[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor or Classads as part of a Grid Infrastructre?



> I believe gLite (aka LCG) is doing this, and Condor-G is an integration 
> with Globus (but I don't know its status in the world of GT4.0.2).  Are 
> there others?

I can confirm the gLite/LCG part, having been behind that
development since its inception in 2000.

> I want to find out how people are extending condor or what adaptations 
> are required to scale it to large scale computational grids (1E4-1E5 
> nodes, 1E5-1E7 tasks queued, 1E3 sites, etc.)

I call it the "dancing around Condor" exercise. Not being willing to throw
away the real-life experiences (scale, portability, friendliness to users/
administrators/machines) that take many years to learn and incorporate in 
real code (like Condor's), I did all I could to keep the integration minimal.

The first need we had was to extend the matchmaking semantics,
for which the new Classads and the plugin extensions helped us a lot.
However, when implementing features like matchmaking with data catalogues or 
'gang'-matching, it seems that classad plugin implementations, while
providing clever semantical properties, are consistently less efficient
than hardcoded ones. I believe this is a part of the reasonably recent 
realisation that matchmaking with computing and data resources is best 
done separately.

The bottleneck forcing us into the real dance, and that by and large 
is still there, is the 'condor_qedit' interface to act on the schedd queue,
that we wanted to use to perform matchmaking externally.
Our rule of thumb figure for qedit operations, at the time when we were
trying to store all job data in the schedd as 'the' data repository back
in 2000, was on the order of one qedit operation per second. 
Even now, quill and schedd-on-the-side provide fast access 'mirrors'
to the 'real' schedd contents, but when the time comes to really
go and change the jobad authoritatively in the schedd we are back to the old 
problem.

So, in the current gLite/LCG setup, in order to keep match-making
external (and having always been discouraged from learning to
talk Daemoncore, synchronise to its development, and provide a drop-in
replacement negotiator) a 'pre-matched' job is condor_submitted (by a first
process), and its progress is followed in Userlog (via the Userlog parsing
class) by a second 'guardian angel' process, that will act on abort,
completion and -especially- hold events, as we have no living user that
can think and decide whether issuing condor_release makes sense.

If a re-match is needed, the old job are condor_rm'd (which may or
may not succeed for Condor-G/GRAM jobs, requiring the renowned
"condor_rm -f") and a fresh new job will be submitted to Condor.
The obvious consequence is an increased impact on external book-keeping
that needs to keep track, among other things, of these different Condor
cluster ID. For this we had to maintain an external solution (the gLite 
"logging and book-keeping", which is a job-event-based store&forward system,
that eventually allows to compute the job state on a server with DB back-end,
protected via canned queries).

This is a bird's eye (of course) summary of the "integration" path we followed 
over the past 6 years. It would take much longer to delve into the details.

And... these are just the dry technical reasons, which typically have a 
featherweight impact on mega-projects like the EU DataGrid or EGEE-n.
In fact, they were used by our seasoned socio-politicians just as grounds 
to justify re-writing from scratch everything using Web Services, which 
flatly contradicts my starting assumption above...

Hope this helps.
Francesco Prelz
INFN Milano