[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Assesing Condor suitability


I'm working on an open source project called 'Beaker' (http://beaker-project.org/).
I'm looking at the possibility of replacing a chunk of what Beaker does, with Condor.

I've done some reading of various Condor related documents (there are a lot!), but I thought here
may be a good place to get an idea of whether or not Condor would be a good fit for our purpose.

Similarly to Condor, Beaker runs jobs on remote systems. Beaker matches jobs to systems,
keeps track of the status of jobs, and the status of systems. Our current approach is to find a system
that matches the job's criteria (hardware and OS requirements), and one that is currently 'available' to
the job (which encompasses the permissions, the current status of the system and whether someone is already using it).
Beaker does other ancillary things as well, but they are less important.

I'm considering using ClassAds to match jobs with systems. We currently do this matching with our own
XML language. We need all the basic things, like this job requires > 512MB RAM, HVM cpu flag etc etc.
We also need to support a group and individual based permission model for access to the systems.
I'd also like to identify groups of systems as belonging to a certain 'pool' (just an abstract name) and
then create a job ClassAd that expresses a preference for systems from one pool over another. From
the reading I've done, I think ClassAds can do all this. The real reason I want to use ClassAds though, is
that I want to take advantage of Condor's scheduling.

Currently in Beaker, when a job is submitted to be executed on a system, we store the names of
all the systems that meet the hardware requirements as potential candidates.
Then, every 30 seconds or so we will check all the candidate systems until we find one that is free
and that we have the right permissions to use the system. This is slow, cumbersome, inefficient and involves
lots of complex database queries that we would rather not have to ever look at again :-) We want to get rid of this,
as this model no longer meets our requirements. We're envisioning a scheduler where the system can find jobs based
on what the system finds preferable, and jobs can find systems based on what the job finds preferable,
and then form an harmonious union between the two :-)

In terms of 'preferable', things like effective job priorities would be important. Currently in Beaker, if UserA and UserB both have access to
SystemA, they have equal priority to that system. We want to be able to have the system give a higher priority to a job
based on who owns the job, or what groups they belong to (and perhaps other things as well). Another example is a job
that requires a single CPU system (there are few of them) that has to wait forever because that system just happens to be swamped
by other jobs that don't care whether they have 1,2 or 4 CPUs. We would like to be able to tell that system to give preferential
treatment to jobs that require a single CPU.

So the way I imagine it, we basically we would want to use condor from the point where a user's job is analysed and matched to a system,
to where condor returns to say "Here, you can run it on this system to the exclusion of all others".
Beaker would then actually setup and organise the running of the job (Writing PXE cfg files, power cycling the system,
etc etc). Once the Job is complete, Beaker would inform Condor that the system is no longer running a job and so is available to
potentially run another job.

Beaker is written in Python, so all of the interaction with Condor would have to be through it's SOAP API I imagine.

So does Condor seem like it might be a reasonable solution?