[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Multiple batch queues on a single machine



> Is it possible to have multiple batch queues in a single-machine pool
> (Condor 6.6.0)?
> As an LSF admin/user I'm very used to the concept of multiple queues of 
> different applications/priorities/policies/running windows/etc.
> 
> I looked thorugh the Condor manual but I never spotted a concept of having more
> than one queue per machine.
> Generally, is Condor a good batch solution for a >128 CPUs SMP machine?


I don't know much LSF, but I assume it's more along the lines of the NQS family of
batch-job managers.  In that case I understand how Condor might seem very
different and confusing to you, I've been there.

On the one hand Condor was born as a grid-like system (one can argue Condor was
doing grid computing waaay before all the recent grid hype), and its design might
not look as optimal for Beowulfs or big SMP machines as compared to, say, OpenPBS
or DQS (nowadays SunGridEngine).  Specifically, Condor is about "cycle scavenging",
and very good at that.

But on the other hand Condor does support cluster environments, and might offer some
features that OpenPBS or DQS do not have.   For instance, I like the fault-tolerance
you get (condor manager or schedd might fail for a short time and it'll be OK; execute
machines might fail and the job gets rescheduled automatically).  It would be interesting
if anybody who has tried these different systems could give their opinions...

The stuff you're looking for in the manuals is "running Condor on dedicated resources",
the Dedicated Scheduler, MPI Universe, and all that.  To put it in a nutshell, you configure
the nodes in the cluster to be managed under one special submit machine, the Dedicated Scheduler
(normally the cluster front-end machine, and also Central Manager), and to always accept to run
Condor jobs (since this is no longer "opportunistic cycle-scavenging", but a legit use of a
resource).  Note that the jobs don't need to be proper MPI jobs, even though the manual might
imply this.

Then there's the issue of running more than one job at a time (per CPU).  I assume you're
thinking along the lines of NQS, like having a "long queue" and a "short queue" which execute
at the same time (with different nice values).  This is perfectly doable, you have to define
two (or more) "Virtual Machines" per CPU, in order to have more than one job running at a time.
Establishing different nice values can be done with a trick like that published in the Bologna
Batch System paper.  I strongly recommend reading this paper because it clarifies a lot of things
that are too dispersed in the Manual.


JL
-- 
Jose Luis Marin                                 email: jlmarin@xxxxxxxxxxxxx
Free Software Consulting Services               Cell: +34 699 470 198
E-50500 Tarazona                                Jabber: jlmarin@xxxxxxxxxx
Zaragoza

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>