[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor - condor_schedd daemon per pool or what?



On Wednesday, July 6, 2011 at 10:01 AM, Sassy Natan wrote:

My question start here: If I I have two Regular Machines, what is the
use of having two different queues per one pool?.
Michael O'Donnell answered this one already, but I'll repeat it: scalability. One scheduler can only handle so much load in the form of queued and running jobs before it starts to fall down, unable to fill all the available slots in your pool with jobs. Where that limit is depends on the job startup rate and the OS you happen to be running on (traditionally Linux-based condor_schedd machines scaled much larger than Windows-based schedulers, but that gap has been closing and is really much closer in the 7.6.x series).

There might also be administrative reasons to separate jobs on to multiple schedulers. For example: you may wish to enforce scheduling and matchmaking policies on some class of jobs via configuration file options.

You may also wish to take advantage of scheduler technologies like dedicated schedulers to make MPI-type jobs in your pool easier to run. See: http://www.cs.wisc.edu/condor/manual/v7.6/3_13Setting_Up.html#sec:Configure-Dedicated-Resource for more details on this.
So when users are login to Regular01 while other to Regular02,
submitting there jobs to the condor pool, I don't understand how can I
control my queue? I don't want to manage two differences queue, but a
global one.
This is only possible to some extent with Condor. Even with a really big machine for a single scheduler, at some point your execute slots and queued jobs will exceed the performance capabilities of a single-scheduler approach. Where that limit is depends on your jobs, your hardware and your execute node count.

Distributed queues are one of the things that make Condor robust and highly scalable. You don't have a single queue point of failure or bottleneck.

Can I ask: what is about multiple queues that makes "management" hard? What exactly are you trying to manage?
If I take out the schedd from one of the Regular machine, say
Regular01, I can't commit jobs to the pool.
Well, this is almost true, since I can submit with a remote job using
the -n switch, but then I don't get what is the use
of having two schedd daemons running on two different machines in the
same pool (Unless off course you want to
have some load balancing for the schedd daemons, but then again the
ll point of having a load balancing schedd is for save the status of
the co-existing queue).
You have a few options here:

1. you can have users use -n to do remote submissions;
2. you can give all users a log in to your one scheduler machine;
3. you can look into Condor's SOAP interface and write a custom submission tool that uses SOAP
4. you can use a meta-scheduler that acts as the single queue for all your users that then load balances these jobs to Condor schedulers on their behalf (CycleServer and MRG are examples)

I saw there is an option for configuring the SCHEDD_NAME and
SCHEDD_ADDRESS_FILE options.
But I'm not sure I got it right. When point the name and the file to
my schedd (which is based on my example
in Regular02) I still get error and must point manually to the
Regular02 host name. (And I did put @ at the end of the SCHEDD_NAME.
I can't think of a good reason to mess with SCHEDD_NAME and SCHEDD_ADDRESS_FILE in your case.

You may want to look at SCHEDD_HOST -- it lets you name a scheduler to contact when you run commands that contact the scheduler like 'condor_q' or 'condor_submit' and you don't supply the -name option to these commands to name a scheduler. It defaults to the local machine, but you may want to set it to some other value if you're not running a scheduler on the local machine. If your scheduler was on myhost1.mydomain you could set SCHEDD_NAME="mhost1.mydomain" on every other machine in your pool and then condor_q/rm/submit would work without having to use the -name option. See: http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#15346

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing