[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Deployment Recommendations

On Thu, 23 Apr 2009, James Osborne wrote:

Dear All

My name is James Osborne and I am the Condor Project Manger at Cardiff
University in the UK.  Now that summer is approaching, and I have some
nice new virtualization infrastructure coming on stream, I am in the
process of virtualizing our Condor infrastructure.  I already have a
virtual submit machine which works very well with surprisingly low
overhead (I couldn't push it harder than about 4% cpu usage with 000s of
15 minute jobs in the queue).  The virtualization infrastructure will soon
be a load-balanced pair of 3GHz dual-socket quad-core machines with 32GB
of RAM each with multiple redundant connections into FC storage.

I seem to remember hearing that a good 'rule of thumb' was to have no more
than 2000 execute nodes reporting to a single central manager.

I've said that before, but that is in the specific context of
a Grid site, specifically the OSG.  We have since pushed a production
pool up to 3000 and a test pool up to 10000.

1) Is that still the case ?

It depends on a number of factors.
There is some scaling with the number of condor_startd in the
pool--so it is different if you have 2000 individual nodes
each with one processor, as opposed to 215 nodes each running 10 processes
which is what we've got.

Also it depends on how you define "central manager"-- is your
central manager just running the collector/negotiator or is it
running one or more condor_schedd as well?  Right now  I would say
the collector is probably good for 10K slots but you should have
more than one condor_schedd to feed it jobs.

2) Has anybody pushed a single central manager to about 9000 execute nodes
I have done that for my "sleeper pool".. but as I hinted above
we ran into scalability problems in the condor_schedd when we did.

3) Does it make more sense to deploy 4-5 central managers instead and use
flocking ?

No.  Avoid flocking like the plague.
But it might make sense to deploy the high-availability version
of the collector-negotiator.  I have done that on 3 clusters now.

4) If so, would you for example use one central manager per core network
router even if that increased the number of managers to 8 or more ?

No--this is just asking for trouble.  I'm a minority on this
but I have used TCP to communicate to the collector from the
beginning (over a network plant that includes 8 different
switches in the same cluster and 3 different subnets) and I
haven't been sorry.

5) Has anybody tried to flock jobs to 8 or more central managers ?

I haven't tried mostly because incoming grid jobs don't handle
being flocked gracefully.

Steve Timm

I can already see how I can set execute nodes to report to different
central managers in my Condor distribution scripts.

I look forwards to hearing from those of you with big pools...

Thanks in advance.  Best regards


Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.