[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Request for Ideas/Plans: Designing a Large Condor Pool
- Date: Thu, 25 May 2006 08:06:04 -0400
- From: Jess Cannata <jac67@xxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Request for Ideas/Plans: Designing a Large Condor Pool
Yes, we definitely do not want to rely on only one schedd, and we
probably do not want to rely on one collector and negotiator, either. We
also have the challenge of getting the output back to the user. We have
a few ideas on how to do this, but we'd first like to hear from the
groups that are already doing this; apparently other people on the list
are interested, too.
When we were at CondorWeek last year we brought up the need for some
sample deployment diagrams for large pools, and there seemed to be a lot
of interest in this. Please send me simple diagrams (or explanations) or
your pools, even if you don't think that your pool is all that
interesting. You can send them as Visio, PowerPoint, or other formats
(just tell me what they are), and I will compile them and see about
getting them either included in the Condor manual or in some other
section on the web site.
You can send the files with attachments directly to me and I will make
sure the information is pushed back out the list.
Advanced Research Computing
Matt Hope wrote:
On 5/24/06, Jess Cannata <jac67@xxxxxxxxxxxxxx> wrote:
Dear Condor User Community:
We are in the process of setting up a Condor pool to initially include
all lab machines (400-1000 machines) on campus, though later we plan to
add a few of our clusters. While we currently run Condor on some of our
smaller clusters, we suspect that the layout for this larger pool will
be different than a standard Condor pool.
For this campus pool, we want one entry point for users to submit jobs.
Since the pool will have tens of thousands of jobs in queue, with
several hundreds running simultaneously, we know that we will likely
overload one schedd along with the other daemons.
Does anyone have any design plans that outline how one might set up a
pool with a single point of entry, with multiple daemons to spread out
the load and provide some redundancy? I've looked in the manual for
examples of large deployments, but cannot find any. Am I missing
something? If you wouldn't mind sharing your pool layout, I think that
this would be useful to many Condor users especially if your pool is not
a typical pool.
For a large pool having all the schedd's on one machine is a very bad
idea. since it dying (or needing serivcing) will screw the whole farm.
Having your controlled submit point automatically trigger submmisions
to some distribution of schedd's is the best idea. This way you have
to wrap the submission tools (or use the sOAP library) but this may be
a benefit since you also gain total control over what submit options
Dealing with getting data back is the tricky bit in this case. If you
have some form of network file system you can just get the jobs to
direct their output there but if not you will need some means of
bringing back the resulting files from the disk on the schedd's
This is only rough but gives you a flavour for what you can do.
Condor-users mailing list