[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Request for Ideas/Plans: Designing a Large Condor Pool



On 5/24/06, Jess Cannata <jac67@xxxxxxxxxxxxxx> wrote:
Dear Condor User Community:

We are in the process of setting up a Condor pool to initially include
all lab machines (400-1000 machines) on campus, though later we plan to
add a few of our clusters. While we currently run Condor on some of our
smaller clusters, we suspect that the layout for this larger pool will
be different than a standard Condor pool.

For this campus pool, we want one entry point for users to submit jobs.
Since the pool will have tens of thousands of jobs in queue, with
several hundreds running simultaneously, we know that we will likely
overload one schedd along with the other daemons.

Does anyone have any design plans that outline how one might set up a
pool with a single point of entry, with multiple daemons to spread out
the load and provide some redundancy? I've looked in the manual for
examples of large deployments, but cannot find any. Am I missing
something? If you wouldn't mind sharing your pool layout, I think that
this would be useful to many Condor users especially if your pool is not
a typical pool.

For a large pool having all the schedd's on one machine is a very bad
idea. since it dying (or needing serivcing) will screw the whole farm.

Having your controlled submit point automatically trigger submmisions
to some distribution of schedd's is the best idea. This way you have
to wrap the submission tools (or use the sOAP library) but this may be
a benefit since you also gain total control over what submit options
are set.

Dealing with getting data back is the tricky bit in this case. If you
have some form of network file system you can just get the jobs to
direct their output there but if not you will need some means of
bringing back the resulting files from the disk on the schedd's
machine.

This is only rough but gives you a flavour for what you can do.

Matt