[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor Performance



On 7/25/06, Ronen Yaari <Ronen@xxxxxxxxxxxxx> wrote:
Is there any document regarding performance and throughput of condor?

I am interested in Condor limitations e.g what is the maximum number of JOBs
that I can submit to condor.

How long is a piece of string....

Depends on many things - the three most important I know of being

1) The IO load of your jobs (how much data is transferred on the job
starting and how much they transfer back at the end).
This is because the submit machine must service all these at once -
thus making it a choke point.

2)  Which brings us to number 2 - how many submit points you have -
obviously by increasing this you spread out the load from number 1.
(to a lesser extent the relative power of your submit points,
Anecdotal evidence suggests that more memory is the best bang for buck
boost in improving these and I note considerable benefit on high IO
load jobs on placing the submit machines close - in terms of network
bandwidth - to the execute machines.

3) How fast your jobs complete - the longer the job the better the
throughput. Certainly anything under an hour on a sizable pool (say
100+ execute nodes per submit point) will mean that the submit
machines are dealing with a new job start + old job end about every 30
seconds, if they are unable to service this then again you have a
choke point.

Many other factors can influence this depending on circumstances - for
example checkpointing is effectively another in/out transfer operation
and if a big bunch of jobs kicks off another bunch of jobs then they
will all be trying to do it at once.
Previous versions of condor could end up being choked on the
negotiation cycle when things got busy - this should be considerably
less likely to cause issues with recent intelligent changes to the
negotiation, most of which will be totally transparent to you. One
aspect which can cause issues is how you split the individual jobs up
- i.e. how many jobs in a cluster. In general the bigger you make this
the better (again recent changes may obviate the need for such coarse
optimizations)

Several people on this list have pools heading towards the thousands
of execute nodes - though this is normally as a result of careful
tuning of several aspects of the system *and* the job loads. If you
just set something up relatively naively with more than say 10's of
megabytes of input and output data then I would not expect your pool
to scale beyond a hundred execute nodes per submit point.*

This is all very very rough and based of my experiences (reasonably
detailed but very windows specific) and previous postings to the list
(and as second hand opinions should be taken with a considerable error
factor)

Matt

* On windows there is a rough limit on this anyway of about 120
without tweaks and at best 200 or so if the submit point is a
dedicated server (those numbers are using windows 2003 server on a
beefy box)