[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] submitting LOTS of jobs



I think that, unless these are submitted as a (much) smaller number of clusters each containing many thousands of jobs then you have no chance whatsoever of this working with plain condor.

Most people tend to either:

A) End up writing some other 'queue' of some sort which can efficiently hold the work required and which trickles the submissions checking the queue for it's state (via condor_status -submitter calls if possible)

B) More recently they might start writing job hoks to entirely remove the need for the scheduling/negotiation component and just do it themselves taking advantage of domain specific mechnics to do it very efficiently.

I prefer the latter conceptually (it avoids the 'split brain' issue inherent in the former) but this is relatively recent and I only know of one other person using it significantly here (Ian Chesal) who has been working out some kinks in it. He would likely be able to give a better view on the suitability of this approach.

If you go with A then you should note that machine running the schedd tends to have the following limits:

The maximum number of active jobs (since every active job consumes resources for the shadow)
On windows this is about 150 for Windows XP 32bit with some tweaks to the registry and around 200 or so on windows server 2003 64bit
The condor team can likely give you reasonable limits on various linux flavours.

The rate at which jobs start and stop 'multiplied' by the amount of data that they transfer in/out
If this is sizeable then the box may be unable to cope. having the schedd transfer a minimal script out and having the jobs all actually execute and get data from some network location (or better still local to each execute machine) as well as placing results on the same will mitigate this considerably.

The other significant factor in a task of this magnitude is how long each job takes.
The overhead of plain condor for small jobs is significant (reduced somewhat in a nice steady state as negotiation is avoided) but would still be measured in a few seconds at least. If you are not submitting as many jobs to a cluster the overhead will rise much higher. I also do not know how much locking the submission process does on the schedd these days, in older releases submitting would significantly impact it's ability to respond to the negotiation requests/starting new jobs.
 
I would assume that the number of jobs mentioned must imply they are relatively short because some back of the envelope calculations 

(15M * TIME_JOB) / SLOTS = TOTAL_TIME

would suggest that with 1000 slots and 1 minute jobs you would take over 10 days
With 1000 slots you would almost certainly require at least 2 or 3 schedd machines and a reasonably hefty collector/negotiator. If the jobs were in the minute range then you would need to constantly 'feed' you farm or waste throughput, so 1 hour of downtime on one of the schedd's would translate almost directly to 1 hour / NumberOfSchedds * NumberOfSlots unless you put some sort of High Availability setup in place.

For comparison we have ~600 slots and tend to have upwards of 10 schedd's active when it is fully utilized, most jobs are hours if not days and users still get annoyed at the latency inherent in the submission/negotiation/transfer out/shadow/transfer in.

All numbers/suggestions above are *very* general. Being specific is impossible without some detailed knowledge about the exact mechanics of your jobs/pool structure

Certainly the numbers are large enough that you almost certainly need to 'tune' your setup appropriately to use condor as intended (high throughput) or use 'wrapping' layers like techion's which convert many small jobs into fewer bigger jobs on the fly.

hope this was helpful,
Matt

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jonathan D. Proulx
Sent: 04 November 2009 18:31
To: Condor-Users Mail List
Subject: [Condor-users] submitting LOTS of jobs

Hi All,

I have a user looking to submit 15million jobs this is about two
orders of magnitude above what weve done previously.  I don't yet know
too much about the calculation other than it's genome based and
running against the human genome (a later goal) would result in
1.8 billion jobs.

leaving aside for a monent the question of how long this will take on
my resources, any advise on facilitating this number of jobs?  Are
there submit side limits I'm going to run into?

Thanks
-Jon
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

----
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
----