[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Reasonable job queue size limits post 7.x. WAS RE: Condor 6.8.n: job scheduling process delays

Resurrection of this thread due to some interest on my part on issues it reminds me of.

I know several other people here besides me used this technique on 6.8 era installations:

User wishes to submit many thousands of jobs at once (possibly tens of thousands) to the same schedd.
Accepting that at most 200 or so will run concurrently and that these will feed through the system nicely over say the weekend.

A) Submitting them all at once kills/DOS's the schedd so no jobs start properly anyway.
B) having that many ready to run jobs in the queue kills negotiation (and indeed most other aspects of communication with the schedd)

To fix you must solve A and B

Possible solutions:
1) submit throttle
simple and effective if you aren't using "queue n" (which I'm not so fine). Essentially maintain your own queue elsewhere which has many benefits of it's own once you go that route.
Fixes A
Issues: you must keep the 'external queue' in step with actions/events on the condor queue. Periodic tidy up becomes essential. 'Disconnected brains' events do occur and often require wiping one, or both of the queues.

2) submit jobs on hold and have a throttled release process that sees how many idle jobs are in the queue and releases slowly
Not simple, has performance impact itself (even if you talk to the collector to work out the idle numbers)
Fixes B
Breaks remote submit due to competing interests in the hold state.
Prevents jobs going into a hold state for some other reason without costly schedd check
Requires an active monitoring process

I wrote code that does the above and it has steadily evolved over time to be quite complex. Notably the throttled aspect has become overly complex as it has been made multithreaded to reduce the latency of certain operations.

I now wonder if I can eliminate the phase 2 aspects of the solution and remove the submit on hold and release throttling.
This would *vastly* simply the code and reduce latency in some key areas for free.
The disconnected brain aspect would still remain but but would be mitigated somewhat by no longer caring about the JobStatus state so much, simply _existence_.

Has anyone else that did this in the 6.8/7.0 era removed the throttled release aspect from 7.2 onwards?
If so was it successful and with what sort of loads?

For comparison some of the key variables that may affect this:
 * Windows only, some 32bit XP schedd's but heavy submissions go to 64bit 2003 servers
 * almost all jobs will satisfy the auto clustering and will have *identical* requirements
 * not a great deal of user generated condor_q activity
 * jobs take from minutes to 24hour range
 * ~650 execute nodes (and likely to always climb). about 5-15 distinct user/schedd combinations active at anyone time


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 14 October 2009 15:58
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condor 6.8.n: job scheduling process delays

> Back to your original question, this is entirely a scalability issue.
> Prior to the 6.9.3 release the schedd simply couldn't handle more than
> a few thousand jobs in the job queue without a severe degradation in
> performance.  I believe your previous message stated you had around
> 17,500 jobs in the queue - this simply won't work with Condor 6.8.

Optionally, if you have only a handful of schedd machines in the pool you can upgrade them to 7.x.x. I'm running 6.8.6 execution nodes with 7.0.x central machines (negotiator/collector, schedd/quill) without issue.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: 

Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.