[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Wnidows schedd job limits (Was: RE: Hooks in the Scheduler Universe?)

Thanks but I already knew those tips, the problem is the need for contiguous memory for things which means that the theoretical limits often can’t be reached (like I said 200 is about our current limit).


Our pool is plenty big enough now (~600 slots and counting) that this does cause issues, especially if two users use the same schedd by accident.


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael O'Donnell
Sent: 07 October 2010 17:45
To: Condor-Users Mail List
Subject: Re: [Condor-users] Wnidows schedd job limits (Was: RE: Hooks in the Scheduler Universe?)


I am new with both administrating Condor as well as using it so maybe I did not understand or maybe I do not understand Condor's operations with submitting jobs. What I have read is that on any Windows submit machine, the maximum number of jobs that can be submitted is dictated by the heap size. It should not make any difference whether it is VM or not, but we do not have any VM submit machines. Although the Condor manual does not discuss, it may also be necessary to change the heap size for non-interactive window stations, but this only effects the job running on the execute. I have not had the latter problem yet.

The most jobs that I have concurrently run is about 90 jobs. This is a small amount and does not require changing the default heap size for Windows. Our pool is small so at the time being we will not exceed this amount.

This is what I have in my notes (not necessarily correct):
To increase the maximum number of jobs that can run in the schedd, the global config MAX_JOBS_RUNNING should be set and the heap size on the submit machine (windows only) must be increased

Collector and negotiator requirements: ~10K RAM per batch slot for each service.

Schedd requirements: ~10k RAM per job in the queue. Additional memory required when large ClassAd attributes are used for jobs.

Condor_shadow requirements: 500K RAM for each running job. This can be doubled when jobs submitted from a 32bit system.

Limitation: A 32bit machine cannot run more than 10,000 simultaneous jobs due to kernel memory constraints.

The Windows default heap size is 512 Kb. Sixty Condor_shadow daemons use 256 Kb and therefore 120 shadows will consume the default. To run 300 jobs set the heap size to 1280 Kb.

I hope this helps,


Matt Hope <Matt.Hope@xxxxxxxxxxxxxxx>


Condor-Users Mail List <condor-users@xxxxxxxxxxx>


10/07/2010 09:38 AM


Re: [Condor-users] Wnidows schedd job limits (Was: RE: Hooks in the Scheduler Universe?)

Sent by:



From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 07 October 2010 14:48
To: Condor-Users Mail List
Subject: Re: [Condor-users] Wnidows schedd job limits (Was: RE: Hooks in the Scheduler Universe?)

> One bare metal, with some exceptional nice hardware, I've gotten it up to 150 running jobs/schedd with
> 4 schedds on the box. And even that feels meta-stable at times. The lightest of breezes pushes off balance.
> This is Win2k3 Server 64-bit. AFAIK there are no registry tweaks to the image. What tweaks are you making?

Apparently we don't apply the desktop heap tweaks on the server VM's, perhaps that only applies to XP or 32bit machines.
What was the most you managed to achieve on one schedd and queue?

>> Job hooks
>> In the hive mind opinion should I not consider even testing using job hooks (for replacement of schedd/negotiator) on windows right now?
> Well, Todd closed that ticket. I swear it's never worked in 7.4.x for me but I have
> retest now and confirm this. It can't hurt to try it. But you loose so much with hooks for pulling jobs.
> You have to do your own job selection, or if you let the startd reject jobs, you have to have a way to pull
> and then put back jobs that have been rejected, which is inefficient and difficult to architect such that it
> works well when you've got a *lot* of slots trying to find their next job to run.
> I'll admit this exactly what I had working at Altera but it was a good year plus of work to get it functioning. 

Yeah I figured that was tricky. Though I would have the advantage that machines would never need to reject (except as a failure mode).
If I did it I'd likely do it as:

A fixed set of queues (Q => max queues) where a job could be in multiple queues at once albeit with different ranking. These queues define a strict ordering of relative ranking for any individual slot
Each queue is actually broken down into per user (U => max users) queues within it. This is entirely transparent to the below except where noted.

Every slot would map to a (stable) ordering over the possible queues.
In a steady state a freshly available slot would take the job at the top of the first queue that had a job for it.
In start up/sudden surges of multiple machines available I would do the same as steady state but order the process by the relative speed of the machine (we are happy for this to be a simple assigned strict ordering).

Evaluation of any slot would then be O(QU) where Q is the number of distinct queues. But in almost all cases the queue being non empty => the job goes to that slot and the process stops. A job being taken by a slot would eliminate it from other queues. At the point a queue is selected I can do some simple user level balancing (likely simply attempting to ensure that the user with the least active jobs gets to go next, I can always add in the user halflife if needed)

I would have a central server constantly maintaining the top N of each queue/user pair (where N is some number a bit bigger than the max slots ever available) and so I would only need to do serious
Sorting effort on the queue on that being drained. Any events on a job (including insertion of a new one) could only cause a scan in the top N (O(QUN) in worst case) for comparisons to see if it goes in there, if not it is dropped into the 'to place bucket'
On neededing to repopulate the top N I can assume that the ordering of the bottom is still good, I just need to merge in the 'to place ones' which can be done on a separate thread if need be, and in most cases the ordering will be pretty simple.

I doubt I would need terribly complex lock free structures. All lock operations would last only as long as was required to handle (insert/update/aquire) a single job/slot unless the 'to place' buckests start to overfill in which case a brief period of allowing only read only access to clear them out may be required. If need be by taking a per job lock for assignment (locking protocol queue -> job) then I can deal with a job being taken by one slot whilst involved in an operation in the other by discard and recurse on encountering an orphan. I doubt I would bother with this to start with, accepting that incoming events will have to buffer till after the current assignment pass is not that big a deal.

The in memory structures need not be replicated to disk, On a restart they can be reread and reconstructed, as there will be a natural ordering in most cases that can be used when querying the backing database for state to place most queues into 'almost sorted' if not actually perfectly sorted which simplifies things and should give it relatively rapid recovery, not to mention a very simple failover mode.

Submission becomes as simple as inserting into the database followed by a 'hint ping' to the server process to take a look for fresh data.

Pre-emption becomes the only tricky part. But since that is not likely to be an issue in the steady state phase we need only run a separate thread to sniff for possible pre-emption targets (only needed for slots running jobs not on their best queue) which triggers pre-emption (and remembers that it has done so) to allow the slot to do its request for a new job (which will then fulfil its optimal job)

This all only works within the sort of constrained, and crucially nearly transitive ordered system. But since that is how our system has actually been for the past 5 years I see no reason to think it will change. Exploiting the steady state characteristics should allow me to do an extremely fast set up with very little delay and absolutely no need to query the server for anything but the actual 'get me a new job' phase and to trigger a recheck of the database. All submission/modification/querying of the jobs goes directly to the database.

I would need to add security to the db insert/modification so that only user X could insert/update rows 'owned' as themselves. Obviously super users could delete anyone.

Many of the things condor does to deal with streaming/lack of shared filesystem could be utterly ignored for my purposes, submission would therefore be much simpler (and transactional to boot).

In all probability I wouldn't be doing this, someone else would. But I can get a rough idea of the complexity involved from the above.

>>         Multiple per user daemons per box
>>   I doubt this would actually improve things
> It does assuming you've got enough CPU so the shadows don't starve on startup.
> That's one area where I notice Windows falling down quite frequently: if you've got a high rate of shadows
> spawning it seems to have a lot of trouble getting the shadow<->startd comm setup and the job started.
> Lately I've been running with a 2:1 ratio of processor/cores to scheduler daemons.

That's worth knowing thanks.
>>   Also not clear if anyone uses this heavily on windows
> All the time.

Right ho, still an idea then

>> Remote submit to linux based schedd's               
>>  submission is ultimately a bit of a hack, and forces the client side to do a lot more state checking
> I have mixed feelings about remote submission. It gets particularly tricky when you're mixing OSes.


> I've had better luck with password-free ssh to submit jobs to centralize Linux machines.
> And the SOAP interface to Condor is another approach I've had better experience with than remote submission from Windows to Linux. Not to say it can't work, just that I've found it tricky.

I hadn't considered using SOAP, it seemed like it didn't have the performance of raw condor_submit access when I last tried it but that was long ago. Also I'm not sure how well (if at all) it supports run_as_owner which would be a deal breaker.

> Another option is to run a light-weight "submitter" daemon on the scheduler that you have a custom submit command talk to over
> REST or SOAP and it, in turn, is running as root so it has the ability to su to any user and submit as them, on that machine.
> Might be easier than ssh.

We already run such a daemon per user (it basically reads from and maintains a database and does the submit/queue maintenance). The problem is that, it is all in c# (with considerable backing libraries I don't fancy trying to port to mono). But that may well be the best option, migrate the daemon to mono, use it to submit jobs and give everyone their own virtual submit server (a bit cheaper per VM at least).

Thanks for the ideas.


Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: