[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Wnidows schedd job limits (Was: RE: Hooks in the Scheduler Universe?)



From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 07 October 2010 14:48
To: Condor-Users Mail List
Subject: Re: [Condor-users] Wnidows schedd job limits (Was: RE: Hooks in the Scheduler Universe?)

> One bare metal, with some exceptional nice hardware, I've gotten it up to 150 running jobs/schedd with 
> 4 schedds on the box. And even that feels meta-stable at times. The lightest of breezes pushes off balance. 
> This is Win2k3 Server 64-bit. AFAIK there are no registry tweaks to the image. What tweaks are you making?

Apparently we don't apply the desktop heap tweaks on the server VM's, perhaps that only applies to XP or 32bit machines.
What was the most you managed to achieve on one schedd and queue?

>> Job hooks
>> In the hive mind opinion should I not consider even testing using job hooks (for replacement of schedd/negotiator) on windows right now?
> Well, Todd closed that ticket. I swear it's never worked in 7.4.x for me but I have 
> retest now and confirm this. It can't hurt to try it. But you loose so much with hooks for pulling jobs. 
> You have to do your own job selection, or if you let the startd reject jobs, you have to have a way to pull 
> and then put back jobs that have been rejected, which is inefficient and difficult to architect such that it 
> works well when you've got a *lot* of slots trying to find their next job to run. 
> I'll admit this exactly what I had working at Altera but it was a good year plus of work to get it functioning. 

Yeah I figured that was tricky. Though I would have the advantage that machines would never need to reject (except as a failure mode).
If I did it I'd likely do it as: 

A fixed set of queues (Q => max queues) where a job could be in multiple queues at once albeit with different ranking. These queues define a strict ordering of relative ranking for any individual slot
Each queue is actually broken down into per user (U => max users) queues within it. This is entirely transparent to the below except where noted.

Every slot would map to a (stable) ordering over the possible queues.
In a steady state a freshly available slot would take the job at the top of the first queue that had a job for it.
In start up/sudden surges of multiple machines available I would do the same as steady state but order the process by the relative speed of the machine (we are happy for this to be a simple assigned strict ordering).

Evaluation of any slot would then be O(QU) where Q is the number of distinct queues. But in almost all cases the queue being non empty => the job goes to that slot and the process stops. A job being taken by a slot would eliminate it from other queues. At the point a queue is selected I can do some simple user level balancing (likely simply attempting to ensure that the user with the least active jobs gets to go next, I can always add in the user halflife if needed)

I would have a central server constantly maintaining the top N of each queue/user pair (where N is some number a bit bigger than the max slots ever available) and so I would only need to do serious 
Sorting effort on the queue on that being drained. Any events on a job (including insertion of a new one) could only cause a scan in the top N (O(QUN) in worst case) for comparisons to see if it goes in there, if not it is dropped into the 'to place bucket'
On neededing to repopulate the top N I can assume that the ordering of the bottom is still good, I just need to merge in the 'to place ones' which can be done on a separate thread if need be, and in most cases the ordering will be pretty simple.

I doubt I would need terribly complex lock free structures. All lock operations would last only as long as was required to handle (insert/update/aquire) a single job/slot unless the 'to place' buckests start to overfill in which case a brief period of allowing only read only access to clear them out may be required. If need be by taking a per job lock for assignment (locking protocol queue -> job) then I can deal with a job being taken by one slot whilst involved in an operation in the other by discard and recurse on encountering an orphan. I doubt I would bother with this to start with, accepting that incoming events will have to buffer till after the current assignment pass is not that big a deal. 

The in memory structures need not be replicated to disk, On a restart they can be reread and reconstructed, as there will be a natural ordering in most cases that can be used when querying the backing database for state to place most queues into 'almost sorted' if not actually perfectly sorted which simplifies things and should give it relatively rapid recovery, not to mention a very simple failover mode.

Submission becomes as simple as inserting into the database followed by a 'hint ping' to the server process to take a look for fresh data.

Pre-emption becomes the only tricky part. But since that is not likely to be an issue in the steady state phase we need only run a separate thread to sniff for possible pre-emption targets (only needed for slots running jobs not on their best queue) which triggers pre-emption (and remembers that it has done so) to allow the slot to do its request for a new job (which will then fulfil its optimal job)

This all only works within the sort of constrained, and crucially nearly transitive ordered system. But since that is how our system has actually been for the past 5 years I see no reason to think it will change. Exploiting the steady state characteristics should allow me to do an extremely fast set up with very little delay and absolutely no need to query the server for anything but the actual 'get me a new job' phase and to trigger a recheck of the database. All submission/modification/querying of the jobs goes directly to the database.

I would need to add security to the db insert/modification so that only user X could insert/update rows 'owned' as themselves. Obviously super users could delete anyone.

Many of the things condor does to deal with streaming/lack of shared filesystem could be utterly ignored for my purposes, submission would therefore be much simpler (and transactional to boot).

In all probability I wouldn't be doing this, someone else would. But I can get a rough idea of the complexity involved from the above.

>>         Multiple per user daemons per box
>>   I doubt this would actually improve things
> It does assuming you've got enough CPU so the shadows don't starve on startup. 
> That's one area where I notice Windows falling down quite frequently: if you've got a high rate of shadows 
> spawning it seems to have a lot of trouble getting the shadow<->startd comm setup and the job started. 
> Lately I've been running with a 2:1 ratio of processor/cores to scheduler daemons.

That's worth knowing thanks.
 
>>   Also not clear if anyone uses this heavily on windows
> All the time.

Right ho, still an idea then

>> Remote submit to linux based schedd's                
>>  submission is ultimately a bit of a hack, and forces the client side to do a lot more state checking
> I have mixed feelings about remote submission. It gets particularly tricky when you're mixing OSes.

indeed

> I've had better luck with password-free ssh to submit jobs to centralize Linux machines. 
>
> And the SOAP interface to Condor is another approach I've had better experience with than remote submission from Windows to Linux. Not to say it can't work, just that I've found it tricky.

I hadn't considered using SOAP, it seemed like it didn't have the performance of raw condor_submit access when I last tried it but that was long ago. Also I'm not sure how well (if at all) it supports run_as_owner which would be a deal breaker.

> Another option is to run a light-weight "submitter" daemon on the scheduler that you have a custom submit command talk to over 
> REST or SOAP and it, in turn, is running as root so it has the ability to su to any user and submit as them, on that machine. 
> Might be easier than ssh.

We already run such a daemon per user (it basically reads from and maintains a database and does the submit/queue maintenance). The problem is that, it is all in c# (with considerable backing libraries I don't fancy trying to port to mono). But that may well be the best option, migrate the daemon to mono, use it to submit jobs and give everyone their own virtual submit server (a bit cheaper per VM at least).

Thanks for the ideas.

Matt

--------------
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
--------------