[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Condor priority model

See the thread with the title "Confirming MAX_JOBS_RUNNING defaults to infinite" -- there was a short discussion on limiting the number of jobs running for any particular submission host. It's not quite what you were asking for (you ask to limit jobs running by user) but it would keep your 30000-job-submitting user under control.

Also, with regards to the MaxJobRetirementTime -- this appeared to be working very well for me (I had set it to 2 days). But see the most recent message with the title "[Condor-users] Using JobPrio to adjust the RANK on machines" -- I had run an experiment in a closed and controlled condor environment to confirm my settings and jobs appeared to entire the retirement phase properly but of course as soon as I step away and stop watching it all fell apart.


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steffen Grunewald
Sent: September 15, 2004 3:23 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condor priority model

Hi Matthew, hallo Carsten,

On Wed, Sep 15, 2004 at 07:36:50AM +0100, matthew hope wrote:
> The scheduling logic is a little opaque.

nicely said :-/

> if you have a claim to a machine and the machine doesn't want to 
> preempt it then you can keep on sending jobs till the cows come home.

Figure 3.4 in the manual shows the possible transitions (although it's almost Greek to me, with it's states and activities).

It still needs several minutes of 100% CPU brain work to figure out what this means...

> the action of claiming a machine to a user being disassociated from 
> the action almost certainly comes because the scheduling logic can be 
> faster and in most cases it does not significantly impact users (esp 
> in a preemption-> checkpoint environment)

We don't preempt/checkpoint (low bandwidth being the main reason, and local checkpoints aren't generally a good idea IMHO)

> If your job latency is important remember that condor is geared up for 
> high throughput not low latency...

In plain English: one may have to wait some time to get machine access, but once one has it (CLAIMED the VMs), the whole cluster will be pushed through, right? (Reminds me of that classical queue situation where the old lady asks to pass by since she only needs one little thing, and then remembers 1000s of others she needs too...)

Which means that high priority users (low prio factors) may have to wait for ages if there's a low prio user with 30000 jobs who just took the chance when the whole pool was idle. Or to setup preempting (which can be a pain with hundreds of VMs running long - more than a week or so - jobs). Right? Or the low prio user has to go and condor_hold her still idle jobs.

Would be nice if there was an "-idleonly" option to condor_hold. Perhaps a suggestion for the next release? (In some cases it would also make sense to have a count limit to condor_release so only say 500 of 30000 jobs would be released at one time...) (Of course it can be done using some long one-liner shell script, but most of our Condor users are users, not geeks.)


Steffen Grunewald * * * Merlin cluster admin (http://pandora.aei.mpg.de) Albert-Einstein-Institut (MPI Gravitationsphysik, http://www.aei.mpg.de)
       Science Park Golm, Am Mühlenberg 1, 14476 Potsdam, Germany
e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}

Condor-users mailing list
Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users