[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] avoiding vanilla job eviction



You may get away with defining extra slots/VMs for your cluster machines. One with a lower priority/nice level, and the other with a higher priority, but a capped run time. (There is a paper refernced somewhere in the archives that covers this problem very well. Not my paper btw.)

I'm thinking something like:

JOB_RENICE = 8 - ((VirtualMachineID == 1) * 4)

PREEMPT = ($(StateTimer) > 2*24*HOUR) && (VirtualMachineID != 1)

The above might not be entirely correct, and _I_ have failed to get renicing working on windows, but this gives you an idea of what might be possible, and it will be interesting to hear how you get on.

You will then need to tell your users to specify which VMs they want to target, so you may have a start expression like this:

START = ( ((VirtualMachineID == 1) && (USE_VM1 =?= TRUE)) || \
          ((VirtualMachineID == 2) && (USE_VM2 =?= TRUE)) ) ....

Then add +USE_VM1 = TRUE in the job files. 


Peter

Dr Peter Myerscough-Jackopson
Engineer, MAC Ltd

phone:+44 (0) 23 8076 7808 fax:+44 (0) 23 8076 0602
email:peter.myerscough-jackopson@xxxxxxxxxx  web:www.macltd.com

Multiple Access Communications Limited is a company registered in
England at Delta House, Southampton Science Park, Southampton,
SO16 7NS, United Kingdom with Company Number 1979185

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steffen Grunewald
Sent: 09 November 2007 08:49
To: condor-users@xxxxxxxxxxx
Subject: Re: [Condor-users] avoiding vanilla job eviction

On Thu, Nov 08, 2007 at 03:33:58PM -0700, Pasquale Tricarico wrote:
> Hi,
> 
> In our condor cluster, we have two classes of users (not sorted by 
> importance whatsoever):
> 
> A) users running a few jobs for a long time (weeks), sometime using 
> the vanilla universe (only option, code links to libpthread);
> 
> B) users running many jobs (more than the available nodes) all at the 
> same time, for a short period (less than one day typically).
> 
> The problem is that the B class of users typically have a very low 
> effective priority (condor_userprio...), so their jobs can easily 
> cause the eviction of vanilla jobs from A class users. This is a 
> problem, because this way A class users lose all the time already put 
> into the job, as the vanilla jobs cannot checkpoint. Since the 
> standard universe is sometimes not an option, is there a way to 
> configure Condor in such a way that vanilla jobs are never (or almost
> never..) evicted but just kept in memory while other jobs are running?
> Or maybe some other trick so that vanilla jobs are not restarted from 
> scratch, but just suspended while waiting for enough priority? Thanks 
> for your suggestions.

On our pool, we have defined 

SUSPEND = False
PREEMPT = False
CONTINUE = True
WANT_SUSPEND = False
WANT_VACATE = False

and job eviction has gone completely.
This wouldn't suspend the long-running jobs in favour of the short ones though - and our policy currently is a mixed "first come-first serve"
(until all resources are claimed) and "fair share" (based on cumulative priority, with a short CLAIM_WORKLIFE) one.

Not sure whether this is what you (and your users) want...

Steffen

--
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298} No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/