[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] How to have schedd drop claim after each job



sorry to be a pain but this is a feauture I've beed dying for for ages and this seems somewhat confusing in that form

let me get this straight:

This is what I* would like

1) Machine A Claimed by (at the time) the best job for it.
2) New job added to queue (or released / qedited etc. etc.)
3) This job evaluates to a higher rank on the machine A that the current job but preemption_requirements evaluates to false.
4) When the job finishes the machine causes the release of the current claim and behaves like a fresh machine

I appreciate that step 4 implies an overhead and slight loss of throughput (especially on small, rapid jobs) and may not be the desired default behaviour but causes the farm to behave in the least surprising manner to the end users.

Three options that trade off complexity/timeliness for overhead:

A) Anly drop the claim if, during the current job a higher scoring job arrives. 
pro - no additional tests (since you evaluate the rank to determine pre-emption anyway)
con - would still cause starvation if you aren't using the user priority system (as on our farm)
con - the job may have been taken or removed in between the last evaluation and the end of the job

B) only drop the claim if, at the end of the current job a higher scoring job is found. 
pro - absolutely correct behaviour - no starvation and user priority can also be evaluated
con - significant overhead in between jobs

C) mix with a timer - after X amount of time in the claimed state use B
pro/con as for B but admin can balance correctness lag vs. overhead

the behaviour I infer from the mail below is

1) Machine A Claimed by (at the time) the best job for it.
2) New job added to queue (or released / qedited etc. etc.)
3) This job evaluates to a higher rank on the machine A that the current job

4a) preemption_requirements evaluates true.
5a) the currently running job gets an additional amount of time to complete before vacation 

4b) preemption_requirements evaluates false
5b) existing behaviour

This is an improvement but does not really provide the desired control I list above - since I do not necessarilly know in advance how long is reasonable to give to a job.

I suppose I can simulate the above behaviour by pushing this retirement timout very high but will this lead to issues further down the line such as:

1) Another machine becomes free but the pending job cannot use it
2) Another job of even better rank cannot take the pending claim off the existing one.
3) Management and transistions of state is already complex -this seems to muddy it further.

Does a pending claim count for the purposes of continuing to evaluate the cluster?

questions like this mean this may cause more confusion.

I like C because 

a) the admin can tune it.
b) the behaviour is exactly as most peple would expect looking at the queue.
c) the _current_ state is always used to determine the next allocated job rather than any previous state.

Could you clarify what the current idea is for this functionality please - I can then determine it it will fit our needs or if I need to look at an iternal workaround.

Thanks,
Matt

* and from this list the behaviour a lot of people expected and would also like

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Dan Bradley
> Sent: 19 July 2004 20:18
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] How to have schedd drop claim 
> after each job
> 
> 
> Hi Maarten,
> 
> I think the feature you want is in the current development branch and 
> will be released in Condor 6.7.2.  In the machine policy, you will be 
> able to specify 'MaxJobRetirementTime', an expression that determines 
> the maximum runtime for a job that is in a 'retiring' claim.  A claim 
> may go into retirement due to any type of preemption, or due 
> to Condor 
> being gracefully shut down or restarted.  It will stay in the 
> retiring 
> state until the current job finishes or the maximum retirement time 
> expires (or the GRACEFUL_SHUTDOWN_TIMEOUT expires).
> 
> If you are living in the stable series, then there are some 
> less-than-ideal methods people have come up with to address 
> the problem 
> with resource reallocation when you want minimal job death.  
> One is to 
> set your PREEMPTION_REQUIREMENTS to allow preemption only during the 
> first 10 minutes of the job and (if you _really_ can't live with jobs 
> being killed), add a USER_JOB_WRAPPER script that sleeps for 
> 10 minutes 
> before starting jobs.
> 
> To my knowledge, there is no way to force the schedd to drop 
> each claim 
> after running a job.  Anybody with a clever solution, please 
> correct me!
> 
> --Dan
> 
> Maarten Ballintijn wrote:
> 
> >Hello,
> >
> >Most of our jobs are vanilla universe for the moment. In order not
> >to waste CPU time I'd like them to run to completion. I understand
> >how to configure PREEMPT and PREEMPTION_REQUIREMENTS etc. to avoid
> >killing the jobs. The catch is that schedd hangs on to the claim
> >even if the priority dictates another job should run.
> >
> >Is there a way to have schedd relinquish the claim "between" to jobs,
> >either always or when appropriate?
> >
> >Thanks for your help,
> >
> >Maarten.
> >
> >
> >
> >  
> >
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 


*****************************************************************
Gloucester Research Limited believes the information 
provided herein is reliable. While every care has been 
taken to ensure accuracy, the information is furnished 
to the recipients with no warranty as to the completeness 
and accuracy of its contents and on condition that any 
errors or omissions shall not be made the basis for any 
claim, demand or cause for action.
*****************************************************************