[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] quick question: is periodic vacate possible



On Tue, Jun 22, 2010 at 5:29 AM, Smith, Ian <I.C.Smith@xxxxxxxxxxxxxxx> wrote:
OK I think I see how  to go about this now. How would I write the
PREEMPT _expression_ - presumably it would need to include
a WANT_VACATE==TRUE term  (so that only jobs that save
their own checkpoints are vacated) and some way of determining
if the run time was greater than a given periodic checkpoint time 
I guess this value could be supplied via a job classad ?).

Did you get this working?

You have control over how long Condor waits for a job to checkpoint itself when it wants to get the job off a machine.

If your jobs have:

+CheckpointJob = True

And then:

# Some helpful macros
StateTimer = (CurrentTime - EnteredCurrentState)
ActivityTimer = (CurrentTime - EnteredCurrentActivity)

# Preempt long running jobs
PREEMPT = (ActivityTimer > 3600)

# WANT_VACATE gets checked when PREEPT=True to see if we should
# vacate the job through a checkpointing call or proceed directly to killing
# the job. So move to Preempting/Vacating if this is a check-pointable job
WANT_VACATE = CheckpointJob =?= True

# Move to the Preempting/Killing state after 30 seconds in Preempting/Vacating
KILL = $(StateTimer) > 30

# And get real tough on things after another 30 seconds in the
# Preempting/Killing state
KILLING_TIMEOUT = 30
 
That's the approximate framework for things. Now you can tweak it to suit your needs. Perhaps your jobs take a variable, but deterministic, amount of time to vacate. In this case, if they supplied their estimated checkpointing time with a job ad:

+CheckpointTime = 120

You could try to reference it (I'm not sure this is 100% correct TARGET. is always a tricky one to use):

# If the job told us how to long to wait for it to checkpoint use that. Otherwise use
# the default of 30 seconds.
KILL = (isUndefined(TARGET.CheckpointTime) && ($(StateTimer) > TARGET.CheckpointTime)) || ($(StateTimer) > 30)

That needs to be verified. But it's a start.

- Ian