[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] quick question: is periodic vacate possible



Many thanks for this -  it looks to be exactly what I was after.

 

regards,

 

-ian.

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
Sent: 24 June 2010 22:49
To: Condor-Users Mail List
Subject: Re: [Condor-users] quick question: is periodic vacate possible

 

On Tue, Jun 22, 2010 at 5:29 AM, Smith, Ian <I.C.Smith@xxxxxxxxxxxxxxx> wrote:

OK I think I see how  to go about this now. How would I write the
PREEMPT _expression_ - presumably it would need to include
a WANT_VACATE==TRUE term  (so that only jobs that save
their own checkpoints are vacated) and some way of determining
if the run time was greater than a given periodic checkpoint time 

I guess this value could be supplied via a job classad ?).

 

Did you get this working?

 

You have control over how long Condor waits for a job to checkpoint itself when it wants to get the job off a machine.

 

If your jobs have:

 

+CheckpointJob = True

 

And then:

 

# Some helpful macros

StateTimer = (CurrentTime - EnteredCurrentState)
ActivityTimer = (CurrentTime - EnteredCurrentActivity)

 

# Preempt long running jobs

PREEMPT = (ActivityTimer > 3600)

 

# WANT_VACATE gets checked when PREEPT=True to see if we should

# vacate the job through a checkpointing call or proceed directly to killing

# the job. So move to Preempting/Vacating if this is a check-pointable job

WANT_VACATE = CheckpointJob =?= True

 

# Move to the Preempting/Killing state after 30 seconds in Preempting/Vacating

KILL = $(StateTimer) > 30

 

# And get real tough on things after another 30 seconds in the

# Preempting/Killing state

KILLING_TIMEOUT = 30

 

That's the approximate framework for things. Now you can tweak it to suit your needs. Perhaps your jobs take a variable, but deterministic, amount of time to vacate. In this case, if they supplied their estimated checkpointing time with a job ad:

 

+CheckpointTime = 120

 

You could try to reference it (I'm not sure this is 100% correct TARGET. is always a tricky one to use):

 

# If the job told us how to long to wait for it to checkpoint use that. Otherwise use

# the default of 30 seconds.

KILL = (isUndefined(TARGET.CheckpointTime) && ($(StateTimer) > TARGET.CheckpointTime)) || ($(StateTimer) > 30)

 

That needs to be verified. But it's a start.

 

- Ian