[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] feature request - vacate disabling by job



I have a farm where a certain subset of jobs cannot be vacated in
anyway except by starting from the beginning. the others can trap the
wm_close and persist this state but take a while to do so.

Therefore my KILL expression gives them 20 mins to handle the wm_close
and exit before killing them outright. However this is a pain for the
vacate-unfriendly jobs in two ways.

1) they weren't going to finish in 20 mins anyway so they waste 20
mins of execution time for themselves and the waiting job.

2) they happen to finish in that 20 mins (reasonably likely given that
I try to tune the jobs to be a few hours tops) and condor believes
them to have vacated, going to all the hassle of transferring files
around and running the job again. Wasting even more time than case 1.

I can try to avoid this by making my KILL statement aware of the job
it is deciding about but this is

1) inelegant
2) a pain to maintain

It strikes me that a job is better placed to indicate it's ability to
deal with a vacation event.

a submit parameter such as

JobWantVacate= true / false

could allow condor to short circuit KILL to true immediately (or
better still never send the wm_close / vacate signal and kill the job
immediately)

If you fancied getting fancy a more tunable system would b to make 

JobWantVacate be an expression determined at runtime on the execute
machine but specified in the submit file, thus allowing the job to say
things like "If I haven't run for at least X mins there is no point
vacating just kill me asap and get me onto another machine). this
would allow preemption to avoid unnecessary overhead.

Matt