[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to ask an execute machine "stop after this job" ?

On 5/15/07, Nicolas GUIOT <nicolas.guiot@xxxxxxx> wrote:

I need to make some maintenance on some execute machines.
How can I tell them "Finish the job you are running right now, but don't start any new one (until I authorize it again...)" ?

Rather than repeat what the others have said I will suggest an
alternate plan that may be of use for some people in SMP situations.
This requires that whoever manges your pool can remotely reconfig it
(by changing the file and issuing a condor_reconfig for example).

If you have SMP machines where you have jobs with very different time
scales (think days verses less than an hour) and you want the
requirements as follows: No more long jobs to start but faster (or
just plain risky) jobs will so long as you don't mind them being
kicked when the long job finishes

Add the following attribute to any jobs which are happy to run on
machines in such a state
+JobAcceptsRiskOfTermination = true

then the normal start requirement can be changed to instead be something like

NORMAL_START = whatever your start requirement normally is
IsTerminating = False
( add IsTerminating to the STARTD_ATTRS so others can see it to work
out why jobs aren't running there)
then use
START = ( $(NORMAL_START) ) && ($(IsTerminating) != False ||
TARGET.JobAcceptsRiskOfTermination =?= True)

Then by simply changing IsTerminating you trigger the relevant change.

I suggest also triggering some time based setting indicating when the
termination phase happens (if you have something like systime()
available in your machines scripting language you can get the relevant
integer to place in the class ad then an external tool can monitor the
running jobs to see when none are running which started before this
and trigger a restart.

More management overhead than using peaceful but allows much better
throughput on SMP systems matching the criteria above.