[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] pool drainoff






Steven Timm wrote:
How can I put a single node in a condor pool into a
'drainoff' state, that is, let any jobs currently running on
the node finish, but don't accept new jobs.
It should be:

	condor_off -peaceful

In theory that will shut down the machines once all the running jobs
leave. In practice I find if one job takes an incredibly long time to
run new jobs keep getting assigned to the machine and a peaceful point
to shut down is never reached. That's with 6.8.6 (yea, Condor guys, I
know: why don't I tell you about these things? Sometimes it just slips
my mind... :) ).


In practice I've found two gotchas with this approach
(1) you have to execute condor_off -peaceful individually
for each startd in the pool.   If you just do a global
condor_off -peaceful it will kill the schedd's and negotiators
well before the startd's go off and you won't have the
desired result.  (the jobs will all finish but condor
will never know about it).  They need a feature added to
automatically do the startd's first and then the schedd's and collector/negotiators.

(2) If you execute condor_off -peaceful for a lot of nodes
in rapid succession it will send the collector into a dance of death
from which it can take hours to extract itself and condor_status
will time out in the meantime.  Supposedly that will
be fixed in condor 7.0.2.

The other two features I've wanted for a long time are (1) an instruction
to tell a schedd to start all its existing jobs but not
accept any more new ones.  Also (2) an instruction to let existing
jobs on a schedd complete but not start any more new ones.  (yes
I know the latter could be accomplished with condor_hold -constraint ...)

In Condor 7.1.1, condor_off -peaceful -schedd will cause the schedd to stop starting new jobs and shut down after all currently running jobs finish. I believe the answer to your request (1) above is to set |MAX_JOBS_SUBMITTED=0. Or at least so says my new (not yet publicly announced) How-to:|

http://nmi.cs.wisc.edu/node/1466

--Dan