[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Ways to limit schedd from accepting and/or starting jobs:
- Date: Fri, 12 May 2006 15:31:19 -0500 (CDT)
- From: Steven Timm <timm@xxxxxxxx>
- Subject: Re: [Condor-users] Ways to limit schedd from accepting and/or starting jobs:
On Fri, 12 May 2006, Erik Paulson wrote:
On Fri, May 12, 2006 at 09:10:08AM -0500, Steven Timm wrote:
I've asked on this list before how to stop jobs from running at
the schedd level, as opposed to doing condor_off -peaceful
or turning START = FALSE on every single worker node.
There are two scenarios:
In one, the schedd still accepts submissions, keeps track
of all running jobs, but doesn't try to negotiate any new ones.
It appears this can be done by setting the MAX_JOBS_RUNNING
macro to zero and doing condor_reconfig, although I have not
tested that yet.
That won't work - in the normal case, when Condor is not being shut
down, the schedd will kill running jobs in order to get itself under
the MAX_JOBS_RUNNING threshold.
Interesting.. other members of the condor team told me at condor
week that that would work.
Really what I would like to do is to do condor_off -all -peaceful
and let all the running jobs finish up.
The use case here is draining the queue nicely before a major system
upgrade, which we are doing next week.
You could get this behavoir now by setting the max jobs per claim options
in later 6.7s, and shutting down your negotiator (or, using HOSTDENY
for the specific schedd at your negotatior) - that way, running jobs
will continue to run, but new jobs won't be able to get more resources.
The other would be:
The schedd doesn't take any new submissions, but supervises
the draining of its existing queue, getting jobs run until there
are no more left in the queue to run.
I haven't seen anything in the manual that might accomplish this.
Has anyone figured out how to do it? If not, can we request this feature?
I _think_ you can do this by setting MAX_JOBS_SUBMITTED = 0. That should
stop the schedd from creating any new jobs, so any submit attempts would
This is probably a bad idea, because users will get an error message that
their submit failed, and it would probably cause havoc with any DAGman
jobs in the queue (I don't know how often DAGMan will retry submits
that failed, but I know it's at least semi-robust against this failure,
if not completely robust)
My guess is that most of the reasons people want features like this are
handled by disconnected operations, so you can reboot submit nodes and not
lose all of the running jobs. Right now the one thing that sucks is the
schedd can't shut itself down without killing all of the running jobs,
even if they could be reconnected to. We're fixing that, but for now if
you want that you have to use a 'kill -9' or condor_off -schedd -fast
to not give the schedd a chance to shutdown "cleanly". When it comes
back up, it will reconnect to the running jobs.
Condor-users mailing list
Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team