[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Ways to limit schedd from accepting and/or starting jobs:



On Fri, May 12, 2006 at 09:10:08AM -0500, Steven Timm wrote:
> 
> I've asked on this list before how to stop jobs from running at
> the schedd level, as opposed to doing condor_off -peaceful
> or turning START = FALSE on every single worker node.
> 
> There are two scenarios:
> 
> In one, the schedd still accepts submissions, keeps track
> of all running jobs, but doesn't try to negotiate any new ones.
> It appears this can be done by setting the MAX_JOBS_RUNNING
> macro to zero and doing condor_reconfig, although I have not
> tested that yet.
> 

That won't work - in the normal case, when Condor is not being shut
down, the schedd will kill running jobs in order to get itself under
the MAX_JOBS_RUNNING threshold.

You could get this behavoir now by setting the max jobs per claim options
in later 6.7s, and shutting down your negotiator (or, using HOSTDENY
for the specific schedd at your negotatior) - that way, running jobs
will continue to run, but new jobs won't be able to get more resources.

> The other would be:
> The schedd doesn't take any new submissions, but supervises
> the draining of its existing queue, getting jobs run until there
> are no more left in the queue to run.
> 
> I haven't seen anything in the manual that might accomplish this.
> Has anyone figured out how to do it?  If not, can we request this feature?
> 

I _think_ you can do this by setting MAX_JOBS_SUBMITTED = 0. That should
stop the schedd from creating any new jobs, so any submit attempts would 
fail. 

This is probably a bad idea, because users will get an error message that
their submit failed, and it would probably cause havoc with any DAGman
jobs in the queue (I don't know how often DAGMan will retry submits
that failed, but I know it's at least semi-robust against this failure,
if not completely robust)

My guess is that most of the reasons people want features like this are
handled by disconnected operations, so you can reboot submit nodes and not
lose all of the running jobs. Right now the one thing that sucks is the
schedd can't shut itself down without killing all of the running jobs,
even if they could be reconnected to. We're fixing that, but for now if
you want that you have to use a 'kill -9' or condor_off -schedd -fast
to not give the schedd a chance to shutdown "cleanly". When it comes
back up, it will reconnect to the running jobs.

-Erik