Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dead scheduler node, how to safely revive?

Date: Wed, 15 Aug 2018 11:47:26 +0200
From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Dead scheduler node, how to safely revive?

Hi Greg,

On Tue, 2018-08-14 at 10:22:46 -0500, Greg Thain wrote:
> On 08/14/2018 05:01 AM, Steffen Grunewald wrote:
> > Good morning,
> > 
> > two weeks ago, while I was on vacation, one of our scheduler nodes died
> > horribly - but can probably be repaired.
> > I presume that all jobs that had been submitted are still known to the
> > schedd, and therefore would likely be restarted as soon as the machine
> > comes up again - but users may in the meantime have submitted identical
> > copies from another scheduler node, and the old copies would overwrite
> > their output data once they start running.
> > Is there a simple way to prevent this from happening?
> > (To learn which jobs were still in the queue would require firing up the
> > schedd, which would start a fresh negotiation for all of them. Catch 22?)
> 
> You could set MAX_JOBS_RUNNING = 0 on the schedd node before restarting, and
> the schedd will not start any jobs.  You can then condor_q and condor_rm
> them at will.

Thanks a lot, that setting did the trick! (If there were no adverse effects
- which exist though - this would be a nice setting for a freshly booted
schedd machine...)

> If you know you want to remove all the jobs, they are stored in the
> job_queue.log.* files in the SPOOL directory, removing those files is an
> extreme way to removee all trace of those jobs from the schedd.

Dragons ahead... I'm happy that I didn't have to take that path.

Thanks again,
 Steffen

References:
- [HTCondor-users] Dead scheduler node, how to safely revive?
  - From: Steffen Grunewald
- Re: [HTCondor-users] Dead scheduler node, how to safely revive?
  - From: Greg Thain

Prev by Date: Re: [HTCondor-users] Dead scheduler node, how to safely revive?
Next by Date: [HTCondor-users] Vanilla universe jobs getting evicted them immediately aborted ( 8.6.11)
Previous by thread: Re: [HTCondor-users] Dead scheduler node, how to safely revive?
Next by thread: [HTCondor-users] Vanilla universe jobs getting evicted them immediately aborted ( 8.6.11)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Dead scheduler node, how to safely revive?