[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dead scheduler node, how to safely revive?



Hi Greg,

On Tue, 2018-08-14 at 10:22:46 -0500, Greg Thain wrote:
> On 08/14/2018 05:01 AM, Steffen Grunewald wrote:
> > Good morning,
> > 
> > two weeks ago, while I was on vacation, one of our scheduler nodes died
> > horribly - but can probably be repaired.
> > I presume that all jobs that had been submitted are still known to the
> > schedd, and therefore would likely be restarted as soon as the machine
> > comes up again - but users may in the meantime have submitted identical
> > copies from another scheduler node, and the old copies would overwrite
> > their output data once they start running.
> > Is there a simple way to prevent this from happening?
> > (To learn which jobs were still in the queue would require firing up the
> > schedd, which would start a fresh negotiation for all of them. Catch 22?)
> 
> You could set MAX_JOBS_RUNNING = 0 on the schedd node before restarting, and
> the schedd will not start any jobs.  You can then condor_q and condor_rm
> them at will.

Thanks a lot, that setting did the trick! (If there were no adverse effects
- which exist though - this would be a nice setting for a freshly booted
schedd machine...)

> If you know you want to remove all the jobs, they are stored in the
> job_queue.log.* files in the SPOOL directory, removing those files is an
> extreme way to removee all trace of those jobs from the schedd.

Dragons ahead... I'm happy that I didn't have to take that path.

Thanks again,
 Steffen