[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dead scheduler node, how to safely revive?



On 08/14/2018 05:01 AM, Steffen Grunewald wrote:
Good morning,

two weeks ago, while I was on vacation, one of our scheduler nodes died
horribly - but can probably be repaired.
I presume that all jobs that had been submitted are still known to the
schedd, and therefore would likely be restarted as soon as the machine
comes up again - but users may in the meantime have submitted identical
copies from another scheduler node, and the old copies would overwrite
their output data once they start running.
Is there a simple way to prevent this from happening?
(To learn which jobs were still in the queue would require firing up the
schedd, which would start a fresh negotiation for all of them. Catch 22?)

You could set MAX_JOBS_RUNNING = 0 on the schedd node before restarting, and the schedd will not start any jobs. You can then condor_q and condor_rm them at will.

If you know you want to remove all the jobs, they are stored in the job_queue.log.* files in the SPOOL directory, removing those files is an extreme way to removee all trace of those jobs from the schedd.

-greg