[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] some jobs die at daemon restart, some don't
- Date: Sat, 06 Feb 2016 21:28:09 -0600
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] some jobs die at daemon restart, some don't
Unfortunately, when the startd is restarted, all vanilla universe jobs are lost.
Typically disruptive restarts can be avoided by:
1) condor_reconfig to pick up new settings. Almost all configuration changes can be ingested without restart.
2) Draining the node (condor_off -peaceful) of jobs, then restart.
- This can be done centrally in order to do a rolling restart of the cluster.
3) I havenât tried it myself, but perhaps the Docker universe would reconnect? Greg would know...
Restarts of the schedd and collector/negotiator shouldnât affect running jobs.
> On Feb 4, 2016, at 7:29 AM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> when restarting the condor daemon on a workernode most of the time the jobs on that node survive sometimes though the job dies, I presume that is the case when the job is actually writing to the shadow (?)
> Is there a timeout or something alike that I can increase to keep all jobs happy during a daemon restart ?
> /* Christoph Beyer | Office: Building 2b / 23 *\
> * DESY | Phone: 040-8998-2317 *
> * - IT - | Fax: 040-8994-2317 *
> \* 22603 Hamburg | http://www.desy.de */
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: