[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Restarting DAGman nodes
- Date: Wed, 15 May 2013 08:23:04 -0500
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Restarting DAGman nodes
Default configuration is for retries to go to the back of its job queue
and DAGMan will resubmit them when it gets to that point.
To get retries to go first, set the configuration variable
DAGMAN_RETRY_NODE_FIRST = True
That is really the only option at this point.
On Tue, May 14, 2013 at 11:11:30PM +0100, Brian Candler wrote:
> When DAGman is running, and some nodes have failed but I have since fixed
> the problem which caused them to fail, and DAGman is still running, is it
> possible to signal to that instance of DAGman to retry failed jobs now?
> Otherwise I have to wait for DAGman to drain out all the jobs it is able,
> write out the rescue dag and terminate - at which point I can run
> condor_submit_dag again.
> This does work of course, but (a) some jobs which could be started
> immediately aren't; and (b) if DAGman completes at say 2am then I won't
> restart it until the morning. Both of these mean that the overall time to
> finish processing is longer than it could be.