Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Restarting DAGman nodes

Date: Wed, 15 May 2013 08:23:04 -0500
From: Nathan Panike <nwp@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Restarting DAGman nodes

Default configuration is for retries to go to the back of its job queue
and DAGMan will resubmit them when it gets to that point.

To get retries to go first, set the configuration variable

DAGMAN_RETRY_NODE_FIRST = True

That is really the only option at this point.

Nathan Panike

On Tue, May 14, 2013 at 11:11:30PM +0100, Brian Candler wrote:
> When DAGman is running, and some nodes have failed but I have since fixed
> the problem which caused them to fail, and DAGman is still running, is it
> possible to signal to that instance of DAGman to retry failed jobs now?
> 
> Otherwise I have to wait for DAGman to drain out all the jobs it is able,
> write out the rescue dag and terminate - at which point I can run
> condor_submit_dag again.
> 
> This does work of course, but (a) some jobs which could be started
> immediately aren't; and (b) if DAGman completes at say 2am then I won't
> restart it until the morning.  Both of these mean that the overall time to
> finish processing is longer than it could be.
> 
> Regards,
> 
> Brian.

Follow-Ups:
- Re: [HTCondor-users] Restarting DAGman nodes
  - From: Brian Candler

References:
- [HTCondor-users] Restarting DAGman nodes
  - From: Brian Candler

Prev by Date: Re: [HTCondor-users] Help: How to use VM Universe? Thank you
Next by Date: Re: [HTCondor-users] Help: How to use VM Universe? Thank you
Previous by thread: [HTCondor-users] Restarting DAGman nodes
Next by thread: Re: [HTCondor-users] Restarting DAGman nodes
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Restarting DAGman nodes