[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Restarting DAGman nodes



When DAGman is running, and some nodes have failed but I have since fixed
the problem which caused them to fail, and DAGman is still running, is it
possible to signal to that instance of DAGman to retry failed jobs now?

Otherwise I have to wait for DAGman to drain out all the jobs it is able,
write out the rescue dag and terminate - at which point I can run
condor_submit_dag again.

This does work of course, but (a) some jobs which could be started
immediately aren't; and (b) if DAGman completes at say 2am then I won't
restart it until the morning.  Both of these mean that the overall time to
finish processing is longer than it could be.

Regards,

Brian.