[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Retry failed nodes in a running DAG




On 5/6/21 10:26 AM, Vaurynovich, Siarhei wrote:

Â

Hi Christoph,

Â

Thank you for your reply!

Â

The failed jobs are not queued anymore â they have crashed (in this case, due to the insufficient disk space for their output). If the jobs were still running, I could have held and then released them to solve the problem. The question is if I can tell HTcondor to run just those failed jobs again if the jobs have crashed and are not running anymore.


When you say the jobs "crashed", the important issue to dagman is if the job exited with a zero or non-zero exit code. If a dag node job exits with a non-zero exit code, dagman considers the node to have failed. It will not run any nodes that depend on a failed node, but it will continue to run independent nodes until it can not make more progress. After fixing what failed, dagman can be re-run and it will just run the failed nodes and their dependents.

If however, the job exits with a zero exit code (in the absence of a postscript), dagman assumes the job has succeeded, and continues running dependent jobs.

-greg