Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Retry failed nodes in a running DAG

Date: Thu, 06 May 2021 10:34:01 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Retry failed nodes in a running DAG

On 5/6/21 10:26 AM, Vaurynovich, Siarhei wrote:

Â

Hi Christoph,

Â

Thank you for your reply!

Â

The failed jobs are not queued anymore â they have crashed (in this case, due to the insufficient disk space for their output). If the jobs were still running, I could have held and then released them to solve the problem. The question is if I can tell HTcondor to run just those failed jobs again if the jobs have crashed and are not running anymore.

When you say the jobs "crashed", the important issue to dagman is if the job exited with a zero or non-zero exit code.Â If a dag node job exits with a non-zero exit code, dagman considers the node to have failed.Â It will not run any nodes that depend on a failed node, but it will continue to run independent nodes until it can not make more progress.Â After fixing what failed, dagman can be re-run and it will just run the failed nodes and their dependents.

If however, the job exits with a zero exit code (in theÂ absence of a postscript), dagman assumes the job has succeeded, and continues running dependent jobs.

-greg

Follow-Ups:
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Mark Coatsworth

References:
- [HTCondor-users] Retry failed nodes in a running DAG
  - From: Vaurynovich, Siarhei
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Beyer, Christoph
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Vaurynovich, Siarhei

Prev by Date: Re: [HTCondor-users] Retry failed nodes in a running DAG
Next by Date: Re: [HTCondor-users] Retry failed nodes in a running DAG
Previous by thread: Re: [HTCondor-users] Retry failed nodes in a running DAG
Next by thread: Re: [HTCondor-users] Retry failed nodes in a running DAG
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Retry failed nodes in a running DAG