[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Retry failed nodes in a running DAG



Hi Siarhei,

Additionally, It's worth noting in our upcoming v9.1.0 release (due
next week) we've added a new feature that does pretty much what you're
looking for. A new configuration option called
DAGMAN_PUT_FAILED_JOBS_ON_HOLD will tell DAGMan to optionally put a
failed job on hold (instead of marking it failed and waiting for the
dag to abort).

So this will give you an opportunity to fix whatever caused the job to
fail, release it, then continue the regular dag execution.

If you're able to upgrade to this release when it comes out, that
would be the most straightforward solution.

Mark

On Thu, May 6, 2021 at 10:34 AM Greg Thain <gthain@xxxxxxxxxxx> wrote:
>
>
> On 5/6/21 10:26 AM, Vaurynovich, Siarhei wrote:
>
>
>
> Hi Christoph,
>
>
>
> Thank you for your reply!
>
>
>
> The failed jobs are not queued anymore â they have crashed (in this case, due to the insufficient disk space for their output). If the jobs were still running, I could have held and then released them to solve the problem. The question is if I can tell HTcondor to run just those failed jobs again if the jobs have crashed and are not running anymore.
>
>
> When you say the jobs "crashed", the important issue to dagman is if the job exited with a zero or non-zero exit code.  If a dag node job exits with a non-zero exit code, dagman considers the node to have failed.  It will not run any nodes that depend on a failed node, but it will continue to run independent nodes until it can not make more progress.  After fixing what failed, dagman can be re-run and it will just run the failed nodes and their dependents.
>
> If however, the job exits with a zero exit code (in the  absence of a postscript), dagman assumes the job has succeeded, and continues running dependent jobs.
>
> -greg
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison