Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Retry failed nodes in a running DAG

Date: Thu, 06 May 2021 21:46:56 +0000
From: "Vaurynovich, Siarhei" <siarhei.vaurynovich@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Retry failed nodes in a running DAG

Hi Mark, Greg,

Mark, thank you for your reply! That is exactly the kind of feature that would solve my problem. From your reply, I gather that there are no similar/alternative ways to address my issue in the current releases of HTCondor.

Greg, yes, if there is a problem, the exit code from my jobs is not zero. This makes it easier for me to identify failed jobs in a large DAG (and, of course, I do not meaningless crashes of child jobs since dependencies are there for a good reason): all I need to do is to look at the dagman log for a report on how many and which nodes failed. So, when a job crashes, it is considered as failed by the dagman and the dagman won't run the job and any of its child jobs unless I restart the DAG, but for that I either need to kill jobs which continue to run or I have to wait until no further progress in the DAG can be made and a rescue file is generated (both are not optimal).

Best,
Siarhei.

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Mark Coatsworth
Sent: Thursday, 6 May, 2021 15:33
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Retry failed nodes in a running DAG

  External Email. Use caution when clicking links or opening file attachments.

Hi Siarhei,

Additionally, It's worth noting in our upcoming v9.1.0 release (due next week) we've added a new feature that does pretty much what you're looking for. A new configuration option called DAGMAN_PUT_FAILED_JOBS_ON_HOLD will tell DAGMan to optionally put a failed job on hold (instead of marking it failed and waiting for the dag to abort).

So this will give you an opportunity to fix whatever caused the job to fail, release it, then continue the regular dag execution.

If you're able to upgrade to this release when it comes out, that would be the most straightforward solution.

Mark

On Thu, May 6, 2021 at 10:34 AM Greg Thain <gthain@xxxxxxxxxxx> wrote:
>
>
> On 5/6/21 10:26 AM, Vaurynovich, Siarhei wrote:
>
>
>
> Hi Christoph,
>
>
>
> Thank you for your reply!
>
>
>
> The failed jobs are not queued anymore â they have crashed (in this case, due to the insufficient disk space for their output). If the jobs were still running, I could have held and then released them to solve the problem. The question is if I can tell HTcondor to run just those failed jobs again if the jobs have crashed and are not running anymore.
>
>
> When you say the jobs "crashed", the important issue to dagman is if the job exited with a zero or non-zero exit code.  If a dag node job exits with a non-zero exit code, dagman considers the node to have failed.  It will not run any nodes that depend on a failed node, but it will continue to run independent nodes until it can not make more progress.  After fixing what failed, dagman can be re-run and it will just run the failed nodes and their dependents.
>
> If however, the job exits with a zero exit code (in the  absence of a postscript), dagman assumes the job has succeeded, and continues running dependent jobs.
>
> -greg
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone.  Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.alliancebernstein.com/disclaimer/email/disclaimer.html

References:
- [HTCondor-users] Retry failed nodes in a running DAG
  - From: Vaurynovich, Siarhei
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Beyer, Christoph
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Vaurynovich, Siarhei
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Greg Thain
- Re: [HTCondor-users] Retry failed nodes in a running DAG
  - From: Mark Coatsworth

Prev by Date: [HTCondor-users] Spread the word: Hiring a Research Computing Facilitator to join the UW-Madison and OSG Facilitation teams
Next by Date: [HTCondor-users] Comparing evaluation result with "classad.Value.Error" throws error
Previous by thread: Re: [HTCondor-users] Retry failed nodes in a running DAG
Next by thread: [HTCondor-users] Spread the word: Hiring a Research Computing Facilitator to join the UW-Madison and OSG Facilitation teams
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Retry failed nodes in a running DAG