Re: [HTCondor-users] Retry failed nodes in a running DAG

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Christoph,

Thank you for your reply!

The failed jobs are not queued anymore â they have crashed (in this case, due to the insufficient disk space for their output). If the jobs were still running, I could have held and then released them to solve the problem. The question is if I can tell HTcondor to run just those failed jobs again if the jobs have crashed and are not running anymore.

Thank you,

Siarhei.

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Beyer, Christoph
Sent: Thursday, 6 May, 2021 01:57
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Retry failed nodes in a running DAG

External Email. Use caution when clicking links or opening file attachments.

Hi,

I think as long as the jobs are still queued you can put them back in idle mode thorugh condor_qedit ?

Best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Vaurynovich, Siarhei" <siarhei.vaurynovich@xxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 6. Mai 2021 06:06:43
Betreff: [HTCondor-users] Retry failed nodes in a running DAG

Hello,

Situation: I have a large DAG of jobs which is in the process of running. A few jobs failed but most of the jobs in the DAG keep running. From the log files, I have figured out the problem and fixed it. Please, let me know if there is a way to tell HTCondor to try again the failed nodes (and all of their CHILD nodes, of course) without killing any of the currently running jobs in the same DAG and without waiting for the whole DAG to fail (and generate a rescue file)?

From the documentation on condor_submit_dag, I can see that the following command might be a good candidate (I have sub-DAGs):

condor_submit_dag -DoRecovery -do_recurse submit_file.dag

Please, let me know if that is what I should do.

Thank you very much for your help,

Siarhei.

............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone. Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.alliancebernstein.com/disclaimer/email/disclaimer.html

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Retry failed nodes in a running DAG