[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated



Hi Todd, David,

Many thanks to both of you, your answers are very helpful and thanks a lot for the references.

Best,
Romain

LeÂlun. 7 fÃvr. 2022 ÃÂ20:52, <duduhandelman@xxxxxxxxxxx> a ÃcritÂ:
Todd.
I probably need to go over my sub files.
.
Thank you very much.


From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, February 7, 2022 9:30:36 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; duduhandelman@xxxxxxxxxxx <duduhandelman@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated
Â
On 2/7/2022 1:12 PM, duduhandelman@xxxxxxxxxxx wrote:
Romain
I think it will be great.Â
Fyi I dont remember the exact configuration.Â

The logic will be.Â

1. Put jobs on hold when exit code !=0
2. Autorelease jobs when exit code !=0 and number of starts is less than 3
This should work.Â


I suggest you to create a single submit file and run a script that exit with status code. you will be able to check it quicklyÂ

Have a look at the documentation.Â
Drop me a note if you need an example.Â
ThanksÂ
David

Hi folks,

Just chiming in here:

While David's suggestion above would work, there is no need to place jobs on hold and autorelease jobs... current versions of HTCondor have an easier to use/understand mechanism to simply retry jobs that exit without a successful exit code. In the condor_submit man page at
ÂÂÂ https://htcondor.readthedocs.io/en/feature/man-pages/condor_submit.html
take a look at the definitions for max_retries, success_exit_code, and retry_until. Also take a look at the following section in the manual:
ÂÂ https://htcondor.readthedocs.io/en/feature/users-manual/automatic-job-management.html?highlight=max_retries#automatically-rerunning-a-failed-job

The decision to place retry policy directly into your job submission, or alternatively to use DAGMan to manage retries, largely depends on if your job requires the PRE and POST script functionality that DAGMan brings to the table. For example, if you can determine if your job succeeded or failed based on just the exit status or other attributes reflected in the job classad (like runtime, for instance), then likely no need to involve DAGMan to handle retries and you can simply specify max_tries and/ir retry_until in your job submit file.  On the other hand, if determining if a job succeeded requires running some procedural code (e.g. a script that does some sanity and completeness checking on the output files), then using DAGMan's retry functionality in concert with POST scripts is what I would recommend.

Hope the above helps
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/