[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated



Todd.
I probably need to go over my sub files.
.
Thank you very much.

Get Outlook for Android

From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Sent: Monday, February 7, 2022 9:30:36 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; duduhandelman@xxxxxxxxxxx <duduhandelman@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated
 
On 2/7/2022 1:12 PM, duduhandelman@xxxxxxxxxxx wrote:
Romain
I think it will be great. 
Fyi I dont remember the exact configuration. 

The logic will be. 

1. Put jobs on hold when exit code !=0
2. Autorelease jobs when exit code !=0 and number of starts is less than 3
This should work. 


I suggest you to create a single submit file and run a script that exit with status code. you will be able to check it quickly 

Have a look at the documentation. 
Drop me a note if you need an example. 
Thanks 
David

Hi folks,

Just chiming in here:

While David's suggestion above would work, there is no need to place jobs on hold and autorelease jobs...  current versions of HTCondor have an easier to use/understand mechanism to simply retry jobs that exit without a successful exit code. In the condor_submit man page at
    https://htcondor.readthedocs.io/en/feature/man-pages/condor_submit.html
take a look at the definitions for max_retries, success_exit_code, and retry_until. Also take a look at the following section in the manual:
   https://htcondor.readthedocs.io/en/feature/users-manual/automatic-job-management.html?highlight=max_retries#automatically-rerunning-a-failed-job

The decision to place retry policy directly into your job submission, or alternatively to use DAGMan to manage retries, largely depends on if your job requires the PRE and POST script functionality that DAGMan brings to the table.  For example, if you can determine if your job succeeded or failed based on just the exit status or other attributes reflected in the job classad (like runtime, for instance), then likely no need to involve DAGMan to handle retries and you can simply specify max_tries and/ir retry_until in your job submit file.   On the other hand, if determining if a job succeeded requires running some procedural code (e.g. a script that does some sanity and completeness checking on the output files), then using DAGMan's retry functionality in concert with POST scripts is what I would recommend.

Hope the above helps
Todd