[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated
- Date: Mon, 07 Feb 2022 19:12:38 +0000
- From: duduhandelman@xxxxxxxxxxx
- Subject: Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated
I think it will be great.
Fyi I dont remember the exact configuration.
The logic will be.
1. Put jobs on hold when exit code !=0
2. Autorelease jobs when exit code !=0 and number of starts is less than 3
This should work.
I suggest you to create a single submit file and run a script that exit with status code. you will be able to check it quickly
Have a look at the documentation.
Drop me a note if you need an example.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of romain.bouquet04@xxxxxxxxx <romain.bouquet04@xxxxxxxxx>
Sent: Monday, February 7, 2022, 20:54
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated
Thanks for your reply
I have a question with your proposed solution what would be the best I think (let's say we want 2 retries):
- submit the jobs
- retry up to 2 times jobs that fail
- if after the 2nd retry a job fails --> put that job on hold until the user examines it
So that the user can decide whether he wants to release or kill the job
If the job is released retry 2 more times as previously and so on and so forth
Would you know if such a thing is possible?
Because I only know the command line to put a job on hold if they fail
but I don't know how to only put them on hold after 2 tries
Thanks again your idea is really interesting I am going to look how to implement it,
How about put jobs on hold when exit code != 0 and periodically release jobs?
I am sending you this email as I would like to know if it is possible to retry failed jobs with condor DAG (condor_submit_dag command) before the rescue file is created.
Basically I am sending around 500 jobs + 1 final job that is in charged of adding results from previous jobs.
I use the DAG feature for that and have set the
RETRY ALL_NODES 2
in the submit file which enables to retry each job up to 2 times in case a transient failure occurs.
Lately the machines I am running on are quite unstable and so some of the 500 jobs can crash even with 2 retries (this happens when opening a text file for instance which is obviously a transient error). The crash open at the beginning and the 500 jobs are
running for quite long so I can really spot easily the one that failed.
I would prefer not to have to increase that RETRY ALL_NODES to a higher value as I would like first to inspect if there is not something wrong in my code before re-submitting the failed jobs.
So I would like to be able to resubmit jobs that failed before the 500 jobs are "done" either failing or finishing successfully.
The thing is the rescue file which enables the resubmission of failed jobs is only created at the end when all jobs have finished (either they failed or finished successfully).
Would you know a way to do so apart from creating a new submitter file myself ?
Many thanks in advance,
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: