[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated



Hi David,
Thanks for your reply

I have a question with your proposed solution what would be the best I think (let's say we want 2 retries):
- submit the jobs
- retry up to 2 times jobs that fail
- if after the 2nd retry a job fails --> put that job on hold until the user examines it

So that the user can decide whether he wants to release or kill the job
If the job is released retry 2 more times as previously and so on and so forth

Would you know if such a thing is possible?
Because I only know the command line to put a job on hold if they fail
but I don't know how to only put them on hold after 2 triesÂÂ

Thanks again your idea is really interesting I am going to look how to implement it,
Best,
Romain



LeÂlun. 7 fÃvr. 2022 ÃÂ19:36, <duduhandelman@xxxxxxxxxxx> a ÃcritÂ:
Hi Romain.
How about put jobs on hold when exit code != 0 and periodically release jobs?

Thanks
David


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of romain.bouquet04@xxxxxxxxx <romain.bouquet04@xxxxxxxxx>
Sent: Monday, February 7, 2022 6:35:11 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated
Â
Dear all,

I am sending you this email as I would like to know if it is possible to retry failed jobs with condor DAG (condor_submit_dag command) before the rescue file is created.

Basically I am sending around 500 jobs + 1 final job that is in charged of adding results from previous jobs.
I use the DAG feature for that and have set the RETRY ALL_NODES 2 in the submit file which enables to retry each job up to 2 times in case a transient failure occurs.

Lately the machines I am running on are quite unstable and so some of the 500 jobs can crash even with 2 retries (this happens when opening a text file for instance which is obviously a transient error). The crash open at the beginning and the 500 jobs are running for quite long so I can really spot easily the one that failed.

I would prefer not to have to increase that RETRY ALL_NODES to a higher value as I would like first to inspect if there is not something wrong in my code before re-submitting the failed jobs.
So I would like to be able to resubmit jobs that failed before the 500 jobs are "done" either failing or finishing successfully.
The thing is the rescue file which enables the resubmission of failed jobs is only created at the end when all jobs have finished (either they failed or finished successfully).

Would you know a way to do so apart from creating a new submitter file myself ?

I had a look at the documentation but for instance using the -DoRecovery option but the jobs are locked
https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html

Many thanks in advance,
Best regards,
Romain Bouquet
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/