[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor DAG: retry failed jobs before rescue file is generated



Dear all,

I am sending you this email as I would like to know if it is possible to retry failed jobs with condor DAG (condor_submit_dag command) before the rescue file is created.

Basically I am sending around 500 jobs + 1 final job that is in charged of adding results from previous jobs.
I use the DAG feature for that and have set the RETRY ALL_NODES 2 in the submit file which enables to retry each job up to 2 times in case a transient failure occurs.

Lately the machines I am running on are quite unstable and so some of the 500 jobs can crash even with 2 retries (this happens when opening a text file for instance which is obviously a transient error). The crash open at the beginning and the 500 jobs are running for quite long so I can really spot easily the one that failed.

I would prefer not to have to increase that RETRY ALL_NODES to a higher value as I would like first to inspect if there is not something wrong in my code before re-submitting the failed jobs.
So I would like to be able to resubmit jobs that failed before the 500 jobs are "done" either failing or finishing successfully.
The thing is the rescue file which enables the resubmission of failed jobs is only created at the end when all jobs have finished (either they failed or finished successfully).

Would you know a way to do so apart from creating a new submitter file myself ?

I had a look at the documentation but for instance using the -DoRecovery option but the jobs are locked
https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html

Many thanks in advance,
Best regards,
Romain Bouquet