[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] 2 questions about job retry




Hello,

I have a couple questions about how to tune the retry of a failed DAG job.

1) What's the best way to wait some seconds before attempting a retry?

I've thought of using a POST script that would have $RETURN among its arguments and call |sleep| if $RETURN is not equal to 0, but I wonder whether that would work and whether there is a simpler way to do something similar.

2) When a job retries, I would like it *not* to run where the failed job has run. Searching on the web lead me to adding the line

requirements = Machine =!= LastRemoteHost

to the submit file that is called by the JOB command on the DAG file, but that doesn't seem to work. More often than not, the job reruns in the same place (same machine and same slot) than the failed try.

The Condor version I am using is

condor_version $CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $
$CondorPlatform: x86_64_CentOS7 $

Thanks in advance,

Nicolas