[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] 2 questions about job retry



On 8/19/22 09:45, Nicolas Arnaud wrote:

Hello,

I have a couple questions about how to tune the retry of a failed DAG job.

1) What's the best way to wait some seconds before attempting a retry?

I've thought of using a POST script that would have $RETURN among its arguments and call |sleep| if $RETURN is not equal to 0, but I wonder whether that would work and whether there is a simpler way to do something similar.


This is not bad, and would be my first recommendation.



2) When a job retries, I would like it *not* to run where the failed job has run. Searching on the web lead me to adding the line

requirements = Machine =!= LastRemoteHost

to the submit file that is called by the JOB command on the DAG file, but that doesn't seem to work. More often than not, the job reruns in the same place (same machine and same slot) than the failed try.


There's a somewhat over-engineered version of this that supports hold and release at

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAutoRetryElsewhere

But the basic mechanisms with the Requirements should be what you need.


-greg