[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Restart a job after its node failed



On 4/24/2023 4:39 AM, Gaetan Geffroy wrote:

Hi,

I was testing how Condor handles worker node failures, by starting jobs on a node and then shutting it down.

I used configuration values such as UPDATE_INTERVAL = 15 to make sure the CM detects the node failure fast, and sure enough the node is gone from condor_status after a few seconds.

But the job stays. condor_q still shows it in the running state, and -better-analyze still shows the executing node as being the failed one, even though it is already gone from condor_status.

I left it running in the background, and it is only after about two hours that the job was finally restarted on another node.

 

After a bit of research, I found this thread from 2009 describing the same behavior, with the answer saying that it would be improved soon using MAX_CLAIM_ALIVES_MISSED and ALIVE_INTERVAL:

https://groups.google.com/g/condor-computing/c/Sxag4qbtfsg

I looked these macros, but it appears that it only has the schedd to send alive messages to the startd, which stops a running job if it does not receive them. But I am looking for the oppositeâ

Then, there is also STARTD_SENDS_ALIVES, which looks to do want I want, but it is deprecated.

 

How could I make the recovery of jobs on failed nodes faster ?


Hi Gaetan,

Perhaps you want to specify "job_lease_duration" in your job submit file ?  

See this section of the manual (and/or the condor_submit man page):

https://htcondor.readthedocs.io/en/latest/users-manual/special-environment-considerations.html#job-leases

"A job lease specifies how long a given job will attempt to run on a remote resource, even if that resource loses contact with the submitting machine. Similarly, it is the length of time the submitting machine will spend trying to reconnect to the (now disconnected) execution host, before the submitting machine gives up and tries to claim another resource to run the job...."

The default value for the job lease duration is 40 minutes.

Note that if you wish to change the default job_lease_duration for all jobs on a given Access Point (submit host), you can use condor_config knob "JOB_DEFAULT_LEASE_DURATION".

Hope the above helps,
Todd


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685