[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job



Lukas Slebodnik wrote:
Hi Thomass,

I think that decreasing values of variables MAX_CLAIM_ALIVES_MISSED
and ALIVE_INTERVAL will help you.

Details in manual: http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:AliveInterval http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:MaxClaimAlivesMissed


Regards, Lukas

On Tue, Nov 22, 2011 at 01:59:01PM +0000, Thomas Luff wrote:
If a target machine shutsdown/crashes whilst a job is running on
the machine the job will hang around in the queue with the status
'Running'.

Even if the machine is shutdown and left off, the job still acts as
if it's running and has been like this for over an hour now.

Is it possible to make these jobs automatically fail or requeue if
the target machine goes down?

Thanks

Thoughts:

- in the event of a execute machine crashing, the job should automatically requeue after a maximum of 2 hrs (the TCP KEEPALIVE timeout). we plan to enhance this from 2 hrs max to instead be the job lease duration (default of 20 minutes) in a future developer series release.

- are you running the developer series of Condor? there was a bug introduced in Condor v7.7.2 that could result in the job staying running indefinitely depending on your configuration. we have it fixed for the upcoming v7.7.4 release. details at
  https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2591

regards,
Todd