[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job
- Date: Tue, 22 Nov 2011 10:48:26 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job
Lukas Slebodnik wrote:
I think that decreasing values of variables MAX_CLAIM_ALIVES_MISSED
and ALIVE_INTERVAL will help you.
Details in manual:
On Tue, Nov 22, 2011 at 01:59:01PM +0000, Thomas Luff wrote:
If a target machine shutsdown/crashes whilst a job is running on
the machine the job will hang around in the queue with the status
Even if the machine is shutdown and left off, the job still acts as
if it's running and has been like this for over an hour now.
Is it possible to make these jobs automatically fail or requeue if
the target machine goes down?
- in the event of a execute machine crashing, the job should
automatically requeue after a maximum of 2 hrs (the TCP KEEPALIVE
timeout). we plan to enhance this from 2 hrs max to instead be the job
lease duration (default of 20 minutes) in a future developer series release.
- are you running the developer series of Condor? there was a bug
introduced in Condor v7.7.2 that could result in the job staying running
indefinitely depending on your configuration. we have it fixed for the
upcoming v7.7.4 release. details at