Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job

Date: Tue, 22 Nov 2011 10:48:26 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job

Lukas Slebodnik wrote:

Hi Thomass,

I think that decreasing values of variables MAX_CLAIM_ALIVES_MISSED
and ALIVE_INTERVAL will help you.

Details in manual:http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:AliveIntervalhttp://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:MaxClaimAlivesMissed



Regards, Lukas

On Tue, Nov 22, 2011 at 01:59:01PM +0000, Thomas Luff wrote:

If a target machine shutsdown/crashes whilst a job is running on
the machine the job will hang around in the queue with the status
'Running'.

Even if the machine is shutdown and left off, the job still acts as
if it's running and has been like this for over an hour now.

Is it possible to make these jobs automatically fail or requeue if
the target machine goes down?

Thanks


Thoughts:

- in the event of a execute machine crashing, the job shouldautomatically requeue after a maximum of 2 hrs (the TCP KEEPALIVEtimeout). we plan to enhance this from 2 hrs max to instead be the joblease duration (default of 20 minutes) in a future developer series release.

- are you running the developer series of Condor? there was a bugintroduced in Condor v7.7.2 that could result in the job staying runningindefinitely depending on your configuration. we have it fixed for theupcoming v7.7.4 release. details at

  https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2591

regards,
Todd

Follow-Ups:
- Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job
  - From: Thomas Luff

References:
- [Condor-users] Jobs lingering in queue if target shuts down mid-job
  - From: Thomas Luff
- Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job
  - From: Lukas Slebodnik

Prev by Date: Re: [Condor-users] bind failed: WSAError = 10055 and 10048
Next by Date: Re: [Condor-users] rooster on linux, take 2
Previous by thread: Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job
Next by thread: Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Jobs lingering in queue if target shuts down mid-job