Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job rescheduling

Date: Mon, 17 Aug 2009 07:48:23 -0700
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Job rescheduling

Janito Ferreira Filho wrote:
> Hi,
> 
> I've investigated more into the matter of the rescheduling of jobs after an execution node has died, and although it appears to be working, it's taking too long. If I shutdown an execute node with a job running on it, and then restart it, it takes two hours for condor to remove the failed job (until that point Condor thinks it's still running) and reschedule it (sometimes to run on the same node, which was unclaimed since the restart). I searched the manual, but I can't seem to find where to configure this two hour delay. Can someone please point me in the right direction? Thank you,
> 
> JVFF

Have a look at ...

http://www.google.com/search?q=site%3Awww.cs.wisc.edu%2Fcondor%2Fmanual%2Fv7.3+claim+alive

Specifically around MAX_CLAIM_ALIVES_MISSED and ALIVE_INTERVAL.

If you're seeing a 2 hour timeout that sounds fairly familiar. I believe Todd answered it previously. I'd assume his answer was to reverse the direction on the alive messages. I'll ping him to include details.

Best,


matt

Follow-Ups:
- Re: [Condor-users] Job rescheduling
  - From: Todd Tannenbaum

References:
- [Condor-users] Job rescheduling
  - From: Janito Ferreira Filho

Prev by Date: Re: [Condor-users] Bug in Condor?
Next by Date: Re: [Condor-users] Bug in Condor?
Previous by thread: [Condor-users] Job rescheduling
Next by thread: Re: [Condor-users] Job rescheduling
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Job rescheduling