[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job rescheduling



Janito Ferreira Filho wrote:
> Hi,
> 
> I've investigated more into the matter of the rescheduling of jobs after an execution node has died, and although it appears to be working, it's taking too long. If I shutdown an execute node with a job running on it, and then restart it, it takes two hours for condor to remove the failed job (until that point Condor thinks it's still running) and reschedule it (sometimes to run on the same node, which was unclaimed since the restart). I searched the manual, but I can't seem to find where to configure this two hour delay. Can someone please point me in the right direction? Thank you,
> 
> JVFF

Have a look at ...

http://www.google.com/search?q=site%3Awww.cs.wisc.edu%2Fcondor%2Fmanual%2Fv7.3+claim+alive

Specifically around MAX_CLAIM_ALIVES_MISSED and ALIVE_INTERVAL.

If you're seeing a 2 hour timeout that sounds fairly familiar. I believe Todd answered it previously. I'd assume his answer was to reverse the direction on the alive messages. I'll ping him to include details.

Best,


matt