[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job restart delay after system crash


We had some very similar system crashes lately (because of disk failures of a specific type of hardware) and condor responded much the same way every time: After the computer that runs the starter dies, condor_q still reports that it is running on that machine for eternity (at least more than 6 hours), even though condor_starter
does not show that computer at all.

The only way I found to remove the job from that crashed machine is to hold and release the task (condor_vacate and condor_vacate_job does not work) but even after release the job does not get restarted and stays in the idle state for hours.

I'm not really sure whether its simply a configuration issue or a bug so first I'd like to ask which configuration settings might affect this behaviour? How can I make condor realize that the scheduler is not getting any response from the starter
and after that to re-negotiate the job instead of waiting for hours?

(Using condor 6.8.4 on window XP.)