[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] job restart delay after system crash
- Date: Thu, 19 Apr 2007 18:11:35 +0200
- From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
- Subject: [Condor-users] job restart delay after system crash
We had some very similar system crashes lately (because of disk failures
of a specific type of hardware)
and condor responded much the same way every time: After the computer
that runs the starter dies, condor_q
still reports that it is running on that machine for eternity (at least
more than 6 hours), even though condor_starter
does not show that computer at all.
The only way I found to remove the job from that crashed machine is to
hold and release the task (condor_vacate and
condor_vacate_job does not work) but even after release the job does not
get restarted and stays in the idle state for hours.
I'm not really sure whether its simply a configuration issue or a bug so
first I'd like to ask which configuration settings
might affect this behaviour? How can I make condor realize that the
scheduler is not getting any response from the starter
and after that to re-negotiate the job instead of waiting for hours?
(Using condor 6.8.4 on window XP.)