[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?

On Aug 6, 2010, at 9:55 AM, Rob wrote:

> The problem I encounter is:
> 1. The job's log file tells me that a VM job has been evicted.
> 2. However, condor keeps telling me that this VM job is still running.
> 3. And this condition persists for many, many hours, probably for ever!
> How can I get out of this apparent deadlock of the job and
> tell Condor to reschedule the job from the last checkpoint?

Here's what I've learned from the logs you emailed to me:

The job was indeed evicted when user log indicates, and returned to idle status. 35 minutes later, it was matched to the same machine and Condor tried to restart it there. During file transfer, the execute machine's SUSPEND expression started evaluating to True. The startd failed to send a message to the starter, which was too busy transferring the job's files. The starter ended up exiting, but for some unknown reason, the shadow still had an open connection to the execute machine. That connection should close when the starter exits. So the shadow waited for the starter to retry the file transfer. Only when the execute machine was rebooted did the shadow notice the connection close.

You can reduce the chance of this happening in the future by setting STARTD_SENDS_ALIVES=True in your config file.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team