[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?

On Tue, 10 Aug 2010 01:29, Rob wrote:
> On Mon, 9 Aug 2010 12:07 Jaime Frey wrote:
>> On Aug 6, 2010, at 9:55 AM, Rob wrote:
>>> The problem I encounter is:
>>> 1. The job's log file tells me that a VM job has been evicted.
>>> 2. However, condor keeps telling me that this VM job is still running.
>>> 3. And this condition persists for many, many hours, probably for ever!
>>> How can I get out of this apparent deadlock of the job and
>>> tell Condor to reschedule the job from the last checkpoint?
>> Here's what I've learned from the logs you emailed to me:
>> The job was indeed evicted when user log indicates, and returned to idle 
>>status. 35 minutes later,
>> it was matched to the same machine and Condor tried to restart it there. During 
>>file transfer, the
>> execute machine's SUSPEND expression started evaluating to True. The startd 
>>failed to send
>> a message to the starter, which was too busy transferring the job's files. The 
>>starter ended up
>> exiting, but for some unknown reason, the shadow still had an open connection 

>>to the execute
>> machine. That connection should close when the starter exits. So the shadow 
>>waited for the
>> starter to retry the file transfer. Only when the execute machine was rebooted 
>>did the shadow
>> notice the connection close.
>> You can reduce the chance of this happening in the future by setting 
>>STARTD_SENDS_ALIVES=True in your config file.
> Should I set this on the Master, on the pool PC, or both?
> Thanks for your help!

I found a 1-year old email in the condor archives:

Is the info here still valid?
Such as "STARTD_SENDS_ALIVES = True" must be set on both,
master and pool PCs; and when using this, also  PeriodicHold
and  PeriodicRelease  need to be set accordingly?

Also, the info on the next stable release


has this for 7.4.3 on STARTD_SENDS_ALIVES:

* Fixed a problem that caused the condor_startd daemon to crash
  in some cases when STARTD_SENDS_ALIVES was True.
  This setting is False by default. 

All the PCs here have condor 7.4.2. Do I have to worry?
Or should I better wait until 7.4.3 comes out and then
implement  STARTD_SENDS_ALIVES  in the configs?

Thank you.