[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?
- Date: Thu, 19 Aug 2010 09:37:23 -0500
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?
On Aug 10, 2010, at 6:37 PM, Rob wrote:
> On Tue, 10 Aug 2010 01:29, Rob wrote:
>> On Mon, 9 Aug 2010 12:07 Jaime Frey wrote:
>>> On Aug 6, 2010, at 9:55 AM, Rob wrote:
>>>> The problem I encounter is:
>>>> 1. The job's log file tells me that a VM job has been evicted.
>>>> 2. However, condor keeps telling me that this VM job is still running.
>>>> 3. And this condition persists for many, many hours, probably for ever!
>>>> How can I get out of this apparent deadlock of the job and
>>>> tell Condor to reschedule the job from the last checkpoint?
>>> Here's what I've learned from the logs you emailed to me:
>>> The job was indeed evicted when user log indicates, and returned to idle
>>> status. 35 minutes later,
>>> it was matched to the same machine and Condor tried to restart it there. During
>>> file transfer, the
>>> execute machine's SUSPEND expression started evaluating to True. The startd
>>> failed to send
>>> a message to the starter, which was too busy transferring the job's files. The
>>> starter ended up
>>> exiting, but for some unknown reason, the shadow still had an open connection
>>> to the execute
>>> machine. That connection should close when the starter exits. So the shadow
>>> waited for the
>>> starter to retry the file transfer. Only when the execute machine was rebooted
>>> did the shadow
>>> notice the connection close.
>>> You can reduce the chance of this happening in the future by setting
>>> STARTD_SENDS_ALIVES=True in your config file.
>> Should I set this on the Master, on the pool PC, or both?
>> Thanks for your help!
> I found a 1-year old email in the condor archives:
> Is the info here still valid?
> Such as "STARTD_SENDS_ALIVES = True" must be set on both,
> master and pool PCs; and when using this, also PeriodicHold
> and PeriodicRelease need to be set accordingly?
Sorry for the delayed response. The information in that post is mostly valid. STARTD_SENDS_ALIVES must be set on all submit and execute machines. With current versions of Condor, I believe rescheduling of the job will be happen without setting PeriodicHold and PeriodicRelease in a case such as yours.
> Also, the info on the next stable release
> has this for 7.4.3 on STARTD_SENDS_ALIVES:
> * Fixed a problem that caused the condor_startd daemon to crash
> in some cases when STARTD_SENDS_ALIVES was True.
> This setting is False by default.
> All the PCs here have condor 7.4.2. Do I have to worry?
> Or should I better wait until 7.4.3 comes out and then
> implement STARTD_SENDS_ALIVES in the configs?
7.4.3 is now out, so you should be fine if you upgrade.
Starting in the forthcoming Condor 7.5.4, STARTD_SENDS_ALIVES will be true by default.
Thanks and regards,
UW-Madison Condor Team