[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SUSPEND/CONTINUE puzzle



Here are my suspend/continue expressions in condor_config.local. I
just upgraded to the latest version, 7.6.1, which did help properly
detect the keyboard, but it still has the 5 second cycle between
suspend and continue. This makes me suspect the problem lies in the
expression to suspend on high non-condor load.

HighLoad		                = 0.8
BackgroundLoad		= 0.3

# time keyboard must be idle to start job
StartIdleTime 		= 5 * $(MINUTE)
# max time to allow a job in suspension
MaxSuspendTime		=  2 * $(HOUR)
# if keyboard idle for this time, continue suspended job
ContinueIdleTime	= 5 * $(MINUTE)

KeyboardBusy        = (KeyboardIdle < $(StartIdleTime))
ConsoleBusy        = (ConsoleIdle  < $(StartIdleTime))
ConsoleNotBusy        = ($(ConsoleBusy) == False)
KeyorConBusy        = ($(KeyboardBusy) || $(ConsoleBusy))
KeyorConNotBusy        = ($(KeyorConBusy) == False)

# Suspend job on Slots 1 or 2 if keyboard is touched
# or the Slot has a high non-condor load;
# but don't suspend if job suspension time exceeds limit
SUSPEND1     = (SlotID <= 2 && $(KeyorConBusy))
SUSPEND2     = ( $(NonCondorLoadAvg) > $(HighLoad) )
SUSPEND3     = ( (TotalJobSuspendTime =!= UNDEFINED) &&
(TotalJobSuspendTime <= $(MaxSuspendTime)) \
                          || (TotalJobSuspendTime =?= UNDEFINED) )
SUSPEND        = $(SUSPEND3) && ( $(SUSPEND1) || $(SUSPEND2) )

# continue on Slots1 & 2 if keyboard not used,
# or Slot's non-condor load drops,
# or job has been suspended more than than max suspend time
CONTINUE1     = (SlotID <= 2 && $(KeyorConNotBusy))
CONTINUE2     = (SlotID > 2 && $(NonCondorLoadAvg) <= $(BackgroundLoad))
CONTINUE3     = ((TotalJobSuspendTime =!= UNDEFINED) &&
(TotalJobSuspendTime > $(MaxSuspendTime)))

CONTINUE     = $(CONTINUE3) || $(CONTINUE1) || $(CONTINUE2)


On Wed, Jul 13, 2011 at 4:41 AM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
>
> On 07/12/2011 07:35 PM, Ralph&Maria Finch wrote:
>>
>> condor -version
>> $CondorVersion: 7.5.3 Jun 24 2010 BuildID: 250654 $
>> $CondorPlatform: INTEL-WINNT50 $
>>
>> Given the Windows platform, I implement a SUSPEND policy. If the
>> keyboard is touched in the last 5 minutes, or if the non-Condor load
>> reaches a high value, I want to SUSPEND the job. Then CONTINUE the job
>> when the keyboard is untouched for 5 minutes and the load is below the
>> limit.
>>
>> Unfortunately I have something wrong and the jobs SUSPEND/CONTINUE every
>> 5 seconds:
>>
>> 07/12/11 16:32:21 slot1: Sent update to 1 collector(s)
>> 07/12/11 16:32:22 slot1: State change: SUSPEND is TRUE
>> 07/12/11 16:32:22 slot1: Changing activity: Busy -> Suspended
>> 07/12/11 16:32:22 slot1: In Starter::kill() with pid 5372, sig 100
>> (DC_SIGSUSPEND)
>> 07/12/11 16:32:23 slot1: Received job ClassAd update from starter.
>> 07/12/11 16:32:26 Trying to update collector <123.456.78.910:9618>
>> 07/12/11 16:32:26 Attempting to send update via UDP to collector
>> delta-mod.water.ca.gov <http://delta-mod.water.ca.gov> <123.456.78.910:9618>
>> 07/12/11 16:32:26 slot1: Sent update to 1 collector(s)
>> 07/12/11 16:32:27 slot1: State change: CONTINUE is TRUE
>> 07/12/11 16:32:27 slot1: In Starter::kill() with pid 5372, sig 101
>> (DC_SIGCONTINUE)
>> 07/12/11 16:32:27 slot1: Changing activity: Suspended -> Busy
>> 07/12/11 16:32:27 slot1: Received job ClassAd update from starter.
>>
>>
>> Attempting to debug this, I set
>>
>> STARTD_DEBUG        = D_FULLDEBUG
>>
>> While this does give more information (see above), it doesn't state why
>> Condor decides to SUSPEND or CONTINUE a job.  And that piece of
>> information I need to see what is wrong with my condition statement.
>> What can I do to see why Condor is changing the state of a job?
>>
>> Ralph Finch
>> Calif. Dept. of Water Resources
>> Sacramento, CA USA
>
> Please include your SUSPEND/CONTINUE expressions.
>
> You can try debug() around them, but it might have been gone by 7.5.3.
>
> Best,
>
>
> matt