[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Help understanding SUSPEND expression



Hello,

despite having been using HTCondor for a long time, today I realized
that I'm having troubles with the SUSPEND expression, so I hope somebody
can shed some light here...

For a long time, the SUSPEND expression in our machines has been:

,----
| SUSPEND = ( ((CpuBusyTime > 2 * $(MINUTE)) && ($(ActivationTimer) > 90)) \
|             || ( (WorkerType == "desktop" || WorkerType == "burro_pro") && $(KeyboardBusy) ) )
`----

WorkerType is just a characteristic that we add to our machines, so in
some of them the keyboard activity will be taken into account while in
other not.

Anyway, today I found that HTCondor was not evicting jobs despite having
a high load in the machine, so I was looking at the SUSPEND expression.

I see references to CpuBusyTime in several documents, but our machine has
not CpuBusyTime info anymore (this is HTCondor 23.0.4):

,----
| $ condor_status -l xxxx.xxx | grep -i cpubusytime
`----

so I guess it is no surprise that the job was not being evicted, since I
assume SUSPEND was never evaluated to True.

I assumed that this has probably changed in some recent HTCondor
version, so I looked into the current POLICY:DESKTOP template, which
reads:

# $ condor_config_val use policy:desktop
# use POLICY:DESKTOP is
#       if ! defined PolicyExprFragments
#               use FEATURE : POLICY_EXPR_FRAGMENTS
#       endif
#       STARTD_LATCH_EXPRS = $(STARTD_LATCH_EXPRS) CpuBusy
#       CpuBusyTimer=IfThenElse(CpuBusyValue is 1, time() - CpuBusyTime, 0)
#       WANT_SUSPEND=($(SmallJob) || $(KeyboardNotBusy) || $(IsVanilla) ) && ( $(SUSPEND))
#       WANT_VACATE=$(ActivationTimer) > 600 || $(IsVanilla)
#       SUSPEND=($(KeyboardBusy) || ( ($(CpuBusyTimer) > 120) && $(ActivationTimer) > 90))
#       CONTINUE=($(CPUIdle) && ($(ActivityTimer) > 10) && (KeyboardIdle > $(ContinueIdleTime)))
#       PREEMPT=(((Activity == "Suspended") && ($(ActivityTimer) > $(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)))
#       START=((KeyboardIdle > $(StartIdleTime)) && ( $(CPUIdle) || (State != "Unclaimed" && State != "Owner")) )
#       KILL=False
#       MaxJobRetirementTime=0
#       CLAIM_WORKLIFE=
#       SLOTS_CONNECTED_TO_KEYBOARD=1024*1024
#       SLOTS_CONNECTED_TO_CONSOLE=1024*1024
#       IS_OWNER=(START =?= False)



OK, so SUSPEND is defined in terms of the macro CpuBusyTimer, which is
defined in terms of CpuBusyValue, and CpuBusyTime, but none are defined

,----
| $ condor_status -l xxxx.xxx | grep -i cpubusytimer
`----

so if I understand correctly that part of the expression
($(CpuBusyTimer) > 120) is never going to be true, and as such my jobs
will never try to suspend.

It is a long time since I play with these expressions, but surely I'm
missing something?

Any help appreciated,
-- 
Ãngel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747

 GPG: 0x8BDC390B69033F52