[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding Condor Policies on Jobs

On 3/22/2013 3:18 AM, Andrey Kuznetsov wrote:

I've been reading the documentation, and slowly figuring out what is
what, but some things are unclear.

If you are going to be the HTCondor admin at your site, you may be interested in viewing one of our HTCondorWeek Administration tutorials from a past HTCondor Week workshop (or come to Madison this May for live tutorials!). Materials/slides from past HTCondor Weeks are online; back in 2008 Red Hat was kind enough to record some of them. At URL
see the "Administrating Condor" tutorial, which I think covers most of the concepts/questions you are asking below.

I will take a quick pass at answering your questions inline below, but of course the tutorial does a much better job than my pithy comments...

From the documentation, WANT_SUSPEND = A boolean expression that, when
True, tells Condor to evaluate the SUSPEND expression.
SUSPEND = A boolean expression that, when True, causes Condor to
suspend running a Condor job. The machine may still be claimed, but
the job makes no further progress, and Condor does not generate a load
on the machine.
From default config, UWCS_WANT_SUSPEND = ( $(SmallJob) ||
$(KeyboardNotBusy) || $(IsVanilla) ) && ( $(SUSPEND) )
So SUSPEND will be evaluated if the job is small, likely some kind of
error in the job, but I am having trouble understanding the rest.

The default UWCS policy expressions in the default config are not as simple as they could (should?) be. For better or worse, these expressions relate to the default policy that was in use at the UW-Madison Computer Sciences department a while back. Something you should know is there are a lot of standard universe (aka relinked with condor_compile so they can checkpoint, and with 'universe=standard' in the submit description file) jobs submitted at UW-Madison. Since standard universe jobs can checkpoint and restart right where they left off, the UWCS policy expressions are optimized in many places to take advantage of that. If you are not relinking with condor_compile, you probably submitting vanilla universe jobs. Off the top of my head, a simple setup for HTCondor to relinquish one processor core when someone is typing either on the console or via ssh would be:

  # Jobs can start anytime on slots > 1, and also can
  # start on slot 1 if there has been no keyboard activity for 15 min
  START = SlotID > 1 || KeyboardIdle > 900
  # When we see keyboard activity on Slot1, send the job a SIGTERM
  # and if the job is still around 10 seconds later send a SIGKILL.
  PREEMPT = SlotID > 1 && KeyboardIdle < 60
  MachineMaxVacateTime = 10
  KILL = False

Note that all the slot (machine) classads will be numbered via an attribute SlotID (SlotID=1, SlotID=2, etc), and KeyboardIdle will be the number of seconds the keyboard (or ssh) has not had any keystrokes.

Warning: I didn't test the above, I just wrote it in my email client :)

More inline below...

1) Why is SUSPEND evaluated if there is no user at the keyboard
"KeyboardNotBusy", shouldn't it be the opposite? If the keyboard is
busy then I want the SUSPEND to be evaluated on the basis that someone
is using the machine, thus I want the job to be suspended to free
resources/processor for the user.

Note that UWCS_WANT_SUSPEND says "... $(KeyboardNotBusy) || $(IsVanilla) ...".

So for vanilla jobs, it indeed works the way you thought it should. It is only if the job is not vanilla that we KeyboardNotBusy comes into play. The thinking here is if the job is standard universe, don't bother suspending the job, just checkpoint and migrate it to a different machine right away.

2) Why is SUSPEND evaluated when the job is running in VANILLA
universe? We are submitting jobs under VANILLA universe and add our
own environmental variables inside the jobs. It doesn't make sense why
condor would attempt to suspend a VANILLA universe job.

The thinking is VANILLA jobs cannot necessarily be checkpointed, and thus if they are bumped off the machine they would have to restart from the beginning. So the idea of suspending the job for a few minutes before killing it off is in hopes that the keyboard user will go away soon. Kinda a bummer if you have a job that runs for 12 hours, and at hour 11 a guy just checks his email for 3 minutes then leaves... may be better to simply suspend the job for 3 minutes instead of forcing the job to start over and loose 11 hours of computing. (of course, suspending may irritate some users... while a suspended job uses no CPU, it will still consume RAM and/or virtual memory)

3) Why is SUSPEND in WANT_SUSPEND since when WANT_SUSPEND=TRUE, then
SUSPEND is evaluated, seems kind of redundant?!

I guess it is not how I would have written it...

Regarding, UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) &&
(KeyboardIdle > $(ContinueIdleTime)) )
ActivityTimer = Amount of time in seconds in the current activity.
4) What kind of activity is the timer tracking? CONTINUE is supposed
to reactivate a suspended job, that means that when the machine is
free from users and nothing is running on it, then ActivityTimer is
somehow supposed to be non-zero, and thus > 10, so what is it
tracking? Is ActivityTimer tracking the time since last user
click/interaction was made, thus if the user steps away for more than
10 seconds, condor job will continue/resume?

Slots in HTCondor are always in a specific state and activity. You see this when you do condor_status. When HTCondor suspends a job (when SUSPEND becomes true), that slot will change from acivity "Busy" to activity "Suspended" and then HTCondor evaluates CONTINUE. So in the above, $(ActivityTimer) timer represents the number of seconds the slot has been in the "Suspended" activity.

5) What's the purpose of WANT_SUSPEND and SUSPEND? Seems like they
accomplish the same thing, except you run the check twice. Does
WANT_SUSPEND has some other kind of use?

While a job is running, if WANT_SUSPEND is True, HTCondor startd will continuously evaluate the SUSPEND expression. If WANT_SUSPEND is FALSE, it will not even look at the SUSPEND expression and will just continuously evaluate the the PREEMPT expression. So essentially it is just a way enable folks to write less complicated expressions.

6) Why are some variable in the config in the bash form, and others
not, or is it a typo?
Take a look at where SUSPEND is evaluated:
UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) ||
$(IsVanilla) ) && ( $(SUSPEND) )
UWCS_PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) >
$(MaxSuspendTime))) || (SUSPEND && (WANT_SUSPEND == False)) )

The ones in bash form aka $() just simple macros expanded from elsewhere in the condor_config file. The ones without $() are likely referring to ClassAd attributes, which are either characteristics about the machine or characteristics of the job. I think the tutorials cover this pretty well...

7) Are variables case sensitive? In condor_config_var, they are
printed as all capitals, but in the defaults UWCS they are used often
as lower-case with first capital letters of the word:
"$(ActivityTimer)" vs "ACTIVITYTIMER = (time() -

Macro and attribute names are both case-insensitive. For instance, $(Hour) and $(HOUR) are interchangeable.

8) How do you differentiate between variables set/updated by condor
and variables that you define? Like SUSPEND is defined in the config
by user, but "KeyboardIdle" is not in the config.

If it has $() it is from the config file, if it does not have $() that means it is referring to an attribute about the machine (or job).

9) What is =?= and =!= ?


Essentially, what happens if you write foo == 5, but foo is not defined? Should it be true? False? In HTCondor, it will not be True or False, but will evaluate to UNDEFINED. This so-called three-value logic is common in databases as well (think the Null value). Three-value logic lets folks write policies that explicitly deal with cases where information is missing (i.e. i want folks to submit jobs and tell me their department in the submit file, and want to do something special if someone forgot to specify their department). If you never want to deal with UNDEFINED and just want good-ol boolean two-value logic, use =?= instead of ==, and =!= instead of !=.

I am using:

10) How does condor know which SlotID to reserve for the user when the
desktop is being used? Where is this set?

No idea off the top of my head. Note in my simple example above, I didn't bother with SLOTS_CONNECTED_TO_KEYBOARD myself, and instead explicitly referenced SlotID in my Start/Preempt expressions. Seems more clear/explicit to me (but in more complex configurations it may make more sense to use SLOTS_CONNECTED_TO_KEYBOARD...).

Here's what my SUSPEND looks line:
SUSPEND = ( ($(KeyboardBusy) || $(ConsoleBusy)) && ((SlotID <=
&& $(ActivationTimer) > 90)
In other words, if console or keyboard is being used, and the SlotID
is 1, meaning processor #1 out of a total of 4 processors (cores) in
my computer, and the job is mature, has been running for some time,
then suspend the job.
PREEMPT = ( ((Activity == "Suspended") && ($(ActivityTimer) >
$(MaxSuspendTime))) || (SUSPEND) )
WANT_SUSPEND = ( $(SmallJob) || $(KeyboardBusy) || $(ConsoleBusy) )
CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) && (KeyboardIdle >
$(ContinueIdleTime)) )

I welcome any suggestions to improve my attempts at forcing condor to
relinquish 1 processor when a user is utilizing the computer.

Thank you very much for taking a look.

Hope the above helps and welcome to HTCondor,