Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Preemption question

Date: Fri, 24 Mar 2006 13:23:00 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Preemption question



Steven Timm wrote:

On Fri, 24 Mar 2006, Dan Bradley wrote:

There are a number of different possible causes of preemption in Condor,
and your policy eliminates most but not all of them.  The startd RANK
expression is treated as an overriding directive by the negotiator,
trumping the normal user-priority based calculations (and therefore
PREEMPTION_REQUIREMENTS).  This means that your policy will cause
precisely the kind of preemption that you have observed--members of
group "numi" will preempt other users.

All the various tutorials I've been to and manuals I have read
didn't tell me that.  Interesting.  I thought as long as we
had PREEMPTION_REQUIREMENTS false we wouldn't preempt.

Thanks for pointing that out. I have now fixing several places in themanual where this misleading impression is made. The one place thattold the truth is here:


http://www.cs.wisc.edu/condor/manual/v6.7/3_5User_Priorities.html#15677

"Note that PREEMPTION_REQUIREMENTS only applies to preemptions due touser priority. It does not have any effect if the machine rankexpression prefers a different job, or if the startd policy expressioncauses the job to vacate due to other activity on the machine."

The effect we want to have is the following:

these 15 machines are owned by group_numi.
If the queue is full and all machines are claimed, and there
are jobs waiting from both group_numi and from others, then
on these 15 machines we want the job from group_numi to start,
independent of what user priority group_numi may have at the time.


This could be achieved with a policy such as the following:

RANK = (agroup == "group_numi" ) * 1000
#allow preempted jobs a total of 4 days wall time
MaxJobRetirementTime = 3600 * 24 * 4


However, there is one additional consideration.  The above policy doesn't say that group_numi jobs will preferentially run on group_numi machines.  It just says that they have high priority to do so.  Therefore, if there are available machines in both group_numi and elsewhere, a group_numi job could land on either one with no preference either way.  This may or may not be what you want.  To preferentially steer group_numi jobs to group_numi machines, you can do something like the following:

MachineGroup = "group_numi"
STARTD_EXPRS = $(STARTD_EXPRS) MachineGroup
NEGOTIATOR_PRE_JOB_RANK = (agroup =?= MachineGroup)*1 + (RemoteOwner =?= UNDEFINED)*2


That says to preferentially run jobs on idle machines and secondarily to prefer machines belonging to the same group.

We would really rather not
have pre-emption happen at all, even if the cost is some idle
time on the cluster every once in a while.

By this, I assume you mean that you don't want _job_ preemption. Insome cases you still appear to want _claim_ preemption. If that is thecase, then setting MaxJobRetirementTime to a very large number is a goodsolution.

I had no idea up until now that a user through the schedd could keep aclaim on a machine between the finishing of a job and the start of a newone. Where is there more information in the condor docs that
describes this situation? We may have to rethink our whole
strategy on how we do our batch system here.

In V6.6, there was no good solution for this problem (becauseMaxJobRetirementTime did not exist). Therefore, it is documented in theV6.6. manual in the section on disabling preemption:


http://www.cs.wisc.edu/condor/manual/v6.6/3_6Startd_Policy.html#SECTION00469500000000000000

However, in V6.7, this bit of knowledge does not appear in the manual,because the section on disabling preemption offers a solution thatavoids the problem. Clearly, this section should still discuss some ofthe problems with alternate preemption -avoiding policies, because theyare not obvious.


http://www.cs.wisc.edu/condor/manual/v6.7/3_6Startd_Policy.html#SECTION004610500000000000000

--Dan

References:
- [Condor-users] Preemption question
  - From: Steven Timm
- Re: [Condor-users] Preemption question
  - From: Dan Bradley
- Re: [Condor-users] Preemption question
  - From: Steven Timm

Prev by Date: Re: [Condor-users] Preemption question
Next by Date: Re: [Condor-users] condor & scientific linux
Previous by thread: Re: [Condor-users] Preemption question
Next by thread: [Condor-users] can't find resource with capability ?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Preemption question