[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Help with scheduling start/evict policy in condor_config
- Date: Wed, 14 Apr 2010 10:48:38 -0500
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Help with scheduling start/evict policy in condor_config
One likely source of trouble in this policy is that RANK is inherently a
preemptive mechanism. RANK is only relevant when deciding whether to
preempt an existing job with a new better-ranked one. This can lead to
rapid cycles of preemption in some cases.
One case I have seen is this:
User with high (i.e. good) RANK has a high (i.e. _bad_) user priority.
A different user has a low RANK and a low (i.e. _good_) user priority.
When a machine is idle, RANK is irrelevant and only user priority is
taken into account. Therefore, the user with a low RANK but good user
priority will be scheduled to run on the idle machine. In the next
round of negotiation, the user with a high RANK will preempt the other user.
Typically, this is only a problem if the user with high RANK does not
keep the machine claimed for very long, because then the whole process
repeats frequently. If I read the ticket correctly, CLAIM_WORKLIFE is
1200, so if the high RANK user has relatively short jobs, then I'd
expect the cycle to be repeating every 20 minutes or so.
What to do about this?
One thing is to make sure users with low RANK also have a high (bad)
user priority relative to users with high RANK. This can be done with
Another thing to do is to use MaxJobRetirementTime to prevent preemption
from happening quickly. The down side of this, of course, is that the
high ranked users don't get immediate access to the machines when they
Hope that helps.
Ian Stokes-Rees wrote:
We've got a situation where many jobs in a condor pool repeatedly go
through queue/execute/evict loops until they hit a timeout or retry
attempt limit. What is odd is that they execute for 1-2 minutes before
being evicted, then restart again a few minutes later, only to be
evicted again after 1-2 minutes. Inside of 10 minutes the same job in
the same pool can be started 3 times and evicted 3 times.
The pool effectively (but not in reality) has two groups of users:
privileged and public. The core of the policy is supposed to be quite
simple: privileged jobs evict public jobs if there are public jobs
running and no free job slots. The actual policy currently seems to
evict public jobs even if there are free job slots, which means the
public jobs then get restarted immediately into the free job slots, only
to be evicted again on the next scheduling cycle when a privileged job
considers the state of the job slot (running public job? evict and start
We haven't had a lot of luck figuring out why this happens. We are
almost certain it is something in the RANK, START, and PREEMPT
expressions. Below I include what I think are the relevant extracts
from condor_config. More details are in a ticket here:
https://ticket.grid.iu.edu/goc/viewer?id=8375 (click "+ Show More" to
see all the details for the latest entry).
DEFAULT_PRIO_FACTOR = 10000
MACHINEBUSY = ($(CPUBusy) || $(KeyboardBusy))
MAXSUSPENDTIME = 10 * $(MINUTE)
MAXVACATETIME = 10 * $(MINUTE)
PREEMPT = False
PREEMPTION_RANK = 0
PREEMPTION_REQUIREMENTS = False
RANK = (regexp("group_cmsuser",TARGET.AccountingGroup) ||
START = (Owner != "cdf") && ((TARGET.IsMadgraph =!= TRUE) ||
(TARGET.IsMadgraph == UNDEFINED) || (SlotID == 1)) && ($(RANK) ||
isUndefined(LastHeardFrom) || LastHeardFrom-EnteredCurrentState>600)
WANT_SUSPEND = False
WANT_VACATE = False
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: