[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help with scheduling start/evict policy in condor_config

Hi Ian,

One likely source of trouble in this policy is that RANK is inherently a preemptive mechanism. RANK is only relevant when deciding whether to preempt an existing job with a new better-ranked one. This can lead to rapid cycles of preemption in some cases.

One case I have seen is this:

User with high (i.e. good) RANK has a high (i.e. _bad_) user priority. A different user has a low RANK and a low (i.e. _good_) user priority. When a machine is idle, RANK is irrelevant and only user priority is taken into account. Therefore, the user with a low RANK but good user priority will be scheduled to run on the idle machine. In the next round of negotiation, the user with a high RANK will preempt the other user.

Typically, this is only a problem if the user with high RANK does not keep the machine claimed for very long, because then the whole process repeats frequently. If I read the ticket correctly, CLAIM_WORKLIFE is 1200, so if the high RANK user has relatively short jobs, then I'd expect the cycle to be repeating every 20 minutes or so.

What to do about this?

One thing is to make sure users with low RANK also have a high (bad) user priority relative to users with high RANK. This can be done with priority factors.

Another thing to do is to use MaxJobRetirementTime to prevent preemption from happening quickly. The down side of this, of course, is that the high ranked users don't get immediate access to the machines when they need them.

Hope that helps.


Ian Stokes-Rees wrote:
We've got a situation where many jobs in a condor pool repeatedly go
through queue/execute/evict loops until they hit a timeout or retry
attempt limit.  What is odd is that they execute for 1-2 minutes before
being evicted, then restart again a few minutes later, only to be
evicted again after 1-2 minutes.  Inside of 10 minutes the same job in
the same pool can be started 3 times and evicted 3 times.

The pool effectively (but not in reality) has two groups of users:
privileged and public.  The core of the policy is supposed to be quite
simple: privileged jobs evict public jobs if there are public jobs
running and no free job slots.  The actual policy currently seems to
evict public jobs even if there are free job slots, which means the
public jobs then get restarted immediately into the free job slots, only
to be evicted again on the next scheduling cycle when a privileged job
considers the state of the job slot (running public job? evict and start
privileged job!).

We haven't had a lot of luck figuring out why this happens.  We are
almost certain it is something in the RANK, START, and PREEMPT
expressions.  Below I include what I think are the relevant extracts
from condor_config.  More details are in a ticket here:

https://ticket.grid.iu.edu/goc/viewer?id=8375  (click "+ Show More" to
see all the details for the latest entry).





GROUP_QUOTA_* entries

MACHINEBUSY = ($(CPUBusy) || $(KeyboardBusy))






RANK = (regexp("group_cmsuser",TARGET.AccountingGroup) ||
regexp("group_cmsprod",TARGET.AccountingGroup) ||
regexp("group_cdf",TARGET.AccountingGroup) ||
regexp("group_monitor",TARGET.AccountingGroup) ||
regexp("group_mitlns",TARGET.AccountingGroup) ||

START = (Owner != "cdf") && ((TARGET.IsMadgraph =!= TRUE) ||
(TARGET.IsMadgraph == UNDEFINED) || (SlotID == 1)) && ($(RANK) ||
isUndefined(LastHeardFrom) || LastHeardFrom-EnteredCurrentState>600)



Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: