[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Help with scheduling start/evict policy in condor_config



We've got a situation where many jobs in a condor pool repeatedly go
through queue/execute/evict loops until they hit a timeout or retry
attempt limit.  What is odd is that they execute for 1-2 minutes before
being evicted, then restart again a few minutes later, only to be
evicted again after 1-2 minutes.  Inside of 10 minutes the same job in
the same pool can be started 3 times and evicted 3 times.

The pool effectively (but not in reality) has two groups of users:
privileged and public.  The core of the policy is supposed to be quite
simple: privileged jobs evict public jobs if there are public jobs
running and no free job slots.  The actual policy currently seems to
evict public jobs even if there are free job slots, which means the
public jobs then get restarted immediately into the free job slots, only
to be evicted again on the next scheduling cycle when a privileged job
considers the state of the job slot (running public job? evict and start
privileged job!).

We haven't had a lot of luck figuring out why this happens.  We are
almost certain it is something in the RANK, START, and PREEMPT
expressions.  Below I include what I think are the relevant extracts
from condor_config.  More details are in a ticket here:

https://ticket.grid.iu.edu/goc/viewer?id=8375  (click "+ Show More" to
see all the details for the latest entry).

TIA,

Ian

DEFAULT_PRIO_FACTOR = 10000

GROUP_PRIO_FACTOR_* entries

GROUP_QUOTA_* entries

MACHINEBUSY = ($(CPUBusy) || $(KeyboardBusy))

MAXSUSPENDTIME = 10 * $(MINUTE)

MAXVACATETIME = 10 * $(MINUTE)

PREEMPT = False

PREEMPTION_RANK = 0

PREEMPTION_REQUIREMENTS = False

RANK = (regexp("group_cmsuser",TARGET.AccountingGroup) ||
regexp("group_cmsprod",TARGET.AccountingGroup) ||
regexp("group_cdf",TARGET.AccountingGroup) ||
regexp("group_monitor",TARGET.AccountingGroup) ||
regexp("group_mitlns",TARGET.AccountingGroup) ||
regexp("group_cmshi",TARGET.AccountingGroup))

START = (Owner != "cdf") && ((TARGET.IsMadgraph =!= TRUE) ||
(TARGET.IsMadgraph == UNDEFINED) || (SlotID == 1)) && ($(RANK) ||
isUndefined(LastHeardFrom) || LastHeardFrom-EnteredCurrentState>600)

WANT_SUSPEND = False

WANT_VACATE = False