Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help with scheduling start/evict policy in condor_config

Date: Wed, 14 Apr 2010 10:48:38 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Help with scheduling start/evict policy in condor_config

Hi Ian,

One likely source of trouble in this policy is that RANK is inherently apreemptive mechanism. RANK is only relevant when deciding whether topreempt an existing job with a new better-ranked one. This can lead torapid cycles of preemption in some cases.


One case I have seen is this:

User with high (i.e. good) RANK has a high (i.e. _bad_) user priority.A different user has a low RANK and a low (i.e. _good_) user priority.When a machine is idle, RANK is irrelevant and only user priority istaken into account. Therefore, the user with a low RANK but good userpriority will be scheduled to run on the idle machine. In the nextround of negotiation, the user with a high RANK will preempt the other user.

Typically, this is only a problem if the user with high RANK does notkeep the machine claimed for very long, because then the whole processrepeats frequently. If I read the ticket correctly, CLAIM_WORKLIFE is1200, so if the high RANK user has relatively short jobs, then I'dexpect the cycle to be repeating every 20 minutes or so.


What to do about this?

One thing is to make sure users with low RANK also have a high (bad)user priority relative to users with high RANK. This can be done withpriority factors.

Another thing to do is to use MaxJobRetirementTime to prevent preemptionfrom happening quickly. The down side of this, of course, is that thehigh ranked users don't get immediate access to the machines when theyneed them.


Hope that helps.

--Dan

Ian Stokes-Rees wrote:

We've got a situation where many jobs in a condor pool repeatedly go
through queue/execute/evict loops until they hit a timeout or retry
attempt limit.  What is odd is that they execute for 1-2 minutes before
being evicted, then restart again a few minutes later, only to be
evicted again after 1-2 minutes.  Inside of 10 minutes the same job in
the same pool can be started 3 times and evicted 3 times.

The pool effectively (but not in reality) has two groups of users:
privileged and public.  The core of the policy is supposed to be quite
simple: privileged jobs evict public jobs if there are public jobs
running and no free job slots.  The actual policy currently seems to
evict public jobs even if there are free job slots, which means the
public jobs then get restarted immediately into the free job slots, only
to be evicted again on the next scheduling cycle when a privileged job
considers the state of the job slot (running public job? evict and start
privileged job!).

We haven't had a lot of luck figuring out why this happens.  We are
almost certain it is something in the RANK, START, and PREEMPT
expressions.  Below I include what I think are the relevant extracts
from condor_config.  More details are in a ticket here:

https://ticket.grid.iu.edu/goc/viewer?id=8375  (click "+ Show More" to
see all the details for the latest entry).

TIA,

Ian

DEFAULT_PRIO_FACTOR = 10000

GROUP_PRIO_FACTOR_* entries

GROUP_QUOTA_* entries

MACHINEBUSY = ($(CPUBusy) || $(KeyboardBusy))

MAXSUSPENDTIME = 10 * $(MINUTE)

MAXVACATETIME = 10 * $(MINUTE)

PREEMPT = False

PREEMPTION_RANK = 0

PREEMPTION_REQUIREMENTS = False

RANK = (regexp("group_cmsuser",TARGET.AccountingGroup) ||
regexp("group_cmsprod",TARGET.AccountingGroup) ||
regexp("group_cdf",TARGET.AccountingGroup) ||
regexp("group_monitor",TARGET.AccountingGroup) ||
regexp("group_mitlns",TARGET.AccountingGroup) ||
regexp("group_cmshi",TARGET.AccountingGroup))

START = (Owner != "cdf") && ((TARGET.IsMadgraph =!= TRUE) ||
(TARGET.IsMadgraph == UNDEFINED) || (SlotID == 1)) && ($(RANK) ||
isUndefined(LastHeardFrom) || LastHeardFrom-EnteredCurrentState>600)

WANT_SUSPEND = False

WANT_VACATE = False


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Follow-Ups:
- Re: [Condor-users] Help with scheduling start/evict policy in condor_config
  - From: Ian Stokes-Rees

References:
- [Condor-users] Help with scheduling start/evict policy in condor_config
  - From: Ian Stokes-Rees

Prev by Date: Re: [Condor-users] Intel Core i7 processors?
Next by Date: [Condor-users] Condor View not updating?
Previous by thread: [Condor-users] Help with scheduling start/evict policy in condor_config
Next by thread: Re: [Condor-users] Help with scheduling start/evict policy in condor_config
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Help with scheduling start/evict policy in condor_config