Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Preemption and Priority Issue

Date: Wed, 13 Jul 2011 14:22:44 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] Preemption and Priority Issue

Felix Wolfheimer wrote:

If the preemption happens because of user priority or machine rank you

may try to set

PREEMPTION_REQUIREMENTS = False

which should switch off this type of preemption.

Actually, afaik, setting PREEMPTION_REQUIREMENTS to false will disableuser priority preemption but will not disable machine rank preemption.

The Condor Manual actually has a very enlightening section specificallyon disabling preemption, including how to do so, and also talking aboutthe (often overlooked) consequences and other (perhaps better)alternatives such as allowing preemption to occur only at jobboundaries. See

http://www.cs.wisc.edu/condor/manual/v7.6/3_5Policy_Configuration.html#SECTION00459500000000000000

For convenience and in case there are follow-up questions, Icut-n-pasted that section of the manual below.


regards,
Todd

3.5.9.5 Disabling Preemption

Preemption can result in jobs being killed by Condor. When this happens,the jobs remain in the queue and will be automatically rescheduled. Wehighly recommend designing jobs that work well in this environment,rather than simply disabling preemption.

Planning for preemption makes jobs more robust in the face of othersources of failure. One way to live happily with preemption is to useCondor's standard universe, which provides the ability to producecheckpoints. If a job is incompatible with the requirements of standarduniverse, the job can still gracefully shutdown and restart byintercepting the soft kill signal.

All that being said, there may be cases where it is appropriate to forceCondor to never kill jobs within some upper time limit. This can beachieved with the following policy in the configuration of the executenodes:


# When we want to kick a job off, let it run uninterrupted for
# up to 2 days before forcing it to vacate.
MAXJOBRETIREMENTTIME = $(HOUR) * 24 * 2

Construction of this expression may be more complicated. For example, itcould provide a different retirement time to different users ordifferent types of jobs. Also be aware that the job may come with itsown definition of MaxJobRetirementTime, but this may only cause lessretirement time to be used, never more than what the machine offers.

The longer the retirement time that is given, the slower reallocation ofresources in the pool can become if there are long-running jobs.However, by preventing jobs from being killed, you may decrease thenumber of cycles that are wasted on non-checkpointable jobs that arekilled. That is the basic trade off.

Note that the use of MAXJOBRETIREMENTTIME limits the killing of jobs,but it does not prevent the preemption of resource claims. Therefore, itis technically not a way of disabling preemption, but simply a way offorcing preempting claims to wait until an existing job finishes or runsout of time. In other words, it limits the preemption of jobs but notthe preemption of claims.

Limiting the preemption of jobs is often more desirable than limitingthe preemption of resource claims. However, if you really do want tolimit the preemption of resource claims, the following policy may beused. Some of these settings apply to the execute node and some apply tothe central manager, so this policy should be configured so that it isread by both.


#Disable preemption by machine activity.
PREEMPT = False
#Disable preemption by user priority.
PREEMPTION_REQUIREMENTS = False
#Disable preemption by machine RANK by ranking all jobs equally.
RANK = 0
#Since we are disabling claim preemption, we
# may as well optimize negotiation for this case:
NEGOTIATOR_CONSIDER_PREEMPTION = False

Be aware of the consequences of this policy. Without any preemption ofresource claims, once the condor_negotiator gives the condor_schedd amatch to a machine, the condor_schedd may hold onto this claimindefinitely, as long as the user keeps supplying more jobs to run. Ifthis is not desired, force claims to be retired after some amount oftime using CLAIM_WORKLIFE . This enforces a time limit, beyond which nonew jobs may be started on an existing claim; therefore thecondor_schedd daemon is forced to go back to the condor_negotiator torequest a new match, if there is still more work to do. Example executemachine configuration to include in addition to the example above:


# after 20 minutes, schedd must renegotiate to run
# additional jobs on the machine
CLAIM_WORKLIFE = 1200

References:
- [Condor-users] Preemption and Priority Issue
  - From: Natarajan, Senthil
- Re: [Condor-users] Preemption and Priority Issue
  - From: Felix Wolfheimer

Prev by Date: Re: [Condor-users] Preemption and Priority Issue
Next by Date: Re: [Condor-users] Automate removal of inefficient jobs
Previous by thread: Re: [Condor-users] Preemption and Priority Issue
Next by thread: [Condor-users] Automate removal of inefficient jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Preemption and Priority Issue