[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Preemption issues
- Date: Wed, 20 Oct 2010 09:48:28 -0500
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Preemption issues
On 10/20/10 9:05 AM, Jonathan D. Proulx wrote:
I'm looking to disable preemption on some of the systems in my cluster
$CondorVersion: 7.4.0 Nov 1 2009 BuildID: 193173 $
$CondorPlatform: X86_64-LINUX_DEBIAN50 $
The goal being for running jobs never to be interrupted (which I know
isn't quite the same as not preempting claims).
My first attempt using the example in the manual (184.108.40.206):
#Disable preemption by machine activity.
PREEMPT = False
#Disable preemption by user priority.
PREEMPTION_REQUIREMENTS = False
#Disable preemption by machine RANK by ranking all jobs equally.
RANK = 0
still gets jobs preempted due to user priority (checked runtime values
with condor_config_val to see the values I expect are the ones
actually in use)
The above policy should definitely not allow preemption based on user
priority. Are you setting PREEMPTION_REQUIREMENTS in the configuration
of the negotiator? The rest of the settings apply to the worker node,
but that setting applies to the negotiator.
Another configuration setting that you can apply to the negotiator is this:
NEGOTIATOR_CONSIDER_PREEMPTION = False
Given the above policy, this additional setting shouldn't change
behavior, but it should result in more efficient negotiation, since the
work can be avoided.
My second attempt was to set a high MAXJOBRETIREMENTTIME as suggested
in the same section this "works" but queued jobs seem to get stuck to
a node that is doing this slow preemtion and are not reassinged to
other resources if the become available and since some jobs in the
cluster run for minutes and some for weeks this is not really what I'm
This is expected behavior. The "stickiness" has a timeout, controlled
by REQUEST_CLAIM_TIMEOUT, which defaults to 30 minutes.
I had thought this was working previously and has been part of an
advertized feature of our cluster for years, but I'm honestly not
certain if the behaviour has changed or if it were simply
insufficiently tested in the past.
I can't think of any changes in recent versions of Condor that would
impact the above policies.