[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Preemption issues



On Wed, Oct 20, 2010 at 09:48:28AM -0500, Dan Bradley wrote:

>> #Disable preemption by machine activity.
>> PREEMPT = False
>> #Disable preemption by user priority.
>> PREEMPTION_REQUIREMENTS = False
>> #Disable preemption by machine RANK by ranking all jobs equally.
>> RANK = 0
>>
>> still gets jobs preempted due to user priority (checked runtime values
>> with condor_config_val to see the values I expect are the ones
>> actually in use)
>
> The above policy should definitely not allow preemption based on user  
> priority.  Are you setting PREEMPTION_REQUIREMENTS in the configuration  
> of the negotiator?  The rest of the settings apply to the worker node,  
> but that setting applies to the negotiator.

I was only setting PREEMPTION_REQUIREMENTS on the worker nodes so that
probably answers the question, but of course brings up another. I only
want this policy  to apply to certain worker nodes for certain people (and
other worker nodes for other people based on group membership and
resource ownership).  I'd taken out all the other variables to
simplify the problem as I saw it.  Can I do this with a single
negotiator or do I need to make my world much more complicated?

> Another configuration setting that you can apply to the negotiator is this:
>
> NEGOTIATOR_CONSIDER_PREEMPTION = False
>
> Given the above policy, this additional setting shouldn't change  
> behavior, but it should result in more efficient negotiation, since the  
> work can be avoided.
>
>> My second attempt was to set a high MAXJOBRETIREMENTTIME as suggested
>> in the same section this "works" but queued jobs seem to get stuck to
>> a node that is doing this slow preemtion and are not reassinged to
>> other resources if the become available and since some jobs in the
>> cluster run for minutes and some for weeks this is not really what I'm
>> looking for.
>
> This is expected behavior.  The "stickiness" has a timeout, controlled  
> by REQUEST_CLAIM_TIMEOUT, which defaults to 30 minutes.

That's very good to know.  I wasn't surprised by this behaviour (which
is why I didn't try it first), but didn't know it was limited.

>
>> I had thought this was working previously and has been part of an
>> advertized feature of our cluster for years, but I'm honestly not
>> certain if the behaviour has changed or if it were simply
>> insufficiently tested in the past.
>
> I can't think of any changes in recent versions of Condor that would  
> impact the above policies.

I figured it was increased use showing a previously undetected problem
rather than an actual change.

-Jon