[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_status -total, Preempting



Now, I've set MAXJOBRETIREMENTTIME to a high value, and now I can't
see any machines in "Preempting" state, instead, they're in Claimed
with Retiring activity.

01/29/14 10:43:23 slot5: Request accepted.
01/29/14 10:43:23 slot5: Remote owner is xxx@xxxxxx
01/29/14 10:43:23 slot5: State change: claiming protocol successful
01/29/14 10:43:23 slot5: Changing state: Unclaimed -> Claimed
01/29/14 10:43:24 slot5: Got activate_claim request from shadow
(xxx.xxx.xxx.xxx)
01/29/14 10:43:24 slot5: Remote job ID is 3933.36
01/29/14 10:43:24 slot5: Got universe "VANILLA" (5) from request classad
01/29/14 10:43:24 slot5: State change: claim-activation protocol successful
01/29/14 10:43:24 slot5: Changing activity: Idle -> Busy
01/29/14 10:43:34 slot5: Preempting claim has correct ClaimId.
01/29/14 10:43:34 slot5: New claim has sufficient rank, preempting
current claim.
01/29/14 10:43:34 slot5: State change: preempting claim based on user priority
01/29/14 10:43:34 slot5: State change: retiring due to preempting claim
01/29/14 10:43:34 slot5: Changing activity: Busy -> Retiring

And also, during the negotiation, there're some fluctuations in the
number of claimed machines. It should be monotonicly increasing, but
sometimes it drops down to a lower value, and then it's increasing
again...



2014/1/28 Pek Daniel <pekdaniel@xxxxxxxxx>:
> 2014/1/28 Pek Daniel <pekdaniel@xxxxxxxxx>:
>> Hi,
>>
>> 2014/1/27 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
>>
>>> Hi Daniel -
>>>
>>> The below looks really unexpected.  Your settings indeed should disable
>>> preemption, assuming you did a successful condor_reconfig after the
>>> changes
>>> and they are set at the right host (the PREEMPTION_REQUIREMENTS change
>>> read
>>> by the condor_negotiator, and the other settings are read by all the
>>> execute
>>> hosts running condor_startds).  Note that the preferred way to disable
>>> preemption on HTCondor v8.0+ is via MaxJobRetirementTime, see
>>>
>>>
>>> http://research.cs.wisc.edu/htcondor/manual/current/3_5Policy_Configuration.html#SECTION00459500000000000000
>>>
>>> But what you have below should work as well.
>>>
>>> HTCondor may preempt a job in favor of another job from the same user, but
>>> only in the case of a higher startd RANK.
>>>
>>> Very strange.
>>>
>>> Is the below regularly reproducible, or do you only see it very rarely ?
>>
>> Yes, this is a regular thing, I can reproduce it. What I do is I submit 4000
>> jobs spread across 10 schedds with the negotiator turned off, and then I
>> turn it on and poll condor_status -total. I can see from time to time the
>> value of Preemption other than zero.
>>
>>
>>>
>>> Note that starting HTCondor v8.1.3, the machine classads will report some
>>> helpful/insightful attributes regarding preemption; I copied the below
>>> from
>>> the manual at
>>> http://research.cs.wisc.edu/htcondor/manual/latest/12_Appendix_A.html
>>> These statistics were added for just such an occurance, i.e. so admins can
>>> confirm that preemption is disabled. So, if you are running v8.1.3 or
>>> above,
>>> are these statistics below reporting preemptions as occuring?  If so, is
>>> it
>>> reporting user preemptions or rank preemptions? Maybe it is only happening
>>> on some specific nodes?
>>>
>>> JobPreemptions:
>>>     The total number of times a running job has been preempted on this
>>> machine.
>>>
>>> JobRankPreemptions:
>>>     The total number of times a running job has been preempted on this
>>> machine due to the machine's rank of jobs since the condor_startd started
>>> running.
>>>
>>> JobUserPrioPreemptions:
>>>     The total number of times a running job has been preempted on this
>>> machine based on a fair share allocation of the pool since the
>>> condor_startd
>>> started running.
>>>
>>> RecentJobPreemptions:
>>>     The total number of jobs which have been preempted from this machine
>>> in
>>> the last twenty minutes.
>>>
>>> RecentJobRankPreemptions:
>>>     The total number of times a running job has been preempted on this
>>> machine due to the machine's rank of jobs in the last twenty minutes.
>>>
>>> RecentJobUserPrio:
>>>     The total number of times a running job has been preempted on this
>>> machine based on a fair share allocation of the pool in the last twenty
>>> minutes.
>>
>> Yes, recent userprio and total values are around 16 (out of 4000 jobs).
>> These happen on different schedds and startds, not always the same. They
>> have exactly the same configuration btw.
>
> Ah, sorry, I've just noticed that this value is per machine (or per
> slot?). So this means ~16 preemptions / machine.
>
> Also I found these in my NegotiatorLog which might be relevant:
>
> 01/28/14 16:43:39 PREEMPTION_REQUIREMENTS = FALSE
> 01/28/14 16:43:39 NEGOTIATOR_INTERVAL = 1 sec
> 01/28/14 16:43:39 NEGOTIATOR_TIMEOUT = 30 sec
> 01/28/14 16:43:39 MAX_TIME_PER_SUBMITTER = 31536000 sec
> 01/28/14 16:43:39 MAX_TIME_PER_PIESPIN = 31536000 sec
> 01/28/14 16:43:39 PREEMPTION_RANK = (RemoteUserPrio * 1000000) -
> TARGET.ImageSize
> 01/28/14 16:43:39 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
> 01/28/14 16:43:39 NEGOTIATOR_POST_JOB_RANK = (RemoteOwner =?=
> UNDEFINED) * (ifthenElse(isUndefined(KFlops), 1000, Kflops) - SlotID
>  - 1.0e10*(Offline=?=True))
>
> And at the beginning of new cycles:
> 01/28/14 16:43:54 Not considering preemption, therefore constraining
> idle machines with ifThenElse(State == "Claimed","Name State
> Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements
> SlotWeight ConcurrencyLimits","")
>
> Can any of these cause the preemptions?
>
>
>>
>>>
>>> regards,
>>> Todd
>>>
>>
>> Thanks,
>> Daniel
>>
>>>
>>> On 1/27/2014 9:53 AM, Pek Daniel wrote:
>>>>
>>>> Some lines from the StartLog:
>>>>
>>>> 01/27/14 16:45:42 slot22: Request accepted.
>>>> 01/27/14 16:45:42 slot22: Remote owner is xxx
>>>> 01/27/14 16:45:42 slot22: State change: claiming protocol successful
>>>> 01/27/14 16:45:42 slot22: Changing state: Unclaimed -> Claimed
>>>> 01/27/14 16:45:46 slot22: Got activate_claim request from shadow
>>>> (xxx.xxx.xxx.xxx)
>>>> 01/27/14 16:45:46 slot22: Remote job ID is 3920.25
>>>> 01/27/14 16:45:46 slot22: Got universe "VANILLA" (5) from request classad
>>>> 01/27/14 16:45:47 slot22: State change: claim-activation protocol
>>>> successful
>>>> 01/27/14 16:45:47 slot22: Changing activity: Idle -> Busy
>>>> 01/27/14 16:45:55 slot22: Preempting claim has correct ClaimId.
>>>> 01/27/14 16:45:55 slot22: New claim has sufficient rank, preempting
>>>> current claim.
>>>> 01/27/14 16:45:55 slot22: State change: preempting claim based on user
>>>> priority
>>>> 01/27/14 16:45:55 slot22: State change: claim retirement ended/expired
>>>> 01/27/14 16:45:55 slot22: Changing state and activity: Claimed/Busy ->
>>>> Preempting/Vacating
>>>>
>>>> 2014/1/27 Pek Daniel <pekdaniel@xxxxxxxxx>:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I tried my best to turn off preemption completely:
>>>>> PREEMPT = FALSE
>>>>> SUSPEND = FALSE
>>>>> KILL = FALSE
>>>>> PREEMPTION_REQUIREMENTS = FALSE
>>>>> NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
>>>>> RANK = 0
>>>>>
>>>>> But sometimes during negotiation, I still can see non-zero value in
>>>>> the Preempting column of the output of condor_status -total.
>>>>>
>>>>> According to the docs:
>>>>>
>>>>> ``Preempting'': A Condor job is being preempted (possibly via
>>>>> checkpointing) in order to clear the machine for either a higher
>>>>> priority job or because the machine owner wants the machine back.
>>>>>
>>>>> Regarding that I have only one single user and completely identical
>>>>> jobs, I don't think the preemption would happen because of a higher
>>>>> priority job. Any idea why is this?
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>>> a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>
>>>
>>>
>>> --
>>> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>>> Center for High Throughput Computing   Department of Computer Sciences
>>> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
>>> Phone: (608) 263-7132                  Madison, WI 53706-1685
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>> a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/