[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_status -total, Preempting



I don't know whether it's related, but sometimes when I kill all the
jobs (some of them are running), most of the slots become Unclaimed
very quickly. But there is always some (3-4 out of ~2000 slots) which
remain in Claimed state for a while (some minutes), and I have to wait
(supposedly) to the next ClassAd from the machine which makes the
machine "Unclaimed" again. Is it possible that this is because of some
UDP package loss, or something like that?

Thanks,
Daniel

2014/1/29 Pek Daniel <pekdaniel@xxxxxxxxx>:
> Now, I've set MAXJOBRETIREMENTTIME to a high value, and now I can't
> see any machines in "Preempting" state, instead, they're in Claimed
> with Retiring activity.
>
> 01/29/14 10:43:23 slot5: Request accepted.
> 01/29/14 10:43:23 slot5: Remote owner is xxx@xxxxxx
> 01/29/14 10:43:23 slot5: State change: claiming protocol successful
> 01/29/14 10:43:23 slot5: Changing state: Unclaimed -> Claimed
> 01/29/14 10:43:24 slot5: Got activate_claim request from shadow
> (xxx.xxx.xxx.xxx)
> 01/29/14 10:43:24 slot5: Remote job ID is 3933.36
> 01/29/14 10:43:24 slot5: Got universe "VANILLA" (5) from request classad
> 01/29/14 10:43:24 slot5: State change: claim-activation protocol successful
> 01/29/14 10:43:24 slot5: Changing activity: Idle -> Busy
> 01/29/14 10:43:34 slot5: Preempting claim has correct ClaimId.
> 01/29/14 10:43:34 slot5: New claim has sufficient rank, preempting
> current claim.
> 01/29/14 10:43:34 slot5: State change: preempting claim based on user priority
> 01/29/14 10:43:34 slot5: State change: retiring due to preempting claim
> 01/29/14 10:43:34 slot5: Changing activity: Busy -> Retiring
>
> And also, during the negotiation, there're some fluctuations in the
> number of claimed machines. It should be monotonicly increasing, but
> sometimes it drops down to a lower value, and then it's increasing
> again...
>
>
>
> 2014/1/28 Pek Daniel <pekdaniel@xxxxxxxxx>:
>> 2014/1/28 Pek Daniel <pekdaniel@xxxxxxxxx>:
>>> Hi,
>>>
>>> 2014/1/27 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
>>>
>>>> Hi Daniel -
>>>>
>>>> The below looks really unexpected.  Your settings indeed should disable
>>>> preemption, assuming you did a successful condor_reconfig after the
>>>> changes
>>>> and they are set at the right host (the PREEMPTION_REQUIREMENTS change
>>>> read
>>>> by the condor_negotiator, and the other settings are read by all the
>>>> execute
>>>> hosts running condor_startds).  Note that the preferred way to disable
>>>> preemption on HTCondor v8.0+ is via MaxJobRetirementTime, see
>>>>
>>>>
>>>> http://research.cs.wisc.edu/htcondor/manual/current/3_5Policy_Configuration.html#SECTION00459500000000000000
>>>>
>>>> But what you have below should work as well.
>>>>
>>>> HTCondor may preempt a job in favor of another job from the same user, but
>>>> only in the case of a higher startd RANK.
>>>>
>>>> Very strange.
>>>>
>>>> Is the below regularly reproducible, or do you only see it very rarely ?
>>>
>>> Yes, this is a regular thing, I can reproduce it. What I do is I submit 4000
>>> jobs spread across 10 schedds with the negotiator turned off, and then I
>>> turn it on and poll condor_status -total. I can see from time to time the
>>> value of Preemption other than zero.
>>>
>>>
>>>>
>>>> Note that starting HTCondor v8.1.3, the machine classads will report some
>>>> helpful/insightful attributes regarding preemption; I copied the below
>>>> from
>>>> the manual at
>>>> http://research.cs.wisc.edu/htcondor/manual/latest/12_Appendix_A.html
>>>> These statistics were added for just such an occurance, i.e. so admins can
>>>> confirm that preemption is disabled. So, if you are running v8.1.3 or
>>>> above,
>>>> are these statistics below reporting preemptions as occuring?  If so, is
>>>> it
>>>> reporting user preemptions or rank preemptions? Maybe it is only happening
>>>> on some specific nodes?
>>>>
>>>> JobPreemptions:
>>>>     The total number of times a running job has been preempted on this
>>>> machine.
>>>>
>>>> JobRankPreemptions:
>>>>     The total number of times a running job has been preempted on this
>>>> machine due to the machine's rank of jobs since the condor_startd started
>>>> running.
>>>>
>>>> JobUserPrioPreemptions:
>>>>     The total number of times a running job has been preempted on this
>>>> machine based on a fair share allocation of the pool since the
>>>> condor_startd
>>>> started running.
>>>>
>>>> RecentJobPreemptions:
>>>>     The total number of jobs which have been preempted from this machine
>>>> in
>>>> the last twenty minutes.
>>>>
>>>> RecentJobRankPreemptions:
>>>>     The total number of times a running job has been preempted on this
>>>> machine due to the machine's rank of jobs in the last twenty minutes.
>>>>
>>>> RecentJobUserPrio:
>>>>     The total number of times a running job has been preempted on this
>>>> machine based on a fair share allocation of the pool in the last twenty
>>>> minutes.
>>>
>>> Yes, recent userprio and total values are around 16 (out of 4000 jobs).
>>> These happen on different schedds and startds, not always the same. They
>>> have exactly the same configuration btw.
>>
>> Ah, sorry, I've just noticed that this value is per machine (or per
>> slot?). So this means ~16 preemptions / machine.
>>
>> Also I found these in my NegotiatorLog which might be relevant:
>>
>> 01/28/14 16:43:39 PREEMPTION_REQUIREMENTS = FALSE
>> 01/28/14 16:43:39 NEGOTIATOR_INTERVAL = 1 sec
>> 01/28/14 16:43:39 NEGOTIATOR_TIMEOUT = 30 sec
>> 01/28/14 16:43:39 MAX_TIME_PER_SUBMITTER = 31536000 sec
>> 01/28/14 16:43:39 MAX_TIME_PER_PIESPIN = 31536000 sec
>> 01/28/14 16:43:39 PREEMPTION_RANK = (RemoteUserPrio * 1000000) -
>> TARGET.ImageSize
>> 01/28/14 16:43:39 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
>> 01/28/14 16:43:39 NEGOTIATOR_POST_JOB_RANK = (RemoteOwner =?=
>> UNDEFINED) * (ifthenElse(isUndefined(KFlops), 1000, Kflops) - SlotID
>>  - 1.0e10*(Offline=?=True))
>>
>> And at the beginning of new cycles:
>> 01/28/14 16:43:54 Not considering preemption, therefore constraining
>> idle machines with ifThenElse(State == "Claimed","Name State
>> Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements
>> SlotWeight ConcurrencyLimits","")
>>
>> Can any of these cause the preemptions?
>>
>>
>>>
>>>>
>>>> regards,
>>>> Todd
>>>>
>>>
>>> Thanks,
>>> Daniel
>>>
>>>>
>>>> On 1/27/2014 9:53 AM, Pek Daniel wrote:
>>>>>
>>>>> Some lines from the StartLog:
>>>>>
>>>>> 01/27/14 16:45:42 slot22: Request accepted.
>>>>> 01/27/14 16:45:42 slot22: Remote owner is xxx
>>>>> 01/27/14 16:45:42 slot22: State change: claiming protocol successful
>>>>> 01/27/14 16:45:42 slot22: Changing state: Unclaimed -> Claimed
>>>>> 01/27/14 16:45:46 slot22: Got activate_claim request from shadow
>>>>> (xxx.xxx.xxx.xxx)
>>>>> 01/27/14 16:45:46 slot22: Remote job ID is 3920.25
>>>>> 01/27/14 16:45:46 slot22: Got universe "VANILLA" (5) from request classad
>>>>> 01/27/14 16:45:47 slot22: State change: claim-activation protocol
>>>>> successful
>>>>> 01/27/14 16:45:47 slot22: Changing activity: Idle -> Busy
>>>>> 01/27/14 16:45:55 slot22: Preempting claim has correct ClaimId.
>>>>> 01/27/14 16:45:55 slot22: New claim has sufficient rank, preempting
>>>>> current claim.
>>>>> 01/27/14 16:45:55 slot22: State change: preempting claim based on user
>>>>> priority
>>>>> 01/27/14 16:45:55 slot22: State change: claim retirement ended/expired
>>>>> 01/27/14 16:45:55 slot22: Changing state and activity: Claimed/Busy ->
>>>>> Preempting/Vacating
>>>>>
>>>>> 2014/1/27 Pek Daniel <pekdaniel@xxxxxxxxx>:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I tried my best to turn off preemption completely:
>>>>>> PREEMPT = FALSE
>>>>>> SUSPEND = FALSE
>>>>>> KILL = FALSE
>>>>>> PREEMPTION_REQUIREMENTS = FALSE
>>>>>> NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
>>>>>> RANK = 0
>>>>>>
>>>>>> But sometimes during negotiation, I still can see non-zero value in
>>>>>> the Preempting column of the output of condor_status -total.
>>>>>>
>>>>>> According to the docs:
>>>>>>
>>>>>> ``Preempting'': A Condor job is being preempted (possibly via
>>>>>> checkpointing) in order to clear the machine for either a higher
>>>>>> priority job or because the machine owner wants the machine back.
>>>>>>
>>>>>> Regarding that I have only one single user and completely identical
>>>>>> jobs, I don't think the preemption would happen because of a higher
>>>>>> priority job. Any idea why is this?
>>>>>>
>>>>>> Thanks,
>>>>>> Daniel
>>>>>
>>>>> _______________________________________________
>>>>> HTCondor-users mailing list
>>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>>>> a
>>>>> subject: Unsubscribe
>>>>> You can also unsubscribe by visiting
>>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>>
>>>>> The archives can be found at:
>>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>>>> Center for High Throughput Computing   Department of Computer Sciences
>>>> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
>>>> Phone: (608) 263-7132                  Madison, WI 53706-1685
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>>> a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/