[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_status -total, Preempting



Hi,

2014/1/27 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
> Hi Daniel -
>
> The below looks really unexpected.  Your settings indeed should disable
> preemption, assuming you did a successful condor_reconfig after the changes
> and they are set at the right host (the PREEMPTION_REQUIREMENTS change read
> by the condor_negotiator, and the other settings are read by all the execute
> hosts running condor_startds).  Note that the preferred way to disable
> preemption on HTCondor v8.0+ is via MaxJobRetirementTime, see
>
> http://research.cs.wisc.edu/htcondor/manual/current/3_5Policy_Configuration.html#SECTION00459500000000000000
>
> But what you have below should work as well.
>
> HTCondor may preempt a job in favor of another job from the same user, but
> only in the case of a higher startd RANK.
>
> Very strange.
>
> Is the below regularly reproducible, or do you only see it very rarely ?

Yes, this is a regular thing, I can reproduce it. What I do is I submit 4000 jobs spread across 10 schedds with the negotiator turned off, and then I turn it on and poll condor_status -total. I can see from time to time the value of Preemption other than zero.

>
> Note that starting HTCondor v8.1.3, the machine classads will report some
> helpful/insightful attributes regarding preemption; I copied the below from
> the manual at
> http://research.cs.wisc.edu/htcondor/manual/latest/12_Appendix_A.html
> These statistics were added for just such an occurance, i.e. so admins can
> confirm that preemption is disabled. So, if you are running v8.1.3 or above,
> are these statistics below reporting preemptions as occuring?  If so, is it
> reporting user preemptions or rank preemptions? Maybe it is only happening
> on some specific nodes?
>
> JobPreemptions:
>     The total number of times a running job has been preempted on this
> machine.
>
> JobRankPreemptions:
>     The total number of times a running job has been preempted on this
> machine due to the machine's rank of jobs since the condor_startd started
> running.
>
> JobUserPrioPreemptions:
>     The total number of times a running job has been preempted on this
> machine based on a fair share allocation of the pool since the condor_startd
> started running.
>
> RecentJobPreemptions:
>     The total number of jobs which have been preempted from this machine in
> the last twenty minutes.
>
> RecentJobRankPreemptions:
>     The total number of times a running job has been preempted on this
> machine due to the machine's rank of jobs in the last twenty minutes.
>
> RecentJobUserPrio:
>     The total number of times a running job has been preempted on this
> machine based on a fair share allocation of the pool in the last twenty
> minutes.

Yes, recent userprio and total values are around 16 (out of 4000 jobs). These happen on different schedds and startds, not always the same. They have exactly the same configuration btw.

>
> regards,
> Todd
>

Thanks,
Daniel

>
> On 1/27/2014 9:53 AM, Pek Daniel wrote:
>>
>> Some lines from the StartLog:
>>
>> 01/27/14 16:45:42 slot22: Request accepted.
>> 01/27/14 16:45:42 slot22: Remote owner is xxx
>> 01/27/14 16:45:42 slot22: State change: claiming protocol successful
>> 01/27/14 16:45:42 slot22: Changing state: Unclaimed -> Claimed
>> 01/27/14 16:45:46 slot22: Got activate_claim request from shadow
>> (xxx.xxx.xxx.xxx)
>> 01/27/14 16:45:46 slot22: Remote job ID is 3920.25
>> 01/27/14 16:45:46 slot22: Got universe "VANILLA" (5) from request classad
>> 01/27/14 16:45:47 slot22: State change: claim-activation protocol
>> successful
>> 01/27/14 16:45:47 slot22: Changing activity: Idle -> Busy
>> 01/27/14 16:45:55 slot22: Preempting claim has correct ClaimId.
>> 01/27/14 16:45:55 slot22: New claim has sufficient rank, preempting
>> current claim.
>> 01/27/14 16:45:55 slot22: State change: preempting claim based on user
>> priority
>> 01/27/14 16:45:55 slot22: State change: claim retirement ended/expired
>> 01/27/14 16:45:55 slot22: Changing state and activity: Claimed/Busy ->
>> Preempting/Vacating
>>
>> 2014/1/27 Pek Daniel <pekdaniel@xxxxxxxxx>:
>>>
>>> Hi,
>>>
>>> I tried my best to turn off preemption completely:
>>> PREEMPT = FALSE
>>> SUSPEND = FALSE
>>> KILL = FALSE
>>> PREEMPTION_REQUIREMENTS = FALSE
>>> NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
>>> RANK = 0
>>>
>>> But sometimes during negotiation, I still can see non-zero value in
>>> the Preempting column of the output of condor_status -total.
>>>
>>> According to the docs:
>>>
>>> ``Preempting'': A Condor job is being preempted (possibly via
>>> checkpointing) in order to clear the machine for either a higher
>>> priority job or because the machine owner wants the machine back.
>>>
>>> Regarding that I have only one single user and completely identical
>>> jobs, I don't think the preemption would happen because of a higher
>>> priority job. Any idea why is this?
>>>
>>> Thanks,
>>> Daniel
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/