[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_status -total, Preempting



2014/1/28 Pek Daniel <pekdaniel@xxxxxxxxx>:
> Hi,
>
> 2014/1/27 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
>
>> Hi Daniel -
>>
>> The below looks really unexpected.  Your settings indeed should disable
>> preemption, assuming you did a successful condor_reconfig after the
>> changes
>> and they are set at the right host (the PREEMPTION_REQUIREMENTS change
>> read
>> by the condor_negotiator, and the other settings are read by all the
>> execute
>> hosts running condor_startds).  Note that the preferred way to disable
>> preemption on HTCondor v8.0+ is via MaxJobRetirementTime, see
>>
>>
>> http://research.cs.wisc.edu/htcondor/manual/current/3_5Policy_Configuration.html#SECTION00459500000000000000
>>
>> But what you have below should work as well.
>>
>> HTCondor may preempt a job in favor of another job from the same user, but
>> only in the case of a higher startd RANK.
>>
>> Very strange.
>>
>> Is the below regularly reproducible, or do you only see it very rarely ?
>
> Yes, this is a regular thing, I can reproduce it. What I do is I submit 4000
> jobs spread across 10 schedds with the negotiator turned off, and then I
> turn it on and poll condor_status -total. I can see from time to time the
> value of Preemption other than zero.
>
>
>>
>> Note that starting HTCondor v8.1.3, the machine classads will report some
>> helpful/insightful attributes regarding preemption; I copied the below
>> from
>> the manual at
>> http://research.cs.wisc.edu/htcondor/manual/latest/12_Appendix_A.html
>> These statistics were added for just such an occurance, i.e. so admins can
>> confirm that preemption is disabled. So, if you are running v8.1.3 or
>> above,
>> are these statistics below reporting preemptions as occuring?  If so, is
>> it
>> reporting user preemptions or rank preemptions? Maybe it is only happening
>> on some specific nodes?
>>
>> JobPreemptions:
>>     The total number of times a running job has been preempted on this
>> machine.
>>
>> JobRankPreemptions:
>>     The total number of times a running job has been preempted on this
>> machine due to the machine's rank of jobs since the condor_startd started
>> running.
>>
>> JobUserPrioPreemptions:
>>     The total number of times a running job has been preempted on this
>> machine based on a fair share allocation of the pool since the
>> condor_startd
>> started running.
>>
>> RecentJobPreemptions:
>>     The total number of jobs which have been preempted from this machine
>> in
>> the last twenty minutes.
>>
>> RecentJobRankPreemptions:
>>     The total number of times a running job has been preempted on this
>> machine due to the machine's rank of jobs in the last twenty minutes.
>>
>> RecentJobUserPrio:
>>     The total number of times a running job has been preempted on this
>> machine based on a fair share allocation of the pool in the last twenty
>> minutes.
>
> Yes, recent userprio and total values are around 16 (out of 4000 jobs).
> These happen on different schedds and startds, not always the same. They
> have exactly the same configuration btw.

Ah, sorry, I've just noticed that this value is per machine (or per
slot?). So this means ~16 preemptions / machine.

Also I found these in my NegotiatorLog which might be relevant:

01/28/14 16:43:39 PREEMPTION_REQUIREMENTS = FALSE
01/28/14 16:43:39 NEGOTIATOR_INTERVAL = 1 sec
01/28/14 16:43:39 NEGOTIATOR_TIMEOUT = 30 sec
01/28/14 16:43:39 MAX_TIME_PER_SUBMITTER = 31536000 sec
01/28/14 16:43:39 MAX_TIME_PER_PIESPIN = 31536000 sec
01/28/14 16:43:39 PREEMPTION_RANK = (RemoteUserPrio * 1000000) -
TARGET.ImageSize
01/28/14 16:43:39 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
01/28/14 16:43:39 NEGOTIATOR_POST_JOB_RANK = (RemoteOwner =?=
UNDEFINED) * (ifthenElse(isUndefined(KFlops), 1000, Kflops) - SlotID
 - 1.0e10*(Offline=?=True))

And at the beginning of new cycles:
01/28/14 16:43:54 Not considering preemption, therefore constraining
idle machines with ifThenElse(State == "Claimed","Name State
Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements
SlotWeight ConcurrencyLimits","")

Can any of these cause the preemptions?


>
>>
>> regards,
>> Todd
>>
>
> Thanks,
> Daniel
>
>>
>> On 1/27/2014 9:53 AM, Pek Daniel wrote:
>>>
>>> Some lines from the StartLog:
>>>
>>> 01/27/14 16:45:42 slot22: Request accepted.
>>> 01/27/14 16:45:42 slot22: Remote owner is xxx
>>> 01/27/14 16:45:42 slot22: State change: claiming protocol successful
>>> 01/27/14 16:45:42 slot22: Changing state: Unclaimed -> Claimed
>>> 01/27/14 16:45:46 slot22: Got activate_claim request from shadow
>>> (xxx.xxx.xxx.xxx)
>>> 01/27/14 16:45:46 slot22: Remote job ID is 3920.25
>>> 01/27/14 16:45:46 slot22: Got universe "VANILLA" (5) from request classad
>>> 01/27/14 16:45:47 slot22: State change: claim-activation protocol
>>> successful
>>> 01/27/14 16:45:47 slot22: Changing activity: Idle -> Busy
>>> 01/27/14 16:45:55 slot22: Preempting claim has correct ClaimId.
>>> 01/27/14 16:45:55 slot22: New claim has sufficient rank, preempting
>>> current claim.
>>> 01/27/14 16:45:55 slot22: State change: preempting claim based on user
>>> priority
>>> 01/27/14 16:45:55 slot22: State change: claim retirement ended/expired
>>> 01/27/14 16:45:55 slot22: Changing state and activity: Claimed/Busy ->
>>> Preempting/Vacating
>>>
>>> 2014/1/27 Pek Daniel <pekdaniel@xxxxxxxxx>:
>>>>
>>>> Hi,
>>>>
>>>> I tried my best to turn off preemption completely:
>>>> PREEMPT = FALSE
>>>> SUSPEND = FALSE
>>>> KILL = FALSE
>>>> PREEMPTION_REQUIREMENTS = FALSE
>>>> NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
>>>> RANK = 0
>>>>
>>>> But sometimes during negotiation, I still can see non-zero value in
>>>> the Preempting column of the output of condor_status -total.
>>>>
>>>> According to the docs:
>>>>
>>>> ``Preempting'': A Condor job is being preempted (possibly via
>>>> checkpointing) in order to clear the machine for either a higher
>>>> priority job or because the machine owner wants the machine back.
>>>>
>>>> Regarding that I have only one single user and completely identical
>>>> jobs, I don't think the preemption would happen because of a higher
>>>> priority job. Any idea why is this?
>>>>
>>>> Thanks,
>>>> Daniel
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>> a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>
>>
>> --
>> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>> Center for High Throughput Computing   Department of Computer Sciences
>> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
>> Phone: (608) 263-7132                  Madison, WI 53706-1685
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/