[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7

Hi again,

On 4/9/20 3:18 PM, Carsten Aulbert wrote:
> Weird thing, this throws an error:
> 04/09/20 13:02:21 Classad debug: [0.17381ms] (((RemoteUserPrio >
> SubmitterUserPrio * 1.200000000000000E+00) && ( -(TotalSlotGpus isnt 0))
> && ( -(RequestGpus is 0))) || ( -(TotalSlotGpus isnt 0)) ||
> ((RequestGpus is 0) && ( -(TotalSlotGpus isnt 0)))) && (ClusterId > 0 &&
> ProcId > 0 && JobId isnt "") --> ERROR (attribute
> LastNegotiationCycleMatchRateSustained99 not found to be deleted)
> I have yet to find which expression is triggering this, any help
> appreciated :)

Still no real clue, but it may be related to me using Macro heavily for
that expression, e.g.

# Standard rule: if prio is at least 20% "better", then preemption may
be considered
HasBetterPrio       =  RemoteUserPrio > SubmitterUserPrio * 1.2

# Is running job claiming a GPU?
DoesRunningJobUseGpus = TotalSlotGpus =!= 0

# Does the incoming job require a GPU
DoesNewJobWantGpus = RequestGpus =?= 0

# Debug helper (should always yield TRUE)
DebugJobInfo = ClusterId > 0 && ProcId > 0 && JobId =!= ""

                                 ( \
                                   (($(HasBetterPrio)) &&
(-($(DoesRunningJobUseGpus))) && (-($(DoesNewJobWantGpus)))) \
                                   || \
                                 (-($(DoesRunningJobUseGpus))) \
                                   || \
                                 ( ($(DoesNewJobWantGpus)) &&
(-($(DoesRunningJobUseGpus)))) ))
                                ) && ($(DebugJobInfo)))

generate failures which mostly go away if I for example remove the
middle OR expression (-($(DoesRunningJobUseGpus))), but if I write it
all without macros like this

PREEMPTION_REQUIREMENTS = debug((ClusterId > 0 && ProcId > 0 && JobId
=!= "") && ((RemoteUserPrio > SubmitterUserPrio * 1.2) && (TotalSlotGpus
is 0) && (RequestGpus isnt 0)) || (TotalSlotGpus is 0) || (RequestGpus
is 0 &&  TotalSlotGpus is 0))

I am unable to trigger the error again.

Anyway, if you want more information, feel free to contact me on or off


Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature