[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7



Hi again,

On 4/9/20 3:18 PM, Carsten Aulbert wrote:
> Weird thing, this throws an error:
> 
> 04/09/20 13:02:21 Classad debug: [0.17381ms] (((RemoteUserPrio >
> SubmitterUserPrio * 1.200000000000000E+00) && ( -(TotalSlotGpus isnt 0))
> && ( -(RequestGpus is 0))) || ( -(TotalSlotGpus isnt 0)) ||
> ((RequestGpus is 0) && ( -(TotalSlotGpus isnt 0)))) && (ClusterId > 0 &&
> ProcId > 0 && JobId isnt "") --> ERROR (attribute
> LastNegotiationCycleMatchRateSustained99 not found to be deleted)
> 
> I have yet to find which expression is triggering this, any help
> appreciated :)

Still no real clue, but it may be related to me using Macro heavily for
that expression, e.g.

# Standard rule: if prio is at least 20% "better", then preemption may
be considered
HasBetterPrio       =  RemoteUserPrio > SubmitterUserPrio * 1.2

# Is running job claiming a GPU?
DoesRunningJobUseGpus = TotalSlotGpus =!= 0

# Does the incoming job require a GPU
DoesNewJobWantGpus = RequestGpus =?= 0

# Debug helper (should always yield TRUE)
DebugJobInfo = ClusterId > 0 && ProcId > 0 && JobId =!= ""

PREEMPTION_REQUIREMENTS = debug( \
                                 ( \
                                   (($(HasBetterPrio)) &&
(-($(DoesRunningJobUseGpus))) && (-($(DoesNewJobWantGpus)))) \
                                   || \
                                 (-($(DoesRunningJobUseGpus))) \
                                   || \
                                 ( ($(DoesNewJobWantGpus)) &&
(-($(DoesRunningJobUseGpus)))) ))
                                ) && ($(DebugJobInfo)))

generate failures which mostly go away if I for example remove the
middle OR expression (-($(DoesRunningJobUseGpus))), but if I write it
all without macros like this

PREEMPTION_REQUIREMENTS = debug((ClusterId > 0 && ProcId > 0 && JobId
=!= "") && ((RemoteUserPrio > SubmitterUserPrio * 1.2) && (TotalSlotGpus
is 0) && (RequestGpus isnt 0)) || (TotalSlotGpus is 0) || (RequestGpus
is 0 &&  TotalSlotGpus is 0))

I am unable to trigger the error again.

Anyway, if you want more information, feel free to contact me on or off
list.

Cheers

Carsten
-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature