[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] pslot preemption by rank with GPUs crashes the startd



Hi All,

We have a bunch of dedicated servers each with a number of GPUs from 0 to 7. We are running them with a config that can be reduced to using the GPUs and PartitionableSlot features. Additionally, some of the machines have their RANK set to prefer their owner's jobs. Pslot preemption is enabled on the negotiator. Whenever a preemption of an old dslot is to happen to grab the GPU for the new dslot, the startd daemon crashes with following

07/26/19 08:44:28 slot1: Schedd sending 1 preempting claims.
07/26/19 08:44:28 slot1_1: Canceled ClaimLease timer (32)
07/26/19 08:44:28 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing
07/26/19 08:44:28 slot1_1[36.0]: In Starter::kill() with pid 2725126, sig 3 (SIGQUIT)
07/26/19 08:44:28 Send_Signal(): Doing kill(2725126,3) [SIGQUIT]
07/26/19 08:44:28 slot1_1[36.0]: in starter:killHard starting kill timer
07/26/19 08:44:28 slot1: Total execute space: 70511440
07/26/19 08:44:28 slot1_1: Total execute space: 70511440
07/26/19 08:44:28 slot1: Received ClaimId from schedd (<127.0.0.1:9618>#1564127014#16#...)
07/26/19 08:44:28 slot1: Match requesting resources: cpus=1 memory=128 disk=0.1% GPUs=1
07/26/19 08:44:28 Got execute_dir = /var/lib/condor/execute
07/26/19 08:44:28 slot1: Total execute space: 70511440
07/26/19 08:44:28 bind_DevIds for slot1.5 before : GPUs:{CUDA0, CUDA1, CUDA2, CUDA3, }{1_1, 1_2, 1_3, 1_4, }
07/26/19 08:44:28 ERROR "Failed to bind local resource 'GPUs'" at line 1272 in file /slots/02/dir_540114/userdir/.tmpkvAibo/condor-8.8.4/src/condor_startd.V6/ResAttributes.cpp

It seems to me this is the same problem as described in ticket #6815. Rank preemption can't be guarded with PREEMPTION_REQUIREMENTS, but then again I dont understand why refusing to make such matches is the solution. Is this not supposed to happen in the first place?

So far we've worked around this problem by either avoiding pslots, so using static slots with rank, or avoiding preemption, so using pslots with equal priorities or machine owner restrictions. Ideally we would like to use pslots, gpus and rank combined.

Is there something I'm missing that can make this work or is this an issue that needs fixing?


Thanks in advance,

-----------------------

Kosta Polyzos

Systems Administrator

Research Computing Services - University of Surrey

t: +44(0) 1483 68 6859
e: 
k.polyzos@xxxxxxxxxxxx
p: IT Services, University of Surrey, Guildford, Surrey, GU2 7XH, UK