[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7



Hi all,

in our quest to get preemption running in a somewhat reliable and
predictable fashion, we now found a problem when a job which requested a
gpu is already running but is the preempted by the negotiator by another
job requesting a gpu.

The final lines of the node's startd are:

04/09/20 07:04:29 Attempting to send update via TCP to collector
condor1.atlas.local <10.20.30.16:9618>
04/09/20 07:04:29 slot1_3: Sent update to 1 collector(s)
04/09/20 07:05:14 slot1: Schedd addr =
<10.20.30.16:9618?addrs=10.20.30.16-9618&noUDP&sock=1998632_e432_4>
04/09/20 07:05:14 slot1: Alive interval = 300
04/09/20 07:05:14 slot1: Schedd sending 1 preempting claims.
04/09/20 07:05:14 slot1_5: Canceled ClaimLease timer (48)
04/09/20 07:05:14 slot1_5: Changing state and activity: Claimed/Busy ->
Preempting/Killing
04/09/20 07:05:14 slot1_5[48.2]: In Starter::kill() with pid 6043, sig 3
(SIGQUIT)
04/09/20 07:05:14 Send_Signal(): Doing kill(6043,3) [SIGQUIT]
04/09/20 07:05:14 slot1_5[48.2]: in starter:killHard starting kill timer
04/09/20 07:05:14 slot1: Total execute space: 859473444
04/09/20 07:05:14 slot1_5: Total execute space: 859473444
04/09/20 07:05:14 slot1: Received ClaimId from schedd
(<10.10.38.22:9618?addrs=10.10.38.22-9618&noUDP&sock=6676_c25a_6>#1586415265#15#...)
04/09/20 07:05:14 slot1: Match requesting resources: cpus=1 memory=128
disk=0.1% GPUs=1
04/09/20 07:05:14 Got execute_dir = /local/condor/execute
04/09/20 07:05:14 slot1: Total execute space: 859473444
04/09/20 07:05:14 bind_DevIds for slot1.1 before : GPUs:{CUDA0, }{1_5, }
04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line
1272 in file /home/tim/CONDOR_SRC/.tmplCDN9v/condor-8.8.7/src/condor_sta
rtd.V6/ResAttributes.cpp
04/09/20 07:05:14 CronJobMgr: 1 jobs alive
04/09/20 07:05:14 slot1_4: Canceled ClaimLease timer (28)
04/09/20 07:05:14 slot1_4: Changing state and activity: Claimed/Busy ->
Preempting/Killing
04/09/20 07:05:14 slot1_4[49.7]: In Starter::kill() with pid 5785, sig 3
(SIGQUIT)
04/09/20 07:05:14 Send_Signal(): Doing kill(5785,3) [SIGQUIT]
04/09/20 07:05:14 slot1_4[49.7]: in starter:killHard starting kill timer
04/09/20 07:05:14 slot1_3: Canceled ClaimLease timer (25)
04/09/20 07:05:14 slot1_3: Changing state and activity: Claimed/Busy ->
Preempting/Killing
04/09/20 07:05:14 slot1_3[49.6]: In Starter::kill() with pid 5783, sig 3
(SIGQUIT)
04/09/20 07:05:14 Send_Signal(): Doing kill(5783,3) [SIGQUIT]
04/09/20 07:05:14 slot1_3[49.6]: in starter:killHard starting kill timer
04/09/20 07:05:14 startd exiting because of fatal exception.
04/09/20 07:05:25 Result of reading /etc/issue:  Debian GNU/Linux 10 \n \l

04/09/20 07:05:25 Using IDs: 4 processors, 4 CPUs, 0 HTs
04/09/20 07:05:25 Reading condor configuration from
'/etc/condor/condor_config'


The problem seems to be this:
04/09/20 07:05:14 bind_DevIds for slot1.1 before : GPUs:{CUDA0, }{1_5, }
04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line
1272 in file /home/tim/CONDOR_SRC/.tmplCDN9v/condor-8.8.7/src/condor_sta
rtd.V6/ResAttributes.cpp

At this point the master sees:

04/09/20 06:54:25 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 5719
04/09/20 06:54:29 Setting ready state 'Ready' for STARTD
04/09/20 07:05:14 DefaultReaper unexpectedly called on pid 5719, status
1024.
04/09/20 07:05:14 The STARTD (pid 5719) exited with status 4
04/09/20 07:05:14 Sending obituary for "/usr/sbin/condor_startd"
04/09/20 07:05:15 restarting /usr/sbin/condor_startd in 10 seconds
04/09/20 07:05:25 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 6536
04/09/20 07:05:28 Setting ready state 'Ready' for STARTD

One interesting bit is that this is not related to GPU usage per se, as
the job is simply starting /bin/sleep and then does nothing - it only
requests the gpu via its submit file.

Has anyone seen this or something similar?

(maybe this is the same which happened back in 2017?
https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-November/msg00024.shtml)

Shall we continue here or shall I send more information/logs somewhere
to start a ticket?

Cheers and thanks a lot in advance looking into this

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature