[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Startd crash w/ "Failed to bind local resource 'GPUs'"




Hello fellow Condorians,

I've been seeing these in my StartLogs since some time...

----
11/15/17 02:56:51 slot1_8: Changing activity: Idle -> Busy
11/15/17 08:50:00 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 ERROR "Failed to bind local resource 'GPUs'" at line 1237 in file /builddir/build/BUILD/htcondor-8_6_6/sr/condor_startd.V6/ResAttributes.cpp
11/15/17 08:50:00 slot1_8: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_6: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_5: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_7: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_4: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_3: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 startd exiting because of fatal exception.
11/15/17 08:50:10 ******************************************************
11/15/17 08:50:10 ** condor_startd (CONDOR_STARTD) STARTING UP
11/15/17 08:50:10 ** /usr/sbin/condor_startd
11/15/17 08:50:10 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
----

I don't know the order of things here:

   * is startd suddenly killing everything, and then gets this GPUs error, then crashes and restarts,

or

   * is startd getting this GPUs error, and then crashes, killing everything on its way.

I've seen this with 8.6.5 and 8.6.6 (Fedora 26 rpms).  It's sporadic (couple of times a week) on machines
with just one GPU but quite frequent (up to several per hour) on machines with 8 GPUs.

Has anyone see something similar?  Any suggestions to figure out what happens?

Greetings, Bert.