[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Startd crash w/ "Failed to bind local resource 'GPUs'"



The order of messages seems to indicate that something is telling the startd to evict all of the jobs (perhaps a SIGTERM to the STARTD) and then the STARTD is aborting while doing that because it can't assign a GPU resource.

The ERROR message indicates that the STARTD failed to assign a specific GPU to a slot during creation of the slot.  This code only executes when a slot is created, either at startup time for static and partitionable slots, or at claim time for dynamic slots.
 
I can't tell from the log fragment which one is happening here, but I presume that this is a dynamic slot being created. 

When the STARTD first starts up and creates the dynamic slot, what is the value of AssignedGPUs for that slot?  are there more than one with the same name perhaps?

The specific order of messages in the log is puzzling me as well:
 
11/15/17 02:56:51 slot1_8: Changing activity: Idle -> Busy
11/15/17 08:50:00 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing

Is there really nothing the log for the 6 hours between these two messages?

And then this

11/15/17 08:50:00 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 ERROR "Failed to bind local resource 'GPUs'" at line 1237 in file 
/builddir/build/BUILD/htcondor-8_6_6/sr/condor_startd.V6/ResAttributes.cpp
11/15/17 08:50:00 slot1_8: Changing state and activity: Claimed/Busy -> Preempting/Killing

How does a bit of code that ONLY executes during slot creation get invoked while all of the dynamic
slots are switching to preempting/killing state?  How does that message end up BETWEEN two other
messages about state transitions?   

The fact that all of the slots are preempting/killing at the same time may be a clue here. 
Do you have any idea what the triggering condition may be for this?  Is there something interesting happening
in any of the other logs at 11/15/17 08:50:00 ?

Did something send the STARTD a SIGTERM at 08:50:00 ?

Could you increase the logging level of your startd and reproduce the problem again.  Try adding this to the config

STARTD_DEBUG = D_CAT D_ALWAYS:2


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of bert.deknuydt@xxxxxxxxxxxxxxxx
Sent: Wednesday, November 15, 2017 3:58 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Startd crash w/ "Failed to bind local resource 'GPUs'"


Hello fellow Condorians,

I've been seeing these in my StartLogs since some time...

----
11/15/17 02:56:51 slot1_8: Changing activity: Idle -> Busy
11/15/17 08:50:00 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 ERROR "Failed to bind local resource 'GPUs'" at line 1237 in file 
/builddir/build/BUILD/htcondor-8_6_6/sr/condor_startd.V6/ResAttributes.cpp
11/15/17 08:50:00 slot1_8: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_6: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_5: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_7: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_4: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 slot1_3: Changing state and activity: Claimed/Busy -> Preempting/Killing
11/15/17 08:50:00 startd exiting because of fatal exception.
11/15/17 08:50:10 ******************************************************
11/15/17 08:50:10 ** condor_startd (CONDOR_STARTD) STARTING UP
11/15/17 08:50:10 ** /usr/sbin/condor_startd
11/15/17 08:50:10 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
----

I don't know the order of things here:

    * is startd suddenly killing everything, and then gets this GPUs error, then crashes and restarts,

or

    * is startd getting this GPUs error, and then crashes, killing everything on its way.

I've seen this with 8.6.5 and 8.6.6 (Fedora 26 rpms).  It's sporadic (couple of times a week) on machines
with just one GPU but quite frequent (up to several per hour) on machines with 8 GPUs.

Has anyone see something similar?  Any suggestions to figure out what happens?

Greetings, Bert.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/