[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7



Good news. 

 A very clueful HTCondor admin seems to have run into this same bug and sent us a patch for it.   So this should be
fixed in the next stable release.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: Friday, April 10, 2020 10:16 AM
To: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7

The way the logging works in that code, there will always be a trailing comma.  

So this indicates that the startd is trying to bind your one and only GPU to a slot that already has it bound.
It then aborts when it fails to do so. 

I'm assuming that this has something to do with preemption. 
It looks like when we preempt a slot that has a GPU to run a job that wants a GPU we
try to run through some slot initialization code that we should not be hitting, and that fails (probably
because there is no *free* GPU to bind to the slot)

Does this always happen when you try and preempt a slot with a GPU to run a new job? or is there some
other condition that is required to reproduce this bug?

I'm adding a bug ticket for this.
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7591

Feel free to edit the ticket.

-tj

-----Original Message-----
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> 
Sent: Friday, April 10, 2020 8:22 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7

Hi tj,

On 4/10/20 12:41 AM, John M Knoeller wrote:
> 04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line 1272 ..
> 
> There was a known bug in this code when there were multiple GPUS that had the same device name.  
> (i.e. the device list was  CUDA0,CUDA0)  Is that the case here?
nope, this box only has a single (old) GPU in it:

condor_status -l slot1@xxxxxxxxxxxxxxxxx |awk 'tolower($1)~/gpu/ {print}'
AssignedGPUs = "CUDA0"
ChildGPUs = { 0,0,0,0 }
DetectedGPUs = 1
GPUs = 1
TotalGPUs = 1
TotalSlotGPUs = 1

nvidia-smi -L
GPU 0: GeForce GT 640 (UUID: GPU-27ce3be5-06de-e8b2-419e-6edc9e05b2c7)

But maybe, the startd thinks it has an invisible second one as some
strings seems to be incomplete in its logs:

StartLog:04/10/20 02:21:07 unbind_DevIds for slot1.3 before :
GPUs:{CUDA0, }{1_5, }

Cheers

Carsten



-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/