[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs not detected in 9.0.6 version



On Wed, 2021-09-29 at 14:53:39 +0000, John M Knoeller wrote:
> well that's not good.
> 
> Could you try running the 9.0.6 condor_gpu_discovery with
> 
>   condor_gpu_discovery -verbose -diag
> 
> and send me the results?
> 
> thanks
> -tj
> 
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
> Sent: Tuesday, September 28, 2021 11:53 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version
> 
> Hi Stuart, TJ,
> 
> Thank you for your replies.
> 
> Regarding the CUDA version, we did not update it. We are using old CUDA 10.1.243 for these GPUs.
> 
> We did some testing as TJ suggests. We are running right now HTCondor 9.0.6 but using the condor_gpu_discovery from HTCondor 9.0.5 and the GPU is correctly discovered:
> 
> # condor_status slot2@xxxxxxxxxxxx<mailto:slot2@xxxxxxxxxxxx> -af Gpus DetectedGpus CondorVersion
> 1 GPU-c659279d $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $
> 
> Thus, it is related to the new condor_gpu_discovery binary in version 9.0.6.  In fact:
> 
> [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.5
> DetectedGPUs="GPU-c659279d"
> [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.6
> Segmentation fault

Hi John, all,

since I'm building my own set of packages, I extracted the condor_gpu_discovery binaries

-rwxr-xr-x 1 root root  60040 Oct 23  2020 condor_gpu_discovery-8.8.11
-rwxr-xr-x 1 root root  60040 Aug  2 15:54 condor_gpu_discovery-8.8.15
-rwxr-xr-x 1 root root  88800 Aug  2 17:02 condor_gpu_discovery-9.0.4
-rwxr-xr-x 1 root root  88800 Aug 20 13:10 condor_gpu_discovery-9.0.5
-rwxr-xr-x 1 root root 105216 Sep 24 11:01 condor_gpu_discovery-9.0.6
-rwxr-xr-x 1 root root  96992 Aug 20 14:02 condor_gpu_discovery-9.1.3
-rwxr-xr-x 1 root root 109312 Sep 24 11:36 condor_gpu_discovery-9.2.0

and ran them on a Debian Buster machine equipped with two Kepler K10s:

condor_gpu_discovery-8.8.11
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-8.8.15
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-9.0.4
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.6
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.1.3
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.2.0
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"

No segfault at all.

Now an error that cannot be reproduced wouldn't help you much...
so I took the prebuilt Buster packages and ran the same:

-rwxr-xr-x 1 root root  60040 Jul 29 19:36 condor_gpu_discovery-8.8.15
-rwxr-xr-x 1 root root  84680 Mar 30  2021 condor_gpu_discovery-8.9.13
-rwxr-xr-x 1 root root  88800 Jul 29 18:01 condor_gpu_discovery-9.0.4
-rwxr-xr-x 1 root root  88800 Aug 18 21:25 condor_gpu_discovery-9.0.5
-rwxr-xr-x 1 root root 105216 Sep 23 17:09 condor_gpu_discovery-9.0.6
-rwxr-xr-x 1 root root  96992 Aug 19 21:27 condor_gpu_discovery-9.1.3
-rwxr-xr-x 1 root root 109312 Sep 23 23:37 condor_gpu_discovery-9.2.0

condor_gpu_discovery-8.8.15
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-8.9.13
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-9.0.4
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.6
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.1.3
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.2.0
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"

- no segfault with Debian Buster. I'm suspecting a shared library issue...

Curious to learn about the actual culprit ;)
 - Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~