[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs not detected in 9.0.6 version



Hello all,

Thank you for your help. Right now, we have all our GPUs running with CUDA 11 and everything isÂworking fine using the condor_gpu_discovery 9.0.6 without any special workaround.

Cheers,

Carles

On Thu, 30 Sept 2021 at 18:25, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
Yep. I'm pretty sure that the crash is caused by an nvml library that does not have the
nvmlDeviceGetMaxMigDeviceCount function. we are missing a test for this to be null inÂ
one of the code paths

thanks again,
-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Thursday, September 30, 2021 11:14 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version
Â
Thank you. this output will be helpful
-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Anderson, Stuart B. <sba@xxxxxxxxxxx>
Sent: Thursday, September 30, 2021 10:58 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version
Â
Carles,
ÂÂÂÂÂÂÂ In case it helps, here is what I see from "condor_gpu_discovery -verbose -diag" on an SL7 system with a single GTX 1050 Ti running condor 9.0.6, CUDA 11.2 and NVIDIA driver 460.32.03.


root@node1 config.d]# /usr/libexec/condor/condor_gpu_discovery -verbose -diag
diag: clearing environment before device enumeration
diag: using nvcuda for gpu discovery
# querying ordinal:0, dev:0x7ffc00000000 using cuDevice* API
# cuDeviceTotalMem(0) returns 0, value = 4236312576
# cuDeviceTotalMem(0x7ffc00000000) returns 0, value = 4236312576
# nvml_getBasicProps() for GPU-23b6505e-b534-990a-9ec9-f4dca5662ab0 returns 0
diag: skipping uuid=GPU-23b6505e-b534-990a-9ec9-f4dca5662ab0 during nvml enumeration because it matches CUDA0
DetectedGPUs="GPU-23b6505e"


Looks like the call to nvml_getBasicProps() is new in 9.0.6. I don't know if will help, but it might also be worth comparing the output of compiling and running /usr/local/cuda/samples/1_Utilities/deviceQuery to see if there are different BasicProps being returned that Condor is chocking on. Here is what I see,


[root@node1 deviceQuery]# ./deviceQuery
./deviceQuery Starting...

ÂCUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1050 Ti"
 CUDA Driver Version / Runtime Version 11.2 / 11.2
 CUDA Capability Major/Minor version number: 6.1
 Total amount of global memory: 4040 MBytes (4236312576 bytes)
 ( 6) Multiprocessors, (128) CUDA Cores/MP: 768 CUDA Cores
 GPU Max Clock rate: 1392 MHz (1.39 GHz)
 Memory Clock rate: 3504 Mhz
 Memory Bus Width: 128-bit
 L2 Cache Size: 1048576 bytes
 Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
 Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
 Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
 Total amount of constant memory: 65536 bytes
 Total amount of shared memory per block: 49152 bytes
 Total shared memory per multiprocessor: 98304 bytes
 Total number of registers available per block: 65536
 Warp size: 32
 Maximum number of threads per multiprocessor: 2048
 Maximum number of threads per block: 1024
 Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
 Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
 Maximum memory pitch: 2147483647 bytes
 Texture alignment: 512 bytes
 Concurrent copy and kernel execution: Yes with 2 copy engine(s)
 Run time limit on kernels: No
 Integrated GPU sharing Host Memory: No
 Support host page-locked memory mapping: Yes
 Alignment requirement for Surfaces: Yes
 Device has ECC support: Disabled
 Device supports Unified Addressing (UVA): Yes
 Device supports Managed Memory: Yes
 Device supports Compute Preemption: Yes
 Supports Cooperative Kernel Launch: Yes
 Supports MultiDevice Co-op Kernel Launch: Yes
 Device PCI Domain ID / Bus ID / location ID: 0 / 7 / 0
 Compute Mode:
ÂÂÂÂ < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS


Thanks.


> On Sep 29, 2021, at 10:07 PM, Carles Acosta <cacosta@xxxxxx> wrote:
>
> Hi TJ,
>
> Here you have the results:
>
> # /usr/libexec/condor/condor_gpu_discovery-9.0.6 -verbose -diag
> diag: clearing environment before device enumeration
> diag: using nvcuda for gpu discovery
> # querying ordinal:0, dev:0x833117100000000 using cuDevice* API
> # cuDeviceTotalMem(0) returns 0, value = 4236312576
> # cuDeviceTotalMem(0x833117100000000) returns 0, value = 4236312576
> # nvml_getBasicProps() for GPU-c659279d-ce12-c3b9-f9c4-05a68df7c711 returns 0
> Segmentation fault
>
> On the other hand, using the 9.0.5 version:
>
> # /usr/libexec/condor/condor_gpu_discovery-9.0.5 -verbose -diag
> diag: using nvcuda for gpu discovery
> # querying ordinal:0, dev:0xa2f81b3c00000000 using cuDevice* API
> # cuDeviceTotalMem(0) returns 0, value = 4236312576
> # cuDeviceTotalMem(0xa2f81b3c00000000) returns 0, value = 4236312576
> DetectedGPUs="GPU-c659279d"
>

--
Stuart Anderson
sba@xxxxxxxxxxx




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es