[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] GPUs detected but not assigned



I'm having trouble making a couple of GPUs available in their respective machines. They are both configured in the exact same way: GPUs are enabled by having a file in /etc/condor/config.d/ containing the following:

@use feature : GPUs(-extra -nested)
And the machine is set up as a single partitionable slot like this:

NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=95%
SLOT_TYPE_1_PARTITIONABLE=true
JOB_DEFAULT_REQUESTMEMORY=1200M

The first machine has two GPUs and this is the output of the command "condor_gpu_discovery -extra -nested":

DetectedGPUs="GPU-7655ffe8, GPU-26455510"
Common=[ CoresPerCU=128; DriverVersion=11.70; ECCEnabled=false; MaxSupportedVersion=11070; ]
GPU_26455510=[ id="GPU-26455510"; Capability=6.1; ClockMhz=1531.00; ComputeUnits=28; DeviceName="NVIDIA TITAN X (Pascal)"; DevicePciBusId="0000:25:00.0"; DeviceUuid="26455510-6ee9-5e28-d21f-cda9dca4671f"; GlobalMemoryMb=12196; ]
GPU_7655ffe8=[ id="GPU-7655ffe8"; Capability=5.2; ClockMhz=1076.00; ComputeUnits=24; DeviceName="NVIDIA GeForce GTX TITAN X"; DevicePciBusId="0000:17:00.0"; DeviceUuid="7655ffe8-f462-74ea-7871-7b436884d079"; GlobalMemoryMb=12213; ]
Right now, a single job is running in a dynamic slot with the following relevant ClassAd attributes:

AvailableGPUs = { GPUs_GPU_26455510 }
DetectedGPUs = "GPU-26455510, GPU-7655ffe8"
GPUs = 1
GPUs_Capability = 6.1
GPUs_ClockMhz = 1531.0
GPUs_ComputeUnits = 28
GPUs_CoresPerCU = 128
GPUs_DeviceName = "NVIDIA TITAN X (Pascal)"
GPUs_DevicePciBusId = "0000:25:00.0"
GPUs_DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"
GPUs_DriverVersion = 11.7
GPUs_ECCEnabled = false
GPUs_GlobalMemoryMb = 12196
GPUs_GPU_26455510 = [ Capability = 6.1; DevicePciBusId = "0000:25:00.0"; Id = "GPU-26455510"; ClockMhz = 1531.0; DeviceName = "NVIDIA TITAN X (Pascal)"; DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"; GlobalMemoryMb = 12196; CoresPerCU = 128; DriverVersion = 11.7; MaxSupportedVersion = 11070; ComputeUnits = 28; ECCEnabled = false ]
TotalGPUs = 2
TotalSlotGPUs = 1
The parent partitionable slot shows the following:

AvailableGPUs = {Â }
DetectedGPUs = "GPU-26455510, GPU-7655ffe8"
GPUs = 0
GPUs_Capability = 6.1
GPUs_ClockMhz = 1531.0
GPUs_ComputeUnits = 28
GPUs_CoresPerCU = 128
GPUs_DeviceName = "NVIDIA TITAN X (Pascal)"
GPUs_DevicePciBusId = "0000:25:00.0"
GPUs_DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"
GPUs_DriverVersion = 11.7
GPUs_ECCEnabled = false
GPUs_GlobalMemoryMb = 12196
GPUs_GPU_26455510 = [ Capability = 6.1; DevicePciBusId = "0000:25:00.0"; Id = "GPU-26455510"; ClockMhz = 1531.0; DeviceName = "NVIDIA TITAN X (Pascal)"; DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"; GlobalMemoryMb = 12196; CoresPerCU = 128; DriverVersion = 11.7; MaxSupportedVersion = 11070; ComputeUnits = 28; ECCEnabled = false ]
TotalGPUs = 2
TotalSlotGPUs = 1
For some reason, both GPUs are detected but only one is available. The exact same issue happens in another machine with a single GPU, but that one has 10 GPUs so I think it will be easier to debug in this other one. Nonetheless, this is the "condor_gpu_discovery -extra -nested" output in that machine. The failing GPU in this case is the one with UUID GPU_19decd9c:

DetectedGPUs="GPU-5b862b1f, GPU-19994730, GPU-f51b9bb1, GPU-3f09d5c1, GPU-41fb1ac9, GPU-71245f0d, GPU-53044473, GPU-19decd9c, GPU-0ba8f63b, GPU-29af7eea"
Common=[ DriverVersion=11.70; ECCEnabled=false; MaxSupportedVersion=11070; ]
GPU_0ba8f63b=[ id="GPU-0ba8f63b"; Capability=7.5; ClockMhz=1545.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:8B:00.0"; DeviceUuid="0ba8f63b-9278-8484-63a1-062cfbf94ef0"; GlobalMemoryM
b=11019; ]
GPU_19994730=[ id="GPU-19994730"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1B:00.0"; DeviceUuid="19994730-1109-782e-9e4b-7fc2d82bfdd4"; GlobalMemoryMb=
24268; ]
GPU_19decd9c=[ id="GPU-19decd9c"; Capability=6.1; ClockMhz=1582.00; ComputeUnits=30; CoresPerCU=128; DeviceName="NVIDIA TITAN Xp"; DevicePciBusId="0000:8A:00.0"; DeviceUuid="19decd9c-f020-1f43-ba7c-343b3ef84f29"; GlobalMemoryMb=12196; ]
GPU_29af7eea=[ id="GPU-29af7eea"; Capability=7.5; ClockMhz=1650.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:8C:00.0"; DeviceUuid="29af7eea-fa59-647c-192b-c1ce2a33039a"; GlobalMemoryM
b=11019; ]
GPU_3f09d5c1=[ id="GPU-3f09d5c1"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1D:00.0"; DeviceUuid="3f09d5c1-f703-0e36-1868-8762fb21596f"; GlobalMemoryMb=
24268; ]
GPU_41fb1ac9=[ id="GPU-41fb1ac9"; Capability=7.5; ClockMhz=1545.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:1E:00.0"; DeviceUuid="41fb1ac9-28dc-a15a-63d0-7dc31ec8e622"; GlobalMemoryM
b=11019; ]
GPU_53044473=[ id="GPU-53044473"; Capability=7.5; ClockMhz=1620.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:89:00.0"; DeviceUuid="53044473-f1ea-868e-1950-82da63dc2b22"; GlobalMemoryM
b=11019; ]
GPU_5b862b1f=[ id="GPU-5b862b1f"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1A:00.0"; DeviceUuid="5b862b1f-556d-67b2-19d8-6bc1620da638"; GlobalMemoryMb=
24268; ]
GPU_71245f0d=[ id="GPU-71245f0d"; Capability=7.5; ClockMhz=1545.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:88:00.0"; DeviceUuid="71245f0d-53de-c6bc-9576-2e6928039564"; GlobalMemoryM
b=11019; ]
GPU_f51b9bb1=[ id="GPU-f51b9bb1"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1C:00.0"; DeviceUuid="f51b9bb1-4c6a-8b37-5cf0-85b4dee09cc1"; GlobalMemoryMb=
24268;

I became aware of this problem because these two GPU would always be unused. I initially thought that some other resource might have been unavailable but no, there is available CPU, memory and disk in both machines. After seeing the slots ClassAds I'm pretty sure that is not the issue here.

Coincidentally, in both cases, the failing GPU is the one with the lowest CUDA capability (5.2 in the first case and 6.1 in the second case). A third machine with 6 GPUs is set up in the exact same way but no problems arise there, although that one has 6 identical GPUs (all NVIDIA RTX A5000 with 8.6 CUDA capability). I have been able to find a possibly related thread in the HTCondor-users list where the CUDA capability was 1.2 (quite low) mentioning that there was "a poorly-implemented and poorly- or undocumented change in the [CUDA] API and data formats between 10.0 and 10.1 or something like that which tripped things up for GPU discovery" (https://www-auth.cs.wisc.edu/lists/htcondor-users/2020-April/msg00007.shtml). I don't understand much of that, I just mention it in case it could be related.

In all cases I'm using the latest NVIDIA driver (515.48.07), CUDA driver version 11.7 and the OS is Debian 11 (bullseye). The issue was happening with version 9.9 but after updating to 9.11 (full version: 9.11.0 2022-08-25 BuildID: 602587 PackageID: 9.11.0-1.1) it persists in both machines.

Regards,
Javier Barbero GÃmez