[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs detected but not assigned



You are right that was a typo and supposed to be "==". Try just condor_status -af GPUs The other one was just me trying to filter things out to make it easier since I didn't know your pool size.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Javier Barbero <jbarbero@xxxxxx>
Sent: Friday, September 16, 2022 6:00 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs detected but not assigned
 

I think there was a small typo in the command: SlotType=\"Partitionable\" should be SlotType==\"Partitionable\" (with double equals sign), right? Otherwise I get "Error: invalid constraint".


The output is blank, I only get the column headers:


GPUs    Name    AssignedGPUs


Javier Barbero Gómez


El 16/9/22 a las 21:11, Cole Bollig via HTCondor-users escribió:
Hi Javier,

What does the command:
condor_status -const "GPUs=!=UNDEFINED && GPUs>0 && SlotType=\"Partitionable\"" -af:ht GPUs Name AssignedGPUs 
result in. Specifically for the machines in question.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Javier Barbero <jbarbero@xxxxxx>
Sent: Friday, September 16, 2022 9:19 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] GPUs detected but not assigned
 

I'm having trouble making a couple of GPUs available in their respective machines. They are both configured in the exact same way: GPUs are enabled by having a file in /etc/condor/config.d/ containing the following:

@use feature : GPUs(-extra -nested)
And the machine is set up as a single partitionable slot like this:
NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=95%
SLOT_TYPE_1_PARTITIONABLE=true
JOB_DEFAULT_REQUESTMEMORY=1200M

The first machine has two GPUs and this is the output of the command "condor_gpu_discovery -extra -nested":

DetectedGPUs="GPU-7655ffe8, GPU-26455510"
Common=[ CoresPerCU=128; DriverVersion=11.70; ECCEnabled=false; MaxSupportedVersion=11070; ]
GPU_26455510=[ id="GPU-26455510"; Capability=6.1; ClockMhz=1531.00; ComputeUnits=28; DeviceName="NVIDIA TITAN X (Pascal)"; DevicePciBusId="0000:25:00.0"; DeviceUuid="26455510-6ee9-5e28-d21f-cda9dca4671f"; GlobalMemoryMb=12196; ]
GPU_7655ffe8=[ id="GPU-7655ffe8"; Capability=5.2; ClockMhz=1076.00; ComputeUnits=24; DeviceName="NVIDIA GeForce GTX TITAN X"; DevicePciBusId="0000:17:00.0"; DeviceUuid="7655ffe8-f462-74ea-7871-7b436884d079"; GlobalMemoryMb=12213; ]
Right now, a single job is running in a dynamic slot with the following relevant ClassAd attributes:
AvailableGPUs = { GPUs_GPU_26455510 }
DetectedGPUs = "GPU-26455510, GPU-7655ffe8"
GPUs = 1
GPUs_Capability = 6.1
GPUs_ClockMhz = 1531.0
GPUs_ComputeUnits = 28
GPUs_CoresPerCU = 128
GPUs_DeviceName = "NVIDIA TITAN X (Pascal)"
GPUs_DevicePciBusId = "0000:25:00.0"
GPUs_DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"
GPUs_DriverVersion = 11.7
GPUs_ECCEnabled = false
GPUs_GlobalMemoryMb = 12196
GPUs_GPU_26455510 = [ Capability = 6.1; DevicePciBusId = "0000:25:00.0"; Id = "GPU-26455510"; ClockMhz = 1531.0; DeviceName = "NVIDIA TITAN X (Pascal)"; DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"; GlobalMemoryMb = 12196; CoresPerCU = 128; DriverVersion = 11.7; MaxSupportedVersion = 11070; ComputeUnits = 28; ECCEnabled = false ]
TotalGPUs = 2
TotalSlotGPUs = 1
The parent partitionable slot shows the following:
AvailableGPUs = {  }
DetectedGPUs = "GPU-26455510, GPU-7655ffe8"
GPUs = 0
GPUs_Capability = 6.1
GPUs_ClockMhz = 1531.0
GPUs_ComputeUnits = 28
GPUs_CoresPerCU = 128
GPUs_DeviceName = "NVIDIA TITAN X (Pascal)"
GPUs_DevicePciBusId = "0000:25:00.0"
GPUs_DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"
GPUs_DriverVersion = 11.7
GPUs_ECCEnabled = false
GPUs_GlobalMemoryMb = 12196
GPUs_GPU_26455510 = [ Capability = 6.1; DevicePciBusId = "0000:25:00.0"; Id = "GPU-26455510"; ClockMhz = 1531.0; DeviceName = "NVIDIA TITAN X (Pascal)"; DeviceUuid = "26455510-6ee9-5e28-d21f-cda9dca4671f"; GlobalMemoryMb = 12196; CoresPerCU = 128; DriverVersion = 11.7; MaxSupportedVersion = 11070; ComputeUnits = 28; ECCEnabled = false ]
TotalGPUs = 2
TotalSlotGPUs = 1
For some reason, both GPUs are detected but only one is available. The exact same issue happens in another machine with a single GPU, but that one has 10 GPUs so I think it will be easier to debug in this other one. Nonetheless, this is the "condor_gpu_discovery -extra -nested" output in that machine. The failing GPU in this case is the one with UUID GPU_19decd9c:
DetectedGPUs="GPU-5b862b1f, GPU-19994730, GPU-f51b9bb1, GPU-3f09d5c1, GPU-41fb1ac9, GPU-71245f0d, GPU-53044473, GPU-19decd9c, GPU-0ba8f63b, GPU-29af7eea"
Common=[ DriverVersion=11.70; ECCEnabled=false; MaxSupportedVersion=11070; ]
GPU_0ba8f63b=[ id="GPU-0ba8f63b"; Capability=7.5; ClockMhz=1545.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:8B:00.0"; DeviceUuid="0ba8f63b-9278-8484-63a1-062cfbf94ef0"; GlobalMemoryM
b=11019; ]
GPU_19994730=[ id="GPU-19994730"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1B:00.0"; DeviceUuid="19994730-1109-782e-9e4b-7fc2d82bfdd4"; GlobalMemoryMb=
24268; ]
GPU_19decd9c=[ id="GPU-19decd9c"; Capability=6.1; ClockMhz=1582.00; ComputeUnits=30; CoresPerCU=128; DeviceName="NVIDIA TITAN Xp"; DevicePciBusId="0000:8A:00.0"; DeviceUuid="19decd9c-f020-1f43-ba7c-343b3ef84f29"; GlobalMemoryMb=12196; ]
GPU_29af7eea=[ id="GPU-29af7eea"; Capability=7.5; ClockMhz=1650.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:8C:00.0"; DeviceUuid="29af7eea-fa59-647c-192b-c1ce2a33039a"; GlobalMemoryM
b=11019; ]
GPU_3f09d5c1=[ id="GPU-3f09d5c1"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1D:00.0"; DeviceUuid="3f09d5c1-f703-0e36-1868-8762fb21596f"; GlobalMemoryMb=
24268; ]
GPU_41fb1ac9=[ id="GPU-41fb1ac9"; Capability=7.5; ClockMhz=1545.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:1E:00.0"; DeviceUuid="41fb1ac9-28dc-a15a-63d0-7dc31ec8e622"; GlobalMemoryM
b=11019; ]
GPU_53044473=[ id="GPU-53044473"; Capability=7.5; ClockMhz=1620.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:89:00.0"; DeviceUuid="53044473-f1ea-868e-1950-82da63dc2b22"; GlobalMemoryM
b=11019; ]
GPU_5b862b1f=[ id="GPU-5b862b1f"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1A:00.0"; DeviceUuid="5b862b1f-556d-67b2-19d8-6bc1620da638"; GlobalMemoryMb=
24268; ]
GPU_71245f0d=[ id="GPU-71245f0d"; Capability=7.5; ClockMhz=1545.00; ComputeUnits=68; CoresPerCU=64; DeviceName="NVIDIA GeForce RTX 2080 Ti"; DevicePciBusId="0000:88:00.0"; DeviceUuid="71245f0d-53de-c6bc-9576-2e6928039564"; GlobalMemoryM
b=11019; ]
GPU_f51b9bb1=[ id="GPU-f51b9bb1"; Capability=8.6; ClockMhz=1695.00; ComputeUnits=82; CoresPerCU=128; DeviceName="NVIDIA GeForce RTX 3090"; DevicePciBusId="0000:1C:00.0"; DeviceUuid="f51b9bb1-4c6a-8b37-5cf0-85b4dee09cc1"; GlobalMemoryMb=
24268;

I became aware of this problem because these two GPU would always be unused. I initially thought that some other resource might have been unavailable but no, there is available CPU, memory and disk in both machines. After seeing the slots ClassAds I'm pretty sure that is not the issue here.

Coincidentally, in both cases, the failing GPU is the one with the lowest CUDA capability (5.2 in the first case and 6.1 in the second case). A third machine with 6 GPUs is set up in the exact same way but no problems arise there, although that one has 6 identical GPUs (all NVIDIA RTX A5000 with 8.6 CUDA capability). I have been able to find a possibly related thread in the HTCondor-users list where the CUDA capability was 1.2 (quite low) mentioning that there was "a poorly-implemented and poorly- or undocumented change in the [CUDA] API and data formats between 10.0 and 10.1 or something like that which tripped things up for GPU discovery" (https://www-auth.cs.wisc.edu/lists/htcondor-users/2020-April/msg00007.shtml). I don't understand much of that, I just mention it in case it could be related.

In all cases I'm using the latest NVIDIA driver (515.48.07), CUDA driver version 11.7 and the OS is Debian 11 (bullseye). The issue was happening with version 9.9 but after updating to 9.11 (full version: 9.11.0 2022-08-25 BuildID: 602587 PackageID: 9.11.0-1.1) it persists in both machines.

Regards,
Javier Barbero Gómez


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/