[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Disappearing OpenCL GPUs



You can force condor_gpu_discovery to do OpenCL detection by adding the  -opencl argument.

 

condor_gpu_discovery -opencl -extra

 

Otherwise it will prefer cuda detection over opencl detection, and will never do both so that it doesn’t end up overcounting GPUs that show up both ways.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Chris Brew - STFC UKRI via HTCondor-users
Sent: Friday, September 29, 2023 5:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Chris Brew - STFC UKRI <chris.brew@xxxxxxxxxx>
Subject: [HTCondor-users] Disappearing OpenCL GPUs

 

Hi,

 

This is all with Condor 10.0.7 on Rocky Linux 8.

 

I’ve got a test node with a couple of AMD Instinct MI GPGPU cards (i.e. not CUDA) in but I’m having no luck getting them to show up in the machine ClassAds.

 

Condor_gpu_discovery sees them fine:

 

# /usr/libexec/condor/condor_gpu_discovery -extra -properties

DetectedGPUs="OCL0, OCL1"

Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ]

OCL0=[ id="OCL0"; ]

OCL1=[ id="OCL1"; ]

 

But the StartD doesn’t:

 

# grep -I gpu /var/log/condor/StartLog

09/29/23 10:39:03    /etc/condor/config.d/19gpu.config

09/29/23 10:39:03    /etc/condor/config.d/30start_gpu.config

09/29/23 10:39:06 Local machine resource GPUs = 0

09/29/23 10:39:06 Allocating auto shares for slot type 1: Cpus: 96.000000, Memory: 257000, Swap: auto, Disk: auto, GPUs: auto

09/29/23 10:39:06   slot type 1: Cpus: 96.000000, Memory: 257000, Swap: 100.00%, Disk: 100.00%, GPUs: 0

09/29/23 10:39:06 bind DevIds tag=GPUs contraint=

09/29/23 10:39:06 CronJobList: Adding job 'GPUs_MONITOR'

09/29/23 10:39:06 CronJob: Initializing job 'GPUs_MONITOR' (/usr/libexec/condor/condor_gpu_utilization)

 

19gpu.config only contains:

 

use feature : GPUs

GPU_DISCOVERY_EXTRA = -extra

 

And 30start_gpu.config only contains:

 

START = $(START) && ( (RequestGPUs >= 1) )

 

I thought it might be because of /usr/libexec/condor/condor_gpu_utilization, which does not seem to work for non CUDA cards:

 

# /usr/libexec/condor/condor_gpu_utilization

# Unable to load a CUDA library (libcuda.so or libcudart.so).

Hanging to prevent process churn.

 

But I think I managed to disable that by expanding the ‘use feature:GPUs’ and removing the ‘use feature:GpuMonitor’.

 

I’m now stuck. I have a very vague recollection that when I first got some NVidia cards they showed up as OpenCL devices. Did I do something then to make them show up as CUDA devices that’s preventing these devices showing up? Condor_config_val -dump doesn’t show any likely suspects.

 

It’s entirely possible I haven’t got the drivers and/or software correctly installed but rocm-smi and rocminfo do see them as expected.

 

Thanks,

Chris.