[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Disappearing OpenCL GPUs



Thanks TJ, though condor_gpu_discovery does find the OpenCL GPUs without that. However, testing that might just have led me to stumble on the answer, spot the difference:

 

$ /usr/libexec/condor/condor_gpu_discovery -opencl -extra

DetectedGPUs=0

$ sudo /usr/libexec/condor/condor_gpu_discovery -opencl -extra

DetectedGPUs="OCL0, OCL1"

Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ]

OCL0=[ id="OCL0"; ]

OCL1=[ id="OCL1"; ]

 

This works without privileges:

 

$ rocm-smi

======================= ROCm System Management Interface =======================

================================= Concise Info =================================

GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%

0    32.0c  41.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%

1    32.0c  40.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%

================================================================================

============================= End of ROCm SMI Log ==============================

 

But not this:

 

$ rocminfo

ROCk module is loaded

Unable to open /dev/kfd read-write: Permission denied

brew is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully.

 

But I want anyone I let onto the host use the GPUs, that’s sort of the point:

 

$ ls -l /dev/kfd

crw-rw---- 1 root video 241, 0 Sep 22 06:38 /dev/kfd

[brew@hepacc13 ~]$ sudo chmod o+rw /dev/kfd

 

And now this works:

 

$ rocminfo

ROCk module is loaded

=====================

HSA System Attributes

=====================

Runtime Version:         1.1

 

One quick condor restart later:

 

$ condor_status -l hepacc13 | grep -i gpu

AssignedGPUs = "OCL0,OCL1"

AvailableGPUs = { GPUs_OCL0,GPUs_OCL1 }

ChildGPUs = {  }

DetectedGPUs = "OCL0, OCL1"

GPUs = 2

GPUs_ClockMhz = 1700

GPUs_ComputeUnits = 104

GPUs_DeviceName = "gfx90a:sramecc+:xnack-"

 

Not a condor problem, sorry for the noise and thank you for the sounding board.

 

Yours,

Chris.

 

On 29/09/2023, 15:24, "John M Knoeller" <johnkn@xxxxxxxxxxx> wrote:

 

You can force condor_gpu_discovery to do OpenCL detection by adding the  -opencl argument.

 

condor_gpu_discovery -opencl -extra

 

Otherwise it will prefer cuda detection over opencl detection, and will never do both so that it doesn’t end up overcounting GPUs that show up both ways.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Chris Brew - STFC UKRI via HTCondor-users
Sent: Friday, September 29, 2023 5:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Chris Brew - STFC UKRI <chris.brew@xxxxxxxxxx>
Subject: [HTCondor-users] Disappearing OpenCL GPUs

 

Hi,

 

This is all with Condor 10.0.7 on Rocky Linux 8.

 

I’ve got a test node with a couple of AMD Instinct MI GPGPU cards (i.e. not CUDA) in but I’m having no luck getting them to show up in the machine ClassAds.

 

Condor_gpu_discovery sees them fine:

 

# /usr/libexec/condor/condor_gpu_discovery -extra -properties

DetectedGPUs="OCL0, OCL1"

Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ]

OCL0=[ id="OCL0"; ]

OCL1=[ id="OCL1"; ]

 

But the StartD doesn’t:

 

# grep -I gpu /var/log/condor/StartLog

09/29/23 10:39:03    /etc/condor/config.d/19gpu.config

09/29/23 10:39:03    /etc/condor/config.d/30start_gpu.config

09/29/23 10:39:06 Local machine resource GPUs = 0

09/29/23 10:39:06 Allocating auto shares for slot type 1: Cpus: 96.000000, Memory: 257000, Swap: auto, Disk: auto, GPUs: auto

09/29/23 10:39:06   slot type 1: Cpus: 96.000000, Memory: 257000, Swap: 100.00%, Disk: 100.00%, GPUs: 0

09/29/23 10:39:06 bind DevIds tag=GPUs contraint=

09/29/23 10:39:06 CronJobList: Adding job 'GPUs_MONITOR'

09/29/23 10:39:06 CronJob: Initializing job 'GPUs_MONITOR' (/usr/libexec/condor/condor_gpu_utilization)

 

19gpu.config only contains:

 

use feature : GPUs

GPU_DISCOVERY_EXTRA = -extra

 

And 30start_gpu.config only contains:

 

START = $(START) && ( (RequestGPUs >= 1) )

 

I thought it might be because of /usr/libexec/condor/condor_gpu_utilization, which does not seem to work for non CUDA cards:

 

# /usr/libexec/condor/condor_gpu_utilization

# Unable to load a CUDA library (libcuda.so or libcudart.so).

Hanging to prevent process churn.

 

But I think I managed to disable that by expanding the ‘use feature:GPUs’ and removing the ‘use feature:GpuMonitor’.

 

I’m now stuck. I have a very vague recollection that when I first got some NVidia cards they showed up as OpenCL devices. Did I do something then to make them show up as CUDA devices that’s preventing these devices showing up? Condor_config_val -dump doesn’t show any likely suspects.

 

It’s entirely possible I haven’t got the drivers and/or software correctly installed but rocm-smi and rocminfo do see them as expected.

 

Thanks,

Chris.