[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Disappearing OpenCL GPUs



Hi Chris,

maybe you could point

STARTD_CRON_GPUs_MONITOR_EXECUTABLE

to somethinge else than 'condor_gpu_utilization'  and imitate the ecpected output ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Chris Brew - STFC UKRI via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
CC: "Chris Brew, UKRI STFC" <chris.brew@xxxxxxxxxx>
Gesendet: Freitag, 29. September 2023 12:39:57
Betreff: [HTCondor-users] Disappearing OpenCL GPUs

Hi,

 

This is all with Condor 10.0.7 on Rocky Linux 8.

 

Iâve got a test node with a couple of AMD Instinct MI GPGPU cards (i.e. not CUDA) in but Iâm having no luck getting them to show up in the machine ClassAds.

 

Condor_gpu_discovery sees them fine:

 

# /usr/libexec/condor/condor_gpu_discovery -extra -properties

DetectedGPUs="OCL0, OCL1"

Common=[ ClockMhz=1700; ComputeUnits=104; DeviceName="gfx90a:sramecc+:xnack-"; ECCEnabled=false; GlobalMemoryMb=65520; OpenCLVersion=2.0; ]

OCL0=[ id="OCL0"; ]

OCL1=[ id="OCL1"; ]

 

But the StartD doesnât:

 

# grep -I gpu /var/log/condor/StartLog

09/29/23 10:39:03    /etc/condor/config.d/19gpu.config

09/29/23 10:39:03    /etc/condor/config.d/30start_gpu.config

09/29/23 10:39:06 Local machine resource GPUs = 0

09/29/23 10:39:06 Allocating auto shares for slot type 1: Cpus: 96.000000, Memory: 257000, Swap: auto, Disk: auto, GPUs: auto

09/29/23 10:39:06   slot type 1: Cpus: 96.000000, Memory: 257000, Swap: 100.00%, Disk: 100.00%, GPUs: 0

09/29/23 10:39:06 bind DevIds tag=GPUs contraint=

09/29/23 10:39:06 CronJobList: Adding job 'GPUs_MONITOR'

09/29/23 10:39:06 CronJob: Initializing job 'GPUs_MONITOR' (/usr/libexec/condor/condor_gpu_utilization)

 

19gpu.config only contains:

 

use feature : GPUs

GPU_DISCOVERY_EXTRA = -extra

 

And 30start_gpu.config only contains:

 

START = $(START) && ( (RequestGPUs >= 1) )

 

I thought it might be because of /usr/libexec/condor/condor_gpu_utilization, which does not seem to work for non CUDA cards:

 

# /usr/libexec/condor/condor_gpu_utilization

# Unable to load a CUDA library (libcuda.so or libcudart.so).

Hanging to prevent process churn.

 

But I think I managed to disable that by expanding the âuse feature:GPUsâ and removing the âuse feature:GpuMonitorâ.

 

Iâm now stuck. I have a very vague recollection that when I first got some NVidia cards they showed up as OpenCL devices. Did I do something then to make them show up as CUDA devices thatâs preventing these devices showing up? Condor_config_val -dump doesnât show any likely suspects.

 

Itâs entirely possible I havenât got the drivers and/or software correctly installed but rocm-smi and rocminfo do see them as expected.

 

Thanks,

Chris.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/