[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] GPU discovery - empty results with older cards on 8.6.13



I'm seeing some odd results with the condor_gpu_discovery on a couple of systems which have old GRID K1 cards installed in them.

I have the 367.130 kernel module installed, and the CUDA 8.0 runtime, which are needed to support these cards, which have CUDA capability 3.0. The nvidia-smi command shows the cards with no problem.

In addition, the condor_gpu_discovery run on the command line has no problem identifying and enumerating the cards:

$ /usr/libexec/condor/condor_gpu_discovery -extra -properties -dynamic
DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"
CUDACapability=3.0
CUDAClockMhz=849.50
CUDAComputeUnits=1
CUDACoresPerCU=192
CUDADeviceName="GRID K1"
CUDADriverVersion=8.0
CUDAECCEnabled=false
CUDAGlobalMemoryMb=4034
CUDARuntimeVersion=8.0
CUDA0PowerUsage_mw=13958
CUDA0DieTempC=38
CUDA1PowerUsage_mw=14107
CUDA1DieTempC=35
CUDA2PowerUsage_mw=13762
CUDA2DieTempC=27
CUDA3PowerUsage_mw=13657
CUDA3DieTempC=31
$

However, none of this information is pulled into the machine ClassAd, and the GPUs attribute for the machine remains at "0". The startd_cron job I set up to advertise the utilization percentage and the per-card and global free memory attributes is working fine, and all those attributes are winding up in the ClassAd with no problem.

The machine is using a shared configuration for GPU settings which includes "use feature : GPUs", and the GPU advertising is working fine on other machines equipped with newer GPU cards, namely a P100 and several V100's. The condor_config_val shows that the MRI and environment settings for the feature are being populated in the older machine's configuration without any problems. I've done a condor_restart and a systemctl restart with no luck.

In the startLog, I see:

05/16/19 11:36:07 History file rotation is enabled.
05/16/19 11:36:07   Maximum history file size is: 1073741824 bytes
05/16/19 11:36:07   Number of rotated history files is: 2
__
05/16/19 11:36:07 Local machine resource GPUs = 0
^^
05/16/19 11:36:07 Allocating auto shares for slot type 1: Cpus: 48.000000, Memory: 257653, Swap: 100.00%, Disk: 100.00%, GPUs: 0
slot type 1: Cpus: 48.000000, Memory: 257653, Swap: 100.00%, Disk: 100.00%, GPUs: 0
05/16/19 11:36:07 slot1: New machine resource of type 1 allocated

This suggests it knows it's supposed to be looking for GPUs, but is not finding them. Is there a debug level short of D_ALL I can set to see what it's attempting to do with the condor_gpu_discovery command, or does anyone have any ideas as to what might be happening?


Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company