[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor not picking up GPU memory?



So GPU memory is being detected and reported.  

But since you are using -not-nested in your config, you can't use any of the newer features of HTCondor that depend on the nested GPU properties ads.

-tj

From: Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Wednesday, April 24, 2024 10:48 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: HTCondor not picking up GPU memory?
 
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

tj,

 

/usr/libexec/condor/condor_gpu_discovery -properties -extra

DetectedGPUs="GPU-2b5bf517"

Common=[ Capability=7.5; ClockMhz=1590.00; ComputeUnits=40; CoresPerCU=64; DeviceName="Tesla T4"; DevicePciBusId="0000:03:00.0"; DeviceUuid="2b5bf517-8290-ca68-bb9e-eaf4336d1321"; DriverVersion=12.10; ECCEnabled=true; GlobalMemoryMb=14966; MaxSupportedVersion=12010; ]

GPU_2b5bf517=[ id="GPU-2b5bf517"; ]

nmradmin@neon:~$ /usr/libexec/condor/condor_gpu_discovery -properties -extra --not-nested

DetectedGPUs="GPU-2b5bf517"

CUDACapability=7.5

CUDAClockMhz=1590.00

CUDAComputeUnits=40

CUDACoresPerCU=64

CUDADeviceName="Tesla T4"

CUDADevicePciBusId="0000:03:00.0"

CUDADeviceUuid="2b5bf517-8290-ca68-bb9e-eaf4336d1321"

CUDADriverVersion=12.10

CUDAECCEnabled=true

CUDAGlobalMemoryMb=14966

CUDAMaxSupportedVersion=12010

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Wednesday, April 24, 2024 at 11:13
âAM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor not picking up GPU memory?

*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

what does running

 

   condor_gpu_discovery  -properties -extra

 

show on that node?  what about

 

   condor_gpu_discovery  -properties -extra -not-nested

 

 

I notice you are using the -not-nested argument,  the new submit keywords for GPU matchmaking like  gpus_minimum_memory = 0.1 require that the GPU properties be nested.  Although those new submit keywords have a known bug with the version of HTCondor you are using, and should not be used before 23.7 

 

-tj

 

 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, April 16, 2024 8:12 AM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor not picking up GPU memory?

 

*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Our system does not seem to pick up GPU memory. e.g.

condor_status --gpus

Name                        ST User                GPUs GPU-Memory GPU-Name            

 

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxx      Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      4            NVIDIA A100-SXM4-40GB

slot1@xxxxxxxxxxxxxxxxxxxx  Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxx     Ui _                      1            Tesla T4         

 

and adding a gpus_minimum_memory = 0.1 results in no matches.


Weâre using use feature :GPUs and config_config_val -dump |grep GPU shows

ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES=/CUDA//

ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000

GPU_DISCOVERY_EXTRA = -extra -not-nested

GPU_MONITOR = $(LIBEXEC)/condor_gpu_utilization

MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery  -properties $(GPU_DISCOVERY_EXTRA)

STARTD_CRON_GPUs_MONITOR_CONDITION = TotalGPUs > 0

STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(GPU_MONITOR)

STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory

STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit

STARTD_CRON_GPUs_MONITOR_PERIOD = 300

STARTD_CRON_JOBLIST =  GPUs_MONITOR

STARTD_DETECT_GPUS = -properties $(GPU_DISCOVERY_EXTRA)

STARTD_JOB_ATTRS =  GPUsUsage GPUsMemoryUsage

STARTER_HIDE_GPU_DEVICES = true

htcondor version 23.5.2-1 running on Ubuntu20.04 servers