[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor not picking up GPU memory?



*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Our system does not seem to pick up GPU memory. e.g.

condor_status --gpus

Name                        ST User                GPUs GPU-Memory GPU-Name            

 

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxx      Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      4            NVIDIA A100-SXM4-40GB

slot1@xxxxxxxxxxxxxxxxxxxx  Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxx     Ui _                      1            Tesla T4         

 

and adding a gpus_minimum_memory = 0.1 results in no matches.

We’re using use feature :GPUs and config_config_val -dump |grep GPU shows

ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES=/CUDA//

ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000

GPU_DISCOVERY_EXTRA = -extra -not-nested

GPU_MONITOR = $(LIBEXEC)/condor_gpu_utilization

MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery  -properties $(GPU_DISCOVERY_EXTRA)

STARTD_CRON_GPUs_MONITOR_CONDITION = TotalGPUs > 0

STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(GPU_MONITOR)

STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory

STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit

STARTD_CRON_GPUs_MONITOR_PERIOD = 300

STARTD_CRON_JOBLIST =  GPUs_MONITOR

STARTD_DETECT_GPUS = -properties $(GPU_DISCOVERY_EXTRA)

STARTD_JOB_ATTRS =  GPUsUsage GPUsMemoryUsage

STARTER_HIDE_GPU_DEVICES = true

htcondor version 23.5.2-1 running on Ubuntu20.04 servers