[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release

Hi Douglas.  

We have been working on solving this problem.  The 9.8.0 release has a technology preview of our likely solution.   If you upgrade all of your nodes (execute, schedd and matchmaker) to 9.8.0, you can
take advantage of a new capability of condor_submit. 

Your submit file would have something like this

request_gpus = 1
require_gpus = GlobalMemoryMb > 10000

And this would insure that the job would only match with GPUs that have at least that amount of memory.   But in order for this to work, your Schedd, Matchmaker and Execute nodes have to be running 9.8.0 or later.  And you have to add  the -nested option to condor_gpu_discovery on the Execute node like this. 


The -nested option caused the Execute node to publish GPU properties as a nested classad inside the STARTD ad.  The GPU properties look something like this.

GPUs_GPU_a0223334 = [ DriverVersion = 11.2; Capability = 7.0; MaxSupportedVersion = 11020; ECCEnabled = true; DeviceName = "Tesla V100-PCIE-16GB"; Id = "GPU-a0223334"; GlobalMemoryMb = 24220; DeviceUuid = "a0223334-4445-5667-899a-abbccddeeff0"; DevicePciBusId = "0000:40:00.0" ]

The new "require_gpus" submit keyword can match on any of these properties

There is more information in the ticket for this work here. 

This is not yet documented in the manual, but we hope to finalize and document this feature soon. 


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Benjamin, Douglas via HTCondor-users
Sent: Monday, April 25, 2022 8:02 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Benjamin, Douglas <dbenjamin@xxxxxxx>; Hollowell, Christopher <hollowec@xxxxxxx>
Subject: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release


   We have several A100 GPU's that we have divided up using nVidia's MIG configuration.  Each nVidia A100 80GB GPU is dived into 3  19.955GB partitions and 1 9.721 GB partion

Here is a snippet of the ïcondor_gpu_discovery command output.

MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97DeviceName="NVIDIA A100 80GB PCIe MIG 2g.20gb"
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1DeviceName="NVIDIA A100 80GB PCIe MIG 1g.10gb"

We are using partitionable slots. 

$CondorVersion: 9.0.6 Sep 23 2021 BuildID: racf PackageID: 9.0.6 $
$CondorPlatform: X86_64-ScientificLinux_7.9 $

Is there an easy way to add the GPUmemory to the requirements for a job. For users who have need for more memory than 9.721 GB we would like to allow the users to select.

Is there a condor classad short hand that would allow us to use *GlobalMemoryMb > 10000 to differential between GPU's.


Doug Benjamin

Doug Benjamin 

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: