[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release
- Date: Tue, 26 Apr 2022 14:59:20 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release
We have been working on solving this problem. The 9.8.0 release has a technology preview of our likely solution. If you upgrade all of your nodes (execute, schedd and matchmaker) to 9.8.0, you can
take advantage of a new capability of condor_submit.
Your submit file would have something like this
request_gpus = 1
require_gpus = GlobalMemoryMb > 10000
And this would insure that the job would only match with GPUs that have at least that amount of memory. But in order for this to work, your Schedd, Matchmaker and Execute nodes have to be running 9.8.0 or later. And you have to add the -nested option to condor_gpu_discovery on the Execute node like this.
GPU_DISCOVERY_EXTRA = $(GPU_DISCOVERY_EXTRA) -nested
The -nested option caused the Execute node to publish GPU properties as a nested classad inside the STARTD ad. The GPU properties look something like this.
GPUs_GPU_a0223334 = [ DriverVersion = 11.2; Capability = 7.0; MaxSupportedVersion = 11020; ECCEnabled = true; DeviceName = "Tesla V100-PCIE-16GB"; Id = "GPU-a0223334"; GlobalMemoryMb = 24220; DeviceUuid = "a0223334-4445-5667-899a-abbccddeeff0"; DevicePciBusId = "0000:40:00.0" ]
The new "require_gpus" submit keyword can match on any of these properties
There is more information in the ticket for this work here.
This is not yet documented in the manual, but we hope to finalize and document this feature soon.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Benjamin, Douglas via HTCondor-users
Sent: Monday, April 25, 2022 8:02 PM
Cc: Benjamin, Douglas <dbenjamin@xxxxxxx>; Hollowell, Christopher <hollowec@xxxxxxx>
Subject: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release
We have several A100 GPU's that we have divided up using nVidia's MIG configuration. Each nVidia A100 80GB GPU is dived into 3 19.955GB partitions and 1 9.721 GB partion
Here is a snippet of the ïcondor_gpu_discovery command output.
MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97DeviceName="NVIDIA A100 80GB PCIe MIG 2g.20gb"
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1DeviceName="NVIDIA A100 80GB PCIe MIG 1g.10gb"
We are using partitionable slots.
$CondorVersion: 9.0.6 Sep 23 2021 BuildID: racf PackageID: 9.0.6 $
$CondorPlatform: X86_64-ScientificLinux_7.9 $
Is there an easy way to add the GPUmemory to the requirements for a job. For users who have need for more memory than 9.721 GB we would like to allow the users to select.
Is there a condor classad short hand that would allow us to use *GlobalMemoryMb > 10000 to differential between GPU's.
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: