[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release



Hi Douglas.  

We have been working on solving this problem.  The 9.8.0 release has a technology preview of our likely solution.   If you upgrade all of your nodes (execute, schedd and matchmaker) to 9.8.0, you can
take advantage of a new capability of condor_submit. 

Your submit file would have something like this

request_gpus = 1
require_gpus = GlobalMemoryMb > 10000

And this would insure that the job would only match with GPUs that have at least that amount of memory.   But in order for this to work, your Schedd, Matchmaker and Execute nodes have to be running 9.8.0 or later.  And you have to add  the -nested option to condor_gpu_discovery on the Execute node like this. 

GPU_DISCOVERY_EXTRA = $(GPU_DISCOVERY_EXTRA) -nested

The -nested option caused the Execute node to publish GPU properties as a nested classad inside the STARTD ad.  The GPU properties look something like this.

GPUs_GPU_a0223334 = [ DriverVersion = 11.2; Capability = 7.0; MaxSupportedVersion = 11020; ECCEnabled = true; DeviceName = "Tesla V100-PCIE-16GB"; Id = "GPU-a0223334"; GlobalMemoryMb = 24220; DeviceUuid = "a0223334-4445-5667-899a-abbccddeeff0"; DevicePciBusId = "0000:40:00.0" ]

The new "require_gpus" submit keyword can match on any of these properties

There is more information in the ticket for this work here. 
https://opensciencegrid.atlassian.net/browse/HTCONDOR-953

This is not yet documented in the manual, but we hope to finalize and document this feature soon. 

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Benjamin, Douglas via HTCondor-users
Sent: Monday, April 25, 2022 8:02 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Benjamin, Douglas <dbenjamin@xxxxxxx>; Hollowell, Christopher <hollowec@xxxxxxx>
Subject: [HTCondor-users] GPU selection in HTCondor 9.0.6 LT Release

Hello,

   We have several A100 GPU's that we have divided up using nVidia's MIG configuration.  Each nVidia A100 80GB GPU is dived into 3  19.955GB partitions and 1 9.721 GB partion

Here is a snippet of the ïcondor_gpu_discovery command output.

MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97DeviceName="NVIDIA A100 80GB PCIe MIG 2g.20gb"
MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97DeviceUuid="MIG-3f63dad5-849f-591e-9d4f-f7bacd6c2d97"
MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97DriverVersion=11.60
MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97GlobalMemoryMb=19955
MIG_3f63dad5_849f_591e_9d4f_f7bacd6c2d97MaxSupportedVersion=11060
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1DeviceName="NVIDIA A100 80GB PCIe MIG 1g.10gb"
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1DeviceUuid="MIG-56476b2d-78a8-5280-9fa9-02bf5b74dee1"
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1DriverVersion=11.60
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1GlobalMemoryMb=9721
MIG_56476b2d_78a8_5280_9fa9_02bf5b74dee1MaxSupportedVersion=11060

We are using partitionable slots. 

$CondorVersion: 9.0.6 Sep 23 2021 BuildID: racf PackageID: 9.0.6 $
$CondorPlatform: X86_64-ScientificLinux_7.9 $

Is there an easy way to add the GPUmemory to the requirements for a job. For users who have need for more memory than 9.721 GB we would like to allow the users to select.

Is there a condor classad short hand that would allow us to use *GlobalMemoryMb > 10000 to differential between GPU's.

Regards,

Doug Benjamin



Regards,
Doug Benjamin 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/