[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] oversubscribing gpus



Hey David,

The trick is wrapping the condor_gpu_discovery program so that it generates a double-list of the DetectedGPUs. You can change the MACHINE_RESOURCE_INVENTORY_GPUs configuration setting to call your wrapper instead of the actual binary.

The wrapper would look at the o instead of:

 DetectedGPUs = "CUDA0, CUDA1, CUDA2, CUDA3"

You want to have:

DetectedGPUs = "CUDA0, CUDA1, CUDA2, CUDA3, CUDA0, CUDA1, CUDA2, CUDA3"

Or, if you want to depth-first fill the GPUs with jobs:

DetectedGPUs = "CUDA0, CUDA0, CUDA1, CUDA1, CUDA2, CUDA2, CUDA3, CUDA3"

You might also want to have your wrapper modify the CUDAGlobalMemoryMB value (from -properties option) to half of the actual value, just in case any jobs set up requirements based on the CUDA memory.


Michael V Pelletier
Principal Engineer

Raytheon Technologies
Information Technology
Digital Transormation & Innovation
 

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of David Schultz
Sent: Monday, October 5, 2020 11:12 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] oversubscribing gpus

Hi all,

Does anyone have a recipe for oversubscribing GPU resources 2:1, so each GPU would have two slots?  I think I can figure out how to do it completely manually, but was wondering if there was a nice way to hook into HTCondor's GPU detection.

Thanks,
David Schultz
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!MvWE!T0QNSCV9UQWVHW7XjESyNr9Lm8Y53vjHQ5wguGTxXIt_nXL-k7X6A9NuQMPpOVXB1SdMZw$ 

The archives can be found at:
https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!MvWE!T0QNSCV9UQWVHW7XjESyNr9Lm8Y53vjHQ5wguGTxXIt_nXL-k7X6A9NuQMPpOVWEedd8ew$