Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Eric.

Nvidia is adding the ability to share a GPU between processes in newer hardware with hardware enforcement of memory isolation

between processes. HTCondor does plan to support that but it does not yet, and I don’t think the NVida devices that support this

are very common yet. This is work in progress…

However, You can share a GPU between processes *without* any kind protection between processes just by having more than a single process set the environment variable CUDA_VISIBLE_DEVICES to the same value

You can get HTCondor to do this just by having the same device show up more than once in the device enumeration.

For instance, if you have two GPUs and your configuration is

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1

You can run two jobs on each GPU by configuring

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1, CUDA0, CUDA1

If you don’t use the MACHINE_RESOURCE_GPUS knob, and instead use HTCondor’s GPU detection, you can use the same trick, it’s just a little more work.

# enable GPU discovery

use FEATURE : GPUs

# then override the GPU device enumeration with a wrapper script that duplicates the detected GPUs

MACHINE_RESOURCE_INVENTORY_GPUs = $(ETC)/bin/condor_gpu_discovery.sh $(1) -properties $(GPU_DISCOVERY_EXTRA)

The wrapper script $(ETC)/bin/condor_gpu_discovery.sh is something that you need to write.

condor_gpu_discovery produces output like this

DetectedGPUs="CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

Your wrapper script should produce the same output, but with a modified value for DetectedGPUs like this

DetectedGPUs="CUDA0, CUDA1, CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Eric Sedore via HTCondor-users
Sent: Thursday, November 19, 2020 11:44 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Eric Sedore <essedore@xxxxxxx>
Subject: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

Good evening everyone,

I’ve listened to a few presentations that mentioned there is a way (either ready now or planned) to allow multiple jobs to utilize a single GPU. This would be helpful as we have a number of workloads/jobs that do not consume the entire GPU (memory or processing). Is there documentation (apologies if I missed it) that would assist with how to set up this configuration?

Happy to provide more of a description if my question is not clear.

Thanks,

-Eric

Mailing List Archives

Public Access

Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU