[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU



Hi Eric.   

 

Nvidia is adding the ability to share a GPU between processes in newer hardware with hardware enforcement of memory isolation

between processes. HTCondor does plan to support that but it does not yet, and I don’t think the NVida devices that support this

are very common yet.  This is work in progress…

 

However,  You can share a GPU between processes *without* any kind protection between processes just by having more than a single process set the environment variable CUDA_VISIBLE_DEVICES to the same value

 

You can get HTCondor to do this just by having the same device show up more than once in the device enumeration.  

 

For instance, if you have two GPUs and your configuration is

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1

 

You can run two jobs on each GPU by configuring

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1, CUDA0, CUDA1

 

If you don’t use the MACHINE_RESOURCE_GPUS  knob, and instead use HTCondor’s GPU detection, you can use the same trick, it’s just a little more work.

 

# enable GPU discovery

use FEATURE : GPUs

# then override the GPU device enumeration with a wrapper script that duplicates the detected GPUs

MACHINE_RESOURCE_INVENTORY_GPUs = $(ETC)/bin/condor_gpu_discovery.sh $(1) -properties $(GPU_DISCOVERY_EXTRA)

 

The wrapper script $(ETC)/bin/condor_gpu_discovery.sh is something that you need to write.

 

condor_gpu_discovery produces output like this

 

DetectedGPUs="CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

 

Your wrapper script should produce the same output, but with a modified value for DetectedGPUs like this

 

DetectedGPUs="CUDA0, CUDA1, CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

 

-tj

 

 

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Eric Sedore via HTCondor-users
Sent: Thursday, November 19, 2020 11:44 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Eric Sedore <essedore@xxxxxxx>
Subject: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Good evening everyone,

 

I’ve listened to a few presentations that mentioned there is a way (either ready now or planned) to allow multiple jobs to utilize a single GPU.  This would be helpful as we have a number of workloads/jobs that do not consume the entire GPU (memory or processing).  Is there documentation (apologies if I missed it) that would assist with how to set up this configuration?

 

Happy to provide more of a description if my question is not clear.

 

Thanks,

-Eric