[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU



Thanks Yannik – yes, if you have time and are willing that would be very helpful.

 

-Eric

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Rath, Yannik
Sent: Wednesday, November 25, 2020 6:51 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Hi Eric,

we also have a number of jobs on our cluster that do not use a full GPU. We ended up with a solution that is rather specialized to our use case, but maybe it happens to align with yours.

For each GPU on a machine, we have one partitionable job slot.
This is one limitation to our approach, as it means we have to associate a certain fraction of RAM and CPU cores to each GPU, and that the same machine cannot run jobs that require multiple GPUs.

We add an additional resource to the job slot, which we name GPUMemory. A user can require either a full GPU as normal or a certain amount of GPU memory (or of course neither for non-GPU jobs).
In our job start _expression_ we make sure these things don't collide, i.e. a full GPU can't be requested if part of its memory is already used and vice-versa.

The job slot also has a configuration variable that identifies the associated GPU, which is used to set the CUDA_VISIBLE_DEVICES environment variable in a user job wrapper.

Finally, we have a monitoring script for the used GPU memory, so that condor kills jobs using more memory than they requested.

In case this sounds like something that would make sense for you, I can collect the configuration parts and share them here.

Best regards,
Yannik

 

 


Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> im Auftrag von John M Knoeller <johnkn@xxxxxxxxxxx>
Gesendet: Dienstag, 24. November 2020 18:08
An: HTCondor-Users Mail List
Betreff: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Hi Eric.   

 

Nvidia is adding the ability to share a GPU between processes in newer hardware with hardware enforcement of memory isolation

between processes. HTCondor does plan to support that but it does not yet, and I don’t think the NVida devices that support this

are very common yet.  This is work in progress…

 

However,  You can share a GPU between processes *without* any kind protection between processes just by having more than a single process set the environment variable CUDA_VISIBLE_DEVICES to the same value

 

You can get HTCondor to do this just by having the same device show up more than once in the device enumeration.  

 

For instance, if you have two GPUs and your configuration is

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1

 

You can run two jobs on each GPU by configuring

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1, CUDA0, CUDA1

 

If you don’t use the MACHINE_RESOURCE_GPUS  knob, and instead use HTCondor’s GPU detection, you can use the same trick, it’s just a little more work.

 

# enable GPU discovery

use FEATURE : GPUs

# then override the GPU device enumeration with a wrapper script that duplicates the detected GPUs

MACHINE_RESOURCE_INVENTORY_GPUs = $(ETC)/bin/condor_gpu_discovery.sh $(1) -properties $(GPU_DISCOVERY_EXTRA)

 

The wrapper script $(ETC)/bin/condor_gpu_discovery.sh is something that you need to write.

 

condor_gpu_discovery produces output like this

 

DetectedGPUs="CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

 

Your wrapper script should produce the same output, but with a modified value for DetectedGPUs like this

 

DetectedGPUs="CUDA0, CUDA1, CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

 

-tj

 

 

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Eric Sedore via HTCondor-users
Sent: Thursday, November 19, 2020 11:44 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Eric Sedore <essedore@xxxxxxx>
Subject: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Good evening everyone,

 

I’ve listened to a few presentations that mentioned there is a way (either ready now or planned) to allow multiple jobs to utilize a single GPU.  This would be helpful as we have a number of workloads/jobs that do not consume the entire GPU (memory or processing).  Is there documentation (apologies if I missed it) that would assist with how to set up this configuration?

 

Happy to provide more of a description if my question is not clear.

 

Thanks,

-Eric