[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multi-gpu-nodes limit access per slot



Hi Christoph and Todd,

On 12/10/2019 10:52 AM, Beyer, Christoph wrote:
Hi,

I do have one 4 gpu node and wonder if there is a way to limit the usage on slot base, for ex 4 slots that just see & access each a single GPU. Are cgroups the way to do so and if yes how is it configured ?


Maybe on this node just configure HTCondor with four static slots, each
with one GPU and some amount of CPU/RAM?  If you need partitionable
slots for some reason (e.g. RAM), you could edit your START expression
to say only jobs requesting 0 or 1 GPUs will be matched....

We did some tests here on 2 sockets/2 GPUs nodes.
We used this static slots solutions to get a correct CPU/GPU affinity (to limit undue latencies). I suppose you may limit resources as well? For example:

# Create specific slots to enforce CPU/GPU affinity
# This conf DOES NOT suit MPI multi-nodes jobs
SLOT_TYPE_1 = cpus=2, gpus=1
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 2
# GPUs will always be assigned to the partitionable slots in order
ENFORCE_CPU_AFFINITY: True
SLOT1_CPU_AFFINITY: 0,2 # etc...
SLOT2_CPU_AFFINITY: 1,3


As for restricting access to the GPUs, HTCondor will set
CUDA_VISIBLE_DEVICES environment variable (and the OpenCL equal) to
point to the GPU provisioned to that slot. This environment variable is
honored by low-level CUDA libraries.   Are you worried about GPU codes
that purposefully ignore or clear this environment variable?

We were worried about this here. One solution we imagined would be to let /dev/nvidia[0-X] be writable only by their owner (root), and use a wrapper when a job begins to change the ownership (via a dedicated script launched with sudo). This is a nice solution from a user point of view, because you can only see what you've been attributed like if you were alone on your node, regardless of the env.

However, it seems to me that cgroups to manage /dev/nvidia[0-X] devices would be really neat.



--
Regards,

Nicolas Fournials