[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multi-gpu-nodes limit access per slot



Hi Christoph,

nvidia-smi allows for tight regulation of GPU access on a multi-GPU machine, see https://devtalk.nvidia.com/default/topic/1052524/cuda-programming-and-performance/nvidia-smi-exclusive_process/. We use EXCLUSIVE_PROCESS and I suspect you want to do the same based upon your user feedback.

It's also worth making sure that users don't ever try and set the CUDA_VISIBLE_DEVICES environment variable themselves, and instead request the correct number of GPUs and trust Condor to assign them!

Cheers,

Alex
________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: 11 December 2019 09:23
To: htcondor-users
Subject: Re: [HTCondor-users] multi-gpu-nodes limit access per slot

Hi,

thanks for the helpful thoughts !

We usually have 'only' one gpu per node hence there are no such problems and I started of with 4 static slots on the only 4-gpu-node we have as I thought there will be no usage for a multi-gpu slot. But as always someone spotted the opportunity and does have a project that would profite from multi-gpu setup, hence I changed the 4-gpu machine and made a dynamic/partitionable slot there.

Now user complain they can see (use?) 4 gpus from a single gpu-slot, will have to dig further into this to get the problems right ...

Best
Christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- Ursprüngliche Mail -----
Von: "nicolas fournials" <nicolas.fournials@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 11. Dezember 2019 09:41:24
Betreff: Re: [HTCondor-users] multi-gpu-nodes limit access per slot

Hi Christoph and Todd,

> On 12/10/2019 10:52 AM, Beyer, Christoph wrote:
>> Hi,
>>
>> I do have one 4 gpu node and wonder if there is a way to limit the usage on slot base, for ex 4 slots that just see & access each a single GPU. Are cgroups the way to do so and if yes how is it configured ?
>>
>
> Maybe on this node just configure HTCondor with four static slots, each
> with one GPU and some amount of CPU/RAM?  If you need partitionable
> slots for some reason (e.g. RAM), you could edit your START expression
> to say only jobs requesting 0 or 1 GPUs will be matched....

We did some tests here on 2 sockets/2 GPUs nodes.
We used this static slots solutions to get a correct CPU/GPU affinity
(to limit undue latencies). I suppose you may limit resources as well?
For example:

# Create specific slots to enforce CPU/GPU affinity
# This conf DOES NOT suit MPI multi-nodes jobs
SLOT_TYPE_1 = cpus=2, gpus=1
SLOT_TYPE_1_PARTITIONABLE = True
NUM_SLOTS_TYPE_1 = 2
# GPUs will always be assigned to the partitionable slots in order
ENFORCE_CPU_AFFINITY: True
SLOT1_CPU_AFFINITY: 0,2 # etc...
SLOT2_CPU_AFFINITY: 1,3


> As for restricting access to the GPUs, HTCondor will set
> CUDA_VISIBLE_DEVICES environment variable (and the OpenCL equal) to
> point to the GPU provisioned to that slot. This environment variable is
> honored by low-level CUDA libraries.   Are you worried about GPU codes
> that purposefully ignore or clear this environment variable?

We were worried about this here. One solution we imagined would be to
let /dev/nvidia[0-X] be writable only by their owner (root), and use a
wrapper when a job begins to change the ownership (via a dedicated
script launched with sudo).
This is a nice solution from a user point of view, because you can only
see what you've been attributed like if you were alone on your node,
regardless of the env.

However, it seems to me that cgroups to manage /dev/nvidia[0-X] devices
would be really neat.



--
Regards,

Nicolas Fournials
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--------------
G-RESEARCH believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis of any claim, demand or cause of action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this e-mail address will be logged by G-RESEARCH and are subject to archival storage, monitoring, review and disclosure. For information about how G-RESEARCH uses your personal data, please refer to our Privacy Policy at https://www.gresearch.co.uk/privacy-policy/.
G-RESEARCH is the trading name of Trenchant Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Trenchant Limited is a company registered in England with company number 08127121.
--------------