[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job access to own Job and Machine ClassAds?



Hi,

Some tools directly ignore CUDA_VISIBLE_DEVICES. For instance, anything not using CUDA but the graphics subsystem of the card, like headless rendering using EGL.

Also, some libraries / framework override CUDA_VISIBLE_DEVICS by default, so in our experience it's not as reliable as we'd like to.

What we do is to use a job wrapper that calls a small suid tool that uses device cgroups [1] to make sure only the assigned GPU(s) can be accessed.

This can't be bypassed by the user, and it has the added benefit of "hiding" any other GPUs in the system.

Ideally this could be part of the starterd cgropup setup (when enabled), I think slurm can do something similar, but in the meanwhile it's working quite well for us.

Best,

Joan

[1]: https://docs.kernel.org/admin-guide/cgroup-v1/devices.html

On 15/5/23 13:01, Joachim Meyer wrote:
Hi Steffen,

HTCondor sets the _CONDOR_JOB_AD and _CONDOR_MACHINE_AD environment variables
that point to files containing the respective class Ad dumps.

Regarding the GPU related question:
The HTCondor jobs get environment variables set
"CUDA_VISIBLE_DEVICES"&"GPU_DEVICE_ORDINAL"&"_CONDOR_AssignedGPUs" that
contain a comma separated list of device identifiers. These are the GPUs that
were assigned to the job.
At least with Nvidia GPUs, these are strings, not actual ordinals: e.g.
"GPU-24a2cfec".
For some multi-GPU jobs and some frameworks that use CUDA under the hood, we
observed that they weren't happy with the CUDA_VISIBLE_DEVICES being set to
the string ids.. If we are reported that, we provide the uses with a script to
translate the string env to an integral one. The script is attached. In case
there are issues, we recommend running this script at the start of the job..
See the usage:

$ echo ${CUDA_VISIBLE_DEVICES}
GPU-9515f130,GPU-fc201e55,GPU-515d339b,GPU-77ac93e6
$ source translate_gpu_ids.sh
now: CUDA_VISIBLE_DEVICES=0,1,2,3
$ echo ${CUDA_VISIBLE_DEVICES}
0,1,2,3

Yes, you can alter these environment variables in the job.. that can be
abused.. but I'd really expect all users to not change that variable, since,
when everybody would do that, nobody would be able to do anything useful...

Hope this helped!
- Joachim


Am Montag, 15. Mai 2023, 12:24:07 CEST schrieb Steffen Grunewald:
Good morning/afternoon/...,

we're facing a problem with GPU-bound jobs, and while investigating the best
approach to use a multi-GPU machine (I couldn't find an equivalent to CPU
sharing - as that is done by the kernel), I was wondering

- Does a job running in its slot have a means to read its own Job ClassAd,
and the Machine ClassAd of the slot it's running in?
- If the answer is yes, how to do it without Python bindings?

(The background is: If the OSG gets access to some of our GPUs, how do we
and how do the users make sure there are no collisions? If there's already
a canonical way to assign and use GPUs known to, and used by, everyone -
I'd like to join in... If there isn't, how to set up a standard?)

Thanks,
  Steffen

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature