[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job access to own Job and Machine ClassAds?



On 5/15/2023 7:20 AM, Joan Josep Piles-Contreras wrote:
Hi,

Some tools directly ignore CUDA_VISIBLE_DEVICES. For instance, anything not using CUDA but the graphics subsystem of the card, like headless rendering using EGL.

Also, some libraries / framework override CUDA_VISIBLE_DEVICS by default, so in our experience it's not as reliable as we'd like to.

What we do is to use a job wrapper that calls a small suid tool that uses device cgroups [1] to make sure only the assigned GPU(s) can be accessed.

This can't be bypassed by the user, and it has the added benefit of "hiding" any other GPUs in the system.

Ideally this could be part of the starterd cgropup setup (when enabled), I think slurm can do something similar, but in the meanwhile it's working quite well for us.

Hi Joan,

I like the above idea, thank you for sharing.   We have considered changing the ownership on the gpu /dev files, but I like the idea of using device cgroups much better.  Could you please email me (off-group is fine) your wrapper/suid tool for reference, and I will see about incorporating it directly into HTCondor's native cgroup support.  Or if you are interested/willing to make a GitHub pull request to do the same, that is also welcome :).

Thank you Joan,
regards,
Todd


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences