[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] ebpf expert tutorial during a Condor workshop an option?



Hi Todd and Christoph,

just an ad hoc idea but maybe an ebpf intro or tutorial by an expert could be interesting for a Condor week (at least I would be very interested to learn more about ebpf). But it is maybe out of scope for a Condor workshop...

I got curious about the cgroup device controller following Todds comment on Joan's suggestion to use it for GPUs (I had not noticed device controllers before) and looked how it works in cgroups v2. However, the cgroup v2 "Device controller" section does not read very encouraging without a simple pseudo-file interface but only via ebpf
  https://docs.kernel.org/admin-guide/cgroup-v2.html
But maybe it could be an option for admins to learn a bit more about ebpf (if a admin could inject their own small ebpf programlets in addition to the general Condor job control)?

Cheers,
  Thomas


On 16/05/2023 17.51, Todd Tannenbaum via HTCondor-users wrote:
On 5/15/2023 7:20 AM, Joan Josep Piles-Contreras wrote:
Hi,

Some tools directly ignore CUDA_VISIBLE_DEVICES. For instance, anything not using CUDA but the graphics subsystem of the card, like headless rendering using EGL.

Also, some libraries / framework override CUDA_VISIBLE_DEVICS by default, so in our experience it's not as reliable as we'd like to.

What we do is to use a job wrapper that calls a small suid tool that uses device cgroups [1] to make sure only the assigned GPU(s) can be accessed.

This can't be bypassed by the user, and it has the added benefit of "hiding" any other GPUs in the system.

Ideally this could be part of the starterd cgropup setup (when enabled), I think slurm can do something similar, but in the meanwhile it's working quite well for us.

Hi Joan,

I like the above idea, thank you for sharing. We have considered changing the ownership on the gpu /dev files, but I like the idea of using device cgroups much better. Could you please email me (off-group is fine) your wrapper/suid tool for reference, and I will see about incorporating it directly into HTCondor's native cgroup support. Or if you are interested/willing to make a GitHub pull request to do the same, that is also welcome :).

Thank you Joan,
regards,
Todd


--
Todd Tannenbaum<tannenba@xxxxxxxxxxx>   University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature