[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Multi GPUs on multiple nodes



Few questions for clarification:

* Do they need more GPUs for their application than is on a single node, i.e. more than 4 or 8?

* Are they using model or data parallelism in their training?

Out of the box pytorch only uses the GPUs on a single machine. For cross-node you will need to use something like like pytorch distribute: https://pytorch.org/tutorials/beginner/dist_overview.html

In HTCondor this would require the parallel universe. I am not sure what the status of that with GPUs is.Â

Benedikt

On Tue, Nov 21, 2023 at 10:04âAM Dudu Handelman <duduhandelman@xxxxxxxxxxx> wrote:
Hi All,
My users using pytorch and considering using multi GPUs on multiple physical servers.
I think that pytorch is able to do that out of the box using tcp as a workers.

I wonder if anyone doing that on top of HTCondor?

Thanks
David.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Benedikt Riedel
Global Computing Coordinator IceCube Neutrino Observatory
Technical Coordinator IceCube Neutrino Observatory
Computing Manager Wisconsin IceCube Particle Astrophysics Center
University of Wisconsin-Madison