[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Multi GPUs on multiple nodes



Thanks. 
They are doing model parallelism. 



Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Benedikt Riedel <briedel@xxxxxxxxxxxxxxxx>
Sent: Tuesday, November 21, 2023 7:23:24 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Multi GPUs on multiple nodes

Are they doing data or model parallelism? This has a big effect on how this should be setup.

Benedikt

On Tue, Nov 21, 2023 at 11:19âAM Dudu Handelman <duduhandelman@xxxxxxxxxxx> wrote:
Thanks Benedikt. 
They are currently using 8 gpus on a single server. 
They are considering to expand for the model. 

Thanks 
David


Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Benedikt Riedel <briedel@xxxxxxxxxxxxxxxx>
Sent: Tuesday, November 21, 2023 6:27:46 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Multi GPUs on multiple nodes

Few questions for clarification:

* Do they need more GPUs for their application than is on a single node, i.e. more than 4 or 8?

* Are they using model or data parallelism in their training?

Out of the box pytorch only uses the GPUs on a single machine. For cross-node you will need to use something like like pytorch distribute: https://pytorch.org/tutorials/beginner/dist_overview.html

In HTCondor this would require the parallel universe. I am not sure what the status of that with GPUs is. 

Benedikt

On Tue, Nov 21, 2023 at 10:04âAM Dudu Handelman <duduhandelman@xxxxxxxxxxx> wrote:
Hi All,
My users using pytorch and considering using multi GPUs on multiple physical servers.
I think that pytorch is able to do that out of the box using tcp as a workers.

I wonder if anyone doing that on top of HTCondor?

Thanks
David.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Benedikt Riedel
Global Computing Coordinator IceCube Neutrino Observatory
Technical Coordinator IceCube Neutrino Observatory
Computing Manager Wisconsin IceCube Particle Astrophysics Center
University of Wisconsin-Madison

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Benedikt Riedel
Global Computing Coordinator IceCube Neutrino Observatory
Technical Coordinator IceCube Neutrino Observatory
Computing Manager Wisconsin IceCube Particle Astrophysics Center
University of Wisconsin-Madison