[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor + gpudirect + openmpi



Hi,

Please put me also in this loop and I am also looking for the same, hope jason knows me and can you tell me how to set the environment variables OMP_NUM_THREADS for openMP.

Regards,
Malathi


From: "Jason Patton" <jpatton@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, September 6, 2017 1:20:13 AM
Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi

Harald,

I'm going to investigate this some more. I'm guessing we need to modify the contact file to specify how many cores should be used on each host and then modify the "-n" argument to mpirun accordingly in openmpiscript. Do you know if there are any environment variables that also need to be passed? (For example, I'm thinking of OpenMP jobs needing OMP_NUM_THREADS set correctly, but that's not OpenMPI...)

Jason Patton

On Tue, Sep 5, 2017 at 2:04 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> wrote:
Dear all,

we want to use htcondor 8.6.5 in a gpu cluster with openmpi in the parallel
universe.
Our main task will be to run openmpi with up to 16 gpus on nodes with 4 or 8
gpus installed.
To profit from the p2p connection on the board we want to have 4 or 8 mpi
processes running on one machine and not distributed over the whole cluster.

If we use for example
universe = parallel
executable = /mpi/openmpiscript
arguments = a.out
machine_count = 2
request_cpus = 4
request_gpus = 4

the slots are reserved correct, but openmpiscript ignores the cpu request and
starts 2 mpi processes in total and not 4 on each node used.

if I just copy the hosts 4 times
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
and use
mpirun ... -n 8  -hostfile machines ...

the a.out processes are start, 4 on each machine, but all 4 processes a bound
to the same core.

How can I manage that 4 a.out processes run on each machine and use 4 cores in
total or even more if each of them uses threads.

Best
Harald

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/