[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor + gpudirect + openmpi



Hello,

On Tuesday 05 September 2017 21:50:13 Jason Patton wrote:
> Harald,
> 
> I'm going to investigate this some more. I'm guessing we need to modify the
> contact file to specify how many cores should be used on each host and then
> modify the "-n" argument to mpirun accordingly in openmpiscript. 

yes I think this ist the right direction but here my knowledge ends.

> Do you
> know if there are any environment variables that also need to be passed?
> (For example, I'm thinking of OpenMP jobs needing OMP_NUM_THREADS set
> correctly, but that's not OpenMPI...)

I will try to find out, at least with slurm it seems to work maybe I 
can learn how mpirun was started there.

Harald


> 
> Jason Patton
> 
> On Tue, Sep 5, 2017 at 2:04 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
> 
> wrote:
> > Dear all,
> > 
> > we want to use htcondor 8.6.5 in a gpu cluster with openmpi in the
> > parallel universe.
> > Our main task will be to run openmpi with up to 16 gpus on nodes with 4
> > or 8
> > gpus installed.
> > To profit from the p2p connection on the board we want to have 4 or 8 mpi
> > processes running on one machine and not distributed over the whole
> > cluster.
> > 
> > If we use for example
> > universe = parallel
> > executable = /mpi/openmpiscript
> > arguments = a.out
> > machine_count = 2
> > request_cpus = 4
> > request_gpus = 4
> > 
> > the slots are reserved correct, but openmpiscript ignores the cpu request
> > and
> > starts 2 mpi processes in total and not 4 on each node used.
> > 
> > if I just copy the hosts 4 times
> > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
> > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
> > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
> > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
> > and use
> > mpirun ... -n 8  -hostfile machines ...
> > 
> > the a.out processes are start, 4 on each machine, but all 4 processes a
> > bound
> > to the same core.
> > 
> > How can I manage that 4 a.out processes run on each machine and use 4
> > cores in
> > total or even more if each of them uses threads.
> > 
> > Best
> > Harald
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/