[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor + gpudirect + openmpi



Hi Jason,

I think I have done something wrong, or its just working since I have 
installed Mellanox OFED 4.1.

Up to now I have tested it only with openmpi-2.0.2a1 but at least for this 
version its working if I make a loop over the requested gpus to just get more 
lines in the machine file and for the -n argument I have to multiply 
$_CONDOR_NPROCS with the requested gpus.

How I get the number of requested gpus in the script?
At the moment I would parse 
_CONDOR_AssignedGPUs and count them.

OMP_NUM_THREADS can be used to get the request_cpus value but in general this 
is not the number of mpinodes per node.

Up to now I just test the cpus but I hope I can start with real gpu jobs soon.

Best regards
Harald


On Wednesday 06 September 2017 06:00:53 Harald van Pee wrote:
> Hello,
> 
> On Tuesday 05 September 2017 21:50:13 Jason Patton wrote:
> > Harald,
> > 
> > I'm going to investigate this some more. I'm guessing we need to modify
> > the contact file to specify how many cores should be used on each host
> > and then modify the "-n" argument to mpirun accordingly in
> > openmpiscript.
> 
> yes I think this ist the right direction but here my knowledge ends.
> 
> > Do you
> > know if there are any environment variables that also need to be passed?
> > (For example, I'm thinking of OpenMP jobs needing OMP_NUM_THREADS set
> > correctly, but that's not OpenMPI...)
> 
> I will try to find out, at least with slurm it seems to work maybe I
> can learn how mpirun was started there.
> 
> Harald
> 
> > Jason Patton
> > 
> > On Tue, Sep 5, 2017 at 2:04 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
> > 
> > wrote:
> > > Dear all,
> > > 
> > > we want to use htcondor 8.6.5 in a gpu cluster with openmpi in the
> > > parallel universe.
> > > Our main task will be to run openmpi with up to 16 gpus on nodes with 4
> > > or 8
> > > gpus installed.
> > > To profit from the p2p connection on the board we want to have 4 or 8
> > > mpi processes running on one machine and not distributed over the
> > > whole cluster.
> > > 
> > > If we use for example
> > > universe = parallel
> > > executable = /mpi/openmpiscript
> > > arguments = a.out
> > > machine_count = 2
> > > request_cpus = 4
> > > request_gpus = 4
> > > 
> > > the slots are reserved correct, but openmpiscript ignores the cpu
> > > request and
> > > starts 2 mpi processes in total and not 4 on each node used.
> > > 
> > > if I just copy the hosts 4 times
> > > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
> > > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
> > > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
> > > sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
> > > and use
> > > mpirun ... -n 8  -hostfile machines ...
> > > 
> > > the a.out processes are start, 4 on each machine, but all 4 processes a
> > > bound
> > > to the same core.
> > > 
> > > How can I manage that 4 a.out processes run on each machine and use 4
> > > cores in
> > > total or even more if each of them uses threads.
> > > 
> > > Best
> > > Harald
> > > 
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > > with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Harald van Pee

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505
mail: pee@xxxxxxxxxxxxxxxxx