[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] htcondor + gpudirect + openmpi

Dear all,

we want to use htcondor 8.6.5 in a gpu cluster with openmpi in the parallel 
Our main task will be to run openmpi with up to 16 gpus on nodes with 4 or 8 
gpus installed.
To profit from the p2p connection on the board we want to have 4 or 8 mpi 
processes running on one machine and not distributed over the whole cluster.

If we use for example
universe = parallel
executable = /mpi/openmpiscript
arguments = a.out 
machine_count = 2
request_cpus = 4
request_gpus = 4

the slots are reserved correct, but openmpiscript ignores the cpu request and 
starts 2 mpi processes in total and not 4 on each node used.

if I just copy the hosts 4 times 
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
and use 
mpirun ... -n 8  -hostfile machines ...

the a.out processes are start, 4 on each machine, but all 4 processes a bound 
to the same core.

How can I manage that 4 a.out processes run on each machine and use 4 cores in 
total or even more if each of them uses threads.