here how I managed to get openmpi running in parallel universe.
The infinband part will follow.
Howto openmpi with htcondor (general part)
We use htcondor 8.4.x with debian 7 and debian 8 and use openmpi 1.6.5 mostly
with debian 7,
with debian 8 we just tested a small openmpi example.
We use a common file system for all nodes, htcondor claims it does work also
(but is this realy useful?).
- Setup your htcondor environment for parallel jobs (see manual section 2.9)
- Running openmpi (test it on a single node or in the vanilla universe
- ssh client and server on each node.
In my understanding, htcondor just claims the needed slots, prepares and start
the sshd on the running
nodes and than just start mpirun. This is done by the openmpiscript (see
section 2.9.3) and other scripts.
>From htcondor 8.6.1 on, these scripts are improved and condor variables can be
set which are used by
openmpiscript. In earlier versions one have to change the openmpiscript
What I have to do to get openmpi running?
Change the openmpiscript:
1. the openmpiscript is a bash script, therefore make sure that bash not sh
for example use
debian often use dash as system shell which is not fully bash compatible.
My suggestion is that condor use for all scripts bash explicitly, at least if
not fully bourne shell compatible and therfore need bash not sh.
Is there any system where bash could not be installed under /bin/bash?
2. change MPDIR to the prefix dir of your openmpi
Take into account:
The scripts will run into problems if you add a path for your program (in the
argument for openmpiscript).
Therefore put the program into your working directory and submit from there.
We often have seen that after condor_rm the mpi processes are still running
but the parallel job was
removed from condor.
Following the philosophy of condor, that mpirun have to do the job, we start
mpirun in background
and wait for this process. This allows us to install a signal handler with
trap, which send
a TERM signal to mpirun after the openmpiscript gets the TERM signal.
With this signal handler we never have seen the problem above. I do not know
if and why the condor team
does think this was not necessary, but at least it works for us.
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx
You can also unsubscribe by visiting
The archives can be found at: