[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] openmpi jobs with condor [on debian] (using infiniband and new options in htcondor 8.6.1)



Hello all,

I copied the general part at the end of this mail that one have everything 
together. Here what I found out to get openmpi running:

Howto openmpi with htcondor (using infiniband and new options in openmpiscript 
8.6.1)
For debian 7 and debian 8 we need one additional line in /etc/init.d/condor to 
allow locking all available memory
(locked-in-memory address space unlimited):

ulimit -l unlimited 

without this lines the values in /etc/security/limits.conf are ignored and
the default value of 64k is used.

Requirements:
- You have an infiniband aware openmpi installation
- You allow the users for locked-in-memory address space unlimited
  For this we need to set in /etc/security/limits.conf
*                soft    memlock         unlimited
*                hard    memlock         unlimited
- Set the mtt big enough.
Most likely this will be done well if you install the Mellanox OFED package, 
if you use the
kernel drivers see

https://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

For debian we can only set one parameter and use in /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="mlx4_core.log_mtts_per_seg=7"
which should be good for 256GB memory.

----

New in the openmpiscript of htcondor 8.6.1:
before mpirun is started the following environment variables are set:
# set MCA values for running on HTCondor
        export OMPI_MCA_plm_rsh_no_tree_spawn="true" # disable ssh tree spawn
        export OMPI_MCA_btl_tcp_if_exclude="lo,$EXINT" # exclude network 
interfaces

        # optionally set MCA values for increasing mpirun verbosity
        #export OMPI_MCA_plm_base_verbose=30
        #export OMPI_MCA_btl_base_verbose=30

because we still using htcondor 8.4.x we use now
        export OMPI_MCA_plm_rsh_no_tree_spawn="true" # disable ssh tree spawn
        export OMPI_MCA_btl_tcp_if_exclude="lo,eth0,eth1" # exclude network 
interfaces
        export OMPI_MCA_plm_base_verbose=30
        export OMPI_MCA_btl_base_verbose=30

----

Problems: We still see sometimes, that some mpi programs are still running 
after condor_rm
even with our additional signal handler, mpirun and openmpiscript are allways 
stopped but not the running program.


-----

Howto openmpi with htcondor (general part)

We use htcondor 8.4.x with debian 7 and debian 8 and use openmpi 1.6.5 mostly 
with debian 7,
with debian 8 we just tested a small openmpi example.
We use a common file system for all nodes, htcondor claims it does work also 
without
(but is this realy useful?).
Requirements:
- Setup your htcondor environment for parallel jobs (see manual section 2.9)
- Running openmpi (test it on a single node or in the vanilla universe 
[section 2.9.4])
- ssh client and server on each node.

In my understanding, htcondor just claims the needed slots, prepares and start 
the sshd on the running
nodes and than just start mpirun. This is done by the openmpiscript (see 
section 2.9.3) and other scripts.
>From htcondor 8.6.1 on, these scripts are improved and condor variables can be 
set which are used by
openmpiscript. In earlier versions one have to change the openmpiscript 
directly.

What I have to do to get openmpi running?
Change the openmpiscript:
1. the openmpiscript is a bash script, therefore make sure that bash not sh 
was used
for example use
#!/bin/bash
not
#!/bin/sh
debian often use dash as system shell which is not fully bash compatible.
My suggestion is that condor use for all scripts bash explicitly, at least if 
they are
not fully bourne shell compatible and therfore need bash not sh.
Is there any system where bash could not be installed under /bin/bash?

2. change MPDIR to the prefix dir of your openmpi

Take into account:
The scripts will run into problems if you add a path for your program (in the 
argument for openmpiscript).
Therefore put the program into your working directory and submit from there.


Improvements:
We often have seen that after condor_rm the mpi processes are still running 
but the parallel job was
removed from condor.
Following the philosophy of condor, that mpirun have to do the job, we start 
mpirun in background
and wait for this process. This allows us to install a signal handler with 
trap, which send
a TERM signal to mpirun after the openmpiscript gets the TERM signal.
With this signal handler we never have seen the problem above. I do not know 
if and why the condor team
does think this was not necessary, but at least it works for us (in most 
cases).

Best regards
Harald