[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] openmpi jobs with condor [on debian] (using infiniband and new options in htcondor 8.6.1)



export OMPI_MCA_btl_tcp_if_exclude="lo,eth0,eth1"

Based on my reading of the Open MPI wiki, disabling network interfaces should be unnecessary if Open MPI detects that you're using InfiniBand or a similar HPC communication interface. I am curious, if you have available time and capacity to run a test, can you tell us what happens if you just comment out this line?

For reference, with this variable not defined on machines without InfiniBand (or similar), IÂfound that the mpi proxy processesÂwould stall at near 100% cpu usage. With the increased verbosity of the mpirun output, I condor_ssh_to_job to node 0 and saw in _condor_stdout that Open MPI was stuck looking for IPs on the wrong network interfaces.

Jason Patton


On Thu, Mar 16, 2017 at 1:43 PM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> wrote:
Hello all,

I copied the general part at the end of this mail that one have everything
together. Here what I found out to get openmpi running:

Howto openmpi with htcondor (using infiniband and new options in openmpiscript
8.6.1)
For debian 7 and debian 8 we need one additional line in /etc/init.d/condor to
allow locking all available memory
(locked-in-memory address space unlimited):

ulimit -l unlimited

without this lines the values in /etc/security/limits.conf are ignored and
the default value of 64k is used.

Requirements:
- You have an infiniband aware openmpi installation
- You allow the users for locked-in-memory address space unlimited
 For this we need to set in /etc/security/limits.conf
*        soft  memlock    Âunlimited
*        hard  memlock    Âunlimited
- Set the mtt big enough.
Most likely this will be done well if you install the Mellanox OFED package,
if you use the
kernel drivers see

https://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

For debian we can only set one parameter and use in /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="mlx4_core.log_mtts_per_seg=7"
which should be good for 256GB memory.

----

New in the openmpiscript of htcondor 8.6.1:
before mpirun is started the following environment variables are set:
# set MCA values for running on HTCondor
    export OMPI_MCA_plm_rsh_no_tree_spawn="true" # disable ssh tree spawn
    export OMPI_MCA_btl_tcp_if_exclude="lo,$EXINT" # exclude network
interfaces

    # optionally set MCA values for increasing mpirun verbosity
    #export OMPI_MCA_plm_base_verbose=30
    #export OMPI_MCA_btl_base_verbose=30

because we still using htcondor 8.4.x we use now
    export OMPI_MCA_plm_rsh_no_tree_spawn="true" # disable ssh tree spawn
    export OMPI_MCA_btl_tcp_if_exclude="lo,eth0,eth1" # exclude network
interfaces
    export OMPI_MCA_plm_base_verbose=30
    export OMPI_MCA_btl_base_verbose=30

----

Problems: We still see sometimes, that some mpi programs are still running
after condor_rm
even with our additional signal handler, mpirun and openmpiscript are allways
stopped but not the running program.


-----

Howto openmpi with htcondor (general part)

We use htcondor 8.4.x with debian 7 and debian 8 and use openmpi 1.6.5 mostly
with debian 7,
with debian 8 we just tested a small openmpi example.
We use a common file system for all nodes, htcondor claims it does work also
without
(but is this realy useful?).
Requirements:
- Setup your htcondor environment for parallel jobs (see manual section 2.9)
- Running openmpi (test it on a single node or in the vanilla universe
[section 2.9.4])
- ssh client and server on each node.

In my understanding, htcondor just claims the needed slots, prepares and start
the sshd on the running
nodes and than just start mpirun. This is done by the openmpiscript (see
section 2.9.3) and other scripts.
>From htcondor 8.6.1 on, these scripts are improved and condor variables can be
set which are used by
openmpiscript. In earlier versions one have to change the openmpiscript
directly.

What I have to do to get openmpi running?
Change the openmpiscript:
1. the openmpiscript is a bash script, therefore make sure that bash not sh
was used
for example use
#!/bin/bash
not
#!/bin/sh
debian often use dash as system shell which is not fully bash compatible.
My suggestion is that condor use for all scripts bash explicitly, at least if
they are
not fully bourne shell compatible and therfore need bash not sh.
Is there any system where bash could not be installed under /bin/bash?

2. change MPDIR to the prefix dir of your openmpi

Take into account:
The scripts will run into problems if you add a path for your program (in the
argument for openmpiscript).
Therefore put the program into your working directory and submit from there.


Improvements:
We often have seen that after condor_rm the mpi processes are still running
but the parallel job was
removed from condor.
Following the philosophy of condor, that mpirun have to do the job, we start
mpirun in background
and wait for this process. This allows us to install a signal handler with
trap, which send
a TERM signal to mpirun after the openmpiscript gets the TERM signal.
With this signal handler we never have seen the problem above. I do not know
if and why the condor team
does think this was not necessary, but at least it works for us (in most
cases).

Best regards
Harald

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/