[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] openmpi jobs with condor [on debian] (general part)



Harald,

Thank you so much for sending along your experience!

I will improve the #! line for the next 8.4 release.

We never ran into the problem of processes lingering after a condor_rm when testing openmpiscript, but if you want to pass along your signal handling code, I will include that in the next release, too.

Jason Patton

On Thu, Mar 9, 2017 at 11:12 AM, Harald van Pee <pee@xxxxxxxxxxxxxxxxx> wrote:
Hello all,

here how I managed to get openmpi running in parallel universe.
The infinband part will follow.

Howto openmpi with htcondor (general part)

We use htcondor 8.4.x with debian 7 and debian 8 and use openmpi 1.6.5 mostly
with debian 7,
with debian 8 we just tested a small openmpi example.
We use a common file system for all nodes, htcondor claims it does work also
without
(but is this realy useful?).
Requirements:
- Setup your htcondor environment for parallel jobs (see manual section 2.9)
- Running openmpi (test it on a single node or in the vanilla universe
[section 2.9.4])
- ssh client and server on each node.

In my understanding, htcondor just claims the needed slots, prepares and start
the sshd on the running
nodes and than just start mpirun. This is done by the openmpiscript (see
section 2.9.3) and other scripts.
>From htcondor 8.6.1 on, these scripts are improved and condor variables can be
set which are used by
openmpiscript. In earlier versions one have to change the openmpiscript
directly.

What I have to do to get openmpi running?
Change the openmpiscript:
1. the openmpiscript is a bash script, therefore make sure that bash not sh
was used
for example use
#!/bin/bash
not
#!/bin/sh
debian often use dash as system shell which is not fully bash compatible.
My suggestion is that condor use for all scripts bash explicitly, at least if
they are
not fully bourne shell compatible and therfore need bash not sh.
Is there any system where bash could not be installed under /bin/bash?

2. change MPDIR to the prefix dir of your openmpi

Take into account:
The scripts will run into problems if you add a path for your program (in the
argument for openmpiscript).
Therefore put the program into your working directory and submit from there.


Improvemts:
We often have seen that after condor_rm the mpi processes are still running
but the parallel job was
removed from condor.
Following the philosophy of condor, that mpirun have to do the job, we start
mpirun in background
and wait for this process. This allows us to install a signal handler with
trap, which send
a TERM signal to mpirun after the openmpiscript gets the TERM signal.
With this signal handler we never have seen the problem above. I do not know
if and why the condor team
does think this was not necessary, but at least it works for us.

Best regards
Harald
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/