[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



Two things come to mind to try to debug this. First, you may need to
comment out the second mention of MPDIR so it doesn't pull from the
config:

# MPDIR=$(condor_config_val OPENMPI_INSTALL_PATH)

(You might have an older version of Open MPI installed in the default
path causing problems?)


Second, if still get the same error, in the condor config on the
execute nodes, try setting:

MOUNT_UNDER_SCRATCH = /tmp

When developing these scripts, I ran into problems when two nodes of
the job landed on the same machine, and it was due to the executables
overwriting files in /tmp. Setting MOUNT_UNDER_SCRATCH = /tmp creates
a bind mount for the jobs so that when they try to write to /tmp, they
actually write to $_CONDOR_SCRATCH_DIR/tmp, so each jobs' /tmp
directory is isolated.

Jason

On Wed, Jan 24, 2018 at 12:22 PM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> OK I did that but there seems to be a problem.
> /home is shared among nodes.
>
>
> [mahmood@rocks7 ~]$ which mpirun
> /opt/openmpi/bin/mpirun
> [mahmood@rocks7 ~]$ grep MPDIR openmpiscript
> # $MPDIR points to the location of the OpenMPI install
> MPDIR=/opt/openmpi
> MPDIR=$(condor_config_val OPENMPI_INSTALL_PATH)
> # If MPDIR is not set, then use a default value
> if [ -z $MPDIR ]; then
>     echo "WARNING: Using default value for \$MPDIR in openmpiscript"
>     MPDIR=/usr/lib64/openmpi
> PATH=$MPDIR/bin:.:$PATH
>         mpirun -v --prefix $MPDIR --mca $mca_ssh_agent $CONDOR_SSH -n
> $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@ &
> [mahmood@rocks7 ~]$ cat mpi.ht
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out
> error = hellompi.err
> machine_count = 2
> queue
> [mahmood@rocks7 ~]$ condor_submit mpi.ht
> Submitting job(s).
> 1 job(s) submitted to cluster 13.
> [mahmood@rocks7 ~]$ cat hellompi.err
> Not defined: MOUNT_UNDER_SCRATCH
> Not defined: MOUNT_UNDER_SCRATCH
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [compute-0-1.local:9520] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [compute-0-1.local:9521] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able to
> guarantee that all other processes were killed!
> [compute-0-1.local:09228] [[38005,0],0]->[[38005,0],2]
> mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 15]
> [compute-0-1.local:09228] [[38005,0],0]-[[38005,0],2]
> mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 15
> [mahmood@rocks7 ~]$ cat hellompi.out
> WARNING: MOUNT_UNDER_SCRATCH not set in condor_config
> WARNING: MOUNT_UNDER_SCRATCH not set in condor_config
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "(null)" (-43) instead of "Success" (0)
> --------------------------------------------------------------------------
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "(null)" (-43) instead of "Success" (0)
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus
> causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[38005,1],0]
>   Exit code:    1
> --------------------------------------------------------------------------
> [mahmood@rocks7 ~]$
>
>
>
>
>
> Regards,
> Mahmood
>
>
> On Wednesday, January 24, 2018, 1:03:35 PM EST, Jason Patton
> <jpatton@xxxxxxxxxxx> wrote:
>
>
> The scripts themselves are a bit complicated, but unless your job is
> very complicated, they should hopefully not be difficult to use. For
> example, you would change your submit file from:
>
> universe = parallel
> executable = /opt/openmpi/bin/mpirun
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out
> error = hellompi.err
> machine_count = 2
> queue
>
> to
>
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out
> error = hellompi.err
> machine_count = 2
> queue
>
>
> I'm guessing your condor pool uses a shared filesystem? If not, you
> may need to transfer your mpihello program, too:
> transfer_input_files = mpihello
>
> You should copy openmpiscript from the examples directory to the
> directory that you're submitting the job from. The only change within
> openmpiscript would be to point it to your Open MPI install directory,
> so set MPDIR=/opt/openmpi
> (Alternatively, you can add OPENMPI_INSTALL_PATH=/opt/openmpi to your
> execute machines' condor configs, and by default openmpiscript will
> read that config)
>
> Jason
>