[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



OK I did that but there seems to be a problem.
/home is shared among nodes.


[mahmood@rocks7 ~]$ which mpirun
/opt/openmpi/bin/mpirun
[mahmood@rocks7 ~]$ grep MPDIR openmpiscript
# $MPDIR points to the location of the OpenMPI install
MPDIR=/opt/openmpi
MPDIR=$(condor_config_val OPENMPI_INSTALL_PATH)
# If MPDIR is not set, then use a default value
if [ -z $MPDIR ]; then
    echo "WARNING: Using default value for \$MPDIR in openmpiscript"
    MPDIR=/usr/lib64/openmpi
PATH=$MPDIR/bin:.:$PATH
        mpirun -v --prefix $MPDIR --mca $mca_ssh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@ &
[mahmood@rocks7 ~]$ cat mpi.ht
universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out
error = hellompi.err
machine_count = 2
queue
[mahmood@rocks7 ~]$ condor_submit mpi.ht
Submitting job(s).
1 job(s) submitted to cluster 13.
[mahmood@rocks7 ~]$ cat hellompi.err
Not defined: MOUNT_UNDER_SCRATCH
Not defined: MOUNT_UNDER_SCRATCH
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-0-1.local:9520] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[compute-0-1.local:9521] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[compute-0-1.local:09228] [[38005,0],0]->[[38005,0],2] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 15]
[compute-0-1.local:09228] [[38005,0],0]-[[38005,0],2] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 15
[mahmood@rocks7 ~]$ cat hellompi.out
WARNING: MOUNT_UNDER_SCRATCH not set in condor_config
WARNING: MOUNT_UNDER_SCRATCH not set in condor_config
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[38005,1],0]
  Exit code:    1
--------------------------------------------------------------------------
[mahmood@rocks7 ~]$





Regards,
Mahmood


On Wednesday, January 24, 2018, 1:03:35 PM EST, Jason Patton <jpatton@xxxxxxxxxxx> wrote:


The scripts themselves are a bit complicated, but unless your job is
very complicated, they should hopefully not be difficult to use. For
example, you would change your submit file from:

universe = parallel
executable = /opt/openmpi/bin/mpirun
arguments = mpihello
log = hellompi.log
output = hellompi.out
error = hellompi.err
machine_count = 2
queue

to

universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out
error = hellompi.err
machine_count = 2
queue


I'm guessing your condor pool uses a shared filesystem? If not, you
may need to transfer your mpihello program, too:
transfer_input_files = mpihello

You should copy openmpiscript from the examples directory to the
directory that you're submitting the job from. The only change within
openmpiscript would be to point it to your Open MPI install directory,
so set MPDIR=/opt/openmpi
(Alternatively, you can add OPENMPI_INSTALL_PATH=/opt/openmpi to your
execute machines' condor configs, and by default openmpiscript will
read that config)

Jason