[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Issue with OpenMpi Jobs



Hi,
we are trying to runa basic OpenMPI program under parallel universe in our test pool, and we are getting the next errors:
******************************************************
Error File:

/var/lib/condor/execute/dir_5148/condor_exec.exe: 125: [: Illegal number: mpi_hello
/var/lib/condor/execute/dir_5148/condor_exec.exe: 69: [: Illegal number: mpi_hello
cat: /var/lib/condor/execute/dir_5148/contact: No such file or directory
/var/lib/condor/execute/dir_5148/condor_exec.exe: 91: /var/lib/condor/execute/dir_5148/condor_exec.exe: cannot open /var/lib/condor/execute/dir_5148/contact: No such file
----------------------------------------------------------------------------
Open MPI has detected that a parameter given to a command line
option does not match the expected format:

 Option: n
 Param: -hostfile

This is frequently caused by omitting to provide the parameter
to an option that requires one. Please check the command line and try again.
----------------------------------------------------------------------------

******************************************************
Output File:

Contact File: /var/lib/condor/execute/dir_5148/contact
Machines
Here we should see at leat 1 hostname, but it's empty.
ELSE
******************************************************
SubmitFile:
should_transfer_files = yes
transfer_input_files=mpi_hello
when_to_transfer_output = on_exit_or_evict

universe = parallel
executable = openmpiscript
getenv=true
arguments = mpi_hello

output = MpiOut_$(Cluster)-$(NODE)
error = MpiErr_$(Cluster)-$(NODE)
logÂÂÂ = MpiLog.txt

notification = never
machine_count = 1
queue

******************************************************
Openmpiscript(from the example):
#!/bin/sh
MPDIR=/usr/lib/openmpi
if `uname -m | grep "64" 1>/dev/null 2>&1`
then
ÂÂÂ MPDIR=/usr/lib64/openmpi
fi
PATH=$MPDIR/lib:$MPDIR/1.4-gcc/bin:.:$PATH
export PATH

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS

# If not the head node, just sleep forever, to let the sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ wait
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ sshd_cleanup
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ exit 0
fi
EXECUTABLE=$1
shift
chmod +x $EXECUTABLE

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
# Added for Debug
echo "Contact File: ${CONDOR_CONTACT_FILE}"
cat ${CONDOR_CONTACT_FILE}
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
# Added for Debug
echo "Machines"
cat machines
## run the actual mpijob
if `ompi_info --param all all | grep orte_rsh_agent 1>/dev/null 2>&1`
then
ÂÂÂ echo "IF" # Added for Debug
ÂÂÂ mpirun -v --prefix $MPDIR --mca orte_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@
else
ÂÂÂ ########## For mpi versions 1.1 & 1.2 use the line below
ÂÂÂ echo "ELSE" # Added for Debug
ÂÂÂ mpirun -v --mca plm_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@
fi
sshd_cleanup
rm -f machines
exit $?
******************************************************
After read the docs and de error, we check the sshd.sh file in the worker node, and found this (Line 125):
if [ $_CONDOR_PROCNO -eq 0 ]
Line 113 has this:
echo "$_CONDOR_PROCNO $hostname $PORT $user $currentDir $thisrun"Â |
ÂÂÂÂÂÂÂ $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact

To check the output we change it to this:
echo "$_CONDOR_PROCNO N $_CONDOR_NPROCS $hostname $PORT $user $currentDir $thisrun"Â |
ÂÂÂÂÂÂÂ $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact

And got this putput in the contact file:
mpi_hello NÂ uvcluster-01.cloud.univalle.edu.co 4444 edza /var/lib/condor/execute/dir_5148 1470260298

So the $_CONDOR_PROCNO it's not a number but the executable's name and $_CONDOR_NPROCS it's empty.

Anyone can help us to solve this issue? Any ideas?

Thank you very much.

--
Edier Alberto Zapata HernÃndez
Ingeniero de Soporte en Infraestructura
CIER - Sur