[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issue with OpenMpi Jobs



Hi all,
Âno one can help us with this problem?
We think the problem can be when Condor creates the MPI job, we check the contac file created in the Scheduler and got this:

mpi_hello NÂÂuvcluster-01.cloud.univalle.edu.coÂ4444 edza /var/lib/condor/execute/dir_5148 1470260298

So theÂ$_CONDOR_PROCNOÂit's not a number but the executable's name andÂÂ$_CONDOR_NPROCSÂit's empty.

Any one can help us to fix this?
Btw, when the jobs fail, we get a slot in use in the execute nodes, but no tasks reported.

Thank you.

On Wed, Aug 3, 2016 at 11:49 AM, Edier Zapata <edalzap@xxxxxxxxx> wrote:
Hi,
we are trying to runa basic OpenMPI program under parallel universe in our test pool, and we are getting the next errors:
******************************************************
Error File:

/var/lib/condor/execute/dir_5148/condor_exec.exe: 125: [: Illegal number: mpi_hello
/var/lib/condor/execute/dir_5148/condor_exec.exe: 69: [: Illegal number: mpi_hello
cat: /var/lib/condor/execute/dir_5148/contact: No such file or directory
/var/lib/condor/execute/dir_5148/condor_exec.exe: 91: /var/lib/condor/execute/dir_5148/condor_exec.exe: cannot open /var/lib/condor/execute/dir_5148/contact: No such file
----------------------------------------------------------------------------
Open MPI has detected that a parameter given to a command line
option does not match the expected format:

 Option: n
 Param: -hostfile

This is frequently caused by omitting to provide the parameter
to an option that requires one. Please check the command line and try again.
----------------------------------------------------------------------------

******************************************************
Output File:

Contact File: /var/lib/condor/execute/dir_5148/contact
Machines
Here we should see at leat 1 hostname, but it's empty.
ELSE
******************************************************
SubmitFile:
should_transfer_files = yes
transfer_input_files=mpi_hello
when_to_transfer_output = on_exit_or_evict

universe = parallel
executable = openmpiscript
getenv=true
arguments = mpi_hello

output = MpiOut_$(Cluster)-$(NODE)
error = MpiErr_$(Cluster)-$(NODE)
logÂÂÂ = MpiLog.txt

notification = never
machine_count = 1
queue

******************************************************
Openmpiscript(from the example):
#!/bin/sh
MPDIR=/usr/lib/openmpi
if `uname -m | grep "64" 1>/dev/null 2>&1`
then
ÂÂÂ MPDIR=/usr/lib64/openmpi
fi
PATH=$MPDIR/lib:$MPDIR/1.4-gcc/bin:.:$PATH
export PATH

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS

# If not the head node, just sleep forever, to let the sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ wait
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ sshd_cleanup
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ exit 0
fi
EXECUTABLE=$1
shift
chmod +x $EXECUTABLE

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
# Added for Debug
echo "Contact File: ${CONDOR_CONTACT_FILE}"
cat ${CONDOR_CONTACT_FILE}
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
# Added for Debug
echo "Machines"
cat machines
## run the actual mpijob
if `ompi_info --param all all | grep orte_rsh_agent 1>/dev/null 2>&1`
then
ÂÂÂ echo "IF" # Added for Debug
ÂÂÂ mpirun -v --prefix $MPDIR --mca orte_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@
else
ÂÂÂ ########## For mpi versions 1.1 & 1.2 use the line below
ÂÂÂ echo "ELSE" # Added for Debug
ÂÂÂ mpirun -v --mca plm_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@
fi
sshd_cleanup
rm -f machines
exit $?
******************************************************
After read the docs and de error, we check the sshd.sh file in the worker node, and found this (Line 125):
if [ $_CONDOR_PROCNO -eq 0 ]
Line 113 has this:
echo "$_CONDOR_PROCNO $hostname $PORT $user $currentDir $thisrun"Â |
ÂÂÂÂÂÂÂ $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact

To check the output we change it to this:
echo "$_CONDOR_PROCNO N $_CONDOR_NPROCS $hostname $PORT $user $currentDir $thisrun"Â |
ÂÂÂÂÂÂÂ $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact

And got this putput in the contact file:
mpi_hello NÂ uvcluster-01.cloud.univalle.edu.co 4444 edza /var/lib/condor/execute/dir_5148 1470260298

So the $_CONDOR_PROCNO it's not a number but the executable's name and $_CONDOR_NPROCS it's empty.

Anyone can help us to solve this issue? Any ideas?

Thank you very much.

--
Edier Alberto Zapata HernÃndez
Ingeniero de Soporte en Infraestructura
CIER - Sur




--
Edier Alberto Zapata HernÃndez
Ingeniero de Soporte en Infraestructura
CIER - Sur