[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] MPI Jobs on Condor



Hi all,

I am using Condor to run my MPI jobs on a large cluster of nodes. The jobs run fine but after sometimes they automatically get restarted. What can be the reason?

My mpi-wrapper is scripted as follows.
___________________________________________________________________________________
#!/bin/sh

EXECUTABLE=$1

CONDOR_CHIRP=`condor_config_val libexec`/condor_chirp

contact_dir=/atlas/user/atlas1/`whoami`/Condor/MPI/contact

mkdir -p $contact_dir

thisrun=`echo $_CONDOR_REMOTE_SPOOL_DIR | sed 's!^.*/cluster\([0-9]*\).*!\1!'`

contact=$contact_dir/$thisrun

hostname  | $CONDOR_CHIRP put -mode cwa - $contact


if [ $_CONDOR_PROCNO -eq 0 ]; then
        while [ "`awk 'END { print NR }' $contact`" -lt $_CONDOR_NPROCS  ]; do
            echo WAITING
            sleep 1
        done
        /usr/bin/mpirun.openmpi -v -np $_CONDOR_NPROCS -machinefile $contact $EXECUTABLE $@
        sleep 300
        rm -f $contact
else
        wait
        exit $?
fi

exit $?
_________________________________________________________________________________________

My condor_submit file is
_________________________________________________________________________________________
######################
# Condor submit file #
######################
universe                = parallel
executable              = /usr/local/bin/atlas_openmpi_wrapper
arguments               = /home/asad/MLDC4/lfakw4b1
machine_count           = 10
should_transfer_files   = yes
when_to_transfer_output = on_exit
transfer_input_files    = /home/asad/MLDC4/lfakw4b1
+ParallelShutdownPolicy = "WAIT_FOR_ALL"
log                     = /home/asad/MLDC4/logfiles/lfakw4b1.log
output                  = /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).out
error                   = /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).error
environment        = "MPI_NRPROCS=10 JOB=1"
queue
_________________________________________________________________________________________

The mpi version is (Open MPI) 1.2.7rc2. The problem is that the jobs start and run for a while and then suddenly restarts by themselves.

Cheers,

Asad






--
 "A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule."