[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] MPI Jobs on Condor



There's a number of reasons why this can occur.   The most insidious
is a DNS issue.   Make sure your DNS is consistent across all your
machines.   I posted about this a while ago with more details.
Check your logs on the executing hosts... they should explain....
usually the job is a "vacated vanilla" job.

(The default behavior of vacating vanilla jobs by killing them is
counter-intuitive for an intranet... but logical for idle-cycle
processing.)

.
On Fri, Apr 15, 2011 at 9:14 PM, Asad Ali <asad06@xxxxxxxxx> wrote:
> Hi all,
>
> I am using Condor to run my MPI jobs on a large cluster of nodes. The jobs
> run fine but after sometimes they automatically get restarted. What can be
> the reason?
>
> My mpi-wrapper is scripted as follows.
> ___________________________________________________________________________________
> #!/bin/sh
>
> EXECUTABLE=$1
>
> CONDOR_CHIRP=`condor_config_val libexec`/condor_chirp
>
> contact_dir=/atlas/user/atlas1/`whoami`/Condor/MPI/contact
>
> mkdir -p $contact_dir
>
> thisrun=`echo $_CONDOR_REMOTE_SPOOL_DIR | sed
> 's!^.*/cluster\([0-9]*\).*!\1!'`
>
> contact=$contact_dir/$thisrun
>
> hostname  | $CONDOR_CHIRP put -mode cwa - $contact
>
>
> if [ $_CONDOR_PROCNO -eq 0 ]; then
>         while [ "`awk 'END { print NR }' $contact`" -lt $_CONDOR_NPROCS  ];
> do
>             echo WAITING
>             sleep 1
>         done
>         /usr/bin/mpirun.openmpi -v -np $_CONDOR_NPROCS -machinefile $contact
> $EXECUTABLE $@
>         sleep 300
>         rm -f $contact
> else
>         wait
>         exit $?
> fi
>
> exit $?
> _________________________________________________________________________________________
>
> My condor_submit file is
> _________________________________________________________________________________________
> ######################
> # Condor submit file #
> ######################
> universe                = parallel
> executable              = /usr/local/bin/atlas_openmpi_wrapper
> arguments               = /home/asad/MLDC4/lfakw4b1
> machine_count           = 10
> should_transfer_files   = yes
> when_to_transfer_output = on_exit
> transfer_input_files    = /home/asad/MLDC4/lfakw4b1
> +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> log                     = /home/asad/MLDC4/logfiles/lfakw4b1.log
> output                  = /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).out
> error                   =
> /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).error
> environment        = "MPI_NRPROCS=10 JOB=1"
> queue
> _________________________________________________________________________________________
>
> The mpi version is (Open MPI) 1.2.7rc2. The problem is that the jobs start
> and run for a while and then suddenly restarts by themselves.
>
> Cheers,
>
> Asad
>
>
>
>
>
>
> --
>  "A Bayesian is one who, vaguely expecting a horse, and catching a glimpse
> of a donkey, strongly believes he has seen a mule."
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>