[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Parallel MPI Job



On Sun, 2012-09-02 at 10:29 -0700, patrick cadelina wrote:
> 
> 
> Hi,
> 
> 
> I'm trying to run a simple parallel MPI hello world on condor but I
> keep getting errors. My code works using mpirun. Here's my submit
> file:
> 
> 
> 
> universe = parallel
> requirements = (TARGET.OpSys=="LINUX" && TARGET.Arch=="INTEL")
> executable = mp2script
> arguments = hello
> log = hello.log
> output = hello.out
> error = hello.err
> machine_count = 2
> should_transfer_files = yes
> when_to_transfer_output = on_exit
> transfer_input_files = hello
> +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> queue
> 
> 
> 
> 
> And here's the error that I get from the generated files:
> mpd.out.0:
> /var/lib/condor/execute/dir_3282/condor_exec.exe:
> 60: /var/lib/condor/execute/dir_3282/condor_exec.exe: mpd: not found
> 
> 
> 
> mpd.out.1:
> /var/lib/condor/execute/dir_5103/condor_exec.exe:
> 101: /var/lib/condor/execute/dir_5103/condor_exec.exe: mpd: not found
> 
When you say your code works outside of Condor using mpirun and succeeds
and you have no mpd installed according to mp2script that tells me
mpirun is using a different process manager than mpd (which is a good
thing IMHO).

Before pursuing installation of mpd, I would look to see if other
process managers are being used.  As I recall some mpi implementations
have a mechanism to run in mpich1 mode, which doesn't use mpd. You might
want to look at your mpirun or mpiexec man page to see if you have that
option or the option to use hydra.

Here's a script (to replace mp2script) that I've used with intel mpi to
avoid using mpd. I've also replaced the mpirun line in the same script
with

mpiexec -launcher ssh  -n $_CONDOR_NPROCS -f ${MACHINE_FILE} $EXECUTABLE
$@ 

for MPICH2 (MPDIR=/usr/lib64/mpich2/bin) where the launcher was hydra.



#!/bin/sh

MPDIR=/product/Fortran_MPI/intel64/bin

PATH=$MPDIR:.:$PATH
export PATH

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

# Remove the contact file, so if we are held and released
# it can be recreated anew

rm -f $CONDOR_CONTACT_FILE

PATH=`condor_config_val libexec`/:$PATH


if [ $_CONDOR_PROCNO -eq 0 ]
then
      echo "trying"

	echo "setting up "
	echo $_CONDOR_NPROCS
	SLOTS=$($(condor_config_val libexec)/condor_chirp get_job_attr
AllRemoteHosts)
	MACHINE_FILE="${_CONDOR_SCRATCH_DIR}/hosts"

	echo $SLOTS |  sed -e 's/\"\(.*\)\".*/\1/' -e 's/,/\n/g' |tr  "@" "\n"|
grep -v slot >> ${MACHINE_FILE}
        echo "---"
	cat ${MACHINE_FILE}	
	echo "---"

        echo "running job"
	## run the actual mpijob in mpich1 mode
	
       	mpirun  -f ${MACHINE_FILE} -machinefile ${MACHINE_FILE} -n
$_CONDOR_NPROCS $EXECUTABLE $@
	e=$?

	sleep 20
	echo "first node out"
	echo $e
else
	echo "second node out"
fi




> 
> 
> Any help would be appreciated. Thanks!
> 
> 
> Regards,
> Pat
>  
>  
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The information transmitted is intended only for the person or entity
> to which it is addressed and may contain confidential and/or
> privileged material. Any review,retransmission,dissemination or other
> use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited.
> If you received this in error, please contact the sender and delete
> the material from any computer.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/