[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run



Hi,
I noticed that the mpi universe works fine. Well sort of. It does work as long as I turn off iptables. If it is on, I get error messages in the outputfiles :

outfile.0
p0_1136: p4_error: Timeout in making connection to remote process on pirineu.cap.ed.ac.uk: 0
	p0_1136: (302.006268) net_send: could not write to fd=4, errno = 32

outfile.1
	rm_4994:  p4_error: rm_start: net_conn_to_listener failed: 33192

So I don't know if there is a way to restrict the range of ports that I assume MPI uses.

I also tried to run MPI through the parallel universe, but this does not work. I used the example mp1script, and set MPDIR to the path of the bin directory of my MPI distribution.

I get some errors in the errfile.0

	connect to address 129.215.191.107 port 544: Connection refused
	connect to address 129.215.191.107 port 544: Connection refused
	trying normal rsh (/usr/bin/rsh)
	pirineu.cap.ed.ac.uk: Connection refused

This puzzled me for a while, since there is a CONDOR_SSH and P4_RSHCOMMAND environment variables defined in mp1script, so I assumed that rsh wasn't called (and instead condor_ssh). But those variables seem to be OK.

I altered the line in mp1script :

	PATH=$MPDIR:.:$PATH

to

	PATH=$MPDIR:`condor_config_val libexec`:.:$PATH

so that if rsh is called, it would be found somehow in the condor folder.

After this change there wasn't any error message in errfile.0 .

But there was still some error messages displayed in outfile.0 :
	/usr/local/condor/libexec/condor_ssh
	running /home/condor/execute/dir_717/simplempi on 2 LINUX ch_p4 processors
	Created /home/condor/execute/dir_717/PI760
	Starting
p0_844: p4_error: Timeout in making connection to remote process on pirineu.cap.ed.ac.uk: 0
	p0_844: (302.474659) net_send: could not write to fd=4, errno = 32

The first line is just related to an echo $P4_RSHCOMMAND so this is OK. However there are some errors afterwards.

I had a look to the file PI760. It looks a bit like a 'p4pg' (P4 proc group) file, but I might be wrong. It this is supposed to work like a p4pg file, there is something which surprised me a bit :

	ys.cap.ed.ac.uk 0 /home/condor/execute/dir_717/simplempi
	pirineu.cap.ed.ac.uk 1 /home/condor/execute/dir_717/simplempi

The temporary directory dir_717 is indeed the local path of simplempi on ys.cap.ed.ac.uk, but it is NOT the temporary path on pirineu.cap.ed.ac.uk, which has another temporary directories (dir_4975 or something). So this looks a bit strange although I assume it should work with a shared file system.

Therefore I altered the mp1script so that it writes a P4 proc group file (written from the 'contact' file like the 'machine' file is written) which will be linked with the -p4pg option given to mpirun.

Unfortunately this did not give better results.

I wonder if there is something I need to investigate a bit more carefully. If anybody manages to run MPI through the parallel universe on a network of desktop workstations withouth a shared file system, I would be interested by the kind of scripts they use.
Thanks,