[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Re: Condor with MPI-over-ssh

Hash: SHA1

On Wed, Feb 16, 2005, Mark Calleja wrote:
> Hence note we're using ssh to communicate over. Now, I've set up the 
> dedicated VM1_USER on each machine to have passwordless ssh access 
> between all MPI enabled nodes (by copying the ssh keys), and I can 
> indeed run jobs as that user using mpirun successfully. 

Sounds familliar to me. I had to learn, that condor doesn't use mpirun
at all, but has its own starting mechanism.

>The problem 
> arises when I submit the same job as a Condor job. I've set up a 
> dedicated submit node as mentioned in the manual, and tried an initial 
> job which just two nodes. The first execute node goes into a 
> Claimed/Busy state while the second goes into a Claimed/Idle state 
> before finally the job fails with the following sent to the stdout of 
> the first execute node:

I'll just copy-and-paste a reply i got from Erik Paulsen on this list:
- -------------
> p0_2957:  p4_error: Child process exited while making connection to
> remote process on c029.cip.physik.local: 0
> p0_2957: (6.333597) net_send: could not write to fd=4, errno = 32
> As far as i understand, condor uses /home/condor/condor/sbin/rsh to
> start the job, right ? this doesn't work, as for security reasons, rsh
> is
> not allowed here.

You'll note that /home/condor/condor/sbin/rsh is not really rsh, it's
just named rsh. It does not have the security problems of the Berekely

> So i set up ssh :
> bash-2.05b$ whoami
> condor
> bash-2.05b$ ssh c029 date
> Tue Feb  1 12:41:20 CET 2005
> and linked it there, but this didn't help either.

That was your mistake. Put the condor program named 'rsh' back.

> So
> a) how do i tell condor where to look for mpi
> b) how do i tell condor to use ssh ?
a) you don't need to
b) you can't

Link your job with MPICH 1.2.4 for the ch_p4 device. Condor does not
need any MPI runtime support (we don't use mpirun)
- ------------------------
Btw, when you compile mpich 1.2.4, are there any errors during 'make
testing'? If no, how did you do that, on what kind of system ? 

I always get (Mandrake 10.0):
**** Testing I/O functions ****
c074 : So Feb  6 17:55:01 CET 2005
/tmp/mpi/mpich-1.2.4/bin/mpicc  -c simple.c
/tmp/mpi/mpich-1.2.4/bin/mpicc  -o simple simple.o
/tmp/mpi/mpich-1.2.4/lib/libpmpich.a(getpname.o)(.text+0x13): In
function `PMPI_Get_processor_name':
: undefined reference to `MPID_Node_name'
collect2: ld returned 1 exit status
make[3]: *** [simple] Error 1
Could not build executable simple; aborting tests
make[2]: [testing] Error 1 (ignored)
End of testing in directory io
Running tests in directory command
End of testing in directory command

- -- 
________ This message is made of 100 % recycled electrons
\..|     PGP Key: www.stud.uni-goettingen.de/~s242275/pgpkey.pub     (o_
.\.|--   Jabber:  te_linuxguru at jabber.fsinf.de            (o  (o  //\
..\|____ ICQ:     124557012                                  (/)_(/)_V_/
Version: GnuPG v1.2.4 (GNU/Linux)