[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor with MPI-over-ssh



Hi,

I'm trying to get a test MPI-universe job to run but have run into the following problem. Firstly, I'm using mpich v1.2.4 built with the following options:

./configure --with-device=ch_p4 --with-arch=LINUX -cc=icc -cflags="-I/usr/include" -fc=ifc --enable-cxx --enable-f77 -c++=icpc -f90=f90com -opt="-tpp7 -O2 -I/usr/include" -rsh=ssh --enable-f90modules --enable-sharedlib -prefix=/opt/mpich

Hence note we're using ssh to communicate over. Now, I've set up the dedicated VM1_USER on each machine to have passwordless ssh access between all MPI enabled nodes (by copying the ssh keys), and I can indeed run jobs as that user using mpirun successfully. The problem arises when I submit the same job as a Condor job. I've set up a dedicated submit node as mentioned in the manual, and tried an initial job which just two nodes. The first execute node goes into a Claimed/Busy state while the second goes into a Claimed/Idle state before finally the job fails with the following sent to the stdout of the first execute node:

p0_7062: p4_error: Timeout in making connection to remote process on <2nd execute nodename>: 0
p0_7062: (302.027937) net_send: could not write to fd=4, errno = 32


Now my guess is that the execute nodes are using $MPI_CONDOR_RSH_PATH/rsh to communicte, whereas our mpich is built to use ssh, right? Hence, my question is: is there any way we can cajole Condor to use ssh, or do we have to rebuild our mpich to use rsh? We're really keen to use ssh if at all possible.

Cheers,
Mark