On Fri, Feb 20, 2004 at 03:39:11PM -0500, Joel Hernandez wrote:
I've been trying to setup several of our nodes to run as dedicated
resources in order to run MPI jobs. I've tested the setup using the
simple example in section 2.10 of the online Condor Manual for V6.6.
However instead of containing the print out from stdin, the outfile
contains the following error message:
rm_3660: (-) net_recv failed for fd = 3
rm_3660: p4_error: net_recv read, errno = : 104
Has anyone encountered this problem?
Which version of MPICH? MPICH 1.2.5 changed what it expects the P4_RSH_COMMAND
to do - the 1.2.4 and previous versions were ok if the RSH command used to
startup the job exited before the job completed, 1.2.5 considers it an error.
In Condor, we don't use a real RSH to start up a job, so we have an RSH
"in name only", and it exits before the job completes. We do plan to
come up with a workaround, but it'll be a while before that gets into
Condor.