[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor-mpich



On Fri, Feb 20, 2004 at 03:39:11PM -0500, Joel Hernandez wrote:
> I've been trying to setup several of our nodes to run as dedicated 
> resources in order to run MPI jobs.  I've tested the setup using the 
> simple example in section 2.10 of the online Condor Manual for V6.6.
> 
> However instead of containing the print out from stdin, the outfile 
> contains the following error message:
> 
> rm_3660: (-) net_recv failed for fd = 3
> rm_3660:  p4_error: net_recv read, errno = : 104
> 
> Has anyone encountered this problem?
> 

Which version of MPICH? MPICH 1.2.5 changed what it expects the P4_RSH_COMMAND
to do - the 1.2.4 and previous versions were ok if the RSH command used to
startup the job exited before the job completed, 1.2.5 considers it an error.

In Condor, we don't use a real RSH to start up a job, so we have an RSH
"in name only", and it exits before the job completes. We do plan to 
come up with a workaround, but it'll be a while before that gets into 
Condor.

-Erik
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>