[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How to troubleshoot MPI job



On Tue, Feb 15, 2005 at 05:04:45PM +0800, Nigel Teow wrote:
> Hi,
> 
> Had installed condor (version 6.6.8) on a cluster,
> 
> Am able to use condor_submit to run the mpi job on a single node but 
> when I tried to run on 2 nodes, it fails. Following are the output files,
> 
> outfile.0
> -----------
> p0_28434:  p4_error: Child process exited while making connection to 
> remote process on compute-0-1.local: 0
> p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32
> 
> outfile.1
> -----------
> rm_28438: (-) net_recv failed for fd = 3
> rm_28438:  p4_error: net_recv read, errno = : 104
> 

That looks like an error with MPICH 1.2.5 or later. Use 1.2.4

-Erik