[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor-mpich

Erik Paulson wrote:

On Fri, Feb 20, 2004 at 03:39:11PM -0500, Joel Hernandez wrote:

I've been trying to setup several of our nodes to run as dedicated resources in order to run MPI jobs. I've tested the setup using the simple example in section 2.10 of the online Condor Manual for V6.6.

However instead of containing the print out from stdin, the outfile contains the following error message:

rm_3660: (-) net_recv failed for fd = 3
rm_3660:  p4_error: net_recv read, errno = : 104

Has anyone encountered this problem?

Which version of MPICH? MPICH 1.2.5 changed what it expects the P4_RSH_COMMAND to do - the 1.2.4 and previous versions were ok if the RSH command used to startup the job exited before the job completed, 1.2.5 considers it an error.

In Condor, we don't use a real RSH to start up a job, so we have an RSH
"in name only", and it exits before the job completes. We do plan to come up with a workaround, but it'll be a while before that gets into Condor.

That's the problem. I installed MPICH 1.2.4 and it works!

However, I now have another problem. We have two eight node dual cpu clusters (louie and duey). Users submit their jobs on louie and when all the nodes are busy, they start to flock and run on duey. This works great for non-MPI jobs.

I am now able to run the simple example MPI job from section 2.10.0 of the Condor users manual on either louie or duey. But in order to run an MPI job on duey, I have to submit the job on duey. If I use the following macro in a submit file on louie

requirements = machine == "duey3.cnidr.org"

and submit the job from louie, the job just stays in the queue.

Any ideas?

Thanks, Joel ----------------------------------------------------------------- Joel Hernandez Systems Programmer / Analyst MCNC-RDI Center for Networked Information Discovery and Retrieval joelh@xxxxxxxxx http://www.cnidr.org

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>