[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor-mpich

----- Forwarded message from owner-condor-users@xxxxxxxxxxx -----

Date: Fri, 20 Feb 2004 23:21:28 +0100
From: Olivier Ricou 
To: condor-users@xxxxxxxxxxx
Subject: Re: [condor-users] condor-mpich
Message-ID: <20040220232128.A1184@xxxxxxxxxxxxxxxx>
References: <4036706F.4080708@xxxxxxxxx>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <4036706F.4080708@xxxxxxxxx>; from joelh@xxxxxxxxx on Fri, Feb 20, 2004 at 03:39:11PM -0500
X-Miltered: at shiva.jussieu.fr with ID 403687EC.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Antivirus: scanned by sophie at shiva.jussieu.fr
X-CSL-MailScanner-Information: Please contact lab@xxxxxxxxxxx for more information
X-CSL-MailScanner: Found to be clean

20/02/04 die, ad 21h39, Joel Hernandez <joelh@xxxxxxxxx> dixit :
> I've been trying to setup several of our nodes to run as dedicated 
> resources in order to run MPI jobs.  I've tested the setup using the 
> simple example in section 2.10 of the online Condor Manual for V6.6.
> However instead of containing the print out from stdin, the outfile 
> contains the following error message:
> rm_3660: (-) net_recv failed for fd = 3
> rm_3660:  p4_error: net_recv read, errno = : 104

If your nodes running MPI are Linux machines, have a look
to /proc/sysvipc/sem to see if your node still have semaphores
free ? If you see a list of 128 or 256 (I don't remember the
limite), then it means your MPI program has a problem and does
not free the semaphore properly.

To clean by hand the semaphores (and shared memory I think), use
cleanipcs which should be in the sbin directory of your MPI.
Beware, it cleans only the semaphore owned by the user running

Hope it helps,


----- End forwarded message -----
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>