Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with MPI universe job

Date: Tue, 03 Apr 2007 16:53:09 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Problem with MPI universe job

For starts, you'll need to look at the StarterLog on the execute machineto be able to see why condor_starter died.


Pasquale Tricarico wrote:

Sorry for asking again, but really nobody has any idea on how to fix
this problem?

Thanks,
Pasquale

On 3/30/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:

Hi,

We're still working on this problem.

I just want to add that on the job log, we get this entry:

...
007 (1454.000.000) 03/30 02:59:50 Shadow exception!
       Can no longer talk to condor_starter <10.7.7.12:40786>
       1090105856  -  Run Bytes Sent By Job
       151425472  -  Run Bytes Received By Job
...

relative to one of the crunching nodes. So why is condor_sterter dying
in the remote node? Is there any configuration parameter that can be
used to prevent this?

Thanks,
Pasquale


On 3/26/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:

Hi,

<feel free to jump directly to the question at the end of the message>

We have a problem with an MPI job, on a system:

$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

Let me start saying that I'm the admin of the cluster, so I'm
debugging the application that a gray box to me (I can ask the user
details about it).

Simply put, each job runs on a different node, communicates with the
other jobs, and writes a large number of big files at the end of its
execution. We are NOT using a shared file system (NFS), so all the
generated files are copied back to the head node, used for submission
only.

So the job ID=0 completes its execution (usually it's the first to
complete), and from the ShadowLog I see that all the files generated
on the node are being copied to the head node correctly. Soon after,
while the first node is still transferring files, two more nodes
complete their jobs and start to transfer... and on ShadowLog we get
good comments such as:

---
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
MpiResource::resourceExit()
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
RemoteResource::resourceExit()
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): setting
exit reason on vm4@xxxxxxxxxx to 100
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615):
Resource vm4@xxxxxxxxxx changing state from EXECUTING to FINISHED
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Entering shutDownLogic(r=100)
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615): Normal exit
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm2@xxxxxxxxxxxxx     FINISHED 100
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm4@xxxxxxxxxxxxx     FINISHED 100
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm1@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm3@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
---

But then, suddenly, I get something like:

---
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
entering FileTransfer::HandleCommands
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
FileTransfer::HandleCommands read transkey=5#4604675c407de5b75d04959c
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
entering FileTransfer::Download
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
entering FileTransfer::DoDownload sync=0
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file(): going to write to filename /home/x.y/z.dat
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file: Receiving 2267460 bytes
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
condor_read(): Socket closed when trying to read 65536 bytes from
<10.X.Y.13:53477>
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
ReliSock::get_bytes_nobuffer: Failed to receive file.
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file: wrote 65536 bytes to file
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file(): ERROR: received 65536 bytes, expected 2267460!
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
DoDownload: SHADOW at 10.X.Y.250 failed to receive file
/home/x.y/z.dat
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
DoDownload: exiting at 1509
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
condor_read(): Socket closed when trying to read 5 bytes from
<10.X.Y.12:40786>
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): IO: EOF
reading packet header
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): ERROR
"Can no longer talk to condor_starter <10.X.Y.12:40786>" at line 123
in file NTreceivers.C
---

and then the MPI job is completely restarted...

So my questions are: Is it OK for Condor if only a fraction of the
total MPI jobs originally submitted are running at a given time? Or
Condor interprets that as a problem and automatically restarts the
entire MPI job? Should all the jobs complete their execution pretty
much at the same time?

Thank you very much for any insight you may provide on this problem.

Pasquale

--
This space intentionally has nothing but text explaining why this
space has nothing but text explaining that this space would otherwise
have been left blank, and would otherwise have been left blank.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

Follow-Ups:
- Re: [Condor-users] Problem with MPI universe job
  - From: Pasquale Tricarico

References:
- Re: [Condor-users] Problem with MPI universe job
  - From: Pasquale Tricarico

Prev by Date: Re: [Condor-users] Problem with MPI universe job
Next by Date: Re: [Condor-users] Problem with MPI universe job
Previous by thread: Re: [Condor-users] Problem with MPI universe job
Next by thread: Re: [Condor-users] Problem with MPI universe job
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Problem with MPI universe job