[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with MPI universe job



Hi,

<feel free to jump directly to the question at the end of the message>

We have a problem with an MPI job, on a system:

$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

Let me start saying that I'm the admin of the cluster, so I'm
debugging the application that a gray box to me (I can ask the user
details about it).

Simply put, each job runs on a different node, communicates with the
other jobs, and writes a large number of big files at the end of its
execution. We are NOT using a shared file system (NFS), so all the
generated files are copied back to the head node, used for submission
only.

So the job ID=0 completes its execution (usually it's the first to
complete), and from the ShadowLog I see that all the files generated
on the node are being copied to the head node correctly. Soon after,
while the first node is still transferring files, two more nodes
complete their jobs and start to transfer... and on ShadowLog we get
good comments such as:

---
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
MpiResource::resourceExit()
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
RemoteResource::resourceExit()
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): setting
exit reason on vm4@xxxxxxxxxx to 100
/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615):
Resource vm4@xxxxxxxxxx changing state from EXECUTING to FINISHED
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Entering shutDownLogic(r=100)
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615): Normal exit
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm2@xxxxxxxxxxxxx     FINISHED 100
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm4@xxxxxxxxxxxxx     FINISHED 100
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm1@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm3@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
---

But then, suddenly, I get something like:

---
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
entering FileTransfer::HandleCommands
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
FileTransfer::HandleCommands read transkey=5#4604675c407de5b75d04959c
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
entering FileTransfer::Download
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
entering FileTransfer::DoDownload sync=0
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file(): going to write to filename /home/x.y/z.dat
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file: Receiving 2267460 bytes
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
condor_read(): Socket closed when trying to read 65536 bytes from
<10.X.Y.13:53477>
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
ReliSock::get_bytes_nobuffer: Failed to receive file.
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file: wrote 65536 bytes to file
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
get_file(): ERROR: received 65536 bytes, expected 2267460!
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
DoDownload: SHADOW at 10.X.Y.250 failed to receive file
/home/x.y/z.dat
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
DoDownload: exiting at 1509
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
condor_read(): Socket closed when trying to read 5 bytes from
<10.X.Y.12:40786>
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): IO: EOF
reading packet header
/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): ERROR
"Can no longer talk to condor_starter <10.X.Y.12:40786>" at line 123
in file NTreceivers.C
---

and then the MPI job is completely restarted...

So my questions are: Is it OK for Condor if only a fraction of the
total MPI jobs originally submitted are running at a given time? Or
Condor interprets that as a problem and automatically restarts the
entire MPI job? Should all the jobs complete their execution pretty
much at the same time?

Thank you very much for any insight you may provide on this problem.

Pasquale