[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with MPI universe job



Sorry for asking again, but really nobody has any idea on how to fix
this problem?

Thanks,
Pasquale

On 3/30/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:
Hi,

We're still working on this problem.

I just want to add that on the job log, we get this entry:

...
007 (1454.000.000) 03/30 02:59:50 Shadow exception!
        Can no longer talk to condor_starter <10.7.7.12:40786>
        1090105856  -  Run Bytes Sent By Job
        151425472  -  Run Bytes Received By Job
...

relative to one of the crunching nodes. So why is condor_sterter dying
in the remote node? Is there any configuration parameter that can be
used to prevent this?

Thanks,
Pasquale


On 3/26/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:
> Hi,
>
> <feel free to jump directly to the question at the end of the message>
>
> We have a problem with an MPI job, on a system:
>
> $CondorVersion: 6.8.2 Oct 12 2006 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
>
> Let me start saying that I'm the admin of the cluster, so I'm
> debugging the application that a gray box to me (I can ask the user
> details about it).
>
> Simply put, each job runs on a different node, communicates with the
> other jobs, and writes a large number of big files at the end of its
> execution. We are NOT using a shared file system (NFS), so all the
> generated files are copied back to the head node, used for submission
> only.
>
> So the job ID=0 completes its execution (usually it's the first to
> complete), and from the ShadowLog I see that all the files generated
> on the node are being copied to the head node correctly. Soon after,
> while the first node is still transferring files, two more nodes
> complete their jobs and start to transfer... and on ShadowLog we get
> good comments such as:
>
> ---
> /home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
> MpiResource::resourceExit()
> /home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
> RemoteResource::resourceExit()
> /home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): setting
> exit reason on vm4@xxxxxxxxxx to 100
> /home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615):
> Resource vm4@xxxxxxxxxx changing state from EXECUTING to FINISHED
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Entering shutDownLogic(r=100)
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615): Normal exit
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm2@xxxxxxxxxxxxx     FINISHED 100
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm4@xxxxxxxxxxxxx     FINISHED 100
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm1@xxxxxxxxxxxxx    EXECUTING -1
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm3@xxxxxxxxxxxxx    EXECUTING -1
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
> /home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
> Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
> ---
>
> But then, suddenly, I get something like:
>
> ---
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> entering FileTransfer::HandleCommands
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> FileTransfer::HandleCommands read transkey=5#4604675c407de5b75d04959c
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> entering FileTransfer::Download
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> entering FileTransfer::DoDownload sync=0
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> get_file(): going to write to filename /home/x.y/z.dat
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> get_file: Receiving 2267460 bytes
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> condor_read(): Socket closed when trying to read 65536 bytes from
> <10.X.Y.13:53477>
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> ReliSock::get_bytes_nobuffer: Failed to receive file.
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> get_file: wrote 65536 bytes to file
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> get_file(): ERROR: received 65536 bytes, expected 2267460!
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> DoDownload: SHADOW at 10.X.Y.250 failed to receive file
> /home/x.y/z.dat
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> DoDownload: exiting at 1509
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
> condor_read(): Socket closed when trying to read 5 bytes from
> <10.X.Y.12:40786>
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): IO: EOF
> reading packet header
> /home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): ERROR
> "Can no longer talk to condor_starter <10.X.Y.12:40786>" at line 123
> in file NTreceivers.C
> ---
>
> and then the MPI job is completely restarted...
>
> So my questions are: Is it OK for Condor if only a fraction of the
> total MPI jobs originally submitted are running at a given time? Or
> Condor interprets that as a problem and automatically restarts the
> entire MPI job? Should all the jobs complete their execution pretty
> much at the same time?
>
> Thank you very much for any insight you may provide on this problem.
>
> Pasquale
>


--
This space intentionally has nothing but text explaining why this
space has nothing but text explaining that this space would otherwise
have been left blank, and would otherwise have been left blank.