[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with MPI universe job



Hi Dan and list,

Thanks for responding and for pointing out the right log file to look
at. We just ran the job again, reproducing the problem, and this is
the relevant part of StarterLog.vm2:

4/4 01:20:48 ******************************************************
4/4 01:20:48 ** condor_starter (CONDOR_STARTER) STARTING UP
4/4 01:20:48 ** /usr/local/condor/sbin/condor_starter
4/4 01:20:48 ** $CondorVersion: 6.8.2 Oct 12 2006 $
4/4 01:20:48 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
4/4 01:20:48 ** PID = 3371
4/4 01:20:48 ** Log last touched 4/3 23:57:42
4/4 01:20:48 ******************************************************
4/4 01:20:48 Using config source: /home/condor/condor_config
4/4 01:20:48 Using local config sources:
4/4 01:20:48    /home/condor/condor_config.local
4/4 01:20:48 DaemonCore: Command Socket at <10.7.7.12:56296>
4/4 01:20:48 Done setting resource limits
4/4 01:20:48 Communicating with shadow <10.7.7.250:34338>
4/4 01:20:48 Submitting machine is "head.psi.edu"
4/4 01:20:48 File transfer completed successfully.
4/4 01:20:49 Starting a MPI universe job with ID: 1542.0
4/4 01:20:49 IWD: /home/condor/execute/dir_3371
4/4 01:20:49 Output file: /home/condor/execute/dir_3371/citcom.out.1
4/4 01:20:49 Error file: /home/condor/execute/dir_3371/citcom.error.1
4/4 01:20:49 About to exec
/home/condor/execute/dir_3371/condor_exec.exe node20.psi.edu 47688
-p4amslave -p4yourname node12.psi.edu
4/4 01:20:49 Create_Process succeeded, pid=3373
4/4 02:38:25 Process exited, pid=3373, status=0
4/4 02:38:55 condor_write(): timed out writing 65536 bytes to <10.7.7.250:34338>
4/4 02:38:55 ReliSock::put_bytes_nobuffer: Send failed.
4/4 02:38:55 ReliSock::put_file: failed to put 65536 bytes
(put_bytes_nobuffer() returned -1)
4/4 02:39:25 condor_read(): timeout reading 5 bytes from <10.7.7.250:34338>.
4/4 02:39:25 DoUpload: STARTER at 10.7.7.12 failed to send file(s) to
<10.7.7.250:34338>: error sending
/home/condor/execute/dir_3371/case2.temp.1.13
4/4 02:39:25 File transfer failed, forcing disconnect.
4/4 02:39:25 JIC::allJobsDone() failed, waiting for job lease to
expire or for a reconnect attempt
4/4 02:39:25 Got SIGTERM. Performing graceful shutdown.
4/4 02:39:25 ShutdownGraceful all jobs.
4/4 02:39:25 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

It looks like the process exits cleanly at 02:38:25, but then at
02:38:55 the condor_write() command gets a time out. Any idea on what
could be determining this, and how to fix it? Maybe the disk on the
head node is too slow? It can write at about 30 MB/s, and the cluster
network is gigabit.

Thanks a lot,
Pasquale

On 4/3/07, Dan Bradley <dan@xxxxxxxxxxxx> wrote:

For starts, you'll need to look at the StarterLog on the execute machine
to be able to see why condor_starter died.

Pasquale Tricarico wrote:

>Sorry for asking again, but really nobody has any idea on how to fix
>this problem?
>
>Thanks,
>Pasquale
>
>On 3/30/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:
>
>
>>Hi,
>>
>>We're still working on this problem.
>>
>>I just want to add that on the job log, we get this entry:
>>
>>...
>>007 (1454.000.000) 03/30 02:59:50 Shadow exception!
>>        Can no longer talk to condor_starter <10.7.7.12:40786>
>>        1090105856  -  Run Bytes Sent By Job
>>        151425472  -  Run Bytes Received By Job
>>...
>>
>>relative to one of the crunching nodes. So why is condor_sterter dying
>>in the remote node? Is there any configuration parameter that can be
>>used to prevent this?
>>
>>Thanks,
>>Pasquale
>>
>>
>>On 3/26/07, Pasquale Tricarico <tricaric@xxxxxxxxx> wrote:
>>
>>
>>>Hi,
>>>
>>><feel free to jump directly to the question at the end of the message>
>>>
>>>We have a problem with an MPI job, on a system:
>>>
>>>$CondorVersion: 6.8.2 Oct 12 2006 $
>>>$CondorPlatform: X86_64-LINUX_RHEL3 $
>>>
>>>Let me start saying that I'm the admin of the cluster, so I'm
>>>debugging the application that a gray box to me (I can ask the user
>>>details about it).
>>>
>>>Simply put, each job runs on a different node, communicates with the
>>>other jobs, and writes a large number of big files at the end of its
>>>execution. We are NOT using a shared file system (NFS), so all the
>>>generated files are copied back to the head node, used for submission
>>>only.
>>>
>>>So the job ID=0 completes its execution (usually it's the first to
>>>complete), and from the ShadowLog I see that all the files generated
>>>on the node are being copied to the head node correctly. Soon after,
>>>while the first node is still transferring files, two more nodes
>>>complete their jobs and start to transfer... and on ShadowLog we get
>>>good comments such as:
>>>
>>>---
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
>>>MpiResource::resourceExit()
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): Inside
>>>RemoteResource::resourceExit()
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615): setting
>>>exit reason on vm4@xxxxxxxxxx to 100
>>>/home/condor/log/ShadowLog.old:3/25 22:34:23 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxx changing state from EXECUTING to FINISHED
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Entering shutDownLogic(r=100)
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615): Normal exit
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm2@xxxxxxxxxxxxx     FINISHED 100
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxxxxx     FINISHED 100
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm1@xxxxxxxxxxxxx    EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm3@xxxxxxxxxxxxx    EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm2@xxxxxxxxxxxxx    EXECUTING -1
>>>/home/condor/log/ShadowLog.old:3/25 22:34:31 (1428.0) (26615):
>>>Resource vm4@xxxxxxxxxxxxx    EXECUTING -1
>>>---
>>>
>>>But then, suddenly, I get something like:
>>>
>>>---
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>entering FileTransfer::HandleCommands
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>FileTransfer::HandleCommands read transkey=5#4604675c407de5b75d04959c
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>entering FileTransfer::Download
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>entering FileTransfer::DoDownload sync=0
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file(): going to write to filename /home/x.y/z.dat
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file: Receiving 2267460 bytes
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>condor_read(): Socket closed when trying to read 65536 bytes from
>>><10.X.Y.13:53477>
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>ReliSock::get_bytes_nobuffer: Failed to receive file.
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file: wrote 65536 bytes to file
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>get_file(): ERROR: received 65536 bytes, expected 2267460!
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>DoDownload: SHADOW at 10.X.Y.250 failed to receive file
>>>/home/x.y/z.dat
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>DoDownload: exiting at 1509
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615):
>>>condor_read(): Socket closed when trying to read 5 bytes from
>>><10.X.Y.12:40786>
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): IO: EOF
>>>reading packet header
>>>/home/condor/log/ShadowLog.old:3/25 22:34:56 (1428.0) (26615): ERROR
>>>"Can no longer talk to condor_starter <10.X.Y.12:40786>" at line 123
>>>in file NTreceivers.C
>>>---
>>>
>>>and then the MPI job is completely restarted...
>>>
>>>So my questions are: Is it OK for Condor if only a fraction of the
>>>total MPI jobs originally submitted are running at a given time? Or
>>>Condor interprets that as a problem and automatically restarts the
>>>entire MPI job? Should all the jobs complete their execution pretty
>>>much at the same time?
>>>
>>>Thank you very much for any insight you may provide on this problem.
>>>
>>>Pasquale
>>>
>>>
>>>
>>--
>>This space intentionally has nothing but text explaining why this
>>space has nothing but text explaining that this space would otherwise
>>have been left blank, and would otherwise have been left blank.
>>
>>
>>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at either
>https://lists.cs.wisc.edu/archive/condor-users/
>http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR