[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs don't run on execute machines



Very sorry for the long delay in following your suggestions with this problem.  I got busy with other tasks, and wishfully thought that with time the problem might fix itself. Strangely, my wish was not fulfilled.

To reduce the burden on the submitting machine and execution machines I set the maximum slots on any machine to 2 (it was 4, the hardware number of slots, before).  Still the no-connection problem remains.

Here are snippets from the log files suggested.  I cannot glean any insights from these logs and hope that once again help will arrive on this list.  This time I'll respond faster.

Submit machine:
Job log:
000 (2367.000.000) 04/04 20:39:00 Job submitted from host: <136.200.32.179:2831>
...
022 (2367.000.000) 04/05 00:59:29 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx <136.200.32.236:4774>
...
024 (2367.000.000) 04/05 00:59:29 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (300 seconds) expired
    Can not reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx, rescheduling job

ShadowLog:
04/05 00:58:46 Initializing a VANILLA shadow for job 2366.0
04/05 00:58:46 (2366.0) (5696): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxx <136.200.32.179:4587> was ACCEPTED
04/05 00:59:29 (2367.0) (7184): ReliSock: put_file: TransmitFile() failed, errno=10060
04/05 00:59:29 (2367.0) (7184): DoUpload: SHADOW at 136.200.32.179 failed to send file(s) to <136.200.32.236:1232>: error sending d:\delta\dsm2_v8\bin\hydro.exe; STARTER at 136.200.32.236 failed to receive file Z:\Condor\execute\dir_392\hydro.exe
04/05 00:59:29 (2367.0) (7184): condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from startd slot2@xxxxxxxxxxxxxxxxxxxxx.
04/05 00:59:29 (2367.0) (7184): IO: Failed to read packet header
04/05 00:59:29 (2367.0) (7184): Can no longer talk to condor_starter <136.200.32.236:4774>
04/05 00:59:29 (2367.0) (7184): Trying to reconnect to disconnected job
04/05 00:59:29 (2367.0) (7184): LastJobLeaseRenewal: 1270450910 Mon Apr 05 00:01:50 2010
04/05 00:59:29 (2367.0) (7184): JobLeaseDuration: 300 seconds
04/05 00:59:29 (2367.0) (7184): JobLeaseDuration remaining: EXPIRED!
04/05 00:59:29 (2367.0) (7184): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (300 seconds) expired
04/05 00:59:29 (2367.0) (7184): **** condor_shadow (condor_SHADOW) pid 7184 EXITING WITH STATUS 107

LOCKE:
MasterLog:
04/04 23:12:31 Started DaemonCore process "Z:/Condor/bin/condor_startd.exe", pid and pgroup = 2276
04/05 00:12:31 Preen pid is 4052
04/05 05:33:05 Sent signal 15 to COLLECTOR (pid 740)

StartLog:
04/05 00:59:59 condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:1291>.
04/05 00:59:59 IO: Failed to read packet header
04/05 00:59:59 Starter pid 676 exited with status 4
04/05 00:59:59 slot1: State change: starter exited
04/05 00:59:59 slot1: Changing activity: Busy -> Idle
04/05 00:59:59 slot1: State change: idle claim shutting down due to CLAIM_WORKLIFE
04/05 00:59:59 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
04/05 00:59:59 slot1: State change: No preempting claim, returning to owner
04/05 00:59:59 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/05 00:59:59 slot1: State change: IS_OWNER is false
04/05 00:59:59 slot1: Changing state: Owner -> Unclaimed

StarterLog.slot1:
04/05 00:59:58 condor_read(): timeout reading 65536 bytes from <136.200.32.179:3201>.
04/05 00:59:58 ReliSock::get_bytes_nobuffer: Failed to receive file.
04/05 00:59:58 get_file(): ERROR: received 0 bytes, expected 9266688!
04/05 00:59:58 DoDownload: STARTER at 136.200.32.236 failed to receive file Z:\Condor\execute\dir_676\hydro.exe
04/05 00:59:58 File transfer failed (status=0).
04/05 00:59:58 ERROR "Failed to transfer files" at line 1882 in file ..\src\condor_starter.V6.1\jic_shadow.cpp
04/05 00:59:58 ShutdownFast all jobs.
04/05 01:00:46 Locale: English_United States.1252

StarterLog.slot2:
04/05 00:40:47 setting the orig job iwd in starter
04/05 01:21:46 condor_read(): timeout reading 65536 bytes from <136.200.32.179:4096>.

On Thu, Feb 4, 2010 at 1:13 PM, Alan De Smet <adesmet@xxxxxxxxxxx> wrote:
Finch, Ralph <rfinch@xxxxxxxxxxxx> wrote:
> 022 (5193.000.000) 02/03 08:11:22 Job disconnected, attempting to
> reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxxxxx
> <136.200.32.179:4314>

The reconnection message is a red-herring.  Condor is just trying
to recover from the real problem.  The question is, why did the
connection between your submit and execute computers close?

I suggest taking a few of these "disconnected" events, and
correlate them with the ShadowLog on your submit computer and the
Master, StartLog, and StarterLogs on the matching execute
computer.  There might be some useful clues in there.  I'm betting
the ShadowLog will just say something like "socket closed
unexpectedly."  Hopefully the execute computer will be able to
tell you why the connection closed.  Did the MasterLog report
that the Startd exited unexpectedly?  Did the Startd report that
the Starter exited unexpectedly?  Do the Startd or Starter either
have warnings or errors in their logs?  Perhaps it complains
about timing out trying to contact the submit computer.

To engage in wild guesswork, perhaps your submit computer is so
heavily overloaded that your shadows are unable to keep up with
the network traffic from the starters on the execute computers.
The starters eventually decide the other side is dead and hang
up.  If this is the problem, you might try configuration changes
on the submit computer: cut down on the number of jobs the startd
is willing to run simultaneously, use JOB_RENICE_INCREMENT to
decrease their priority, or both.  If the situation is bad
enough, you might need to stop running jobs on your submit node
entirely, but I would be surprised if you needed to go that far.

I doubt that your central manager being overloaded is causing a
problem.  The most likely symptom of an overloaded central
manager is that new jobs don't get matched to execute nodes.
What you're seeing is existing jobs being interrupted.

--
Alan De Smet                              Condor Project Research
adesmet@xxxxxxxxxxx                http://www.cs.wisc.edu/condor/