[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] network loss between schedds and worker nodes



Hi Andrew,

I can't see anything obvious in the code or in these logs that would have led to this failure scenario.

I'm afraid someone more familiar with the shadow<->starter protocol may have to step in here... Dan?

Brian

PS - we probably really should log when keepalive packets are sent at D_ALWAYS.  Otherwise, there's not too much of a hint as to what the shadow is up to.

On Dec 13, 2013, at 7:48 AM, <andrew.lahiff@xxxxxxxxxx> <andrew.lahiff@xxxxxxxxxx> wrote:

> Hi Brian,
> 
> Thanks for your reply. (full ShadowLog sent off-list).
> 
> I've attached below [1] one of the starter log files, which is interesting:
> 
> 12/12/13 13:06:29 - CE rebooted due to kernel/errata upgrade
> 12/12/13 13:10:45 - connects back to (new) shadow
> 12/12/13 15:01:40 - clear that CE lost networking
> - CE networking returned at ~15:25 -
> 12/12/13 17:01:55 - starter exits
> 12/12/13 17:03:31 - new job starting, from the same CE
> 
> For some reason the starter couldn't connect to the CE even after networking had returned. After the job lease expired another job started running, in this case from the same CE which it previously couldn't connect to. An extract from the ShadowLog is also shown [2].
> 
> Note that between 15:25 and 17:01 the two CEs which had previously lost networking were able to start new jobs. Some of these are still running. It's only the jobs which had started running before the network break which were affected.
> 
> I've been trying to see if I can reproduce this problem using a schedd on a VM and disabling the VM's networking temporarily, but so far I haven't been able to reproduce it.
> 
> Regards,
> Andrew.
> 
> [1]
> 12/12/13 11:41:11 ******************************************************
> 12/12/13 11:41:11 ** condor_starter (CONDOR_STARTER) STARTING UP
> 12/12/13 11:41:11 ** /usr/sbin/condor_starter
> 12/12/13 11:41:11 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> 12/12/13 11:41:11 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> 12/12/13 11:41:11 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
> 12/12/13 11:41:11 ** $CondorPlatform: x86_64_RedHat6 $
> 12/12/13 11:41:11 ** PID = 26873
> 12/12/13 11:41:11 ** Log last touched 12/12 11:40:40
> 12/12/13 11:41:11 ******************************************************
> 12/12/13 11:41:11 Using config source: /etc/condor/condor_config
> 12/12/13 11:41:11 Using local config sources:
> 12/12/13 11:41:11    /etc/condor/config.d/10security.config
> 12/12/13 11:41:11    /etc/condor/config.d/20wn.config
> 12/12/13 11:41:11    /etc/condor/condor_config.local
> 12/12/13 11:41:11 DaemonCore: command socket at <130.246.218.157:58225>
> 12/12/13 11:41:11 DaemonCore: private command socket at <130.246.218.157:58225>
> 12/12/13 11:41:11 Communicating with shadow <130.246.181.27:58210?noUDP>
> 12/12/13 11:41:11 Submitting machine is "cream-ce02.gridpp.rl.ac.uk"
> 12/12/13 11:41:11 setting the orig job name in starter
> 12/12/13 11:41:11 setting the orig job iwd in starter
> 12/12/13 11:41:11 Done setting resource limits
> 12/12/13 11:41:11 File transfer completed successfully.
> 12/12/13 11:41:11 Job 284063.0 set to execute immediately
> 12/12/13 11:41:11 Starting a VANILLA universe job with ID: 284063.0
> 12/12/13 11:41:11 IWD: /pool/condor/dir_26873
> 12/12/13 11:41:11 Output file: /pool/condor/dir_26873/_condor_stdout
> 12/12/13 11:41:11 Error file: /pool/condor/dir_26873/_condor_stderr
> 12/12/13 11:41:11 Renice expr "10" evaluated to 10
> 12/12/13 11:41:11 About to exec /pool/condor/dir_26873/condor_exec.exe
> 12/12/13 11:41:11 Setting job's virtual memory rlimit to 0 megabytes
> 12/12/13 11:41:11 Running job as user tlhcb005
> 12/12/13 11:41:11 Create_Process succeeded, pid=26877
> 12/12/13 13:06:29 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.181.27:31505>.
> 12/12/13 13:06:29 IO: Failed to read packet header
> 12/12/13 13:10:45 Accepted request to reconnect from <130.246.181.27:31505>
> 12/12/13 13:10:45 Ignoring old shadow <130.246.181.27:58210?noUDP>
> 12/12/13 13:10:45 Communicating with shadow <130.246.181.27:40051?noUDP>
> 12/12/13 15:01:40 condor_read(): timeout reading 21 bytes from <130.246.181.27:62806>.
> 12/12/13 15:01:40 IO: Failed to read packet header
> 12/12/13 15:16:43 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10, errno=110 Connection timed out
> 12/12/13 15:16:43 Buf::write(): condor_write() failed
> 12/12/13 15:28:45 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 15:28:45 Buf::write(): condor_write() failed
> 12/12/13 15:36:00 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 15:36:00 Buf::write(): condor_write() failed
> 12/12/13 15:41:01 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 15:41:01 Buf::write(): condor_write() failed
> 12/12/13 15:46:01 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 15:46:01 Buf::write(): condor_write() failed
> 12/12/13 15:51:01 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 15:51:01 Buf::write(): condor_write() failed
> 12/12/13 15:56:02 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 15:56:02 Buf::write(): condor_write() failed
> 12/12/13 16:01:02 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:01:02 Buf::write(): condor_write() failed
> 12/12/13 16:06:03 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:06:03 Buf::write(): condor_write() failed
> 12/12/13 16:11:03 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:11:03 Buf::write(): condor_write() failed
> 12/12/13 16:16:03 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:16:03 Buf::write(): condor_write() failed
> 12/12/13 16:21:04 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:21:04 Buf::write(): condor_write() failed
> 12/12/13 16:26:04 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:26:04 Buf::write(): condor_write() failed
> 12/12/13 16:31:05 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:31:05 Buf::write(): condor_write() failed
> 12/12/13 16:36:05 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:36:05 Buf::write(): condor_write() failed
> 12/12/13 16:41:06 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:41:06 Buf::write(): condor_write() failed
> 12/12/13 16:46:07 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:46:07 Buf::write(): condor_write() failed
> 12/12/13 16:51:07 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:51:07 Buf::write(): condor_write() failed
> 12/12/13 16:51:55 Got SIGTERM. Performing graceful shutdown.
> 12/12/13 16:51:55 ShutdownGraceful all jobs.
> 12/12/13 16:56:07 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 16:56:07 Buf::write(): condor_write() failed
> 12/12/13 17:01:08 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 17:01:08 Buf::write(): condor_write() failed
> 12/12/13 17:01:55 Got SIGQUIT.  Performing fast shutdown.
> 12/12/13 17:01:55 ShutdownFast all jobs.
> 12/12/13 17:01:55 Process exited, pid=26877, signal=9
> 12/12/13 17:01:55 condor_write(): Socket closed when trying to write 422 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 17:01:55 Buf::write(): condor_write() failed
> 12/12/13 17:01:55 condor_write(): Socket closed when trying to write 116 bytes to <130.246.181.27:62806>, fd is 10
> 12/12/13 17:01:55 Buf::write(): condor_write() failed
> 12/12/13 17:01:55 Failed to send job exit status to shadow
> 12/12/13 17:01:55 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
> 12/12/13 17:01:55 Returning from CStarter::JobReaper()
> 12/12/13 17:03:31 ******************************************************
> 12/12/13 17:03:31 ** condor_starter (CONDOR_STARTER) STARTING UP
> 12/12/13 17:03:31 ** /usr/sbin/condor_starter
> 12/12/13 17:03:31 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> 12/12/13 17:03:31 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> 12/12/13 17:03:31 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
> 12/12/13 17:03:31 ** $CondorPlatform: x86_64_RedHat6 $
> 12/12/13 17:03:31 ** PID = 27508
> 12/12/13 17:03:31 ** Log last touched 12/12 17:01:55
> 12/12/13 17:03:31 ******************************************************
> 12/12/13 17:03:31 Using config source: /etc/condor/condor_config
> 12/12/13 17:03:31 Using local config sources:
> 12/12/13 17:03:31    /etc/condor/config.d/10security.config
> 12/12/13 17:03:31    /etc/condor/config.d/20wn.config
> 12/12/13 17:03:31    /etc/condor/condor_config.local
> 12/12/13 17:03:31 DaemonCore: command socket at <130.246.218.157:37675>
> 12/12/13 17:03:31 DaemonCore: private command socket at <130.246.218.157:37675>
> 12/12/13 17:03:31 Communicating with shadow <130.246.181.27:47959?noUDP>
> 12/12/13 17:03:31 Submitting machine is "cream-ce02.gridpp.rl.ac.uk"
> 12/12/13 17:03:31 setting the orig job name in starter
> 12/12/13 17:03:31 setting the orig job iwd in starter
> 12/12/13 17:03:31 Done setting resource limits
> 12/12/13 17:03:31 File transfer completed successfully.
> 12/12/13 17:03:31 Job 283875.0 set to execute immediately
> 12/12/13 17:03:31 Starting a VANILLA universe job with ID: 283875.0
> 12/12/13 17:03:31 IWD: /pool/condor/dir_27508
> 12/12/13 17:03:31 Output file: /pool/condor/dir_27508/_condor_stdout
> 12/12/13 17:03:31 Error file: /pool/condor/dir_27508/_condor_stderr
> 12/12/13 17:03:31 Renice expr "10" evaluated to 10
> 12/12/13 17:03:31 About to exec /pool/condor/dir_27508/condor_exec.exe
> 12/12/13 17:03:31 Setting job's virtual memory rlimit to 0 megabytes
> 12/12/13 17:03:31 Running job as user tlhcb005
> 12/12/13 17:03:31 Create_Process succeeded, pid=27538
> 
> 
> [2]
> [root@cream-ce02 ~]# grep 10137 /var/log/condor/ShadowLog
> 12/12/13 13:10:45 ** PID = 10137
> 12/12/13 13:10:45 (284063.0) (10137): Trying to reconnect to disconnected job
> 12/12/13 13:10:45 (284063.0) (10137): LastJobLeaseRenewal: 1386852989 Thu Dec 12 12:56:29 2013
> 12/12/13 13:10:45 (284063.0) (10137): JobLeaseDuration: 7200 seconds
> 12/12/13 13:10:45 (284063.0) (10137): JobLeaseDuration remaining: 6344
> 12/12/13 13:10:45 (284063.0) (10137): Attempting to locate disconnected starter
> 12/12/13 13:10:45 (284063.0) (10137): Found starter: <130.246.218.157:58225>
> 12/12/13 13:10:45 (284063.0) (10137): Attempting to reconnect to starter <130.246.218.157:58225>
> 12/12/13 13:10:45 (284063.0) (10137): Reconnect SUCCESS: connection re-established
> 12/12/13 16:51:55 (284063.0) (10137): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from starter at <130.246.218.157:58225>.
> 12/12/13 16:51:55 (284063.0) (10137): IO: Failed to read packet header
> 12/12/13 16:51:55 (284063.0) (10137): Can no longer talk to condor_starter <130.246.218.157:58225>
> 12/12/13 16:51:55 (284063.0) (10137): JobLeaseDuration remaining: EXPIRED!
> 12/12/13 16:51:55 (284063.0) (10137): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (7200 seconds) expired
> 12/12/13 16:51:55 (284063.0) (10137): **** condor_shadow (condor_SHADOW) pid 10137 EXITING WITH STATUS 107
> 
> 
> 
> 
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brian Bockelman
> Sent: 12 December 2013 23:29
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] network loss between schedds and worker nodes
> 
> Hi Andrew,
> 
> I poked around and saw nothing obvious that could cause it.  condor_read's error message doesn't log the timeout used, but I don't see any way that the call could have blocked for multiple hours (hence preventing the lease renewal messages from going out the door).
> 
> Can you reveal more of the ShadowLog?  What other messages do you see?
> 
> Can you replicate this under "laboratory conditions"?
> 
> Brian
> 
> On Dec 12, 2013, at 4:14 PM, andrew.lahiff@xxxxxxxxxx wrote:
> 
>> Hi,
>> 
>> We currently have JobLeaseDuration set to 7200 so that if a machine running a schedd needs to be rebooted or is down for other reasons we have some time before jobs would be lost (the default wasn't long enough). This works fine and we clearly see the "Attempting to locate disconnected starter..." and "Reconnect SUCCESS: connection re-established" messages in ShadowLog.
>> 
>> However, today we had our first experience of a network break between 2 schedds and all the worker nodes. It lasted about 30 minutes. Two hours after the beginning of the break, we had lots of messages like this (*) in ShadowLog, and the running jobs therefore failed from the users point of view. It seems that the shadows didn't notice that the network had returned, and then they all exited after the JobLeaseDuration had expired.
>> 
>> I assume that if we had done a "service condor restart" on the schedd machines after networking was restored then everything would have been fine, i.e. new shadows would have been created which would then connect to the starters. In general though, is there any way to ensure that a shadow will detect the loss and return of the connection to a starter and automatically reconnect, provided some time limit hasn't been exceeded of course?
>> 
>> Thanks,
>> Andrew.
>> 
>> (*)
>> 12/12/13 16:52:05 (281379.0) (9670): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from starter at <130.246.219.76:54959>.
>> 12/12/13 16:52:05 (281379.0) (9670): IO: Failed to read packet header
>> 12/12/13 16:52:05 (281379.0) (9670): Can no longer talk to condor_starter <130.246.219.76:54959>
>> 12/12/13 16:52:05 (281379.0) (9670): JobLeaseDuration remaining: EXPIRED!
>> 12/12/13 16:52:05 (281379.0) (9670): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (7200 seconds) expired
>> 12/12/13 16:52:05 (281379.0) (9670): **** condor_shadow (condor_SHADOW) pid 9670 EXITING WITH STATUS 107
>> -- 
>> Scanned by iCritical.
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> -- 
> Scanned by iCritical.
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/