Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] network loss between schedds and worker nodes

Date: Fri, 13 Dec 2013 13:48:42 +0000
From: <andrew.lahiff@xxxxxxxxxx>
Subject: Re: [HTCondor-users] network loss between schedds and worker nodes
Hi Brian,

Thanks for your reply. (full ShadowLog sent off-list).

I've attached below [1] one of the starter log files, which is interesting:

12/12/13 13:06:29 - CE rebooted due to kernel/errata upgrade
12/12/13 13:10:45 - connects back to (new) shadow
12/12/13 15:01:40 - clear that CE lost networking
- CE networking returned at ~15:25 -
12/12/13 17:01:55 - starter exits
12/12/13 17:03:31 - new job starting, from the same CE

For some reason the starter couldn't connect to the CE even after networking had returned. After the job lease expired another job started running, in this case from the same CE which it previously couldn't connect to. An extract from the ShadowLog is also shown [2].

Note that between 15:25 and 17:01 the two CEs which had previously lost networking were able to start new jobs. Some of these are still running. It's only the jobs which had started running before the network break which were affected.

I've been trying to see if I can reproduce this problem using a schedd on a VM and disabling the VM's networking temporarily, but so far I haven't been able to reproduce it.

Regards,
Andrew.

[1]
12/12/13 11:41:11 ******************************************************
12/12/13 11:41:11 ** condor_starter (CONDOR_STARTER) STARTING UP
12/12/13 11:41:11 ** /usr/sbin/condor_starter
12/12/13 11:41:11 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/12/13 11:41:11 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/12/13 11:41:11 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
12/12/13 11:41:11 ** $CondorPlatform: x86_64_RedHat6 $
12/12/13 11:41:11 ** PID = 26873
12/12/13 11:41:11 ** Log last touched 12/12 11:40:40
12/12/13 11:41:11 ******************************************************
12/12/13 11:41:11 Using config source: /etc/condor/condor_config
12/12/13 11:41:11 Using local config sources:
12/12/13 11:41:11    /etc/condor/config.d/10security.config
12/12/13 11:41:11    /etc/condor/config.d/20wn.config
12/12/13 11:41:11    /etc/condor/condor_config.local
12/12/13 11:41:11 DaemonCore: command socket at <130.246.218.157:58225>
12/12/13 11:41:11 DaemonCore: private command socket at <130.246.218.157:58225>
12/12/13 11:41:11 Communicating with shadow <130.246.181.27:58210?noUDP>
12/12/13 11:41:11 Submitting machine is "cream-ce02.gridpp.rl.ac.uk"
12/12/13 11:41:11 setting the orig job name in starter
12/12/13 11:41:11 setting the orig job iwd in starter
12/12/13 11:41:11 Done setting resource limits
12/12/13 11:41:11 File transfer completed successfully.
12/12/13 11:41:11 Job 284063.0 set to execute immediately
12/12/13 11:41:11 Starting a VANILLA universe job with ID: 284063.0
12/12/13 11:41:11 IWD: /pool/condor/dir_26873
12/12/13 11:41:11 Output file: /pool/condor/dir_26873/_condor_stdout
12/12/13 11:41:11 Error file: /pool/condor/dir_26873/_condor_stderr
12/12/13 11:41:11 Renice expr "10" evaluated to 10
12/12/13 11:41:11 About to exec /pool/condor/dir_26873/condor_exec.exe
12/12/13 11:41:11 Setting job's virtual memory rlimit to 0 megabytes
12/12/13 11:41:11 Running job as user tlhcb005
12/12/13 11:41:11 Create_Process succeeded, pid=26877
12/12/13 13:06:29 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.181.27:31505>.
12/12/13 13:06:29 IO: Failed to read packet header
12/12/13 13:10:45 Accepted request to reconnect from <130.246.181.27:31505>
12/12/13 13:10:45 Ignoring old shadow <130.246.181.27:58210?noUDP>
12/12/13 13:10:45 Communicating with shadow <130.246.181.27:40051?noUDP>
12/12/13 15:01:40 condor_read(): timeout reading 21 bytes from <130.246.181.27:62806>.
12/12/13 15:01:40 IO: Failed to read packet header
12/12/13 15:16:43 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10, errno=110 Connection timed out
12/12/13 15:16:43 Buf::write(): condor_write() failed
12/12/13 15:28:45 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 15:28:45 Buf::write(): condor_write() failed
12/12/13 15:36:00 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 15:36:00 Buf::write(): condor_write() failed
12/12/13 15:41:01 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 15:41:01 Buf::write(): condor_write() failed
12/12/13 15:46:01 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 15:46:01 Buf::write(): condor_write() failed
12/12/13 15:51:01 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 15:51:01 Buf::write(): condor_write() failed
12/12/13 15:56:02 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 15:56:02 Buf::write(): condor_write() failed
12/12/13 16:01:02 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:01:02 Buf::write(): condor_write() failed
12/12/13 16:06:03 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:06:03 Buf::write(): condor_write() failed
12/12/13 16:11:03 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:11:03 Buf::write(): condor_write() failed
12/12/13 16:16:03 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:16:03 Buf::write(): condor_write() failed
12/12/13 16:21:04 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:21:04 Buf::write(): condor_write() failed
12/12/13 16:26:04 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:26:04 Buf::write(): condor_write() failed
12/12/13 16:31:05 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:31:05 Buf::write(): condor_write() failed
12/12/13 16:36:05 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:36:05 Buf::write(): condor_write() failed
12/12/13 16:41:06 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:41:06 Buf::write(): condor_write() failed
12/12/13 16:46:07 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:46:07 Buf::write(): condor_write() failed
12/12/13 16:51:07 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:51:07 Buf::write(): condor_write() failed
12/12/13 16:51:55 Got SIGTERM. Performing graceful shutdown.
12/12/13 16:51:55 ShutdownGraceful all jobs.
12/12/13 16:56:07 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 16:56:07 Buf::write(): condor_write() failed
12/12/13 17:01:08 condor_write(): Socket closed when trying to write 327 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 17:01:08 Buf::write(): condor_write() failed
12/12/13 17:01:55 Got SIGQUIT.  Performing fast shutdown.
12/12/13 17:01:55 ShutdownFast all jobs.
12/12/13 17:01:55 Process exited, pid=26877, signal=9
12/12/13 17:01:55 condor_write(): Socket closed when trying to write 422 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 17:01:55 Buf::write(): condor_write() failed
12/12/13 17:01:55 condor_write(): Socket closed when trying to write 116 bytes to <130.246.181.27:62806>, fd is 10
12/12/13 17:01:55 Buf::write(): condor_write() failed
12/12/13 17:01:55 Failed to send job exit status to shadow
12/12/13 17:01:55 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
12/12/13 17:01:55 Returning from CStarter::JobReaper()
12/12/13 17:03:31 ******************************************************
12/12/13 17:03:31 ** condor_starter (CONDOR_STARTER) STARTING UP
12/12/13 17:03:31 ** /usr/sbin/condor_starter
12/12/13 17:03:31 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/12/13 17:03:31 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/12/13 17:03:31 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
12/12/13 17:03:31 ** $CondorPlatform: x86_64_RedHat6 $
12/12/13 17:03:31 ** PID = 27508
12/12/13 17:03:31 ** Log last touched 12/12 17:01:55
12/12/13 17:03:31 ******************************************************
12/12/13 17:03:31 Using config source: /etc/condor/condor_config
12/12/13 17:03:31 Using local config sources:
12/12/13 17:03:31    /etc/condor/config.d/10security.config
12/12/13 17:03:31    /etc/condor/config.d/20wn.config
12/12/13 17:03:31    /etc/condor/condor_config.local
12/12/13 17:03:31 DaemonCore: command socket at <130.246.218.157:37675>
12/12/13 17:03:31 DaemonCore: private command socket at <130.246.218.157:37675>
12/12/13 17:03:31 Communicating with shadow <130.246.181.27:47959?noUDP>
12/12/13 17:03:31 Submitting machine is "cream-ce02.gridpp.rl.ac.uk"
12/12/13 17:03:31 setting the orig job name in starter
12/12/13 17:03:31 setting the orig job iwd in starter
12/12/13 17:03:31 Done setting resource limits
12/12/13 17:03:31 File transfer completed successfully.
12/12/13 17:03:31 Job 283875.0 set to execute immediately
12/12/13 17:03:31 Starting a VANILLA universe job with ID: 283875.0
12/12/13 17:03:31 IWD: /pool/condor/dir_27508
12/12/13 17:03:31 Output file: /pool/condor/dir_27508/_condor_stdout
12/12/13 17:03:31 Error file: /pool/condor/dir_27508/_condor_stderr
12/12/13 17:03:31 Renice expr "10" evaluated to 10
12/12/13 17:03:31 About to exec /pool/condor/dir_27508/condor_exec.exe
12/12/13 17:03:31 Setting job's virtual memory rlimit to 0 megabytes
12/12/13 17:03:31 Running job as user tlhcb005
12/12/13 17:03:31 Create_Process succeeded, pid=27538


[2]
[root@cream-ce02 ~]# grep 10137 /var/log/condor/ShadowLog
12/12/13 13:10:45 ** PID = 10137
12/12/13 13:10:45 (284063.0) (10137): Trying to reconnect to disconnected job
12/12/13 13:10:45 (284063.0) (10137): LastJobLeaseRenewal: 1386852989 Thu Dec 12 12:56:29 2013
12/12/13 13:10:45 (284063.0) (10137): JobLeaseDuration: 7200 seconds
12/12/13 13:10:45 (284063.0) (10137): JobLeaseDuration remaining: 6344
12/12/13 13:10:45 (284063.0) (10137): Attempting to locate disconnected starter
12/12/13 13:10:45 (284063.0) (10137): Found starter: <130.246.218.157:58225>
12/12/13 13:10:45 (284063.0) (10137): Attempting to reconnect to starter <130.246.218.157:58225>
12/12/13 13:10:45 (284063.0) (10137): Reconnect SUCCESS: connection re-established
12/12/13 16:51:55 (284063.0) (10137): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from starter at <130.246.218.157:58225>.
12/12/13 16:51:55 (284063.0) (10137): IO: Failed to read packet header
12/12/13 16:51:55 (284063.0) (10137): Can no longer talk to condor_starter <130.246.218.157:58225>
12/12/13 16:51:55 (284063.0) (10137): JobLeaseDuration remaining: EXPIRED!
12/12/13 16:51:55 (284063.0) (10137): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (7200 seconds) expired
12/12/13 16:51:55 (284063.0) (10137): **** condor_shadow (condor_SHADOW) pid 10137 EXITING WITH STATUS 107




-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brian Bockelman
Sent: 12 December 2013 23:29
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] network loss between schedds and worker nodes

Hi Andrew,

I poked around and saw nothing obvious that could cause it.  condor_read's error message doesn't log the timeout used, but I don't see any way that the call could have blocked for multiple hours (hence preventing the lease renewal messages from going out the door).

Can you reveal more of the ShadowLog?  What other messages do you see?

Can you replicate this under "laboratory conditions"?

Brian

On Dec 12, 2013, at 4:14 PM, andrew.lahiff@xxxxxxxxxx wrote:

> Hi,
> 
> We currently have JobLeaseDuration set to 7200 so that if a machine running a schedd needs to be rebooted or is down for other reasons we have some time before jobs would be lost (the default wasn't long enough). This works fine and we clearly see the "Attempting to locate disconnected starter..." and "Reconnect SUCCESS: connection re-established" messages in ShadowLog.
> 
> However, today we had our first experience of a network break between 2 schedds and all the worker nodes. It lasted about 30 minutes. Two hours after the beginning of the break, we had lots of messages like this (*) in ShadowLog, and the running jobs therefore failed from the users point of view. It seems that the shadows didn't notice that the network had returned, and then they all exited after the JobLeaseDuration had expired.
> 
> I assume that if we had done a "service condor restart" on the schedd machines after networking was restored then everything would have been fine, i.e. new shadows would have been created which would then connect to the starters. In general though, is there any way to ensure that a shadow will detect the loss and return of the connection to a starter and automatically reconnect, provided some time limit hasn't been exceeded of course?
> 
> Thanks,
> Andrew.
> 
> (*)
> 12/12/13 16:52:05 (281379.0) (9670): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from starter at <130.246.219.76:54959>.
> 12/12/13 16:52:05 (281379.0) (9670): IO: Failed to read packet header
> 12/12/13 16:52:05 (281379.0) (9670): Can no longer talk to condor_starter <130.246.219.76:54959>
> 12/12/13 16:52:05 (281379.0) (9670): JobLeaseDuration remaining: EXPIRED!
> 12/12/13 16:52:05 (281379.0) (9670): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (7200 seconds) expired
> 12/12/13 16:52:05 (281379.0) (9670): **** condor_shadow (condor_SHADOW) pid 9670 EXITING WITH STATUS 107
> -- 
> Scanned by iCritical.
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Scanned by iCritical.
Follow-Ups:
- Re: [HTCondor-users] network loss between schedds and worker nodes
  - From: Brian Bockelman
References:
- [HTCondor-users] network loss between schedds and worker nodes
  - From: andrew.lahiff
- Re: [HTCondor-users] network loss between schedds and worker nodes
  - From: Brian Bockelman
Prev by Date: [HTCondor-users] Public and local IP address problem
Next by Date: Re: [HTCondor-users] network loss between schedds and worker nodes
Previous by thread: Re: [HTCondor-users] network loss between schedds and worker nodes
Next by thread: Re: [HTCondor-users] network loss between schedds and worker nodes
Index(es):
- Date
- Thread
Mailing List Archives

Public Access

Re: [HTCondor-users] network loss between schedds and worker nodes