[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] network loss between schedds and worker nodes



Hi Andrew,

An update --

After spending some time last night examining the code, the tuning you found in the old redhat ticket would make this issue go away in most cases.  Something like this should work:

$ sysctl -a | grep net.ipv4.tcp_keepalive
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_keepalive_time = 300

This would be a setting on the schedd node only.  This should have minimal impact - it would cause 1 TCP packet every five minutes if the connection is idle (since the STARTER_UPDATE_INTERVAL defaults to 5 minutes, this should cause zero operational impact when the starter is connected).

I found in the code that JobLeaseDuration is updated each time the shadow hears from the starter.  [So the schedd and shadow have different semantics for the same attribute -- this is what confused me].  The starter periodically sends ClassAd updates to the shadow which should reset the shadow's JobLeaseDuration timer -- assuming the TCP connection is alive.

Hence, the whole issue boils down to having the shadow detecting dead TCP connections in a timely manner - and the keepalive tunings schedd-side should keep you above water until the code is updated.

Brian

On Dec 14, 2013, at 9:15 AM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

> Hi Andrew,
> 
> This logging is enough to let me figure it out.
> 
> 1) The starter and shadow are disconnected on the network.
> 2) The starter continues to send periodic classad updates to the shadow, but these fail.  The shadow does not notice the TCP socket is dead.
> 3) In the meantime, the startd sends keepalive packets to the schedd (except for the period where the disconnect occurred).  This causes the schedd to keep the claim alive.
>  - NOTE!  The schedd does not inform the shadow it has an updated JobLeaseDuration.
> 4) After slightly more then 2 hours, the default TCP socket keepalive policy causes the shadow to notice its socket is dead.
>  - Because the schedd does not inform the shadow that it has received the keepalive from the startd, the shadow doesn't realize the job lease is still valid.
> 5) Because it thinks the lease is expired, the shadow exits with code 107 (JOB_NOT_CKPTED).  The schedd restarts the job.
> 
> So it seems that HTCondor can't survive this particular temporary network disconnect between schedd and worker nodes!  This behavior has existed since at least 7.5.6.
> 
> To replicate this, you must have one of the two:
> (a) The starter and the shadow disagree on the state of the TCP socket.
>  - This can happen if the starter tried to write into the socket while the network was out (causing the starter to realize the network is dead but not the shadow).
>  - As it requires two nodes to duplicate, it's going to be a pain to write a test case. :)
> (b) Any network issue after the job has run for more JobLeaseDuration.
> 
> Luckily, there are two fairly simple ways to fix this:
> 1) Have the schedd inform the shadow when the job lease is updated.  This is fairly simple (make JobLeaseDuration a dirty attribute), but there may be scalability issues to think about.  It may be better to introduce a new command between the schedd and shadow specific to lease updates.
> 2) Set TCP keepalive on the shadow socket to be on the order of JobLeaseDuration / 2.
> 
> Both items could probably be backported to the stable series (especially if we take the simpler route for item (1)).
> 
> Thanks again for the debug logs!  Otherwise, this would have been hard to track down...
> 
> Brian
> 
> On Dec 14, 2013, at 4:08 AM, <andrew.lahiff@xxxxxxxxxx> <andrew.lahiff@xxxxxxxxxx> wrote:
> 
>> Hi,
>> 
>> I did another test, this time with D_ALL set for everything. I've put schedd, shadow, startd and starter logs here:
>> 
>> /afs/cern.ch/user/a/alahiff/public/condor
>> 
>> For this test all I did was submit 1 job, after it started running shutdown networking on the CE (i.e. machine running the schedd), and restarted the networking 30 minutes later. The job eventually dies when the job lease expires (and it's then restarted). With D_ALL there is more information about the condor_write failing on the starter [1], but from the ShadowLog it's not obvious to me that it's even noticed that it can't communicate with the worker node and then eventually can.
>> 
>> I'm not sure if it's related to this old ticket about the shadow not detecting disconnected jobs:
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=645030
>> 
>> During the period that the condor_write's were failing after networking had returned, netstat gave this:
>> 
>> [root@lcgvm-ui01 ~]# netstat -anp | grep 52103
>> tcp        0      0 130.246.223.18:52103        130.246.216.5:60837         ESTABLISHED 34206/condor_shadow 
>> 
>> [root@lcg0733 ~]# netstat -anp | grep 60837
>> tcp        0      0 0.0.0.0:60837               0.0.0.0:*                   LISTEN      30114/condor_startd 
>> udp        0      0 0.0.0.0:60837               0.0.0.0:*                               30114/condor_startd 
>> 
>> Regards,
>> Andrew.
>> 
>> [1]
>> 12/14/13 09:11:13 Calling Timer handler 10 (JobInfoCommunicator::periodicJobUpdateTimerHandler)
>> 12/14/13 09:11:13 Initializing Directory: curr_dir = /pool/condor/dir_30417
>> 12/14/13 09:11:13 In VanillaProc::PublishUpdateAd()
>> 12/14/13 09:11:13 Inside OsProc::PublishUpdateAd()
>> 12/14/13 09:11:13 Inside UserProc::PublishUpdateAd()
>> 12/14/13 09:11:13 Entering JICShadow::updateShadow()
>> 12/14/13 09:11:13 Initializing Directory: curr_dir = /pool/condor/dir_30417
>> 12/14/13 09:11:13 In VanillaProc::PublishUpdateAd()
>> 12/14/13 09:11:13 Inside OsProc::PublishUpdateAd()
>> 12/14/13 09:11:13 Inside UserProc::PublishUpdateAd()
>> 12/14/13 09:11:13 condor_write(): Socket closed when trying to write 299 bytes to <130.246.223.18:52103>, fd is 6
>> 12/14/13 09:11:13 Buf::write(): condor_write() failed
>> 12/14/13 09:11:13 Sent job ClassAd update to startd.
>> 12/14/13 09:11:13 JICShadow::updateShadow(): failed to send update
>> 12/14/13 09:11:13 Return from Timer handler 10 (JobInfoCommunicator::periodicJobUpdateTimerHandler) - took 0.001s
>> 
>> 
>> 
>> -----Original Message-----
>> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of andrew.lahiff@xxxxxxxxxx
>> Sent: 13 December 2013 15:00
>> To: htcondor-users@xxxxxxxxxxx
>> Subject: Re: [HTCondor-users] network loss between schedds and worker nodes
>> 
>> Hi Brian,
>> 
>> It does seem to be reproducible [1]. I did a "service condor restart" first just before 13:59 to simulate the CE reboot that we had (I haven't tried yet without this, so it might not be important). I then did a "/sbin/ifdown eth0" on the test CE at about 14:05 and then "/sbin/ifup eth0" at 14:30. The "Buf::write(): condor_write() failed" messages are still appearing even though the network is back to normal.
>> 
>> Regards,
>> Andrew.
>> 
>> [1]
>> 12/13/13 13:57:24 ******************************************************
>> 12/13/13 13:57:24 ** condor_starter (CONDOR_STARTER) STARTING UP
>> 12/13/13 13:57:24 ** /usr/sbin/condor_starter
>> 12/13/13 13:57:24 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
>> 12/13/13 13:57:24 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
>> 12/13/13 13:57:24 ** $CondorVersion: 8.0.2 Aug 15 2013 BuildID: 162062 $
>> 12/13/13 13:57:24 ** $CondorPlatform: x86_64_RedHat6 $
>> 12/13/13 13:57:24 ** PID = 3267
>> 12/13/13 13:57:24 ** Log last touched 12/3 11:44:18
>> 12/13/13 13:57:24 ******************************************************
>> 12/13/13 13:57:24 Using config source: /etc/condor/condor_config
>> 12/13/13 13:57:24 Using local config sources:
>> 12/13/13 13:57:24    /etc/condor/config.d/10security.conf
>> 12/13/13 13:57:24    /etc/condor/config.d/10security.config
>> 12/13/13 13:57:24    /etc/condor/config.d/20wn.config
>> 12/13/13 13:57:24    /etc/condor/condor_config.local
>> 12/13/13 13:57:24 DaemonCore: command socket at <130.246.216.5:48547>
>> 12/13/13 13:57:24 DaemonCore: private command socket at <130.246.216.5:48547>
>> 12/13/13 13:57:24 Communicating with shadow <130.246.223.18:52450?noUDP>
>> 12/13/13 13:57:24 Submitting machine is "lcgvm-ui01.gridpp.rl.ac.uk"
>> 12/13/13 13:57:24 setting the orig job name in starter
>> 12/13/13 13:57:24 setting the orig job iwd in starter
>> 12/13/13 13:57:24 Done setting resource limits
>> 12/13/13 13:57:24 File transfer completed successfully.
>> 12/13/13 13:57:25 Job 1468.0 set to execute immediately
>> 12/13/13 13:57:25 Starting a VANILLA universe job with ID: 1468.0
>> 12/13/13 13:57:25 IWD: /pool/condor/dir_3267
>> 12/13/13 13:57:25 Output file: /pool/condor/dir_3267/_condor_stdout
>> 12/13/13 13:57:25 Error file: /pool/condor/dir_3267/_condor_stderr
>> 12/13/13 13:57:25 Renice expr "10" evaluated to 10
>> 12/13/13 13:57:25 About to exec /pool/condor/dir_3267/condor_exec.exe
>> 12/13/13 13:57:25 Setting job's virtual memory rlimit to 0 megabytes
>> 12/13/13 13:57:25 Running job as user alahiff
>> 12/13/13 13:57:25 Create_Process succeeded, pid=3271
>> 12/13/13 13:59:12 Accepted request to reconnect from <130.246.223.18:56974>
>> 12/13/13 13:59:12 Ignoring old shadow <130.246.223.18:52450?noUDP>
>> 12/13/13 13:59:12 Communicating with shadow <130.246.223.18:41326?noUDP>
>> 12/13/13 14:07:35 condor_read(): timeout reading 21 bytes from <130.246.223.18:43828>.
>> 12/13/13 14:07:35 IO: Failed to read packet header
>> 12/13/13 14:22:36 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10, errno=113 No route to host
>> 12/13/13 14:22:36 Buf::write(): condor_write() failed
>> 12/13/13 14:34:37 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
>> 12/13/13 14:34:37 Buf::write(): condor_write() failed
>> 12/13/13 14:41:49 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
>> 12/13/13 14:41:49 Buf::write(): condor_write() failed
>> 12/13/13 14:46:49 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
>> 12/13/13 14:46:49 Buf::write(): condor_write() failed
>> 12/13/13 14:51:50 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
>> 12/13/13 14:51:50 Buf::write(): condor_write() failed
>> 
>> 
>> -----Original Message-----
>> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brian Bockelman
>> Sent: 13 December 2013 14:00
>> To: HTCondor-Users Mail List
>> Subject: Re: [HTCondor-users] network loss between schedds and worker nodes
>> 
>> Hi Andrew,
>> 
>> I can't see anything obvious in the code or in these logs that would have led to this failure scenario.
>> 
>> I'm afraid someone more familiar with the shadow<->starter protocol may have to step in here... Dan?
>> 
>> Brian
>> 
>> PS - we probably really should log when keepalive packets are sent at D_ALWAYS.  Otherwise, there's not too much of a hint as to what the shadow is up to.
>> 
>> -- 
>> Scanned by iCritical.
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> -- 
>> Scanned by iCritical.
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/