[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] network loss between schedds and worker nodes



Hi,

I did another test, this time with D_ALL set for everything. I've put schedd, shadow, startd and starter logs here:

/afs/cern.ch/user/a/alahiff/public/condor

For this test all I did was submit 1 job, after it started running shutdown networking on the CE (i.e. machine running the schedd), and restarted the networking 30 minutes later. The job eventually dies when the job lease expires (and it's then restarted). With D_ALL there is more information about the condor_write failing on the starter [1], but from the ShadowLog it's not obvious to me that it's even noticed that it can't communicate with the worker node and then eventually can.

I'm not sure if it's related to this old ticket about the shadow not detecting disconnected jobs:

https://bugzilla.redhat.com/show_bug.cgi?id=645030

During the period that the condor_write's were failing after networking had returned, netstat gave this:

[root@lcgvm-ui01 ~]# netstat -anp | grep 52103
tcp        0      0 130.246.223.18:52103        130.246.216.5:60837         ESTABLISHED 34206/condor_shadow 

[root@lcg0733 ~]# netstat -anp | grep 60837
tcp        0      0 0.0.0.0:60837               0.0.0.0:*                   LISTEN      30114/condor_startd 
udp        0      0 0.0.0.0:60837               0.0.0.0:*                               30114/condor_startd 

Regards,
Andrew.

[1]
12/14/13 09:11:13 Calling Timer handler 10 (JobInfoCommunicator::periodicJobUpdateTimerHandler)
12/14/13 09:11:13 Initializing Directory: curr_dir = /pool/condor/dir_30417
12/14/13 09:11:13 In VanillaProc::PublishUpdateAd()
12/14/13 09:11:13 Inside OsProc::PublishUpdateAd()
12/14/13 09:11:13 Inside UserProc::PublishUpdateAd()
12/14/13 09:11:13 Entering JICShadow::updateShadow()
12/14/13 09:11:13 Initializing Directory: curr_dir = /pool/condor/dir_30417
12/14/13 09:11:13 In VanillaProc::PublishUpdateAd()
12/14/13 09:11:13 Inside OsProc::PublishUpdateAd()
12/14/13 09:11:13 Inside UserProc::PublishUpdateAd()
12/14/13 09:11:13 condor_write(): Socket closed when trying to write 299 bytes to <130.246.223.18:52103>, fd is 6
12/14/13 09:11:13 Buf::write(): condor_write() failed
12/14/13 09:11:13 Sent job ClassAd update to startd.
12/14/13 09:11:13 JICShadow::updateShadow(): failed to send update
12/14/13 09:11:13 Return from Timer handler 10 (JobInfoCommunicator::periodicJobUpdateTimerHandler) - took 0.001s



-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of andrew.lahiff@xxxxxxxxxx
Sent: 13 December 2013 15:00
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] network loss between schedds and worker nodes

Hi Brian,

It does seem to be reproducible [1]. I did a "service condor restart" first just before 13:59 to simulate the CE reboot that we had (I haven't tried yet without this, so it might not be important). I then did a "/sbin/ifdown eth0" on the test CE at about 14:05 and then "/sbin/ifup eth0" at 14:30. The "Buf::write(): condor_write() failed" messages are still appearing even though the network is back to normal.

Regards,
Andrew.

[1]
12/13/13 13:57:24 ******************************************************
12/13/13 13:57:24 ** condor_starter (CONDOR_STARTER) STARTING UP
12/13/13 13:57:24 ** /usr/sbin/condor_starter
12/13/13 13:57:24 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/13/13 13:57:24 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/13/13 13:57:24 ** $CondorVersion: 8.0.2 Aug 15 2013 BuildID: 162062 $
12/13/13 13:57:24 ** $CondorPlatform: x86_64_RedHat6 $
12/13/13 13:57:24 ** PID = 3267
12/13/13 13:57:24 ** Log last touched 12/3 11:44:18
12/13/13 13:57:24 ******************************************************
12/13/13 13:57:24 Using config source: /etc/condor/condor_config
12/13/13 13:57:24 Using local config sources:
12/13/13 13:57:24    /etc/condor/config.d/10security.conf
12/13/13 13:57:24    /etc/condor/config.d/10security.config
12/13/13 13:57:24    /etc/condor/config.d/20wn.config
12/13/13 13:57:24    /etc/condor/condor_config.local
12/13/13 13:57:24 DaemonCore: command socket at <130.246.216.5:48547>
12/13/13 13:57:24 DaemonCore: private command socket at <130.246.216.5:48547>
12/13/13 13:57:24 Communicating with shadow <130.246.223.18:52450?noUDP>
12/13/13 13:57:24 Submitting machine is "lcgvm-ui01.gridpp.rl.ac.uk"
12/13/13 13:57:24 setting the orig job name in starter
12/13/13 13:57:24 setting the orig job iwd in starter
12/13/13 13:57:24 Done setting resource limits
12/13/13 13:57:24 File transfer completed successfully.
12/13/13 13:57:25 Job 1468.0 set to execute immediately
12/13/13 13:57:25 Starting a VANILLA universe job with ID: 1468.0
12/13/13 13:57:25 IWD: /pool/condor/dir_3267
12/13/13 13:57:25 Output file: /pool/condor/dir_3267/_condor_stdout
12/13/13 13:57:25 Error file: /pool/condor/dir_3267/_condor_stderr
12/13/13 13:57:25 Renice expr "10" evaluated to 10
12/13/13 13:57:25 About to exec /pool/condor/dir_3267/condor_exec.exe
12/13/13 13:57:25 Setting job's virtual memory rlimit to 0 megabytes
12/13/13 13:57:25 Running job as user alahiff
12/13/13 13:57:25 Create_Process succeeded, pid=3271
12/13/13 13:59:12 Accepted request to reconnect from <130.246.223.18:56974>
12/13/13 13:59:12 Ignoring old shadow <130.246.223.18:52450?noUDP>
12/13/13 13:59:12 Communicating with shadow <130.246.223.18:41326?noUDP>
12/13/13 14:07:35 condor_read(): timeout reading 21 bytes from <130.246.223.18:43828>.
12/13/13 14:07:35 IO: Failed to read packet header
12/13/13 14:22:36 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10, errno=113 No route to host
12/13/13 14:22:36 Buf::write(): condor_write() failed
12/13/13 14:34:37 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:34:37 Buf::write(): condor_write() failed
12/13/13 14:41:49 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:41:49 Buf::write(): condor_write() failed
12/13/13 14:46:49 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:46:49 Buf::write(): condor_write() failed
12/13/13 14:51:50 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:51:50 Buf::write(): condor_write() failed


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brian Bockelman
Sent: 13 December 2013 14:00
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] network loss between schedds and worker nodes

Hi Andrew,

I can't see anything obvious in the code or in these logs that would have led to this failure scenario.

I'm afraid someone more familiar with the shadow<->starter protocol may have to step in here... Dan?

Brian

PS - we probably really should log when keepalive packets are sent at D_ALWAYS.  Otherwise, there's not too much of a hint as to what the shadow is up to.

-- 
Scanned by iCritical.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Scanned by iCritical.