[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] network loss between schedds and worker nodes



Hi Brian,

It does seem to be reproducible [1]. I did a "service condor restart" first just before 13:59 to simulate the CE reboot that we had (I haven't tried yet without this, so it might not be important). I then did a "/sbin/ifdown eth0" on the test CE at about 14:05 and then "/sbin/ifup eth0" at 14:30. The "Buf::write(): condor_write() failed" messages are still appearing even though the network is back to normal.

Regards,
Andrew.

[1]
12/13/13 13:57:24 ******************************************************
12/13/13 13:57:24 ** condor_starter (CONDOR_STARTER) STARTING UP
12/13/13 13:57:24 ** /usr/sbin/condor_starter
12/13/13 13:57:24 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/13/13 13:57:24 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/13/13 13:57:24 ** $CondorVersion: 8.0.2 Aug 15 2013 BuildID: 162062 $
12/13/13 13:57:24 ** $CondorPlatform: x86_64_RedHat6 $
12/13/13 13:57:24 ** PID = 3267
12/13/13 13:57:24 ** Log last touched 12/3 11:44:18
12/13/13 13:57:24 ******************************************************
12/13/13 13:57:24 Using config source: /etc/condor/condor_config
12/13/13 13:57:24 Using local config sources:
12/13/13 13:57:24    /etc/condor/config.d/10security.conf
12/13/13 13:57:24    /etc/condor/config.d/10security.config
12/13/13 13:57:24    /etc/condor/config.d/20wn.config
12/13/13 13:57:24    /etc/condor/condor_config.local
12/13/13 13:57:24 DaemonCore: command socket at <130.246.216.5:48547>
12/13/13 13:57:24 DaemonCore: private command socket at <130.246.216.5:48547>
12/13/13 13:57:24 Communicating with shadow <130.246.223.18:52450?noUDP>
12/13/13 13:57:24 Submitting machine is "lcgvm-ui01.gridpp.rl.ac.uk"
12/13/13 13:57:24 setting the orig job name in starter
12/13/13 13:57:24 setting the orig job iwd in starter
12/13/13 13:57:24 Done setting resource limits
12/13/13 13:57:24 File transfer completed successfully.
12/13/13 13:57:25 Job 1468.0 set to execute immediately
12/13/13 13:57:25 Starting a VANILLA universe job with ID: 1468.0
12/13/13 13:57:25 IWD: /pool/condor/dir_3267
12/13/13 13:57:25 Output file: /pool/condor/dir_3267/_condor_stdout
12/13/13 13:57:25 Error file: /pool/condor/dir_3267/_condor_stderr
12/13/13 13:57:25 Renice expr "10" evaluated to 10
12/13/13 13:57:25 About to exec /pool/condor/dir_3267/condor_exec.exe
12/13/13 13:57:25 Setting job's virtual memory rlimit to 0 megabytes
12/13/13 13:57:25 Running job as user alahiff
12/13/13 13:57:25 Create_Process succeeded, pid=3271
12/13/13 13:59:12 Accepted request to reconnect from <130.246.223.18:56974>
12/13/13 13:59:12 Ignoring old shadow <130.246.223.18:52450?noUDP>
12/13/13 13:59:12 Communicating with shadow <130.246.223.18:41326?noUDP>
12/13/13 14:07:35 condor_read(): timeout reading 21 bytes from <130.246.223.18:43828>.
12/13/13 14:07:35 IO: Failed to read packet header
12/13/13 14:22:36 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10, errno=113 No route to host
12/13/13 14:22:36 Buf::write(): condor_write() failed
12/13/13 14:34:37 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:34:37 Buf::write(): condor_write() failed
12/13/13 14:41:49 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:41:49 Buf::write(): condor_write() failed
12/13/13 14:46:49 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:46:49 Buf::write(): condor_write() failed
12/13/13 14:51:50 condor_write(): Socket closed when trying to write 298 bytes to <130.246.223.18:43828>, fd is 10
12/13/13 14:51:50 Buf::write(): condor_write() failed


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Brian Bockelman
Sent: 13 December 2013 14:00
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] network loss between schedds and worker nodes

Hi Andrew,

I can't see anything obvious in the code or in these logs that would have led to this failure scenario.

I'm afraid someone more familiar with the shadow<->starter protocol may have to step in here... Dan?

Brian

PS - we probably really should log when keepalive packets are sent at D_ALWAYS.  Otherwise, there's not too much of a hint as to what the shadow is up to.

-- 
Scanned by iCritical.