[HTCondor-users] MasterLog: "condor_read(): timeout reading 5 bytes": ignore or bad news?

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

Jobs submitted to our HTCondor pool are currently in a permanent idle state and I try desperately to figure out what is wrong. There seems to be no change in the pool PCs or the network, so I suspect some configuration problem on the HTCondor Master PC.

When I start HTCondor (on a Linux/Fedora 20 OS), I see in the MasterLog file the lines with "condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:46834>", where the IP address is the HTCondor Master PC:

03/29/14 17:15:53 ******************************************************

03/29/14 17:15:53 ** condor_master (CONDOR_MASTER) STARTING UP

03/29/14 17:15:53 ** /usr/sbin/condor_master

03/29/14 17:15:53 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)

03/29/14 17:15:53 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON

03/29/14 17:15:53 ** $CondorVersion: 8.1.1 Oct 25 2013 BuildID: RH-8.1.1-0.3.fc20 $

03/29/14 17:15:53 ** $CondorPlatform: I686-Fedora_20 $

03/29/14 17:15:53 ** PID = 1213

03/29/14 17:15:53 ** Log last touched 3/29 17:15:53

03/29/14 17:15:53 ******************************************************

03/29/14 17:15:53 Using config source: /etc/condor/condor_config

03/29/14 17:15:53 Using local config sources:

03/29/14 17:15:53 /etc/condor/config.d/00personal_condor.config

03/29/14 17:15:53 /etc/condor/config.d/90skku_condor.config

03/29/14 17:15:53 CLASSAD_CACHING is ENABLED

03/29/14 17:15:53 DaemonCore: command socket at <xxx.xxx.140.72:50402>

03/29/14 17:15:53 DaemonCore: private command socket at <xxx.xxx.140.72:50402>

03/29/14 17:15:53 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1382718547)

03/29/14 17:15:53 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 1215

03/29/14 17:15:53 Waiting for /var/log/condor/.collector_address to appear.

03/29/14 17:17:14 condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:46834>.

03/29/14 17:17:14 IO: Failed to read packet header

03/29/14 17:17:14 Failed to read ChildAlive packet (1)

03/29/14 17:17:14 Found /var/log/condor/.collector_address.

03/29/14 17:17:14 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 1225

03/29/14 17:17:14 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1226

03/29/14 17:17:34 condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:52388>.

03/29/14 17:17:34 IO: Failed to read packet header

03/29/14 17:17:34 Failed to read ChildAlive packet (1)

03/29/14 17:17:54 condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:49466>.

03/29/14 17:17:54 IO: Failed to read packet header

03/29/14 17:17:54 Failed to read ChildAlive packet (1)

Does this indicate trouble and could it be a hint as to why jobs cannot be executed on the pool PCs?

Thanks!

Rob.

Mailing List Archives

Public Access

[HTCondor-users] MasterLog: "condor_read(): timeout reading 5 bytes": ignore or bad news?