[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MasterLog: "condor_read(): timeout reading 5 bytes": ignore or bad news?



Hi,

Jobs submitted to our HTCondor pool are currently in a permanent idle state and I try desperately to figure out what is wrong. There seems to be no change in the pool PCs or the network, so I suspect some configuration problem on the HTCondor Master PC.

When I start HTCondor (on a Linux/Fedora 20 OS), I see in the MasterLog file the lines with "condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:46834>", where the IP address is the HTCondor Master PC:

03/29/14 17:15:53 ******************************************************
03/29/14 17:15:53 ** condor_master (CONDOR_MASTER) STARTING UP
03/29/14 17:15:53 ** /usr/sbin/condor_master
03/29/14 17:15:53 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
03/29/14 17:15:53 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
03/29/14 17:15:53 ** $CondorVersion: 8.1.1 Oct 25 2013 BuildID: RH-8.1.1-0.3.fc20 $
03/29/14 17:15:53 ** $CondorPlatform: I686-Fedora_20 $
03/29/14 17:15:53 ** PID = 1213
03/29/14 17:15:53 ** Log last touched 3/29 17:15:53
03/29/14 17:15:53 ******************************************************
03/29/14 17:15:53 Using config source: /etc/condor/condor_config
03/29/14 17:15:53 Using local config sources:
03/29/14 17:15:53    /etc/condor/config.d/00personal_condor.config
03/29/14 17:15:53    /etc/condor/config.d/90skku_condor.config
03/29/14 17:15:53 CLASSAD_CACHING is ENABLED
03/29/14 17:15:53 DaemonCore: command socket at <xxx.xxx.140.72:50402>
03/29/14 17:15:53 DaemonCore: private command socket at <xxx.xxx.140.72:50402>
03/29/14 17:15:53 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1382718547)
03/29/14 17:15:53 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 1215
03/29/14 17:15:53 Waiting for /var/log/condor/.collector_address to appear.
03/29/14 17:17:14 condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:46834>.
03/29/14 17:17:14 IO: Failed to read packet header
03/29/14 17:17:14 Failed to read ChildAlive packet (1)
03/29/14 17:17:14 Found /var/log/condor/.collector_address.
03/29/14 17:17:14 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 1225
03/29/14 17:17:14 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1226
03/29/14 17:17:34 condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:52388>.
03/29/14 17:17:34 IO: Failed to read packet header
03/29/14 17:17:34 Failed to read ChildAlive packet (1)
03/29/14 17:17:54 condor_read(): timeout reading 5 bytes from <xxx.xxx.140.72:49466>.
03/29/14 17:17:54 IO: Failed to read packet header
03/29/14 17:17:54 Failed to read ChildAlive packet (1)


Does this indicate trouble and could it be a hint as to why jobs cannot be executed on the pool PCs?

Thanks!

Rob.