[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] protocol error in collector after housekeeping



Hi Klint,

I've seen this error message type in the past when I've accidentally appended the port to the address a second time.

However your CONDOR_HOST var seems okay.

Could you run the following:

condor_config_val -master CONDOR_HOST

condor_config_val -v CONDOR_HOST

I think we can ignore the connection refused error for the moment. The master doesn't know the collector is dead, so is trying to send an update, I think. (Sounds like a bug in itself really)

Could you bump up the debugging?

MASTER_DEBUG = D_FULLDEBUG
COLLECTOR_DEBUG = D_FULLDEBUG

Cheers, Iain
________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Klint Gore [kgore4@xxxxxxxxxx]
Sent: 27 June 2016 08:25
To: HTCondor-Users Mail List
Subject: [HTCondor-users] protocol error in collector after housekeeping

Can someone point me to what could be causing the following to happen?  Everything is condor 8.4.7 from the wisc.edu repo.  The MasterLog starts throwing this message when I upgraded from 8.2.8 to 8.4.7. There’s an unknown protocol error in the collector log which seems to correspond to the times.  It’s always after the housekeeper done cleaning message.

----- master log
06/27/16 13:42:34 condor_write(): Socket closed when trying to write 1276 bytes to collector 10.1.1.55, fd is 11
06/27/16 13:42:34 Buf::write(): condor_write() failed
06/27/16 13:42:34 attempt to connect to <10.1.1.55:9618> failed: Connection refused (connect errno = 111).
06/27/16 13:42:34 ERROR: SECMAN:2003:TCP connection to collector 10.1.1.55 failed.
06/27/16 13:42:34 Failed to start non-blocking update to <10.1.1.55:9618>.
06/27/16 13:42:45 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 46684
06/27/16 13:57:46 DefaultReaper unexpectedly called on pid 46684, status 1024.
06/27/16 13:57:46 The COLLECTOR (pid 46684) exited with status 4
06/27/16 13:57:46 Sending obituary for "/usr/sbin/condor_collector"
06/27/16 13:57:47 restarting /usr/sbin/condor_collector in 10 seconds
06/27/16 13:57:47 condor_write(): Socket closed when trying to write 1277 bytes to collector 10.1.1.55, fd is 11
06/27/16 13:57:47 Buf::write(): condor_write() failed
06/27/16 13:57:47 attempt to connect to <10.1.1.55:9618> failed: Connection refused (connect errno = 111).
06/27/16 13:57:47 ERROR: SECMAN:2003:TCP connection to collector 10.1.1.55 failed.
06/27/16 13:57:47 Failed to start non-blocking update to <10.1.1.55:9618>.
06/27/16 13:57:57 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 46794

----- collector log
06/27/16 13:42:00 Query info: matched=89; skipped=9; query_time=0.003030; send_time=0.031904; type=Any; requirements={( ( ( MyType == "Scheduler" ) || (
06/27/16 13:42:33 Housekeeper:  Ready to clean old ads
06/27/16 13:42:33       Cleaning StartdAds ...
06/27/16 13:42:33       Cleaning StartdPrivateAds ...
06/27/16 13:42:33       Cleaning ScheddAds ...
06/27/16 13:42:33       Cleaning SubmittorAds ...
06/27/16 13:42:33       Cleaning LicenseAds ...
06/27/16 13:42:33       Cleaning MasterAds ...
06/27/16 13:42:33       Cleaning CkptServerAds ...
06/27/16 13:42:33       Cleaning CollectorAds ...
06/27/16 13:42:33       Cleaning StorageAds ...
06/27/16 13:42:33       Cleaning NegotiatorAds ...
06/27/16 13:42:33       Cleaning HadAds ...
06/27/16 13:42:33       Cleaning GridAds ...
06/27/16 13:42:33       Cleaning XferServiceAds ...
06/27/16 13:42:33       Cleaning LeaseManagerAds ...
06/27/16 13:42:33       Cleaning Generic Ads ...
06/27/16 13:42:33 Housekeeper:  Done cleaning
06/27/16 13:42:34 ERROR "Unknown protocol (1) in Sock::bind(); aborting." at line 741 in file /slots/01/dir_1114870/userdir/.tmpthm9vL/BUILD/condor-8.4.
06/27/16 13:42:45 Setting maximum file descriptors to 10240.
06/27/16 13:42:45 ******************************************************
06/27/16 13:42:45 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
06/27/16 13:42:45 ** /usr/sbin/condor_collector
06/27/16 13:42:45 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
06/27/16 13:42:45 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
06/27/16 13:42:45 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
06/27/16 13:42:45 ** $CondorPlatform: x86_64_RedHat7 $
06/27/16 13:42:45 ** PID = 46684
06/27/16 13:42:45 ** Log last touched 6/27 13:42:34
06/27/16 13:42:45 ******************************************************


10.1.1.55 is the condor host running centos 7. The firewall is active but that interface is in the trusted zone.  Selinux is off.  It is virtual on ESXi if that makes any difference (4cpu 4gb mem).
It’s config is

------------
CONDOR_HOST = 10.1.1.55
COLLECTOR_NAME          = AGBU
ALLOW_READ = 10.1.*
ALLOW_WRITE = 10.1.*
DEFAULT_DOMAIN_NAME = agbu.localdomain
NO_DNS = True
TRUST_UID_DOMAIN = True
BIND_ALL_INTERFACES = False
NETWORK_INTERFACE = 10.1.1.55

START = True
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD
-------------------

My laptop is on the same network and it isn’t having any trouble maintaining the ssh connection to 10.1.1.55.  There’s no entries in /var/log/messages indicating any issues.

Ideas anyone?


--
Klint Gore
Database Manager
Sheep CRC
A.G.B.U.
University of New England
Armidale NSW 2350

Ph: 02 6773 3789
Fax: 02 6773 3266
EMail: kgore4@xxxxxxxxxx<mailto:kgore4@xxxxxxxxxx>