[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] protocol error in collector after housekeeping



I’d just found that and tested it as your message came in.

 

[root@xxxxxxxxx condor]# condor_config_val -master CONDOR_DEVELOPERS_COLLECTOR

Not defined

 

Setting that to NONE stopped it crashing. 

 

It resolves to 128.105.19.35.  Does it use a library to look that up?  The machine is a minimal centos 7 install so maybe there’s a library missing.

 

These machines don't have any access to the outside world anyway so it’ll never connect.

 

Klint.

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Monday, 27 June 2016 6:36 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] protocol error in collector after housekeeping

 

Hi Klint,



 

Looks like your collector machine has something bogus setup in the /etc/hosts file or DNS when resolving "condor.cs.wisc.edu". Could you investigate that for us? 

 

Meanwhile as an immediate workaround, perhaps you could avoid the problem if you put in the condor_config file on your central manager machine:

CONDOR_DEVELOPERS_COLLECTOR = NONE

Hope this helps,

Todd

 

Sent from my iPhone


On Jun 27, 2016, at 2:38 AM, Klint Gore <kgore4@xxxxxxxxxx> wrote:

Just in case

[root@xxxxxxxxx condor]# condor_config_val -v COLLECTOR_HOST
COLLECTOR_HOST = 10.1.1.55
# at: <Default>
# raw: COLLECTOR_HOST = $(CONDOR_HOST)


-----Original Message-----
From: Klint Gore
Sent: Monday, 27 June 2016 5:40 PM
To: HTCondor-Users Mail List
Subject: RE: protocol error in collector after housekeeping

[root@xxxxxxxxx condor]# condor_config_val -master CONDOR_HOST
10.1.1.55
[root@xxxxxxxxx condor]# condor_config_val -v CONDOR_HOST CONDOR_HOST = 10.1.1.55  # at: /etc/condor/config.d/condor_config.local, line 1  # raw: CONDOR_HOST = 10.1.1.55

Jobs do get run in the 15 minutes after the collector restarts until the housekeeper kicks in.

------ collector log with D_FULLDEBUG

06/27/16 17:22:41 Housekeeper:  Ready to clean old ads
06/27/16 17:22:41       Cleaning StartdAds ...
06/27/16 17:22:41       Cleaning StartdPrivateAds ...
06/27/16 17:22:41       Cleaning ScheddAds ...
06/27/16 17:22:41       Cleaning SubmittorAds ...
06/27/16 17:22:41       Cleaning LicenseAds ...
06/27/16 17:22:41       Cleaning MasterAds ...
06/27/16 17:22:41       Cleaning CkptServerAds ...
06/27/16 17:22:41       Cleaning CollectorAds ...
06/27/16 17:22:41       Cleaning StorageAds ...
06/27/16 17:22:41       Cleaning NegotiatorAds ...
06/27/16 17:22:41       Cleaning HadAds ...
06/27/16 17:22:41       Cleaning GridAds ...
06/27/16 17:22:41       Cleaning XferServiceAds ...
06/27/16 17:22:41       Cleaning LeaseManagerAds ...
06/27/16 17:22:41       Cleaning Generic Ads ...
06/27/16 17:22:41 Housekeeper:  Done cleaning
06/27/16 17:22:42 ScheddAd     : Updating ... "< 10-1-1-61.agbu.localdomain , 10.1.1.61 >"
06/27/16 17:22:42 In OfflineCollectorPlugin::update ( 1 )
06/27/16 17:22:42 CollectorAd  : Updating ... "< AGBU@xxxxxxxxxxxxxxxxxxxxxxxxxx >"
06/27/16 17:22:42 Attempting to send update via UDP to collector condor.cs.wisc.edu <:9618>
06/27/16 17:22:42 ERROR "Unknown protocol (1) in Sock::bind(); aborting." at line 741 in file /slots/01/dir_1114870/userdir/.tmpthm9vL/BUILD/condor-8.4.
7/src/condor_io/sock.cpp
------

Looks like the address is blank in that attempting to update line.

Klint.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Iain Bradford Steers
Sent: Monday, 27 June 2016 4:35 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] protocol error in collector after housekeeping

Hi Klint,

I've seen this error message type in the past when I've accidentally appended the port to the address a second time.

However your CONDOR_HOST var seems okay.

Could you run the following:

condor_config_val -master CONDOR_HOST

condor_config_val -v CONDOR_HOST

I think we can ignore the connection refused error for the moment. The master doesn't know the collector is dead, so is trying to send an update, I think. (Sounds like a bug in itself really)

Could you bump up the debugging?

MASTER_DEBUG = D_FULLDEBUG
COLLECTOR_DEBUG = D_FULLDEBUG

Cheers, Iain
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/