[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] protocol error in collector after housekeeping



Hi Klint,

I think I can reproduce problem you observed by setting
  NO_DNS = True
in my central manager.

What value of NO_DNS and DEFAULT_DOMAIN_NAME are you using for your collector? Assuming you have NO_DNS = True in your setup, I think I now know enough about the problem to make a patch in the code, thus sparing this pain for future users.

Thanks for reporting the issue.

regards,
Todd

On 6/27/2016 4:04 AM, Klint Gore wrote:
I’d just found that and tested it as your message came in.

[root@xxxxxxxxx condor]# condor_config_val -master
CONDOR_DEVELOPERS_COLLECTOR

Not defined

Setting that to NONE stopped it crashing.

It resolves to 128.105.19.35.  Does it use a library to look that up?
The machine is a minimal centos 7 install so maybe there’s a library
missing.

These machines don't have any access to the outside world anyway so
it’ll never connect.

Klint.

*From:*HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] *On
Behalf Of *Todd Tannenbaum
*Sent:* Monday, 27 June 2016 6:36 PM
*To:* HTCondor-Users Mail List
*Subject:* Re: [HTCondor-users] protocol error in collector after
housekeeping

Hi Klint,



Looks like your collector machine has something bogus setup in the
/etc/hosts file or DNS when resolving "condor.cs.wisc.edu
<http://condor.cs.wisc.edu>". Could you investigate that for us?

Meanwhile as an immediate workaround, perhaps you could avoid the
problem if you put in the condor_config file on your central manager
machine:

CONDOR_DEVELOPERS_COLLECTOR = NONE

Hope this helps,

Todd

Sent from my iPhone


On Jun 27, 2016, at 2:38 AM, Klint Gore <kgore4@xxxxxxxxxx
<mailto:kgore4@xxxxxxxxxx>> wrote:

    Just in case

    [root@xxxxxxxxx <mailto:root@xxxxxxxxx> condor]# condor_config_val
    -v COLLECTOR_HOST
    COLLECTOR_HOST = 10.1.1.55
    # at: <Default>
    # raw: COLLECTOR_HOST = $(CONDOR_HOST)


    -----Original Message-----
    From: Klint Gore
    Sent: Monday, 27 June 2016 5:40 PM
    To: HTCondor-Users Mail List
    Subject: RE: protocol error in collector after housekeeping

    [root@xxxxxxxxx <mailto:root@xxxxxxxxx> condor]# condor_config_val
    -master CONDOR_HOST
    10.1.1.55
    [root@xxxxxxxxx <mailto:root@xxxxxxxxx> condor]# condor_config_val
    -v CONDOR_HOST CONDOR_HOST = 10.1.1.55  # at:
    /etc/condor/config.d/condor_config.local, line 1  # raw: CONDOR_HOST
    = 10.1.1.55

    Jobs do get run in the 15 minutes after the collector restarts until
    the housekeeper kicks in.

    ------ collector log with D_FULLDEBUG

    06/27/16 17:22:41 Housekeeper:  Ready to clean old ads
    06/27/16 17:22:41       Cleaning StartdAds ...
    06/27/16 17:22:41       Cleaning StartdPrivateAds ...
    06/27/16 17:22:41       Cleaning ScheddAds ...
    06/27/16 17:22:41       Cleaning SubmittorAds ...
    06/27/16 17:22:41       Cleaning LicenseAds ...
    06/27/16 17:22:41       Cleaning MasterAds ...
    06/27/16 17:22:41       Cleaning CkptServerAds ...
    06/27/16 17:22:41       Cleaning CollectorAds ...
    06/27/16 17:22:41       Cleaning StorageAds ...
    06/27/16 17:22:41       Cleaning NegotiatorAds ...
    06/27/16 17:22:41       Cleaning HadAds ...
    06/27/16 17:22:41       Cleaning GridAds ...
    06/27/16 17:22:41       Cleaning XferServiceAds ...
    06/27/16 17:22:41       Cleaning LeaseManagerAds ...
    06/27/16 17:22:41       Cleaning Generic Ads ...
    06/27/16 17:22:41 Housekeeper:  Done cleaning
    06/27/16 17:22:42 ScheddAd     : Updating ... "<
    10-1-1-61.agbu.localdomain , 10.1.1.61 >"
    06/27/16 17:22:42 In OfflineCollectorPlugin::update ( 1 )
    06/27/16 17:22:42 CollectorAd  : Updating ... "<
    AGBU@xxxxxxxxxxxxxxxxxxxxxxxxxx
    <mailto:AGBU@xxxxxxxxxxxxxxxxxxxxxxxxxx> >"
    06/27/16 17:22:42 Attempting to send update via UDP to collector
    condor.cs.wisc.edu <http://condor.cs.wisc.edu> <:9618>
    06/27/16 17:22:42 ERROR "Unknown protocol (1) in Sock::bind();
    aborting." at line 741 in file
    /slots/01/dir_1114870/userdir/.tmpthm9vL/BUILD/condor-8.4.
    7/src/condor_io/sock.cpp
    ------

    Looks like the address is blank in that attempting to update line.

    Klint.

    -----Original Message-----
    From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
    Behalf Of Iain Bradford Steers
    Sent: Monday, 27 June 2016 4:35 PM
    To: HTCondor-Users Mail List
    Subject: Re: [HTCondor-users] protocol error in collector after
    housekeeping

    Hi Klint,

    I've seen this error message type in the past when I've accidentally
    appended the port to the address a second time.

    However your CONDOR_HOST var seems okay.

    Could you run the following:

    condor_config_val -master CONDOR_HOST

    condor_config_val -v CONDOR_HOST

    I think we can ignore the connection refused error for the moment.
    The master doesn't know the collector is dead, so is trying to send
    an update, I think. (Sounds like a bug in itself really)

    Could you bump up the debugging?

    MASTER_DEBUG = D_FULLDEBUG
    COLLECTOR_DEBUG = D_FULLDEBUG

    Cheers, Iain
    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685