[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Communication Error: condor_status -constraint



On Feb 26, 2013, at 6:43 PM, Rob <spamrefuse@xxxxxxxxx> wrote:

> ----- Original Message -----
> 
>> From: Rob <spamrefuse@xxxxxxxxx>
>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>> Cc:
>> Sent: Wednesday, February 27, 2013 9:28 AM
>> Subject: Re: [HTCondor-users] Communication Error: condor_status -constraint
>> 
>> Hi,
>> 
>> I am actually having a same or similar problem.
>> 
>> HTCondor 7.9.1 runs on the Linux central manager and
>> version 7.4.4 on the Windows XP execute machines.
>> 
>> On the manager I run a cron script every 5 minutes, which creates over time
>> a database with the HTCondor status of the pool. At the heart of the script is
>> this line:
>> 
>> condor_status -debug -total -constraint "target.Arch == \"INTEL\""
>> 
>> 
>> Once in every few days I get an email from the cron daemon with:
>> 
>> 
>> Status: R
>> Error: communication error
> 
> Oops, the "-debug" flag I added later, before I sent the previous email.
> 
> Now, with the "-debug" I actually get following in the cron daemon email:
> 
> 
> 02/27/13 09:05:01 Enumerating interfaces: lo 127.0.0.1 up
> 02/27/13 09:05:01 Enumerating interfaces: em1 xxx.xxx.140.72 up
> 02/27/13 09:05:01 Collector central.manager.edu blacklisted; skipping
> Error: communication error
> 
> 
> (I have removed the actual IP number and hostname of the central manager)
> 
> Why is the hostname of the central manager blacklisted? It is an officially registered hostname!
> And why does this blacklisting happen only once in a few days, whereas the status command is called every 5 minutes?


Do you have multiple collectors?

The blacklisting happens when you have multiple collectors configured for fault tolerance (the COLLECTOR_HOST parameter has multiple hostnames or IP addresses). Daemons and tools will query the collectors in a random order until they get a successful result. If a query fails, the tool/daemon will avoid that collector (blacklisting it) for an hour.

I'm surprised you're seeing this with condor_status, as it means the tool is making two queries to the collector(s). The first query failing would trigger the blacklisting, and the second query would produce the message you're seeing. If you only have one collector, then I'm very surprised that you're seeing this.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project