[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Communication Error: condor_status -constraint

Thank you Jaime Frey for your reply.

Here is what I get on my HTCondor central manager:

~$ condor_version
$CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $
$CondorPlatform: I686-Fedora_18 $

~$ condor_config_val -config
Configuration source:
Local configuration sources:

~$ condor_config_val -v COLLECTOR_HOST
COLLECTOR_HOST: condor.skk.edu
  Defined in '/etc/condor/config.d/99skk_condor.config', line 8.

So there is only one collector running on the master.

However, another hostname also points to the same IP of the HTCondor master: prior to getting an "official" hostname (the one above), I used a free hostname from dyndns.com.

Hence, all Windows XP execution machines in the pool still use the dyndns hostname to connect to the master, whereas the master is using the official hostname. All hostnames point to the same IPv4 address....
The execution machines run Condor version 7.4.4.

Again, this error

   02/27/13 09:05:01 Collector condor.skku.edu blacklisted; skipping
   Error: communication error

from the condor_status command seems to happen only once in a few days, or once a week, while it is used in a cron job on the Condor master every 5 minutes 24/7.

Please let me know if there is any further debugging I can do.


----- Original Message -----
From: Jaime Frey <jfrey@xxxxxxxxxxx>
To: Rob <spamrefuse@xxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Sent: Saturday, March 2, 2013 5:27 AM
Subject: Re: [HTCondor-users] Communication Error: condor_status -constraint

On Feb 26, 2013, at 6:43 PM, Rob <spamrefuse@xxxxxxxxx> wrote:

> ----- Original Message -----
>> From: Rob <spamrefuse@xxxxxxxxx>
>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>> Cc:
>> Sent: Wednesday, February 27, 2013 9:28 AM
>> Subject: Re: [HTCondor-users] Communication Error: condor_status -constraint
>> Hi,
>> I am actually having a same or similar problem.
>> HTCondor 7.9.1 runs on the Linux central manager and
>> version 7.4.4 on the Windows XP execute machines.
>> On the manager I run a cron script every 5 minutes, which creates over time
>> a database with the HTCondor status of the pool. At the heart of the script is
>> this line:
>> condor_status -debug -total -constraint "target.Arch == \"INTEL\""
>> Once in every few days I get an email from the cron daemon with:
>> Status: R
>> Error: communication error
> Oops, the "-debug" flag I added later, before I sent the previous email.
> Now, with the "-debug" I actually get following in the cron daemon email:
> 02/27/13 09:05:01 Enumerating interfaces: lo up
> 02/27/13 09:05:01 Enumerating interfaces: em1 xxx.xxx.140.72 up
> 02/27/13 09:05:01 Collector central.manager.edu blacklisted; skipping
> Error: communication error
> (I have removed the actual IP number and hostname of the central manager)
> Why is the hostname of the central manager blacklisted? It is an officially registered hostname!
> And why does this blacklisting happen only once in a few days, whereas the status command is called every 5 minutes?

Do you have multiple collectors?

The blacklisting happens when you have multiple collectors configured for fault tolerance (the COLLECTOR_HOST parameter has multiple hostnames or IP addresses). Daemons and tools will query the collectors in a random order until they get a successful result. If a query fails, the tool/daemon will avoid that collector (blacklisting it) for an hour.

I'm surprised you're seeing this with condor_status, as it means the tool is making two queries to the collector(s). The first query failing would trigger the blacklisting, and the second query would produce the message you're seeing. If you only have one collector, then I'm very surprised that you're seeing this.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project