[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] blacklisted local host Collector



On Oct 7, 2015, at 3:10 AM, Richard Crozier <richard.crozier@xxxxxxxxxxx> wrote:
> 
> On 31/03/15 21:39, Jaime Frey wrote:
>> On Mar 27, 2015, at 5:00 AM, Richard Crozier
>> <richard.crozier@xxxxxxxxxxx <mailto:richard.crozier@xxxxxxxxxxx>> wrote:
>>> 
>>> On 26/03/15 15:41, Todd Tannenbaum wrote:
>>>> On 3/26/2015 9:29 AM, Richard Crozier wrote:
>>>>> Hello,
>>>>> 
>>>>> I'm running a personal  condor pool on a machine with 64 nodes.
>>>>> sometimes.
>>>>> 
>>>>> condor_status -total -debug
>>>>> 
>>>>> 03/26/15 13:40:25 Collector 127.0.0.1 blacklisted; skipping
>>>>> 
> 
> <snip>
> 
>>> 
>>> I tried to set DEAD_COLLECTOR_MAX_AVOIDANCE_TIME using:
>>> 
>>> $ condor_config_val -set "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60"
>>> 
>>> but got:
>>> 
>>> Attempt to set configuration "DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60"
>>> on master chameleon <127.0.0.1:44031> failed.
>>> 
>>> Instead I opened /etc/condor/condor_config
>>> 
>>> and added the line
>>> 
>>> DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60
>>> 
>>> Is this right? I couldn't find DEAD_COLLECTOR_MAX_AVOIDANCE_TIME
>>> existing anywhere in this file already.
>>> 
>>> Finally, in case it's relevant:
>>> 
>>> $ condor_version
>>> $CondorVersion: 8.0.5 Jan 14 2014 BuildID:
>>> Debian-8.0.5~dfsg.1-1ubuntu1 Debian-8.0.5~dfsg.1-1ubuntu1 $
>>> $CondorPlatform: X86_64-Ubuntu_ $
>> 
>> The dead collector avoidance feature is intended for pools that have
>> multiple collectors set up for fault tolerance. Daemons send their
>> information to all of the collectors and queriers pick a random
>> collector to contact. If one collector goes down, queriers will avoid it
>> after failing to contact it for the first time. This feature isn’t
>> useful when you only have one collector.
>> 
>> We’ve had another user report the same problem that you’re experiencing,
>> occurring occasionally when they do thousands of queries. We haven’t
>> been able to track down the cause yet.
>> 
>> Thanks and regards,
>> Jaime Frey
>> UW-Madison HTCondor Project
>> 
>> 
> 
> I'm still seeing this issue, and it wasn't actually confirmed if what I did will have any effect. Could you please confirm whether opening
> 
> /etc/condor/condor_config
> 
> and adding the line
> 
> DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 60
> 
> which change how long the collector is blacklisted to 60s? Is it correct that the default time is 1 hour.
> 
> Alternatively, can you tell me how to to do it correctly? Or if you've found the root cause? I'm happy to provide any information that might help. If there's any other workarounds your aware off, this would also be really helpful.
> 
> Best regards,
> 
> Richard

We believe we understand the root cause, but haven’t been able to catch it in the act. We did fix how the code reacts when it decides that all collectors are blacklisted. Now, it will always tries contacting at least one collector immediately. This change was introduced in HTCondor 8.2.9 and 8.3.8.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project