[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CCB problems and hight daemon load

On 11/29/2016 09:42 PM, Brian Bockelman wrote:
>> On Nov 24, 2016, at 3:19 PM, Matthias Schnepf <udcqn@xxxxxxxxxxxxxxx> wrote:
>> Hello all,
>> we have about 2000 VM workernodes ( ~ 8000 cores ) which are behind a NAT. We start up to 10 VMs every 30 sec. Sometimes we got problems with the CCB
>> CCBClient: Failed to read response from CCB server collector...
> This is not a particularly large rate compared to other installations.
> What do you use for authentication?  Note that only a few methods (such as GSI) have been made non-blocking (blocking authz may cause the above issue).
    We using the password authentication method. Probably we will test
condor version 8.5.6 with one collector to see if the issue is caused by
>> Failed to reverse connect to startd workernode via CCB.
>> Also the Collector, Negotiator and Scheduler get up to a daemon load of 100% and condor_q /condor_status became slow. However the machines has free resources in memory and CPU. The Collector, Negotiator and Scheduler run Condor version 8.4.8/9 and the workernodes version 8.5.7
>> The network between the VMs and the Collector looks stable. Our plan is to start additional Collectors with CCBs. Would that help? How much Collectors do we need and how we should configure our system? 
> What do you mean by "daemon load" (Daemon Core Duty Cycle metric?  Something about the process as measured by Linux?)?
    I mean the Daemon Core Duty Cycle. The Linux system load was low.
    We tested our setup with one collector on the shared port and 20
extra collectors on other ports. See another E-Mail conversation:

    Now the Daemon Core Duty Cycle of the main collector is about 20%
and the Linux system load is a  bit higher.
> Typically, the duty cycle metric for negotiator is always 100%.  I would expect it to never be 100% for the collector or schedd at the scale you describe.
    The negotiator duty cycle is about 80%  and takes one core for a few
seconds with 100%. However the matching looks good since we run more
collectors. When you say that duty cycle metric is normal than that
should not be a problem. Also the schedd, condor_q and condor_status run
smoother now.

> Hope this helps,

    Yes, it did. :-)