[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CCB problems and hight daemon load



> On Nov 24, 2016, at 3:19 PM, Matthias Schnepf <udcqn@xxxxxxxxxxxxxxx> wrote:
> 
> Hello all,
> 
> we have about 2000 VM workernodes ( ~ 8000 cores ) which are behind a NAT. We start up to 10 VMs every 30 sec. Sometimes we got problems with the CCB
> CCBClient: Failed to read response from CCB server collector...
> 

This is not a particularly large rate compared to other installations.

What do you use for authentication?  Note that only a few methods (such as GSI) have been made non-blocking (blocking authz may cause the above issue).

> Failed to reverse connect to startd workernode via CCB.
> Also the Collector, Negotiator and Scheduler get up to a daemon load of 100% and condor_q /condor_status became slow. However the machines has free resources in memory and CPU. The Collector, Negotiator and Scheduler run Condor version 8.4.8/9 and the workernodes version 8.5.7
> The network between the VMs and the Collector looks stable. Our plan is to start additional Collectors with CCBs. Would that help? How much Collectors do we need and how we should configure our system? 

What do you mean by "daemon load" (Daemon Core Duty Cycle metric?  Something about the process as measured by Linux?)?

Typically, the duty cycle metric for negotiator is always 100%.  I would expect it to never be 100% for the collector or schedd at the scale you describe.

Hope this helps,

Brian