Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes

Date: Sat, 30 Jul 2016 14:17:51 -0500
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes

> On Jul 30, 2016, at 2:13 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> 
> Hi Brian,
> 
> We noticed this on our 8.5.5 CC7 infra nodes (cm, schedds) as well, primarily on start-up.
> 
>> On Jul 30, 2016, at 21:04, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
>> 
>> The two things I can think of that block in the master are:
>> - DNS lookups (the one Andrew originally quoted appears to be inside the security subsystem).
>> - Updating the collector.  Kinda: I suspect that most updates are nonblocking because they buffer in the outgoing TCP socket.  However, when you have to establish a new security sessionâ
> 
> Primarily the second one for us. Whilst conversing with the htcondor team, it was noticed that most of our kills occurred with HTCondor inside relisock doing the initial authentications on start-up.

Indeed: client side authentication is often blocking.  Only the server-side has been made non-blocking.

Something for the TODO list, I suppose.  If we get to the point where only DNS lookups are blocking in the master, then maybe itâs time to take another look at c-ares.

:/  DNS is hard.

> 
> We also had kills in schedds opening an initial security session with execute nodes across a fairly saturated WAN.
> 

Nah, this shouldnât affect the master.  That could be the master killing off the schedd (the latter also has a ton of other blocking behavior).

Brian

References:
- [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: andrew . lahiff
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Brian Bockelman
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: andrew . lahiff
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: andrew . lahiff
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Brian Bockelman
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Michael V Pelletier
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Brian Bockelman
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Zach Miller
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Brian Bockelman
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Iain Bradford Steers

Prev by Date: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
Next by Date: [HTCondor-users] 22nd ACM International Conference on Intelligent User Interfaces (IUI 2017): Fourth Call for Workshop Proposals
Previous by thread: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
Next by thread: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes