[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes



Hi Brian,

We noticed this on our 8.5.5 CC7 infra nodes (cm, schedds) as well, primarily on start-up.

> On Jul 30, 2016, at 21:04, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
> 
> The two things I can think of that block in the master are:
> - DNS lookups (the one Andrew originally quoted appears to be inside the security subsystem).
> - Updating the collector.  Kinda: I suspect that most updates are nonblocking because they buffer in the outgoing TCP socket.  However, when you have to establish a new security sessionâ

Primarily the second one for us. Whilst conversing with the htcondor team, it was noticed that most of our kills occurred with HTCondor inside relisock doing the initial authentications on start-up.

We also had kills in schedds opening an initial security session with execute nodes across a fairly saturated WAN.

Cheers, Iain

> 
> 60s seems reasonable.
> 
> Personally, Iâd suggest that we need Louder notifications that either blocking reason is going on.  Killing off HTCondor might be a little on the strong side though...
> 
> (At the Nebraska cluster, weâll probably tune it back down to 5s...)
> 
> Brian
> 
>> On Jul 29, 2016, at 4:23 PM, Zach Miller <zmiller@xxxxxxxxxxx> wrote:
>> 
>> Yep, Jaimie and I also just confirmed this.  The WatchdogSec is set to 5 in our packaging.  The master attempts to send a keepalive every (WatchdogSec/2) seconds, so if the timing is bad, having the master block for as little as 3 seconds could trigger systemd to kill off condor.
>> 
>> Workaround for now:  Set WatchdogSec to something much higher.  Having thought about this for approximately one minute, I'd suggest 60. :)
>> 
>> 
>> Cheers,
>> -zach
>> 
>> 
>>> -----Original Message-----
>>> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
>>> Of Brian Bockelman
>>> Sent: Friday, July 29, 2016 4:12 PM
>>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>>> Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
>>> 
>>> 
>>> Hi Michael,
>>> 
>>> It's not really a systemd issue. Condor's config file puts in place a
>>> directive of "if I haven't responded in the last 5 seconds, then consider
>>> me deadlocked and kill off all my processes."
>>> 
>>> HTCondor asked, systemd listened.
>>> 
>>> Sent from my iPhone
>>> 
>>> On Jul 29, 2016, at 4:00 PM, Michael V Pelletier
>>> <Michael.V.Pelletier@xxxxxxxxxxxx <mailto:Michael.V.Pelletier@xxxxxxxxxxxx>
>>>> wrote:
>>> 
>>> 
>>> 
>>> 	This seems to be another example of how systemd doesn't seem to
>>> acknowledge
>>> 	decades of UNIX-derived system management experience that people
>>> have
>>> 	accumulated over the years.
>>> 
>>> 	http://suckless.org/sucks/systemd
>>> <http://suckless.org/sucks/systemd>
>>> 
>>> 	The philosophy of UNIX is, in part, "write programs that do one
>>> thing, and
>>> 	do it well."
>>> 
>>> 
>>> 
>>> Which is one reason why systemd is broken up into, what, a dozen different
>>> daemons?
>>> 
>>> One thing that sysvinit does poorly is manage services. That's why many
>>> (most?) commercial POSIX implementations have abandoned it.
>>> 
>>> In fact, that sysvinit didn't keep up the "and do it well" half of the
>>> sentence is one of the motivations for the condor_master.  It is
>>> encouraging to me that all the features the HTCondor team found missing
>>> from sysvinit have now made it into RHEL7's service management framework.
>>> 
>>> Brian
>>> 
>>> 
>>> 
>>> 	I hope I don't loathe it too much when I finally get around to
>>> installing
>>> 	a CentOS 7 VM.
>>> 
>>> 	        -Michael Pelletier.
>>> 	_
>>> 
>>> 
>>> 	_______________________________________________
>>> 	HTCondor-users mailing list
>>> 	To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> <mailto:htcondor-users-request@xxxxxxxxxxx>  with a
>>> 	subject: Unsubscribe
>>> 	You can also unsubscribe by visiting
>>> 	https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>> 
>>> 	The archives can be found at:
>>> 	https://lists.cs.wisc.edu/archive/htcondor-users/
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature