[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes



The two things I can think of that block in the master are:
- DNS lookups (the one Andrew originally quoted appears to be inside the security subsystem).
- Updating the collector.  Kinda: I suspect that most updates are nonblocking because they buffer in the outgoing TCP socket.  However, when you have to establish a new security sessionâ

60s seems reasonable.

Personally, Iâd suggest that we need Louder notifications that either blocking reason is going on.  Killing off HTCondor might be a little on the strong side though...

(At the Nebraska cluster, weâll probably tune it back down to 5s...)

Brian

> On Jul 29, 2016, at 4:23 PM, Zach Miller <zmiller@xxxxxxxxxxx> wrote:
> 
> Yep, Jaimie and I also just confirmed this.  The WatchdogSec is set to 5 in our packaging.  The master attempts to send a keepalive every (WatchdogSec/2) seconds, so if the timing is bad, having the master block for as little as 3 seconds could trigger systemd to kill off condor.
> 
> Workaround for now:  Set WatchdogSec to something much higher.  Having thought about this for approximately one minute, I'd suggest 60. :)
> 
> 
> Cheers,
> -zach
> 
> 
>> -----Original Message-----
>> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
>> Of Brian Bockelman
>> Sent: Friday, July 29, 2016 4:12 PM
>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>> Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
>> 
>> 
>> Hi Michael,
>> 
>> It's not really a systemd issue. Condor's config file puts in place a
>> directive of "if I haven't responded in the last 5 seconds, then consider
>> me deadlocked and kill off all my processes."
>> 
>> HTCondor asked, systemd listened.
>> 
>> Sent from my iPhone
>> 
>> On Jul 29, 2016, at 4:00 PM, Michael V Pelletier
>> <Michael.V.Pelletier@xxxxxxxxxxxx <mailto:Michael.V.Pelletier@xxxxxxxxxxxx>
>>> wrote:
>> 
>> 
>> 
>> 	This seems to be another example of how systemd doesn't seem to
>> acknowledge
>> 	decades of UNIX-derived system management experience that people
>> have
>> 	accumulated over the years.
>> 
>> 	http://suckless.org/sucks/systemd
>> <http://suckless.org/sucks/systemd>
>> 
>> 	The philosophy of UNIX is, in part, "write programs that do one
>> thing, and
>> 	do it well."
>> 
>> 
>> 
>> Which is one reason why systemd is broken up into, what, a dozen different
>> daemons?
>> 
>> One thing that sysvinit does poorly is manage services. That's why many
>> (most?) commercial POSIX implementations have abandoned it.
>> 
>> In fact, that sysvinit didn't keep up the "and do it well" half of the
>> sentence is one of the motivations for the condor_master.  It is
>> encouraging to me that all the features the HTCondor team found missing
>> from sysvinit have now made it into RHEL7's service management framework.
>> 
>> Brian
>> 
>> 
>> 
>> 	I hope I don't loathe it too much when I finally get around to
>> installing
>> 	a CentOS 7 VM.
>> 
>> 	        -Michael Pelletier.
>> 	_
>> 
>> 
>> 	_______________________________________________
>> 	HTCondor-users mailing list
>> 	To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> <mailto:htcondor-users-request@xxxxxxxxxxx>  with a
>> 	subject: Unsubscribe
>> 	You can also unsubscribe by visiting
>> 	https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> 	The archives can be found at:
>> 	https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/