Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
- Date: Sat, 30 Jul 2016 14:04:08 -0500
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
The two things I can think of that block in the master are:
- DNS lookups (the one Andrew originally quoted appears to be inside the security subsystem).
- Updating the collector. Kinda: I suspect that most updates are nonblocking because they buffer in the outgoing TCP socket. However, when you have to establish a new security sessionâ
60s seems reasonable.
Personally, Iâd suggest that we need Louder notifications that either blocking reason is going on. Killing off HTCondor might be a little on the strong side though...
(At the Nebraska cluster, weâll probably tune it back down to 5s...)
Brian
> On Jul 29, 2016, at 4:23 PM, Zach Miller <zmiller@xxxxxxxxxxx> wrote:
>
> Yep, Jaimie and I also just confirmed this. The WatchdogSec is set to 5 in our packaging. The master attempts to send a keepalive every (WatchdogSec/2) seconds, so if the timing is bad, having the master block for as little as 3 seconds could trigger systemd to kill off condor.
>
> Workaround for now: Set WatchdogSec to something much higher. Having thought about this for approximately one minute, I'd suggest 60. :)
>
>
> Cheers,
> -zach
>
>
>> -----Original Message-----
>> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
>> Of Brian Bockelman
>> Sent: Friday, July 29, 2016 4:12 PM
>> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
>> Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
>>
>>
>> Hi Michael,
>>
>> It's not really a systemd issue. Condor's config file puts in place a
>> directive of "if I haven't responded in the last 5 seconds, then consider
>> me deadlocked and kill off all my processes."
>>
>> HTCondor asked, systemd listened.
>>
>> Sent from my iPhone
>>
>> On Jul 29, 2016, at 4:00 PM, Michael V Pelletier
>> <Michael.V.Pelletier@xxxxxxxxxxxx <mailto:Michael.V.Pelletier@xxxxxxxxxxxx>
>>> wrote:
>>
>>
>>
>> This seems to be another example of how systemd doesn't seem to
>> acknowledge
>> decades of UNIX-derived system management experience that people
>> have
>> accumulated over the years.
>>
>> http://suckless.org/sucks/systemd
>> <http://suckless.org/sucks/systemd>
>>
>> The philosophy of UNIX is, in part, "write programs that do one
>> thing, and
>> do it well."
>>
>>
>>
>> Which is one reason why systemd is broken up into, what, a dozen different
>> daemons?
>>
>> One thing that sysvinit does poorly is manage services. That's why many
>> (most?) commercial POSIX implementations have abandoned it.
>>
>> In fact, that sysvinit didn't keep up the "and do it well" half of the
>> sentence is one of the motivations for the condor_master. It is
>> encouraging to me that all the features the HTCondor team found missing
>> from sysvinit have now made it into RHEL7's service management framework.
>>
>> Brian
>>
>>
>>
>> I hope I don't loathe it too much when I finally get around to
>> installing
>> a CentOS 7 VM.
>>
>> -Michael Pelletier.
>> _
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/