[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes



Yep, Jaimie and I also just confirmed this.  The WatchdogSec is set to 5 in our packaging.  The master attempts to send a keepalive every (WatchdogSec/2) seconds, so if the timing is bad, having the master block for as little as 3 seconds could trigger systemd to kill off condor.

Workaround for now:  Set WatchdogSec to something much higher.  Having thought about this for approximately one minute, I'd suggest 60. :)


Cheers,
-zach


> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Brian Bockelman
> Sent: Friday, July 29, 2016 4:12 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
> 
> 
> Hi Michael,
> 
> It's not really a systemd issue. Condor's config file puts in place a
> directive of "if I haven't responded in the last 5 seconds, then consider
> me deadlocked and kill off all my processes."
> 
> HTCondor asked, systemd listened.
> 
> Sent from my iPhone
> 
> On Jul 29, 2016, at 4:00 PM, Michael V Pelletier
> <Michael.V.Pelletier@xxxxxxxxxxxx <mailto:Michael.V.Pelletier@xxxxxxxxxxxx>
> > wrote:
> 
> 
> 
> 	This seems to be another example of how systemd doesn't seem to
> acknowledge
> 	decades of UNIX-derived system management experience that people
> have
> 	accumulated over the years.
> 
> 	http://suckless.org/sucks/systemd
> <http://suckless.org/sucks/systemd>
> 
> 	The philosophy of UNIX is, in part, "write programs that do one
> thing, and
> 	do it well."
> 
> 
> 
> Which is one reason why systemd is broken up into, what, a dozen different
> daemons?
> 
> One thing that sysvinit does poorly is manage services. That's why many
> (most?) commercial POSIX implementations have abandoned it.
> 
> In fact, that sysvinit didn't keep up the "and do it well" half of the
> sentence is one of the motivations for the condor_master.  It is
> encouraging to me that all the features the HTCondor team found missing
> from sysvinit have now made it into RHEL7's service management framework.
> 
> Brian
> 
> 
> 
> 	I hope I don't loathe it too much when I finally get around to
> installing
> 	a CentOS 7 VM.
> 
> 	        -Michael Pelletier.
> 	_
> 
> 
> 	_______________________________________________
> 	HTCondor-users mailing list
> 	To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> <mailto:htcondor-users-request@xxxxxxxxxxx>  with a
> 	subject: Unsubscribe
> 	You can also unsubscribe by visiting
> 	https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> 	The archives can be found at:
> 	https://lists.cs.wisc.edu/archive/htcondor-users/