[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
- Date: Fri, 29 Jul 2016 08:54:47 +0000
- From: andrew.lahiff@xxxxxxxxxx
- Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
I switched over to 3 different DNS servers yesterday morning, and since then the problem hasn't occurred again.
Note that we have DNS Nagios tests which check if DNS lookups take more than 0.5s, and there haven't been any alarms at all in recent weeks. However, these are not triggered by a single slow DNS lookup of course (I think they need 5 which are > 0.5s over a 20 minute time period).
Even though the problem seems to have now been fixed, I'll probably increase the watchdog timeout anyway, as I don't like the idea that a single slow DNS lookup can kill all jobs on a worker node.
From: Lahiff, Andrew (STFC,RAL,PPD)
Sent: Wednesday, July 27, 2016 9:47 PM
To: HTCondor-Users Mail List
Subject: RE: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
I'm not aware of any DNS issues at the moment, but I'll ask around tomorrow. We are meant to be (finally) moving from the RAL site DNS servers to our own soon, so the first thing I could try would be to switch the SL7 worker nodes over to using those to see if that helps (even though the new DNS servers are not yet production services). If that doesn't help I'll try increasing the watchdog timeout until the DNS slowness can be sorted out.
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
Sent: Wednesday, July 27, 2016 9:13 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
So, systemd is killing off all the HTCondor processes because the condor_master is unresponsive. So, ignore all the tracebacks besides the master: they’re effectively saying the daemon was idle when it got a SIGWHATEVER.
Judging from your condor_master stack trace, your condor instance is waiting on the results of a DNS query (which is blocking in HTCondor). If you can figure out why your DNS is acting up, you can probably avoid the watchdog timeout from triggering. Alternately, you can bump up the watchdog timeout (but that’s avoiding the issue - a slow DNS server can do other Bad Things).
I’ve also noticed that HTCondor can leave around Docker job turds on the host. File a ticket for that too, it ought to be solvable.
> On Jul 27, 2016, at 2:59 PM, andrew.lahiff@xxxxxxxxxx wrote:
> I'm experiencing problems with HTCondor 8.5.5 on SL7 running only Docker universe jobs. Occasionally all HTCondor daemons on a worker node do a stack dump and die, including the master, startd and all starters.
> An interesting side effect of this is that while HTCondor deletes the job sandboxes, the Docker containers actually continue running, but HTCondor seems unaware of this, and therefore eventually starts running another set of jobs. So I end up with twice as many jobs running as there should be on an affected worker node, half of which are no longer under HTCondor's control.
> In /var/log/messages is this (it sometimes happens several times consecutively):
> 2016-07-27T20:20:09.881844+01:00 lcg1879 systemd: condor.service watchdog timeout (limit 5s)!
> 2016-07-27T20:20:10.068139+01:00 lcg1879 systemd: condor.service: main process exited, code=killed, status=6/ABRT
> 2016-07-27T20:20:10.157858+01:00 lcg1879 systemd: Unit condor.service entered failed state.
> 2016-07-27T20:20:10.158120+01:00 lcg1879 systemd: condor.service failed.
> 2016-07-27T20:20:15.381427+01:00 lcg1879 systemd: condor.service holdoff time over, scheduling restart.
> In /var/log/condor/StartLog is this:
> Stack dump for process 27243 at timestamp 1469647209 (9 frames)
> while every StarterLog has something like this:
> Stack dump for process 1211947 at timestamp 1469647209 (9 frames)
> and finally the MasterLog:
> Stack dump for process 27213 at timestamp 1469647209 (29 frames)
> Has anyone else seen this? It's not obvious to me from timestamps in the logs if it was systemd that killed all the HTCondor daemons due to the watchdog timeout (I guess it's probably this?) or if everything died first and then systemd notices. It only seems to happen when a worker node is very busy (i.e. I've never seen this happen on idle SL7 worker nodes).
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at:
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: