[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Monitoring condor nodes with hobbit
- Date: Wed, 8 Jan 2014 15:15:40 -0600
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Monitoring condor nodes with hobbit
It's worth noting that HTCondor 8.1 now forwards all the stats Lans references (and more) to Ganglia "out-of-the-box" (for some value of "out-of-the-box").
However, I think you're more referring to health monitoring, right? Other than periodic probing (I do "condor_q -const false", for example), I can't think of anything overly clever.
As Ben mentioned, even periodic probing can be tough as it's difficult to differentiate "very busy" from "not responding".
On Jan 8, 2014, at 9:54 AM, Lans Carstensen <Lans.Carstensen@xxxxxxxxxxxxxx> wrote:
> Aside from setting up up/down monitoring to ensure that your
> collectors are healthy, submissions are working, and startd nodes
> haven't fallen out of a pool - the real monitoring value that's been
> added in the last few years is in operational statistics included in
> the negotiator and schedd daemon classads. It's worth your while to
> collect and graph some of those stats. I covered a couple of graphs
> we use a couple years ago at HTCondorWeek.
> -- Lans Carstensen
> On Wed, Jan 8, 2014 at 6:40 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
>> I suppose my end goal is to easily see when a node has an issue, but you are
>> right, I do get emails when say sched crashes or something. with out any
>> extra configuration I can use hobbit to see which hosts are on, and that
>> will work for my needs.
>> On 01/08/2014 08:13 AM, Ben Cotton wrote:
>>> When I was at Purdue, I tried monitoring HTCondor servers (i.e. not
>>> execute nodes) with Nagios. I eventually removed the checks because
>>> they didn't add value. The condor_master does a good job of making
>>> sure the daemons are running. I did get alerts for the schedd checks,
>>> but they turned out to be false alarms when the schedd was just too
>>> busy to answer the condor_q from Nagios. (I suppose that's an issue in
>>> itself, but it wasn't what we were checking for).
>>> I guess the point of this story is to ask what exactly you want to
>>> check and why. Knowing that makes it easier to offer guidance.
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> The archives can be found at:
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: