[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Monitoring condor nodes with hobbit



Hi Cody,

It's worth noting that HTCondor 8.1 now forwards all the stats Lans references (and more) to Ganglia "out-of-the-box" (for some value of "out-of-the-box").

However, I think you're more referring to health monitoring, right?  Other than periodic probing (I do "condor_q -const false", for example), I can't think of anything overly clever.

As Ben mentioned, even periodic probing can be tough as it's difficult to differentiate "very busy" from "not responding".

Brian

On Jan 8, 2014, at 9:54 AM, Lans Carstensen <Lans.Carstensen@xxxxxxxxxxxxxx> wrote:

> Aside from setting up up/down monitoring to ensure that your
> collectors are healthy, submissions are working, and startd nodes
> haven't fallen out of a pool - the real monitoring value that's been
> added in the last few years is in operational statistics included in
> the negotiator and schedd daemon classads.  It's worth your while to
> collect and graph some of those stats.  I covered a couple of graphs
> we use a couple years ago at HTCondorWeek.
> 
> http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/carstensen-dreamworks.pdf
> 
> -- Lans Carstensen
> 
> On Wed, Jan 8, 2014 at 6:40 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
>> I suppose my end goal is to easily see when a node has an issue, but you are
>> right, I do get emails when say sched crashes or something. with out any
>> extra configuration I can use hobbit to see which hosts are on, and that
>> will work for my needs.
>> 
>> Thanks,
>> 
>> Cody
>> 
>> 
>> On 01/08/2014 08:13 AM, Ben Cotton wrote:
>>> 
>>> Cody,
>>> 
>>> When I was at Purdue, I tried monitoring HTCondor servers (i.e. not
>>> execute nodes) with Nagios. I eventually removed the checks because
>>> they didn't add value. The condor_master does a good job of making
>>> sure the daemons are running. I did get alerts for the schedd checks,
>>> but they turned out to be false alarms when the schedd was just too
>>> busy to answer the condor_q from Nagios. (I suppose that's an issue in
>>> itself, but it wasn't what we were checking for).
>>> 
>>> I guess the point of this story is to ask what exactly you want to
>>> check and why. Knowing that makes it easier to offer guidance.
>>> 
>>> 
>>> Thanks,
>>> BC
>>> 
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/