[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Monitoring condor nodes with hobbit



On 1/8/2014 3:15 PM, Brian Bockelman wrote:
Hi Cody,

It's worth noting that HTCondor 8.1 now forwards all the stats Lans references (and more) to Ganglia "out-of-the-box" (for some value of "out-of-the-box").

However, I think you're more referring to health monitoring, right?  Other than periodic probing (I do "condor_q -const false", for example), I can't think of anything overly clever.


Just a quick note - for simply detecting startd nodes that have fallen out of the pool (or temporarily taken out of the pool with condor_off), you could configure HTCondor to show you such absent nodes via "condor_status -absent". This can show all the machines that reported to the collector in the past X days, but are currently not reporting.

In the manual see http://goo.gl/sxqa5h and http://goo.gl/0Q4r6F for more info.

regards,
Todd



As Ben mentioned, even periodic probing can be tough as it's difficult to differentiate "very busy" from "not responding".

Brian

On Jan 8, 2014, at 9:54 AM, Lans Carstensen <Lans.Carstensen@xxxxxxxxxxxxxx> wrote:

Aside from setting up up/down monitoring to ensure that your
collectors are healthy, submissions are working, and startd nodes
haven't fallen out of a pool - the real monitoring value that's been
added in the last few years is in operational statistics included in
the negotiator and schedd daemon classads.  It's worth your while to
collect and graph some of those stats.  I covered a couple of graphs
we use a couple years ago at HTCondorWeek.

http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/carstensen-dreamworks.pdf

-- Lans Carstensen

On Wed, Jan 8, 2014 at 6:40 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
I suppose my end goal is to easily see when a node has an issue, but you are
right, I do get emails when say sched crashes or something. with out any
extra configuration I can use hobbit to see which hosts are on, and that
will work for my needs.

Thanks,

Cody


On 01/08/2014 08:13 AM, Ben Cotton wrote:

Cody,

When I was at Purdue, I tried monitoring HTCondor servers (i.e. not
execute nodes) with Nagios. I eventually removed the checks because
they didn't add value. The condor_master does a good job of making
sure the daemons are running. I did get alerts for the schedd checks,
but they turned out to be false alarms when the schedd was just too
busy to answer the condor_q from Nagios. (I suppose that's an issue in
itself, but it wasn't what we were checking for).

I guess the point of this story is to ask what exactly you want to
check and why. Knowing that makes it easier to offer guidance.


Thanks,
BC


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685