[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Monitoring condor nodes with hobbit

Aside from setting up up/down monitoring to ensure that your
collectors are healthy, submissions are working, and startd nodes
haven't fallen out of a pool - the real monitoring value that's been
added in the last few years is in operational statistics included in
the negotiator and schedd daemon classads.  It's worth your while to
collect and graph some of those stats.  I covered a couple of graphs
we use a couple years ago at HTCondorWeek.


-- Lans Carstensen

On Wed, Jan 8, 2014 at 6:40 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
> I suppose my end goal is to easily see when a node has an issue, but you are
> right, I do get emails when say sched crashes or something. with out any
> extra configuration I can use hobbit to see which hosts are on, and that
> will work for my needs.
> Thanks,
> Cody
> On 01/08/2014 08:13 AM, Ben Cotton wrote:
>> Cody,
>> When I was at Purdue, I tried monitoring HTCondor servers (i.e. not
>> execute nodes) with Nagios. I eventually removed the checks because
>> they didn't add value. The condor_master does a good job of making
>> sure the daemons are running. I did get alerts for the schedd checks,
>> but they turned out to be false alarms when the schedd was just too
>> busy to answer the condor_q from Nagios. (I suppose that's an issue in
>> itself, but it wasn't what we were checking for).
>> I guess the point of this story is to ask what exactly you want to
>> check and why. Knowing that makes it easier to offer guidance.
>> Thanks,
>> BC
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/