[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] host failure detection
- Date: Wed, 2 Aug 2006 17:17:11 +0100
- From: "Matt Hope" <matthew.hope@xxxxxxxxx>
- Subject: Re: [Condor-users] host failure detection
On 8/2/06, Michael Thomas <thomas@xxxxxxxxxxxxxxx> wrote:
While disk failures may be the biggest black hole cause, I'm also
interested in looking at a general solution.
> On the Dell machines this event mechanism is very fast (seconds)
> whereas on the HP's it can be as much as 5 mins.
Even a 5 minute delay would be preferable to the situation I have now.
But I see your point in using a system-level error checking script which
can automatically update the condor classad for that machine. I got the
same suggestion on another list.
Additionally, one thing I would really like to see is a way to get these
per-host statistics into a higher level monitoring infrastructure like
MonALISA. I already monitor the cluster load, network IO, and per-VO
jobs in MonALISA. If condor provided a way to obtain # jobs completed
per node, and average time to completion per node, it would help me to
detect both a black hole and underperforming nodes at the same time.
Have you looked at Hawkeye?
I don't use it myself but a lot of others on this list do...