[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] host failure detection



On 8/2/06, Michael Thomas <thomas@xxxxxxxxxxxxxxx> wrote:
While disk failures may be the biggest black hole cause, I'm also
interested in looking at a general solution.

> On the Dell machines this event mechanism is very fast (seconds)
> whereas on the HP's it can be as much as 5 mins.

Even a 5 minute delay would be preferable to the situation I have now.
But I see your point in using a system-level error checking script which
can automatically update the condor classad for that machine.  I got the
same suggestion on another list.

Additionally, one thing I would really like to see is a way to get these
per-host statistics into a higher level monitoring infrastructure like
MonALISA.  I already monitor the cluster load, network IO, and per-VO
jobs in MonALISA.  If condor provided a way to obtain # jobs completed
per node, and average time to completion per node, it would help me to
detect both a black hole and underperforming nodes at the same time.

Have you looked at Hawkeye?

http://www.cs.wisc.edu/condor/hawkeye/

I don't use it myself but a lot of others on this list do...

Matt