[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] host failure detection



Matt Hope wrote:
> On 8/1/06, Michael Thomas <thomas@xxxxxxxxxxxxxxx> wrote:
> 
>>We recently had a disk problem with one of the 60 machines in our condor
>>pool that caused jobs to fail quickly.  As a result, most jobs ended up
>>landing on this node, which generated a large number of failed jobs out
>>of the total job submissions.  Unfortuantely, we were not aware of this
>>failing node until we took a long look at the job output logs.
>>
>>What kind of tools does condor provide for monitoring things like:
>>* Average job time to completion per node
>>* Number of jobs processed per node
>>
>>Any sort of host-level monitoring information that we can get from
>>condor would be useful to plug into a higher-level monitoring system
>>like MonALISA, and allow us to detect such problems as they occur and
>>not days after the fact.
> 
> 
> We had a similar problem a while back.
> 
> Whilst general solutions are all nice disk failure will almost
> certainly be the biggest cause of 'black holes' on your pool
> Black holes are machines which accept a job but always fail to run it
> properly - often very fast thus sending loads of you patiently queued
> jobs into a black hole.

While disk failures may be the biggest black hole cause, I'm also
interested in looking at a general solution.

> On the Dell machines this event mechanism is very fast (seconds)
> whereas on the HP's it can be as much as 5 mins.

Even a 5 minute delay would be preferable to the situation I have now.
But I see your point in using a system-level error checking script which
can automatically update the condor classad for that machine.  I got the
same suggestion on another list.

Additionally, one thing I would really like to see is a way to get these
per-host statistics into a higher level monitoring infrastructure like
MonALISA.  I already monitor the cluster load, network IO, and per-VO
jobs in MonALISA.  If condor provided a way to obtain # jobs completed
per node, and average time to completion per node, it would help me to
detect both a black hole and underperforming nodes at the same time.

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature