[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] host failure detection

On 8/1/06, Michael Thomas <thomas@xxxxxxxxxxxxxxx> wrote:
We recently had a disk problem with one of the 60 machines in our condor
pool that caused jobs to fail quickly.  As a result, most jobs ended up
landing on this node, which generated a large number of failed jobs out
of the total job submissions.  Unfortuantely, we were not aware of this
failing node until we took a long look at the job output logs.

What kind of tools does condor provide for monitoring things like:
* Average job time to completion per node
* Number of jobs processed per node

Any sort of host-level monitoring information that we can get from
condor would be useful to plug into a higher-level monitoring system
like MonALISA, and allow us to detect such problems as they occur and
not days after the fact.

We had a similar problem a while back.

Whilst general solutions are all nice disk failure will almost
certainly be the biggest cause of 'black holes' on your pool
Black holes are machines which accept a job but always fail to run it
properly - often very fast thus sending loads of you patiently queued
jobs into a black hole.

If you control the machines in your pool directly and they are from
similar then it is likely they will have some form of management
software for their hardware
(e.g. we have a bunch of Dells and a bunch of HP's).
This normally provides some means of triggering events on disk failure.
I got systems to setup a script which executes on a machine with admin
privileges to the farm which gracefully shuts down the offending
machines condor system (so if one half of a raid disk fails then the
currently happy jobs gets to complete nicely).
It also alters the machines config to make it 'look' different
(essentially changes an internally defined machine attribute).

On the Dell machines this event mechanism is very fast (seconds)
whereas on the HP's it can be as much as 5 mins.

For the HP machines this is a little too slow so One of the systems
guys suggested moving the condor binaries onto the same disk as the
jobs execute from. Thus if the disk dies there is a high probability
that the condor system will itself die (at the least be very unlikely
to be able to fire up a starter, thus unable to become a black hole)

This still leave the machine to be tidied up but significantly reduces
the frustration.
It would be nice if condor could work out that the disk it is trying
to transfer files to is not working, though  I accept this is not so
trivial as it sounds.