[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] host failure detection



We recently had a disk problem with one of the 60 machines in our condor
pool that caused jobs to fail quickly.  As a result, most jobs ended up
landing on this node, which generated a large number of failed jobs out
of the total job submissions.  Unfortuantely, we were not aware of this
failing node until we took a long look at the job output logs.

What kind of tools does condor provide for monitoring things like:
* Average job time to completion per node
* Number of jobs processed per node

Any sort of host-level monitoring information that we can get from
condor would be useful to plug into a higher-level monitoring system
like MonALISA, and allow us to detect such problems as they occur and
not days after the fact.

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature