[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor monitoring alternatives



Here at RIT, we've been working on building up a respectable Condor pool with some success. We're now running into the issue of monitoring our clients. We have enough client machines that it is now impossible to visually parse the condor_status output to find "stragglers", so I'm looking for an automated solution.

I'm specifically looking for a lightweight alternative to hawkeye, possibly something we could integrate into or have as an addition to our quick stats look ( http://stats.rc.rit.edu/condor/ ).

Has anyone written a simple script or similar that contains a master list of machines that should be up and compares the output of condor_status to it? It seems to be something that would be very useful and I'm hoping I can reuse someone else's code.

Ideally, we want a lightweight webpage that shows a list of machines (by hostname, IP, whatever) that Condor is installed on and their corresponding status (up and running condor, up but not running/ responding to condor, down). Combining the output of a ping test, condor_status and a master list of machines, these states should be easily determined. My question is: has anyone done this?

Thanks,

Brent Strong
Research Computing at RIT