[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor and monitoring performance




I was curious if anyone has suggestions on how to monitor the health of a Condor pool? I am trying to track down an error (Q3) and was also trying to develop a set of commands for monitoring Condor.

1. I found this URL, which is helpful but there appear to be some issues for windows.
URL: https://nmi.cs.wisc.edu/node/1481
I noticed that the binary for condor_updates_stats does not exist with window installations of Condor. Is this a mistake or is it not available with windows?

2. Does any one have suggestions for querying Condor to help detect potential issues with performance?

3. I am getting the following error and I am not sure how to determine if I need to modify my configuration or whether there is something else wrong.
SchedLog excerpt:
06/30 18:53:04 (pid:1732) Received UDP command 60011 (DC_NOP) from  <xxx.xxx.xxx.xx:9608>, access level READ
06/30 18:53:04 (pid:1732) Calling HandleReq <handle_nop()> (0)
06/30 18:53:04 (pid:1732) Return from HandleReq <handle_nop()> (handler: 0.000s, sec: 0.371s)
06/30 18:53:04 (pid:1732) Calling Handler <SecManStartCommand::WaitForSocketCallback DC_INVALIDATE_KEY> (6)
06/30 18:53:04 (pid:1732) SECMAN: resuming command 60014 DC_INVALIDATE_KEY to daemon at <xxx.xxx.xxx.xx:4278> from TCP port 4371 (non-blocking, raw).
06/30 18:53:04 (pid:1732) SECMAN: TCP connection to daemon at <xxx.xxx.xxx.xx:4278> failed.
06/30 18:53:04 (pid:1732) Failed to send DC_INVALIDATE_KEY to daemon at <xxx.xxx.xxx.xx:4278>: SECMAN:2003:TCP connection to daemon at <xxx.xxx.xxx.xx:4278> failed.

4. Occasionally I get an error when I use condor_status or condor_q, which I believe is related to the errors in Q3. Failed to fetch ads from ...
SECMAN: 2007: Failed to end classad schedlog


Thank you for your suggestions,
Mike