[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
There have been discussions about putting some statistical monitors into different daemons, including the Schedd, which would be direct indicators of backup etc. However, that's just talk at the moment.

That would be useful to have so long as the information can be gotten to when things are falling apart. :D
 
>From a practical point of view, a few options are:
 0) You can use the knowledge that the Negotiator puts a Slot into the Matched state, then Schedd then puts it into the Claimed/Idle state and then into Claimed/Busy, and observe that a large number of Matched slots can indicate backup. Claimed/Idle can also be an indicator.
 1) You can look at the exit status of shadows in the Schedd's log and why the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd.
 
 2) You can look at response time to a condor_q query that should return no result and wouldn't initiate a O(n) scan of the queue.

I generally advise against running condor_q if you think your scheduler is overloaded. It really only makes things worse. But what kind of condor_q query returns nothing and doesn't initiate a queue scan? Something like:

condor_q -better-analyze <job>

?

I can't think of any others that are safe in this case and even the above is a call that's not going to help the situation.
 
 3) Looking at the history file (with condor_history), you can compare the CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone attributes. Any drift can indicate a backup in the Schedd.

This is assuming you have jobs completing of course. And history turned on.
 
 4) Compare the reported count of running jobs in the Schedd (condor_q | tail -n1) to the condor_status -schedd. Some small drift is acceptable.

If you're trying to avoid using condor_q you can look at the job_queue.log file instead. If you grep on "JobStatus 2" you'll get a list of running jobs to compare to the condor_status view of the world.

Hope that helps!

- Ian