[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



On 06/03/2010 05:34 PM, Ian Chesal wrote:
> On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee <matt@xxxxxxxxxx
> <mailto:matt@xxxxxxxxxx>> wrote:
> 
>     There have been discussions about putting some statistical monitors
>     into different daemons, including the Schedd, which would be direct
>     indicators of backup etc. However, that's just talk at the moment.
> 
> 
> That would be useful to have so long as the information can be gotten to
> when things are falling apart. :D
>  
> 
>     >From a practical point of view, a few options are:
>      0) You can use the knowledge that the Negotiator puts a Slot into
>     the Matched state, then Schedd then puts it into the Claimed/Idle
>     state and then into Claimed/Busy, and observe that a large number of
>     Matched slots can indicate backup. Claimed/Idle can also be an
>     indicator.
>      1) You can look at the exit status of shadows in the Schedd's log
>     and why the exits happened in the ShadowLog. Seeing exit code 4 or
>     "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT" is an indication
>     of a backed up Schedd.
> 
>  
> 
>      2) You can look at response time to a condor_q query that should
>     return no result and wouldn't initiate a O(n) scan of the queue.
> 
> 
> I generally advise against running condor_q if you think your scheduler
> is overloaded. It really only makes things worse. But what kind of
> condor_q query returns nothing and doesn't initiate a queue scan?
> Something like:
> 
> condor_q -better-analyze <job>
> 
> ?
> 
> I can't think of any others that are safe in this case and even the
> above is a call that's not going to help the situation.
>  
> 
>      3) Looking at the history file (with condor_history), you can
>     compare the CompletionDate with the EnteredCurrentStatus and
>     JobFinishedHookDone attributes. Any drift can indicate a backup in
>     the Schedd.
> 
> 
> This is assuming you have jobs completing of course. And history turned on.

History is on by default


>      4) Compare the reported count of running jobs in the Schedd
>     (condor_q | tail -n1) to the condor_status -schedd. Some small drift
>     is acceptable.
> 
> 
> If you're trying to avoid using condor_q you can look at the
> job_queue.log file instead. If you grep on "JobStatus 2" you'll get a
> list of running jobs to compare to the condor_status view of the world.

job_queue.log is a transaction log and generally you can't just grep it. You'd have to replay it.

Best,


matt


> Hope that helps!
> 
> - Ian
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/