[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Condor sleeping



On Wed June 16 2004 5:16 am, Mark Silberstein wrote:
> It never happened for me with 6.4.7. It started with 6.6 series, and is
> not only annoying, but makes my users feel that the system is
> unreliable, which unfortunately is true in these circumstances. I wish I
> had more time to debug it, but maybe if someone has at least some time
> to try moving collector and negotiator to another machine (with another
> IP/Name - maybe some problem with DNS resolution), and more likely -
> Linux or at least not Windows. From all mails on the list it feels like
> Windows causes some problems here. By the way, I don't experience any
> problems with the pools working with Linux-based matchmaker.

It'd be useful to know if these problems are being caused by lost updates, or 
by some other problem.  Fortunately, we have some new sources of data....

Have you tried looking at the new "Collector Updates Stats" fields?  They can 
be used to help quantitate lost updates.  As of Condor 6.6.2, 
"condor_updates_stats" is shipped with Condor; it's a perl script which can 
be used to parse this output into a more meaningful text:

nleroy@chopin% condor_status -l c2-001 | grep Updates
UpdatesTotal = 12785
UpdatesSequenced = 12772
UpdatesLost = 33
UpdatesHistory = "0x00000000000000000000000000000000"
UpdatesTotal = 12678
UpdatesSequenced = 12666
UpdatesLost = 31
UpdatesHistory = "0x00100000000000000000000000000000"
nleroy@chopin% condor_status -l c2-001 | condor_updates_stats
(Reading from stdin)
*** Name/Machine = 'vm1@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
 Type: Main
   Stats: Total=12785, Seq=12772, Lost=33 (0.26%)
     0: Ok
  ...
   127: Ok

*** Name/Machine = 'vm2@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
 Type: Main
   Stats: Total=12678, Seq=12666, Lost=31 (0.24%)
     0: Ok
  ...
    11: Missed
    12: Ok
  ...
   127: Ok

If you know your update interval (default = 5 minutes), you can give it that 
information, and it can guess at the time of the missing updates:

nleroy@chopin% condor_status -l c2-001 | condor_updates_stats --interval=300
(Reading from stdin)
*** Name/Machine = 'vm1@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
 Type: Main
   Stats: Total=12786, Seq=12773, Lost=33 (0.26%)
     0 @ Wed Jun 16 08:31:30 2004: Ok
  ...
   127 @ Tue Jun 15 21:56:30 2004: Ok

*** Name/Machine = 'vm2@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
 Type: Main
   Stats: Total=12679, Seq=12667, Lost=31 (0.24%)
     0 @ Wed Jun 16 08:31:31 2004: Ok
  ...
    12 @ Wed Jun 16 07:31:31 2004: Missed
    13 @ Wed Jun 16 07:26:31 2004: Ok
  ...
   127 @ Tue Jun 15 21:56:31 2004: Ok


-Nick

> On Wed, 2004-06-16 at 12:53, Ron Viloria wrote:
> > Ive always seen it happen, as early as 6.4.7, again its more of an
> > annoyance. Ive always assumed its because of the CPU being busy doing
> > non-condor stuff in the background or something like antivirus or
> > backups.

-- 
           <<< The matrix has you. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences