[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Collector using a lot of CPU



I am running 11 dedicated worker nodes (dual CPU, Scientific Linux 3.0.3,
Condor 6.7.3) with 4 VMs, two separate schedulers and another scheduler on
the CM node (dual 2.4 GHz Xeon, 2 GB RAM, 1GbE interface).  The
condor_collector process always seems to be at about 77% CPU.

  PID   PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
10762    25   0  3832 3832  2120 R    76.3  0.1  63:58   2 condor_collecto

Is this normal?  I believe the CM may be dropping UDP packets and hence is 
removing VMs from the system.  I upgraded to Condor 6.7.3 which is 
supposed to help with this issue but I still see VMs being dropped.  
I also doubled the ClassAD lifetime, and timeouts in the collector and 
negotiator:
 CLASSAD_LIFETIME       = 1800
 CLIENT_TIMEOUT         = 60
 NEGOTIATOR_TIMEOUT     = 60
but I still see stale Ads being removed.

My concern is I am running only 5% of our worker nodes in the system so 
far.  What happens when I scale up to 220 worker nodes?  Next step is to 
go to TCP, but was wondering if there is some misconfiguration causing the 
collector to be too busy.  Relevant debugging and other parameters are set 
at:
  ALL_DEBUG               = D_PROTOCOL D_MATCH 
  COLLECTOR_CLASS_HISTORY_SIZE = 1024
  COLLECTOR_DAEMON_HISTORY_SIZE = 128
  COLLECTOR_DAEMON_STATS = True
  COLLECTOR_DEBUG		= 
  MAX_COLLECTOR_LOG	= 640000000

Thanks
Leslie Groer

-- 
   ,-~~-.___.       ________________________________________________
  / |  '     \      groer@xxxxxxxxxxxxxxxxxxx  Department of Physics
 (  )        0           Tel: +1-416-978-2959  University of Toronto
  \_/-, ,----'           Fax: +1-416-978-8221  60 St. George Street
     ====           //                         Toronto, ON M5S 1A7
    /  \-'~;    /~~~(O)                        Canada
   /  __/~|   /       |  Office: McLennan Physics Lab Room 911
 =(  _____| (_________|  http://home.fnal.gov/~groer
     Leslie S. Groer