Re: [Condor-users] Could not get startd's private ad

Constantinos Evangelinos wrote:

We have a Condor setup that's evolved over time from 6.4.x via 6.5.x to 6.6.x (6.6.3 currently). Ever since the change to 6.6 happened condor_collector on the master node (which is also a view host) has become a total resource hog, grabbing the CPU 100% of the time.

Checking the collector logs with the debugging info, I see a constant stream of messages of this type:

10/7 11:16:42           **** Removing stale ad: "< x.x.x.edu , >"
10/7 11:16:42 (Invalidated 1 ads)
10/7 11:16:42 (Invalidated 0 ads)
10/7 11:16:42 StartdAd     : Updating ... "< x.x.x.edu ,>"
10/7 11:16:42   (Could not get startd's private ad)

About 330 of these actions are recorded every second. This was not an issue under earlier Condor installations. This particular one has not been used for some time now due to this problem but needs to become usable again. I would appreciate any help/ideas.
load average: 1.22, 1.06, 0.75

The command 'uptime' for ex. gives you three load averages (depending on the time span to compute the mean)

-> load average: 1.22, 1.06, 0.75

From 'man uptime' :
      ... average number of jobs in the run queue over the last 1, 5 and
      15 minutes

For well equilibrated (working at full capacity for a long time) hyperthreading dual-Xeon, it gives roughly 4,4,4 (just one job in queue for each virtual CPU)

Now I do not know if the Condor view of load average is standardized between 0-1, to cope with the differences between what is a 'normal healthy state' for a moni-, bi-, quad-machine. Just curious ;-)



