[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Could not get startd's private ad



On Saturday 09 October 2004 06:41, Alain EMPAIN wrote:

> Constantinos Evangelinos wrote:
> >We have a Condor setup that's evolved over time from 6.4.x via 6.5.x to
> > 6.6.x (6.6.3 currently). Ever since the change to 6.6 happened
> > condor_collector on the master node (which is also a view host) has
> > become a total resource hog, grabbing the CPU 100% of the time.
> >
> >Checking the collector logs with the debugging info, I see a constant
> > stream of messages of this type:
> >
> >10/7 11:16:42 Got INVALIDATE_STARTD_ADS
> >10/7 11:16:42           **** Removing stale ad: "< x.x.x.edu ,
> > 192.168.0.22 >" 10/7 11:16:42 (Invalidated 1 ads)
> >10/7 11:16:42 (Invalidated 0 ads)
> >10/7 11:16:42 StartdAd     : Updating ... "< x.x.x.edu , 192.168.0.22>"
> >10/7 11:16:42   (Could not get startd's private ad)
> >
> >About 330 of these actions are recorded every second. This was not an
> > issue under earlier Condor installations. This particular one has not
> > been used for some time now due to this problem but needs to become
> > usable again. I would appreciate any help/ideas.
> > load average: 1.22, 1.06, 0.75
>
> The command 'uptime' for ex. gives you three load averages (depending on
> the time span to compute the mean)
>
> ->  load average: 1.22, 1.06, 0.75
>
>  From 'man uptime' :
>        ... average number of jobs in the run queue over the last 1, 5 and
>        15 minutes
>
> For well equilibrated (working at full capacity for a long time)
> hyperthreading  dual-Xeon, it gives roughly 4,4,4 (just one job in queue
> for each virtual CPU)
>
> Now I do not know if the Condor view of load average is standardized
> between 0-1, to cope with the differences between what is a 'normal
> healthy state' for a moni-, bi-, quad-machine. Just curious ;-)

I'm not sure how this "load average" last line made it into my e-mail (it's 
not in my saved copy). Anyway, thanks for trying to answer Alain but 
unfortunately the uptime issue is irrelevant - the master node is not an 
execute node and therefore should not have any jobs running on it. What is 
grabbing 100% of the CPU is condor_collector and it obviously is busy 
invalidating and then attempting to update startd ads. Any ideas anyone - am 
I the only person in the world with this problem? Googling for the 
condor_collector error string has turned up nothing. :-(

Constantinos
-- 
Dr. Constantinos Evangelinos                    Room 54-1518, EAPS/MIT
Earth, Atmospheric and Planetary Sciences       77 Massachusetts Avenue
Massachusetts Institute of Technology           Cambridge, MA 02139
+1-617-253-5259/+1-617-253-4464 (fax)           USA