[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] condor_status taking ages to report

I think I've finally got to the root of this. The condor view server
was rebooted but the condor daemons didn't come up on it. The collector
on the manager was so busy trying to contact the (now defunct) view
server that nothing else got a look in. I'm not sure why it just didn't
give up as the condor stats are hardly mission-critical.

I'm still puzzled as to why the collector is taking up so much memory
( getting on for 500 MB ). I've restarted the daemons, rebooted the
machine but no change. How does this scale with the number of startds
in the pool ? At present we have ~ 100 but this is small compared to
some sites. If we run out of real memory and are into swap presumably it's
going to crawl along.

many thanks,


--On 23 March 2005 15:26 +0000 "Kewley, J (John)" <j.kewley@xxxxxxxx> wrote:


This was the system that Jaime and I were trying to sort out at Condor
Week, without success. Having seen the Condor_ID settings in the central
nodes config file, I suggested changing these to the condor user since
the daemons were not running as the user you would expect. I think the
problems came about as a result of this. There are a LOT of 30s timeouts
at the new machine's end. I haven't checked on the central node's machine
again for what is happening there.



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Matt Hope
Sent: 23 March 2005 15:02
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor_status taking ages to report

On Wed, 23 Mar 2005 10:52:43 +0000, Dr Ian C. Smith <i.c.smith@xxxxxxxxxxxxxxx> wrote: > Hi, > > I've had a Condor pool working fine now for several months > but after making a small change to the condor_config > on the central manager condor_status and condor_q -global > are taking now taking over five minutes to respond (if at all !). > > The manager is running condor 6.6.5 on a Sun-Blade-1000 > with solaris 8. We have around 100 Wintel execute hosts in the pool > The load average is < 0.1 so I don't see this as a problem. > The condor_collector has been taking upto ~ 500 MB of memory > which seems a huge amount and makes me suspect a memory leak. > Any one else seen anything similar ? > > Any help on this would be very much appreciated !

perhaps an indication of what the small change you made was
would be useful...

Note that condor_q -global is a BAD thing to do, especially if your
pool is running slowly since it locks the schedd on your version
slowing down negotiation/job starting/preemptions etc tec.

The collector sounds like it is far too much (hav you tried restarting
it?) you haven't accidentally upped the number of startds running per
machine or added some horrifically large value to all classads have

Condor-users mailing list

_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users