[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status taking ages to report



The COLLECTOR and NEGOTIATOR were missing from the DAEMON_LIST. When
I added them it all came up fine. The collector is now taking < 10 MB
which seems much more reasonable.

many thanks,


-ian.


--On 29 March 2005 11:29 +0100 "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx> wrote:




--On 24 March 2005 15:27 +0000 "Dr Ian C. Smith"
<i.c.smith@xxxxxxxxxxxxxxx> wrote:
--On 24 March 2005 09:10 -0600 Alain Roy <roy@xxxxxxxxxxx> wrote:


The manager is running condor 6.6.5 on a Sun-Blade-1000
with solaris 8. We have around 100 Wintel execute hosts in the pool
The load average is < 0.1 so I don't see this as a problem.
The condor_collector has been taking upto ~ 500 MB of memory
which seems a huge amount and makes me suspect a memory leak.

Your collector should not be using 500MB of memory for 100 execution hosts.

It would be useful if _the_people_who_wrote_this_stuff_ could tell me
how the dynamic memory allocation for the collector scales with no of
startds, schedds etc etc.  At least that way we'd have a handle on the
requirements for the central master.

I don't have an exact formula for you. I suspect we could come up with one, but let me give you a basic heuristic: the condor_collector run by the Condor group manages about 800 computers. Some have multiple CPUs, so the total number of startds is a bit greater than that. The collector has lots of ClassAds in it (startd, schedd, submitter, master...) Our collector is taking about 50M. We have roughly 10 times as many computers and it's taking roughly 1/10th the space. Clearly you have a problem.

Fortunately, the problem should be easy to solve. Condor 6.6.6 fixed a
memory leak in the collector:

http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html#SECTIO
N0 0924000000000000000
* Fixed a memory leak in the condor_collector.

My recommendation is to update Condor to a newer version. If you can't
update your whole pool, it is safe to upgrade just the collector. It
would be better to use Condor 6.6.9 rather than just 6.6.6: we've made a
number of bug fixes.

I suspect that this will fix the problem for you. If it doesn't, let us
know and we can look more deeply into the problem.

-alain

I've just tried running condor 6.6.9 but the collector and other daemons don't seem to want to start. In the master log I get:

3/29 11:00:03 Using config file: /etc/condor/condor_config
3/29 11:00:03 Using local config files:
/opt1/condor-6.6.9/home/condor_config.local
3/29 11:00:03 DaemonCore: Command Socket at <138.253.100.178:33675>
3/29 11:00:03 Started DaemonCore process
"/opt1/condor-6.6.9/sbin/condor_schedd", pid and pgroup =
28220
3/29 11:00:08 Can't connect to <138.253.100.178:9618>:0, errno = 146
3/29 11:00:08 Will keep trying for 10 seconds...
3/29 11:00:18 Connect failed for 10 seconds; returning FALSE
3/29 11:00:18 ERROR:
SECMAN:2003:TCP connection to <138.253.100.178:9618> failed

and the schedd log:

3/29 11:00:03 ******************************************************
3/29 11:00:03 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
3/29 11:00:03 ** /opt1/condor-6.6.9/sbin/condor_schedd
3/29 11:00:03 ** $CondorVersion: 6.6.9 Mar 10 2005 $
3/29 11:00:03 ** $CondorPlatform: SUN4X-SOLARIS28 $
3/29 11:00:03 ** PID = 28220
3/29 11:00:03 ******************************************************
3/29 11:00:03 Using config file: /etc/condor/condor_config
3/29 11:00:03 Using local config files:
/opt1/condor-6.6.9/home/condor_config.local
3/29 11:00:03 DaemonCore: Command Socket at <138.253.100.178:33676>
3/29 11:00:04 Can't connect to <138.253.100.178:9618>:0, errno = 146
3/29 11:00:04 Will keep trying for 10 seconds...
3/29 11:00:14 Connect failed for 10 seconds; returning FALSE
3/29 11:00:14 ERROR:
SECMAN:2003:TCP connection to <138.253.100.178:9618> failed

I've swapped back to the original 6.6.5 version and now it's broken as
well
so instead of condor+memory leak I now have no condor service at all.
Brilliant.

-ian,




_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users