[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status taking ages to report




The manager is running condor 6.6.5 on a Sun-Blade-1000
with solaris 8. We have around 100 Wintel execute hosts in the pool
The load average is < 0.1 so I don't see this as a problem.
The condor_collector has been taking upto ~ 500 MB of memory
which seems a huge amount and makes me suspect a memory leak.

Your collector should not be using 500MB of memory for 100 execution hosts.

It would be useful if _the_people_who_wrote_this_stuff_ could tell me
how the dynamic memory allocation for the collector scales with no of
startds, schedds etc etc.  At least that way we'd have a handle on the
requirements for the central master.

I don't have an exact formula for you. I suspect we could come up with one, but let me give you a basic heuristic: the condor_collector run by the Condor group manages about 800 computers. Some have multiple CPUs, so the total number of startds is a bit greater than that. The collector has lots of ClassAds in it (startd, schedd, submitter, master...) Our collector is taking about 50M. We have roughly 10 times as many computers and it's taking roughly 1/10th the space. Clearly you have a problem.


Fortunately, the problem should be easy to solve. Condor 6.6.6 fixed a memory leak in the collector:

http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html#SECTION00924000000000000000
* Fixed a memory leak in the condor_collector.

My recommendation is to update Condor to a newer version. If you can't update your whole pool, it is safe to upgrade just the collector. It would be better to use Condor 6.6.9 rather than just 6.6.6: we've made a number of bug fixes.

I suspect that this will fix the problem for you. If it doesn't, let us know and we can look more deeply into the problem.

-alain