[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Memory leaks in Quill ?



Hi,

I would like to test patched condor_quill, before sending this email.
And it seems that everything is OK (use 30MB) and quill is not yet greedy :)
I create patch from git branch origin/V7_4_4-branch using command:
#git format-patch origin/V7_4_4-branch --stdout >quill_memory_fix.patch

Could anyone from Condor developers check, if this patch if correct, because I
don't know dependencies among patch files and another part of Condor.

I also attach a log file from valgrind, which could possibly help you
understand next explanation of my point of view.

We can see in valgrind.log, that there is 28,677,184 bytes possibly lost. This
could correspond to our problem. Two instances of class ClassAdLogParser use the
same file descriptor(JobQueueDBManager::maintain, Prober::probe). The main
problem. We get descriptor from first instance caLogParser->getFileDescriptor() 
And then set to second instance caLogParser.setFileDescriptor(job_queue_fd).
Implementation of this method call next code: log_fp = fdopen(fd,"r");.
Function fdopen should allocate memory for structure FILE (this depends on libc
implementation). I think that this memory could be free using fclose function,
but we could not call method ClassAdLogParser::closeFile() at the end of
Prober::probe(), because we need work with file in JobQueueDBManager::maintain()
I think that in another version of glibc this should not be so critical like in
our case. We have version GLIBC_2.2.5, you can see it in valgrind stack trace. 

I would be very gratefull for any comments.

Thanks in advance!

Cheers,
Lukas

On Fri, Dec 31, 2010 at 09:56:53AM -0600, Steven Timm wrote:
> I am also seeing all my condor 7.4.3 condor_Quill daemons using between
> 1.5-2 GB of RAM apiece.  I don't recall that this was happening in 
> earlier versions of quill.
>
>
> Steve
>
>
> On Fri, 31 Dec 2010, Marian Zvada wrote:
>
>> Hi,
>>
>> On 27.12.2010 16:12, slebodnik wrote:
>>> This problem is still actual.
>>>
>>> I attach a graph of actual quill memory usage.
>>> First 130 hours, value of variable QUILL_POLLING_PERIOD was "1", and
>>> then we
>>> change value to "10".
>>>
>>> We would like to also know how much memory use quill daemon on yours submit
>>> machine (where schedd daemon is running).
>>
>> I remember there is somewhere presentation from condor-week 20?? which  
>> provides lots of useful info regarding memory footprint vs. quill 
>> usage.
>>
>> Can someone, maybe from condor folks, recall this presentation and 
>> point it to us? Or maybe provide other recommended quill setup we might 
>> use for further tests? Qull logs don't say much using 
>> QUILL_DEBUG=D_FULLDEBUG.
>>
>> Currently, we have done some valgrind and gdb work to find out source 
>> of the trouble. Our investigation shows it's likely typical memory leak 
>> and using simplified code as follows helps:
>>
>> FILE * fp;
>> while (true){
>>   fp = fdopen(file_descriptor, "w") //here is memory leak
>>   //work
>>   wait(QUILL_POOLING_PERIOD)
>> }
>>
>> I've talked to Lukas what he found out also on our development 
>> environment - see attached plot. What is strange, same code on CentOS 
>> takes all the memory and crashes virtual host on our dev env. Whereas, 
>> on debian it takes 300MB of res mem and that's it... Maybe different 
>> glibc on CentOS vs debian, but this is what Lukas can comment more in 
>> details if needed.
>>
>> Well, we have option to compile quill and see what is going on with 
>> memory in our test case or we can live with gdb hook Lukas patched to 
>> the code. However we would be happy if someone from condor experts can 
>> review our report and give a hint in here. Or, should we open 
>> condor-admin ticket track the issue yet?
>>
>> Thanks in advance!
>>
>> Cheers,
>> Marian
>>
>>
>> PS: Happy New Year 2011 to all condor-users and wishing successful new 
>> lines of code into next condor releases ;)
>>
>>
>>
>>> Thanks
>>>
>>> Lukas Slebodnik
>>>
>>> On Tue, 7 Dec 2010 09:41:11 +0100, Vladimir Motoska <motoska@xxxxxxxx>
>>> wrote:
>>>> Hi,
>>>>
>>>> We have some issues with quill daemon on the submitter node. We use
>>>> condor 7.4.4 , x86_64 rhel5 dynamic running on CentOS with
>>>> postgresql 8.1.22-1.el5_5.1.
>>>>
>>>> The problem is that quill accumulates memory. Here is the memory usage
>>>> log
>>>> http://pastie.org/1354799
>>>> The data are divided to 3 columns.
>>>> 1st percentage of memory used by quill
>>>> 2rd usage of memory in KB
>>>> 3th usage of memory in MB
>>>> Each line represents a time stamp one minute.
>>>>
>>>> Here is also our condor_config.
>>>> http://pastie.org/1354821
>>>>
>>>> On all other nodes quill runs just fine. Can anybody give me some hint ?
>>>>
>>>> Thanks
>>>
>>>
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>
> -- 
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D  (630) 840-8525
> timm@xxxxxxxx  http://home.fnal.gov/~timm/
> Fermilab Computing Division, Scientific Computing Facilities,
> Grid Facilities Department, FermiGrid Services Group, Group Leader.
> Lead of FermiCloud project.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>From c34c9589e921f2849cf191cf4e200420b9b02570 Mon Sep 17 00:00:00 2001
From: slebodnik <slebodnik@xxxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 12 Jan 2011 11:12:51 +0100
Subject: [PATCH] Quill memory leak fix

	modified:   src/condor_c++_util/JobLogReader.cpp
	modified:   src/condor_c++_util/prober.cpp
	modified:   src/condor_c++_util/prober.h
	modified:   src/condor_tt/jobqueuedbmanager.cpp
---
 src/condor_c++_util/JobLogReader.cpp |    2 +-
 src/condor_c++_util/prober.cpp       |   11 +++++++++--
 src/condor_c++_util/prober.h         |    2 +-
 src/condor_tt/jobqueuedbmanager.cpp  |    2 +-
 4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/src/condor_c++_util/JobLogReader.cpp b/src/condor_c++_util/JobLogReader.cpp
index cd4baa5..32284c6 100644
--- a/src/condor_c++_util/JobLogReader.cpp
+++ b/src/condor_c++_util/JobLogReader.cpp
@@ -65,7 +65,7 @@ JobLogReader::Poll() {
 		return;
 	}
 
-	probe_st = prober.probe(parser.getLastCALogEntry(),parser.getFileDescriptor());
+	probe_st = prober.probe(parser.getLastCALogEntry(),parser.getFilePointer());
 
 	bool success = true;
 	switch(probe_st) {
diff --git a/src/condor_c++_util/prober.cpp b/src/condor_c++_util/prober.cpp
index 91af2cb..67cb91c 100644
--- a/src/condor_c++_util/prober.cpp
+++ b/src/condor_c++_util/prober.cpp
@@ -126,11 +126,18 @@ Prober::getCurProbedCreationTime() {
 //! probe job_queue.log file
 ProbeResultType
 Prober::probe(ClassAdLogEntry *curCALogEntry,
-			  int job_queue_fd)
+			  FILE* job_queue_fp)
 {
 	FileOpErrCode   st;
 	int op_type;
 	struct stat filestat;
+	int job_queue_fd = fileno(job_queue_fp);
+	//TODO: uncomment and possibly change
+	/* 
+	if (job_queue_fd == -1) {
+		return PROBE_ERROR;
+	}
+	*/
 
 	//TODO: should use condor's StatInfo instead.
 	if (fstat(job_queue_fd, &filestat) == -1)
@@ -146,7 +153,7 @@ Prober::probe(ClassAdLogEntry *curCALogEntry,
 
 	ClassAdLogParser caLogParser;
 	
-	caLogParser.setFileDescriptor(job_queue_fd);
+	caLogParser.setFilePointer(job_queue_fp);
 	caLogParser.setNextOffset(0);
 	st = caLogParser.readLogEntry(op_type);
 	
diff --git a/src/condor_c++_util/prober.h b/src/condor_c++_util/prober.h
index eb9e5fb..3cb3b31 100644
--- a/src/condor_c++_util/prober.h
+++ b/src/condor_c++_util/prober.h
@@ -41,7 +41,7 @@ public:
 
 	//! probe job_queue.log file
 	ProbeResultType probe(ClassAdLogEntry *curCALogEntry,
-						  int job_queue_fd);
+						  FILE* job_queue_fp);
 
 	//! update state information about size of log file last probed etc.
 	//! Call this after successfully responding to probe() result.
diff --git a/src/condor_tt/jobqueuedbmanager.cpp b/src/condor_tt/jobqueuedbmanager.cpp
index 32c6500..dd70fc0 100644
--- a/src/condor_tt/jobqueuedbmanager.cpp
+++ b/src/condor_tt/jobqueuedbmanager.cpp
@@ -343,7 +343,7 @@ JobQueueDBManager::maintain()
 	
 		// polling
 	probe_st = prober->probe(caLogParser->getCurCALogEntry(), 
-							 caLogParser->getFileDescriptor());
+							 caLogParser->getFilePointer());
 
 	// no harm in doing this every time - if someone shutdown and reset
 	// the schedd, we want to be sure to have the latest verison (we could
-- 
1.5.5.6

==11162== Memcheck, a memory error detector
==11162== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==11162== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info
==11162== Command: /opt/condor/current/sbin/condor_quill -f
==11162== Parent PID: 11148
==11162== 
==11162== 
==11162== HEAP SUMMARY:
==11162==     in use at exit: 29,717,572 bytes in 55,122 blocks
==11162==   total heap usage: 31,434,939 allocs, 31,379,817 frees, 14,717,558,886 bytes allocated
==11162== 
==11162== 16 bytes in 1 blocks are possibly lost in loss record 145 of 571
==11162==    at 0x4A062CA: operator new[](unsigned long) (vg_replace_malloc.c:264)
==11162==    by 0x51F429: SimpleList<classy_counted_ptr<CCBListener> >::SimpleList() (simplelist.h:78)
==11162==    by 0x51FF05: CCBListeners::CCBListeners() (ccb_listener.h:76)
==11162==    by 0x5139B2: DaemonCore::reconfig() (daemon_core.cpp:2700)
==11162==    by 0x524D0D: main (daemon_core_main.cpp:2254)
==11162== 
==11162== 
==11162== ---- Print suppression ? --- [Return/N/n/Y/y/C/c] ---- ==11162== 30 bytes in 1 blocks are possibly lost in loss record 285 of 571
==11162==    at 0x4A0666E: operator new(unsigned long) (vg_replace_malloc.c:220)
==11162==    by 0x3ED6C9B800: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x3ED6C9D0B0: std::string::_M_mutate(unsigned long, unsigned long, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x3ED6C9D22B: std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x57C704: Sinful::Sinful(char const*) (condor_sinful.cpp:194)
==11162==    by 0x50FA60: DaemonCore::InfoCommandSinfulStringMyself(bool) (daemon_core.cpp:1064)
==11162==    by 0x511714: DaemonCore::InfoCommandSinfulString(int) (daemon_core.cpp:973)
==11162==    by 0x51540B: DaemonCore::InitDCCommandSocket(int) (daemon_core.cpp:8503)
==11162==    by 0x5246F0: main (daemon_core_main.cpp:2107)
==11162== 
==11162== 37 bytes in 1 blocks are possibly lost in loss record 358 of 571
==11162==    at 0x4A0666E: operator new(unsigned long) (vg_replace_malloc.c:220)
==11162==    by 0x3ED6C9B800: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x3ED6C9D0B0: std::string::_M_mutate(unsigned long, unsigned long, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x3ED6C9D22B: std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x57C6EA: Sinful::Sinful(char const*) (condor_sinful.cpp:191)
==11162==    by 0x50FA60: DaemonCore::InfoCommandSinfulStringMyself(bool) (daemon_core.cpp:1064)
==11162==    by 0x511714: DaemonCore::InfoCommandSinfulString(int) (daemon_core.cpp:973)
==11162==    by 0x51540B: DaemonCore::InitDCCommandSocket(int) (daemon_core.cpp:8503)
==11162==    by 0x5246F0: main (daemon_core_main.cpp:2107)
==11162== 
==11162== 45 bytes in 1 blocks are possibly lost in loss record 389 of 571
==11162==    at 0x4A0666E: operator new(unsigned long) (vg_replace_malloc.c:220)
==11162==    by 0x3ED6C9B800: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x3ED6C9C304: ??? (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x3ED6C9C4B1: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.8)
==11162==    by 0x57C627: Sinful::Sinful(char const*) (condor_sinful.cpp:177)
==11162==    by 0x50FA60: DaemonCore::InfoCommandSinfulStringMyself(bool) (daemon_core.cpp:1064)
==11162==    by 0x511714: DaemonCore::InfoCommandSinfulString(int) (daemon_core.cpp:973)
==11162==    by 0x51540B: DaemonCore::InitDCCommandSocket(int) (daemon_core.cpp:8503)
==11162==    by 0x5246F0: main (daemon_core_main.cpp:2107)
==11162== 
==11162== 1,032 bytes in 1 blocks are possibly lost in loss record 544 of 571
==11162==    at 0x4A062CA: operator new[](unsigned long) (vg_replace_malloc.c:264)
==11162==    by 0x549253: ExtArray<RuntimeConfigItem>::ExtArray(int) (extArray.h:74)
==11162==    by 0x542F07: __static_initialization_and_destruction_0(int, int) (condor_config.cpp:2115)
==11162==    by 0x542F4E: global constructors keyed to ConfigTab (condor_config.cpp:2578)
==11162==    by 0x7E5975: ??? (in /opt/condor/condor-7.4.4/sbin/condor_quill)
==11162==    by 0x4DD4CA: ??? (in /opt/condor/condor-7.4.4/sbin/condor_quill)
==11162==    by 0x302F: ???
==11162==    by 0x7E58D6: __libc_csu_init (in /opt/condor/condor-7.4.4/sbin/condor_quill)
==11162==    by 0x3ED581D92D: (below main) (in /lib64/libc-2.5.so)
==11162== 
==11162== 28,677,184 bytes in 50,488 blocks are possibly lost in loss record 571 of 571
==11162==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==11162==    by 0x3ED5860F78: fdopen@@GLIBC_2.2.5 (in /lib64/libc-2.5.so)
==11162==    by 0x56A313: ClassAdLogParser::setFileDescriptor(int) (classadlogparser.cpp:69)
==11162==    by 0x56A79C: Prober::probe(ClassAdLogEntry*, int) (prober.cpp:149)
==11162==    by 0x4F513C: JobQueueDBManager::maintain() (jobqueuedbmanager.cpp:346)
==11162==    by 0x4EC99F: TTManager::maintain() (ttmanager.cpp:235)
==11162==    by 0x4ECF70: TTManager::pollingTime() (ttmanager.cpp:197)
==11162==    by 0x52B0B2: TimerManager::Timeout() (timer_manager.cpp:394)
==11162==    by 0x50C0D3: DaemonCore::Driver() (daemon_core.cpp:2912)
==11162==    by 0x524DDA: main (daemon_core_main.cpp:2287)
==11162== 
==11162== LEAK SUMMARY:
==11162==    definitely lost: 0 bytes in 0 blocks
==11162==    indirectly lost: 0 bytes in 0 blocks
==11162==      possibly lost: 28,678,344 bytes in 50,493 blocks
==11162==    still reachable: 1,039,228 bytes in 4,629 blocks
==11162==         suppressed: 0 bytes in 0 blocks
==11162== Reachable blocks (those to which a pointer was found) are not shown.
==11162== To see them, rerun with: --leak-check=full --show-reachable=yes
==11162== 
==11162== For counts of detected and suppressed errors, rerun with: -v
==11162== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 4 from 4)