[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] memory leak in condor_q?



Thank you Todd! The reason why we need to measure condor_q -global
performance is because our users use it extensively nowadays (more
precisely the counterpart of it in the current system). They use it to
poll their jobs, grepping around, etc. Even worse, we can't/don't want
to tie a specific user to a specific submission node (achieving more
efficient load balancing between schedds), which means if a user wants
to find its job, s/he has to query with -global... Also, dedicated
schedd nodes won't be reachable by our users through ssh or so, every
submission will be a -remote (or -name) submission.
In case of Condor, I'm aware there are much more efficient ways to do
this polling (like checking the joblog), which won't affect the
service that much, but we have to keep in mind: bad habits die hard,
so in the "transitional" phase sure there will be some users who will
"stick to" the global query way to poll their jobs, so we have to be
prepared.
I had a look at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ReleaseHistory, but
is there any rough estimation about the time of the next dev release
(8.1.4)?

Thanks,
Daniel

2014-02-06 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
>
> Indeed, as Brian says, the patches going into the code deal with the below
> condor_q -stream issue.  Note that an often over-looked downside of
> "condor_q -stream" is that the results will not be sorted in any manner, as
> normally the schedd sends the job ads out in hash table order and condor_q
> does the sorting.
>
> Another often over-looked item is using condor_status with either "-schedd"
> or "-submitter" instead of polling condor_q in order to get "big picture"
> aggregate statistics like total jobs running/idle per schedd or user,
> respectively.  No need to use condor_q and slowly get a dump of all 500,000
> jobs and then count to get this sort of aggregate info, condor_status is
> much faster and less resource intensive.  Yes, condor_status is giving you
> "cached" information that may only be updated once or twice a minute, but in
> many situations (like portals that want to update a web page or what have
> you) that is just fine...
>
> regards,
> Todd
>
>
> On 2/6/2014 10:11 AM, Pek Daniel wrote:
>>
>> Yaaay, perfect! :)
>>
>> 2014-02-06 Brian Bockelman <bbockelm@xxxxxxxxxxx>:
>>>
>>> Hi Daniel,
>>>
>>> This is expected.  Stream does not stream (well, until my patches land).
>>> It still buffers the entire response in memory before parsing it.
>>>
>>> Stream prevents HTCondor from sorting the results in memory.
>>>
>>> Non-blocking condor_q patch set will take care of this and turn it into a
>>> real stream.
>>>
>>> Sent from my iPhone
>>>
>>>> On Feb 6, 2014, at 9:46 AM, Pek Daniel <pekdaniel@xxxxxxxxx> wrote:
>>>>
>>>> I've noticed there's not so much difference between the memory
>>>> consumption of condor_q -stream and condor_q:
>>>>
>>>> [root@XXX thrash]# /usr/bin/time -v condor_q -stream >/dev/null
>>>> ...
>>>> Maximum resident set size (kbytes): 186592
>>>> ...
>>>>
>>>> [root@XXX thrash]# /usr/bin/time -v condor_q >/dev/null
>>>> ...
>>>> Maximum resident set size (kbytes): 306352
>>>> ...
>>>>
>>>> There was 100k jobs in the queue. I've dug into the source a bit, and
>>>> I suspect some leak somewhere here:
>>>>
>>>> https://github.com/htcondor/htcondor/blob/b151357dcd13efe2703a2386e1d89bbacac79cd6/src/condor_schedd.V6/qmgmt_send_stubs.cpp#L862-L882
>>>>
>>>> or here:
>>>>
>>>> https://github.com/htcondor/htcondor/blob/0222c71b4a7cf5946ab9d5caf5ecca0ca8c75539/src/condor_utils/classad_oldnew.cpp#L57-L130
>>>>
>>>> Maybe the ReliSock, or the ClassAd...
>>>>
>>>> Cheers,
>>>> daniel
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>> a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/