[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]



Hi Brian,

many thanks for the explanation!

The issue should be fixed for good and we do not disesteem hold jobs as
before now ;)

A short question: to limit the total number of jobs per scheduler
MAX_JOBS_SUBMITTED should be the right ad, i.e., idle+running+hold, or?

So far I noticed only MaxJobsRunning/MAX_JOBS_RUNNING @ 10000 [1] and
would like to go now also for an upper limit of all jobs.

Cheers and many thanks,
  Thomas


[1]
>>> thisSchedd['MaxJobsRunning']
10000L
http://research.cs.wisc.edu/htcondor/manual/v8.6/3_5Configuration_Macros.html#param:MaxJobsRunning


On 2017-02-15 03:59, Brian Bockelman wrote:
> Hi Thomas,
> 
> That makes sense - when the schedd forked a child condor_schedd to answer a query, the child grew rapidly in terms of RAM usage.  By default, the parent and child share the same pages in RAM - meaning the cost of the fork is relatively low.  *However*, if the parent condor_schedd's queue is rapidly changing, Linux will allocate a lot of new pages.
> 
> A large, changing queue causes trouble in two ways:
> - Breaks the sharing between parent and child, AND
> - Causes child to live longer as response to query takes long.
> 
> Note that you can reduce the number of helps down drastically and force the schedd to respond to queries in-process.  It will make the parent process work harder - but  consumes less memory in the worst-case scenario.
> 
> Brian
> 
>> On Feb 14, 2017, at 7:48 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
>>
>> Hi all,
>>
>> we probably found the cause and fixed it (fingers crossed)
>>
>> post mortem ~>
>> https://confluence.desy.de/pages/viewpage.action?pageId=47425023
>>
>> Presumably during a 'transparent' maintenance on the ARC's underlying
>> supervisor, Condor shadows etc. could not access the local job files.
>> This caused(?) a large number of jobs to be seen as failed by condor and
>> sending them to hold.
>> Apparently, condor was overwhelmed by the large number of hold jobs
>> (160.000 jobs in hold, /var/lib/condor/spool.old.20170214/job_queue.log
>> already at ~1.4GB). Plain removing the hold jobs with condor_rm failed
>> accordingly(?), so that we moved the spool dir away and gave condor a
>> fresh start.
>>
>> Since then, the node has been running fine again.
>>
>> Cheers,
>>  Thomas
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature