[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]

Hi Thomas,

That makes sense - when the schedd forked a child condor_schedd to answer a query, the child grew rapidly in terms of RAM usage.  By default, the parent and child share the same pages in RAM - meaning the cost of the fork is relatively low.  *However*, if the parent condor_schedd's queue is rapidly changing, Linux will allocate a lot of new pages.

A large, changing queue causes trouble in two ways:
- Breaks the sharing between parent and child, AND
- Causes child to live longer as response to query takes long.

Note that you can reduce the number of helps down drastically and force the schedd to respond to queries in-process.  It will make the parent process work harder - but  consumes less memory in the worst-case scenario.


> On Feb 14, 2017, at 7:48 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
> Hi all,
> we probably found the cause and fixed it (fingers crossed)
> post mortem ~>
> https://confluence.desy.de/pages/viewpage.action?pageId=47425023
> Presumably during a 'transparent' maintenance on the ARC's underlying
> supervisor, Condor shadows etc. could not access the local job files.
> This caused(?) a large number of jobs to be seen as failed by condor and
> sending them to hold.
> Apparently, condor was overwhelmed by the large number of hold jobs
> (160.000 jobs in hold, /var/lib/condor/spool.old.20170214/job_queue.log
> already at ~1.4GB). Plain removing the hold jobs with condor_rm failed
> accordingly(?), so that we moved the spool dir away and gave condor a
> fresh start.
> Since then, the node has been running fine again.
> Cheers,
>  Thomas
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/