Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7

Hi Todd, (sorry to fork in between)

I am a bit confused regarding the soft limits.

So far I had assumed that the kernel would allow a cgroup to exceed its
soft limit usage as long as there is free memory available - and kill a
group's processes if the system runs low on unwired memory (assuming a
translation between limits in condor to cgroup limits).

So, we have effectively not set a 'real' cgroup hard limit assuming that
the soft limit would be sufficient, e.g., would the kernel kill [1] when
exceeding it's 4GB soft limit and running low on system-wide memory?
(looking now onto the values: would memsw -set to such a large value-
actually send the job heavily swapping...?)



On 2017-10-20 18:26, Todd Tannenbaum wrote:
> On 10/20/2017 9:44 AM, Alessandra Forti wrote:
>> Hi,
>> is more information needed?
> Hi Alessandra,
> The version of HTCondor you are using would be helpful :).
> But I have some answers/suggestions below that I hope will help...
>>> * On the head node
>>> RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory )
>>> SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)Â || <OtherParameters>
>>> So the questions are two
>>> 1) Why SYSTEM_PERIODIC_REMOVEÂ didn't work? 
> Because the (system_)periodic_remove expressions are evaluated by the
> condor_shadow while the job is running, and the *_RAW attributes are
> only updated in the condor_schedd.
> A simple solution is to use attribute MemoryUsage instead of
> ResidentSetSize_RAW. So I think things will work as you want if you
> instead did:
> Â RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
> Â SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)Â || <OtherParameters>
> Note that MemoryUsage is in the same units as RequestMemory, so only
> need to multiply by 2 instead of 2000.
> You are not the first person to be tripped up by this. :(Â I realize it
> is not at all intuitive. I think I will add a quick patch in the code to
> allow _RAW attributes to be referenced inside of job policy expressions
> to help prevent frustration by the next person.
> Also you may want to place your memory limit policy on the execute nodes
> via startd policy expression, instead of having them enforced on the
> submit machine (what I think you are calling the head node). The reason
> is the execute node policy is evaluated every five seconds, while the
> submit machine policy is evaluated every several minutes. A runaway job
> could consume a lot of memory in a few minutes :).
>>> 2) Shouldn't htcondor set the job soft limit with this configuration?
>>> or is the site expected to set the soft limit separately?
> Personally, I think "soft" limits in cgroups are completely bogus. The
> way the Linux kernel treats soft limits does not do in practice what
> anyone (including htcondor itself) expects. I recommend settings
> CGROUP_MEMORY_LIMIT to either none or hard, soft makes no sense imho.
> "CGROUP_MEMORY_LIMIT=hard" is clear to understand: if the job uses more
> memory than it requested, it is __immediately__ kicked off and put on
> hold. This way users get a consistent experience.
> If you want jobs to be able to go over their requested memory so long as
> the machine isn't swapping, consider disabling swap on your execute
> nodes (not a bad idea for compute servers in general) and simply leaving
> "CGROUP_MEMORY_LIMIT=none". What will happen is if the system is
> stressed, eventually the Linux OOM (out of memory killer) will kick in
> and pick a process to kill. HTCondor sets the OOM priority of job
> process such that the OOM killer should always pick job processes ahead
> of other processes on the system. Furthermore, HTCondor "captures" the
> OOM request to kill a job and only allows it to continue if the job is
> indeed using more memory than requested (i.e. provisioned in the slot).
> This is probably what you wanted by setting the limit to soft in the
> first place.
> I am thinking we should remove the "soft" option to CGROUP_MEMORY_LIMIT
> in future releases, it just causes confusion imho. Curious if others on
> the list disagree...
> Hope the above helps,
> regards,
> Todd
Attachment: smime.p7s
Description: S/MIME Cryptographic Signature