[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7

A couple of LIGO admins, including myself, have recently complained specifically about how high the combined memory + SWAP limit is. As you note, when used with a soft limit it effectively allows any job to use all of RAM and SWAP. When used with the HTCondor notion of a hard limit, it effectively allows any job to use all of SWAP.


Most of the problem is the kernel itself: cgroups v1 controllers donât regulate swap separately from RAM. It regulates the combined footprint. So if you set a hard limit on RAM of 4G and a hard limit on RAM+swap of 4G, you could have 0G in RAM and 4G in swap. Itâs just how it is.


But, right now, the limit for RAM+SWAP is set to something like all the RAM+SWAP in the system.


I think they heard us when we complained and that solutions are forthcoming but you should understand that the bulk of the problem is in the kernel itself. The only good swap is dead swap.



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Alessandra Forti <Alessandra.Forti@xxxxxxx>
Reply-To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Monday, October 23, 2017 at 10:12 AM
To: Todd Tannenbaum <tannenba@xxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7


Hi Todd,

On 20/10/2017 17:26, Todd Tannenbaum wrote:

On 10/20/2017 9:44 AM, Alessandra Forti wrote:


is more information needed?

Hi Alessandra,

The version of HTCondor you are using would be helpful :).

I'm using 8.6.6

But I have some answers/suggestions below that I hope will help...

* On the head node

RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory )
SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)  || <OtherParameters>

So the questions are two

1) Why SYSTEM_PERIODIC_REMOVE  didn't work?

Because the (system_)periodic_remove expressions are evaluated by the condor_shadow while the job is running, and the *_RAW attributes are only updated in the condor_schedd.

A simple solution is to use attribute MemoryUsage instead of ResidentSetSize_RAW.  So I think things will work as you want if you instead did:

  RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
  SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)  || <OtherParameters>

let me get this straight if I replace ResidentSetSize_RAW with MemoryUsage it should work?

Note that MemoryUsage is in the same units as RequestMemory, so only need to multiply by 2 instead of 2000.

You are not the first person to be tripped up by this. :(  I realize it is not at all intuitive. I think I will add a quick patch in the code to allow _RAW attributes to be referenced inside of job policy expressions to help prevent frustration by the next person.

Also you may want to place your memory limit policy on the execute nodes via startd policy _expression_, instead of having them enforced on the submit machine (what I think you are calling the head node).  The reason is the execute node policy is evaluated every five seconds, while the submit machine policy is evaluated every several minutes. 

I read that the submit machine evaluates the _expression_ every 60 seconds since version 7.4 (though admitedly the blog I read is quite old so things might have changed again (https://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/)

A runaway job could consume a lot of memory in a few minutes :).

Do you mean I should move SYSTEM_PERIODIC_REMOVE to the WN? or is there another recipe? This I wrote recipe I'm using is used by several other sites.

2) Shouldn't htcondor set the job soft limit with this configuration? or is the site expected to set the soft limit separately?

Personally, I think "soft" limits in cgroups are completely bogus.  The way the Linux kernel treats soft limits does not do in practice what anyone (including htcondor itself) expects.  I recommend settings CGROUP_MEMORY_LIMIT to either none or hard, soft makes no sense imho.

"CGROUP_MEMORY_LIMIT=hard" is clear to understand: if the job uses more memory than it requested, it is __immediately__ kicked off and put on hold.  This way users get a consistent experience.

If you want jobs to be able to go over their requested memory so long as the machine isn't swapping, consider disabling swap on your execute nodes (not a bad idea for compute servers in general) and simply leaving "CGROUP_MEMORY_LIMIT=none".  What will happen is if the system is stressed, eventually the Linux OOM (out of memory killer) will kick in and pick a process to kill. 

at the moment there are no limits set in cgroups, i.e. the limit number is practically infinite, so either policy - soft or hard - might not work without (OOM doesn't kick in). This is why sites are setting the system_periodic_remove. The machines were stressed because the application was using up to 15 times what it requested. For example using stress I just submitted a job that usses 80GB of memory on a machine that has 64GB RAM

[root@wn2208290 ~]# for a in $(seq -w 1 50); do egrep '^rss|^swap' /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.stat|grep -v huge; sleep 5 ;echo;done
rss 65429585920
swap 457154560


rss 65468846080
swap 1864413184

it is happily filling the swap. I don't think removing the swap is a good idea but the sum RAM+swap should be indeed limited to either a multiple of what is requested or a default max limit. If I put a soft limit to 4GB in the job it does bring down the memory to 40GB but the limit doesn't affect the swap which starts increasing at a faster pace.

[root@wn2208290 ~]# for a in $(seq -w 1 50); do egrep '^rss|^swap' /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.stat|grep -v huge; sleep 5 ;echo;done
rss 64724119552
swap 16926076928


rss 64367165440
swap 21876707328

however the soft limit is the only thing it lets me set with a brutal echo redirection. The memory general limit and the memsw limit give error

echo 4G > /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.limit_in_bytes
-bash: echo: write error: Device or resource busy

echo 4G > /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.memsw.limit_in_bytes
-bash: echo: write error: Invalid argument

so my plan to set this stuff on the fly doesn't seem feasible. I wonder if any of these condor daemons that actually create the condor jobs process groups could do that at creation time? I'm just at the start with cgroups reading the docs I thought things where quite straightforward but now I'm confused on how it works.


HTCondor sets the OOM priority of job process such that the OOM killer should always pick job processes ahead of other processes on the system.  Furthermore, HTCondor "captures" the OOM request to kill a job and only allows it to continue if the job is indeed using more memory than requested (i.e. provisioned in the slot). This is probably what you wanted by setting the limit to soft in the first place.

I am thinking we should remove the "soft" option to CGROUP_MEMORY_LIMIT in future releases, it just causes confusion imho.  Curious if others on the list disagree...

Hope the above helps,

Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... covfefe!