[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7

Hi again,
Also you may want to place your memory limit policy on the execute nodes via startd policy _expression_, instead of having them enforced on the submit machine (what I think you are calling the head node). The reason is the execute node policy is evaluated every five seconds, while the submit machine policy is evaluated every several minutes.Â
I read that the submit machine evaluates the _expression_ every 60 seconds since version 7.4 (though admitedly the blog I read is quite old so things might have changed again (https://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/)
I'm trying to look at both ResidentSetSize_RAW and MemoryUsage on schedd machine and it does actually take a full 15 minutes before either gets a value assigned (unless I misunderstood the time attributes)

condor_q -autoformat MemoryUsage ResidentSetSize_RAW ClusterId Owner '(ServerTime-LastMatchTime)/60' |Â sort -rnk5
1221 1104640 101393 atlprd002 22
undefined undefined 101767 atlprd007 15
undefined undefined 101409 atlprd002 8
undefined undefined 101779 atlpil017 4

A runaway job could consume a lot of memory in a few minutes :).
Do you mean I should move SYSTEM_PERIODIC_REMOVE to the WN? or is there another recipe? This I wrote recipe I'm using is used by several other sites.

2) Shouldn't htcondor set the job soft limit with this configuration? or is the site expected to set the soft limit separately?

Personally, I think "soft" limits in cgroups are completely bogus. The way the Linux kernel treats soft limits does not do in practice what anyone (including htcondor itself) expects. I recommend settings CGROUP_MEMORY_LIMIT to either none or hard, soft makes no sense imho.

"CGROUP_MEMORY_LIMIT=hard" is clear to understand: if the job uses more memory than it requested, it is __immediately__ kicked off and put on hold. This way users get a consistent experience.

If you want jobs to be able to go over their requested memory so long as the machine isn't swapping, consider disabling swap on your execute nodes (not a bad idea for compute servers in general) and simply leaving "CGROUP_MEMORY_LIMIT=none". What will happen is if the system is stressed, eventually the Linux OOM (out of memory killer) will kick in and pick a process to kill.Â
at the moment there are no limits set in cgroups, i.e. the limit number is practically infinite, so either policy - soft or hard - might not work without (OOM doesn't kick in). This is why sites are setting the system_periodic_remove. The machines were stressed because the application was using up to 15 times what it requested. For example using stress I just submitted a job that usses 80GB of memory on a machine that has 64GB RAM

[root@wn2208290 ~]# for a in $(seq -w 1 50); do egrep '^rss|^swap' /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.stat|grep -v huge; sleep 5 ;echo;done
rss 65429585920
swap 457154560


rss 65468846080
swap 1864413184

it is happily filling the swap. I don't think removing the swap is a good idea but the sum RAM+swap should be indeed limited to either a multiple of what is requested or a default max limit. If I put a soft limit to 4GB in the job it does bring down the memory to 40GB but the limit doesn't affect the swap which starts increasing at a faster pace.

[root@wn2208290 ~]# for a in $(seq -w 1 50); do egrep '^rss|^swap' /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.stat|grep -v huge; sleep 5 ;echo;done
rss 64724119552
swap 16926076928


rss 64367165440
swap 21876707328

however the soft limit is the only thing it lets me set with a brutal echo redirection. The memory general limit and the memsw limit give error

echo 4G > /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.limit_in_bytes
-bash: echo: write error: Device or resource busy

echo 4G > /sys/fs/cgroup/memory/system.slice/condor.service/condor_scratch_condor_pool_condor_slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/memory.memsw.limit_in_bytes
-bash: echo: write error: Invalid argument

so my plan to set this stuff on the fly doesn't seem feasible. I wonder if any of these condor daemons that actually create the condor jobs process groups could do that at creation time? I'm just at the start with cgroups reading the docs I thought things where quite straightforward but now I'm confused on how it works.


HTCondor sets the OOM priority of job process such that the OOM killer should always pick job processes ahead of other processes on the system. Furthermore, HTCondor "captures" the OOM request to kill a job and only allows it to continue if the job is indeed using more memory than requested (i.e. provisioned in the slot). This is probably what you wanted by setting the limit to soft in the first place.

I am thinking we should remove the "soft" option to CGROUP_MEMORY_LIMIT in future releases, it just causes confusion imho. Curious if others on the list disagree...

Hope the above helps,

Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... covfefe!

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... covfefe!