[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limiting HTCondor total RAM usage



On 17/02/2015 14:33, Brian Bockelman wrote:
1) cgroups enforce a "first to touch" model.  The mmap memory usage is charged to the first job to pull a piece of data into the page cache (which may or may not be the first job to run).  When the job exits, the memory usage is migrated to the parent cgroup.
  - Note that the page cache (unless you mlock stuff) is treated as evictable - this will get swapped out if applications need to malloc() something.

2) If you don't trust your users to put in reasonable hard limits, you may want to look at the soft limit model: this only enforces memory limits kernel-side when there is memory pressure on the host (a bit before OOM fires).

The upside of using these approaches are that the OOM killer will delegate its work to HTCondor and HTCondor will put the job on hold (in most cases - can't say it'll happen 100% of the time.  Many corner cases when systems are running out of memory).
...

Ah-ha!  Reading your message again, you don't really care about memory usage, you care about making sure the OOM killer doesn't hit important stuff.  Newer versions of HTCondor automatically adjust the job's OOM score so it's much more likely to pick these.  Look at oom_adj (newer kernels: oom_score_adj); you can either make critical tasks less-likely to get picked by the OOM or make HTCondor more-likely.  This is an older kernel feature and is likely to be better-supported.
Thank you: that's all really useful stuff to know.

Yes, I need HTCondor to be a good citizen when sitting alongside other apps, and so at minimum not having those other apps killed would be a good thing; but not having HTCondor exceed a certain allocated amount of RAM (and/or swap) would be better.

I got a very helpful reply from Mark Calleja off-list:

> I have some notes on my experience of HTCondor and cgroups at:
>
> http://www.ucs.cam.ac.uk/scientific/camgrid/technical/Cgroups
>
> I mention some aspects of Debian strangeness half way down the page.

After some binary chopping of configs, to determine that the Debian kernel doesn't support "memory.memsw.limit_in_bytes", I ended up with this:

mount {
ÂÂÂÂÂÂÂ cpuÂÂÂ = /sys/fs/cgroup;
ÂÂÂÂÂÂÂ cpusetÂÂÂ = /sys/fs/cgroup;
ÂÂÂÂÂÂÂ cpuacct = /sys/fs/cgroup;
 memory = /sys/fs/cgroup;
ÂÂÂÂÂÂÂ freezer = /sys/fs/cgroup;
ÂÂÂÂÂÂÂ blkioÂÂ = /sys/fs/cgroup;
}

group htcondor {
 cpu {}
 cpuacct {}
 memory {
ÂÂÂ memory.use_hierarchy="1";
ÂÂÂ memory.memsw.limit_in_bytes=59478159K;
 }
 freezer {}
 blkio {}

 cpuset {
ÂÂÂ cpuset.cpus = 0-31;
ÂÂÂ cpuset.mems = 0;
 }
}

(where 59478159K = 90% of system RAM). At this point, initially it appeared to be working:

* I have some jobs running
* The one which is running in slot1_1@<thismachine> was submitted with request_memory="1805"
* This slot has classAd attribute Memory = 1920

# cat /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<thismachine>/memory.soft_limit_in_bytes
2013265920

(that's 1920*1024*1024)

And I get what appears to be an accurate report of the memory used in execution:

# cat /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<thismachine>/memory.usage_in_bytes
1779826688

Unfortunately, I spoke to soon. What happened then is that oom_killer started killing processes when I don't think it should.

I can cat memory.usage_in_bytes every second and I see it slowly creeping up to 1.6G or 1.7G, and then suddenly the process dies, and dmesg shows a splurge of backtrace of oom_killer output.

Setting memory {} in /etc/cgconfig.conf didn't make a difference (so it's not a question of the overall memory allocated to htcondor).

Neither did doing this:

BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = none

Well, technically it did make a difference, in that I could see that

/sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<host>/memory.limit_in_bytes
/sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<host>/memory.soft_limit_in_bytes

are now both set extremely high (2^63 - 1). So now there is no limit at the htcondor cgroup level, and no limit at the slot cgroup level, but the processes are still being killed!

The only way I could stop this was by commenting out the BASE_CGROUP = htcondor setting entirely, so htcondor doesn't create per-process cgroups any more

Now, it appears something is causing the OOM killer to be invoked when not necessary. With all the cgroup stuff turned off, I can see the slot size:

$ condor_status -long slot1_1@xxxxxxxxxxxxxxxx | grep ^Memory
Memory = 1920

And I can confirm through manual inspection that the memory used by the processes in the job is within that limit:

root@dar1:~# ps auxwww | grep condor_starter | grep "slot1_1 "
condorÂÂÂ 8264Â 0.0Â 0.0Â 95720Â 6976 ?ÂÂÂÂÂÂÂ SsÂÂ 16:50ÂÂ 0:00 condor_starter -f -a slot1_1 proliant.example.com
root@dar1:~# pstree -ap 8264
condor_starter,8264 -f -a slot1_1 proliant.example.com
 ââpython,8341 /var/lib/condor/execute/dir_8264/condor_exec.exe
ÂÂÂÂÂ ââbash,8397 -c...
ÂÂÂÂÂÂÂÂÂ ââxxx,8406
ÂÂÂÂÂÂÂÂÂ ââyyy,8404 ...
ÂÂÂÂÂÂÂÂÂ ââxxx,8407 ...
root@dar1:~# ps uw 8264 8341 8397 8406 8404 8407
USERÂÂÂÂÂÂ PID %CPU %MEMÂÂÂ VSZÂÂ RSS TTYÂÂÂÂÂ STAT STARTÂÂ TIME COMMAND
condorÂÂÂ 8264Â 0.0Â 0.0Â 95720Â 6976 ?ÂÂÂÂÂÂÂ SsÂÂ 16:50ÂÂ 0:00 condor_starter -f -a slot1_1 proliant.example.com
brian 8341 0.0 0.0 83168 46628 ? SNs 16:50 0:00 python /var/lib/condor/execute/dir_8264/condor_exec.exe
brianÂÂÂÂ 8397Â 0.0Â 0.0Â 17660Â 1464 ?ÂÂÂÂÂÂÂ SNÂÂ 16:51ÂÂ 0:00 /bin/bash -c set -o pipefail; ...
brianÂÂÂÂ 8404Â 1.0Â 1.2 853972 823952 ?ÂÂÂÂÂÂ SNÂÂ 16:51ÂÂ 0:09 yyy ...
brianÂÂÂÂ 8406Â 4.7Â 0.0 131332Â 1476 ?ÂÂÂÂÂÂÂ SNÂÂ 16:51ÂÂ 0:41 xxx
brianÂÂÂÂ 8407 16.2Â 1.4 954476 930976 ?ÂÂÂÂÂÂ RNÂÂ 16:51ÂÂ 2:20 zzz ...

(Total RSS = 1,804,496K; slot size is 1920*1024 = 1,966,080K)

But in any case, with limit 'none' or 'soft' I would not expect any processes to be killed.

This is condor 8.2.4-281588 running under Debian Wheezy amd64, kernel 3.2.65-1+deb7u1

If anybody has any clues that would be great, but I'm starting to suspect that the Debian Wheezy kernel is broken w.r.t. cgroups.

Debian Jessie is supposed to have docker.io officially supported (and therefore implicitly cgroups); or else maybe I should be looking at running something like CoreOS on the metal, and HTCondor inside a container.

Regards,

Brian.