On 17/02/2015 14:33, Brian Bockelman
wrote:
...1) cgroups enforce a "first to touch" model. The mmap memory usage is charged to the first job to pull a piece of data into the page cache (which may or may not be the first job to run). When the job exits, the memory usage is migrated to the parent cgroup. - Note that the page cache (unless you mlock stuff) is treated as evictable - this will get swapped out if applications need to malloc() something. 2) If you don't trust your users to put in reasonable hard limits, you may want to look at the soft limit model: this only enforces memory limits kernel-side when there is memory pressure on the host (a bit before OOM fires). The upside of using these approaches are that the OOM killer will delegate its work to HTCondor and HTCondor will put the job on hold (in most cases - can't say it'll happen 100% of the time. Many corner cases when systems are running out of memory). Thank you: that's all really useful stuff to know.Ah-ha! Reading your message again, you don't really care about memory usage, you care about making sure the OOM killer doesn't hit important stuff. Newer versions of HTCondor automatically adjust the job's OOM score so it's much more likely to pick these. Look at oom_adj (newer kernels: oom_score_adj); you can either make critical tasks less-likely to get picked by the OOM or make HTCondor more-likely. This is an older kernel feature and is likely to be better-supported. Yes, I need HTCondor to be a good citizen when sitting alongside other apps, and so at minimum not having those other apps killed would be a good thing; but not having HTCondor exceed a certain allocated amount of RAM (and/or swap) would be better. I got a very helpful reply from Mark Calleja off-list: > I have some notes on my experience of HTCondor and cgroups at: > > http://www.ucs.cam.ac.uk/scientific/camgrid/technical/Cgroups > > I mention some aspects of Debian strangeness half way down the page. After some binary chopping of configs, to determine that the Debian kernel doesn't support "memory.memsw.limit_in_bytes", I ended up with this: mount {  cpu = /sys/fs/cgroup;  cpuset = /sys/fs/cgroup;  cpuacct = /sys/fs/cgroup;  memory = /sys/fs/cgroup;  freezer = /sys/fs/cgroup;  blkio = /sys/fs/cgroup; } group htcondor {  cpu {}  cpuacct {}  memory {  memory.use_hierarchy="1";  memory.memsw.limit_in_bytes=59478159K;  }  freezer {}  blkio {}  cpuset {  cpuset.cpus = 0-31;  cpuset.mems = 0;  } } (where 59478159K = 90% of system RAM). At this point, initially it appeared to be working: * I have some jobs running * The one which is running in slot1_1@<thismachine> was submitted with request_memory="1805" * This slot has classAd attribute Memory = 1920 # cat /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<thismachine>/memory.soft_limit_in_bytes 2013265920 (that's 1920*1024*1024) And I get what appears to be an accurate report of the memory used in execution: # cat /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<thismachine>/memory.usage_in_bytes 1779826688 Unfortunately, I spoke to soon. What happened then is that oom_killer started killing processes when I don't think it should. I can cat memory.usage_in_bytes every second and I see it slowly creeping up to 1.6G or 1.7G, and then suddenly the process dies, and dmesg shows a splurge of backtrace of oom_killer output. Setting memory {} in /etc/cgconfig.conf didn't make a difference (so it's not a question of the overall memory allocated to htcondor). Neither did doing this: BASE_CGROUP = htcondor CGROUP_MEMORY_LIMIT_POLICY = none Well, technically it did make a difference, in that I could see that /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<host>/memory.limit_in_bytes /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot1_1@<host>/memory.soft_limit_in_bytes are now both set extremely high (2 63 - 1). So now there is no limit at the htcondor cgroup level, and no limit at the slot cgroup level, but the processes are still being killed! The only way I could stop this was by commenting out the BASE_CGROUP = htcondor setting entirely, so htcondor doesn't create per-process cgroups any more Now, it appears something is causing the OOM killer to be invoked when not necessary. With all the cgroup stuff turned off, I can see the slot size: $ condor_status -long slot1_1@xxxxxxxxxxxxxxxx | grep ^Memory Memory = 1920 And I can confirm through manual inspection that the memory used by the processes in the job is within that limit: root@dar1:~# ps auxwww | grep condor_starter | grep "slot1_1 " condor 8264 0.0 0.0 95720 6976 ? Ss 16:50 0:00 condor_starter -f -a slot1_1 proliant.example.com root@dar1:~# pstree -ap 8264 condor_starter,8264 -f -a slot1_1 proliant.example.com  ââpython,8341 /var/lib/condor/execute/dir_8264/condor_exec.exe  ââbash,8397 -c...  ââxxx,8406  ââyyy,8404 ...  ââxxx,8407 ... root@dar1:~# ps uw 8264 8341 8397 8406 8404 8407 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND condor 8264 0.0 0.0 95720 6976 ? Ss 16:50 0:00 condor_starter -f -a slot1_1 proliant.example.com brian 8341 0.0 0.0 83168 46628 ? SNs 16:50 0:00 python /var/lib/condor/execute/dir_8264/condor_exec.exe brian 8397 0.0 0.0 17660 1464 ? SN 16:51 0:00 /bin/bash -c set -o pipefail; ... brian 8404 1.0 1.2 853972 823952 ? SN 16:51 0:09 yyy ... brian 8406 4.7 0.0 131332 1476 ? SN 16:51 0:41 xxx brian 8407 16.2 1.4 954476 930976 ? RN 16:51 2:20 zzz ... (Total RSS = 1,804,496K; slot size is 1920*1024 = 1,966,080K) But in any case, with limit 'none' or 'soft' I would not expect any processes to be killed. This is condor 8.2.4-281588 running under Debian Wheezy amd64, kernel 3.2.65-1+deb7u1 If anybody has any clues that would be great, but I'm starting to suspect that the Debian Wheezy kernel is broken w.r.t. cgroups. Debian Jessie is supposed to have docker.io officially supported (and therefore implicitly cgroups); or else maybe I should be looking at running something like CoreOS on the metal, and HTCondor inside a container. Regards, Brian. |