[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limiting memory used on the worker node with c-groups




On 4/30/20 4:27 AM, jean-michel.barbet@xxxxxxxxxxxxxxxxx wrote:
On 4/30/20 6:19 AM, tpdownes@xxxxxxxxx wrote:
I do think your problem is as simple as Thomas' question: figuring out why oom_control is set to disabled. These cgroup settings are inherited hierarchically so it could be the htcondor group itself or a cgroup above it. It could even be set system-wide.


HTCondor intentionally sets oom_kill_disable because the starter really needs to know if the job was OOM killed, and treat the job differently than if it just got a normal signal 9. We think it is very unfortunate that the OOM killer kills with the usual signal 9, and not a custom signal just for OOM -- we wouldn't need to do this if the OOM signal was its own value. The starter also installs a handler to get notified when the kernel oom-kills a process in the job. This lets the starter clean up the job, and put the job on hold with an appropriate message if it gets OOM killed. If we didn't do this, the an OOM killed job would be killed with signal 9, and probably leave the queue, as from condor's perspective, it has exitted of its own accord.


-greg