[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroups and OOM killers



Hello HTCondor experts,

We're seeing some interesting behaviour with user jobs on our local HTCondor cluster, running version 9.8.

Basically, if a job in the cgroup manages to go sufficiently over memory so that the container cannot allocate accountable memory that is needed for basic functioning of the system as a whole (e.g. to hold its cmdline), then the container has impact on the whole system and will bring it down. This is a worse condition than condor not being able to fully get the status/failure reason for any single specific container. And since oom_kill_disable is set to 1, the kernel will now not intervene and hence the entire system grinds to a halt. It is preferable to loose state for a single job, have the kernel do its thing, and have the system survive. Now, the only workaround is to run for i in /sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control ; do echo 0 > $i ; done in a loop to ensure the sysadmin-intended settings are applied to the condor-managed cgroups.

Is there a configurable setting for oom_kill_disable 0? Shouldn't this be an option or was there another reason for the oom_kill_disable being set to 1?

Thanks,

Mary