[HTCondor-users] how to effectively enforcing resource/cgroup limits per job?

Hi all,

is there a way in Condor to tune the memory limits for the jobs' cgroups
more fine grained?
Thing is, that we just had a few nodes, which a user managed to swap to
death (e.g., see the attached stats).

As for the memory handling we are running so far with soft limits, i.e.,
which is AFAIS reflected in the job slices' memory.max_usage_in_bytes as
well as in the startd log [1].
Since the hard and the mem+swap limits are pretty generous, they will
never take effect, I suppose.

Also, if I understand memory.oom_control correctly, the out-of-memory
control is actually not handled by the kernel but by Condor [2], or?
I guess, it is for Condor to clean up a job, or?

Since on the affected nodes the OOM situation became serious pretty
rapidly, I wonder if we can make the memory control more strict but
still allow for a soft over-allocation?
E.g., per job hard limits for mem / memsw in multiples of
soft_limit_in_bytes but still below the total mem / total mem + X.

For the moment I am trying to limit the condor unit's slice overall
memory as safeguard to keep the node responsive - obviously for the
price that all jobs/slices below will get indiscriminately affected when
another job sends the whole Condor slice into the limit :(


07/30/18 07:15:55 (pid:41725) Running job as user cmsplt036
07/30/18 07:15:55 (pid:41725) Create_Process succeeded, pid=41745
07/30/18 07:15:55 (pid:41725) Limiting (soft) memory usage to 0 bytes
07/30/18 07:15:55 (pid:41725) Limiting memsw usage to
9223372036854775807 bytes
07/30/18 07:15:55 (pid:41725) Limiting (soft) memory usage to
21072183296 bytes
07/30/18 07:15:55 (pid:41725) Limiting (hard) memory usage to
278668124160 bytes
07/30/18 07:15:55 (pid:41725) Limiting memsw usage to 278668128256 bytes

  MemTotal:       263944848 kB
i.e., the hard mem and memsw limits are both set ~15GB larger than the
total mem

> cat

oom_kill_disable 1
under_oom 0

