[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] how to effectively enforcing resource/cgroup limits per job?



Hi all,

is there a way in Condor to tune the memory limits for the jobs' cgroups
more fine grained?
Thing is, that we just had a few nodes, which a user managed to swap to
death (e.g., see the attached stats).

As for the memory handling we are running so far with soft limits, i.e.,
  CGROUP_MEMORY_LIMIT_POLICY = soft
which is AFAIS reflected in the job slices' memory.max_usage_in_bytes as
well as in the startd log [1].
Since the hard and the mem+swap limits are pretty generous, they will
never take effect, I suppose.

Also, if I understand memory.oom_control correctly, the out-of-memory
control is actually not handled by the kernel but by Condor [2], or?
I guess, it is for Condor to clean up a job, or?

Since on the affected nodes the OOM situation became serious pretty
rapidly, I wonder if we can make the memory control more strict but
still allow for a soft over-allocation?
E.g., per job hard limits for mem / memsw in multiples of
soft_limit_in_bytes but still below the total mem / total mem + X.

For the moment I am trying to limit the condor unit's slice overall
memory as safeguard to keep the node responsive - obviously for the
price that all jobs/slices below will get indiscriminately affected when
another job sends the whole Condor slice into the limit :(

Cheers,
  Thomas


[1]
07/30/18 07:15:55 (pid:41725) Running job as user cmsplt036
07/30/18 07:15:55 (pid:41725) Create_Process succeeded, pid=41745
07/30/18 07:15:55 (pid:41725) Limiting (soft) memory usage to 0 bytes
07/30/18 07:15:55 (pid:41725) Limiting memsw usage to
9223372036854775807 bytes
07/30/18 07:15:55 (pid:41725) Limiting (soft) memory usage to
21072183296 bytes
07/30/18 07:15:55 (pid:41725) Limiting (hard) memory usage to
278668124160 bytes
07/30/18 07:15:55 (pid:41725) Limiting memsw usage to 278668128256 bytes

where
  MemTotal:       263944848 kB
i.e., the hard mem and memsw limits are both set ~15GB larger than the
total mem

[2]
> cat
/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx/memory.oom_control

oom_kill_disable 1
under_oom 0

Attachment: batch0946_load.png
Description: PNG image

Attachment: batch0946_mem.png
Description: PNG image

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature