[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Memory accounting issue with cgroups



Hi,

One thing we have noticed, and it seems it hasn't changed, is that the reported memory usage is taken from the classAd attribute. This is highly inaccurate for two reasons:

* It's sampled (by default every five minutes), so this value might be quite old by the time the OoM event arrives. * It uses the old memory tracking system based on RSS that doesn't take into account things like tmpfs (for instance, some of our users use /dev/shm).

This inaccuracy results in at least one bug, because for instance it will consider tmpfs filling up the requested memory as "the system running out of memory". With `IGNORE_LEAF_OOM`' default value of true (still the case in 10.0), it causes jobs to hung waiting eternally for the system to free up memory (when that's not the issue at all).

It also confuses the users, because they sometimes see a reported "peak usage" much lower than the limit, it's not clear to them that there might be something else going on.

So, would it be possible to make it get the value directly from the cgroup, i.e. `memory.max_usage_in_bytes` or `memory.memsw.max_usage_in_bytes`? I'm talking about cgroups v1, I'm not sure how this would affect v2.

Best,

Joan

On 19/5/23 10:37, Jan van Eldik wrote:
Hallo Marco,

Could this be the issue addressed in https://github.com/htcondor/htcondor/commit/3c1b39bf5607d7485aa36e90ab8f6de6f99baeb0

Release condor-10.6.0-0.644330.el9.x86_64 includes this, and we have not
observed any cgroups-v2 related crashes on our EL9 servers since we deployed it a few weeks ago.


 ÂÂ hope this helps, groeten, Jan
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature