[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit



On 6/23/2020 4:38 AM, ervikrant06@xxxxxxxxx wrote:
Hello Experts,

I am still juggling with this issue. Can anyone please share thoughts?

Thanks & Regards,
Vikrant Aggarwal


Hi Vikrant,

Sorry to hear about the issues below....

One quick thought: One possible reason for silence on the below is both HTCondor v8.5.8 and RHEL 6.10 are quite old. When reporting problems with very basic functionality, that does not matter so much. But when working with more "recent" mechanisms like control groups and namespaces, combined with shared memory etc, this can make a difference. HTCondor v8.5.8 is no longer officially supported, and several reported cgroup issues have been fixed in the past 4+ years. Meanwhile the RHEL6 kernel had so many encountered corner case issues with cgroups that a couple years back even Docker, a project with near infinite resources, gave up on it and now only supports RHEL7. I recall here at UW-Madison, we also gave up on using stock RHEL6 with cgroups (and docker) after encountering many mysterious problems. At the time one thing that helped some was grabbing a more modern kernel from elrepo-kernel channel at https://elrepo.org....

You mention you can easily reproduce the problem, which is great! Could you test with one execute node at HTCondor v8.8.9 (which is still released for RHEL6) and see if the problem still exists? If it does, I'll meet you half-way and try your test with a recent HTCondor on a recent Centos7 :).

regards
Todd


On Wed, May 27, 2020 at 4:20 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>> wrote:

    I can easily reproduce this behavior of RHEL 6.10

    If I copy file of larger than memory per core limit in /dev/shm job is going into held status.

    If I try to read file larger than request memory job is going into 4 status (complete) instead of held status. I can
    see the errror in stderr log file it should have gone into removed status not sure why it's marked as completed.

    Thanks & Regards,
    Vikrant Aggarwal