[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit

Thanks Todd.

Today I used 7.7 with HTcondor version 8.8.5 (not the latest one but much newer than 8.5.8 :-) ) .. I can see the job always goes into held status whenever it's breaching the memory with both bash and c++ tests.ÂÂ

Thanks & Regards,
Vikrant Aggarwal

On Wed, Jun 24, 2020 at 1:17 AM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 6/23/2020 4:38 AM, ervikrant06@xxxxxxxxx wrote:
> Hello Experts,
> I am still juggling with this issue. Can anyone please share thoughts?
> Thanks & Regards,
> Vikrant Aggarwal

Hi Vikrant,

Sorry to hear about the issues below....

One quick thought:Â One possible reason for silence on the below is both HTCondor v8.5.8 and RHEL 6.10 are quite old.
When reporting problems with very basic functionality, that does not matter so much. But when working with more
"recent" mechanisms like control groups and namespaces, combined with shared memory etc, this can make a difference.
HTCondor v8.5.8 is no longer officially supported, and several reported cgroup issues have been fixed in the past 4+
years. Meanwhile the RHEL6 kernel had so many encountered corner case issues with cgroups that a couple years back even
Docker, a project with near infinite resources, gave up on it and now only supports RHEL7. I recall here at UW-Madison,
we also gave up on using stock RHEL6 with cgroups (and docker) after encountering many mysterious problems. At the time
one thing that helped some was grabbing a more modern kernel from elrepo-kernel channel at https://elrepo.org....

You mention you can easily reproduce the problem, which is great! Could you test with one execute node at HTCondor
v8.8.9 (which is still released for RHEL6) and see if the problem still exists? If it does, I'll meet you half-way and
try your test with a recent HTCondor on a recent Centos7 :).


> On Wed, May 27, 2020 at 4:20 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>> wrote:
>Â Â ÂI can easily reproduce this behavior of RHEL 6.10
>Â Â ÂIf I copy file of larger than memory per core limit in /dev/shm job is going into held status.
>Â Â ÂIf I try to read file larger than request memory job is going into 4 status (complete) instead of held status. I can
>Â Â Âsee the errror in stderr log file it should have gone into removed status not sure why it's marked as completed.
>Â Â ÂThanks & Regards,
>Â Â ÂVikrant Aggarwal