[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit



On 6/30/2020 8:10 AM, ervikrant06@xxxxxxxxx wrote:
Thanks Todd.

Today I used 7.7 with HTcondor version 8.8.5 (not the latest one but much newer than 8.5.8 :-) ) .. I can see the job always goes into held status whenever it's breaching the memory with both bash and c++ tests.


Hi Vikrant,

Always glad to hear about progress and success!

Thanks for the update,
Todd



Thanks & Regards,
Vikrant Aggarwal


On Wed, Jun 24, 2020 at 1:17 AM Todd Tannenbaum <tannenba@xxxxxxxxxxx <mailto:tannenba@xxxxxxxxxxx>> wrote:

    On 6/23/2020 4:38 AM, ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx> wrote:
     > Hello Experts,
     >
     > I am still juggling with this issue. Can anyone please share thoughts?
     >
     > Thanks & Regards,
     > Vikrant Aggarwal
     >

    Hi Vikrant,

    Sorry to hear about the issues below....

    One quick thought:Â One possible reason for silence on the below is both HTCondor v8.5.8 and RHEL 6.10 are quite old.
    When reporting problems with very basic functionality, that does not matter so much. But when working with more
    "recent" mechanisms like control groups and namespaces, combined with shared memory etc, this can make a difference.
    HTCondor v8.5.8 is no longer officially supported, and several reported cgroup issues have been fixed in the past 4+
    years. Meanwhile the RHEL6 kernel had so many encountered corner case issues with cgroups that a couple years back even
    Docker, a project with near infinite resources, gave up on it and now only supports RHEL7. I recall here at
    UW-Madison,
    we also gave up on using stock RHEL6 with cgroups (and docker) after encountering many mysterious problems. At the
    time
    one thing that helped some was grabbing a more modern kernel from elrepo-kernel channel at https://elrepo.org....

    You mention you can easily reproduce the problem, which is great! Could you test with one execute node at HTCondor
    v8.8.9 (which is still released for RHEL6) and see if the problem still exists? If it does, I'll meet you half-way and
    try your test with a recent HTCondor on a recent Centos7 :).

    regards
    Todd

     >
     > On Wed, May 27, 2020 at 4:20 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>
    <mailto:ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>>> wrote:
     >
     >Â Â ÂI can easily reproduce this behavior of RHEL 6.10
     >
     >Â Â ÂIf I copy file of larger than memory per core limit in /dev/shm job is going into held status.
     >
     >Â Â ÂIf I try to read file larger than request memory job is going into 4 status (complete) instead of held
    status. I can
     >Â Â Âsee the errror in stderr log file it should have gone into removed status not sure why it's marked as completed.
     >
     >Â Â ÂThanks & Regards,
     >Â Â ÂVikrant Aggarwal
     >



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685