[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit
- Date: Tue, 30 Jun 2020 10:26:10 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit
On 6/30/2020 8:10 AM, ervikrant06@xxxxxxxxx wrote:
Today I used 7.7 with HTcondor version 8.8.5 (not the latest one but much newer than 8.5.8 :-) ) .. I can see the job
always goes into held status whenever it's breaching the memory with both bash and c++ tests.
Always glad to hear about progress and success!
Thanks for the update,
Thanks & Regards,
On Wed, Jun 24, 2020 at 1:17 AM Todd Tannenbaum <tannenba@xxxxxxxxxxx <mailto:tannenba@xxxxxxxxxxx>> wrote:
On 6/23/2020 4:38 AM, ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx> wrote:
> Hello Experts,
> I am still juggling with this issue. Can anyone please share thoughts?
> Thanks & Regards,
> Vikrant Aggarwal
Sorry to hear about the issues below....
One quick thought:Â One possible reason for silence on the below is both HTCondor v8.5.8 and RHEL 6.10 are quite old.
When reporting problems with very basic functionality, that does not matter so much.Â But when working with more
"recent" mechanisms like control groups and namespaces, combined with shared memory etc, this can make a difference.
HTCondor v8.5.8 is no longer officially supported, and several reported cgroup issues have been fixed in the past 4+
years. Meanwhile the RHEL6 kernel had so many encountered corner case issues with cgroups that a couple years back even
Docker, a project with near infinite resources, gave up on it and now only supports RHEL7.Â I recall here at
we also gave up on using stock RHEL6 with cgroups (and docker) after encountering many mysterious problems.Â At the
one thing that helped some was grabbing a more modern kernel from elrepo-kernel channel at https://elrepo.org....
You mention you can easily reproduce the problem, which is great!Â Could you test with one execute node at HTCondor
v8.8.9 (which is still released for RHEL6) and see if the problem still exists?Â If it does, I'll meet you half-way and
try your test with a recent HTCondor on a recent Centos7 :).
> On Wed, May 27, 2020 at 4:20 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>
<mailto:ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>>> wrote:
>Â Â ÂI can easily reproduce this behavior of RHEL 6.10
>Â Â ÂIf I copy file of larger than memory per core limit in /dev/shm job is going into held status.
>Â Â ÂIf I try to read file larger than request memory job is going into 4 status (complete) instead of held
status. I can
>Â Â Âsee the errror in stderr log file it should have gone into removed status not sure why it's marked as completed.
>Â Â ÂThanks & Regards,
>Â Â ÂVikrant Aggarwal
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685