[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor cgroups and memory limits on CentOS7



Hi Todd:

I think there are a couple points to discuss here:

1. Should HTCondor have a CGROUP_MEMORY_LIMIT policy called âsoftâ?
2. Should HTCondor set cgroups soft memory limits?

Question #1 is poorly posed and Iâll address question #2 below. The cgroups (v1) memory controller poses 2 question to the user/htcondor:

A. At what point do you want me to start reclaiming your unnecessary memory usage?
B. At what point do you want me to kill your job itself

(A) is the soft limit and will cause the kernel to remove cached files and that sort of thing from memory. (B) is the hard limit which behaves as you/weâd expect it to.

The fundamental mistake is that HTCondor is only answering 1 of these questions when the kernel is asking it to answer both. You should have neither a âhardâ nor âsoftâ policy but a set of knobs that lets you answer these questions separately.

I also think you are being doctrinal with respect to hard limits. The following statements are true:

* The memory footprint of some jobs are literally unknowable. They may follow a statistical distribution with no a priori way of asserting the footprint of any particular job. Multi-threaded applications may be scheduled in such a way that even deterministic memory footprints vary.
*Users are unlikely to know the memory footprint even if it is a predictable quantity

You will find that the cgroups v2 philosophy is much less oriented around a hard limit. In fact, it changes to a set of 3 levels:

(Low) if your footprint is below this value, only subject to reclaiming if literally every other cgroup has already been subjected to reclaiming
(High) if you go above this limit, subject to heavy reclaim pressure (roughly the current soft limit)
(Max) Never, ever go above this limit. OOM kill. 
(Swap Max) Separate accounting for swap with only a single hard limit enforced

Youâre better buying into what the kernel is actually doing rather than what you think it ought to be doing. In the future, it will be doing even less of what you think it ought to be doing.

Tom

On October 20, 2017 at 11:29:19 AM, Todd Tannenbaum (tannenba@xxxxxxxxxxx) wrote:

On 10/20/2017 9:44 AM, Alessandra Forti wrote:
> Hi,
>
> is more information needed?
>

Hi Alessandra,

The version of HTCondor you are using would be helpful :).

But I have some answers/suggestions below that I hope will help...

>> * On the head node
>>
>> RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory )
>> SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)  || <OtherParameters>
>>
>> So the questions are two
>>
>> 1) Why SYSTEM_PERIODIC_REMOVE  didn't work?

Because the (system_)periodic_remove expressions are evaluated by the
condor_shadow while the job is running, and the *_RAW attributes are
only updated in the condor_schedd.

A simple solution is to use attribute MemoryUsage instead of
ResidentSetSize_RAW. So I think things will work as you want if you
instead did:

RemoveMemoryUsage = ( MemoryUsage > 2*RequestMemory )
SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage) || <OtherParameters>

Note that MemoryUsage is in the same units as RequestMemory, so only
need to multiply by 2 instead of 2000.

You are not the first person to be tripped up by this. :( I realize it
is not at all intuitive. I think I will add a quick patch in the code to
allow _RAW attributes to be referenced inside of job policy expressions
to help prevent frustration by the next person.

Also you may want to place your memory limit policy on the execute nodes
via startd policy _expression_, instead of having them enforced on the
submit machine (what I think you are calling the head node). The reason
is the execute node policy is evaluated every five seconds, while the
submit machine policy is evaluated every several minutes. A runaway job
could consume a lot of memory in a few minutes :).

>> 2) Shouldn't htcondor set the job soft limit with this configuration?
>> or is the site expected to set the soft limit separately?
>>

Personally, I think "soft" limits in cgroups are completely bogus. The
way the Linux kernel treats soft limits does not do in practice what
anyone (including htcondor itself) expects. I recommend settings
CGROUP_MEMORY_LIMIT to either none or hard, soft makes no sense imho.

"CGROUP_MEMORY_LIMIT=hard" is clear to understand: if the job uses more
memory than it requested, it is __immediately__ kicked off and put on
hold. This way users get a consistent experience.

If you want jobs to be able to go over their requested memory so long as
the machine isn't swapping, consider disabling swap on your execute
nodes (not a bad idea for compute servers in general) and simply leaving
"CGROUP_MEMORY_LIMIT=none". What will happen is if the system is
stressed, eventually the Linux OOM (out of memory killer) will kick in
and pick a process to kill. HTCondor sets the OOM priority of job
process such that the OOM killer should always pick job processes ahead
of other processes on the system. Furthermore, HTCondor "captures" the
OOM request to kill a job and only allows it to continue if the job is
indeed using more memory than requested (i.e. provisioned in the slot).
This is probably what you wanted by setting the limit to soft in the
first place.

I am thinking we should remove the "soft" option to CGROUP_MEMORY_LIMIT
in future releases, it just causes confusion imho. Curious if others on
the list disagree...

Hope the above helps,
regards,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/