[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups and OOM killers




Hi Mary,

A couple comments in addition to the wisdom from Greg below:

1. How HTCondor on the Execution Point (EP) reacts to the OOM killer was changed/improved starting with HTCondor ver 10.3.0 to deal with issues like yours below -  from the version history in the Manual:

When HTCondor is configured to use cgroups, if the system as a whole is out of memory, and the kernel kills a job with the out of memory killer, HTCondor now checks to see if the job is below the provisioned memory. If so, HTCondor now evicts the job, and marks it as idle, not held, so that it might start again on a machine with sufficient resources. Previous, HTCondor would let this job attempt to run, hoping the next time the OOM killer fired it would pick a different process. (HTCONDOR-1512)

2. Perhaps you want to set a value for RESERVED_MEMORY in your HTCondor config?  From the Manual:

RESERVED_MEMORY
How much memory would you like reserved from HTCondor? By default, HTCondor considers all the physical memory of your machine as available to be used by HTCondor jobs. If RESERVED_MEMORY is defined, HTCondor subtracts it from the amount of memory it advertises as available.

Hope the above plus Greg's ideas below helps,
Todd

On 7/6/2023 10:42 AM, Greg Thain via HTCondor-users wrote:
On 7/6/23 10:01, Mary Hester wrote:
Hello HTCondor experts,

We're seeing some interesting behaviour with user jobs on our local HTCondor cluster, running version 9.8.

Basically, if a job in the cgroup manages to go sufficiently over memory so that the container cannot allocate accountable memory that is needed for basic functioning of the system as  a whole (e.g. to hold its cmdline), then the container has impact on the whole system and will bring it down. This is a worse condition than condor not being able to fully get the status/failure reason for any single specific container. And since oom_kill_disable is set to 1, the kernel will now not intervene and hence the entire system grinds to a halt. It is preferable to loose state for a single job, have the kernel do its thing, and have the system survive. Now, the only workaround is to run for i in /sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control ; do echo 0 > $i ; done in a loop to ensure the sysadmin-intended settings are applied to the condor-managed cgroups.


Hi Mary:

I'm sorry your system is having problems.  Perhaps what is happening is that there is swap enabled on the system, and cgroups are limiting the amount of physical memory used by the job, and the system is paging itself to death before the starter can read the OOM message.  Can you try setting

DISABLE_SWAP_FOR_JOB = true

and see if the problem persists?

The reason condor sets oom_kill_disable to true is that the starter registers to get notified of the OOM event, so that it can know that the reason the job exitted was due to OOM kill.  It sounds like perhaps the system is so overloaded that this event isn't getting delivered or processed.


Newer versions of HTCondor, and those with cgroup v2 don't set oom_kill_disable, they wait for the cgroup to die, and there is first class support in the cgroup for querying whether the OOM killer fired.  We hope this will be a more reliable method in the future.


Let us know how this goes,


-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685