Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups and OOM killers

Date: Thu, 6 Jul 2023 10:42:05 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] cgroups and OOM killers

On 7/6/23 10:01, Mary Hester wrote:

Hello HTCondor experts,
We're seeing some interesting behaviour with user jobs on our localHTCondor cluster, running version 9.8.
Basically, if a job in the cgroup manages to go sufficiently overmemory so that the container cannot allocate accountable memory thatis needed for basic functioning of the system asÂ a whole (e.g. tohold its cmdline), then the container has impact on the whole systemand will bring it down. This is a worse condition than condor notbeing able to fully get the status/failure reason for any singlespecific container. And since oom_kill_disable is set to 1, the kernelwill now not intervene and hence the entire system grinds to a halt.It is preferable to loose state for a single job, have the kernel doits thing, and have the system survive. Now, the only workaround is torun for i in /sys/fs/cgroup/memory/htcondor/condor*/memory.oom_control; do echo 0 > $i ; done in a loop to ensure the sysadmin-intendedsettings are applied to the condor-managed cgroups.



Hi Mary:

I'm sorry your system is having problems.Â Perhaps what is happening isthat there is swap enabled on the system, and cgroups are limiting theamount of physical memory used by the job, and the system is pagingitself to death before the starter can read the OOM message.Â Can youtry setting


DISABLE_SWAP_FOR_JOB = true

and see if the problem persists?

The reason condor sets oom_kill_disable to true is that the starterregisters to get notified of the OOM event, so that it can know that thereason the job exitted was due to OOM kill.Â It sounds like perhaps thesystem is so overloaded that this event isn't getting delivered orprocessed.

Newer versions of HTCondor, and those with cgroup v2 don't setoom_kill_disable, they wait for the cgroup to die, and there is firstclass support in the cgroup for querying whether the OOM killer fired.ÂWe hope this will be a more reliable method in the future.



Let us know how this goes,


-greg

Follow-Ups:
- Re: [HTCondor-users] cgroups and OOM killers
  - From: Todd Tannenbaum

References:
- [HTCondor-users] cgroups and OOM killers
  - From: Mary Hester

Prev by Date: Re: [HTCondor-users] NVIDIA L40 Identified as an OCL Device
Next by Date: Re: [HTCondor-users] cgroups and OOM killers
Previous by thread: [HTCondor-users] cgroups and OOM killers
Next by thread: Re: [HTCondor-users] cgroups and OOM killers
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] cgroups and OOM killers