Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none

Date: Tue, 26 Dec 2023 15:22:35 -0600
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none

On 12/26/23 14:26, JM wrote:

Based on limited information from worker logs, the chance is very highthat the server indeed ran out of physical and swap memory. Multiplejobs of the same (high) memory usage pattern hit the server at thesame time. One of them was terminated by OOM killer. I was confused bythe startd log message which said about the job memory usagethreshold. TheÂmessage gave the impressionÂthat the job was killed bya policy. If I remember correctly, a more typical feedback fromHTCondor is that the job was terminated with return value 137.



Hi JM:

This is something that has changed in HTCondor.Â In the past, if cgroupswere not enabled, and the OOM killer killed a job (because the system asa whole was out of memory), the job might exit the queue by default, asit just looked to HTCondor like the job was killed with signal 9,perhaps by something within the job proper.

Our philosophy is that the job should not leave the queue if somethinghappened to it outside of its control.Â e.g. if it is running on aworker node that gets rebooted, by default the job should start againsomewhere else.Â Not the job's fault the node was rebooted.Â Now, if theOOM killer kills the process not because the job is over the per-cgrouplimit, but because the system as a whole is out of memory, we want totreat that the same way.


I agree that the message is confusing, and I'll work on cleaning that up.

Thanks,

-greg

Follow-Ups:
- Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
  - From: Weatherby,Gerard

References:
- [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
  - From: JM
- Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
  - From: John M Knoeller
- Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
  - From: JM

Prev by Date: Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
Next by Date: [HTCondor-users] Docker OOM
Previous by thread: Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
Next by thread: Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none