[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none



If a job is killed by OOM killer, does HTCondor treat that particular node differently moving forward?

That is, will it try to avoid scheduling more jobs on that node?

Note: We found managing our servers (Linux / Ubuntu 20.04)  much easy using earlyoom: https://github.com/rfjakob/earlyoom

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, December 26, 2023 at 4:23
âPM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none

*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

On 12/26/23 14:26, JM wrote:
> Based on limited information from worker logs, the chance is very high
> that the server indeed ran out of physical and swap memory. Multiple
> jobs of the same (high) memory usage pattern hit the server at the
> same time. One of them was terminated by OOM killer. I was confused by
> the startd log message which said about the job memory usage
> threshold. The message gave the impression that the job was killed by
> a policy. If I remember correctly, a more typical feedback from
> HTCondor is that the job was terminated with return value 137.


Hi JM:

This is something that has changed in HTCondor.  In the past, if cgroups
were not enabled, and the OOM killer killed a job (because the system as
a whole was out of memory), the job might exit the queue by default, as
it just looked to HTCondor like the job was killed with signal 9,
perhaps by something within the job proper.

Our philosophy is that the job should not leave the queue if something
happened to it outside of its control.  e.g. if it is running on a
worker node that gets rebooted, by default the job should start again
somewhere else.  Not the job's fault the node was rebooted.  Now, if the
OOM killer kills the process not because the job is over the per-cgroup
limit, but because the system as a whole is out of memory, we want to
treat that the same way.

I agree that the message is confusing, and I'll work on cleaning that up.

Thanks,

-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.com/v3/__https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users__;!!Cn_UX_p3!myl9cE7TtDXzRkE7zlaNsULpZBVpvXCqEhb5FMz460S6Aqv0aF21VU95ihTB-N1tiQrfzJZ99IVmVI3Qv3-Bt3a_DfFm8KVw$

The archives can be found at:
https://urldefense.com/v3/__https://lists.cs.wisc.edu/archive/htcondor-users/__;!!Cn_UX_p3!myl9cE7TtDXzRkE7zlaNsULpZBVpvXCqEhb5FMz460S6Aqv0aF21VU95ihTB-N1tiQrfzJZ99IVmVI3Qv3-Bt3a_DUjSHU1F$