[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limiting memory used on the worker node with c-groups



Was just thinking about that,ÂGreg! It's been a while.

JM - look at the log for your slot with a name like /var/log/condor/StarterLog_*. It's clear why the job is being paused by the kernel but the job should be cleaned up by HTCondor.

Tom

On Thu, Apr 30, 2020 at 8:16 AM Gregory Thain <gthain@xxxxxxxxxxx> wrote:

On 4/30/20 4:27 AM, jean-michel.barbet@xxxxxxxxxxxxxxxxx wrote:
> On 4/30/20 6:19 AM, tpdownes@xxxxxxxxx wrote:
>> I do think your problem is as simple as Thomas' question: figuring
>> out why oom_control is set to disabled. These cgroup settings are
>> inherited hierarchically so it could be the htcondor group itself or
>> a cgroup above it. It could even be set system-wide.


HTCondor intentionally sets oom_kill_disable because the starter really
needs to know if the job was OOM killed, and treat the job differently
than if it just got a normal signal 9. We think it is very unfortunate
that the OOM killer kills with the usual signal 9, and not a custom
signal just for OOM -- we wouldn't need to do this if the OOM signal was
its own value. The starter also installs a handler to get notified when
the kernel oom-kills a process in the job. This lets the starter clean
up the job, and put the job on hold with an appropriate message if it
gets OOM killed. If we didn't do this, the an OOM killed job would be
killed with signal 9, and probably leave the queue, as from condor's
perspective, it has exitted of its own accord.


-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/