[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none



On 12/26/23 14:26, JM wrote:
Based on limited information from worker logs, the chance is very high that the server indeed ran out of physical and swap memory. Multiple jobs of the same (high) memory usage pattern hit the server at the same time. One of them was terminated by OOM killer. I was confused by the startd log message which said about the job memory usage threshold. TheÂmessage gave the impressionÂthat the job was killed by a policy. If I remember correctly, a more typical feedback from HTCondor is that the job was terminated with return value 137.


Hi JM:

This is something that has changed in HTCondor. In the past, if cgroups were not enabled, and the OOM killer killed a job (because the system as a whole was out of memory), the job might exit the queue by default, as it just looked to HTCondor like the job was killed with signal 9, perhaps by something within the job proper.

Our philosophy is that the job should not leave the queue if something happened to it outside of its control. e.g. if it is running on a worker node that gets rebooted, by default the job should start again somewhere else. Not the job's fault the node was rebooted. Now, if the OOM killer kills the process not because the job is over the per-cgroup limit, but because the system as a whole is out of memory, we want to treat that the same way.

I agree that the message is confusing, and I'll work on cleaning that up.

Thanks,

-greg