[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none



Based on limited information from worker logs, the chance is very high that the server indeed ran out of physical and swap memory. Multiple jobs of the same (high) memory usage pattern hit the server at the same time. One of them was terminated by OOM killer. I was confused by the startd log message which said about the job memory usage threshold. TheÂmessage gave the impressionÂthat the job was killed by a policy. If I remember correctly, a more typical feedback from HTCondor is that the job was terminated with return value 137.Â

Thank you for looking into this.

On Tue, Dec 26, 2023 at 1:24âPM John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
The hold message below can only happen if there was an out of memory event for the job. But the job reaper always checks to see if there was an outÂof memory event, regardless of how the job exited. Â

Is the question whether it is possible when the CGROUP_MEMORY_LIMIT_POLICYÂis None for a job to run to completion and exit successfully or be evicted for a different reason and then be held for exceeding memory? From my reading of the code, this is possible. Â

I would have to ask our CGROUP expert to be sure and he is out on vacation today.ÂÂ

I do believe that the reported peak memory usage in the hold message below is correct and that the value came from the CGROUP. Â

If you want to know if the job stopped running because of the out of memory event or for some other reason, I would suggest you look at the StarterLog.slot1_1 and StartLog on the execute node, to see if the job exited on its own or was killed and what reason it was killed for.








From:ÂHTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JM <jm@xxxxxxxxxxxxxxxxxxxx>
Sent:ÂTuesday, December 26, 2023 10:36 AM
To:ÂHTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject:Â[HTCondor-users] Job OOM with CGROUP_MEMORY_LIMIT_POLICY=none
Â
HTCondor community,

In an HTCondor 23 LTS executor setup, all worker servers are the sameÂand CGROUP_MEMORY_LIMIT_POLICY is set to none. Each worker server has a sizable swap partition. Jobs were randomly held because of exceeding memory limits.Â

Job 4882.11349 going into Hold state (code 34,0): Error from slot1_1@workerX: Job has gone over memory limit of 128 megabytes. Peak usage: 4346 megabytes.

The worker server log showed that Linux OOM killer terminated the job. Worker nodes are preemptible instances in a public cloud. So it is not easy (but doable) to collect actual memory usage for each worker.

Can someone please advise if OOM was because the server memory (physicalÂ+ swap) used up, or another HTCondor knob killed the job because it used more memory than requested?

Thank you and happy holidays!

JM.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/