[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Memory accounting issue with cgroups



Thanks Greg,

I've been experimenting a bit, using information from https://facebookmicrosites.github.io/cgroup2/docs/overview.html. Relevant quotes there:

The 10.6 "memory.max" setting worked for one job (keeping its RSS within request_mem), but for another type of job it almost immediately OOM-ed all instances. This was correct (in principle) from the condor viewpoint: the instances neatly went on Hold. (The only minor inconvenience being that the system sends an email for every OOM.)

However, when I manually set memory.high at 50% of memory.max, the offending jobs crept up to about 110% of that level and kept running. The memory.pressure (see doc here: https://facebookmicrosites.github.io/cgroup2/docs/pressure-metrics.html) then slowly went up to 98.5%, supposedly meaning that the job was spending 98.5% of its time stalling for memory pages to be swapped in.

This is on a machine with no disc swap, and plenty spare memory above memory.max (requested 256G out of 768G), so I suppose no actual swapping was taking place, just physical pages being marked "free" and "taken" again. (Or something like that, I'm no expert in Linux kernel memory management.)

Increasing memory.high to 90% of memory.max made consumption creep up again, levelling just below memory.max, with no OOMs. Reducing memory.high also worked, and consumption would go down again. Very neat.

Unfortunately, I didn't try lifting memory.high while the job was at memory.max, to see if memory.max would pressure the job before OOM-ing it (provided that it approached memory.max slowly). Its description (above) appears to suggest this: "if [memory consumption] reaches this limit **and can't be reduced** [then OOM ensues]".

I'm in the dark what "and can't be reduced" means. The OOM came almost immediately after starting the job, whereas with memory.high set at 90% of max, the job ran to completion.

Either way, it would seem that setting memory.high at ~90% of memory.max would be appropriate. I haven't yet thought about memory.min/low.

Cheers
Marco



On 20/05/2023 23:43, Greg Thain via HTCondor-users wrote:

On 5/20/23 5:03 AM, Marco van Zwetselaar wrote:
I guess my mental picture of memory.high as a yellow card, and memory.max as the red card was incorrect. It's more like rugby: the referee's stare is enough. :-)

The kernel docs are a little vague about the difference between "high" and "max", saying that usually a cgroup gets OOM killed when it hits "high", but in some cases can go all the way up to "max" before the OOM arrives. It isn't clear to me if this means maybe a page or two more memory, in order to deliver the signal, or potentially some unbounded amount of memory. Given that, I chose to have condor only set "max".

If you will excuse me stretching your metaphor, "high" is the moment the red card goes into the air, but "max" is when the guilty party actually leaves the pitch. "memory.min" is like our youth leagues here, where there is an unwritten understanding that if one team can't field some minimum number of players (seven?), the opposing team (if able) will loan them some players in order that the kids can still get a game in (despite a forfeit on the books). And I have no good idea right now what htcondor should set "memory.low" to.



On a side note to the Condor devs: my config has 'DISABLE_SWAP_FOR_JOB = true'. Shouldn't that translate to 'memory.swap.max = 0' on the cgroup (currently shows "max")?


The cgroup v2 code path doesn't set this. I'll write a PR to fix this.


Thanks,

-greg



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/