Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Memory accounting issue with cgroups

Date: Thu, 18 May 2023 11:43:37 +0300
From: Marco van Zwetselaar <zwets@xxxxxxxxxx>
Subject: [HTCondor-users] Memory accounting issue with cgroups

The following issue started occurring at one of the 10.x releases (not certain which, but still present in 10.4.3), installed from .debs on nodes running Ubuntu 22.04.

My config used to have "CGROUP_MEMORY_LIMIT_POLICY = hard" and "use POLICY : Hold_If_Memory_Exceeded". Jobs were correctly put on hold when they exceeded their request_memory.

Now, with the same config and the same jobs, they eventually all go on Hold with "memory usage exceeded request_memory", while their actual consumption (USS, PSS and RSS as reported by smem) never exceed request_memory.

Their job log shows image size updates every 5 minutes, with RSS steadily increasing by 1GB/5mn. Once this exceeds request_mem, they (correctly) go into hold - except that their actual RSS never went beyond 2GB. When I remove the 'use POLICY' config, the jobs continue unbounded.

Looking in the cgroup context of the job's (dynamic) slot, it seems that Condor takes 'memory.current' to be the RSS. This would be correct if the job were under (severe) memory pressure, but (and this seems to be the crux of the issue) both 'memory.high' and 'memory.max' are set to "max" (and the machine has loads of memory). The Condor docs suggest that memory.high and memory.max should be at 90% and 100% of request_memory.

In fact, when I "cat memory.current | sudo tee memory.high", then memory.current and the RSS reported by Condor stay at that same level throughout, which presumably is precisely how this was supposed to work. (Very elegant mechanism!)

Not sure where to look for diagnostics, but I see one ominous message in the slot's StarterLog: "Error while locating memcg controller for starter: 50014 Cgroup not initialized". This is the tail of that log:

Â 05/18/23 09:50:58 (pid:1120439) Starting a VANILLA universe job with ID: 4574.0
Â 05/18/23 09:50:58 (pid:1120439) Checking to see if htcondor is a writeable cgroup
Â 05/18/23 09:50:58 (pid:1120439)ÂÂÂÂ Cgroup /htcondor is useable
Â 05/18/23 09:50:58 (pid:1120439) Current mount, /tmp, is shared.
Â 05/18/23 09:50:58 (pid:1120439) Current mount, /, is shared.
Â 05/18/23 09:50:58 (pid:1120439) IWD: /var/lib/condor/execute/dir_1120439
Â 05/18/23 09:50:58 (pid:1120439) Output file: /var/lib/condor/execute/dir_1120439/_condor_stdout
Â 05/18/23 09:50:58 (pid:1120439) Error file: /var/lib/condor/execute/dir_1120439/_condor_stderr
Â 05/18/23 09:50:58 (pid:1120439) Renice expr "0" evaluated to 0
Â 05/18/23 09:50:58 (pid:1120439) Running job as user zwets
Â 05/18/23 09:50:58 (pid:1120439) About to exec [... omitted ...]
Â 05/18/23 09:50:58 (pid:1120439) Create_Process succeeded, pid=1120441
Â 05/18/23 09:50:58 (pid:1120439) Error while locating memcg controller for starter: 50014 Cgroup not initialized
Â 05/18/23 09:51:06 (pid:1120439) Failed to open '.update.ad' to read update ad: No such file or directory (2).
Â 05/18/23 09:51:06 (pid:1120439) Failed to open '.update.ad' to read update ad: No such file or directory (2).

Any suggestions on where to look or what could be the issue here?

Kind regards,
Marco

Marco van Zwetselaar

Bioinformatician

Kilimanjaro Clinical Research Institute

P.O. Box 2236 | Moshi, Kilimanjaro | Tanzania

http://www.kcri.ac.tz/ÂÂ|ÂÂzwets@xxxxxxxxxxÂÂ|ÂÂ+255 782 334124

Follow-Ups:
- Re: [HTCondor-users] Memory accounting issue with cgroups
  - From: Jan van Eldik

Prev by Date: [HTCondor-users] HTCondor Installation Issue in Ubuntu 22.04
Next by Date: Re: [HTCondor-users] HTCondor Installation Issue in Ubuntu 22.04
Previous by thread: Re: [HTCondor-users] HTCondor Installation Issue in Ubuntu 22.04
Next by thread: Re: [HTCondor-users] Memory accounting issue with cgroups
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] Memory accounting issue with cgroups