[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ResidentSetSize report on Almalinux 9 and condor 10.5.0//23.0.0//23.0.1



Hi all,

I was just about to report this and noticed instead this thread existed.

I think the issue is that memory.current (and the high watermark memory.peak) includes file cache, which obviously is not RSS.

Hereâs an example from a job:

[root@b9s05p0611 htcondor]# cat condor_pool_condor_slot1_23@xxxxxxxxxxxxxxxxxx/cgroup.procs | xargs ps -o pid,pgid,rss,vsz
PID PGID RSS VSZ
246130 246130 960 2504
246212 246130 3864 16356
246655 246130 1988344 2177908
[root@b9s05p0611 htcondor]# cat condor_pool_condor_slot1_23@xxxxxxxxxxxxxxxxxx/memory.peak
13427261440
[root@b9s05p0611 htcondor]# cat condor_pool_condor_slot1_23@xxxxxxxxxxxxxxxxxx/memory.current
13426835456


[root@b9s05p0611 dir_246126]# grep "^file " /sys/fs/cgroup/htcondor/condor_pool_condor_slot1_25@xxxxxxxxxxxxxxxxxx/memory.stat
file 13073104896

What was happening here is that a job was writing a fairly large file, and thus there was a bunch of file cache.

We also had some ancient periodic remove that did something with ResidentSetSize where it was 10x RequestMemory and so a 2Gb job was being removed when its output file got to 18Gb :( At the very least, having ResidentSetSize be the value of memory.peak would seem to be false advertising :D

 I guess that for memory.max / memory.high the first thing that happens is things like file cache get thrown away when you approach the limit, but otherwise thereâs not really much the user can do.

Anyhow, I think that explains the behaviour?

cheers,
ben

> On 21 Feb 2024, at 17:56, Carles Acosta <cacosta@xxxxxx> wrote:
> 
> Hi Thomas, Greg,
> 
> Thank you for your responses. 
> 
> I didn't check memory.peak before and it is comparable to what is reported as MemoryUsage for AlmaLinux9, not CentOs7 (I use a different example for CentOs7 because the other one has finished).
> 
> Alma9
> [root@td807 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.current
> 2426515456
> [root@td807 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.peak
> 9710202880
> 
> # condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost 
> 8662 9766 3500 slot1_10@xxxxxxxxxxxx
> 
> CentOs7 [root@tds410 ~]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_24\@tds410.pic.es/memory.usage_in_bytes
> 3476918272
> [root@tds410 ~]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_24\@tds410.pic.es/memory.max_usage_in_bytes
> 7336988672
> 
> # condor_q 19602512.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
> 3312 3418 3500 slot1_24@xxxxxxxxxxxxx
> 
> Greg, I'm sending you the StarterLog privately.
> 
> Thank you again.
> 
> Carles
> 
> 
> On Wed, 21 Feb 2024 at 16:44, Greg Thain <gthain@xxxxxxxxxxx> wrote:
> On 2/21/24 05:10, Carles Acosta wrote:
>> Hi,
>> 
>> I just want to comment that I continue to observe the memory issue with HTCondor and EL9. I cannot rely on adding MEMORY_EXCEED conditions in HTCondor/AlmaLinux9. 
>> 
>> Let me show you another example with Atlas jobs, both a top-xaod execution.
>> 
>> Alma9 WN 
>> * HTCondor version 23.0.3
>> * RES memory according to top: 3 GB
>> * ResidentSetSize according to condor:
>> 
>> # condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
>> 8662 9766 3500 slot1_10@xxxxxxxxxxxx
> 
> 
> Hi Carles -- this is really odd.  Can you send me (directly), the StarterLog.slotXXX that corresponds to this job?  Also, I assume that the cgroup memory.peak is roughly the same as memory.current?
> 
> -greg
> 
> 
> 
> 
> -- 
> Carles Acosta i Silva
> PIC (Port d'Informacià CientÃfica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 08
> Fax: +34 93 581 41 10
> http://www.pic.es AvÃs - Aviso - Legal Notice:  http://legal.ifae.es
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/