[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ResidentSetSize report on Almalinux 9 and condor 10.5.0//23.0.0//23.0.1



Hi Ben, all,

Thank you very much, your debug can explain what was happening to us. In our case, we had a periodic hold _expression_ for jobs that exceeded 8 times the RequestMemory and we had to remove it on AlmaLinux9.

For an Atlas job requesting 2 GB:

# condor_q 22065498 -af Owner RequestCpus RequestMemory ResidentSetSize_RAW/1024 RemoteHost
atprd015 1 2000 32485 slot1_41@xxxxxxxxxxxx

[root@td806 ~]# grep "^file " /sys/fs/cgroup/htcondor/condor_home_execute_slot1_41\@td806.pic.es/memory.stat
file 33750253568

[root@td806 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_41\@td806.pic.es/memory.peak
34820653056
[root@td806 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_41\@td806.pic.es/memory.current
34827796480

[root@td806 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_41\@td806.pic.es/cgroup.procs | xargs ps -o pid,pgid,rss,vsz
  PID  ÂPGID  RSS  ÂVSZ
1767993 1767993 Â 940 Â 2500
1768002 1767993 Â4356 Â 5244
1774675 1774675 93124 1569040
1774676 1767993 Â3076 Â 5244
1829967 1829967 Â3564 Â 4820
1829983 1829983 Â1988 Â 2956
1830072 1829967 Â4360 Â 5588
1831333 1829967 Â7356 1205220
1831445 1829967 10224 1145156
1831488 1829967 Â5692 Â 7956
1831820 1829967 Â3280 Â12232
1837101 1829967 Â2952 Â12232
1837102 1829967 16428 123384
1837613 1829967 Â2680 Â11884
1837616 1829967 333936 1251280
2123398 1767993 Â1044 Â 2660

Cheers,

Carles

On Tue, 27 Feb 2024 at 19:58, Ben Jones <ben.dylan.jones@xxxxxxxxx> wrote:
Hi all,

I was just about to report this and noticed instead this thread existed.

I think the issue is that memory.current (and the high watermark memory.peak) includes file cache, which obviously is not RSS.

Hereâs an example from a job:

[root@b9s05p0611 htcondor]# cat condor_pool_condor_slot1_23@xxxxxxxxxxxxxxxxxx/cgroup.procs | xargs ps -o pid,pgid,rss,vsz
PID PGID RSS VSZ
246130 246130 960 2504
246212 246130 3864 16356
246655 246130 1988344 2177908
[root@b9s05p0611 htcondor]# cat condor_pool_condor_slot1_23@xxxxxxxxxxxxxxxxxx/memory.peak
13427261440
[root@b9s05p0611 htcondor]# cat condor_pool_condor_slot1_23@xxxxxxxxxxxxxxxxxx/memory.current
13426835456


[root@b9s05p0611 dir_246126]# grep "^file " /sys/fs/cgroup/htcondor/condor_pool_condor_slot1_25@xxxxxxxxxxxxxxxxxx/memory.stat
file 13073104896

What was happening here is that a job was writing a fairly large file, and thus there was a bunch of file cache.

We also had some ancient periodic remove that did something with ResidentSetSize where it was 10x RequestMemory and so a 2Gb job was being removed when its output file got to 18Gb :( At the very least, having ResidentSetSize be the value of memory.peak would seem to be false advertising :D

ÂI guess that for memory.max / memory.high the first thing that happens is things like file cache get thrown away when you approach the limit, but otherwise thereâs not really much the user can do.

Anyhow, I think that explains the behaviour?

cheers,
ben

> On 21 Feb 2024, at 17:56, Carles Acosta <cacosta@xxxxxx> wrote:
>
> Hi Thomas, Greg,
>
> Thank you for your responses.
>
> I didn't check memory.peak before and it is comparable to what is reported as MemoryUsage for AlmaLinux9, not CentOs7 (I use a different example for CentOs7 because the other one has finished).
>
> Alma9
> [root@td807 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.current
> 2426515456
> [root@td807 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.peak
> 9710202880
>
> # condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
> 8662 9766 3500 slot1_10@xxxxxxxxxxxx
>
> CentOs7 [root@tds410 ~]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_24\@tds410.pic.es/memory.usage_in_bytes
> 3476918272
> [root@tds410 ~]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_24\@tds410.pic.es/memory.max_usage_in_bytes
> 7336988672
>
> # condor_q 19602512.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
> 3312 3418 3500 slot1_24@xxxxxxxxxxxxx
>
> Greg, I'm sending you the StarterLog privately.
>
> Thank you again.
>
> Carles
>
>
> On Wed, 21 Feb 2024 at 16:44, Greg Thain <gthain@xxxxxxxxxxx> wrote:
> On 2/21/24 05:10, Carles Acosta wrote:
>> Hi,
>>
>> I just want to comment that I continue to observe the memory issue with HTCondor and EL9. I cannot rely on adding MEMORY_EXCEED conditions in HTCondor/AlmaLinux9.
>>
>> Let me show you another example with Atlas jobs, both a top-xaod execution.
>>
>> Alma9 WN
>> * HTCondor version 23.0.3
>> * RES memory according to top: 3 GB
>> * ResidentSetSize according to condor:
>>
>> # condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
>> 8662 9766 3500 slot1_10@xxxxxxxxxxxx
>
>
> Hi Carles -- this is really odd. Can you send me (directly), the StarterLog.slotXXX that corresponds to this job? Also, I assume that the cgroup memory.peak is roughly the same as memory.current?
>
> -greg
>
>
>
>
> --
> Carles Acosta i Silva
> PIC (Port d'Informacià CientÃfica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 08
> Fax: +34 93 581 41 10
> http://www.pic.es AvÃs - Aviso - Legal Notice:Â http://legal.ifae.es
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es