[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ResidentSetSize report on Almalinux 9 and condor 10.5.0//23.0.0//23.0.1



Hi,

I just want to comment that I continue to observe the memory issue with HTCondor and EL9. I cannot rely on adding MEMORY_EXCEED conditions in HTCondor/AlmaLinux9.Â

Let me show you another example with Atlas jobs, both a top-xaod execution.

Alma9 WNÂ
* HTCondor version 23.0.3
* RES memory according to top: 3 GB
* ResidentSetSize according to condor:

# condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
8662 9766 3500 slot1_10@xxxxxxxxxxxx

* Cgroup memory:Â
# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.current
3279331328

CentOs7 WN
* HTCondor version 9.0.17
* RES memory according to top: 3.2 GB
* ResidentSetSize according to condor:

# condor_q 19600433.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
3329 3418 3500 slot1_10@xxxxxxxxxxxxx

* Cgroup memory:Â
# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_10\@tds410.pic.es/memory.usage_in_bytes
3496230912

So, again, the ResidentSetSize and MemoryUsage exceed what we can see on top or cgroup memory report. I do not know if we are the only ones seeing this problem...

Cheers,

Carles

On Thu, 16 Nov 2023 at 13:45, Carles Acosta <cacosta@xxxxxx> wrote:
Hi again,

Although the problem is most prominent for the LHCb experiment, I have been investigating if it affects other projects as well. For example, for CMS. I checked all the jobs where the ResidentSetSize is over 1.5 times the RequestedMemory: 18 of 186 jobs. Of these 18 jobs, 17 are running on AlmaLinux 9 WNs. One example:

[root@td810 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_11@xxxxxxxxxxxx/memory.current
25772843008

[root@ce13 ~]# condor_q 21238903 -af ResidentSetSize ResidentSetSize_RAW
32500000 31951976

As we only put on hold the jobs that exceed 2 times the RequestedMemory, the CMs jobs can still run. I'm not sure, this is not a huge issue for CMS, but in general it seems that on AlmaLinux 9 WNs the ResidentSetSize values reported are higher.

Cheers,

Carles

On Fri, 10 Nov 2023 at 06:43, Carles Acosta <cacosta@xxxxxx> wrote:
Hi Greg,Â

The old job is finished. But for another job (exactly the same ResidentSetSize as before...):

[root@ce13 ~]# condor_q 21206191.0 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024
21206191 0 lhpilot001 slot1_54@xxxxxxxxxxxx 12207 11011

[root@td813 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_54\@td813.pic.es/memory.current
8253329408

For a Centos7 job:

[root@ce13 ~]# condor_q 21205088 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024
21205088 0 lhpilot001 slot1_33@xxxxxxxxxxxxx 7324 6608

[root@tds408 condor]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_33\@tds408.pic.es/memory.usage_in_bytes
7089156096

Cheers,

Carles



On Thu, 9 Nov 2023 at 19:08, Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

[root@ce13 ~]# condor_q 21200475 21200476 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024
21200475 0 lhpilot001 slot1_43@xxxxxxxxxxxx 12207
21200476 0 lhpilot001 slot1_51@xxxxxxxxxxxx 1708


Hi Carles:

I'm curious what the value that cgroup is reporting for this job is. Can you tell us the contents of the "memory.current" value for the cgroup that job is in?

And the errors about the missing "memory.peak" files are ok. I'll try to change the error message to indicate this better.


-greg



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es