[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ResidentSetSize report on Almalinux 9 and condor 10.5.0//23.0.0//23.0.1



Hi again,

Although the problem is most prominent for the LHCb experiment, I have been investigating if it affects other projects as well. For example, for CMS. I checked all the jobs where the ResidentSetSize is over 1.5 times the RequestedMemory: 18 of 186 jobs. Of these 18 jobs, 17 are running on AlmaLinux 9 WNs. One example:

[root@td810 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_11@xxxxxxxxxxxx/memory.current
25772843008

[root@ce13 ~]# condor_q 21238903 -af ResidentSetSize ResidentSetSize_RAW
32500000 31951976

As we only put on hold the jobs that exceed 2 times the RequestedMemory, the CMs jobs can still run. I'm not sure, this is not a huge issue for CMS, but in general it seems that on AlmaLinux 9 WNs the ResidentSetSize values reported are higher.

Cheers,

Carles

On Fri, 10 Nov 2023 at 06:43, Carles Acosta <cacosta@xxxxxx> wrote:
Hi Greg,Â

The old job is finished. But for another job (exactly the same ResidentSetSize as before...):

[root@ce13 ~]# condor_q 21206191.0 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024
21206191 0 lhpilot001 slot1_54@xxxxxxxxxxxx 12207 11011

[root@td813 ~]# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_54\@td813.pic.es/memory.current
8253329408

For a Centos7 job:

[root@ce13 ~]# condor_q 21205088 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024
21205088 0 lhpilot001 slot1_33@xxxxxxxxxxxxx 7324 6608

[root@tds408 condor]# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_33\@tds408.pic.es/memory.usage_in_bytes
7089156096

Cheers,

Carles



On Thu, 9 Nov 2023 at 19:08, Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

[root@ce13 ~]# condor_q 21200475 21200476 -af ClusterId ProcId Owner RemoteHost ResidentSetSize/1024
21200475 0 lhpilot001 slot1_43@xxxxxxxxxxxx 12207
21200476 0 lhpilot001 slot1_51@xxxxxxxxxxxx 1708


Hi Carles:

I'm curious what the value that cgroup is reporting for this job is. Can you tell us the contents of the "memory.current" value for the cgroup that job is in?

And the errors about the missing "memory.peak" files are ok. I'll try to change the error message to indicate this better.


-greg



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es