[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ResidentSetSize report on Almalinux 9 and condor 10.5.0//23.0.0//23.0.1



Hi Carles,

I personally only take RSS as rough upper limit since it includes also shared pages (which probably inflates the memory usage by a process group especially on larger nodes with a lot of jobs potentially using the same shared libraries). Imho PSS is a more realistic metric - or the cgroup memory controller metrics.

Cheers,
  Thomas


On 21/02/2024 12.10, Carles Acosta wrote:
Hi,

I just want to comment that I continue to observe the memory issue with HTCondor and EL9. I cannot rely on adding MEMORY_EXCEED conditions in HTCondor/AlmaLinux9.

Let me show you another example with Atlas jobs, both a top-xaod execution.

*_Alma9 WN _*
* HTCondor version 23.0.3
* RES memory according to top: 3 GB
* ResidentSetSize according to condor:

# condor_q 22018463.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
8662 9766 3500 slot1_10@xxxxxxxxxxxx <mailto:slot1_10@xxxxxxxxxxxx>

* Cgroup memory:
# cat /sys/fs/cgroup/htcondor/condor_home_execute_slot1_10\@td807.pic.es/memory.current <http://td807.pic.es/memory.current>
3279331328

*_CentOs7 WN_*
* HTCondor version 9.0.17
* RES memory according to top: 3.2 GB
* ResidentSetSize according to condor:

# condor_q 19600433.0 -af ResidentSetSize_RAW/1024 MemoryUsage RequestMemory RemoteHost
3329 3418 3500 slot1_10@xxxxxxxxxxxxx <mailto:slot1_10@xxxxxxxxxxxxx>

* Cgroup memory:
# cat /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_10\@tds410.pic.es/memory.usage_in_bytes <http://tds410.pic.es/memory.usage_in_bytes>
3496230912

So, again, the ResidentSetSize and MemoryUsage exceed what we can see on top or cgroup memory report. I do not know if we are the only ones seeing this problem...

Cheers,

Carles

On Thu, 16 Nov 2023 at 13:45, Carles Acosta <cacosta@xxxxxx <mailto:cacosta@xxxxxx>> wrote:

    Hi again,

    Although the problem is most prominent for the LHCb experiment, I
    have been investigating if it affects other projects as well. For
    example, for CMS. I checked all the jobs where the ResidentSetSize
    is over 1.5 times the RequestedMemory: 18 of 186 jobs. Of these 18
    jobs, 17 are running on AlmaLinux 9 WNs. One example:

    [root@td810 ~]# cat
    /sys/fs/cgroup/htcondor/condor_home_execute_slot1_11@xxxxxxxxxxxx/memory.current <http://condor_home_execute_slot1_11@xxxxxxxxxxxx/memory.current>
    25772843008

    [root@ce13 ~]# condor_q 21238903 -af ResidentSetSize ResidentSetSize_RAW
    32500000 31951976

    As we only put on hold the jobs that exceed 2 times the
    RequestedMemory, the CMs jobs can still run. I'm not sure, this is
    not a huge issue for CMS, but in general it seems that on AlmaLinux
    9 WNs the ResidentSetSize values reported are higher.

    Cheers,

    Carles

    On Fri, 10 Nov 2023 at 06:43, Carles Acosta <cacosta@xxxxxx
    <mailto:cacosta@xxxxxx>> wrote:

        Hi Greg,

        The old job is finished. But for another job (exactly the same
        ResidentSetSize as before...):

        [root@ce13 ~]# condor_q 21206191.0 -af ClusterId ProcId Owner
        RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024
        21206191 0 lhpilot001 slot1_54@xxxxxxxxxxxx
        <mailto:slot1_54@xxxxxxxxxxxx> 12207 11011

        [root@td813 ~]# cat
        /sys/fs/cgroup/htcondor/condor_home_execute_slot1_54\@td813.pic.es/memory.current <http://td813.pic.es/memory.current>
        8253329408

        For a Centos7 job:

        [root@ce13 ~]# condor_q 21205088 -af ClusterId ProcId Owner
        RemoteHost ResidentSetSize/1024 ResidentSetSize_RAW/1024
        21205088 0 lhpilot001 slot1_33@xxxxxxxxxxxxx
        <mailto:slot1_33@xxxxxxxxxxxxx> 7324 6608

        [root@tds408 condor]# cat
        /sys/fs/cgroup/memory/system.slice/condor.service/condor_home_execute_slot1_33\@tds408.pic.es/memory.usage_in_bytes <http://tds408.pic.es/memory.usage_in_bytes>
        7089156096

        Cheers,

        Carles



        On Thu, 9 Nov 2023 at 19:08, Greg Thain via HTCondor-users
        <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>>
        wrote:

            __

            [root@ce13 ~]# condor_q 21200475 21200476 -af ClusterId
            ProcId Owner RemoteHost ResidentSetSize/1024
            21200475 0 lhpilot001 slot1_43@xxxxxxxxxxxx
            <mailto:slot1_43@xxxxxxxxxxxx> 12207
            21200476 0 lhpilot001 slot1_51@xxxxxxxxxxxx
            <mailto:slot1_51@xxxxxxxxxxxx> 1708


            Hi Carles:

            I'm curious what the value that cgroup is reporting for this
            job is. Can you tell us the contents of the
            "memory.current" value for the cgroup that job is in?

            And the errors about the missing "memory.peak" files are ok.
            I'll try to change the error message to indicate this better.


            -greg



            _______________________________________________
            HTCondor-users mailing list
            To unsubscribe, send a message to
            htcondor-users-request@xxxxxxxxxxx
            <mailto:htcondor-users-request@xxxxxxxxxxx> with a
            subject: Unsubscribe
            You can also unsubscribe by visiting
            https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
            <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

            The archives can be found at:
            https://lists.cs.wisc.edu/archive/htcondor-users/
            <https://lists.cs.wisc.edu/archive/htcondor-users/>



-- Carles Acosta i Silva
        PIC (Port d'Informacià CientÃfica)
        Campus UAB, Edifici D
        E-08193 Bellaterra, Barcelona
        Tel: +34 93 581 33 08
        Fax: +34 93 581 41 10
        http://www.pic.es <http://www.pic.es>
        AvÃs - Aviso - Legal Notice: http://legal.ifae.es
        <http://legal.ifae.es/>



-- Carles Acosta i Silva
    PIC (Port d'Informacià CientÃfica)
    Campus UAB, Edifici D
    E-08193 Bellaterra, Barcelona
    Tel: +34 93 581 33 08
    Fax: +34 93 581 41 10
    http://www.pic.es <http://www.pic.es>
    AvÃs - Aviso - Legal Notice: http://legal.ifae.es
    <http://legal.ifae.es/>



--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.es <http://www.pic.es>
AvÃs - Aviso - Legal Notice: http://legal.ifae.es <http://legal.ifae.es/>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature