[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTcondor disk resource related queries



Hi Vikrant,

how does your storage set up looks like?

My guess would be that
  18446744073692897281
is a bit large, so that the partitionable parent slot maybe has an overflow or so, but that the partitioned slots are cut out properly.

Cheers,
  Thomas

On 06/06/2023 14.53, Vikrant Aggarwal wrote:
Hello Experts,

Any input on disk issues?

Thanks & Regards,
Vikrant Aggarwal


On Sat, Jun 3, 2023 at 6:28âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>> wrote:

    Hello Tomer,

    Thanks for sharing the configuration. it helps to put the job on
    hold breaching the requestdisk. We have a problem in our infra where
    people don't ask for the request disk in job spec hence I want to
    modify it on a worker machine based on some logic related to CPUs. I
    am seeing strange behavior.

    RequestDisk will remain intact whatever we put in the job submit
    file 2GB but I Âcouldn't understand where it's picking the Disk
    attribute. By default it's ~ 4GB

    # condor_who -af:h globaljobid disk DiskUsage TotalDisk
    TotalSlotDisk RequestDisk

globaljobid                    Âdisk DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
    test.example.com#429.0#1685829846
<http://test.example.com#429.0%231685829846> 4271297 Â27 Â4271296648 4271297.0 Â Â Â Â Â Â 2097152

    Attempt 1 : Try to modify the RequestDisk to 4GB but it becomes 8GB
    - May be addition of default 4GB

    MODIFY_REQUEST_EXPR_REQUESTDISK = 4194304

globaljobid                    Âdisk DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
    test.example.com#430.0#1685830072
<http://test.example.com#430.0%231685830072> 8542594 Â27 Â4271296648 8542594.0 Â Â Â Â Â Â 2097152


    Attempt 2 : Try to modify the RequestDisk to 6GB but it becomes 8GB
    - If we go by 4GB addition logic it should have been 10GB

    MODIFY_REQUEST_EXPR_REQUESTDISK = 6291456


globaljobid                    Âdisk DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
    test.example.com#431.0#1685830179
<http://test.example.com#431.0%231685830179> 8542594 Â2 4271296648 8542594.0 Â Â Â Â Â Â 2097152

    Attempt 3 : Try to modify the RequestDisk to 8GB as expected it
    becomes 12GB.

    MODIFY_REQUEST_EXPR_REQUESTDISK = 8388608

globaljobid                    Âdisk DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
    test.example.com#428.0#1685829703
<http://test.example.com#428.0%231685829703> 12813890 8192027 4271296648 12813890.0 Â Â Â Â Â Â2097152

    Attempt 4 : Try to modify the disk size to 1GB. it retains 4GB size.

    MODIFY_REQUEST_EXPR_REQUESTDISK = 1048576

globaljobid                    Âdisk DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
    test.example.com#432.0#1685830887
<http://test.example.com#432.0%231685830887> 4271297 Â2 4271296648 4271297.0 Â Â Â Â Â Â 2097152


    Command used to grab outputs:

    condor_who -af:h globaljobid disk DiskUsage TotalDisk TotalSlotDisk
    RequestDisk


    Finally more confusion with negative disk values in following output:

    # condor_status `hostname` -server
Name                      OpSys    Arch LoadAv Memory  Disk   ÂMips  ÂKFlops

    slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx
    <mailto:slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx> Â LINUX Â Â Â X86_64
     Â0.000 Â 172962 -57841021 Â 22492 Â 1705677
    slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx
    <mailto:slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx> LINUX Â Â Â X86_64
     Â0.000 Â Â19218 Â12813890 Â 22492 Â 1705677
    slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx
    <mailto:slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx> LINUX Â Â Â X86_64
     Â0.000 Â Â19218 Â 8542594 Â 22492 Â 1705677
    slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxx
    <mailto:slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxx> LINUX Â Â Â X86_64
     Â0.000 Â Â19218 Â 8542594 Â 22492 Â 1705677
    slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxx
    <mailto:slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxx> LINUX Â Â Â X86_64
     Â0.000 Â Â19218 Â 4271297 Â 22492 Â 1705677

       ÂMachines Avail ÂMemory    ÂDisk    ÂMIPS ÂKFLOPS

 X86_64/LINUX    Â5   5   Â249834 18446744073685880970 Â112460   8528385

    ÂTotal    Â5   5   Â249834 18446744073685880970 Â112460   8528385




    Questions:

    - From where it's picking the default 4GB Disk size?
    - Why is it setting Disk size to different values than what we ask
    in the modify expression?
    - Why in -server output we see negative disk value.


    htcondor version : 9.0.17



    Regards,
    Vikrant Aggarwal

    On Thu, 1 Jun, 2023, 09:38 Tomer Pearl, <tomerp@xxxxxxxxxxx
    <mailto:tomerp@xxxxxxxxxxx>> wrote:

        Hi Vikrant,

        The following configuration works for me. Not sure which version
        I'm running, should be 9+.

        STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
        DISK_USAGE_EXCEEDED = (JobUniverse !=13 && DiskUsage =!=
        UNDEFINED && DiskUsage > RequestDisk)
        *use POLICY: *WANT_HOLD*_IF* = (DISK_USAGE_EXCEEDED, 105, my
        error string..).

        Not sure if /my error string../ should be surroundedÂby
        quotation marks, as I'm templating the file with Jinja.

        Tomer.

        ------------------------------------------------------------------------
        *From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx
        <mailto:htcondor-users-bounces@xxxxxxxxxxx>> on behalf of
        Vikrant Aggarwal <ervikrant06@xxxxxxxxx
        <mailto:ervikrant06@xxxxxxxxx>>
        *Sent:* Thursday, June 1, 2023 12:44 AM
        *To:* HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx
        <mailto:htcondor-users@xxxxxxxxxxx>>
        *Subject:* Re: [HTCondor-users] HTcondor disk resource related
        queries
        Hello Experts,

        I am testing this configuration to put the jobs on hold
        breaching the disk limit.

        STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
        DISK_USAGE_EXCEEDED = (JobUniverse =!=13 && DiskUsage =!=
        UNDEFINED && DiskUsage > RequestDisk)
        WANT_HOLD = $(DISK_USAGE_EXCEEDED)
        WANT_HOLD_REASON = "Job exceeded disk usage limits"

        I clearly see the jobs are using more than RequestDisk size
        still they are not getting held.

        # condor_who -af:h globaljobid disk DiskUsage TotalDisk
        TotalSlotDisk RequestDisk

globaljobid                    Âdisk DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
        test.example.com#412.0#1685567906
<http://test.example.com#412.0%231685567906> 21356484 8192026 4271296648 21356484.0 Â Â Â Â Â Â16777216
        test.example.com#413.0#1685567923
<http://test.example.com#413.0%231685567923> 12813890 8192026 4271296648 12813890.0 Â Â Â Â Â Â8388608
        test.example.com#414.0#1685567952
<http://test.example.com#414.0%231685567952> 8542594 Â8192026 4271296648 8542594.0 Â Â Â Â Â Â 3250000
        test.example.com#415.0#1685568493
<http://test.example.com#415.0%231685568493> 8542594 Â8192025 4271296648 8542594.0 Â Â Â Â Â Â 3250000
        test.example.com#416.0#1685568803
<http://test.example.com#416.0%231685568803> 12813890 8192026 4271296648 12813890.0 Â Â Â Â Â Â10000000
        test.example.com#417.0#1685568954
<http://test.example.com#417.0%231685568954> 4271297 Â8192025 4271296648 4271297.0 Â Â Â Â Â Â 1

        9.0.17 is htcondor version I am using.


        Thanks & Regards,
        Vikrant Aggarwal


        On Tue, May 30, 2023 at 1:09âPM Vikrant Aggarwal
        <ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>> wrote:

            Hello Experts,

            Couple of queries:

            - Why it's showing negative value for primary partitionable
            slot.

            # condor_status `hostname` -server
Name                      OpSys Arch  LoadAv Memory  Disk   ÂMips  ÂKFlops

            slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx
<mailto:slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx> Â LINUX X86_64 Â0.000 Â 211398 -25210961 Â 25601 Â 1764976
            slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx
<mailto:slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx> LINUX X86_64 Â0.000 Â Â19218 Â 4278313 Â 25601 Â 1764976
            slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx
<mailto:slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx> LINUX X86_64 Â0.000 Â Â19218 Â 4278313 Â 25601 Â 1764976

       ÂMachines Avail ÂMemory    ÂDisk ÂMIPS   ÂKFLOPS

             Â X86_64/LINUX Â Â Â Â3 Â Â 3 Â Â Â249834
            18446744073692897281 Â Â Â 76803 Â Â 5294928

                 ÂTotal    Â3   3   Â249834
            18446744073692897281 Â Â Â 76803 Â Â 5294928


            # condor_status -compact `hostname` -af Disk
            4269756335

            -Â I have this on worker node conf to modify the job request
            disk to mentioned value but it never worked. We are using
            similar expression for cpu and memory, it works fine.

            # condor_config_val MODIFY_REQUEST_EXPR_REQUESTDISK
            80000

            Not sure from where it's picking this value.

            # grep -r 'Disk =' /spare/condor/dir_14*/.machine.ad
            <http://machine.ad>
            /spare/condor/dir_1417831/.machine.ad:Disk = 4278313
            /spare/condor/dir_1417831/.machine.ad:TotalDisk = 4278312960
            /spare/condor/dir_1417831/.machine.ad:TotalSlotDisk = 4278313.0
            /spare/condor/dir_1425169/.machine.ad:Disk = 4278313
            /spare/condor/dir_1425169/.machine.ad:TotalDisk = 4278312960
            /spare/condor/dir_1425169/.machine.ad:TotalSlotDisk = 4278313.0


            # du -sh /spare/condor/dir_1425169
            3.0G Â Â/spare/condor/dir_1425169

            Thanks & Regards,
            Vikrant Aggarwal

        CAUTION: This email originated from outside of the organization.
        Do not click links or open attachments unless you recognize the
        sender and know the content is safe.

        _______________________________________________
        HTCondor-users mailing list
        To unsubscribe, send a message to
        htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
        <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/htcondor-users/
        <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature