[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] soft/hard limiting cpu.shares ?



Hi Tom,

many thanks for the confirmation :)

tbh, we want probably eat the cake and keep it, i.e., having a somewhat hard limit but being lenient towards our users... Probably we will play a bit with the cgroups' values and see how the system evolves.

Cheers and thanks,
  Thomas



On 16/06/2021 17.44, tpdownes@xxxxxxxxx wrote:
Thomas:

You understand the cpu shares mechanism correctly. It's a soft limit with a policy for resolving conflict when conflict arises.

If you really want to nail down HTCondor jobs to a total number of cores, you want to want to use cpu.cfs_quota_us (and optionally cpu.cfs_period_us) on the parent htcondor cgroup. This is an honest to goodness hard limit on CPU usage that works in parallel with the shares mechanism.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu>

Short version, to assignÂ1-core to the cgroup, set the quota to 1000000.

Within the htcondor cgroup, shares will be enforced by HTCondor but the overall limit will be applied at the parent level.

Tom
On Wed, Jun 16, 2021 at 10:20 AM Thomas Hartmann <thomas.hartmann@xxxxxxx <mailto:thomas.hartmann@xxxxxxx>> wrote:

    Hi all,

    a short question regarding jobs core time scaling via cgroup cpu.shares:

    The relative share of a job's cgroup is only limiting with respect to
    the total core-scaled CPU time, or?

    I.e., we are running our nodes with hyperthreading 2x enabled for
    simplicity, since we use the same machines for production jobs as well
    as for user job sub-clusters.

    Since user have occasionally odd user jobs (that tend to work better
    without overbooking) we broker on user nodes only 1/2 of the HT-core
    numbers for jobs.

    now, the condor parent cgroup has assigned
     Â Âhtcondor/cpu.shares = 1024
    with respect to the total system share of
     Â Âcpu.shares=1024
    so all condor child processes (without further sub-groups) could in
    principle use up to 100% of the total HT-core scaled CPU time.

    A single core job gets a relative share like

    htcondor/condor_var_lib_condor_execute_slot2_15@xxxxxxxxxxxxxxx/cpu.shares
    <http://condor_var_lib_condor_execute_slot2_15@xxxxxxxxxxxxxxx/cpu.shares>
    100
    where we broker only 50% of the total HT-core scaled time - as far
    as I see.

    However, user jobs can utilize more than their nominally assigned
    cpu share.
    My understanding is, that the kernel notices, that the total CPU
    time is
    not utilized completely - and thus allows processes to use more than
    their nominal time limit as there is still CPU time available.
    Is this correct? ð

    When we scale the condor parent cgroup to a reasonable fraction of the
    system cpu.share (taking HT efficiency into account), we should be able
    to scale CPU times per job to (roughly) core-equivalents - without the
    need to bind jobs to specific cores, or?

    Cheers,
     Â ÂThomas

    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/
    <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature