[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HT cores utilized to 100% although HT core count is false



Hi Tom,

yes, thinking about it with your and Greg's feedback and our attempt to make Condor/the kernel to behave like 'non-HT' on a 'HT CPU' through cpu time makes not much sense.

So far we have been on cgroups v1 and some user jobs (or some of their frameworks) lie about their needs all the time... (like hard-wired `make -j 32` in a score job :-/ )

---
Taking the CPU time share extreme. halving the overall CPU time shares or so would definitively be not equal to the CPU with HT switched off in hw :-/ Binding a process to a cpuset might be somewhat workable - but I have no idea, if in the end we force the kernel to make inefficient decisions... (plus I have no idea, what the CPU is actually be doing underneath)

Cheers and thanks,
  Thomas

On 15/01/2021 18.18, tpdownes@xxxxxxxxx wrote:
Thomas:

It's also worth emphasizing what's happening here. One or more of your jobs is simplifyingÂlying about the CPU resources it needs. That's also a problem.

Miron has in the past described HTCondor should not act "the CPU police" which I think is the better part of wisdom. But you, personally, might consider acting as the CPU judge/jury/executioner.

Tom

On Fri, Jan 15, 2021 at 11:11 AM Tom Downes <tpdownes@xxxxxxxxx <mailto:tpdownes@xxxxxxxxx>> wrote:

    Thomas:

    CGroups allows you to set hard limits on CPU if you wish.

    https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu
    <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu>

    There is a lot of movement in cgroups so refer to your own OS kernel
    docs where you can. The RHEL 6 link above should work on most
    contemporary OSes. On modern versions of SystemD (this excludes
    CentOS 7), you can set this with a SystemD directive CPUQuota but
    you'd have to use "condor.service" as your parent cgroup for jobs.
    Otherwise you have to manually script up a solution for your
    htcondor cgroup.

    Alternatively - and this might be best for your use case... set the
    cgroups's CPU set to the kernel-exposed cores that map to
    independent physical cores. This will completely preclude the kernel
    from considering the "fake" cores when scheduling your threads.

    Whether any of this actually helps your applications is another
    matter but they are the way of accomplishing what you want.

    Tom



    On Fri, Jan 15, 2021 at 10:51 AM Greg Thain <gthain@xxxxxxxxxxx
    <mailto:gthain@xxxxxxxxxxx>> wrote:


        Hi Thomas:

        When you sent

        COUNT_HYPERTHREADED_CPUS = false

        HTCondor will only advertise as many cores as there are physical
        cores. Whether the kernel will choose to schedule processes
only on the physical cores is kind of up to the Linux kernel. If you absolutely want to prohibit the kernel from ever running
        a process using hyperthreads, it might be best to disable
        hyperthreading in the BIOS, but I understand that's more work
        than merely setting a condor knob.

        As you see from your cgroups, an HTCondor with root will set
        cpu.shares. Note that cpu.shares isn't a hard limit, but only
        comes into play when there is contention. That is, let's say on
        your machine you have 48 slots, all running jobs that have
        requested and been allocated one core each. If 47 of those jobs
        are idle, (maybe waiting on I/O), but one job launched 96
        cpu-bound threads, the linux kernel schedule may run all 96 of
        those threads concurrently. If the 47 idle jobs suddenly become
        cpu-bound again, the Linux scheduler will throttle the 96 thread
        job back to one core.

        Now, whether to use or disable hyperthreads depends on your
        needs. Enabling hyperthreads, in general, increases throughput,
        at the cost of performance and per job memory of individual
        jobs. There is no free lunch.

        -greg

        On 1/15/21 8:30 AM, Thomas Hartmann wrote:
        Hi all,

        I am currently wondering about a few nodes, that have a
        utilization of all (HT) cores but should only be using only
        50%, i.e., just the physical core count.

        The nodes have AMD Epycs with HT/SMT cores active - but since
        we have
        Â COUNT_HYPERTHREAD_CPUS = false
        set, Condor should be using only 50% of the (virtual) core
        count [1], or?.

        What worries me a bit is, that the CPU time shares of the jobs
        look good [2], i.e., currently just <48 single core jobs with
        a relative '100' weight. However, I am not sure anymore, how
        the kernel is distributing the CPU time slots here, if the
        parent relative share is 100%(?) of the overall(??) time share?

        Is the CPU time weighting maybe misleading here, if one tries
        to 'match' only for the physical core count?

        Cheers and thanks for ideas,
        Â Thomas



        [1]
        COUNT_HYPERTHREAD_CPUS = false
        ...
        DETECTED_CORES = 96
        DETECTED_CPUS = 48
        DETECTED_MEMORY = 257656
        DETECTED_PHYSICAL_CPUS = 48
        ..
        NUM_CPUS = $(DETECTED_CPUS)


        [2]
        [root@batch1071 htcondor]# cat
        /sys/fs/cgroup/cpu,cpuacct/cpu.shares
        1024
        [root@batch1071 htcondor]# cat
        /sys/fs/cgroup/cpu,cpuacct/htcondor/cpu.shares
        1024
        [root@batch1071 htcondor]# cat
        /sys/fs/cgroup/cpu,cpuacct/htcondor/condor_var_lib_condor_execute_slot*/cpu.shares
        | sort | wc -l
        45

        _______________________________________________
        HTCondor-users mailing list
        To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx  <mailto:htcondor-users-request@xxxxxxxxxxx>  with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users  <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/htcondor-users/  <https://lists.cs.wisc.edu/archive/htcondor-users/>
        _______________________________________________
        HTCondor-users mailing list
        To unsubscribe, send a message to
        htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
        <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/htcondor-users/
        <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature