[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Including excess per-core memory request in SlotWeight for dynamic slots



We, like many others, have situations where certain kinds of jobs need only a few CPU cores, but a substantial fraction of the system’s memory, and the default SLOT_WEIGHT configuration of “Cpus” doesn’t take that into account when measuring a user’s weighted-hours utilization of the system. A user might claim a single core but 90% of the system’s memory, and the weighted use would only reflect one core even though nearly none of the other cores on the machine are available for other users.

 

I cooked up an _expression_ for the configuration value to take that into account.

 

The idea is that each core in the system has an uncontested claim to that core’s equal share of the system memory, kind of like a static slot. As a simple example, for a 64GB machine with 64 CPU cores, each core should be able to take up to 1GB of memory without penalty.

 

This “baseline” allowance is represented by the _expression_:

TotalMemory / TotalCpus

 

The “excess” memory is what the job uses over and above the baseline amount for the number of cores the job has claimed. A one-care job using 2GB, or a two-core using 3GB would have an excess of 1GB, a one-core job using 32GB would have an excess of 31GB, while a one-core using 512MB would have an excess of zero.

 

This is represented by:

ifThenElse(TotalSlotMemory > TotalMemory / TotalCpus * Cpus, (TotalSlotMemory – (TotalMemory / TotalCpus * Cpus)), 0)

 

To calculate how many cores this excess memory use is equivalent to, you divide it by the baseline allowance.

            (TotalSlotMemory – (TotalMemory / TotalCpus * Cpus)) / (TotalMemory / TotalCpus)

 

Then you add that total to the existing “Cpus” slot weight. So your final SLOT_WEIGHT _expression_ would be

 

SLOT_WEIGHT = Cpus + ifThenElse(TotalSlotMemory > (TotalMemory / TotalCpus * Cpus), \

    (TotalSlotMemory – (TotalMemory / TotalCpus * Cpus) / (TotalMemory / TotalCpus), 0)

If you wish, you can round or ceiling the calculated value so that slot weight stays as an integer instead of a floating point number – it doesn’t appear to be necessary for stock tools – condor_userprio, for example, rounds the “Res In Use” column, so a floating point value doesn’t throw off the formatting. But if you’ve got scripts which assume SlotWeight is an integer, you may want to do it yourself in the _expression_.

 

Note that this only works if you’re using Partitionable / Dynamic slots, because it’s using the TotalSlotMemory machine attribute – for a static slot that value will always be equal to the baseline allowance, by definition. For static slots, you’d have to use a job attribute instead, which requires the use of SCHEDD_SLOT_WEIGHT since the SLOT_WEIGHT config can’t refer to job attributes.

 

            -Michael Pelletier.