[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests



On 2/24/2021 8:49 AM, David Cohen wrote:
Hi Todd,
~$ condor_config_val BASE_CGROUP
htcondor
~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
HARD

And still I recall at least two occasions when users were running over the requested memory.


Hi David,

Note that CGROUP_MEMORY_LIMIT_POLICY does not hold jobs running over the requested memory, it holds jobs that use more resident memory than allocated in the execute slot.  For instance, if a job requests 6 GB and is matched to a slot containing 12 GB of memory, the job will not be halted unless the resident set size of all processes running on that slot exceed 12GB.

By default, HTCondor will match where the jobs requested_memory <  the slot's memory.  The Requested Memory from a job will not always be exactly equal to the  Slot's Memory.  Reasons they may be different include use of static slots, or use of partitionable slots with due to a) config setting MODIFY_REQUEST_EXPR_REQUESTMEMORY which will round-upwards the memory of the slot so it matches more jobs in the future (see https://tinyurl.com/ydev6mka) and/or b) slot preemption, where a dynamic slot is created with 20GB for a job requesting 20GB, but then that slot is preempted for use by a higher priority user for a job that requested less than 20GB.

Hope the above helps,
Todd