[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests



Hi Todd,
~$ condor_config_val BASE_CGROUP
htcondor
~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
HARD

And still I recall at least two occasions when users were running over the requested memory.

As for cpu usage, as I only schedule 0.75 of HT cores, for better performance, it allows misbehaving jobs to affect other jobs.
The link you send seems to address that, thanks.

David

On Wed, Feb 24, 2021 at 4:25 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 2/24/2021 6:31 AM, David Cohen wrote:
Hi,
I was under the, apparently wrong, impression that setting
CGROUP_MEMORY_LIMIT_POLICY = HARD
will suffice to kill jobs running over the requested memory.
I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD

Hi David,

How did you arrive at the conclusion that you need to do anything more than setting CGROUP_MEMORY_LIMIT_POLICY=HARD to have jobs placed on hold if they exceed the memory allocated to the slot?

As Christoph stated early, that should be sufficient assuming you are running the HTCondor services with root privileges (i.e. as a system service) and you have BASE_CGROUP defined in your config (it is defined by default....did you change it?). ÂÂ
While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?


Assuming HTCondor is launched as root, it will automatically restrict CPU usage of jobs (using Linux cgroups) to not exceed the number of cores in the slot when there is contention for the cores. That is, on an eight core machine, with only a single, one-core slot running, and otherwise idle, the job running in the one slot could consume all eight cpus concurrently. If, however, all eight slots where running jobs, with each configured for one cpu, the cpu usage would be assigned equally to each job, regardless of the number of processes or threads in each job.

Because of this, few administrators see the need to stop jobs using more cores than requested, because the only scenario this could happen is if no other user was impacted and the cores would otherwise go idle. If for some reason you still want to do this, you may find the HOWOTO at this page useful:
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
(specifically Option 3 on this page).

Hope the above helps
Todd