[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests

Hi David,

is that on the worker ?

And also does you OS work with cgroups OK ?

Please check if the cgroups are working out as expected, should look like:

[root@batch1074 ~]# ls /sys/fs/cgroup/memory/htcondor/condor*
cgroup.clone_children  memory.force_empty              memory.kmem.slabinfo                memory.kmem.tcp.usage_in_bytes  memory.memsw.failcnt             memory.move_charge_at_immigrate  memory.soft_limit_in_bytes  memory.use_hierarchy
cgroup.event_control   memory.kmem.failcnt             memory.kmem.tcp.failcnt             memory.kmem.usage_in_bytes      memory.memsw.limit_in_bytes      memory.numa_stat                 memory.stat                 notify_on_release
cgroup.procs           memory.kmem.limit_in_bytes      memory.kmem.tcp.limit_in_bytes      memory.limit_in_bytes           memory.memsw.max_usage_in_bytes  memory.oom_control               memory.swappiness           tasks
memory.failcnt         memory.kmem.max_usage_in_bytes  memory.kmem.tcp.max_usage_in_bytes  memory.max_usage_in_bytes       memory.memsw.usage_in_bytes      memory.pressure_level            memory.usage_in_bytes


Christoph Beyer
DESY Hamburg

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

mail: christoph.beyer@xxxxxxx

Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>
CC: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 24. Februar 2021 15:49:30
Betreff: Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests

Hi Todd,
~$ condor_config_val BASE_CGROUP
~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY

And still I recall at least two occasions when users were running over the requested memory.

As for cpu usage, as I only schedule 0.75 of HT cores, for better performance, it allows misbehaving jobs to affect other jobs.
The link you send seems to address that, thanks.


On Wed, Feb 24, 2021 at 4:25 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 2/24/2021 6:31 AM, David Cohen wrote:
I was under the, apparently wrong, impression that setting
will suffice to kill jobs running over the requested memory.
I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD

Hi David,

How did you arrive at the conclusion that you need to do anything more than setting CGROUP_MEMORY_LIMIT_POLICY=HARD to have jobs placed on hold if they exceed the memory allocated to the slot?

As Christoph stated early, that should be sufficient assuming you are running the HTCondor services with root privileges  (i.e. as a system service) and you have BASE_CGROUP defined in your config (it is defined by default....did you change it?).   
While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?

Assuming HTCondor is launched as root, it will automatically restrict CPU usage of jobs (using Linux cgroups) to not exceed the number of cores in the slot when there is contention for the cores.  That is, on an eight core machine, with only a single, one-core slot running, and otherwise idle, the job running in the one slot could consume all eight cpus concurrently.  If, however, all eight slots where running jobs, with each configured for one cpu, the cpu usage would be assigned equally to each job, regardless of the number of processes or threads in each job.

Because of this, few administrators see the need to stop jobs using more cores than requested, because the only scenario this could happen is if no other user was impacted and the cores would otherwise go idle.  If for some reason you still want to do this, you may find the HOWOTO at this page useful:
(specifically Option 3 on this page).

Hope the above helps

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: