Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Yes, that's on the worker.

CentOS7

/sys/fs/cgroup/memory/htcondor/condor_var_lib_condor_execute_slot1_9@wn214:
cgroup.clone_children Âmemory.force_empty Â Â Â Â Â Â Âmemory.kmem.slabinfo Â Â Â Â Â Â Â Âmemory.kmem.tcp.usage_in_bytes Âmemory.memsw.failcnt Â Â Â Â Â Â memory.move_charge_at_immigrate Âmemory.soft_limit_in_bytes Âmemory.use_hierarchy
cgroup.event_control Â memory.kmem.failcnt Â Â Â Â Â Â memory.kmem.tcp.failcnt Â Â Â Â Â Â memory.kmem.usage_in_bytes Â Â Âmemory.memsw.limit_in_bytes Â Â Âmemory.numa_stat Â Â Â Â Â Â Â Â memory.stat Â Â Â Â Â Â Â Â notify_on_release
cgroup.procs Â Â Â Â Â memory.kmem.limit_in_bytes Â Â Âmemory.kmem.tcp.limit_in_bytes Â Â Âmemory.limit_in_bytes Â Â Â Â Â memory.memsw.max_usage_in_bytes Âmemory.oom_control Â Â Â Â Â Â Â memory.swappiness Â Â Â Â Â tasks
memory.failcnt Â Â Â Â memory.kmem.max_usage_in_bytes Âmemory.kmem.tcp.max_usage_in_bytes Âmemory.max_usage_in_bytes Â Â Â memory.memsw.usage_in_bytes Â Â Âmemory.pressure_level Â Â Â Â Â Âmemory.usage_in_bytes

On Wed, Feb 24, 2021 at 5:21 PM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hi David,

is that on the worker ?

And also does you OS work with cgroups OK ?

Please check if the cgroups are working out as expected, should look like:

[root@batch1074 ~]# ls /sys/fs/cgroup/memory/htcondor/condor*
/sys/fs/cgroup/memory/htcondor/condor_var_lib_condor_execute_slot2_10@xxxxxxxxxxxxxxxxx:
cgroup.clone_childrenÂ memory.force_emptyÂÂÂÂÂÂÂÂÂÂÂÂÂ memory.kmem.slabinfoÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ memory.kmem.tcp.usage_in_bytesÂ memory.memsw.failcntÂÂÂÂÂÂÂÂÂÂÂÂ memory.move_charge_at_immigrateÂ memory.soft_limit_in_bytesÂ memory.use_hierarchy
cgroup.event_controlÂÂ memory.kmem.failcntÂÂÂÂÂÂÂÂÂÂÂÂ memory.kmem.tcp.failcntÂÂÂÂÂÂÂÂÂÂÂÂ memory.kmem.usage_in_bytesÂÂÂÂÂ memory.memsw.limit_in_bytesÂÂÂÂÂ memory.numa_statÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ memory.statÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ notify_on_release
cgroup.procsÂÂÂÂÂÂÂÂÂÂ memory.kmem.limit_in_bytesÂÂÂÂÂ memory.kmem.tcp.limit_in_bytesÂÂÂÂÂ memory.limit_in_bytesÂÂÂÂÂÂÂÂÂÂ memory.memsw.max_usage_in_bytesÂ memory.oom_controlÂÂÂÂÂÂÂÂÂÂÂÂÂÂ memory.swappinessÂÂÂÂÂÂÂÂÂÂ tasks
memory.failcntÂÂÂÂÂÂÂÂ memory.kmem.max_usage_in_bytesÂ memory.kmem.tcp.max_usage_in_bytesÂ memory.max_usage_in_bytesÂÂÂÂÂÂ memory.memsw.usage_in_bytesÂÂÂÂÂ memory.pressure_levelÂÂÂÂÂÂÂÂÂÂÂ memory.usage_in_bytes
<snip>

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>
CC: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 24. Februar 2021 15:49:30
Betreff: Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests

Hi Todd,
~$ condor_config_val BASE_CGROUP
htcondor
~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
HARD

And still I recall at least two occasions when users were running over the requested memory.

As for cpu usage, as I only schedule 0.75 of HT cores, for better performance, it allows misbehaving jobs to affect other jobs.
The link you send seems to address that, thanks.

David

On Wed, Feb 24, 2021 at 4:25 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

On 2/24/2021 6:31 AM, David Cohen wrote:

Hi,

I was under the, apparently wrong, impression that setting

CGROUP_MEMORY_LIMIT_POLICY = HARD

will suffice to kill jobs running over the requested memory.

I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD

Hi David,

How did you arrive at the conclusion that you need to do anything more than setting CGROUP_MEMORY_LIMIT_POLICY=HARD to have jobs placed on hold if they exceed the memory allocated to the slot?

As Christoph stated early, that should be sufficient assuming you are running the HTCondor services with root privilegesÂ (i.e. as a system service) and you have BASE_CGROUP defined in your config (it is defined by default....did you change it?). ÂÂ

While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?

Assuming HTCondor is launched as root, it will automatically restrict CPU usage of jobs (using Linux cgroups) to not exceed the number of cores in the slot when there is contention for the cores.Â That is, on an eight core machine, with only a single, one-core slot running, and otherwise idle, the job running in the one slot could consume all eight cpus concurrently.Â If, however, all eight slots where running jobs, with each configured for one cpu, the cpu usage would be assigned equally to each job, regardless of the number of processes or threads in each job.

Because of this, few administrators see the need to stop jobs using more cores than requested, because the only scenario this could happen is if no other user was impacted and the cores would otherwise go idle.Â If for some reason you still want to do this, you may find the HOWOTO at this page useful:
Â https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
(specifically Option 3 on this page).

Hope the above helps
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests