[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests



Yes, that's on the worker.
CentOS7

/sys/fs/cgroup/memory/htcondor/condor_var_lib_condor_execute_slot1_9@wn214:
cgroup.clone_children Âmemory.force_empty       Âmemory.kmem.slabinfo        Âmemory.kmem.tcp.usage_in_bytes Âmemory.memsw.failcnt       memory.move_charge_at_immigrate Âmemory.soft_limit_in_bytes Âmemory.use_hierarchy
cgroup.event_control  memory.kmem.failcnt       memory.kmem.tcp.failcnt       memory.kmem.usage_in_bytes   Âmemory.memsw.limit_in_bytes   Âmemory.numa_stat         memory.stat         notify_on_release
cgroup.procs      memory.kmem.limit_in_bytes   Âmemory.kmem.tcp.limit_in_bytes   Âmemory.limit_in_bytes      memory.memsw.max_usage_in_bytes Âmemory.oom_control        memory.swappiness      tasks
memory.failcnt     memory.kmem.max_usage_in_bytes Âmemory.kmem.tcp.max_usage_in_bytes Âmemory.max_usage_in_bytes    memory.memsw.usage_in_bytes   Âmemory.pressure_level      Âmemory.usage_in_bytes



On Wed, Feb 24, 2021 at 5:21 PM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi David,

is that on the worker ?

And also does you OS work with cgroups OK ?

Please check if the cgroups are working out as expected, should look like:

[root@batch1074 ~]# ls /sys/fs/cgroup/memory/htcondor/condor*
/sys/fs/cgroup/memory/htcondor/condor_var_lib_condor_execute_slot2_10@xxxxxxxxxxxxxxxxx:
cgroup.clone_children memory.force_empty memory.kmem.slabinfo memory.kmem.tcp.usage_in_bytes memory.memsw.failcnt memory.move_charge_at_immigrate memory.soft_limit_in_bytes memory.use_hierarchy
cgroup.event_controlÂÂ memory.kmem.failcntÂÂÂÂÂÂÂÂÂÂÂÂ memory.kmem.tcp.failcntÂÂÂÂÂÂÂÂÂÂÂÂ memory.kmem.usage_in_bytesÂÂÂÂÂ memory.memsw.limit_in_bytesÂÂÂÂÂ memory.numa_statÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ memory.statÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ notify_on_release
cgroup.procs memory.kmem.limit_in_bytes memory.kmem.tcp.limit_in_bytes memory.limit_in_bytes memory.memsw.max_usage_in_bytes memory.oom_control memory.swappiness tasks
memory.failcnt memory.kmem.max_usage_in_bytes memory.kmem.tcp.max_usage_in_bytes memory.max_usage_in_bytes memory.memsw.usage_in_bytes memory.pressure_level memory.usage_in_bytes
<snip>

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>
CC: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 24. Februar 2021 15:49:30
Betreff: Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests

Hi Todd,
~$ condor_config_val BASE_CGROUP
htcondor
~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
HARD

And still I recall at least two occasions when users were running over the requested memory.

As for cpu usage, as I only schedule 0.75 of HT cores, for better performance, it allows misbehaving jobs to affect other jobs.
The link you send seems to address that, thanks.

David

On Wed, Feb 24, 2021 at 4:25 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 2/24/2021 6:31 AM, David Cohen wrote:
Hi,
I was under the, apparently wrong, impression that setting
CGROUP_MEMORY_LIMIT_POLICY = HARD
will suffice to kill jobs running over the requested memory.
I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD

Hi David,

How did you arrive at the conclusion that you need to do anything more than setting CGROUP_MEMORY_LIMIT_POLICY=HARD to have jobs placed on hold if they exceed the memory allocated to the slot?

As Christoph stated early, that should be sufficient assuming you are running the HTCondor services with root privileges (i.e. as a system service) and you have BASE_CGROUP defined in your config (it is defined by default....did you change it?). ÂÂ
While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?


Assuming HTCondor is launched as root, it will automatically restrict CPU usage of jobs (using Linux cgroups) to not exceed the number of cores in the slot when there is contention for the cores. That is, on an eight core machine, with only a single, one-core slot running, and otherwise idle, the job running in the one slot could consume all eight cpus concurrently. If, however, all eight slots where running jobs, with each configured for one cpu, the cpu usage would be assigned equally to each job, regardless of the number of processes or threads in each job.

Because of this, few administrators see the need to stop jobs using more cores than requested, because the only scenario this could happen is if no other user was impacted and the cores would otherwise go idle. If for some reason you still want to do this, you may find the HOWOTO at this page useful:
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
(specifically Option 3 on this page).

Hope the above helps
Todd




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/