[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limiting memory used on the worker node with c-groups



Hi Jean-Michel,

just an idea - but can you try and check, if the out-of-memory control is handled by the kernel or by Condor?

As far as I understand [1], with something like
> cat /sys/fs/cgroup/memory/system.slice/condor.service/SLOT/memory.oom_control
  oom_kill_disable 1
  under_oom 0
should indicate, that the kernel itself is not killing or stopping processes (but might depend also on the parent oom settigns maybe??)


Cheers,
  Thomas

[1]
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt


On 24/04/2020 08.43, Jean-Michel Barbet wrote:
Hello,

Having had many times worker nodes hanging because of memory exhaustion,
I am trying to figure out how we can prevent this. I believe the memory
exhaustion is due to some kind of pathologic job using way more memory
than it should.

The first question would be : does it make sense to use
SYSTEM_PERIODIC_REMOVE in the config of a worker node (startd) or is it
working only on the scheduler (thus reacting with a certain delay) ?

Then, I tried differents settings of CGROUP_MEMORY_LIMIT_POLICY.

I understand that the default setting is : "none". In this case, in
/sys/fs/cgroup/memory/htcondor/condor_dlocal_htcondor_slot1\@worker,
"memory.limit_in_bytes" is set to the nodes detected memory divided by
the number of cores and "memory.soft_limit_in_bytes" is 0.

I tried setting CGROUP_MEMORY_LIMIT_POLICY to "soft". It seems to do its
job with jobs being remove with "Job has gone over memory limit of 6000 megabytes. Peak usage: 5926 megabytes." BUT: The result on the worker
nodes is a number of processes in "Deffered" status which gives a high
Unix load even if there is no CPU consumed. No new jobs are scheduled.
Looks like the jobs are not killed cleanly.

I am now trying with "hard". Let's see...

I have read this presentation :
https://research.cs.wisc.edu/htcondor/HTCondorWeek2017/presentations/WedDownes_cgroups.pdf
... but I do not understand everything. Sorry.

This is HTCondor version 8.6.13. Also, please note that I have made
is so that the threshold is higher than the detected memory :

MEMORY = 1.5 * quantize( $(DETECTED_MEMORY), 1000 )
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory,100)

Thank you in advance.

JM


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature