[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Limiting memory used on the worker node with c-groups


Having had many times worker nodes hanging because of memory exhaustion,
I am trying to figure out how we can prevent this. I believe the memory
exhaustion is due to some kind of pathologic job using way more memory
than it should.

The first question would be : does it make sense to use
SYSTEM_PERIODIC_REMOVE in the config of a worker node (startd) or is it
working only on the scheduler (thus reacting with a certain delay) ?

Then, I tried differents settings of CGROUP_MEMORY_LIMIT_POLICY.

I understand that the default setting is : "none". In this case, in
"memory.limit_in_bytes" is set to the nodes detected memory divided by
the number of cores and "memory.soft_limit_in_bytes" is 0.

I tried setting CGROUP_MEMORY_LIMIT_POLICY to "soft". It seems to do its
job with jobs being remove with "Job has gone over memory limit of 6000 megabytes. Peak usage: 5926 megabytes." BUT: The result on the worker
nodes is a number of processes in "Deffered" status which gives a high
Unix load even if there is no CPU consumed. No new jobs are scheduled.
Looks like the jobs are not killed cleanly.

I am now trying with "hard". Let's see...

I have read this presentation :
... but I do not understand everything. Sorry.

This is HTCondor version 8.6.13. Also, please note that I have made
is so that the threshold is higher than the detected memory :

MEMORY = 1.5 * quantize( $(DETECTED_MEMORY), 1000 )

Thank you in advance.


Jean-michel BARBET                    | Tel: +33 (0)2 51 85 84 86
Laboratoire SUBATECH Nantes France    | Fax: +33 (0)2 51 85 84 79
CNRS-IN2P3/Ecole des Mines/Universite | E-Mail: barbet@xxxxxxxxxxxxxxxxx