I've setup cgroups on my htcondor cluster some months ago. I expected cgroups to handle soft limits and htcondor to kill with SYTEM_PERIODIC_REMOVE when the limit is twice the requested memory. However last week we had a user running havoc on the nodes and using up to 35GB of RSS when his limit should have been 4GB.
My settings are as follows
* On the WNs
# Enable CGROUP
BASE_CGROUP = /system.slice/condor.service
CGROUP_MEMORY_LIMIT = soft
* On the head node
RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory )
SYSTEM_PERIODIC_REMOVE = $(RemoveMemoryUsage)Â ||Â <OtherParameters>
this is a set up other sites have.
cgroup doesn't have any limit set neither soft nor hard.
So the questions are two
1) Why SYSTEM_PERIODIC_REMOVEÂ didn't work? Here is an example of job that exceeded the limit 4GB limit
condor_history 66469.0 -autoformat ClusterId 2000*RequestMemory ResidentSetSize_RAW
66469 4000000 34723028
2) Shouldn't htcondor set the job soft limit with this configuration? or is the site expected to set the soft limit separately?
-- Respect is a rational process. \\// Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante) For Ur-Fascism, disagreement is treason. (U. Eco) But but but her emails... covfefe!