[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Using cgroups to limit job memory



On 03/04/2015 22:21, Roderick Johnstone wrote:
If so, would something like the following, (based on examples from the
wiki page), in an environment with cgroups enabled, place a job on hold
when the job process tree allocates more resident memory than in the
request_memory submit file attribute?

# Allow jobs to not be limited by request_memory otherwise
# this policy can never be triggered
CGROUP_MEMORY_LIMIT_POLICY=none

# hold jobs that are more than 10% over requested memory
MEMORY_EXCEEDED = ((MemoryUsage*1.1 > request_memory) =!= TRUE)
PREEMPT = $(PREEMPT)) || $(MEMORY_EXCEEDED)
WANT_SUSPEND = False
WANT_HOLD = $(MEMORY_EXCEEDED)
WANT_HOLD_REASON = ifThenElse( $(MEMORY_EXCEEDED), \
       "Your job used more resident memory than it requested.", \
       undefined )


Without actually testing the above, off the top of my head the idea
looks like it should work.  Note that the above has a syntax error for
the PREEMPT expression due to unmatched parenthesis - you probably wanted
   PREEMPT = ($(PREEMPT)) || $(MEMORY_EXCEEDED)
Also note that jobs will not be preempted until they exhaust their
MaxJobRetirementTime, which is time HTCondor promises to let the job run
without being preempted for any reason. So if you want to immediately
hold jobs that exceed memory usage even if the jobs have specified a
maxjobretirementtime and you are using HTCondor v8.2 or above, you will
want to use the template at
  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
and just replace $(CPU_EXCEEDED) with $(MEMORY_EXCEEDED).


Just to complete this thread, and in case it might be of use to others, what I am using now with cgroups enabled is listed below and seems to work well. When the jobs resident size exceeds the request_memory value from the job submit file the job goes on hold and the hold reason is set.

My previous usage of request_memory in the MEMORY_EXCEEDED expression above seemed not to work and required being replaced by RequestMemory.

# Allow jobs to not be limited by request_memory otherwise
# this policy can never be triggered
CGROUP_MEMORY_LIMIT_POLICY = none

# Hold jobs that are using more than their requested_memory
# Based on recipes at these two pages:
# https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitMemoryUsage
# Option 3 at: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
# Note that to place the job on hold, we first eliminate any
# retirement time and preempt the job.

MEMORY_EXCEEDED = (MemoryUsage > RequestMemory)
PREEMPT = ($(PREEMPT:False)) || $(MEMORY_EXCEEDED)
MAXJOBRETIREMENTTIME = ifthenelse($(MEMORY_EXCEEDED),0,$(MAXJOBRETIREMENTTIME:0))
WANT_SUSPEND = ($(WANT_SUSPEND:False)) && $(MEMORY_EXCEEDED) =!= TRUE
WANT_HOLD = ($(WANT_HOLD:False)) || $(MEMORY_EXCEEDED)
WANT_HOLD_REASON = ifThenElse( $(MEMORY_EXCEEDED), \
       "Job used more resident memory than specified by request_memory.", \
       $(WANT_HOLD_REASON:UNDEFINED))
WANT_HOLD_SUBCODE = ifThenElse($(MEMORY_EXCEEDED),1,$(WANT_HOLD_SUBCODE:UNDEFINED))

Roderick