[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Using cgroups to limit job memory



On 4/2/2015 9:30 AM, Roderick Johnstone wrote:
Todd

The HOWTO recipes are your friend.  From the HTCondor.org homepage look
for "HOWTO recipes"; the direct link is
   https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAdminRecipes

Thanks for the pointer, yes these are really useful.

Glad to hear it!


ok, I think I understand this, but it would be good to have
clarification on a one thing please:

1) Is MemoryUsage tracking the resident memory usage (ie excluding any
virtual memory) of the whole job process tree when cgroups is configured?


Yes.

And even if you do not configure/use cgroups, HTCondor attempts to make MemoryUsage mean the same thing. It does this by summing the resident set size for each process in the job process tree. This may end up overestimating the memory usage compared to what cgroups would report (imagine a job that has dozens of child processes that all load the same shared library), but it is a pretty reasonable approximation. Without cgroups, HTCondor tracks what processes are in a group via several different algorithms that can work very accurately in practice, especially if you give HTCondor a range of GIDs to use (see http://goo.gl/LVDSys).


If so, would something like the following, (based on examples from the
wiki page), in an environment with cgroups enabled, place a job on hold
when the job process tree allocates more resident memory than in the
request_memory submit file attribute?

# Allow jobs to not be limited by request_memory otherwise
# this policy can never be triggered
CGROUP_MEMORY_LIMIT_POLICY=none

# hold jobs that are more than 10% over requested memory
MEMORY_EXCEEDED = ((MemoryUsage*1.1 > request_memory) =!= TRUE)
PREEMPT = $(PREEMPT)) || $(MEMORY_EXCEEDED)
WANT_SUSPEND = False
WANT_HOLD = $(MEMORY_EXCEEDED)
WANT_HOLD_REASON = ifThenElse( $(MEMORY_EXCEEDED), \
       "Your job used more resident memory than it requested.", \
       undefined )


Without actually testing the above, off the top of my head the idea looks like it should work. Note that the above has a syntax error for the PREEMPT expression due to unmatched parenthesis - you probably wanted
  PREEMPT = ($(PREEMPT)) || $(MEMORY_EXCEEDED)
Also note that jobs will not be preempted until they exhaust their MaxJobRetirementTime, which is time HTCondor promises to let the job run without being preempted for any reason. So if you want to immediately hold jobs that exceed memory usage even if the jobs have specified a maxjobretirementtime and you are using HTCondor v8.2 or above, you will want to use the template at
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage
and just replace $(CPU_EXCEEDED) with $(MEMORY_EXCEEDED).

Nice work Roderick, thanks for sharing!

Hope the above helps,
Todd




Also likely of interest is
   https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitCpuUsage

Thanks, thats my next project!


Hope the above helps.  Also interested in any thoughts you may have to
improve the above HOWTOs.

Thanks again. See above for (minimal) feedback.

Roderick

regards,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685