[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Resource usage RFC (Re: Using cgroups to limit job memory)



On 04/01/2015 09:20 AM, Roderick Johnstone wrote:

> While this behaviour is good for the machine owner, its less than ideal
> for the job owner since the job may continue but only very slowly since
> its paging a lot. This condition might not be obvious to the job owner.

This is one of the things I never understood about condor's
request_memory: it sure makes sense if your execute node runs a batch OS
but from my vague recollection of Operating Systems 101 it just can't
work all that well on a virtual memory system with the default deferred
allocation.

> Although this seems to be the behaviour documented in the manual, I'm
> sure I have seen a description of a configuration in which the job can
> be placed on hold with a suitable message if it tries to allocate more
> memory than it requests, although I can't find that now.
> 
> So, is it possible to configure what happens when the job exceeds the
> requested memory at all?

(Courtesy of Lauren Michael)

periodic_hold = ( MemoryUsage >= ( ( RequestMemory ) * 3 / 2 ) )

where 3/2 is your margin. You'll want a

periodic_release = (JobStatus == 5) ...

with that. And, dep. on your situation: either up request_memory or
blacklist the machine. The latter is in the wiki recipes, the former
(again, Lauren's):

+MemoryUsage = ( 800 ) * 2 / 3
request_memory = ( MemoryUsage ) * 3 / 2

Keep in mind that MemoryUsage reported by linux kernel can be skewed by
e.g. copy on write and is wildly inaccurate for some applications. And
that the above can potentially grow your request_memory to where no
nodes match and the job goes on hold forever. So you'll want a
periodic_remove on top.

Personally I think this should be automagically handled by condor: even
those of us who understand how it works increasingly don't know what the
application's requirements are until we grep through the dag.nodes.log
after the fact. To think that a wider user base can figure it all out is
a delusion IMO.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment: signature.asc
Description: OpenPGP digital signature