[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Preempt jobs which exceed their request_memory - but no parallel universe?



On Tue, Mar 03, 2015 at 10:36:30AM -0600, Greg Thain wrote:
> On 03/03/2015 05:31 AM, Steffen Grunewald wrote:
> >I'm confused.
> >
> >I have a couple of users who underestimate the memory their jobs
> >would attempt to allocate, and as a result some worker nodes end
> >up swapping heavily.
> >I tried to get those jobs preempted, and sent back into the queue
> >with their updated (ImageSize) request_memory:
> >
> ># Let job use its declared amount of memory and some more
> >MEMORY_EXTRA            = 2048
> >MEMORY_ALLOWED          = (Memory + $(MEMORY_EXTRA)*Cpus)
> ># Get the current footprint
> >MEMORY_CURRENT          = (ImageSize/1024)
> ># Exceeds expectations?
> >MEMORY_EXCEEDED         = $(MEMORY_CURRENT) > $(MEMORY_ALLOWED)
> ># If exceeding, preempt
> >#[preset]PREEMPT        = False
> >PREEMPT                 = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))
> >WANT_SUSPEND            = False
> >
> >
> This should all work.  Can you wrap your PREEMPT expression in the
> debug() function like this:
> 
> PREEMPT = debug($(PREEMPT) || ($(MEMORY_EXCEEDED)))

This will require some DEBUG settings as well, right? (and disk space)

> What are WANT_VACATE and KILL set to?  If you don't want to give
> these jobs a grace period, you
> probably want WANT_VACATE = false.

Certainly (that's been the policy for years):
$ condor_config_val -dump | grep -i WANT_
WANT_SUSPEND = False
WANT_UDP_COMMAND_SOCKET = true
WANT_VACATE = False
WANT_XML_LOG = false
$ condor_config_val -dump | grep -i KILL
KILL = False
KILLING_TIMEOUT = 30
VM_KILLING_TIMEOUT = 60
WINDOWS_SOFTKILL = 

For the "exclude parallel universe from preemption" part, I will now use
PREEMPT                 = ($(PREEMPT)) || ($(MEMORY_EXCEEDED) && (JobUniverse =!= 11))
(and I'm afraid "PREEMPT_VANILLA = False" was the cause for preemption not
happening to vanilla universe jobs... removed that one from the config now)

Let's see what happens...

Thanks,
 Steffen