Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] badput

Date: Thu, 15 Nov 2012 15:26:17 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] badput

Re the below -

I don't think Condor (or I suppose I should get used to saying'HTCondor') will do anything by default when a job exceeds the Memoryprovisioned in the slot. But you can configure it to do whatever youwant...

So when a job starts using more memory than what is available in theslot, HTCondor could

  1. Do nothing (this is how the default config files are setup)
  2. Evict the job

3. Put the job on hold, perhaps with a hold message that says whathappened.

  4. <whatever other policy you care to imagine...>

If, for example, you want to do #2 above, you should put something likethe following in your nodes condor_config file(s) to tell your startdsto preempt:


   PREEMPT = ( $(PREEMPT) ) || ( MemoryUsage > Memory )

In the above, Memory = the amount of RAM in MB assigned to that slot(static or dynamic), and MemoryUsage = what Condor thinks the RAM usageof the job is. How it computes this memory usage is configurable, butin Condor v7.8 it defaults to be the sum of the ResidentSetSize for allprocesses in the job on that slot. The PREEMPT expression is pollingevery 10 or 20 seconds iirc, so as David mentioned below, it is possiblefor a job that allocates memory very rapidly to run the system out ofRAM before HTCondor reacts. We are improving this in HTCondor v7.9.2 onLinux by adding the ability to enforce memory limits via Linux cgroups.

If, for example, you wanted to do #3 above, use WANT_HOLD andWANT_HOLD_REASON instead of PREEMPT, perhaps like so:


   MEMORY_EXCEEDED = MemoryUsage > Memory
   WANT_HOLD = ( $(MEMORY_EXCEEDED) )
   WANT_HOLD_REASON = \
       ifThenElse( $(MEMORY_EXCEEDED), \
               "Your job used too much memory.", \
               undefined )

Note the above all assuming HTCondor version 7.8 or above, anddisclaimer: the above is off the top of my head, I didn't test it out.


Hope the above helps,
regards,
Todd


On 11/15/2012 2:44 PM, David Brodbeck wrote:




On Wed, Nov 14, 2012 at 1:27 AM, Ian Cottam <Ian.Cottam@xxxxxxxxxxxxxxxx
<mailto:Ian.Cottam@xxxxxxxxxxxxxxxx>> wrote:

    A colleague just asked me:

    "When a Condor node runs out of memory - to the point that it starts
    evicting jobs - does it:

    1. Evict the most recently started job to minimize "badput"?
    2. Evict the first job that requests more memory when all the memory has
    been exhausted?
    3. Some other strategy?"

    I'm not sure.


In my experience, it's the first job to exceed the memory available for
its slot, or (if dynamic slots are in use) its RequestMemory setting.
Note that if the machine as a whole runs out of RAM and swap before
Condor reacts, the OS's out of memory reactions come into play.  In
Linux what happens at that point is configurable, but the default is for
the kernel to start killing off processes until there's enough free RAM
to proceed.  I'm not sure what heuristics it uses to decide what to kill
off, but it usually seems to be the largest process first.

--
David Brodbeck
System Administrator, Linguistics
University of Washington



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] badput
  - From: Todd Tannenbaum
- Re: [HTCondor-users] badput
  - From: Todd Tannenbaum

References:
- [HTCondor-users] badput
  - From: Ian Cottam
- Re: [HTCondor-users] badput
  - From: David Brodbeck

Prev by Date: Re: [HTCondor-users] badput
Next by Date: Re: [HTCondor-users] badput
Previous by thread: Re: [HTCondor-users] badput
Next by thread: Re: [HTCondor-users] badput
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] badput