[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] badput



On 11/15/2012 3:26 PM, Todd Tannenbaum wrote:
Re the below -

I don't think Condor (or I suppose I should get used to saying
'HTCondor') will do anything by default when a job exceeds the Memory
provisioned in the slot.  But you can configure it to do whatever you
want...


Looks like TJ just updated the following HOWTO page with most of the info from my previous post:


So when a job starts using more memory than what is available in the
slot, HTCondor could
   1. Do nothing (this is how the default config files are setup)
   2. Evict the job
   3. Put the job on hold, perhaps with a hold message that says what
happened.
   4. <whatever other policy you care to imagine...>

If, for example, you want to do #2 above, you should put something like
the following in your nodes condor_config file(s) to tell your startds
to preempt:

    PREEMPT = ( $(PREEMPT) ) || ( MemoryUsage > Memory )

In the above, Memory = the amount of RAM in MB assigned to that slot
(static or dynamic), and MemoryUsage = what Condor thinks the RAM usage
of the job is.  How it computes this memory usage is configurable, but
in Condor v7.8 it defaults to be the sum of the ResidentSetSize for all
processes in the job on that slot.   The PREEMPT expression is polling
every 10 or 20 seconds iirc, so as David mentioned below, it is possible
for a job that allocates memory very rapidly to run the system out of
RAM before HTCondor reacts.  We are improving this in HTCondor v7.9.2 on
Linux by adding the ability to enforce memory limits via Linux cgroups.

If, for example, you wanted to do #3 above, use WANT_HOLD and
WANT_HOLD_REASON instead of PREEMPT, perhaps like so:

    MEMORY_EXCEEDED = MemoryUsage > Memory
    WANT_HOLD = ( $(MEMORY_EXCEEDED) )
    WANT_HOLD_REASON = \
        ifThenElse( $(MEMORY_EXCEEDED), \
                "Your job used too much memory.", \
                undefined )

Note the above all assuming HTCondor version 7.8 or above, and
disclaimer: the above is off the top of my head, I didn't test it out.

Hope the above helps,
regards,
Todd


On 11/15/2012 2:44 PM, David Brodbeck wrote:



On Wed, Nov 14, 2012 at 1:27 AM, Ian Cottam <Ian.Cottam@xxxxxxxxxxxxxxxx
<mailto:Ian.Cottam@xxxxxxxxxxxxxxxx>> wrote:

    A colleague just asked me:

    "When a Condor node runs out of memory - to the point that it starts
    evicting jobs - does it:

    1. Evict the most recently started job to minimize "badput"?
    2. Evict the first job that requests more memory when all the
memory has
    been exhausted?
    3. Some other strategy?"

    I'm not sure.


In my experience, it's the first job to exceed the memory available for
its slot, or (if dynamic slots are in use) its RequestMemory setting.
Note that if the machine as a whole runs out of RAM and swap before
Condor reacts, the OS's out of memory reactions come into play.  In
Linux what happens at that point is configurable, but the default is for
the kernel to start killing off processes until there's enough free RAM
to proceed.  I'm not sure what heuristics it uses to decide what to kill
off, but it usually seems to be the largest process first.

--
David Brodbeck
System Administrator, Linguistics
University of Washington



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/





--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685