[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?



On Thu, Oct 12, 2017 at 11:11 AM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>
> Assuming you are on Linux... One thing that changed between v8.4.x and
> v8.6.x is in v8.6.x cgroup support is enabled by default which allows
> HTCondor to more accurately track the amount of memory your job uses during
> its lifetime.  On an execute node that put your job on hold, what does
>   condor_config_val -dump CGROUP PREEMPT
> say?  I am interested in values for CGROUP_MEMORY_LIMIT_POLICY and
> BASE_CGROUP (see the Manual for details on these knobs), or if your machines
> are configured to PREEMPT jobs that use more memory than provisioned in the
> slot.  These settings could tell HTCondor to put jobs on hold that use more
> memory than allocated in the slot.

yes this is linux, rhel 7.4 to be specific

cgroup_preempt doesn't show up

base_cgroup = htcondor
cgroup_memory_limit_policy = none

we actually set preempt to false everywhere.  we don't want jobs to
preempt for any reason unless a person specifically says to.

> So what is likely happening is your job is using more memory than allocated
> to the slot, and you will need to increase the value in your job submit file
> for request_memory.

yes, i agree that is very likely to be what's happening.  our users
are pretty bad at accurately specifying memory requests.  they get
pretty close, which is generally good enough, but guess condor is now
enforcing more stringently.