Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?

Date: Thu, 12 Oct 2017 10:11:47 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] out-of-memory event?

On 10/12/2017 9:17 AM, Michael Di Domenico wrote:

ever since i upgraded our condor pool from 8.4.x to 8.6.1 a lot (but
not all) of my jobs are getting put on hold with "job has encountered
an out-of-memory event".

there were a lot of condor/system changes at the same time, so it's
certainly and very possible that a config setting got changed.

the problem is i can't seem to locate which knob/knobs produces this error

our config is fairly generic and we use most of the default condor settings

Assuming you are on Linux... One thing that changed between v8.4.x andv8.6.x is in v8.6.x cgroup support is enabled by default which allowsHTCondor to more accurately track the amount of memory your job usesduring its lifetime. On an execute node that put your job on hold, whatdoes

  condor_config_val -dump CGROUP PREEMPT

say? I am interested in values for CGROUP_MEMORY_LIMIT_POLICY andBASE_CGROUP (see the Manual for details on these knobs), or if yourmachines are configured to PREEMPT jobs that use more memory thanprovisioned in the slot. These settings could tell HTCondor to put jobson hold that use more memory than allocated in the slot.

So what is likely happening is your job is using more memory thanallocated to the slot, and you will need to increase the value in yourjob submit file for request_memory. If you specify a log=<file> in yoursubmit file, the log file will state how much max memory your job used(for jobs that completed). While these guides are tailored somewhat tospecific policies for UW-Madison users, you may find the below guide useful:

  http://chtc.cs.wisc.edu/helloworld.shtml

regards,
Todd


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] out-of-memory event?
  - From: Michael Di Domenico

References:
- [HTCondor-users] out-of-memory event?
  - From: Michael Di Domenico

Prev by Date: Re: [HTCondor-users] out-of-memory event?
Next by Date: Re: [HTCondor-users] condor_status update time
Previous by thread: Re: [HTCondor-users] out-of-memory event?
Next by thread: Re: [HTCondor-users] out-of-memory event?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] out-of-memory event?