[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] out-of-memory event?



On 10/12/2017 9:17 AM, Michael Di Domenico wrote:
ever since i upgraded our condor pool from 8.4.x to 8.6.1 a lot (but
not all) of my jobs are getting put on hold with "job has encountered
an out-of-memory event".

there were a lot of condor/system changes at the same time, so it's
certainly and very possible that a config setting got changed.

the problem is i can't seem to locate which knob/knobs produces this error

our config is fairly generic and we use most of the default condor settings


Assuming you are on Linux... One thing that changed between v8.4.x and v8.6.x is in v8.6.x cgroup support is enabled by default which allows HTCondor to more accurately track the amount of memory your job uses during its lifetime. On an execute node that put your job on hold, what does
  condor_config_val -dump CGROUP PREEMPT
say? I am interested in values for CGROUP_MEMORY_LIMIT_POLICY and BASE_CGROUP (see the Manual for details on these knobs), or if your machines are configured to PREEMPT jobs that use more memory than provisioned in the slot. These settings could tell HTCondor to put jobs on hold that use more memory than allocated in the slot.

So what is likely happening is your job is using more memory than allocated to the slot, and you will need to increase the value in your job submit file for request_memory. If you specify a log=<file> in your submit file, the log file will state how much max memory your job used (for jobs that completed). While these guides are tailored somewhat to specific policies for UW-Madison users, you may find the below guide useful:
  http://chtc.cs.wisc.edu/helloworld.shtml

regards,
Todd


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685