[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Out of memory killer & cgroups



Hi,

When cgroups are enabled and the soft memory limit is used, i.e.

BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft

and a job uses so much memory that the system runs out of memory, the OOM killer kills the job:

Oct  2 11:04:15 lcg1077 kernel: Out of memory: Kill process 23856 (condor_exec.exe) score 270 or sacrifice child
Oct  2 11:04:15 lcg1077 kernel: Killed process 23856, UID 99, (condor_exec.exe) total-vm:8433308kB, anon-rss:3961620kB, file-rss:28kB

but it's not at all obvious to the user that this has happened. All that can be seen in the job's ClassAd is:

ExitReason = "died on signal 9 (Killed)"
ExitSignal = 9

A number of tickets seem to suggest that such jobs should be held with a message saying that the job has exceeded it's memory limit (note that in the tests I've done I've had request_memory=1000 with jobs that use much more memory than this).

This is with HTCondor 8.2.2 on an SL6.4 machine with kernel 2.6.32-431.23.3.el6.

Is what I'm seeing the expected behaviour?

Many Thanks,
Andrew.

-- 
Scanned by iCritical.