[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CGROUPS + OOM / HOLD on exit



Hi all:

We are having some problems using cgroups for memory limiting. When jobs exit, the OOM-Killer routines get called, placing the job on hold instead of letting it end normally. With a full starter log (and a really short job) debug we have:

07/24/13 12:47:39 Initializing cgroup library.
07/24/13 12:47:44 DaemonCore: No more children processes to reap.
07/24/13 12:47:44 Process exited, pid=32686, status=0
07/24/13 12:47:44 Inside VanillaProc::JobReaper()
07/24/13 12:47:44 Inside OsProc::JobReaper()
07/24/13 12:47:44 Inside UserProc::JobReaper()
07/24/13 12:47:44 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 HOOK_JOB_EXIT not configured.
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 Entering JICShadow::updateShadow()
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Leaving JICShadow::updateShadow(): success
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 JICShadow::transferOutput(void): Transferring...
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Inside OsProc::JobExit()
07/24/13 12:47:44 Notifying exit status=0 reason=100
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Hold all jobs
07/24/13 12:47:44 All jobs were removed due to OOM event.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Closing event FD pipe 0.
07/24/13 12:47:44 Close_Pipe on invalid pipe end: 0
07/24/13 12:47:44 ERROR "Close_Pipe error" at line 2104 in file /slots/01/dir_5373/userdir/src/condor_daemon_core.V6/daemon_core.cpp
07/24/13 12:47:44 ShutdownFast all jobs.
07/24/13 12:47:44 Got ShutdownFast when no jobs running.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)

It seems an event is fired for some reason to the OOM eventfd (the cgroup itself being destroyed, perhaps?). Has anybody else seen the same issue? Could it be a change in the kernel cgroups' interface?

Thanks,

Joan
-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------