[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit



Hi Joan,

I think I have figured out why you're hitting this and not us locally.  The cgroup API generates a notification whenever the OOM occurs *or* the cgroup is removed.  Locally, we pre-create all the possible cgroups (older RHEL kernels would crash if the cgroups were not pre-created; since fixed) which causes the cgroup to not be deleted by condor.  Hence, we never got the notification when the cgroup was removed and this issue was missed in testing.

Your fix is acceptable, I think.  A "more proper" approach would be to parse the memory.oom_control file to see if under_oom is set to 1.  (On further thought, I'd actually prefer your approach - parsing files in the case where memory is actually tight is more likely to lead to deadlock.)

I'm just coming back under an avalanche of emails - did a bug report get filed from this?

Brian

On Jul 25, 2013, at 5:47 AM, Joan J. Piles <jpiles@xxxxxxxxx> wrote:

Hi,

Just in case it can be useful for somebody, we have been able to solve (or workaround) the problem with a little patch to the condor source:

diff -ur condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp
--- condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp    2013-05-29 18:58:09.000000000 +0200
+++ condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp    2013-07-25 12:13:09.000000000 +0200
@@ -798,6 +798,18 @@
 int
 VanillaProc::outOfMemoryEvent(int /* fd */)
 {
+
+    /* If we have no jobs left, return and do nothing */
+    if (num_pids == 0) {
+        dprintf(D_FULLDEBUG, "Closing event FD pipe %d.\n", m_oom_efd);
+        daemonCore->Close_Pipe(m_oom_efd);
+        close(m_oom_fd);
+        m_oom_efd = -1;
+        m_oom_fd = -1;
+
+        return 0;
+    }
+
     std::stringstream ss;
     if (m_memory_limit >= 0) {
         ss << "Job has gone over memory limit of " << m_memory_limit << " megabytes.";

I don't know if it is the best way to work around this problem, but at least it seems to work for us. We have forced a (true) OOM condition, and it responded as it should, whereas the jobs weren't put on hold at exit.

I don't think it's too clean, either, but as I've said it's more of a quick-and-dirty hack to get this feature (which is really interesting for us) running.

Regards,

Joan

El 24/07/13 17:24, Paolo Perfetti escribió:
Hi,

On 24/07/2013 13:07, Joan J. Piles wrote:
Hi all:

We are having some problems using cgroups for memory limiting. When jobs
exit, the OOM-Killer routines get called, placing the job on hold
instead of letting it end normally. With a full starter log (and a
really short job) debug we have:

Right now I'm getting crazy on the same problem since a week.
My system is an updated Debian Wheezy  with condor version 8.0.1-148801 (from research.cs.wisc.edu repository)
odino:~$ uname  -a
Linux odino 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux


cgroups seems working properly:
odino:~$ condor_config_val BASE_CGROUP
htcondor
odino:~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
soft
odino:~$ grep cgroup /etc/default/grub
GRUB_CMDLINE_LINUX="cgroup_enable=memory"
odino:~$ cat /etc/cgconfig.conf
mount {
        cpu     = /cgroup/cpu;
        cpuset  = /cgroup/cpuset;
        cpuacct = /cgroup/cpuacct;
        memory  = /cgroup/memory;
        freezer = /cgroup/freezer;
        blkio   = /cgroup/blkio;
}

group htcondor {
        cpu {}
        cpuset {}
        cpuacct {}
        memory {
# Tested both memory.limit_in_bytes and memory.soft_limit_in_bytes
#memory.limit_in_bytes = 16370672K;
          memory.soft_limit_in_bytes = 16370672K;
        }
        freezer {}
        blkio {}
}
odino:~$ mount | grep cgrou
cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio)

Submit file is trivial:
universe = parallel
executable = /bin/sleep
arguments = 15
machine_count = 4
#request_cpu = 1
request_memory = 128
log = log
output = output
error  = error
notification = never
should_transfer_files = always
when_to_transfer_output = on_exit
queue

Below is my StarterLog.

Any suggestion would be appreciated.
tnx, Paolo


07/24/13 16:56:09 Enumerating interfaces: lo 127.0.0.1 up
07/24/13 16:56:09 Enumerating interfaces: eth0 192.168.100.161 up
07/24/13 16:56:09 Enumerating interfaces: eth1 10.5.0.2 up
07/24/13 16:56:09 Initializing Directory: curr_dir = /etc/condor/config.d
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 ** condor_starter (CONDOR_STARTER) STARTING UP
07/24/13 16:56:09 ** /usr/sbin/condor_starter
07/24/13 16:56:09 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
07/24/13 16:56:09 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
07/24/13 16:56:09 ** $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $
07/24/13 16:56:09 ** $CondorPlatform: x86_64_Debian7 $
07/24/13 16:56:09 ** PID = 31181
07/24/13 16:56:09 ** Log last touched 7/24 16:37:26
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 Using config source: /etc/condor/condor_config
07/24/13 16:56:09 Using local config sources:
07/24/13 16:56:09    /etc/condor/config.d/00-asgard-common
07/24/13 16:56:09    /etc/condor/config.d/10-asgard-execute
07/24/13 16:56:09    /etc/condor/condor_config.local
07/24/13 16:56:09 Running as root.  Enabling specialized core dump routines
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 DaemonCore: command socket at <192.168.100.161:35626>
07/24/13 16:56:09 DaemonCore: private command socket at <192.168.100.161:35626>
07/24/13 16:56:09 Setting maximum accepts per cycle 8.
07/24/13 16:56:09 Will use UDP to update collector odino.bo.ingv.it <192.168.100.160:9618>
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 Entering JICShadow::receiveMachineAd
07/24/13 16:56:09 Communicating with shadow <192.168.100.160:36378?noUDP>
07/24/13 16:56:09 Shadow version: $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $
07/24/13 16:56:09 Submitting machine is "odino.bo.ingv.it"
07/24/13 16:56:09 Instantiating a StarterHookMgr
07/24/13 16:56:09 Job does not define HookKeyword, not invoking any job hooks.
07/24/13 16:56:09 setting the orig job name in starter
07/24/13 16:56:09 setting the orig job iwd in starter
07/24/13 16:56:09 ShouldTransferFiles is "YES", transfering files
07/24/13 16:56:09 Submit UidDomain: "bo.ingv.it"
07/24/13 16:56:09  Local UidDomain: "bo.ingv.it"
07/24/13 16:56:09 Initialized user_priv as "username"
07/24/13 16:56:09 Done moving to directory "/var/lib/condor/execute/dir_31181"
07/24/13 16:56:09 Job has WantIOProxy=true
07/24/13 16:56:09 Initialized IO Proxy.
07/24/13 16:56:09 LocalUserLog::initFromJobAd: path_attr = StarterUserLog
07/24/13 16:56:09 LocalUserLog::initFromJobAd: xml_attr = StarterUserLogUseXML
07/24/13 16:56:09 No StarterUserLog found in job ClassAd
07/24/13 16:56:09 Starter will not write a local UserLog
07/24/13 16:56:09 Done setting resource limits
07/24/13 16:56:09 Changing the executable name
07/24/13 16:56:09 entering FileTransfer::Init
07/24/13 16:56:09 entering FileTransfer::SimpleInit
07/24/13 16:56:09 FILETRANSFER: protocol "http" handled by "/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "ftp" handled by "/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "file" handled by "/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "data" handled by "/usr/lib/condor/libexec/data_plugin"
07/24/13 16:56:09 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:09 TransferIntermediate="(none)"
07/24/13 16:56:09 entering FileTransfer::DownloadFiles
07/24/13 16:56:09 entering FileTransfer::Download
07/24/13 16:56:09 FileTransfer: created download transfer process with id 31184
07/24/13 16:56:09 entering FileTransfer::DownloadThread
07/24/13 16:56:09 entering FileTransfer::DoDownload sync=1
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 DaemonCore: in SendAliveToParent()
07/24/13 16:56:09 REMAP: begin with rules:
07/24/13 16:56:09 REMAP: 0: condor_exec.exe
07/24/13 16:56:09 REMAP: res is 0 ->  !
07/24/13 16:56:09 Sending GoAhead for 192.168.100.160 to send /var/lib/condor/execute/dir_31181/condor_exec.exe and all further files.
07/24/13 16:56:09 Completed DC_CHILDALIVE to daemon at <192.168.100.161:53285>
07/24/13 16:56:09 DaemonCore: Leaving SendAliveToParent() - success
07/24/13 16:56:09 Received GoAhead from peer to receive /var/lib/condor/execute/dir_31181/condor_exec.exe.
07/24/13 16:56:09 get_file(): going to write to filename /var/lib/condor/execute/dir_31181/condor_exec.exe
07/24/13 16:56:09 get_file: Receiving 31136 bytes
07/24/13 16:56:09 get_file: wrote 31136 bytes to file
07/24/13 16:56:09 ReliSock::get_file_with_permissions(): going to set permissions 755
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 File transfer completed successfully.
07/24/13 16:56:09 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Calling client FileTransfer handler function.
07/24/13 16:56:10 HOOK_PREPARE_JOB not configured.
07/24/13 16:56:10 Job 90.0 set to execute immediately
07/24/13 16:56:10 Starting a PARALLEL universe job with ID: 90.0
07/24/13 16:56:10 In OsProc::OsProc()
07/24/13 16:56:10 Main job KillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job RmKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job HoldKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Constructor of ParallelProc::ParallelProc
07/24/13 16:56:10 in ParallelProc::StartJob()
07/24/13 16:56:10 Found Node = 0 in job ad
07/24/13 16:56:10 ParallelProc::addEnvVars()
07/24/13 16:56:10 No Path in ad, $PATH in env
07/24/13 16:56:10 before: /bin:/sbin:/usr/bin:/usr/sbin
07/24/13 16:56:10 New env: PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_PROCNO=0 CONDOR_CONFIG=/etc/condor/condor_config _CONDOR_NPROCS=4 _CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0
07/24/13 16:56:10 in VanillaProc::StartJob()
07/24/13 16:56:10 Requesting cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx for job.
07/24/13 16:56:10 Value of RequestedChroot is unset.
07/24/13 16:56:10 PID namespace option: false
07/24/13 16:56:10 in OsProc::StartJob()
07/24/13 16:56:10 IWD: /var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Input file: /dev/null
07/24/13 16:56:10 Output file: /var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:10 Error file: /var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:10 About to exec /var/lib/condor/execute/dir_31181/condor_exec.exe 15
07/24/13 16:56:10 Env = TEMP=/var/lib/condor/execute/dir_31181 _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_31181 _CONDOR_SLOT=slot1_1 TMPDIR=/var/lib/condor/execute/dir_31181 _CONDOR_PROCNO=0 _CONDOR_JOB_PIDS= TMP=/var/lib/condor/execute/dir_31181 _CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0 _CONDOR_JOB_AD=/var/lib/condor/execute/dir_31181/.job.ad _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_31181 CONDOR_CONFIG=/etc/condor/condor_config PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_31181/.machine.ad _CONDOR_NPROCS=4
07/24/13 16:56:10 Setting job's virtual memory rlimit to 17179869184 megabytes
07/24/13 16:56:10 ENFORCE_CPU_AFFINITY not true, not setting affinity
07/24/13 16:56:10 Running job as user username
07/24/13 16:56:10 track_family_via_cgroup: Tracking PID 31185 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx.
07/24/13 16:56:10 About to tell ProcD to track family with root 31185 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx
07/24/13 16:56:10 Create_Process succeeded, pid=31185
07/24/13 16:56:10 Initializing cgroup library.
07/24/13 16:56:18 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Entering JICShadow::updateShadow()
07/24/13 16:56:18 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Sent job ClassAd update to startd.
07/24/13 16:56:18 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 DaemonCore: No more children processes to reap.
07/24/13 16:56:25 Process exited, pid=31185, status=0
07/24/13 16:56:25 Inside VanillaProc::JobReaper()
07/24/13 16:56:25 Inside OsProc::JobReaper()
07/24/13 16:56:25 Inside UserProc::JobReaper()
07/24/13 16:56:25 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 HOOK_JOB_EXIT not configured.
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 Entering JICShadow::updateShadow()
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 JICShadow::transferOutput(void): Transferring...
07/24/13 16:56:25 Begin transfer of sandbox to shadow.
07/24/13 16:56:25 entering FileTransfer::UploadFiles (final_transfer=1)
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Sending new file _condor_stdout, time==1374677770, size==0
07/24/13 16:56:25 Skipping file in exception list: .job.ad
07/24/13 16:56:25 Sending new file _condor_stderr, time==1374677770, size==0
07/24/13 16:56:25 Skipping file in exception list: .machine.ad
07/24/13 16:56:25 Skipping file chirp.config, t: 1374677769==1374677769, s: 54==54
07/24/13 16:56:25 Skipping file condor_exec.exe, t: 1374677769==1374677769, s: 31136==31136
07/24/13 16:56:25 FileTransfer::UploadFiles: sent TransKey=1#51efeb09437ffa2dcc159bc
07/24/13 16:56:25 entering FileTransfer::Upload
07/24/13 16:56:25 entering FileTransfer::DoUpload
07/24/13 16:56:25 DoUpload: sending file _condor_stdout
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for _condor_stdout
07/24/13 16:56:25 Received GoAhead from peer to send /var/lib/condor/execute/dir_31181/_condor_stdout.
07/24/13 16:56:25 Sending GoAhead for 192.168.100.160 to receive /var/lib/condor/execute/dir_31181/_condor_stdout and all further files.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send permissions 100644
07/24/13 16:56:25 put_file: going to send from filename /var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: sending file _condor_stderr
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for _condor_stderr
07/24/13 16:56:25 Received GoAhead from peer to send /var/lib/condor/execute/dir_31181/_condor_stderr.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to send permissions 100644
07/24/13 16:56:25 put_file: going to send from filename /var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: exiting at 3294
07/24/13 16:56:25 End transfer of sandbox to shadow.
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Inside OsProc::JobExit()
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Notifying exit status=0 reason=100
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Hold all jobs
07/24/13 16:56:25 All jobs were removed due to OOM event.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Closing event FD pipe 65536.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Got SIGQUIT.  Performing fast shutdown.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 dirscat: dirpath = /
07/24/13 16:56:25 dirscat: subdir = /var/lib/condor/execute
07/24/13 16:56:25 Initializing Directory: curr_dir = /var/lib/condor/execute/
07/24/13 16:56:25 Removing /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Attempting to remove /var/lib/condor/execute/dir_31181 as SuperUser (root)
07/24/13 16:56:25 **** condor_starter (condor_STARTER) pid 31181 EXITING WITH STATUS 0

07/24/13 12:47:39 Initializing cgroup library.
07/24/13 12:47:44 DaemonCore: No more children processes to reap.
07/24/13 12:47:44 Process exited, pid=32686, status=0
07/24/13 12:47:44 Inside VanillaProc::JobReaper()
07/24/13 12:47:44 Inside OsProc::JobReaper()
07/24/13 12:47:44 Inside UserProc::JobReaper()
07/24/13 12:47:44 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 HOOK_JOB_EXIT not configured.
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 Entering JICShadow::updateShadow()
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Leaving JICShadow::updateShadow(): success
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 JICShadow::transferOutput(void): Transferring...
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Inside OsProc::JobExit()
07/24/13 12:47:44 Notifying exit status=0 reason=100
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Hold all jobs
07/24/13 12:47:44 All jobs were removed due to OOM event.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Closing event FD pipe 0.
07/24/13 12:47:44 Close_Pipe on invalid pipe end: 0
07/24/13 12:47:44 ERROR "Close_Pipe error" at line 2104 in file
/slots/01/dir_5373/userdir/src/condor_daemon_core.V6/daemon_core.cpp
07/24/13 12:47:44 ShutdownFast all jobs.
07/24/13 12:47:44 Got ShutdownFast when no jobs running.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)

It seems an event is fired for some reason to the OOM eventfd (the
cgroup itself being destroyed, perhaps?). Has anybody else seen the same
issue? Could it be a change in the kernel cgroups' interface?

Thanks,

Joan

--
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es  --jpiles@xxxxxxxxx
--------------------------------------------------------------------------



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/