Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit

Date: Tue, 30 Jul 2013 14:51:21 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit

On 7/30/2013 1:01 PM, Brian Bockelman wrote:

Hi Joan,

I think I have figured out why you're hitting this and not us locally.
  The cgroup API generates a notification whenever the OOM occurs *or*
the cgroup is removed.  Locally, we pre-create all the possible cgroups
(older RHEL kernels would crash if the cgroups were not pre-created;
since fixed) which causes the cgroup to not be deleted by condor.
  Hence, we never got the notification when the cgroup was removed and
this issue was missed in testing.

Your fix is acceptable, I think.  A "more proper" approach would be to
parse the memory.oom_control file to see if under_oom is set to 1.  (On
further thought, I'd actually prefer your approach - parsing files in
the case where memory is actually tight is more likely to lead to deadlock.)

I'm just coming back under an avalanche of emails - did a bug report get
filed from this?



I created a ticket for this at
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3824

Once we receive a CLA from Joan we will incorporate the patch postedhere into the codebase.

Joan, please let me know once you had a chance to email in the CLA (Iemailed you specific directions out of band).


best
Todd


Brian

On Jul 25, 2013, at 5:47 AM, Joan J. Piles <jpiles@xxxxxxxxx
<mailto:jpiles@xxxxxxxxx>> wrote:

Hi,

Just in case it can be useful for somebody, we have been able to solve
(or workaround) the problem with a little patch to the condor source:

diff -ur condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp
condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp
--- condor-8.0.0.orig/src/condor_starter.V6.1/vanilla_proc.cpp
 2013-05-29 18:58:09.000000000 +0200
+++ condor-8.0.0/src/condor_starter.V6.1/vanilla_proc.cpp  2013-07-25
12:13:09.000000000 +0200
@@ -798,6 +798,18 @@
 int
 VanillaProc::outOfMemoryEvent(int /* fd */)
 {
+
+    /* If we have no jobs left, return and do nothing */
+    if (num_pids == 0) {
+        dprintf(D_FULLDEBUG, "Closing event FD pipe %d.\n", m_oom_efd);
+        daemonCore->Close_Pipe(m_oom_efd);
+        close(m_oom_fd);
+        m_oom_efd = -1;
+        m_oom_fd = -1;
+
+        return 0;
+    }
+
     std::stringstream ss;
     if (m_memory_limit >= 0) {
         ss << "Job has gone over memory limit of " << m_memory_limit
<< " megabytes.";


I don't know if it is the best way to work around this problem, but at
least it seems to work for us. We have forced a (true) OOM condition,
and it responded as it should, whereas the jobs weren't put on hold at
exit.

I don't think it's too clean, either, but as I've said it's more of a
quick-and-dirty hack to get this feature (which is really interesting
for us) running.

Regards,

Joan

El 24/07/13 17:24, Paolo Perfetti escribió:

Hi,

On 24/07/2013 13:07, Joan J. Piles wrote:

Hi all:

We are having some problems using cgroups for memory limiting. When
jobs
exit, the OOM-Killer routines get called, placing the job on hold
instead of letting it end normally. With a full starter log (and a
really short job) debug we have:


Right now I'm getting crazy on the same problem since a week.
My system is an updated Debian Wheezy  with condor version
8.0.1-148801 (from research.cs.wisc.edu <http://research.cs.wisc.edu>
repository)
odino:~$ uname  -a
Linux odino 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux


cgroups seems working properly:
odino:~$ condor_config_val BASE_CGROUP
htcondor
odino:~$ condor_config_val CGROUP_MEMORY_LIMIT_POLICY
soft
odino:~$ grep cgroup /etc/default/grub
GRUB_CMDLINE_LINUX="cgroup_enable=memory"
odino:~$ cat /etc/cgconfig.conf
mount {
        cpu     = /cgroup/cpu;
        cpuset  = /cgroup/cpuset;
        cpuacct = /cgroup/cpuacct;
        memory  = /cgroup/memory;
        freezer = /cgroup/freezer;
        blkio   = /cgroup/blkio;
}

group htcondor {
        cpu {}
        cpuset {}
        cpuacct {}
        memory {
# Tested both memory.limit_in_bytes and memory.soft_limit_in_bytes
#memory.limit_in_bytes = 16370672K;
          memory.soft_limit_in_bytes = 16370672K;
        }
        freezer {}
        blkio {}
}
odino:~$ mount | grep cgrou
cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio)

Submit file is trivial:
universe = parallel
executable = /bin/sleep
arguments = 15
machine_count = 4
#request_cpu = 1
request_memory = 128
log = log
output = output
error  = error
notification = never
should_transfer_files = always
when_to_transfer_output = on_exit
queue

Below is my StarterLog.

Any suggestion would be appreciated.
tnx, Paolo


07/24/13 16:56:09 Enumerating interfaces: lo 127.0.0.1 up
07/24/13 16:56:09 Enumerating interfaces: eth0 192.168.100.161 up
07/24/13 16:56:09 Enumerating interfaces: eth1 10.5.0.2 up
07/24/13 16:56:09 Initializing Directory: curr_dir =
/etc/condor/config.d
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 ** condor_starter (CONDOR_STARTER) STARTING UP
07/24/13 16:56:09 ** /usr/sbin/condor_starter
07/24/13 16:56:09 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
07/24/13 16:56:09 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
07/24/13 16:56:09 ** $CondorVersion: 8.0.1 Jul 15 2013 BuildID: 148801 $
07/24/13 16:56:09 ** $CondorPlatform: x86_64_Debian7 $
07/24/13 16:56:09 ** PID = 31181
07/24/13 16:56:09 ** Log last touched 7/24 16:37:26
07/24/13 16:56:09 ******************************************************
07/24/13 16:56:09 Using config source: /etc/condor/condor_config
07/24/13 16:56:09 Using local config sources:
07/24/13 16:56:09    /etc/condor/config.d/00-asgard-common
07/24/13 16:56:09    /etc/condor/config.d/10-asgard-execute
07/24/13 16:56:09    /etc/condor/condor_config.local
07/24/13 16:56:09 Running as root.  Enabling specialized core dump
routines
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 DaemonCore: command socket at <192.168.100.161:35626>
07/24/13 16:56:09 DaemonCore: private command socket at
<192.168.100.161:35626>
07/24/13 16:56:09 Setting maximum accepts per cycle 8.
07/24/13 16:56:09 Will use UDP to update collector odino.bo.ingv.it
<http://odino.bo.ingv.it> <192.168.100.160:9618>
07/24/13 16:56:09 Not using shared port because USE_SHARED_PORT=false
07/24/13 16:56:09 Entering JICShadow::receiveMachineAd
07/24/13 16:56:09 Communicating with shadow
<192.168.100.160:36378?noUDP>
07/24/13 16:56:09 Shadow version: $CondorVersion: 8.0.1 Jul 15 2013
BuildID: 148801 $
07/24/13 16:56:09 Submitting machine is "odino.bo.ingv.it
<http://odino.bo.ingv.it>"
07/24/13 16:56:09 Instantiating a StarterHookMgr
07/24/13 16:56:09 Job does not define HookKeyword, not invoking any
job hooks.
07/24/13 16:56:09 setting the orig job name in starter
07/24/13 16:56:09 setting the orig job iwd in starter
07/24/13 16:56:09 ShouldTransferFiles is "YES", transfering files
07/24/13 16:56:09 Submit UidDomain: "bo.ingv.it <http://bo.ingv.it>"
07/24/13 16:56:09  Local UidDomain: "bo.ingv.it <http://bo.ingv.it>"
07/24/13 16:56:09 Initialized user_priv as "username"
07/24/13 16:56:09 Done moving to directory
"/var/lib/condor/execute/dir_31181"
07/24/13 16:56:09 Job has WantIOProxy=true
07/24/13 16:56:09 Initialized IO Proxy.
07/24/13 16:56:09 LocalUserLog::initFromJobAd: path_attr =
StarterUserLog
07/24/13 16:56:09 LocalUserLog::initFromJobAd: xml_attr =
StarterUserLogUseXML
07/24/13 16:56:09 No StarterUserLog found in job ClassAd
07/24/13 16:56:09 Starter will not write a local UserLog
07/24/13 16:56:09 Done setting resource limits
07/24/13 16:56:09 Changing the executable name
07/24/13 16:56:09 entering FileTransfer::Init
07/24/13 16:56:09 entering FileTransfer::SimpleInit
07/24/13 16:56:09 FILETRANSFER: protocol "http" handled by
"/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "ftp" handled by
"/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "file" handled by
"/usr/lib/condor/libexec/curl_plugin"
07/24/13 16:56:09 FILETRANSFER: protocol "data" handled by
"/usr/lib/condor/libexec/data_plugin"
07/24/13 16:56:09 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:09 TransferIntermediate="(none)"
07/24/13 16:56:09 entering FileTransfer::DownloadFiles
07/24/13 16:56:09 entering FileTransfer::Download
07/24/13 16:56:09 FileTransfer: created download transfer process
with id 31184
07/24/13 16:56:09 entering FileTransfer::DownloadThread
07/24/13 16:56:09 entering FileTransfer::DoDownload sync=1
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 DaemonCore: in SendAliveToParent()
07/24/13 16:56:09 REMAP: begin with rules:
07/24/13 16:56:09 REMAP: 0: condor_exec.exe
07/24/13 16:56:09 REMAP: res is 0 ->  !
07/24/13 16:56:09 Sending GoAhead for 192.168.100.160 to send
/var/lib/condor/execute/dir_31181/condor_exec.exe and all further files.
07/24/13 16:56:09 Completed DC_CHILDALIVE to daemon at
<192.168.100.161:53285>
07/24/13 16:56:09 DaemonCore: Leaving SendAliveToParent() - success
07/24/13 16:56:09 Received GoAhead from peer to receive
/var/lib/condor/execute/dir_31181/condor_exec.exe.
07/24/13 16:56:09 get_file(): going to write to filename
/var/lib/condor/execute/dir_31181/condor_exec.exe
07/24/13 16:56:09 get_file: Receiving 31136 bytes
07/24/13 16:56:09 get_file: wrote 31136 bytes to file
07/24/13 16:56:09 ReliSock::get_file_with_permissions(): going to set
permissions 755
07/24/13 16:56:09 DaemonCore: No more children processes to reap.
07/24/13 16:56:09 File transfer completed successfully.
07/24/13 16:56:09 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Calling client FileTransfer handler function.
07/24/13 16:56:10 HOOK_PREPARE_JOB not configured.
07/24/13 16:56:10 Job 90.0 set to execute immediately
07/24/13 16:56:10 Starting a PARALLEL universe job with ID: 90.0
07/24/13 16:56:10 In OsProc::OsProc()
07/24/13 16:56:10 Main job KillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job RmKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Main job HoldKillSignal: 15 (SIGTERM)
07/24/13 16:56:10 Constructor of ParallelProc::ParallelProc
07/24/13 16:56:10 in ParallelProc::StartJob()
07/24/13 16:56:10 Found Node = 0 in job ad
07/24/13 16:56:10 ParallelProc::addEnvVars()
07/24/13 16:56:10 No Path in ad, $PATH in env
07/24/13 16:56:10 before: /bin:/sbin:/usr/bin:/usr/sbin
07/24/13 16:56:10 New env:
PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin _CONDOR_PROCNO=0
CONDOR_CONFIG=/etc/condor/condor_config _CONDOR_NPROCS=4
_CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0
07/24/13 16:56:10 in VanillaProc::StartJob()
07/24/13 16:56:10 Requesting cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx for job.
07/24/13 16:56:10 Value of RequestedChroot is unset.
07/24/13 16:56:10 PID namespace option: false
07/24/13 16:56:10 in OsProc::StartJob()
07/24/13 16:56:10 IWD: /var/lib/condor/execute/dir_31181
07/24/13 16:56:10 Input file: /dev/null
07/24/13 16:56:10 Output file:
/var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:10 Error file:
/var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:10 About to exec
/var/lib/condor/execute/dir_31181/condor_exec.exe 15
07/24/13 16:56:10 Env = TEMP=/var/lib/condor/execute/dir_31181
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_31181
_CONDOR_SLOT=slot1_1 TMPDIR=/var/lib/condor/execute/dir_31181
_CONDOR_PROCNO=0 _CONDOR_JOB_PIDS=
TMP=/var/lib/condor/execute/dir_31181
_CONDOR_REMOTE_SPOOL_DIR=/var/lib/condor/spool/90/0/cluster90.proc0.subproc0
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_31181/.job.ad
_CONDOR_JOB_IWD=/var/lib/condor/execute/dir_31181
CONDOR_CONFIG=/etc/condor/condor_config
PATH=/usr/bin:/bin:/sbin:/usr/bin:/usr/sbin
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_31181/.machine.ad
_CONDOR_NPROCS=4
07/24/13 16:56:10 Setting job's virtual memory rlimit to 17179869184
megabytes
07/24/13 16:56:10 ENFORCE_CPU_AFFINITY not true, not setting affinity
07/24/13 16:56:10 Running job as user username
07/24/13 16:56:10 track_family_via_cgroup: Tracking PID 31185 via
cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxx
07/24/13 16:56:10 About to tell ProcD to track family with root 31185
via cgroup
htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxx
07/24/13 16:56:10 Create_Process succeeded, pid=31185
07/24/13 16:56:10 Initializing cgroup library.
07/24/13 16:56:18 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Entering JICShadow::updateShadow()
07/24/13 16:56:18 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:18 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:18 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:18 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:18 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:18 Sent job ClassAd update to startd.
07/24/13 16:56:18 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 DaemonCore: No more children processes to reap.
07/24/13 16:56:25 Process exited, pid=31185, status=0
07/24/13 16:56:25 Inside VanillaProc::JobReaper()
07/24/13 16:56:25 Inside OsProc::JobReaper()
07/24/13 16:56:25 Inside UserProc::JobReaper()
07/24/13 16:56:25 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 HOOK_JOB_EXIT not configured.
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:25 In ParallelProc::PublishUpdateAd()
07/24/13 16:56:25 In VanillaProc::PublishUpdateAd()
07/24/13 16:56:25 Inside OsProc::PublishUpdateAd()
07/24/13 16:56:25 Inside UserProc::PublishUpdateAd()
07/24/13 16:56:25 Entering JICShadow::updateShadow()
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Leaving JICShadow::updateShadow(): success
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 JICShadow::transferOutput(void): Transferring...
07/24/13 16:56:25 Begin transfer of sandbox to shadow.
07/24/13 16:56:25 entering FileTransfer::UploadFiles (final_transfer=1)
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Sending new file _condor_stdout, time==1374677770,
size==0
07/24/13 16:56:25 Skipping file in exception list: .job.ad
07/24/13 16:56:25 Sending new file _condor_stderr, time==1374677770,
size==0
07/24/13 16:56:25 Skipping file in exception list: .machine.ad
07/24/13 16:56:25 Skipping file chirp.config, t:
1374677769==1374677769, s: 54==54
07/24/13 16:56:25 Skipping file condor_exec.exe, t:
1374677769==1374677769, s: 31136==31136
07/24/13 16:56:25 FileTransfer::UploadFiles: sent
TransKey=1#51efeb09437ffa2dcc159bc
07/24/13 16:56:25 entering FileTransfer::Upload
07/24/13 16:56:25 entering FileTransfer::DoUpload
07/24/13 16:56:25 DoUpload: sending file _condor_stdout
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for
_condor_stdout
07/24/13 16:56:25 Received GoAhead from peer to send
/var/lib/condor/execute/dir_31181/_condor_stdout.
07/24/13 16:56:25 Sending GoAhead for 192.168.100.160 to receive
/var/lib/condor/execute/dir_31181/_condor_stdout and all further files.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to
send permissions 100644
07/24/13 16:56:25 put_file: going to send from filename
/var/lib/condor/execute/dir_31181/_condor_stdout
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: sending file _condor_stderr
07/24/13 16:56:25 FILETRANSFER: outgoing file_command is 1 for
_condor_stderr
07/24/13 16:56:25 Received GoAhead from peer to send
/var/lib/condor/execute/dir_31181/_condor_stderr.
07/24/13 16:56:25 ReliSock::put_file_with_permissions(): going to
send permissions 100644
07/24/13 16:56:25 put_file: going to send from filename
/var/lib/condor/execute/dir_31181/_condor_stderr
07/24/13 16:56:25 put_file: Found file size 0
07/24/13 16:56:25 put_file: sending 0 bytes
07/24/13 16:56:25 ReliSock: put_file: sent 0 bytes
07/24/13 16:56:25 DoUpload: exiting at 3294
07/24/13 16:56:25 End transfer of sandbox to shadow.
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Inside OsProc::JobExit()
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Notifying exit status=0 reason=100
07/24/13 16:56:25 Sent job ClassAd update to startd.
07/24/13 16:56:25 Hold all jobs
07/24/13 16:56:25 All jobs were removed due to OOM event.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Closing event FD pipe 65536.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 Got SIGQUIT.  Performing fast shutdown.
07/24/13 16:56:25 ShutdownFast all jobs.
07/24/13 16:56:25 Got ShutdownFast when no jobs running.
07/24/13 16:56:25 Inside JICShadow::transferOutput(void)
07/24/13 16:56:25 Inside JICShadow::transferOutputMopUp(void)
07/24/13 16:56:25 dirscat: dirpath = /
07/24/13 16:56:25 dirscat: subdir = /var/lib/condor/execute
07/24/13 16:56:25 Initializing Directory: curr_dir =
/var/lib/condor/execute/
07/24/13 16:56:25 Removing /var/lib/condor/execute/dir_31181
07/24/13 16:56:25 Attempting to remove
/var/lib/condor/execute/dir_31181 as SuperUser (root)
07/24/13 16:56:25 **** condor_starter (condor_STARTER) pid 31181
EXITING WITH STATUS 0

07/24/13 12:47:39 Initializing cgroup library.
07/24/13 12:47:44 DaemonCore: No more children processes to reap.
07/24/13 12:47:44 Process exited, pid=32686, status=0
07/24/13 12:47:44 Inside VanillaProc::JobReaper()
07/24/13 12:47:44 Inside OsProc::JobReaper()
07/24/13 12:47:44 Inside UserProc::JobReaper()
07/24/13 12:47:44 Reaper: all=1 handled=1 ShuttingDown=0
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 HOOK_JOB_EXIT not configured.
07/24/13 12:47:44 In VanillaProc::PublishUpdateAd()
07/24/13 12:47:44 Inside OsProc::PublishUpdateAd()
07/24/13 12:47:44 Inside UserProc::PublishUpdateAd()
07/24/13 12:47:44 Entering JICShadow::updateShadow()
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Leaving JICShadow::updateShadow(): success
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 JICShadow::transferOutput(void): Transferring...
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Inside OsProc::JobExit()
07/24/13 12:47:44 Notifying exit status=0 reason=100
07/24/13 12:47:44 Sent job ClassAd update to startd.
07/24/13 12:47:44 Hold all jobs
07/24/13 12:47:44 All jobs were removed due to OOM event.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)
07/24/13 12:47:44 Closing event FD pipe 0.
07/24/13 12:47:44 Close_Pipe on invalid pipe end: 0
07/24/13 12:47:44 ERROR "Close_Pipe error" at line 2104 in file
/slots/01/dir_5373/userdir/src/condor_daemon_core.V6/daemon_core.cpp
07/24/13 12:47:44 ShutdownFast all jobs.
07/24/13 12:47:44 Got ShutdownFast when no jobs running.
07/24/13 12:47:44 Inside JICShadow::transferOutput(void)
07/24/13 12:47:44 Inside JICShadow::transferOutputMopUp(void)


It seems an event is fired for some reason to the OOM eventfd (the
cgroup itself being destroyed, perhaps?). Has anybody else seen the
same
issue? Could it be a change in the kernel cgroups' interface?

Thanks,

Joan

--
--------------------------------------------------------------------------

Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es  --jpiles@xxxxxxxxx <mailto:jpiles@xxxxxxxxx>
--------------------------------------------------------------------------




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es  --jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Follow-Ups:
- Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
  - From: Todd Tannenbaum

References:
- [HTCondor-users] CGROUPS + OOM / HOLD on exit
  - From: Joan J. Piles
- Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
  - From: Paolo Perfetti
- Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
  - From: Joan J. Piles
- Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
  - From: Brian Bockelman

Prev by Date: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
Next by Date: [HTCondor-users] disable preemption
Previous by thread: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
Next by thread: Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] CGROUPS + OOM / HOLD on exit