[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Frequent Close_Pipe errors



Especially as the use of cgroup support in HTCondor has grown in popularity, some folks have started to observe problems similar to what Andrew Lahiff reported below, and they have been looking at this email thread in the archives for a resolution. So just for the email record, note the problems reported below were fixed starting with HTCondor v8.2.7. Folks interested in the details of the fix can go to URL
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4774
Much thanks to Andrew for helping us to ultimately track down and fix the problem.

regards
Todd

>>> On 10/8/2014 10:17 AM, andrew.lahiff@xxxxxxxxxx wrote:
Hi,

Over the past month or so we’ve been getting large numbers of held
jobs, whereas previously we normally had close to zero. What seems to
be happening is:
(1) a job is running successfully
(2) the user’s process finishes
(3) in the ShadowLog we see

10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from
slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file
/slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp

and at the same time in the StarterLog we see:

10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089
in file
/slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp

(4) the job gets restarted on a different worker node, then for some
reason the Shadow has permission problems trying to write to output
files, then the job goes into the held state.

Below I show an extract from the ShadowLog [1] and StarterLog [2].

We are currently using 8.2.2. I think the problems started occurring
at around the time we upgraded the worker nodes from 8.0.6 to 8.2.2
(although I'm not 100% sure).

Has anyone seen this before or know what could be causing these
Close_Pipe errors?

Many Thanks,
Andrew.

[1]
10/07/14 12:43:39 Initializing a VANILLA shadow for job 860002.0
10/07/14 12:43:39 (860002.0) (30272): Request to run on
slot1@xxxxxxxxxxxxxx <130.a.b.c:35760> was ACCEPTED
10/07/14 12:43:40 (860002.0) (30272): File transfer completed
successfully.
10/08/14 14:26:26 (860002.0) (30272): File transfer completed
successfully.
10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from
slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file
/slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
10/08/14 14:27:33 Initializing a VANILLA shadow for job 860002.0
10/08/14 14:27:33 (860002.0) (49167): Request to run on
slot1@xxxxxxxxxxxxxx <130.a.b.c:45410> was ACCEPTED
10/08/14 14:27:33 (860002.0) (49167): File transfer completed
successfully.
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out,
errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 57681
bytes of file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of
transfer and failing after encountering the following error: SHADOW
at 130.a.b.c failed to write to file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out:
(errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err,
errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 0 bytes of
file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of
transfer and failing after encountering the following error: SHADOW
at 130.a.b.c failed to write to file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err:
(errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): Mock terminating job 860002.0:
exited_by_signal=FALSE, exit_code=0 OR exit_signal=0,
core_dumped=FALSE, exit_reason="Exited normally"
10/08/14 14:37:13 (860002.0) (49167): File transfer failed (status=0).
10/08/14 14:37:13 (860002.0) (49167): Job 860002.0 going into Hold
state (code 12,13): Error from slot1@xxxxxxxxxxxxxx: STARTER at
130.a.b.c failed to send file(s) to <130.a.b.c:54052>; SHADOW at
130.a.b.c failed to write to file
/var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err:
(errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): **** condor_shadow
(condor_SHADOW) pid 49167 EXITING WITH STATUS 112

[2]
10/07/14 12:43:40 (pid:18932)
******************************************************
10/07/14 12:43:40 (pid:18932) ** condor_starter (CONDOR_STARTER)
STARTING UP
10/07/14 12:43:40 (pid:18932) ** /usr/sbin/condor_starter
10/07/14 12:43:40 (pid:18932) ** SubsystemInfo: name=STARTER
type=STARTER(8) class=DAEMON(1)
10/07/14 12:43:40 (pid:18932) ** Configuration: subsystem:STARTER
local:<NONE> class:DAEMON
10/07/14 12:43:40 (pid:18932) ** $CondorVersion: 8.2.2 Aug 07 2014
BuildID: 265643 $
10/07/14 12:43:40 (pid:18932) ** $CondorPlatform: x86_64_RedHat6 $
10/07/14 12:43:40 (pid:18932) ** PID = 18932
10/07/14 12:43:40 (pid:18932) ** Log last touched 10/7 12:39:50
10/07/14 12:43:40 (pid:18932)
******************************************************
10/07/14 12:43:40 (pid:18932) Using config source:
/etc/condor/condor_config
10/07/14 12:43:40 (pid:18932) Using local config sources:
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/10security.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn.config
10/07/14 12:43:40 (pid:18932)
/etc/condor/config.d/20wn_startd_history.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/21slots.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/31cgroups.config
10/07/14 12:43:40 (pid:18932)
/etc/condor/config.d/32mountunderscratch.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/condor_config.local
10/07/14 12:43:40 (pid:18932) config Macros = 128, Sorted = 127,
StringBytes = 4666, TablesBytes = 4704
10/07/14 12:43:40 (pid:18932) CLASSAD_CACHING is OFF
10/07/14 12:43:40 (pid:18932) Daemon Log is logging: D_ALWAYS D_ERROR
10/07/14 12:43:40 (pid:18932) DaemonCore: command socket at
<130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) DaemonCore: private command socket at
<130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) Communicating with shadow
<130.a.b.c:34061?noUDP>
10/07/14 12:43:40 (pid:18932) Submitting machine is "arc-ce04.domain"
10/07/14 12:43:40 (pid:18932) setting the orig job name in starter
10/07/14 12:43:40 (pid:18932) setting the orig job iwd in starter
10/07/14 12:43:40 (pid:18932) Chirp config summary: IO false, Updates
false, Delayed updates true.
10/07/14 12:43:40 (pid:18932) Initialized IO Proxy.
10/07/14 12:43:40 (pid:18932) Done setting resource limits
10/07/14 12:43:40 (pid:18932) File transfer completed successfully.
10/07/14 12:43:41 (pid:18932) Job 860002.0 set to execute immediately
10/07/14 12:43:41 (pid:18932) Starting a VANILLA universe job with
ID: 860002.0
10/07/14 12:43:41 (pid:18932) IWD: /pool/condor/dir_18932
10/07/14 12:43:41 (pid:18932) Output file:
/pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Error file:
/pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Renice expr "10" evaluated to 10
10/07/14 12:43:41 (pid:18932) About to exec
/usr/libexec/condor/condor_pid_ns_init condor_exec.exe
10/07/14 12:43:41 (pid:18932) Running job as user tlhcb005
10/07/14 12:43:41 (pid:18932) Create_Process succeeded, pid=18953
10/08/14 14:26:26 (pid:18932) Process exited, pid=18953, status=0
10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089
in file
/slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
10/08/14 14:26:26 (pid:18932) ShutdownFast all jobs.