[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Frequent Close_Pipe errors



Hi,

Over the past month or so we’ve been getting large numbers of held jobs, whereas previously we normally had close to zero. What seems to be happening is:
(1) a job is running successfully
(2) the user’s process finishes
(3) in the ShadowLog we see

10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file /slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp

and at the same time in the StarterLog we see:

10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089 in file /slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp

(4) the job gets restarted on a different worker node, then for some reason the Shadow has permission problems trying to write to output files, then the job goes into the held state.

Below I show an extract from the ShadowLog [1] and StarterLog [2].

We are currently using 8.2.2. I think the problems started occurring at around the time we upgraded the worker nodes from 8.0.6 to 8.2.2 (although I'm not 100% sure).

Has anyone seen this before or know what could be causing these Close_Pipe errors?

Many Thanks,
Andrew.

[1]
10/07/14 12:43:39 Initializing a VANILLA shadow for job 860002.0
10/07/14 12:43:39 (860002.0) (30272): Request to run on slot1@xxxxxxxxxxxxxx <130.a.b.c:35760> was ACCEPTED
10/07/14 12:43:40 (860002.0) (30272): File transfer completed successfully.
10/08/14 14:26:26 (860002.0) (30272): File transfer completed successfully.
10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file /slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
10/08/14 14:27:33 Initializing a VANILLA shadow for job 860002.0
10/08/14 14:27:33 (860002.0) (49167): Request to run on slot1@xxxxxxxxxxxxxx <130.a.b.c:45410> was ACCEPTED
10/08/14 14:27:33 (860002.0) (49167): File transfer completed successfully.
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out, errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 57681 bytes of file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out: (errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err, errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 0 bytes of file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err: (errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): Mock terminating job 860002.0: exited_by_signal=FALSE, exit_code=0 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
10/08/14 14:37:13 (860002.0) (49167): File transfer failed (status=0).
10/08/14 14:37:13 (860002.0) (49167): Job 860002.0 going into Hold state (code 12,13): Error from slot1@xxxxxxxxxxxxxx: STARTER at 130.a.b.c failed to send file(s) to <130.a.b.c:54052>; SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err: (errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): **** condor_shadow (condor_SHADOW) pid 49167 EXITING WITH STATUS 112

[2]
10/07/14 12:43:40 (pid:18932) ******************************************************
10/07/14 12:43:40 (pid:18932) ** condor_starter (CONDOR_STARTER) STARTING UP
10/07/14 12:43:40 (pid:18932) ** /usr/sbin/condor_starter
10/07/14 12:43:40 (pid:18932) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
10/07/14 12:43:40 (pid:18932) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
10/07/14 12:43:40 (pid:18932) ** $CondorVersion: 8.2.2 Aug 07 2014 BuildID: 265643 $
10/07/14 12:43:40 (pid:18932) ** $CondorPlatform: x86_64_RedHat6 $
10/07/14 12:43:40 (pid:18932) ** PID = 18932
10/07/14 12:43:40 (pid:18932) ** Log last touched 10/7 12:39:50
10/07/14 12:43:40 (pid:18932) ******************************************************
10/07/14 12:43:40 (pid:18932) Using config source: /etc/condor/condor_config
10/07/14 12:43:40 (pid:18932) Using local config sources:
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/10security.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn_startd_history.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/21slots.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/31cgroups.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/32mountunderscratch.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/condor_config.local
10/07/14 12:43:40 (pid:18932) config Macros = 128, Sorted = 127, StringBytes = 4666, TablesBytes = 4704
10/07/14 12:43:40 (pid:18932) CLASSAD_CACHING is OFF
10/07/14 12:43:40 (pid:18932) Daemon Log is logging: D_ALWAYS D_ERROR
10/07/14 12:43:40 (pid:18932) DaemonCore: command socket at <130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) DaemonCore: private command socket at <130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) Communicating with shadow <130.a.b.c:34061?noUDP>
10/07/14 12:43:40 (pid:18932) Submitting machine is "arc-ce04.domain"
10/07/14 12:43:40 (pid:18932) setting the orig job name in starter
10/07/14 12:43:40 (pid:18932) setting the orig job iwd in starter
10/07/14 12:43:40 (pid:18932) Chirp config summary: IO false, Updates false, Delayed updates true.
10/07/14 12:43:40 (pid:18932) Initialized IO Proxy.
10/07/14 12:43:40 (pid:18932) Done setting resource limits
10/07/14 12:43:40 (pid:18932) File transfer completed successfully.
10/07/14 12:43:41 (pid:18932) Job 860002.0 set to execute immediately
10/07/14 12:43:41 (pid:18932) Starting a VANILLA universe job with ID: 860002.0
10/07/14 12:43:41 (pid:18932) IWD: /pool/condor/dir_18932
10/07/14 12:43:41 (pid:18932) Output file: /pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Error file: /pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Renice expr "10" evaluated to 10
10/07/14 12:43:41 (pid:18932) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe
10/07/14 12:43:41 (pid:18932) Running job as user tlhcb005
10/07/14 12:43:41 (pid:18932) Create_Process succeeded, pid=18953
10/08/14 14:26:26 (pid:18932) Process exited, pid=18953, status=0
10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089 in file /slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
10/08/14 14:26:26 (pid:18932) ShutdownFast all jobs.


-- 
Scanned by iCritical.