[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Frequent Close_Pipe errors



On 10/8/2014 2:50 PM, andrew.lahiff@xxxxxxxxxx wrote:
Hi Todd,

Yes, we are using cgroups:

[root@lcg1402 ~]# condor_config_val BASE_CGROUP
htcondor

Thanks,
Andrew.


Thanks. And just to be sure I am barking up the right tree (I think I am), besides cgroups, are you using either a) glexec, or b) streaming the job stdout/err ?

thanks again,
Todd


________________________________________
From: Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Wednesday, October 08, 2014 8:39 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Frequent Close_Pipe errors

Hi Andrew,

Question re the below:

On your execute machine(s), what does
    condor_config_val BASE_CGROUP
report?

In other words, I am wondering if you are using HTCondor's cgroup
process tracking....  if you answer is yes, based on looking over the
source code, I have a guess as to what is going wrong.

Thanks
Todd


On 10/8/2014 10:17 AM, andrew.lahiff@xxxxxxxxxx wrote:
Hi,

Over the past month or so we’ve been getting large numbers of held jobs, whereas previously we normally had close to zero. What seems to be happening is:
(1) a job is running successfully
(2) the user’s process finishes
(3) in the ShadowLog we see

10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file /slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp

and at the same time in the StarterLog we see:

10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089 in file /slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp

(4) the job gets restarted on a different worker node, then for some reason the Shadow has permission problems trying to write to output files, then the job goes into the held state.

Below I show an extract from the ShadowLog [1] and StarterLog [2].

We are currently using 8.2.2. I think the problems started occurring at around the time we upgraded the worker nodes from 8.0.6 to 8.2.2 (although I'm not 100% sure).

Has anyone seen this before or know what could be causing these Close_Pipe errors?

Many Thanks,
Andrew.

[1]
10/07/14 12:43:39 Initializing a VANILLA shadow for job 860002.0
10/07/14 12:43:39 (860002.0) (30272): Request to run on slot1@xxxxxxxxxxxxxx <130.a.b.c:35760> was ACCEPTED
10/07/14 12:43:40 (860002.0) (30272): File transfer completed successfully.
10/08/14 14:26:26 (860002.0) (30272): File transfer completed successfully.
10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file /slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
10/08/14 14:27:33 Initializing a VANILLA shadow for job 860002.0
10/08/14 14:27:33 (860002.0) (49167): Request to run on slot1@xxxxxxxxxxxxxx <130.a.b.c:45410> was ACCEPTED
10/08/14 14:27:33 (860002.0) (49167): File transfer completed successfully.
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out, errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 57681 bytes of file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out: (errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err, errno = 13: Permission denied.
10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 0 bytes of file transmission
10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err: (errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): Mock terminating job 860002.0: exited_by_signal=FALSE, exit_code=0 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
10/08/14 14:37:13 (860002.0) (49167): File transfer failed (status=0).
10/08/14 14:37:13 (860002.0) (49167): Job 860002.0 going into Hold state (code 12,13): Error from slot1@xxxxxxxxxxxxxx: STARTER at 130.a.b.c failed to send file(s) to <130.a.b.c:54052>; SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err: (errno 13) Permission denied
10/08/14 14:37:13 (860002.0) (49167): **** condor_shadow (condor_SHADOW) pid 49167 EXITING WITH STATUS 112

[2]
10/07/14 12:43:40 (pid:18932) ******************************************************
10/07/14 12:43:40 (pid:18932) ** condor_starter (CONDOR_STARTER) STARTING UP
10/07/14 12:43:40 (pid:18932) ** /usr/sbin/condor_starter
10/07/14 12:43:40 (pid:18932) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
10/07/14 12:43:40 (pid:18932) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
10/07/14 12:43:40 (pid:18932) ** $CondorVersion: 8.2.2 Aug 07 2014 BuildID: 265643 $
10/07/14 12:43:40 (pid:18932) ** $CondorPlatform: x86_64_RedHat6 $
10/07/14 12:43:40 (pid:18932) ** PID = 18932
10/07/14 12:43:40 (pid:18932) ** Log last touched 10/7 12:39:50
10/07/14 12:43:40 (pid:18932) ******************************************************
10/07/14 12:43:40 (pid:18932) Using config source: /etc/condor/condor_config
10/07/14 12:43:40 (pid:18932) Using local config sources:
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/10security.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn_startd_history.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/21slots.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/31cgroups.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/32mountunderscratch.config
10/07/14 12:43:40 (pid:18932)    /etc/condor/condor_config.local
10/07/14 12:43:40 (pid:18932) config Macros = 128, Sorted = 127, StringBytes = 4666, TablesBytes = 4704
10/07/14 12:43:40 (pid:18932) CLASSAD_CACHING is OFF
10/07/14 12:43:40 (pid:18932) Daemon Log is logging: D_ALWAYS D_ERROR
10/07/14 12:43:40 (pid:18932) DaemonCore: command socket at <130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) DaemonCore: private command socket at <130.a.b.c:38881>
10/07/14 12:43:40 (pid:18932) Communicating with shadow <130.a.b.c:34061?noUDP>
10/07/14 12:43:40 (pid:18932) Submitting machine is "arc-ce04.domain"
10/07/14 12:43:40 (pid:18932) setting the orig job name in starter
10/07/14 12:43:40 (pid:18932) setting the orig job iwd in starter
10/07/14 12:43:40 (pid:18932) Chirp config summary: IO false, Updates false, Delayed updates true.
10/07/14 12:43:40 (pid:18932) Initialized IO Proxy.
10/07/14 12:43:40 (pid:18932) Done setting resource limits
10/07/14 12:43:40 (pid:18932) File transfer completed successfully.
10/07/14 12:43:41 (pid:18932) Job 860002.0 set to execute immediately
10/07/14 12:43:41 (pid:18932) Starting a VANILLA universe job with ID: 860002.0
10/07/14 12:43:41 (pid:18932) IWD: /pool/condor/dir_18932
10/07/14 12:43:41 (pid:18932) Output file: /pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Error file: /pool/condor/dir_18932/_condor_stdout
10/07/14 12:43:41 (pid:18932) Renice expr "10" evaluated to 10
10/07/14 12:43:41 (pid:18932) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe
10/07/14 12:43:41 (pid:18932) Running job as user tlhcb005
10/07/14 12:43:41 (pid:18932) Create_Process succeeded, pid=18953
10/08/14 14:26:26 (pid:18932) Process exited, pid=18953, status=0
10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089 in file /slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
10/08/14 14:26:26 (pid:18932) ShutdownFast all jobs.




--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685