[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Frequent Close_Pipe errors



Hi Todd,

Unfortunately jobs that don't use glexec have had this problem, and all our jobs have:

StreamErr = false
StreamOut = false
WhenToTransferOutput = "ON_EXIT_OR_EVICT"

Note that we're using MOUNT_UNDER_SCRATCH and PID namespaces.

Thanks,
Andrew.

________________________________________
From: Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Wednesday, October 08, 2014 9:00 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Frequent Close_Pipe errors

On 10/8/2014 2:50 PM, andrew.lahiff@xxxxxxxxxx wrote:
> Hi Todd,
>
> Yes, we are using cgroups:
>
> [root@lcg1402 ~]# condor_config_val BASE_CGROUP
> htcondor
>
> Thanks,
> Andrew.
>

Thanks.  And just to be sure I am barking up the right tree (I think I
am), besides cgroups, are you using either a) glexec, or b) streaming
the job stdout/err ?

thanks again,
Todd


> ________________________________________
> From: Todd Tannenbaum [tannenba@xxxxxxxxxxx]
> Sent: Wednesday, October 08, 2014 8:39 PM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] Frequent Close_Pipe errors
>
> Hi Andrew,
>
> Question re the below:
>
> On your execute machine(s), what does
>     condor_config_val BASE_CGROUP
> report?
>
> In other words, I am wondering if you are using HTCondor's cgroup
> process tracking....  if you answer is yes, based on looking over the
> source code, I have a guess as to what is going wrong.
>
> Thanks
> Todd
>
>
> On 10/8/2014 10:17 AM, andrew.lahiff@xxxxxxxxxx wrote:
>> Hi,
>>
>> Over the past month or so we’ve been getting large numbers of held jobs, whereas previously we normally had close to zero. What seems to be happening is:
>> (1) a job is running successfully
>> (2) the user’s process finishes
>> (3) in the ShadowLog we see
>>
>> 10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file /slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
>>
>> and at the same time in the StarterLog we see:
>>
>> 10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
>> 10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089 in file /slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
>>
>> (4) the job gets restarted on a different worker node, then for some reason the Shadow has permission problems trying to write to output files, then the job goes into the held state.
>>
>> Below I show an extract from the ShadowLog [1] and StarterLog [2].
>>
>> We are currently using 8.2.2. I think the problems started occurring at around the time we upgraded the worker nodes from 8.0.6 to 8.2.2 (although I'm not 100% sure).
>>
>> Has anyone seen this before or know what could be causing these Close_Pipe errors?
>>
>> Many Thanks,
>> Andrew.
>>
>> [1]
>> 10/07/14 12:43:39 Initializing a VANILLA shadow for job 860002.0
>> 10/07/14 12:43:39 (860002.0) (30272): Request to run on slot1@xxxxxxxxxxxxxx <130.a.b.c:35760> was ACCEPTED
>> 10/07/14 12:43:40 (860002.0) (30272): File transfer completed successfully.
>> 10/08/14 14:26:26 (860002.0) (30272): File transfer completed successfully.
>> 10/08/14 14:26:26 (860002.0) (30272): ERROR "Error from slot1@xxxxxxxxxxxxxx: Close_Pipe error" at line 558 in file /slots/03/dir_2866/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
>> 10/08/14 14:27:33 Initializing a VANILLA shadow for job 860002.0
>> 10/08/14 14:27:33 (860002.0) (49167): Request to run on slot1@xxxxxxxxxxxxxx <130.a.b.c:45410> was ACCEPTED
>> 10/08/14 14:27:33 (860002.0) (49167): File transfer completed successfully.
>> 10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out, errno = 13: Permission denied.
>> 10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 57681 bytes of file transmission
>> 10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.out: (errno 13) Permission denied
>> 10/08/14 14:37:13 (860002.0) (49167): get_file(): Failed to open file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err, errno = 13: Permission denied.
>> 10/08/14 14:37:13 (860002.0) (49167): get_file(): consumed 0 bytes of file transmission
>> 10/08/14 14:37:13 (860002.0) (49167): DoDownload: consuming rest of transfer and failing after encountering the following error: SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err: (errno 13) Permission denied
>> 10/08/14 14:37:13 (860002.0) (49167): Mock terminating job 860002.0: exited_by_signal=FALSE, exit_code=0 OR exit_signal=0, core_dumped=FALSE, exit_reason="Exited normally"
>> 10/08/14 14:37:13 (860002.0) (49167): File transfer failed (status=0).
>> 10/08/14 14:37:13 (860002.0) (49167): Job 860002.0 going into Hold state (code 12,13): Error from slot1@xxxxxxxxxxxxxx: STARTER at 130.a.b.c failed to send file(s) to <130.a.b.c:54052>; SHADOW at 130.a.b.c failed to write to file /var/spool/arc/grid02/U7JMDmQKivknE6QDjqmt6UqoABFKDmABFKDmAVMKDmABFKDmQRxEln/Rq5Ib2.err: (errno 13) Permission denied
>> 10/08/14 14:37:13 (860002.0) (49167): **** condor_shadow (condor_SHADOW) pid 49167 EXITING WITH STATUS 112
>>
>> [2]
>> 10/07/14 12:43:40 (pid:18932) ******************************************************
>> 10/07/14 12:43:40 (pid:18932) ** condor_starter (CONDOR_STARTER) STARTING UP
>> 10/07/14 12:43:40 (pid:18932) ** /usr/sbin/condor_starter
>> 10/07/14 12:43:40 (pid:18932) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
>> 10/07/14 12:43:40 (pid:18932) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
>> 10/07/14 12:43:40 (pid:18932) ** $CondorVersion: 8.2.2 Aug 07 2014 BuildID: 265643 $
>> 10/07/14 12:43:40 (pid:18932) ** $CondorPlatform: x86_64_RedHat6 $
>> 10/07/14 12:43:40 (pid:18932) ** PID = 18932
>> 10/07/14 12:43:40 (pid:18932) ** Log last touched 10/7 12:39:50
>> 10/07/14 12:43:40 (pid:18932) ******************************************************
>> 10/07/14 12:43:40 (pid:18932) Using config source: /etc/condor/condor_config
>> 10/07/14 12:43:40 (pid:18932) Using local config sources:
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/10security.config
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn.config
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/20wn_startd_history.config
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/21slots.config
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/31cgroups.config
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/config.d/32mountunderscratch.config
>> 10/07/14 12:43:40 (pid:18932)    /etc/condor/condor_config.local
>> 10/07/14 12:43:40 (pid:18932) config Macros = 128, Sorted = 127, StringBytes = 4666, TablesBytes = 4704
>> 10/07/14 12:43:40 (pid:18932) CLASSAD_CACHING is OFF
>> 10/07/14 12:43:40 (pid:18932) Daemon Log is logging: D_ALWAYS D_ERROR
>> 10/07/14 12:43:40 (pid:18932) DaemonCore: command socket at <130.a.b.c:38881>
>> 10/07/14 12:43:40 (pid:18932) DaemonCore: private command socket at <130.a.b.c:38881>
>> 10/07/14 12:43:40 (pid:18932) Communicating with shadow <130.a.b.c:34061?noUDP>
>> 10/07/14 12:43:40 (pid:18932) Submitting machine is "arc-ce04.domain"
>> 10/07/14 12:43:40 (pid:18932) setting the orig job name in starter
>> 10/07/14 12:43:40 (pid:18932) setting the orig job iwd in starter
>> 10/07/14 12:43:40 (pid:18932) Chirp config summary: IO false, Updates false, Delayed updates true.
>> 10/07/14 12:43:40 (pid:18932) Initialized IO Proxy.
>> 10/07/14 12:43:40 (pid:18932) Done setting resource limits
>> 10/07/14 12:43:40 (pid:18932) File transfer completed successfully.
>> 10/07/14 12:43:41 (pid:18932) Job 860002.0 set to execute immediately
>> 10/07/14 12:43:41 (pid:18932) Starting a VANILLA universe job with ID: 860002.0
>> 10/07/14 12:43:41 (pid:18932) IWD: /pool/condor/dir_18932
>> 10/07/14 12:43:41 (pid:18932) Output file: /pool/condor/dir_18932/_condor_stdout
>> 10/07/14 12:43:41 (pid:18932) Error file: /pool/condor/dir_18932/_condor_stdout
>> 10/07/14 12:43:41 (pid:18932) Renice expr "10" evaluated to 10
>> 10/07/14 12:43:41 (pid:18932) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe
>> 10/07/14 12:43:41 (pid:18932) Running job as user tlhcb005
>> 10/07/14 12:43:41 (pid:18932) Create_Process succeeded, pid=18953
>> 10/08/14 14:26:26 (pid:18932) Process exited, pid=18953, status=0
>> 10/08/14 14:26:26 (pid:18932) Close_Pipe on invalid pipe end: 0
>> 10/08/14 14:26:26 (pid:18932) ERROR "Close_Pipe error" at line 2089 in file /slots/04/dir_38687/userdir/src/condor_daemon_core.V6/daemon_core.cpp
>> 10/08/14 14:26:26 (pid:18932) ShutdownFast all jobs.
>>
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Scanned by iCritical.