[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] "Connection to shadow may be lost" messages



Yep,

Found the one I saw earlier, a job from our ARC CE. I’ll have another hunt to see if the same thing is occurring on our htcondorces.

Similarities I notice straight off is that we’re both using IO Proxies between the starters and shadows.

Cheers, Iain

02/16/16 17:15:19 (pid:1609794) ******************************************************
02/16/16 17:15:19 (pid:1609794) ** condor_starter (CONDOR_STARTER) STARTING UP
02/16/16 17:15:19 (pid:1609794) ** /usr/sbin/condor_starter
02/16/16 17:15:19 (pid:1609794) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/16/16 17:15:19 (pid:1609794) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/16/16 17:15:19 (pid:1609794) ** $CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $
02/16/16 17:15:19 (pid:1609794) ** $CondorPlatform: X86_64-RedHat_6.6 $
02/16/16 17:15:19 (pid:1609794) ** PID = 1609794
02/16/16 17:15:19 (pid:1609794) ** Log last touched 2/16 17:13:59
02/16/16 17:15:19 (pid:1609794) ******************************************************
02/16/16 17:15:19 (pid:1609794) Using config source: /etc/condor/condor_config
02/16/16 17:15:19 (pid:1609794) Using local config sources: 
02/16/16 17:15:19 (pid:1609794)    /etc/condor/config.d/10_security.config
02/16/16 17:15:19 (pid:1609794)    /etc/condor/config.d/14_network.config
02/16/16 17:15:19 (pid:1609794)    /etc/condor/config.d/20_workernode.config
02/16/16 17:15:19 (pid:1609794)    /etc/condor/config.d/24_workernode_crons.config
02/16/16 17:15:19 (pid:1609794)    /etc/condor/condor_config.local
02/16/16 17:15:19 (pid:1609794) config Macros = 170, Sorted = 169, StringBytes = 6318, TablesBytes = 6200
02/16/16 17:15:19 (pid:1609794) CLASSAD_CACHING is OFF
02/16/16 17:15:19 (pid:1609794) Daemon Log is logging: D_ALWAYS D_ERROR
02/16/16 17:15:19 (pid:1609794) SharedPortEndpoint: waiting for connections to named socket 15788_d2fb_365
02/16/16 17:15:19 (pid:1609794) DaemonCore: command socket at <:9618?addrs=-9618&noUDP&sock=15788_d2fb_365>
02/16/16 17:15:19 (pid:1609794) DaemonCore: private command socket at <:9618?addrs=-9618&noUDP&sock=15788_d2fb_365>
02/16/16 17:15:19 (pid:1609794) Communicating with shadow <:9618?addrs=-9618&noUDP&sock=8114_cc03_875902>
02/16/16 17:15:19 (pid:1609794) Submitting machine is “ce501.cern.ch"
02/16/16 17:15:19 (pid:1609794) setting the orig job name in starter
02/16/16 17:15:19 (pid:1609794) setting the orig job iwd in starter
02/16/16 17:15:19 (pid:1609794) Chirp config summary: IO false, Updates false, Delayed updates true.
02/16/16 17:15:19 (pid:1609794) Initialized IO Proxy.
02/16/16 17:15:19 (pid:1609794) Done setting resource limits
02/16/16 17:15:19 (pid:1609794) File transfer completed successfully.
02/16/16 17:15:20 (pid:1609794) Job 1281546.0 set to execute immediately
02/16/16 17:15:20 (pid:1609794) Starting a VANILLA universe job with ID: 1281546.0
02/16/16 17:15:20 (pid:1609794) IWD: /pool/condor/dir_1609794
02/16/16 17:15:21 (pid:1609794) Output file: /pool/condor/dir_1609794/_condor_stdout
02/16/16 17:15:21 (pid:1609794) Error file: /pool/condor/dir_1609794/_condor_stdout
02/16/16 17:15:21 (pid:1609794) Renice expr "10" evaluated to 10
02/16/16 17:15:21 (pid:1609794) Using wrapper /usr/local/condor/job_wrapper to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe
02/16/16 17:15:21 (pid:1609794) Running job as user <blanked>
02/16/16 17:15:21 (pid:1609794) Create_Process succeeded, pid=1609798
02/16/16 17:15:21 (pid:1609794) Limitting memory usage to 2097152000 bytes
02/16/16 17:15:21 (pid:1609794) Limitting memsw usage to 35696619520 bytes
02/16/16 17:15:40 (pid:1609794) Process exited, pid=1609798, status=0
02/16/16 17:15:41 (pid:1609794) Connection to shadow may be lost, will test by sending whoami request.
02/16/16 17:15:41 (pid:1609794) condor_write(): Socket closed when trying to write 37 bytes to <:37550>, fd is 19
02/16/16 17:15:41 (pid:1609794) Buf::write(): condor_write() failed
02/16/16 17:15:41 (pid:1609794) i/o error result is 0, errno is 0
02/16/16 17:15:41 (pid:1609794) Lost connection to shadow, waiting 2400 secs for reconnect
02/16/16 17:15:41 (pid:1609794) Got SIGQUIT.  Performing fast shutdown.
02/16/16 17:15:41 (pid:1609794) ShutdownFast all jobs.
02/16/16 17:15:41 (pid:1609794) **** condor_starter (condor_STARTER) pid 1609794 EXITING WITH STATUS 0

> On Feb 16, 2016, at 23:29, andrew.lahiff@xxxxxxxxxx wrote:
> 
> Hi,
> 
> We seem to be getting messages like this in the StarterLogs very frequently:
> 
> 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
> 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
> 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
> 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
> 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
> 
> The complete StarterLog for this job is shown below [1], as well as the associated ShadowLog [2].
> 
> Has anyone else encountered this problem or know what could be causing it? I think it only happens with 8.4.x schedds (with 8.2.7 at least I can't reproduce it).
> 
> Many Thanks,
> Andrew.
> 
> [1]
> 02/16/16 20:51:34 (pid:2619) ******************************************************
> 02/16/16 20:51:34 (pid:2619) ** condor_starter (CONDOR_STARTER) STARTING UP
> 02/16/16 20:51:34 (pid:2619) ** /usr/sbin/condor_starter
> 02/16/16 20:51:34 (pid:2619) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> 02/16/16 20:51:34 (pid:2619) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> 02/16/16 20:51:34 (pid:2619) ** $CondorVersion: 8.3.7 Jul 23 2015 BuildID: 331383 $
> 02/16/16 20:51:34 (pid:2619) ** $CondorPlatform: X86_64-RedHat_6.6 $
> 02/16/16 20:51:34 (pid:2619) ** PID = 2619
> 02/16/16 20:51:34 (pid:2619) ** Log last touched 2/16 20:49:32
> 02/16/16 20:51:34 (pid:2619) ******************************************************
> 02/16/16 20:51:34 (pid:2619) Using config source: /etc/condor/condor_config
> 02/16/16 20:51:34 (pid:2619) Using local config sources:
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/00-general
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/01-security
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/10-worker-node
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/11-cgroups
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/96-hibernation
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/97-preemptable
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/98-reduce-power-now
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/condor_config.local
> 02/16/16 20:51:34 (pid:2619) config Macros = 148, Sorted = 147, StringBytes = 6526, TablesBytes = 5440
> 02/16/16 20:51:34 (pid:2619) CLASSAD_CACHING is OFF
> 02/16/16 20:51:34 (pid:2619) Daemon Log is logging: D_ALWAYS D_ERROR
> 02/16/16 20:51:34 (pid:2619) Daemoncore: Listening at <0.0.0.0:42560> on TCP (ReliSock) and UDP (SafeSock).
> 02/16/16 20:51:34 (pid:2619) DaemonCore: command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
> 02/16/16 20:51:34 (pid:2619) DaemonCore: private command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
> 02/16/16 20:51:34 (pid:2619) Communicating with shadow <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
> 02/16/16 20:51:34 (pid:2619) Submitting machine is "arc-ce05.domain”
> 02/16/16 20:51:34 (pid:2619) setting the orig job name in starter
> 02/16/16 20:51:34 (pid:2619) setting the orig job iwd in starter
> 02/16/16 20:51:34 (pid:2619) Chirp config summary: IO false, Updates false, Delayed updates true.
> 02/16/16 20:51:34 (pid:2619) Initialized IO Proxy.
> 02/16/16 20:51:34 (pid:2619) Done setting resource limits
> 02/16/16 20:51:35 (pid:2619) File transfer completed successfully.
> 02/16/16 20:51:36 (pid:2619) Job 3714.0 set to execute immediately
> 02/16/16 20:51:36 (pid:2619) Starting a VANILLA universe job with ID: 3714.0
> 02/16/16 20:51:36 (pid:2619) IWD: /pool/condor/dir_2619
> 02/16/16 20:51:36 (pid:2619) Renice expr "10" evaluated to 10
> 02/16/16 20:51:36 (pid:2619) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe 1
> 02/16/16 20:51:36 (pid:2619) Running job as user nagios
> 02/16/16 20:51:36 (pid:2619) Create_Process succeeded, pid=4471
> 02/16/16 20:51:37 (pid:2619) Process exited, pid=4471, status=0
> 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
> 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
> 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
> 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
> 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
> 02/16/16 20:51:37 (pid:2619) Got SIGQUIT.  Performing fast shutdown.
> 02/16/16 20:51:37 (pid:2619) ShutdownFast all jobs.
> 02/16/16 20:51:37 (pid:2619) **** condor_starter (condor_STARTER) pid 2619 EXITING WITH STATUS 0
> 
> 
> [2]
> 02/16/16 20:51:33 ******************************************************
> 02/16/16 20:51:33 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 02/16/16 20:51:33 ** /usr/sbin/condor_shadow
> 02/16/16 20:51:33 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
> 02/16/16 20:51:33 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
> 02/16/16 20:51:33 ** $CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $
> 02/16/16 20:51:33 ** $CondorPlatform: x86_RedHat6 $
> 02/16/16 20:51:33 ** PID = 115754
> 02/16/16 20:51:33 ** Log last touched 2/16 20:48:10
> 02/16/16 20:51:33 ******************************************************
> 02/16/16 20:51:33 Using config source: /etc/condor/condor_config
> 02/16/16 20:51:33 Using local config sources:
> 02/16/16 20:51:33    /etc/condor/config.d/10security.config
> 02/16/16 20:51:33    /etc/condor/config.d/12resourcelimits.config
> 02/16/16 20:51:33    /etc/condor/config.d/13users-blacklists.config
> 02/16/16 20:51:33    /etc/condor/config.d/14accounting-groups.config
> 02/16/16 20:51:33    /etc/condor/config.d/21schedd.config
> 02/16/16 20:51:33    /etc/condor/config.d/24per-job-history-files.config
> 02/16/16 20:51:33    /etc/condor/condor_config.local
> 02/16/16 20:51:33 config Macros = 234, Sorted = 234, StringBytes = 13208, TablesBytes = 1928
> 02/16/16 20:51:33 CLASSAD_CACHING is OFF
> 02/16/16 20:51:33 Daemon Log is logging: D_ALWAYS D_ERROR
> 02/16/16 20:51:33 Daemoncore: Listening at <0.0.0.0:29414> on TCP (ReliSock).
> 02/16/16 20:51:33 DaemonCore: command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
> 02/16/16 20:51:33 DaemonCore: private command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414>
> 02/16/16 20:51:33 Initializing a VANILLA shadow for job 3714.0
> 02/16/16 20:51:33 (3714.0) (115754): Request to run on slot1@xxxxxxxxxxxxxx <a.b.c.d:44704?addrs=a.b.c.d-44704> was ACCEPTED
> 02/16/16 20:51:35 (3714.0) (115754): File transfer completed successfully.
> 02/16/16 20:51:37 (3714.0) (115754): Job 3714.0 terminated: exited with status 0
> 02/16/16 20:51:37 (3714.0) (115754): Reporting job exit reason 100 and attempting to fetch new job.
> 02/16/16 20:51:37 (3714.0) (115754): **** condor_shadow (condor_SHADOW) pid 115754 EXITING WITH STATUS 100
> 
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature