[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] "Connection to shadow may be lost" messages



Hi Andrew,

I spotted this same message in a few of our starter logs earlier today, but we had another issue at the time so I just assumed they were related.

We’re running the schedds on 8.3.8 at the moment, so presumably it was something introduced in 8.3.x

Did you notice it due to a specific shadow exit code…I can have a hunt in our pool as well for some corroborating logs.

Or was it just a blip on your monitoring?

Cheers, Iain

> On Feb 16, 2016, at 23:29, andrew.lahiff@xxxxxxxxxx wrote:
> 
> Hi,
> 
> We seem to be getting messages like this in the StarterLogs very frequently:
> 
> 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
> 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
> 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
> 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
> 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
> 
> The complete StarterLog for this job is shown below [1], as well as the associated ShadowLog [2].
> 
> Has anyone else encountered this problem or know what could be causing it? I think it only happens with 8.4.x schedds (with 8.2.7 at least I can't reproduce it).
> 
> Many Thanks,
> Andrew.
> 
> [1]
> 02/16/16 20:51:34 (pid:2619) ******************************************************
> 02/16/16 20:51:34 (pid:2619) ** condor_starter (CONDOR_STARTER) STARTING UP
> 02/16/16 20:51:34 (pid:2619) ** /usr/sbin/condor_starter
> 02/16/16 20:51:34 (pid:2619) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> 02/16/16 20:51:34 (pid:2619) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> 02/16/16 20:51:34 (pid:2619) ** $CondorVersion: 8.3.7 Jul 23 2015 BuildID: 331383 $
> 02/16/16 20:51:34 (pid:2619) ** $CondorPlatform: X86_64-RedHat_6.6 $
> 02/16/16 20:51:34 (pid:2619) ** PID = 2619
> 02/16/16 20:51:34 (pid:2619) ** Log last touched 2/16 20:49:32
> 02/16/16 20:51:34 (pid:2619) ******************************************************
> 02/16/16 20:51:34 (pid:2619) Using config source: /etc/condor/condor_config
> 02/16/16 20:51:34 (pid:2619) Using local config sources:
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/00-general
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/01-security
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/10-worker-node
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/11-cgroups
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/96-hibernation
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/97-preemptable
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/98-reduce-power-now
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/condor_config.local
> 02/16/16 20:51:34 (pid:2619) config Macros = 148, Sorted = 147, StringBytes = 6526, TablesBytes = 5440
> 02/16/16 20:51:34 (pid:2619) CLASSAD_CACHING is OFF
> 02/16/16 20:51:34 (pid:2619) Daemon Log is logging: D_ALWAYS D_ERROR
> 02/16/16 20:51:34 (pid:2619) Daemoncore: Listening at <0.0.0.0:42560> on TCP (ReliSock) and UDP (SafeSock).
> 02/16/16 20:51:34 (pid:2619) DaemonCore: command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
> 02/16/16 20:51:34 (pid:2619) DaemonCore: private command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
> 02/16/16 20:51:34 (pid:2619) Communicating with shadow <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
> 02/16/16 20:51:34 (pid:2619) Submitting machine is "arc-ce05.domain”
> 02/16/16 20:51:34 (pid:2619) setting the orig job name in starter
> 02/16/16 20:51:34 (pid:2619) setting the orig job iwd in starter
> 02/16/16 20:51:34 (pid:2619) Chirp config summary: IO false, Updates false, Delayed updates true.
> 02/16/16 20:51:34 (pid:2619) Initialized IO Proxy.
> 02/16/16 20:51:34 (pid:2619) Done setting resource limits
> 02/16/16 20:51:35 (pid:2619) File transfer completed successfully.
> 02/16/16 20:51:36 (pid:2619) Job 3714.0 set to execute immediately
> 02/16/16 20:51:36 (pid:2619) Starting a VANILLA universe job with ID: 3714.0
> 02/16/16 20:51:36 (pid:2619) IWD: /pool/condor/dir_2619
> 02/16/16 20:51:36 (pid:2619) Renice expr "10" evaluated to 10
> 02/16/16 20:51:36 (pid:2619) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe 1
> 02/16/16 20:51:36 (pid:2619) Running job as user nagios
> 02/16/16 20:51:36 (pid:2619) Create_Process succeeded, pid=4471
> 02/16/16 20:51:37 (pid:2619) Process exited, pid=4471, status=0
> 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
> 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
> 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
> 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
> 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
> 02/16/16 20:51:37 (pid:2619) Got SIGQUIT.  Performing fast shutdown.
> 02/16/16 20:51:37 (pid:2619) ShutdownFast all jobs.
> 02/16/16 20:51:37 (pid:2619) **** condor_starter (condor_STARTER) pid 2619 EXITING WITH STATUS 0
> 
> 
> [2]
> 02/16/16 20:51:33 ******************************************************
> 02/16/16 20:51:33 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 02/16/16 20:51:33 ** /usr/sbin/condor_shadow
> 02/16/16 20:51:33 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
> 02/16/16 20:51:33 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
> 02/16/16 20:51:33 ** $CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $
> 02/16/16 20:51:33 ** $CondorPlatform: x86_RedHat6 $
> 02/16/16 20:51:33 ** PID = 115754
> 02/16/16 20:51:33 ** Log last touched 2/16 20:48:10
> 02/16/16 20:51:33 ******************************************************
> 02/16/16 20:51:33 Using config source: /etc/condor/condor_config
> 02/16/16 20:51:33 Using local config sources:
> 02/16/16 20:51:33    /etc/condor/config.d/10security.config
> 02/16/16 20:51:33    /etc/condor/config.d/12resourcelimits.config
> 02/16/16 20:51:33    /etc/condor/config.d/13users-blacklists.config
> 02/16/16 20:51:33    /etc/condor/config.d/14accounting-groups.config
> 02/16/16 20:51:33    /etc/condor/config.d/21schedd.config
> 02/16/16 20:51:33    /etc/condor/config.d/24per-job-history-files.config
> 02/16/16 20:51:33    /etc/condor/condor_config.local
> 02/16/16 20:51:33 config Macros = 234, Sorted = 234, StringBytes = 13208, TablesBytes = 1928
> 02/16/16 20:51:33 CLASSAD_CACHING is OFF
> 02/16/16 20:51:33 Daemon Log is logging: D_ALWAYS D_ERROR
> 02/16/16 20:51:33 Daemoncore: Listening at <0.0.0.0:29414> on TCP (ReliSock).
> 02/16/16 20:51:33 DaemonCore: command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
> 02/16/16 20:51:33 DaemonCore: private command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414>
> 02/16/16 20:51:33 Initializing a VANILLA shadow for job 3714.0
> 02/16/16 20:51:33 (3714.0) (115754): Request to run on slot1@xxxxxxxxxxxxxx <a.b.c.d:44704?addrs=a.b.c.d-44704> was ACCEPTED
> 02/16/16 20:51:35 (3714.0) (115754): File transfer completed successfully.
> 02/16/16 20:51:37 (3714.0) (115754): Job 3714.0 terminated: exited with status 0
> 02/16/16 20:51:37 (3714.0) (115754): Reporting job exit reason 100 and attempting to fetch new job.
> 02/16/16 20:51:37 (3714.0) (115754): **** condor_shadow (condor_SHADOW) pid 115754 EXITING WITH STATUS 100
> 
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature