[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] "Connection to shadow may be lost" messages



Hi,

We seem to be getting messages like this in the StarterLogs very frequently:

02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect

The complete StarterLog for this job is shown below [1], as well as the associated ShadowLog [2].

Has anyone else encountered this problem or know what could be causing it? I think it only happens with 8.4.x schedds (with 8.2.7 at least I can't reproduce it).

Many Thanks,
Andrew.

[1]
02/16/16 20:51:34 (pid:2619) ******************************************************
02/16/16 20:51:34 (pid:2619) ** condor_starter (CONDOR_STARTER) STARTING UP
02/16/16 20:51:34 (pid:2619) ** /usr/sbin/condor_starter
02/16/16 20:51:34 (pid:2619) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/16/16 20:51:34 (pid:2619) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/16/16 20:51:34 (pid:2619) ** $CondorVersion: 8.3.7 Jul 23 2015 BuildID: 331383 $
02/16/16 20:51:34 (pid:2619) ** $CondorPlatform: X86_64-RedHat_6.6 $
02/16/16 20:51:34 (pid:2619) ** PID = 2619
02/16/16 20:51:34 (pid:2619) ** Log last touched 2/16 20:49:32
02/16/16 20:51:34 (pid:2619) ******************************************************
02/16/16 20:51:34 (pid:2619) Using config source: /etc/condor/condor_config
02/16/16 20:51:34 (pid:2619) Using local config sources:
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/00-general
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/01-security
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/10-worker-node
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/11-cgroups
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/96-hibernation
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/97-preemptable
02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/98-reduce-power-now
02/16/16 20:51:34 (pid:2619)    /etc/condor/condor_config.local
02/16/16 20:51:34 (pid:2619) config Macros = 148, Sorted = 147, StringBytes = 6526, TablesBytes = 5440
02/16/16 20:51:34 (pid:2619) CLASSAD_CACHING is OFF
02/16/16 20:51:34 (pid:2619) Daemon Log is logging: D_ALWAYS D_ERROR
02/16/16 20:51:34 (pid:2619) Daemoncore: Listening at <0.0.0.0:42560> on TCP (ReliSock) and UDP (SafeSock).
02/16/16 20:51:34 (pid:2619) DaemonCore: command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
02/16/16 20:51:34 (pid:2619) DaemonCore: private command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
02/16/16 20:51:34 (pid:2619) Communicating with shadow <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
02/16/16 20:51:34 (pid:2619) Submitting machine is "arc-ce05.domain”
02/16/16 20:51:34 (pid:2619) setting the orig job name in starter
02/16/16 20:51:34 (pid:2619) setting the orig job iwd in starter
02/16/16 20:51:34 (pid:2619) Chirp config summary: IO false, Updates false, Delayed updates true.
02/16/16 20:51:34 (pid:2619) Initialized IO Proxy.
02/16/16 20:51:34 (pid:2619) Done setting resource limits
02/16/16 20:51:35 (pid:2619) File transfer completed successfully.
02/16/16 20:51:36 (pid:2619) Job 3714.0 set to execute immediately
02/16/16 20:51:36 (pid:2619) Starting a VANILLA universe job with ID: 3714.0
02/16/16 20:51:36 (pid:2619) IWD: /pool/condor/dir_2619
02/16/16 20:51:36 (pid:2619) Renice expr "10" evaluated to 10
02/16/16 20:51:36 (pid:2619) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe 1
02/16/16 20:51:36 (pid:2619) Running job as user nagios
02/16/16 20:51:36 (pid:2619) Create_Process succeeded, pid=4471
02/16/16 20:51:37 (pid:2619) Process exited, pid=4471, status=0
02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
02/16/16 20:51:37 (pid:2619) Got SIGQUIT.  Performing fast shutdown.
02/16/16 20:51:37 (pid:2619) ShutdownFast all jobs.
02/16/16 20:51:37 (pid:2619) **** condor_starter (condor_STARTER) pid 2619 EXITING WITH STATUS 0


[2]
02/16/16 20:51:33 ******************************************************
02/16/16 20:51:33 ** condor_shadow (CONDOR_SHADOW) STARTING UP
02/16/16 20:51:33 ** /usr/sbin/condor_shadow
02/16/16 20:51:33 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
02/16/16 20:51:33 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
02/16/16 20:51:33 ** $CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $
02/16/16 20:51:33 ** $CondorPlatform: x86_RedHat6 $
02/16/16 20:51:33 ** PID = 115754
02/16/16 20:51:33 ** Log last touched 2/16 20:48:10
02/16/16 20:51:33 ******************************************************
02/16/16 20:51:33 Using config source: /etc/condor/condor_config
02/16/16 20:51:33 Using local config sources:
02/16/16 20:51:33    /etc/condor/config.d/10security.config
02/16/16 20:51:33    /etc/condor/config.d/12resourcelimits.config
02/16/16 20:51:33    /etc/condor/config.d/13users-blacklists.config
02/16/16 20:51:33    /etc/condor/config.d/14accounting-groups.config
02/16/16 20:51:33    /etc/condor/config.d/21schedd.config
02/16/16 20:51:33    /etc/condor/config.d/24per-job-history-files.config
02/16/16 20:51:33    /etc/condor/condor_config.local
02/16/16 20:51:33 config Macros = 234, Sorted = 234, StringBytes = 13208, TablesBytes = 1928
02/16/16 20:51:33 CLASSAD_CACHING is OFF
02/16/16 20:51:33 Daemon Log is logging: D_ALWAYS D_ERROR
02/16/16 20:51:33 Daemoncore: Listening at <0.0.0.0:29414> on TCP (ReliSock).
02/16/16 20:51:33 DaemonCore: command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
02/16/16 20:51:33 DaemonCore: private command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414>
02/16/16 20:51:33 Initializing a VANILLA shadow for job 3714.0
02/16/16 20:51:33 (3714.0) (115754): Request to run on slot1@xxxxxxxxxxxxxx <a.b.c.d:44704?addrs=a.b.c.d-44704> was ACCEPTED
02/16/16 20:51:35 (3714.0) (115754): File transfer completed successfully.
02/16/16 20:51:37 (3714.0) (115754): Job 3714.0 terminated: exited with status 0
02/16/16 20:51:37 (3714.0) (115754): Reporting job exit reason 100 and attempting to fetch new job.
02/16/16 20:51:37 (3714.0) (115754): **** condor_shadow (condor_SHADOW) pid 115754 EXITING WITH STATUS 100