[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] "Connection to shadow may be lost" messages



Hi Andrew,

Yes the ones I've found so far are 'good' exits as well. 

It doesn't appear to be related to how loaded the schedd is either, it's happening on our ARC which is ~100 running jobs and at least one of our condor ces which is ~9000 running jobs.

Cheers, Iain

Sent from Nine

From: andrew.lahiff@xxxxxxxxxx
Sent: 17 Feb 2016 00:09
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] "Connection to shadow may be lost" messages

Hi Iain,

Just today we've had many hundreds (if not thousands) of these errors. It hadn't been noticed before because even with this error the shadows generally seem to exit with 'good' exit codes, like in this example, and the jobs have status 0 and seem to be successful. I only noticed the problem when looking into an ATLAS job which had a lost heartbeat according to Panda and appeared to have been killed, but according to HTCondor the job had an exit status of 0 and hadn't been removed. But it did have this error.

Regards,
Andrew.

________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Iain Bradford Steers [iain.steers@xxxxxxx]
Sent: Tuesday, February 16, 2016 10:46 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] "Connection to shadow may be lost" messages

Hi Andrew,

I spotted this same message in a few of our starter logs earlier today, but we had another issue at the time so I just assumed they were related.

We’re running the schedds on 8.3.8 at the moment, so presumably it was something introduced in 8.3.x

Did you notice it due to a specific shadow exit code…I can have a hunt in our pool as well for some corroborating logs.

Or was it just a blip on your monitoring?

Cheers, Iain

> On Feb 16, 2016, at 23:29, andrew.lahiff@xxxxxxxxxx wrote:
>
> Hi,
>
> We seem to be getting messages like this in the StarterLogs very frequently:
>
> 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
> 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
> 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
> 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
> 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
>
> The complete StarterLog for this job is shown below [1], as well as the associated ShadowLog [2].
>
> Has anyone else encountered this problem or know what could be causing it? I think it only happens with 8.4.x schedds (with 8.2.7 at least I can't reproduce it).
>
> Many Thanks,
> Andrew.
>
> [1]
> 02/16/16 20:51:34 (pid:2619) ******************************************************
> 02/16/16 20:51:34 (pid:2619) ** condor_starter (CONDOR_STARTER) STARTING UP
> 02/16/16 20:51:34 (pid:2619) ** /usr/sbin/condor_starter
> 02/16/16 20:51:34 (pid:2619) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> 02/16/16 20:51:34 (pid:2619) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> 02/16/16 20:51:34 (pid:2619) ** $CondorVersion: 8.3.7 Jul 23 2015 BuildID: 331383 $
> 02/16/16 20:51:34 (pid:2619) ** $CondorPlatform: X86_64-RedHat_6.6 $
> 02/16/16 20:51:34 (pid:2619) ** PID = 2619
> 02/16/16 20:51:34 (pid:2619) ** Log last touched 2/16 20:49:32
> 02/16/16 20:51:34 (pid:2619) ******************************************************
> 02/16/16 20:51:34 (pid:2619) Using config source: /etc/condor/condor_config
> 02/16/16 20:51:34 (pid:2619) Using local config sources:
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/00-general
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/01-security
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/10-worker-node
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/11-cgroups
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/96-hibernation
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/97-preemptable
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/config.d/98-reduce-power-now
> 02/16/16 20:51:34 (pid:2619)    /etc/condor/condor_config.local
> 02/16/16 20:51:34 (pid:2619) config Macros = 148, Sorted = 147, StringBytes = 6526, TablesBytes = 5440
> 02/16/16 20:51:34 (pid:2619) CLASSAD_CACHING is OFF
> 02/16/16 20:51:34 (pid:2619) Daemon Log is logging: D_ALWAYS D_ERROR
> 02/16/16 20:51:34 (pid:2619) Daemoncore: Listening at <0.0.0.0:42560> on TCP (ReliSock) and UDP (SafeSock).
> 02/16/16 20:51:34 (pid:2619) DaemonCore: command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
> 02/16/16 20:51:34 (pid:2619) DaemonCore: private command socket at <a.b.c.d:42560?addrs=a.b.c.d-42560>
> 02/16/16 20:51:34 (pid:2619) Communicating with shadow <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
> 02/16/16 20:51:34 (pid:2619) Submitting machine is "arc-ce05.domain”
> 02/16/16 20:51:34 (pid:2619) setting the orig job name in starter
> 02/16/16 20:51:34 (pid:2619) setting the orig job iwd in starter
> 02/16/16 20:51:34 (pid:2619) Chirp config summary: IO false, Updates false, Delayed updates true.
> 02/16/16 20:51:34 (pid:2619) Initialized IO Proxy.
> 02/16/16 20:51:34 (pid:2619) Done setting resource limits
> 02/16/16 20:51:35 (pid:2619) File transfer completed successfully.
> 02/16/16 20:51:36 (pid:2619) Job 3714.0 set to execute immediately
> 02/16/16 20:51:36 (pid:2619) Starting a VANILLA universe job with ID: 3714.0
> 02/16/16 20:51:36 (pid:2619) IWD: /pool/condor/dir_2619
> 02/16/16 20:51:36 (pid:2619) Renice expr "10" evaluated to 10
> 02/16/16 20:51:36 (pid:2619) About to exec /usr/libexec/condor/condor_pid_ns_init condor_exec.exe 1
> 02/16/16 20:51:36 (pid:2619) Running job as user nagios
> 02/16/16 20:51:36 (pid:2619) Create_Process succeeded, pid=4471
> 02/16/16 20:51:37 (pid:2619) Process exited, pid=4471, status=0
> 02/16/16 20:51:37 (pid:2619) Connection to shadow may be lost, will test by sending whoami request.
> 02/16/16 20:51:37 (pid:2619) condor_write(): Socket closed when trying to write 37 bytes to <x.y.z.t:8847>, fd is 7
> 02/16/16 20:51:37 (pid:2619) Buf::write(): condor_write() failed
> 02/16/16 20:51:37 (pid:2619) i/o error result is 0, errno is 0
> 02/16/16 20:51:37 (pid:2619) Lost connection to shadow, waiting 2400 secs for reconnect
> 02/16/16 20:51:37 (pid:2619) Got SIGQUIT.  Performing fast shutdown.
> 02/16/16 20:51:37 (pid:2619) ShutdownFast all jobs.
> 02/16/16 20:51:37 (pid:2619) **** condor_starter (condor_STARTER) pid 2619 EXITING WITH STATUS 0
>
>
> [2]
> 02/16/16 20:51:33 ******************************************************
> 02/16/16 20:51:33 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 02/16/16 20:51:33 ** /usr/sbin/condor_shadow
> 02/16/16 20:51:33 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
> 02/16/16 20:51:33 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
> 02/16/16 20:51:33 ** $CondorVersion: 8.4.4 Feb 03 2016 BuildID: 355883 $
> 02/16/16 20:51:33 ** $CondorPlatform: x86_RedHat6 $
> 02/16/16 20:51:33 ** PID = 115754
> 02/16/16 20:51:33 ** Log last touched 2/16 20:48:10
> 02/16/16 20:51:33 ******************************************************
> 02/16/16 20:51:33 Using config source: /etc/condor/condor_config
> 02/16/16 20:51:33 Using local config sources:
> 02/16/16 20:51:33    /etc/condor/config.d/10security.config
> 02/16/16 20:51:33    /etc/condor/config.d/12resourcelimits.config
> 02/16/16 20:51:33    /etc/condor/config.d/13users-blacklists.config
> 02/16/16 20:51:33    /etc/condor/config.d/14accounting-groups.config
> 02/16/16 20:51:33    /etc/condor/config.d/21schedd.config
> 02/16/16 20:51:33    /etc/condor/config.d/24per-job-history-files.config
> 02/16/16 20:51:33    /etc/condor/condor_config.local
> 02/16/16 20:51:33 config Macros = 234, Sorted = 234, StringBytes = 13208, TablesBytes = 1928
> 02/16/16 20:51:33 CLASSAD_CACHING is OFF
> 02/16/16 20:51:33 Daemon Log is logging: D_ALWAYS D_ERROR
> 02/16/16 20:51:33 Daemoncore: Listening at <0.0.0.0:29414> on TCP (ReliSock).
> 02/16/16 20:51:33 DaemonCore: command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414&noUDP>
> 02/16/16 20:51:33 DaemonCore: private command socket at <x.y.z.t:29414?addrs=x.y.z.t-29414>
> 02/16/16 20:51:33 Initializing a VANILLA shadow for job 3714.0
> 02/16/16 20:51:33 (3714.0) (115754): Request to run on slot1@xxxxxxxxxxxxxx <a.b.c.d:44704?addrs=a.b.c.d-44704> was ACCEPTED
> 02/16/16 20:51:35 (3714.0) (115754): File transfer completed successfully.
> 02/16/16 20:51:37 (3714.0) (115754): Job 3714.0 terminated: exited with status 0
> 02/16/16 20:51:37 (3714.0) (115754): Reporting job exit reason 100 and attempting to fetch new job.
> 02/16/16 20:51:37 (3714.0) (115754): **** condor_shadow (condor_SHADOW) pid 115754 EXITING WITH STATUS 100
>
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/