[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Shared port daemon fails to start after reboot on Dual Stack CentOS7 nodes



Hi,

 

Condor is failing to restart cleanly after node reboots on or dual stack (IPv4 and IPv6) nodes.

 

The issue appears to be the communication from the Shared Port Daemon back to the Master that started it.

 

If I run `sudo systemctl restart condor` after I can log into the node everything comes up cleanly so I’m wondering if the Master is coming up before something that it needs.

 

Extracts of the MasterLog and SharedPortLog are below.

 

This is with condor 8.8.13 on CentOS7.

 

Has anyone seen anything like this and/or know of a fix? I’m wondering of the first line of the MasterLog extract is significant.

 

Thanks,

Chris.

 

## MasterLog

 

06/08/21 10:37:49 init_local_hostname_impl: ipv6_getaddrinfo() returned EAI_AGAIN for 'heplnc001.pp.rl.ac.uk'.  Will try again after sleeping 3 seconds (try 2 of 20).

06/08/21 10:37:49 ******************************************************

06/08/21 10:37:49 ** condor_master (CONDOR_MASTER) STARTING UP

06/08/21 10:37:49 ** /usr/sbin/condor_master

06/08/21 10:37:49 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)

06/08/21 10:37:49 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON

06/08/21 10:37:49 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $

06/08/21 10:37:49 ** $CondorPlatform: x86_64_CentOS7 $

06/08/21 10:37:49 ** PID = 1677

06/08/21 10:37:49 ** Log last touched 6/8 10:35:24

06/08/21 10:37:49 ******************************************************

06/08/21 10:37:49 Using config source: /etc/condor/condor_config

06/08/21 10:37:49 Using local config sources:

06/08/21 10:37:49    /etc/condor/config.d/00init.config

06/08/21 10:37:49    /etc/condor/config.d/01puppet_ssl.config

06/08/21 10:37:49    /etc/condor/config.d/02machines.config

06/08/21 10:37:49    /etc/condor/config.d/05security.config

06/08/21 10:37:49    /etc/condor/config.d/20wn_centos7.config

06/08/21 10:37:49    /etc/condor/config.d/25scaling.config

06/08/21 10:37:49    /etc/condor/config.d/27healthcheck.config

06/08/21 10:37:49    /etc/condor/config.d/28rebooter.config

06/08/21 10:37:49    /etc/condor/config.d/29start.config

06/08/21 10:37:49    /etc/condor/config.d/30start_jobtypes.config

06/08/21 10:37:49    /etc/condor/config.d/30start_multicore.config

06/08/21 10:37:49    /etc/condor/config.d/41shared_port.config

06/08/21 10:37:49    /etc/condor/condor_config.local

06/08/21 10:37:49 config Macros = 159, Sorted = 159, StringBytes = 7323, TablesBytes = 5868

06/08/21 10:37:49 CLASSAD_CACHING is OFF

06/08/21 10:37:49 Daemon Log is logging: D_ALWAYS D_ERROR

06/08/21 10:37:50 SharedPortEndpoint: waiting for connections to named socket 1677_dc69

06/08/21 10:37:50 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory

06/08/21 10:37:50 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.

06/08/21 10:37:50 DaemonCore: private command socket at <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>

06/08/21 10:37:50 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1616514849)

06/08/21 10:37:50 Starting shared port with port: 9618

06/08/21 10:37:50 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1852

06/08/21 10:37:50 Waiting for /var/lock/condor/shared_port_ad to appear.

06/08/21 10:37:50 DefaultReaper unexpectedly called on pid 1852, status 1024.

06/08/21 10:37:50 The SHARED_PORT (pid 1852) exited with status 4

06/08/21 10:37:50 Sending obituary for "/usr/libexec/condor/condor_shared_port"

06/08/21 10:37:50 restarting /usr/libexec/condor/condor_shared_port in 10 seconds

 

## SharedPortLog

 

06/08/21 10:40:09 Setting maximum file descriptors to 4096.

06/08/21 10:40:09 ******************************************************

06/08/21 10:40:09 ** condor_shared_port (CONDOR_SHARED_PORT) STARTING UP

06/08/21 10:40:09 ** /usr/libexec/condor/condor_shared_port

06/08/21 10:40:09 ** SubsystemInfo: name=SHARED_PORT type=SHARED_PORT(11) class=DAEMON(1)

06/08/21 10:40:09 ** Configuration: subsystem:SHARED_PORT local:<NONE> class:DAEMON

06/08/21 10:40:09 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $

06/08/21 10:40:09 ** $CondorPlatform: x86_64_CentOS7 $

06/08/21 10:40:09 ** PID = 2997

06/08/21 10:40:09 ** Log last touched 6/8 10:40:08

06/08/21 10:40:09 ******************************************************

06/08/21 10:40:09 Using config source: /etc/condor/condor_config

06/08/21 10:40:09 Using local config sources:

06/08/21 10:40:09    /etc/condor/config.d/00init.config

06/08/21 10:40:09    /etc/condor/config.d/01puppet_ssl.config

06/08/21 10:40:09    /etc/condor/config.d/02machines.config

06/08/21 10:40:09    /etc/condor/config.d/05security.config

06/08/21 10:40:09    /etc/condor/config.d/20wn_centos7.config

06/08/21 10:40:09    /etc/condor/config.d/25scaling.config

06/08/21 10:40:09    /etc/condor/config.d/27healthcheck.config

06/08/21 10:40:09    /etc/condor/config.d/28rebooter.config

06/08/21 10:40:09    /etc/condor/config.d/29start.config

06/08/21 10:40:09    /etc/condor/config.d/30start_jobtypes.config

06/08/21 10:40:09    /etc/condor/config.d/30start_multicore.config

06/08/21 10:40:09    /etc/condor/config.d/41shared_port.config

06/08/21 10:40:09    /etc/condor/condor_config.local

06/08/21 10:40:09 config Macros = 161, Sorted = 161, StringBytes = 7389, TablesBytes = 5940

06/08/21 10:40:09 CLASSAD_CACHING is ENABLED

06/08/21 10:40:09 Daemon Log is logging: D_ALWAYS D_ERROR

06/08/21 10:40:09 Daemoncore: Listening at <[::]:9618> on TCP (ReliSock).

06/08/21 10:40:09 DaemonCore: command socket at <[2001:630:58:1c20::82f6:2d01]:9618?addrs=[2001-630-58-1c20--82f6-2d01]-9618&noUDP>

06/08/21 10:40:09 DaemonCore: private command socket at <[2001:630:58:1c20::82f6:2d01]:9618?addrs=[2001-630-58-1c20--82f6-2d01]-9618>

06/08/21 10:40:09 main_init() called

06/08/21 10:40:09 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :

ForkedChildrenPeak = 0

RequestsBlocked = 0

ForkedChildrenCurrent = 0

RequestsSucceeded = 0

RequestsPendingPeak = 0

RequestsPendingCurrent = 0

RequestsFailed = 0

SharedPortCommandSinfuls = "<[2001:630:58:1c20::82f6:2d01]:9618>"

MyAddress = "<[2001:630:58:1c20::82f6:2d01]:9618?addrs=[2001-630-58-1c20--82f6-2d01]-9618&noUDP>"

06/08/21 10:40:09 attempt to connect to <[2001:630:58:1c20::82f6:2d01]:0> failed: Connection refused (connect errno = 111).

06/08/21 10:40:09 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <[2001:630:58:1c20::82f6:2d01]:0> (try 1 of 3): CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>

06/08/21 10:40:09 attempt to connect to <[2001:630:58:1c20::82f6:2d01]:0> failed: Connection refused (connect errno = 111).

06/08/21 10:40:09 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <[2001:630:58:1c20::82f6:2d01]:0> (try 2 of 3): CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>|CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>

06/08/21 10:40:09 attempt to connect to <[2001:630:58:1c20::82f6:2d01]:0> failed: Connection refused (connect errno = 111).

06/08/21 10:40:09 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <[2001:630:58:1c20::82f6:2d01]:0> (try 3 of 3): CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>|CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>|CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>

06/08/21 10:40:09 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>" at line 241 in file /var/lib/condor/execute/slot1/dir_12537/userdir/.tmpfVvlO6/BUILD/condor-8.8.13/src/condor_daemon_core.V6/daemon_keep_alive.cpp

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI.