[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Collector not reachable on localhost?



Hi all,

I am struggling somewhat to spawn a test cluster (v8.6.4 on OpenStack,
IPv4 only) where a Master, Collector, Negotiator and Scheduler running
on one host (plus a few workers).

Thing seems to be, that the collector cannot connect to itself(?). The
master is restarting the collector several times but cannot connect to
it [1].
As for the Collector log, the DemonCore also complaints about not being
able to connect to the Collector (on the same host/IP).

Network-wise I was able to communicate via ncat between a worker and the
collector host on the shared port 9618 (& 9620). And -AFAIS- the
collector is actually listening on 9618 [5].

(Sched and Negotiator seem to be happy  and are listening to the
DemonCore on p9620)

Maybe somebody has an idea what could be jamming the collector?
(being bound also to IPv6 link-local should be no problem, or??)

Cheers and thanks for ideas,
  Thomas

btw: is it actually necessary to set POOL_HISTORY_DIR [6] ~~>
/var/ViewHist/ ? I had to create the directory manually but I do not
remember that setting explicitly the dir had been necessary it before?



[1]
> MasterLog
06/30/17 16:01:12 SharedPortEndpoint: waiting for connections to named
socket 4932_852f
06/30/17 16:01:12 SharedPortEndpoint: failed to open
/var/lock/condor/shared_port_ad: No such file or directory
06/30/17 16:01:12 SharedPortEndpoint: did not successfully find
SharedPortServer address. Will retry in 60s.
06/30/17 16:01:12 DaemonCore: private command socket at
<131.169.240.85:0?sock=4932_852f>
06/30/17 16:01:12 Adding SHARED_PORT to DAEMON_LIST, because
USE_SHARED_PORT=true (to disable this, set
AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
06/30/17 16:01:12 Master restart (GRACEFUL) is watching
/usr/sbin/condor_master (mtime:1498066869)
06/30/17 16:01:12 Started DaemonCore process
"/usr/libexec/condor/condor_shared_port", pid and pgroup = 4978
06/30/17 16:01:12 Waiting for /var/lock/condor/shared_port_ad to appear.
06/30/17 16:01:13 Found /var/lock/condor/shared_port_ad.
06/30/17 16:01:13 Started DaemonCore process
"/usr/sbin/condor_collector", pid and pgroup = 4979
06/30/17 16:01:13 Waiting for /var/log/condor/.collector_address to appear.
06/30/17 16:01:14 Found /var/log/condor/.collector_address.
06/30/17 16:01:14 Started DaemonCore process
"/usr/sbin/condor_negotiator", pid and pgroup = 4980
06/30/17 16:01:14 Started DaemonCore process
"/usr/libexec/condor/condor_gangliad", pid and pgroup = 4981
06/30/17 16:01:14 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 4982
06/30/17 16:01:14 Started DaemonCore process
"/usr/libexec/condor/condor_defrag", pid and pgroup = 4983
06/30/17 16:01:14 DefaultReaper unexpectedly called on pid 4979, status
1024.
06/30/17 16:01:14 The COLLECTOR (pid 4979) exited with status 4
06/30/17 16:01:14 Sending obituary for "/usr/sbin/condor_collector"
06/30/17 16:01:14 restarting /usr/sbin/condor_collector in 10 seconds
06/30/17 16:01:14 attempt to connect to <131.169.240.85:9618> failed:
Connection refused (connect errno = 111).
06/30/17 16:01:14 ERROR: SECMAN:2003:TCP connection to collector
os-condor-dev-collector.desy.de:9618 failed.
06/30/17 16:01:14 Failed to start non-blocking update to
<131.169.240.85:9618>.
06/30/17 16:01:14 DefaultReaper unexpectedly called on pid 4981, status
1024.
06/30/17 16:01:14 The GANGLIAD (pid 4981) exited with status 4
06/30/17 16:01:14 Sending obituary for "/usr/libexec/condor/condor_gangliad"
06/30/17 16:01:14 restarting /usr/libexec/condor/condor_gangliad in 10
seconds
06/30/17 16:01:14 attempt to connect to <131.169.240.85:9618> failed:
Connection refused (connect errno = 111).
06/30/17 16:01:14 ERROR: SECMAN:2003:TCP connection to collector
os-condor-dev-collector.desy.de:9618 failed.
06/30/17 16:01:14 Failed to start non-blocking update to
<131.169.240.85:9618>.


[2]
> CollectorLog w. mkdir /var/ViewHist

06/30/17 16:05:51 MasterAd     : Inserting ** "<
os-condor-dev-collector.desy.de >"
06/30/17 16:05:51 Query info: matched=0; skipped=0; query_time=0.000982;
send_time=0.000148; type=MachinePrivate; requirements={true};
peer=<131.169.240.85:25443>; projection={}
06/30/17 16:05:51 Number of Active Workers 0
06/30/17 16:05:51 creating new table for type Defrag
06/30/17 16:05:51 Defrag: Inserting ** "< os-condor-dev-collector.desy.de >"
06/30/17 16:05:51 (Sending 0 ads in response to query)
06/30/17 16:05:51 Query info: matched=0; skipped=2; query_time=0.001465;
send_time=0.000103; type=Any; requirements={( ( ( MyType == "Scheduler"
) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )};
peer=<131.169.240.85:3507>; projection={}
06/30/17 16:05:51 ScheddAd     : Inserting ** "<
os-condor-dev-collector.desy.de , 131.169.240.85 >"
...
06/30/17 16:05:51 AccountingAd  : Inserting ** "< group_OPS >"
06/30/17 16:05:51 AccountingAd  : Inserting ** "< group_OTHER >"
06/30/17 16:05:51 DaemonCore: Can't receive command request from
131.169.240.85 (perhaps a timeout?)
06/30/17 16:05:51 NegotiatorAd  : Inserting ** "< NEGOTIATOR >"
06/30/17 16:05:54 Got QUERY_STARTD_ADS
06/30/17 16:05:54 Number of Active Workers 0
...
06/30/17 16:11:32 DaemonCore: Can't receive command request from
131.169.240.85 (perhaps a timeout?)

[3]
os-condor-dev-batch02 > condor_status
Error: communication error
CONDOR_STATUS:1:Unable to resolve COLLECTOR_HOST
(os-condor-dev-collector01.desy.de:9618).

[4]
ip addr | grep inet
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
    inet 131.169.240.85/23 brd 131.169.241.255 scope global dynamic eth0
    inet6 fe80::f816:3eff:fe62:d318/64 scope link

[5]
> netstat -tlnp | grep 9618
tcp        0      0 0.0.0.0:9618            0.0.0.0:*
LISTEN      6653/condor_collect
tcp6       0      0 :::9618                 :::*
LISTEN      6653/condor_collect


[6]
> CollectorLog
06/30/17 16:01:13 Setting maximum file descriptors to 10240.
06/30/17 16:01:13 ******************************************************
06/30/17 16:01:13 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
06/30/17 16:01:13 ** /usr/sbin/condor_collector
06/30/17 16:01:13 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3)
class=DAEMON(1)
06/30/17 16:01:13 ** Configuration: subsystem:COLLECTOR local:<NONE>
class:DAEMON
06/30/17 16:01:13 ** $CondorVersion: 8.6.4 Jun 21 2017 BuildID: 408625 $
06/30/17 16:01:13 ** $CondorPlatform: x86_64_RedHat7 $
06/30/17 16:01:13 ** PID = 4979
06/30/17 16:01:13 ** Log last touched 6/30 16:00:40
06/30/17 16:01:13 ******************************************************
06/30/17 16:01:13 Using config source: /etc/condor/condor_config
06/30/17 16:01:13 Using local config sources:
06/30/17 16:01:13    /etc/condor/config.d/00masterd.conf
06/30/17 16:01:13    /etc/condor/config.d/04defragd.conf
06/30/17 16:01:13    /etc/condor/config.d/06accounting.conf
06/30/17 16:01:13    /etc/condor/config.d/20rebooter.conf
06/30/17 16:01:13    /etc/condor/condor_config.local
06/30/17 16:01:13 config Macros = 126, Sorted = 126, StringBytes = 6392,
TablesBytes = 4616
06/30/17 16:01:13 CLASSAD_CACHING is ENABLED
06/30/17 16:01:13 Daemon Log is logging: D_ALWAYS D_ERROR
06/30/17 16:01:13 SharedPortEndpoint: waiting for connections to named
socket 4979_d11b
06/30/17 16:01:13 DaemonCore: non-shared command socket at
<131.169.240.85:9618>
06/30/17 16:01:13 Daemoncore: Listening at <0.0.0.0:9618> on TCP
(ReliSock) and UDP (SafeSock).
06/30/17 16:01:13 DaemonCore: non-shared command socket at <[::1]:9618>
06/30/17 16:01:13 WARNING: Condor is running on a loopback address
06/30/17 16:01:13          of this machine, and may not visible to other
hosts!
06/30/17 16:01:13 Daemoncore: Listening at <[::]:9618> on TCP (ReliSock)
and UDP (SafeSock).
06/30/17 16:01:13 DaemonCore: command socket at
<131.169.240.85:9620?addrs=131.169.240.85-9620+[--1]-9620&noUDP&sock=4979_d11b>
06/30/17 16:01:13 DaemonCore: private command socket at
<131.169.240.85:9620?addrs=131.169.240.85-9620+[--1]-9620&noUDP&sock=4979_d11b>
06/30/17 16:01:14 In ViewServer::Init()
06/30/17 16:01:14 In CollectorDaemon::Init()
06/30/17 16:01:14 In ViewServer::Config()
06/30/17 16:01:14 In CollectorDaemon::Config()
06/30/17 16:01:14 ABSENT_REQUIREMENTS = None
06/30/17 16:01:14 OfflineCollectorPlugin::configure: no persistent store
was defined for off-line ads.
06/30/17 16:01:14 enable: Creating stats hash table
06/30/17 16:01:14 Enabling CCB Server.
06/30/17 16:01:14 m_reconnect_fname =
/var/lib/condor/spool/131.169.240.85-9620.ccb_reconnect
06/30/17 16:01:14 Configuration: SAMPLING_INTERVAL=60,
MAX_STORAGE=10000000, MaxFileSize=333333, POOL_HISTORY_DIR=/var/ViewHist
06/30/17 16:01:14 ERROR "POOL_HISTORY_DIR (/var/ViewHist) does not
exist." at line 180 in file
/slots/02/dir_4081266/userdir/.tmpO178Wi/BUILD/condor-8.6.4/src/condor_collector.V6/view_server.cpp
06/30/17 16:01:24 Setting maximum file descriptors to 10240.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature