[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] 8.4.0: ERROR: SECMAN:2003:TCP connection to collector failed.



Hi all,

I just updated a centos 6 central manager to 6.7 and condor 8.4, after a
reboot I have

> 10/07/15 16:31:38 ******************************************************
> 10/07/15 16:31:38 Using config source: /etc/condor/condor_config
> 10/07/15 16:31:38 Using local config sources:
> 10/07/15 16:31:38 /etc/condor/config.d/10_pool.conf
> 10/07/15 16:31:38 /etc/condor/config.d/20_host.conf
> 10/07/15 16:31:38 /etc/condor/config.d/30_cron.conf
> 10/07/15 16:31:38 /etc/condor/config.d/99_nfy_glidein2.conf
> 10/07/15 16:31:38 /dev/null
> 10/07/15 16:31:38 config Macros = 113, Sorted = 113, StringBytes = 3711, TablesBytes = 4156
> 10/07/15 16:31:38 CLASSAD_CACHING is ENABLED
> 10/07/15 16:31:38 Daemon Log is logging: D_ALWAYS D_ERROR
> 10/07/15 16:31:38 SharedPortEndpoint: waiting for connections to named socket 19015_99d8_5
> 10/07/15 16:31:38 DaemonCore: command socket at <144.92.167.251:9619?addrs=144.92.167.251-9619&noUDP&sock=19015_99d8_5>
> 10/07/15 16:31:38 DaemonCore: private command socket at <144.92.167.251:9619?addrs=144.92.167.251-9619&noUDP&sock=19015_99d8_5>
> 10/07/15 16:31:38 my_popenv failed
> 10/07/15 16:31:38 Failed to execute /usr/sbin/condor_starter.std, ignoring
> 10/07/15 16:31:38 VM-gahp server reported an internal error

... nothing unusual until

> 10/07/15 16:31:38 slot12: Changing activity: Benchmarking -> Idle
> 10/07/15 16:31:41 attempt to connect to <144.92.167.251:9618> failed: Connection refused (connect errno = 111).
> 10/07/15 16:31:41 ERROR: SECMAN:2003:TCP connection to collector exocet.bmrb.wisc.edu failed.
> 10/07/15 16:31:41 Failed to start non-blocking update to <144.92.167.251:9618>.
> 10/07/15 16:31:42 attempt to connect to <144.92.167.251:9618> failed: Connection refused (connect errno = 111).
> 10/07/15 16:31:42 ERROR: SECMAN:2003:TCP connection to collector exocet.bmrb.wisc.edu failed.
> 10/07/15 16:31:42 Failed to start non-blocking update to <144.92.167.251:9618>.

and my condor pool is dead:

> $ condor_status
> Error: communication error
> CEDAR:6001:Failed to connect to <144.92.167.251:9618>

Thankfully,

> # rpm -e --nodeps condor-external-libs
> # rpm -e --nodeps condor-classads
> # rpm -e --nodeps condor-procd
> # yum downgrade condor

fixed it, my pool is back to normal.

What's special about this box is OSG flocking and shared port:

> USE_SHARED_PORT = TRUE
> SHARED_PORT_ARGS = -p 9619
> SEC_WRITE_AUTHENTICATION_METHODS = FS, PASSWORD, CLAIMTOBE, GSI, SSL
> SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD, GSI, SSL
> SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, GSI, SSL, CLAIMTOBE
> SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
> SEC_DEFAULT_NEGOTIATION =
> AUTH_SSL_CLIENT_CADIR = /etc/grid-security/sslcerts
> AUTH_SSL_CLIENT_CERTFILE = /etc/grid-security/hostcert.pem
> AUTH_SSL_CLIENT_KEYFILE = /etc/grid-security/hostkey.pem
> AUTH_SSL_SERVER_CADIR = /etc/grid-security/sslcerts
> AUTH_SSL_SERVER_CERTFILE = /etc/grid-security/hostcert.pem
> AUTH_SSL_SERVER_KEYFILE = /etc/grid-security/hostkey.pem
> FLOCK_INCREMENT=10
> SCHEDD_MAX_FILE_DESCRIPTORS = 102400
> DAGMAN_MAX_JOBS_SUBMITTED=

So which of those did you guys break in 8.4.0?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment: signature.asc
Description: OpenPGP digital signature