[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor Transferer issue for HAD Replication issue with Shared Port



Dear HTCondor experts,

it seems our HA setup of two CMs is broken. 

I get the following regularly in TransferLog (with ADDRCM1 being address of CM1 and ADDRCM2 being address of CM2):
------------------------
04/12/18 12:27:41 SharedPortEndpoint: waiting for connections to named socket 2142_fb90
04/12/18 12:27:41 DaemonCore: command socket at <ADDRCM1:9618?addrs=ADDRCM1-9618+[--1]-9618&noUDP&sock=2142_fb90>
04/12/18 12:27:41 DaemonCore: private command socket at <ADDRCM1:9618?addrs=ADDRCM1-9618+[--1]-9618&noUDP&sock=2142_fb90>
04/12/18 12:27:41 BaseReplicaTransferer::reinitialize started
04/12/18 12:29:48 attempt to connect to <ADDRCM2:43586> failed: Connection timed out (connect errno = 110).  Will keep trying for 2147483647 total seconds (2147483520 to go).
------------------------
The port number "43586" is changing with each error message. 

Configuration is as follows (with FQDNCM1 and FQDNCM2 being the real FQDNs, of course):
------------------------
SHARED_PORT_PORT = 9618
SHARED_PORT_ARGS = -p $(SHARED_PORT_PORT)
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = condor-cm1.physik.uni-bonn.de?sock=collector, condor-cm2.physik.uni-bonn.de?sock=collector
USE_SHARED_PORT = true

HAD_PORT = $(SHARED_PORT_PORT)
HAD_USE_SHARED_PORT = TRUE
REPLICATION_PORT = $(SHARED_PORT_PORT)
REPLICATION_USE_SHARED_PORT = TRUE
REPLICATION_LIST = FQDNCM1:$(REPLICATION_PORT), FQDNCM2:$(REPLICATION_PORT)
HAD_LIST = FQDNCM1:$(HAD_PORT), FQDNCM2:$(HAD_PORT)
------------------------

Can somebody tell me why the transferer tries to connect to an arbitrary port? 
This is naturally blocked by our firewall, since we are using shared_port mode. 

Any help is appreciated. 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature