[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Master server sends wrond SCHEDD IP



Hello from Vienna!

We do have a quite severe problem in our cluster, leading to job
evictions after a connection timeout. It seems the wrong IP address of
the SCHEDD is transmitted.

In our small grid we use two networks. One is the standard LAN network
and the second one is an openvpn network.
We have one central master server (collector, schedd, negotiator,
startd, sharedPortdaemon) which listens to two network interfaces eth0
(for the standard LAN) and tun0(for the openvpn).
We have calculation nodes on both networks. It seems that the master
server sends out all IP addresses (depending on the network) correctly
but NOT the SCHEDD IP. The SCHEDD IP is always the LAN IP.

Therefore all nodes on the openvpn network get the LAN IP of the SCHEDD.
This IP is not reachable by them due to firewall restrictions. So the
nodes can not connect to the SCHEDD and abort the running jobs after a
timeout. In practice this means jobs are matched to nodes on the openvpn
network, start running and are evicted after 20 minutes (the timeout if
the SCHEDD can not be reached).
Unfortunately we can not run a second master server as all machines may
be shut down by the user, therefore every node has to use the same
server.

Example:
MasterServer 
	LAN IP:123.123.123.123
	openvpn IP:10.8.0.1
Node on standard LAN: 
(information retrieved from StartLog)
	Got activate_claim request from shadow (123.123.123.123)
	 collector <123.123.123.123:9618?sock=collector>
	 Schedd addr = <123.123.123.123:9618?noUDP&sock=6778_8abd_3>
	ClaimId(<123.123.123.123:9618>#1343371837#2...

Node on openvpn network:
	Got activate_claim request from shadow (10.8.0.1)
	collector <10.8.0.1:9618?sock=collector>
	Schedd addr = <123.123.123.123:9618?noUDP&sock=6778_8abd_4>


Our configuration is as follows:
The central server is
configured in this way:
BIND_ALL_INTERFACES = true
MAX_FILE_DESCRIPTORS = 100000
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
##DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000
SCHEDD_QUERY_WORKERS   = 6
COLLECTOR_QUERY_WORKERS = 32


The openvpn clients are configured:
CONDOR_HOST = 10.8.0.1
BIND_ALL_INTERFACES = False
NETWORK_INTERFACE = 10.8.*
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
#DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000

Are there any suggestions how we could resolve this situation?
This is quite a show stopper, as we got quite a few nodes in the openvpn
network which at the moment are unusable.
Is there a way to override the SCHEDD IP address at the client level?

Best regards from Austria,
Hermann

-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx