[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Connection Problem with multiple Network Cards - Timeouts after 20 min



Hello

We have some problems within our cluster. It seems a lot of jobs are
matched, start to calculate and are evicted after 20 minutes.
I would like your help in trying to find a solution.

We have a rather complicated cluster set up:
We do have one central server. It has two network interfaces:
-)eth0, a real network interface: public IP
-)tun0, a openvpn network interface (ip:10.8.0.1) network:10.8.0.*

We have machines which connect to the central server using its public IP
address(eth0), while the bulk of the machines connect to it using its
openvpng IP address(tun0), therefore the central server needs to listen
on both IP addresses.

The central server is
configured in this way:
BIND_ALL_INTERFACES = true
MAX_FILE_DESCRIPTORS = 100000
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
##DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000
SCHEDD_QUERY_WORKERS   = 6
COLLECTOR_QUERY_WORKERS = 32

The client is configured:
CONDOR_HOST = 10.8.0.1
BIND_ALL_INTERFACES = False
NETWORK_INTERFACE = 10.8.*
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
#DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=1000


Jobs are correctly matched and started using the tun0 ip(10.8.0.1) of
the server. Once the job has started, it seems that the node then tries
to use the public IP INSTEAD of the openvpnIP of the central server.
This fails, and after a timeout (20 min) the job is evicted. 
Below you will find a snipped of the StartLog(I have replaced the public
IP with a dummy value):

10/24/12 10:10:31 Got activate_claim request from shadow (10.8.0.1)
10/24/12 10:10:31 Remote job ID is 2395.14
10/24/12 10:10:31 Got universe "VANILLA" (5) from request classad
10/24/12 10:10:31 State change: claim-activation protocol successful
10/24/12 10:10:31 Changing activity: Idle -> Busy
10/24/12 10:11:45 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).
10/24/12 10:11:45 Failed to connect to schedd
<123.123.123.123:9618?noUDP&sock=16127_44fe_3>
10/24/12 10:12:53 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).  Will keep trying for 397
total seconds (334 to go).
10/24/12 10:18:28 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).
10/24/12 10:18:28 Failed to connect to schedd
<123.123.123.123:9618?noUDP&sock=16127_44fe_3>
10/24/12 10:19:37 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).  Will keep trying for 397
total seconds (333 to go).
10/24/12 10:23:03 Called deactivate_claim_forcibly()
10/24/12 10:23:03 Changing state and activity: Claimed/Busy ->
Preempting/Vacating
10/24/12 10:23:03 Starter pid 1137 exited with status 0
10/24/12 10:23:03 State change: starter exited
10/24/12 10:23:03 State change: No preempting claim, returning to owner
10/24/12 10:23:03 Changing state and activity: Preempting/Vacating ->
Owner/Idle
10/24/12 10:23:03 State change: IS_OWNER is false
10/24/12 10:23:03 Changing state: Owner -> Unclaimed
10/24/12 10:23:03 Changing state: Unclaimed -> Delete
10/24/12 10:23:03 Resource no longer needed, deleting


Does somebody have an idea why this is happening?
Any suggestions how to redeem this?
Perhaps there is a configuration setting to force the clients to only
talk to a specific IP of the central server?


Best regards from Austria,
Hermann
-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx