[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem: Jobs are evicted after 20 minutes



Hello

We have some problems within our cluster. It seems a lot of jobs are
matched, start to calculate and are evicted after 20 minutes.
I would like your help in trying to find a solution.

We have a rather complicated cluster set up:
We do have one central server. It has two network interfaces:
-)eth0, a real network interface: public IP
-)tun0, a openvpn network interface (ip:10.8.0.1) network:10.8.0.*

Nodes may connect to the central server using the public IP, or (most of
the nodes) using the openvpn network IP. The central server is
configured in this way:
BIND_ALL_INTERFACES = true
MAX_FILE_DESCRIPTORS = 100000
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
##DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000
SCHEDD_QUERY_WORKERS   = 6
COLLECTOR_QUERY_WORKERS = 32

Most of the nodes are virtual machines which are behind a NAT and should
not use the public IP of the central server.
The nodes can contact the server using the openvpn IP, show up in
condor_status and are also matched and given a job.

Once the job has started, it seems that the node then tries to use the
public IP INSTEAD of the openvpnIP of the central server. This fails,
and after a timeout the job is evicted. 
Below you will find a snipped of the StartLog(I have replaced the public
IP with a dummy value):

10/24/12 10:10:31 Got activate_claim request from shadow (10.8.0.1)
10/24/12 10:10:31 Remote job ID is 2395.14
10/24/12 10:10:31 Got universe "VANILLA" (5) from request classad
10/24/12 10:10:31 State change: claim-activation protocol successful
10/24/12 10:10:31 Changing activity: Idle -> Busy
10/24/12 10:11:45 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).
10/24/12 10:11:45 Failed to connect to schedd
<123.123.123.123:9618?noUDP&sock=16127_44fe_3>
10/24/12 10:12:53 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).  Will keep trying for 397
total seconds (334 to go).

10/24/12 10:18:28 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).
10/24/12 10:18:28 Failed to connect to schedd
<123.123.123.123:9618?noUDP&sock=16127_44fe_3>
10/24/12 10:19:37 attempt to connect to <123.123.123.123:9618> failed:
Connection timed out (connect errno = 110).  Will keep trying for 397
total seconds (333 to go).

10/24/12 10:23:03 Called deactivate_claim_forcibly()
10/24/12 10:23:03 Changing state and activity: Claimed/Busy ->
Preempting/Vacating
10/24/12 10:23:03 Starter pid 1137 exited with status 0
10/24/12 10:23:03 State change: starter exited
10/24/12 10:23:03 State change: No preempting claim, returning to owner
10/24/12 10:23:03 Changing state and activity: Preempting/Vacating ->
Owner/Idle
10/24/12 10:23:03 State change: IS_OWNER is false
10/24/12 10:23:03 Changing state: Owner -> Unclaimed
10/24/12 10:23:03 Changing state: Unclaimed -> Delete
10/24/12 10:23:03 Resource no longer needed, deleting


Does somebody have an idea why this is happening?
Any suggestions how to redeem this?


Best regards from Austria,
Hermann

-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx