[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem: Jobs are evicted after 20 minutes



Hermann - 

Guten Morgen! 

You will likely want to configure your CCB settings for NAT'd networks: 
http://research.cs.wisc.edu/condor/manual/v7.9/3_7Networking_includes.html#SECTION00474000000000000000
 
If needed, you can also bind condor to a specific adapter using 'NETWORK_INTERFACE = tun0' or whatever you vpn adapter is. 

Why is `BIND_ALL_INTERFACES = true` if you want condor to use the NAT?  

Cheers,
Tim 

----- Original Message -----
> From: "Hermann Fuchs" <hermann.fuchs@xxxxxxxxxxxxxxxx>
> To: "condor-users" <condor-users@xxxxxxxxxxx>
> Sent: Wednesday, October 24, 2012 3:55:58 AM
> Subject: [Condor-users] Problem: Jobs are evicted after 20 minutes
> 
> Hello
> 
> We have some problems within our cluster. It seems a lot of jobs are
> matched, start to calculate and are evicted after 20 minutes.
> I would like your help in trying to find a solution.
> 
> We have a rather complicated cluster set up:
> We do have one central server. It has two network interfaces:
> -)eth0, a real network interface: public IP
> -)tun0, a openvpn network interface (ip:10.8.0.1) network:10.8.0.*
> 
> Nodes may connect to the central server using the public IP, or (most
> of
> the nodes) using the openvpn network IP. The central server is
> configured in this way:
> BIND_ALL_INTERFACES = true
> MAX_FILE_DESCRIPTORS = 100000
> #Restrict Condor to use only port 9618
> SHARED_PORT_ARGS = -p 9618
> DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
> COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
> USE_SHARED_PORT = TRUE
> ##DISABLE UDP
> UPDATE_COLLECTOR_WITH_TCP = True
> WANT_UDP_COMMAND_SOCKET = False
> COLLECTOR_SOCKET_CACHE_SIZE=10000
> SCHEDD_QUERY_WORKERS   = 6
> COLLECTOR_QUERY_WORKERS = 32
> 
> Most of the nodes are virtual machines which are behind a NAT and
> should
> not use the public IP of the central server.
> The nodes can contact the server using the openvpn IP, show up in
> condor_status and are also matched and given a job.
> 
> Once the job has started, it seems that the node then tries to use
> the
> public IP INSTEAD of the openvpnIP of the central server. This fails,
> and after a timeout the job is evicted.
> Below you will find a snipped of the StartLog(I have replaced the
> public
> IP with a dummy value):
> 
> 10/24/12 10:10:31 Got activate_claim request from shadow (10.8.0.1)
> 10/24/12 10:10:31 Remote job ID is 2395.14
> 10/24/12 10:10:31 Got universe "VANILLA" (5) from request classad
> 10/24/12 10:10:31 State change: claim-activation protocol successful
> 10/24/12 10:10:31 Changing activity: Idle -> Busy
> 10/24/12 10:11:45 attempt to connect to <123.123.123.123:9618>
> failed:
> Connection timed out (connect errno = 110).
> 10/24/12 10:11:45 Failed to connect to schedd
> <123.123.123.123:9618?noUDP&sock=16127_44fe_3>
> 10/24/12 10:12:53 attempt to connect to <123.123.123.123:9618>
> failed:
> Connection timed out (connect errno = 110).  Will keep trying for 397
> total seconds (334 to go).
> 
> 10/24/12 10:18:28 attempt to connect to <123.123.123.123:9618>
> failed:
> Connection timed out (connect errno = 110).
> 10/24/12 10:18:28 Failed to connect to schedd
> <123.123.123.123:9618?noUDP&sock=16127_44fe_3>
> 10/24/12 10:19:37 attempt to connect to <123.123.123.123:9618>
> failed:
> Connection timed out (connect errno = 110).  Will keep trying for 397
> total seconds (333 to go).
> 
> 10/24/12 10:23:03 Called deactivate_claim_forcibly()
> 10/24/12 10:23:03 Changing state and activity: Claimed/Busy ->
> Preempting/Vacating
> 10/24/12 10:23:03 Starter pid 1137 exited with status 0
> 10/24/12 10:23:03 State change: starter exited
> 10/24/12 10:23:03 State change: No preempting claim, returning to
> owner
> 10/24/12 10:23:03 Changing state and activity: Preempting/Vacating ->
> Owner/Idle
> 10/24/12 10:23:03 State change: IS_OWNER is false
> 10/24/12 10:23:03 Changing state: Owner -> Unclaimed
> 10/24/12 10:23:03 Changing state: Unclaimed -> Delete
> 10/24/12 10:23:03 Resource no longer needed, deleting
> 
> 
> Does somebody have an idea why this is happening?
> Any suggestions how to redeem this?
> 
> 
> Best regards from Austria,
> Hermann
> 
> --
> -------------
> DI Hermann Fuchs
> Christian Doppler Laboratory for Medical Radiation Research for
> Radiation Oncology
> Department of Radiation Oncology
> Medical University Vienna
> Währinger Gürtel 18-20
> A-1090 Wien
> 
> Tel.  + 43 / 1 / 40 400 7271
> Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>