[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem: Jobs are evicted after 20 minutes



Hello

The reason for `BIND_ALL_INTERFACES = true` is that we have machines
which connect to the central server using its public IP address(eth0),
while the bulk of the machines connect to it using its openvpng IP
address(tun0).

This is mainly due to the fact that we can not install openvpn on all
the machines...

So I do not want to change the CCB settings for NAT. The problem arises
with the machines using the tun0 network device. Somehow they try to use
the other network device, which most likely will not work.

I have no idea why condor should behave that way. I always thought
condor answers to a machine using the same network interface, so the
clients should not even know the public ip of the central server...

Best regards,
Hermann

On Wed, 2012-10-24 at 09:47 -0400, Tim St Clair wrote:
> Hermann - 
> 
> Guten Morgen! 
> 
> You will likely want to configure your CCB settings for NAT'd networks: 
> http://research.cs.wisc.edu/condor/manual/v7.9/3_7Networking_includes.html#SECTION00474000000000000000
>  
> If needed, you can also bind condor to a specific adapter using 'NETWORK_INTERFACE = tun0' or whatever you vpn adapter is. 
> 
> Why is `BIND_ALL_INTERFACES = true` if you want condor to use the NAT?  
> 
> Cheers,
> Tim 
> 
> ----- Original Message -----
> > From: "Hermann Fuchs" <hermann.fuchs@xxxxxxxxxxxxxxxx>
> > To: "condor-users" <condor-users@xxxxxxxxxxx>
> > Sent: Wednesday, October 24, 2012 3:55:58 AM
> > Subject: [Condor-users] Problem: Jobs are evicted after 20 minutes
> > 
> > Hello
> > 
> > We have some problems within our cluster. It seems a lot of jobs are
> > matched, start to calculate and are evicted after 20 minutes.
> > I would like your help in trying to find a solution.
> > 
> > We have a rather complicated cluster set up:
> > We do have one central server. It has two network interfaces:
> > -)eth0, a real network interface: public IP
> > -)tun0, a openvpn network interface (ip:10.8.0.1) network:10.8.0.*
> > 
> > Nodes may connect to the central server using the public IP, or (most
> > of
> > the nodes) using the openvpn network IP. The central server is
> > configured in this way:
> > BIND_ALL_INTERFACES = true
> > MAX_FILE_DESCRIPTORS = 100000
> > #Restrict Condor to use only port 9618
> > SHARED_PORT_ARGS = -p 9618
> > DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
> > COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
> > USE_SHARED_PORT = TRUE
> > ##DISABLE UDP
> > UPDATE_COLLECTOR_WITH_TCP = True
> > WANT_UDP_COMMAND_SOCKET = False
> > COLLECTOR_SOCKET_CACHE_SIZE=10000
> > SCHEDD_QUERY_WORKERS   = 6
> > COLLECTOR_QUERY_WORKERS = 32
> > 
> > Most of the nodes are virtual machines which are behind a NAT and
> > should
> > not use the public IP of the central server.
> > The nodes can contact the server using the openvpn IP, show up in
> > condor_status and are also matched and given a job.
> > 
> > Once the job has started, it seems that the node then tries to use
> > the
> > public IP INSTEAD of the openvpnIP of the central server. This fails,
> > and after a timeout the job is evicted.
> > Below you will find a snipped of the StartLog(I have replaced the
> > public
> > IP with a dummy value):
> > 
> > 10/24/12 10:10:31 Got activate_claim request from shadow (10.8.0.1)
> > 10/24/12 10:10:31 Remote job ID is 2395.14
> > 10/24/12 10:10:31 Got universe "VANILLA" (5) from request classad
> > 10/24/12 10:10:31 State change: claim-activation protocol successful
> > 10/24/12 10:10:31 Changing activity: Idle -> Busy
> > 10/24/12 10:11:45 attempt to connect to <123.123.123.123:9618>
> > failed:
> > Connection timed out (connect errno = 110).
> > 10/24/12 10:11:45 Failed to connect to schedd
> > <123.123.123.123:9618?noUDP&sock=16127_44fe_3>
> > 10/24/12 10:12:53 attempt to connect to <123.123.123.123:9618>
> > failed:
> > Connection timed out (connect errno = 110).  Will keep trying for 397
> > total seconds (334 to go).
> > 
> > 10/24/12 10:18:28 attempt to connect to <123.123.123.123:9618>
> > failed:
> > Connection timed out (connect errno = 110).
> > 10/24/12 10:18:28 Failed to connect to schedd
> > <123.123.123.123:9618?noUDP&sock=16127_44fe_3>
> > 10/24/12 10:19:37 attempt to connect to <123.123.123.123:9618>
> > failed:
> > Connection timed out (connect errno = 110).  Will keep trying for 397
> > total seconds (333 to go).
> > 
> > 10/24/12 10:23:03 Called deactivate_claim_forcibly()
> > 10/24/12 10:23:03 Changing state and activity: Claimed/Busy ->
> > Preempting/Vacating
> > 10/24/12 10:23:03 Starter pid 1137 exited with status 0
> > 10/24/12 10:23:03 State change: starter exited
> > 10/24/12 10:23:03 State change: No preempting claim, returning to
> > owner
> > 10/24/12 10:23:03 Changing state and activity: Preempting/Vacating ->
> > Owner/Idle
> > 10/24/12 10:23:03 State change: IS_OWNER is false
> > 10/24/12 10:23:03 Changing state: Owner -> Unclaimed
> > 10/24/12 10:23:03 Changing state: Unclaimed -> Delete
> > 10/24/12 10:23:03 Resource no longer needed, deleting
> > 
> > 
> > Does somebody have an idea why this is happening?
> > Any suggestions how to redeem this?
> > 
> > 
> > Best regards from Austria,
> > Hermann
> > 
> > --
> > -------------
> > DI Hermann Fuchs
> > Christian Doppler Laboratory for Medical Radiation Research for
> > Radiation Oncology
> > Department of Radiation Oncology
> > Medical University Vienna
> > Währinger Gürtel 18-20
> > A-1090 Wien
> > 
> > Tel.  + 43 / 1 / 40 400 7271
> > Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
> > 
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

-- 
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx