Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Master server sends wrond SCHEDD IP

Date: Mon, 19 Nov 2012 16:19:36 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Master server sends wrond SCHEDD IP

Hi Hermann,

I reproduced the problem and pushed a fix for 7.8.7:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3330,0

I heard through the grapevine that you managed to work around theproblem by disabling shared port. This is unexpected and does not matchthe results in my test. I would expect that you could work around theproblem by setting PRIVATE_NETWORK_INTERFACE and PRIVATE_NETWORK_NAMEappropriately. I would not expected shared port to make a differenceeither way. Let me know if you can shed any light on that. I'd beinterested in knowing what precisely changed in the configurationbetween the time it did not work and the time when it did.

To work around the problem, you would need to setPRIVATE_NETWORK_INTERFACE to the IP address of the vpn network, and youwould need to set PRIVATE_NETWORK_NAME to the same value on the submitnode and the vpn nodes. It doesn't matter what value you set the nameto as long as it is not empty and it is the same on all the nodes thathave access to the vpn.


Cheers,
--Dan

On 11/13/12 8:55 PM, Dan Bradley wrote:

Hi Hermann,

This case is supposed to be handled by ENABLE_ADDRESS_REWRITING, whichdefaults to True. The schedd IP address that is sent to the startd issupposed to be set to the interface that the schedd used tocommunicate with the startd.

I haven't yet had a chance to try to reproduce this problem. It ispossible it is tied to shared_port. I'll let you know what I find.


--Dan

On 11/13/12 1:49 AM, Hermann Fuchs wrote:

Hello from Vienna!

We do have a quite severe problem in our cluster, leading to job
evictions after a connection timeout. It seems the wrong IP address of
the SCHEDD is transmitted.

In our small grid we use two networks. One is the standard LAN network
and the second one is an openvpn network.
We have one central master server (collector, schedd, negotiator,
startd, sharedPortdaemon) which listens to two network interfaces eth0
(for the standard LAN) and tun0(for the openvpn).
We have calculation nodes on both networks. It seems that the master
server sends out all IP addresses (depending on the network) correctly
but NOT the SCHEDD IP. The SCHEDD IP is always the LAN IP.

Therefore all nodes on the openvpn network get the LAN IP of the SCHEDD.
This IP is not reachable by them due to firewall restrictions. So the
nodes can not connect to the SCHEDD and abort the running jobs after a
timeout. In practice this means jobs are matched to nodes on the openvpn
network, start running and are evicted after 20 minutes (the timeout if
the SCHEDD can not be reached).
Unfortunately we can not run a second master server as all machines may
be shut down by the user, therefore every node has to use the same
server.

Example:
MasterServer
    LAN IP:123.123.123.123
    openvpn IP:10.8.0.1
Node on standard LAN:
(information retrieved from StartLog)
    Got activate_claim request from shadow (123.123.123.123)
     collector <123.123.123.123:9618?sock=collector>
     Schedd addr = <123.123.123.123:9618?noUDP&sock=6778_8abd_3>
    ClaimId(<123.123.123.123:9618>#1343371837#2...

Node on openvpn network:
    Got activate_claim request from shadow (10.8.0.1)
    collector <10.8.0.1:9618?sock=collector>
    Schedd addr = <123.123.123.123:9618?noUDP&sock=6778_8abd_4>


Our configuration is as follows:
The central server is
configured in this way:
BIND_ALL_INTERFACES = true
MAX_FILE_DESCRIPTORS = 100000
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
##DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000
SCHEDD_QUERY_WORKERS   = 6
COLLECTOR_QUERY_WORKERS = 32


The openvpn clients are configured:
CONDOR_HOST = 10.8.0.1
BIND_ALL_INTERFACES = False
NETWORK_INTERFACE = 10.8.*
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
#DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000

Are there any suggestions how we could resolve this situation?
This is quite a show stopper, as we got quite a few nodes in the openvpn
network which at the moment are unusable.
Is there a way to override the SCHEDD IP address at the client level?

Best regards from Austria,
Hermann


_______________________________________________
HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] Master server sends wrond SCHEDD IP
  - From: Hermann Fuchs
- Re: [HTCondor-users] Master server sends wrond SCHEDD IP
  - From: Dan Bradley

Prev by Date: Re: [HTCondor-users] Windows XP computer matched but idle
Next by Date: Re: [HTCondor-users] Windows XP: Opsys = WINDOWS versus WINNT51. What causes difference?
Previous by thread: Re: [HTCondor-users] Master server sends wrond SCHEDD IP
Next by thread: Re: [HTCondor-users] [Condor-users] condor_transfer_data problem on major version switch
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Master server sends wrond SCHEDD IP