[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Master server sends wrond SCHEDD IP



Hi Hermann,

I reproduced the problem and pushed a fix for 7.8.7:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3330,0

I heard through the grapevine that you managed to work around the problem by disabling shared port. This is unexpected and does not match the results in my test. I would expect that you could work around the problem by setting PRIVATE_NETWORK_INTERFACE and PRIVATE_NETWORK_NAME appropriately. I would not expected shared port to make a difference either way. Let me know if you can shed any light on that. I'd be interested in knowing what precisely changed in the configuration between the time it did not work and the time when it did.

To work around the problem, you would need to set PRIVATE_NETWORK_INTERFACE to the IP address of the vpn network, and you would need to set PRIVATE_NETWORK_NAME to the same value on the submit node and the vpn nodes. It doesn't matter what value you set the name to as long as it is not empty and it is the same on all the nodes that have access to the vpn.

Cheers,
--Dan

On 11/13/12 8:55 PM, Dan Bradley wrote:
Hi Hermann,

This case is supposed to be handled by ENABLE_ADDRESS_REWRITING, which defaults to True. The schedd IP address that is sent to the startd is supposed to be set to the interface that the schedd used to communicate with the startd.

I haven't yet had a chance to try to reproduce this problem. It is possible it is tied to shared_port. I'll let you know what I find.

--Dan

On 11/13/12 1:49 AM, Hermann Fuchs wrote:
Hello from Vienna!

We do have a quite severe problem in our cluster, leading to job
evictions after a connection timeout. It seems the wrong IP address of
the SCHEDD is transmitted.

In our small grid we use two networks. One is the standard LAN network
and the second one is an openvpn network.
We have one central master server (collector, schedd, negotiator,
startd, sharedPortdaemon) which listens to two network interfaces eth0
(for the standard LAN) and tun0(for the openvpn).
We have calculation nodes on both networks. It seems that the master
server sends out all IP addresses (depending on the network) correctly
but NOT the SCHEDD IP. The SCHEDD IP is always the LAN IP.

Therefore all nodes on the openvpn network get the LAN IP of the SCHEDD.
This IP is not reachable by them due to firewall restrictions. So the
nodes can not connect to the SCHEDD and abort the running jobs after a
timeout. In practice this means jobs are matched to nodes on the openvpn
network, start running and are evicted after 20 minutes (the timeout if
the SCHEDD can not be reached).
Unfortunately we can not run a second master server as all machines may
be shut down by the user, therefore every node has to use the same
server.

Example:
MasterServer
    LAN IP:123.123.123.123
    openvpn IP:10.8.0.1
Node on standard LAN:
(information retrieved from StartLog)
    Got activate_claim request from shadow (123.123.123.123)
     collector <123.123.123.123:9618?sock=collector>
     Schedd addr = <123.123.123.123:9618?noUDP&sock=6778_8abd_3>
    ClaimId(<123.123.123.123:9618>#1343371837#2...

Node on openvpn network:
    Got activate_claim request from shadow (10.8.0.1)
    collector <10.8.0.1:9618?sock=collector>
    Schedd addr = <123.123.123.123:9618?noUDP&sock=6778_8abd_4>


Our configuration is as follows:
The central server is
configured in this way:
BIND_ALL_INTERFACES = true
MAX_FILE_DESCRIPTORS = 100000
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
##DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000
SCHEDD_QUERY_WORKERS   = 6
COLLECTOR_QUERY_WORKERS = 32


The openvpn clients are configured:
CONDOR_HOST = 10.8.0.1
BIND_ALL_INTERFACES = False
NETWORK_INTERFACE = 10.8.*
#Restrict Condor to use only port 9618
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
#DISABLE UDP
UPDATE_COLLECTOR_WITH_TCP = True
WANT_UDP_COMMAND_SOCKET = False
COLLECTOR_SOCKET_CACHE_SIZE=10000

Are there any suggestions how we could resolve this situation?
This is quite a show stopper, as we got quite a few nodes in the openvpn
network which at the moment are unusable.
Is there a way to override the SCHEDD IP address at the client level?

Best regards from Austria,
Hermann


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/