[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job does not start: Failed to reverse connect to daemon via CCB



So 172.17.0.7 (central manager) and 172.17.0.8 (submit node) share a private network. You have a forwarding host, 1.1.1.99, and an execute node, 1.1.1.51, which is on the forwarding host's network. The following instructions assume that the 172.17.0.* network has outbound connectivity to the 1.1.1.* network.

In this situation, what you want to do is set TCP_FORWARDING_HOST on the central manager to 1.1.1.99, and redirect port 9618 on the forwarding host to 172.17.0.7 (as you have done). That will cause the central manager (and all its daemons) to advertise themselves at 1.1.1.99, at whatever port number they're actually bound to. (I don't think TCP_FORWARDING_HOST allows for pot-number remaps.)

You then also need to set TCP_FORWARDING_HOST on the submit node to 1.1.1.99; to make sure it advertises itself at port 9619, set SHARED_PORT_PORT on the submit node to 9619. Change the port redirection on the forwarding host to match.

Because you set up the port forwards (and told HTCondor about them), 1.1.1.51 can open connections to the central manager and submit node. Assuming outbound connectivity from 172.17.0.* to 1.1.1.* means that they central manager and submit node can in turn open connections back, and that's all you need for network.

Therefore, remove CCB_ADDRESS, HOST_ALIAS, and PRIVATE_NETWORK_NAME all configurations. If you have other execute nodes on the 172.17.0.* network, you may want to set PRIVATE_NETWORK_INTERFACE (and leave PRIVATE_NETWORK_NAME set), so that they don't try to contact the central manager to submit node via their TCP_FORWARDING_HOST.

- ToddM