[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8



Hi Zack,

That did it!

I'm going to post this exchange on htcondor-users so that if anyone else ever has the problem I did, they might be able to find your solution.

Thanks!

On 5/24/19 3:36 PM, Zach Miller wrote:
Hi William,

You no longer need BIND_ALL_INTERFACES, as the default now is true.

But since you have multiple interfaces, it seems HTCondor is
selecting  the wrong interface to use by default. I would try setting:
   NETWORK_INTERFACE = 10.44.7.84

to see if that solves the problem.


(Check the manual, I believe you can also specify the name of the interface (a la "eth0") if you would prefer)


Cheers,
-zach


ïOn 5/24/19, 12:51 PM, "William Seligman" <seligman@xxxxxxxxxxxx> wrote:

     Hi Zach,
Thanks for the offer of help. I've already tried reinstalling 8.8
from scratch and starting from a fresh condor_config file. It worked
if the condor master was flying solo. The problems start when I
include the slave nodes.

Before I deluge you with debug outputs, let me describe the network setup and what I think is the problem.

The condor master is a dual-home host sitting on the demilitarized
zone of our network's firewall. The two interfaces are:
olga.mydomain.org = 12.34.56.84
     olga-local.mydomain.org 10.44.7.84

The idea, which worked in 7.7 (and other condor pools I've worked
with in the past) is that the users can login remotely to any system
in our demilitarized zone and submit jobs to batch nodes in our local
NAT'ed network.
For this "olga cluster", we use a shared filesystem with a shared condor_config file, so I know both the condor master (olga) and the various batch nodes see the same configuration. In condor_config: DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
     CONDOR_HOST     = olga-local.mydomain.org
     ALLOW_WRITE = 12.34.56.0/24, 10.44.0.0/16
     BIND_ALL_INTERFACES = true
There's more, of course, and I'll send you the full dump if you
think we'll need it.
Let's consider a slave nodes: olga00.mydomain.org = 10.44.14.0 DAEMON_LIST = MASTER, STARTD When I submit a job on olga that runs on any of the batch nodes, including olga00, it swaps from Idle to Running and back again. The
job log says
Error from slot1@xxxxxxxxxxxxxxxxxxx: Could not initiate file
     transfer
When I look at StarterLog.slot1 on olga00, it seems clear what's
happening:
05/24/19 13:43:16 (pid:1313536) DaemonCore: command socket at
     <10.44.14.0:9618?addrs=10.44.14.0-9618&noUDP&sock=1313247_1034_32>
     05/24/19 13:43:16 (pid:1313536) DaemonCore: private command socket at
     <10.44.14.0:9618?addrs=10.44.14.0-9618&noUDP&sock=1313247_1034_32>
     05/24/19 13:43:17 (pid:1313536) Communicating with shadow
     <12.34.56.84:9618?addrs=12.34.56.84-9618&noUDP&sock=2002206_0144_30>
     05/24/19 13:43:17 (pid:1313536) Submitting machine is
     "olga-local.mydomain.org"
Somehow, though olga00 should only be communicating over the NAT'ed local network, it's trying to communicate with the condor master's shadow process via our public network. Since the local machines
can't see the public interface on the dual-home host, we're stuck.

Bear in mind that the result of the `hostname` on the condor master
is "olga".

I should add that I use a DNS trick to simplify my users' lives when
the login to machines elsewhere on my site. On olga, if I use DNS:
# host olga
     olga.mydomain.org has address 12.34.56.84
On a machine that's on our local network, the result is different: # host olga
     olga.mydomain.org has address 10.44.7.84
     olga.mydomain.org has address 12.34.56.84
This so the users can either login remotely or use a laptop attached
to our NAT'ed network and still use a command like "ssh olga"
without having to think about which network they're on. This has
nothing directly to do with the olgaNN slave nodes, since the users
can't login to them; it's just a general operational principle for
our site.

Any ideas? Do you need to see full dumps?

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature