Many thanks for the quick reply on this. Unfortunately the Uni is still in lockdown at present so it's difficult to actually go in and do any hands on testing. Having said that it *may* be possible to login remotely to some machines and have a peek at what's
going on. I know we do have some machines with IPv6 addresses (but they have IPv4 as well) - so that may be a cause.
The comment about the service startup order is interesting. If this isn't explicity set then I could imagine a race condition between htcondor and the
network service which would explain why some machines get the correct interface address and some get the loopback. I'll get back to you when I have some more information.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: 21 May 2020 15:43:12
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] execute hosts advertise loopback address
If you log in to one of these machines and run
is the result 127.0.0.1 ?
This would indicate that Htcondor is unable to determine which interface is external, OR that it has
been explicitly configured to bind only to the loopback.
condor_config_val -dump NETWORK
is NETWORK_INTERFACE set to something?
Do the public interfaces of these machines perhaps have IPv4 disabled, so they are IPv6 only?
A newer HTCondor like 8.8.9 will have better support for IPv6, including the ability to prefer it
or to prefer IPv4
If you restart condor on the machine, does it continue to advertise the loopback? If so, the problem may
be that the network is not initialized and so only the loopback can be found at the time that condor_starts up.
You might also want to check in the services control panel to make sure that Htcondor is not started until
after the network service, this should be setup automatically by the MSI installer package, but it’s worth checking.
I have come across a very strange problem with our HTCondor pool whereby *some* execute hosts advertise
the loopback address as the address of the startd as evidenced by this from CollectorLog:
Some execute hosts work fine and advertise their correct address whereas a substantial number advertise the
loopback and I believe there are even examples of both on the same subnet. The execute hosts all run Windows 10
and HTCondor version 8.4.6 and employ power saving so that idle machines (viz no local user use or HTCondor
use) go into hibernation after approx 10 minutes.
A typical scenario is that I wake a machine to a run job, the machine advertises its loopback address to the
collector. The negotiator either finds a match or ignores the loopback - no quite sure which. but in any case the job never
starts on the execute host and so the host returns to hibernation.
I turned up this submission to htcondor-users in the archives but it seems pretty old (Windows XP) and doesn't seem to
come up with a satisfactory solution:
Any suggestions would be extremely useful as I'm totally baffled by this.
Dr Ian C. Smith,
Advanced Research Computing,
University of Liverpool