[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] execute hosts advertise loopback address



Many thanks for the quick reply on this. Unfortunately the Uni is still in lockdown at present so it's difficult to actually go in and do any hands on testing. Having said that it *may* be possible to login remotely to some machines and have a peek at what's going on.  I know we do have some machines with IPv6 addresses (but they have IPv4 as well) - so that may be a cause.


The comment about the service startup order is interesting. If this isn't explicity set then I could imagine a race condition between htcondor and the

network service which would explain why some machines get the correct interface address and some get the loopback. I'll get back to you when I have some more information.


thanks again,


-ian.



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: 21 May 2020 15:43:12
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] execute hosts advertise loopback address
 

If you log in to one of these machines and run

 

condor_config_val IP_ADDRESS

 

is the result 127.0.0.1 ? 

 

This would indicate that Htcondor is unable to determine which interface is external, OR that it has

been explicitly configured to bind only to the loopback.

 

try

 

condor_config_val -dump NETWORK

 

is  NETWORK_INTERFACE set to something?

 

Do the public interfaces of these machines perhaps have IPv4 disabled, so they are IPv6 only? 

A newer HTCondor like 8.8.9 will have better support for IPv6, including the ability to prefer it

or to prefer IPv4

 

If you restart condor on the machine, does it continue to advertise the loopback?  If so, the problem may

be that the network is not initialized and so only the loopback can be found at the time that condor_starts up.

 

You might also want to check in the services control panel to make sure that Htcondor is not started until

after the network service,  this should be setup automatically by the MSI installer package, but it’s worth checking.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Smith, Ian
Sent: Thursday, May 21, 2020 8:49 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] execute hosts advertise loopback address

 

 

Hello All,

 

I have come across a very strange problem with our HTCondor pool whereby *some* execute hosts advertise

the loopback address as the address of the startd as evidenced by this from CollectorLog:

 

05/21/20 14:09:35 StartdAd     : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxxxxxx , 127.0.0.1 >"
05/21/20 14:09:35 StartdPvtAd  : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxxxxxx , 127.0.0.1 >"

 

Some execute hosts work fine and advertise their correct address whereas a substantial number advertise the

loopback and I believe there are even examples of both on the same subnet.  The execute hosts all run Windows 10

and HTCondor version 8.4.6 and employ power saving so that idle machines (viz no local user use or HTCondor

use) go into hibernation after approx 10 minutes.

 

A typical scenario is that I wake a machine to a run job, the machine advertises its loopback address to the

collector. The negotiator either finds a match or ignores the loopback - no quite sure which. but in any case the job never

starts on the execute host and so the host returns to hibernation.

 

I turned up this submission to htcondor-users in the archives but it seems pretty old (Windows XP) and doesn't seem to

come up with a satisfactory solution:

 

 

Any suggestions would be extremely useful as I'm totally baffled by this.

 

regards,

 

-ian.

 

Dr Ian C. Smith,

Condor Manager,

Advanced Research Computing,

University of Liverpool

UK.