[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] daemons not using IPv4 on unusable IPv6 network



Hi Todd,

let me just start with a post mortem realisation: HTCondor handled this incident very, very, very well!
Daemons reacted gracefully and kept on using existing connections for most operations. Nothing substantial broke, and we could probably have taken some more hours to react without any notable service degradation.

> 	So it looks like the shared port daemon picked up the IPv6 address and started advertising it.
Just for reference:
- The IPv6 address got auto configured by the host/switch several days earlier. Our network people had reconfigured the switch to provide IPv6 for other machines in the same rack.
- The shared port daemon started using the IPv6 due to condor_reconfig. This was also issued automatically due to fair share changes in the configs.

> 	In addition, this particular warning (IPV6 address literal not matching itself) is spurious; we just fixed it for the 8.6.5 release:
Thumbs up! *puts on my guess-we-should-update-more-often-hat*

> 	condor_pool@ is usually associated with using PASSWORD authentication.  I'll guess here that the problem is that the link-local address doesn't resolve to 'gridka.de', and your ALLOW list contains 'condor_pool@xxxxxxxxx'.  If your ALLOW list contains 'condor_pool@*', let me know, because we just found a problem with '*' being protocol-specific:
Our ALLOW for this is indeed 'condor_pool@xxxxxxxxx', and the IPv6 was indeed not resolved to gridka.de.
Give me a pointer on how Condor handles these identities:
- In this case, 'gridka.de' is the UID_DOMAIN, not the actual fqdn domain.
- Unless we allow it to (TRUST_UID_DOMAIN), the UID_DOMAIN can only be used from hosts which also share the domain in their fqdn.
- Since the fqdn cannot be resolved/does not resolve to '*.gridka.de', the authorisation fails because the UID_DOMAIN cannot be verified.
So even though we do not explicitly require the host name to match gridka.de, it is implicitly required to match the domain name.

> 	I'll have to look into this.  Could you check your logs and verify the sinful they were trying to contact was
Sorry, we did not have the appropriate log level turned on. I only know that this was the sinful advertised by the SharedPort of the central node, so I assume it is what the daemons got.

> 	After reading the code, it looks like HTCondor prefers public addresses over private addresses, and PREFER_IPV4 only changes the relative order of the protocol within those categories.[...]
Does the regular NETWORK_INTERFACE play a role in this too? On the Schedds, it is set to the private address IPv4.
I will push to get this changed to using PRIVATE_NETWORK_NAME and PRIVATE_NETWORK_INTERFACE for our internal resources.

> 	Should the default for ENABLE_IPV6 be FALSE if only link-local IPv6 address are found?
I'm tempted to say 'yes'. On all machines with link-local only, condor_config_val is showing IPV6_ADDRESS = ::1 - so it seems to ignore the link-local interfaces in some cases already.

> 	Just to be clear, HTCondor doesn't know (or have any way to know) if a particular protocol is "working", only if an address of that protocol is present on any of the machine's interfaces.  If you want HTCondor to never try any IPv6 address, set ENABLE_IPV6 = FALSE.
That's the part that is confusing me about the incident: the entire PREFER_IPV4+friends knobs are using default values, i.e. True. The IPv4 addresses are present on all machines, and have been used by HTCondor for months.
I've set ENABLE_IPV6=FALSE on the central node, which fixed the issue for now. Being able to safely set 'auto' on all machines would greatly simplify adopting dual stack, though.

At any rate, I have to say that we feel *safe enough* to tackle IPv6 for condor soon. Most of the issues are *definitely* our fault, and something tells me this might be the case for the others as well. Either way, the condor cluster remained stable enough to make some extensive tests in the future.

Cheers,
Max

> Am 11.07.2017 um 22:14 schrieb Todd L Miller <tlmiller@xxxxxxxxxxx>:
> 
>> we had an unexpected dual stack test run yesterday when our central node [collector+negotiator] started using IPv6 due to network misconfiguration.
> 
> 	So it looks like the shared port daemon picked up the IPv6 address and started advertising it.
> 
>> - The Negotiator started talking to local daemons on IPv6 even though forward resolution failed [2].
> 
> 	This warning is, generally, only informative.  However,
> 
>> 07/10/17 06:40:01 (pid:3203476) (D_ALWAYS) WARNING: forward resolution of
>> fe80::3a63:bbff:fe3f:59b4 doesn't match fe80::3a63:bbff:fe3f:59b4!
> 
> 	In addition, this particular warning (IPV6 address literal not matching itself) is spurious; we just fixed it for the 8.6.5 release:
> 
> https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6338.
> 
>> 07/10/17 06:40:01 (pid:3203476) (D_ALWAYS) PERMISSION DENIED to
>> condor_pool@xxxxxxxxx from host fe80::3a63:bbff:fe3f:59b4 for command 421
>> (Reschedule), access level DAEMON: reason: DAEMON authorization policy
>> contains no matching ALLOW entry for this request; identifiers used for this
>> host: fe80::3a63:bbff:fe3f:59b4, hostname size = 0, original ip address =
>> fe80::3a63:bbff:fe3f:59b4
> 
> 	condor_pool@ is usually associated with using PASSWORD authentication.  I'll guess here that the problem is that the link-local address doesn't resolve to 'gridka.de', and your ALLOW list contains 'condor_pool@xxxxxxxxx'.  If your ALLOW list contains 'condor_pool@*', let me know, because we just found a problem with '*' being protocol-specific:
> 
> https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6340
> 
>> The main question for us is why did these components try using IPv6 anyway?
> 
> 	I'll have to look into this.  Could you check your logs and verify the sinful they were trying to contact was
> 
> <10.97.13.108:9618?addrs=10.97.13.108-9618+[2a00-1398-10a-610d-3a63-bbff-fe3f-7a08]-9618&noUDP>
> 
> ?  Adding D_HOSTNAME do SCHEDD_DEBUG (or STARTD_DEBUG, etc, as appropriate) should give you a lot information about what HTCondor was trying to do.
> 
> 	After reading the code, it looks like HTCondor prefers public addresses over private addresses, and PREFER_IPV4 only changes the relative order of the protocol within those categories.  There's another layer of preferences on top, where if the private IPv4 address is declared as a private network, and the source's private network name is the same as the target's, HTCondor will use the private network.  See PRIVATE_NETWORK_NAME and PRIVATE_NETWORK_INTERFACE in the manual.  I don't know off the top of my head if these will work without CCB; it may also be easier simply to disable IPv6 on machines where it's known not to work (usually meaning those that have only a link-local IPv6 address).
> 
> 	Should the default for ENABLE_IPV6 be FALSE if only link-local IPv6 address are found?
> 
>> which seem to imply that IPv6 is not used at all unless IPv4 is not working.
> 
> 	Just to be clear, HTCondor doesn't know (or have any way to know) if a particular protocol is "working", only if an address of that protocol is present on any of the machine's interfaces.  If you want HTCondor to never try any IPv6 address, set ENABLE_IPV6 = FALSE.  Otherwise:
> 
> PREFER_IPV4: Metaconfiguration; the default value for the next four settings.
> 
> ADVERTISE_IPV4_FIRST: If true, HTCondor will advertise IPv4 in its address lists before IPV6, indicating that it prefers to be contacted over IPV4. Otherwise, HTCondor will advertise IPV6 first.
> 
> IGNORE_TARGET_PROTOCOL_PREFERENCE: If true, HTCondor will ignore
> target daemon's protocol preference, as indicated above, and instead select which protocol to use based on its own preferences.  By default,
> HTCondor will prefer IPv6.
> 
> PREFER_OUTBOUND_IPV4: If true, when HTCondor will prefer IPv4 when
> IGNORE_TARGET_PROTOCOL_PREFERENCE (above) is true.  Also, if IGNORE_DNS_PROTOCOL_PREFERENCE (below) is true, HTCondor will try IPv4 addresses returned by DNS first.
> 
> IGNORE_DNS_PROTOCOL_PREFERENCE: If true, HTCondor will sort DNS replies
> by protocol.  (DNS replies are normally sorted in order of the host's
> contact preferences.)  By default, IPv6 addresses will be sorted first.
> 
> - ToddM
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature