Page MenuHomeFreeBSD

Fix lagg failover mode caused by missing node notifications
Needs RevisionPublic

Authored by smh on Nov 9 2015, 12:47 AM.
Tags
None
Referenced Files
F106814230: D4111.id10047.diff
Sun, Jan 5, 8:16 PM
Unknown Object (File)
Fri, Jan 3, 4:02 AM
Unknown Object (File)
Thu, Jan 2, 4:06 AM
Unknown Object (File)
Thu, Jan 2, 4:01 AM
Unknown Object (File)
Thu, Jan 2, 4:00 AM
Unknown Object (File)
Thu, Jan 2, 3:57 AM
Unknown Object (File)
Tue, Dec 31, 12:27 AM
Unknown Object (File)
Mon, Dec 30, 4:21 PM

Details

Summary

When using lagg failover mode no gratuitous ARP (IPv4) or neighbour advertisements are sent to notify other nodes that the address has moved. This results is slow
failover, dropped packets and network outages for the lagg interface.

Based on work from koobs@ and suggestions from glebius@ this change fixes this.

It uses a new linkstate (LINK_STATE_UP_FORCE) to allow lagg to force through link state changes and hence fire a ifnet_link_event which are now monitored by rip and nd6.

Upon receiving these events the protocols trigger the relevant notifications:

  • inet4 => Gratuitous ARP
  • inet6 => Neighbour Announce

This fixes PR #156226

Original thread from 2012: PATCH if_lagg driver enhancements.

Test Plan

Setup
ifconfig <port1> up
ifconfig <port2> up
ifconfig lagg0 create
ifconfig lagg0 laggproto failover laggport <port1> laggport <port2> <ip>

Test 1 - Slave port failover

  1. Disable switch port <port1> - confirm arp seen from <port2>

Test 2 - Master port recovery

  1. Disable switch port <port1> - confirm arp seen from <port2>
  2. Enable switch port <port1> - confirm arp seen from <port1>

More to come...

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 1076
Build 1079: arc lint + arc unit

Event Timeline

smh retitled this revision from to Fix lagg failover mode caused by missing node notifications.
smh updated this object.
smh edited the test plan for this revision. (Show Details)
smh added reviewers: koobs, glebius, network.
smh updated this object.
smh updated this object.

It might be more appropriate to move inet4 event handler to if_ether.c from raw_ip.c and register it in arp_init ?

Move arp handler from raw_ip.c to if_ether.c so its with all other arp methods.

This also removes unrelated fixes from rawp_ip.c too.

Add check for IFF_UP and split into two methods so that arp_notify can be used for iflladdr_event too.

Style(9) fix for arp_notify.

smh edited the test plan for this revision. (Show Details)

Check IFF_UP in nd6_ifnet_link_event.

hrs requested changes to this revision.Nov 9 2015, 4:48 AM
hrs added a reviewer: hrs.
hrs added a subscriber: hrs.

Sending NAs for all of the AF_INET6 addresses unconditionally with ND_NA_FLAG_OVERRIDE flag is incorrect (anycast address should not have one, for example. See 7.2.6 in RFC 4861). Unsolicited NAs are certainly useful for L2 address change, but I am skeptical about if it works well in up/down events. This is because all of AF_INET6 addresses are marked as tentative when an interface is becoming down and duplicate address detection is always performed again when becoming up. So this behavior will conflict with NAs sent by the linkup event.

This revision now requires changes to proceed.Nov 9 2015, 4:48 AM
smh edited edge metadata.

Don't set ND_NA_FLAG_OVERRIDE for anycast addresses, addressing point 1 of feedbad from hrs.

It should be noted that if_carp has this same bug, sends NA with ND_NA_FLAG_OVERRIDE set for anycast addresses, as that's what I used as a basis for this portion of the code.

hrs: if you agree that this is the correct fix for this small component I'll get it fixed in carp too separately.

smh edited edge metadata.

Removed erroneous blank line.

Update to r290603 which introduced similar changes to arp for lladdr events.

Update comment in arp_notify to be more accurate given r290603 changes.

Update based on r290604 which add arp_announce_ifaddr.

This makes inet6 match inet4 for the purpose of announcements, which also fixes carp sending NA with override set for anycast addresses.

melifaro edited edge metadata.

Looks good to me. Some notes below.
Also, RFC 4861 asks that node " ..SHOULD introduce a small delay between the sending of each advertisement to reduce the probability of the advertisements being lost due to congestion". It would be great to implement (probably along with delaying first packet), but this is not the scope of this review.
Also note that all related code in ipv4/ipv6 traverses if_addrhead unlocked. (Again, this is not within scope of this review.)
Since you can't pass the packet while holding ifaddr lock, you have to do some tricks like saving all the addresses we want to send notification to to some temporary array, and then traverse this array unlocked. That was one of the reasons why I decomposed arp_ifinit() so it can be called without providing ifa.
It would be great if nd6_announce_ifaddr() would also accept addr+flags instead of ifa, so it would be easier to make proper locking afterwards..

sys/net/if.c
2032

I'd avoid adding LINK_STATE_UP_FORCE link state (and making it visible to everyone including userland applications like routing daemons) ). Probably, the better way might be introduction of if_link_state_change_cond(ifp, link_state, int force) ?

sys/net/if_var.h
533 ↗(On Diff #10062)

It would be better to move given define to netinet/in_var.h or similar place. if_var shouldn't be aware of any address families other than link. There is also IFA_IN6() macro doing the same for IPv6, so IFA_IN() might be a better namig for that

sys/net/if.c
2032

I much prefer this idea, implementing...

sys/net/if_var.h
533 ↗(On Diff #10062)

Thanks, implementing...

smh edited edge metadata.

Fix unlocked if_addrhead traversal.

As we cant send arp / na while holding the lock we now copy the required details to a temporary STAIL_Q while holding the lock then process through it after dropping the lock.

Also switch to addr + flags for nd6_announce_ifaddr.

Tweak arp_announce_ifaddr addr param to be a pointer, reducing stack usage and allowing consistency with nd6 versions.

Implement and use nd6_announce_delay to abide by RFC 4861 7.2.6 and 7.2.7.

Rename IFA_SINADDR to IFA_IN and move to in_var.h.

This should address all of feeback from melifaro.

smh marked 4 inline comments as done.Nov 9 2015, 6:18 PM
sys/netinet/if_ether.c
1159

Well, IMHO it would be better if this could be implemented using array+realloc. Then, in most cases default array size (say, 16 entries) would be sufficient and you would avoid malloc() inside cycle.

1163

lladdr can be stored once since it is the same.

sys/netinet6/nd6.c
255

It is not possible to sleep in eventhandler.
I might be better to deal w/ delayed announces in separate commit/review.

sys/netinet6/nd6.c
255

DELAY is a busy wait not a sleep, so is that not OK?

When I looked I couldn't find any reference to not being able to sleep in EVENTHANDLER(9).

Is this a you shouldn't sleep / delay due as it will cause delay to other handlers receiving the event or is there some other reason?

Having tested the code in a full debug kernel I see no warnings about this, so would like to understand more about this restriction.

smh edited edge metadata.

Use incrementing array allocations for temporary address storage in (arp|nd6)_announce.

Remove unused link_state parameter from lagg_linkstate_p.

Add lagg_linkstate_p calls to if_(route|unroute) fixing ifconfig <port> (up|down) for child ports not announcing changes.

Eliminate lladdr recalculation and use direct in_addr for temp storage in arp_announce.

smh marked 2 inline comments as done.Nov 10 2015, 2:51 AM

@smh First, thank you for picking up this low hanging fruit :)

While we're here, might the following be valuable

  • any unit tests? (Either for the expected behaviour, and/or for existing behaviour expectations for code paths being modified/extended)
  • a sysctl to globally enable/disable the (default: on) behaviour? are there any cases where this may be beneficial?

Also, I'm not sure if you've seen this (related?) review, but it might be worth mentioning:

In D4111#86576, @koobs wrote:

@smh First, thank you for picking up this low hanging fruit :)

While we're here, might the following be valuable

  • any unit tests? (Either for the expected behaviour, and/or for existing behaviour expectations for code paths being modified/extended)

Not sure that would be easily unit testable, if you have any pointers on this lmk, however I will expand on the manual test cases above once I've finished testing here.

  • a sysctl to globally enable/disable the (default: on) behaviour? are there any cases where this may be beneficial?

Nope I don't see any case where that would be useful.

Also, I'm not sure if you've seen this (related?) review, but it might be worth mentioning:

Yep, some of the changes here are based off it and melifaro (the author of that patch) has already done a number of reviews on this.

  • Can you please consider to reuse nd6_ns_input() in some way to send unsolicited NAs? nd6_announce() is a specific version of it and most of code can be shared. nd6_announce() lacks ifa checks other than anycast (e.g. tentative, deprecated, duplicated addresses must be filtered out) and R flag handling, and nd6_na_output_fib() instead of nd6_na_output() has to be used based on an interface's FIB for consistency. nd6_announce_delay() should use retrans timer (ND_IFINFO(ifa->ifa_ifp)->retrans) to separate NAs instead of hard-coded value (1000). And in practice, a down->up event sends an invalid NA because DAD of link-local address must be performed before sending an NA (i.e. nd6_announce() has to wait until a link-local address is ready to use).

    Adding all of them into nd6_announce() is possible but it causes duplication of code between the two. NS/NA handling should be in nd6_nbr.c, not in nd6.c Factoring out the NA sending part from nd6_ns_input() to reimplement solicited and unsolicited NA by using it looks reasonable for me.
  • The name nd6_announce sounds strange to me. Unlike ARP, NDP is not just for L2-L3 address translation, so "announce" is a confusing name. How about using nd6_na_output_unsolicited(struct ifaddr *ifa) for sending NA in a per-ifa basis and putting the TAILQ_FOREACH() loop for an ifnet into nd6_ifnet_link_event()?
  • And, I am concerned that gratuitous ARP/NAs are always sent upon an interface up/down event after this change even if L2 address is not changed. Is this intentional? For if_lagg(4) this makes sense in some degree, it sounds overkill for the other interfaces (to me, at least).
smh edited edge metadata.

Address points raised by hrs

  • Extract outbound logic from nd6_ns_input for reuse with in unsolicited na's.
  • Added constant for inter address delay of unsolicited na's.
  • Moved unsolicited na methods to nd6_nbr.c.
  • Renamed nd6_announceXXX methods to nd6_na_output_unsolicitedXXXX.
In D4111#86850, @hrs wrote:
  • Can you please consider to reuse nd6_ns_input() in some way to send unsolicited NAs? nd6_announce() is a specific version of it and most of code can be shared. nd6_announce() lacks ifa checks other than anycast (e.g. tentative, deprecated, duplicated addresses must be filtered out) and R flag handling, and nd6_na_output_fib() instead of nd6_na_output() has to be used based on an interface's FIB for consistency.

Based on your comment I've extracted out what seems to be all the outbound flag logic including tentative, deprecated, duplicated and R flag handling via the IFA_ND6_NA_UNSOLICITED_SKIP and IFA_ND6_NA_BASE_FLAGS macros, eliminating what seems to be be all the duplicate logic.

With regards to using nd6_na_output_fib as there is no inbound packet to base the FIB decision how would I determine the fibnum?

nd6_announce_delay() should use retrans timer (ND_IFINFO(ifa->ifa_ifp)->retrans) to separate NAs instead of hard-coded value (1000). And in practice, a down->up event sends an invalid NA because DAD of link-local address must be performed before sending an NA (i.e. nd6_announce() has to wait until a link-local address is ready to use).

These are not retransmits, so using the retransmit timer doesn't seem to make sense.

This delay corresponds to the following from the RFC:

A node that has multiple IP addresses assigned to an interface MAY multicast a separate Neighbor Advertisement for each address. In such a case, the node SHOULD introduce a small delay between the sending of each advertisement to reduce the probability of the advertisements being lost due to congestion.

As I interpret it, this is not related to RetransTimer mentioned in other parts of the RFC which is the MUST delay between retransmit's, which is not the same as the SHOULD delay between each IP's advertisement.

I've created a define to make this explicit does that works for you?

Adding all of them into nd6_announce() is possible but it causes duplication of code between the two.  NS/NA handling should be in nd6_nbr.c, not in nd6.c Factoring out the NA sending part from nd6_ns_input() to reimplement solicited and unsolicited NA by using it looks reasonable for me.

Functions moved, could you confirm the logic behind the split, I'm guessing based on your comment nbr means neighbor?

  • The name nd6_announce sounds strange to me. Unlike ARP, NDP is not just for L2-L3 address translation, so "announce" is a confusing name. How about using nd6_na_output_unsolicited(struct ifaddr *ifa) for sending NA in a per-ifa basis and putting the TAILQ_FOREACH() loop for an ifnet into nd6_ifnet_link_event()?

This was named like that to maintain consistency across protocols but I've renamed this as per your suggestion.

  • And, I am concerned that gratuitous ARP/NAs are always sent upon an interface up/down event after this change even if L2 address is not changed. Is this intentional? For if_lagg(4) this makes sense in some degree, it sounds overkill for the other interfaces (to me, at least).

Yes that is totally the point of the change. We're trying to ensure that the remote nodes, be that machines or switches update their entries to point to the correct place when failover occurs. For machines this is usually just about IP -> MAC but for switches port comes into play, so without it they could continue to send packets to the wrong port as their fdb entry will be out of date. In real terms this can be minutes before a successful change occurs as its reliant on arp timeout.

Set scope for unsolicited neighbor advertisements, this was preventing them being sent due to route lookup matching ipv6_route__llma reject route. This means that carp IPv6 NA's have not worked since r251584 which added this route.

Added sysctls for disabling announcements on link, which are useful for convergence speed testing:

  • net.link.ether.inet.arp_on_link
  • net.inet6.icmp6.nd6_on_link

Also added descriptions for net.inet6.icmp6 sysctls while here.

Hi guys just wondered if you had chance to review this yet and if you had any further feedback on it.

Sorry to chase after only 1 week but I'm intending to use this work in a project which gets deployed in the next two weeks.

Avoid wrapping short SYSCTL description lines.

This revision was automatically updated to reflect the committed changes.
  • Fix panic for etherswitches which don't have a LLADDR.
  • Disabled DELAY in unsolicited NDA, which needs further work.
  • Fixed missing DELAY in carp_send_na.
  • style(9) fix.

glebius raised concerns about this fix so its currently been removed from the tree, so we can continue discussion here.

I'm wondering if the main concerns center around the delay which "should" be implemented between interface IPv6 NDA's?

Currently these delays have been disabled and although this is not ideal it does produce the desired behaviour.

If this is the main concern would there be any merit in splitting the commit into two, one for IPv4 and one for IPv6 once completed at a later date?

Sending a GARP on every link-up is a good idea. Please keep that. Consider a laptop moving between switch ports.

By the way, a colleague of mine is working on a patch to retransmit GARPs using a callout. That is orthogonal but relevant to this work.

This does not appear to have been MFC as the commit message implies. I don't see the commit in 10 STABLE or 10.3 RELENG. Was this MFC cancelled due to issues?

sys/net/if.c
2022

Parenthesis missing.

sys/netinet/if_ether.c
1153–1181

This cycle will send ARP requests for CARP addresses in BACKUP or INIT state. I didn't look at corresponding IPv6 function, but likely it has same issue.

Note, that arprequest() already has a cycle over all ifaddrs. Once you fix the CARP problem, both cycles will look quite similar. I'm pondering if we can improve arprequest() to be capable to send gratious from every valid IP address and don't create a new function.

Of course all memory allocation is ugly. I'd prefer temporary unlocking instead, but in that case we need protection against address list change.

1154

This function is superfluous. It is called from three places:

  1. arp_ifinit() just above and it already has check against INADDR_ANY.
  2. carp. Unlikely that INADDR_ANY comes from here.
  3. New function arp_announce() down below. Check can be added there.
This revision now requires changes to proceed.May 23 2020, 5:12 PM