There are cases, when we have to send some data while holding
too many locks which may lead to LORs / recursive lock acquisition.
Typical problem areas:
- ndp (ICMPv6 needs to be routed over IPv6)
- nesting interfaces sending some control traffic
Idea is to simplify locking model for consumers by adding generic
mbuf queue which
a) deals with interface departures automatically
b) is able to save some state which may be needed to process mbuf before sending
c) calls special handler for each mbuf in queue so it becomes relatively easy to do some preprocessing before sending
What exactly is proposed:
- Another one netisr queue for handling different types of packets
- metainfo is stored in mbuf_tag attached to packet
- ifnet departure handler taking care of packets queued from/to killed ifnet
- API to register/unregister/dispath given type of traffic
Current problems that can be solved:
- Locking in IPv6 LLE timers (solution embedded)
We're using per-LLE IPv6 timers for various purposes, most of them
requires LLE modifications, so timer function starts with lle write lock
held.
Some timer events requires us to send neighbour solicication messages
which involves a) source address selection (requiring LLE lock being
held ) and b) calling ip6_output() which requires LLE lock being not
held. It is solved exactly as in IPv4 arp handling code: timer function
drops write lock before calling nd6_ns_output().
Dropping/acquiring lock is error-prone, for example, the following scenario is possible (traced by ae@):
we're calling if_detach(ifp) (thread 1) and nd6_llinfo_timer (thread 2).
Then the following can happen:
#1 T2 releases LLE lock and runs nd6_ns_output().
#2 T1 proceeds with detaching: in6_ifdetach() -> in6_purgeaddr() -> nd6_rem_ifa_lle() -> in6_lltable_prefix_free()
which removes all LLEs for given prefix acquiring each LLE write lock. "Our" LLE is not destroyed since it is refcounted by nd6_llinfo_settimer_locked().
#3 T2 proceeds with nd6_ns_output() selecting source address (which involves acquiring LLE read lock)
#4 T1 finishes with detaching interface addresses and sets ifp->if_addr to NULL
#5 T2 calls nd6_ifptomac() which reads interface MAC from ifp->if_addr
#6 User inspects core generated by previous call
Using new API, we can avoid #6 by making the following code changes:
- LLE timer does not drop/reacquire LLE lock
- we require nd6_ns_output callers to lock LLE if it is provided
- nd6_ns_output() uses "slow" path instead of sending mbuf to ip6_output() immediately if LLE is not NULL.
- Lagg locking:
Changing lagg primary port requires updating MAC addresses on other ports and nested devices.
We do this with holding lagg WLOCK. Since changing mac involves sending gratious arp, we generate mbuf and
send it via vlan interface, which transmits it to lagg which tries to acquire read lock..
While this was (partially?) addressed by r272547 I feel we still have to send gratious arp via given interface,
because:
- current lagg scheme works by a) _detaching_ all port on reconfig and b) does this in taskqueue. This is too complex and bad for production traffic.
- there can be lots of other cases with nested devices, so we'd better solve them in one place.
We've been running very similar patch on more than 50 heavy-loaded IPv6 firewalls since January, without any issues.