Netisr delayed dispatch queue for complex processing cases.
ClosedPublic
Actions

Authored by melifaro on Oct 5 2014, 11:56 AM.

Details

Reviewers

Group Reviewers

Summary

There are cases, when we have to send some data while holding
too many locks which may lead to LORs / recursive lock acquisition.

Typical problem areas:

ndp (ICMPv6 needs to be routed over IPv6)
nesting interfaces sending some control traffic

Idea is to simplify locking model for consumers by adding generic
mbuf queue which
a) deals with interface departures automatically
b) is able to save some state which may be needed to process mbuf before sending
c) calls special handler for each mbuf in queue so it becomes relatively easy to do some preprocessing before sending

What exactly is proposed:

Another one netisr queue for handling different types of packets
metainfo is stored in mbuf_tag attached to packet
ifnet departure handler taking care of packets queued from/to killed ifnet
API to register/unregister/dispath given type of traffic

Current problems that can be solved:

Locking in IPv6 LLE timers (solution embedded)

We're using per-LLE IPv6 timers for various purposes, most of them
requires LLE modifications, so timer function starts with lle write lock
held.

Some timer events requires us to send neighbour solicication messages
which involves a) source address selection (requiring LLE lock being
held ) and b) calling ip6_output() which requires LLE lock being not
held. It is solved exactly as in IPv4 arp handling code: timer function
drops write lock before calling nd6_ns_output().

Dropping/acquiring lock is error-prone, for example, the following scenario is possible (traced by ae@):

we're calling if_detach(ifp) (thread 1) and nd6_llinfo_timer (thread 2).
Then the following can happen:

#1 T2 releases LLE lock and runs nd6_ns_output().
#2 T1 proceeds with detaching: in6_ifdetach() -> in6_purgeaddr() -> nd6_rem_ifa_lle() -> in6_lltable_prefix_free()

which removes all LLEs for given prefix acquiring each LLE write lock.
"Our" LLE is not destroyed since it is refcounted by nd6_llinfo_settimer_locked().

#3 T2 proceeds with nd6_ns_output() selecting source address (which involves acquiring LLE read lock)

#4 T1 finishes with detaching interface addresses and sets ifp->if_addr to NULL

#5 T2 calls nd6_ifptomac() which reads interface MAC from ifp->if_addr

#6 User inspects core generated by previous call

Using new API, we can avoid #6 by making the following code changes:

LLE timer does not drop/reacquire LLE lock
we require nd6_ns_output callers to lock LLE if it is provided
nd6_ns_output() uses "slow" path instead of sending mbuf to ip6_output() immediately if LLE is not NULL.

Lagg locking:

Changing lagg primary port requires updating MAC addresses on other ports and nested devices.
We do this with holding lagg WLOCK. Since changing mac involves sending gratious arp, we generate mbuf and
send it via vlan interface, which transmits it to lagg which tries to acquire read lock..
While this was (partially?) addressed by r272547 I feel we still have to send gratious arp via given interface,
because:

current lagg scheme works by a) _detaching_ all port on reconfig and b) does this in taskqueue. This is too complex and bad for production traffic.
there can be lots of other cases with nested devices, so we'd better solve them in one place.

We've been running very similar patch on more than 50 heavy-loaded IPv6 firewalls since January, without any issues.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

melifaro updated this revision to Diff 1897.Oct 5 2014, 11:56 AM

melifaro retitled this revision from to Netisr delayed dispatch queue for complex processing cases..

melifaro updated this object.

melifaro edited the test plan for this revision. (Show Details)

melifaro added a subscriber: ae.

melifaro added a reviewer: network.Oct 5 2014, 11:56 AM

melifaro updated this object.

Query: are there a sufficient number of these types that simply adding more netisr protocols isn't the right solution? That was what was done historically for routing sockets, etc. I'm not opposed to a more fine-grained mechanism, but there are tradeoffs in adding additional memory allocation/freeing for every packet, etc. These mostly don't matter for lower-volume events, of course.

s/Netisr/netisr/ in mbuf.h.

I would prefer it if this were named 'deferred' rather than 'delayed', as that is the term used elsewhere for this design pattern (i.e., 'deferred dispatch').

I agree with Robert's comments on the topic. I'd also like to know how this effects the performance of the system. Finally, if this were to become a general mechanism in the kernel, how would we abstract it so that it worked, for instance, with ARP.

Yes, It is true that LORs in if_lagg(4) are not solved even in r272547.

I agree that ARP, NDP, and MLD/IGMP require an asynchronous queue to avoid recursive lock acquisition. I basically like this idea though I am not sure if we should implement it as new protocol set for netisr or not as others pointed out.

And, what is difference between dispatch and pdispatch in practice? Correct me if I am wrong, but to me they are almost the same interface with each other. Do you have any specific use cases?

emaste added a subscriber: emaste.Oct 24 2014, 3:36 PM

melifaro updated this revision to Diff 2312.Nov 6 2014, 7:57 PM

This comment was removed by melifaro.

Sorry for the noise, guys.
I've pushed patch for different this here by mistake, so there is still nothing to review.
I'll try to provide better version of delayed queues (half-finished) soon.

I think this is good candidate to fix issue, described in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197059