Page MenuHomeFreeBSD

Restructure mbuf send tags to provide stronger guarantees.
ClosedPublic

Authored by jhb on May 1 2019, 12:35 AM.

Details

Summary
  • Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags.

    Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds.

    To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions.

    For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN.
  • To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped.

    This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop).

    In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag.
  • cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called.

    To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag.
  • mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead.
  • Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Sponsored by: Netflix

Test Plan
  • tested using netperf -q with a RATELIMIT kernel on cxgbe over plain cc interfaces, vlan+cc, and lagg + cc (including failing traffic over between two cc interfaces in a lagg while netperf was running)
  • also tested with additional changes to use send tags for TLS sessions on plain cc and lagg + cc.
  • Drew has tested on mlx5.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 24377
Build 23197: arc lint + arc unit

Event Timeline

jhb created this revision.May 1 2019, 12:35 AM
jhb added inline comments.May 1 2019, 12:43 AM
sys/net/if_lagg.c
1559

This function is only used in one place (during my WIP it was used in more than one place), but it makes the error handling simpler since all of the errors conditions here would have to LAGG_RUNLOCK() and now they can just do it in the caller. However, I could collapse this back down in the caller if folks don't like it being split out.

sys/net/if_vlan.c
1977–1984

The old vlan_snd_tag_alloc was missing this locking to keep ifv_trunk stable.

sys/netinet6/ip6_output.c
231–244

This change to use m_dup_pkthdr() isn't strictly needed for this change since m0 shouldn't have a send tag assigned yet (it gets assigned in ip6_output_send() on the generated fragments), however, it looks more correct and matches what ip_fragment does. I would be happy to commit the ip6_fragment change separately.

ae set the repository for this revision to rS FreeBSD src repository.
ae added a comment.May 1 2019, 10:13 AM

I'm sorry, I completely missed this change in the past. But it looks like it can break ipfw firewall rules, since rcvif is now union with snd_tag. And this means, rcvif can be initialized for packets that were not actually received on specified interface. ipfw uses rcvif in rules to check that a packet was received on specified interface, and this check was correct even for outgoing packets. Now it looks like such checks can be incorrect.

In D20117#433037, @ae wrote:

I'm sorry, I completely missed this change in the past. But it looks like it can break ipfw firewall rules, since rcvif is now union with snd_tag. And this means, rcvif can be initialized for packets that were not actually received on specified interface. ipfw uses rcvif in rules to check that a packet was received on specified interface, and this check was correct even for outgoing packets. Now it looks like such checks can be incorrect.

Send tags are only used for endstation initiated traffic (eg, TCP). So if a packet has a send tag set, then it could not possibly have arrived via any interface. Eg, it would have previously had a NULL rcvif. So if you compare pointers, it should be fine.

If you look at something in the rcvif, there is now a flag (CSUM_SND_TAG) that indicates that the rcvif/snd_tag union is a send tag, so you can avoid treating a send_tag as an ifnet in that case.

adrian added a subscriber: adrian.May 1 2019, 3:36 PM
In D20117#433037, @ae wrote:

I'm sorry, I completely missed this change in the past. But it looks like it can break ipfw firewall rules, since rcvif is now union with snd_tag. And this means, rcvif can be initialized for packets that were not actually received on specified interface. ipfw uses rcvif in rules to check that a packet was received on specified interface, and this check was correct even for outgoing packets. Now it looks like such checks can be incorrect.

Send tags are only used for endstation initiated traffic (eg, TCP). So if a packet has a send tag set, then it could not possibly have arrived via any interface. Eg, it would have previously had a NULL rcvif. So if you compare pointers, it should be fine.
If you look at something in the rcvif, there is now a flag (CSUM_SND_TAG) that indicates that the rcvif/snd_tag union is a send tag, so you can avoid treating a send_tag as an ifnet in that case.

Ugh. Lemme go make sure net80211's abuse of rcvif to store the net80211 node struct doesn't interfere with this. :( I'm pretty sure outbound and inbound net80211 mbufs overload the rcvif for this purpose.

adrian added a comment.May 1 2019, 3:38 PM

I see you've done a bit of net80211 checking there; I think I'm going to use this change as motivation to speed up my desire to make rcvif manipulation a bit less insane and error prone.

gallatin added a comment.EditedMay 1 2019, 4:05 PM

I see you've done a bit of net80211 checking there; I think I'm going to use this change as motivation to speed up my desire to make rcvif manipulation a bit less insane and error prone.

Would it work to use PH_loc.ptr in place of rcvif to store the pointer the net80211 node struct? I assume PH_loc was added after you'd already taken rcvif?

If anybody wants to throw rotten vegetables for the choice to unionize rcvif and snd_tag, lob them at me. I argued strongly to keep the mbuf size unchanged. This is because (on amd64) we wind up taking an extra cacheline miss on mbuf ext free if the pkt_hdr adds a new pointer, because any additions to the packet header will push m_ext.ext_arg1 into another cacheline. In our workload, this ext_free happens millions of times per second. I wish I could think of some clean, non-intrusive way to have an mbuf have just an m_ext, but not waste space on a pkt_hdr.

I have made one quick pass over this, now that I have read all the bits I would like to make a second pass, my comments now are nits mostly, and can be safely ignored. However I do, as others, have some pretty big concerns about sharing the rcvif with tags, is there some great cost in not doing that?

sys/net/if_lagg.c
1559

I like it split out, but it should also get a block comment describing it, and perhaps you ab above comments? There are also some other functions that could use block comments, though not all of them need them as they are trivially short

sys/net/if_vlan.c
1977–1984

Is that a bug in the current code that should just be fixed with a commit outside this review? Does that bug exist in stable/12 and most importantly stable/11 right now?

sys/netinet6/ip6_output.c
231–244

If this change can be applied seperate and in advance of this differential I think that would be the better, as it makes one less change to look at in this already large set of diffs.

sys/sys/mbuf.h
1230

If you have to make that same check each time before calling m_snd_tag_rele would it make since to do that in the routine and just pass m? These are m_foo routines, I am use to seeing those take (m) as argument not some nested data.

adrian added a comment.May 1 2019, 4:36 PM

I see you've done a bit of net80211 checking there; I think I'm going to use this change as motivation to speed up my desire to make rcvif manipulation a bit less insane and error prone.

Would it work to use PH_loc.ptr in place of rcvif to store the pointer the net80211 node struct? I assume PH_loc was added after you'd already taken rcvif?
If anybody wants to throw rotten vegetables for the choice to unionize rcvif and snd_tag, lob them at me. I argued strongly to keep the mbuf size unchanged. This is because (on amd64) we wind up taking an extra cacheline miss on mbuf ext free if the pkt_hdr adds a new pointer, because any additions to the packet header will push m_ext.ext_arg1 into another cacheline. In our workload, this ext_free happens millions of times per second. I wish I could think of some clean, non-intrusive way to have an mbuf have just an m_ext, but not waste space on a pkt_hdr.

This was done in net80211 far before I was on the 802.11 scene. :(

I'm tempted to just turn it into an mbuf tag to store the net80211 node pointer. I don't care about the performance overhead right now; I'd rather fix net80211's locking and mbuf sins first and then rethink performance.

(I also have idly wondered about stuff like ext_free where it could be interesting to turn it into an ID into a function array, so it's not wasting a 64 bit pointer value for what's almost always one of < 20 methods; then subsystems can register mbuf methods like for ext_free..)

Can you hold off on this for a little bit? I think I'd like to go do up a quick (!) patch which makes rcvif dereference use an inline accessor, which we can then turn into a runtime assert that (in your world) it definitely isn't a snd_tag. That should also make it less error prone in general.

jhb added a comment.May 1 2019, 4:38 PM
In D20117#433037, @ae wrote:

I'm sorry, I completely missed this change in the past. But it looks like it can break ipfw firewall rules, since rcvif is now union with snd_tag. And this means, rcvif can be initialized for packets that were not actually received on specified interface. ipfw uses rcvif in rules to check that a packet was received on specified interface, and this check was correct even for outgoing packets. Now it looks like such checks can be incorrect.

So I looked at this a bit, and the overloading is what led to the panic slavash@ reported on comitters@. However, the explicit CSUM_SND_TAG now means we avoid ambiguity about the field, and I've added assertions to make sure we don't clobber a valid rcvif with a send tag or vice versa. As Drew noted, a forwarded packet shouldn't have a send tag as only ip_output with a valid inp will set a send tag. I do plan to use send tags in future changes to work with TLS sessions, but those tags will have the same condition.

I did review the use of ip_fw2.c, but I assumed it had the the IN vs OUT flags backward, but on re-reviewing it now, I see it doesn't. I'll fix its use of rcvif to do what I did in bpf and use NULL if CSUM_SND_TAG is set since the PFIL_OUT hook in ether_output_frame() can see a packet with a send tag. The pfil hooks in ip_output are run before tags are set so couldn't be confused. Also, FWIW, that bug is true today.

jhb added a comment.May 1 2019, 4:40 PM
In D20117#433037, @ae wrote:

I'm sorry, I completely missed this change in the past. But it looks like it can break ipfw firewall rules, since rcvif is now union with snd_tag. And this means, rcvif can be initialized for packets that were not actually received on specified interface. ipfw uses rcvif in rules to check that a packet was received on specified interface, and this check was correct even for outgoing packets. Now it looks like such checks can be incorrect.

Send tags are only used for endstation initiated traffic (eg, TCP). So if a packet has a send tag set, then it could not possibly have arrived via any interface. Eg, it would have previously had a NULL rcvif. So if you compare pointers, it should be fine.
If you look at something in the rcvif, there is now a flag (CSUM_SND_TAG) that indicates that the rcvif/snd_tag union is a send tag, so you can avoid treating a send_tag as an ifnet in that case.

Ugh. Lemme go make sure net80211's abuse of rcvif to store the net80211 node struct doesn't interfere with this. :( I'm pretty sure outbound and inbound net80211 mbufs overload the rcvif for this purpose.

My reading was that only outbound net80211 abused it (presumably on receive, rcvif has to be "real")). I did add assertions and it should be safe since no current 802.11 drivers support send tags. If they ever wanted to support send tags in the future the recvif overloading would need to be resolved in another way.

jhb marked an inline comment as done.May 1 2019, 9:12 PM
jhb added inline comments.
sys/net/if_vlan.c
1977–1984

It's a very rare bug. We don't let you move a vlan to a different vlandev once it is assigned (at least according to ifconfig(8)). I think if you detach the trunk device (e.g. devctl detach or kldunload of the driver) leaving the vlan orphaned then you would have a race with reading a stale pointer. However, this is not a common occurrence.

sys/netinet6/ip6_output.c
231–244

I really need someone with slightly more ipv6 fu (e.g. bz@) to review it.

sys/sys/mbuf.h
1230

Various places that drop references don't have an 'm', only a tag that is detached from an mbuf (e.g. in the transmit routines of lagg and vlan, but also some places in cxgbe). As such, they can't take an mbuf.

jhb added a comment.May 1 2019, 9:16 PM

I see you've done a bit of net80211 checking there; I think I'm going to use this change as motivation to speed up my desire to make rcvif manipulation a bit less insane and error prone.

Would it work to use PH_loc.ptr in place of rcvif to store the pointer the net80211 node struct? I assume PH_loc was added after you'd already taken rcvif?
If anybody wants to throw rotten vegetables for the choice to unionize rcvif and snd_tag, lob them at me. I argued strongly to keep the mbuf size unchanged. This is because (on amd64) we wind up taking an extra cacheline miss on mbuf ext free if the pkt_hdr adds a new pointer, because any additions to the packet header will push m_ext.ext_arg1 into another cacheline. In our workload, this ext_free happens millions of times per second. I wish I could think of some clean, non-intrusive way to have an mbuf have just an m_ext, but not waste space on a pkt_hdr.

This was done in net80211 far before I was on the 802.11 scene. :(
I'm tempted to just turn it into an mbuf tag to store the net80211 node pointer. I don't care about the performance overhead right now; I'd rather fix net80211's locking and mbuf sins first and then rethink performance.

I think Drew's suggestion of using PH_loc is probably cheaper, and it's probably fine if you aren't crossing protocol layers.

Can you hold off on this for a little bit? I think I'd like to go do up a quick (!) patch which makes rcvif dereference use an inline accessor, which we can then turn into a runtime assert that (in your world) it definitely isn't a snd_tag. That should also make it less error prone in general.

I'd rather not delay this very long as there is other pending work that builds on this.

jhb updated this revision to Diff 56930.May 1 2019, 9:18 PM
  • Move bpf_rcvif to mbuf.h and rename it m_rcvif.
  • Use m_rcvif.
  • Add an assertion to if_setrcvif.
  • Add a comment for lookup_snd_tag_port.
adrian added a comment.May 1 2019, 9:23 PM
In D20117#433281, @jhb wrote:

I'm tempted to just turn it into an mbuf tag to store the net80211 node pointer. I don't care about the performance overhead right now; I'd rather fix net80211's locking and mbuf sins first and then rethink performance.

I think Drew's suggestion of using PH_loc is probably cheaper, and it's probably fine if you aren't crossing protocol layers.

I'll look into it. Thanks!

Can you hold off on this for a little bit? I think I'd like to go do up a quick (!) patch which makes rcvif dereference use an inline accessor, which we can then turn into a runtime assert that (in your world) it definitely isn't a snd_tag. That should also make it less error prone in general.

I'd rather not delay this very long as there is other pending work that builds on this.

It's cool; if this doesn't break wifi on your laptop it's "good for me" for now. I'll go and add a getrcvif method and start sprinkling it around the tree.

-a

rgrimes added a reviewer: bz.May 1 2019, 9:46 PM

You can mark all my comments as done, I am fine with your answers, I added bz and requested his review on the one change.

sys/netinet6/ip6_output.c
231–244

I added bz to the review, could you please look at this both for correctness, and second for possible seperation from this review. Thanks.

jhb added a comment.May 1 2019, 10:16 PM

Weird, I had put 'battlez' (IRC nick) as a subscriber in arc when uploading and it didn't complain about the missing account. *sigh*

gallatin accepted this revision.May 1 2019, 10:48 PM
This revision is now accepted and ready to land.May 1 2019, 10:48 PM

I have a generic question about the snd_tag functionality.

Do I understand right that the primary purpose of the m_snd_tag is to provide/verify valid physical ifp for the particular pcb? Could you please clarify if there is any other functionality?

The reason I'm asking is that there is a WIP around making route lookup functions return special next-hop objects (very similar to what fib[46]_lookup currently returns) instead of rtentries. These objects will be cached instead of rtenty in the _struct route_ and it will be possible to cache the next-hop object referencing the particular physical ifp (even with vlan-over-lagg use cases) . All such objects are created using control-plane machinery.

This nexthop change I'm talking about does not conflict with this change and is not a blocker in any way. Basically I'm trying to understand whether some of the proposed functionality (like vlan/lagg changes, for example) could be converted later to use the nexthop control plane mechanisms.

sys/netinet/ip_output.c
207

Do we need to specify __inline explicitly here?

sys/netinet6/ip6_output.c
279

Do we need to specify __inline explicitly here?

ae accepted this revision as: ae.May 2 2019, 3:05 PM
ae added inline comments.
sys/netinet6/ip6_output.c
290

it looks like mst is always NULL here. The same for ip_output_send().

jhb added a comment.May 2 2019, 4:23 PM

I have a generic question about the snd_tag functionality.
Do I understand right that the primary purpose of the m_snd_tag is to provide/verify valid physical ifp for the particular pcb? Could you please clarify if there is any other functionality?

My understanding is that the purpose of a send tag is to allocate some per-flow (per-socket) offload resources from a NIC driver, which may include allocating resources on a NIC itself. The current consumer in-tree is rate limit which allocates a queue on a NIC that packets can be stored in for a flow and the NIC will pace those queued packets out at a configured rate. The use case I have in out-of-tree-patches is to allocate a TLS session where you save the keys for a TLS session on the NIC and then have the NIC encrypt TLS records and segment them (via TSO). A future use case I'm considering is something similar, but for IPSec which isn't per-flow but is still allocating some offload state on the NIC. Having the ifp mismatch be "perfect" is much more important for TLS where sending the TLS frame to a driver that doesn't use send tags means we would send unencrypted TLS records on the wire. For RATELIMIT, we just degrade to sending packets without any pacing if a route change means packets start being sent over a NIC driver that doesn't support pacing.

The reason I'm asking is that there is a WIP around making route lookup functions return special next-hop objects (very similar to what fib[46]_lookup currently returns) instead of rtentries. These objects will be cached instead of rtenty in the _struct route_ and it will be possible to cache the next-hop object referencing the particular physical ifp (even with vlan-over-lagg use cases) . All such objects are created using control-plane machinery.
This nexthop change I'm talking about does not conflict with this change and is not a blocker in any way. Basically I'm trying to understand whether some of the proposed functionality (like vlan/lagg changes, for example) could be converted later to use the nexthop control plane mechanisms.

I think they will be orthogonal. In this case we have to get down to the actual NIC driver because we want to allocate resources on the NIC, and for RATELIMIT and TLS those resources are per-flow, not per-ifp or per-route.

jhb added inline comments.May 2 2019, 4:29 PM
sys/netinet/ip_output.c
207

We do in other places in the tree where we really want the compiler to inline it. In particular, for a kernel without RATELIMIT the compiler should be able to determine mst is always NULL and compress this down to just if_output as the code was before without send tags. I don't think it hurts to inline it always as the previous code did. The purpose of the function is more to reduce code duplication.

sys/netinet6/ip6_output.c
290

It is, yes. I had pulled this from a version that has an earlier check for TLS here: https://github.com/bsdjhb/freebsd/blob/kern_tls_send_tags/sys/netinet/ip_output.c#L208

However, I can remove the 'mst == NULL' check from here for this change and reintroduce it in the TLS changes when they eventually land if that is clearer.

jhb added a comment.May 2 2019, 4:50 PM

One more comment for adrian: I did boot an INVARIANTS kernel with this on my laptop last night and was able to 'git fetch' over iwm(4) ok.

bz added a comment.May 2 2019, 8:31 PM

The ip6_fragment() change seems fine if you want to go ahead and factor that out.

sys/netinet6/ip6_output.c
280

ip6_output_send() is a rather overloaded name as SeND is a thing in IPv6 (Secure ND) which we also support and I think that name is confusing. Now I am aware that there is rip6_send() as well, which makes this even more confusing.

sys/sys/mbuf.h
499

I keep arguing with myself if we could not name it CSUM_* as it's just not that.

ae added inline comments.May 2 2019, 8:37 PM
sys/sys/mbuf.h
499

I also think this is not good enough, but it seems csum_flags is just bad name for what it is purposed. The comment says that it is for "checksum and offload features".

rgrimes added inline comments.May 2 2019, 8:44 PM
sys/sys/mbuf.h
472

I am putting this comment up here, hopefully it pulls this into view. Should we call these Outbound flags -> of_flags?

499

See above, the block comment calls these Outbound flags, so of_flags and OF_* seem more proper to me as well, though that is a massive churn. I could also go with OL_* for Off Loading flags.

jhb updated this revision to Diff 57284.May 10 2019, 7:45 PM
  • Rebase.
  • Remove checks that are currently always true.
  • Drop __inline.
This revision now requires review to proceed.May 10 2019, 7:45 PM
jhb marked 3 inline comments as done.May 10 2019, 7:55 PM

@bz any thoughts on the ip6_fragment change?

sys/netinet/ip_output.c
207

I went ahead and dropped these. In my testing the compiler ignored it anyway and inlined in the !RATELIMIT case and used a separate function in the RATELIMIT case.

sys/netinet6/ip6_output.c
280

Do you have any suggestions? ip6_output_output seems worse. Would you prefer ip6_output_send_packet, maybe ip6_output_transmit (I don't really like this as it is calling if_output, not if_transmit over in ip_output.c)? ip6_output_packet?

sys/sys/mbuf.h
472

That would be a lot of churn.

499

TSO isn't a checksum either. The ifnet field name of hwassist is better (and a send tag is really about a more generic vector of certain types of hardware assists). However, using CSUM_* seems most consistent with the field name in the packet header and the other constants. I think having a single field in csum_flags that doesn't match will be worse, and I think that renaming all of them is probably too much churn (almost every NIC driver would have to change).

jhb added a comment.May 10 2019, 7:57 PM

Whoops, missed bz's earlier note about ip6_fragment, so I will pull that out and merge it separately.

rgrimes accepted this revision as: rgrimes.May 10 2019, 7:59 PM
This revision is now accepted and ready to land.May 10 2019, 7:59 PM
This revision was automatically updated to reflect the committed changes.
jhb reopened this revision.May 10 2019, 8:16 PM
gallatin accepted this revision.May 11 2019, 12:52 PM
This revision is now accepted and ready to land.May 11 2019, 12:52 PM
jhb updated this revision to Diff 57661.May 21 2019, 11:10 PM
  • Rebase after ip6_fragment commit.
This revision now requires review to proceed.May 21 2019, 11:10 PM
gallatin accepted this revision.May 22 2019, 12:13 AM
This revision is now accepted and ready to land.May 22 2019, 12:13 AM
hselasky accepted this revision.May 22 2019, 7:41 AM
hselasky added inline comments.
sys/sys/mbuf.h
1229

You may want to put ()'s around the &'s before the &&'s.

jhb closed this revision.May 24 2019, 11:30 PM