Page MenuHomeFreeBSD

Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS code.

Authored by adrian on Aug 2 2014, 11:36 PM.


  • Add new functions to in_rss.[ch] which implement software hashing for IPv4 frames that are aware of the currently configured RSS hash types.
  • Extend the ip_output() API to have a flag that disables overriding the mbuf with the inp flowid details.
  • Set the flowid details in the UDP IPv4 and IPv6 output paths by doing a software hash. This is currently not yet cached in the inpcb so it's recalculated each time.
  • Recalculate the flowid of a reassembled IPv4 fragmented datagram.
  • .. and reinject the datagram back into netisr for recalcuation.
  • Shift the IPv4 netisr queue to use a CPU policy and teach it to use the rss_m2cpuid function to (re)-assign the frame to the correct queue as appropriate.

There's a few things that I know about that I haven't yet implemented:

  • IPv6 software hashing;
  • IPv6 fragment path handling;
  • IPv4 and IPv6 de-capsulation and recalculation (IPSEC, various encapsulation protocols, etc.);
  • Automatically configuring bound netisr threads for the configured RSS parameters;
  • Ensuring that all fragments in the IPv4/IPv6 output path have the correct RSS hash information copied into the generated fragment mbufs.
Test Plan
  • IPv4 and IPv6 UDP send/receive on ixgbe(4) NIC
  • IPv4 fragments have been tested
  • (IPv6 fragments haven't yet been tested.)

Diff Detail

rS FreeBSD src repository - subversion
Lint Skipped
Unit Tests Skipped

Event Timeline

adrian retitled this revision from to Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS code..
adrian updated this object.
adrian edited the test plan for this revision. (Show Details)
adrian added reviewers: rwatson, hiren, gnn.

It would be better if these two topics were split into separate reviews.

508 ↗(On Diff #924)

I'd recommend a #define at a minimum for INGRESS and EGRESS. in/out is easy to get wrong depending on what your viewpoint is (in to the NIC ? in to the stack ?)

123 ↗(On Diff #924)

The naming seems redundant: rss_hash_v4 would seem sufficient.

125 ↗(On Diff #924)

Ditto with naming.

131 ↗(On Diff #924)

If RSS is active, there is no option to direct-dispatch RSS-capable traffic or you risk sending a flow to the wrong PCB group.

156 ↗(On Diff #924)

Fragment should not be resubmitted to the stack: they have already been accounted for in terms of statistics. It would seem you need an additional netisr that is going to submit the already verified/accounted frame directly to the inetsw (even the M_FASTFWD_OURS bypass will still double-count ips_delivered)

1178 ↗(On Diff #924)

This should either be a panic if RSS is defined and the packet can be hashed, or perhaps verify if the current PCB context is the one the packet belongs to.

1409 ↗(On Diff #924)


My comments about PCBGROUPs and how they deal with packets not belonging to a group's PCB list are based on a proprietary implementation that may behave differently. In any event, it would seem prudent not to do direct dispatch since that would seem to have a high chance of ending up in the wrong context/group.

156 ↗(On Diff #924)

On further thought, I think the FASTFWD is what you want, modulo the byte order/length expectations in the first conditional in ip_input. ips_delivered won't be double counted since the frame is netisr'd prior to reaching inetsw.

156 ↗(On Diff #924)

I think that yes, perhaps we do need an ip reinject netisr that things get queued to. I'm just worried about unintended queue behaviour between the IP and IP-reinject queues.

Thanks for picking that up though! I didn't even think about it double-accounting things.

So hm, does that mean that the various IP tunnel de-encapsulation paths are falling similarly afoul? They also re-queue into the NETISR_IP queue.

1178 ↗(On Diff #924)

This was mostly for debugging/evaluation purposes for the (current) UDP situation where it's 2-tuple hashed.

You're right though - if RSS is enabled then it should just netisr_dispatch() and not be configurable.

156 ↗(On Diff #924)

Tunnel decaps is entirely different: you're dealing with a completely new packet once the outer layer has been stripped: it hasn't had any input processing.

The requeueing is problematic. There has already been work done to process the packet, and you don't want to drop it because a netisr queue is full. That smells a lot like a DoS attack. A possible way to do this is to reserve a netisr queue slot for a reassembly entry, and only initiate a new reassembly if there is an available slot, so it is guaranteed that a successful reassembly will result in delivery to the transport layer.

Seems like it's easy to spin out of control with complexity in this area :(

123 ↗(On Diff #924)

The reason I have a verbose name is so stack/driver writers realise the function is actually doing a software hash if required, rather than just checking the hash value in the mbuf.

For example, rss_m2cpuid() doesn't do a software hash calculation at the moment.

adrian edited edge metadata.

Address some of the comments from grehan@ . Thanks!

Updating D527: Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS


647 ↗(On Diff #1082)

Insert options into the PCS scripts I wrote for RSS verification and see what happens !

384 ↗(On Diff #1082)

You're now making changes that are outside of the RSS conditional, so it seems like this needs additional review. My suggestion is that collapse the 2 tests of *_OURS into 1 to avoid the extra conditional test in a common path

if (m->m_flags & M_FASTFWD_OURS|M_REINJECT_OURS) {

/* put individual tests here */


Also, the use of the reinjected flag doesn't seem to be necessary. There is no need to do *any* tests again: the fragment code has always passed reassembled fragments directly to the stack, and all the intercept points knew this. Seems less invasive to have a label just prior to the inetsw dispatch and goto that, rather than a seemingly arbitrary set of tests bracketed with the reinjected flag, and some without.

The real fix is to have a netisr (or other) dispatch that goes directly to the protocols. I'm not sure what the "unintended queue behaviour" is that you are referring to. Certainly the proprietary implementation I worked on did have a dispatch direct to protocols without any side effects.

748 ↗(On Diff #1082)

goto label for reinjected frames should go here.

748 ↗(On Diff #1082)

Right, but we should also at least run it through pfil once more, right?

Hm, thinking about it it _isn't_ doing it for reassembled frames anyway, so. Ok. I think I'll do this for now.

Updates from comments from Peter Grehan.

Updating D527: Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS


Update again - without the arcconfig update.

Updating D527: Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS


More feedback from Peter Grehan - migrate the IP reinject stuff to
a new netisr.

Updating D527: Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS


Add some more RSS validation and a new netisr path for reinjected
IP frames.

This includes some code to fetch and set the UDP frame RSS and
flowid information. I've been using this for doing userland
validation of packet handling.

Updating D527: Begin adding IPv4 fragment handling and IPv4/IPv6 UDP awareness to the RSS


So, I've thought about the multiple netisr dispatches a bit. My worry is that if some frames in a flow get a recalculated hash value (eg IP fragments -> large UDP frame -> 4-tuple UDP hashing, or even IP fragments -> large TCP frame -> 4 tuple TCP hashing) we will end up with some frames going via the IP netisr path, and the reassembled fragments going via the IP-direct netisr path. In theory they're on the same CPU but I don't know if it'll 100% always preserve the same order.

There is the possibility for frame reordering even with the netisr queue. Consider the case of a fragment train preceding a non-fragmented packet in the same connection. Other than the first one, the fragments will most likely arrive on a different queue than the following non-fragmented packet. By the time the last fragment has been reassembled and requeued, it is highly probable that the following packet has already been processed (depending on queue lengths in the adapter, and what work the pcbgroups are doing).

This could also happen externally in the network if there is L2/L3 forwarding on an aggregated link with a hash policy that includes the protocol.

TCP will be fine so long as this doesn't trigger fast retransmit, but given that fragmentation it is going to be problematic even without RSS, it isn't going to be solved by falling back to the netisr queue.

There's also the issue of where you deal with traffic from non-RSS capable NICs. It seems reasonable to deal with that in ip_input after the packet has been validated, in which case the requeueing will have to be done at a transport layer. For this case, there are no reordering issues since presumably frames from that adapter are always going to end up in the same context before being requeued.

grehan edited edge metadata.

Please remove debug printfs before submitting.

Here's my general reservations:

  • there is leakage of stuff outside of the RSS conditional. If you are going to do development in head (and I don't see why this is the case: it's the perfect candidate for a branch) then at least everything in here that's new should be conditional (e.g. the API constants, some udp code).
  • what testing is being done with this ? None ? Lots ? At the moment, no one knows. Since this is a fundamental change to the stack it would be worthwhile to document this so it can at least be reproduced. Maybe the wiki is the place for this. Ditto with benchmarking.
  • as mentioned in comments, the API naming at least seems misleading. That's somewhat of my personal preference, but for things that are globally visible and will be around forever in a BSD API, it seems like there should be some more feedback than just mine and yours.
  • You've mentioned in person that you are not wed to having threads/applications bound to CPUs. I'd like to see that written down somewhere. The wiki implies that CPU pinning is going to be the way this works.
553 ↗(On Diff #1187)

I don't agree with the naming RSS_BUCKET. It's really a pcbgroup number, or 'connection group', or something more abstract. The name implies something that's h/w-related.

682 ↗(On Diff #1187)

Seems like an easy test to write.

This revision is now accepted and ready to land.Aug 26 2014, 1:40 AM
adrian updated this revision to Diff 1410.

Closed by commit rS271291 (authored by @adrian).