Bring back AIM (Adaptive Interrupt Moderation) that was lost in IFLIB migration.
ClosedPublic
Actions

Authored by stallamr_netapp.com on Nov 23 2020, 7:58 PM.

Details

Reviewers

markj
gallatin
olivier
erj
kgalazka
jacob.e.keller_intel.com

Group Reviewers

NetApp
iflib

Commits

rGc9361f2fb0fb: ixgbe: Restore AIM support
rG64881da47807: ixgbe: Restore AIM support

Summary

AIM was part of BSD11 driver. Upon IFLIB migration, AIM feature got lost. Re-introducing AIM back into IFLIB based IXGBE driver.
One caveat is that in BSD11 driver - a queue comprises both Rx and Tx ring. Starting from BSD12, Rx and Tx have their own queues and rings. Also, IRQ is now only configured for Rx side. So, when AIM is re-enabled, we should now consider only Rx stats for configuring EITR register in contrast to BSD11 where Rx and Tx stats were considered to manipulate EITR register.

Test Plan

Unit-tested with AIM feature enabled on NetApp platforms on physical NICs.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

stallamr_netapp.com created this revision.Nov 23 2020, 7:58 PM

Herald added a subscriber: ae. · View Herald TranscriptNov 23 2020, 7:58 PM

stallamr_netapp.com requested review of this revision.Nov 23 2020, 7:58 PM

Harbormaster completed remote builds in B34988: Diff 79915.Nov 23 2020, 7:58 PM

stallamr_netapp.com added reviewers: kgalazka, jacob.e.keller_intel.com, NetApp.Nov 23 2020, 8:00 PM

Can you provide a justification in terms of performance and performance stability? This was dropped because it was believed to be unnecessary with the newer driver software queuing architecture.

kbowling added a reviewer: iflib.Nov 23 2020, 9:08 PM

At NetApp, in one of our proprietary application 64K sequential write load test, we noticed that average LRO size of the packet/segment gets shorter. And we also observed, average LRO segment size is proportional to interrupt rate and throttling. For better running of our application, we need average LRO segment size to be on higher size. With AIM enabled (and with other Intel patch), the average LRO size now come close to BSD11 LRO average segment size for same application.

So, introducing AIM back with default being disabled. So, its business as usual for the rest of community.

I would rather see this fixed by implementing LRO mbuf sorting in iflib, a la cxgbe and mlx5en. It's a small code change like https://freshbsd.org/commit/freebsd/src/317041. Would you be willing to look at this or discuss with intel to do it on your behalf?

Sure, thanks Kevin. I will discuss with Intel engineers.

In D27344#610766, @kbowling wrote:

I would rather see this fixed by implementing LRO mbuf sorting in iflib, a la cxgbe and mlx5en. It's a small code change like https://freshbsd.org/commit/freebsd/src/317041. Would you be willing to look at this or discuss with intel to do it on your behalf?

Note that I tried to implement RSS assisted LRO in iflib ("mbuf sorting") a while back. The first step is undoing some of the mbuf queueing that happens with iflib. When doing so, @olivier 's fleet of terrible machines showed regressions in packet forwarding performance, and this is where i paused.

Also note that Netapp's problem is likely different than the problems experienced by a CDN. In the CDN case, mbuf sorting is helpful, as it allows aggregation of packets from the same TCP session which have many intermediate packets in between them. This is common in a CDN case where you have 10s or 100s of thounsands of connections. My assumption is that Netapp has just a few active connections, so mbuf sorting won't help them at all, and all they need is simply having a longer interrupt timer.

Approving for just myself for now until the discussion regarding AIM / LRO is wrapped up.

This revision is now accepted and ready to land.Nov 24 2020, 3:25 PM

A few more thoughts based on a conversation I had with somebody: In order for mbuf sorting to be effective, you need to have a large number of packets received per irq. Sorting a dozen mbufs is not useful when you're already aggregating 64 different connections, for example. So AIM (or very large irq colaescing timer/packet settings) are required to get enough packets per irq to make mbuf sorting worthwhile.

To be clear, I think mbuf sorting is something we want, and something which could be helpful. I just think that it may not be the answer to Netapp's problem and its not fair to suggest it as an alternative to AIM. In fact, the proposed solution of restoring AIM can almost be viewed as a pre-requisite for mbuf sorting,

@gallatin I don't have the resources to fix it, so I'll just say of the things we were trying to do is get away from expert knob turning for common use cases, which I would lump a netapp filer and a CDN into for FreeBSD. iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this. Nonetheless, if I am the only person that cares about that kind of thing I think my assumptions need to be retuned and I'll go away.

stallamr_netapp.com added a reviewer: Intel Networking.Nov 24 2020, 11:02 PM

In D27344#611029, @kbowling wrote:

@gallatin I don't have the resources to fix it, so I'll just say of the things we were trying to do is get away from expert knob turning for common use cases, which I would lump a netapp filer and a CDN into for FreeBSD. iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this. Nonetheless, if I am the only person that cares about that kind of thing I think my assumptions need to be retuned and I'll go away.

At least on real hardware I've used with iflib, with both ix and ixl, one needs to adjust irq moderation manually in order to get a decent amount of packets per irq. I honestly think AIM might be the best path forward for less tuning knobs, if it were the default with a wider range of min and max irq rates.

And I do think that we need to enable Hans's RSS assisted LRO. I just think its not the solution here.

markj added inline comments.Nov 25 2020, 4:52 PM

sys/dev/ixgbe/if_ix.c
2186	Rather than a goto I think it would be nicer to lift this code into a separate function that gets called here.
2206	`newitr` is always 0 here. I am nervous about the lack of atomicity with respect to `rxr->packets` and `rxr->bytes`. They are 32-bit values and could conceivably overflow, resulting in a division-by-zero. I would suggest loading the values with atomic_load_int() before doing any math with them, to ensure that you're operating on a consistent snapshot of the values.
2231	I believe these counters may be modified concurrently. If so, these updates may be lost. Is that acceptable? If so I think it's worth a comment.

stallamr_netapp.com added inline comments.Dec 3 2020, 8:41 PM

sys/dev/ixgbe/if_ix.c
2186	This is old BSD11 code. In my opinion, I see no harm. New function in interrupt context may be little unwarranted overhead ?
2206	Why do you think newitr will be 0? Like in any possible case, we will have bytes >= packets. So, at the worst newitr be 1. I donot see how can it be 0. Next, these variables are u32. And we do disable IRQ , process rxq and then enable IRQ. Our max rx queue length gonna be say 4096 or 8192. And every packet cannot go beyond 9 mbuf clusters. I presume you are thinking of a case where a single IRQ is able to process MAX_UINT32 packets ? That seem so unlikely. For atomicity, these bytes/packets get increment when we process IRQ. So, the fact that we are in IRQ routine, that means these bytes/packets are a snapshot up-until previous IRQ. Like, I donot see possibility of race between IRQs and Rx processing. We do Rx processing with IRQ disabled. Also, there is no atomicity in BSD11 too.
2231	No, these counters donot get modified concurrently. Like said in above comment, we do disable IRQ - process Rx queue - re-enable IRQ.

In D27344#611363, @gallatin wrote:

At least on real hardware I've used with iflib, with both ix and ixl, one needs to adjust irq moderation manually in order to get a decent amount of packets per irq. I honestly think AIM might be the best path forward for less tuning knobs, if it were the default with a wider range of min and max irq rates.

This is the patch Sai mentioned: https://reviews.freebsd.org/D27465 The problem is that at least for some adapters we have no influence on the number of RX descriptors, which are written back at once by HW. Using interrupts with moderation allows to aggregate more fragments in single LRO transaction and makes processing less CPU intensive. It still requires more testing with different HW and workloads so we made it configurable with sysctl.

In D27344#611029, @kbowling wrote:

iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already

In this case interrupts are not rearmed. iflib_rxd_avail always finds at least one ready descriptor so the RX task is re-scheduled. But in the next run it find only 2-3 descriptors for processing what makes LRO quite ineffective.

In D27344#611029, @kbowling wrote:

or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this.
I like the idea but I think we should first test more with different HW and using real life workloads if possible.

markj added inline comments.Dec 7 2020, 3:24 PM

sys/dev/ixgbe/if_ix.c
2186	Most of the code in this function is for dealing with AIM, an optional feature. IMO it is simply easier to read the interrupt handler code if AIM code is in a separate function. It will be inlined by the compiler so there is no runtime difference. I don't think "it was this way before" is a good argument against making changes.
2206	I mean, we have: newitr = 0; ... newitr = max(newitr, ...); so the `max()` is redundant. If it is possible for `rxr->bytes / rxr->packets` to be negative (I presume not), then this should be using `imin()` instead. With respect to the counters, I see that `rxr->packets` and `rxr->bytes` are incremented from `iflib_rxeof()`, which indeed disables queue interrupts, so I think you're right. This deserves a comment IMO.