Page MenuHomeFreeBSD

Bring back AIM (Adaptive Interrupt Moderation) that was lost in IFLIB migration.
ClosedPublic

Authored by stallamr_netapp.com on Nov 23 2020, 7:58 PM.

Details

Summary

AIM was part of BSD11 driver. Upon IFLIB migration, AIM feature got lost. Re-introducing AIM back into IFLIB based IXGBE driver.
One caveat is that in BSD11 driver - a queue comprises both Rx and Tx ring. Starting from BSD12, Rx and Tx have their own queues and rings. Also, IRQ is now only configured for Rx side. So, when AIM is re-enabled, we should now consider only Rx stats for configuring EITR register in contrast to BSD11 where Rx and Tx stats were considered to manipulate EITR register.

Test Plan

Unit-tested with AIM feature enabled on NetApp platforms on physical NICs.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Can you provide a justification in terms of performance and performance stability? This was dropped because it was believed to be unnecessary with the newer driver software queuing architecture.

At NetApp, in one of our proprietary application 64K sequential write load test, we noticed that average LRO size of the packet/segment gets shorter. And we also observed, average LRO segment size is proportional to interrupt rate and throttling. For better running of our application, we need average LRO segment size to be on higher size. With AIM enabled (and with other Intel patch), the average LRO size now come close to BSD11 LRO average segment size for same application.

So, introducing AIM back with default being disabled. So, its business as usual for the rest of community.

I would rather see this fixed by implementing LRO mbuf sorting in iflib, a la cxgbe and mlx5en. It's a small code change like https://freshbsd.org/commit/freebsd/src/317041. Would you be willing to look at this or discuss with intel to do it on your behalf?

Sure, thanks Kevin. I will discuss with Intel engineers.

I would rather see this fixed by implementing LRO mbuf sorting in iflib, a la cxgbe and mlx5en. It's a small code change like https://freshbsd.org/commit/freebsd/src/317041. Would you be willing to look at this or discuss with intel to do it on your behalf?

Note that I tried to implement RSS assisted LRO in iflib ("mbuf sorting") a while back. The first step is undoing some of the mbuf queueing that happens with iflib. When doing so, @olivier 's fleet of terrible machines showed regressions in packet forwarding performance, and this is where i paused.

Also note that Netapp's problem is likely different than the problems experienced by a CDN. In the CDN case, mbuf sorting is helpful, as it allows aggregation of packets from the same TCP session which have many intermediate packets in between them. This is common in a CDN case where you have 10s or 100s of thounsands of connections. My assumption is that Netapp has just a few active connections, so mbuf sorting won't help them at all, and all they need is simply having a longer interrupt timer.

Approving for just myself for now until the discussion regarding AIM / LRO is wrapped up.

This revision is now accepted and ready to land.Nov 24 2020, 3:25 PM

A few more thoughts based on a conversation I had with somebody: In order for mbuf sorting to be effective, you need to have a large number of packets received per irq. Sorting a dozen mbufs is not useful when you're already aggregating 64 different connections, for example. So AIM (or very large irq colaescing timer/packet settings) are required to get enough packets per irq to make mbuf sorting worthwhile.

To be clear, I think mbuf sorting is something we want, and something which could be helpful. I just think that it may not be the answer to Netapp's problem and its not fair to suggest it as an alternative to AIM. In fact, the proposed solution of restoring AIM can almost be viewed as a pre-requisite for mbuf sorting,

@gallatin I don't have the resources to fix it, so I'll just say of the things we were trying to do is get away from expert knob turning for common use cases, which I would lump a netapp filer and a CDN into for FreeBSD. iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this. Nonetheless, if I am the only person that cares about that kind of thing I think my assumptions need to be retuned and I'll go away.

@gallatin I don't have the resources to fix it, so I'll just say of the things we were trying to do is get away from expert knob turning for common use cases, which I would lump a netapp filer and a CDN into for FreeBSD. iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this. Nonetheless, if I am the only person that cares about that kind of thing I think my assumptions need to be retuned and I'll go away.

At least on real hardware I've used with iflib, with both ix and ixl, one needs to adjust irq moderation manually in order to get a decent amount of packets per irq. I honestly think AIM might be the best path forward for less tuning knobs, if it were the default with a wider range of min and max irq rates.

And I do think that we need to enable Hans's RSS assisted LRO. I just think its not the solution here.

sys/dev/ixgbe/if_ix.c
2185

Rather than a goto I think it would be nicer to lift this code into a separate function that gets called here.

2205

newitr is always 0 here.

I am nervous about the lack of atomicity with respect to rxr->packets and rxr->bytes. They are 32-bit values and could conceivably overflow, resulting in a division-by-zero. I would suggest loading the values with atomic_load_int() before doing any math with them, to ensure that you're operating on a consistent snapshot of the values.

2230

I believe these counters may be modified concurrently. If so, these updates may be lost. Is that acceptable? If so I think it's worth a comment.

sys/dev/ixgbe/if_ix.c
2185

This is old BSD11 code. In my opinion, I see no harm. New function in interrupt context may be little unwarranted overhead ?

2205

Why do you think newitr will be 0?

Like in any possible case, we will have bytes >= packets. So, at the worst newitr be 1. I donot see how can it be 0.

Next, these variables are u32. And we do disable IRQ , process rxq and then enable IRQ.
Our max rx queue length gonna be say 4096 or 8192. And every packet cannot go beyond 9 mbuf clusters.
I presume you are thinking of a case where a single IRQ is able to process MAX_UINT32 packets ? That seem so unlikely.

For atomicity, these bytes/packets get increment when we process IRQ. So, the fact that we are in IRQ routine, that means these bytes/packets are a snapshot up-until previous IRQ. Like, I donot see possibility of race between IRQs and Rx processing. We do Rx processing with IRQ disabled.

Also, there is no atomicity in BSD11 too.

2230

No, these counters donot get modified concurrently. Like said in above comment, we do disable IRQ - process Rx queue - re-enable IRQ.

At least on real hardware I've used with iflib, with both ix and ixl, one needs to adjust irq moderation manually in order to get a decent amount of packets per irq. I honestly think AIM might be the best path forward for less tuning knobs, if it were the default with a wider range of min and max irq rates.

This is the patch Sai mentioned: https://reviews.freebsd.org/D27465 The problem is that at least for some adapters we have no influence on the number of RX descriptors, which are written back at once by HW. Using interrupts with moderation allows to aggregate more fragments in single LRO transaction and makes processing less CPU intensive. It still requires more testing with different HW and workloads so we made it configurable with sysctl.

iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already

In this case interrupts are not rearmed. iflib_rxd_avail always finds at least one ready descriptor so the RX task is re-scheduled. But in the next run it find only 2-3 descriptors for processing what makes LRO quite ineffective.

or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this.
I like the idea but I think we should first test more with different HW and using real life workloads if possible.

sys/dev/ixgbe/if_ix.c
2185

Most of the code in this function is for dealing with AIM, an optional feature. IMO it is simply easier to read the interrupt handler code if AIM code is in a separate function. It will be inlined by the compiler so there is no runtime difference. I don't think "it was this way before" is a good argument against making changes.

2205

I mean, we have:

newitr = 0;
...
newitr = max(newitr, ...);

so the max() is redundant. If it is possible for rxr->bytes / rxr->packets to be negative (I presume not), then this should be using imin() instead.

With respect to the counters, I see that rxr->packets and rxr->bytes are incremented from iflib_rxeof(), which indeed disables queue interrupts, so I think you're right. This deserves a comment IMO.

  • Move AIM functionality into its own function to keep interrupt handler code simple
This revision now requires review to proceed.Dec 9 2020, 2:36 AM

Addressed Mark J comments.

This revision is now accepted and ready to land.Dec 21 2020, 7:30 PM

Review https://reviews.freebsd.org/D27465 helps to improve latency overall and has no dependency on this revision.

For NetApp, review D27465 and D27344 together helped to reduce latency by ~1.2ms.

Is this going to get committed soon? Possibly for inclusion in 13.0?

In D27344#649966, @erj wrote:

Is this going to get committed soon? Possibly for inclusion in 13.0?

If you have no objection, I'll commit and get it into 13.0.

This revision was automatically updated to reflect the committed changes.