Page MenuHomeFreeBSD

iflib: remove transmit prefetching
AcceptedPublic

Authored by gallatin on Mon, Nov 10, 6:21 PM.
Tags
None
Referenced Files
F135586814: D53674.id166147.diff
Tue, Nov 11, 2:01 AM
F135565439: D53674.id.diff
Mon, Nov 10, 9:06 PM
F135565125: D53674.diff
Mon, Nov 10, 9:01 PM
F135561745: D53674.id.diff
Mon, Nov 10, 8:17 PM
F135557308: D53674.diff
Mon, Nov 10, 7:20 PM

Details

Summary

iflib's prefetching transmit descriptors actually hurts performance. I've tested 100g and 400g NICs in multiple generations of AMD and Intel servers and I have not yet found a case where it improves things. In fact, it reduces peak bandwidth by 2-4%, increases cache misses, and increases memory bus traffic.

I was thinking to add a flag to conditionally disable it, but I'd rather just remove the code entirely, as removing it simplifies the transmit path greatly. Also, I'm working on optimizing the cache behavior of the iflib transit path, and removing this code removes a reference to ifc_flags from the hot path, which helps a lot with cache locality. (there are other references that I'll introduce a patch to remove after this is merged).

I have left receive side prefetching in-place as I don't have a good setup to test it.

Typical results from my UDP packet blaster (sends full-mtu UDP segments directly to the driver) on a 7502P AMD EPYC using ice and bnxt):
ice:
before (with prefetch)
1 thread: 31.2Gb/s 19.8% L3 miss, 6.1GB/s memory bw
3 thread: 82.5Gb/s 26.5% L3 miss, 15GB/s

after prefetch removed
1 thread: 31.6Gb/s 18% L3 miss, 61.GB/s mem bw
3 thread: 84.4Gb/s 25% L3 miss, 14.8GB/s

bnxt:
before: (with prefetch)
1 thread; 30.7Gb, 20% L3 miss, 4.6GB/s
3 thread: 82.gB/s 26% L3 miss, 13.7GB/s

after prefetch removed
1 thread: 31.1Gb/s 17% L3 miss, 4.5GB/s
3 thread: 88.8Gb/s 19% L3 miss, 13.6GB/s

400G results are from a NIC which is still under NDA and cannot be shared. L3 miss/mem bw data is from, AMDuProfPcm -m memory,l3

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

This looks legitimate.

The kernel does not have infra for static branches, so a tunable would not look good anyway. However, instead of just removing it I would propose just ifdefing the code for the time being, maybe start the file with:

/*
 * Uncomment to enable prefetching. Measurements in https://reviews.freebsd.org/D53674 show it hurts performance.
 */
//#define IFLIB_PREFETCH

Then interested parties can experiment should the choose to.

If mere existence of this code is making other work harder, then by all means remove it.

In D53674#1225304, @mjg wrote:

This looks legitimate.

The kernel does not have infra for static branches, so a tunable would not look good anyway. However, instead of just removing it I would propose just ifdefing the code for the time being, maybe start the file with:

/*
 * Uncomment to enable prefetching. Measurements in https://reviews.freebsd.org/D53674 show it hurts performance.
 */
//#define IFLIB_PREFETCH

Then interested parties can experiment should the choose to.

If mere existence of this code is making other work harder, then by all means remove it.

Do we know if there are any interested parties? It feels like iflib is suffering from a "tragedy of the commons" situation, where now that @shurd is gone, nobody really feels any ownership of it, and folks make tiny changes related to iflib's interaction with whatever they are working on. I tend to think this is at least partially driven by how hard to code is to understand, so I feel like anything that can make it simpler, and is performance neutral or better, will improve things. Improving performance and reducing the prematurely optimized, overly complex nature of the code has been my goal. I'm afraid I may end up owning this..

My feeling is that the code is in the repo, and anybody qualified to do these sorts of experiments would be qualified to revert this patch.

So I feel rather strongly that it should just go..

these also need to go:

sys/net/iflib.c:#define IFC_PREFETCH            0x040
sys/net/iflib.c:        if (rxq->ifr_ctx->ifc_flags & IFC_PREFETCH)
sys/net/iflib.c:        do_prefetch = (ctx->ifc_flags & IFC_PREFETCH);
sys/net/iflib.c:                ctx->ifc_flags |= IFC_PREFETCH;
In D53674#1225349, @mjg wrote:

these also need to go:

sys/net/iflib.c:#define IFC_PREFETCH            0x040
sys/net/iflib.c:        if (rxq->ifr_ctx->ifc_flags & IFC_PREFETCH)
sys/net/iflib.c:        do_prefetch = (ctx->ifc_flags & IFC_PREFETCH);
sys/net/iflib.c:                ctx->ifc_flags |= IFC_PREFETCH;

Yes, just as soon as I have a good receive-side benchmark. I prefer testing to justify anything I do is at least performance neutral. Right now, I have my UDP packet blaster, and the Netflix CDN workload, both of which are heavily transmit focused. The only receive side changes that I've made have been related to LRO, where I can measure an impact. [ EDIT ] And where I asked Olivier to confirm his packet forwarding benchmark on slow hardware didn't show regressions. However, this only applies to 10G or faster NICs, so his testbed would not be helpful..

This revision is now accepted and ready to land.Mon, Nov 10, 8:42 PM