iflib's prefetching transmit descriptors actually hurts performance. I've tested 100g and 400g NICs in multiple generations of AMD and Intel servers and I have not yet found a case where it improves things. In fact, it reduces peak bandwidth by 2-4%, increases cache misses, and increases memory bus traffic.
I was thinking to add a flag to conditionally disable it, but I'd rather just remove the code entirely, as removing it simplifies the transmit path greatly. Also, I'm working on optimizing the cache behavior of the iflib transit path, and removing this code removes a reference to ifc_flags from the hot path, which helps a lot with cache locality. (there are other references that I'll introduce a patch to remove after this is merged).
I have left receive side prefetching in-place as I don't have a good setup to test it.
Typical results from my UDP packet blaster (sends full-mtu UDP segments directly to the driver) on a 7502P AMD EPYC using ice and bnxt):
ice:
before (with prefetch)
1 thread: 31.2Gb/s 19.8% L3 miss, 6.1GB/s memory bw
3 thread: 82.5Gb/s 26.5% L3 miss, 15GB/s
after prefetch removed
1 thread: 31.6Gb/s 18% L3 miss, 61.GB/s mem bw
3 thread: 84.4Gb/s 25% L3 miss, 14.8GB/s
bnxt:
before: (with prefetch)
1 thread; 30.7Gb, 20% L3 miss, 4.6GB/s
3 thread: 82.gB/s 26% L3 miss, 13.7GB/s
after prefetch removed
1 thread: 31.1Gb/s 17% L3 miss, 4.5GB/s
3 thread: 88.8Gb/s 19% L3 miss, 13.6GB/s
400G results are from a NIC which is still under NDA and cannot be shared. L3 miss/mem bw data is from, AMDuProfPcm -m memory,l3