While mp_ring can provide amazing scalability in scenarios where the number of cores exceeds the number of NIC tx rings, it can also lead to greatly reduced performance in simpler, high packet rate scenarios due to extra CPU cycles and cache misses stemming from its complexity.
In testing on a 400GbE NIC in an AMD 7502P EPYC server, this simple tx routine is roughly 2.5 times as fast as mp_ring (8Gbs -> 20Gb/s). and 5x as fast as mp_ring with tx_abdicate=1 (4Gbs) for a simple in-kernel packet generator, which is closed source currently. It also shows 50% speedup for a simple netperf -tUDP_STREAM test (5Gb/s -> 8Gbs).
This change is mostly a noop, as it not enabled by default. The one exception is the change to iflib_encap() to immediately reclaim completed tx descriptors, and only failing the transmit & scheduling a later reclaim if iflib_completed_tx_reclaim() didn't free enough descriptors.