Page MenuHomeFreeBSD

iflib: Drop tx lock when freeing mbufs using simple_transmit
Needs ReviewPublic

Authored by gallatin on Tue, Dec 23, 10:21 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 25, 9:53 PM
Unknown Object (File)
Wed, Dec 24, 5:16 PM
Subscribers

Details

Reviewers
shurd
erj
glebius
markj
kgalazka
kbowling
Group Reviewers
iflib
Summary

Freeing completed transmit mbufs can be time consuming (due to them
being cold in cache, and due to ext free routines taking locks),
especially when we batch tx completions. If we do this when holding
the tx ring mutex, this can cause lock contention on the tx ring mutex
when using iflib_simple_transmit.

To resolve this, this patch opportunistically copies completed mbuf
pointers into a new array (ifsd_m_defer) so they can be freed after
dropping the transmit mutex. The ifsd_m_defer array is
opportunistically used, and may be NULL. If its NULL, then we free
mbufs in the old way. The ifsd_m_defer array is atomically nulled
when a thread is using it, and atomically restored when the freeing
thread is done with it. The use of atomics here avoids
acquire/release of the tx lock to restore the array after freeing
mbufs.

Since we're no longer always freeing mbufs inline, peeking into them to see if a
transmit used TSO or not will cause a useless cache miss, as nothing
else in the mbuf is likely to be accessed soon. To avoid that cache
miss, we encode a TSO or not TSO flag in the lower bits of the mbuf
pointer stored in the ifsd_m array. Note that the IFLIB_NO_TSO flag
exists primarily for sanity/debugging.

iflib_completed_tx_reclaim() was refactored to break out
iflib_txq_can_reclaim() and _iflib_completed_tx_reclaim()
so the that the tx routine can call iflib_tx_credits_update()
just once, rather than twice.

Note that deferred mbuf freeing is not enabled by default, and can be
enabled using the dev.$DEV.$UNIT.iflib.tx_defer_mfree sysctl.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

I've only been able to test this on amd64, and I'm pretty bad at atomic acq/rel semantics. If somebody could review that, I'd appreciate it.

Have you tried chaining mbufs instead of having an array? Does that have negative cache effect?

How big is the array? Can it be placed on stack achieving same level of batching?

Have you tried chaining mbufs instead of having an array? Does that have negative cache effect?

Yes; chained mbufs will cause a cache miss for each mbuf when queuing, and when de-queing.

How big is the array? Can it be placed on stack achieving same level of batching?

The array is the same size as the tx ring, so that we can free an entire ring's worth of transmits at once if needed.
Iflib supports up to 32k ring sizes, so that's 32k*8 or 256k. That's too big for the stack, I think.

I don't like the atomics either, but I'm fairly sure that is the cheapest way to do this.

What really bugs me is the mess with the different busdma resources for TSO and non-TSO. No driver I've ever written has needed this. I'd like to clean it up, but there are so many drivers using iflib and I'm sure that at least one must actually need this..