Page MenuHomeFreeBSD

Store mbuf cluster reference count in the first mbuf.
Needs ReviewPublic

Authored by glebius on Feb 22 2016, 7:24 PM.

Details

Reviewers
rrs
Group Reviewers
network
Summary

Storing refcounts one after one appears to be a cache line
collision provoking. Having extra UMA zone for external refcounts
is even worse - it requires additional allocation.

The idea is to store mbuf refcount in the first mbuf to reference
a cluster. This may sometimes delay freeing the first mbuf, since
it goes free only when cluster is free. But the win is bigger than
the loss.

The UMA_ZONE_REFCOUNT code is wiped from UMA. jeff@ personally
gave blessing for that. mbufs were the only ones who used it.

The idea for this patch came independntly from jeff@, rrs@,
rwatson@ and probably others.

Test Plan

Passed Peter Holm's testing. I'm also going to test that at Netflix.

I'd very much appreciate benchmarking the patch. Note, that dumb
packet forwarding benchmark will not provide any difference.
Actually melifaro@ already benched the patch this was and there
is no difference, as expected. What we need is complex TCP load.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 2620
Build 2637: arc lint + arc unit

Event Timeline

glebius retitled this revision from to Store mbuf cluster reference count in the first mbuf..Feb 22 2016, 7:24 PM
glebius updated this object.
glebius edited the test plan for this revision. (Show Details)
glebius added a reviewer: network.
glebius updated this revision to Diff 13614.

Haven't read the diffs but I like this very much in concept.

rrs added a reviewer: rrs.Feb 23 2016, 11:34 AM
rrs accepted this revision.
This revision is now accepted and ready to land.Feb 23 2016, 11:34 AM
rrs edited edge metadata.Feb 23 2016, 12:04 PM
rrs requested changes to this revision.

It just occurred to me what bothers me about this .. you need a check

At line 680 & 645 where you are calling
mfree(mref);

You have to check the flags to see if its M_NOFREE on the original
mbuf (mref).. otherwise if you have the situation where the base mbuf
that holds the refcnt is marked as M_NOFREE you will free it anyway.

This revision now requires changes to proceed.Feb 23 2016, 12:04 PM
glebius edited edge metadata.Feb 23 2016, 9:52 PM
glebius updated this revision to Diff 13662.

Fix EXT_EXTREF handling.

Randall, I tried to address your feedback, got a patch that complicates code a lot, and then figured out that such mbufs cannot exist (for now). The M_NOFREE mbufs come from drivers[1] which allocate cluster+mbuf as a single chunk. These mbufs have always type of EXT_EXTREF, any mb_dupcl() on them will create a new EXT_EXTREF typed mbuf. For these type of mbuds the mref is never freed. btw, this is the bug I have just fixed.

Speaking "drivers", we actually mean only Chelsio. No one else utilizes this. But the plan is to utilize the "Chelsio way" to the packet zone. In this case, M_NOFREEs will become of type EXT_PACKET, which is also special. Then in mb_free_ext() all the special cases of EXT_PACKET and M_NOFREE will merge into one, reducing complexity, instead of spreading it.

For now we can put KASSERT, that M_NOFREE doesn't happen for mref. However, we can't put it at beginning of the function, since dereferencing result of __containerof() isn't save for all EXT types. Not safe for EXT_EXTREF.

We have finished benchmarking this at Netflix.

At Netflix we mostly send data with sendfile(), which is already optimized wrt refcounting: see sf_ext_ref(). In our year ago benchmarking it showed a solid win against UMA based refcounting. Thus, at Netflix we are benching not the "UMA refcounting" VS "embedded refcount", instead we are benching "sf_ext_ref()" VS "embedded refcount". Anyway, we got a 2-6% improvement with this patch. Since embedded refcount appears to be better than sf_ext_ref, and the latter in its turn showed to be better than UMA refcounting, I'd assert that for non-sendfile sends, the improvement is over 6%.

I'll wait a couple of more days for someone to comment or benchmark this patch, and will proceed with committing.