Page MenuHomeFreeBSD

Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
ClosedPublic

Authored by gallatin on Mar 28 2020, 8:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 23, 1:07 AM
Unknown Object (File)
Oct 5 2024, 10:33 PM
Unknown Object (File)
Oct 1 2024, 4:14 AM
Unknown Object (File)
Sep 24 2024, 11:34 PM
Unknown Object (File)
Sep 24 2024, 11:34 PM
Unknown Object (File)
Sep 24 2024, 11:34 PM
Unknown Object (File)
Sep 24 2024, 11:34 PM
Unknown Object (File)
Sep 24 2024, 11:34 PM
Subscribers

Details

Summary

While the original implementation of unmapped mbufs is a large
step forward in terms of reducing cache misss by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_art2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 30154

Event Timeline

Don't you want to provide some asserts about the struct mbuf layout ?
Imagine I do not know anything about constraints and want to add a field somewhere.

In D24213#532679, @kib wrote:

Don't you want to provide some asserts about the struct mbuf layout ?
Imagine I do not know anything about constraints and want to add a field somewhere.

In kern_mbuf.c, I replaced the constraint that m_ext_pgs was 256b with the constraints that the offset of m_ext in struct mbuf matches the offset of m_ext_pgs.m_ext, as well as a constraint that the size of struct mbuf does not exceed MSIZE. Did you have others in mind? The ones I added caught a few problems on 32-bit platforms which naturally align 8-byte types, for example.

In D24213#532679, @kib wrote:

Don't you want to provide some asserts about the struct mbuf layout ?
Imagine I do not know anything about constraints and want to add a field somewhere.

In kern_mbuf.c, I replaced the constraint that m_ext_pgs was 256b with the constraints that the offset of m_ext in struct mbuf matches the offset of m_ext_pgs.m_ext, as well as a constraint that the size of struct mbuf does not exceed MSIZE. Did you have others in mind? The ones I added caught a few problems on 32-bit platforms which naturally align 8-byte types, for example.

No, I do not, otherwise I would suggest the checks.

BTW, you commented out some asserts, why not remove them if they are not relevant ?

Looks good and also see my comments.

sys/kern/uipc_ktls.c
1222

What about byte order here? Do we need to use htonl() ?

sys/kern/uipc_mbuf.c
166

Need to fix these before commit.

sys/sys/mbuf.h
352

Maybe add the new field at the end of the union ....

  • Restored static asserts on the size of m_ext() that I'd forgotten I'd commented
  • Changed copy of seqno to a memcpy
gallatin added inline comments.
sys/kern/uipc_ktls.c
1222

Yes, the assignment was almost certainly wrong. I've changed this to a memcpy of the seqno into the trailer. This should have the same outcome as overlaying the bytes in a union, as it was previously done.

However, I really, really, really think that this belongs in the mellanox driver. Can you please remind me why you need this in the stack, rather than in the driver?

In D24213#532734, @kib wrote:
In D24213#532679, @kib wrote:

Don't you want to provide some asserts about the struct mbuf layout ?
Imagine I do not know anything about constraints and want to add a field somewhere.

In kern_mbuf.c, I replaced the constraint that m_ext_pgs was 256b with the constraints that the offset of m_ext in struct mbuf matches the offset of m_ext_pgs.m_ext, as well as a constraint that the size of struct mbuf does not exceed MSIZE. Did you have others in mind? The ones I added caught a few problems on 32-bit platforms which naturally align 8-byte types, for example.

No, I do not, otherwise I would suggest the checks.

BTW, you commented out some asserts, why not remove them if they are not relevant ?

Thanks for pointing those out, I've restored them with the new, updated size of m_ext (much larger, since it now includes the ext_pgs hdr/trailer/pg array)

sys/sys/mbuf.h
352

Hmm.. It makese more sense to me the way it is, actually.. so I'd prefer to leave it, but I don't feel hugely strongly about that.

When doing TLS v1.3 retransmissions the HW needs to know the sequence number in the fast path.

sys/kern/uipc_ktls.c
1222

It wasn't the seqno that was supposed to be in the trailer, but the type byte, like DATA / ALERT and so on for TLS v1.3 ???

After discussion with Hans, it turns out that it is the record type, not the seqno, that the Mellanox driver needs in the trailer for hwtls.

  • Rebase past r359474
  • Move m_ext_pgs down below m_pkthdr in the union, as suggested by Hans

This does mean that for the non-TLS case, ext_pgs now only holds 5 pages instead of 19, so for "plain" sendfile() this won't be as beneficial (though it's still a win over pre-ext_pgs). I think this is generally ok (I have some out of tree patches I've mentioned previously I'll have to adjust as they were using space in the unused pkthdr region, but only to cache computations).

sys/kern/uipc_mbuf.c
1775

I think you could make the ext_pgs pointer const instead of needing the __DECONST?

sys/kern/uipc_sockbuf.c
115

I would probably drop this blank line.

sys/sys/mbuf.h
270

Perhaps trim the blank line above this and add back the comment from before ext_pgs:

char            *ext_buf;       /* start of buffer */
298

Perhaps 'struct<space>ktls_session' to match the other fields in this structure?

gallatin marked 3 inline comments as done.

Make cosmetic/style fixes as suggested by jhb

sys/kern/uipc_mbuf.c
1775

The problem is that uiomove() does not take a const, and I would then need to DECONST several more times to call it (which seems uglier), or I'd need to change uiomove() itself (which seems out of scope for this change).

jhb added inline comments.
sys/kern/uipc_mbuf.c
1775

Oof, ok. You can't change uiomove() since it can read or write to the kernel buffer. What you have is probably best, yes.

Fix issues encountered when running mlx5 with hw tls enabled with this patchset:

    • accessing ext_pgs via pointer was missed at compile time, as the driver was using the ext_buf rather than ext_pgs pointer
  • the driver was using ext_pgs at the same time as pkthdr

I can test this on T6 once it's committed.

This revision is now accepted and ready to land.Apr 10 2020, 1:37 PM

Just a heads up that i plan to commit this tomorrow.