This diff shuffles around the tcpcb so that it is optimized
for the common input and output processing with a 64 byte
cache line in mind. We want the first cache miss to be the
most common byte accessed and fields accessed in the
common path to stick to that cache line for as long as possible.
Hopefully by the time we spill over to the next cacheline the
pre-read-ahead will have gotten line two in etc. Things that
are less often used (retransmission paths, sacks etc) are pushed
towards the bottom optimizing for the hopefully most common paths.
Details
This changes no code only shuffles around fields in the tcp-pcb.
It has been tested and running like this at NF for a couple of years now. Vtune
has shown it to be more efficient.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Did you ever measure (apart from VTune) any difference. What does this change do to VIMAGE kernels given td_vnet gets down to the cold side of the structure?
Yes I gained about 1/2Gbps of added performance in my tests.
As to VIMAGE who really uses that? No one I know of. Considering
the use of it (or lack there of) I saw of no real reason to have it
in the first-cache-line. Of course the other question is how
often does one use the back-pointer to the parent vnet.
Hmm looking in the code t_vnet is only used by
- The new htpsi code
- in tcb_subr when creating a new tcb
- The timer code
All of these seem to me to be prime candidates for a later cache-line. You
want the hits to be against things in the direct input/output path which
this is not.
Been running a variant of this for over a year (with some slight site-specific changes)