- bits 6 & 7 of mbufs are currently always zero which means that when (mostly) only accessing the first 64 bytes of it we're only able to use 1/4 of each cache set
- similarly, bits 6-11 are always zero in mbuf clusters, this means that, particularly in the rx path, sub cache line sized packets are only able to use 1/32nd of each cache set
This may, in part, explain why prefetching frequently worsens measured performance.
Poor cache utilization generally isn't measurable at lower packet rates. Nonetheless, there are some users who would benefit.
It looks like the underlying UMA / VM code is broken. @shurd reports that tests don't complete when it's enabled. I'm adding @markj in the hope that he'll take a look after his move.