Do some more bcmp "optimization". (That's in quotes because "optimization" is rarely straightforward. What "optimizes" one person's use case may well pessimize another's.)
Main changes:
- Do 8-byte comparisons in a loop prior to doing 1-byte comparisons in a loop.
- In the large case, do 1-byte comparisons in a loop (instead of repe cmpsb).
In a user-space test of many different memory blocks with different sizes (up to 105 bytes) and different alignments, this seems to reduce the run time by 42%.
Additionally, running the same test @mjg ran (pmcstat to capture samples of bcmp), bcmp fell from being in 0.7% of samples to not showing up on the list at all.
One nagging worry is whether making this function larger will negatively impact other aspects of the system.