Page MenuHomeFreeBSD

mergesort: use memcpy() for copying
AcceptedPublic

Authored by minsoochoo0122_proton.me on Wed, Jul 1, 10:44 PM.

Details

Reviewers
fuz
Summary

Currently mergesort() uses ICOPY_*() to copy data as four byte blocks
instead of one byte. However, this is only achievable when both size and
base arguments are aligned to four bytes.

Use of memcpy() is ideal as 1) it is cleaner and 2) the library will use
SIMD for copying when the hardware supports it. Compared to ICOPY_*(),
SIMD can support up to 64 bytes. When the SIMD-backed memcpy() find the
address is unaligned, it can first copy data up to the nearest aligned
address, and then use SIMD operations for faster transfer. Thus memcpy()
can give better performance than mergesort()'s own implementation.

This is benchmarked on amd64 where there isn't a SIMD-backed
implementation yet. However, the baseline implementation in assembly
already delivers better performance in unaligned cases although there is
some performance drops in aligned cases. The benchmark results and
script is available in the Phabricator review. Ideally, more performance
improvements will come when amd64 gets SIMD implementation of memcpy().

Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 74472
Build 71355: arc lint + arc unit

Event Timeline

Benchmark results

Original:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 7.853ms, confidence interval +- 0.233798%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 9.132ms, confidence interval +- 0.138881%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 8.261ms, confidence interval +- 0.431836%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 9.697ms, confidence interval +- 0.162985%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 8.835ms, confidence interval +- 0.890204%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 10.344ms, confidence interval +- 0.302903%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 9.457ms, confidence interval +- 0.312470%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 9.367ms, confidence interval +- 1.613589%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 9.771ms, confidence interval +- 0.136250%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 11.119ms, confidence interval +- 0.165858%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 153.083us, confidence interval +- 0.306207%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 350.774us, confidence interval +- 0.365715%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 3.217ms, confidence interval +- 0.864563%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 628.263us, confidence interval +- 0.107621%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 746.592us, confidence interval +- 0.123026%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 96.576ms, confidence interval +- 0.809997%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 114.531ms, confidence interval +- 0.278125%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

Patch applied:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 9.466ms, confidence interval +- 0.213250%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 9.487ms, confidence interval +- 0.125024%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 9.564ms, confidence interval +- 0.534718%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 9.558ms, confidence interval +- 0.201565%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 9.830ms, confidence interval +- 0.987650%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 9.842ms, confidence interval +- 0.265556%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 10.471ms, confidence interval +- 0.204237%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 10.731ms, confidence interval +- 1.251673%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 8.438ms, confidence interval +- 0.538611%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 8.512ms, confidence interval +- 0.353424%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 156.486us, confidence interval +- 1.616154%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 361.990us, confidence interval +- 0.246398%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 4.504ms, confidence interval +- 0.204794%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 681.198us, confidence interval +- 0.157808%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 681.108us, confidence interval +- 0.128650%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 107.260ms, confidence interval +- 1.064875%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 102.222ms, confidence interval +- 0.308288%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

As you see, there is performance drop in aligned cases and improvements in unaligned cases. As I noted in the commit message, both will improve when SIMD implementation for memcpy() comes (currently there is none for amd64).

The benchmark code is available on GitHub Gist.

This sounds like a good idea for simplicity indeed.

However, the performance does suffer in the aligned case throughout the board.

Could you by chance test on arm64 where we do have a SIMD-based memcpy implementation?

In D58002#1330056, @fuz wrote:

This sounds like a good idea for simplicity indeed.

However, the performance does suffer in the aligned case throughout the board.

Could you by chance test on arm64 where we do have a SIMD-based memcpy implementation?

Tested on 15.1 arm64 (I don't have arm64 -current, but 15.1 supports simd memcpy as well)

Original:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 11.515ms, confidence interval +- 0.726057%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 16.398ms, confidence interval +- 0.031860%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 12.605ms, confidence interval +- 0.888527%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 19.350ms, confidence interval +- 0.016958%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 14.210ms, confidence interval +- 1.531384%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 22.564ms, confidence interval +- 0.074281%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 18.472ms, confidence interval +- 2.161852%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 16.244ms, confidence interval +- 0.479047%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 18.272ms, confidence interval +- 0.034819%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 23.236ms, confidence interval +- 0.147330%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 330.577us, confidence interval +- 0.105261%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 713.333us, confidence interval +- 0.153345%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 5.547ms, confidence interval +- 0.132338%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 1.012ms, confidence interval +- 0.162285%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 1.543ms, confidence interval +- 0.040833%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 152.388ms, confidence interval +- 0.027235%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 235.189ms, confidence interval +- 0.020163%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

Patch applied (with Neon):

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 13.213ms, confidence interval +- 0.597244%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 13.342ms, confidence interval +- 0.041792%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 12.514ms, confidence interval +- 0.923415%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 12.522ms, confidence interval +- 0.050003%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 12.594ms, confidence interval +- 1.765293%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 12.676ms, confidence interval +- 0.203617%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 14.661ms, confidence interval +- 0.940104%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 14.815ms, confidence interval +- 0.448580%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 13.108ms, confidence interval +- 0.077572%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 13.086ms, confidence interval +- 0.265297%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 324.032us, confidence interval +- 0.121752%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 757.879us, confidence interval +- 0.117649%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 7.645ms, confidence interval +- 0.088442%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 1.021ms, confidence interval +- 0.073652%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 1.025ms, confidence interval +- 0.040314%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 150.585ms, confidence interval +- 0.092804%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 151.384ms, confidence interval +- 0.081576%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

We see huge performance improvements in most cases. align.rec4_aligned and aligned.rec8_aligned_reverse have a small performance drop. aligned.rec8_aligned_fewunique has a huge performance drop.

But I think this is fine because in most cases there is performance gain, and the last one is good especially.

Please consider that original libc is release while post patch libc is a debug build.

Oh is that the case for both? How did you select a debug build? If so it's kind of an apples to oranges comparison.
I'm inclined to accept the patch nonetheless as it seems to make things simpler, but do let me think about it some more.

In D58002#1330171, @fuz wrote:

Oh is that the case for both? How did you select a debug build? If so it's kind of an apples to oranges comparison.

By debug build, I mean the vanilla buildworld. For better comparison, I ran the arm64 benchmark again but this time both libraries are built on my local machine at release/15.1.0-p1. Both are debug builds.

For amd64, the baseline was debug as well as it was from pkgbase-current. But for better comparison, I should run the benchmark with libraries built on my local machine. I'll run amd64 benchmark later.

Updated benchmark, both libc are built on my local machine

Original:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 11.454ms, confidence interval +- 0.585918%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 16.347ms, confidence interval +- 0.033041%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 12.601ms, confidence interval +- 0.852469%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 19.348ms, confidence interval +- 0.037812%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 14.169ms, confidence interval +- 1.569751%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 22.561ms, confidence interval +- 0.094936%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 18.286ms, confidence interval +- 2.098980%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 16.295ms, confidence interval +- 0.383004%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 18.276ms, confidence interval +- 0.054065%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 23.242ms, confidence interval +- 0.159953%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 330.848us, confidence interval +- 0.397700%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 711.899us, confidence interval +- 0.144525%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 5.529ms, confidence interval +- 0.073231%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 1.006ms, confidence interval +- 0.135189%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 1.540ms, confidence interval +- 0.059561%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 152.384ms, confidence interval +- 0.118410%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 235.181ms, confidence interval +- 0.036172%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

Patch applied:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 12.909ms, confidence interval +- 0.629741%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 13.055ms, confidence interval +- 0.045282%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 12.850ms, confidence interval +- 0.891822%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 12.868ms, confidence interval +- 0.220019%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 12.871ms, confidence interval +- 1.839393%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 12.849ms, confidence interval +- 0.271427%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 14.878ms, confidence interval +- 0.928887%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 15.015ms, confidence interval +- 0.407387%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 13.424ms, confidence interval +- 0.065090%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 13.319ms, confidence interval +- 0.267004%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 324.888us, confidence interval +- 0.355521%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 760.790us, confidence interval +- 0.115692%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 7.775ms, confidence interval +- 0.053783%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 993.321us, confidence interval +- 0.058439%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 998.753us, confidence interval +- 0.053028%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 154.790ms, confidence interval +- 0.023819%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 155.455ms, confidence interval +- 0.020475%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

EDIT: copy-pasted wrong benchmark result

Updated benchmark for amd64 16-CURRENT, both libc built from my local machine and tested on it

Before:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 7.601ms, confidence interval +- 0.478645%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 8.832ms, confidence interval +- 0.322316%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 7.991ms, confidence interval +- 0.554446%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 9.344ms, confidence interval +- 0.129353%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 8.557ms, confidence interval +- 1.002189%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 9.929ms, confidence interval +- 0.382486%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 9.801ms, confidence interval +- 2.148648%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 9.701ms, confidence interval +- 1.481086%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 10.189ms, confidence interval +- 2.433937%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 11.036ms, confidence interval +- 2.299150%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 147.021us, confidence interval +- 0.686232%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 336.634us, confidence interval +- 0.440853%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 3.110ms, confidence interval +- 0.491127%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 620.335us, confidence interval +- 0.273342%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 741.846us, confidence interval +- 0.298275%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 99.085ms, confidence interval +- 0.695158%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 113.756ms, confidence interval +- 1.118856%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

After:

[==========] Running 17 benchmarks.
[ RUN      ] align.rec4_aligned
[       OK ] align.rec4_aligned (mean 9.530ms, confidence interval +- 0.246582%)
[ RUN      ] align.rec4_misaligned
[       OK ] align.rec4_misaligned (mean 9.512ms, confidence interval +- 0.194525%)
[ RUN      ] align.rec8_aligned
[       OK ] align.rec8_aligned (mean 9.577ms, confidence interval +- 0.395499%)
[ RUN      ] align.rec8_misaligned
[       OK ] align.rec8_misaligned (mean 9.564ms, confidence interval +- 0.298806%)
[ RUN      ] align.rec16_aligned
[       OK ] align.rec16_aligned (mean 9.872ms, confidence interval +- 0.945139%)
[ RUN      ] align.rec16_misaligned
[       OK ] align.rec16_misaligned (mean 9.851ms, confidence interval +- 0.440628%)
[ RUN      ] align.rec64_aligned
[       OK ] align.rec64_aligned (mean 11.574ms, confidence interval +- 1.255698%)
[ RUN      ] align.rec64_misaligned
[       OK ] align.rec64_misaligned (mean 11.697ms, confidence interval +- 1.647383%)
[ RUN      ] align.rec6_oddsize
[       OK ] align.rec6_oddsize (mean 9.547ms, confidence interval +- 0.117859%)
[ RUN      ] align.rec13_oddsize
[       OK ] align.rec13_oddsize (mean 9.725ms, confidence interval +- 0.149874%)
[ RUN      ] align.rec8_aligned_sorted
[       OK ] align.rec8_aligned_sorted (mean 162.973us, confidence interval +- 0.750951%)
[ RUN      ] align.rec8_aligned_reverse
[       OK ] align.rec8_aligned_reverse (mean 362.147us, confidence interval +- 0.417816%)
[ RUN      ] align.rec8_aligned_fewunique
[       OK ] align.rec8_aligned_fewunique (mean 4.826ms, confidence interval +- 0.158951%)
[ RUN      ] align.rec8_aligned_1e4
[       OK ] align.rec8_aligned_1e4 (mean 749.217us, confidence interval +- 0.124175%)
[ RUN      ] align.rec8_misaligned_1e4
[       OK ] align.rec8_misaligned_1e4 (mean 749.495us, confidence interval +- 0.102323%)
[ RUN      ] align.rec8_aligned_1e6
[       OK ] align.rec8_aligned_1e6 (mean 105.336ms, confidence interval +- 0.771537%)
[ RUN      ] align.rec8_misaligned_1e6
[       OK ] align.rec8_misaligned_1e6 (mean 101.883ms, confidence interval +- 0.073384%)
[==========] 17 benchmarks ran.
[  PASSED  ] 17 benchmarks.

amd64 doesn't have simd, so we have performance drop in most cases except for oddsize and some misaligned cases.

It's probably a good idea to do this change either way as the current code won't work with CHERI. Perf looks ok and I think we can live with the slight hit. I'll look into landing it.

This revision is now accepted and ready to land.Thu, Jul 2, 5:18 PM