Page MenuHomeFreeBSD

lib/libc/aarch64/string: add memcmp SIMD implementation
AcceptedPublic

Authored by getz on Jun 17 2024, 3:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 9:37 PM
Unknown Object (File)
Sat, Jan 18, 9:35 PM
Unknown Object (File)
Sat, Jan 18, 5:58 PM
Unknown Object (File)
Sat, Jan 18, 5:56 PM
Unknown Object (File)
Sun, Jan 12, 4:26 AM
Unknown Object (File)
Sun, Jan 12, 4:19 AM
Unknown Object (File)
Sun, Jan 12, 4:13 AM
Unknown Object (File)
Sun, Jan 12, 4:10 AM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.

It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.

Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
            │  memcmpARM  │             memcmpSIMD              │
            │   sec/op    │   sec/op     vs base                │
MemcmpShort   63.96µ ± 1%   32.41µ ± 0%  -49.33% (p=0.000 n=20)
MemcmpMid     12.09µ ± 1%   12.33µ ± 1%   +1.98% (p=0.000 n=20)
MemcmpLong    4.648µ ± 1%   4.942µ ± 1%   +6.32% (p=0.000 n=20)
geomean       15.32µ        12.55µ       -18.10%

            │  memcmpARM   │              memcmpSIMD              │
            │     B/s      │     B/s       vs base                │
MemcmpShort   1.820Gi ± 1%   3.592Gi ± 0%  +97.35% (p=0.000 n=20)
MemcmpMid     9.629Gi ± 1%   9.442Gi ± 1%   -1.94% (p=0.000 n=20)
MemcmpLong    25.05Gi ± 1%   23.55Gi ± 1%   -5.96% (p=0.000 n=20)
geomean       7.600Gi        9.279Gi       +22.09%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
            │  memcmpARM   │             memcmpSIMD              │
            │    sec/op    │   sec/op     vs base                │
MemcmpShort   136.11µ ± 3%   69.69µ ± 0%  -48.80% (p=0.000 n=20)
MemcmpMid      34.16µ ± 1%   32.55µ ± 1%   -4.71% (p=0.000 n=20)
MemcmpLong     9.382µ ± 0%   8.972µ ± 1%   -4.37% (p=0.000 n=20)
geomean        35.20µ        27.30µ       -22.44%

            │  memcmpARM   │              memcmpSIMD               │
            │     B/s      │      B/s       vs base                │
MemcmpShort   875.8Mi ± 3%   1710.6Mi ± 0%  +95.31% (p=0.000 n=20)
MemcmpMid     3.408Gi ± 1%    3.577Gi ± 1%   +4.94% (p=0.000 n=20)
MemcmpLong    12.41Gi ± 0%    12.98Gi ± 1%   +4.57% (p=0.000 n=20)
geomean       3.307Gi         4.264Gi       +28.93%
Test Plan

Passes the tests in the test suite

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 58232
Build 55120: arc lint + arc unit

Event Timeline

getz requested review of this revision.Jun 17 2024, 3:26 PM
lib/libc/aarch64/string/memcmp.S
32

My idea here is that if a buffer is placed just at a boundary, jump, then edit the pointer to load -32 bytes from the end and shift the vector register using TBL a dynamic amount to get everything in the right position and compare as usual.

91

It might be worth messing with the loads before to make this aligned, otherwise this could also cross into an unmapped page I believe.

lib/libc/aarch64/string/memcmp.S
96

I'm using UMAXP here instead of NOT'ing the output of CMEQ and doing a FCMP with #0.0 as this had better performance in general.

Can you test this with the two buffers at the end of a page and the next page is unmapped. I'm seeing segmentation faults when I test it.

I'm working on solving that right now, I mentioned it in the description.
The problem is that a buffer less than 32 bytes can be placed at the end of a page causing an overread.
I'll update it later today. I was mostly looking for some feedback at first.

  • Avoid overreads in main loop

We re-compare some of the bytes we already compared to avoid crossing
over into an unmapped page.

Still lacking handling for the "short input" case that can still cross
over into unmapped pages.

getz marked an inline comment as not done.Jun 18 2024, 1:46 PM

Some initial comments.

Stylistic notes: a space goes after the commas separating instruction operands.

lib/libc/aarch64/string/memcmp.S
36

Seems like the comment is the wrong way round?

52–55

Rule of thumb: try to interleave independent instructions so multi-issue in order CPUs can execute them in parallel.

60–71

This should be done in a branch-free manner.

104–108

This code always sets x2 to 32. We can treat the limit as constant from here on.

The following two subtractions set x8 to x8 + x2 - 32 and x9 to x9 + x2 - 32. You then add 32 to that for the load. Why all this complication? Why not just add and then load?

123

This too should be all branch free.

getz marked 5 inline comments as done.
  • Address comments

interleaves independent instructions for paralellism.
finds the offset in a branch free manner after a match.
simplifies the code for avoiding overreads in the main loop

  • Avoid page crossing into unmapped pages

Passes all tests, need to extend memcmp tests to verify correct
behaviour for small buffers at page boundaries.

getz edited the test plan for this revision. (Show Details)
lib/libc/aarch64/string/memcmp.S
46

Here too try to load the shift table address only once.

187–188

This table needs to be padded and .size added in much the same way as I recommend for strcmp.

Pad .rodata with -1 and use unsigned comparisons

Use correct style for SPDX identifier

follow __$FUNC convention

Follow __$FUNC convention
ifdef's to simulate bcmp

  • Use unsigned comparisons for main loop

Also unrolls the main loop which resulted in a modest performance
increase:

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
            │  memcmpARM   │             memcmpSIMD              │
            │    sec/op    │   sec/op     vs base                │
MemcmpShort    64.55µ ± 1%   32.55µ ± 1%  -49.57% (p=0.000 n=20)
MemcmpMid      12.47µ ± 1%   12.36µ ± 1%        ~ (p=0.121 n=20)
MemcmpLong     4.812µ ± 2%   4.930µ ± 0%   +2.46% (p=0.000 n=20)
BcmpShort     130.97µ ± 1%   32.24µ ± 1%  -75.38% (p=0.000 n=20)
BcmpMid        75.37µ ± 2%   12.42µ ± 1%  -83.52% (p=0.000 n=20)
BcmpLong      67.378µ ± 4%   4.926µ ± 0%  -92.69% (p=0.000 n=20)
geomean        37.02µ        12.55µ       -66.10%

            │  memcmpARM  │               memcmpSIMD               │
            │    MiB/s    │    MiB/s      vs base                  │
MemcmpShort   1.936k ± 1%    3.840k ± 1%    +98.29% (p=0.000 n=20)
MemcmpMid     10.02k ± 1%    10.12k ± 1%          ~ (p=0.121 n=20)
MemcmpLong    25.98k ± 2%    25.36k ± 0%     -2.40% (p=0.000 n=20)
BcmpShort      954.4 ± 1%    3877.4 ± 1%   +306.25% (p=0.000 n=20)
BcmpMid       1.659k ± 2%   10.066k ± 1%   +506.85% (p=0.000 n=20)
BcmpLong      1.855k ± 4%   25.377k ± 0%  +1267.86% (p=0.000 n=20)
geomean       3.376k         9.959k        +194.96%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
            │   memcmpARM   │             memcmpSIMD              │
            │    sec/op     │   sec/op     vs base                │
MemcmpShort    133.92µ ± 0%   70.15µ ± 0%  -47.62% (p=0.000 n=20)
MemcmpMid       31.63µ ± 1%   31.82µ ± 1%   +0.59% (p=0.006 n=20)
MemcmpLong      9.384µ ± 0%   9.058µ ± 1%   -3.48% (p=0.000 n=20)
BcmpShort      247.46µ ± 2%   70.17µ ± 0%  -71.64% (p=0.000 n=20)
BcmpMid        148.94µ ± 1%   31.93µ ± 1%  -78.56% (p=0.000 n=20)
BcmpLong      126.093µ ± 0%   8.949µ ± 2%  -92.90% (p=0.000 n=20)
geomean         75.47µ        27.20µ       -63.95%

            │  memcmpARM   │                memcmpSIMD                │
            │     B/s      │      B/s        vs base                  │
MemcmpShort   890.1Mi ± 0%    1699.3Mi ± 0%    +90.91% (p=0.000 n=20)
MemcmpMid     3.680Gi ± 1%     3.659Gi ± 1%     -0.59% (p=0.006 n=20)
MemcmpLong    12.41Gi ± 0%     12.85Gi ± 1%     +3.60% (p=0.000 n=20)
BcmpShort     481.9Mi ± 2%    1698.9Mi ± 0%   +252.56% (p=0.000 n=20)
BcmpMid       800.4Mi ± 1%    3733.5Mi ± 1%   +366.47% (p=0.000 n=20)
BcmpLong      945.4Mi ± 0%   13320.9Mi ± 2%  +1309.01% (p=0.000 n=20)
geomean       1.543Gi          4.279Gi        +177.40%

exp-run says it's fine.

This revision is now accepted and ready to land.Nov 6 2024, 2:26 PM

Have you tried any other benchmarks? I tried testing against the glibc memcmp using their benchmarking tool as should be the same code as on FreeBSD and found this version was slower in most cases.

Have you tried any other benchmarks? I tried testing against the glibc memcmp using their benchmarking tool as should be the same code as on FreeBSD and found this version was slower in most cases.

What CPU model did you test on? I am not sure exactly how the glibc benchmarks work and don't run Linux, so it's hard to say for me what the cause of this is.
If you run Linux on some ARM box, check if you can run our strperf benchmark there and see what results you get: https://github.com/clausecker/strperf.

I was running on a Neoverse-N2 CPU. I ran with your benchmark and received mixed results, some faster some slower.

Will hold this one pending the rework @getz has promised.