Page MenuHomeFreeBSD

lib/libc/aarch64/string: add strncmp SIMD implementation
AcceptedPublic

Authored by getz on Jul 10 2024, 4:10 PM.
Tags
None
Referenced Files
F106630001: D45943.diff
Fri, Jan 3, 12:45 AM
Unknown Object (File)
Sat, Dec 28, 1:58 PM
Unknown Object (File)
Fri, Dec 27, 2:53 AM
Unknown Object (File)
Thu, Dec 26, 4:29 AM
Unknown Object (File)
Sun, Dec 22, 3:28 AM
Unknown Object (File)
Nov 21 2024, 6:29 PM
Unknown Object (File)
Nov 18 2024, 9:03 PM
Unknown Object (File)
Nov 14 2024, 8:26 AM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

This changeset includes a port of the SIMD implementation of strncmp
for amd64 to Aarch64.

It is based on D45839 with added handling for the limit.

An extended unit test for strncmp is currently being written to make
sure the bounds checks for page crossings work as expected.

Performance is significantly better than the existing implementation
from the Arm Optimized Routines repository.

Benchmark results are as usual generated by the strperf utility written
by fuz.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
                      │  strncmpARM  │             strncmpSIMD             │
                      │    sec/op    │   sec/op     vs base                │
StrncmpShortAligned      97.06µ ± 2%   63.24µ ± 2%  -34.85% (p=0.000 n=20)
StrncmpMidAligned        28.59µ ± 1%   20.61µ ± 1%  -27.92% (p=0.000 n=20)
StrncmpLongAligned      18.363µ ± 2%   9.330µ ± 2%  -49.19% (p=0.000 n=20)
StrncmpShortUnaligned   148.56µ ± 1%   72.96µ ± 0%  -50.89% (p=0.000 n=20)
StrncmpMidUnaligned      43.26µ ± 1%   22.37µ ± 1%  -48.30% (p=0.000 n=20)
StrncmpLongUnaligned    25.327µ ± 0%   9.508µ ± 2%  -62.46% (p=0.000 n=20)
StrncmpShortQsort        1.374m ± 0%   1.339m ± 0%   -2.55% (p=0.000 n=20)
StrncmpMidQsort          306.9µ ± 0%   275.7µ ± 0%  -10.16% (p=0.000 n=20)
geomean                  87.70µ        53.75µ       -38.71%

                      │  strncmpARM  │              strncmpSIMD               │
                      │     B/s      │      B/s       vs base                 │
StrncmpShortAligned     1.199Gi ± 2%    1.841Gi ± 2%   +53.48% (p=0.000 n=20)
StrncmpMidAligned       4.072Gi ± 1%    5.649Gi ± 1%   +38.74% (p=0.000 n=20)
StrncmpLongAligned      6.339Gi ± 2%   12.480Gi ± 2%   +96.86% (p=0.000 n=20)
StrncmpShortUnaligned   802.4Mi ± 1%   1634.0Mi ± 0%  +103.63% (p=0.000 n=20)
StrncmpMidUnaligned     2.691Gi ± 1%    5.205Gi ± 1%   +93.43% (p=0.000 n=20)
StrncmpLongUnaligned    4.597Gi ± 0%   12.245Gi ± 2%  +166.38% (p=0.000 n=20)
StrncmpShortQsort       86.76Mi ± 0%    89.03Mi ± 0%    +2.62% (p=0.000 n=20)
StrncmpMidQsort         388.4Mi ± 0%    432.3Mi ± 0%   +11.31% (p=0.000 n=20)
geomean                 1.327Gi         2.166Gi        +63.17%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
                      │ strncmpARM  │             strncmpSIMD             │
                      │   sec/op    │   sec/op     vs base                │
StrncmpShortAligned     144.8µ ± 0%   100.2µ ± 0%  -30.79% (p=0.000 n=20)
StrncmpMidAligned       49.89µ ± 0%   39.26µ ± 1%  -21.30% (p=0.000 n=20)
StrncmpLongAligned      23.29µ ± 0%   15.38µ ± 0%  -33.95% (p=0.000 n=20)
StrncmpShortUnaligned   195.3µ ± 0%   112.0µ ± 0%  -42.66% (p=0.000 n=20)
StrncmpMidUnaligned     68.18µ ± 1%   42.81µ ± 1%  -37.22% (p=0.000 n=20)
StrncmpLongUnaligned    33.44µ ± 0%   16.92µ ± 0%  -49.41% (p=0.000 n=20)
StrncmpShortQsort       1.770m ± 0%   1.801m ± 0%   +1.72% (p=0.000 n=20)
StrncmpMidQsort         413.5µ ± 0%   390.7µ ± 0%   -5.51% (p=0.000 n=20)
geomean                 123.7µ        87.56µ       -29.22%

                      │  strncmpARM  │              strncmpSIMD              │
                      │     B/s      │      B/s       vs base                │
StrncmpShortAligned     823.5Mi ± 0%   1189.8Mi ± 0%  +44.48% (p=0.000 n=20)
StrncmpMidAligned       2.333Gi ± 0%    2.965Gi ± 1%  +27.06% (p=0.000 n=20)
StrncmpLongAligned      4.998Gi ± 0%    7.567Gi ± 0%  +51.40% (p=0.000 n=20)
StrncmpShortUnaligned   610.4Mi ± 0%   1064.4Mi ± 0%  +74.39% (p=0.000 n=20)
StrncmpMidUnaligned     1.707Gi ± 1%    2.720Gi ± 1%  +59.28% (p=0.000 n=20)
StrncmpLongUnaligned    3.481Gi ± 0%    6.881Gi ± 0%  +97.65% (p=0.000 n=20)
StrncmpShortQsort       67.34Mi ± 0%    66.20Mi ± 0%   -1.69% (p=0.000 n=20)
StrncmpMidQsort         288.3Mi ± 0%    305.1Mi ± 0%   +5.83% (p=0.000 n=20)
geomean                 963.7Mi         1.330Gi       +41.28%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
                      │ strncmpARM  │             strncmpSIMD             │
                      │   sec/op    │   sec/op     vs base                │
StrncmpShortAligned     193.7µ ± 0%   135.5µ ± 0%  -30.08% (p=0.000 n=20)
StrncmpMidAligned       62.40µ ± 1%   51.60µ ± 1%  -17.31% (p=0.000 n=20)
StrncmpLongAligned      34.20µ ± 0%   22.82µ ± 0%  -33.28% (p=0.000 n=20)
StrncmpShortUnaligned   277.5µ ± 0%   153.4µ ± 0%  -44.71% (p=0.000 n=20)
StrncmpMidUnaligned     98.78µ ± 1%   58.79µ ± 0%  -40.48% (p=0.000 n=20)
StrncmpLongUnaligned    45.96µ ± 0%   22.84µ ± 0%  -50.31% (p=0.000 n=20)
StrncmpShortQsort       2.524m ± 0%   2.402m ± 0%   -4.83% (p=0.000 n=20)
StrncmpMidQsort         577.6µ ± 0%   504.4µ ± 0%  -12.67% (p=0.000 n=20)
geomean                 171.8µ        118.9µ       -30.83%

                      │  strncmpARM  │              strncmpSIMD               │
                      │     B/s      │      B/s       vs base                 │
StrncmpShortAligned     615.3Mi ± 0%    880.0Mi ± 0%   +43.01% (p=0.000 n=20)
StrncmpMidAligned       1.865Gi ± 1%    2.256Gi ± 1%   +20.93% (p=0.000 n=20)
StrncmpLongAligned      3.404Gi ± 0%    5.103Gi ± 0%   +49.89% (p=0.000 n=20)
StrncmpShortUnaligned   429.6Mi ± 0%    777.0Mi ± 0%   +80.88% (p=0.000 n=20)
StrncmpMidUnaligned     1.179Gi ± 1%    1.980Gi ± 0%   +68.02% (p=0.000 n=20)
StrncmpLongUnaligned    2.533Gi ± 0%    5.097Gi ± 0%  +101.25% (p=0.000 n=20)
StrncmpShortQsort       47.23Mi ± 0%    49.62Mi ± 0%    +5.07% (p=0.000 n=20)
StrncmpMidQsort         206.4Mi ± 0%    236.3Mi ± 0%   +14.50% (p=0.000 n=20)
geomean                 693.8Mi        1003.0Mi        +44.56%
Test Plan

Passes all current tests, in the process of adding new bounds test for strncmp

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59012
Build 55899: arc lint + arc unit

Event Timeline

getz requested review of this revision.Jul 10 2024, 4:10 PM

Looks reasonable. Looking forwards o your fixes!

lib/libc/aarch64/string/strncmp.S
236–237

This performs better on microarchitectures where cmp+bne fuse.

251–254

Try to interleave more (this applies in multiple places).

Fixes the page crossing for tails.

Interleaves instructions for better ILP.

getz edited the test plan for this revision. (Show Details)
getz marked 2 inline comments as done.
getz added inline comments.
lib/libc/aarch64/string/strncmp.S
236–237

mvn still has to performed in the mismatch so the difference is negligible

Really nice!

lib/libc/aarch64/string/strncmp.S
31–36

This logic seems to be duplicated around .Llt16. Would it perhaps be possible to deduplicate it?

139–140
147–148
175

Instead of reloading #0xf all over the place, you might be able to get away with loading the constant #-1 somewhere around the beginning and using it in all those places where you currently use #0xf and #-1. After all, it doesn't matter if we inject more mismatches after the first injected mismatch.

286–288

I think this is correct, but please check.

Thanks for review! Changes appear to increase performance by a percent or two.

lib/libc/aarch64/string/strncmp.S
175

Sure, that should help.

286–288

Seems correct, passes unit tests and I verified the path was taken through a debugger.

getz marked 2 inline comments as done.
  • Updated based on review

Progress! Found some more ideas in the meanwhile.

lib/libc/aarch64/string/strncmp.S
23

Missing whitespace

43–47
64–65

The same applies to the .Llt16 branch below.

76–77
115–116

This code is also duplicated, but not sure if it is a good idea to deduplicate it.

198–211

Perhaps that could help?

But then you would also have to adjust the code at .Lnulfound and .Lmismatch to deal with x8 being pre-incremented, so maybe it's not worth doing.

229–230

If you decrement x2 by 32 before entering into the loop, you could get away with just

subs x2, x2, #32

instead of a subtraction plus a comparison, saving one instruction each iteration. Maybe that could be made to work?

getz marked 4 inline comments as done.Jul 11 2024, 3:15 PM
getz added inline comments.
lib/libc/aarch64/string/strncmp.S
115–116

I believe it's better to leave it where its at, enough arithmetic already going on before the conditionals above.

198–211

I tried this but couldn't get it be more efficient as several places needed handling to accommodate for this.

229–230

This is probably what was going on in the original amd64 code where 48 is subtracted before the loop, my problem is that x2 would wrap around and it would mess up the tail handling.

  • Revert to complicated handling of short string near a page boundary

I extended the unit tests and it was failing for short unaligned strings near a page boundary.

Introduces a null "match" when limit is reached for short strings near a page boundary

Missed this case in the tests.

Use unsigned comparisons everywhere

Use unsigned comparisons for limit
follow __$FUNC convention

exp-run says it's fine.

This revision is now accepted and ready to land.Nov 6 2024, 2:25 PM