Details

Reviewers

fuz
emaste
andrew

Commits

rG25c485e14769: lib/libc/aarch64/string: add strncmp SIMD implementation

Summary

This changeset includes a port of the SIMD implementation of strncmp
for amd64 to Aarch64.

It is based on D45839 with added handling for the limit.

An extended unit test for strncmp is currently being written to make
sure the bounds checks for page crossings work as expected.

Performance is significantly better than the existing implementation
from the Arm Optimized Routines repository.

Benchmark results are as usual generated by the strperf utility written
by fuz.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
                      │  strncmpARM  │             strncmpSIMD             │
                      │    sec/op    │   sec/op     vs base                │
StrncmpShortAligned      97.06µ ± 2%   63.24µ ± 2%  -34.85% (p=0.000 n=20)
StrncmpMidAligned        28.59µ ± 1%   20.61µ ± 1%  -27.92% (p=0.000 n=20)
StrncmpLongAligned      18.363µ ± 2%   9.330µ ± 2%  -49.19% (p=0.000 n=20)
StrncmpShortUnaligned   148.56µ ± 1%   72.96µ ± 0%  -50.89% (p=0.000 n=20)
StrncmpMidUnaligned      43.26µ ± 1%   22.37µ ± 1%  -48.30% (p=0.000 n=20)
StrncmpLongUnaligned    25.327µ ± 0%   9.508µ ± 2%  -62.46% (p=0.000 n=20)
StrncmpShortQsort        1.374m ± 0%   1.339m ± 0%   -2.55% (p=0.000 n=20)
StrncmpMidQsort          306.9µ ± 0%   275.7µ ± 0%  -10.16% (p=0.000 n=20)
geomean                  87.70µ        53.75µ       -38.71%

                      │  strncmpARM  │              strncmpSIMD               │
                      │     B/s      │      B/s       vs base                 │
StrncmpShortAligned     1.199Gi ± 2%    1.841Gi ± 2%   +53.48% (p=0.000 n=20)
StrncmpMidAligned       4.072Gi ± 1%    5.649Gi ± 1%   +38.74% (p=0.000 n=20)
StrncmpLongAligned      6.339Gi ± 2%   12.480Gi ± 2%   +96.86% (p=0.000 n=20)
StrncmpShortUnaligned   802.4Mi ± 1%   1634.0Mi ± 0%  +103.63% (p=0.000 n=20)
StrncmpMidUnaligned     2.691Gi ± 1%    5.205Gi ± 1%   +93.43% (p=0.000 n=20)
StrncmpLongUnaligned    4.597Gi ± 0%   12.245Gi ± 2%  +166.38% (p=0.000 n=20)
StrncmpShortQsort       86.76Mi ± 0%    89.03Mi ± 0%    +2.62% (p=0.000 n=20)
StrncmpMidQsort         388.4Mi ± 0%    432.3Mi ± 0%   +11.31% (p=0.000 n=20)
geomean                 1.327Gi         2.166Gi        +63.17%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
                      │ strncmpARM  │             strncmpSIMD             │
                      │   sec/op    │   sec/op     vs base                │
StrncmpShortAligned     144.8µ ± 0%   100.2µ ± 0%  -30.79% (p=0.000 n=20)
StrncmpMidAligned       49.89µ ± 0%   39.26µ ± 1%  -21.30% (p=0.000 n=20)
StrncmpLongAligned      23.29µ ± 0%   15.38µ ± 0%  -33.95% (p=0.000 n=20)
StrncmpShortUnaligned   195.3µ ± 0%   112.0µ ± 0%  -42.66% (p=0.000 n=20)
StrncmpMidUnaligned     68.18µ ± 1%   42.81µ ± 1%  -37.22% (p=0.000 n=20)
StrncmpLongUnaligned    33.44µ ± 0%   16.92µ ± 0%  -49.41% (p=0.000 n=20)
StrncmpShortQsort       1.770m ± 0%   1.801m ± 0%   +1.72% (p=0.000 n=20)
StrncmpMidQsort         413.5µ ± 0%   390.7µ ± 0%   -5.51% (p=0.000 n=20)
geomean                 123.7µ        87.56µ       -29.22%

                      │  strncmpARM  │              strncmpSIMD              │
                      │     B/s      │      B/s       vs base                │
StrncmpShortAligned     823.5Mi ± 0%   1189.8Mi ± 0%  +44.48% (p=0.000 n=20)
StrncmpMidAligned       2.333Gi ± 0%    2.965Gi ± 1%  +27.06% (p=0.000 n=20)
StrncmpLongAligned      4.998Gi ± 0%    7.567Gi ± 0%  +51.40% (p=0.000 n=20)
StrncmpShortUnaligned   610.4Mi ± 0%   1064.4Mi ± 0%  +74.39% (p=0.000 n=20)
StrncmpMidUnaligned     1.707Gi ± 1%    2.720Gi ± 1%  +59.28% (p=0.000 n=20)
StrncmpLongUnaligned    3.481Gi ± 0%    6.881Gi ± 0%  +97.65% (p=0.000 n=20)
StrncmpShortQsort       67.34Mi ± 0%    66.20Mi ± 0%   -1.69% (p=0.000 n=20)
StrncmpMidQsort         288.3Mi ± 0%    305.1Mi ± 0%   +5.83% (p=0.000 n=20)
geomean                 963.7Mi         1.330Gi       +41.28%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
                      │ strncmpARM  │             strncmpSIMD             │
                      │   sec/op    │   sec/op     vs base                │
StrncmpShortAligned     193.7µ ± 0%   135.5µ ± 0%  -30.08% (p=0.000 n=20)
StrncmpMidAligned       62.40µ ± 1%   51.60µ ± 1%  -17.31% (p=0.000 n=20)
StrncmpLongAligned      34.20µ ± 0%   22.82µ ± 0%  -33.28% (p=0.000 n=20)
StrncmpShortUnaligned   277.5µ ± 0%   153.4µ ± 0%  -44.71% (p=0.000 n=20)
StrncmpMidUnaligned     98.78µ ± 1%   58.79µ ± 0%  -40.48% (p=0.000 n=20)
StrncmpLongUnaligned    45.96µ ± 0%   22.84µ ± 0%  -50.31% (p=0.000 n=20)
StrncmpShortQsort       2.524m ± 0%   2.402m ± 0%   -4.83% (p=0.000 n=20)
StrncmpMidQsort         577.6µ ± 0%   504.4µ ± 0%  -12.67% (p=0.000 n=20)
geomean                 171.8µ        118.9µ       -30.83%

                      │  strncmpARM  │              strncmpSIMD               │
                      │     B/s      │      B/s       vs base                 │
StrncmpShortAligned     615.3Mi ± 0%    880.0Mi ± 0%   +43.01% (p=0.000 n=20)
StrncmpMidAligned       1.865Gi ± 1%    2.256Gi ± 1%   +20.93% (p=0.000 n=20)
StrncmpLongAligned      3.404Gi ± 0%    5.103Gi ± 0%   +49.89% (p=0.000 n=20)
StrncmpShortUnaligned   429.6Mi ± 0%    777.0Mi ± 0%   +80.88% (p=0.000 n=20)
StrncmpMidUnaligned     1.179Gi ± 1%    1.980Gi ± 0%   +68.02% (p=0.000 n=20)
StrncmpLongUnaligned    2.533Gi ± 0%    5.097Gi ± 0%  +101.25% (p=0.000 n=20)
StrncmpShortQsort       47.23Mi ± 0%    49.62Mi ± 0%    +5.07% (p=0.000 n=20)
StrncmpMidQsort         206.4Mi ± 0%    236.3Mi ± 0%   +14.50% (p=0.000 n=20)
geomean                 693.8Mi        1003.0Mi        +44.56%

Test Plan

Passes all current tests, in the process of adding new bounds test for strncmp

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

getz created this revision.Jul 10 2024, 4:10 PM

Herald added a reviewer: andrew. · View Herald TranscriptJul 10 2024, 4:10 PM

Herald added a subscriber: imp. · View Herald Transcript

getz requested review of this revision.Jul 10 2024, 4:10 PM

Harbormaster completed remote builds in B58602: Diff 140759.Jul 10 2024, 4:10 PM

Looks reasonable. Looking forwards o your fixes!

lib/libc/aarch64/string/strncmp.S
236–237	This performs better on microarchitectures where cmp+bne fuse.
251–254	Try to interleave more (this applies in multiple places).

Fixes the page crossing for tails.

Interleaves instructions for better ILP.

Harbormaster completed remote builds in B58609: Diff 140786.Jul 10 2024, 9:42 PM

getz edited the summary of this revision. (Show Details)Jul 10 2024, 9:43 PM

getz edited the test plan for this revision. (Show Details)

getz marked 2 inline comments as done.

getz added inline comments.

lib/libc/aarch64/string/strncmp.S
236–237	mvn still has to performed in the mismatch so the difference is negligible

Really nice!

lib/libc/aarch64/string/strncmp.S
31–36	This logic seems to be duplicated around `.Llt16`. Would it perhaps be possible to deduplicate it?
139–140
147–148
175	Instead of reloading `#0xf` all over the place, you might be able to get away with loading the constant `#-1` somewhere around the beginning and using it in all those places where you currently use `#0xf` and `#-1`. After all, it doesn't matter if we inject more mismatches after the first injected mismatch.
286–288	I think this is correct, but please check.

Thanks for review! Changes appear to increase performance by a percent or two.

lib/libc/aarch64/string/strncmp.S
175	Sure, that should help.
286–288	Seems correct, passes unit tests and I verified the path was taken through a debugger.

Updated based on review

Harbormaster completed remote builds in B58612: Diff 140795.Jul 11 2024, 11:21 AM

Progress! Found some more ideas in the meanwhile.

lib/libc/aarch64/string/strncmp.S
23	Missing whitespace
43–47
64–65	The same applies to the `.Llt16` branch below.
76–77
115–116	This code is also duplicated, but not sure if it is a good idea to deduplicate it.
198–211	Perhaps that could help? But then you would also have to adjust the code at `.Lnulfound` and `.Lmismatch` to deal with `x8` being pre-incremented, so maybe it's not worth doing.
229–230	If you decrement `x2` by 32 before entering into the loop, you could get away with just subs x2, x2, #32 instead of a subtraction plus a comparison, saving one instruction each iteration. Maybe that could be made to work?

getz marked 4 inline comments as done.Jul 11 2024, 3:15 PM

getz added inline comments.

lib/libc/aarch64/string/strncmp.S
115–116	I believe it's better to leave it where its at, enough arithmetic already going on before the conditionals above.
198–211	I tried this but couldn't get it be more efficient as several places needed handling to accommodate for this.
229–230	This is probably what was going on in the original amd64 code where 48 is subtracted before the loop, my problem is that x2 would wrap around and it would mess up the tail handling.