This changeset includes a port of the SIMD implementation of strncmp
for amd64 to Aarch64.
It is based on D45839 with added handling for the limit.
An extended unit test for strncmp is currently being written to make
sure the bounds checks for page crossings work as expected.
Performance is significantly better than the existing implementation
from the Arm Optimized Routines repository.
Benchmark results are as usual generated by the strperf utility written
by fuz.
os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
                      │  strncmpARM  │             strncmpSIMD             │
                      │    sec/op    │   sec/op     vs base                │
StrncmpShortAligned      97.06µ ± 2%   63.24µ ± 2%  -34.85% (p=0.000 n=20)
StrncmpMidAligned        28.59µ ± 1%   20.61µ ± 1%  -27.92% (p=0.000 n=20)
StrncmpLongAligned      18.363µ ± 2%   9.330µ ± 2%  -49.19% (p=0.000 n=20)
StrncmpShortUnaligned   148.56µ ± 1%   72.96µ ± 0%  -50.89% (p=0.000 n=20)
StrncmpMidUnaligned      43.26µ ± 1%   22.37µ ± 1%  -48.30% (p=0.000 n=20)
StrncmpLongUnaligned    25.327µ ± 0%   9.508µ ± 2%  -62.46% (p=0.000 n=20)
StrncmpShortQsort        1.374m ± 0%   1.339m ± 0%   -2.55% (p=0.000 n=20)
StrncmpMidQsort          306.9µ ± 0%   275.7µ ± 0%  -10.16% (p=0.000 n=20)
geomean                  87.70µ        53.75µ       -38.71%
                      │  strncmpARM  │              strncmpSIMD               │
                      │     B/s      │      B/s       vs base                 │
StrncmpShortAligned     1.199Gi ± 2%    1.841Gi ± 2%   +53.48% (p=0.000 n=20)
StrncmpMidAligned       4.072Gi ± 1%    5.649Gi ± 1%   +38.74% (p=0.000 n=20)
StrncmpLongAligned      6.339Gi ± 2%   12.480Gi ± 2%   +96.86% (p=0.000 n=20)
StrncmpShortUnaligned   802.4Mi ± 1%   1634.0Mi ± 0%  +103.63% (p=0.000 n=20)
StrncmpMidUnaligned     2.691Gi ± 1%    5.205Gi ± 1%   +93.43% (p=0.000 n=20)
StrncmpLongUnaligned    4.597Gi ± 0%   12.245Gi ± 2%  +166.38% (p=0.000 n=20)
StrncmpShortQsort       86.76Mi ± 0%    89.03Mi ± 0%    +2.62% (p=0.000 n=20)
StrncmpMidQsort         388.4Mi ± 0%    432.3Mi ± 0%   +11.31% (p=0.000 n=20)
geomean                 1.327Gi         2.166Gi        +63.17%
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
                      │ strncmpARM  │             strncmpSIMD             │
                      │   sec/op    │   sec/op     vs base                │
StrncmpShortAligned     144.8µ ± 0%   100.2µ ± 0%  -30.79% (p=0.000 n=20)
StrncmpMidAligned       49.89µ ± 0%   39.26µ ± 1%  -21.30% (p=0.000 n=20)
StrncmpLongAligned      23.29µ ± 0%   15.38µ ± 0%  -33.95% (p=0.000 n=20)
StrncmpShortUnaligned   195.3µ ± 0%   112.0µ ± 0%  -42.66% (p=0.000 n=20)
StrncmpMidUnaligned     68.18µ ± 1%   42.81µ ± 1%  -37.22% (p=0.000 n=20)
StrncmpLongUnaligned    33.44µ ± 0%   16.92µ ± 0%  -49.41% (p=0.000 n=20)
StrncmpShortQsort       1.770m ± 0%   1.801m ± 0%   +1.72% (p=0.000 n=20)
StrncmpMidQsort         413.5µ ± 0%   390.7µ ± 0%   -5.51% (p=0.000 n=20)
geomean                 123.7µ        87.56µ       -29.22%
                      │  strncmpARM  │              strncmpSIMD              │
                      │     B/s      │      B/s       vs base                │
StrncmpShortAligned     823.5Mi ± 0%   1189.8Mi ± 0%  +44.48% (p=0.000 n=20)
StrncmpMidAligned       2.333Gi ± 0%    2.965Gi ± 1%  +27.06% (p=0.000 n=20)
StrncmpLongAligned      4.998Gi ± 0%    7.567Gi ± 0%  +51.40% (p=0.000 n=20)
StrncmpShortUnaligned   610.4Mi ± 0%   1064.4Mi ± 0%  +74.39% (p=0.000 n=20)
StrncmpMidUnaligned     1.707Gi ± 1%    2.720Gi ± 1%  +59.28% (p=0.000 n=20)
StrncmpLongUnaligned    3.481Gi ± 0%    6.881Gi ± 0%  +97.65% (p=0.000 n=20)
StrncmpShortQsort       67.34Mi ± 0%    66.20Mi ± 0%   -1.69% (p=0.000 n=20)
StrncmpMidQsort         288.3Mi ± 0%    305.1Mi ± 0%   +5.83% (p=0.000 n=20)
geomean                 963.7Mi         1.330Gi       +41.28%
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
                      │ strncmpARM  │             strncmpSIMD             │
                      │   sec/op    │   sec/op     vs base                │
StrncmpShortAligned     193.7µ ± 0%   135.5µ ± 0%  -30.08% (p=0.000 n=20)
StrncmpMidAligned       62.40µ ± 1%   51.60µ ± 1%  -17.31% (p=0.000 n=20)
StrncmpLongAligned      34.20µ ± 0%   22.82µ ± 0%  -33.28% (p=0.000 n=20)
StrncmpShortUnaligned   277.5µ ± 0%   153.4µ ± 0%  -44.71% (p=0.000 n=20)
StrncmpMidUnaligned     98.78µ ± 1%   58.79µ ± 0%  -40.48% (p=0.000 n=20)
StrncmpLongUnaligned    45.96µ ± 0%   22.84µ ± 0%  -50.31% (p=0.000 n=20)
StrncmpShortQsort       2.524m ± 0%   2.402m ± 0%   -4.83% (p=0.000 n=20)
StrncmpMidQsort         577.6µ ± 0%   504.4µ ± 0%  -12.67% (p=0.000 n=20)
geomean                 171.8µ        118.9µ       -30.83%
                      │  strncmpARM  │              strncmpSIMD               │
                      │     B/s      │      B/s       vs base                 │
StrncmpShortAligned     615.3Mi ± 0%    880.0Mi ± 0%   +43.01% (p=0.000 n=20)
StrncmpMidAligned       1.865Gi ± 1%    2.256Gi ± 1%   +20.93% (p=0.000 n=20)
StrncmpLongAligned      3.404Gi ± 0%    5.103Gi ± 0%   +49.89% (p=0.000 n=20)
StrncmpShortUnaligned   429.6Mi ± 0%    777.0Mi ± 0%   +80.88% (p=0.000 n=20)
StrncmpMidUnaligned     1.179Gi ± 1%    1.980Gi ± 0%   +68.02% (p=0.000 n=20)
StrncmpLongUnaligned    2.533Gi ± 0%    5.097Gi ± 0%  +101.25% (p=0.000 n=20)
StrncmpShortQsort       47.23Mi ± 0%    49.62Mi ± 0%    +5.07% (p=0.000 n=20)
StrncmpMidQsort         206.4Mi ± 0%    236.3Mi ± 0%   +14.50% (p=0.000 n=20)
geomean                 693.8Mi        1003.0Mi        +44.56%