Page MenuHomeFreeBSD

lib/libc/aarch64/string: add strspn optimized implementation
Needs ReviewPublic

Authored by getz on Aug 21 2024, 12:19 PM.
Tags
None
Referenced Files
F101385373: D46396.id142343.diff
Mon, Oct 28, 6:32 PM
Unknown Object (File)
Tue, Oct 22, 3:59 PM
Unknown Object (File)
Fri, Oct 11, 1:23 AM
Unknown Object (File)
Thu, Oct 3, 10:16 PM
Unknown Object (File)
Sep 28 2024, 3:08 PM
Unknown Object (File)
Sep 28 2024, 12:30 AM
Unknown Object (File)
Sep 26 2024, 1:59 PM
Unknown Object (File)
Sep 26 2024, 1:00 PM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

This is a port of the Scalar optimized variant of strspn for amd64 to aarch64

It utilizes a LUT to speed up the function, a SIMD variant is still under
development.

Performance benchmarks are as usual generated by strperf

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ strspnScalar │             strspnSIMD              │
        │    sec/op    │   sec/op     vs base                │
Short1     240.3µ ± 1%   152.0µ ± 0%  -36.73% (p=0.000 n=20)
Mid1      146.06µ ± 0%   83.33µ ± 0%  -42.95% (p=0.000 n=20)
Long1     112.77µ ± 0%   70.46µ ± 1%  -37.52% (p=0.000 n=20)
Short5     298.5µ ± 2%   253.9µ ± 0%  -14.94% (p=0.000 n=20)
Mid5       161.7µ ± 1%   110.4µ ± 0%  -31.77% (p=0.000 n=20)
Long5     112.75µ ± 0%   68.38µ ± 1%  -39.35% (p=0.000 n=20)
Short20    527.1µ ± 1%   514.8µ ± 0%   -2.33% (p=0.000 n=20)
Mid20      227.0µ ± 1%   180.3µ ± 0%  -20.59% (p=0.000 n=20)
Long20    112.78µ ± 0%   68.29µ ± 2%  -39.45% (p=0.000 n=20)
Short40    840.1µ ± 1%   869.8µ ± 1%   +3.53% (p=0.000 n=20)
Mid40      314.7µ ± 1%   275.6µ ± 0%  -12.42% (p=0.000 n=20)
Long40    112.82µ ± 0%   68.13µ ± 2%  -39.61% (p=0.000 n=20)
geomean    212.9µ        153.9µ       -27.70%

        │ strspnScalar │              strspnSIMD               │
        │     B/s      │      B/s       vs base                │
Short1    496.1Mi ± 1%    784.1Mi ± 0%  +58.06% (p=0.000 n=20)
Mid1      816.2Mi ± 0%   1430.6Mi ± 0%  +75.28% (p=0.000 n=20)
Long1     1.032Gi ± 0%    1.652Gi ± 1%  +60.04% (p=0.000 n=20)
Short5    399.4Mi ± 2%    469.5Mi ± 0%  +17.57% (p=0.000 n=20)
Mid5      737.0Mi ± 1%   1080.2Mi ± 0%  +46.56% (p=0.000 n=20)
Long5     1.033Gi ± 0%    1.702Gi ± 1%  +64.88% (p=0.000 n=20)
Short20   226.2Mi ± 1%    231.5Mi ± 0%   +2.38% (p=0.000 n=20)
Mid20     525.1Mi ± 1%    661.2Mi ± 0%  +25.92% (p=0.000 n=20)
Long20    1.032Gi ± 0%    1.705Gi ± 2%  +65.15% (p=0.000 n=20)
Short40   141.9Mi ± 1%    137.1Mi ± 1%   -3.41% (p=0.000 n=20)
Mid40     378.8Mi ± 1%    432.5Mi ± 0%  +14.18% (p=0.000 n=20)
Long40    1.032Gi ± 0%    1.709Gi ± 2%  +65.60% (p=0.000 n=20)
geomean   559.9Mi         774.4Mi       +38.30%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ strspnScalar │             strspnSIMD              │
        │    sec/op    │   sec/op     vs base                │
Short1     177.2µ ± 1%   112.2µ ± 1%  -36.67% (p=0.000 n=20)
Mid1      100.87µ ± 0%   48.44µ ± 0%  -51.97% (p=0.000 n=20)
Long1      90.16µ ± 0%   50.55µ ± 0%  -43.93% (p=0.000 n=20)
Short5     231.8µ ± 1%   181.6µ ± 2%  -21.65% (p=0.000 n=20)
Mid5      115.49µ ± 2%   60.70µ ± 2%  -47.44% (p=0.000 n=20)
Long5      90.16µ ± 0%   36.59µ ± 0%  -59.41% (p=0.000 n=20)
Short20    426.9µ ± 0%   436.4µ ± 3%        ~ (p=0.149 n=20)
Mid20      168.0µ ± 0%   126.0µ ± 0%  -24.99% (p=0.000 n=20)
Long20     90.17µ ± 0%   36.59µ ± 0%  -59.43% (p=0.000 n=20)
Short40    702.9µ ± 0%   758.4µ ± 1%   +7.89% (p=0.000 n=20)
Mid40      244.3µ ± 0%   213.4µ ± 0%  -12.65% (p=0.000 n=20)
Long40     90.22µ ± 0%   36.68µ ± 1%  -59.34% (p=0.000 n=20)
geomean    164.4µ        102.4µ       -37.73%

        │ strspnScalar │              strspnSIMD              │
        │    MiB/s     │    MiB/s     vs base                 │
Short1      705.3 ± 1%   1113.6 ± 1%   +57.89% (p=0.000 n=20)
Mid1       1.239k ± 0%   2.580k ± 0%  +108.21% (p=0.000 n=20)
Long1      1.386k ± 0%   2.473k ± 0%   +78.36% (p=0.000 n=20)
Short5      539.3 ± 1%    688.4 ± 2%   +27.64% (p=0.000 n=20)
Mid5       1.082k ± 2%   2.059k ± 2%   +90.28% (p=0.000 n=20)
Long5      1.386k ± 0%   3.416k ± 0%  +146.39% (p=0.000 n=20)
Short20     292.8 ± 0%    286.4 ± 3%         ~ (p=0.149 n=20)
Mid20       744.2 ± 0%    992.2 ± 0%   +33.31% (p=0.000 n=20)
Long20     1.386k ± 0%   3.416k ± 0%  +146.46% (p=0.000 n=20)
Short40     177.8 ± 0%    164.8 ± 1%    -7.31% (p=0.000 n=20)
Mid40       511.7 ± 0%    585.7 ± 0%   +14.48% (p=0.000 n=20)
Long40     1.385k ± 0%   3.407k ± 1%  +145.95% (p=0.000 n=20)
geomean     760.4        1.221k        +60.59%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ strspnScalar │             strspnSIMD              │
        │    sec/op    │   sec/op     vs base                │
Short1     341.0µ ± 0%   223.6µ ± 0%  -34.42% (p=0.000 n=20)
Mid1      201.66µ ± 1%   99.97µ ± 1%  -50.43% (p=0.000 n=20)
Long1     169.75µ ± 0%   91.16µ ± 0%  -46.30% (p=0.000 n=20)
Short5     413.9µ ± 0%   294.1µ ± 1%  -28.95% (p=0.000 n=20)
Mid5       220.5µ ± 0%   117.2µ ± 1%  -46.86% (p=0.000 n=20)
Long5     169.73µ ± 0%   75.27µ ± 0%  -55.65% (p=0.000 n=20)
Short20    784.7µ ± 0%   352.3µ ± 1%  -55.11% (p=0.000 n=20)
Mid20      323.5µ ± 0%   135.3µ ± 0%  -58.16% (p=0.000 n=20)
Long20    169.79µ ± 0%   74.71µ ± 0%  -56.00% (p=0.000 n=20)
Short40   1288.4µ ± 0%   432.8µ ± 1%  -66.41% (p=0.000 n=20)
Mid40      467.8µ ± 0%   160.1µ ± 0%  -65.78% (p=0.000 n=20)
Long40    169.84µ ± 0%   74.77µ ± 0%  -55.97% (p=0.000 n=20)
geomean    310.3µ        146.5µ       -52.80%

        │ strspnScalar │               strspnSIMD               │
        │     B/s      │      B/s       vs base                 │
Short1    349.6Mi ± 0%    533.1Mi ± 0%   +52.49% (p=0.000 n=20)
Mid1      591.1Mi ± 1%   1192.5Mi ± 1%  +101.73% (p=0.000 n=20)
Long1     702.3Mi ± 0%   1307.8Mi ± 0%   +86.22% (p=0.000 n=20)
Short5    288.0Mi ± 0%    405.3Mi ± 1%   +40.74% (p=0.000 n=20)
Mid5      540.5Mi ± 0%   1017.2Mi ± 1%   +88.19% (p=0.000 n=20)
Long5     702.3Mi ± 0%   1583.7Mi ± 0%  +125.49% (p=0.000 n=20)
Short20   151.9Mi ± 0%    338.4Mi ± 1%  +122.76% (p=0.000 n=20)
Mid20     368.5Mi ± 0%    880.9Mi ± 0%  +139.02% (p=0.000 n=20)
Long20    702.1Mi ± 0%   1595.7Mi ± 0%  +127.28% (p=0.000 n=20)
Short40   92.53Mi ± 0%   275.44Mi ± 1%  +197.69% (p=0.000 n=20)
Mid40     254.8Mi ± 0%    744.7Mi ± 0%  +192.24% (p=0.000 n=20)
Long40    701.9Mi ± 0%   1594.3Mi ± 0%  +127.14% (p=0.000 n=20)
geomean   384.1Mi         813.9Mi       +111.87%
Test Plan

Passes all tests in the test suite

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59124
Build 56011: arc lint + arc unit

Event Timeline

getz requested review of this revision.Aug 21 2024, 12:19 PM

Some review comments. Code looks ok. Looking forwards to your SIMD implementation!

lib/libc/aarch64/string/strspn.S
14

Consider pre-loading the first character of the string early on, as to reduce the impact of a cache miss.

30–35

This could perform better

42–49

This order of operations might perform better, but do benchmark the code.

57–58

etc for the others?

88

This could be optimised using the same algorithm used for strlen, except you look for mismatches instead of matches.

getz added inline comments.
lib/libc/aarch64/string/strspn.S
14

I tried adding ldrb w8, [x0] early but it degraded performance by about 1-2%, is this a worthwhile trade-off?

30–35

Tried this, worse performance on cores with slower SIMD units, performance was on par on cores with faster SIMD units.

42–49

This significantly increased performance for small sets, thanks!

getz marked 2 inline comments as done.
  • Update based on review

Changed order of operations for checking characters in set.
Significantly increased performance for shorter sets.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ strspnScalar │               strspnSIMD               │
        │    sec/op    │   sec/op     vs base                   │
Short1     240.3µ ± 1%   145.9µ ± 0%  -39.29% (p=0.000 n=20+10)
Mid1      146.06µ ± 0%   83.44µ ± 1%  -42.87% (p=0.000 n=20+10)
Long1     112.77µ ± 0%   70.49µ ± 0%  -37.49% (p=0.000 n=20+10)
Short5     298.5µ ± 2%   230.3µ ± 0%  -22.86% (p=0.000 n=20+10)
Mid5       161.7µ ± 1%   101.4µ ± 1%  -37.29% (p=0.000 n=20+10)
Long5     112.75µ ± 0%   65.61µ ± 0%  -41.81% (p=0.000 n=20+10)
Short20    527.1µ ± 1%   352.3µ ± 0%  -33.16% (p=0.000 n=20+10)
Mid20      227.0µ ± 1%   135.6µ ± 0%  -40.28% (p=0.000 n=20+10)
Long20    112.78µ ± 0%   65.35µ ± 1%  -42.06% (p=0.000 n=20+10)
Short40    840.1µ ± 1%   540.6µ ± 1%  -35.65% (p=0.000 n=20+10)
Mid40      314.7µ ± 1%   186.7µ ± 0%  -40.69% (p=0.000 n=20+10)
Long40    112.82µ ± 0%   65.28µ ± 0%  -42.14% (p=0.000 n=20+10)
geomean    212.9µ        131.6µ       -38.18%

        │ strspnScalar │                strspnSIMD                │
        │     B/s      │      B/s       vs base                   │
Short1    496.1Mi ± 1%    817.1Mi ± 0%  +64.72% (p=0.000 n=20+10)
Mid1      816.2Mi ± 0%   1428.6Mi ± 1%  +75.04% (p=0.000 n=20+10)
Long1     1.032Gi ± 0%    1.652Gi ± 0%  +59.98% (p=0.000 n=20+10)
Short5    399.4Mi ± 2%    517.7Mi ± 0%  +29.64% (p=0.000 n=20+10)
Mid5      737.0Mi ± 1%   1175.4Mi ± 1%  +59.47% (p=0.000 n=20+10)
Long5     1.033Gi ± 0%    1.774Gi ± 0%  +71.86% (p=0.000 n=20+10)
Short20   226.2Mi ± 1%    338.4Mi ± 0%  +49.62% (p=0.000 n=20+10)
Mid20     525.1Mi ± 1%    879.2Mi ± 0%  +67.45% (p=0.000 n=20+10)
Long20    1.032Gi ± 0%    1.781Gi ± 1%  +72.59% (p=0.000 n=20+10)
Short40   141.9Mi ± 1%    220.5Mi ± 1%  +55.40% (p=0.000 n=20+10)
Mid40     378.8Mi ± 1%    638.6Mi ± 0%  +68.60% (p=0.000 n=20+10)
Long40    1.032Gi ± 0%    1.783Gi ± 0%  +72.83% (p=0.000 n=20+10)
geomean   559.9Mi         905.7Mi       +61.76%