Page MenuHomeFreeBSD

lib/libc/aarch64/string: add strlen SIMD implementation
AcceptedPublic

Authored by getz on Jun 17 2024, 7:52 PM.
Tags
None
Referenced Files
F103317709: D45623.diff
Sat, Nov 23, 11:43 AM
F103287951: D45623.diff
Sat, Nov 23, 2:38 AM
F103235543: D45623.id142467.diff
Fri, Nov 22, 12:15 PM
Unknown Object (File)
Thu, Nov 21, 10:04 PM
Unknown Object (File)
Sun, Nov 17, 12:23 PM
Unknown Object (File)
Sun, Nov 17, 4:57 AM
Unknown Object (File)
Wed, Nov 13, 3:16 AM
Unknown Object (File)
Tue, Nov 12, 4:29 AM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

Adds a SIMD enhanced strlen for Aarch64. It takes inspiration from
the amd64 implementation but I struggled getting the performance
I had hoped for on cores like the Graviton3 when compared to the
existing implementation from Arm Optimized Routines.

Benchmark results are also available for a simple SIMD variant
loading 16 bytes at a time and checking with a simple cmeq,shrn,fcmp loop.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ strlen_ARM  │           strlen_SIMD                │          strlen_SIMD_UNROLL          │
        │   sec/op    │    sec/op     vs base                │    sec/op     vs base                │
Short     93.82µ ± 1%   100.94µ ± 2%   +7.59% (p=0.000 n=20)   105.13µ ± 0%  +12.05% (p=0.000 n=20)
Mid       23.56µ ± 6%    23.01µ ± 0%   -2.36% (p=0.000 n=20)    23.79µ ± 0%        ~ (p=0.602 n=20)
Long      3.065µ ± 0%    3.541µ ± 0%  +15.54% (p=0.000 n=20)    3.172µ ± 0%   +3.50% (p=0.000 n=20)
geomean   18.92µ         20.18µ        +6.67%                   19.94µ        +5.39%

        │  strlen_ARM  │           strlen_SIMD                │          strlen_SIMD_UNROLL          │
        │     B/s      │     B/s       vs base                │     B/s       vs base                │
Short     1.241Gi ± 1%   1.153Gi ± 2%   -7.06% (p=0.000 n=20)   1.107Gi ± 0%  -10.75% (p=0.000 n=20)
Mid       4.940Gi ± 6%   5.060Gi ± 0%   +2.42% (p=0.000 n=20)   4.894Gi ± 0%        ~ (p=0.602 n=20)
Long      37.99Gi ± 0%   32.88Gi ± 0%  -13.45% (p=0.000 n=20)   36.70Gi ± 0%   -3.38% (p=0.000 n=20)
geomean   6.152Gi        5.768Gi        -6.25%                  5.837Gi        -5.12%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ strlen_ARM  │          strlen_SIMD                │          strlen_SIMD_UNROLL         │
        │   sec/op    │   sec/op     vs base                │   sec/op     vs base                │
Short     134.5µ ± 0%   119.2µ ± 0%  -11.36% (p=0.000 n=20)   115.3µ ± 0%  -14.29% (p=0.000 n=20)
Mid       37.10µ ± 0%   29.60µ ± 0%  -20.23% (p=0.000 n=20)   33.29µ ± 1%  -10.27% (p=0.000 n=20)
Long      4.442µ ± 0%   5.661µ ± 0%  +27.44% (p=0.000 n=20)   4.267µ ± 0%   -3.94% (p=0.000 n=20)
geomean   28.09µ        27.13µ        -3.41%                  25.39µ        -9.60%

        │  strlen_ARM  │           strlen_SIMD                │           strlen_SIMD_UNROLL          │
        │     B/s      │     B/s       vs base                │      B/s       vs base                │
Short     886.2Mi ± 0%   999.8Mi ± 0%  +12.82% (p=0.000 n=20)   1034.0Mi ± 0%  +16.68% (p=0.000 n=20)
Mid       3.138Gi ± 0%   3.933Gi ± 0%  +25.37% (p=0.000 n=20)    3.497Gi ± 1%  +11.45% (p=0.000 n=20)
Long      26.21Gi ± 0%   20.57Gi ± 0%  -21.53% (p=0.000 n=20)    27.28Gi ± 0%   +4.10% (p=0.000 n=20)
geomean   4.144Gi        4.291Gi        +3.54%                   4.584Gi       +10.62%
Test Plan

Passes all unit tests.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59175
Build 56062: arc lint + arc unit

Event Timeline

getz requested review of this revision.Jun 17 2024, 7:52 PM
lib/libc/aarch64/string/strlen.S
19

This fmov can be moved to before the cbz as its's the same in both branches.

21

This is an independent computation. Try to interleave it with the SIMD stuff so multi-issue in order CPUs can execute the two at the same time. For visualisation, you can try to group instructions that can execute at the same time. Try to make it so that there's often 2 or 3 in each group.

60–61
  • Update based on review

Got rid of the special handling for prealigned strings.
Not sure if this one is gonna be worth adding, the existing
implementation is quite hard to beat.

Newly generated performance benchmarks:

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │  strlenARM  │             strlenSIMD             │
        │   sec/op    │   sec/op     vs base               │
Short     94.62µ ± 1%   96.51µ ± 0%  +2.00% (p=0.000 n=20)
Mid       24.29µ ± 3%   23.77µ ± 0%       ~ (p=0.678 n=20)
Long      3.073µ ± 1%   3.173µ ± 0%  +3.25% (p=0.000 n=20)
geomean   19.19µ        19.38µ       +1.00%

        │  strlenARM  │             strlenSIMD             │
        │    MiB/s    │    MiB/s     vs base               │
Short     1.321k ± 1%   1.295k ± 0%  -1.96% (p=0.000 n=20)
Mid       5.148k ± 3%   5.259k ± 0%       ~ (p=0.678 n=20)
Long      40.67k ± 1%   39.39k ± 0%  -3.14% (p=0.000 n=20)
geomean   6.516k        6.450k       -1.00%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │  strlenARM  │             strlenSIMD             │
        │   sec/op    │   sec/op     vs base               │
Short     134.5µ ± 0%   121.5µ ± 0%  -9.69% (p=0.000 n=20)
Mid       37.10µ ± 0%   40.55µ ± 0%  +9.30% (p=0.000 n=20)
Long      4.442µ ± 0%   4.269µ ± 0%  -3.88% (p=0.000 n=20)
geomean   28.09µ        27.60µ       -1.74%

        │  strlenARM   │              strlenSIMD              │
        │     B/s      │     B/s       vs base                │
Short     886.2Mi ± 0%   981.3Mi ± 0%  +10.73% (p=0.000 n=20)
Mid       3.138Gi ± 0%   2.871Gi ± 0%   -8.51% (p=0.000 n=20)
Long      26.21Gi ± 0%   27.27Gi ± 0%   +4.04% (p=0.000 n=20)
geomean   4.144Gi        4.217Gi        +1.77%
  • Revert to previous strlen as the other version failed a test when running entire test suite

I dont think this one is gonna be worth including as I'm unable to comfortably beat the perf of the existing implementation.

New benchmark results available below

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │  strlenARM  │              strlenSIMD              │
        │   sec/op    │    sec/op     vs base                │
Short     94.62µ ± 1%   104.12µ ± 0%  +10.04% (p=0.000 n=20)
Mid       24.29µ ± 3%    23.03µ ± 0%   -5.21% (p=0.000 n=20)
Long      3.073µ ± 1%    3.408µ ± 3%  +10.88% (p=0.000 n=20)
geomean   19.19µ         20.14µ        +4.97%

        │  strlenARM  │             strlenSIMD             │
        │    MiB/s    │    MiB/s     vs base               │
Short     1.321k ± 1%   1.201k ± 0%  -9.13% (p=0.000 n=20)
Mid       5.148k ± 3%   5.428k ± 0%  +5.45% (p=0.000 n=20)
Long      40.67k ± 1%   36.68k ± 3%  -9.81% (p=0.000 n=20)
geomean   6.516k        6.206k       -4.75%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │  strlenARM  │             strlenSIMD              │
        │   sec/op    │   sec/op     vs base                │
Short     134.5µ ± 0%   119.8µ ± 0%  -10.98% (p=0.000 n=20)
Mid       37.10µ ± 0%   30.18µ ± 1%  -18.67% (p=0.000 n=20)
Long      4.442µ ± 0%   5.628µ ± 0%  +26.71% (p=0.000 n=20)
geomean   28.09µ        27.30µ        -2.83%

        │  strlenARM   │              strlenSIMD              │
        │     B/s      │     B/s       vs base                │
Short     886.2Mi ± 0%   995.5Mi ± 0%  +12.33% (p=0.000 n=20)
Mid       3.138Gi ± 0%   3.858Gi ± 1%  +22.95% (p=0.000 n=20)
Long      26.21Gi ± 0%   20.68Gi ± 0%  -21.08% (p=0.000 n=20)
geomean   4.144Gi        4.265Gi        +2.91%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │  strlenARM  │             strlenSIMD              │
        │   sec/op    │   sec/op     vs base                │
Short     173.5µ ± 0%   197.0µ ± 1%  +13.54% (p=0.000 n=20)
Mid       52.07µ ± 3%   47.61µ ± 1%   -8.56% (p=0.000 n=20)
Long      5.943µ ± 0%   8.902µ ± 0%  +49.78% (p=0.000 n=20)
geomean   37.72µ        43.70µ       +15.85%

        │  strlenARM   │              strlenSIMD              │
        │     B/s      │     B/s       vs base                │
Short     687.1Mi ± 0%   605.2Mi ± 1%  -11.92% (p=0.000 n=20)
Mid       2.236Gi ± 3%   2.445Gi ± 1%   +9.36% (p=0.000 n=20)
Long      19.59Gi ± 0%   13.08Gi ± 0%  -33.23% (p=0.000 n=20)
geomean   3.086Gi        2.664Gi       -13.68%

exp-run says it's fine.

This revision is now accepted and ready to land.Wed, Nov 6, 2:26 PM