Page MenuHomeFreeBSD

lib/libc/aarch64/string: add memccpy SIMD implementation
AcceptedPublic

Authored by getz on Jul 27 2024, 7:57 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Nov 17, 8:06 PM
Unknown Object (File)
Sat, Nov 16, 9:21 PM
Unknown Object (File)
Sat, Nov 16, 7:15 PM
Unknown Object (File)
Wed, Nov 13, 3:16 AM
Unknown Object (File)
Oct 30 2024, 8:45 AM
Unknown Object (File)
Oct 27 2024, 8:31 AM
Unknown Object (File)
Oct 11 2024, 1:23 AM
Unknown Object (File)
Oct 10 2024, 8:45 PM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

This changeset includes a port of the SIMD implementation of memccpy
for amd64 to Aarch64.

Performance is significantly better than the scalar implementation
except for short strings.

Benchmark results are as usual generated by the strperf utility written
by fuz.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   142.4µ ± 0%   +4.11% (p=0.000 n=20)
Mid         69.85µ ± 1%   30.63µ ± 1%  -56.15% (p=0.000 n=20)
Long      112.854µ ± 0%   7.898µ ± 1%  -93.00% (p=0.000 n=20)
geomean     102.5µ        32.53µ       -68.27%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    837.4Mi ± 0%     -3.95% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.801Gi ± 1%   +128.04% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.740Gi ± 1%  +1328.86% (p=0.000 n=20)
geomean    1.135Gi         3.578Gi        +215.14%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │             memccpySIMD              │
        │    sec/op     │    sec/op     vs base                │
Short       96.73µ ± 1%   122.82µ ± 1%  +26.98% (p=0.000 n=20)
Mid         48.50µ ± 0%    24.62µ ± 0%  -49.23% (p=0.000 n=20)
Long       84.122µ ± 1%    4.961µ ± 0%  -94.10% (p=0.000 n=20)
geomean     73.35µ         24.66µ       -66.37%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short     1232.5Mi ± 1%    970.6Mi ± 1%    -21.25% (p=0.000 n=20)
Mid        2.400Gi ± 0%    4.728Gi ± 0%    +96.95% (p=0.000 n=20)
Long       1.384Gi ± 1%   23.466Gi ± 0%  +1595.65% (p=0.000 n=20)
geomean    1.587Gi         4.720Gi        +197.38%
Test Plan

Passes all the unit tests including the extended memccpy test D46051

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59005
Build 55892: arc lint + arc unit

Event Timeline

getz requested review of this revision.Jul 27 2024, 7:57 PM

will rebase on D46052 to improve performance for the short case

New method for handling short strings based on D46052

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   141.8µ ± 0%   +3.74% (p=0.000 n=20)
Mid         69.85µ ± 1%   31.93µ ± 1%  -54.28% (p=0.000 n=20)
Long      112.854µ ± 0%   7.985µ ± 3%  -92.92% (p=0.000 n=20)
geomean     102.5µ        33.07µ       -67.74%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    840.5Mi ± 0%     -3.60% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.646Gi ± 1%   +118.72% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.579Gi ± 3%  +1313.28% (p=0.000 n=20)
geomean    1.135Gi         3.520Gi        +210.02%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │             memccpySIMD              │
        │    sec/op     │    sec/op     vs base                │
Short       96.73µ ± 1%   119.82µ ± 1%  +23.87% (p=0.000 n=20)
Mid         48.50µ ± 0%    24.17µ ± 0%  -50.15% (p=0.000 n=20)
Long       84.122µ ± 1%    4.960µ ± 0%  -94.10% (p=0.000 n=20)
geomean     73.35µ         24.31µ       -66.86%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short     1232.5Mi ± 1%    994.9Mi ± 1%    -19.27% (p=0.000 n=20)
Mid        2.400Gi ± 0%    4.816Gi ± 0%   +100.61% (p=0.000 n=20)
Long       1.384Gi ± 1%   23.470Gi ± 0%  +1595.96% (p=0.000 n=20)
geomean    1.587Gi         4.789Gi        +201.71%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       234.4µ ± 0%   197.0µ ± 0%  -15.95% (p=0.000 n=20)
Mid        115.03µ ± 1%   52.82µ ± 1%  -54.08% (p=0.000 n=20)
Long       178.23µ ± 0%   11.47µ ± 0%  -93.57% (p=0.000 n=20)
geomean     168.7µ        49.23µ       -70.83%

        │ memccpyScalar │               memccpySIMD                │
        │      B/s      │      B/s        vs base                  │
Short      508.6Mi ± 0%     605.1Mi ± 0%    +18.97% (p=0.000 n=20)
Mid        1.012Gi ± 1%     2.204Gi ± 1%   +117.76% (p=0.000 n=20)
Long       668.8Mi ± 0%   10397.0Mi ± 0%  +1454.50% (p=0.000 n=20)
geomean    706.4Mi          2.365Gi        +242.77%
lib/libc/aarch64/string/memccpy.S
27

I tried ubfiz x12, x1, #2, #4 here but it degraded perfomance a tiny bit

A general comment: shifts instructions take the shift amount modulo the data size. So if the shift amount is wrong by a multiple of the data size, you don't need to go and fix that up. This could save a few additions and subtractions around the code.

Code looks ok. Looking forwards to the acceptance test.

lib/libc/aarch64/string/memccpy.S
38

Comment outdated? You no longer induce a match prior to this.

80–82

It's nicer stylistically to have the label in the same line as the instruction it labels if the label fits there. There are some other places with the same issue.

163–166

Accepted pending final acceptance test.

This revision is now accepted and ready to land.Aug 7 2024, 10:51 AM
getz marked 2 inline comments as done.Aug 13 2024, 5:38 PM
getz added inline comments.
lib/libc/aarch64/string/memccpy.S
80–82

I prefer it the way I have it written and I have been doing it this way for all my previous functions.
But I'm not opposed to changing it if that's the accepted style.

163–166

I tried this and performance regressed a bit.

getz marked an inline comment as done.

Update based on review.

Slightly improved performance

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   139.7µ ± 0%   +2.18% (p=0.000 n=20)
Mid         69.85µ ± 1%   32.87µ ± 0%  -52.94% (p=0.000 n=20)
Long      112.854µ ± 0%   7.884µ ± 2%  -93.01% (p=0.000 n=20)
geomean     102.5µ        33.08µ       -67.73%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    853.3Mi ± 0%     -2.13% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.542Gi ± 0%   +112.50% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.765Gi ± 2%  +1331.36% (p=0.000 n=20)
geomean    1.135Gi         3.519Gi        +209.92%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │              memccpySIMD6               │
        │    sec/op     │    sec/op     vs base                   │
Short       96.79µ ± 1%   119.83µ ± 1%  +23.80% (p=0.000 n=40+20)
Mid         48.44µ ± 0%    24.36µ ± 0%  -49.71% (p=0.000 n=40+20)
Long       83.154µ ± 1%    4.964µ ± 0%  -94.03% (p=0.000 n=40+20)
geomean     73.06µ         24.38µ       -66.63%

        │ memccpyScalar │               memccpySIMD6                │
        │     MiB/s     │    MiB/s      vs base                     │
Short       1.291k ± 1%    1.043k ± 1%    -19.23% (p=0.000 n=40+20)
Mid         2.580k ± 0%    5.131k ± 0%    +98.85% (p=0.000 n=40+20)
Long        1.503k ± 1%   25.181k ± 0%  +1575.14% (p=0.000 n=40+20)
geomean     1.711k         5.127k        +199.65%
This revision now requires review to proceed.Aug 13 2024, 5:39 PM

exp-run says it's fine.

This revision is now accepted and ready to land.Wed, Nov 6, 2:25 PM