string: add memccpy SIMD implementation
ClosedPublic
Actions

Authored by getz on Jul 27 2024, 7:57 PM.

Details

Reviewers

fuz
emaste
andrew

Commits

rGbad17991c06d: lib/libc/aarch64/string: add memccpy SIMD implementation

Summary

This changeset includes a port of the SIMD implementation of memccpy
for amd64 to Aarch64.

Performance is significantly better than the scalar implementation
except for short strings.

Benchmark results are as usual generated by the strperf utility written
by fuz.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   142.4µ ± 0%   +4.11% (p=0.000 n=20)
Mid         69.85µ ± 1%   30.63µ ± 1%  -56.15% (p=0.000 n=20)
Long      112.854µ ± 0%   7.898µ ± 1%  -93.00% (p=0.000 n=20)
geomean     102.5µ        32.53µ       -68.27%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    837.4Mi ± 0%     -3.95% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.801Gi ± 1%   +128.04% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.740Gi ± 1%  +1328.86% (p=0.000 n=20)
geomean    1.135Gi         3.578Gi        +215.14%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │             memccpySIMD              │
        │    sec/op     │    sec/op     vs base                │
Short       96.73µ ± 1%   122.82µ ± 1%  +26.98% (p=0.000 n=20)
Mid         48.50µ ± 0%    24.62µ ± 0%  -49.23% (p=0.000 n=20)
Long       84.122µ ± 1%    4.961µ ± 0%  -94.10% (p=0.000 n=20)
geomean     73.35µ         24.66µ       -66.37%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short     1232.5Mi ± 1%    970.6Mi ± 1%    -21.25% (p=0.000 n=20)
Mid        2.400Gi ± 0%    4.728Gi ± 0%    +96.95% (p=0.000 n=20)
Long       1.384Gi ± 1%   23.466Gi ± 0%  +1595.65% (p=0.000 n=20)
geomean    1.587Gi         4.720Gi        +197.38%

Test Plan

Passes all the unit tests including the extended memccpy test D46051

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

getz created this revision.Jul 27 2024, 7:57 PM

Herald added a reviewer: andrew. · View Herald TranscriptJul 27 2024, 7:57 PM

Herald added a subscriber: imp. · View Herald Transcript

getz requested review of this revision.Jul 27 2024, 7:57 PM

Harbormaster completed remote builds in B58870: Diff 141475.Jul 27 2024, 7:57 PM

will rebase on D46052 to improve performance for the short case

New method for handling short strings based on D46052

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   141.8µ ± 0%   +3.74% (p=0.000 n=20)
Mid         69.85µ ± 1%   31.93µ ± 1%  -54.28% (p=0.000 n=20)
Long      112.854µ ± 0%   7.985µ ± 3%  -92.92% (p=0.000 n=20)
geomean     102.5µ        33.07µ       -67.74%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    840.5Mi ± 0%     -3.60% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.646Gi ± 1%   +118.72% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.579Gi ± 3%  +1313.28% (p=0.000 n=20)
geomean    1.135Gi         3.520Gi        +210.02%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │             memccpySIMD              │
        │    sec/op     │    sec/op     vs base                │
Short       96.73µ ± 1%   119.82µ ± 1%  +23.87% (p=0.000 n=20)
Mid         48.50µ ± 0%    24.17µ ± 0%  -50.15% (p=0.000 n=20)
Long       84.122µ ± 1%    4.960µ ± 0%  -94.10% (p=0.000 n=20)
geomean     73.35µ         24.31µ       -66.86%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short     1232.5Mi ± 1%    994.9Mi ± 1%    -19.27% (p=0.000 n=20)
Mid        2.400Gi ± 0%    4.816Gi ± 0%   +100.61% (p=0.000 n=20)
Long       1.384Gi ± 1%   23.470Gi ± 0%  +1595.96% (p=0.000 n=20)
geomean    1.587Gi         4.789Gi        +201.71%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       234.4µ ± 0%   197.0µ ± 0%  -15.95% (p=0.000 n=20)
Mid        115.03µ ± 1%   52.82µ ± 1%  -54.08% (p=0.000 n=20)
Long       178.23µ ± 0%   11.47µ ± 0%  -93.57% (p=0.000 n=20)
geomean     168.7µ        49.23µ       -70.83%

        │ memccpyScalar │               memccpySIMD                │
        │      B/s      │      B/s        vs base                  │
Short      508.6Mi ± 0%     605.1Mi ± 0%    +18.97% (p=0.000 n=20)
Mid        1.012Gi ± 1%     2.204Gi ± 1%   +117.76% (p=0.000 n=20)
Long       668.8Mi ± 0%   10397.0Mi ± 0%  +1454.50% (p=0.000 n=20)
geomean    706.4Mi          2.365Gi        +242.77%

Harbormaster completed remote builds in B58905: Diff 141613.Jul 31 2024, 7:56 PM

getz added inline comments.Jul 31 2024, 7:57 PM

lib/libc/aarch64/string/memccpy.S
27	I tried `ubfiz x12, x1, #2, #4` here but it degraded perfomance a tiny bit

A general comment: shifts instructions take the shift amount modulo the data size. So if the shift amount is wrong by a multiple of the data size, you don't need to go and fix that up. This could save a few additions and subtractions around the code.

Code looks ok. Looking forwards to the acceptance test.

lib/libc/aarch64/string/memccpy.S
38	Comment outdated? You no longer induce a match prior to this.
80–82	It's nicer stylistically to have the label in the same line as the instruction it labels if the label fits there. There are some other places with the same issue.
163–166

Accepted pending final acceptance test.

This revision is now accepted and ready to land.Aug 7 2024, 10:51 AM

getz mentioned this in D46243: lib/libc/aarch64/string: add strlcpy SIMD implementation.Aug 8 2024, 1:54 PM

getz marked 2 inline comments as done.Aug 13 2024, 5:38 PM

getz added inline comments.

lib/libc/aarch64/string/memccpy.S
80–82	I prefer it the way I have it written and I have been doing it this way for all my previous functions. But I'm not opposed to changing it if that's the accepted style.
163–166	I tried this and performance regressed a bit.

Update based on review.

Slightly improved performance

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   139.7µ ± 0%   +2.18% (p=0.000 n=20)
Mid         69.85µ ± 1%   32.87µ ± 0%  -52.94% (p=0.000 n=20)
Long      112.854µ ± 0%   7.884µ ± 2%  -93.01% (p=0.000 n=20)
geomean     102.5µ        33.08µ       -67.73%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    853.3Mi ± 0%     -2.13% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.542Gi ± 0%   +112.50% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.765Gi ± 2%  +1331.36% (p=0.000 n=20)
geomean    1.135Gi         3.519Gi        +209.92%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │              memccpySIMD6               │
        │    sec/op     │    sec/op     vs base                   │
Short       96.79µ ± 1%   119.83µ ± 1%  +23.80% (p=0.000 n=40+20)
Mid         48.44µ ± 0%    24.36µ ± 0%  -49.71% (p=0.000 n=40+20)
Long       83.154µ ± 1%    4.964µ ± 0%  -94.03% (p=0.000 n=40+20)
geomean     73.06µ         24.38µ       -66.63%

        │ memccpyScalar │               memccpySIMD6                │
        │     MiB/s     │    MiB/s      vs base                     │
Short       1.291k ± 1%    1.043k ± 1%    -19.23% (p=0.000 n=40+20)
Mid         2.580k ± 0%    5.131k ± 0%    +98.85% (p=0.000 n=40+20)
Long        1.503k ± 1%   25.181k ± 0%  +1575.14% (p=0.000 n=40+20)
geomean     1.711k         5.127k        +199.65%

This revision now requires review to proceed.Aug 13 2024, 5:39 PM

Harbormaster completed remote builds in B59005: Diff 142054.Aug 13 2024, 5:39 PM

getz mentioned this in D46292: lib/libc/aarch64/string: add strncat SIMD implementation.Aug 14 2024, 2:12 PM

getz added a child revision: D46292: lib/libc/aarch64/string: add strncat SIMD implementation.Aug 14 2024, 2:12 PM

getz removed a child revision: D46292: lib/libc/aarch64/string: add strncat SIMD implementation.Aug 25 2024, 6:42 PM

exp-run says it's fine.

This revision is now accepted and ready to land.Nov 6 2024, 2:25 PM

Closed by commit rGbad17991c06d: lib/libc/aarch64/string: add memccpy SIMD implementation (authored by getz, committed by fuz). · Explain WhyFri, Jan 10, 3:04 PM

This revision was automatically updated to reflect the committed changes.

fuz mentioned this in rG756b7fc80837: lib/libc/aarch64/string: add strlcpy SIMD implementation.

fuz mentioned this in rG3dc5429158cf: lib/libc/aarch64/string: add strncat SIMD implementation.

fuz added a commit: rGbad17991c06d: lib/libc/aarch64/string: add memccpy SIMD implementation.