Page MenuHomeFreeBSD

lib/libc/aarch64/string: add memcpy SIMD implementation
AcceptedPublic

Authored by getz on Aug 9 2024, 1:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 13, 3:18 AM
Unknown Object (File)
Sat, Nov 9, 4:11 AM
Unknown Object (File)
Tue, Nov 5, 11:32 AM
Unknown Object (File)
Thu, Oct 31, 9:22 AM
Unknown Object (File)
Oct 11 2024, 1:23 AM
Unknown Object (File)
Sep 22 2024, 10:26 PM
Unknown Object (File)
Sep 22 2024, 7:27 AM
Unknown Object (File)
Sep 19 2024, 12:29 PM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

I noticed that we have a SIMD optimized memcpy in the
arm-optimized-routines in /contrib.

This patch ensures we use the SIMD variant as opposed to the
Scalar optimized variant.

Benchmarks are available below generated by fuz' strperf utility.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memcpyScalar │             memcpySIMD              │
        │    sec/op    │   sec/op     vs base                │
64         30.71µ ± 0%   22.47µ ± 1%  -26.83% (p=0.000 n=20)
4k         7.875µ ± 0%   4.069µ ± 0%  -48.33% (p=0.000 n=20)
256k       6.608µ ± 0%   5.126µ ± 0%  -22.43% (p=0.000 n=20)
16m        512.0µ ± 0%   503.0µ ± 0%   -1.75% (p=0.000 n=20)
1g         41.42m ± 0%   39.73m ± 0%   -4.08% (p=0.000 n=20)
geomean    127.7µ        98.70µ       -22.68%

        │ memcpyScalar │              memcpySIMD               │
        │     B/s      │      B/s       vs base                │
64        7.582Gi ± 0%   10.362Gi ± 1%  +36.68% (p=0.000 n=20)
4k        29.57Gi ± 0%    57.22Gi ± 0%  +93.55% (p=0.000 n=20)
256k      35.23Gi ± 0%    45.42Gi ± 0%  +28.91% (p=0.000 n=20)
16m       29.11Gi ± 0%    29.62Gi ± 0%   +1.78% (p=0.000 n=20)
1g        23.02Gi ± 0%    24.00Gi ± 0%   +4.26% (p=0.000 n=20)
geomean   22.12Gi         28.60Gi       +29.33%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memcpyScalar │             memcpySIMD              │
        │    sec/op    │   sec/op     vs base                │
64         51.55µ ± 0%   46.25µ ± 0%  -10.29% (p=0.000 n=20)
4k         9.866µ ± 0%   7.253µ ± 0%  -26.48% (p=0.000 n=20)
256k       7.044µ ± 0%   7.793µ ± 0%  +10.64% (p=0.000 n=20)
16m        3.523m ± 6%   3.707m ± 5%        ~ (p=0.602 n=20)
1g         209.3m ± 1%   211.3m ± 1%   +0.93% (p=0.035 n=20)
geomean    305.1µ        289.9µ        -4.97%

        │ memcpyScalar │              memcpySIMD              │
        │     B/s      │     B/s       vs base                │
64        4.516Gi ± 0%   5.035Gi ± 0%  +11.48% (p=0.000 n=20)
4k        23.60Gi ± 0%   32.10Gi ± 0%  +36.02% (p=0.000 n=20)
256k      33.05Gi ± 0%   29.88Gi ± 0%   -9.62% (p=0.000 n=20)
16m       4.230Gi ± 5%   4.020Gi ± 5%        ~ (p=0.602 n=20)
1g        4.556Gi ± 1%   4.514Gi ± 1%   -0.92% (p=0.035 n=20)
geomean   9.255Gi        9.739Gi        +5.23%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ memcpyScalar │             memcpySIMD             │
        │    sec/op    │   sec/op     vs base               │
64         67.58µ ± 0%   64.87µ ± 0%  -4.00% (p=0.000 n=20)
4k         14.42µ ± 0%   14.43µ ± 0%       ~ (p=0.478 n=20)
256k       14.68µ ± 1%   14.76µ ± 1%       ~ (p=0.192 n=20)
16m        1.513m ± 1%   1.500m ± 1%       ~ (p=0.301 n=20)
1g         86.77m ± 2%   87.08m ± 1%       ~ (p=0.640 n=20)
geomean    284.9µ        282.7µ       -0.78%

        │ memcpyScalar │             memcpySIMD              │
        │     B/s      │     B/s       vs base               │
64        3.445Gi ± 0%   3.589Gi ± 0%  +4.17% (p=0.000 n=20)
4k        16.15Gi ± 0%   16.14Gi ± 0%       ~ (p=0.478 n=20)
256k      15.86Gi ± 1%   15.77Gi ± 1%       ~ (p=0.192 n=20)
16m       9.850Gi ± 1%   9.931Gi ± 1%       ~ (p=0.301 n=20)
1g        10.99Gi ± 2%   10.95Gi ± 1%       ~ (p=0.640 n=20)
geomean   9.909Gi        9.987Gi       +0.78%
Test Plan

No regressions in the test suite noticed, all tests pass

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 58975
Build 55862: arc lint + arc unit