Page MenuHomeFreeBSD

libc: scalar memcpy() in RISC-V assembly
Needs ReviewPublic

Authored by strajabot on Jul 25 2024, 8:41 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 3, 5:44 PM
Unknown Object (File)
Fri, Sep 27, 11:05 AM
Unknown Object (File)
Mon, Sep 23, 8:01 AM
Unknown Object (File)
Thu, Sep 19, 4:09 AM
Unknown Object (File)
Sep 7 2024, 11:12 PM
Unknown Object (File)
Sep 7 2024, 1:40 PM
Unknown Object (File)
Sep 2 2024, 9:33 PM
Unknown Object (File)
Aug 28 2024, 10:43 PM
Subscribers

Details

Reviewers
fuz
emaste
Summary

Optimized assembly implementation of memcpy() for the RISC-V architecture.
The implementation has two paths:

  • An aligned path - (dst - src) % 8 = 0, runs faster
  • An unaligned path - (dst - src) % 8 != 0, runs slower
os: FreeBSD
arch: riscv
           │ memcpy_baseline │            memcpy_scalar            │
           │     sec/op      │   sec/op     vs base                │
64Align8         851.6µ ± 1%   488.9µ ± 1%  -42.59% (p=0.000 n=12)
4kAlign8         681.5µ ± 1%   255.1µ ± 2%  -62.57% (p=0.000 n=12)
256kAlign8       273.0µ ± 2%   230.7µ ± 2%  -15.50% (p=0.000 n=12)
16mAlign8        98.07m ± 0%   95.29m ± 0%   -2.84% (p=0.000 n=12)
64UAlign         887.5µ ± 1%   531.6µ ± 1%  -40.10% (p=0.000 n=12)
4kUAlign         725.6µ ± 1%   262.2µ ± 1%  -63.87% (p=0.000 n=12)
256kUAlign       844.1µ ± 2%   322.8µ ± 0%  -61.76% (p=0.000 n=12)
16mUAlign        134.9m ± 0%   101.2m ± 0%  -24.97% (p=0.000 n=20)
geomean          2.410m        1.371m       -43.12%
 
           │ memcpy_baseline │            memcpy_scalar             │
           │      MiB/s      │    MiB/s     vs base                 │
64Align8          293.6 ± 1%    511.3 ± 1%   +74.18% (p=0.000 n=12)
4kAlign8          366.8 ± 1%    980.0 ± 2%  +167.15% (p=0.000 n=12)
256kAlign8        915.8 ± 2%   1083.7 ± 2%   +18.34% (p=0.000 n=12)
16mAlign8         163.1 ± 0%    167.9 ± 0%    +2.92% (p=0.000 n=12)
64UAlign          281.7 ± 1%    470.3 ± 1%   +66.94% (p=0.000 n=12)
4kUAlign          344.5 ± 1%    953.6 ± 1%  +176.77% (p=0.000 n=12)
256kUAlign        296.2 ± 2%    774.5 ± 0%  +161.49% (p=0.000 n=12)
16mUAlign         118.6 ± 0%    158.1 ± 0%   +33.28% (p=0.000 n=20)
geomean           293.4         515.8        +75.81%
Test Plan

Tested using the in-tree test suite

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 58833
Build 55720: arc lint + arc unit

Event Timeline

Quite happy with the code. Here are some suggestions for further things you could try.

lib/libc/riscv/string/memcpy.S
26

It might be faster to just compare len with 7 and branch to a special code path if it is. This case should be rare (thus the branch is easily predicted) and should still be faster than doing it with one path as we don't have to go through the main loop and the tail duff's device.

102–121

It might be worth unrolling this one, too. Your curent code has mostly instructions the depend on each other. If you process two iterations at a time, chances are the CPU can execute two instructions in parallel most of the time.

247–268

No need to duplicate the code I think.