libc: scalar memcpy() in RISC-V assembly
Needs ReviewPublic
Actions

Authored by strajabot on Jul 25 2024, 8:41 PM.

Details

Reviewers

fuz
emaste

Summary

Optimized assembly implementation of memcpy() for the RISC-V architecture.
The implementation has two paths:

An aligned path - (dst - src) % 8 = 0, runs faster
An unaligned path - (dst - src) % 8 != 0, runs slower

os: FreeBSD
arch: riscv
           │ memcpy_baseline │            memcpy_scalar            │
           │     sec/op      │   sec/op     vs base                │
64Align8         851.6µ ± 1%   488.9µ ± 1%  -42.59% (p=0.000 n=12)
4kAlign8         681.5µ ± 1%   255.1µ ± 2%  -62.57% (p=0.000 n=12)
256kAlign8       273.0µ ± 2%   230.7µ ± 2%  -15.50% (p=0.000 n=12)
16mAlign8        98.07m ± 0%   95.29m ± 0%   -2.84% (p=0.000 n=12)
64UAlign         887.5µ ± 1%   531.6µ ± 1%  -40.10% (p=0.000 n=12)
4kUAlign         725.6µ ± 1%   262.2µ ± 1%  -63.87% (p=0.000 n=12)
256kUAlign       844.1µ ± 2%   322.8µ ± 0%  -61.76% (p=0.000 n=12)
16mUAlign        134.9m ± 0%   101.2m ± 0%  -24.97% (p=0.000 n=20)
geomean          2.410m        1.371m       -43.12%
 
           │ memcpy_baseline │            memcpy_scalar             │
           │      MiB/s      │    MiB/s     vs base                 │
64Align8          293.6 ± 1%    511.3 ± 1%   +74.18% (p=0.000 n=12)
4kAlign8          366.8 ± 1%    980.0 ± 2%  +167.15% (p=0.000 n=12)
256kAlign8        915.8 ± 2%   1083.7 ± 2%   +18.34% (p=0.000 n=12)
16mAlign8         163.1 ± 0%    167.9 ± 0%    +2.92% (p=0.000 n=12)
64UAlign          281.7 ± 1%    470.3 ± 1%   +66.94% (p=0.000 n=12)
4kUAlign          344.5 ± 1%    953.6 ± 1%  +176.77% (p=0.000 n=12)
256kUAlign        296.2 ± 2%    774.5 ± 0%  +161.49% (p=0.000 n=12)
16mUAlign         118.6 ± 0%    158.1 ± 0%   +33.28% (p=0.000 n=20)
geomean           293.4         515.8        +75.81%

Test Plan

Tested using the in-tree test suite

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 58833
Build 55720: arc lint + arc unit

Event Timeline

strajabot created this revision.Jul 25 2024, 8:41 PM

Herald added a subscriber: imp. · View Herald TranscriptJul 25 2024, 8:41 PM

strajabot requested review of this revision.Jul 25 2024, 8:41 PM

Harbormaster completed remote builds in B58833: Diff 141397.Jul 25 2024, 8:41 PM

Quite happy with the code. Here are some suggestions for further things you could try.

lib/libc/riscv/string/memcpy.S
26	It might be faster to just compare len with 7 and branch to a special code path if it is. This case should be rare (thus the branch is easily predicted) and should still be faster than doing it with one path as we don't have to go through the main loop and the tail duff's device.
102–121	It might be worth unrolling this one, too. Your curent code has mostly instructions the depend on each other. If you process two iterations at a time, chances are the CPU can execute two instructions in parallel most of the time.
247–268	No need to duplicate the code I think.