Page MenuHomeFreeBSD

libc: scalar memchr() in RISC-V assembly
Needs ReviewPublic

Authored by strajabot on Jul 18 2024, 6:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Oct 11, 6:44 AM
Unknown Object (File)
Sun, Sep 29, 10:10 PM
Unknown Object (File)
Sat, Sep 21, 9:51 PM
Unknown Object (File)
Sep 9 2024, 1:31 AM
Unknown Object (File)
Sep 8 2024, 2:24 AM
Unknown Object (File)
Sep 7 2024, 8:17 PM
Unknown Object (File)
Aug 28 2024, 10:52 PM
Unknown Object (File)
Aug 20 2024, 7:24 PM
Subscribers

Details

Reviewers
fuz
emaste
Summary

Added an optimized memchr() implementation in RISC-V assembly and updated
the relevant manpage.

        │ memchr_baseline │            memchr_scalar            │
        │     sec/op      │   sec/op     vs base                │
Short         636.6µ ± 1%   495.9µ ± 1%  -22.10% (p=0.000 n=20)
Mid           279.7µ ± 1%   224.1µ ± 1%  -19.87% (p=0.000 n=20)
Long          138.8µ ± 0%   124.9µ ± 0%  -10.00% (p=0.000 n=20)
geomean       291.3µ        240.3µ       -17.48%

        │ memchr_baseline │            memchr_scalar             │
        │       B/s       │     B/s       vs base                │
Short        187.3Mi ± 1%   240.4Mi ± 1%  +28.37% (p=0.000 n=20)
Mid          426.2Mi ± 1%   531.9Mi ± 1%  +24.79% (p=0.000 n=20)
Long         859.0Mi ± 0%   954.4Mi ± 0%  +11.11% (p=0.000 n=20)
geomean      409.3Mi        496.0Mi       +21.19%
Test Plan

Tested using the in-tree test suite

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59002
Build 55889: arc lint + arc unit

Event Timeline

Looks reasonable. Make sure the code works with phony buffer lengths such as SIZE_MAX.

lib/libc/riscv/string/memchr.S
35–36

Could you somehow avoid having to deal with two counters here? Each addition is 20% of the full loop, so if you get rid of one, that's a 20% speedup.

Also consider completely peeling the loop or using Duff's device.

120

Avoid depending on the previous instruction.

libc: unroll loop in RISC-V memchr()

Refactor of memchr to remove the case where len < 8 and to unroll the main loop. Removes many data dependencies between the instructions to improve performance on small strings.

        │ memchr_baseline │            memchr_scalar            │              memchr_uroll              │
        │     sec/op      │   sec/op     vs base                │   sec/op     vs base                   │
Short         636.6µ ± 1%   495.9µ ± 1%  -22.10% (p=0.000 n=20)   497.5µ ± 2%  -21.86% (p=0.000 n=20+22)
Mid           279.7µ ± 1%   224.1µ ± 1%  -19.87% (p=0.000 n=20)   220.0µ ± 2%  -21.35% (p=0.000 n=20+22)
Long          138.8µ ± 0%   124.9µ ± 0%  -10.00% (p=0.000 n=20)   117.3µ ± 2%  -15.45% (p=0.000 n=20+22)
geomean       291.3µ        240.3µ       -17.48%                  234.2µ       -19.60%

        │ memchr_baseline │            memchr_scalar             │
        │       B/s       │     B/s       vs base                │
Short        187.3Mi ± 1%   240.4Mi ± 1%  +28.37% (p=0.000 n=20)
Mid          426.2Mi ± 1%   531.9Mi ± 1%  +24.79% (p=0.000 n=20)
Long         859.0Mi ± 0%   954.4Mi ± 0%  +11.11% (p=0.000 n=20)
geomean      409.3Mi        496.0Mi       +21.19%

        │ memchr_uroll │
        │    MiB/s     │
Short       251.3 ± 1%
Mid         568.2 ± 2%
Long       1.065k ± 2%
geomean     533.8