libc: scalar memchr() in RISC-V assembly
Needs ReviewPublic
Actions

Authored by strajabot on Jul 18 2024, 6:04 PM.

Details

Reviewers

fuz
emaste

Summary

Added an optimized memchr() implementation in RISC-V assembly and updated
the relevant manpage.

        │ memchr_baseline │            memchr_scalar            │
        │     sec/op      │   sec/op     vs base                │
Short         636.6µ ± 1%   495.9µ ± 1%  -22.10% (p=0.000 n=20)
Mid           279.7µ ± 1%   224.1µ ± 1%  -19.87% (p=0.000 n=20)
Long          138.8µ ± 0%   124.9µ ± 0%  -10.00% (p=0.000 n=20)
geomean       291.3µ        240.3µ       -17.48%

        │ memchr_baseline │            memchr_scalar             │
        │       B/s       │     B/s       vs base                │
Short        187.3Mi ± 1%   240.4Mi ± 1%  +28.37% (p=0.000 n=20)
Mid          426.2Mi ± 1%   531.9Mi ± 1%  +24.79% (p=0.000 n=20)
Long         859.0Mi ± 0%   954.4Mi ± 0%  +11.11% (p=0.000 n=20)
geomean      409.3Mi        496.0Mi       +21.19%

Test Plan

Tested using the in-tree test suite

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 59002
Build 55889: arc lint + arc unit

Event Timeline

strajabot created this revision.Jul 18 2024, 6:04 PM

Herald added a subscriber: imp. · View Herald TranscriptJul 18 2024, 6:04 PM

strajabot requested review of this revision.Jul 18 2024, 6:04 PM

Harbormaster completed remote builds in B58705: Diff 141081.Jul 18 2024, 6:04 PM

Looks reasonable. Make sure the code works with phony buffer lengths such as SIZE_MAX.

lib/libc/riscv/string/memchr.S
35–36	Could you somehow avoid having to deal with two counters here? Each addition is 20% of the full loop, so if you get rid of one, that's a 20% speedup. Also consider completely peeling the loop or using Duff's device.
120	Avoid depending on the previous instruction.

libc: unroll loop in RISC-V memchr()

Refactor of memchr to remove the case where len < 8 and to unroll the main loop. Removes many data dependencies between the instructions to improve performance on small strings.

        │ memchr_baseline │            memchr_scalar            │              memchr_uroll              │
        │     sec/op      │   sec/op     vs base                │   sec/op     vs base                   │
Short         636.6µ ± 1%   495.9µ ± 1%  -22.10% (p=0.000 n=20)   497.5µ ± 2%  -21.86% (p=0.000 n=20+22)
Mid           279.7µ ± 1%   224.1µ ± 1%  -19.87% (p=0.000 n=20)   220.0µ ± 2%  -21.35% (p=0.000 n=20+22)
Long          138.8µ ± 0%   124.9µ ± 0%  -10.00% (p=0.000 n=20)   117.3µ ± 2%  -15.45% (p=0.000 n=20+22)
geomean       291.3µ        240.3µ       -17.48%                  234.2µ       -19.60%

        │ memchr_baseline │            memchr_scalar             │
        │       B/s       │     B/s       vs base                │
Short        187.3Mi ± 1%   240.4Mi ± 1%  +28.37% (p=0.000 n=20)
Mid          426.2Mi ± 1%   531.9Mi ± 1%  +24.79% (p=0.000 n=20)
Long         859.0Mi ± 0%   954.4Mi ± 0%  +11.11% (p=0.000 n=20)
geomean      409.3Mi        496.0Mi       +21.19%

        │ memchr_uroll │
        │    MiB/s     │
Short       251.3 ± 1%
Mid         568.2 ± 2%
Long       1.065k ± 2%
geomean     533.8

Harbormaster completed remote builds in B59002: Diff 142041.Aug 13 2024, 2:26 PM