Paths

Table of Contentst

string: add stpncpy scalar, baseline implementation
ClosedPublic
Actions

Authored by fuz on Nov 9 2023, 5:43 AM.

Details

Reviewers

kib
mjg

Commits

rG76f9afcdcfa4: lib/libc/amd64/string: implement strncpy() by calling stpncpy()
rG7527fecbfe0c: lib/libc/amd64/string: add stpncpy scalar, baseline implementation
rG57da891ad354: share/man/man7/simd.7: document simd-enhanced strncpy, stpncpy
rG438a1ff803a5: lib/libc/tests/string/stpncpy_test.c: extend for upcoming SSE implementation
rGe19d46c80826: lib/libc/amd64/string: implement strncpy() by calling stpncpy()
rG75a9e2250656: share/man/man7/simd.7: document simd-enhanced strncpy, stpncpy
rG90253d49db09: lib/libc/amd64/string: add stpncpy scalar, baseline implementation
rG6fa9e7d87375: lib/libc/tests/string/stpncpy_test.c: extend for upcoming SSE implementation

Summary

This was surprisingly annoying to get right, despite being such a simple
function. A scalar implementation is also provided, it just calls into
our optimised memchr(), memcpy(), and memset() routines to carry out its
job.

The unit test for stpncpy has been extended significantly and now covers
the function quite well.

I'm quite happy with the performance. glibc only beats us for very long
strings, likely due to the use of AVX-512. The scalar implementation
just calls into our optimised memchr(), memcpy(), and memset() routines,
so it has a high overhead to begin with but then performs ok for the
amount of effort that went into it. Still beats the old C code, except
for very short strings.

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ stpncpy.pre.out │          stpncpy.scalar.out          │        stpncpy.baseline.out         │
        │     sec/op      │    sec/op     vs base                │   sec/op     vs base                │
Short        106.88µ ± 0%   134.08µ ± 0%  +25.45% (p=0.000 n=20)   75.95µ ± 0%  -28.94% (p=0.000 n=20)
Mid          49.292µ ± 0%   18.558µ ± 0%  -62.35% (p=0.000 n=20)   9.911µ ± 1%  -79.89% (p=0.000 n=20)
Long         48.106µ ± 0%   11.614µ ± 0%  -75.86% (p=0.000 n=20)   3.606µ ± 0%  -92.50% (p=0.000 n=20)
geomean       63.28µ         30.69µ       -51.51%                  13.95µ       -77.96%

        │ stpncpy.pre.out │           stpncpy.scalar.out           │          stpncpy.baseline.out           │
        │       B/s       │      B/s       vs base                 │      B/s       vs base                  │
Short       1115.4Mi ± 0%    889.1Mi ± 0%   -20.29% (p=0.000 n=20)   1569.5Mi ± 0%    +40.72% (p=0.000 n=20)
Mid          2.362Gi ± 0%    6.273Gi ± 0%  +165.61% (p=0.000 n=20)   11.746Gi ± 1%   +397.36% (p=0.000 n=20)
Long         2.420Gi ± 0%   10.024Gi ± 0%  +314.22% (p=0.000 n=20)   32.284Gi ± 0%  +1234.05% (p=0.000 n=20)
geomean      1.840Gi         3.794Gi       +106.22%                   8.345Gi        +353.66%

os: Linux
arch: x86_64
cpu:
        │ stpncpy.glibc.out │
        │      sec/op       │
Short           153.9µ ± 0%
Mid             11.96µ ± 1%
Long            2.623µ ± 0%
geomean         16.90µ

        │ stpncpy.glibc.out │
        │        B/s        │
Short          774.6Mi ± 0%
Mid            9.731Gi ± 1%
Long           44.39Gi ± 0%
geomean        6.888Gi

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 54424
Build 51314: arc lint + arc unit

Event Timeline

fuz created this revision.Nov 9 2023, 5:43 AM

Herald added a subscriber: imp. · View Herald TranscriptNov 9 2023, 5:43 AM

fuz requested review of this revision.Nov 9 2023, 5:43 AM

Harbormaster completed remote builds in B54343: Diff 129885.Nov 9 2023, 5:43 AM

I found a possible integer overflow bug in this code. While the overflow cannot happen with valid inputs, it could cause invalid inputs (specifically, buffer sizes very close to SIZE_MAX) to silently behave as if the buffer was very short instead of crashing. I'll put something in to catch that case.

lib/libc/amd64/string/stpncpy.S
111	This can overflow. I'll have to fix the code.

lib/libc/amd64/string/stpncpy.S: reduce loop-carried dependency on r10 (+5.01%)
lib/libc/amd64/string/stpncpy.S: fix possible integer overflow (-0.08%)
lib/libc/amd64/string/stpncpy.S: reduce loop-carried dependencies in tail loops

This fixes the bug and adds some minor performance improvements
I missed in my initial optimisation pass through the function.

Harbormaster completed remote builds in B54424: Diff 130084.Nov 14 2023, 5:28 PM

lib/libc/amd64/string/stpncpy.S: optimize further (+1.65%)

enter the tail with 1--16 bytes left instead of 0--15 bytes left
this has a 1/16 chance of reducing the iteration count by 1
do the same when clearing out the rest of the destination
as we clear the possibly unaligned tail ahead of time, the last iteration can be omitted if exactly 16 bytes remain

Harbormaster completed remote builds in B54543: Diff 130380.Nov 21 2023, 9:29 AM

This revision was not accepted when it landed; it landed in state Needs Review.Dec 25 2023, 2:26 PM

Closed by commit rG6fa9e7d87375: lib/libc/tests/string/stpncpy_test.c: extend for upcoming SSE implementation (authored by fuz). · Explain Why

This revision was automatically updated to reflect the committed changes.

fuz added a commit: rG6fa9e7d87375: lib/libc/tests/string/stpncpy_test.c: extend for upcoming SSE implementation.

fuz added a commit: rG90253d49db09: lib/libc/amd64/string: add stpncpy scalar, baseline implementation.

fuz added a commit: rG75a9e2250656: share/man/man7/simd.7: document simd-enhanced strncpy, stpncpy.

fuz added a commit: rGe19d46c80826: lib/libc/amd64/string: implement strncpy() by calling stpncpy().

fuz added a commit: rG438a1ff803a5: lib/libc/tests/string/stpncpy_test.c: extend for upcoming SSE implementation.Jan 24 2024, 7:46 PM

fuz added a commit: rG57da891ad354: share/man/man7/simd.7: document simd-enhanced strncpy, stpncpy.

fuz added a commit: rG7527fecbfe0c: lib/libc/amd64/string: add stpncpy scalar, baseline implementation.

fuz added a commit: rG76f9afcdcfa4: lib/libc/amd64/string: implement strncpy() by calling stpncpy().

fuz mentioned this in D54169: libc/tests/string: improve stpncpy() "bounds" unit test.Wed, Dec 10, 9:18 PM

fuz added a child revision: D54169: libc/tests/string: improve stpncpy() "bounds" unit test.Wed, Dec 10, 9:19 PM

fuz mentioned this in rG123c08620049: libc/tests/string: improve stpncpy() "bounds" unit test.Sun, Dec 14, 4:08 PM

fuz mentioned this in rG66eb78377bf1: libc/amd64: fix overread conditions in stpncpy().

fuz mentioned this in rG54849107681d: libc/tests/string: improve stpncpy() "bounds" unit test.Thu, Jan 1, 10:22 PM

fuz mentioned this in rGb49401c0bd4c: libc/amd64: fix overread conditions in stpncpy().Sun, Jan 4, 1:24 PM

fuz mentioned this in rGe2f095779323: libc/tests/string: improve stpncpy() "bounds" unit test.Sun, Jan 4, 1:27 PM

fuz mentioned this in rGb0dc25c6d378: libc/amd64: fix overread conditions in stpncpy().

Revision Contents
Changeset List

Path

Size

lib/

libc/

amd64/

string/

Makefile.inc

2 lines

stpncpy.S

288 lines

strncpy.c

41 lines

tests/

string/

stpncpy_test.c

99 lines

share/

man/

man7/

simd.7

7 lines

Commit	Tree	Parents	Author	Summary	Date
7b8648dcfc84	e1789bdbc03d	7636145b88b8	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: reduce loop-carried dependencies in tail loops (Show More…)	Nov 14 2023, 5:27 PM
7636145b88b8	38d8aee56ec9	0f50359f387c	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: fix possible integer overflow (-0.08%) (Show More…)	Nov 14 2023, 5:14 PM
0f50359f387c	40802f5abc73	e6900b92921c	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: reduce loop-carried dependency on r10 (+5.01%) (Show More…)	Nov 14 2023, 7:31 AM
e6900b92921c	3d7ca9741cbd	311dd701f6af	Robert Clausecker	lib/libc/amd64/string: add stpncpy scalar, baseline implementation (Show More…)	Nov 9 2023, 4:49 AM
311dd701f6af	855849eb9f96	190952225e6c	Robert Clausecker	share/man/man7/simd.7: document simd-enhanced strncpy, stpncpy (Show More…)	Nov 9 2023, 4:39 AM
190952225e6c	b3e2cc03993c	891b04997f9b	Robert Clausecker	lib/libc/amd64/string: implement strncpy() by calling stpncpy() (Show More…)	Nov 9 2023, 4:25 AM
891b04997f9b	6c92b30758c2	93900a55d62b	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: add scalar implementation (Show More…)	Nov 9 2023, 4:19 AM
93900a55d62b	40b8d8097e51	2f9b1770efa5	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: unroll main loop (+5.78%) (Show More…)	Nov 9 2023, 3:21 AM
2f9b1770efa5	86593b82a281	483799dd8cef	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: fix baseline implementation (Show More…)	Nov 9 2023, 12:56 AM
483799dd8cef	a8dd349c1201	d4393d83f8a9	Robert Clausecker	lib/libc/tests/string/stpncpy_test.c: allow testing of external stpncpy… (Show More…)	Nov 9 2023, 12:28 AM
d4393d83f8a9	60bd8aeaaf8e	776085ddf2d0	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: finish and hook up (Show More…)	Nov 9 2023, 12:27 AM
776085ddf2d0	13833476ff9b	a2c9b71e9476	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: further work (Show More…)	Nov 8 2023, 5:16 AM
a2c9b71e9476	b234f2fe4d41	eac42cc661de	Robert Clausecker	lib/libc/tests/string/stpncpy_test.c: extend for upcoming SSE implementation (Show More…)	Nov 5 2023, 4:02 AM
eac42cc661de	bfdc7280d9c2	1368f6bf0982	Robert Clausecker	lib/libc/amd64/string/stpncpy.S: implement runt (rdx<32) case of baseline kernel (Show More…)	Oct 30 2023, 3:15 AM