string: add memccpy, strncat scalar, baseline implementations
ClosedPublic
Actions

Authored by fuz on Dec 4 2023, 10:17 PM.

Details

Reviewers

kib
mjg

Commits

rGddab9e646122: lib/libc/amd64/string: implement strncat() by calling strlen(), memccpy()
rGa3ce82e5b887: lib/libc/amd64/string: add memccpy scalar, baseline implementation
rGbd051ed3fed7: share/man/man7/simd.7: document simd-enhanced memccpy, strncat
rGea7b13771cc9: lib/libc/amd64/string: implement strncat() by calling strlen(), memccpy()
rGfc0e38a7a67a: lib/libc/amd64/string: add memccpy scalar, baseline implementation
rG5fa0fbf40b11: share/man/man7/simd.7: document simd-enhanced memccpy, strncat

Summary

Based on the strlcpy code from D42863, this DR adds a SIMD-enhanced
implementation of memccpy for amd64. A scalar implementation calling
into memchr and memcpy to do the job is provided, too. Then, strncat
is reimplemented to call into strlen and memccpy to do its job, allowing
it to benefit from the enhanced implementations.

Please note that this code does not behave exactly the same as the C
implementation of memccpy for overlapping inputs. However, overlapping
inputs are not allowed for this function by ISO/IEC 9899:1999 and neither
does the C code have code to deal with the possibility. It just
proceeds byte-by-byte, which may or may not do the expected thing for
some overlaps. We do not document whether overlapping inputs are
supported in memccpy(3).

New unit tests are added to cover memccpy in more detail.

The performance is up to 21x better than the C code. The scalar
implementation is pretty good, too, except for very short strings.

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ memccpy.pre.out │          memccpy.scalar.out          │        memccpy.baseline.out         │
        │     sec/op      │    sec/op     vs base                │   sec/op     vs base                │
Short         92.24µ ± 0%   109.23µ ± 0%  +18.42% (p=0.000 n=20)   66.93µ ± 0%  -27.44% (p=0.000 n=20)
Mid          52.091µ ± 0%   16.617µ ± 1%  -68.10% (p=0.000 n=20)   8.008µ ± 1%  -84.63% (p=0.000 n=20)
Long         80.934µ ± 0%   11.611µ ± 0%  -85.65% (p=0.000 n=20)   3.577µ ± 0%  -95.58% (p=0.000 n=20)
geomean       72.99µ         27.62µ       -62.16%                  12.42µ       -82.98%

        │ memccpy.pre.out │           memccpy.scalar.out           │          memccpy.baseline.out           │
        │       B/s       │      B/s       vs base                 │      B/s       vs base                  │
Short        1.262Gi ± 0%    1.066Gi ± 0%   -15.55% (p=0.000 n=20)    1.739Gi ± 0%    +37.82% (p=0.000 n=20)
Mid          2.235Gi ± 0%    7.006Gi ± 1%  +213.49% (p=0.000 n=20)   14.537Gi ± 1%   +550.49% (p=0.000 n=20)
Long         1.438Gi ± 0%   10.026Gi ± 0%  +597.03% (p=0.000 n=20)   32.550Gi ± 0%  +2162.92% (p=0.000 n=20)
geomean      1.595Gi         4.215Gi       +164.25%                   9.371Gi        +487.59%

Test Plan

passes the newly added unit tests and no new Kyua test suite failures
in other tests either.