Page MenuHomeFreeBSD

lib/libc/amd64/string: add stpncpy scalar, baseline implementation
ClosedPublic

Authored by fuz on Nov 9 2023, 5:43 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, May 26, 1:13 PM
Unknown Object (File)
Wed, May 8, 9:21 AM
Unknown Object (File)
Wed, May 8, 9:21 AM
Unknown Object (File)
Wed, May 8, 9:21 AM
Unknown Object (File)
Wed, May 8, 9:21 AM
Unknown Object (File)
Wed, May 8, 6:50 AM
Unknown Object (File)
Feb 11 2024, 7:01 AM
Unknown Object (File)
Jan 24 2024, 9:24 PM
Subscribers

Details

Summary

This was surprisingly annoying to get right, despite being such a simple
function. A scalar implementation is also provided, it just calls into
our optimised memchr(), memcpy(), and memset() routines to carry out its
job.

The unit test for stpncpy has been extended significantly and now covers
the function quite well.

I'm quite happy with the performance. glibc only beats us for very long
strings, likely due to the use of AVX-512. The scalar implementation
just calls into our optimised memchr(), memcpy(), and memset() routines,
so it has a high overhead to begin with but then performs ok for the
amount of effort that went into it. Still beats the old C code, except
for very short strings.

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ stpncpy.pre.out │          stpncpy.scalar.out          │        stpncpy.baseline.out         │
        │     sec/op      │    sec/op     vs base                │   sec/op     vs base                │
Short        106.88µ ± 0%   134.08µ ± 0%  +25.45% (p=0.000 n=20)   75.95µ ± 0%  -28.94% (p=0.000 n=20)
Mid          49.292µ ± 0%   18.558µ ± 0%  -62.35% (p=0.000 n=20)   9.911µ ± 1%  -79.89% (p=0.000 n=20)
Long         48.106µ ± 0%   11.614µ ± 0%  -75.86% (p=0.000 n=20)   3.606µ ± 0%  -92.50% (p=0.000 n=20)
geomean       63.28µ         30.69µ       -51.51%                  13.95µ       -77.96%

        │ stpncpy.pre.out │           stpncpy.scalar.out           │          stpncpy.baseline.out           │
        │       B/s       │      B/s       vs base                 │      B/s       vs base                  │
Short       1115.4Mi ± 0%    889.1Mi ± 0%   -20.29% (p=0.000 n=20)   1569.5Mi ± 0%    +40.72% (p=0.000 n=20)
Mid          2.362Gi ± 0%    6.273Gi ± 0%  +165.61% (p=0.000 n=20)   11.746Gi ± 1%   +397.36% (p=0.000 n=20)
Long         2.420Gi ± 0%   10.024Gi ± 0%  +314.22% (p=0.000 n=20)   32.284Gi ± 0%  +1234.05% (p=0.000 n=20)
geomean      1.840Gi         3.794Gi       +106.22%                   8.345Gi        +353.66%

os: Linux
arch: x86_64
cpu:
        │ stpncpy.glibc.out │
        │      sec/op       │
Short           153.9µ ± 0%
Mid             11.96µ ± 1%
Long            2.623µ ± 0%
geomean         16.90µ

        │ stpncpy.glibc.out │
        │        B/s        │
Short          774.6Mi ± 0%
Mid            9.731Gi ± 1%
Long           44.39Gi ± 0%
geomean        6.888Gi

Sponsored by: The FreeBSD Foundation

Test Plan

passes extended unit tests, now new test suite failures.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

fuz requested review of this revision.Nov 9 2023, 5:43 AM
fuz planned changes to this revision.Nov 14 2023, 6:37 AM

I found a possible integer overflow bug in this code. While the overflow cannot happen with valid inputs, it could cause invalid inputs (specifically, buffer sizes very close to SIZE_MAX) to silently behave as if the buffer was very short instead of crashing. I'll put something in to catch that case.

lib/libc/amd64/string/stpncpy.S
110 ↗(On Diff #129885)

This can overflow. I'll have to fix the code.

  • lib/libc/amd64/string/stpncpy.S: reduce loop-carried dependency on r10 (+5.01%)
  • lib/libc/amd64/string/stpncpy.S: fix possible integer overflow (-0.08%)
  • lib/libc/amd64/string/stpncpy.S: reduce loop-carried dependencies in tail loops

This fixes the bug and adds some minor performance improvements
I missed in my initial optimisation pass through the function.

lib/libc/amd64/string/stpncpy.S: optimize further (+1.65%)

  • enter the tail with 1--16 bytes left instead of 0--15 bytes left
  • this has a 1/16 chance of reducing the iteration count by 1
  • do the same when clearing out the rest of the destination
  • as we clear the possibly unaligned tail ahead of time, the last iteration can be omitted if exactly 16 bytes remain