Based on the strlcpy code from D42863, this DR adds a SIMD-enhanced
implementation of memccpy for amd64. A scalar implementation calling
into memchr and memcpy to do the job is provided, too. Then, strncat
is reimplemented to call into strlen and memccpy to do its job, allowing
it to benefit from the enhanced implementations.
Please note that this code does not behave exactly the same as the C
implementation of memccpy for overlapping inputs. However, overlapping
inputs are not allowed for this function by ISO/IEC 9899:1999 and neither
does the C code have code to deal with the possibility. It just
proceeds byte-by-byte, which may or may not do the expected thing for
some overlaps. We do not document whether overlapping inputs are
supported in memccpy(3).
New unit tests are added to cover memccpy in more detail.
The performance is up to 21x better than the C code. The scalar
implementation is pretty good, too, except for very short strings.
os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ memccpy.pre.out │ memccpy.scalar.out │ memccpy.baseline.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ Short 92.24µ ± 0% 109.23µ ± 0% +18.42% (p=0.000 n=20) 66.93µ ± 0% -27.44% (p=0.000 n=20) Mid 52.091µ ± 0% 16.617µ ± 1% -68.10% (p=0.000 n=20) 8.008µ ± 1% -84.63% (p=0.000 n=20) Long 80.934µ ± 0% 11.611µ ± 0% -85.65% (p=0.000 n=20) 3.577µ ± 0% -95.58% (p=0.000 n=20) geomean 72.99µ 27.62µ -62.16% 12.42µ -82.98% │ memccpy.pre.out │ memccpy.scalar.out │ memccpy.baseline.out │ │ B/s │ B/s vs base │ B/s vs base │ Short 1.262Gi ± 0% 1.066Gi ± 0% -15.55% (p=0.000 n=20) 1.739Gi ± 0% +37.82% (p=0.000 n=20) Mid 2.235Gi ± 0% 7.006Gi ± 1% +213.49% (p=0.000 n=20) 14.537Gi ± 1% +550.49% (p=0.000 n=20) Long 1.438Gi ± 0% 10.026Gi ± 0% +597.03% (p=0.000 n=20) 32.550Gi ± 0% +2162.92% (p=0.000 n=20) geomean 1.595Gi 4.215Gi +164.25% 9.371Gi +487.59%