This changeset includes a port of the SIMD implementation of memccpy
for amd64 to Aarch64.
Performance is significantly better than the scalar implementation
except for short strings.
Benchmark results are as usual generated by the strperf utility written
by fuz.
os: FreeBSD arch: arm64 cpu: ARM Cortex-A76 r4p1 │ memccpyScalar │ memccpySIMD │ │ sec/op │ sec/op vs base │ Short 136.7µ ± 1% 142.4µ ± 0% +4.11% (p=0.000 n=20) Mid 69.85µ ± 1% 30.63µ ± 1% -56.15% (p=0.000 n=20) Long 112.854µ ± 0% 7.898µ ± 1% -93.00% (p=0.000 n=20) geomean 102.5µ 32.53µ -68.27% │ memccpyScalar │ memccpySIMD │ │ B/s │ B/s vs base │ Short 871.9Mi ± 1% 837.4Mi ± 0% -3.95% (p=0.000 n=20) Mid 1.667Gi ± 1% 3.801Gi ± 1% +128.04% (p=0.000 n=20) Long 1.032Gi ± 0% 14.740Gi ± 1% +1328.86% (p=0.000 n=20) geomean 1.135Gi 3.578Gi +215.14% os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ memccpyScalar │ memccpySIMD │ │ sec/op │ sec/op vs base │ Short 96.73µ ± 1% 122.82µ ± 1% +26.98% (p=0.000 n=20) Mid 48.50µ ± 0% 24.62µ ± 0% -49.23% (p=0.000 n=20) Long 84.122µ ± 1% 4.961µ ± 0% -94.10% (p=0.000 n=20) geomean 73.35µ 24.66µ -66.37% │ memccpyScalar │ memccpySIMD │ │ B/s │ B/s vs base │ Short 1232.5Mi ± 1% 970.6Mi ± 1% -21.25% (p=0.000 n=20) Mid 2.400Gi ± 0% 4.728Gi ± 0% +96.95% (p=0.000 n=20) Long 1.384Gi ± 1% 23.466Gi ± 0% +1595.65% (p=0.000 n=20) geomean 1.587Gi 4.720Gi +197.38%