The scalar implementation is fairly simplistic and only performs
slightly better than the generic C implementation. It could be
improved by using the same algorithm as for memchr, but it would
have been a lot more complicated.
The baseline implementation performs well and is similar to
timingsafe_memcmp in the way it operates. See the usual place
for benchmark results:
os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ memrchr.pre.out │ memrchr.scalar.out │ memrchr.baseline.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ Short 120.95µ ± 0% 98.08µ ± 0% -18.90% (p=0.000 n=20) 37.75µ ± 1% -68.79% (p=0.000 n=20) Mid 74.374µ ± 0% 48.394µ ± 0% -34.93% (p=0.000 n=20) 9.120µ ± 0% -87.74% (p=0.000 n=20) Long 52.181µ ± 0% 38.607µ ± 0% -26.01% (p=0.000 n=20) 4.110µ ± 0% -92.12% (p=0.000 n=20) geomean 77.72µ 56.80µ -26.91% 11.23µ -85.55% │ memrchr.pre.out │ memrchr.scalar.out │ memrchr.baseline.out │ │ B/s │ B/s vs base │ B/s vs base │ Short 985.6Mi ± 0% 1215.4Mi ± 0% +23.31% (p=0.000 n=20) 3158.2Mi ± 1% +220.42% (p=0.000 n=20) Mid 1.565Gi ± 0% 2.406Gi ± 0% +53.68% (p=0.000 n=20) 12.765Gi ± 0% +715.52% (p=0.000 n=20) Long 2.231Gi ± 0% 3.015Gi ± 0% +35.16% (p=0.000 n=20) 28.323Gi ± 0% +1169.56% (p=0.000 n=20) geomean 1.498Gi 2.050Gi +36.82% 10.37Gi +592.26%
New unit tests to cover this function are provided, too.