Conceptually very similar to timingsafe_bcmp(), but with comparison
logic inspired by Elijah Stone's
fancy memcmp. A baseline (SSE) implementation
was omitted this time as I was not able to get it to perform adequately.
Best I got was 8% over the scalar version for long inputs, but slower for
short inputs.
Performance is solid, at about 10x of the generic C
implementation overall:
os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
│ memcmp.pre.out │ memcmp.amd64.out │
│ sec/op │ sec/op vs base │
TsMemcmpShort 189.11µ ± 0% 55.85µ ± 0% -70.47% (p=0.000 n=20)
TsMemcmpMid 146.47µ ± 0% 10.14µ ± 0% -93.08% (p=0.000 n=20)
TsMemcmpLong 130.642µ ± 0% 6.608µ ± 0% -94.94% (p=0.000 n=20)
geomean 153.5µ 15.52µ -89.89%
│ memcmp.pre.out │ memcmp.amd64.out │
│ B/s │ B/s vs base │
TsMemcmpShort 630.4Mi ± 0% 2134.4Mi ± 0% +238.60% (p=0.000 n=20)
TsMemcmpMid 813.9Mi ± 0% 11761.9Mi ± 0% +1345.11% (p=0.000 n=20)
TsMemcmpLong 912.5Mi ± 0% 18039.2Mi ± 0% +1876.92% (p=0.000 n=20)
geomean 776.5Mi 7.499Gi +888.99%As with the timingsafe_bcmp implementation from D41673, care has been
taken to ensure that only instructions with data operand independent
timing from Intel's list have been used.
Sponsored by: The FreeBSD Foundation