This is a work in progress, currently lacking a method to avoid crossing
into unmapped pages.
This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.
It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.
Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.
```
os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │ sec/op vs base │
MemcmpShort 139.41µ ± 2% 63.96µ ± 1% -54.12% (p=0.000 n=20) 25.14µ ± 1% -81.97% (p=0.000 n=20)
MemcmpMid 93.38µ ± 3% 12.09µ ± 1% -87.05% (p=0.000 n=20) 10.91µ ± 1% -88.31% (p=0.000 n=20)
MemcmpLong 86.722µ ± 7% 4.648µ ± 1% -94.64% (p=0.000 n=20) 4.931µ ± 0% -94.31% (p=0.000 n=20)
geomean 104.1µ 15.32µ -85.29% 11.06µ -89.38%
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │ B/s vs base │
MemcmpShort 855.1Mi ± 2% 1864.0Mi ± 1% +117.99% (p=0.000 n=20) 4742.2Mi ± 1% +454.58% (p=0.000 n=20)
MemcmpMid 1.247Gi ± 3% 9.629Gi ± 1% +671.89% (p=0.000 n=20) 10.668Gi ± 1% +755.20% (p=0.000 n=20)
MemcmpLong 1.342Gi ± 6% 25.048Gi ± 1% +1765.92% (p=0.000 n=20) 23.608Gi ± 0% +1658.68% (p=0.000 n=20)
geomean 1.118Gi 7.600Gi +579.66% 10.53Gi +841.33%
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │ sec/op vs base │
MemcmpShort 183.12µ ± 0% 101.29µ ± 0% -44.68% (p=0.000 n=20) 39.96µ ± 0% -78.18% (p=0.000 n=20)
MemcmpMid 129.12µ ± 0% 24.55µ ± 1% -80.99% (p=0.000 n=20) 21.48µ ± 0% -83.37% (p=0.000 n=20)
MemcmpLong 111.374µ ± 0% 6.288µ ± 0% -94.35% (p=0.000 n=20) 7.593µ ± 0% -93.18% (p=0.000 n=20)
geomean 138.1µ 25.01µ -81.89% 18.68µ -86.47%
│ memcmpScalar │ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │ B/s vs base │
MemcmpShort 651.0Mi ± 0% 1176.9Mi ± 0% +80.78% (p=0.000 n=20) 2983.1Mi ± 0% +358.23% (p=0.000 n=20)
MemcmpMid 923.3Mi ± 0% 4856.0Mi ± 1% +425.95% (p=0.000 n=20) 5551.0Mi ± 0% +501.23% (p=0.000 n=20)
MemcmpLong 1.045Gi ± 0% 18.514Gi ± 0% +1671.24% (p=0.000 n=20) 15.332Gi ± 0% +1366.84% (p=0.000 n=20)
geomean 863.3Mi 4.656Gi +452.24% 6.233Gi +639.33%
```