This is a work in progress, currently lacking a method to avoid crossing
into unmapped pages.
This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.
It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.
Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.
```
os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
│ memcmpSIMD │ memcmpARM │ memcmpScalar │
│ sec/op │ sec/op vs base │ sec/op vs base │
MemcmpShort 25.14µ ± 1% 63.96µ ± 1% +154.42% (p=0.000 n=20) 139.41µ ± 2% +454.58% (p=0.000 n=20)
MemcmpMid 10.91µ ± 1% 12.09µ ± 1% +10.79% (p=0.000 n=20) 93.38µ ± 3% +755.74% (p=0.000 n=20)
MemcmpLong 4.931µ ± 0% 4.648µ ± 1% -5.73% (p=0.000 n=20) 86.722µ ± 7% +1658.68% (p=0.000 n=20)
geomean 11.06µ 15.32µ +38.51% 104.1µ +841.53%
│ memcmpSIMD │ memcmpARM │ memcmpScalar │
│ B/s │ B/s vs base │ B/s vs base │
MemcmpShort 4742.2Mi ± 1% 1864.0Mi ± 1% -60.69% (p=0.000 n=20) 855.1Mi ± 2% -81.97% (p=0.000 n=20)
MemcmpMid 10.668Gi ± 1% 9.629Gi ± 1% -9.74% (p=0.000 n=20) 1.247Gi ± 3% -88.31% (p=0.000 n=20)
MemcmpLong 23.608Gi ± 0% 25.048Gi ± 1% +6.10% (p=0.000 n=20) 1.342Gi ± 6% -94.31% (p=0.000 n=20)
geomean 10.53Gi 7.600Gi -27.80% 1.118Gi -89.38%
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
│ memcmpSIMD │ memcmpARM │ memcmpScalar │
│ sec/op │ sec/op vs base │ sec/op vs base │
MemcmpShort 39.96µ ± 0% 101.29µ ± 0% +153.47% (p=0.000 n=20) 183.12µ ± 0% +358.23% (p=0.000 n=20)
MemcmpMid 21.48µ ± 0% 24.55µ ± 1% +14.31% (p=0.000 n=20) 129.12µ ± 0% +501.23% (p=0.000 n=20)
MemcmpLong 7.593µ ± 0% 6.288µ ± 0% -17.19% (p=0.000 n=20) 111.374µ ± 0% +1366.84% (p=0.000 n=20)
geomean 18.68µ 25.01µ +33.88% 138.1µ +639.33%
│ memcmpSIMD │ memcmpARM │ memcmpScalar │
│ B/s │ B/s vs base │ B/s vs base │
MemcmpShort 2983.1Mi ± 0% 1176.9Mi ± 0% -60.55% (p=0.000 n=20) 651.0Mi ± 0% -78.18% (p=0.000 n=20)
MemcmpMid 5551.0Mi ± 0% 4856.0Mi ± 1% -12.52% (p=0.000 n=20) 923.3Mi ± 0% -83.37% (p=0.000 n=20)
MemcmpLong 15.332Gi ± 0% 18.514Gi ± 0% +20.75% (p=0.000 n=20) 1.045Gi ± 0% -93.18% (p=0.000 n=20)
geomean 6.233Gi 4.656Gi -25.31% 863.3Mi -86.47%
```