This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.
It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.
Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.
```
os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
│ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │
MemcmpShort 63.96µ ± 1% 32.41µ ± 0% -49.33% (p=0.000 n=20)
MemcmpMid 12.09µ ± 1% 12.33µ ± 1% +1.98% (p=0.000 n=20)
MemcmpLong 4.648µ ± 1% 4.942µ ± 1% +6.32% (p=0.000 n=20)
geomean 15.32µ 12.55µ -18.10%
│ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │
MemcmpShort 1.820Gi ± 1% 3.592Gi ± 0% +97.35% (p=0.000 n=20)
MemcmpMid 9.629Gi ± 1% 9.442Gi ± 1% -1.94% (p=0.000 n=20)
MemcmpLong 25.05Gi ± 1% 23.55Gi ± 1% -5.96% (p=0.000 n=20)
geomean 7.600Gi 9.279Gi +22.09%
os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
│ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │
MemcmpShort 171.05µ ± 12% 91.81µ ± 24% -46.33% (p=0.000 n=20+60)
MemcmpMid 35.03µ ± 3% 36.84µ ± 11% ~ (p=0.661 n=20+60)
MemcmpLong 10.56µ ± 2% 10.57µ ± 14% ~ (p=0.377 n=20+60)
geomean 39.84µ 32.94µ -17.33%
│ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │
MemcmpShort 696.9Mi ± 14% 1298.4Mi ± 31% +86.31% (p=0.000 n=20+60)
MemcmpMid 3.323Gi ± 3% 3.160Gi ± 12% ~ (p=0.661 n=20+60)
MemcmpLong 11.03Gi ± 2% 11.02Gi ± 15% ~ (p=0.377 n=20+60)
geomean 2.922Gi 3.534Gi +20.96%
```