This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.
It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.
Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.
```
os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
│ memcmpARM │ memcmpSIMD │
│ sec/op │ sec/op vs base │
MemcmpShort 63.96µ ± 1% 32.41µ ± 0% -49.33% (p=0.000 n=20)
MemcmpMid 12.09µ ± 1% 12.33µ ± 1% +1.98% (p=0.000 n=20)
MemcmpLong 4.648µ ± 1% 4.942µ ± 1% +6.32% (p=0.000 n=20)
geomean 15.32µ 12.55µ -18.10%
│ memcmpARM │ memcmpSIMD │
│ B/s │ B/s vs base │
MemcmpShort 1.820Gi ± 1% 3.592Gi ± 0% +97.35% (p=0.000 n=20)
MemcmpMid 9.629Gi ± 1% 9.442Gi ± 1% -1.94% (p=0.000 n=20)
MemcmpLong 25.05Gi ± 1% 23.55Gi ± 1% -5.96% (p=0.000 n=20)
geomean 7.600Gi 9.279Gi +22.09%
```