This changeset includes a port of the SIMD implementation of strcmp
for amd64 to Aarch64.
Below is a description of its method as described in D41971
The basic idea is to process the bulk of the string in aligned blocks
of 16 bytes such that one string runs ahead and the other runs behind.
The string that runs ahead is checked for NUL bytes, the one that runs
behind is compared with the corresponding chunk of the string that runs
ahead. This trades an extra load per iteration for the very complicated
block-reassembly needed in the other implementations (bionic, glibc).
On the flip side, we need two code paths depending on the relative
alignment of the two buffers.
The initial part of the string is compared directly if it is known not
to cross a page boundary. Otherwise, a complex slow path to avoid
crossing into unmapped memory commences.
Performance is better in most cases than the existing implementation
from the Arm Optimized Routines repository.
os: FreeBSD arch: arm64 cpu: ARM Cortex-A76 r4p1 │ strcmpARM │ strcmpSIMD │ strcmpSIMDupdated │ │ sec/op │ sec/op vs base │ sec/op vs base │ StrcmpShortAligned 137.6µ ± 1% 113.8µ ± 0% -17.35% (p=0.000 n=20) 110.0µ ± 0% -20.11% (p=0.000 n=20) StrcmpMidAligned 37.54µ ± 2% 38.93µ ± 0% +3.69% (p=0.000 n=20) 35.85µ ± 1% -4.50% (p=0.000 n=20) StrcmpLongAligned 17.65µ ± 0% 14.67µ ± 0% -16.89% (p=0.000 n=20) 15.09µ ± 2% -14.51% (p=0.000 n=20) StrcmpShortUnaligned 183.7µ ± 1% 125.2µ ± 0% -31.83% (p=0.000 n=20) 122.5µ ± 0% -33.32% (p=0.000 n=20) StrcmpMidUnaligned 51.74µ ± 0% 38.69µ ± 2% -25.23% (p=0.000 n=20) 41.98µ ± 1% -18.86% (p=0.000 n=20) StrcmpLongUnaligned 16.20µ ± 0% 16.12µ ± 0% -0.50% (p=0.000 n=20) 16.51µ ± 0% +1.89% (p=0.000 n=20) StrcmpShortQsort 1.511m ± 0% 1.450m ± 0% -4.05% (p=0.000 n=20) 1.412m ± 0% -6.55% (p=0.000 n=20) StrcmpMidQsort 354.1µ ± 0% 345.1µ ± 0% -2.56% (p=0.000 n=20) 334.8µ ± 0% -5.46% (p=0.000 n=20) geomean 96.49µ 84.24µ -12.69% 83.60µ -13.35% │ strcmpARM │ strcmpSIMD │ strcmpSIMDupdated │ │ B/s │ B/s vs base │ B/s vs base │ StrcmpShortAligned 866.2Mi ± 1% 1048.0Mi ± 0% +20.99% (p=0.000 n=20) 1084.2Mi ± 0% +25.17% (p=0.000 n=20) StrcmpMidAligned 3.101Gi ± 1% 2.991Gi ± 0% -3.56% (p=0.000 n=20) 3.247Gi ± 1% +4.71% (p=0.000 n=20) StrcmpLongAligned 6.597Gi ± 0% 7.938Gi ± 0% +20.33% (p=0.000 n=20) 7.717Gi ± 2% +16.98% (p=0.000 n=20) StrcmpShortUnaligned 649.0Mi ± 1% 952.1Mi ± 0% +46.70% (p=0.000 n=20) 973.3Mi ± 0% +49.96% (p=0.000 n=20) StrcmpMidUnaligned 2.250Gi ± 0% 3.009Gi ± 2% +33.74% (p=0.000 n=20) 2.773Gi ± 1% +23.25% (p=0.000 n=20) StrcmpLongUnaligned 7.186Gi ± 0% 7.222Gi ± 0% +0.50% (p=0.000 n=20) 7.053Gi ± 0% -1.85% (p=0.000 n=20) StrcmpShortQsort 78.89Mi ± 0% 82.22Mi ± 0% +4.22% (p=0.000 n=20) 84.42Mi ± 0% +7.01% (p=0.000 n=20) StrcmpMidQsort 336.6Mi ± 0% 345.5Mi ± 0% +2.62% (p=0.000 n=20) 356.1Mi ± 0% +5.77% (p=0.000 n=20) geomean 1.207Gi 1.382Gi +14.53% 1.392Gi +15.41% os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ strcmpARM │ strcmpSIMD │ strcmpSIMDupdated │ │ sec/op │ sec/op vs base │ sec/op vs base │ StrcmpShortAligned 85.63µ ± 1% 72.88µ ± 1% -14.88% (p=0.000 n=20) 67.86µ ± 1% -20.75% (p=0.000 n=20) StrcmpMidAligned 21.59µ ± 5% 17.51µ ± 2% -18.88% (p=0.000 n=20) 16.60µ ± 2% -23.10% (p=0.000 n=20) StrcmpLongAligned 13.457µ ± 8% 8.315µ ± 3% -38.21% (p=0.000 n=20) 8.809µ ± 4% -34.54% (p=0.000 n=20) StrcmpShortUnaligned 135.96µ ± 0% 85.68µ ± 0% -36.98% (p=0.000 n=20) 78.51µ ± 0% -42.26% (p=0.000 n=20) StrcmpMidUnaligned 30.31µ ± 1% 18.64µ ± 1% -38.49% (p=0.000 n=20) 17.37µ ± 1% -42.70% (p=0.000 n=20) StrcmpLongUnaligned 13.649µ ± 1% 8.442µ ± 2% -38.15% (p=0.000 n=20) 8.480µ ± 5% -37.87% (p=0.000 n=20) StrcmpShortQsort 1173.0µ ± 0% 984.4µ ± 0% -16.08% (p=0.000 n=20) 950.8µ ± 0% -18.94% (p=0.000 n=20) StrcmpMidQsort 263.1µ ± 0% 227.8µ ± 0% -13.38% (p=0.000 n=20) 220.6µ ± 0% -16.14% (p=0.000 n=20) geomean 67.51µ 48.79µ -27.74% 47.06µ -30.29% │ strcmpARM │ strcmpSIMD │ strcmpSIMDupdated │ │ B/s │ B/s vs base │ B/s vs base │ StrcmpShortAligned 1.360Gi ± 1% 1.597Gi ± 1% +17.49% (p=0.000 n=20) 1.715Gi ± 1% +26.18% (p=0.000 n=20) StrcmpMidAligned 5.393Gi ± 5% 6.648Gi ± 2% +23.28% (p=0.000 n=20) 7.013Gi ± 2% +30.04% (p=0.000 n=20) StrcmpLongAligned 8.651Gi ± 9% 14.001Gi ± 3% +61.83% (p=0.000 n=20) 13.216Gi ± 4% +52.77% (p=0.000 n=20) StrcmpShortUnaligned 876.8Mi ± 0% 1391.3Mi ± 0% +58.69% (p=0.000 n=20) 1518.5Mi ± 0% +73.19% (p=0.000 n=20) StrcmpMidUnaligned 3.841Gi ± 1% 6.245Gi ± 1% +62.58% (p=0.000 n=20) 6.704Gi ± 1% +74.52% (p=0.000 n=20) StrcmpLongUnaligned 8.529Gi ± 1% 13.791Gi ± 2% +61.69% (p=0.000 n=20) 13.729Gi ± 5% +60.96% (p=0.000 n=20) StrcmpShortQsort 101.6Mi ± 0% 121.1Mi ± 0% +19.16% (p=0.000 n=20) 125.4Mi ± 0% +23.37% (p=0.000 n=20) StrcmpMidQsort 453.2Mi ± 0% 523.2Mi ± 0% +15.45% (p=0.000 n=20) 540.4Mi ± 0% +19.25% (p=0.000 n=20) geomean 1.724Gi 2.386Gi +38.39% 2.474Gi +43.46%