Details

Reviewers

fuz
emaste
andrew

Commits

rG5e7d93a60440: lib/libc/aarch64/string: add strcmp SIMD implementation

Summary

This changeset includes a port of the SIMD implementation of strcmp
for amd64 to Aarch64.

Below is a description of its method as described in D41971

The basic idea is to process the bulk of the string in aligned blocks
of 16 bytes such that one string runs ahead and the other runs behind.
The string that runs ahead is checked for NUL bytes, the one that runs
behind is compared with the corresponding chunk of the string that runs
ahead. This trades an extra load per iteration for the very complicated
block-reassembly needed in the other implementations (bionic, glibc).
On the flip side, we need two code paths depending on the relative
alignment of the two buffers.

The initial part of the string is compared directly if it is known not
to cross a page boundary. Otherwise, a complex slow path to avoid
crossing into unmapped memory commences.

Performance is better in most cases than the existing implementation
from the Arm Optimized Routines repository.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
                     │  strcmpARM  │             strcmpSIMD              │          strcmpSIMDupdated          │
                     │   sec/op    │   sec/op     vs base                │   sec/op     vs base                │
StrcmpShortAligned     137.6µ ± 1%   113.8µ ± 0%  -17.35% (p=0.000 n=20)   110.0µ ± 0%  -20.11% (p=0.000 n=20)
StrcmpMidAligned       37.54µ ± 2%   38.93µ ± 0%   +3.69% (p=0.000 n=20)   35.85µ ± 1%   -4.50% (p=0.000 n=20)
StrcmpLongAligned      17.65µ ± 0%   14.67µ ± 0%  -16.89% (p=0.000 n=20)   15.09µ ± 2%  -14.51% (p=0.000 n=20)
StrcmpShortUnaligned   183.7µ ± 1%   125.2µ ± 0%  -31.83% (p=0.000 n=20)   122.5µ ± 0%  -33.32% (p=0.000 n=20)
StrcmpMidUnaligned     51.74µ ± 0%   38.69µ ± 2%  -25.23% (p=0.000 n=20)   41.98µ ± 1%  -18.86% (p=0.000 n=20)
StrcmpLongUnaligned    16.20µ ± 0%   16.12µ ± 0%   -0.50% (p=0.000 n=20)   16.51µ ± 0%   +1.89% (p=0.000 n=20)
StrcmpShortQsort       1.511m ± 0%   1.450m ± 0%   -4.05% (p=0.000 n=20)   1.412m ± 0%   -6.55% (p=0.000 n=20)
StrcmpMidQsort         354.1µ ± 0%   345.1µ ± 0%   -2.56% (p=0.000 n=20)   334.8µ ± 0%   -5.46% (p=0.000 n=20)
geomean                96.49µ        84.24µ       -12.69%                  83.60µ       -13.35%

                     │  strcmpARM   │              strcmpSIMD               │           strcmpSIMDupdated           │
                     │     B/s      │      B/s       vs base                │      B/s       vs base                │
StrcmpShortAligned     866.2Mi ± 1%   1048.0Mi ± 0%  +20.99% (p=0.000 n=20)   1084.2Mi ± 0%  +25.17% (p=0.000 n=20)
StrcmpMidAligned       3.101Gi ± 1%    2.991Gi ± 0%   -3.56% (p=0.000 n=20)    3.247Gi ± 1%   +4.71% (p=0.000 n=20)
StrcmpLongAligned      6.597Gi ± 0%    7.938Gi ± 0%  +20.33% (p=0.000 n=20)    7.717Gi ± 2%  +16.98% (p=0.000 n=20)
StrcmpShortUnaligned   649.0Mi ± 1%    952.1Mi ± 0%  +46.70% (p=0.000 n=20)    973.3Mi ± 0%  +49.96% (p=0.000 n=20)
StrcmpMidUnaligned     2.250Gi ± 0%    3.009Gi ± 2%  +33.74% (p=0.000 n=20)    2.773Gi ± 1%  +23.25% (p=0.000 n=20)
StrcmpLongUnaligned    7.186Gi ± 0%    7.222Gi ± 0%   +0.50% (p=0.000 n=20)    7.053Gi ± 0%   -1.85% (p=0.000 n=20)
StrcmpShortQsort       78.89Mi ± 0%    82.22Mi ± 0%   +4.22% (p=0.000 n=20)    84.42Mi ± 0%   +7.01% (p=0.000 n=20)
StrcmpMidQsort         336.6Mi ± 0%    345.5Mi ± 0%   +2.62% (p=0.000 n=20)    356.1Mi ± 0%   +5.77% (p=0.000 n=20)
geomean                1.207Gi         1.382Gi       +14.53%                   1.392Gi       +15.41%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
                     │  strcmpARM   │             strcmpSIMD              │          strcmpSIMDupdated          │
                     │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │
StrcmpShortAligned      85.63µ ± 1%   72.88µ ± 1%  -14.88% (p=0.000 n=20)   67.86µ ± 1%  -20.75% (p=0.000 n=20)
StrcmpMidAligned        21.59µ ± 5%   17.51µ ± 2%  -18.88% (p=0.000 n=20)   16.60µ ± 2%  -23.10% (p=0.000 n=20)
StrcmpLongAligned      13.457µ ± 8%   8.315µ ± 3%  -38.21% (p=0.000 n=20)   8.809µ ± 4%  -34.54% (p=0.000 n=20)
StrcmpShortUnaligned   135.96µ ± 0%   85.68µ ± 0%  -36.98% (p=0.000 n=20)   78.51µ ± 0%  -42.26% (p=0.000 n=20)
StrcmpMidUnaligned      30.31µ ± 1%   18.64µ ± 1%  -38.49% (p=0.000 n=20)   17.37µ ± 1%  -42.70% (p=0.000 n=20)
StrcmpLongUnaligned    13.649µ ± 1%   8.442µ ± 2%  -38.15% (p=0.000 n=20)   8.480µ ± 5%  -37.87% (p=0.000 n=20)
StrcmpShortQsort       1173.0µ ± 0%   984.4µ ± 0%  -16.08% (p=0.000 n=20)   950.8µ ± 0%  -18.94% (p=0.000 n=20)
StrcmpMidQsort          263.1µ ± 0%   227.8µ ± 0%  -13.38% (p=0.000 n=20)   220.6µ ± 0%  -16.14% (p=0.000 n=20)
geomean                 67.51µ        48.79µ       -27.74%                  47.06µ       -30.29%

                     │  strcmpARM   │              strcmpSIMD               │           strcmpSIMDupdated           │
                     │     B/s      │      B/s       vs base                │      B/s       vs base                │
StrcmpShortAligned     1.360Gi ± 1%    1.597Gi ± 1%  +17.49% (p=0.000 n=20)    1.715Gi ± 1%  +26.18% (p=0.000 n=20)
StrcmpMidAligned       5.393Gi ± 5%    6.648Gi ± 2%  +23.28% (p=0.000 n=20)    7.013Gi ± 2%  +30.04% (p=0.000 n=20)
StrcmpLongAligned      8.651Gi ± 9%   14.001Gi ± 3%  +61.83% (p=0.000 n=20)   13.216Gi ± 4%  +52.77% (p=0.000 n=20)
StrcmpShortUnaligned   876.8Mi ± 0%   1391.3Mi ± 0%  +58.69% (p=0.000 n=20)   1518.5Mi ± 0%  +73.19% (p=0.000 n=20)
StrcmpMidUnaligned     3.841Gi ± 1%    6.245Gi ± 1%  +62.58% (p=0.000 n=20)    6.704Gi ± 1%  +74.52% (p=0.000 n=20)
StrcmpLongUnaligned    8.529Gi ± 1%   13.791Gi ± 2%  +61.69% (p=0.000 n=20)   13.729Gi ± 5%  +60.96% (p=0.000 n=20)
StrcmpShortQsort       101.6Mi ± 0%    121.1Mi ± 0%  +19.16% (p=0.000 n=20)    125.4Mi ± 0%  +23.37% (p=0.000 n=20)
StrcmpMidQsort         453.2Mi ± 0%    523.2Mi ± 0%  +15.45% (p=0.000 n=20)    540.4Mi ± 0%  +19.25% (p=0.000 n=20)
geomean                1.724Gi         2.386Gi       +38.39%                   2.474Gi       +43.46%

Test Plan

Passes all current unit tests

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

getz created this revision.Jul 2 2024, 7:04 PM

Herald added a reviewer: andrew. · View Herald TranscriptJul 2 2024, 7:04 PM

Herald added a subscriber: imp. · View Herald Transcript

getz requested review of this revision.Jul 2 2024, 7:04 PM

Harbormaster completed remote builds in B58456: Diff 140476.Jul 2 2024, 7:04 PM

Remove unnecessary mov

Harbormaster completed remote builds in B58504: Diff 140564.Jul 5 2024, 11:06 AM

Impressive performance!

I wonder how the performance would change if you unconditionally did the tbl maneuver to get the heads in place. As you could then interleave the dependency chains for the two strings and get rid of two conditional branches.

lib/libc/aarch64/string/strcmp.S
37–41	Permit more ILP on multi-issue in-order architectures.
54–55	This should be done once at an earlier point and the address you looked up should be used for both cases instead of reloading it.
57	Why do you reload `q0` here? It should be more efficient to `cmeq` into a new register earlier so you keep the string head already loaded.

Some more microöptimisations. Some of them occur more than once in the code but I only marked some of them.

lib/libc/aarch64/string/strcmp.S
87–92
100–105
115–116	This may perform better as cmp + b.cc fuse on some microarchitectures.
187–188
190–191
206–207
356	The `.size` directive makes objdump output more readable. Pad the table with `0xff` so we never read past its end (this may otherwise crash if our rodata section ends up right at the end of the last mapped page; doing so also makes static checkers happy).