armv8 has support for optional CRC32C instructions. This patch checks if they are available and if that is true make use of them.
This improves the performance. When doing an SCTP bulk transfer with full sized frames on a RPi3 over the loopback interface with checksum offloading disabled, the throughput increases from 0.6 GBit/sec to 1.8 GBit/sec (2.3 GBit/sec with checksum offloading enabled).
Details
- Reviewers
andrew - Group Reviewers
arm64 - Commits
- rS319404: MFC r317512:
rS317512: armv8 has support for optional CRC32C instructions. This patch checks if they…
- Get the unit tests running, for which support for armv8 is also added by this patch.
- Do SCTP bulk transfers over the loopback interface with checksum offloading disabled on the loopback interface
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 8902 Build 9289: CI src build Jenkins
Event Timeline
sys/libkern/arm64/crc32c_armv8.S | ||
---|---|---|
29 | __FBSDID("$FreeBSD$"); You should also add .arch armv8-a+crc (I think) to ensire the assembler is allowing the crc instructions. | |
59–62 | There may be an optimisation to load a pair of 64-bit words & run two crc instructions, however the Cortex-A53 in the Raspberry Pi 3 is in-order so this may be minimal. | |
sys/libkern/crc32.c | ||
775 | This should be != ID_AA64ISAR0_CRC32_NONE. I expect any changes to this field would also indicate the current instructions still work as expected. | |
tests/sys/kern/libkern_crc32.c | ||
37 | I would use #elif defined(__aarch64__) with an error in the #else case. | |
115 | What stops this when we don't have he crc instructions in HW? |
Addressing some of Andrews' comments:
- Added __FBSDID("$FreeBSD$");
- clang 3.8 (running on ref12-aarch64) doesn't support .arch armv8-a+crc, so I didn't add it. clang 4.0 supports it, since I would prefer the code to be compilable by older versions of clang.
- I tried replacing the loop with the following code and tested on ref12-aarch64. There was no difference. So I kept the loop as is.
double_word_aligned: lsr w9, w2, #0x4 cbz w9, last_word loop: ldr x10, [x1], #0x8 ldr x11, [x1], #0x8 crc32cx w0, w0, x10 crc32cx w0, w0, x11 subs w9, w9, #1 b.ne loop
- I changed the test to != ID_AA64ISAR0_CRC32_NONE as suggested.
- I added #elif defined(__aarch64__) with an error in the #else case as suggested.
- I haven't added code to deal with the case on armv8 that the crc32c instructions are not available, since these checks are also not there for amd64. I guess the program will be terminated and the test framework will count them as failed. If you really want, I can see how to detect the capabilities in userland and let the tests indicate that they are not applicable (if that is possible in the test framework).
sys/libkern/arm64/crc32c_armv8.S | ||
---|---|---|
29 | I added the __FBSDID("$FreeBSD$"); The .arch armv8-a+crc is not supported by clang 3.8, which is running on ref12-aarch64. So I did not add it. Should I add it and required a newer compiler? Didn't check whether clang 3.9 works or not... | |
59–62 | I replaced the loop by double_word_aligned: lsr w9, w2, #0x4 cbz w9, last_word loop: ldr x10, [x1], #0x8 ldr x11, [x1], #0x8 crc32cx w0, w0, x10 crc32cx w0, w0, x11 subs w9, w9, #1 b.ne loop and tested it on ref12-aarch64. There was no change in the runtime. So either I did it wrong or it doesn't give a benefit. I kept the loop as is for now. | |
sys/libkern/crc32.c | ||
775 | Done. | |
tests/sys/kern/libkern_crc32.c | ||
37 | Done. | |
115 | Nothing. Which is the same as in the amd64 || i386 case. I guess the test program will crash and the test will be counted as failed. I could add a userland check if the instructions are available (can I read the register without special privileges?) and mark the tests as not applicable if that is possible in the test framework... |
sys/libkern/arm64/crc32c_armv8.S | ||
---|---|---|
29 | It's supported on clang 4.0 in head and stable/11. | |
59–62 | You can load x10 and x11 with ldrp x10, x11, [x1], #0x10. I would expect only a minor improvement from loading in two instructions. | |
tests/sys/kern/libkern_crc32.c | ||
115 | There is currently no simple way to check in userland. |
Hi Andrew,
I did some testing using
loop: ldp x10, x11, [x1], #0x10 crc32cx w0, w0, x10 crc32cx w0, w0, x11 subs w9, w9, #1 b.ne loop
and could not measure any substantial difference on ref12-aarch64. So I'll commit the version as is.
I am confused by the description of "with checksum offloading disabled" on a loopback interface. Did you mean to say that you force it to calculate the checksums on loopback rather than just setting the flag that checksums are ok as would be the default? I think it would be good to re-phrase this as there is no "offloading" on loopback.
I'm referring to "setting a flag to checksums are OK" as "checksum offloading" for the loopback interface.
Please note that this was just done for the measurements. The patch doesn't change this feature.
To be crystal clear, the measurement was:
- Using ifconfig lo0 rxcsum txcsum you get 2.3 GBit/sec
- Using ifconfig lo0 -rxcsum -txcsum you get 0.6 GBit/sec without this patch
- Using ifconfig lo0 -rxcsum -txcsum you get 1.8 GBit/sec
This is just to show that there is a substantial performance increase. You wouldn't use this on the loopback interface,
but on real interfaces not supporting CRC32C offloading. Since I'm not having access to an armv8 based system
with an interface faster than 100MBit/sec, I tested it with the loopback interface on the RPi3.