Page MenuHomeFreeBSD

Add support for optional CRC32c instructions on armv8
ClosedPublic

Authored by tuexen on Apr 25 2017, 8:20 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 9, 1:59 AM
Unknown Object (File)
Fri, Jan 3, 8:43 AM
Unknown Object (File)
Fri, Jan 3, 8:43 AM
Unknown Object (File)
Fri, Jan 3, 8:42 AM
Unknown Object (File)
Fri, Jan 3, 8:42 AM
Unknown Object (File)
Fri, Jan 3, 6:06 AM
Unknown Object (File)
Fri, Dec 27, 8:44 PM
Unknown Object (File)
Mon, Dec 23, 1:02 PM
Subscribers

Details

Summary

armv8 has support for optional CRC32C instructions. This patch checks if they are available and if that is true make use of them.
This improves the performance. When doing an SCTP bulk transfer with full sized frames on a RPi3 over the loopback interface with checksum offloading disabled, the throughput increases from 0.6 GBit/sec to 1.8 GBit/sec (2.3 GBit/sec with checksum offloading enabled).

Test Plan
  • Get the unit tests running, for which support for armv8 is also added by this patch.
  • Do SCTP bulk transfers over the loopback interface with checksum offloading disabled on the loopback interface

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/libkern/arm64/crc32c_armv8.S
29 ↗(On Diff #27728)

__FBSDID("$FreeBSD$");

You should also add .arch armv8-a+crc (I think) to ensire the assembler is allowing the crc instructions.

59–62 ↗(On Diff #27728)

There may be an optimisation to load a pair of 64-bit words & run two crc instructions, however the Cortex-A53 in the Raspberry Pi 3 is in-order so this may be minimal.

sys/libkern/crc32.c
775 ↗(On Diff #27728)

This should be != ID_AA64ISAR0_CRC32_NONE. I expect any changes to this field would also indicate the current instructions still work as expected.

tests/sys/kern/libkern_crc32.c
37 ↗(On Diff #27728)

I would use #elif defined(__aarch64__) with an error in the #else case.

115 ↗(On Diff #27728)

What stops this when we don't have he crc instructions in HW?

Addressing some of Andrews' comments:

  • Added __FBSDID("$FreeBSD$");
  • clang 3.8 (running on ref12-aarch64) doesn't support .arch armv8-a+crc, so I didn't add it. clang 4.0 supports it, since I would prefer the code to be compilable by older versions of clang.
  • I tried replacing the loop with the following code and tested on ref12-aarch64. There was no difference. So I kept the loop as is.
double_word_aligned:
	lsr	w9, w2, #0x4
	cbz	w9, last_word
loop:
	ldr	x10, [x1], #0x8
	ldr	x11, [x1], #0x8
	crc32cx	w0, w0, x10
	crc32cx	w0, w0, x11
	subs	w9, w9, #1
	b.ne	loop
  • I changed the test to != ID_AA64ISAR0_CRC32_NONE as suggested.
  • I added #elif defined(__aarch64__) with an error in the #else case as suggested.
  • I haven't added code to deal with the case on armv8 that the crc32c instructions are not available, since these checks are also not there for amd64. I guess the program will be terminated and the test framework will count them as failed. If you really want, I can see how to detect the capabilities in userland and let the tests indicate that they are not applicable (if that is possible in the test framework).
sys/libkern/arm64/crc32c_armv8.S
29 ↗(On Diff #27728)

I added the __FBSDID("$FreeBSD$");

The .arch armv8-a+crc is not supported by clang 3.8, which is running on ref12-aarch64. So I did not add it. Should I add it and required a newer compiler? Didn't check whether clang 3.9 works or not...

59–62 ↗(On Diff #27728)

I replaced the loop by

double_word_aligned:
	lsr	w9, w2, #0x4
	cbz	w9, last_word
loop:
	ldr	x10, [x1], #0x8
	ldr	x11, [x1], #0x8
	crc32cx	w0, w0, x10
	crc32cx	w0, w0, x11
	subs	w9, w9, #1
	b.ne	loop

and tested it on ref12-aarch64. There was no change in the runtime. So either I did it wrong or it doesn't give a benefit. I kept the loop as is for now.

sys/libkern/crc32.c
775 ↗(On Diff #27728)

Done.

tests/sys/kern/libkern_crc32.c
37 ↗(On Diff #27728)

Done.

115 ↗(On Diff #27728)

Nothing. Which is the same as in the amd64 || i386 case. I guess the test program will crash and the test will be counted as failed.

I could add a userland check if the instructions are available (can I read the register without special privileges?) and mark the tests as not applicable if that is possible in the test framework...

andrew added inline comments.
sys/libkern/arm64/crc32c_armv8.S
29 ↗(On Diff #27728)

It's supported on clang 4.0 in head and stable/11.

59–62 ↗(On Diff #27728)

You can load x10 and x11 with ldrp x10, x11, [x1], #0x10. I would expect only a minor improvement from loading in two instructions.

tests/sys/kern/libkern_crc32.c
115 ↗(On Diff #27728)

There is currently no simple way to check in userland.

This revision is now accepted and ready to land.Apr 27 2017, 7:50 AM

Hi Andrew,

I did some testing using

loop:
        ldp     x10, x11, [x1], #0x10
        crc32cx w0, w0, x10
        crc32cx w0, w0, x11
        subs    w9, w9, #1
        b.ne    loop

and could not measure any substantial difference on ref12-aarch64. So I'll commit the version as is.

I am confused by the description of "with checksum offloading disabled" on a loopback interface. Did you mean to say that you force it to calculate the checksums on loopback rather than just setting the flag that checksums are ok as would be the default? I think it would be good to re-phrase this as there is no "offloading" on loopback.

I'm referring to "setting a flag to checksums are OK" as "checksum offloading" for the loopback interface.
Please note that this was just done for the measurements. The patch doesn't change this feature.
To be crystal clear, the measurement was:

  • Using ifconfig lo0 rxcsum txcsum you get 2.3 GBit/sec
  • Using ifconfig lo0 -rxcsum -txcsum you get 0.6 GBit/sec without this patch
  • Using ifconfig lo0 -rxcsum -txcsum you get 1.8 GBit/sec

This is just to show that there is a substantial performance increase. You wouldn't use this on the loopback interface,
but on real interfaces not supporting CRC32C offloading. Since I'm not having access to an armv8 based system
with an interface faster than 100MBit/sec, I tested it with the loopback interface on the RPi3.

This revision was automatically updated to reflect the committed changes.