Page MenuHomeFreeBSD

calculate_crc32c: Add SSE4.2 implementation on x86
ClosedPublic

Authored by cem on Jan 26 2017, 12:15 AM.
Tags
None
Referenced Files
F133274168: D9342.diff
Fri, Oct 24, 1:10 PM
Unknown Object (File)
Thu, Oct 23, 8:49 AM
Unknown Object (File)
Tue, Oct 21, 2:41 PM
Unknown Object (File)
Sat, Oct 18, 3:50 AM
Unknown Object (File)
Fri, Oct 17, 11:08 PM
Unknown Object (File)
Fri, Oct 17, 6:08 AM
Unknown Object (File)
Fri, Oct 17, 5:18 AM
Unknown Object (File)
Wed, Oct 15, 4:03 AM

Details

Summary

Derived from an implementation by Mark Adler.

The fast loop performs three simultaneous CRCs over subsets of the data
before composing them. This takes advantage of certain properties of
the CRC32 implementation in Intel hardware.

The CRC32 instruction does not manipulate FPU state.

i386 does not have the crc32q instruction, so avoid it there. Otherwise
the implementation is identical to amd64.

Add basic userland tests to verify correctness on a variety of inputs.

PR: 216467

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

cem retitled this revision from to calculate_crc32c: Add SSE4.2 implementation on x86.
cem updated this object.
cem edited the test plan for this revision. (Show Details)
cem added reviewers: kib, markj.

I don't have any intelligent thoughts on the implementation, but I think machine-dependent libkern code belongs in a subdirectory, probably sys/libkern/x86 in this case. I also don't understand how this implementation can be the default in the kernel - don't we need a fpu_kern_enter() call in order to use it?

In D9342#192958, @markj wrote:

I also don't understand how this implementation can be the default in the kernel - don't we need a fpu_kern_enter() call in order to use it?

I see, it looks like we don't need to save SSE registers to use this.

In D9342#192958, @markj wrote:

I don't have any intelligent thoughts on the implementation, but I think machine-dependent libkern code belongs in a subdirectory, probably sys/libkern/x86 in this case.

Sure, that's easy enough to do.

In D9342#192963, @markj wrote:
In D9342#192958, @markj wrote:

I also don't understand how this implementation can be the default in the kernel - don't we need a fpu_kern_enter() call in order to use it?

I see, it looks like we don't need to save SSE registers to use this.

Right. The routine that uses PCLMULQDQ needs FPU state, but this one does not. You can also implement the unrolled-by-3 variant without using PCLMULQDQ, although I'm not aware of a permissively licensed version of that.

Interestingly enough, there is a permissively licensed implementation of the unrolled-by-3 *with* PCLMULQDQ variant in the Linux kernel: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S

I think it makes sense to start with a variant that doesn't require FPU save, though.

cem edited edge metadata.

Move SSE4.2 CRC32C code into libkern/x86 subdirectory.

Fix build on PC98.

Other architectures seem to build okay, except for fallout from ARM being
broken.

Could you explain or point to explanation of the licensing (?) issue ?

In kernel, we use __asm() and not icc builtins for access to CPU instructions. The only exception I know about is aesno.ko. I note this because I am not sure how much compatible clang vs. gcc is, also that adds one more requirement on the compiler for porting.

#include <nmmintrin.h>

pulls in a file outside sys/.

Otherwise, this looks fine.

sys/libkern/crc32.c
760 ↗(On Diff #24462)

Do we need defined(_KERNEL) there as well ? Why PC98 is excluded (it seems to be on the way out) ?

In D9342#193008, @kib wrote:

Could you explain or point to explanation of the licensing (?) issue ?

I don't think there's any issue with this implementation, which is derived from the MIT-licensed https://github.com/anandsuresh/sse4_crc32 .

We have an Intel implementation inside Isilon which is not licensed in a way that we can upstream. Intel may be able to, though.

The PCLMULQDQ implementation in the Linux kernel is available and dual-licensed (BSD/GPL).

In kernel, we use __asm() and not icc builtins for access to CPU instructions. The only exception I know about is aesno.ko. I note this because I am not sure how much compatible clang vs. gcc is, also that adds one more requirement on the compiler for porting.

Right, this portion of conf/files* was derived from the aesni example.

cem edited edge metadata.

To match the other CRC32C primitives, do not invert.

sys/libkern/crc32.c
760 ↗(On Diff #24462)

Is this code built outside of _KERNEL? tinderbox succeeds, so I don't believe this code is used outside of the kernel.

PC98 is excluded because I believe it predates SSE4.2, and I preferred to avoid including dead SSE4.2 code in the PC98 kernel if it would never be used.

Additionally, here's a permissively licensed by-3 algorithm that doesn't rely upon PCLMULQDQ:

http://gitweb.dragonflybsd.org/dragonfly.git/blob/HEAD:/test/debug/crc32hw.c

It uses __asm instead of the icc intrinsics.

However, Clang cannot automatically unroll loops with the __asm instruction, whereas it can with the icc intrinsics. This makes a noticable difference; without the unrolling, the Adler version is much slower than the simple version on inputs smaller than SHORT*3 (768) bytes. Unrolling doesn't noticeably help the by-3 inner loops.

markj edited edge metadata.

Seems reasonable to me. I'd consider adding a kernel option to i386 and amd64 GENERIC for this instead - that'd replace "defined(amd64) || (defined(i386) && !defined(PC98))" while making it easy to exclude this code when using an exotic compiler.

sys/libkern/x86/crc32_sse42.c
42 ↗(On Diff #24489)

Extra newline.

sys/sys/libkern.h
181 ↗(On Diff #24489)

Exclude PC98 here too?

This revision is now accepted and ready to land.Jan 27 2017, 12:02 AM
In D9342#193229, @markj wrote:

Seems reasonable to me. I'd consider adding a kernel option to i386 and amd64 GENERIC for this instead - that'd replace "defined(amd64) || (defined(i386) && !defined(PC98))" while making it easy to exclude this code when using an exotic compiler.

That'd be fine. Something like "SSE42_SUPPORT?"

sys/libkern/x86/crc32_sse42.c
42 ↗(On Diff #24489)

Will fix.

sys/sys/libkern.h
181 ↗(On Diff #24489)

Sure, or just throw it under the same feature mentioned earlier.

In D9342#193235, @cem wrote:
In D9342#193229, @markj wrote:

Seems reasonable to me. I'd consider adding a kernel option to i386 and amd64 GENERIC for this instead - that'd replace "defined(amd64) || (defined(i386) && !defined(PC98))" while making it easy to exclude this code when using an exotic compiler.

That'd be fine. Something like "SSE42_SUPPORT?"

I think that's ok, but @kib or @jhb will probably have actual opinions. There's some precedent here with the use of the popcnt instruction in the amd64 pmap code, but I don't see where \_\_POPCNT\_\_ is defined (if anywhere).

cem edited edge metadata.
cem marked 2 inline comments as done.
  • Replace implementation with Mark Adler's unrolled-by-3 implementation. This is significantly faster for larger inputs (the slicing kicks in at 768 input bytes).
  • Add basic testcases to verify correctness on a variety of inputs / alignedness.
This revision now requires review to proceed.Jan 27 2017, 6:26 AM
cem edited the test plan for this revision. (Show Details)
cem edited edge metadata.

No reason not to prefer the SSE42 implementation on amd64/i386 platforms that
support it.

I'm happy with the correctness and performance improvement of this change — I think it's ready to commit. I've got some benchmarking code that won't be committed but the tests pass and it improves performance on supported hardware:

0x000080: multitable:53 intrins:4  (ns per buf)
0x000100: multitable:162 intrins:13  (ns per buf)
0x000200: multitable:352 intrins:34  (ns per buf)
0x000400: multitable:685 intrins:63  (ns per buf)
0x000800: multitable:1358 intrins:133  (ns per buf)
0x002000: multitable:5402 intrins:447  (ns per buf)
0x008000: multitable:21632 intrins:1498  (ns per buf)
0x020000: multitable:86477 intrins:5612  (ns per buf)

Naive (not sliced-by-3) intrinsics approach for comparison:

0x000080: sse42:9
0x000100: sse42:19
0x000200: sse42:40
0x000400: sse42:100
0x000800: sse42:235
0x002000: sse42:1010
0x008000: sse42:4119
0x020000: sse42:16546

I looked at adding an option, but that seems like it requires a lot of messy changes to files I am unfamiliar with, so I'm punting on that for now. It can always be done later.

kib edited edge metadata.

It is small piece of code, I do not see a need in an option to enable it. Mentioned POPCNT instruction is also compiled in unconditionally and is used based on runtime feature test.

<bde tone>I do not like _mm builtins. They are very bad documented: gcc documentation is an excuse, and the real documentation is in intel compiler bundle, which is not freely available. It requires knowing both assembler and compiler builtins to read the code.</tone>

Other than that, I am fine with the patch if it passes the correct tests.

sys/libkern/x86/crc32_sse42.c
165 ↗(On Diff #24497)

If you said that this is only compiled for kernel, why _KERNEL is not completely eliminated from the patch ?

This revision is now accepted and ready to land.Jan 27 2017, 7:10 PM
sys/libkern/crc32.c
758 ↗(On Diff #24497)

You can just use #ifdef __SSE4_2__. -msse42 (or -msse4) will define that via the compiler driver. That is what controls 'popcnt' (in the case of popcnt we use it if you set CPUTYPE in your make.conf to specify a CPU that includes popcnt).

In D9342#193446, @kib wrote:

It is small piece of code, I do not see a need in an option to enable it. Mentioned POPCNT instruction is also compiled in unconditionally and is used based on runtime feature test.

Ok.

<bde tone>I do not like _mm builtins. They are very bad documented: gcc documentation is an excuse, and the real documentation is in intel compiler bundle, which is not freely available. It requires knowing both assembler and compiler builtins to read the code.</tone>

Yeah, but the unrolling does really help for moderate sized inputs (32-768 bytes).

Would the GCC builtins (__builtin_ia32_crc32di() etc) be any better? https://svnweb.freebsd.org/base/head/contrib/llvm/tools/clang/lib/Headers/smmintrin.h?revision=309124&view=markup#l474

They would be equivalent for the purposes of unrolling loops and legibility.

Other than that, I am fine with the patch if it passes the correct tests.

It does. Ben RUBSON from the PR has also used it successfully with iscsi.

sys/libkern/crc32.c
758 ↗(On Diff #24497)

Is -msse4 used to compile this file on i386/amd64? I thought it was off by default, and only enabled on a few files. My build logs actually show -mno-sse -mno-avx.

sys/libkern/x86/crc32_sse42.c
165 ↗(On Diff #24497)

It was only compiled for kernel at the time. Since then, I've added userspace correctness tests that need to build the code in userspace.

sys/libkern/crc32.c
758 ↗(On Diff #24497)

The entries in sys/conf/files.* explicitly specify -msse4.

sys/libkern/crc32.c
758 ↗(On Diff #24497)

Only for libkern/x86/crc32_sse42.c. Not libkern/crc32.c (this file).

sys/libkern/x86/crc32_sse42.c
165 ↗(On Diff #24497)

This clearly indicates my confusion and its source. I think you should use some explicitely provided symbol, e.g. TEST_MODE or anything similar, instead of !_KERNEL. The !_KERNEL code is not compiled for non-kernel base system use, but for tests.

sys/libkern/x86/crc32_sse42.c
165 ↗(On Diff #24497)

That would be fine, but is a little more churn. Why prefer something like TEST_MODE over !_KERNEL? The code is generally usable in userspace as-is.

sys/libkern/x86/crc32_sse42.c
165 ↗(On Diff #24497)

Because the code is not used outside the kernel (modulo the tests), and never will be while it is part of libkern. My concern is a confusion of anybodye looking at the code, since defined(_KERNEL) is typically a very strong indicator of code use outside the kernel. See my confusion above.

More, I think that you can provide test-specific header with the definitions which a required for tests, and include it instead of spreading _KERNEL around the file.

I do not insist, but IMO this improves the code.

cem edited edge metadata.
  • Use USERSPACE_TESTING define instead of !_KERNEL
  • Remove PC98 workarounds for i386
This revision now requires review to proceed.Jan 30 2017, 11:51 PM
kib edited edge metadata.
This revision is now accepted and ready to land.Jan 31 2017, 12:02 AM
This revision was automatically updated to reflect the committed changes.
arichardson added inline comments.
head/tests/sys/kern/libkern_crc32.c
117

It appears this value is incorrect: It doesn't match the software implementations and also doesn't match the ARM HW-accelerated one.

head/tests/sys/kern/libkern_crc32.c
117

Please ignore this, I accidentally had mismatched sizes when updating D28395