Page MenuHomeFreeBSD

Build arm64 kernel with inline ffs, fls
ClosedPublic

Authored by dougm on May 12 2019, 9:16 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 25, 7:53 PM
Unknown Object (File)
Sat, Jan 25, 7:52 PM
Unknown Object (File)
Sat, Jan 25, 7:24 PM
Unknown Object (File)
Sat, Jan 25, 7:14 PM
Unknown Object (File)
Fri, Jan 24, 6:55 PM
Unknown Object (File)
Sat, Jan 18, 5:09 PM
Unknown Object (File)
Fri, Jan 17, 8:51 AM
Unknown Object (File)
Mon, Jan 13, 9:39 PM

Details

Summary

Implement the ffs and fls instructions, and their longer counterparts, in cpufunc, in terms of gcc extensions like __builtin_ffs, for arm64 architectures, and use those, rather than simple libkern implementations, in building arm64 kernels.

Tested by: greg_unrelenting.technology

Test Plan

I can complete an arm64 build, and I can examine an object file like pmap.o or subr_blist.o for signs that the inlining has happened, as instruction sequences like

244: dac001c9 rbit x9, x14
248: dac01129 clz x9, x9

appear there to indicate a fast implementation of ffs. My reference on smart bit ops for arm64 and how to recognize them is https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/.

Without the patch, the arm64 build does not have these sequences in these object files and has lines like this instead:

af34:       94000000        bl      0 <ffsl>

which do not appear with the patch in place.

None of this suggests that this patch is correct, but only that it does something. I don't have the means to test an arm64 kernel at this time. If Peter Holm can help, that would be great. If not, I'll have to wait for the freebsd arm64 test boxes to come back online.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Perhaps @andrew, @manu, @greg_unrelenting.technology or one of the other ARM folks would be willing to test?

arm64/include/cpufunc.h
56 ↗(On Diff #57343)

The indentation in this function and the one above looks off.

Address indentation issues.

Can somebody explain to me why don't we use __builtin_ffs() and other functions everywhere ?

arm64/include/cpufunc.h
47 ↗(On Diff #57349)

Why do you need the cast of mask to u_int ? Prototype in the gcc reference shows
int __builtin_ctz(unsigned).

I also find it strange to write mask == 0 ? mask : ..., I would certainly express it as mask == 0 ? 0 : ....

85 ↗(On Diff #57349)

Why casting the result of __builtin_clzl() to int ? It is tautology.

dougm marked 3 inline comments as done.

Reduce strangeness.

arm64/include/cpufunc.h
47 ↗(On Diff #57349)

In imitating the code in amd64/include/cpufunc.h, I unnecessarily copied a cast-to-int. I will remove it. I also copied the use of 'mask' that seems strange. I will remove it too.

85 ↗(On Diff #57349)

Agreed.

In D20250#436408, @kib wrote:

Can somebody explain to me why don't we use __builtin_ffs() and other functions everywhere ?

In case you are asking why I wrote code using builtin_clz and not builtin_ffs, it is because I believe that many call to ffs and fls always subtract one from the result, and always know that the argument is nonzero, and I think that the way I wrote things might let the compiler optimize out the zero-check and/or the plus-one sometimes, and maybe with __builtin_ffs that wouldn't happen. But that is speculation.

I case you are asking about the whole freebsd codebase in general, I can't say.

Kernel with this patch boots fine on Amazon EC2 (which btw you can use too, freebsd-snapshots posts have aarch64 AMIs)

The only difference in the pmap.o code on arm64 between using builtin_ffsl and _builtin_ctzl (with a zero-test and an add) is that the version with builtin_ffsl has an 'orr' instruction where the other version has an 'add'.

So this version uses builtin_ffs and builtin_ffsl.

arm64/include/cpufunc.h
94 ↗(On Diff #57437)

This assumes that long and long long will always be the same size. If a tool chain came along that defined long long as 128 bits, this would compile without an error or warning and produce broken results. I would either add a CTASSERT() that long and long long are the same size or implement this using __builtin_clzll.

Handle the possibility that sizeof(long long) != sizeof(long).

This revision is now accepted and ready to land.May 17 2019, 4:36 AM
This revision was automatically updated to reflect the committed changes.