Page MenuHomeFreeBSD

Implement a translation from FreeBSD atomic API to gcc intrinsics
Needs ReviewPublic

Authored by mjg on Feb 13 2020, 2:23 PM.

Details

Reviewers
kib
markj
jeff
jhb
Summary

Which are of course supported by clang.

Primary motivation is to eliminate problems like this (mtx lock):

0xffffffff80cb873b <+139>:	xor    %eax,%eax
0xffffffff80cb873d <+141>:	lock cmpxchg %r14,0x18(%rbx)
0xffffffff80cb8743 <+147>:	sete   %cl
0xffffffff80cb8746 <+150>:	test   %cl,%cl
0xffffffff80cb8748 <+152>:	je     0xffffffff80cb8798 <vfs_ref+232>

patched:

0xffffffff80caaa3b <+139>:	xor    %eax,%eax
0xffffffff80caaa3d <+141>:	lock cmpxchg %r14,0x18(%rbx)
0xffffffff80caaa43 <+147>:	jne    0xffffffff80caaa8e <vfs_ref+222>

Note that clang apparently now supports indicating the result in flags, meaning the hand-rolled routine can be patched to end up generating equivalent code. Howver, I don't think there is any good reason now to handroll said routines.

Note there is no equivalent for testandclear/readandclear so they are left be.

Also I did not alter load/store routines. In my opinion that part needs to be reworked to be type-friendly to passed arguments and goes beyond the scope of this patch.

The patch does not add routines which were not already present on amd64, but this can be changed later. The intent is to eventually remove all hand-rolled routines.

I provided powerpc64 people with a patch for their platform and asked that they test. Works for me on amd64 of course.

Kernel .text changed only a little bit:

    text      data       bss        dec         hex   filename
22234960   1778275   4505600   28518835   0x1b329b3   GENERIC-NODEBUG.patched/kernel
22340096   1778275   4505600   28623971   0x1b4c463   GENERIC-NODEBUG.stock/kernel

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 29344

Event Timeline

So in CheriBSD we just replaced all of atomic(9) with the __atomic API supported in both modern GCC and clang for RISC-V:

https://github.com/CTSRD-CHERI/cheribsd/commit/e9e9ff10652179eaeec911ed804326bee482a0b3#diff-dc47b5c2fe7ae88823267345a6d1ae5c

However, that header could likely just be used on any architecture as none of it is MD. It also does the full atomic API and works with both clang and gcc.

I do think it would be nice to also add untyped variants as part of doing this as well.

There is a risk of trusting the compiler (we had to fix a bug in RISC-V clang I'm not sure is yet upstream for the 32-bit cmpset).

In D23661#521321, @jhb wrote:

So in CheriBSD we just replaced all of atomic(9) with the __atomic API supported in both modern GCC and clang for RISC-V:

https://github.com/CTSRD-CHERI/cheribsd/commit/e9e9ff10652179eaeec911ed804326bee482a0b3#diff-dc47b5c2fe7ae88823267345a6d1ae5c

However, that header could likely just be used on any architecture as none of it is MD. It also does the full atomic API and works with both clang and gcc.

Well my header is very similar but superior in that interested architectures can pick and alias whatever they want however they see fit, most notably not running into any 32 vs 64-bit problems or forcing weak/strong fcmpset when other variant is desired. Otherwise it's the same stuff.

I do think it would be nice to also add untyped variants as part of doing this as well.

Untyped variants of what specifically? All types or just ptr? I think this is separate effort.

Note there are 2 nits I'll fix later:

  1. the clear routine can push negation one level below
  2. failing (f)cmpset does not have to post a fence

Biggest objection is that there is still no ABI for atomic stuff even on x86, so each compiler (version) is free to choose the implementation that it finds suitable and that non necessary compatible with the other' compiler implementation.

I.e., is the generated code compatible with existing atomic.h on x86 is the valid question as well.

  • force strong cmpset since this seems to be the requirement right now
  • push ~ lower in atomic_clear
  • use relaxed barrier for failing (f)cmpset

I think it's a fair worry that mixing hand-rolled atomics with compiler intrinsics may lead to trouble, but it's not hard to remedy.

Note the 2 mentioned missing testandclear and readandclear can be implemented without significant difficulty (the latter is a oneliner, the former worst case can be a cmpset loop. I grepped only 2 users.)

sys/sys/_gcc_atomic.h
51

Why is success_memorder always relaxed?

sys/sys/_gcc_atomic.h
51

ops, that's breakage introduced in subsequent update. of course it was meant to be for success, will fix later

In D23661#521387, @mjg wrote:

I think it's a fair worry that mixing hand-rolled atomics with compiler intrinsics may lead to trouble, but it's not hard to remedy.

Note the 2 mentioned missing testandclear and readandclear can be implemented without significant difficulty (the latter is a oneliner, the former worst case can be a cmpset loop. I grepped only 2 users.)

atomic_testandset/atomic_testandclear could be implemented with __atomic_fetch_or/__atomic_fetch_and, but we'd have to check that the intrinsics generate no worse code (currently we use bts/btr). We have more planned uses of these but were waiting on generic implementations of long variants in D22963 (which indeed uses a cmpset loop).

In D23661#521387, @mjg wrote:

I think it's a fair worry that mixing hand-rolled atomics with compiler intrinsics may lead to trouble, but it's not hard to remedy.

Note the 2 mentioned missing testandclear and readandclear can be implemented without significant difficulty (the latter is a oneliner, the former worst case can be a cmpset loop. I grepped only 2 users.)

atomic_testandset/atomic_testandclear could be implemented with __atomic_fetch_or/__atomic_fetch_and, but we'd have to check that the intrinsics generate no worse code (currently we use bts/btr). We have more planned uses of these but were waiting on generic implementations of long variants in D22963 (which indeed uses a cmpset loop).

No, they modify the entire memory operand and in fact are equivalent to atomic_set/atomic_clear (and are used in this very patch to implement them). testandset only modifies the bit, which can be preserved with a cmpset loop. Otherwise you can lose concurrent updates.

Now that I wrote the above this may actually be doable, I'll have to take a sober look later.

sys/sys/_gcc_atomic.h
51

sigh, I meant the actual barrier was meant for success and relaxed for failure.

sys/amd64/include/atomic.h
79

Could this be factored out into a separate header so that platforms that can use the compiler builtins can simply change their atomic.h to #include <atomics_using_compiler_intrinsics.h> or similar?

This won't work when using GCC with any of MIPS, RISC-V and certain 32-bit ARM configurations (A32 LDREX/STREX are ARMv6, A32 B/H/D variants are ARMv6K and T32 atomics need Thumb-2 ie ARMv6T2 and above). GCC will use libcalls for non-native-width atomics, unlike Clang which knows how to expand all of them (although Clang will still use a libcall on those ARM configurations that have no atomic instructions available at all, of course). As a downstream we can get away with only supporting Clang, but here we need something that supports both compilers. One option would be to use the atomic intrinsics and then implement the libcalls in a separate .S, though that may cause a slight performance hit due to the additional function call overhead on GCC.