Add 8 and 16 bit versions of atomic_cmpset and atomic_fcmpset for arm.
AbandonedPublic
Actions

Authored by • ian on Sep 17 2019, 10:56 PM.

Details

Reviewers

Group Reviewers

Summary

This adds 8 and 16 bit versions of the cmpset and fcmpset functions. Macros are used to generate all the flavors from the same set of instructions; the macro expansion handles the couple minor differences between each size variation (generating ldrexb/ldrexh/ldrex for 8/16/32, etc).

In addition to handling new sizes, the instruction sequences used for cmpset and fcmpset are rewritten to be a bit shorter/faster, and the new sequence will not return false when *dst==*old but the store-exclusive fails because of concurrent writers. Instead, it just loops like ldrex/strex sequences normally do until it gets a non-conflicted store. The manpage allows LL/SC architectures to bogusly return false, but there's no reason to actually do so, at least on arm.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 26541

Event Timeline

• ian created this revision.Sep 17 2019, 10:56 PM

Herald added subscribers: andrew, imp. · View Herald TranscriptSep 17 2019, 10:56 PM

Harbormaster completed remote builds in B26539: Diff 62239.Sep 17 2019, 10:56 PM

Add 8 and 16 bit [f]cmpset for armv4/5 (kernel only).

Harbormaster completed remote builds in B26541: Diff 62243.Sep 17 2019, 11:44 PM

andrew added inline comments.Sep 18 2019, 8:49 AM

sys/arm/include/atomic-v6.h
206	Isn't the point of fcmpset that we don't loop? They can return false when the store fails and the calling code has to handle this, e.g. it will be in a larger loop.

• ian added inline comments.Sep 18 2019, 6:35 PM

sys/arm/include/atomic-v6.h
206	An outer loop might handle failure because of an oldval mismatch, but why should it have to handle failure due to the internal implementation details of atomics on a given platform? The strex can fail for reasons such as an interrupt or fault happening between the ldrex and strex. We can loop more efficiently internally than an outer loop can (it costs literally 0 cycles unless the strex actually fails). The manpage for fcmpset says that some platforms "might" return false even when old==new, and I think that's a documentation bug (and implementation bug if it's actually true on any platform) that should be fixed.

andrew added a reviewer: mjg.Sep 19 2019, 7:31 AM

mjg added inline comments.Sep 19 2019, 7:44 AM

sys/arm/include/atomic-v6.h
206	The point of fcmpset (over cmpset) is that it returns the found value instead of throwing it away and forcing common consumers to read it again. The c11 standard provides cmpxchg with two variants (weak and strong) to let you pick what behavior you want and if there is a bug it's that we don't have fcmpset_weak and fcmpset_strong (with whatever names). Another consideration is that fcmpset is inlined all over the kernel due to locking primitives, so adding the loop makes the call site bigger (but I don't know the impact on arm). On the other hand these suckers seem so big already that perhaps deinline locking routines would be faster on this arch. tl;dr looping or not looping is up to you, if it's faster to loop then do it, but apart from it perhaps you just want to reduce the spread of fcmpset over the kernel

• ian added inline comments.Sep 19 2019, 3:14 PM

sys/arm/include/atomic-v6.h
206	Adding the loop on arm makes the call site bigger by one instruction, which will take 0 cycles in the usual case. I guess it's a cisc-centric background that would make someone declare this code as "large". The biggest one, fcmpset64, is 10 instructions, of which only 6 execute in the usual uncontested case (I wonder if x86 lock cmpxchg is able to run in just 6 cycles without impacting any core other than the one it runs on?). ARM's ability to conditionally execute (or not) most instructions makes for really fast branch-free code in which you get to coast through some instructions at a cost of zero cycles, and most of the ones that aren't skipped are 1 cycle to execute. The point about returning the found value is why I think it's important to loop internally on a strex access conflict. If you don't do so, then the value returned is the original value you read from there, even though you know (because of the strex failure) that there is now a new value there. If the code returns false on strex failure instead of looping, the caller cannot know whether *oldval is valid or not, and would have to re-read it to be sure.

mjg added inline comments.Sep 23 2019, 3:59 PM

sys/arm/include/atomic-v6.h
206	Even for cache-hot atomics the cost drastically differs between microarchitectures (in part because of how they react to it with the surrounding code). Stale value is fine. For example a loop trying to replace the non-0 counter with a value 1 bigger worst case will get a counter which is already bigger (or lower) and will try to replace it, which is going to fail. Problems could only stem from the read not being preceded with a memory barrier or the value being invaild (e.g. a torn store or straight up garbage), none of which is the case here. Anything which can legitimately be there has to already be accounted for by the consumer. Whether relooping internally helps on this arch I cannot comment on.

So is this waiting on anyone in particular? I'm not qualified to review the arm variants and most definitely should not be added as a reviewer.

This was committed as r352938, but I neglected to include the phab link in the commit.

Revision Contents
Changeset List

Path

Size

sys/

arm/

include/

atomic-v4.h

79 lines

atomic-v6.h

356 lines

Diff 62243

View Options

sys/arm/include/atomic-v4.h