Page MenuHomeFreeBSD

Add 8 and 16 bit versions of atomic_cmpset and atomic_fcmpset for arm.
Needs ReviewPublic

Authored by ian on Tue, Sep 17, 10:56 PM.


Group Reviewers

This adds 8 and 16 bit versions of the cmpset and fcmpset functions. Macros are used to generate all the flavors from the same set of instructions; the macro expansion handles the couple minor differences between each size variation (generating ldrexb/ldrexh/ldrex for 8/16/32, etc).

In addition to handling new sizes, the instruction sequences used for cmpset and fcmpset are rewritten to be a bit shorter/faster, and the new sequence will not return false when *dst==*old but the store-exclusive fails because of concurrent writers. Instead, it just loops like ldrex/strex sequences normally do until it gets a non-conflicted store. The manpage allows LL/SC architectures to bogusly return false, but there's no reason to actually do so, at least on arm.

Diff Detail

rS FreeBSD src repository
Lint Skipped
Unit Tests Skipped
Build Status
Buildable 26541

Event Timeline

ian created this revision.Tue, Sep 17, 10:56 PM
ian updated this revision to Diff 62243.Tue, Sep 17, 11:43 PM

Add 8 and 16 bit [f]cmpset for armv4/5 (kernel only).

andrew added inline comments.Wed, Sep 18, 8:49 AM

Isn't the point of fcmpset that we don't loop? They can return false when the store fails and the calling code has to handle this, e.g. it will be in a larger loop.

ian added inline comments.Wed, Sep 18, 6:35 PM

An outer loop might handle failure because of an oldval mismatch, but why should it have to handle failure due to the internal implementation details of atomics on a given platform? The strex can fail for reasons such as an interrupt or fault happening between the ldrex and strex. We can loop more efficiently internally than an outer loop can (it costs literally 0 cycles unless the strex actually fails).

The manpage for fcmpset says that some platforms "might" return false even when old==new, and I think that's a documentation bug (and implementation bug if it's actually true on any platform) that should be fixed.

andrew added a reviewer: mjg.Thu, Sep 19, 7:31 AM
mjg added inline comments.Thu, Sep 19, 7:44 AM

The point of fcmpset (over cmpset) is that it returns the found value instead of throwing it away and forcing common consumers to read it again.

The c11 standard provides cmpxchg with two variants (weak and strong) to let you pick what behavior you want and if there is a bug it's that we don't have fcmpset_weak and fcmpset_strong (with whatever names).

Another consideration is that fcmpset is inlined all over the kernel due to locking primitives, so adding the loop makes the call site bigger (but I don't know the impact on arm). On the other hand these suckers seem so big already that perhaps deinline locking routines would be faster on this arch.

tl;dr looping or not looping is up to you, if it's faster to loop then do it, but apart from it perhaps you just want to reduce the spread of fcmpset over the kernel

ian added inline comments.Thu, Sep 19, 3:14 PM

Adding the loop on arm makes the call site bigger by one instruction, which will take 0 cycles in the usual case. I guess it's a cisc-centric background that would make someone declare this code as "large". The biggest one, fcmpset64, is 10 instructions, of which only 6 execute in the usual uncontested case (I wonder if x86 lock cmpxchg is able to run in just 6 cycles without impacting any core other than the one it runs on?). ARM's ability to conditionally execute (or not) most instructions makes for really fast branch-free code in which you get to coast through some instructions at a cost of zero cycles, and most of the ones that aren't skipped are 1 cycle to execute.

The point about returning the found value is why I think it's important to loop internally on a strex access conflict. If you don't do so, then the value returned is the original value you read from there, even though you know (because of the strex failure) that there is now a new value there. If the code returns false on strex failure instead of looping, the caller cannot know whether *oldval is valid or not, and would have to re-read it to be sure.

mjg added inline comments.Mon, Sep 23, 3:59 PM

Even for cache-hot atomics the cost drastically differs between microarchitectures (in part because of how they react to it with the surrounding code).

Stale value is fine. For example a loop trying to replace the non-0 counter with a value 1 bigger worst case will get a counter which is already bigger (or lower) and will try to replace it, which is going to fail. Problems could only stem from the read not being preceded with a memory barrier or the value being invaild (e.g. a torn store or straight up garbage), none of which is the case here. Anything which can legitimately be there has to already be accounted for by the consumer. Whether relooping internally helps on this arch I cannot comment on.

mjg added a comment.Tue, Sep 24, 7:05 PM

So is this waiting on anyone in particular? I'm not qualified to review the arm variants and most definitely should not be added as a reviewer.