Page MenuHomeFreeBSD

atomic.9: provide fine details about CAS memory model MD semantic
AcceptedPublic

Authored by kib on Sep 26 2025, 7:00 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Oct 29, 12:53 PM
Unknown Object (File)
Wed, Oct 29, 12:47 PM
Unknown Object (File)
Wed, Oct 29, 12:42 PM
Unknown Object (File)
Wed, Oct 29, 12:42 PM
Unknown Object (File)
Wed, Oct 29, 12:32 PM
Unknown Object (File)
Wed, Oct 29, 12:31 PM
Unknown Object (File)
Wed, Oct 29, 12:30 PM
Unknown Object (File)
Tue, Oct 28, 1:40 PM
Subscribers

Details

Reviewers
andrew
alc
markj
Summary
atomic.9: fix description of acquire and release

The ordering point is not the atomic operation itself, but the load for
acquire or store for release done as part of the atomic.  This does not
matter for atomic_load_acq and atomic_store_rel, but does matter for RWM
operations.

atomic.9: provide fine details about CAS memory model MD semantic

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Sep 26 2025, 7:00 AM

This is my reading of both lse and llcs implementation of CAS ops on aarch64.

I am not sure should it be documented even if I am right, but I tend to think that it is worth it.

On arm64 atomics with an acquire then later memory operations can move before the store, however the store needs to succeed as if it doesn't then we will execute the load-acquire again.

I don't think the CAS instructions allow this. The pseudo-code has the comment All observers in the shareability domain observe the following load and store atomically before listing the load and store operations. This seems to indicate if the load has been observed the store has too.

Note that the llsc instructions have some subtly when implementing fully sequentially consistent semantics. As the acquire and release are on different instructions this means memory operations may move into the sequence & be swapped. As we don't implement them in atomic(9) this shouldn't be an issue, however will be if we latter add them.

On arm64 atomics with an acquire then later memory operations can move before the store, however the store needs to succeed as if it doesn't then we will execute the load-acquire again.

I don't think the CAS instructions allow this. The pseudo-code has the comment All observers in the shareability domain observe the following load and store atomically before listing the load and store operations. This seems to indicate if the load has been observed the store has too.

But if CAS failed, could we claim that the load was not observed?

Anyway, I do not question the behavior.

Note that the llsc instructions have some subtly when implementing fully sequentially consistent semantics. As the acquire and release are on different instructions this means memory operations may move into the sequence & be swapped. As we don't implement them in atomic(9) this shouldn't be an issue, however will be if we latter add them.

On arm64 atomics with an acquire then later memory operations can move before the store, however the store needs to succeed as if it doesn't then we will execute the load-acquire again.

I don't think the CAS instructions allow this. The pseudo-code has the comment All observers in the shareability domain observe the following load and store atomically before listing the load and store operations. This seems to indicate if the load has been observed the store has too.

Note that the llsc instructions have some subtly when implementing fully sequentially consistent semantics. As the acquire and release are on different instructions this means memory operations may move into the sequence & be swapped. As we don't implement them in atomic(9) this shouldn't be an issue, however will be if we latter add them.

I vaguely recall Linux having an issue with the aforementioned behavior about 8-10 years ago in the implementation of a synchronization primitive.

share/man/man9/atomic.9
250
252
253
kib marked 3 inline comments as done.

Fix grammar.

In D52744#1205716, @alc wrote:

On arm64 atomics with an acquire then later memory operations can move before the store, however the store needs to succeed as if it doesn't then we will execute the load-acquire again.

I don't think the CAS instructions allow this. The pseudo-code has the comment All observers in the shareability domain observe the following load and store atomically before listing the load and store operations. This seems to indicate if the load has been observed the store has too.

Note that the llsc instructions have some subtly when implementing fully sequentially consistent semantics. As the acquire and release are on different instructions this means memory operations may move into the sequence & be swapped. As we don't implement them in atomic(9) this shouldn't be an issue, however will be if we latter add them.

I vaguely recall Linux having an issue with the aforementioned behavior about 8-10 years ago in the implementation of a synchronization primitive.

But then don't we have a similar (or even larger) problem on ppc and risc-v, where the cas is fully relaxed?

share/man/man9/atomic.9
251

The Tn macro is dead, you can check the linter warning with mandoc -T lint ./atomic.9.

share/man/man9/atomic.9
251

Why?
Did groff drop it as well? If not, I think there is something wrong with the decision.

Why?

The linter command I posted says it is useless and does nothing, iirc.

Did groff drop it as well?

Yes. The bottom of the groff_mdoc(7) manual says: "As of groff 1.23, ‘Tn’ no longer changes the type size; this functionality may return in the next release."

In D52744#1205716, @alc wrote:

Note that the llsc instructions have some subtly when implementing fully sequentially consistent semantics. As the acquire and release are on different instructions this means memory operations may move into the sequence & be swapped. As we don't implement them in atomic(9) this shouldn't be an issue, however will be if we latter add them.

I vaguely recall Linux having an issue with the aforementioned behavior about 8-10 years ago in the implementation of a synchronization primitive.

I've known about the issue on Linux
This is the fix for Linux issue https://github.com/torvalds/linux/commit/8e86f0b409a44193f1587e87b69c5dcf8f65be67. It's not been a problem on FreeBSD as we don't have atomic operations with a full barrier in atomic(9).

kib marked an inline comment as done.

Remove .Tn from the patch. Existing .Tns are kept.

In D52744#1205716, @alc wrote:

Note that the llsc instructions have some subtly when implementing fully sequentially consistent semantics. As the acquire and release are on different instructions this means memory operations may move into the sequence & be swapped. As we don't implement them in atomic(9) this shouldn't be an issue, however will be if we latter add them.

I vaguely recall Linux having an issue with the aforementioned behavior about 8-10 years ago in the implementation of a synchronization primitive.

I've known about the issue on Linux
This is the fix for Linux issue https://github.com/torvalds/linux/commit/8e86f0b409a44193f1587e87b69c5dcf8f65be67. It's not been a problem on FreeBSD as we don't have atomic operations with a full barrier in atomic(9).

Yes, this is what I was remembering. smr_enter has an #ifdef for handling a case where this would be an issue. While it is true that there is not a problem with the primitives in atomic.h themselves, I am afraid of their misuse in the sense of expecting x86-like behavior elsewhere.

@kib I am supportive of a change like this. However, in regards to the arm64 text, I don't think that it will lead the reader to the intended conclusion. At present, you are trying to describe what the operations do provide, but that leaves the reader to infer what they don't provide. I think that we need to be more direct, and describe what the problem is. That for the lack of a better word, the load and store making up the operation are not performed "indivisibly", and so reordering of other accesses with respect to one or the other can occur.

Be explicit about what might happen on aarch64.

share/man/man9/atomic.9
249
256

The original is ok, but IMHO my suggestion is a bit clearer.

261
266
267

My recollection is that this is not specific to the LL/SC variant, it is true of the LSE atomic instruction as well. Despite the fact that it is a single instruction, its memory accesses are effectively separate from the POV of memory access ordering. Is that right?

kib marked 4 inline comments as done.Oct 1 2025, 10:38 AM
kib added inline comments.
share/man/man9/atomic.9
267

I believe yes, this follows from the description of CASAL -> reference to J1.1.3.101 MemAtomic, where the pseudocode has separate load and store. I believe andrew@ explained the comment All observers in the shareability domain observe the following load and store atomically, as implying exclusive location ownership, and not as the indivisible event of load+store.

kib marked an inline comment as done.

Markj@ editing

share/man/man9/atomic.9
247

Doesn't this caveat apply to RMW atomics generally, not just atomic_(f)cmpset? e.g., atomic_testandset.

share/man/man9/atomic.9
247

For instance, aarch64 fetchadd is implemented with plain LDXR/STXR instructions and does not provide any fence semantic. readandclear seems to be same.

markj added inline comments.
share/man/man9/atomic.9
247

You mean, we do not provide any variants other than _relaxed? That's true, but maybe one will be added someday.

This revision is now accepted and ready to land.Fri, Oct 3, 10:13 PM
share/man/man9/atomic.9
247

I mean, only aarch64 [f]cmpset have acq+rel semantic. All other rwm ops seems to be either seq-cst (x86) or fully relaxed. But I did not fully reviewed the implementations to make this claim.

It has been a long time since I looked at this man page. In the interest of making sure that these changes were written in an style consistent with the rest of the man page, I decided to take a look, and having done so the following bits concern me:

For example, to subtract two integers ensuring that the subtraction is
completed before any subsequent loads and stores are performed, use
atomic_subtract_acq_int().

and

For example, to add two long integers ensuring that all prior loads and
stores are completed before the addition is performed, use
atomic_add_rel_long().

A reasonable interpretation of these statements seems in conflict with the change under discussion here. For example, if you interpret "the subtraction is completed" as the store of the computed difference having been completed, then this is most definitely not true on arm64.

Here's a modest proposal to consider: We update the man page's section "Acquire and Release Operations" to correctly describe what happens in general, e.g., on arm64. Then, the proposed caveats section simply describes how amd64 and i386 differ, and says machine independent code should not rely on this behavior.

share/man/man9/atomic.9
246–251
253–255

I suggest this because we don't use x86 elsewhere in this man page. We spell out amd64 and i386.

258–261
kib marked 5 inline comments as done.Sun, Oct 5, 8:52 AM

Yes indeed what you noted is a bug in the description of acq/rel for atomics that perform more than one memory access, basically all RWM atomics. I think just fixing that would be enough.

kib edited the summary of this revision. (Show Details)

Fix acq/rel description for RMW.

This revision now requires review to proceed.Sun, Oct 5, 8:53 AM
This revision is now accepted and ready to land.Tue, Oct 7, 8:18 PM

Alan. do you plan to comment on this?