Details

Reviewers

becker.greg_att.net
theraven
olce
kevans
kib

Group Reviewers

Klara

Commits

rG681fbce51d33: libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics.
rGb9c361e4af26: libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics.
rG1dc3abb05243: libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics.

Summary

Compiler memory barriers do not prevent the CPU from executing the code
out of order. Switch to C11 atomics. This also lets us get rid of the
mutex; instead, loop until the compare_exchange succeeds.

While here, change the return value of at_quick_exit() on failure to
the more traditional -1, matching atexit().

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

des created this revision.Sep 22 2023, 11:20 AM

Herald added a subscriber: imp. · View Herald TranscriptSep 22 2023, 11:20 AM

des requested review of this revision.Sep 22 2023, 11:20 AM

des added a child revision: D41937: libc: Add a rudimentary test for quick_exit(3)..

Harbormaster completed remote builds in B53668: Diff 127629.Sep 22 2023, 11:20 AM

des added a reviewer: theraven.Sep 22 2023, 11:25 AM

olce accepted this revision.Sep 22 2023, 11:45 AM

This revision is now accepted and ready to land.Sep 22 2023, 11:45 AM

kib added a subscriber: kib.Sep 22 2023, 12:12 PM

kib added inline comments.

lib/libc/stdlib/quick_exit.c
60	I do not see why do you need fcmpset() since you re-assign h->next on each iteration. Either move the `h->next = handlers;` out of loop, or use cmpset().
73	Each dereference of h->next also needs to have acquire semantic, so that read of *(h->next) is consistent with finding the element on the list.

des added inline comments.Sep 22 2023, 12:23 PM

lib/libc/stdlib/quick_exit.c
60	Because if I lose the race I need `h->next` to point to the winner.
73	Why? `h->next` is never updated after insertion.

des added inline comments.Sep 22 2023, 12:29 PM

lib/libc/stdlib/quick_exit.c
60	Oh I realize my mistake now, will update shortly.

fcmpset will update h->next

This revision now requires review to proceed.Sep 22 2023, 12:34 PM

Harbormaster completed remote builds in B53671: Diff 127632.Sep 22 2023, 12:34 PM

des marked an inline comment as done.Sep 22 2023, 12:35 PM

olce accepted this revision.Sep 22 2023, 12:38 PM

olce added inline comments.

lib/libc/stdlib/quick_exit.c
73	Why? `h->next` is never updated after insertion. Indeed, I don't think acquire reads are necessary here.

This revision is now accepted and ready to land.Sep 22 2023, 12:38 PM

becker.greg_att.net added inline comments.Sep 22 2023, 1:42 PM

lib/libc/stdlib/quick_exit.c
45–46	Maybe clean up this comment while you're here, perhaps something like "Stack of cleanup handlers, invoked in LIFO order by quick_exit()." would be a bit more clear?

kevans accepted this revision.Sep 22 2023, 1:43 PM

des added inline comments.Sep 22 2023, 1:54 PM

lib/libc/stdlib/quick_exit.c
45–46	Oh, that's why it had a trailing space, I hadn't even noticed the sentence is incomplete! Maybe @theraven remembers how it was supposed to end?

merge and clarify comments

This revision now requires review to proceed.Sep 22 2023, 2:08 PM

Harbormaster completed remote builds in B53676: Diff 127639.Sep 22 2023, 2:08 PM

des marked an inline comment as done.Sep 22 2023, 2:08 PM

des marked an inline comment as done.

Is there a reason to write new code that uses atomics and not use C11 atomics? We’ve had support for them for several releases now and it makes code much harder for new developers to understand if it uses nonstandard features for things that are part of the standard.

theraven added inline comments.Sep 22 2023, 2:24 PM

lib/libc/stdlib/quick_exit.c
60	Why do you want a predict false here? I would expect it to either be ignored or make things worse on any modern CPU. If you want to micro optimise a cold loop like this, it would be better to put a macro that expands to pause on x86 and local equivalents on other architectures in the retry path.

What is supposed to happen when this is a multithreaded program and multiple threads call quick_exit at the same time? While it is silly for such a setup to exist, I had the misfortune of encountering it (albeit with full-blown exit(3)).

That is to say, while the original code is indeed buggy, I think the actual work to do here would make there are no repeat calls to exit handlers in presence of multiple threads landing here.

One way out would be to atomic_swap the head to NULL.

Note this could still race against threads adding to the list as someone calls quick_exit, but that is unfixable.

lib/libc/stdlib/quick_exit.c
60	pause in a tight loops like this is a known pessimizer (albeit there are archs where some arch-specific delay does help). note this is not waiting for the value to reach a certain state (like in locking primitives), where retrying in a tight loop would indeed be bad.

One way out would be to atomic_swap the head to NULL.

That's also probably surprising. One thread will start running cleanup functions, the others will then exit the process before the first has finished. You probably want to acquire a mutex to block all other threads from entering quick_exit after one has. This might want to be shared with the atexit cleanup, because otherwise you can have strange situations where one thread calls exit and another calls quick_exit.

theraven added inline comments.Sep 22 2023, 4:56 PM

lib/libc/stdlib/quick_exit.c
60	note this is not waiting for the value to reach a certain state Ah, yes, you're right. Please ignore that part of the suggestion. I don't believe predict false is useful though, because most architectures will assume CAS succeeds in their internal hinting, adding an explicit hint will not help and will cause it to do the wrong thing in cases of contention (which should be very rare for here, but it's still more source code for no benefit).

In D41936#956105, @theraven wrote:

You probably want to acquire a mutex to block all other threads from entering quick_exit after one has. This might want to be shared with the atexit cleanup, because otherwise you can have strange situations where one thread calls exit and another calls quick_exit.

I want to say that's their problem, not ours...

drop branch prediction

Harbormaster completed remote builds in B53678: Diff 127649.Sep 22 2023, 6:44 PM

des marked 3 inline comments as done.Sep 22 2023, 6:45 PM

From the C standard (C17):

The exit function causes normal program termination to occur. No functions registered by the
at_quick_exit function are called. If a program calls the exit function more than once, or calls the
quick_exit function in addition to the exit function, the behavior is undefined.

and conversely for quick_exit.

So I tend to agree with des. This doesn't preclude us from being "nicer", but I'm not sure there is much point in doing that if other prominent implementations (such as glibc) don't care (which I admittedly haven't checked).

In D41936#956161, @olce.freebsd_certner.fr wrote:

From the C standard (C17):

The exit function causes normal program termination to occur. No functions registered by the
at_quick_exit function are called. If a program calls the exit function more than once, or calls the
quick_exit function in addition to the exit function, the behavior is undefined.

and conversely for quick_exit.

So I tend to agree with des. This doesn't preclude us from being "nicer", but I'm not sure there is much point in doing that if other prominent implementations (such as glibc) don't care (which I admittedly haven't checked).

What are the real programs that depend on parallel quick_exit() to behave this way? If there are any, we certainly can adopt.

lib/libc/stdlib/quick_exit.c
73	It does not matter if it changes after insertion or not. The write to 'cleanup' needs to be visible, if we see the element on the list. Stating it using the C11 terminology, there needs to be happens-before relation between write to 'cleanup' and all reads in line below. The fcmpset's in at_quick_exit() only have release semantic, so I do not see a happens-before ordering between writes to 'cleanup' members of the list, and consequently no h/b with read. Author of the previous version of the code had similar intuition, I suspect, this is why he inserted the compiler barriers there. That said, note that x86 acq and rel are free, they are translated in compiler barriers and plain loads/stores.

becker.greg_att.net added inline comments.Sep 23 2023, 10:48 AM

lib/libc/stdlib/quick_exit.c

Konstantin, I believe all writes to 'cleanup' are guaranteed to be seen by the thread that calls acquire, as noted in this excerpt I pulled from 'man atomic':

When a release operation by one thread synchronizes with an acquire
operation by another thread, usually meaning that the acquire operation
reads the value written by the release operation, then the effects of all
prior stores by the releasing thread must become visible to subsequent
loads by the acquiring thread.  Moreover, the effects of all stores (by
other threads) that were visible to the releasing thread must also become
visible to the acquiring thread.  These rules only apply to the
synchronizing threads.  Other threads might observe these stores in a
different order.

kib added inline comments.Sep 23 2023, 11:01 AM

lib/libc/stdlib/quick_exit.c
73	Right, this is a somewhat informal description of the rel/acq effects. So imagine that the thread 1 added cleanup handler, then thread 2 added yet another handler. Not that there is no rel/acq pairs between thread 1 and 2. Now thread 3 does quick_exit(), and its acq load sync/with thread 2 release store. What ensures that thread 3 sees stores from thread 1?

becker.greg_att.net added inline comments.Sep 23 2023, 12:14 PM

lib/libc/stdlib/quick_exit.c
73	Nice example! I did a little digging, it appears to me that 'lock cmpxchg' on x86 is intrinsically full barrier, and on arm64 it appears we always use 'ldaxr ... stlxr', which is an acquire+release sequence (I guess to emulate x86 semantics?). This really caught me off guard, but I may have missed something so be sure to double-check my math. Unfortunately, the default semantics of FreeBSD atomic operations are "relaxed", whereas in C11 they are "seq_cst", and FreeBSD has no "seq_cst" variant of atomic_fcmpset(), so without relying on just knowing that atomic_fcmpset() on FreeBSD is full barrier (which I do not know for certain) then it does seem that we need to strengthen the fcmpset loop, perhaps by using acquire semantics on the first load of handlers into h->next. OTOH, moving fully to C11 atomics and using the default cmpset semantics (i.e., "_seq_cst") would be sufficient in my mind to resolve the correctnes issue posed by your scenario. Thoughts?

mjg added inline comments.Sep 23 2023, 12:31 PM

lib/libc/stdlib/quick_exit.c
73	In that scenario, if fcmpset_rel in thread 2 is to be trusted vs load_acq in thread 3, the update by thread 1 is automagically handled because of the fact that its fcmpset_rel succeeded prior to thread 2 doing anything. (alternatively, if they were racing each other, it won said race and thread 2 had to retry) In fact the load acq used here is too strong -- a consume fence would be sufficient, but there is no freebsd-specific intrinsic to do it.

theraven added inline comments.Sep 23 2023, 12:52 PM

lib/libc/stdlib/quick_exit.c
73	OTOH, moving fully to C11 atomics and using the default cmpset semantics (i.e., "_seq_cst") would be sufficient in my mind to resolve the correctnes issue posed by your scenario. Thoughts? See my previous comment. We should not add more code using non-standard C for things that are supported in standard C unless there is a really good reason. I have not seen any justification for not using C11 atomics here.

kib added inline comments.Sep 23 2023, 1:10 PM

lib/libc/stdlib/quick_exit.c
73	In that scenario, if fcmpset_rel in thread 2 is to be trusted vs load_acq in thread 3, the update by thread 1 is automagically handled because of the fact that its fcmpset_rel succeeded prior to thread 2 doing anything. (alternatively, if they were racing each other, it won said race and thread 2 had to retry) You are claiming that fcmpset_rel() has seq_cst semantic, which is not guaranteed. In fact the load acq used here is too strong -- a consume fence would be sufficient, but there is no freebsd-specific intrinsic to do it. Consume is not implemented anywhere, it is aliased to acquire by both clang and gcc. Also I saw papers which claimed that current definition is not implementable.
73	Nice example! I did a little digging, it appears to me that 'lock cmpxchg' on x86 is intrinsically full barrier, and on arm64 it appears we always use 'ldaxr ... stlxr', which is an acquire+release sequence (I guess to emulate x86 semantics?). This really caught me off guard, but I may have missed something so be sure to double-check my math. Unfortunately, the default semantics of FreeBSD atomic operations are "relaxed", whereas in C11 they are "seq_cst", and FreeBSD has no "seq_cst" variant of atomic_fcmpset(), so without relying on just knowing that atomic_fcmpset() on FreeBSD is full barrier (which I do not know for certain) then it does seem that we need to strengthen the fcmpset loop, perhaps by using acquire semantics on the first load of handlers into h->next. Just in case, note that acq+rel != seq_cst.

mjg added inline comments.Sep 23 2023, 10:38 PM

lib/libc/stdlib/quick_exit.c
73	You are claiming that fcmpset_rel() has seq_cst semantic, which is not guaranteed. I'm not. I am claiming: fcmpset_rel paired with load_acq already works for just 2 threads doing anything if thread 1 succeded in fcmpset_rel and then thread 2 succeded in fcmpset_rel, and thread 3 load_acq'd the value as set by thread 2, it necessarily is also synchronized against stores from thread 1 load consume would suffice. i mention it because acq pessimizes things on arm64. for example if you take a look at primitives in cocurrency kit they handroll a mere dereference enclosed with compiler barriers. I recall Linux was doing something to that extent for RCU as well.

kib added inline comments.Sep 25 2023, 2:46 PM

lib/libc/stdlib/quick_exit.c
73	I was unable to find any formal rule why #2 from your list is necessarily true. There is indeed a total order of atomic ops on each location, and this order must not contradict happens-before. But there is no requirement (in C11) that local order of atomics for specific location implies global happens-before ordering on that ops. Even when participating releases are RWM ops. There are memory models like HW ARMv8 where releases are globally totally ordered, effectively including each local atomic total order in happens-before.

In D41936#956247, @kib wrote:

What are the real programs that depend on parallel quick_exit() to behave this way? If there are any, we certainly can adopt.

Two quick thoughts on this:

This "extension" of specified behavior is independent of what is been fixed here, i.e., lack of proper synchronization between the producer (at_quick_exit()) and the consumer (quick_exit()) in use cases that do not trigger undefined behavior according to the C standard. So I don't think it should prevent @des from committing the current change.
Adopting the behavior of running the relevant exit handlers only once would imply we are comfortable with committing to this behavior for a long time, in other words that we consider it a sensible behavior, which IMO warrants a discussion. I already have some ideas and thoughts to share on this topic, but I don't want to start "polluting" the matter at hand with it. I'd rather have this discussion in another venue, such as another review, and preferably after we have been shown a real program triggering undefined behavior under the C standard in this area. In any case (and again), let's ensure we are all confident that the currently visible changes are correct and have @des commit them before possibly delving into that part.

I've answered the part concerning the synchronization issue inline.

lib/libc/stdlib/quick_exit.c
73	It does not matter if it changes after insertion or not. The write to 'cleanup' needs to be visible, if we see the element on the list. Stating it using the C11 terminology, there needs to be happens-before relation between write to 'cleanup' and all reads in line below. I was unable to find any formal rule why #2 from your list is necessarily true. I think you're mistaken (if you're not, than I am). The write to `cleanup` necessarily happens before the read in `quick_exit()` if the element is on the list as seen by `quick_exit()`. Adopting the C standard parlance, this write is sequenced before the `atomic_fcmpset_rel_ptr()` call which is a release operation (if using the corresponding C atomic operation, and if the object has a C atomic type, indeed; this should be changed accordingly if switching to C atomics as asked by @theraven) which synchronizes with `atomic_load_acq_ptr()` (an acquire operation; again, this is true only if using the corresponding C atomic operation while reasoning in the C standard frame), which is sequenced before the read of `cleanup`. And, as you know, "sequenced before" is a sub-relation of "happens before", and the latter's definition implies it is transitive. Maybe your objection comes from the fact that the reasoning above proves that the write to `cleanup` happens before the read in `quick_exit()` but actually is not enough to prove the former is visible to the latter (this is the understandably missing part in @mjg's explanation). Indeed, of all the side-effects to `cleanup` that happen before the read, the one before the `atomic_fcmpset_rel_ptr()` that is then read by the acquire operation must be the last one. But then it is immediate that this is true because each `quick_exit_handler` structure is allocated anew, so it is impossible for two threads to share the same address for `handler` (as long as `malloc()` works...). Hence the only side-effect that can possibly affect the `handler` field of the `quick_exit_handler` structure actually read in `quick_exit()` is the one at start of `at_quick_exit()` in the same call. Since, as we saw in the previous paragraph, it also happens before the read in `quick_exit()`, it's by definition (deferring to the C standard) the one that is visible at that point. QED.

olce added inline comments.Sep 25 2023, 9:14 PM

lib/libc/stdlib/quick_exit.c
73	Hence the only side-effect that can possibly affect the `handler` field of the `quick_exit_handler` structure... Typo here, `handler` should be replaced by `cleanup`.

des marked 12 inline comments as done.Sep 26 2023, 5:46 PM

des retitled this revision from libc: Add memory barriers to quick_exit() and at_quick_exit(). to libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics..

des edited the summary of this revision. (Show Details)

switch to C11 atomics

Harbormaster completed remote builds in B53730: Diff 127831.Sep 26 2023, 5:47 PM

kib added inline comments.Sep 26 2023, 6:14 PM

lib/libc/stdlib/quick_exit.c
57–59	So you changed to seq consistent access there, which is stronger than needed but fine.
72–73	The load must have acquire semantic to sync/with CAS in last at_quick_exit().

load acq

Harbormaster completed remote builds in B53733: Diff 127834.Sep 26 2023, 6:29 PM

des marked an inline comment as done.Sep 26 2023, 6:29 PM

des marked an inline comment as done.

kib accepted this revision.Sep 26 2023, 6:46 PM

This revision is now accepted and ready to land.Sep 26 2023, 6:46 PM

theraven added inline comments.Sep 26 2023, 8:21 PM

lib/libc/stdlib/quick_exit.c
57–59	This probably wants to be a weak CAS, since we're not doing anything in the loop. On RISC architectures a weak CAS expands to a ll/sc pair, a strong CAS to an LL/SC pair in a loop. We then put another loop around that, which doesn't do anything differently.