Page MenuHomeFreeBSD

libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics.
ClosedPublic

Authored by des on Sep 22 2023, 11:20 AM.
Tags
None
Referenced Files
F102193965: D41936.diff
Fri, Nov 8, 6:59 PM
F102137370: D41936.id127834.diff
Fri, Nov 8, 2:08 AM
Unknown Object (File)
Tue, Oct 22, 4:55 AM
Unknown Object (File)
Sat, Oct 19, 1:18 PM
Unknown Object (File)
Sep 30 2024, 6:57 PM
Unknown Object (File)
Sep 28 2024, 9:11 AM
Unknown Object (File)
Sep 27 2024, 11:00 AM
Unknown Object (File)
Sep 27 2024, 9:03 AM
Subscribers

Details

Summary

Compiler memory barriers do not prevent the CPU from executing the code
out of order. Switch to C11 atomics. This also lets us get rid of the
mutex; instead, loop until the compare_exchange succeeds.

While here, change the return value of at_quick_exit() on failure to
the more traditional -1, matching atexit().

Sponsored by: Klara, Inc.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 53730
Build 50621: arc lint + arc unit

Event Timeline

This revision is now accepted and ready to land.Sep 22 2023, 11:45 AM
kib added inline comments.
lib/libc/stdlib/quick_exit.c
61

I do not see why do you need fcmpset() since you re-assign h->next on each iteration. Either move the h->next = handlers; out of loop, or use cmpset().

74

Each dereference of h->next also needs to have acquire semantic, so that read of *(h->next) is consistent with finding the element on the list.

lib/libc/stdlib/quick_exit.c
61

Because if I lose the race I need h->next to point to the winner.

74

Why? h->next is never updated after insertion.

lib/libc/stdlib/quick_exit.c
61

Oh I realize my mistake now, will update shortly.

fcmpset will update h->next

This revision now requires review to proceed.Sep 22 2023, 12:34 PM
des marked an inline comment as done.Sep 22 2023, 12:35 PM
olce added inline comments.
lib/libc/stdlib/quick_exit.c
74

Why? h->next is never updated after insertion.

Indeed, I don't think acquire reads are necessary here.

This revision is now accepted and ready to land.Sep 22 2023, 12:38 PM
lib/libc/stdlib/quick_exit.c
46–47

Maybe clean up this comment while you're here, perhaps something like "Stack of cleanup handlers, invoked in LIFO order by quick_exit()." would be a bit more clear?

lib/libc/stdlib/quick_exit.c
46–47

Oh, that's why it had a trailing space, I hadn't even noticed the sentence is incomplete! Maybe @theraven remembers how it was supposed to end?

merge and clarify comments

This revision now requires review to proceed.Sep 22 2023, 2:08 PM
des marked an inline comment as done.Sep 22 2023, 2:08 PM
des marked an inline comment as done.

Is there a reason to write new code that uses atomics and not use C11 atomics? We’ve had support for them for several releases now and it makes code much harder for new developers to understand if it uses nonstandard features for things that are part of the standard.

lib/libc/stdlib/quick_exit.c
61

Why do you want a predict false here? I would expect it to either be ignored or make things worse on any modern CPU. If you want to micro optimise a cold loop like this, it would be better to put a macro that expands to pause on x86 and local equivalents on other architectures in the retry path.

What is supposed to happen when this is a multithreaded program and multiple threads call quick_exit at the same time? While it is silly for such a setup to exist, I had the misfortune of encountering it (albeit with full-blown exit(3)).

That is to say, while the original code is indeed buggy, I think the actual work to do here would make there are no repeat calls to exit handlers in presence of multiple threads landing here.

One way out would be to atomic_swap the head to NULL.

Note this could still race against threads adding to the list as someone calls quick_exit, but that is unfixable.

lib/libc/stdlib/quick_exit.c
61

pause in a tight loops like this is a known pessimizer (albeit there are archs where some arch-specific delay does help). note this is not *waiting* for the value to reach a certain state (like in locking primitives), where retrying in a tight loop would indeed be bad.

One way out would be to atomic_swap the head to NULL.

That's also probably surprising. One thread will start running cleanup functions, the others will then exit the process before the first has finished. You probably want to acquire a mutex to block all other threads from entering quick_exit after one has. This might want to be shared with the atexit cleanup, because otherwise you can have strange situations where one thread calls exit and another calls quick_exit.

lib/libc/stdlib/quick_exit.c
61

note this is not *waiting* for the value to reach a certain state

Ah, yes, you're right. Please ignore that part of the suggestion. I don't believe predict false is useful though, because most architectures will assume CAS succeeds in their internal hinting, adding an explicit hint will not help and will cause it to do the wrong thing in cases of contention (which should be very rare for here, but it's still more source code for no benefit).

You probably want to acquire a mutex to block all other threads from entering quick_exit after one has. This might want to be shared with the atexit cleanup, because otherwise you can have strange situations where one thread calls exit and another calls quick_exit.

I want to say that's their problem, not ours...

des marked 3 inline comments as done.Sep 22 2023, 6:45 PM

From the C standard (C17):

The exit function causes normal program termination to occur. No functions registered by the
at_quick_exit function are called. If a program calls the exit function more than once, or calls the
quick_exit function in addition to the exit function, the behavior is undefined.

and conversely for quick_exit.

So I tend to agree with des. This doesn't preclude us from being "nicer", but I'm not sure there is much point in doing that if other prominent implementations (such as glibc) don't care (which I admittedly haven't checked).

In D41936#956161, @olce.freebsd_certner.fr wrote:

From the C standard (C17):

The exit function causes normal program termination to occur. No functions registered by the
at_quick_exit function are called. If a program calls the exit function more than once, or calls the
quick_exit function in addition to the exit function, the behavior is undefined.

and conversely for quick_exit.

So I tend to agree with des. This doesn't preclude us from being "nicer", but I'm not sure there is much point in doing that if other prominent implementations (such as glibc) don't care (which I admittedly haven't checked).

What are the real programs that depend on parallel quick_exit() to behave this way? If there are any, we certainly can adopt.

lib/libc/stdlib/quick_exit.c
74

It does not matter if it changes after insertion or not. The write to 'cleanup' needs to be visible, if we see the element on the list. Stating it using the C11 terminology, there needs to be happens-before relation between write to 'cleanup' and all reads in line below.

The fcmpset's in at_quick_exit() only have release semantic, so I do not see a happens-before ordering between writes to 'cleanup' members of the list, and consequently no h/b with read.

Author of the previous version of the code had similar intuition, I suspect, this is why he inserted the compiler barriers there.

That said, note that x86 acq and rel are free, they are translated in compiler barriers and plain loads/stores.

lib/libc/stdlib/quick_exit.c
74

Konstantin, I believe all writes to 'cleanup' are guaranteed to be seen by the thread that calls acquire, as noted in this excerpt I pulled from 'man atomic':

When a release operation by one thread synchronizes with an acquire
operation by another thread, usually meaning that the acquire operation
reads the value written by the release operation, then the effects of all
prior stores by the releasing thread must become visible to subsequent
loads by the acquiring thread.  Moreover, the effects of all stores (by
other threads) that were visible to the releasing thread must also become
visible to the acquiring thread.  These rules only apply to the
synchronizing threads.  Other threads might observe these stores in a
different order.
lib/libc/stdlib/quick_exit.c
74

Right, this is a somewhat informal description of the rel/acq effects. So imagine that the thread 1 added cleanup handler, then thread 2 added yet another handler. Not that there is no rel/acq pairs between thread 1 and 2.

Now thread 3 does quick_exit(), and its acq load sync/with thread 2 release store. What ensures that thread 3 sees stores from thread 1?

lib/libc/stdlib/quick_exit.c
74

Nice example! I did a little digging, it appears to me that 'lock cmpxchg' on x86 is intrinsically full barrier, and on arm64 it appears we always use 'ldaxr ... stlxr', which is an acquire+release sequence (I guess to emulate x86 semantics?). This really caught me off guard, but I may have missed something so be sure to double-check my math.

Unfortunately, the default semantics of FreeBSD atomic operations are "relaxed", whereas in C11 they are "seq_cst", and FreeBSD has no "seq_cst" variant of atomic_fcmpset(), so without relying on just knowing that atomic_fcmpset() on FreeBSD is full barrier (which I do not know for certain) then it does seem that we need to strengthen the fcmpset loop, perhaps by using acquire semantics on the first load of handlers into h->next.

OTOH, moving fully to C11 atomics and using the default cmpset semantics (i.e., "_seq_cst") would be sufficient in my mind to resolve the correctnes issue posed by your scenario. Thoughts?

lib/libc/stdlib/quick_exit.c
74

In that scenario, if fcmpset_rel in thread 2 is to be trusted vs load_acq in thread 3, the update by thread 1 is automagically handled because of the fact that its fcmpset_rel succeeded *prior* to thread 2 doing anything. (alternatively, if they were racing each other, it won said race and thread 2 had to retry)

In fact the load acq used here is too strong -- a consume fence would be sufficient, but there is no freebsd-specific intrinsic to do it.

lib/libc/stdlib/quick_exit.c
74

OTOH, moving fully to C11 atomics and using the default cmpset semantics (i.e., "_seq_cst") would be sufficient in my mind to resolve the correctnes issue posed by your scenario. Thoughts?

See my previous comment. We should not add more code using non-standard C for things that are supported in standard C unless there is a *really* good reason. I have not seen any justification for not using C11 atomics here.

lib/libc/stdlib/quick_exit.c
74

In that scenario, if fcmpset_rel in thread 2 is to be trusted vs load_acq in thread 3, the update by thread 1 is automagically handled because of the fact that its fcmpset_rel succeeded *prior* to thread 2 doing anything. (alternatively, if they were racing each other, it won said race and thread 2 had to retry)

You are claiming that fcmpset_rel() has seq_cst semantic, which is not guaranteed.

In fact the load acq used here is too strong -- a consume fence would be sufficient, but there is no freebsd-specific intrinsic to do it.

Consume is not implemented anywhere, it is aliased to acquire by both clang and gcc. Also I saw papers which claimed that current definition is not implementable.

74

Nice example! I did a little digging, it appears to me that 'lock cmpxchg' on x86 is intrinsically full barrier, and on arm64 it appears we always use 'ldaxr ... stlxr', which is an acquire+release sequence (I guess to emulate x86 semantics?). This really caught me off guard, but I may have missed something so be sure to double-check my math.

Unfortunately, the default semantics of FreeBSD atomic operations are "relaxed", whereas in C11 they are "seq_cst", and FreeBSD has no "seq_cst" variant of atomic_fcmpset(), so without relying on just knowing that atomic_fcmpset() on FreeBSD is full barrier (which I do not know for certain) then it does seem that we need to strengthen the fcmpset loop, perhaps by using acquire semantics on the first load of handlers into h->next.

Just in case, note that acq+rel != seq_cst.

lib/libc/stdlib/quick_exit.c
74

You are claiming that fcmpset_rel() has seq_cst semantic, which is not guaranteed.

I'm not.

I am claiming:

  1. fcmpset_rel paired with load_acq already works for just 2 threads doing anything
  2. if thread 1 *succeded* in fcmpset_rel and then thread 2 succeded in fcmpset_rel, and thread 3 load_acq'd the value as set by thread 2, it necessarily is also synchronized against stores from thread 1
  3. load consume would suffice. i mention it because acq pessimizes things on arm64. for example if you take a look at primitives in cocurrency kit they handroll a mere dereference enclosed with compiler barriers. I recall Linux was doing something to that extent for RCU as well.
lib/libc/stdlib/quick_exit.c
74

I was unable to find any formal rule why #2 from your list is necessarily true.

There is indeed a total order of atomic ops on each location, and this order must not contradict happens-before. But there is no requirement (in C11) that local order of atomics for specific location implies global happens-before ordering on that ops. Even when participating releases are RWM ops.

There are memory models like HW ARMv8 where releases are globally totally ordered, effectively including each local atomic total order in happens-before.

In D41936#956247, @kib wrote:

What are the real programs that depend on parallel quick_exit() to behave this way? If there are any, we certainly can adopt.

Two quick thoughts on this:

  1. This "extension" of specified behavior is independent of what is been fixed here, i.e., lack of proper synchronization between the producer (at_quick_exit()) and the consumer (quick_exit()) in use cases that do not trigger undefined behavior according to the C standard. So I don't think it should prevent @des from committing the current change.
  2. Adopting the behavior of running the relevant exit handlers only once would imply we are comfortable with committing to this behavior for a long time, in other words that we consider it a sensible behavior, which IMO warrants a discussion. I already have some ideas and thoughts to share on this topic, but I don't want to start "polluting" the matter at hand with it. I'd rather have this discussion in another venue, such as another review, and preferably after we have been shown a real program triggering undefined behavior under the C standard in this area. In any case (and again), let's ensure we are all confident that the currently visible changes are correct and have @des commit them before possibly delving into that part.

I've answered the part concerning the synchronization issue inline.

lib/libc/stdlib/quick_exit.c
74

It does not matter if it changes after insertion or not. The write to 'cleanup' needs to be visible, if we see the element on the list. Stating it using the C11 terminology, there needs to be happens-before relation between write to 'cleanup' and all reads in line below.
I was unable to find any formal rule why #2 from your list is necessarily true.

I think you're mistaken (if you're not, than I am).

The write to cleanup necessarily happens before the read in quick_exit() if the element is on the list as seen by quick_exit(). Adopting the C standard parlance, this write is sequenced before the atomic_fcmpset_rel_ptr() call which is a release operation (if using the corresponding C atomic operation, and if the object has a C atomic type, indeed; this should be changed accordingly if switching to C atomics as asked by @theraven) which synchronizes with atomic_load_acq_ptr() (an acquire operation; again, this is true only if using the corresponding C atomic operation while reasoning in the C standard frame), which is sequenced before the read of cleanup. And, as you know, "sequenced before" is a sub-relation of "happens before", and the latter's definition implies it is transitive.

Maybe your objection comes from the fact that the reasoning above proves that the write to cleanup happens before the read in quick_exit() but actually is not enough to prove the former is *visible* to the latter (this is the understandably missing part in @mjg's explanation). Indeed, of all the side-effects to cleanup that happen before the read, the one before the atomic_fcmpset_rel_ptr() that is then read by the acquire operation must be the last one. But then it is immediate that this is true because each quick_exit_handler structure is allocated anew, so it is impossible for two threads to share the same address for handler (as long as malloc() works...). Hence the *only* side-effect that can possibly affect the handler field of the quick_exit_handler structure actually read in quick_exit() is the one at start of at_quick_exit() in the same call. Since, as we saw in the previous paragraph, it also happens before the read in quick_exit(), it's by definition (deferring to the C standard) the one that is visible at that point. QED.

lib/libc/stdlib/quick_exit.c
74

Hence the *only* side-effect that can possibly affect the handler field of the quick_exit_handler structure...

Typo here, handler should be replaced by cleanup.

des marked 12 inline comments as done.Sep 26 2023, 5:46 PM
des retitled this revision from libc: Add memory barriers to quick_exit() and at_quick_exit(). to libc: Rewrite quick_exit() and at_quick_exit() using C11 atomics..
des edited the summary of this revision. (Show Details)
lib/libc/stdlib/quick_exit.c
58

So you changed to seq consistent access there, which is stronger than needed but fine.

73

The load must have acquire semantic to sync/with CAS in last at_quick_exit().

des marked an inline comment as done.Sep 26 2023, 6:29 PM
des marked an inline comment as done.
This revision is now accepted and ready to land.Sep 26 2023, 6:46 PM
lib/libc/stdlib/quick_exit.c
58–60

This probably wants to be a weak CAS, since we're not doing anything in the loop. On RISC architectures a weak CAS expands to a ll/sc pair, a strong CAS to an LL/SC pair in a loop. We then put another loop around that, which doesn't do anything differently.