Compiler memory barriers do not prevent the CPU from executing the code
out of order. Switch to C11 atomics. This also lets us get rid of the
mutex; instead, loop until the compare_exchange succeeds.
While here, change the return value of at_quick_exit() on failure to
the more traditional -1, matching atexit().
Sponsored by: Klara, Inc.