These make use of the cas*, ld* and swp instructions added in ARMv8.1.
Testing shows them to be significantly more performant than LL/SC-based
implementations.
These were mostly written by Ali Saidi. I did atomic_testand*() and
made atomic_(set|clear|add|subtract)_* a bit simpler.
No functional change here since the wrappers still unconditionally
select the _llsc variants.
Submitted by: Ali Saidi <alisaidi@amazon.com> (original version)