This adds 8 and 16 bit versions of the cmpset and fcmpset functions. Macros are used to generate all the flavors from the same set of instructions; the macro expansion handles the couple minor differences between each size variation (generating ldrexb/ldrexh/ldrex for 8/16/32, etc).
In addition to handling new sizes, the instruction sequences used for cmpset and fcmpset are rewritten to be a bit shorter/faster, and the new sequence will not return false when *dst==*old but the store-exclusive fails because of concurrent writers. Instead, it just loops like ldrex/strex sequences normally do until it gets a non-conflicted store. The manpage allows LL/SC architectures to bogusly return false, but there's no reason to actually do so, at least on arm.