I have no idea if this works in practice, but the eyeballed assembly looks OK at a glance and it seems to work in single-threaded application...
Each one does ll (containing word) -> drop to C for logic -> sc (containing word)
The {,f}cmpset logic I'm incredibly unsure of, as far as ll/sc semantics go; I assume the sc of the original value should succeed, or it'll spin and the value should still be wrong with the next ll. I think we have to do this as opposed to the straightforward comparison and out of the word/dword versions since it's not necessarily clear what else is resident in the whole word we're modifying.