Use atomic load and stores to ensure that the compiler doesn't optimize away these loops. Change boolean to int to match what atomic API supplies. Remove wmb() since the atomic_store_rel() on status.done ensure the prior writes to status.
Coverity caught this, and kib@ suggested these changes. They supposedly also use fewer cycles.