As suggested by the comment in buf_ring_dequeue_sc, the use of an
acquire load of br_cons_head is completely unnecessary. This code path
is for the single consumer case, and so an acquire load of a field
modified by the consumer merely synchronises with itself and is thus a
waste of time, not to mention that if it were necessary it should be
done everywhere, not only on Arm. The important synchronisation here is
with pr_prod_tail, being under the producer's control, in order to
synchronise with the corresponding store release in order to guarantee
visibility of the store into that ring index. This is pretty standard
lock-free code and does not warrant a 25 line comment, certainly not one
that vastly exceeds the line length limit, and the acquire load should
be sufficiently self-documenting just like the others littered
throughout the tree.
Relatedly, buf_ring_peek_clear_sc gained a fence in r346593 but only on
Arm, which is incorrect; some form of acquire synchronisation is needed
on all architectures for exactly the same reason as above. Rather than
leave the fence in, copy the (fixed) code from buf_ring_peek_clear_sc
instead so they don't needlessly diverge in implementation details. This
also has the benefit of making it clearer that we are synchronising with
a release store of pr_prod_tail.
Finally, buf_ring_peek, which really should be called buf_ring_peek_sc,
has exactly the same bug as buf_ring_peek_clear_sc pre-r346593, so copy
the atomic use there too.
Whilst here, move the lock check in buf_ring_dequeue_sc to before we
attempt to do anything that would be racy if it failed, and slightly
refactor buf_ring_peek_clear_sc to avoid an unnecessary bit of code
duplication and ifdef usage.