Add memory barriers to buf_ring
AbandonedPublic
Actions

Authored by jpa-semihalf.com on Feb 13 2015, 12:21 PM.

Details

Reviewers

kmacy
zbb
rpaulo
imp

Summary

This change is mostly the fix for ARMv7 architecture. Make sure
that the correct value is observable on the dequeue side.

Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

zbb updated this revision to Diff 3756.Feb 13 2015, 12:21 PM

zbb retitled this revision from to Add memory barriers to buf_ring.

zbb updated this object.

zbb edited the test plan for this revision. (Show Details)

zbb added reviewers: kmacy, rpaulo, imp.

zbb added a subscriber: Unknown Object (MLST).

Damn, we don't have an atomic way to assign a pointer value? (ie, expressing it via an atomic acq/rel semantic) ?

Hmm, I think these memory barriers were in the tree but were removed recently. I think it would be nice to use the atomic load acq/rel because this change could negatively impact x86.

This revision now requires changes to proceed.Feb 13 2015, 10:01 PM

an atomic_store_rel would put the dmb() before writing the producer ring head, the opposite of what this change proposes (which reinforces my vague feeling that this change isn't quite right somehow). On the consumer side, it's not clear what piece of data would be the subject of an atomic_load_acq(), because I can't really tell from the current placement of the rmb() what reordering it's trying to prevent.

andrew added a subscriber: andrew.Feb 14 2015, 11:03 AM

andrew added inline comments.

sys/sys/buf_ring.h
107 ↗	(On Diff #3756)	atomic_store_rel_int includes a dmb before writing to &br->br_prod_tail. I don't think this code will be hitting ARM errata 761319 [1] as the read is in a loop where we exit when we see the new value at least once meaning it has been incremented to the value we expect. . [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
167 ↗	(On Diff #3756)	It feels like one or both of these should be read with atomic_load_acq_int.

I'm mostly echoing Ian Lepore's comments here:

The wmb() is equivalent to an atomic_store_acq_32 - which is not part of the API. It would seem we'd rather have a atomic_load_acq_32 where the value is actually later to be read.

I don't know what the rmb() is protecting against.

This penalizes architectures that don't need it so at the very least a local macro needs to be defined that is a no-op on all other architectures.

Are you actually seeing problems that is fixed by this change?

From W. Macek:

It was observed on ARM Cortex-A15, that sometimes, the buf_ring_dequeue_sc finction returned the old pointer, which left there from the previous buffer ring pass. With this fix, the issue disappeared.

Inside buf_ring_dequeue_sc, after the cons_head is read, the CPU memory access predictors are capable of preloading the values from the memory for the future use - like a br->br_ring[cons_head], for example. However, this value might not be valid. There is a small "hole" in the logic, between line 165 and 166. If the enqueue function puts the new entry to the ring just after the core prefetches the buf = br->br_ring[cons_head] and before the br->br_prod_tail is read, it will return the wrong, old, value. To prevent this, full memmory barrier must be added before reading br->br_ring[cons_head]. However, that change would negatively impact x86, so it's better to use atomic_load_acq_32 on br->br_prod_tail variable: on Intel it will not change anything (except adding compiler barrier), but on ARM it puts a full dmb() after the br->br_prod_tail is read.

As Andrew suggested, the atomic operation was also added to br->br_cons_head variable.

That looks much better. Thanks. This change has my blessing.

Hmm, IMHO, this lockless buf ring stuff should be reconsidered much more because of presented issue. It's lockless so there is no way how to get stable snapshot of buf ring variables. A race is always present. Each read value should be considered stale at any moment. Thus what is the real issue here? Is it the order of reading, i.e. prod_tail is pre-read before cons_head? It must be that because in other cases, IMHO, nothing can help. But if it's reordering issue, what else variables are involved? These two reads are very close, but nobody ensures that it could not happen in bigger distance...

At least, the same scenario is in buf_ring_advance_sc() and alike in buf_ring_dequeue_mc() and buf_ring_peek(). IMHO, if there is some presumption that some variables are changed in defined order and they should be read in the same order, then we must re-checked the code for all of them.

Approved.

In D1833#18, @onwahe-gmail-com wrote:

Hmm, IMHO, this lockless buf ring stuff should be reconsidered much more because of presented issue. It's lockless so there is no way how to get stable snapshot of buf ring variables. A race is always present. Each read value should be considered stale at any moment. Thus what is the real issue here? Is it the order of reading, i.e. prod_tail is pre-read before cons_head? It must be that because in other cases, IMHO, nothing can help. But if it's reordering issue, what else variables are involved? These two reads are very close, but nobody ensures that it could not happen in bigger distance...

At least, the same scenario is in buf_ring_advance_sc() and alike in buf_ring_dequeue_mc() and buf_ring_peek(). IMHO, if there is some presumption that some variables are changed in defined order and they should be read in the same order, then we must re-checked the code for all of them.

Getting the memory ordering correct is difficult, yes. But the semantics are not ambiguous. Consumers are (at least for the most part) using it correctly in the network stack.

rpaulo accepted this revision.Feb 20 2015, 10:47 PM

rpaulo edited edge metadata.

This revision is now accepted and ready to land.Feb 20 2015, 10:47 PM

This change looks good on its surface, but I haven't looked deeply to see if it has any bad connotations...

Even I’m not able to evaluate all aspect, I see some serious defects here - at least from ARM point of view.
The buf_ring_enqueue() guarantees proper write ordering (and visibility):

Store with acquire to br->br_prod_head
Normal store to br_ring[]
Store with release to br_prod_tail.

Unfortunately buf_ring_dequeue_sc() have not defined any read ordering and code can see
updated br_prod_tail and stale br_ring[]. (imho, this is true for amd64/i386 too).
Unlike of Semihalf guys, I see little different solution for race in buf_ring_dequeue_sc() read logic.
The br_ring[] must be read after br_prod_tail, but read order of br_cons_head and br_prod_tail is not important.
So (line 183)

buf = atomic_load_rel_32(&br->br_ring[cons_head]);

looks more correct to me.

On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other
atomic_* functions), normal store doesn’t clear exclusive monitor.
Thus, for ARM we MUST use atomic_store() for each store to variable referenced by atomic_cmpset() !

Ohh, and note – current atomic_store() implementation on ARM is broken too, but fix is easy (see atomic store_*_64().

In D1833#26, @meloun-miracle-cz wrote:

...
On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other
atomic_* functions), normal store doesn’t clear exclusive monitor.
Thus, for ARM we MUST use atomic_store() for each store to variable referenced by atomic_cmpset() !

Ohh, and note – current atomic_store() implementation on ARM is broken too, but fix is easy (see atomic store_*_64().

I'm commenting on just this aspect of Michal's comment, not the overall change. I'm not sure I agree that a normal store fails to clear the exclusive monitor, but even more importantly in this context... the discussion of that and any possible fixes for it shouldn't hold up this change right now.

I'm specifically not commenting on (or formally reviewing) the overall behavior of the ring buffer code. I'm not qualified to do that without a lot more studying of the code and just don't have time right now.

In D1833#26, @meloun-miracle-cz wrote:
Even I’m not able to evaluate all aspect, I see some serious defects here - at least from ARM point of view.
The buf_ring_enqueue() guarantees proper write ordering (and visibility):

Store with acquire to br->br_prod_head

Normal store to br_ring[]

Store with release to br_prod_tail.

Unfortunately buf_ring_dequeue_sc() have not defined any read ordering and code can see
updated br_prod_tail and stale br_ring[]. (imho, this is true for amd64/i386 too).
Unlike of Semihalf guys, I see little different solution for race in buf_ring_dequeue_sc() read logic.
The br_ring[] must be read after br_prod_tail, but read order of br_cons_head and br_prod_tail is not important.
So (line 183)
buf = atomic_load_rel_32(&br->br_ring[cons_head]);
looks more correct to me.

The update to prod_tail happens _after_ the store to br_ring and is done with an atomic_store_rel_int subsequently in dequeue_sc the read of prod_tail will be done with an atomic_load_acq_int before loading br_prod. Since there is a memory barrier after the update in enqueue and one before the read where is the problem?

On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other
atomic_* functions), normal store doesn’t clear exclusive monitor.
Thus, for ARM we MUST use atomic_store() for each store to variable referenced by atomic_cmpset() !

Great, but the only variable that is updated with atomic_cmpset is prod_head to atomically acquire the right to store in to that index in br_ring.

Please clarify what the actual problem is. This change fixes the only real issue that I can glean from your comment.

First of all, I was wrong in the behavior of atomic operations on ARM. After yesterday’s long session on IRC with Ian and Andy, we concluded that pure (non atomic) store affects(clears) state of exclusive monitor. So our implementation of atomics is fine on ARM. I apologize for my mistake, and, please, ignore everything about atomic_cmpset() in my previous message.

Back to your first question. I thought it only as a technical note (because misordered read occurs on br_ring[]). So, i'm fine with proposed patch.
And again, sorry for noise. I certainly did not intend to be “destructive”.

Reading this again, cons_head is used only by dequeue and is thus protected by the lock or an atomic update. Hence the atomic_load_acq_32 isn't needed for it. On the other hand, there are other places where prod_tail needs to be read with a memory barrier.

I think there are a few other potential races in the code since the 'tail' values are not mutex protected. Please see
https://reviews.freebsd.org/D1945

I've been spending more time looking at this than I care to admit. I don't understand why we need the load_acq_32 on prod_tail. The atomic_store_rel_int in enqueue should guarantee that the store to the ring happens before the update to prod_tail. So the consumer should _never_ see a prod_tail that is newer than the ring[prod_next] value. I suspect that something else is going on that we don't understand.

kmacy was so kind to offer posting the final solution to this ticket so I'm changing status to abandoned.
Thank you all for your review and discussion.

jpa-semihalf.com commandeered this revision.May 11 2015, 5:04 PM

jpa-semihalf.com added a reviewer: zbb.

kib added a subscriber: kib.Jul 30 2015, 1:41 PM

kib added inline comments.

sys/sys/buf_ring.h
165–166	Isn't the same situation occur in other places, at least in the buf_ring_dequeue_mc ? You probably could replace two acqs with recently added atomic_thread_fence_acq(). rmb() in buf_ring_enqueue() arguably should be also spelled as thread_fence_acq().