Page MenuHomeFreeBSD

Add memory barriers to buf_ring
AbandonedPublic

Authored by jpa-semihalf.com on Feb 13 2015, 12:21 PM.
Tags
None
Referenced Files
F104107859: D1833.diff
Tue, Dec 3, 3:48 PM
Unknown Object (File)
Tue, Nov 26, 11:08 AM
Unknown Object (File)
Thu, Nov 21, 4:37 AM
Unknown Object (File)
Wed, Nov 20, 1:43 PM
Unknown Object (File)
Tue, Nov 19, 9:20 AM
Unknown Object (File)
Mon, Nov 18, 3:37 PM
Unknown Object (File)
Sun, Nov 10, 1:17 AM
Unknown Object (File)
Thu, Nov 7, 6:01 PM

Details

Summary

This change is mostly the fix for ARMv7 architecture. Make sure
that the correct value is observable on the dequeue side.

Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

zbb retitled this revision from to Add memory barriers to buf_ring.
zbb updated this object.
zbb edited the test plan for this revision. (Show Details)
zbb added reviewers: kmacy, rpaulo, imp.
zbb added a subscriber: Unknown Object (MLST).

Damn, we don't have an atomic way to assign a pointer value? (ie, expressing it via an atomic acq/rel semantic) ?

rpaulo requested changes to this revision.Feb 13 2015, 10:01 PM
rpaulo edited edge metadata.

Hmm, I think these memory barriers were in the tree but were removed recently. I think it would be nice to use the atomic load acq/rel because this change could negatively impact x86.

This revision now requires changes to proceed.Feb 13 2015, 10:01 PM

an atomic_store_rel would put the dmb() before writing the producer ring head, the opposite of what this change proposes (which reinforces my vague feeling that this change isn't quite right somehow). On the consumer side, it's not clear what piece of data would be the subject of an atomic_load_acq(), because I can't really tell from the current placement of the rmb() what reordering it's trying to prevent.

andrew added inline comments.
sys/sys/buf_ring.h
107 ↗(On Diff #3756)

atomic_store_rel_int includes a dmb before writing to &br->br_prod_tail.

I don't think this code will be hitting ARM errata 761319 [1] as the read is in a loop where we exit when we see the new value at least once meaning it has been incremented to the value we expect.

. [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

167 ↗(On Diff #3756)

It feels like one or both of these should be read with atomic_load_acq_int.

I'm mostly echoing Ian Lepore's comments here:

  • The wmb() is equivalent to an atomic_store_acq_32 - which is not part of the API. It would seem we'd rather have a atomic_load_acq_32 where the value is actually later to be read.
  • I don't know what the rmb() is protecting against.
  • This penalizes architectures that don't need it so at the very least a local macro needs to be defined that is a no-op on all other architectures.

Are you actually seeing problems that is fixed by this change?

zbb updated this object.
zbb edited edge metadata.

From W. Macek:

It was observed on ARM Cortex-A15, that sometimes, the buf_ring_dequeue_sc finction returned the old pointer, which left there from the previous buffer ring pass. With this fix, the issue disappeared.

Inside buf_ring_dequeue_sc, after the cons_head is read, the CPU memory access predictors are capable of preloading the values from the memory for the future use - like a br->br_ring[cons_head], for example. However, this value might not be valid. There is a small "hole" in the logic, between line 165 and 166. If the enqueue function puts the new entry to the ring just after the core prefetches the buf = br->br_ring[cons_head] and before the br->br_prod_tail is read, it will return the wrong, old, value. To prevent this, full memmory barrier must be added before reading br->br_ring[cons_head]. However, that change would negatively impact x86, so it's better to use atomic_load_acq_32 on br->br_prod_tail variable: on Intel it will not change anything (except adding compiler barrier), but on ARM it puts a full dmb() after the br->br_prod_tail is read.

As Andrew suggested, the atomic operation was also added to br->br_cons_head variable.

That looks much better. Thanks. This change has my blessing.

Hmm, IMHO, this lockless buf ring stuff should be reconsidered much more because of presented issue. It's lockless so there is no way how to get stable snapshot of buf ring variables. A race is always present. Each read value should be considered stale at any moment. Thus what is the real issue here? Is it the order of reading, i.e. prod_tail is pre-read before cons_head? It must be that because in other cases, IMHO, nothing can help. But if it's reordering issue, what else variables are involved? These two reads are very close, but nobody ensures that it could not happen in bigger distance...

At least, the same scenario is in buf_ring_advance_sc() and alike in buf_ring_dequeue_mc() and buf_ring_peek(). IMHO, if there is some presumption that some variables are changed in defined order and they should be read in the same order, then we must re-checked the code for all of them.

kmacy edited edge metadata.

Approved.

In D1833#18, @onwahe-gmail-com wrote:

Hmm, IMHO, this lockless buf ring stuff should be reconsidered much more because of presented issue. It's lockless so there is no way how to get stable snapshot of buf ring variables. A race is always present. Each read value should be considered stale at any moment. Thus what is the real issue here? Is it the order of reading, i.e. prod_tail is pre-read before cons_head? It must be that because in other cases, IMHO, nothing can help. But if it's reordering issue, what else variables are involved? These two reads are very close, but nobody ensures that it could not happen in bigger distance...

At least, the same scenario is in buf_ring_advance_sc() and alike in buf_ring_dequeue_mc() and buf_ring_peek(). IMHO, if there is some presumption that some variables are changed in defined order and they should be read in the same order, then we must re-checked the code for all of them.

Getting the memory ordering correct is difficult, yes. But the semantics are not ambiguous. Consumers are (at least for the most part) using it correctly in the network stack.

rpaulo edited edge metadata.
This revision is now accepted and ready to land.Feb 20 2015, 10:47 PM

This change looks good on its surface, but I haven't looked deeply to see if it has any bad connotations...

Even I’m not able to evaluate all aspect, I see some serious defects here - at least from ARM point of view.
The buf_ring_enqueue() guarantees proper write ordering (and visibility):

  • Store with acquire to br->br_prod_head
  • Normal store to br_ring[]
  • Store with release to br_prod_tail.

Unfortunately buf_ring_dequeue_sc() have not defined any read ordering and code can see
updated br_prod_tail and stale br_ring[]. (imho, this is true for amd64/i386 too).
Unlike of Semihalf guys, I see little different solution for race in buf_ring_dequeue_sc() read logic.
The br_ring[] must be read after br_prod_tail, but read order of br_cons_head and br_prod_tail is not important.
So (line 183)

buf = atomic_load_rel_32(&br->br_ring[cons_head]);

looks more correct to me.

On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other
atomic_* functions), normal store doesn’t clear exclusive monitor.
Thus, for ARM we MUST use atomic_store() for each store to variable referenced by atomic_cmpset() !

Ohh, and note – current atomic_store() implementation on ARM is broken too, but fix is easy (see atomic store_*_64().

In D1833#26, @meloun-miracle-cz wrote:

...
On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other
atomic_* functions), normal store doesn’t clear exclusive monitor.
Thus, for ARM we MUST use atomic_store() for each store to variable referenced by atomic_cmpset() !

Ohh, and note – current atomic_store() implementation on ARM is broken too, but fix is easy (see atomic store_*_64().

I'm commenting on just this aspect of Michal's comment, not the overall change. I'm not sure I agree that a normal store fails to clear the exclusive monitor, but even more importantly in this context... the discussion of that and any possible fixes for it shouldn't hold up this change right now.

I'm specifically not commenting on (or formally reviewing) the overall behavior of the ring buffer code. I'm not qualified to do that without a lot more studying of the code and just don't have time right now.

In D1833#26, @meloun-miracle-cz wrote:

Even I’m not able to evaluate all aspect, I see some serious defects here - at least from ARM point of view.
The buf_ring_enqueue() guarantees proper write ordering (and visibility):

  • Store with acquire to br->br_prod_head
  • Normal store to br_ring[]
  • Store with release to br_prod_tail.

Unfortunately buf_ring_dequeue_sc() have not defined any read ordering and code can see
updated br_prod_tail and stale br_ring[]. (imho, this is true for amd64/i386 too).
Unlike of Semihalf guys, I see little different solution for race in buf_ring_dequeue_sc() read logic.
The br_ring[] must be read after br_prod_tail, but read order of br_cons_head and br_prod_tail is not important.
So (line 183)

buf = atomic_load_rel_32(&br->br_ring[cons_head]);

looks more correct to me.

The update to prod_tail happens _after_ the store to br_ring and is done with an atomic_store_rel_int subsequently in dequeue_sc the read of prod_tail will be done with an atomic_load_acq_int before loading br_prod. Since there is a memory barrier after the update in enqueue and one before the read where is the problem?

On ARM, all stores to variable referenced by atomic_cmpset() must be done by atomic_store (or by other
atomic_* functions), normal store doesn’t clear exclusive monitor.
Thus, for ARM we MUST use atomic_store() for each store to variable referenced by atomic_cmpset() !

Great, but the only variable that is updated with atomic_cmpset is prod_head to atomically acquire the right to store in to that index in br_ring.

Please clarify what the actual problem is. This change fixes the only real issue that I can glean from your comment.

First of all, I was wrong in the behavior of atomic operations on ARM. After yesterday’s long session on IRC with Ian and Andy, we concluded that pure (non atomic) store affects(clears) state of exclusive monitor. So our implementation of atomics is fine on ARM. I apologize for my mistake, and, please, ignore everything about atomic_cmpset() in my previous message.

Back to your first question. I thought it only as a technical note (because misordered read occurs on br_ring[]). So, i'm fine with proposed patch.
And again, sorry for noise. I certainly did not intend to be “destructive”.

Reading this again, cons_head is used only by dequeue and is thus protected by the lock or an atomic update. Hence the atomic_load_acq_32 isn't needed for it. On the other hand, there are other places where prod_tail needs to be read with a memory barrier.

I think there are a few other potential races in the code since the 'tail' values are not mutex protected. Please see
https://reviews.freebsd.org/D1945

I've been spending more time looking at this than I care to admit. I don't understand why we need the load_acq_32 on prod_tail. The atomic_store_rel_int in enqueue should guarantee that the store to the ring happens before the update to prod_tail. So the consumer should _never_ see a prod_tail that is newer than the ring[prod_next] value. I suspect that something else is going on that we don't understand.

kmacy was so kind to offer posting the final solution to this ticket so I'm changing status to abandoned.
Thank you all for your review and discussion.

kib added inline comments.
sys/sys/buf_ring.h
165–166

Isn't the same situation occur in other places, at least in the buf_ring_dequeue_mc ?

You probably could replace two acqs with recently added atomic_thread_fence_acq().
rmb() in buf_ring_enqueue() arguably should be also spelled as thread_fence_acq().

sys/sys/buf_ring.h
165–166

This thread is marked as "abandoned" - please, have a look at D1945