Page MenuHomeFreeBSD

Buf ring cleanups
Needs ReviewPublic

Authored by jpa-semihalf.com on Feb 23 2015, 12:47 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 30, 3:09 AM
Unknown Object (File)
Sun, Dec 29, 8:00 PM
Unknown Object (File)
Sun, Dec 29, 12:43 PM
Unknown Object (File)
Tue, Dec 17, 2:14 PM
Unknown Object (File)
Mon, Dec 16, 10:55 PM
Unknown Object (File)
Sun, Dec 8, 9:49 AM
Unknown Object (File)
Sat, Dec 7, 2:21 PM
Unknown Object (File)
Fri, Dec 6, 9:06 PM

Details

Reviewers
kmacy
andrew
imp
Summary
  • comment on buf_ring operation
  • Improve memory ordering guarantees for reads of non-mutex protected values in buf_ring
  • add comment to clarify atomic_<>_acq operations

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kmacy retitled this revision from to Buf ring cleanups.
kmacy updated this object.
kmacy edited the test plan for this revision. (Show Details)
kmacy added reviewers: imp, rpaulo, andrew, zbb.
kmacy added a subscriber: emaste.
kmacy updated this object.
imp edited edge metadata.

These changes look good to me. The only issues I see are if the queue size is 1. And even then I'm not sure they are issues!

sys/sys/buf_ring.h
2

2007-2015 is project style.

2

Happiness from the anonymous nitpickers society.

This revision is now accepted and ready to land.Feb 23 2015, 1:28 AM
kmacy updated this object.
kmacy edited edge metadata.
  • make sure tabbing is correct
  • multi-line comments need to be re-worked (
This revision now requires review to proceed.Feb 23 2015, 2:01 AM

Thinking about this some more I think I've been overzealous with the memory barriers.

There are five variables in question:

  • prod_head: this is only read by the enqueue routine, if the latter were to initially read a stale value for it the cmpxchg (atomic_cmpset_acq_32) would fail. However, the implied memory barrier in cmpxchg would cause the subsequent read of prod_head to read the up to date value permitting the cmpxchg to succeed the second time. Conclusion - no additional memory barriers needed.
  • prod_tail: this value is the one used by dequeue to determine the effective producer index. The absence of a memory barrier preceding it made it possible for a new value to be visible to dequeue before the corresponding store to br_ring[idx] was visible in enqueue. Reading a stale value in enqueue could prevent progress, but would not violate correctness. Conclusion - atomic_store_rel_32() needed when doing update in enqueue and atomic_load_acq_32() needed when reading the value in dequeue. Correctness will not be violated by reading a stale value in enqueue so the initial read can be without a memory barrier, if it doesn't match the preceding value it may be stale so we need the while loop to read with an atomic_load_acq_32().
  • cons_head: this value is used only by dequeue, it is either updated atomically (dequeue_mc) or protected by a mutex (dequeue_sc). Conclusion - no memory barriers needed
  • cons_tail: This is used to communicate the latest consumer index between dequeue and enqueue. Reading a stale value in enqueue can cause an enqueue fail when it should not have. This was fixed in a recent commit by requiring a memory barrier before reading it again and returning ENOBUFS. There is no risk of reading a _new_ value of cons_tail before br_ring[idx] has been read because the value is not updated until following the read. Conclusion - forcing ordered reads necessary. Ordered rights don't matter here.
  • br_ring[idx] : The consistency of this value is addressed with prod_tail. This value needs to reach memory before prod_tail does. I believe the latest code handles this correctly but will need to look at it again.
sys/sys/buf_ring.h
166

_acq_32

168

This is the part that I don't understand: cons_tail and prod_head don't share a cache line, so how can the read memory barrier help?

sys/sys/buf_ring.h
168

So this bit of confusion leads me to believe that the implied specificity of FreeBSD's acquire / release API is a bit of sophistry that actually falls short of the more straightforward rmb()/wmb()/mb() in Linux. So I know of no architecture where memory barriers apply only to specific cache lines. Can you name one?

sys/sys/buf_ring.h
168

Actually, you're right and this seems to work as intended.

kmacy edited edge metadata.
  • Add further discussion of memory ordering expectations
  • Remove gratuitous memory barriers
  • Fix comment in dequeue_sc

Now I'm wondering if we need to guarantee that the load from ring needs to complete before subsequent changes that could expose it to destructive updates

I think I can guarantee that the ring load completes before it is exposed to destructive updates by just enforcing release semantics on cons_tail. This is obviously _much_ cheaper on x86 being an ordinary store.

Only do atomic_load_acq to guarantee that stale values can't cause unwarranted failures. The use of atomic_store_rel_32 on the tail indices should assure that reading a stale value can't violate correctness.

From Wojciech Macek:
One thing, here is what's i think is going on:

// 1. Core reads cons_head first
cons_head = br->br_cons_head;
// 2. It knows it might need a br->br_ring[cons_head]. The index is already known, so early-load data into read buffers.
// 3. Some kind of stall, pagetable walk, mesi wait or whatever
// 4. Other core enqueues something to the ring and updates prod_tail, data in br->br_ring[cons_head] changes.
// 5. The core now is running, just read prod_tail
prod_tail = br->br_prod_tail;

cons_next = (cons_head + 1) & br->br_cons_mask;

<some stuff here>

// 6. Yeah, there is something in the ring!
if (cons_head == prod_tail) 
    return (NULL);

<other even more important stuff>

br->br_cons_head = cons_next;
// 7. Oh, I've already early-loded this value... it would be great to use it now, wouldn't it?
buf = br->br_ring[cons_head];


<....>

// 8. Return the old value
return (buf);

I understand now. The problem is the relaxed consistency of ARM whereby loads can be ordered ahead of loads. When we enter the function br_ring[cons_head] is not valid. By the time we check prod_tail the value in memory of br_ring[cons_head] _is_ valid, but the value in our ROB is invalid. This could be mitigated, but not avoided entirely by making it a critical section. So you really do need to enforce that prod_tail is read before br_ring[cons_head] since program ordering of reads is not guaranteed. Unfortunately acq_load is expensive on x86 and loads can't be ordered loads. Thus I'd rather add an architecture specific define for relaxed consistency architectures so that x86 doesn't suffer a gratuitous atomic (while those that do need it will have it).

Why not have a new API that more accurately captures the nuance here?

But I think #4 in Wojciech's comment actually speaks volumes. "data in br->br_ring[cons_head] changes." cons_head shouldn't be changing behind the scenes. If it can, then you've just dropped data. Why is it being updated in the #6 step? I mean, I understand why it needs to do that for the algorithm, but that means that this isn't lockless.

Also, in #7, why is it using the *new* value of cons_head rather than the old value? Or is that just bad English and that happens on the other core.

This code is also fragile when cache line size is larger than the compiled constant CACHE_LINE_SIZE, but that isn't changed by these patches...

Is this more descriptive example ?

`Core(0) - buf_ring_enqueue()                                       Core(1) - buf_ring_dequeue_sc()
-----------------------------------------                                       ----------------------------------------------

                                                                                cons_head = br->br_cons_head;
atomic_cmpset_acq_32(&br->br_prod_head, ...));
                                                                                buf = br->br_ring[cons_head];     <see <1>>
br->br_ring[prod_head] = buf;
atomic_store_rel_32(&br->br_prod_tail, ...);
                                                                                prod_tail = br->br_prod_tail;
                                                                                if (cons_head == prod_tail) 
                                                                                        return (NULL);
                                                                                <condition is false and code uses invalid(old) buf>`

<1>
Load (on core 1) from br->br_ring[cons_head] can be reordered (speculative readed) by CPU.

And, imho, the compiler can do the same, because non-volatile load can be moved
across volatile load. But I'm not 100% sure here.

In D1945#30, @meloun-miracle-cz wrote:

Is this more descriptive example ?

`Core(0) - buf_ring_enqueue()                                       Core(1) - buf_ring_dequeue_sc()
-----------------------------------------                                       ----------------------------------------------

                                                                                cons_head = br->br_cons_head;
atomic_cmpset_acq_32(&br->br_prod_head, ...));
                                                                                buf = br->br_ring[cons_head];     <see <1>>
br->br_ring[prod_head] = buf;
atomic_store_rel_32(&br->br_prod_tail, ...);
                                                                                prod_tail = br->br_prod_tail;
                                                                                if (cons_head == prod_tail) 
                                                                                        return (NULL);
                                                                                <condition is false and code uses invalid(old) buf>`

<1>
Load (on core 1) from br->br_ring[cons_head] can be reordered (speculative readed) by CPU.

Load re-ordering is the fundamental issue here.

And, imho, the compiler can do the same, because non-volatile load can be moved
across volatile load. But I'm not 100% sure here.

My, possibly naive, understanding of volatile is that loads from memory only have to happen once - not necessarily that they can be re-ordered. However, re-ordering them is only a small step. On the practical side, scalarizing an arbitrary sized global array is not something that LLVM is currently capable of. Load re-ordering of scalars on the other hand is probably common when, for example, lifting loads out of a loop.

In D1945#29, @imp wrote:

Why not have a new API that more accurately captures the nuance here?

According to http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf the only re-ordering possible with normal memory operations on x86 is that stores can be re-ordered after loads.

But I think #4 in Wojciech's comment actually speaks volumes. "data in br->br_ring[cons_head] changes." cons_head shouldn't be changing behind the scenes. If it can, then you've just dropped data. Why is it being updated in the #6 step? I mean, I understand why it needs to do that for the algorithm, but that means that this isn't lockless.

The validity of the data in the ring is predicated on the value of prod_tail. If loads were executed in program order then this could never happen. Either prod_tail would indicate that the ring is empty and the function would return right away, or prod_tail would indicate that the ring was not empty - guaranteeing that value in the br_ring[cons_head] was valid. The problem only occurs because load re-ordering permits us to read br_ring[cons_head] before prod_tail says it is safe to do so.

Also, in #7, why is it using the *new* value of cons_head rather than the old value? Or is that just bad English and that happens on the other core.

The value of cons_head is irrelevant. It is controlled by dequeue and updates are serialized by a mutex.

This code is also fragile when cache line size is larger than the compiled constant CACHE_LINE_SIZE, but that isn't changed by these patches...

Why? CACHE_LINE_SIZE alignment is only to avoid gratuitous coherency traffic. I actually have an optimization related to that that I'd like to discuss but I'd like to clear this up first.

  • Improve prod_tail and cons_tail ordering comments
  • Enforce read barrier after prod_tail read in dequeue

Store release doesn't intrinsically serialize loads with stores. Update ring on dequeue to guarantee that the update of cons_tail happens last on all architectures.

Remove gratuitous re-ordering of assignments and fix PREFETCH by initializing the now conditional prod_tail

  • Make ring entries volatile void * to protect against any future compiler optimizations
  • Comment on alignment
  • make ring entries structs to simplify (possible) aligning of entries
  • comment on prod_tail usage in PREFETCH_DEFINED
  • remove incorrect comment above putback
  • missing a dequeue is not as disruptive as missing an enqueue, remove the double check

fix bug on RMO arches in dequeue_mc

the cmpset_acq in dequeue_mc should provide the memory barrier we need

I see there was huge amount of good work put into fixing the race, fruitful discussion, but also no updates since last month. Frankly speaking, I am waiting to see this fix in mainline, instead of using patch from with Wojciech Macek. Are there any more issues with this or can it be merged?

Looks like a good fix to have in the tree. Any other holdups besides the usual suspect (time)?

cheers,
Hiren

alc added inline comments.
sys/sys/buf_ring.h
71–80

As of this week, you no longer need this. The excessive, i.e., unnecessary, ordering by load_acq on x86 no longer exists.

176

Is there a release operation on &br->br_prod_head anywhere in this code? I didn't see one. So, I'm not sure what this is intended to order.

236–237

That is guaranteed by the release semantics on the store in line 252. Preceding (by program order) loads as well as stores are ordered by a release store.

298–299

Same comment as before.

331–339

Same comment as before.

sbahra_repnop.org added inline comments.
sys/sys/buf_ring.h
225

If br_prod_tail is re-ordered prior to load of br_cons_head, bad things will happen. Let's assume (a, b) = (cons_head, prod_tail).

Initial state: (0, 0), indicating empty.

T0: Producer transitions us from (0,0) to (0, 1).
T1: A consumer comes in and transitions us to (1, 1). This is an empty state.
T2: Begins this operation, reads prod_tail when it is 0 and reads cons_head after T1 has completed. It has observed (1, 0) despite T1 having rendered the ring buffer empty and will go ahead and successfully apply a CAS speculation on cons_head being 1.

This scenario was also confirmed with fault injection on Power (RMO).

ck_ring has slightly different semantics (it is completely lock-free on consumer, but assumes UINT_MAX operations don't occur during a delay in dequeue) but I took a stab at the buf_ring MP semantics today. I have tested this on RMO architectures and verified the fences with fault injection.

Will likely be cleaning up excessive macro use on M*_DEFINE, but it's up here in the mean time: https://github.com/concurrencykit/ck/blob/master/include/ck_ring.h - it seems like the old variation in ordering guarantee are MPMC fencing of shared producer / consumer counters. I hope it's useful.

sys/sys/buf_ring.h
168

Similar race to my previous comment. We must order this read or risk invalidly storing a value.

zbb removed a reviewer: zbb.
rpaulo removed a reviewer: rpaulo.
rpaulo added inline comments.
sys/sys/buf_ring.h
254

This should be under #else, right?