- comment on buf_ring operation
- Improve memory ordering guarantees for reads of non-mutex protected values in buf_ring
- add comment to clarify atomic_<>_acq operations
Details
Diff Detail
- Lint
- Lint Skipped 
- Unit
- Tests Skipped 
Event Timeline
Thinking about this some more I think I've been overzealous with the memory barriers.
There are five variables in question:
- prod_head: this is only read by the enqueue routine, if the latter were to initially read a stale value for it the cmpxchg (atomic_cmpset_acq_32) would fail. However, the implied memory barrier in cmpxchg would cause the subsequent read of prod_head to read the up to date value permitting the cmpxchg to succeed the second time. Conclusion - no additional memory barriers needed.
- prod_tail: this value is the one used by dequeue to determine the effective producer index. The absence of a memory barrier preceding it made it possible for a new value to be visible to dequeue before the corresponding store to br_ring[idx] was visible in enqueue. Reading a stale value in enqueue could prevent progress, but would not violate correctness. Conclusion - atomic_store_rel_32() needed when doing update in enqueue and atomic_load_acq_32() needed when reading the value in dequeue. Correctness will not be violated by reading a stale value in enqueue so the initial read can be without a memory barrier, if it doesn't match the preceding value it may be stale so we need the while loop to read with an atomic_load_acq_32().
- cons_head: this value is used only by dequeue, it is either updated atomically (dequeue_mc) or protected by a mutex (dequeue_sc). Conclusion - no memory barriers needed
- cons_tail: This is used to communicate the latest consumer index between dequeue and enqueue. Reading a stale value in enqueue can cause an enqueue fail when it should not have. This was fixed in a recent commit by requiring a memory barrier before reading it again and returning ENOBUFS. There is no risk of reading a _new_ value of cons_tail before br_ring[idx] has been read because the value is not updated until following the read. Conclusion - forcing ordered reads necessary. Ordered rights don't matter here.
- br_ring[idx] : The consistency of this value is addressed with prod_tail. This value needs to reach memory before prod_tail does. I believe the latest code handles this correctly but will need to look at it again.
| sys/sys/buf_ring.h | ||
|---|---|---|
| 157 | So this bit of confusion leads me to believe that the implied specificity of FreeBSD's acquire / release API is a bit of sophistry that actually falls short of the more straightforward rmb()/wmb()/mb() in Linux. So I know of no architecture where memory barriers apply only to specific cache lines. Can you name one? | |
| sys/sys/buf_ring.h | ||
|---|---|---|
| 157 | Actually, you're right and this seems to work as intended. | |
- Add further discussion of memory ordering expectations
- Remove gratuitous memory barriers
Now I'm wondering if we need to guarantee that the load from ring needs to complete before subsequent changes that could expose it to destructive updates
I think I can guarantee that the ring load completes before it is exposed to destructive updates by just enforcing release semantics on cons_tail. This is obviously _much_ cheaper on x86 being an ordinary store.
Only do atomic_load_acq to guarantee that stale values can't cause unwarranted failures. The use of atomic_store_rel_32 on the tail indices should assure that reading a stale value can't violate correctness.
From Wojciech Macek:
One thing, here is what's i think is going on:
// 1. Core reads cons_head first
cons_head = br->br_cons_head;
// 2. It knows it might need a br->br_ring[cons_head]. The index is already known, so early-load data into read buffers.
// 3. Some kind of stall, pagetable walk, mesi wait or whatever
// 4. Other core enqueues something to the ring and updates prod_tail, data in br->br_ring[cons_head] changes.
// 5. The core now is running, just read prod_tail
prod_tail = br->br_prod_tail;
cons_next = (cons_head + 1) & br->br_cons_mask;
<some stuff here>
// 6. Yeah, there is something in the ring!
if (cons_head == prod_tail) 
    return (NULL);
<other even more important stuff>
br->br_cons_head = cons_next;
// 7. Oh, I've already early-loded this value... it would be great to use it now, wouldn't it?
buf = br->br_ring[cons_head];
<....>
// 8. Return the old value
return (buf);I understand now. The problem is the relaxed consistency of ARM whereby loads can be ordered ahead of loads. When we enter the function br_ring[cons_head] is not valid. By the time we check prod_tail the value in memory of br_ring[cons_head] _is_ valid, but the value in our ROB is invalid. This could be mitigated, but not avoided entirely by making it a critical section. So you really do need to enforce that prod_tail is read before br_ring[cons_head] since program ordering of reads is not guaranteed. Unfortunately acq_load is expensive on x86 and loads can't be ordered loads. Thus I'd rather add an architecture specific define for relaxed consistency architectures so that x86 doesn't suffer a gratuitous atomic (while those that do need it will have it).
Why not have a new API that more accurately captures the nuance here?
But I think #4 in Wojciech's comment actually speaks volumes. "data in br->br_ring[cons_head] changes." cons_head shouldn't be changing behind the scenes. If it can, then you've just dropped data. Why is it being updated in the #6 step? I mean, I understand why it needs to do that for the algorithm, but that means that this isn't lockless.
Also, in #7, why is it using the *new* value of cons_head rather than the old value? Or is that just bad English and that happens on the other core.
This code is also fragile when cache line size is larger than the compiled constant CACHE_LINE_SIZE, but that isn't changed by these patches...
Is this more descriptive example ?
`Core(0) - buf_ring_enqueue()                                       Core(1) - buf_ring_dequeue_sc()
-----------------------------------------                                       ----------------------------------------------
                                                                                cons_head = br->br_cons_head;
atomic_cmpset_acq_32(&br->br_prod_head, ...));
                                                                                buf = br->br_ring[cons_head];     <see <1>>
br->br_ring[prod_head] = buf;
atomic_store_rel_32(&br->br_prod_tail, ...);
                                                                                prod_tail = br->br_prod_tail;
                                                                                if (cons_head == prod_tail) 
                                                                                        return (NULL);
                                                                                <condition is false and code uses invalid(old) buf>`<1>
Load (on core 1) from br->br_ring[cons_head] can be reordered (speculative readed) by CPU.
And, imho, the compiler can do the same, because non-volatile load can be moved
across volatile load. But I'm not 100% sure here.
Load re-ordering is the fundamental issue here.
And, imho, the compiler can do the same, because non-volatile load can be moved
across volatile load. But I'm not 100% sure here.
My, possibly naive, understanding of volatile is that loads from memory only have to happen once - not necessarily that they can be re-ordered. However, re-ordering them is only a small step. On the practical side, scalarizing an arbitrary sized global array is not something that LLVM is currently capable of. Load re-ordering of scalars on the other hand is probably common when, for example, lifting loads out of a loop.
According to http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf the only re-ordering possible with normal memory operations on x86 is that stores can be re-ordered after loads.
But I think #4 in Wojciech's comment actually speaks volumes. "data in br->br_ring[cons_head] changes." cons_head shouldn't be changing behind the scenes. If it can, then you've just dropped data. Why is it being updated in the #6 step? I mean, I understand why it needs to do that for the algorithm, but that means that this isn't lockless.
The validity of the data in the ring is predicated on the value of prod_tail. If loads were executed in program order then this could never happen. Either prod_tail would indicate that the ring is empty and the function would return right away, or prod_tail would indicate that the ring was not empty - guaranteeing that value in the br_ring[cons_head] was valid. The problem only occurs because load re-ordering permits us to read br_ring[cons_head] before prod_tail says it is safe to do so.
Also, in #7, why is it using the *new* value of cons_head rather than the old value? Or is that just bad English and that happens on the other core.
The value of cons_head is irrelevant. It is controlled by dequeue and updates are serialized by a mutex.
This code is also fragile when cache line size is larger than the compiled constant CACHE_LINE_SIZE, but that isn't changed by these patches...
Why? CACHE_LINE_SIZE alignment is only to avoid gratuitous coherency traffic. I actually have an optimization related to that that I'd like to discuss but I'd like to clear this up first.
- Improve prod_tail and cons_tail ordering comments
- Enforce read barrier after prod_tail read in dequeue
Store release doesn't intrinsically serialize loads with stores. Update ring on dequeue to guarantee that the update of cons_tail happens last on all architectures.
Remove gratuitous re-ordering of assignments and fix PREFETCH by initializing the now conditional prod_tail
- Make ring entries volatile void * to protect against any future compiler optimizations
- Comment on alignment
- make ring entries structs to simplify (possible) aligning of entries
- comment on prod_tail usage in PREFETCH_DEFINED
- remove incorrect comment above putback
- missing a dequeue is not as disruptive as missing an enqueue, remove the double check
I see there was huge amount of good work put into fixing the race, fruitful discussion, but also no updates since last month. Frankly speaking, I am waiting to see this fix in mainline, instead of using patch from with Wojciech Macek. Are there any more issues with this or can it be merged?
Looks like a good fix to have in the tree. Any other holdups besides the usual suspect (time)?
cheers,
Hiren
| sys/sys/buf_ring.h | ||
|---|---|---|
| 60–140 | As of this week, you no longer need this. The excessive, i.e., unnecessary, ordering by load_acq on x86 no longer exists. | |
| 165 | Is there a release operation on &br->br_prod_head anywhere in this code? I didn't see one. So, I'm not sure what this is intended to order. | |
| 224–225 | That is guaranteed by the release semantics on the store in line 252. Preceding (by program order) loads as well as stores are ordered by a release store. | |
| 286–287 | Same comment as before. | |
| 325–333 | Same comment as before. | |
| sys/sys/buf_ring.h | ||
|---|---|---|
| 213 | If br_prod_tail is re-ordered prior to load of br_cons_head, bad things will happen. Let's assume (a, b) = (cons_head, prod_tail). Initial state: (0, 0), indicating empty. T0: Producer transitions us from (0,0) to (0, 1). This scenario was also confirmed with fault injection on Power (RMO). | |
ck_ring has slightly different semantics (it is completely lock-free on consumer, but assumes UINT_MAX operations don't occur during a delay in dequeue) but I took a stab at the buf_ring MP semantics today. I have tested this on RMO architectures and verified the fences with fault injection.
Will likely be cleaning up excessive macro use on M*_DEFINE, but it's up here in the mean time: https://github.com/concurrencykit/ck/blob/master/include/ck_ring.h - it seems like the old variation in ordering guarantee are MPMC fencing of shared producer / consumer counters. I hope it's useful.
| sys/sys/buf_ring.h | ||
|---|---|---|
| 158 | Similar race to my previous comment. We must order this read or risk invalidly storing a value. | |
| sys/sys/buf_ring.h | ||
|---|---|---|
| 243 | This should be under #else, right? | |