Paths

Table of Contentst

rmlock: Add a missing compiler membar to the rlock slow path
ClosedPublic
Actions

Authored by markj on Feb 20 2021, 8:43 PM.

Details

Reviewers

kib
mjg
jhb
rlibby
scottl

Commits

rG20f9e2a723f5: rmlock: Add a required compiler membar to the rlock slow path
rGc29d3ecdc8b3: rmlock: Add a required compiler membar to the rlock slow path
rG2655a711df3c: rmlock: Add a required compiler membar to the rlock slow path
rG68b5d0bea6d2: rmlock: Add a required compiler membar to the rlock slow path
rG1d44514fcd68: rmlock: Add a required compiler membar to the rlock slow path

Summary

On an armv7 kernel we are seeing deadlocks where an _rm_wlock() caller
is blocked after propagating priority to a reader, and the reader is
blocked in _rm_rlock_hard() on the rmlock interlock.

A look at the disassembly of _rm_rlock_hard() shows that the compiler is
emitting loads for

tracker->rmp_cpuQueue.rmq_next
tracker->rmp_cpuQueue.rmq_prev, and
tracker->rmp_flags

before performing any stores to update the per-CPU queue. This breaks
synchronization with the writer IPI, which removes the tracker from the
per-CPU queue and updates the flags. If the reader does not observe
that RMPF_ONQUEUE is set before blocking on the interlock, we get a
deadlock.

I don't claim that this change is complete, it is just enough to fix the
deadlocks we are seeing. I will audit the code some more, any help
would be very appreciated.

Test Plan

Disassembly before:

0xc03cc0b4 <+72>:    tst     r2, r1, lsr r0                                                                                                                                                                                                                                                                                
0xc03cc0b8 <+76>:    beq     0xc03cc110 <_rm_rlock_hard+164>   <--- branch on write CPUs                                                                                                                                                                                                                                                            
0xc03cc0bc <+80>:    ldr     r0, [r5]   <--- loading all three fields                                                                                                                                                                                                                                                                                   
0xc03cc0c0 <+84>:    ldr     r1, [r5, #16]                                                                                                                                                                                                                                                                                 
0xc03cc0c4 <+88>:    ldr     r2, [r5, #4]                                                                                                                                                                                                                                                                                  
0xc03cc0c8 <+92>:    cmp     r1, #0    <--- comparing rmp_flags == 0                                                                                                                                                                                                                                                                                    
0xc03cc0cc <+96>:    str     r2, [r0, #4]  <--- updating per-CPU queue                                                                                                                                                                                                                                                                                
0xc03cc0d0 <+100>:   str     r0, [r2]                                                                                                                                                                                                                                                                                      
0xc03cc0d4 <+104>:   beq     0xc03cc144 <_rm_rlock_hard+216>

After:

c041a948:       0a000012        beq     c041a998 <_rm_rlock_hard+0x9c>
c041a94c:       e5950000        ldr     r0, [r5]
c041a950:       e5951004        ldr     r1, [r5, #4]
c041a954:       e5801004        str     r1, [r0, #4]
c041a958:       e5810000        str     r0, [r1]
c041a95c:       e5950010        ldr     r0, [r5, #16]
c041a960:       e3500000        cmp     r0, #0  ; 0x0
c041a964:       0a000015        beq     c041a9c0 <_rm_rlock_hard+0xc4>

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 37228
Build 34117: arc lint + arc unit

Event Timeline

markj created this revision.Feb 20 2021, 8:43 PM

Herald added a subscriber: imp. · View Herald TranscriptFeb 20 2021, 8:43 PM

markj requested review of this revision.Feb 20 2021, 8:43 PM

Harbormaster completed remote builds in B37228: Diff 84355.Feb 20 2021, 8:44 PM

markj edited the test plan for this revision. (Show Details)Feb 20 2021, 8:46 PM

markj added reviewers: kib, mjg, jhb, rlibby.

markj added subscribers: scottl, loos.

scottl accepted this revision.Feb 20 2021, 8:55 PM

This revision is now accepted and ready to land.Feb 20 2021, 8:55 PM

While the patch at hand may indeed be good enough for a quick fixup, trying to review the entire ordeal suggests the mechanism is very error prone (at least to my taste). For example rm_tracker_remove may still be executing against IPIs and it contains multiple unordered stores.

In my opinion the thing to do is to rework the internals to follow rms (e.g., see rms_rlock). Namely any transitions read -> write or whatever other state stall for as long as there is any fast path executing (and block new arrivals), working around almost all races. Past that all read/write requests are globally serialized, taking care of the above bug.

In D28821#645324, @mjg wrote:

While the patch at hand may indeed be good enough for a quick fixup, trying to review the entire ordeal suggests the mechanism is very error prone (at least to my taste). For example rm_tracker_remove may still be executing against IPIs and it contains multiple unordered stores.

I think that is ok, the ordering doesn't matter with respect to the IPI, which only does forward traversal and does not update the queue.

In my opinion the thing to do is to rework the internals to follow rms (e.g., see rms_rlock). Namely any transitions read -> write or whatever other state stall for as long as there is any fast path executing (and block new arrivals), working around almost all races. Past that all read/write requests are globally serialized, taking care of the above bug.

Thanks, I will look at the rmslock implementation, but it'll be a separate diff to make backporting easier.

I have a couple questions, but the change itself looks fine.

which removes the tracker from the per-CPU queue

Pardon my ignorance of this code, but it doesn't seem to do that? It seems it sets the flag and links the tracker into rm_activeReaders, but doesn't write the per-CPU queue.

sys/kern/kern_rmlock.c
365–370	Why don't we need atomic semantics for this? Because the IPI stores are from the same CPU? Should we comment something to that effect? Anyway if that's the case, maybe this should just be a volatile load or atomic_load_int()? I don't think we care about any other ordering except that with the earlier volatile store to the per-CPU list pointer (e.g. we don't care when rm->lock_object.lo_flags is loaded).

In D28821#645325, @markj wrote:

In D28821#645324, @mjg wrote:

While the patch at hand may indeed be good enough for a quick fixup, trying to review the entire ordeal suggests the mechanism is very error prone (at least to my taste). For example rm_tracker_remove may still be executing against IPIs and it contains multiple unordered stores.

I think that is ok, the ordering doesn't matter with respect to the IPI, which only does forward traversal and does not update the queue.

My understanding of this code is:

rmq_next/prev are declared as volatile, so they are not totally unordered, the compiler should at least emit reads and writes to those memory addresses in source order.
Whenever we do rm_tracker_remove or rm_tracker_add, we are in a critical section (or a pseudo critical section).
The IPI executes on the same CPU as the thread which is updating the per-CPU list and reading the rmp_flags, so the synchronization requirements are lighter than one might otherwise think at first (don't need atomic acq/rel/cst).

So combined I think these ensure that the IPI sees the list links updated in an acceptable order.

In D28821#645480, @rlibby wrote:

I have a couple questions, but the change itself looks fine.

which removes the tracker from the per-CPU queue

Pardon my ignorance of this code, but it doesn't seem to do that? It seems it sets the flag and links the tracker into rm_activeReaders, but doesn't write the per-CPU queue.

Sorry, you're right. I confused myself when writing the description: the reader unlinks the tracker and the IPI merely scans the list and sets RMPF_ONQUEUE if it finds a matching tracker. So if the IPI fires before the tracker is unlinked, then the writer will block and the reader must observe that RMPF_ONQUEUE is set, and if the IPI fires after the tracker is unlinked, then the reader will block until it is able to acquire the interlock. And we assume that the store to the rmp_cpuQueue.rmq_next pointer is atomic with respect to IPIs.

sys/kern/kern_rmlock.c
365–370	Right, we assume that a plain load is atomic with respect to the current CPU. That assumption is made all over the kernel. I can expand the comment. Are you suggesting that we should use an atomic_load_int() instead of the compiler_membar()? I think that would fix the problem as well, since as you noted the queue linkage pointers are declared volatile, and using atomic_load_int() would prevent reordering of the (volatile) stores with the load of rmp_flags. I would also wrap accesses of the linkage fields with atomic_store_ptr() in that case. I noticed that the read path uses compiler_membar() elsewhere in the read path, so I went with that here.

rlibby added inline comments.Feb 21 2021, 7:38 PM

sys/kern/kern_rmlock.c
365–370	Well, first, I think both work and I don't want to insist on the style so I'll leave it to your judgement. I'd note that the `__compiler_membar()`s in `_rm_rlock()` are really just hand-inlining of `critical_enter()`/`critical_exit()` (without preemption check). Personally I think given the use of volatile for the list pointers then `atomic_load_int()` makes a little more sense to me here, as a reader. Equally I think this could be rewritten with `__compiler_membar()` instead of volatile for the list pointers.

garga added a subscriber: garga.Feb 22 2021, 10:44 AM

Use atomic_load_int() instead of a compiler barrier. I verified this with the
armv7 kernel where the problem was originally observed.

This revision now requires review to proceed.Feb 23 2021, 9:07 PM

Harbormaster completed remote builds in B37331: Diff 84583.Feb 23 2021, 9:07 PM

kib added inline comments.Feb 23 2021, 9:16 PM

sys/kern/kern_rmlock.c
372	But atomic load without fence does not provide any 'after' guarantees. It might change the compiler output but this is coincidental. Also, this only changes the processor ordering, not what other threads could observe, unless I miss something.

markj added inline comments.Feb 23 2021, 9:20 PM

sys/kern/kern_rmlock.c
372	Note that the problem here involves synchronization with an IPI on the same CPU. The use of atomic_load_int() ensures that the load will not be reordered by the compiler with the stores in rm_tracker_remove(), since the queue linkage fields are declared volatile. This combination ensures that the load and stores will not be reordered by the compiler. I think this is actually more explicit with __compiler_membar(). It is not very common to use atomic_* to enforce compiler ordering constraints. But see the earlier comments below.

I won't be able to go back and test this for at least a few more days. However, when I tested the atomic_load_int() suggestion over the weekend, I got a really weird panic instead. I didn't dig into it because of other time pressures, but it does make me worry about compiler bugs and other weirdness, whereas the membar is proven to work for now.

In D28821#647026, @scottl wrote:

I won't be able to go back and test this for at least a few more days. However, when I tested the atomic_load_int() suggestion over the weekend, I got a really weird panic instead. I didn't dig into it because of other time pressures, but it does make me worry about compiler bugs and other weirdness, whereas the membar is proven to work for now.

Looking at the patch you pasted at the time, you had if (atomic_load_int(tracker->rmp_flags)) instead of if (atomic_load_int(&tracker->rmp_flags)). I'm not sure how the former would compile (though it is more plausible on armv7), but if it did I'd indeed expect a panic.

kib added inline comments.Feb 23 2021, 10:15 PM

sys/kern/kern_rmlock.c
372	Well, no, atomic_load/store without fences are relaxed and are not guaranteed to be reordered by compiler if accessing different locations. I do not think that gcc semantic for volatile t * provides any program order guarantees either, but even if it was, it is only a detail of our atomic_load() implementation. So the semantic needed is exactly like C11 atomic_signal_fence(). Might be we should put that into atomic_common.h? diff --git a/sys/sys/atomic_common.h b/sys/sys/atomic_common.h index 48f0a8b8939c..11301b5a469f 100644 --- a/sys/sys/atomic_common.h +++ b/sys/sys/atomic_common.h @@ -78,4 +78,6 @@ #define atomic_load_consume_ptr(p) \ ((__typeof(p)) atomic_load_acq_ptr((uintptr_t )p)) +#define atomic_signal_fence() __compiler_membar() + #endif Might be, we should call it atomic_interrupt_fence() instead (also avoiding conflict with C11).

markj added inline comments.Feb 23 2021, 10:40 PM

sys/kern/kern_rmlock.c
372	The compiler must not reorder volatile accesses, even to different locations, and here I'm relying on the implementation property of atomic_load_int() that it is a volatile load. That combined with the fact that `rmp_cpuQueue.rmq_(next\|prev)` are defined as volatile is sufficient to give the correct ordering, I believe. But as you note this is an abuse of implementation details, so the original patch was preferable. I will revert to `__compiler_membar()` and commit. I would prefer to add atomic_interrupt_fence() separately to simplify MFCs, and since there are other uses of `__compiler_membar()` in the file (see rms_int_membar). atomic_interrupt_fence() seems like the right name for the kernel.

Revert to using __compiler_membar().

Harbormaster completed remote builds in B37336: Diff 84594.Feb 24 2021, 12:27 AM

rlibby accepted this revision.Feb 24 2021, 1:38 AM

This revision is now accepted and ready to land.Feb 24 2021, 1:38 AM

Closed by commit rG1d44514fcd68: rmlock: Add a required compiler membar to the rlock slow path (authored by markj). · Explain WhyFeb 24 2021, 2:18 AM

This revision was automatically updated to reflect the committed changes.

markj added a commit: rG1d44514fcd68: rmlock: Add a required compiler membar to the rlock slow path.

In D28821#647027, @markj wrote:

In D28821#647026, @scottl wrote:

I won't be able to go back and test this for at least a few more days. However, when I tested the atomic_load_int() suggestion over the weekend, I got a really weird panic instead. I didn't dig into it because of other time pressures, but it does make me worry about compiler bugs and other weirdness, whereas the membar is proven to work for now.

Looking at the patch you pasted at the time, you had if (atomic_load_int(tracker->rmp_flags)) instead of if (atomic_load_int(&tracker->rmp_flags)). I'm not sure how the former would compile (though it is more plausible on armv7), but if it did I'd indeed expect a panic.