Page MenuHomeFreeBSD

rmlock: Add a missing compiler membar to the rlock slow path
ClosedPublic

Authored by markj on Feb 20 2021, 8:43 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 6:58 PM
Unknown Object (File)
Fri, Jan 10, 10:32 PM
Unknown Object (File)
Thu, Jan 9, 7:01 PM
Unknown Object (File)
Dec 7 2024, 10:03 PM
Unknown Object (File)
Dec 6 2024, 7:33 AM
Unknown Object (File)
Dec 1 2024, 1:02 AM
Unknown Object (File)
Nov 30 2024, 9:10 AM
Unknown Object (File)
Nov 16 2024, 3:24 AM

Details

Summary

On an armv7 kernel we are seeing deadlocks where an _rm_wlock() caller
is blocked after propagating priority to a reader, and the reader is
blocked in _rm_rlock_hard() on the rmlock interlock.

A look at the disassembly of _rm_rlock_hard() shows that the compiler is
emitting loads for

  • tracker->rmp_cpuQueue.rmq_next
  • tracker->rmp_cpuQueue.rmq_prev, and
  • tracker->rmp_flags

before performing any stores to update the per-CPU queue. This breaks
synchronization with the writer IPI, which removes the tracker from the
per-CPU queue and updates the flags. If the reader does not observe
that RMPF_ONQUEUE is set before blocking on the interlock, we get a
deadlock.

I don't claim that this change is complete, it is just enough to fix the
deadlocks we are seeing. I will audit the code some more, any help
would be very appreciated.

Test Plan

Disassembly before:

0xc03cc0b4 <+72>:    tst     r2, r1, lsr r0                                                                                                                                                                                                                                                                                
0xc03cc0b8 <+76>:    beq     0xc03cc110 <_rm_rlock_hard+164>   <--- branch on write CPUs                                                                                                                                                                                                                                                            
0xc03cc0bc <+80>:    ldr     r0, [r5]   <--- loading all three fields                                                                                                                                                                                                                                                                                   
0xc03cc0c0 <+84>:    ldr     r1, [r5, #16]                                                                                                                                                                                                                                                                                 
0xc03cc0c4 <+88>:    ldr     r2, [r5, #4]                                                                                                                                                                                                                                                                                  
0xc03cc0c8 <+92>:    cmp     r1, #0    <--- comparing rmp_flags == 0                                                                                                                                                                                                                                                                                    
0xc03cc0cc <+96>:    str     r2, [r0, #4]  <--- updating per-CPU queue                                                                                                                                                                                                                                                                                
0xc03cc0d0 <+100>:   str     r0, [r2]                                                                                                                                                                                                                                                                                      
0xc03cc0d4 <+104>:   beq     0xc03cc144 <_rm_rlock_hard+216>

After:

c041a948:       0a000012        beq     c041a998 <_rm_rlock_hard+0x9c>
c041a94c:       e5950000        ldr     r0, [r5]
c041a950:       e5951004        ldr     r1, [r5, #4]
c041a954:       e5801004        str     r1, [r0, #4]
c041a958:       e5810000        str     r0, [r1]
c041a95c:       e5950010        ldr     r0, [r5, #16]
c041a960:       e3500000        cmp     r0, #0  ; 0x0
c041a964:       0a000015        beq     c041a9c0 <_rm_rlock_hard+0xc4>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Feb 20 2021, 8:43 PM
markj added reviewers: kib, mjg, jhb, rlibby.
markj added subscribers: scottl, loos.
This revision is now accepted and ready to land.Feb 20 2021, 8:55 PM

While the patch at hand may indeed be good enough for a quick fixup, trying to review the entire ordeal suggests the mechanism is very error prone (at least to my taste). For example rm_tracker_remove may still be executing against IPIs and it contains multiple unordered stores.

In my opinion the thing to do is to rework the internals to follow rms (e.g., see rms_rlock). Namely any transitions read -> write or whatever other state stall for as long as there is any fast path executing (and block new arrivals), working around almost all races. Past that all read/write requests are globally serialized, taking care of the above bug.

In D28821#645324, @mjg wrote:

While the patch at hand may indeed be good enough for a quick fixup, trying to review the entire ordeal suggests the mechanism is very error prone (at least to my taste). For example rm_tracker_remove may still be executing against IPIs and it contains multiple unordered stores.

I think that is ok, the ordering doesn't matter with respect to the IPI, which only does forward traversal and does not update the queue.

In my opinion the thing to do is to rework the internals to follow rms (e.g., see rms_rlock). Namely any transitions read -> write or whatever other state stall for as long as there is any fast path executing (and block new arrivals), working around almost all races. Past that all read/write requests are globally serialized, taking care of the above bug.

Thanks, I will look at the rmslock implementation, but it'll be a separate diff to make backporting easier.

I have a couple questions, but the change itself looks fine.

which removes the tracker from the per-CPU queue

Pardon my ignorance of this code, but it doesn't seem to do that? It seems it sets the flag and links the tracker into rm_activeReaders, but doesn't write the per-CPU queue.

sys/kern/kern_rmlock.c
365–370

Why don't we need atomic semantics for this? Because the IPI stores are from the same CPU? Should we comment something to that effect?

Anyway if that's the case, maybe this should just be a volatile load or atomic_load_int()? I don't think we care about any other ordering except that with the earlier volatile store to the per-CPU list pointer (e.g. we don't care when rm->lock_object.lo_flags is loaded).

In D28821#645324, @mjg wrote:

While the patch at hand may indeed be good enough for a quick fixup, trying to review the entire ordeal suggests the mechanism is very error prone (at least to my taste). For example rm_tracker_remove may still be executing against IPIs and it contains multiple unordered stores.

I think that is ok, the ordering doesn't matter with respect to the IPI, which only does forward traversal and does not update the queue.

My understanding of this code is:

  • rmq_next/prev are declared as volatile, so they are not totally unordered, the compiler should at least emit reads and writes to those memory addresses in source order.
  • Whenever we do rm_tracker_remove or rm_tracker_add, we are in a critical section (or a pseudo critical section).
  • The IPI executes on the same CPU as the thread which is updating the per-CPU list and reading the rmp_flags, so the synchronization requirements are lighter than one might otherwise think at first (don't need atomic acq/rel/cst).

So combined I think these ensure that the IPI sees the list links updated in an acceptable order.

I have a couple questions, but the change itself looks fine.

which removes the tracker from the per-CPU queue

Pardon my ignorance of this code, but it doesn't seem to do that? It seems it sets the flag and links the tracker into rm_activeReaders, but doesn't write the per-CPU queue.

Sorry, you're right. I confused myself when writing the description: the reader unlinks the tracker and the IPI merely scans the list and sets RMPF_ONQUEUE if it finds a matching tracker. So if the IPI fires before the tracker is unlinked, then the writer will block and the reader must observe that RMPF_ONQUEUE is set, and if the IPI fires after the tracker is unlinked, then the reader will block until it is able to acquire the interlock. And we assume that the store to the rmp_cpuQueue.rmq_next pointer is atomic with respect to IPIs.

sys/kern/kern_rmlock.c
365–370

Right, we assume that a plain load is atomic with respect to the current CPU. That assumption is made all over the kernel. I can expand the comment.

Are you suggesting that we should use an atomic_load_int() instead of the compiler_membar()? I think that would fix the problem as well, since as you noted the queue linkage pointers are declared volatile, and using atomic_load_int() would prevent reordering of the (volatile) stores with the load of rmp_flags. I would also wrap accesses of the linkage fields with atomic_store_ptr() in that case. I noticed that the read path uses compiler_membar() elsewhere in the read path, so I went with that here.

sys/kern/kern_rmlock.c
365–370

Well, first, I think both work and I don't want to insist on the style so I'll leave it to your judgement.

I'd note that the __compiler_membar()s in _rm_rlock() are really just hand-inlining of critical_enter()/critical_exit() (without preemption check).

Personally I think given the use of volatile for the list pointers then atomic_load_int() makes a little more sense to me here, as a reader. Equally I think this could be rewritten with __compiler_membar() instead of volatile for the list pointers.

Use atomic_load_int() instead of a compiler barrier. I verified this with the
armv7 kernel where the problem was originally observed.

This revision now requires review to proceed.Feb 23 2021, 9:07 PM
sys/kern/kern_rmlock.c
372

But atomic load without fence does not provide any 'after' guarantees. It might change the compiler output but this is coincidental.

Also, this only changes the processor ordering, not what other threads could observe, unless I miss something.

sys/kern/kern_rmlock.c
372

Note that the problem here involves synchronization with an IPI on the same CPU. The use of atomic_load_int() ensures that the load will not be reordered by the compiler with the stores in rm_tracker_remove(), since the queue linkage fields are declared volatile. This combination ensures that the load and stores will not be reordered by the compiler.

I think this is actually more explicit with __compiler_membar(). It is not very common to use atomic_* to enforce compiler ordering constraints. But see the earlier comments below.

I won't be able to go back and test this for at least a few more days. However, when I tested the atomic_load_int() suggestion over the weekend, I got a really weird panic instead. I didn't dig into it because of other time pressures, but it does make me worry about compiler bugs and other weirdness, whereas the membar is proven to work for now.

I won't be able to go back and test this for at least a few more days. However, when I tested the atomic_load_int() suggestion over the weekend, I got a really weird panic instead. I didn't dig into it because of other time pressures, but it does make me worry about compiler bugs and other weirdness, whereas the membar is proven to work for now.

Looking at the patch you pasted at the time, you had if (atomic_load_int(tracker->rmp_flags)) instead of if (atomic_load_int(&tracker->rmp_flags)). I'm not sure how the former would compile (though it is more plausible on armv7), but if it did I'd indeed expect a panic.

sys/kern/kern_rmlock.c
372

Well, no, atomic_load/store without fences are relaxed and are not guaranteed to be reordered by compiler if accessing different locations. I do not think that gcc semantic for volatile t * provides any program order guarantees either, but even if it was, it is only a detail of our atomic_load() implementation.

So the semantic needed is exactly like C11 atomic_signal_fence(). Might be we should put that into atomic_common.h?

diff --git a/sys/sys/atomic_common.h b/sys/sys/atomic_common.h
index 48f0a8b8939c..11301b5a469f 100644
--- a/sys/sys/atomic_common.h
+++ b/sys/sys/atomic_common.h
@@ -78,4 +78,6 @@
 #define atomic_load_consume_ptr(p) \
     ((__typeof(*p)) atomic_load_acq_ptr((uintptr_t *)p))
 
+#define   atomic_signal_fence()   __compiler_membar()
+
 #endif

Might be, we should call it atomic_interrupt_fence() instead (also avoiding conflict with C11).

sys/kern/kern_rmlock.c
372

The compiler must not reorder volatile accesses, even to different locations, and here I'm relying on the implementation property of atomic_load_int() that it is a volatile load. That combined with the fact that rmp_cpuQueue.rmq_(next|prev) are defined as volatile is sufficient to give the correct ordering, I believe. But as you note this is an abuse of implementation details, so the original patch was preferable.

I will revert to __compiler_membar() and commit. I would prefer to add atomic_interrupt_fence() separately to simplify MFCs, and since there are other uses of __compiler_membar() in the file (see rms_int_membar). atomic_interrupt_fence() seems like the right name for the kernel.

Revert to using __compiler_membar().

This revision is now accepted and ready to land.Feb 24 2021, 1:38 AM

I won't be able to go back and test this for at least a few more days. However, when I tested the atomic_load_int() suggestion over the weekend, I got a really weird panic instead. I didn't dig into it because of other time pressures, but it does make me worry about compiler bugs and other weirdness, whereas the membar is proven to work for now.

Looking at the patch you pasted at the time, you had if (atomic_load_int(tracker->rmp_flags)) instead of if (atomic_load_int(&tracker->rmp_flags)). I'm not sure how the former would compile (though it is more plausible on armv7), but if it did I'd indeed expect a panic.

Hah, ok. I was quite cross-eyed by that point in the morning. Thanks for the reality check.