if_wg: use proper barriers around pkt->p_state
ClosedPublic
Actions

Authored by kevans on Mar 9 2024, 1:33 AM.

Details

Reviewers

jhb
markj
andrew
zlei

Group Reviewers

arm64

Commits

rG3705d679a634: if_wg: use proper barriers around pkt->p_state

Summary

The hardware's allowed to reorder these loads in wg_deliver_out() and
wg_deliver_in() such that we end up with a garbage mbuf that we try to
pass on without appropriate load-synchronization to pair with store
barriers in wg_encrypt() and wg_decrypt((). The issue is particularly
prevalent with the weaker memory models of !x86 platforms.

With the patch, my dual-iperf3 reproducer is dramatically more stable
than it is without.

PR: 264115

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kevans created this revision.Mar 9 2024, 1:33 AM

Herald added a subscriber: imp. · View Herald TranscriptMar 9 2024, 1:33 AM

kevans requested review of this revision.Mar 9 2024, 1:33 AM

Harbormaster completed remote builds in B56519: Diff 135556.Mar 9 2024, 1:34 AM

I think these are right.

kevans added inline comments.Mar 9 2024, 1:53 AM

sys/dev/wg/if_wg.c
1519–1520	There's a little more context in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264115#c23, but in all of the non-debug reports it looks like a bogus fib from the mbuf, which I suspected lined up with the `pkt->p_mbuf` load in `wg_deliver_out` ultimately happening before this specific store is visible... the panic I was finally able to reproduce seems consistent with that.

imp added inline comments.Mar 9 2024, 4:10 AM

sys/dev/wg/if_wg.c
1519–1520	That's kinda a classic symptom... But it's hard to know for sure: It could also be random corruption.... If it is a race here, I think the release on the store and the acquire on the read is sufficient for this variable....but I often forget key details of the semantics, so don't just commit on my say so...

zlei added a subscriber: zlei.Mar 11 2024, 7:00 AM

zlei added inline comments.

sys/dev/wg/if_wg.c
1648	Will a `rmb()` before dereferencing `pkt->p_mbuf` be sufficient for load synchronization ?

lexi_le-fay.org added a subscriber: lexi_le-fay.org.Mar 11 2024, 8:11 AM

andrew accepted this revision.Mar 11 2024, 12:47 PM

andrew added a subscriber: andrew.

andrew added inline comments.

sys/dev/wg/if_wg.c
1648	Is there a reason to prefer `rmb` over `atomic_load_acq_int`? `rmb` looks heavier than a load acquire on many architectures.

This revision is now accepted and ready to land.Mar 11 2024, 12:47 PM

kevans added inline comments.Mar 11 2024, 2:56 PM

sys/dev/wg/if_wg.c
1648	Yeah, `rmb` should be fine as well, but it doesn't seem like there's much point in penalizing x86 with an lfence when we can handle it more optimally for everyone and more clearly outline our expectations at the same time (i.e., that the remaining pkt field updates are visible when we observe `p_state != WG_PACKET_CRYPTED`).

zlei added inline comments.Mar 12 2024, 4:46 AM

sys/dev/wg/if_wg.c
1648	@andrew Is there a reason to prefer rmb over atomic_load_acq_int? rmb looks heavier than a load acquire on many architectures. That is from my first sight on wmb() . Normally `rmb()` and `wmb()` should be used as a pair. I have not tested yet but I guess the `wmb()` in this driver if_wg is even redundant on x86. @kevans Yeah, rmb should be fine as well, but it doesn't seem like there's much point in penalizing x86 with an lfence when we can handle it more optimally for everyone and more clearly outline our expectations at the same time (i.e., that the remaining pkt field updates are visible when we observe p_state != WG_PACKET_CRYPTED). Last night my brain is full filled with weak memory model. I'm now wondering why not establishing a happens-before order between `GROUPTASK_ENQUEUE()` and next running of tasks (`wg_deliver_out()` and `wg_deliver_in()`).

kevans added inline comments.Mar 12 2024, 5:04 AM

sys/dev/wg/if_wg.c
1648	That is from my first sight on wmb() . Normally rmb() and wmb() should be used as a pair. I have not tested yet but I guess the wmb() in this driver if_wg is even redundant on x86. Yes, I think this is the case. Last night my brain is full filled with weak memory model. I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in()). Because we don't really need that heavy of a hammer, and I'm pretty sure that's not sufficient anyways. Just because we `GROUPTASK_ENQUEUE()` here, that doesn't mean a future run of the task is the one that peels the packet off of the serial queue -- it could actually be a concurrent run pulling it off pretty much as soon as we're done with it. We don't really have a reason to prevent that, either, we just need to make sure that the `pkt` state is consistent when we actually go to process it.

Looks good to me.

sys/dev/wg/if_wg.c
1648	That is from my first sight on wmb() . Normally rmb() and wmb() should be used as a pair. I have not tested yet but I guess the wmb() in this driver if_wg is even redundant on x86. Yes, I think this is the case. Last night my brain is full filled with weak memory model. I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in()). Because we don't really need that heavy of a hammer, and I'm pretty sure that's not sufficient anyways. Just because we `GROUPTASK_ENQUEUE()` here, that doesn't mean a future run of the task is the one that peels the packet off of the serial queue -- it could actually be a concurrent run pulling it off pretty much as soon as we're done with it. We don't really have a reason to prevent that, either, we just need to make sure that the `pkt` state is consistent when we actually go to process it. I think your analysis is right. ----- wg_xmit() wg_peer_send_staged() wg_queue_both() enqueue(&peer->p_encrypt_serial) enqueue(&sc->sc_encrypt_parallel) wg_encrypt_dispatch() GROUPTASK_ENQUEUE(&sc->sc_encrypt[cpu]); // signal one cpu to do wg_softc_encrypt() ----- wg_softc_encrypt() while ((pkt = dequeue(&sc->sc_encrypt_parallel)) != NULL) wg_encrypt(sc, pkt) noise_keypair_encrypt() // Do encryption, the heavy way pkt->p_mbuf = m; wmb(); pkt->p_state = state; GROUPTASK_ENQUEUE(&peer->p_send); // sigal one cpu to do wg_deliver_out() ----- wg_deliver_out() while ((pkt = dequeue(&peer->p_encrypt_serial)) != NULL) assert (pkt->p_state == WG_PACKET_CRYPTED) m = pkt->p_mbuf; wg_send(sc, &endpoint, m); sosend() I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in()) Actually there is a happens-before order `GROUPTASK_ENQUEUE < run wg_deliver_out`, BUT `wg_deliver_out()` is dequeuing multiple packets, then some of them may be generated by other threads that run `wg_softc_encrypt()`.