if_wg: use proper barriers around pkt->p_state
Without appropriate load-synchronization to pair with store barriers in
wg_encrypt() and wg_decrypt(), the compiler and hardware are often
allowed to reorder these loads in wg_deliver_out() and wg_deliver_in()
such that we end up with a garbage or intermediate mbuf that we try to
pass on. The issue is particularly prevalent with the weaker
memory models of !x86 platforms.
Switch from the big-hammer wmb() to more explicit acq/rel atomics to
both make it obvious what we're syncing up with, and to avoid somewhat
hefty fences on platforms that don't necessarily need this.
With this patch, my dual-iperf3 reproducer is dramatically more stable
than it is without on aarch64.
PR: 264115
MFC after: 1 week
Reviewed by: andrew, zlei
Differential Revision: https://reviews.freebsd.org/D44283