Since I had already worked out the interesting races this was actually surprisingly easy. The idea is that instead of a synthetic write sequence we just use ticks. You will always accumulate 2 ticks of memory. I limit the poll damage by not even checking before then.
The synchronization is a little more subtle because interrupts and context switches are key to understanding why it is safe. If either occurs any speculative loads are lost and the store buffer is flushed. So it acts as the barrier if necessary.
It is about 30% slower than SMR at my write heavy benchmark and consumes 10x as much memory but it is better than both RCU and epoch. I have a timed section loop. Some costs:
RCU/NULL: 2ns (loop + global variable increment)
SMR_LAZY: 3ns
SMR: 6ns
EPOCH: 8ns
EPOCH_PREEMPT: 11ns
so the read side here is pretty close to free. although it does still trade a lot of memory and large bucket caches. Because we store the ticks it is amenable to conversion to a preemptible technique.