This is a memory/latency vs cpu tradeoff. The less frequently you advance wr_seq the longer it takes to reclaim memory but the shared cacheline is written less frequently and the read sections become cheaper. I do see a small positive effect on read section time in my tests.
In my experiments so far UMA does not benefit from this. The buckets amortize the global cacheline writes enough that it's irrelevant. In fact, I see a couple thousand alloc/free pairs before poll() is called. I suspect that at very high cpu counts this may start to change so one strategy may be limit = ncpus/factor.
However, if we are to integrate this with subr_epoch.c, a nice strategy would be to avoid calling smr_advance() for every callback, and instead moderate it by some value. If we advance the epoch when the callout is scheduled it will most likely be available to free by the time the timer fires. This should give us one clock tick of latency instead of two today even if we batch advancement.