I am quite a ways away from committing this but I think it's significant enough that we should discuss ahead of time. The first user of this interface is here: https://reviews.freebsd.org/D22587
This implements a lighter-weight variant of epoch closely coupled to UMA. Why do this? Tighter integration gives us a more efficient mechanism, lower free-to-use latency, better memory management policy, and a simpler API for consumers.
The epoch_call() system has the overhead of maintaining a per-cpu queue of each freed memory location. It relies on the consumers pre-allocating space for the list and providing callback functions. In general it's just a more complex API than is necessary to handle many cases. By making uma_zfree() compatible with SMR I am able to use per-cpu buckets to batch operations and save the individual item queuing. In effect the free fast path is no slower than for non smr zones. This allows me to use it for memory with much more rapid replacement rates, like the radix zone, without imposing much overhead.
The current use of epoch drives the epoch count from the clock tick interrupt. At best you need two clock ticks after freeing memory before it can be reclaimed. Depending on the timing of frees and critical sections this may be more. Because my implementation does not need a bounded number of queues I am free to increment the epoch every time a bucket is freed. This means my free-to-use latency is simply the longest critical_section() time. Because UMA already keeps a queue of full buckets, we can place the filled bucket from free stamped with the current epoch into the tail of this list. By the time the bucket is selected for re-use the epoch has virtually guaranteed to have expired and we avoid the scan entirely.
By using UMA we are able to independently tune various aspects of performance and memory consumption with little effort. For example, the bucket size can be increased or the epoch can be advanced for every N buckets in order to reduce the number of cacheline invalidations from advancing the epoch. If the items are re-used before the epoch has expired we can increase the depth of the queue of buckets until the cache size matches the product of the read latency and the consumption rate. The page daemon still has access to these queues and can recover this additional memory when needed.
I believe this combination of properties is sufficiently attractive to support a distinct mechanism. I am not yet supporting preemptable sections and I'm not sure I will. This means all of the uma smr protected code must execute under critical. So no blocking mutex acquires. The ck/subr_epoch.c mechanism continues to be appropriate for those uses.
The one wrinkle is that SMR zones can never be without a free bucket. If you have to synchronize() for every call to free the performance degradation is massive. In effect this means there is a bucket * ncpu * smr zones always allocated. We can force the bucket to drain in low memory conditions but the bucket itself will never be freed. In the worst case you will synchronize every time you fill a bucket which is 1/N where N is likely max bucket size. This condition only happens if memory allocation fails for the bucket zone and at this point the system is likely crawling along stuck in VM_WAIT anyhow.
I am a little uncertain of a few of my fences and would appreciate another eye on those. The current implementation is a bit barebones. I have been validating with dtrace. I see as many 50,000 frees before we have to do a scan. The vast majority of the time the epoch is before min_epoch and only an atomic load and a comparison is necessary. In build profiles the synchronization does not register or barely at all (< .01% cpu). I have verified that things are clocking forward as expected. I handle wrapping with modular arithmetic which I should probably place in macros ala tcp seq but I have not yet tested wrapping.