Page MenuHomeFreeBSD

Implement safe memory reclamation in UMA.
AcceptedPublic

Authored by jeff on Thu, Nov 28, 1:14 AM.

Details

Summary

I am quite a ways away from committing this but I think it's significant enough that we should discuss ahead of time. The first user of this interface is here: https://reviews.freebsd.org/D22587

This implements a lighter-weight variant of epoch closely coupled to UMA. Why do this? Tighter integration gives us a more efficient mechanism, lower free-to-use latency, better memory management policy, and a simpler API for consumers.

The epoch_call() system has the overhead of maintaining a per-cpu queue of each freed memory location. It relies on the consumers pre-allocating space for the list and providing callback functions. In general it's just a more complex API than is necessary to handle many cases. By making uma_zfree() compatible with SMR I am able to use per-cpu buckets to batch operations and save the individual item queuing. In effect the free fast path is no slower than for non smr zones. This allows me to use it for memory with much more rapid replacement rates, like the radix zone, without imposing much overhead.

The current use of epoch drives the epoch count from the clock tick interrupt. At best you need two clock ticks after freeing memory before it can be reclaimed. Depending on the timing of frees and critical sections this may be more. Because my implementation does not need a bounded number of queues I am free to increment the epoch every time a bucket is freed. This means my free-to-use latency is simply the longest critical_section() time. Because UMA already keeps a queue of full buckets, we can place the filled bucket from free stamped with the current epoch into the tail of this list. By the time the bucket is selected for re-use the epoch has virtually guaranteed to have expired and we avoid the scan entirely.

By using UMA we are able to independently tune various aspects of performance and memory consumption with little effort. For example, the bucket size can be increased or the epoch can be advanced for every N buckets in order to reduce the number of cacheline invalidations from advancing the epoch. If the items are re-used before the epoch has expired we can increase the depth of the queue of buckets until the cache size matches the product of the read latency and the consumption rate. The page daemon still has access to these queues and can recover this additional memory when needed.

I believe this combination of properties is sufficiently attractive to support a distinct mechanism. I am not yet supporting preemptable sections and I'm not sure I will. This means all of the uma smr protected code must execute under critical. So no blocking mutex acquires. The ck/subr_epoch.c mechanism continues to be appropriate for those uses.

The one wrinkle is that SMR zones can never be without a free bucket. If you have to synchronize() for every call to free the performance degradation is massive. In effect this means there is a bucket * ncpu * smr zones always allocated. We can force the bucket to drain in low memory conditions but the bucket itself will never be freed. In the worst case you will synchronize every time you fill a bucket which is 1/N where N is likely max bucket size. This condition only happens if memory allocation fails for the bucket zone and at this point the system is likely crawling along stuck in VM_WAIT anyhow.

I am a little uncertain of a few of my fences and would appreciate another eye on those. The current implementation is a bit barebones. I have been validating with dtrace. I see as many 50,000 frees before we have to do a scan. The vast majority of the time the epoch is before min_epoch and only an atomic load and a comparison is necessary. In build profiles the synchronization does not register or barely at all (< .01% cpu). I have verified that things are clocking forward as expected. I handle wrapping with modular arithmetic which I should probably place in macros ala tcp seq but I have not yet tested wrapping.

Test Plan

This is passing stress2 on my machine. I am getting a 192 core loaner from intel to continue to do profiling. I am testing builds and I/O.

I need to write some kind of smr stress test to ensure that there is no access-after-free.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 27866
Build 26038: arc lint + arc unit

Event Timeline

jeff created this revision.Thu, Nov 28, 1:14 AM
jeff edited the summary of this revision. (Show Details)Thu, Nov 28, 2:07 AM
jeff edited the test plan for this revision. (Show Details)
jeff set the repository for this revision to rS FreeBSD src repository.
hselasky added inline comments.Thu, Nov 28, 8:52 AM
sys/vm/uma_smr.c
127

Need to cast or compute this difference in an own variable to avoid compiler making predictions?

173

You probably need to cast these differences or compute them into an own variable to avoid predictions made by the compiler.

197

Ditto.

198

Ditto.

216

Ditto.

jeff added inline comments.Thu, Nov 28, 11:19 PM
sys/vm/uma.h
279–280

This comment is stale. The memory will not be re-used at all until the smr epoch expires.

sys/vm/uma_smr.c
127

Can you elaborate? I'm not sure I follow.

hselasky added inline comments.Fri, Nov 29, 10:49 AM
sys/vm/uma_smr.c
127

From what I understand the counter g->smrg_epoch_min will eventually wrap around, so that the subtraction gets a sign overflow. I.E. you are subtracting a higer unsigned value from a lower one, which is undefined behaviour from what I understand. Then you either need to cast the subtraction or put it into a temporary unsigned variable. @kib can probably explain this better than I can. See also the commit which added:

sys/conf/kern.mk:CFLAGS+=	-fwrapv
kib added inline comments.Fri, Nov 29, 12:13 PM
sys/vm/uma_smr.c
127

Yes, -fwrapv should make the signed arithmetic wrap on overflow, same as unsigned. Of course this is less tested compiler's option, so usual approach is to avoid undefined behavior either by changing types of the epoch counters to unsigned ints, or at least casting the expression elements to unsigned.

jeff updated this revision to Diff 65086.Sun, Dec 1, 4:48 AM

Use sequence macros. Clarify some comments. Rename a confusing function.

hselasky accepted this revision.Sun, Dec 1, 7:09 AM

Looks good with regards to EPOCH and wrapping counters.

This revision is now accepted and ready to land.Sun, Dec 1, 7:09 AM
jeff added a reviewer: pho.Sun, Dec 1, 10:49 PM
mmacy accepted this revision.Mon, Dec 2, 8:33 PM
jeff added a comment.Tue, Dec 10, 4:17 AM

Small update; This has enabled a several order of magnitude drop in object lock contention via vm radix zones. Radix is often only second to mbuf in terms of turnover rate. Thus far benchmarks have shown trivial overhead for expiration polls and a small increase in memory footprint. I believe this approach has merit and I am continuing to work to productize it. My next steps are man page and some kind of stress test. Intel has donated time on a quad socket machine and I will be further validating performance and stability there.