Page MenuHomeFreeBSD

Implement safe memory reclamation in UMA.

Authored by jeff on Nov 28 2019, 1:14 AM.



I am quite a ways away from committing this but I think it's significant enough that we should discuss ahead of time. The first user of this interface is here:

This implements a lighter-weight variant of epoch closely coupled to UMA. Why do this? Tighter integration gives us a more efficient mechanism, lower free-to-use latency, better memory management policy, and a simpler API for consumers.

The epoch_call() system has the overhead of maintaining a per-cpu queue of each freed memory location. It relies on the consumers pre-allocating space for the list and providing callback functions. In general it's just a more complex API than is necessary to handle many cases. By making uma_zfree() compatible with SMR I am able to use per-cpu buckets to batch operations and save the individual item queuing. In effect the free fast path is no slower than for non smr zones. This allows me to use it for memory with much more rapid replacement rates, like the radix zone, without imposing much overhead.

The current use of epoch drives the epoch count from the clock tick interrupt. At best you need two clock ticks after freeing memory before it can be reclaimed. Depending on the timing of frees and critical sections this may be more. Because my implementation does not need a bounded number of queues I am free to increment the epoch every time a bucket is freed. This means my free-to-use latency is simply the longest critical_section() time. Because UMA already keeps a queue of full buckets, we can place the filled bucket from free stamped with the current epoch into the tail of this list. By the time the bucket is selected for re-use the epoch has virtually guaranteed to have expired and we avoid the scan entirely.

By using UMA we are able to independently tune various aspects of performance and memory consumption with little effort. For example, the bucket size can be increased or the epoch can be advanced for every N buckets in order to reduce the number of cacheline invalidations from advancing the epoch. If the items are re-used before the epoch has expired we can increase the depth of the queue of buckets until the cache size matches the product of the read latency and the consumption rate. The page daemon still has access to these queues and can recover this additional memory when needed.

I believe this combination of properties is sufficiently attractive to support a distinct mechanism. I am not yet supporting preemptable sections and I'm not sure I will. This means all of the uma smr protected code must execute under critical. So no blocking mutex acquires. The ck/subr_epoch.c mechanism continues to be appropriate for those uses.

The one wrinkle is that SMR zones can never be without a free bucket. If you have to synchronize() for every call to free the performance degradation is massive. In effect this means there is a bucket * ncpu * smr zones always allocated. We can force the bucket to drain in low memory conditions but the bucket itself will never be freed. In the worst case you will synchronize every time you fill a bucket which is 1/N where N is likely max bucket size. This condition only happens if memory allocation fails for the bucket zone and at this point the system is likely crawling along stuck in VM_WAIT anyhow.

I am a little uncertain of a few of my fences and would appreciate another eye on those. The current implementation is a bit barebones. I have been validating with dtrace. I see as many 50,000 frees before we have to do a scan. The vast majority of the time the epoch is before min_epoch and only an atomic load and a comparison is necessary. In build profiles the synchronization does not register or barely at all (< .01% cpu). I have verified that things are clocking forward as expected. I handle wrapping with modular arithmetic which I should probably place in macros ala tcp seq but I have not yet tested wrapping.

Test Plan

This is passing stress2 on my machine. I am getting a 192 core loaner from intel to continue to do profiling. I am testing builds and I/O.

I need to write some kind of smr stress test to ensure that there is no access-after-free.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires review to proceed.Jan 23 2020, 7:08 PM
This revision is now accepted and ready to land.Jan 23 2020, 10:08 PM

Delay uz_dtor() until the memory is safe to be reclaimed. This makes UMA SMR
compatible with existing use-after-free detection and allows zones to specify
typed callbacks to be notified when an item expired. The one limitation of
this system is that the udata argument to uma_zfree_arg() is not preserved.

Implement a very simple smr test that of course found bugs.

Fix my embarrassing bugs which are fortunately not performance impacting.
Document one particularly hairy race condition that I did not consider before
but which is harmless with the exception of some extra bookkeeping. I will
revisit this later to see if there is a more elegant solution.

This revision now requires review to proceed.Jan 25 2020, 2:28 AM

I have some quantitative data to support my performance claims. In my test SMR is ~3x faster while consuming 5% as much memory vs epoch.

I wrote a uma performance testing kernel module. It simulates queuing and dequeuing 2k network packets. Every CPU runs a thread that allocates N packets and picks a random other CPU to queue them to. It then dequeues its own packets. There is some memory access to the allocated buffers. I don't actually need SMR semantics but I can optionally run this with epoch, with uma smr, or with plain uma calls. With smr or epoch I simply enter and exit a section around queues and dequeues to create some activity. I time a certain workload and then look at how much memory UMA needed in total to support it.

Here are results on a 32 thread, 2 domain thread ripper:

vanilla uma:
UMA Perf: 32 threads did 320000000 allocs and 320000000 frees: 124775MB/s
tput/cpu 3899, time 5009ms, packets/s 63885000, packets/s cpu 1996406
20933 pages

UMA Perf: 32 threads did 320000000 allocs and 320000000 frees: 41564MB/s
tput/cpu 1298, time 15037ms, packets/s 21280000, packets/s cpu 665000
450661 pages

UMA Perf: 32 threads did 320000000 allocs and 320000000 frees: 112107MB/s
tput/cpu 3503, time 5575ms, packets/s 57399000, packets/s cpu 1793718
20401 pages

I ran each test three times but I have pasted only one result each that was the median. The pages fluctuated by ~30ish% but throughput was actually very stable.

There is a 10% performance impact to using SMR vs no synchronization and virtually no additional memory overhead although the stddev is higher on memory use. I will continue to do perf analysis but this is more likely due to making UMA more expensive and very unlikely to have anything to do with the actual smr synchronization.

jeff added a subscriber: mckusick.
1 ↗(On Diff #67270)

/sys/tools is just used for stuff relating to kernel build infrastructure. This should really go in /tools/test.

462 ↗(On Diff #67270)

This is fine because callers of zone_alloc_bucket() and zone_free_bucket() already check bucketdisable, right?

209 ↗(On Diff #67270)

Should we perhaps use a named constant akin to EPOCH_INIT here? EPOCH_INACTIVE?

55 ↗(On Diff #67270)

FWIW I tend to prefer the use of atomic_load/atomic_store over the volatile qualifier. It makes the code more straightforward to follow since information about accesses is encoded in the code itself rather than in the type definition.

462 ↗(On Diff #67270)

Yes. The one wrinkle with UMA integration is that you more or less can't run without a per-cpu bucket in a SMR zone. If you can't get a bucket it means you're going to serialize on every free. So I changed the bucketdisable logic to ignore SMR zones and added the bit that preserves the existing bucket if you can't allocate so that you get at least N frees per-synchronization event.

You could push the epoch # into the slab layer and force it to handle smr as well and get rid of the constraints on buckets. But it's a lot more complicated for questionable gain.

4351 ↗(On Diff #67270)

I just noticed that the keg does not need the SMR flag.

533 ↗(On Diff #67270)

This should moved to consume one of the spare pointers in the read-only section.

209 ↗(On Diff #67270)

Sure can.

55 ↗(On Diff #67270)

Yeah that is probably good form.

One thing that I like about rcu on linux is that they define a number of macros that are not strictly necessary but denote which fields are being accessed and modified within a section. It is purely for the reader but I think worth the effort.


Use sequence instead of epoch to avoid confusion with the epoch algorithm.
epoch becomes wr_seq, the write sequence number.
epoch_min becomes rd_seq, the read sequence number.

Take a pass through comments to attempt to improve clarity. Add a
little graphical tldr.

64–67 ↗(On Diff #67303)

The language describing the invariant is not very good.

58 ↗(On Diff #67303)

You might note that the global rd_seq is lazily updated, so it is really just a lower bound.

127 ↗(On Diff #67303)

... but may need to spin until the difference wr_seq - rd_seq is smaller than than SMR_SEQ_MAX_DELTA.

185 ↗(On Diff #67303)

Would it be worth adding a counter for instances of this scenario?

246 ↗(On Diff #67303)

This is a dead store.

47 ↗(On Diff #67303)

"sequence numbers"?

84 ↗(On Diff #67303)


175 ↗(On Diff #67303)

"the current write sequence value"?

180 ↗(On Diff #67303)

Shouldn't this be using an atomic_load?

494 ↗(On Diff #67303)

Shouldn't we be using the named constant SMR_SEQ_INVALID here and elsewhere in UMA?

544 ↗(On Diff #67303)

It would be nice to have a counter for this case.

546 ↗(On Diff #67303)

Can't we optimize this a bit by writing dtor = (uz_flags & UMA_ZFLAG_CTORDTOR) != 0 || UMA_ALWAYS_CTORDTOR?

577 ↗(On Diff #67303)

This means that the bucket cache is now FIFO for all zones, and zone_fetch_bucket() is getting the least recently used bucket. Can't we use LRU only for SMR zones? It is also kind of weird that bucket_cache_reclaim() still fetches buckets from the tail of the queue. For SMR zones this increases the likelihood that a draining thread will have to poll while waiting for the read sequence number to advance.

202 ↗(On Diff #67303)

This needs to be updated with the new SMR flag.

185 ↗(On Diff #67303)

Definitely counters & sysctl are on my todo. I have been using dtrace and hwpmc to verify.

246 ↗(On Diff #67303)

I just wanted to be sure the compiler wouldn't confuse it for an uninitialized value. It is accessed after the for loop.

546 ↗(On Diff #67303)

Yes I can make the test an inline since it would be repeated.

577 ↗(On Diff #67303)

Yes I can restore this. Your observation about reclaim is important. I will fix both.

Address some review feedback. Still a few comments to clear up.

Rename the 'global' structure the 'shared' structure as I believe that
expresses things more clearly.

Improve smrstress with multiple writers.

577 ↗(On Diff #67303)

I have my doubts about the efficacy of this given that we have two buckets already between alloc/free. So data in uzd_buckets is going to be stale and going to a different cpu. However, I have restored it. I'm just going to have reclaim pop the head though. Even if there is some minor improvement in cache effect it is rare enough that it won't matter and I'm sensitive to adding tons of special cases to support this.

252–268 ↗(On Diff #67342)

I need to think about this some more... I think this may still be a problem. Isn't it possible that the same race results in us observing SMR_SEQ_INVALID here instead of the stale value? I guess this is done this way to avoid a loop in smr_enter, but would a __predict_false loop be so

293–294 ↗(On Diff #67342)

This would be easier to read with a SMR_SEQ_MIN(x, y) macro. Ditto above.

177 ↗(On Diff #67303)

Why not just atomic_add?

179–185 ↗(On Diff #67303)

Is it legal to smr_advance() inside an SMR section? I think no, and we should KASSERT. Otherwise, there's a potential deadlock here if a CPU is in an SMR section and tries to advance too far, where it would attempt to wait on itself.

Separately, can you state rationale for waiting until goal (completely caught up)? A couple other options might be goal - SMR_SEQ_MAX_DELTA + 1 (minimally caught up), or goal - SMR_SEQ_MAX_DELTA/2 (somewhere in between). I think a minimum wait may be better, after all it wouldn't restrict how much we actually advance the read hand.

234 ↗(On Diff #67303)

SMR_SEQ_GEQ(g_rd_seq, goal) (vs GT) reads like an off-by-one according to the comment, but I think it's actually just that you are combining the separate case for s_rd_seq == goal, which is in range (and therefore not skippable), but means that all readers have observed it. I think for clarity it would be good to either call this out or have a separate comment and test for it.

242 ↗(On Diff #67303)

Redundant, success is already true.

246 ↗(On Diff #67303)

Spell it c_seq = SMR_SEQ_INVALID ?

47–48 ↗(On Diff #67303)

Do you want to replace "epoch" here with "sequence"?

50–53 ↗(On Diff #67303)

Pedantically, these should be int32_t, or else they would break with an int > 4 bytes.

114 ↗(On Diff #67303)

This is just for the KASSERT, right? You're not trying to forcibly load the c_global? In that case I suggest just just spelling the KASSERT with smr->c_global->g_name and losing the local.

164 ↗(On Diff #67303)


175 ↗(On Diff #67303)


179–180 ↗(On Diff #67303)

Blank line for style.

2646 ↗(On Diff #67303)

This placement is a little surprising with smr having moved out from under uma. You might consider if you think it would read better in vm_init() now.

252–268 ↗(On Diff #67342)

I still believe this is safe. The race would be that the cpu had loaded a value that is now stale and has not stored it yet. Imagine a hypervisor thread was preempted for a long time. However, in this case, we are guaranteed that no user data has been loaded yet. Which means that the local copy of g_wr_seq which was fetched before c_seq is guaranteed to be less than or equal to the sequence number of the actual data accesses that follow. Since we don't allow rd_seq to advance beyond the wr_seq that we observed at the start of the function we can not advance unsafely.

I had looked at a loop but it has other implications and I think as long as I can prove it's not strictly necessary it will just have to be a quirk with a lot of comments.

It does bring to mind that I have not thought extensively about what happens if the thread running poll() is preempted and looking at very stale values. That is perhaps more concerning. I need to re-evaluate with this in mind and possibly reload the global values after spinwaits.

252–268 ↗(On Diff #67342)

Never mind about SMR_SEQ_INVALID, I see why that doesn't matter.

I think this is okay, I get that we'd rather handle this case in smr_poll() than by looping in smr_enter().

4109–4117 ↗(On Diff #67342)

What case is this covering? If it is the uma_zfree_arg() path, it looks wrong (we pass SKIP_NONE for SMR), and I don't see any other callers that pass SKIP_DTOR.

Also comparison with == feels more fragile than a comparison with </<= (e.g. because of SKIP_CNT). Should we add a new SKIP_SYNC?

3723 ↗(On Diff #67342)

I just noticed that this means that uma_zfree_arg() can't pass non-NULL udata when freeing to an SMR zone. It's probably not a big deal (consumers can embed a pointer in the structure itself if desired), but we should probably have an assert to catch that.

Review feedback. Improved comments. Some minor fixes.

Introduce two new required functions:

This is partially to simplify the implementation to streamline it but mostly
to make it clear to consumers which they are using and assert that they
are not mixing a matching.

One more performance update; This has less than half of the cost of enter()/exit() vs epoch. ~200 cycles vs ~500 cycles on my simple benchmark.

I tried using tsc for the wr_seq but this did not improve the overhead. I have not analyzed further than that.

229–230 ↗(On Diff #67491)

I was experimenting with a 64bit value so I could use tsc instead of a synthetic clock and this was necessary. It is likely valuable on its own and should be committed separately.

Just dropping some nitpick comments before heading to bed. I looked through the interdiff for the latest update, but I was still intending to take another pass through the rest of the uma_core.c changes.

57 ↗(On Diff #67491)

"a the"

72–74 ↗(On Diff #67491)

"Readers never advance [either] sequence number"

3700–3702 ↗(On Diff #67342)

Is this deliberately missing from uma_zfree_smr? It seems like a potential point of confusion, being different from free() and uma_zfree/_arg(). If so, maybe KASSERT(item != NULL) there instead of crashing in pmap_kextract so that misuse may be more easily diagnosed?

229 ↗(On Diff #67491)

More parens for hygiene, although obviously safe in practice below.

2902–2903 ↗(On Diff #67491)

"static inline" now? I think the compiler may do it anyway, but it may better capture the intent.

3102–3128 ↗(On Diff #67491)

Could consider moving this to a static inline and just passing udata=NULL from SMR instead of c&p, or do you see this and the SMR path diverging further in the future?

3786–3790 ↗(On Diff #67491)

Better to do domain = itemdomain = 0 here too, like in uma_zfree_arg. Else if !NUMA then itemdomain is uninitialized, and a compiler may get angry when it is passed to cache_free() (although it is harmless).

markj added inline comments.
546 ↗(On Diff #67303)

Sorry, I don't quite follow what you mean. I'm just pointing out that we may be unnecessarily iterating over the bucket entries, same in bucket_drain().

969 ↗(On Diff #67491)

&& should come on the previous line.

3128 ↗(On Diff #67491)

Missing parens.

This revision is now accepted and ready to land.Jan 30 2020, 8:57 PM
jeff marked an inline comment as done.

Review feedback. Minor nits.

This revision now requires review to proceed.Jan 30 2020, 11:16 PM
This revision was not accepted when it landed; it landed in state Needs Review.Jan 31 2020, 12:50 AM
This revision was automatically updated to reflect the committed changes.