Paths

Table of Contentst

Differential D25952

UMA: Avoid polling for an invalid read sequence number.
AbandonedPublic
Actions

Authored by markj on Aug 4 2020, 8:51 PM.

Details

Reviewers

cem

Summary

See https://reviews.freebsd.org/D25950#575261 .

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 32760
Build 30195: arc lint + arc unit

Event Timeline

markj requested review of this revision.Aug 4 2020, 8:51 PM

markj created this revision.

Harbormaster completed remote builds in B32759: Diff 75398.Aug 4 2020, 8:51 PM

markj added inline comments.Aug 4 2020, 8:52 PM

sys/vm/uma_core.c
905	This load is unsynchronized.

Fix a race.

Harbormaster completed remote builds in B32760: Diff 75400.Aug 4 2020, 8:57 PM

markj mentioned this in D25950: buf_trie_alloc: Spin on alloc failure.Aug 4 2020, 8:58 PM

@pho can you see if the buf_vlist_add() panic is still reproducible with this patch?

This fix looks good to me. I'm not confident it's the only thing we need re: preallocated zones, but I am not super familiar with UMA.

This revision is now accepted and ready to land.Aug 4 2020, 9:15 PM

In D25952#575282, @markj wrote:

@pho can you see if the buf_vlist_add() panic is still reproducible with this patch?

Sure.

In D25952#575285, @cem wrote:

This fix looks good to me. I'm not confident it's the only thing we need re: preallocated zones, but I am not super familiar with UMA.

I think even without SMR the preallocation+NOFREE mechanism isn't foolproof. Free items may be cached in per-CPU caches not accessible to an allocating thread. In a low-memory situation we will try to reclaim items from per-CPU caches but there's no guarantee that this will happen quickly enough to avoid an allocation failure. In fact in Peter's report we can see the uma_reclaim worker thread is waiting on a runqueue after having bound itself to a specific CPU in order to reclaim per-CPU items. In practice I suspect that since the bufdaemon will run before we get too close to the upper limit we just never hit this problem. In other words, you are right that there's still a problem here and SMR makes it worse in theory, but I suspect it'll still be very hard to hit on at least large-memory systems.

In D25952#575326, @markj wrote:

In D25952#575285, @cem wrote:

This fix looks good to me. I'm not confident it's the only thing we need re: preallocated zones, but I am not super familiar with UMA.

I think even without SMR the preallocation+NOFREE mechanism isn't foolproof. Free items may be cached in per-CPU caches not accessible to an allocating thread. In a low-memory situation we will try to reclaim items from per-CPU caches but there's no guarantee that this will happen quickly enough to avoid an allocation failure. In fact in Peter's report we can see the uma_reclaim worker thread is waiting on a runqueue after having bound itself to a specific CPU in order to reclaim per-CPU items. In practice I suspect that since the bufdaemon will run before we get too close to the upper limit we just never hit this problem. In other words, you are right that there's still a problem here and SMR makes it worse in theory, but I suspect it'll still be very hard to hit on at least large-memory systems.

Is there any rigorous way we can we put an upper bound on the slop (mp_ncpus * per-CPU cache maximum) and pre-allocate that much?

In D25952#575328, @cem wrote:

In D25952#575326, @markj wrote:

In D25952#575285, @cem wrote:

This fix looks good to me. I'm not confident it's the only thing we need re: preallocated zones, but I am not super familiar with UMA.

I think even without SMR the preallocation+NOFREE mechanism isn't foolproof. Free items may be cached in per-CPU caches not accessible to an allocating thread. In a low-memory situation we will try to reclaim items from per-CPU caches but there's no guarantee that this will happen quickly enough to avoid an allocation failure. In fact in Peter's report we can see the uma_reclaim worker thread is waiting on a runqueue after having bound itself to a specific CPU in order to reclaim per-CPU items. In practice I suspect that since the bufdaemon will run before we get too close to the upper limit we just never hit this problem. In other words, you are right that there's still a problem here and SMR makes it worse in theory, but I suspect it'll still be very hard to hit on at least large-memory systems.

Is there any rigorous way we can we put an upper bound on the slop (mp_ncpus * per-CPU cache maximum) and pre-allocate that much?

We have a mechanism to reserve items in a keg; they only get allocated when M_USE_RESERVE is specified in the allocation request. We could keep a number of reserved items and allocate them if a regular allocation attempt fails. The minimum required size of the reservation is a bit tricky to calculate: we have up to three buckets per CPU (alloc, free, cross-domain) and we might need to reserve that many items per NUMA domain, I'm not sure.

With SMR things get more complicated. Normally a read section will be very short but really they are unbounded; in a VM, a thread running in a read section might vmexit, and the corresponding host thread might be preempted.

It is probably simpler to find a way to fix this in the buf trie code. Is there some reason we can't embed a trie node in each buf? (Ignoring for now the fact that the PCTRIE code wants nodes to be dynamically allocated.)

In D25952#575451, @markj wrote:

Is there some reason we can't embed a trie node in each buf? (Ignoring for now the fact that the PCTRIE code wants nodes to be dynamically allocated.)

Never mind, I forgot that keys are stored in the leaves.

In D25952#575451, @markj wrote:

In D25952#575328, @cem wrote:

Is there any rigorous way we can we put an upper bound on the slop (mp_ncpus * per-CPU cache maximum) and pre-allocate that much?

We have a mechanism to reserve items in a keg; they only get allocated when M_USE_RESERVE is specified in the allocation request. We could keep a number of reserved items and allocate them if a regular allocation attempt fails.

Is there any point in have separate reserved items vs regular preallocated items for the buf tries? All allocation is the same and must not fail, so I'm not sure if M_USE_RESERVED-only items helps here.

The minimum required size of the reservation is a bit tricky to calculate: we have up to three buckets per CPU (alloc, free, cross-domain) and we might need to reserve that many items per NUMA domain, I'm not sure.

That seems like a decent starting place, at least.

With SMR things get more complicated. Normally a read section will be very short but really they are unbounded;

Well, they are critical sections, so as far as freebsd's locking hierarchy is concerned, they are very short. Practically, the read sections for the buf tries should be very short.

in a VM, a thread running in a read section might vmexit, and the corresponding host thread might be preempted.

Byzantine VM behavior isn't a major concern for me at the moment. Sure, VMs can do this, but they can do this in violation of all of our locking models — it is not unique to SMR.

It is probably simpler to find a way to fix this in the buf trie code. Is there some reason we can't embed a trie node in each buf? (Ignoring for now the fact that the PCTRIE code wants nodes to be dynamically allocated.)

I don't think embedding tries in the bufs is viable or really helps; it just adds a 1:1 mapping between specific buf and trie uma items. But the lifetimes are different, and embedding the trie makes allocation difficult. Tries aren't RB-trees.

Maybe for the buf trie code we could do the following:

(1) Preallocate 3 x Bucket size x CPUs additional items.
(2) Spin and retry on failure, external to UMA.

Should we emit a warning or bound the spin time or anything? The vast majority of the time we will never spin, but if we never get an item, it may indicate a UMA or buftrie bug.

I realized that this patch does not fix the bug: calling smr_poll() with a goal of SMR_SEQ_INVALID is basically harmless, since in the common case that value will lie outside the range [rd_seq, wr_seq] and will be treated like a valid sequence number. Worth fixing anyway, I think.

This morning I looked a bit at the vmcore from Peter. I think there's a more direct problem unrelated to SMR: cache_alloc() is short-circuited by bucketdisable before it checks the full bucket cache. In that case it goes directly to the keg, but if all of the pre-allocated items are cached in the full bucket list, it won't find any preallocated items, and an attempt to allocate a new slab fails because we have no free memory. A partial fix is to check the per-domain cache before checking bucketdisable. I will post another patch shortly. The problem was introduced by some recent-ish refactoring in UMA and possibly made easier to trigger by the conversion to SMR.

In D25952#575503, @cem wrote:

In D25952#575451, @markj wrote:

In D25952#575328, @cem wrote:

Is there any rigorous way we can we put an upper bound on the slop (mp_ncpus * per-CPU cache maximum) and pre-allocate that much?

We have a mechanism to reserve items in a keg; they only get allocated when M_USE_RESERVE is specified in the allocation request. We could keep a number of reserved items and allocate them if a regular allocation attempt fails.

Is there any point in have separate reserved items vs regular preallocated items for the buf tries? All allocation is the same and must not fail, so I'm not sure if M_USE_RESERVED-only items helps here.

The minimum required size of the reservation is a bit tricky to calculate: we have up to three buckets per CPU (alloc, free, cross-domain) and we might need to reserve that many items per NUMA domain, I'm not sure.

That seems like a decent starting place, at least.

With SMR things get more complicated. Normally a read section will be very short but really they are unbounded;

Well, they are critical sections, so as far as freebsd's locking hierarchy is concerned, they are very short. Practically, the read sections for the buf tries should be very short.

in a VM, a thread running in a read section might vmexit, and the corresponding host thread might be preempted.

Byzantine VM behavior isn't a major concern for me at the moment. Sure, VMs can do this, but they can do this in violation of all of our locking models — it is not unique to SMR.

In violation how, exactly?

It is probably simpler to find a way to fix this in the buf trie code. Is there some reason we can't embed a trie node in each buf? (Ignoring for now the fact that the PCTRIE code wants nodes to be dynamically allocated.)

I don't think embedding tries in the bufs is viable or really helps; it just adds a 1:1 mapping between specific buf and trie uma items. But the lifetimes are different, and embedding the trie makes allocation difficult. Tries aren't RB-trees.

That specific idea doesn't really help, I just didn't have enough coffee. But I think all of this could be streamlined a bit. Right now we maintain two tries and two linked lists, for clean and dirty buffers. Almost all trie lookups check both tries. The exception is buf_vlist_add(), which uses BUF_PCTRIE_LOOKUP_LE to help ensure that the linked list is kept sorted. Outside of reassignbuf(), it is called only by bgetvp(), which always inserts a clean buf.

We could simply merge the two tries (not the linked lists), so reassignbuf() never has to allocate a trie node and allocation failures become easier to deal with. That makes buf_vlist_add() more expensive. One way to mitigate that would be to augment the trie: in each node, maintain a bitmap with a bit for each child. Set the bit for a child node if there is at least one clean buf in the subtree rooted at that child. bgetvp() always inserts a clean buf, so when it calls buf_vlist_add() it can quickly find the preceding clean buf by checking the bitmap before descending a level. That is, it can skip over long ranges of dirty bufs quickly. Insertion and deletion have to ensure that the bitmaps are kept up to date. Deletion becomes a bit more expensive when removing the last clean buf from a subtree, since you potentially have to ascend back from the deleted leaf to update bitmaps. I expect that would be a rare case though.

With this, reassignbuf() would only have to do a single lookup to update bitmaps, instead of a delete+insert, and it would never need to allocate pctrie nodes. bgetvp() still needs to perform allocations, but it can sleep if it unlocks the bufobj after an allocation failure. Code that does lookups for either clean or dirty bufs just needs to search a single trie instead of two.

This is perhaps all too complicated or specialized to be worth implementing, but maybe there is a better solution.

Maybe for the buf trie code we could do the following:

(1) Preallocate 3 x Bucket size x CPUs additional items.
(2) Spin and retry on failure, external to UMA.

Should we emit a warning or bound the spin time or anything? The vast majority of the time we will never spin, but if we never get an item, it may indicate a UMA or buftrie bug.

The idea of using a reserve would be to have a fallback mechanism if a regular allocation attempt fails. It ensures that some number of items is available to all threads.

For (2), we need to make sure that something is advancing the write sequence number after an allocation failure. Right now that only happens when freeing items, but we can't rely on that to happen while spinning after an allocation failure.

In D25952#576079, @markj wrote:

I realized that this patch does not fix the bug: calling smr_poll() with a goal of SMR_SEQ_INVALID is basically harmless, since in the common case that value will lie outside the range [rd_seq, wr_seq] and will be treated like a valid sequence number. Worth fixing anyway, I think.

This morning I looked a bit at the vmcore from Peter. I think there's a more direct problem unrelated to SMR: cache_alloc() is short-circuited by bucketdisable before it checks the full bucket cache. In that case it goes directly to the keg, but if all of the pre-allocated items are cached in the full bucket list, it won't find any preallocated items, and an attempt to allocate a new slab fails because we have no free memory. A partial fix is to check the per-domain cache before checking bucketdisable. I will post another patch shortly. The problem was introduced by some recent-ish refactoring in UMA and possibly made easier to trigger by the conversion to SMR.

I noticed bucketdisable earlier but didn't realize it was an issue / raise it outside of the sidebar thread with Peter and Mateusz, oops.

In D25952#575503, @cem wrote:

Byzantine VM behavior isn't a major concern for me at the moment. Sure, VMs can do this, but they can do this in violation of all of our locking models — it is not unique to SMR.

In violation how, exactly?

A hypervisor thread could be preempted, swapped out, and effectively go to sleep for unbounded time, while the guest kernel holds a spin lock. That violates our (guest) lock invariants: acquiring a spinlock is cheap and holders cannot block on sleep mutexes nor sleep sleep. But if the VM thread is swapped out, that isn't true.

That specific idea doesn't really help, I just didn't have enough coffee. But I think all of this could be streamlined a bit.

Yes.

Right now we maintain two tries and two linked lists, for clean and dirty buffers. Almost all trie lookups check both tries. The exception is buf_vlist_add(), which uses BUF_PCTRIE_LOOKUP_LE to help ensure that the linked list is kept sorted. Outside of reassignbuf(), it is called only by bgetvp(), which always inserts a clean buf.

We could simply merge the two tries (not the linked lists), so reassignbuf() never has to allocate a trie node and allocation failures become easier to deal with. That makes buf_vlist_add() more expensive. One way to mitigate that would be to augment the trie: in each node, maintain a bitmap with a bit for each child. Set the bit for a child node if there is at least one clean buf in the subtree rooted at that child. bgetvp() always inserts a clean buf, so when it calls buf_vlist_add() it can quickly find the preceding clean buf by checking the bitmap before descending a level. That is, it can skip over long ranges of dirty bufs quickly. Insertion and deletion have to ensure that the bitmaps are kept up to date. Deletion becomes a bit more expensive when removing the last clean buf from a subtree, since you potentially have to ascend back from the deleted leaf to update bitmaps. I expect that would be a rare case though.

With this, reassignbuf() would only have to do a single lookup to update bitmaps, instead of a delete+insert, and it would never need to allocate pctrie nodes. bgetvp() still needs to perform allocations, but it can sleep if it unlocks the bufobj after an allocation failure. Code that does lookups for either clean or dirty bufs just needs to search a single trie instead of two.

This is perhaps all too complicated or specialized to be worth implementing, but maybe there is a better solution.

It does seem reasonable, but adds quite a bit of complexity. Integrating it with the generic PCTRIE macros seems potentially difficult. I'm not sure the delete and insert is even a significant cost.

The idea of using a reserve would be to have a fallback mechanism if a regular allocation attempt fails. It ensures that some number of items is available to all threads.

The idea being that unlike regular preallocated items, reserve items are not freed back into PCPU caches that can't be reached by other threads?

For (2), we need to make sure that something is advancing the write sequence number after an allocation failure. Right now that only happens when freeing items, but we can't rely on that to happen while spinning after an allocation failure.

smr_synchronize() advances the write sequence number and waits for readers to catch up. We can discuss that further on D25950 ?

One other possibility would be to keep the separate clean/dirty tries, but kill the sorted tailqs. We could traverse a trie like a tailq, and they are inherently sorted. This would be a big code churn.

I will post a modified version of this diff in a new review since the discussion here is somewhat tangential and I haven't already added other reviewers. Based on the new diff I'll make a suggestion for D25950.