Page MenuHomeFreeBSD

buf_trie_alloc: Spin on alloc failure
AbandonedPublic

Authored by cem on Aug 4 2020, 6:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 23 2023, 2:37 AM
Unknown Object (File)
Nov 7 2023, 7:35 PM
Unknown Object (File)
Nov 6 2023, 6:16 AM
Unknown Object (File)
Oct 6 2023, 6:31 PM
Unknown Object (File)
Oct 5 2023, 5:03 AM
Unknown Object (File)
Aug 11 2023, 6:44 AM
Unknown Object (File)
May 12 2023, 6:42 PM
Unknown Object (File)
Apr 8 2023, 12:50 AM
Subscribers

Details

Summary

Peter Holm reports a low-mem buf_trie_alloc() failure:

panic: buf_vlist_add:  Preallocated nodes insufficient.
cpuid = 9
time = 1596433571
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe0339908560
vpanic() at vpanic+0x182/frame 0xfffffe03399085b0
panic() at panic+0x43/frame 0xfffffe0339908610
buf_vlist_add() at buf_vlist_add+0x1c1/frame 0xfffffe0339908640
reassignbuf() at reassignbuf+0x15f/frame 0xfffffe0339908670
bdirty() at bdirty+0x58/frame 0xfffffe03399086b0
bdwrite() at bdwrite+0x92/frame 0xfffffe0339908720
ext2_update() at ext2_update+0x1c0/frame 0xfffffe0339908770
ext2_inactive() at ext2_inactive+0x3f/frame 0xfffffe03399087a0
VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0x59/frame 0xfffffe03399087c0
vinactivef() at vinactivef+0x107/frame 0xfffffe0339908810
vput_final() at vput_final+0x298/frame 0xfffffe0339908870
kern_funlinkat() at kern_funlinkat+0x336/frame 0xfffffe0339908ab0
sys_unlink() at sys_unlink+0x28/frame 0xfffffe0339908ad0

The buf trie UMA zone is preallocated to contain all the entries we ever
need, so the failure of M_NOWAIT alloc is not expected. The problem seems
to be that SMR deferred-free latency causes preallocated zones to
unpredictably run out of free items. From Peter's report, the "free" count
is a majority of the total item count, so it seems like we would need to
preallocate an unacceptably greater number of items (like 2x) to avoid this
shortfall, and it does not seem robust anyway:

db:0:pho>  show uma
          Zone   Size    Used    Free    Requests  Sleeps  Bucket  Total Mem    XFree
      BUF TRIE    144    4708  127172   102807141       0      62  18990720        0
                         ^^^^  ^^^^^^ 96% free

So what is the right way to reliably preallocate SMR zone items, such that
allocation does not need to sleep (or ideally, spin much)? buf_trie_alloc()
and similar SMR consumers could do the spin externally to UMA; I don't see
any obvious place inside UMA to spin, aside from right at the API boundary
(this patch). We could set a special flag on preallocated zones?

Any suggestions welcome.

Reported by: pho
X-MFC-With: r363482

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 32819
Build 30234: arc lint + arc unit

Event Timeline

cem requested review of this revision.Aug 4 2020, 6:34 PM
cem created this revision.

Was this on a NUMA system? Please show the full report.

Are all of the free trie nodes in a bucket list? Were any threads executing the read section at the time of the panic? I'm surprised that we weren't able to make progress with that many free items, I wouldn't rule out a bug somewhere in UMA or SMR.

sys/vm/uma_core.c
3353 ↗(On Diff #75390)

This is smr_wait().

This comment was removed by markj.

Hmm, I think I see a bug that could cause this. In each full bucket cache we maintain a uzd_seq: the lowest write sequence number for any bucket in the list. When zone_fetch_bucket() takes a full bucket off the head of the list, it updates uzd_seq to the sequence number of the following bucket. The problem is that we might cache full buckets with ub_seq == SMR_SEQ_INVALID in rare cases, and:

678         /* SMR Buckets can not be re-used until readers expire. */
679         if ((zone->uz_flags & UMA_ZONE_SMR) != 0 &&
680             bucket->ub_seq != SMR_SEQ_INVALID) {
681                 if (!smr_poll(zone->uz_smr, bucket->ub_seq, false))
682                         return (NULL);
683                 bucket->ub_seq = SMR_SEQ_INVALID;
684                 dtor = (zone->uz_dtor != NULL) || UMA_ALWAYS_CTORDTOR;
685                 if (STAILQ_NEXT(bucket, ub_link) != NULL)
686                         zdom->uzd_seq = STAILQ_NEXT(bucket, ub_link)->ub_seq; // what if this ub_seq == SMR_SEQ_INVALID?
687         }

We might cache a full bucket with an invalid sequence number when cache_alloc() allocates a new bucket, fills it, and discovers that it lost a race. So, with uzd_seq == SMR_SEQ_INVALID, cache_fetch_bucket() would short-circuit and not try to allocate from the full bucket list. In Peter's scenario, at the time of the panic we had two free pages, so a slab allocation would fail if all preallocated slabs were cached in the full bucket list. Surprisingly, smr_poll() doesn't assert that goal != SMR_SEQ_INVALID, but it pretty clearly assumes that. I think there's a second related bug here in that uzd_seq is not initialized at zone creation time (it implicitly sets uzd_seq = SMR_SEQ_INVALID).

Hmm, I think I see a bug that could cause this.

https://reviews.freebsd.org/D25952 contains a proposed patch.

  • Pull fallback alloc spin out to buf_trie_alloc
  • Preallocate buf trie items with some arbitrary slop to accommodate PCPU caches
cem retitled this revision from XXX uma_zalloc_smr: Spin on failure to buf_trie_alloc: Spin on alloc failure.Aug 6 2020, 6:15 PM
cem edited the summary of this revision. (Show Details)

If anything of the sort is really the solution it should be made general with M_SPINWAITOK (or whatever) flag in the allocator. No comments on prealloc.

I don't strongly object to this, but:

  • 0.5 * nbuf * 144 bytes is a ton of extra slop for a case where nbuf trie nodes is already pretty conservative and we're trying to address a theoretical problem. The case that Peter hit was caused by a UMA bug.
  • I tend to think that this should be handled in the allocator. In particular, if upon an allocation we cannot claim a bucket off the full bucket list because smr_poll() fails, I think we should advance the write sequence. Otherwise we are relying on free or drain operations to do so, and there's zero guarantee that either will happen in short order.

I don't strongly object to this, but:

  • 0.5 * nbuf * 144 bytes is a ton of extra slop for a case where nbuf trie nodes is already pretty conservative and we're trying to address a theoretical problem. The case that Peter hit was caused by a UMA bug.

Yeah, that may be unnecessary with the UMA bug addressed.

  • I tend to think that this should be handled in the allocator. In particular, if upon an allocation we cannot claim a bucket off the full bucket list because smr_poll() fails, I think we should advance the write sequence. Otherwise we are relying on free or drain operations to do so, and there's zero guarantee that either will happen in short order.

What do you envision there? Should pools that use uma_prealloc() set some internal flag which is used in allocation? Do we need a new M_SMRPREALLOCATED sleep mode?

I'm definitely not attached to anything in this revision, but would like some resolution.

In D25950#577167, @cem wrote:

I don't strongly object to this, but:

  • 0.5 * nbuf * 144 bytes is a ton of extra slop for a case where nbuf trie nodes is already pretty conservative and we're trying to address a theoretical problem. The case that Peter hit was caused by a UMA bug.

Yeah, that may be unnecessary with the UMA bug addressed.

  • I tend to think that this should be handled in the allocator. In particular, if upon an allocation we cannot claim a bucket off the full bucket list because smr_poll() fails, I think we should advance the write sequence. Otherwise we are relying on free or drain operations to do so, and there's zero guarantee that either will happen in short order.

What do you envision there? Should pools that use uma_prealloc() set some internal flag which is used in allocation? Do we need a new M_SMRPREALLOCATED sleep mode?

I'm definitely not attached to anything in this revision, but would like some resolution.

After D26024 the next step would be to advance the SMR write sequence after smr_poll() fails in cache_fetch_bucket(). Then we are not relying on frees or drains to ensure that cached full buckets become reusable.

After that I would look at how this is all supposed to work for NUMA. Right now uma_prealloc() is allocating slabs from all domains using round-robin, but the zone itself is first-touch, meaning that if keg_fetch_slab() fails for the local domain, we don't search for cached free items from remote domains. As a result, on NUMA systems we really don't handle the worst case properly. I think it just happens to work because on average we allocate less than one buf trie node per active buf.

I don't have an easy resolution for that at the moment. UMA doesn't have good mechanisms for guaranteeing successful M_NOWAIT allocations. I would explore trying to make the buffer cache code more resilient in the face of trie node allocation failures, again possibly by merging the clean and dirty tries so that reassignbuf() does not need to perform allocations at all.

Let's keep this in mind and revisit if we manage to hit an OOM in buf_trie_alloc with the UMA bucket issue fixed.