Page MenuHomeFreeBSD

Only update the domain cursor once in keg_fetch_slab().
ClosedPublic

Authored by markj on Sep 17 2018, 8:32 PM.

Details

Summary

We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor. This can in principle cause us to
exit the round-robin loop before having attempted allocations from all
domains.

Suppose one domain, N, is depleted and its page daemon cannot reclaim
any memory (e.g., because virtually all of the memory in the domain is
wired). Suppose keg_fetch_slab() attempts to allocate from that domain
first, and fails, and that while the keg lock was dropped a different
thread advanced the cursor to N - 1. Upon re-acquiring the keg lock, we
will then set domain = N and retry the loop, resulting in a blocking
allocation which will never return.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj created this revision.Sep 17 2018, 8:32 PM
cem accepted this revision.Sep 18 2018, 12:29 AM
cem added a subscriber: cem.

This changes keg cursor advancement behavior slightly. I'm not sure that matters.

This revision is now accepted and ready to land.Sep 18 2018, 12:29 AM
In D17209#366880, @cem wrote:

This changes keg cursor advancement behavior slightly. I'm not sure that matters.

On my 32core EPYC server this simple change makes the difference between a stable system and ZFS hanging waiting to allocate memory for writes under load with NUMA enabled until the ZFS deadman switch triggers a panic after 1000 seconds.

markj added a comment.Sep 18 2018, 3:15 PM
In D17209#366880, @cem wrote:

This changes keg cursor advancement behavior slightly. I'm not sure that matters.

Yes, but only in the case where the first allocation attempt fails.

alc accepted this revision.Sep 18 2018, 4:22 PM
This revision was automatically updated to reflect the committed changes.