Page MenuHomeFreeBSD

Fix memguard when options NUMA is configured.
AbandonedPublic

Authored by markj on Sep 14 2018, 8:56 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Apr 24, 4:51 PM
Unknown Object (File)
Mar 22 2024, 4:19 PM
Unknown Object (File)
Dec 30 2023, 1:37 AM
Unknown Object (File)
Dec 20 2023, 12:30 AM
Unknown Object (File)
Sep 7 2023, 7:36 AM
Unknown Object (File)
Aug 27 2023, 8:12 AM
Unknown Object (File)
Aug 26 2023, 11:23 AM
Unknown Object (File)
Aug 12 2023, 10:43 AM
Subscribers

Details

Reviewers
kib
alc
jeff
cem
Summary

kmem_back() is now somewhat deficient since it doesn't know that pages
backing the same large virtual page must come from the same NUMA domain
in order to satisfy constraints of vm_reserv_extend(). It just selects
a domain according to the configured policy, which may or may not
correspond to the "colour" of the caller-supplied KVA.

memguard is the last in-tree consumer of kmem_back(). This change
allows it to work with "options NUMA" by ensuring that we consistently
select the same domain for pages backing a large virtual page. As part
of this, ensure that the KVA range reserved for memguard is large
page-aligned.

Test Plan

Set vm.memguard.desc="mbuf" on a NUMA system and ran network
traffic until the memguard cursor wrapped around.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 19604
Build 19185: arc lint + arc unit

Event Timeline

markj added reviewers: kib, alc, jeff.
This revision is now accepted and ready to land.Sep 14 2018, 9:10 PM
cem added a subscriber: cem.

Looks like kmem_back() can now be entirely removed.

In D17175#366163, @cem wrote:

Looks like kmem_back() can now be entirely removed.

Yep, I'll propose that change separately.

sys/vm/memguard.c
374

I think we can fall back to kmem_back() if the initial allocation attempt fails.

sys/vm/memguard.c
374

... not quite, since we could race with a free. I considered somehow moving this logic into kmem_back() itself (i.e., use the KVA to select an initial domain and then fall back if necessary, but memguard still needs to ensure that its KVA arena is aligned and a multiple of 2MB in size.

One thing we could do is round up the cursor to the next 2MB page in the event of an allocation failure, so that the next allocation attempt will select a different domain. I'll attempt that in a separate change.

393

I think this should be origaddr + size_v.

Advance the cursor to the next superpage boundary if the page
allocation fails.

Advance the cursor by the correct amount in the case where page
allocation succeeds.

This revision now requires review to proceed.Sep 15 2018, 5:56 PM
markj added inline comments.
sys/vm/memguard.c
374

I ended up rolling that up into this change. Otherwise, if one domain is depleted, memguard allocation attempts won't advance the cursor at all and we'll just keep hitting the same domain over and over.

cem added inline comments.
sys/vm/memguard.c
370

Is this necessary? (Isn't any integer mod 1 going to be zero?) Or is it valid for vm_ndomains to be zero?

387–388

Why atomic_cmpset? No other update to memguard_cursor seems to be atomic. (And the cmpset seems meaningless, no?) For the cmpset behavior I think you want an extra local variable to show that the load is only performed once:

if (vm_ndomains...) {
    vm_offset_t cursor;

    cursor = memguard_cursor; // atomic_load() ?
    next = roundup2(cursor, ...);
    atomic_cmpset_long(&memguard_cursor, cursor, next);
393

Agree

This revision is now accepted and ready to land.Sep 15 2018, 6:18 PM
markj added inline comments.
sys/vm/memguard.c
370

It's not really necessary, but I wanted the condition to match the one in the failure case below. vm_ndomains is initialized to 1.

387–388

Oops, right, I need to saved the value of memguard_cursor loaded at the beginning of the routine. I also realized that the domain selection doesn't work properly if we're allocating multiple pages and the KVA crosses a superpage boundary.

sys/vm/memguard.c
331–335

Setting aside the NUMA problem for a second, this actually doesn't work with a vmem arena behind the cursor. Suppose that we have wrapped around. vmem manages its free lists in a LIFO fashion. So, if you are allocating one page, vmem is going to return the most recently freed page that hasn't coalesced, and not the first free page near the start of the arena. The cursor will jump to that location, and I expect that the next allocation attempt will fail resulting in another cursor reset.

Until wraparound, the most recently freed pages were rejected because their addresses were below the cursor. But in that case vmem_xalloc is iterating over the all of the free list entries for these pages until it gets to the entry representing the region between the cursor and the end of the arena. Hopefully some coalescing of these recently freed pages occurs.

sys/vm/memguard.c
331–335

The vmem paper describes a "nextfit" allocation strategy, but our implementation doesn't have it yet. It could be used here since the memguard arena doesn't use a quantum cache. Assuming that you don't have an alternate strategy for handling the NUMA problem, what do you think of implementing that and using it here?

sys/vm/memguard.c
331–335

Please upload the next-fit implementation to phabricator. I think that it's worth having.

That said, iterating over an arena's segment list leads to a bad worst case performance bound. Because allocated boundary tags do not coalesce (unlike free boundary tags), a next-fit allocation might iterate over every single allocation.

Essentially, one has to hope that the arena on which nextfit is applied is sparsely utilized.

sys/vm/memguard.c
67–72

My suggestion would be to keep memguard NUMA-oblivious and fix kmem_back(). In other words, keep code like the above inside vm/vm_kern.c.

sys/vm/memguard.c
67–72

I mentioned this possibility in a different comment thread. memguard still needs to ensure that its arena doesn't share any large pages with its parent, so kmem_back() cannot completely hide the problem.

331–335

Right, I was hesitating for a while because of this limitation of the approach. I took a look at illumos and their implementation seems to have the same problem.

sys/vm/memguard.c
67–72

The kernel_object has only one use case outside of memguard.c and vm_kern.c, in subr_vmem.c, and that use case allocates an address from a per-domain arena. Also, the per-domain arenas should only be pulling address ranges that are a multiple of 2MB and start at a 2MB aligned address. So, memguard shouldn't be getting addresses under which there is a reservation that is shared with a per-domain arena. (memguard is getting its entire address from the the "global" kernel arena.)