Page MenuHomeFreeBSD

Move NUMA policy iterators into the page allocator layer

Authored by jeff on Nov 9 2017, 5:19 AM.



This patch makes most of the physical page layer unaware of numa policies and domain specific. This is a stepping stone towards per-domain free locking.

Notably this does not handle free or a few other global phys operations like defrag. It also only partially converts reservations. Really reservations should be consulted prior to calling the allocator and potentially without a free queue lock but this is not yet possible. This does enable callers to override normal allocation policy and specify individual domains.

I make no guarantee about the iterators, I used what was present. Long-term they need to be re-visited.

I am particularly concerned about the change in control flow around allocation failures. While I have tested it did require the allocation functions to be more tolerant of NULL pages when the stats indicate there should be some present. Once the stats are per-domain we can go back to a tighter relationship if necessary.

I also tried to pick a sane argument order for 'domain' not just slapping it in front of all variables. If anyone objects to the order I picked I'm open to suggestions

Test Plan


I am running buildworld and make universe with VM_NUMA_ALLOC and non-NUMA kernels as well as forcibly exhausting memory. I verified that we block appropriately and don't spin.

Diff Detail

rS FreeBSD src repository - subversion
Lint Not Applicable
Tests Not Applicable

Event Timeline

I forgot to mention; This partially came from markj's updated numa2 branch which came from my original numa branch. I then refactored slightly and produced this version. I'm next going to figure out the kmem_ layer and then pick up UMA from the numa2 branch. It probably will be difficult to merge those two and I may abandon the numa branch if I can keep doing patches this small on head.

122 ↗(On Diff #34971)

This is in the wrong header but dependency issues make it really hard to move.

625 ↗(On Diff #34971)

This could eventually want to be two functions;

One would check the object for an existing reservation at that pindex and satisfy the allocation from it. This would be done without a domain argument and would not create a reservation. It would possibly need the domain free queue for the page free count.

The second would possibly populate a new reservation when a page allocation is done. This could be called with the domain free queue already locked.

158 ↗(On Diff #34971)

Use curproc instead of curthread->td_proc ?

1599 ↗(On Diff #34971)

Perhaps these two lines can be kept as is, without reformatting.

1611 ↗(On Diff #34971)

So assume that all non-last domains failed to provide a page to allocate, and the request is WAITOK. In this case, only last domain in the iteration is retried, and we would only return the page when that domain gets some free page.

IMO this is not right, we should restart the whole iteration for WAITOK. With the current arrangements of the counters, the patch is fine for WAITFAIL.

100 ↗(On Diff #34971)

I would prefer to call this function vm_page_domain_idx() instead. It is impossible to say from the names of vm_page_domain()/vm_phys_domain() what is the difference, and practically impossible to remember.

I used vm_phys prefix for the name because it is placed in vm_phys.h. Yes, it is too hard to move it into other place.

158 ↗(On Diff #34971)

I just moved this code directly unmodified. I intend to replace it entirely in a later diff. I would rather not move and modify in a single patch.

1611 ↗(On Diff #34971)

You are either going to block on the first or last in the list. You could convert to WAITFAIL and rescan for WAITOK. I think waiting for multiple domains to reclaim memory is going to be very complex and not worth it. With my page daemon refactoring I can survive extremely high memory pressure and never block.

I would prefer to refine this when I replace the policy and iterators. I want to do that after I refactor the VM APIs. The new iterators could provide more facilities for managing consistent policies in blocking allocations. For now this works so unless you strongly object I would prefer to do it as a later action.

100 ↗(On Diff #34971)

I agree. I will look at the use cases and come up with another name.

1611 ↗(On Diff #34971)

My point is not about waiting only for the last domain. Rather, I am stating that you would allocate from domains 0,1,2,...,n<all fails, wait>n,n,n,n,.... In other words, after the the wait, iteration should be restarted.

Aside from kib's comment about WAITOK, this looks ok to me. This is pretty similar to the changes in projects/numa2.

Incidentally, I think this addresses a crash that can occur when using a fixed-domain allocation policy: previously, we didn't handle the possibility of vm_page_alloc_pages() returning NULL, which could happen if we only attempted to allocate from a single domain before giving up. Given that VM_NUMA_ALLOC currently contains this rather severe bug, I think it's ok to commit this refactoring first and deal with the iterator API later.

1824 ↗(On Diff #34971)

Style nit: unneeded parens. Ditto in vm_page_alloc_after().

477–478 ↗(On Diff #34971)

Style nit: I would swap the order of declarations of vm_page_alloc_after() and vm_page_alloc_domain().

This revision is now accepted and ready to land.Nov 14 2017, 12:24 AM
This revision was automatically updated to reflect the committed changes.