Page MenuHomeFreeBSD

vm_object: drop memq field
Needs ReviewPublic

Authored by dougm on Dec 9 2024, 8:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 20, 5:16 AM
Unknown Object (File)
Sun, Jan 19, 9:55 PM
Unknown Object (File)
Sun, Jan 19, 9:25 PM
Unknown Object (File)
Sun, Jan 19, 8:06 PM
Unknown Object (File)
Sun, Jan 19, 6:24 PM
Unknown Object (File)
Sun, Jan 19, 5:38 PM
Unknown Object (File)
Sun, Jan 19, 1:47 PM
Unknown Object (File)
Sun, Jan 19, 1:41 PM

Details

Summary

This is a first draft, a work in progress. All the reviewers automatically added don't need to rush to review it.

The memq field in vm_object is used to maintain a list of mapped pages, sorted by pindex. Remove that field, and stop maintaining the list. Change all iterations that use the list to use pctrie iterators instead. Change the tailq field in vm_page so that it is part of a union that overlaps the {pindex, object) pair, and remains in use only for the buddy allocator in vm_phys.c and the maintenance of temporary lists in uma_core.c. Where a predecessor page is passed to a function solely to help with insertion into the memq list, drop that parameter.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

dougm requested review of this revision.Dec 9 2024, 8:06 PM

A code reformat to (slightly) reduce the diff size.

Define and use VM_RADIX_FOREACH macros. Initialize a couple of iterators before acquiring read locks.

sys/amd64/sgx/sgx.c
401

Rather than lifting out the vm_page_remove() calls, perhaps pass the iterator to sgx_page_remove() as well, and have it decide whether to call vm_page_remove() or vm_page_iter_remove() based on whether the iterator pointer is NULL.

sys/vm/vm_glue.c
625

Do you have some other uses for NOCREAT_LT planned? This function is called rarely (kernel stacks are cached by UMA and reused many times without going through this function), so the extra complexity feels a bit unnecessary.

sys/vm/vm_page.h
224

Not quite related to this change, but once it lands, I'd like to add one 8-byte field to plinks. I believe that'd be enough to let us start embedding UMA slab headers (i.e. uma_slab_t) within the vm_page structure. This is an alternative approach to D22768.

sys/vm/vm_radix.h
272

Bikeshedding, but VM_RADIX_FOREACH_FROM reads a bit more naturally to me, and we already have TAILQ_FOREACH_FROM etc..

sys/vm/vm_reserv.c
528

This is more expensive, but at some point I presume it should be possible to pass an interator down from the page allocator and use that instead of doing a full lookup here.

dougm marked 2 inline comments as done.Mon, Jan 6, 7:26 PM
dougm added inline comments.
sys/vm/vm_glue.c
625

I have no other uses planned. I was just looking for a way to avoid a redundant lookup.

sys/vm/vm_radix.h
272

I don't object to using FROM, but the reason I didn't is that FOREACH_FROM takes two arguments, and this macro takes three. FOREACH_FROM assumes that the lookup has already happened and FOREACH_START has to do the lookup. Those differences are the reason that I chose to use a distinct name. But it doesn't matter much to me.

sys/vm/vm_reserv.c
528

I agree that it would be possible.

dougm marked an inline comment as done.

Doug, how would you like to proceed with the patch? Since quite a few pieces of it are independent, I imagine they can be reviewed and peeled off one by one, especially in areas that aren't performance-critical.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

sys/kern/kern_kcov.c
410

Can't we use vm_radix_iter_next() here?

sys/kern/subr_pctrie.c
550

This can presumably be committed on its own. It looks reasonable to me.

sys/x86/iommu/amd_idpgtbl.c
122

The assertion can come first, as in dmar_domain_free_pgtbl().

sys/x86/iommu/intel_idpgtbl.c
267

Elsewhere you are careful to initialize the iterator outside the scope of the object lock, but not here.

269

Does it make sense to have a macro for this usage too? Perhaps we could use VM_RADIX_FOREACH and assert that pindices are consecutive?

dougm marked 3 inline comments as done.

Doug, how would you like to proceed with the patch? Since quite a few pieces of it are independent, I imagine they can be reviewed and peeled off one by one, especially in areas that aren't performance-critical.

Some of them have been reviewed separately, and approved by @kib, but blocked by comments from @alc. His comment on D47718 is why I stopped creeping along a bit at a time and threw it all out for consideration at once. I expect that his approval depends on some kind of performance testing. He's been drawn away from BSD considerations by other life events lately.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

@alc has concerns. He's reported performance worsened a bit by some of the page management changes already made, and worsened a bit by some other changes that have been under review for a while.

sys/x86/iommu/intel_idpgtbl.c
269

I don't know. I'll add a VM_RADIX_FORALL macro for this purpose, and you can decide how you feel about it.

Use VM_RADIX_FORALL_FROM one more place.

Use VM_RADIX_FORALL in kcov_free.

Abandon VM_ALLOC_NOCREAT_LT.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

@alc has concerns. He's reported performance worsened a bit by some of the page management changes already made, and worsened a bit by some other changes that have been under review for a while.

Do we have a benchmark to experiment with, either a workload to test the kernel or a standalone userspace program embedding subr_pctrie.o? I wrote a patch which tries to add a fast path to pctrie_iter_step/next, wherein pctrie.h inlines iteration when the top-most node in the iterator contains a run of successive leaf values, and falls back to pctrie_iter_lookup_ge() to look up new nodes or handle holes. So far I've only tested it to check correctness, but I wonder if it would help performance in the case where a VM object is densely populated.

One other performance-related detail I'm wondering about is, we check the iterator limit twice when stepping: once before looking up a leaf, and once after finding one. I believe the latter check is sufficient for correctness. If so, are we making the right tradeoff in checking twice instead of once?

sys/amd64/amd64/pmap.c
7710

I don't have a concrete suggestion to offer at the moment, but here we're doing some unnecessary work: vm_map_pmap_enter(), the sole caller of this function, uses an iterator to find a contiguous run of pindices that correspond to a run of valid pages; then, it passes the bounds of this run to pmap_enter_object(), which uses a separate iterator to re-lookup each page in the run. That seems redundant, and we should perhaps rework the pmap_enter_object() interface to avoid it.

One possibility would be modifying vm_map_pmap_enter() to store a fixed-size array of vm_page pointers on the stack, which gets passed to pmap_enter_object() once full or the end of a run is detected or a superpage is detected.

In vm_map.c, have page index jump ahead with a superpage jump-ahead. Initialize the cursor before acquiring a read lock.

sys/amd64/amd64/pmap.c
7710

How about if vm_map_pmap_enter stops trying to find a max page range between calls to pmap_enter_object, and instead just calls it every time vm_page_all_valid is true, passing along the start and psind vals? The pmap_lock would be acquired once, before the loop in vm_map_pmap_enter, and released once, and pmap_enter_object wouldn't have a loop at all. Since I don't know what the rwlock is about, I don't know how the potential extra locking/unlocking of the rwlock would affect things.

Undo the parts of the last change about moving the iterator initialization.

Are there still some concerns about the overhead of switching to iterators in some of the already-committed changes? I've lost track a bit, I'm sorry.

@alc has concerns. He's reported performance worsened a bit by some of the page management changes already made, and worsened a bit by some other changes that have been under review for a while.

Do we have a benchmark to experiment with, either a workload to test the kernel or a standalone userspace program embedding subr_pctrie.o? I wrote a patch which tries to add a fast path to pctrie_iter_step/next, wherein pctrie.h inlines iteration when the top-most node in the iterator contains a run of successive leaf values, and falls back to pctrie_iter_lookup_ge() to look up new nodes or handle holes. So far I've only tested it to check correctness, but I wonder if it would help performance in the case where a VM object is densely populated.

I've just been running buildworld tests. I wrote a patch (a part of D47680) that tries to have the next lookup execute straightline code in the common case.

One other performance-related detail I'm wondering about is, we check the iterator limit twice when stepping: once before looking up a leaf, and once after finding one. I believe the latter check is sufficient for correctness. If so, are we making the right tradeoff in checking twice instead of once?

@alc has suggested that we could modify pctrie nodes to maintain a parent pointer and avoid the iterator array.

One other performance-related detail I'm wondering about is, we check the iterator limit twice when stepping: once before looking up a leaf, and once after finding one. I believe the latter check is sufficient for correctness. If so, are we making the right tradeoff in checking twice instead of once?

I haven't decoded this yet. Could you provide a bit more detail?

sys/amd64/amd64/pmap.c
7710

Cancel the previous idea.

Can we use p->plinks.s to build a list of the valid pages, and pass that list to pmap_enter_object to avoid the second iterator? Using an array leads to a question of how big to make the array, where using plinks.s avoids that - assuming that someone else isn't using plinks for something while vm_map_pmap_enter is happening. I don't know whether I can safely assume that.

sys/amd64/amd64/pmap.c
7710

At the moment, we cannot use plinks.s: it's overlaid with plinks.q, the fields used to link pages together in the global page queues. The pages mapped here will in general be managed (which among other things means that they are reclaimable and thus are kept in these page queues), so plinks is already in use.

That said, in this diff you remove two pointer-sized fields from struct vm_page, and I would later like to add one back for use in UMA (the kernel memory allocator). We could make use of such a field here: that would not overlap with UMA's use of it.