Paths

Table of Contentst

vm_domainset: Only probe domains once when iterating, instead of up to 4 times
ClosedPublic
Actions

Authored by olce on Jul 11 2025, 7:00 AM.

Details

Reviewers

markj
alc
kib
jeff

Commits

rGb5834753d330: vm_domainset: Only probe domains once when iterating, instead of up to 4 times
rGda257e519bc0: vm_domainset: Only probe domains once when iterating, instead of up to 4 times
rGd44095394237: vm_domainset: Only probe domains once when iterating, instead of up to 4 times

Summary

Because of the 'di_minskip' logic, which resets the initial domain, an
iterator starts by considering only domains that have more than
'free_min' pages in a first phase, and then all domains in a second one.
Non-"underpaged" domains are thus examined twice, even if the allocation
can't succeed.

Re-scanning the same domains twice just wastes time, as allocation
attempts that must not wait may rely on failing sooner and those that
must will loop anyway (a domain previously scanned twice has more pages
than 'free_min' and consequently vm_wait_doms() will just return
immediately).

Additionally, the DOMAINSET_POLICY_FIRSTTOUCH policy would aggravate
this situation by reexamining the current domain again at the end of
each phase. In the case of a single domain, this means doubling again
the number of times domain 0 is probed.

Implementation consists in adding two 'domainset_t' to 'struct
vm_domainset_iter' (and removing the 'di_n' counter). The first,
'di_remain_mask', contains domains still to be explored in the current
phase, the first phase concerning only domains with more pages than
'free_min' ('di_minskip' true) and the second one concerning only
domains previously under 'free_min' ('di_minskip' false). The second,
'di_min_mask', holds the domains with less pages than 'free_min'
encountered during the first phase, and serves as the reset value for
'di_remain_mask' when transitioning to the second phase.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

olce created this revision.Jul 11 2025, 7:00 AM

Herald added a subscriber: imp. · View Herald TranscriptJul 11 2025, 7:00 AM

olce requested review of this revision.Jul 11 2025, 7:00 AM

Harbormaster completed remote builds in B65341: Diff 158309.Jul 11 2025, 7:00 AM

olce added a parent revision: D51248: vm_domainset: Print correct function in KASSERT()/panic().Jul 11 2025, 7:01 AM

olce added a child revision: D51250: vm_domainset: Simplify vm_domainset_iter_next().

olce added a subscriber: emaste.Jul 11 2025, 7:30 AM

bnovkov added a subscriber: bnovkov.Jul 11 2025, 11:45 AM

Simplify back the logic that determines whether a currently examined domain can
be returned (second phase, or has enough memory), also making the control flow
clearer.

Harbormaster completed remote builds in B65390: Diff 158401.Jul 13 2025, 9:18 AM

kib added inline comments.Jul 15 2025, 9:42 PM

sys/vm/vm_domainset.h
41	Why do we need this second (third ?) mask at all. When iterating over domains with not count_min(), clear their bit, and skip !count_min() domains. Then iterate over all remaining bits in remain_mask.

kenrap_kennethraplee.com mentioned this in D51251: vm_domainset: Refactor iterators, multiple fixes.Jul 22 2025, 9:16 PM

olce marked an inline comment as done.Jul 23 2025, 12:58 PM

olce added inline comments.

sys/vm/vm_domainset.h
41	Iteration with this API can be stopped and resumed (or not) at will. With your proposal, when you start over, you just don't know which domains have already been determined to be under-provisioned and skipped in the first phase, and you're going to have to parse over them again. I precisely want to get rid of any occurence of that here, although those in your proposal wouldn't be as critical as the existing ones are. The current bitset API does not allow to save the current iteration point, and even if it did, that's more context to save than a simple domain bitmask, given that `MAXMEMDOM` is always below 64. Also, all `struct vm_domainset_iter` objects are allocated on the stack, so I don't see how an additional 64-bit field could make any significant difference.
41	`di_valid_mask` has been serving to remember that some specific allocation cannot be fulfilled from a domain (see `vm_domainset_iter_ignore()`), so is different than `di_domain->ds_mask`.

olce marked an inline comment as done.Jul 23 2025, 12:59 PM

Ping?

This revision was not accepted when it landed; it landed in state Needs Review.Sep 9 2025, 7:58 AM

Closed by commit rGd44095394237: vm_domainset: Only probe domains once when iterating, instead of up to 4 times (authored by olce). · Explain Why

This revision was automatically updated to reflect the committed changes.

olce added a commit: rGd44095394237: vm_domainset: Only probe domains once when iterating, instead of up to 4 times.

I don't fully understand this change:

Because of the 'di_minskip' logic, which resets the initial domain, an
iterator starts by considering only domains that have more than
'free_min' pages in a first phase, and then all domains in a second one.
Non-"underpaged" domains are thus examined twice, even if the allocation
can't succeed.

When would the allocation not succeed? I guess this is the contig allocation case? Why exactly do the unnecessary allocation attempts make a big difference? I see a reference to PR 277476, but it's hard to see exactly why these patches help there.

Please do not commit unreviewed patches to sys/vm. Virtually all non-trivial patches there are reviewed. It is important that more than one person understand how these layers work. I'm sorry you didn't get timely review here, but in that case please poke us again instead of just committing. I spent some time today reading these commits and am not asking that they be reverted, but I think we need to find a way to simplify these iterators. There is a lot of overhead here for single-page allocations where no page shortage is occurring, which is by far the common case.

In D51249#1198176, @markj wrote:

When would the allocation not succeed? I guess this is the contig allocation case?

Exactly.

Why exactly do the unnecessary allocation attempts make a big difference? I see a reference to PR 277476, but it's hard to see exactly why these patches help there.

Because the TTM layer of DRM drivers attempts to fill a pool of pages with large contiguous chunks, so start by trying to allocate larger chunks, which fails, and then retry, progressively shrinking the chunk sizes, until they get enough of the pages they wanted.

I hope that clears up why I'm doing these changes. But I don't think that's really relevant in terms of assessing why this change is good, which is explained by the commit message's paragraph:

Re-scanning the same domains twice just wastes time, as allocation
attempts that must not wait may rely on failing sooner and those that
must will loop anyway (a domain previously scanned twice has more pages
than 'free_min' and consequently vm_wait_doms() will just return
immediately).

If you don't agree with that, I'd like to hear your arguments.

Please do not commit unreviewed patches to sys/vm. Virtually all non-trivial patches there are reviewed. It is important that more than one person understand how these layers work. I'm sorry you didn't get timely review here, but in that case please poke us again instead of just committing.

While not completely trivial, I don't think this series of change (D51249 to D51251) is particularly involved either, and in any case doesn't fundamentally change the nature and purpose of iterators. We have been shipping both 14.2 & 14.3 with bugs rendering amdgpu drivers basically unusable on most configurations, and should absolutely strive to avoid that for 15.0. I actually sent a ping above, and just assumed there was no interest for this series. Will do it multiple times next time.

I spent some time today reading these commits and am not asking that they be reverted, but I think we need to find a way to simplify these iterators. There is a lot of overhead here for single-page allocations where no page shortage is occurring, which is by far the common case.

These commits actually factor out a lot of the existing iterator code, eliminating discrepancies and making the logic clearer which immediately helped spot another bug where skipping domains with less than v_min_free pages was not consistently done. I don't think we can significantly "simplify" these iterators further (the code is pretty minimal for the functionality while staying understandable, and it doesn't seem there is too much functionality here in general, does it?).

If there is a lot of overhead for single page allocations (more generally, for not too large ones), which is slightly surprising to me given that IIRC performing such an allocation requires more steps, and more costly ones, than those done by iterators (admittedly, I haven't looked that up in detail nor recently), we can imagine several specific improvements. The first that comes to mind is a fast path where part of the iterator functionality (or some specific routine) is used to determine the starting domain from which we directly try to allocate a single page, falling back on a full iteration.

It is important that more than one person understand how these layers work.

An additional note on this, which I completely support, to say it is extremely unfair on me.

I think you will have a hard time finding someone more aware of that than I am and who actually puts his money where his mouth is, as can be seen here from the amount of comments I've added to this code to clearly explain how iterators are working (see D51251), whereas there was not a single one before (probably a reason why they were not optimal in terms of domain browse) except for a single short paragraph at top of vm_pagequeue.h only mentioning that domains with more pages than v_free_min are used in priority.

The requirement of "not a single person" works both ways, and I'd like to see people explain more in detail the inner workings (whether true or assumed) of stuff they develop by writing more comments (not saying that for you).

olce added a commit: rGda257e519bc0: vm_domainset: Only probe domains once when iterating, instead of up to 4 times.Fri, Sep 19, 10:10 AM

olce added a commit: rGb5834753d330: vm_domainset: Only probe domains once when iterating, instead of up to 4 times.Fri, Sep 19, 11:43 AM