Page MenuHomeFreeBSD

vm_domainset: Only probe domains once when iterating, instead of up to 4 times
ClosedPublic

Authored by olce on Jul 11 2025, 7:00 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Oct 12, 1:39 PM
Unknown Object (File)
Fri, Oct 10, 10:13 AM
Unknown Object (File)
Fri, Oct 10, 10:12 AM
Unknown Object (File)
Fri, Oct 10, 10:12 AM
Unknown Object (File)
Fri, Oct 10, 10:12 AM
Unknown Object (File)
Fri, Oct 10, 3:43 AM
Unknown Object (File)
Sun, Oct 5, 5:10 AM
Unknown Object (File)
Fri, Oct 3, 9:18 AM
Subscribers

Details

Summary

Because of the 'di_minskip' logic, which resets the initial domain, an
iterator starts by considering only domains that have more than
'free_min' pages in a first phase, and then all domains in a second one.
Non-"underpaged" domains are thus examined twice, even if the allocation
can't succeed.

Re-scanning the same domains twice just wastes time, as allocation
attempts that must not wait may rely on failing sooner and those that
must will loop anyway (a domain previously scanned twice has more pages
than 'free_min' and consequently vm_wait_doms() will just return
immediately).

Additionally, the DOMAINSET_POLICY_FIRSTTOUCH policy would aggravate
this situation by reexamining the current domain again at the end of
each phase. In the case of a single domain, this means doubling again
the number of times domain 0 is probed.

Implementation consists in adding two 'domainset_t' to 'struct
vm_domainset_iter' (and removing the 'di_n' counter). The first,
'di_remain_mask', contains domains still to be explored in the current
phase, the first phase concerning only domains with more pages than
'free_min' ('di_minskip' true) and the second one concerning only
domains previously under 'free_min' ('di_minskip' false). The second,
'di_min_mask', holds the domains with less pages than 'free_min'
encountered during the first phase, and serves as the reset value for
'di_remain_mask' when transitioning to the second phase.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

olce requested review of this revision.Jul 11 2025, 7:00 AM

Simplify back the logic that determines whether a currently examined domain can
be returned (second phase, or has enough memory), also making the control flow
clearer.

sys/vm/vm_domainset.h
41

Why do we need this second (third ?) mask at all.

When iterating over domains with not count_min(), clear their bit, and skip !count_min() domains. Then iterate over all remaining bits in remain_mask.

olce added inline comments.
sys/vm/vm_domainset.h
41

Iteration with this API can be stopped and resumed (or not) at will. With your proposal, when you start over, you just don't know which domains have already been determined to be under-provisioned and skipped in the first phase, and you're going to have to parse over them again. I precisely want to get rid of any occurence of that here, although those in your proposal wouldn't be as critical as the existing ones are.

The current bitset API does not allow to save the current iteration point, and even if it did, that's more context to save than a simple domain bitmask, given that MAXMEMDOM is always below 64.

Also, all struct vm_domainset_iter objects are allocated on the stack, so I don't see how an additional 64-bit field could make any significant difference.

41

di_valid_mask has been serving to remember that some specific allocation cannot be fulfilled from a domain (see vm_domainset_iter_ignore()), so is different than di_domain->ds_mask.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 9 2025, 7:58 AM
This revision was automatically updated to reflect the committed changes.

I don't fully understand this change:

Because of the 'di_minskip' logic, which resets the initial domain, an
iterator starts by considering only domains that have more than
'free_min' pages in a first phase, and then all domains in a second one.
Non-"underpaged" domains are thus examined twice, even if the allocation
can't succeed.

When would the allocation not succeed? I guess this is the contig allocation case? Why exactly do the unnecessary allocation attempts make a big difference? I see a reference to PR 277476, but it's hard to see exactly why these patches help there.

Please do not commit unreviewed patches to sys/vm. Virtually all non-trivial patches there are reviewed. It is important that more than one person understand how these layers work. I'm sorry you didn't get timely review here, but in that case please poke us again instead of just committing. I spent some time today reading these commits and am not asking that they be reverted, but I think we need to find a way to simplify these iterators. There is a lot of overhead here for single-page allocations where no page shortage is occurring, which is by far the common case.

When would the allocation not succeed? I guess this is the contig allocation case?

Exactly.

Why exactly do the unnecessary allocation attempts make a big difference? I see a reference to PR 277476, but it's hard to see exactly why these patches help there.

Because the TTM layer of DRM drivers attempts to fill a pool of pages with large contiguous chunks, so start by trying to allocate larger chunks, which fails, and then retry, progressively shrinking the chunk sizes, until they get enough of the pages they wanted.

I hope that clears up why I'm doing these changes. But I don't think that's really relevant in terms of assessing why this change is good, which is explained by the commit message's paragraph:

Re-scanning the same domains twice just wastes time, as allocation
attempts that must not wait may rely on failing sooner and those that
must will loop anyway (a domain previously scanned twice has more pages
than 'free_min' and consequently vm_wait_doms() will just return
immediately).

If you don't agree with that, I'd like to hear your arguments.

Please do not commit unreviewed patches to sys/vm. Virtually all non-trivial patches there are reviewed. It is important that more than one person understand how these layers work. I'm sorry you didn't get timely review here, but in that case please poke us again instead of just committing.

While not completely trivial, I don't think this series of change (D51249 to D51251) is particularly involved either, and in any case doesn't fundamentally change the nature and purpose of iterators. We have been shipping both 14.2 & 14.3 with bugs rendering amdgpu drivers basically unusable on most configurations, and should absolutely strive to avoid that for 15.0. I actually sent a ping above, and just assumed there was no interest for this series. Will do it multiple times next time.

I spent some time today reading these commits and am not asking that they be reverted, but I think we need to find a way to simplify these iterators. There is a lot of overhead here for single-page allocations where no page shortage is occurring, which is by far the common case.

These commits actually factor out a lot of the existing iterator code, eliminating discrepancies and making the logic clearer which immediately helped spot another bug where skipping domains with less than v_min_free pages was not consistently done. I don't think we can significantly "simplify" these iterators further (the code is pretty minimal for the functionality while staying understandable, and it doesn't seem there is too much functionality here in general, does it?).

If there is a lot of overhead for single page allocations (more generally, for not too large ones), which is slightly surprising to me given that IIRC performing such an allocation requires more steps, and more costly ones, than those done by iterators (admittedly, I haven't looked that up in detail nor recently), we can imagine several specific improvements. The first that comes to mind is a fast path where part of the iterator functionality (or some specific routine) is used to determine the starting domain from which we directly try to allocate a single page, falling back on a full iteration.

It is important that more than one person understand how these layers work.

An additional note on this, which I completely support, to say it is extremely unfair on me.

I think you will have a hard time finding someone more aware of that than I am and who actually puts his money where his mouth is, as can be seen here from the amount of comments I've added to this code to clearly explain how iterators are working (see D51251), whereas there was not a single one before (probably a reason why they were not optimal in terms of domain browse) except for a single short paragraph at top of vm_pagequeue.h only mentioning that domains with more pages than v_free_min are used in priority.

The requirement of "not a single person" works both ways, and I'd like to see people explain more in detail the inner workings (whether true or assumed) of stuff they develop by writing more comments (not saying that for you).