Tentative physical memory compaction
Needs ReviewPublic
Actions

Authored by bnovkov on Jun 27 2023, 2:42 PM.

Details

Reviewers

markj
kib
dougm
imp
dchagin
alc
gallatin
cperciva
dim
mjg
emaste

Summary

This patch implements mechanisms for physical memory compaction and adds a system daemon that performs proactive compaction.
This is still a work in progress so this patch should be used for internal reviews and discussions only.

A detailed description and list of points to discuss will be added soon.

A high-level overview of the added changes is available here.

This work is sponsored by the Google Summer of Code program.

Test Plan

I've found that the most reliable way of incurring fragmentation (and the fastest way of testing compaction) is running buildkernel and/or buildworld.
The fragmentation metric values can be monitored using the vm.phys_frag_idx sysctl.
The total number of pages relocated per compaction run can be observed using the fbt dtrace provider: fbt::vm_compact_run:return { trace(arg1) }.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

bnovkov created this revision.Jun 27 2023, 2:42 PM

Herald added a subscriber: imp. · View Herald TranscriptJun 27 2023, 2:42 PM

bnovkov requested review of this revision.Jun 27 2023, 2:42 PM

grahamperrin added a subscriber: grahamperrin.Jul 8 2023, 1:29 AM

afedorov added a subscriber: afedorov.Jul 28 2023, 7:38 AM

Note this became more important since we have ASLR turned on for 64 bit processes since 13.2-RELEASE. And ASLR adds great deal of fragmentation. It leads to significant performance degradation over long run due to superpages becoming unusable due memory fragmentation.

In D40772#966839, @eugen_grosbein.net wrote:

Note this became more important since we have ASLR turned on for 64 bit processes since 13.2-RELEASE. And ASLR adds great deal of fragmentation. It leads to significant performance degradation over long run due to superpages becoming unusable due memory fragmentation.

Until a few months ago, the handling of anonymous memory allocations when ASLR is turned on was not really operating as intended. Over the summer, I fixed the biggest of the issues. Consequently, the behavior that I see now on CURRENT (and presumably 14) with ASLR on is no worse than with it off. And, in fact, it is sometimes better. For example, a buildworld (or any workload that uses clang a lot), now gets substantially more promotions and fewer underutilized reservations with ASLR turned on, because ASLR automatically segregates anonymous memory allocations from memory mapped files. That said, there is still one minor change that I am considering. All of these changes could be back ported to 13 before the next release.

To be clear, none of this is meant to imply that I think that we shouldn't implement compaction, only that ASLR when working correctly doesn't magnify the need for it. (I apologize for all the negations in that sentence. :-))

gbe added a subscriber: gbe.Oct 26 2023, 7:35 PM

In D40772#966839, @eugen_grosbein.net wrote:

Note this became more important since we have ASLR turned on for 64 bit processes since 13.2-RELEASE. And ASLR adds great deal of fragmentation. It leads to significant performance degradation over long run due to superpages becoming unusable due memory fragmentation.

Thank you for taking an interest in this patch!

I'd just like to note a couple of things to keep in mind while testing/reviewing this:

The current iteration is a bit overengineered - the vm_compact layer was designed to do bookkeeping for multiple concurrent compaction daemons (e.g. one daemon for jumbo frame allocations and one for kTLS frames). After a couple of conversations with @markj I concluded that this is unnecessary so this layer will be removed soon.
Determining candidate regions for compaction is pretty barebones and needs to be refined. The system currently tracks the total number of free space and unallocated 0-order pages inside a physical memory region and uses this information (albeit very naively) to select compaction candidates. More suitable metrics and candidate selection criteria could be added.
There are a couple of fixes that I still haven't included in the diff - I'll do this along with removing the vm_compact layer.
I haven't had the time to conduct proper testing. All of the test so far included several consecutive buildkernel workloads with the source and object directories mounted as tmpfs while tracking the total number of promotions and compaction daemon CPU time. Preliminary results show that the number of promotions improves when compared to the baseline.

In D40772#966842, @alc wrote:

In D40772#966839, @eugen_grosbein.net wrote:

Note this became more important since we have ASLR turned on for 64 bit processes since 13.2-RELEASE. And ASLR adds great deal of fragmentation. It leads to significant performance degradation over long run due to superpages becoming unusable due memory fragmentation.

Until a few months ago, the handling of anonymous memory allocations when ASLR is turned on was not really operating as intended. Over the summer, I fixed the biggest of the issues. Consequently, the behavior that I see now on CURRENT (and presumably 14) with ASLR on is no worse than with it off.

Are you planning to merge the fix to stable/13?

I don't know if the kernel is in shape where this can be properly evaluated.

There are several high-usage NOFREE zones and this is not going to get sorted out in this decade.

Unless someone fixed the problem while I was not looking the kernel is really bad at damage-controlling them. In the long run they result in significant fragmentation which can't be reversed.

So I think a prerequisite would be to sort out this problem first (for example: allocate a 2M super page and carve out NOFREE allocations from that area. rinse & repeat as needed). Some of the biggest culprits probably should not be using per-cpu caching to begin with, which sorted out would also help. (the list includes struct proc, thread and mount. all of which serialize on global locks to do anything already)

In D40772#969386, @mjg wrote:

I don't know if the kernel is in shape where this can be properly evaluated.

There are several high-usage NOFREE zones and this is not going to get sorted out in this decade.

We have https://reviews.freebsd.org/D16620 from last decade. Probably that should be finished and committed, along with D24758.

If you can rebase both changes and show me how to collect fragmentation stats I can test this against a full ports tree build.

@markj ping?

netchild added a subscriber: netchild.Apr 8 2024, 9:01 AM

In D40772#969417, @mjg wrote:

If you can rebase both changes and show me how to collect fragmentation stats I can test this against a full ports tree build.

Fragmentation stats are available in https://reviews.freebsd.org/D40575

Regenerate and simplify patch.

As promised, I've removed some redundant bits from the patch, it should be a bit clearer now.

As @mjg pointed out, there are some things that need to be solved before this patch can be properly evaluated.
We have several issues with "permanent" fragmentation, i.e. improper placement of non-movable pages which is especially problematic for compaction.
I'm currently working on a solution which will allow UMA cache zones to allocate pages in a more contiguous manner.

I managed to evaluate this patch as a part of a paper presented at this year's AsiaBSDcon (available at https://people.freebsd.org/~bnovkov/papers/).
The preliminary results show some improvements when running a build workload, but a more thorough evaluation will take place once the other issues have been solved.