Page MenuHomeFreeBSD

arm64: Make jemalloc safe for 16k / 4k interoperability
Needs ReviewPublic

Authored by gallatin on May 1 2024, 1:30 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jun 29, 10:20 AM
Unknown Object (File)
Jun 2 2024, 3:39 PM
Unknown Object (File)
May 2 2024, 7:22 PM
Unknown Object (File)
May 2 2024, 6:49 AM
Unknown Object (File)
May 2 2024, 6:49 AM
Unknown Object (File)
May 2 2024, 12:16 AM

Details

Summary

jemalloc obtains its page-size at compile time. It does not work when the compiled in page size is smaller than the runtime page size. It has an assertion that the kernel's page size is no larger than the compiled-in page size. This prevents booting a kernel with 16KB pages without building a matching world. It also makes downgrading to a 4k world when booted into a 16k kernel "fun". And it makes running pre-built 4K static binaries from packages (or other sources, like golang) impossible (eg, pkg-static).

However, jemalloc runs just fine with the compiled-in page size is larger than the runtime page size.

To make interoperability between 16k and 4k, I'd like to increase the compiled in page size of jemalloc to 16k. This will waste a small amount of space, but the payoff is making a 16k kernel much easier to use. For some workloads (like static web serving), 16k pages show an up to 25% performance improvement in my testing.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

gallatin created this revision.

I've run this patch from you and noticed no world build time regressions with a 4k kernel

I've been thinking about adding PAGE_SIZE_MAX/PAGE_SHIFT_MAX or similar to arm64 to define the largest page size the kernel could support. We could then use that here if it's defined.

I've been thinking about adding PAGE_SIZE_MAX/PAGE_SHIFT_MAX or similar to arm64 to define the largest page size the kernel could support. We could then use that here if it's defined.

I like that. I'd also bite the bullet and update the other arch to define that too...

I've been thinking about adding PAGE_SIZE_MAX/PAGE_SHIFT_MAX or similar to arm64 to define the largest page size the kernel could support. We could then use that here if it's defined.

That sounds like a great idea!

See D45065 (and D45066 for a use in getpagesize & getpagesizes)

Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.

It might be useful to provide a build knob so that embedded device builds can select 4KB if the memory usage reduction is significant.

Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.

I had the same question. It will clearly impact a lot of page granularity counters, at the very least causing some confusion for people who look at those counters, e.g.,

./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h-              &arena->decay_dirty.stats->nmadvise, 1);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h:              &arena->decay_dirty.stats->purged, extent_size >> LG_PAGE);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_sub_zu(tsdn, &arena->stats, &arena->stats.mapped,
./include/jemalloc/internal/arena_inlines_b.h-              extent_size);

However, it's not so obvious what the effect on the memory footprint will be. For example, the madvise(MADV_FREE) calls will have coarser granularity. If we set the page size to 64KB, then one in-use 4KB page within a 64KB region will be enough to block the application of madvise(MADV_FREE) to the other 15 pages. Quantifying the impact that this coarsening has will be hard.

This does, however, seem to be the intended workaround: https://github.com/jemalloc/jemalloc/issues/467

Buried in that issue is the claim that Firefox's builtin derivative version of jemalloc eliminated the statically compiled page size.

There is an open review for upgrading Jemalloc to 5.3.0 that I have been testing and running on current and 14-stable for some time now. I will attempt to test this change with that. I would like to see the jemalloc update merged.

In D45042#1028058, @alc wrote:

Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.

I had the same question. It will clearly impact a lot of page granularity counters, at the very least causing some confusion for people who look at those counters, e.g.,

./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h-              &arena->decay_dirty.stats->nmadvise, 1);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h:              &arena->decay_dirty.stats->purged, extent_size >> LG_PAGE);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_sub_zu(tsdn, &arena->stats, &arena->stats.mapped,
./include/jemalloc/internal/arena_inlines_b.h-              extent_size);

However, it's not so obvious what the effect on the memory footprint will be. For example, the madvise(MADV_FREE) calls will have coarser granularity. If we set the page size to 64KB, then one in-use 4KB page within a 64KB region will be enough to block the application of madvise(MADV_FREE) to the other 15 pages. Quantifying the impact that this coarsening has will be hard.

This does, however, seem to be the intended workaround: https://github.com/jemalloc/jemalloc/issues/467

Buried in that issue is the claim that Firefox's builtin derivative version of jemalloc eliminated the statically compiled page size.

What direction does the kernel grow the vm map? They apparently reverted support for lg page size values larger than the runtime page size because it caused fragmentation when the kernel grows the vm map downwards..

In D45042#1028058, @alc wrote:

Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.

I had the same question. It will clearly impact a lot of page granularity counters, at the very least causing some confusion for people who look at those counters, e.g.,

./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h-              &arena->decay_dirty.stats->nmadvise, 1);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h:              &arena->decay_dirty.stats->purged, extent_size >> LG_PAGE);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_sub_zu(tsdn, &arena->stats, &arena->stats.mapped,
./include/jemalloc/internal/arena_inlines_b.h-              extent_size);

However, it's not so obvious what the effect on the memory footprint will be. For example, the madvise(MADV_FREE) calls will have coarser granularity. If we set the page size to 64KB, then one in-use 4KB page within a 64KB region will be enough to block the application of madvise(MADV_FREE) to the other 15 pages. Quantifying the impact that this coarsening has will be hard.

This does, however, seem to be the intended workaround: https://github.com/jemalloc/jemalloc/issues/467

Buried in that issue is the claim that Firefox's builtin derivative version of jemalloc eliminated the statically compiled page size.

What direction does the kernel grow the vm map? They apparently reverted support for lg page size values larger than the runtime page size because it caused fragmentation when the kernel grows the vm map downwards..

Typically, existing map entries are only extended in an upward direction. For downward growing regions, e.g., stacks, new entries are created. Do you have a pointer to where this is discussed? I'm puzzled as to why the direction would be a factor.

In D45042#1029022, @alc wrote:
In D45042#1028058, @alc wrote:

Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.

I had the same question. It will clearly impact a lot of page granularity counters, at the very least causing some confusion for people who look at those counters, e.g.,

./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h-              &arena->decay_dirty.stats->nmadvise, 1);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_add_u64(tsdn, &arena->stats,
./include/jemalloc/internal/arena_inlines_b.h:              &arena->decay_dirty.stats->purged, extent_size >> LG_PAGE);
./include/jemalloc/internal/arena_inlines_b.h-          arena_stats_sub_zu(tsdn, &arena->stats, &arena->stats.mapped,
./include/jemalloc/internal/arena_inlines_b.h-              extent_size);

However, it's not so obvious what the effect on the memory footprint will be. For example, the madvise(MADV_FREE) calls will have coarser granularity. If we set the page size to 64KB, then one in-use 4KB page within a 64KB region will be enough to block the application of madvise(MADV_FREE) to the other 15 pages. Quantifying the impact that this coarsening has will be hard.

This does, however, seem to be the intended workaround: https://github.com/jemalloc/jemalloc/issues/467

Buried in that issue is the claim that Firefox's builtin derivative version of jemalloc eliminated the statically compiled page size.

What direction does the kernel grow the vm map? They apparently reverted support for lg page size values larger than the runtime page size because it caused fragmentation when the kernel grows the vm map downwards..

Typically, existing map entries are only extended in an upward direction. For downward growing regions, e.g., stacks, new entries are created. Do you have a pointer to where this is discussed? I'm puzzled as to why the direction would be a factor.

It was in that github issue: https://github.com/jemalloc/jemalloc/issues/467#issuecomment-252025408