Page MenuHomeFreeBSD

Set pseflag after r332489.
ClosedPublic

Authored by markj on Jul 15 2018, 7:32 PM.
Tags
None
Referenced Files
F107375711: D16279.diff
Mon, Jan 13, 6:29 AM
Unknown Object (File)
Thu, Dec 26, 11:38 AM
Unknown Object (File)
Nov 11 2024, 12:57 AM
Unknown Object (File)
Nov 7 2024, 1:05 PM
Unknown Object (File)
Nov 4 2024, 4:44 AM
Unknown Object (File)
Oct 31 2024, 3:57 AM
Unknown Object (File)
Oct 22 2024, 8:33 PM
Unknown Object (File)
Sep 27 2024, 9:17 AM
Subscribers

Details

Summary

It looks like r332489 inadventently removed the code which checks for
CPU support for large pages. As a result, pmap_pg_ps_enabled is always
0 on i386, so superpage promotion is disabled.

Test Plan

I booted a kernel with this change on one of my laptops. I ran some builds
and did some browsing with firefox, which exercises the recently added
pmap_enter(psind = 1) support.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 18050
Build 17794: arc lint + arc unit

Event Timeline

markj added reviewers: alc, kib.
sys/i386/i386/pmap.c
521

I guess this should be initialized to PG_PS per the comment, though it doesn't seem to matter in practice.

  • Initialize pseflag to PG_PS

I find it weird that some places test pseflag and not pg_ps_enabled. We end up with superpage mappings even if user explicitly disabled them. I do not even mind for the kernel mappings, but for pmap_object_init_pt() it is not correct.

Use ps_pg_enabled for user mappings.

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

This revision is now accepted and ready to land.Jul 15 2018, 8:54 PM
In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

This revision was automatically updated to reflect the committed changes.
In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.

In D16279#345429, @alc wrote:
In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.

Sure: https://reviews.freebsd.org/P191

In D16279#345429, @alc wrote:
In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.

Sure: https://reviews.freebsd.org/P191

I see good and bad things there.

I'm glad to see automatic promotion on the code segment for libxul.so:

1089 0x22c00000 0x2782e000 r-x 11168 11667  12   7 CNS- vn /usr/local/lib/firefox/libxul.so

However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.

In D16279#345431, @alc wrote:

However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.

No, it still seems to be using MADV_FREE. (Linux finally gained support for that recently, BTW.)

As a side note, almost all free() calls in firefox end with this stack:

 libc.so.7`0x8012e900a
libc.so.7`heapsort+0xa5a
libc.so.7`heapsort+0x934
libc.so.7`__bsd_iconvlist+0x278
libc.so.7`__bsd_iconv+0x36
libc.so.7`_citrus_stdenc_open+0x63
libc.so.7`mbstowcs+0x20

Seems like we are sorting the same data over and over.

In D16279#345431, @alc wrote:

However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.

No, it still seems to be using MADV_FREE. (Linux finally gained support for that recently, BTW.)

For example, take a look at lines 2253 through 2296 in P191. At some point in the past, firefox had a valid range of addresses starting at 0x33e09000 and ending at 0x33ed4000. Then, a bunch of munmap()s fragmented the range, and finally new allocations were created within the holes.

This seems like a newish behavior, starting with jemalloc 5.0. Relevant details seem like

Unlike all previous jemalloc releases, this release does not use naturally
aligned "chunks" for virtual memory management, and instead uses page-aligned
"extents".  This change has few externally visible effects, but the internal
impacts are... extensive.
- Implement two-phase decay of unused dirty pages.  Pages transition from
   dirty-->muzzy-->clean, where the first phase transition relies on
   madvise(... MADV_FREE) semantics, and the second phase transition discards
   pages such that they are replaced with demand-zeroed pages on next access.
   (@jasone)
opt.muzzy_decay_ms (ssize_t) r-

   Approximate time in milliseconds from the creation of a set of unused muzzy pages until an equivalent set of unused muzzy pages is purged (i.e. converted to clean) and/or reused. Muzzy pages are defined as previously having been unused dirty pages that were subsequently purged in a manner that left them subject to the reclamation whims of the operating system (e.g. madvise(...MADV_FREE)), and therefore in an indeterminate state. The pages are incrementally purged according to a sigmoidal decay curve that starts and ends with zero purge rate. A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. A decay time of -1 disables purging. The default decay time is 10 seconds. See arenas.muzzy_decay_ms and arena.<i>.muzzy_decay_ms for related dynamic control options.

I speculate that setting muzzy_decay_ms to -1 will stop the fragmention.

In D16279#345433, @alc wrote:

I speculate that setting muzzy_decay_ms to -1 will stop the fragmention.

Some quick testing that suggests that it helps. I tried starting firefox and opening three fairly resource-intensive websites, and counted the total number of vm_map entries among all firefox processes. Once everything has loaded, I consistently see about half the total number of entries (~3800 vs. ~7500) when malloc.conf is set as you suggest. I'll try this on my desktop today - when I last killed firefox, I had about 80 tabs open and about 110,000 map entries(!).

https://reviews.freebsd.org/P193

The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.

Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?

In D16279#345673, @alc wrote:

The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.

To be a bit more concrete, suppose a mapped VM object spans pindices [0, n], and we unmap the range in [x1, x2] where 0 < x1 < x2 < n, and then map it again. Are you observing that we'll instantiate a new object to back the range originally backed by [x1, x2]?

BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)

Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?

Not very. I do know that at least Solaris maintains an array of "slots" tracking per-page protection and madvise info in a given anonymous memory region, but am not sure how it's used in COW.

In D16279#345673, @alc wrote:

The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.

To be a bit more concrete, suppose a mapped VM object spans pindices [0, n], and we unmap the range in [x1, x2] where 0 < x1 < x2 < n, and then map it again. Are you observing that we'll instantiate a new object to back the range originally backed by [x1, x2]?

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)

It's using MAP_FIXED?

Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?

Not very. I do know that at least Solaris maintains an array of "slots" tracking per-page protection and madvise info in a given anonymous memory region, but am not sure how it's used in COW.

Essentially, this array contains vm_page_t pointers, and within the vm_page structure a count of the number of pointers to that page. To setup COW, you replicate the array and increment the count on each page. A COW fault checks the count on the page, and replaces it with a new private copy if the count was > 1.

There are pros and cons to this approach versus our Mach-derived approach. I've often wondered if a hybrid approach didn't make sense where you used a more space efficient data structure to maintain just the per-pindex counts. The hypothesis being that the counts will typically be the same over the entire range of indices.

In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)

It's using MAP_FIXED?

Hmm, sorry, I think I was mistaken. The code is a little hard to follow. extent_dalloc_wrapper() munmap()s the range before "decommitting" the range, i.e., calling mmap(PROT_NONE, MAP_FIXED).

There are pros and cons to this approach versus our Mach-derived approach. I've often wondered if a hybrid approach didn't make sense where you used a more space efficient data structure to maintain just the per-pindex counts. The hypothesis being that the counts will typically be the same over the entire range of indices.

I think you were at one point considering using blists for this purpose?

In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.

In D16279#345821, @alc wrote:
In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.

I'll investigate and work on that too. I think I just found the last bug in my psind == 1 patch. :)

In D16279#345821, @alc wrote:
In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.

I'll investigate and work on that too. I think I just found the last bug in my psind == 1 patch. :)

The needed commits are covered in the last column of the table that I've been posting.

                Uses            Has PV Alloc    Has COW         Needs r324665
                pv_chunk        Problem         Bug             and r325285
                -------------------------------------------------------------
amd64           Yes             Fixed           Fixed           Fixed
arm/pmap-v4.c   No              N/A             No[1]           N/A
arm/pmap-v6.c   Yes             No              No[2]           No[4]
arm64           Yes             Fixed           No[2]           Yes
i386            Yes             No              Fixed           No[4]
mips            Yes             No              Fixed           No[4]
powerpc/booke   No              N/A             No[2]           N/A
powerpc/oea     No              N/A[3]          No[2]           N/A
powerpc/oea64   No              N/A             No[2]           N/A
powerpc/pseries[5]
riscv           Yes             Fixed           Fixed           No[4]
sparc64         No              N/A             No[2]           N/A

[1] SMP is not supported.
[2] Performs "break-before-make".
[3] The comments say that the PV entry is reused, but it is not.  That said,
    the old PV entry is freed before the new one is allocated.  I believe
    that reuse could be beneficial because it would eliminate two O(log n)
    Red-Black tree operations.
[4] Still has pvh_global_lock.
[5] Literally derived from powerpc/oea64.