Set pseflag after r332489.
ClosedPublic
Actions

Authored by markj on Jul 15 2018, 7:32 PM.

Details

Reviewers

alc
kib

Commits

rS336321: Restore the check for the page size extension after r332489.

Summary

It looks like r332489 inadventently removed the code which checks for
CPU support for large pages. As a result, pmap_pg_ps_enabled is always
0 on i386, so superpage promotion is disabled.

Test Plan

I booted a kernel with this change on one of my laptops. I ran some builds
and did some browsing with firefox, which exercises the recently added
pmap_enter(psind = 1) support.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.Jul 15 2018, 7:32 PM

Harbormaster completed remote builds in B18048: Diff 45326.Jul 15 2018, 7:32 PM

markj edited the test plan for this revision. (Show Details)Jul 15 2018, 7:33 PM

markj added reviewers: alc, kib.

markj added inline comments.Jul 15 2018, 7:35 PM

sys/i386/i386/pmap.c
521 ↗	(On Diff #45326)	I guess this should be initialized to PG_PS per the comment, though it doesn't seem to matter in practice.

Initialize pseflag to PG_PS

Harbormaster completed remote builds in B18050: Diff 45328.Jul 15 2018, 7:37 PM

markj marked an inline comment as done.Jul 15 2018, 7:38 PM

I find it weird that some places test pseflag and not pg_ps_enabled. We end up with superpage mappings even if user explicitly disabled them. I do not even mind for the kernel mappings, but for pmap_object_init_pt() it is not correct.

Use ps_pg_enabled for user mappings.

Harbormaster completed remote builds in B18051: Diff 45329.Jul 15 2018, 8:21 PM

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

This revision is now accepted and ready to land.Jul 15 2018, 8:54 PM

In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

kib accepted this revision.Jul 15 2018, 9:28 PM

Closed by commit rS336321: Restore the check for the page size extension after r332489. (authored by markj). · Explain WhyJul 15 2018, 10:18 PM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: imp. · View Herald TranscriptJul 15 2018, 10:18 PM

In D16279#345413, @markj wrote:

In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.

In D16279#345429, @alc wrote:

In D16279#345413, @markj wrote:

In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.

Sure: https://reviews.freebsd.org/P191

In D16279#345430, @markj wrote:

In D16279#345429, @alc wrote:

In D16279#345413, @markj wrote:

In D16279#345394, @alc wrote:

I'm curious about Firefox. Are you seeing psind==1 usage on the shared libraries?

Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.

Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.

Sure: https://reviews.freebsd.org/P191

I see good and bad things there.

I'm glad to see automatic promotion on the code segment for libxul.so:

1089 0x22c00000 0x2782e000 r-x 11168 11667  12   7 CNS- vn /usr/local/lib/firefox/libxul.so

However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.

In D16279#345431, @alc wrote:

However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.

No, it still seems to be using MADV_FREE. (Linux finally gained support for that recently, BTW.)

As a side note, almost all free() calls in firefox end with this stack:

 libc.so.7`0x8012e900a
libc.so.7`heapsort+0xa5a
libc.so.7`heapsort+0x934
libc.so.7`__bsd_iconvlist+0x278
libc.so.7`__bsd_iconv+0x36
libc.so.7`_citrus_stdenc_open+0x63
libc.so.7`mbstowcs+0x20

Seems like we are sorting the same data over and over.

In D16279#345432, @markj wrote:

In D16279#345431, @alc wrote:

However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.

No, it still seems to be using MADV_FREE. (Linux finally gained support for that recently, BTW.)

For example, take a look at lines 2253 through 2296 in P191. At some point in the past, firefox had a valid range of addresses starting at 0x33e09000 and ending at 0x33ed4000. Then, a bunch of munmap()s fragmented the range, and finally new allocations were created within the holes.

This seems like a newish behavior, starting with jemalloc 5.0. Relevant details seem like

Unlike all previous jemalloc releases, this release does not use naturally
aligned "chunks" for virtual memory management, and instead uses page-aligned
"extents".  This change has few externally visible effects, but the internal
impacts are... extensive.

- Implement two-phase decay of unused dirty pages.  Pages transition from
   dirty-->muzzy-->clean, where the first phase transition relies on
   madvise(... MADV_FREE) semantics, and the second phase transition discards
   pages such that they are replaced with demand-zeroed pages on next access.
   (@jasone)

opt.muzzy_decay_ms (ssize_t) r-

   Approximate time in milliseconds from the creation of a set of unused muzzy pages until an equivalent set of unused muzzy pages is purged (i.e. converted to clean) and/or reused. Muzzy pages are defined as previously having been unused dirty pages that were subsequently purged in a manner that left them subject to the reclamation whims of the operating system (e.g. madvise(...MADV_FREE)), and therefore in an indeterminate state. The pages are incrementally purged according to a sigmoidal decay curve that starts and ends with zero purge rate. A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. A decay time of -1 disables purging. The default decay time is 10 seconds. See arenas.muzzy_decay_ms and arena.<i>.muzzy_decay_ms for related dynamic control options.

I speculate that setting muzzy_decay_ms to -1 will stop the fragmention.

In D16279#345433, @alc wrote:

I speculate that setting muzzy_decay_ms to -1 will stop the fragmention.

Some quick testing that suggests that it helps. I tried starting firefox and opening three fairly resource-intensive websites, and counted the total number of vm_map entries among all firefox processes. Once everything has loaded, I consistently see about half the total number of entries (~3800 vs. ~7500) when malloc.conf is set as you suggest. I'll try this on my desktop today - when I last killed firefox, I had about 80 tabs open and about 110,000 map entries(!).

https://reviews.freebsd.org/P193

The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.

Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?

In D16279#345673, @alc wrote:

The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.

To be a bit more concrete, suppose a mapped VM object spans pindices [0, n], and we unmap the range in [x1, x2] where 0 < x1 < x2 < n, and then map it again. Are you observing that we'll instantiate a new object to back the range originally backed by [x1, x2]?

BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)

Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?

Not very. I do know that at least Solaris maintains an array of "slots" tracking per-page protection and madvise info in a given anonymous memory region, but am not sure how it's used in COW.

In D16279#345742, @markj wrote:

In D16279#345673, @alc wrote:

The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.

To be a bit more concrete, suppose a mapped VM object spans pindices [0, n], and we unmap the range in [x1, x2] where 0 < x1 < x2 < n, and then map it again. Are you observing that we'll instantiate a new object to back the range originally backed by [x1, x2]?

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)

It's using MAP_FIXED?

Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?

Not very. I do know that at least Solaris maintains an array of "slots" tracking per-page protection and madvise info in a given anonymous memory region, but am not sure how it's used in COW.

Essentially, this array contains vm_page_t pointers, and within the vm_page structure a count of the number of pointers to that page. To setup COW, you replicate the array and increment the count on each page. A COW fault checks the count on the page, and replaces it with a new private copy if the count was > 1.

There are pros and cons to this approach versus our Mach-derived approach. I've often wondered if a hybrid approach didn't make sense where you used a more space efficient data structure to maintain just the per-pindex counts. The hypothesis being that the counts will typically be the same over the entire range of indices.

In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)

It's using MAP_FIXED?

Hmm, sorry, I think I was mistaken. The code is a little hard to follow. extent_dalloc_wrapper() munmap()s the range before "decommitting" the range, i.e., calling mmap(PROT_NONE, MAP_FIXED).

There are pros and cons to this approach versus our Mach-derived approach. I've often wondered if a hybrid approach didn't make sense where you used a more space efficient data structure to maintain just the per-pindex counts. The hypothesis being that the counts will typically be the same over the entire range of indices.

I think you were at one point considering using blists for this purpose?

In D16279#345798, @markj wrote:

In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.

In D16279#345821, @alc wrote:

In D16279#345798, @markj wrote:

In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.

I'll investigate and work on that too. I think I just found the last bug in my psind == 1 patch. :)

In D16279#345845, @markj wrote:

In D16279#345821, @alc wrote:

In D16279#345798, @markj wrote:

In D16279#345790, @alc wrote:

Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.

I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.

There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.

I'll investigate and work on that too. I think I just found the last bug in my psind == 1 patch. :)

The needed commits are covered in the last column of the table that I've been posting.

                Uses            Has PV Alloc    Has COW         Needs r324665
                pv_chunk        Problem         Bug             and r325285
                -------------------------------------------------------------
amd64           Yes             Fixed           Fixed           Fixed
arm/pmap-v4.c   No              N/A             No[1]           N/A
arm/pmap-v6.c   Yes             No              No[2]           No[4]
arm64           Yes             Fixed           No[2]           Yes
i386            Yes             No              Fixed           No[4]
mips            Yes             No              Fixed           No[4]
powerpc/booke   No              N/A             No[2]           N/A
powerpc/oea     No              N/A[3]          No[2]           N/A
powerpc/oea64   No              N/A             No[2]           N/A
powerpc/pseries[5]
riscv           Yes             Fixed           Fixed           No[4]
sparc64         No              N/A             No[2]           N/A

[1] SMP is not supported.
[2] Performs "break-before-make".
[3] The comments say that the PV entry is reused, but it is not.  That said,
    the old PV entry is freed before the new one is allocated.  I believe
    that reuse could be beneficial because it would eliminate two O(log n)
    Red-Black tree operations.
[4] Still has pvh_global_lock.
[5] Literally derived from powerpc/oea64.