It looks like r332489 inadventently removed the code which checks for
CPU support for large pages. As a result, pmap_pg_ps_enabled is always
0 on i386, so superpage promotion is disabled.
Details
- Reviewers
alc kib - Commits
- rS336321: Restore the check for the page size extension after r332489.
I booted a kernel with this change on one of my laptops. I ran some builds
and did some browsing with firefox, which exercises the recently added
pmap_enter(psind = 1) support.
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 18050 Build 17794: arc lint + arc unit
Event Timeline
sys/i386/i386/pmap.c | ||
---|---|---|
521 | I guess this should be initialized to PG_PS per the comment, though it doesn't seem to matter in practice. |
I find it weird that some places test pseflag and not pg_ps_enabled. We end up with superpage mappings even if user explicitly disabled them. I do not even mind for the kernel mappings, but for pmap_object_init_pt() it is not correct.
Recent versions of firefox are multi-process and use shared memory, and I'm seeing that they execute pmap_enter(psind == 1) during faults on the shared memory regions.
Could you please post "procstat -v" output for one of the Firefox processes? I'm just curious to see it.
I see good and bad things there.
I'm glad to see automatic promotion on the code segment for libxul.so:
1089 0x22c00000 0x2782e000 r-x 11168 11667 12 7 CNS- vn /usr/local/lib/firefox/libxul.so
However, the overall number of map entries is frightening. Has jemalloc stopped using MADV_FREE and started munmap()ing aggressively? The pattern where you have high ref counts on underlying objects interspersed without a low ref count suggests as much.
No, it still seems to be using MADV_FREE. (Linux finally gained support for that recently, BTW.)
As a side note, almost all free() calls in firefox end with this stack:
libc.so.7`0x8012e900a libc.so.7`heapsort+0xa5a libc.so.7`heapsort+0x934 libc.so.7`__bsd_iconvlist+0x278 libc.so.7`__bsd_iconv+0x36 libc.so.7`_citrus_stdenc_open+0x63 libc.so.7`mbstowcs+0x20
Seems like we are sorting the same data over and over.
For example, take a look at lines 2253 through 2296 in P191. At some point in the past, firefox had a valid range of addresses starting at 0x33e09000 and ending at 0x33ed4000. Then, a bunch of munmap()s fragmented the range, and finally new allocations were created within the holes.
This seems like a newish behavior, starting with jemalloc 5.0. Relevant details seem like
Unlike all previous jemalloc releases, this release does not use naturally aligned "chunks" for virtual memory management, and instead uses page-aligned "extents". This change has few externally visible effects, but the internal impacts are... extensive.
- Implement two-phase decay of unused dirty pages. Pages transition from dirty-->muzzy-->clean, where the first phase transition relies on madvise(... MADV_FREE) semantics, and the second phase transition discards pages such that they are replaced with demand-zeroed pages on next access. (@jasone)
opt.muzzy_decay_ms (ssize_t) r- Approximate time in milliseconds from the creation of a set of unused muzzy pages until an equivalent set of unused muzzy pages is purged (i.e. converted to clean) and/or reused. Muzzy pages are defined as previously having been unused dirty pages that were subsequently purged in a manner that left them subject to the reclamation whims of the operating system (e.g. madvise(...MADV_FREE)), and therefore in an indeterminate state. The pages are incrementally purged according to a sigmoidal decay curve that starts and ends with zero purge rate. A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. A decay time of -1 disables purging. The default decay time is 10 seconds. See arenas.muzzy_decay_ms and arena.<i>.muzzy_decay_ms for related dynamic control options.
I speculate that setting muzzy_decay_ms to -1 will stop the fragmention.
Some quick testing that suggests that it helps. I tried starting firefox and opening three fairly resource-intensive websites, and counted the total number of vm_map entries among all firefox processes. Once everything has loaded, I consistently see about half the total number of entries (~3800 vs. ~7500) when malloc.conf is set as you suggest. I'll try this on my desktop today - when I last killed firefox, I had about 80 tabs open and about 110,000 map entries(!).
The fundamental issue is that the object's ref count is too coarse-grained. A per-pindex (within the object) ref count is needed. In other words, the ability to precisely count the number of mappings to an arbitrary window of indices within an object. Then, when jemalloc munmap()s and later mmap()s a subregion (of a larger valid region), the VM system can recognize that the exiting object from the larger valid region can be reused because there were no mappings to the corresponding subrange of the existing object.
Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?
To be a bit more concrete, suppose a mapped VM object spans pindices [0, n], and we unmap the range in [x1, x2] where 0 < x1 < x2 < n, and then map it again. Are you observing that we'll instantiate a new object to back the range originally backed by [x1, x2]?
BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)
Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?
Not very. I do know that at least Solaris maintains an array of "slots" tracking per-page protection and madvise info in a given anonymous memory region, but am not sure how it's used in COW.
Yes, that is what happens today. Look at vm_object_coalesce(). That said, I think that we might be able to recycle the range in the existing object from the preceding mapping if that object has the OBJ_ONEMAPPING flag set. However, in general, we would need per-pindex reference counts.
BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)
It's using MAP_FIXED?
Are you familiar with anon the structure in the System V/Solaris/NetBSD-style of COW implementation?
Not very. I do know that at least Solaris maintains an array of "slots" tracking per-page protection and madvise info in a given anonymous memory region, but am not sure how it's used in COW.
Essentially, this array contains vm_page_t pointers, and within the vm_page structure a count of the number of pointers to that page. To setup COW, you replicate the array and increment the count on each page. A COW fault checks the count on the page, and replaces it with a new private copy if the count was > 1.
There are pros and cons to this approach versus our Mach-derived approach. I've often wondered if a hybrid approach didn't make sense where you used a more space efficient data structure to maintain just the per-pindex counts. The hypothesis being that the counts will typically be the same over the entire range of indices.
I can try to implement this optimization later this week once I'm finished pmap_enter(psind == 1) for arm64.
BTW, from my reading it seems jemalloc is not actually unmapping ranges when it "cleans" them - it just mmaps over the existing range. On Linux it will actually just call madvise(MADV_DONTNEED) on the range, since that causes the kernel to zero the pages. (Am I right in believing that we currently don't have a kernel interface that would allow us to mimic this behaviour for the Linuxulator?)
It's using MAP_FIXED?
Hmm, sorry, I think I was mistaken. The code is a little hard to follow. extent_dalloc_wrapper() munmap()s the range before "decommitting" the range, i.e., calling mmap(PROT_NONE, MAP_FIXED).
There are pros and cons to this approach versus our Mach-derived approach. I've often wondered if a hybrid approach didn't make sense where you used a more space efficient data structure to maintain just the per-pindex counts. The hypothesis being that the counts will typically be the same over the entire range of indices.
I think you were at one point considering using blists for this purpose?
There is also a bug fix to the pv_chunk code on amd64 that is needed by arm64. Specifically, there is a race condition in the pv_chunk code that arises when the pvh_global_lock is removed.
I'll investigate and work on that too. I think I just found the last bug in my psind == 1 patch. :)
The needed commits are covered in the last column of the table that I've been posting.
Uses Has PV Alloc Has COW Needs r324665 pv_chunk Problem Bug and r325285 ------------------------------------------------------------- amd64 Yes Fixed Fixed Fixed arm/pmap-v4.c No N/A No[1] N/A arm/pmap-v6.c Yes No No[2] No[4] arm64 Yes Fixed No[2] Yes i386 Yes No Fixed No[4] mips Yes No Fixed No[4] powerpc/booke No N/A No[2] N/A powerpc/oea No N/A[3] No[2] N/A powerpc/oea64 No N/A No[2] N/A powerpc/pseries[5] riscv Yes Fixed Fixed No[4] sparc64 No N/A No[2] N/A [1] SMP is not supported. [2] Performs "break-before-make". [3] The comments say that the PV entry is reused, but it is not. That said, the old PV entry is freed before the new one is allocated. I believe that reuse could be beneficial because it would eliminate two O(log n) Red-Black tree operations. [4] Still has pvh_global_lock. [5] Literally derived from powerpc/oea64.