Page MenuHomeFreeBSD

Add support for psind == 1 to pmap_enter() on amd64 and use that support in vm_fault()'s "fast path"
ClosedPublic

Authored by alc on Jul 11 2017, 4:25 AM.
Tags
None
Referenced Files
Unknown Object (File)
Feb 10 2024, 6:43 AM
Unknown Object (File)
Feb 8 2024, 2:53 AM
Unknown Object (File)
Feb 3 2024, 1:10 AM
Unknown Object (File)
Jan 14 2024, 11:14 PM
Unknown Object (File)
Jan 3 2024, 5:13 AM
Unknown Object (File)
Jan 1 2024, 7:46 AM
Unknown Object (File)
Dec 20 2023, 4:09 AM
Unknown Object (File)
Dec 18 2023, 12:00 PM

Details

Summary

When we find ourselves in vm_fault()'s "fast path", we check to see if the 4 KB page that we faulted on belongs to an existing 2 MB superpage. (Imagine that another process or processes with whom we share memory have already fully validated the superpage.) Moreover, we check to see if the current process has the entire superpage mapped in its address at the machine-independent layer (just not in its page table). If these conditions are met, then we instruct pmap_enter() to create a 2 MB page mapping, bypassing the promotion process. This change results in a significant reduction in the number of page faults for PostgreSQL.

Currently, there is one edge-case involving the use of pmap_enter(psind == 1) on the kernel pmap that isn't handled. Specifically, the patch doesn't handle an allocation failure for a radix trie node required to maintain the kernel page table page in reserve after creating the 2 MB page mapping. However, the fast path shouldn't be creating kernel mappings.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
amd64/amd64/pmap.c
3776 ↗(On Diff #30634)

In regards to the first question, the answer is "Yes." Historically, pmap_invalidate_all() was never used on the kernel pmap, because it didn't invalidate PG_G mappings.

In regards to the second question, the answer is that invlpg instructions have been used only when necessary because they have been quite costly in some x86 implementations. For example, my vague recollection is that they took 500 to 1000 cycles on the Pentium 4, whereas contemporaneous AMD processors were about 100 cycles. I don't know what their current cost is, but they are serializing instructions.

That said, I do think that we are underutilizing pmap_invalidate_range(). For example,
when the new pmap_enter_pde() is calling this function under PostgreSQL, we are typically invalidating less than 100 mappings out of the 512 entries in the page table page. However, I don't know if those mappings were contiguous, and would generate a single pmap_invalidate_range() call.

vm/vm_page.h
449 ↗(On Diff #30634)

There is no reason.

Use hexadecimal values for vm_page_ps_test() flags.

alc marked 2 inline comments as done.Jul 12 2017, 2:05 AM

Is it possible to extract and commit refactoring bits in advance ? I mean pmap_enter_2mpage(), pmap_remove_ptes() and vm_page_ps_test() at least.

vm/vm_fault.c
286 ↗(On Diff #30663)

Don't you need to check the (vaddr % page index) alignment ?

292 ↗(On Diff #30663)

I am not sure about non-amd64 pmaps. Is the assumption that writeable superpage is dirty amd64-specific ?

I understand that this block is under #ifdef amd64, but later the call to vm_fault_dirty() is arch-independend and gated on psind == 0.

In D11556#239313, @kib wrote:

Is it possible to extract and commit refactoring bits in advance ? I mean pmap_enter_2mpage(), pmap_remove_ptes() and vm_page_ps_test() at least.

vm_page_ps_test() is easy. Let me start with that. I propose to commit that part of this change in about 9 hours. Can I declare that part as having been reviewed by you guys?

In D11556#239751, @alc wrote:
In D11556#239313, @kib wrote:

Is it possible to extract and commit refactoring bits in advance ? I mean pmap_enter_2mpage(), pmap_remove_ptes() and vm_page_ps_test() at least.

vm_page_ps_test() is easy. Let me start with that. I propose to commit that part of this change in about 9 hours. Can I declare that part as having been reviewed by you guys?

vm_page_ps_test() looks fine to me.

In D11556#239751, @alc wrote:

vm_page_ps_test() is easy. Let me start with that. I propose to commit that part of this change in about 9 hours. Can I declare that part as having been reviewed by you guys?

Sure.

Peel off the part of this change that generalizes vm_page_ps_is_valid() and commit it.

vm/vm_fault.c
309 ↗(On Diff #30767)

This "psind == 0" test is wrong and should be removed. vm_fault_dirty() should always be called. The current fault might well be the one that dirties 'm'. In fact, the whole point of the last argument to vm_page_ps_test() was to handle such situations.

286 ↗(On Diff #30663)

Yes.

Correct a couple errors in vm_fault_soft_fast().

alc marked 3 inline comments as done.Jul 14 2017, 3:29 AM
amd64/amd64/pmap.c
4402–4403 ↗(On Diff #30768)

I think that a KASSERT(m->psind == 1, ...) here would be appropriate. pmap_enter_pde() asserts that the virtual address is superpage aligned, but nowhere is there an assert on the physical address.

amd64/amd64/pmap.c
4402–4403 ↗(On Diff #30768)

May be the assert on the alignment of the phys address is most useful then ?

vm/vm_fault.c
309 ↗(On Diff #30767)

It took some time to convince myself that vm_fault_dirty() there needs only be called on the m and not on the all pages making up the superpage. Perhaps a comment is due there.

amd64/amd64/pmap.c
4402–4403 ↗(On Diff #30768)

My rationale for using m->psind was that the physical address could be "accidentally" aligned, whereas m->psind != 0 guarantees that the entire superpage starting at m is allocated as well as aligned.

Assert that the page being mapped is actually a superpage when the psind value passed to pmap_enter() is 1. Don't initialize mpte when it may never be used.

amd64/amd64/pmap.c
4402–4403 ↗(On Diff #30768)

I would check both facts then.

alc marked an inline comment as done.Jul 14 2017, 4:23 PM
In D11556#239313, @kib wrote:

Is it possible to extract and commit refactoring bits in advance ? I mean pmap_enter_2mpage(), pmap_remove_ptes() and vm_page_ps_test() at least.

I propose to peel off the refactoring of pmap_remove(), creating pmap_remove_ptes(), next. Do you guys have any further questions about pmap_remove_ptes()?

In D11556#240188, @alc wrote:

I propose to peel off the refactoring of pmap_remove(), creating pmap_remove_ptes(), next. Do you guys have any further questions about pmap_remove_ptes()?

That bit looks fine to me.

amd64/amd64/pmap.c
3782 ↗(On Diff #30789)

I actually do not understand this break.

amd64/amd64/pmap.c
3782 ↗(On Diff #30789)

Ignore me, it is fine.

Here is some data on PostgreSQL execution that I collected with the initial version of this patch.

To be clear, I am using a database that is three times larger than physical memory. The database is stored on a Samsung 850 PRO SSD, so my tests are I/O bound, not CPU bound. Adjust your expectations accordingly. Nonetheless, I see a slight but consistent increase in TPS.

I have configured PostgreSQL to use 1/4 of physical memory for shared buffers. I am not using unmanaged pages for the shared buffers.

My machine has six cores and I am running the following test script:

$ for i in 15 15 15 15 15 15; do
> sysctl vm.pmap.pde vm.reserv vm.stats.vm
> pgbench -j 1 -c $i -T 1800 -S bench
> done ; sysctl vm.pmap.pde vm.reserv vm.stats.vm

After the first iteration, TPS results don't vary much at all, at most 10 TPS.

Before the patch:

number of transactions actually processed: 20717230
tps = 11509.560124 (including connections establishing)
tps = 11509.740169 (excluding connections establishing)

After the patch:

number of transactions actually processed: 20990549
tps = 11661.192676 (including connections establishing)
tps = 11661.376262 (excluding connections establishing)

Here is some raw data after the 2nd iteration:

Before the patch:

vm.pmap.pde.promotions: 400
vm.pmap.pde.p_failures: 6882801
vm.pmap.pde.mappings: 3879
vm.pmap.pde.demotions: 110
...
vm.reserv.fullpop: 3112
...
vm.stats.vm.v_vm_faults: 59527355

After the patch:

vm.pmap.pde.promotions: 407
vm.pmap.pde.p_failures: 59
vm.pmap.pde.mappings: 134076
vm.pmap.pde.demotions: 116
...
vm.reserv.fullpop: 3126
...
vm.stats.vm.v_vm_faults: 19490732

The most striking thing about the counts is the 3-fold reduction in page faults.

The next most striking thing to me was the difference between fullpops and promotions. In both cases, fullpops was greater than 3100, which says that we have 3100+ physical superpages, but we never promoted more than 407 virtual-to-physical mappings to 2 MB page mappings. Prior to the patch, most of the 2 MB page mappings were being created by fork(2). In other words, whatever promotions occurred in the master, template process got carried over to the child processes by pmap_copy() on fork(2).

Peel off the part of this change that extracts the innermost loop of pmap_remove() into a new function, pmap_remove_ptes().

In D11556#239313, @kib wrote:

Is it possible to extract and commit refactoring bits in advance ? I mean pmap_enter_2mpage(), pmap_remove_ptes() and vm_page_ps_test() at least.

It's hard to disentangle the pmap_enter_2mpage() / pmap_enter_pde() refactoring from the new code.

That said, there is one purely stylistic change in pmap_enter_pde() that I could peel off. I changed from the variable name "mpde" to "pdpg" for the vm_page_t-type variable that references a PD page. About half of the pmap uses "pdpg" and the other half uses "mpde". I've come to prefer "pdpg". I could do a global replace and make this consistent throughout the file. This would reduce the size of this change by a few lines. Opinions?

In D11556#240303, @alc wrote:

It's hard to disentangle the pmap_enter_2mpage() / pmap_enter_pde() refactoring from the new code.

Ok.

That said, there is one purely stylistic change in pmap_enter_pde() that I could peel off. I changed from the variable name "mpde" to "pdpg" for the vm_page_t-type variable that references a PD page. About half of the pmap uses "pdpg" and the other half uses "mpde". I've come to prefer "pdpg". I could do a global replace and make this consistent throughout the file. This would reduce the size of this change by a few lines. Opinions?

Go for it.

Here is another test case that benefits from this change:

#include <sys/param.h>
#include <sys/ipc.h>
#include <sys/shm.h>

#include <err.h>
#include <stdio.h>

#define SHM_SIZE        (8 * 1024 * 1024)

int
main(int argc, char *argv[])
{
        key_t key;
        int shmid;
        char *data;
        int i, sum;

        key = ftok("/etc/motd", 63);
        if ((shmid = shmget(key, SHM_SIZE, IPC_CREAT | SHM_R | SHM_W)) == -1)
                err(1, "shmget");
        if ((data = shmat(shmid, 0, 0)) == (void *)-1)
                err(1, "shmat");
        printf("data = %p\n", data);
        sum = 0;
        for (i = 0; i < SHM_SIZE; i++) {
                data[i] += i;
                sum += data[i];
        }
        printf("sum = %d\n", sum);
        getchar();
        if (shmdt(data) == -1)
                warn("shmdt");
#if 0
        if (shmctl(shmid, IPC_RMID, NULL) == -1)
                warn("shmctl");
#endif
        return (0);
}

The first execution of this program will go through the normal page allocation, mapping, and eventual promotion process. (If you enable kern.ipc.shm_use_phys, then the number of page faults is reduced. Essentially, a single page fault iteratively maps and eventually promotes each superpage.)

The second and subsequent executions are where we see a difference. Initially, shmat() prefaults read-only superpage mappings without the PG_A bit set because these are speculative mappings. Then, the program faults on a write access. Without this change, we demote the mapping and reenable write access page-by-page, faulting on each page until eventually we repromote with a writeable superpage mapping. With this change, on the first page fault on each superpage we recognize that the underlying physical superpage is all dirty and immediately recreate a writeable superpage mapping.

In D11556#240398, @alc wrote:

The second and subsequent executions are where we see a difference. Initially, shmat() prefaults read-only superpage mappings without the PG_A bit set because these are speculative mappings. Then, the program faults on a write access. Without this change, we demote the mapping and reenable write access page-by-page, faulting on each page until eventually we repromote with a writeable superpage mapping. With this change, on the first page fault on each superpage we recognize that the underlying physical superpage is all dirty and immediately recreate a writeable superpage mapping.

This finally prompted me to ask the question which I have for some time. Why do you test for PS_ALL_DIRTY, at all ? If the page fault which causes psind == 1 mapping would be read and next access is for write, then the whole superpage become dirty. So, why not to dirty all pages when mapping the superpage ?

I understand that this is not how current superpage promotion work, but in-tree promotion code have to operate inside pmap. With high-level vm_fault() code knowing about superpage mapping, I believe there is no architectural constraints like that.

In D11556#240401, @kib wrote:
In D11556#240398, @alc wrote:

The second and subsequent executions are where we see a difference. Initially, shmat() prefaults read-only superpage mappings without the PG_A bit set because these are speculative mappings. Then, the program faults on a write access. Without this change, we demote the mapping and reenable write access page-by-page, faulting on each page until eventually we repromote with a writeable superpage mapping. With this change, on the first page fault on each superpage we recognize that the underlying physical superpage is all dirty and immediately recreate a writeable superpage mapping.

This finally prompted me to ask the question which I have for some time. Why do you test for PS_ALL_DIRTY, at all ? If the page fault which causes psind == 1 mapping would be read and next access is for write, then the whole superpage become dirty. So, why not to dirty all pages when mapping the superpage ?

I'm not sure if this is the answer to your question. Consider an mmap(..., PROT_WRITE, MAP_SHARED, fd, ...) on an actual file. I'm trying to avoid a single store making us treat the entire superpage as dirty, i.e., have to write it back to disk. Strictly speaking, the extra parameter, "skip_m", to vm_page_ps_test() is allowing us to ask whether all pages other than the one that we faulted on are dirty, and we allow the psind == 1 mapping to occur even if the one page that we faulted on is not yet dirty and we took a read fault on it, not a write fault. In a sense, I am predicting that this one page will be dirtied based on the fact that its 511 peers in the same physical superpage already are. In this case, if I'm wrong, we only wrote an extra 4KB to disk.

I understand that this is not how current superpage promotion work, but in-tree promotion code have to operate inside pmap. With high-level vm_fault() code knowing about superpage mapping, I believe there is no architectural constraints like that.

Peel off a style change: Rename "mpde" to "pdpg". Tweak some KASSERT()s and add a comment.

amd64/amd64/pmap.c
4402–4403 ↗(On Diff #30768)

The above KASSERT()s check the exact same conditions that pmap_enter_object() has long used to verify both virtual and physical alignment before attempting a superpage mapping. m->psind > 0 is simultaneously guaranteeing alignment and that the physical superpage is fully allocated.

In D11556#240402, @alc wrote:

I'm not sure if this is the answer to your question. Consider an mmap(..., PROT_WRITE, MAP_SHARED, fd, ...) on an actual file. I'm trying to avoid a single store making us treat the entire superpage as dirty, i.e., have to write it back to disk. Strictly speaking, the extra parameter, "skip_m", to vm_page_ps_test() is allowing us to ask whether all pages other than the one that we faulted on are dirty, and we allow the psind == 1 mapping to occur even if the one page that we faulted on is not yet dirty and we took a read fault on it, not a write fault. In a sense, I am predicting that this one page will be dirtied based on the fact that its 511 peers in the same physical superpage already are. In this case, if I'm wrong, we only wrote an extra 4KB to disk.

Yes, but you are only considering two corner cases, where all pages except current are clean, and all pages except current are dirty. If the argument is that forcing all pages of the superpage to become dirty is too costly because all of them need to be paged out, I mentioned this by stating that we might have observed a write after the promotion, in which case we either have to demote or declare all pages dirty.

Now with the high level VM knowing about superpages, we can apply more involved policies in the place of calling pmap_enter(). E.g. we might count number of dirty pages forming the superpage and request psind == 1 if count >= 1/4 of 512 or any other arbitrary constant, or even a tunable to allow the tuning for loads.

In D11556#240408, @kib wrote:
In D11556#240402, @alc wrote:

I'm not sure if this is the answer to your question. Consider an mmap(..., PROT_WRITE, MAP_SHARED, fd, ...) on an actual file. I'm trying to avoid a single store making us treat the entire superpage as dirty, i.e., have to write it back to disk. Strictly speaking, the extra parameter, "skip_m", to vm_page_ps_test() is allowing us to ask whether all pages other than the one that we faulted on are dirty, and we allow the psind == 1 mapping to occur even if the one page that we faulted on is not yet dirty and we took a read fault on it, not a write fault. In a sense, I am predicting that this one page will be dirtied based on the fact that its 511 peers in the same physical superpage already are. In this case, if I'm wrong, we only wrote an extra 4KB to disk.

Yes, but you are only considering two corner cases, where all pages except current are clean, and all pages except current are dirty. If the argument is that forcing all pages of the superpage to become dirty is too costly because all of them need to be paged out, I mentioned this by stating that we might have observed a write after the promotion, in which case we either have to demote or declare all pages dirty.

Now with the high level VM knowing about superpages, we can apply more involved policies in the place of calling pmap_enter(). E.g. we might count number of dirty pages forming the superpage and request psind == 1 if count >= 1/4 of 512 or any other arbitrary constant, or even a tunable to allow the tuning for loads.

Ok, now I understand. Yes, we could apply more speculative policies. For example, my temptation would be to distinguish between OBJT_VNODE and OBJT_DEFAULT, applying a higher threshold to the former. Already, along these lines, we don't care whether unmanaged pages are dirty or not when creating a writeable mapping.

I just doubled-checked my recollection about PostgreSQL: Relaxing the all-dirty requirement has no effect, specifically, there is no reduction in page faults.

In D11556#240441, @alc wrote:

I just doubled-checked my recollection about PostgreSQL: Relaxing the all-dirty requirement has no effect, specifically, there is no reduction in page faults.

Postgres has very specific usage pattern, it first writes whole large buffer, and then reads it as a whole.

Add a comment to vm_fault_soft_fast().

vm/vm_fault.c
292 ↗(On Diff #30663)

I am not sure about non-amd64 pmaps. Is the assumption that writeable superpage is dirty amd64-specific ?

I'm pretty confident that the arm v6 pmap is the same. I don't think that the arm64 pmap implements a proper dirty bit. Specifically, I think that any writeable page may be regarded as dirty.

This revision is now accepted and ready to land.Jul 16 2017, 5:15 PM

I'm able to trip the "freeing mapped page" assertion in vm_page_free_toq() when running with this diff applied. firefox was exiting after dumping core (almost certainly also a result of this change) and was deallocating objects and freeing their pages. I have to go AFK for a few hours now but will debug further when I get back.

On an unrelated note, does pmap_enter(psind > 0) create large mappings even when pg_ps_enabled = 0?

I'm able to trip the "freeing mapped page" assertion in vm_page_free_toq() when running with this diff applied. firefox was exiting after dumping core (almost certainly also a result of this change) and was deallocating objects and freeing their pages. I have to go AFK for a few hours now but will debug further when I get back.

On an unrelated note, does pmap_enter(psind > 0) create large mappings even when pg_ps_enabled = 0?

Hmm, yes. Worse yet, pagesizes[1] is 0 when pg_ps_enabled is 0. That could produce strange behavior from the vm_fault_soft_fast().

To be clear, when pg_ps_enabled is 0, the reservation code will still be active and "promote" physical pages, i.e., set m->psind to 1. Prior to this patch, we just didn't create superpage mappings.

Did you have pg_ps_enabled set to 0?

alc edited edge metadata.

Export the function pmap_ps_enabled() from the amd64 pmap, and call it from vm_fault_soft_fast() to determine if superpage mappings are enabled at the pmap level.

This revision now requires review to proceed.Jul 16 2017, 10:54 PM
In D11556#240559, @alc wrote:

To be clear, when pg_ps_enabled is 0, the reservation code will still be active and "promote" physical pages, i.e., set m->psind to 1. Prior to this patch, we just didn't create superpage mappings.

Did you have pg_ps_enabled set to 0?

Isilon does, but we have do have reservations enabled. It appears to have been disabled back when we were on FreeBSD 7 (i.e., before my time there), but I can't find any reasoning for it in the commit logs. I believe Jeff experimented with enabling them earlier this year and found that it didn't give much of an improvement, but I don't know the details. I'll try to find out more.

vm/vm_fault.c
286 ↗(On Diff #30845)

Mark, I'm pretty confident that the explanation for your assertion failure is the lack of a "+ 1" here. The previous line should round up "vaddr + 1", not "vaddr".

Correct a rounding error in vm_fault_soft_fast().

vm/vm_fault.c
286 ↗(On Diff #30845)

That seems to have done the trick. :)

Peter, could you please run this patch through your battery of tests?

amd64/amd64/pmap.c
4695–4696 ↗(On Diff #30851)

I really wish we had a more superpage-friendly optimization than this simple per-page flag for tracking the possible existence of writeable mappings. This is the only part of pmap_enter_pde() that does 512 per-page operations. This is also why I didn't revise pmap_copy() to use pmap_enter_pde() as a helper function. If pmap_copy() is copying a PG_RW PDE, it can assume that the PGA_WRITEABLE flag is already set.

This revision is now accepted and ready to land.Jul 17 2017, 4:05 PM
vm/vm_fault.c
286 ↗(On Diff #30845)

Your assertion failure strongly suggests the following rather interesting scenario occurred: You are mmap()ing a large shared library, and the last N pages of code were memory resident, but N is less than 512, so it can't be a superpage mapping. However, the next 512 - N pages of presumably copy-on-write initialized data is also resident. That is how the vm_reserv_to_superpage() check passed. On amd64 there is always a superpage boundary between the end of the code segment and the beginning of the data, so we could use the data as read/execute-only padding to fill out a superpage mapping for the end of the code segment.

vm/vm_fault.c
286 ↗(On Diff #30845)

I'll note a few things that I observed while debugging yesterday:

  • The page triggering the assertion belonged usually to a OBJT_SWAP object allocated by shmget(). In one case though, it belonged to an OBJT_DEFAULT object.
  • Using KTR I could see that we were promoting the entries in a PTP very shortly before we entered a 2MB mapping for those pages in another pmap. The lingering pv entry that triggered the assertion failure always came from a 4KB mapping in the range of the promoted mapping.
  • Both 2MB mappings were demoted shortly before the crash, which was triggered by exiting firefox. I'm not sure why though, since pmap_remove_pages() doesn't demote 2MB mappings.
  • The crash didn't occur reproducibly - it usually took a few tries. In most cases firefox didn't segfault as I had initially reported.
vm/vm_fault.c
286 ↗(On Diff #30845)

I'm a little puzzled at how the bug could have led to an assertion failure with shmat(). The obvious trigger is a situation where a subrange of a vm object gets mapped, and the bug leads to mappings at the pmap level without corresponding mappings at the vm map entry level, so we fail to do a pmap_remove() before vm_page_free().

I can trigger this with ease:
panic: Assertion m_super->object == fs->first_object failed at ../../../vm/vm_fault.c:290
Details @ https://people.freebsd.org/~pho/stress/log/alc007.txt

I too observed firefox SIGSEGV. Using:
firefox -no-remote http://b.dk/ and exiting after the images were loaded would trigger the fault.

alc edited edge metadata.

Disallow psind == 1 mapping when the faulted upon page is fictitious.

This revision now requires review to proceed.Jul 18 2017, 4:50 PM
In D11556#241098, @pho wrote:

I can trigger this with ease:
panic: Assertion m_super->object == fs->first_object failed at ../../../vm/vm_fault.c:290
Details @ https://people.freebsd.org/~pho/stress/log/alc007.txt

I believe that the updated patch will address this assertion failure.

I too observed firefox SIGSEGV. Using:
firefox -no-remote http://b.dk/ and exiting after the images were loaded would trigger the fault.

vm/vm_fault.c
286 ↗(On Diff #30845)

I don't quite understand it either. I'm planning to dig into this further, FWIW.

I haven't seen any segfaults after applying the fix, unlike Peter.

vm/vm_fault.c
286 ↗(On Diff #30845)

The shm segments are shared between firefox and Xorg. The segments of interest have size 0x7af000. What appears to happen is that anonymous RW mappings are created in the small region immediately following one of the segments, and they end up being backed by the segment's VM object. If this happens in one of the processes but not the other, and the mapping of the last 2MB of a segment is promoted, I think we can hit the situation you described.

In D11556#241123, @alc wrote:
In D11556#241098, @pho wrote:

I can trigger this with ease:
panic: Assertion m_super->object == fs->first_object failed at ../../../vm/vm_fault.c:290
Details @ https://people.freebsd.org/~pho/stress/log/alc007.txt

I believe that the updated patch will address this assertion failure.

It did. I have not observed any other problems with this patch, apart from the firefox issue.

In D11556#241371, @pho wrote:
In D11556#241123, @alc wrote:
In D11556#241098, @pho wrote:

I can trigger this with ease:
panic: Assertion m_super->object == fs->first_object failed at ../../../vm/vm_fault.c:290
Details @ https://people.freebsd.org/~pho/stress/log/alc007.txt

I believe that the updated patch will address this assertion failure.

It did. I have not observed any other problems with this patch, apart from the firefox issue.

My suspicion is that this is a copy-on-write issue, specifically, that a page on which we took a copy-on-write fault is being replaced by the original unmodified page through the 2MB page mapping.

Peter, do you have any other information on the SIGSEGV? Knowing the faulting address and the vm_map_entry's for the address space, e.g., "procstat -v" or the like, might help.

In D11556#241616, @alc wrote:
In D11556#241371, @pho wrote:
In D11556#241123, @alc wrote:
In D11556#241098, @pho wrote:

I can trigger this with ease:
panic: Assertion m_super->object == fs->first_object failed at ../../../vm/vm_fault.c:290
Details @ https://people.freebsd.org/~pho/stress/log/alc007.txt

I believe that the updated patch will address this assertion failure.

It did. I have not observed any other problems with this patch, apart from the firefox issue.

My suspicion is that this is a copy-on-write issue, specifically, that a page on which we took a copy-on-write fault is being replaced by the original unmodified page through the 2MB page mapping.

Peter, do you have any other information on the SIGSEGV? Knowing the faulting address and the vm_map_entry's for the address space, e.g., "procstat -v" or the like, might help.

I took a step back and ran the same firefox test on a pristine kernel:
$ firefox -no-remote http://b.dk/
GLib-GIO-Message: Using the 'memory' GSettings backend. Your settings will not be saved or shared with other applications.
Segmentation fault (core dumped)
$ uname -a
FreeBSD t1.osted.lan 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r321074: Thu Jul 20 08:25:22 CEST 2017 pho@t1.osted.lan:/usr/src/sys/amd64/compile/PHO amd64
$
So I would say that this problem is unrelated to your patch.

Check for a mismatch between the reservation's object field and the page's object field in vm_reserv_to_superpage(). As long as vm_fault()'s fast path is the only caller to vm_reserv_to_superpage(), I don't think that a mismatch is possible, but it could be required if we start calling vm_reserv_to_superpage() in other places.

vm/vm_reserv.c
1134 ↗(On Diff #31006)

Can we assert that rv->object == m->object || rv->object == NULL?

vm/vm_reserv.c
1134 ↗(On Diff #31006)

No. There are situations involving copy-on-write, specifically, shadow object chains, in which a page and a reservation might legitimately, if only temporarily, differ. For example, suppose a superpage is created in object A. Then, object A acquires a shadow object B. However, the conditions for a collapse or a so-called "optimized" COW fault arise. We will start migrating pages from A to B (rather than allocating new pages and copying code/data over to the new pages). When the first page migrates to B, we will also migrate the reservation, but there will still be pages left in A that come from the reservation.

In D11556#241767, @alc wrote:

Check for a mismatch between the reservation's object field and the page's object field in vm_reserv_to_superpage(). As long as vm_fault()'s fast path is the only caller to vm_reserv_to_superpage(), I don't think that a mismatch is possible, but it could be required if we start calling vm_reserv_to_superpage() in other places.

I tested this version with all of the tests I have that uses mmap(2). Also did a buildworld. No problems seen.

In D11556#242095, @pho wrote:
In D11556#241767, @alc wrote:

Check for a mismatch between the reservation's object field and the page's object field in vm_reserv_to_superpage(). As long as vm_fault()'s fast path is the only caller to vm_reserv_to_superpage(), I don't think that a mismatch is possible, but it could be required if we start calling vm_reserv_to_superpage() in other places.

I tested this version with all of the tests I have that uses mmap(2). Also did a buildworld. No problems seen.

Thanks, Peter.

Add an unconditional check for object field consistency to vm_page_ps_test(). Remove an assertion from vm_fault_soft_fast() that is now redundant.

I haven't been able to conjure up an actual test case that demonstrates a need for the new consistency check in vm_page_ps_test(). However, I am still worried that a case could arise. Essentially, we have a fully populated reservation that originates in what becomes a backing object (as opposed to a "first" object). We take an ordinary copy-on-write fault that leads to the creation of a shadow object and a new writeable physical page. Then, the conditions change so that future write faults are eligible for the "optimized" COW handling, wherein we migrate the existing, previously read-only physical page to the first object and simply reenable write access. (In other words, we don't really copy the page on this optimized path.) A critical fact here is that we also migrate the reservation. My hypothetical problem scenario is that somehow one of the newly read/write mappings created by the optimized COW path within the pmap level gets invalidated. And, we fault on it again. Now, we find ourselves in vm_fault_soft_fast() and without the new check in vm_page_ps_test() we could incorrectly map the superpage even though the initial COW fault logically replaced one of the 4KB pages.

Did that make sense? If so, I propose to declare victory and commit these changes. I don't have any remaining nagging doubts.

Peter, given the nature of my latest change, I don't think that you need to repeat any of your testing.

This revision is now accepted and ready to land.Jul 21 2017, 5:44 PM
markj added inline comments.
vm/vm_reserv.c
1134 ↗(On Diff #31006)

I see, thanks.

This revision was automatically updated to reflect the committed changes.