Page MenuHomeFreeBSD

riscv: handle existing mapping in pmap_enter_l2()
ClosedPublic

Authored by mhorne on Sep 14 2022, 2:21 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 26, 1:06 PM
Unknown Object (File)
Sun, Jan 26, 12:54 AM
Unknown Object (File)
Wed, Jan 22, 12:33 AM
Unknown Object (File)
Sat, Jan 18, 2:29 AM
Unknown Object (File)
Thu, Jan 16, 12:17 AM
Unknown Object (File)
Fri, Jan 3, 8:43 AM
Unknown Object (File)
Dec 8 2024, 8:31 PM
Unknown Object (File)
Dec 4 2024, 3:09 PM
Subscribers

Details

Summary

It is possible that the function will be asked map something for a
second time. For example, if a file has memory-resident pages and a
process first calls mmap() and then madvise(MADV_WILLNEED) for this
range of the file, pmap_enter_object() will be called on this range
of pages twice.

Because pmap_enter_l2() is passed PMAP_ENTER_NOREPLACE, it 'fails' to
map the 2M page the second time, and falls back to trying
pmap_enter_quick_locked() for each 4K virtual page. This may result in
the creation of false PV entries for some of these pages.

If the range is later munmap'ed, the system will panic during the
process' exit in pmap_remove_pages(), when it attempts to clean up the
PV entries for mappings which no longer exist.

PR: 266108

Test Plan

Create the temp file with all zeros (this is required to generate the false PV entries)

# dd if=/dev/zero of=/tmp/mapfile bs=1m count=5

The following reproducer will trigger the panic upon the exit of the second child process:

#include <assert.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <malloc.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/mman.h>
#include <sys/wait.h>

#define PAGE_SIZE	4096

/* Larger than one superpage: 2MiB + 7 * 4KiB */
#define	MAP_LENGTH	((1UL << 21) + 7 * PAGE_SIZE)

void
child_run(int fd)
{
	char buf[16];
	char *addr, *cpyaddr;

	/* mmap the first MAP_LENGTH bytes of the file into the process. */
	addr = mmap(NULL, MAP_LENGTH, PROT_READ, MAP_PRIVATE, fd, 0);
	if (addr == MAP_FAILED) {
		err(1, "mmap failed");
	}

	/* Now madvise() with WILLNEED, so that the mappings are (re-)created. */
	if (madvise(addr, MAP_LENGTH, MADV_WILLNEED) == -1) {
		err(1, "madvise(MADV_WILLNEED) failed");
	}

	close(fd);

	/* Read a few bytes from every page in the mapped region, so that they
	 * are paged in. */
	cpyaddr = addr;
	for (cpyaddr = addr; cpyaddr < (addr + MAP_LENGTH); cpyaddr += PAGE_SIZE) {
		memcpy(buf, cpyaddr, sizeof(buf));
	}

	if (munmap(addr, MAP_LENGTH) == -1) {
		err(1, "munmap failed");
	}

	/* exit process */
	exit(0);
}

int
main(int arg, char *argv[])
{
	pid_t pid;
	int fd, status;

	fd = open("/tmp/mapfile", O_RDONLY);
	if (fd == -1) {
		err(1, "failed to open mapfile");
	}

	for (int i = 0; i < 2; i++) {
		pid = fork();
		if (pid == -1) {
			err(1, "Failed to fork");
		} else if (pid == 0) {
			child_run(fd);
		} else {
			/* Allow the child to exit before running again. */
			waitpid(pid, &status, 0);
		}
	}

	return (0);
}

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

What if the permissions are different? Won't this still do the same? From what I can tell arm64 and x86 just add an additional bunch of conditions to the NOREPLACE check and will break+remake the mapping?

To better understand what was going on here, I added a few CTR calls and examined the trace output. We can see how the second call to pmap_enter_l2() fails, and we fall back on the loop of pmap_enter_quick_locked():

2013 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86406000
2012 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86405000
2011 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86404000
2010 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86403000
2009 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86402000
2008 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86401000
2007 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86400000
2006 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 863ff000
2005 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 863fe000
2004 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 863fd000

...

1499 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86204000
1498 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86203000
1497 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86202000
1496 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86201000
1495 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86200000
1494 (0xffffffc222b955c0:cpu0): pmap_enter_l2: failure for va 0x86200000 in pmap 0xffffffd01a6f8950: couldn't replace existing mapping
1493 (0xffffffc222b955c0:cpu0): vm_map_pmap_enter: addr=86200000, pindex=0, size=2125824
1492 (0xffffffc222b955c0:cpu0): vm_map_advise(): map=0xffffffd01a6f8820, start=86200000, end=86407000, behav=3

1491 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86406000
1490 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86405000
1489 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86404000
1488 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86403000
1487 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86402000
1486 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86401000
1485 (0xffffffc222b955c0:cpu0): pmap_enter_quick_locked: 0xffffffd01a6f8950 86400000
1484 (0xffffffc222b955c0:cpu0): pmap_enter_l2: success for va 0x86200000 in pmap 0xffffffd01a6f8950
1483 (0xffffffc222b955c0:cpu0): vm_map_pmap_enter: addr=86200000, pindex=0, size=2125824
1482 (0xffffffc222b955c0:cpu0): vm_map_entry_link: map 0xffffffd01a6f8820, nentries 22, entry 0xffffffd069cfee40

With this change things become much more sensible:

5155 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c06000
5154 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c05000
5153 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c04000
5152 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c03000
5151 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c02000
5150 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c01000
5149 (0xffffffc223cfc000:cpu2): pmap_enter_quick_locked: 0xffffffd02c88e610 84c00000
5148 (0xffffffc223cfc000:cpu1): pmap_enter_l2: success for va 0x84a00000 in pmap 0xffffffd02c88e610; pre-existing mapping
5147 (0xffffffc223cfc000:cpu1): vm_map_pmap_enter: addr=84a00000, pindex=0, size=2125824
5146 (0xffffffc223cfc000:cpu1): vm_map_advise(): map=0xffffffd02c88e4e0, start=84a00000, end=84c07000, behav=3

5145 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c06000
5144 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c05000
5143 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c04000
5142 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c03000
5141 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c02000
5140 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c01000
5139 (0xffffffc223cfc000:cpu1): pmap_enter_quick_locked: 0xffffffd02c88e610 84c00000
5138 (0xffffffc223cfc000:cpu1): pmap_enter_l2: success for va 0x84a00000 in pmap 0xffffffd02c88e610
5137 (0xffffffc223cfc000:cpu1): vm_map_pmap_enter: addr=84a00000, pindex=0, size=2125824
5136 (0xffffffc223cfc000:cpu1): vm_map_entry_link: map 0xffffffd02c88e4e0, nentries 22, entry 0xffffffd015f127e0
sys/riscv/riscv/pmap.c
3386–3388

This is where I've departed from how amd64 and arm64 handle this edge-case. Here, they perform a check like:

if ((pmap_load(l2) & PTE_RWX) != 0)
        return (NULL);

This eliminates the potential panic, but we still end up (uselessly) calling pmap_enter_quick_locked() for each 4K page in the superpage mapping -- see the KTR trace output.

What if the permissions are different? Won't this still do the same? From what I can tell arm64 and x86 just add an additional bunch of conditions to the NOREPLACE check and will break+remake the mapping?

This is where the oldl2 == new_l2 check might need to be expanded a little... obviously it works in this case, but I did not yet fully enumerate the changes in properties that could be observed here. I'm hoping for some additional intuition here from others who are more familiar with these code paths.

The expanded checks around NOREPLACE had no bearing on the outcome here, from my experimentation.

Do you see any reason not to turn your reproducer into a regression test case under tests/sys/vm? I will volunteer to do that if so; we really need to improve test coverage for the interactions between VM syscalls and superpages. One difficulty is that there's no guarantee that promotion can happen at any given point in the test run. We can at least use mincore() to detect whether promotion occurred at appropriate points, and skip the test if not.

What if the permissions are different? Won't this still do the same? From what I can tell arm64 and x86 just add an additional bunch of conditions to the NOREPLACE check and will break+remake the mapping?

This is where the oldl2 == new_l2 check might need to be expanded a little... obviously it works in this case, but I did not yet fully enumerate the changes in properties that could be observed here. I'm hoping for some additional intuition here from others who are more familiar with these code paths.

pmap_enter_object() is used only to prefault mappings [of a single VM object], so it is allowed to do nothing. I'm not sure if that was the intent behind the interface when it was first introduced, since the name doesn't suggest that it is advisory. When support was added to map 2MB pages, pmap_enter_object() extended its existing behaviour of doing nothing upon discovering an existing mapping, hence the NOREPLACE flag.

This diff tries to fix two problems: the logic bug where pmap_enter_quick_locked() fails to detect a 2MB mapping (actually I think this is broken on amd64/arm64 for 1GB mappings?), and needless calls to pmap_enter_quick_locked(). But pmap_enter_2mpage() is private to pmap.c - can't we simply extend it to signal this special case to the caller? For instance with a new return value.

sys/riscv/riscv/pmap.c
3150–3151

This comment doesn't quite reflect reality anymore.

3191
3386–3388

This code is reachable via pmap_enter_quick(). I think the assertion you added cannot be triggered by its sole caller today, if only because pmap_is_prefaultable() will return false if the address is mapped as a superpage. But, today we permit callers of pmap_enter_quick() to "overwrite" existing L3 mappings (the function simply does nothing in that case), so it seems inconsistent to require that there is no pre-existing L2 mapping. So, I think amd64 and arm64's behaviour makes more sense, and their pmap_enter_object() implementations should instead be smarter in this scenario.

pmap_enter_object() is used only to prefault mappings [of a single VM object], so it is allowed to do nothing. I'm not sure if that was the intent behind the interface when it was first introduced, since the name doesn't suggest that it is advisory.

It was.

sys/riscv/riscv/pmap.c
3386–3388

Yes, they could be smarter, but only sometimes. Specifically, if the old mapping is a superpage. On the other hand, hypothetically, the physical page could now be a complete, valid superpage, hence you find yourself in pmap_enter_2mpage(), but only a subset of the base pages are mapped. In that case, the code should continue to behave as it does now, trying pmap_enter_quick_locked() on every base page.

See D36801 for changes to the amd64 and arm64 pmaps to avoid the unnecessary pmap_enter_quick_locked() calls. I recommend merging those changes to riscv.

mhorne marked 2 inline comments as done.

Switch assertion in pmap_enter_quick_locked() to just return NULL when it
encounters a superpage.

Rework the pmap_enter_l2() changes to follow D36801/1d5ebad06c20.

I will split this review into two commits when it is time to push it.

Do you see any reason not to turn your reproducer into a regression test case under tests/sys/vm? I will volunteer to do that if so; we really need to improve test coverage for the interactions between VM syscalls and superpages. One difficulty is that there's no guarantee that promotion can happen at any given point in the test run. We can at least use mincore() to detect whether promotion occurred at appropriate points, and skip the test if not.

Yes, I intend to make this a regression test. Thanks for the hint about mincore, I will make use of that.

This revision is now accepted and ready to land.Oct 5 2022, 7:47 AM
sys/riscv/riscv/pmap.c
3192

Don't we need to take into account the possibility that the L3 table is empty? This can happen in the kernel map, where page table pages are never freed.

sys/riscv/riscv/pmap.c
3192

Is this the !pmap_every_pte_is_zero() check performed here by the other pmaps? I was wondering what the purpose of that was.

sys/riscv/riscv/pmap.c
3192

Yes. It arises only for kernel pmaps, hence the va < VM_MAXUSER_ADDRESS check. I believe this would apply to riscv as well. Though, I think it should be quite rare for the kernel to map 2MB pages this way.

mhorne added inline comments.
sys/riscv/riscv/pmap.c
3192