amd64 pmap: do not panic on inability to insert ptp into trie When pmap_enter_pde() needs to destroy existing kernel superpage mapping, do not remove saved pt page from the pm_root trie. Then the pt page does not need to be re-inserted into the trie. If the kernel region is not mapped with the superpage, try to insert the pt page into pm_root trie before clearing ptes. If failed, we can return failure without a need to rewind. Suggested by: alc dev/mem: use sx instead of rw lock Some ops require sleepable context to success, like DMAP demotion. amd64 pmap: preallocate pt page foir pmap_demote_pdpe() in pmap_demote_DMAP() Allocate the page outside the kernel_pmap locked region, and pass it to pmap_demote_pdpe() instead of panicing if VM_ALLOC_INTERRUPT failing.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
sys/vm/vm_radix.c | ||
---|---|---|
140 ↗ | (On Diff #157419) | If parentp == NULL, node is not linked into the tree. I don't see how it ever gets freed, so I think it's a memory leak. |
sys/vm/vm_radix.c | ||
---|---|---|
140 ↗ | (On Diff #157419) | I see, in this case pctrie_insert_lookup_compound() uses existing nul leaf. Am I right? I changed the interface to vm_radix_insert_prealloc() to take **node, to signal that the *node was not used. |
Change vm_radix_insert_prealloc() to indicate if node was used. Pass &node and NULL it if insertion consumed it.
sys/amd64/amd64/pmap.c | ||
---|---|---|
10065 | I think this can be reached via mem_range_attr_set(), and because of the rwlock this will be a non-sleepable context. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
10058 | BTW, do you have an explanation why it is safe to ignore the demote requests with len > 1G? |
sys/amd64/amd64/pmap.c | ||
---|---|---|
10065 | It seems that we can change mr_lock to sx. At least all callers are unlocked or in sleepable context. I looked at the use of it in linuxkpi, and lkpi_arch_phys_wc_del() potentially might be called from the non-sleepable place, the wc_add() sleeps by itself. But I do not see a need in the split if we do MEMRANGE_SET_REMOVE(), since it was already demoted by the SET_UPDATE. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
10058 | I think I see it myself. If some pdp is participating in the mapping of the region which we demote, then it must fully fit into the region, by the assertion that base is aligned at len. In other words, this is indeed not a generic demotion. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
7598 | This means you could fail in a case where an allocation wasn't needed, where currently you would succeed. | |
7669–7670 | I assume that with this change, this line will never be reached. Why can't pmap_abort_ptp(pmap, va, pdpg); return (KERN_RESOURCE_SHORTAGE); appear here, instead of doing the early node allocation? | |
sys/vm/vm_radix.c | ||
140 ↗ | (On Diff #157419) | You are right. Why not just call vm_radix_node_free(*node)) here, in the parentp==NULL case? |
sys/amd64/amd64/pmap.c | ||
---|---|---|
7598 | Yes. But IMO either conditions are very rare, and have the machine continuing running is more important. | |
7669–7670 | We already cleared page table entries in the if/else block above. If we are going to return failure there, we must restore the mappings, at least for the kernel pmap. The rollback is harder. | |
sys/vm/vm_radix.c | ||
140 ↗ | (On Diff #157419) | Caller still needs to call node_free(), so it is slightly less code. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
7669–7670 | ||
7734 | It's possible to have node == NULL here, but uma_zfree_smr(NULL) is not permitted. | |
10039 | ||
10064 | ||
10065 | It seems to be ok for both functions to sleep. They are probably only used by drm-kmod, and there the functions are called during driver init and deinit. | |
sys/vm/vm_radix.h | ||
361 ↗ | (On Diff #157428) | Should it be grouped with the other prototypes at the beginning of the file? |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9653–9654 | Here you are handling the case where *mp is null, but why? It cannot happen today, and in general it is probably better for the caller to handle allocation failures. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9653–9654 | if mp == NULL, we allocate with pmap_alloc_pt_page(), which might fail. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9653–9654 | I'm talking about the case *mp == NULL, not mp == NULL. If mp != NULL, then I think we must have pdpg != NULL here. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9653–9654 | Then I do not understand what do you mean. I have to handle the possibility of pdpg == NULL, and this is all what I do there. pdpg can be NULL if mp is NULL. I do not see a place where I have a check for *mp == NULL as is. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9653–9654 | I think the code should be structured like this: if (mp == NULL) { pdpg = pmap_alloc_pt_page(...); if (pdpg == NULL) return (false); } else { pdpg = *mp; ... } |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9653–9654 | Well, this just tightens up the NULL check, but ok. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
7596 | The most common use case for this function is pmap_enter_object(), and so only about one in 10,000 calls will actually perform the potentially problematic pmap_insert_pt_page(). The unnecessary allocations can easily be avoided by deferring the allocation until you are inside of the subsequent if statement. Put it right before the comment /* Break the existing mapping(s). */. Moreover, condition it on (oldpde & PG_PS) == 0 && va >= VM_MAXUSER_ADDRESS. Once you do this, then the much later call to pmap_insert_pt_page() that deals with wired mappings can revert back to its original behavior that does not assume a preallocated node. Most of the time, that preallocated node was not going to be used anyway, because the PTP is going to be referenced from an existing trie inode. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
7631–7632 | I think that we can be even more selective about node allocation than I suggested in my previous comment. If we are removing a kernel mapping here, then we are actually reinstalling the PTP with invalid mappings. In other words, the PTP was already in the trie. Instead, we could install an invalid PDE here, and leave the PTP in the trie. Then, ... (see below) | |
7658 | If we are about to call this function on the kernel pmap, then beforehand we try performing the existing pmap_insert_pt_page() without any of the proposed changes. In other words, I don't see any reason why we can't attempt to insert the kernel PTP before actually destroying the 4KB granularity mappings in the PTE. Concurrent access to the trie by another thread will be blocked by the pmap lock. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
9637 | I don't see why the final argument can't simply be a vm_page_t, not a vm_page_t *. When a vm_page_t is given the function is guaranteed to succeed. The caller can just set m to NULL itself. | |
10083 | This can also fail. Unlike the regular kernel address space, we don't have preallocated page table pages for the direct map. In other words, we also need to preallocate a page table page for this demotion, but that only needs to happen if len < NBPDR. I will post suggested changes shortly. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
10063 | Should we perform a WARN_SLEEPOK alongside the KASSERTs? | |
10092–10095 | If you move this if statement up inside the prior if statement, then the m = NULL statement can be eliminated. |
sys/amd64/amd64/pmap.c | ||
---|---|---|
10057 | ||
10076 | ||
10091–10092 |
sys/amd64/amd64/pmap.c | ||
---|---|---|
7653 |