vm_object_coalesce(): any of these conditions must prevent coalesce When either there is more than one reference to the object, or the object size does not exactly match the next start, or OBJ_ONEMAPPING is not set, we cannot safely coalesce. We might corrupt some other valid mapping or a shadow chain doing it. Debugging help from: mmel vm_object_coalesce(): simplify Checked condition for the object size to coalesce make follow-up re-checking of the object size redundant. Esp., call vm_object_page_remove() always if we coalesce. Debugging help from: mmel vm_object_page_remove(): clear pager even if there is no resident pages Swap pager might still carry the data. Debugging help from: mmel vm_map_insert(): add check that coalescing does not revive stale pages Debugging help from: mmel
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
| sys/vm/vm_object.c | ||
|---|---|---|
| 2201 | Just an FYI: Between the KASSERT and the prior if we have: . . . && prev_object->size == next_pindex && . . . so the KASSERT is effectively checking next_pindex > 0 as things are here. | |
Drop KASSERT() checking the layout in vm_object_coalesce(). It really outlived its usefulness.
Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.
I did not have access to the object state there (all debugging was done remotely).
The situation you described is one of the cases that concerned me. I am not sure that we have a guarantee that doing the coalesce on the object with OBJ_ONEMAPPING flag but ref_count > 1 would not corrupt some other mapping. We need to do vm_object_page_remove(), and in principle that could remove pages which belong to other fragment.
After another day of extensive testing, I can confirm that everything is working properly.
I'm happy to perform additional debugging to clarify the problematic condition, should you find it helpful.
Would you like a dump of the affected object for the newly added goto remove_pager case? Or anything else?
Try just this part of the series alone. Do not enable vm_check_pg_zero. Does it help?
diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 9d09224e7346..8850a1b8644e 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -1988,7 +1988,7 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, (options & (OBJPR_CLEANONLY | OBJPR_NOTMAPPED)) == OBJPR_NOTMAPPED, ("vm_object_page_remove: illegal options for object %p", object)); if (object->resident_page_count == 0) - return; + goto remove_pager; vm_object_pip_add(object, 1); vm_page_iter_limit_init(&pages, object, end); again: @@ -2061,6 +2061,7 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end, } vm_object_pip_wakeup(object); +remove_pager: vm_pager_freespace(object, start, (end == 0 ? object->size : end) - start); }
Confirmed, this part is enough.
Is there a way to detect a page fault, that map a zeroed page? (I assume that mmap() returns with lazy-mapped pages.)
This makes sense to me. Suppose that the object has OBJ_ONEMAPPING set. On a memory-starved machine, we could have paged out all of the object's pages, so that the resident count would be zero. Then, if a hole is punched in a mapping to this object, then vm_map_entry_delete() will call vm_object_page_remove(), which will fail to release the swap space.
@mmel you have been testing on ARMv7, correct? Also, you have the option enabled in jemalloc to write junk in the freed pages, correct? And, jemalloc is complaining upon recycling the memory that it didn't find the expected junk values in the memory?
@kib , consider this subset of the changes reviewed by me. In short, I see no reason not to commit this part of the change now.
. . . testing on ARMv7, correct? Also, you have the option enabled in jemalloc to write junk in the freed pages, correct? And, jemalloc is complaining upon recycling the memory that it didn't find the expected junk values in the memory?
The failing assert is part of a (partial) check that the pages required to be all zero are all zero. The context is a normal, debug build of main, so the junk filling is enabled by default (allocation and deallocation). In my testing, rejected pages with non-zero bytes had the 0x5A pattern (the deallocation junk pattern) in those non-zero bytes Generally: started with zeros then transitioned to 0x5A's instead at some point in the Byte sequence. I did see an example of all 0x5A's, no zeros. But that was not common.
ehooks_debug_zero_check(void *addr, size_t size) {
assert(((uintptr_t)addr & PAGE_MASK) == 0);
assert((size & PAGE_MASK) == 0);
assert(size > 0);
if (config_debug) {
/* Check the whole first page. */
size_t *p = (size_t *)addr;
for (size_t i = 0; i < PAGE / sizeof(size_t); i++) {
assert(p[i] == 0);
}The rest of the changes.
I still think that the 'and' condition in vm_object_coalesce() is too optimistic and we might sometime remap the mapped object into the coalesced place, also destroying other mapping content. If not, then the whole diff for vm_object_coalesce() should go away, since the simplification part depends on the exact checks.
The vm_check_pg_zero part under INVARIANTS for vm_map_insert() is independent and should be committed IMO.
yes, and also on arm32 jail
Also, you have the option enabled in jemalloc to write junk in the freed pages, correct?
No, I have junk fill disabled.
And, jemalloc is complaining upon recycling the memory that it didn't find the expected junk values in the memory?
No, jemalloc complains about a dirty page in this check in https://cgit.freebsd.org/src/tree/contrib/jemalloc/include/jemalloc/internal/ehooks.h#n162
This check is called from extent_alloc_core() on mmapped memory from https://cgit.freebsd.org/src/tree/contrib/jemalloc/src/pages.c#n250
. . .
@mmel you have been testing on ARMv7, correct?
[I'm not mmel. So just an FYI.]
I'll note that my testing included both armv7 chroot use on aarch64 and i386 chroot use on amd64, both for main. I got backtraces for both contexts but I only examined the memory via the core files for armv7, just due to the detailed differences in what was "optimized out" vs. accessible in the debugger for the asserting jemalloc code.
I believe OBJ_ONEMAPPING means, "each page in the object is mapped at most once", so in the case you describe, OBJ_ONEMAPPING should not be set to begin with.
I think the extra checking is a good idea.
| sys/vm/vm_map.c | ||
|---|---|---|
| 1751 ↗ | (On Diff #167045) | Don't you also want to check the pager? |
| 1749 ↗ | (On Diff #166990) | |
| 1750 ↗ | (On Diff #166990) | |
| sys/vm/vm_object.c | ||
| 2230 | Stray period. | |
But this is exactly the part of my concern: even if OBJ_ONEMAPPING is not set, other conditions would allow the coalesce.
I understand the main objection against the change: it causes more coalescing failures, increasing the fragmentation.
I might use the word "pindex" here, rather than "page". Further, we can say: An unmapped pindex will not have a physical page or swap space allocated to it. The OBJ_ONEMAPPING object can have a reference count greater than one, where each reference stems from a map entry that does not overlap with any of the other map entries mapping the same object. These map entries must be within the same vm_map, and that vm_map ensures that we do not create overlapping map entries to the object.
| sys/vm/vm_object.c | ||
|---|---|---|
| 2226–2228 | If the object has OBJ_ONEMAPPING set, then we don't need to perform the call to vm_object_page_remove() here. | |
One scenario where I wouldn't be surprised at all that there is a problem with the coalescing code is after a fork() involving a MAP_INHERIT_SHARE map entry.
| sys/vm/vm_object.c | ||
|---|---|---|
| 2226–2228 | I am afraid of the stale pages that might be left by any reason in the unmapped parts of the object. I can immediately think about wired pages that cannot be removed, and which we do not even always invalidate when found during vm_object_page_remove(). | |
Move vm_check_pg_zero code from vm_map_insert() into vm_object_coalesce().
I kept the other two chunks for vm_map_coalesce() there as well for now, since there is discussion around them.
| sys/vm/vm_object.c | ||
|---|---|---|
| 2249 | Isn't it safe if the page is invalid? | |
| 2257 | Because we know that prev_object is OBJ_ANON, can we just use swap_pager_seek_data() or similar to validate this invariant? Then, I do not think it needs to be conditional on vm_check_pg_zero(), we just need to do two pctrie lookups, which is cheap enough to be enabled unconditionally IMO. | |
| 2269 | This can be de-indented. | |
| sys/vm/vm_object.c | ||
|---|---|---|
| 2249 | You mean, eventually vm_fault would look up this invalid page, and since there is no backing object and pager cannot provide the page data, fault handler would do vm_fault_zerofill(). But then there is a bug IMO: vm_fault_zerofill() checks PG_ZERO, which might be still present there (the flag is only cleared on vm_page_free()). This raises the question: if vm_object_page_remove() cannot remove page, for instance because it is wired, should And, if we allow invalid pages, we have to iterate over the whole range to check that all found pages are invalid. | |
| 2257 | Still, using the _seek_data() or what I wrote, we have to iterate over the whole range if we allow invalid pages. So it is not just two trie lookups (in theory). Also I like the consistency of zero page checks handled by the same sysctl, but it is secondary, of course. | |
| sys/vm/vm_object.c | ||
|---|---|---|
| 2249 | May be, vm_page_invalid() should clear PG_ZERO. | |
| sys/vm/vm_object.c | ||
|---|---|---|
| 2249 | Suppose we are punching a hole in an anonymous mapping with munmap(). If the page was wired by mlock(), then vm_map_delete() will unwire the map_entry and underlying pages. If the object is OBJ_ONEMAPPING, this unwiring will happen before the call to vm_object_page_remove() by vm_map_entry_delete(). If, however, the page was wired by the system, then vm_map_delete() will wait out the wiring before it deletes the map entry. Do you have a different scenario in mind? | |
| sys/vm/vm_object.c | ||
|---|---|---|
| 2219 | Suppose that the object is OBJ_ONEMAPPING and that we, in fact, have a single mapping to it. Then, we punch a hole in that mapping. vm_map_entry_delete() only subtracts from the object's charge when we shrink the size of the object. Now, suppose that we perform mmap(MAP_ANON) to reallocate some of the hole. Aren't we going to add to the charge here, even though vm_map_entry_delete() never subtracted from the charge for the hole that was created? | |
| sys/vm/vm_object.c | ||
|---|---|---|
| 2219 | Yes, I think this is the bug in swap accounting. We should not add to the charge up to the object size, only the difference between object end and next_end. | |
| 2249 | Yes, I am referencing the code in vm_object_page_remove() under the vm_page_wire() condition. This could occur for many reasons now, esp. after the merge of the hold and wire counts. For instance, sysctl might wire a region. Or an io occuring into the anon region, triggering vn_io_fault to hold (wire) the user pages backing the io buffer. Or physio does vm_fault_quick_hold() on the user buffer. And so on. In each of these cases, we are left with an invalid page(s) on the object queue if userspace does unmap to create hole. | |