Details

Reviewers

alc
markj
mmel

Commits

rG72a447d0bc76: vm_object_page_remove(): clear pager even if there is no resident pages

Summary

vm_object_coalesce(): any of these conditions must prevent coalesce

When either there is more than one reference to the object, or the
object size does not exactly match the next start, or OBJ_ONEMAPPING is
not set, we cannot safely coalesce.  We might corrupt some other valid
mapping or a shadow chain doing it.

Debugging help from:    mmel


vm_object_coalesce(): simplify

Checked condition for the object size to coalesce make follow-up
re-checking of the object size redundant.  Esp., call
vm_object_page_remove() always if we coalesce.

Debugging help from:    mmel


vm_object_page_remove(): clear pager even if there is no resident pages

Swap pager might still carry the data.

Debugging help from:    mmel


vm_map_insert(): add check that coalescing does not revive stale pages

Debugging help from:    mmel

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Nov 23 2025, 9:51 AM

Herald added a subscriber: imp. · View Herald TranscriptNov 23 2025, 9:51 AM

kib requested review of this revision.Nov 23 2025, 9:51 AM

kib retitled this revision from Fixes for dreaded assert in jemalloc page allocator (AKA mmap(MAP_ANON) providing non-zeroed pages) to Fixes for dreaded assert in jemalloc page allocator AKA mmap(MAP_ANON) providing non-zeroed pages.

markmi_dsl-only.net added a subscriber: markmi_dsl-only.net.Nov 23 2025, 12:42 PM

markmi_dsl-only.net added inline comments.

sys/vm/vm_object.c
2201	Just an FYI: Between the KASSERT and the prior if we have: . . . && prev_object->size == next_pindex && . . . so the KASSERT is effectively checking next_pindex > 0 as things are here.

Drop KASSERT() checking the layout in vm_object_coalesce(). It really outlived its usefulness.

Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.

In D53891#1231399, @alc wrote:

Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.

I did not have access to the object state there (all debugging was done remotely).

The situation you described is one of the cases that concerned me. I am not sure that we have a guarantee that doing the coalesce on the object with OBJ_ONEMAPPING flag but ref_count > 1 would not corrupt some other mapping. We need to do vm_object_page_remove(), and in principle that could remove pages which belong to other fragment.

After another day of extensive testing, I can confirm that everything is working properly.

I'm happy to perform additional debugging to clarify the problematic condition, should you find it helpful.
Would you like a dump of the affected object for the newly added goto remove_pager case? Or anything else?

This revision is now accepted and ready to land.Nov 24 2025, 8:08 AM

In D53891#1231437, @mmel wrote:

After another day of extensive testing, I can confirm that everything is working properly.

I'm happy to perform additional debugging to clarify the problematic condition, should you find it helpful.
Would you like a dump of the affected object for the newly added goto remove_pager case? Or anything else?

Try just this part of the series alone. Do not enable vm_check_pg_zero. Does it help?

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 9d09224e7346..8850a1b8644e 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -1988,7 +1988,7 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end,
       (options & (OBJPR_CLEANONLY | OBJPR_NOTMAPPED)) == OBJPR_NOTMAPPED,
       ("vm_object_page_remove: illegal options for object %p", object));
   if (object->resident_page_count == 0)
-    return;
+     goto remove_pager;
   vm_object_pip_add(object, 1);
   vm_page_iter_limit_init(&pages, object, end);
 again:
@@ -2061,6 +2061,7 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end,
   }
   vm_object_pip_wakeup(object);
 
+remove_pager:
   vm_pager_freespace(object, start, (end == 0 ? object->size : end) -
       start);
 }

Confirmed, this part is enough.

Is there a way to detect a page fault, that map a zeroed page? (I assume that mmap() returns with lazy-mapped pages.)

In D53891#1231619, @mmel wrote:

Confirmed, this part is enough.

This makes sense to me. Suppose that the object has OBJ_ONEMAPPING set. On a memory-starved machine, we could have paged out all of the object's pages, so that the resident count would be zero. Then, if a hole is punched in a mapping to this object, then vm_map_entry_delete() will call vm_object_page_remove(), which will fail to release the swap space.

@mmel you have been testing on ARMv7, correct? Also, you have the option enabled in jemalloc to write junk in the freed pages, correct? And, jemalloc is complaining upon recycling the memory that it didn't find the expected junk values in the memory?

In D53891#1231456, @kib wrote:
In D53891#1231437, @mmel wrote:

After another day of extensive testing, I can confirm that everything is working properly.

I'm happy to perform additional debugging to clarify the problematic condition, should you find it helpful.
Would you like a dump of the affected object for the newly added goto remove_pager case? Or anything else?

Try just this part of the series alone. Do not enable vm_check_pg_zero. Does it help?
diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 9d09224e7346..8850a1b8644e 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -1988,7 +1988,7 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end,
       (options & (OBJPR_CLEANONLY | OBJPR_NOTMAPPED)) == OBJPR_NOTMAPPED,
       ("vm_object_page_remove: illegal options for object %p", object));
   if (object->resident_page_count == 0)
-    return;
+     goto remove_pager;
   vm_object_pip_add(object, 1);
   vm_page_iter_limit_init(&pages, object, end);
 again:
@@ -2061,6 +2061,7 @@ vm_object_page_remove(vm_object_t object, vm_pindex_t start, vm_pindex_t end,
   }
   vm_object_pip_wakeup(object);
 
+remove_pager:
   vm_pager_freespace(object, start, (end == 0 ? object->size : end) -
       start);
 }

@kib , consider this subset of the changes reviewed by me. In short, I see no reason not to commit this part of the change now.

. . . testing on ARMv7, correct? Also, you have the option enabled in jemalloc to write junk in the freed pages, correct? And, jemalloc is complaining upon recycling the memory that it didn't find the expected junk values in the memory?

The failing assert is part of a (partial) check that the pages required to be all zero are all zero. The context is a normal, debug build of main, so the junk filling is enabled by default (allocation and deallocation). In my testing, rejected pages with non-zero bytes had the 0x5A pattern (the deallocation junk pattern) in those non-zero bytes Generally: started with zeros then transitioned to 0x5A's instead at some point in the Byte sequence. I did see an example of all 0x5A's, no zeros. But that was not common.

ehooks_debug_zero_check(void *addr, size_t size) {
	assert(((uintptr_t)addr & PAGE_MASK) == 0);
	assert((size & PAGE_MASK) == 0);
	assert(size > 0);
	if (config_debug) {
		/* Check the whole first page. */
		size_t *p = (size_t *)addr;
		for (size_t i = 0; i < PAGE / sizeof(size_t); i++) {
			assert(p[i] == 0);
		}

Closed by commit rG72a447d0bc76: vm_object_page_remove(): clear pager even if there is no resident pages (authored by kib). · Explain WhyNov 24 2025, 6:09 PM

This revision was automatically updated to reflect the committed changes.

kib added a commit: rG72a447d0bc76: vm_object_page_remove(): clear pager even if there is no resident pages.

kib reopened this revision.Nov 24 2025, 6:11 PM

The rest of the changes.

I still think that the 'and' condition in vm_object_coalesce() is too optimistic and we might sometime remap the mapped object into the coalesced place, also destroying other mapping content. If not, then the whole diff for vm_object_coalesce() should go away, since the simplification part depends on the exact checks.

The vm_check_pg_zero part under INVARIANTS for vm_map_insert() is independent and should be committed IMO.

In D53891#1231630, @alc wrote:

In D53891#1231619, @mmel wrote:

Confirmed, this part is enough.

This makes sense to me. Suppose that the object has OBJ_ONEMAPPING set. On a memory-starved machine, we could have paged out all of the object's pages, so that the resident count would be zero. Then, if a hole is punched in a mapping to this object, then vm_map_entry_delete() will call vm_object_page_remove(), which will fail to release the swap space.

@mmel you have been testing on ARMv7, correct?

yes, and also on arm32 jail

Also, you have the option enabled in jemalloc to write junk in the freed pages, correct?

No, I have junk fill disabled.

And, jemalloc is complaining upon recycling the memory that it didn't find the expected junk values in the memory?

No, jemalloc complains about a dirty page in this check in https://cgit.freebsd.org/src/tree/contrib/jemalloc/include/jemalloc/internal/ehooks.h#n162

This check is called from extent_alloc_core() on mmapped memory from https://cgit.freebsd.org/src/tree/contrib/jemalloc/src/pages.c#n250

. . .

@mmel you have been testing on ARMv7, correct?

[I'm not mmel. So just an FYI.]

I'll note that my testing included both armv7 chroot use on aarch64 and i386 chroot use on amd64, both for main. I got backtraces for both contexts but I only examined the memory via the core files for armv7, just due to the detailed differences in what was "optimized out" vs. accessible in the debugger for the asserting jemalloc code.

In D53891#1231400, @kib wrote:

In D53891#1231399, @alc wrote:

Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.

I did not have access to the object state there (all debugging was done remotely).

The situation you described is one of the cases that concerned me. I am not sure that we have a guarantee that doing the coalesce on the object with OBJ_ONEMAPPING flag but ref_count > 1 would not corrupt some other mapping. We need to do vm_object_page_remove(), and in principle that could remove pages which belong to other fragment.

I believe OBJ_ONEMAPPING means, "each page in the object is mapped at most once", so in the case you describe, OBJ_ONEMAPPING should not be set to begin with.

In D53891#1231667, @kib wrote:

The rest of the changes.

I still think that the 'and' condition in vm_object_coalesce() is too optimistic and we might sometime remap the mapped object into the coalesced place, also destroying other mapping content. If not, then the whole diff for vm_object_coalesce() should go away, since the simplification part depends on the exact checks.

The vm_check_pg_zero part under INVARIANTS for vm_map_insert() is independent and should be committed IMO.

I think the extra checking is a good idea.

sys/vm/vm_map.c
1751 ↗	(On Diff #167045)	Don't you also want to check the pager?
1749 ↗	(On Diff #166990)
1750 ↗	(On Diff #166990)
sys/vm/vm_object.c
2230	Stray period.

In D53891#1231852, @markj wrote:

In D53891#1231400, @kib wrote:

In D53891#1231399, @alc wrote:

Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.

I did not have access to the object state there (all debugging was done remotely).

The situation you described is one of the cases that concerned me. I am not sure that we have a guarantee that doing the coalesce on the object with OBJ_ONEMAPPING flag but ref_count > 1 would not corrupt some other mapping. We need to do vm_object_page_remove(), and in principle that could remove pages which belong to other fragment.

I believe OBJ_ONEMAPPING means, "each page in the object is mapped at most once", so in the case you describe, OBJ_ONEMAPPING should not be set to begin with.

But this is exactly the part of my concern: even if OBJ_ONEMAPPING is not set, other conditions would allow the coalesce.

I understand the main objection against the change: it causes more coalescing failures, increasing the fragmentation.

In the checker, check that the pager does not contain pages.

In D53891#1231852, @markj wrote:

In D53891#1231400, @kib wrote:

In D53891#1231399, @alc wrote:

Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.

I did not have access to the object state there (all debugging was done remotely).

The situation you described is one of the cases that concerned me. I am not sure that we have a guarantee that doing the coalesce on the object with OBJ_ONEMAPPING flag but ref_count > 1 would not corrupt some other mapping. We need to do vm_object_page_remove(), and in principle that could remove pages which belong to other fragment.

I believe OBJ_ONEMAPPING means, "each page in the object is mapped at most once", so in the case you describe, OBJ_ONEMAPPING should not be set to begin with.

I might use the word "pindex" here, rather than "page". Further, we can say: An unmapped pindex will not have a physical page or swap space allocated to it. The OBJ_ONEMAPPING object can have a reference count greater than one, where each reference stems from a map entry that does not overlap with any of the other map entries mapping the same object. These map entries must be within the same vm_map, and that vm_map ensures that we do not create overlapping map entries to the object.

sys/vm/vm_object.c
2226–2228	If the object has OBJ_ONEMAPPING set, then we don't need to perform the call to vm_object_page_remove() here.

In D53891#1231869, @kib wrote:

In D53891#1231852, @markj wrote:

In D53891#1231400, @kib wrote:

In D53891#1231399, @alc wrote:

Can you elaborate on what you saw when debugging this problem? In particular, whether the map entries had OBJ_ONEMAPPING set? In principle, the ref_count and size shouldn't be a concern if the object has OBJ_ONEMAPPING set. For example, suppose there is an anonymous mapping (with OBJ_ONEMAPPING set) for the range [A, D). Further, suppose we punch a hole in the middle of that mapping, by calling munmap() on the range [B, C) where A < B < C < D. Now, we have two mappings, [A, B) and [C, D), that both reference the original object, and that object should still have OBJ_ONEMAPPING set. Because OBJ_ONEMAPPING is set, the munmap() should have freed any physical pages and swap space from the object that fell within the range [B, C). So, if a new anonymous mapping is created starting at either B or D, we should be able to safely coalesce it.

I did not have access to the object state there (all debugging was done remotely).

The situation you described is one of the cases that concerned me. I am not sure that we have a guarantee that doing the coalesce on the object with OBJ_ONEMAPPING flag but ref_count > 1 would not corrupt some other mapping. We need to do vm_object_page_remove(), and in principle that could remove pages which belong to other fragment.

I believe OBJ_ONEMAPPING means, "each page in the object is mapped at most once", so in the case you describe, OBJ_ONEMAPPING should not be set to begin with.

But this is exactly the part of my concern: even if OBJ_ONEMAPPING is not set, other conditions would allow the coalesce.

One scenario where I wouldn't be surprised at all that there is a problem with the coalescing code is after a fork() involving a MAP_INHERIT_SHARE map entry.

alc added inline comments.Nov 26 2025, 8:54 PM

sys/vm/vm_map.c
1746 ↗	(On Diff #167084)	I would suggest putting this code at the end of vm_object_coalesce() instead.
1765 ↗	(On Diff #167084)	You don't need to use an iterator here; simply call vm_radix_lookup_ge().

kib marked 3 inline comments as done.Nov 26 2025, 10:55 PM

kib added inline comments.

sys/vm/vm_object.c
2226–2228	I am afraid of the stale pages that might be left by any reason in the unmapped parts of the object. I can immediately think about wired pages that cannot be removed, and which we do not even always invalidate when found during vm_object_page_remove().

Move vm_check_pg_zero code from vm_map_insert() into vm_object_coalesce().

I kept the other two chunks for vm_map_coalesce() there as well for now, since there is discussion around them.

markj added inline comments.Nov 27 2025, 5:10 PM

sys/vm/vm_object.c
2249	Isn't it safe if the page is invalid?
2257	Because we know that prev_object is OBJ_ANON, can we just use swap_pager_seek_data() or similar to validate this invariant? Then, I do not think it needs to be conditional on vm_check_pg_zero(), we just need to do two pctrie lookups, which is cheap enough to be enabled unconditionally IMO.
2269	This can be de-indented.

kib marked an inline comment as done.Nov 27 2025, 6:06 PM

kib added inline comments.

sys/vm/vm_object.c
2249	You mean, eventually vm_fault would look up this invalid page, and since there is no backing object and pager cannot provide the page data, fault handler would do vm_fault_zerofill(). But then there is a bug IMO: vm_fault_zerofill() checks PG_ZERO, which might be still present there (the flag is only cleared on vm_page_free()). This raises the question: if vm_object_page_remove() cannot remove page, for instance because it is wired, should vm_object_page_remove() clear PG_ZERO then? And, if we allow invalid pages, we have to iterate over the whole range to check that all found pages are invalid.
2257	Still, using the _seek_data() or what I wrote, we have to iterate over the whole range if we allow invalid pages. So it is not just two trie lookups (in theory). Also I like the consistency of zero page checks handled by the same sysctl, but it is secondary, of course.

kib added inline comments.Nov 27 2025, 6:08 PM

sys/vm/vm_object.c
2249	May be, vm_page_invalid() should clear PG_ZERO.

alc added inline comments.Nov 27 2025, 8:54 PM

sys/vm/vm_object.c
2249	Suppose we are punching a hole in an anonymous mapping with munmap(). If the page was wired by mlock(), then vm_map_delete() will unwire the map_entry and underlying pages. If the object is OBJ_ONEMAPPING, this unwiring will happen before the call to vm_object_page_remove() by vm_map_entry_delete(). If, however, the page was wired by the system, then vm_map_delete() will wait out the wiring before it deletes the map entry. Do you have a different scenario in mind?

alc added inline comments.Nov 27 2025, 9:02 PM

sys/vm/vm_object.c
2219	Suppose that the object is OBJ_ONEMAPPING and that we, in fact, have a single mapping to it. Then, we punch a hole in that mapping. vm_map_entry_delete() only subtracts from the object's charge when we shrink the size of the object. Now, suppose that we perform mmap(MAP_ANON) to reallocate some of the hole. Aren't we going to add to the charge here, even though vm_map_entry_delete() never subtracted from the charge for the hole that was created?

kib added inline comments.Nov 27 2025, 9:42 PM

sys/vm/vm_object.c
2219	Yes, I think this is the bug in swap accounting. We should not add to the charge up to the object size, only the difference between object end and next_end.
2249	Yes, I am referencing the code in vm_object_page_remove() under the vm_page_wire() condition. This could occur for many reasons now, esp. after the merge of the hold and wire counts. For instance, sysctl might wire a region. Or an io occuring into the anon region, triggering vn_io_fault to hold (wire) the user pages backing the io buffer. Or physio does vm_fault_quick_hold() on the user buffer. And so on. In each of these cases, we are left with an invalid page(s) on the object queue if userspace does unmap to create hole.

So lets split this review, hopefully to revive it.

I moved the vm_check_pg_zero chunk in vm_object_coalesce to D54219, also switching to swap_pager_seek_data() as Mark suggested. I think this is now correct after the PG_ZERO fix for vm_fault().

markmi_dsl-only.net added inline comments.Dec 14 2025, 5:34 PM

sys/vm/vm_object.c
2195	Is the major question left for non-INVARIANTS code the use of \|\| instead of && above, given the other related changes that have occurred? (I'm just trying to follow along to some degree.)

markj added inline comments.Dec 15 2025, 9:40 PM

sys/vm/vm_object.c
2249	Now that vm_fault_zerofill() will zero an invalid page, why is the patch to vm_object_coalesce() necessary?

alc added inline comments.Dec 16 2025, 7:29 PM

sys/vm/vm_object.c
2195	Yes, I don't think that the use of \|\| instead of && is needed.

kib abandoned this revision.Dec 23 2025, 4:47 AM

Fixes for dreaded assert in jemalloc page allocator AKA mmap(MAP_ANON) providing non-zeroed pages
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 167162

sys/vm/vm_object.c

Fixes for dreaded assert in jemalloc page allocator AKA mmap(MAP_ANON) providing non-zeroed pagesAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 167162

sys/vm/vm_object.c

Fixes for dreaded assert in jemalloc page allocator AKA mmap(MAP_ANON) providing non-zeroed pages
AbandonedPublic
Actions

Revision Contents
Changeset List