Details

Reviewers

alc
markj

Commits

rG6f41575a94b3: tuning.7: wording fixes
rG457b940bfb6a: tuning.7: add more explanation about swap (over-)accounting
rGd160447129fe: vm_object: remove the charge member
rGde770681234d: rfork(2): fix swap accounting in vmspace_unshare()
rG7361727d4584: swap_release_by_cred*(): give some additional info on panics due to underflow

Summary

swap_release_by_cred*(): give some additional info on panics due to underflow


rfork(2): fix swap accounting in vmspace_unshare()

When an attempt to increase the swap charge for the ucred failed, we
must forcibly increase the charge to allow the vmspace_destroy()
operation to correctly un-charge the accumulated objects.

Add a swap_reserve_force_by_cred() helper and use it in
vmspace_unshare(), same as it is done in normal fork operations.


vm_object: remove the charge member

State that the object charge is zero if object->cred == NULL, or equal
to the ptoa(object->size) otherwise.

Besides being much simpler, the transition to use object->size corrects
the architectural issue with the use of object->charge.  The split
operations effectively carve the holes in the charged regions, but
single counter cannot properly express it.  As result, coalescing
anonymous mappings cannot calculate correctly if the extended mapping
already backed by the existing object is already accounted or not [1].

To properly solve the issue, either we need to start tracking exact
charged regions in the anonymous objects, which has the significant
overhead and complications.  Or give up on the slight over-accounting
and charge the whole object unconditionally, as it is done in the patch.

Reported by:    mmel, pho [1]
Tested by: pho

 tuning.7: add more explanation about swap (over-)accounting

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Jan 7 2026, 9:08 AM

Herald added subscribers: olce, glebius, imp. · View Herald TranscriptJan 7 2026, 9:08 AM

kib requested review of this revision.Jan 7 2026, 9:08 AM

vm_map_copy_swap_object(): clear the OBJ_ONEMAPPING flag on the original obj

Original object, shadowed by the new object, should be no longer considered one-mapped. Simply due to the fact that it is shadowed, parent mappings are unknown to the backing object.

I don't quite follow. After vm_object_split() the original and new objects are not connected to each other in the sense of being linked via backing_object. What shadowing is happening?

Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.

In D54572#1247459, @markj wrote:

vm_map_copy_swap_object(): clear the OBJ_ONEMAPPING flag on the original obj

Original object, shadowed by the new object, should be no longer considered one-mapped. Simply due to the fact that it is shadowed, parent mappings are unknown to the backing object.

I don't quite follow. After vm_object_split() the original and new objects are not connected to each other in the sense of being linked via backing_object. What shadowing is happening?

Hm, yes, I just dropped this change.

Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.

This is not necessarily true. For instance, if you create an anon mapping of three pages ABC, and then unmap B, I believe that the object behind the original vm_map get the ref count of 2 (first incremented to 3 by split, then decremented by actual delete), but the OBJ_ONEMAPPING flag is kept.

Drop the 'clear OBJ_ONEMAPPING' patch.

In D54572#1247713, @kib wrote:

In D54572#1247459, @markj wrote:

Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.

This is not necessarily true. For instance, if you create an anon mapping of three pages ABC, and then unmap B, I believe that the object behind the original vm_map get the ref count of 2 (first incremented to 3 by split, then decremented by actual delete), but the OBJ_ONEMAPPING flag is kept.

Oops, of course.

Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.

How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?

sys/vm/vm_map.c
3999	Can we call it `oldsize` instead of `size1`, and just write `swap_release_by_cred(ptoa(oldsize - object->size), object->cred);`?
4961	I think this needs a comment.

s/size1/oldsize/
add comment

In D54572#1248346, @markj wrote:

In D54572#1247713, @kib wrote:

Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.

How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?

Do you want an exact formula? I do not think it is possible, simply due to the nature of the interaction with the user.
Typically, split() would apply to the whole object, just in chunks. If split does not cover the whole object, then this is usually because the covering vm_map_entry was split for ops like mprotect() or (more rare) mlock().
If the split() covers the entire object, it should be destroyed shortly by collapse().

But if split is not total, we would keep this space accounting. It can be said that the space is leaked because no actual pageout could be initiated from it, because no pages could be installed there. Of course, the charge is un-accounted on the object free.

The only alternative to the fix I see is to record the ranges, which I do not want to do. One more intrusive alternative is to split the object itself, but this is perhaps even worse than recording ranges.

In D54572#1248367, @kib wrote:

In D54572#1248346, @markj wrote:

In D54572#1247713, @kib wrote:

Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.

How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?

Do you want an exact formula? I do not think it is possible, simply due to the nature of the interaction with the user.

No, I'm just wondering if there is some informal argument to justify the use of "slight."

Typically, split() would apply to the whole object, just in chunks. If split does not cover the whole object, then this is usually because the covering vm_map_entry was split for ops like mprotect() or (more rare) mlock().
If the split() covers the entire object, it should be destroyed shortly by collapse().

But if split is not total, we would keep this space accounting. It can be said that the space is leaked because no actual pageout could be initiated from it, because no pages could be installed there. Of course, the charge is un-accounted on the object free.

The only alternative to the fix I see is to record the ranges, which I do not want to do. One more intrusive alternative is to split the object itself, but this is perhaps even worse than recording ranges.

Right, I'm trying to understand how big the leak might be in practice. It really depends on the behaviour of the memory allocator and in particular whether it unmaps memory instead of using MADV_FREE to release free memory back to the system. In any case, this architectural bug should probably be documented somewhere, but I'm not sure where would be appropriate.

This revision is now accepted and ready to land.Jan 11 2026, 11:43 PM

In D54572#1248709, @markj wrote:

Right, I'm trying to understand how big the leak might be in practice. It really depends on the behaviour of the memory allocator and in particular whether it unmaps memory instead of using MADV_FREE to release free memory back to the system. In any case, this architectural bug should probably be documented somewhere, but I'm not sure where would be appropriate.

Not only the free behavior. It also depends on the usage. If all memory in CoW is shadowed by both parent and child, or parent exits early, then we can collapse and eventually do.