Page MenuHomeFreeBSD

Fix charge accounting for objects
ClosedPublic

Authored by kib on Wed, Jan 7, 9:08 AM.
Tags
None
Referenced Files
F142263252: D54572.diff
Sat, Jan 17, 10:47 PM
F142228675: D54572.id169499.diff
Sat, Jan 17, 1:19 PM
F142199808: D54572.id169498.diff
Sat, Jan 17, 3:40 AM
F142199339: D54572.id169356.diff
Sat, Jan 17, 3:29 AM
F142199305: D54572.id169590.diff
Sat, Jan 17, 3:28 AM
F142199273: D54572.id169456.diff
Sat, Jan 17, 3:28 AM
Unknown Object (File)
Sat, Jan 17, 2:20 AM
Unknown Object (File)
Sat, Jan 17, 2:17 AM

Details

Summary
swap_release_by_cred*(): give some additional info on panics due to underflow


rfork(2): fix swap accounting in vmspace_unshare()

When an attempt to increase the swap charge for the ucred failed, we
must forcibly increase the charge to allow the vmspace_destroy()
operation to correctly un-charge the accumulated objects.

Add a swap_reserve_force_by_cred() helper and use it in
vmspace_unshare(), same as it is done in normal fork operations.


vm_object: remove the charge member

State that the object charge is zero if object->cred == NULL, or equal
to the ptoa(object->size) otherwise.

Besides being much simpler, the transition to use object->size corrects
the architectural issue with the use of object->charge.  The split
operations effectively carve the holes in the charged regions, but
single counter cannot properly express it.  As result, coalescing
anonymous mappings cannot calculate correctly if the extended mapping
already backed by the existing object is already accounted or not [1].

To properly solve the issue, either we need to start tracking exact
charged regions in the anonymous objects, which has the significant
overhead and complications.  Or give up on the slight over-accounting
and charge the whole object unconditionally, as it is done in the patch.

Reported by:    mmel, pho [1]
Tested by: pho

 tuning.7: add more explanation about swap (over-)accounting

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Wed, Jan 7, 9:08 AM

vm_map_copy_swap_object(): clear the OBJ_ONEMAPPING flag on the original obj

Original object, shadowed by the new object, should be no longer considered one-mapped. Simply due to the fact that it is shadowed, parent mappings are unknown to the backing object.

I don't quite follow. After vm_object_split() the original and new objects are not connected to each other in the sense of being linked via backing_object. What shadowing is happening?

Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.

vm_map_copy_swap_object(): clear the OBJ_ONEMAPPING flag on the original obj

Original object, shadowed by the new object, should be no longer considered one-mapped. Simply due to the fact that it is shadowed, parent mappings are unknown to the backing object.

I don't quite follow. After vm_object_split() the original and new objects are not connected to each other in the sense of being linked via backing_object. What shadowing is happening?

Hm, yes, I just dropped this change.

Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.

This is not necessarily true. For instance, if you create an anon mapping of three pages ABC, and then unmap B, I believe that the object behind the original vm_map get the ref count of 2 (first incremented to 3 by split, then decremented by actual delete), but the OBJ_ONEMAPPING flag is kept.

kib edited the summary of this revision. (Show Details)

Drop the 'clear OBJ_ONEMAPPING' patch.

In D54572#1247713, @kib wrote:

Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.

This is not necessarily true. For instance, if you create an anon mapping of three pages ABC, and then unmap B, I believe that the object behind the original vm_map get the ref count of 2 (first incremented to 3 by split, then decremented by actual delete), but the OBJ_ONEMAPPING flag is kept.

Oops, of course.

Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.

How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?

sys/vm/vm_map.c
3998–3999

Can we call it oldsize instead of size1, and just write swap_release_by_cred(ptoa(oldsize - object->size), object->cred);?

4960

I think this needs a comment.

s/size1/oldsize/
add comment

kib marked 2 inline comments as done.Sun, Jan 11, 3:15 AM
In D54572#1247713, @kib wrote:

Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.

How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?

Do you want an exact formula? I do not think it is possible, simply due to the nature of the interaction with the user.
Typically, split() would apply to the whole object, just in chunks. If split does not cover the whole object, then this is usually because the covering vm_map_entry was split for ops like mprotect() or (more rare) mlock().
If the split() covers the entire object, it should be destroyed shortly by collapse().

But if split is not total, we would keep this space accounting. It can be said that the space is leaked because no actual pageout could be initiated from it, because no pages could be installed there. Of course, the charge is un-accounted on the object free.

The only alternative to the fix I see is to record the ranges, which I do not want to do. One more intrusive alternative is to split the object itself, but this is perhaps even worse than recording ranges.

In D54572#1248367, @kib wrote:
In D54572#1247713, @kib wrote:

Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.

How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?

Do you want an exact formula? I do not think it is possible, simply due to the nature of the interaction with the user.

No, I'm just wondering if there is some informal argument to justify the use of "slight."

Typically, split() would apply to the whole object, just in chunks. If split does not cover the whole object, then this is usually because the covering vm_map_entry was split for ops like mprotect() or (more rare) mlock().
If the split() covers the entire object, it should be destroyed shortly by collapse().

But if split is not total, we would keep this space accounting. It can be said that the space is leaked because no actual pageout could be initiated from it, because no pages could be installed there. Of course, the charge is un-accounted on the object free.

The only alternative to the fix I see is to record the ranges, which I do not want to do. One more intrusive alternative is to split the object itself, but this is perhaps even worse than recording ranges.

Right, I'm trying to understand how big the leak might be in practice. It really depends on the behaviour of the memory allocator and in particular whether it unmaps memory instead of using MADV_FREE to release free memory back to the system. In any case, this architectural bug should probably be documented somewhere, but I'm not sure where would be appropriate.

This revision is now accepted and ready to land.Sun, Jan 11, 11:43 PM

Right, I'm trying to understand how big the leak might be in practice. It really depends on the behaviour of the memory allocator and in particular whether it unmaps memory instead of using MADV_FREE to release free memory back to the system. In any case, this architectural bug should probably be documented somewhere, but I'm not sure where would be appropriate.

Not only the free behavior. It also depends on the usage. If all memory in CoW is shadowed by both parent and child, or parent exits early, then we can collapse and eventually do.

kib edited the summary of this revision. (Show Details)

tuning.7: add more explanation about swap (over-)accounting

This revision now requires review to proceed.Mon, Jan 12, 3:58 AM

Wording update in the man page

markj added inline comments.
share/man/man7/tuning.7
230 ↗(On Diff #169499)
231 ↗(On Diff #169499)
This revision is now accepted and ready to land.Mon, Jan 12, 3:18 PM
kib marked 2 inline comments as done.Wed, Jan 14, 3:37 PM