swap_release_by_cred*(): give some additional info on panics due to underflow rfork(2): fix swap accounting in vmspace_unshare() When an attempt to increase the swap charge for the ucred failed, we must forcibly increase the charge to allow the vmspace_destroy() operation to correctly un-charge the accumulated objects. Add a swap_reserve_force_by_cred() helper and use it in vmspace_unshare(), same as it is done in normal fork operations. vm_object: remove the charge member State that the object charge is zero if object->cred == NULL, or equal to the ptoa(object->size) otherwise. Besides being much simpler, the transition to use object->size corrects the architectural issue with the use of object->charge. The split operations effectively carve the holes in the charged regions, but single counter cannot properly express it. As result, coalescing anonymous mappings cannot calculate correctly if the extended mapping already backed by the existing object is already accounted or not [1]. To properly solve the issue, either we need to start tracking exact charged regions in the anonymous objects, which has the significant overhead and complications. Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch. Reported by: mmel, pho [1] Tested by: pho tuning.7: add more explanation about swap (over-)accounting
Details
- Reviewers
alc markj - Commits
- rG6f41575a94b3: tuning.7: wording fixes
rG457b940bfb6a: tuning.7: add more explanation about swap (over-)accounting
rGd160447129fe: vm_object: remove the charge member
rGde770681234d: rfork(2): fix swap accounting in vmspace_unshare()
rG7361727d4584: swap_release_by_cred*(): give some additional info on panics due to underflow
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
vm_map_copy_swap_object(): clear the OBJ_ONEMAPPING flag on the original obj
Original object, shadowed by the new object, should be no longer considered one-mapped. Simply due to the fact that it is shadowed, parent mappings are unknown to the backing object.
I don't quite follow. After vm_object_split() the original and new objects are not connected to each other in the sense of being linked via backing_object. What shadowing is happening?
Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.
Hm, yes, I just dropped this change.
Also I suspect that if OBJ_ONEMAPPING is set, then the original object's ref_count will be 1, and vm_object_split() is a no-op. If this isn't true, I'd want to understand how that can happen.
This is not necessarily true. For instance, if you create an anon mapping of three pages ABC, and then unmap B, I believe that the object behind the original vm_map get the ref count of 2 (first incremented to 3 by split, then decremented by actual delete), but the OBJ_ONEMAPPING flag is kept.
Oops, of course.
Or give up on the slight over-accounting and charge the whole object unconditionally, as it is done in the patch.
How do we know it is slight? Is there some way to express a bound on the amount of swap that might be wasted due to the over-accounting?
| sys/vm/vm_map.c | ||
|---|---|---|
| 3999 | Can we call it oldsize instead of size1, and just write swap_release_by_cred(ptoa(oldsize - object->size), object->cred);? | |
| 4961 | I think this needs a comment. | |
Do you want an exact formula? I do not think it is possible, simply due to the nature of the interaction with the user.
Typically, split() would apply to the whole object, just in chunks. If split does not cover the whole object, then this is usually because the covering vm_map_entry was split for ops like mprotect() or (more rare) mlock().
If the split() covers the entire object, it should be destroyed shortly by collapse().
But if split is not total, we would keep this space accounting. It can be said that the space is leaked because no actual pageout could be initiated from it, because no pages could be installed there. Of course, the charge is un-accounted on the object free.
The only alternative to the fix I see is to record the ranges, which I do not want to do. One more intrusive alternative is to split the object itself, but this is perhaps even worse than recording ranges.
No, I'm just wondering if there is some informal argument to justify the use of "slight."
Typically, split() would apply to the whole object, just in chunks. If split does not cover the whole object, then this is usually because the covering vm_map_entry was split for ops like mprotect() or (more rare) mlock().
If the split() covers the entire object, it should be destroyed shortly by collapse().But if split is not total, we would keep this space accounting. It can be said that the space is leaked because no actual pageout could be initiated from it, because no pages could be installed there. Of course, the charge is un-accounted on the object free.
The only alternative to the fix I see is to record the ranges, which I do not want to do. One more intrusive alternative is to split the object itself, but this is perhaps even worse than recording ranges.
Right, I'm trying to understand how big the leak might be in practice. It really depends on the behaviour of the memory allocator and in particular whether it unmaps memory instead of using MADV_FREE to release free memory back to the system. In any case, this architectural bug should probably be documented somewhere, but I'm not sure where would be appropriate.
Not only the free behavior. It also depends on the usage. If all memory in CoW is shadowed by both parent and child, or parent exits early, then we can collapse and eventually do.