Page MenuHomeFreeBSD

Allow collapse to operate to completion during paging.
ClosedPublic

Authored by jeff on Dec 23 2019, 12:42 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 21, 2:58 AM
Unknown Object (File)
Wed, Apr 17, 11:19 AM
Unknown Object (File)
Wed, Apr 17, 11:19 AM
Unknown Object (File)
Wed, Apr 10, 4:06 PM
Unknown Object (File)
Wed, Apr 10, 4:06 PM
Unknown Object (File)
Feb 12 2024, 5:55 PM
Unknown Object (File)
Jan 26 2024, 12:21 PM
Unknown Object (File)
Dec 25 2023, 6:11 PM
Subscribers

Details

Summary

This change allows collapse to run concurrently with fault in all cases. This was necessary so that I can drop the object lock in more cases in fault. I also believe the resulting behavior is easier to understand. It provides the following benefits:

  1. There are no objects eligible for collapse lingering because paging_in_progress was set.
  2. We do not have to delay collapse for normal paging operations. Only when they intersect the pages we are working on will we wait for xbusy.
  3. The complicated and direct reference count manipulation in vm_object_deallocate() goes away leaving us only atomic ref count manipulation.
  4. coalesce and split won't abandon operations due to paging in progress or an in-progress collapse.
  5. It is now safe to pip_add without the object lock so long as you have some other valid reference to the object.

The primary operations which have to be synchronized with respect to collapse are split, getpages, and other collapse calls. We need only worry about getpages on swap objects which can not create pages outside of the requested range if either split or collapse is in progress. Otherwise the busy lock protects against parallel operations.

split and collapse can not run concurrently on orig_object because it has multiple references. split may discover that the backing object is collapsing and will now do an interlocked sleep waiting for that condition to clear. Before we aborted the split operation in this case.

This could increase the frequency with which we call vm_object_scan_all_shadowed() because we no longer avoid it while pip != 0. It also does less efficient swap I/O if we page in while running split or collapse.

Test Plan

I have written a custom collapse test for stress2 that I am refining. It uses minherit and fork to create deep chains of shadow objects and then random parts of the chain are deallocated on exit().

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jeff edited the test plan for this revision. (Show Details)
jeff added reviewers: alc, dougm, kib, markj.
jeff set the repository for this revision to rS FreeBSD src repository - subversion.

It might be beneficial to split OBJ_DEAD meaning for anon objects and vnode, by e.g. renaming OBJ_DEAD for anon into OBJ_COLLAPSING. Then we need to check both flags.

Could you, please, explain the changes to the collapse algorithm ?

sys/vm/swap_pager.c
1003 ↗(On Diff #65923)

This should be asserted.

sys/vm/vm_object.c
582 ↗(On Diff #65923)

You dropped the assertion that OBJ_ANON is set on object.

The check for OBJ_ANON on backing_object is delegated to vm_object_collapse() now, am I right ?

I originally wrote this with and OBJ_COLLAPSE flag but I just ended up testing for it virtually everywhere that DEAD was. It just looked redundant and provided little extra information.

The collapse algorithm is really no different. We just need to ensure that the swap meta tree is stable in a few places and prevent fault from operating on a stale meta tree. I believe these changes and their assumptions are well documented in the comments.

sys/vm/swap_pager.c
1003 ↗(On Diff #65923)

Can do

sys/vm/vm_object.c
582 ↗(On Diff #65923)

Yes, I can add more asserts if necessary. We know backing_object was ANON when we entered the function. It could've changed to something that was not anon if we raced with another collapse. In that case object_collapse() will return immediately and checking here is redundant.

sys/vm/vm_object.c
582 ↗(On Diff #65923)

This assert is more to help the reader of the function, to understand the assumptions for its environment.

1429 ↗(On Diff #65923)

The read of source->flags is unlocked and racy. In principle, we might miss OBJ_DEAD and only note it later when we lock source.

1432 ↗(On Diff #65923)

After this loop orig_object == entry->object.vm_object because the only caller has the vm_map locked. This is not obvious to the reader of the function.

1685 ↗(On Diff #65923)

This comment seems to be no longer valid: new code does not skip, it always wait for busy and then restarts.

1793 ↗(On Diff #65923)

Again, I think we might miss the OBJ_DEAD flag setting there.

1811 ↗(On Diff #65923)

IMO it would be cleaner to set OBJ_DEAD on backing_object there instead of in collapse_scan() now.

1824 ↗(On Diff #65923)

Please explain why this checks can be removed. In particular, what prevents a parallel collapse of the object as backing_object to start later.

sys/vm/vm_object.c
1429 ↗(On Diff #65923)

See the comment in anon_deallocate(). It is safe to check DEAD with the parent locked because collapse can only start with both locks held. I was thinking about moving this into its own function like: source = vm_object_backing_wait(orig_object);

1824 ↗(On Diff #65923)

Only the first caller to set DEAD does the collapse. This serializes collapse operations. xbusy is sufficient to serialize page use. So we only need to worry about swap contents being serialized. This is why I changed getpages() and force split to wait for collapse to complete.

The one bug in this patch is that we need to grab a ref on the parent object when we come via vm_object_anon_deallocate() or I have to handle collapse and terminate happening concurrently which is messier.

Restore the reference to the backing_object's last shadow to simplify teardown races. Add a flag for COLLAPSING so that the conditions and synchronization surrounding waiting for it are more clear. Attempt to assert more and document better. This is passing a test case that found head bugs but I'm still not hitting every case I care about with it yet. I believe this is close to commitable pending further testing.

sys/vm/vm_object.c
590 ↗(On Diff #66119)

I don't really like this but I can't blindly acquire_if_not_zero because we could reference an object that is part way through collapse() unless I convert the tail of collapse to use vm_object_deallocate() instead of vm_object_teriminate() which starts to get a bit funky.

877 ↗(On Diff #66119)

This could be a WAITERS flag and ad-hoc synchronization if we don't like overloading pip.

The important thing for me is to be able to pip anonymous objects without a lock which prevents its use in synchronization collapse.

sys/vm/vnode_pager.c
203 ↗(On Diff #66119)

This has not been possible for a while.

sys/vm/vm_object.c
578 ↗(On Diff #66119)

Quibbling, but it's easier to see that this is a helper for vm_object_deallocate() if it's called vm_object_deallocate_anon() instead.

804 ↗(On Diff #66119)

Can't this assertion be in vm_object_backing_insert_locked()?

1522 ↗(On Diff #66119)

This comment is kind of confusing, there is no backing_object.

1981 ↗(On Diff #66119)

Could you add a (void) cast while here?

sys/vm/vm_object.c
578 ↗(On Diff #66119)

ok if I also rename vm_object_vndeallocate() to vm_object_deallocate_vnode().

804 ↗(On Diff #66119)

Yes that's a good idea

1522 ↗(On Diff #66119)

It is called 'source' in this function. I could rename it for clarity.

sys/vm/vm_object.c
578 ↗(On Diff #66119)

I missed that inconsistency. It's up to you of course, but I prefer vm_object_deallocate_vnode().

1522 ↗(On Diff #66119)

For minimal churn I'd write, "The additional reference on orig_object's backing object by new_object will prevent further collapses until the split completes." I'm fine with renaming too.

sys/vm/vm_object.c
1522 ↗(On Diff #66119)

It is only referenced a few times. I'm going to rename it for clarity. Only really reading the implementation of shadow objects in detail a few months ago I often found the name inconsistencies confusing at first glance. So I think it's worth a little churn for consistency given the amount of other related churn that has gone on.

Rename a few things for clarity. Address Mark's feedback. Simplify some changes by calling vm_object_terminate() in collapse instead of destroy().

This could increase the frequency with which we call vm_object_scan_all_shadowed() because we no longer avoid it while pip != 0. It also does less efficient swap I/O if we page in while running split or collapse.

We also no longer compare the resident page counts. Is that necessary with the rest of this change?

sys/vm/swap_pager.c
1214 ↗(On Diff #66274)

Why isn't this a problem for the vnode pager as well?

sys/vm/vm_object.c
577 ↗(On Diff #66274)

Extra newline.

804 ↗(On Diff #66119)

Did that not work for some reason?

markj added inline comments.
sys/vm/swap_pager.c
1214 ↗(On Diff #66274)

... because we aren't going to split or collapse a vnode object.

This revision is now accepted and ready to land.Jan 3 2020, 2:25 AM
sys/vm/vm_object.c
804 ↗(On Diff #66119)

Just forgot to do it. I will adjust in my next patch.

sys/vm/vm_object.c
497 ↗(On Diff #66274)

This lock could go away if we changed the way we teardown vnode objects during forced unmount. This is somewhat significant source of contention in addition to creating a lot of additional complexity here and elsewhere.

I argue that the vnode_pager.c change should be committed separately.

This revision was automatically updated to reflect the committed changes.