Page MenuHomeFreeBSD

(vm object 3) Add a shared object busy synchronization mechanism that blocks new page busy acquires while held.
ClosedPublic

Authored by jeff on Sep 10 2019, 7:58 PM.

Details

Summary

This is the least elegant of the object concurrency patches. The basic notion is that the object can become busy, after which new pages can not become busy but existing pages may persist in being busy. This permits the existing style of only checking, and not acquiring, busy for dedicated blocks of code.

Here with object busy we're synchronizing vm_fault_soft_fast against pmap_remove* or changes in valid bits on any constituent page in the super page set. We could simply busy every page in the set and then unbusy at the end as we would've for 512 individual mappings. This would be my preference, but there was concern over the cost. So instead we use the atomic object busy counter to emulate the old behavior of a busy check being sufficient.

This is the only consumer that still uses the pattern of checking busy without acquiring and so it is at present the only place that requires object busy. This is only synchronizing against other page busy consumers that don't have the object lock held. Callers which do have the object write lock will still be serialized. The object busy may allow us to do vm_fault_soft_fast without the object lock held across the pmap operation.

The implementation does allow for pages to transiently become busy, notice the object busy count, and drop busy. These will revert back to single page mappings if you lose the race. I don't see a reasonable way around this. I think it's likely that if the page is becoming busy it may drop out of the superpage set later anyway.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jeff created this revision.Sep 10 2019, 7:58 PM
jeff retitled this revision from Add a shared object busy synchronization mechanism that blocks new page busy acquires while held. to (vm object 3) Add a shared object busy synchronization mechanism that blocks new page busy acquires while held..Sep 10 2019, 8:10 PM
jeff edited the summary of this revision. (Show Details)
jeff edited reviewers, added: alc, kib, markj, dougm; removed: manu.
jeff edited the summary of this revision. (Show Details)Sep 17 2019, 9:55 PM
kib added inline comments.Sep 18 2019, 6:24 AM
sys/amd64/amd64/pmap.c
5660 ↗(On Diff #61909)

As was discussed somewhere else, this assert should stil check that the page is busied if PMAP_ENTER_NOSLEEP was specified. It may be just a replacement of VM_OBJECT_ASSERT_LOCKED() by (PMAP_ENTER_NOSLEEP && VM_PAGE_OBJECT_BUSY_ASSERT).

sys/vm/vm_fault.c
285 ↗(On Diff #61909)

We wrote a value to fs->first_object->busy above, and there we read m->busy. Other thread could write the lock cookie to m->busy, and then rechecks first_object->busy.

What prevents us from seeing non-busies m, simultaneously with other thread seeing unbusy first_object ? In other words, it is typical (x=1; read y;) vs. (y = 1; read x;) race.

jeff added inline comments.Sep 18 2019, 8:43 PM
sys/amd64/amd64/pmap.c
5660 ↗(On Diff #61909)

I'm not sure I fully follow. We're already asserting that the page is busy in all cases. What is the additional assert?

sys/vm/vm_fault.c
285 ↗(On Diff #61909)

The important thing is that when the object is busy you can transiently have a busy page but you can not have any modifications permitted by busying a page. So you are left with a harmless race and you skip superpage insertion.

So in your example the first thread may read the page as busy and operate as-if the page had been busy before the call to vm_object_busy(). The second thread will observe the object busy count in tryxbusy and unbusy the page before returning failure to the caller. So you do have this transient condition. I believe it is harmless and according to tests I have done it is quite rare.

The other obvious solution is to abandon this whole approach and simply busy/unbusy 512 pages.

markj added inline comments.Sep 18 2019, 10:17 PM
sys/vm/vm_fault.c
281 ↗(On Diff #61909)

Since the goal of these patches is to permit busying a page without the object lock held, how is the check for the busy state of m below going to work? I presume that at some point we must change this code to xbusy m, but that won't succeed if the object is busy.

285 ↗(On Diff #61909)

Currently vm_page_xbusy() and vm_page_sbusy() panic if their attempts to update the page state fail. Right now we are protected from that by the object lock, but how would this work in the longer term?

sys/vm/vm_page.c
904 ↗(On Diff #61909)

I think this deserves a comment. What exactly does the wiring protect us from here?

1112 ↗(On Diff #61909)

m doesn't need to be parenthesized here.

jeff added inline comments.Sep 18 2019, 10:25 PM
sys/vm/vm_fault.c
281 ↗(On Diff #61909)

I feel that my comments must be failing to describe something fundamental. This is addressed in the patch description.

If the page was busied by another thread before the call to vm_object_busy then busied returns true and we fall back to the fully synchronized version of vm_fault. All of the code in soft_fault_fast is an optional fast path.

We do not busy the first page or any other page in the superpage set. We use the object busy to guarantee that the state remains stable across the operation by denying other threads from acquiring busy.

285 ↗(On Diff #61909)

Those functions are to be eliminated at the end of the patchset. Nothing in the kernel calls them after #2. Only trybusy and busy_acquire() are used.

sys/vm/vm_page.c
904 ↗(On Diff #61909)

Yes the notion is if you don't have the object locked you must be holding a reference count on the page otherwise it can disappear or change identity while you sleep and loop to acquire busy again.

markj added inline comments.Sep 18 2019, 10:39 PM
sys/vm/vm_page.c
1351 ↗(On Diff #61909)

Isn't obj != NULL guaranteed by the fact that we hold the object lock?

1384 ↗(On Diff #61909)

Same here.

I think I've brought this up before, but I would like it if the VM had a generic per-2MB page structure. We already have several in vm_reserv and the pmap, and IMO it would be a good place to maintain a "compound" busy state, rather than in the object. I worry that a mechanism to block the busying of all pages in an object will inhibit concurrency and lead to transient latency spikes. I don't object to the current approach though.

sys/vm/vm_fault.c
281 ↗(On Diff #61909)

I just confused myself while reading code.

jeff added a comment.Sep 18 2019, 10:59 PM

I think I've brought this up before, but I would like it if the VM had a generic per-2MB page structure. We already have several in vm_reserv and the pmap, and IMO it would be a good place to maintain a "compound" busy state, rather than in the object. I worry that a mechanism to block the busying of all pages in an object will inhibit concurrency and lead to transient latency spikes. I don't object to the current approach though.

Right now the object lock is the mechanism that blocks busying of all pages and creates transient latency spikes. This at least narrows it a level and allows other object operations to proceed. If you had a per-superpage object you would have to be able to very quickly look it up and acquire in tryxbusy/sbusy.

Ultimately I would like to see a generic mechanism that can treat pages as groups with a single set of state. A variable page size.

kib added inline comments.Sep 19 2019, 1:47 PM
sys/amd64/amd64/pmap.c
5660 ↗(On Diff #61909)

I see, I forgot about the fact that vm_page_object_busy_assert() does check for busy (but not for xbusy, it seems).

sys/vm/vm_fault.c
285 ↗(On Diff #61909)

But the impossible reads are relevant for the non-superpage case as well. We might map a busy page (4k), and other thread might mark and assume the page busy without noticing object->busy.

kib added a comment.EditedSep 19 2019, 1:55 PM
In D21592#473571, @jeff wrote:

I think I've brought this up before, but I would like it if the VM had a generic per-2MB page structure. We already have several in vm_reserv and the pmap, and IMO it would be a good place to maintain a "compound" busy state, rather than in the object. I worry that a mechanism to block the busying of all pages in an object will inhibit concurrency and lead to transient latency spikes. I don't object to the current approach though.

Right now the object lock is the mechanism that blocks busying of all pages and creates transient latency spikes. This at least narrows it a level and allows other object operations to proceed. If you had a per-superpage object you would have to be able to very quickly look it up and acquire in tryxbusy/sbusy.
Ultimately I would like to see a generic mechanism that can treat pages as groups with a single set of state. A variable page size.

But isn't vm_reserv adequate object ? psind is set to 1 iff the reserv is fully populated. Why cannot we lock vm_reserv around pmap_enter() in psind==1 case for fast_soft ?

Update: when I wrote that, I thought about locking reserv and e.g. busy the first page in the range. Incompatible manipulation of any page in the superpage would need to break the reservation. But this is not enough, indeed.

jeff added a comment.Sep 19 2019, 7:38 PM
In D21592#473738, @kib wrote:
In D21592#473571, @jeff wrote:

I think I've brought this up before, but I would like it if the VM had a generic per-2MB page structure. We already have several in vm_reserv and the pmap, and IMO it would be a good place to maintain a "compound" busy state, rather than in the object. I worry that a mechanism to block the busying of all pages in an object will inhibit concurrency and lead to transient latency spikes. I don't object to the current approach though.

Right now the object lock is the mechanism that blocks busying of all pages and creates transient latency spikes. This at least narrows it a level and allows other object operations to proceed. If you had a per-superpage object you would have to be able to very quickly look it up and acquire in tryxbusy/sbusy.
Ultimately I would like to see a generic mechanism that can treat pages as groups with a single set of state. A variable page size.

But isn't vm_reserv adequate object ? psind is set to 1 iff the reserv is fully populated. Why cannot we lock vm_reserv around pmap_enter() in psind==1 case for fast_soft ?
Update: when I wrote that, I thought about locking reserv and e.g. busy the first page in the range. Incompatible manipulation of any page in the superpage would need to break the reservation. But this is not enough, indeed.

You would need to handle busy like we did vm_page_lock() where it refers to a range of pages. Or you would need to do an optional tiered lock like object busy but with the reservation being the second lock. I hope that as we figure out how to aggregate pages into collections better this object busy goes away. For now it lets me go forward with a lot of object lock reduction.

sys/vm/vm_fault.c
285 ↗(On Diff #61909)

I don't understand this example. How will another thread assume page is busy without noticing object busy?

jeff added inline comments.Sep 25 2019, 11:32 PM
sys/vm/vm_fault.c
285 ↗(On Diff #61909)

I now understand this to mean I likely need stronger serialization in vm_object_busy().

I'm also considering renaming this to vm_object_barrier() or something of the sort so the exact mechanism and guarantees are not conflated with page busy.

jeff updated this revision to Diff 62924.Oct 4 2019, 10:29 PM

Add missing barrier.

kib added inline comments.Oct 6 2019, 12:30 PM
sys/vm/vm_object.c
2261 ↗(On Diff #62924)

I believe you need the _acq_rel() fence there, because _acq is basically r|r,w barrier. I.e. the write of the new busy counter is allowed to 'move' after the _acq(). _rel is r,w|w.

You may want to note in comment that the _rel fence would sync/with _acq of the page trybusy operation.

On x86 this would still produce only compiler memory barrier, so there should be no loss.

kib added inline comments.Oct 6 2019, 12:40 PM
sys/vm/vm_page.c
1108 ↗(On Diff #62924)

And there, I believe, you need a _rel fence to sync/w the acq part of the fence in vm_object_busy().

jeff added inline comments.Oct 6 2019, 9:41 PM
sys/vm/vm_object.c
2261 ↗(On Diff #62924)

We want to stop a load of the page busy field from being re-ordered before the refcount is acquired. _rel would allow the load to precede it.

sys/vm/vm_page.c
1108 ↗(On Diff #62924)

This acq barrier should mean that we did not speculatively load the object busy until after the atomic completed.

kib added inline comments.Oct 8 2019, 8:38 AM
sys/vm/vm_object.c
2261 ↗(On Diff #62924)

Right, _rel is not enough, but _acq is not enough as well, see above. This is why I said that
_acq_rel is needed (refcount acquire is both load and store in atomic, so it should be not possible to pass _acq_rel fence).

sys/vm/vm_page.c
1108 ↗(On Diff #62924)

acq only prevents the operations after the fcmpset to appear before it. It does not prevents load before to execute after fcmpset.

jeff added inline comments.Oct 8 2019, 10:54 PM
sys/vm/vm_object.c
2261 ↗(On Diff #62924)

I see yes, I agree.

sys/vm/vm_page.c
1108 ↗(On Diff #62924)

" When an atomic operation has acquire semantics, the operation must have

completed before any subsequent load or store (by program order) is
performed.  Conversely, acquire semantics do not require that prior loads
or stores have completed before the atomic operation is performed. "

This is all that is necessary. The import object busy check is after the atomic. The one above is racy and speculative.

kib added inline comments.Oct 9 2019, 8:11 AM
sys/vm/vm_page.c
1108 ↗(On Diff #62924)

You are right, sorry.

jeff added inline comments.Oct 10 2019, 3:32 AM
sys/vm/vm_page.c
1108 ↗(On Diff #62924)

I think we are in agreement now? Ok to commit with the fence above fixed? Thank you for your attention to detail.

kib added inline comments.Oct 10 2019, 8:57 AM
sys/vm/vm_page.c
1108 ↗(On Diff #62924)

Yes, sure.

This revision was not accepted when it landed; it landed in state Needs Review.Sat, Oct 26, 9:31 PM
This revision was automatically updated to reflect the committed changes.