Page MenuHomeFreeBSD

Change synchonization rules for page reference counting.
ClosedPublic

Authored by markj on May 31 2019, 9:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM
Unknown Object (File)
Jan 25 2024, 10:04 AM

Details

Summary

My aim is to reduce the scope of the vm_page lock array. It serves
multiple unrelated purposes, and the locks show up as a bottleneck
in some workloads. For example, vm_object_terminate_pages() bounces
between page locks at a high rate during builds. The locks also show up
with sendfile: in vm_page_grab_pages() when wiring resident pages, and
when freeing SF_NOCACHE pages (i.e., Netflix's workload).

The VM page lock protects the wire_count field and the object's
reference. The basic idea is to use per-page atomic reference counting
to remove the need for the page locks. This allows many code paths to
simply avoid the page lock and instead perform a single atomic operation
on a cache line within the vm_page. My approach imposes some overhead
on the page daemon, which now must acquire and drop a transient
reference to each page while scanning it, and handle some new race
conditions. However, I believe this is a reasonable tradeoff: in most
workloads, the page daemon is not the bottleneck, and profiling shows
that when it is, we would be best served by trying to reduce the
effect of cache misses incurred while traversing the queue itself.

Summary of changes:
The wire_count field is renamed to ref_count. wire_count is preserved
for now to avoid churn in the pmap code.

vm_page_wire() now can be used only for pages belonging to an object.
It asserts that the page is either busy, or the vm object lock is held.

vm_page_wire_mapped() is for use by pmap_extract_and_hold(). It may
fail if a thread is concurrently executing pmap_remove_all() or
pmap_remove_write().

vm_page_try_remove_all() and vm_page_try_remove_write() atomically check
for wirings of a page and block new wirings via vm_page_wire_mapped()
before calling pmap_remove_all() or pmap_remove_write(). They fail if a
wiring of the page exists.

vm_page_unwire() no longer returns a value and no longer accepts PQ_NONE
as the queue parameter. It frees the page if the wiring was the last
reference. vm_page_unwire_noq() returns true if it released the last
wiring (not necessarily the last reference) to the page.

sendfile_free_page() and vfs_vmio_unwire() are replaced with a generic
vm_page_release(), which releases the page to the page cache.
vm_page_try_to_free() is removed.

vm_page_remove() and vm_page_free_prep() no longer assert the page lock.
The object lock is required to remove a page from its object. We still
dequeue the page in vm_page_free_prep(), but we leverage the fact that
no other threads hold a reference in order to update queue state without
the page lock. vm_page_rename() still internally uses the page lock to
atomically move a page between objects. In other words, only the object
lock is required to transition m->object between an object and NULL, and
the page lock is required to transition m->object between two objects.

Page queue scans now require a transient reference to ensure that the
page daemon does not update the page's logical queue state (i.e.,
aflags) while a concurrent vm_page_free_prep() is dequeuing the page.
The transient reference is dropped once the object lock is acquired.

Test Plan

I have started some benchmarking, testing kernel builds on tmpfs and
single-threaded sendfile performance. So far the results show no difference
or a small improvement on a 16-core Haswell system.

Drew has tested the patch on a couple of systems at Netflix. In their
workload, the vm page locks are the most highly contended locks in
the kernel; with this patch they effectively disappear from lockstat
profiles and we see a 1-2% CPU usage decrease.

Peter completed a stress2 run with the patch.

I added an mlock test to will-it-scale. It shows a marginal (~1-2%) improvement
in throughput in the single-threaded case on a Xeon E5-2630v3, and a much large
improvement with 32 copies of the test running in parallel. The test maps 64KB
of anonymous memory, and simply mlock()s and munlock()s the range in a loop.

With the patch, n=1:
min:200229 max:200229 total:200229
min:200152 max:200152 total:200152
min:200142 max:200142 total:200142
min:200169 max:200169 total:200169                                             
min:200146 max:200146 total:200146
min:200180 max:200180 total:200180
min:200155 max:200155 total:200155                                                                                                                            
min:200159 max:200159 total:200159                                             
min:200146 max:200146 total:200146
min:200177 max:200177 total:200177

Without the patch, n=1:
min:197246 max:197246 total:197246                                             
min:197266 max:197266 total:197266     
min:197253 max:197253 total:197253                                             
min:197255 max:197255 total:197255 
min:197260 max:197260 total:197260                                             
min:197242 max:197242 total:197242
min:197255 max:197255 total:197255
min:197249 max:197249 total:197249
min:197269 max:197269 total:197269
min:197255 max:197255 total:197255

With the patch, n=32:
min:89635 max:105000 total:3215957
min:89420 max:104891 total:3211078
min:89574 max:105039 total:3215716
min:89722 max:105012 total:3215434                                             
min:89747 max:105006 total:3215072                                             
min:89758 max:104994 total:3215708
min:89718 max:104947 total:3212421 
min:89702 max:104896 total:3211679                                                                                                                            
min:89613 max:104898 total:3213014                                             
min:89634 max:104876 total:3212128

Without the patch, n=32:
min:48451 max:104544 total:2565376                                             
min:48017 max:104470 total:2564240
min:47383 max:103517 total:2553197
min:47077 max:103944 total:2550691
min:47776 max:103415 total:2556992                                             
min:47719 max:104336 total:2554948
min:47531 max:104168 total:2557383                                             
min:48232 max:104220 total:2562169
min:47913 max:104223 total:2558416
min:48147 max:104188 total:2555988

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 24850
Build 23599: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
markj edited reviewers, added: jeff, alc, kib; removed: manu.
markj edited subscribers, added: gallatin; removed: jhibbits, bdragon, delphij and 3 others.

Fix a race hit by gallatin.

Suppose a thread frees a managed page without having dequeued it first.
For example, when sendfile(SF_NOCACHE) completes and the page is unmapped
and has no wirings. The page daemon may have acquired a transient reference
to the page, in preparation for scanning. Freeing the page will remove it
from its object; the page daemon will detect this and skip the page. However,
we cannot free the page to vm_phys until all references are gone.

Fix the problem by changing vm_page_free_prep() to check the return value
of vm_page_remove(), and abort if the page still has references. Callers
are still required to ensure that there are no wirings of the page.

sys/vm/vm_page.c
3610–3628

I think it should be possible to check whether you are dropping the final reference and use that to only pick up the lock when necessary. For certain kinds of operations this may be common.

sys/vm/vm_page.c
3610–3628

Yes. I will do that in a separate diff since this one is already too big, and that proposal changes the KPI, which currently requires the caller to do the locking.

Rebase and fix some problems found by pho.

  • Rebase.
  • Fix bugs found by Peter.
sys/vm/vm_fault.c
493–503

vm_page_xunbusy_maybelock acquires the page lock if it is not already held. You have to have changed the page lock to match here before calling it.

sys/vm/vm_fault.c
493–503

The condition (fault_flags & VM_FAULT_WIRE) will have the same value throughout the loop. So either we update m_mtx for each page, in which case vm_page_xbusy_maybelocked() will use the correct page lock, or m_mtx will be NULL, in which case vm_page_xbusy_maybelocked() will acquire the page lock.

Rebase on top of D21257 and perform some cleanups.

  • Get rid of the transient page daemon reference. With atomic updates to the queue state, we no longer need them. This also gives a nice invariant: if a page's object is write locked, and the page is unbusied and unmapped, new wirings of the page cannot be created. This invariant can be used in, e.g., vm_object_terminate_pages() to release the object reference with a plain store instead of an atomic operation.
  • Move page locking into vm_page_unwire(). We do not need the page lock if the page has more than one wiring, and some future work aims to remove the need for the page lock to update queue state in the first place.
  • In vm_page_wire() and vm_page_unwire_noq(), only use atomics if the page is unmanaged. For unmanaged pages the caller already must guarantee synchronization externally. For example, for pmap pages, this is done using the pmap lock.
sys/vm/vm_page.c
1348

I assume the comment is not intentional?

3936

We would expect a failure of this mechanism to produce a page with additional references after the op. Can you safely assert that this hasn't happened?

sys/vm/vm_pageout.c
815

Can you describe why the atomic load is necessary?

markj marked an inline comment as done.

Address feedback.

sys/vm/vm_page.c
3936

Yes, I can strengthen the assertion below.

sys/vm/vm_pageout.c
815

I want to make it explicit to both readers and compilers that the value of m->object is not stable. Without an atomic_load here, the compiler is permitted to replace subsequent references to object with m->object, but that is not safe.

sys/vm/vm_pageout.c
815

Add a comment? Perhaps a wrapper.

This revision is now accepted and ready to land.Aug 19 2019, 1:03 AM

Update expository comments at the beginning of vm_page.h.

Allow vm_page_try_remove_all() and vm_page_try_remove_write() to be called
with the object read lock held. In this case, new wirings may be acquired
while mappings are being torn down, but no new wired mappings may be created.
This is for use in vm_swapout_object_deactivate_pages(), which tries to
remove mappings before deactivating a page, while holding the object read
locked.

This revision now requires review to proceed.Aug 27 2019, 3:46 AM

Remove some more now-unnecessary page locking around calls to
vm_page_free().

sys/amd64/amd64/pmap.c
2940

Is vm_page_pa_tryrelock() still used ?

sys/vm/vm_page.c
1372

My _feel_ is that this op should be atomic.

1488–1490

Can you assert that VPRC_BLOCKED is not returned ?

sys/vm/vm_page.h
206–209

Anon unions are not in C99, they appeared in C11.

sys/amd64/amd64/pmap.c
2940

Yes, in pmap_mincore(). I will clean that up in a follow-up; that usage is not important now that mincore() only reports on the current mapping by default.

sys/vm/vm_page.c
1372

The most common case is page allocation, where it is preferable to avoid atomic ops (and they are unnecessary there). In other cases we are inserting an unmanaged page, and among existing callers I did not see any possible races.

It would be straightforward to add a bool parameter to vm_page_insert_after() to differentiate between these cases, though.

1488–1490

Yes, I will add it.

sys/vm/vm_page.h
206–209

Hmm, we use them in other core pieces of the kernel, struct mbuf for example, so I thought they were ok. I plan to remove this on in a follow-up diff in any case.

share/man/man9/vm_page_wire.9
54

"function" -> "functions"

"... the page, which prevents it ...

sys/amd64/amd64/pmap.c
2930

"pa" is now an initialized but otherwise unused variable.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 13 2019, 1:18 PM
This revision was automatically updated to reflect the committed changes.