Page MenuHomeFreeBSD

Modify vm_page_wire() to not dequeue the specified page
ClosedPublic

Authored by markj on Aug 10 2017, 1:22 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Apr 23, 3:17 PM
Unknown Object (File)
Mar 10 2024, 4:03 AM
Unknown Object (File)
Feb 26 2024, 10:29 PM
Unknown Object (File)
Feb 17 2024, 2:41 PM
Unknown Object (File)
Feb 5 2024, 6:32 AM
Unknown Object (File)
Feb 4 2024, 12:33 PM
Unknown Object (File)
Feb 2 2024, 3:51 PM
Unknown Object (File)
Feb 1 2024, 9:21 PM

Details

Summary

Instead, let the page daemon lazily dequeue any wired pages that it
encounters. vm_page_unwire() is modified to requeue the page, preserving
LRU. This improves scability quite a bit in some benchmarks, for example
when the workload involves the buffer cache and the working set size
exceeds the configured buffer limit. In this case, multiple threads will
be wiring pages into the buffer cache while the buf daemon evicts them.
Without this change, all of these threads contend heavily on the global
inactive queue lock; with the change, only the buf daemon needs to
interact with the inactive queue.

The change has the potential drawback of causing the page daemon to do
more work, since it may potentially encounter many wired pages during an
inactive queue scan. However, I think this is still the right tradeoff
to make from a scalability perspective, and the page daemon can check
wire_count before attempting to acquire the VM object lock.

I think this patch is still pretty rough and not committable as-is. I
haven't yet implemented Alan's suggestion of avoiding the requeue when
the page belongs to PQ_ACTIVE.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

The vnode lock is held, and at least for usual filesystems like UFS it is held in exclusive mode. Then the vn_io_fault() code cannot run in parallel.

For ZFS and tmpfs it is not true, ZFS causes only shared mode vnode lock held, while tmpfs pages belong to swap object. So io really can occur in parallel but I think this is mostly harmless, although indeed not too pretty. It might cause partial or contradictory content to persist if system crashes, but I suspect that ZFS provides some internal synchronization there (I do not know for sure).

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

I'm skeptical that the hold_count tests serve any useful purpose. Moreover, if we want to say, "hands off this page because it's undergoing I/O,", we have other ways of saying that. It shouldn't be the role of hold_count.

  • Reference pages in PQ_ACTIVE upon unwiring them.
  • Remove wire_count checks from the page-clustering code.
  • Simplify the wire_count checking in vm_pageout_launder().
  • Hold off on adding a large comment to vm_page.h for now.
sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

It looks like the hold_count tests were added when the page holding mechanism was introduced in r1549. I believe that the tests are not needed for correctness; my guess is that we wanted to avoid writing "in-flux" dirty pages that are perhaps likely to be re-dirtied soon again anyway. That makes sense as a rationale for testing hold_count in the PQ_LAUNDRY scan, but not so much in the clustering code I think.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

I've had a chance to think some more about this.

Consider kern_physio.c calling vmapbuf(). In other words, places where we hold a mapped page; create a temporary mapping to that page in the kernel address space (or simply use the direct map); and then modify the page through the temporary mapping.

Now, suppose that that page is dirty and we are about to launder it. The holder's temporary mapping bypasses the write protection that is applied to the page by vm_pageout_flush().

So, if we are not paying attention to the hold count in the laundry code, how do we know whether the holder's write through the temporary mapping occurred before or after the page was written to disk? Skipping held pages in the laundry was probably, once upon a time the simplest approach to this potential problem.

Rhetorical question: Why did I say, "once upon a time"? Well, before the introduction of vm_page_lock(), the clustering code held the page queue lock, which by the way was also required to hold a page. So, I think that there was actually some ordering guarantee between these operations provided by the page queue lock. However, now, I don't see anything blocking a holder from jumping in right after the clustering code completed. Moreover, I'm not convinced that all holders are checking for, for example, the busy state of a page before writing to it.

To be clear, my initial statement that the hold count served no purpose is clearly contradicted by what I wrote above. :-) However, that doesn't mean that we should keep them. As I said in the previous paragraph, I don't see how we could eliminate the potential races. I think that we need to audit the holders and take appropriate steps in the holders to synchronize with possible laundering.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

I started performing such an audit and quickly discovered r253939 and r253953. I'll admit that I don't really understand the reasoning for the revert: 1) the vget() call in vm_fault_hold() uses LK_NOWAIT to avoid sleeping on the vnode lock with the fault page busied, presumably because vm_object_clean(), for example, may sleep on a busy page with the object's vnode lock held; 2) vm_fault_hold() will sleep forever if it finds that the page is resident and already busied. What am I misunderstanding?

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

Mentioned revision replaced hold with busy.

Busy state is de-facto a sleepable lock. This lock is after the vnode lock in the global order. Besides mentioned vm_fault_hold() and vm_object_page_clean(), this order is followed by buffer cache and misc. implementations of VOP_GETPAGE() in filesystems.

For one example where the referenced change would break things, consider exec: there, exec_map_first_page() busies the first executable page, while the binary vnode is locked. But then the busy state is converted to the hold, because the binary's vnode lock is dropped several times during execve(2) flow. Keeping the page busy would contradict the order when relocking, causing deadlocks.

Another example is the vfs_vnops.c, where deadlock avoidance code hold pages, then unlock the vnode to access the userspace in uiomove(). Before vn_io_fault(), the userspace was accessed while vnode and buffer locks were held, causing well-known deadlock (I call it UPS deadlock by Stephan Uphoff who provided a code to reliably reproduce the bug, see tools/test/upsdl).

Yet another example is the tmpfs io, where again, if we would busy tmpfs node object' pages around uiomove() in uiomove_object_page(), we risk faulting on them if userspace asked for io to/from tmpfs node mapping and deadlocking.

I believe that similar examples existed for GEM GTT or CPU mappings in current drm2 code.

There should be more cases, but I think that the list is good enough. It would work to replace hold with wire, but this was not done because hold is much cheaper. It will break if replacing hold with busy. Hold and wire are non-sleeping, while busying is sleepable.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

Thanks for the explanation. Reading through the vm_page_hold() callers in the tree, we seem to have two cases: 1) vm_page_hold() is called while the object is locked and after we guaranteed that the page is unbusied by anyone else, and 2) pmap_extract_and_hold(), where the page is found in a pmap. The laundry thread already synchronizes properly with case 1 since it holds the object lock throughout vm_pageout_cluster() and busies the pages before releasing the lock. There's nothing synchronizing a concurrent vm_pageout_cluster() and pmap_extract_and_hold(VM_PROT_WRITE), I think, since the laundry thread calls pmap_remove_write() after dropping the page lock. As Alan noted, before r207410 we handled case 2 properly.

I'm wondering if we can handle synchronization of case 2 with laundering by simply moving the pmap_remove_write() call in vm_pageout_flush() to under the page lock in vm_pageout_cluster(). I note that the other caller of vm_pageout_flush(), vm_object_page_collect_flush(), already removes write mappings.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

Yes, indeed this may work. The interaction with the vm_fault_soft_fast() case should be similar.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

vm_fault_soft_fast() checks the busy state and then (possibly) holds the page without dropping the object lock at any point in between. Isn't that sufficient?

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

Fast path only check for busy state when when mapping is for write. If it maps for read, then busy is fine, in fact this behavior is modelled after vm_map_pmap_enter().

But the fact that a write access would cause page fault which would wait for busy should make your change in D12084 correct.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

Returning to this discussion, I see some vm_fault_quick_hold_pages() callers that immediately wire and unhold the returned pages (see zbuf_sfbuf_get() for instance). That suggests that we really do need the wire_count checks in vm_pageout_cluster(), since we otherwise won't detect the existence of an unmanaged writeable mapping. Alternately, perhaps such callers should be modified to avoid unholding the page and instead simply unhold at the same time as unwiring.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

Why should we care about unmanaged writes ? It is the caller choice, otherwise it would busy or otherwise protected the page state.

Also, what do you mean by checking for wired in vm_pageout_cluster() ? Do you mean that wired pages should be excluded from laundry ? Then we put the user data at risk, e.g. the writeable file which is mapped and happen to wired still must be msynced periodically.

In case of bpf, the pages wired should be normal pages (usually).

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

I think wiring would count as "otherwise protected the page state," but after this change it might not: vm_pageout_cluster() skips pages not in the laundry queues, and if wiring a page always removes it from its page queue, then we won't attempt to launder it. With this change, the laundry thread skips wired pages in its queue scan, but may still cluster with wired pages.

Right, I meant that I should perhaps restore the wire_count != 0 next to the hold_count != 0 checks in vm_pageout_cluster(). Note that vm_pageout_cluster() is used only by the laundry thread, so I believe your comment about msync doesn't apply. vm_object_page_collect_flush() doesn't check hold_count either.

sys/vm/vm_pageout.c
402 ↗(On Diff #31840)

With this change, the laundry thread skips wired pages in its queue scan, but may still cluster with wired pages.

Ah, I understand.

sys/vm/vm_page.c
2826 ↗(On Diff #32074)

The "by yet another map" is archaic, and we should just delete it. Essentially, it is implying that wiring only happens through mappings.

sys/vm/vm_page.c
2710–2711 ↗(On Diff #32074)

Aren't we implicitly changing the behavior here? Specifically, calling this function on a lazily wired page, i.e., one that hasn't been removed from the active queue, will now have its act_count updated.

sys/vm/vm_page.c
2836 ↗(On Diff #32074)

While we are here, please convert this to the newer form: vm_page_assert_locked(m).

2885 ↗(On Diff #32074)

Stylistically, I don't think that we need a blank line here.

2892 ↗(On Diff #32074)

Repeating myself (and to be clear this isn't a complaint :-)), I don't think that we gain anything (in terms of the replacement policy) by requeueing active pages. Do you?

2899 ↗(On Diff #32074)

"Initialize act_count."

2905 ↗(On Diff #32074)

Again, since we're here, should we go ahead and change the return type to bool?

  • Address review comments; restore wire_count check in vm_pageout_cluster()
markj added inline comments.
sys/vm/vm_page.c
2892 ↗(On Diff #32074)

I don't, no. We just need to set PGA_REFERENCED, I think.

markj added inline comments.
sys/vm/vm_page.c
2909 ↗(On Diff #32900)

If we call vm_page_unwire(PQ_NONE) on a page in a queue, we leave it in the queue. I can't think of any specific scenarios where that would lead to a problem, though.

2710–2711 ↗(On Diff #32074)

Indeed... it's not obvious to me that the new behaviour is incorrect though. The page daemon will ignore act_count when it visits this page, if it's still wired. Otherwise, vm_page_unwire(PQ_ACTIVE) will call vm_page_activate() anyway, so we're guaranteed to have act_count = ACT_INIT, and if we unwire into another queue, the value of act_count doesn't matter. So unless I'm missing something, the difference in behaviour is inconsequential.

sys/vm/vm_page.c
2909 ↗(On Diff #32900)

I think that we should remove the page from the queue.

To be clear, I've looked at all of the vm_page_unwire(PQ_NONE) calls that we have today, and l don't see a problem with any of them because they all immediately call vm_page_free() or vm_page_deactivate{,_noreuse}().

However, someday I might do something like this:

vm_page_lock(m);
vm_page_unwire(m, PQ_NONE);
vm_page_unlock(m);

expecting afterwards that the page can't be reclaimed by the page daemon because it is either wired or not in a page queue. But, if we lazily leave the page enqueued when the wire count transitions to zero, the page daemon will see no reason not to reclaim the page.

Put another way, not removing the page represents a subtle change in semantics that I would consider surprising.

  • Make vm_page_unwire(PQ_NONE) dequeue the page.
  • Deindent the vm_page_wire() herald comment to match that of vm_page_unwire().

I ran all of the stress2 tests on amd64 without finding any problems.

Originally, my hope was that this change would only acquire the inactive queue lock once in vfs_vmio_unwire() for requeueing the page and eliminate the acquisition in vfs_vmio_extend() because we lazily wire each page. However, please take a look at vfs_vmio_unwire(). Unfortunately, the way it is written, the page queue lock will now be acquired twice, first during the vm_page_unwire(PQ_NONE) to dequeue the page and again during vm_page_deactivate() to reenqueue the page. However, if we don't dequeue the page during vm_page_unwire(PQ_NONE) (as in the previous version of this change), then vm_page_deactivate() does nothing because it sees the page is already inactive. In other words, the page doesn't get requeued to the tail.

In D11943#258507, @alc wrote:

Originally, my hope was that this change would only acquire the inactive queue lock once in vfs_vmio_unwire() for requeueing the page and eliminate the acquisition in vfs_vmio_extend() because we lazily wire each page. However, please take a look at vfs_vmio_unwire(). Unfortunately, the way it is written, the page queue lock will now be acquired twice, first during the vm_page_unwire(PQ_NONE) to dequeue the page and again during vm_page_deactivate() to reenqueue the page. However, if we don't dequeue the page during vm_page_unwire(PQ_NONE) (as in the previous version of this change), then vm_page_deactivate() does nothing because it sees the page is already inactive. In other words, the page doesn't get requeued to the tail.

I think it's rather difficult to fix this completely, that is, eliminate unnecessary page queue locking for all cases in vfs_vmio_unwire(). I think the common case could be addressed with:

if ((bp->b_flags & (B_NOREUSE | B_DIRECT)) == 0 && m->valid != 0)
    queue = PQ_INACTIVE;
else
    queue = PQ_NONE;
if (vm_page_unwire(m, queue)) {

With this, we'll dequeue and enqueue in separate operations for NOREUSE pages, but I don't see how we can address this without adding a separate queue for NOREUSE pages, or introducing something like vm_page_unwire_noreuse() (ugh).

Add vm_page_unwire_noq() and use it in vfs_vmio_unwire() and
sendfile_free_page()

In D11943#258507, @alc wrote:

Originally, my hope was that this change would only acquire the inactive queue lock once in vfs_vmio_unwire() for requeueing the page and eliminate the acquisition in vfs_vmio_extend() because we lazily wire each page. However, please take a look at vfs_vmio_unwire(). Unfortunately, the way it is written, the page queue lock will now be acquired twice, first during the vm_page_unwire(PQ_NONE) to dequeue the page and again during vm_page_deactivate() to reenqueue the page. However, if we don't dequeue the page during vm_page_unwire(PQ_NONE) (as in the previous version of this change), then vm_page_deactivate() does nothing because it sees the page is already inactive. In other words, the page doesn't get requeued to the tail.

I tried to address this by adding a new KPI, vm_page_unwire_noq(), which unwires the page without modifying its position in any page queue. This way, the caller has fine-grained control over what happens to the page, though it must be careful to avoid leaks. I modified sendfile_free_page() to use it as well, and restructured it a bit to more closely resemble vfs_vmio_unwire().

In D11943#258507, @alc wrote:

Originally, my hope was that this change would only acquire the inactive queue lock once in vfs_vmio_unwire() for requeueing the page and eliminate the acquisition in vfs_vmio_extend() because we lazily wire each page. However, please take a look at vfs_vmio_unwire(). Unfortunately, the way it is written, the page queue lock will now be acquired twice, first during the vm_page_unwire(PQ_NONE) to dequeue the page and again during vm_page_deactivate() to reenqueue the page. However, if we don't dequeue the page during vm_page_unwire(PQ_NONE) (as in the previous version of this change), then vm_page_deactivate() does nothing because it sees the page is already inactive. In other words, the page doesn't get requeued to the tail.

I tried to address this by adding a new KPI, vm_page_unwire_noq(), which unwires the page without modifying its position in any page queue. This way, the caller has fine-grained control over what happens to the page, though it must be careful to avoid leaks. I modified sendfile_free_page() to use it as well, and restructured it a bit to more closely resemble vfs_vmio_unwire().

This is good! I'll give it a spin.

P.S. We could use this new function in the pmap in place of some inline code.

I have a simple test where I load an approximately 16GB sized file into memory and then run a program that performs random accesses of various sizes (and alignments) to the file. Lock contention is not an issue in this test. I am not trying to measure the SMP benefits of the lazy wiring.

The results are as I would hope. There is a small but consistent benefit. (I ran the test multiple times but I'm just presenting one, representative run here.)

Before:

file size: 15569256448
block size: 4096
 read size: 1
  usecs/read: 3.458450
 read size: 4096
  usecs/read: 4.195940
block size: 8192
 read size: 1
  usecs/read: 3.408855
 read size: 8192
  usecs/read: 4.827630
block size: 16384
 read size: 1
  usecs/read: 3.408352
 read size: 16384
  usecs/read: 6.281994
block size: 32768
 read size: 1
  usecs/read: 3.411384
 read size: 32768
  usecs/read: 9.349455
block size: 65536
 read size: 1
  usecs/read: 3.139929
 read size: 65536
  usecs/read: 17.169623
block size: 131072
 read size: 1
  usecs/read: 2.677881
 read size: 131072
  usecs/read: 32.472003
block size: 262144
 read size: 1
  usecs/read: 1.839661
 read size: 262144
  usecs/read: 64.864683

After:

file size: 15569256448
block size: 4096
 read size: 1
  usecs/read: 3.307335
 read size: 4096
  usecs/read: 3.993329
block size: 8192
 read size: 1
  usecs/read: 3.316298
 read size: 8192
  usecs/read: 4.658893
block size: 16384
 read size: 1
  usecs/read: 3.255763
 read size: 16384
  usecs/read: 6.084286
block size: 32768
 read size: 1
  usecs/read: 3.252896
 read size: 32768
  usecs/read: 9.104744
block size: 65536
 read size: 1
  usecs/read: 3.005966
 read size: 65536
  usecs/read: 16.992908
block size: 131072
 read size: 1
  usecs/read: 2.573434
 read size: 131072
  usecs/read: 32.207462
block size: 262144
 read size: 1
  usecs/read: 1.793598
 read size: 262144
  usecs/read: 63.802824
In D11943#269227, @alc wrote:

I have a simple test where I load an approximately 16GB sized file into memory and then run a program that performs random accesses of various sizes (and alignments) to the file. Lock contention is not an issue in this test. I am not trying to measure the SMP benefits of the lazy wiring.

When I wrote the patch originally, I tried running a read-only multi-threaded pgbench against a database that fit in RAM. Without the patch, we see lots of inactive queue lock contention as postgres processes wire pages into the buffer cache. With the patch, the contention is completely gone since the queue manipulations are deferred to the bufdaemon.

Another relevant scenario, which I did not test, is one where a thread is frequently pulling data (e.g., TCP stats) out of the kernel via sysctl. Many sysctl handlers will use sysctl_wire_old_buffer() to ensure that they can copy out while holding a non-sleepable lock. With this change, the sysctl handler will likely be able to avoid page queue locks entirely, since the output buffer is probably in the active queue and we can record the reference by setting PGA_REFERENCED. It's my understanding that Netflix is using a patch where sysctl_wire_old_buffer() holds the pages rather than wiring them, specifically to avoid the queue lock.

In D11943#269227, @alc wrote:

I have a simple test where I load an approximately 16GB sized file into memory and then run a program that performs random accesses of various sizes (and alignments) to the file. Lock contention is not an issue in this test. I am not trying to measure the SMP benefits of the lazy wiring.

When I wrote the patch originally, I tried running a read-only multi-threaded pgbench against a database that fit in RAM. Without the patch, we see lots of inactive queue lock contention as postgres processes wire pages into the buffer cache. With the patch, the contention is completely gone since the queue manipulations are deferred to the bufdaemon.

Does it have a measurable effect on TPS?

In D11943#269242, @alc wrote:
In D11943#269227, @alc wrote:

I have a simple test where I load an approximately 16GB sized file into memory and then run a program that performs random accesses of various sizes (and alignments) to the file. Lock contention is not an issue in this test. I am not trying to measure the SMP benefits of the lazy wiring.

When I wrote the patch originally, I tried running a read-only multi-threaded pgbench against a database that fit in RAM. Without the patch, we see lots of inactive queue lock contention as postgres processes wire pages into the buffer cache. With the patch, the contention is completely gone since the queue manipulations are deferred to the bufdaemon.

Does it have a measurable effect on TPS?

It does. I meant to note in my previous comment that the benchmark was read-only. Without the patch, I get the following stats for 180s trials (throwing the first couple away):

number of transactions actually processed: 20701513
latency average = 0.139 ms
tps = 115012.149920 (excluding connections establishing)
number of transactions actually processed: 20716412
latency average = 0.139 ms
tps = 115094.924957 (excluding connections establishing)
number of transactions actually processed: 20725505
latency average = 0.139 ms
tps = 115145.427075 (excluding connections establishing)
number of transactions actually processed: 20714370
latency average = 0.139 ms
tps = 115087.431262 (excluding connections establishing)
number of transactions actually processed: 20711754
latency average = 0.139 ms
tps = 115069.332940 (excluding connections establishing)

With the patch:

number of transactions actually processed: 20795454
latency average = 0.138 ms
tps = 115538.475430 (excluding connections establishing)
number of transactions actually processed: 20804857
latency average = 0.138 ms
tps = 115586.838077 (excluding connections establishing)
number of transactions actually processed: 20801735
latency average = 0.138 ms
tps = 115569.147712 (excluding connections establishing)
number of transactions actually processed: 20809457
latency average = 0.138 ms
tps = 115612.433323 (excluding connections establishing)
number of transactions actually processed: 20822891
latency average = 0.138 ms
tps = 115686.929423 (excluding connections establishing)

sys/kern/kern_sendfile.c
162 ↗(On Diff #34561)

This would appear to deactivate an active page, and not just enqueue a page that is not in any queue. If the page is referenced, it will be reactivated, so I don't think that there is any real harm to the replacement strategy's ordering, just pointless shuffling of the page between queues.

sys/kern/vfs_bio.c
2635 ↗(On Diff #34561)

Ditto.

Avoid deactivating active pages.

sys/kern/kern_sendfile.c
162 ↗(On Diff #34561)

I think we want to deactivate pages in the laundry as well, so in the latest revision I just avoid deactivating active pages.

In D11943#269212, @alc wrote:

P.S. We could use this new function in the pmap in place of some inline code.

(FWIW, I'm planning to do the pmap changes in a separate change, assuming we go ahead with this approach.)

Skip pages that were wired while the object lock was dropped.

Before, the vm_page_in_laundry() check above would have caught this.

In D11943#269212, @alc wrote:

P.S. We could use this new function in the pmap in place of some inline code.

(FWIW, I'm planning to do the pmap changes in a separate change, assuming we go ahead with this approach.)

None of the tests have changed my mind. I still think that this change is a move in the right direction.

sys/vm/vm_page.c
3040 ↗(On Diff #35932)

I have done something with the buffer cache recently which just sets a flag when something needs to be re-queued. This could be useful here to avoid locking the pagequeue for something that is very frequently referenced. It simply implements a second chance algorithm when the flag is seen.

sys/vm/vm_pageout.c
398 ↗(On Diff #35932)

If we really must repeat this check so many times I would prefer that it was an inline with a comment. Alternatively, does a wire_count > 0 imply a hold count? Should we bump hold when we bump wired?

sys/vm/vm_page.c
3040 ↗(On Diff #35932)

Right now, a scan of PQ_INACTIVE or PQ_LAUNDRY moves referenced pages to PQ_ACTIVE. Are you suggesting adding a second type of reference flag that simply means "requeue"?

sys/vm/vm_pageout.c
398 ↗(On Diff #35932)

I can't see any reason we can't bump the hold count with the wire count.

sys/vm/vm_page.c
3040 ↗(On Diff #35932)

Yes, that's right. In the buf case I set a flag that is protected by the buf lock. buf_recycle() will move the buf to the tail and clear the flag if it is set. It is not perfect lru, it is second chance, but it eliminates a ton of lock activity for things like indirect blocks which are accessed very frequently.

The active queue already does basically the same thing when it discovers page activity. It clears ACTIVE and requeues the page. The same flag could be used when we hold and release the page. Right now hold doesn't move the page in the LRU like wired does. If we had such a flag it could do so without touching the pq locks.

sys/vm/vm_page.c
3040 ↗(On Diff #35932)

I think it would be reasonable to set such a flag in all of vm_page_unwire(), vfs_vmio_unwire(), and sendfile_free_page(). The only other vm_page_requeue() calls in the tree are vm_swapout_object_deactivate_pages(), and those apply only to pages in PQ_ACTIVE. (I'm not sure that the requeue is really crucial in that case either.)

I'll work on adding such a flag, but that belongs in a separate diff.

sys/vm/vm_pageout.c
398 ↗(On Diff #35932)

One possible issue comes from the fact that hold_count is only 16 bits wide. I don't think it would be hard to construct a scenario that triggers an overflow if we bump hold_count along with wire_count, so for now I'd like to hold off on making that change.

I'd like to simplify the situation we have with hold/wire/busy mechanisms. This change makes holds and wires more similar than they were, so it seems like a step in the right direction. I'll also write a block comment for vm_page.h that describes these mechanisms and the relationships between them.

  • Add a block comment describing hold/wire/busy.

Remove changes that I didn't mean to upload in this diff.

Remove more unintended changes. Sorry for the noise.

sys/vm/vm_pageout.c
398 ↗(On Diff #35932)

I added a block comment.

I'm ok with adding an inline for testing hold_count > 0 || wire_count > 0, but am not sure what it should be called. vm_page_is_referenced() isn't great since it's unrelated to vm_page_reference().

sys/vm/vm_page.h
116 ↗(On Diff #38952)

I am not sure I follow the distinction between page structure and page content. Either of hold/wire or busy prevent page reclaim.

Also, I believe that the description makes things more confusing than clear by mixing hold/wire and busy. I agree that hold and wire are very similar, esp. after the patch in this review. While busy mechanism' purpose is quite different.

122 ↗(On Diff #38952)

I believe you mean pmap_extract_and_hold() and vm_fault_quick_hold_pages(). The order between wait for busy and the vnode lock there is only a symptom, really we do not want to block other page accesses while driver or filesystem hold the user page as the source of data io.

sys/vm/vm_page.h
146 ↗(On Diff #38953)

It is useful to note that the page identity might change during the sleep, so the waiter needs to re-lookup the page in the object after the wait.

sys/vm/vm_page.h
116 ↗(On Diff #38952)

I just wanted to avoid confusing the counters with the page's reference bits. I'll reword this.

I described the busy lock in a separate paragraph. I can remove the reference to it here, but I had kept it because of the note that hold_count is used in place of the busy lock in some areas.

  • Adjust the comment.
  • Add vm_page_held().
markj marked 3 inline comments as done.EditedFeb 6 2018, 7:40 PM
  • Add vm_page_held().

This function seems a bit misnamed, since it counts wires as well. However, in a separate change I'd like to acquire/release a hold on the page when the wire count transitions between 0 and 1. Then, vm_page_held() just needs to check hold_count.

sys/vm/vm_page.h
115 ↗(On Diff #38962)

May be, 'contains two counters, each of which prevents the page reuse' Then you avoid the word reference and do not need to explain that this is not PGA_REFERENCED or m->active.

117 ↗(On Diff #38962)

Again, remove 'reference' there.

  • Fix the pool field annotation.
  • Avoid using "reference" when describing the hold and wire counters.
markj marked an inline comment as done.Feb 6 2018, 8:20 PM
markj added inline comments.
sys/vm/vm_page.h
117 ↗(On Diff #38962)

I removed two uses of "reference," but I think the remaining usage is clear from context.

jeff added inline comments.
sys/vm/vm_page.h
134–135 ↗(On Diff #38968)

I believe it also protects identity.

That is the object, pindex tuple that defines the page can not change while it is busy unless the busy holder does it.

This revision is now accepted and ready to land.Feb 7 2018, 3:41 AM
This revision was automatically updated to reflect the committed changes.