Page MenuHomeFreeBSD

rework handling of POSIX_FADV_{DONTNEED,NOREUSE}
ClosedPublic

Authored by markj on Sep 23 2015, 10:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Oct 23 2024, 3:57 PM
Unknown Object (File)
Sep 20 2024, 1:32 PM
Unknown Object (File)
Sep 18 2024, 3:18 AM
Unknown Object (File)
Sep 9 2024, 11:25 AM
Unknown Object (File)
Sep 8 2024, 11:27 PM
Unknown Object (File)
Sep 8 2024, 8:32 AM
Unknown Object (File)
Sep 8 2024, 2:50 AM
Unknown Object (File)
Sep 8 2024, 12:42 AM
Subscribers

Details

Summary

As a step towards the removal of PG_CACHED pages, we'd like to remove
the call to vm_object_page_cache() in vop_stdadvise(). The first
iteration of this change effectively replaced the call with one to
vm_page_advise(MADV_DONTNEED) for each resident page in the specified
range. However, this interacts badly with the current implementation of
POSIX_FADV_DONTNEED: as a thread reads or writes a file sequentially,
vn_read() and vn_write() coalesce the consecutive file ranges so that
the range passed to VOP_ADVISE grows to cover the entire file. Unlike
page caching, page deactivation does not remove the page from the
object, so each call to VOP_ADVISE ends up duplicating the work of the
previous call.

This change modifies vop_stdadvise() so that it deactivates pages
without running into this problem. In particular, it

  • removes the range coalescing code from vn_read() and vn_write(), and
  • introduces a new buf flag, B_NOREUSE, which is a hint to vfs_vmio_release() that indicates that LRU operation may be bypassed when deactivating pages.

This change adds a new VM function, vm_page_deactivate_noreuse(), to
perform B_NOREUSE deactivations in vfs_vmio_release(). This is done
instead of calling vm_page_advise(MADV_DONTNEED) since the latter has
some mildly undesirable side effects here: it'll clear the
PGA_REFERENCED bit on the page, which could have been caused by a
userspace mapping of the file, and it needlessly checks whether the page
is dirty.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 615
Build 615: arc lint + arc unit

Event Timeline

markj retitled this revision from to rework handling of POSIX_FADV_{DONTNEED,NOREUSE}.
markj edited the test plan for this revision. (Show Details)
markj updated this object.

Let's set aside the POSIX_FADV_NOREUSE case that lead us to this approach and consider a single POSIX_FADV_DONTNEED call on a file. The vm object might have pages in the range targeted by the POSIX_FADV_DONTNEED that don't have associated buffers. For example, those pages may have been read and kicked out of the buffer map at some point in the past. For those pages, we still need to perform something like vm_page_deactivate_noreuse() on each of those pages.

sys/kern/vfs_bio.c
59

Is this #include still needed?

In D3726#76901, @alc wrote:

Let's set aside the POSIX_FADV_NOREUSE case that lead us to this approach and consider a single POSIX_FADV_DONTNEED call on a file. The vm object might have pages in the range targeted by the POSIX_FADV_DONTNEED that don't have associated buffers. For example, those pages may have been read and kicked out of the buffer map at some point in the past. For those pages, we still need to perform something like vm_page_deactivate_noreuse() on each of those pages.

Oops, right. :(
These requirements seem to suggest that we really need to iterate over both the buffers in the specified range (or the entire object, as the vinalbuf() call did), and the resident pages in the object. This implies that we still need a vm_object_madvise(MADV_DONTNEED) call, though, which brings back the issue of deactivating pages via vm_page_advise() (which clears the PGA_REFERENCED bit). One possibility is to repurpose vm_object_page_cache() and have it call vm_page_deactivate_noreuse() instead.

I'm inclined to think that vop_stdadvise() should attempt to deactivate resident pages before iterating over any buffer cache entries: otherwise, the brelse() call will deactivate buffer pages, and the subsequent vm_object_madvise() call will pointlessly requeue previously-deactivated pages. In the other order, vm_object_madvise() will have no effect on pages in the buffer cache since they are wired, and _vm_page_deactivate() will thus avoid acquiring the inactive page queue lock unnecessarily.

In D3726#76981, @markj wrote:
In D3726#76901, @alc wrote:

Let's set aside the POSIX_FADV_NOREUSE case that lead us to this approach and consider a single POSIX_FADV_DONTNEED call on a file. The vm object might have pages in the range targeted by the POSIX_FADV_DONTNEED that don't have associated buffers. For example, those pages may have been read and kicked out of the buffer map at some point in the past. For those pages, we still need to perform something like vm_page_deactivate_noreuse() on each of those pages.

Oops, right. :(
These requirements seem to suggest that we really need to iterate over both the buffers in the specified range (or the entire object, as the vinalbuf() call did), and the resident pages in the object. This implies that we still need a vm_object_madvise(MADV_DONTNEED) call, though, which brings back the issue of deactivating pages via vm_page_advise() (which clears the PGA_REFERENCED bit). One possibility is to repurpose vm_object_page_cache() and have it call vm_page_deactivate_noreuse() instead.

For now, I'll endorse the plan to convert vm_object_page_cache() into vm_object_page_noreuse(). We can revisit the question of whether vm_object_page_noreuse() can be replaced by vm_object_madvise() after this change is finalized and committed.

I'm inclined to think that vop_stdadvise() should attempt to deactivate resident pages before iterating over any buffer cache entries: otherwise, the brelse() call will deactivate buffer pages, and the subsequent vm_object_madvise() call will pointlessly requeue previously-deactivated pages. In the other order, vm_object_madvise() will have no effect on pages in the buffer cache since they are wired, and _vm_page_deactivate() will thus avoid acquiring the inactive page queue lock unnecessarily.

I agree.

sys/kern/vfs_vnops.c
768–769

A different variable name, like "orig_offset", might better convey how this variable is used.

markj edited edge metadata.

Updates based on testing and review:

  • Restore vm_object_page_cache() and repurpose it for LRU-defeating page deactivation: vop_stdadvise() should still requeue any resident pages that are not in the buffer cache.
  • Rename offset to orig_offset in vn_{read,write}().
  • Only set B_NOREUSE if we have a VMIO buffer. This is guaranteed by all current callers, but isn't a requirement of the interface.
  • Release B_NOREUSE buffers in vfs_vmio_iodone(). Otherwise, the now-clean buffers will remain in the buffer cache until they are evicted by some other mechanism.

Patch needs remerging after the Jeff' commit. Could you, please, in the course of update also commit trivial but useful changes to reduce unrelated noise ? E.g., ANSI convertions for vn_read/vn_write and space truncation in the comment from vm_object.c are good candidates.

sys/kern/vfs_bio.c
2062

It is not the best place to put the convertion of B_NOREUSE to B_RELBUF there. vfs_vmio_iodone() manages pages, it is unexpected to see it changing the buffer flags not related to the current pages state.

Why not do this in brelse and bqrelse ?

markj edited edge metadata.

Updates:

  • Rebase after r288299.
  • Remove cosmetic changes from the diff.
  • Revert the change to vfs_vmio_iodone() - now, brelse() sets B_RELBUF on a clean buffer with B_NOREUSE set. Also modify bufdone_finish() to call brelse() on buffers with B_NOREUSE set, rather than calling bqrelse().
  • Clear B_NOREUSE in brelse().
  • Unconditionally set B_RELBUF in vop_stdadvise(). If the buffer is dirty, brelse() will clear B_RELBUF.
sys/kern/vfs_bio.c
2064

Sorry, I missed the last sentence and modified bufdone_finish() instead of bqrelse(), but bqrelse() seems like the more correct place to handle B_NOREUSE. I'll upload a new revision shortly.

Have bqrelse() release clean B_NOREUSE buffers.

Please do not leave out the changes you removed from the patch, go ahead and commit that part.

sys/kern/vfs_default.c
1107

Wouldn't it save some iterations in the object loop if buffer loop is done before object ? The wired pages owned by the buffer cache are still on the object queue, so you have to iterate over them and ignore above, as your comment notes.

sys/kern/vfs_default.c
1107

No. The pages aren't being cached/freed. They are being placed at the head of the inactive queue. So, they remain in the object's memq regardless of whether you perform the buffer cache or vm object operation first. If Mark reorders the operations as you suggest, then the pages that are ejected from the buffer cache to the head of the inactive queue will be removed and reinserted a second time at the head of the inactive queue by vm_object_page_noreuse(), putting further stress on the inactive queue lock. Under Mark's current ordering, we acquire the page lock, observe that the page is wired, and move on to the next page.

Fix a mistake introduced in a previous change: only clear B_NOREUSE once
the buffer has been truncated. We don't want to unconditionally clear
it in brelse(), since that'll clear it for B_DELWRI buffers too.

sys/kern/vfs_bio.c
1878

Wouldn't it be simpler to add B_NOREUSE flag to the set of flags tested above, instead of first converting it into B_RELBUF several lines above, and then clearing the B_NOREUSE (why clearing it, btw ?).

sys/kern/vfs_default.c
1107

I see.

Don't bother setting B_RELBUF when B_NOREUSE is set.

sys/kern/vfs_bio.c
1878–1882

Thanks, I think this is simpler.

As far as I can see, it's necessary to clear it here: otherwise the flag will still be set when the buffer is reused, which isn't the intent. Is there something I'm missing that makes this unnecessary?

sys/kern/vfs_bio.c
1878–1882

I forgot that getnewbuf_reuse_bp() does not clear flags. I think it would be useful to assert that the flag did not leaked then, in _reuse_bp().

Assert that B_NOREUSE is not set in getnewbuf_reuse_bp().

sys/kern/vfs_bio.c
1878–1883

Done in the latest upload. (Just extended an existing assertion a bit.)

kib edited edge metadata.
kib added inline comments.
sys/kern/vfs_bio.c
2489

It makes sense to print the b_flags value in the panic message. Before the change, there was only one reason why the buffer considered invalid B_DELWRI.

This revision is now accepted and ready to land.Sep 30 2015, 6:09 AM
alc edited edge metadata.
This revision was automatically updated to reflect the committed changes.