Details

Reviewers

markj
mjg
freqlabs
mmacy

Commits

rS365787: Add tmpfs page cache read support.
rS365786: Microoptimize tmpfs node ref/unref by using atomics.
rS365785: Convert page cache read to VOP.
rS365784: vfs_subr.c: export io_hold_cnt and vn_read_from_obj().
rS365783: Do not copy vp into f_data for DTYPE_VNODE files.

Summary

First, convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, the conditions to do pgcache reads in vn layer were too rigid, they can be usefully weakened for tmpfs, which is too clumsy to do with new flags.

Then, tmpfs regular vnode object lifecycle is significantly different from the normal OBJT_VNODE: it is alive as far as ref_count > 0.

All the arguments convinced me that tmpfs is better served by separate code, not touching vn_read_from_obj().

I kept VIRF_PGREAD flag around, it is still useful for nullfs and for asserts.

Second, add tmpfs VOP_READ_PGCACHE that takes advantage of all tmpfs quirks. It is quite cheap in code size sense to support page-ins for read for tmpfs even if we do not own tmpfs vnode lock. Also, we can handle holes in tmpfs node without additional efforts, and do not have limitation of the transfer size.

Ensure liveness of the tmpfs VREG node and consequently v_object inside VOP_READ_PGCACHE by referencing tmpfs node in tmpfs_open(). Provide custom tmpfs fo_close() method on file, to ensure that close is paired with open.

Microoptimize tmpfs node ref/unref by using atomics and avoiding tmpfs mount and node locks when ref count is greater than zero, which is the case until node is being destroyed by unlink or unmount.

Do not copy vp into f_data for DTYPE_VNODE files. The pointer to vnode is already stored into f_vnode, so f_data can be reused. Fix all found users of f_data for DTYPE_VNODE.

Tested by: pho

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.Sep 6 2020, 8:44 PM

Herald added a subscriber: imp. · View Herald TranscriptSep 6 2020, 8:44 PM

kib requested review of this revision.Sep 6 2020, 8:44 PM

Harbormaster completed remote builds in B33421: Diff 76717.Sep 6 2020, 8:44 PM

First a minor note that the tmpfs part fails to plug the routine, so I added it with the following:

diff --git a/sys/fs/tmpfs/tmpfs_vnops.c b/sys/fs/tmpfs/tmpfs_vnops.c
index 7dc0e1b4a870..ac158167fd12 100644
--- a/sys/fs/tmpfs/tmpfs_vnops.c
+++ b/sys/fs/tmpfs/tmpfs_vnops.c
@@ -1766,6 +1766,7 @@ struct vop_vector tmpfs_vnodeop_entries = {
        .vop_stat =                     tmpfs_stat,
        .vop_getattr =                  tmpfs_getattr,
        .vop_setattr =                  tmpfs_setattr,
+       .vop_read_pgcache =             tmpfs_read_pgcache,
        .vop_read =                     tmpfs_read,
        .vop_write =                    tmpfs_write,
        .vop_fsync =                    tmpfs_fsync,

The important part though is that the patch causes a slodown both single and multi-threaded. In vm_object ref_count and paging_in_progress are close enough to each other, and given vm_object alignment they randomly share a cacheline or not. The newly added routine updates them back to back, once with a cmpxchg loop. Shared case causes excessive ping pongs. However, even with this problem addresses I don't think this is a satisfying solution (see below for comparison against Linux).

When doing a 24-way read from the same vnode and profiling with dtrace -w -n 'profile:::profile-997 { @[arg0] = count(); } tick-5s { system("clear"); trunc(@, 40); printa("%40a %@16d\n", @); clear(@); }':

   kernel`vm_object_pip_add+0x13             8039
kernel`vm_object_pip_wakeup+0x11             8892
kernel`vm_object_deallocate+0x58            14808
     kernel`vm_page_sunbusy+0x15            14896
    kernel`vm_page_trysbusy+0x2d            15161
  kernel`tmpfs_read_pgcache+0x96            22107
  kernel`tmpfs_read_pgcache+0xb6            26864

I tested this on a 24 core vm on a Cascade Lake box provided by cperciva. It can be powered on on demand.

Linux was 5.9.0-rc4, FreeBSD is r365424 + the patch with debug.vn_io_pgcache_read_enable toggled (below as pgread)

Tests are (will-it-scale):
read2 - Same file read
pread3 - Same file pread to different offsets

-t denotes processes

I don't know if I managed to whack all differences for a fair single-threaded test, so differences should be ignored. It is provided to get an idea of a scaling factor.

test	Linux	FBSD + pgread=0	FBSD + pgread=1
read2 -t 1	3479276	3971812	3654834
read2 -t 24	60119926	6485077	4086881
pread3 -t 1	4660092	4305291	4191886
pread3 -t 24	100346643	6801614	4484083

If you want you can get access to perform your own tests.

I'm not familiar with this code, I suspect this can be made to scale to roughly Linux level (or higher) with the following observations:

vm object is type stable
there is a routine to perform lockless lookup
pages can be busied and validated afterwards. should it turned out the page is reassigned it can fallback to regular read
vm object teardown can be made to wait for all busy pages to drain

This would reduce work single-threaded and reduce contention points multithreaded (most notably make pread3 scale completely).

Remove pip increment. Refcount alone is enough to guarantee swap object liveness.
Add tmpfs vop to the table.

Harbormaster completed remote builds in B33434: Diff 76744.Sep 7 2020, 9:52 PM

In fact tmpfs_read_pgcache can become tmpfs_read.

As a useful experiment, you could comment out the call to vm_page_activate() in uiomove_object_page().

This still performs worse than the stock kernel:

     kernel`vm_page_sunbusy+0x15              979
kernel`vm_object_deallocate+0x2d             1120
  kernel`copyout_smap_erms+0x186             1975
  kernel`tmpfs_read_pgcache+0x60             3148
    kernel`vm_page_trysbusy+0x49             4959
    kernel`vm_page_trysbusy+0x2d            12040
  kernel`tmpfs_read_pgcache+0x83            19499
kernel`vm_object_deallocate+0x27            21954
kernel`vm_object_deallocate+0x58            22291
  kernel`tmpfs_read_pgcache+0xa1            25232

trysbusy reads from the object and presumbaly ping pongs here against refing/unrefing.

Just for testing purposes I removed the ref_count manipulation. It bring scalability to the same ballpark as Linux.

test	Linux	FBSD + pgread=0	FBSD + pgread=1	FBSD + pgread=1 v2	FBSD +pgread=1 no ref
read2 -t 1	3479276	3971812	3654834	3937229	4049478
read2 -t 24	60119926	6485077	4086881	5037021	57303186
pread3 -t 1	4660092	4305291	4191886	4273027	4600783
pread3 -t 24	100346643	6801614	4484083	5914140	91718291

Is there a fundamental problem not taking the ref here? vm obj reuse could be postponed until after the vnode is freed if that's a concern.

sigsys_gmail.com added a subscriber: sigsys_gmail.com.Sep 7 2020, 11:33 PM

Ref tmpfs VREG node on open, store it into fp->f_data and unref on close.
Switch to atomics and avoid tmp/node locks to ref/deref as usual.

I tried to do pass over mis-use of f_data for FTYPE_VNODE, probably missed some.

Harbormaster completed remote builds in B33441: Diff 76757.Sep 8 2020, 10:31 AM

markj added inline comments.Sep 8 2020, 2:48 PM

sys/fs/tmpfs/tmpfs.h
42 ↗	(On Diff #76757)	Why is it needed?
sys/fs/tmpfs/tmpfs_subr.c
216 ↗	(On Diff #76757)	To assert this without races, you can use the return value from refcount_acquire().
383 ↗	(On Diff #76757)	This is asserted by refcount_release().

refcount improvements

Harbormaster completed remote builds in B33443: Diff 76771.Sep 8 2020, 3:30 PM

I can't test on the original box, the patched routine is how it looked like after my own removal of vm obj refcounting so I presume performance is the same.

I disagree with special casing of the sort though. I don't know the lifecycle for vm object, so perhaps there is a lot of badness there.

The vnode is guaranteed to have usecount > 0 and that should be sufficient to provide whatever guarantees are needed in face of forced unmount (modulo smr for ->v_data). Making sure this works will likely clean up hacks elsewhere. Is there are fundamental problem assuring this?

General note is that the current forced unmount behavior is rather cumbersome and perhaps we would way better served with something cooperating with the filesystem. Most notably it can leave ->v_data in place, making it an invariant that the pointer is *never* NULL as long as you at least hold the vnode.

In D26346#586024, @kib wrote:

I tried to do pass over mis-use of f_data for FTYPE_VNODE, probably missed some.

zfs_file_fsync() perhaps, but I'm not certain.

sys/fs/tmpfs/tmpfs_vnops.c
341 ↗	(On Diff #76771)	Why can't this be done in tmpfs_close()?
601 ↗	(On Diff #76771)	Perhaps comment that the VIRF_DOOMED check is racy.
621 ↗	(On Diff #76771)	Typo, rangelock.

In D26346#586405, @markj wrote:

In D26346#586024, @kib wrote:

I tried to do pass over mis-use of f_data for FTYPE_VNODE, probably missed some.

zfs_file_fsync() perhaps, but I'm not certain.

I will write a coccinelle script to find all instances before I pass this to pho.

sys/fs/tmpfs/tmpfs_vnops.c
341 ↗	(On Diff #76771)	tmpfs_close() is VOP. It is not paired with VOP_OPEN()s. Main offender is reclamation, for instance if we have ten open(2) for read unmount -f this translates to ten VOP_OPEN(FREAD) single VOP_CLOSE(FNONBLOCK) VOP_RECLAIM() -> from now on tmpfs_close is not called on close and we get node refcount leaked by nine. The only events that are guaranteed to be paired are VOP_OPEN()s and struct file -> fo_close(). This is one of the reasons why D_TRACKCLOSE does not work in devfs, and why I wrote cdevpriv to allow drivers to track per-fd data.

Mark notes:
use f_vnode in zfs code
update comments.

Herald added a subscriber: delphij. · View Herald TranscriptSep 9 2020, 3:10 PM

Harbormaster completed remote builds in B33459: Diff 76813.Sep 9 2020, 3:10 PM

NFS used f_data.

Harbormaster completed remote builds in B33490: Diff 76886.Sep 10 2020, 4:04 PM

So again is there a problem keeping vm object in a good enough shape as long as v_usecount > 1? This would avoid adding more work on tmpfs open/close and would most likely facilitate easier support for other filesystems, at the expense of some surgery to vgone.

sys/kern/vfs_vnops.c
976 ↗	(On Diff #76886)	This induces an indirect func call only to find out the filesystem at hand does not even do pgcache reads. Why not hoist VIRF_PGREAD \| VIRF_DOOMED check here? It looks like all participating filesystems would have to do it anyway.

In D26346#587083, @mjg wrote:

So again is there a problem keeping vm object in a good enough shape as long as v_usecount > 1? This would avoid adding more work on tmpfs open/close and would most likely facilitate easier support for other filesystems, at the expense of some surgery to vgone.

The liveness of the vm_object for tmpfs VREG node is equal to the liveness of the node itself. This is a fundamental difference between tmpfs v_object vs. typical filesystem with permanent storage, where v_object is used for caching and its lifetime strictly subset of the vnode (not inode) lifetime.

It is not vgone() that must be patched but freevnode(), and it must become aware of filesystem-specific details for vnodes that are already reclaimed. This is violation of the VFS architecture, and I do not see it worth the effect. Patch currently adds single atomic to open and close on tmpfs.

[Later add: and doing it in last vdrop/freevnode means user-uncontrolled memory leak. Right now node data is kept as far as there is mmap (or with patch, an opened file). ]

Check VIRF flags in vfs_vnops.c instead of doing it in VOPs.

Harbormaster completed remote builds in B33503: Diff 76910.Sep 10 2020, 11:07 PM

markj accepted this revision.Sep 11 2020, 6:37 PM

markj added inline comments.

sys/fs/tmpfs/tmpfs_subr.c
224 ↗	(On Diff #76910)	I'll add refcount_acquire_gt_zero() if you agree with the name. refcount_acquire() did not have a return value until recently, so most existing callers do not check.

This revision is now accepted and ready to land.Sep 11 2020, 6:37 PM

kib marked an inline comment as done.Sep 11 2020, 6:53 PM

kib added inline comments.

sys/fs/tmpfs/tmpfs_subr.c
224 ↗	(On Diff #76910)	I would add 'assert' somewhere in the name, e.g. refcount_acquire_assert_gt_zero().

In D26346#587114, @kib wrote:

In D26346#587083, @mjg wrote:

So again is there a problem keeping vm object in a good enough shape as long as v_usecount > 1? This would avoid adding more work on tmpfs open/close and would most likely facilitate easier support for other filesystems, at the expense of some surgery to vgone.

The liveness of the vm_object for tmpfs VREG node is equal to the liveness of the node itself. This is a fundamental difference between tmpfs v_object vs. typical filesystem with permanent storage, where v_object is used for caching and its lifetime strictly subset of the vnode (not inode) lifetime.

It is not vgone() that must be patched but freevnode(), and it must become aware of filesystem-specific details for vnodes that are already reclaimed. This is violation of the VFS architecture, and I do not see it worth the effect. Patch currently adds single atomic to open and close on tmpfs.

[Later add: and doing it in last vdrop/freevnode means user-uncontrolled memory leak. Right now node data is kept as far as there is mmap (or with patch, an opened file). ]

So a general point here is that the current behavior vs vgone is quite problematic, even ignoring this particular use case. And the fix would be integrating vnode recycling with filesystems, making the current change spurious in my opinion. As a step towards that, we could just make vm obj for tmpfs behave the same manner as it does for other cases.

Another example of the problem is non-lockmgr locks for vnodes. If it ever worked, it was in part because vget was a 2 step process starting with setting a flag that the vnode is needed, waiting for any in-flight stuff to settle and only then locking. The flag stuff was removed years ago and now you can get threads stuck in a custom locking primitive while the vnode gets doomed and gets op vector replaced which something only using lockmgr. This leads to all kinds of woes and the only clean fix I see is to provide filesystems with custom badops (or an equivalent) so that non-lockmgr locks are operational right until the vnode is freed.

I don't get the point about user-controlled leak. The user can keep vnodes around predominantly by opening them, which has the same side effect with the current patch.

Fix ordering issue between finit() and tmpfs_open().

This revision now requires review to proceed.Sep 14 2020, 11:08 AM

Harbormaster completed remote builds in B33545: Diff 77001.Sep 14 2020, 11:08 AM

In D26346#587295, @mjg wrote:

So a general point here is that the current behavior vs vgone is quite problematic, even ignoring this particular use case. And the fix would be integrating vnode recycling with filesystems, making the current change spurious in my opinion. As a step towards that, we could just make vm obj for tmpfs behave the same manner as it does for other cases.

Another example of the problem is non-lockmgr locks for vnodes. If it ever worked, it was in part because vget was a 2 step process starting with setting a flag that the vnode is needed, waiting for any in-flight stuff to settle and only then locking. The flag stuff was removed years ago and now you can get threads stuck in a custom locking primitive while the vnode gets doomed and gets op vector replaced which something only using lockmgr. This leads to all kinds of woes and the only clean fix I see is to provide filesystems with custom badops (or an equivalent) so that non-lockmgr locks are operational right until the vnode is freed.

Reworking forced unmounts is arguably out of scope of this patch.

I don't get the point about user-controlled leak. The user can keep vnodes around predominantly by opening them, which has the same side effect with the current patch.

I wrote "user-uncontrolled leak". Before this patch, if user mapped tmpfs node, its content memory is kept around until last unmap is done. With this patch, opening has same effect. But, if node (and v_object as the actual goal) is kept around until last reference on the vnode is dropped, content memory is kept as well, and since user cannot control for real when freenode() is done, it has a temporal leak not controlled by user.

kib added reviewers: freqlabs, mmacy.Sep 14 2020, 11:13 AM

kib edited the summary of this revision. (Show Details)

This revision was not accepted when it landed; it landed in state Needs Review.Sep 15 2020, 9:55 PM

Closed by commit rS365783: Do not copy vp into f_data for DTYPE_VNODE files. (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib added a commit: rS365783: Do not copy vp into f_data for DTYPE_VNODE files..

kib added a commit: rS365784: vfs_subr.c: export io_hold_cnt and vn_read_from_obj()..Sep 15 2020, 10:01 PM

kib added a commit: rS365785: Convert page cache read to VOP..Sep 15 2020, 10:06 PM

kib added a commit: rS365786: Microoptimize tmpfs node ref/unref by using atomics..Sep 15 2020, 10:13 PM

kib added a commit: rS365787: Add tmpfs page cache read support..Sep 15 2020, 10:19 PM

GitHub <noreply@github.com> mentioned this in rG3c7566cb0dce: FreeBSD: Do not copy vp into f_data for DTYPE_VNODE files.Jun 8 2021, 2:50 PM