Details

Reviewers

mjg
markj
jeff

Commits

rS364289: VMIO reads: enable for UFS
rS364288: VMIO reads: enable for nullfs upper vnode if the lower vnode supports it.
rS364287: VMIO read
rS364286: vm_object: allow paging_in_progress to be acquired after object termination.
rS364285: Atomically update vm_object vnp_size, where atomic is available.

Summary

If possible, i.e. if the requested range is resident valid in the vm object queue, and some secondary conditions hold, copy data for read(2) directly from the valid cached pages, avoiding vnode lock and instantiating buffers. I intentionally do not start read-ahead, nor handle the advises on the cached range.

It is only enabled for UFS right now. Tmpfs does not save file size in the swap object so I cannot clip at EOF. In principle this can work for ZFS.

Sample on 2.5G file read, first read is for guaranteed uncached file from spinning disk:

root@:/ # umount /mnt
root@:/ # mount -r /dev/da0p2 /mnt
root@:/ # sysctl debug.vn_io_pgcache_read_enable=0
debug.vn_io_pgcache_read_enable: 0 -> 0
root@:/ # time dd if=/mnt/work/kernbuild.stacks of=/dev/zero
5256686+1 records in
5256686+1 records out
2691423676 bytes transferred in 19.233627 secs (139933240 bytes/sec)
       19.25 real         1.58 user        16.58 sys
root@:/ # time dd if=/mnt/work/kernbuild.stacks of=/dev/zero
5256686+1 records in
5256686+1 records out
2691423676 bytes transferred in 17.040581 secs (157942009 bytes/sec)
       17.04 real         1.77 user        15.27 sys
root@:/ # time dd if=/mnt/work/kernbuild.stacks of=/dev/zero
5256686+1 records in
5256686+1 records out
2691423676 bytes transferred in 16.993587 secs (158378784 bytes/sec)
       16.99 real         1.89 user        15.09 sys
root@:/ # umount /mnt
root@:/ # mount -r /dev/da0p2 /mnt
root@:/ # sysctl debug.vn_io_pgcache_read_enable=1
debug.vn_io_pgcache_read_enable: 0 -> 1
root@:/ # time dd if=/mnt/work/kernbuild.stacks of=/dev/zero
5256686+1 records in
5256686+1 records out
2691423676 bytes transferred in 18.191389 secs (147950421 bytes/sec)
       18.21 real         1.77 user         9.71 sys
root@:/ # time dd if=/mnt/work/kernbuild.stacks of=/dev/zero
5256686+1 records in
5256686+1 records out
2691423676 bytes transferred in 10.764700 secs (250023107 bytes/sec)
       10.76 real         1.67 user         9.08 sys
root@:/ # time dd if=/mnt/work/kernbuild.stacks of=/dev/zero
5256686+1 records in
5256686+1 records out
2691423676 bytes transferred in 10.720573 secs (251052233 bytes/sec)
       10.72 real         1.70 user         9.01 sys

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 32955

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In D25968#575879, @kib wrote:

In D25968#575713, @mjg wrote:

Is there a problem getting this to work for tmpfs? I had a WIP lockless routine (VOP_GETATTR_LITE) which would give you the size without taking the vnode lock.

Yes, the only reason why tmpfs is skipped right now is because I need a way to get the vnode size without vnode lock.

the tmpfs_node object is protected with SMR since lockless lookup. tmpfs_getattr does:

obj = node->tn_reg.tn_aobj;
vap->va_bytes = (u_quad_t)obj->resident_page_count * PAGE_SIZE;

Assuming this is correct with vnode lock, a racy variant can be trivially devised even if aobj can get reassigned from under us in absence of any locks. Note since no locks are held the returned potential mismatch between actual and reported size has to be accounted for anyway. The VOP_GETATTR_LITE routine I mentioned was going to return whatever it finds and if the caller does not have the vnode lock, that data may be stale on return (note care would be taken to update select fields in atomic manner, but no effort would be done provide a fully consistent snapshot).

Is there another problem here?

sys/kern/vfs_vnops.c
175	Having it in irflag would eliminated VREG and mp != NULL check (along with mp read) which i think is a nice win, assuming it's not problematic.

In D25968#576131, @mjg wrote:
In D25968#575879, @kib wrote:

In D25968#575713, @mjg wrote:

Is there a problem getting this to work for tmpfs? I had a WIP lockless routine (VOP_GETATTR_LITE) which would give you the size without taking the vnode lock.

Yes, the only reason why tmpfs is skipped right now is because I need a way to get the vnode size without vnode lock.

the tmpfs_node object is protected with SMR since lockless lookup. tmpfs_getattr does:
obj = node->tn_reg.tn_aobj;
vap->va_bytes = (u_quad_t)obj->resident_page_count * PAGE_SIZE;
Assuming this is correct with vnode lock, a racy variant can be trivially devised even if aobj can get reassigned from under us in absence of any locks. Note since no locks are held the returned potential mismatch between actual and reported size has to be accounted for anyway. The VOP_GETATTR_LITE routine I mentioned was going to return whatever it finds and if the caller does not have the vnode lock, that data may be stale on return (note care would be taken to update select fields in atomic manner, but no effort would be done provide a fully consistent snapshot).

Is there another problem here?

No, it is this problem. In more details, object->size is the size in pages, not in bytes. For proper read(2) clipping at EOF we need to know size in bytes. Vnode vm objects has this vnp.vnp_size field, for tmpfs swap objects I do not want to add it, in part because there is no single place to maintain it.

It is fine to read the size racy, patch already does that for regular vnodes, because for post-EOF reads it falls back to VOP_READ().

It might be that smr protection for tmpfs nodes is enough.

markj added inline comments.Aug 7 2020, 3:15 PM

sys/kern/vfs_vnops.c
136	Should it be RWTUN? Same for the sysctls above.
875	Is it possible to use vm_page_grab_pages_unlocked() here? We will potentially grab a run of invalid pages, but I do not see why it can't be handled.
877	But it's still possible that we are grabbing pages from an arbitrary object, right? I think we must never block in that case, i.e., VM_ALLOC_NOWAIT must be specified for all pages in the range. Consider that we may allocate objects for kernel memory in which pages are permanently xbusy. Kernel stacks are an example. Before r360354 kernel stack objects were dynamically allocated.

kib marked 4 inline comments as done.Aug 7 2020, 9:14 PM

kib added inline comments.

sys/kern/vfs_vnops.c
136	I did not found a use for it since vn_io_fault_enable was introduced as an emergency switch. But I do not see why not.
875	I tried and I do not like the result. The saving of the code lines in the initial loop is eaten by the need to calculate the run length twice, first to see how many pages grab, then to iterate and cut it for invalid pages. Also busying any invalid page after the first is just a waste.

Fix loop termination for invalid pages.
Use non-waiting busy.
Move pgcache flag from mnt kern flags to vp irflags.

Harbormaster completed remote builds in B32863: Diff 75591.Aug 7 2020, 9:15 PM

kib retitled this revision from vmio read to VMIO read.Aug 7 2020, 9:30 PM

kib edited the summary of this revision. (Show Details)

mjg added inline comments.Aug 7 2020, 11:24 PM

sys/sys/vnode.h

248

@@ -4225,7 +4231,9 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
        buf[1] = '\0';
        if (vp->v_irflag & VIRF_DOOMED)
                strlcat(buf, "|VIRF_DOOMED", sizeof(buf));
-       flags = vp->v_irflag & ~(VIRF_DOOMED);
+       if (vp->v_irflag & VIRF_PGREAD)
+               strlcat(buf, "|VIRF_PGREAD", sizeof(buf));
+       flags = vp->v_irflag & ~(VIRF_DOOMED | VIRF_PGREAD);
        if (flags != 0) {
                snprintf(buf2, sizeof(buf2), "|VIRF(0x%lx)", flags);
                strlcat(buf, buf2, sizeof(buf));

sys/ufs/ufs/ufs_vnops.c

285

should this migrate back below the append check?

kib marked 2 inline comments as done.Aug 7 2020, 11:53 PM

kib added inline comments.

sys/ufs/ufs/ufs_vnops.c
285	This was intentional so that nullfs sees vm_object for its open as early as possible.

Teach vn_printf() about VIRF_PGREAD.

Harbormaster completed remote builds in B32867: Diff 75597.Aug 7 2020, 11:54 PM

So I added a cheap hack to use VOP_UGETATTR_LITE from the other review and added tmpfs support (diff at the bottom).

There is the pread3 ("Same file pread to different offsets") test from will-it-scale which on stock kernel de facto serializes on ping ponging the vnode lock.
Single-threaded there is a nice win (flipping debug.vn_io_pgcache_read_enable=0; ops/s):
pread3_processes -t 1 -s 10
before: 4197381
after: 4768564

But the real win is multithreaded.
pread3_processes -t 52 -s 10
before: 4353811
after: 166425928

Another problematic test was read2 ("Same file read") -- reads over the entire file. Previously once more completely serialized on vnode lock, now the effect is significantly lessened.

read2_processes -t 52 -s 10
before: 3680856
after: 79499021

While I can't compare these numbers against Linux right now, it did scale very well in pread3 and had some issues in read2. Thus I suspect we are now in the same ballpark for these 2 tests.

Hack for benchmarking:

diff --git a/sys/fs/tmpfs/tmpfs_subr.c b/sys/fs/tmpfs/tmpfs_subr.c
index 55a6390f6833..eb5f6dfff616 100644
--- a/sys/fs/tmpfs/tmpfs_subr.c
+++ b/sys/fs/tmpfs/tmpfs_subr.c
@@ -736,6 +736,8 @@ tmpfs_alloc_vp(struct mount *mp, struct tmpfs_node *node, int lkflag,
 
 out:
        if (error == 0) {
+               if (vp->v_type == VREG)
+                       vp->v_irflag |= VIRF_PGREAD;
                *vpp = vp;
 
 #ifdef INVARIANTS
diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c
index ed54f6435755..3eaad8844e1e 100644
--- a/sys/kern/vfs_vnops.c
+++ b/sys/kern/vfs_vnops.c
@@ -853,6 +853,7 @@ vn_read_from_obj(struct vnode *vp, struct uio *uio)
 {
        vm_object_t obj;
        vm_page_t ma[io_hold_cnt + 2];
+       struct vattr_lite attr;
        off_t off, vsz;
        ssize_t resid;
        int error, i, j;
@@ -907,11 +908,12 @@ vn_read_from_obj(struct vnode *vp, struct uio *uio)
         * variable so that possible concurrent extension does not
         * break calculation.
         */
-#if defined(__powerpc__) && !defined(__powerpc64__)
-       vsz = object->un_pager.vnp.vnp_size ;
-#else
-       vsz = atomic_load_64(&obj->un_pager.vnp.vnp_size);
-#endif
+       error = VOP_UGETATTR_LITE(vp, &attr, curthread->td_ucred);
+       if (__predict_false(error != 0)) {
+               error = EJUSTRETURN;
+               goto out;
+       }
+       vsz = attr.va_size;
        if (uio->uio_offset + resid > vsz)
                resid = vsz - uio->uio_offset;

markj added inline comments.Aug 9 2020, 4:20 PM

sys/kern/vfs_vnops.c
922	We should ensure that this counts as a reference with respect to LRU. I think there are basically two things we could do: call vm_page_deactivate(), or call vm_page_reference(). The first will trigger a requeue of the page, so it will go to the tail of the inactive queue in the near future. This is basically what happens when a page is evicted from the buffer cache. vm_page_reference() merely sets a flag that will tell the page daemon to requeue or reactivate the page, so it is a "lazy" requeue. In particular it does not preserve LRU, but it is cheaper. To be more specific about the behaviour of vm_page_reference(): when an inactive scan encounters a page with PGA_REFERENCED set, and the object has ref_count > 0, the page is moved to the active queue. If ref_count == 0, it is requeued. So, if the page is mapped elsewhere, it will be activated.

kib added inline comments.Aug 9 2020, 4:53 PM

sys/kern/vfs_vnops.c
922	Suppose that the page is active, what would vm_page_deactivate() do ? From my reading of vm_page_mvqueue(), it would schedule the page move to inactive queue (same as before the scheduling stuff), right ? I do not think this is fair. Buffer cache deactivation of the page on buffer reclamation is supported by some specific use case, and the fact that we loose all page access history when the buffer is formed. More, the page is kept resident for guaranteed long time. In other words, I do not think that deactivation is fair. OTOH, I do not see a problem with vm_page_reference().

markj added inline comments.Aug 9 2020, 5:22 PM

sys/kern/vfs_vnops.c
922	Right, vm_page_deactivate() will unconditionally move the page to the inactive queue. vm_page_unwire(PQ_INACTIVE) actually has friendlier semantics: if the page is in the active queue, PGA_REFERENCED is set and the page is left in the active queue. We should probably have a function unrelated to wiring that implements this behaviour. In the meantime, I think using vm_page_reference() is fine.

Reference pages we read from.

Harbormaster completed remote builds in B32893: Diff 75632.Aug 9 2020, 5:36 PM

markj added inline comments.Aug 10 2020, 6:21 PM

sys/kern/vfs_vnops.c
893	Note that vm_object_page_remove() does not remove wired pages from the object, it merely marks them invalid. So we may temporarily busy an invalid page after vm_object_page_remove() has passed it. Then we may race with vm_object_terminate_pages(), which asserts that all resident pages are unbusy. vm_object_terminate_pages() removes pages from their object, which is an operation protected by the busy lock. So I don't think we can correctly weaken the assertion.
911	Extra space before `;`.

We should set reference bits or similar so that the LRU is updated lazily.

vm_page_grab_valid() could be used to handle the pagein. I'm not sure whether this will be performance beneficial or not. It should slightly simplify the code.

I'm really glad you did this. This was really a large part of the up-front impetus for the object concurrency changes. This should alleviate lockmgr contention for read bufs in databases which we identified as significant.

Increment pip around pgread. Interlock with OBJ_DEAD check to not interfere with vm_terminate_object_pages().

Harbormaster completed remote builds in B32899: Diff 75652.Aug 10 2020, 9:27 PM

In D25968#577165, @jeff wrote:

We should set reference bits or similar so that the LRU is updated lazily.

This is already done.

vm_page_grab_valid() could be used to handle the pagein. I'm not sure whether this will be performance beneficial or not. It should slightly simplify the code.

Lets postpone page validation and write to pgcache.

emaste added a subscriber: emaste.Aug 12 2020, 2:16 PM

markj added inline comments.Aug 12 2020, 3:10 PM

sys/kern/vfs_vnops.c
865	I would add `/* Depends on type-stability. */` here.
874	Hrm, vm_object_terminate_pages() also asserts paging_in_progress == 0.
893	I wonder why vm_object_page_remove() does not remove wired pages from the object.
sys/sys/vnode.h
247	Maybe, "direct reads from the page cache are permitted"?

kib marked 4 inline comments as done.Aug 12 2020, 4:27 PM

kib added inline comments.

sys/kern/vfs_vnops.c
893	Because it causes other issues. In short, there is an architectural contradiction between user wiring and msync(MS_INVALIDATE). On one hand, we must keep the wired pages around, e.g. otherwise vm_object_unwire() panic with 'missed page'. On the other hand, MS_INVALIDATE works by doing vm_object_page_remove(). So the solution that fixed a lot of breakage was to keep wired pages but invalidate them. I am still not sure that everything was fixed, but definitely a lot.

Remove pip == 0 assert in vm_object_terminate().
Update comment.

Harbormaster completed remote builds in B32943: Diff 75734.Aug 12 2020, 4:27 PM

markj added inline comments.Aug 12 2020, 4:39 PM

sys/kern/vfs_vnops.c
893	Hmm, right. We prohibit msync(MS_INVALIDATE) on an mlock()ed region, but the mapping may be mlock()ed into a different address space. I will say that keeping the wired pages also introduces a fair bit of complexity. r345382 is one example, but I remember that there were others in the past.
sys/vm/vm_object.c
946	s/should/must/. Maybe we should have a vm_object_pip_add() wrapper to do the OBJ_DEAD check.

kib marked 2 inline comments as done.Aug 12 2020, 4:57 PM

kib added inline comments.

sys/vm/vm_object.c
946	May be. I will look at this after the current patch lands.

s/should/must/

Harbormaster completed remote builds in B32944: Diff 75736.Aug 12 2020, 4:57 PM

markj accepted this revision.Aug 12 2020, 5:30 PM

markj added inline comments.

sys/fs/nullfs/null_subr.c
265	Looks like there is some extra white space at the beginning of each line.

This revision is now accepted and ready to land.Aug 12 2020, 5:30 PM

jeff added inline comments.Aug 12 2020, 5:36 PM

sys/kern/vfs_vnops.c
866–870	I'm confused about all of this. The vnode has a reference to the object. We are reading from a reference to a vnode. Is the vnode shared locked here? This is all to handle a race with forced unmount?
sys/vm/vm_object.c
946	Why was this removed? References are more strict than pip.

kib marked 3 inline comments as done.Aug 12 2020, 5:54 PM

kib added inline comments.

sys/fs/nullfs/null_subr.c
265	There is tab and space before each '*', as usual for multiline comments. Unless I miss something, this is normal indent.
sys/kern/vfs_vnops.c
866–870	No, the vnode is not locked, we only have a reference to it thought struct file f_vnode. It handles forced unmounts in particular. The vnode content is read-locked by rangelocks for read range, so we get the expected read atomicity.
sys/vm/vm_object.c
946	My bad, restored.

Restore ref_count assert in terminate().

This revision now requires review to proceed.Aug 12 2020, 5:55 PM

Harbormaster completed remote builds in B32949: Diff 75742.Aug 12 2020, 5:55 PM

jeff added inline comments.Aug 12 2020, 5:57 PM

sys/kern/vfs_vnops.c
866–870	I see ok. Well it really is a shame how much extra synchronization and complexity is required for forced unmount. Orthogonal to this patch, but maybe we should rethink the way that works to avoid all of this. Possibly something closer to write suspension.

markj added inline comments.Aug 12 2020, 6:02 PM

sys/fs/nullfs/null_subr.c
265	I meant that there is an extra space after each `*`.

spaces

Harbormaster completed remote builds in B32955: Diff 75751.Aug 12 2020, 7:28 PM

markj accepted this revision.Aug 13 2020, 2:13 PM

markj added inline comments.

sys/kern/vfs_vnops.c
939	Can't we handle requests larger than `ptoa(io_hold_cnt)` by looping with the object PIP held?

This revision is now accepted and ready to land.Aug 13 2020, 2:13 PM

Patch is currently tested by pho.

sys/kern/vfs_vnops.c
939	We can, but I am not sure it is much useful. vniofault code already chunks io into io_hold_cnt blocks, so for the normal case where vniofault is enabled, we would never see large block. The check is to do something with !vniofault. Might be we could not do pgcache read for !vniofault, but this seems to be too arbitrary.