Page MenuHomeFreeBSD

Buffer pager for FFS.
ClosedPublic

Authored by kib on Oct 8 2016, 5:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 20, 8:33 PM
Unknown Object (File)
Sun, Nov 17, 8:02 AM
Unknown Object (File)
Sun, Nov 17, 12:58 AM
Unknown Object (File)
Mon, Nov 11, 2:29 PM
Unknown Object (File)
Mon, Nov 11, 6:08 AM
Unknown Object (File)
Thu, Oct 31, 5:41 PM
Unknown Object (File)
Fri, Oct 25, 4:16 PM
Unknown Object (File)
Oct 20 2024, 8:13 PM
Subscribers

Details

Summary

This is mostly optional FFS pager which, comparing with the generic local fs pager from vnode_pager.c, correctly operates on FFS volumes which underlying block device has block size larger than the machine page size.

Main issue there is that vnode_pager.c cannot easily align page-in requests on the block size. It would need to grab and busy pages outside the page-in run, and handle valid pages in the run extensions. Buffer cache/FFS already handle these problems, so simply bread()-ing the corresponding buffers provides the pager functionality almost for free. In particular, the most delicate issue with valid pages substitutions by bogus pages is avoided.

The only problematic issue is that exclusively busy state, which is assumed by the vm_pager_getpages() interface, is incompatible with the VMIO buffer cache desire to shared-busy pages. Most of the code handles the downgrade (trivial) before reading buffers, and less trivial upgrade from the shared-busy to excl-busy.

I release buffers after the read, if possible (see the check for b_dep and B_RELBUF flag setting) to somewhat ease the buffer cache pressure, this is probably not strictly needed.

Test Plan

Patch was tested by Peter on both 8K and normal volumes, for later the sysctl vfs.ffs.use_buf_pager = 1 was set. In fact I might benchmark this to consider changing default to use_buf_pager = 1 for HEAD (but did not, yet).

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Did you consider some mechanism (e.g. new buf flag) to indicate to the buffer cache that the pages are already (exclusive-)busied? Is there some reason this wouldn't be sufficient?

What FFS-specific properties is this code relying on?

sys/ufs/ffs/ffs_vnops.c
123 ↗(On Diff #21186)

Looks like these were sorted before.

1811 ↗(On Diff #21186)

Isn't "ma" a more common name for this?

1859 ↗(On Diff #21186)

Under what circumstances can m->valid == 0 be true here?

In D8198#170382, @markj wrote:

Did you consider some mechanism (e.g. new buf flag) to indicate to the buffer cache that the pages are already (exclusive-)busied? Is there some reason this wouldn't be sufficient?

Yes, I did considered, but not for too long. I decided that the buffer cache modifications would be needed for this code, and went with the downgrade. I wanted the pager be as little intrusive as possible.

Note that just single flag specifying that the buffer pages are excl-busy is not enough. Page-in request typically does not include all buffer pages in the run, so only some pages are excl-busy by the current thread. This makes the 'buffers with some pages excl-busy' complicated.

If not the soft-busy mechanism bugs, this patch would consist only of the additional code in ffs_vnops.c.

What FFS-specific properties is this code relying on?

At very least, the b_deps check is pure FFS specific. Also, the concept of the um_devvp -- device vnode of the mount, and the way to fetch the block size of that device is also FFS-specific. Yes, with some minor refactoring the pager should work for msdosfs at least, and might be for cd9660. But this is not the straight plug-in due to the above details, which needs to be filled there as well.

I will fix style issues you pointed out tomorrow, thanks.

sys/ufs/ffs/ffs_vnops.c
1859 ↗(On Diff #21186)

The page can be invalidated while the object lock is not owned. Other thread might mmap the page after bread() marked it as valid, and invalidate. I believe that the soft-busy state of the page would not prevent the scenario.

kib edited edge metadata.

Style.

kib marked 2 inline comments as done.Oct 11 2016, 9:48 AM
markj edited edge metadata.
This revision is now accepted and ready to land.Oct 11 2016, 8:01 PM
mckusick edited edge metadata.

This looks good.

Unless it causes a regression in performance, I think that it should be enabled in HEAD,

sys/ufs/ffs/ffs_vnops.c
1896 ↗(On Diff #21243)

Shouldn't you be unlocking and relocking the object here?

kib edited edge metadata.

Unlock the object around VM_WAIT.
Restructure the busy loop to have the control flow less convoluted.

This revision now requires review to proceed.Oct 13 2016, 4:05 PM
sys/ufs/ffs/ffs_vnops.c
1906 ↗(On Diff #21358)

"Since the pages ..."

1907 ↗(On Diff #21358)

"... neither the buffer nor the object lock was held ..."

1908 ↗(On Diff #21358)

"Recheck the valid ..."

kib marked 4 inline comments as done.
kib edited edge metadata.

Grammar fixes.

mckusick edited edge metadata.

I concur with the changes suggested by alc@

This revision is now accepted and ready to land.Oct 14 2016, 5:11 AM

Ran an old dbench test with D8197+8 applied + vfs.ffs.use_buf_pager=1 :

$ procs=4 ../compare.sh dbench.r307234.2016-10-13.20161014.050428.log dbench.r307234M.2016-10-13.20161014.101527.log
Ministat for 4 procs:
x r307234
+ r307234M
+------------------------------------------------------------+
|+ + x x x x x + ++ |
| |__________|________AM______A|______________M___________||
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 598.833 602.65 600.993 600.9392 1.5471662
+ 5 596.953 606.203 605.05 602.3048 4.7793623
No difference proven at 95.0% confidence
$

In D8198#171414, @pho wrote:

Ran an old dbench test with D8197+8 applied + vfs.ffs.use_buf_pager=1 :
No difference proven at 95.0% confidence

Thank you, this is interesting.

If measuring the speed, a possible curious experiment is to do the same measurement with removed line 1865

bp->b_flags |= B_RELBUF;

The system may just deadlock without it, but if not, the removal might make some difference in the benchmark.

sys/ufs/ffs/ffs_vnops.c
1822 ↗(On Diff #21374)

The vnode lock is ensuring that vnp_size doesn't concurrently change, yes?

1846 ↗(On Diff #21374)

The busy status of the page is ensuring that the page's valid field doesn't change, yes? (Normally, we would lock the object to test the page's valid field.)

1889 ↗(On Diff #21374)

Was there a reason not to use vm_page_grab() here?

1912 ↗(On Diff #21374)

I found this hard to read. Do you mean, "... and partial validity for the page at index count - 1 could mean ..."

1914 ↗(On Diff #21374)

"..., so we must restart ..."

sys/ufs/ffs/ffs_vnops.c
1822 ↗(On Diff #21374)

Sure. Also, Peter' tests show that the issue with the parallel truncation is fixed.

1846 ↗(On Diff #21374)

It does not. Shared-busy pages can be invalidated. But the test skips useless creation of the buffer for most typical case when invalidation is not used. The shared->excl upgrade loop at the end would catch the case if we raced, in the reliable way (protected by the object lock).

1889 ↗(On Diff #21374)

Initially the code was structured differently, and vm_page_grab() was not suitable. Now indeed it could be used.

But I started considering that I must also drop the vnode lock around VM_WAIT, after your pointed my bug with the object lock. I probably will add the vnode unlocking around the VM_WAIT, then vm_page_grab() would become unsuitable again.

From Peter' mail:

With this I see an improvement in throughput:

$ ../compare.sh dbench.r307234.2016-10-13.20161014.050428.log dbench.r307234M.2016-10-13.20161014.160711.log
Ministat for 4 procs:
x r307234
+ r307234M
+------------------------------------------------------------+
|                                         +                  |
|x xx xx                                 ++        +        +|
| |_A__|                               |__M____A________|    |
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       598.833        602.65       600.993      600.9392     1.5471662
+   5       625.198       637.268       625.772      629.0804      5.268211
Difference at 95.0% confidence
 28.1412 +/- 5.66242
 4.68287% +/- 0.942261%
 (Student's t, pooled s = 3.88251)
$ 

My old dbench setup runs on a async mounted swap backed memory disk, ...

There was no difference in the buildworld times, I asked mjg@ to measure buildworld with this on pig1.

kib edited edge metadata.

Switch to vm_page_grab(). Buffer cache does not drop vnode lock around page allocations, page-in can safely do same.

Comment-out B_RELBUF and explain why it might be useful, in the comment.

Update comments.

This revision now requires review to proceed.Oct 15 2016, 6:56 AM

With mjg@ testing on pig1 (40 cores), the impact of the change on buildworld:

The differnce is real-time is marginal.
sys-time is the same.

(statistically significant).

In other words, there is no regressions with the buildworld load, confirmed both by pho@ and mjg@.

In addition to adding a comment concerning the unsynchronized "m->valid" accesses, I would also suggest turning a portion of the summary into a comment at the top of this function. Otherwise, I'm happy with this change.

sys/ufs/ffs/ffs_vnops.c
1846 ↗(On Diff #21374)

I think that your response to my question should be turned into a comment in the code.

mckusick edited edge metadata.

Given the comparable performance measures by pho@ and mjg@, I suggest that this change be enabled by default.

I suggest that rather than commenting out the B_RELBUF that it be enabled by a sysctl so that folks (like Netflix) that want to keep down the pressure on the buffer cache can easily do so.

This revision is now accepted and ready to land.Oct 15 2016, 11:45 PM
kib edited edge metadata.

Add requested comments.
Disable racy assert (but useful for debugging in the controlled env).
Add a knob to enable B_RELBUF.
Enable buf pager by default.

This revision now requires review to proceed.Oct 16 2016, 5:50 AM

Alan, please note that D8197 is the prerequisite for this code.

mckusick edited edge metadata.

Thanks for the additions.Looks good to go.

This revision is now accepted and ready to land.Oct 16 2016, 8:24 PM
sys/ufs/ffs/ffs_vnops.c
1802 ↗(On Diff #21428)

"In contrast to the generic ...

1804 ↗(On Diff #21428)

"... size. The buffer

1806 ↗(On Diff #21428)

"... does the necessary ...

1807 ↗(On Diff #21428)

"replacements ...

1810 ↗(On Diff #21428)

"... that the exclusive ...

1812 ↗(On Diff #21428)

"... buffer cache's desire ...

1813 ↗(On Diff #21428)

"pages. This function performs a trivial downgrade of the pages' state before

1814 ↗(On Diff #21428)

"... and a less ...

1815 ↗(On Diff #21428)

"the excl-busy state after the read.

sys/ufs/ffs/ffs_vnops.c
1869 ↗(On Diff #21428)

"... busy and the object ...

1870 ↗(On Diff #21428)

"... which together allow for the pages' invalidation. ...

1872 ↗(On Diff #21428)

"... for the most ...

1875 ↗(On Diff #21428)

"... in a reliable ...

alc edited edge metadata.
kib edited edge metadata.

Language fixes.

This revision now requires review to proceed.Oct 17 2016, 10:08 AM
markj edited edge metadata.
This revision is now accepted and ready to land.Oct 17 2016, 5:27 PM
alc edited edge metadata.
mckusick edited edge metadata.

Make it so.

This revision was automatically updated to reflect the committed changes.