VFS: Track sequential reads and writes separately
ClosedPublic
Actions

Authored by tmunro on May 27 2020, 10:34 AM.

Details

Reviewers

mjg
kib

Commits

rS362460: vfs: track sequential reads and writes separately

Summary

If you're reading sequential data from a file, modifying it and then writing it back sequentially in blocks, with the writes some distance behind the reads due to some kind of user space buffering scheme, then the I/O clustering facility isn't triggered. This hurts PostgreSQL, SQLite, and similar when they're rewriting a large table on a UFS filesystem.

Here's a patch to fix that by tracking sequential reads and write separately. It was written by Andrew Gierth back in 2008, and I've rebased it and hopefully improved it slightly.

Test Plan

Create a file significantly larger than memory on a UFS filesystem. Run the following test program on it:

https://github.com/macdice/some-io-tests/blob/master/sequential-read-and-write.c

This reads in a straight line, and writes in a straight line some distance behind with interleaving system calls. When DISTANCE is > 0, there is no clustering and performance suffers. This is visible with iostat, as smaller KB/t and MB/s. With the patch, you get large I/O requests, just like when you so simple reading, which matches the behaviour on Linux for the same test.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

tmunro created this revision.May 27 2020, 10:34 AM

Herald added subscribers: Contributor Reviews (src), imp. · View Herald TranscriptMay 27 2020, 10:34 AM

tmunro requested review of this revision.May 27 2020, 10:34 AM

tmunro added a reviewer: mjg.

mjg added a reviewer: kib.May 27 2020, 11:22 AM

mjg accepted this revision.May 27 2020, 12:56 PM

This revision is now accepted and ready to land.May 27 2020, 12:56 PM

kib added inline comments.May 27 2020, 1:08 PM

sys/kern/vfs_vnops.c
487 ↗	(On Diff #72314)	Style: no initialization in local decl.
sys/sys/file.h
194 ↗	(On Diff #72314)	I do not like this, of course, It gives +8 bytes to struct file. Could you change the type of f_nextoff to int32 and only store the addend to f_offset, for instance ?

mjg added inline comments.May 27 2020, 1:15 PM

sys/sys/file.h
194 ↗	(On Diff #72314)	The problem with the struct is that it is past 64 bytes already and there is no way to shrink it. This has a side effect where updates to the reference counter can result in false sharing. Thus I think the way to go is to to let the struct grow and if the NOFREE status becomes a problem that status can be revisited with SMR.

kib added inline comments.May 27 2020, 1:31 PM

sys/sys/file.h
194 ↗	(On Diff #72314)	The problem with the struct is the memory usage. Patch adds +8 bytes. which even on my small desktop with kern.maxfiles=1043454 equals to potentially consuming 8MB of kernel memory. NOFREE also adds to the whole picture, of course.

mjg added inline comments.May 27 2020, 2:19 PM

sys/sys/file.h
194 ↗	(On Diff #72314)	Well that's true but it would also be a small fraction of total memory usage given all these object will be mostly pointing to distinct other objects of significantly bigger size. Also note there is an older review which points out that the struct can be shortened in size 8 bytes by removing the unused f_label field. This can be resurrected to get this in without further complications.

andrew_tao173.riddles.org.uk added inline comments.May 27 2020, 2:48 PM

sys/sys/file.h
194 ↗	(On Diff #72314)	f_nextoff can be arbitrarily different from f_offset, for example consider the case of a >2GB file which is being read and written exclusively with pread/pwrite,

tmunro updated this revision to Diff 72342.May 27 2020, 10:32 PM

This revision now requires review to proceed.May 27 2020, 10:32 PM

tmunro marked an inline comment as done.May 27 2020, 10:32 PM

I think that currently f_nextoff is often redundant, and this can be used to avoid further extending the size of the struct. Most of the time, I believe, f_nextoff == f_offset. You can use this to consider both f_offset/f_nextoff as the pointers to two streams of accesses. More, it could serve a situation where you have e.g. two read streams, instead of current patch insistence of one stream being read and one write.

Still I am fine with current patch assuming f_label is removed. It might make sense to just remove r/w requirement for the next step.

sys/sys/file.h
194 ↗	(On Diff #72314)	Of course it can be, then the seq heuristic would (should) fail.
194 ↗	(On Diff #72314)	I am fine with removing f_label as an intermediate measure.

mjg mentioned this in D25036: Remove f_label from struct file.May 28 2020, 11:21 AM

andrew_tao173.riddles.org.uk added inline comments.May 28 2020, 1:38 PM

sys/sys/file.h
194 ↗	(On Diff #72314)	"should" fail? proper sequential heuristics for the case when the application is using pread exclusively are essential.

struct file shrinkage just went in.

This revision is now accepted and ready to land.Jun 1 2020, 3:59 PM

In D25024#551329, @kib wrote:

I think that currently f_nextoff is often redundant, and this can be used to avoid further extending the size of the struct. Most of the time, I believe, f_nextoff == f_offset. You can use this to consider both f_offset/f_nextoff as the pointers to two streams of accesses. More, it could serve a situation where you have e.g. two read streams, instead of current patch insistence of one stream being read and one write.

Agreed, it could be interesting to support detection of N interleaved streams. A semi-related topic is "window" based stream detection for striding patterns, which works on other OSes. Wondering about these two ideas is actually what put me off proposing this simple patch for so long, despite the simple r/w case being a known problem for PostgreSQL users.