Page MenuHomeFreeBSD

VFS: Track sequential reads and writes separately
ClosedPublic

Authored by tmunro on May 27 2020, 10:34 AM.

Details

Summary

If you're reading sequential data from a file, modifying it and then writing it back sequentially in blocks, with the writes some distance behind the reads due to some kind of user space buffering scheme, then the I/O clustering facility isn't triggered. This hurts PostgreSQL, SQLite, and similar when they're rewriting a large table on a UFS filesystem.

Here's a patch to fix that by tracking sequential reads and write separately. It was written by Andrew Gierth back in 2008, and I've rebased it and hopefully improved it slightly.

Test Plan

Create a file significantly larger than memory on a UFS filesystem. Run the following test program on it:

https://github.com/macdice/some-io-tests/blob/master/sequential-read-and-write.c

This reads in a straight line, and writes in a straight line some distance behind with interleaving system calls. When DISTANCE is > 0, there is no clustering and performance suffers. This is visible with iostat, as smaller KB/t and MB/s. With the patch, you get large I/O requests, just like when you so simple reading, which matches the behaviour on Linux for the same test.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

tmunro added a reviewer: mjg.
This revision is now accepted and ready to land.May 27 2020, 12:56 PM
sys/kern/vfs_vnops.c
486

Style: no initialization in local decl.

sys/sys/file.h
194

I do not like this, of course, It gives +8 bytes to struct file.

Could you change the type of f_nextoff to int32 and only store the addend to f_offset, for instance ?

sys/sys/file.h
194

The problem with the struct is that it is past 64 bytes already and there is no way to shrink it. This has a side effect where updates to the reference counter can result in false sharing.

Thus I think the way to go is to to let the struct grow and if the NOFREE status becomes a problem that status can be revisited with SMR.

sys/sys/file.h
194

The problem with the struct is the memory usage. Patch adds +8 bytes. which even on my small desktop with kern.maxfiles=1043454 equals to potentially consuming 8MB of kernel memory.

NOFREE also adds to the whole picture, of course.

sys/sys/file.h
194

Well that's true but it would also be a small fraction of total memory usage given all these object will be mostly pointing to distinct other objects of significantly bigger size.

Also note there is an older review which points out that the struct can be shortened in size 8 bytes by removing the unused f_label field. This can be resurrected to get this in without further complications.

sys/sys/file.h
194

f_nextoff can be arbitrarily different from f_offset, for example consider the case of a >2GB file which is being read and written exclusively with pread/pwrite,

This revision now requires review to proceed.May 27 2020, 10:32 PM

I think that currently f_nextoff is often redundant, and this can be used to avoid further extending the size of the struct. Most of the time, I believe, f_nextoff == f_offset. You can use this to consider both f_offset/f_nextoff as the pointers to two streams of accesses. More, it could serve a situation where you have e.g. two read streams, instead of current patch insistence of one stream being read and one write.

Still I am fine with current patch assuming f_label is removed. It might make sense to just remove r/w requirement for the next step.

sys/sys/file.h
194

Of course it can be, then the seq heuristic would (should) fail.

194

I am fine with removing f_label as an intermediate measure.

sys/sys/file.h
194

"should" fail? proper sequential heuristics for the case when the application is using pread exclusively are essential.

struct file shrinkage just went in.

This revision is now accepted and ready to land.Jun 1 2020, 3:59 PM
In D25024#551329, @kib wrote:

I think that currently f_nextoff is often redundant, and this can be used to avoid further extending the size of the struct. Most of the time, I believe, f_nextoff == f_offset. You can use this to consider both f_offset/f_nextoff as the pointers to two streams of accesses. More, it could serve a situation where you have e.g. two read streams, instead of current patch insistence of one stream being read and one write.

Agreed, it could be interesting to support detection of N interleaved streams. A semi-related topic is "window" based stream detection for striding patterns, which works on other OSes. Wondering about these two ideas is actually what put me off proposing this simple patch for so long, despite the simple r/w case being a known problem for PostgreSQL users.

Still I am fine with current patch assuming f_label is removed. It might make sense to just remove r/w requirement for the next step.

Thanks. f_label is gone -- are you OK with this now?

This revision was automatically updated to reflect the committed changes.