Details

Reviewers

kib
mckusick
fsu

Summary

The intent is to replace vn_bmap_seekhole() for lseek(SEEK_DATA) calls.
The algorithm used by vn_bmap_seekhole() is very slow for large sparse
files, and it seems we can't do better without extending the VFS
(adding a new VOP or so). In particular, syzkaller generates test
programs which truncate a file to a very large size and then call
lseek(fd, 0, SEEK_DATA), which does not complete in any sane amount of
time, during which the process is unkillable.

ufs_bmap_seekdata() has the obvious implementation: starting from the
caller-provided offset, we search for a direct block or indirect block
entry which references a non-zero disk block.

When discussing the problem, kib suggested that we modify
vn_bmap_seekhole() to periodically check whether the RLIMIT_CPU limit
was exceeded by the current process, and abort if so. That would
be a simpler solution but would require some synchronization with
lim_cb(), and ufs_bmap_seekdata() will give better performance for
real-world cases. However, if the preference is to avoid adding more
UFS code, I can go that route.

vn_bmap_seekhole() is still used for SEEK_HOLE, as I believe it is
still sufficient and I did not want to complicate ufs_bmap_seekhole().

ufs_bmap_seekhole() does not attempt to handle snapshot files yet. If
we agree that this approach is ok, I'll try to implement that.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 23385
Build 22403: arc lint + arc unit

Event Timeline

markj created this revision.Mar 15 2019, 4:38 PM

Harbormaster completed remote builds in B23114: Diff 55112.Mar 15 2019, 4:38 PM

markj added reviewers: kib, mckusick.Mar 15 2019, 4:39 PM

Rename ufs_readmeta() to ufs_readindir().

Harbormaster completed remote builds in B23116: Diff 55114.Mar 15 2019, 4:43 PM

This looks still inefficient, you do a lot of calculations for each sequential block number. Right now you getblk() indirect blocks for each possible bn. As a hypothetical optimal implementation, you can read indirect blocks and scan them until you find non-zero block pointer. Then you can reconstruct the offset (or keep the offset calculation in parallel with sliding the cursor over the blocks).

sys/ufs/ufs/ufs_bmap.c
97	Ultimately this function would share the code with breadn_flags(). Or getblkx() could grow a flag to stop calling VOP_BMAP() and get the blkno directly as an argument. I do not request this for the current change.
359	Why do you need to call ufs_getlbns() for bn < UFS_NDADDR ? You can directly walk over the direct blocks in the inode.

In D19599#419613, @kib wrote:

This looks still inefficient, you do a lot of calculations for each sequential block number. Right now you getblk() indirect blocks for each possible bn. As a hypothetical optimal implementation, you can read indirect blocks and scan them until you find non-zero block pointer. Then you can reconstruct the offset (or keep the offset calculation in parallel with sliding the cursor over the blocks).

Hmm, that's what my implementation does. (My first implementation was indeed inefficient in the way that you described.) The innermost for-loop scans the current indirect block looking for a non-zero entry.

sys/ufs/ufs/ufs_bmap.c
359	No real need, it just seemed simpler and more general. I'll try changing it to scan the direct blocks without using ufs_getlbns().

In D19599#419628, @markj wrote:

In D19599#419613, @kib wrote:

This looks still inefficient, you do a lot of calculations for each sequential block number. Right now you getblk() indirect blocks for each possible bn. As a hypothetical optimal implementation, you can read indirect blocks and scan them until you find non-zero block pointer. Then you can reconstruct the offset (or keep the offset calculation in parallel with sliding the cursor over the blocks).

Hmm, that's what my implementation does. (My first implementation was indeed inefficient in the way that you described.) The innermost for-loop scans the current indirect block looking for a non-zero entry.

This implementation is still not optimal if it is possible for an inode to reference an indirect block that is completely zeroed. I'm not sure if such a situation ever arises in practice, though. At least, I could not trigger it.

For example, suppose the double indirect block references multiple indirect blocks, and each indirect block is all-zeroes except the last one. This implementation will fetch the double indirect block from the buffer cache multiple times, but that's not really necessary. I did not want to complicate the implementation to handle this case unless it is non-hypothetical.

In D19599#419630, @markj wrote:

In D19599#419628, @markj wrote:

In D19599#419613, @kib wrote:

This looks still inefficient, you do a lot of calculations for each sequential block number. Right now you getblk() indirect blocks for each possible bn. As a hypothetical optimal implementation, you can read indirect blocks and scan them until you find non-zero block pointer. Then you can reconstruct the offset (or keep the offset calculation in parallel with sliding the cursor over the blocks).

Hmm, that's what my implementation does. (My first implementation was indeed inefficient in the way that you described.) The innermost for-loop scans the current indirect block looking for a non-zero entry.

This implementation is still not optimal if it is possible for an inode to reference an indirect block that is completely zeroed. I'm not sure if such a situation ever arises in practice, though. At least, I could not trigger it.

For example, suppose the double indirect block references multiple indirect blocks, and each indirect block is all-zeroes except the last one. This implementation will fetch the double indirect block from the buffer cache multiple times, but that's not really necessary. I did not want to complicate the implementation to handle this case unless it is non-hypothetical.

I see, yes, I agree.

sys/ufs/ufs/ufs_bmap.c
405	perhaps kassert that num == 0

Return ENXIO if the offset is negative. Upper layers do not check this for us.
Don't use ufs_getlbns() to map direct blocks.
Assert num == 0 when we break from the loop.

Harbormaster completed remote builds in B23121: Diff 55121.Mar 15 2019, 9:22 PM

kib accepted this revision.Mar 15 2019, 9:26 PM

This revision is now accepted and ready to land.Mar 15 2019, 9:26 PM

markj added inline comments.Mar 15 2019, 10:05 PM

sys/ufs/ufs/ufs_bmap.c
97	BTW, there are at least three copies of this code in UFS/FFS.

This seems like a very useful improvement.

From a quick inspection, it appears that the new ufs_readindir() function could be used as a refactoring in ufs/ffs/ffs_inode.c: ffs_indirtrunc() and ufs/ffs/ffs_softdep.c: setup_trunc_indir().
It could potentially be adapted to use in fs/ext2fs/ext2_bmap.c: ext2_bmaparray() and kern/vfs_bio.c: breadn_flags(). The refactoring should be done in a different checkin.

This was committed in r345244.

This is broken in a couple of ways:

bn += blockcnt isn't right if bn is has a partial offset into the indirect block.
The innermost loop contains a severe bug: it doesn't update ap->in_lbn for the remaining indirect block descriptors, so we may load an indirect block at the wrong logical offset in the vnode. This can cause data corruption if the vnode is subsequently modified.

This revision is now accepted and ready to land.Mar 18 2019, 5:12 AM

Fix the issues with the previous version.

In each iteration of the outer loop, we recompute the block number
nextbn, which is the next LBN that we will look up with ufs_getlbns().
If a hole is encountered at any point in the lookup, we advance nextbn
to the LBN of the first non-zero block pointer in the current indirect
block, or to the next indirect block in the case where all indirect block
entries are zero.

This revision now requires review to proceed.Mar 28 2019, 2:58 PM

Harbormaster completed remote builds in B23348: Diff 55538.Mar 28 2019, 2:58 PM

kib accepted this revision.Mar 29 2019, 10:44 AM

This revision is now accepted and ready to land.Mar 29 2019, 10:44 AM

markj added a reviewer: fsu.Mar 29 2019, 2:38 PM

Hi, Mark.

Thanks a lot for adding me to review.
But could you please provide some information, how it is possible to test FIOSEEKDATA/FIOSEEKHOLE from user-space side.
I mean, did you write some user-space code to test it or some unit-tests exist somewhere?

Also, it is interesting how faster it became.

In D19599#423330, @fsu wrote:

Hi, Mark.

Thanks a lot for adding me to review.
But could you please provide some information, how it is possible to test FIOSEEKDATA/FIOSEEKHOLE from user-space side.
I mean, did you write some user-space code to test it or some unit-tests exist somewhere?

I did not modify SEEK_HOLE. I tested SEEK_DATA with a program that truncates an empty file to a random length, and then calls SEEK_DATA on the file with a random starting offset, and verifies that the value returned is the starting offset of the last LBN in the file, since UFS seems to always put a data block at the end of a sparse file, even if the inode size % block size is 0. I also ran a private instance of syzkaller on the patch and verified that it exercises most of the branches in the new function (currently it cannot trigger errors from readindir(), for example).

I started writing some tests, but discarded them. It is hard to write useful tests since the implementation of file holes is highly filesystem-dependent, and even for UFS there is no blessed procedure for creating holes in a file AFAIK. So any tests for this will be coupled to filesystem internals.

Also, it is interesting how faster it became.

I did not attempt to compare them. The old algorithm is O(size of file) and the new one is effectively constant-time as long as the inode does not reference indirect blocks that are full of zeroes. With a sufficiently large sparse file the old algorithm will not complete in any reasonable amount of time.

In D19599#423334, @markj wrote:

In D19599#423330, @fsu wrote:

Hi, Mark.

Thanks a lot for adding me to review.
But could you please provide some information, how it is possible to test FIOSEEKDATA/FIOSEEKHOLE from user-space side.
I mean, did you write some user-space code to test it or some unit-tests exist somewhere?

I did not modify SEEK_HOLE. I tested SEEK_DATA with a program that truncates an empty file to a random length, and then calls SEEK_DATA on the file with a random starting offset, and verifies that the value returned is the starting offset of the last LBN in the file, since UFS seems to always put a data block at the end of a sparse file, even if the inode size % block size is 0. I also ran a private instance of syzkaller on the patch and verified that it exercises most of the branches in the new function (currently it cannot trigger errors from readindir(), for example).
I started writing some tests, but discarded them. It is hard to write useful tests since the implementation of file holes is highly filesystem-dependent, and even for UFS there is no blessed procedure for creating holes in a file AFAIK. So any tests for this will be coupled to filesystem internals.

Also, it is interesting how faster it became.

I did not attempt to compare them. The old algorithm is O(size of file) and the new one is effectively constant-time as long as the inode does not reference indirect blocks that are full of zeroes. With a sufficiently large sparse file the old algorithm will not complete in any reasonable amount of time.

I expected that some user-space tests exist, which could be possible to apply to ext2fs to ensure, that the implementation is compatible with ufs.
But ok, I will try to find something and check it on ufs too.

Thanks.

In D19599#423338, @fsu wrote:

In D19599#423334, @markj wrote:

In D19599#423330, @fsu wrote:

Hi, Mark.

Thanks a lot for adding me to review.
But could you please provide some information, how it is possible to test FIOSEEKDATA/FIOSEEKHOLE from user-space side.
I mean, did you write some user-space code to test it or some unit-tests exist somewhere?

I did not modify SEEK_HOLE. I tested SEEK_DATA with a program that truncates an empty file to a random length, and then calls SEEK_DATA on the file with a random starting offset, and verifies that the value returned is the starting offset of the last LBN in the file, since UFS seems to always put a data block at the end of a sparse file, even if the inode size % block size is 0. I also ran a private instance of syzkaller on the patch and verified that it exercises most of the branches in the new function (currently it cannot trigger errors from readindir(), for example).
I started writing some tests, but discarded them. It is hard to write useful tests since the implementation of file holes is highly filesystem-dependent, and even for UFS there is no blessed procedure for creating holes in a file AFAIK.

What you described above *is* the correct way to create holes on UFS. Seek after the EOF and write single byte, then all the content between old EOF and new block containing that byte is hole. The only detail is that if the byte is in direct block, then you might get fragment at the end, instead of the full block (does not matter for the tests).

So any tests for this will be coupled to filesystem internals.

Also, it is interesting how faster it became.

I did not attempt to compare them. The old algorithm is O(size of file) and the new one is effectively constant-time as long as the inode does not reference indirect blocks that are full of zeroes. With a sufficiently large sparse file the old algorithm will not complete in any reasonable amount of time.

I expected that some user-space tests exist, which could be possible to apply to ext2fs to ensure, that the implementation is compatible with ufs.

https://gist.github.com/5319d0d7d52e0f9199ddf2b5a75a6438
https://gist.github.com/a4f9c80a7699151e7f2e5ba051740aa7
This is what I used for vn_bmap_seekhole() implementation.

But ok, I will try to find something and check it on ufs too.

Thanks.

Return ENXIO immediately if the starting offset is >= inode size.
Ensure that the returned offset is >= the starting offset.

This revision now requires review to proceed.Mar 29 2019, 6:56 PM

Harbormaster completed remote builds in B23385: Diff 55605.Mar 29 2019, 6:56 PM

In D19599#423359, @kib wrote:

In D19599#423338, @fsu wrote:

In D19599#423334, @markj wrote:

I did not modify SEEK_HOLE. I tested SEEK_DATA with a program that truncates an empty file to a random length, and then calls SEEK_DATA on the file with a random starting offset, and verifies that the value returned is the starting offset of the last LBN in the file, since UFS seems to always put a data block at the end of a sparse file, even if the inode size % block size is 0. I also ran a private instance of syzkaller on the patch and verified that it exercises most of the branches in the new function (currently it cannot trigger errors from readindir(), for example).
I started writing some tests, but discarded them. It is hard to write useful tests since the implementation of file holes is highly filesystem-dependent, and even for UFS there is no blessed procedure for creating holes in a file AFAIK.

What you described above *is* the correct way to create holes on UFS. Seek after the EOF and write single byte, then all the content between old EOF and new block containing that byte is hole. The only detail is that if the byte is in direct block, then you might get fragment at the end, instead of the full block (does not matter for the tests).

Ok. I was thinking about a general mechanism that lets one create holes at arbitrary offsets within a specific file.

So any tests for this will be coupled to filesystem internals.

Also, it is interesting how faster it became.

I did not attempt to compare them. The old algorithm is O(size of file) and the new one is effectively constant-time as long as the inode does not reference indirect blocks that are full of zeroes. With a sufficiently large sparse file the old algorithm will not complete in any reasonable amount of time.

I expected that some user-space tests exist, which could be possible to apply to ext2fs to ensure, that the implementation is compatible with ufs.

https://gist.github.com/5319d0d7d52e0f9199ddf2b5a75a6438
https://gist.github.com/a4f9c80a7699151e7f2e5ba051740aa7
This is what I used for vn_bmap_seekhole() implementation.

Thanks, it found a couple of bugs. There is one remaining difference in the seek_hole2 output, in the last test case. The test creates a file of size 1048578 = 2 mod bsize on my system. The test expects lseek(fd, 200000, SEEK_DATA) to return -1. But, there is a data block starting at offset 1048576, so my implementation returns 1048576. I confirmed the existence of the data block with fsdb. vn_bmap_seekhole() returns -1/ENXIO. This is because the loop increments *noff by the block size, and 200000 mod bsize > 2. The comment at the end of seek_hole2.c suggests that your implementation is correct, but I don't understand why.

In D19599#423359, @kib wrote:

In D19599#423338, @fsu wrote:

In D19599#423334, @markj wrote:

In D19599#423330, @fsu wrote:

Hi, Mark.

Thanks a lot for adding me to review.
But could you please provide some information, how it is possible to test FIOSEEKDATA/FIOSEEKHOLE from user-space side.
I mean, did you write some user-space code to test it or some unit-tests exist somewhere?

I did not modify SEEK_HOLE. I tested SEEK_DATA with a program that truncates an empty file to a random length, and then calls SEEK_DATA on the file with a random starting offset, and verifies that the value returned is the starting offset of the last LBN in the file, since UFS seems to always put a data block at the end of a sparse file, even if the inode size % block size is 0. I also ran a private instance of syzkaller on the patch and verified that it exercises most of the branches in the new function (currently it cannot trigger errors from readindir(), for example).
I started writing some tests, but discarded them. It is hard to write useful tests since the implementation of file holes is highly filesystem-dependent, and even for UFS there is no blessed procedure for creating holes in a file AFAIK.

What you described above *is* the correct way to create holes on UFS. Seek after the EOF and write single byte, then all the content between old EOF and new block containing that byte is hole. The only detail is that if the byte is in direct block, then you might get fragment at the end, instead of the full block (does not matter for the tests).

So any tests for this will be coupled to filesystem internals.

Also, it is interesting how faster it became.

I did not attempt to compare them. The old algorithm is O(size of file) and the new one is effectively constant-time as long as the inode does not reference indirect blocks that are full of zeroes. With a sufficiently large sparse file the old algorithm will not complete in any reasonable amount of time.

I expected that some user-space tests exist, which could be possible to apply to ext2fs to ensure, that the implementation is compatible with ufs.

https://gist.github.com/5319d0d7d52e0f9199ddf2b5a75a6438
https://gist.github.com/a4f9c80a7699151e7f2e5ba051740aa7
This is what I used for vn_bmap_seekhole() implementation.

Ok, thank you. This is, what I tried to find, to avoid implement it from scratch.

But ok, I will try to find something and check it on ufs too.

Thanks.

emaste added a subscriber: emaste.Apr 2 2019, 2:30 PM

In D19599#423432, @markj wrote:

Thanks, it found a couple of bugs. There is one remaining difference in the seek_hole2 output, in the last test case. The test creates a file of size 1048578 = 2 mod bsize on my system. The test expects lseek(fd, 200000, SEEK_DATA) to return -1. But, there is a data block starting at offset 1048576, so my implementation returns 1048576. I confirmed the existence of the data block with fsdb. vn_bmap_seekhole() returns -1/ENXIO. This is because the loop increments *noff by the block size, and 200000 mod bsize > 2. The comment at the end of seek_hole2.c suggests that your implementation is correct, but I don't understand why.

According to the Solaris lseek(2) man page:

If whence is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset.

and later

ENXIO

    For SEEK_DATA, there are no more data regions past the supplied offset. For SEEK_HOLE, there are no more holes past the supplied offset.

See https://docs.oracle.com/cd/E26502_01/html/E29032/lseek-2.html

IMO this does not leave the place for any different interpretation of the results. If you reach the last block, and the request is SEEK_DATA, you must return ENXIO, or even better, check for EOF in advance.

In D19599#424473, @kib wrote:
In D19599#423432, @markj wrote:

Thanks, it found a couple of bugs. There is one remaining difference in the seek_hole2 output, in the last test case. The test creates a file of size 1048578 = 2 mod bsize on my system. The test expects lseek(fd, 200000, SEEK_DATA) to return -1. But, there is a data block starting at offset 1048576, so my implementation returns 1048576. I confirmed the existence of the data block with fsdb. vn_bmap_seekhole() returns -1/ENXIO. This is because the loop increments *noff by the block size, and 200000 mod bsize > 2. The comment at the end of seek_hole2.c suggests that your implementation is correct, but I don't understand why.

According to the Solaris lseek(2) man page:
If whence is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset.
and later
ENXIO

    For SEEK_DATA, there are no more data regions past the supplied offset. For SEEK_HOLE, there are no more holes past the supplied offset.
See https://docs.oracle.com/cd/E26502_01/html/E29032/lseek-2.html

IMO this does not leave the place for any different interpretation of the results. If you reach the last block, and the request is SEEK_DATA, you must return ENXIO, or even better, check for EOF in advance.

So, I misread the expected output comment at the end of seek_hole2.c. There are in fact two tests, one with a file that has a byte written at either end, and one with a completely sparse file. The output for the first test file agrees with my implementation: the final lseek(200000, SEEK_DATA) should return 1048576, and this makes sense since there is a data region in the range [1048576, 1048577]. vn_bmap_seekhole() returns -1/ENXIO in that case and I believe that is a bug.

I re-checked the results and now I believe that this diff is correct. If I get a chance I'll try to convert the seek_hole*.c tests to ATF tests.

In D19599#424500, @markj wrote:
In D19599#424473, @kib wrote:
In D19599#423432, @markj wrote:

Thanks, it found a couple of bugs. There is one remaining difference in the seek_hole2 output, in the last test case. The test creates a file of size 1048578 = 2 mod bsize on my system. The test expects lseek(fd, 200000, SEEK_DATA) to return -1. But, there is a data block starting at offset 1048576, so my implementation returns 1048576. I confirmed the existence of the data block with fsdb. vn_bmap_seekhole() returns -1/ENXIO. This is because the loop increments *noff by the block size, and 200000 mod bsize > 2. The comment at the end of seek_hole2.c suggests that your implementation is correct, but I don't understand why.

According to the Solaris lseek(2) man page:
If whence is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset.
and later
ENXIO

    For SEEK_DATA, there are no more data regions past the supplied offset. For SEEK_HOLE, there are no more holes past the supplied offset.
See https://docs.oracle.com/cd/E26502_01/html/E29032/lseek-2.html

IMO this does not leave the place for any different interpretation of the results. If you reach the last block, and the request is SEEK_DATA, you must return ENXIO, or even better, check for EOF in advance.
So, I misread the expected output comment at the end of seek_hole2.c. There are in fact two tests, one with a file that has a byte written at either end, and one with a completely sparse file. The output for the first test file agrees with my implementation: the final lseek(200000, SEEK_DATA) should return 1048576, and this makes sense since there is a data region in the range [1048576, 1048577]. vn_bmap_seekhole() returns -1/ENXIO in that case and I believe that is a bug.

I re-checked the results and now I believe that this diff is correct. If I get a chance I'll try to convert the seek_hole*.c tests to ATF tests.

See D19811.

Should this now be closed since it was resolved in D19811?

In D19599#426455, @mckusick wrote:

Should this now be closed since it was resolved in D19811?

No, they are unrelated. My revision fixed functional bug, while Mark' change fixes speed.

In D19599#426470, @kib wrote:

In D19599#426455, @mckusick wrote:

Should this now be closed since it was resolved in D19811?

No, they are unrelated. My revision fixed functional bug, while Mark' change fixes speed.

Note also that this revision changes only SEEK_DATA, and vn_bmap_seekhole() continues to implement SEEK_HOLE.

Ping? I started adding the regression tests for SEEK_HOLE/DATA to the automated suite, but that will be a separate diff.

kib accepted this revision.Apr 20 2019, 11:54 AM

This revision is now accepted and ready to land.Apr 20 2019, 11:54 AM

I committed this in r346932.

fsu mentioned this in D23605: Add a EXT2FS-specific implementation for lseek(SEEK_DATA)..Feb 10 2020, 5:18 PM

fsu mentioned this in rS358073: Add a EXT2FS-specific implementation for lseek(SEEK_DATA)..Feb 18 2020, 4:40 PM

Add a UFS-specific implementation for lseek(SEEK_DATA).
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 55605

sys/ufs/ufs/ufs_bmap.c

sys/ufs/ufs/ufs_extern.h

sys/ufs/ufs/ufs_vnops.c

Add a UFS-specific implementation for lseek(SEEK_DATA).ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 55605

sys/ufs/ufs/ufs_bmap.c

sys/ufs/ufs/ufs_extern.h

sys/ufs/ufs/ufs_vnops.c

Add a UFS-specific implementation for lseek(SEEK_DATA).
ClosedPublic
Actions

Revision Contents
Changeset List