Page MenuHomeFreeBSD

Extend VOP_ALLOCATE(9) and implement new fallocate(2) syscall to allow punching holes
Needs RevisionPublic

Authored by sobomax on Jan 30 2016, 4:44 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Apr 12, 5:03 AM
Unknown Object (File)
Dec 29 2023, 11:42 AM
Unknown Object (File)
Dec 20 2023, 4:25 AM
Unknown Object (File)
Nov 17 2023, 5:55 PM
Unknown Object (File)
Nov 14 2023, 3:45 AM
Unknown Object (File)
Nov 10 2023, 6:51 AM
Unknown Object (File)
Nov 6 2023, 12:24 PM
Unknown Object (File)
Oct 19 2023, 5:57 PM

Details

Reviewers
cem
Group Reviewers
manpages
Summary

Implement new fallocate(2) syscall and extend VOP_ALLOCATE(9) to allow punching holes into existing files. Use that to properly convert from BIO_DELETE operations in the md(4) module.

Only supported on ZFS now.

Test Plan

Test case needs to be created.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sobomax retitled this revision from to Extend VOP_ALLOCATE(9) and implement new fallocate(2) syscall to allow punching holes.
sobomax updated this object.
sobomax edited the test plan for this revision. (Show Details)
sobomax set the repository for this revision to rS FreeBSD src repository - subversion.
sobomax added subscribers: mckusick, kib, imp, pjd.

Looks mostly fine to me. I'm hardly an expert in these things. I did not review the ZFS parts; I don't know much about it.

share/man/man9/VOP_ALLOCATE.9
71–74

A partial result being successful seems dubious given the guarantee above.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
6061

if ((ap->a_mode & VNA_DEALLOC) == 0)

sys/kern/vfs_default.c
921–924

I'd suggest asserting that no unexpected flags are set here.

KASSERT((ap->a_mode & ~(VNA_ALLOC | VNA_DEALLOC | VNA_KEEP_SIZE)) == 0, ...

926–928

No need for braces.

967–992

if ((mode & KEEP_SIZE) == 0) {

sys/kern/vfs_syscalls.c
4494–4501

The fallocate(2) man page included in this review does not document that FALLOC_FL_PUNCH_HOLE without FALLOC_FL_KEEP_SIZE returns EINVAL.

sobomax edited edge metadata.

Oops, we are not compatible with any official standard. Clean the standards section of the manpage.

sobomax edited edge metadata.

Document FALLOC_FL_KEEP_SIZE and return errors EINVAL and EOPNOTSUPPORTED.

OK, new revision is up, fixed some manpages.

share/man/man9/VOP_ALLOCATE.9
71–74

It's basically documents the current state of things. It was not my mission to clean up ALLOCATE path or touch it in any way.

sys/kern/vfs_default.c
926–928

Well, it's been discussed recently and voted on to be acceptable. I find it useful to reduce diff when I need to stick debug printf into one of those return paths.

sys/kern/vfs_syscalls.c
4494–4501

Thanks, fixed the manpage.

share/man/man9/VOP_ALLOCATE.9
71–74

Ok.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
6061

Ping? style(9) issue

sys/kern/vfs_default.c
921–924

Ping?

967–992

if ((mode & KEEP_SIZE) == 0) { (style(9))

sobomax edited edge metadata.
sobomax marked an inline comment as done.
o Rename:

  VNA_ALLOC -> VNA_OP_ALLOC;
  VNA_DEALLOC -> VNA_OP_DEALLOC.

oDefine some helper constants VNA_OP_MASK and VNA_ALL_MFLAGS.

o Use explicit "== 0" or "!= 0" in the if () statements.

style(9) are cleared.

cem added a reviewer: cem.

Looks good to me (non-ZFS bits).

This revision is now accepted and ready to land.Jan 31 2016, 5:54 PM

So this is just a slight rehashing of the previous patch which was discussed to death and where the API semantic was one of the biggest question not resolved at the end of the discussion.

Now you put the same patch to the review without providing context for the issues where the objections were raised.

I do not want to retype all the stuff that was pointed out during that discussion, but at least I will paste two my replies which IMO highlight the problem and provide the background for why I consider the current patch unnaceptable.

Some time ago, kib wrote:
On Sat, Dec 12, 2015 at 04:31:05PM -0800, Maxim Sobolev wrote:
> > 3. VNA_DEALLOC mode: I think it should not have any pre-disposition
> > WRT alignment/block sizes. The actual implementation is to be FS
> > specific, so I don't think it needs any constrains. At least our
> > current only working ("reference?") implementation in ZFS does not
> > care, I made a small test userland case and checked that it works with
> > any offsets and sizes not aligned around some specific value. It's FS
> > layer after all.
It is completely unacceptable for the behaviour to be left to FS
implementation,  How would caller know what to expect from the call ?

> > 5. "Emulation" in the VOP_ALLOCATE(), I don't think this good idea at
> > all. I see it's a feature, not a bug that it would return EOPNOTSUPP
> > if underlying file system has no support for this operation and do not
> > try to mock it up. This gives opportunity for the application to think
> > again and make a decision between "I don't care about this data
> > anymore" and "I don't care about this data anymore AND I WANT IT TO BE
> > DESTROYED" and act accordingly in the latter case with appropriate
> > call to write(). This would actually prevent fallocate() to ever
> > return EOPNOTSUPP.
This call does not "DESTROY" the data, it only zeroes them, potentially
freeing blocks with all the consequences of freeing the blocks, from the
level where the user code cares about visible consequences.  Having
default fallback which zeroes the range is both reasonable and makes the
call actually useful by allowing caller to not care about implementation
details.  The caller always gets constistent data in the file after the
call, instead of being forced to interpret error codes to do more actions.

Also, typical fs implementation of PUNCH_HOLE cannot handle non-block
boundary requests, and common helper to zero ranges would be useful as
the library to avoid copy/paste into all fses.

> 2. Still I respectfully disagree wrt VNA_DELETE emulation. You see, in
> most of those cases you listed, esp tmpfs and nfs, you'd definitely
> need to let filesystem deal with deallocation down to the very last
> byte. Otherwise you might easily run into what I call "DELETE-reversal
> effect". Take md(4) on vnode as an example. Let's say you start with
> file that is just one big hole (i.e. truncate -s 100TB /tmp/bigfile)
> and try to use create ZFS on it. Before there was a "fake" BIO_DELETE
> support, you'd not have any issue doing that without actually having
> 100TB of free space, just few megs for metadata would be sufficient.
> However, after that feature has been added it's not longer possible!
> What happens there is that ZFS checks the GEOM provider capabilities
> and discovers that "aha, mdX supports BIO_DELETE!" Therefore, when you
> run zpool create it would sweep over the entire size of mdX (100TB in
> our example), and issue BIO_DELETE at every block, which effectively
> actually writes 100TB worth of zeroes into that file before you even
> created a file system. Any simplistic emulation of the VNA_DELETE,
> therefore, would have this property unless it has functionality to
> discover if the area in question is already a hole and only write
> zeroes if it's not and it and I think it might take the implementation
> kinda hairy.

You describe a bug in the zfs tools.  Compare your scenario with
UFS' newfs -E flag, which is optional, as well as the ability to turn
trim on and off for the volume with -t.

My point is that there are two layers of side-effects from your API,
both of which must be considered. One is the application-visible state
consisting of zeroing the range of file. There, the partial-block
zeroing comes into play as well.

Another one is the _possible_ hole created by freeing blocks in
the range, and possible (but not neccessary) hints issued for the
lower-layer in the form of BIO_DELETE. This is not visible to app, it
is something about the administration state of the volume. At the level
below fs, BIO_DELETE might be ignored even if reported as supported etc,
there the des@ beating is quite good.

And, while I understand that your goal is only item two, it is item
one consistency and usability which really bothers me. Your API would
be only started used by application if there is no hurdles and no
surprising outcomes from it. Right now, app needs to issue the syscall,
check the error, specially interpret some error code and know how to
act to make the result similar to what would be done by syscall if not
refused. This is not going to work, or rather, it might be enough for
your immediate and limited goal, but this is not the general-purpose
API.
cem requested changes to this revision.Jan 31 2016, 6:33 PM
cem edited edge metadata.
This revision now requires changes to proceed.Jan 31 2016, 6:33 PM
In D5126#109542, @kib wrote:

So this is just a slight rehashing of the previous patch which was discussed to death and where the API semantic was one of the biggest question not resolved at the end of the discussion.

Now you put the same patch to the review without providing context for the issues where the objections were raised.

I do not want to retype all the stuff that was pointed out during that discussion, but at least I will paste two my replies which IMO highlight the problem and provide the background for why I consider the current patch unnaceptable.

@kib, your concerns have been heard, I've just did not have time to work on those parts yet. However, there was a lot of work put into those changes, so I don't want them to rot either. On the other hand, @mckusick's review was pretty favorable so I felt like those patches may be ready for a wider review. And in fact some constructive feedback have been provided by @cem already allowing me to improve some style and put in missing pieces into manpages. That would not have happened had I just been sitting on those changes.

This service is exactly for that IMHO, for making incremental improvements by having more than one pair of eyes look into the emerging code.

In fact I think maybe more important missing piece for those changes is a test case. So stay tuned, there will be updates into this project, hopefully sooner rather than later.