Implement new fallocate(2) syscall and extend VOP_ALLOCATE(9) to allow punching holes into existing files. Use that to properly convert from BIO_DELETE operations in the md(4) module.
Only supported on ZFS now.
Differential D5126
Extend VOP_ALLOCATE(9) and implement new fallocate(2) syscall to allow punching holes sobomax on Jan 30 2016, 4:44 AM. Authored by Tags None Referenced Files
Details
Diff Detail
Event TimelineComment Actions Looks mostly fine to me. I'm hardly an expert in these things. I did not review the ZFS parts; I don't know much about it.
Comment Actions Oops, we are not compatible with any official standard. Clean the standards section of the manpage. Comment Actions OK, new revision is up, fixed some manpages.
Comment Actions o Rename: VNA_ALLOC -> VNA_OP_ALLOC; VNA_DEALLOC -> VNA_OP_DEALLOC. oDefine some helper constants VNA_OP_MASK and VNA_ALL_MFLAGS. o Use explicit "== 0" or "!= 0" in the if () statements. Comment Actions So this is just a slight rehashing of the previous patch which was discussed to death and where the API semantic was one of the biggest question not resolved at the end of the discussion. Now you put the same patch to the review without providing context for the issues where the objections were raised. I do not want to retype all the stuff that was pointed out during that discussion, but at least I will paste two my replies which IMO highlight the problem and provide the background for why I consider the current patch unnaceptable. Some time ago, kib wrote: On Sat, Dec 12, 2015 at 04:31:05PM -0800, Maxim Sobolev wrote: > > 3. VNA_DEALLOC mode: I think it should not have any pre-disposition > > WRT alignment/block sizes. The actual implementation is to be FS > > specific, so I don't think it needs any constrains. At least our > > current only working ("reference?") implementation in ZFS does not > > care, I made a small test userland case and checked that it works with > > any offsets and sizes not aligned around some specific value. It's FS > > layer after all. It is completely unacceptable for the behaviour to be left to FS implementation, How would caller know what to expect from the call ? > > 5. "Emulation" in the VOP_ALLOCATE(), I don't think this good idea at > > all. I see it's a feature, not a bug that it would return EOPNOTSUPP > > if underlying file system has no support for this operation and do not > > try to mock it up. This gives opportunity for the application to think > > again and make a decision between "I don't care about this data > > anymore" and "I don't care about this data anymore AND I WANT IT TO BE > > DESTROYED" and act accordingly in the latter case with appropriate > > call to write(). This would actually prevent fallocate() to ever > > return EOPNOTSUPP. This call does not "DESTROY" the data, it only zeroes them, potentially freeing blocks with all the consequences of freeing the blocks, from the level where the user code cares about visible consequences. Having default fallback which zeroes the range is both reasonable and makes the call actually useful by allowing caller to not care about implementation details. The caller always gets constistent data in the file after the call, instead of being forced to interpret error codes to do more actions. Also, typical fs implementation of PUNCH_HOLE cannot handle non-block boundary requests, and common helper to zero ranges would be useful as the library to avoid copy/paste into all fses. > 2. Still I respectfully disagree wrt VNA_DELETE emulation. You see, in > most of those cases you listed, esp tmpfs and nfs, you'd definitely > need to let filesystem deal with deallocation down to the very last > byte. Otherwise you might easily run into what I call "DELETE-reversal > effect". Take md(4) on vnode as an example. Let's say you start with > file that is just one big hole (i.e. truncate -s 100TB /tmp/bigfile) > and try to use create ZFS on it. Before there was a "fake" BIO_DELETE > support, you'd not have any issue doing that without actually having > 100TB of free space, just few megs for metadata would be sufficient. > However, after that feature has been added it's not longer possible! > What happens there is that ZFS checks the GEOM provider capabilities > and discovers that "aha, mdX supports BIO_DELETE!" Therefore, when you > run zpool create it would sweep over the entire size of mdX (100TB in > our example), and issue BIO_DELETE at every block, which effectively > actually writes 100TB worth of zeroes into that file before you even > created a file system. Any simplistic emulation of the VNA_DELETE, > therefore, would have this property unless it has functionality to > discover if the area in question is already a hole and only write > zeroes if it's not and it and I think it might take the implementation > kinda hairy. You describe a bug in the zfs tools. Compare your scenario with UFS' newfs -E flag, which is optional, as well as the ability to turn trim on and off for the volume with -t. My point is that there are two layers of side-effects from your API, both of which must be considered. One is the application-visible state consisting of zeroing the range of file. There, the partial-block zeroing comes into play as well. Another one is the _possible_ hole created by freeing blocks in the range, and possible (but not neccessary) hints issued for the lower-layer in the form of BIO_DELETE. This is not visible to app, it is something about the administration state of the volume. At the level below fs, BIO_DELETE might be ignored even if reported as supported etc, there the des@ beating is quite good. And, while I understand that your goal is only item two, it is item one consistency and usability which really bothers me. Your API would be only started used by application if there is no hurdles and no surprising outcomes from it. Right now, app needs to issue the syscall, check the error, specially interpret some error code and know how to act to make the result similar to what would be done by syscall if not refused. This is not going to work, or rather, it might be enough for your immediate and limited goal, but this is not the general-purpose API. Comment Actions @kib, your concerns have been heard, I've just did not have time to work on those parts yet. However, there was a lot of work put into those changes, so I don't want them to rot either. On the other hand, @mckusick's review was pretty favorable so I felt like those patches may be ready for a wider review. And in fact some constructive feedback have been provided by @cem already allowing me to improve some style and put in missing pieces into manpages. That would not have happened had I just been sitting on those changes. This service is exactly for that IMHO, for making incremental improvements by having more than one pair of eyes look into the emerging code. In fact I think maybe more important missing piece for those changes is a test case. So stay tuned, there will be updates into this project, hopefully sooner rather than later. |