Page MenuHomeFreeBSD

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).
Needs ReviewPublic

Authored by khng on Jan 26 2021, 11:10 AM.

Details

Summary

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd can be
SPACECTL_DEALLOC, and flags has to be 0 for SPACECTL_DEALLOC
operation.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Submitted by: Ka Ho Ng <khng@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 39171
Build 36060: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/kern/uipc_shm.c
1891

TODO: kill this, and kill int error;

1919

TODO: Just directly return (0).

1963–1965

TODO: Move them to D28833.

khng marked 4 inline comments as not done.Mar 22 2021, 6:16 PM
khng added inline comments.
tests/sys/file/fspacectl_test.c
294

Done in D28833.

Sorry I'm late to the party, but I have a few questions:

  1. Why a new syscall? There are already at least three APIs for hole-punching. Instead of adding a fourth, can we just reuse one of these existing APIs?
    • Solaris uses fcntl() with F_FREESP. AFAIK it was the first major operating system to support hole punching.
    • Linux uses fallocate() with FALLOC_FL_PUNCH_HOLE.
    • GEOM uses ioctl() with DIOCGDELETE.
  1. Where did zfs_deallocate go? I see it in the comments but not in the source.

Sorry I'm late to the party, but I have a few questions:

  1. Why a new syscall? There are already at least three APIs for hole-punching. Instead of adding a fourth, can we just reuse one of these existing APIs?
    • Solaris uses fcntl() with F_FREESP. AFAIK it was the first major operating system to support hole punching.
    • Linux uses fallocate() with FALLOC_FL_PUNCH_HOLE.
    • GEOM uses ioctl() with DIOCGDELETE.

The Solaris's use of F_FREESP is actually unstandarized. As far as I remember, only
F_FREESP on ZFS would do hole-punching, while F_FREESP on UFS only allows
truncation. Even with that, the ZFS behavior of supporting F_FREESP also allows
extending a file and doing hole-punching at the same time, which I could not
understand in what situation could we use such functionalities. The extending
behavior is not seen in other operating system's equivalent such as Linux's
fallocate(2) or FSCTL_SET_ZERO_DATA on Windows either, nor is it listed to be
one of the side-effect of the NFSv4 DEALLOCATE command.

For Linux's fallocate(), we currently do not have an equivalent of linux's fallocate(2)
either. In particular I do not quite like it putting FALLOC_FL_PUNCH_HOLE as a flag
despite it is actually one of the major function.

GEOM uses ioctl() with DIOCGDELETE does TRIMMING as its intended behavior.
But Trimming does not equal hole-punching as trimming need not to enforce the
"zeroing" as the main side-effect. The fspacectl(SPACECTL_DEALLOC) on the
otherhand enforces "zeroing" as its main side-effect, while allowing the underlying
file system to do deallocation as well.

The fspacectl(2) in this revision currently does SPACECTL_DEALLOC only. In
D28833 SPACECTL_ALLOC is added as well and posix_fallocate(2) is changed over
to use this. In future we might even add an equivalent of TRIM which does not
mandate the zeroing behavior and only exposes the TRIM command of the
underlying storage if possible.

  1. Where did zfs_deallocate go? I see it in the comments but not in the source.

I separated the giant revision into different smaller revisions, which is listed under
"Stack" tab. zfs_deallocate() is in D28834.
for them.

In D28347#657998, @khng wrote:

For Linux's fallocate(), we currently do not have an equivalent of linux's fallocate(2)
either. In particular I do not quite like it putting FALLOC_FL_PUNCH_HOLE as a flag
despite it is actually one of the major function.

A particular part regarding fallocate(2) with FALLOC_FL_PUNCH_HOLE is that, the behavior of the call on Linux would fail in case the underlying file system does not support hole-punching. However such behavior requires calling the fallocate(2) to check explicitly the return value as well, which makes the code calling the system call to be wrapped with extra fallback behaviors. As a remedy for this, _PC_FDEALLOC_PRESENT is added to pathconf(2) to allow checking if the underlying file system does support hole-punching without internal fallback, for userspace code that really cares whether the hole-punching requires long range of zero-filling. In case the userspace code does not care about the whether hole punching is supported on the file system, fspacectl(SPACECTL_DEALLOC) could be called without much extra care.

  • Move them to D28833.
  • shm_deallocate: kill int error;. Just directly return (0).

Some comments on the man pages.

lib/libc/sys/fspacectl.2
77

s/needs be/needs to be/

79

s/If signal/If a signal/

81

s/or signal/or the signal/

share/man/man9/vn_deallocate.9
41

Not sure about this one: s/deallocate/deallocates/

53

Needs a line break here after the sentence stop.

[I did not started reading the actual code yet, only declarations/man page]

lib/libc/sys/fspacectl.2
81

I do not understand this sentence. Do you mean that syscall is allowed to throw out signals generated while it is running?

share/man/man9/VOP_DEALLOCATE.9
37

It is better to use .Fo/.Fc for long prototypes

52

Where is the list of the accepted flags and its meaning?

56

What is this?

66

What does this sentence mean? I cannot understand it in context of the previous text.

68

What it means 'file should be locked' ?

share/man/man9/vn_deallocate.9
72

What is this?

82

What it means 'granted'?

share/man/man9/vnode_pager_purge_range.9
1 ↗(On Diff #87465)

I think there are some problems with the diff generation.

sys/kern/sys_generic.c
893

Extra () on this and previous lines.

894

(flags & ~XXX) != 0

sys/sys/fcntl.h
391

Move this right after flock(), avoiding another ifdef.

sys/sys/file.h
37

What?

lib/libc/sys/fspacectl.2
81

I meant the syscall would either return -1 with errno set to EINTR if the operation internally doesn't begin yet, or the syscall finishes the whole operation. There are no partial results to be expected if a signal was received when the syscall was ongoing.

share/man/man9/VOP_DEALLOCATE.9
52

For this commit it has to be 0.

66

It was a leftover line when I was also trying to implement preallocation. Will clean this up.

68

vnode lock has to be acquired (either shared or exclusively) before calling this function.

share/man/man9/vn_deallocate.9
72

Leftover from implementing posix_fallocate. Will clean this up.

82

'acquired'

share/man/man9/VOP_DEALLOCATE.9
68

The line was actually the same as the one in LOCKS section in VOP_READ/VOP_WRITE/VOP_RDWR.

  • Manpage clarifications and fixes
  • Code cosmetic fixes
  • Removed a stray #include "sys/fcntl.h"
khng marked 2 inline comments as done.
  • A leftover manpage fix
khng marked 8 inline comments as done.Apr 15 2021, 6:51 PM
gbe added a subscriber: gbe.

LGTM for the man page part.

Global question: assume that fspacectl(2) did partial operation. How userspace can determine it? Besides signals, you may get an error inside the loop.

share/man/man9/VOP_DEALLOCATE.9
69

I think this should be

The
.Fa *offset
and
.Fa *len
variables are updated ...
91

I do not think that caller tries to write. At best, he tries to zero.

share/man/man9/vnode_pager_purge_range.9
6 ↗(On Diff #87513)

You still have same issues with generating the patch.

sys/compat/freebsd32/syscalls.master
1178

BTW, the fact that spacectl_range has the same layout for native and compat32 is somewhat incidental. Some static asserts would be due, I believe.

sys/kern/sys_generic.c
895

Don't you need to check that offset+len do not overflow?

sys/kern/uipc_shm.c
1887

Can offset + PAGE_MASK overflow ?

1895

I believe this should be an earlier error from overflow, and not the truncation at this stage.

1898

I do not understand this condition. Why it is not (offset & PAGE_MASK) != 0?
And I think it makes sense to assign offset & PAGE_MASK to some local var.

1943

I see that this comment was copied/pasted without looking at the content.

sys/kern/vfs_default.c
1091

I think you should try to set iosize to multiple of blocksize that fits into zero_region.

1119

Yielding with the vnode locked might be not the best proposition.

sys/kern/vfs_vnops.c
2421

You should assert that the vnode is locked, then.

3468

Imagine that vn_start_write() returned EINTR due to a signal + PCATCH. Also assume that this is second iteration of the while (len > 0) loop. Then you return EINTR while some data was already deallocated.

That said, relocking ranglelock basically makes it useless. Rangelock only purpose is to establish atomicity WRT parallel reads and writes due to VOP dropping the vnode lock. In other words, you need to get the ranglelock once and for whole op.

3501

You don't unlock the rangelock on iteration, but do relock it on each loop entry. I believe you would deadlock against yourself if short dealloc occurs.

sys/kern/vnode_if.src
803

Where do you plan to use shared vnode lock? Isn't exclusive lock de-facto required?

sys/sys/fcntl.h
366

Start it from 1.

sys/sys/vnode.h
727

I do not understand this comment

tests/sys/file/fspacectl_test.c
5

this line is not needed.

Catch up with vnode_pager_purge_range(9) manual updates.

khng marked 15 inline comments as done.
  • Introduce struct space_range32 for CTASSERT()
  • Update manpages
  • Check offset + length do not overflow in kern_ variant and vn_deallocate. We do not crop the offset + length to within OFF_MAX anymore
  • shm_deallocate overflow when offset fits in the range [OFF_MAX - PAGE_SIZE + 1, OFF_MAX]
  • vp_zerofill does not yield
  • vn_deallocate_impl locking fixes.
  • fspacectl updates range on return.
  • struct spacectl_range32 seems redundant. Use CTASSERT(sizeof(struct spacectl_range) == 16) instead.
sys/compat/freebsd32/freebsd32_misc.c
3586 ↗(On Diff #89151)

With the removal of spacectl_range32 this whole function is redundant.

  • Move CTASSERT(sizeof(struct spacectl_range) == 16) from freebsd32_misc.c to sys_generic.c
  • Remove freebsd32_fspacectl()
khng marked an inline comment as done.May 13 2021, 7:56 PM
khng added inline comments.
sys/kern/vnode_if.src
803

This is for MNT_SHARED_WRITES(mp) == true, which is the case in ZFS.