Page MenuHomeFreeBSD

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).
ClosedPublic

Authored by khng on Jan 26 2021, 11:10 AM.

Details

Summary

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd can be
SPACECTL_DEALLOC, and flags has to be 0 for SPACECTL_DEALLOC
operation.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Submitted by: Ka Ho Ng <khng@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 36735
Build 33624: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/compat/freebsd32/freebsd32_misc.c
3572

With the removal of spacectl_range32 this whole function is redundant.

  • Move CTASSERT(sizeof(struct spacectl_range) == 16) from freebsd32_misc.c to sys_generic.c
  • Remove freebsd32_fspacectl()
khng marked an inline comment as done.May 13 2021, 7:56 PM
khng added inline comments.
sys/kern/vnode_if.src
794

This is for MNT_SHARED_WRITES(mp) == true, which is the case in ZFS.

Updates:

  • vn_deallocate_impl() does bwillwrite() and vn_start_write() on every iteration instead of before and after all the iterations.
  • vn_deallocate() no longer takes a 'struct thread *'.

_PC_FDEALLOC_PRESENT -> _PC_DEALLOC_PRESENT

lib/libc/sys/Symbol.map
412

Since we are now at FreeBSD 14, this should be in FBSD_1.7 namespace.

(Unrelated: I always wonder why we can't just call future versions e.g. FBSD_15.0; I usually find myself looking at Version.def to confirm)

sys/kern/sys_generic.c
109

I think it's worth commenting here that the reason we are CTASSERT'ing the size was because we wanted to make sure that for both 32-bit and 64-bit variants, the struct have the same layout (and thus the compat32 call can just use the same struct); having it in the same block of asserting size_t was confusing for me in the first glance.

917

Is this intentional? If so, could you please add a comment explaining why we are ignoring ERESTART/EINTR/EWOULDBLOCK? (I think this is different from short read/write and these shouldn't be ignored?)

Address delphij@'s comments.

sys/kern/sys_generic.c
917

It was intentional, but yes they shouldn't be ignored as short read/write.

Manpage updates:

  • Author part fixes
  • Remove the unnecessary partial success description from fspacectl.2. The value of 0 is for complete successful operation only.
sys/kern/sys_generic.c
917

I believe that was already discussed, and now you returned back the same errant behavior. Unix convention is that error return is only acceptable if the file content was not changed at all. If syscall did modified the file, regardless was it completed WRT requested op, or just a partial execution, userspace must know exactly what was done.

khng marked an inline comment as done.
  • Move rangelock around iteration loop to make it clearer
  • fspacectl(2) returns 0 as long as the file is modified, to prevent the system call being restarted after signals.
sys/compat/freebsd32/syscalls.master
1176

I think that this requires translation for compat32, because spacectl_range contains off_t. Look how freebsd32_lseek() handles that, in particular the big- vs. little-endian versions of PAIR32TO64.

sys/kern/sys_generic.c
875

I wonder if it make sense to change the fspacectl(2) interface while not too late and split in and out spacectl_range. Basically keep original spacectl_range as is, and copy out optional processed length:

fspacectl(int fd, int cmd, const struct spacectl_range *sr, int flags, off_t *length);
901

Why don't you use fget_write()? I am not even sure why you check for CAP_PWRITE instead of CAP_WRITE, the later seems to be more suitable.

sys/kern/uipc_shm.c
1902 ↗(On Diff #93055)

This was copied from shm_dotruncate_locked(), right? It misses the vm_pager_has_page() case. Basically, the page that you need to partially zero might be not resident but swap contains the user data. So we need to read the page from swap.

Perhaps you should refactor code to instantiate the page from shm_dotruncate_locked() and reuse it. Perhaps the code should take some control which region of the page should be cleared, to be passed to pmap_zero_page_area().

sys/kern/sys_generic.c
875

Does it sound better to have copyout spacectl_range instead of copyout length, so that callers may save an fseek(DATA) call to guess where to restart?

901

But if we can in theory operate in any positions in a file, we should use CAP_PWRITE?

khng marked 4 inline comments as done.
  • Updated fspacectl(2) to contain an optional parameter to store unprocessed operation range after the syscall returns
  • Updated fspacectl.2 because of ^^^
  • Added fspacectl(2) compat32 translation
  • Remove uipc_shm (This should end up as a separate review)
  • fget_write(CAP_PWRITE) instead of fget(CAP_PWRITE) in kern_fspacectl

Restore line 1116 in sys/kern/vfs_default.c

sys/kern/sys_generic.c
875

Why do you copyout result only in case of an error? This is esp. not useful for ERESTART/EINTR etc case where you cleared the error.

sys/kern/vfs_vnops.c
2350–2351

I suggest to extract introduction of vn_bmap_seekhole_locked() into a separate commit and do it in advance.

2403

This assert should be either repeated in vn_bmap_seekhole_locked(), or moved to that function.

3364

Use bool?

3373
if ((ioflg & (IO_NODELOCKED | IO_RANGELOCKED)) == 0)
khng marked 2 inline comments as done.
  • Fix sys_fspacectl(2) copying out rmsr in opposite cases.
  • Separate vn_bmap_seekhole_locked to a separate differential
  • styles and use bool in vn_deallocate_impl

Fix mangled git commit order.

kib added inline comments.
sys/kern/capabilities.conf
232 ↗(On Diff #93206)

s/descriptors/operations/, perhaps?

This revision is now accepted and ready to land.Aug 4 2021, 2:50 AM
sys/compat/freebsd32/freebsd32_misc.c
3875

This condition is still not fixed.

BTW, I suggest to remove the error != 0 check at all. Copy *rqsr into *rmsr at the very beginning of kern_fspacectl() if rmsr != NULL, and it should be fine.

This revision now requires review to proceed.Aug 4 2021, 2:54 AM
khng marked an inline comment as done.
  • Fix compat32's fspacectl error cases
  • kern_fspacectl asserts rqsr be non-NULL

kern_fspacectl should not KASSERT because of rqsr, instead it should return EINVAL to align with other EINVAL cases.

Minor change in kern_fspacectl(2): No need to copy rqsr into *rmsrp in the very beginning since things in *rmsrp is useless if error != 0. This reduces the lines further.

sys/kern/capabilities.conf
232 ↗(On Diff #93206)

You did not fixed this

sys/kern/sys_generic.c
923

I still insist on copying out rmsr always, not only on error == 0. For instance, if errno == ENOSPC, you still might have modified parts of the file, and it is important for usermode to know.

  • fspacectl(2) now also returns rmsr in case return value is not 0.
  • ^^^ Changes in fspacectl.2
khng marked 3 inline comments as done.Aug 4 2021, 5:26 PM
lib/libc/sys/fspacectl.2
151 ↗(On Diff #93230)
lib/libc/sys/fspacectl.2
153 ↗(On Diff #93230)
sys/kern/sys_generic.c
877

No, I think error is more important than cerror.

In other words, it should be

if (error == 0)
     error = cerror;
khng marked an inline comment as done.
  • Returns error instead of copyout(9) error if error != 0.
  • fspacectl.2 minor fixes.
This revision is now accepted and ready to land.Aug 4 2021, 6:29 PM