Page MenuHomeFreeBSD

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).
ClosedPublic

Authored by khng on Jan 26 2021, 11:10 AM.
Tags
None
Referenced Files
F133235559: D28347.id89086.diff
Fri, Oct 24, 5:09 AM
F133227384: D28347.id93086.diff
Fri, Oct 24, 3:39 AM
F133205624: D28347.id83461.diff
Thu, Oct 23, 11:15 PM
F133192146: D28347.id84401.diff
Thu, Oct 23, 8:18 PM
F133150292: D28347.id85982.diff
Thu, Oct 23, 10:26 AM
Unknown Object (File)
Thu, Oct 23, 3:34 AM
Unknown Object (File)
Thu, Oct 23, 2:27 AM
Unknown Object (File)
Tue, Oct 21, 8:19 PM

Details

Summary

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd can be
SPACECTL_DEALLOC, and flags has to be 0 for SPACECTL_DEALLOC
operation.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Submitted by: Ka Ho Ng <khng@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 40774
Build 37663: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/compat/freebsd32/freebsd32_misc.c
3586 ↗(On Diff #89151)

With the removal of spacectl_range32 this whole function is redundant.

  • Move CTASSERT(sizeof(struct spacectl_range) == 16) from freebsd32_misc.c to sys_generic.c
  • Remove freebsd32_fspacectl()
khng marked an inline comment as done.May 13 2021, 7:56 PM
khng added inline comments.
sys/kern/vnode_if.src
804

This is for MNT_SHARED_WRITES(mp) == true, which is the case in ZFS.

Updates:

  • vn_deallocate_impl() does bwillwrite() and vn_start_write() on every iteration instead of before and after all the iterations.
  • vn_deallocate() no longer takes a 'struct thread *'.

_PC_FDEALLOC_PRESENT -> _PC_DEALLOC_PRESENT

lib/libc/sys/Symbol.map
413

Since we are now at FreeBSD 14, this should be in FBSD_1.7 namespace.

(Unrelated: I always wonder why we can't just call future versions e.g. FBSD_15.0; I usually find myself looking at Version.def to confirm)

sys/kern/sys_generic.c
110

I think it's worth commenting here that the reason we are CTASSERT'ing the size was because we wanted to make sure that for both 32-bit and 64-bit variants, the struct have the same layout (and thus the compat32 call can just use the same struct); having it in the same block of asserting size_t was confusing for me in the first glance.

922

Is this intentional? If so, could you please add a comment explaining why we are ignoring ERESTART/EINTR/EWOULDBLOCK? (I think this is different from short read/write and these shouldn't be ignored?)

Address delphij@'s comments.

sys/kern/sys_generic.c
922

It was intentional, but yes they shouldn't be ignored as short read/write.

Manpage updates:

  • Author part fixes
  • Remove the unnecessary partial success description from fspacectl.2. The value of 0 is for complete successful operation only.
sys/kern/sys_generic.c
922

I believe that was already discussed, and now you returned back the same errant behavior. Unix convention is that error return is only acceptable if the file content was not changed at all. If syscall did modified the file, regardless was it completed WRT requested op, or just a partial execution, userspace must know exactly what was done.

khng marked an inline comment as done.
  • Move rangelock around iteration loop to make it clearer
  • fspacectl(2) returns 0 as long as the file is modified, to prevent the system call being restarted after signals.
sys/compat/freebsd32/syscalls.master
1179

I think that this requires translation for compat32, because spacectl_range contains off_t. Look how freebsd32_lseek() handles that, in particular the big- vs. little-endian versions of PAIR32TO64.

sys/kern/sys_generic.c
880

I wonder if it make sense to change the fspacectl(2) interface while not too late and split in and out spacectl_range. Basically keep original spacectl_range as is, and copy out optional processed length:

fspacectl(int fd, int cmd, const struct spacectl_range *sr, int flags, off_t *length);
906

Why don't you use fget_write()? I am not even sure why you check for CAP_PWRITE instead of CAP_WRITE, the later seems to be more suitable.

sys/kern/uipc_shm.c
1902

This was copied from shm_dotruncate_locked(), right? It misses the vm_pager_has_page() case. Basically, the page that you need to partially zero might be not resident but swap contains the user data. So we need to read the page from swap.

Perhaps you should refactor code to instantiate the page from shm_dotruncate_locked() and reuse it. Perhaps the code should take some control which region of the page should be cleared, to be passed to pmap_zero_page_area().

sys/kern/sys_generic.c
880

Does it sound better to have copyout spacectl_range instead of copyout length, so that callers may save an fseek(DATA) call to guess where to restart?

906

But if we can in theory operate in any positions in a file, we should use CAP_PWRITE?

khng marked 4 inline comments as done.
  • Updated fspacectl(2) to contain an optional parameter to store unprocessed operation range after the syscall returns
  • Updated fspacectl.2 because of ^^^
  • Added fspacectl(2) compat32 translation
  • Remove uipc_shm (This should end up as a separate review)
  • fget_write(CAP_PWRITE) instead of fget(CAP_PWRITE) in kern_fspacectl

Restore line 1116 in sys/kern/vfs_default.c

sys/kern/sys_generic.c
880

Why do you copyout result only in case of an error? This is esp. not useful for ERESTART/EINTR etc case where you cleared the error.

sys/kern/vfs_vnops.c
2441–2442

I suggest to extract introduction of vn_bmap_seekhole_locked() into a separate commit and do it in advance.

2496

This assert should be either repeated in vn_bmap_seekhole_locked(), or moved to that function.

3467

Use bool?

3476
if ((ioflg & (IO_NODELOCKED | IO_RANGELOCKED)) == 0)
khng marked 2 inline comments as done.
  • Fix sys_fspacectl(2) copying out rmsr in opposite cases.
  • Separate vn_bmap_seekhole_locked to a separate differential
  • styles and use bool in vn_deallocate_impl

Fix mangled git commit order.

kib added inline comments.
sys/kern/capabilities.conf
232 ↗(On Diff #93206)

s/descriptors/operations/, perhaps?

This revision is now accepted and ready to land.Aug 4 2021, 2:50 AM
sys/compat/freebsd32/freebsd32_misc.c
3876 ↗(On Diff #93206)

This condition is still not fixed.

BTW, I suggest to remove the error != 0 check at all. Copy *rqsr into *rmsr at the very beginning of kern_fspacectl() if rmsr != NULL, and it should be fine.

This revision now requires review to proceed.Aug 4 2021, 2:54 AM
khng marked an inline comment as done.
  • Fix compat32's fspacectl error cases
  • kern_fspacectl asserts rqsr be non-NULL

kern_fspacectl should not KASSERT because of rqsr, instead it should return EINVAL to align with other EINVAL cases.

Minor change in kern_fspacectl(2): No need to copy rqsr into *rmsrp in the very beginning since things in *rmsrp is useless if error != 0. This reduces the lines further.

sys/kern/capabilities.conf
232 ↗(On Diff #93206)

You did not fixed this

sys/kern/sys_generic.c
928

I still insist on copying out rmsr always, not only on error == 0. For instance, if errno == ENOSPC, you still might have modified parts of the file, and it is important for usermode to know.

  • fspacectl(2) now also returns rmsr in case return value is not 0.
  • ^^^ Changes in fspacectl.2
khng marked 3 inline comments as done.Aug 4 2021, 5:26 PM
lib/libc/sys/fspacectl.2
152
lib/libc/sys/fspacectl.2
154
sys/kern/sys_generic.c
882

No, I think error is more important than cerror.

In other words, it should be

if (error == 0)
     error = cerror;
khng marked an inline comment as done.
  • Returns error instead of copyout(9) error if error != 0.
  • fspacectl.2 minor fixes.
This revision is now accepted and ready to land.Aug 4 2021, 6:29 PM