Page MenuHomeFreeBSD

Add an implementation of the 9P filesystem
AcceptedPublic

Authored by dfr on Sep 13 2023, 12:33 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, May 14, 5:20 AM
Unknown Object (File)
Fri, May 10, 10:50 AM
Unknown Object (File)
Sun, May 5, 2:53 PM
Unknown Object (File)
Sun, May 5, 10:59 AM
Unknown Object (File)
Sat, May 4, 5:03 PM
Unknown Object (File)
Fri, Apr 26, 6:04 PM
Unknown Object (File)
Fri, Apr 26, 9:33 AM
Unknown Object (File)
Fri, Apr 26, 1:44 AM

Details

Reviewers
gnn
dch
Summary

This is derived from swills@ fork of the Juniper virtfs with lots of
changes by me including bug fixes, style improvements, clearer layering
and more consistent logging. The filesystem is renamed to p9fs to better
reflect its function and to prevent possible future confusion with
virtio-fs.

To use this with bhyve, add 'virtio_p9fs_load=YES' to loader.conf. The
bhyve virtio-9p device allows access from the guest to files on the host
by mapping a 'sharename' to a host path. It is possible to use p9fs as a
root filesystem by adding this to /boot/loader.conf:

vfs.root.mountfrom="p9fs:sharename"

for non-root filesystems add something like this to /etc/fstab:

sharename /mnt virtfs rw 0 0

In both examples, substitute the share name used on the bhyve command
line.

The 9P filesystem protocol relies on stateful file opens which map
protocol-level FIDs to host file descriptors. The FreeBSD vnode
interface doesn't really support this and we use heuristics to guess the
right FID to use for file operations. This can be confused by privilege
lowering and does not guarantee that the FID created for a given file
open is always used for file operations, even if the calling process is
using the file descriptor from the original open call. Improving this
would involve changes to the vnode interface which is out-of-scope for
this import.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 54481
Build 51370: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I just skimmed over the patch for now. Is there a spec for 9p protocol?

sys/dev/virtio/p9fs/virtio_p9fs.c
90

Could we please use short name for mutex, they are visible in top/ps and other places with limited space.

sys/dev/virtio/p9fs/virtio_p9fs.h
39

Why do you need packed there?

Regardless, FreeBSD prefers spell it __packed instead of direct use of attribute.

sys/fs/p9fs/p9_client.c
125
155

These assignments are pointless.

387
401
431
524

Don't we need proper alignments for all allocations?

567

fidpool is not protected by clnt_mtx (I assume clnt-spin is a typo in comment).
BTW, both unrs can be embedded into struct p9_client instead of using separate allocations.

580

This leaks fidpool

590

This leaks both pools

595

More leaks

602

Please fix all returns missing ().

sys/fs/p9fs/p9_client.h
42

This is second time inclusion of sys/param.h. Since sys/systm.h is included, the first time include is also excessive. sys/param.h includes sys/types.h. sys/systm.h includes sys/queue.h

sys/fs/p9fs/p9fs_vfsops.c
79

Can only bsize be zero? Or is there a typo and && should be || ?

106

This is LoR: vrele might need to lock node->parent, while we own the child vnode lock.

129

Again, what about alignments?

sys/fs/p9fs/p9fs_vnops.c
107

This is done by VFS.

109

This is done by VFS.

150

How could np be NULL?

170

This kills all caching. Why cannot we cache 9p vnodes?

199
322

This should be VN_IS_DOOMED(dvp)

341

The code in the ISDOTDOT branch is better handled by calling vn_vget_ino_gen().

sys/fs/p9fs/p9fs_vnops.c
321

we can drop the cases for old 13.x

sys/fs/p9fs/p9fs_vnops.c
1417

Allocating memory while vnode is locked is potentially deadlock-prone, since it prevents pagedaemon from flushing pages owned by locked vnode.

1419

There are too many blank lines, this is actually a note for the whole sources.

1440

Direct use of uiomove() while the vnode is locked is dead-lock prone, but this can be fixed later.

1449

No need to check for io_buffer != NULL

1535

If something was written, but the next iteration of the loop(s) failed, we get inconsistent inode i_size and vnode vm_object size.

Also, if anything was written, VOP_WRITE() should not return an error, instead short write must be reported.

1859

Strange indent.

2165

Isn't this place potentially causes LoR?

Its mainly because filesystem names must be valid C identifiers - this is assumed by the VFS_SET registration macro. Possibly that could be fixed by changing VFS_SET to add a prefix.

If we don't do the prefix we should add a comment explaining why it's p9 (something like Wikipedia's note when a title is changed due to "technical restrictions").

The correct name of this filesystem is 9p. "p9" is used in some cases because ...

dch requested changes to this revision.Sep 23 2023, 11:33 AM

for current, after https://reviews.freebsd.org/D41850 the flag needs to be dropped:

--------------------------------------------------------------
/usr/src/sys/dev/virtio/p9fs/virtio_p9fs.c:244:45: error: too many arguments to function call, expected 3, have 4
        return (virtio_alloc_virtqueues(dev, 0, 1, &vq_info));
                ~~~~~~~~~~~~~~~~~~~~~~~            ^~~~~~~~
/usr/src/sys/dev/virtio/virtio.h:106:6: note: 'virtio_alloc_virtqueues' declared here
int      virtio_alloc_virtqueues(device_t dev, int nvqs,
         ^

would like to see this MFC eventually for 14-STABLE though too.

This revision now requires changes to proceed.Sep 23 2023, 11:33 AM

Thanks a lot for this work!

Just a little gotcha in the commit message, the fs is not called virtfs but p9fs (it's correct in the manpage). Maybe also worth mentioning you need the late option in fstab when the mountpoint is in some ZFS dataset?

I'm currently testing it with poudriere logs mounted from the host and just got this error:

_mktemp: mkstemp failed on 15a-default/2023-09-24_07h40m07s/.tmp-.data.json.eZBBTR7X: Input/output error
mapfile_write: Missing handle

Is this related to the limitations mentioned? I've seen similar issues with other implementations of 9pfs I tried earlier. At least, this one seems to recover from it (I can just restart the poudriere build). So, not usable (yet?) for mounting poudriere logs from the host, as these errors stall the build and happen frequently, but progress as the mount survives it.

Just logging in to a home directory shared with the Linux host, there's a lock order issue:

panic: excl->share
cpuid = 9
time = 1697151893
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c27cc480
vpanic() at vpanic+0x145/frame 0xfffffe00c27cc5b0
panic() at panic+0x43/frame 0xfffffe00c27cc610
witness_checkorder() at witness_checkorder+0xbc3/frame 0xfffffe00c27cc7d0
lockmgr_lock_flags() at lockmgr_lock_flags+0x93/frame 0xfffffe00c27cc810
_vn_lock() at _vn_lock+0x53/frame 0xfffffe00c27cc870
vget_finish() at vget_finish+0x4d/frame 0xfffffe00c27cc8a0
vfs_hash_get() at vfs_hash_get+0xd1/frame 0xfffffe00c27cc8f0
p9fs_vget_common() at p9fs_vget_common+0xc9/frame 0xfffffe00c27cc980
p9fs_lookup() at p9fs_lookup+0x5ac/frame 0xfffffe00c27ccad0
vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c27ccb60
namei() at namei+0x2e1/frame 0xfffffe00c27ccbc0
kern_statat() at kern_statat+0x130/frame 0xfffffe00c27ccd00
sys_fstatat() at sys_fstatat+0x27/frame 0xfffffe00c27cce00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe00c27ccf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c27ccf30
--- syscall (552, FreeBSD ELF64, fstatat), rip = 0x153cb5916d7a, rsp = 0x153cb1f94a18, rbp = 0x153cb1f94b30 ---
KDB: enter: panic
[ thread pid 799 tid 100185 ]
Stopped at      kdb_enter+0x32: movq    $0,0xe2f603(%rip)

Interesting!! The excl->share thing I was getting seems to happen when I mount /home, but not when I mount /home/val only. (On the host /home/val is a mounted ZFS dataset if that matters… well /home is a dataset too but only has the mountpoint dir.)

I've imported the Juniper changes (except the P9PROTO_TREADDIR one), see P611 – but then "Fix the broken chain of virtfs_node" seems to cause:

panic: cache_enter_time: different vnodes for dot entry (0xfffff8006529ce00; 0xfffff80002cbc540)

maybe the assignment shouldn't be moved out of if (error != 0) {? @nkumarababu_gmail.com

Ah no, the cache_enter_time thing seems to be unrelated. Maybe it was bad luck that I got it very quickly after porting the patches just doing some ls commands; with pkg fetching packages it has happened both with the patches as they are and without them at all; i.e. no matter what. Not sure what can be done, my attempts to "fix" it (with my nonexistent vfs knowledge) seem to only introduce new panics.

Is there any assistance, short of time, that you could use from the community to help this along?

Is there any assistance, short of time, that you could use from the community to help this along?

I've been busy with a few other things but plan to spend some time working through the comments over the next few days. I should be able to update the diff next week.

dfr marked 29 inline comments as done.

Rebased, addressed most of kib@'s comments.

sys/dev/virtio/p9fs/virtio_p9fs.h
39

This struct represents a configuration parameter passed from the hypervisor. I don't think it needs __packed at all since the compiler will not put padding between uint16_t and uint8_t.

sys/fs/p9fs/p9_client.c
567

I will change both to use init_unrhdr to avoid the allocations. This pattern might be more widely used if init_unrhdr was described in the manpage.

Looking into clnt_mtx - nothing actually initialises either mutex in p9_client which is concerning. I checked the latest Juniper code drop and nothing in that tree uses these mutexes. For now, I will use clnt_mtx to protect the unit number pools.

sys/fs/p9fs/p9fs_vfsops.c
79

I don't know. I can't actually find anything which calls this. The code stores a pointer to this function in p9fs_filesize2bytes but nothing seems to read that field. I think this is dead code - I will delete it.

106

Would it be reasonable to call vrele on the parent vp after vp->v_data is cleared below?

106

What I'm trying to do here (and with the matching vref in p9fs_vget_common is to prevent a use-after-free which happens if the parent vnode is recycled before the child. Would vhold/vdrop work for this and if so, would that avoid the LoR?

sys/fs/p9fs/p9fs_vnops.c
150

I don't think it can. Removed.

170

I think this is a bug - other filesystems call vrecycle when a vnode was removed (e.g. links==0).

1417

How much of a problem is this? I think this happens in other filesystems, e.g. smbfs_read ends up calling smb_rq_alloc which calls malloc.

1419

I have removed some blank lines in this function but it might be best to have a style pass later when/if the technical content is acceptable.

2165

Is this LoR? According to the VOP_RENAME manpage, source directory and file are unlocked. I am way out of my comfort zone here - how should we avoid this problem?

sys/fs/p9fs/p9fs_vnops.c
1417

Are there examples other than smbfs? smbfs is renowned for its lack of quality...

sys/fs/p9fs/p9fs_vnops.c
1417

I haven't found one yet. Reading the code again, this buffer is only used to copy and coalesce data from uio->uio_iov - perhaps it would be better to do the p9_client_write calls directly from the iovs, avoiding the allocation and extra data copy.

thanks, builds cleanly against CURRENT on arm64 & amd64. I will try to get some testing over next week on this.

This revision is now accepted and ready to land.Nov 17 2023, 9:20 AM
sys/fs/p9fs/p9_client.c
34

These includes should be also cleaned. sys/queue.h is provided by sys/systm.h, sys/param.h is included twice but none is needed. sys/types.h is also included twice and is also not needed.

129
493

Why this block uses strncmp? Are there suffixes that we need to ignore?

753
987
sys/fs/p9fs/p9_client.h
43

The includes should be ordered alphabetically, with exception of sys/systm.h going first. sys/types.h is provided by sys/systm.h

sys/fs/p9fs/p9fs.h
109

Would it be useful to add cast to p9fs_node?

111

Same q about cast.

sys/fs/p9fs/p9fs_vfsops.c
106

In principle vhold/vdrop would indeed avoid the LoR. But, having such structure means that all directories forming the path from mp root to any alive vnode, are also alive. This should cause issues with vnodes recycling.

It seems that the design of the client assumes that there is 1:1 correspondence between nodes and vnodes, i.e. recycling vnode invalidates cached node. If you do this, you also should allow VFS to shrink caches. Having reference to the parent actively obstructs vlru.

sys/fs/p9fs/p9fs_vnops.c
666

So if the file was locally modified and closed, and on the next open we noticed that external entity modified the file, local modifications are potentially dropped. With such design, should we care more about user data and e.g. force fsync on the last close?

See my note about fsync. This vinvalbuf() is really about dropping page cache.

1417

In fact, I think that the problem is too fundamental with the current code and can be ignored for now. As I understand, almost any VOP is directly translated into network request that allocates memory etc.

2139

I do not see fsync() implementation. Oh, the writes are synchronous.

But fs does allow mmaping files, by creating vnode v_object. This means that page cache exists, but it is incoherent with reads/writes.

sys/fs/p9fs/p9fs_vnops.c
267

Looks like here 37bbe0c9 by swills should actually be reverted.

That's actually what causes panic: cache_enter_time: different vnodes for dot entry for me.

Playing with this again, a lot better now!

Importing the Juniper patches, this time as separate commits, so bundled as a tarball: F72405937 (the dot revert as well).

With this, I was able to boot the system and build a Rust project. (Definitely can personally confirm the necessity of 731fffe5227 heh. Without it, things like software builds make the system hang on I/O a lot.) What's left:

  • Trying to do a buildkernel, hit an assert: panic: cleaned vnode isn't (namei → vfs_lookup → p9fs_lookup → p9fs_vget_common → vfs_hash_insert → vput_final → freevnode)
  • Or this one: panic: Assertion prev_offset < next_offset failed at sys/vm/vnode_pager.c:1472 (vnode_pager_generic_putpages via vn_closefile)
    • the offsets are equal and the goto start_write wasn't triggered; if we bypass that assertion a zero-length write gets logged
  • Managed to hit cache_enter_time: different vnodes for dot entry eventually anyway, even with the '.' special case removed
  • chown fails, e.g. from pkg: pkg: Fail to chown /usr/local/lib/.pkgtemp.libpkg.so.Z0Kcu30kQ8zB:Message too long (same from tar)
In D41844#973060, @dch wrote:

thanks, builds cleanly against CURRENT on arm64 & amd64. I will try to get some testing over next week on this.

Can we land this without an aggressive MFC and test in CURRENT before the review goes stale?

Can we land this without an aggressive MFC and test in CURRENT before the review goes stale?

+1. Will be a lot easier for me to submit my patches then. With those, it almost works well enough for me, the only thing missing is mmap'ed writes.