Page MenuHomeFreeBSD

unionfs: Implement VOP_GETLOWVNODE and employ it for basicmount-time deadlock detection
ClosedPublic

Authored by jah on Sat, Nov 29, 9:29 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 20, 4:42 AM
Unknown Object (File)
Sat, Dec 13, 9:44 PM
Unknown Object (File)
Sat, Dec 13, 8:57 PM
Unknown Object (File)
Sat, Dec 13, 8:47 PM
Unknown Object (File)
Sun, Nov 30, 10:30 AM
Subscribers

Details

Summary

vnode_if.src: fix function name in locking annotation
getwritevnode->getlowvnode

unionfs: Implement VOP_GETLOWVNODE

This function returns the vnode that will be used to resolve the
access type specified in the 'flags' argument, and is useful for
optimal behavior of vn_copy_file_range(). While most filesystems
can simply use the default implementation which returns the passed-
in vnode, unionfs (like nullfs) ideally should resolve the access
request to whichever base layer vnode will be used for the I/O.

For unionfs, write accesses must be resolved through the upper vnode,
while read accesses will be resolved through the upper vnode if
present or the lower vnode otherwise. In the case of write access
when an upper vnode has not yet been created, the unionfs vnode
itself should be treated as the resolving vnode so that unionfs can
handle the required copy when processing the write. Provide a simple
unionfs_getlowvnode() implementation that reflects this policy.

unionfs: detect common deadlock-producing mount misconfigurations

When creating a unionfs mount, it's fairly easy to shoot oneself
in the foot by specifying upper and lower file hierarchies that
resolve back to the same vnodes. This is fairly easy to do if
the sameness is not obvious due to aliasing through nullfs or other
unionfs mounts (as in the associated PR), and will produce either
deadlock or failed locking assertions on any attempt to use the
resulting unionfs mount.

Leverage VOP_GETLOWVNODE() to detect the most common cases of
foot-shooting at mount time and fail the mount with EDEADLK.
This is not meant to be an exhaustive check for all possible
deadlock-producing scenarios, but it is an extremely cheap and
simple approach that, unlike previous proposed fixes, also works
in the presence of nullfs aliases.

PR: 172334
Reported by: ngie, Karlo Miličević <karlo98.m@gmail.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jah requested review of this revision.Sat, Nov 29, 9:29 PM
jah retitled this revision from unionfs: Implement VOP_GETLOWVNODE and employ it for basic mount-time deadlock detection to unionfs: Implement VOP_GETLOWVNODE and employ it for basicmount-time deadlock detection.Sat, Nov 29, 9:30 PM
jah added reviewers: kib, markj, pho.
sys/fs/unionfs/union_vnops.c
3088 ↗(On Diff #167285)

I know I don't technically need this since vop_default is specified above, but I prefer to make it clear that we're intentionally using the default implementation.

sys/fs/unionfs/union_vnops.c
2137 ↗(On Diff #167285)

Don't you have to do the instantiation of the upper vnode there? Otherwise, vn_copy_file_range() malfunctions.

It is wrong right now without unionfs specialization for the VOP, but having the VOP implemented but wrong is IMO worse.

sys/fs/unionfs/union_vnops.c
2137 ↗(On Diff #167285)

See the comment block I added above this.

In fact, I had written an earlier version of the patch which did exactly that, but it ended up being fairly messy. The requirements of calling unionfs_copyfile() here mean that we have to do additional locking and rechecking of the vnode, and we also have to change VOP_GETLOWVNODE() to pass the caller creds so that the upper file can be created correctly. All of that is doable, but rather ugly and IMO not something we should be doing in a fairly simple vnode accessor if we can avoid it.

Instead, the idea here is that if the unionfs vnode has not yet instantiated an upper vnode, then we simply fall back to the default implementation which returns the unionfs vnode itself instead of a base-layer vnode. That is no different that what already happens today, and in that case vn_copy_file_range() should end up falling back to vn_generic_copy_file_range() (Even if both the source and destination vnodes were unionfs vnodes, it would still end up calling vn_generic_copy_file_range() because unionfs does not specialize VOP_COPY_FILE_RANGE). Since the 'outvp' to vn_generic_copy_file_range() will then be a unionfs vnode, the write issued against it will result in the upper vnode being instantiated at that point.

It's possible I missed something, but I think this should work?

I ran the uniofs tests in a loop for 5 hours without observing any issues.

sys/fs/unionfs/union_vnops.c
2137 ↗(On Diff #167285)

Actually perhaps I do need to go back to the messier implementation. I'd been thinking that vn_[generic]_copy_file_range() opened the target vnode for write access, which is where the copy-up actually happens in unionfs. But looking back over the code, it looks as though I misremembered and there is no open there.

That said, it looks as though vn_copy_file_range() should only be called in cases in which the target vnode has already been opened for write? If we can guarantee that, then I should be able to do something even simpler here and return EBADF if no upper vnode is present.

sys/fs/unionfs/union_vnops.c
2137 ↗(On Diff #167285)

I am not sure that e.g. NFS is required to call VOP_OPEN() before doing block copy. And generally, you are adding the new requirement: VOP_OPEN(for write) must be done before VOP_GETLOWVNODE(FWRITE) is ever called.

I would not object if you enforce the later for unionfs. You might indeed check for upper vnode, and return something. EBADF does not sound right, it is for different situation.

I thought about using v_writecount to assert that open for write was done, but there is no vnode to check on, or when the vnode is there, it is already fine.

sys/fs/unionfs/union_vnops.c
2137 ↗(On Diff #167285)

I didn't think the NFS code uses vn_copy_file_range(), does it? That of course doesn't mean that NFS (or something else) might not start making use of VOP_GETLOWVNODE() at some point in the future. Also anything involving the interaction between NFS and unionfs is really its own can of worms since there are quite a few known deficiencies there.

Interestingly, unionfs_write() issues VOP_WRITE() against the lower vnode if the upper vnode is not present. I'm not sure if it does that because of NFS or some other case that doesn't first open the vnode, but it's been that way since at least 2006. That behavior seems quite wrong for a filesystem that's supposed to have CoW semantics, but that is a problem for another day.

My preference would definitely be to just return an error in this case (EACCES?). I did save off the original bigger patch that called unionfs_copyfile(), so it wouldn't be too hard to bring it back if we ended up needing it for some reason.

sys/fs/unionfs/union_vnops.c
2137 ↗(On Diff #167285)

I am completely fine with returning an error in this case.

Might be, ENXIO or (somewhat strange but often used for strange purposes EDADLK) look somewhat better for me. But selecting the error code should not be a stopper there.

Return error if no upper vnode present

This revision is now accepted and ready to land.Tue, Dec 2, 1:13 AM

I don't think just having unionfs_getlowvnode() return EACCESS alone is enough, it has problems, see inline comment.

I'm not sure we should expect unionfs_getlowvnode() to copy up the file. This indeed complicates code for an accessor, and perhaps more importantly brings the question of the semantics of VOP_GETLOWVNODE(), especially when called with F_WRITE.

There are currently only two callers of VOP_GETLOWVNODE(), one is null_getlowvnode(), the other vn_copy_file_range(), and only the latter uses F_WRITE. nullfs' implementation will not work if the vnode is locked on entry. Also, it does not guarantee the same locking for the output vnode at all (and I don't think it should).

Given all this, here's what I propose:

  1. Change part of the contract of VOP_GETLOWVNODE() with respect to locking and error: Use U U - (input vnode is unlocked, output is unlocked, and no vnode returned on error).
  2. unionfs_getlowvnode() must not fail. On FWRITE, if not implementing copy up there, then at least it should return the vnode passed as input.

Returning the input vnode on FWRITE in 2 has the drawback of preventing using VOP_COPY_FILE_RANGE() on the upper layer where applicable, but at least vn_copy_file_range() can fall back to vn_generic_copy_file_range() and we avoid the problem mentioned in the inline comment.

In the end, we still may want to do the copy up directly in VOP_GETLOWVNODE() or perhaps change the guts of vn_copy_file_range() with calling VOP_COPY_FILE_RANGE() unconditionally (dispatching on the input vnode) regardless of the vnode to write to not being on the same filesystem, with each implementation adding his own checks and in the case of unionfs performing the copy up there. But that's not strictly necessary here.

For the check that prompted all this, yes, a non-exhaustive check catching the most frequent pilot error is much better than no check at all. Thanks!

sys/fs/unionfs/union_vnops.c
2133–2136 ↗(On Diff #167397)

Just returning EACCESS here causes vn_copy_file_range() (and then kern_copy_file_range() and the system call) to just fail, which not only feels wrong, but will also cause, e.g., cp(1) to loop forever (arguably, the code around this in cp(1) seems buggy for independent reasons).

Additionally, this means that, when trying to copy an existing unionfs file that has already been copied to the upper layer to a non-existent or unmodified file (i.e., backed by the lower layer), we basically lose the benefit of vn_generic_copy_file_range() (copying in the kernel). That's second order, but still a non-negligible effect.

Please see the general comment.

olce requested changes to this revision.Tue, Dec 9, 11:33 AM
This revision now requires changes to proceed.Tue, Dec 9, 11:33 AM
sys/fs/unionfs/union_vnops.c
2133–2136 ↗(On Diff #167397)

e.g., cp(1) to loop forever (arguably, the code around this in cp(1) seems buggy for independent reasons).

Bad code reading, sorry, no infinite loop, but a copy failure which is still problematic.

I don't think just having unionfs_getlowvnode() return EACCESS alone is enough, it has problems, see inline comment.

I'm not sure we should expect unionfs_getlowvnode() to copy up the file. This indeed complicates code for an accessor, and perhaps more importantly brings the question of the semantics of VOP_GETLOWVNODE(), especially when called with F_WRITE.

There are currently only two callers of VOP_GETLOWVNODE(), one is null_getlowvnode(), the other vn_copy_file_range(), and only the latter uses F_WRITE. nullfs' implementation will not work if the vnode is locked on entry. Also, it does not guarantee the same locking for the output vnode at all (and I don't think it should).

Given all this, here's what I propose:

  1. Change part of the contract of VOP_GETLOWVNODE() with respect to locking and error: Use U U - (input vnode is unlocked, output is unlocked, and no vnode returned on error).

I made exactly this change in my original heavyweight attempt at unionfs_getlowvnode (which did do the copy-up and thus required the vnode lock). But as a practical concern, I think the nullfs issue you mentioned above could be just as easily dealt with by changing null_getlowvnode() to take the interlock instead of the full vnode lock. We might still want to change this contract anyway, but I think that should be done as a separate change, no?

  1. unionfs_getlowvnode() must not fail. On FWRITE, if not implementing copy up there, then at least it should return the vnode passed as input.

Returning the input vnode on FWRITE in 2 has the drawback of preventing using VOP_COPY_FILE_RANGE() on the upper layer where applicable, but at least vn_copy_file_range() can fall back to vn_generic_copy_file_range() and we avoid the problem mentioned in the inline comment.

This is what I did in the initial posted version of this change; you can still see that version in the diff history here, along with the conversation with kib@ that resulted in the change to return EACCES instead.

Basically, the problem is that unionfs_write() today does not do any copy-up; the copy-up is instead done when unionfs_open() is invoked with write access requested. Since the vn_copy_file_range() path does not VOP_OPEN() the target vnode before issuing VOP_WRITE() against it, returning the input unionfs vnode here will not provide any opportunity for unionfs to do the copy-up within the vn_copy_file_range() path. However, vn_copy_file_range() is today only used in a syscall that expects the target file to have already been open()ed for write, so my expectation is that any necessary copy-up should have already been performed by unionfs_open() by the time we reach the point of calling unionfs_getlowvnode().

As an aside, perhaps unionfs_write() should do a copy-up? I mentioned in an earlier comment in this review that it at least seems wrong that the current implementation of unionfs_write() allows the write to proceed against the lower vnode if no upper vnode is present.

In the end, we still may want to do the copy up directly in VOP_GETLOWVNODE() or perhaps change the guts of vn_copy_file_range() with calling VOP_COPY_FILE_RANGE() unconditionally (dispatching on the input vnode) regardless of the vnode to write to not being on the same filesystem, with each implementation adding his own checks and in the case of unionfs performing the copy up there. But that's not strictly necessary here.

For the check that prompted all this, yes, a non-exhaustive check catching the most frequent pilot error is much better than no check at all. Thanks!

In D53988#1237096, @jah wrote:

I made exactly this change in my original heavyweight attempt at unionfs_getlowvnode (which did do the copy-up and thus required the vnode lock). But as a practical concern, I think the nullfs issue you mentioned above could be just as easily dealt with by changing null_getlowvnode() to take the interlock instead of the full vnode lock. We might still want to change this contract anyway, but I think that should be done as a separate change, no?

You're right, doing it separately doesn't hurt, as VOP_GETLOWVNODE() is currently never called with a lock.

This is what I did in the initial posted version of this change; you can still see that version in the diff history here, along with the conversation with kib@ that resulted in the change to return EACCES instead.

Oh sorry, I read it, but apparently too quickly. I missed that the top-level VOP_GETLOWVNODE() calls are all triggered through the copy_file_range() system calls, for which users must have opened the file, leading to VOP_OPEN() being called first (if we except the NFS case).

All in all, this is an improvement with no drawbacks, it should go, sorry for the delay.

Basically, the problem is that unionfs_write() today does not do any copy-up; the copy-up is instead done when unionfs_open() is invoked with write access requested. Since the vn_copy_file_range() path does not VOP_OPEN() the target vnode before issuing VOP_WRITE() against it, returning the input unionfs vnode here will not provide any opportunity for unionfs to do the copy-up within the vn_copy_file_range() path. However, vn_copy_file_range() is today only used in a syscall that expects the target file to have already been open()ed for write, so my expectation is that any necessary copy-up should have already been performed by unionfs_open() by the time we reach the point of calling unionfs_getlowvnode().

As an aside, perhaps unionfs_write() should do a copy-up? I mentioned in an earlier comment in this review that it at least seems wrong that the current implementation of unionfs_write() allows the write to proceed against the lower vnode if no upper vnode is present.

In the end, we still may want to do the copy up directly in VOP_GETLOWVNODE() or perhaps change the guts of vn_copy_file_range() with calling VOP_COPY_FILE_RANGE() unconditionally (dispatching on the input vnode) regardless of the vnode to write to not being on the same filesystem, with each implementation adding his own checks and in the case of unionfs performing the copy up there. But that's not strictly necessary here.

For the check that prompted all this, yes, a non-exhaustive check catching the most frequent pilot error is much better than no check at all. Thanks!

This revision is now accepted and ready to land.Thu, Dec 11, 3:08 PM
In D53988#1237096, @jah wrote:

I made exactly this change in my original heavyweight attempt at unionfs_getlowvnode (which did do the copy-up and thus required the vnode lock). But as a practical concern, I think the nullfs issue you mentioned above could be just as easily dealt with by changing null_getlowvnode() to take the interlock instead of the full vnode lock. We might still want to change this contract anyway, but I think that should be done as a separate change, no?

You're right, doing it separately doesn't hurt, as VOP_GETLOWVNODE() is currently never called with a lock.

This is what I did in the initial posted version of this change; you can still see that version in the diff history here, along with the conversation with kib@ that resulted in the change to return EACCES instead.

Oh sorry, I read it, but apparently too quickly. I missed that the top-level VOP_GETLOWVNODE() calls are all triggered through the copy_file_range() system calls, for which users must have opened the file, leading to VOP_OPEN() being called first (if we except the NFS case).

All in all, this is an improvement with no drawbacks, it should go, sorry for the delay.

No worries, thanks for looking it over!

Basically, the problem is that unionfs_write() today does not do any copy-up; the copy-up is instead done when unionfs_open() is invoked with write access requested. Since the vn_copy_file_range() path does not VOP_OPEN() the target vnode before issuing VOP_WRITE() against it, returning the input unionfs vnode here will not provide any opportunity for unionfs to do the copy-up within the vn_copy_file_range() path. However, vn_copy_file_range() is today only used in a syscall that expects the target file to have already been open()ed for write, so my expectation is that any necessary copy-up should have already been performed by unionfs_open() by the time we reach the point of calling unionfs_getlowvnode().

As an aside, perhaps unionfs_write() should do a copy-up? I mentioned in an earlier comment in this review that it at least seems wrong that the current implementation of unionfs_write() allows the write to proceed against the lower vnode if no upper vnode is present.

In the end, we still may want to do the copy up directly in VOP_GETLOWVNODE() or perhaps change the guts of vn_copy_file_range() with calling VOP_COPY_FILE_RANGE() unconditionally (dispatching on the input vnode) regardless of the vnode to write to not being on the same filesystem, with each implementation adding his own checks and in the case of unionfs performing the copy up there. But that's not strictly necessary here.

For the check that prompted all this, yes, a non-exhaustive check catching the most frequent pilot error is much better than no check at all. Thanks!