zfs: step #1 of bringing zpl to full obedience of vfs
ClosedPublic

Authored by avg on May 24 2016, 4:51 PM.

Details

Summary

ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Most importantly many things that FreeBSD VFS
manages on behalf of all filesystems are implemented in ZPL in a different
way.
Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS
functionality or, in the worst cases, badly interacts / interferes
with VFS.

The most prominent problem is a deadlock caused by the lock order reversal
of vnode locks that may happen with concurrent zfs_rename() and lookup().
The deadlock is a result of zfs_rename() not observing the vnode locking
contract expected by VFS.

This commit removes all ZPL internal locking that protects parent-child
relationships of filesystem nodes. These relationships are protected
by vnode locks and the code is changed to take advantage of that fact
and to properly interact with VFS.

Removal of the internal locking allowed all ZPL dmu_tx_assign calls to
use TXG_WAIT mode.

Another victim, disputable perhaps, is support for case insensitive
operations.

To do:

  • replace ZFS_ENTER mechanism with VFS managed / visible mechanism
  • do vnode locking in zfs_zget
  • better integration with VFS namecache
  • get rid of not really useful now zfs_freebsd_* adapters
  • more cleanups of unneeded / unused code
  • fix / replace .zfs support

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
avg retitled this revision from to zfs: step #1 of bringing zpl to full obedience of vfs.May 24 2016, 4:51 PM
avg updated this object.
avg edited the test plan for this revision. (Show Details)
avg added a subscriber: ZFS.
avg updated this revision to Diff 16864.May 25 2016, 4:08 PM

zfs_rename: fix a few problems

  • ZFS_VERIFY_ZP can not be used before ZFS_ENTER because it calls ZFS_EXIT
  • ZFS_VERIFY_ZP can not be used here it at all because it can return bypassing all the cleanup code
  • likewise, we should not bypass the cleanup code is sdzp is invalid

Also, add some comments to the code.

avg updated this revision to Diff 17204.Jun 1 2016, 9:22 PM

Fix multiple problems, chiefly lock order reversals between z_teardown_lock
and vnode lock.

More details:

  • zfs_link: remove a couple of duplicate lines
  • zfs_zget: comment why vrele-ing a doomed vnode should not lead to a lor
  • zfs_lookup: ensure that we never take vnode lock after z_teardown_lock
  • zfs_rename: ensure that we never take vnode lock after z_teardown_lock
  • zfs_remove / zfs_rmdir: verify child znode before using it
  • zfs_vptocnp: check for invalid znode
  • zfs_rename: fix some more problems
    • re-arrange the code so that the vnode locks are acquired before ZFS_ENTER
    • fix another ZFS_VERIFY_ZP
    • check szp and tzp for being invalid
    • do zil_commit() only on success
    • fix a comment
emaste added a subscriber: emaste.Jun 15 2016, 5:34 PM
jhibbits added a subscriber: jhibbits.

Add the full ZFS umbrella group

swills added a subscriber: swills.Jul 18 2016, 6:57 PM

I've tested this patch quite a bit on powerpc64, where I was seeing the issue a lot. So far, this patch resolves the issue and doesn't cause any new issues. Any status update on this?

mahrens added inline comments.Jul 19 2016, 8:24 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
1069 ↗(On Diff #17204)

Looks like the 2nd arg to this macro is always the function name. You could use func inside the macro instead. (like dprintf() in zfs_debug.h) Or if you don't want to change the macro, you could pass func to it from these new callers.

emaste added inline comments.Jul 19 2016, 8:40 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
1069 ↗(On Diff #17204)

Good point, at least passing __func__ in new callers will simplify any future effort to update the macro (by avoiding the need to verify the arg matches the function name).

smh added a subscriber: smh.Jul 22 2016, 2:31 PM

Thanks for attacking this Andriy its a big task which really needed some attention.

I don't feel qualified to comment directly on the VFS changes myself without investing quite a bit of time familiarising myself with the ins and outs of our VFS layer and unfortunately I don't have time for that ATM.

Would it be a good idea to add kib to the reviews, IIRC he's very familiar with VFS?

Another victim, disputable perhaps, is support for case insensitive
operations.

What do you mean by "disputable" here?

kib added a subscriber: kib.Jul 22 2016, 4:53 PM

IMO ths stuff should be committed. Nobody cares except ZFS/VFS interaction besides you, and waiting for a review would only cause aggravation to users suffering from the bug.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
1130 ↗(On Diff #17204)

Why this place is not braced with #ifdef FreeBSD ?

1169 ↗(On Diff #17204)

Why do you need VI_LOCK() around this test ? You drop the interlock immediately, so if your check is to avoid returning doomed vnode, then it is not useful.

But VFS treats VI_DOOMED specially, the flag is set when both vnode lock and interlock are owned. So the exclusive lock you checked there already prevents the flag modification.

avg added a comment.Jul 25 2016, 2:33 PM

I've tested this patch quite a bit on powerpc64, where I was seeing the issue a lot. So far, this patch resolves the issue and doesn't cause any new issues. Any status update on this?

Thank you very much for testing! I'd like to have a bit more of discussion before I can consider committing this change.

avg added inline comments.Jul 25 2016, 2:37 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
1069 ↗(On Diff #17204)

The second argument does not have tobe just a function name, it can include a more detailed explanation of an assertion. But you are absolutely correct that in this case we can use __func__ in all instances. Thank you for the suggestion, I will make sure to implement it.

avg added a comment.Jul 25 2016, 2:39 PM
In D6533#151464, @smh wrote:

Would it be a good idea to add kib to the reviews, IIRC he's very familiar with VFS?

Thank you for adding Kostik. The more eyes the better.

avg added a comment.Jul 25 2016, 2:48 PM

What do you mean by "disputable" here?

It seems that we formally support insensitive and mixed values of casesensitivity property, but I am not sure how much we really support them now.
I think that mixed is the same as sensitive at the moment as we do not have any other mechanism to control the policy.
Also, I am not sure how well our VFS support case insensitive operations (the name cache, etc).

So, I removed the support for anything else but the case-sensitive lookups (in ZPL) to make the code simpler, but I am not sure if that was a right thing to do.

avg added inline comments.Jul 25 2016, 3:50 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
1130 ↗(On Diff #17204)

It probably should be. OTOH, that check is pretty useless on FreeBSD itself.

1169 ↗(On Diff #17204)

Well, the main reason is that the vnode is not locked here. I know that that is bad, but I haven't got around to fixing it. Please note that this piece of code is not changing much from the current version.

avg added a comment.EditedJul 25 2016, 6:50 PM
In D6533#151511, @kib wrote:

IMO ths stuff should be committed. Nobody cares except ZFS/VFS interaction besides you, and waiting for a review would only cause aggravation to users suffering from the bug.

Well, I think that the patch has several points of potential contention. Some functions are changed so much that they are very different from their originals in illumos.
Also, I decided to remove all chunks of non-trivial size guarded with ifdef illumos. In my opinion they served only to obscure the code and didn't help much with correctly merging upstream changes. My opinion in general is that it is inevitable that the "upper half" of the ZPL code would be quite different between OSes.

The patch is also "incomplete". There are still some major changes to be done. For instance, z_teardown_lock that protects things like rollback and unmounting should be acquired before any vnode locks, but currently that is quite hard to implement, so it is always acquired after all vnode locks making the code messier than necessary.
Ideally, however, it would be better to replace z_teardown_lock with a VFS mechanism similar to vfs_write_suspend.

All that said, I do not really mind committing the change. I will try to draw more attention to this review request, so that the contentious points are discussed before the code goes in.

swills added a comment.EditedJul 26 2016, 3:25 PM

I've now done a number of exp-run builds (all packages) on an amd64 box using this patch and there were no issues, or at least none that I noticed.

For what it's worth, I use a case-insensitive filesystem to build a Mono project.
This won't be missed: the project should be fixed instead.

asomers added a reviewer: will.Aug 3 2016, 2:47 PM
avg updated this revision to Diff 19050.Aug 4 2016, 8:45 PM
  • restore functionality of filesystems with casesensitivity=insensitive
  • bypass namecache if case-insensitive or normalized lookups are enabled
avg added a comment.Aug 4 2016, 8:47 PM

For what it's worth, I use a case-insensitive filesystem to build a Mono project.
This won't be missed: the project should be fixed instead.

I've restored the functionality (and even tried to make it more correct).
It would be great if you could test it.
Thank you!

avg updated this revision to Diff 19051.Aug 4 2016, 9:00 PM

fix the previous diff update which was based on a wrong revision

smh added inline comments.Aug 4 2016, 9:09 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c
520 ↗(On Diff #19051)

intentionally left in?

533 ↗(On Diff #19051)

intentionally left in?

avg added inline comments.Aug 4 2016, 9:15 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c
520 ↗(On Diff #19051)

Yes. On the one hand, there is this limit. On the other hand, we never enforced it before in this code. So...

533 ↗(On Diff #19051)

ditto

avg updated this revision to Diff 19052.EditedAug 4 2016, 9:16 PM

use __func__ as the second arg to VOP_ASSERT_* macros

avg updated this revision to Diff 19053.Aug 4 2016, 9:19 PM

use __func__ as the second arg to VOP_ASSERT_* macros

avg updated this revision to Diff 19054.Aug 4 2016, 9:23 PM

use __func__ as the second arg to VOP_ASSERT_* macros

peter added a subscriber: peter.Aug 4 2016, 9:30 PM

Hmm. I'd like to run this in the freebsd.org cluster on a larger scale. Do you have this checked in somewhere? Dealing with raw .diff files without version history isn't terribly appealing.

smh added inline comments.Aug 4 2016, 9:32 PM
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c
520 ↗(On Diff #19051)

Cool, be good to add a quick comment to this effect to the code, so when we look back its obvious why its currently disabled.

For what it's worth, I've been running this on amd64 on top of 11.0-ALPHA2 since June 3rd without a hiccup. Plenty of regular usage, running a production email server as well as a lot of other stuff.

peter added a comment.Aug 5 2016, 12:53 AM

We're running diff 19054 on a bunch of freebsd.org cluster machines now. I'm spinning up the machine that used to regularly explode when building npm/node things right now.

I confirmed that with the latest patch case-insensitive filesystem works.

$ echo foo >> /tmp/m/ab
$ echo bar >> /tmp/m/Ab
$ echo baz >> /tmp/m/aB
$ cat /tmp/m/AB
foo
bar
baz

This revision was automatically updated to reflect the committed changes.