Page MenuHomeFreeBSD

zfs: depessimize zfs_root with rmlocks
ClosedPublic

Authored by mjg on Sep 19 2018, 7:47 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 29, 5:10 AM
Unknown Object (File)
Nov 25 2024, 10:39 AM
Unknown Object (File)
Nov 5 2024, 9:53 AM
Unknown Object (File)
Nov 3 2024, 9:26 AM
Unknown Object (File)
Oct 19 2024, 10:54 PM
Unknown Object (File)
Oct 3 2024, 3:31 PM
Unknown Object (File)
Oct 3 2024, 12:17 AM
Unknown Object (File)
Sep 29 2024, 10:29 PM
Subscribers

Details

Summary

Currently vfs likes to call ->vfs_root method of file systems, which causes very significant lock contention on mount-point heavy boxes with a lot of cores.

Since it is late in the release cycle and I think this should be addressed, I went for a trivial hack - store the vnode in the mount point and protect the access with rmlocks. The vnode is cleared on unmount and never repopulated. A better quality and long term fix will appear later. I'm not at all fond of this code and if you have a better idea (which can land in 12.0), I'm happy to drop this patch.

This significantly reduces system time and real time of -j 128 buildkernel on EPYC ( AMD EPYC 7601 32-Core Processor, 2 package(s) x 4 groups x 2 cache groups x 4 core(s) x 2 hardware threads) with the default partition setup (pasted at the end).

Before:
make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total

After:
make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total

Lock profile (total wait time) before:

5808005606 (sx:zfsvfs->z_hold_mtx[i])
363396408 (lockmgr:zfs)
261479733 (rw:vm object)
203987276 (sx:dd->dd_lock)
114277109 (spin mutex:turnstile chain)
71652435 (sx:vm map (user))
56534812 (spin mutex:sleepq chain)
44072442 (sx:dn->dn_mtx)
39450115 (sleep mutex:zio_write_issue)
28089280 (sleep mutex:vnode interlock)

after:

411035342 (lockmgr:zfs)
348574707 (sleep mutex:struct mount mtx)
285654874 (rw:vm object)
183426131 (sx:dd->dd_lock)
111464752 (spin mutex:turnstile chain)
110790531 (sleep mutex:vnode interlock)
74655012 (sx:vm map (user))
72533018 (spin mutex:sleepq chain)
46086874 (sx:dn->dn_mtx)
45494908 (sx:zp->z_acl_lock)

partition setup:
zroot/ROOT/default on / (zfs, local, noatime, nfsv4acls)
devfs on /dev (devfs, local, multilabel)
zroot/tmp on /tmp (zfs, local, noatime, nosuid, nfsv4acls)
zroot/usr/home on /usr/home (zfs, local, noatime, nfsv4acls)
zroot/usr/ports on /usr/ports (zfs, local, noatime, nosuid, nfsv4acls)
zroot/usr/src on /usr/src (zfs, local, noatime, nfsv4acls)
zroot/var/audit on /var/audit (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/crash on /var/crash (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/log on /var/log (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/mail on /var/mail (zfs, local, nfsv4acls)
zroot/var/tmp on /var/tmp (zfs, local, noatime, nosuid, nfsv4acls)
zroot on /zroot (zfs, local, noatime, nfsv4acls)

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

mjg edited the summary of this revision. (Show Details)
  • reupload with context

So what happens if the root vnode is reclaimed ?

Note zfs_zget vrefs the vnode.

The existing code already can encounter a reclaimed vnode and I think it is harmless (in fact I believe this is common for all filesystems). Lock op will either fail with ENOENT or if LK_RETRY is passed we will get the vnode, but nothing will work on it.

The only code path which results in reclamation that I found is unmount, which starts with just clearing the cached vnode. In the window between snatching the pointer and going to lock the vnode it can be cleared and reclaimed by vflush. But that's precisely the same race as seen with ZFS_ENTER/EXIT macros, which only protect snatching the vnode and don't protect locking it. Similarly for ufs, snatching is protected with the vfs_hash_lock but locking is not.

If this is not what you meant, can you elaborate on a specific scenario?

In D17233#367326, @mjg wrote:

The existing code already can encounter a reclaimed vnode and I think it is harmless (in fact I believe this is common for all filesystems). Lock op will either fail with ENOENT or if LK_RETRY is passed we will get the vnode, but nothing will work on it.

The only code path which results in reclamation that I found is unmount, which starts with just clearing the cached vnode. In the window between snatching the pointer and going to lock the vnode it can be cleared and reclaimed by vflush. But that's precisely the same race as seen with ZFS_ENTER/EXIT macros, which only protect snatching the vnode and don't protect locking it. Similarly for ufs, snatching is protected with the vfs_hash_lock but locking is not.

If this is not what you meant, can you elaborate on a specific scenario?

I am not sure what snatching mean.

I mean that if the root vnode is reclaimed, then zfs_root() is destined to always return that reclaimed vnode. If zfs_zget() is executing vget() without LK_RETRY, it fails. If LK_RETRY is not passed in, useless vnode is returned.

I noted the only codepath leading to reclamation that I see goes through zfs_unmount, which starts with clearing the cached vnode. Thus I don't see how the perpetual bad vnode can happen. Moreover, forced vflush leading to said reclamation is not going to fail on zfs and there are no failure points afterwards either.

If there is a different reclamation codepath, worst case the patch can be updated to check for V_DOOMED and relock rw to clear the vnode.

In D17233#367334, @mjg wrote:

I noted the only codepath leading to reclamation that I see goes through zfs_unmount, which starts with clearing the cached vnode. Thus I don't see how the perpetual bad vnode can happen. Moreover, forced vflush leading to said reclamation is not going to fail on zfs and there are no failure points afterwards either.

If there is a different reclamation codepath, worst case the patch can be updated to check for V_DOOMED and relock rw to clear the vnode.

VFS does not guarantee that there will be no other ways to reclaim referenced vnode. Yes, some handling for the VI_DOOMED case should be added.

  • clear doomed vnodes in zfs_root.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
2045 ↗(On Diff #48209)

VI_DOOMED is protected by either vnode interlock or vnode lock. I do not see why the test cannot be invalidated right after the check.

mjg marked an inline comment as done.Sep 19 2018, 1:05 PM
mjg added inline comments.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
2045 ↗(On Diff #48209)

It can be, but I don't see why this would be a problem. It's exactly the same race which already exists. On the other hand the patch addresses the worry of reclamation not going through zfs_umount and resulting in a perpetually bad root vnode.

This revision is now accepted and ready to land.Sep 19 2018, 1:35 PM

I highly suggest you to get a review from somebody knowing a bit about zfs.

mav added a subscriber: mav.

Looks good to me, but VFS and ZPL are not my favorite areas to know all details.

The ZFS side seems fine to me (I defer to stronger backgrounds on the VFS interactions).

This revision was automatically updated to reflect the committed changes.