zfs: depessimize zfs_root with rmlocks
ClosedPublic
Actions

Authored by mjg on Sep 19 2018, 7:47 AM.

Details

Reviewers

avg
kib
mav
sef

Commits

rS338927: zfs: depessimize zfs_root with rmlocks

Summary

Currently vfs likes to call ->vfs_root method of file systems, which causes very significant lock contention on mount-point heavy boxes with a lot of cores.

Since it is late in the release cycle and I think this should be addressed, I went for a trivial hack - store the vnode in the mount point and protect the access with rmlocks. The vnode is cleared on unmount and never repopulated. A better quality and long term fix will appear later. I'm not at all fond of this code and if you have a better idea (which can land in 12.0), I'm happy to drop this patch.

This significantly reduces system time and real time of -j 128 buildkernel on EPYC ( AMD EPYC 7601 32-Core Processor, 2 package(s) x 4 groups x 2 cache groups x 4 core(s) x 2 hardware threads) with the default partition setup (pasted at the end).

Before:
make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total

After:
make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total

Lock profile (total wait time) before:

5808005606 (sx:zfsvfs->z_hold_mtx[i])
363396408 (lockmgr:zfs)
261479733 (rw:vm object)
203987276 (sx:dd->dd_lock)
114277109 (spin mutex:turnstile chain)
71652435 (sx:vm map (user))
56534812 (spin mutex:sleepq chain)
44072442 (sx:dn->dn_mtx)
39450115 (sleep mutex:zio_write_issue)
28089280 (sleep mutex:vnode interlock)

after:

411035342 (lockmgr:zfs)
348574707 (sleep mutex:struct mount mtx)
285654874 (rw:vm object)
183426131 (sx:dd->dd_lock)
111464752 (spin mutex:turnstile chain)
110790531 (sleep mutex:vnode interlock)
74655012 (sx:vm map (user))
72533018 (spin mutex:sleepq chain)
46086874 (sx:dn->dn_mtx)
45494908 (sx:zp->z_acl_lock)

partition setup:
zroot/ROOT/default on / (zfs, local, noatime, nfsv4acls)
devfs on /dev (devfs, local, multilabel)
zroot/tmp on /tmp (zfs, local, noatime, nosuid, nfsv4acls)
zroot/usr/home on /usr/home (zfs, local, noatime, nfsv4acls)
zroot/usr/ports on /usr/ports (zfs, local, noatime, nosuid, nfsv4acls)
zroot/usr/src on /usr/src (zfs, local, noatime, nfsv4acls)
zroot/var/audit on /var/audit (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/crash on /var/crash (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/log on /var/log (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/mail on /var/mail (zfs, local, nfsv4acls)
zroot/var/tmp on /var/tmp (zfs, local, noatime, nosuid, nfsv4acls)
zroot on /zroot (zfs, local, noatime, nfsv4acls)

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Sep 19 2018, 7:47 AM

Herald added subscribers: delphij, imp. · View Herald TranscriptSep 19 2018, 7:47 AM

Harbormaster completed remote builds in B19662: Diff 48201.Sep 19 2018, 7:47 AM

mjg edited the summary of this revision. (Show Details)Sep 19 2018, 7:48 AM

mjg edited the summary of this revision. (Show Details)

mjg edited the summary of this revision. (Show Details)Sep 19 2018, 7:50 AM

reupload with context

So what happens if the root vnode is reclaimed ?

Note zfs_zget vrefs the vnode.

The existing code already can encounter a reclaimed vnode and I think it is harmless (in fact I believe this is common for all filesystems). Lock op will either fail with ENOENT or if LK_RETRY is passed we will get the vnode, but nothing will work on it.

The only code path which results in reclamation that I found is unmount, which starts with just clearing the cached vnode. In the window between snatching the pointer and going to lock the vnode it can be cleared and reclaimed by vflush. But that's precisely the same race as seen with ZFS_ENTER/EXIT macros, which only protect snatching the vnode and don't protect locking it. Similarly for ufs, snatching is protected with the vfs_hash_lock but locking is not.

If this is not what you meant, can you elaborate on a specific scenario?

In D17233#367326, @mjg wrote:

The existing code already can encounter a reclaimed vnode and I think it is harmless (in fact I believe this is common for all filesystems). Lock op will either fail with ENOENT or if LK_RETRY is passed we will get the vnode, but nothing will work on it.

The only code path which results in reclamation that I found is unmount, which starts with just clearing the cached vnode. In the window between snatching the pointer and going to lock the vnode it can be cleared and reclaimed by vflush. But that's precisely the same race as seen with ZFS_ENTER/EXIT macros, which only protect snatching the vnode and don't protect locking it. Similarly for ufs, snatching is protected with the vfs_hash_lock but locking is not.

If this is not what you meant, can you elaborate on a specific scenario?

I am not sure what snatching mean.

I mean that if the root vnode is reclaimed, then zfs_root() is destined to always return that reclaimed vnode. If zfs_zget() is executing vget() without LK_RETRY, it fails. If LK_RETRY is not passed in, useless vnode is returned.

I noted the only codepath leading to reclamation that I see goes through zfs_unmount, which starts with clearing the cached vnode. Thus I don't see how the perpetual bad vnode can happen. Moreover, forced vflush leading to said reclamation is not going to fail on zfs and there are no failure points afterwards either.

If there is a different reclamation codepath, worst case the patch can be updated to check for V_DOOMED and relock rw to clear the vnode.

In D17233#367334, @mjg wrote:

I noted the only codepath leading to reclamation that I see goes through zfs_unmount, which starts with clearing the cached vnode. Thus I don't see how the perpetual bad vnode can happen. Moreover, forced vflush leading to said reclamation is not going to fail on zfs and there are no failure points afterwards either.

If there is a different reclamation codepath, worst case the patch can be updated to check for V_DOOMED and relock rw to clear the vnode.

VFS does not guarantee that there will be no other ways to reclaim referenced vnode. Yes, some handling for the VI_DOOMED case should be added.

clear doomed vnodes in zfs_root.

kib added inline comments.Sep 19 2018, 12:51 PM

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
2045 ↗	(On Diff #48209)	VI_DOOMED is protected by either vnode interlock or vnode lock. I do not see why the test cannot be invalidated right after the check.

mjg marked an inline comment as done.Sep 19 2018, 1:05 PM

mjg added inline comments.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
2045 ↗	(On Diff #48209)	It can be, but I don't see why this would be a problem. It's exactly the same race which already exists. On the other hand the patch addresses the worry of reclamation not going through zfs_umount and resulting in a perpetually bad root vnode.