Page MenuHomeFreeBSD

zfs: Fix a deadlock between page busy and the teardown lock
ClosedPublic

Authored by markj on Nov 10 2021, 5:45 PM.

Details

Summary

ZFS has a per-mountpoint read-mostly "teardown lock" which is acquired
in read mode by most VOPs. It is used to suspend filesystem operations
in preparation for some dataset-level operation like a rollback
(reverting a filesystem to an earlier snapshot). In particular, the ZFS
VOP_GETPAGES implementation acquires this lock.

When rolling back a dataset, ZFS invalidates all file data resident in
the page cache, as a rollback can cause a file's contents to change and
we don't want to let stale data linger. To do this, it calls
vn_page_remove() on each vnode associated with the mountpoint. This
introduces a lock order reversal: to handle a page fault we busy vnode
pages before calling VOP_GETPAGES, and during rollback we busy vnode
pages in vm_object_page_remove() with the teardown lock held.

Resolve the deadlock by exploiting the fact that rollback only needs to
purge valid pages: invalid pages need not be purged by definition, and
then a busy lock holder of invalid ZFS vnode pages can safely block on
the teardown lock. This assumes that we will not pass valid pages to
VOP_GETPAGES via vm_pager_get_pages(). Since ZFS is solely responsible
for marking vnode pages valid, I believe it is a safe assumption.

Add a new mode to vm_object_page_remove() to skip over invalid pages.
Use it when rolling back a dataset.

PR: 258208

Test Plan

Peter has a stress2 scenario which triggers the deadlock fairly regularly.
We haven't observed any problems with this patch applied.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj requested review of this revision.Nov 10 2021, 5:45 PM
sys/vm/vm_object.c
2129

Would this still cause the issue with partially valid pages at EOF? If the page is not fully valid, vm_fault() still calls into pager, and pager calls into VOP. The validation of the rest of the page is performed by vm_pager_get_pages() after pgo_getpages() validated page up to EOF.

It might be simplest to avoid this issue at all by unconditionally doing vm_page_zero_invalid() in zfs VOP_GETPAGES().

sys/vm/vm_object.c
2129

I believe ZFS vnode pages are never partially valid. This is asserted in several places which maintain coherence between the page cache and the DMU. See page_busy() or dmu_read_pages(). Truncation (currently) does not mark pages partially invalid, see the last comment in vnode_pager_subpage_purge(). Maybe it is a somewhat fragile assumption, but it exists already.

This revision is now accepted and ready to land.Nov 10 2021, 8:05 PM
sef added a subscriber: sef.

The code looks fine (small change, after all), other than my request for comments. :)

sys/kern/vfs_vnops.c
2447–2458

Can we get some comments? I realize the code has historically been low on them, but we can try to fix that for new code. :)

markj marked an inline comment as done.

Add a comment for vn_pages_remove_valid().

This revision now requires review to proceed.Nov 11 2021, 8:33 PM

Thank you very much! This is a very neat fix for what appeared as a quite substantial problem (and maybe is, in terms of interactions between layers / subsystems).

This revision is now accepted and ready to land.Nov 12 2021, 10:04 AM