Details

Reviewers

markj
olce

Commits

rG2f60984053e5: namei dotdot tracker: take mnt_renamelock shared to prevent parallel renames
rG2ffee9aef7c2: vfs_lookup: split NDRESTART
rG369a2542caa9: msdosfs: remove pm_checkpath_lock
rG2c17429915f2: ufs: remove um_checkpath_lock
rGef6ea91593eb: VOP_RENAME: add mp-global lock

Summary

VOP_RENAME: add mp-global lock

It is before all vnode locks, but after vn_start_write().

The lock prevents parallel rename operations on the same mount point,
which should in (near future) simplify a lot of code in VFS/fs that
otherwise need to code with either the changing hierarchy, or with the
lock order for vnodes due to changed hierarchy.

On renames, the lock is taken on the lowest stacked filesystem.
Otherwise rename could still occur in parallel, by performing one of op
on the lower fs.


ufs, msdosfs: remove checkpath locks


namei dotdot tracker: take mnt_renamelock shared to prevent parallel renames

For each visited dvp vnode, take the shared lock on mp->mnt_renamelock
for the lowest stacked filesystem.

This keeps the hierarchy walked by O_RELATIVE_BENEATH lookups stable.
As a consequence, we no longer need to track the dvps visited, it is
enough to remember the mount points only, and even this is needed only
to properly unlock them afterward.

Taking the lowest mp using VOP_GETWRITEMOUNT() ensures that renames on
lower fs are not allowed to break the guarantees of the
O_RELATIVE_BENEATH.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.Jun 2 2025, 7:41 AM

Herald added subscribers: olce, imp. · View Herald TranscriptJun 2 2025, 7:41 AM

Not even booted yet.
Simplifying relocking in VOP_RENAME() implementations is out of scope for this change, will be looked at later.

markj added inline comments.Jun 2 2025, 4:12 PM

sys/kern/vfs_lookup.c
232 ↗	(On Diff #156404)	Do we need to handle the possibility that `nt->dp` is doomed? In that case `v_mount` would be null.
sys/kern/vfs_syscalls.c
3837	I'd be worried about livelock here. If the trylock fails, can we acquire the renamelock, then re-lookup and lock vnodes involved? Or is some deadlock then possible?

Try harder to mitigate deadlock in the rename syscall and namei, lock renamelock in the blocking fashion if nowait failed.
Record dp->v_mount in tracker elements to handle possible reclaim of the tracked vnode.
Handle the case were tracker is recorded but the renamelock was not locked.

Do not record tracker element if taking the renamelock failed, at all.

markj added inline comments.Jun 2 2025, 5:06 PM

sys/kern/vfs_lookup.c
232 ↗	(On Diff #156404)	Hmm, does the vnode ref guarantee that the `struct mount` is still live? I'm not sure, delmntque() releases the ref when the vnode is reclaimed.

kib added inline comments.Jun 2 2025, 5:11 PM

sys/kern/vfs_lookup.c
232 ↗	(On Diff #156404)	struct mount is type-safe, its zone is NOFREE. We must free the same lockmgr lock as we took there, so I record v_mount.

For some reason, I had missed the existence of um_checkpath_lock, which implies that some renames (involving directories whose parent directory changes) are basically serialized. Since this was added in 2021, probably nobody really complained, but that is slightly surprising to me.

The generalization here has the potential to be much worse, serializing absolutely all renames, and affecting all filesystems, including ZFS (which is a problem only if it can support concurrent renames, which I don't know). The alternative you mentioned in 50371 seems a better way in this respect.

In D50648#1156291, @olce wrote:

For some reason, I had missed the existence of um_checkpath_lock, which implies that some renames (involving directories whose parent directory changes) are basically serialized. Since this was added in 2021, probably nobody really complained, but that is slightly surprising to me.

The generalization here has the potential to be much worse, serializing absolutely all renames, and affecting all filesystems, including ZFS (which is a problem only if it can support concurrent renames, which I don't know). The alternative you mentioned in 50371 seems a better way in this respect.

Is it really worthwhile to try to improve scalability of renames? I suspect that in practice there are enough bottlenecks elsewhere that it's not important, e.g., if multiple processes are renaming files within the same directory during a build, they will already collide on vnode locks.

sys/kern/vfs_lookup.c
237 ↗	(On Diff #156431)	Now that renames are serialized, do we need to do all of this tracking at all?
232 ↗	(On Diff #156404)	I think some comment would be valuable, even just `/* Depends on type-stability of struct mount. */`

In D50648#1156311, @markj wrote:

Is it really worthwhile to try to improve scalability of renames? I suspect that in practice there are enough bottlenecks elsewhere that it's not important, e.g., if multiple processes are renaming files within the same directory during a build, they will already collide on vnode locks.

For ZFS, my main concern would be that different datasets in a single pool should be able to operate completely in parallel, but that is not a problem here as the lock is per mount.

That said, if you have parallel renaming activity in separate parts of a single FS (such as, e.g., multiple jails on UFS, but that need not involve jails at all, let's say, parallel builds of different programs/checkouts), you could probably see a difference. By construction here, there would be no bottleneck caused by multiple renames in the same directory.

kib added inline comments.Jun 2 2025, 8:58 PM

sys/kern/vfs_lookup.c
237 ↗	(On Diff #156431)	Mmm, yes we do need. There are two possibilities, one of which would allow to simplify tracking. Basically, there is still no promise that the directory does not change parent, even with the patch, due to e.g. nullfs lower mount still allowing renames. So either we keep the current form of tracking, or we need to make the renamelock locking do the bypass. In the later case, indeed, there is no sense to record the full path of the vnodes for ORB lookups, but we still need to record the v_mount's to properly unlock. And we do need to record the bypassed v_mount's in fact. (It seems that VOP_GETWRITEMOUNT() is enough)

Track dvp->v_mount's with bypass.

markj added inline comments.Jun 3 2025, 10:04 PM

sys/kern/vfs_lookup.c
237 ↗	(On Diff #156431)	But, this also means that my concern in D50371 about concurrent renames is not a problem after all, no? I agree that with this patch, the tracking can be simplified as you suggest.

kib added inline comments.Jun 3 2025, 10:10 PM

sys/kern/vfs_lookup.c
237 ↗	(On Diff #156431)	But, this also means that my concern in D50371 about concurrent renames is not a problem after all, no? I 'after all' means 'after this patch' then yes. Not changing parent in my prev reply means that we detect the changed parent only when traversing dotdot. To not allow changing the hierarchy at the time of ORB lookup we need the renamelock.

markj added inline comments.Jun 3 2025, 10:17 PM

sys/kern/vfs_lookup.c
237 ↗	(On Diff #156431)	I meant without this patch: I think it's probably ok to change the hierarchy while an ORB lookup is taking place. My worry was that a concurrent change would allow an ORB dotdot lookup to escape the lookup dirfd, but now I think it's not possible. I forgot about the tracking list mechanism when I originally brought up the issue.

kib added a subscriber: pho.Jun 10 2025, 1:32 AM

emaste added a subscriber: emaste.Jun 14 2025, 2:30 PM

I should be able to do some parallel buildworld tests this week.

In D50648#1161451, @olce wrote:

I should be able to do some parallel buildworld tests this week.

The patch in the review is outdated, it got a lot of fixes. Peter is testing ATM, I will update the change after he is done.

In D50648#1161495, @kib wrote:

The patch in the review is outdated, it got a lot of fixes. Peter is testing ATM, I will update the change after he is done.

Ok. Could you please update it relatively soon then, so that I can test in parallel? I gather that Peter is testing through stress2? IIRC there's no scenario in that suite close to what I'd like to test.

Current version of the patch

More fixes.
And an important change, vop_rename_pre() asserts that tdvp->v_mount mnt_renamelock is owned.

This seems to behave properly with limited stress2 testing. Done by Peter.

The patch was fully tested. I do not want to leave it rot in phab.
Thank you.

markj added inline comments.Jun 27 2025, 12:08 AM

sys/kern/vfs_lookup.c
81 ↗	(On Diff #157212)	This comment applies to NDRESTART(), not NDRESTART1().
82 ↗	(On Diff #157212)	NDRESET() seems like a better name.
219 ↗	(On Diff #157212)	The line break here isn't needed.
691 ↗	(On Diff #157212)	Is it necessary to call nameicap_cleanup() here too?

kib marked 9 inline comments as done.Jun 27 2025, 1:19 AM

kib added inline comments.

sys/kern/vfs_lookup.c
82 ↗	(On Diff #157212)	I would never remember which is wider and which is more specialized.
691 ↗	(On Diff #157212)	Might be not, but it would be less surprising in future if it becomes needed even for lockless lookup. I prefer to keep it consistent, it is two more reads for linux ABI lookups.
232 ↗	(On Diff #156404)	This is now consumed by using VOP_GETWRITEMOUNT().

NDRESET()

With this change I see some test failures in sys/capsicum/functional and sys/vfs/lookup_cap_dotdot.

Assign ni_rbeneath_dpp for NI_LCF_STRICTREL set regardless of RBENEATH.

In fact Peter pointed out that O_RB was broken early on, but I only fixed it for O_RB instead of the proper scope.

In D50648#1165580, @kib wrote:

The patch was fully tested. I do not want to leave it rot in phab.
Thank you.

I understand this, sorry for the delay in testing, but please allow me until the end of this day to finish these.

I'm fine with the letter of the patch (have only two minor inline comments).

sys/kern/vfs_lookup.c
241 ↗	(On Diff #157699)	While here, could you change `"not strictrelative"` for something more descriptive like `"not strictrelative but tracker fields set"`?
801–802 ↗	(On Diff #157699)	Please merge this `MPASS()` on `error` with the one above.