Page MenuHomeFreeBSD

mnt_renamelock
ClosedPublic

Authored by kib on Jun 2 2025, 7:41 AM.
Tags
None
Referenced Files
F122715700: D50648.id157132.diff
Mon, Jul 7, 4:59 PM
Unknown Object (File)
Sun, Jul 6, 5:26 PM
Unknown Object (File)
Sun, Jul 6, 5:12 PM
Unknown Object (File)
Sun, Jul 6, 4:04 AM
Unknown Object (File)
Sat, Jul 5, 8:31 PM
Unknown Object (File)
Sat, Jul 5, 5:39 AM
Unknown Object (File)
Sat, Jul 5, 3:00 AM
Unknown Object (File)
Fri, Jul 4, 11:46 PM
Subscribers

Details

Summary
VOP_RENAME: add mp-global lock

It is before all vnode locks, but after vn_start_write().

The lock prevents parallel rename operations on the same mount point,
which should in (near future) simplify a lot of code in VFS/fs that
otherwise need to code with either the changing hierarchy, or with the
lock order for vnodes due to changed hierarchy.

On renames, the lock is taken on the lowest stacked filesystem.
Otherwise rename could still occur in parallel, by performing one of op
on the lower fs.


ufs, msdosfs: remove checkpath locks


namei dotdot tracker: take mnt_renamelock shared to prevent parallel renames

For each visited dvp vnode, take the shared lock on mp->mnt_renamelock
for the lowest stacked filesystem.

This keeps the hierarchy walked by O_RELATIVE_BENEATH lookups stable.
As a consequence, we no longer need to track the dvps visited, it is
enough to remember the mount points only, and even this is needed only
to properly unlock them afterward.

Taking the lowest mp using VOP_GETWRITEMOUNT() ensures that renames on
lower fs are not allowed to break the guarantees of the
O_RELATIVE_BENEATH.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kib requested review of this revision.EditedJun 2 2025, 7:41 AM

Not even booted yet.
Simplifying relocking in VOP_RENAME() implementations is out of scope for this change, will be looked at later.

sys/kern/vfs_lookup.c
232 ↗(On Diff #156404)

Do we need to handle the possibility that nt->dp is doomed? In that case v_mount would be null.

sys/kern/vfs_syscalls.c
3837

I'd be worried about livelock here. If the trylock fails, can we acquire the renamelock, then re-lookup and lock vnodes involved? Or is some deadlock then possible?

kib marked 2 inline comments as done.

Try harder to mitigate deadlock in the rename syscall and namei, lock renamelock in the blocking fashion if nowait failed.
Record dp->v_mount in tracker elements to handle possible reclaim of the tracked vnode.
Handle the case were tracker is recorded but the renamelock was not locked.

Do not record tracker element if taking the renamelock failed, at all.

sys/kern/vfs_lookup.c
232 ↗(On Diff #156404)

Hmm, does the vnode ref guarantee that the struct mount is still live? I'm not sure, delmntque() releases the ref when the vnode is reclaimed.

sys/kern/vfs_lookup.c
232 ↗(On Diff #156404)

struct mount is type-safe, its zone is NOFREE. We must free the same lockmgr lock as we took there, so I record v_mount.

For some reason, I had missed the existence of um_checkpath_lock, which implies that some renames (involving directories whose parent directory changes) are basically serialized. Since this was added in 2021, probably nobody really complained, but that is slightly surprising to me.

The generalization here has the potential to be much worse, serializing absolutely all renames, and affecting all filesystems, including ZFS (which is a problem only if it can support concurrent renames, which I don't know). The alternative you mentioned in 50371 seems a better way in this respect.

For some reason, I had missed the existence of um_checkpath_lock, which implies that some renames (involving directories whose parent directory changes) are basically serialized. Since this was added in 2021, probably nobody really complained, but that is slightly surprising to me.

The generalization here has the potential to be much worse, serializing absolutely all renames, and affecting all filesystems, including ZFS (which is a problem only if it can support concurrent renames, which I don't know). The alternative you mentioned in 50371 seems a better way in this respect.

Is it really worthwhile to try to improve scalability of renames? I suspect that in practice there are enough bottlenecks elsewhere that it's not important, e.g., if multiple processes are renaming files within the same directory during a build, they will already collide on vnode locks.

sys/kern/vfs_lookup.c
237 ↗(On Diff #156431)

Now that renames are serialized, do we need to do all of this tracking at all?

232 ↗(On Diff #156404)

I think some comment would be valuable, even just /* Depends on type-stability of struct mount. */

Is it really worthwhile to try to improve scalability of renames? I suspect that in practice there are enough bottlenecks elsewhere that it's not important, e.g., if multiple processes are renaming files within the same directory during a build, they will already collide on vnode locks.

For ZFS, my main concern would be that different datasets in a single pool should be able to operate completely in parallel, but that is not a problem here as the lock is per mount.

That said, if you have parallel renaming activity in separate parts of a single FS (such as, e.g., multiple jails on UFS, but that need not involve jails at all, let's say, parallel builds of different programs/checkouts), you could probably see a difference. By construction here, there would be no bottleneck caused by multiple renames in the same directory.

sys/kern/vfs_lookup.c
237 ↗(On Diff #156431)

Mmm, yes we do need.

There are two possibilities, one of which would allow to simplify tracking. Basically, there is still no promise that the directory does not change parent, even with the patch, due to e.g. nullfs lower mount still allowing renames. So either we keep the current form of tracking, or we need to make the renamelock locking do the bypass.

In the later case, indeed, there is no sense to record the full path of the vnodes for ORB lookups, but we still need to record the v_mount's to properly unlock. And we do need to record the bypassed v_mount's in fact. (It seems that VOP_GETWRITEMOUNT() is enough)

Track dvp->v_mount's with bypass.

sys/kern/vfs_lookup.c
237 ↗(On Diff #156431)

But, this also means that my concern in D50371 about concurrent renames is not a problem after all, no?

I agree that with this patch, the tracking can be simplified as you suggest.

sys/kern/vfs_lookup.c
237 ↗(On Diff #156431)

But, this also means that my concern in D50371 about concurrent renames is not a problem after all, no?

I 'after all' means 'after this patch' then yes. Not changing parent in my prev reply means that we detect the changed parent only when traversing dotdot. To not allow changing the hierarchy at the time of ORB lookup we need the renamelock.

sys/kern/vfs_lookup.c
237 ↗(On Diff #156431)

I meant without this patch: I think it's probably ok to change the hierarchy while an ORB lookup is taking place. My worry was that a concurrent change would allow an ORB dotdot lookup to escape the lookup dirfd, but now I think it's not possible. I forgot about the tracking list mechanism when I originally brought up the issue.

I should be able to do some parallel buildworld tests this week.

I should be able to do some parallel buildworld tests this week.

The patch in the review is outdated, it got a lot of fixes. Peter is testing ATM, I will update the change after he is done.

In D50648#1161495, @kib wrote:

The patch in the review is outdated, it got a lot of fixes. Peter is testing ATM, I will update the change after he is done.

Ok. Could you please update it relatively soon then, so that I can test in parallel? I gather that Peter is testing through stress2? IIRC there's no scenario in that suite close to what I'd like to test.

Current version of the patch

More fixes.
And an important change, vop_rename_pre() asserts that tdvp->v_mount mnt_renamelock is owned.

This seems to behave properly with limited stress2 testing. Done by Peter.

The patch was fully tested. I do not want to leave it rot in phab.
Thank you.

sys/kern/vfs_lookup.c
81 ↗(On Diff #157212)

This comment applies to NDRESTART(), not NDRESTART1().

82 ↗(On Diff #157212)

NDRESET() seems like a better name.

219 ↗(On Diff #157212)

The line break here isn't needed.

691 ↗(On Diff #157212)

Is it necessary to call nameicap_cleanup() here too?

kib marked 9 inline comments as done.Fri, Jun 27, 1:19 AM
kib added inline comments.
sys/kern/vfs_lookup.c
82 ↗(On Diff #157212)

I would never remember which is wider and which is more specialized.

691 ↗(On Diff #157212)

Might be not, but it would be less surprising in future if it becomes needed even for lockless lookup. I prefer to keep it consistent, it is two more reads for linux ABI lookups.

232 ↗(On Diff #156404)

This is now consumed by using VOP_GETWRITEMOUNT().

kib marked 3 inline comments as done.
kib edited the summary of this revision. (Show Details)

NDRESET()

With this change I see some test failures in sys/capsicum/functional and sys/vfs/lookup_cap_dotdot.

Assign ni_rbeneath_dpp for NI_LCF_STRICTREL set regardless of RBENEATH.

In fact Peter pointed out that O_RB was broken early on, but I only fixed it for O_RB instead of the proper scope.

In D50648#1165580, @kib wrote:

The patch was fully tested. I do not want to leave it rot in phab.
Thank you.

I understand this, sorry for the delay in testing, but please allow me until the end of this day to finish these.

I'm fine with the letter of the patch (have only two minor inline comments).

sys/kern/vfs_lookup.c
241 ↗(On Diff #157699)

While here, could you change "not strictrelative" for something more descriptive like "not strictrelative but tracker fields set"?

801–802 ↗(On Diff #157699)

Please merge this MPASS() on error with the one above.

I understand this, sorry for the delay in testing, but please allow me until the end of this day to finish these.

Not being able to test all I wanted, but if you're confident enough, don't wait for me, I may perform some remaining ones a bit later.

This revision is now accepted and ready to land.Fri, Jun 27, 9:44 PM
sys/kern/vfs_lookup.c
267 ↗(On Diff #157699)

This comment is stale.

kib marked an inline comment as done.

Update herald comment for nameicap_check_dotdot()

This revision now requires review to proceed.Sat, Jun 28, 2:42 AM
kib marked 2 inline comments as done.

Update KASSERT message.
Remove redundant MPASS().

This revision is now accepted and ready to land.Tue, Jul 1, 2:01 PM

Still small nit in MPASS().

sys/kern/vfs_lookup.c
798 ↗(On Diff #157730)
kib marked an inline comment as done.

Fix assert.

This revision now requires review to proceed.Tue, Jul 1, 3:30 PM
This revision is now accepted and ready to land.Tue, Jul 1, 3:59 PM