VOP_RENAME: add mp-global lock It is before all vnode locks, but after vn_start_write(). The lock prevents parallel rename operations on the same mount point, which should in (near future) simplify a lot of code in VFS/fs that otherwise need to code with either the changing hierarchy, or with the lock order for vnodes due to changed hierarchy. On renames, the lock is taken on the lowest stacked filesystem. Otherwise rename could still occur in parallel, by performing one of op on the lower fs. ufs, msdosfs: remove checkpath locks namei dotdot tracker: take mnt_renamelock shared to prevent parallel renames For each visited dvp vnode, take the shared lock on mp->mnt_renamelock for the lowest stacked filesystem. This keeps the hierarchy walked by O_RELATIVE_BENEATH lookups stable. As a consequence, we no longer need to track the dvps visited, it is enough to remember the mount points only, and even this is needed only to properly unlock them afterward. Taking the lowest mp using VOP_GETWRITEMOUNT() ensures that renames on lower fs are not allowed to break the guarantees of the O_RELATIVE_BENEATH.
Details
- Reviewers
markj olce - Commits
- rG2f60984053e5: namei dotdot tracker: take mnt_renamelock shared to prevent parallel renames
rG2ffee9aef7c2: vfs_lookup: split NDRESTART
rG369a2542caa9: msdosfs: remove pm_checkpath_lock
rG2c17429915f2: ufs: remove um_checkpath_lock
rGef6ea91593eb: VOP_RENAME: add mp-global lock
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Not even booted yet.
Simplifying relocking in VOP_RENAME() implementations is out of scope for this change, will be looked at later.
sys/kern/vfs_lookup.c | ||
---|---|---|
232 ↗ | (On Diff #156404) | Do we need to handle the possibility that nt->dp is doomed? In that case v_mount would be null. |
sys/kern/vfs_syscalls.c | ||
3837 | I'd be worried about livelock here. If the trylock fails, can we acquire the renamelock, then re-lookup and lock vnodes involved? Or is some deadlock then possible? |
Try harder to mitigate deadlock in the rename syscall and namei, lock renamelock in the blocking fashion if nowait failed.
Record dp->v_mount in tracker elements to handle possible reclaim of the tracked vnode.
Handle the case were tracker is recorded but the renamelock was not locked.
sys/kern/vfs_lookup.c | ||
---|---|---|
232 ↗ | (On Diff #156404) | Hmm, does the vnode ref guarantee that the struct mount is still live? I'm not sure, delmntque() releases the ref when the vnode is reclaimed. |
sys/kern/vfs_lookup.c | ||
---|---|---|
232 ↗ | (On Diff #156404) | struct mount is type-safe, its zone is NOFREE. We must free the same lockmgr lock as we took there, so I record v_mount. |
For some reason, I had missed the existence of um_checkpath_lock, which implies that some renames (involving directories whose parent directory changes) are basically serialized. Since this was added in 2021, probably nobody really complained, but that is slightly surprising to me.
The generalization here has the potential to be much worse, serializing absolutely all renames, and affecting all filesystems, including ZFS (which is a problem only if it can support concurrent renames, which I don't know). The alternative you mentioned in 50371 seems a better way in this respect.
Is it really worthwhile to try to improve scalability of renames? I suspect that in practice there are enough bottlenecks elsewhere that it's not important, e.g., if multiple processes are renaming files within the same directory during a build, they will already collide on vnode locks.
sys/kern/vfs_lookup.c | ||
---|---|---|
237 ↗ | (On Diff #156431) | Now that renames are serialized, do we need to do all of this tracking at all? |
232 ↗ | (On Diff #156404) | I think some comment would be valuable, even just /* Depends on type-stability of struct mount. */ |
For ZFS, my main concern would be that different datasets in a single pool should be able to operate completely in parallel, but that is not a problem here as the lock is per mount.
That said, if you have parallel renaming activity in separate parts of a single FS (such as, e.g., multiple jails on UFS, but that need not involve jails at all, let's say, parallel builds of different programs/checkouts), you could probably see a difference. By construction here, there would be no bottleneck caused by multiple renames in the same directory.
sys/kern/vfs_lookup.c | ||
---|---|---|
237 ↗ | (On Diff #156431) | Mmm, yes we do need. There are two possibilities, one of which would allow to simplify tracking. Basically, there is still no promise that the directory does not change parent, even with the patch, due to e.g. nullfs lower mount still allowing renames. So either we keep the current form of tracking, or we need to make the renamelock locking do the bypass. In the later case, indeed, there is no sense to record the full path of the vnodes for ORB lookups, but we still need to record the v_mount's to properly unlock. And we do need to record the bypassed v_mount's in fact. (It seems that VOP_GETWRITEMOUNT() is enough) |
sys/kern/vfs_lookup.c | ||
---|---|---|
237 ↗ | (On Diff #156431) |
I 'after all' means 'after this patch' then yes. Not changing parent in my prev reply means that we detect the changed parent only when traversing dotdot. To not allow changing the hierarchy at the time of ORB lookup we need the renamelock. |
sys/kern/vfs_lookup.c | ||
---|---|---|
237 ↗ | (On Diff #156431) | I meant without this patch: I think it's probably ok to change the hierarchy while an ORB lookup is taking place. My worry was that a concurrent change would allow an ORB dotdot lookup to escape the lookup dirfd, but now I think it's not possible. I forgot about the tracking list mechanism when I originally brought up the issue. |
The patch in the review is outdated, it got a lot of fixes. Peter is testing ATM, I will update the change after he is done.
Ok. Could you please update it relatively soon then, so that I can test in parallel? I gather that Peter is testing through stress2? IIRC there's no scenario in that suite close to what I'd like to test.
More fixes.
And an important change, vop_rename_pre() asserts that tdvp->v_mount mnt_renamelock is owned.
This seems to behave properly with limited stress2 testing. Done by Peter.
sys/kern/vfs_lookup.c | ||
---|---|---|
82 ↗ | (On Diff #157212) | I would never remember which is wider and which is more specialized. |
691 ↗ | (On Diff #157212) | Might be not, but it would be less surprising in future if it becomes needed even for lockless lookup. I prefer to keep it consistent, it is two more reads for linux ABI lookups. |
232 ↗ | (On Diff #156404) | This is now consumed by using VOP_GETWRITEMOUNT(). |
With this change I see some test failures in sys/capsicum/functional and sys/vfs/lookup_cap_dotdot.
Assign ni_rbeneath_dpp for NI_LCF_STRICTREL set regardless of RBENEATH.
In fact Peter pointed out that O_RB was broken early on, but I only fixed it for O_RB instead of the proper scope.
I understand this, sorry for the delay in testing, but please allow me until the end of this day to finish these.
I'm fine with the letter of the patch (have only two minor inline comments).
sys/kern/vfs_lookup.c | ||
---|---|---|
241 ↗ | (On Diff #157699) | While here, could you change "not strictrelative" for something more descriptive like "not strictrelative but tracker fields set"? |
801–802 ↗ | (On Diff #157699) | Please merge this MPASS() on error with the one above. |
Not being able to test all I wanted, but if you're confident enough, don't wait for me, I may perform some remaining ones a bit later.
sys/kern/vfs_lookup.c | ||
---|---|---|
267 ↗ | (On Diff #157699) | This comment is stale. |