Paths

Table of Contentst

unlink, rmdir: call notify_upper from VOP pre method instead of syscall
ClosedPublic
Actions

Authored by kib on Feb 4 2025, 12:26 AM.

Details

Reviewers

jah
olce

Commits

rG1a0cb938f7b4: unlink, rmdir: call notify_upper from VOP pre method instead of syscall

Summary

Suppose that there are two or more nullfs mounts over some fs, and
suppose the we unlink a file on one of the nullfs mount.
This way notify_upper get called from the lower vnode as well, allowing
the other nullfs mounts to note that and drop their caches for the
unlinked vnode.

PR:     254210

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.Feb 4 2025, 12:26 AM

Herald added subscribers: olce, imp. · View Herald TranscriptFeb 4 2025, 12:26 AM

kib requested review of this revision.Feb 4 2025, 12:26 AM

The approach looks generally OK to me, but please take the opportunity to move notifying the upper layers into the "post" hooks instead of the "pre" ones, under the condition that rc is effectively 0, as it seems wasted work to destroy the upper vnodes if the lower one isn't (as they are probably going to need to be reconstructed later).

I ran all of the stress2 tests using nullfs and also ronald@'s test scenario. LGTM.

In D48825#1113608, @olce wrote:

The approach looks generally OK to me, but please take the opportunity to move notifying the upper layers into the "post" hooks instead of the "pre" ones, under the condition that rc is effectively 0, as it seems wasted work to destroy the upper vnodes if the lower one isn't (as they are probably going to need to be reconstructed later).

Well, optimization for the failing cases should not be a necessity. That said, I am not sure that there is no complications with the vnode ref counts (I might be too cautious there, but it is the reason to keep this separate) with the move of upper notification post vop. You can do it after my change is landed.

kib added reviewers: jah, olce.Feb 4 2025, 3:19 PM

In D48825#1113760, @kib wrote:

Well, optimization for the failing cases should not be a necessity. That said, I am not sure that there is no complications with the vnode ref counts (I might be too cautious there, but it is the reason to keep this separate) with the move of upper notification post vop. You can do it after my change is landed.

Fine. I don't see how such additional complications could exist as both VOPs take an exclusively locked vnode and must return it back in the same state (if the stacking FSes temporarily drop the lock, they may have to be careful about the ref count, but that is irrespective of the move to the post hooks), but that probably warrants a new round of testing, and pho@ has already tested the current patch.

This revision is now accepted and ready to land.Feb 4 2025, 4:04 PM

In D48825#1113802, @olce wrote:

In D48825#1113760, @kib wrote:

Well, optimization for the failing cases should not be a necessity. That said, I am not sure that there is no complications with the vnode ref counts (I might be too cautious there, but it is the reason to keep this separate) with the move of upper notification post vop. You can do it after my change is landed.

Fine. I don't see how such additional complications could exist as both VOPs take an exclusively locked vnode and must return it back in the same state (if the stacking FSes temporarily drop the lock, they may have to be careful about the ref count, but that is irrespective of the move to the post hooks), but that probably warrants a new round of testing, and pho@ has already tested the current patch.

Dropping the lock is the issue. For instance, UFS may successfully unlink but drop the lock, which allows the upper vnode to be reclaimed, and then other upper filesystems are not notified, it the call is moved after the VOP.

Closed by commit rG1a0cb938f7b4: unlink, rmdir: call notify_upper from VOP pre method instead of syscall (authored by kib). · Explain WhyFeb 4 2025, 10:20 PM

This revision was automatically updated to reflect the committed changes.

kib added a commit: rG1a0cb938f7b4: unlink, rmdir: call notify_upper from VOP pre method instead of syscall.

In D48825#1114098, @kib wrote:

Dropping the lock is the issue. For instance, UFS may successfully unlink but drop the lock, which allows the upper vnode to be reclaimed, and then other upper filesystems are not notified, it the call is moved after the VOP.

You're right. But does it matter? In the case you describe, some upper vnodes are reclaimed without having been notified of some lower vnode unlink, and I don't see any conceptual problem with that. Same if replacing "which allows the upper vnode to be reclaimed" with "which allows the *lower* vnode to be reclaimed": In this case, in the post hook, the mountpoint has been disconnected from the vnode already, so we aren't able to notify upper layers for unlink, but the dooming process itself will have already notified them about the reclaim, so again I don't see it as a problem. It would be quite surprising if nullfs doesn't support that. I'll check before drafting something.

In D48825#1114380, @olce wrote:

In D48825#1114098, @kib wrote:

Dropping the lock is the issue. For instance, UFS may successfully unlink but drop the lock, which allows the upper vnode to be reclaimed, and then other upper filesystems are not notified, it the call is moved after the VOP.

You're right. But does it matter?

Yes it is. Second sibling nullfs mount still references the lower vnode, so use count on it is non-zero. For instance, on UFS, this prevents vop_inactive() from truncating the file, clearing the inode, and reclaiming the vnode. Basically it makes the issue fixed in the review, to return, but only in racy way.

In the case you describe, some upper vnodes are reclaimed without having been notified of some lower vnode unlink, and I don't see any conceptual problem with that. Same if replacing "which allows the upper vnode to be reclaimed" with "which allows the *lower* vnode to be reclaimed": In this case, in the post hook, the mountpoint has been disconnected from the vnode already, so we aren't able to notify upper layers for unlink, but the dooming process itself will have already notified them about the reclaim, so again I don't see it as a problem. It would be quite surprising if nullfs doesn't support that. I'll check before drafting something.

In D48825#1114388, @kib wrote:

In D48825#1114380, @olce wrote:

You're right. But does it matter?

Yes it is. Second sibling nullfs mount still references the lower vnode, so use count on it is non-zero. For instance, on UFS, this prevents vop_inactive() from truncating the file, clearing the inode, and reclaiming the vnode. Basically it makes the issue fixed in the review, to return, but only in racy way.

Oh, you were talking about sibling nullfs mounts, and not stacked ones. I still don't think that this causes a conceptual problem, though.

In the post wrapper case, the unlink from one of the nullfs is propagated to the lower layer (e.g., UFS), which then will call vfs_notify_upper(VFS_NOTIFY_UPPER_UNLINK) (provided the unlink succeeds in the post case), unless the vnode was doomed in the meantime in which case vfs_notify_upper(VFS_NOTIFY_UPPER_RECLAIM) will have been instead (by the other thread that did the vgone()).

As said above, this all boils down to nullfs behaving correctly when no VFS_NOTIFY_UPPER_UNLINK notification has been received but a VFS_NOTIFY_UPPER_RECLAIM instead. I see a priori no reason why it shouldn't, but I'll check.

In D48825#1114433, @olce wrote:

In D48825#1114388, @kib wrote:

In D48825#1114380, @olce wrote:

You're right. But does it matter?

Yes it is. Second sibling nullfs mount still references the lower vnode, so use count on it is non-zero. For instance, on UFS, this prevents vop_inactive() from truncating the file, clearing the inode, and reclaiming the vnode. Basically it makes the issue fixed in the review, to return, but only in racy way.

Oh, you were talking about sibling nullfs mounts, and not stacked ones. I still don't think that this causes a conceptual problem, though.

In the post wrapper case, the unlink from one of the nullfs is propagated to the lower layer (e.g., UFS), which then will call vfs_notify_upper(VFS_NOTIFY_UPPER_UNLINK) (provided the unlink succeeds in the post case), unless the vnode was doomed in the meantime in which case vfs_notify_upper(VFS_NOTIFY_UPPER_RECLAIM) will have been instead (by the other thread that did the vgone()).

As said above, this all boils down to nullfs behaving correctly when no VFS_NOTIFY_UPPER_UNLINK notification has been received but a VFS_NOTIFY_UPPER_RECLAIM instead. I see a priori no reason why it shouldn't, but I'll check.

No, with you proposal the call would only propagate to sibling uppers if it succeeded, which is not true when the lower vnode is unlocked.

In D48825#1114446, @kib wrote:

No, with you proposal the call would only propagate to sibling uppers if it succeeded, which is not true when the lower vnode is unlocked.

The only way I found so far to make sense of this sentence is suspecting you're confused. Whether the lower vnode is unlocked or not has no bearing on the return code of the *lower* routine (I specifically checked that for UFS; but, in general, if some routine reports zero or non-zero, we have to assume it did perform or not its actions, regardless of whether the vnode was temporarily unlocked). It's that return code that matters for calling vfs_notify_upper() in the vop_post*() handlers, not that of the upper routine (that of nullfs in your scenario; additionally, I don't see either how null_remove() or null_rmdir() could return an error even if the lower vnode was temporarily unlocked and the upper vnode doomed in that interval, as null_nodeget() doesn't check for that condition and can only fail on insmntque1() failing, and the rest of null_bypass() doesn't set the returned error on its own).

If you still think I'm wrong, could you give more details about what I said is incorrect and/or about the scenario you're seeing would fail?

I've made the (obvious) changes locally and will soon test them. That probably will shed a light on my misunderstandings, if any.

Look at the ERELOOKUP, specifically of the softdep_prelink().

In D48825#1114649, @kib wrote:

Look at the ERELOOKUP, specifically of the softdep_prelink().

I had seen them already (there is a call in both ufs_remove() and ufs_rmdir()). When ERELOOKUP is returned, it means that FFS had to flush some journal entries, and the caller has to relookup the relevant vnodes because they may have been doomed while temporarily being unlocked. But, again, if vp has been doomed, all uppers will have already received VFS_NOTIFY_UPPER_RECLAIM anyway. ERELOOKUP, as any other error, is just passed through by null_bypass, triggering a relookup in kern_funlinkat(). So, assume we got ERELOOKUP but vp was not doomed: Either we are going to relook it up via nullfs and apply VOP_REMOVE()/VOP_RMDIR(), perhaps several times, until 0 is returned, and VFS_NOTIFY_UPPER_UNLINK will eventually be sent, else we won't, because it is no more the vnode referenced by the same name in the relevant directory and then nullfs has really no reason to release the reference its wrapping vnode still has. So, where is the problem?

Problem is when upper vnode is reclaimed, but lower is not, although lower was temp unlocked for something. Then despite unlinked, lower vnode is still referenced by other upper vnodes, which prevents inactivation from truncating and clearing the inode.

In D48825#1114666, @kib wrote:

Problem is when upper vnode is reclaimed, but lower is not, although lower was temp unlocked for something. Then despite unlinked, lower vnode is still referenced by other upper vnodes, which prevents inactivation from truncating and clearing the inode.

No, because if the lower vnode has actually been unlinked, then the lower layer's VOP_REMOVE() (e.g., ufs_remove()) will return 0, and the post hook will send VFS_NOTIFY_UPPER_UNLINK to the other upper vnodes (and then these will free their references and in the end cause proper inactivation of the vnode). If ERELOOKUP was returned by softdep_prelink(), and thus by ufs_remove(), then the vnode has not been unlinked, and we don't need/want to inactivate it.

In D48825#1114667, @olce wrote:

In D48825#1114666, @kib wrote:

Problem is when upper vnode is reclaimed, but lower is not, although lower was temp unlocked for something. Then despite unlinked, lower vnode is still referenced by other upper vnodes, which prevents inactivation from truncating and clearing the inode.

No, because if the lower vnode has actually been unlinked, then the lower layer's VOP_REMOVE() (e.g., ufs_remove()) will return 0,

This is not true, and is not guaranteed.
Again, the VOP can relock at any moment, and indicate to the VFS layer that this indeed occured. There is no breakage of contract by doing some part of the requested work and then relock.

and the post hook will send VFS_NOTIFY_UPPER_UNLINK to the other upper vnodes (and then these will free their references and in the end cause proper inactivation of the vnode). If ERELOOKUP was returned by softdep_prelink(), and thus by ufs_remove(), then the vnode has not been unlinked, and we don't need/want to inactivate it.

I was reviewing prelink and VOP_VPUT_PAIR() due to this discussion, and now I think that VOP_VPUT_PAIR() was applied too cautiously, instead of more properly and aggressive. E.g. it is currently only put into places where UFS might create a new object, instead of all syscalls which need to vput pair, like remove or rmdir. It is too subtle knowledge of the current UFS internals in the VFS.

In D48825#1114691, @kib wrote:

In D48825#1114667, @olce wrote:

No, because if the lower vnode has actually been unlinked, then the lower layer's VOP_REMOVE() (e.g., ufs_remove()) will return 0,

This is not true, and is not guaranteed.

You mentioned softdep_prelink() but I don't think it invalidates what I said, since it doesn't perform any unlink operation at all and if it returns ERELOOKUP then ufs_rmdir() and ufs_remove() just exit right away (as they should).

For UFS, the "real meat" (see below) of the unlink is performed by ufs_dirremove(). It does take care of reverting the link count update if the UFS_BLKATOFF() call to get the block containing the directory entry fails and it has to return early with an error. However, after the directory entry has been updated (and the file link count as well), it can still return an error if writing the directory block fails (but only if that write is synchronous, obviously). That seems to be the only place where returning non-zero can happen although UFS logically considers the file to have been unlinked.

It is historically understandable to return an error in this case, because we generally like to notify users that something went wrong. However, it is seriously debatable that this is still a good idea in an era where there are other ways to write back than synchronous writes, and where the latter are statistically nonexistent except in corner cases. If an application really wants to know if something has reached the disk, it has to use fsync() or variants. Consequently, returning an error although the internal FS logic has settled down correctly (the operation is seen as performed from the outside) is at best of dubious value, and I'd even argue it is actually harmful as some applications may be tempted to give up (or sometimes retry) on this error, although the unlink has in fact been performed (even if not written back to disk). Moreover, even the current error reporting in the synchronous write case is incomplete, as only the directory portion is written in ufs_dirremove(), but not the new link count of the unlinked inode (which I think will be reflected on disk only at inactivation or fsync()).

There is no breakage of contract by doing some part of the requested work and then relock.

It is not that simple.

While it is possible to do the work in chunks internally, externally there are only two possibilities: The requested unlink has really been performed or it has not (forgetting the case of an indefinitely delayed answer). That's why I called ufs_dirremove() the "real meat" above: It really performs memory modifications reflecting the unlink, and once they are done, other threads will not be able to retrieve the name that was removed (regardless of any error it may return after that point).

As developed above, returning an error while the unlink succeeded (in the memory structures) conflates whether the operation was successful and whether it will be durable. As said, I think the best option is simply to change that by forgetting about the durability aspect. Alternatively or in the meantime, as the returned error is what is in the end communicated to userland, the simplest way-out seems to create a new vnode flag indicating in-memory completion, which must be set or unset by VOP_RMDIR()/VOP_REMOVE() after the last relock of vp has been performed (so it is not read by the wrong call).

What do you think?

I was reviewing prelink and VOP_VPUT_PAIR() due to this discussion, and now I think that VOP_VPUT_PAIR() was applied too cautiously, instead of more properly and aggressive. E.g. it is currently only put into places where UFS might create a new object, instead of all syscalls which need to vput pair, like remove or rmdir. It is too subtle knowledge of the current UFS internals in the VFS.

Yes, it has to be called also for directory entry removal, so that UFS can perform directory compaction, and due to the VFS interface, this is not done in the implementations of VOP_RMDIR()/VOP_REMOVE().