Page MenuHomeFreeBSD

ufs: end-of-life truncate should depend on dirent write
Needs ReviewPublic

Authored by rlibby on Oct 26 2020, 8:33 PM.

Details

Summary

Ensure that a file is not truncated until after it is unlinked on disk.

When a dirent is removed or changed due to a remove or rename, we
correctly made the write of the inode that decremented the link count
depend on the write of the buf containing the change to the dirent.
However, we did not prevent the end-of-life truncate from running under
the vput / vinactive following the remove or rename. Therefore it used
to happen that we would do the end-of-life truncate, including the inode
write with a size of zero (but still with the old link count) before the
dirent write. If we then crashed before the dirent write, we would
recover with a state where the file was still linked, but had been
truncated to zero.

An example where upon reboot /tmp/foo is linked but has size zero:
dd if=/usr/share/dict/words of=/tmp/foo bs=8k count=1
fsync /tmp/foo /tmp
sync
rm -f /tmp/foo
sleep 10
reboot -nq # or panic, reset, etc

The same problem happens with rename, where the consequence is worse.
mv -f /etc/foo.tmp /etc/foo can result in /etc/foo being truncated to
zero upon crash recovery.

Now, delay the end-of-life truncate until after the i_nlink count, and
not just the i_effnlink count, reaches zero. With softdep, the
difference between the two is whether the change to a directory page
corresponding to an unlink has landed on disk.


I'm not at all sure that this is the best way to enforce such a
dependency. However, it is simple and seems to work.

Test Plan
kyua test -k /usr/tests/sys/Kyuafile acl aio fifo file fs kern kqueue sys vfs vm

dd if=/usr/share/dict/words of=/tmp/foo bs=8k count=1
fsync /tmp/foo /tmp
sync
rm -f /tmp/foo
sleep 10
reboot -nq

dd if=/usr/share/dict/words of=/tmp/foo bs=8k count=1
dd if=/usr/share/dict/words of=/tmp/foo.tmp bs=8k count=1
fsync /tmp/foo /tmp/foo.tmp /tmp
sync
mv -f /tmp/foo.tmp /tmp/foo
sleep 10
reboot -nq

cd stress2/misc
./all.sh -o

No performance testing done.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 34536
Build 31633: arc lint + arc unit

Event Timeline

rlibby created this revision.

You note that ``If we then crashed before the dirent write, we would recover with a state where the file was still linked, but had been truncated to zero. The resulting state could be considered corruption.'' The filesystem is not corrupted in the sense that it needs fsck to be run to clean it up. We simply end up with an unexpected result.

Given that you have come up with a simple way to avoid this unexpected result, I am not averse to making this change if it can pass Peter Holm's filesystem torture tests.

I am not clear on why you have the "isparent" test for deciding to do the vunref(). Is this to detect that the dirrem is not being reused for its second use when renaming a directory within the same directory?

If so, then I would code that logic based on control flow rather than the current way which is impossible to understand. By control flow I mean set a bool "dovunref = 1" at the top then set "dovunref = 0" just before the reuse of the dirrem at "dirrem->dm_state = ONDEPLIST;"

You note that ``If we then crashed before the dirent write, we would recover with a state where the file was still linked, but had been truncated to zero. The resulting state could be considered corruption.'' The filesystem is not corrupted in the sense that it needs fsck to be run to clean it up. We simply end up with an unexpected result.

That's right, fsck won't help us but we have an unexpected / uncommanded result. Internally we ran into this where we ended up losing the contents of files in /etc after they were updated in what
seemed like a safe manner with rename.

Given that you have come up with a simple way to avoid this unexpected result, I am not averse to making this change if it can pass Peter Holm's filesystem torture tests.

Thanks, I'll CC him and see about running that test myself too. That is just for integrity and not for performance, correct?

I am not clear on why you have the "isparent" test for deciding to do the vunref(). Is this to detect that the dirrem is not being reused for its second use when renaming a directory within the same directory?

Yes it's because of the dirrem reuse, which is done when the link count of the parent changes due to the net unlink of a directory. As coded in the patch, in newdirrem we only take a vref on the child, and not on the parent, and so we only release on the child and not on the parent. For rmdir, the dirrem is resused for the parent after being used for the child. isparent is meant to indicate whether this call of handle_workitem_remove is for the child or is the reuse for the parent.

If so, then I would code that logic based on control flow rather than the current way which is impossible to understand. By control flow I mean set a bool "dovunref = 1" at the top then set "dovunref = 0" just before the reuse of the dirrem at "dirrem->dm_state = ONDEPLIST;"

I don't think I understand your suggested implementation. For the rmdir case, even when we reuse the dirrem we still need to vunref the child directory, and then additionally on the reuse for parent we take the path labeled "Normal file deletion" (RMDIR is now cleared), which, as I understand the suggestion, would still have dovunref set.

Would maybe some additional comments about why we're comparing the oldinum vs dirinum be a path forward?

Alternately, I also thought about adding a flag for dm_state to represent whether a vref was held, but I thought this was simpler.

Fix leaked ref in recursive case

Fix a leak of the vref for the rmdir case that results in recursion, we
need to vunref there too, so just do it at that top, after the vget.

This doesn't fully address Kirk's feedback.

rlibby edited the test plan for this revision. (Show Details)

Given that you have come up with a simple way to avoid this unexpected result, I am not averse to making this change if it can pass Peter Holm's filesystem torture tests.

Thanks for the tip. stress2's fs.sh expressed a leak in the recursive case. It passes now. I'm running the rest of all.sh overnight.

Distinguish first and second pass with state and vtype

Besides the ino check being inscrutable, it was also unsafe because one
of the fields was held in a union that was also in use.

sys/ufs/ffs/ffs_softdep.c
9258 ↗(On Diff #78800)

This is maybe just true but I wasn't positive, as otherwise I don't understand why isrmdir would be plumbed at all rather than just being v_type == VDIR.

9906 ↗(On Diff #78800)

Because in the second pass RMDIR is cleared and we're working on the parent.

I am not sure about this approach. Note that vref()/usecount reference does not prevent the vnode reclaim. So for instance force umount results in vgone() which does inactivation and reclaim regardless of the active state (or rather does inactivation if the vnode is active). In this case, it seems to not fix the issue.

Also, typically SU does not rely on the _vnode_ state, workitems are attached to the metadata blocks owned by devvp.

Okay. I am definitely open to better solutions, especially if they fit the paradigm better.

In ufs_inactive, why do we look at i_effnlink at all? Why not just base the truncate on i_nlink? I think this could be another method of delaying the end-of-life truncate until after the write of the dirent, but without relying on a vnode reference. I think that might look like this:
https://github.com/rlibby/freebsd/commit/3b62c248f3377c47fb4bfa65a19b0f5390caec37
(Passes stress2's fs.sh and issue repros above.)

Okay. I am definitely open to better solutions, especially if they fit the paradigm better.

In ufs_inactive, why do we look at i_effnlink at all? Why not just base the truncate on i_nlink? I think this could be another method of delaying the end-of-life truncate until after the write of the dirent, but without relying on a vnode reference. I think that might look like this:
https://github.com/rlibby/freebsd/commit/3b62c248f3377c47fb4bfa65a19b0f5390caec37
(Passes stress2's fs.sh and issue repros above.)

I do like your new approach better as it is much clearer what is going on. I think that it may be sufficient to just make the last change in your delta where you switch it only doing the truncation when i_nlink falls to zero. The other actions taken when i_effnlink falls to zero should still be OK to be done then. As before, getting Peter Holm's testing is important.

In D26964#601553, @kib wrote:

I am not sure about this approach. Note that vref()/usecount reference does not prevent the vnode reclaim. So for instance force umount results in vgone() which does inactivation and reclaim regardless of the active state (or rather does inactivation if the vnode is active). In this case, it seems to not fix the issue.

Also, typically SU does not rely on the _vnode_ state, workitems are attached to the metadata blocks owned by devvp.

I concur with your uneasyness with using vref()/usecount and hope that the alternative solution works.

I am not concerned with your forcible unmount senario though since it starts by suspending the filesystem which will finish all in-progress system calls and then cause all the pending writes to be flushed. So by the time the vnodes are inactivated the filesystem will be consistent.

I have started a test of D26964.78800.diff.

Okay. I am definitely open to better solutions, especially if they fit the paradigm better.

In ufs_inactive, why do we look at i_effnlink at all? Why not just base the truncate on i_nlink? I think this could be another method of delaying the end-of-life truncate until after the write of the dirent, but without relying on a vnode reference. I think that might look like this:
https://github.com/rlibby/freebsd/commit/3b62c248f3377c47fb4bfa65a19b0f5390caec37
(Passes stress2's fs.sh and issue repros above.)

I do like your new approach better as it is much clearer what is going on. I think that it may be sufficient to just make the last change in your delta where you switch it only doing the truncation when i_nlink falls to zero. The other actions taken when i_effnlink falls to zero should still be OK to be done then. As before, getting Peter Holm's testing is important.

Okay. I'll post the diff as you suggest, but I don't quite understand. Those other actions seem just to be doing a vn_start_secondary_write(). Is this so that when i_effnlink <= 0 but i_nlink > 0 and we have one of the IN_ change flags, that we use V_NOWAIT for the vn_start_secondary_write and possibly defer with VI_OWEINACT vs the V_WAIT we might use just above UFS_UPDATE?

Feedback: Instead of trying to block vinactive from softdep, change ufs_inactive to truncate based on i_nlink instead of i_effnlink, which will in turn depend on the dirent write.

rlibby retitled this revision from ufs softdep: end-of-life truncate should depend on dirent write to ufs: end-of-life truncate should depend on dirent write.Oct 31 2020, 9:20 PM
rlibby edited the summary of this revision. (Show Details)

Okay. I am definitely open to better solutions, especially if they fit the paradigm better.

In ufs_inactive, why do we look at i_effnlink at all? Why not just base the truncate on i_nlink? I think this could be another method of delaying the end-of-life truncate until after the write of the dirent, but without relying on a vnode reference. I think that might look like this:
https://github.com/rlibby/freebsd/commit/3b62c248f3377c47fb4bfa65a19b0f5390caec37
(Passes stress2's fs.sh and issue repros above.)

I do like your new approach better as it is much clearer what is going on. I think that it may be sufficient to just make the last change in your delta where you switch it only doing the truncation when i_nlink falls to zero. The other actions taken when i_effnlink falls to zero should still be OK to be done then. As before, getting Peter Holm's testing is important.

Okay. I'll post the diff as you suggest, but I don't quite understand. Those other actions seem just to be doing a vn_start_secondary_write(). Is this so that when i_effnlink <= 0 but i_nlink > 0 and we have one of the IN_ change flags, that we use V_NOWAIT for the vn_start_secondary_write and possibly defer with VI_OWEINACT vs the V_WAIT we might use just above UFS_UPDATE?

The crucial action there is for effnlink == 0 and i_mode == 0, where we recycle the vnode. I believe it should be that i_mode == 0 implies i_effnlink == 0, but there might be nuances that prevent asserting this directly.

sys/ufs/ufs/ufs_inode.c
165

Please style it while you change the line anyway: isize != 0.

! In D26964#603222, @rlibby wrote:

! In D26964#603106, @mckusick wrote:

I do like your new approach better as it is much clearer what is going on. I think that it may be sufficient to just make the last change in your delta where you switch it only doing the truncation when i_nlink falls to zero. The other actions taken when i_effnlink falls to zero should still be OK to be done then. As before, getting Peter Holm's testing is important.

Okay. I'll post the diff as you suggest, but I don't quite understand. Those other actions seem just to be doing a vn_start_secondary_write(). Is this so that when i_effnlink <= 0 but i_nlink > 0 and we have one of the IN_ change flags, that we use V_NOWAIT for the vn_start_secondary_write and possibly defer with VI_OWEINACT vs the V_WAIT we might use just above UFS_UPDATE?

My recollection is that the i_effnlink <= 0 when i_nlink > 0 got added because we were somehow missing getting VI_OWEINACT set. But looking through the code, I just cannot come up with a scenario where that is the case. So your original proposed change is probably correct and would save some extra unnecessary work.

So, I noticed a problem and I'm not sure of the best path forward.

Looking through history, in r209717 Jeff changed the check from i_nlink to i_effnlink so that the truncate would be done in the context of the thread that did the unlink/close, and not mostly in the softdep flush thread. This patch would effectively revert that.

https://svnweb.freebsd.org/base?view=revision&revision=209717

We could look at the case where i_effnlink <= 0 && i_nlink > 0 and try to call through to softdep's process_removes() so that we still get synchronous behavior here. Or we could do that conditionally on some measure of load for the flush thread, so that it remains async when load is low.

I will chew on this for a little, but please let me know if you have ideas or preferences.

sys/ufs/ufs/ufs_inode.c
165

Will do.

I have thought about how to preserve the performance behavior aspects of r209717, but I haven't come up with how to do it. Here are a few thoughts.

  • There's the patch here that uses i_nlink instead of i_effnlink, which I think solves the problem I set out to solve, but probably reopens the problem from r209717.
  • We could do the above and add a chicken switch, like vfs.ffs.doasyncfree (vfs.ffs.doeagertrunc?).
  • I think, most ideally, we would just make all of the writes that happens in ffs_truncate for the end-of-life truncate depend on i_nlink reaching zero (via softdep, and either directly or indirectly). That way the thread doing the remove is still usually the one calling ffs_truncate and can be throttled. However I don't really know how to do this in code, and I'm unsure whether it's feasible (?).
  • As a hack, we could do some kind of proxy throttle. This would be something like, if we are inactivating a large file, or we are softdep_excess_items(D_DIRREM), then do some process_worklist_item() to do some of the flusher work.

Thoughts on how to proceed?

I have thought about how to preserve the performance behavior aspects of r209717, but I haven't come up with how to do it. Here are a few thoughts.

  • There's the patch here that uses i_nlink instead of i_effnlink, which I think solves the problem I set out to solve, but probably reopens the problem from r209717.
  • We could do the above and add a chicken switch, like vfs.ffs.doasyncfree (vfs.ffs.doeagertrunc?).
  • I think, most ideally, we would just make all of the writes that happens in ffs_truncate for the end-of-life truncate depend on i_nlink reaching zero (via softdep, and either directly or indirectly). That way the thread doing the remove is still usually the one calling ffs_truncate and can be throttled. However I don't really know how to do this in code, and I'm unsure whether it's feasible (?).
  • As a hack, we could do some kind of proxy throttle. This would be something like, if we are inactivating a large file, or we are softdep_excess_items(D_DIRREM), then do some process_worklist_item() to do some of the flusher work.

Thoughts on how to proceed?

Thanks for tracing back to -r209717. I have had a chance to go back and review what lead to that change. The issue was when a large number of files were being deleted at once (i.e., an `rm -r largetree' was being done). It could quickly exhaust available soft update structures and/or journal space. So the idea was to allow the throttling of such a command. As you have discovered, the result is that we now prematurely truncate a file before it is removed.

I think that the most viable solution is your suggestion of a proxy throttle of processing some worklist items when there are an excess of D_DIRREM items or a shortage of journal space. If we do this we will need to carefully figure out what of -r209717 needs to be reverted (notably the SPACECOUNTED tracking).