In your summary you say ``rw<->ro remounts are not atomic, filesystem is accessible by other threads during the process. As result, its internal state is inconsistent. Just blocking writers with suspend is not enough.'' Can you elaborate on how having other processes reading the filesystem causes trouble?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 10 2021
Mar 3 2021
This seems like a reasonable solution to the problem. The allocation will fail in a few cases where it previously would have succeeded, but hopefully those will be rare. The effect of failing will simply be slower lookups rather than unexpected errors to applications.
I agree that this change is appropriate.
Mar 2 2021
Sorry for the delayed review. This change looks correct to me.
Sorry for the delayed review. This fix looks correct to me.
Feb 25 2021
It appears that this change should be MFC'ed to 12.
Feb 24 2021
Feb 20 2021
Its a wrap!
Feb 18 2021
I note that lib/libprocstat/libprocstat.c includes ufs/ufs/inode.h but in fact does not need to do so.
Getting rid of _buf_cluster.h was a lot more work than I expected, but is definitely cleaner and has the added bonus of cleaning up some other cruft as well.
The problem is in lib/libprocstat that wants to read inodes out of the kernel so needs to know their size. So, I agree avoiding _buf_cluster.h is hard. I reluctantly agree with that solution.
Feb 17 2021
To avoid the _buf_cluster.h file, how about making the inclusion of buf.h in ufs/inode.h and ext2fs/inode.h conditional on #ifdef KERNEL? Both vfs_cluster.c and fuse_node.h already include buf.h so are not an issue. I don't know if msdosfs/denode.h is used outside the kernel, but if so could make inclusion of buf.h conditional on #ifdef KERNEL.
All looks good.
Feb 16 2021
Overall looks good.
Overall looks good. A few inline comments.
Thanks for rewriting this comment.
I have provided some suggestions for cleanup and/or clarification.
Feb 12 2021
Jan 30 2021
Jan 26 2021
Jan 16 2021
Jan 12 2021
Jan 7 2021
This change certainly fixes the problem though it write-locks far more than necessary.
Jan 3 2021
Jan 1 2021
Dec 31 2020
Dec 23 2020
No change in actual running, but definitely correct change.
Dec 18 2020
Dec 11 2020
Dec 9 2020
Dec 8 2020
Dec 6 2020
These updates look needed and correct.
Nov 29 2020
Good to go.
Nov 25 2020
The sentiment is correct, but logic fixes noted are needed.
Nov 20 2020
Belatedly, these changes look good and in particular get rid of VOP_SYNC(..., MNT_WAIT).
Nov 17 2020
Nov 16 2020
I have wanted this change for a long time. Thanks for doing it.
Nov 14 2020
Have we reached any conclusions about whether to do any of the ideas suggested in this phabricator thread?
In D26964#604409, @rlibby wrote:I have thought about how to preserve the performance behavior aspects of r209717, but I haven't come up with how to do it. Here are a few thoughts.
- There's the patch here that uses i_nlink instead of i_effnlink, which I think solves the problem I set out to solve, but probably reopens the problem from r209717.
- We could do the above and add a chicken switch, like vfs.ffs.doasyncfree (vfs.ffs.doeagertrunc?).
- I think, most ideally, we would just make all of the writes that happens in ffs_truncate for the end-of-life truncate depend on i_nlink reaching zero (via softdep, and either directly or indirectly). That way the thread doing the remove is still usually the one calling ffs_truncate and can be throttled. However I don't really know how to do this in code, and I'm unsure whether it's feasible (?).
- As a hack, we could do some kind of proxy throttle. This would be something like, if we are inactivating a large file, or we are softdep_excess_items(D_DIRREM), then do some process_worklist_item() to do some of the flusher work.
Thoughts on how to proceed?
Nov 6 2020
Two questions.
Nov 1 2020
Oct 31 2020
! In D26964#603222, @rlibby wrote:
! In D26964#603106, @mckusick wrote:
I do like your new approach better as it is much clearer what is going on. I think that it may be sufficient to just make the last change in your delta where you switch it only doing the truncation when i_nlink falls to zero. The other actions taken when i_effnlink falls to zero should still be OK to be done then. As before, getting Peter Holm's testing is important.
Okay. I'll post the diff as you suggest, but I don't quite understand. Those other actions seem just to be doing a vn_start_secondary_write(). Is this so that when i_effnlink <= 0 but i_nlink > 0 and we have one of the IN_ change flags, that we use V_NOWAIT for the vn_start_secondary_write and possibly defer with VI_OWEINACT vs the V_WAIT we might use just above UFS_UPDATE?
My recollection is that the i_effnlink <= 0 when i_nlink > 0 got added because we were somehow missing getting VI_OWEINACT set. But looking through the code, I just cannot come up with a scenario where that is the case. So your original proposed change is probably correct and would save some extra unnecessary work.
This is an impressive piece of work, what a lot of effort to fix this LOR. Took a couple of hours to review, but overall looks good. A couple of minor inline comments.
Oct 30 2020
In D26964#601553, @kib wrote:I am not sure about this approach. Note that vref()/usecount reference does not prevent the vnode reclaim. So for instance force umount results in vgone() which does inactivation and reclaim regardless of the active state (or rather does inactivation if the vnode is active). In this case, it seems to not fix the issue.
Also, typically SU does not rely on the _vnode_ state, workitems are attached to the metadata blocks owned by devvp.
In D26964#601787, @rlibby wrote:Okay. I am definitely open to better solutions, especially if they fit the paradigm better.
In ufs_inactive, why do we look at i_effnlink at all? Why not just base the truncate on i_nlink? I think this could be another method of delaying the end-of-life truncate until after the write of the dirent, but without relying on a vnode reference. I think that might look like this:
https://github.com/rlibby/freebsd/commit/3b62c248f3377c47fb4bfa65a19b0f5390caec37
(Passes stress2's fs.sh and issue repros above.)
Oct 27 2020
You note that ``If we then crashed before the dirent write, we would recover with a state where the file was still linked, but had been truncated to zero. The resulting state could be considered corruption.'' The filesystem is not corrupted in the sense that it needs fsck to be run to clean it up. We simply end up with an unexpected result.
Oct 25 2020
Oct 23 2020
It would be trivial to request high priority for synchronous writes in bwrite() and if desired synchronous reads in bread(). That would have effects for several filesystems.
Oct 18 2020
Added a couple of inline comments.
Oct 3 2020
Sep 26 2020
Sep 25 2020
In D26511#591360, @kib wrote:The block being written with the barrier is a newly allocated block of inodes. The write is done asynchronously and the cylinder group is then updated to reflect that the additional inodes are available. The reason for the barrier is so that the cylinder group buffer with the newly expanded inode map cannot be written before the newly allocated set of inodes.
Since the incore cylinder group includes the newly created inodes, some other thread can come along and try to use one of those newly allocated inodes. but, it will block on the inode buffer until its write has completed.
The bug in this instance is that there is an assumption that the write cannot fail. Clearly this is a bad assumption. So, the correct fix is to not depend on the barrier write, but rather to create a callback that updates the cylinder group once the write of the new inodes has completed successfully.
In the specific case of kostik1316, doing that would deadlock the machine. Problem is that CoW write returned ENOSPC, which means that there is no way to correctly free space on the volume. I am not sure what do to there. Most likely any other write would also return ENOSPC, so in fact consistency of the volume is not too badly broken if we just fail there. If we start tracking the write as a dependency for cg write, then all that buffers add to the dirty space which eventually hang buffer subsystem.
I believe (based on Peter testing) that just erroring write allows us to unmount.
What we are trying to do where the barrier write is being used in UFS is expand the number of available inodes. It is OK to abandon the write of the zero'ed out inode block as the expansion can be put off to later. But, we must not update the cylinder group to say that these new inodes are available if we have not been able to zero them out. As it is currently written, the cylinder group is updated but the on-disk inodes are not zero'ed out. If the unmount succeeds in writing out the dirty buffers we will end up with a corrupted filesystem because fsck_ffs will try to check the non-zero'ed out inodes and raise numerous errors trying to correct the inconsistencies that arise from the random data in the uninitialzed inode block.
In D26511#590951, @kib wrote:In D26511#590923, @mckusick wrote:In D26511#590229, @kib wrote:In D26511#590172, @mckusick wrote:As long as the buffer remains locked until it is successfully written, this should be fine.
Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.
We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.
I still do not understand what do you mean by 'use'. In the kostik1316 dump, most probable scenario was that babarrierwrite() for the inode block failed with ENOSPC, and then bufdone() does brelse() on the buffer. So it is unlocked, but this is somewhat unrelated to the issue of leaking B_BARRIER.
Sep 23 2020
In D26511#590229, @kib wrote:In D26511#590172, @mckusick wrote:As long as the buffer remains locked until it is successfully written, this should be fine.
Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.
We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.
Sep 22 2020
Sep 21 2020
As long as the buffer remains locked until it is successfully written, this should be fine.
Sep 19 2020
Sep 13 2020
Sep 12 2020
This change looks reasonable to me as at least for amd64 every path out of the kernel appears to check TDF_ASTPENDING and calls ast() if set.
Sep 1 2020
It was a horrible hack at the time and should have been tossed decades ago.
Aug 31 2020
Aug 29 2020
A new review? If so, I suggest that you close this one and give a pointer to the new one.
Aug 27 2020
What is the status with this change?
Aug 26 2020
The get_parent_vp() function seems to be handling two different problems.
In its first instance it is dealing with locking its parent. Here we already have the function vn_vget_ino() which presumably could be used to handle this situation.
In its second and third instances it is avoiding a LOR of acquiring an inode lock while holding a buffer lock. Here we need a function like the proposed get_parent_vp(). However, we will no longer need the first (vp) argument as we are acquiring a child inode so do not have to release the already held parent inode before acquiring the child inode.
Aug 15 2020
Aug 8 2020
This change will provide the interface that we will need to notify daemon processes for our forcible unmount and forcible downgrade to read-only as well as re-establishment of read-write after a successful background fsck.
Jul 22 2020
I concur with the proposed change and also agree that cem's suggestion is a good one.
Jul 21 2020
Jul 14 2020
Jul 13 2020
This should not be done as written as it can cause a denial of service (infinite loop) as described in my previous comment.
Jul 10 2020
If data is being written to the filesystem faster than the disk can write it then a sync with MNT_WAIT will never finish. The only safe way to use sync with MNT_WAIT is to first suspend the filesystem to create a finite number of write operations that need to be done.
Jul 9 2020
Jul 6 2020
Jun 30 2020
Per the above commentary, this change should be withdrawn.
Jun 24 2020
Having read Chuck's argument, I agree with it. I was thinking that the g_label_ufs.c routines passed in M_UFSMNT as their malloc type. But they do not. They use M_GEOM malloc type. So there is no need for them to have an definition for M_UFSMNT.
Thanks for this cleanup. I found the previous transformation grating for the reasons that you listed.
This looks like a reasonable change to me.
I believe that the existing test already has this covered. The test (resid > uio->uio_resid) is checking that any data has been written. In the typical case this is at the end of the file and the size will have increased. But if the write was in the middle of a file filling in a hole then this test will catch that and cause the inode to be written.