- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 14 2021
Apr 13 2021
Not having seen this review, I started a review at https://reviews.freebsd.org/D29754. I am fine with abandoning my review though I do think you should consider incorporating my version of the manual page.
Apr 9 2021
Apr 2 2021
Given that -pg support has been withdrawn from the kernel, it is sensible to remove kgmon(8).
Mar 31 2021
Your change looks good.
Mar 26 2021
In D28856#659254, @kib wrote:In D28856#659176, @mckusick wrote:The cost of an extra allocation versus the overhead of having to handle low-memory situations and building up and tearing down zones seems like a bad tradeoff.
Why? It is reverse, IMO: the normal system operation performs a lot of vnode allocations and deallocations, while lowmem is rare condition, where we do not worry about system performance at all, only about system liveness. Optimizing for normal path is right, optimizing for lowmem handler is not.
The purpose of this change is to reduce the amount of memory dedicated to vnodes.
In D28856#659141, @kib wrote:In D28856#659119, @mckusick wrote:There is still overhead as the zone memory has to be cleaned up (locks disposed of) and then new memory initialized (zeroed, lists and queues initialized, locks initialized, etc). Also there is extra work done detecting that we have hit these conditions and making them happen. In general we are going to have more memory tied up and do more work moving it between the zones. If we just had one zone for vnodes and another zone for bufobjs we could avoid all of this. In all likelyhood we would only need occational freeing of memory in the bufobj zone.
I am curious why you are so resistant to having a single vnode zone and a single bufobj zone?
With either (vnode + bufobj, vnode - bufobj), or (vnode - bufobj, bufobj) we still have two zones, and on low memory condition two zones needs to be drained. But for the separate bufobj zone, we additionally punish filesystems that use buffers. Instead of single allocation for vnode, they have to perform two, and also they have to perform two frees.
We have a similar structure in namecache, where {short,long}x{timestamp, no timestamp} allocations use specific zones, instead of allocating nc + path segment + timestamp.
Mar 25 2021
I understand that there cannot be more than maxvnodes. What I am concerned about is how much memory is tied up in the two zones. In this example as vnlru() is freeing (vnodes without bufobjs) into the (vnodes without bufobjs) zone. It then allocates memory from (vnodes+bufobj) from the (vnodes+bufobj) zone. That allocation cannot use the memory in the (vnodes without bufobj) zone. So when we are done we have enough memory locked down in the two zones to support 2 * maxvnodes. This is much more wasteful of memory than having a single zone for pure vnodes and a second zone that holds bufobjs each of which will be limited to maxvnodes in size.
Mar 24 2021
In D28856#658811, @kib wrote:In D28856#658765, @mckusick wrote:No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].
Let me try again to explain my perceived issue.
Under this scheme we have two zones. If there is a lot of ZFS activity, the vnode-only zone can be filled with maxvnodes worth of entries. Now suppose activity in ZFS drops but activity in NFS rises. Now the zone with vnodes + bufobj can fill to maxvnodes worth of memory. As I understand it we do not reclaim any of the memory from the vnode-only zone, it just sits there unable to be used. Is that correct?
No, this is not correct. Total number of vnodes (summary of both types) is limited by maxvnodes. After the load shifted from zfs to nfs in your scenario, vnlru starts reclaiming vnodes in LRU order from the global free list, freeing (vnode without bufobj)s, and most of the allocated vnodes would be from the (vnode+bufobj) zone. We do not allow more that maxvnodes total vnodes allocated in the system.
No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].
Mar 22 2021
In D28856#657865, @kib wrote:In D28856#657658, @mckusick wrote:Three inline comments / questions.
In Sun's implementation of vnodes each filesystem type had its own pool. When I adopted the vnode idea into BSD, I created generic vnodes that could be used by all filesystems so that they could move between filesystems based on demand.
This design reverts back to vnodes usable by ZFS and a few other filesystems and vnodes for NFS, UFS, and most other filesystems. This will be a win for systems that run just ZFS. But, systems that are also running NFS or UFS will not be able to share vnode memory and will likely have a bigger memory footprint than if they stuck with the single type of vnode.
There has been no attempt to fix vlrureclaim() so we can end up reclaiming a bunch of vnodes of the wrong type thus reducing the usefulness of the cache without recovering any useful memory. In the worst case, we can end up with each of the two vnode pools using maxvnodes worth of memory.
We probably need to have a separate maxvnodes for each pool. Alternately we could keep track of how many vnodes are in each pool and limit the two pools to a total of maxvnodes. That of course begs the question of how we decide to divide the quota between the two pools. At a minimum, vlrureclaim() needs to have a way to decide which pool needs to have vnodes reclaimed.
We do not have two pools of vnodes after the patch. For very long time, we free vnode after its hold count goes to zero (mod SMR complications).
Three inline comments / questions.
Mar 16 2021
Mar 14 2021
Mar 12 2021
Mar 11 2021
Flag definitions look good.
Breakdown of commits is excellent.
Changes should resolve the problem.
It would help if I had looked at your commit logs before my previous comment. You have in fact separated everything out appropriately.
Nearly all of the changes in ffs_softdep.c are code cleanups and not related to this bug fix. I would prefer to see the code cleanups in a separate commit so that it is easier to see the changes that are needed to fix this problem. That said, this update appears to solve the problem that you describe.
Mar 10 2021
In your summary you say ``rw<->ro remounts are not atomic, filesystem is accessible by other threads during the process. As result, its internal state is inconsistent. Just blocking writers with suspend is not enough.'' Can you elaborate on how having other processes reading the filesystem causes trouble?
Mar 3 2021
This seems like a reasonable solution to the problem. The allocation will fail in a few cases where it previously would have succeeded, but hopefully those will be rare. The effect of failing will simply be slower lookups rather than unexpected errors to applications.
I agree that this change is appropriate.
Mar 2 2021
Sorry for the delayed review. This change looks correct to me.
Sorry for the delayed review. This fix looks correct to me.
Feb 25 2021
It appears that this change should be MFC'ed to 12.
Feb 24 2021
Feb 20 2021
Its a wrap!
Feb 18 2021
I note that lib/libprocstat/libprocstat.c includes ufs/ufs/inode.h but in fact does not need to do so.
Getting rid of _buf_cluster.h was a lot more work than I expected, but is definitely cleaner and has the added bonus of cleaning up some other cruft as well.
The problem is in lib/libprocstat that wants to read inodes out of the kernel so needs to know their size. So, I agree avoiding _buf_cluster.h is hard. I reluctantly agree with that solution.
Feb 17 2021
To avoid the _buf_cluster.h file, how about making the inclusion of buf.h in ufs/inode.h and ext2fs/inode.h conditional on #ifdef KERNEL? Both vfs_cluster.c and fuse_node.h already include buf.h so are not an issue. I don't know if msdosfs/denode.h is used outside the kernel, but if so could make inclusion of buf.h conditional on #ifdef KERNEL.
All looks good.
Feb 16 2021
Overall looks good.
Overall looks good. A few inline comments.
Thanks for rewriting this comment.
I have provided some suggestions for cleanup and/or clarification.
Feb 12 2021
Jan 30 2021
Jan 26 2021
Jan 16 2021
Jan 12 2021
Jan 7 2021
This change certainly fixes the problem though it write-locks far more than necessary.
Jan 3 2021
Jan 1 2021
Dec 31 2020
Dec 23 2020
No change in actual running, but definitely correct change.
Dec 18 2020
Dec 11 2020
Dec 9 2020
Dec 8 2020
Dec 6 2020
These updates look needed and correct.
Nov 29 2020
Good to go.
Nov 25 2020
The sentiment is correct, but logic fixes noted are needed.
Nov 20 2020
Belatedly, these changes look good and in particular get rid of VOP_SYNC(..., MNT_WAIT).
Nov 17 2020
Nov 16 2020
I have wanted this change for a long time. Thanks for doing it.
Nov 14 2020
Have we reached any conclusions about whether to do any of the ideas suggested in this phabricator thread?
In D26964#604409, @rlibby wrote:I have thought about how to preserve the performance behavior aspects of r209717, but I haven't come up with how to do it. Here are a few thoughts.
- There's the patch here that uses i_nlink instead of i_effnlink, which I think solves the problem I set out to solve, but probably reopens the problem from r209717.
- We could do the above and add a chicken switch, like vfs.ffs.doasyncfree (vfs.ffs.doeagertrunc?).
- I think, most ideally, we would just make all of the writes that happens in ffs_truncate for the end-of-life truncate depend on i_nlink reaching zero (via softdep, and either directly or indirectly). That way the thread doing the remove is still usually the one calling ffs_truncate and can be throttled. However I don't really know how to do this in code, and I'm unsure whether it's feasible (?).
- As a hack, we could do some kind of proxy throttle. This would be something like, if we are inactivating a large file, or we are softdep_excess_items(D_DIRREM), then do some process_worklist_item() to do some of the flusher work.
Thoughts on how to proceed?
Nov 6 2020
Two questions.
Nov 1 2020
Oct 31 2020
! In D26964#603222, @rlibby wrote:
! In D26964#603106, @mckusick wrote:
I do like your new approach better as it is much clearer what is going on. I think that it may be sufficient to just make the last change in your delta where you switch it only doing the truncation when i_nlink falls to zero. The other actions taken when i_effnlink falls to zero should still be OK to be done then. As before, getting Peter Holm's testing is important.
Okay. I'll post the diff as you suggest, but I don't quite understand. Those other actions seem just to be doing a vn_start_secondary_write(). Is this so that when i_effnlink <= 0 but i_nlink > 0 and we have one of the IN_ change flags, that we use V_NOWAIT for the vn_start_secondary_write and possibly defer with VI_OWEINACT vs the V_WAIT we might use just above UFS_UPDATE?
My recollection is that the i_effnlink <= 0 when i_nlink > 0 got added because we were somehow missing getting VI_OWEINACT set. But looking through the code, I just cannot come up with a scenario where that is the case. So your original proposed change is probably correct and would save some extra unnecessary work.
This is an impressive piece of work, what a lot of effort to fix this LOR. Took a couple of hours to review, but overall looks good. A couple of minor inline comments.
Oct 30 2020
In D26964#601553, @kib wrote:I am not sure about this approach. Note that vref()/usecount reference does not prevent the vnode reclaim. So for instance force umount results in vgone() which does inactivation and reclaim regardless of the active state (or rather does inactivation if the vnode is active). In this case, it seems to not fix the issue.
Also, typically SU does not rely on the _vnode_ state, workitems are attached to the metadata blocks owned by devvp.
In D26964#601787, @rlibby wrote:Okay. I am definitely open to better solutions, especially if they fit the paradigm better.
In ufs_inactive, why do we look at i_effnlink at all? Why not just base the truncate on i_nlink? I think this could be another method of delaying the end-of-life truncate until after the write of the dirent, but without relying on a vnode reference. I think that might look like this:
https://github.com/rlibby/freebsd/commit/3b62c248f3377c47fb4bfa65a19b0f5390caec37
(Passes stress2's fs.sh and issue repros above.)
Oct 27 2020
You note that ``If we then crashed before the dirent write, we would recover with a state where the file was still linked, but had been truncated to zero. The resulting state could be considered corruption.'' The filesystem is not corrupted in the sense that it needs fsck to be run to clean it up. We simply end up with an unexpected result.
Oct 25 2020
Oct 23 2020
It would be trivial to request high priority for synchronous writes in bwrite() and if desired synchronous reads in bread(). That would have effects for several filesystems.
Oct 18 2020
Added a couple of inline comments.
Oct 3 2020
Sep 26 2020
Sep 25 2020
In D26511#591360, @kib wrote:The block being written with the barrier is a newly allocated block of inodes. The write is done asynchronously and the cylinder group is then updated to reflect that the additional inodes are available. The reason for the barrier is so that the cylinder group buffer with the newly expanded inode map cannot be written before the newly allocated set of inodes.
Since the incore cylinder group includes the newly created inodes, some other thread can come along and try to use one of those newly allocated inodes. but, it will block on the inode buffer until its write has completed.
The bug in this instance is that there is an assumption that the write cannot fail. Clearly this is a bad assumption. So, the correct fix is to not depend on the barrier write, but rather to create a callback that updates the cylinder group once the write of the new inodes has completed successfully.
In the specific case of kostik1316, doing that would deadlock the machine. Problem is that CoW write returned ENOSPC, which means that there is no way to correctly free space on the volume. I am not sure what do to there. Most likely any other write would also return ENOSPC, so in fact consistency of the volume is not too badly broken if we just fail there. If we start tracking the write as a dependency for cg write, then all that buffers add to the dirty space which eventually hang buffer subsystem.
I believe (based on Peter testing) that just erroring write allows us to unmount.
What we are trying to do where the barrier write is being used in UFS is expand the number of available inodes. It is OK to abandon the write of the zero'ed out inode block as the expansion can be put off to later. But, we must not update the cylinder group to say that these new inodes are available if we have not been able to zero them out. As it is currently written, the cylinder group is updated but the on-disk inodes are not zero'ed out. If the unmount succeeds in writing out the dirty buffers we will end up with a corrupted filesystem because fsck_ffs will try to check the non-zero'ed out inodes and raise numerous errors trying to correct the inconsistencies that arise from the random data in the uninitialzed inode block.
In D26511#590951, @kib wrote:In D26511#590923, @mckusick wrote:In D26511#590229, @kib wrote:In D26511#590172, @mckusick wrote:As long as the buffer remains locked until it is successfully written, this should be fine.
Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.
We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.
I still do not understand what do you mean by 'use'. In the kostik1316 dump, most probable scenario was that babarrierwrite() for the inode block failed with ENOSPC, and then bufdone() does brelse() on the buffer. So it is unlocked, but this is somewhat unrelated to the issue of leaking B_BARRIER.
Sep 23 2020
In D26511#590229, @kib wrote:In D26511#590172, @mckusick wrote:As long as the buffer remains locked until it is successfully written, this should be fine.
Sorry, I do not fully understand you note. Do you mean that it is fine to have B_BARRIER set as far as buffer is not unlocked ? If not, could you please clarify.
We are depending on this being written before it can be used. If it were unlocked, then some other thread could get it and make use of it. See comment above use of babarrierwrite in sys/ufs/ffs/ffs_alloc.c.
Sep 22 2020
Sep 21 2020
As long as the buffer remains locked until it is successfully written, this should be fine.
Sep 19 2020
Sep 13 2020
Sep 12 2020
This change looks reasonable to me as at least for amd64 every path out of the kernel appears to check TDF_ASTPENDING and calls ast() if set.
Sep 1 2020
It was a horrible hack at the time and should have been tossed decades ago.