Move struct bufobj out of struct vnode
Needs ReviewPublic
Actions

Authored by kib on Feb 22 2021, 2:26 PM.

Details

Reviewers

Summary

Remove v_bufobj from struct vnode, but move v_object into vnode. Define struct vnode_bo which consists of struct vnode and bufobj. It is backed by separate zone.

If a filesystem uses buffer cache, as indicated by MNTK_USES_BCACHE mount point flag, getnewvnode() allocates node_bo instead of vnode, and marks the vnode with VIRF_BUFOBJ flag. For convenience, syncer vnodes are also automatically extended with bufobj.

I considered not doing the co-location for vnode/bufobj, putting the bufobj into inodes for filesystems needed byffers. This appeared to be too problematic due to the lifetime issues, not for filesystems itself, but for use of vnodes from other filesystems. For instance devvp buffers cannot be reliably handled if bufobj for devvp is located in cdev_priv (devfs spec inode).

Right now this shrinks struct vnode from 448 to 304 bytes for filesystems that do not use buffer cache, i.e. at least zfs/tmpfs/nullfs. Filesystems using buffer cache still gets the 'large' struct vnode + struct bufobj blob when allocating the vnode.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Feb 22 2021, 2:26 PM

Herald added subscribers: asomers, delphij. · View Herald TranscriptFeb 22 2021, 2:26 PM

kib requested review of this revision.Feb 22 2021, 2:26 PM

kib edited the summary of this revision. (Show Details)Feb 22 2021, 2:38 PM

I think this runs into trouble with the way the way vnlru is implemented. It may maintain some used/free vnode targets, but resulting freed memory may not be reusable by other filesystems. For example consider a box with ufs + nullfs, which now cannot reuse vnodes after each other. I think the long term solution is to fundamentally rework this model and put a small 'struct vnode' as part of fs-specific object, which would also mean reworking forced unmount.

In the meantime I think this can work if bufobj is allocated from an additional zone and struct vnode stores a pointer. The pointer may be only nullified in freevnode to avoid spurious checks.

In D28856#646077, @mjg wrote:

I think this runs into trouble with the way the way vnlru is implemented. It may maintain some used/free vnode targets, but resulting freed memory may not be reusable by other filesystems. For example consider a box with ufs + nullfs, which now cannot reuse vnodes after each other. I think the long term solution is to fundamentally rework this model and put a small 'struct vnode' as part of fs-specific object, which would also mean reworking forced unmount.

In the meantime I think this can work if bufobj is allocated from an additional zone and struct vnode stores a pointer. The pointer may be only nullified in freevnode to avoid spurious checks.

Vnlru does not reuses vnodes, and I doubt that its goal is to reuse vnode memory. Vnodes are freed and allocated anew, daemon maintains the target by vnode count, and that the only restriction it has. Since default limits are still tuned by the KVA usage for vnode_bo+large NFS inode, nothing really changed WRT vnlru except that vnode cache could consume less memory now. If you mean that other subsystems can eat this memory and vnode allocator would be unable to allocate, again this sounds too unlikely. Testing should show it.

BTW, NFS_NCLNODE_SZ should be updated after current changes, but I probably do it once after this patchset is landed.

The fact that vnode cache does not try to react to vm_lowmem is also a bug.

On a general note, the entire mechanism is already highly deficient. For example the code can bump into global limits even if there are free vnodes in local per-CPU caches. This needs a complete revamp which either lifts it out of UMA or integrates this into UMA completely.

vnlru does not reuse memory, it tries to free vnodes so that they can be reused. In particular in vn_alloc where a call to vnlru_free_locked will try free up something. In the current code, should the routine succeed, there will be a vnode for grabs. In contrast, with your patch the freed vnode may be of the wrong kind and thus not usable. Thus in my opinion the patch as proposed makes the situation worse.

In D28856#646189, @mjg wrote:

vnlru does not reuse memory, it tries to free vnodes so that they can be reused. In particular in vn_alloc where a call to vnlru_free_locked will try free up something. In the current code, should the routine succeed, there will be a vnode for grabs. In contrast, with your patch the freed vnode may be of the wrong kind and thus not usable. Thus in my opinion the patch as proposed makes the situation worse.

In current code, if vnlru frees a vnode, there is no guarantee that there is a vnode for the grab, because memory could be grabbed for something else meantime as well. The only guiding mechanism is the vnode count. I do not believe that your objection is practical.

In fact, the vnode_bo/vnode mechanism is identical to the namecache small/large/ts/no-ts allocations, and for namecache it is even more varied.

As noted earlier the mechanism is highly deficient. In fact now that I think of it it's even worse than described because of the way SMR is implemented, where freeing can only happen in full batches.

That said, if the patch is to go in, it has to come with an explanation in vn_alloc_hard. Currently it states:

* The routine can try to free a vnode or stall for up to 1 second waiting for
* vnlru to clear things up, but ultimately always performs a M_WAITOK allocation.

Given the realities (and especially so with the proposed patch) I think it ends up being misleading. Instead it should note that the vnode count just provides some notion of controlling total memory usage stemming from vnodes themselves and associated (meta)data, where vnlru free call may free up some memory but is not guaranteed to make any of it reusable for this allocation.

emaste added a subscriber: emaste.Feb 22 2021, 8:23 PM

The ZFS bits look good, just one question about the change in zfs_freebsd_bmap.

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
4325	What does changing this to NULL do?

kib marked an inline comment as done.Feb 24 2021, 11:12 PM

kib added inline comments.

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
4325	This is important for vnode_generic_pager(), which uses returned bufobj to factually read the content of the requested pages. Generic pager cannot work for ZFS, which provides its own VOP_GETPAGES() implementation. Since we no longer have a spare bufobj in the struct vnode, returning NULL there should change nothing.

Fix bugs found by pho

Three inline comments / questions.

In Sun's implementation of vnodes each filesystem type had its own pool. When I adopted the vnode idea into BSD, I created generic vnodes that could be used by all filesystems so that they could move between filesystems based on demand.

This design reverts back to vnodes usable by ZFS and a few other filesystems and vnodes for NFS, UFS, and most other filesystems. This will be a win for systems that run just ZFS. But, systems that are also running NFS or UFS will not be able to share vnode memory and will likely have a bigger memory footprint than if they stuck with the single type of vnode.

There has been no attempt to fix vlrureclaim() so we can end up reclaiming a bunch of vnodes of the wrong type thus reducing the usefulness of the cache without recovering any useful memory. In the worst case, we can end up with each of the two vnode pools using maxvnodes worth of memory.

We probably need to have a separate maxvnodes for each pool. Alternately we could keep track of how many vnodes are in each pool and limit the two pools to a total of maxvnodes. That of course begs the question of how we decide to divide the quota between the two pools. At a minimum, vlrureclaim() needs to have a way to decide which pool needs to have vnodes reclaimed.

Before making this change, I would like to see some measurements on an NFS server storing its files in ZFS that compares the amount of memory used in the old vnode pool versus that used in the two new vnode pools.

sys/geom/geom_vfs.c
294–295	Why can you pass NULL here and not have to come up with vm_object? You will fail to clear the VM cache which it should since flags & (V_ALT \| V_NORMAL \| V_CLEANONLY \| V_VMIO)) == 0
sys/kern/vfs_bio.c
4150–4151	Why are the following KASSERTs no longer relevant?
sys/kern/vfs_subr.c
2126	Seems like a KASSERT that the vnode has its VIRF_BUFOBJ flag set would be useful.

This revision now requires changes to proceed.Mar 22 2021, 5:19 AM

In D28856#657658, @mckusick wrote:

Three inline comments / questions.

In Sun's implementation of vnodes each filesystem type had its own pool. When I adopted the vnode idea into BSD, I created generic vnodes that could be used by all filesystems so that they could move between filesystems based on demand.

This design reverts back to vnodes usable by ZFS and a few other filesystems and vnodes for NFS, UFS, and most other filesystems. This will be a win for systems that run just ZFS. But, systems that are also running NFS or UFS will not be able to share vnode memory and will likely have a bigger memory footprint than if they stuck with the single type of vnode.

There has been no attempt to fix vlrureclaim() so we can end up reclaiming a bunch of vnodes of the wrong type thus reducing the usefulness of the cache without recovering any useful memory. In the worst case, we can end up with each of the two vnode pools using maxvnodes worth of memory.

We probably need to have a separate maxvnodes for each pool. Alternately we could keep track of how many vnodes are in each pool and limit the two pools to a total of maxvnodes. That of course begs the question of how we decide to divide the quota between the two pools. At a minimum, vlrureclaim() needs to have a way to decide which pool needs to have vnodes reclaimed.

We do not have two pools of vnodes after the patch. For very long time, we free vnode after its hold count goes to zero (mod SMR complications). The maxvnodes limit is mostly a cap on the usage of KVA for whole fs data. We already have kernel memory usage specific to fs, because sizeof of inode is fs-specific, and e.g. (assuming zfs inode is much larger than ufs inode), you actually do not get enough memory total to allocate ufs inode if zfs inode is reclaimed and freed. See a similar discussion with mjg above.

More, the memory use of either single current vnode zone, or two zones I introduced, can legitimately consume (much) more memory than would imply just maxvnodes * sizeof(struct vnode).

Mjg proposed to have bufobj allocated by separate alloc when needed, instead of using the special zone with tail for bufobj. But we already use similar allocation schemes for e.g. namecache, where the global cap on nc entries just limits total KVA, while size of nc element is determined by the length of the cached path component.

Before making this change, I would like to see some measurements on an NFS server storing its files in ZFS that compares the amount of memory used in the old vnode pool versus that used in the two new vnode pools.

For me, Peter' testing was convincing that we do not introduce new problems with vnlru(). What specifics do you want to see with nfs over ZFS? Just numbers of total memory consumed by vnode zone(s) before/after the patch, or something else (where I would worry about vnlru() being stuck, but this is not the case I believe).

sys/geom/geom_vfs.c
294–295	I do not think we could ever dirty a page for devvp, because we cannot mmap cdevs. The dirtyness for devvp is always tracked by devvp-owned buffers. But I fixed this place nonetheless, using bo2vnode().
sys/kern/vfs_bio.c
4150–4151	I believe these asserts are not too relevant before the patch, as well. They are tautological because before this patch, v_object is defined as v_bufobj->bo_object, and vmio is evaluated as v_object != NULL. After v_object is the proper member of the struct vnode instead of being a define, I cannot even formulate the tautological condition for the first (vmio == true) case. Might be it could be bp->b_bufobj == vp2bo(vp), but this is trivial after bgetvp(). For the second assert, !vmio, this is tautological check vp->v_object == NULL.
sys/kern/vfs_subr.c
2126	vp2bo() accessor already has the assert built-in. It should not be checked for the case of nullfs (handle != vp case). So I do not see a need in adding yet another assert.

Pass vnode v_object to vinval_bufobj() from geom_vfs_close().

In D28856#657865, @kib wrote:

In D28856#657658, @mckusick wrote:

Three inline comments / questions.

In Sun's implementation of vnodes each filesystem type had its own pool. When I adopted the vnode idea into BSD, I created generic vnodes that could be used by all filesystems so that they could move between filesystems based on demand.

This design reverts back to vnodes usable by ZFS and a few other filesystems and vnodes for NFS, UFS, and most other filesystems. This will be a win for systems that run just ZFS. But, systems that are also running NFS or UFS will not be able to share vnode memory and will likely have a bigger memory footprint than if they stuck with the single type of vnode.

There has been no attempt to fix vlrureclaim() so we can end up reclaiming a bunch of vnodes of the wrong type thus reducing the usefulness of the cache without recovering any useful memory. In the worst case, we can end up with each of the two vnode pools using maxvnodes worth of memory.

We probably need to have a separate maxvnodes for each pool. Alternately we could keep track of how many vnodes are in each pool and limit the two pools to a total of maxvnodes. That of course begs the question of how we decide to divide the quota between the two pools. At a minimum, vlrureclaim() needs to have a way to decide which pool needs to have vnodes reclaimed.

We do not have two pools of vnodes after the patch. For very long time, we free vnode after its hold count goes to zero (mod SMR complications).

We do have two pools of vnodes after this change. One zone is vnodes without bufobj and the other is vnodes with bufobj. My point is that both of these pools can grow big enough to hold maxvnodes of their type. We could avoid this problem if we just had a single zone of pure vnodes that would simply have a pointer to a bufobj (or could be a union * so that non bufobj filesystems like ZFS could use it for other purposes). It will be much easier for vlrureclaim() to be cleaned up if it only has a single pool of vnode to manage. I would suggest something like mjg suggested where there is a way to ask the underlying filesystem using the vnode how many pages it is identifying so as to more accurately guage how useful it is.

The maxvnodes limit is mostly a cap on the usage of KVA for whole fs data. We already have kernel memory usage specific to fs, because sizeof of inode is fs-specific, and e.g. (assuming zfs inode is much larger than ufs inode), you actually do not get enough memory total to allocate ufs inode if zfs inode is reclaimed and freed. See a similar discussion with mjg above.

More, the memory use of either single current vnode zone, or two zones I introduced, can legitimately consume (much) more memory than would imply just maxvnodes * sizeof(struct vnode).

Mjg proposed to have bufobj allocated by separate alloc when needed, instead of using the special zone with tail for bufobj. But we already use similar allocation schemes for e.g. namecache, where the global cap on nc entries just limits total KVA, while size of nc element is determined by the length of the cached path component.

The point of splitting out the bufobj is to make more efficient use of the memory. With two types of vnodes we end up tying up maxvnodes worth of sizeof (vnode + bufobj) that is unusable when most demand is for ZFS vnodes. With bufobj allocated separately we are only tying up maxvnodes worth of sizeof(bufobj) when most demand is for ZFS vnodes.

Before making this change, I would like to see some measurements on an NFS server storing its files in ZFS that compares the amount of memory used in the old vnode pool versus that used in the two new vnode pools.

For me, Peter' testing was convincing that we do not introduce new problems with vnlru(). What specifics do you want to see with nfs over ZFS? Just numbers of total memory consumed by vnode zone(s) before/after the patch, or something else (where I would worry about vnlru() being stuck, but this is not the case I believe).

If the limit on memory use is being limited by KVA for fs data, then the problem will be that the number of vnodes for ZFS and NFS will be lower with this new scheme than with the old scheme. So rather than being an issue of higher memory usage with the new scheme the issue will be longer running time because of having to reload recycled vnodes more frequently. A worst case scenario would be if most of the KVM got dedicated to one type of vnode so that the filesystem on the losing end would only have a small pool of vnodes to use.

As an aside, I wonder if the KVA memory limit is what is causing the long vlrureclaim() delays in "vlruwt" that have been recently reported when running poudriere (even when setting insanely high maxvnodes).

In D28856#658066, @mckusick wrote:

In D28856#657865, @kib wrote:

We do not have two pools of vnodes after the patch. For very long time, we free vnode after its hold count goes to zero (mod SMR complications).

We do have two pools of vnodes after this change. One zone is vnodes without bufobj and the other is vnodes with bufobj. My point is that both of these pools can grow big enough to hold maxvnodes of their type. We could avoid this problem if we just had a single zone of pure vnodes that would simply have a pointer to a bufobj (or could be a union * so that non bufobj filesystems like ZFS could use it for other purposes). It will be much easier for vlrureclaim() to be cleaned up if it only has a single pool of vnode to manage. I would suggest something like mjg suggested where there is a way to ask the underlying filesystem using the vnode how many pages it is identifying so as to more accurately guage how useful it is.

No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].

Vnlru has only one free list of vnodes to process. Vnodes reclaimed e.g. by vnlru are not reused, they are freed to uma. What vnlru manages, de facto, is the count of total vnodes, and not the cache. After vnlru freed a unit number for vnode, new vnode is allocated from the proper uma zone. We do not need to free specific kind of vnode (with or without bufobj) to satisfy the incoming request, since newly allocated vnode does not reuse the memory of just disposed vnode.

Question of vnlru not making very intelligent decisions about the relative weight of the vnode for reclaim (currently it is v_object->resident_page_count, and I just fixed nullfs bug there) is also not relevant for the patch.

The point of splitting out the bufobj is to make more efficient use of the memory. With two types of vnodes we end up tying up maxvnodes worth of sizeof (vnode + bufobj) that is unusable when most demand is for ZFS vnodes. With bufobj allocated separately we are only tying up maxvnodes worth of sizeof(bufobj) when most demand is for ZFS vnodes.

maxvnodes limit is not about limiting memory for vnodes itself, it provides a cap on the resource usage by vnodes, Two important facts: 1. we calculate the default value for maxvnodes based on KVA required to hold (vnode + inode + vm_object) 2. we do not shrink constructed vnodes when lowmem handler is running.

Before making this change, I would like to see some measurements on an NFS server storing its files in ZFS that compares the amount of memory used in the old vnode pool versus that used in the two new vnode pools.

For me, Peter' testing was convincing that we do not introduce new problems with vnlru(). What specifics do you want to see with nfs over ZFS? Just numbers of total memory consumed by vnode zone(s) before/after the patch, or something else (where I would worry about vnlru() being stuck, but this is not the case I believe).

If the limit on memory use is being limited by KVA for fs data, then the problem will be that the number of vnodes for ZFS and NFS will be lower with this new scheme than with the old scheme.

Why? Limit of vnodes is not changed.

So rather than being an issue of higher memory usage with the new scheme the issue will be longer running time because of having to reload recycled vnodes more frequently. A worst case scenario would be if most of the KVM got dedicated to one type of vnode so that the filesystem on the losing end would only have a small pool of vnodes to use.

Please see above, there is no 'dedicated' KVA for 'kind' of vnodes, I limit the total number with the same formula.

As an aside, I wonder if the KVA memory limit is what is causing the long vlrureclaim() delays in "vlruwt" that have been recently reported when running poudriere (even when setting insanely high maxvnodes).

Apparently this vnlru() misbehavior has nothing to do with KVA clamp, and was diagnosed and fixed by mjg as ZFS-specific issue, see https://cgit.freebsd.org/src/commit/?id=e9272225e6bed840b00eef1c817b188c172338ee

No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].

Let me try again to explain my perceived issue.

Under this scheme we have two zones. If there is a lot of ZFS activity, the vnode-only zone can be filled with maxvnodes worth of entries. Now suppose activity in ZFS drops but activity in NFS rises. Now the zone with vnodes + bufobj can fill to maxvnodes worth of memory. As I understand it we do not reclaim any of the memory from the vnode-only zone, it just sits there unable to be used. Is that correct?

In D28856#658765, @mckusick wrote:

No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].

Let me try again to explain my perceived issue.

Under this scheme we have two zones. If there is a lot of ZFS activity, the vnode-only zone can be filled with maxvnodes worth of entries. Now suppose activity in ZFS drops but activity in NFS rises. Now the zone with vnodes + bufobj can fill to maxvnodes worth of memory. As I understand it we do not reclaim any of the memory from the vnode-only zone, it just sits there unable to be used. Is that correct?

No, this is not correct. Total number of vnodes (summary of both types) is limited by maxvnodes. After the load shifted from zfs to nfs in your scenario, vnlru starts reclaiming vnodes in LRU order from the global free list, freeing (vnode without bufobj)s, and most of the allocated vnodes would be from the (vnode+bufobj) zone. We do not allow more that maxvnodes total vnodes allocated in the system.

In D28856#658811, @kib wrote:

In D28856#658765, @mckusick wrote:

No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].

Let me try again to explain my perceived issue.

Under this scheme we have two zones. If there is a lot of ZFS activity, the vnode-only zone can be filled with maxvnodes worth of entries. Now suppose activity in ZFS drops but activity in NFS rises. Now the zone with vnodes + bufobj can fill to maxvnodes worth of memory. As I understand it we do not reclaim any of the memory from the vnode-only zone, it just sits there unable to be used. Is that correct?

No, this is not correct. Total number of vnodes (summary of both types) is limited by maxvnodes. After the load shifted from zfs to nfs in your scenario, vnlru starts reclaiming vnodes in LRU order from the global free list, freeing (vnode without bufobj)s, and most of the allocated vnodes would be from the (vnode+bufobj) zone. We do not allow more that maxvnodes total vnodes allocated in the system.

I understand that there cannot be more than maxvnodes. What I am concerned about is how much memory is tied up in the two zones. In this example as vnlru() is freeing (vnodes without bufobjs) into the (vnodes without bufobjs) zone. It then allocates memory from (vnodes+bufobj) from the (vnodes+bufobj) zone. That allocation cannot use the memory in the (vnodes without bufobj) zone. So when we are done we have enough memory locked down in the two zones to support 2 * maxvnodes. This is much more wasteful of memory than having a single zone for pure vnodes and a second zone that holds bufobjs each of which will be limited to maxvnodes in size.

In D28856#658823, @mckusick wrote:

In D28856#658811, @kib wrote:

In D28856#658765, @mckusick wrote:

No, we do not have two pools of vnodes after this change. We have two zones, but zones do not keep vnodes, they cache partially initialized memory for vnodes. Either the current single zone, or two zones after applying the patch, do not have any limits to grow that size of the cache. But it is a cache of memory and not vnodes. Without or with the patch, only maxvnodes constructed vnodes can exist in the system. The constructed vnode is struct vnode which is correctly initialized and has identity belonging to some filesystem, or reclaimed. [In fact in some cases getnewvnodes() is allowed to ignore the limit of maxvnodes, but this is not relevant to the discussion].

Let me try again to explain my perceived issue.

Under this scheme we have two zones. If there is a lot of ZFS activity, the vnode-only zone can be filled with maxvnodes worth of entries. Now suppose activity in ZFS drops but activity in NFS rises. Now the zone with vnodes + bufobj can fill to maxvnodes worth of memory. As I understand it we do not reclaim any of the memory from the vnode-only zone, it just sits there unable to be used. Is that correct?

No, this is not correct. Total number of vnodes (summary of both types) is limited by maxvnodes. After the load shifted from zfs to nfs in your scenario, vnlru starts reclaiming vnodes in LRU order from the global free list, freeing (vnode without bufobj)s, and most of the allocated vnodes would be from the (vnode+bufobj) zone. We do not allow more that maxvnodes total vnodes allocated in the system.

I understand that there cannot be more than maxvnodes. What I am concerned about is how much memory is tied up in the two zones. In this example as vnlru() is freeing (vnodes without bufobjs) into the (vnodes without bufobjs) zone. It then allocates memory from (vnodes+bufobj) from the (vnodes+bufobj) zone. That allocation cannot use the memory in the (vnodes without bufobj) zone. So when we are done we have enough memory locked down in the two zones to support 2 * maxvnodes. This is much more wasteful of memory than having a single zone for pure vnodes and a second zone that holds bufobjs each of which will be limited to maxvnodes in size.

Memory in zones is not tied. On the first vm_lowmem condition (i.e. when pageadaemon fails to keep with the demand), the caches are purged. This does not involve any activity from vnlru(), since we only cache vnode memory, not vnodes, in zones.

Also, memory returned by uma_zfree() does not neccessary stays in zone until lowmem event. Slabs are sometimes returned back during (almost) normal uma operation.

Practically it is the same situation with two zones as with single zone.

I understand that there cannot be more than maxvnodes. What I am concerned about is how much memory is tied up in the two zones. In this example as vnlru() is freeing (vnodes without bufobjs) into the (vnodes without bufobjs) zone. It then allocates memory from (vnodes+bufobj) from the (vnodes+bufobj) zone. That allocation cannot use the memory in the (vnodes without bufobj) zone. So when we are done we have enough memory locked down in the two zones to support 2 * maxvnodes. This is much more wasteful of memory than having a single zone for pure vnodes and a second zone that holds bufobjs each of which will be limited to maxvnodes in size.

Memory in zones is not tied. On the first vm_lowmem condition (i.e. when pageadaemon fails to keep with the demand), the caches are purged. This does not involve any activity from vnlru(), since we only cache vnode memory, not vnodes, in zones.

Also, memory returned by uma_zfree() does not neccessary stays in zone until lowmem event. Slabs are sometimes returned back during (almost) normal uma operation.

Practically it is the same situation with two zones as with single zone.

There is still overhead as the zone memory has to be cleaned up (locks disposed of) and then new memory initialized (zeroed, lists and queues initialized, locks initialized, etc). Also there is extra work done detecting that we have hit these conditions and making them happen. In general we are going to have more memory tied up and do more work moving it between the zones. If we just had one zone for vnodes and another zone for bufobjs we could avoid all of this. In all likelyhood we would only need occational freeing of memory in the bufobj zone.

I am curious why you are so resistant to having a single vnode zone and a single bufobj zone?

In D28856#659119, @mckusick wrote:

There is still overhead as the zone memory has to be cleaned up (locks disposed of) and then new memory initialized (zeroed, lists and queues initialized, locks initialized, etc). Also there is extra work done detecting that we have hit these conditions and making them happen. In general we are going to have more memory tied up and do more work moving it between the zones. If we just had one zone for vnodes and another zone for bufobjs we could avoid all of this. In all likelyhood we would only need occational freeing of memory in the bufobj zone.

I am curious why you are so resistant to having a single vnode zone and a single bufobj zone?

With either (vnode + bufobj, vnode - bufobj), or (vnode - bufobj, bufobj) we still have two zones, and on low memory condition two zones needs to be drained. But for the separate bufobj zone, we additionally punish filesystems that use buffers. Instead of single allocation for vnode, they have to perform two, and also they have to perform two frees.

We have a similar structure in namecache, where {short,long}x{timestamp, no timestamp} allocations use specific zones, instead of allocating nc + path segment + timestamp.

In D28856#659141, @kib wrote:

In D28856#659119, @mckusick wrote:

There is still overhead as the zone memory has to be cleaned up (locks disposed of) and then new memory initialized (zeroed, lists and queues initialized, locks initialized, etc). Also there is extra work done detecting that we have hit these conditions and making them happen. In general we are going to have more memory tied up and do more work moving it between the zones. If we just had one zone for vnodes and another zone for bufobjs we could avoid all of this. In all likelyhood we would only need occational freeing of memory in the bufobj zone.

I am curious why you are so resistant to having a single vnode zone and a single bufobj zone?

With either (vnode + bufobj, vnode - bufobj), or (vnode - bufobj, bufobj) we still have two zones, and on low memory condition two zones needs to be drained. But for the separate bufobj zone, we additionally punish filesystems that use buffers. Instead of single allocation for vnode, they have to perform two, and also they have to perform two frees.

We have a similar structure in namecache, where {short,long}x{timestamp, no timestamp} allocations use specific zones, instead of allocating nc + path segment + timestamp.

The cost of an extra allocation versus the overhead of having to handle low-memory situations and building up and tearing down zones seems like a bad tradeoff. We already do a local allocation for each vnode allocated by a filesystem, so we could include the bufobj in that structure (i.e., for UFS it would be in the inode structure). But it seems to me that sharing the bufobj structures between filesystems that use them would be a better way to go.

In D28856#659176, @mckusick wrote:

The cost of an extra allocation versus the overhead of having to handle low-memory situations and building up and tearing down zones seems like a bad tradeoff.

Why? It is reverse, IMO: the normal system operation performs a lot of vnode allocations and deallocations, while lowmem is rare condition, where we do not worry about system performance at all, only about system liveness. Optimizing for normal path is right, optimizing for lowmem handler is not.

In D28856#659176, @mckusick wrote:

We already do a local allocation for each vnode allocated by a filesystem, so we could include the bufobj in that structure (i.e., for UFS it would be in the inode structure). But it seems to me that sharing the bufobj structures between filesystems that use them would be a better way to go.

This is how I started developing the change, my intent was to embed bufobj into inodes (v_data) for filesystems that need it. I had to add something like VOP_GET_BUFOBJ() to translate vnode to bufobj, which was quite bad already. But then, I realized that there are issues with vnode reclamation/bufobj liveness that seems to be unsolvable with this approach. Biggest offender was devfs where we do metadata io without locking devvp. But there were some problems even with e.g. more well-behaving UFS. I had to abandon this approach and require bufobj having the same lifetime as the vnode.

In D28856#659254, @kib wrote:

In D28856#659176, @mckusick wrote:

The cost of an extra allocation versus the overhead of having to handle low-memory situations and building up and tearing down zones seems like a bad tradeoff.

Why? It is reverse, IMO: the normal system operation performs a lot of vnode allocations and deallocations, while lowmem is rare condition, where we do not worry about system performance at all, only about system liveness. Optimizing for normal path is right, optimizing for lowmem handler is not.

The purpose of this change is to reduce the amount of memory dedicated to vnodes.

It will result in one of two outcomes:

If there is no memory pressure (which you assert is the usual case), we end up with two zones each with maxvnodes worth of space, one zone is with bufobj and is equal to what we already have. The other without bufobj is all using more memory than before.

There is memory pressure in which case we spend a great deal of time and effort reclaiming memory from the lesser used pool plus drain memory from other pools that likely would have been left untouched or less lightly impacted.

I do not see any gain here and there is an increase in the complexity of the VFS interface.

In D28856#659393, @mckusick wrote:

The purpose of this change is to reduce the amount of memory dedicated to vnodes.

It will result in one of two outcomes:

If there is no memory pressure (which you assert is the usual case), we end up with two zones each with maxvnodes worth of space, one zone is with bufobj and is equal to what we already have. The other without bufobj is all using more memory than before.

There is memory pressure in which case we spend a great deal of time and effort reclaiming memory from the lesser used pool plus drain memory from other pools that likely would have been left untouched or less lightly impacted.

I do not see any gain here and there is an increase in the complexity of the VFS interface.

Let me restate, maxvnodes only controls amount of live (free or not) vnodes, not the amount of memory transiently consumed by the vnode allocator.

As far as we do not get memory pressure, it is not pre-determined how and when zones would cache. Zone caching in not limited to maxvnodes at all, it can legitimately be 10xmaxvnodes or whatever, as far as uma subsystem does not feel back-pressure. But this caching is not for alive vnodes, it is for backing memory, and it is only the current implementation detail, there is nothing that in principle prevents uma from returning fully free slabs earlier than on drain.

When we get memory pressure, not reserving memory for unused bufobj gives us all that memory for immediate consumption.

And we do less operations at the hot path (vnode allocation and free) than we would do with (vnode, bufob) zone scheme, while for slow path, we have one more zone to drain anyway. either bufobj-less vnodes zone, or bufobj zone.

Just for illustration: on my w/s with hand-tuned maxvnodes at 1000000, I see

ITEM                   SIZE  LIMIT     USED     FREE      REQ     FAILSLEEP XDOMAIN
VNODE:                  488,      0,  946866, 1008446,36577061,   0,   0,   0

I think the amount of free vnodes (not vnodes on free list, which are fully constructed), but cached memory for vnodes, is explained by the load. This machine runs poudriere which unmounts populated tmpfs mounts. Since there is no immediate getnewvnode() that follows freevnode(), machine ends up with caching a lot of it. There is enough memory to not kick uma_reclaim().

[I did not spend any time reading the diff so far, just the discussion. Sorry if I missed anything below, just wanted to provide some clarification on UMA's involvement.]

In D28856#659434, @kib wrote:

In D28856#659393, @mckusick wrote:

The purpose of this change is to reduce the amount of memory dedicated to vnodes.

It will result in one of two outcomes:

If there is no memory pressure (which you assert is the usual case), we end up with two zones each with maxvnodes worth of space, one zone is with bufobj and is equal to what we already have. The other without bufobj is all using more memory than before.

This is a worst-case situation, I believe. One would have to allocate maxvnodes worth of vnodes of one type, free them all, and then allocate maxvnodes vnodes of the other type. It certainly can happen in reasonable workloads, but I note also that maxvnodes * sizeof(struct vnode) is going to be a small fraction of physical memory based on current limits, something like 2%.

There is memory pressure in which case we spend a great deal of time and effort reclaiming memory from the lesser used pool plus drain memory from other pools that likely would have been left untouched or less lightly impacted.

A recent-ish change is that we no longer fully drain UMA caches in the presence of memory pressure. That was indeed a fairly heavyweight operation. Now, we track recent allocation activity of the zone to get a working set size estimate, and instead drain anything in excess of that estimate. If the system is really in a severe shortage, we will indeed drain the caches, but that shouldn't be happening in any reasonable steady-state operation even if the page daemon is constantly running.

If there is a situation where we frequently ping-pong between heavy usage of the two vnode types, then under memory pressure we may indeed spend a significant amount of time draining slabs.

I do not see any gain here and there is an increase in the complexity of the VFS interface.

Let me restate, maxvnodes only controls amount of live (free or not) vnodes, not the amount of memory transiently consumed by the vnode allocator.

As far as we do not get memory pressure, it is not pre-determined how and when zones would cache. Zone caching in not limited to maxvnodes at all, it can legitimately be 10xmaxvnodes or whatever, as far as uma subsystem does not feel back-pressure.

This is true. In practice I suspect that the allocation+cache is still bounded to maxvnodes+epsilon. There is a single vnlru thread that frees vnodes to UMA, and getnewvnode() always allocates from UMA. When there is pressure on vnlru, some free vnodes may accumulate in the per-CPU caches, but the total size of the per-CPU caches is bounded; at some point, a bucket of free vnodes will become visible to the allocating thread. With NUMA the picture is not so clear because vnodes are allocated using a first-touch policy. (Since constructed vnodes are cached on a freelist this policy actually doesn't make much sense to me.)

With this diff, it sounds like we are increasing that total allocation+cache to 2*maxvnodes+epsilon in some cases. Conceptually, it does not seem like a fatal problem. But then it raises the question of whether we retain any benefits from the patch.

But this caching is not for alive vnodes, it is for backing memory, and it is only the current implementation detail, there is nothing that in principle prevents uma from returning fully free slabs earlier than on drain.

We've talked several times about proactively pruning UMA caches using the WSS estimate, instead of waiting for a vm_lowmem event. With some zones the cache can blow up to many times the size of the WSS for various reasons.

overall I agree with mjg and kirk that it would be simpler and better in the long term for the bufobj to be part of the fs-specific "inode" structure for file systems that need it, rather than having the fs-independent code need to know about multiple flavors of vnodes.

you mention in the review summary that there's some problem with devfs vnodes used by other file systems if the bufobj is moved from the vnode to the devfs "inode", but you didn't say what the problem was exactly. could you explain?

I suspect that the issues with moving the bufobj into the devfs inode are the same issues that led us to create the "mntfs" private vnodes that UFS now uses for its devvp rather than the original devfs vnode. currently UFS is the only file system that uses mntfs vnodes, but every other file system that uses a devfs vnode (and its bufobj) to access its underlying storage has the same issues with that that UFS did (namely that vgone'ing a file system's device vnode out from under the file system is unsafe because file systems do not follow the usual vnode locking rules for the accessing their underlying storage via device vnodes), and so really all of those other file systems should be changed to use mntfs vnodes as well. if those changes were made, would there still a problem with moving the bufobj into the devfs inode for the remaining usage of devfs vnodes (ie. from file descriptors)?

The main issue with devfs is that metadata io does not lock devvp. In other words, we cannot guarantee the liveness of devfs vnode for the duration of io, which results in problems with liveness of bufobj. I.e. bufobj could go away while owned buffer is under io. It is somewhat similar to mntfs reason, but have its own nuances. My conclusion was that the lifetime of bufobj must be equal to lifetime of vnode and not inode.

That said, I do not agree with the arguments that having two vnode zones is problematic or detrimental. UMA should (and I think recently it become better at this) auto-tune the sizes of per-zone caches, in particular it started better shrinking down oversized caches. But besides that, having bufobj allocated separately means one more allocation/free for vnode lifecycle, and one pointer dereference for each bufobj access.

I was not able to convince myself that this additional indirection is fine, so I did not reworked the patch.

If going for multiple zones, I think the way to go is to get rid of the current notion of vnodes as separate allocations to begin with and rework dooming/forced unmount semantics.

Then per-fs inode would embed 'struct vnode' for interaction with VFS and would have bufobj as needed.

The way the concept of "dooming" is currently implemented runs into many problems. For one, vnodes can supposedly have custom locking routines, but they get replaced with the default as you get doomed, meaning the actual lock can change from under code doing VOP_LOCK().

One way to fix it would be to make dooming itself only replace the vnode op table with a filesystem-supplied "bad" variant and otherwise let the thing be (in particular not mess with freeing anything at this point yet). The new ops can still do whatever i/o necessary if needed and return errors otherwise.

While a lot of churn, I think it is worth it as it gets rid of a lot of complexity.

Embedding vnode into inode is indeed a lot of churn, together with some secondary troubles. For instance, unmount of the last filesystem for given vfs type no longer means that fs-specific zone can be destroyed. So although the idea of embedding have its merit, its scope is much larger (not only due to required code churn) than struct vnode shrinking by not allocating bufobj part, which is my goal in the patch.

That said, lock switching has nothing to do with the vnode lifecycle definition. There are two current lock-switching filesystems, nullfs and ufs. For nullfs, indeed final lock switch is tied to reclamation, but for ufs it is not. Rather, ufs switches lock when vnode becomes snapshot, which occurs some time after vnode is created and written to, and when vnode stops being a snapshot, which again happens not only during reclamation.

VOP_LOCK() implementations have to be aware of the switch, so even if we switch locks by changing vop vector, instead of setting v_vnlock, the locking method implementation would be same as it is now.

The tests that Peter Holm ran (back in March) showed that buildworld ran about 1.5% slower with this change than without it. We were not able to figure out why that slowdown occurred, but unless this can be done in a way that at least has the same running time, I do not think we should do it.