Page MenuHomeFreeBSD

Move struct bufobj out of struct vnode
Needs ReviewPublic

Authored by kib on Mon, Feb 22, 2:26 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

Remove v_bufobj from struct vnode, but move v_object into vnode. Define struct vnode_bo which consists of struct vnode and bufobj. It is backed by separate zone.

If a filesystem uses buffer cache, as indicated by MNTK_USES_BCACHE mount point flag, getnewvnode() allocates node_bo instead of vnode, and marks the vnode with VIRF_BUFOBJ flag. For convenience, syncer vnodes are also automatically extended with bufobj.

I considered not doing the co-location for vnode/bufobj, putting the bufobj into inodes for filesystems needed byffers. This appeared to be too problematic due to the lifetime issues, not for filesystems itself, but for use of vnodes from other filesystems. For instance devvp buffers cannot be reliably handled if bufobj for devvp is located in cdev_priv (devfs spec inode).

Right now this shrinks struct vnode from 448 to 304 bytes for filesystems that do not use buffer cache, i.e. at least zfs/tmpfs/nullfs. Filesystems using buffer cache still gets the 'large' struct vnode + struct bufobj blob when allocating the vnode.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

kib requested review of this revision.Mon, Feb 22, 2:26 PM

I think this runs into trouble with the way the way vnlru is implemented. It may maintain some used/free vnode targets, but resulting freed memory may not be reusable by other filesystems. For example consider a box with ufs + nullfs, which now cannot reuse vnodes after each other. I think the long term solution is to fundamentally rework this model and put a small 'struct vnode' as part of fs-specific object, which would also mean reworking forced unmount.

In the meantime I think this can work if bufobj is allocated from an additional zone and struct vnode stores a pointer. The pointer may be only nullified in freevnode to avoid spurious checks.

In D28856#646077, @mjg wrote:

I think this runs into trouble with the way the way vnlru is implemented. It may maintain some used/free vnode targets, but resulting freed memory may not be reusable by other filesystems. For example consider a box with ufs + nullfs, which now cannot reuse vnodes after each other. I think the long term solution is to fundamentally rework this model and put a small 'struct vnode' as part of fs-specific object, which would also mean reworking forced unmount.

In the meantime I think this can work if bufobj is allocated from an additional zone and struct vnode stores a pointer. The pointer may be only nullified in freevnode to avoid spurious checks.

Vnlru does not reuses vnodes, and I doubt that its goal is to reuse vnode memory. Vnodes are freed and allocated anew, daemon maintains the target by vnode count, and that the only restriction it has. Since default limits are still tuned by the KVA usage for vnode_bo+large NFS inode, nothing really changed WRT vnlru except that vnode cache could consume less memory now. If you mean that other subsystems can eat this memory and vnode allocator would be unable to allocate, again this sounds too unlikely. Testing should show it.

BTW, NFS_NCLNODE_SZ should be updated after current changes, but I probably do it once after this patchset is landed.

The fact that vnode cache does not try to react to vm_lowmem is also a bug.

On a general note, the entire mechanism is already highly deficient. For example the code can bump into global limits even if there are free vnodes in local per-CPU caches. This needs a complete revamp which either lifts it out of UMA or integrates this into UMA completely.

vnlru does not reuse memory, it tries to free vnodes so that they can be reused. In particular in vn_alloc where a call to vnlru_free_locked will try free up something. In the current code, should the routine succeed, there will be a vnode for grabs. In contrast, with your patch the freed vnode may be of the wrong kind and thus not usable. Thus in my opinion the patch as proposed makes the situation worse.

In D28856#646189, @mjg wrote:

vnlru does not reuse memory, it tries to free vnodes so that they can be reused. In particular in vn_alloc where a call to vnlru_free_locked will try free up something. In the current code, should the routine succeed, there will be a vnode for grabs. In contrast, with your patch the freed vnode may be of the wrong kind and thus not usable. Thus in my opinion the patch as proposed makes the situation worse.

In current code, if vnlru frees a vnode, there is no guarantee that there is a vnode for the grab, because memory could be grabbed for something else meantime as well. The only guiding mechanism is the vnode count. I do not believe that your objection is practical.

In fact, the vnode_bo/vnode mechanism is identical to the namecache small/large/ts/no-ts allocations, and for namecache it is even more varied.

As noted earlier the mechanism is highly deficient. In fact now that I think of it it's even worse than described because of the way SMR is implemented, where freeing can only happen in full batches.

That said, if the patch is to go in, it has to come with an explanation in vn_alloc_hard. Currently it states:

* The routine can try to free a vnode or stall for up to 1 second waiting for
* vnlru to clear things up, but ultimately always performs a M_WAITOK allocation.

Given the realities (and especially so with the proposed patch) I think it ends up being misleading. Instead it should note that the vnode count just provides some notion of controlling total memory usage stemming from vnodes themselves and associated (meta)data, where vnlru free call may free up some memory but is not guaranteed to make any of it reusable for this allocation.

The ZFS bits look good, just one question about the change in zfs_freebsd_bmap.

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
4323

What does changing this to NULL do?

kib marked an inline comment as done.Wed, Feb 24, 11:12 PM
kib added inline comments.
sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
4323

This is important for vnode_generic_pager(), which uses returned bufobj to factually read the content of the requested pages. Generic pager cannot work for ZFS, which provides its own VOP_GETPAGES() implementation.

Since we no longer have a spare bufobj in the struct vnode, returning NULL there should change nothing.