Page MenuHomeFreeBSD

4/7 vfs: add per-mount vnode lazy list and use it for deferred inactive + msync

Authored by mjg on Jan 1 2020, 10:55 PM.



The primary motivation for this change is to eliminate the active list (see remaining reviews).

This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Test Plan

tested by pho

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mjg retitled this revision from vfs: add per-mount vnode dirty list and use it for deferred inactive + msync to 2/5 vfs: add per-mount vnode dirty list and use it for deferred inactive + msync.Jan 1 2020, 10:56 PM

I like the structural change and explicit dirty list. The duplication of code for iteration seems unfortunate. Are there any differences except for which TAILQ we're iterating? Can we use a common set of iteration code? I still don't love the filter concept. It feels like a workaround for other synchronization issues.

Kirk, it was just brought to my attention that vnlru does not actually use the lru free list but instead uses the vnode instantiation time ordered mountlist. Is there any good reason for this or is this just a workaround for other deficiencies?

The duplication is addressed in the 4th diff where active list is removed. I added the filter stuff to band aid over the current approach. Both currently and with the patchset vfs performs its own vnode iteration over stuff regardless of what the filesystem is doing, which I consider to be a bug. However, coming up with a nice API which allows everyone to iterate once and do all they need requires some flaming and more importantly patching all filesystems to cooperate. The current form is a trivial temporary drop-in replacement without either need and with lessened deficiency of the original.

vnlru iterates in 2 ways:

  1. calls into vnlru_free_locked where it walks the free list to deallocate *free* vnodes
  2. for each mount point calls into vlrureclaim where it walks vnode list in fifo-ish order to deallocate *non-free* vnodes

The second iteration is fifo-ish because there is no other ordering for vnodes added to a mount point. They are never requeued (modulo by vlrureclaim). With the 4th patch (D22997) global LRU-ish ordering is introduced and vnlru can be converted to just walk the global list. I intend to do it after this patchset is sorted out.

1239 ↗(On Diff #66224)

Did you actually observed a case when the the VOP was called with vp->v_mount == NULL ?

3113 ↗(On Diff #66224)

You do not need a nested block there.

} else if (vp->v_iflag & VI_OWEINACT) {
} else {
4465 ↗(On Diff #66224)

It is strange that you do not check VV_NOSYNC there. Once the flag is set, it is not cleared.

And I do not like VOP_ISLOCKED() there, as before. It allows the rare and then esp. hard to note race where vnode' pages could be left unsynced for at least one syncer run. I do not see why not use trylock.

151 ↗(On Diff #66224)

This is the part that I dislike most from the whole patch. It blows struct vnode by 16 bytes, which scales to additional tens of MBs of kernel memory for typical machine.

Why don't you reused v_actfreelist ?

151 ↗(On Diff #66224)

The internal fragmentation in the slab means there's more space here before we actually consume more physical memory. It is a step function on PAGE_SIZE / sizeof(vnode). So we shouldn't gratuitously blow it out and there are ways we could make it smaller fairly reasonably but every pointer doesn't translate to increased physical memory.

We are looking at multi-page slabs for some of these larger zones to control fragmentation better but unless you make ridiculous allocation sizes you don't get a lot of benefit.

1239 ↗(On Diff #66224)

Yes, stress2 managed to panic the patchset without this. I don't remember the backtrace.

3113 ↗(On Diff #66224)


4465 ↗(On Diff #66224)

The check in the filter is just a copy-paste of what msync was doing in the loop. I can add VV_NOSYNC no problem.

Similarly, VOP_ISLOCKED was there. I think it's rather iffy and I'm more than happy to just remove it.

151 ↗(On Diff #66224)

I should have noted this. According to my conversations with jeff there was unused space in the slab.

With the D22997 the active list is retired and instead global vnode linkage is introduced (making 3 linkages in total: all allocated managed LRU, assigned to a mount point, dirty list on a mount point)

As for sizeof itself, I plan to shrink the struct later. v_tag is almost unused (the only consumer can be easily patched away), flag fields are bigger than they need to be and fit in the hole at the beginning of the struct. Note there are several ufs-isms in the vnode perhaps should be moved to something else so that othe filesystems don't have to pay for it.

  • rename VI_DEFERRED_INACTIVE to VI_DEFINACT and group it with the other flags for consistency
  • trylock for non-MNT_WAIT case in both msync and inactive
  • check for VV_NOSYNC
mjg edited the summary of this revision. (Show Details)
mjg retitled this revision from 2/5 vfs: add per-mount vnode dirty list and use it for deferred inactive + msync to 4/7 vfs: add per-mount vnode dirty list and use it for deferred inactive + msync.Jan 5 2020, 4:06 AM

pho tested the previous iteration of the patchset (without extra split), it's almost the same code when summed up.

Overall I like these changes. I have one specific comment here and also a suggested change to qsync/qsyncvp to mitigate the increased cost of qsync().

151 ↗(On Diff #66224)

It would be very helpful to add multi-page slabs. When I was writing my book some years ago I did some tests and found that nearly 10% of wired memory in the kernel was wasted space due to bad packing of slabsAt that time vnodes were the worst offenders followed by processes. I note that Solaris supports multi-page slabs and has much less wasted kernel memory.

mjg retitled this revision from 4/7 vfs: add per-mount vnode dirty list and use it for deferred inactive + msync to 4/7 vfs: add per-mount vnode lazy list and use it for deferred inactive + msync.
mjg edited the summary of this revision. (Show Details)
mjg edited the test plan for this revision. (Show Details)
  • rename dirty to lazy
This revision is now accepted and ready to land.Jan 12 2020, 7:48 PM