Details

Reviewers

kib
mckusick

Commits

rGd8aa5ff4f606: cache: add high level overview
rGf79bd71def7a: cache: add high level overview

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg requested review of this revision.Feb 15 2021, 1:37 AM

mjg created this revision.

mjg updated this revision to Diff 83900.Feb 15 2021, 1:42 AM

kib added inline comments.Feb 15 2021, 2:38 AM

sys/kern/vfs_cache.c
141	Perhaps add some language that it is up to the filesystem to populate name cache as appropriate. Also I believe it is useful to note that name cache cannot be too dynamic, which means that pseudo file systems tend to not use cache. For instance, procfs and devfs. I believe that nullfs does not use namecache at all due to the structure. Note in the text that it might be useful to allow special filesystems to avoid namecache, nonc tmpfs option as an example.
153	Not component but a vnode, i.e. the lookup result. I think it is too confusing to leave it as is.
165	dtrace probe
199	This statement depends on fs. For instance, for UFS it is not true, i.e. there cannot exist a cached inode without corresponding vnode. Even for SU, where inode block might have dependencies attached to it (and thus cannot be reclaimed), it is still not thecase. I think this is also false for any filesystem in FreeBSD that uses persistent storage. In other words, only tmpfs has this special feature.
203	I think this line needs some grammar fix, not sure which

mjg added inline comments.Feb 15 2021, 3:25 AM

sys/kern/vfs_cache.c
141	It states filesystems have to call cache_enter and even shows how ufs can end up doing it. devfs is a bad example in that it is mostly static and at least one node (null) keeps seeing a lot of traffic i'll see about procfs
199	In that case can something be done about: /* * clustering stuff / daddr_t v_cstart; / v start block of cluster / daddr_t v_lasta; / v last allocation / daddr_t v_lastw; / v last write / int v_clen; / v length of cur. cluster */ Removing this and moving v_hash into a 4 byte hole is almost enough to shrink the struct enough to fit 9 instead of 8 objects per page. There are few more bytes to shave elsewhere in the struct.

mjg added inline comments.Feb 15 2021, 3:26 AM

sys/kern/vfs_cache.c
199	none of zfs, tmpfs nor nullfs use it

kib added inline comments.Feb 15 2021, 3:45 AM

sys/kern/vfs_cache.c
199	Of course zfs/tmpfs/nullfs do not use these fields, because they are for clustering support in the buffer cache. I do not have a good idea where to move them. It would be fine to have them in inode, but then clustering code needs a way to find the place having vp as an input. Perhaps there could be a VOP like vop_clusterdata() that returns a pointer to the structure of these four fields.

mjg added inline comments.Feb 15 2021, 3:54 AM

sys/kern/vfs_cache.c
199	Cursory read suggests vfs_cluster.c can grow an argument which can be a pointer to these fields.

kib added inline comments.Feb 15 2021, 4:40 AM

sys/kern/vfs_cache.c
199	VOP would be somewhat less clumsy, but ok. D28679 It is strange that you do not complain about v_bufobj then, on the same basis. On the other hand, moving bufobj to nodes is harder due to v_object.

mjg added inline comments.Feb 15 2021, 4:47 AM

sys/kern/vfs_cache.c
199	I don't have a strong opinion, just a suggestion. It does save an indirect function call. There is a lot more to remove or shrink. Basically the vnode can be made to fit 9 per page with some headway.

Thanks for rewriting this comment.
I have provided some suggestions for cleanup and/or clarification.

sys/kern/vfs_cache.c
89	Like byte-range file locking, I originally implemented the name cache as part of UFS. Both were later extracted from UFS and provided as kernel support so that they could be used by other filesystems.
94	thorought should either be thorough or thoroughout
95	shortcommings => shortcomings
106	represented => described
111	vnode (see => vnode, see
118	hacks around the problem => minimises the number of locks
127	set => provided
145	For lockless case forward lookup avoids any writes to share areas apart from the terminal path component. => For lockless lookups, every path-name component, except possibly the last one, can be found and traversed without needing to make any modification to the cache data structures.
191	that => than
203	I do not understand this sentence: now hard to replace with malloc due to dependence on SMR. What are the limits of SMR? I am not aware of limits placed on use of malloc by SMR. Are there limits placed on the cache entries by SMR?

This revision is now accepted and ready to land.Feb 16 2021, 1:23 AM

address some of the feedback

This revision now requires review to proceed.Feb 16 2021, 6:05 AM

mjg updated this revision to Diff 83987.Feb 16 2021, 6:08 AM

mjg marked 4 inline comments as done.

mjg marked 3 inline comments as done.Feb 16 2021, 6:22 AM

mjg added inline comments.

sys/kern/vfs_cache.c
145	I don't think this covers it. Key here is that this scales completely and it stems from not bouncing cache lines with anyone.
203	First, consider a per-SMR kernel. Namecache maintains its own zones but there is no good reason to do it that I know of, instead it could use malloc (with better than power-of-2 granularity for malloc zones). In general, if more of the kernel was using malloc instead of hand rolled zones, it would be easier to justify better malloc granularity and likely get better total memory usage. For example most names are < 8 characters or so, yet every single one allocates a 104 byte object with 45 bytes to hold them. But there is already some overhead for the very fact that we are dealing with a namecache entry, so the buffer which comes after it has to be worth it. Should namecache-specific zones get abolished, the problem would disappear. Next note in the current implementation traversal has to access memory from any of 4 namecache zones, the vnode zone and pwd zone (for cwd and other pointers). Consider a kernel with a generalized safe memory reclamation mechanism which supports all allocations. Whether the namecache is using custom zones, malloc or whatever else, there would be smr_enter or compatible call upfront to provide protection. This is roughly how it works in Linux with RCU. SMR as implemented here is very different. "vfs_smr" object is created and installed in aforementioned zones, then smr_enter(vfs_smr) provides protection against freeing from said zones. But it does not provide any protection against freeing from malloc or whatever else not roped in with vfs_smr. Let's say someone added malloc_smr. It would have to be smr_enter-ed separately and that comes with a huge single-threaded cost (an atomic op). There was a mode to do it without said op, but it would still come with extra overhead making this a non-starter. I don't know how the current could would handle a hypothetical global smr. That's the rough outline. As you can see the comment assumed some familiarity with the above.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 2 2021, 3:15 AM

Closed by commit rGf79bd71def7a: cache: add high level overview (authored by mjg). · Explain Why

This revision was automatically updated to reflect the committed changes.

mjg added a commit: rGf79bd71def7a: cache: add high level overview.

mjg added a commit: rGd8aa5ff4f606: cache: add high level overview.Apr 10 2021, 6:04 AM

cache: add an introductory comment
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 86726

sys/kern/vfs_cache.c

cache: add an introductory commentClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 86726

sys/kern/vfs_cache.c

cache: add an introductory comment
ClosedPublic
Actions

Revision Contents
Changeset List