Changeset View
Standalone View
sys/kern/vfs_cache.c
Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines | |||||
#include <security/mac/mac_framework.h> | #include <security/mac/mac_framework.h> | ||||
#ifdef DDB | #ifdef DDB | ||||
#include <ddb/ddb.h> | #include <ddb/ddb.h> | ||||
#endif | #endif | ||||
#include <vm/uma.h> | #include <vm/uma.h> | ||||
/* | |||||
* High level overview of name caching in the VFS layer. | |||||
* | |||||
* Originally caching was implemented as part of UFS, later extracted to allow | |||||
* use by other filesystems. A decision was made to make it optional and | |||||
mckusick: Like byte-range file locking, I originally implemented the name cache as part of UFS. Both were… | |||||
* completely detached from the rest of the kernel, which comes with limitations | |||||
* outlined near the end of this comment block. | |||||
* | |||||
* This fundamental choice needs to be revisited. In the meantime, the current | |||||
* state is described below. Significance of all notable routines is explained in | |||||
Done Inline Actionsthorought should either be thorough or thoroughout mckusick: thorought should either be thorough or thoroughout | |||||
* comments placed above their implementation. Scattered thorought the file are | |||||
Done Inline Actionsshortcommings => shortcomings mckusick: shortcommings => shortcomings | |||||
* TODO comments indicating shortcommings which can be fixed without reworking | |||||
* everything (most of the fixes will likely be reusable). Various details are | |||||
* omitted from this explanation to not clutter the overview, they have to be | |||||
* checked by reading the code and associated commentary. | |||||
* | |||||
* Keep in mind that it's individual path components which are cached, not full | |||||
* paths. That is, for a fully cached path "foo/bar/baz" there are 3 entries, | |||||
* one for each name. | |||||
* | |||||
* I. Data organization | |||||
* | |||||
Done Inline Actionsrepresented => described mckusick: represented => described | |||||
* Entries are represented by "struct namecache" objects and stored in a hash | |||||
* table. See cache_get_hash for more information. | |||||
* | |||||
* "struct vnode" contains pointers to source entries (names which can be found | |||||
* when traversing through said vnode), destination entries (names of that | |||||
Not Done Inline Actionsvnode (see => vnode, see mckusick: vnode (see => vnode, see | |||||
* vnode (see "Limitations" for a breakdown on the subject) and a pointer to | |||||
* the parent vnode. | |||||
* | |||||
* The (directory vnode; name) tuple reliably determines the target entry if | |||||
* it exists. | |||||
* | |||||
* Since there are no small locks at this time (all are 32 bytes in size on | |||||
Not Done Inline Actionshacks around the problem => minimises the number of locks mckusick: hacks around the problem => minimises the number of locks | |||||
* LP64), the code hacks around the problem by introducing lock arrays to | |||||
* protect hash buckets and vnode lists. | |||||
* | |||||
* II. Filesystem integration | |||||
* | |||||
* Filesystems participating in name caching do the following: | |||||
* - set vop_lookup routine to vfs_cache_lookup | |||||
* - set vop_cachedlookup to whatever can perform the lookup if the above fails | |||||
* - if they support lockless lookup (see below), vop_fplookup_vexec and | |||||
Not Done Inline Actionsset => provided mckusick: set => provided | |||||
* vop_fplookup_symlink are set along with the MNTK_FPLOOKUP flag on the | |||||
* mount point | |||||
* - call cache_purge or cache_vop_* routines to eliminate stale entries as | |||||
* applicable | |||||
* - call cache_enter to add entries depending on the MAKEENTRY flag | |||||
* | |||||
* With the above in mind, there are 2 entry points when doing lookups: | |||||
* - ... -> namei -> cache_fplookup -- this is the default | |||||
* - ... -> VOP_LOOKUP -> vfs_cache_lookup -- normally only called by namei | |||||
* should the above fail | |||||
* | |||||
* Example code flow how an entry is added: | |||||
* ... -> namei -> cache_fplookup -> cache_fplookup_noentry -> VOP_LOOKUP -> | |||||
* vfs_cache_lookup -> VOP_CACHEDLOOKUP -> ufs_lookup_ino -> cache_enter | |||||
Not Done Inline ActionsPerhaps add some language that it is up to the filesystem to populate name cache as appropriate. Also I believe it is useful to note that name cache cannot be too dynamic, which means that pseudo file systems tend to not use cache. For instance, procfs and devfs. I believe that nullfs does not use namecache at all due to the structure. Note in the text that it might be useful to allow special filesystems to avoid namecache, nonc tmpfs option as an example. kib: Perhaps add some language that it is up to the filesystem to populate name cache as appropriate. | |||||
Done Inline ActionsIt states filesystems have to call cache_enter and even shows how ufs can end up doing it. devfs is a bad example in that it is mostly static and at least one node (null) keeps seeing a lot of traffic i'll see about procfs mjg: It states filesystems have to call cache_enter and even shows how ufs can end up doing it. | |||||
* | |||||
* III. Performance considerations | |||||
* | |||||
* For lockless case forward lookup avoids any writes to shared areas apart from | |||||
Not Done Inline ActionsFor lockless case forward lookup avoids any writes to share areas apart from the terminal path component. => mckusick: For lockless case forward lookup avoids any writes to share areas apart from the terminal path… | |||||
Done Inline ActionsI don't think this covers it. Key here is that this scales completely and it stems from not bouncing cache lines with anyone. mjg: I don't think this covers it. Key here is that this scales completely and it stems from not… | |||||
* the terminal path component. I.e., non-modifying lookups of different files | |||||
* don't suffer any scalability problems in the namecache. Looking up the same | |||||
* file is limited by VFS and goes beyond the scope of this file. | |||||
* | |||||
* At least on amd64 the single-threaded bottleneck for long paths is hashing | |||||
* (see cache_get_hash). There are cases where the code issues acquire fence | |||||
* multiple times, they can be combined on architectures which suffer from it. | |||||
* | |||||
Done Inline ActionsNot component but a vnode, i.e. the lookup result. I think it is too confusing to leave it as is. kib: Not component but a vnode, i.e. the lookup result. I think it is too confusing to leave it as… | |||||
* For locked case each encountered vnode has to be referenced and locked in | |||||
* order to be handed out to the caller (normally that's namei). This | |||||
* introduces significant hit single-threaded and serialization multi-threaded. | |||||
* | |||||
* Reverse lookup (e.g., "getcwd") fully scales provided it is fully cached -- | |||||
* avoids any writes to shared areas to any components. | |||||
* | |||||
* Unrelated insertions are partially serialized on updating the global entry | |||||
* counter and possibly serialized on colliding bucket or vnode locks. | |||||
* | |||||
* IV. Observability | |||||
* | |||||
Done Inline Actionsdtrace probe kib: dtrace probe | |||||
* Note not everything has an explicit dtrace probe nor it should have, thus | |||||
* some of the one-liners below depend on implementation details. | |||||
* | |||||
* Examples: | |||||
* | |||||
* # Check what lookups failed to be handled in a lockless manner. Column 1 is | |||||
* # line number, column 2 is status code (see cache_fpl_status) | |||||
* dtrace -n 'vfs:fplookup:lookup:done { @[arg1, arg2] = count(); }' | |||||
* | |||||
* # Lengths of names added by binary name | |||||
* dtrace -n 'fbt::cache_enter_time:entry { @[execname] = quantize(args[2]->cn_namelen); }' | |||||
* | |||||
* # Same as above but only those which exceed 64 characters | |||||
* dtrace -n 'fbt::cache_enter_time:entry /args[2]->cn_namelen > 64/ { @[execname] = quantize(args[2]->cn_namelen); }' | |||||
* | |||||
* # Who is performing lookups with spurious slashes (e.g., "foo//bar") and what | |||||
* # path is it | |||||
* dtrace -n 'fbt::cache_fplookup_skip_slashes:entry { @[execname, stringof(args[0]->cnp->cn_pnbuf)] = count(); }' | |||||
* | |||||
* V. Limitations and implementation defects | |||||
* | |||||
* - since it is possible there is no entry for an open file, tools like | |||||
* "procstat" may fail to resolve fd -> vnode -> path to anything | |||||
* - even if a filesystem adds an entry, it may get purged (e.g., due to memory | |||||
* shortage) in which case the above problem applies | |||||
* - hardlinks are not tracked, thus if a vnode is reachable in more than one | |||||
Done Inline Actionsthat => than mckusick: that => than | |||||
* way, resolving a name may return a different path than the one used to | |||||
* open it (even if said path is still valid) | |||||
* - by default entries are not added for newly created files | |||||
* - adding an entry may need to evict negative entry first, which happens in 2 | |||||
* distinct places (evicting on lookup, adding in a later VOP) making it | |||||
* impossible to simply reuse it | |||||
* - there is a simple scheme to evict negative entries as the cache is approaching | |||||
* its capacity, but it is very unclear if doing so is a good idea to begin with | |||||
Not Done Inline ActionsThis statement depends on fs. For instance, for UFS it is not true, i.e. there cannot exist a cached inode without corresponding vnode. Even for SU, where inode block might have dependencies attached to it (and thus cannot be reclaimed), it is still not thecase. I think this is also false for any filesystem in FreeBSD that uses persistent storage. In other words, only tmpfs has this special feature. kib: This statement depends on fs. For instance, for UFS it is not true, i.e. there cannot exist a… | |||||
Done Inline ActionsIn that case can something be done about: /* * clustering stuff */ daddr_t v_cstart; /* v start block of cluster */ daddr_t v_lasta; /* v last allocation */ daddr_t v_lastw; /* v last write */ int v_clen; /* v length of cur. cluster */ Removing this and moving v_hash into a 4 byte hole is almost enough to shrink the struct enough to fit 9 instead of 8 objects per page. There are few more bytes to shave elsewhere in the struct. mjg: In that case can something be done about:
```
/*
* clustering stuff… | |||||
Done Inline Actionsnone of zfs, tmpfs nor nullfs use it mjg: none of zfs, tmpfs nor nullfs use it | |||||
Not Done Inline ActionsOf course zfs/tmpfs/nullfs do not use these fields, because they are for clustering support in the buffer cache. I do not have a good idea where to move them. It would be fine to have them in inode, but then clustering code needs a way to find the place having vp as an input. Perhaps there could be a VOP like vop_clusterdata() that returns a pointer to the structure of these four fields. kib: Of course zfs/tmpfs/nullfs do not use these fields, because they are for clustering support in… | |||||
Done Inline ActionsCursory read suggests vfs_cluster.c can grow an argument which can be a pointer to these fields. mjg: Cursory read suggests vfs_cluster.c can grow an argument which can be a pointer to these fields. | |||||
Not Done Inline ActionsVOP would be somewhat less clumsy, but ok. D28679 It is strange that you do not complain about v_bufobj then, on the same basis. On the other hand, moving bufobj to nodes is harder due to v_object. kib: VOP would be somewhat less clumsy, but ok. D28679
It is strange that you do not complain about… | |||||
Done Inline ActionsI don't have a strong opinion, just a suggestion. It does save an indirect function call. There is a lot more to remove or shrink. Basically the vnode can be made to fit 9 per page with some headway. mjg: I don't have a strong opinion, just a suggestion. It does save an indirect function call. | |||||
* - vnodes are subject to being recycled even if target inode is left in memory, | |||||
* which loses the name cache entries when it perhaps should not. in case of tmpfs | |||||
* names get duplicated -- kept by filesystem itself and namecache separately | |||||
* - struct namecache has a fixed size and comes in 2 variants, often wasting space. | |||||
Not Done Inline ActionsI think this line needs some grammar fix, not sure which kib: I think this line needs some grammar fix, not sure which | |||||
Not Done Inline ActionsI do not understand this sentence: now hard to replace with malloc due to dependence on SMR. mckusick: I do not understand this sentence: now hard to replace with malloc due to dependence on SMR. | |||||
Done Inline ActionsFirst, consider a per-SMR kernel. Namecache maintains its own zones but there is no good reason to do it that I know of, instead it could use malloc (with better than power-of-2 granularity for malloc zones). In general, if more of the kernel was using malloc instead of hand rolled zones, it would be easier to justify better malloc granularity and likely get better total memory usage. For example most names are < 8 characters or so, yet every single one allocates a 104 byte object with 45 bytes to hold them. But there is already some overhead for the very fact that we are dealing with a namecache entry, so the buffer which comes after it has to be worth it. Should namecache-specific zones get abolished, the problem would disappear. Next note in the current implementation traversal has to access memory from any of 4 namecache zones, the vnode zone and pwd zone (for cwd and other pointers). Consider a kernel with a generalized safe memory reclamation mechanism which supports all allocations. Whether the namecache is using custom zones, malloc or whatever else, there would be smr_enter or compatible call upfront to provide protection. This is roughly how it works in Linux with RCU. SMR as implemented here is very different. "vfs_smr" object is created and installed in aforementioned zones, then smr_enter(vfs_smr) provides protection against freeing from said zones. But it does not provide any protection against freeing from malloc or whatever else not roped in with vfs_smr. Let's say someone added malloc_smr. It would have to be smr_enter-ed separately and that comes with a huge single-threaded cost (an atomic op). There was a mode to do it without said op, but it would still come with extra overhead making this a non-starter. I don't know how the current could would handle a hypothetical global smr. That's the rough outline. As you can see the comment assumed some familiarity with the above. mjg: First, consider a per-SMR kernel.
Namecache maintains its own zones but there is no good… | |||||
* now hard to replace with malloc due to dependence on SMR. | |||||
* - lack of better integration with the kernel also turns nullfs into a layered | |||||
* filesystem instead of something which can take advantage of caching | |||||
*/ | |||||
static SYSCTL_NODE(_vfs, OID_AUTO, cache, CTLFLAG_RW | CTLFLAG_MPSAFE, 0, | static SYSCTL_NODE(_vfs, OID_AUTO, cache, CTLFLAG_RW | CTLFLAG_MPSAFE, 0, | ||||
"Name cache"); | "Name cache"); | ||||
SDT_PROVIDER_DECLARE(vfs); | SDT_PROVIDER_DECLARE(vfs); | ||||
SDT_PROBE_DEFINE3(vfs, namecache, enter, done, "struct vnode *", "char *", | SDT_PROBE_DEFINE3(vfs, namecache, enter, done, "struct vnode *", "char *", | ||||
"struct vnode *"); | "struct vnode *"); | ||||
SDT_PROBE_DEFINE3(vfs, namecache, enter, duplicate, "struct vnode *", "char *", | SDT_PROBE_DEFINE3(vfs, namecache, enter, duplicate, "struct vnode *", "char *", | ||||
"struct vnode *"); | "struct vnode *"); | ||||
▲ Show 20 Lines • Show All 5,729 Lines • Show Last 20 Lines |
Like byte-range file locking, I originally implemented the name cache as part of UFS. Both were later extracted from UFS and provided as kernel support so that they could be used by other filesystems.