This provides an implementation of three Linux system calls,
inotify_init1(), inotify_add_watch() and inotify_rm_watch(). They can
be used to generate a stream of notifications for file operations. This
is morally similar to EVFILT_VNODE, but allows one to watch all files in
a directory without opening every single file. This avoids races where a
file is created and then must be opened before events can be received.
This implementation is motivated by a desire to close the above race,
while adopting an interface that is already widely used. It also
greatly reduces the resources needed to monitor a large hierarchy, as
with kevent() one needs an fd per monitored file. It was implemented
using the Linux manual page as a reference and aims to be
API-compatible. At present it is not perfectly compatible due to some
limitations imposed by our name cache design, but I think it's close
enough for most purposes.
The patch introduces a new fd type, DTYPE_INOTIFY, three new system
calls which manipulate that API (actually inotify_init1() is implemented
using __specialfd(), so only two new system call numbers are allocated,
and some new VOPs (these are perhaps not needed, see below). Below is
some explanation of the internals.
Using an inotify fd, one can "watch" files or directories for events.
Once subscribed, events in the form of struct inotify_event can be read
from the descriptor. In particular, the event includes the name of the
file (technically, hard link) which triggered the event. A vnode that
has at least one watch on it has VIRF_INOTIFY set in its flags. When a
file belongs to a watched directory, it has VIRF_INOTIFY_PARENT set.
The INOTIFY macro ignores vnodes unless they have one of these flags
set, so the overhead of this feature should be small for unwatched files.
To simplify the implementation, VIRF_INOTIFY_PARENT is cleared lazily
after all watches are removed.
For some events, e.g., file creation, rename, removal, we have the namei
data used to look up the target. In that case we can obtain the file
name unambiguously. In other cases, e.g., file I/O, we have only the
file vnode: how can we figure out whether a containing directory wants
to be notified about the I/O? For this I have to use the name cache:
cache_vop_inotify() scans cache entries pointing to the target vnode,
looking for directory vnodes with VIRF_INOTIFY set. If found, we can
log an event using the name. An inotify watch holds the directory
usecount > 0, preventing its outgoing namecache entries from being
reclaimed.
The above scheme mostly works, but has two bugs involving hard links:
when an event occurs on a vnode, we have no way of knowing which hard
link triggered the event. On Linux, an open file handle keeps track of
the name used to open the backing vnode, updated during renames, but we
don't have this glue. Thus, if a directory containing two links for the
same file is watched, each access to the file will generate two inotify
events (except for IN_CREATE, IN_MOVE and IN_DELETE). And, if a watched
directory contains a file that is linked outside the directory, and that
link is accessed, the directory will get an event even though the access
technically happened from outside the directory. I think these issues
are unlikely to be a major problem.
inotify is fundamentally about monitoring accesses via file handles, not
all file accesses in general. So, I inserted INOTIFY calls directly
into the system call layer, close to where the file handle is used to
obtain the vnode, rather than using vop_*_post callbacks like
EVFILT_VNODE does. Hooking the vop_*_post calls is not quite right for
inotify, e.g., open(file, O_PATH) should generate an IN_OPEN event even
though we don't call VOP_OPEN in this case.
I introduced two VOPs, VOP_INOTIFY and VOP_INOTIFY_ADD_WATCH, which
ensure that when inotify is applied to a nullfs vnode, we automatically watch
the lower vnode (or search the lower vnode for watches). The new VIRF flags
are lazily synchronized in null_bypass().