Details

Reviewers

brooks
olce
kib

Commits

rG739fc14938d5: tests: Add some regression tests for inotify
rG5202cfcf93b1: inotify: Add man pages for the new syscalls and for VOP_INOTIFY
rGbc4430dc203e: vfs: Add event notification points
rGcf65b5e43d7c: libsys: Add inotify support
rG245ff4c4b2c2: libc: Add inotify support

Summary

This provides an implementation of three Linux system calls,
inotify_init1(), inotify_add_watch() and inotify_rm_watch(). They can
be used to generate a stream of notifications for file operations. This
is morally similar to EVFILT_VNODE, but allows one to watch all files in
a directory without opening every single file. This avoids races where a
file is created and then must be opened before events can be received.

This implementation is motivated by a desire to close the above race,
while adopting an interface that is already widely used. It also
greatly reduces the resources needed to monitor a large hierarchy, as
with kevent() one needs an fd per monitored file. It was implemented
using the Linux manual page as a reference and aims to be
API-compatible. At present it is not perfectly compatible due to some
limitations imposed by our name cache design, but I think it's close
enough for most purposes.

The patch introduces a new fd type, DTYPE_INOTIFY, three new system
calls which manipulate that API (actually inotify_init1() is implemented
using __specialfd(), so only two new system call numbers are allocated,
and some new VOPs (these are perhaps not needed, see below). Below is
some explanation of the internals.

Using an inotify fd, one can "watch" files or directories for events.
Once subscribed, events in the form of struct inotify_event can be read
from the descriptor. In particular, the event includes the name of the
file (technically, hard link) which triggered the event. A vnode that
has at least one watch on it has VIRF_INOTIFY set in its flags. When a
file belongs to a watched directory, it has VIRF_INOTIFY_PARENT set.
The INOTIFY macro ignores vnodes unless they have one of these flags
set, so the overhead of this feature should be small for unwatched files.
To simplify the implementation, VIRF_INOTIFY_PARENT is cleared lazily
after all watches are removed.

For some events, e.g., file creation, rename, removal, we have the namei
data used to look up the target. In that case we can obtain the file
name unambiguously. In other cases, e.g., file I/O, we have only the
file vnode: how can we figure out whether a containing directory wants
to be notified about the I/O? For this I have to use the name cache:
cache_vop_inotify() scans cache entries pointing to the target vnode,
looking for directory vnodes with VIRF_INOTIFY set. If found, we can
log an event using the name. An inotify watch holds the directory
usecount > 0, preventing its outgoing namecache entries from being
reclaimed.

The above scheme mostly works, but has two bugs involving hard links:
when an event occurs on a vnode, we have no way of knowing which hard
link triggered the event. On Linux, an open file handle keeps track of
the name used to open the backing vnode, updated during renames, but we
don't have this glue. Thus, if a directory containing two links for the
same file is watched, each access to the file will generate two inotify
events (except for IN_CREATE, IN_MOVE and IN_DELETE). And, if a watched
directory contains a file that is linked outside the directory, and that
link is accessed, the directory will get an event even though the access
technically happened from outside the directory. I think these issues
are unlikely to be a major problem.

inotify is fundamentally about monitoring accesses via file handles, not
all file accesses in general. So, I inserted INOTIFY calls directly
into the system call layer, close to where the file handle is used to
obtain the vnode, rather than using vop_*_post callbacks like
EVFILT_VNODE does. Hooking the vop_*_post calls is not quite right for
inotify, e.g., open(file, O_PATH) should generate an IN_OPEN event even
though we don't call VOP_OPEN in this case.

I introduced two VOPs, VOP_INOTIFY and VOP_INOTIFY_ADD_WATCH, which
ensure that when inotify is applied to a nullfs vnode, we automatically watch
the lower vnode (or search the lower vnode for watches). The new VIRF flags
are lazily synchronized in null_bypass().

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 64532
Build 61416: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Harbormaster completed remote builds in B64672: Diff 156624.Fri, Jun 6, 1:33 PM

In D50315#1157602, @arrowd wrote:

Would it be possible to make native sys/inotify.h be a drop-in replacement for libinotify's one? In other words, I want a program that was compiled against /usr/include/sys/inotify.h to work when run with LD_PRELOAD=libinotify.so. This will provide an almost painless transition from libinotify to the native implementation.

Yes, that was my intent in the previous revision, but I missed some flags. The new version should be almost completely compatible: the difference is with IN_CLOEXEC and IN_NONBLOCK, and the libinotify flag values are handled in libc. I think this provides compatibility in both directions, since libinotify already handles FreeBSD O_CLOEXEC and O_NONBLOCK as well.

I haven't tested compatibility yet. Soon I will try updating the base system+kernel on my desktop, so that applications linked against libinotify will automatically start using libc/libsys's implementations of those symbols.

BTW, my branch now adds inotify to the linuxulator, but I didn't test it much yet.

arrowd mentioned this in D50736: Prepare for the native inotify impl.Sat, Jun 7, 11:19 AM

In D50315#1157710, @markj wrote:

Yes, that was my intent in the previous revision, but I missed some flags. The new version should be almost completely compatible: the difference is with IN_CLOEXEC and IN_NONBLOCK, and the libinotify flag values are handled in libc. I think this provides compatibility in both directions, since libinotify already handles FreeBSD O_CLOEXEC and O_NONBLOCK as well.

I can confirm that inotify-test from libinotify-kqueue now works when compiled against /usr/include/sys/libinotify.h, but linked to -linotify.

I haven't tested compatibility yet. Soon I will try updating the base system+kernel on my desktop, so that applications linked against libinotify will automatically start using libc/libsys's implementations of those symbols.

I suggest you to first apply https://reviews.freebsd.org/D50736 to your ports to actually switch most consumers to the native implementation. Otherwise the unpatched devel/libinotify will just shadow the native inotify for most ports.

In D50315#1158070, @arrowd wrote:

In D50315#1157710, @markj wrote:

Yes, that was my intent in the previous revision, but I missed some flags. The new version should be almost completely compatible: the difference is with IN_CLOEXEC and IN_NONBLOCK, and the libinotify flag values are handled in libc. I think this provides compatibility in both directions, since libinotify already handles FreeBSD O_CLOEXEC and O_NONBLOCK as well.

I can confirm that inotify-test from libinotify-kqueue now works when compiled against /usr/include/sys/libinotify.h, but linked to -linotify.

What if it is still using libinotify's libinotify.h?

I haven't tested compatibility yet. Soon I will try updating the base system+kernel on my desktop, so that applications linked against libinotify will automatically start using libc/libsys's implementations of those symbols.

I suggest you to first apply https://reviews.freebsd.org/D50736 to your ports to actually switch most consumers to the native implementation. Otherwise the unpatched devel/libinotify will just shadow the native inotify for most ports.

What do you mean by "shadow"? The point of testing existing packages is to make nothing breaks when updating the kernel+libc+libsys.

In D50315#1158075, @markj wrote:

What if it is still using libinotify's libinotify.h?

You mean, am I sure that it didn't pick up libinotify's libinotify.h when I was testing? I'm now unsure and will redo the test.

What do you mean by "shadow"? The point of testing existing packages is to make nothing breaks when updating the kernel+libc+libsys.

Why test this configuration if we can first prepare Ports for the native inotify and instead test how it gets picked up?

By "shadow" I mean that unpatched devel/libinotify will install its own /usr/local/include/sys/inotify.h and a .pc file with -I/usr/local/include -L/usr/local/lib -linotify. This will make other ports pick up libinotify's libinotify.h in most cases and loading the library into a process would mean that native inotify API wouldn't be used.

In D50315#1158078, @arrowd wrote:

In D50315#1158075, @markj wrote:

What if it is still using libinotify's libinotify.h?

You mean, am I sure that it didn't pick up libinotify's libinotify.h when I was testing? I'm now unsure and will redo the test.

Suppose you have a dynamically linked executable which links against libinotify.so. Then you upgrade your system to get native inotify, but your executable is not updated. When you run it, the dynamic linker will search for implementations of inotify_add_watch() etc.. Because the symbols in libinotify.so are unversioned, I believe the executable will automatically start using the symbols from libc.so instead of libinotify.so.

Assuming the above is correct (I will verify it soon), we should ensure that the executable still runs without problems in this situation.

What do you mean by "shadow"? The point of testing existing packages is to make nothing breaks when updating the kernel+libc+libsys.

Why test this configuration if we can first prepare Ports for the native inotify and instead test how it gets picked up?

By "shadow" I mean that unpatched devel/libinotify will install its own /usr/local/include/sys/inotify.h and a .pc file with -I/usr/local/include -L/usr/local/lib -linotify. This will make other ports pick up libinotify's libinotify.h in most cases and loading the library into a process would mean that native inotify API wouldn't be used.

Per what I wrote above, I believe the native implementation would still be used, but the application would be compiled against libinotify inotify.h.

I cannot test this for the next day or two. If you are able to test it, please let me know what happens in practice.

Yep, I messed up and it was still using libinotify's libinotify.h. I've redone the test properly and found that the program starts behaving differently if linked to libinotify.

The test I'm running is

./inotify-test .
touch a
rm a

When not linked with libinotify, I get

a created in watched directory
a was opened
a opened for writing was closed
a delete from watched directory

When linked to libinotify, I only get

a created in watched directory
a deleted from watched directory

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

In D50315#1158082, @markj wrote:

Per what I wrote above, I believe the native implementation would still be used, but the application would be compiled against libinotify inotify.h.

Yes, which is why I propose D50736 to land first. With it, it wouldn't be possible to link to libinotify via pkgconfig and both headers would be identical to the system one.

Only programs that were not recompiled after the base update remain to use libinotify, which isn't a big deal.

In D50315#1158083, @arrowd wrote:

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

Well, this scenario is supposed to work in general, so I'd like to understand what's going on. I will investigate a bit.

In D50315#1158083, @arrowd wrote:
Yep, I messed up and it was still using libinotify's libinotify.h. I've redone the test properly and found that the program starts behaving differently if linked to libinotify.

The test I'm running is
./inotify-test .
touch a
rm a
When not linked with libinotify, I get
a created in watched directory
a was opened
a opened for writing was closed
a delete from watched directory
When linked to libinotify, I only get
a created in watched directory
a deleted from watched directory
I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

It should not be fixed. You have observed a race in libinotify. In-kernel implementation looks correct here.

Rebase, small tweaks.

Harbormaster completed remote builds in B64720: Diff 156711.Mon, Jun 9, 2:50 PM

In D50315#1158398, @wulf wrote:

In D50315#1158083, @arrowd wrote:

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

It should not be fixed. You have observed a race in libinotify. In-kernel implementation looks correct here.

I agree. touch(1) opens the file when creating it (but not when updating timestamps).

markj mentioned this in D50761: linux: Add inotify support.Mon, Jun 9, 8:16 PM

The diff is unlikely to change much without more feedback. If anyone would like me to split it up for easier review, please let me know.

kib added inline comments.Wed, Jun 11, 5:05 AM

lib/libsys/inotify.2
52	Did you considered adding inode number there? I mean, linux namecache is very different from our, and while linux' dir + name is something that identifies their inode, our does not. Adding a field should be source-compatible.
63	How inotify fd behaves on fork?
87
141	What about mknod?
146	File or vnode? Also, if vnode, effective or real link count? Anyway, doesn't the inotify object itself acquires the use/effective count?
sys/fs/nullfs/null_subr.c
251	Did you considered taking the vnode interlock around all three ifs() and use vn_irflag_set_locked()? BTW why the flags are in irflags? From my reading, they are actually modified under the vnode lock, and perhaps should be moved to v_flag (VV_XXX). Modulo that there is no space in v_flag.
sys/kern/syscalls.master
3364	I still cannot accept the frivolous breakage of the specialfd design breakage that was done. The bug in ioctl type checking is an example of the consequences.
sys/kern/vfs_inotify.c
89	Should the file be passable over unix socket? (Follow-up to the fork question)
149	Might be this should be an assert instead of the check?
203	Might be use the exterr and provide some description for EINVAL, to start setting the example.
219	Could you please explain this memset?
304	Indeed a useful stat should be provided.
sys/sys/resourcevar.h
125	This is the number of allocated inotify fds, am I right? And watchcnt is the number of watches registered per user. IMO this is not too useful design for resources. There is already a global and per-process file limit. What is needed there is the per-user limit for events. Then the overflow record should be placed on the user limit reach, instead of the per-queue. It is easier for admin to utilize the limit this way (instead of multiplying the number of descriptors by the queue length).

arrowd added inline comments.Wed, Jun 11, 6:02 AM

lib/libsys/inotify.2
63	The fd is inherited, unlike our kqueue. This is what prevented me from enabling inotify watcher in `devel/dbus` - it first creates a notify fd and then forks to do the actual work.

kib added inline comments.Wed, Jun 11, 6:47 AM

lib/libsys/inotify.2
63	Yes, I mean that this should be mentioned in the man page.

Try to handle most of kib's feedback.

Harbormaster completed remote builds in B64767: Diff 156839.Wed, Jun 11, 8:10 PM

markj added inline comments.Wed, Jun 11, 8:10 PM

lib/libsys/inotify.2
52	It's source-compatible, but it wouldn't be binary-compatible, and this is required to avoid spuriously breaking applications linked against libinotify.so. Maybe we could add a new flag which requests an alternate layout for struct inotify_event. Fetching the inode number is a bit complex too, I believe we have to add a VOP_GETATTR for every single event to get the inode number, which is a bit expensive.
146	The file. (I would avoid referencing "vnodes" in user-facing documentation.) When the link count of a file changes, we use VOP_GETATTR to check the va_nlinks value; the usecount reference on the vnode doesn't affect that.
sys/fs/nullfs/null_subr.c
251	My reasoning was that the flags are checked "frequently" (i.e., in many VOP *_post hooks) and this is the motivation for the v_irflag split. The flags are set under the interlock (and maybe the vnode lock, but this is coincidental), so I don't think v_flag is correct.
sys/kern/syscalls.master
3364	Sorry, can you elaborate? What frivolous breakage are you referring to?
sys/kern/vfs_inotify.c
89	Yes, I believe it would be safe.
203	I used EXTERR_CAT_FILEDESC, I'm not sure if it's an abuse or not.
219	I added a comment.
sys/sys/resourcevar.h
125	Yes, this is the number of inotify fds, and then the total number of watches per user. These limits come directly from Linux, for better or worse. Suppose I implement your suggestion. A user has two process, each one opens an inotify descriptor. The first process hangs due to some bug or because it is stopped by a debugger, so events accumulate until the limit is reached. Then the second application cannot receive any more events. I think this is too fragile.

arrowd added inline comments.Thu, Jun 12, 5:26 AM

lib/libsys/inotify.2
52	this is required to avoid spuriously breaking applications linked against libinotify.so. Do we really need this? It is nice to have if we don't need to make tradeoffs, but the problem will only be visible for CURRENT users. We can as well just ask everyone to rebuild their ports after updating the world to get rid of all applications linked to libinotify.so

markj added inline comments.Thu, Jun 12, 11:50 AM

lib/libsys/inotify.2
52	It'll be visible to users when they upgrade to 15.0-RELEASE. I want to be sure that their libinotify-linked applications don't misbehave after upgrading the base system. And since none of those applications will expect to see an ino field, IMO it is not really useful to add one by default. Only new code will be able to use it.

arrowd added inline comments.Thu, Jun 12, 12:19 PM

lib/libsys/inotify.2
52	For the ports software we require all packages/ports to be reinstalled after major upgrades. Users following this procedure will get a FreeBSD 15 package set, for which we'll disable linking to libinotify with https://reviews.freebsd.org/D50736 Programs that are compiled by users themselves will still be linked to libinotify, but they should still work, because they were compiled against libinotify sys/inotify.h and they will still be calling the library, not the native implementation. The only case that comes to my mind that will be broken is: Application that is using native inotify calls `inotify_init()` It then loads some legacy module that pulls in libinotify. Consecutive calls to `inotify_add_watch` and others will go to the library. But this is a pretty much rare case, I think.

markj added inline comments.Fri, Jun 13, 12:43 AM

lib/libsys/inotify.2
52	Well, aside from the compatibility issue, it's still relatively expensive to fetch the inode number, and nothing uses it (because all code using inotify today is targeting Linux). How exactly would the new field be useful?

kib added inline comments.Fri, Jun 13, 4:44 AM

lib/libsys/inotify.2
52	Indeed, the library normally should appear before libc in the order of DT_NEEDED objects, and library symbols would interpose the libc symbols. BTW, is libnotify versioned? It might be that the symbols exported from libc should be changed to not clash with the symbols from libinotify, similar to how the iconv was handled (iconv_open/libiconv_open) but with reverse direction. WRT how the inode number in inotify event would be useful, how the event is used? For instance, if I watch a specific file, how could I be sure that events are for it and there is no some race? Normally I would open the file, and whenever I get the event for my name, I want to be sure that the name is still a hardlink to my file. How do I do that? If ino is reported, I simply compare ino in event with the result of fstat().
sys/fs/nullfs/null_subr.c
251	v_irflags motivation was 'read often, set never'. Perhaps only VIRF_TEXT_REF is somewhat outlier there, even this is up for discussion. All other flags are set at the vnode creation time and then not changed. The inotify flags are supposed to have the active life.
sys/kern/kern_resource.c
1752	It is fine to provide the userspace glue later, of course.
sys/kern/syscalls.master
3364	The timerfd case.
sys/kern/vfs_inotify.c
203	The most natural way would be to allocate a dedicated category, since for non-bloated kernels, we report the line numbers of the place where the situation occured, but not the file name. EXTERR_CAT_INOTIFY is fine, we do not have shortage of numbers so far.
689	Just break out of the loop, to have the free() in single place?
714	Then return if error != 0
721	Why cannot this call be vrefact()?
831	Again SET_ERROR() for the series of EINVALs would be useful.
869	Why do we need the special case of the existing watch? Is it only to decrement the limit? Perhaps a comment would be useful.
884	How could fp be NULL there?
sys/sys/resourcevar.h
125	It is much easier (and IMO natural) to configure global or per-uid limits, then have to do math like num_of_inotify times max_queue_length, if we need to make something behave. Also it handles the normal situations better, where you have one outlier that requires a lot of events queued, while other listeners are mostly dormant. If you have queue limit, then you must bump it and have global effect, which is hard to balance.

Handle some of kib's comments.

Harbormaster completed remote builds in B64848: Diff 156978.Fri, Jun 13, 3:49 PM

markj added inline comments.Fri, Jun 13, 3:49 PM

lib/libsys/inotify.2
52	libinotify's symbols have no versions. I think you're right that the clash is probably not a problem in practice, unless applications are underlinked. If you are watching a specific file, then all events for that watch are associated with the same inode/vnode, so there is no race. If you are watching a directory, and you receive a notification for an name in that directory, then yes it's possible for the name to refer to something else by the time you open it. But, I'm not sure why it's a problem: the watcher can also receive notifications about renames or unlinks, so it can see that the name has changed.
sys/fs/nullfs/null_subr.c
251	VIRF_DOOMED is set after creation too. VIRF_MOUNTPOINT also can change during the vnode lifetime as well, as mounts are created and destroyed. In general I expect inotify flags to be set somewhat rarely, certainly less often then they are checked. And v_irflag is specifically annotated as being synchronized by the vnode interlock, so I don't really understand the "write never" requirement.
sys/kern/vfs_inotify.c
831	BTW, it would be nice if SET_ERROR* was an expression, so would could write `return (SET_ERROR0(EINVAL, ...));`, like the `SET_ERROR()` macro in OpenZFS.
869	Yes, it's just a small hack to ensure that we handle the accounting correctly. I added a comment.
sys/sys/resourcevar.h
125	I agree that the current scheme is not ideal, but a global limit makes it very easy to starve the rest of the system, and IMHO this is worse. Note that there is some event coalescing, so file I/O should not generate too many back-to-back events. I've heard that the main problem in practice in Linux is the per-user limit on the number of watches: when watching some large directory hierarchy (e.g., a copy of the freebsd src tree), it's easy to run out of watches.

kib added inline comments.Sat, Jun 14, 12:37 PM

sys/fs/nullfs/null_subr.c
251	VIRF_DOOMED is extra special, it is in fact protected by both interlock and vnode lock, and this is actively utilized. VIRF_MOUNTPOINT does not have the intensive modification traffic. Since your code seems to read the flags lockless often e.g. all the vnode post vop hooks, it would be nice to be honest/race-less and put the flags into place which is protected naturally. At least this is not part of the ABI.
sys/sys/resourcevar.h
125	From my experience with e.g. the swap, the per-object limit scheme is not adequate. You want a global reasonable limit (global as e.g. per uid). When I needed swap limits, this was only workable schema, the limits set to the per-process VA did not provided the required safety, and the cause was not the shadowing. So I do think that the current approach is formal and does not provide effects that would allow users to put the working limitations on the mis-behaving inotify consumers.

markj added inline comments.Wed, Jun 18, 3:00 PM

sys/fs/nullfs/null_subr.c
251	I do not expect to see intensive modification traffic for VIRF_INOTIFY. The most common use-case for the feature is some background indexing service which watches for new files in a hierarchy. Such a service creates watches only rarely. And mountpoints can be created and destroyed quite frequently in some workloads, e.g., port builds in poudriere. it would be nice to be honest/race-less Well, suppose I use the vnode lock to synchronize the flags. What should I do for VOPs where the vnode lock is not held, e.g., VOP_READ_PGCACHE? Either I would check the flag racily, so the same situation as right now, or acquire the lock in the _post hook, which is slow and complex. In the current patch, the raciness is intentional. If INOTIFY observes that VIRF_INOTIFY is set, then it will acquire the pollinfo lock to check for watches in a race-free manner. I want this feature to have minimal overhead when not enabled.
sys/sys/resourcevar.h
125	But there is a global limit: `inotify_max_queued_events * inotify_max_user_instances` restricts the number of pending records (per user). The current scheme is simplistic: we should derive max_queued_events from the amount of RAM in the system, rather than hard-coding a default. I will work on that. Yes, we could make the limit of queued events per-uid instead of per-inotify fd. But again, that allows one process to starve another, and I think that's worse than the current scheme.

Rebase.
Fix handling of the queue limit, i.e., coalesce overflow events when possible.
Make the user watch limit a function of the maximum number of vnodes. Each watch represents a vnode ref, so we should ensure that excessive watches do not prevent vnode reclamation. At the same time, some reading suggests that the watch limit is often too low on Linux, so this approach is better than a hard-coded limit.
Impose a global limit on the number of watches. This is higher than the per-user limit.
Add a sysctl to export the total number of watches in use.

Harbormaster completed remote builds in B64989: Diff 157337.Thu, Jun 19, 10:35 PM

kib added inline comments.Mon, Jun 23, 6:23 PM

sys/fs/nullfs/null_subr.c
251	Let me rehash the intent behind the flags fields in struct vnode: VV_ aka v_vflag represent the vnode state, and protected by the vnode lock, as well as all other living vnode state; VI_ aka v_iflag, is mostly internal to VFS vnode lifecycle management; VIRF_ aka v_irflag was added as a hack, with the promise that it is used only for the stable flags that do not change after the vnode creation, therefore there is no race in reading them unlocked. Abusing v_irflag for the transient state is perhaps technically allowed, indeed.
sys/sys/resourcevar.h
125	Starving one process with another consuming a lot of resources is in fact how resources limits are supposed to work. It helps in the critical situation, and usually went unnoticed until things go wrong. Compare with the virtual address limits: we have per-process VSS (RLIMIT_AS), and we have per-uid swap limit. Despite the fact that we can claim that total is limited by per-uid process limit times RLIMIT_AS, it does not work: fork() needs overcommit to be practical, but also for some situations the real swap limit is required. Same holds for all other resources: admins want to set the total limit per user, but user should (or needs) be able to exhaust the limit in single process, otherwise it cannot be exercised fully.

markj added inline comments.Tue, Jun 24, 12:03 AM

sys/sys/resourcevar.h
125	The inotify event limit is just a seatbelt, not a resource that must be managed by the admin. An inotify queue cannot be allowed to grow indefinitely, there must be some limit on the amount of kernel memory committed to inotify events. So, the limit can be global or per-queue, but either way, applications must handle the possibility of drops. Suppose the limit is global. Then yes, if there is one busy inotify application running on the system, the queue can be much deeper, but the possibility of drops is still there. The application must handle them, or else it is simply buggy. And, the admin cannot easily know which applications are using inotify, and how many inotify applications might be busy in a given workload. If the limit is per-queue, then the queue size is fixed, and drops may occur more easily than with a global limit. But, applications have to handle drops anyway. In either scenario, inotify cannot be treated as a reliable file event notification mechanism. A per-queue limit is simpler and cheaper to implement, and one buggy application cannot starve others, so the system behaviour is more predictable. That's why I prefer this scheme. Rather than comparing inotify queue limits with memory limits, IMO it makes more sense to compare with socket buffers. Applications cannot really control how much kernel memory is committed to a socket receive buffer, so there must be some per-buffer limit. Does it make sense to make this limit per-UID instead of per-socket? I do not really think so. Any kernel queue (inotify queue, socket buffer, etc.) should have some bounded size. When the size is per-queue, applications can be independently tuned for performance; when it is global, it is impossible to tune a single application since one has to account for unrelated resource utilization. Same holds for all other resources: admins want to set the total limit per user, but user should (or needs) be able to exhaust the limit in single process, otherwise it cannot be exercised fully. But sometimes "exercised fully" is not important. One (IMHO valid) use for per-process RLIMIT_AS that I have seen in production was to ensure that a memory leak in a daemon does not result in memory pressure in the kernel or other parts of the system (composed of several dozen daemons). In this situation, a global limit was in fact undesirable: a buggy daemon should not trigger mmap() failures in unrelated services, instead it should die quickly and automatically get restarted. So, setting RLIMIT_AS to 2GB in each daemon process worked fine in practice.

chris_cretaforce.gr added a subscriber: chris_cretaforce.gr.Tue, Jun 24, 7:25 PM

Add a counter for event drops.
Fix a watch accounting leak.

Harbormaster completed remote builds in B65093: Diff 157627.Wed, Jun 25, 9:29 PM

I wonder if there are any other comments on the review? As far as I know, the two main points of contention in the discussion are, 1) whether to (ab)use v_irflag to store the fact that a vnode is being watched by inotify, and 2) whether the resource limits have the right shape. For 1) I think the v_irflag may technically be an abuse, but it still seems better than any alternative, and in fact I'd like to add a VIRF_KNOTE flag to replace the (also racy) VN_KNLIST_EMPTY test that we do in many VOPs. That way, we can test for inotify and knotes in a single branch, without additional memory accesses in the common case. With respect to 2), I still feel that a global limit is not the right approach, basically because the application does not control event queuing, similar to socket buffer limits and unlike memory usage limits. In any case, there is still time before 15.0 to experiment with the implementation of limits and potentially change them.

I believe I already agreed with note 1.

For note 2, I do not agree, but I do not want to block this work.

You did not added me as reviewer, so I did not clicked accepted button.

This revision is now accepted and ready to land.Thu, Jul 3, 6:11 PM

Closed by commit rG245ff4c4b2c2: libc: Add inotify support (authored by markj). · Explain WhyFri, Jul 4, 2:56 PM

This revision was automatically updated to reflect the committed changes.

markj added a commit: rG245ff4c4b2c2: libc: Add inotify support.

markj added a commit: rGcf65b5e43d7c: libsys: Add inotify support.

markj added a commit: rGbc4430dc203e: vfs: Add event notification points.

markj added a commit: rG5202cfcf93b1: inotify: Add man pages for the new syscalls and for VOP_INOTIFY.

markj added a commit: rG739fc14938d5: tests: Add some regression tests for inotify.

This is exciting! Many thanks @markj for working on this.

I will now apply https://reviews.freebsd.org/D50736 and rebuild my desktop to test it live.

inotify: Initial revisionClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 156261

lib/libc/gen/Makefile.inc

lib/libc/gen/Symbol.map

lib/libc/gen/inotify.c

lib/libsys/Makefile.sys

lib/libsys/Symbol.sys.map

lib/libsys/_libsys.h

lib/libsys/inotify.2

share/examples/Makefile

share/examples/inotify/Makefile

share/examples/inotify/inotify.c

share/man/man4/rights.4

share/man/man9/Makefile

share/man/man9/VOP_INOTIFY.9

sys/bsm/audit_kevents.h

sys/conf/files

sys/fs/nullfs/null_subr.c

sys/fs/nullfs/null_vnops.c

sys/kern/kern_exec.c

sys/kern/kern_resource.c

sys/kern/kern_sendfile.c

sys/kern/subr_capability.c

sys/kern/sys_generic.c

sys/kern/syscalls.master

sys/kern/vfs_cache.c

sys/kern/vfs_default.c

sys/kern/vfs_inotify.c

sys/kern/vfs_subr.c

sys/kern/vfs_syscalls.c

sys/kern/vfs_vnops.c

sys/kern/vnode_if.src

sys/sys/caprights.h

sys/sys/capsicum.h

sys/sys/file.h

sys/sys/inotify.h

sys/sys/resourcevar.h

sys/sys/specialfd.h

sys/sys/user.h

sys/sys/vnode.h

sys/tools/vnode_if.awk

tests/sys/kern/Makefile

tests/sys/kern/inotify_test.c

inotify: Initial revision
ClosedPublic
Actions

Revision Contents
Changeset List