Page MenuHomeFreeBSD

inotify: Initial revision
Needs ReviewPublic

Authored by markj on May 12 2025, 8:52 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jun 16, 1:11 PM
Unknown Object (File)
Sun, Jun 15, 8:54 PM
Unknown Object (File)
Sun, Jun 15, 6:27 PM
Unknown Object (File)
Sun, Jun 15, 11:07 AM
Unknown Object (File)
Sun, Jun 15, 1:12 AM
Unknown Object (File)
Sat, Jun 14, 11:58 PM
Unknown Object (File)
Sat, Jun 14, 12:13 AM
Unknown Object (File)
Fri, Jun 13, 5:28 PM

Details

Reviewers
brooks
olce
Summary

This provides an implementation of three Linux system calls,
inotify_init1(), inotify_add_watch() and inotify_rm_watch(). They can
be used to generate a stream of notifications for file operations. This
is morally similar to EVFILT_VNODE, but allows one to watch all files in
a directory without opening every single file. This avoids races where a
file is created and then must be opened before events can be received.

This implementation is motivated by a desire to close the above race,
while adopting an interface that is already widely used. It also
greatly reduces the resources needed to monitor a large hierarchy, as
with kevent() one needs an fd per monitored file. It was implemented
using the Linux manual page as a reference and aims to be
API-compatible. At present it is not perfectly compatible due to some
limitations imposed by our name cache design, but I think it's close
enough for most purposes.

The patch introduces a new fd type, DTYPE_INOTIFY, three new system
calls which manipulate that API (actually inotify_init1() is implemented
using __specialfd(), so only two new system call numbers are allocated,
and some new VOPs (these are perhaps not needed, see below). Below is
some explanation of the internals.

Using an inotify fd, one can "watch" files or directories for events.
Once subscribed, events in the form of struct inotify_event can be read
from the descriptor. In particular, the event includes the name of the
file (technically, hard link) which triggered the event. A vnode that
has at least one watch on it has VIRF_INOTIFY set in its flags. When a
file belongs to a watched directory, it has VIRF_INOTIFY_PARENT set.
The INOTIFY macro ignores vnodes unless they have one of these flags
set, so the overhead of this feature should be small for unwatched files.
To simplify the implementation, VIRF_INOTIFY_PARENT is cleared lazily
after all watches are removed.

For some events, e.g., file creation, rename, removal, we have the namei
data used to look up the target. In that case we can obtain the file
name unambiguously. In other cases, e.g., file I/O, we have only the
file vnode: how can we figure out whether a containing directory wants
to be notified about the I/O? For this I have to use the name cache:
cache_vop_inotify() scans cache entries pointing to the target vnode,
looking for directory vnodes with VIRF_INOTIFY set. If found, we can
log an event using the name. An inotify watch holds the directory
usecount > 0, preventing its outgoing namecache entries from being
reclaimed.

The above scheme mostly works, but has two bugs involving hard links:
when an event occurs on a vnode, we have no way of knowing which hard
link triggered the event. On Linux, an open file handle keeps track of
the name used to open the backing vnode, updated during renames, but we
don't have this glue. Thus, if a directory containing two links for the
same file is watched, each access to the file will generate two inotify
events (except for IN_CREATE, IN_MOVE and IN_DELETE). And, if a watched
directory contains a file that is linked outside the directory, and that
link is accessed, the directory will get an event even though the access
technically happened from outside the directory. I think these issues
are unlikely to be a major problem.

inotify is fundamentally about monitoring accesses via file handles, not
all file accesses in general. So, I inserted INOTIFY calls directly
into the system call layer, close to where the file handle is used to
obtain the vnode, rather than using vop_*_post callbacks like
EVFILT_VNODE does. Hooking the vop_*_post calls is not quite right for
inotify, e.g., open(file, O_PATH) should generate an IN_OPEN event even
though we don't call VOP_OPEN in this case.

I introduced two VOPs, VOP_INOTIFY and VOP_INOTIFY_ADD_WATCH, which
ensure that when inotify is applied to a nullfs vnode, we automatically watch
the lower vnode (or search the lower vnode for watches). The new VIRF flags
are lazily synchronized in null_bypass().

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 64290
Build 61174: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Flesh out the man page a bit further.
  • Add a short VOP_INOTIFY man page.
  • Implement IN_UNMOUNT support: when a watched vnode is forcibly reclaimed, generate IN_UNMOUNT events and remove watches.

I don't have any further changes planned.

Some final tweaking: move most of the INOTIFY calls to VOP post hooks.
At some point it would be useful to combine checks for inotify watches
and knotes into a single v_irflag check, otherwise we're accessing v_pollinfo
unnecessarily most of the time.

Remove some generated files that belong in a separate commit.

Qt tests now pass for me too.

One last frontier left:

poudriere testport -ci devel/kf6-kcoreaddons
<wait for build to finish>
`make -v BUILD_WRKSRC`/bin/kdirwatch_inotify_benchmarktest

The benchmark fails for me at some point with "Inotify Event queue overflowed, check max_queued_events value". I do not get this message with neither kqueue-based qfilesystemwatcher backend nor with libinotify-kqueue-backed kdirwatch.

I pushed necessary commits to https://github.com/arrowd/freebsd-ports/tree/inotify-test

Qt tests now pass for me too.

One last frontier left:

poudriere testport -ci devel/kf6-kcoreaddons
<wait for build to finish>
`make -v BUILD_WRKSRC`/bin/kdirwatch_inotify_benchmarktest

The benchmark fails for me at some point with "Inotify Event queue overflowed, check max_queued_events value". I do not get this message with neither kqueue-based qfilesystemwatcher backend nor with libinotify-kqueue-backed kdirwatch.

This appears to be because the default max_queued_events limit is too small. Increasing it to the default value on Linux (512 to 16384) lets the test pass.

  • Increase the default max_events value.
  • Rename a sysctl to match the corresponding Linux name.
  • Update the libsys version map to include inotify syscalls.

Yes, works for me too now and the performance is on par with other implementations.

This is a splendid work, thanks for doing it!

One note: upon exiting the poudriere jail I got the message freeing uidinfo: 65534, inotifywatchcnt = 1. Not sure if this is normal or not.

Now we need to figure how to handle a conflict with libinotify-kqueue. The problem can be observed with taking the main branch of Ports and running

poudriere testport -ci devel/libinotify
<wait for the build to finish>
cc /usr/share/examples/inotify/inotify.c -lxo -I/usr/local/include

The resulting program fails with EINVAL when calling __specialfd(), because it takes the sys/inotify.h header from the libinotify-kqueue package.

I think, we should do the following in the port:

.if exists(/usr/include/sys/inotify.h)
# avoid installing /usr/local/include/sys/inotify.h

@wulf does this make sense to you?

Current ABI is not compatible. libinotify events have different values compared with system provided ones.
You must use libinotify header to link with libinotify or change system one to be ABI-compatible.

Ugh, I guess you're right. So we need to make sure that no libinotify-kqueue is installed on a native intofy-enabled system.

This can be achieved with a new USES=inotify replacing the LIB_DEPENDS= line in each consumer port, which is quite a work. I'll think a bit more to try to come up with something else.

As I see, libinotify event values were just copied from Linux, so it may have sense to change system ones to make linuxolator support simpler

As I see, libinotify event values were just copied from Linux, so it may have sense to change system ones to make linuxolator support simpler

This is fine for most of the flags, I will change it. libinotify has this weird thing:

37 /* Flags for the parameter of inotify_init1. */                                                                                                                                                                                                                                                                           
38 #define IN_CLOEXEC      02000000        /* Linux x86 O_CLOEXEC */                                                                                                                                                                                                                                                         
39 #define IN_NONBLOCK     00004000        /* Linux x86 O_NONBLOCK */

Some applications will call inotify_init1(O_CLOEXEC) instead of specifying IN_CLOEXEC. In my inotify.h I just defined those flags to be equal, but to preserve compatibility with libinotify we have to support both values. I guess this is best handled in libc.

There is also this:

40 /* libinotify-specific - Direct mode operation. See below. */                                                                                                                                                                                                                                                             
41 #define IN_DIRECT       0x80000000

Does anything use it?

One note: upon exiting the poudriere jail I got the message freeing uidinfo: 65534, inotifywatchcnt = 1. Not sure if this is normal or not.

This is a bug, I believe the next update will fix it.

There is also this:

40 /* libinotify-specific - Direct mode operation. See below. */                                                                                                                                                                                                                                                             
41 #define IN_DIRECT       0x80000000

Does anything use it?

It is a libinotify-kqueue-specific thing I invented and is described in the commit message: https://github.com/libinotify-kqueue/libinotify-kqueue/commit/c3dff3863e386b7f0cdea98c44d92e2ee66e5a35
It does not make sense for a native inotify implementation, so our sys/inotify.h should not have it.

As I see, libinotify event values were just copied from Linux, so it may have sense to change system ones to make linuxolator support simpler

This is fine for most of the flags, I will change it. libinotify has this weird thing:

37 /* Flags for the parameter of inotify_init1. */                                                                                                                                                                                                                                                                           
38 #define IN_CLOEXEC      02000000        /* Linux x86 O_CLOEXEC */                                                                                                                                                                                                                                                         
39 #define IN_NONBLOCK     00004000        /* Linux x86 O_NONBLOCK */

Some applications will call inotify_init1(O_CLOEXEC) instead of specifying IN_CLOEXEC. In my inotify.h I just defined those flags to be equal, but to preserve compatibility with libinotify we have to support both values. I guess this is best handled in libc.

libinotify supports both FreeBSD O_CLOEXEC and Linux O_CLOEXEC called IN_CLOEXEC as flag parameter. In-kernel inotify should provide only FreeBSD O_CLOEXEC support. That means #define IN_NONBLOCK O_NONBLOCK is enough. No need to use Linux value for this flag.

There is also this:

40 /* libinotify-specific - Direct mode operation. See below. */                                                                                                                                                                                                                                                             
41 #define IN_DIRECT       0x80000000

Does anything use it?

In-kernel implementation should know nothing of this flag. It enables some userland socket emulation very specific to libinotify

As I see, libinotify event values were just copied from Linux, so it may have sense to change system ones to make linuxolator support simpler

This is fine for most of the flags, I will change it. libinotify has this weird thing:

37 /* Flags for the parameter of inotify_init1. */                                                                                                                                                                                                                                                                           
38 #define IN_CLOEXEC      02000000        /* Linux x86 O_CLOEXEC */                                                                                                                                                                                                                                                         
39 #define IN_NONBLOCK     00004000        /* Linux x86 O_NONBLOCK */

Some applications will call inotify_init1(O_CLOEXEC) instead of specifying IN_CLOEXEC. In my inotify.h I just defined those flags to be equal, but to preserve compatibility with libinotify we have to support both values. I guess this is best handled in libc.

libinotify supports both FreeBSD O_CLOEXEC and Linux O_CLOEXEC called IN_CLOEXEC as flag parameter. In-kernel inotify should provide only FreeBSD O_CLOEXEC support. That means #define IN_NONBLOCK O_NONBLOCK is enough. No need to use Linux value for this flag.

I believe libc still needs to support both values in order to avoid breakage when someone updates their libc+kernel. But the kernel doesn't need to know about it, indeed.

  • Drop some spurious diffs.
  • Change the values of IN_* flags to match libinotify.
  • Make libc handle libinotify's IN_CLOEXEC and IN_NONBLOCK values.
  • Fix an accounting bug: if VOP_INOTIFY_ADD_WATCH modifies an existing watch instead of creating a new one, don't bump the per-user watch count.

Would it be possible to make native sys/inotify.h be a drop-in replacement for libinotify's one? In other words, I want a program that was compiled against /usr/include/sys/inotify.h to work when run with LD_PRELOAD=libinotify.so. This will provide an almost painless transition from libinotify to the native implementation.

  • Fix more flag definitions to be compatible.
  • Minor refactoring.

Would it be possible to make native sys/inotify.h be a drop-in replacement for libinotify's one? In other words, I want a program that was compiled against /usr/include/sys/inotify.h to work when run with LD_PRELOAD=libinotify.so. This will provide an almost painless transition from libinotify to the native implementation.

Yes, that was my intent in the previous revision, but I missed some flags. The new version should be almost completely compatible: the difference is with IN_CLOEXEC and IN_NONBLOCK, and the libinotify flag values are handled in libc. I think this provides compatibility in both directions, since libinotify already handles FreeBSD O_CLOEXEC and O_NONBLOCK as well.

I haven't tested compatibility yet. Soon I will try updating the base system+kernel on my desktop, so that applications linked against libinotify will automatically start using libc/libsys's implementations of those symbols.

BTW, my branch now adds inotify to the linuxulator, but I didn't test it much yet.

Yes, that was my intent in the previous revision, but I missed some flags. The new version should be almost completely compatible: the difference is with IN_CLOEXEC and IN_NONBLOCK, and the libinotify flag values are handled in libc. I think this provides compatibility in both directions, since libinotify already handles FreeBSD O_CLOEXEC and O_NONBLOCK as well.

I can confirm that inotify-test from libinotify-kqueue now works when compiled against /usr/include/sys/libinotify.h, but linked to -linotify.

I haven't tested compatibility yet. Soon I will try updating the base system+kernel on my desktop, so that applications linked against libinotify will automatically start using libc/libsys's implementations of those symbols.

I suggest you to first apply https://reviews.freebsd.org/D50736 to your ports to actually switch most consumers to the native implementation. Otherwise the unpatched devel/libinotify will just shadow the native inotify for most ports.

Yes, that was my intent in the previous revision, but I missed some flags. The new version should be almost completely compatible: the difference is with IN_CLOEXEC and IN_NONBLOCK, and the libinotify flag values are handled in libc. I think this provides compatibility in both directions, since libinotify already handles FreeBSD O_CLOEXEC and O_NONBLOCK as well.

I can confirm that inotify-test from libinotify-kqueue now works when compiled against /usr/include/sys/libinotify.h, but linked to -linotify.

What if it is still using libinotify's libinotify.h?

I haven't tested compatibility yet. Soon I will try updating the base system+kernel on my desktop, so that applications linked against libinotify will automatically start using libc/libsys's implementations of those symbols.

I suggest you to first apply https://reviews.freebsd.org/D50736 to your ports to actually switch most consumers to the native implementation. Otherwise the unpatched devel/libinotify will just shadow the native inotify for most ports.

What do you mean by "shadow"? The point of testing existing packages is to make nothing breaks when updating the kernel+libc+libsys.

What if it is still using libinotify's libinotify.h?

You mean, am I sure that it didn't pick up libinotify's libinotify.h when I was testing? I'm now unsure and will redo the test.

What do you mean by "shadow"? The point of testing existing packages is to make nothing breaks when updating the kernel+libc+libsys.

Why test this configuration if we can first prepare Ports for the native inotify and instead test how it gets picked up?

By "shadow" I mean that unpatched devel/libinotify will install its own /usr/local/include/sys/inotify.h and a .pc file with -I/usr/local/include -L/usr/local/lib -linotify. This will make other ports pick up libinotify's libinotify.h in most cases and loading the library into a process would mean that native inotify API wouldn't be used.

What if it is still using libinotify's libinotify.h?

You mean, am I sure that it didn't pick up libinotify's libinotify.h when I was testing? I'm now unsure and will redo the test.

Suppose you have a dynamically linked executable which links against libinotify.so. Then you upgrade your system to get native inotify, but your executable is not updated. When you run it, the dynamic linker will search for implementations of inotify_add_watch() etc.. Because the symbols in libinotify.so are unversioned, I believe the executable will automatically start using the symbols from libc.so instead of libinotify.so.

Assuming the above is correct (I will verify it soon), we should ensure that the executable still runs without problems in this situation.

What do you mean by "shadow"? The point of testing existing packages is to make nothing breaks when updating the kernel+libc+libsys.

Why test this configuration if we can first prepare Ports for the native inotify and instead test how it gets picked up?

By "shadow" I mean that unpatched devel/libinotify will install its own /usr/local/include/sys/inotify.h and a .pc file with -I/usr/local/include -L/usr/local/lib -linotify. This will make other ports pick up libinotify's libinotify.h in most cases and loading the library into a process would mean that native inotify API wouldn't be used.

Per what I wrote above, I believe the native implementation would still be used, but the application would be compiled against libinotify inotify.h.

I cannot test this for the next day or two. If you are able to test it, please let me know what happens in practice.

Yep, I messed up and it was still using libinotify's libinotify.h. I've redone the test properly and found that the program starts behaving differently if linked to libinotify.

The test I'm running is

./inotify-test .
touch a
rm a

When not linked with libinotify, I get

a created in watched directory
a was opened
a opened for writing was closed
a delete from watched directory

When linked to libinotify, I only get

a created in watched directory
a deleted from watched directory

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

Per what I wrote above, I believe the native implementation would still be used, but the application would be compiled against libinotify inotify.h.

Yes, which is why I propose D50736 to land first. With it, it wouldn't be possible to link to libinotify via pkgconfig and both headers would be identical to the system one.

Only programs that were not recompiled after the base update remain to use libinotify, which isn't a big deal.

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

Well, this scenario is supposed to work in general, so I'd like to understand what's going on. I will investigate a bit.

Yep, I messed up and it was still using libinotify's libinotify.h. I've redone the test properly and found that the program starts behaving differently if linked to libinotify.

The test I'm running is

./inotify-test .
touch a
rm a

When not linked with libinotify, I get

a created in watched directory
a was opened
a opened for writing was closed
a delete from watched directory

When linked to libinotify, I only get

a created in watched directory
a deleted from watched directory

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

It should not be fixed. You have observed a race in libinotify. In-kernel implementation looks correct here.

I don't think it is worth fixing, because this mess can only arise for 15-CURRENT users that do not recompile their ports after this change hits the tree.

It should not be fixed. You have observed a race in libinotify. In-kernel implementation looks correct here.

I agree. touch(1) opens the file when creating it (but not when updating timestamps).

The diff is unlikely to change much without more feedback. If anyone would like me to split it up for easier review, please let me know.

lib/libsys/inotify.2
51 ↗(On Diff #156711)

Did you considered adding inode number there?

I mean, linux namecache is very different from our, and while linux' dir + name is something that identifies their inode, our does not. Adding a field should be source-compatible.

62 ↗(On Diff #156711)

How inotify fd behaves on fork?

86 ↗(On Diff #156711)
140 ↗(On Diff #156711)

What about mknod?

145 ↗(On Diff #156711)

File or vnode? Also, if vnode, effective or real link count?

Anyway, doesn't the inotify object itself acquires the use/effective count?

sys/fs/nullfs/null_subr.c
251

Did you considered taking the vnode interlock around all three ifs() and use vn_irflag_set_locked()?

BTW why the flags are in irflags? From my reading, they are actually modified under the vnode lock, and perhaps should be moved to v_flag (VV_XXX). Modulo that there is no space in v_flag.

sys/kern/syscalls.master
3364

I still cannot accept the frivolous breakage of the specialfd design breakage that was done. The bug in ioctl type checking is an example of the consequences.

sys/kern/vfs_inotify.c
89

Should the file be passable over unix socket? (Follow-up to the fork question)

149

Might be this should be an assert instead of the check?

203

Might be use the exterr and provide some description for EINVAL, to start setting the example.

219

Could you please explain this memset?

304

Indeed a useful stat should be provided.

sys/sys/resourcevar.h
125

This is the number of allocated inotify fds, am I right? And watchcnt is the number of watches registered per user.

IMO this is not too useful design for resources. There is already a global and per-process file limit. What is needed there is the per-user limit for events. Then the overflow record should be placed on the user limit reach, instead of the per-queue. It is easier for admin to utilize the limit this way (instead of multiplying the number of descriptors by the queue length).

lib/libsys/inotify.2
62 ↗(On Diff #156711)

The fd is inherited, unlike our kqueue. This is what prevented me from enabling inotify watcher in devel/dbus - it first creates a notify fd and then forks to do the actual work.

lib/libsys/inotify.2
62 ↗(On Diff #156711)

Yes, I mean that this should be mentioned in the man page.

markj marked 13 inline comments as done.

Try to handle most of kib's feedback.

lib/libsys/inotify.2
51 ↗(On Diff #156711)

It's source-compatible, but it wouldn't be binary-compatible, and this is required to avoid spuriously breaking applications linked against libinotify.so.

Maybe we could add a new flag which requests an alternate layout for struct inotify_event. Fetching the inode number is a bit complex too, I believe we have to add a VOP_GETATTR for every single event to get the inode number, which is a bit expensive.

145 ↗(On Diff #156711)

The file. (I would avoid referencing "vnodes" in user-facing documentation.)

When the link count of a file changes, we use VOP_GETATTR to check the va_nlinks value; the usecount reference on the vnode doesn't affect that.

sys/fs/nullfs/null_subr.c
251

My reasoning was that the flags are checked "frequently" (i.e., in many VOP *_post hooks) and this is the motivation for the v_irflag split.

The flags are set under the interlock (and maybe the vnode lock, but this is coincidental), so I don't think v_flag is correct.

sys/kern/syscalls.master
3364

Sorry, can you elaborate? What frivolous breakage are you referring to?

sys/kern/vfs_inotify.c
89

Yes, I believe it would be safe.

203

I used EXTERR_CAT_FILEDESC, I'm not sure if it's an abuse or not.

219

I added a comment.

sys/sys/resourcevar.h
125

Yes, this is the number of inotify fds, and then the total number of watches per user. These limits come directly from Linux, for better or worse.

Suppose I implement your suggestion. A user has two process, each one opens an inotify descriptor. The first process hangs due to some bug or because it is stopped by a debugger, so events accumulate until the limit is reached. Then the second application cannot receive any more events. I think this is too fragile.

lib/libsys/inotify.2
51 ↗(On Diff #156711)

this is required to avoid spuriously breaking applications linked against libinotify.so.

Do we really need this? It is nice to have if we don't need to make tradeoffs, but the problem will only be visible for CURRENT users. We can as well just ask everyone to rebuild their ports after updating the world to get rid of all applications linked to libinotify.so

lib/libsys/inotify.2
51 ↗(On Diff #156711)

It'll be visible to users when they upgrade to 15.0-RELEASE. I want to be sure that their libinotify-linked applications don't misbehave after upgrading the base system.

And since none of those applications will expect to see an ino field, IMO it is not really useful to add one by default. Only new code will be able to use it.

lib/libsys/inotify.2
51 ↗(On Diff #156711)

For the ports software we require all packages/ports to be reinstalled after major upgrades. Users following this procedure will get a FreeBSD 15 package set, for which we'll disable linking to libinotify with https://reviews.freebsd.org/D50736

Programs that are compiled by users themselves will still be linked to libinotify, but they should still work, because they were compiled against libinotify sys/inotify.h and they will still be calling the library, not the native implementation.

The only case that comes to my mind that will be broken is:

  1. Application that is using native inotify calls inotify_init()
  2. It then loads some legacy module that pulls in libinotify.
  3. Consecutive calls to inotify_add_watch and others will go to the library.

But this is a pretty much rare case, I think.

lib/libsys/inotify.2
51 ↗(On Diff #156711)

Well, aside from the compatibility issue, it's still relatively expensive to fetch the inode number, and nothing uses it (because all code using inotify today is targeting Linux). How exactly would the new field be useful?

lib/libsys/inotify.2
51 ↗(On Diff #156711)

Indeed, the library normally should appear before libc in the order of DT_NEEDED objects, and library symbols would interpose the libc symbols. BTW, is libnotify versioned?

It might be that the symbols exported from libc should be changed to not clash with the symbols from libinotify, similar to how the iconv was handled (iconv_open/libiconv_open) but with reverse direction.

WRT how the inode number in inotify event would be useful, how the event is used? For instance, if I watch a specific file, how could I be sure that events are for it and there is no some race? Normally I would open the file, and whenever I get the event for my name, I want to be sure that the name is still a hardlink to my file. How do I do that? If ino is reported, I simply compare ino in event with the result of fstat().

sys/fs/nullfs/null_subr.c
251

v_irflags motivation was 'read often, set never'. Perhaps only VIRF_TEXT_REF is somewhat outlier there, even this is up for discussion. All other flags are set at the vnode creation time and then not changed.

The inotify flags are supposed to have the active life.

sys/kern/kern_resource.c
1752

It is fine to provide the userspace glue later, of course.

sys/kern/syscalls.master
3364

The timerfd case.

sys/kern/vfs_inotify.c
203

The most natural way would be to allocate a dedicated category, since for non-bloated kernels, we report the line numbers of the place where the situation occured, but not the file name.
EXTERR_CAT_INOTIFY is fine, we do not have shortage of numbers so far.

689

Just break out of the loop, to have the free() in single place?

714

Then return if error != 0

721

Why cannot this call be vrefact()?

831

Again SET_ERROR() for the series of EINVALs would be useful.

869

Why do we need the special case of the existing watch? Is it only to decrement the limit?
Perhaps a comment would be useful.

884

How could fp be NULL there?

sys/sys/resourcevar.h
125

It is much easier (and IMO natural) to configure global or per-uid limits, then have to do math like num_of_inotify times max_queue_length, if we need to make something behave.

Also it handles the normal situations better, where you have one outlier that requires a lot of events queued, while other listeners are mostly dormant. If you have queue limit, then you must bump it and have global effect, which is hard to balance.

markj marked 10 inline comments as done.

Handle some of kib's comments.

lib/libsys/inotify.2
51 ↗(On Diff #156711)

libinotify's symbols have no versions. I think you're right that the clash is probably not a problem in practice, unless applications are underlinked.

If you are watching a specific file, then all events for that watch are associated with the same inode/vnode, so there is no race.

If you are watching a directory, and you receive a notification for an name in that directory, then yes it's possible for the name to refer to something else by the time you open it. But, I'm not sure why it's a problem: the watcher can also receive notifications about renames or unlinks, so it can see that the name has changed.

sys/fs/nullfs/null_subr.c
251

VIRF_DOOMED is set after creation too. VIRF_MOUNTPOINT also can change during the vnode lifetime as well, as mounts are created and destroyed.

In general I expect inotify flags to be set somewhat rarely, certainly less often then they are checked. And v_irflag is specifically annotated as being synchronized by the vnode interlock, so I don't really understand the "write never" requirement.

sys/kern/vfs_inotify.c
831

BTW, it would be nice if SET_ERROR* was an expression, so would could write return (SET_ERROR0(EINVAL, ...));, like the SET_ERROR() macro in OpenZFS.

869

Yes, it's just a small hack to ensure that we handle the accounting correctly. I added a comment.

sys/sys/resourcevar.h
125

I agree that the current scheme is not ideal, but a global limit makes it very easy to starve the rest of the system, and IMHO this is worse.

Note that there is some event coalescing, so file I/O should not generate too many back-to-back events.

I've heard that the main problem in practice in Linux is the per-user limit on the number of watches: when watching some large directory hierarchy (e.g., a copy of the freebsd src tree), it's easy to run out of watches.

sys/fs/nullfs/null_subr.c
251

VIRF_DOOMED is extra special, it is in fact protected by both interlock and vnode lock, and this is actively utilized.

VIRF_MOUNTPOINT does not have the intensive modification traffic.

Since your code seems to read the flags lockless often e.g. all the vnode post vop hooks, it would be nice to be honest/race-less and put the flags into place which is protected naturally.

At least this is not part of the ABI.

sys/sys/resourcevar.h
125

From my experience with e.g. the swap, the per-object limit scheme is not adequate. You want a global reasonable limit (global as e.g. per uid). When I needed swap limits, this was only workable schema, the limits set to the per-process VA did not provided the required safety, and the cause was not the shadowing.

So I do think that the current approach is formal and does not provide effects that would allow users to put the working limitations on the mis-behaving inotify consumers.