Page MenuHomeFreeBSD

Copy kqueues into the child on fork
ClosedPublic

Authored by kib on Aug 20 2025, 10:48 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Oct 17, 2:44 PM
Unknown Object (File)
Fri, Oct 17, 1:35 PM
Unknown Object (File)
Fri, Oct 17, 11:46 AM
Unknown Object (File)
Fri, Oct 17, 10:22 AM
Unknown Object (File)
Fri, Oct 17, 2:40 AM
Unknown Object (File)
Fri, Oct 17, 1:54 AM
Unknown Object (File)
Thu, Oct 16, 7:23 PM
Unknown Object (File)
Thu, Oct 16, 4:55 PM

Details

Summary

Add KQUEUE_CPONFORK flag for kqueue1(2), that marks the created kqueue as requiring copy on fork into the child' file descriptor table.

This means that a new kqueue is created at the same fd index as the parent' kqueue, and all registered events are copied into the new kqueue (when possible). The current active events list is also duplicated.

Another sore place is the removal on the marker from the knlist, which is very high-cost for SLIST (the second XXXKIB place).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib marked an inline comment as done.

Ignore fd-based knotes for which fd was not copied.
Tweak locking to avoid witness warning about dup filedesc lock.

kib marked an inline comment as done.Aug 21 2025, 7:20 PM

Handle copy for vfs/vnode filters

Skip copying of influx notes (that are infux not due to scan).
Add copy method for netmap notes.

Call kqueue_expand() as needed. Not doing so has profound consequences, including not calling knote_drop() on close of forked kqueue.

Allocate forked kqueue fd with the ref count on 1 instead of 2. Using '2' hide the issue with missed knote_drop().

Harmonize internal interfaces for copying specific knote.

kib removed subscribers: kevans, markj.

From my current PoV, this should be more or less finished implementation. In particular, I reviewed all filters for the need of special copy handling.

In D52045#1193293, @kib wrote:

From my current PoV, this should be more or less finished implementation. In particular, I reviewed all filters for the need of special copy handling.

Ping?

In D52045#1199301, @kib wrote:
In D52045#1193293, @kib wrote:

From my current PoV, this should be more or less finished implementation. In particular, I reviewed all filters for the need of special copy handling.

Ping?

Sorry, I will try to get to this patch later this week.

kib added a subscriber: pho.

Rebase. Document KQUEUE_CPONFORK

sys/compat/linux/linux_event.c
107

Shouldn't the third arg be true in this call? In Linux epoll fds are copied across the forks.

kib marked an inline comment as done.Wed, Oct 8, 5:30 AM
kib added inline comments.
sys/compat/linux/linux_event.c
107

Might be. For now I intend to keep existing behavior.

kib marked an inline comment as done.

Bug fixes after Peter' testing.

Keep source kq pointer in the copy kq until copying phase two is done.

I intend to commit this early next week.

lib/libsys/kqueue.2
105 ↗(On Diff #164025)
109 ↗(On Diff #164025)
111 ↗(On Diff #164025)

Maybe, "The kqueue descriptor will be inherited by the child, i.e., the numeric value of the descriptor will remain the same."?

115 ↗(On Diff #164025)
sys/kern/kern_event.c
2919

Strange indentation here, is it from debugging?

3030

This looks like some debugging.

kib marked 6 inline comments as done.Mon, Oct 13, 4:06 PM
kib added inline comments.
sys/kern/kern_event.c
2919

Indeed this is debugging, I forgot to drop the patch when generating the diff.

kib marked an inline comment as done.

Man page editing

lib/libsys/kqueue.2
109 ↗(On Diff #164102)

Maybe mention that this is contrary to how all other descriptors are treated upon fork(), i.e., we make shallow copies instead.

sys/kern/kern_descrip.c
2530

This comment should be removed or enhanced to explain what is actually happening.

2548

If you handle the !fork_pass case here instead, you can reduce indentation below.

2550

Maybe make a local variable for ofde->fde_file->f_ops so that the lines aren't so long.

2562

Why do we handle DFLAG_FORK in a separate pass? Why not call fo_fork in the first loop?

sys/kern/kern_event.c
873

Inheriting the cpuid like this doesn't really make sense. Ideally the callout will fire on the same CPU where it was scheduled from. Maybe filt_timerstart() should also reset kc->cpuid = PCPU_GET(cpuid).

2995

It is not too unusual for a queue to have several thousand knotes. I suspect this will be slow indeed.

kib marked 7 inline comments as done.Tue, Oct 14, 6:01 PM
kib added inline comments.
sys/kern/kern_descrip.c
2548

Each FILEDESC_FOREACH_FDE() needs its own block scope unfortunately.

2562

This is the commit message for DFLAG_FORK change:

struct file: add DFLAG_FORK, indicate copying on fork

instead of referencing or ignoring.

The pass to handle copying is performed after all passable files are
installed into the new file descriptor's table, since one of the future
consumers of this interface (kqueue) needs all passable file descriptors
already valid. For the same reason the copying is done in two passes
by itself, first allowing copyable files to be copied, and then giving
the subsystems a second chance to really fill the copied file backing
structure with the content.

[MFC note: use fo_spares for fo_fork]

(and note that kqueue might need the fd number to reference other kqueue)

sys/kern/kern_event.c
2995

It is worse at more significant place, knlist_remove_kq().

kib marked 3 inline comments as done.

Changes due to the latest comments.

Wrap the first stage of the fork pass into the original pass over source fd table.

Some editing to the comment.

The code changes seem ok to me. This should really have some regression tests. Even some simple ones which exercise the f_copy routine for various filters, and verify that knotes for O_CLOFORK fds are handled properly, would be much better than nothing.

lib/libsys/kqueue.2
120 ↗(On Diff #164200)

You might mention somewhere that knotes corresponding to O_CLOFORK fds will not be inherited.

sys/kern/kern_descrip.c
2571
sys/kern/kern_event.c
2995

knlist_remove_kq() is just removing a single knote though.

kib marked 2 inline comments as done.Wed, Oct 15, 10:42 PM

This should really have some regression tests. Even some simple ones which exercise the f_copy routine for various filters, and verify that knotes for O_CLOFORK fds are handled properly, would be much better than nothing.

There is a program I posted which I use for testing the change. I believe Peter took it into stress2 already, staged.

I will convert the program to atf someday, but I do not want this to delay the commit. I promise the conversion.

sys/kern/kern_event.c
2995

But all knotes are removed eventually, so the same amount of work is spread over the time.

Man page update: explain that knotes for closed fds are not copied.

kib marked an inline comment as done.Wed, Oct 15, 10:43 PM
markj added inline comments.
lib/libsys/kqueue.2
105 ↗(On Diff #164272)
sys/kern/kern_event.c
2947

If someone adds a filter which requires some specialized code in order to copy a knote, and forgets to implement f_copy, there will be some hard-to-debug memory corruption most likely. That is, this is a fail-open design.

I wonder if we can do better? Maybe skip the knote if f_copy is note defined, and have a no-op f_copy implementation that filters can use.

2995

It depends on the order of the removals. Removing the first element of the list until the list is empty is an O(n) operation. Removing the last element of the list until the list is empty is an O(n^2) operation. Traversal here is an O(n^2) operation always. I suspect it will eventually cause pain for someone.

This revision is now accepted and ready to land.Thu, Oct 16, 1:51 PM
kib marked 3 inline comments as done.Fri, Oct 17, 12:09 AM
kib added inline comments.
sys/kern/kern_event.c
2995

Basically you argue for the same thing that I wanted to do, change this to list. But lets do it separately as a follow-up.

kib marked an inline comment as done.

knote_triv_copy

This revision now requires review to proceed.Fri, Oct 17, 12:10 AM
This revision is now accepted and ready to land.Fri, Oct 17, 12:48 PM