I would like a kqueue interface for an asynchronous, multiplexable version of _umtx_op(2)'s simple wait/wake operations (AKA futexes). Unlike EVFILT_USER, which requires all participating threads to have access to the same kqueue, the proposed EVFILT_USERMEM requires them only to have the same word of memory mapped (even if in different processes or at a different address). Unlike _umtx_op(), the proposed EVFILT_USERMEM can wait on multiple memory addresses at the same time, along with other kinds of kevents.
The minimal implementation would have only a way of waiting, and you could use _umtx_op() to wake it. This patch does interoperate with _umtx_op(), but if this idea goes somewhere I would also like there to be a plausible pathway for {Open/Net/DragonFly/Darwin}BSD to adopt the new filter, if they want to, and for client code to be portable without changes. Unfortunately every one of those OSs has invented its own incompatible umtx/futex API, and _umtx_op() isn't really a fully public API. So I would also like to make it possible to wake using kevent() itself. NOTE_TRIGGER is the obvious way to spell that, and that would fit quite well beside EVFILT_USER, which also does both waiting and waking through kevent() calls.
When waking a EVFILT_USERMEM, the kq fd is irrelevant, since the address (ident) is sufficient on its own, and I would not like it to allocate or touch knotes just to wake. Therefore, I added a new flag EV_ANON that skips all that and calls a new filter function op f_anon. It also seems to follow that you should optionally be able to call kevent() without having to create a dummy kqueue. The existing 'anonymous' mechanism, not currently in use (was previously used by CloudABI) seems to fit the bill, so here I have exposed it to users as kevent(KQUEUE_FD_ANON, ...). KQUEUE_FD_ANON can also be used to wait, but a long lived kqueue is used in the two existing example users described below in order to multiplex with other kernel events efficiently.
This set of choices makes it possible to use kevent() for synchronous or asynchronous wait for one or more memory words, and any other kinds of filter. More advanced umtx operations could be considered later, for example mask-based selective wakeups and other exotic stuff.
Example applications
- Efficient job token management for make(1) (bmake).
bmake has -j to restrict the number of jobs that are allowed to run, across a whole tree of recursive make processes. This works by providing an inherited 'job token pipe' to all submakes. If you stare at the protocol for long enough you eventually realise that it is conceptually a semaphore, except that you can go away and do something else while waiting for a job token to be available. When a message arrives on the pipe, all submakes fight over the right to read it and be granted a new token, effectively decrementing the conceptual semaphore, and when finished they write into the pipe to increment it, triggering a new fight to read it. In the meantime, they also consume results from jobs that are already running. Currently all of this is multiplexed with poll(2).
We can switch bmake to use kevent(2) instead of poll(2), which already makes it marginally more efficient at handling its set of pipes without any other changes. Then we can replace the job token pipe with an integer in a tiny piece of shared memory that behaves as a classical semaphore. When bmake wants a job token to start a new job, it tries to decrement the counter using compare-exchange, but if it's zero, it uses EVFILT_USERMEM to request an event when it goes non-zero so it can try again. When a submake finishes a job, it increments the counter and wakes just one waiter with NOTE_TRIGGER, avoiding thundering-herd contention.
Here are some example benchmark numbers from mjg@ using our experimental modified version of bmake, saving ~468s of system time:
chroot /mnt/tmp make -s -j 104 -C /usr/src tinderbox WITHOUT_LLVM_ASSERTIONS=yes \ WITH_MALLOC_PRODUCTION=yes WITHOUT_CLEAN=1 WITHOUT_CTF=1 MAKE_JUST_KERNELS=1 before: 964.45s user 801.41s system 9338% cpu 18.908 total after: 1072.53s user 333.03s system 9119% cpu 15.412 total lock profile: before: 473635699 (sleep mutex:pipe mutex) 89799780 (rw:pmap pv list) 5010373 (sleep mutex:vnode interlock) 1984126 (sleep mutex:select mtxpool) after: 48424743 (rw:pmap pv list) 270533 (sleep mutex:vm active pagequeue) 175377 (sleep mutex:vnode_list) 125479 (sx:vm map (user))
See experimental proof-of-concept code for that at https://github.com/macdice/freebsd/tree/kqueue-usermem-bmake .
- Implementation of PostgreSQL 'latches'.
PostgreSQL has an internal 'latch' concept that is effectively a one bit futex that is used to wake other processes, but can also be multiplexed with various other socket and pipe I/O (and in future async disk I/O). It is implemented with a shared memory flag, and then various operating system facilities for multiplexed wakeups: a EVFILT_SIGNAL (BSDs/macOS), signalfd (Linux), a signal handler that writes to a self-pipe (other Unixes), Win32 events (Windows). EVFILT_USERMEM is a close match for the 'latch' concept, and has three advantages over the current EVFILT_SIGNAL approach: (1) it could be Capsicumised, whereas kill() can't directly, (2) a single system call can be used to wake a set of N waiters, which is good for efficiency in some cases, and (3) it removes the slight possibility of sending a signal to a pid that has been recycled and is no long the intended recipient.
Other solutions to these problems would be possible, including a network of interprocess pipes, or process descriptors for every other process, or using threads instead of processes and then EVFILT_USER. EVFILT_USERMEM more directly models the concept without the need for N^2 descriptors, and also allows for batching of wakeups as a micro-optimisation.
See experimental proof-of-concept code for that at https://github.com/macdice/postgres/tree/kqueue-usermem .
- When porting applications from Linux, or perhaps emulating Linux, I speculate that its new futex_waitv() (= wait for any of N futexes) could at least in some cases be replaced with kevent(KQUEUE_FD_ANON, ...). An example user is Wine, which I have not investigated myself but I am told that it can use futex_waitv() to emulate Windows' WaitForMultipleObjects() efficiently (the central event loop in many Windows applications/games). Other uses of futex_waitv() might eventually show up in the wild.