kqueue: EVFILT_USERMEM
Needs ReviewPublic
Actions

Authored by tmunro on Oct 24 2022, 5:43 AM.

Details

Reviewers: None

Summary

I would like a kqueue interface for an asynchronous, multiplexable version of _umtx_op(2)'s simple wait/wake operations (AKA futexes). Unlike EVFILT_USER, which requires all participating threads to have access to the same kqueue, the proposed EVFILT_USERMEM requires them only to have the same word of memory mapped (even if in different processes or at a different address). Unlike _umtx_op(), the proposed EVFILT_USERMEM can wait on multiple memory addresses at the same time, along with other kinds of kevents.

The minimal implementation would have only a way of waiting, and you could use _umtx_op() to wake it. This patch does interoperate with _umtx_op(), but if this idea goes somewhere I would also like there to be a plausible pathway for {Open/Net/DragonFly/Darwin}BSD to adopt the new filter, if they want to, and for client code to be portable without changes. Unfortunately every one of those OSs has invented its own incompatible umtx/futex API, and _umtx_op() isn't really a fully public API. So I would also like to make it possible to wake using kevent() itself. NOTE_TRIGGER is the obvious way to spell that, and that would fit quite well beside EVFILT_USER, which also does both waiting and waking through kevent() calls.

When waking a EVFILT_USERMEM, the kq fd is irrelevant, since the address (ident) is sufficient on its own, and I would not like it to allocate or touch knotes just to wake. Therefore, I added a new flag EV_ANON that skips all that and calls a new filter function op f_anon. It also seems to follow that you should optionally be able to call kevent() without having to create a dummy kqueue. The existing 'anonymous' mechanism, not currently in use (was previously used by CloudABI) seems to fit the bill, so here I have exposed it to users as kevent(KQUEUE_FD_ANON, ...). KQUEUE_FD_ANON can also be used to wait, but a long lived kqueue is used in the two existing example users described below in order to multiplex with other kernel events efficiently.

This set of choices makes it possible to use kevent() for synchronous or asynchronous wait for one or more memory words, and any other kinds of filter. More advanced umtx operations could be considered later, for example mask-based selective wakeups and other exotic stuff.

Example applications

Efficient job token management for make(1) (bmake).

bmake has -j to restrict the number of jobs that are allowed to run, across a whole tree of recursive make processes. This works by providing an inherited 'job token pipe' to all submakes. If you stare at the protocol for long enough you eventually realise that it is conceptually a semaphore, except that you can go away and do something else while waiting for a job token to be available. When a message arrives on the pipe, all submakes fight over the right to read it and be granted a new token, effectively decrementing the conceptual semaphore, and when finished they write into the pipe to increment it, triggering a new fight to read it. In the meantime, they also consume results from jobs that are already running. Currently all of this is multiplexed with poll(2).

We can switch bmake to use kevent(2) instead of poll(2), which already makes it marginally more efficient at handling its set of pipes without any other changes. Then we can replace the job token pipe with an integer in a tiny piece of shared memory that behaves as a classical semaphore. When bmake wants a job token to start a new job, it tries to decrement the counter using compare-exchange, but if it's zero, it uses EVFILT_USERMEM to request an event when it goes non-zero so it can try again. When a submake finishes a job, it increments the counter and wakes just one waiter with NOTE_TRIGGER, avoiding thundering-herd contention.

Here are some example benchmark numbers from mjg@ using our experimental modified version of bmake, saving ~468s of system time:

chroot /mnt/tmp make -s -j 104 -C /usr/src tinderbox WITHOUT_LLVM_ASSERTIONS=yes \
WITH_MALLOC_PRODUCTION=yes WITHOUT_CLEAN=1 WITHOUT_CTF=1 MAKE_JUST_KERNELS=1

before: 964.45s user 801.41s system 9338% cpu 18.908 total
after:  1072.53s user 333.03s system 9119% cpu 15.412 total

lock profile:

before:
473635699 (sleep mutex:pipe mutex)
89799780 (rw:pmap pv list) 
5010373 (sleep mutex:vnode interlock)
1984126 (sleep mutex:select mtxpool)

after:
48424743 (rw:pmap pv list)  
270533 (sleep mutex:vm active pagequeue)
175377 (sleep mutex:vnode_list)
125479 (sx:vm map (user))

See experimental proof-of-concept code for that at https://github.com/macdice/freebsd/tree/kqueue-usermem-bmake .

Implementation of PostgreSQL 'latches'.

PostgreSQL has an internal 'latch' concept that is effectively a one bit futex that is used to wake other processes, but can also be multiplexed with various other socket and pipe I/O (and in future async disk I/O). It is implemented with a shared memory flag, and then various operating system facilities for multiplexed wakeups: a EVFILT_SIGNAL (BSDs/macOS), signalfd (Linux), a signal handler that writes to a self-pipe (other Unixes), Win32 events (Windows). EVFILT_USERMEM is a close match for the 'latch' concept, and has three advantages over the current EVFILT_SIGNAL approach: (1) it could be Capsicumised, whereas kill() can't directly, (2) a single system call can be used to wake a set of N waiters, which is good for efficiency in some cases, and (3) it removes the slight possibility of sending a signal to a pid that has been recycled and is no long the intended recipient.

Other solutions to these problems would be possible, including a network of interprocess pipes, or process descriptors for every other process, or using threads instead of processes and then EVFILT_USER. EVFILT_USERMEM more directly models the concept without the need for N^2 descriptors, and also allows for batching of wakeups as a micro-optimisation.

See experimental proof-of-concept code for that at https://github.com/macdice/postgres/tree/kqueue-usermem .

When porting applications from Linux, or perhaps emulating Linux, I speculate that its new futex_waitv() (= wait for any of N futexes) could at least in some cases be replaced with kevent(KQUEUE_FD_ANON, ...). An example user is Wine, which I have not investigated myself but I am told that it can use futex_waitv() to emulate Windows' WaitForMultipleObjects() efficiently (the central event loop in many Windows applications/games). Other uses of futex_waitv() might eventually show up in the wild.

Test Plan

Kyua tests provided. See also example user programs mentioned above.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

tmunro created this revision.Oct 24 2022, 5:43 AM

Herald added a subscriber: imp. · View Herald TranscriptOct 24 2022, 5:43 AM

tmunro requested review of this revision.Oct 24 2022, 5:43 AM

Taking the liberty of subscribing a couple of people who've been active in kern_event.c, to see if they have any flames/feedback/ideas for me or might be able to suggest who could be interested in this area...

Perhaps you do not get responses before proposed semantic is quite strange.

umtx(2) operates on user memory because this is the natural layout for pthread locking primitives. Libthr t tries hard to do everything in userspace, and delegates the work to the syscall only when it has no choice, mostly when the thread needs to sleep or when there is a probability that there are other threads sleeping and needed to be woken up.

In your proposed interface, shared memory is nothing but a token to access the same knote, basically providing just the access control. You added the umtx-like operation on the value, and attach tries to fight the usual race of checking the value and then sleeping, but I am not convinced that there are no other races (I am not sure that such analysis is useful at this stage). This is the weird part.

Did you considered using something more natural as the token there? Might be, consider something like system-global (or per-uid ?) EVFILT_USER ids. This would make the mechanism similar to SysV IPC object ids, you already mentioned USER filter.

Thinking about it more, I suspect that your case could be served by much simpler addition to EVFILT_USER. Add a flag to specify that ident is the user address. Kernel would key event on the real physical byte backing the byte, same as it is done for umtx, i.e. vm_object/offset pair, see umtx_key_get().

In D37102#845792, @kib wrote:

Perhaps you do not get responses before proposed semantic is quite strange.

Thank you for looking! I admit that async umtx is a strange/new-ish concept.

umtx(2) operates on user memory because this is the natural layout for pthread locking primitives. Libthr t tries hard to do everything in userspace, and delegates the work to the syscall only when it has no choice, mostly when the thread needs to sleep or when there is a probability that there are other threads sleeping and needed to be woken up.

That is also true for the bmake "asynchronous semaphore" use case (see below), and it's true for hypothetical code that wants some emulation of futex_waitv() by definition.

In your proposed interface, shared memory is nothing but a token to access the same knote, basically providing just the access control.

That is true only in the experimental postgres use case. Perhaps I should use different flags to skip the expected value check in this case, to make that clearer. In that case it's just replacing signals with something a bit safer, more efficient (can send batches with one syscall), and theoretically capsicumisable.

You added the umtx-like operation on the value, and attach tries to fight the usual race of checking the value and then sleeping, but I am not convinced that there are no other races (I am not sure that such analysis is useful at this stage). This is the weird part.

The bmake case does rely on the value check. That experimental bmake code does in fact have some known bugs, but I'm not asking for review of the bmake code here and I don't think it's fundamental to this interface (the general claim is that one can implement mechanisms like semaphores and more with umtx, and thus asynchronous versions with asynchronous umtx; just like in synchronous implementations, you need a value check to defend against races when entering a sleep).

The 3rd use case, hypothetical emulation of synchronous futex_waitv(), must necessarily check the values by definition of the interface to be emulated. Someone may prefer to emulate that more directly, though, rather than going though kqueue; I'm pointing out that you *could* implement it with this interface, to show that it's fairly general.

Did you considered using something more natural as the token there? Might be, consider something like system-global (or per-uid ?) EVFILT_USER ids. This would make the mechanism similar to SysV IPC object ids, you already mentioned USER filter.

I did indeed consider EVFILT_USER with a system-global ident, but I rejected the idea because I do not think SysV IPC's global namespace worked out very well, and one of my experimental goals is to capsicumise postgres, but Capsicum would certainly never allow such a global namespace. Capsicum uses memory maps as one of its primary capability sharing mechanisms, though. So I hit on the idea of using memory maps just like umtx. It seems to be one of the strengths of kqueue that ident is not limited to file descriptors and specifically allows for pointers to objects (c.f. Linux's abandoned attempt to implement asynchronous futexes via fds in epoll).

In D37102#846054, @kib wrote:

Thinking about it more, I suspect that your case could be served by much simpler addition to EVFILT_USER. Add a flag to specify that ident is the user address. Kernel would key event on the real physical byte backing the byte, same as it is done for umtx, i.e. vm_object/offset pair, see umtx_key_get().

I did consider this, but since knotes are identified by (filter, ident) and not other flags used to create them, it seemed strange to put memory addresses (whose virtual address is not exactly under user control) into the same namespace as regular EVFILT_USER idents (whose idents are entirely under user control). Wouldn't it be strange that the user has to work to avoid collisions if trying to use both addresses and non-address values? Another reason I wanted a separate filter is that the various flagspace and fields are used up, not leaving room for the extra data needed for the umtx ops.

Supposing I abandon the umtx value check part for now and tackle only the postgres use case, as you're suggesting (perhaps there is also an efficient way to tackle mjg's bmake complaint with only that). Would you be on board with still calling it EVFILT_USERMEM to give it a separate namespace? I would also have to decide whether to share any code with umtx, and I suspect from the above that your vote would be no, it should have its own hash table of wakeup queues etc?

In D37102#846054, @kib wrote:

Thinking about it more, I suspect that your case could be served by much simpler addition to EVFILT_USER. Add a flag to specify that ident is the user address. Kernel would key event on the real physical byte backing the byte, same as it is done for umtx, i.e. vm_object/offset pair, see umtx_key_get().

I did consider this, but since knotes are identified by (filter, ident) and not other flags used to create them, it seemed strange to put memory addresses (whose virtual address is not exactly under user control) into the same namespace as regular EVFILT_USER idents (whose idents are entirely under user control). Wouldn't it be strange that the user has to work to avoid collisions if trying to use both addresses and non-address values? Another reason I wanted a separate filter is that the various flagspace and fields are used up, not leaving room for the extra data needed for the umtx ops.

Supposing I abandon the umtx value check part for now and tackle only the postgres use case, as you're suggesting (perhaps there is also an efficient way to tackle mjg's bmake complaint with only that). Would you be on board with still calling it EVFILT_USERMEM to give it a separate namespace? I would also have to decide whether to share any code with umtx, and I suspect from the above that your vote would be no, it should have its own hash table of wakeup queues etc?

I do not mind adding another filter for just that, instead of using a flag for EVFILT_USER. Yes, I am against sharing code with umtx. Rather, if there is indeed some code to share, like most part of umtx_key_get(), it should be extracted into common helpers, used by both umtx and this filter.

agh_riseup.net added a subscriber: agh_riseup.net.Feb 7 2024, 1:33 PM

mqudsi_neosmart.net added a subscriber: mqudsi_neosmart.net.Aug 19 2024, 7:33 PM

Herald added a subscriber: olce. · View Herald TranscriptAug 19 2024, 7:33 PM