This is work in progress.
I would like to be able to run a lot of I/Os with a small number of system calls. lio_listio() allows for batch submission, but we lack a corresponding way to reap completion events in batches.
Currently, if you use SIGEV_KEVENT as a notification mechanism for asynchronous I/O, you still need to call either aio_return() or aio_waitcomplete() to reap each individual completion. By "reap", I mean collecting the result of the operation and releasing kernel resources (kaiocb). You can therefore never get below one syscall per I/O, and for many applications the syscall count may be somewhere around 4 due to extra calls to aio_error().
Examples of APIs in other operating systems that reap N completion events using 0 or 1 system calls: Solaris aio_waitn(), AIX aio_nwait(), HPUX aio_reap(), Windows GetQueuedCompletionStatusEx(), Linux io_uring_wait_cqe_nr().
An obvious function to add to FreeBSD would be aio_waitcompleten(), along the same lines as the above, possibly with a user space queue in front. I tried that and it worked, but I didn't love the implicit process-wide queue, or the inability to multiplex with other kinds of kernel events, among other problems. Hence the present prototype, which provides an "asynchronous reap" mode when using SIGEV_KEVENT.
Changes:
- A new flag AIO_KEVENT_FLAG_REAP is defined in <aio.h>. You can test for presence of the feature with #ifdef, and set it in sigev_notify_kevent_flags to enable the new mode.
- When this flag is set, kernel resources will be released asynchronously when the I/O completes, and the result will be stored in the kevent's data field (and also in the user space aiocb object, see below). For success, kev->data holds the value aio_return() would return (number of bytes transferred for reads and writes, 0 for fsync). For failure, it's the value aio_error() would return, and EV_ERROR is set.
- When this mode is not requested, it's as before: you have to call aio_error() and aio_return(), or aio_waitcomplete(). That's because we don't know which of those interfaces you're going to use, and aio_waitcomplete() relies on an in-kernel queue to work correctly. Therefore, we can't release the kaiocb asynchronously unless you opt in to this behaviour explicitly, or we'd break existing applications.
- Whether or not you request this new mode, and even if you don't use SIGEV_KEVENT, more work is done asynchronously than before:
4.1. The rusage counters are updated because of change D33271, which this patch depends on (now split off for separate review as requested). That change is a minor improvement in its own right, but also makes it possible to free kaiocb objects early because it means we have the submitting thread (as required for fdrop()) and we know it won't disappear from under us.
4.2. The final error and return statuses are written to user space asynchronously. Previously those values were written synchronously in aio_return()/aio_waitcomplete() and not used for much. This change is necessary because of the way lio_listio()'s interface is defined by POSIX; if it reports failure, the client is required to check the error status of all submitted aiocb objects, so we need a way for aio_error() to return scrutable results for I/Os that were (a) not submitted due to resource limits, (b) submitted and are still in progress, (c) submitted and are already completed and reaped. Therefore the result must be obtainable from the user space object, since aiocb objects in categories (a) and (c) are unknown to the kernel.
Assorted thoughts/observations/problems:
- AIO_KEVENT_FLAG_REAP is currently defined as EV_FLAG2, which I don't love for the superficial reason that truss shows it as EV_ERROR (these bits are all overloaded). I clear it immediately because EV_ERROR means something in output kevents. EV_FLAG1 is already used for special magic.
- EV_ERROR on output has previously been used only to signal failure to process kevent changes. This introduces a new kind of use for it, to indicate whether data contains an error or a result. This seemed better to me than using a negative value for errors (Linuxism) or making use of the ugly undocumented "ext" members.
- aio_error() now returns the value from the user space object (even though it enters the kernel). This behaviour was already present, but used for a special case "hack for failure of aio_aqueue". With this patch, that is now the main codepath, since the value is always stored asynchronously in user space. That means that libc could just read the value directly (syscall is totally redundant), but I haven't done that in this patch, as my goal is not to call it at all, syscall or not (except in lio_listio() failure handling, rare, I don't care how slow it is). This change does result in a user visible change: after you call aio_return()/aio_waitcomplete(), you can still call aio_error() repeatedly and get the last value, instead of EINVAL. This is permitted by POSIX ("If the aiocb structure pointed to by aiocbp is not associated with an operation that has been scheduled, the results are undefined."), but is a change from FreeBSD's traditional behaviour.
- Requiring all updates of td_ru.ru_{oublock,inblock,msgsend,msgrcv} to go through new atomic macros may not be acceptable; is there out-of-tree module code that needs to update these counters? On the other hand, the current scheme of assuming curthread is the only writer seems incompatible with the very concept of asynchronous I/O, leading to the strange choice of falsely blaming I/O on the thread that calls aio_return(), not the thread that initiated the I/O. Something has to give here!
- Compatibility considerations: NetBSD and Darwin don't support SIGEV_KEVENT, despite having both AIO and kqueue; Dragonfly and OpenBSD have kqueue but not AIO. So there are no compatibility problems to worry about or preexisting established API conventions along these lines.
- Future direction: I see this as a stepping stone to being able to put oneshot kevents into a user space ring buffer that sits in front of a kqueue, to avoid entering the kernel sometimes, though I have no code for that.
Better ideas for all of the above or anything else are very welcome.