Page MenuHomeFreeBSD

Reap AIO completions asynchronously when using kqueue.
Needs ReviewPublic

Authored by tmunro on Nov 28 2021, 3:10 AM.
Tags
None
Referenced Files
F93166858: D33144.id.diff
Sat, Sep 7, 7:51 PM
Unknown Object (File)
Sat, Aug 31, 10:16 AM
Unknown Object (File)
Sat, Aug 31, 5:56 AM
Unknown Object (File)
Thu, Aug 22, 11:59 PM
Unknown Object (File)
Thu, Aug 22, 2:18 AM
Unknown Object (File)
Wed, Aug 21, 8:21 PM
Unknown Object (File)
Mon, Aug 19, 11:23 PM
Unknown Object (File)
Fri, Aug 16, 11:33 PM

Details

Reviewers
asomers
Group Reviewers
transport
Summary

This is work in progress.

I would like to be able to run a lot of I/Os with a small number of system calls. lio_listio() allows for batch submission, but we lack a corresponding way to reap completion events in batches.

Currently, if you use SIGEV_KEVENT as a notification mechanism for asynchronous I/O, you still need to call either aio_return() or aio_waitcomplete() to reap each individual completion. By "reap", I mean collecting the result of the operation and releasing kernel resources (kaiocb). You can therefore never get below one syscall per I/O, and for many applications the syscall count may be somewhere around 4 due to extra calls to aio_error().

Examples of APIs in other operating systems that reap N completion events using 0 or 1 system calls: Solaris aio_waitn(), AIX aio_nwait(), HPUX aio_reap(), Windows GetQueuedCompletionStatusEx(), Linux io_uring_wait_cqe_nr().

An obvious function to add to FreeBSD would be aio_waitcompleten(), along the same lines as the above, possibly with a user space queue in front. I tried that and it worked, but I didn't love the implicit process-wide queue, or the inability to multiplex with other kinds of kernel events, among other problems. Hence the present prototype, which provides an "asynchronous reap" mode when using SIGEV_KEVENT.

Changes:

  1. A new flag AIO_KEVENT_FLAG_REAP is defined in <aio.h>. You can test for presence of the feature with #ifdef, and set it in sigev_notify_kevent_flags to enable the new mode.
  1. When this flag is set, kernel resources will be released asynchronously when the I/O completes, and the result will be stored in the kevent's data field (and also in the user space aiocb object, see below). For success, kev->data holds the value aio_return() would return (number of bytes transferred for reads and writes, 0 for fsync). For failure, it's the value aio_error() would return, and EV_ERROR is set.
  1. When this mode is not requested, it's as before: you have to call aio_error() and aio_return(), or aio_waitcomplete(). That's because we don't know which of those interfaces you're going to use, and aio_waitcomplete() relies on an in-kernel queue to work correctly. Therefore, we can't release the kaiocb asynchronously unless you opt in to this behaviour explicitly, or we'd break existing applications.
  1. Whether or not you request this new mode, and even if you don't use SIGEV_KEVENT, more work is done asynchronously than before:

    4.1. The rusage counters are updated because of change D33271, which this patch depends on (now split off for separate review as requested). That change is a minor improvement in its own right, but also makes it possible to free kaiocb objects early because it means we have the submitting thread (as required for fdrop()) and we know it won't disappear from under us.

    4.2. The final error and return statuses are written to user space asynchronously. Previously those values were written synchronously in aio_return()/aio_waitcomplete() and not used for much. This change is necessary because of the way lio_listio()'s interface is defined by POSIX; if it reports failure, the client is required to check the error status of all submitted aiocb objects, so we need a way for aio_error() to return scrutable results for I/Os that were (a) not submitted due to resource limits, (b) submitted and are still in progress, (c) submitted and are already completed and reaped. Therefore the result must be obtainable from the user space object, since aiocb objects in categories (a) and (c) are unknown to the kernel.

Assorted thoughts/observations/problems:

  1. AIO_KEVENT_FLAG_REAP is currently defined as EV_FLAG2, which I don't love for the superficial reason that truss shows it as EV_ERROR (these bits are all overloaded). I clear it immediately because EV_ERROR means something in output kevents. EV_FLAG1 is already used for special magic.
  1. EV_ERROR on output has previously been used only to signal failure to process kevent changes. This introduces a new kind of use for it, to indicate whether data contains an error or a result. This seemed better to me than using a negative value for errors (Linuxism) or making use of the ugly undocumented "ext" members.
  1. aio_error() now returns the value from the user space object (even though it enters the kernel). This behaviour was already present, but used for a special case "hack for failure of aio_aqueue". With this patch, that is now the main codepath, since the value is always stored asynchronously in user space. That means that libc could just read the value directly (syscall is totally redundant), but I haven't done that in this patch, as my goal is not to call it at all, syscall or not (except in lio_listio() failure handling, rare, I don't care how slow it is). This change does result in a user visible change: after you call aio_return()/aio_waitcomplete(), you can still call aio_error() repeatedly and get the last value, instead of EINVAL. This is permitted by POSIX ("If the aiocb structure pointed to by aiocbp is not associated with an operation that has been scheduled, the results are undefined."), but is a change from FreeBSD's traditional behaviour.
  1. Requiring all updates of td_ru.ru_{oublock,inblock,msgsend,msgrcv} to go through new atomic macros may not be acceptable; is there out-of-tree module code that needs to update these counters? On the other hand, the current scheme of assuming curthread is the only writer seems incompatible with the very concept of asynchronous I/O, leading to the strange choice of falsely blaming I/O on the thread that calls aio_return(), not the thread that initiated the I/O. Something has to give here!
  1. Compatibility considerations: NetBSD and Darwin don't support SIGEV_KEVENT, despite having both AIO and kqueue; Dragonfly and OpenBSD have kqueue but not AIO. So there are no compatibility problems to worry about or preexisting established API conventions along these lines.
  1. Future direction: I see this as a stepping stone to being able to put oneshot kevents into a user space ring buffer that sits in front of a kqueue, to avoid entering the kernel sometimes, though I have no code for that.

Better ideas for all of the above or anything else are very welcome.

Test Plan

I added some simple Kyua tests to exercise the code.

I also have an experimental fork of PostgreSQL that uses AIO and consumes completions with this mechanism, at https://github.com/macdice/postgres/tree/aio-kqueue, which has motivated my work on this; this should be compiled with --with-posix-aio and run with io_method=posix_aio, and then big I/O workloads can be run through it to exercise it.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

This is a very awesome feature! A few questions:

  • aio_error now returns the value from userspace, but still enters the kernel. Could it be changed to not enter the kernel (create a new symbol version for it, and the old aio_error@fbsd1.6 will do the legacy thing and call an aio_error_freebsd13 syscall). Ditto for aio_return. If so, then existing code would require fewer modifications to take advantage of this feature. It would be even easier if aio_error has the ability to check for the AIO_KEVENT_FLAG_REAP flag. If so, no new symbol version would be required.
  • In fact, if aio_error and aio_return are backwards compatible in that way, could we also create new versions of aio_write, etc that set AIO_KEVENT_FLAG_REAP automatically? That way existing programs could reap the benefits of your work with no code changes, just a recompile.
  • The rusage thread bug seems to be separate from the new feature. How hard would it be to split up this revision into two: one to fix rusage and one to add the new feature?
  • What about a test case for 32-bit compat with AIO_KEVENT_FLAG_REAP?
share/man/man4/aio.4
30

Don't forget to bump the date, too.

sys/fs/nfsclient/nfs_clvnops.c
3509 ↗(On Diff #99091)

Curious that the NFS client never increments ru_inblock. It probably should. I just wrote to rmacklem about that. Heads up: if he changes it, that could create a subtle merge conflict for you.

sys/kern/vfs_aio.c
584

What is the point of taking the lock just for a pointer copy? It looks like that pointer can get modified without the proc lock, as seen at the end of aio_proc_rundown .

2687

I'm confused. Could we avoid overloading flag values by using one of kevent's ext fields? From a brief inspection, it looks like ext[0] is the only ext field that's used anywhere, and it's only used by the linuxulator.

sys/sys/aio.h
116

s/sigev_kevent_flags/sigev_notify_kevent_flags/

tests/sys/aio/aio_test.c
2013

Don't forget to replace this flag with AIO_KEVENT_FLAG_REAP here and on line 2035.

Split into two, moved the rusage stuff to D33271.

In fact, if aio_error and aio_return are backwards compatible in that way, could we also create new versions of aio_write, etc that set AIO_KEVENT_FLAG_REAP automatically? That way existing programs could reap the benefits of your work with no code changes, just a recompile.

Turning this on by default for all users of SIGEV_KEVENT is bolder than what I was thinking. It would break hypothetical applications that use SIGEV_KEVENT but still expect aio_waitcomplete() to work. That is admittedly a fairly strange combination of APIs to use, so maybe no application does it, and perhaps you're right and we should just turn it on and document that that combination doesn't work. I will try some things out, and report back, hopefully next weekend.

The rusage thread bug seems to be separate from the new feature. How hard would it be to split up this revision into two: one to fix rusage and one to add the new feature?

Good idea. Done.

What about a test case for 32-bit compat with AIO_KEVENT_FLAG_REAP?

Is there some more general way to run the whole test suite run under 32-bit compat?

share/man/man4/aio.4
30

I'll update this with each version to remember...

sys/kern/vfs_aio.c
2687

I think it depends if we want to enable this automatically as you were suggesting, in which case maybe an ext field makes sense (ie just for libc and the kernel to communicate). For a public API for explicit opt-in as I originally imagined, I figured that would be too weird and iinternal (?)

Is there some more general way to run the whole test suite run under 32-bit compat?

No, but there really ought to be. The problem right now, IIRC, is that many of the tests fail because their APIs aren't useful in 32-bit compat mode, like the libgeom ones. I'll take this up on freebsd-tests and see if we can get it more permanently fixed.

tmunro marked an inline comment as not done.

More context.

You cannot rely on headers to detect the presence of the kernel feature. It is not always true that programs are run on the same system where they were compiled.

sys/kern/vfs_aio.c
1732

The new flag clearly should belong to EV_ namespace.

sys/sys/aio.h
120

There should be an assert somewhere that EV_FLAG2 is equal to this REAP flag.

In D33144#752233, @kib wrote:

You cannot rely on headers to detect the presence of the kernel feature. It is not always true that programs are run on the same system where they were compiled.

What do you mean, @kib? Don't we always guarantee that the kernel will be at least as recent as the userland?

In D33144#752233, @kib wrote:

You cannot rely on headers to detect the presence of the kernel feature. It is not always true that programs are run on the same system where they were compiled.

What do you mean, @kib? Don't we always guarantee that the kernel will be at least as recent as the userland?

I am not aware of any method to guarantee that. More, it is quite common for e.g. official poudriere builders to run newer userspace and older kernel.
The usual mantra is that such configuration is unsupported and if it breaks it is responsibility of the system owner to fix, but such things occur regularly. Just for another example, after ino64, there were a shims added to HEAD libc to fall back to pre-ino64 syscalls if kernel is old.

In D33144#752431, @kib wrote:
In D33144#752233, @kib wrote:

You cannot rely on headers to detect the presence of the kernel feature. It is not always true that programs are run on the same system where they were compiled.

What do you mean, @kib? Don't we always guarantee that the kernel will be at least as recent as the userland?

I am not aware of any method to guarantee that. More, it is quite common for e.g. official poudriere builders to run newer userspace and older kernel.
The usual mantra is that such configuration is unsupported and if it breaks it is responsibility of the system owner to fix, but such things occur regularly. Just for another example, after ino64, there were a shims added to HEAD libc to fall back to pre-ino64 syscalls if kernel is old.

The best use case is I upgraded the kernel, I upgraded userland and then I'm forced to downgrade the kernel to build one that doesn't have some horrible bug that I encountered after making the decision to update the userland. I added the shims in part due to my experience upgrading, as well as sometimes having a situation where we had newer binaries doing innocent things on an older kernel for reasons too long to enumerate here (I've tried to eliminate this as much as possible, but we still have one scenario that we need to do this to automatically recover, though that may be going away as we move to ZFS snapshots)

Another involves jails, which is another common source of version skew that often works, but sometimes doesn't. There's a number of other scenarios that require userland to be able to cope with recent but not totally up-to-date kernels in a way that's failsafe.

kib's right: header presence is a really bad way to detect these features. At the very least, we need to fail safe and sanely when there's a mismatch. The guarantee isn't 100%, especially if it intrudes into the upgrade path. It never has been. We don't do complete forward compatibility by any means, but neither do we never do it.

I had some thoughts about this during Thomas' talk today during BSDCan. The resource accounting stuff is already in another review, but some other thoughts I had were:

  1. Make aio_return() and aio_error() check the userspace structure and return status immediately from that unless the error value is EINPROGRESS. This can then permit having completions not require as many syscalls while still working with the same API. The kernel can also be selective about when it does implicit frees by deciding when to push the status out to userland vs leaving it as EINPROGRESS.
  1. I would make aio_suspend() store the completed status for all completed requests. In-progress requests would still need aio_error to call into the kernel, but completed requests would not.
  1. When a request using SIGEV_KEVENT completes, have it also store the completed state immediately and free the kernel-side aiocb. This would permit the call to aio_return/error in userland to not require system calls.

However, as Thomas noted, this breaks using aio_waitcomplete to wait for a request submitted with SIGEV_KEVENT. I think I'm ok with breaking that use case as it's obscure and these are all non-standard FreeBSD extensions already. We could though provide a sysctl that disabled the asynchronous free's and always left the status as EINPROGRESS.

Another variant of this would be to add (or reuse) an internal error code that means "go ask the kernel for status": something like ECOMPLETED. In that case, the kernel when completing a request would either write the real status or ECOMPLETED. aio_error/aio_return would only call into the kernel if the job's userspace status was ECOMPLETED. The advantage of this is that it would mean that in the aio_suspend() case you would no longer need any syscalls to scan the array as EINPROGRESS status for in-flight requests would not require a syscall.

Here's another backwards-compatibility nit. In kevent, AIO events are level-triggered by default. That is, if kevent reports that an event is ready, but the caller for some reason calls kevent again without calling aio_complete, then kevent will immediately return ready again. But with AIO_KEVENT_FLAG_REAP, it won't be able to, because the kernel will no longer be aware of that event. I don't know if we need to care about users who depend on that level-triggering, but if we do an easy fix would be to only enable the new behavior if EV_ONESHOT is also set. It probably doesn't matter if we're requiring the user to explicitly set AIO_KEVENT_FLAG_REAP anyway. But it could matter if we use the new behavior by default.

Here's another backwards-compatibility nit. In kevent, AIO events are level-triggered by default. That is, if kevent reports that an event is ready, but the caller for some reason calls kevent again without calling aio_complete, then kevent will immediately return ready again. But with AIO_KEVENT_FLAG_REAP, it won't be able to, because the kernel will no longer be aware of that event. I don't know if we need to care about users who depend on that level-triggering, but if we do an easy fix would be to only enable the new behavior if EV_ONESHOT is also set. It probably doesn't matter if we're requiring the user to explicitly set AIO_KEVENT_FLAG_REAP anyway. But it could matter if we use the new behavior by default.

Hmm, I'm not sure we care about that use case, but it is something to consider (and to document if we make this change). Note that in my proposal we wouldn't need the AIO_KEVENT_FLAG_REAP flag.