In some cases (nginx web server), a large number of processes using aio at the same time can put a lot of pressure on the aio job mutex. Work around this in 2 ways:
- Take pressure off the aio_job mutex by managing num_aio_procs, refid and seqno using atomics. The existing code is a bit sloppy in managing num_aio_procs (count checked in parent, before forking child, then incremented in child), and I make no attempt to make it better. Due to this we can have at most 1 more proc per aio softc (which is 1 by default), which seems to be faithful to the current behavior
- Allow creation of multiple aio software contexts. Eg, rather than a single aio job list, etc, shard that by process id to a larger number of aio contexts.
Note that since we hash aio to aio contexts via pid, multithreaded servers making heavy use of aio will still see the same contention. The decision to hash via pid is due to the process (not thread) struct keeping track of aio state, so hashing by pid was fairly straightforward and required no other modifications. And it "disappears" when the code uses a single aio context (which is the current default).
I've set the default number of context to 1, so this patch is mostly a noop. I was debating setting the default to "0", which autotunes based on number of cores, and provides the best scalability I was able to find for my workload. Interested in feedback on this, as the space tradeoff is fairly minimal, since the aio softc is tiny.
In a trivial benchmark using tools/regression/aio/aiop on a 32-core / 64-thread AMD server shows a roughly 6x speedup. The exact benchmark was 8 copies of "aiop $file 4096 1000000 255 ro "