Paths

Table of Contentst

aio: improve lock contention on the aio_job_mtx
AcceptedPublic
Actions

Authored by gallatin on Nov 11 2024, 7:14 PM.

Details

Reviewers

kib
jhb
markj
imp

Summary

In some cases (nginx web server), a large number of processes using aio at the same time can put a lot of pressure on the aio job mutex. Work around this in 2 ways:

Take pressure off the aio_job mutex by managing num_aio_procs, refid and seqno using atomics. The existing code is a bit sloppy in managing num_aio_procs (count checked in parent, before forking child, then incremented in child), and I make no attempt to make it better. Due to this we can have at most 1 more proc per aio softc (which is 1 by default), which seems to be faithful to the current behavior
Allow creation of multiple aio software contexts. Eg, rather than a single aio job list, etc, shard that by process id to a larger number of aio contexts.

Note that since we hash aio to aio contexts via pid, multithreaded servers making heavy use of aio will still see the same contention. The decision to hash via pid is due to the process (not thread) struct keeping track of aio state, so hashing by pid was fairly straightforward and required no other modifications. And it "disappears" when the code uses a single aio context (which is the current default).

I've set the default number of context to 1, so this patch is mostly a noop. I was debating setting the default to "0", which autotunes based on number of cores, and provides the best scalability I was able to find for my workload. Interested in feedback on this, as the space tradeoff is fairly minimal, since the aio softc is tiny.

In a trivial benchmark using tools/regression/aio/aiop on a 32-core / 64-thread AMD server shows a roughly 6x speedup. The exact benchmark was 8 copies of "aiop $file 4096 1000000 255 ro "

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

gallatin created this revision.Nov 11 2024, 7:14 PM

Herald added a subscriber: olce. · View Herald TranscriptNov 11 2024, 7:14 PM

gallatin requested review of this revision.Nov 11 2024, 7:14 PM

kib added inline comments.Nov 11 2024, 8:17 PM

sys/kern/vfs_aio.c
315	Extra blank line.
317	Can this var be static?
325	I think that this pid->aio_sc idea is incompatible with rfork().
430	Extra blank line
1678	The herald comment promised that the job* vars are wrapped on overflow. I think it matters at least on 32bit arches.
1950	This should be KASSERT, but usually we do not even bother asserting non-NULL, allowing the CPU hw to catch it.

Address Kib's feedbackj

sys/kern/vfs_aio.c

325

I think I see what you mean. You're saying p_aioinfo could be shared among parent / child, thus breaking pid hashing.
I think it would be safe to stash the softc pointer in p_aioinfo, and use it rather than the pid after p_aioinfo is setup.
I assume it would be safe to access p->p_aioinfo without holding the proc lock, because aio_proc_rundown() currently removes and NULLs p_aioinfo without holding the proc lock.
I'm going to try something like this:

diff --git a/FreeBSD/sys/kern/vfs_aio.c b/FreeBSD/sys/kern/vfs_aio.c.  


index bfcc82970601f..e22b07284e735 100644
--- a/FreeBSD/sys/kern/vfs_aio.c
+++ b/FreeBSD/sys/kern/vfs_aio.c
@@ -282,6 +282,7 @@ struct kaioinfo {
        TAILQ_HEAD(,kaiocb) kaio_syncready;  /* (a) second q for aio_fsync */
        struct  task kaio_task;         /* (*) task to kick aio processes */
        struct  task kaio_sync_task;    /* (*) task to schedule fsync jobs */
+       struct  aio_softc *aio_sc;      /* (*) aio softc pointer*/
 };
 
 #define AIO_LOCK(ki)           mtx_lock(&(ki)->kaio_mtx)
@@ -321,6 +322,9 @@ aio_proc_to_softc(struct proc *p)
 {
        unsigned int idx;
 
+       if (p->p_aioinfo != NULL) {
+               return (p->p_aioinfo->aio_sc);
+       }
        idx = p->p_pid % num_aio_sc;
        return (&aio_scs[idx]);
 }
@@ -485,6 +489,7 @@ aio_init_aioinfo(struct proc *p)
        TAILQ_INIT(&ki->kaio_syncready);
        TASK_INIT(&ki->kaio_task, 0, aio_kick_helper, p);
        TASK_INIT(&ki->kaio_sync_task, 0, aio_schedule_fsync, ki);
+       ki->aio_sc = aio_proc_to_softc(p);
        PROC_LOCK(p);
        if (p->p_aioinfo == NULL) {
                p->p_aioinfo = ki;

`

1678

I'm not sure that I understand what you mean. Doesn't fetchadd wrap the same way a simple addition would?

1950

Yes, left over from initial debugging. Sorry.

kib added inline comments.Mon, Nov 11, 11:46 PM

sys/kern/vfs_aio.c
325	p_aioinfo is zeroed on fork. Perhaps it should be moved into the inherited section of struct proc then. I mean that rfork, and not fork, could cause problems, where address space is shared between parent and child.
1678	I mean, that the comment suggests that the wraparound should skip 0 and wraps to 1. It is easy to implement with non-atomic arithmetic under the lock, but requires CAS loop for lock-less. If the zero jobrefid and jobseqno are fine, then my note is irrelevant.

evgueni.gavrilov_itglobal.com added a subscriber: evgueni.gavrilov_itglobal.com.Tue, Nov 12, 6:56 AM

gallatin added inline comments.Tue, Nov 12, 12:00 PM

sys/kern/vfs_aio.c
1678	Looking at the old code, its just a simple increment, with no special handling of 0 that I can see. Unless I'm missing something, I don't think the old code did this either.

gallatin added inline comments.Tue, Nov 12, 2:53 PM

sys/kern/vfs_aio.c
325	Looking at the code, I'm not sure how this matters. For new aio requests, a null p_aioinfo results in a new one being allocated. For checking status / completion of existing requests (aio_return, aio_error) , it appears that they already don't work in a child, due the null p_aioinfo. So it seems like hashing by pid is well suited to aio, since the p_aioinfo status is per-process, and children don't inherit their parent's aio requests.. Can you please explain what I'm missing?

kib added inline comments.Tue, Nov 12, 11:33 PM

sys/kern/vfs_aio.c
325	I mean e.g. rfork(RGPROC\|RFMEM). My belief is that such child thread must be able to do aio_return(2) and aio_cancel(2) on requests from any other thread sharing this address space. Perhaps it is broken now.
1678	Yes. I said, 'if this requires fixing, using atomics makes the fix harder'.

gallatin added inline comments.Wed, Nov 13, 9:54 PM

sys/kern/vfs_aio.c
325	I just tested aio with fork and rfork, and in today's system (umodified -current from ~1 month ago), rfork(RFPROC\|RFMEM) prevents aio from working. See attached transcript with a toy test program and output. I honestly don't see how it could be any other way, given that p_aioinfo seems to be zeroed irrespective of the flags passed to do_fork(){F102547320}

kib accepted this revision.Thu, Nov 14, 12:49 AM

kib added inline comments.

sys/kern/vfs_aio.c
325	The test program has restricted access. As I noted in previous message, I sort of expected it to be broken. I plan to look at fixing this after your commit landed.

This revision is now accepted and ready to land.Thu, Nov 14, 12:49 AM

I wonder if another way to think about the sharding might be to not assign a process to a shard, but hash on the (pid, fd) or some such to pick the shard to queue an individual job to. That might help even within the same process while still preserving the relative order of requests for a single file descriptor. The latter somewhat matters in that you can queue multiple aio_read() requests for a socket which complete in FIFO order, and for LIO_SYNC jobs that need to drain the jobs in front of them. This would also mean that management of the job pool would still kind of be global in that you need to ensure there's the minimum number of workers for each shard/context.

sys/kern/vfs_aio.c
498–499	How does this work with multiple contexts? When the first AIO job is queued, I believe it will create N aio_procs all tied to the context of the first user process. When subsequent user processes call this routine, `num_aio_procs` will be large enough, so no kprocs will be created for the new context. I feel like `num_aio_procs` actually needs to be per-context (and then you don't need to deal with atomics for it since all the modifications to it will be under the per-context lock anyway). That said, MAX_AIO_PROCS and TARGET_AIO_PROCS are also rather silly in modern systems. Their initial value should be computed based on `mp_ncpus` not static constants of 32 and 4, respectively. My guess is each context wants a target of 1 kproc and max of 4 kproc (or something like that). Combined with your `mp_ncpu`-scaled context count this scales the number of kprocs to the number of contexts. I would also include the context number in the kproc name, and probably make the `aiod_unr` per context as well so you end with kproc names of `aiodX_Y` where `X` is the context ID and `Y` is the specific kproc in that context.
1095	Needs a blank line before the comment, but we also normally declare types up at the top of the file as a group.
1213	It might be more intuitive if the argument to this function was the `struct aio_softc *`.
1678–1679	The whole `jid` thing can just die. The `aiocb_private.kernelinfo` field is write-only. Nothing reads it anymore. We used to read it back in and try to find a matching aiocb, but now we walk the linked list looking for a matching userspace pointer instead. Its only use was removed in commit 1ce9182407f671538287837f0bbd8cc05bf41873. I would maybe start with just removing jobrefid and all the `store_kernelinfo` stuff entirely as an initial commit. The job sequence number field will not handle wrap correctly, but since it's 64-bit that's probably good enough and using atomics for it is ok.

To kib's point about dealing with rfork(), maybe instead of hashing on the (pid, fd) tuple you hash on the resolved struct file * pointer that is saved in job->fd.

Super helpful review, John. I just opened a new review (https://reviews.freebsd.org/D47583) for the simplest suggested change. Will work on your other suggestions.

sys/kern/vfs_aio.c
1678–1679	Thanks. I just opened https://reviews.freebsd.org/D47583 for this.

Revision Contents
Changeset List

Path

Size

sys/

kern/

vfs_aio.c

188 lines

Diff 146290

View Options

aio: improve lock contention on the aio_job_mtxAcceptedPublicActions