nvme: fix race condition in split bio completion path
Fixes race condition observed under following circumstances:
- I/O split on 128KB boundary with Intel NVMe controller. Current Intel controllers produce better latency when I/Os do not span a 128KB boundary - even if the I/O size itself is less than 128KB.
- Per-CPU I/O queues are enabled.
- Child I/Os are submitted on different submission queues.
- Interrupts for child I/O completions occur almost simultaneously.
- ithread for child I/O A increments bio_inbed, then immediately is preempted (rendezvous IPI, higher priority interrupt).
- ithread for child I/O B increments bio_inbed, then completes parent bio since all children are now completed.
- parent bio is freed, and immediately reallocated for a VFS or gpart bio (including setting bio_children to 1 and clearing bio_driver1).
- ithread for child I/O A resumes processing. bio_children for what it thinks is the parent bio is set to 1, so it thinks it needs to complete the parent bio.
Result is either calling a NULL callback function, or double freeing
the bio to its uma zone.
PR: 203746
Reported by: Drew Gallatin <gallatin@netflix.com>,
Marc Goroff <mgoroff@quorum.net>
Tested by: Drew Gallatin <gallatin@netflix.com>
MFC after: 3 days
Sponsored by: Intel