nvme: fix race condition in split bio completion path


nvme: fix race condition in split bio completion path

Fixes race condition observed under following circumstances:

  1. I/O split on 128KB boundary with Intel NVMe controller. Current Intel controllers produce better latency when I/Os do not span a 128KB boundary - even if the I/O size itself is less than 128KB.
  2. Per-CPU I/O queues are enabled.
  3. Child I/Os are submitted on different submission queues.
  4. Interrupts for child I/O completions occur almost simultaneously.
  5. ithread for child I/O A increments bio_inbed, then immediately is preempted (rendezvous IPI, higher priority interrupt).
  6. ithread for child I/O B increments bio_inbed, then completes parent bio since all children are now completed.
  7. parent bio is freed, and immediately reallocated for a VFS or gpart bio (including setting bio_children to 1 and clearing bio_driver1).
  8. ithread for child I/O A resumes processing. bio_children for what it thinks is the parent bio is set to 1, so it thinks it needs to complete the parent bio.

Result is either calling a NULL callback function, or double freeing
the bio to its uma zone.

PR: 203746
Reported by: Drew Gallatin <gallatin@netflix.com>,

		Marc Goroff <mgoroff@quorum.net>

Tested by: Drew Gallatin <gallatin@netflix.com>
MFC after: 3 days
Sponsored by: Intel


jimharrisAuthored on
rS290197: After r290196, the kernel won't wait for stuff like gmirror nodes