MFV r328227: 8909 8585 can cause a use-after-free kernel panic
illumos/illumos-gate@94ddd0900a8838f62bba15e270649a42f4ef9f81
https://www.illumos.org/issues/8909:
There's a race condition that exists if zil_free_lwb races with either
zil_commit_waiter_timeout and/or zil_lwb_flush_vdevs_done.
Here's an example panic due to this bug:
::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only
$c
zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177()
If there's an outstanding lwb that's in zil_commit_waiter_timeout
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call zil_free_lwb. If we end up calling zil_free_lwb, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the zil_commit_waiter_t structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the LWB_STATE_ISSUED state, and zil_free_lwb is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
zil_free_lwb, which will result in a use-after-free situation when the
lwb's zio completes, and zil_lwb_flush_vdevs_done is called.
This race condition is prevented in zil_close by calling zil_commit
before zil_free_lwb is called, which will ensure all outstanding (i.e.
all lwb's in the LWB_STATE_OPEN and/or LWB_STATE_ISSUED states)
reach the LWB_STATE_DONE state before the lwb's are freed
(zil_commit will not return untill all the lwb's are
LWB_STATE_DONE).
Further, this race condition is prevented in zil_sync by only calling
zil_free_lwb for lwb's that do not have their lwb_buf pointer set.
All lwb's not in the LWB_STATE_DONE state will have a non-null value
for this pointer; the pointer is only cleared in
zil_lwb_flush_vdevs_done, at which point the lwb's state will be
changed to LWB_STATE_DONE.
This race is present in zil_suspend, leading to this bug.
At first glance, it would appear as though this would not be true
because zil_suspend will call zil_commit, just like zil_close, but
the problem is that zil_suspend will set the zilog's zl_suspend
field prior to calling zil_commit. Further, in zil_commit, if
zl_suspend is set, zil_commit will take a special branch of logic
and use txg_wait_synced instead of performing the normal zil_commit
logic.
This call to txg_wait_synced might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the LWB_STATE_DONE state before it returns.
This is because, if there's an lwb "stuck" in
zil_commit_waiter_timeout, waiting for it's lwb to timeout, it will
maintain a non-null value for it's lwb_buf field and thus zil_sync
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after zil_commit is called from zil_suspend, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used zil_commit_waiter_timeout.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>