Hello,
Since the merge of "MFV r323535: 8585 improve batching done in zil_commit()" (I have done a bisect to find it) there is a deadlock in the FreeBSD part of ZFS code.
The deadlock happen when we try to do a zfs rename on a still opened zvol.
I have spotted the bug because I'm using ctld to expose some zvol through iscsi and I had a race when my ctld was still reloading (to no longer use the zvol) while my script was already trying to rename the zvol.
Here is some ddb stacktrace during the deadlock:
the zfs rename stack:
Tracing command zfs pid 1202 tid 101726 td 0xfffff801475e7000 sched_switch() at sched_switch+0x3d0/frame 0xfffffe7f47507550 mi_switch() at mi_switch+0xe5/frame 0xfffffe7f47507580 sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe7f475075b0 _cv_wait() at _cv_wait+0x15b/frame 0xfffffe7f475075f0 txg_wait_synced() at txg_wait_synced+0xa5/frame 0xfffffe7f47507630 dsl_sync_task() at dsl_sync_task+0x209/frame 0xfffffe7f47507700 dsl_dir_rename() at dsl_dir_rename+0x47/frame 0xfffffe7f47507730 zfsdev_ioctl() at zfsdev_ioctl+0x6bf/frame 0xfffffe7f475077e0 devfs_ioctl_f() at devfs_ioctl_f+0x128/frame 0xfffffe7f47507840 kern_ioctl() at kern_ioctl+0x26b/frame 0xfffffe7f475078b0 sys_ioctl() at sys_ioctl+0x16c/frame 0xfffffe7f47507980 amd64_syscall() at amd64_syscall+0xa31/frame 0xfffffe7f47507ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe7f47507ab0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80141651a, rsp = 0x7fffffffc678, rbp = 0x7fffffffc6f0 ---
the txg_sync thread:
Tracing command zfskern pid 15 tid 101701 td 0xfffff802361dd000 sched_switch() at sched_switch+0x3d0/frame 0xfffffe7f47494020 mi_switch() at mi_switch+0xe5/frame 0xfffffe7f47494050 sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe7f47494080 _cv_wait() at _cv_wait+0x15b/frame 0xfffffe7f474940c0 txg_wait_synced() at txg_wait_synced+0xa5/frame 0xfffffe7f47494100 zil_close() at zil_close+0xc7/frame 0xfffffe7f47494140 zvol_last_close() at zvol_last_close+0x15/frame 0xfffffe7f47494160 zvol_rename_minor() at zvol_rename_minor+0x148/frame 0xfffffe7f474941d0 zvol_rename_minors() at zvol_rename_minors+0x1cf/frame 0xfffffe7f47494630 dsl_dir_rename_sync() at dsl_dir_rename_sync+0x3e9/frame 0xfffffe7f474946c0 dsl_sync_task_sync() at dsl_sync_task_sync+0xd5/frame 0xfffffe7f474946f0 dsl_pool_sync() at dsl_pool_sync+0x38b/frame 0xfffffe7f47494770 spa_sync() at spa_sync+0x982/frame 0xfffffe7f474949a0 txg_sync_thread() at txg_sync_thread+0x400/frame 0xfffffe7f47494a70 fork_exit() at fork_exit+0x85/frame 0xfffffe7f47494ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe7f47494ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
a ps:
root 1202 0.0 0.0 7796 3596 0 D+ 13:45 0:00.05 zfs rename bld311/win/fio130k13 bld311/win/fio130k13fff
ddb ps:
pid ppid pgrp uid state wmesg wchan cmd 1202 1199 1202 0 D+ tx->tx_s 0xfffff801c563aa40 zfs ... 101111 D tx->tx_q 0xfffff801c57fea50 [txg_thread_enter] 101112 D tx->tx_s 0xfffff801c57fea30 [txg_thread_enter] 101700 D tx->tx_q 0xfffff801c563aa50 [txg_thread_enter] 101701 D tx->tx_s 0xfffff801c563aa40 [txg_thread_enter] ...
From what I understand, before r323535 we were still calling zvol_rename_minor()->zvol_last_close()->zil_close() but zil_close() was not doing a txg_sync_wait() call so we were not impacted but it is no longer the case. With my small understanding of the txg part of ZFS, I guess we are not supposed to call a function that is doing a txg_sync_wait() from the txg_sync thread itself.
Note that in illumos and ZoL they don't have this issue because they are not calling zvol_last_close() during a rename, from what I see they are able to rename the device directly without closing/opening it.
to reproduce it:
- checkout at least 02ec18c016ef723fc4b74c85b315b1253f48767b (on stable/11)
- create a zvol
- open it ( through ctld in my case )
- zfs rename it
The patch proposed here is certainly not the best one, but it solve almost the problem. The issue with it is that there is potentially a small race between the check and the use, but it is better than the current scenario of a sure deadlock.
Note : all my tests/debug were done on stable/11 and not -current