Page MenuHomeFreeBSD

zfs: deadlock in "zfs rename" if zv_total_opens > 0 since MFV r323535
Needs ReviewPublic

Authored by nikita_elyzion.net on Dec 11 2017, 2:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Dec 22, 7:14 PM
Unknown Object (File)
Oct 3 2024, 11:50 PM
Unknown Object (File)
Oct 3 2024, 4:40 PM
Unknown Object (File)
Oct 3 2024, 8:53 AM
Unknown Object (File)
Oct 1 2024, 8:30 AM
Unknown Object (File)
Sep 8 2024, 10:44 PM
Unknown Object (File)
Sep 7 2024, 10:58 AM
Unknown Object (File)
Aug 23 2024, 6:49 AM
Subscribers

Details

Reviewers
mav
avg
bapt
manu
Summary

Hello,

Since the merge of "MFV r323535: 8585 improve batching done in zil_commit()" (I have done a bisect to find it) there is a deadlock in the FreeBSD part of ZFS code.
The deadlock happen when we try to do a zfs rename on a still opened zvol.

I have spotted the bug because I'm using ctld to expose some zvol through iscsi and I had a race when my ctld was still reloading (to no longer use the zvol) while my script was already trying to rename the zvol.

Here is some ddb stacktrace during the deadlock:
the zfs rename stack:

Tracing command zfs pid 1202 tid 101726 td 0xfffff801475e7000
sched_switch() at sched_switch+0x3d0/frame 0xfffffe7f47507550
mi_switch() at mi_switch+0xe5/frame 0xfffffe7f47507580
sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe7f475075b0
_cv_wait() at _cv_wait+0x15b/frame 0xfffffe7f475075f0
txg_wait_synced() at txg_wait_synced+0xa5/frame 0xfffffe7f47507630
dsl_sync_task() at dsl_sync_task+0x209/frame 0xfffffe7f47507700
dsl_dir_rename() at dsl_dir_rename+0x47/frame 0xfffffe7f47507730
zfsdev_ioctl() at zfsdev_ioctl+0x6bf/frame 0xfffffe7f475077e0
devfs_ioctl_f() at devfs_ioctl_f+0x128/frame 0xfffffe7f47507840
kern_ioctl() at kern_ioctl+0x26b/frame 0xfffffe7f475078b0
sys_ioctl() at sys_ioctl+0x16c/frame 0xfffffe7f47507980
amd64_syscall() at amd64_syscall+0xa31/frame 0xfffffe7f47507ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe7f47507ab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80141651a, rsp = 0x7fffffffc678, rbp = 0x7fffffffc6f0 ---

the txg_sync thread:

Tracing command zfskern pid 15 tid 101701 td 0xfffff802361dd000
sched_switch() at sched_switch+0x3d0/frame 0xfffffe7f47494020
mi_switch() at mi_switch+0xe5/frame 0xfffffe7f47494050
sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe7f47494080
_cv_wait() at _cv_wait+0x15b/frame 0xfffffe7f474940c0
txg_wait_synced() at txg_wait_synced+0xa5/frame 0xfffffe7f47494100
zil_close() at zil_close+0xc7/frame 0xfffffe7f47494140
zvol_last_close() at zvol_last_close+0x15/frame 0xfffffe7f47494160
zvol_rename_minor() at zvol_rename_minor+0x148/frame 0xfffffe7f474941d0
zvol_rename_minors() at zvol_rename_minors+0x1cf/frame 0xfffffe7f47494630
dsl_dir_rename_sync() at dsl_dir_rename_sync+0x3e9/frame 0xfffffe7f474946c0
dsl_sync_task_sync() at dsl_sync_task_sync+0xd5/frame 0xfffffe7f474946f0
dsl_pool_sync() at dsl_pool_sync+0x38b/frame 0xfffffe7f47494770
spa_sync() at spa_sync+0x982/frame 0xfffffe7f474949a0
txg_sync_thread() at txg_sync_thread+0x400/frame 0xfffffe7f47494a70
fork_exit() at fork_exit+0x85/frame 0xfffffe7f47494ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe7f47494ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

a ps:

root    1202    0.0  0.0  7796  3596  0  D+   13:45     0:00.05 zfs rename bld311/win/fio130k13 bld311/win/fio130k13fff

ddb ps:

  pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
 1202  1199  1202     0  D+      tx->tx_s 0xfffff801c563aa40 zfs
...
101111                   D       tx->tx_q 0xfffff801c57fea50 [txg_thread_enter]
101112                   D       tx->tx_s 0xfffff801c57fea30 [txg_thread_enter]
101700                   D       tx->tx_q 0xfffff801c563aa50 [txg_thread_enter]
101701                   D       tx->tx_s 0xfffff801c563aa40 [txg_thread_enter]
...

From what I understand, before r323535 we were still calling zvol_rename_minor()->zvol_last_close()->zil_close() but zil_close() was not doing a txg_sync_wait() call so we were not impacted but it is no longer the case. With my small understanding of the txg part of ZFS, I guess we are not supposed to call a function that is doing a txg_sync_wait() from the txg_sync thread itself.

Note that in illumos and ZoL they don't have this issue because they are not calling zvol_last_close() during a rename, from what I see they are able to rename the device directly without closing/opening it.

to reproduce it:

  • checkout at least 02ec18c016ef723fc4b74c85b315b1253f48767b (on stable/11)
  • create a zvol
  • open it ( through ctld in my case )
  • zfs rename it

The patch proposed here is certainly not the best one, but it solve almost the problem. The issue with it is that there is potentially a small race between the check and the use, but it is better than the current scenario of a sure deadlock.

Note : all my tests/debug were done on stable/11 and not -current

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

I don't have anything against this patch, but I would prefer a more universal solution.
There are more of problems involving FreeBSD-specific volume manipulations done in the sync context.
Please see this bug report for some details: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=203864
These comments have my analysis:

And my proposed fix D7179.
But the fix introduces a window between the changes are done at DSL level and when they applied to zvol devices.
So, it opens a possibility of some races, etc.

I haven't given any more thought to the problem since then, so I do not have any better proposals right now.

I was not aware of D7179, thanks to pointing it. I tried it in some simple case and it seems to work pretty well.
Moving zfsvfs_update_fromname() and zvol_rename_minors() from txg_sync to the ioctl part was something I considered but I was unsure of the result because of the part of the rename still in txg_sync but after a second look it seems fine.
About the race, maybe adding a way to lock a specific zvol by adding something in zvol_state (from zvol.c) and holding the lock on the specific zvol during a zfs_ioc_rename could be enough to fix the issue ? so we won't have concurrent rename on the same zvol ? or am I missing another race here ?
My feeling is that we have now a second obvious deadlock on zfs rename in stable/11 and it would be better to use your patch now with (or maybe even without) a lock on the zvol than what we have now.

I forgot to add, even with D7179, it seems still be a good idea to ensure that we are not trying to rename a zvol which is still opened somewhere.