Page MenuHomeFreeBSD

add white listing for ZFS locking pairs that WITNESS can't report accurately and enable WITNESS by default in ZFS
Needs ReviewPublic

Authored by kmacy on Apr 8 2018, 3:39 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 17, 2:17 AM
Unknown Object (File)
Jan 4 2025, 12:25 PM
Unknown Object (File)
Dec 26 2024, 8:19 AM
Unknown Object (File)
Dec 14 2024, 11:54 PM
Unknown Object (File)
Dec 9 2024, 4:24 PM
Unknown Object (File)
Dec 4 2024, 7:13 AM
Unknown Object (File)
Oct 11 2024, 1:50 AM
Unknown Object (File)
Oct 9 2024, 12:10 AM

Details

Summary

In spite of never being heavily involved with ZFS I personally have fixed 3 deadlocks in it over the years. A recent deadlock during testing by a customer caused a major setback for FreeBSD. Thus I'd very much like to move towards leveraging WITNESS.

It currently isn't possible to use WITNESS for much of ZFS because the lock name doesn't provide WITNESS with enough context to distinguish between what are in fact two unique locking orders. This change will, at _enrollment_ time mark a lock pair as "blessed". Thus at lock acquisition there is no added cost over the existing code - unlike the previous bless implementation. The added enrollment cost is only borne by locks in the classes that are being blessed (in the case of ZFS sx and lockmgr). The added overhead of enrollment for participating lock types is an extra string compare for each lock that is in a white listed pair.

The intent is that with WITNESS enabled, we can incrementally determine how to give WITNESS the context it needs for each lock pair. And even if we aren't reporting reorders,
enrolling witness locks mean that we can now inspect them in ddb when we get a lockup.

Obviously the real solution is to fix the opensolaris shim layer to provide the context / naming that WITNESS needs to provide accurate LORs. Nonetheless, this prevents regressions on new locks in the present.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 16049

Event Timeline

kmacy added a reviewer: markj.
cem accepted this revision.EditedApr 8 2018, 4:10 AM
cem added a subscriber: cem.

I'm not too familiar with witness and the patch lacks context, but the changes look good to me (and I really like the general direction).

Some non-ZFS LORs that could be suppressed include several in UFS: http://sources.zabbadoz.net/freebsd/lor.html (though some of these may not be false positives).

sys/kern/subr_witness.c
740

spurious diff

2161

style nit: != 0

This revision is now accepted and ready to land.Apr 8 2018, 4:10 AM
kmacy added reviewers: avg, mav.

add mls->mls_lock + dl->dl_lock

This revision now requires review to proceed.Apr 8 2018, 4:25 AM

Don't include extra changes

Why not set WITNESS_BLESSED when the first reversal is detected, and have blessed() return true if WITNESS_BLESSED is set? Then the implementation will still be cheap and we could retain the option of logging the stacks, so investigating the reversals will be easier.

sys/kern/subr_witness.c
177

The comment is wrong now.

330

The declarations are sorted alphabetically and include parameter names.

1290

Why not do this in witness_lock_order_check?

2150

You're not doing anything with "type."

2164

Wonky indentation.

2168

Extra braces.

sys/kern/subr_witness.c
2150

The loop + following if statement checks if the lock class is in the blessed_lock_types table. Locks other than those (currently sx and lockmgr) are ignored for the purposes of this function.

It's a cheaper test than the strcmp iteration for locks that aren't sx or lockmgr.

Why not set WITNESS_BLESSED when the first reversal is detected, and have blessed() return true if WITNESS_BLESSED is set? Then the implementation will still be cheap and we could retain the option of logging the stacks, so investigating the reversals will be easier.

Is it cheaper to call blessed on every potential reversal than init?

Why not set WITNESS_BLESSED when the first reversal is detected, and have blessed() return true if WITNESS_BLESSED is set? Then the implementation will still be cheap and we could retain the option of logging the stacks, so investigating the reversals will be easier.

That doesn't actually work. We lose the ability to continue traversing the list otherwise we get:

Apr 8 22:23:56 entropy kernel: witness rmatrix paradox! [64][60]=14 both ancestor and descendant
Apr 8 22:23:56 entropy kernel: KDB: stack backtrace:
Apr 8 22:23:56 entropy kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01f6ba8120
Apr 8 22:23:56 entropy kernel: adopt() at adopt+0x42e/frame 0xfffffe01f6ba8180
Apr 8 22:23:56 entropy kernel: itismychild() at itismychild+0x164/frame 0xfffffe01f6ba81b0
Apr 8 22:23:56 entropy kernel: witness_checkorder() at witness_checkorder+0x1030/frame 0xfffffe01f6ba83d0
Apr 8 22:23:56 entropy kernel: _sx_xlock() at _sx_xlock+0x68/frame 0xfffffe01f6ba8410
Apr 8 22:23:56 entropy kernel: dbuf_sync_list() at dbuf_sync_list+0x593/frame 0xfffffe01f6ba84c0
Apr 8 22:23:56 entropy kernel: dbuf_sync_list() at dbuf_sync_list+0x470/frame 0xfffffe01f6ba8570
Apr 8 22:23:56 entropy kernel: dnode_sync() at dnode_sync+0xe91/frame 0xfffffe01f6ba8630
Apr 8 22:23:56 entropy kernel: dmu_objset_sync() at dmu_objset_sync+0x1a2/frame 0xfffffe01f6ba86b0
Apr 8 22:23:56 entropy kernel: dsl_pool_sync() at dsl_pool_sync+0x4ec/frame 0xfffffe01f6ba8740
Apr 8 22:23:56 entropy kernel: spa_sync() at spa_sync+0xd09/frame 0xfffffe01f6ba8990
Apr 8 22:23:56 entropy kernel: txg_sync_thread() at txg_sync_thread+0x2a9/frame 0xfffffe01f6ba8a70
Apr 8 22:23:56 entropy kernel: fork_exit() at fork_exit+0x84/frame 0xfffffe01f6ba8ab0
Apr 8 22:23:56 entropy kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01f6ba8ab0
Apr 8 22:23:56 entropy kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Then if you disable those checks for blessed locks you still get:

isitmychild: rmatrix mismatch between buf->b_evict_lock (index 68) and buf_hash_table.ht_locks[i].ht_lock (index 69): w_rmatrix[68][69] == 9 but w_rmatrix[69][68] == 6
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01f6ba7f60
_isitmyx() at _isitmyx+0x2b4/frame 0xfffffe01f6ba7ff0
isitmychild() at isitmychild+0x31/frame 0xfffffe01f6ba8020
witness_lock_order_check() at witness_lock_order_check+0x5c/frame 0xfffffe01f6ba8050
witness_checkorder() at witness_checkorder+0x5a7/frame 0xfffffe01f6ba8270
_sx_xlock() at _sx_xlock+0x68/frame 0xfffffe01f6ba82b0
arc_release() at arc_release+0xad/frame 0xfffffe01f6ba8340
dbuf_write() at dbuf_write+0xc1/frame 0xfffffe01f6ba8410
dbuf_sync_list() at dbuf_sync_list+0xa72/frame 0xfffffe01f6ba84c0
dbuf_sync_list() at dbuf_sync_list+0x470/frame 0xfffffe01f6ba8570
dnode_sync() at dnode_sync+0xe91/frame 0xfffffe01f6ba8630
dmu_objset_sync() at dmu_objset_sync+0x1a2/frame 0xfffffe01f6ba86b0
dsl_pool_sync() at dsl_pool_sync+0x4ec/frame 0xfffffe01f6ba8740
spa_sync() at spa_sync+0xd09/frame 0xfffffe01f6ba8990
txg_sync_thread() at txg_sync_thread+0x2a9/frame 0xfffffe01f6ba8a70
fork_exit() at fork_exit+0x84/frame 0xfffffe01f6ba8ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01f6ba8ab0

  • trap 0, rip = 0, rsp = 0, rbp = 0 ---

@markj you can find my attempt at that here https://github.com/mattmacy/networking/tree/projects/witnessfix2 -- you can have a go at it. If you don't have the time / inclination I'd _really_ appreciate it if we didn't let the best become the enemy of the good. If we want stacks of reversals we can comment out blessed entries.

I am not closely familiar with WITNESS, so just a feeling: the long lists of blessed locks and their combinations promises high chances for them to be forgotten on following ZFS updates. At very list it would be good to document how those new mechanisms should be used.

sys/kern/subr_witness.c
759

"db->db_mtx" is specified twice in blessed_locks.

In D15010#316190, @mav wrote:

I am not closely familiar with WITNESS, so just a feeling: the long lists of blessed locks and their combinations promises high chances for them to be forgotten on following ZFS updates.

That's actually true of all documented lock orders. I don't have a good fix for that. However, the cost of lookup can be further reduced by putting the names in a red black tree, reducing the overhead to the O(lgN).

At very list it would be good to document how those new mechanisms should be used.

Yup. Seems pretty self-explanatory apart from the separate lists used for expediting negative lookups.

In D15010#316190, @mav wrote:

I am not closely familiar with WITNESS, so just a feeling: the long lists of blessed locks and their combinations promises high chances for them to be forgotten on following ZFS updates.

That's actually true of all documented lock orders. I don't have a good fix for that. However, the cost of lookup can be further reduced by putting the names in a red black tree, reducing the overhead to the O(lgN).

At very list it would be good to document how those new mechanisms should be used.

Yup. Seems pretty self-explanatory apart from the separate lists used for expediting negative lookups.

We should just do some simple a/b perf tests to put this blessed list checking issue to bed. We have enough locks in the blessed lists to see if it impacts a build on ffs. We don't want to measure the impact of newly running witness on zfs.

Hopefully as ZFS code changes people will be alerted to locks they forgot to add to the list by witness warnings and that will be enough impetus to fix it. The worst case should be removed locks that are not garbage collected from the list.

In D15010#468020, @mjg wrote:

Is this getting anywhere?

Maybe now. Now that we're actually doing development on ZFS