Details

Reviewers

kib
markj
mjg

Commits

rG372691a7ae18: unionfs: release parent vnodes in deferred context

Summary

Each unionfs node holds a reference to its parent directory vnode.
If a sufficiently deep unionfs directory hierarchy is deleted, a
single open file reference can therefore end up keeping the vnode
hierarchy in place. When that reference is released, the resulting
VOP_RECLAIM call chain can exhaust the kernel stack.

This is easily reproducible by running the unionfs.sh stress2 test.
Fix it by deferring recursive unionfs vnode release to taskqueue
context.

PR: 238883

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 39983
Build 36872: arc lint + arc unit

Event Timeline

jah created this revision.Jun 12 2021, 8:37 PM

Herald added a subscriber: imp. · View Herald TranscriptJun 12 2021, 8:37 PM

jah requested review of this revision.Jun 12 2021, 8:37 PM

Harbormaster completed remote builds in B39868: Diff 90803.Jun 12 2021, 8:37 PM

This is a naive approach. Instead of using a taskqueue, would I be better of with something completely different? For example, could I just stop holding a reference to the parent and instead lookup the parent through namei in the relatively small number of cases where unionfs needs the parent? Or, if the taskqueue is an acceptable approach, should I instead use a dedicated taskqueue instead of funneling everything through taskqueue_thread?

I suspect it might be unsafe to use the global taskqueue. If any filesystem uses the same taskqueue for anything during inactivation, the system would deadlock.

Even if filesystems do not use the taskqueue, it is single-threaded and global. As I understand from your summary, the deferred queue can become quite large, and its length is in fact user-controllable. Better to use dedicated taskqueue thread, IMO.

sys/fs/unionfs/union_subr.c
114	I believe this is too much unlocks/locks. You can, under the lock, take the whole deferred list into a local head, then clear union_deferred_rele_list, and unlock rele_lock. Then the processing of the local list should no longer require the lock. There are many instances of this trick in kernel, one I remembered first is in dev_unlock_and_free().

jah added inline comments.Jun 13 2021, 12:24 AM

sys/fs/unionfs/union_subr.c
114	Ah! I didn't see that the _CONCAT macros were the way to do this. Thanks, I'll switch to STAILQ and take this approach.

Use dedicated taskqueue, use STAILQ_CONCAT to reduce list locking

Harbormaster completed remote builds in B39872: Diff 90810.Jun 13 2021, 1:49 PM

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

FIFO processing for the deferred list

Harbormaster completed remote builds in B39873: Diff 90811.Jun 13 2021, 2:28 PM

I think it is useful, if not required, to drain the taskqueue on unmount.

In D30748#691212, @mjg wrote:

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

I didn't know we had DEBUG_VFS_LOCKS. My guess is that unionfs under stress will fail these checks in a variety of ways, but hopefully none of them will be caused by this change. This seems like another useful tool for evaluating unionfs, so I'll give it a shot as soon as I get some time.

In D30748#691229, @kib wrote:

I think it is useful, if not required, to drain the taskqueue on unmount.

This could be helpful in making it more likely for non-forced unmounts to succeed if there have been very recent deletions. Would you consider that a requirement?

In D30748#691607, @jah wrote:

In D30748#691229, @kib wrote:

I think it is useful, if not required, to drain the taskqueue on unmount.

This could be helpful in making it more likely for non-forced unmounts to succeed if there have been very recent deletions. Would you consider that a requirement?

My motivation for the suggestion was that the module cannot be unloaded until taskqueue is drained, and it is logical to drain on unmount to ensure that unmount is actually completed, i.e. as much resources related to the filesystem is freed as possible.

But now I think that also you need to terminate and clean up the taskqueue thread on vfs module unload.

In D30748#691598, @jah wrote:

In D30748#691212, @mjg wrote:

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

I didn't know we had DEBUG_VFS_LOCKS. My guess is that unionfs under stress will fail these checks in a variety of ways, but hopefully none of them will be caused by this change. This seems like another useful tool for evaluating unionfs, so I'll give it a shot as soon as I get some time.

Surprisingly, the panic on mount (new vnode not exclusively locked for insmntque()) is the only issue I've found in testing with DEBUG_VFS_LOCKS so far.

In D30748#691785, @kib wrote:

In D30748#691607, @jah wrote:

In D30748#691229, @kib wrote:

I think it is useful, if not required, to drain the taskqueue on unmount.

This could be helpful in making it more likely for non-forced unmounts to succeed if there have been very recent deletions. Would you consider that a requirement?

My motivation for the suggestion was that the module cannot be unloaded until taskqueue is drained, and it is logical to drain on unmount to ensure that unmount is actually completed, i.e. as much resources related to the filesystem is freed as possible.

But now I think that also you need to terminate and clean up the taskqueue thread on vfs module unload.

Hmm, I'd initially thought calling taskqueue_free() in unionfs_uninit() would be sufficient. But looking at it again, I think in the case of a recent forced unmount the number of mounts could be 0, which would allow unionfs_uninit() to proceed, but there could be remaining doomed vnodes with non-zero usecounts waiting to be released. Some of them may not have yet been enqueued for deferred release, so simply calling taskqueue_free() would abort processing of them. It seems better to call taskqueue_quiesce() before taskqueue_free().

Quiesce the taskqueue on module unload

Harbormaster completed remote builds in B39919: Diff 90919.Jun 15 2021, 1:17 PM

kib added inline comments.Jun 15 2021, 11:53 PM

sys/fs/unionfs/union_subr.c
79	This is an informational sysctl, right? Do you think it is useful generally, and not for debugging of your patch? That said, why did you marked it as a tunable?

In D30748#691844, @jah wrote:

In D30748#691598, @jah wrote:

In D30748#691212, @mjg wrote:

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

I didn't know we had DEBUG_VFS_LOCKS. My guess is that unionfs under stress will fail these checks in a variety of ways, but hopefully none of them will be caused by this change. This seems like another useful tool for evaluating unionfs, so I'll give it a shot as soon as I get some time.

Surprisingly, the panic on mount (new vnode not exclusively locked for insmntque()) is the only issue I've found in testing with DEBUG_VFS_LOCKS so far.

To be clear, I do plan to fix the instmntque() panic in a separate change, but I have a few other things I want to get to first.

sys/fs/unionfs/union_subr.c
79	This is an informational sysctl, right? Do you think it is useful generally, and not for debugging of your patch? It was moderately useful in making sure the deferred release worker was working, and in getting a sense of how much release traffic there is. I decided to keep it for the code review in case someone else might find it useful e.g. in debugging a unionfs performance issue. That said, why did you marked it as a tunable? copy-paste-o :) NOFETCH \| TUNABLE don't make any sense here.