Page MenuHomeFreeBSD

unionfs: release parent vnodes in deferred context
ClosedPublic

Authored by jah on Jun 12 2021, 8:37 PM.

Details

Summary

Each unionfs node holds a reference to its parent directory vnode.
If a sufficiently deep unionfs directory hierarchy is deleted, a
single open file reference can therefore end up keeping the vnode
hierarchy in place. When that reference is released, the resulting
VOP_RECLAIM call chain can exhaust the kernel stack.

This is easily reproducible by running the unionfs.sh stress2 test.
Fix it by deferring recursive unionfs vnode release to taskqueue
context.

PR: 238883

Diff Detail

Repository
rG FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jah requested review of this revision.Jun 12 2021, 8:37 PM

This is a naive approach. Instead of using a taskqueue, would I be better of with something completely different? For example, could I just stop holding a reference to the parent and instead lookup the parent through namei in the relatively small number of cases where unionfs needs the parent? Or, if the taskqueue is an acceptable approach, should I instead use a dedicated taskqueue instead of funneling everything through taskqueue_thread?

I suspect it might be unsafe to use the global taskqueue. If any filesystem uses the same taskqueue for anything during inactivation, the system would deadlock.

Even if filesystems do not use the taskqueue, it is single-threaded and global. As I understand from your summary, the deferred queue can become quite large, and its length is in fact user-controllable. Better to use dedicated taskqueue thread, IMO.

sys/fs/unionfs/union_subr.c
113

I believe this is too much unlocks/locks. You can, under the lock, take the whole deferred list into a local head, then clear union_deferred_rele_list, and unlock rele_lock. Then the processing of the local list should no longer require the lock.

There are many instances of this trick in kernel, one I remembered first is in dev_unlock_and_free().

sys/fs/unionfs/union_subr.c
113

Ah! I didn't see that the _CONCAT macros were the way to do this. Thanks, I'll switch to STAILQ and take this approach.

Use dedicated taskqueue, use STAILQ_CONCAT to reduce list locking

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

FIFO processing for the deferred list

I think it is useful, if not required, to drain the taskqueue on unmount.

In D30748#691212, @mjg wrote:

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

I didn't know we had DEBUG_VFS_LOCKS. My guess is that unionfs under stress will fail these checks in a variety of ways, but hopefully none of them will be caused by this change. This seems like another useful tool for evaluating unionfs, so I'll give it a shot as soon as I get some time.

In D30748#691229, @kib wrote:

I think it is useful, if not required, to drain the taskqueue on unmount.

This could be helpful in making it more likely for non-forced unmounts to succeed if there have been very recent deletions. Would you consider that a requirement?

In D30748#691607, @jah wrote:
In D30748#691229, @kib wrote:

I think it is useful, if not required, to drain the taskqueue on unmount.

This could be helpful in making it more likely for non-forced unmounts to succeed if there have been very recent deletions. Would you consider that a requirement?

My motivation for the suggestion was that the module cannot be unloaded until taskqueue is drained, and it is logical to drain on unmount to ensure that unmount is actually completed, i.e. as much resources related to the filesystem is freed as possible.

But now I think that also you need to terminate and clean up the taskqueue thread on vfs module unload.

In D30748#691598, @jah wrote:
In D30748#691212, @mjg wrote:

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

I didn't know we had DEBUG_VFS_LOCKS. My guess is that unionfs under stress will fail these checks in a variety of ways, but hopefully none of them will be caused by this change. This seems like another useful tool for evaluating unionfs, so I'll give it a shot as soon as I get some time.

Surprisingly, the panic on mount (new vnode not exclusively locked for insmntque()) is the only issue I've found in testing with DEBUG_VFS_LOCKS so far.

In D30748#691785, @kib wrote:
In D30748#691607, @jah wrote:
In D30748#691229, @kib wrote:

I think it is useful, if not required, to drain the taskqueue on unmount.

This could be helpful in making it more likely for non-forced unmounts to succeed if there have been very recent deletions. Would you consider that a requirement?

My motivation for the suggestion was that the module cannot be unloaded until taskqueue is drained, and it is logical to drain on unmount to ensure that unmount is actually completed, i.e. as much resources related to the filesystem is freed as possible.

But now I think that also you need to terminate and clean up the taskqueue thread on vfs module unload.

Hmm, I'd initially thought calling taskqueue_free() in unionfs_uninit() would be sufficient. But looking at it again, I think in the case of a recent forced unmount the number of mounts could be 0, which would allow unionfs_uninit() to proceed, but there could be remaining doomed vnodes with non-zero usecounts waiting to be released. Some of them may not have yet been enqueued for deferred release, so simply calling taskqueue_free() would abort processing of them. It seems better to call taskqueue_quiesce() before taskqueue_free().

Quiesce the taskqueue on module unload

sys/fs/unionfs/union_subr.c
78

This is an informational sysctl, right? Do you think it is useful generally, and not for debugging of your patch?

That said, why did you marked it as a tunable?

In D30748#691844, @jah wrote:
In D30748#691598, @jah wrote:
In D30748#691212, @mjg wrote:

If memory serves unionfs panics on mount if DEBUG_VFS_LOCKS is enabled, but the patch needs to be tested with it. I don't remember what stands in the way of fixing it, but bare minimum the triggering assert can be commented out for testing.

I didn't know we had DEBUG_VFS_LOCKS. My guess is that unionfs under stress will fail these checks in a variety of ways, but hopefully none of them will be caused by this change. This seems like another useful tool for evaluating unionfs, so I'll give it a shot as soon as I get some time.

Surprisingly, the panic on mount (new vnode not exclusively locked for insmntque()) is the only issue I've found in testing with DEBUG_VFS_LOCKS so far.

To be clear, I do plan to fix the instmntque() panic in a separate change, but I have a few other things I want to get to first.

sys/fs/unionfs/union_subr.c
78

This is an informational sysctl, right? Do you think it is useful generally, and not for debugging of your patch?

It was moderately useful in making sure the deferred release worker was working, and in getting a sense of how much release traffic there is. I decided to keep it for the code review in case someone else might find it useful e.g. in debugging a unionfs performance issue.

That said, why did you marked it as a tunable?

copy-paste-o :) NOFETCH | TUNABLE don't make any sense here.

This revision is now accepted and ready to land.Jun 16 2021, 3:40 PM

Use STAILQ_FOREACH_SAFE to simplify the release loop

This revision now requires review to proceed.Jun 18 2021, 11:36 PM
This revision is now accepted and ready to land.Jun 28 2021, 1:38 PM