Page MenuHomeFreeBSD

unionfs: simplify writecount management
ClosedPublic

Authored by jah on Dec 21 2021, 11:52 PM.

Details

Summary

Use atomics to track the writecount granted to the underlying FS,
and avoid holding the vnode interlock while calling the underling FS'
VOP_ADD_WRITECOUNT(). This also fixes a WITNESS warning about nesting
the same lock type. Also add comments explaining why we need to track
the writecount on the unionfs vnode in the first place.

unionfs: implement VOP_SET_TEXT/VOP_UNSET_TEXT

The implementation simply passes the text ref to the appropriate
underlying vnode. Without this, the default [un]set_text
implementation will only manage the text ref on the unionfs vnode,
causing it to be out of sync with the underlying filesystems and
potentially allowing corruption of executable file contents.
On INVARIANTS kernels, it also readily produces a panic on process
termination because the VM object representing the executable mapping
is backed by the underlying vnode, not the unionfs vnode.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jah requested review of this revision.Dec 21 2021, 11:52 PM

From what I can tell, the main reason that managing v_writecount on the unionfs vnode might be useful would be in handling vflush(WRITECLOSE).
But WRITECLOSE only seems to be used by a few filesystems in support of read-only remounts, and unionfs isn't one of them. It's entirely possible I've missed something here, but if I can get away with it this seems like a useful simplification.

@pho I'll post a separate fix for the panic exposed by unionfs7.sh, which just amounts to unionfs_open() not correctly accounting for the vnode being locked shared on the loader path.

Did you tried similar change for nullfs?

Basically text ref count needs to exist at least on the vnode which owns the actual v_object.

In D33611#759420, @kib wrote:

Did you tried similar change for nullfs?

Basically text ref count needs to exist at least on the vnode which owns the actual v_object.

I haven't tried any nullfs changes, but the thought did occur to me since the unionfs code I changed was obviously copied from nullfs.
Would you like me to change nullfs as well (in a separate commit of course)?

From my reading of the code, unionfs and nullfs vnodes always inherit v_object from the underlying filesystem, which usually creates it through vnode_create_vobject() in its open or fhtovp handler.
The presence of a non-NULL v_object for the unionfs/nullfs vnode additionally prevents vnode_pager_alloc (from mmap(2) etc.) from trying to create a new object. So it looks as though the VM object will always be backed by the underlying vnode and not the unionfs/nullfs vnode.

In D33611#759589, @jah wrote:
In D33611#759420, @kib wrote:

Did you tried similar change for nullfs?

Basically text ref count needs to exist at least on the vnode which owns the actual v_object.

I haven't tried any nullfs changes, but the thought did occur to me since the unionfs code I changed was obviously copied from nullfs.
Would you like me to change nullfs as well (in a separate commit of course)?

Yes, and in fact I think that nullfs should be done first.

From my reading of the code, unionfs and nullfs vnodes always inherit v_object from the underlying filesystem, which usually creates it through vnode_create_vobject() in its open or fhtovp handler.
The presence of a non-NULL v_object for the unionfs/nullfs vnode additionally prevents vnode_pager_alloc (from mmap(2) etc.) from trying to create a new object. So it looks as though the VM object will always be backed by the underlying vnode and not the unionfs/nullfs vnode.

The mmaping of nullfs vnode cannot work unless lowervp->v_object == vp->v_object. So this is indeed true, and it is very important invariant. The vm_mmap_vnode() helper does the following:

		if (obj->type == OBJT_VNODE && obj->handle != vp) {
			vput(vp);
			vp = (struct vnode *)obj->handle;

to get to the lowervp.

In D33611#759601, @kib wrote:
In D33611#759589, @jah wrote:
In D33611#759420, @kib wrote:

Did you tried similar change for nullfs?

Basically text ref count needs to exist at least on the vnode which owns the actual v_object.

I haven't tried any nullfs changes, but the thought did occur to me since the unionfs code I changed was obviously copied from nullfs.
Would you like me to change nullfs as well (in a separate commit of course)?

Yes, and in fact I think that nullfs should be done first.

Ok, will do. On a related note, I did test nullfs enough to know that it doesn't have the same panic-on-process-termination issue as unionfs. So for nullfs the text ref seems to be managed on the correct vnode even though nullfs also does not implement VOP_[UN]SET_TEXT.
I assume this is because of null_bypass? If so would it be sufficient to simply delete null_add_writecount?

From my reading of the code, unionfs and nullfs vnodes always inherit v_object from the underlying filesystem, which usually creates it through vnode_create_vobject() in its open or fhtovp handler.
The presence of a non-NULL v_object for the unionfs/nullfs vnode additionally prevents vnode_pager_alloc (from mmap(2) etc.) from trying to create a new object. So it looks as though the VM object will always be backed by the underlying vnode and not the unionfs/nullfs vnode.

The mmaping of nullfs vnode cannot work unless lowervp->v_object == vp->v_object. So this is indeed true, and it is very important invariant. The vm_mmap_vnode() helper does the following:

		if (obj->type == OBJT_VNODE && obj->handle != vp) {
			vput(vp);
			vp = (struct vnode *)obj->handle;

to get to the lowervp.

In D33611#759655, @jah wrote:

Ok, will do. On a related note, I did test nullfs enough to know that it doesn't have the same panic-on-process-termination issue as unionfs. So for nullfs the text ref seems to be managed on the correct vnode even though nullfs also does not implement VOP_[UN]SET_TEXT.
I assume this is because of null_bypass? If so would it be sufficient to simply delete null_add_writecount?

Yes, null_bypass results in text references go onto the lower vnode.

I mostly agree with the note that maintaining writecount on the upper vnode is effectively nop. So I am interested in see if nullfs can work without it, and indeed for nullfs the corresponding change would be to remove the custom bypass.

In D33611#759740, @kib wrote:
In D33611#759655, @jah wrote:

Ok, will do. On a related note, I did test nullfs enough to know that it doesn't have the same panic-on-process-termination issue as unionfs. So for nullfs the text ref seems to be managed on the correct vnode even though nullfs also does not implement VOP_[UN]SET_TEXT.
I assume this is because of null_bypass? If so would it be sufficient to simply delete null_add_writecount?

Yes, null_bypass results in text references go onto the lower vnode.

I mostly agree with the note that maintaining writecount on the upper vnode is effectively nop. So I am interested in see if nullfs can work without it, and indeed for nullfs the corresponding change would be to remove the custom bypass.

Ah, I see now why we can't completely leave write ref managment to the lower FS. There's a comment to the effect in null_reclaim(): if the nullfs/unionfs is forcibly unmounted, we can end up leaking write refs on the lower vnode as prior unionfs/nullfs write refs will now be undone through the deadfs vnops.

There may still be some room for simplification in both unionfs and nullfs, but not as much as I'd hoped.

Restore writecount tracking on unionfs vnode, in simplified form

jah retitled this revision from unionfs: leave writecount management to underlying filesystem to unionfs: simplify writecount management.Dec 25 2021, 8:28 PM
jah edited the summary of this revision. (Show Details)
sys/fs/unionfs/union_vnops.c
2531–2532

Hm, I probably miss something. Isn't goal of unionfs is to always write to the upper? If yes, how could it be that we put a write reference on lowervp?

sys/fs/unionfs/union_subr.c
486 ↗(On Diff #100567)

This code, and the equivalent nullfs code, makes me a little bit uncomfortable. We leave the refcount in place on the unionfs/nullfs vnode because we assume the reference owner(s) will still undo the reference(s), and that the deadfs vnops will undo those references in a way that's compatible with our reference tracking to avoid hitting the assert in freevnode(). Based on the testing I've done, that does seem to always be the case but it strikes me as fragile.

What if we instead do this:
--Add no-op implementations of add_writecount() and set_text() to deadfs, to match the no-op implementation of unset_text() that's already there.
--Change vgonel() to clear v_writecount, but only for the case in which the doomed vnode is active. Otherwise we still want the assert in freevnode() to catch filesystems that leak references during normal operation.

Would this approach be a useful improvement?

sys/fs/unionfs/union_vnops.c
2531–2532

I noticed the same thing when I wrote the patch, but decided against changing the logic in case there was something I missed. As far as I can tell, and from the testing I've done so far, write references should always be taken on the upper. Shadow directory creation and file copying generally happen during lookup/open, before the write reference is taken.

On second thought though, I think I really do want to change this to always use the upper and assert that it's non-NULL. The reason is that if I've missed something, and there is a situation in which we can take a write ref on the lower, and the code path that holds the ref does a unionfs copy-on-write operation, then the write ref will be undone on the wrong vnode. It seems better to catch this case immediately than to wait for some later call to vop_stdadd_writecount() to panic.

sys/fs/unionfs/union_subr.c
486 ↗(On Diff #100567)

I do not quite understand why active/inactive makes a difference. Either you clear all write/text references on reclaim, or you let them drain in a normal way when vnode is closed. It cannot be mixed.

If you mean that active reclaim means forced unmount, this is not quite true for some time. For instance we have a debug sysctl to force reclaim of the specific vnode.

I remember I tried to clean everything on reclaim in some intermediate versions of the write/text ref merge, but it did not worked well (I may misremember).

sys/fs/unionfs/union_vnops.c
2531–2532

Yes, I agree. Unionfs should only put write reference on the upper vnode.

BTW, isn't a situation possible where lower vnode is executed, so gained text references, and then unionfs still must allow writes by CoW? Or rather, it should allow writes if execution is done through lower vnode, but sould not if unionfs vnode was executed. For me, this sounds as a need to maintain text refs on unionfs vnode.

sys/fs/unionfs/union_subr.c
486 ↗(On Diff #100567)

My thinking was the vgonel() would clear v_writecount at the same time it resets the vnode's dispatch table to dead_vnodeops. Since I would also change deadfs to provide a no-op implementation of vop_add_writecount(), subsequent draining would not matter.

active/inactive would only make a difference for debugging. In the normal (inactive) case, I wouldn't want to reset v_writecount, because a non-zero writecount would indicate a buggy filesystem, and I would want the assert in freevnode() (or some equivalent check) to catch it.

sys/fs/unionfs/union_vnops.c
2531–2532

If a text ref is taken through the unionfs vnode at a time when only the lower vnode is available, won't the executable mapping then always be backed by the lower vnode? If so, then wouldn't it be fine to still allow copy-on-write? Any modified file contents would belong to the newly-created upper vnode.

sys/fs/unionfs/union_subr.c
486 ↗(On Diff #100567)

My opinion is to leave this as is, until we see real problems with the code.

sys/fs/unionfs/union_vnops.c
2531–2532

I mean that this is not logical, although possibly acceptable. You executed a, then attempts to write to a should result in ETXTBUSY. Unionfs behavior is to allow write due to CoW.

jah marked an inline comment as done.Dec 28 2021, 11:35 PM
jah added inline comments.
sys/fs/unionfs/union_subr.c
486 ↗(On Diff #100567)

I'm fine with that.

sys/fs/unionfs/union_vnops.c
2531–2532

By my reading of do_execve(), the vnode should be locked shared across the setting of the text ref and establishment of the backing object for the executable image. Unionfs copy-on-write paths should execute with the vnode locked exclusive (we assert this in unionfs_node_update()). Therefore it seems that vnode locking should prevent any case in which the text ref would be set on the lower vnode but the executable mapping would be backed by the upper vnode.

Given that, I see this added bit of flexibility on the part of unionfs as being more feature than bug. Since tracking text refs on the unionfs vnode would add complexity, I'd prefer to avoid it unless/until lack of it causes a problem.

sys/fs/unionfs/union_vnops.c
2531–2532

Ok. But do you still want to change the code to avoid passing writecount VOPs to lowervp?

jah marked an inline comment as done.

Only allow write refs on upper vnode

sys/fs/unionfs/union_subr.c
489 ↗(On Diff #100687)

Still, you do add to lowervp writecount (or substract from it, if you prefer this term).

sys/fs/unionfs/union_subr.c
489 ↗(On Diff #100687)

Oops, sorry. I was distracted by a few other things and completely forgot to change that part of the code.

Also simplify writecount management in unionfs_noderem()

kib added inline comments.
sys/fs/unionfs/union_subr.c
480 ↗(On Diff #100744)

I suggest to say 'text ref count' instead of 'negative write count', it explains why do we assert this.

This revision is now accepted and ready to land.Dec 30 2021, 4:05 AM
This revision now requires review to proceed.Dec 30 2021, 5:40 AM
This revision is now accepted and ready to land.Dec 30 2021, 12:58 PM