Details

Reviewers

dchagin
markj

Commits

rGec3864cc6861: fusefs: sanitize FUSE_READLINK results for embedded NULs
rG0022bd418e6c: fusefs: sanitize FUSE_READLINK results for embedded NULs
rG8fca98f6881f: fusefs: sanitize FUSE_READLINK results for embedded NULs
rG662ec2f78152: fusefs: sanitize FUSE_READLINK results for embedded NULs

Summary

This can happen if VOP_READLINK returns such a path. So far I've seen
it in fusefs, but it can probably happen in other file systems, too.

PR: 274268
MFC after: 1 week
Sponsored by: Axcient

Test Plan

Test case added to the fusefs test suite, though the bug isn't fusefs specific.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

asomers created this revision.Oct 4 2023, 6:54 PM

Herald added a subscriber: imp. · View Herald TranscriptOct 4 2023, 6:54 PM

asomers requested review of this revision.Oct 4 2023, 6:54 PM

Harbormaster completed remote builds in B53837: Diff 128257.Oct 4 2023, 6:54 PM

sorry mate, I'm no longer involved here, you have to prod someone else

Restrict the new fusefs warning to INVARIANTS

Harbormaster completed remote builds in B53838: Diff 128258.Oct 4 2023, 7:01 PM

Sorry, @mjg . Good luck with your next project.

asomers edited reviewers, added: dchagin; removed: mjg.Oct 4 2023, 7:04 PM

by "here" I meant the vfs layer ;)

i don't think the patch is legitimate in that length indicated by VOP_READLINK does not line up with the real length -- it is that discrepancy which needs to be avoided instead. i would make it an invariant that VOP_READLINK returns correct size or an error like in the case above. then a debug_pre func for VOP_READLINK could validate it. i also note the proposed patch pessimizes all lookups (not only those which encounter a symlink) by adding a func call and no longer avoiding an extra branch per character (currently achieved by slapping / at the end)

with this i'm buggering off the area

perhaps @kib and/or @markj should be added as reviewers

mjg added subscribers: kib, markj.Oct 5 2023, 10:43 AM

mjg removed a subscriber: mjg.

emaste added a subscriber: emaste.Oct 5 2023, 1:10 PM

In D42081#959985, @mjg wrote:

i don't think the patch is legitimate in that length indicated by VOP_READLINK does not line up with the real length -- it is that discrepancy which needs to be avoided instead. i would make it an invariant that VOP_READLINK returns correct size or an error like in the case above. then a debug_pre func for VOP_READLINK could validate it. i also note the proposed patch pessimizes all lookups (not only those which encounter a symlink) by adding a func call and no longer avoiding an extra branch per character (currently achieved by slapping / at the end)

I thought of that. But I think it's better this way for two reasons:

Programmer efficiency. Enforcing the invariant in each file system would require similar code in about a dozen different places, instead of one.
CPU efficiency. There's no way to enforce this invariant in most file systems without doing a full scan of the link value. But vfs_lookup has to do a full scan anyway, so we might as well piggy back on this scan.

I can of course inline strchrnul if you think it would be worthwhile.

In D42081#960092, @asomers wrote:

I thought of that. But I think it's better this way for two reasons:

Programmer efficiency. Enforcing the invariant in each file system would require similar code in about a dozen different places, instead of one.

Why do other filesystems need to be modified if we check the invariant in a vop_readlink_post()?

sys/kern/vfs_lookup.c
1092 ↗	(On Diff #128258)	Before, we were checking an invariant ("only the last byte of the buffer is a nul char"). VOP_READLINK can violate this invariant, so the patch removes it for all consumers. That doesn't feel right.

In D42081#960121, @markj wrote:

Why do other filesystems need to be modified if we check the invariant in a vop_readlink_post()?

Because I didn't know about the vop _post hooks ;). But now that I look at it, it isn't possible. vop_readlink_post only has access to the struct uio _after_ the copy operation. So it doesn't know where the path actually starts, and it can't check it. Can you think of any other common point in the call stack?

In D42081#960179, @asomers wrote:

In D42081#960121, @markj wrote:

Why do other filesystems need to be modified if we check the invariant in a vop_readlink_post()?

Because I didn't know about the vop _post hooks ;). But now that I look at it, it isn't possible. vop_readlink_post only has access to the struct uio _after_ the copy operation. So it doesn't know where the path actually starts, and it can't check it. Can you think of any other common point in the call stack?

You could add something to namei_follow_link(), where VOP_READLINK is actually called during name resolution. But I also want to ask, why is not sufficient to modify fuse alone? How would other filesystems allow this situation to occur, barring corruption of on-disk data?

You could add something to namei_follow_link(), where VOP_READLINK is actually called during name resolution. But I also want to ask, why is not sufficient to modify fuse alone? How would other filesystems allow this situation to occur, barring corruption of on-disk data?

Exactly, corruption of on-disk data. From inspection, I think that ext2fs is vulnerable. I haven't checked others. But I think we should be resilient against corrupt disks.

In D42081#960550, @asomers wrote:

You could add something to namei_follow_link(), where VOP_READLINK is actually called during name resolution. But I also want to ask, why is not sufficient to modify fuse alone? How would other filesystems allow this situation to occur, barring corruption of on-disk data?

Exactly, corruption of on-disk data. From inspection, I think that ext2fs is vulnerable. I haven't checked others. But I think we should be resilient against corrupt disks.

But your approach to resiliency should depend on the filesystem. FUSE explicitly needs to guard against this particular possibility, so it should check for nul bytes in symlink targets. OpenZFS probably does not, assuming that it always writes valid symlinks to disk and can catch corruption using checksums. UFS and ext2 provide no mechanism to detect data corruption, so we cannot be resilient to corruption in any general sense. Certainly they could explicitly check for this case and panic, but then we should presumably also make sure that directory entries don't contain nul bytes (ext2_readdir() and ufs_readdir() don't appear to verify this), and the rabbit hole doesn't end there.

In D42081#960578, @markj wrote:

In D42081#960550, @asomers wrote:

You could add something to namei_follow_link(), where VOP_READLINK is actually called during name resolution. But I also want to ask, why is not sufficient to modify fuse alone? How would other filesystems allow this situation to occur, barring corruption of on-disk data?

Exactly, corruption of on-disk data. From inspection, I think that ext2fs is vulnerable. I haven't checked others. But I think we should be resilient against corrupt disks.

But your approach to resiliency should depend on the filesystem. FUSE explicitly needs to guard against this particular possibility, so it should check for nul bytes in symlink targets. OpenZFS probably does not, assuming that it always writes valid symlinks to disk and can catch corruption using checksums. UFS and ext2 provide no mechanism to detect data corruption, so we cannot be resilient to corruption in any general sense. Certainly they could explicitly check for this case and panic, but then we should presumably also make sure that directory entries don't contain nul bytes (ext2_readdir() and ufs_readdir() don't appear to verify this), and the rabbit hole doesn't end there.

ZFS checksums only guard against hardware-caused corruption. Corruption caused by software bugs is not prevented. For example, I've dealt with a software bug that caused ZFS to record compressed records but indicate in the block pointer that they were uncompressed. If any of those records were for symlink targets (does ZFS ever allocate a whole record for a really long symlink?) then it would've contained NULs.

In D42081#960588, @asomers wrote:

In D42081#960578, @markj wrote:

In D42081#960550, @asomers wrote:

You could add something to namei_follow_link(), where VOP_READLINK is actually called during name resolution. But I also want to ask, why is not sufficient to modify fuse alone? How would other filesystems allow this situation to occur, barring corruption of on-disk data?

Exactly, corruption of on-disk data. From inspection, I think that ext2fs is vulnerable. I haven't checked others. But I think we should be resilient against corrupt disks.

But your approach to resiliency should depend on the filesystem. FUSE explicitly needs to guard against this particular possibility, so it should check for nul bytes in symlink targets. OpenZFS probably does not, assuming that it always writes valid symlinks to disk and can catch corruption using checksums. UFS and ext2 provide no mechanism to detect data corruption, so we cannot be resilient to corruption in any general sense. Certainly they could explicitly check for this case and panic, but then we should presumably also make sure that directory entries don't contain nul bytes (ext2_readdir() and ufs_readdir() don't appear to verify this), and the rabbit hole doesn't end there.

ZFS checksums only guard against hardware-caused corruption. Corruption caused by software bugs is not prevented. For example, I've dealt with a software bug that caused ZFS to record compressed records but indicate in the block pointer that they were uncompressed. If any of those records were for symlink targets (does ZFS ever allocate a whole record for a really long symlink?) then it would've contained NULs.

The way we find such bugs is with assertions. This diff removes an assertion which could have caught the bug you described.

In any case, I'm opposed to the vfs_lookup.c portion of the diff. If we have a symlink pointing to foo/bar and some kind of corruption turns that into foo\0bar, an INVARIANTS kernel would now treat that as "foo" when before it would have failed closed.

FUSE is special here since its VOP_READLINK returns a target provided by an untrusted source. FUSE should be verifying that its inputs to the VFS don't violate VFS invariants. Other filesystems could likely do more to avoid trusting on-disk data more than is strictly necessary without overly hurting performance, but that seems like a separate concern.

Move the nul check into namei_follow_link

Harbormaster completed remote builds in B53882: Diff 128373.Oct 6 2023, 4:22 PM

@markj I've moved the check into namei_follow_link . Do you like this version better? I could also have it print some kind of warning to dmesg, subject to INVARIANTS, if you would prefer. Or I could remove it entirely and restore the panic, if you prefer. The problem with doing the check in namei_follow_link is that it causes an extra scan through the string, rather than piggy backing on the scan that already takes place in vfs_lookup.

BTW is there a check in fusefs that lookup never returns vp == dvp (of non-dot cnp, but this should never reach VOP)? IMO it is even more fundamental invariant.

In D42081#960674, @kib wrote:

BTW is there a check in fusefs that lookup never returns vp == dvp (of non-dot cnp, but this should never reach VOP)? IMO it is even more fundamental invariant.

Very close. There is such a check at the interface to the fuse server. So it checks that FUSE_LOOKUP doesn't return the same inode number as the parent directory. And similar for FUSE_CREATE, FUSE_MKDIR, etc.

In D42081#960671, @asomers wrote:

@markj I've moved the check into namei_follow_link . Do you like this version better?

Sorry, but I still don't like it. :)

fuse_vnop_readlink() should be sanitizing its inputs to the VFS rather than expecting the VFS to handle this situation in some arbitrary way. With this patch, we are now hiding what could be bugs in all VOP_READLINK implementations. If it were ZFS returning corrupted symlink targets, then I'd want the system to panic (at least on INVARIANTS systems). If this situation can legitimately arise in FUSE filesystems, then fuse_vnop_readlink() needs to decide how to handle it; in this case it seems like you want to simply return the string up to the first nul terminator.

In D42081#960676, @markj wrote:

Sorry, but I still don't like it. :)

fuse_vnop_readlink() should be sanitizing its inputs to the VFS rather than expecting the VFS to handle this situation in some arbitrary way. With this patch, we are now hiding what could be bugs in all VOP_READLINK implementations. If it were ZFS returning corrupted symlink targets, then I'd want the system to panic (at least on INVARIANTS systems). If this situation can legitimately arise in FUSE filesystems, then fuse_vnop_readlink() needs to decide how to handle it; in this case it seems like you want to simply return the string up to the first nul terminator.

Well, it can't _legitimately_ occur in fusefs. The best behavior I think would be to return EIO if a fuse server does this. The only reason I chose to swallow the NUL in vfs_lookup was because it was easier to do that than to return an error. What I really want is for the system to not panic, at least on non-debug kernels, due to a naughty fuse server or corrupted disk or something. Do you really prefer a panic over returning an error and printing something to dmesg?

In D42081#960683, @asomers wrote:

In D42081#960676, @markj wrote:

Sorry, but I still don't like it. :)

fuse_vnop_readlink() should be sanitizing its inputs to the VFS rather than expecting the VFS to handle this situation in some arbitrary way. With this patch, we are now hiding what could be bugs in all VOP_READLINK implementations. If it were ZFS returning corrupted symlink targets, then I'd want the system to panic (at least on INVARIANTS systems). If this situation can legitimately arise in FUSE filesystems, then fuse_vnop_readlink() needs to decide how to handle it; in this case it seems like you want to simply return the string up to the first nul terminator.

Well, it can't _legitimately_ occur in fusefs. The best behavior I think would be to return EIO if a fuse server does this. The only reason I chose to swallow the NUL in vfs_lookup was because it was easier to do that than to return an error.

In the patch you're changing the VFS to treat the first nul as the end of the symlink value. You can change fuse_vnop_readlink() to do that instead.

What I really want is for the system to not panic, at least on non-debug kernels, due to a naughty fuse server or corrupted disk or something. Do you really prefer a panic over returning an error and printing something to dmesg?

In an INVARIANTS kernel, of course I prefer to panic. The panic you reported in PR 274268 is from a KASSERT, so I'm not sure what you mean about non-debug kernels. As far as I can see, an embedded nul won't trigger a panic directly in such kernels, but it's hard to reason about what will happen with or without your patch. In any case, I think FUSE needs to handle the possibility of a naughty fuse server without weakening VFS invariants.