Implement O_BENEATH flag for openat(2).
Needs ReviewPublic

Authored by jonathan on Jun 14 2015, 4:41 AM.

Details

Reviewers
pjd
rwatson
sson
Summary

Capsicum restricts capability-mode and capability-relative path lookups
to be "strictly relative": we do not permit absolute paths or ".." in
path resolution for these cases. This functionality could be useful in
non-Capsicum applications as well, and it has been proposed for inclusion
in Linux as a step along the way to Capsicum in Linux.

This commit would add the O_BENEATH flag for use in openat(2) to enable
"strict relative" lookups with unrestricted file descriptors (i.e., not
Capsicum capabilities) outside of capability mode. The only difference
from the Capsicum behaviour is that O_BENEATH would cause errno to be
EPERM rather than ECAPMODE or ENOTCAPABLE. If O_BENEATH is used together
with capability mode or a directory capability, EPERM is returned, since
the usermode application has explicitly requested the new behaviour.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
jonathan retitled this revision from to Implement O_BENEATH flag for openat(2)..Jun 14 2015, 4:41 AM
jonathan updated this object.
jonathan edited the test plan for this revision. (Show Details)
jonathan added reviewers: rwatson, pjd.

I like the idea of this change.

I'd rename ni_strictrelative to nd_strictrelativeerror or such, since it's now an errno.

Does this change have the correct effect when combined with fchdir() and fchroot()? I.e., can I fchdir() to an O_BENEATH descriptor and then perform a lookup without beneath properties?

sys/kern/vfs_lookup.c
1170

I might move this comment up to just above the function, and provide a little more explanation as to why this is needed, not just what it does.

1187

Possibly this should be a panic(), as should this occur at runtime, we want to fail closed, not open? Also, do include the proposed error number in the panic message.

Thanks for the feedback!

On the ni_strictrelative renaming: how about ni_nonrelativeerrno?

On your composition question:
The proposed mechanism would only affect the current system call: there is no change to the behaviour of the returned descriptor. If the dirfd was a capability then we still apply the usual capability restrictions to the opened fd. If the dirfd was not a capability, then we return a normal full-fledged file descriptor.

In your specific example, let's say we have a regular (non-capability) directory descriptor top and that we are not in capability mode. In that case, we could do:

int subdir = openat(top, "deep/subdir/path", O_DIRECTORY | O_BENEATH);
// subdir is now an unrestricted dirfd

int config = openat(subdir, "../config/my.conf", O_RDONLY);

fchdir(subdir);
int config2 = open("../config/my.conf", O_RDONLY);
int config3 = openat(FD_ATCWD, "../config/my.conf", O_RDONLY);

after which config, config2 and config3 should all be valid file descriptors for the configuration file.

If we are in capability mode or operating relative to a capability then the usual Capsicum rules apply.

sys/kern/vfs_lookup.c
1170

I can certainly do this. I was wondering, however, if this function would be considered useful or overly heavyweight: the call could be replaced by a conditional assignment in most places (and an unconditional assignment in one). I like the clarity of an explicit function call, but it can't really be inlined (unless I move the implementation to a header file, which I also don't like).

jonathan updated this revision to Diff 6191.Jun 14 2015, 4:25 PM

Additional commit:

  • Rename nameidata::ni_strictrelative to ni_nonrelativeerrno.

Sounds reasonable with regard to fchdir() and friends -- and, indeed -- right now, CAPMODE forces O_BENEATH, but it's true we don't have a variant on O_BENEATH that forces it to be used for open()'d directories via other open()'d directories with O_BENEATH set. I think this is fine. There's a slight confused deputy problem if directories are passed from CAPMODE processes to non-CAPMODE processes anyway -- and we don't make it worse here.

Is there a way we can add David Drysdale as a reviewer for this patch?

sys/kern/vfs_lookup.c
1170

I would be OK with moving to various in situ -- especially of the form "... CAPMODE ? ENOTCAPABLE : EPERM".

emaste added a subscriber: emaste.Jun 15 2015, 1:58 AM

Is there a way we can add David Drysdale as a reviewer for this patch?

Yes, David will need to check out https://wiki.freebsd.org/CodeReview which explains the procedure for getting an account. Short version: sign up at https://reviews.freebsd.org/auth/register/

emaste added inline comments.Jun 15 2015, 2:38 PM
sys/kern/vfs_syscalls.c
1000–1002

Extra braces

sys/sys/namei.h
172–173

This just seems to describe ndrequire_strict_relative_lookups's implementation, and I'm not sure it would help my understanding of this functionality without the full context of this change.

We should also add a test case -- @drysdale_google.com, is there a possibility the O_BENEATH tests you have could be made available under a BSD license?

@emaste, I updated the capsicum-test framework to include openat tests as well as capability-mode/capability-dfd tests, so that covers everything I've got.

sys/kern/vfs_lookup.c
1173

This seems superfluous with the second KASSERT below

sys/kern/vfs_syscalls.c
983

(Random aside: does FreeBSD not attempt to police unrecognized flags? In Linux, syscalls are (now) recommended to reject unknown flags with EINVAL.)

sys/sys/namei.h
153

I couldn't see where this is used - is it needed?

Thanks for these comments! I should be updating (and shrinking) the diff soon... just need a little more testing first.

sys/kern/vfs_lookup.c
1173

This function will go away anyhow based the above discussion with Robert.

sys/kern/vfs_syscalls.c
983

This strikes me as a fundamentally good idea, but others may disagree for compatibility reasons... maybe worth talking about for 11, but probably not in this hopefully-MFC'able change.

1000–1002

Ok, I will change this conform to style(9), but I officially protest that I do so under duress. :)

From now on, I will refer to this paragraph of style(9) as the "goto fail" paragraph.

sys/sys/namei.h
153

You are quite right: that was only needed in my very first patch version. I'll remove it.

172–173

I'm going to change this by removing the function and just setting the fields in the appropriate places. The logic will be unconditional in one place and if (foo != EPERM) in two, so it's probably not worth the cost of this only-inlineable-with-LTO function call.

@emaste, I updated the capsicum-test framework to include openat tests as well as capability-mode/capability-dfd tests, so that covers everything I've got.

Sounds good. I was thinking of the standalone O_BENEATH test posted to LKML and wasn't aware that equivalent test coverage now exists in the capsicum-test project - thanks.

jonathan updated this revision to Diff 6214.Jun 15 2015, 5:18 PM

Remove ndrequire_strict_relative_lookups().

jonathan updated this revision to Diff 6215.Jun 15 2015, 5:24 PM
  • Increase style(9) compliance.
  • Remove unused namei() flag.

Ok, so I believe that I have addressed all of the comments above... please let me know if I've missed anything! If I have indeed addressed everything, it might be appropriate to hit the LGTM button now. :)

Getting close, but a couple more comments :-).

sys/kern/vfs_lookup.c
658–660

Hmm. If we're limited by O_BENEATH, should we be using a different KTR trace function here?

sys/kern/vfs_syscalls.c
983

I think validating flags is a good idea -- but agree that it might be done on a different MFC schedule and hence should be a separate change.

1018–1020

Hmm. I think failing closed here is reasonable for capability mode -- but I'm not sure if that's true for ambient O_BENEATH. I believe td_dupfd is set for /dev/fd/[012] and /dev/{stderr,stdin,stdout}, which might reasonable be expected to work outside of capability mode. It also looks like it's used in another edge case or two (sysvr4 compatibility!?).

jonathan marked 12 inline comments as done.Jun 17 2015, 7:10 PM
jonathan added inline comments.
sys/kern/vfs_lookup.c
658–660

How would you feel about using KTR_NAMEI instead? We could use KTR_CAPFAIL for Capsicum-caused failures and KTR_NAMEI for O_BENEATH-induced failures, but that's adding a lot of complexity and it's not entirely clear to me how we should handle the "both" case...

sys/kern/vfs_syscalls.c
1018–1020

Since we've said that O_BENEATHsupercedes the capability rules (in the sense that we want the O_BENEATH errno whether we're in or out of capability mode), we can't know at this point whether the original file was a capability: we just see EPERM.

If this code is related to fdopen(), I don't anticipate O_BENEATH being used much...

rwatson added inline comments.Jul 1 2015, 2:10 AM
sys/kern/vfs_lookup.c
658–660

Hurum. Agreed it would make things more complex, but it probably is the right thing. How about we simply make the decision based on whether it's ENOTCAPABLE or EPERM?

sys/kern/vfs_syscalls.c
1018–1020

Maybe make this an XXXCAP or XXXJA or such, but otherwise, yes, I agree. We'll see if anyone ever runs into it!

jonathan marked an inline comment as done.Jul 2 2015, 10:39 PM
jonathan added inline comments.
sys/kern/vfs_lookup.c
658–660

Ok, I can try that and see how it looks.

sys/kern/vfs_syscalls.c
1018–1020

Adding more XXX here sounds like a fine thing to do. :)

jonathan updated this revision to Diff 6847.Jul 10 2015, 7:16 PM
  • Let ktrace output depend on nonrelativeerrno.
jonathan updated this revision to Diff 6848.Jul 10 2015, 7:39 PM
  • Implement O_BENEATH flag for openat(2).
  • Rename nameidata::ni_strictrelative to ni_nonrelativeerrno.
  • Remove ndrequire_strict_relative_lookups().
  • Increase style(9) compliance.
  • Remove unused namei() flag.
jonathan added a subscriber: mjg.Jul 10 2015, 7:57 PM

Sorry for all of the recent noise: it seems that Arcanist requires extra care when doing distributed development!

To catch everybody up: I implemented @rwatson's ktrace change suggestion, but now it's been rolled back by the Arcanist fixups. Before I re-implement it, I'm going to merge in @mjg's namei refactoring work (which hit HEAD yesterday).

mjg added inline comments.Jul 10 2015, 8:29 PM
sys/kern/vfs_lookup.c
211

The caller is free to specify any error they want, yet this checks specifically for EPERM. I would say that's inconsequential for no good reason.

Instead, you can if (ndp->ni_nonrelativeerrno == 0) ndp->ni_nonrelativeerrno = ENOTCAPABLE;

while a no-op change with current consumers, may come in handy.

jonathan updated this revision to Diff 7188.Jul 22 2015, 6:40 PM
  • Let ktrace output depend on nonrelativeerrno.
jonathan marked 4 inline comments as done.Jul 22 2015, 6:46 PM
jonathan added inline comments.
sys/kern/vfs_lookup.c
211

We actually want to honour EPERM above ENOTCAPABLE, since the user application code can always set the O_BENEATH flag and might expect EPERM when there is a failure. In all other cases, we preserve the existing ABI by returning ENOTCAPABLE.

658–661

Ok, @rwatson, how does this ktrace update look?

jonathan updated this revision to Diff 7189.Jul 22 2015, 6:48 PM
  • Add an XXXJA to the fdopen() with O_BENEATH case.
jonathan marked 3 inline comments as done.Jul 22 2015, 6:49 PM

Marking request for XXXJA in the fdopen()+O_BENEATH case as done.

mjg added inline comments.Aug 6 2015, 11:14 PM
sys/kern/vfs_lookup.c
211

But proposed change would honour EPERM just fine. Further, when one day other caller will want to return EACCESS, it will just work, while with current patch whatever it sets will be converted to ENOTCAPABLE. That's only a nit though.

jonathan added inline comments.Aug 7 2015, 12:26 PM
sys/kern/vfs_lookup.c
211

That's a fair point. Another development (not visible here on the issue tracker) is that we're considering just creating a brand new errno for this kind of failure. That would provide more information to the caller about the true nature of the error. It would also allow us to stay compatible between the FreeBSD and Linux implementations even though Linux doesn't have ENOTCAPABLE, and we could #define this value to ENOTCAPABLE on FreeBSD 10 to maintain ABI compatibility.

jonathan updated this revision to Diff 18847.Jul 28 2016, 2:22 PM
  • MFC r303406
  • Treat EPERM slightly less specially.
jonathan marked 6 inline comments as done.Jul 28 2016, 2:26 PM

Address @mjg 's point about EPERM.