Paths

Table of Contentst

Capsicum vs the Pathnames, a PoC
Needs ReviewPublic
Actions

Authored by trasz on Mar 15 2024, 12:48 PM.

Details

Reviewers

brooks
val_packett.cool
jonathan

Group Reviewers

capsicum

Summary

This is a proof of concept implementation of some changes to how Capsicum
handles path names. It's in some ways similar to D38351 by Val Packett,
but implemented quite differently. The primary motivation is to make it possible
to execute binaries in capability mode from the start, without having to trust them.

The way this works now is that absolute path lookups are prohibited,
and relative are only allowed with an explicitely provided directory
descriptor.

The works it works with the patch is that both are allowed, but only
if the process - or its ancestor - called fchdir(2) and fchroot(2)
to set the descriptors the (nowly allowed) lookups are relative to.
Calling cap_enter(2) clears both descriptors again.

There is a (pretty terrible, and obviously temporary) hack
to chroot(8) utility to run binaries in capability mode "by hand":

$ chroot -Cdn 5 /bin/sh 5< /

Regarding the Capsicum security model, I believe the lookup change doesn't change it.
The directory descriptors for lookups still need to be provided by the process,
like before; it's just that now it can ask the kernel to use them for absolute
and relative lookups instead of having to explicitly pass them to APIs like openat(2).

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 56625
Build 53513: arc lint + arc unit

Event Timeline

trasz created this revision.Mar 15 2024, 12:48 PM

Herald added a reviewer: brooks. · View Herald TranscriptMar 15 2024, 12:48 PM

Herald added subscribers: olce, glebius, jonathan, imp. · View Herald Transcript

trasz requested review of this revision.Mar 15 2024, 12:48 PM

Harbormaster completed remote builds in B56625: Diff 135783.Mar 15 2024, 12:48 PM

trasz added a reviewer: capsicum.Mar 15 2024, 12:48 PM

trasz edited the summary of this revision. (Show Details)Mar 15 2024, 1:50 PM

trasz added reviewers: val_packett.cool, jonathan.

I worry somewhat about interactions with dlopen which was previously disabled in capability mode by virtual of breaking open(2). It's true that fdlopen existed, but that's a somewhat different beast and I suspect users are more likely to be audited.

I kind of want to disallow fchroot in capability mode and have a cap_enter2 that takes a root fd and a flags argument that includes flags to disable this functionality, but that also feels like it adds complexity.

sys/kern/kern_mib.c
102	Not obviously related to the rest of this patch. Seems generally fine though.
sys/kern/syscalls.master
146	At first glance, I find myself wanting a separate flag from `SYF_CAPENABLED` so we can potentially deny these syscalls in syscallenter if both curdir and root are ecapmodevp. I'm not sure this is actually a good idea, but it's easier to make the annotations in syscalls.master different now.
160	Leakage from D44372?
usr.bin/procstat/procstat_files.c
148	Old binaries will still use this so might as well keep it until we're ready to completely remove API support.

trasz mentioned this in D44372: Allow subset of wait4(2) functionality in Capsicum mode.Mar 21 2024, 12:44 PM

emaste added a subscriber: emaste.Mar 21 2024, 2:02 PM

Can you describe the dlopen threat model a bit more? My assumption is, a typical Capsicum-aware app wouldn't be setting the rootdir/curdir at all. Or, if it does, it could call cap_enter(2) again before calling dlopen(3), clearing those vnodes.

Also, do you think it makes sense to split off fchroot(2) and get that bit committed first?

sys/kern/kern_mib.c
102	Yeah, bmake refuses to work without it. I suppose it should be fixed in bmake and not here though; this often contains personal information (builder's hostname and login).
sys/kern/syscalls.master
146	I was thinking about something similar - a separate flag to be used by an explicit sysctl to switch back to the old semantics in case of security bug. It did make the patch quite a bit larger though.
160	Yup.
usr.bin/procstat/procstat_files.c
148	Makes sense.

In D44373#1013871, @trasz wrote:

Can you describe the dlopen threat model a bit more? My assumption is, a typical Capsicum-aware app wouldn't be setting the rootdir/curdir at all. Or, if it does, it could call cap_enter(2) again before calling dlopen(3), clearing those vnodes.

My worry is dlopen calls the developer is unaware of (or not thinking about) suddenly working. For example things that look like iconv or nss that didn't used to work and now could be coerced to work. I'm not sure how serious an issue this is.

Also, do you think it makes sense to split off fchroot(2) and get that bit committed first?

It probably does make sense to commit separately. This review is pretty large.

sys/kern/kern_mib.c
102	Hmm, could make is a SYSCTL_PROC and output something more reserved in capability mode?

trasz edited the summary of this revision. (Show Details)Mon, May 13, 10:07 AM

@trasz : thanks for sending this review request. My general feeling is that I'm leery of relaxing the in-kernel security model, not just because of the potential for opening things we don't mean to open, but also because it complicates the model for those who are trying to understand it. "No global namespaces", while limiting, is a clearer rule than "no global namespaces unless you or your ancestor has previously called fchroot(2), unless-unless something has also called cap_enter(2) again to clear that magic vnode".

I wonder if, instead of changing the in-kernel model, this might be better addressed through interposition, either using LDPRELOAD-ed wrappers that convert open(2) to openat(2) (relative a pre-set "root" FD) or within libc itself?

In fact, maybe I should connect you with a student of mine who is playing with LDPRELOAD-ed wrappers in order to run unmodified installer scripts like RVM and (hopefully soon) rustup...

In D44373#1032959, @jonathan wrote:

I wonder if, instead of changing the in-kernel model, this might be better addressed through interposition, either using LDPRELOAD-ed wrappers that convert open(2) to openat(2) (relative a pre-set "root" FD) or within libc itself?

This whole thing originates with me being too frustrated with how janky and non-robust LD_PRELOAD-based hackery was feeling! :) The proposal I submitted for discussion was a kernel-based very literal equivalent to libpreopen, which due to being in the kernel could just do the substitution in the *one* central place in the codebase where all FS lookups go through, instead of having to hook *every* entry point that eventually ends up doing an FS lookup in the kernel, which is just not viable.

But sure, doing this(*) in the actual libc itself, while still a lot more tedious than in the kernel for the aforementioned "one central place" advantage the kernel has, definitely has the potential to be robust! Normal operation, no "special" interposition at runtime + as the libc is part of the base system, making sure this thing works with new syscall additions and changes would be part of normal maintenance.

(*) about "this" though i.e. what do we even want: I've been admiring the direction WASI is going towards with WebAssembly Component Model… Where e.g. "the file system" is a defined interface which can have multiple implementations, and when launching a process you could compose the environment in whatever arrangement you want: plug the "real file system" component into the "main program", or instead plug a "virtual in-memory file system" or a "remote file system" one, or one that combines many of those…

*and*, at the same time, I've been really longing for some way of constructing a virtual filesystem tree for a sandboxed process subtree (jailed or this-kind-of-thing-substituted-capsicumized) that would NOT involve "persistent" (/ outwardly visible) state like tmpfs's with nullfs mounts that are present in the global VFS namespace and visible in the mount output and so on. Call me shallow and superficial but I really hate all that stuff "sticking out" like that, it offends me aesthetically xD Linux filesystem namespaces at least let you "tuck it all away" and (importantly!) tie it directly to the lifecycle of that process subtree so you won't ever end up with "leftovers", but this Component Model style composability feels far superior honestly.

I'm not sure how to implement that well in the down-to-earth Unix/C world though, with file descriptors being single numbers handled purely by the kernel right now, and all that… I guess nsswitch is a precedent for pluggable stuff in the libc, but that's an easy case as it doesn't have anything like object handles. One idea for handling VFS—rather round-trippy in the single process case but would allow for incredible flexibility in terms of structuring how the environment ends up at runtime—is to make a sort of "VFS-in-userspace" system. Imagine this: a syscall to create a producer-consumer pair of capability descriptors. The consumer can be duplicated, inherited, passed over sockets, used with *at operations by itself, inserted into this fchroot(); and all the operations on it (and on virtual descriptors created by it) become requests to the producer side, which that side must handle via kqueue or something.

In D44373#1032959, @jonathan wrote:

@trasz : thanks for sending this review request. My general feeling is that I'm leery of relaxing the in-kernel security model, not just because of the potential for opening things we don't mean to open, but also because it complicates the model for those who are trying to understand it. "No global namespaces", while limiting, is a clearer rule than "no global namespaces unless you or your ancestor has previously called fchroot(2), unless-unless something has also called cap_enter(2) again to clear that magic vnode".

Ah, but my whole point here is (that I believe) it _doesn't_ change the security model :)

Perhaps we understand the term "global namespace" differently. To me, this doesn't do anything with a global namespace - it's about the kernel doing the mapping instead of libc, like Val described. One difference is that this mapping is inherited from the parent; with mapping in userspace you'd inherit them as ordinary file descriptors. You're not supposed to stash the system's or jails' actual root file descriptor there; I imagine that typically it would either be a premade, read-only system image, or something synthetic.

Or perhaps it's my explanation above, which describes the implementation rather than the way to use it. For the users, the mental model would be "instead of explicitly passing file descriptor to openat(2) every time you can pre-set it using fchroot(2) and fchdir(2)".

I wonder if, instead of changing the in-kernel model, this might be better addressed through interposition, either using LDPRELOAD-ed wrappers that convert open(2) to openat(2) (relative a pre-set "root" FD) or within libc itself?

I can see two problems there. First is that without inheriting cwd and rootfd (of some kind) from parent you can't have something that resembles Unix shell. Second - when starting a new process you need to somehow find rtld, then shared libraries. Sure, can be done, but with the above in kernel you don't need to. And finally you have static binaries and weird runtimes, like golang.

In fact, maybe I should connect you with a student of mine who is playing with LDPRELOAD-ed wrappers in order to run unmodified installer scripts like RVM and (hopefully soon) rustup...

Yes please :)