Page MenuHomeFreeBSD

unix: Restrict dirfds exchanged between jails with a different root
Needs ReviewPublic

Authored by markj on Fri, May 16, 1:40 AM.
Tags
None
Referenced Files
F117680070: D50371.diff
Wed, May 21, 4:13 PM
Unknown Object (File)
Mon, May 19, 6:54 PM
Unknown Object (File)
Sat, May 17, 2:47 PM
Unknown Object (File)
Sat, May 17, 1:40 PM
Unknown Object (File)
Sat, May 17, 1:12 AM
Unknown Object (File)
Fri, May 16, 4:41 PM
Unknown Object (File)
Fri, May 16, 1:15 PM
Unknown Object (File)
Fri, May 16, 11:31 AM

Details

Reviewers
kib
Summary

Just a patch to motivate discussion.

Sibling jails which can exchange fds over a unix socket can escape the
jail chroot by passing a directory fd that points to a vnode outside of
the receiver's root. Try to restrict this by treating dirfds as
capabilities when exchanged between two jails with different root
directories.

First, add a new per-fd flag which causes name lookups relative to that
fd to behave as though O_RESOLVE_BENEATH was specified. When
transferring fds, set that flag in the case described above. This
requires modifying some fget* interfaces to return the fd flags.

Second, disallow fchdir (and fchroot) to a directory received in the
case described above. I think this is maybe too heavy-handed, but I
don't have a better solution at the moment.

PR: 262179

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 64223
Build 61107: arc lint + arc unit

Event Timeline

markj requested review of this revision.Fri, May 16, 1:40 AM

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0. I think your addition of allowing to pass dirfd between different jails but having same rootvp is neat.

sys/kern/kern_descrip.c
3184

So, when openat(O_RESOLVE_BENEATH) is done, with the patch applied, O_RESOLVE_BENEATH becomes sticky for all further lookups made with the dirfd returned from that first openat().

Am I miss something?

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0.

It might be a way out but will prevent sending FDs across sibling or parent/child jails. Not sure about the general usefulness of this, but I can imagine some scenarios.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

I think the change here is good enough as a stopgap (modulo the problem reported by kib@), but would prefer the above proposal.

sys/kern/kern_descrip.c
3184

Am I miss something?

I don't think so.

The conflation and systematic passing of all flags down to files increases the risk of such mistakes. Pending a big sweep, the problem you point out could be resolved by having a new, internal O_* flag.

markj added a subscriber: jamie.

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0.

It might be a way out but will prevent sending FDs across sibling or parent/child jails. Not sure about the general usefulness of this, but I can imagine some scenarios.

Directory fds, yes. This was my reasoning--I'm worried that entirely disallowing passing dirfds between jails will break some applications. Maybe @jamie has some insight there?

Yes we can add a chicken switch sysctl to control it, but I'd prefer not to if possible.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

As I understand it, that idea is similar to this patch, except that a passed dirfd is not a capability for the referenced directory, but rather is a capability for the root of the jail where the fd was created?

I think the change here is good enough as a stopgap (modulo the problem reported by kib@), but would prefer the above proposal.

One other problem with this change is that it assumes CAPABILITIES is configured; otherwise it's not possible to revoke CAP_FCHDIR.

sys/kern/kern_descrip.c
3184

We do not have so many flag bits left available, so I didn't want to allocate a new one.

One hacky solution to the problem is to observe that there are only two callers to _finstall() and unp_externalize(), so we can unconditionally mask off O_RESOLVE_BENEATH in the former.

sys/sys/filedesc.h
96

This was left over from an experiment.

markj marked an inline comment as done.
  • Remove a modification to struct pwd.
  • Only make O_RESOLVE_BENEATH sticky for fds installed by unp_externalize().
sys/kern/kern_descrip.c
3184

I meant, the callers are finstall_refed() and unp_externalize(), and I modified the former.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

As I understand it, that idea is similar to this patch, except that a passed dirfd is not a capability for the referenced directory, but rather is a capability for the root of the jail where the fd was created?

I'm not sure why you're talking about capabilities here, maybe I'm missing something but my proposal doesn't involve them. I'd like to make use of the transported FD's file's credentials to limit lookups from that FD.

The idea is different in that I'm not trying to restrict the use of .. with the passed directory FD, which could be legitimate. If there is footshooting because the passed FD is outside the receiving jail, then so be it, that's part of the "shitty virtualization" and probably is useful.

AIUI, the bug in PR 262179 is in fact about a passed FD being used in another jail, and from that FD other files than the jail's ones can be accessed. I agree that's a problem, because then the sending jail in reality passes more rights than it has itself. To prevent this, I propose using f_cred of the received file when it is the origin of a lookup via namei(), in particular to ensure a lookup does not go outside the sender jail's root.

I now see that there may be a security problem with both approaches. One such FD has been transferred, and is used to lookup another directory (underneath it let's say, which can happen with both approaches), does the user can bypass again the sending jail's root? In your approach, that depends on whether O_RESOLVE_BENEATH is sticky for FDs/files obtained for successive lookups, which I don't think it is, and if I'm true could be in fact considered a bug of the O_RESOLVE_BENEATH implementation. In my alternative, the problem is that new files obtained after lookup have their credentials set to the calling thread's. That should be changed to that of the file serving as the lookup starting point (else that defeats relying on the file's credentials). I've looked it up a bit, and a priori it seems doing that would not cause other problems, but more audit is warranted.

One other problem with this change is that it assumes CAPABILITIES is configured; otherwise it's not possible to revoke CAP_FCHDIR.

In my alternative, we could modify sys_fchdir() to just check whether the candidate FD's credentials are the same than that of the caller (or apply the more relaxed check of restrict_caps()) rather than using capabilities.

(snip) could be in fact considered a bug of the O_RESOLVE_BENEATH implementation (snip)

I was wrong here, O_RESOLVE_BENEATH as it stands (without a sticky property) is useful for the case of processing an untrusted path, ensuring that the result stays below the origin, and there's nothing wrong with that.

In D50371#1149495, @kib wrote:

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0. I think your addition of allowing to pass dirfd between different jails but having same rootvp is neat.

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

As I understand it, that idea is similar to this patch, except that a passed dirfd is not a capability for the referenced directory, but rather is a capability for the root of the jail where the fd was created?

I'm not sure why you're talking about capabilities here, maybe I'm missing something but my proposal doesn't involve them.

Both solutions involve constraining the use of received file descriptors such that they don't obey standard unix interfaces. In particular, they are constrained with respect to path lookup, in a way that makes them behave like (capsicum) capabilities. That's all I meant.

I'd like to make use of the transported FD's file's credentials to limit lookups from that FD.

The idea is different in that I'm not trying to restrict the use of .. with the passed directory FD, which could be legitimate. If there is footshooting because the passed FD is outside the receiving jail, then so be it, that's part of the "shitty virtualization" and probably is useful.

AIUI, the bug in PR 262179 is in fact about a passed FD being used in another jail, and from that FD other files than the jail's ones can be accessed. I agree that's a problem, because then the sending jail in reality passes more rights than it has itself. To prevent this, I propose using f_cred of the received file when it is the origin of a lookup via namei(), in particular to ensure a lookup does not go outside the sender jail's root.

I now see that there may be a security problem with both approaches. One such FD has been transferred, and is used to lookup another directory (underneath it let's say, which can happen with both approaches), does the user can bypass again the sending jail's root? In your approach, that depends on whether O_RESOLVE_BENEATH is sticky for FDs/files obtained for successive lookups, which I don't think it is, and if I'm true could be in fact considered a bug of the O_RESOLVE_BENEATH implementation.

You're right, this is another bug. It looks like dup() should already behave as desired for both approaches.

In my alternative, the problem is that new files obtained after lookup have their credentials set to the calling thread's. That should be changed to that of the file serving as the lookup starting point (else that defeats relying on the file's credentials). I've looked it up a bit, and a priori it seems doing that would not cause other problems, but more audit is warranted.

One other problem with this change is that it assumes CAPABILITIES is configured; otherwise it's not possible to revoke CAP_FCHDIR.

In my alternative, we could modify sys_fchdir() to just check whether the candidate FD's credentials are the same than that of the caller (or apply the more relaxed check of restrict_caps()) rather than using capabilities.

I think I'm a bit uncomfortable changing or extending the semantics of f_cred.

In D50371#1149495, @kib wrote:

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0. I think your addition of allowing to pass dirfd between different jails but having same rootvp is neat.

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

In PR 262179, an example is given of a jail configuration where two jails have a parent/child relationship, but their root directories do not: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262179#c17

I think that case breaks this proposal. Is it a bug that jail(8) permits this at all?

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

On further reading, never mind: I just saw crest's comment on the bug about parent/child jails without parent/child directory relationships. It could well be that the only safe relationships is with one end of the conversation being prison0 (or at least a true-rooted prison).

Ensure that O_RESOLVE_BENEATH is inherited.

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.
I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

Ok. How should unp_externalize() handle this case though? Just insert -1 into the cmsg buffer?

In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).

Might be, then the introduction of the approach would be more natural.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

Ok. How should unp_externalize() handle this case though? Just insert -1 into the cmsg buffer?

Either that, or skip the element, reducing the length of the message.

In D50371#1150088, @kib wrote:

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).
Might be, then the introduction of the approach would be more natural.

Yes, the sticky flag must not be clearable by users in any way.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

The nuances seem necessary as, without them, we'll simply be prohibiting sending file descriptors across jails. With them, these scenarios become possible but only for "top-level" jails (those that are direct children of prison0), which is a kind of surprising limitation (from the outside, why should we treat "sub-jails" differently than top-level ones).

So I think that we should rather adopt the approach here at a minimum, and ideally (perhaps later) the one I proposed above which a priori does not seem too complex to implement (but is more complex than this one).

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

In PR 262179, an example is given of a jail configuration where two jails have a parent/child relationship, but their root directories do not: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262179#c17

I think that case breaks this proposal. Is it a bug that jail(8) permits this at all?

On further reading, never mind: I just saw crest's comment on the bug about parent/child jails without parent/child directory relationships. It could well be that the only safe relationships is with one end of the conversation being prison0 (or at least a true-rooted prison).

@jamie : Was the possibility of having a sub-jail with a root that is not under its parent jail something that was explicitly meant to be allowed from the start? Aren't all the other jail-managed resources not allowed to grow when going down the jail tree (for those where that makes sense)? I agree there is currently no mechanism to actually enforce that some directory stays below another one, but maybe we can search for some practical ones if we consider this to be really relevant. Providing such a guarantee could ease general reasoning about whether some paths are contained in others. But that doesn't seem to make a big difference here: With such a property, we could allow simply passing any descriptor from some sub-jail to a parent jail, but it would still be too weak by itself to allow a parent jail to unconditionally send a descriptor to its children, or have two siblings exchanging one. If you have some pointers or remembrances about restrictions on the root of sub-jails, I would be interested.

@jamie : Was the possibility of having a sub-jail with a root that is not under its parent jail something that was explicitly meant to be allowed from the start? Aren't all the other jail-managed resources not allowed to grow when going down the jail tree (for those where that makes sense)? I agree there is currently no mechanism to actually enforce that some directory stays below another one, but maybe we can search for some practical ones if we consider this to be really relevant. Providing such a guarantee could ease general reasoning about whether some paths are contained in others. But that doesn't seem to make a big difference here: With such a property, we could allow simply passing any descriptor from some sub-jail to a parent jail, but it would still be too weak by itself to allow a parent jail to unconditionally send a descriptor to its children, or have two siblings exchanging one. If you have some pointers or remembrances about restrictions on the root of sub-jails, I would be interested.

It's something I never really considered, but given the unenforceability that you mention, I probably would have not done anything if I did think of it.

In D50371#1150088, @kib wrote:
In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).

Well, to be clear, this sticky flag is an fd flag, not a file flag. In the current patch, fcntl() cannot be used to clear it (it ignores everything except UF_EXCLOSE). So it is already sticky. Do you mean that I should introduce a new open(2) flag which causes the returned fd to have UF_BENEATH set? Or maybe we should just add a new fcntl() flag to start? That is, add FD_BENEATH and let it be set by fcntl(F_SETFD, FD_BENEATH)? Perhaps both FD_BENEATH and FD_BENEATH_STICKY are needed?

I am a bit wary of introducing new O_* flags, since there are not so many free bits left.

Might be, then the introduction of the approach would be more natural.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

Ok. How should unp_externalize() handle this case though? Just insert -1 into the cmsg buffer?

Either that, or skip the element, reducing the length of the message.

In D50371#1150088, @kib wrote:
In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).

Well, to be clear, this sticky flag is an fd flag, not a file flag. In the current patch, fcntl() cannot be used to clear it (it ignores everything except UF_EXCLOSE). So it is already sticky. Do you mean that I should introduce a new open(2) flag which causes the returned fd to have UF_BENEATH set? Or maybe we should just add a new fcntl() flag to start? That is, add FD_BENEATH and let it be set by fcntl(F_SETFD, FD_BENEATH)? Perhaps both FD_BENEATH and FD_BENEATH_STICKY are needed?

I am a bit wary of introducing new O_* flags, since there are not so many free bits left.

Right, I fully agree with this proposal. It is not necessary to provide the sticky O_R_B variant for open(), at least for now.

Either that, or skip the element, reducing the length of the message.

Corrupting the SCM_RIGHTS message would potentially create new bugs. If the recvmsg() call is modified the whole call should fail with an error (EPERM) and free the reference to the attached file descriptors.