Page MenuHomeFreeBSD

unix: Restrict dirfds exchanged between jails with a different root
ClosedPublic

Authored by markj on May 16 2025, 1:40 AM.
Tags
None
Referenced Files
F121246202: D50371.diff
Tue, Jun 24, 9:40 PM
Unknown Object (File)
Sun, Jun 22, 12:11 PM
Unknown Object (File)
Sun, Jun 22, 2:58 AM
Unknown Object (File)
Sat, Jun 21, 6:34 AM
Unknown Object (File)
Fri, Jun 20, 6:31 AM
Unknown Object (File)
Thu, Jun 19, 6:54 PM
Unknown Object (File)
Thu, Jun 19, 10:11 AM
Unknown Object (File)
Wed, Jun 18, 9:04 AM

Details

Summary

Just a patch to motivate discussion.

Sibling jails which can exchange fds over a unix socket can escape the
jail chroot by passing a directory fd that points to a vnode outside of
the receiver's root. Try to restrict this by treating dirfds as
capabilities when exchanged between two jails with different root
directories.

First, add a new per-fd flag which causes name lookups relative to that
fd to behave as though O_RESOLVE_BENEATH was specified. When
transferring fds, set that flag in the case described above. This
requires modifying some fget* interfaces to return the fd flags.

Second, disallow fchdir (and fchroot) to a directory received in the
case described above. I think this is maybe too heavy-handed, but I
don't have a better solution at the moment.

PR: 262179

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.May 16 2025, 1:40 AM

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0. I think your addition of allowing to pass dirfd between different jails but having same rootvp is neat.

sys/kern/kern_descrip.c
3200

So, when openat(O_RESOLVE_BENEATH) is done, with the patch applied, O_RESOLVE_BENEATH becomes sticky for all further lookups made with the dirfd returned from that first openat().

Am I miss something?

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0.

It might be a way out but will prevent sending FDs across sibling or parent/child jails. Not sure about the general usefulness of this, but I can imagine some scenarios.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

I think the change here is good enough as a stopgap (modulo the problem reported by kib@), but would prefer the above proposal.

sys/kern/kern_descrip.c
3200

Am I miss something?

I don't think so.

The conflation and systematic passing of all flags down to files increases the risk of such mistakes. Pending a big sweep, the problem you point out could be resolved by having a new, internal O_* flag.

markj added a subscriber: jamie.

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0.

It might be a way out but will prevent sending FDs across sibling or parent/child jails. Not sure about the general usefulness of this, but I can imagine some scenarios.

Directory fds, yes. This was my reasoning--I'm worried that entirely disallowing passing dirfds between jails will break some applications. Maybe @jamie has some insight there?

Yes we can add a chicken switch sysctl to control it, but I'd prefer not to if possible.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

As I understand it, that idea is similar to this patch, except that a passed dirfd is not a capability for the referenced directory, but rather is a capability for the root of the jail where the fd was created?

I think the change here is good enough as a stopgap (modulo the problem reported by kib@), but would prefer the above proposal.

One other problem with this change is that it assumes CAPABILITIES is configured; otherwise it's not possible to revoke CAP_FCHDIR.

sys/kern/kern_descrip.c
3200

We do not have so many flag bits left available, so I didn't want to allocate a new one.

One hacky solution to the problem is to observe that there are only two callers to _finstall() and unp_externalize(), so we can unconditionally mask off O_RESOLVE_BENEATH in the former.

sys/sys/filedesc.h
96

This was left over from an experiment.

markj marked an inline comment as done.
  • Remove a modification to struct pwd.
  • Only make O_RESOLVE_BENEATH sticky for fds installed by unp_externalize().
sys/kern/kern_descrip.c
3200

I meant, the callers are finstall_refed() and unp_externalize(), and I modified the former.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

As I understand it, that idea is similar to this patch, except that a passed dirfd is not a capability for the referenced directory, but rather is a capability for the root of the jail where the fd was created?

I'm not sure why you're talking about capabilities here, maybe I'm missing something but my proposal doesn't involve them. I'd like to make use of the transported FD's file's credentials to limit lookups from that FD.

The idea is different in that I'm not trying to restrict the use of .. with the passed directory FD, which could be legitimate. If there is footshooting because the passed FD is outside the receiving jail, then so be it, that's part of the "shitty virtualization" and probably is useful.

AIUI, the bug in PR 262179 is in fact about a passed FD being used in another jail, and from that FD other files than the jail's ones can be accessed. I agree that's a problem, because then the sending jail in reality passes more rights than it has itself. To prevent this, I propose using f_cred of the received file when it is the origin of a lookup via namei(), in particular to ensure a lookup does not go outside the sender jail's root.

I now see that there may be a security problem with both approaches. One such FD has been transferred, and is used to lookup another directory (underneath it let's say, which can happen with both approaches), does the user can bypass again the sending jail's root? In your approach, that depends on whether O_RESOLVE_BENEATH is sticky for FDs/files obtained for successive lookups, which I don't think it is, and if I'm true could be in fact considered a bug of the O_RESOLVE_BENEATH implementation. In my alternative, the problem is that new files obtained after lookup have their credentials set to the calling thread's. That should be changed to that of the file serving as the lookup starting point (else that defeats relying on the file's credentials). I've looked it up a bit, and a priori it seems doing that would not cause other problems, but more audit is warranted.

One other problem with this change is that it assumes CAPABILITIES is configured; otherwise it's not possible to revoke CAP_FCHDIR.

In my alternative, we could modify sys_fchdir() to just check whether the candidate FD's credentials are the same than that of the caller (or apply the more relaxed check of restrict_caps()) rather than using capabilities.

(snip) could be in fact considered a bug of the O_RESOLVE_BENEATH implementation (snip)

I was wrong here, O_RESOLVE_BENEATH as it stands (without a sticky property) is useful for the case of processing an untrusted path, ensuring that the result stays below the origin, and there's nothing wrong with that.

In D50371#1149495, @kib wrote:

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0. I think your addition of allowing to pass dirfd between different jails but having same rootvp is neat.

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

The "right way" seems more to check whether we are traversing the FD's prison root during lookup, in addition to the calling process' prison root (and equally for the chroot directory). After some preliminary research, I would imagine something like this: As in this proposal, we just forbid chdir(), but instead of forcing UF_RBENEATH we modify namei_setup() to record in struct nameidata the root and top directories from ni_dirfd (or perhaps a whole copy/new reference on the struct pwd or even struct filedesc, if feasible and deemed more appropriate), and we then have to check for that in vfs_lookup(). The corresponding bits in the VFS cache have to be changed too (maybe by adding some field to struct cache_fpl).

As I understand it, that idea is similar to this patch, except that a passed dirfd is not a capability for the referenced directory, but rather is a capability for the root of the jail where the fd was created?

I'm not sure why you're talking about capabilities here, maybe I'm missing something but my proposal doesn't involve them.

Both solutions involve constraining the use of received file descriptors such that they don't obey standard unix interfaces. In particular, they are constrained with respect to path lookup, in a way that makes them behave like (capsicum) capabilities. That's all I meant.

I'd like to make use of the transported FD's file's credentials to limit lookups from that FD.

The idea is different in that I'm not trying to restrict the use of .. with the passed directory FD, which could be legitimate. If there is footshooting because the passed FD is outside the receiving jail, then so be it, that's part of the "shitty virtualization" and probably is useful.

AIUI, the bug in PR 262179 is in fact about a passed FD being used in another jail, and from that FD other files than the jail's ones can be accessed. I agree that's a problem, because then the sending jail in reality passes more rights than it has itself. To prevent this, I propose using f_cred of the received file when it is the origin of a lookup via namei(), in particular to ensure a lookup does not go outside the sender jail's root.

I now see that there may be a security problem with both approaches. One such FD has been transferred, and is used to lookup another directory (underneath it let's say, which can happen with both approaches), does the user can bypass again the sending jail's root? In your approach, that depends on whether O_RESOLVE_BENEATH is sticky for FDs/files obtained for successive lookups, which I don't think it is, and if I'm true could be in fact considered a bug of the O_RESOLVE_BENEATH implementation.

You're right, this is another bug. It looks like dup() should already behave as desired for both approaches.

In my alternative, the problem is that new files obtained after lookup have their credentials set to the calling thread's. That should be changed to that of the file serving as the lookup starting point (else that defeats relying on the file's credentials). I've looked it up a bit, and a priori it seems doing that would not cause other problems, but more audit is warranted.

One other problem with this change is that it assumes CAPABILITIES is configured; otherwise it's not possible to revoke CAP_FCHDIR.

In my alternative, we could modify sys_fchdir() to just check whether the candidate FD's credentials are the same than that of the caller (or apply the more relaxed check of restrict_caps()) rather than using capabilities.

I think I'm a bit uncomfortable changing or extending the semantics of f_cred.

In D50371#1149495, @kib wrote:

IMO the right way is much simpler, and was formulated in the PR: only allow to pass dirfd across the same jail, or from any jail to prison0. I think your addition of allowing to pass dirfd between different jails but having same rootvp is neat.

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

In PR 262179, an example is given of a jail configuration where two jails have a parent/child relationship, but their root directories do not: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262179#c17

I think that case breaks this proposal. Is it a bug that jail(8) permits this at all?

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

On further reading, never mind: I just saw crest's comment on the bug about parent/child jails without parent/child directory relationships. It could well be that the only safe relationships is with one end of the conversation being prison0 (or at least a true-rooted prison).

Ensure that O_RESOLVE_BENEATH is inherited.

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.
I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

Ok. How should unp_externalize() handle this case though? Just insert -1 into the cmsg buffer?

In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).

Might be, then the introduction of the approach would be more natural.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

Ok. How should unp_externalize() handle this case though? Just insert -1 into the cmsg buffer?

Either that, or skip the element, reducing the length of the message.

In D50371#1150088, @kib wrote:

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).
Might be, then the introduction of the approach would be more natural.

Yes, the sticky flag must not be clearable by users in any way.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

The nuances seem necessary as, without them, we'll simply be prohibiting sending file descriptors across jails. With them, these scenarios become possible but only for "top-level" jails (those that are direct children of prison0), which is a kind of surprising limitation (from the outside, why should we treat "sub-jails" differently than top-level ones).

So I think that we should rather adopt the approach here at a minimum, and ideally (perhaps later) the one I proposed above which a priori does not seem too complex to implement (but is more complex than this one).

I would widen this a little, and allow passing dirfds up and down the jail parent/child chain. That fits with the current ".."-handling strategy of preventing traversal up from the current jail or any ancestor.

In PR 262179, an example is given of a jail configuration where two jails have a parent/child relationship, but their root directories do not: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262179#c17

I think that case breaks this proposal. Is it a bug that jail(8) permits this at all?

On further reading, never mind: I just saw crest's comment on the bug about parent/child jails without parent/child directory relationships. It could well be that the only safe relationships is with one end of the conversation being prison0 (or at least a true-rooted prison).

@jamie : Was the possibility of having a sub-jail with a root that is not under its parent jail something that was explicitly meant to be allowed from the start? Aren't all the other jail-managed resources not allowed to grow when going down the jail tree (for those where that makes sense)? I agree there is currently no mechanism to actually enforce that some directory stays below another one, but maybe we can search for some practical ones if we consider this to be really relevant. Providing such a guarantee could ease general reasoning about whether some paths are contained in others. But that doesn't seem to make a big difference here: With such a property, we could allow simply passing any descriptor from some sub-jail to a parent jail, but it would still be too weak by itself to allow a parent jail to unconditionally send a descriptor to its children, or have two siblings exchanging one. If you have some pointers or remembrances about restrictions on the root of sub-jails, I would be interested.

@jamie : Was the possibility of having a sub-jail with a root that is not under its parent jail something that was explicitly meant to be allowed from the start? Aren't all the other jail-managed resources not allowed to grow when going down the jail tree (for those where that makes sense)? I agree there is currently no mechanism to actually enforce that some directory stays below another one, but maybe we can search for some practical ones if we consider this to be really relevant. Providing such a guarantee could ease general reasoning about whether some paths are contained in others. But that doesn't seem to make a big difference here: With such a property, we could allow simply passing any descriptor from some sub-jail to a parent jail, but it would still be too weak by itself to allow a parent jail to unconditionally send a descriptor to its children, or have two siblings exchanging one. If you have some pointers or remembrances about restrictions on the root of sub-jails, I would be interested.

It's something I never really considered, but given the unenforceability that you mention, I probably would have not done anything if I did think of it.

In D50371#1150088, @kib wrote:
In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).

Well, to be clear, this sticky flag is an fd flag, not a file flag. In the current patch, fcntl() cannot be used to clear it (it ignores everything except UF_EXCLOSE). So it is already sticky. Do you mean that I should introduce a new open(2) flag which causes the returned fd to have UF_BENEATH set? Or maybe we should just add a new fcntl() flag to start? That is, add FD_BENEATH and let it be set by fcntl(F_SETFD, FD_BENEATH)? Perhaps both FD_BENEATH and FD_BENEATH_STICKY are needed?

I am a bit wary of introducing new O_* flags, since there are not so many free bits left.

Might be, then the introduction of the approach would be more natural.

I would very much prefer the straight fix of not allowing to externalize dirfd if it comes from a different jail (with nuances about prison0 or equal roots).

Ok. How should unp_externalize() handle this case though? Just insert -1 into the cmsg buffer?

Either that, or skip the element, reducing the length of the message.

In D50371#1150088, @kib wrote:
In D50371#1149975, @kib wrote:

So overall I do not like this, sorry. It makes very delicate semantic change for what is basically an assisted self-induced jail problem. The O_RESOLVE_BENEATH is huge complication to the lookup process already to abuse it even more.

There is no semantic change, I am just adding a "sticky" mode for O_RESOLVE_BENEATH, i.e., extending the existing semantics. It gives capsicum's semantics for a single directory file descriptor. I suspect it could be useful elsewhere.

Well, I think if you interpret the patch this way, then you need to go with the full file flag route. First add the flag to automatically add O_R_B when fd is used as dirfd, then add a sticky variant of this flag so it cannot be cleared with fcntl(2).

Well, to be clear, this sticky flag is an fd flag, not a file flag. In the current patch, fcntl() cannot be used to clear it (it ignores everything except UF_EXCLOSE). So it is already sticky. Do you mean that I should introduce a new open(2) flag which causes the returned fd to have UF_BENEATH set? Or maybe we should just add a new fcntl() flag to start? That is, add FD_BENEATH and let it be set by fcntl(F_SETFD, FD_BENEATH)? Perhaps both FD_BENEATH and FD_BENEATH_STICKY are needed?

I am a bit wary of introducing new O_* flags, since there are not so many free bits left.

Right, I fully agree with this proposal. It is not necessary to provide the sticky O_R_B variant for open(), at least for now.

Either that, or skip the element, reducing the length of the message.

Corrupting the SCM_RIGHTS message would potentially create new bugs. If the recvmsg() call is modified the whole call should fail with an error (EPERM) and free the reference to the attached file descriptors.

  • Add a regression test.
  • Make the UF_BENEATH flag settable via fcntl(F_SETFD).
  • Handle this flag in the fplookup code, by falling back to the locked lookup.
  • Disallow fchdir() and fchroot() on a fd with UF_BENEATH set, so we don't rely on capsicum rights to prevent this.

Some of these changes required passing fde flags down through fget_*, which
involves some churn and bloats those functions a bit.

In D50371#1150812, @kib wrote:

Well, to be clear, this sticky flag is an fd flag, not a file flag. In the current patch, fcntl() cannot be used to clear it (it ignores everything except UF_EXCLOSE). So it is already sticky. Do you mean that I should introduce a new open(2) flag which causes the returned fd to have UF_BENEATH set? Or maybe we should just add a new fcntl() flag to start? That is, add FD_BENEATH and let it be set by fcntl(F_SETFD, FD_BENEATH)? Perhaps both FD_BENEATH and FD_BENEATH_STICKY are needed?

I am a bit wary of introducing new O_* flags, since there are not so many free bits left.

Right, I fully agree with this proposal. It is not necessary to provide the sticky O_R_B variant for open(), at least for now.

So far I did not implement both variants, just a plain sticky UF_BENEATH, since that's sufficient for the mitigation. Do you think it's worth implementing both?

Does anything prevent openat(dfd, "foo/../..", O_RESOLVE_BENEATH | O_DIRECTORY) from succeeding if the directory referenced by dfd is concurrently moved so that foo/.. references something else due to a race?

So far I did not implement both variants, just a plain sticky UF_BENEATH, since that's sufficient for the mitigation. Do you think it's worth implementing both?

As far as I see, after the ground work is done, the open flag is trivial. So it can be done ATM it is required. OTOH I do not have objections against adding it now.

Does anything prevent openat(dfd, "foo/../..", O_RESOLVE_BENEATH | O_DIRECTORY) from succeeding if the directory referenced by dfd is concurrently moved so that foo/.. references something else due to a race?

No, but define 'concurrently'. How would either syscall or userspace determine a difference between 'concurrently' vs 'after'?

In D50371#1155991, @kib wrote:

So far I did not implement both variants, just a plain sticky UF_BENEATH, since that's sufficient for the mitigation. Do you think it's worth implementing both?

As far as I see, after the ground work is done, the open flag is trivial. So it can be done ATM it is required. OTOH I do not have objections against adding it now.

Does anything prevent openat(dfd, "foo/../..", O_RESOLVE_BENEATH | O_DIRECTORY) from succeeding if the directory referenced by dfd is concurrently moved so that foo/.. references something else due to a race?

No, but define 'concurrently'. How would either syscall or userspace determine a difference between 'concurrently' vs 'after'?

Suppose two processes in two sibling jails are able to pass fds over a unix socket. Process P1 creates a directory D, opens it, passes it two process P2 in the second jail.

With the patch, P2 should not be able to traverse out of the directory. But suppose P1 renames the subdirectory foo at the same time that P2 does a relative lookup like foo/../../. I believe it's possible for the lookup to succeed if the rename is done after the lookup of foo but before the lookup of foo/... Does something prevent this? For instance, should namei(dvp, RBENEATH) fail if the dvp->v_seqc changed during the namei() call?

In D50371#1155991, @kib wrote:

So far I did not implement both variants, just a plain sticky UF_BENEATH, since that's sufficient for the mitigation. Do you think it's worth implementing both?

As far as I see, after the ground work is done, the open flag is trivial. So it can be done ATM it is required. OTOH I do not have objections against adding it now.

Does anything prevent openat(dfd, "foo/../..", O_RESOLVE_BENEATH | O_DIRECTORY) from succeeding if the directory referenced by dfd is concurrently moved so that foo/.. references something else due to a race?

No, but define 'concurrently'. How would either syscall or userspace determine a difference between 'concurrently' vs 'after'?

Suppose two processes in two sibling jails are able to pass fds over a unix socket. Process P1 creates a directory D, opens it, passes it two process P2 in the second jail.

With the patch, P2 should not be able to traverse out of the directory. But suppose P1 renames the subdirectory foo at the same time that P2 does a relative lookup like foo/../../. I believe it's possible for the lookup to succeed if the rename is done after the lookup of foo but before the lookup of foo/... Does something prevent this?

No. But 'so what'?

For more contrived example, 'a/b/../../' and rename of b would also give some way. If we re-check v_seqc, and I suppose that we have to recheck v_seqc for all vnodes recorded by the namei tracker, what to do if seqc changed? If we do re-lookup, this would practically allow for the adversary actor to spin the lookup thread in kernel. If we plain fail the lookup, this is a reliability issue, not to mention that seqc change might be not related to _this_ hierarchy.

But then, what is really the threat model? Passing a dirfd and trying to organize the race to get the receiver the full access to the source jail file namespace? If yes, why not pass a pipe connected to the shell?

For instance, should namei(dvp, RBENEATH) fail if the dvp->v_seqc changed during the namei() call?

Absolutely not. This just makes the syscall unreliable.

NIRES_BENEATH should be added to namei.9

sys/kern/kern_descrip.c
3080–3082
3081
3142

Same extra () and needs '!='.

3429

Why cannot you name _fget_unlocked() as fget_unlocked_flags()?

In D50371#1155993, @kib wrote:
In D50371#1155991, @kib wrote:

Does anything prevent openat(dfd, "foo/../..", O_RESOLVE_BENEATH | O_DIRECTORY) from succeeding if the directory referenced by dfd is concurrently moved so that foo/.. references something else due to a race?

No, but define 'concurrently'. How would either syscall or userspace determine a difference between 'concurrently' vs 'after'?

Suppose two processes in two sibling jails are able to pass fds over a unix socket. Process P1 creates a directory D, opens it, passes it two process P2 in the second jail.

With the patch, P2 should not be able to traverse out of the directory. But suppose P1 renames the subdirectory foo at the same time that P2 does a relative lookup like foo/../../. I believe it's possible for the lookup to succeed if the rename is done after the lookup of foo but before the lookup of foo/... Does something prevent this?

No. But 'so what'?

For more contrived example, 'a/b/../../' and rename of b would also give some way. If we re-check v_seqc, and I suppose that we have to recheck v_seqc for all vnodes recorded by the namei tracker, what to do if seqc changed? If we do re-lookup, this would practically allow for the adversary actor to spin the lookup thread in kernel. If we plain fail the lookup, this is a reliability issue, not to mention that seqc change might be not related to _this_ hierarchy.

But then, what is really the threat model? Passing a dirfd and trying to organize the race to get the receiver the full access to the source jail file namespace?

This doesn't give just full access to the source jail file namespace. It potentially gives access to the whole filesystem. If the sibling jails are at /j/1, /j/2, then P2 can walk out of /j/1 since its jail root is /j/2, and this directory is not encountered during dotdot traversal.

For instance, should namei(dvp, RBENEATH) fail if the dvp->v_seqc changed during the namei() call?

Absolutely not. This just makes the syscall unreliable.

Or it could retry internally. BTW, I found the following error return for Linux's openat2():

EAGAIN how.resolve contains either RESOLVE_IN_ROOT or
       RESOLVE_BENEATH, and the kernel could not ensure that a
       ".." component didn't escape (due to a race condition or
       potential attack).  The caller may choose to retry the
       openat2() call.

I believe RESOLVE_BENEATH is equivalent to openat(... O_RESOLVE_BENEATH) in FreeBSD. I don't know how they determine whether an "escape" happened.

In D50371#1155993, @kib wrote:
In D50371#1155991, @kib wrote:

Does anything prevent openat(dfd, "foo/../..", O_RESOLVE_BENEATH | O_DIRECTORY) from succeeding if the directory referenced by dfd is concurrently moved so that foo/.. references something else due to a race?

No, but define 'concurrently'. How would either syscall or userspace determine a difference between 'concurrently' vs 'after'?

Suppose two processes in two sibling jails are able to pass fds over a unix socket. Process P1 creates a directory D, opens it, passes it two process P2 in the second jail.

With the patch, P2 should not be able to traverse out of the directory. But suppose P1 renames the subdirectory foo at the same time that P2 does a relative lookup like foo/../../. I believe it's possible for the lookup to succeed if the rename is done after the lookup of foo but before the lookup of foo/... Does something prevent this?

No. But 'so what'?

For more contrived example, 'a/b/../../' and rename of b would also give some way. If we re-check v_seqc, and I suppose that we have to recheck v_seqc for all vnodes recorded by the namei tracker, what to do if seqc changed? If we do re-lookup, this would practically allow for the adversary actor to spin the lookup thread in kernel. If we plain fail the lookup, this is a reliability issue, not to mention that seqc change might be not related to _this_ hierarchy.

But then, what is really the threat model? Passing a dirfd and trying to organize the race to get the receiver the full access to the source jail file namespace?

This doesn't give just full access to the source jail file namespace. It potentially gives access to the whole filesystem. If the sibling jails are at /j/1, /j/2, then P2 can walk out of /j/1 since its jail root is /j/2, and this directory is not encountered during dotdot traversal.

For instance, should namei(dvp, RBENEATH) fail if the dvp->v_seqc changed during the namei() call?

Absolutely not. This just makes the syscall unreliable.

Or it could retry internally.

Then this gives an easy way to force the lookup thread to loop in kernel forever. See above.

BTW, I found the following error return for Linux's openat2():

EAGAIN how.resolve contains either RESOLVE_IN_ROOT or
       RESOLVE_BENEATH, and the kernel could not ensure that a
       ".." component didn't escape (due to a race condition or
       potential attack).  The caller may choose to retry the
       openat2() call.

I believe RESOLVE_BENEATH is equivalent to openat(... O_RESOLVE_BENEATH) in FreeBSD. I don't know how they determine whether an "escape" happened.

So I do not think that checking seqc is proper because it is too coarse.

An variant could be to do the descend over the nameicap tracker after the lookup, to check that the dirs are still parent/child.
But the complication there is due to lookup vp and possible dvp already locked, which means that we cannot lock any of the tracker vnodes. Unlocking dvp and vp, check tracker, then relocking is somewhat stinky.

Another variant might be to accept the failure and lock the renames globally per mount, with an sx, as mjg suggested some time ago. Then, ORB lookups would take the sx shared, which should prevent the issue. The sx can be also reused as um_checkpath_lock and pm_checkpath_lock for ufs and msdosfs, or rather it eliminates the need for them.

With the patch, P2 should not be able to traverse out of the directory. But suppose P1 renames the subdirectory foo at the same time that P2 does a relative lookup like foo/../../. I believe it's possible for the lookup to succeed if the rename is done after the lookup of foo but before the lookup of foo/... Does something prevent this?

This is not possible as long as the renames always transform a descendant directory in another descendant directory. That's the same requirement as for jails having a root that is not a filesystem root in order to avoid escaping. In practical terms, however, I don't think it's reasonable to expect people (whether a jail admin, or even some innocent user) not to shoot themselves in the foot and break the use of UF_BENEATH. What can seem acceptable to prevent jail escape is probably not in this case. Moreover:

This doesn't give just full access to the source jail file namespace. It potentially gives access to the whole filesystem.

and that's a big problem again (delegation of more rights than the originator has).

Recording the top and root directories from the credentials of the user opening the file seems the straightforward to prevent this (either via some fd's credentials, or new fields). With it, at least we would be back to only the traditional jail escape situation and recommendation.

Another possibility is some more involved mechanism to check that some file stays below another one. I'd love to contribute some code, but unfortunately won't be able to until after BSDCan (or perhaps during the KW hackathon).

In D50371#1155993, @kib wrote:

But then, what is really the threat model? Passing a dirfd and trying to organize the race to get the receiver the full access to the source jail file namespace? If yes, why not pass a pipe connected to the shell?

The threat model would be foot-shooting, where some user/admin thinks he gives away some restricted rights but it then turns out that the restrictions won't apply (IOW, as said above, delegation of more rights than the originator has).

For instance, should namei(dvp, RBENEATH) fail if the dvp->v_seqc changed during the namei() call?

Absolutely not. This just makes the syscall unreliable.

I tend to agree. Linux's openat2() allows EAGAIN, which solves the in-kernel infinite loop in a straightforward way but puts the burden on userspace in more situations than it probably expects.

markj marked 4 inline comments as done.

Handle kib's comments.

This revision is now accepted and ready to land.Tue, Jun 3, 10:58 PM
sys/sys/fcntl.h
292

This will need an update to fcntl.2 which currently documents F_GETFD/F_SETFD only as checking or setting FD_CLOEXEC

sys/kern/kern_descrip.c
543

We'll want to note this in the man page as well.

547–549

Existing case in _finstall is more style(9)ish with (flags & O_CLOEXEC) != 0 ? UF_EXCLOSE : 0, should we be consistent here?

markj marked 2 inline comments as done.
  • Rename FD_BENEATH and UF_BENEATH to FD_RESOLVE_BENEATH and UF_RESOLVE_BENEATH for consistency with O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH.
  • Update fcntl.2.
  • Handle review comments.
This revision now requires review to proceed.Thu, Jun 5, 4:03 PM
lib/libsys/fcntl.2
131

This formulation seems a little awkward now that we have two different flags. Do we have standard language for flag bits in other man pages?

Michael Dexter was interested in this discussion so adding him.

This is the same idea as OpenBSD's O_BELOW resp. F_BELOW. Perhaps the API can be designed to match OpenBSD's?

https://marc.info/?l=openbsd-tech&m=174844109910709&w=2

In D50371#1157705, @fuz wrote:

This is the same idea as OpenBSD's O_BELOW resp. F_BELOW. Perhaps the API can be designed to match OpenBSD's?

https://marc.info/?l=openbsd-tech&m=174844109910709&w=2

Well, I am trying to solve an existing problem using namei() features that already exist. O_BELOW has different semantics, e.g., in its handling of absolute paths, so would require more modifications, but they don't help in solving the problem at hand.

O_BELOW also sets a flag in the file structure, whereas this patch applies the flag on the file descriptor, so different descriptors for the same file can have different semantics; this is required to handle the case where a dirfd is transferred over a unix socket: I don't want to modify lookup semantics in the sender, so modifying the file structure is not possible.

Use a list of flags in the F_GETFD documentation.

kib added inline comments.
lib/libsys/fcntl.2
147

I think 'similar operations' need to be defined more concise.

This revision is now accepted and ready to land.Sun, Jun 8, 5:45 PM

As brought up in OpenBSD the cleaner design would be to require the user to set a sticky O_BELOW flag on the file description with fcntl() to allow cross jail fd passing instead of hacking it into the file descriptor.

What would be the consequence of dup()/dup2() or passing the file descriptor again within the same jail?

As brought up in OpenBSD the cleaner design would be to require the user to set a sticky O_BELOW flag on the file description with fcntl() to allow cross jail fd passing instead of hacking it into the file descriptor.

Well, I am not thrilled to introduce new name lookup semantics just to solve this problem. We could however make a sticky O_RESOLVE_BENEATH flag set in the file description with F_SETFL, and require it to be set when cross jail boundaries. That would be cleaner to implement since it doesn't require modification to the fget* interface. It is not as good with respect to compatibility, since it requires application changes, and then the file description referenced by the sender will also change (unnecessarily).

As brought up in OpenBSD the cleaner design would be to require the user to set a sticky O_BELOW flag on the file description with fcntl() to allow cross jail fd passing instead of hacking it into the file descriptor.

Well, I am not thrilled to introduce new name lookup semantics just to solve this problem. We could however make a sticky O_RESOLVE_BENEATH flag set in the file description with F_SETFL, and require it to be set when cross jail boundaries. That would be cleaner to implement since it doesn't require modification to the fget* interface. It is not as good with respect to compatibility, since it requires application changes, and then the file description referenced by the sender will also change (unnecessarily).

So what is the plan for 15-current, 14-stable, and the releng branches?