Page MenuHomeFreeBSD

Jail descriptors
Needs ReviewPublic

Authored by jamie on Feb 1 2024, 12:23 AM.
Tags
None
Referenced Files
F126573330: D43696.diff
Thu, Aug 21, 4:42 AM
Unknown Object (File)
Tue, Aug 19, 2:52 PM
Unknown Object (File)
Tue, Aug 19, 2:47 PM
Unknown Object (File)
Tue, Aug 12, 6:16 PM
Unknown Object (File)
Mon, Aug 11, 12:26 AM
Unknown Object (File)
Tue, Aug 5, 1:39 PM
Unknown Object (File)
Sat, Aug 2, 8:47 PM
Unknown Object (File)
Sat, Aug 2, 1:50 AM

Details

Reviewers
brooks
bz
dch
Group Reviewers
Jails
security
Summary

Similar in spirit to process descriptors, jail descriptors are a reference to a jail using the file interface.

Jail descriptors can be created along with the underlying prison with jail_set(2), or fetched later with jail_get(2). They can then be used instead of the jid when a jail is used, via the new system calls jd_attach(2) and jd_remove(2), and also via the "desc" parameter in jail_get(2) and jail_set(2).

Jail_set and jail_get can both create a new descriptor or reference an existing one, controlled by flags. JAIL_USE_DESC reads the desc parameter for a descriptor of the jail to operate on. JAIL_AT_DESC specifies instead the jail to operate under, for example the namespace where to find/create a jail, and the permissions to pass to it. JAIL_GET_DESC will return a new descriptor in the desc parameter (even in jail_set which normally doesn't return parameters), and JAIL_OWN_DESC with return that descriptor with the sticky bit set.

Jail descriptors eliminate the race problem where a jail with a given ID may be destroyed and re-created, and not be the same jail that was referred to before. A descriptor always refers to a single jail over (and beyond) its lifetime, and it is an error to try use an existing descriptor to create a jail. So when a the jail on a descriptor claims to be dead, you can be sure it stays that way.

Permission for jail operations using descriptors is checked against the credentials of the process that first created the descriptor. This allows passing of jail descriptors to non-privileged users who could then for instance attach to a jail without being root. In addition to the privilege checks, the descriptor access mode is also checked, allowing per-user/group and per-operation controls. Read bits control jail_get, write bits control jail_set, and execute bits control jail_attach and the JAIL_AT_DEC flag.

The descriptor sticky bit is repurposed to indicate an owning descriptor, where the jail lifetime is now tied ("stuck") to the descriptor. When the descriptor closes, do will the attached jail. This is similar to the default behavior of process descriptors.

Jail descriptors can be followed via kevent, with the following note types:

  • NOTE_JAIL_SET: jail_set has been called. It's up to the user to see what if anything has actually changed.
  • NOTE_JAIL_ATTACH: a process has attached to the jail. The note data contains the pid.
  • NOTE_JAIL_REMOVE: the jail has been removed. The descriptor remains open, but is no longer attached to the jail.
  • NOTE_JAIL_CREATE: a child has been created under the watched jail. The note data contains the child's jid.

Future plans, not yet done:

  • Capsicum support
  • documentation

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

jamie requested review of this revision.Feb 1 2024, 12:23 AM

If you have reason to upload a new diff (e.g. addressing some feedback) please upload with full context - e.g. git diff -U99999 or git show -U99999 <hash>. Full details at https://wiki.freebsd.org/Phabricator

bz requested changes to this revision.Feb 1 2024, 1:56 PM
bz added a subscriber: bz.

Also, can we please have a man page?

sys/sys/jail.h
119

Can we spell these out as jail_desc_attach() or jaildesc_attach() or something similar but at least preserve the full jail* prefix?
I am just picking this place but obviously that pull through the changes.

This revision now requires changes to proceed.Feb 1 2024, 1:56 PM

I read the description, but I'm still not sure how a jail descriptor would be used. How about some pseudo-code, to illustrate the concept please?

In D43696#996669, @dvl wrote:

I read the description, but I'm still not sure how a jail descriptor would be used. How about some pseudo-code, to illustrate the concept please?

From just looking at the diff in it's current form I assume the following:

  • jail_set() with the JAIL_CREATE flag can be used to create a jail like before, but you can pass the new "desc" argument with a negative number (e.g. -1) as argument. This would cause jail_set() to attempt to create a new jail descriptor bound to the the jail and put into the calling process. The file descriptor number is returned by overwriting the previously negative int argument.
  • jail_get() can be used the same way to get a jail descriptor for an already existing jail.
  • jail_set() can than be used with the "desc" argument and a non-negative file descriptor number instead of the jail id in which case the permission checks are performed against the credentials used to create the jail file descriptor instead of the calling processes current permissions.
  • Or it could be required to use the new syscalls for this.
  • It's planned, but not yet implemented, to register a jail descriptor with kqueue using a new filter to select which types of events to notify (a variable was modified, a process attached, the jail was destroyed, etc.).
  • Not mentioned but also missing to allow safe delegation to unprivileged users is a way to restrict the capability represented by a jail descriptor e.g. attach only, destroy only, watch only.
lib/libc/sys/Symbol.map
420

Syscalls for 15.0 go in FBSD_1.8

sys/kern/kern_jail.c
2825

Is the i a typo?

sys/kern/kern_jaildesc.c
121

Blank line no longer required by style(9)

sys/sys/user.h
455

Why a spare? New entires seem not to include them and old ones put a huge block at the top.

sys/kern/kern_jail.c
2825

Yes - copy/paste from the jail_attach_args comment.

sys/sys/user.h
455

Yes, it does seem out of place. It preserves the 64-bit alignment, but that's not necessary.

sys/sys/jail.h
119

Sure. I went with "jd" from similarity with the "pd" calls, but "jaildesc" works.

sys/sys/jail.h
119

Maybe jail_attach_jd() or jail_attach_desc()? I somewhat prefer jd to desc.

markj added inline comments.
sys/sys/jail.h
119

Or jaild_ or jailfd_? :)

We have a generic __specialfd system call that can be used to return descriptors of different types. I just bring it up to show that there's some precedent for an "fd" suffix.

The __specialfd() system call is good to know about, but on its own doesn't provide a way to atomically create a configured jail with its lifetime tied to the file descriptor unlike passing a negative integer as "desc" argument to jail_set. There would either be a window where the jail would exist without a file descriptor associated with it which would reintroduce races and resource leak or the file descriptor has to start out unattached to a jail and be passed to the jail_set(JAIL_CREATE) call. The later design choice would get rid of the awkwardness of (ab-)using negative file descriptor numbers to request a file descriptor via an in-out parameter.

sys/sys/jail.h
119

Maybe jail_attach_jd() or jail_attach_desc()? I somewhat prefer jd to desc.

Yes, that jail_xxx_jd sounds good. What about the parameter - would a "jd" parameter be better than the current "desc"? My only concern is its similarity to "jid".

In D43696#996617, @bz wrote:

Also, can we please have a man page?

Of course I'll need to do that sometime ;-). The parameters to get/set the descriptor will have their place in jail(8) where all the other parameters live, though they're kind of strange there because it's very much not part of the "(8)" side of things. Also jail_attach_jd and jail_remove_jd (or whatever we settle on) could be added to jail(2) which has the other jail-related calls.

For how the whole thing works, I suppose a jaildesc(4) works, again as a counterpart to procdesc(4).

Here's the latest diff to address concerns so far (except those that request proper documentation).

jamie marked 5 inline comments as done.Feb 5 2024, 4:28 AM

The syscall bits look good (module feedback on a flag argument which you can take or leave). I've not reviewed the descriptor bits in great detail so another review by someone more familiar with that code would be good.

sys/kern/kern_jail.c
2702–2705

Should there be a KASSERT that only one flag is set?

sys/kern/syscalls.master
3338

Is there any possibility one might want flags in the future? An unused, must-be-zero flags argument is cheap.

sys/kern/kern_jaildesc.c
121

Perhaps assert that jd->jd_prison == NULL before the assignment?

142

The opening brace should be on the previous line.

sys/kern/syscalls.master
3341

How do jail descriptors fit into rights(4)? Should we define fine-grained rights on a jail descriptor, akin to CAP_PDGETPID, CAP_PDKILL, etc.?

sys/sys/jaildesc.h
50

It looks like this list is unused. Do you have some future plans for it?

sys/kern/kern_jail.c
2702–2705

Perhaps - this pattern is used a lot in kern_jail.c, and the flags are only local to this file. There are likely a number of places where I could assert such.

sys/kern/kern_jaildesc.c
142

Indeed. I often miss this because it's my own coding style for non-FreeBSD stuff.

sys/kern/syscalls.master
3338

I think not, because such a flag would mostly likely apply to the non-descriptor versions of both of these calls. Though I have to admit if there were a flag word passed to the old calls, I could have made these calls into a flag on them.

3341

Yes, that's what I'm planning as part of this. Capabilities for set, attach, remove.

sys/sys/jaildesc.h
50

Yes, it is for kevent. When a reportable event happens to a jail, it will traverse this list to report it.

dch requested changes to this revision.Mar 2 2024, 1:31 PM
dch added inline comments.
lib/libc/sys/Symbol.map
420

does this now move to lib/libsys/Symbol.sys.map after libc -> libsys changes?

recent CURRENT doesn't have Symbol.map anymore.

This revision now requires changes to proceed.Mar 2 2024, 1:31 PM

@jamie this not longer builds against CURRENT, if possible can you massage it a little, and then commit it so we don't get any more rot? I fixed symbol.map changes in https://git.sr.ht/~dch/src/commit/9051eedd52fc but that's not sufficient.

Diff updated for libsys and other recent changes.

lib/libc/sys/Symbol.map
420

That's correct. Most likely, just rebasing the patch will do the right thing today.

lib/libsys/Symbol.sys.map
382–383 ↗(On Diff #138628)
sys/kern/kern_jail.c
1071

This function seems to mostly favor comparing error to 0.

sys/sys/jail.h
425–426

These seem not to exist?

sys/kern/kern_jail.c
1071

Indeed. I can change this to match.

sys/sys/jail.h
425–426

Indeed - I found there was no need for those, since there's little overlap before do_jail_attach of prison_deref is called. I'll remove these lines.

jamie marked 4 inline comments as done.
jamie edited the summary of this revision. (Show Details)

The big addition is kevent support. A jail descriptor can pass four notes:

The way descriptors were passed to/from jail_set and jail_get has been updated with new flag bits, instead of the old hacky method of checking of checking whether the passed number is -1. The descriptor number is still passed via the "desc" parameter, but now the flag bits control how it's used. I've updated the summary to describe these bits.

One new use of the descriptor is the flag JAIL_AT_DESC, which uses the described jail as the current/parent prison, in place of the caller's prison. This allowed child jails to be manipulated using a descriptor, in the spirit of the "fooat" family of file system calls.

I've added file access permissions, via the mode bits on the descriptor, settable via fchmod. By default the read bits are on (as anyone can call jail_get) and the write and execute bits are on for the user and only if it has the associated privileges (jail_set for write, jail_attach for execute). I also use the sticky bit the make the jail go away when the descriptor closes, similar to the way a process is killed when its process descriptor closes.

libjail has been updated to be aware of the desc parameter and flags, since it needs to know when to pass it to jail_get and retrieve it from jail_set.

The big addition is kevent support. A jail descriptor can pass four notes:

The way descriptors were passed to/from jail_set and jail_get has been updated with new flag bits, instead of the old hacky method of checking of checking whether the passed number is -1. The descriptor number is still passed via the "desc" parameter, but now the flag bits control how it's used. I've updated the summary to describe these bits.

One new use of the descriptor is the flag JAIL_AT_DESC, which uses the described jail as the current/parent prison, in place of the caller's prison. This allowed child jails to be manipulated using a descriptor, in the spirit of the "fooat" family of file system calls.

I've added file access permissions, via the mode bits on the descriptor, settable via fchmod. By default the read bits are on (as anyone can call jail_get) and the write and execute bits are on for the user and only if it has the associated privileges (jail_set for write, jail_attach for execute). I also use the sticky bit the make the jail go away when the descriptor closes, similar to the way a process is killed when its process descriptor closes.

libjail has been updated to be aware of the desc parameter and flags, since it needs to know when to pass it to jail_get and retrieve it from jail_set.

This sounds deeply confused about what a file descriptor is (supposed to be). It is the capability to use the referenced resource. The permissions in the file description should govern what's allowed through the descriptor not the current process credentials. The check against the current process credential has to happen as the description the descriptor references is created.

"When the descriptor closes, do will the attached jail" should probably read "... so will the attached jail".

In D43696#996669, @dvl wrote:

I read the description, but I'm still not sure how a jail descriptor would be used. How about some pseudo-code, to illustrate the concept please?

Now that the kevent stuff is there, you have to tools to handle keeping track of a jail hierarchy. Something like...

		my_jid = jail_setv(JAIL_CREATE | JAIL_OWN_DESC, "name", "foo", "desc", desc, ..., NULL);
		kq = kqueue();
		EV_SET(&ev, desc, EVFILT_JAILDESC, EV_ADD, NOTE_JAIL_SET | NOTE_JAIL_ATTACH | NOTE_JAIL_REMOVE | NOTE_JAIL_CREATE, 0, NULL);
		kevent(kq, &ev, 1, NULL, 0, NULL);
		for (;;) {
			if (kevent(kq, NULL, 0, &ev, 1, NULL) > 0)
				switch (ev.fflags) {
				case NOTE_JAIL_SET:
					jail_getv(JAIL_USE_DESC, "desc", desc, ..., NULL);
					do_something_with_jail(...);
					break;
				case NOTE_JAIL_ATTACH:
					do_something_with_process(ev.data);
					break;
				case NOTE_JAIL_REMOVE:
					close(desc);
					goto done;
				case NOTE_JAIL_CREATE:
					jail_getv(JAIL_GET_DESC, "jid", ev.data, "desc", "child_desc", "parent", parent_jid, ..., NULL);
					if (parent_jid != my_jid) {
						// ignore or handle race where this isn't
						// the jail that I just got a note for.
					} else
						do_something_with_jail(child_desc);
					break;
				}
			}
		}

Other features let different tasks be delegated to a sub-process or whatever.

This sounds deeply confused about what a file descriptor is (supposed to be). It is the capability to use the referenced resource. The permissions in the file description should govern what's allowed through the descriptor not the current process credentials. The check against the current process credential has to happen as the description the descriptor references is created.

Um, yeah. I'm just saying that's how it works now. In the previous revision, those checks weren't there, and anyone who had the descriptor could do anything with it. Whether it's "new feature" or "finally works correctly" is just different ways of describing the new code.

Or am I deeply confused enough that I don't know which part of my description you're referring to?

"When the descriptor closes, do will the attached jail" should probably read "... so will the attached jail".

Hey, I'm not the one who chose to put those keys right next to each other.

As I learn more about kqueue, I see I am trying to make it do things it's not suited for. In particular, the data field can't contain the kind of information I'm putting it in (process ID, jail ID). The non-queuing nature of kqueue (!) means that one event can be obliterated by the next event, leading to unreliable notification. There's not a lot of advantage of being notified that at least one child jail has been created, and here's the JID of the most recent one.

Notably all the existing uses of kqueue have data that tells you many times something happened, of how much of something exists - the kind of data where the latest value is the only useful one. An exception is EVFILT_PROC, where an exiting process gives you its status, and NOTE_TRACK gives you a NOTE_CHILD with a parent PID. The former works because NOTE_EXIT will always be the last thing a process does, and the latter because the kernel creates a new kevent behind the scenes (which may fail, leading to NOTE_TRACKERR).

So perhaps I could go the NOTE_TRACK route, especially considering the creating a child jail corresponds somewhat to a fork. That would include the caller having to handle a failure to be told.

I've added D51940, which adds kqueue support to jails by JID, so it's separate from the descriptor work, except it contains fixes for the problems I've identified with the current kqueue code in this revision. In particular, it uses the NOTE_TRACK convention that processes use, and adds a flag that notes if attached processes were missed.