Paths

Table of Contentst

mac_curtain infrastructure: syscall filters
Needs ReviewPublic
Actions

Authored by sigsys_gmail.com on Apr 4 2022, 6:58 AM.

Details

Reviewers

bapt
brooks
markj
kevans
emaste

Summary

This is part of the infrastructure needed to support the mac_curtain module (new sandboxing mechanism that includes pledge()/unveil() compatibility) that I announced on -hackers a week ago:

https://github.com/Math2/freebsd-pledge
https://github.com/Math2/freebsd-pledge/blob/curtain/current/main/CURTAIN-README.md

I extracted (after much cleanups) the "sysfil" (syscall filters) part for review. This patch, by itself, should preserve existing behavior in all cases. Sysfils are intended to be managed by mac_curtain through a new curtainctl(2) syscall. And to complete the integration, mac_curtain needs a bunch of new MAC handlers (not included in this review).

Overview:

Each "sysfil" is a category of things that the kernel can do (mostly inspired from pledge() promises). Syscalls are annotated with a sysfil bitmap representing the sysfil bits that a process needs to have enabled in its ucred to be allowed to call it.

Capsicum was slightly modified to make use of a sysfil bit. ucred's cr_flags was replaced by a cr_sysfilset field. This keeps the syscall entry check in subr_syscall.c simple as both Capsicum and (other) sysfils are checked in one operation. The IN_CAPABILITY_MODE() macro didn't change meaning, it only is true for processes in Capsicum mode. But there's a new IN_RESTRICTED_MODE() macro that is true for all processes that do not have all sysfils enabled (which thus includes processes in capability mode and processes sandboxed with mac_curtain).

If a syscall doesn't have any SYSFIL_* keyword in syscalls.master, makesyscalls.lua will give it SYSFIL_CATCHALL (which is always disabled for processes sandboxed with mac_curtain, but will still be enabled under Capsicum like they were before (unless you use mac_curtain sandboxing at the same time as Capsicum, which is completely supported)).

Sysfils aren't meant to be exposed to the userland. mac_curtain exposes intermediate "abilities" as part of its curtainctl(2) API, and the pledge(3) userland compatibility wrapper exposes "promises" compatible with OpenBSD's. So sysfils could be reorganized internally without breaking user APIs.

mac_curtain can signal processes on check failures by returning special errnos (the locking situation in the check call sites sometimes don't allow to send signals... but I wonder if there could be a better way to do this?) and that's included in this patch even though nothing uses them yet.

About the syscalls categorizations:

There's a short comment describing each sysfil in sys/sysfil.h.

Many of the sysfils are there specifically to provide restrictions compatible with certain OpenBSD pledge() promises. Overall the syscalls have been broken up into smaller sets (so that mac_curtain can be more fine-grained), so some promises are covered by a union of sysfils. And not all pledge() promises even need a sysfil, some of the higher-level ones are enforced just with MAC handlers. Some don't even need particular kernel-level support, they're built on top of other restrictions and only exist in the userland pledge() compatibility library. Some of the sysfil assignments are more questionnable than others... for example, not sure that separating SYSFIL_CLOCK/SYSFIL_TIMER/SYSFIL_POSIXRT made that much sense.

Also, note that sysfils are just one part of the protection. They're the first line of defense, but they're incomplete by themselves. They need to be complemented by checks from MAC modules (and with mac_curtain this includes path-based FS access restrictions).

For example, open(2) allows you to open paths for both read and write, but it depends on the arguments, so its sysfil is just SYSFIL_PATH (misc. operations on path) plus SYSFIL_FDESC (file descriptor management). But creat(2) always open files for writing, so it gets SYSFIL_WPATH (path-based write access). But it does not *necessarily* create a new path (if the file already exists), so it does not get SYSFIL_CPATH (path creation). But mkdir(2) can only create paths, so it gets SYSFIL_CPATH. SYSFIL_FATTR means the ability to change attributes on ANY type of FD, so fchmod() gets just SYSFIL_FATTR, but fchmodat() gets SYSFIL_PATH as well. shm_open(2) is allowed with just SYSFIL_MMAN (memory management) because opening an anonymous memory segment is allowed without the full SYSFIL_POSIXSHM (Capsicum does something similar).

The idea is that a process doesn't request blocking specific syscalls directly, it requests certain higher-level "abilities" from mac_curtain and syscalls are sort of "opportunistically" blocked if they can't possibly be useful for any of those abilities. The MAC module does the final checks depending on the specifics of the call and what the target object is. There are currently 101 abilities supported by mac_curtain (with some dependency relationships between one another).

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

sigsys_gmail.com created this revision.Apr 4 2022, 6:58 AM

Herald added a reviewer: brooks. · View Herald TranscriptApr 4 2022, 6:58 AM

Herald added subscribers: jonathan, imp. · View Herald Transcript

sigsys_gmail.com requested review of this revision.Apr 4 2022, 6:58 AM

bapt added reviewers: markj, kevans, emaste.Apr 6 2022, 11:50 AM

I just left some comments based on reading the code, I don't have a lot to say about the high-level design yet. It would be nice to see the new MAC handlers mentioned in your review description.

Some vague and not-very-actionable comments: we have a lot of new terminology here: curtain, sysfil, restricted mode, veiled. I wonder if it could be simpler. I'm also not sure how much benefit we get from making capability mode a special case of "restricted mode"; I also wonder a bit if MAC could be extended to handle more of what sysfil and restricted mode are doing.

sys/conf/options
186	Generally we use positive senses for features enabled by config options. IMO it's nicer to see `#ifdef SYSFIL` rather than the double negative `#ifndef NOSYSFIL`.
sys/kern/subr_syscall.c
132	If NOSYSFIL and CAPABILITY mode are defined in the kernel config, then we lose this check.
260	Again, this kills off handling of ENOTCAPABLE and ECAPMODE if NOSYSFIL is defined.
sys/kern/syscalls.master
430	Does this mean that all ioctls on all descriptors are permitted as long as "core" is enabled? Or does the MAC module filter them somehow?
sys/kern/vfs_syscalls.c
1169	Can you explain what exactly is happening here?
sys/security/mac/mac_framework.c
731	Not very important, but I don't understand why we do all of these checks.
sys/sys/sysfil.h
17	Can we perhaps use bitset(9) to implement this? It looks like we're pretty close to the limit already.
60	What do procctl() and curtainctl() have to do with each other?

In D34761#788699, @markj wrote:

I just left some comments based on reading the code, I don't have a lot to say about the high-level design yet. It would be nice to see the new MAC handlers mentioned in your review description.

Some vague and not-very-actionable comments: we have a lot of new terminology here: curtain, sysfil, restricted mode, veiled. I wonder if it could be simpler. I'm also not sure how much benefit we get from making capability mode a special case of "restricted mode"; I also wonder a bit if MAC could be extended to handle more of what sysfil and restricted mode are doing.

Yeah, thinking about it, I can see how more of it could go into MAC. At first I was making "sysfils" more of a new subsystem in the kernel. But as it became more complex I realized that I was just making a miniature MAC. So I moved as much stuff as possible to MAC. Now sysfils are just a way to tag syscalls and establish a general "sandboxing condition" for processes.

"Veiled VFS" mode could be part of the MAC API instead because this module's FS visibility restrictions only work with MAC enabled anyway. IN_RESTRICTED_MODE() could probably go too. It was more used during my first attempts at implementing this whole thing (and before I did some cleanup a few days ago), but now there are very few uses of it left.

Sysfils could conceivably be used by other modules, but the categorization is very specific to mac_curtain...

The "sysfil" name just came from when I tried to steal the SYF_* namespace at first BTW.

Full diff to sys/security/mac:
https://gist.githubusercontent.com/Math2/7866a3c476496d51eb085590554e2260/raw/a4e4a7e7e948668c49e9d42a769f351e2c336511/gistfile1.txt

A lot of those new MAC handlers are pretty ad-hoc and not very elegant. The rest of MAC had a pretty consistent design to it and my patch just adds random things that it needed...

The mac_vnode_walk_*() functions are probably the worst offenders. But from my point of view it's a whole lot better than when I put all of this directly in vfs_lookup.c. Now the new code is mostly in the module (but the new call sites still add a lot of noise).

The exec-related handlers mostly deal with switching the restrictions on execve() (because you can have a separate "curtain" it switches to when you execute).

sys/conf/options
186	Alright.
sys/kern/subr_syscall.c
260	Yeah, CAPABILITY would become dependent on !NOSYSFIL (or a positive SYSFIL option). It would become the underlying syscall filtering mechanism behind Capsicum and mac_curtain (and if people compile out CAPABILITY to reduce overhead, they should also compile out sysfils as well for maximum effect).
sys/kern/syscalls.master
430	Yeah I added a mac_generic_check_ioctl() function (it's "generic" because it works on arbitrary FDs, unlike most other MAC handlers).
sys/kern/vfs_syscalls.c
1169	This is specifically to emulate pledge() on OpenBSD. It does that. It just silently drops the sticky and s[ug]id mode bits. With my module, there's also a separate "ability" to deny setting just the s[ug]id bits. But it returns an error when they're used, not silently drop them (because that's what can be done with the existing MAC handlers). You can use one or the other (or both...) with my module. It's probably not very important compatibility-wise, it could be dropped and added back later if it matters. Or there could be a new MAC handler just for that instead of a sysfil.
sys/sys/sysfil.h
17	I was trying to keep the overhead of sysfils as close to zero as possible (compared to what was done with Capsicum) since it would affect the performance of all processes. If more sysfils are needed, they can always be reorganized to free one. This doesn't break user API but it breaks kernel API with mac_curtain... If you think a bitset is fine then I could switch them to a bitset. It would allow to improve the categorization a bit too (I was being a bit careful not to use too many but it could use 10 more or so I think). sysents would need a bitset as well since syscalls can be tagged with more than one sysfil. Really starting to wonder too how syscall filtering could be moved to MAC. Having the SYSFIL_* tags directly in syscalls.master is convenient but that's more support code in the stock kernel that probably isn't going to be used by anything else. Annoyingly the AUE_* audit codes don't fully allow to identify syscalls. If the syscall handler functions are global symbols, could the MAC module build a table of them and compare them directly with say a mac_system_sysent_check() handler?
60	A little bit. procctl() can do some security stuff too. curtainct() could have its own sysfil, but sharing it "saves" a sysfil (so yeah, 64 bits is tight...). In general I want curtain(1) sandboxes to be "nestable", so curtainctl() wouldn't usually be disabled (and it's also the way to drop pledge() promises, so it's not disabled for pledge()'d applications either). But there's still a lot of complexity to it so I wanted SOME way to disable it. And I figured that a process that doesn't need procctl() probably isn't going to need curtainctl() either. Largely arbitrary...
60	What do procctl() and curtainctl() have to do with each other?

A few thoughts about integration with syscalls.master:

The if SYSFIL values aren't in syscalls.master they people won't remember to add them. This happened all the time with CAPENABLED annotations when it was in a separate file.
I'm worried that there are enough values that people are just going to punt and not add any flags so this will be an ongoing maintenance burden even with integration into syscalls.master.

It should be a compile or config error to enable CAPABILITY_MODE without SYSFIL.

sys/kern/sys_capability.c
115	You explain the negation elsewhere, but I think it is sufficiently confusing that setting the flag should be done by a macro defined near the explanation.
sys/sys/ucred.h
84	This breaks the ABI of `struct ucred`. I think that's OK in CURRENT, but we should probably get rid of `cr_pspare2`at the same time (I don't think it's useful in practice due to not reserving a useful amount or consistent shape of space) and consider if we want to add new spares.

In D34761#789689, @brooks wrote:

A few thoughts about integration with syscalls.master:

The if SYSFIL values aren't in syscalls.master they people won't remember to add them. This happened all the time with CAPENABLED annotations when it was in a separate file.

Should the per-sysfil short description comments be moved there too? It would put them right where they're needed the most. The Linuxulator could use sysfils too though (I had it working for a while but I didn't keep its syscalls.master files up-to-date as I kept rearranging the sysfils..).

I'm worried that there are enough values that people are just going to punt and not add any flags so this will be an ongoing maintenance burden even with integration into syscalls.master.

This reminds me, I noticed that setgroups(2)/posix_fadvise(2) don't have CAPENABLED but arguably they should.

It should be a compile or config error to enable CAPABILITY_MODE without SYSFIL.

Alright.

I have an idea to rejiggle this in a way that would minimize changes to the rest of the kernel while keeping the syscall annotations in syscalls.master though. (And I really should've done this before making this review...)

I would drop "sysfils" and replace them with "syscats" (syscall categories). Unlike sysfils they would just be an ordinal category number, not a bitmap. There would be room for 128 categories in sysent's sy_flags (sharing it with SYF_CAPENABLED) without widening it from 8 bits (which does keep sysents a bit more compact).

ucreds wouldn't (directly) have anything related to syscats in them and it would have its cr_flags restored. Everything needed for syscalls filtering already is in the curtain object pointed to by the label and the filtering could be done by a new mac_system_check_sysent() MAC call on syscall entry.

There could be a new "syscall filtering needed" flag in ucred's cr_flags for fastpaths in MAC to minimize the impact on other processes (though it would be nice to have a more general mechanism for that as well, for the fplookup stuff in vfs_cache.c in particular).

Since syscalls would only be able to have one category, I would reorganize them a little so that there can be one category that will represent the needed combinations (ex instead of SYSFIL_EXTATTR|SYSFIL_FATTR|SYSFIL_PATH, there could be a SYSCAT_SETEXTATTR_PATH). And with more categories to spare the categorization could be a bit more general and less tied to the specific way that mac_curtain works.

With some other cleanups, this would keep pretty much only the syscalls annotations from this patch and move almost everything else to MAC. So my next patch could include all of the needed kernel changes, I don't think it would be very useful to break it down into separate patches anymore.

sys/kern/sys_capability.c
115	Alright.

In D34761#789755, @sigsys_gmail.com wrote:

In D34761#789689, @brooks wrote:

A few thoughts about integration with syscalls.master:

The if SYSFIL values aren't in syscalls.master they people won't remember to add them. This happened all the time with CAPENABLED annotations when it was in a separate file.

Should the per-sysfil short description comments be moved there too? It would put them right where they're needed the most. The Linuxulator could use sysfils too though (I had it working for a while but I didn't keep its syscalls.master files up-to-date as I kept rearranging the sysfils..).

I think we've miscommunicated. What I meant is that if the annotations aren't in syscalls.master (e.g., they live in a MAC module) then people who add syscalls won't add them. This is mostly a vote for adding them to syscalls.master.

I'm worried that there are enough values that people are just going to punt and not add any flags so this will be an ongoing maintenance burden even with integration into syscalls.master.

This reminds me, I noticed that setgroups(2)/posix_fadvise(2) don't have CAPENABLED but arguably they should.

I'm not sure why either of those aren't. setgroups at least dates to the original work so there is presumably a reason even if I can't think of it off the top of my head.

With some other cleanups, this would keep pretty much only the syscalls annotations from this patch and move almost everything else to MAC. So my next patch could include all of the needed kernel changes, I don't think it would be very useful to break it down into separate patches anymore.

That seems reasonable.

emaste mentioned this in D34903: Allow posix_fadvise in capability mode.Apr 13 2022, 10:39 PM

In D34761#790503, @brooks wrote:

I'm not sure why either of those aren't. setgroups at least dates to the original work so there is presumably a reason even if I can't think of it off the top of my head.

Is it as simple as groups representing a global namespace?
Also as Matieu points out in D34903 groups could be used as negative permissions (e.g. https://lwn.net/Articles/621612/)

emaste mentioned this in rG52a1d90c8bfe: Allow posix_fadvise in capability mode.Apr 14 2022, 7:14 PM

sigsys_gmail.com mentioned this in D35116: mac_curtain infrastructure: everything.May 3 2022, 7:53 AM

emaste mentioned this in rGee9bc5818356: Allow posix_fadvise in capability mode.Oct 20 2022, 2:01 PM

me_cameronkatri.com added a subscriber: me_cameronkatri.com.Nov 5 2022, 4:04 PM