This is part of the infrastructure needed to support the mac_curtain module (new sandboxing mechanism that includes pledge()/unveil() compatibility) that I announced on -hackers a week ago:
I extracted (after much cleanups) the "sysfil" (syscall filters) part for review. This patch, by itself, should preserve existing behavior in all cases. Sysfils are intended to be managed by mac_curtain through a new curtainctl(2) syscall. And to complete the integration, mac_curtain needs a bunch of new MAC handlers (not included in this review).
Each "sysfil" is a category of things that the kernel can do (mostly inspired from pledge() promises). Syscalls are annotated with a sysfil bitmap representing the sysfil bits that a process needs to have enabled in its ucred to be allowed to call it.
Capsicum was slightly modified to make use of a sysfil bit. ucred's cr_flags was replaced by a cr_sysfilset field. This keeps the syscall entry check in subr_syscall.c simple as both Capsicum and (other) sysfils are checked in one operation. The IN_CAPABILITY_MODE() macro didn't change meaning, it only is true for processes in Capsicum mode. But there's a new IN_RESTRICTED_MODE() macro that is true for all processes that do not have all sysfils enabled (which thus includes processes in capability mode and processes sandboxed with mac_curtain).
If a syscall doesn't have any SYSFIL_* keyword in syscalls.master, makesyscalls.lua will give it SYSFIL_CATCHALL (which is always disabled for processes sandboxed with mac_curtain, but will still be enabled under Capsicum like they were before (unless you use mac_curtain sandboxing at the same time as Capsicum, which is completely supported)).
Sysfils aren't meant to be exposed to the userland. mac_curtain exposes intermediate "abilities" as part of its curtainctl(2) API, and the pledge(3) userland compatibility wrapper exposes "promises" compatible with OpenBSD's. So sysfils could be reorganized internally without breaking user APIs.
mac_curtain can signal processes on check failures by returning special errnos (the locking situation in the check call sites sometimes don't allow to send signals... but I wonder if there could be a better way to do this?) and that's included in this patch even though nothing uses them yet.
About the syscalls categorizations:
There's a short comment describing each sysfil in sys/sysfil.h.
Many of the sysfils are there specifically to provide restrictions compatible with certain OpenBSD pledge() promises. Overall the syscalls have been broken up into smaller sets (so that mac_curtain can be more fine-grained), so some promises are covered by a union of sysfils. And not all pledge() promises even need a sysfil, some of the higher-level ones are enforced just with MAC handlers. Some don't even need particular kernel-level support, they're built on top of other restrictions and only exist in the userland pledge() compatibility library. Some of the sysfil assignments are more questionnable than others... for example, not sure that separating SYSFIL_CLOCK/SYSFIL_TIMER/SYSFIL_POSIXRT made that much sense.
Also, note that sysfils are just one part of the protection. They're the first line of defense, but they're incomplete by themselves. They need to be complemented by checks from MAC modules (and with mac_curtain this includes path-based FS access restrictions).
For example, open(2) allows you to open paths for both read and write, but it depends on the arguments, so its sysfil is just SYSFIL_PATH (misc. operations on path) plus SYSFIL_FDESC (file descriptor management). But creat(2) always open files for writing, so it gets SYSFIL_WPATH (path-based write access). But it does not *necessarily* create a new path (if the file already exists), so it does not get SYSFIL_CPATH (path creation). But mkdir(2) can only create paths, so it gets SYSFIL_CPATH. SYSFIL_FATTR means the ability to change attributes on ANY type of FD, so fchmod() gets just SYSFIL_FATTR, but fchmodat() gets SYSFIL_PATH as well. shm_open(2) is allowed with just SYSFIL_MMAN (memory management) because opening an anonymous memory segment is allowed without the full SYSFIL_POSIXSHM (Capsicum does something similar).
The idea is that a process doesn't request blocking specific syscalls directly, it requests certain higher-level "abilities" from mac_curtain and syscalls are sort of "opportunistically" blocked if they can't possibly be useful for any of those abilities. The MAC module does the final checks depending on the specifics of the call and what the target object is. There are currently 101 abilities supported by mac_curtain (with some dependency relationships between one another).