diff --git a/en_US.ISO8859-1/books/arch-handbook/mac/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/mac/chapter.sgml
index 6e053546e9..4a826bf9b2 100644
--- a/en_US.ISO8859-1/books/arch-handbook/mac/chapter.sgml
+++ b/en_US.ISO8859-1/books/arch-handbook/mac/chapter.sgml
@@ -1,7821 +1,7821 @@
ChrisCostelloTrustedBSD Projectchris@FreeBSD.orgRobertWatsonTrustedBSD Projectrwatson@FreeBSD.orgThe TrustedBSD MAC FrameworkMAC Documentation CopyrightThis documentation was developed for the FreeBSD Project by
Chris Costello at Safeport Network Services and Network
Associates Laboratories, the Security Research Division of
Network Associates, Inc. under DARPA/SPAWAR contract
N66001-01-C-8035 (CBOSS), as part of the DARPA
CHATS research program.Redistribution and use in source (SGML DocBook) and
'compiled' forms (SGML, HTML, PDF, PostScript, RTF and so forth)
with or without modification, are permitted provided that the
following conditions are met:Redistributions of source code (SGML DocBook) must
retain the above copyright notice, this list of conditions
and the following disclaimer as the first lines of this file
unmodified.Redistributions in compiled form (transformed to other
DTDs, converted to PDF, PostScript, RTF and other formats)
must reproduce the above copyright notice, this list of
conditions and the following disclaimer in the documentation
and/or other materials provided with the
distribution.THIS DOCUMENTATION IS PROVIDED BY THE NETWORKS ASSOCIATES
TECHNOLOGY, INC "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NETWORKS ASSOCIATES TECHNOLOGY,
INC BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENTATION, EVEN
IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.SynopsisFreeBSD includes experimental support for several
mandatory access control policies, as well as a framework
for kernel security extensibility, the TrustedBSD MAC
Framework. The MAC Framework provides a pluggable access
control framework, permitting new security policies to
be easily linked into the kernel, loaded at boot, or loaded
dynamically at run-time. The framework provides a variety
of features to make it easier to implement new policies,
including the ability to easily tag security labels (such as
confidentiality information) onto system objects.This chapter introduces the MAC policy framework and
provides documentation for a sample MAC policy module.IntroductionThe TrustedBSD MAC framework provides a mechanism to allow
the compile-time or run-time extension of the kernel access
control model. New system policies may be implemented as
kernel modules and linked to the kernel; if multiple policy
modules are present, their results will be composed. The
MAC Framework provides a variety of access control infrastructure
services to assist policy writers, including support for
transient and persistent policy-agnostic object security
labels. This support is currently considered experimental.
-
+ Policy BackgroundMandatory Access Control (MAC), refers to a set of
access control policies that are mandatorily enforced on
users by the operating system. MAC policies may be contrasted
with Discretionary Access Control (DAC) protections, by which
non-administrative users may (at their discretion) protect
objects. In traditional UNIX systems, DAC protections include
file permissions and access control lists; MAC protections include
process controls preventing inter-user debugging and firewalls.
A variety of MAC policies have been formulated by operating system
designers and security researches, including the Multi-Level
Security (MLS) confidentiality policy, the Biba integrity policy,
Role-Based Access Control (RBAC), and Type Enforcement (TE). Each
model bases decisions on a variety of factors, including user
identity, role, and security clearance, as well as security labels
on objects representing concepts such as data sensitivity and
integrity.The TrustedBSD MAC Framework is capable of supporting policy
modules that implement all of these policies, as well as a broad
class of system hardening policies. In addition, despite the
name, the MAC Framework can also be used to implement purely
discretionary policies, as policy modules are given substantial
flexibility in how they authorize protections.MAC Framework Kernel ArchitectureThe TrustedBSD MAC Framework permits kernel modules to
extend the operating system security policy, as well as
providing infrastructure functionality required by many
access control modules. If multiple policies are
simultaneously loaded, the MAC Framework will usefully (for
some definition of useful) compose the results of the
policies.Kernel ElementsThe MAC Framework contains a number of kernel elements:Framework management interfacesConcurrency and synchronization
primitives.Policy registrationExtensible security label for kernel
objectsPolicy entry point composition
operatorsLabel management primitivesEntry point API invoked by kernel
servicesEntry point API to policy modulesEntry points implementations (policy life cycle,
object life cycle/label management, access control
checks).Policy-agnostic label-management system
callsmac_syscall() multiplex
system callVarious security policies implemented as MAC
policy modulesManagement InterfacesThe TrustedBSD MAC Framework may be directly managed using
sysctls, loader tunables, and system calls.In most cases, sysctls and loader tunables modify the same
parameters, and control behavior such as enforcement of
protections relating to various kernel subsystems. In addition,
if MAC debugging support is compiled into the kernel, a variety
of counters will be maintained tracking label allocation. In
most cases, it is advised that per-subsystem enforcement
controls not be used to control policy behavior in production
environments, as they broadly impact the operation of all
active policies. Instead, per-policy controls should be
preferred to ensure proper policy operation.Loading and unloading of policy modules is performed
using the system module management system calls and other
system interfaces, including loader variables.Concurrency and SynchronizationAs the set of active policies may change at run-time,
and the invocation of entry points is non-atomic,
synchronization is required to prevent unloading or
loading of new policies while an entry point invocation
is progress, freezing the list of policies for the
duration. This is accomplished by means of a Framework
busy count. Whenever an entry point is entered, the
busy count is incremented; whenever it is exited, the
busy count is decremented. While the busy count is
elevated, policy list changes are not permitted, and
threads attempting to modify the policy list will sleep
until the list is not busy. The busy count is protected
by a mutex, and a condition variable is used to wake up
sleepers waiting on policy list modifications.Various optimizations are used to reduce the overhead
of the busy count, including avoiding the full cost of
incrementing and decrementing if the list is empty or
contains only static entries (policies that are loaded
before the system starts, and cannot be unloaded).Policy RegistrationThe MAC Framework maintains two lists of active
policies: a static list, and a dynamic list. The lists
differ only with regards to their locking semantics: an
elevated reference count is not required to make use of
the static list. When kernel modules containing MAC
Framework policies are loaded, the policy module will
use SYSINIT to invoke a registration
function; when a policy module is unloaded,
SYSINIT will likewise invoke a
de-registration function. Registration may fail if a
policy module is loaded more than once, if insufficient
resources are available for the registration (for
example, the policy might require labeling and
insufficient labeling state might be available), or
other policy prerequisites might not be met (some
policies may only be loaded prior to boot). Likewise,
de-registration may fail if a policy refuses an
unload.Entry PointsKernel services interact with the MAC Framework in two ways:
they invoke a series of APIs to notify the framework of relevant
events, and they a policy-agnostic label structure in
security-relevant objects. This label structure is maintained by
the MAC Framework via label management entry points, and permits
the Framework to offer a labeling service to policy modules
through relatively non-invasive changes to the kernel subsystem
maintaining the object. For example, label structures have been
added to processes, process credentials, sockets, pipes, vnodes,
Mbufs, network interfaces, IP reassembly queues, and a variety
of other security-relevant structures. Kernel services also
invoke the MAC Framework when they perform important security
decisions, permitting policy modules to augment those decisions
based on their own criteria (possibly including data stored in
security labels).Policy CompositionWhen more than one policy module is loaded into the kernel
at a time, the results of the policy modules will be composed
by the framework using a composition operator. This operator
is currently hard-coded, and requires that all active policies
must approve a request for it to occur. As policies may
return a variety of error conditions (success, access denied,
object doesn't exist, ...), a precedence operator selects the
resulting error from the set of errors returned by policies.
While it is not guaranteed that the resulting composition will
be useful or secure, we've found that it is for many useful
selections of policies.Labeling SupportAs many interesting access control extensions rely on
security labels on objects, the MAC Framework provides a set
of policy-agnostic label management system calls covering
a variety of user-exposed objects. Common label types
include partition identifiers, sensitivity labels, integrity
labels, compartments, domains, roles, and types. Policy
modules participate in the internalization and externalization
of string-based labels provides by user applications, and can
expose multiple label elements to applications if desired.In-memory labels are stored in struct
label, which consists of a fixed-length array
of unions, each holding a void * pointer
and a long. Policies registering for
label storage will be assigned a "slot" identifier, which
may be used to dereference the label storage. The semantics
of the storage are left entirely up to the policy module:
modules are provided with a variety of entry points
associated with the kernel object life cycle, including
initialization, association/creation, and destruction. Using
these interfaces, it is possible to implement reference
counting and other storage mechanisms. Direct access to
the kernel object is generally not required by policy
modules to retrieve a label, as the MAC Framework generally
passes both a pointer to the object and a direct pointer
to the object's label into entry points.Initialization entry points frequently include a blocking
disposition flag indicating whether or not an initialization
is permitted to block; if blocking is not permitted, a failure
may be returned to cancel allocation of the label. This may
occur, for example, in the network stack during interrupt
handling, where blocking is not permitted. Due to the
performance cost of maintaining labels on in-flight network
packets (Mbufs), policies must specifically declare a
requirement that Mbuf labels be allocated. Dynamically
loaded policies making use of labels must be able to handle
the case where their init function has not been called on
an object, as objects may already exist when the policy is
loaded.In the case of file system labels, special support is
provided for the persistent storage of security labels in
extended attributes. Where available, EA transactions
are used to permit consistent compound updates of
security labels on vnodes.Currently, if a labeled policy permits dynamic
unloading, its state slot cannot be reclaimed.System CallsThe MAC Framework implements a number of system calls:
most of these calls support the policy-agnostic label
retrieval and manipulation APIs exposed to user
applications.The label management calls accept a label description
structure, struct mac, which
contains a series of MAC label elements. Each element
contains a character string name, and character string
value. Each policy will be given the chance to claim a
particular element name, permitting policies to expose
multiple independent elements if desired. Policy modules
perform the internalization and externalization between
kernel labels and user-provided labels via entry points,
permitting a variety of semantics. Label management system
calls are generally wrapped by user library functions to
perform memory allocation and error handling.In addition, mac_syscall()
permits policy modules to create new system calls without
allocating system calls. mac_execve()
permits an atomic process credential label change when
executing a new image.MAC Policy ArchitectureSecurity policies are either linked directly into the kernel,
or compiled into loadable kernel modules that may be loaded at
boot, or dynamically using the module loading system calls at
runtime. Policy modules interact with the system through a
set of declared entry points, providing access to a stream of
system events and permitting the policy to influence access
control decisions. Each policy contains a number of elements:Optional configuration parameters for
policy.Centralized implementation of the policy
logic and parameters.Optional implementation of policy life cycle
events, such as initialization and destruction.Optional support for initializing, maintaining, and
destroying labels on selected kernel objects.Optional support for user process inspection and
modification of labels on selected objects.Implementation of selected access control
entry points that are of interest to the policy.Declaration of policy identity, module entry
points, and policy properties.Policy DeclarationModules may be declared using the
MAC_POLICY_SET() macro, which names the
policy, provides a reference to the MAC entry point vector,
provides load-time flags determining how the policy framework
should handle the policy, and optionally requests the
allocation of label state by the framework.static struct mac_policy_ops mac_policy_ops =
{
.mpo_destroy = mac_policy_destroy,
.mpo_init = mac_policy_init,
.mpo_init_bpfdesc_label = mac_policy_init_bpfdesc_label,
.mpo_init_cred_label = mac_policy_init_label,
/* ... */
.mpo_check_vnode_setutimes = mac_policy_check_vnode_setutimes,
.mpo_check_vnode_stat = mac_policy_check_vnode_stat,
.mpo_check_vnode_write = mac_policy_check_vnode_write,
};The MAC policy entry point vector,
mac_policy_ops in this example, associates
functions defined in the module with specific entry points. A
complete listing of available entry points and their
prototypes may be found in the MAC entry point reference
section. Of specific interest during module registration are
the .mpo_destroy and .mpo_init
entry points. .mpo_init will be invoked once a
policy is successfully registered with the module framework
but prior to any other entry points becoming active. This
permits the policy to perform any policy-specific allocation
and initialization, such as initialization of any data or
locks. .mpo_destroy will be invoked when a
policy module is unloaded to permit releasing of any allocated
memory and destruction of locks. Currently, these two entry
points are invoked with the MAC policy list mutex held to
prevent any other entry points from being invoked: this will
be changed, but in the mean time, policies should be careful
about what kernel primitives they invoke so as to avoid lock
ordering or sleeping problems.The policy declaration's module name field exists so that
the module may be uniquely identified for the purposes of
module dependencies. An appropriate string should be selected.
The full string name of the policy is displayed to the user
via the kernel log during load and unload events, and also
exported when providing status information to userland
processes.Policy FlagsThe policy declaration flags field permits the module to
provide the framework with information about its capabilities at
the time the module is loaded. Currently, three flags are
defined:MPC_LOADTIME_FLAG_UNLOADOKThis flag indicates that the policy module may be
unloaded. If this flag is not provided, then the policy
framework will reject requests to unload the module.
This flag might be used by modules that allocate label
state and are unable to free that state at
runtime.MPC_LOADTIME_FLAG_NOTLATEThis flag indicates that the policy module
must be loaded and initialized early in the boot
process. If the flag is specified, attempts to register
the module following boot will be rejected. The flag
may be used by policies that require pervasive labeling
of all system objects, and cannot handle objects that
have not been properly initialized by the policy.MPC_LOADTIME_FLAG_LABELMBUFSThis flag indicates that the policy module requires
labeling of Mbufs, and that memory should always be
allocated for the storage of Mbuf labels. By default,
the MAC Framework will not allocate label storage for
Mbufs unless at least one loaded policy has this flag
set. This measurably improves network performance when
policies do not require Mbuf labeling. A kernel option,
MAC_ALWAYS_LABEL_MBUF, exists to
force the MAC Framework to allocate Mbuf label storage
regardless of the setting of this flag, and may be
useful in some environments.Policies using the
MPC_LOADTIME_FLAG_LABELMBUFS without the
MPC_LOADTIME_FLAG_NOTLATE flag set
must be able to correctly handle NULL
Mbuf label pointers passed into entry points. This is necessary
as in-flight Mbufs without label storage may persist after a
policy enabling Mbuf labeling has been loaded. If a policy
is loaded before the network subsystem is active (i.e., the
policy is not being loaded late), then all Mbufs are guaranteed
to have label storage.Policy Entry PointsFour classes of entry points are offered to policies
registered with the framework: entry points associated with
the registration and management of policies, entry points
denoting initialization, creation, destruction, and other life
cycle events for kernel objects, events associated with access
control decisions that the policy module may influence, and
calls associated with the management of labels on objects. In
addition, a mac_syscall() entry point is
provided so that policies may extend the kernel interface
without registering new system calls.Policy module writers should be aware of the kernel
locking strategy, as well as what object locks are available
during which entry points. Writers should attempt to avoid
deadlock scenarios by avoiding grabbing non-leaf locks inside
of entry points, and also follow the locking protocol for
object access and modification. In particular, writers should
be aware that while necessary locks to access objects and
their labels are generally held, sufficient locks to modify an
object or its label may not be present for all entry points.
Locking information for arguments is documented in the MAC
framework entry point document.Policy entry points will pass a reference to the object
label along with the object itself. This permits labeled
policies to be unaware of the internals of the object yet
still make decisions based on the label. The exception to this
is the process credential, which is assumed to be understood
by policies as a first class security object in the kernel.
Policies that do not implement labels on kernel objects will
be passed NULL pointers for label arguments to entry
points.MAC Policy Entry Point ReferenceGeneral-Purpose Module Entry Points&mac.mpo;_initvoid
&mac.mpo;_initstruct mac_policy_conf
*conf
&mac.thead;
confMAC policy definitionPolicy load event. The policy list mutex is held, so
caution should be applied.&mac.mpo;_destroyvoid
&mac.mpo;_destroystruct mac_policy_conf
*conf
&mac.thead;
confMAC policy definitionPolicy load event. The policy list mutex is held, so
caution should be applied.&mac.mpo;_syscallint
&mac.mpo;_syscallstruct thread
*tdint callvoid *arg
&mac.thead;
tdCalling threadcallSyscall numberargPointer to syscall argumentsThis entry point provides a policy-multiplexed system
call so that policies may provide additional services to
user processes without registering specific system calls.
The policy name provided during registration is used to
demux calls from userland, and the arguments will be
forwarded to this entry point. When implementing new
services, security modules should be sure to invoke
appropriate access control checks from the MAC framework as
needed. For example, if a policy implements an augmented
signal functionality, it should call the necessary signal
access control checks to invoke the MAC framework and other
registered policies.Modules must currently perform the
copyin() of the syscall data on their
own.&mac.mpo;_thread_userretvoid
&mac.mpo;_thread_userretstruct thread
*td
&mac.thead;
tdReturning threadThis entry point permits policy modules to perform
MAC-related events when a thread returns to user space.
This is required for policies that have floating process
labels, as it is not always possible to acquire the process
lock at arbitrary points in the stack during system call
processing; process labels might represent traditional
authentication data, process history information, or other
data.Label Operations&mac.mpo;_init_bpfdesc_labelvoid
&mac.mpo;_init_bpfdesc_labelstruct label
*label
&mac.thead;
labelNew label to applyInitialize the label on a newly instantiated bpfdesc (BPF
descriptor)&mac.mpo;_init_cred_labelvoid
&mac.mpo;_init_cred_labelstruct label
*label
&mac.thead;
labelNew label to initializeInitialize the label for a newly instantiated
user credential.&mac.mpo;_init_devfsdirent_labelvoid
&mac.mpo;_init_devfsdirent_labelstruct label
*label
&mac.thead;
labelNew label to applyInitialize the label on a newly instantiated devfs
entry.&mac.mpo;_init_ifnet_labelvoid
&mac.mpo;_init_ifnet_labelstruct label
*label
&mac.thead;
labelNew label to applyInitialize the label on a newly instantiated network
interface.&mac.mpo;_init_ipq_labelvoid
&mac.mpo;_init_ipq_labelstruct label
*labelint flag
&mac.thead;
labelNew label to applyflagBlocking/non-blocking &man.malloc.9;; see
belowInitialize the label on a newly instantiated IP fragment
reassembly queue. The flag field may
be one of M_WAITOK and M_NOWAIT,
and should be employed to avoid performing a blocking
&man.malloc.9; during this initialization call. IP fragment
reassembly queue allocation frequently occurs in performance
sensitive environments, and the implementation should be careful
to avoid blocking or long-lived operations. This entry point
is permitted to fail resulting in the failure to allocate
the IP fragment reassembly queue.&mac.mpo;_init_mbuf_labelvoid
&mac.mpo;_init_mbuf_labelint flagstruct label
*label
&mac.thead;
flagBlocking/non-blocking &man.malloc.9;; see
belowlabelPolicy label to initializeInitialize the label on a newly instantiated mbuf packet
header (mbuf). The
flag field may be one of
M_WAITOK and M_NOWAIT, and
should be employed to avoid performing a blocking
&man.malloc.9; during this initialization call. Mbuf
allocation frequently occurs in performance sensitive
environments, and the implementation should be careful to
avoid blocking or long-lived operations. This entry point
is permitted to fail resulting in the failure to allocate
the mbuf header.&mac.mpo;_init_mount_labelvoid
&mac.mpo;_init_mount_labelstruct label
*mntlabelstruct label
*fslabel
&mac.thead;
mntlabelPolicy label to be initialized for the mount
itselffslabelPolicy label to be initialized for the file
systemInitialize the labels on a newly instantiated mount
point.&mac.mpo;_init_mount_fs_labelvoid
&mac.mpo;_init_mount_fs_labelstruct label
*label
&mac.thead;
labelLabel to be initializedInitialize the label on a newly mounted file
system.&mac.mpo;_init_pipe_labelvoid
&mac.mpo;_init_pipe_labelstruct
label*label
&mac.thead;
labelLabel to be filled inInitialize a label for a newly instantiated pipe.&mac.mpo;_init_socket_labelvoid
&mac.mpo;_init_socket_labelstruct label
*labelint flag
&mac.thead;
labelNew label to initializeflag&man.malloc.9; flagsInitialize a label for a newly instantiated
socket.&mac.mpo;_init_socket_peer_labelvoid
&mac.mpo;_init_socket_peer_labelstruct label
*labelint flag
&mac.thead;
labelNew label to initializeflag&man.malloc.9; flagsInitialize the peer label for a newly instantiated
socket.&mac.mpo;_init_proc_labelvoid
&mac.mpo;_init_proc_labelstruct label
*label
&mac.thead;
labelNew label to initializeInitialize the label for a newly instantiated
process.&mac.mpo;_init_vnode_labelvoid
&mac.mpo;_init_vnode_labelstruct label
*label
&mac.thead;
labelNew label to initializeInitialize the label on a newly instantiated vnode.&mac.mpo;_destroy_bpfdesc_labelvoid
&mac.mpo;_destroy_bpfdesc_labelstruct label
*label
&mac.thead;
labelbpfdesc labelDestroy the label on a BPF descriptor. In this entry
point a policy should free any internal storage associated
with label so that it may be
destroyed.&mac.mpo;_destroy_cred_labelvoid
&mac.mpo;_destroy_cred_labelstruct label
*label
&mac.thead;
labelLabel being destroyedDestroy the label on a credential. In this entry point,
a policy module should free any internal storage associated
with label so that it may be
destroyed.&mac.mpo;_destroy_devfsdirent_labelvoid
&mac.mpo;_destroy_devfsdirent_labelstruct label
*label
&mac.thead;
labelLabel being destroyedDestroy the label on a devfs entry. In this entry
point, a policy module should free any internal storage
associated with label so that it may
be destroyed.&mac.mpo;_destroy_ifnet_labelvoid
&mac.mpo;_destroy_ifnet_labelstruct label
*label
&mac.thead;
labelLabel being destroyedDestroy the label on a removed interface. In this entry
point, a policy module should free any internal storage
associated with label so that it may
be destroyed.&mac.mpo;_destroy_ipq_labelvoid
&mac.mpo;_destroy_ipq_labelstruct label
*label
&mac.thead;
labelLabel being destroyedDestroy the label on an IP fragment queue. In this
entry point, a policy module should free any internal
storage associated with label so that
it may be destroyed.&mac.mpo;_destroy_mbuf_labelvoid
&mac.mpo;_destroy_mbuf_labelstruct label
*label
&mac.thead;
labelLabel being destroyedDestroy the label on an mbuf header. In this entry
point, a policy module should free any internal storage
associated with label so that it may
be destroyed.&mac.mpo;_destroy_mount_labelvoid
&mac.mpo;_destroy_mount_labelstruct label
*label
&mac.thead;
labelMount point label being destroyedDestroy the labels on a mount point. In this entry
point, a policy module should free the internal storage
associated with mntlabel so that they
may be destroyed.&mac.mpo;_destroy_mount_labelvoid
&mac.mpo;_destroy_mount_labelstruct label
*mntlabelstruct label
*fslabel
&mac.thead;
mntlabelMount point label being destroyedfslabelFile system label being destroyed>Destroy the labels on a mount point. In this entry
point, a policy module should free the internal storage
associated with mntlabel and
fslabel so that they may be
destroyed.&mac.mpo;_destroy_socket_labelvoid
&mac.mpo;_destroy_socket_labelstruct label
*label
&mac.thead;
labelSocket label being destroyedDestroy the label on a socket. In this entry point, a
policy module should free any internal storage associated
with label so that it may be
destroyed.&mac.mpo;_destroy_socket_peer_labelvoid
&mac.mpo;_destroy_socket_peer_labelstruct label
*peerlabel
&mac.thead;
peerlabelSocket peer label being destroyedDestroy the peer label on a socket. In this entry
point, a policy module should free any internal storage
associated with label so that it may
be destroyed.&mac.mpo;_destroy_pipe_labelvoid
&mac.mpo;_destroy_pipe_labelstruct label
*label
&mac.thead;
labelPipe labelDestroy the label on a pipe. In this entry point, a
policy module should free any internal storage associated
with label so that it may be
destroyed.&mac.mpo;_destroy_proc_labelvoid
&mac.mpo;_destroy_proc_labelstruct label
*label
&mac.thead;
labelProcess labelDestroy the label on a process. In this entry point, a
policy module should free any internal storage associated
with label so that it may be
destroyed.&mac.mpo;_destroy_vnode_labelvoid
&mac.mpo;_destroy_vnode_labelstruct label
*label
&mac.thead;
labelProcess labelDestroy the label on a vnode. In this entry point, a
policy module should free any internal storage associated
with label so that it may be
destroyed.&mac.mpo;_copy_mbuf_labelvoid
&mac.mpo;_copy_mbuf_labelstruct label
*srcstruct label
*dest
&mac.thead;
srcSource labeldestDestination labelCopy the label information in
src into
dest.&mac.mpo;_copy_pipe_labelvoid
&mac.mpo;_copy_pipe_labelstruct label
*srcstruct label
*dest
&mac.thead;
srcSource labeldestDestination labelCopy the label information in
src into
dest.&mac.mpo;_copy_vnode_labelvoid
&mac.mpo;_copy_vnode_labelstruct label
*srcstruct label
*dest
&mac.thead;
srcSource labeldestDestination labelCopy the label information in
src into
dest.&mac.mpo;_externalize_cred_labelint
&mac.mpo;_externalize_cred_label
&mac.externalize.paramdefs;
&mac.thead;
&mac.externalize.tbody;
&mac.externalize.para;
&mac.mpo;_externalize_ifnet_labelint
&mac.mpo;_externalize_ifnet_label
&mac.externalize.paramdefs;
&mac.thead;
&mac.externalize.tbody;
&mac.externalize.para;
&mac.mpo;_externalize_pipe_labelint
&mac.mpo;_externalize_pipe_label
&mac.externalize.paramdefs;
&mac.thead;
&mac.externalize.tbody;
&mac.externalize.para;
&mac.mpo;_externalize_socket_labelint
&mac.mpo;_externalize_socket_label
&mac.externalize.paramdefs;
&mac.thead;
&mac.externalize.tbody;
&mac.externalize.para;
&mac.mpo;_externalize_socket_peer_labelint
&mac.mpo;_externalize_socket_peer_label
&mac.externalize.paramdefs;
&mac.thead;
&mac.externalize.tbody;
&mac.externalize.para;
&mac.mpo;_externalize_vnode_labelint
&mac.mpo;_externalize_vnode_label
&mac.externalize.paramdefs;
&mac.thead;
&mac.externalize.tbody;
&mac.externalize.para;
&mac.mpo;_internalize_cred_labelint
&mac.mpo;_internalize_cred_label
&mac.internalize.paramdefs;
&mac.thead;
&mac.internalize.tbody;
&mac.internalize.para;
&mac.mpo;_internalize_ifnet_labelint
&mac.mpo;_internalize_ifnet_label
&mac.internalize.paramdefs;
&mac.thead;
&mac.internalize.tbody;
&mac.internalize.para;
&mac.mpo;_internalize_pipe_labelint
&mac.mpo;_internalize_pipe_label
&mac.internalize.paramdefs;
&mac.thead;
&mac.internalize.tbody;
&mac.internalize.para;
&mac.mpo;_internalize_socket_labelint
&mac.mpo;_internalize_socket_label
&mac.internalize.paramdefs;
&mac.thead;
&mac.internalize.tbody;
&mac.internalize.para;
&mac.mpo;_internalize_vnode_labelint
&mac.mpo;_internalize_vnode_label
&mac.internalize.paramdefs;
&mac.thead;
&mac.internalize.tbody;
&mac.internalize.para;
Label EventsThis class of entry points is used by the MAC framework to
permit policies to maintain label information on kernel
objects. For each labeled kernel object of interest to a MAC
policy, entry points may be registered for relevant life cycle
events. All objects implement initialization, creation, and
destruction hooks. Some objects will also implement
relabeling, allowing user processes to change the labels on
objects. Some objects will also implement object-specific
events, such as label events associated with IP reassembly. A
typical labeled object will have the following life cycle of
entry points:Label initialization o
(object-specific wait) \
Label creation o
\
Relabel events, o--<--.
Various object-specific, | |
Access control events ~-->--o
\
Label destruction oLabel initialization permits policies to allocate memory
and set initial values for labels without context for the use
of the object. The label slot allocated to a policy will be
zeroed by default, so some policies may not need to perform
initialization.Label creation occurs when the kernel structure is
associated with an actual kernel object. For example, Mbufs
may be allocated and remain unused in a pool until they are
required. mbuf allocation causes label initialization on the
mbuf to take place, but mbuf creation occurs when the mbuf is
associated with a datagram. Typically, context will be
provided for a creation event, including the circumstances of
the creation, and labels of other relevant objects in the
creation process. For example, when an mbuf is created from a
socket, the socket and its label will be presented to
registered policies in addition to the new mbuf and its label.
Memory allocation in creation events is discouraged, as it may
occur in performance sensitive ports of the kernel; in
addition, creation calls are not permitted to fail so a
failure to allocate memory cannot be reported.Object specific events do not generally fall into the
other broad classes of label events, but will generally
provide an opportunity to modify or update the label on an
object based on additional context. For example, the label on
an IP fragment reassembly queue may be updated during the
MAC_UPDATE_IPQ entry point as a result of the
acceptance of an additional mbuf to that queue.Access control events are discussed in detail in the
following section.Label destruction permits policies to release storage or
state associated with a label during its association with an
object so that the kernel data structures supporting the
object may be reused or released.In addition to labels associated with specific kernel
objects, an additional class of labels exists: temporary
labels. These labels are used to store update information
submitted by user processes. These labels are initialized and
destroyed as with other label types, but the creation event is
MAC_INTERNALIZE, which accepts a user label
to be converted to an in-kernel representation.File System Object Labeling Event Operations&mac.mpo;_associate_vnode_devfsvoid
&mac.mpo;_associate_vnode_devfsstruct mount
*mpstruct label
*fslabelstruct devfs_dirent
*destruct label
*delabelstruct vnode
*vpstruct label
*vlabel
&mac.thead;
mpDevfs mount pointfslabelDevfs file system label
(mp->mnt_fslabel)deDevfs directory entrydelabelPolicy label associated with
devpvnode associated with
devlabelPolicy label associated with
vpFill in the label (vlabel) for
a newly created devfs vnode based on the devfs directory
entry passed in de and its
label.&mac.mpo;_associate_vnode_extattrint
&mac.mpo;_associate_vnode_extattrstruct mount
*mpstruct label
*fslabelstruct vnode
*vpstruct label
*vlabel
&mac.thead;
mpFile system mount pointfslabelFile system labelvpVnode to labelvlabelPolicy label associated with
vpAttempt to retrieve the label for
vp from the file system extended
attributes. Upon success, the value 0
is returned. Should extended attribute retrieval not be
supported, an accepted fallback is to copy
fslabel into
vlabel. In the event of an error,
an appropriate value for errno should
be returned.&mac.mpo;_associate_vnode_singlelabelvoid
&mac.mpo;_associate_vnode_singlelabelstruct mount
*mpstruct label
*fslabelstruct vnode
*vpstruct label
*vlabel
&mac.thead;
mpFile system mount pointfslabelFile system labelvpVnode to labelvlabelPolicy label associated with
vpOn non-multilabel file systems, this entry point is
called to set the policy label for
vp based on the file system label,
fslabel.&mac.mpo;_create_devfs_devicevoid
&mac.mpo;_create_devfs_devicedev_t devstruct devfs_dirent
*devfs_direntstruct label
*label
&mac.thead;
devDevice corresponding with
devfs_direntdevfs_direntDevfs directory entry to be labeled.labelLabel for devfs_dirent
to be filled in.Fill out the label on a devfs_dirent being created for
the passed device. This call will be made when the device
file system is mounted, regenerated, or a new device is made
available.&mac.mpo;_create_devfs_directoryvoid
&mac.mpo;_create_devfs_directorychar *dirnameint dirnamelenstruct devfs_dirent
*devfs_direntstruct label
*label
&mac.thead;
dirnameName of directory being creatednamelenLength of string
dirnamedevfs_direntDevfs directory entry for directory being
created.Fill out the label on a devfs_dirent being created for
the passed directory. This call will be made when the device
file system is mounted, regenerated, or a new device
requiring a specific directory hierarchy is made
available.&mac.mpo;_create_devfs_symlinkvoid
&mac.mpo;_create_devfs_symlinkstruct ucred
*credstruct mount
*mpstruct devfs_dirent
*ddstruct label
*ddlabelstruct devfs_dirent
*destruct label
*delabel
&mac.thead;
credSubject credentialmpDevfs mount pointddLink destinationddlabelLabel associated with
dddeSymlink entrydelabelLabel associated with
deFill in the label (delabel) for
a newly created &man.devfs.5; symbolic link entry.&mac.mpo;_create_vnode_extattrint
&mac.mpo;_create_vnode_extattrstruct ucred
*credstruct mount
*mpstruct label
*fslabelstruct vnode
*dvpstruct label
*dlabelstruct vnode
*vpstruct label
*vlabelstruct componentname
*cnp
&mac.thead;
credSubject credentialmountFile system mount pointlabelFile system labeldvpParent directory vnodedlabelLabel associated with
dvpvpNewly created vnodevlabelPolicy label associated with
vpcnpComponent name for
vpWrite out the label for vp to
the appropriate extended attribute. If the write
succeeds, fill in vlabel with the
label, and return 0. Otherwise,
return an appropriate error.&mac.mpo;_create_mountvoid
&mac.mpo;_create_mountstruct ucred
*credstruct mount
*mpstruct label
*mntstruct label
*fslabel
&mac.thead;
credSubject credentialmpObject; file system being mountedmntlabelPolicy label to be filled in for
mpfslabelPolicy label for the file system
mp mounts.Fill out the labels on the mount point being created by
the passed subject credential. This call will be made when
a new file system is mounted.&mac.mpo;_create_root_mountvoid
&mac.mpo;_create_root_mountstruct ucred
*credstruct mount
*mpstruct label
*mntlabelstruct label
*fslabel
&mac.thead;
See .Fill out the labels on the mount point being created by
the passed subject credential. This call will be made when
the root file system is mounted, after
&mac.mpo;_create_mount;.&mac.mpo;_relabel_vnodevoid
&mac.mpo;_relabel_vnodestruct ucred
*credstruct vnode
*vpstruct label
*vnodelabelstruct label
*newlabel
&mac.thead;
credSubject credentialvpvnode to relabelvnodelabelExisting policy label for
vpnewlabelNew, possibly partial label to replace
vnodelabelUpdate the label on the passed vnode given the passed
update vnode label and the passed subject credential.&mac.mpo;_setlabel_vnode_extattrint
&mac.mpo;_setlabel_vnode_extattrstruct ucred
*credstruct vnode
*vpstruct label
*vlabelstruct label
*intlabel
&mac.thead;
credSubject credentialvpVnode for which the label is being
writtenvlabelPolicy label associated with
vpintlabelLabel to write outWrite out the policy from
intlabel to an extended
attribute. This is called from
vop_stdcreatevnode_ea.&mac.mpo;_update_devfsdirentvoid
&mac.mpo;_update_devfsdirentstruct devfs_dirent
*devfs_direntstruct label
*direntlabelstruct vnode
*vpstruct label
*vnodelabel
&mac.thead;
devfs_direntObject; devfs directory entrydirentlabelPolicy label for
devfs_dirent to be
updated.vpParent vnodeLockedvnodelabelPolicy label for
vpUpdate the devfs_dirent label
from the passed devfs vnode label. This call will be made
when a devfs vnode has been successfully relabeled to commit
the label change such that it lasts even if the vnode is
recycled. It will also be made when when a symlink is
created in devfs, following a call to
mac_vnode_create_from_vnode to
initialize the vnode label.IPC Object Labeling Event Operations&mac.mpo;_create_mbuf_from_socketvoid
&mac.mpo;_create_mbuf_from_socketstruct socket
*sostruct label
*socketlabelstruct mbuf *mstruct label
*mbuflabel
&mac.thead;
socketSocketSocket locking WIPsocketlabelPolicy label for
socketmObject; mbufmbuflabelPolicy label to fill in for
mSet the label on a newly created mbuf header from the
passed socket label. This call is made when a new datagram
or message is generated by the socket and stored in the
passed mbuf.&mac.mpo;_create_pipevoid
&mac.mpo;_create_pipestruct ucred
*credstruct pipe
*pipestruct label
*pipelabel
&mac.thead;
credSubject credentialpipePipepipelabelPolicy label associated with
pipeSet the label on a newly created pipe from the passed
subject credential. This call is made when a new pipe is
created.&mac.mpo;_create_socketvoid
&mac.mpo;_create_socketstruct ucred
*credstruct socket
*sostruct label
*socketlabel
&mac.thead;
credSubject credentialImmutablesoObject; socket to labelsocketlabelLabel to fill in for
soSet the label on a newly created socket from the passed
subject credential. This call is made when a socket is
created.&mac.mpo;_create_socket_from_socketvoid
&mac.mpo;_create_socket_from_socketstruct socket
*oldsocketstruct label
*oldsocketlabelstruct socket
*newsocketstruct label
*newsocketlabel
&mac.thead;
oldsocketListening socketoldsocketlabelPolicy label associated with
oldsocketnewsocketNew socketnewsocketlabelPolicy label associated with
newsocketlabelLabel a socket, newsocket,
newly &man.accept.2;ed, based on the &man.listen.2;
socket, oldsocket.&mac.mpo;_relabel_pipevoid
&mac.mpo;_relabel_pipestruct ucred
*credstruct pipe
*pipestruct label
*oldlabelstruct label
*newlabel
&mac.thead;
credSubject credentialpipePipeoldlabelCurrent policy label associated with
pipenewlabelPolicy label update to apply to
pipeApply a new label, newlabel, to
pipe.&mac.mpo;_relabel_socketvoid
&mac.mpo;_relabel_socketstruct ucred
*credstruct socket
*sostruct label
*oldlabelstruct label
*newlabel
&mac.thead;
credSubject credentialImmutablesoObject; socketoldlabelCurrent label for
sonewlabelLabel update for
soUpdate the label on a socket from the passed socket
label update.&mac.mpo;_set_socket_peer_from_mbufvoid
&mac.mpo;_set_socket_peer_from_mbufstruct mbuf
*mbufstruct label
*mbuflabelstruct label
*oldlabelstruct label
*newlabel
&mac.thead;
mbufFirst datagram received over socketmbuflabelLabel for mbufoldlabelCurrent label for the socketnewlabelPolicy label to be filled out for the
socketSet the peer label on a stream socket from the passed
mbuf label. This call will be made when the first datagram
is received by the stream socket, with the exception of Unix
domain sockets.&mac.mpo;_set_socket_peer_from_socketvoid
&mac.mpo;_set_socket_peer_from_socketstruct socket
*oldsocketstruct label
*oldsocketlabelstruct socket
*newsocketstruct label
*newsocketpeerlabel
&mac.thead;
oldsocketLocal socketoldsocketlabelPolicy label for
oldsocketnewsocketPeer socketnewsocketpeerlabelPolicy label to fill in for
newsocketSet the peer label on a stream UNIX domain socket from
the passed remote socket endpoint. This call will be made
when the socket pair is connected, and will be made for both
endpoints.Network Object Labeling Event Operations&mac.mpo;_create_bpfdescvoid
&mac.mpo;_create_bpfdescstruct ucred
*credstruct bpf_d
*bpf_dstruct label
*bpflabel
&mac.thead;
credSubject credentialImmutablebpf_dObject; bpf descriptorbpfPolicy label to be filled in for
bpf_dSet the label on a newly created BPF descriptor from the
passed subject credential. This call will be made when a
BPF device node is opened by a process with the passed
subject credential.&mac.mpo;_create_ifnetvoid
&mac.mpo;_create_ifnetstruct ifnet
*ifnetstruct label
*ifnetlabel
&mac.thead;
ifnetNetwork interfaceifnetlabelPolicy label to fill in for
ifnetSet the label on a newly created interface. This call
may be made when a new physical interface becomes available
to the system, or when a pseudo-interface is instantiated
during the boot or as a result of a user action.&mac.mpo;_create_ipqvoid
&mac.mpo;_create_ipqstruct mbuf
*fragmentstruct label
*fragmentlabelstruct ipq
*ipqstruct label
*ipqlabel
&mac.thead;
fragmentFirst received IP fragmentfragmentlabelPolicy label for
fragmentipqIP reassembly queue to be labeledipqlabelPolicy label to be filled in for
ipqSet the label on a newly created IP fragment reassembly
queue from the mbuf header of the first received
fragment.&mac.mpo;_create_datagram_from_ipqvoid
&mac.mpo;_create_create_datagram_from_ipqstruct ipq
*ipqstruct label
*ipqlabelstruct mbuf
*datagramstruct label
*datagramlabel
&mac.thead;
ipqIP reassembly queueipqlabelPolicy label for
ipqdatagramDatagram to be labeleddatagramlabelPolicy label to be filled in for
datagramlabelSet the label on a newly reassembled IP datagram from
the IP fragment reassembly queue from which it was
generated.&mac.mpo;_create_fragmentvoid
&mac.mpo;_create_fragmentstruct mbuf
*datagramstruct label
*datagramlabelstruct mbuf
*fragmentstruct label
*fragmentlabel
&mac.thead;
datagramDatagramdatagramlabelPolicy label for
datagramfragmentFragment to be labeledfragmentlabelPolicy label to be filled in for
datagramSet the label on the mbuf header of a newly created IP
fragment from the label on the mbuf header of the datagram
it was generate from.&mac.mpo;_create_mbuf_from_mbufvoid
&mac.mpo;_create_mbuf_from_mbufstruct mbuf
*oldmbufstruct label
*oldmbuflabelstruct mbuf
*newmbufstruct label
*newmbuflabel
&mac.thead;
oldmbufExisting (source) mbufoldmbuflabelPolicy label for
oldmbufnewmbufNew mbuf to be labelednewmbuflabelPolicy label to be filled in for
newmbufSet the label on the mbuf header of a newly created
datagram from the mbuf header of an existing datagram. This
call may be made in a number of situations, including when
an mbuf is re-allocated for alignment purposes.&mac.mpo;_create_mbuf_linklayervoid
&mac.mpo;_create_mbuf_linklayerstruct ifnet
*ifnetstruct label
*ifnetlabelstruct mbuf
*mbufstruct label
*mbuflabel
&mac.thead;
ifnetNetwork interfaceifnetlabelPolicy label for
ifnetmbufmbuf header for new datagrammbuflabelPolicy label to be filled in for
mbufSet the label on the mbuf header of a newly created
datagram generated for the purposes of a link layer response
for the passed interface. This call may be made in a number
of situations, including for ARP or ND6 responses in the
IPv4 and IPv6 stacks.&mac.mpo;_create_mbuf_from_bpfdescvoid
&mac.mpo;_create_mbuf_from_bpfdescstruct bpf_d
*bpf_dstruct label
*bpflabelstruct mbuf
*mbufstruct label
*mbuflabel
&mac.thead;
bpf_dBPF descriptorbpflabelPolicy label for
bpflabelmbufNew mbuf to be labeledmbuflabelPolicy label to fill in for
mbufSet the label on the mbuf header of a newly created
datagram generated using the passed BPF descriptor. This
call is made when a write is performed to the BPF device
associated with the passed BPF descriptor.&mac.mpo;_create_mbuf_from_ifnetvoid
&mac.mpo;_create_mbuf_from_ifnetstruct ifnet
*ifnetstruct label
*ifnetlabelstruct mbuf
*mbufstruct label
*mbuflabel
&mac.thead;
ifnetNetwork interfaceifnetlabelPolicy label for
ifnetlabelmbufmbuf header for new datagrammbuflabelPolicy label to be filled in for
mbufSet the label on the mbuf header of a newly created
datagram generated from the passed network interface.&mac.mpo;_create_mbuf_multicast_encapvoid
&mac.mpo;_create_mbuf_multicast_encapstruct mbuf
*oldmbufstruct label
*oldmbuflabelstruct ifnet
*ifnetstruct label
*ifnetlabelstruct mbuf
*newmbufstruct label
*newmbuflabel
&mac.thead;
oldmbufmbuf header for existing datagramoldmbuflabelPolicy label for
oldmbufifnetNetwork interfaceifnetlabelPolicy label for
ifnetnewmbufmbuf header to be labeled for new
datagramnewmbuflabelPolicy label to be filled in for
newmbufSet the label on the mbuf header of a newly created
datagram generated from the existing passed datagram when it
is processed by the passed multicast encapsulation
interface. This call is made when an mbuf is to be
delivered using the virtual interface.&mac.mpo;_create_mbuf_netlayervoid
&mac.mpo;_create_mbuf_netlayerstruct mbuf
*oldmbufstruct label
*oldmbuflabelstruct mbuf
*newmbufstruct label
*newmbuflabel
&mac.thead;
oldmbufReceived datagramoldmbuflabelPolicy label for
oldmbufnewmbufNewly created datagramnewmbuflabelPolicy label for
newmbufSet the label on the mbuf header of a newly created
datagram generated by the IP stack in response to an
existing received datagram (oldmbuf).
This call may be made in a number of situations, including
when responding to ICMP request datagrams.&mac.mpo;_fragment_matchint
&mac.mpo;_fragment_matchstruct mbuf
*fragmentstruct label
*fragmentlabelstruct ipq
*ipqstruct label
*ipqlabel
&mac.thead;
fragmentIP datagram fragmentfragmentlabelPolicy label for
fragmentipqIP fragment reassembly queueipqlabelPolicy label for
ipqDetermine whether an mbuf header containing an IP
datagram (fragment) fragment matches
the label of the passed IP fragment reassembly queue
(ipq). Return
(1) for a successful match, or
(0) for no match. This call is
made when the IP stack attempts to find an existing fragment
reassembly queue for a newly received fragment; if this
fails, a new fragment reassembly queue may be instantiated
for the fragment. Policies may use this entry point to
prevent the reassembly of otherwise matching IP fragments if
policy does not permit them to be reassembled based on the
label or other information.&mac.mpo;_relabel_ifnetvoid
&mac.mpo;_relabel_ifnetstruct ucred
*credstruct ifnet
*ifnetstruct label
*ifnetlabelstruct label
*newlabel
&mac.thead;
credSubject credentialifnetObject; Network interfaceifnetlabelPolicy label for
ifnetnewlabelLabel update to apply to
ifnetUpdate the label of network interface,
ifnet, based on the passed update
label, newlabel, and the passed
subject credential, cred.&mac.mpo;_update_ipqvoid
&mac.mpo;_update_ipqstruct mbuf
*fragmentstruct label
*fragmentlabelstruct ipq
*ipqstruct label
*ipqlabel
&mac.thead;
mbufIP fragmentmbuflabelPolicy label for
mbufipqIP fragment reassembly queueipqlabelPolicy label to be updated for
ipqUpdate the label on an IP fragment reassembly queue
(ipq) based on the acceptance of the
passed IP fragment mbuf header
(mbuf).Process Labeling Event Operations&mac.mpo;_create_credvoid
&mac.mpo;_create_credstruct ucred
*parent_credstruct ucred
*child_cred
&mac.thead;
parent_credParent subject credentialchild_credChild subject credentialSet the label of a newly created subject credential from
the passed subject credential. This call will be made when
&man.crcopy.9; is invoked on a newly created struct
ucred. This call should not be confused with a
process forking or creation event.&mac.mpo;_execve_transitionvoid
&mac.mpo;_execve_transitionstruct ucred
*oldstruct ucred
*newstruct vnode
*vpstruct label
*vnodelabel
&mac.thead;
oldExisting subject credentialImmutablenewNew subject credential to be labeledvpFile to executeLockedvnodelabelPolicy label for
vpUpdate the label of a newly created subject credential
(new) from the passed existing
subject credential (old) based on a
label transition caused by executing the passed vnode
(vp). This call occurs when a
process executes the passed vnode and one of the policies
returns a success from the
mpo_execve_will_transition entry point.
Policies may choose to implement this call simply by
invoking mpo_create_cred and passing
the two subject credentials so as not to implement a
transitioning event. Policies should not leave this entry
point unimplemented if they implement
mpo_create_cred, even if they do not
implement
mpo_execve_will_transition.&mac.mpo;_execve_will_transitionint
&mac.mpo;_execve_will_transitionstruct ucred
*oldstruct vnode
*vpstruct label
*vnodelabel
&mac.thead;
oldSubject credential prior to
&man.execve.2;ImmutablevpFile to executevnodelabelPolicy label for
vpDetermine whether the policy will want to perform a
transition event as a result of the execution of the passed
vnode by the passed subject credential. Return
1 if a transition is required,
0 if not. Even if a policy
returns 0, it should behave
correctly in the presence of an unexpected invocation of
mpo_execve_transition, as that call may
happen as a result of another policy requesting a
transition.&mac.mpo;_create_proc0void
&mac.mpo;_create_proc0struct ucred
*cred
&mac.thead;
credSubject credential to be filled inCreate the subject credential of process 0, the parent
of all kernel processes.&mac.mpo;_create_proc1void
&mac.mpo;_create_proc1struct ucred
*cred
&mac.thead;
credSubject credential to be filled inCreate the subject credential of process 1, the parent
of all user processes.&mac.mpo;_relabel_credvoid
&mac.mpo;_relabel_credstruct ucred
*credstruct label
*newlabel
&mac.thead;
credSubject credentialnewlabelLabel update to apply to
credUpdate the label on a subject credential from the passed
update label.Access Control ChecksAccess control entry points permit policy modules to
influence access control decisions made by the kernel.
Generally, although not always, arguments to an access control
entry point will include one or more authorizing credentials,
information (possibly including a label) for any other objects
involved in the operation. An access control entry point may
return 0 to permit the operation, or an &man.errno.2; error
value. The results of invoking the entry point across various
registered policy modules will be composed as follows: if all
modules permit the operation to succeed, success will be
returned. If one or modules returns a failure, a failure will
be returned. If more than one module returns a failure, the
errno value to return to the user will be selected using the
following precedence, implemented by the
error_select() function in
kern_mac.c:Most precedenceEDEADLKEINVALESRCHEACCESLeast precedenceEPERMIf none of the error values returned by all modules are
listed in the precedence chart then an arbitrarily selected
value from the set will be returned. In general, the rules
provide precedence to errors in the following order: kernel
failures, invalid arguments, object not present, access not
permitted, other.&mac.mpo;_check_bpfdesc_receiveint
&mac.mpo;_check_bpfdesc_receivestruct bpf_d
*bpf_dstruct label
*bpflabelstruct ifnet
*ifnetstruct label
*ifnetlabel
&mac.thead;
bpf_dSubject; BPF descriptorbpflabelPolicy label for
bpf_difnetObject; network interfaceifnetlabelPolicy label for
ifnetDetermine whether the MAC framework should permit
datagrams from the passed interface to be delivered to the
buffers of the passed BPF descriptor. Return
(0) for success, or an
errno value for failure Suggested
failure: EACCES for label mismatches,
EPERM for lack of privilege.&mac.mpo;_check_kenv_dumpint
&mac.mpo;_check_kenv_dumpstruct ucred
*cred
&mac.thead;
credSubject credentialDetermine whether the subject should be allowed to
retrieve the kernel environment (see &man.kenv.2;).&mac.mpo;_check_kenv_getint
&mac.mpo;_check_kenv_getstruct ucred
*credchar *name
&mac.thead;
credSubject credentialnameKernel environment variable nameDetermine whether the subject should be allowed to
retrieve the value of the specified kernel environment
variable.&mac.mpo;_check_kenv_setint
&mac.mpo;_check_kenv_setstruct ucred
*credchar *name
&mac.thead;
credSubject credentialnameKernel environment variable nameDetermine whether the subject should be allowed to set
the specified kernel environment variable.&mac.mpo;_check_kenv_unsetint
&mac.mpo;_check_kenv_unsetstruct ucred
*credchar *name
&mac.thead;
credSubject credentialnameKernel environment variable nameDetermine whether the subject should be allowed to unset
the specified kernel environment variable.&mac.mpo;_check_kld_loadint
&mac.mpo;_check_kld_loadstruct ucred
*credstruct vnode
*vpstruct label
*vlabel
&mac.thead;
credSubject credentialvpKernel module vnodevlabelLabel associated with
vpDetermine whether the subject should be allowed to load
the specified module file.&mac.mpo;_check_kld_statint
&mac.mpo;_check_kld_statstruct ucred
*cred
&mac.thead;
credSubject credentialDetermine whether the subject should be allowed to
retrieve a list of loaded kernel module files and associated
statistics.&mac.mpo;_check_kld_unloadint
&mac.mpo;_check_kld_unloadstruct ucred
*cred
&mac.thead;
credSubject credentialDetermine whether the subject should be allowed to
unload a kernel module.&mac.mpo;_check_pipe_ioctlint
&mac.mpo;_check_pipe_ioctlstruct ucred
*credstruct pipe
*pipestruct label
*pipelabelunsigned long
cmdvoid *data
&mac.thead;
credSubject credentialpipePipepipelabelPolicy label associated with
pipecmd&man.ioctl.2; commanddata&man.ioctl.2; dataDetermine whether the subject should be allowed to make
the specified &man.ioctl.2; call.&mac.mpo;_check_pipe_pollint
&mac.mpo;_check_pipe_pollstruct ucred
*credstruct pipe
*pipestruct label
*pipelabel
&mac.thead;
credSubject credentialpipePipepipelabelPolicy label associated with
pipeDetermine whether the subject should be allowed to poll
pipe.&mac.mpo;_check_pipe_readint
&mac.mpo;_check_pipe_readstruct ucred
*credstruct pipe
*pipestruct label
*pipelabel
&mac.thead;
credSubject credentialpipePipepipelabelPolicy label associated with
pipeDetermine whether the subject should be allowed read
access to pipe.&mac.mpo;_check_pipe_relabelint
&mac.mpo;_check_pipe_relabelstruct ucred
*credstruct pipe
*pipestruct label
*pipelabelstruct label
*newlabel
&mac.thead;
credSubject credentialpipePipepipelabelCurrent policy label associated with
pipenewlabelLabel update to
pipelabelDetermine whether the subject should be allowed to
relabel pipe.&mac.mpo;_check_pipe_statint
&mac.mpo;_check_pipe_statstruct ucred
*credstruct pipe
*pipestruct label
*pipelabel
&mac.thead;
credSubject credentialpipePipepipelabelPolicy label associated with
pipeDetermine whether the subject should be allowed to
retrieve statistics related to
pipe.&mac.mpo;_check_pipe_writeint
&mac.mpo;_check_pipe_writestruct ucred
*credstruct pipe
*pipestruct label
*pipelabel
&mac.thead;
credSubject credentialpipePipepipelabelPolicy label associated with
pipeDetermine whether the subject should be allowed to write
to pipe.&mac.mpo;_check_socket_bindint
&mac.mpo;_check_socket_bindstruct ucred
*credstruct socket
*socketstruct label
*socketlabelstruct sockaddr
*sockaddr
&mac.thead;
credSubject credentialsocketSocket to be boundsocketlabelPolicy label for
socketsockaddrAddress of
socket&mac.mpo;_check_socket_connectint
&mac.mpo;_check_socket_connectstruct ucred
*credstruct socket
*socketstruct label
*socketlabelstruct sockaddr
*sockaddr
&mac.thead;
credSubject credentialsocketSocket to be connectedsocketlabelPolicy label for
socketsockaddrAddress of
socketDetermine whether the subject credential
(cred) can connect the passed socket
(socket) to the passed socket address
(sockaddr). Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatches,
EPERM for lack of privilege.&mac.mpo;_check_socket_receiveint
&mac.mpo;_check_socket_receivestruct ucred
*credstruct socket
*sostruct label
*socketlabel
&mac.thead;
credSubject credentialsoSocketsocketlabelPolicy label associated with
soDetermine whether the subject should be allowed to
receive information from the socket
so.&mac.mpo;_check_socket_sendint
&mac.mpo;_check_socket_sendstruct ucred
*credstruct socket
*sostruct label
*socketlabel
&mac.thead;
credSubject credentialsoSocketsocketlabelPolicy label associated with
soDetermine whether the subject should be allowed to send
information across the socket
so.&mac.mpo;_check_cred_visibleint
&mac.mpo;_check_cred_visiblestruct ucred
*u1struct ucred
*u2
&mac.thead;
u1Subject credentialu2Object credentialDetermine whether the subject credential
u1 can see other
subjects with the passed subject credential
u2. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatches,
EPERM for lack of privilege, or
ESRCH to hide visibility. This call
may be made in a number of situations, including
inter-process status sysctls used by ps,
and in procfs lookups.&mac.mpo;_check_socket_visibleint
&mac.mpo;_check_socket_visiblestruct ucred
*credstruct socket
*socketstruct label
*socketlabel
&mac.thead;
credSubject credentialsocketObject; socketsocketlabelPolicy label for
socket&mac.mpo;_check_ifnet_relabelint
&mac.mpo;_check_ifnet_relabelstruct ucred
*credstruct ifnet
*ifnetstruct label
*ifnetlabelstruct label
*newlabel
&mac.thead;
credSubject credentialifnetObject; network interfaceifnetlabelExisting policy label for
ifnetnewlabelPolicy label update to later be applied to
ifnetDetermine whether the subject credential can relabel the
passed network interface to the passed label update.&mac.mpo;_check_socket_relabelint
&mac.mpo;_check_socket_relabelstruct ucred
*credstruct socket
*socketstruct label
*socketlabelstruct label
*newlabel
&mac.thead;
credSubject credentialsocketObject; socketsocketlabelExisting policy label for
socketnewlabelLabel update to later be applied to
socketlabelDetermine whether the subject credential can relabel the
passed socket to the passed label update.&mac.mpo;_check_cred_relabelint
&mac.mpo;_check_cred_relabelstruct ucred
*credstruct label
*newlabel
&mac.thead;
credSubject credentialnewlabelLabel update to later be applied to
credDetermine whether the subject credential can relabel
itself to the passed label update.&mac.mpo;_check_vnode_relabelint
&mac.mpo;_check_vnode_relabelstruct ucred
*credstruct vnode
*vpstruct label
*vnodelabelstruct label
*newlabel
&mac.thead;
credSubject credentialImmutablevpObject; vnodeLockedvnodelabelExisting policy label for
vpnewlabelPolicy label update to later be applied to
vpDetermine whether the subject credential can relabel the
passed vnode to the passed label update.&mac.mpo;_check_mount_statint &mac.mpo;_check_mount_statstruct ucred
*credstruct mount
*mpstruct label
*mountlabel
&mac.thead;
credSubject credentialmpObject; file system mountmountlabelPolicy label for
mpDetermine whether the subject credential can see the
results of a statfs performed on the file system. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatches
or EPERM for lack of privilege. This
call may be made in a number of situations, including during
invocations of &man.statfs.2; and related calls, as well as to
determine what file systems to exclude from listings of file
systems, such as when &man.getfsstat.2; is invoked. &mac.mpo;_check_proc_debugint
&mac.mpo;_check_proc_debugstruct ucred
*credstruct proc
*proc
&mac.thead;
credSubject credentialImmutableprocObject; processDetermine whether the subject credential can debug the
passed process. Return 0 for
success, or an errno value for failure.
Suggested failure: EACCES for label
mismatch, EPERM for lack of
privilege, or ESRCH to hide
visibility of the target. This call may be made in a number
of situations, including use of the &man.ptrace.2; and
&man.ktrace.2; APIs, as well as for some types of procfs
operations.&mac.mpo;_check_vnode_accessint
&mac.mpo;_check_vnode_accessstruct ucred
*credstruct vnode
*vpstruct label
*labelint flags
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpflags&man.access.2; flagsDetermine how invocations of &man.access.2; and related
calls by the subject credential should return when performed
on the passed vnode using the passed access flags. This
should generally be implemented using the same semantics
used in &mac.mpo;_check_vnode_open.
Return 0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatches
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_chdirint
&mac.mpo;_check_vnode_chdirstruct ucred
*credstruct vnode
*dvpstruct label
*dlabel
&mac.thead;
credSubject credentialdvpObject; vnode to &man.chdir.2; intodlabelPolicy label for
dvpDetermine whether the subject credential can change the
process working directory to the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_chrootint
&mac.mpo;_check_vnode_chrootstruct ucred
*credstruct vnode
*dvpstruct label
*dlabel
&mac.thead;
credSubject credentialdvpDirectory vnodedlabelPolicy label associated with
dvpDetermine whether the subject should be allowed to
&man.chroot.2; into the specified directory
(dvp).&mac.mpo;_check_vnode_createint
&mac.mpo;_check_vnode_createstruct ucred
*credstruct vnode
*dvpstruct label
*dlabelstruct componentname
*cnpstruct vattr
*vap
&mac.thead;
credSubject credentialdvpObject; vnodedlabelPolicy label for
dvpcnpComponent name for
dvpvapvnode attributes for vapDetermine whether the subject credential can create a
vnode with the passed parent directory, passed name
information, and passed attribute information. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES. for label mismatch,
or EPERM for lack of privilege.
This call may be made in a number of situations, including
as a result of calls to &man.open.2; with
O_CREAT, &man.mknod.2;, &man.mkfifo.2;, and
others.&mac.mpo;_check_vnode_deleteint
&mac.mpo;_check_vnode_deletestruct ucred
*credstruct vnode
*dvpstruct label
*dlabelstruct vnode
*vpvoid *labelstruct componentname
*cnp
&mac.thead;
credSubject credentialdvpParent directory vnodedlabelPolicy label for
dvpvpObject; vnode to deletelabelPolicy label for
vpcnpComponent name for
vpDetermine whether the subject credential can delete a
vnode from the passed parent directory and passed name
information. Return 0 for
success, or an errno value for failure.
Suggested failure: EACCES for label
mismatch, or EPERM for lack of
privilege. This call may be made in a number of situations,
including as a result of calls to &man.unlink.2; and
&man.rmdir.2;. Policies implementing this entry point
should also implement
mpo_check_rename_to to authorize
deletion of objects as a result of being the target of a
rename.&mac.mpo;_check_vnode_deleteaclint
&mac.mpo;_check_vnode_deleteaclstruct ucred *credstruct vnode *vpstruct label *labelacl_type_t type
&mac.thead;
credSubject credentialImmutablevpObject; vnodeLockedlabelPolicy label for
vptypeACL typeDetermine whether the subject credential can delete the
ACL of passed type from the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_execint
&mac.mpo;_check_vnode_execstruct ucred
*credstruct vnode
*vpstruct label
*label
&mac.thead;
credSubject credentialvpObject; vnode to executelabelPolicy label for
vpDetermine whether the subject credential can execute the
passed vnode. Determination of execute privilege is made
separately from decisions about any transitioning event.
Return 0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_getaclint
&mac.mpo;_check_vnode_getaclstruct ucred
*credstruct vnode
*vpstruct label
*labelacl_type_t
type
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vptypeACL typeDetermine whether the subject credential can retrieve
the ACL of passed type from the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_getextattrint
&mac.mpo;_check_vnode_getextattrstruct ucred
*credstruct vnode
*vpstruct label
*labelint
attrnamespaceconst char
*namestruct uio
*uio
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpattrnamespaceExtended attribute namespacenameExtended attribute nameuioI/O structure pointer; see &man.uio.9;Determine whether the subject credential can retrieve
the extended attribute with the passed namespace and name
from the passed vnode. Policies implementing labeling using
extended attributes may be interested in special handling of
operations on those extended attributes. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_linkint
&mac.mpo;_check_vnode_linkstruct ucred
*credstruct vnode
*dvpstruct label
*dlabelstruct vnode
*vpstruct label
*labelstruct componentname
*cnp
&mac.thead;
credSubject credentialdvpDirectory vnodedlabelPolicy label associated with
dvpvpLink destination vnodelabelPolicy label associated with
vpcnpComponent name for the link being createdDetermine whether the subject should be allowed to
create a link to the vnode vp with
the name specified by cnp.&mac.mpo;_check_vnode_mmapint
&mac.mpo;_check_vnode_mmapstruct ucred
*credstruct vnode
*vpstruct label
*labelint prot
&mac.thead;
credSubject credentialvpVnode to maplabelPolicy label associated with
vpprotMmap protections (see &man.mmap.2;)Determine whether the subject should be allowed to map
the vnode vp with the protections
specified in prot.&mac.mpo;_check_vnode_mmap_downgradevoid
&mac.mpo;_check_vnode_mmap_downgradestruct ucred
*credstruct vnode
*vpstruct label
*labelint *prot
&mac.thead;
credSee
.vplabelprotMmap protections to be downgradedDowngrade the mmap protections based on the subject and
object labels.&mac.mpo;_check_vnode_mprotectint
&mac.mpo;_check_vnode_mprotectstruct ucred
*credstruct vnode
*vpstruct label
*labelint prot
&mac.thead;
credSubject credentialvpMapped vnodeprotMemory protectionsDetermine whether the subject should be allowed to
set the specified memory protections on memory mapped from
the vnode vp.&mac.mpo;_check_vnode_pollint
&mac.mpo;_check_vnode_pollstruct ucred
*active_credstruct ucred
*file_credstruct vnode
*vpstruct label
*label
&mac.thead;
active_credSubject credentialfile_credCredential associated with the struct
filevpPolled vnodelabelPolicy label associated with
vpDetermine whether the subject should be allowed to poll
the vnode vp.&mac.mpo;_check_vnode_rename_fromint
&mac.mpo;_vnode_rename_fromstruct ucred
*credstruct vnode
*dvpstruct label
*dlabelstruct vnode
*vpstruct label
*labelstruct componentname
*cnp
&mac.thead;
credSubject credentialdvpDirectory vnodedlabelPolicy label associated with
dvpvpVnode to be renamedlabelPolicy label associated with
vpcnpComponent name for
vpDetermine whether the subject should be allowed to
rename the vnode vp to something
else.&mac.mpo;_check_vnode_rename_toint
&mac.mpo;_check_vnode_rename_tostruct ucred
*credstruct vnode
*dvpstruct label
*dlabelstruct vnode
*vpstruct label
*labelint samedirstruct componentname
*cnp
&mac.thead;
credSubject credentialdvpDirectory vnodedlabelPolicy label associated with
dvpvpOverwritten vnodelabelPolicy label associated with
vpsamedirBoolean; 1 if the source and
destination directories are the samecnpDestination component nameDetermine whether the subject should be allowed to
rename to the vnode vp, into the
directory dvp, or to the name
represented by cnp. If there is no
existing file to overwrite, vp and
label will be NULL.&mac.mpo;_check_socket_listenint
&mac.mpo;_check_socket_listenstruct ucred
*credstruct socket
*socketstruct label
*socketlabel
&mac.thead;
credSubject credentialsocketObject; socketsocketlabelPolicy label for
socketDetermine whether the subject credential can listen on
the passed socket. Return 0 for
success, or an errno value for failure.
Suggested failure: EACCES for label
mismatch, or EPERM for lack of
privilege.&mac.mpo;_check_vnode_lookupint
&mac.mpo;_check_vnode_lookupstruct ucred
*credstruct vnode
*dvpstruct label
*dlabelstruct componentname
*cnp
&mac.thead;
credSubject credentialdvpObject; vnodedlabelPolicy label for
dvpcnpComponent name being looked upDetermine whether the subject credential can perform a
lookup in the passed directory vnode for the passed name.
Return 0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_openint
&mac.mpo;_check_vnode_openstruct ucred
*credstruct vnode
*vpstruct label
*labelint
acc_mode
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpacc_mode&man.open.2; access modeDetermine whether the subject credential can perform an
open operation on the passed vnode with the passed access
mode. Return 0 for success, or
an errno value for failure. Suggested failure:
EACCES for label mismatch, or
EPERM for lack of privilege.&mac.mpo;_check_vnode_readdirint
&mac.mpo;_check_vnode_readdirstruct ucred
*credstruct vnode
*dvpstruct label
*dlabel
&mac.thead;
credSubject credentialdvpObject; directory vnodedlabelPolicy label for
dvpDetermine whether the subject credential can perform a
readdir operation on the passed
directory vnode. Return 0 for
success, or an errno value for failure.
Suggested failure: EACCES for label
mismatch, or EPERM for lack of
privilege.&mac.mpo;_check_vnode_readlinkint
&mac.mpo;_check_vnode_readlinkstruct ucred
*credstruct vnode
*vpstruct label
*label
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpDetermine whether the subject credential can perform a
readlink operation on the passed
symlink vnode. Return 0 for
success, or an errno value for failure.
Suggested failure: EACCES for label
mismatch, or EPERM for lack of
privilege. This call may be made in a number of situations,
including an explicit readlink call by
the user process, or as a result of an implicit
readlink during a name lookup by the
process.&mac.mpo;_check_vnode_revokeint
&mac.mpo;_check_vnode_revokestruct ucred
*credstruct vnode
*vpstruct label
*label
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpDetermine whether the subject credential can revoke
access to the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_setaclint
&mac.mpo;_check_vnode_setaclstruct ucred
*credstruct vnode
*vpstruct label
*labelacl_type_t
typestruct acl
*acl
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vptypeACL typeaclACLDetermine whether the subject credential can set the
passed ACL of passed type on the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_setextattrint
&mac.mpo;_check_vnode_setextattrstruct ucred
*credstruct vnode
*vpstruct label
*labelint
attrnamespaceconst char
*namestruct uio
*uio
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for vpattrnamespaceExtended attribute namespacenameExtended attribute nameuioI/O structure pointer; see &man.uio.9;Determine whether the subject credential can set the
extended attribute of passed name and passed namespace on
the passed vnode. Policies implementing security labels
backed into extended attributes may want to provide
additional protections for those attributes. Additionally,
policies should avoid making decisions based on the data
referenced from uio, as there is a
potential race condition between this check and the actual
operation. The uio may also be
NULL if a delete operation is being
performed. Return 0 for success,
or an errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_setflagsint
&mac.mpo;_check_vnode_setflagsstruct ucred
*credstruct vnode
*vpstruct label
*labelu_long flags
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpflagsFile flags; see &man.chflags.2;Determine whether the subject credential can set the
passed flags on the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_setmodeint
&mac.mpo;_check_vnode_setmodestruct ucred
*credstruct vnode
*vpstruct label
*labelmode_t mode
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for vpmodeFile mode; see &man.chmod.2;Determine whether the subject credential can set the
passed mode on the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_vnode_setownerint
&mac.mpo;_check_vnode_setownerstruct ucred
*credstruct vnode
*vpstruct label
*labeluid_t uidgid_t gid
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for vpuidUser IDgidGroup IDDetermine whether the subject credential can set the
passed uid and passed gid as file uid and file gid on the
passed vnode. The IDs may be set to (-1)
to request no update. Return 0
for success, or an errno value for
failure. Suggested failure: EACCES
for label mismatch, or EPERM for lack
of privilege.&mac.mpo;_check_vnode_setutimesint
&mac.mpo;_check_vnode_setutimesstruct ucred
*credstruct vnode
*vpstruct label
*labelstruct timespec
atimestruct timespec
mtime
&mac.thead;
credSubject credentialvpObject; vplabelPolicy label for
vpatimeAccess time; see &man.utimes.2;mtimeModification time; see &man.utimes.2;Determine whether the subject credential can set the
passed access timestamps on the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_proc_schedint
&mac.mpo;_check_proc_schedstruct ucred
*ucredstruct proc
*proc
&mac.thead;
credSubject credentialprocObject; processDetermine whether the subject credential can change the
scheduling parameters of the passed process. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
EPERM for lack of privilege, or
ESRCH to limit visibility.See &man.setpriority.2; for more information.&mac.mpo;_check_proc_signalint
&mac.mpo;_check_proc_signalstruct ucred
*credstruct proc
*procint signal
&mac.thead;
credSubject credentialprocObject; processsignalSignal; see &man.kill.2;Determine whether the subject credential can deliver the
passed signal to the passed process. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
EPERM for lack of privilege, or
ESRCH to limit visibility.&mac.mpo;_check_vnode_statint
&mac.mpo;_check_vnode_statstruct ucred
*credstruct vnode
*vpstruct label
*label
&mac.thead;
credSubject credentialvpObject; vnodelabelPolicy label for
vpDetermine whether the subject credential can
stat the passed vnode. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatch,
or EPERM for lack of
privilege.See &man.stat.2; for more information.&mac.mpo;_check_ifnet_transmitint
&mac.mpo;_check_ifnet_transmitstruct ucred
*credstruct ifnet
*ifnetstruct label
*ifnetlabelstruct mbuf
*mbufstruct label
*mbuflabel
&mac.thead;
credSubject credentialifnetNetwork interfaceifnetlabelPolicy label for
ifnetmbufObject; mbuf to be sentmbuflabelPolicy label for
mbufDetermine whether the network interface can transmit the
passed mbuf. Return 0 for
success, or an errno value for failure.
Suggested failure: EACCES for label
mismatch, or EPERM for lack of
privilege.&mac.mpo;_check_socket_deliverint
&mac.mpo;_check_socket_deliverstruct ucred
*credstruct ifnet
*ifnetstruct label
*ifnetlabelstruct mbuf
*mbufstruct label
*mbuflabel
&mac.thead;
credSubject credentialifnetNetwork interfaceifnetlabelPolicy label for
ifnetmbufObject; mbuf to be deliveredmbuflabelPolicy label for
mbufDetermine whether the socket may receive the datagram
stored in the passed mbuf header. Return
0 for success, or an
errno value for failure. Suggested
failures: EACCES for label mismatch,
or EPERM for lack of
privilege.&mac.mpo;_check_socket_visibleint
&mac.mpo;_check_socket_visiblestruct ucred
*credstruct socket
*sostruct label
*socketlabel
&mac.thead;
credSubject credentialImmutablesoObject; socketsocketlabelPolicy label for
soDetermine whether the subject credential cred can "see"
the passed socket (socket) using
system monitoring functions, such as those employed by
&man.netstat.8; and &man.sockstat.1;. Return
0 for success, or an
errno value for failure. Suggested
failure: EACCES for label mismatches,
EPERM for lack of privilege, or
ESRCH to hide visibility.&mac.mpo;_check_system_acctint
&mac.mpo;_check_system_acctstruct ucred
*ucredstruct vnode
*vpstruct label
*vlabel
&mac.thead;
ucredSubject credentialvpAccounting file; &man.acct.5;vlabelLabel associated with
vpDetermine whether the subject should be allowed to
enable accounting, based on its label and the label of the
accounting log file.&mac.mpo;_check_system_nfsdint
&mac.mpo;_check_system_nfsdstruct ucred
*cred
&mac.thead;
credSubject credentialDetermine whether the subject should be allowed to call
&man.nfssvc.2;.&mac.mpo;_check_system_rebootint
&mac.mpo;_check_system_rebootstruct ucred
*credint howto
&mac.thead;
credSubject credentialhowtohowto parameter from
&man.reboot.2;Determine whether the subject should be allowed to
reboot the system in the specified manner.&mac.mpo;_check_system_settimeint
&mac.mpo;_check_system_settimestruct ucred
*cred
&mac.thead;
credSubject credentialDetermine whether the user should be allowed to set the
system clock.&mac.mpo;_check_system_swaponint
&mac.mpo;_check_system_swaponstruct ucred
*credstruct vnode
*vpstruct label
*vlabel
&mac.thead;
credSubject credentialvpSwap devicevlabelLabel associated with
vpDetermine whether the subject should be allowed to add
vp as a swap device.&mac.mpo;_check_system_sysctlint
&mac.mpo;_check_system_sysctlstruct ucred
*credint *nameu_int *namelenvoid *oldsize_t
*oldlenpint inkernelvoid *newsize_t newlen
&mac.thead;
credSubject credentialnameSee &man.sysctl.3;namelenoldoldlenpinkernelBoolean; 1 if called from
kernelnewSee &man.sysctl.3;newlenDetermine whether the subject should be allowed to make
the specified &man.sysctl.3; transaction.Label Management CallsRelabel events occur when a user process has requested
that the label on an object be modified. A two-phase update
occurs: first, an access control check will be performed to
determine if the update is both valid and permitted, and then
the update itself is performed via a separate entry point.
Relabel entry points typically accept the object, object label
reference, and an update label submitted by the process.
Memory allocation during relabel is discouraged, as relabel
calls are not permitted to fail (failure should be reported
earlier in the relabel check).Userland ArchitectureThe TrustedBSD MAC Framework includes a number of
policy-agnostic elements, including MAC library interfaces
for abstractly managing labels, modifications to the system
credential management and login libraries to support the
assignment of MAC labels to users, and a set of tools to
monitor and modify labels on processes, files, and network
interfaces. More details on the user architecture will
be added to this section in the near future.APIs for Policy-Agnostic Label ManagementThe TrustedBSD MAC Framework provides a number of
library and system calls permitting applications to
manage MAC labels on objects using a policy-agnostic
interface. This permits applications to manipulate
labels for a variety of policies without being
written to support specific policies. These interfaces
are used by general-purpose tools such as &man.ifconfig.8;,
&man.ls.1; and &man.ps.1; to view labels on network
interfaces, files, and processes. The APIs also support
MAC management tools including &man.getfmac.8;,
&man.getpmac.8;, &man.setfmac.8;, &man.setfsmac.8;,
and &man.setpmac.8;. The MAC APIs are documented in
&man.mac.3;.Applications handle MAC labels in two forms: an
internalized form used to return and set labels on
processes and objects (mac_t),
and externalized form based on C strings appropriate for
storage in configuration files, display to the user, or
input from the user. Each MAC label contains a number of
elements, each consisting of a name and value pair.
Policy modules in the kernel bind to specific names
and interpret the values in policy-specific ways. In
the externalized string form, labels are represented
by a comma-delimited list of name and value pairs separated
by the / character. Labels may be
directly converted to and from text using provided APIs;
when retrieving labels from the kernel, internalized
label storage must first be prepared for the desired
label element set. Typically, this is done in one of
two ways: using &man.mac.prepare.3; and an arbitrary
list of desired label elements, or one of the variants
of the call that loads a default element set from the
&man.mac.conf.5; configuration file. Per-object
defaults permit application writers to usefully display
labels associated with objects without being aware of
the policies present in the system.Currently, direct manipulation of label elements
other than by conversion to a text string, string editing,
and conversion back to an internalized label is not supported
by the MAC library. Such interfaces may be added in the
future if they prove necessary for application
writers.Binding of Labels to UsersThe standard user context management interface,
&man.setusercontext.3;, has been modified to retrieve
MAC labels associated with a user's class from
&man.login.conf.5;. These labels are then set along
with other user context when either
LOGIN_SETALL is specified, or when
LOGIN_SETMAC is explicitly
specified.It is expected that, in a future version of FreeBSD,
the MAC label database will be separated from the
login.conf user class abstraction,
and be maintained in a separate database. However, the
&man.setusercontext.3; API should remain the same
following such a change.ConclusionThe TrustedBSD MAC framework permits kernel modules to
augment the system security policy in a highly integrated
manner. They may do this based on existing object properties,
or based on label data that is maintained with the assistance of
the MAC framework. The framework is sufficiently flexible to
implement a variety of policy types, including information flow
security policies such as MLS and Biba, as well as policies
based on existing BSD credentials or file protections. Policy
authors may wish to consult this documentation as well as
existing security modules when implementing a new security
service.
diff --git a/en_US.ISO8859-1/books/arch-handbook/newbus/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/newbus/chapter.sgml
index c20617c4fc..186880d010 100644
--- a/en_US.ISO8859-1/books/arch-handbook/newbus/chapter.sgml
+++ b/en_US.ISO8859-1/books/arch-handbook/newbus/chapter.sgml
@@ -1,360 +1,360 @@
JeroenRuigrok van der Werven (asmodai)asmodai@FreeBSD.orgWritten by HitenPandyahiten@uk.FreeBSD.orgNewbusSpecial thanks to Matthew N. Dodd, Warner Losh, Bill Paul,
Doug Rabson, Mike Smith, Peter Wemm and Scott Long.This chapter explains the Newbus device framework in detail.
-
+ Device DriversPurpose of a Device DriverA device driver is a software component which provides the
interface between the kernel's generic view of a peripheral
(e.g. disk, network adapter) and the actual implementation of the
peripheral. The device driver interface (DDI) is
the defined interface between the kernel and the device driver component.
Types of Device DriversThere used to be days in &unix;, and thus FreeBSD, in which there
were four types of devices defined:block device driverscharacter device driversnetwork device driverspseudo-device driversBlock devices performed in way that used
fixed size blocks [of data]. This type of driver depended on the
so called buffer cache, which had the purpose
to cache accessed blocks of data in a dedicated part of the memory.
Often this buffer cache was based on write-behind, which meant that when
data was modified in memory it got synced to disk whenever the system
did its periodical disk flushing, thus optimizing writes.Character devicesHowever, in the versions of FreeBSD 4.0 and onward the
distinction between block and character devices became non-existent.
Overview of NewbusNewbus is the implementation of a new bus
architecture based on abstraction layers which saw its introduction in
FreeBSD 3.0 when the Alpha port was imported into the source tree. It was
not until 4.0 before it became the default system to use for device
drivers. Its goals are to provide a more object oriented means of
interconnecting the various busses and devices which a host system
provides to the Operating System.Its main features include amongst others:dynamic attachingeasy modularization of driverspseudo-bussesOne of the most prominent changes is the migration from the flat and
ad-hoc system to a device tree lay-out.At the top level resides the root
device which is the parent to hang all other devices on. For each
architecture, there is typically a single child of root
which has such things as host-to-PCI bridges, etc.
attached to it. For x86, this root device is the
nexus device and for Alpha, various
different different models of Alpha have different top-level devices
corresponding to the different hardware chipsets, including
lca, apecs,
cia and tsunami.A device in the Newbus context represents a single hardware entity
in the system. For instance each PCI device is represented by a Newbus
device. Any device in the system can have children; a device which has
children is often called a bus.
Examples of common busses in the system are ISA and PCI which manage lists
of devices attached to ISA and PCI busses respectively.Often, a connection between different kinds of bus is represented by
a bridge device which normally has one
child for the attached bus. An example of this is a
PCI-to-PCI bridge which is represented by a device
pcibN on the parent PCI bus
and has a child pciN for the
attached bus. This layout simplifies the implementation of the PCI bus
tree, allowing common code to be used for both top-level and bridged
busses.Each device in the Newbus architecture asks its parent to map its
resources. The parent then asks its own parent until the nexus is
reached. So, basically the nexus is the only part of the Newbus system
which knows about all resources.An ISA device might want to map its IO port at
0x230, so it asks its parent, in this case the ISA
bus. The ISA bus hands it over to the PCI-to-ISA bridge which in its turn
asks the PCI bus, which reaches the host-to-PCI bridge and finally the
nexus. The beauty of this transition upwards is that there is room to
translate the requests. For example, the 0x230 IO port
request might become memory-mapped at 0xb0000230 on a
MIPS box by the PCI bridge.Resource allocation can be controlled at any place in the device
tree. For instance on many Alpha platforms, ISA interrupts are managed
separately from PCI interrupts and resource allocations for ISA interrupts
are managed by the Alpha's ISA bus device. On IA-32, ISA and PCI
interrupts are both managed by the top-level nexus device. For both
ports, memory and port address space is managed by a single entity - nexus
for IA-32 and the relevant chipset driver on Alpha (e.g. CIA or tsunami).
In order to normalize access to memory and port mapped resources,
Newbus integrates the bus_space APIs from NetBSD.
These provide a single API to replace inb/outb and direct memory
reads/writes. The advantage of this is that a single driver can easily
use either memory-mapped registers or port-mapped registers
(some hardware supports both).This support is integrated into the resource allocation mechanism.
When a resource is allocated, a driver can retrieve the associated
bus_space_tag_t and
bus_space_handle_t from the resource.Newbus also allows for definitions of interface methods in files
dedicated to this purpose. These are the .m files
that are found under the src/sys hierarchy.The core of the Newbus system is an extensible
object-based programming model. Each device in the system
has a table of methods which it supports. The system and other devices
uses those methods to control the device and request services. The
different methods supported by a device are defined by a number of
interfaces. An interface is simply a group
of related methods which can be implemented by a device.In the Newbus system, the methods for a device are provided by the
various device drivers in the system. When a device is attached to a
driver during auto-configuration, it uses the method
table declared by the driver. A device can later
detach from its driver and
re-attach to a new driver with a new method table.
This allows dynamic replacement of drivers which can be useful for driver
development.The interfaces are described by an interface definition language
similar to the language used to define vnode operations for file systems.
The interface would be stored in a methods file (which would normally named
foo_if.m).Newbus Methods
# Foo subsystem/driver (a comment...)
INTERFACE foo
METHOD int doit {
device_t dev;
};
# DEFAULT is the method that will be used, if a method was not
# provided via: DEVMETHOD()
METHOD void doit_to_child {
device_t dev;
driver_t child;
} DEFAULT doit_generic_to_child;
When this interface is compiled, it generates a header file
foo_if.h which contains function
declarations:
int FOO_DOIT(device_t dev);
int FOO_DOIT_TO_CHILD(device_t dev, device_t child);
A source file, foo_if.c is
also created to accompany the automatically generated header file; it
contains implementations of those functions which look up the location
of the relevant functions in the object's method table and call that
function.The system defines two main interfaces. The first fundamental
interface is called device and
includes methods which are relevant to all devices. Methods in the
device interface include
probe,
attach and
detach to control detection of
hardware and shutdown,
suspend and
resume for critical event
notification.The second, more complex interface is
bus. This interface contains
methods suitable for devices which have children, including methods to
access bus specific per-device information
&man.bus.generic.read.ivar.9; and
&man.bus.generic.write.ivar.9;, event notification
(child_detached,
driver_added) and resource
management (alloc_resource,
activate_resource,
deactivate_resource,
release_resource).Many methods in the bus interface are performing
services for some child of the bus device. These methods would normally
use the first two arguments to specify the bus providing the service
and the child device which is requesting the service. To simplify
driver code, many of these methods have accessor functions which
lookup the parent and call a method on the parent. For instance the
method
BUS_TEARDOWN_INTR(device_t dev, device_t child, ...)
can be called using the function
bus_teardown_intr(device_t child, ...).Some bus types in the system define additional interfaces to
provide access to bus-specific functionality. For instance, the PCI
bus driver defines the pci interface which has two
methods read_config and
write_config for accessing the
configuration registers of a PCI device.Newbus APIAs the Newbus API is huge, this section makes some effort at
documenting it. More information to come in the next revision of this
document.Important locations in the source hierarchysrc/sys/[arch]/[arch] - Kernel code for a
specific machine architecture resides in this directory. for example,
the i386 architecture, or the
SPARC64 architecture.src/sys/dev/[bus] - device support for a
specific [bus] resides in this directory.src/sys/dev/pci - PCI bus support code
resides in this directory.src/sys/[isa|pci] - PCI/ISA device drivers
reside in this directory. The PCI/ISA bus support code used to exist
in this directory in FreeBSD version 4.0.Important structures and type definitionsdevclass_t - This is a type definition of a
pointer to a struct devclass.device_method_t - This is same as
kobj_method_t (see
src/sys/kobj.h).device_t - This is a type definition of a
pointer to a struct device.
device_t represents a device in the system. It is
a kernel object. See src/sys/sys/bus_private.h
for implementation details.driver_t - This is a type definition which,
references struct driver. The
driver struct is a class of the
device kernel object; it also holds data private
to for the driver.driver_t implementation
struct driver {
KOBJ_CLASS_FIELDS;
void *priv; /* driver private data */
};
A device_state_t type, which is
an enumeration, device_state. It contains
the possible states of a Newbus device before and after the
autoconfiguration process.Device statesdevice_state_t
/*
* src/sys/sys/bus.h
*/
typedef enum device_state {
DS_NOTPRESENT, /* not probed or probe failed */
DS_ALIVE, /* probe succeeded */
DS_ATTACHED, /* attach method called */
DS_BUSY /* device is open */
} device_state_t;
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
index 3ce99fa171..3ce6e041f1 100644
--- a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
+++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
@@ -1,944 +1,944 @@
JohnBaldwinRobertWatson$FreeBSD$20022003John BaldwinRobert WatsonSMPng Design Document
-
+ IntroductionThis document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
synchronization and execution model is laid out. Then, locking
strategies for specific subsystems are discussed, documenting the
approaches taken to introduce fine-grained synchronization and
parallelism for each subsystem. Finally, detailed implementation
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives. This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
with the SMPng Project. Many sections currently exist only in
outline form, but will be fleshed out as work proceeds. Updates or
suggestions regarding the document may be directed to the document
editors.The goal of SMPng is to allow concurrency in the kernel.
The kernel is basically one rather large and complex program. To
make the kernel multi-threaded we use some of the same tools used
to make other programs multi-threaded. These include mutexes,
shared/exclusive locks, semaphores, and condition variables. For
the definitions of these and other SMP-related terms, please see
- the section of this article.
+ the section of this article.
-
+ Basic Tools and Locking FundamentalsAtomic Instructions and Memory BarriersThere are several existing treatments of memory barriers
and atomic instructions, so this section will not include a
lot of detail. To put it simply, one can not go around reading
variables without a lock if a lock is used to protect writes
to that variable. This becomes obvious when you consider that
memory barriers simply determine relative order of memory
operations; they do not make any guarantee about timing of
memory operations. That is, a memory barrier does not force
the contents of a CPU's local cache or store buffer to flush.
Instead, the memory barrier at lock release simply ensures
that all writes to the protected data will be visible to other
CPU's or devices if the write to release the lock is visible.
The CPU is free to keep that data in its cache or store buffer
as long as it wants. However, if another CPU performs an
atomic instruction on the same datum, the first CPU must
guarantee that the updated value is made visible to the second
CPU along with any other operations that memory barriers may
require.For example, assuming a simple model where data is
considered visible when it is in main memory (or a global
cache), when an atomic instruction is triggered on one CPU,
other CPU's store buffers and caches must flush any writes to
that same cache line along with any pending operations behind
a memory barrier.This requires one to take special care when using an item
protected by atomic instructions. For example, in the sleep
mutex implementation, we have to use an
atomic_cmpset rather than an
atomic_set to turn on the
MTX_CONTESTED bit. The reason is that we
read the value of mtx_lock into a
variable and then make a decision based on that read.
However, the value we read may be stale, or it may change
while we are making our decision. Thus, when the
atomic_set executed, it may end up
setting the bit on another value than the one we made the
decision on. Thus, we have to use an
atomic_cmpset to set the value only if
the value we made the decision on is up-to-date and
valid.Finally, atomic instructions only allow one item to be
updated or read. If one needs to atomically update several
items, then a lock must be used instead. For example, if two
counters must be read and have values that are consistent
relative to each other, then those counters must be protected
by a lock rather than by separate atomic instructions.Read Locks versus Write LocksRead locks do not need to be as strong as write locks.
Both types of locks need to ensure that the data they are
accessing is not stale. However, only write access requires
exclusive access. Multiple threads can safely read a value.
Using different types of locks for reads and writes can be
implemented in a number of ways.First, sx locks can be used in this manner by using an
exclusive lock when writing and a shared lock when reading.
This method is quite straightforward.A second method is a bit more obscure. You can protect a
datum with multiple locks. Then for reading that data you
simply need to have a read lock of one of the locks. However,
to write to the data, you need to have a write lock of all of
the locks. This can make writing rather expensive but can be
useful when data is accessed in various ways. For example,
the parent process pointer is protected by both the
proctree_lock sx lock and the per-process mutex. Sometimes
the proc lock is easier as we are just checking to see who a
parent of a process is that we already have locked. However,
other places such as inferior need to
walk the tree of processes via parent pointers and locking
each process would be prohibitive as well as a pain to
guarantee that the condition you are checking remains valid
for both the check and the actions taken as a result of the
check.Locking Conditions and ResultsIf you need a lock to check the state of a variable so
that you can take an action based on the state you read, you
can not just hold the lock while reading the variable and then
drop the lock before you act on the value you read. Once you
drop the lock, the variable can change rendering your decision
invalid. Thus, you must hold the lock both while reading the
variable and while performing the action as a result of the
test.
-
+ General Architecture and DesignInterrupt HandlingFollowing the pattern of several other multi-threaded &unix;
kernels, FreeBSD deals with interrupt handlers by giving them
their own thread context. Providing a context for interrupt
handlers allows them to block on locks. To help avoid
latency, however, interrupt threads run at real-time kernel
priority. Thus, interrupt handlers should not execute for very
long to avoid starving other kernel threads. In addition,
since multiple handlers may share an interrupt thread,
interrupt handlers should not sleep or use a sleepable lock to
avoid starving another interrupt handler.The interrupt threads currently in FreeBSD are referred to
as heavyweight interrupt threads. They are called this
because switching to an interrupt thread involves a full
context switch. In the initial implementation, the kernel was
not preemptive and thus interrupts that interrupted a kernel
thread would have to wait until the kernel thread blocked or
returned to userland before they would have an opportunity to
run.To deal with the latency problems, the kernel in FreeBSD
has been made preemptive. Currently, we only preempt a kernel
thread when we release a sleep mutex or when an interrupt
comes in. However, the plan is to make the FreeBSD kernel
fully preemptive as described below.Not all interrupt handlers execute in a thread context.
Instead, some handlers execute directly in primary interrupt
context. These interrupt handlers are currently misnamed
fast interrupt handlers since the
INTR_FAST flag used in earlier versions
of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt
handlers are clock interrupts and serial I/O device
interrupts. Since these handlers do not have their own
context, they may not acquire blocking locks and thus may only
use spin mutexes.Finally, there is one optional optimization that can be
added in MD code called lightweight context switches. Since
an interrupt thread executes in a kernel context, it can
borrow the vmspace of any process. Thus, in a lightweight
context switch, the switch to the interrupt thread does not
switch vmspaces but borrows the vmspace of the interrupted
thread. In order to ensure that the vmspace of the
interrupted thread does not disappear out from under us, the
interrupted thread is not allowed to execute until the
interrupt thread is no longer borrowing its vmspace. This can
happen when the interrupt thread either blocks or finishes.
If an interrupt thread blocks, then it will use its own
context when it is made runnable again. Thus, it can release
the interrupted thread.The cons of this optimization are that they are very
machine specific and complex and thus only worth the effort if
their is a large performance improvement. At this point it is
probably too early to tell, and in fact, will probably hurt
performance as almost all interrupt handlers will immediately
block on Giant and require a thread fix-up when they block.
Also, an alternative method of interrupt handling has been
proposed by Mike Smith that works like so:Each interrupt handler has two parts: a predicate
which runs in primary interrupt context and a handler
which runs in its own thread context.If an interrupt handler has a predicate, then when an
interrupt is triggered, the predicate is run. If the
predicate returns true then the interrupt is assumed to be
fully handled and the kernel returns from the interrupt.
If the predicate returns false or there is no predicate,
then the threaded handler is scheduled to run.Fitting light weight context switches into this scheme
might prove rather complicated. Since we may want to change
to this scheme at some point in the future, it is probably
best to defer work on light weight context switches until we
have settled on the final interrupt handling architecture and
determined how light weight context switches might or might
not fit into it.Kernel Preemption and Critical SectionsKernel Preemption in a NutshellKernel preemption is fairly simple. The basic idea is
that a CPU should always be doing the highest priority work
available. Well, that is the ideal at least. There are a
couple of cases where the expense of achieving the ideal is
not worth being perfect.Implementing full kernel preemption is very
straightforward: when you schedule a thread to be executed
by putting it on a runqueue, you check to see if its
priority is higher than the currently executing thread. If
so, you initiate a context switch to that thread.While locks can protect most data in the case of a
preemption, not all of the kernel is preemption safe. For
example, if a thread holding a spin mutex preempted and the
new thread attempts to grab the same spin mutex, the new
thread may spin forever as the interrupted thread may never
get a chance to execute. Also, some code such as the code
to assign an address space number for a process during
exec() on the Alpha needs to not be preempted as it supports
the actual context switch code. Preemption is disabled for
these code sections by using a critical section.Critical SectionsThe responsibility of the critical section API is to
prevent context switches inside of a critical section. With
a fully preemptive kernel, every
setrunqueue of a thread other than the
current thread is a preemption point. One implementation is
for critical_enter to set a per-thread
flag that is cleared by its counterpart. If
setrunqueue is called with this flag
set, it does not preempt regardless of the priority of the new
thread relative to the current thread. However, since
critical sections are used in spin mutexes to prevent
context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this
reason the current implementation uses a nesting count
instead of a single per-thread flag.In order to minimize latency, preemptions inside of a
critical section are deferred rather than dropped. If a
thread is made runnable that would normally be preempted to
outside of a critical section, then a per-thread flag is set
to indicate that there is a pending preemption. When the
outermost critical section is exited, the flag is checked.
If the flag is set, then the current thread is preempted to
allow the higher priority thread to run.Interrupts pose a problem with regards to spin mutexes.
If a low-level interrupt handler needs a lock, it needs to
not interrupt any code needing that lock to avoid possible
data structure corruption. Currently, providing this
mechanism is piggybacked onto critical section API by means
of the cpu_critical_enter and
cpu_critical_exit functions. Currently
this API disables and re-enables interrupts on all of
FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to
get right. Theoretically, this second API need only be used
for spin mutexes that are used in primary interrupt context.
However, to make the code simpler, it is used for all spin
mutexes and even all critical sections. It may be desirable
to split out the MD API from the MI API and only use it in
conjunction with the MI API in the spin mutex
implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API
now.Design TradeoffsAs mentioned earlier, a couple of trade-offs have been
made to sacrifice cases where perfect preemption may not
always provide the best performance.The first trade-off is that the preemption code does not
take other CPUs into account. Suppose we have a two CPU's A
and B with the priority of A's thread as 4 and the priority
of B's thread as 2. If CPU B makes a thread with priority 1
runnable, then in theory, we want CPU A to switch to the new
thread so that we will be running the two highest priority
runnable threads. However, the cost of determining which
CPU to enforce a preemption on as well as actually signaling
that CPU via an IPI along with the synchronization that
would be required would be enormous. Thus, the current code
would instead force CPU B to switch to the higher priority
thread. Note that this still puts the system in a better
position as CPU B is executing a thread of priority 1 rather
than a thread of priority 2.The second trade-off limits immediate kernel preemption
to real-time priority kernel threads. In the simple case of
preemption defined above, a thread is always preempted
immediately (or as soon as a critical section is exited) if
a higher priority thread is made runnable. However, many
threads executing in the kernel only execute in a kernel
context for a short time before either blocking or returning
to userland. Thus, if the kernel preempts these threads to
run another non-realtime kernel thread, the kernel may
switch out the executing thread just before it is about to
sleep or execute. The cache on the CPU must then adjust to
the new thread. When the kernel returns to the interrupted
CPU, it must refill all the cache information that was lost.
In addition, two extra context switches are performed that
could be avoided if the kernel deferred the preemption until
the first thread blocked or returned to userland. Thus, by
default, the preemption code will only preempt immediately
if the higher priority thread is a real-time priority
thread.Turning on full kernel preemption for all kernel threads
has value as a debugging aid since it exposes more race
conditions. It is especially useful on UP systems were many
races are hard to simulate otherwise. Thus, there will be a
kernel option to enable preemption for all kernel threads
that can be used for debugging purposes.Thread MigrationSimply put, a thread migrates when it moves from one CPU
to another. In a non-preemptive kernel this can only happen
at well-defined points such as when calling
tsleep or returning to userland.
However, in the preemptive kernel, an interrupt can force a
preemption and possible migration at any time. This can have
negative affects on per-CPU data since with the exception of
curthread and curpcb the
data can change whenever you migrate. Since you can
potentially migrate at any time this renders per-CPU data
rather useless. Thus it is desirable to be able to disable
migration for sections of code that need per-CPU data to be
stable.Critical sections currently prevent migration since they
do not allow context switches. However, this may be too strong
of a requirement to enforce in some cases since a critical
section also effectively blocks interrupt threads on the
current processor. As a result, it may be desirable to
provide an API whereby code may indicate that if the current
thread is preempted it should not migrate to another
CPU.One possible implementation is to use a per-thread nesting
count td_pinnest along with a
td_pincpu which is updated to the current
CPU on each context switch. Each CPU has its own run queue
that holds threads pinned to that CPU. A thread is pinned
when its nesting count is greater than zero and a thread
starts off unpinned with a nesting count of zero. When a
thread is put on a runqueue, we check to see if it is pinned.
If so, we put it on the per-CPU runqueue, otherwise we put it
on the global runqueue. When
choosethread is called to retrieve the
next thread, it could either always prefer bound threads to
unbound threads or use some sort of bias when comparing
priorities. If the nesting count is only ever written to by
the thread itself and is only read by other threads when the
owning thread is not executing but while holding the
sched_lock, then
td_pinnest will not need any other locks.
The migrate_disable function would
increment the nesting count and
migrate_enable would decrement the
nesting count. Due to the locking requirements specified
above, they will only operate on the current thread and thus
would not need to handle the case of making a thread
migrateable that currently resides on a per-CPU run
queue.It is still debatable if this API is needed or if the
critical section API is sufficient by itself. Many of the
places that need to prevent migration also need to prevent
preemption as well, and in those places a critical section
must be used regardless.CalloutsThe timeout() kernel facility permits
kernel services to register functions for execution as part
of the softclock() software interrupt.
Events are scheduled based on a desired number of clock
ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.The global list of pending timeout events is protected
by a global spin mutex, callout_lock;
all access to the timeout list must be performed with this
mutex held. When softclock() is
woken up, it scans the list of pending timeouts for those
that should fire. In order to avoid lock order reversal,
the softclock thread will release the
callout_lock mutex when invoking the
provided timeout() callback function.
If the CALLOUT_MPSAFE flag was not set
during registration, then Giant will be grabbed before
invoking the callout, and then released afterwards. The
callout_lock mutex will be re-grabbed
before proceeding. The softclock()
code is careful to leave the list in a consistent state
while releasing the mutex. If DIAGNOSTIC
is enabled, then the time taken to execute each function is
measured, and a warning generated if it exceeds a
threshold.
-
+ Specific Locking StrategiesCredentialsstruct ucred is the kernel's
internal credential structure, and is generally used as the
basis for process-driven access control within the kernel.
BSD-derived systems use a copy-on-write model for credential
data: multiple references may exist for a credential structure,
and when a change needs to be made, the structure is duplicated,
modified, and then the reference replaced. Due to wide-spread
caching of the credential to implement access control on open,
this results in substantial memory savings. With a move to
fine-grained SMP, this model also saves substantially on
locking operations by requiring that modification only occur
on an unshared credential, avoiding the need for explicit
synchronization when consuming a known-shared
credential.Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
cr_mtxp protects the reference
count of struct ucred so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.The struct ucred mutex is a leaf
mutex, and for performance reasons, is implemented via a mutex
pool.Usually, credentials are used in a read-only manner for access
control decisions, and in this case td_ucred
is generally preferred because it requires no locking. When a
process' credential is updated the proc lock
must be held across the check and update operations thus avoid
races. The process credential p_ucred
must be used for check and update operations to prevent
time-of-check, time-of-use races.If system call invocations will perform access control after
an update to the process credential, the value of
td_ucred must also be refreshed to
the current process value. This will prevent use of a stale
credential following a change. The kernel automatically
refreshes the td_ucred pointer in
the thread structure from the process
p_ucred whenever a process enters
the kernel, permitting use of a fresh credential for kernel
access control.File Descriptors and File Descriptor TablesDetails to follow.Jail Structuresstruct prison stores
administrative details pertinent to the maintenance of jails
created using the &man.jail.2; API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, pr_mtx protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
struct prison is sufficient to read
these values. The precise locking of each entry is documented
via comments in sys/jail.h.MAC FrameworkThe TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of struct
label. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in struct vnode is protected
by the vnode lock on the vnode.In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (mac_policy_list_lock) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and de-registration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.A condition variable,
mac_policy_list_not_busy, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
shared/exclusive lock over access to the framework: the difference
is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.ModulesFor the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in sys/module.h and are quite basic
in terms of usage. The main structures protected under this lock
are the module_t structures (when shared)
and the global modulelist_t structure,
modules. One should review the related source code in
kern/kern_module.c to further understand the
locking strategy.Newbus Device TreeThe newbus system will have one sx lock. Readers will
hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
an exclusive (write) lock (&man.sx.xlock.9;). Internal functions
will not do locking at all. Externally visible ones will lock as
needed.
Those items that do not matter if the race is won or lost will
not be locked, since they tend to be read all over the place
(e.g. &man.device.get.softc.9;). There will be relatively few
changes to the newbus data structures, so a single lock should
be sufficient and not impose a performance penalty.Pipes...Processes and Threads- process hierarchy- proc locks, references- thread-specific copies of proc entries to freeze during system
calls, including td_ucred- inter-process operations- process groups and sessionsSchedulerLots of references to sched_lock and notes
pointing at specific primitives and related magic elsewhere in the
document.Select and PollThe select() and poll() functions permit threads to block
waiting on events on file descriptors--most frequently, whether
or not the file descriptors are readable or writable....SIGIOThe SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write status
of specified file descriptors changes. At most one process or
process group is permitted to register for SIGIO from any given
kernel object, and that process or group is referred to as
the owner. Each object supporting SIGIO registration contains
pointer field that is NULL if the object is not registered, or
points to a struct sigio describing
the registration. This field is protected by a global mutex,
sigio_lock. Callers to SIGIO maintenance
functions must pass in this field by reference so that local
register copies of the field are not made when unprotected by
the lock.One struct sigio is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered struct sigio structures,
p_sigiolst for processes, and
pg_sigiolst for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each struct
sigio are constant for the duration of the
registration, with the exception of the
sio_pgsigio field which links the
struct sigio into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as fsetown()
or funsetown() to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.SysctlThe sysctl() MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in locking:
first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and
functions that are accessed by the sysctl interface. Since
sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a
single global sx lock to serialize use of sysctl(); however, it
is assumed to operate under Giant and other protections are not
provided. The remainder of this section speculates on locking
and semantic changes to sysctl.- Need to change the order of operations for sysctl's that
update values from read old, copyin and copyout, write new to
copyin, lock, read old and write new, unlock, copyout. Normal
sysctl's that just copyout the old value and set a new value
that they copyin may still be able to follow the old model.
However, it may be cleaner to use the second model for all of
the sysctl handlers to avoid lock operations.- To allow for the common case, a sysctl could embed a
pointer to a mutex in the SYSCTL_FOO macros and in the struct.
This would work for most sysctl's. For values protected by sx
locks, spin mutexes, or other locking strategies besides a
single sleep mutex, SYSCTL_PROC nodes could be used to get the
locking right.Taskqueue The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
taskqueue_queues_mutex is meant to serve as a
lock to protect the taskqueue_queues TAILQ.
The other mutex lock associated with this system is the one in the
struct taskqueue data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the struct
taskqueue. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
kern/subr_taskqueue.c.
-
+ Implementation NotesDetails of the Mutex Implementation- Should we require mutexes to be owned for mtx_destroy()
since we can not safely assert that they are unowned by anyone
else otherwise?Spin Mutexes- Use a critical section...Sleep Mutexes- Describe the races with contested mutexes- Why it is safe to read mtx_lock of a contested mutex
when holding sched_lock.- Priority propagationWitness- What does it do- How does it work
-
+ Miscellaneous TopicsInterrupt Source and ICU Abstractions- struct isrc- pic driversOther Random Questions/TopicsShould we pass an interlock into
sema_wait?- Generic turnstiles for sleep mutexes and sx locks.- Should we have non-sleepable sx locks?
-
+ Glossary
-
+ atomicAn operation is atomic if all of its effects are visible
to other CPUs together when the proper access protocol is
followed. In the degenerate case are atomic instructions
provided directly by machine architectures. At a higher
level, if several members of a structure are protected by a
lock, then a set of operations are atomic if they are all
performed while holding the lock without releasing the lock
in between any of the operations.operation
-
+ blockA thread is blocked when it is waiting on a lock,
resource, or condition. Unfortunately this term is a bit
overloaded as a result.sleep
-
+ critical sectionA section of code that is not allowed to be preempted.
A critical section is entered and exited using the
&man.critical.enter.9; API.
-
+ MDMachine dependent.MI
-
+ memory operationA memory operation reads and/or writes to a memory
location.
-
+ MIMachine independent.MD
-
+ operationmemory operation
-
+ primary interrupt contextPrimary interrupt context refers to the code that runs
when an interrupt occurs. This code can either run an
interrupt handler directly or schedule an asynchronous
interrupt thread to execute the interrupt handlers for a
given interrupt source.realtime kernel threadA high priority kernel thread. Currently, the only
realtime priority kernel threads are interrupt threads.thread
-
+ sleepA thread is asleep when it is blocked on a condition
variable or a sleep queue via msleep or
tsleep.block
-
+ sleepable lockA sleepable lock is a lock that can be held by a thread
which is asleep. Lockmgr locks and sx locks are currently
the only sleepable locks in FreeBSD. Eventually, some sx
locks such as the allproc and proctree locks may become
non-sleepable locks.sleep
-
+ threadA kernel thread represented by a struct thread. Threads own
locks and hold a single execution context.