diff --git a/en_US.ISO8859-1/articles/smp/article.sgml b/en_US.ISO8859-1/articles/smp/article.sgml
index c6c78e0feb..14c53dc77c 100644
--- a/en_US.ISO8859-1/articles/smp/article.sgml
+++ b/en_US.ISO8859-1/articles/smp/article.sgml
@@ -1,957 +1,959 @@
%man;
%authors;
%misc;
+
+%freebsd;
]>
SMPng Design Document
John
Baldwin
Robert
Watson
$FreeBSD$
2002
2003
John Baldwin
Robert Watson
This document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
synchronization and execution model is laid out. Then, locking
strategies for specific subsystems are discussed, documenting the
approaches taken to introduce fine-grained synchronization and
parallelism for each subsystem. Finally, detailed implementation
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives.
Introduction
This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
with the SMPng Project. Many sections currently exist only in
outline form, but will be fleshed out as work proceeds. Updates or
suggestions regarding the document may be directed to the document
editors.
The goal of SMPng is to allow concurrency in the kernel.
The kernel is basically one rather large and complex program. To
make the kernel multi-threaded we use some of the same tools used
to make other programs multi-threaded. These include mutexes,
shared/exclusive locks, semaphores, and condition variables. For
the definitions of these and other SMP-related terms, please see
the section of this article.
Basic Tools and Locking Fundamentals
Atomic Instructions and Memory Barriers
There are several existing treatments of memory barriers
and atomic instructions, so this section will not include a
lot of detail. To put it simply, one can not go around reading
variables without a lock if a lock is used to protect writes
to that variable. This becomes obvious when you consider that
memory barriers simply determine relative order of memory
operations; they do not make any guarantee about timing of
memory operations. That is, a memory barrier does not force
the contents of a CPU's local cache or store buffer to flush.
Instead, the memory barrier at lock release simply ensures
that all writes to the protected data will be visible to other
CPU's or devices if the write to release the lock is visible.
The CPU is free to keep that data in its cache or store buffer
as long as it wants. However, if another CPU performs an
atomic instruction on the same datum, the first CPU must
guarantee that the updated value is made visible to the second
CPU along with any other operations that memory barriers may
require.
For example, assuming a simple model where data is
considered visible when it is in main memory (or a global
cache), when an atomic instruction is triggered on one CPU,
other CPU's store buffers and caches must flush any writes to
that same cache line along with any pending operations behind
a memory barrier.
This requires one to take special care when using an item
protected by atomic instructions. For example, in the sleep
mutex implementation, we have to use an
atomic_cmpset rather than an
atomic_set to turn on the
MTX_CONTESTED bit. The reason is that we
read the value of mtx_lock into a
variable and then make a decision based on that read.
However, the value we read may be stale, or it may change
while we are making our decision. Thus, when the
atomic_set executed, it may end up
setting the bit on another value than the one we made the
decision on. Thus, we have to use an
atomic_cmpset to set the value only if
the value we made the decision on is up-to-date and
valid.
Finally, atomic instructions only allow one item to be
updated or read. If one needs to atomically update several
items, then a lock must be used instead. For example, if two
counters must be read and have values that are consistent
relative to each other, then those counters must be protected
by a lock rather than by separate atomic instructions.
Read Locks versus Write Locks
Read locks do not need to be as strong as write locks.
Both types of locks need to ensure that the data they are
accessing is not stale. However, only write access requires
exclusive access. Multiple threads can safely read a value.
Using different types of locks for reads and writes can be
implemented in a number of ways.
First, sx locks can be used in this manner by using an
exclusive lock when writing and a shared lock when reading.
This method is quite straightforward.
A second method is a bit more obscure. You can protect a
datum with multiple locks. Then for reading that data you
simply need to have a read lock of one of the locks. However,
to write to the data, you need to have a write lock of all of
the locks. This can make writing rather expensive but can be
useful when data is accessed in various ways. For example,
the parent process pointer is protected by both the
proctree_lock sx lock and the per-process mutex. Sometimes
the proc lock is easier as we are just checking to see who a
parent of a process is that we already have locked. However,
other places such as inferior need to
walk the tree of processes via parent pointers and locking
each process would be prohibitive as well as a pain to
guarantee that the condition you are checking remains valid
for both the check and the actions taken as a result of the
check.
Locking Conditions and Results
If you need a lock to check the state of a variable so
that you can take an action based on the state you read, you
can not just hold the lock while reading the variable and then
drop the lock before you act on the value you read. Once you
drop the lock, the variable can change rendering your decision
invalid. Thus, you must hold the lock both while reading the
variable and while performing the action as a result of the
test.
General Architecture and Design
Interrupt Handling
- Following the pattern of several other multi-threaded Unix
+ Following the pattern of several other multi-threaded &unix;
kernels, FreeBSD deals with interrupt handlers by giving them
their own thread context. Providing a context for interrupt
handlers allows them to block on locks. To help avoid
latency, however, interrupt threads run at real-time kernel
priority. Thus, interrupt handlers should not execute for very
long to avoid starving other kernel threads. In addition,
since multiple handlers may share an interrupt thread,
interrupt handlers should not sleep or use a sleepable lock to
avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to
as heavyweight interrupt threads. They are called this
because switching to an interrupt thread involves a full
context switch. In the initial implementation, the kernel was
not preemptive and thus interrupts that interrupted a kernel
thread would have to wait until the kernel thread blocked or
returned to userland before they would have an opportunity to
run.
To deal with the latency problems, the kernel in FreeBSD
has been made preemptive. Currently, we only preempt a kernel
thread when we release a sleep mutex or when an interrupt
comes in. However, the plan is to make the FreeBSD kernel
fully preemptive as described below.
Not all interrupt handlers execute in a thread context.
Instead, some handlers execute directly in primary interrupt
context. These interrupt handlers are currently misnamed
fast
interrupt handlers since the
INTR_FAST flag used in earlier versions
of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt
handlers are clock interrupts and serial I/O device
interrupts. Since these handlers do not have their own
context, they may not acquire blocking locks and thus may only
use spin mutexes.
Finally, there is one optional optimization that can be
added in MD code called lightweight context switches. Since
an interrupt thread executes in a kernel context, it can
borrow the vmspace of any process. Thus, in a lightweight
context switch, the switch to the interrupt thread does not
switch vmspaces but borrows the vmspace of the interrupted
thread. In order to ensure that the vmspace of the
interrupted thread does not disappear out from under us, the
interrupted thread is not allowed to execute until the
interrupt thread is no longer borrowing its vmspace. This can
happen when the interrupt thread either blocks or finishes.
If an interrupt thread blocks, then it will use its own
context when it is made runnable again. Thus, it can release
the interrupted thread.
The cons of this optimization are that they are very
machine specific and complex and thus only worth the effort if
their is a large performance improvement. At this point it is
probably too early to tell, and in fact, will probably hurt
performance as almost all interrupt handlers will immediately
block on Giant and require a thread fix-up when they block.
Also, an alternative method of interrupt handling has been
proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate
which runs in primary interrupt context and a handler
which runs in its own thread context.
If an interrupt handler has a predicate, then when an
interrupt is triggered, the predicate is run. If the
predicate returns true then the interrupt is assumed to be
fully handled and the kernel returns from the interrupt.
If the predicate returns false or there is no predicate,
then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme
might prove rather complicated. Since we may want to change
to this scheme at some point in the future, it is probably
best to defer work on light weight context switches until we
have settled on the final interrupt handling architecture and
determined how light weight context switches might or might
not fit into it.
Kernel Preemption and Critical Sections
Kernel Preemption in a Nutshell
Kernel preemption is fairly simple. The basic idea is
that a CPU should always be doing the highest priority work
available. Well, that is the ideal at least. There are a
couple of cases where the expense of achieving the ideal is
not worth being perfect.
Implementing full kernel preemption is very
straightforward: when you schedule a thread to be executed
by putting it on a runqueue, you check to see if it's
priority is higher than the currently executing thread. If
so, you initiate a context switch to that thread.
While locks can protect most data in the case of a
preemption, not all of the kernel is preemption safe. For
example, if a thread holding a spin mutex preempted and the
new thread attempts to grab the same spin mutex, the new
thread may spin forever as the interrupted thread may never
get a chance to execute. Also, some code such as the code
to assign an address space number for a process during
exec() on the Alpha needs to not be preempted as it supports
the actual context switch code. Preemption is disabled for
these code sections by using a critical section.
Critical Sections
The responsibility of the critical section API is to
prevent context switches inside of a critical section. With
a fully preemptive kernel, every
setrunqueue of a thread other than the
current thread is a preemption point. One implementation is
for critical_enter to set a per-thread
flag that is cleared by its counterpart. If
setrunqueue is called with this flag
set, it does not preempt regardless of the priority of the new
thread relative to the current thread. However, since
critical sections are used in spin mutexes to prevent
context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this
reason the current implementation uses a nesting count
instead of a single per-thread flag.
In order to minimize latency, preemptions inside of a
critical section are deferred rather than dropped. If a
thread is made runnable that would normally be preempted to
outside of a critical section, then a per-thread flag is set
to indicate that there is a pending preemption. When the
outermost critical section is exited, the flag is checked.
If the flag is set, then the current thread is preempted to
allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes.
If a low-level interrupt handler needs a lock, it needs to
not interrupt any code needing that lock to avoid possible
data structure corruption. Currently, providing this
mechanism is piggybacked onto critical section API by means
of the cpu_critical_enter and
cpu_critical_exit functions. Currently
this API disables and re-enables interrupts on all of
FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to
get right. Theoretically, this second API need only be used
for spin mutexes that are used in primary interrupt context.
However, to make the code simpler, it is used for all spin
mutexes and even all critical sections. It may be desirable
to split out the MD API from the MI API and only use it in
conjunction with the MI API in the spin mutex
implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API
now.
Design Tradeoffs
As mentioned earlier, a couple of trade-offs have been
made to sacrifice cases where perfect preemption may not
always provide the best performance.
The first trade-off is that the preemption code does not
take other CPUs into account. Suppose we have a two CPU's A
and B with the priority of A's thread as 4 and the priority
of B's thread as 2. If CPU B makes a thread with priority 1
runnable, then in theory, we want CPU A to switch to the new
thread so that we will be running the two highest priority
runnable threads. However, the cost of determining which
CPU to enforce a preemption on as well as actually signaling
that CPU via an IPI along with the synchronization that
would be required would be enormous. Thus, the current code
would instead force CPU B to switch to the higher priority
thread. Note that this still puts the system in a better
position as CPU B is executing a thread of priority 1 rather
than a thread of priority 2.
The second trade-off limits immediate kernel preemption
to real-time priority kernel threads. In the simple case of
preemption defined above, a thread is always preempted
immediately (or as soon as a critical section is exited) if
a higher priority thread is made runnable. However, many
threads executing in the kernel only execute in a kernel
context for a short time before either blocking or returning
to userland. Thus, if the kernel preempts these threads to
run another non-realtime kernel thread, the kernel may
switch out the executing thread just before it is about to
sleep or execute. The cache on the CPU must then adjust to
the new thread. When the kernel returns to the interrupted
CPU, it must refill all the cache information that was lost.
In addition, two extra context switches are performed that
could be avoided if the kernel deferred the preemption until
the first thread blocked or returned to userland. Thus, by
default, the preemption code will only preempt immediately
if the higher priority thread is a real-time priority
thread.
Turning on full kernel preemption for all kernel threads
has value as a debugging aid since it exposes more race
conditions. It is especially useful on UP systems were many
races are hard to simulate otherwise. Thus, there will be a
kernel option to enable preemption for all kernel threads
that can be used for debugging purposes.
Thread Migration
Simply put, a thread migrates when it moves from one CPU
to another. In a non-preemptive kernel this can only happen
at well-defined points such as when calling
tsleep or returning to userland.
However, in the preemptive kernel, an interrupt can force a
preemption and possible migration at any time. This can have
negative affects on per-CPU data since with the exception of
curthread and curpcb the
data can change whenever you migrate. Since you can
potentially migrate at any time this renders per-CPU data
rather useless. Thus it is desirable to be able to disable
migration for sections of code that need per-CPU data to be
stable.
Critical sections currently prevent migration since they
do not allow context switches. However, this may be too strong
of a requirement to enforce in some cases since a critical
section also effectively blocks interrupt threads on the
current processor. As a result, it may be desirable to
provide an API whereby code may indicate that if the current
thread is preempted it should not migrate to another
CPU.
One possible implementation is to use a per-thread nesting
count td_pinnest along with a
td_pincpu which is updated to the current
CPU on each context switch. Each CPU has its own run queue
that holds threads pinned to that CPU. A thread is pinned
when its nesting count is greater than zero and a thread
starts off unpinned with a nesting count of zero. When a
thread is put on a runqueue, we check to see if it is pinned.
If so, we put it on the per-CPU runqueue, otherwise we put it
on the global runqueue. When
choosethread is called to retrieve the
next thread, it could either always prefer bound threads to
unbound threads or use some sort of bias when comparing
priorities. If the nesting count is only ever written to by
the thread itself and is only read by other threads when the
owning thread is not executing but while holding the
sched_lock, then
td_pinnest will not need any other locks.
The migrate_disable function would
increment the nesting count and
migrate_enable would decrement the
nesting count. Due to the locking requirements specified
above, they will only operate on the current thread and thus
would not need to handle the case of making a thread
migrateable that currently resides on a per-CPU run
queue.
It is still debatable if this API is needed or if the
critical section API is sufficient by itself. Many of the
places that need to prevent migration also need to prevent
preemption as well, and in those places a critical section
must be used regardless.
Callouts
The timeout() kernel facility permits
kernel services to register functions for execution as part
of the softclock() software interrupt.
Events are scheduled based on a desired number of clock
ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.
The global list of pending timeout events is protected
by a global spin mutex, callout_lock;
all access to the timeout list must be performed with this
mutex held. When softclock() is
woken up, it scans the list of pending timeouts for those
that should fire. In order to avoid lock order reversal,
the softclock thread will release the
callout_lock mutex when invoking the
provided timeout() callback function.
If the CALLOUT_MPSAFE flag was not set
during registration, then Giant will be grabbed before
invoking the callout, and then released afterwards. The
callout_lock mutex will be re-grabbed
before proceeding. The softclock()
code is careful to leave the list in a consistent state
while releasing the mutex. If DIAGNOSTIC
is enabled, then the time taken to execute each function is
measured, and a warning generated if it exceeds a
threshold.
Specific Locking Strategies
Credentials
struct ucred is the kernel's
internal credential structure, and is generally used as the
basis for process-driven access control within the kernel.
BSD-derived systems use a copy-on-write
model for credential
data: multiple references may exist for a credential structure,
and when a change needs to be made, the structure is duplicated,
modified, and then the reference replaced. Due to wide-spread
caching of the credential to implement access control on open,
this results in substantial memory savings. With a move to
fine-grained SMP, this model also saves substantially on
locking operations by requiring that modification only occur
on an unshared credential, avoiding the need for explicit
synchronization when consuming a known-shared
credential.
Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
cr_mtxp protects the reference
count of struct ucred so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.
The struct ucred mutex is a leaf
mutex, and for performance reasons, is implemented via a mutex
pool.
Usually, credentials are used in a read-only manner for access
control decisions, and in this case td_ucred
is generally preferred because it requires no locking. When a
process' credential is updated the proc lock
must be held across the check and update operations thus avoid
races. The process credential p_ucred
must be used for check and update operations to prevent
time-of-check, time-of-use races.
If system call invocations will perform access control after
an update to the process credential, the value of
td_ucred must also be refreshed to
the current process value. This will prevent use of a stale
credential following a change. The kernel automatically
refreshes the td_ucred pointer in
the thread structure from the process
p_ucred whenever a process enters
the kernel, permitting use of a fresh credential for kernel
access control.
File Descriptors and File Descriptor Tables
Details to follow.
Jail Structures
struct prison stores
administrative details pertinent to the maintenance of jails
created using the &man.jail.2; API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, pr_mtx protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
struct prison is sufficient to read
these values. The precise locking of each entry is documented
via comments in sys/jail.h.
MAC Framework
The TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of struct
label. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in struct vnode is protected
by the vnode lock on the vnode.
In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (mac_policy_list_lock) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and de-registration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.
A condition variable,
mac_policy_list_not_busy, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
shared/exclusive lock over access to the framework: the difference
is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.
Modules
For the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in sys/module.h and are quite basic
in terms of usage. The main structures protected under this lock
are the module_t structures (when shared)
and the global modulelist_t structure,
modules. One should review the related source code in
kern/kern_module.c to further understand the
locking strategy.
Newbus Device Tree
The newbus system will have one sx lock. Readers will
hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
an exclusive (write) lock (&man.sx.xlock.9;). Internal functions
will not do locking at all. Externally visible ones will lock as
needed.
Those items that do not matter if the race is won or lost will
not be locked, since they tend to be read all over the place
(e.g. &man.device.get.softc.9;). There will be relatively few
changes to the newbus data structures, so a single lock should
be sufficient and not impose a performance penalty.
Pipes
...
Processes and Threads
- process hierarchy
- proc locks, references
- thread-specific copies of proc entries to freeze during system
calls, including td_ucred
- inter-process operations
- process groups and sessions
Scheduler
Lots of references to sched_lock and notes
pointing at specific primitives and related magic elsewhere in the
document.
Select and Poll
The select() and poll() functions permit threads to block
waiting on events on file descriptors--most frequently, whether
or not the file descriptors are readable or writable.
...
SIGIO
The SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write status
of specified file descriptors changes. At most one process or
process group is permitted to register for SIGIO from any given
kernel object, and that process or group is referred to as
the owner. Each object supporting SIGIO registration contains
pointer field that is NULL if the object is not registered, or
points to a struct sigio describing
the registration. This field is protected by a global mutex,
sigio_lock. Callers to SIGIO maintenance
functions must pass in this field by reference
so that local
register copies of the field are not made when unprotected by
the lock.
One struct sigio is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered struct sigio structures,
p_sigiolst for processes, and
pg_sigiolst for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each struct
sigio are constant for the duration of the
registration, with the exception of the
sio_pgsigio field which links the
struct sigio into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as fsetown()
or funsetown() to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.
Sysctl
The sysctl() MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in locking:
first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and
functions that are accessed by the sysctl interface. Since
sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a
single global sx lock to serialize use of sysctl(); however, it
is assumed to operate under Giant and other protections are not
provided. The remainder of this section speculates on locking
and semantic changes to sysctl.
- Need to change the order of operations for sysctl's that
update values from read old, copyin and copyout, write new to
copyin, lock, read old and write new, unlock, copyout. Normal
sysctl's that just copyout the old value and set a new value
that they copyin may still be able to follow the old model.
However, it may be cleaner to use the second model for all of
the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a
pointer to a mutex in the SYSCTL_FOO macros and in the struct.
This would work for most sysctl's. For values protected by sx
locks, spin mutexes, or other locking strategies besides a
single sleep mutex, SYSCTL_PROC nodes could be used to get the
locking right.
Taskqueue
The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
taskqueue_queues_mutex is meant to serve as a
lock to protect the taskqueue_queues TAILQ.
The other mutex lock associated with this system is the one in the
struct taskqueue data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the struct
taskqueue. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
kern/subr_taskqueue.c.
Implementation Notes
Details of the Mutex Implementation
- Should we require mutexes to be owned for mtx_destroy()
since we can not safely assert that they are unowned by anyone
else otherwise?
Spin Mutexes
- Use a critical section...
Sleep Mutexes
- Describe the races with contested mutexes
- Why it is safe to read mtx_lock of a contested mutex
when holding sched_lock.
- Priority propagation
Witness
- What does it do
- How does it work
Miscellaneous Topics
Interrupt Source and ICU Abstractions
- struct isrc
- pic drivers
Other Random Questions/Topics
Should we pass an interlock into
sema_wait?
- Generic turnstiles for sleep mutexes and sx locks.
- Should we have non-sleepable sx locks?
Glossary
atomic
An operation is atomic if all of its effects are visible
to other CPUs together when the proper access protocol is
followed. In the degenerate case are atomic instructions
provided directly by machine architectures. At a higher
level, if several members of a structure are protected by a
lock, then a set of operations are atomic if they are all
performed while holding the lock without releasing the lock
in between any of the operations.
operation
block
A thread is blocked when it is waiting on a lock,
resource, or condition. Unfortunately this term is a bit
overloaded as a result.
sleep
critical section
A section of code that is not allowed to be preempted.
A critical section is entered and exited using the
&man.critical.enter.9; API.
MD
Machine dependent.
MI
memory operation
A memory operation reads and/or writes to a memory
location.
MI
Machine independent.
MD
operation
memory operation
primary interrupt context
Primary interrupt context refers to the code that runs
when an interrupt occurs. This code can either run an
interrupt handler directly or schedule an asynchronous
interrupt thread to execute the interrupt handlers for a
given interrupt source.
realtime kernel thread
A high priority kernel thread. Currently, the only
realtime priority kernel threads are interrupt threads.
thread
sleep
A thread is asleep when it is blocked on a condition
variable or a sleep queue via msleep or
tsleep.
block
sleepable lock
A sleepable lock is a lock that can be held by a thread
which is asleep. Lockmgr locks and sx locks are currently
the only sleepable locks in FreeBSD. Eventually, some sx
locks such as the allproc and proctree locks may become
non-sleepable locks.
sleep
thread
A kernel thread represented by a struct thread. Threads own
locks and hold a single execution context.
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml
index f4f2487d4b..bbe6da9094 100644
--- a/en_US.ISO8859-1/articles/vm-design/article.sgml
+++ b/en_US.ISO8859-1/articles/vm-design/article.sgml
@@ -1,838 +1,841 @@
%man;
+
+%freebsd;
+
]>
Design elements of the FreeBSD VM system
Matthew
Dillon
dillon@apollo.backplane.com
The title is really just a fancy way of saying that I am going to
attempt to describe the whole VM enchilada, hopefully in a way that
everyone can follow. For the last year I have concentrated on a number
of major kernel subsystems within FreeBSD, with the VM and Swap
subsystems being the most interesting and NFS being a necessary
chore
. I rewrote only small portions of the code. In the VM
arena the only major rewrite I have done is to the swap subsystem.
Most of my work was cleanup and maintenance, with only moderate code
rewriting and no major algorithmic adjustments within the VM
subsystem. The bulk of the VM subsystem's theoretical base remains
unchanged and a lot of the credit for the modernization effort in the
last few years belongs to John Dyson and David Greenman. Not being a
historian like Kirk I will not attempt to tag all the various features
with peoples names, since I will invariably get it wrong.
This article was originally published in the January 2000 issue of
DaemonNews. This
version of the article may include updates from Matt and other authors
to reflect changes in FreeBSD's VM implementation.
Introduction
Before moving along to the actual design let's spend a little time
on the necessity of maintaining and modernizing any long-living
codebase. In the programming world, algorithms tend to be more
important than code and it is precisely due to BSD's academic roots that
a great deal of attention was paid to algorithm design from the
beginning. More attention paid to the design generally leads to a clean
and flexible codebase that can be fairly easily modified, extended, or
replaced over time. While BSD is considered an old
operating system by some people, those of us who work on it tend to view
it more as a mature
codebase which has various components
modified, extended, or replaced with modern code. It has evolved, and
FreeBSD is at the bleeding edge no matter how old some of the code might
be. This is an important distinction to make and one that is
unfortunately lost to many people. The biggest error a programmer can
make is to not learn from history, and this is precisely the error that
many other modern operating systems have made. NT is the best example
of this, and the consequences have been dire. Linux also makes this
mistake to some degree—enough that we BSD folk can make small
jokes about it every once in a while, anyway. Linux's problem is simply
one of a lack of experience and history to compare ideas against, a
problem that is easily and rapidly being addressed by the Linux
community in the same way it has been addressed in the BSD
community—by continuous code development. The NT folk, on the
- other hand, repeatedly make the same mistakes solved by Unix decades ago
+ other hand, repeatedly make the same mistakes solved by &unix; decades ago
and then spend years fixing them. Over and over again. They have a
severe case of not designed here
and we are always
right because our marketing department says so
. I have little
tolerance for anyone who cannot learn from history.
Much of the apparent complexity of the FreeBSD design, especially in
the VM/Swap subsystem, is a direct result of having to solve serious
performance issues that occur under various conditions. These issues
are not due to bad algorithmic design but instead rise from
environmental factors. In any direct comparison between platforms,
these issues become most apparent when system resources begin to get
stressed. As I describe FreeBSD's VM/Swap subsystem the reader should
always keep two points in mind. First, the most important aspect of
performance design is what is known as Optimizing the Critical
Path
. It is often the case that performance optimizations add a
little bloat to the code in order to make the critical path perform
better. Second, a solid, generalized design outperforms a
heavily-optimized design over the long run. While a generalized design
may end up being slower than an heavily-optimized design when they are
first implemented, the generalized design tends to be easier to adapt to
changing conditions and the heavily-optimized design winds up having to
be thrown away. Any codebase that will survive and be maintainable for
years must therefore be designed properly from the beginning even if it
costs some performance. Twenty years ago people were still arguing that
programming in assembly was better than programming in a high-level
language because it produced code that was ten times as fast. Today,
the fallibility of that argument is obvious—as are the parallels
to algorithmic design and code generalization.
VM Objects
The best way to begin describing the FreeBSD VM system is to look at
it from the perspective of a user-level process. Each user process sees
a single, private, contiguous VM address space containing several types
of memory objects. These objects have various characteristics. Program
code and program data are effectively a single memory-mapped file (the
binary file being run), but program code is read-only while program data
is copy-on-write. Program BSS is just memory allocated and filled with
zeros on demand, called demand zero page fill. Arbitrary files can be
memory-mapped into the address space as well, which is how the shared
library mechanism works. Such mappings can require modifications to
remain private to the process making them. The fork system call adds an
entirely new dimension to the VM management problem on top of the
complexity already given.
A program binary data page (which is a basic copy-on-write page)
illustrates the complexity. A program binary contains a preinitialized
data section which is initially mapped directly from the program file.
When a program is loaded into a process's VM space, this area is
initially memory-mapped and backed by the program binary itself,
allowing the VM system to free/reuse the page and later load it back in
from the binary. The moment a process modifies this data, however, the
VM system must make a private copy of the page for that process. Since
the private copy has been modified, the VM system may no longer free it,
because there is no longer any way to restore it later on.
You will notice immediately that what was originally a simple file
mapping has become much more complex. Data may be modified on a
page-by-page basis whereas the file mapping encompasses many pages at
once. The complexity further increases when a process forks. When a
process forks, the result is two processes—each with their own
private address spaces, including any modifications made by the original
process prior to the call to fork(). It would be
silly for the VM system to make a complete copy of the data at the time
of the fork() because it is quite possible that at
least one of the two processes will only need to read from that page
from then on, allowing the original page to continue to be used. What
was a private page is made copy-on-write again, since each process
(parent and child) expects their own personal post-fork modifications to
remain private to themselves and not effect the other.
FreeBSD manages all of this with a layered VM Object model. The
original binary program file winds up being the lowest VM Object layer.
A copy-on-write layer is pushed on top of that to hold those pages which
had to be copied from the original file. If the program modifies a data
page belonging to the original file the VM system takes a fault and
makes a copy of the page in the higher layer. When a process forks,
additional VM Object layers are pushed on. This might make a little
more sense with a fairly basic example. A fork()
is a common operation for any *BSD system, so this example will consider
a program that starts up, and forks. When the process starts, the VM
system creates an object layer, let's call this A:
+---------------+
| A |
+---------------+
A picture
A represents the file—pages may be paged in and out of the
file's physical media as necessary. Paging in from the disk is
reasonable for a program, but we really do not want to page back out and
overwrite the executable. The VM system therefore creates a second
layer, B, that will be physically backed by swap space:
+---------------+
| B |
+---------------+
| A |
+---------------+
On the first write to a page after this, a new page is created in B,
and its contents are initialized from A. All pages in B can be paged in
or out to a swap device. When the program forks, the VM system creates
two new object layers—C1 for the parent, and C2 for the
child—that rest on top of B:
+-------+-------+
| C1 | C2 |
+-------+-------+
| B |
+---------------+
| A |
+---------------+
In this case, let's say a page in B is modified by the original
parent process. The process will take a copy-on-write fault and
duplicate the page in C1, leaving the original page in B untouched.
Now, let's say the same page in B is modified by the child process. The
process will take a copy-on-write fault and duplicate the page in C2.
The original page in B is now completely hidden since both C1 and C2
have a copy and B could theoretically be destroyed if it does not
represent a real
file). However, this sort of optimization is not
trivial to make because it is so fine-grained. FreeBSD does not make
this optimization. Now, suppose (as is often the case) that the child
process does an exec(). Its current address space
is usually replaced by a new address space representing a new file. In
this case, the C2 layer is destroyed:
+-------+
| C1 |
+-------+-------+
| B |
+---------------+
| A |
+---------------+
In this case, the number of children of B drops to one, and all
accesses to B now go through C1. This means that B and C1 can be
collapsed together. Any pages in B that also exist in C1 are deleted
from B during the collapse. Thus, even though the optimization in the
previous step could not be made, we can recover the dead pages when
either of the processes exit or exec().
This model creates a number of potential problems. The first is that
you can wind up with a relatively deep stack of layered VM Objects which
can cost scanning time and memory when you take a fault. Deep
layering can occur when processes fork and then fork again (either
parent or child). The second problem is that you can wind up with dead,
inaccessible pages deep in the stack of VM Objects. In our last example
if both the parent and child processes modify the same page, they both
get their own private copies of the page and the original page in B is
no longer accessible by anyone. That page in B can be freed.
FreeBSD solves the deep layering problem with a special optimization
called the All Shadowed Case
. This case occurs if either
C1 or C2 take sufficient COW faults to completely shadow all pages in B.
Lets say that C1 achieves this. C1 can now bypass B entirely, so rather
then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But
look what also happened—now B has only one reference (C2), so we
can collapse B and C2 together. The end result is that B is deleted
entirely and we have C1->A and C2->A. It is often the case that B will
contain a large number of pages and neither C1 nor C2 will be able to
completely overshadow it. If we fork again and create a set of D
layers, however, it is much more likely that one of the D layers will
eventually be able to completely overshadow the much smaller dataset
represented by C1 or C2. The same optimization will work at any point in
the graph and the grand result of this is that even on a heavily forked
machine VM Object stacks tend to not get much deeper then 4. This is
true of both the parent and the children and true whether the parent is
doing the forking or whether the children cascade forks.
The dead page problem still exists in the case where C1 or C2 do not
completely overshadow B. Due to our other optimizations this case does
not represent much of a problem and we simply allow the pages to be
dead. If the system runs low on memory it will swap them out, eating a
little swap, but that is it.
The advantage to the VM Object model is that
fork() is extremely fast, since no real data
copying need take place. The disadvantage is that you can build a
relatively complex VM Object layering that slows page fault handling
down a little, and you spend memory managing the VM Object structures.
The optimizations FreeBSD makes proves to reduce the problems enough
that they can be ignored, leaving no real disadvantage.
SWAP Layers
Private data pages are initially either copy-on-write or zero-fill
pages. When a change, and therefore a copy, is made, the original
backing object (usually a file) can no longer be used to save a copy of
the page when the VM system needs to reuse it for other purposes. This
is where SWAP comes in. SWAP is allocated to create backing store for
memory that does not otherwise have it. FreeBSD allocates the swap
management structure for a VM Object only when it is actually needed.
However, the swap management structure has had problems
historically.
Under FreeBSD 3.X the swap management structure preallocates an
array that encompasses the entire object requiring swap backing
store—even if only a few pages of that object are swap-backed.
This creates a kernel memory fragmentation problem when large objects
are mapped, or processes with large runsizes (RSS) fork. Also, in order
to keep track of swap space, a list of holes
is kept in
kernel memory, and this tends to get severely fragmented as well. Since
the list of holes
is a linear list, the swap allocation and freeing
performance is a non-optimal O(n)-per-page. It also requires kernel
memory allocations to take place during the swap freeing process, and
that creates low memory deadlock problems. The problem is further
exacerbated by holes created due to the interleaving algorithm. Also,
the swap block map can become fragmented fairly easily resulting in
non-contiguous allocations. Kernel memory must also be allocated on the
fly for additional swap management structures when a swapout occurs. It
is evident that there was plenty of room for improvement.
For FreeBSD 4.X, I completely rewrote the swap subsystem. With this
rewrite, swap management structures are allocated through a hash table
rather than a linear array giving them a fixed allocation size and much
finer granularity. Rather then using a linearly linked list to keep
track of swap space reservations, it now uses a bitmap of swap blocks
arranged in a radix tree structure with free-space hinting in the radix
node structures. This effectively makes swap allocation and freeing an
O(1) operation. The entire radix tree bitmap is also preallocated in
order to avoid having to allocate kernel memory during critical low
memory swapping operations. After all, the system tends to swap when it
is low on memory so we should avoid allocating kernel memory at such
times in order to avoid potential deadlocks. Finally, to reduce
fragmentation the radix tree is capable of allocating large contiguous
chunks at once, skipping over smaller fragmented chunks. I did not take
the final step of having an allocating hint pointer
that would trundle
through a portion of swap as allocations were made in order to further
guarantee contiguous allocations or at least locality of reference, but
I ensured that such an addition could be made.
When to free a page
Since the VM system uses all available memory for disk caching,
there are usually very few truly-free pages. The VM system depends on
being able to properly choose pages which are not in use to reuse for
new allocations. Selecting the optimal pages to free is possibly the
single-most important function any VM system can perform because if it
makes a poor selection, the VM system may be forced to unnecessarily
retrieve pages from disk, seriously degrading system performance.
How much overhead are we willing to suffer in the critical path to
avoid freeing the wrong page? Each wrong choice we make will cost us
hundreds of thousands of CPU cycles and a noticeable stall of the
affected processes, so we are willing to endure a significant amount of
overhead in order to be sure that the right page is chosen. This is why
FreeBSD tends to outperform other systems when memory resources become
stressed.
The free page determination algorithm is built upon a history of the
use of memory pages. To acquire this history, the system takes advantage
of a page-used bit feature that most hardware page tables have.
In any case, the page-used bit is cleared and at some later point
the VM system comes across the page again and sees that the page-used
bit has been set. This indicates that the page is still being actively
used. If the bit is still clear it is an indication that the page is not
being actively used. By testing this bit periodically, a use history (in
the form of a counter) for the physical page is developed. When the VM
system later needs to free up some pages, checking this history becomes
the cornerstone of determining the best candidate page to reuse.
What if the hardware has no page-used bit?
For those platforms that do not have this feature, the system
actually emulates a page-used bit. It unmaps or protects a page,
forcing a page fault if the page is accessed again. When the page
fault is taken, the system simply marks the page as having been used
and unprotects the page so that it may be used. While taking such page
faults just to determine if a page is being used appears to be an
expensive proposition, it is much less expensive than reusing the page
for some other purpose only to find that a process needs it back and
then have to go to disk.
FreeBSD makes use of several page queues to further refine the
selection of pages to reuse as well as to determine when dirty pages
must be flushed to their backing store. Since page tables are dynamic
entities under FreeBSD, it costs virtually nothing to unmap a page from
the address space of any processes using it. When a page candidate has
been chosen based on the page-use counter, this is precisely what is
done. The system must make a distinction between clean pages which can
theoretically be freed up at any time, and dirty pages which must first
be written to their backing store before being reusable. When a page
candidate has been found it is moved to the inactive queue if it is
dirty, or the cache queue if it is clean. A separate algorithm based on
the dirty-to-clean page ratio determines when dirty pages in the
inactive queue must be flushed to disk. Once this is accomplished, the
flushed pages are moved from the inactive queue to the cache queue. At
this point, pages in the cache queue can still be reactivated by a VM
fault at relatively low cost. However, pages in the cache queue are
considered to be immediately freeable
and will be reused
in an LRU (least-recently used) fashion when the system needs to
allocate new memory.
It is important to note that the FreeBSD VM system attempts to
separate clean and dirty pages for the express reason of avoiding
unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
it move pages between the various page queues gratuitously when the
memory subsystem is not being stressed. This is why you will see some
systems with very low cache queue counts and high active queue counts
when doing a systat -vm command. As the VM system
becomes more stressed, it makes a greater effort to maintain the various
page queues at the levels determined to be the most effective. An urban
myth has circulated for years that Linux did a better job avoiding
swapouts than FreeBSD, but this in fact is not true. What was actually
occurring was that FreeBSD was proactively paging out unused pages in
order to make room for more disk cache while Linux was keeping unused
pages in core and leaving less memory available for cache and process
pages. I do not know whether this is still true today.
Pre-Faulting and Zeroing Optimizations
Taking a VM fault is not expensive if the underlying page is already
in core and can simply be mapped into the process, but it can become
expensive if you take a whole lot of them on a regular basis. A good
example of this is running a program such as &man.ls.1; or &man.ps.1;
over and over again. If the program binary is mapped into memory but
not mapped into the page table, then all the pages that will be accessed
by the program will have to be faulted in every time the program is run.
This is unnecessary when the pages in question are already in the VM
Cache, so FreeBSD will attempt to pre-populate a process's page tables
with those pages that are already in the VM Cache. One thing that
FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For
example, if you run the &man.ls.1; program while running vmstat
1 you will notice that it always takes a certain number of
page faults, even when you run it over and over again. These are
zero-fill faults, not program code faults (which were pre-faulted in
already). Pre-copying pages on exec or fork is an area that could use
more study.
A large percentage of page faults that occur are zero-fill faults.
You can usually see this by observing the vmstat -s
output. These occur when a process accesses pages in its BSS area. The
BSS area is expected to be initially zero but the VM system does not
bother to allocate any memory at all until the process actually accesses
it. When a fault occurs the VM system must not only allocate a new page,
it must zero it as well. To optimize the zeroing operation the VM system
has the ability to pre-zero pages and mark them as such, and to request
pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs
whenever the CPU is idle but the number of pages the system pre-zeros is
limited in order to avoid blowing away the memory caches. This is an
excellent example of adding complexity to the VM system in order to
optimize the critical path.
Page Table Optimizations
The page table optimizations make up the most contentious part of
the FreeBSD VM design and they have shown some strain with the advent of
serious use of mmap(). I think this is actually a
feature of most BSDs though I am not sure when it was first introduced.
There are two major optimizations. The first is that hardware page
tables do not contain persistent state but instead can be thrown away at
any time with only a minor amount of management overhead. The second is
that every active page table entry in the system has a governing
pv_entry structure which is tied into the
vm_page structure. FreeBSD can simply iterate
through those mappings that are known to exist while Linux must check
all page tables that might contain a specific
mapping to see if it does, which can achieve O(n^2) overhead in certain
situations. It is because of this that FreeBSD tends to make better
choices on which pages to reuse or swap when memory is stressed, giving
it better performance under load. However, FreeBSD requires kernel
tuning to accommodate large-shared-address-space situations such as
those that can occur in a news system because it may run out of
pv_entry structures.
Both Linux and FreeBSD need work in this area. FreeBSD is trying to
maximize the advantage of a potentially sparse active-mapping model (not
all processes need to map all pages of a shared library, for example),
whereas Linux is trying to simplify its algorithms. FreeBSD generally
has the performance advantage here at the cost of wasting a little extra
memory, but FreeBSD breaks down in the case where a large file is
massively shared across hundreds of processes. Linux, on the other hand,
breaks down in the case where many processes are sparsely-mapping the
same shared library and also runs non-optimally when trying to determine
whether a page can be reused or not.
Page Coloring
We will end with the page coloring optimizations. Page coloring is a
performance optimization designed to ensure that accesses to contiguous
pages in virtual memory make the best use of the processor cache. In
ancient times (i.e. 10+ years ago) processor caches tended to map
virtual memory rather than physical memory. This led to a huge number of
problems including having to clear the cache on every context switch in
some cases, and problems with data aliasing in the cache. Modern
processor caches map physical memory precisely to solve those problems.
This means that two side-by-side pages in a processes address space may
not correspond to two side-by-side pages in the cache. In fact, if you
are not careful side-by-side pages in virtual memory could wind up using
the same page in the processor cache—leading to cacheable data
being thrown away prematurely and reducing CPU performance. This is true
even with multi-way set-associative caches (though the effect is
mitigated somewhat).
FreeBSD's memory allocation code implements page coloring
optimizations, which means that the memory allocation code will attempt
to locate free pages that are contiguous from the point of view of the
cache. For example, if page 16 of physical memory is assigned to page 0
of a process's virtual memory and the cache can hold 4 pages, the page
coloring code will not assign page 20 of physical memory to page 1 of a
process's virtual memory. It would, instead, assign page 21 of physical
memory. The page coloring code attempts to avoid assigning page 20
because this maps over the same cache memory as page 16 and would result
in non-optimal caching. This code adds a significant amount of
complexity to the VM memory allocation subsystem as you can well
imagine, but the result is well worth the effort. Page Coloring makes VM
memory as deterministic as physical memory in regards to cache
performance.
Conclusion
Virtual memory in modern operating systems must address a number of
different issues efficiently and for many different usage patterns. The
modular and algorithmic approach that BSD has historically taken allows
us to study and understand the current implementation as well as
relatively cleanly replace large sections of the code. There have been a
number of improvements to the FreeBSD VM system in the last several
years, and work is ongoing.
Bonus QA session by Allen Briggs
briggs@ninthwonder.com
What is the interleaving algorithm
that you
refer to in your listing of the ills of the FreeBSD 3.X swap
arrangements?
FreeBSD uses a fixed swap interleave which defaults to 4. This
means that FreeBSD reserves space for four swap areas even if you
only have one, two, or three. Since swap is interleaved the linear
address space representing the four swap areas
will be
fragmented if you do not actually have four swap areas. For
example, if you have two swap areas A and B FreeBSD's address
space representation for that swap area will be interleaved in
blocks of 16 pages:
A B C D A B C D A B C D A B C D
FreeBSD 3.X uses a sequential list of free
regions
approach to accounting for the free swap areas.
The idea is that large blocks of free linear space can be
represented with a single list node
(kern/subr_rlist.c). But due to the
fragmentation the sequential list winds up being insanely
fragmented. In the above example, completely unused swap will
have A and B shown as free
and C and D shown as
all allocated
. Each A-B sequence requires a list
node to account for because C and D are holes, so the list node
cannot be combined with the next A-B sequence.
Why do we interleave our swap space instead of just tack swap
areas onto the end and do something fancier? Because it is a whole
lot easier to allocate linear swaths of an address space and have
the result automatically be interleaved across multiple disks than
it is to try to put that sophistication elsewhere.
The fragmentation causes other problems. Being a linear list
under 3.X, and having such a huge amount of inherent
fragmentation, allocating and freeing swap winds up being an O(N)
algorithm instead of an O(1) algorithm. Combined with other
factors (heavy swapping) and you start getting into O(N^2) and
O(N^3) levels of overhead, which is bad. The 3.X system may also
need to allocate KVM during a swap operation to create a new list
node which can lead to a deadlock if the system is trying to
pageout pages in a low-memory situation.
Under 4.X we do not use a sequential list. Instead we use a
radix tree and bitmaps of swap blocks rather than ranged list
nodes. We take the hit of preallocating all the bitmaps required
for the entire swap area up front but it winds up wasting less
memory due to the use of a bitmap (one bit per block) instead of a
linked list of nodes. The use of a radix tree instead of a
sequential list gives us nearly O(1) performance no matter how
fragmented the tree becomes.
I do not get the following:
It is important to note that the FreeBSD VM system attempts
to separate clean and dirty pages for the express reason of
avoiding unnecessary flushes of dirty pages (which eats I/O
bandwidth), nor does it move pages between the various page
queues gratuitously when the memory subsystem is not being
stressed. This is why you will see some systems with very low
cache queue counts and high active queue counts when doing a
systat -vm command.
How is the separation of clean and dirty (inactive) pages
related to the situation where you see low cache queue counts and
high active queue counts in systat -vm? Do the
systat stats roll the active and dirty pages together for the
active queue count?
Yes, that is confusing. The relationship is
goal
verses reality
. Our goal is to
separate the pages but the reality is that if we are not in a
memory crunch, we do not really have to.
What this means is that FreeBSD will not try very hard to
separate out dirty pages (inactive queue) from clean pages (cache
queue) when the system is not being stressed, nor will it try to
deactivate pages (active queue -> inactive queue) when the system
is not being stressed, even if they are not being used.
In the &man.ls.1; / vmstat 1 example,
would not some of the page faults be data page faults (COW from
executable file to private page)? I.e., I would expect the page
faults to be some zero-fill and some program data. Or are you
implying that FreeBSD does do pre-COW for the program data?
A COW fault can be either zero-fill or program-data. The
mechanism is the same either way because the backing program-data
is almost certainly already in the cache. I am indeed lumping the
two together. FreeBSD does not pre-COW program data or zero-fill,
but it does pre-map pages that exist in its
cache.
In your section on page table optimizations, can you give a
little more detail about pv_entry and
vm_page (or should vm_page be
vm_pmap—as in 4.4, cf. pp. 180-181 of
McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of
operation/reaction would require scanning the mappings?
How does Linux do in the case where FreeBSD breaks down
(sharing a large file mapping over many processes)?
A vm_page represents an (object,index#)
tuple. A pv_entry represents a hardware page
table entry (pte). If you have five processes sharing the same
physical page, and three of those processes's page tables actually
map the page, that page will be represented by a single
vm_page structure and three
pv_entry structures.
pv_entry structures only represent pages
mapped by the MMU (one pv_entry represents one
pte). This means that when we need to remove all hardware
references to a vm_page (in order to reuse the
page for something else, page it out, clear it, dirty it, and so
forth) we can simply scan the linked list of
pv_entry's associated with that
vm_page to remove or modify the pte's from
their page tables.
Under Linux there is no such linked list. In order to remove
all the hardware page table mappings for a
vm_page linux must index into every VM object
that might have mapped the page. For
example, if you have 50 processes all mapping the same shared
library and want to get rid of page X in that library, you need to
index into the page table for each of those 50 processes even if
only 10 of them have actually mapped the page. So Linux is
trading off the simplicity of its design against performance.
Many VM algorithms which are O(1) or (small N) under FreeBSD wind
up being O(N), O(N^2), or worse under Linux. Since the pte's
representing a particular page in an object tend to be at the same
offset in all the page tables they are mapped in, reducing the
number of accesses into the page tables at the same pte offset
will often avoid blowing away the L1 cache line for that offset,
which can lead to better performance.
FreeBSD has added complexity (the pv_entry
scheme) in order to increase performance (to limit page table
accesses to only those pte's that need to be
modified).
But FreeBSD has a scaling problem that Linux does not in that
there are a limited number of pv_entry
structures and this causes problems when you have massive sharing
of data. In this case you may run out of
pv_entry structures even though there is plenty
of free memory available. This can be fixed easily enough by
bumping up the number of pv_entry structures in
the kernel config, but we really need to find a better way to do
it.
In regards to the memory overhead of a page table verses the
pv_entry scheme: Linux uses
permanent
page tables that are not throw away, but
does not need a pv_entry for each potentially
mapped pte. FreeBSD uses throw away
page tables but
adds in a pv_entry structure for each
actually-mapped pte. I think memory utilization winds up being
about the same, giving FreeBSD an algorithmic advantage with its
ability to throw away page tables at will with very low
overhead.
Finally, in the page coloring section, it might help to have a
little more description of what you mean here. I did not quite
follow it.
Do you know how an L1 hardware memory cache works? I will
explain: Consider a machine with 16MB of main memory but only 128K
of L1 cache. Generally the way this cache works is that each 128K
block of main memory uses the same 128K of
cache. If you access offset 0 in main memory and then offset
offset 128K in main memory you can wind up throwing away the
cached data you read from offset 0!
Now, I am simplifying things greatly. What I just described
is what is called a direct mapped
hardware memory
cache. Most modern caches are what are called
2-way-set-associative or 4-way-set-associative caches. The
set-associatively allows you to access up to N different memory
regions that overlap the same cache memory without destroying the
previously cached data. But only N.
So if I have a 4-way set associative cache I can access offset
0, offset 128K, 256K and offset 384K and still be able to access
offset 0 again and have it come from the L1 cache. If I then
access offset 512K, however, one of the four previously cached
data objects will be thrown away by the cache.
It is extremely important…
extremely important for most of a processor's
memory accesses to be able to come from the L1 cache, because the
L1 cache operates at the processor frequency. The moment you have
an L1 cache miss and have to go to the L2 cache or to main memory,
the processor will stall and potentially sit twiddling its fingers
for hundreds of instructions worth of time
waiting for a read from main memory to complete. Main memory (the
dynamic ram you stuff into a computer) is
slow, when compared to the speed of a modern
processor core.
Ok, so now onto page coloring: All modern memory caches are
what are known as physical caches. They
cache physical memory addresses, not virtual memory addresses.
This allows the cache to be left alone across a process context
switch, which is very important.
But in the Unix world you are dealing with virtual address
spaces, not physical address spaces. Any program you write will
see the virtual address space given to it. The actual
physical pages underlying that virtual
address space are not necessarily physically contiguous! In fact,
you might have two pages that are side by side in a processes
address space which wind up being at offset 0 and offset 128K in
physical memory.
A program normally assumes that two side-by-side pages will be
optimally cached. That is, that you can access data objects in
both pages without having them blow away each other's cache entry.
But this is only true if the physical pages underlying the virtual
address space are contiguous (insofar as the cache is
concerned).
This is what Page coloring does. Instead of assigning
random physical pages to virtual addresses,
which may result in non-optimal cache performance, Page coloring
assigns reasonably-contiguous physical pages
to virtual addresses. Thus programs can be written under the
assumption that the characteristics of the underlying hardware
cache are the same for their virtual address space as they would
be if the program had been run directly in a physical address
space.
Note that I say reasonably
contiguous rather
than simply contiguous
. From the point of view of a
128K direct mapped cache, the physical address 0 is the same as
the physical address 128K. So two side-by-side pages in your
virtual address space may wind up being offset 128K and offset
132K in physical memory, but could also easily be offset 128K and
offset 4K in physical memory and still retain the same cache
performance characteristics. So page-coloring does
not have to assign truly contiguous pages of
physical memory to contiguous pages of virtual memory, it just
needs to make sure it assigns contiguous pages from the point of
view of cache performance and operation.
diff --git a/en_US.ISO8859-1/articles/zip-drive/article.sgml b/en_US.ISO8859-1/articles/zip-drive/article.sgml
index 9a9871b060..5330a94857 100644
--- a/en_US.ISO8859-1/articles/zip-drive/article.sgml
+++ b/en_US.ISO8859-1/articles/zip-drive/article.sgml
@@ -1,273 +1,276 @@
%man;
+
+%freebsd;
+
]>
ZIP Drives
Jason
Bacon
acadix@execpc.com
ZIP Drive Basics
ZIP disks are high capacity, removable, magnetic disks, which can be
read or written by ZIP drives from IOMEGA corporation. ZIP disks are
similar to floppy disks, except that they are much faster, and have a
much greater capacity. While floppy disks typically hold 1.44
megabytes, ZIP disks are available in two sizes, namely 100 megabytes
and 250 megabytes. ZIP drives should not be confused with the
super-floppy, a 120 megabyte floppy drive which also handles traditional
1.44 megabyte floppies.
IOMEGA also sells a higher capacity, higher performance drive called
the JAZZ drive. JAZZ drives come in 1 gigabyte and 2 gigabyte
sizes.
ZIP drives are available as internal or external units, using one of
three interfaces:
The SCSI (Small Computer Standard Interface) interface is the
fastest, most sophisticated, most expandable, and most expensive
interface. The SCSI interface is used by all types of computers
from PC's to RISC workstations to minicomputers, to connect all
types of peripherals such as disk drives, tape drives, scanners, and
so on. SCSI ZIP drives may be internal or external, assuming your
host adapter has an external connector.
If you are using an external SCSI device, it is important
never to connect or disconnect it from the SCSI bus while the
computer is running. Doing so may cause file-system damage on the
disks that remain connected.
If you want maximum performance and easy setup, the SCSI
interface is the best choice. This will probably require adding a
SCSI host adapter, since most PC's (except for high-performance
servers) do not have built-in SCSI support. Each SCSI host adapter
can support either 7 or 15 SCSI devices, depending on the
model.
Each SCSI device has its own controller, and these
controllers are fairly intelligent and well standardized, (the
second `S' in SCSI is for Standard) so from the operating system's
point of view, all SCSI disk drives look about the same, as do all
SCSI tape drives, etc. To support SCSI devices, the operating
system need only have a driver for the particular host adapter, and
a generic driver for each type of device, i.e. a SCSI disk driver,
SCSI tape driver, and so on. There are some SCSI devices that can
be better utilized with specialized drivers (e.g. DAT tape drives),
but they tend to work OK with the generic driver, too. It is just
that the generic drivers may not support some of the special
features.
Using a SCSI zip drive is simply a matter of determining which
device file in the /dev directory represents
the ZIP drive. This can be determined by looking at the boot
messages while FreeBSD is booting (or in
/var/log/messages after booting), where you
will see a line something like this:
da1: <IOMEGA ZIP 100 D.13> Removable Direct Access SCSI-2 Device
This means that the ZIP drive is represented by the file
/dev/da1.
The IDE (Integrated Drive Electronics) interface is a low-cost
disk drive interface used by many desktop PC's. Most IDE devices
are strictly internal.
Performance of IDE ZIP drives is comparable to SCSI ZIP drives.
(The IDE interface is not as fast as SCSI, but ZIP drives
performance is limited mainly by the mechanics of the drive, not by
the bus interface.)
The drawback of the IDE interface is the limitations it imposes.
Most IDE adapters can only support 2 devices, and IDE interfaces are
not typically designed for the long term. For example, the original
IDE interface would not support hard disks with more than 1024
cylinders, which forced a lot of people to upgrade their hardware
prematurely. If you have plans to expand your PC by adding another
disk, a tape drive, or scanner, you may want to invest in a SCSI
host adapter and a SCSI ZIP drive to avoid problems in the
future.
IDE devices in FreeBSD are prefixed with a a.
For example, an IDE hard disk might be
/dev/ad0, an IDE (ATAPI) CDROM might be
/dev/acd1, and so on.
The parallel port interface is popular for portable external
devices such as external ZIP drives and scanners, because virtually
every computer has a standard parallel port (usually used for
printers). This makes things easy for people to transfer data
between multiple computers by toting around their ZIP drive.
Performance will generally be slower than a SCSI or IDE ZIP
drive, since it is limited by the speed of the parallel port.
Parallel port speed varies considerably between various computers,
and can often be configured in the system BIOS. Some machines will
also require BIOS configuration to operate the parallel port in
bidirectional mode. (Parallel ports were originally designed only
for output to printers)
Parallel ZIP: The vpo Driver
To use a parallel-port ZIP drive under FreeBSD, the
vpo driver must be configured into the kernel.
Parallel port ZIP drives also have a built-in SCSI controller. The vpo
driver allows the FreeBSD kernel to communicate with the ZIP drive's
SCSI controller through the parallel port.
Since the vpo driver is not a standard part of the kernel (as of
FreeBSD 3.2), you will need to rebuild the kernel to enable this device.
The process of building a kernel is outlined in detail in another
section. The following steps outline the process in brief for the
purpose of enabling the vpo driver:
Run /stand/sysinstall, and install the kernel
source code on your system.
Create a custom kernel configuration, that includes the
driver for the vpo driver:
&prompt.root; cd /sys/i386/conf
&prompt.root; cp GENERIC MYKERNEL
Edit MYKERNEL, change the
ident line to MYKERNEL, and
uncomment the line describing the vpo driver.
If you have a second parallel port, you may need to copy the
section for ppc0 to create a
ppc1 device. The second parallel port usually
uses IRQ 5 and address 378. Only the IRQ is required in the config
file.
If your root hard disk is a SCSI disk, you might run into a
problem with probing order, which will cause the system to attempt
to use the ZIP drive as the root device. This will cause a boot
failure, unless you happen to have a FreeBSD root file-system on
your ZIP disk! In this case, you will need to wire
down
the root disk, i.e. force the kernel to bind a
specific device to /dev/da0, the root SCSI
disk. It will then assign the ZIP disk to the next available SCSI
disk, e.g. /dev/da1. To wire down your SCSI hard
drive as da0, change the line
device da0
to
disk da0 at scbus0 target 0 unit 0
You may need to change the target above to match the SCSI ID of
your disk drive. You should also wire down the scbus0 entry to your
controller. For example, if you have an Adaptec 15xx controller,
you would change
controller scbus0
to
controller scbus0 at aha0
Finally, since you are creating a custom kernel configuration,
you can take the opportunity to remove all the unnecessary drivers.
This should be done with a great deal of caution, and only if you
feel confident about making modifications to your kernel
configuration. Removing unnecessary drivers will reduce the kernel
size, leaving more memory available for your applications. To
determine which drivers are not needed, go to the end of the file
/var/log/messages, and look for lines reading
"not found". Then, comment out these devices in your config file.
You can also change other options to reduce the size and increase
the speed of your kernel. Read the section on rebuilding your kernel
for more complete information.
Now it is time to compile the kernel:
&prompt.root; /usr/sbin/config MYKERNEL
&prompt.root; cd ../../compile/MYKERNEL
&prompt.root; make clean depend && make all install
After the kernel is rebuilt, you will need to reboot. Make sure the
ZIP drive is connected to the parallel port before the boot begins. You
should see the ZIP drive show up in the boot messages as device vpo0 or
vpo1, depending on which parallel port the drive is attached to. It
should also show which device file the ZIP drive has been bound to. This
will be /dev/da0 if you have no other SCSI disks in
the system, or /dev/da1 if you have a SCSI hard
disk wired down as the root device.
Mounting ZIP disks
To access the ZIP disk, you simply mount it like any other disk
device. The file-system is represented as slice 4 on the device, so for
SCSI or parallel ZIP disks, you would use:
&prompt.root; mount_msdos /dev/da1s4 /mnt
For IDE ZIP drives, use:
&prompt.root; mount_msdos /dev/ad1s4 /mnt
It will also be helpful to update /etc/fstab to
make mounting easier. Add a line like the following, edited to suit your
system:
/dev/da1s4 /zip msdos rw,noauto 0 0
and create the directory /zip.
Then, you can mount simply by typing
&prompt.root; mount /zip
and unmount by typing
&prompt.root; umount /zip
For more information on the format of
/etc/fstab, see &man.fstab.5;.
You can also create a FreeBSD file-system on the ZIP disk using
&man.newfs.8;. However, the disk will only be usable on a FreeBSD
- system, or perhaps a few other Unix clones that recognize FreeBSD
+ system, or perhaps a few other &unix; clones that recognize FreeBSD
file-systems. (Definitely not DOS or Windows.)
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
index c6c78e0feb..14c53dc77c 100644
--- a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
+++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
@@ -1,957 +1,959 @@
%man;
%authors;
%misc;
+
+%freebsd;
]>
SMPng Design Document
John
Baldwin
Robert
Watson
$FreeBSD$
2002
2003
John Baldwin
Robert Watson
This document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
synchronization and execution model is laid out. Then, locking
strategies for specific subsystems are discussed, documenting the
approaches taken to introduce fine-grained synchronization and
parallelism for each subsystem. Finally, detailed implementation
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives.
Introduction
This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
with the SMPng Project. Many sections currently exist only in
outline form, but will be fleshed out as work proceeds. Updates or
suggestions regarding the document may be directed to the document
editors.
The goal of SMPng is to allow concurrency in the kernel.
The kernel is basically one rather large and complex program. To
make the kernel multi-threaded we use some of the same tools used
to make other programs multi-threaded. These include mutexes,
shared/exclusive locks, semaphores, and condition variables. For
the definitions of these and other SMP-related terms, please see
the section of this article.
Basic Tools and Locking Fundamentals
Atomic Instructions and Memory Barriers
There are several existing treatments of memory barriers
and atomic instructions, so this section will not include a
lot of detail. To put it simply, one can not go around reading
variables without a lock if a lock is used to protect writes
to that variable. This becomes obvious when you consider that
memory barriers simply determine relative order of memory
operations; they do not make any guarantee about timing of
memory operations. That is, a memory barrier does not force
the contents of a CPU's local cache or store buffer to flush.
Instead, the memory barrier at lock release simply ensures
that all writes to the protected data will be visible to other
CPU's or devices if the write to release the lock is visible.
The CPU is free to keep that data in its cache or store buffer
as long as it wants. However, if another CPU performs an
atomic instruction on the same datum, the first CPU must
guarantee that the updated value is made visible to the second
CPU along with any other operations that memory barriers may
require.
For example, assuming a simple model where data is
considered visible when it is in main memory (or a global
cache), when an atomic instruction is triggered on one CPU,
other CPU's store buffers and caches must flush any writes to
that same cache line along with any pending operations behind
a memory barrier.
This requires one to take special care when using an item
protected by atomic instructions. For example, in the sleep
mutex implementation, we have to use an
atomic_cmpset rather than an
atomic_set to turn on the
MTX_CONTESTED bit. The reason is that we
read the value of mtx_lock into a
variable and then make a decision based on that read.
However, the value we read may be stale, or it may change
while we are making our decision. Thus, when the
atomic_set executed, it may end up
setting the bit on another value than the one we made the
decision on. Thus, we have to use an
atomic_cmpset to set the value only if
the value we made the decision on is up-to-date and
valid.
Finally, atomic instructions only allow one item to be
updated or read. If one needs to atomically update several
items, then a lock must be used instead. For example, if two
counters must be read and have values that are consistent
relative to each other, then those counters must be protected
by a lock rather than by separate atomic instructions.
Read Locks versus Write Locks
Read locks do not need to be as strong as write locks.
Both types of locks need to ensure that the data they are
accessing is not stale. However, only write access requires
exclusive access. Multiple threads can safely read a value.
Using different types of locks for reads and writes can be
implemented in a number of ways.
First, sx locks can be used in this manner by using an
exclusive lock when writing and a shared lock when reading.
This method is quite straightforward.
A second method is a bit more obscure. You can protect a
datum with multiple locks. Then for reading that data you
simply need to have a read lock of one of the locks. However,
to write to the data, you need to have a write lock of all of
the locks. This can make writing rather expensive but can be
useful when data is accessed in various ways. For example,
the parent process pointer is protected by both the
proctree_lock sx lock and the per-process mutex. Sometimes
the proc lock is easier as we are just checking to see who a
parent of a process is that we already have locked. However,
other places such as inferior need to
walk the tree of processes via parent pointers and locking
each process would be prohibitive as well as a pain to
guarantee that the condition you are checking remains valid
for both the check and the actions taken as a result of the
check.
Locking Conditions and Results
If you need a lock to check the state of a variable so
that you can take an action based on the state you read, you
can not just hold the lock while reading the variable and then
drop the lock before you act on the value you read. Once you
drop the lock, the variable can change rendering your decision
invalid. Thus, you must hold the lock both while reading the
variable and while performing the action as a result of the
test.
General Architecture and Design
Interrupt Handling
- Following the pattern of several other multi-threaded Unix
+ Following the pattern of several other multi-threaded &unix;
kernels, FreeBSD deals with interrupt handlers by giving them
their own thread context. Providing a context for interrupt
handlers allows them to block on locks. To help avoid
latency, however, interrupt threads run at real-time kernel
priority. Thus, interrupt handlers should not execute for very
long to avoid starving other kernel threads. In addition,
since multiple handlers may share an interrupt thread,
interrupt handlers should not sleep or use a sleepable lock to
avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to
as heavyweight interrupt threads. They are called this
because switching to an interrupt thread involves a full
context switch. In the initial implementation, the kernel was
not preemptive and thus interrupts that interrupted a kernel
thread would have to wait until the kernel thread blocked or
returned to userland before they would have an opportunity to
run.
To deal with the latency problems, the kernel in FreeBSD
has been made preemptive. Currently, we only preempt a kernel
thread when we release a sleep mutex or when an interrupt
comes in. However, the plan is to make the FreeBSD kernel
fully preemptive as described below.
Not all interrupt handlers execute in a thread context.
Instead, some handlers execute directly in primary interrupt
context. These interrupt handlers are currently misnamed
fast
interrupt handlers since the
INTR_FAST flag used in earlier versions
of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt
handlers are clock interrupts and serial I/O device
interrupts. Since these handlers do not have their own
context, they may not acquire blocking locks and thus may only
use spin mutexes.
Finally, there is one optional optimization that can be
added in MD code called lightweight context switches. Since
an interrupt thread executes in a kernel context, it can
borrow the vmspace of any process. Thus, in a lightweight
context switch, the switch to the interrupt thread does not
switch vmspaces but borrows the vmspace of the interrupted
thread. In order to ensure that the vmspace of the
interrupted thread does not disappear out from under us, the
interrupted thread is not allowed to execute until the
interrupt thread is no longer borrowing its vmspace. This can
happen when the interrupt thread either blocks or finishes.
If an interrupt thread blocks, then it will use its own
context when it is made runnable again. Thus, it can release
the interrupted thread.
The cons of this optimization are that they are very
machine specific and complex and thus only worth the effort if
their is a large performance improvement. At this point it is
probably too early to tell, and in fact, will probably hurt
performance as almost all interrupt handlers will immediately
block on Giant and require a thread fix-up when they block.
Also, an alternative method of interrupt handling has been
proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate
which runs in primary interrupt context and a handler
which runs in its own thread context.
If an interrupt handler has a predicate, then when an
interrupt is triggered, the predicate is run. If the
predicate returns true then the interrupt is assumed to be
fully handled and the kernel returns from the interrupt.
If the predicate returns false or there is no predicate,
then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme
might prove rather complicated. Since we may want to change
to this scheme at some point in the future, it is probably
best to defer work on light weight context switches until we
have settled on the final interrupt handling architecture and
determined how light weight context switches might or might
not fit into it.
Kernel Preemption and Critical Sections
Kernel Preemption in a Nutshell
Kernel preemption is fairly simple. The basic idea is
that a CPU should always be doing the highest priority work
available. Well, that is the ideal at least. There are a
couple of cases where the expense of achieving the ideal is
not worth being perfect.
Implementing full kernel preemption is very
straightforward: when you schedule a thread to be executed
by putting it on a runqueue, you check to see if it's
priority is higher than the currently executing thread. If
so, you initiate a context switch to that thread.
While locks can protect most data in the case of a
preemption, not all of the kernel is preemption safe. For
example, if a thread holding a spin mutex preempted and the
new thread attempts to grab the same spin mutex, the new
thread may spin forever as the interrupted thread may never
get a chance to execute. Also, some code such as the code
to assign an address space number for a process during
exec() on the Alpha needs to not be preempted as it supports
the actual context switch code. Preemption is disabled for
these code sections by using a critical section.
Critical Sections
The responsibility of the critical section API is to
prevent context switches inside of a critical section. With
a fully preemptive kernel, every
setrunqueue of a thread other than the
current thread is a preemption point. One implementation is
for critical_enter to set a per-thread
flag that is cleared by its counterpart. If
setrunqueue is called with this flag
set, it does not preempt regardless of the priority of the new
thread relative to the current thread. However, since
critical sections are used in spin mutexes to prevent
context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this
reason the current implementation uses a nesting count
instead of a single per-thread flag.
In order to minimize latency, preemptions inside of a
critical section are deferred rather than dropped. If a
thread is made runnable that would normally be preempted to
outside of a critical section, then a per-thread flag is set
to indicate that there is a pending preemption. When the
outermost critical section is exited, the flag is checked.
If the flag is set, then the current thread is preempted to
allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes.
If a low-level interrupt handler needs a lock, it needs to
not interrupt any code needing that lock to avoid possible
data structure corruption. Currently, providing this
mechanism is piggybacked onto critical section API by means
of the cpu_critical_enter and
cpu_critical_exit functions. Currently
this API disables and re-enables interrupts on all of
FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to
get right. Theoretically, this second API need only be used
for spin mutexes that are used in primary interrupt context.
However, to make the code simpler, it is used for all spin
mutexes and even all critical sections. It may be desirable
to split out the MD API from the MI API and only use it in
conjunction with the MI API in the spin mutex
implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API
now.
Design Tradeoffs
As mentioned earlier, a couple of trade-offs have been
made to sacrifice cases where perfect preemption may not
always provide the best performance.
The first trade-off is that the preemption code does not
take other CPUs into account. Suppose we have a two CPU's A
and B with the priority of A's thread as 4 and the priority
of B's thread as 2. If CPU B makes a thread with priority 1
runnable, then in theory, we want CPU A to switch to the new
thread so that we will be running the two highest priority
runnable threads. However, the cost of determining which
CPU to enforce a preemption on as well as actually signaling
that CPU via an IPI along with the synchronization that
would be required would be enormous. Thus, the current code
would instead force CPU B to switch to the higher priority
thread. Note that this still puts the system in a better
position as CPU B is executing a thread of priority 1 rather
than a thread of priority 2.
The second trade-off limits immediate kernel preemption
to real-time priority kernel threads. In the simple case of
preemption defined above, a thread is always preempted
immediately (or as soon as a critical section is exited) if
a higher priority thread is made runnable. However, many
threads executing in the kernel only execute in a kernel
context for a short time before either blocking or returning
to userland. Thus, if the kernel preempts these threads to
run another non-realtime kernel thread, the kernel may
switch out the executing thread just before it is about to
sleep or execute. The cache on the CPU must then adjust to
the new thread. When the kernel returns to the interrupted
CPU, it must refill all the cache information that was lost.
In addition, two extra context switches are performed that
could be avoided if the kernel deferred the preemption until
the first thread blocked or returned to userland. Thus, by
default, the preemption code will only preempt immediately
if the higher priority thread is a real-time priority
thread.
Turning on full kernel preemption for all kernel threads
has value as a debugging aid since it exposes more race
conditions. It is especially useful on UP systems were many
races are hard to simulate otherwise. Thus, there will be a
kernel option to enable preemption for all kernel threads
that can be used for debugging purposes.
Thread Migration
Simply put, a thread migrates when it moves from one CPU
to another. In a non-preemptive kernel this can only happen
at well-defined points such as when calling
tsleep or returning to userland.
However, in the preemptive kernel, an interrupt can force a
preemption and possible migration at any time. This can have
negative affects on per-CPU data since with the exception of
curthread and curpcb the
data can change whenever you migrate. Since you can
potentially migrate at any time this renders per-CPU data
rather useless. Thus it is desirable to be able to disable
migration for sections of code that need per-CPU data to be
stable.
Critical sections currently prevent migration since they
do not allow context switches. However, this may be too strong
of a requirement to enforce in some cases since a critical
section also effectively blocks interrupt threads on the
current processor. As a result, it may be desirable to
provide an API whereby code may indicate that if the current
thread is preempted it should not migrate to another
CPU.
One possible implementation is to use a per-thread nesting
count td_pinnest along with a
td_pincpu which is updated to the current
CPU on each context switch. Each CPU has its own run queue
that holds threads pinned to that CPU. A thread is pinned
when its nesting count is greater than zero and a thread
starts off unpinned with a nesting count of zero. When a
thread is put on a runqueue, we check to see if it is pinned.
If so, we put it on the per-CPU runqueue, otherwise we put it
on the global runqueue. When
choosethread is called to retrieve the
next thread, it could either always prefer bound threads to
unbound threads or use some sort of bias when comparing
priorities. If the nesting count is only ever written to by
the thread itself and is only read by other threads when the
owning thread is not executing but while holding the
sched_lock, then
td_pinnest will not need any other locks.
The migrate_disable function would
increment the nesting count and
migrate_enable would decrement the
nesting count. Due to the locking requirements specified
above, they will only operate on the current thread and thus
would not need to handle the case of making a thread
migrateable that currently resides on a per-CPU run
queue.
It is still debatable if this API is needed or if the
critical section API is sufficient by itself. Many of the
places that need to prevent migration also need to prevent
preemption as well, and in those places a critical section
must be used regardless.
Callouts
The timeout() kernel facility permits
kernel services to register functions for execution as part
of the softclock() software interrupt.
Events are scheduled based on a desired number of clock
ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.
The global list of pending timeout events is protected
by a global spin mutex, callout_lock;
all access to the timeout list must be performed with this
mutex held. When softclock() is
woken up, it scans the list of pending timeouts for those
that should fire. In order to avoid lock order reversal,
the softclock thread will release the
callout_lock mutex when invoking the
provided timeout() callback function.
If the CALLOUT_MPSAFE flag was not set
during registration, then Giant will be grabbed before
invoking the callout, and then released afterwards. The
callout_lock mutex will be re-grabbed
before proceeding. The softclock()
code is careful to leave the list in a consistent state
while releasing the mutex. If DIAGNOSTIC
is enabled, then the time taken to execute each function is
measured, and a warning generated if it exceeds a
threshold.
Specific Locking Strategies
Credentials
struct ucred is the kernel's
internal credential structure, and is generally used as the
basis for process-driven access control within the kernel.
BSD-derived systems use a copy-on-write
model for credential
data: multiple references may exist for a credential structure,
and when a change needs to be made, the structure is duplicated,
modified, and then the reference replaced. Due to wide-spread
caching of the credential to implement access control on open,
this results in substantial memory savings. With a move to
fine-grained SMP, this model also saves substantially on
locking operations by requiring that modification only occur
on an unshared credential, avoiding the need for explicit
synchronization when consuming a known-shared
credential.
Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
cr_mtxp protects the reference
count of struct ucred so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.
The struct ucred mutex is a leaf
mutex, and for performance reasons, is implemented via a mutex
pool.
Usually, credentials are used in a read-only manner for access
control decisions, and in this case td_ucred
is generally preferred because it requires no locking. When a
process' credential is updated the proc lock
must be held across the check and update operations thus avoid
races. The process credential p_ucred
must be used for check and update operations to prevent
time-of-check, time-of-use races.
If system call invocations will perform access control after
an update to the process credential, the value of
td_ucred must also be refreshed to
the current process value. This will prevent use of a stale
credential following a change. The kernel automatically
refreshes the td_ucred pointer in
the thread structure from the process
p_ucred whenever a process enters
the kernel, permitting use of a fresh credential for kernel
access control.
File Descriptors and File Descriptor Tables
Details to follow.
Jail Structures
struct prison stores
administrative details pertinent to the maintenance of jails
created using the &man.jail.2; API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, pr_mtx protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
struct prison is sufficient to read
these values. The precise locking of each entry is documented
via comments in sys/jail.h.
MAC Framework
The TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of struct
label. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in struct vnode is protected
by the vnode lock on the vnode.
In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (mac_policy_list_lock) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and de-registration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.
A condition variable,
mac_policy_list_not_busy, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
shared/exclusive lock over access to the framework: the difference
is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.
Modules
For the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in sys/module.h and are quite basic
in terms of usage. The main structures protected under this lock
are the module_t structures (when shared)
and the global modulelist_t structure,
modules. One should review the related source code in
kern/kern_module.c to further understand the
locking strategy.
Newbus Device Tree
The newbus system will have one sx lock. Readers will
hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
an exclusive (write) lock (&man.sx.xlock.9;). Internal functions
will not do locking at all. Externally visible ones will lock as
needed.
Those items that do not matter if the race is won or lost will
not be locked, since they tend to be read all over the place
(e.g. &man.device.get.softc.9;). There will be relatively few
changes to the newbus data structures, so a single lock should
be sufficient and not impose a performance penalty.
Pipes
...
Processes and Threads
- process hierarchy
- proc locks, references
- thread-specific copies of proc entries to freeze during system
calls, including td_ucred
- inter-process operations
- process groups and sessions
Scheduler
Lots of references to sched_lock and notes
pointing at specific primitives and related magic elsewhere in the
document.
Select and Poll
The select() and poll() functions permit threads to block
waiting on events on file descriptors--most frequently, whether
or not the file descriptors are readable or writable.
...
SIGIO
The SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write status
of specified file descriptors changes. At most one process or
process group is permitted to register for SIGIO from any given
kernel object, and that process or group is referred to as
the owner. Each object supporting SIGIO registration contains
pointer field that is NULL if the object is not registered, or
points to a struct sigio describing
the registration. This field is protected by a global mutex,
sigio_lock. Callers to SIGIO maintenance
functions must pass in this field by reference
so that local
register copies of the field are not made when unprotected by
the lock.
One struct sigio is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered struct sigio structures,
p_sigiolst for processes, and
pg_sigiolst for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each struct
sigio are constant for the duration of the
registration, with the exception of the
sio_pgsigio field which links the
struct sigio into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as fsetown()
or funsetown() to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.
Sysctl
The sysctl() MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in locking:
first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and
functions that are accessed by the sysctl interface. Since
sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a
single global sx lock to serialize use of sysctl(); however, it
is assumed to operate under Giant and other protections are not
provided. The remainder of this section speculates on locking
and semantic changes to sysctl.
- Need to change the order of operations for sysctl's that
update values from read old, copyin and copyout, write new to
copyin, lock, read old and write new, unlock, copyout. Normal
sysctl's that just copyout the old value and set a new value
that they copyin may still be able to follow the old model.
However, it may be cleaner to use the second model for all of
the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a
pointer to a mutex in the SYSCTL_FOO macros and in the struct.
This would work for most sysctl's. For values protected by sx
locks, spin mutexes, or other locking strategies besides a
single sleep mutex, SYSCTL_PROC nodes could be used to get the
locking right.
Taskqueue
The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
taskqueue_queues_mutex is meant to serve as a
lock to protect the taskqueue_queues TAILQ.
The other mutex lock associated with this system is the one in the
struct taskqueue data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the struct
taskqueue. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
kern/subr_taskqueue.c.
Implementation Notes
Details of the Mutex Implementation
- Should we require mutexes to be owned for mtx_destroy()
since we can not safely assert that they are unowned by anyone
else otherwise?
Spin Mutexes
- Use a critical section...
Sleep Mutexes
- Describe the races with contested mutexes
- Why it is safe to read mtx_lock of a contested mutex
when holding sched_lock.
- Priority propagation
Witness
- What does it do
- How does it work
Miscellaneous Topics
Interrupt Source and ICU Abstractions
- struct isrc
- pic drivers
Other Random Questions/Topics
Should we pass an interlock into
sema_wait?
- Generic turnstiles for sleep mutexes and sx locks.
- Should we have non-sleepable sx locks?
Glossary
atomic
An operation is atomic if all of its effects are visible
to other CPUs together when the proper access protocol is
followed. In the degenerate case are atomic instructions
provided directly by machine architectures. At a higher
level, if several members of a structure are protected by a
lock, then a set of operations are atomic if they are all
performed while holding the lock without releasing the lock
in between any of the operations.
operation
block
A thread is blocked when it is waiting on a lock,
resource, or condition. Unfortunately this term is a bit
overloaded as a result.
sleep
critical section
A section of code that is not allowed to be preempted.
A critical section is entered and exited using the
&man.critical.enter.9; API.
MD
Machine dependent.
MI
memory operation
A memory operation reads and/or writes to a memory
location.
MI
Machine independent.
MD
operation
memory operation
primary interrupt context
Primary interrupt context refers to the code that runs
when an interrupt occurs. This code can either run an
interrupt handler directly or schedule an asynchronous
interrupt thread to execute the interrupt handlers for a
given interrupt source.
realtime kernel thread
A high priority kernel thread. Currently, the only
realtime priority kernel threads are interrupt threads.
thread
sleep
A thread is asleep when it is blocked on a condition
variable or a sleep queue via msleep or
tsleep.
block
sleepable lock
A sleepable lock is a lock that can be held by a thread
which is asleep. Lockmgr locks and sx locks are currently
the only sleepable locks in FreeBSD. Eventually, some sx
locks such as the allproc and proctree locks may become
non-sleepable locks.
sleep
thread
A kernel thread represented by a struct thread. Threads own
locks and hold a single execution context.