diff --git a/en_US.ISO8859-1/articles/smp/article.sgml b/en_US.ISO8859-1/articles/smp/article.sgml
index c6c78e0feb..14c53dc77c 100644
--- a/en_US.ISO8859-1/articles/smp/article.sgml
+++ b/en_US.ISO8859-1/articles/smp/article.sgml
@@ -1,957 +1,959 @@
 <!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
 <!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
 %man;
 
 <!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
 %authors;
 <!ENTITY % misc PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
 %misc;
+<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
+%freebsd;
 
 <!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
 <!--
 %mailing-lists;
 -->
 
 ]>
 
 <article>
   <articleinfo>
     <title>SMPng Design Document</title>
 
     <authorgroup>
       <author>
 	<firstname>John</firstname>
 	<surname>Baldwin</surname>
       </author>
       <author>
 	<firstname>Robert</firstname>
 	<surname>Watson</surname>
       </author>
     </authorgroup>
 
     <pubdate>$FreeBSD$</pubdate>
 
     <copyright>
       <year>2002</year>
       <year>2003</year>
       <holder>John Baldwin</holder>
       <holder>Robert Watson</holder>
     </copyright>
 
     <abstract>
       <para>This document presents the current design and implementation of
 	the SMPng Architecture.  First, the basic primitives and tools are
 	introduced.  Next, a general architecture for the FreeBSD kernel's
 	synchronization and execution model is laid out.  Then, locking
 	strategies for specific subsystems are discussed, documenting the
 	approaches taken to introduce fine-grained synchronization and
 	parallelism for each subsystem.  Finally, detailed implementation
 	notes are provided to motivate design choices, and make the reader
 	aware of important implications involving the use of specific
 	primitives. </para>
     </abstract>
   </articleinfo>
 
   <sect1>
     <title>Introduction</title>
 
     <para>This document is a work-in-progress, and will be updated to
       reflect on-going design and implementation activities associated
       with the SMPng Project.  Many sections currently exist only in
       outline form, but will be fleshed out as work proceeds.  Updates or
       suggestions regarding the document may be directed to the document
       editors.</para>
 
     <para>The goal of SMPng is to allow concurrency in the kernel.
       The kernel is basically one rather large and complex program. To
       make the kernel multi-threaded we use some of the same tools used
       to make other programs multi-threaded.  These include mutexes,
       shared/exclusive locks, semaphores, and condition variables.  For
       the definitions of these and other SMP-related terms, please see
       the <xref linkend="glossary"> section of this article.</para>
   </sect1>
 
   <sect1>
     <title>Basic Tools and Locking Fundamentals</title>
 
     <sect2>
       <title>Atomic Instructions and Memory Barriers</title>
 
       <para>There are several existing treatments of memory barriers
 	and atomic instructions, so this section will not include a
 	lot of detail.  To put it simply, one can not go around reading
 	variables without a lock if a lock is used to protect writes
 	to that variable.  This becomes obvious when you consider that
 	memory barriers simply determine relative order of memory
 	operations; they do not make any guarantee about timing of
 	memory operations.  That is, a memory barrier does not force
 	the contents of a CPU's local cache or store buffer to flush.
 	Instead, the memory barrier at lock release simply ensures
 	that all writes to the protected data will be visible to other
 	CPU's or devices if the write to release the lock is visible.
 	The CPU is free to keep that data in its cache or store buffer
 	as long as it wants. However, if another CPU performs an
 	atomic instruction on the same datum, the first CPU must
 	guarantee that the updated value is made visible to the second
 	CPU along with any other operations that memory barriers may
 	require.</para>
 
       <para>For example, assuming a simple model where data is
 	considered visible when it is in main memory (or a global
 	cache), when an atomic instruction is triggered on one CPU,
 	other CPU's store buffers and caches must flush any writes to
 	that same cache line along with any pending operations behind
 	a memory barrier.</para>
 
       <para>This requires one to take special care when using an item
 	protected by atomic instructions.  For example, in the sleep
 	mutex implementation, we have to use an
 	<function>atomic_cmpset</function> rather than an
 	<function>atomic_set</function> to turn on the
 	<constant>MTX_CONTESTED</constant> bit.  The reason is that we
 	read the value of <structfield>mtx_lock</structfield> into a
 	variable and then make a decision based on that read.
 	However, the value we read may be stale, or it may change
 	while we are making our decision.  Thus, when the
 	<function>atomic_set</function> executed, it may end up
 	setting the bit on another value than the one we made the
 	decision on. Thus, we have to use an
 	<function>atomic_cmpset</function> to set the value only if
 	the value we made the decision on is up-to-date and
 	valid.</para>
 
       <para>Finally, atomic instructions only allow one item to be
 	updated or read.  If one needs to atomically update several
 	items, then a lock must be used instead.  For example, if two
 	counters must be read and have values that are consistent
 	relative to each other, then those counters must be protected
 	by a lock rather than by separate atomic instructions.</para>
     </sect2>
 
     <sect2>
       <title>Read Locks versus Write Locks</title>
 
       <para>Read locks do not need to be as strong as write locks.
 	Both types of locks need to ensure that the data they are
 	accessing is not stale.  However, only write access requires
 	exclusive access.  Multiple threads can safely read a value.
 	Using different types of locks for reads and writes can be
 	implemented in a number of ways.</para>
 
       <para>First, sx locks can be used in this manner by using an
 	exclusive lock when writing and a shared lock when reading.
 	This method is quite straightforward.</para>
 
       <para>A second method is a bit more obscure.  You can protect a
 	datum with multiple locks.  Then for reading that data you
 	simply need to have a read lock of one of the locks.  However,
 	to write to the data, you need to have a write lock of all of
 	the locks.  This can make writing rather expensive but can be
 	useful when data is accessed in various ways.  For example,
 	the parent process pointer is protected by both the
 	proctree_lock sx lock and the per-process mutex.  Sometimes
 	the proc lock is easier as we are just checking to see who a
 	parent of a process is that we already have locked.  However,
 	other places such as <function>inferior</function> need to
 	walk the tree of processes via parent pointers and locking
 	each process would be prohibitive as well as a pain to
 	guarantee that the condition you are checking remains valid
 	for both the check and the actions taken as a result of the
 	check.</para>
     </sect2>
 
     <sect2>
       <title>Locking Conditions and Results</title>
 
       <para>If you need a lock to check the state of a variable so
 	that you can take an action based on the state you read, you
 	can not just hold the lock while reading the variable and then
 	drop the lock before you act on the value you read.  Once you
 	drop the lock, the variable can change rendering your decision
 	invalid. Thus, you must hold the lock both while reading the
 	variable and while performing the action as a result of the
 	test.</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>General Architecture and Design</title>
 
     <sect2>
       <title>Interrupt Handling</title>
 
-      <para>Following the pattern of several other multi-threaded Unix
+      <para>Following the pattern of several other multi-threaded &unix;
 	kernels, FreeBSD deals with interrupt handlers by giving them
 	their own thread context.  Providing a context for interrupt
 	handlers allows them to block on locks.  To help avoid
 	latency, however, interrupt threads run at real-time kernel
 	priority. Thus, interrupt handlers should not execute for very
 	long to avoid starving other kernel threads.  In addition,
 	since multiple handlers may share an interrupt thread,
 	interrupt handlers should not sleep or use a sleepable lock to
 	avoid starving another interrupt handler.</para>
 
       <para>The interrupt threads currently in FreeBSD are referred to
 	as heavyweight interrupt threads.  They are called this
 	because switching to an interrupt thread involves a full
 	context switch. In the initial implementation, the kernel was
 	not preemptive and thus interrupts that interrupted a kernel
 	thread would have to wait until the kernel thread blocked or
 	returned to userland before they would have an opportunity to
 	run.</para>
 
       <para>To deal with the latency problems, the kernel in FreeBSD
 	has been made preemptive.  Currently, we only preempt a kernel
 	thread when we release a sleep mutex or when an interrupt
 	comes in.  However, the plan is to make the FreeBSD kernel
 	fully preemptive as described below.</para>
 
       <para>Not all interrupt handlers execute in a thread context.
 	Instead, some handlers execute directly in primary interrupt
 	context.  These interrupt handlers are currently misnamed
 	<quote>fast</quote> interrupt handlers since the
 	<constant>INTR_FAST</constant> flag used in earlier versions
 	of the kernel is used to mark these handlers.  The only
 	interrupts which currently use these types of interrupt
 	handlers are clock interrupts and serial I/O device
 	interrupts.  Since these handlers do not have their own
 	context, they may not acquire blocking locks and thus may only
 	use spin mutexes.</para>
 
       <para>Finally, there is one optional optimization that can be
 	added in MD code called lightweight context switches.  Since
 	an interrupt thread executes in a kernel context, it can
 	borrow the vmspace of any process.  Thus, in a lightweight
 	context switch, the switch to the interrupt thread does not
 	switch vmspaces but borrows the vmspace of the interrupted
 	thread.  In order to ensure that the vmspace of the
 	interrupted thread does not disappear out from under us, the
 	interrupted thread is not allowed to execute until the
 	interrupt thread is no longer borrowing its vmspace.  This can
 	happen when the interrupt thread either blocks or finishes.
 	If an interrupt thread blocks, then it will use its own
 	context when it is made runnable again.  Thus, it can release
 	the interrupted thread.</para>
 
       <para>The cons of this optimization are that they are very
 	machine specific and complex and thus only worth the effort if
 	their is a large performance improvement.  At this point it is
 	probably too early to tell, and in fact, will probably hurt
 	performance as almost all interrupt handlers will immediately
 	block on Giant and require a thread fix-up when they block.
 	Also, an alternative method of interrupt handling has been
 	proposed by Mike Smith that works like so:</para>
 
       <orderedlist>
 	<listitem>
 	  <para>Each interrupt handler has two parts: a predicate
 	    which runs in primary interrupt context and a handler
 	    which runs in its own thread context.</para>
 	</listitem>
 
 	<listitem>
 	  <para>If an interrupt handler has a predicate, then when an
 	    interrupt is triggered, the predicate is run.  If the
 	    predicate returns true then the interrupt is assumed to be
 	    fully handled and the kernel returns from the interrupt.
 	    If the predicate returns false or there is no predicate,
 	    then the threaded handler is scheduled to run.</para>
 	</listitem>
       </orderedlist>
 
       <para>Fitting light weight context switches into this scheme
 	might prove rather complicated.  Since we may want to change
 	to this scheme at some point in the future, it is probably
 	best to defer work on light weight context switches until we
 	have settled on the final interrupt handling architecture and
 	determined how light weight context switches might or might
 	not fit into it.</para>
     </sect2>
 
     <sect2>
       <title>Kernel Preemption and Critical Sections</title>
 
       <sect3>
 	<title>Kernel Preemption in a Nutshell</title>
 
 	<para>Kernel preemption is fairly simple.  The basic idea is
 	  that a CPU should always be doing the highest priority work
 	  available.  Well, that is the ideal at least.  There are a
 	  couple of cases where the expense of achieving the ideal is
 	  not worth being perfect.</para>
 
 	<para>Implementing full kernel preemption is very
 	  straightforward: when you schedule a thread to be executed
 	  by putting it on a runqueue, you check to see if it's
 	  priority is higher than the currently executing thread.  If
 	  so, you initiate a context switch to that thread.</para>
 
 	<para>While locks can protect most data in the case of a
 	  preemption, not all of the kernel is preemption safe.  For
 	  example, if a thread holding a spin mutex preempted and the
 	  new thread attempts to grab the same spin mutex, the new
 	  thread may spin forever as the interrupted thread may never
 	  get a chance to execute.  Also, some code such as the code
 	  to assign an address space number for a process during
 	  exec() on the Alpha needs to not be preempted as it supports
 	  the actual context switch code.  Preemption is disabled for
 	  these code sections by using a critical section.</para>
       </sect3>
 
       <sect3>
 	<title>Critical Sections</title>
 
 	<para>The responsibility of the critical section API is to
 	  prevent context switches inside of a critical section.  With
 	  a fully preemptive kernel, every
 	  <function>setrunqueue</function> of a thread other than the
 	  current thread is a preemption point.  One implementation is
 	  for <function>critical_enter</function> to set a per-thread
 	  flag that is cleared by its counterpart.  If
 	  <function>setrunqueue</function> is called with this flag
 	  set, it does not preempt regardless of the priority of the new
 	  thread relative to the current thread.  However, since
 	  critical sections are used in spin mutexes to prevent
 	  context switches and multiple spin mutexes can be acquired,
 	  the critical section API must support nesting.  For this
 	  reason the current implementation uses a nesting count
 	  instead of a single per-thread flag.</para>
 
 	<para>In order to minimize latency, preemptions inside of a
 	  critical section are deferred rather than dropped.  If a
 	  thread is made runnable that would normally be preempted to
 	  outside of a critical section, then a per-thread flag is set
 	  to indicate that there is a pending preemption.  When the
 	  outermost critical section is exited, the flag is checked.
 	  If the flag is set, then the current thread is preempted to
 	  allow the higher priority thread to run.</para>
 
 	<para>Interrupts pose a problem with regards to spin mutexes.
 	  If a low-level interrupt handler needs a lock, it needs to
 	  not interrupt any code needing that lock to avoid possible
 	  data structure corruption.  Currently, providing this
 	  mechanism is piggybacked onto critical section API by means
 	  of the <function>cpu_critical_enter</function> and
 	  <function>cpu_critical_exit</function> functions.  Currently
 	  this API disables and re-enables interrupts on all of
 	  FreeBSD's current platforms.  This approach may not be
 	  purely optimal, but it is simple to understand and simple to
 	  get right. Theoretically, this second API need only be used
 	  for spin mutexes that are used in primary interrupt context.
 	  However, to make the code simpler, it is used for all spin
 	  mutexes and even all critical sections.  It may be desirable
 	  to split out the MD API from the MI API and only use it in
 	  conjunction with the MI API in the spin mutex
 	  implementation.  If this approach is taken, then the MD API
 	  likely would need a rename to show that it is a separate API
 	  now.</para>
       </sect3>
 
       <sect3>
 	<title>Design Tradeoffs</title>
 
 	<para>As mentioned earlier, a couple of trade-offs have been
 	  made to sacrifice cases where perfect preemption may not
 	  always provide the best performance.</para>
 
 	<para>The first trade-off is that the preemption code does not
 	  take other CPUs into account.  Suppose we have a two CPU's A
 	  and B with the priority of A's thread as 4 and the priority
 	  of B's thread as 2.  If CPU B makes a thread with priority 1
 	  runnable, then in theory, we want CPU A to switch to the new
 	  thread so that we will be running the two highest priority
 	  runnable threads.  However, the cost of determining which
 	  CPU to enforce a preemption on as well as actually signaling
 	  that CPU via an IPI along with the synchronization that
 	  would be required would be enormous.  Thus, the current code
 	  would instead force CPU B to switch to the higher priority
 	  thread. Note that this still puts the system in a better
 	  position as CPU B is executing a thread of priority 1 rather
 	  than a thread of priority 2.</para>
 
 	<para>The second trade-off limits immediate kernel preemption
 	  to real-time priority kernel threads.  In the simple case of
 	  preemption defined above, a thread is always preempted
 	  immediately (or as soon as a critical section is exited) if
 	  a higher priority thread is made runnable.  However, many
 	  threads executing in the kernel only execute in a kernel
 	  context for a short time before either blocking or returning
 	  to userland.  Thus, if the kernel preempts these threads to
 	  run another non-realtime kernel thread, the kernel may
 	  switch out the executing thread just before it is about to
 	  sleep or execute.  The cache on the CPU must then adjust to
 	  the new thread.  When the kernel returns to the interrupted
 	  CPU, it must refill all the cache information that was lost.
 	  In addition, two extra context switches are performed that
 	  could be avoided if the kernel deferred the preemption until
 	  the first thread blocked or returned to userland.  Thus, by
 	  default, the preemption code will only preempt immediately
 	  if the higher priority thread is a real-time priority
 	  thread.</para>
 
 	<para>Turning on full kernel preemption for all kernel threads
 	  has value as a debugging aid since it exposes more race
 	  conditions.  It is especially useful on UP systems were many
 	  races are hard to simulate otherwise.  Thus, there will be a
 	  kernel option to enable preemption for all kernel threads
 	  that can be used for debugging purposes.</para>
       </sect3>
     </sect2>
 
     <sect2>
       <title>Thread Migration</title>
 
       <para>Simply put, a thread migrates when it moves from one CPU
 	to another.  In a non-preemptive kernel this can only happen
 	at well-defined points such as when calling
 	<function>tsleep</function> or returning to userland.
 	However, in the preemptive kernel, an interrupt can force a
 	preemption and possible migration at any time.  This can have
 	negative affects on per-CPU data since with the exception of
 	<varname>curthread</varname> and <varname>curpcb</varname> the
 	data can change whenever you migrate.  Since you can
 	potentially migrate at any time this renders per-CPU data
 	rather useless. Thus it is desirable to be able to disable
 	migration for sections of code that need per-CPU data to be
 	stable.</para>
 
       <para>Critical sections currently prevent migration since they
 	do not allow context switches.  However, this may be too strong
 	of a requirement to enforce in some cases since a critical
 	section also effectively blocks interrupt threads on the
 	current processor.  As a result, it may be desirable to
 	provide an API whereby code may indicate that if the current
 	thread is preempted it should not migrate to another
 	CPU.</para>
 
       <para>One possible implementation is to use a per-thread nesting
 	count <varname>td_pinnest</varname> along with a
 	<varname>td_pincpu</varname> which is updated to the current
 	CPU on each context switch.  Each CPU has its own run queue
 	that holds threads pinned to that CPU.  A thread is pinned
 	when its nesting count is greater than zero and a thread
 	starts off unpinned with a nesting count of zero.  When a
 	thread is put on a runqueue, we check to see if it is pinned.
 	If so, we put it on the per-CPU runqueue, otherwise we put it
 	on the global runqueue.  When
 	<function>choosethread</function> is called to retrieve the
 	next thread, it could either always prefer bound threads to
 	unbound threads or use some sort of bias when comparing
 	priorities.  If the nesting count is only ever written to by
 	the thread itself and is only read by other threads when the
 	owning thread is not executing but while holding the
 	<varname>sched_lock</varname>, then
 	<varname>td_pinnest</varname> will not need any other locks.
 	The <function>migrate_disable</function> function would
 	increment the nesting count and
 	<function>migrate_enable</function> would decrement the
 	nesting count.  Due to the locking requirements specified
 	above, they will only operate on the current thread and thus
 	would not need to handle the case of making a thread
 	migrateable that currently resides on a per-CPU run
 	queue.</para>
 
       <para>It is still debatable if this API is needed or if the
 	critical section API is sufficient by itself.  Many of the
 	places that need to prevent migration also need to prevent
 	preemption as well, and in those places a critical section
 	must be used regardless.</para>
     </sect2>
 
     <sect2>
       <title>Callouts</title>
 
       <para>The <function>timeout()</function> kernel facility permits
 	kernel services to register functions for execution as part
 	of the <function>softclock()</function> software interrupt.
 	Events are scheduled based on a desired number of clock
 	ticks, and callbacks to the consumer-provided function
 	will occur at approximately the right time.</para>
 
       <para>The global list of pending timeout events is protected
 	by a global spin mutex, <varname>callout_lock</varname>;
 	all access to the timeout list must be performed with this
 	mutex held.  When <function>softclock()</function> is
 	woken up, it scans the list of pending timeouts for those
 	that should fire.  In order to avoid lock order reversal,
 	the <function>softclock</function> thread will release the
 	<varname>callout_lock</varname> mutex when invoking the
 	provided <function>timeout()</function> callback function.
 	If the <constant>CALLOUT_MPSAFE</constant> flag was not set
 	during registration, then Giant will be grabbed before
 	invoking the callout, and then released afterwards.  The
 	<varname>callout_lock</varname> mutex will be re-grabbed
 	before proceeding.  The <function>softclock()</function>
 	code is careful to leave the list in a consistent state
 	while releasing the mutex.  If <constant>DIAGNOSTIC</constant>
 	is enabled, then the time taken to execute each function is
 	measured, and a warning generated if it exceeds a
 	threshold.</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>Specific Locking Strategies</title>
 
     <sect2>
       <title>Credentials</title>
 
       <para><structname>struct ucred</structname> is the kernel's
 	internal credential structure, and is generally used as the
 	basis for process-driven access control within the kernel.  
 	BSD-derived systems use a <quote>copy-on-write</quote> model for credential 
 	data: multiple references may exist for a credential structure, 
 	and when a change needs to be made, the structure is duplicated,
 	modified, and then the reference replaced.  Due to wide-spread
 	caching of the credential to implement access control on open,
 	this results in substantial memory savings.  With a move to
 	fine-grained SMP, this model also saves substantially on
 	locking operations by requiring that modification only occur
 	on an unshared credential, avoiding the need for explicit   
 	synchronization when consuming a known-shared
 	credential.</para>
 
       <para>Credential structures with a single reference are
 	considered mutable; shared credential structures must not be  
 	modified or a race condition is risked.  A mutex,
 	<structfield>cr_mtxp</structfield> protects the reference 
 	count of <structname>struct ucred</structname> so as to
 	maintain consistency.  Any use of the structure requires a
 	valid reference for the duration of the use, or the structure
 	may be released out from under the illegitimate
 	consumer.</para>
 
       <para>The <structname>struct ucred</structname> mutex is a leaf
 	mutex, and for performance reasons, is implemented via a mutex
 	pool.</para>
 
       <para>Usually, credentials are used in a read-only manner for access
 	control decisions, and in this case <structfield>td_ucred</structfield>
 	is generally preferred because it requires no locking.  When a
 	process' credential is updated the <literal>proc</literal> lock
 	must be held across the check and update operations thus avoid
 	races.  The process credential <structfield>p_ucred</structfield>
 	must be used for check and update operations to prevent
 	time-of-check, time-of-use races.</para>
 
       <para>If system call invocations will perform access control after
 	an update to the process credential, the value of
 	<structfield>td_ucred</structfield> must also be refreshed to
 	the current process value.  This will prevent use of a stale
 	credential following a change.  The kernel automatically
 	refreshes the <structfield>td_ucred</structfield> pointer in
 	the thread structure from the process
 	<structfield>p_ucred</structfield> whenever a process enters
 	the kernel, permitting use of a fresh credential for kernel
 	access control.</para>
     </sect2>
 
     <sect2>
       <title>File Descriptors and File Descriptor Tables</title>
 
       <para>Details to follow.</para>
     </sect2>
 
     <sect2>
       <title>Jail Structures</title>
 
       <para><structname>struct prison</structname> stores
 	administrative details pertinent to the maintenance of jails
 	created using the &man.jail.2; API.  This includes the
 	per-jail hostname, IP address, and related settings.  This
 	structure is reference-counted since pointers to instances of
 	the structure are shared by many credential structures.  A
 	single mutex, <structfield>pr_mtx</structfield> protects read
 	and write access to the reference count and all mutable
 	variables inside the struct jail.  Some variables are set only
 	when the jail is created, and a valid reference to the
 	<structname>struct prison</structname> is sufficient to read
 	these values.  The precise locking of each entry is documented
 	via comments in <filename>sys/jail.h</filename>.</para>
     </sect2>
 
     <sect2>
       <title>MAC Framework</title>
 
       <para>The TrustedBSD MAC Framework maintains data in a variety
 	of kernel objects, in the form of <structname>struct
 	label</structname>.  In general, labels in kernel objects
 	are protected by the same lock as the remainder of the kernel
 	object.  For example, the <structfield>v_label</structfield>
 	label in <structname>struct vnode</structname> is protected
 	by the vnode lock on the vnode.</para>
 
       <para>In addition to labels maintained in standard kernel objects,
 	the MAC Framework also maintains a list of registered and
 	active policies.  The policy list is protected by a global
 	mutex (<varname>mac_policy_list_lock</varname>) and a busy
 	count (also protected by the mutex).  Since many access
 	control checks may occur in parallel, entry to the framework
 	for a read-only access to the policy list requires holding the
 	mutex while incrementing (and later decrementing) the busy
 	count.  The mutex need not be held for the duration of the
 	MAC entry operation--some operations, such as label operations
 	on file system objects--are long-lived.  To modify the policy
 	list, such as during policy registration and de-registration,
 	the mutex must be held and the reference count must be zero,
 	to prevent modification of the list while it is in use.</para>
 
       <para>A condition variable,
 	<varname>mac_policy_list_not_busy</varname>, is available to
 	threads that need to wait for the list to become unbusy, but
 	this condition variable must only be waited on if the caller is
 	holding no other locks, or a lock order violation may be
 	possible.  The busy count, in effect, acts as a form of
 	shared/exclusive lock over access to the framework: the difference
 	is that, unlike with an sx lock, consumers waiting for the list
 	to become unbusy may be starved, rather than permitting lock
 	order problems with regards to the busy count and other locks
 	that may be held on entry to (or inside) the MAC Framework.</para>
     </sect2>
 
     <sect2>
       <title>Modules</title>
 
       <para>For the module subsystem there exists a single lock that is
 	used to protect the shared data.  This lock is a shared/exclusive
 	(SX) lock and has a good chance of needing to be acquired (shared
 	or exclusively), therefore there are a few macros that have been
 	added to make access to the lock more easy.  These macros can be
 	located in <filename>sys/module.h</filename> and are quite basic
 	in terms of usage.  The main structures protected under this lock
 	are the <structname>module_t</structname> structures (when shared)
 	and the global <structname>modulelist_t</structname> structure,
 	modules.  One should review the related source code in
 	<filename>kern/kern_module.c</filename> to further understand the
 	locking strategy.</para>
     </sect2>
 
     <sect2>
       <title>Newbus Device Tree</title>
 
       <para>The newbus system will have one sx lock.  Readers will
 	hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
 	an exclusive (write) lock (&man.sx.xlock.9;).  Internal functions 
 	will not do locking at all.  Externally visible ones will lock as 
 	needed.
 	Those items that do not matter if the race is won or lost will
 	not be locked, since they tend to be read all over the place
 	(e.g. &man.device.get.softc.9;).  There will be relatively few
 	changes to the newbus data structures, so a single lock should
 	be sufficient and not impose a performance penalty.</para>
     </sect2>
 
     <sect2>
       <title>Pipes</title>
 
       <para>...</para>
     </sect2>
 
     <sect2>
       <title>Processes and Threads</title>
 
       <para>- process hierarchy</para>
       <para>- proc locks, references</para>
       <para>- thread-specific copies of proc entries to freeze during system
 	calls, including td_ucred</para>
       <para>- inter-process operations</para>
       <para>- process groups and sessions</para>
     </sect2>
 
     <sect2>
       <title>Scheduler</title>
 
       <para>Lots of references to <varname>sched_lock</varname> and notes
 	pointing at specific primitives and related magic elsewhere in the
 	document.</para>
     </sect2>
 
     <sect2>
       <title>Select and Poll</title>
 
       <para>The select() and poll() functions permit threads to block
 	waiting on events on file descriptors--most frequently, whether
 	or not the file descriptors are readable or writable.</para>
 
       <para>...</para>
     </sect2>
 
     <sect2>
       <title>SIGIO</title>
 
       <para>The SIGIO service permits processes to request the delivery
 	of a SIGIO signal to its process group when the read/write status
 	of specified file descriptors changes.  At most one process or
 	process group is permitted to register for SIGIO from any given
 	kernel object, and that process or group is referred to as
 	the owner.  Each object supporting SIGIO registration contains
 	pointer field that is NULL if the object is not registered, or
 	points to a <structname>struct sigio</structname> describing
 	the registration.  This field is protected by a global mutex,
 	<varname>sigio_lock</varname>.  Callers to SIGIO maintenance
 	functions must pass in this field <quote>by reference</quote> so that local
 	register copies of the field are not made when unprotected by
 	the lock.</para>
 
       <para>One <structname>struct sigio</structname> is allocated for
 	each registered object associated with any process or process
 	group, and contains back-pointers to the object, owner, signal
 	information, a credential, and the general disposition of the
 	registration.  Each process or progress group contains a list of
 	registered <structname>struct sigio</structname> structures,
 	<structfield>p_sigiolst</structfield> for processes, and
 	<structfield>pg_sigiolst</structfield> for process groups.
 	These lists are protected by the process or process group
 	locks respectively.  Most fields in each <structname>struct
 	sigio</structname> are constant for the duration of the
 	registration, with the exception of the
 	<structfield>sio_pgsigio</structfield> field which links the
 	<structname>struct sigio</structname> into the process or
 	process group list.  Developers implementing new kernel
 	objects supporting SIGIO will, in general, want to avoid
 	holding structure locks while invoking SIGIO supporting
 	functions, such as <function>fsetown()</function>
 	or <function>funsetown()</function> to avoid
 	defining a lock order between structure locks and the global
 	SIGIO lock.  This is generally possible through use of an
 	elevated reference count on the structure, such as reliance
 	on a file descriptor reference to a pipe during a pipe
 	operation.<para>
     </sect2>
 
     <sect2>
       <title>Sysctl</title>
 
       <para>The <function>sysctl()</function> MIB service is invoked
 	from both within the kernel and from userland applications
 	using a system call.  At least two issues are raised in locking:
 	first, the protection of the structures maintaining the
 	namespace, and second, interactions with kernel variables and
 	functions that are accessed by the sysctl interface.  Since
 	sysctl permits the direct export (and modification) of
 	kernel statistics and configuration parameters, the sysctl
 	mechanism must become aware of appropriate locking semantics
 	for those variables.  Currently, sysctl makes use of a
 	single global sx lock to serialize use of sysctl(); however, it
 	is assumed to operate under Giant and other protections are not
 	provided.  The remainder of this section speculates on locking
 	and semantic changes to sysctl.</para>
 
       <para>- Need to change the order of operations for sysctl's that
 	update values from read old, copyin and copyout, write new to
 	copyin, lock, read old and write new, unlock, copyout.  Normal
 	sysctl's that just copyout the old value and set a new value
 	that they copyin may still be able to follow the old model.
 	However, it may be cleaner to use the second model for all of
 	the sysctl handlers to avoid lock operations.</para>
 
       <para>- To allow for the common case, a sysctl could embed a
 	pointer to a mutex in the SYSCTL_FOO macros and in the struct.
 	This would work for most sysctl's.  For values protected by sx
 	locks, spin mutexes, or other locking strategies besides a
 	single sleep mutex, SYSCTL_PROC nodes could be used to get the
 	locking right.</para>
     </sect2>
 
     <sect2>
       <title>Taskqueue</title>
 
        <para> The taskqueue's interface has two basic locks associated
 	with it in order to protect the related shared data.  The
 	<varname>taskqueue_queues_mutex</varname> is meant to serve as a
 	lock to protect the <varname>taskqueue_queues</varname> TAILQ.
 	The other mutex lock associated with this system is the one in the
 	<structname>struct taskqueue</structname> data structure.  The
 	use of the synchronization primitive here is to protect the
 	integrity of the data in the <structname>struct
 	taskqueue</structname>.  It should be noted that there are no
 	separate macros to assist the user in locking down his/her own work
 	since these locks are most likely not going to be used outside of
 	<filename>kern/subr_taskqueue.c</filename>.</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>Implementation Notes</title>
 
     <sect2>
       <title>Details of the Mutex Implementation</title>
 
       <para>- Should we require mutexes to be owned for mtx_destroy()
 	since we can not safely assert that they are unowned by anyone
 	else otherwise?</para>
 
       <sect3>
 	<title>Spin Mutexes</title>
 
 	<para>- Use a critical section...</para>
       </sect3>
 
       <sect3>
 	<title>Sleep Mutexes</title>
 
 	<para>- Describe the races with contested mutexes</para>
 
 	<para>- Why it is safe to read mtx_lock of a contested mutex
 	  when holding sched_lock.</para>
 
 	<para>- Priority propagation</para>
       </sect3>
     </sect2>
 
     <sect2>
       <title>Witness</title>
 
       <para>- What does it do</para>
 
       <para>- How does it work</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>Miscellaneous Topics</title>
 
     <sect2>
       <title>Interrupt Source and ICU Abstractions</title>
 
       <para>- struct isrc</para>
 
       <para>- pic drivers</para>
     </sect2>
 
     <sect2>
       <title>Other Random Questions/Topics</title>
 
       <para>Should we pass an interlock into
 	<function>sema_wait</function>?</para>
 
       <para>- Generic turnstiles for sleep mutexes and sx locks.</para>
 
       <para>- Should we have non-sleepable sx locks?</para>
     </sect2>
   </sect1>
 
   <glossary id="glossary">
     <title>Glossary</title>
 
     <glossentry id="atomic">
       <glossterm>atomic</glossterm>
       <glossdef>
 	<para>An operation is atomic if all of its effects are visible
 	  to other CPUs together when the proper access protocol is
 	  followed.  In the degenerate case are atomic instructions
 	  provided directly by machine architectures.  At a higher
 	  level, if several members of a structure are protected by a
 	  lock, then a set of operations are atomic if they are all
 	  performed while holding the lock without releasing the lock
 	  in between any of the operations.</para>
 
 	<glossseealso>operation</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="block">
       <glossterm>block</glossterm>
       <glossdef>
 	<para>A thread is blocked when it is waiting on a lock,
 	  resource, or condition.  Unfortunately this term is a bit
 	  overloaded as a result.</para>
 
 	<glossseealso>sleep</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="critical-section">
       <glossterm>critical section</glossterm>
       <glossdef>
 	<para>A section of code that is not allowed to be preempted.
 	  A critical section is entered and exited using the
 	  &man.critical.enter.9; API.</para>
       </glossdef>
     </glossentry>
 
     <glossentry id="MD">
       <glossterm>MD</glossterm>
       <glossdef>
 	<para>Machine dependent.</para>
 
 	<glossseealso>MI</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="memory-operation">
       <glossterm>memory operation</glossterm>
       <glossdef>
 	<para>A memory operation reads and/or writes to a memory
 	  location.</para>
       </glossdef>
     </glossentry>
 
     <glossentry id="MI">
       <glossterm>MI</glossterm>
       <glossdef>
 	<para>Machine independent.</para>
 
 	<glossseealso>MD</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="operation">
       <glossterm>operation</glossterm>
       <glosssee>memory operation</glosssee>
     </glossentry>
 
     <glossentry id="primary-interrupt-context">
       <glossterm>primary interrupt context</glossterm>
       <glossdef>
 	<para>Primary interrupt context refers to the code that runs
 	  when an interrupt occurs.  This code can either run an
 	  interrupt handler directly or schedule an asynchronous
 	  interrupt thread to execute the interrupt handlers for a
 	  given interrupt source.</para>
       </glossdef>
     </glossentry>
 
     <glossentry>
       <glossterm>realtime kernel thread</glossterm>
       <glossdef>
 	<para>A high priority kernel thread.  Currently, the only
 	  realtime priority kernel threads are interrupt threads.</para>
 
 	<glossseealso>thread</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="sleep">
       <glossterm>sleep</glossterm>
       <glossdef>
 	<para>A thread is asleep when it is blocked on a condition
 	  variable or a sleep queue via <function>msleep</function> or
 	  <function>tsleep</function>.</para>
 
 	<glossseealso>block</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="sleepable-lock">
       <glossterm>sleepable lock</glossterm>
       <glossdef>
 	<para>A sleepable lock is a lock that can be held by a thread
 	  which is asleep.  Lockmgr locks and sx locks are currently
 	  the only sleepable locks in FreeBSD.  Eventually, some sx
 	  locks such as the allproc and proctree locks may become
 	  non-sleepable locks.</para>
 
 	<glossseealso>sleep</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="thread">
       <glossterm>thread</glossterm>
       <glossdef>
 	<para>A kernel thread represented by a struct thread.  Threads own
 	  locks and hold a single execution context.</para>
       </glossdef>
     </glossentry>
   </glossary>
 </article>
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml
index f4f2487d4b..bbe6da9094 100644
--- a/en_US.ISO8859-1/articles/vm-design/article.sgml
+++ b/en_US.ISO8859-1/articles/vm-design/article.sgml
@@ -1,838 +1,841 @@
 <!-- $FreeBSD$ -->
 <!-- FreeBSD Documentation Project -->
 
 <!DOCTYPE ARTICLE PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
 <!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
 %man;
+<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
+%freebsd;
+
 ]>
 
 <article>
   <articleinfo>
     <title>Design elements of the FreeBSD VM system</title>
 
     <authorgroup>
       <author>
 	<firstname>Matthew</firstname>
 
 	<surname>Dillon</surname>
 
 	<affiliation>
 	  <address>
 	    <email>dillon@apollo.backplane.com</email>
 	  </address>
 	</affiliation>
       </author>
     </authorgroup>
 
     <abstract>
       <para>The title is really just a fancy way of saying that I am going to
 	attempt to describe the whole VM enchilada, hopefully in a way that
 	everyone can follow.  For the last year I have concentrated on a number
 	of major kernel subsystems within FreeBSD, with the VM and Swap
 	subsystems being the most interesting and NFS being <quote>a necessary
 	chore</quote>.  I rewrote only small portions of the code.  In the VM
 	arena the only major rewrite I have done is to the swap subsystem.
 	Most of my work was cleanup and maintenance, with only moderate code
 	rewriting and no major algorithmic adjustments within the VM
 	subsystem.  The bulk of the VM subsystem's theoretical base remains
 	unchanged and a lot of the credit for the modernization effort in the
 	last few years belongs to John Dyson and David Greenman.  Not being a
 	historian like Kirk I will not attempt to tag all the various features
 	with peoples names, since I will invariably get it wrong.</para>
     </abstract>
 
     <legalnotice>
       <para>This article was originally published in the January 2000 issue of 
 	<ulink url="http://www.daemonnews.org/">DaemonNews</ulink>.  This
 	version of the article may include updates from Matt and other authors
 	to reflect changes in FreeBSD's VM implementation.</para>
     </legalnotice>
   </articleinfo>
 
   <sect1>
     <title>Introduction</title>
 
     <para>Before moving along to the actual design let's spend a little time
       on the necessity of maintaining and modernizing any long-living
       codebase.  In the programming world, algorithms tend to be more
       important than code and it is precisely due to BSD's academic roots that
       a great deal of attention was paid to algorithm design from the
       beginning.  More attention paid to the design generally leads to a clean
       and flexible codebase that can be fairly easily modified, extended, or
       replaced over time.  While BSD is considered an <quote>old</quote>
       operating system by some people, those of us who work on it tend to view
       it more as a <quote>mature</quote> codebase which has various components
       modified, extended, or replaced with modern code.  It has evolved, and
       FreeBSD is at the bleeding edge no matter how old some of the code might
       be.  This is an important distinction to make and one that is
       unfortunately lost to many people.  The biggest error a programmer can
       make is to not learn from history, and this is precisely the error that
       many other modern operating systems have made.  NT is the best example
       of this, and the consequences have been dire.  Linux also makes this
       mistake to some degree&mdash;enough that we BSD folk can make small
       jokes about it every once in a while, anyway.  Linux's problem is simply
       one of a lack of experience and history to compare ideas against, a
       problem that is easily and rapidly being addressed by the Linux
       community in the same way it has been addressed in the BSD
       community&mdash;by continuous code development.  The NT folk, on the
-      other hand, repeatedly make the same mistakes solved by Unix decades ago
+      other hand, repeatedly make the same mistakes solved by &unix; decades ago
       and then spend years fixing them. Over and over again.  They have a
       severe case of <quote>not designed here</quote> and <quote>we are always
       right because our marketing department says so</quote>.  I have little
       tolerance for anyone who cannot learn from history.</para>
 
     <para>Much of the apparent complexity of the FreeBSD design, especially in
       the VM/Swap subsystem, is a direct result of having to solve serious
       performance issues that occur under various conditions.  These issues
       are not due to bad algorithmic design but instead rise from
       environmental factors.  In any direct comparison between platforms,
       these issues become most apparent when system resources begin to get
       stressed.  As I describe FreeBSD's VM/Swap subsystem the reader should
       always keep two points in mind.  First, the most important aspect of
       performance design is what is known as <quote>Optimizing the Critical
       Path</quote>.  It is often the case that performance optimizations add a
       little bloat to the code in order to make the critical path perform
       better.  Second, a solid, generalized design outperforms a
       heavily-optimized design over the long run.  While a generalized design
       may end up being slower than an heavily-optimized design when they are
       first implemented, the generalized design tends to be easier to adapt to
       changing conditions and the heavily-optimized design winds up having to
       be thrown away.  Any codebase that will survive and be maintainable for
       years must therefore be designed properly from the beginning even if it
       costs some performance.  Twenty years ago people were still arguing that
       programming in assembly was better than programming in a high-level
       language because it produced code that was ten times as fast.  Today,
       the fallibility of that argument is obvious&mdash;as are the parallels
       to algorithmic design and code generalization.</para>
   </sect1>
 
   <sect1>
     <title>VM Objects</title>
 
     <para>The best way to begin describing the FreeBSD VM system is to look at
       it from the perspective of a user-level process.  Each user process sees
       a single, private, contiguous VM address space containing several types
       of memory objects.  These objects have various characteristics.  Program
       code and program data are effectively a single memory-mapped file (the
       binary file being run), but program code is read-only while program data
       is copy-on-write.  Program BSS is just memory allocated and filled with
       zeros on demand, called demand zero page fill.  Arbitrary files can be
       memory-mapped into the address space as well, which is how the shared
       library mechanism works.  Such mappings can require modifications to
       remain private to the process making them.  The fork system call adds an
       entirely new dimension to the VM management problem on top of the
       complexity already given.</para>
 
     <para>A program binary data page (which is a basic copy-on-write page)
       illustrates the complexity.  A program binary contains a preinitialized
       data section which is initially mapped directly from the program file.
       When a program is loaded into a process's VM space, this area is
       initially memory-mapped and backed by the program binary itself,
       allowing the VM system to free/reuse the page and later load it back in
       from the binary.  The moment a process modifies this data, however, the
       VM system must make a private copy of the page for that process.  Since
       the private copy has been modified, the VM system may no longer free it,
       because there is no longer any way to restore it later on.</para>
 
     <para>You will notice immediately that what was originally a simple file
       mapping has become much more complex.  Data may be modified on a
       page-by-page basis whereas the file mapping encompasses many pages at
       once.  The complexity further increases when a process forks.  When a
       process forks, the result is two processes&mdash;each with their own
       private address spaces, including any modifications made by the original
       process prior to the call to <function>fork()</function>.  It would be
       silly for the VM system to make a complete copy of the data at the time
       of the <function>fork()</function> because it is quite possible that at
       least one of the two processes will only need to read from that page
       from then on, allowing the original page to continue to be used.  What
       was a private page is made copy-on-write again, since each process
       (parent and child) expects their own personal post-fork modifications to
       remain private to themselves and not effect the other.</para>
 
     <para>FreeBSD manages all of this with a layered VM Object model.  The
       original binary program file winds up being the lowest VM Object layer.
       A copy-on-write layer is pushed on top of that to hold those pages which
       had to be copied from the original file.  If the program modifies a data
       page belonging to the original file the VM system takes a fault and
       makes a copy of the page in the higher layer.  When a process forks,
       additional VM Object layers are pushed on.  This might make a little
       more sense with a fairly basic example.  A <function>fork()</function>
       is a common operation for any *BSD system, so this example will consider
       a program that starts up, and forks.  When the process starts, the VM
       system creates an object layer, let's call this A:</para>
 
     <mediaobject>
       <imageobject>
         <imagedata fileref="fig1" format="EPS">
       </imageobject>
 	
       <textobject>
 	<literallayout class="monospaced">+---------------+
 |       A       |
 +---------------+</literallayout>
       </textobject>
 
       <textobject>
 	<phrase>A picture</phrase>
       </textobject>
     </mediaobject>
 
     <para>A represents the file&mdash;pages may be paged in and out of the
       file's physical media as necessary.  Paging in from the disk is
       reasonable for a program, but we really do not want to page back out and
       overwrite the executable.  The VM system therefore creates a second
       layer, B, that will be physically backed by swap space:</para>
 
     <mediaobject>
       <imageobject>
         <imagedata fileref="fig2" format="EPS">
       </imageobject>
 
       <textobject>
 	<literallayout class="monospaced">+---------------+
 |       B       |	  
 +---------------+
 |       A       |
 +---------------+</literallayout>
       </textobject>
     </mediaobject>
 
     <para>On the first write to a page after this, a new page is created in B,
       and its contents are initialized from A.  All pages in B can be paged in
       or out to a swap device.  When the program forks, the VM system creates
       two new object layers&mdash;C1 for the parent, and C2 for the
       child&mdash;that rest on top of B:</para>
 
     <mediaobject>
       <imageobject>
         <imagedata fileref="fig3" format="EPS">
       </imageobject>
       
       <textobject>
 	<literallayout class="monospaced">+-------+-------+
 |   C1  |   C2  |
 +-------+-------+
 |       B       |
 +---------------+
 |       A       |
 +---------------+</literallayout>
       </textobject>
     </mediaobject>
 
     <para>In this case, let's say a page in B is modified by the original
       parent process.  The process will take a copy-on-write fault and
       duplicate the page in C1, leaving the original page in B untouched.
       Now, let's say the same page in B is modified by the child process.  The
       process will take a copy-on-write fault and duplicate the page in C2.
       The original page in B is now completely hidden since both C1 and C2
       have a copy and B could theoretically be destroyed if it does not
       represent a <quote>real</quote> file).  However, this sort of optimization is not
       trivial to make because it is so fine-grained.  FreeBSD does not make
       this optimization.  Now, suppose (as is often the case) that the child
       process does an <function>exec()</function>.  Its current address space
       is usually replaced by a new address space representing a new file.  In
       this case, the C2 layer is destroyed:</para>
 
     <mediaobject>
       <imageobject>
         <imagedata fileref="fig4" format="EPS">
       </imageobject>
 
       <textobject>
 	<literallayout class="monospaced">+-------+
 |   C1  |
 +-------+-------+
 |       B       |
 +---------------+
 |       A       |
 +---------------+</literallayout>
       </textobject>
     </mediaobject>
 
     <para>In this case, the number of children of B drops to one, and all
       accesses to B now go through C1.  This means that B and C1 can be
       collapsed together.  Any pages in B that also exist in C1 are deleted
       from B during the collapse.  Thus, even though the optimization in the
       previous step could not be made, we can recover the dead pages when
       either of the processes exit or <function>exec()</function>.</para>
 
     <para>This model creates a number of potential problems.  The first is that
       you can wind up with a relatively deep stack of layered VM Objects which
       can cost scanning time and memory when you take a fault.  Deep
       layering can occur when processes fork and then fork again (either
       parent or child).  The second problem is that you can wind up with dead,
       inaccessible pages deep in the stack of VM Objects.  In our last example
       if both the parent and child processes modify the same page, they both
       get their own private copies of the page and the original page in B is
       no longer accessible by anyone.  That page in B can be freed.</para>
 
     <para>FreeBSD solves the deep layering problem with a special optimization
       called the <quote>All Shadowed Case</quote>.  This case occurs if either
       C1 or C2 take sufficient COW faults to completely shadow all pages in B.
       Lets say that C1 achieves this.  C1 can now bypass B entirely, so rather
       then have C1->B->A and C2->B->A we now have C1->A and C2->B->A.  But
       look what also happened&mdash;now B has only one reference (C2), so we
       can collapse B and C2 together.  The end result is that B is deleted
       entirely and we have C1->A and C2->A.  It is often the case that B will
       contain a large number of pages and neither C1 nor C2 will be able to
       completely overshadow it.  If we fork again and create a set of D
       layers, however, it is much more likely that one of the D layers will
       eventually be able to completely overshadow the much smaller dataset
       represented by C1 or C2.  The same optimization will work at any point in
       the graph and the grand result of this is that even on a heavily forked
       machine VM Object stacks tend to not get much deeper then 4.  This is
       true of both the parent and the children and true whether the parent is
       doing the forking or whether the children cascade forks.</para>
 
     <para>The dead page problem still exists in the case where C1 or C2 do not
       completely overshadow B.  Due to our other optimizations this case does
       not represent much of a problem and we simply allow the pages to be
       dead.  If the system runs low on memory it will swap them out, eating a
       little swap, but that is it.</para>
 
     <para>The advantage to the VM Object model is that
       <function>fork()</function> is extremely fast, since no real data
       copying need take place.  The disadvantage is that you can build a
       relatively complex VM Object layering that slows page fault handling
       down a little, and you spend memory managing the VM Object structures.
       The optimizations FreeBSD makes proves to reduce the problems enough
       that they can be ignored, leaving no real disadvantage.</para>
   </sect1>
 
   <sect1>
     <title>SWAP Layers</title>
 
     <para>Private data pages are initially either copy-on-write or zero-fill
       pages.  When a change, and therefore a copy, is made, the original
       backing object (usually a file) can no longer be used to save a copy of
       the page when the VM system needs to reuse it for other purposes.  This
       is where SWAP comes in.  SWAP is allocated to create backing store for
       memory that does not otherwise have it.  FreeBSD allocates the swap
       management structure for a VM Object only when it is actually needed.
       However, the swap management structure has had problems
       historically.</para>
 
     <para>Under FreeBSD 3.X the swap management structure preallocates an
       array that encompasses the entire object requiring swap backing
       store&mdash;even if only a few pages of that object are swap-backed.
       This creates a kernel memory fragmentation problem when large objects
       are mapped, or processes with large runsizes (RSS) fork.  Also, in order
       to keep track of swap space, a <quote>list of holes</quote> is kept in
       kernel memory, and this tends to get severely fragmented as well.  Since
       the <quote>list of holes</quote> is a linear list, the swap allocation and freeing
       performance is a non-optimal O(n)-per-page.  It also requires kernel
       memory allocations to take place during the swap freeing process, and
       that creates low memory deadlock problems.  The problem is further
       exacerbated by holes created due to the interleaving algorithm.  Also,
       the swap block map can become fragmented fairly easily resulting in
       non-contiguous allocations. Kernel memory must also be allocated on the
       fly for additional swap management structures when a swapout occurs.  It
       is evident that there was plenty of room for improvement.</para>
 
     <para>For FreeBSD 4.X, I completely rewrote the swap subsystem.  With this
       rewrite, swap management structures are allocated through a hash table
       rather than a linear array giving them a fixed allocation size and much
       finer granularity.  Rather then using a linearly linked list to keep
       track of swap space reservations, it now uses a bitmap of swap blocks
       arranged in a radix tree structure with free-space hinting in the radix
       node structures.  This effectively makes swap allocation and freeing an
       O(1) operation.  The entire radix tree bitmap is also preallocated in
       order to avoid having to allocate kernel memory during critical low
       memory swapping operations.  After all, the system tends to swap when it
       is low on memory so we should avoid allocating kernel memory at such
       times in order to avoid potential deadlocks.  Finally, to reduce
       fragmentation the radix tree is capable of allocating large contiguous
       chunks at once, skipping over smaller fragmented chunks.  I did not take
       the final step of having an <quote>allocating hint pointer</quote> that would trundle
       through a portion of swap as allocations were made in order to further
       guarantee contiguous allocations or at least locality of reference, but
       I ensured that such an addition could be made.</para>
   </sect1>
 
   <sect1>
     <title>When to free a page</title>
 
     <para>Since the VM system uses all available memory for disk caching,
       there are usually very few truly-free pages.  The VM system depends on
       being able to properly choose pages which are not in use to reuse for
       new allocations.  Selecting the optimal pages to free is possibly the
       single-most important function any VM system can perform because if it
       makes a poor selection, the VM system may be forced to unnecessarily
       retrieve pages from disk, seriously degrading system performance.</para>
 
     <para>How much overhead are we willing to suffer in the critical path to
       avoid freeing the wrong page?  Each wrong choice we make will cost us
       hundreds of thousands of CPU cycles and a noticeable stall of the
       affected processes, so we are willing to endure a significant amount of
       overhead in order to be sure that the right page is chosen.  This is why
       FreeBSD tends to outperform other systems when memory resources become
       stressed.</para>
 
     <para>The free page determination algorithm is built upon a history of the
       use of memory pages.  To acquire this history, the system takes advantage
       of a page-used bit feature that most hardware page tables have.</para>
 
     <para>In any case, the page-used bit is cleared and at some later point
       the VM system comes across the page again and sees that the page-used
       bit has been set.  This indicates that the page is still being actively
       used.  If the bit is still clear it is an indication that the page is not
       being actively used.  By testing this bit periodically, a use history (in
       the form of a counter) for the physical page is developed.  When the VM
       system later needs to free up some pages, checking this history becomes
       the cornerstone of determining the best candidate page to reuse.</para>
 
     <sidebar>
       <title>What if the hardware has no page-used bit?</title>
 
       <para>For those platforms that do not have this feature, the system
 	actually emulates a page-used bit.  It unmaps or protects a page,
 	forcing a page fault if the page is accessed again.  When the page
 	fault is taken, the system simply marks the page as having been used
 	and unprotects the page so that it may be used.  While taking such page
 	faults just to determine if a page is being used appears to be an
 	expensive proposition, it is much less expensive than reusing the page
 	for some other purpose only to find that a process needs it back and
 	then have to go to disk.</para>
     </sidebar>
 
     <para>FreeBSD makes use of several page queues to further refine the
       selection of pages to reuse as well as to determine when dirty pages
       must be flushed to their backing store.  Since page tables are dynamic
       entities under FreeBSD, it costs virtually nothing to unmap a page from
       the address space of any processes using it.  When a page candidate has
       been chosen based on the page-use counter, this is precisely what is
       done.  The system must make a distinction between clean pages which can
       theoretically be freed up at any time, and dirty pages which must first
       be written to their backing store before being reusable.  When a page
       candidate has been found it is moved to the inactive queue if it is
       dirty, or the cache queue if it is clean.  A separate algorithm based on
       the dirty-to-clean page ratio determines when dirty pages in the
       inactive queue must be flushed to disk.  Once this is accomplished, the
       flushed pages are moved from the inactive queue to the cache queue.  At
       this point, pages in the cache queue can still be reactivated by a VM
       fault at relatively low cost.  However, pages in the cache queue are
       considered to be <quote>immediately freeable</quote> and will be reused
       in an LRU (least-recently used) fashion when the system needs to
       allocate new memory.</para>
 
     <para>It is important to note that the FreeBSD VM system attempts to
       separate clean and dirty pages for the express reason of avoiding
       unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
       it move pages between the various page queues gratuitously when the
       memory subsystem is not being stressed.  This is why you will see some
       systems with very low cache queue counts and high active queue counts
       when doing a <command>systat -vm</command> command.  As the VM system
       becomes more stressed, it makes a greater effort to maintain the various
       page queues at the levels determined to be the most effective.  An urban
       myth has circulated for years that Linux did a better job avoiding
       swapouts than FreeBSD, but this in fact is not true.  What was actually
       occurring was that FreeBSD was proactively paging out unused pages in
       order to make room for more disk cache while Linux was keeping unused
       pages in core and leaving less memory available for cache and process
       pages.  I do not know whether this is still true today.</para>
   </sect1>
 
   <sect1>
     <title>Pre-Faulting and Zeroing Optimizations</title>
 
     <para>Taking a VM fault is not expensive if the underlying page is already
       in core and can simply be mapped into the process, but it can become
       expensive if you take a whole lot of them on a regular basis.  A good
       example of this is running a program such as &man.ls.1; or &man.ps.1;
       over and over again.  If the program binary is mapped into memory but
       not mapped into the page table, then all the pages that will be accessed
       by the program will have to be faulted in every time the program is run.
       This is unnecessary when the pages in question are already in the VM
       Cache, so FreeBSD will attempt to pre-populate a process's page tables
       with those pages that are already in the VM Cache.  One thing that
       FreeBSD does not yet do is pre-copy-on-write certain pages on exec.  For
       example, if you run the &man.ls.1; program while running <command>vmstat
 	1</command> you will notice that it always takes a certain number of
       page faults, even when you run it over and over again.  These are
       zero-fill faults, not program code faults (which were pre-faulted in
       already).  Pre-copying pages on exec or fork is an area that could use
       more study.</para>
 
     <para>A large percentage of page faults that occur are zero-fill faults.
       You can usually see this by observing the <command>vmstat -s</command>
       output.  These occur when a process accesses pages in its BSS area.  The
       BSS area is expected to be initially zero but the VM system does not
       bother to allocate any memory at all until the process actually accesses
       it.  When a fault occurs the VM system must not only allocate a new page,
       it must zero it as well.  To optimize the zeroing operation the VM system
       has the ability to pre-zero pages and mark them as such, and to request
       pre-zeroed pages when zero-fill faults occur.  The pre-zeroing occurs
       whenever the CPU is idle but the number of pages the system pre-zeros is
       limited in order to avoid blowing away the memory caches.  This is an
       excellent example of adding complexity to the VM system in order to
       optimize the critical path.</para>
   </sect1>
 
   <sect1>
     <title>Page Table Optimizations</title>
 
     <para>The page table optimizations make up the most contentious part of
       the FreeBSD VM design and they have shown some strain with the advent of
       serious use of <function>mmap()</function>.  I think this is actually a
       feature of most BSDs though I am not sure when it was first introduced.
       There are two major optimizations.  The first is that hardware page
       tables do not contain persistent state but instead can be thrown away at
       any time with only a minor amount of management overhead.  The second is
       that every active page table entry in the system has a governing
       <literal>pv_entry</literal> structure which is tied into the
       <literal>vm_page</literal> structure.  FreeBSD can simply iterate
       through those mappings that are known to exist while Linux must check
       all page tables that <emphasis>might</emphasis> contain a specific
       mapping to see if it does, which can achieve O(n^2) overhead in certain
       situations.  It is because of this that FreeBSD tends to make better
       choices on which pages to reuse or swap when memory is stressed, giving
       it better performance under load. However, FreeBSD requires kernel
       tuning to accommodate large-shared-address-space situations such as
       those that can occur in a news system because it may run out of
       <literal>pv_entry</literal> structures.</para>
 
     <para>Both Linux and FreeBSD need work in this area.  FreeBSD is trying to
       maximize the advantage of a potentially sparse active-mapping model (not
       all processes need to map all pages of a shared library, for example),
       whereas Linux is trying to simplify its algorithms.  FreeBSD generally
       has the performance advantage here at the cost of wasting a little extra
       memory, but FreeBSD breaks down in the case where a large file is
       massively shared across hundreds of processes.  Linux, on the other hand,
       breaks down in the case where many processes are sparsely-mapping the
       same shared library and also runs non-optimally when trying to determine
       whether a page can be reused or not.</para>
   </sect1>
 
   <sect1>
     <title>Page Coloring</title>
 
     <para>We will end with the page coloring optimizations.  Page coloring is a
       performance optimization designed to ensure that accesses to contiguous
       pages in virtual memory make the best use of the processor cache.  In
       ancient times (i.e. 10+ years ago) processor caches tended to map
       virtual memory rather than physical memory.  This led to a huge number of
       problems including having to clear the cache on every context switch in
       some cases, and problems with data aliasing in the cache.  Modern
       processor caches map physical memory precisely to solve those problems.
       This means that two side-by-side pages in a processes address space may
       not correspond to two side-by-side pages in the cache.  In fact, if you
       are not careful side-by-side pages in virtual memory could wind up using
       the same page in the processor cache&mdash;leading to cacheable data
       being thrown away prematurely and reducing CPU performance.  This is true
       even with multi-way set-associative caches (though the effect is
       mitigated somewhat).</para>
 
     <para>FreeBSD's memory allocation code implements page coloring
       optimizations, which means that the memory allocation code will attempt
       to locate free pages that are contiguous from the point of view of the
       cache.  For example, if page 16 of physical memory is assigned to page 0
       of a process's virtual memory and the cache can hold 4 pages, the page
       coloring code will not assign page 20 of physical memory to page 1 of a
       process's virtual memory.  It would, instead, assign page 21 of physical
       memory.  The page coloring code attempts to avoid assigning page 20
       because this maps over the same cache memory as page 16 and would result
       in non-optimal caching.  This code adds a significant amount of
       complexity to the VM memory allocation subsystem as you can well
       imagine, but the result is well worth the effort.  Page Coloring makes VM
       memory as deterministic as physical memory in regards to cache
       performance.</para>
   </sect1>
 
   <sect1>
     <title>Conclusion</title>
 
     <para>Virtual memory in modern operating systems must address a number of
       different issues efficiently and for many different usage patterns.  The
       modular and algorithmic approach that BSD has historically taken allows
       us to study and understand the current implementation as well as
       relatively cleanly replace large sections of the code.  There have been a
       number of improvements to the FreeBSD VM system in the last several
       years, and work is ongoing.</para>
   </sect1>
 
   <sect1>
     <title>Bonus QA session by Allen Briggs
       <email>briggs@ninthwonder.com</email></title>
 
     <qandaset>
       <qandaentry>
 	<question>
 	  <para>What is <quote>the interleaving algorithm</quote> that you
 	    refer to in your listing of the ills of the FreeBSD 3.X swap
 	    arrangements?</para>
 	</question>
 
 	<answer>
 	  <para>FreeBSD uses a fixed swap interleave which defaults to 4.  This
 	    means that FreeBSD reserves space for four swap areas even if you
 	    only have one, two, or three.  Since swap is interleaved the linear
 	    address space representing the <quote>four swap areas</quote> will be
 	    fragmented if you do not actually have four swap areas.  For
 	    example, if you have two swap areas A and B FreeBSD's address
 	    space representation for that swap area will be interleaved in
 	    blocks of 16 pages:</para>
 
 	  <literallayout>A B C D A B C D A B C D A B C D</literallayout>
 
 	  <para>FreeBSD 3.X uses a <quote>sequential list of free
 	    regions</quote> approach to accounting for the free swap areas.
 	    The idea is that large blocks of free linear space can be
 	    represented with a single list node
 	    (<filename>kern/subr_rlist.c</filename>).  But due to the
 	    fragmentation the sequential list winds up being insanely
 	    fragmented.  In the above example, completely unused swap will
 	    have A and B shown as <quote>free</quote> and C and D shown as
 	    <quote>all allocated</quote>.  Each A-B sequence requires a list
 	    node to account for because C and D are holes, so the list node
 	    cannot be combined with the next A-B sequence.</para>
 
 	  <para>Why do we interleave our swap space instead of just tack swap
 	    areas onto the end and do something fancier?  Because it is a whole
 	    lot easier to allocate linear swaths of an address space and have
 	    the result automatically be interleaved across multiple disks than
 	    it is to try to put that sophistication elsewhere.</para>
 
 	  <para>The fragmentation causes other problems.  Being a linear list
 	    under 3.X, and having such a huge amount of inherent
 	    fragmentation, allocating and freeing swap winds up being an O(N)
 	    algorithm instead of an O(1) algorithm.  Combined with other
 	    factors (heavy swapping) and you start getting into O(N^2) and
 	    O(N^3) levels of overhead, which is bad.  The 3.X system may also
 	    need to allocate KVM during a swap operation to create a new list
 	    node which can lead to a deadlock if the system is trying to
 	    pageout pages in a low-memory situation.</para>
 
 	  <para>Under 4.X we do not use a sequential list.  Instead we use a
 	    radix tree and bitmaps of swap blocks rather than ranged list
 	    nodes.  We take the hit of preallocating all the bitmaps required
 	    for the entire swap area up front but it winds up wasting less
 	    memory due to the use of a bitmap (one bit per block) instead of a
 	    linked list of nodes.  The use of a radix tree instead of a
 	    sequential list gives us nearly O(1) performance no matter how
 	    fragmented the tree becomes.</para>
 	</answer>
       </qandaentry>
 
       <qandaentry>
 	<question>
 	  <para>I do not get the following:</para>
 
 	  <blockquote>
 	    <para>It is important to note that the FreeBSD VM system attempts
 	      to separate clean and dirty pages for the express reason of
 	      avoiding unnecessary flushes of dirty pages (which eats I/O
 	      bandwidth), nor does it move pages between the various page
 	      queues gratuitously when the memory subsystem is not being
 	      stressed.  This is why you will see some systems with very low
 	      cache queue counts and high active queue counts when doing a
 	      <command>systat -vm</command> command.</para>
 	  </blockquote>
 	  
 	  <para>How is the separation of clean and dirty (inactive) pages
 	    related to the situation where you see low cache queue counts and
 	    high active queue counts in <command>systat -vm</command>?  Do the
 	    systat stats roll the active and dirty pages together for the
 	    active queue count?</para>
 	</question>
 
 	<answer>
 	  <para>Yes, that is confusing.  The relationship is
 	    <quote>goal</quote> verses <quote>reality</quote>.  Our goal is to
 	    separate the pages but the reality is that if we are not in a
 	    memory crunch, we do not really have to.</para>
 
 	  <para>What this means is that FreeBSD will not try very hard to
 	    separate out dirty pages (inactive queue) from clean pages (cache
 	    queue) when the system is not being stressed, nor will it try to
 	    deactivate pages (active queue -> inactive queue) when the system
 	    is not being stressed, even if they are not being used.</para>
 	</answer>
       </qandaentry>
 
       <qandaentry>
 	<question>
 	  <para> In the &man.ls.1; / <command>vmstat 1</command> example,
 	    would not some of the page faults be data page faults (COW from
 	    executable file to private page)?  I.e., I would expect the page
 	    faults to be some zero-fill and some program data.  Or are you
 	    implying that FreeBSD does do pre-COW for the program data?</para>
 	</question>
 
 	<answer>
 	  <para>A COW fault can be either zero-fill or program-data.  The
 	    mechanism is the same either way because the backing program-data
 	    is almost certainly already in the cache.  I am indeed lumping the
 	    two together.  FreeBSD does not pre-COW program data or zero-fill,
 	    but it <emphasis>does</emphasis> pre-map pages that exist in its
 	    cache.</para>
 	</answer>
       </qandaentry>
 
       <qandaentry>
 	<question>
 	  <para>In your section on page table optimizations, can you give a
 	    little more detail about <literal>pv_entry</literal> and
 	    <literal>vm_page</literal> (or should vm_page be
 	    <literal>vm_pmap</literal>&mdash;as in 4.4, cf. pp. 180-181 of
 	    McKusick, Bostic, Karel, Quarterman)?  Specifically, what kind of
 	    operation/reaction would require scanning the mappings?</para>
 
 	  <para>How does Linux do in the case where FreeBSD breaks down
 	    (sharing a large file mapping over many processes)?</para>
 	</question>
 
 	<answer>
 	  <para>A <literal>vm_page</literal> represents an (object,index#)
 	    tuple.  A <literal>pv_entry</literal> represents a hardware page
 	    table entry (pte).  If you have five processes sharing the same
 	    physical page, and three of those processes's page tables actually
 	    map the page, that page will be represented by a single
 	    <literal>vm_page</literal> structure and three
 	    <literal>pv_entry</literal> structures.</para>
 
 	  <para><literal>pv_entry</literal> structures only represent pages
 	    mapped by the MMU (one <literal>pv_entry</literal> represents one
 	    pte).  This means that when we need to remove all hardware
 	    references to a <literal>vm_page</literal> (in order to reuse the
 	    page for something else, page it out, clear it, dirty it, and so
 	    forth) we can simply scan the linked list of
 	    <literal>pv_entry</literal>'s associated with that
 	    <literal>vm_page</literal> to remove or modify the pte's from
 	    their page tables.</para>
 
 	  <para>Under Linux there is no such linked list.  In order to remove
 	    all the hardware page table mappings for a
 	    <literal>vm_page</literal> linux must index into every VM object
 	    that <emphasis>might</emphasis> have mapped the page.  For
 	    example, if you have 50 processes all mapping the same shared
 	    library and want to get rid of page X in that library, you need to
 	    index into the page table for each of those 50 processes even if
 	    only 10 of them have actually mapped the page.  So Linux is
 	    trading off the simplicity of its design against performance.
 	    Many VM algorithms which are O(1) or (small N) under FreeBSD wind
 	    up being O(N), O(N^2), or worse under Linux.  Since the pte's
 	    representing a particular page in an object tend to be at the same
 	    offset in all the page tables they are mapped in, reducing the
 	    number of accesses into the page tables at the same pte offset
 	    will often avoid blowing away the L1 cache line for that offset,
 	    which can lead to better performance.</para>
 
 	  <para>FreeBSD has added complexity (the <literal>pv_entry</literal>
 	    scheme) in order to increase performance (to limit page table
 	    accesses to <emphasis>only</emphasis> those pte's that need to be
 	    modified).</para>
 
 	  <para>But FreeBSD has a scaling problem that Linux does not in that
 	    there are a limited number of <literal>pv_entry</literal>
 	    structures and this causes problems when you have massive sharing
 	    of data.  In this case you may run out of
 	    <literal>pv_entry</literal> structures even though there is plenty
 	    of free memory available.  This can be fixed easily enough by
 	    bumping up the number of <literal>pv_entry</literal> structures in
 	    the kernel config, but we really need to find a better way to do
 	    it.</para>
 
 	  <para>In regards to the memory overhead of a page table verses the
 	    <literal>pv_entry</literal> scheme: Linux uses
 	    <quote>permanent</quote> page tables that are not throw away, but
 	    does not need a <literal>pv_entry</literal> for each potentially
 	    mapped pte.  FreeBSD uses <quote>throw away</quote> page tables but
 	    adds in a <literal>pv_entry</literal> structure for each
 	    actually-mapped pte.  I think memory utilization winds up being
 	    about the same, giving FreeBSD an algorithmic advantage with its
 	    ability to throw away page tables at will with very low
 	    overhead.</para>
 	</answer>
       </qandaentry>
 
       <qandaentry>
 	<question>
 	  <para>Finally, in the page coloring section, it might help to have a
 	    little more description of what you mean here.  I did not quite
 	    follow it.</para>
 	</question>
 
 	<answer>
 	  <para>Do you know how an L1 hardware memory cache works?  I will
 	    explain: Consider a machine with 16MB of main memory but only 128K
 	    of L1 cache.  Generally the way this cache works is that each 128K
 	    block of main memory uses the <emphasis>same</emphasis> 128K of
 	    cache.  If you access offset 0 in main memory and then offset
 	    offset 128K in main memory you can wind up throwing away the
 	    cached data you read from offset 0!</para>
 
 	  <para>Now, I am simplifying things greatly.  What I just described
 	    is what is called a <quote>direct mapped</quote> hardware memory
 	    cache.  Most modern caches are what are called
 	    2-way-set-associative or 4-way-set-associative caches.  The
 	    set-associatively allows you to access up to N different memory
 	    regions that overlap the same cache memory without destroying the
 	    previously cached data.  But only N.</para>
 
 	  <para>So if I have a 4-way set associative cache I can access offset
 	    0, offset 128K, 256K and offset 384K and still be able to access
 	    offset 0 again and have it come from the L1 cache.  If I then
 	    access offset 512K, however, one of the four previously cached
 	    data objects will be thrown away by the cache.</para>
 
 	  <para>It is extremely important&hellip;
 	    <emphasis>extremely</emphasis> important for most of a processor's
 	    memory accesses to be able to come from the L1 cache, because the
 	    L1 cache operates at the processor frequency.  The moment you have
 	    an L1 cache miss and have to go to the L2 cache or to main memory,
 	    the processor will stall and potentially sit twiddling its fingers
 	    for <emphasis>hundreds</emphasis> of instructions worth of time
 	    waiting for a read from main memory to complete.  Main memory (the
 	    dynamic ram you stuff into a computer) is
 	    <emphasis>slow</emphasis>, when compared to the speed of a modern
 	    processor core.</para>
 
 	  <para>Ok, so now onto page coloring: All modern memory caches are
 	    what are known as <emphasis>physical</emphasis> caches.  They
 	    cache physical memory addresses, not virtual memory addresses.
 	    This allows the cache to be left alone across a process context
 	    switch, which is very important.</para>
 
 	  <para>But in the Unix world you are dealing with virtual address
 	    spaces, not physical address spaces.  Any program you write will
 	    see the virtual address space given to it.  The actual
 	    <emphasis>physical</emphasis> pages underlying that virtual
 	    address space are not necessarily physically contiguous! In fact,
 	    you might have two pages that are side by side in a processes
 	    address space which wind up being at offset 0 and offset 128K in
 	    <emphasis>physical</emphasis> memory.</para>
 
 	  <para>A program normally assumes that two side-by-side pages will be
 	    optimally cached.  That is, that you can access data objects in
 	    both pages without having them blow away each other's cache entry.
 	    But this is only true if the physical pages underlying the virtual
 	    address space are contiguous (insofar as the cache is
 	    concerned).</para>
 
 	  <para>This is what Page coloring does.  Instead of assigning
 	    <emphasis>random</emphasis> physical pages to virtual addresses,
 	    which may result in non-optimal cache performance, Page coloring
 	    assigns <emphasis>reasonably-contiguous</emphasis> physical pages
 	    to virtual addresses.  Thus programs can be written under the
 	    assumption that the characteristics of the underlying hardware
 	    cache are the same for their virtual address space as they would
 	    be if the program had been run directly in a physical address
 	    space.</para>
 
 	  <para>Note that I say <quote>reasonably</quote> contiguous rather
 	    than simply <quote>contiguous</quote>.  From the point of view of a
 	    128K direct mapped cache, the physical address 0 is the same as
 	    the physical address 128K.  So two side-by-side pages in your
 	    virtual address space may wind up being offset 128K and offset
 	    132K in physical memory, but could also easily be offset 128K and
 	    offset 4K in physical memory and still retain the same cache
 	    performance characteristics.  So page-coloring does
 	    <emphasis>not</emphasis> have to assign truly contiguous pages of
 	    physical memory to contiguous pages of virtual memory, it just
 	    needs to make sure it assigns contiguous pages from the point of
 	    view of cache performance and operation.</para>
 	</answer>
       </qandaentry>
     </qandaset>
   </sect1>
 </article>
diff --git a/en_US.ISO8859-1/articles/zip-drive/article.sgml b/en_US.ISO8859-1/articles/zip-drive/article.sgml
index 9a9871b060..5330a94857 100644
--- a/en_US.ISO8859-1/articles/zip-drive/article.sgml
+++ b/en_US.ISO8859-1/articles/zip-drive/article.sgml
@@ -1,273 +1,276 @@
 <!-- $FreeBSD$ -->
 
 <!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
 <!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
 %man;
+<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
+%freebsd;
+
 ]>
 
 <article>
   <articleinfo>
     <title>ZIP Drives</title>
 
     <authorgroup>
       <author>
 	<firstname>Jason</firstname>
 	<surname>Bacon</surname>
 
 	<affiliation>
 	  <address><email>acadix@execpc.com</email></address>
 	</affiliation>
       </author>
     </authorgroup>
   </articleinfo>
 
   <sect1>
     <title>ZIP Drive Basics</title>
 
     <para>ZIP disks are high capacity, removable, magnetic disks, which can be
       read or written by ZIP drives from IOMEGA corporation.  ZIP disks are
       similar to floppy disks, except that they are much faster, and have a
       much greater capacity.  While floppy disks typically hold 1.44
       megabytes, ZIP disks are available in two sizes, namely 100 megabytes
       and 250 megabytes.  ZIP drives should not be confused with the
       super-floppy, a 120 megabyte floppy drive which also handles traditional
       1.44 megabyte floppies.</para>
 
     <para>IOMEGA also sells a higher capacity, higher performance drive called
       the JAZZ drive.  JAZZ drives come in 1 gigabyte and 2 gigabyte
       sizes.</para>
 
     <para>ZIP drives are available as internal or external units, using one of
       three interfaces:</para>
 
     <orderedlist>
       <listitem>
 	<para>The SCSI (Small Computer Standard Interface) interface is the
 	  fastest, most sophisticated, most expandable, and most expensive
 	  interface.  The SCSI interface is used by all types of computers
 	  from PC's to RISC workstations to minicomputers, to connect all
 	  types of peripherals such as disk drives, tape drives, scanners, and
 	  so on.  SCSI ZIP drives may be internal or external, assuming your
 	  host adapter has an external connector.</para>
 
 	<note>
 	  <para>If you are using an external SCSI device, it is important
 	    never to connect or disconnect it from the SCSI bus while the
 	    computer is running.  Doing so may cause file-system damage on the
 	    disks that remain connected.</para>
 	</note>
 
 	<para>If you want maximum performance and easy setup, the SCSI
 	  interface is the best choice.  This will probably require adding a
 	  SCSI host adapter, since most PC's (except for high-performance
 	  servers) do not have built-in SCSI support.  Each SCSI host adapter
 	  can support either 7 or 15 SCSI devices, depending on the
 	  model.</para>
 
 	<para>Each SCSI device has its own controller, and these
 	  controllers are fairly intelligent and well standardized, (the
 	  second `S' in SCSI is for Standard) so from the operating system's
 	  point of view, all SCSI disk drives look about the same, as do all
 	  SCSI tape drives, etc.  To support SCSI devices, the operating
 	  system need only have a driver for the particular host adapter, and
 	  a generic driver for each type of device, i.e. a SCSI disk driver,
 	  SCSI tape driver, and so on.  There are some SCSI devices that can
 	  be better utilized with specialized drivers (e.g. DAT tape drives),
 	  but they tend to work OK with the generic driver, too.  It is just
 	  that the generic drivers may not support some of the special
 	  features.</para>
 
 	<para>Using a SCSI zip drive is simply a matter of determining which
 	  device file in the <filename>/dev</filename> directory represents
 	  the ZIP drive. This can be determined by looking at the boot
 	  messages while FreeBSD is booting (or in
 	  <filename>/var/log/messages</filename> after booting), where you
 	  will see a line something like this:</para>
 
 	<programlisting>da1: &lt;IOMEGA ZIP 100 D.13&gt; Removable Direct Access SCSI-2 Device</programlisting>
 
 	<para>This means that the ZIP drive is represented by the file
 	  <filename>/dev/da1</filename>.</para>
       </listitem>
 
       <listitem>
 	<para>The IDE (Integrated Drive Electronics) interface is a low-cost
 	  disk drive interface used by many desktop PC's.  Most IDE devices
 	  are strictly internal.</para>
 
 	<para>Performance of IDE ZIP drives is comparable to SCSI ZIP drives.
 	  (The IDE interface is not as fast as SCSI, but ZIP drives
 	  performance is limited mainly by the mechanics of the drive, not by
 	  the bus interface.)</para>
 
 	<para>The drawback of the IDE interface is the limitations it imposes.
 	  Most IDE adapters can only support 2 devices, and IDE interfaces are
 	  not typically designed for the long term. For example, the original
 	  IDE interface would not support hard disks with more than 1024
 	  cylinders, which forced a lot of people to upgrade their hardware
 	  prematurely.  If you have plans to expand your PC by adding another
 	  disk, a tape drive, or scanner, you may want to invest in a SCSI
 	  host adapter and a SCSI ZIP drive to avoid problems in the
 	  future.</para>
 
 	<para>IDE devices in FreeBSD are prefixed with a <literal>a</literal>.
 	  For example, an IDE hard disk might be
 	  <filename>/dev/ad0</filename>, an IDE (ATAPI) CDROM might be
 	  <filename>/dev/acd1</filename>, and so on.</para>
       </listitem>
 
       <listitem>
 	<para>The parallel port interface is popular for portable external
 	  devices such as external ZIP drives and scanners, because virtually
 	  every computer has a standard parallel port (usually used for
 	  printers).  This makes things easy for people to transfer data
 	  between multiple computers by toting around their ZIP drive.</para>
 
 	<para>Performance will generally be slower than a SCSI or IDE ZIP
 	  drive, since it is limited by the speed of the parallel port.
 	  Parallel port speed varies considerably between various computers,
 	  and can often be configured in the system BIOS.  Some machines will
 	  also require BIOS configuration to operate the parallel port in
 	  bidirectional mode.  (Parallel ports were originally designed only
 	  for output to printers)</para>
       </listitem>
     </orderedlist>
   </sect1>
 
   <sect1>
     <title>Parallel ZIP: The <devicename>vpo</devicename> Driver</title>
 
     <para>To use a parallel-port ZIP drive under FreeBSD, the
       <devicename>vpo</devicename> driver must be configured into the kernel.
       Parallel port ZIP drives also have a built-in SCSI controller. The vpo
       driver allows the FreeBSD kernel to communicate with the ZIP drive's
       SCSI controller through the parallel port.</para>
 
     <para>Since the vpo driver is not a standard part of the kernel (as of
       FreeBSD 3.2), you will need to rebuild the kernel to enable this device.
       The process of building a kernel is outlined in detail in another
       section.  The following steps outline the process in brief for the
       purpose of enabling the vpo driver:</para>
 
     <orderedlist>
       <listitem>
 	<para>Run <command>/stand/sysinstall</command>, and install the kernel
 	  source code on your system.</para>
       </listitem>
 
       <listitem>
 	<para>Create a custom kernel configuration, that includes the
 	  driver for the vpo driver:</para>
 
 	<screen>&prompt.root; <userinput>cd /sys/i386/conf</userinput>
 &prompt.root; <userinput>cp GENERIC MYKERNEL</userinput></screen>
 
 	<para>Edit <filename>MYKERNEL</filename>, change the
 	  <literal>ident</literal> line to <literal>MYKERNEL</literal>, and
 	  uncomment the line describing the vpo driver.</para>
 
 	<para>If you have a second parallel port, you may need to copy the
 	  section for <literal>ppc0</literal> to create a
 	  <literal>ppc1</literal> device.  The second parallel port usually
 	  uses IRQ 5 and address 378.  Only the IRQ is required in the config
 	  file.</para>
 
 	<para>If your root hard disk is a SCSI disk, you might run into a
 	  problem with probing order, which will cause the system to attempt
 	  to use the ZIP drive as the root device.  This will cause a boot
 	  failure, unless you happen to have a FreeBSD root file-system on
 	  your ZIP disk!  In this case, you will need to <quote>wire
 	    down</quote> the root disk, i.e. force the kernel to bind a
 	  specific device to <filename>/dev/da0</filename>, the root SCSI
 	  disk.  It will then assign the ZIP disk to the next available SCSI
 	  disk, e.g. <literal>/dev/da1</literal>. To wire down your SCSI hard
 	  drive as <literal>da0</literal>, change the line
 
 	  <programlisting>device  da0</programlisting>
 
 	  to
 
 	  <programlisting>disk da0 at scbus0 target 0 unit 0</programlisting></para>
 
 	<para>You may need to change the target above to match the SCSI ID of
 	  your disk drive. You should also wire down the scbus0 entry to your
 	  controller.  For example, if you have an Adaptec 15xx controller,
 	  you would change
 
 	  <programlisting>controller scbus0</programlisting>
 
 	  to
 
 	  <programlisting>controller scbus0 at aha0</programlisting></para>
 
 	<para>Finally, since you are creating a custom kernel configuration,
 	  you can take the opportunity to remove all the unnecessary drivers.
 	  This should be done with a great deal of caution, and only if you
 	  feel confident about making modifications to your kernel
 	  configuration. Removing unnecessary drivers will reduce the kernel
 	  size, leaving more memory available for your applications.  To
 	  determine which drivers are not needed, go to the end of the file
 	  <filename>/var/log/messages</filename>, and look for lines reading
 	  "not found". Then, comment out these devices in your config file.
 	  You can also change other options to reduce the size and increase
 	  the speed of your kernel. Read the section on rebuilding your kernel
 	  for more complete information.</para>
       </listitem>
 
       <listitem>
 	<para>Now it is time to compile the kernel:</para>
 
 	<screen>&prompt.root; <userinput>/usr/sbin/config MYKERNEL</userinput>
 &prompt.root; <userinput>cd ../../compile/MYKERNEL</userinput>
 &prompt.root; <userinput>make clean depend && make all install</userinput></screen>
       </listitem>
     </orderedlist>
 
     <para>After the kernel is rebuilt, you will need to reboot.  Make sure the
       ZIP drive is connected to the parallel port before the boot begins. You
       should see the ZIP drive show up in the boot messages as device vpo0 or
       vpo1, depending on which parallel port the drive is attached to.  It
       should also show which device file the ZIP drive has been bound to. This
       will be <filename>/dev/da0</filename> if you have no other SCSI disks in
       the system, or <filename>/dev/da1</filename> if you have a SCSI hard
       disk wired down as the root device.</para>
   </sect1>
 
   <sect1>
     <title>Mounting ZIP disks</title>
 
     <para>To access the ZIP disk, you simply mount it like any other disk
       device. The file-system is represented as slice 4 on the device, so for
       SCSI or parallel ZIP disks, you would use:</para>
 
     <screen>&prompt.root; <userinput>mount_msdos /dev/da1s4 /mnt</userinput></screen>
 
     <para>For IDE ZIP drives, use:</para>
 
     <screen>&prompt.root; <userinput>mount_msdos /dev/ad1s4 /mnt</userinput></screen>
 
     <para>It will also be helpful to update <filename>/etc/fstab</filename> to
       make mounting easier. Add a line like the following, edited to suit your
       system:
 
       <programlisting>/dev/da1s4  /zip msdos rw,noauto  0 0</programlisting>
 
       and create the directory <filename>/zip</filename>.</para>
 
     <para>Then, you can mount simply by typing
 
       <screen>&prompt.root; <userinput>mount /zip</userinput></screen>
 
       and unmount by typing
 
       <screen>&prompt.root; <userinput>umount /zip</userinput></screen></para>
 
     <para>For more information on the format of
       <filename>/etc/fstab</filename>, see &man.fstab.5;.</para>
 
     <para>You can also create a FreeBSD file-system on the ZIP disk using
       &man.newfs.8;.  However, the disk will only be usable on a FreeBSD
-      system, or perhaps a few other Unix clones that recognize FreeBSD
+      system, or perhaps a few other &unix; clones that recognize FreeBSD
       file-systems.  (Definitely not DOS or Windows.)</para>
   </sect1>
 </article>
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
index c6c78e0feb..14c53dc77c 100644
--- a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
+++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
@@ -1,957 +1,959 @@
 <!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
 <!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
 %man;
 
 <!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
 %authors;
 <!ENTITY % misc PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
 %misc;
+<!ENTITY % freebsd PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
+%freebsd;
 
 <!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
 <!--
 %mailing-lists;
 -->
 
 ]>
 
 <article>
   <articleinfo>
     <title>SMPng Design Document</title>
 
     <authorgroup>
       <author>
 	<firstname>John</firstname>
 	<surname>Baldwin</surname>
       </author>
       <author>
 	<firstname>Robert</firstname>
 	<surname>Watson</surname>
       </author>
     </authorgroup>
 
     <pubdate>$FreeBSD$</pubdate>
 
     <copyright>
       <year>2002</year>
       <year>2003</year>
       <holder>John Baldwin</holder>
       <holder>Robert Watson</holder>
     </copyright>
 
     <abstract>
       <para>This document presents the current design and implementation of
 	the SMPng Architecture.  First, the basic primitives and tools are
 	introduced.  Next, a general architecture for the FreeBSD kernel's
 	synchronization and execution model is laid out.  Then, locking
 	strategies for specific subsystems are discussed, documenting the
 	approaches taken to introduce fine-grained synchronization and
 	parallelism for each subsystem.  Finally, detailed implementation
 	notes are provided to motivate design choices, and make the reader
 	aware of important implications involving the use of specific
 	primitives. </para>
     </abstract>
   </articleinfo>
 
   <sect1>
     <title>Introduction</title>
 
     <para>This document is a work-in-progress, and will be updated to
       reflect on-going design and implementation activities associated
       with the SMPng Project.  Many sections currently exist only in
       outline form, but will be fleshed out as work proceeds.  Updates or
       suggestions regarding the document may be directed to the document
       editors.</para>
 
     <para>The goal of SMPng is to allow concurrency in the kernel.
       The kernel is basically one rather large and complex program. To
       make the kernel multi-threaded we use some of the same tools used
       to make other programs multi-threaded.  These include mutexes,
       shared/exclusive locks, semaphores, and condition variables.  For
       the definitions of these and other SMP-related terms, please see
       the <xref linkend="glossary"> section of this article.</para>
   </sect1>
 
   <sect1>
     <title>Basic Tools and Locking Fundamentals</title>
 
     <sect2>
       <title>Atomic Instructions and Memory Barriers</title>
 
       <para>There are several existing treatments of memory barriers
 	and atomic instructions, so this section will not include a
 	lot of detail.  To put it simply, one can not go around reading
 	variables without a lock if a lock is used to protect writes
 	to that variable.  This becomes obvious when you consider that
 	memory barriers simply determine relative order of memory
 	operations; they do not make any guarantee about timing of
 	memory operations.  That is, a memory barrier does not force
 	the contents of a CPU's local cache or store buffer to flush.
 	Instead, the memory barrier at lock release simply ensures
 	that all writes to the protected data will be visible to other
 	CPU's or devices if the write to release the lock is visible.
 	The CPU is free to keep that data in its cache or store buffer
 	as long as it wants. However, if another CPU performs an
 	atomic instruction on the same datum, the first CPU must
 	guarantee that the updated value is made visible to the second
 	CPU along with any other operations that memory barriers may
 	require.</para>
 
       <para>For example, assuming a simple model where data is
 	considered visible when it is in main memory (or a global
 	cache), when an atomic instruction is triggered on one CPU,
 	other CPU's store buffers and caches must flush any writes to
 	that same cache line along with any pending operations behind
 	a memory barrier.</para>
 
       <para>This requires one to take special care when using an item
 	protected by atomic instructions.  For example, in the sleep
 	mutex implementation, we have to use an
 	<function>atomic_cmpset</function> rather than an
 	<function>atomic_set</function> to turn on the
 	<constant>MTX_CONTESTED</constant> bit.  The reason is that we
 	read the value of <structfield>mtx_lock</structfield> into a
 	variable and then make a decision based on that read.
 	However, the value we read may be stale, or it may change
 	while we are making our decision.  Thus, when the
 	<function>atomic_set</function> executed, it may end up
 	setting the bit on another value than the one we made the
 	decision on. Thus, we have to use an
 	<function>atomic_cmpset</function> to set the value only if
 	the value we made the decision on is up-to-date and
 	valid.</para>
 
       <para>Finally, atomic instructions only allow one item to be
 	updated or read.  If one needs to atomically update several
 	items, then a lock must be used instead.  For example, if two
 	counters must be read and have values that are consistent
 	relative to each other, then those counters must be protected
 	by a lock rather than by separate atomic instructions.</para>
     </sect2>
 
     <sect2>
       <title>Read Locks versus Write Locks</title>
 
       <para>Read locks do not need to be as strong as write locks.
 	Both types of locks need to ensure that the data they are
 	accessing is not stale.  However, only write access requires
 	exclusive access.  Multiple threads can safely read a value.
 	Using different types of locks for reads and writes can be
 	implemented in a number of ways.</para>
 
       <para>First, sx locks can be used in this manner by using an
 	exclusive lock when writing and a shared lock when reading.
 	This method is quite straightforward.</para>
 
       <para>A second method is a bit more obscure.  You can protect a
 	datum with multiple locks.  Then for reading that data you
 	simply need to have a read lock of one of the locks.  However,
 	to write to the data, you need to have a write lock of all of
 	the locks.  This can make writing rather expensive but can be
 	useful when data is accessed in various ways.  For example,
 	the parent process pointer is protected by both the
 	proctree_lock sx lock and the per-process mutex.  Sometimes
 	the proc lock is easier as we are just checking to see who a
 	parent of a process is that we already have locked.  However,
 	other places such as <function>inferior</function> need to
 	walk the tree of processes via parent pointers and locking
 	each process would be prohibitive as well as a pain to
 	guarantee that the condition you are checking remains valid
 	for both the check and the actions taken as a result of the
 	check.</para>
     </sect2>
 
     <sect2>
       <title>Locking Conditions and Results</title>
 
       <para>If you need a lock to check the state of a variable so
 	that you can take an action based on the state you read, you
 	can not just hold the lock while reading the variable and then
 	drop the lock before you act on the value you read.  Once you
 	drop the lock, the variable can change rendering your decision
 	invalid. Thus, you must hold the lock both while reading the
 	variable and while performing the action as a result of the
 	test.</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>General Architecture and Design</title>
 
     <sect2>
       <title>Interrupt Handling</title>
 
-      <para>Following the pattern of several other multi-threaded Unix
+      <para>Following the pattern of several other multi-threaded &unix;
 	kernels, FreeBSD deals with interrupt handlers by giving them
 	their own thread context.  Providing a context for interrupt
 	handlers allows them to block on locks.  To help avoid
 	latency, however, interrupt threads run at real-time kernel
 	priority. Thus, interrupt handlers should not execute for very
 	long to avoid starving other kernel threads.  In addition,
 	since multiple handlers may share an interrupt thread,
 	interrupt handlers should not sleep or use a sleepable lock to
 	avoid starving another interrupt handler.</para>
 
       <para>The interrupt threads currently in FreeBSD are referred to
 	as heavyweight interrupt threads.  They are called this
 	because switching to an interrupt thread involves a full
 	context switch. In the initial implementation, the kernel was
 	not preemptive and thus interrupts that interrupted a kernel
 	thread would have to wait until the kernel thread blocked or
 	returned to userland before they would have an opportunity to
 	run.</para>
 
       <para>To deal with the latency problems, the kernel in FreeBSD
 	has been made preemptive.  Currently, we only preempt a kernel
 	thread when we release a sleep mutex or when an interrupt
 	comes in.  However, the plan is to make the FreeBSD kernel
 	fully preemptive as described below.</para>
 
       <para>Not all interrupt handlers execute in a thread context.
 	Instead, some handlers execute directly in primary interrupt
 	context.  These interrupt handlers are currently misnamed
 	<quote>fast</quote> interrupt handlers since the
 	<constant>INTR_FAST</constant> flag used in earlier versions
 	of the kernel is used to mark these handlers.  The only
 	interrupts which currently use these types of interrupt
 	handlers are clock interrupts and serial I/O device
 	interrupts.  Since these handlers do not have their own
 	context, they may not acquire blocking locks and thus may only
 	use spin mutexes.</para>
 
       <para>Finally, there is one optional optimization that can be
 	added in MD code called lightweight context switches.  Since
 	an interrupt thread executes in a kernel context, it can
 	borrow the vmspace of any process.  Thus, in a lightweight
 	context switch, the switch to the interrupt thread does not
 	switch vmspaces but borrows the vmspace of the interrupted
 	thread.  In order to ensure that the vmspace of the
 	interrupted thread does not disappear out from under us, the
 	interrupted thread is not allowed to execute until the
 	interrupt thread is no longer borrowing its vmspace.  This can
 	happen when the interrupt thread either blocks or finishes.
 	If an interrupt thread blocks, then it will use its own
 	context when it is made runnable again.  Thus, it can release
 	the interrupted thread.</para>
 
       <para>The cons of this optimization are that they are very
 	machine specific and complex and thus only worth the effort if
 	their is a large performance improvement.  At this point it is
 	probably too early to tell, and in fact, will probably hurt
 	performance as almost all interrupt handlers will immediately
 	block on Giant and require a thread fix-up when they block.
 	Also, an alternative method of interrupt handling has been
 	proposed by Mike Smith that works like so:</para>
 
       <orderedlist>
 	<listitem>
 	  <para>Each interrupt handler has two parts: a predicate
 	    which runs in primary interrupt context and a handler
 	    which runs in its own thread context.</para>
 	</listitem>
 
 	<listitem>
 	  <para>If an interrupt handler has a predicate, then when an
 	    interrupt is triggered, the predicate is run.  If the
 	    predicate returns true then the interrupt is assumed to be
 	    fully handled and the kernel returns from the interrupt.
 	    If the predicate returns false or there is no predicate,
 	    then the threaded handler is scheduled to run.</para>
 	</listitem>
       </orderedlist>
 
       <para>Fitting light weight context switches into this scheme
 	might prove rather complicated.  Since we may want to change
 	to this scheme at some point in the future, it is probably
 	best to defer work on light weight context switches until we
 	have settled on the final interrupt handling architecture and
 	determined how light weight context switches might or might
 	not fit into it.</para>
     </sect2>
 
     <sect2>
       <title>Kernel Preemption and Critical Sections</title>
 
       <sect3>
 	<title>Kernel Preemption in a Nutshell</title>
 
 	<para>Kernel preemption is fairly simple.  The basic idea is
 	  that a CPU should always be doing the highest priority work
 	  available.  Well, that is the ideal at least.  There are a
 	  couple of cases where the expense of achieving the ideal is
 	  not worth being perfect.</para>
 
 	<para>Implementing full kernel preemption is very
 	  straightforward: when you schedule a thread to be executed
 	  by putting it on a runqueue, you check to see if it's
 	  priority is higher than the currently executing thread.  If
 	  so, you initiate a context switch to that thread.</para>
 
 	<para>While locks can protect most data in the case of a
 	  preemption, not all of the kernel is preemption safe.  For
 	  example, if a thread holding a spin mutex preempted and the
 	  new thread attempts to grab the same spin mutex, the new
 	  thread may spin forever as the interrupted thread may never
 	  get a chance to execute.  Also, some code such as the code
 	  to assign an address space number for a process during
 	  exec() on the Alpha needs to not be preempted as it supports
 	  the actual context switch code.  Preemption is disabled for
 	  these code sections by using a critical section.</para>
       </sect3>
 
       <sect3>
 	<title>Critical Sections</title>
 
 	<para>The responsibility of the critical section API is to
 	  prevent context switches inside of a critical section.  With
 	  a fully preemptive kernel, every
 	  <function>setrunqueue</function> of a thread other than the
 	  current thread is a preemption point.  One implementation is
 	  for <function>critical_enter</function> to set a per-thread
 	  flag that is cleared by its counterpart.  If
 	  <function>setrunqueue</function> is called with this flag
 	  set, it does not preempt regardless of the priority of the new
 	  thread relative to the current thread.  However, since
 	  critical sections are used in spin mutexes to prevent
 	  context switches and multiple spin mutexes can be acquired,
 	  the critical section API must support nesting.  For this
 	  reason the current implementation uses a nesting count
 	  instead of a single per-thread flag.</para>
 
 	<para>In order to minimize latency, preemptions inside of a
 	  critical section are deferred rather than dropped.  If a
 	  thread is made runnable that would normally be preempted to
 	  outside of a critical section, then a per-thread flag is set
 	  to indicate that there is a pending preemption.  When the
 	  outermost critical section is exited, the flag is checked.
 	  If the flag is set, then the current thread is preempted to
 	  allow the higher priority thread to run.</para>
 
 	<para>Interrupts pose a problem with regards to spin mutexes.
 	  If a low-level interrupt handler needs a lock, it needs to
 	  not interrupt any code needing that lock to avoid possible
 	  data structure corruption.  Currently, providing this
 	  mechanism is piggybacked onto critical section API by means
 	  of the <function>cpu_critical_enter</function> and
 	  <function>cpu_critical_exit</function> functions.  Currently
 	  this API disables and re-enables interrupts on all of
 	  FreeBSD's current platforms.  This approach may not be
 	  purely optimal, but it is simple to understand and simple to
 	  get right. Theoretically, this second API need only be used
 	  for spin mutexes that are used in primary interrupt context.
 	  However, to make the code simpler, it is used for all spin
 	  mutexes and even all critical sections.  It may be desirable
 	  to split out the MD API from the MI API and only use it in
 	  conjunction with the MI API in the spin mutex
 	  implementation.  If this approach is taken, then the MD API
 	  likely would need a rename to show that it is a separate API
 	  now.</para>
       </sect3>
 
       <sect3>
 	<title>Design Tradeoffs</title>
 
 	<para>As mentioned earlier, a couple of trade-offs have been
 	  made to sacrifice cases where perfect preemption may not
 	  always provide the best performance.</para>
 
 	<para>The first trade-off is that the preemption code does not
 	  take other CPUs into account.  Suppose we have a two CPU's A
 	  and B with the priority of A's thread as 4 and the priority
 	  of B's thread as 2.  If CPU B makes a thread with priority 1
 	  runnable, then in theory, we want CPU A to switch to the new
 	  thread so that we will be running the two highest priority
 	  runnable threads.  However, the cost of determining which
 	  CPU to enforce a preemption on as well as actually signaling
 	  that CPU via an IPI along with the synchronization that
 	  would be required would be enormous.  Thus, the current code
 	  would instead force CPU B to switch to the higher priority
 	  thread. Note that this still puts the system in a better
 	  position as CPU B is executing a thread of priority 1 rather
 	  than a thread of priority 2.</para>
 
 	<para>The second trade-off limits immediate kernel preemption
 	  to real-time priority kernel threads.  In the simple case of
 	  preemption defined above, a thread is always preempted
 	  immediately (or as soon as a critical section is exited) if
 	  a higher priority thread is made runnable.  However, many
 	  threads executing in the kernel only execute in a kernel
 	  context for a short time before either blocking or returning
 	  to userland.  Thus, if the kernel preempts these threads to
 	  run another non-realtime kernel thread, the kernel may
 	  switch out the executing thread just before it is about to
 	  sleep or execute.  The cache on the CPU must then adjust to
 	  the new thread.  When the kernel returns to the interrupted
 	  CPU, it must refill all the cache information that was lost.
 	  In addition, two extra context switches are performed that
 	  could be avoided if the kernel deferred the preemption until
 	  the first thread blocked or returned to userland.  Thus, by
 	  default, the preemption code will only preempt immediately
 	  if the higher priority thread is a real-time priority
 	  thread.</para>
 
 	<para>Turning on full kernel preemption for all kernel threads
 	  has value as a debugging aid since it exposes more race
 	  conditions.  It is especially useful on UP systems were many
 	  races are hard to simulate otherwise.  Thus, there will be a
 	  kernel option to enable preemption for all kernel threads
 	  that can be used for debugging purposes.</para>
       </sect3>
     </sect2>
 
     <sect2>
       <title>Thread Migration</title>
 
       <para>Simply put, a thread migrates when it moves from one CPU
 	to another.  In a non-preemptive kernel this can only happen
 	at well-defined points such as when calling
 	<function>tsleep</function> or returning to userland.
 	However, in the preemptive kernel, an interrupt can force a
 	preemption and possible migration at any time.  This can have
 	negative affects on per-CPU data since with the exception of
 	<varname>curthread</varname> and <varname>curpcb</varname> the
 	data can change whenever you migrate.  Since you can
 	potentially migrate at any time this renders per-CPU data
 	rather useless. Thus it is desirable to be able to disable
 	migration for sections of code that need per-CPU data to be
 	stable.</para>
 
       <para>Critical sections currently prevent migration since they
 	do not allow context switches.  However, this may be too strong
 	of a requirement to enforce in some cases since a critical
 	section also effectively blocks interrupt threads on the
 	current processor.  As a result, it may be desirable to
 	provide an API whereby code may indicate that if the current
 	thread is preempted it should not migrate to another
 	CPU.</para>
 
       <para>One possible implementation is to use a per-thread nesting
 	count <varname>td_pinnest</varname> along with a
 	<varname>td_pincpu</varname> which is updated to the current
 	CPU on each context switch.  Each CPU has its own run queue
 	that holds threads pinned to that CPU.  A thread is pinned
 	when its nesting count is greater than zero and a thread
 	starts off unpinned with a nesting count of zero.  When a
 	thread is put on a runqueue, we check to see if it is pinned.
 	If so, we put it on the per-CPU runqueue, otherwise we put it
 	on the global runqueue.  When
 	<function>choosethread</function> is called to retrieve the
 	next thread, it could either always prefer bound threads to
 	unbound threads or use some sort of bias when comparing
 	priorities.  If the nesting count is only ever written to by
 	the thread itself and is only read by other threads when the
 	owning thread is not executing but while holding the
 	<varname>sched_lock</varname>, then
 	<varname>td_pinnest</varname> will not need any other locks.
 	The <function>migrate_disable</function> function would
 	increment the nesting count and
 	<function>migrate_enable</function> would decrement the
 	nesting count.  Due to the locking requirements specified
 	above, they will only operate on the current thread and thus
 	would not need to handle the case of making a thread
 	migrateable that currently resides on a per-CPU run
 	queue.</para>
 
       <para>It is still debatable if this API is needed or if the
 	critical section API is sufficient by itself.  Many of the
 	places that need to prevent migration also need to prevent
 	preemption as well, and in those places a critical section
 	must be used regardless.</para>
     </sect2>
 
     <sect2>
       <title>Callouts</title>
 
       <para>The <function>timeout()</function> kernel facility permits
 	kernel services to register functions for execution as part
 	of the <function>softclock()</function> software interrupt.
 	Events are scheduled based on a desired number of clock
 	ticks, and callbacks to the consumer-provided function
 	will occur at approximately the right time.</para>
 
       <para>The global list of pending timeout events is protected
 	by a global spin mutex, <varname>callout_lock</varname>;
 	all access to the timeout list must be performed with this
 	mutex held.  When <function>softclock()</function> is
 	woken up, it scans the list of pending timeouts for those
 	that should fire.  In order to avoid lock order reversal,
 	the <function>softclock</function> thread will release the
 	<varname>callout_lock</varname> mutex when invoking the
 	provided <function>timeout()</function> callback function.
 	If the <constant>CALLOUT_MPSAFE</constant> flag was not set
 	during registration, then Giant will be grabbed before
 	invoking the callout, and then released afterwards.  The
 	<varname>callout_lock</varname> mutex will be re-grabbed
 	before proceeding.  The <function>softclock()</function>
 	code is careful to leave the list in a consistent state
 	while releasing the mutex.  If <constant>DIAGNOSTIC</constant>
 	is enabled, then the time taken to execute each function is
 	measured, and a warning generated if it exceeds a
 	threshold.</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>Specific Locking Strategies</title>
 
     <sect2>
       <title>Credentials</title>
 
       <para><structname>struct ucred</structname> is the kernel's
 	internal credential structure, and is generally used as the
 	basis for process-driven access control within the kernel.  
 	BSD-derived systems use a <quote>copy-on-write</quote> model for credential 
 	data: multiple references may exist for a credential structure, 
 	and when a change needs to be made, the structure is duplicated,
 	modified, and then the reference replaced.  Due to wide-spread
 	caching of the credential to implement access control on open,
 	this results in substantial memory savings.  With a move to
 	fine-grained SMP, this model also saves substantially on
 	locking operations by requiring that modification only occur
 	on an unshared credential, avoiding the need for explicit   
 	synchronization when consuming a known-shared
 	credential.</para>
 
       <para>Credential structures with a single reference are
 	considered mutable; shared credential structures must not be  
 	modified or a race condition is risked.  A mutex,
 	<structfield>cr_mtxp</structfield> protects the reference 
 	count of <structname>struct ucred</structname> so as to
 	maintain consistency.  Any use of the structure requires a
 	valid reference for the duration of the use, or the structure
 	may be released out from under the illegitimate
 	consumer.</para>
 
       <para>The <structname>struct ucred</structname> mutex is a leaf
 	mutex, and for performance reasons, is implemented via a mutex
 	pool.</para>
 
       <para>Usually, credentials are used in a read-only manner for access
 	control decisions, and in this case <structfield>td_ucred</structfield>
 	is generally preferred because it requires no locking.  When a
 	process' credential is updated the <literal>proc</literal> lock
 	must be held across the check and update operations thus avoid
 	races.  The process credential <structfield>p_ucred</structfield>
 	must be used for check and update operations to prevent
 	time-of-check, time-of-use races.</para>
 
       <para>If system call invocations will perform access control after
 	an update to the process credential, the value of
 	<structfield>td_ucred</structfield> must also be refreshed to
 	the current process value.  This will prevent use of a stale
 	credential following a change.  The kernel automatically
 	refreshes the <structfield>td_ucred</structfield> pointer in
 	the thread structure from the process
 	<structfield>p_ucred</structfield> whenever a process enters
 	the kernel, permitting use of a fresh credential for kernel
 	access control.</para>
     </sect2>
 
     <sect2>
       <title>File Descriptors and File Descriptor Tables</title>
 
       <para>Details to follow.</para>
     </sect2>
 
     <sect2>
       <title>Jail Structures</title>
 
       <para><structname>struct prison</structname> stores
 	administrative details pertinent to the maintenance of jails
 	created using the &man.jail.2; API.  This includes the
 	per-jail hostname, IP address, and related settings.  This
 	structure is reference-counted since pointers to instances of
 	the structure are shared by many credential structures.  A
 	single mutex, <structfield>pr_mtx</structfield> protects read
 	and write access to the reference count and all mutable
 	variables inside the struct jail.  Some variables are set only
 	when the jail is created, and a valid reference to the
 	<structname>struct prison</structname> is sufficient to read
 	these values.  The precise locking of each entry is documented
 	via comments in <filename>sys/jail.h</filename>.</para>
     </sect2>
 
     <sect2>
       <title>MAC Framework</title>
 
       <para>The TrustedBSD MAC Framework maintains data in a variety
 	of kernel objects, in the form of <structname>struct
 	label</structname>.  In general, labels in kernel objects
 	are protected by the same lock as the remainder of the kernel
 	object.  For example, the <structfield>v_label</structfield>
 	label in <structname>struct vnode</structname> is protected
 	by the vnode lock on the vnode.</para>
 
       <para>In addition to labels maintained in standard kernel objects,
 	the MAC Framework also maintains a list of registered and
 	active policies.  The policy list is protected by a global
 	mutex (<varname>mac_policy_list_lock</varname>) and a busy
 	count (also protected by the mutex).  Since many access
 	control checks may occur in parallel, entry to the framework
 	for a read-only access to the policy list requires holding the
 	mutex while incrementing (and later decrementing) the busy
 	count.  The mutex need not be held for the duration of the
 	MAC entry operation--some operations, such as label operations
 	on file system objects--are long-lived.  To modify the policy
 	list, such as during policy registration and de-registration,
 	the mutex must be held and the reference count must be zero,
 	to prevent modification of the list while it is in use.</para>
 
       <para>A condition variable,
 	<varname>mac_policy_list_not_busy</varname>, is available to
 	threads that need to wait for the list to become unbusy, but
 	this condition variable must only be waited on if the caller is
 	holding no other locks, or a lock order violation may be
 	possible.  The busy count, in effect, acts as a form of
 	shared/exclusive lock over access to the framework: the difference
 	is that, unlike with an sx lock, consumers waiting for the list
 	to become unbusy may be starved, rather than permitting lock
 	order problems with regards to the busy count and other locks
 	that may be held on entry to (or inside) the MAC Framework.</para>
     </sect2>
 
     <sect2>
       <title>Modules</title>
 
       <para>For the module subsystem there exists a single lock that is
 	used to protect the shared data.  This lock is a shared/exclusive
 	(SX) lock and has a good chance of needing to be acquired (shared
 	or exclusively), therefore there are a few macros that have been
 	added to make access to the lock more easy.  These macros can be
 	located in <filename>sys/module.h</filename> and are quite basic
 	in terms of usage.  The main structures protected under this lock
 	are the <structname>module_t</structname> structures (when shared)
 	and the global <structname>modulelist_t</structname> structure,
 	modules.  One should review the related source code in
 	<filename>kern/kern_module.c</filename> to further understand the
 	locking strategy.</para>
     </sect2>
 
     <sect2>
       <title>Newbus Device Tree</title>
 
       <para>The newbus system will have one sx lock.  Readers will
 	hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
 	an exclusive (write) lock (&man.sx.xlock.9;).  Internal functions 
 	will not do locking at all.  Externally visible ones will lock as 
 	needed.
 	Those items that do not matter if the race is won or lost will
 	not be locked, since they tend to be read all over the place
 	(e.g. &man.device.get.softc.9;).  There will be relatively few
 	changes to the newbus data structures, so a single lock should
 	be sufficient and not impose a performance penalty.</para>
     </sect2>
 
     <sect2>
       <title>Pipes</title>
 
       <para>...</para>
     </sect2>
 
     <sect2>
       <title>Processes and Threads</title>
 
       <para>- process hierarchy</para>
       <para>- proc locks, references</para>
       <para>- thread-specific copies of proc entries to freeze during system
 	calls, including td_ucred</para>
       <para>- inter-process operations</para>
       <para>- process groups and sessions</para>
     </sect2>
 
     <sect2>
       <title>Scheduler</title>
 
       <para>Lots of references to <varname>sched_lock</varname> and notes
 	pointing at specific primitives and related magic elsewhere in the
 	document.</para>
     </sect2>
 
     <sect2>
       <title>Select and Poll</title>
 
       <para>The select() and poll() functions permit threads to block
 	waiting on events on file descriptors--most frequently, whether
 	or not the file descriptors are readable or writable.</para>
 
       <para>...</para>
     </sect2>
 
     <sect2>
       <title>SIGIO</title>
 
       <para>The SIGIO service permits processes to request the delivery
 	of a SIGIO signal to its process group when the read/write status
 	of specified file descriptors changes.  At most one process or
 	process group is permitted to register for SIGIO from any given
 	kernel object, and that process or group is referred to as
 	the owner.  Each object supporting SIGIO registration contains
 	pointer field that is NULL if the object is not registered, or
 	points to a <structname>struct sigio</structname> describing
 	the registration.  This field is protected by a global mutex,
 	<varname>sigio_lock</varname>.  Callers to SIGIO maintenance
 	functions must pass in this field <quote>by reference</quote> so that local
 	register copies of the field are not made when unprotected by
 	the lock.</para>
 
       <para>One <structname>struct sigio</structname> is allocated for
 	each registered object associated with any process or process
 	group, and contains back-pointers to the object, owner, signal
 	information, a credential, and the general disposition of the
 	registration.  Each process or progress group contains a list of
 	registered <structname>struct sigio</structname> structures,
 	<structfield>p_sigiolst</structfield> for processes, and
 	<structfield>pg_sigiolst</structfield> for process groups.
 	These lists are protected by the process or process group
 	locks respectively.  Most fields in each <structname>struct
 	sigio</structname> are constant for the duration of the
 	registration, with the exception of the
 	<structfield>sio_pgsigio</structfield> field which links the
 	<structname>struct sigio</structname> into the process or
 	process group list.  Developers implementing new kernel
 	objects supporting SIGIO will, in general, want to avoid
 	holding structure locks while invoking SIGIO supporting
 	functions, such as <function>fsetown()</function>
 	or <function>funsetown()</function> to avoid
 	defining a lock order between structure locks and the global
 	SIGIO lock.  This is generally possible through use of an
 	elevated reference count on the structure, such as reliance
 	on a file descriptor reference to a pipe during a pipe
 	operation.<para>
     </sect2>
 
     <sect2>
       <title>Sysctl</title>
 
       <para>The <function>sysctl()</function> MIB service is invoked
 	from both within the kernel and from userland applications
 	using a system call.  At least two issues are raised in locking:
 	first, the protection of the structures maintaining the
 	namespace, and second, interactions with kernel variables and
 	functions that are accessed by the sysctl interface.  Since
 	sysctl permits the direct export (and modification) of
 	kernel statistics and configuration parameters, the sysctl
 	mechanism must become aware of appropriate locking semantics
 	for those variables.  Currently, sysctl makes use of a
 	single global sx lock to serialize use of sysctl(); however, it
 	is assumed to operate under Giant and other protections are not
 	provided.  The remainder of this section speculates on locking
 	and semantic changes to sysctl.</para>
 
       <para>- Need to change the order of operations for sysctl's that
 	update values from read old, copyin and copyout, write new to
 	copyin, lock, read old and write new, unlock, copyout.  Normal
 	sysctl's that just copyout the old value and set a new value
 	that they copyin may still be able to follow the old model.
 	However, it may be cleaner to use the second model for all of
 	the sysctl handlers to avoid lock operations.</para>
 
       <para>- To allow for the common case, a sysctl could embed a
 	pointer to a mutex in the SYSCTL_FOO macros and in the struct.
 	This would work for most sysctl's.  For values protected by sx
 	locks, spin mutexes, or other locking strategies besides a
 	single sleep mutex, SYSCTL_PROC nodes could be used to get the
 	locking right.</para>
     </sect2>
 
     <sect2>
       <title>Taskqueue</title>
 
        <para> The taskqueue's interface has two basic locks associated
 	with it in order to protect the related shared data.  The
 	<varname>taskqueue_queues_mutex</varname> is meant to serve as a
 	lock to protect the <varname>taskqueue_queues</varname> TAILQ.
 	The other mutex lock associated with this system is the one in the
 	<structname>struct taskqueue</structname> data structure.  The
 	use of the synchronization primitive here is to protect the
 	integrity of the data in the <structname>struct
 	taskqueue</structname>.  It should be noted that there are no
 	separate macros to assist the user in locking down his/her own work
 	since these locks are most likely not going to be used outside of
 	<filename>kern/subr_taskqueue.c</filename>.</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>Implementation Notes</title>
 
     <sect2>
       <title>Details of the Mutex Implementation</title>
 
       <para>- Should we require mutexes to be owned for mtx_destroy()
 	since we can not safely assert that they are unowned by anyone
 	else otherwise?</para>
 
       <sect3>
 	<title>Spin Mutexes</title>
 
 	<para>- Use a critical section...</para>
       </sect3>
 
       <sect3>
 	<title>Sleep Mutexes</title>
 
 	<para>- Describe the races with contested mutexes</para>
 
 	<para>- Why it is safe to read mtx_lock of a contested mutex
 	  when holding sched_lock.</para>
 
 	<para>- Priority propagation</para>
       </sect3>
     </sect2>
 
     <sect2>
       <title>Witness</title>
 
       <para>- What does it do</para>
 
       <para>- How does it work</para>
     </sect2>
   </sect1>
 
   <sect1>
     <title>Miscellaneous Topics</title>
 
     <sect2>
       <title>Interrupt Source and ICU Abstractions</title>
 
       <para>- struct isrc</para>
 
       <para>- pic drivers</para>
     </sect2>
 
     <sect2>
       <title>Other Random Questions/Topics</title>
 
       <para>Should we pass an interlock into
 	<function>sema_wait</function>?</para>
 
       <para>- Generic turnstiles for sleep mutexes and sx locks.</para>
 
       <para>- Should we have non-sleepable sx locks?</para>
     </sect2>
   </sect1>
 
   <glossary id="glossary">
     <title>Glossary</title>
 
     <glossentry id="atomic">
       <glossterm>atomic</glossterm>
       <glossdef>
 	<para>An operation is atomic if all of its effects are visible
 	  to other CPUs together when the proper access protocol is
 	  followed.  In the degenerate case are atomic instructions
 	  provided directly by machine architectures.  At a higher
 	  level, if several members of a structure are protected by a
 	  lock, then a set of operations are atomic if they are all
 	  performed while holding the lock without releasing the lock
 	  in between any of the operations.</para>
 
 	<glossseealso>operation</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="block">
       <glossterm>block</glossterm>
       <glossdef>
 	<para>A thread is blocked when it is waiting on a lock,
 	  resource, or condition.  Unfortunately this term is a bit
 	  overloaded as a result.</para>
 
 	<glossseealso>sleep</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="critical-section">
       <glossterm>critical section</glossterm>
       <glossdef>
 	<para>A section of code that is not allowed to be preempted.
 	  A critical section is entered and exited using the
 	  &man.critical.enter.9; API.</para>
       </glossdef>
     </glossentry>
 
     <glossentry id="MD">
       <glossterm>MD</glossterm>
       <glossdef>
 	<para>Machine dependent.</para>
 
 	<glossseealso>MI</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="memory-operation">
       <glossterm>memory operation</glossterm>
       <glossdef>
 	<para>A memory operation reads and/or writes to a memory
 	  location.</para>
       </glossdef>
     </glossentry>
 
     <glossentry id="MI">
       <glossterm>MI</glossterm>
       <glossdef>
 	<para>Machine independent.</para>
 
 	<glossseealso>MD</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="operation">
       <glossterm>operation</glossterm>
       <glosssee>memory operation</glosssee>
     </glossentry>
 
     <glossentry id="primary-interrupt-context">
       <glossterm>primary interrupt context</glossterm>
       <glossdef>
 	<para>Primary interrupt context refers to the code that runs
 	  when an interrupt occurs.  This code can either run an
 	  interrupt handler directly or schedule an asynchronous
 	  interrupt thread to execute the interrupt handlers for a
 	  given interrupt source.</para>
       </glossdef>
     </glossentry>
 
     <glossentry>
       <glossterm>realtime kernel thread</glossterm>
       <glossdef>
 	<para>A high priority kernel thread.  Currently, the only
 	  realtime priority kernel threads are interrupt threads.</para>
 
 	<glossseealso>thread</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="sleep">
       <glossterm>sleep</glossterm>
       <glossdef>
 	<para>A thread is asleep when it is blocked on a condition
 	  variable or a sleep queue via <function>msleep</function> or
 	  <function>tsleep</function>.</para>
 
 	<glossseealso>block</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="sleepable-lock">
       <glossterm>sleepable lock</glossterm>
       <glossdef>
 	<para>A sleepable lock is a lock that can be held by a thread
 	  which is asleep.  Lockmgr locks and sx locks are currently
 	  the only sleepable locks in FreeBSD.  Eventually, some sx
 	  locks such as the allproc and proctree locks may become
 	  non-sleepable locks.</para>
 
 	<glossseealso>sleep</glossseealso>
       </glossdef>
     </glossentry>
 
     <glossentry id="thread">
       <glossterm>thread</glossterm>
       <glossdef>
 	<para>A kernel thread represented by a struct thread.  Threads own
 	  locks and hold a single execution context.</para>
       </glossdef>
     </glossentry>
   </glossary>
 </article>