diff --git a/en_US.ISO8859-1/articles/smp/article.sgml b/en_US.ISO8859-1/articles/smp/article.sgml
index 3f6b233f60..8e264957d4 100644
--- a/en_US.ISO8859-1/articles/smp/article.sgml
+++ b/en_US.ISO8859-1/articles/smp/article.sgml
@@ -1,934 +1,933 @@
%man;
%authors;
]>
SMPng Design Document
John
Baldwin
Robert
Watson
$FreeBSD$
2002
John Baldwin
Robert Watson
This document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
synchronization and execution model is laid out. Then, locking
strategies for specific subsystems are discussed, documenting the
approaches taken to introduce fine-grained synchronization and
parallelism for each subsystem. Finally, detailed implementation
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives.
Introduction
This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
with the SMPng Project. Many sections currently exist only in
outline form, but will be fleshed out as work proceeds. Updates or
suggestions regarding the document may be directed to the document
editors.
The goal of SMPng is to allow concurrency in the kernel.
The kernel is basically one rather large and complex program. To
make the kernel multithreaded we use some of the same tools used
to make other programs multithreaded. These include mutexes,
- reader/writer locks, semaphores, and condition variables. For
+ shared/exclusive locks, semaphores, and condition variables. For
definitions of many of the terms, please see
.
Basic Tools and Locking Fundamentals
Atomic Instructions and Memory Barriers
There are several existing treatments of memory barriers
and atomic instructions, so this section will not include a
lot of detail. To put it simply, one cannot go around reading
variables without a lock if a lock is used to protect writes
to that variable. This becomes obvious when you consider that
memory barriers simply determine relative order of memory
operations; they do not make any guarantee about timing of
memory operations. That is, a memory barrier does not force
the contents of a CPU's local cache or store buffer to flush.
Instead, the memory barrier at lock release simply ensures
that all writes to the protected data will be visible to other
CPU's or devices if the write to release the lock is visible.
The CPU is free to keep that data in its cache or store buffer
as long as it wants. However, if another CPU performs an
atomic instruction on the same datum, the first CPU must
guarantee that the updated value is made visible to the second
CPU along with any other operations that memory barriers may
require.
For example, assuming a simple model where data is
considered visible when it is in main memory (or a global
cache), when an atomic instruction is triggered on one CPU,
other CPU's store buffers and caches must flush any writes to
that same cache line along with any pending operations behind
a memory barrier.
This requires one to take special care when using an item
protected by atomic instructions. For example, in the sleep
mutex implementation, we have to use an
atomic_cmpset rather than an
atomic_set to turn on the
MTX_CONTESTED bit. The reason is that we
read the value of mtx_lock into a
variable and then make a decision based on that read.
However, the value we read may be stale, or it may change
while we are making our decision. Thus, when the
atomic_set executed, it may end up
setting the bit on another value than the one we made the
decision on. Thus, we have to use an
atomic_cmpset to set the value only if
the value we made the decision on is up-to-date and
valid.
Finally, atomic instructions only allow one item to be
updated or read. If one needs to atomically update several
items, then a lock must be used instad. For example, if two
counters must be read and have values that are consistent
relative to each other, then those counters must be protected
by a lock rather than by separate atomic instructions.
Read Locks versus Write Locks
Read locks do not need to be as strong as write locks.
Both types of locks need to ensure that the data they are
accessing is not stale. However, only write access requires
exclusive access. Multiple threads can safely read a value.
Using different types of locks for reads and writes can be
implemented in a number of ways.
First, sx locks can be used in this manner by using an
exclusive lock when writing and a shared lock when reading.
This method is quite straightforward.
A second method is a bit more obscure. You can protect a
datum with multiple locks. Then for reading that data you
simply need to have a read lock of one of the locks. However,
to write to the data, you need to have a write lock of all of
the locks. This can make writing rather expensive but can be
useful when data is accessed in various ways. For example,
the parent process pointer is proctected by both the
proctree_lock sx lock and the per-process mutex. Sometimes
the proc lock is easier as we are just checking to see who a
parent of a process is that we already have locked. However,
other places such as inferior need to
walk the tree of processes via parent pointers and locking
each process would be prohibitive as well as a pain to
guarantee that the condition you are checking remains valid
for both the check and the actions taken as a result of the
check.
Locking Conditions and Results
If you need a lock to check the state of a variable so
that you can take an action based on the state you read, you
can't just hold the lock while reading the variable and then
drop the lock before you act on the value you read. Once you
drop the lock, the variable can change rendering your decision
invalid. Thus, you must hold the lock both while reading the
variable and while performing the action as a result of the
test.
General Architecture and Design
Interrupt Handling
Following the pattern of several other multithreaded Unix
kernels, FreeBSD deals with interrupt handlers by giving them
their own thread context. Providing a context for interrupt
handlers allows them to block on locks. To help avoid
latency, however, interrupt threads run at real-time kernel
priority. Thus, interrupt handlers should not execute for very
long to avoid starving other kernel threads. In addition,
since multiple handlers may share an interrupt thread,
interrupt handlers should not sleep or use a sleepable lock to
avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to
as heavyweight interrupt threads. They are called this
because switching to an interrupt thread involves a full
context switch. In the initial implementation, the kernel was
not preemptive and thus interrupts that interrupted a kernel
thread would have to wait until the kernel thread blocked or
returned to userland before they would have an opportunity to
run.
To deal with the latency problems, the kernel in FreeBSD
has been made preemptive. Currently, we only preempt a kernel
thread when we release a sleep mutex or when an interrupt
comes in. However, the plan is to make the FreeBSD kernel
fully preemptive as described below.
Not all interrupt handlers execute in a thread context.
Instead, some handlers execute directly in primary interrupt
context. These interrupt handlers are currently misnamed
fast
interrupt handlers since the
INTR_FAST flag used in earlier versions
of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt
handlers are clock interrupts and serial I/O device
interrupts. Since these handlers do not have their own
context, they may not acquire blocking locks and thus may only
use spin mutexes.
Finally, there is one optional optimization that can be
added in MD code called lightweight context switches. Since
an interrupt thread executes in a kernel context, it can
borrow the vmspace of any process. Thus, in a lightweight
context switch, the switch to the interrupt thread does not
switch vmspaces but borrows the vmspace of the interrupted
thread. In order to ensure that the vmspace of the
interrupted thread doesn't disappear out from under us, the
interrupted thread is not allowed to execute until the
interrupt thread is no longer borrowing its vmspace. This can
happen when the interrupt thread either blocks or finishes.
If an interrupt thread blocks, then it will use its own
context when it is made runnable again. Thus, it can release
the interrupted thread.
The cons of this optimization are that they are very
machine specific and complex and thus only worth the effor if
their is a large performance improvement. At this point it is
probably too early to tell, and in fact, will probably hurt
performance as almost all interrupt handlers will immediately
block on Giant and require a thread fixup when they block.
Also, an alternative method of interrupt handling has been
proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate
which runs in primary interrupt context and a handler
which runs in its own thread context.
If an interrupt handler has a predicate, then when an
interrupt is triggered, the predicate is run. If the
predicate returns true then the interrupt is assumed to be
fully handled and the kernel returns from the interrupt.
If the predicate returns false or there is no predicate,
then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme
might prove rather complicated. Since we may want to change
to this scheme at some point in the future, it is probably
best to defer work on light weight context switches until we
have settled on the final interrupt handling architecture and
determined how light weight context switches might or might
not fit into it.
Kernel Preemption and Critical Sections
Kernel Preemption in a Nutshell
Kernel preemption is fairly simple. The basic idea is
that a CPU should always be doing the highest priority work
available. Well, that is the ideal at least. There are a
couple of cases where the expense of achieving the ideal is
not worth being perfect.
Implementing full kernel preemption is very
straightforward: when you schedule a thread to be executed
by putting it on a runqueue, you check to see if it's
priority is higher than the currently executing thread. If
so, you initiate a context switch to that thread.
While locks can protect most data in the case of a
preemption, not all of the kernel is preemption safe. For
example, if a thread holding a spin mutex preempted and the
new thread attempts to grab the same spin mutex, the new
thread may spin forever as the interrupted thread may never
get a chance to execute. Also, some code such as the code
to assign an address space number for a process during
exec() on the Alpha needs to not be preempted as it supports
the actual context switch code. Preemption is disabled for
these code sections by using a critical section.
Critical Sections
The responsibility of the critical section API is to
prevent context switches inside of a critical section. With
a fully preemptive kernel, every
setrunqueue of a thread other than the
current thread is a preemption point. One implementation is
for critical_enter to set a per-thread
flag that is cleared by its counterpart. If
setrunqueue is called with this flag
set, it doesn't preempt regarless of the priority of the new
thread relative to the current thread. However, since
critical sections are used in spin mutexes to prevent
context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this
reason the current implementation uses a nesting count
instead of a single per-thread flag.
In order to minimize latency, preemptions inside of a
critical section are deferred rather than dropped. If a
thread is made runnable that would normally be preempted to
outside of a critical section, then a per-thread flag is set
to indicate that there is a pending preemption. When the
outermost critical section is exited, the flag is checked.
If the flag is set, then the current thread is preempted to
allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes.
If a low-level interrupt handler needs a lock, it needs to
not interrupt any code needing that lock to avoid possible
data structure corruption. Currently, providing this
mechanism is piggybacked onto critical section API by means
of the cpu_critical_enter and
cpu_critical_exit functions. Currently
this API disables and reenables interrupts on all of
FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to
get right. Theoretically, this second API need only be used
for spin mutexes that are used in primary interrupt context.
However, to make the code simpler, it is used for all spin
mutexes and even all critical sections. It may be desirable
to split out the MD API from the MI API and only use it in
conjunction with the MI API in the spin mutex
implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API
now.
Design Tradeoffs
As mentioned earlier, a couple of tradeoffs have been
made to sacrafice cases where perfect preemption may not
always provide the best performance.
The first tradeoff is that the preemption code does not
take other CPUs into account. Suppose we have a two CPU's A
and B with the priority of A's thread as 4 and the priority
of B's thread as 2. If CPU B makes a thread with priority 1
runnable, then in theory, we want CPU A to switch to the new
thread so that we will be running the two highest priority
runnable threads. However, the cost of determining which
CPU to enforce a preemption on as well as actually signaling
that CPU via an IPI along with the synchronization that
would be required would be enormous. Thus, the current code
would instead force CPU B to switch to the higher priority
thread. Note that this still puts the system in a better
position as CPU B is executing a thread of priority 1 rather
than a thread of priority 2.
The second tradeoff limits immediate kernel preemption
to real-time priority kernel threads. In the simple case of
preemption defined above, a thread is always preempted
immediately (or as soon as a critical section is exited) if
a higher priority thread is made runnable. However, many
threads executing in the kernel only execute in a kernel
context for a short time before either blocking or returning
to userland. Thus, if the kernel preempts these threads to
run another non-realtime kernel thread, the kernel may
switch out the executing thread just before it is about to
sleep or execute. The cache on the CPU must then adjust to
the new thread. When the kernel returns to the interrupted
CPU, it must refill all the cache informatino that was lost.
In addition, two extra context switches are performed that
could be avoided if the kernel deferred the preemption until
the first thread blocked or returned to userland. Thus, by
default, the preemption code will only preempt immediately
if the higher priority thread is a real-time priority
thread.
Turning on full kernel preemption for all kernel threads
has value as a debugging aid since it exposes more race
conditions. It is especially useful on UP systems were many
races are hard to simulate otherwise. Thus, there will be a
kernel option to enable preemption for all kernel threads
that can be used for debugging purposes.
Thread Migration
Simply put, a thread migrates when it moves from one CPU
to another. In a non-preemptive kernel this can only happen
at well-defined points such as when calling
tsleep or returning to userland.
However, in the preemptive kernel, an interrupt can force a
preemption and possible migration at any time. This can have
negative affects on per-CPU data since with the exception of
curthread and curpcb the
data can change whenever you migrate. Since you can
potentially migrate at any time this renders per-CPU data
rather useless. Thus it is desirable to be able to disable
migration for sections of code that need per-CPU data to be
stable.
Critical sections currently prevent migration since they
don't allow context switches. However, this may be too strong
of a requirement to enforce in some cases since a critical
section also effectively blocks interrupt threads on the
current processor. As a result, it may be desirable to
provide an API whereby code may indicate that if the current
thread is preempted it should not migrate to another
CPU.
One possible implementation is to use a per-thread nesting
count td_pinnest along with a
td_pincpu which is updated to the current
CPU on each context switch. Each CPU has its own run queue
that holds threads pinned to that CPU. A thread is pinned
when its nesting count is greater than zero and a thread
starts off unpinned with a nesting count of zero. When a
thread is put on a runqueue, we check to see if it is pinned.
If so, we put it on the per-CPU runqueue, otherwise we put it
on the global runqueue. When
choosethread is called to retrieve the
next thread, it could either always prefer bound threads to
unbound threads or use some sort of bias when comparing
priorities. If the nesting count is only ever written to by
the thread itself and is only read by other threads when the
owning thread is not executing but while holding the
sched_lock, then
td_pinnest will not need any other locks.
The migrate_disable function would
increment the nesting count and
migrate_enable would decrement the
nesting count. Due to the locking requirements specified
above, they will only operate on the current thread and thus
would not need to handle the case of making a thread
migratable that currently resides on a per-CPU run
queue.
It is still debatable if this API is needed or if the
critical section API is sufficient by itself. Many of the
places that need to prevent migration also need to prevent
preemption as well, and in those places a critical section
must be used regardless.
Callouts
The timeout() kernel facility permits
kernel services to register funtions for execution as part
of the softclock() software interrupt.
Events are scheduled based on a desired number of clock
ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.
The global list of pending timeout events is protected
by a global spin mutex, callout_lock;
all access to the timeout list must be performed with this
mutex held. When softclock() is
woken up, it scans the list of pending timeouts for those
that should fire. In order to avoid lock order reversal,
the softclock thread will release the
callout_lock mutex when invoking the
provided timeout() callback function.
If the CALLOUT_MPSAFE flag was not set
during registration, then Giant will be grabbed before
invoking the callout, and then released afterwards. The
callout_lock mutex will be re-grabbed
before proceeding. The softclock()
code is careful to leave the list in a consistent state
while releasing the mutex. If DIAGNOSTIC
is enabled, then the time taken to execute each function is
measured, and a warning generated if it exceeds a
threshold.
Specific Locking Strategies
Credentials
struct ucred is the system
internal credential structure, and is generally used as the
basis for process-driven access control. BSD-derived systems
use a "copy-on-write" model for credential data: multiple
references may exist for a credential structure, and when a
change needs to be made, the structure is duplicated,
modified, and then the reference replaced. Due to wide-spread
caching of the credential to implement access control on open,
this results in substantial memory savings. With a move to
fine-grained SMP, this model also saves substantially on
locking operations by requiring that modification only occur
on an unshared credential, avoiding the need for explicit
synchronization when consuming a known-shared
credential.
Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
cr_mtxp protects the reference
count of the struct ucred so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.
The struct ucred mutex is a leaf
mutex, and for performance reasons, is implemented via a mutex
pool.
File Descriptors and File Descriptor Tables
Details to follow.
Jail Structures
struct prison stores
administrative details pertinent to the maintenance of jails
created using the &man.jail.2; API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, pr_mtx protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
struct prison is sufficient to read
these values. The precise locking of each entry is documented
via comments in jail.h.
MAC Framework
The TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of struct
label. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in struct vnode is protected
by the vnode lock on the vnode.
In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (mac_policy_list_lock) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and deregistration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.
A condition variable,
mac_policy_list_not_busy, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
- reader/writer lock over access to the framework: the difference
- is that, unlike with an sxlock, consumers waiting for the list
+ shared/exclusive lock over access to the framework: the difference
+ is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.
Modules
For the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in sys/module.h and are quite basic
in terms of usage. The main structures protected under this lock
are the module_t structures (when shared)
and the global modulelist_t structure,
modules. One should review the related source code in
kern/kern_module.c to further understand the
locking strategy.
Newbus Device Tree
The newbus system will have one sx lock. Readers will
lock it &man.sx.slock.9; and writers will lock it
&man.sx.xlock.9;. Internal only functions will not do locking
at all. The externally visable ones will lock as needed.
Those items that don't matter if the race is won or lost will
not be locked, since they tend to be read all over the place
(eg &man.device.get.softc.9;). There will be relatively few
changes to the newbus datastructures, so a single lock should
be sufficient and not impose a performance penalty.
Pipes
...
Processes and Threads
- process hiearachy
- proc locks, references
- thread-specific copies of proc entries to freeze during system
calls, including td_ucred
- inter-process operations
- process groups and sessions
Scheduler
Lots of references to sched_lock and notes
pointing at specific primitives and related magic elsewhere in the
document.
Select and Poll
The select() and poll() functions permit threads to block
waiting on events on file descriptors--most frequently, whether
or not the file descriptors are readable or writable.
...
SIGIO
The SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write status
of specified file descriptors changes. At most one process or
process group is permitted to register for SIGIO from any given
kernel object, and that process or group is referred to as
the owner. Each object supporting SIGIO registration contains
pointer field that is NULL if the object is not registered, or
points to a struct sigio describing
the registration. This field is protected by a global mutex,
sigio_lock. Callers to SIGIO maintenance
functions must pass in this field "by reference" so that local
register copies of the field are not made when unprotected by
the lock.
One struct sigio is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered struct sigio structures,
p_sigiolst for processes, and
pg_sigiolst for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each struct
sigio are constant for the duration of the
registration, with the exception of the
sio_pgsigio field which links the
struct sigio into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as fsetown()
or funsetown() to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.
sysctl
The sysctl() MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in locking:
first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and
functions that are accessed by the sysctl interface. Since
sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a
- single global sxlock to serialize use
- of sysctl(); however, it is assumed to operate under Giant
- and other protections are not provided. The remainder of
- this section speculates on locking and semantic changes
- to sysctl.
+ single global sx lock to serialize use of sysctl(); however, it
+ is assumed to operate under Giant and other protections are not
+ provided. The remainder of this section speculates on locking
+ and semantic changes to sysctl.
- Need to change the order of operations for sysctl's that
update values from read old, copyin and copyout, write new to
copyin, lock, read old and write new, unlock, copyout. Normal
sysctl's that just copyout the old value and set a new value
that they copyin may still be able to follow the old model.
However, it may be cleaner to use the second model for all of
the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a
pointer to a mutex in the SYSCTL_FOO macros and in the struct.
This would work for most sysctls. For values protected by sx
locks, spin mutexes, or other locking strategies besides a
single sleep mutex, SYSCTL_PROC nodes could be used to get the
locking right.
Taskqueue
The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
taskqueue_queues_mutex is meant to serve as a
lock to protect the taskqueue_queues TAILQ.
The other mutex lock associated with this system is the one in the
struct taskqueue data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the struct
taskqueue. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
kern/subr_taskqueue.c.
Implementation Notes
Details of the Mutex Implementation
- Should we require mutexes to be owned for mtx_destroy()
since we can't safely assert that they are unowned by anyone
else otherwise?
Spin Mutexes
- Use a critical section...
Sleep Mutexes
- Describe the races with contested mutexes
- Why it's safe to read mtx_lock of a contested mutex
when holding sched_lock.
- Priority propagation
Witesss
- What does it do
- How does it work
Miscellaneous Topics
Interrupt Source and ICU Abstractions
- struct isrc
- pic drivers
Other Random Questions/Topics
Should we pass an interlock into
sema_wait?
- Generic turnstiles for sleep mutexes and sx locks.
- Should we have non-sleepable sx locks?
Definitions
atomic
An operation is atomic if all of its effects are visible
to other CPUs together when the proper access protocol is
followed. In the degenerate case are atomic instructions
provided directly by machine architectures. At a higher
level, if several members of a structure are protected by a
lock, then a set of operations are atomic if they are all
performed while holding the lock without releasing the lock
in between any of the operations.
operation
block
A thread is blocked when it is waiting on a lock,
resource, or condition. Unfortunately this term is a bit
overloaded as a result.
sleep
critical section
A section of code that is not allowed to be preempted.
A critical section is entered and exited using the
&man.critical.enter.9; API.
MD
Machine depenedent.
MI
memory operation
A memory operation reads and/or writes to a memory
location.
MI
Machine indepenedent.
MD
operation
memory operation
primary interrupt context
Primary interrupt context refers to the code that runs
when an interrupt occurs. This code can either run an
interrupt handler directly or schedule an asynchronous
interrupt thread to execute the interrupt handlers for a
given interrupt source.
realtime kernel thread
A high priority kernel thread. Currently, the only
realtime priority kernel threads are interrupt threads.
thread
sleep
A thread is asleep when it is blocked on a condition
variable or a sleep queue via msleep or
tsleep.
block
sleepable lock
A sleepable lock is a lock that can be held by a thread
which is asleep. Lockmgr locks and sx locks are currently
the only sleepable locks in FreeBSD. Eventually, some sx
locks such as the allproc and proctree locks may become
non-sleepable locks.
sleep
thread
A kernel thread represented by a struct thread. Threads own
locks and hold a single execution context.
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
index 3f6b233f60..8e264957d4 100644
--- a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
+++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
@@ -1,934 +1,933 @@
%man;
%authors;
]>
SMPng Design Document
John
Baldwin
Robert
Watson
$FreeBSD$
2002
John Baldwin
Robert Watson
This document presents the current design and implementation of
the SMPng Architecture. First, the basic primitives and tools are
introduced. Next, a general architecture for the FreeBSD kernel's
synchronization and execution model is laid out. Then, locking
strategies for specific subsystems are discussed, documenting the
approaches taken to introduce fine-grained synchronization and
parallelism for each subsystem. Finally, detailed implementation
notes are provided to motivate design choices, and make the reader
aware of important implications involving the use of specific
primitives.
Introduction
This document is a work-in-progress, and will be updated to
reflect on-going design and implementation activities associated
with the SMPng Project. Many sections currently exist only in
outline form, but will be fleshed out as work proceeds. Updates or
suggestions regarding the document may be directed to the document
editors.
The goal of SMPng is to allow concurrency in the kernel.
The kernel is basically one rather large and complex program. To
make the kernel multithreaded we use some of the same tools used
to make other programs multithreaded. These include mutexes,
- reader/writer locks, semaphores, and condition variables. For
+ shared/exclusive locks, semaphores, and condition variables. For
definitions of many of the terms, please see
.
Basic Tools and Locking Fundamentals
Atomic Instructions and Memory Barriers
There are several existing treatments of memory barriers
and atomic instructions, so this section will not include a
lot of detail. To put it simply, one cannot go around reading
variables without a lock if a lock is used to protect writes
to that variable. This becomes obvious when you consider that
memory barriers simply determine relative order of memory
operations; they do not make any guarantee about timing of
memory operations. That is, a memory barrier does not force
the contents of a CPU's local cache or store buffer to flush.
Instead, the memory barrier at lock release simply ensures
that all writes to the protected data will be visible to other
CPU's or devices if the write to release the lock is visible.
The CPU is free to keep that data in its cache or store buffer
as long as it wants. However, if another CPU performs an
atomic instruction on the same datum, the first CPU must
guarantee that the updated value is made visible to the second
CPU along with any other operations that memory barriers may
require.
For example, assuming a simple model where data is
considered visible when it is in main memory (or a global
cache), when an atomic instruction is triggered on one CPU,
other CPU's store buffers and caches must flush any writes to
that same cache line along with any pending operations behind
a memory barrier.
This requires one to take special care when using an item
protected by atomic instructions. For example, in the sleep
mutex implementation, we have to use an
atomic_cmpset rather than an
atomic_set to turn on the
MTX_CONTESTED bit. The reason is that we
read the value of mtx_lock into a
variable and then make a decision based on that read.
However, the value we read may be stale, or it may change
while we are making our decision. Thus, when the
atomic_set executed, it may end up
setting the bit on another value than the one we made the
decision on. Thus, we have to use an
atomic_cmpset to set the value only if
the value we made the decision on is up-to-date and
valid.
Finally, atomic instructions only allow one item to be
updated or read. If one needs to atomically update several
items, then a lock must be used instad. For example, if two
counters must be read and have values that are consistent
relative to each other, then those counters must be protected
by a lock rather than by separate atomic instructions.
Read Locks versus Write Locks
Read locks do not need to be as strong as write locks.
Both types of locks need to ensure that the data they are
accessing is not stale. However, only write access requires
exclusive access. Multiple threads can safely read a value.
Using different types of locks for reads and writes can be
implemented in a number of ways.
First, sx locks can be used in this manner by using an
exclusive lock when writing and a shared lock when reading.
This method is quite straightforward.
A second method is a bit more obscure. You can protect a
datum with multiple locks. Then for reading that data you
simply need to have a read lock of one of the locks. However,
to write to the data, you need to have a write lock of all of
the locks. This can make writing rather expensive but can be
useful when data is accessed in various ways. For example,
the parent process pointer is proctected by both the
proctree_lock sx lock and the per-process mutex. Sometimes
the proc lock is easier as we are just checking to see who a
parent of a process is that we already have locked. However,
other places such as inferior need to
walk the tree of processes via parent pointers and locking
each process would be prohibitive as well as a pain to
guarantee that the condition you are checking remains valid
for both the check and the actions taken as a result of the
check.
Locking Conditions and Results
If you need a lock to check the state of a variable so
that you can take an action based on the state you read, you
can't just hold the lock while reading the variable and then
drop the lock before you act on the value you read. Once you
drop the lock, the variable can change rendering your decision
invalid. Thus, you must hold the lock both while reading the
variable and while performing the action as a result of the
test.
General Architecture and Design
Interrupt Handling
Following the pattern of several other multithreaded Unix
kernels, FreeBSD deals with interrupt handlers by giving them
their own thread context. Providing a context for interrupt
handlers allows them to block on locks. To help avoid
latency, however, interrupt threads run at real-time kernel
priority. Thus, interrupt handlers should not execute for very
long to avoid starving other kernel threads. In addition,
since multiple handlers may share an interrupt thread,
interrupt handlers should not sleep or use a sleepable lock to
avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to
as heavyweight interrupt threads. They are called this
because switching to an interrupt thread involves a full
context switch. In the initial implementation, the kernel was
not preemptive and thus interrupts that interrupted a kernel
thread would have to wait until the kernel thread blocked or
returned to userland before they would have an opportunity to
run.
To deal with the latency problems, the kernel in FreeBSD
has been made preemptive. Currently, we only preempt a kernel
thread when we release a sleep mutex or when an interrupt
comes in. However, the plan is to make the FreeBSD kernel
fully preemptive as described below.
Not all interrupt handlers execute in a thread context.
Instead, some handlers execute directly in primary interrupt
context. These interrupt handlers are currently misnamed
fast
interrupt handlers since the
INTR_FAST flag used in earlier versions
of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt
handlers are clock interrupts and serial I/O device
interrupts. Since these handlers do not have their own
context, they may not acquire blocking locks and thus may only
use spin mutexes.
Finally, there is one optional optimization that can be
added in MD code called lightweight context switches. Since
an interrupt thread executes in a kernel context, it can
borrow the vmspace of any process. Thus, in a lightweight
context switch, the switch to the interrupt thread does not
switch vmspaces but borrows the vmspace of the interrupted
thread. In order to ensure that the vmspace of the
interrupted thread doesn't disappear out from under us, the
interrupted thread is not allowed to execute until the
interrupt thread is no longer borrowing its vmspace. This can
happen when the interrupt thread either blocks or finishes.
If an interrupt thread blocks, then it will use its own
context when it is made runnable again. Thus, it can release
the interrupted thread.
The cons of this optimization are that they are very
machine specific and complex and thus only worth the effor if
their is a large performance improvement. At this point it is
probably too early to tell, and in fact, will probably hurt
performance as almost all interrupt handlers will immediately
block on Giant and require a thread fixup when they block.
Also, an alternative method of interrupt handling has been
proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate
which runs in primary interrupt context and a handler
which runs in its own thread context.
If an interrupt handler has a predicate, then when an
interrupt is triggered, the predicate is run. If the
predicate returns true then the interrupt is assumed to be
fully handled and the kernel returns from the interrupt.
If the predicate returns false or there is no predicate,
then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme
might prove rather complicated. Since we may want to change
to this scheme at some point in the future, it is probably
best to defer work on light weight context switches until we
have settled on the final interrupt handling architecture and
determined how light weight context switches might or might
not fit into it.
Kernel Preemption and Critical Sections
Kernel Preemption in a Nutshell
Kernel preemption is fairly simple. The basic idea is
that a CPU should always be doing the highest priority work
available. Well, that is the ideal at least. There are a
couple of cases where the expense of achieving the ideal is
not worth being perfect.
Implementing full kernel preemption is very
straightforward: when you schedule a thread to be executed
by putting it on a runqueue, you check to see if it's
priority is higher than the currently executing thread. If
so, you initiate a context switch to that thread.
While locks can protect most data in the case of a
preemption, not all of the kernel is preemption safe. For
example, if a thread holding a spin mutex preempted and the
new thread attempts to grab the same spin mutex, the new
thread may spin forever as the interrupted thread may never
get a chance to execute. Also, some code such as the code
to assign an address space number for a process during
exec() on the Alpha needs to not be preempted as it supports
the actual context switch code. Preemption is disabled for
these code sections by using a critical section.
Critical Sections
The responsibility of the critical section API is to
prevent context switches inside of a critical section. With
a fully preemptive kernel, every
setrunqueue of a thread other than the
current thread is a preemption point. One implementation is
for critical_enter to set a per-thread
flag that is cleared by its counterpart. If
setrunqueue is called with this flag
set, it doesn't preempt regarless of the priority of the new
thread relative to the current thread. However, since
critical sections are used in spin mutexes to prevent
context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this
reason the current implementation uses a nesting count
instead of a single per-thread flag.
In order to minimize latency, preemptions inside of a
critical section are deferred rather than dropped. If a
thread is made runnable that would normally be preempted to
outside of a critical section, then a per-thread flag is set
to indicate that there is a pending preemption. When the
outermost critical section is exited, the flag is checked.
If the flag is set, then the current thread is preempted to
allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes.
If a low-level interrupt handler needs a lock, it needs to
not interrupt any code needing that lock to avoid possible
data structure corruption. Currently, providing this
mechanism is piggybacked onto critical section API by means
of the cpu_critical_enter and
cpu_critical_exit functions. Currently
this API disables and reenables interrupts on all of
FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to
get right. Theoretically, this second API need only be used
for spin mutexes that are used in primary interrupt context.
However, to make the code simpler, it is used for all spin
mutexes and even all critical sections. It may be desirable
to split out the MD API from the MI API and only use it in
conjunction with the MI API in the spin mutex
implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API
now.
Design Tradeoffs
As mentioned earlier, a couple of tradeoffs have been
made to sacrafice cases where perfect preemption may not
always provide the best performance.
The first tradeoff is that the preemption code does not
take other CPUs into account. Suppose we have a two CPU's A
and B with the priority of A's thread as 4 and the priority
of B's thread as 2. If CPU B makes a thread with priority 1
runnable, then in theory, we want CPU A to switch to the new
thread so that we will be running the two highest priority
runnable threads. However, the cost of determining which
CPU to enforce a preemption on as well as actually signaling
that CPU via an IPI along with the synchronization that
would be required would be enormous. Thus, the current code
would instead force CPU B to switch to the higher priority
thread. Note that this still puts the system in a better
position as CPU B is executing a thread of priority 1 rather
than a thread of priority 2.
The second tradeoff limits immediate kernel preemption
to real-time priority kernel threads. In the simple case of
preemption defined above, a thread is always preempted
immediately (or as soon as a critical section is exited) if
a higher priority thread is made runnable. However, many
threads executing in the kernel only execute in a kernel
context for a short time before either blocking or returning
to userland. Thus, if the kernel preempts these threads to
run another non-realtime kernel thread, the kernel may
switch out the executing thread just before it is about to
sleep or execute. The cache on the CPU must then adjust to
the new thread. When the kernel returns to the interrupted
CPU, it must refill all the cache informatino that was lost.
In addition, two extra context switches are performed that
could be avoided if the kernel deferred the preemption until
the first thread blocked or returned to userland. Thus, by
default, the preemption code will only preempt immediately
if the higher priority thread is a real-time priority
thread.
Turning on full kernel preemption for all kernel threads
has value as a debugging aid since it exposes more race
conditions. It is especially useful on UP systems were many
races are hard to simulate otherwise. Thus, there will be a
kernel option to enable preemption for all kernel threads
that can be used for debugging purposes.
Thread Migration
Simply put, a thread migrates when it moves from one CPU
to another. In a non-preemptive kernel this can only happen
at well-defined points such as when calling
tsleep or returning to userland.
However, in the preemptive kernel, an interrupt can force a
preemption and possible migration at any time. This can have
negative affects on per-CPU data since with the exception of
curthread and curpcb the
data can change whenever you migrate. Since you can
potentially migrate at any time this renders per-CPU data
rather useless. Thus it is desirable to be able to disable
migration for sections of code that need per-CPU data to be
stable.
Critical sections currently prevent migration since they
don't allow context switches. However, this may be too strong
of a requirement to enforce in some cases since a critical
section also effectively blocks interrupt threads on the
current processor. As a result, it may be desirable to
provide an API whereby code may indicate that if the current
thread is preempted it should not migrate to another
CPU.
One possible implementation is to use a per-thread nesting
count td_pinnest along with a
td_pincpu which is updated to the current
CPU on each context switch. Each CPU has its own run queue
that holds threads pinned to that CPU. A thread is pinned
when its nesting count is greater than zero and a thread
starts off unpinned with a nesting count of zero. When a
thread is put on a runqueue, we check to see if it is pinned.
If so, we put it on the per-CPU runqueue, otherwise we put it
on the global runqueue. When
choosethread is called to retrieve the
next thread, it could either always prefer bound threads to
unbound threads or use some sort of bias when comparing
priorities. If the nesting count is only ever written to by
the thread itself and is only read by other threads when the
owning thread is not executing but while holding the
sched_lock, then
td_pinnest will not need any other locks.
The migrate_disable function would
increment the nesting count and
migrate_enable would decrement the
nesting count. Due to the locking requirements specified
above, they will only operate on the current thread and thus
would not need to handle the case of making a thread
migratable that currently resides on a per-CPU run
queue.
It is still debatable if this API is needed or if the
critical section API is sufficient by itself. Many of the
places that need to prevent migration also need to prevent
preemption as well, and in those places a critical section
must be used regardless.
Callouts
The timeout() kernel facility permits
kernel services to register funtions for execution as part
of the softclock() software interrupt.
Events are scheduled based on a desired number of clock
ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.
The global list of pending timeout events is protected
by a global spin mutex, callout_lock;
all access to the timeout list must be performed with this
mutex held. When softclock() is
woken up, it scans the list of pending timeouts for those
that should fire. In order to avoid lock order reversal,
the softclock thread will release the
callout_lock mutex when invoking the
provided timeout() callback function.
If the CALLOUT_MPSAFE flag was not set
during registration, then Giant will be grabbed before
invoking the callout, and then released afterwards. The
callout_lock mutex will be re-grabbed
before proceeding. The softclock()
code is careful to leave the list in a consistent state
while releasing the mutex. If DIAGNOSTIC
is enabled, then the time taken to execute each function is
measured, and a warning generated if it exceeds a
threshold.
Specific Locking Strategies
Credentials
struct ucred is the system
internal credential structure, and is generally used as the
basis for process-driven access control. BSD-derived systems
use a "copy-on-write" model for credential data: multiple
references may exist for a credential structure, and when a
change needs to be made, the structure is duplicated,
modified, and then the reference replaced. Due to wide-spread
caching of the credential to implement access control on open,
this results in substantial memory savings. With a move to
fine-grained SMP, this model also saves substantially on
locking operations by requiring that modification only occur
on an unshared credential, avoiding the need for explicit
synchronization when consuming a known-shared
credential.
Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
cr_mtxp protects the reference
count of the struct ucred so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.
The struct ucred mutex is a leaf
mutex, and for performance reasons, is implemented via a mutex
pool.
File Descriptors and File Descriptor Tables
Details to follow.
Jail Structures
struct prison stores
administrative details pertinent to the maintenance of jails
created using the &man.jail.2; API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, pr_mtx protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
struct prison is sufficient to read
these values. The precise locking of each entry is documented
via comments in jail.h.
MAC Framework
The TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of struct
label. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in struct vnode is protected
by the vnode lock on the vnode.
In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (mac_policy_list_lock) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and deregistration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.
A condition variable,
mac_policy_list_not_busy, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
- reader/writer lock over access to the framework: the difference
- is that, unlike with an sxlock, consumers waiting for the list
+ shared/exclusive lock over access to the framework: the difference
+ is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.
Modules
For the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in sys/module.h and are quite basic
in terms of usage. The main structures protected under this lock
are the module_t structures (when shared)
and the global modulelist_t structure,
modules. One should review the related source code in
kern/kern_module.c to further understand the
locking strategy.
Newbus Device Tree
The newbus system will have one sx lock. Readers will
lock it &man.sx.slock.9; and writers will lock it
&man.sx.xlock.9;. Internal only functions will not do locking
at all. The externally visable ones will lock as needed.
Those items that don't matter if the race is won or lost will
not be locked, since they tend to be read all over the place
(eg &man.device.get.softc.9;). There will be relatively few
changes to the newbus datastructures, so a single lock should
be sufficient and not impose a performance penalty.
Pipes
...
Processes and Threads
- process hiearachy
- proc locks, references
- thread-specific copies of proc entries to freeze during system
calls, including td_ucred
- inter-process operations
- process groups and sessions
Scheduler
Lots of references to sched_lock and notes
pointing at specific primitives and related magic elsewhere in the
document.
Select and Poll
The select() and poll() functions permit threads to block
waiting on events on file descriptors--most frequently, whether
or not the file descriptors are readable or writable.
...
SIGIO
The SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write status
of specified file descriptors changes. At most one process or
process group is permitted to register for SIGIO from any given
kernel object, and that process or group is referred to as
the owner. Each object supporting SIGIO registration contains
pointer field that is NULL if the object is not registered, or
points to a struct sigio describing
the registration. This field is protected by a global mutex,
sigio_lock. Callers to SIGIO maintenance
functions must pass in this field "by reference" so that local
register copies of the field are not made when unprotected by
the lock.
One struct sigio is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered struct sigio structures,
p_sigiolst for processes, and
pg_sigiolst for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each struct
sigio are constant for the duration of the
registration, with the exception of the
sio_pgsigio field which links the
struct sigio into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as fsetown()
or funsetown() to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.
sysctl
The sysctl() MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in locking:
first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and
functions that are accessed by the sysctl interface. Since
sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a
- single global sxlock to serialize use
- of sysctl(); however, it is assumed to operate under Giant
- and other protections are not provided. The remainder of
- this section speculates on locking and semantic changes
- to sysctl.
+ single global sx lock to serialize use of sysctl(); however, it
+ is assumed to operate under Giant and other protections are not
+ provided. The remainder of this section speculates on locking
+ and semantic changes to sysctl.
- Need to change the order of operations for sysctl's that
update values from read old, copyin and copyout, write new to
copyin, lock, read old and write new, unlock, copyout. Normal
sysctl's that just copyout the old value and set a new value
that they copyin may still be able to follow the old model.
However, it may be cleaner to use the second model for all of
the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a
pointer to a mutex in the SYSCTL_FOO macros and in the struct.
This would work for most sysctls. For values protected by sx
locks, spin mutexes, or other locking strategies besides a
single sleep mutex, SYSCTL_PROC nodes could be used to get the
locking right.
Taskqueue
The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
taskqueue_queues_mutex is meant to serve as a
lock to protect the taskqueue_queues TAILQ.
The other mutex lock associated with this system is the one in the
struct taskqueue data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the struct
taskqueue. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
kern/subr_taskqueue.c.
Implementation Notes
Details of the Mutex Implementation
- Should we require mutexes to be owned for mtx_destroy()
since we can't safely assert that they are unowned by anyone
else otherwise?
Spin Mutexes
- Use a critical section...
Sleep Mutexes
- Describe the races with contested mutexes
- Why it's safe to read mtx_lock of a contested mutex
when holding sched_lock.
- Priority propagation
Witesss
- What does it do
- How does it work
Miscellaneous Topics
Interrupt Source and ICU Abstractions
- struct isrc
- pic drivers
Other Random Questions/Topics
Should we pass an interlock into
sema_wait?
- Generic turnstiles for sleep mutexes and sx locks.
- Should we have non-sleepable sx locks?
Definitions
atomic
An operation is atomic if all of its effects are visible
to other CPUs together when the proper access protocol is
followed. In the degenerate case are atomic instructions
provided directly by machine architectures. At a higher
level, if several members of a structure are protected by a
lock, then a set of operations are atomic if they are all
performed while holding the lock without releasing the lock
in between any of the operations.
operation
block
A thread is blocked when it is waiting on a lock,
resource, or condition. Unfortunately this term is a bit
overloaded as a result.
sleep
critical section
A section of code that is not allowed to be preempted.
A critical section is entered and exited using the
&man.critical.enter.9; API.
MD
Machine depenedent.
MI
memory operation
A memory operation reads and/or writes to a memory
location.
MI
Machine indepenedent.
MD
operation
memory operation
primary interrupt context
Primary interrupt context refers to the code that runs
when an interrupt occurs. This code can either run an
interrupt handler directly or schedule an asynchronous
interrupt thread to execute the interrupt handlers for a
given interrupt source.
realtime kernel thread
A high priority kernel thread. Currently, the only
realtime priority kernel threads are interrupt threads.
thread
sleep
A thread is asleep when it is blocked on a condition
variable or a sleep queue via msleep or
tsleep.
block
sleepable lock
A sleepable lock is a lock that can be held by a thread
which is asleep. Lockmgr locks and sx locks are currently
the only sleepable locks in FreeBSD. Eventually, some sx
locks such as the allproc and proctree locks may become
non-sleepable locks.
sleep
thread
A kernel thread represented by a struct thread. Threads own
locks and hold a single execution context.