diff --git a/en_US.ISO8859-1/articles/smp/article.sgml b/en_US.ISO8859-1/articles/smp/article.sgml index c6c78e0feb..14c53dc77c 100644 --- a/en_US.ISO8859-1/articles/smp/article.sgml +++ b/en_US.ISO8859-1/articles/smp/article.sgml @@ -1,957 +1,959 @@ %man; %authors; %misc; + +%freebsd; ]>
SMPng Design Document John Baldwin Robert Watson $FreeBSD$ 2002 2003 John Baldwin Robert Watson This document presents the current design and implementation of the SMPng Architecture. First, the basic primitives and tools are introduced. Next, a general architecture for the FreeBSD kernel's synchronization and execution model is laid out. Then, locking strategies for specific subsystems are discussed, documenting the approaches taken to introduce fine-grained synchronization and parallelism for each subsystem. Finally, detailed implementation notes are provided to motivate design choices, and make the reader aware of important implications involving the use of specific primitives. Introduction This document is a work-in-progress, and will be updated to reflect on-going design and implementation activities associated with the SMPng Project. Many sections currently exist only in outline form, but will be fleshed out as work proceeds. Updates or suggestions regarding the document may be directed to the document editors. The goal of SMPng is to allow concurrency in the kernel. The kernel is basically one rather large and complex program. To make the kernel multi-threaded we use some of the same tools used to make other programs multi-threaded. These include mutexes, shared/exclusive locks, semaphores, and condition variables. For the definitions of these and other SMP-related terms, please see the section of this article. Basic Tools and Locking Fundamentals Atomic Instructions and Memory Barriers There are several existing treatments of memory barriers and atomic instructions, so this section will not include a lot of detail. To put it simply, one can not go around reading variables without a lock if a lock is used to protect writes to that variable. This becomes obvious when you consider that memory barriers simply determine relative order of memory operations; they do not make any guarantee about timing of memory operations. That is, a memory barrier does not force the contents of a CPU's local cache or store buffer to flush. Instead, the memory barrier at lock release simply ensures that all writes to the protected data will be visible to other CPU's or devices if the write to release the lock is visible. The CPU is free to keep that data in its cache or store buffer as long as it wants. However, if another CPU performs an atomic instruction on the same datum, the first CPU must guarantee that the updated value is made visible to the second CPU along with any other operations that memory barriers may require. For example, assuming a simple model where data is considered visible when it is in main memory (or a global cache), when an atomic instruction is triggered on one CPU, other CPU's store buffers and caches must flush any writes to that same cache line along with any pending operations behind a memory barrier. This requires one to take special care when using an item protected by atomic instructions. For example, in the sleep mutex implementation, we have to use an atomic_cmpset rather than an atomic_set to turn on the MTX_CONTESTED bit. The reason is that we read the value of mtx_lock into a variable and then make a decision based on that read. However, the value we read may be stale, or it may change while we are making our decision. Thus, when the atomic_set executed, it may end up setting the bit on another value than the one we made the decision on. Thus, we have to use an atomic_cmpset to set the value only if the value we made the decision on is up-to-date and valid. Finally, atomic instructions only allow one item to be updated or read. If one needs to atomically update several items, then a lock must be used instead. For example, if two counters must be read and have values that are consistent relative to each other, then those counters must be protected by a lock rather than by separate atomic instructions. Read Locks versus Write Locks Read locks do not need to be as strong as write locks. Both types of locks need to ensure that the data they are accessing is not stale. However, only write access requires exclusive access. Multiple threads can safely read a value. Using different types of locks for reads and writes can be implemented in a number of ways. First, sx locks can be used in this manner by using an exclusive lock when writing and a shared lock when reading. This method is quite straightforward. A second method is a bit more obscure. You can protect a datum with multiple locks. Then for reading that data you simply need to have a read lock of one of the locks. However, to write to the data, you need to have a write lock of all of the locks. This can make writing rather expensive but can be useful when data is accessed in various ways. For example, the parent process pointer is protected by both the proctree_lock sx lock and the per-process mutex. Sometimes the proc lock is easier as we are just checking to see who a parent of a process is that we already have locked. However, other places such as inferior need to walk the tree of processes via parent pointers and locking each process would be prohibitive as well as a pain to guarantee that the condition you are checking remains valid for both the check and the actions taken as a result of the check. Locking Conditions and Results If you need a lock to check the state of a variable so that you can take an action based on the state you read, you can not just hold the lock while reading the variable and then drop the lock before you act on the value you read. Once you drop the lock, the variable can change rendering your decision invalid. Thus, you must hold the lock both while reading the variable and while performing the action as a result of the test. General Architecture and Design Interrupt Handling - Following the pattern of several other multi-threaded Unix + Following the pattern of several other multi-threaded &unix; kernels, FreeBSD deals with interrupt handlers by giving them their own thread context. Providing a context for interrupt handlers allows them to block on locks. To help avoid latency, however, interrupt threads run at real-time kernel priority. Thus, interrupt handlers should not execute for very long to avoid starving other kernel threads. In addition, since multiple handlers may share an interrupt thread, interrupt handlers should not sleep or use a sleepable lock to avoid starving another interrupt handler. The interrupt threads currently in FreeBSD are referred to as heavyweight interrupt threads. They are called this because switching to an interrupt thread involves a full context switch. In the initial implementation, the kernel was not preemptive and thus interrupts that interrupted a kernel thread would have to wait until the kernel thread blocked or returned to userland before they would have an opportunity to run. To deal with the latency problems, the kernel in FreeBSD has been made preemptive. Currently, we only preempt a kernel thread when we release a sleep mutex or when an interrupt comes in. However, the plan is to make the FreeBSD kernel fully preemptive as described below. Not all interrupt handlers execute in a thread context. Instead, some handlers execute directly in primary interrupt context. These interrupt handlers are currently misnamed fast interrupt handlers since the INTR_FAST flag used in earlier versions of the kernel is used to mark these handlers. The only interrupts which currently use these types of interrupt handlers are clock interrupts and serial I/O device interrupts. Since these handlers do not have their own context, they may not acquire blocking locks and thus may only use spin mutexes. Finally, there is one optional optimization that can be added in MD code called lightweight context switches. Since an interrupt thread executes in a kernel context, it can borrow the vmspace of any process. Thus, in a lightweight context switch, the switch to the interrupt thread does not switch vmspaces but borrows the vmspace of the interrupted thread. In order to ensure that the vmspace of the interrupted thread does not disappear out from under us, the interrupted thread is not allowed to execute until the interrupt thread is no longer borrowing its vmspace. This can happen when the interrupt thread either blocks or finishes. If an interrupt thread blocks, then it will use its own context when it is made runnable again. Thus, it can release the interrupted thread. The cons of this optimization are that they are very machine specific and complex and thus only worth the effort if their is a large performance improvement. At this point it is probably too early to tell, and in fact, will probably hurt performance as almost all interrupt handlers will immediately block on Giant and require a thread fix-up when they block. Also, an alternative method of interrupt handling has been proposed by Mike Smith that works like so: Each interrupt handler has two parts: a predicate which runs in primary interrupt context and a handler which runs in its own thread context. If an interrupt handler has a predicate, then when an interrupt is triggered, the predicate is run. If the predicate returns true then the interrupt is assumed to be fully handled and the kernel returns from the interrupt. If the predicate returns false or there is no predicate, then the threaded handler is scheduled to run. Fitting light weight context switches into this scheme might prove rather complicated. Since we may want to change to this scheme at some point in the future, it is probably best to defer work on light weight context switches until we have settled on the final interrupt handling architecture and determined how light weight context switches might or might not fit into it. Kernel Preemption and Critical Sections Kernel Preemption in a Nutshell Kernel preemption is fairly simple. The basic idea is that a CPU should always be doing the highest priority work available. Well, that is the ideal at least. There are a couple of cases where the expense of achieving the ideal is not worth being perfect. Implementing full kernel preemption is very straightforward: when you schedule a thread to be executed by putting it on a runqueue, you check to see if it's priority is higher than the currently executing thread. If so, you initiate a context switch to that thread. While locks can protect most data in the case of a preemption, not all of the kernel is preemption safe. For example, if a thread holding a spin mutex preempted and the new thread attempts to grab the same spin mutex, the new thread may spin forever as the interrupted thread may never get a chance to execute. Also, some code such as the code to assign an address space number for a process during exec() on the Alpha needs to not be preempted as it supports the actual context switch code. Preemption is disabled for these code sections by using a critical section. Critical Sections The responsibility of the critical section API is to prevent context switches inside of a critical section. With a fully preemptive kernel, every setrunqueue of a thread other than the current thread is a preemption point. One implementation is for critical_enter to set a per-thread flag that is cleared by its counterpart. If setrunqueue is called with this flag set, it does not preempt regardless of the priority of the new thread relative to the current thread. However, since critical sections are used in spin mutexes to prevent context switches and multiple spin mutexes can be acquired, the critical section API must support nesting. For this reason the current implementation uses a nesting count instead of a single per-thread flag. In order to minimize latency, preemptions inside of a critical section are deferred rather than dropped. If a thread is made runnable that would normally be preempted to outside of a critical section, then a per-thread flag is set to indicate that there is a pending preemption. When the outermost critical section is exited, the flag is checked. If the flag is set, then the current thread is preempted to allow the higher priority thread to run. Interrupts pose a problem with regards to spin mutexes. If a low-level interrupt handler needs a lock, it needs to not interrupt any code needing that lock to avoid possible data structure corruption. Currently, providing this mechanism is piggybacked onto critical section API by means of the cpu_critical_enter and cpu_critical_exit functions. Currently this API disables and re-enables interrupts on all of FreeBSD's current platforms. This approach may not be purely optimal, but it is simple to understand and simple to get right. Theoretically, this second API need only be used for spin mutexes that are used in primary interrupt context. However, to make the code simpler, it is used for all spin mutexes and even all critical sections. It may be desirable to split out the MD API from the MI API and only use it in conjunction with the MI API in the spin mutex implementation. If this approach is taken, then the MD API likely would need a rename to show that it is a separate API now. Design Tradeoffs As mentioned earlier, a couple of trade-offs have been made to sacrifice cases where perfect preemption may not always provide the best performance. The first trade-off is that the preemption code does not take other CPUs into account. Suppose we have a two CPU's A and B with the priority of A's thread as 4 and the priority of B's thread as 2. If CPU B makes a thread with priority 1 runnable, then in theory, we want CPU A to switch to the new thread so that we will be running the two highest priority runnable threads. However, the cost of determining which CPU to enforce a preemption on as well as actually signaling that CPU via an IPI along with the synchronization that would be required would be enormous. Thus, the current code would instead force CPU B to switch to the higher priority thread. Note that this still puts the system in a better position as CPU B is executing a thread of priority 1 rather than a thread of priority 2. The second trade-off limits immediate kernel preemption to real-time priority kernel threads. In the simple case of preemption defined above, a thread is always preempted immediately (or as soon as a critical section is exited) if a higher priority thread is made runnable. However, many threads executing in the kernel only execute in a kernel context for a short time before either blocking or returning to userland. Thus, if the kernel preempts these threads to run another non-realtime kernel thread, the kernel may switch out the executing thread just before it is about to sleep or execute. The cache on the CPU must then adjust to the new thread. When the kernel returns to the interrupted CPU, it must refill all the cache information that was lost. In addition, two extra context switches are performed that could be avoided if the kernel deferred the preemption until the first thread blocked or returned to userland. Thus, by default, the preemption code will only preempt immediately if the higher priority thread is a real-time priority thread. Turning on full kernel preemption for all kernel threads has value as a debugging aid since it exposes more race conditions. It is especially useful on UP systems were many races are hard to simulate otherwise. Thus, there will be a kernel option to enable preemption for all kernel threads that can be used for debugging purposes. Thread Migration Simply put, a thread migrates when it moves from one CPU to another. In a non-preemptive kernel this can only happen at well-defined points such as when calling tsleep or returning to userland. However, in the preemptive kernel, an interrupt can force a preemption and possible migration at any time. This can have negative affects on per-CPU data since with the exception of curthread and curpcb the data can change whenever you migrate. Since you can potentially migrate at any time this renders per-CPU data rather useless. Thus it is desirable to be able to disable migration for sections of code that need per-CPU data to be stable. Critical sections currently prevent migration since they do not allow context switches. However, this may be too strong of a requirement to enforce in some cases since a critical section also effectively blocks interrupt threads on the current processor. As a result, it may be desirable to provide an API whereby code may indicate that if the current thread is preempted it should not migrate to another CPU. One possible implementation is to use a per-thread nesting count td_pinnest along with a td_pincpu which is updated to the current CPU on each context switch. Each CPU has its own run queue that holds threads pinned to that CPU. A thread is pinned when its nesting count is greater than zero and a thread starts off unpinned with a nesting count of zero. When a thread is put on a runqueue, we check to see if it is pinned. If so, we put it on the per-CPU runqueue, otherwise we put it on the global runqueue. When choosethread is called to retrieve the next thread, it could either always prefer bound threads to unbound threads or use some sort of bias when comparing priorities. If the nesting count is only ever written to by the thread itself and is only read by other threads when the owning thread is not executing but while holding the sched_lock, then td_pinnest will not need any other locks. The migrate_disable function would increment the nesting count and migrate_enable would decrement the nesting count. Due to the locking requirements specified above, they will only operate on the current thread and thus would not need to handle the case of making a thread migrateable that currently resides on a per-CPU run queue. It is still debatable if this API is needed or if the critical section API is sufficient by itself. Many of the places that need to prevent migration also need to prevent preemption as well, and in those places a critical section must be used regardless. Callouts The timeout() kernel facility permits kernel services to register functions for execution as part of the softclock() software interrupt. Events are scheduled based on a desired number of clock ticks, and callbacks to the consumer-provided function will occur at approximately the right time. The global list of pending timeout events is protected by a global spin mutex, callout_lock; all access to the timeout list must be performed with this mutex held. When softclock() is woken up, it scans the list of pending timeouts for those that should fire. In order to avoid lock order reversal, the softclock thread will release the callout_lock mutex when invoking the provided timeout() callback function. If the CALLOUT_MPSAFE flag was not set during registration, then Giant will be grabbed before invoking the callout, and then released afterwards. The callout_lock mutex will be re-grabbed before proceeding. The softclock() code is careful to leave the list in a consistent state while releasing the mutex. If DIAGNOSTIC is enabled, then the time taken to execute each function is measured, and a warning generated if it exceeds a threshold. Specific Locking Strategies Credentials struct ucred is the kernel's internal credential structure, and is generally used as the basis for process-driven access control within the kernel. BSD-derived systems use a copy-on-write model for credential data: multiple references may exist for a credential structure, and when a change needs to be made, the structure is duplicated, modified, and then the reference replaced. Due to wide-spread caching of the credential to implement access control on open, this results in substantial memory savings. With a move to fine-grained SMP, this model also saves substantially on locking operations by requiring that modification only occur on an unshared credential, avoiding the need for explicit synchronization when consuming a known-shared credential. Credential structures with a single reference are considered mutable; shared credential structures must not be modified or a race condition is risked. A mutex, cr_mtxp protects the reference count of struct ucred so as to maintain consistency. Any use of the structure requires a valid reference for the duration of the use, or the structure may be released out from under the illegitimate consumer. The struct ucred mutex is a leaf mutex, and for performance reasons, is implemented via a mutex pool. Usually, credentials are used in a read-only manner for access control decisions, and in this case td_ucred is generally preferred because it requires no locking. When a process' credential is updated the proc lock must be held across the check and update operations thus avoid races. The process credential p_ucred must be used for check and update operations to prevent time-of-check, time-of-use races. If system call invocations will perform access control after an update to the process credential, the value of td_ucred must also be refreshed to the current process value. This will prevent use of a stale credential following a change. The kernel automatically refreshes the td_ucred pointer in the thread structure from the process p_ucred whenever a process enters the kernel, permitting use of a fresh credential for kernel access control. File Descriptors and File Descriptor Tables Details to follow. Jail Structures struct prison stores administrative details pertinent to the maintenance of jails created using the &man.jail.2; API. This includes the per-jail hostname, IP address, and related settings. This structure is reference-counted since pointers to instances of the structure are shared by many credential structures. A single mutex, pr_mtx protects read and write access to the reference count and all mutable variables inside the struct jail. Some variables are set only when the jail is created, and a valid reference to the struct prison is sufficient to read these values. The precise locking of each entry is documented via comments in sys/jail.h. MAC Framework The TrustedBSD MAC Framework maintains data in a variety of kernel objects, in the form of struct label. In general, labels in kernel objects are protected by the same lock as the remainder of the kernel object. For example, the v_label label in struct vnode is protected by the vnode lock on the vnode. In addition to labels maintained in standard kernel objects, the MAC Framework also maintains a list of registered and active policies. The policy list is protected by a global mutex (mac_policy_list_lock) and a busy count (also protected by the mutex). Since many access control checks may occur in parallel, entry to the framework for a read-only access to the policy list requires holding the mutex while incrementing (and later decrementing) the busy count. The mutex need not be held for the duration of the MAC entry operation--some operations, such as label operations on file system objects--are long-lived. To modify the policy list, such as during policy registration and de-registration, the mutex must be held and the reference count must be zero, to prevent modification of the list while it is in use. A condition variable, mac_policy_list_not_busy, is available to threads that need to wait for the list to become unbusy, but this condition variable must only be waited on if the caller is holding no other locks, or a lock order violation may be possible. The busy count, in effect, acts as a form of shared/exclusive lock over access to the framework: the difference is that, unlike with an sx lock, consumers waiting for the list to become unbusy may be starved, rather than permitting lock order problems with regards to the busy count and other locks that may be held on entry to (or inside) the MAC Framework. Modules For the module subsystem there exists a single lock that is used to protect the shared data. This lock is a shared/exclusive (SX) lock and has a good chance of needing to be acquired (shared or exclusively), therefore there are a few macros that have been added to make access to the lock more easy. These macros can be located in sys/module.h and are quite basic in terms of usage. The main structures protected under this lock are the module_t structures (when shared) and the global modulelist_t structure, modules. One should review the related source code in kern/kern_module.c to further understand the locking strategy. Newbus Device Tree The newbus system will have one sx lock. Readers will hold a shared (read) lock (&man.sx.slock.9;) and writers will hold an exclusive (write) lock (&man.sx.xlock.9;). Internal functions will not do locking at all. Externally visible ones will lock as needed. Those items that do not matter if the race is won or lost will not be locked, since they tend to be read all over the place (e.g. &man.device.get.softc.9;). There will be relatively few changes to the newbus data structures, so a single lock should be sufficient and not impose a performance penalty. Pipes ... Processes and Threads - process hierarchy - proc locks, references - thread-specific copies of proc entries to freeze during system calls, including td_ucred - inter-process operations - process groups and sessions Scheduler Lots of references to sched_lock and notes pointing at specific primitives and related magic elsewhere in the document. Select and Poll The select() and poll() functions permit threads to block waiting on events on file descriptors--most frequently, whether or not the file descriptors are readable or writable. ... SIGIO The SIGIO service permits processes to request the delivery of a SIGIO signal to its process group when the read/write status of specified file descriptors changes. At most one process or process group is permitted to register for SIGIO from any given kernel object, and that process or group is referred to as the owner. Each object supporting SIGIO registration contains pointer field that is NULL if the object is not registered, or points to a struct sigio describing the registration. This field is protected by a global mutex, sigio_lock. Callers to SIGIO maintenance functions must pass in this field by reference so that local register copies of the field are not made when unprotected by the lock. One struct sigio is allocated for each registered object associated with any process or process group, and contains back-pointers to the object, owner, signal information, a credential, and the general disposition of the registration. Each process or progress group contains a list of registered struct sigio structures, p_sigiolst for processes, and pg_sigiolst for process groups. These lists are protected by the process or process group locks respectively. Most fields in each struct sigio are constant for the duration of the registration, with the exception of the sio_pgsigio field which links the struct sigio into the process or process group list. Developers implementing new kernel objects supporting SIGIO will, in general, want to avoid holding structure locks while invoking SIGIO supporting functions, such as fsetown() or funsetown() to avoid defining a lock order between structure locks and the global SIGIO lock. This is generally possible through use of an elevated reference count on the structure, such as reliance on a file descriptor reference to a pipe during a pipe operation. Sysctl The sysctl() MIB service is invoked from both within the kernel and from userland applications using a system call. At least two issues are raised in locking: first, the protection of the structures maintaining the namespace, and second, interactions with kernel variables and functions that are accessed by the sysctl interface. Since sysctl permits the direct export (and modification) of kernel statistics and configuration parameters, the sysctl mechanism must become aware of appropriate locking semantics for those variables. Currently, sysctl makes use of a single global sx lock to serialize use of sysctl(); however, it is assumed to operate under Giant and other protections are not provided. The remainder of this section speculates on locking and semantic changes to sysctl. - Need to change the order of operations for sysctl's that update values from read old, copyin and copyout, write new to copyin, lock, read old and write new, unlock, copyout. Normal sysctl's that just copyout the old value and set a new value that they copyin may still be able to follow the old model. However, it may be cleaner to use the second model for all of the sysctl handlers to avoid lock operations. - To allow for the common case, a sysctl could embed a pointer to a mutex in the SYSCTL_FOO macros and in the struct. This would work for most sysctl's. For values protected by sx locks, spin mutexes, or other locking strategies besides a single sleep mutex, SYSCTL_PROC nodes could be used to get the locking right. Taskqueue The taskqueue's interface has two basic locks associated with it in order to protect the related shared data. The taskqueue_queues_mutex is meant to serve as a lock to protect the taskqueue_queues TAILQ. The other mutex lock associated with this system is the one in the struct taskqueue data structure. The use of the synchronization primitive here is to protect the integrity of the data in the struct taskqueue. It should be noted that there are no separate macros to assist the user in locking down his/her own work since these locks are most likely not going to be used outside of kern/subr_taskqueue.c. Implementation Notes Details of the Mutex Implementation - Should we require mutexes to be owned for mtx_destroy() since we can not safely assert that they are unowned by anyone else otherwise? Spin Mutexes - Use a critical section... Sleep Mutexes - Describe the races with contested mutexes - Why it is safe to read mtx_lock of a contested mutex when holding sched_lock. - Priority propagation Witness - What does it do - How does it work Miscellaneous Topics Interrupt Source and ICU Abstractions - struct isrc - pic drivers Other Random Questions/Topics Should we pass an interlock into sema_wait? - Generic turnstiles for sleep mutexes and sx locks. - Should we have non-sleepable sx locks? Glossary atomic An operation is atomic if all of its effects are visible to other CPUs together when the proper access protocol is followed. In the degenerate case are atomic instructions provided directly by machine architectures. At a higher level, if several members of a structure are protected by a lock, then a set of operations are atomic if they are all performed while holding the lock without releasing the lock in between any of the operations. operation block A thread is blocked when it is waiting on a lock, resource, or condition. Unfortunately this term is a bit overloaded as a result. sleep critical section A section of code that is not allowed to be preempted. A critical section is entered and exited using the &man.critical.enter.9; API. MD Machine dependent. MI memory operation A memory operation reads and/or writes to a memory location. MI Machine independent. MD operation memory operation primary interrupt context Primary interrupt context refers to the code that runs when an interrupt occurs. This code can either run an interrupt handler directly or schedule an asynchronous interrupt thread to execute the interrupt handlers for a given interrupt source. realtime kernel thread A high priority kernel thread. Currently, the only realtime priority kernel threads are interrupt threads. thread sleep A thread is asleep when it is blocked on a condition variable or a sleep queue via msleep or tsleep. block sleepable lock A sleepable lock is a lock that can be held by a thread which is asleep. Lockmgr locks and sx locks are currently the only sleepable locks in FreeBSD. Eventually, some sx locks such as the allproc and proctree locks may become non-sleepable locks. sleep thread A kernel thread represented by a struct thread. Threads own locks and hold a single execution context.
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml index f4f2487d4b..bbe6da9094 100644 --- a/en_US.ISO8859-1/articles/vm-design/article.sgml +++ b/en_US.ISO8859-1/articles/vm-design/article.sgml @@ -1,838 +1,841 @@ %man; + +%freebsd; + ]>
Design elements of the FreeBSD VM system Matthew Dillon
dillon@apollo.backplane.com
The title is really just a fancy way of saying that I am going to attempt to describe the whole VM enchilada, hopefully in a way that everyone can follow. For the last year I have concentrated on a number of major kernel subsystems within FreeBSD, with the VM and Swap subsystems being the most interesting and NFS being a necessary chore. I rewrote only small portions of the code. In the VM arena the only major rewrite I have done is to the swap subsystem. Most of my work was cleanup and maintenance, with only moderate code rewriting and no major algorithmic adjustments within the VM subsystem. The bulk of the VM subsystem's theoretical base remains unchanged and a lot of the credit for the modernization effort in the last few years belongs to John Dyson and David Greenman. Not being a historian like Kirk I will not attempt to tag all the various features with peoples names, since I will invariably get it wrong. This article was originally published in the January 2000 issue of DaemonNews. This version of the article may include updates from Matt and other authors to reflect changes in FreeBSD's VM implementation.
Introduction Before moving along to the actual design let's spend a little time on the necessity of maintaining and modernizing any long-living codebase. In the programming world, algorithms tend to be more important than code and it is precisely due to BSD's academic roots that a great deal of attention was paid to algorithm design from the beginning. More attention paid to the design generally leads to a clean and flexible codebase that can be fairly easily modified, extended, or replaced over time. While BSD is considered an old operating system by some people, those of us who work on it tend to view it more as a mature codebase which has various components modified, extended, or replaced with modern code. It has evolved, and FreeBSD is at the bleeding edge no matter how old some of the code might be. This is an important distinction to make and one that is unfortunately lost to many people. The biggest error a programmer can make is to not learn from history, and this is precisely the error that many other modern operating systems have made. NT is the best example of this, and the consequences have been dire. Linux also makes this mistake to some degree—enough that we BSD folk can make small jokes about it every once in a while, anyway. Linux's problem is simply one of a lack of experience and history to compare ideas against, a problem that is easily and rapidly being addressed by the Linux community in the same way it has been addressed in the BSD community—by continuous code development. The NT folk, on the - other hand, repeatedly make the same mistakes solved by Unix decades ago + other hand, repeatedly make the same mistakes solved by &unix; decades ago and then spend years fixing them. Over and over again. They have a severe case of not designed here and we are always right because our marketing department says so. I have little tolerance for anyone who cannot learn from history. Much of the apparent complexity of the FreeBSD design, especially in the VM/Swap subsystem, is a direct result of having to solve serious performance issues that occur under various conditions. These issues are not due to bad algorithmic design but instead rise from environmental factors. In any direct comparison between platforms, these issues become most apparent when system resources begin to get stressed. As I describe FreeBSD's VM/Swap subsystem the reader should always keep two points in mind. First, the most important aspect of performance design is what is known as Optimizing the Critical Path. It is often the case that performance optimizations add a little bloat to the code in order to make the critical path perform better. Second, a solid, generalized design outperforms a heavily-optimized design over the long run. While a generalized design may end up being slower than an heavily-optimized design when they are first implemented, the generalized design tends to be easier to adapt to changing conditions and the heavily-optimized design winds up having to be thrown away. Any codebase that will survive and be maintainable for years must therefore be designed properly from the beginning even if it costs some performance. Twenty years ago people were still arguing that programming in assembly was better than programming in a high-level language because it produced code that was ten times as fast. Today, the fallibility of that argument is obvious—as are the parallels to algorithmic design and code generalization. VM Objects The best way to begin describing the FreeBSD VM system is to look at it from the perspective of a user-level process. Each user process sees a single, private, contiguous VM address space containing several types of memory objects. These objects have various characteristics. Program code and program data are effectively a single memory-mapped file (the binary file being run), but program code is read-only while program data is copy-on-write. Program BSS is just memory allocated and filled with zeros on demand, called demand zero page fill. Arbitrary files can be memory-mapped into the address space as well, which is how the shared library mechanism works. Such mappings can require modifications to remain private to the process making them. The fork system call adds an entirely new dimension to the VM management problem on top of the complexity already given. A program binary data page (which is a basic copy-on-write page) illustrates the complexity. A program binary contains a preinitialized data section which is initially mapped directly from the program file. When a program is loaded into a process's VM space, this area is initially memory-mapped and backed by the program binary itself, allowing the VM system to free/reuse the page and later load it back in from the binary. The moment a process modifies this data, however, the VM system must make a private copy of the page for that process. Since the private copy has been modified, the VM system may no longer free it, because there is no longer any way to restore it later on. You will notice immediately that what was originally a simple file mapping has become much more complex. Data may be modified on a page-by-page basis whereas the file mapping encompasses many pages at once. The complexity further increases when a process forks. When a process forks, the result is two processes—each with their own private address spaces, including any modifications made by the original process prior to the call to fork(). It would be silly for the VM system to make a complete copy of the data at the time of the fork() because it is quite possible that at least one of the two processes will only need to read from that page from then on, allowing the original page to continue to be used. What was a private page is made copy-on-write again, since each process (parent and child) expects their own personal post-fork modifications to remain private to themselves and not effect the other. FreeBSD manages all of this with a layered VM Object model. The original binary program file winds up being the lowest VM Object layer. A copy-on-write layer is pushed on top of that to hold those pages which had to be copied from the original file. If the program modifies a data page belonging to the original file the VM system takes a fault and makes a copy of the page in the higher layer. When a process forks, additional VM Object layers are pushed on. This might make a little more sense with a fairly basic example. A fork() is a common operation for any *BSD system, so this example will consider a program that starts up, and forks. When the process starts, the VM system creates an object layer, let's call this A: +---------------+ | A | +---------------+ A picture A represents the file—pages may be paged in and out of the file's physical media as necessary. Paging in from the disk is reasonable for a program, but we really do not want to page back out and overwrite the executable. The VM system therefore creates a second layer, B, that will be physically backed by swap space: +---------------+ | B | +---------------+ | A | +---------------+ On the first write to a page after this, a new page is created in B, and its contents are initialized from A. All pages in B can be paged in or out to a swap device. When the program forks, the VM system creates two new object layers—C1 for the parent, and C2 for the child—that rest on top of B: +-------+-------+ | C1 | C2 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, let's say a page in B is modified by the original parent process. The process will take a copy-on-write fault and duplicate the page in C1, leaving the original page in B untouched. Now, let's say the same page in B is modified by the child process. The process will take a copy-on-write fault and duplicate the page in C2. The original page in B is now completely hidden since both C1 and C2 have a copy and B could theoretically be destroyed if it does not represent a real file). However, this sort of optimization is not trivial to make because it is so fine-grained. FreeBSD does not make this optimization. Now, suppose (as is often the case) that the child process does an exec(). Its current address space is usually replaced by a new address space representing a new file. In this case, the C2 layer is destroyed: +-------+ | C1 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, the number of children of B drops to one, and all accesses to B now go through C1. This means that B and C1 can be collapsed together. Any pages in B that also exist in C1 are deleted from B during the collapse. Thus, even though the optimization in the previous step could not be made, we can recover the dead pages when either of the processes exit or exec(). This model creates a number of potential problems. The first is that you can wind up with a relatively deep stack of layered VM Objects which can cost scanning time and memory when you take a fault. Deep layering can occur when processes fork and then fork again (either parent or child). The second problem is that you can wind up with dead, inaccessible pages deep in the stack of VM Objects. In our last example if both the parent and child processes modify the same page, they both get their own private copies of the page and the original page in B is no longer accessible by anyone. That page in B can be freed. FreeBSD solves the deep layering problem with a special optimization called the All Shadowed Case. This case occurs if either C1 or C2 take sufficient COW faults to completely shadow all pages in B. Lets say that C1 achieves this. C1 can now bypass B entirely, so rather then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But look what also happened—now B has only one reference (C2), so we can collapse B and C2 together. The end result is that B is deleted entirely and we have C1->A and C2->A. It is often the case that B will contain a large number of pages and neither C1 nor C2 will be able to completely overshadow it. If we fork again and create a set of D layers, however, it is much more likely that one of the D layers will eventually be able to completely overshadow the much smaller dataset represented by C1 or C2. The same optimization will work at any point in the graph and the grand result of this is that even on a heavily forked machine VM Object stacks tend to not get much deeper then 4. This is true of both the parent and the children and true whether the parent is doing the forking or whether the children cascade forks. The dead page problem still exists in the case where C1 or C2 do not completely overshadow B. Due to our other optimizations this case does not represent much of a problem and we simply allow the pages to be dead. If the system runs low on memory it will swap them out, eating a little swap, but that is it. The advantage to the VM Object model is that fork() is extremely fast, since no real data copying need take place. The disadvantage is that you can build a relatively complex VM Object layering that slows page fault handling down a little, and you spend memory managing the VM Object structures. The optimizations FreeBSD makes proves to reduce the problems enough that they can be ignored, leaving no real disadvantage. SWAP Layers Private data pages are initially either copy-on-write or zero-fill pages. When a change, and therefore a copy, is made, the original backing object (usually a file) can no longer be used to save a copy of the page when the VM system needs to reuse it for other purposes. This is where SWAP comes in. SWAP is allocated to create backing store for memory that does not otherwise have it. FreeBSD allocates the swap management structure for a VM Object only when it is actually needed. However, the swap management structure has had problems historically. Under FreeBSD 3.X the swap management structure preallocates an array that encompasses the entire object requiring swap backing store—even if only a few pages of that object are swap-backed. This creates a kernel memory fragmentation problem when large objects are mapped, or processes with large runsizes (RSS) fork. Also, in order to keep track of swap space, a list of holes is kept in kernel memory, and this tends to get severely fragmented as well. Since the list of holes is a linear list, the swap allocation and freeing performance is a non-optimal O(n)-per-page. It also requires kernel memory allocations to take place during the swap freeing process, and that creates low memory deadlock problems. The problem is further exacerbated by holes created due to the interleaving algorithm. Also, the swap block map can become fragmented fairly easily resulting in non-contiguous allocations. Kernel memory must also be allocated on the fly for additional swap management structures when a swapout occurs. It is evident that there was plenty of room for improvement. For FreeBSD 4.X, I completely rewrote the swap subsystem. With this rewrite, swap management structures are allocated through a hash table rather than a linear array giving them a fixed allocation size and much finer granularity. Rather then using a linearly linked list to keep track of swap space reservations, it now uses a bitmap of swap blocks arranged in a radix tree structure with free-space hinting in the radix node structures. This effectively makes swap allocation and freeing an O(1) operation. The entire radix tree bitmap is also preallocated in order to avoid having to allocate kernel memory during critical low memory swapping operations. After all, the system tends to swap when it is low on memory so we should avoid allocating kernel memory at such times in order to avoid potential deadlocks. Finally, to reduce fragmentation the radix tree is capable of allocating large contiguous chunks at once, skipping over smaller fragmented chunks. I did not take the final step of having an allocating hint pointer that would trundle through a portion of swap as allocations were made in order to further guarantee contiguous allocations or at least locality of reference, but I ensured that such an addition could be made. When to free a page Since the VM system uses all available memory for disk caching, there are usually very few truly-free pages. The VM system depends on being able to properly choose pages which are not in use to reuse for new allocations. Selecting the optimal pages to free is possibly the single-most important function any VM system can perform because if it makes a poor selection, the VM system may be forced to unnecessarily retrieve pages from disk, seriously degrading system performance. How much overhead are we willing to suffer in the critical path to avoid freeing the wrong page? Each wrong choice we make will cost us hundreds of thousands of CPU cycles and a noticeable stall of the affected processes, so we are willing to endure a significant amount of overhead in order to be sure that the right page is chosen. This is why FreeBSD tends to outperform other systems when memory resources become stressed. The free page determination algorithm is built upon a history of the use of memory pages. To acquire this history, the system takes advantage of a page-used bit feature that most hardware page tables have. In any case, the page-used bit is cleared and at some later point the VM system comes across the page again and sees that the page-used bit has been set. This indicates that the page is still being actively used. If the bit is still clear it is an indication that the page is not being actively used. By testing this bit periodically, a use history (in the form of a counter) for the physical page is developed. When the VM system later needs to free up some pages, checking this history becomes the cornerstone of determining the best candidate page to reuse. What if the hardware has no page-used bit? For those platforms that do not have this feature, the system actually emulates a page-used bit. It unmaps or protects a page, forcing a page fault if the page is accessed again. When the page fault is taken, the system simply marks the page as having been used and unprotects the page so that it may be used. While taking such page faults just to determine if a page is being used appears to be an expensive proposition, it is much less expensive than reusing the page for some other purpose only to find that a process needs it back and then have to go to disk. FreeBSD makes use of several page queues to further refine the selection of pages to reuse as well as to determine when dirty pages must be flushed to their backing store. Since page tables are dynamic entities under FreeBSD, it costs virtually nothing to unmap a page from the address space of any processes using it. When a page candidate has been chosen based on the page-use counter, this is precisely what is done. The system must make a distinction between clean pages which can theoretically be freed up at any time, and dirty pages which must first be written to their backing store before being reusable. When a page candidate has been found it is moved to the inactive queue if it is dirty, or the cache queue if it is clean. A separate algorithm based on the dirty-to-clean page ratio determines when dirty pages in the inactive queue must be flushed to disk. Once this is accomplished, the flushed pages are moved from the inactive queue to the cache queue. At this point, pages in the cache queue can still be reactivated by a VM fault at relatively low cost. However, pages in the cache queue are considered to be immediately freeable and will be reused in an LRU (least-recently used) fashion when the system needs to allocate new memory. It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command. As the VM system becomes more stressed, it makes a greater effort to maintain the various page queues at the levels determined to be the most effective. An urban myth has circulated for years that Linux did a better job avoiding swapouts than FreeBSD, but this in fact is not true. What was actually occurring was that FreeBSD was proactively paging out unused pages in order to make room for more disk cache while Linux was keeping unused pages in core and leaving less memory available for cache and process pages. I do not know whether this is still true today. Pre-Faulting and Zeroing Optimizations Taking a VM fault is not expensive if the underlying page is already in core and can simply be mapped into the process, but it can become expensive if you take a whole lot of them on a regular basis. A good example of this is running a program such as &man.ls.1; or &man.ps.1; over and over again. If the program binary is mapped into memory but not mapped into the page table, then all the pages that will be accessed by the program will have to be faulted in every time the program is run. This is unnecessary when the pages in question are already in the VM Cache, so FreeBSD will attempt to pre-populate a process's page tables with those pages that are already in the VM Cache. One thing that FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For example, if you run the &man.ls.1; program while running vmstat 1 you will notice that it always takes a certain number of page faults, even when you run it over and over again. These are zero-fill faults, not program code faults (which were pre-faulted in already). Pre-copying pages on exec or fork is an area that could use more study. A large percentage of page faults that occur are zero-fill faults. You can usually see this by observing the vmstat -s output. These occur when a process accesses pages in its BSS area. The BSS area is expected to be initially zero but the VM system does not bother to allocate any memory at all until the process actually accesses it. When a fault occurs the VM system must not only allocate a new page, it must zero it as well. To optimize the zeroing operation the VM system has the ability to pre-zero pages and mark them as such, and to request pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs whenever the CPU is idle but the number of pages the system pre-zeros is limited in order to avoid blowing away the memory caches. This is an excellent example of adding complexity to the VM system in order to optimize the critical path. Page Table Optimizations The page table optimizations make up the most contentious part of the FreeBSD VM design and they have shown some strain with the advent of serious use of mmap(). I think this is actually a feature of most BSDs though I am not sure when it was first introduced. There are two major optimizations. The first is that hardware page tables do not contain persistent state but instead can be thrown away at any time with only a minor amount of management overhead. The second is that every active page table entry in the system has a governing pv_entry structure which is tied into the vm_page structure. FreeBSD can simply iterate through those mappings that are known to exist while Linux must check all page tables that might contain a specific mapping to see if it does, which can achieve O(n^2) overhead in certain situations. It is because of this that FreeBSD tends to make better choices on which pages to reuse or swap when memory is stressed, giving it better performance under load. However, FreeBSD requires kernel tuning to accommodate large-shared-address-space situations such as those that can occur in a news system because it may run out of pv_entry structures. Both Linux and FreeBSD need work in this area. FreeBSD is trying to maximize the advantage of a potentially sparse active-mapping model (not all processes need to map all pages of a shared library, for example), whereas Linux is trying to simplify its algorithms. FreeBSD generally has the performance advantage here at the cost of wasting a little extra memory, but FreeBSD breaks down in the case where a large file is massively shared across hundreds of processes. Linux, on the other hand, breaks down in the case where many processes are sparsely-mapping the same shared library and also runs non-optimally when trying to determine whether a page can be reused or not. Page Coloring We will end with the page coloring optimizations. Page coloring is a performance optimization designed to ensure that accesses to contiguous pages in virtual memory make the best use of the processor cache. In ancient times (i.e. 10+ years ago) processor caches tended to map virtual memory rather than physical memory. This led to a huge number of problems including having to clear the cache on every context switch in some cases, and problems with data aliasing in the cache. Modern processor caches map physical memory precisely to solve those problems. This means that two side-by-side pages in a processes address space may not correspond to two side-by-side pages in the cache. In fact, if you are not careful side-by-side pages in virtual memory could wind up using the same page in the processor cache—leading to cacheable data being thrown away prematurely and reducing CPU performance. This is true even with multi-way set-associative caches (though the effect is mitigated somewhat). FreeBSD's memory allocation code implements page coloring optimizations, which means that the memory allocation code will attempt to locate free pages that are contiguous from the point of view of the cache. For example, if page 16 of physical memory is assigned to page 0 of a process's virtual memory and the cache can hold 4 pages, the page coloring code will not assign page 20 of physical memory to page 1 of a process's virtual memory. It would, instead, assign page 21 of physical memory. The page coloring code attempts to avoid assigning page 20 because this maps over the same cache memory as page 16 and would result in non-optimal caching. This code adds a significant amount of complexity to the VM memory allocation subsystem as you can well imagine, but the result is well worth the effort. Page Coloring makes VM memory as deterministic as physical memory in regards to cache performance. Conclusion Virtual memory in modern operating systems must address a number of different issues efficiently and for many different usage patterns. The modular and algorithmic approach that BSD has historically taken allows us to study and understand the current implementation as well as relatively cleanly replace large sections of the code. There have been a number of improvements to the FreeBSD VM system in the last several years, and work is ongoing. Bonus QA session by Allen Briggs <email>briggs@ninthwonder.com</email> What is the interleaving algorithm that you refer to in your listing of the ills of the FreeBSD 3.X swap arrangements? FreeBSD uses a fixed swap interleave which defaults to 4. This means that FreeBSD reserves space for four swap areas even if you only have one, two, or three. Since swap is interleaved the linear address space representing the four swap areas will be fragmented if you do not actually have four swap areas. For example, if you have two swap areas A and B FreeBSD's address space representation for that swap area will be interleaved in blocks of 16 pages: A B C D A B C D A B C D A B C D FreeBSD 3.X uses a sequential list of free regions approach to accounting for the free swap areas. The idea is that large blocks of free linear space can be represented with a single list node (kern/subr_rlist.c). But due to the fragmentation the sequential list winds up being insanely fragmented. In the above example, completely unused swap will have A and B shown as free and C and D shown as all allocated. Each A-B sequence requires a list node to account for because C and D are holes, so the list node cannot be combined with the next A-B sequence. Why do we interleave our swap space instead of just tack swap areas onto the end and do something fancier? Because it is a whole lot easier to allocate linear swaths of an address space and have the result automatically be interleaved across multiple disks than it is to try to put that sophistication elsewhere. The fragmentation causes other problems. Being a linear list under 3.X, and having such a huge amount of inherent fragmentation, allocating and freeing swap winds up being an O(N) algorithm instead of an O(1) algorithm. Combined with other factors (heavy swapping) and you start getting into O(N^2) and O(N^3) levels of overhead, which is bad. The 3.X system may also need to allocate KVM during a swap operation to create a new list node which can lead to a deadlock if the system is trying to pageout pages in a low-memory situation. Under 4.X we do not use a sequential list. Instead we use a radix tree and bitmaps of swap blocks rather than ranged list nodes. We take the hit of preallocating all the bitmaps required for the entire swap area up front but it winds up wasting less memory due to the use of a bitmap (one bit per block) instead of a linked list of nodes. The use of a radix tree instead of a sequential list gives us nearly O(1) performance no matter how fragmented the tree becomes. I do not get the following:
It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command.
How is the separation of clean and dirty (inactive) pages related to the situation where you see low cache queue counts and high active queue counts in systat -vm? Do the systat stats roll the active and dirty pages together for the active queue count?
Yes, that is confusing. The relationship is goal verses reality. Our goal is to separate the pages but the reality is that if we are not in a memory crunch, we do not really have to. What this means is that FreeBSD will not try very hard to separate out dirty pages (inactive queue) from clean pages (cache queue) when the system is not being stressed, nor will it try to deactivate pages (active queue -> inactive queue) when the system is not being stressed, even if they are not being used.
In the &man.ls.1; / vmstat 1 example, would not some of the page faults be data page faults (COW from executable file to private page)? I.e., I would expect the page faults to be some zero-fill and some program data. Or are you implying that FreeBSD does do pre-COW for the program data? A COW fault can be either zero-fill or program-data. The mechanism is the same either way because the backing program-data is almost certainly already in the cache. I am indeed lumping the two together. FreeBSD does not pre-COW program data or zero-fill, but it does pre-map pages that exist in its cache. In your section on page table optimizations, can you give a little more detail about pv_entry and vm_page (or should vm_page be vm_pmap—as in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of operation/reaction would require scanning the mappings? How does Linux do in the case where FreeBSD breaks down (sharing a large file mapping over many processes)? A vm_page represents an (object,index#) tuple. A pv_entry represents a hardware page table entry (pte). If you have five processes sharing the same physical page, and three of those processes's page tables actually map the page, that page will be represented by a single vm_page structure and three pv_entry structures. pv_entry structures only represent pages mapped by the MMU (one pv_entry represents one pte). This means that when we need to remove all hardware references to a vm_page (in order to reuse the page for something else, page it out, clear it, dirty it, and so forth) we can simply scan the linked list of pv_entry's associated with that vm_page to remove or modify the pte's from their page tables. Under Linux there is no such linked list. In order to remove all the hardware page table mappings for a vm_page linux must index into every VM object that might have mapped the page. For example, if you have 50 processes all mapping the same shared library and want to get rid of page X in that library, you need to index into the page table for each of those 50 processes even if only 10 of them have actually mapped the page. So Linux is trading off the simplicity of its design against performance. Many VM algorithms which are O(1) or (small N) under FreeBSD wind up being O(N), O(N^2), or worse under Linux. Since the pte's representing a particular page in an object tend to be at the same offset in all the page tables they are mapped in, reducing the number of accesses into the page tables at the same pte offset will often avoid blowing away the L1 cache line for that offset, which can lead to better performance. FreeBSD has added complexity (the pv_entry scheme) in order to increase performance (to limit page table accesses to only those pte's that need to be modified). But FreeBSD has a scaling problem that Linux does not in that there are a limited number of pv_entry structures and this causes problems when you have massive sharing of data. In this case you may run out of pv_entry structures even though there is plenty of free memory available. This can be fixed easily enough by bumping up the number of pv_entry structures in the kernel config, but we really need to find a better way to do it. In regards to the memory overhead of a page table verses the pv_entry scheme: Linux uses permanent page tables that are not throw away, but does not need a pv_entry for each potentially mapped pte. FreeBSD uses throw away page tables but adds in a pv_entry structure for each actually-mapped pte. I think memory utilization winds up being about the same, giving FreeBSD an algorithmic advantage with its ability to throw away page tables at will with very low overhead. Finally, in the page coloring section, it might help to have a little more description of what you mean here. I did not quite follow it. Do you know how an L1 hardware memory cache works? I will explain: Consider a machine with 16MB of main memory but only 128K of L1 cache. Generally the way this cache works is that each 128K block of main memory uses the same 128K of cache. If you access offset 0 in main memory and then offset offset 128K in main memory you can wind up throwing away the cached data you read from offset 0! Now, I am simplifying things greatly. What I just described is what is called a direct mapped hardware memory cache. Most modern caches are what are called 2-way-set-associative or 4-way-set-associative caches. The set-associatively allows you to access up to N different memory regions that overlap the same cache memory without destroying the previously cached data. But only N. So if I have a 4-way set associative cache I can access offset 0, offset 128K, 256K and offset 384K and still be able to access offset 0 again and have it come from the L1 cache. If I then access offset 512K, however, one of the four previously cached data objects will be thrown away by the cache. It is extremely important… extremely important for most of a processor's memory accesses to be able to come from the L1 cache, because the L1 cache operates at the processor frequency. The moment you have an L1 cache miss and have to go to the L2 cache or to main memory, the processor will stall and potentially sit twiddling its fingers for hundreds of instructions worth of time waiting for a read from main memory to complete. Main memory (the dynamic ram you stuff into a computer) is slow, when compared to the speed of a modern processor core. Ok, so now onto page coloring: All modern memory caches are what are known as physical caches. They cache physical memory addresses, not virtual memory addresses. This allows the cache to be left alone across a process context switch, which is very important. But in the Unix world you are dealing with virtual address spaces, not physical address spaces. Any program you write will see the virtual address space given to it. The actual physical pages underlying that virtual address space are not necessarily physically contiguous! In fact, you might have two pages that are side by side in a processes address space which wind up being at offset 0 and offset 128K in physical memory. A program normally assumes that two side-by-side pages will be optimally cached. That is, that you can access data objects in both pages without having them blow away each other's cache entry. But this is only true if the physical pages underlying the virtual address space are contiguous (insofar as the cache is concerned). This is what Page coloring does. Instead of assigning random physical pages to virtual addresses, which may result in non-optimal cache performance, Page coloring assigns reasonably-contiguous physical pages to virtual addresses. Thus programs can be written under the assumption that the characteristics of the underlying hardware cache are the same for their virtual address space as they would be if the program had been run directly in a physical address space. Note that I say reasonably contiguous rather than simply contiguous. From the point of view of a 128K direct mapped cache, the physical address 0 is the same as the physical address 128K. So two side-by-side pages in your virtual address space may wind up being offset 128K and offset 132K in physical memory, but could also easily be offset 128K and offset 4K in physical memory and still retain the same cache performance characteristics. So page-coloring does not have to assign truly contiguous pages of physical memory to contiguous pages of virtual memory, it just needs to make sure it assigns contiguous pages from the point of view of cache performance and operation.
diff --git a/en_US.ISO8859-1/articles/zip-drive/article.sgml b/en_US.ISO8859-1/articles/zip-drive/article.sgml index 9a9871b060..5330a94857 100644 --- a/en_US.ISO8859-1/articles/zip-drive/article.sgml +++ b/en_US.ISO8859-1/articles/zip-drive/article.sgml @@ -1,273 +1,276 @@ %man; + +%freebsd; + ]>
ZIP Drives Jason Bacon
acadix@execpc.com
ZIP Drive Basics ZIP disks are high capacity, removable, magnetic disks, which can be read or written by ZIP drives from IOMEGA corporation. ZIP disks are similar to floppy disks, except that they are much faster, and have a much greater capacity. While floppy disks typically hold 1.44 megabytes, ZIP disks are available in two sizes, namely 100 megabytes and 250 megabytes. ZIP drives should not be confused with the super-floppy, a 120 megabyte floppy drive which also handles traditional 1.44 megabyte floppies. IOMEGA also sells a higher capacity, higher performance drive called the JAZZ drive. JAZZ drives come in 1 gigabyte and 2 gigabyte sizes. ZIP drives are available as internal or external units, using one of three interfaces: The SCSI (Small Computer Standard Interface) interface is the fastest, most sophisticated, most expandable, and most expensive interface. The SCSI interface is used by all types of computers from PC's to RISC workstations to minicomputers, to connect all types of peripherals such as disk drives, tape drives, scanners, and so on. SCSI ZIP drives may be internal or external, assuming your host adapter has an external connector. If you are using an external SCSI device, it is important never to connect or disconnect it from the SCSI bus while the computer is running. Doing so may cause file-system damage on the disks that remain connected. If you want maximum performance and easy setup, the SCSI interface is the best choice. This will probably require adding a SCSI host adapter, since most PC's (except for high-performance servers) do not have built-in SCSI support. Each SCSI host adapter can support either 7 or 15 SCSI devices, depending on the model. Each SCSI device has its own controller, and these controllers are fairly intelligent and well standardized, (the second `S' in SCSI is for Standard) so from the operating system's point of view, all SCSI disk drives look about the same, as do all SCSI tape drives, etc. To support SCSI devices, the operating system need only have a driver for the particular host adapter, and a generic driver for each type of device, i.e. a SCSI disk driver, SCSI tape driver, and so on. There are some SCSI devices that can be better utilized with specialized drivers (e.g. DAT tape drives), but they tend to work OK with the generic driver, too. It is just that the generic drivers may not support some of the special features. Using a SCSI zip drive is simply a matter of determining which device file in the /dev directory represents the ZIP drive. This can be determined by looking at the boot messages while FreeBSD is booting (or in /var/log/messages after booting), where you will see a line something like this: da1: <IOMEGA ZIP 100 D.13> Removable Direct Access SCSI-2 Device This means that the ZIP drive is represented by the file /dev/da1. The IDE (Integrated Drive Electronics) interface is a low-cost disk drive interface used by many desktop PC's. Most IDE devices are strictly internal. Performance of IDE ZIP drives is comparable to SCSI ZIP drives. (The IDE interface is not as fast as SCSI, but ZIP drives performance is limited mainly by the mechanics of the drive, not by the bus interface.) The drawback of the IDE interface is the limitations it imposes. Most IDE adapters can only support 2 devices, and IDE interfaces are not typically designed for the long term. For example, the original IDE interface would not support hard disks with more than 1024 cylinders, which forced a lot of people to upgrade their hardware prematurely. If you have plans to expand your PC by adding another disk, a tape drive, or scanner, you may want to invest in a SCSI host adapter and a SCSI ZIP drive to avoid problems in the future. IDE devices in FreeBSD are prefixed with a a. For example, an IDE hard disk might be /dev/ad0, an IDE (ATAPI) CDROM might be /dev/acd1, and so on. The parallel port interface is popular for portable external devices such as external ZIP drives and scanners, because virtually every computer has a standard parallel port (usually used for printers). This makes things easy for people to transfer data between multiple computers by toting around their ZIP drive. Performance will generally be slower than a SCSI or IDE ZIP drive, since it is limited by the speed of the parallel port. Parallel port speed varies considerably between various computers, and can often be configured in the system BIOS. Some machines will also require BIOS configuration to operate the parallel port in bidirectional mode. (Parallel ports were originally designed only for output to printers) Parallel ZIP: The <devicename>vpo</devicename> Driver To use a parallel-port ZIP drive under FreeBSD, the vpo driver must be configured into the kernel. Parallel port ZIP drives also have a built-in SCSI controller. The vpo driver allows the FreeBSD kernel to communicate with the ZIP drive's SCSI controller through the parallel port. Since the vpo driver is not a standard part of the kernel (as of FreeBSD 3.2), you will need to rebuild the kernel to enable this device. The process of building a kernel is outlined in detail in another section. The following steps outline the process in brief for the purpose of enabling the vpo driver: Run /stand/sysinstall, and install the kernel source code on your system. Create a custom kernel configuration, that includes the driver for the vpo driver: &prompt.root; cd /sys/i386/conf &prompt.root; cp GENERIC MYKERNEL Edit MYKERNEL, change the ident line to MYKERNEL, and uncomment the line describing the vpo driver. If you have a second parallel port, you may need to copy the section for ppc0 to create a ppc1 device. The second parallel port usually uses IRQ 5 and address 378. Only the IRQ is required in the config file. If your root hard disk is a SCSI disk, you might run into a problem with probing order, which will cause the system to attempt to use the ZIP drive as the root device. This will cause a boot failure, unless you happen to have a FreeBSD root file-system on your ZIP disk! In this case, you will need to wire down the root disk, i.e. force the kernel to bind a specific device to /dev/da0, the root SCSI disk. It will then assign the ZIP disk to the next available SCSI disk, e.g. /dev/da1. To wire down your SCSI hard drive as da0, change the line device da0 to disk da0 at scbus0 target 0 unit 0 You may need to change the target above to match the SCSI ID of your disk drive. You should also wire down the scbus0 entry to your controller. For example, if you have an Adaptec 15xx controller, you would change controller scbus0 to controller scbus0 at aha0 Finally, since you are creating a custom kernel configuration, you can take the opportunity to remove all the unnecessary drivers. This should be done with a great deal of caution, and only if you feel confident about making modifications to your kernel configuration. Removing unnecessary drivers will reduce the kernel size, leaving more memory available for your applications. To determine which drivers are not needed, go to the end of the file /var/log/messages, and look for lines reading "not found". Then, comment out these devices in your config file. You can also change other options to reduce the size and increase the speed of your kernel. Read the section on rebuilding your kernel for more complete information. Now it is time to compile the kernel: &prompt.root; /usr/sbin/config MYKERNEL &prompt.root; cd ../../compile/MYKERNEL &prompt.root; make clean depend && make all install After the kernel is rebuilt, you will need to reboot. Make sure the ZIP drive is connected to the parallel port before the boot begins. You should see the ZIP drive show up in the boot messages as device vpo0 or vpo1, depending on which parallel port the drive is attached to. It should also show which device file the ZIP drive has been bound to. This will be /dev/da0 if you have no other SCSI disks in the system, or /dev/da1 if you have a SCSI hard disk wired down as the root device. Mounting ZIP disks To access the ZIP disk, you simply mount it like any other disk device. The file-system is represented as slice 4 on the device, so for SCSI or parallel ZIP disks, you would use: &prompt.root; mount_msdos /dev/da1s4 /mnt For IDE ZIP drives, use: &prompt.root; mount_msdos /dev/ad1s4 /mnt It will also be helpful to update /etc/fstab to make mounting easier. Add a line like the following, edited to suit your system: /dev/da1s4 /zip msdos rw,noauto 0 0 and create the directory /zip. Then, you can mount simply by typing &prompt.root; mount /zip and unmount by typing &prompt.root; umount /zip For more information on the format of /etc/fstab, see &man.fstab.5;. You can also create a FreeBSD file-system on the ZIP disk using &man.newfs.8;. However, the disk will only be usable on a FreeBSD - system, or perhaps a few other Unix clones that recognize FreeBSD + system, or perhaps a few other &unix; clones that recognize FreeBSD file-systems. (Definitely not DOS or Windows.)
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml index c6c78e0feb..14c53dc77c 100644 --- a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml +++ b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml @@ -1,957 +1,959 @@ %man; %authors; %misc; + +%freebsd; ]>
SMPng Design Document John Baldwin Robert Watson $FreeBSD$ 2002 2003 John Baldwin Robert Watson This document presents the current design and implementation of the SMPng Architecture. First, the basic primitives and tools are introduced. Next, a general architecture for the FreeBSD kernel's synchronization and execution model is laid out. Then, locking strategies for specific subsystems are discussed, documenting the approaches taken to introduce fine-grained synchronization and parallelism for each subsystem. Finally, detailed implementation notes are provided to motivate design choices, and make the reader aware of important implications involving the use of specific primitives. Introduction This document is a work-in-progress, and will be updated to reflect on-going design and implementation activities associated with the SMPng Project. Many sections currently exist only in outline form, but will be fleshed out as work proceeds. Updates or suggestions regarding the document may be directed to the document editors. The goal of SMPng is to allow concurrency in the kernel. The kernel is basically one rather large and complex program. To make the kernel multi-threaded we use some of the same tools used to make other programs multi-threaded. These include mutexes, shared/exclusive locks, semaphores, and condition variables. For the definitions of these and other SMP-related terms, please see the section of this article. Basic Tools and Locking Fundamentals Atomic Instructions and Memory Barriers There are several existing treatments of memory barriers and atomic instructions, so this section will not include a lot of detail. To put it simply, one can not go around reading variables without a lock if a lock is used to protect writes to that variable. This becomes obvious when you consider that memory barriers simply determine relative order of memory operations; they do not make any guarantee about timing of memory operations. That is, a memory barrier does not force the contents of a CPU's local cache or store buffer to flush. Instead, the memory barrier at lock release simply ensures that all writes to the protected data will be visible to other CPU's or devices if the write to release the lock is visible. The CPU is free to keep that data in its cache or store buffer as long as it wants. However, if another CPU performs an atomic instruction on the same datum, the first CPU must guarantee that the updated value is made visible to the second CPU along with any other operations that memory barriers may require. For example, assuming a simple model where data is considered visible when it is in main memory (or a global cache), when an atomic instruction is triggered on one CPU, other CPU's store buffers and caches must flush any writes to that same cache line along with any pending operations behind a memory barrier. This requires one to take special care when using an item protected by atomic instructions. For example, in the sleep mutex implementation, we have to use an atomic_cmpset rather than an atomic_set to turn on the MTX_CONTESTED bit. The reason is that we read the value of mtx_lock into a variable and then make a decision based on that read. However, the value we read may be stale, or it may change while we are making our decision. Thus, when the atomic_set executed, it may end up setting the bit on another value than the one we made the decision on. Thus, we have to use an atomic_cmpset to set the value only if the value we made the decision on is up-to-date and valid. Finally, atomic instructions only allow one item to be updated or read. If one needs to atomically update several items, then a lock must be used instead. For example, if two counters must be read and have values that are consistent relative to each other, then those counters must be protected by a lock rather than by separate atomic instructions. Read Locks versus Write Locks Read locks do not need to be as strong as write locks. Both types of locks need to ensure that the data they are accessing is not stale. However, only write access requires exclusive access. Multiple threads can safely read a value. Using different types of locks for reads and writes can be implemented in a number of ways. First, sx locks can be used in this manner by using an exclusive lock when writing and a shared lock when reading. This method is quite straightforward. A second method is a bit more obscure. You can protect a datum with multiple locks. Then for reading that data you simply need to have a read lock of one of the locks. However, to write to the data, you need to have a write lock of all of the locks. This can make writing rather expensive but can be useful when data is accessed in various ways. For example, the parent process pointer is protected by both the proctree_lock sx lock and the per-process mutex. Sometimes the proc lock is easier as we are just checking to see who a parent of a process is that we already have locked. However, other places such as inferior need to walk the tree of processes via parent pointers and locking each process would be prohibitive as well as a pain to guarantee that the condition you are checking remains valid for both the check and the actions taken as a result of the check. Locking Conditions and Results If you need a lock to check the state of a variable so that you can take an action based on the state you read, you can not just hold the lock while reading the variable and then drop the lock before you act on the value you read. Once you drop the lock, the variable can change rendering your decision invalid. Thus, you must hold the lock both while reading the variable and while performing the action as a result of the test. General Architecture and Design Interrupt Handling - Following the pattern of several other multi-threaded Unix + Following the pattern of several other multi-threaded &unix; kernels, FreeBSD deals with interrupt handlers by giving them their own thread context. Providing a context for interrupt handlers allows them to block on locks. To help avoid latency, however, interrupt threads run at real-time kernel priority. Thus, interrupt handlers should not execute for very long to avoid starving other kernel threads. In addition, since multiple handlers may share an interrupt thread, interrupt handlers should not sleep or use a sleepable lock to avoid starving another interrupt handler. The interrupt threads currently in FreeBSD are referred to as heavyweight interrupt threads. They are called this because switching to an interrupt thread involves a full context switch. In the initial implementation, the kernel was not preemptive and thus interrupts that interrupted a kernel thread would have to wait until the kernel thread blocked or returned to userland before they would have an opportunity to run. To deal with the latency problems, the kernel in FreeBSD has been made preemptive. Currently, we only preempt a kernel thread when we release a sleep mutex or when an interrupt comes in. However, the plan is to make the FreeBSD kernel fully preemptive as described below. Not all interrupt handlers execute in a thread context. Instead, some handlers execute directly in primary interrupt context. These interrupt handlers are currently misnamed fast interrupt handlers since the INTR_FAST flag used in earlier versions of the kernel is used to mark these handlers. The only interrupts which currently use these types of interrupt handlers are clock interrupts and serial I/O device interrupts. Since these handlers do not have their own context, they may not acquire blocking locks and thus may only use spin mutexes. Finally, there is one optional optimization that can be added in MD code called lightweight context switches. Since an interrupt thread executes in a kernel context, it can borrow the vmspace of any process. Thus, in a lightweight context switch, the switch to the interrupt thread does not switch vmspaces but borrows the vmspace of the interrupted thread. In order to ensure that the vmspace of the interrupted thread does not disappear out from under us, the interrupted thread is not allowed to execute until the interrupt thread is no longer borrowing its vmspace. This can happen when the interrupt thread either blocks or finishes. If an interrupt thread blocks, then it will use its own context when it is made runnable again. Thus, it can release the interrupted thread. The cons of this optimization are that they are very machine specific and complex and thus only worth the effort if their is a large performance improvement. At this point it is probably too early to tell, and in fact, will probably hurt performance as almost all interrupt handlers will immediately block on Giant and require a thread fix-up when they block. Also, an alternative method of interrupt handling has been proposed by Mike Smith that works like so: Each interrupt handler has two parts: a predicate which runs in primary interrupt context and a handler which runs in its own thread context. If an interrupt handler has a predicate, then when an interrupt is triggered, the predicate is run. If the predicate returns true then the interrupt is assumed to be fully handled and the kernel returns from the interrupt. If the predicate returns false or there is no predicate, then the threaded handler is scheduled to run. Fitting light weight context switches into this scheme might prove rather complicated. Since we may want to change to this scheme at some point in the future, it is probably best to defer work on light weight context switches until we have settled on the final interrupt handling architecture and determined how light weight context switches might or might not fit into it. Kernel Preemption and Critical Sections Kernel Preemption in a Nutshell Kernel preemption is fairly simple. The basic idea is that a CPU should always be doing the highest priority work available. Well, that is the ideal at least. There are a couple of cases where the expense of achieving the ideal is not worth being perfect. Implementing full kernel preemption is very straightforward: when you schedule a thread to be executed by putting it on a runqueue, you check to see if it's priority is higher than the currently executing thread. If so, you initiate a context switch to that thread. While locks can protect most data in the case of a preemption, not all of the kernel is preemption safe. For example, if a thread holding a spin mutex preempted and the new thread attempts to grab the same spin mutex, the new thread may spin forever as the interrupted thread may never get a chance to execute. Also, some code such as the code to assign an address space number for a process during exec() on the Alpha needs to not be preempted as it supports the actual context switch code. Preemption is disabled for these code sections by using a critical section. Critical Sections The responsibility of the critical section API is to prevent context switches inside of a critical section. With a fully preemptive kernel, every setrunqueue of a thread other than the current thread is a preemption point. One implementation is for critical_enter to set a per-thread flag that is cleared by its counterpart. If setrunqueue is called with this flag set, it does not preempt regardless of the priority of the new thread relative to the current thread. However, since critical sections are used in spin mutexes to prevent context switches and multiple spin mutexes can be acquired, the critical section API must support nesting. For this reason the current implementation uses a nesting count instead of a single per-thread flag. In order to minimize latency, preemptions inside of a critical section are deferred rather than dropped. If a thread is made runnable that would normally be preempted to outside of a critical section, then a per-thread flag is set to indicate that there is a pending preemption. When the outermost critical section is exited, the flag is checked. If the flag is set, then the current thread is preempted to allow the higher priority thread to run. Interrupts pose a problem with regards to spin mutexes. If a low-level interrupt handler needs a lock, it needs to not interrupt any code needing that lock to avoid possible data structure corruption. Currently, providing this mechanism is piggybacked onto critical section API by means of the cpu_critical_enter and cpu_critical_exit functions. Currently this API disables and re-enables interrupts on all of FreeBSD's current platforms. This approach may not be purely optimal, but it is simple to understand and simple to get right. Theoretically, this second API need only be used for spin mutexes that are used in primary interrupt context. However, to make the code simpler, it is used for all spin mutexes and even all critical sections. It may be desirable to split out the MD API from the MI API and only use it in conjunction with the MI API in the spin mutex implementation. If this approach is taken, then the MD API likely would need a rename to show that it is a separate API now. Design Tradeoffs As mentioned earlier, a couple of trade-offs have been made to sacrifice cases where perfect preemption may not always provide the best performance. The first trade-off is that the preemption code does not take other CPUs into account. Suppose we have a two CPU's A and B with the priority of A's thread as 4 and the priority of B's thread as 2. If CPU B makes a thread with priority 1 runnable, then in theory, we want CPU A to switch to the new thread so that we will be running the two highest priority runnable threads. However, the cost of determining which CPU to enforce a preemption on as well as actually signaling that CPU via an IPI along with the synchronization that would be required would be enormous. Thus, the current code would instead force CPU B to switch to the higher priority thread. Note that this still puts the system in a better position as CPU B is executing a thread of priority 1 rather than a thread of priority 2. The second trade-off limits immediate kernel preemption to real-time priority kernel threads. In the simple case of preemption defined above, a thread is always preempted immediately (or as soon as a critical section is exited) if a higher priority thread is made runnable. However, many threads executing in the kernel only execute in a kernel context for a short time before either blocking or returning to userland. Thus, if the kernel preempts these threads to run another non-realtime kernel thread, the kernel may switch out the executing thread just before it is about to sleep or execute. The cache on the CPU must then adjust to the new thread. When the kernel returns to the interrupted CPU, it must refill all the cache information that was lost. In addition, two extra context switches are performed that could be avoided if the kernel deferred the preemption until the first thread blocked or returned to userland. Thus, by default, the preemption code will only preempt immediately if the higher priority thread is a real-time priority thread. Turning on full kernel preemption for all kernel threads has value as a debugging aid since it exposes more race conditions. It is especially useful on UP systems were many races are hard to simulate otherwise. Thus, there will be a kernel option to enable preemption for all kernel threads that can be used for debugging purposes. Thread Migration Simply put, a thread migrates when it moves from one CPU to another. In a non-preemptive kernel this can only happen at well-defined points such as when calling tsleep or returning to userland. However, in the preemptive kernel, an interrupt can force a preemption and possible migration at any time. This can have negative affects on per-CPU data since with the exception of curthread and curpcb the data can change whenever you migrate. Since you can potentially migrate at any time this renders per-CPU data rather useless. Thus it is desirable to be able to disable migration for sections of code that need per-CPU data to be stable. Critical sections currently prevent migration since they do not allow context switches. However, this may be too strong of a requirement to enforce in some cases since a critical section also effectively blocks interrupt threads on the current processor. As a result, it may be desirable to provide an API whereby code may indicate that if the current thread is preempted it should not migrate to another CPU. One possible implementation is to use a per-thread nesting count td_pinnest along with a td_pincpu which is updated to the current CPU on each context switch. Each CPU has its own run queue that holds threads pinned to that CPU. A thread is pinned when its nesting count is greater than zero and a thread starts off unpinned with a nesting count of zero. When a thread is put on a runqueue, we check to see if it is pinned. If so, we put it on the per-CPU runqueue, otherwise we put it on the global runqueue. When choosethread is called to retrieve the next thread, it could either always prefer bound threads to unbound threads or use some sort of bias when comparing priorities. If the nesting count is only ever written to by the thread itself and is only read by other threads when the owning thread is not executing but while holding the sched_lock, then td_pinnest will not need any other locks. The migrate_disable function would increment the nesting count and migrate_enable would decrement the nesting count. Due to the locking requirements specified above, they will only operate on the current thread and thus would not need to handle the case of making a thread migrateable that currently resides on a per-CPU run queue. It is still debatable if this API is needed or if the critical section API is sufficient by itself. Many of the places that need to prevent migration also need to prevent preemption as well, and in those places a critical section must be used regardless. Callouts The timeout() kernel facility permits kernel services to register functions for execution as part of the softclock() software interrupt. Events are scheduled based on a desired number of clock ticks, and callbacks to the consumer-provided function will occur at approximately the right time. The global list of pending timeout events is protected by a global spin mutex, callout_lock; all access to the timeout list must be performed with this mutex held. When softclock() is woken up, it scans the list of pending timeouts for those that should fire. In order to avoid lock order reversal, the softclock thread will release the callout_lock mutex when invoking the provided timeout() callback function. If the CALLOUT_MPSAFE flag was not set during registration, then Giant will be grabbed before invoking the callout, and then released afterwards. The callout_lock mutex will be re-grabbed before proceeding. The softclock() code is careful to leave the list in a consistent state while releasing the mutex. If DIAGNOSTIC is enabled, then the time taken to execute each function is measured, and a warning generated if it exceeds a threshold. Specific Locking Strategies Credentials struct ucred is the kernel's internal credential structure, and is generally used as the basis for process-driven access control within the kernel. BSD-derived systems use a copy-on-write model for credential data: multiple references may exist for a credential structure, and when a change needs to be made, the structure is duplicated, modified, and then the reference replaced. Due to wide-spread caching of the credential to implement access control on open, this results in substantial memory savings. With a move to fine-grained SMP, this model also saves substantially on locking operations by requiring that modification only occur on an unshared credential, avoiding the need for explicit synchronization when consuming a known-shared credential. Credential structures with a single reference are considered mutable; shared credential structures must not be modified or a race condition is risked. A mutex, cr_mtxp protects the reference count of struct ucred so as to maintain consistency. Any use of the structure requires a valid reference for the duration of the use, or the structure may be released out from under the illegitimate consumer. The struct ucred mutex is a leaf mutex, and for performance reasons, is implemented via a mutex pool. Usually, credentials are used in a read-only manner for access control decisions, and in this case td_ucred is generally preferred because it requires no locking. When a process' credential is updated the proc lock must be held across the check and update operations thus avoid races. The process credential p_ucred must be used for check and update operations to prevent time-of-check, time-of-use races. If system call invocations will perform access control after an update to the process credential, the value of td_ucred must also be refreshed to the current process value. This will prevent use of a stale credential following a change. The kernel automatically refreshes the td_ucred pointer in the thread structure from the process p_ucred whenever a process enters the kernel, permitting use of a fresh credential for kernel access control. File Descriptors and File Descriptor Tables Details to follow. Jail Structures struct prison stores administrative details pertinent to the maintenance of jails created using the &man.jail.2; API. This includes the per-jail hostname, IP address, and related settings. This structure is reference-counted since pointers to instances of the structure are shared by many credential structures. A single mutex, pr_mtx protects read and write access to the reference count and all mutable variables inside the struct jail. Some variables are set only when the jail is created, and a valid reference to the struct prison is sufficient to read these values. The precise locking of each entry is documented via comments in sys/jail.h. MAC Framework The TrustedBSD MAC Framework maintains data in a variety of kernel objects, in the form of struct label. In general, labels in kernel objects are protected by the same lock as the remainder of the kernel object. For example, the v_label label in struct vnode is protected by the vnode lock on the vnode. In addition to labels maintained in standard kernel objects, the MAC Framework also maintains a list of registered and active policies. The policy list is protected by a global mutex (mac_policy_list_lock) and a busy count (also protected by the mutex). Since many access control checks may occur in parallel, entry to the framework for a read-only access to the policy list requires holding the mutex while incrementing (and later decrementing) the busy count. The mutex need not be held for the duration of the MAC entry operation--some operations, such as label operations on file system objects--are long-lived. To modify the policy list, such as during policy registration and de-registration, the mutex must be held and the reference count must be zero, to prevent modification of the list while it is in use. A condition variable, mac_policy_list_not_busy, is available to threads that need to wait for the list to become unbusy, but this condition variable must only be waited on if the caller is holding no other locks, or a lock order violation may be possible. The busy count, in effect, acts as a form of shared/exclusive lock over access to the framework: the difference is that, unlike with an sx lock, consumers waiting for the list to become unbusy may be starved, rather than permitting lock order problems with regards to the busy count and other locks that may be held on entry to (or inside) the MAC Framework. Modules For the module subsystem there exists a single lock that is used to protect the shared data. This lock is a shared/exclusive (SX) lock and has a good chance of needing to be acquired (shared or exclusively), therefore there are a few macros that have been added to make access to the lock more easy. These macros can be located in sys/module.h and are quite basic in terms of usage. The main structures protected under this lock are the module_t structures (when shared) and the global modulelist_t structure, modules. One should review the related source code in kern/kern_module.c to further understand the locking strategy. Newbus Device Tree The newbus system will have one sx lock. Readers will hold a shared (read) lock (&man.sx.slock.9;) and writers will hold an exclusive (write) lock (&man.sx.xlock.9;). Internal functions will not do locking at all. Externally visible ones will lock as needed. Those items that do not matter if the race is won or lost will not be locked, since they tend to be read all over the place (e.g. &man.device.get.softc.9;). There will be relatively few changes to the newbus data structures, so a single lock should be sufficient and not impose a performance penalty. Pipes ... Processes and Threads - process hierarchy - proc locks, references - thread-specific copies of proc entries to freeze during system calls, including td_ucred - inter-process operations - process groups and sessions Scheduler Lots of references to sched_lock and notes pointing at specific primitives and related magic elsewhere in the document. Select and Poll The select() and poll() functions permit threads to block waiting on events on file descriptors--most frequently, whether or not the file descriptors are readable or writable. ... SIGIO The SIGIO service permits processes to request the delivery of a SIGIO signal to its process group when the read/write status of specified file descriptors changes. At most one process or process group is permitted to register for SIGIO from any given kernel object, and that process or group is referred to as the owner. Each object supporting SIGIO registration contains pointer field that is NULL if the object is not registered, or points to a struct sigio describing the registration. This field is protected by a global mutex, sigio_lock. Callers to SIGIO maintenance functions must pass in this field by reference so that local register copies of the field are not made when unprotected by the lock. One struct sigio is allocated for each registered object associated with any process or process group, and contains back-pointers to the object, owner, signal information, a credential, and the general disposition of the registration. Each process or progress group contains a list of registered struct sigio structures, p_sigiolst for processes, and pg_sigiolst for process groups. These lists are protected by the process or process group locks respectively. Most fields in each struct sigio are constant for the duration of the registration, with the exception of the sio_pgsigio field which links the struct sigio into the process or process group list. Developers implementing new kernel objects supporting SIGIO will, in general, want to avoid holding structure locks while invoking SIGIO supporting functions, such as fsetown() or funsetown() to avoid defining a lock order between structure locks and the global SIGIO lock. This is generally possible through use of an elevated reference count on the structure, such as reliance on a file descriptor reference to a pipe during a pipe operation. Sysctl The sysctl() MIB service is invoked from both within the kernel and from userland applications using a system call. At least two issues are raised in locking: first, the protection of the structures maintaining the namespace, and second, interactions with kernel variables and functions that are accessed by the sysctl interface. Since sysctl permits the direct export (and modification) of kernel statistics and configuration parameters, the sysctl mechanism must become aware of appropriate locking semantics for those variables. Currently, sysctl makes use of a single global sx lock to serialize use of sysctl(); however, it is assumed to operate under Giant and other protections are not provided. The remainder of this section speculates on locking and semantic changes to sysctl. - Need to change the order of operations for sysctl's that update values from read old, copyin and copyout, write new to copyin, lock, read old and write new, unlock, copyout. Normal sysctl's that just copyout the old value and set a new value that they copyin may still be able to follow the old model. However, it may be cleaner to use the second model for all of the sysctl handlers to avoid lock operations. - To allow for the common case, a sysctl could embed a pointer to a mutex in the SYSCTL_FOO macros and in the struct. This would work for most sysctl's. For values protected by sx locks, spin mutexes, or other locking strategies besides a single sleep mutex, SYSCTL_PROC nodes could be used to get the locking right. Taskqueue The taskqueue's interface has two basic locks associated with it in order to protect the related shared data. The taskqueue_queues_mutex is meant to serve as a lock to protect the taskqueue_queues TAILQ. The other mutex lock associated with this system is the one in the struct taskqueue data structure. The use of the synchronization primitive here is to protect the integrity of the data in the struct taskqueue. It should be noted that there are no separate macros to assist the user in locking down his/her own work since these locks are most likely not going to be used outside of kern/subr_taskqueue.c. Implementation Notes Details of the Mutex Implementation - Should we require mutexes to be owned for mtx_destroy() since we can not safely assert that they are unowned by anyone else otherwise? Spin Mutexes - Use a critical section... Sleep Mutexes - Describe the races with contested mutexes - Why it is safe to read mtx_lock of a contested mutex when holding sched_lock. - Priority propagation Witness - What does it do - How does it work Miscellaneous Topics Interrupt Source and ICU Abstractions - struct isrc - pic drivers Other Random Questions/Topics Should we pass an interlock into sema_wait? - Generic turnstiles for sleep mutexes and sx locks. - Should we have non-sleepable sx locks? Glossary atomic An operation is atomic if all of its effects are visible to other CPUs together when the proper access protocol is followed. In the degenerate case are atomic instructions provided directly by machine architectures. At a higher level, if several members of a structure are protected by a lock, then a set of operations are atomic if they are all performed while holding the lock without releasing the lock in between any of the operations. operation block A thread is blocked when it is waiting on a lock, resource, or condition. Unfortunately this term is a bit overloaded as a result. sleep critical section A section of code that is not allowed to be preempted. A critical section is entered and exited using the &man.critical.enter.9; API. MD Machine dependent. MI memory operation A memory operation reads and/or writes to a memory location. MI Machine independent. MD operation memory operation primary interrupt context Primary interrupt context refers to the code that runs when an interrupt occurs. This code can either run an interrupt handler directly or schedule an asynchronous interrupt thread to execute the interrupt handlers for a given interrupt source. realtime kernel thread A high priority kernel thread. Currently, the only realtime priority kernel threads are interrupt threads. thread sleep A thread is asleep when it is blocked on a condition variable or a sleep queue via msleep or tsleep. block sleepable lock A sleepable lock is a lock that can be held by a thread which is asleep. Lockmgr locks and sx locks are currently the only sleepable locks in FreeBSD. Eventually, some sx locks such as the allproc and proctree locks may become non-sleepable locks. sleep thread A kernel thread represented by a struct thread. Threads own locks and hold a single execution context.