diff --git a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml
index 798b7bc6d9..2ba2795fb3 100644
--- a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml
+++ b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml
@@ -1,2396 +1,2396 @@
Bootstrapping and Kernel InitializationSergeyLyubkaContributed by Sergio Andrés Gómez del RealUpdated and enhanced by SynopsisBIOSfirmwarePOSTIA-32bootingsystem initializationThis chapter is an overview of the boot and system
initialization processes, starting from the
BIOS (firmware) POST, to
the first user process creation. Since the initial
steps of system startup are very architecture dependent, the
IA-32 architecture is used as an example.The &os; boot process can be surprisingly complex. After
control is passed from the BIOS, a
considerable amount of low-level configuration must be done
before the kernel can be loaded and executed. This setup must
be done in a simple and flexible manner, allowing the user a
great deal of customization possibilities.OverviewThe boot process is an extremely machine-dependent
activity. Not only must code be written for every computer
architecture, but there may also be multiple types of booting on
the same architecture. For example, a directory listing of
/usr/src/sys/boot
reveals a great amount of architecture-dependent code. There is
a directory for each of the various supported architectures. In
the x86-specific i386
directory, there are subdirectories for different boot standards
like mbr (Master Boot Record),
gpt (GUID Partition
Table), and efi (Extensible Firmware
Interface). Each boot standard has its own conventions and data
structures. The example that follows shows booting an x86
computer from an MBR hard drive with the &os;
boot0 multi-boot loader stored in the very
first sector. That boot code starts the &os; three-stage boot
process.The key to understanding this process is that it is a series
of stages of increasing complexity. These stages are
boot1, boot2, and
loader (see &man.boot.8; for more detail).
The boot system executes each stage in sequence. The last
stage, loader, is responsible for loading
the &os; kernel. Each stage is examined in the following
sections.Here is an example of the output generated by the
different boot stages. Actual output
may differ from machine to machine:&os; ComponentOutput (may vary)boot0F1 FreeBSD
F2 BSD
F5 Disk 2boot2This prompt will appear if the user
presses a key just after selecting an OS to boot at
the boot0
stage.>>FreeBSD/i386 BOOT
Default: 1:ad(1,a)/boot/loader
boot:loaderBTX loader 1.00 BTX version is 1.02
Consoles: internal video/keyboard
BIOS drive C: is disk0
BIOS 639kB/2096064kB available memory
FreeBSD/x86 bootstrap loader, Revision 1.1
Console internal video/keyboard
(root@snap.freebsd.org, Thu Jan 16 22:18:05 UTC 2014)
Loading /boot/defaults/loader.conf
/boot/kernel/kernel text=0xed9008 data=0x117d28+0x176650 syms=[0x8+0x137988+0x8+0x1515f8]kernelCopyright (c) 1992-2013 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014
root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610The BIOSWhen the computer powers on, the processor's registers are
set to some predefined values. One of the registers is the
instruction pointer register, and its value
after a power on is well defined: it is a 32-bit value of
0xfffffff0. The instruction pointer register
(also known as the Program Counter) points to code to be
executed by the processor. Another important register is the
cr0 32-bit control register, and its value
just after a reboot is 0. One of
cr0's bits, the PE (Protection Enabled) bit,
indicates whether the processor is running in 32-bit protected
mode or 16-bit real mode. Since this bit is cleared at boot
time, the processor boots in 16-bit real mode. Real mode means,
among other things, that linear and physical addresses are
identical. The reason for the processor not to start
immediately in 32-bit protected mode is backwards compatibility.
In particular, the boot process relies on the services provided
by the BIOS, and the BIOS
itself works in legacy, 16-bit code.The value of 0xfffffff0 is slightly less
than 4 GB, so unless the machine has 4 GB of physical
memory, it cannot point to a valid memory address. The
computer's hardware translates this address so that it points to
a BIOS memory block.The BIOS (Basic Input Output
System) is a chip on the motherboard that has a relatively small
amount of read-only memory (ROM). This
memory contains various low-level routines that are specific to
the hardware supplied with the motherboard. The processor will
first jump to the address 0xfffffff0, which really resides in
the BIOS's memory. Usually this address
contains a jump instruction to the BIOS's
POST routines.The POST (Power On Self Test)
is a set of routines including the memory check, system bus
check, and other low-level initialization so the
CPU can set up the computer properly. The
important step of this stage is determining the boot device.
Modern BIOS implementations permit the
selection of a boot device, allowing booting from a floppy,
CD-ROM, hard disk, or other devices.The very last thing in the POST is the
INT 0x19 instruction. The
INT 0x19 handler reads 512 bytes from the
first sector of boot device into the memory at address
0x7c00. The term
first sector originates from hard drive
architecture, where the magnetic plate is divided into a number
of cylindrical tracks. Tracks are numbered, and every track is
divided into a number (usually 64) of sectors. Track numbers
start at 0, but sector numbers start from 1. Track 0 is the
outermost on the magnetic plate, and sector 1, the first sector,
has a special purpose. It is also called the
MBR, or Master Boot Record. The remaining
sectors on the first track are never used.This sector is our boot-sequence starting point. As we will
see, this sector contains a copy of our
boot0 program. A jump is made by the
BIOS to address 0x7c00 so
it starts executing.The Master Boot Record (boot0)MBRAfter control is received from the BIOS
at memory address 0x7c00,
boot0 starts executing. It is the first
piece of code under &os; control. The task of
boot0 is quite simple: scan the partition
table and let the user choose which partition to boot from. The
Partition Table is a special, standard data structure embedded
in the MBR (hence embedded in
boot0) describing the four standard PC
partitions.
boot0 resides in the filesystem as
/boot/boot0. It is a small 512-byte file,
and it is exactly what &os;'s installation procedure wrote to
the hard disk's MBR if you chose the
bootmanager option at installation time. Indeed,
boot0is the
MBR.As mentioned previously, the INT 0x19
instruction causes the INT 0x19 handler to
load an MBR (boot0) into
memory at address 0x7c00. The source file
for boot0 can be found in
sys/boot/i386/boot0/boot0.S - which is an
awesome piece of code written by Robert Nordier.A special structure starting from offset
0x1be in the MBR is called
the partition table. It has four records
of 16 bytes each, called partition records,
which represent how the hard disk is partitioned, or, in &os;'s
terminology, sliced. One byte of those 16 says whether a
partition (slice) is bootable or not. Exactly one record must
have that flag set, otherwise boot0's code
will refuse to proceed.A partition record has the following fields:the 1-byte filesystem typethe 1-byte bootable flagthe 6 byte descriptor in CHS formatthe 8 byte descriptor in LBA formatA partition record descriptor contains information about
where exactly the partition resides on the drive. Both
descriptors, LBA and CHS,
describe the same information, but in different ways:
LBA (Logical Block Addressing) has the
starting sector for the partition and the partition's length,
while CHS (Cylinder Head Sector) has
coordinates for the first and last sectors of the partition.
The partition table ends with the special signature
0xaa55.The MBR must fit into 512 bytes, a single
disk sector. This program uses low-level tricks
like taking advantage of the side effects of certain
instructions and reusing register values from previous
operations to make the most out of the fewest possible
instructions. Care must also be taken when handling the
partition table, which is embedded in the MBR
itself. For these reasons, be very careful when modifying
boot0.S.Note that the boot0.S source file
is assembled as is: instructions are translated
one by one to binary, with no additional information (no
ELF file format, for example). This kind of
low-level control is achieved at link time through special
control flags passed to the linker. For example, the text
section of the program is set to be located at address
0x600. In practice this means that
boot0 must be loaded to memory address
0x600 in order to function properly.It is worth looking at the Makefile for
boot0
(sys/boot/i386/boot0/Makefile), as it
defines some of the run-time behavior of
boot0. For instance, if a terminal
connected to the serial port (COM1) is used for I/O, the macro
SIO must be defined
(-DSIO). -DPXE enables
boot through PXE by pressing
F6. Additionally, the program defines a set of
flags that allow further modification of
its behavior. All of this is illustrated in the
Makefile. For example, look at the
linker directives which command the linker to start the text
section at address 0x600, and to build the
output file as is (strip out any file
formatting):Let us now start our study of the MBR, or
boot0, starting where execution
begins.Some modifications have been made to some instructions in
favor of better exposition. For example, some macros are
expanded, and some macro tests are omitted when the result of
the test is known. This applies to all of the code examples
shown.This first block of code is the entry point of the program.
It is where the BIOS transfers control.
First, it makes sure that the string operations autoincrement
its pointer operands (the cld instruction)
When in doubt, we refer the reader to the official Intel
manuals, which describe the exact semantics for each
instruction: ..
Then, as it makes no assumption about the state of the segment
registers, it initializes them. Finally, it sets the stack
pointer register (%sp) to address
0x7c00, so we have a working stack.The next block is responsible for the relocation and
subsequent jump to the relocated code.As boot0 is loaded by the
BIOS to address 0x7C00, it
copies itself to address 0x600 and then
transfers control there (recall that it was linked to execute at
address 0x600). The source address,
0x7c00, is copied to register
%si. The destination address,
0x600, to register %di.
The number of bytes to copy, 512 (the
program's size), is copied to register %cx.
Next, the rep instruction repeats the
instruction that follows, that is, movsb, the
number of times dictated by the %cx register.
The movsb instruction copies the byte pointed
to by %si to the address pointed to by
%di. This is repeated another 511 times. On
each repetition, both the source and destination registers,
%si and %di, are
incremented by one. Thus, upon completion of the 512-byte copy,
%di has the value
0x600+512=
0x800, and %si has the
value 0x7c00+512=
0x7e00; we have thus completed the code
relocation.Next, the destination register
%di is copied to %bp.
%bp gets the value 0x800.
The value 16 is copied to
%cl in preparation for a new string operation
(like our previous movsb). Now,
stosb is executed 16 times. This instruction
copies a 0 value to the address pointed to by
the destination register (%di, which is
0x800), and increments it. This is repeated
another 15 times, so %di ends up with value
0x810. Effectively, this clears the address
range 0x800-0x80f. This
range is used as a (fake) partition table for writing the
MBR back to disk. Finally, the sector field
for the CHS addressing of this fake partition
is given the value 1 and a jump is made to the main function
from the relocated code. Note that until this jump to the
relocated code, any reference to an absolute address was
avoided.The following code block tests whether the drive number
provided by the BIOS should be used, or
the one stored in boot0.This code tests the SETDRV bit
(0x20) in the flags
variable. Recall that register %bp points to
address location 0x800, so the test is done
to the flags variable at address
0x800-69=
0x7bb. This is an example of the type of
modifications that can be done to boot0.
The SETDRV flag is not set by default, but it
can be set in the Makefile. When set, the
drive number stored in the MBR is used
instead of the one provided by the BIOS. We
assume the defaults, and that the BIOS
provided a valid drive number, so we jump to
save_curdrive.The next block saves the drive number provided by the
BIOS, and calls putn to
print a new line on the screen.Note that we assume TEST is not defined,
so the conditional code in it is not assembled and will not
appear in our executable boot0.Our next block implements the actual scanning of the
partition table. It prints to the screen the partition type for
each of the four entries in the partition table. It compares
each type with a list of well-known operating system file
systems. Examples of recognized partition types are
NTFS (&windows;, ID 0x7),
ext2fs (&linux;, ID 0x83), and, of course,
ffs/ufs2 (&os;, ID 0xa5).
The implementation is fairly simple.It is important to note that the active flag for each entry
is cleared, so after the scanning, no
partition entry is active in our memory copy of
boot0. Later, the active flag will be set
for the selected partition. This ensures that only one active
partition exists if the user chooses to write the changes back
to disk.The next block tests for other drives. At startup,
the BIOS writes the number of drives present
in the computer to address 0x475. If there
are any other drives present, boot0 prints
the current drive to screen. The user may command
boot0 to scan partitions on another drive
later.We make the assumption that a single drive is present, so
the jump to print_drive is not performed. We
also assume nothing strange happened, so we jump to
print_prompt.This next block just prints out a prompt followed by the
default option:Finally, a jump is performed to
start_input, where the
BIOS services are used to start a timer and
for reading user input from the keyboard; if the timer expires,
the default option will be selected:An interrupt is requested with number
0x1a and argument 0 in
register %ah. The BIOS
has a predefined set of services, requested by applications as
software-generated interrupts through the int
instruction and receiving arguments in registers (in this case,
%ah). Here, particularly, we are requesting
the number of clock ticks since last midnight; this value is
computed by the BIOS through the
RTC (Real Time Clock). This clock can be
programmed to work at frequencies ranging from 2 Hz to
8192 Hz. The BIOS sets it to
18.2 Hz at startup. When the request is satisfied, a
32-bit result is returned by the BIOS in
registers %cx and %dx
(lower bytes in %dx). This result (the
%dx part) is copied to register
%di, and the value of the
TICKS variable is added to
%di. This variable resides in
boot0 at offset _TICKS
(a negative value) from register %bp (which,
recall, points to 0x800). The default value
of this variable is 0xb6 (182 in decimal).
Now, the idea is that boot0 constantly
requests the time from the BIOS, and when the
value returned in register %dx is greater
than the value stored in %di, the time is up
and the default selection will be made. Since the RTC ticks
18.2 times per second, this condition will be met after 10
seconds (this default behavior can be changed in the
Makefile). Until this time has passed,
boot0 continually asks the
BIOS for any user input; this is done through
int 0x16, argument 1 in
%ah.Whether a key was pressed or the time expired, subsequent
code validates the selection. Based on the selection, the
register %si is set to point to the
appropriate partition entry in the partition table. This new
selection overrides the previous default one. Indeed, it
becomes the new default. Finally, the ACTIVE flag of the
selected partition is set. If it was enabled at compile time,
the in-memory version of boot0 with these
modified values is written back to the MBR on
disk. We leave the details of this implementation to the
reader.We now end our study with the last code block from the
boot0 program:Recall that %si points to the selected
partition entry. This entry tells us where the partition begins
on disk. We assume, of course, that the partition selected is
actually a &os; slice.From now on, we will favor the use of the technically
more accurate term slice rather than
partition.The transfer buffer is set to 0x7c00
(register %bx), and a read for the first
sector of the &os; slice is requested by calling
intx13. We assume that everything went okay,
so a jump to beep is not performed. In
particular, the new sector read must end with the magic sequence
0xaa55. Finally, the value at
%si (the pointer to the selected partition
table) is preserved for use by the next stage, and a jump is
performed to address 0x7c00, where execution
of our next stage (the just-read block) is started.boot1 StageSo far we have gone through the following sequence:The BIOS did some early hardware
initialization, including the POST. The
MBR (boot0) was
loaded from absolute disk sector one to address
0x7c00. Execution control was passed to
that location.boot0 relocated itself to the
location it was linked to execute
(0x600), followed by a jump to continue
execution at the appropriate place. Finally,
boot0 loaded the first disk sector from
the &os; slice to address 0x7c00.
Execution control was passed to that location.boot1 is the next step in the
boot-loading sequence. It is the first of three boot stages.
Note that we have been dealing exclusively
with disk sectors. Indeed, the BIOS loads
the absolute first sector, while boot0
loads the first sector of the &os; slice. Both loads are to
address 0x7c00. We can conceptually think of
these disk sectors as containing the files
boot0 and boot1,
respectively, but in reality this is not entirely true for
boot1. Strictly speaking, unlike
boot0, boot1 is not
part of the boot blocks
There is a file /boot/boot1, but it
is not the written to the beginning of the &os; slice.
Instead, it is concatenated with boot2
to form boot, which
is written to the beginning of the &os;
slice and read at boot time..
Instead, a single, full-blown file, boot
(/boot/boot), is what ultimately is
written to disk. This file is a combination of
boot1, boot2 and the
Boot Extender (or BTX).
This single file is greater in size than a single sector
(greater than 512 bytes). Fortunately,
boot1 occupies exactly
the first 512 bytes of this single file, so when
boot0 loads the first sector of the &os;
slice (512 bytes), it is actually loading
boot1 and transferring control to
it.The main task of boot1 is to load the
next boot stage. This next stage is somewhat more complex. It
is composed of a server called the Boot Extender,
or BTX, and a client, called
boot2. As we will see, the last boot
stage, loader, is also a client of the
BTX server.Let us now look in detail at what exactly is done by
boot1, starting like we did for
boot0, at its entry point:The entry point at start simply jumps
past a special data area to the label main,
which in turn looks like this:Just like boot0, this
code relocates boot1,
this time to memory address 0x700. However,
unlike boot0, it does not jump there.
boot1 is linked to execute at
address 0x7c00, effectively where it was
loaded in the first place. The reason for this relocation will
be discussed shortly.Next comes a loop that looks for the &os; slice. Although
boot0 loaded boot1
from the &os; slice, no information was passed to it about this
Actually we did pass a pointer to the slice entry in
register %si. However,
boot1 does not assume that it was
loaded by boot0 (perhaps some other
MBR loaded it, and did not pass this
information), so it assumes nothing.,
so boot1 must rescan the
partition table to find where the &os; slice starts. Therefore
it rereads the MBR:In the code above, register %dl
maintains information about the boot device. This is passed on
by the BIOS and preserved by the
MBR. Numbers 0x80 and
greater tells us that we are dealing with a hard drive, so a
call is made to nread, where the
MBR is read. Arguments to
nread are passed through
%si and %dh. The memory
address at label part4 is copied to
%si. This memory address holds a
fake partition to be used by
nread. The following is the data in the fake
partition:In particular, the LBA for this fake
partition is hardcoded to zero. This is used as an argument to
the BIOS for reading absolute sector one from
the hard drive. Alternatively, CHS addressing could be used.
In this case, the fake partition holds cylinder 0, head 0 and
sector 1, which is equivalent to absolute sector one.Let us now proceed to take a look at
nread:Recall that %si points to the fake
partition. The word
In the context of 16-bit real mode, a word is 2
bytes.
at offset 0x8 is copied to register
%ax and word at offset 0xa
to %cx. They are interpreted by the
BIOS as the lower 4-byte value denoting the
LBA to be read (the upper four bytes are assumed to be zero).
Register %bx holds the memory address where
the MBR will be loaded. The instruction
pushing %cs onto the stack is very
interesting. In this context, it accomplishes nothing.
However, as we will see shortly, boot2, in
conjunction with the BTX server, also uses
xread.1. This mechanism will be discussed in
the next section.The code at xread.1 further calls
the read function, which actually calls the
BIOS asking for the disk sector:Note the long return instruction at the end of this block.
This instruction pops out the %cs register
pushed by nread, and returns. Finally,
nread also returns.With the MBR loaded to memory, the actual
loop for searching the &os; slice begins:If a &os; slice is identified, execution continues at
main.5. Note that when a &os; slice is found
%si points to the appropriate entry in the
partition table, and %dh holds the partition
number. We assume that a &os; slice is found, so we continue
execution at main.5:Recall that at this point, register %si
points to the &os; slice entry in the MBR
partition table, so a call to nread will
effectively read sectors at the beginning of this partition.
The argument passed on register %dh tells
nread to read 16 disk sectors. Recall that
the first 512 bytes, or the first sector of the &os; slice,
coincides with the boot1 program. Also
recall that the file written to the beginning of the &os;
slice is not /boot/boot1, but
/boot/boot. Let us look at the size of
these files in the filesystem:-r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot0
-r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot1
-r--r--r-- 1 root wheel 7.5K Jan 8 00:15 /boot/boot2
-r--r--r-- 1 root wheel 8.0K Jan 8 00:15 /boot/bootBoth boot0 and
boot1 are 512 bytes each, so they fit
exactly in one disk sector.
boot2 is much bigger, holding both
the BTX server and the
boot2 client. Finally, a file called
simply boot is 512 bytes larger than
boot2. This file is a
concatenation of boot1 and
boot2. As already noted,
boot0 is the file written to the absolute
first disk sector (the MBR), and
boot is the file written to the first
sector of the &os; slice; boot1 and
boot2 are not written
to disk. The command used to concatenate
boot1 and boot2 into a
single boot is merely
cat boot1 boot2 > boot.So boot1 occupies exactly the first 512
bytes of boot and, because
boot is written to the first sector of the
&os; slice, boot1 fits exactly in this
first sector. When nread reads the first
16 sectors of the &os; slice, it effectively reads the entire
boot file
512*16=8192 bytes, exactly the size of
boot.
We will see more details about how boot is
formed from boot1 and
boot2 in the next section.Recall that nread uses memory address
0x8c00 as the transfer buffer to hold the
sectors read. This address is conveniently chosen. Indeed,
because boot1 belongs to the first 512
bytes, it ends up in the address range
0x8c00-0x8dff. The 512
bytes that follows (range
0x8e00-0x8fff) is used to
store the bsdlabelHistorically known as disklabel. If you
ever wondered where &os; stored this information, it is in
this region. See &man.bsdlabel.8;.Starting at address 0x9000 is the
beginning of the BTX server, and immediately
following is the boot2 client. The
BTX server acts as a kernel, and executes in
protected mode in the most privileged level. In contrast, the
BTX clients (boot2, for
example), execute in user mode. We will see how this is
accomplished in the next section. The code after the call to
nread locates the beginning of
boot2 in the memory buffer, and copies it
to memory address 0xc000. This is because
the BTX server arranges
boot2 to execute in a segment starting at
0xa000. We explore this in detail in the
following section.The last code block of boot1 enables
access to memory above 1MB
This is necessary for legacy reasons. Interested
readers should see .
and concludes with a jump to the starting point of the
BTX server:Note that right before the jump, interrupts are
enabled.The BTX ServerNext in our boot sequence is the
BTX Server. Let us quickly remember how we
got here:The BIOS loads the absolute sector
one (the MBR, or
boot0), to address
0x7c00 and jumps there.boot0 relocates itself to
0x600, the address it was linked to
execute, and jumps over there. It then reads the first
sector of the &os; slice (which consists of
boot1) into address
0x7c00 and jumps over there.boot1 loads the first 16 sectors
of the &os; slice into address 0x8c00.
This 16 sectors, or 8192 bytes, is the whole file
boot. The file is a
concatenation of boot1 and
boot2. boot2, in
turn, contains the BTX server and the
boot2 client. Finally, a jump is made
to address 0x9010, the entry point of the
BTX server.Before studying the BTX Server in detail,
let us further review how the single, all-in-one
boot file is created. The way
boot is built is defined in its
Makefile
(/usr/src/sys/boot/i386/boot2/Makefile).
Let us look at the rule that creates the
boot file:This tells us that boot1 and
boot2 are needed, and the rule simply
concatenates them to produce a single file called
boot. The rules for creating
boot1 are also quite simple:To apply the rule for creating
boot1, boot1.out must
be resolved. This, in turn, depends on the existence of
boot1.o. This last file is simply the
result of assembling our familiar boot1.S,
without linking. Now, the rule for creating
boot1.out is applied. This tells us that
boot1.o should be linked with
start as its entry point, and starting at
address 0x7c00. Finally,
boot1 is created from
boot1.out applying the appropriate rule.
This rule is the objcopy command applied to
boot1.out. Note the flags passed to
objcopy: -S tells it to
strip all relocation and symbolic information;
-O binary indicates the output format, that
is, a simple, unformatted binary file.Having boot1, let us take a look at how
boot2 is constructed:The mechanism for building boot2 is
far more elaborate. Let us point out the most relevant facts.
The dependency list is as follows:Note that initially there is no header file
boot2.h, but its creation depends on
boot1.out, which we already have. The rule
for its creation is a bit terse, but the important thing is that
the output, boot2.h, is something like
this:Recall that boot1 was relocated (i.e.,
copied from 0x7c00 to
0x700). This relocation will now make sense,
because as we will see, the BTX server
reclaims some memory, including the space where
boot1 was originally loaded. However, the
BTX server needs access to
boot1's xread function;
this function, according to the output of
boot2.h, is at location
0x725. Indeed, the
BTX server uses the
xread function from
boot1's relocated code. This function is
now accessible from within the boot2
client.We next build boot2.s from files
boot2.h, boot2.c and
/usr/src/sys/boot/common/ufsread.c. The
rule for this is to compile the code in
boot2.c (which includes
boot2.h and ufsread.c)
into assembly code. Having boot2.s, the
next rule assembles boot2.s, creating the
object file boot2.o. The
next rule directs the linker to link various files
(crt0.o,
boot2.o and sio.o).
Note that the output file, boot2.out, is
linked to execute at address 0x2000. Recall
that boot2 will be executed in user mode,
within a special user segment set up by the
BTX server. This segment starts at
0xa000. Also, remember that the
boot2 portion of boot
was copied to address 0xc000, that is, offset
0x2000 from the start of the user segment, so
boot2 will work properly when we transfer
control to it. Next, boot2.bin is created
from boot2.out by stripping its symbols and
format information; boot2.bin is a raw
binary. Now, note that a file boot2.ldr is
created as a 512-byte file full of zeros. This space is
reserved for the bsdlabel.Now that we have files boot1,
boot2.bin and
boot2.ldr, only the
BTX server is missing before creating the
all-in-one boot file. The
BTX server is located in
/usr/src/sys/boot/i386/btx/btx; it has its
own Makefile with its own set of rules for
building. The important thing to notice is that it is also
compiled as a raw binary, and that it is
linked to execute at address 0x9000. The
details can be found in
/usr/src/sys/boot/i386/btx/btx/Makefile.Having the files that comprise the boot
program, the final step is to merge them.
This is done by a special program called
btxld (source located in
/usr/src/usr.sbin/btxld). Some arguments
to this program include the name of the output file
(boot), its entry point
(0x2000) and its file format
(raw binary). The various files are
finally merged by this utility into the file
boot, which consists of
boot1, boot2, the
bsdlabel and the
BTX server. This file, which takes
exactly 16 sectors, or 8192 bytes, is what is
actually written to the beginning of the &os; slice
during installation. Let us now proceed to study the
BTX server program.The BTX server prepares a simple
environment and switches from 16-bit real mode to 32-bit
protected mode, right before passing control to the client.
This includes initializing and updating the following data
structures:virtual v86 modeModifies the
Interrupt Vector Table (IVT). The
IVT provides exception and interrupt
handlers for Real-Mode code.The Interrupt Descriptor Table (IDT)
is created. Entries are provided for processor exceptions,
hardware interrupts, two system calls and V86 interface.
The IDT provides exception and interrupt handlers for
Protected-Mode code.A Task-State Segment (TSS) is
created. This is necessary because the processor works in
the least privileged level when
executing the client (boot2), but in
the most privileged level when
executing the BTX server.The GDT (Global Descriptor Table) is
set up. Entries (descriptors) are provided for
supervisor code and data, user code and data, and real-mode
code and data.
Real-mode code and data are necessary when switching
back to real mode from protected mode, as suggested by
the Intel manuals.Let us now start studying the actual implementation. Recall
that boot1 made a jump to address
0x9010, the BTX server's
entry point. Before studying program execution there,
note that the BTX server has a special header
at address range 0x9000-0x900f, right before
its entry point. This header is defined as follows:Note the first two bytes are 0xeb and
0xe. In the IA-32 architecture, these two
bytes are interpreted as a relative jump past the header into
the entry point, so in theory, boot1 could
jump here (address 0x9000) instead of address
0x9010. Note that the last field in the
BTX header is a pointer to the client's
(boot2) entry point. This field is patched
at link time.Immediately following the header is the
BTX server's entry point:This code disables interrupts, sets up a working stack
(starting at address 0x1800) and clears the
flags in the EFLAGS register. Note that the
popfl instruction pops out a doubleword (4
bytes) from the stack and places it in the EFLAGS register.
As the value actually popped is 2, the
EFLAGS register is effectively cleared (IA-32 requires that bit
2 of the EFLAGS register always be 1).Our next code block clears (sets to 0)
the memory range 0x5e00-0x8fff. This range
is where the various data structures will be created:Recall that boot1 was originally loaded
to address 0x7c00, so, with this memory
initialization, that copy effectively disappeared. However,
also recall that boot1 was relocated to
0x700, so that copy is
still in memory, and the BTX server will make
use of it.Next, the real-mode IVT (Interrupt Vector
Table is updated. The IVT is an array of
segment/offset pairs for exception and interrupt handlers. The
BIOS normally maps hardware interrupts to
interrupt vectors 0x8 to
0xf and 0x70 to
0x77 but, as will be seen, the 8259A
Programmable Interrupt Controller, the chip controlling the
actual mapping of hardware interrupts to interrupt vectors, is
programmed to remap these interrupt vectors from
0x8-0xf to 0x20-0x27 and
from 0x70-0x77 to
0x28-0x2f. Thus, interrupt handlers are
provided for interrupt vectors 0x20-0x2f.
The reason the BIOS-provided handlers are not
used directly is because they work in 16-bit real mode, but not
32-bit protected mode. Processor mode will be switched to
32-bit protected mode shortly. However, the
BTX server sets up a mechanism to effectively
use the handlers provided by the BIOS:The next block creates the IDT (Interrupt
Descriptor Table). The IDT is analogous, in
protected mode, to the IVT in real mode.
That is, the IDT describes the various
exception and interrupt handlers used when the processor is
executing in protected mode. In essence, it also consists of an
array of segment/offset pairs, although the structure is
somewhat more complex, because segments in protected mode are
different than in real mode, and various protection mechanisms
apply:Each entry in the IDT is 8 bytes long.
Besides the segment/offset information, they also describe the
segment type, privilege level, and whether the segment is
present in memory or not. The construction is such that
interrupt vectors from 0 to
0xf (exceptions) are handled by function
intx00; vector 0x10 (also
an exception) is handled by intx10; hardware
interrupts, which are later configured to start at interrupt
vector 0x20 all the way to interrupt vector
0x2f, are handled by function
intx20. Lastly, interrupt vector
0x30, which is used for system calls, is
handled by intx30, and vectors
0x31 and 0x32 are handled
by intx31. It must be noted that only
descriptors for interrupt vectors 0x30,
0x31 and 0x32 are given
privilege level 3, the same privilege level as the
boot2 client, which means the client can
execute a software-generated interrupt to this vectors through
the int instruction without failing (this is
the way boot2 use the services provided by
the BTX server). Also, note that
only software-generated interrupts are
protected from code executing in lesser privilege levels.
Hardware-generated interrupts and processor-generated exceptions
are always handled adequately, regardless
of the actual privileges involved.The next step is to initialize the TSS
(Task-State Segment). The TSS is a hardware
feature that helps the operating system or executive software
implement multitasking functionality through process
abstraction. The IA-32 architecture demands the creation and
use of at least one TSS
if multitasking facilities are used or different privilege
levels are defined. Since the boot2
client is executed in privilege level 3, but the
- BTX server does in privilege level 0, a
+ BTX server runs in privilege level 0, a
TSS must be defined:Note that a value is given for the Privilege Level 0 stack
pointer and stack segment in the TSS. This
is needed because, if an interrupt or exception is received
while executing boot2 in Privilege Level 3,
a change to Privilege Level 0 is automatically performed by the
processor, so a new working stack is needed. Finally, the I/O
Map Base Address field of the TSS is given a
value, which is a 16-bit offset from the beginning of the
TSS to the I/O Permission Bitmap and the
Interrupt Redirection Bitmap.After the IDT and TSS
are created, the processor is ready to switch to protected mode.
This is done in the next block:First, a call is made to setpic to
program the 8259A PIC (Programmable Interrupt
Controller). This chip is connected to multiple hardware
interrupt sources. Upon receiving an interrupt from a device,
it signals the processor with the appropriate interrupt vector.
This can be customized so that specific interrupts are
associated with specific interrupt vectors, as explained before.
Next, the IDTR (Interrupt Descriptor Table
Register) and GDTR (Global Descriptor Table
Register) are loaded with the instructions
lidt and lgdt,
respectively. These registers are loaded with the base address
and limit address for the IDT and
GDT. The following three instructions set
the Protection Enable (PE) bit of the %cr0
register. This effectively switches the processor to 32-bit
protected mode. Next, a long jump is made to
init.8 using segment selector SEL_SCODE,
which selects the Supervisor Code Segment. The processor is
effectively executing in CPL 0, the most privileged level, after
this jump. Finally, the Supervisor Data Segment is selected for
the stack by assigning the segment selector SEL_SDATA to the
%ss register. This data segment also has a
privilege level of 0.Our last code block is responsible for loading the
TR (Task Register) with the segment selector
for the TSS we created earlier, and setting
the User Mode environment before passing execution control to
the boot2 client.Note that the client's environment include a stack segment
selector and stack pointer (registers %ss and
%esp). Indeed, once the
TR is loaded with the appropriate stack
segment selector (instruction ltr), the stack
pointer is calculated and pushed onto the stack along with the
stack's segment selector. Next, the value
0x202 is pushed onto the stack; it is the
value that the EFLAGS will get when control is passed to the
client. Also, the User Mode code segment selector and the
client's entry point are pushed. Recall that this entry
point is patched in the BTX header at link
time. Finally, segment selectors (stored in register
%ecx) for the segment registers
%gs, %fs, %ds and %es are pushed onto the
stack, along with the value at %edx
(0xa000). Keep in mind the various values
that have been pushed onto the stack (they will be popped out
shortly). Next, values for the remaining general purpose
registers are also pushed onto the stack (note the
loop that pushes the value
0 seven times). Now, values will be started
to be popped out of the stack. First, the
popa instruction pops out of the stack the
latest seven values pushed. They are stored in the general
purpose registers in order
%edi, %esi, %ebp, %ebx, %edx, %ecx, %eax.
Then, the various segment selectors pushed are popped into the
various segment registers. Five values still remain on the
stack. They are popped when the iret
instruction is executed. This instruction first pops
the value that was pushed from the BTX
header. This value is a pointer to boot2's
entry point. It is placed in the register
%eip, the instruction pointer register.
Next, the segment selector for the User Code Segment is popped
and copied to register %cs. Remember that
this segment's privilege level is 3, the least privileged
level. This means that we must provide values for the stack of
this privilege level. This is why the processor, besides
further popping the value for the EFLAGS register, does two more
pops out of the stack. These values go to the stack
pointer (%esp) and the stack segment
(%ss). Now, execution continues at
boot0's entry point.It is important to note how the User Code Segment is
defined. This segment's base address is
set to 0xa000. This means that code memory
addresses are relative to address 0xa000;
if code being executed is fetched from address
0x2000, the actual
memory addressed is
0xa000+0x2000=0xc000.boot2 Stageboot2 defines an important structure,
struct bootinfo. This structure is
initialized by boot2 and passed to the
loader, and then further to the kernel. Some nodes of this
structures are set by boot2, the rest by the
loader. This structure, among other information, contains the
kernel filename, BIOS harddisk geometry,
BIOS drive number for boot device, physical
memory available, envp pointer etc. The
definition for it is:/usr/include/machine/bootinfo.h:
struct bootinfo {
u_int32_t bi_version;
u_int32_t bi_kernelname; /* represents a char * */
u_int32_t bi_nfs_diskless; /* struct nfs_diskless * */
/* End of fields that are always present. */
#define bi_endcommon bi_n_bios_used
u_int32_t bi_n_bios_used;
u_int32_t bi_bios_geom[N_BIOS_GEOM];
u_int32_t bi_size;
u_int8_t bi_memsizes_valid;
u_int8_t bi_bios_dev; /* bootdev BIOS unit number */
u_int8_t bi_pad[2];
u_int32_t bi_basemem;
u_int32_t bi_extmem;
u_int32_t bi_symtab; /* struct symtab * */
u_int32_t bi_esymtab; /* struct symtab * */
/* Items below only from advanced bootloader */
u_int32_t bi_kernend; /* end of kernel space */
u_int32_t bi_envp; /* environment */
u_int32_t bi_modulep; /* preloaded modules */
};boot2 enters into an infinite loop
waiting for user input, then calls load().
If the user does not press anything, the loop breaks by a
timeout, so load() will load the default
file (/boot/loader). Functions
ino_t lookup(char *filename) and
int xfsread(ino_t inode, void *buf, size_t
nbyte) are used to read the content of a file into
memory. /boot/loader is an
ELF binary, but where the
ELF header is prepended with
a.out's struct
exec structure. load() scans the
loader's ELF header, loading the content of
/boot/loader into memory, and passing the
execution to the loader's entry:sys/boot/i386/boot2/boot2.c:
__exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK),
MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part),
0, 0, 0, VTOP(&bootinfo));loader Stageloader is a
BTX client as well. I will not describe it
here in detail, there is a comprehensive man page written by
Mike Smith, &man.loader.8;. The underlying mechanisms and
BTX were discussed above.The main task for the loader is to boot the kernel. When
the kernel is loaded into memory, it is being called by the
loader:sys/boot/common/boot.c:
/* Call the exec handler from the loader matching the kernel */
module_formats[km->m_loader]->l_exec(km);Kernel InitializationLet us take a look at the command that links the kernel.
This will help identify the exact location where the loader
passes execution to the kernel. This location is the kernel's
actual entry point.sys/conf/Makefile.i386:
ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \
-dynamic-linker /red/herring -o kernel -X locore.o \
<lots of kernel .o files>ELFA few interesting things can be seen here. First, the
kernel is an ELF dynamically linked binary, but the dynamic
linker for kernel is /red/herring, which is
definitely a bogus file. Second, taking a look at the file
sys/conf/ldscript.i386 gives an idea about
what ld options are used when
compiling a kernel. Reading through the first few lines, the
stringsys/conf/ldscript.i386:
ENTRY(btext)says that a kernel's entry point is the symbol `btext'.
This symbol is defined in locore.s:sys/i386/i386/locore.s:
.text
/**********************************************************************
*
* This is where the bootblocks start us, set the ball rolling...
*
*/
NON_GPROF_ENTRY(btext)First, the register EFLAGS is set to a predefined value of
0x00000002. Then all the segment registers are
initialized:sys/i386/i386/locore.s:
/* Don't trust what the BIOS gives for eflags. */
pushl $PSL_KERNEL
popfl
/*
* Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap
* to set %cs, %ds, %es and %ss.
*/
mov %ds, %ax
mov %ax, %fs
mov %ax, %gsbtext calls the routines
recover_bootinfo(),
identify_cpu(),
create_pagetables(), which are also defined
in locore.s. Here is a description of what
they do:recover_bootinfoThis routine parses the parameters to the kernel
passed from the bootstrap. The kernel may have been
booted in 3 ways: by the loader, described above, by the
old disk boot blocks, or by the old diskless boot
procedure. This function determines the booting method,
and stores the struct bootinfo
structure into the kernel memory.identify_cpuThis functions tries to find out what CPU it is
running on, storing the value found in a variable
_cpu.create_pagetablesThis function allocates and fills out a Page Table
Directory at the top of the kernel memory area.The next steps are enabling VME, if the CPU supports
it: testl $CPUID_VME, R(_cpu_feature)
jz 1f
movl %cr4, %eax
orl $CR4_VME, %eax
movl %eax, %cr4Then, enabling paging:/* Now enable paging */
movl R(_IdlePTD), %eax
movl %eax,%cr3 /* load ptd addr into mmu */
movl %cr0,%eax /* get control word */
orl $CR0_PE|CR0_PG,%eax /* enable paging */
movl %eax,%cr0 /* and let's page NOW! */The next three lines of code are because the paging was set,
so the jump is needed to continue the execution in virtualized
address space: pushl $begin /* jump to high virtualized address */
ret
/* now running relocated at KERNBASE where the system is linked to run */
begin:The function init386() is called with
a pointer to the first free physical page, after that
mi_startup(). init386
is an architecture dependent initialization function, and
mi_startup() is an architecture independent
one (the 'mi_' prefix stands for Machine Independent). The
kernel never returns from mi_startup(), and
by calling it, the kernel finishes booting:sys/i386/i386/locore.s:
movl physfree, %esi
pushl %esi /* value of first for init386(first) */
call _init386 /* wire 386 chip for unix operation */
call _mi_startup /* autoconfiguration, mountroot etc */
hlt /* never returns to here */init386()init386() is defined in
sys/i386/i386/machdep.c and performs
low-level initialization specific to the i386 chip. The
switch to protected mode was performed by the loader. The
loader has created the very first task, in which the kernel
continues to operate. Before looking at the code, consider
the tasks the processor must complete to initialize protected
mode execution:Initialize the kernel tunable parameters, passed from
the bootstrapping program.Prepare the GDT.Prepare the IDT.Initialize the system console.Initialize the DDB, if it is compiled into
kernel.Initialize the TSS.Prepare the LDT.Set up proc0's pcb.parametersinit386() initializes the tunable
parameters passed from bootstrap by setting the environment
pointer (envp) and calling init_param1().
The envp pointer has been passed from loader in the
bootinfo structure:sys/i386/i386/machdep.c:
kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE;
/* Init basic tunables, hz etc */
init_param1();init_param1() is defined in
sys/kern/subr_param.c. That file has a
number of sysctls, and two functions,
init_param1() and
init_param2(), that are called from
init386():sys/kern/subr_param.c:
hz = HZ;
TUNABLE_INT_FETCH("kern.hz", &hz);TUNABLE_<typename>_FETCH is used to fetch the value
from the environment:/usr/src/sys/sys/kernel.h:
#define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var))Sysctl kern.hz is the system clock
tick. Additionally, these sysctls are set by
init_param1(): kern.maxswzone,
kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.maxdsiz,
kern.dflssiz, kern.maxssiz, kern.sgrowsiz.Global Descriptors Table (GDT)Then init386() prepares the Global
Descriptors Table (GDT). Every task on an x86 is running in
its own virtual address space, and this space is addressed by
a segment:offset pair. Say, for instance, the current
instruction to be executed by the processor lies at CS:EIP,
then the linear virtual address for that instruction would be
the virtual address of code segment CS + EIP.
For convenience, segments begin at virtual address 0 and end
at a 4Gb boundary. Therefore, the instruction's linear
virtual address for this example would just be the value of
EIP. Segment registers such as CS, DS etc are the selectors,
i.e., indexes, into GDT (to be more precise, an index is not a
selector itself, but the INDEX field of a selector).
FreeBSD's GDT holds descriptors for 15 selectors per
CPU:sys/i386/i386/machdep.c:
union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */
sys/i386/include/segments.h:
/*
* Entries in the Global Descriptor Table (GDT)
*/
#define GNULL_SEL 0 /* Null Descriptor */
#define GCODE_SEL 1 /* Kernel Code Descriptor */
#define GDATA_SEL 2 /* Kernel Data Descriptor */
#define GPRIV_SEL 3 /* SMP Per-Processor Private Data */
#define GPROC0_SEL 4 /* Task state process slot zero and up */
#define GLDT_SEL 5 /* LDT - eventually one per process */
#define GUSERLDT_SEL 6 /* User LDT */
#define GTGATE_SEL 7 /* Process task switch gate */
#define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */
#define GPANIC_SEL 9 /* Task state to consider panic from */
#define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */
#define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */
#define GBIOSDATA_SEL 12 /* BIOS interface (Data) */
#define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */
#define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */Note that those #defines are not selectors themselves, but
just a field INDEX of a selector, so they are exactly the
indices of the GDT. for example, an actual selector for the
kernel code (GCODE_SEL) has the value 0x08.Interrupt Descriptor Table
(IDT)The next step is to initialize the Interrupt Descriptor
Table (IDT). This table is referenced by the processor when a
software or hardware interrupt occurs. For example, to make a
system call, user application issues the
INT 0x80 instruction. This is a software
interrupt, so the processor's hardware looks up a record with
index 0x80 in the IDT. This record points to the routine that
handles this interrupt, in this particular case, this will be
the kernel's syscall gate. The IDT may have a maximum of 256
(0x100) records. The kernel allocates NIDT records for the
IDT, where NIDT is the maximum (256):sys/i386/i386/machdep.c:
static struct gate_descriptor idt0[NIDT];
struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */For each interrupt, an appropriate handler is set. The
syscall gate for INT 0x80 is set as
well:sys/i386/i386/machdep.c:
setidt(0x80, &IDTVEC(int0x80_syscall),
SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL));So when a userland application issues the
INT 0x80 instruction, control will transfer
to the function _Xint0x80_syscall, which
is in the kernel code segment and will be executed with
supervisor privileges.Console and DDB are then initialized:DDBsys/i386/i386/machdep.c:
cninit();
/* skipped */
#ifdef DDB
kdb_init();
if (boothowto & RB_KDB)
Debugger("Boot flags requested debugger");
#endifThe Task State Segment is another x86 protected mode
structure, the TSS is used by the hardware to store task
information when a task switch occurs.The Local Descriptors Table is used to reference userland
code and data. Several selectors are defined to point to the
LDT, they are the system call gates and the user code and data
selectors:/usr/include/machine/segments.h:
#define LSYS5CALLS_SEL 0 /* forced by intel BCS */
#define LSYS5SIGR_SEL 1
#define L43BSDCALLS_SEL 2 /* notyet */
#define LUCODE_SEL 3
#define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */
#define LUDATA_SEL 5
/* separate stack, es,fs,gs sels ? */
/* #define LPOSIXCALLS_SEL 5*/ /* notyet */
#define LBSDICALLS_SEL 16 /* BSDI system call gate */
#define NLDT (LBSDICALLS_SEL + 1)Next, proc0's Process Control Block
(struct pcb) structure is initialized.
proc0 is a struct proc structure that
describes a kernel process. It is always present while the
kernel is running, therefore it is declared as global:sys/kern/kern_init.c:
struct proc proc0;The structure struct pcb is a part of a
proc structure. It is defined in
/usr/include/machine/pcb.h and has a
process's information specific to the i386 architecture, such
as registers values.mi_startup()This function performs a bubble sort of all the system
initialization objects and then calls the entry of each object
one by one:sys/kern/init_main.c:
for (sipp = sysinit; *sipp; sipp++) {
/* ... skipped ... */
/* Call function */
(*((*sipp)->func))((*sipp)->udata);
/* ... skipped ... */
}Although the sysinit framework is described in the Developers'
Handbook, I will discuss the internals of it.sysinit objectsEvery system initialization object (sysinit object) is
created by calling a SYSINIT() macro. Let us take as example
an announce sysinit object. This object
prints the copyright message:sys/kern/init_main.c:
static void
print_caddr_t(void *data __unused)
{
printf("%s", (char *)data);
}
SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright)The subsystem ID for this object is SI_SUB_COPYRIGHT
(0x0800001), which comes right after the SI_SUB_CONSOLE
(0x0800000). So, the copyright message will be printed out
first, just after the console initialization.Let us take a look at what exactly the macro
SYSINIT() does. It expands to a
C_SYSINIT() macro. The
C_SYSINIT() macro then expands to a static
struct sysinit structure declaration with
another DATA_SET macro call:/usr/include/sys/kernel.h:
#define C_SYSINIT(uniquifier, subsystem, order, func, ident) \
static struct sysinit uniquifier ## _sys_init = { \ subsystem, \
order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ##
_sys_init);
#define SYSINIT(uniquifier, subsystem, order, func, ident) \
C_SYSINIT(uniquifier, subsystem, order, \
(sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident)The DATA_SET() macro expands to a
MAKE_SET(), and that macro is the point
where all the sysinit magic is hidden:/usr/include/linker_set.h:
#define MAKE_SET(set, sym) \
static void const * const __set_##set##_sym_##sym = &sym; \
__asm(".section .set." #set ",\"aw\""); \
__asm(".long " #sym); \
__asm(".previous")
#endif
#define TEXT_SET(set, sym) MAKE_SET(set, sym)
#define DATA_SET(set, sym) MAKE_SET(set, sym)In our case, the following declaration will occur:static struct sysinit announce_sys_init = {
SI_SUB_COPYRIGHT,
SI_ORDER_FIRST,
(sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t,
(void *) copyright
};
static void const *const __set_sysinit_set_sym_announce_sys_init =
&announce_sys_init;
__asm(".section .set.sysinit_set" ",\"aw\"");
__asm(".long " "announce_sys_init");
__asm(".previous");The first __asm instruction will create
an ELF section within the kernel's executable. This will
happen at kernel link time. The section will have the name
.set.sysinit_set. The content of this
section is one 32-bit value, the address of announce_sys_init
structure, and that is what the second
__asm is. The third
__asm instruction marks the end of a
section. If a directive with the same section name occurred
before, the content, i.e., the 32-bit value, will be appended
to the existing section, so forming an array of 32-bit
pointers.Running objdump on a kernel
binary, you may notice the presence of such small
sections:&prompt.user; objdump -h /kernel
7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2
CONTENTS, ALLOC, LOAD, DATA
8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2
CONTENTS, ALLOC, LOAD, DATA
9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2
CONTENTS, ALLOC, LOAD, DATA
10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2
CONTENTS, ALLOC, LOAD, DATA
11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2
CONTENTS, ALLOC, LOAD, DATA
12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2
CONTENTS, ALLOC, LOAD, DATAThis screen dump shows that the size of .set.sysinit_set
section is 0x664 bytes, so 0x664/sizeof(void
*) sysinit objects are compiled into the kernel.
The other sections such as .set.sysctl_set
represent other linker sets.By defining a variable of type struct
linker_set the content of
.set.sysinit_set section will be
collected into that variable:sys/kern/init_main.c:
extern struct linker_set sysinit_set; /* XXX */The struct linker_set is defined as
follows:/usr/include/linker_set.h:
struct linker_set {
int ls_length;
void *ls_items[1]; /* really ls_length of them, trailing NULL */
};The first node will be equal to the number of a sysinit
objects, and the second node will be a NULL-terminated array
of pointers to them.Returning to the mi_startup()
discussion, it is must be clear now, how the sysinit objects
are being organized. The mi_startup()
function sorts them and calls each. The very last object is
the system scheduler:/usr/include/sys/kernel.h:
enum sysinit_sub_id {
SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/
SI_SUB_DONE = 0x0000001, /* processed*/
SI_SUB_CONSOLE = 0x0800000, /* console*/
SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/
...
SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/
};The system scheduler sysinit object is defined in the file
sys/vm/vm_glue.c, and the entry point for
that object is scheduler(). That
function is actually an infinite loop, and it represents a
process with PID 0, the swapper process. The proc0 structure,
mentioned before, is used to describe it.The first user process, called init,
is created by the sysinit object
init:sys/kern/init_main.c:
static void
create_init(const void *udata __unused)
{
int error;
int s;
s = splhigh();
error = fork1(&proc0, RFFDG | RFPROC, &initproc);
if (error)
panic("cannot fork init: %d\n", error);
initproc->p_flag |= P_INMEM | P_SYSTEM;
cpu_set_fork_handler(initproc, start_init, NULL);
remrunqueue(initproc);
splx(s);
}
SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL)The create_init() allocates a new
process by calling fork1(), but does not
mark it runnable. When this new process is scheduled for
execution by the scheduler, the
start_init() will be called. That
function is defined in init_main.c. It
tries to load and exec the init binary,
probing /sbin/init first, then
/sbin/oinit,
/sbin/init.bak, and finally
/stand/sysinstall:sys/kern/init_main.c:
static char init_path[MAXPATHLEN] =
#ifdef INIT_PATH
__XSTRING(INIT_PATH);
#else
"/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall";
#endif
diff --git a/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml
index 7de627b5b9..dfde154052 100644
--- a/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml
+++ b/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml
@@ -1,2239 +1,2239 @@
Common Access Method SCSI ControllersSergeyBabkinWritten by MurrayStokelyModifications for Handbook made by SynopsisSCSIThis document assumes that the reader has a general
understanding of device drivers in FreeBSD and of the SCSI
protocol. Much of the information in this document was
extracted from the drivers:ncr (/sys/pci/ncr.c) by
Wolfgang Stanglmeier and Stefan Essersym (/sys/dev/sym/sym_hipd.c) by
Gerard Roudieraic7xxx
(/sys/dev/aic7xxx/aic7xxx.c) by Justin
T. Gibbsand from the CAM code itself (by Justin T. Gibbs, see
/sys/cam/*). When some solution looked the
most logical and was essentially verbatim extracted from the
code by Justin T. Gibbs, I marked it as
recommended.The document is illustrated with examples in
pseudo-code. Although sometimes the examples have many details
and look like real code, it is still pseudo-code. It was
written to demonstrate the concepts in an understandable way.
For a real driver other approaches may be more modular and
efficient. It also abstracts from the hardware details, as well
as issues that would cloud the demonstrated concepts or that are
supposed to be described in the other chapters of the developers
handbook. Such details are commonly shown as calls to functions
with descriptive names, comments or pseudo-statements.
Fortunately real life full-size examples with all the details
can be found in the real drivers.General ArchitectureCommon Access Method (CAM)CAM stands for Common Access Method. It is a generic way to
address the I/O buses in a SCSI-like way. This allows a
separation of the generic device drivers from the drivers
controlling the I/O bus: for example the disk driver becomes
able to control disks on both SCSI, IDE, and/or any other bus so
the disk driver portion does not have to be rewritten (or copied
and modified) for every new I/O bus. Thus the two most
important active entities are:CD-ROMtapeIDEPeripheral Modules - a
driver for peripheral devices (disk, tape, CD-ROM,
etc.)SCSI Interface Modules (SIM) - a
Host Bus Adapter drivers for connecting to an I/O bus such
as SCSI or IDE.A peripheral driver receives requests from the OS, converts
them to a sequence of SCSI commands and passes these SCSI
commands to a SCSI Interface Module. The SCSI Interface Module
is responsible for passing these commands to the actual hardware
(or if the actual hardware is not SCSI but, for example, IDE
then also converting the SCSI commands to the native commands of
the hardware).As we are interested in writing a SCSI adapter driver
here, from this point on we will consider everything from the
SIM standpoint.A typical SIM driver needs to include the following
CAM-related header files:#include <cam/cam.h>
#include <cam/cam_ccb.h>
#include <cam/cam_sim.h>
#include <cam/cam_xpt_sim.h>
#include <cam/cam_debug.h>
#include <cam/scsi/scsi_all.h>The first thing each SIM driver must do is register itself
with the CAM subsystem. This is done during the driver's
xxx_attach() function (here and further
xxx_ is used to denote the unique driver name prefix). The
xxx_attach() function itself is called by
the system bus auto-configuration code which we do not describe
here.This is achieved in multiple steps: first it is necessary to
allocate the queue of requests associated with this SIM: struct cam_devq *devq;
if(( devq = cam_simq_alloc(SIZE) )==NULL) {
error; /* some code to handle the error */
}Here SIZE is the size of the queue to be
allocated, maximal number of requests it could contain. It is
the number of requests that the SIM driver can handle in
parallel on one SCSI card. Commonly it can be calculated
as:SIZE = NUMBER_OF_SUPPORTED_TARGETS * MAX_SIMULTANEOUS_COMMANDS_PER_TARGETNext we create a descriptor of our SIM: struct cam_sim *sim;
if(( sim = cam_sim_alloc(action_func, poll_func, driver_name,
softc, unit, mtx, max_dev_transactions,
max_tagged_dev_transactions, devq) )==NULL) {
cam_simq_free(devq);
error; /* some code to handle the error */
}Note that if we are not able to create a SIM descriptor we
free the devq also because we can do
nothing else with it and we want to conserve memory.If a SCSI card has multiple SCSI
busesSCSIbus
on it then each bus requires its own
cam_sim structure.An interesting question is what to do if a SCSI card has
more than one SCSI bus, do we need one
devq structure per card or per SCSI
bus? The answer given in the comments to the CAM code is:
either way, as the driver's author prefers.The arguments are:action_func - pointer to
the driver's xxx_action function.
static void
xxx_actionstruct cam_sim *sim,
union ccb *ccbpoll_func - pointer to
the driver's xxx_poll()static void
xxx_pollstruct cam_sim *simdriver_name - the name of the actual driver,
such as ncr or
wds.softc - pointer to the driver's
internal descriptor for this SCSI card. This pointer will
be used by the driver in future to get private
data.unit - the controller unit number, for example
for controller mps0 this number will be
0mtx - Lock associated with this SIM. For SIMs that don't
know about locking, pass in Giant. For SIMs that do, pass in
the lock used to guard this SIM's data structures. This lock
will be held when xxx_action and xxx_poll are called.max_dev_transactions - maximal number of simultaneous
transactions per SCSI target in the non-tagged mode. This
value will be almost universally equal to 1, with possible
exceptions only for the non-SCSI cards. Also the drivers
that hope to take advantage by preparing one transaction
while another one is executed may set it to 2 but this does
not seem to be worth the complexity.max_tagged_dev_transactions - the same thing, but in the
tagged mode. Tags are the SCSI way to initiate multiple
transactions on a device: each transaction is assigned a
unique tag and the transaction is sent to the device. When
the device completes some transaction it sends back the
result together with the tag so that the SCSI adapter (and
the driver) can tell which transaction was completed. This
argument is also known as the maximal tag depth. It depends
on the abilities of the SCSI adapter.Finally we register the SCSI buses associated with our SCSI
adapterSCSIadapter: if(xpt_bus_register(sim, softc, bus_number) != CAM_SUCCESS) {
cam_sim_free(sim, /*free_devq*/ TRUE);
error; /* some code to handle the error */
}If there is one devq structure per
SCSI bus (i.e., we consider a card with multiple buses as
multiple cards with one bus each) then the bus number will
always be 0, otherwise each bus on the SCSI card should be get a
distinct number. Each bus needs its own separate structure
cam_sim.After that our controller is completely hooked to the CAM
system. The value of devq can be
discarded now: sim will be passed as an argument in all further
calls from CAM and devq can be derived from it.CAM provides the framework for such asynchronous events.
Some events originate from the lower levels (the SIM drivers),
some events originate from the peripheral drivers, some events
originate from the CAM subsystem itself. Any driver can
register callbacks for some types of the asynchronous events, so
that it would be notified if these events occur.A typical example of such an event is a device reset. Each
transaction and event identifies the devices to which it applies
by the means of path. The target-specific events
normally occur during a transaction with this device. So the
path from that transaction may be re-used to report this event
(this is safe because the event path is copied in the event
reporting routine but not deallocated nor passed anywhere
further). Also it is safe to allocate paths dynamically at any
time including the interrupt routines, although that incurs
certain overhead, and a possible problem with this approach is
that there may be no free memory at that time. For a bus reset
event we need to define a wildcard path including all devices on
the bus. So we can create the path for the future bus reset
events in advance and avoid problems with the future memory
shortage: struct cam_path *path;
if(xpt_create_path(&path, /*periph*/NULL,
cam_sim_path(sim), CAM_TARGET_WILDCARD,
CAM_LUN_WILDCARD) != CAM_REQ_CMP) {
xpt_bus_deregister(cam_sim_path(sim));
cam_sim_free(sim, /*free_devq*/TRUE);
error; /* some code to handle the error */
}
softc->wpath = path;
softc->sim = sim;As you can see the path includes:ID of the peripheral driver (NULL here because we have
none)ID of the SIM driver
(cam_sim_path(sim))SCSI target number of the device (CAM_TARGET_WILDCARD
means all devices)SCSI LUN number of the subdevice (CAM_LUN_WILDCARD means
all LUNs)If the driver can not allocate this path it will not be able
to work normally, so in that case we dismantle that SCSI
bus.And we save the path pointer in the
softc structure for future use. After
that we save the value of sim (or we can also discard it on the
exit from xxx_probe() if we wish).That is all for a minimalistic initialization. To do things
right there is one more issue left.For a SIM driver there is one particularly interesting
event: when a target device is considered lost. In this case
resetting the SCSI negotiations with this device may be a good
idea. So we register a callback for this event with CAM. The
request is passed to CAM by requesting CAM action on a CAM
control block for this type of request: struct ccb_setasync csa;
xpt_setup_ccb(&csa.ccb_h, path, /*priority*/5);
csa.ccb_h.func_code = XPT_SASYNC_CB;
csa.event_enable = AC_LOST_DEVICE;
csa.callback = xxx_async;
csa.callback_arg = sim;
xpt_action((union ccb *)&csa);Now we take a look at the xxx_action()
and xxx_poll() driver entry points.static void
xxx_actionstruct cam_sim *sim,
union ccb *ccbDo some action on request of the CAM subsystem. Sim
describes the SIM for the request, CCB is the request itself.
CCB stands for CAM Control Block. It is a union
of many specific instances, each describing arguments for some
type of transactions. All of these instances share the CCB
header where the common part of arguments is stored.CAM supports the SCSI controllers working in both initiator
(normal) mode and target (simulating a SCSI
device) mode. Here we only consider the part relevant to the
initiator mode.There are a few function and macros (in other words,
methods) defined to access the public data in the struct
sim:cam_sim_path(sim) - the path ID
(see above)cam_sim_name(sim) - the name of the
simcam_sim_softc(sim) - the pointer to
the softc (driver private data) structure cam_sim_unit(sim) - the unit
number cam_sim_bus(sim) - the bus
IDTo identify the device, xxx_action()
can get the unit number and pointer to its structure softc using
these functions.The type of request is stored in
ccb->ccb_h.func_code. So
generally xxx_action() consists of a big
switch: struct xxx_softc *softc = (struct xxx_softc *) cam_sim_softc(sim);
struct ccb_hdr *ccb_h = &ccb->ccb_h;
int unit = cam_sim_unit(sim);
int bus = cam_sim_bus(sim);
switch(ccb_h->func_code) {
case ...:
...
default:
ccb_h->status = CAM_REQ_INVALID;
xpt_done(ccb);
break;
}As can be seen from the default case (if an unknown command
was received) the return code of the command is set into
ccb->ccb_h.status and the
completed CCB is returned back to CAM by calling
xpt_done(ccb).xpt_done() does not have to be called
from xxx_action(): For example an I/O
request may be enqueued inside the SIM driver and/or its SCSI
controller. Then when the device would post an interrupt
signaling that the processing of this request is complete
xpt_done() may be called from the interrupt
handling routine.Actually, the CCB status is not only assigned as a return
code but a CCB has some status all the time. Before CCB is
passed to the xxx_action() routine it gets
the status CCB_REQ_INPROG meaning that it is in progress. There
are a surprising number of status values defined in
/sys/cam/cam.h which should be able to
represent the status of a request in great detail. More
interesting yet, the status is in fact a bitwise
or of an enumerated status value (the lower 6 bits) and
possible additional flag-like bits (the upper bits). The
enumerated values will be discussed later in more detail. The
summary of them can be found in the Errors Summary section. The
possible status flags are:CAM_DEV_QFRZN - if the SIM driver
gets a serious error (for example, the device does not
respond to the selection or breaks the SCSI protocol) when
processing a CCB it should freeze the request queue by
calling xpt_freeze_simq(), return the
other enqueued but not processed yet CCBs for this device
back to the CAM queue, then set this flag for the
troublesome CCB and call xpt_done().
This flag causes the CAM subsystem to unfreeze the queue
after it handles the error.CAM_AUTOSNS_VALID - if the
device returned an error condition and the flag
CAM_DIS_AUTOSENSE is not set in CCB the SIM driver must
execute the REQUEST SENSE command automatically to extract
the sense (extended error information) data from the device.
If this attempt was successful the sense data should be
saved in the CCB and this flag set.CAM_RELEASE_SIMQ - like
CAM_DEV_QFRZN but used in case there is some problem (or
resource shortage) with the SCSI controller itself. Then
all the future requests to the controller should be stopped
by xpt_freeze_simq(). The controller
queue will be restarted after the SIM driver overcomes the
shortage and informs CAM by returning some CCB with this
flag set.CAM_SIM_QUEUED - when SIM puts a
CCB into its request queue this flag should be set (and
removed when this CCB gets dequeued before being returned
back to CAM). This flag is not used anywhere in the CAM
code now, so its purpose is purely diagnostic.CAM_QOS_VALID - The QOS data
is now valid.The function xxx_action() is not
allowed to sleep, so all the synchronization for resource access
must be done using SIM or device queue freezing. Besides the
aforementioned flags the CAM subsystem provides functions
xpt_release_simq() and
xpt_release_devq() to unfreeze the queues
directly, without passing a CCB to CAM.The CCB header contains the following fields:path - path ID for the
requesttarget_id - target device ID for
the requesttarget_lun - LUN ID of the target
devicetimeout - timeout interval for this
command, in millisecondstimeout_ch - a convenience place
for the SIM driver to store the timeout handle (the CAM
subsystem itself does not make any assumptions about
it)flags - various bits of information
about the request spriv_ptr0, spriv_ptr1 - fields reserved
for private use by the SIM driver (such as linking to the
SIM queues or SIM private control blocks); actually, they
exist as unions: spriv_ptr0 and spriv_ptr1 have the type
(void *), spriv_field0 and spriv_field1 have the type
unsigned long, sim_priv.entries[0].bytes and
sim_priv.entries[1].bytes are byte arrays of the size
consistent with the other incarnations of the union and
sim_priv.bytes is one array, twice bigger.The recommended way of using the SIM private fields of CCB
is to define some meaningful names for them and use these
meaningful names in the driver, like:#define ccb_some_meaningful_name sim_priv.entries[0].bytes
#define ccb_hcb spriv_ptr1 /* for hardware control block */The most common initiator mode requests are:XPT_SCSI_IO - execute an I/O
transactionThe instance struct ccb_scsiio csio of
the union ccb is used to transfer the arguments. They
are:cdb_io - pointer to the SCSI
command buffer or the buffer itselfcdb_len - SCSI command
lengthdata_ptr - pointer to the data
buffer (gets a bit complicated if scatter/gather is
used)dxfer_len - length of the data
to transfersglist_cnt - counter of the
scatter/gather segmentsscsi_status - place to return
the SCSI statussense_data - buffer for the
SCSI sense information if the command returns an error
(the SIM driver is supposed to run the REQUEST SENSE
command automatically in this case if the CCB flag
CAM_DIS_AUTOSENSE is not set)sense_len - the length of that
buffer (if it happens to be higher than size of
sense_data the SIM driver must silently assume the
smaller value) resid, sense_resid - if the transfer of
data or SCSI sense returned an error these are the
returned counters of the residual (not transferred)
data. They do not seem to be especially meaningful, so
in a case when they are difficult to compute (say,
counting bytes in the SCSI controller's FIFO buffer) an
approximate value will do as well. For a successfully
completed transfer they must be set to
zero.tag_action - the kind of tag to
use:CAM_TAG_ACTION_NONE - do not use tags for this
transactionMSG_SIMPLE_Q_TAG, MSG_HEAD_OF_Q_TAG,
MSG_ORDERED_Q_TAG - value equal to the appropriate
tag message (see /sys/cam/scsi/scsi_message.h); this
gives only the tag type, the SIM driver must assign
the tag value itselfThe general logic of handling this request is the
following:The first thing to do is to check for possible races, to
make sure that the command did not get aborted when it was
sitting in the queue: struct ccb_scsiio *csio = &ccb->csio;
if ((ccb_h->status & CAM_STATUS_MASK) != CAM_REQ_INPROG) {
xpt_done(ccb);
return;
}Also we check that the device is supported at all by our
controller: if(ccb_h->target_id > OUR_MAX_SUPPORTED_TARGET_ID
|| cch_h->target_id == OUR_SCSI_CONTROLLERS_OWN_ID) {
ccb_h->status = CAM_TID_INVALID;
xpt_done(ccb);
return;
}
if(ccb_h->target_lun > OUR_MAX_SUPPORTED_LUN) {
ccb_h->status = CAM_LUN_INVALID;
xpt_done(ccb);
return;
}Then allocate whatever data structures (such as
card-dependent hardware control
blockhardware control
block) we need to process this
request. If we can not then freeze the SIM queue and
remember that we have a pending operation, return the CCB
back and ask CAM to re-queue it. Later when the resources
become available the SIM queue must be unfrozen by returning
a ccb with the CAM_SIMQ_RELEASE bit set
in its status. Otherwise, if all went well, link the CCB
with the hardware control block (HCB) and mark it as
queued. struct xxx_hcb *hcb = allocate_hcb(softc, unit, bus);
if(hcb == NULL) {
softc->flags |= RESOURCE_SHORTAGE;
xpt_freeze_simq(sim, /*count*/1);
ccb_h->status = CAM_REQUEUE_REQ;
xpt_done(ccb);
return;
}
hcb->ccb = ccb; ccb_h->ccb_hcb = (void *)hcb;
ccb_h->status |= CAM_SIM_QUEUED;Extract the target data from CCB into the hardware
control block. Check if we are asked to assign a tag and if
yes then generate an unique tag and build the SCSI tag
messages. The SIM driver is also responsible for
negotiations with the devices to set the maximal mutually
supported bus width, synchronous rate and offset. hcb->target = ccb_h->target_id; hcb->lun = ccb_h->target_lun;
generate_identify_message(hcb);
if( ccb_h->tag_action != CAM_TAG_ACTION_NONE )
generate_unique_tag_message(hcb, ccb_h->tag_action);
if( !target_negotiated(hcb) )
generate_negotiation_messages(hcb);Then set up the SCSI command. The command storage may
be specified in the CCB in many interesting ways, specified
by the CCB flags. The command buffer can be contained in
CCB or pointed to, in the latter case the pointer may be
physical or virtual. Since the hardware commonly needs
physical address we always convert the address to the
physical one, typically using the busdma API.In case if a physical address is
requested it is OK to return the CCB with the status
CAM_REQ_INVALID, the current drivers
do that. If necessary a physical address can be also
converted or mapped back to a virtual address but with
big pain, so we do not do that. if(ccb_h->flags & CAM_CDB_POINTER) {
/* CDB is a pointer */
if(!(ccb_h->flags & CAM_CDB_PHYS)) {
/* CDB pointer is virtual */
hcb->cmd = vtobus(csio->cdb_io.cdb_ptr);
} else {
/* CDB pointer is physical */
hcb->cmd = csio->cdb_io.cdb_ptr ;
}
} else {
/* CDB is in the ccb (buffer) */
hcb->cmd = vtobus(csio->cdb_io.cdb_bytes);
}
hcb->cmdlen = csio->cdb_len;Now it is time to set up the data. Again, the data
storage may be specified in the CCB in many interesting
ways, specified by the CCB flags. First we get the
direction of the data transfer. The simplest case is if
there is no data to transfer: int dir = (ccb_h->flags & CAM_DIR_MASK);
if (dir == CAM_DIR_NONE)
goto end_data;Then we check if the data is in one chunk or in a
scatter-gather list, and the addresses are physical or
virtual. The SCSI controller may be able to handle only a
limited number of chunks of limited length. If the request
hits this limitation we return an error. We use a special
function to return the CCB to handle in one place the HCB
resource shortages. The functions to add chunks are
driver-dependent, and here we leave them without detailed
implementation. See description of the SCSI command (CDB)
handling for the details on the address-translation issues.
If some variation is too difficult or impossible to
implement with a particular card it is OK to return the
status CAM_REQ_INVALID. Actually, it
seems like the scatter-gather ability is not used anywhere
in the CAM code now. But at least the case for a single
non-scattered virtual buffer must be implemented, it is
actively used by CAM. int rv;
initialize_hcb_for_data(hcb);
if((!(ccb_h->flags & CAM_SCATTER_VALID)) {
/* single buffer */
if(!(ccb_h->flags & CAM_DATA_PHYS)) {
rv = add_virtual_chunk(hcb, csio->data_ptr, csio->dxfer_len, dir);
}
} else {
rv = add_physical_chunk(hcb, csio->data_ptr, csio->dxfer_len, dir);
}
} else {
int i;
struct bus_dma_segment *segs;
segs = (struct bus_dma_segment *)csio->data_ptr;
if ((ccb_h->flags & CAM_SG_LIST_PHYS) != 0) {
/* The SG list pointer is physical */
rv = setup_hcb_for_physical_sg_list(hcb, segs, csio->sglist_cnt);
} else if (!(ccb_h->flags & CAM_DATA_PHYS)) {
/* SG buffer pointers are virtual */
for (i = 0; i < csio->sglist_cnt; i++) {
rv = add_virtual_chunk(hcb, segs[i].ds_addr,
segs[i].ds_len, dir);
if (rv != CAM_REQ_CMP)
break;
}
} else {
/* SG buffer pointers are physical */
for (i = 0; i < csio->sglist_cnt; i++) {
rv = add_physical_chunk(hcb, segs[i].ds_addr,
segs[i].ds_len, dir);
if (rv != CAM_REQ_CMP)
break;
}
}
}
if(rv != CAM_REQ_CMP) {
/* we expect that add_*_chunk() functions return CAM_REQ_CMP
* if they added a chunk successfully, CAM_REQ_TOO_BIG if
* the request is too big (too many bytes or too many chunks),
* CAM_REQ_INVALID in case of other troubles
*/
free_hcb_and_ccb_done(hcb, ccb, rv);
return;
}
end_data:If disconnection is disabled for this CCB we pass this
information to the hcb: if(ccb_h->flags & CAM_DIS_DISCONNECT)
hcb_disable_disconnect(hcb);If the controller is able to run REQUEST SENSE command
all by itself then the value of the flag CAM_DIS_AUTOSENSE
should also be passed to it, to prevent automatic REQUEST
SENSE if the CAM subsystem does not want it.The only thing left is to set up the timeout, pass our
hcb to the hardware and return, the rest will be done by the
interrupt handler (or timeout handler). ccb_h->timeout_ch = timeout(xxx_timeout, (caddr_t) hcb,
(ccb_h->timeout * hz) / 1000); /* convert milliseconds to ticks */
put_hcb_into_hardware_queue(hcb);
return;And here is a possible implementation of the function
returning CCB: static void
free_hcb_and_ccb_done(struct xxx_hcb *hcb, union ccb *ccb, u_int32_t status)
{
struct xxx_softc *softc = hcb->softc;
ccb->ccb_h.ccb_hcb = 0;
if(hcb != NULL) {
untimeout(xxx_timeout, (caddr_t) hcb, ccb->ccb_h.timeout_ch);
/* we're about to free a hcb, so the shortage has ended */
if(softc->flags & RESOURCE_SHORTAGE) {
softc->flags &= ~RESOURCE_SHORTAGE;
status |= CAM_RELEASE_SIMQ;
}
free_hcb(hcb); /* also removes hcb from any internal lists */
}
ccb->ccb_h.status = status |
(ccb->ccb_h.status & ~(CAM_STATUS_MASK|CAM_SIM_QUEUED));
xpt_done(ccb);
}XPT_RESET_DEV - send the SCSI
BUS DEVICE RESET message to a deviceThere is no data transferred in CCB except the header
and the most interesting argument of it is target_id.
Depending on the controller hardware a hardware control
block just like for the XPT_SCSI_IO request may be
constructed (see XPT_SCSI_IO request description) and sent
to the controller or the SCSI controller may be immediately
programmed to send this RESET message to the device or this
request may be just not supported (and return the status
CAM_REQ_INVALID). Also on completion
of the request all the disconnected transactions for this
target must be aborted (probably in the interrupt
routine).Also all the current negotiations for the target are
lost on reset, so they might be cleaned too. Or they
clearing may be deferred, because anyway the target would
request re-negotiation on the next
transaction.XPT_RESET_BUS - send the RESET
signal to the SCSI busNo arguments are passed in the CCB, the only interesting
argument is the SCSI bus indicated by the struct sim
pointer.A minimalistic implementation would forget the SCSI
negotiations for all the devices on the bus and return the
status CAM_REQ_CMP.The proper implementation would in addition actually
reset the SCSI bus (possible also reset the SCSI controller)
and mark all the CCBs being processed, both those in the
hardware queue and those being disconnected, as done with
the status CAM_SCSI_BUS_RESET. Like: int targ, lun;
struct xxx_hcb *h, *hh;
struct ccb_trans_settings neg;
struct cam_path *path;
/* The SCSI bus reset may take a long time, in this case its completion
* should be checked by interrupt or timeout. But for simplicity
* we assume here that it is really fast.
*/
reset_scsi_bus(softc);
/* drop all enqueued CCBs */
for(h = softc->first_queued_hcb; h != NULL; h = hh) {
hh = h->next;
free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET);
}
/* the clean values of negotiations to report */
neg.bus_width = 8;
neg.sync_period = neg.sync_offset = 0;
neg.valid = (CCB_TRANS_BUS_WIDTH_VALID
| CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID);
/* drop all disconnected CCBs and clean negotiations */
for(targ=0; targ <= OUR_MAX_SUPPORTED_TARGET; targ++) {
clean_negotiations(softc, targ);
/* report the event if possible */
if(xpt_create_path(&path, /*periph*/NULL,
cam_sim_path(sim), targ,
CAM_LUN_WILDCARD) == CAM_REQ_CMP) {
xpt_async(AC_TRANSFER_NEG, path, &neg);
xpt_free_path(path);
}
for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++)
for(h = softc->first_discon_hcb[targ][lun]; h != NULL; h = hh) {
hh=h->next;
free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET);
}
}
ccb->ccb_h.status = CAM_REQ_CMP;
xpt_done(ccb);
/* report the event */
xpt_async(AC_BUS_RESET, softc->wpath, NULL);
return;Implementing the SCSI bus reset as a function may be a
good idea because it would be re-used by the timeout
function as a last resort if the things go
wrong.XPT_ABORT - abort the specified
CCBThe arguments are transferred in the instance
struct ccb_abort cab of the union ccb. The
only argument field in it is:abort_ccb - pointer to the CCB to
be abortedIf the abort is not supported just return the status
CAM_UA_ABORT. This is also the easy way to minimally
implement this call, return CAM_UA_ABORT in any case.The hard way is to implement this request honestly.
First check that abort applies to a SCSI transaction: struct ccb *abort_ccb;
abort_ccb = ccb->cab.abort_ccb;
if(abort_ccb->ccb_h.func_code != XPT_SCSI_IO) {
ccb->ccb_h.status = CAM_UA_ABORT;
xpt_done(ccb);
return;
}Then it is necessary to find this CCB in our queue.
This can be done by walking the list of all our hardware
control blocks in search for one associated with this
CCB: struct xxx_hcb *hcb, *h;
hcb = NULL;
/* We assume that softc->first_hcb is the head of the list of all
* HCBs associated with this bus, including those enqueued for
* processing, being processed by hardware and disconnected ones.
*/
for(h = softc->first_hcb; h != NULL; h = h->next) {
if(h->ccb == abort_ccb) {
hcb = h;
break;
}
}
if(hcb == NULL) {
/* no such CCB in our queue */
ccb->ccb_h.status = CAM_PATH_INVALID;
xpt_done(ccb);
return;
}
hcb=found_hcb;Now we look at the current processing status of the HCB.
It may be either sitting in the queue waiting to be sent to
the SCSI bus, being transferred right now, or disconnected
and waiting for the result of the command, or actually
completed by hardware but not yet marked as done by
software. To make sure that we do not get in any races with
hardware we mark the HCB as being aborted, so that if this
HCB is about to be sent to the SCSI bus the SCSI controller
will see this flag and skip it. int hstatus;
/* shown as a function, in case special action is needed to make
* this flag visible to hardware
*/
set_hcb_flags(hcb, HCB_BEING_ABORTED);
abort_again:
hstatus = get_hcb_status(hcb);
switch(hstatus) {
case HCB_SITTING_IN_QUEUE:
remove_hcb_from_hardware_queue(hcb);
/* FALLTHROUGH */
case HCB_COMPLETED:
/* this is an easy case */
free_hcb_and_ccb_done(hcb, abort_ccb, CAM_REQ_ABORTED);
break;If the CCB is being transferred right now we would like
to signal to the SCSI controller in some hardware-dependent
way that we want to abort the current transfer. The SCSI
controller would set the SCSI ATTENTION signal and when the
target responds to it send an ABORT message. We also reset
the timeout to make sure that the target is not sleeping
forever. If the command would not get aborted in some
reasonable time like 10 seconds the timeout routine would go
ahead and reset the whole SCSI bus. Since the command
will be aborted in some reasonable time we can just return
the abort request now as successfully completed, and mark
the aborted CCB as aborted (but not mark it as done
yet). case HCB_BEING_TRANSFERRED:
untimeout(xxx_timeout, (caddr_t) hcb, abort_ccb->ccb_h.timeout_ch);
abort_ccb->ccb_h.timeout_ch =
timeout(xxx_timeout, (caddr_t) hcb, 10 * hz);
abort_ccb->ccb_h.status = CAM_REQ_ABORTED;
/* ask the controller to abort that HCB, then generate
* an interrupt and stop
*/
if(signal_hardware_to_abort_hcb_and_stop(hcb) < 0) {
/* oops, we missed the race with hardware, this transaction
* got off the bus before we aborted it, try again */
goto abort_again;
}
break;If the CCB is in the list of disconnected then set it up
as an abort request and re-queue it at the front of hardware
queue. Reset the timeout and report the abort request to be
completed. case HCB_DISCONNECTED:
untimeout(xxx_timeout, (caddr_t) hcb, abort_ccb->ccb_h.timeout_ch);
abort_ccb->ccb_h.timeout_ch =
timeout(xxx_timeout, (caddr_t) hcb, 10 * hz);
put_abort_message_into_hcb(hcb);
put_hcb_at_the_front_of_hardware_queue(hcb);
break;
}
ccb->ccb_h.status = CAM_REQ_CMP;
xpt_done(ccb);
return;That is all for the ABORT request, although there is one
more issue. As the ABORT message cleans all the
ongoing transactions on a LUN we have to mark all the other
active transactions on this LUN as aborted. That should be
done in the interrupt routine, after the transaction gets
aborted.Implementing the CCB abort as a function may be quite a
good idea, this function can be re-used if an I/O
transaction times out. The only difference would be that
the timed out transaction would return the status
CAM_CMD_TIMEOUT for the timed out request. Then the case
XPT_ABORT would be small, like that: case XPT_ABORT:
struct ccb *abort_ccb;
abort_ccb = ccb->cab.abort_ccb;
if(abort_ccb->ccb_h.func_code != XPT_SCSI_IO) {
ccb->ccb_h.status = CAM_UA_ABORT;
xpt_done(ccb);
return;
}
if(xxx_abort_ccb(abort_ccb, CAM_REQ_ABORTED) < 0)
/* no such CCB in our queue */
ccb->ccb_h.status = CAM_PATH_INVALID;
else
ccb->ccb_h.status = CAM_REQ_CMP;
xpt_done(ccb);
return;XPT_SET_TRAN_SETTINGS - explicitly
set values of SCSI transfer settingsThe arguments are transferred in the instance
struct ccb_trans_setting cts of the union
ccb:valid - a bitmask showing which
settings should be updated:CCB_TRANS_SYNC_RATE_VALID -
synchronous transfer rateCCB_TRANS_SYNC_OFFSET_VALID -
synchronous offsetCCB_TRANS_BUS_WIDTH_VALID - bus
widthCCB_TRANS_DISC_VALID - set
enable/disable disconnectionCCB_TRANS_TQ_VALID - set
enable/disable tagged queuingflags - consists of two parts,
binary arguments and identification of sub-operations.
The binary arguments are:CCB_TRANS_DISC_ENB - enable
disconnectionCCB_TRANS_TAG_ENB - enable
tagged queuingthe sub-operations are:CCB_TRANS_CURRENT_SETTINGS
- change the current negotiationsCCB_TRANS_USER_SETTINGS -
remember the desired user values sync_period,
sync_offset - self-explanatory, if sync_offset==0
then the asynchronous mode is requested bus_width -
bus width, in bits (not bytes)Two sets of negotiated parameters are supported, the
user settings and the current settings. The user settings
are not really used much in the SIM drivers, this is mostly
just a piece of memory where the upper levels can store (and
later recall) its ideas about the parameters. Setting the
user parameters does not cause re-negotiation of the
transfer rates. But when the SCSI controller does a
negotiation it must never set the values higher than the
user parameters, so it is essentially the top
boundary.The current settings are, as the name says, current.
Changing them means that the parameters must be
re-negotiated on the next transfer. Again, these
new current settings are not supposed to be
forced on the device, just they are used as the initial step
of negotiations. Also they must be limited by actual
capabilities of the SCSI controller: for example, if the
SCSI controller has 8-bit bus and the request asks to set
16-bit wide transfers this parameter must be silently
truncated to 8-bit transfers before sending it to the
device.One caveat is that the bus width and synchronous
parameters are per target while the disconnection and tag
enabling parameters are per lun.The recommended implementation is to keep 3 sets of
negotiated (bus width and synchronous transfer)
parameters:user - the user set, as
abovecurrent - those actually in
effectgoal - those requested by
setting of the current
parametersThe code looks like: struct ccb_trans_settings *cts;
int targ, lun;
int flags;
cts = &ccb->cts;
targ = ccb_h->target_id;
lun = ccb_h->target_lun;
flags = cts->flags;
if(flags & CCB_TRANS_USER_SETTINGS) {
if(flags & CCB_TRANS_SYNC_RATE_VALID)
softc->user_sync_period[targ] = cts->sync_period;
if(flags & CCB_TRANS_SYNC_OFFSET_VALID)
softc->user_sync_offset[targ] = cts->sync_offset;
if(flags & CCB_TRANS_BUS_WIDTH_VALID)
softc->user_bus_width[targ] = cts->bus_width;
if(flags & CCB_TRANS_DISC_VALID) {
softc->user_tflags[targ][lun] &= ~CCB_TRANS_DISC_ENB;
softc->user_tflags[targ][lun] |= flags & CCB_TRANS_DISC_ENB;
}
if(flags & CCB_TRANS_TQ_VALID) {
softc->user_tflags[targ][lun] &= ~CCB_TRANS_TQ_ENB;
softc->user_tflags[targ][lun] |= flags & CCB_TRANS_TQ_ENB;
}
}
if(flags & CCB_TRANS_CURRENT_SETTINGS) {
if(flags & CCB_TRANS_SYNC_RATE_VALID)
softc->goal_sync_period[targ] =
max(cts->sync_period, OUR_MIN_SUPPORTED_PERIOD);
if(flags & CCB_TRANS_SYNC_OFFSET_VALID)
softc->goal_sync_offset[targ] =
min(cts->sync_offset, OUR_MAX_SUPPORTED_OFFSET);
if(flags & CCB_TRANS_BUS_WIDTH_VALID)
softc->goal_bus_width[targ] = min(cts->bus_width, OUR_BUS_WIDTH);
if(flags & CCB_TRANS_DISC_VALID) {
softc->current_tflags[targ][lun] &= ~CCB_TRANS_DISC_ENB;
softc->current_tflags[targ][lun] |= flags & CCB_TRANS_DISC_ENB;
}
if(flags & CCB_TRANS_TQ_VALID) {
softc->current_tflags[targ][lun] &= ~CCB_TRANS_TQ_ENB;
softc->current_tflags[targ][lun] |= flags & CCB_TRANS_TQ_ENB;
}
}
ccb->ccb_h.status = CAM_REQ_CMP;
xpt_done(ccb);
return;Then when the next I/O request will be processed it will
check if it has to re-negotiate, for example by calling the
function target_negotiated(hcb). It can be implemented like
this: int
target_negotiated(struct xxx_hcb *hcb)
{
struct softc *softc = hcb->softc;
int targ = hcb->targ;
if( softc->current_sync_period[targ] != softc->goal_sync_period[targ]
|| softc->current_sync_offset[targ] != softc->goal_sync_offset[targ]
|| softc->current_bus_width[targ] != softc->goal_bus_width[targ] )
return 0; /* FALSE */
else
return 1; /* TRUE */
}After the values are re-negotiated the resulting values
must be assigned to both current and goal parameters, so for
future I/O transactions the current and goal parameters
would be the same and
target_negotiated() would return TRUE.
When the card is initialized (in
xxx_attach()) the current negotiation
values must be initialized to narrow asynchronous mode, the
goal and current values must be initialized to the maximal
values supported by controller.XPT_GET_TRAN_SETTINGS - get values
of SCSI transfer settingsThis operations is the reverse of XPT_SET_TRAN_SETTINGS.
Fill up the CCB instance
struct ccb_trans_setting cts with data as
requested by the flags CCB_TRANS_CURRENT_SETTINGS or
CCB_TRANS_USER_SETTINGS (if both are set then the existing
drivers return the current settings). Set all the bits in
the valid field.XPT_CALC_GEOMETRY - calculate
logical (BIOS)BIOS
geometry of the diskThe arguments are transferred in the instance
struct ccb_calc_geometry ccg of the union
ccb:block_size - input, block
(A.K.A sector) size in bytesvolume_size - input, volume
size in bytescylinders - output, logical
cylindersheads - output, logical
headssecs_per_track - output,
logical sectors per trackIf the returned geometry differs much enough from what
the SCSI controller BIOSSCSIBIOS thinks and a disk on
this SCSI controller is used as bootable the system may not
be able to boot. The typical calculation example taken from
the aic7xxx driver is: struct ccb_calc_geometry *ccg;
u_int32_t size_mb;
u_int32_t secs_per_cylinder;
int extended;
ccg = &ccb->ccg;
size_mb = ccg->volume_size
/ ((1024L * 1024L) / ccg->block_size);
extended = check_cards_EEPROM_for_extended_geometry(softc);
if (size_mb > 1024 && extended) {
ccg->heads = 255;
ccg->secs_per_track = 63;
} else {
ccg->heads = 64;
ccg->secs_per_track = 32;
}
secs_per_cylinder = ccg->heads * ccg->secs_per_track;
ccg->cylinders = ccg->volume_size / secs_per_cylinder;
ccb->ccb_h.status = CAM_REQ_CMP;
xpt_done(ccb);
return;This gives the general idea, the exact calculation
depends on the quirks of the particular BIOS. If BIOS
provides no way set the extended translation
flag in EEPROM this flag should normally be assumed equal to
1. Other popular geometries are: 128 heads, 63 sectors - Symbios controllers
16 heads, 63 sectors - old controllersSome system BIOSes and SCSI BIOSes fight with each other
with variable success, for example a combination of Symbios
875/895 SCSI and Phoenix BIOS can give geometry 128/63 after
power up and 255/63 after a hard reset or soft
reboot.XPT_PATH_INQ - path inquiry, in
other words get the SIM driver and SCSI controller (also
known as HBA - Host Bus Adapter) propertiesThe properties are returned in the instance
struct ccb_pathinq cpi of the union
ccb:version_num - the SIM driver version number, now all
drivers use 1hba_inquiry - bitmask of features supported by the
controller:PI_MDP_ABLE - supports MDP message (something from
SCSI3?)PI_WIDE_32 - supports 32 bit wide
SCSIPI_WIDE_16 - supports 16 bit wide
SCSIPI_SDTR_ABLE - can negotiate synchronous transfer
ratePI_LINKED_CDB - supports linked
commandsPI_TAG_ABLE - supports tagged
commandsPI_SOFT_RST - supports soft reset alternative (hard
reset and soft reset are mutually exclusive within a
SCSI bus)target_sprt - flags for target mode support, 0 if
unsupportedhba_misc - miscellaneous controller
features:PIM_SCANHILO - bus scans from high ID to low
IDPIM_NOREMOVE - removable devices not included in
scanPIM_NOINITIATOR - initiator role not
supportedPIM_NOBUSRESET - user has disabled initial BUS
RESEThba_eng_cnt - mysterious HBA engine count, something
related to compression, now is always set to 0vuhba_flags - vendor-unique flags, unused nowmax_target - maximal supported target ID (7 for
8-bit bus, 15 for 16-bit bus, 127 for Fibre
Channel)max_lun - maximal supported LUN ID (7 for older SCSI
controllers, 63 for newer ones)async_flags - bitmask of installed Async handler,
unused nowhpath_id - highest Path ID in the subsystem, unused
nowunit_number - the controller unit number,
cam_sim_unit(sim)bus_id - the bus number, cam_sim_bus(sim)initiator_id - the SCSI ID of the controller
itselfbase_transfer_speed - nominal transfer speed in KB/s
for asynchronous narrow transfers, equals to 3300 for
SCSIsim_vid - SIM driver's vendor id, a zero-terminated
string of maximal length SIM_IDLEN including the
terminating zerohba_vid - SCSI controller's vendor id, a
zero-terminated string of maximal length HBA_IDLEN
including the terminating zerodev_name - device driver name, a zero-terminated
string of maximal length DEV_IDLEN including the
terminating zero, equal to cam_sim_name(sim)The recommended way of setting the string fields is
using strncpy, like: strncpy(cpi->dev_name, cam_sim_name(sim), DEV_IDLEN);After setting the values set the status to CAM_REQ_CMP
and mark the CCB as done.Pollingstatic void
xxx_pollstruct cam_sim *simThe poll function is used to simulate the interrupts when
the interrupt subsystem is not functioning (for example, when
the system has crashed and is creating the system dump). The
CAM subsystem sets the proper interrupt level before calling the
poll routine. So all it needs to do is to call the interrupt
routine (or the other way around, the poll routine may be doing
the real action and the interrupt routine would just call the
poll routine). Why bother about a separate function then?
- Due to different calling conventions. The
+ This has to do with different calling conventions. The
xxx_poll routine gets the struct cam_sim
- pointer as its argument when the PCI interrupt routine by common
+ pointer as its argument while the PCI interrupt routine by common
convention gets pointer to the struct
xxx_softc and the ISA interrupt routine
gets just the device unit number. So the poll routine would
normally look as:static void
xxx_poll(struct cam_sim *sim)
{
xxx_intr((struct xxx_softc *)cam_sim_softc(sim)); /* for PCI device */
}orstatic void
xxx_poll(struct cam_sim *sim)
{
xxx_intr(cam_sim_unit(sim)); /* for ISA device */
}Asynchronous EventsIf an asynchronous event callback has been set up then the
callback function should be defined.static void
ahc_async(void *callback_arg, u_int32_t code, struct cam_path *path, void *arg)callback_arg - the value supplied when registering the
callbackcode - identifies the type of eventpath - identifies the devices to which the event
appliesarg - event-specific argumentImplementation for a single type of event, AC_LOST_DEVICE,
looks like: struct xxx_softc *softc;
struct cam_sim *sim;
int targ;
struct ccb_trans_settings neg;
sim = (struct cam_sim *)callback_arg;
softc = (struct xxx_softc *)cam_sim_softc(sim);
switch (code) {
case AC_LOST_DEVICE:
targ = xpt_path_target_id(path);
if(targ <= OUR_MAX_SUPPORTED_TARGET) {
clean_negotiations(softc, targ);
/* send indication to CAM */
neg.bus_width = 8;
neg.sync_period = neg.sync_offset = 0;
neg.valid = (CCB_TRANS_BUS_WIDTH_VALID
| CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID);
xpt_async(AC_TRANSFER_NEG, path, &neg);
}
break;
default:
break;
}InterruptsSCSIinterruptsThe exact type of the interrupt routine depends on the type
of the peripheral bus (PCI, ISA and so on) to which the SCSI
controller is connected.The interrupt routines of the SIM drivers run at the
interrupt level splcam. So splcam() should
be used in the driver to synchronize activity between the
interrupt routine and the rest of the driver (for a
multiprocessor-aware driver things get yet more interesting but
we ignore this case here). The pseudo-code in this document
happily ignores the problems of synchronization. The real code
must not ignore them. A simple-minded approach is to set
splcam() on the entry to the other routines
and reset it on return thus protecting them by one big critical
section. To make sure that the interrupt level will be always
restored a wrapper function can be defined, like: static void
xxx_action(struct cam_sim *sim, union ccb *ccb)
{
int s;
s = splcam();
xxx_action1(sim, ccb);
splx(s);
}
static void
xxx_action1(struct cam_sim *sim, union ccb *ccb)
{
... process the request ...
}This approach is simple and robust but the problem with it
is that interrupts may get blocked for a relatively long time
and this would negatively affect the system's performance. On
the other hand the functions of the spl()
family have rather high overhead, so vast amount of tiny
critical sections may not be good either.The conditions handled by the interrupt routine and the
details depend very much on the hardware. We consider the set
of typical conditions.First, we check if a SCSI reset was encountered on the bus
(probably caused by another SCSI controller on the same SCSI
bus). If so we drop all the enqueued and disconnected requests,
report the events and re-initialize our SCSI controller. It is
important that during this initialization the controller will
not issue another reset or else two controllers on the same SCSI
bus could ping-pong resets forever. The case of fatal
controller error/hang could be handled in the same place, but it
will probably need also sending RESET signal to the SCSI bus to
reset the status of the connections with the SCSI
devices. int fatal=0;
struct ccb_trans_settings neg;
struct cam_path *path;
if( detected_scsi_reset(softc)
|| (fatal = detected_fatal_controller_error(softc)) ) {
int targ, lun;
struct xxx_hcb *h, *hh;
/* drop all enqueued CCBs */
for(h = softc->first_queued_hcb; h != NULL; h = hh) {
hh = h->next;
free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET);
}
/* the clean values of negotiations to report */
neg.bus_width = 8;
neg.sync_period = neg.sync_offset = 0;
neg.valid = (CCB_TRANS_BUS_WIDTH_VALID
| CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID);
/* drop all disconnected CCBs and clean negotiations */
for(targ=0; targ <= OUR_MAX_SUPPORTED_TARGET; targ++) {
clean_negotiations(softc, targ);
/* report the event if possible */
if(xpt_create_path(&path, /*periph*/NULL,
cam_sim_path(sim), targ,
CAM_LUN_WILDCARD) == CAM_REQ_CMP) {
xpt_async(AC_TRANSFER_NEG, path, &neg);
xpt_free_path(path);
}
for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++)
for(h = softc->first_discon_hcb[targ][lun]; h != NULL; h = hh) {
hh=h->next;
if(fatal)
free_hcb_and_ccb_done(h, h->ccb, CAM_UNREC_HBA_ERROR);
else
free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET);
}
}
/* report the event */
xpt_async(AC_BUS_RESET, softc->wpath, NULL);
/* re-initialization may take a lot of time, in such case
* its completion should be signaled by another interrupt or
* checked on timeout - but for simplicity we assume here that
* it is really fast
*/
if(!fatal) {
reinitialize_controller_without_scsi_reset(softc);
} else {
reinitialize_controller_with_scsi_reset(softc);
}
schedule_next_hcb(softc);
return;
}If interrupt is not caused by a controller-wide condition
then probably something has happened to the current hardware
control block. Depending on the hardware there may be other
non-HCB-related events, we just do not consider them here. Then
we analyze what happened to this HCB: struct xxx_hcb *hcb, *h, *hh;
int hcb_status, scsi_status;
int ccb_status;
int targ;
int lun_to_freeze;
hcb = get_current_hcb(softc);
if(hcb == NULL) {
/* either stray interrupt or something went very wrong
* or this is something hardware-dependent
*/
handle as necessary;
return;
}
targ = hcb->target;
hcb_status = get_status_of_current_hcb(softc);First we check if the HCB has completed and if so we check
the returned SCSI status. if(hcb_status == COMPLETED) {
scsi_status = get_completion_status(hcb);Then look if this status is related to the REQUEST SENSE
command and if so handle it in a simple way. if(hcb->flags & DOING_AUTOSENSE) {
if(scsi_status == GOOD) { /* autosense was successful */
hcb->ccb->ccb_h.status |= CAM_AUTOSNS_VALID;
free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_SCSI_STATUS_ERROR);
} else {
autosense_failed:
free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_AUTOSENSE_FAIL);
}
schedule_next_hcb(softc);
return;
}Else the command itself has completed, pay more attention to
details. If auto-sense is not disabled for this CCB and the
command has failed with sense data then run REQUEST SENSE
command to receive that data. hcb->ccb->csio.scsi_status = scsi_status;
calculate_residue(hcb);
if( (hcb->ccb->ccb_h.flags & CAM_DIS_AUTOSENSE)==0
&& ( scsi_status == CHECK_CONDITION
|| scsi_status == COMMAND_TERMINATED) ) {
/* start auto-SENSE */
hcb->flags |= DOING_AUTOSENSE;
setup_autosense_command_in_hcb(hcb);
restart_current_hcb(softc);
return;
}
if(scsi_status == GOOD)
free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_REQ_CMP);
else
free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_SCSI_STATUS_ERROR);
schedule_next_hcb(softc);
return;
}One typical thing would be negotiation events: negotiation
messages received from a SCSI target (in answer to our
negotiation attempt or by target's initiative) or the target is
unable to negotiate (rejects our negotiation messages or does
not answer them). switch(hcb_status) {
case TARGET_REJECTED_WIDE_NEG:
/* revert to 8-bit bus */
softc->current_bus_width[targ] = softc->goal_bus_width[targ] = 8;
/* report the event */
neg.bus_width = 8;
neg.valid = CCB_TRANS_BUS_WIDTH_VALID;
xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg);
continue_current_hcb(softc);
return;
case TARGET_ANSWERED_WIDE_NEG:
{
int wd;
wd = get_target_bus_width_request(softc);
if(wd <= softc->goal_bus_width[targ]) {
/* answer is acceptable */
softc->current_bus_width[targ] =
softc->goal_bus_width[targ] = neg.bus_width = wd;
/* report the event */
neg.valid = CCB_TRANS_BUS_WIDTH_VALID;
xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg);
} else {
prepare_reject_message(hcb);
}
}
continue_current_hcb(softc);
return;
case TARGET_REQUESTED_WIDE_NEG:
{
int wd;
wd = get_target_bus_width_request(softc);
wd = min (wd, OUR_BUS_WIDTH);
wd = min (wd, softc->user_bus_width[targ]);
if(wd != softc->current_bus_width[targ]) {
/* the bus width has changed */
softc->current_bus_width[targ] =
softc->goal_bus_width[targ] = neg.bus_width = wd;
/* report the event */
neg.valid = CCB_TRANS_BUS_WIDTH_VALID;
xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg);
}
prepare_width_nego_rsponse(hcb, wd);
}
continue_current_hcb(softc);
return;
}Then we handle any errors that could have happened during
auto-sense in the same simple-minded way as before. Otherwise
we look closer at the details again. if(hcb->flags & DOING_AUTOSENSE)
goto autosense_failed;
switch(hcb_status) {The next event we consider is unexpected disconnect. Which
is considered normal after an ABORT or BUS DEVICE RESET message
and abnormal in other cases. case UNEXPECTED_DISCONNECT:
if(requested_abort(hcb)) {
/* abort affects all commands on that target+LUN, so
* mark all disconnected HCBs on that target+LUN as aborted too
*/
for(h = softc->first_discon_hcb[hcb->target][hcb->lun];
h != NULL; h = hh) {
hh=h->next;
free_hcb_and_ccb_done(h, h->ccb, CAM_REQ_ABORTED);
}
ccb_status = CAM_REQ_ABORTED;
} else if(requested_bus_device_reset(hcb)) {
int lun;
/* reset affects all commands on that target, so
* mark all disconnected HCBs on that target+LUN as reset
*/
for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++)
for(h = softc->first_discon_hcb[hcb->target][lun];
h != NULL; h = hh) {
hh=h->next;
free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET);
}
/* send event */
xpt_async(AC_SENT_BDR, hcb->ccb->ccb_h.path_id, NULL);
/* this was the CAM_RESET_DEV request itself, it is completed */
ccb_status = CAM_REQ_CMP;
} else {
calculate_residue(hcb);
ccb_status = CAM_UNEXP_BUSFREE;
/* request the further code to freeze the queue */
hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN;
lun_to_freeze = hcb->lun;
}
break;If the target refuses to accept tags we notify CAM about
that and return back all commands for this LUN: case TAGS_REJECTED:
/* report the event */
neg.flags = 0 & ~CCB_TRANS_TAG_ENB;
neg.valid = CCB_TRANS_TQ_VALID;
xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg);
ccb_status = CAM_MSG_REJECT_REC;
/* request the further code to freeze the queue */
hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN;
lun_to_freeze = hcb->lun;
break;Then we check a number of other conditions, with processing
basically limited to setting the CCB status: case SELECTION_TIMEOUT:
ccb_status = CAM_SEL_TIMEOUT;
/* request the further code to freeze the queue */
hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN;
lun_to_freeze = CAM_LUN_WILDCARD;
break;
case PARITY_ERROR:
ccb_status = CAM_UNCOR_PARITY;
break;
case DATA_OVERRUN:
case ODD_WIDE_TRANSFER:
ccb_status = CAM_DATA_RUN_ERR;
break;
default:
/* all other errors are handled in a generic way */
ccb_status = CAM_REQ_CMP_ERR;
/* request the further code to freeze the queue */
hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN;
lun_to_freeze = CAM_LUN_WILDCARD;
break;
}Then we check if the error was serious enough to freeze the
input queue until it gets proceeded and do so if it is: if(hcb->ccb->ccb_h.status & CAM_DEV_QFRZN) {
/* freeze the queue */
xpt_freeze_devq(ccb->ccb_h.path, /*count*/1);
/* re-queue all commands for this target/LUN back to CAM */
for(h = softc->first_queued_hcb; h != NULL; h = hh) {
hh = h->next;
if(targ == h->targ
&& (lun_to_freeze == CAM_LUN_WILDCARD || lun_to_freeze == h->lun) )
free_hcb_and_ccb_done(h, h->ccb, CAM_REQUEUE_REQ);
}
}
free_hcb_and_ccb_done(hcb, hcb->ccb, ccb_status);
schedule_next_hcb(softc);
return;This concludes the generic interrupt handling although
specific controllers may require some additions.Errors SummarySCSIerrorsWhen executing an I/O request many things may go wrong. The
reason of error can be reported in the CCB status with great
detail. Examples of use are spread throughout this document.
For completeness here is the summary of recommended responses
for the typical error conditions:CAM_RESRC_UNAVAIL - some resource
is temporarily unavailable and the SIM driver cannot
generate an event when it will become available. An example
of this resource would be some intra-controller hardware
resource for which the controller does not generate an
interrupt when it becomes available.CAM_UNCOR_PARITY - unrecovered
parity error occurredCAM_DATA_RUN_ERR - data overrun or
unexpected data phase (going in other direction than
specified in CAM_DIR_MASK) or odd transfer length for wide
transferCAM_SEL_TIMEOUT - selection timeout
occurred (target does not respond)CAM_CMD_TIMEOUT - command timeout
occurred (the timeout function ran)CAM_SCSI_STATUS_ERROR - the device
returned errorCAM_AUTOSENSE_FAIL - the device
returned error and the REQUEST SENSE COMMAND failedCAM_MSG_REJECT_REC - MESSAGE REJECT
message was receivedCAM_SCSI_BUS_RESET - received SCSI
bus resetCAM_REQ_CMP_ERR -
impossible SCSI phase occurred or something
else as weird or just a generic error if further detail is
not availableCAM_UNEXP_BUSFREE - unexpected
disconnect occurredCAM_BDR_SENT - BUS DEVICE RESET
message was sent to the targetCAM_UNREC_HBA_ERROR - unrecoverable
Host Bus Adapter ErrorCAM_REQ_TOO_BIG - the request was
too large for this controllerCAM_REQUEUE_REQ - this request
should be re-queued to preserve transaction ordering. This
typically occurs when the SIM recognizes an error that
should freeze the queue and must place other queued requests
for the target at the sim level back into the XPT queue.
Typical cases of such errors are selection timeouts, command
timeouts and other like conditions. In such cases the
troublesome command returns the status indicating the error,
the and the other commands which have not be sent to the bus
yet get re-queued.CAM_LUN_INVALID - the LUN ID in the
request is not supported by the SCSI controllerCAM_TID_INVALID - the target ID in
the request is not supported by the SCSI controllerTimeout HandlingWhen the timeout for an HCB expires that request should be
aborted, just like with an XPT_ABORT request. The only
difference is that the returned status of aborted request should
be CAM_CMD_TIMEOUT instead of CAM_REQ_ABORTED (that is why
implementation of the abort better be done as a function). But
there is one more possible problem: what if the abort request
itself will get stuck? In this case the SCSI bus should be
reset, just like with an XPT_RESET_BUS request (and the idea
about implementing it as a function called from both places
applies here too). Also we should reset the whole SCSI bus if a
device reset request got stuck. So after all the timeout
function would look like:static void
xxx_timeout(void *arg)
{
struct xxx_hcb *hcb = (struct xxx_hcb *)arg;
struct xxx_softc *softc;
struct ccb_hdr *ccb_h;
softc = hcb->softc;
ccb_h = &hcb->ccb->ccb_h;
if(hcb->flags & HCB_BEING_ABORTED
|| ccb_h->func_code == XPT_RESET_DEV) {
xxx_reset_bus(softc);
} else {
xxx_abort_ccb(hcb->ccb, CAM_CMD_TIMEOUT);
}
}When we abort a request all the other disconnected requests
to the same target/LUN get aborted too. So there appears a
question, should we return them with status CAM_REQ_ABORTED or
CAM_CMD_TIMEOUT? The current drivers use CAM_CMD_TIMEOUT. This
seems logical because if one request got timed out then probably
something really bad is happening to the device, so if they
would not be disturbed they would time out by themselves.
diff --git a/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml
index 568cc8ba35..6a51621e89 100644
--- a/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml
+++ b/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml
@@ -1,1638 +1,1638 @@
IPv6 InternalsIPv6/IPsec ImplementationYoshinobuInoueContributed by This section should explain IPv6 and IPsec related
implementation internals. These functionalities are derived
from KAME
projectIPv6ConformanceThe IPv6 related functions conforms, or tries to conform
to the latest set of IPv6 specifications. For future
reference we list some of the relevant documents below
(NOTE: this is not a complete list -
this is too hard to maintain...).For details please refer to specific chapter in the
document, RFCs, manual pages, or comments in the source
code.Conformance tests have been performed on the KAME STABLE
kit at TAHI project. Results can be viewed at http://www.tahi.org/report/KAME/.
We also attended University of New Hampshire IOL tests (http://www.iol.unh.edu/)
in the past, with our past snapshots.RFC1639: FTP Operation Over Big Address Records
(FOOBAR)RFC2428 is preferred over RFC1639. FTP clients
will first try RFC2428, then RFC1639 if
failed.RFC1886: DNS Extensions to support IPv6RFC1933: Transition Mechanisms for IPv6 Hosts and
RoutersIPv4 compatible address is not supported.automatic tunneling (described in 4.3 of this
RFC) is not supported.&man.gif.4; interface implements
IPv[46]-over-IPv[46] tunnel in a generic way, and it
covers "configured tunnel" described in the spec.
See 23.5.1.5 in this
document for details.RFC1981: Path MTU Discovery for IPv6RFC2080: RIPng for IPv6usr.sbin/route6d support this.RFC2292: Advanced Sockets API for IPv6For supported library functions/kernel APIs, see
sys/netinet6/ADVAPI.RFC2362: Protocol Independent Multicast-Sparse Mode
(PIM-SM)RFC2362 defines packet formats for PIM-SM.
draft-ietf-pim-ipv6-01.txt is
written based on this.RFC2373: IPv6 Addressing Architecturesupports node required addresses, and conforms
to the scope requirement.RFC2374: An IPv6 Aggregatable Global Unicast Address
Formatsupports 64-bit length of Interface ID.RFC2375: IPv6 Multicast Address AssignmentsUserland applications use the well-known
addresses assigned in the RFC.RFC2428: FTP Extensions for IPv6 and NATsRFC2428 is preferred over RFC1639. FTP clients
will first try RFC2428, then RFC1639 if
failed.RFC2460: IPv6 specificationRFC2461: Neighbor discovery for IPv6See 23.5.1.2 in
this document for details.RFC2462: IPv6 Stateless Address
AutoconfigurationSee 23.5.1.4 in
this document for details.RFC2463: ICMPv6 for IPv6 specificationSee 23.5.1.9 in
this document for details.RFC2464: Transmission of IPv6 Packets over Ethernet
NetworksRFC2465: MIB for IPv6: Textual Conventions and
General GroupNecessary statistics are gathered by the kernel.
Actual IPv6 MIB support is provided as a patchkit
for ucd-snmp.RFC2466: MIB for IPv6: ICMPv6 groupNecessary statistics are gathered by the kernel.
Actual IPv6 MIB support is provided as patchkit for
ucd-snmp.RFC2467: Transmission of IPv6 Packets over FDDI
NetworksRFC2497: Transmission of IPv6 packet over ARCnet
NetworksRFC2553: Basic Socket Interface Extensions for
IPv6IPv4 mapped address (3.7) and special behavior
of IPv6 wildcard bind socket (3.8) are supported.
See 23.5.1.12 in
this document for details.RFC2675: IPv6 JumbogramsSee 23.5.1.7
in this document for details.RFC2710: Multicast Listener Discovery for
IPv6RFC2711: IPv6 router alert optiondraft-ietf-ipngwg-router-renum-08:
Router renumbering for IPv6draft-ietf-ipngwg-icmp-namelookups-02:
IPv6 Name Lookups Through ICMPdraft-ietf-ipngwg-icmp-name-lookups-03:
IPv6 Name Lookups Through ICMPdraft-ietf-pim-ipv6-01.txt: PIM
for IPv6&man.pim6dd.8; implements dense mode.
&man.pim6sd.8; implements sparse mode.draft-itojun-ipv6-tcp-to-anycast-00:
Disconnecting TCP connection toward IPv6 anycast
addressdraft-yamamoto-wideipv6-comm-model-00See 23.5.1.6 in
this document for details.draft-ietf-ipngwg-scopedaddr-format-00.txt:
An Extension of Format for IPv6 Scoped AddressesNeighbor DiscoveryNeighbor Discovery is fairly stable. Currently Address
Resolution, Duplicated Address Detection, and Neighbor
Unreachability Detection are supported. In the near future
we will be adding Proxy Neighbor Advertisement support in
the kernel and Unsolicited Neighbor Advertisement
transmission command as admin tool.If DAD fails, the address will be marked "duplicated"
and message will be generated to syslog (and usually to
console). The "duplicated" mark can be checked with
&man.ifconfig.8;. It is administrators' responsibility to
check for and recover from DAD failures. The behavior
should be improved in the near future.Some of the network driver loops multicast packets back
to itself, even if instructed not to do so (especially in
promiscuous mode). In such cases DAD may fail, because DAD
engine sees inbound NS packet (actually from the node
itself) and considers it as a sign of duplicate. You may
want to look at #if condition marked "heuristics" in
sys/netinet6/nd6_nbr.c:nd6_dad_timer() as workaround (note
that the code fragment in "heuristics" section is not spec
conformant).Neighbor Discovery specification (RFC2461) does not talk
about neighbor cache handling in the following cases:when there was no neighbor cache entry, node
received unsolicited RS/NS/NA/redirect packet without
link-layer addressneighbor cache handling on medium without link-layer
address (we need a neighbor cache entry for IsRouter
bit)For first case, we implemented workaround based on
discussions on IETF ipngwg mailing list. For more details,
see the comments in the source code and email thread started
from (IPng 7155), dated Feb 6 1999.IPv6 on-link determination rule (RFC2461) is quite
different from assumptions in BSD network code. At this
moment, no on-link determination rule is supported where
default router list is empty (RFC2461, section 5.2, last
sentence in 2nd paragraph - note that the spec misuse the
word "host" and "node" in several places in the
section).To avoid possible DoS attacks and infinite loops, only
10 options on ND packet is accepted now. Therefore, if you
have 20 prefix options attached to RA, only the first 10
prefixes will be recognized. If this troubles you, please
ask it on FREEBSD-CURRENT mailing list and/or modify
nd6_maxndopt in sys/netinet6/nd6.c. If
there are high demands we may provide sysctl knob for the
variable.Scope IndexIPv6 uses scoped addresses. Therefore, it is very
important to specify scope index (interface index for
link-local address, or site index for site-local address)
with an IPv6 address. Without scope index, scoped IPv6
address is ambiguous to the kernel, and kernel will not be
able to determine the outbound interface for a
packet.Ordinary userland applications should use advanced API
(RFC2292) to specify scope index, or interface index. For
similar purpose, sin6_scope_id member in sockaddr_in6
structure is defined in RFC2553. However, the semantics for
sin6_scope_id is rather vague. If you care about
portability of your application, we suggest you to use
advanced API rather than sin6_scope_id.In the kernel, an interface index for link-local scoped
address is embedded into 2nd 16bit-word (3rd and 4th byte)
in IPv6 address. For example, you may see something
like: fe80:1::200:f8ff:fe01:6317in the routing table and interface address structure
(struct in6_ifaddr). The address above is a link-local
unicast address which belongs to a network interface whose
interface identifier is 1. The embedded index enables us to
identify IPv6 link local addresses over multiple interfaces
effectively and with only a little code change.Routing daemons and configuration programs, like
&man.route6d.8; and &man.ifconfig.8;, will need to
manipulate the "embedded" scope index. These programs use
routing sockets and ioctls (like SIOCGIFADDR_IN6) and the
kernel API will return IPv6 addresses with 2nd 16bit-word
filled in. The APIs are for manipulating kernel internal
structure. Programs that use these APIs have to be prepared
about differences in kernels anyway.When you specify scoped address to the command line,
NEVER write the embedded form (such as ff02:1::1 or
fe80:2::fedc). This is not supposed to work. Always use
standard form, like ff02::1 or fe80::fedc, with command line
option for specifying interface (like ping6 -I ne0
ff02::1). In general, if a command does not
have command line option to specify outgoing interface, that
command is not ready to accept scoped address. This may
seem to be opposite from IPv6's premise to support "dentist
office" situation. We believe that specifications need some
improvements for this.Some of the userland tools support extended numeric IPv6
syntax, as documented in
draft-ietf-ipngwg-scopedaddr-format-00.txt.
You can specify outgoing link, by using name of the outgoing
interface like "fe80::1%ne0". This way you will be able to
specify link-local scoped address without much
trouble.To use this extension in your program, you will need to
use &man.getaddrinfo.3;, and &man.getnameinfo.3; with
NI_WITHSCOPEID. The implementation currently assumes 1-to-1
relationship between a link and an interface, which is
stronger than what specs say.Plug and PlayMost of the IPv6 stateless address autoconfiguration is
implemented in the kernel. Neighbor Discovery functions are
implemented in the kernel as a whole. Router Advertisement
(RA) input for hosts is implemented in the kernel. Router
Solicitation (RS) output for endhosts, RS input for routers,
and RA output for routers are implemented in the
userland.Assignment of link-local, and special
addressesIPv6 link-local address is generated from IEEE802
address (Ethernet MAC address). Each of interface is
assigned an IPv6 link-local address automatically, when
the interface becomes up (IFF_UP). Also, direct route for
the link-local address is added to routing table.Here is an output of netstat command:Internet6:
Destination Gateway Flags Netif Expire
fe80:1::%ed0/64 link#1 UC ed0
fe80:2::%ep0/64 link#2 UC ep0Interfaces that has no IEEE802 address (pseudo
interfaces like tunnel interfaces, or ppp interfaces) will
borrow IEEE802 address from other interfaces, such as
Ethernet interfaces, whenever possible. If there is no
IEEE802 hardware attached, a last resort pseudo-random
value, MD5(hostname), will be used as source of link-local
address. If it is not suitable for your usage, you will
need to configure the link-local address manually.If an interface is not capable of handling IPv6 (such
as lack of multicast support), link-local address will not
be assigned to that interface. See section 2 for
details.Each interface joins the solicited multicast address
and the link-local all-nodes multicast addresses (e.g.,
fe80::1:ff01:6317 and ff02::1, respectively, on the link
the interface is attached). In addition to a link-local
address, the loopback address (::1) will be assigned to
the loopback interface. Also, ::1/128 and ff01::/32 are
automatically added to routing table, and loopback
interface joins node-local multicast group ff01::1.Stateless address autoconfiguration on HostsIn IPv6 specification, nodes are separated into two
categories: routers and
hosts. Routers forward packets
addressed to others, hosts does not forward the packets.
net.inet6.ip6.forwarding defines whether this node is
router or host (router if it is 1, host if it is
0).When a host hears Router Advertisement from the
router, a host may autoconfigure itself by stateless
address autoconfiguration. This behavior can be
controlled by net.inet6.ip6.accept_rtadv (host
autoconfigures itself if it is set to 1). By
autoconfiguration, network address prefix for the
receiving interface (usually global address prefix) is
added. Default route is also configured. Routers
periodically generate Router Advertisement packets. To
request an adjacent router to generate RA packet, a host
can transmit Router Solicitation. To generate a RS packet
at any time, use the rtsol command.
&man.rtsold.8; daemon is also available. &man.rtsold.8;
generates Router Solicitation whenever necessary, and it
works great for nomadic usage (notebooks/laptops). If one
wishes to ignore Router Advertisements, use sysctl to set
net.inet6.ip6.accept_rtadv to 0.To generate Router Advertisement from a router, use
the &man.rtadvd.8; daemon.Note that, IPv6 specification assumes the following
items, and nonconforming cases are left
unspecified:Only hosts will listen to router
advertisementsHosts have single network interface (except
loopback)Therefore, this is unwise to enable
net.inet6.ip6.accept_rtadv on routers, or multi-interface
host. A misconfigured node can behave strange
(nonconforming configuration allowed for those who would
like to do some experiments).To summarize the sysctl knob: accept_rtadv forwarding role of the node
--- --- ---
0 0 host (to be manually configured)
0 1 router
1 0 autoconfigured host
(spec assumes that host has single
interface only, autoconfigured host
with multiple interface is
out-of-scope)
1 1 invalid, or experimental
(out-of-scope of spec)RFC2462 has validation rule against incoming RA prefix
information option, in 5.5.3 (e). This is to protect
hosts from malicious (or misconfigured) routers that
advertise very short prefix lifetime. There was an update
from Jim Bound to ipngwg mailing list (look for "(ipng
6712)" in the archive) and it is implemented Jim's
update.See 23.5.1.2
in the document for relationship between DAD and
autoconfiguration.Generic Tunnel InterfaceGIF (Generic InterFace) is a pseudo interface for
configured tunnel. Details are described in &man.gif.4;.
Currentlyv6 in v6v6 in v4v4 in v6v4 in v4are available. Use &man.gifconfig.8; to assign physical
(outer) source and destination address to gif interfaces.
Configuration that uses same address family for inner and
outer IP header (v4 in v4, or v6 in v6) is dangerous. It is
very easy to configure interfaces and routing tables to
perform infinite level of tunneling. Please be
warned.gif can be configured to be ECN-friendly. See 23.5.4.5 for ECN-friendliness
of tunnels, and &man.gif.4; for how to configure.If you would like to configure an IPv4-in-IPv6 tunnel
with gif interface, read &man.gif.4; carefully. You will
need to remove IPv6 link-local address automatically
assigned to the gif interface.Source Address SelectionCurrent source selection rule is scope oriented (there
are some exceptions - see below). For a given destination,
a source IPv6 address is selected by the following
rule:If the source address is explicitly specified by the
user (e.g., via the advanced API), the specified
address is used.If there is an address assigned to the outgoing
interface (which is usually determined by looking up the
routing table) that has the same scope as the
destination address, the address is used.This is the most typical case.If there is no address that satisfies the above
condition, choose a global address assigned to one of
the interfaces on the sending node.If there is no address that satisfies the above
condition, and destination address is site local scope,
choose a site local address assigned to one of the
interfaces on the sending node.If there is no address that satisfies the above
condition, choose the address associated with the
routing table entry for the destination. This is the
last resort, which may cause scope violation.For instance, ::1 is selected for ff01::1,
fe80:1::200:f8ff:fe01:6317 for fe80:1::2a0:24ff:feab:839b
(note that embedded interface index - described in 23.5.1.3 - helps us
choose the right source address. Those embedded indices
will not be on the wire). If the outgoing interface has
multiple address for the scope, a source is selected longest
match basis (rule 3). Suppose
2001:0DB8:808:1:200:f8ff:fe01:6317 and
2001:0DB8:9:124:200:f8ff:fe01:6317 are given to the outgoing
interface. 2001:0DB8:808:1:200:f8ff:fe01:6317 is chosen as
the source for the destination 2001:0DB8:800::1.Note that the above rule is not documented in the IPv6
spec. It is considered "up to implementation" item. There
are some cases where we do not use the above rule. One
example is connected TCP session, and we use the address
kept in tcb as the source. Another example is source
address for Neighbor Advertisement. Under the spec (RFC2461
7.2.2) NA's source should be the target address of the
corresponding NS's target. In this case we follow the spec
rather than the above longest-match rule.For new connections (when rule 1 does not apply),
deprecated addresses (addresses with preferred lifetime = 0)
will not be chosen as source address if other choices are
available. If no other choices are available, deprecated
address will be used as a last resort. If there are
multiple choice of deprecated addresses, the above scope
rule will be used to choose from those deprecated addresses.
If you would like to prohibit the use of deprecated address
for some reason, configure net.inet6.ip6.use_deprecated to
0. The issue related to deprecated address is described in
RFC2462 5.5.4 (NOTE: there is some debate underway in IETF
ipngwg on how to use "deprecated" address).Jumbo PayloadThe Jumbo Payload hop-by-hop option is implemented and
can be used to send IPv6 packets with payloads longer than
65,535 octets. But currently no physical interface whose
MTU is more than 65,535 is supported, so such payloads can
be seen only on the loopback interface (i.e., lo0).If you want to try jumbo payloads, you first have to
reconfigure the kernel so that the MTU of the loopback
interface is more than 65,535 bytes; add the following to
the kernel configuration file:options "LARGE_LOMTU" #To
test jumbo payloadand recompile the new kernel.Then you can test jumbo payloads by the &man.ping6.8;
command with -b and -s options. The -b option must be
specified to enlarge the size of the socket buffer and the
-s option specifies the length of the packet, which should
be more than 65,535. For example, type as follows:&prompt.user; ping6 -b 70000 -s 68000 ::1The IPv6 specification requires that the Jumbo Payload
option must not be used in a packet that carries a fragment
header. If this condition is broken, an ICMPv6 Parameter
Problem message must be sent to the sender. specification
is followed, but you cannot usually see an ICMPv6 error
caused by this requirement.When an IPv6 packet is received, the frame length is
checked and compared to the length specified in the payload
length field of the IPv6 header or in the value of the Jumbo
Payload option, if any. If the former is shorter than the
latter, the packet is discarded and statistics are
incremented. You can see the statistics as output of
&man.netstat.8; command with `-s -p ip6' option:&prompt.user; netstat -s -p ip6
ip6:
(snip)
1 with data size < data lengthSo, kernel does not send an ICMPv6 error unless the
erroneous packet is an actual Jumbo Payload, that is, its
packet size is more than 65,535 bytes. As described above,
currently no physical interface with such a huge MTU is
supported, so it rarely returns an ICMPv6 error.TCP/UDP over jumbogram is not supported at this moment.
This is because we have no medium (other than loopback) to
test this. Contact us if you need this.IPsec does not work on jumbograms. This is due to some
specification twists in supporting AH with jumbograms (AH
header size influences payload length, and this makes it
real hard to authenticate inbound packet with jumbo payload
option as well as AH).There are fundamental issues in *BSD support for
jumbograms. We would like to address those, but we need
more time to finalize these. To name a few:mbuf pkthdr.len field is typed as "int" in 4.4BSD,
so it will not hold jumbogram with len > 2G on 32bit
architecture CPUs. If we would like to support
jumbogram properly, the field must be expanded to hold
4G + IPv6 header + link-layer header. Therefore, it
must be expanded to at least int64_t (u_int32_t is NOT
enough).We mistakingly use "int" to hold packet length in
many places. We need to convert them into larger
integral type. It needs a great care, as we may
experience overflow during packet length
computation.We mistakingly check for ip6_plen field of IPv6
header for packet payload length in various places. We
should be checking mbuf pkthdr.len instead. ip6_input()
will perform sanity check on jumbo payload option on
input, and we can safely use mbuf pkthdr.len
afterwards.TCP code needs a careful update in bunch of places,
of course.Loop Prevention in Header ProcessingIPv6 specification allows arbitrary number of extension
headers to be placed onto packets. If we implement IPv6
packet processing code in the way BSD IPv4 code is
implemented, kernel stack may overflow due to long function
call chain. sys/netinet6 code is carefully designed to
- avoid kernel stack overflow. Because of this, sys/netinet6
+ avoid kernel stack overflow, so sys/netinet6
code defines its own protocol switch structure, as "struct
ip6protosw" (see
netinet6/ip6protosw.h). There is no
such update to IPv4 part (sys/netinet) for compatibility,
but small change is added to its pr_input() prototype. So
- "struct ipprotosw" is also defined. Because of this, if you
+ "struct ipprotosw" is also defined. As a result, if you
receive IPsec-over-IPv4 packet with massive number of IPsec
headers, kernel stack may blow up. IPsec-over-IPv6 is okay.
- (Off-course, for those all IPsec headers to be processed,
+ (Of-course, for those all IPsec headers to be processed,
each such IPsec header must pass each IPsec check. So an
anonymous attacker will not be able to do such an
attack.)ICMPv6After RFC2463 was published, IETF ipngwg has decided to
disallow ICMPv6 error packet against ICMPv6 redirect, to
prevent ICMPv6 storm on a network medium. This is already
implemented into the kernel.ApplicationsFor userland programming, we support IPv6 socket API as
specified in RFC2553, RFC2292 and upcoming Internet
drafts.TCP/UDP over IPv6 is available and quite stable. You
can enjoy &man.telnet.1;, &man.ftp.1;, &man.rlogin.1;,
&man.rsh.1;, &man.ssh.1;, etc. These applications are
protocol independent. That is, they automatically chooses
IPv4 or IPv6 according to DNS.Kernel InternalsWhile ip_forward() calls ip_output(), ip6_forward()
directly calls if_output() since routers must not divide
IPv6 packets into fragments.ICMPv6 should contain the original packet as long as
possible up to 1280. UDP6/IP6 port unreach, for instance,
should contain all extension headers and the *unchanged*
UDP6 and IP6 headers. So, all IP6 functions except TCP
never convert network byte order into host byte order, to
save the original packet.tcp_input(), udp6_input() and icmp6_input() can not
assume that IP6 header is preceding the transport headers
due to extension headers. So, in6_cksum() was implemented
to handle packets whose IP6 header and transport header is
not continuous. TCP/IP6 nor UDP6/IP6 header structures do
not exist for checksum calculation.To process IP6 header, extension headers and transport
headers easily, network drivers are now required to store
packets in one internal mbuf or one or more external mbufs.
A typical old driver prepares two internal mbufs for 96 -
204 bytes data, however, now such packet data is stored in
one external mbuf.netstat -s -p ip6 tells you whether
or not your driver conforms such requirement. In the
following example, "cce0" violates the requirement. (For
more information, refer to Section 2.)Mbuf statistics:
317 one mbuf
two or more mbuf::
lo0 = 8
cce0 = 10
3282 one ext mbuf
0 two or more ext mbufEach input function calls IP6_EXTHDR_CHECK in the
beginning to check if the region between IP6 and its header
is continuous. IP6_EXTHDR_CHECK calls m_pullup() only if
the mbuf has M_LOOP flag, that is, the packet comes from the
loopback interface. m_pullup() is never called for packets
coming from physical network interfaces.Both IP and IP6 reassemble functions never call
m_pullup().IPv4 Mapped Address and IPv6 Wildcard SocketRFC2553 describes IPv4 mapped address (3.7) and special
behavior of IPv6 wildcard bind socket (3.8). The spec
allows you to:Accept IPv4 connections by AF_INET6 wildcard bind
socket.Transmit IPv4 packet over AF_INET6 socket by using
special form of the address like ::ffff:10.1.1.1.but the spec itself is very complicated and does not
specify how the socket layer should behave. Here we call
the former one "listening side" and the latter one
"initiating side", for reference purposes.You can perform wildcard bind on both of the address
families, on the same port.The following table show the behavior of FreeBSD
4.x.listening side initiating side
(AF_INET6 wildcard (connection to ::ffff:10.1.1.1)
socket gets IPv4 conn.)
--- ---
FreeBSD 4.x configurable supported
default: enabledThe following sections will give you more details, and
how you can configure the behavior.Comments on listening side:It looks that RFC2553 talks too little on wildcard bind
issue, especially on the port space issue, failure mode and
relationship between AF_INET/INET6 wildcard bind. There can
be several separate interpretation for this RFC which
conform to it but behaves differently. So, to implement
portable application you should assume nothing about the
behavior in the kernel. Using &man.getaddrinfo.3; is the
safest way. Port number space and wildcard bind issues were
discussed in detail on ipv6imp mailing list, in mid March
1999 and it looks that there is no concrete consensus
(means, up to implementers). You may want to check the
mailing list archives.If a server application would like to accept IPv4 and
IPv6 connections, there will be two alternatives.One is using AF_INET and AF_INET6 socket (you will need
two sockets). Use &man.getaddrinfo.3; with AI_PASSIVE into
ai_flags, and &man.socket.2; and &man.bind.2; to all the
addresses returned. By opening multiple sockets, you can
accept connections onto the socket with proper address
family. IPv4 connections will be accepted by AF_INET
socket, and IPv6 connections will be accepted by AF_INET6
socket.Another way is using one AF_INET6 wildcard bind socket.
Use &man.getaddrinfo.3; with AI_PASSIVE into ai_flags and
with AF_INET6 into ai_family, and set the 1st argument
hostname to NULL. And &man.socket.2; and &man.bind.2; to the
address returned. (should be IPv6 unspecified addr). You
can accept either of IPv4 and IPv6 packet via this one
socket.To support only IPv6 traffic on AF_INET6 wildcard binded
socket portably, always check the peer address when a
connection is made toward AF_INET6 listening socket. If the
address is IPv4 mapped address, you may want to reject the
connection. You can check the condition by using
IN6_IS_ADDR_V4MAPPED() macro.To resolve this issue more easily, there is system
dependent &man.setsockopt.2; option, IPV6_BINDV6ONLY, used
like below. int on;
setsockopt(s, IPPROTO_IPV6, IPV6_BINDV6ONLY,
(char *)&on, sizeof (on)) < 0));When this call succeed, then this socket only receive
IPv6 packets.Comments on initiating side:Advise to application implementers: to implement a
portable IPv6 application (which works on multiple IPv6
kernels), we believe that the following is the key to the
success:NEVER hardcode AF_INET nor AF_INET6.Use &man.getaddrinfo.3; and &man.getnameinfo.3;
throughout the system. Never use gethostby*(),
getaddrby*(), inet_*() or getipnodeby*(). (To update
existing applications to be IPv6 aware easily, sometime
getipnodeby*() will be useful. But if possible, try to
rewrite the code to use &man.getaddrinfo.3; and
&man.getnameinfo.3;.)If you would like to connect to destination, use
&man.getaddrinfo.3; and try all the destination
returned, like &man.telnet.1; does.Some of the IPv6 stack is shipped with buggy
&man.getaddrinfo.3;. Ship a minimal working version
with your application and use that as last
resort.If you would like to use AF_INET6 socket for both IPv4
and IPv6 outgoing connection, you will need to use
&man.getipnodebyname.3;. When you would like to update your
existing application to be IPv6 aware with minimal effort,
this approach might be chosen. But please note that it is a
temporal solution, because &man.getipnodebyname.3; itself is
not recommended as it does not handle scoped IPv6 addresses
at all. For IPv6 name resolution, &man.getaddrinfo.3; is
the preferred API. So you should rewrite your application to
use &man.getaddrinfo.3;, when you get the time to do
it.When writing applications that make outgoing
connections, story goes much simpler if you treat AF_INET
and AF_INET6 as totally separate address family.
{set,get}sockopt issue goes simpler, DNS issue will be made
simpler. We do not recommend you to rely upon IPv4 mapped
address.unified tcp and inpcb codeFreeBSD 4.x uses shared tcp code between IPv4 and IPv6
(from sys/netinet/tcp*) and separate udp4/6 code. It uses
unified inpcb structure.The platform can be configured to support IPv4 mapped
address. Kernel configuration is summarized as
follows:By default, AF_INET6 socket will grab IPv4
connections in certain condition, and can initiate
connection to IPv4 destination embedded in IPv4 mapped
IPv6 address.You can disable it on entire system with sysctl
like below.sysctl
net.inet6.ip6.mapped_addr=0Listening SideEach socket can be configured to support special
AF_INET6 wildcard bind (enabled by default). You can
disable it on each socket basis with &man.setsockopt.2;
like below. int on;
setsockopt(s, IPPROTO_IPV6, IPV6_BINDV6ONLY,
(char *)&on, sizeof (on)) < 0));Wildcard AF_INET6 socket grabs IPv4 connection if
and only if the following conditions are
satisfied:there is no AF_INET socket that matches the IPv4
connectionthe AF_INET6 socket is configured to accept IPv4
traffic, i.e., getsockopt(IPV6_BINDV6ONLY) returns
0.There is no problem with open/close ordering.Initiating SideFreeBSD 4.x supports outgoing connection to IPv4
mapped address (::ffff:10.1.1.1), if the node is
configured to support IPv4 mapped address.sockaddr_storageWhen RFC2553 was about to be finalized, there was
discussion on how struct sockaddr_storage members are named.
One proposal is to prepend "__" to the members (like
"__ss_len") as they should not be touched. The other
proposal was not to prepend it (like "ss_len") as we need to
touch those members directly. There was no clear consensus
on it.As a result, RFC2553 defines struct sockaddr_storage as
follows: struct sockaddr_storage {
u_char __ss_len; /* address length */
u_char __ss_family; /* address family */
/* and bunch of padding */
};On the contrary, XNET draft defines as follows: struct sockaddr_storage {
u_char ss_len; /* address length */
u_char ss_family; /* address family */
/* and bunch of padding */
};In December 1999, it was agreed that RFC2553bis should
pick the latter (XNET) definition.Current implementation conforms to XNET definition,
based on RFC2553bis discussion.If you look at multiple IPv6 implementations, you will
be able to see both definitions. As an userland programmer,
the most portable way of dealing with it is to:ensure ss_family and/or ss_len are available on the
platform, by using GNU autoconf,have -Dss_family=__ss_family to unify all
occurrences (including header file) into __ss_family,
ornever touch __ss_family. cast to sockaddr * and use
sa_family like: struct sockaddr_storage ss;
family = ((struct sockaddr *)&ss)->sa_familyNetwork DriversNow following two items are required to be supported by
standard drivers:mbuf clustering requirement. In this stable release,
we changed MINCLSIZE into MHLEN+1 for all the operating
systems in order to make all the drivers behave as we
expect.multicast. If &man.ifmcstat.8; yields no multicast
group for a interface, that interface has to be
patched.If any of the drivers do not support the requirements,
then the drivers cannot be used for IPv6 and/or IPsec
communication. If you find any problem with your card using
IPv6/IPsec, then, please report it to the &a.bugs;.(NOTE: In the past we required all PCMCIA drivers to have
a call to in6_ifattach(). We have no such requirement any
more)TranslatorWe categorize IPv4/IPv6 translator into 4 types:Translator A --- It is used in
the early stage of transition to make it possible to
establish a connection from an IPv6 host in an IPv6 island
to an IPv4 host in the IPv4 ocean.Translator B --- It is used in
the early stage of transition to make it possible to
establish a connection from an IPv4 host in the IPv4 ocean
to an IPv6 host in an IPv6 island.Translator C --- It is used in
the late stage of transition to make it possible to
establish a connection from an IPv4 host in an IPv4 island
to an IPv6 host in the IPv6 ocean.Translator D --- It is used in
the late stage of transition to make it possible to
establish a connection from an IPv6 host in the IPv6 ocean
to an IPv4 host in an IPv4 island.IPsecIPsec is mainly organized by three components.Policy ManagementKey ManagementAH and ESP handlingPolicy ManagementThe kernel implements experimental policy management
code. There are two way to manage security policy. One is
to configure per-socket policy using &man.setsockopt.2;. In
this cases, policy configuration is described in
&man.ipsec.set.policy.3;. The other is to configure kernel
packet filter-based policy using PF_KEY interface, via
&man.setkey.8;.The policy entry is not re-ordered with its indexes, so
the order of entry when you add is very significant.Key ManagementThe key management code implemented in this kit
(sys/netkey) is a home-brew PFKEY v2 implementation. This
conforms to RFC2367.The home-brew IKE daemon, "racoon" is included in the
kit (kame/kame/racoon). Basically you will need to run
racoon as daemon, then set up a policy to require keys (like
ping -P 'out ipsec esp/transport//use').
The kernel will contact racoon daemon as necessary to
exchange keys.AH and ESP HandlingIPsec module is implemented as "hooks" to the standard
IPv4/IPv6 processing. When sending a packet,
ip{,6}_output() checks if ESP/AH processing is required by
checking if a matching SPD (Security Policy Database) is
found. If ESP/AH is needed, {esp,ah}{4,6}_output() will be
called and mbuf will be updated accordingly. When a packet
is received, {esp,ah}4_input() will be called based on
protocol number, i.e., (*inetsw[proto])().
{esp,ah}4_input() will decrypt/check authenticity of the
packet, and strips off daisy-chained header and padding for
ESP/AH. It is safe to strip off the ESP/AH header on packet
reception, since we will never use the received packet in
"as is" form.By using ESP/AH, TCP4/6 effective data segment size will
be affected by extra daisy-chained headers inserted by
ESP/AH. Our code takes care of the case.Basic crypto functions can be found in directory
"sys/crypto". ESP/AH transform are listed in
{esp,ah}_core.c with wrapper functions. If you wish to add
some algorithm, add wrapper function in {esp,ah}_core.c, and
add your crypto algorithm code into sys/crypto.Tunnel mode is partially supported in this release, with
the following restrictions:IPsec tunnel is not combined with GIF generic
tunneling interface. It needs a great care because we
may create an infinite loop between ip_output() and
tunnelifp->if_output(). Opinion varies if it is
better to unify them, or not.MTU and Don't Fragment bit (IPv4) considerations
need more checking, but basically works fine.Authentication model for AH tunnel must be
revisited. We will need to improve the policy
management engine, eventually.Conformance to RFCs and IDsThe IPsec code in the kernel conforms (or, tries to
conform) to the following standards:"old IPsec" specification documented in
rfc182[5-9].txt"new IPsec" specification documented in
rfc240[1-6].txt,
rfc241[01].txt,
rfc2451.txt and
draft-mcdonald-simple-ipsec-api-01.txt
(draft expired, but you can take from
ftp://ftp.kame.net/pub/internet-drafts/). (NOTE:
IKE specifications, rfc241[7-9].txt are
implemented in userland, as "racoon" IKE daemon)Currently supported algorithms are:old IPsec AHnull crypto checksum (no document, just for
debugging)keyed MD5 with 128bit crypto checksum
(rfc1828.txt)keyed SHA1 with 128bit crypto checksum (no
document)HMAC MD5 with 128bit crypto checksum
(rfc2085.txt)HMAC SHA1 with 128bit crypto checksum (no
document)old IPsec ESPnull encryption (no document, similar to
rfc2410.txt)DES-CBC mode
(rfc1829.txt)new IPsec AHnull crypto checksum (no document, just for
debugging)keyed MD5 with 96bit crypto checksum (no
document)keyed SHA1 with 96bit crypto checksum (no
document)HMAC MD5 with 96bit crypto checksum
(rfc2403.txt)HMAC SHA1 with 96bit crypto checksum
(rfc2404.txt)new IPsec ESPnull encryption
(rfc2410.txt)DES-CBC with derived IV
(draft-ietf-ipsec-ciph-des-derived-01.txt,
draft expired)DES-CBC with explicit IV
(rfc2405.txt)3DES-CBC with explicit IV
(rfc2451.txt)BLOWFISH CBC
(rfc2451.txt)CAST128 CBC
(rfc2451.txt)RC5 CBC
(rfc2451.txt)each of the above can be combined with:ESP authentication with
HMAC-MD5(96bit)ESP authentication with
HMAC-SHA1(96bit)The following algorithms are NOT supported:old IPsec AHHMAC MD5 with 128bit crypto checksum + 64bit
replay prevention
(rfc2085.txt)keyed SHA1 with 160bit crypto checksum + 32bit
padding (rfc1852.txt)IPsec (in kernel) and IKE (in userland as "racoon") has
been tested at several interoperability test events, and it
is known to interoperate with many other implementations
well. Also, current IPsec implementation as quite wide
coverage for IPsec crypto algorithms documented in RFC (we
cover algorithms without intellectual property issues
only).ECN Consideration on IPsec TunnelsECN-friendly IPsec tunnel is supported as described in
draft-ipsec-ecn-00.txt.Normal IPsec tunnel is described in RFC2401. On
encapsulation, IPv4 TOS field (or, IPv6 traffic class field)
will be copied from inner IP header to outer IP header. On
decapsulation outer IP header will be simply dropped. The
decapsulation rule is not compatible with ECN, since ECN bit
on the outer IP TOS/traffic class field will be lost.To make IPsec tunnel ECN-friendly, we should modify
encapsulation and decapsulation procedure. This is
described in
http://www.aciri.org/floyd/papers/draft-ipsec-ecn-00.txt,
chapter 3.IPsec tunnel implementation can give you three
behaviors, by setting net.inet.ipsec.ecn (or
net.inet6.ipsec6.ecn) to some value:RFC2401: no consideration for ECN (sysctl value
-1)ECN forbidden (sysctl value 0)ECN allowed (sysctl value 1)Note that the behavior is configurable in per-node
manner, not per-SA manner (draft-ipsec-ecn-00 wants per-SA
configuration, but it looks too much for me).The behavior is summarized as follows (see source code
for more detail):encapsulate decapsulate
--- ---
RFC2401 copy all TOS bits drop TOS bits on outer
from inner to outer. (use inner TOS bits as is)
ECN forbidden copy TOS bits except for ECN drop TOS bits on outer
(masked with 0xfc) from inner (use inner TOS bits as is)
to outer. set ECN bits to 0.
ECN allowed copy TOS bits except for ECN use inner TOS bits with some
CE (masked with 0xfe) from change. if outer ECN CE bit
inner to outer. is 1, enable ECN CE bit on
set ECN CE bit to 0. the inner.General strategy for configuration is as follows:if both IPsec tunnel endpoint are capable of
ECN-friendly behavior, you should better configure both
end to ECN allowed (sysctl value
1).if the other end is very strict about TOS bit, use
"RFC2401" (sysctl value -1).in other cases, use "ECN forbidden" (sysctl value
0).The default behavior is "ECN forbidden" (sysctl value
0).For more information, please refer to:
http://www.aciri.org/floyd/papers/draft-ipsec-ecn-00.txt,
RFC2481 (Explicit Congestion Notification),
src/sys/netinet6/{ah,esp}_input.c(Thanks goes to Kenjiro Cho
kjc@csl.sony.co.jp for detailed
analysis)InteroperabilityHere are (some of) platforms that KAME code have tested
IPsec/IKE interoperability in the past. Note that both ends
may have modified their implementation, so use the following
list just for reference purposes.Altiga, Ashley-laurent (vpcom.com), Data Fellows
(F-Secure), Ericsson ACC, FreeS/WAN, HITACHI, IBM &aix;,
IIJ, Intel, µsoft; &windowsnt;, NIST (linux IPsec +
plutoplus), Netscreen, OpenBSD, RedCreek, Routerware, SSH,
Secure Computing, Soliton, Toshiba, VPNet, Yamaha
RT100i
diff --git a/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml
index 8fd50f0d68..793d728368 100644
--- a/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml
+++ b/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml
@@ -1,1075 +1,1075 @@
Kernel DebuggingPaulRichardsContributed by JörgWunschRobertWatsonObtaining a Kernel Crash DumpWhen running a development kernel (e.g., &os.current;), such as a
kernel under extreme conditions (e.g., very high load averages,
tens of thousands of connections, exceedingly high number of
concurrent users, hundreds of &man.jail.8;s, etc.), or using a
new feature or device driver on &os.stable; (e.g.,
PAE), sometimes a kernel will panic. In the
event that it does, this chapter will demonstrate how to extract
useful information out of a crash.A system reboot is inevitable once a kernel panics. Once a
system is rebooted, the contents of a system's physical memory
(RAM) is lost, as well as any bits that are
on the swap device before the panic. To preserve the bits in
physical memory, the kernel makes use of the swap device as a
temporary place to store the bits that are in RAM across a
reboot after a crash. In doing this, when &os; boots after a
crash, a kernel image can now be extracted and debugging can
take place.A swap device that has been configured as a dump
device still acts as a swap device. Dumps to non-swap devices
(such as tapes or CDRWs, for example) are not supported at this time. A
swap device is synonymous with a swap
partition.Several types of kernel crash dumps are available:Full memory dumpsHold the complete contents of physical
memory.MinidumpsHold only memory pages in use by the kernel
(&os; 6.2 and higher).TextdumpsHold captured, scripted, or interactive debugger
output (&os; 7.1 and higher).Minidumps are the default dump type as of &os; 7.0,
and in most cases will capture all necessary information
present in a full memory dump, as most problems can be
isolated only using kernel state.Configuring the Dump DeviceBefore the kernel will dump the contents of its physical
memory to a dump device, a dump device must be configured. A
dump device is specified by using the &man.dumpon.8; command
to tell the kernel where to save kernel crash dumps. The
&man.dumpon.8; program must be called after the swap partition
has been configured with &man.swapon.8;. This is normally
handled by setting the dumpdev variable in
&man.rc.conf.5; to the path of the swap device (the
recommended way to extract a kernel dump) or
AUTO to use the first configured swap
device. The default for dumpdev is
AUTO in HEAD, and changed to
NO on RELENG_* branches (except for RELENG_7,
which was left set to AUTO).
On &os; 9.0-RELEASE and later versions,
bsdinstall will ask whether crash dumps
should be enabled on the target system during the install process.Check /etc/fstab or
&man.swapinfo.8; for a list of swap devices.Make sure the dumpdir
specified in &man.rc.conf.5; exists before a kernel
crash!&prompt.root; mkdir /var/crash
&prompt.root; chmod 700 /var/crashAlso, remember that the contents of
/var/crash is sensitive and very likely
contains confidential information such as passwords.Extracting a Kernel DumpOnce a dump has been written to a dump device, the dump
must be extracted before the swap device is mounted.
To extract a dump
from a dump device, use the &man.savecore.8; program. If
dumpdev has been set in &man.rc.conf.5;,
&man.savecore.8; will be called automatically on the first
multi-user boot after the crash and before the swap device
is mounted. The location of the extracted core is placed in
the &man.rc.conf.5; value dumpdir, by
default /var/crash and will be named
vmcore.0.In the event that there is already a file called
vmcore.0 in
/var/crash (or whatever
dumpdir is set to), the kernel will
increment the trailing number for every crash to avoid
overwriting an existing vmcore (e.g.,
vmcore.1). &man.savecore.8; will always
create a symbolic link to named vmcore.last
in /var/crash after a dump is saved.
This symbolic link can be used to locate the name of the most
recent dump.The &man.crashinfo.8; utility generates a text file
containing a summary of information from a full memory dump
or minidump. If dumpdev has been set in
&man.rc.conf.5;, &man.crashinfo.8; will be invoked
automatically after &man.savecore.8;. The output is saved
to a file in dumpdir named
core.txt.N.If you are testing a new kernel but need to boot a different one in
order to get your system up and running again, boot it only into single
user mode using the flag at the boot prompt, and
then perform the following steps:&prompt.root; fsck -p
&prompt.root; mount -a -t ufs # make sure /var/crash is writable
&prompt.root; savecore /var/crash /dev/ad0s1b
&prompt.root; exit # exit to multi-userThis instructs &man.savecore.8; to extract a kernel dump
from /dev/ad0s1b and place the contents in
/var/crash. Do not forget to make sure the
destination directory /var/crash has enough
space for the dump. Also, do not forget to specify the correct path to your swap
device as it is likely different than
/dev/ad0s1b!Testing Kernel Dump ConfigurationThe kernel includes a &man.sysctl.8; node that requests a
kernel panic. This can be used to verify that your system is
properly configured to save kernel crash dumps. You may wish
to remount existing file systems as read-only in single user
mode before triggering the crash to avoid data loss.&prompt.root; shutdown now
...
Enter full pathname of shell or RETURN for /bin/sh:
&prompt.root; mount -a -u -r
&prompt.root; sysctl debug.kdb.panic=1
debug.kdb.panic:panic: kdb_sysctl_panic
...After rebooting, your system should save a dump in
/var/crash along with a matching summary
from &man.crashinfo.8;.Debugging a Kernel Crash Dump with kgdbThis section covers &man.kgdb.1;. The latest version is
included in the devel/gdb. An older version
is also present in &os; 11 and earlier.To enter into the debugger and begin getting information
from the dump, start kgdb:&prompt.root; kgdb -n NWhere N is the suffix of the
vmcore.N to
examine. To open the most recent dump use:&prompt.root; kgdb -n lastNormally, &man.kgdb.1; should be able to locate the kernel
running at the time the dump was generated. If it is not able to
locate the correct kernel, pass the pathname of the kernel and
dump as two arguments to kgdb:&prompt.root; kgdb /boot/kernel/kernel /var/crash/vmcore.0You can debug the crash dump using the kernel sources just like
you can for any other program.This dump is from a 5.2-BETA kernel and the crash
comes from deep within the kernel. The output below has been
modified to include line numbers on the left. This first trace
inspects the instruction pointer and obtains a back trace. The
address that is used on line 41 for the list
command is the instruction pointer and can be found on line
17. Most developers will request having at least this
information sent to them if you are unable to debug the problem
yourself. If, however, you do solve the problem, make sure that
your patch winds its way into the source tree via a problem
report, mailing lists, or by being able to commit it! 1:&prompt.root; cd /usr/obj/usr/src/sys/KERNCONF
2:&prompt.root; kgdb kernel.debug /var/crash/vmcore.0
3:GNU gdb 5.2.1 (FreeBSD)
4:Copyright 2002 Free Software Foundation, Inc.
5:GDB is free software, covered by the GNU General Public License, and you are
6:welcome to change it and/or distribute copies of it under certain conditions.
7:Type "show copying" to see the conditions.
8:There is absolutely no warranty for GDB. Type "show warranty" for details.
9:This GDB was configured as "i386-undermydesk-freebsd"...
10:panic: page fault
11:panic messages:
12:---
13:Fatal trap 12: page fault while in kernel mode
14:cpuid = 0; apic id = 00
15:fault virtual address = 0x300
16:fault code: = supervisor read, page not present
17:instruction pointer = 0x8:0xc0713860
18:stack pointer = 0x10:0xdc1d0b70
19:frame pointer = 0x10:0xdc1d0b7c
20:code segment = base 0x0, limit 0xfffff, type 0x1b
21: = DPL 0, pres 1, def32 1, gran 1
22:processor eflags = resume, IOPL = 0
23:current process = 14394 (uname)
24:trap number = 12
25:panic: page fault
26 cpuid = 0;
27:Stack backtrace:
28
29:syncing disks, buffers remaining... 2199 2199 panic: mi_switch: switch in a critical section
30:cpuid = 0;
31:Uptime: 2h43m19s
32:Dumping 255 MB
33: 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
34:---
35:Reading symbols from /boot/kernel/snd_maestro3.ko...done.
36:Loaded symbols for /boot/kernel/snd_maestro3.ko
37:Reading symbols from /boot/kernel/snd_pcm.ko...done.
38:Loaded symbols for /boot/kernel/snd_pcm.ko
39:#0 doadump () at /usr/src/sys/kern/kern_shutdown.c:240
40:240 dumping++;
41:(kgdb)list *0xc0713860
42:0xc0713860 is in lapic_ipi_wait (/usr/src/sys/i386/i386/local_apic.c:663).
43:658 incr = 0;
44:659 delay = 1;
45:660 } else
46:661 incr = 1;
47:662 for (x = 0; x < delay; x += incr) {
48:663 if ((lapic->icr_lo & APIC_DELSTAT_MASK) == APIC_DELSTAT_IDLE)
49:664 return (1);
50:665 ia32_pause();
51:666 }
52:667 return (0);
53:(kgdb)backtrace
54:#0 doadump () at /usr/src/sys/kern/kern_shutdown.c:240
55:#1 0xc055fd9b in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:372
56:#2 0xc056019d in panic () at /usr/src/sys/kern/kern_shutdown.c:550
57:#3 0xc0567ef5 in mi_switch () at /usr/src/sys/kern/kern_synch.c:470
58:#4 0xc055fa87 in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:312
59:#5 0xc056019d in panic () at /usr/src/sys/kern/kern_shutdown.c:550
60:#6 0xc0720c66 in trap_fatal (frame=0xdc1d0b30, eva=0)
61: at /usr/src/sys/i386/i386/trap.c:821
62:#7 0xc07202b3 in trap (frame=
63: {tf_fs = -1065484264, tf_es = -1065484272, tf_ds = -1065484272, tf_edi = 1, tf_esi = 0, tf_ebp = -602076292, tf_isp = -602076324, tf_ebx = 0, tf_edx = 0, tf_ecx = 1000000, tf_eax = 243, tf_trapno = 12, tf_err = 0, tf_eip = -1066321824, tf_cs = 8, tf_eflags = 65671, tf_esp = 243, tf_ss = 0})
64: at /usr/src/sys/i386/i386/trap.c:250
65:#8 0xc070c9f8 in calltrap () at {standard input}:94
66:#9 0xc07139f3 in lapic_ipi_vectored (vector=0, dest=0)
67: at /usr/src/sys/i386/i386/local_apic.c:733
68:#10 0xc0718b23 in ipi_selected (cpus=1, ipi=1)
69: at /usr/src/sys/i386/i386/mp_machdep.c:1115
70:#11 0xc057473e in kseq_notify (ke=0xcc05e360, cpu=0)
71: at /usr/src/sys/kern/sched_ule.c:520
72:#12 0xc0575cad in sched_add (td=0xcbcf5c80)
73: at /usr/src/sys/kern/sched_ule.c:1366
74:#13 0xc05666c6 in setrunqueue (td=0xcc05e360)
75: at /usr/src/sys/kern/kern_switch.c:422
76:#14 0xc05752f4 in sched_wakeup (td=0xcbcf5c80)
77: at /usr/src/sys/kern/sched_ule.c:999
78:#15 0xc056816c in setrunnable (td=0xcbcf5c80)
79: at /usr/src/sys/kern/kern_synch.c:570
80:#16 0xc0567d53 in wakeup (ident=0xcbcf5c80)
81: at /usr/src/sys/kern/kern_synch.c:411
82:#17 0xc05490a8 in exit1 (td=0xcbcf5b40, rv=0)
83: at /usr/src/sys/kern/kern_exit.c:509
84:#18 0xc0548011 in sys_exit () at /usr/src/sys/kern/kern_exit.c:102
85:#19 0xc0720fd0 in syscall (frame=
86: {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 0, tf_esi = -1, tf_ebp = -1077940712, tf_isp = -602075788, tf_ebx = 672411944, tf_edx = 10, tf_ecx = 672411600, tf_eax = 1, tf_trapno = 12, tf_err = 2, tf_eip = 671899563, tf_cs = 31, tf_eflags = 642, tf_esp = -1077940740, tf_ss = 47})
87: at /usr/src/sys/i386/i386/trap.c:1010
88:#20 0xc070ca4d in Xint0x80_syscall () at {standard input}:136
89:---Can't read userspace from dump, or kernel process---
90:(kgdb)quitIf your system is crashing regularly and you are running
out of disk space, deleting old vmcore
files in /var/crash could save a
considerable amount of disk space!On-Line Kernel Debugging Using DDBWhile kgdb as an off-line debugger provides a very
high level of user interface, there are some things it cannot do. The
most important ones being breakpointing and single-stepping kernel
code.If you need to do low-level debugging on your kernel, there is an
on-line debugger available called DDB. It allows setting of
breakpoints, single-stepping kernel functions, examining and changing
kernel variables, etc. However, it cannot access kernel source files,
and only has access to the global and static symbols, not to the full
debug information like kgdb does.To configure your kernel to include DDB, add the options
options KDBoptions DDB
to your config file, and rebuild. (See The FreeBSD Handbook for details on
configuring the FreeBSD kernel).Once your DDB kernel is running, there are several ways to enter
DDB. The first, and earliest way is to use the boot flag
. The kernel will start up
in debug mode and enter DDB prior to any device probing. Hence you can
even debug the device probe/attach functions. To use this, exit
the loader's boot menu and enter boot -d at
the loader prompt.The second scenario is to drop to the debugger once the
system has booted. There are two simple ways to accomplish
this. If you would like to break to the debugger from the
command prompt, simply type the command:&prompt.root; sysctl debug.kdb.enter=1Alternatively, if you are at the system console, you may use
a hot-key on the keyboard. The default break-to-debugger
sequence is CtrlAltESC. For
syscons, this sequence can be remapped and some of the
distributed maps out there do this, so check to make sure you
know the right sequence to use. There is an option available
for serial consoles that allows the use of a serial line BREAK on the
console line to enter DDB (options BREAK_TO_DEBUGGER
in the kernel config file). It is not the default since there are a lot
of serial adapters around that gratuitously generate a BREAK
condition, for example when pulling the cable.The third way is that any panic condition will branch to DDB if the
kernel is configured to use it. For this reason, it is not wise to
configure a kernel with DDB for a machine running unattended.To obtain the unattended functionality, add:options KDB_UNATTENDEDto the kernel configuration file and rebuild/reinstall.The DDB commands roughly resemble some gdb
commands. The first thing you probably need to do is to set a
breakpoint:break function-name addressNumbers are taken hexadecimal by default, but to make them distinct
from symbol names; hexadecimal numbers starting with the letters
a-f need to be preceded with 0x
(this is optional for other numbers). Simple expressions are allowed,
for example: function-name + 0x103.To exit the debugger and continue execution,
type:continueTo get a stack trace of the current thread, use:traceTo get a stack trace of an arbitrary thread, specify a
process ID or thread ID as a second argument to
trace.If you want to remove a breakpoint, usedeldel address-expressionThe first form will be accepted immediately after a breakpoint hit,
and deletes the current breakpoint. The second form can remove any
breakpoint, but you need to specify the exact address; this can be
obtained from:show bor:show breakTo single-step the kernel, try:sThis will step into functions, but you can make DDB trace them until
the matching return statement is reached by:nThis is different from gdb's
next statement; it is like gdb's
finish. Pressing n more than once
will cause a continue.To examine data from memory, use (for example):
x/wx 0xf0133fe0,40x/hd db_symtab_spacex/bc termbuf,10x/s stringbuf
for word/halfword/byte access, and hexadecimal/decimal/character/ string
display. The number after the comma is the object count. To display
the next 0x10 items, simply use:x ,10Similarly, use
x/ia foofunc,10
to disassemble the first 0x10 instructions of
foofunc, and display them along with their offset
from the beginning of foofunc.To modify memory, use the write command:w/b termbuf 0xa 0xb 0w/w 0xf0010030 0 0The command modifier
(b/h/w)
specifies the size of the data to be written, the first following
expression is the address to write to and the remainder is interpreted
as data to write to successive memory locations.If you need to know the current registers, use:show regAlternatively, you can display a single register value by e.g.
p $eax
and modify it by:set $eax new-valueShould you need to call some kernel functions from DDB, simply
say:call func(arg1, arg2, ...)The return value will be printed.For a &man.ps.1; style summary of all running processes, use:psNow you have examined why your kernel failed, and you wish to
reboot. Remember that, depending on the severity of previous
malfunctioning, not all parts of the kernel might still be working as
expected. Perform one of the following actions to shut down and reboot
your system:panicThis will cause your kernel to dump core and reboot, so you can
later analyze the core on a higher level with &man.kgdb.1;.call boot(0)Might be a good way to cleanly shut down the running system,
sync() all disks, and finally, in some cases,
reboot. As long as
the disk and filesystem interfaces of the kernel are not damaged, this
could be a good way for an almost clean shutdown.resetThis is the final way out of disaster and almost the same as hitting the
Big Red Button.If you need a short command summary, simply type:helpIt is highly recommended to have a printed copy of the
&man.ddb.4; manual page ready for a debugging
session. Remember that it is hard to read the on-line manual while
single-stepping the kernel.On-Line Kernel Debugging Using Remote GDBThis feature has been supported since FreeBSD 2.2, and it is
actually a very neat one.GDB has already supported remote debugging for
a long time. This is done using a very simple protocol along a serial
line. Unlike the other methods described above, you will need two
machines for doing this. One is the host providing the debugging
environment, including all the sources, and a copy of the kernel binary
with all the symbols in it, and the other one is the target machine that
simply runs a similar copy of the very same kernel (but stripped of the
debugging information).You should configure the kernel in question with config
-g if building the traditional way. If
building the new way, make sure that
makeoptions DEBUG=-g is in the configuration.
In both cases, include in the configuration, and
compile it as usual. This gives a large binary, due to the
debugging information. Copy this kernel to the target machine, strip
the debugging symbols off with strip -x, and boot it
using the boot option. Connect the serial line
of the target machine that has "flags 080" set on its uart device
to any serial line of the debugging host. See &man.uart.4; for
information on how to set the flags on an uart device.
Now, on the debugging machine, go to the compile directory of the target
kernel, and start gdb:&prompt.user; kgdb kernel
GDB is free software and you are welcome to distribute copies of it
under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (i386-unknown-freebsd),
Copyright 1996 Free Software Foundation, Inc...
(kgdb)Initialize the remote debugging session (assuming the first serial
port is being used) by:(kgdb)target remote /dev/cuau0Now, on the target host (the one that entered DDB right before even
starting the device probe), type:Debugger("Boot flags requested debugger")
Stopped at Debugger+0x35: movb $0, edata+0x51bc
db>gdbDDB will respond with:Next trap will enter GDB remote protocol modeEvery time you type gdb, the mode will be toggled
between remote GDB and local DDB. In order to force a next trap
immediately, simply type s (step). Your hosting GDB
will now gain control over the target kernel:Remote debugging using /dev/cuau0
Debugger (msg=0xf01b0383 "Boot flags requested debugger")
at ../../i386/i386/db_interface.c:257
(kgdb)You can use this session almost as any other GDB session, including
full access to the source, running it in gud-mode inside an Emacs window
(which gives you an automatic source code display in another Emacs
window), etc.Debugging a Console DriverSince you need a console driver to run DDB on, things are more
complicated if the console driver itself is failing. You might remember
the use of a serial console (either with modified boot blocks, or by
specifying at the Boot: prompt),
and hook up a standard terminal onto your first serial port. DDB works
on any configured console driver, including a serial
console.Debugging DeadlocksYou may experience so called deadlocks, a situation where
a system stops doing useful work. To provide a helpful bug
report in this situation, use &man.ddb.4; as described in the
previous section. Include the output of ps
and trace for suspected processes in the
report.If possible, consider doing further investigation. The
recipe below is especially useful if you suspect that a deadlock
occurs in the VFS layer. Add these options to the kernel
configuration file.makeoptions DEBUG=-g
options INVARIANTS
options INVARIANT_SUPPORT
options WITNESS
options WITNESS_SKIPSPIN
options DEBUG_LOCKS
options DEBUG_VFS_LOCKS
options DIAGNOSTICWhen a deadlock occurs, in addition to the output of the
ps command, provide information from the
show pcpu, show allpcpu,
show locks, show alllocks,
show lockedvnods and
alltrace.To obtain meaningful backtraces for threaded processes, use
thread thread-id to switch to the thread
stack, and do a backtrace with where.Kernel debugging with Dcons&man.dcons.4; is a very simple console driver that is
not directly connected with any physical devices. It just reads
and writes characters from and to a buffer in a kernel or
loader. Due to its simple nature, it is very useful for kernel
debugging, especially with a &firewire; device. Currently, &os;
provides two ways to interact with the buffer from outside of
the kernel using &man.dconschat.8;.Dcons over &firewire;Most &firewire; (IEEE1394) host controllers are
based on the OHCI specification that
supports physical access to the host memory. This means that
once the host controller is initialized, we can access the
host memory without the help of software (kernel). We can
exploit this facility for interaction with &man.dcons.4;.
&man.dcons.4; provides similar functionality as a serial
console. It emulates two serial ports, one for the console
and DDB, the other for
- GDB. Because remote memory access is fully
+ GDB. Since remote memory access is fully
handled by the hardware, the &man.dcons.4; buffer is
accessible even when the system crashes.&firewire; devices are not limited to those
integrated into motherboards. PCI cards
exist for desktops, and a cardbus interface can be purchased
for laptops.Enabling &firewire; and Dcons support on the target
machineTo enable &firewire; and Dcons support in the kernel of
the target machine:Make sure your kernel supports
dcons, dcons_crom
and firewire.
Dcons should be statically linked
with the kernel. For dcons_crom and
firewire, modules should be
OK.Make sure physical DMA is enabled.
You may need to add
hw.firewire.phydma_enable=1 to
/boot/loader.conf.Add options for debugging.Add dcons_gdb=1 in
/boot/loader.conf if you use GDB
over &firewire;.Enable dcons in
/etc/ttys.Optionally, to force dcons to
be the high-level console, add
hw.firewire.dcons_crom.force_console=1
to loader.conf.To enable &firewire; and Dcons support in &man.loader.8;
on i386 or amd64:Add
LOADER_FIREWIRE_SUPPORT=YES in
/etc/make.conf and rebuild
&man.loader.8;:&prompt.root; cd /sys/boot/i386 && make clean && make && make installTo enable &man.dcons.4; as an active low-level
console, add boot_multicons="YES" to
/boot/loader.conf.Here are a few configuration examples. A sample kernel
configuration file would contain:device dcons
device dcons_crom
options KDB
options DDB
options GDB
options ALT_BREAK_TO_DEBUGGERAnd a sample /boot/loader.conf
would contain:dcons_crom_load="YES"
dcons_gdb=1
boot_multicons="YES"
hw.firewire.phydma_enable=1
hw.firewire.dcons_crom.force_console=1Enabling &firewire; and Dcons support on the host
machineTo enable &firewire; support in the kernel on the
host machine:&prompt.root; kldload firewireFind out the EUI64 (the unique 64
bit identifier) of the &firewire; host controller, and
use &man.fwcontrol.8; or dmesg to
find the EUI64 of the target machine.Run &man.dconschat.8;, with:&prompt.root; dconschat -e \# -br -G 12345 -t 00-11-22-33-44-55-66-77The following key combinations can be used once
&man.dconschat.8; is running:~.Disconnect~CtrlBALT BREAK~CtrlRRESET target~CtrlZSuspend dconschatAttach remote GDB by starting
&man.kgdb.1; with a remote debugging session:kgdb -r :12345 kernelSome general tipsHere are some general tips:To take full advantage of the speed of &firewire;,
disable other slow console drivers:&prompt.root; conscontrol delete ttyd0 # serial console
&prompt.root; conscontrol delete consolectl # video/keyboardThere exists a GDB mode for
&man.emacs.1;; this is what you will need to add to your
.emacs:(setq gud-gdba-command-name "kgdb -a -a -a -r :12345")
(setq gdb-many-windows t)
(xterm-mouse-mode 1)
M-x gdbaAnd for DDD (devel/ddd):# remote serial protocol
LANG=C ddd --debugger kgdb -r :12345 kernel
# live core debug
LANG=C ddd --debugger kgdb kernel /dev/fwmem0.2Dcons with KVMWe can directly read the &man.dcons.4; buffer via
/dev/mem for live systems, and in the
core dump for crashed systems. These give you similar output
to dmesg -a, but the &man.dcons.4; buffer
includes more information.Using Dcons with KVMTo use &man.dcons.4; with KVM:Dump a &man.dcons.4; buffer of a live system:&prompt.root; dconschat -1Dump a &man.dcons.4; buffer of a crash dump:&prompt.root; dconschat -1 -M vmcore.XXLive core debugging can be done via:&prompt.root; fwcontrol -m target_eui64
&prompt.root; kgdb kernel /dev/fwmem0.2Glossary of Kernel Options for DebuggingThis section provides a brief glossary of compile-time kernel
options used for debugging:options KDB: compiles in the kernel
debugger framework. Required for options DDB
and options GDB. Little or no performance
overhead. By default, the debugger will be entered on panic
instead of an automatic reboot.options KDB_UNATTENDED: change the default
value of the debug.debugger_on_panic sysctl to
0, which controls whether the debugger is entered on panic. When
options KDB is not compiled into the kernel, the
behavior is to automatically reboot on panic; when it is compiled
into the kernel, the default behavior is to drop into the debugger
unless options KDB_UNATTENDED is compiled in.
If you want to leave the kernel debugger compiled into the kernel
but want the system to come back up unless you're on-hand to use
the debugger for diagnostics, use this option.options KDB_TRACE: change the default value
of the debug.trace_on_panic sysctl to 1, which
controls whether the debugger automatically prints a stack trace
on panic. Especially if running with options
KDB_UNATTENDED, this can be helpful to gather basic
debugging information on the serial or firewire console while
still rebooting to recover.options DDB: compile in support for the
console debugger, DDB. This interactive debugger runs on whatever
the active low-level console of the system is, which includes the
video console, serial console, or firewire console. It provides
basic integrated debugging facilities, such as stack tracing,
process and thread listing, dumping of lock state, VM state, file
system state, and kernel memory management. DDB does not require
software running on a second machine or being able to generate a
core dump or full debugging kernel symbols, and provides detailed
diagnostics of the kernel at run-time. Many bugs can be fully
diagnosed using only DDB output. This option depends on
options KDB.options GDB: compile in support for the
remote debugger, GDB, which can operate over serial cable or
firewire. When the debugger is entered, GDB may be attached to
inspect structure contents, generate stack traces, etc. Some
kernel state is more awkward to access than in DDB, which is able
to generate useful summaries of kernel state automatically, such
as automatically walking lock debugging or kernel memory
management structures, and a second machine running the debugger
is required. On the other hand, GDB combines information from
the kernel source and full debugging symbols, and is aware of full
data structure definitions, local variables, and is scriptable.
This option is not required to run GDB on a kernel core dump.
This option depends on options KDB.
options BREAK_TO_DEBUGGER, options
ALT_BREAK_TO_DEBUGGER: allow a break signal or
alternative signal on the console to enter the debugger. If the
system hangs without a panic, this is a useful way to reach the
debugger. Due to the current kernel locking, a break signal
generated on a serial console is significantly more reliable at
getting into the debugger, and is generally recommended. This
option has little or no performance impact.options INVARIANTS: compile into the kernel
a large number of run-time assertion checks and tests, which
constantly test the integrity of kernel data structures and the
invariants of kernel algorithms. These tests can be expensive, so
are not compiled in by default, but help provide useful "fail stop"
behavior, in which certain classes of undesired behavior enter the
debugger before kernel data corruption occurs, making them easier
to debug. Tests include memory scrubbing and use-after-free
testing, which is one of the more significant sources of overhead.
This option depends on options INVARIANT_SUPPORT.
options INVARIANT_SUPPORT: many of the tests
present in options INVARIANTS require modified
data structures or additional kernel symbols to be defined.options WITNESS: this option enables run-time
lock order tracking and verification, and is an invaluable tool for
deadlock diagnosis. WITNESS maintains a graph of acquired lock
orders by lock type, and checks the graph at each acquire for
cycles (implicit or explicit). If a cycle is detected, a warning
and stack trace are generated to the console, indicating that a
potential deadlock might have occurred. WITNESS is required in
order to use the show locks, show
witness and show alllocks DDB
commands. This debug option has significant performance overhead,
which may be somewhat mitigated through the use of options
WITNESS_SKIPSPIN. Detailed documentation may be found in
&man.witness.4;.options WITNESS_SKIPSPIN: disable run-time
checking of spinlock lock order with WITNESS. As spin locks are
acquired most frequently in the scheduler, and scheduler events
occur often, this option can significantly speed up systems
running with WITNESS. This option depends on options
WITNESS.options WITNESS_KDB: change the default
value of the debug.witness.kdb sysctl to 1,
which causes WITNESS to enter the debugger when a lock order
violation is detected, rather than simply printing a warning. This
option depends on options WITNESS.options SOCKBUF_DEBUG: perform extensive
run-time consistency checking on socket buffers, which can be
useful for debugging both socket bugs and race conditions in
protocols and device drivers that interact with sockets. This
option significantly impacts network performance, and may change
the timing in device driver races.options DEBUG_VFS_LOCKS: track lock
acquisition points for lockmgr/vnode locks, expanding the amount
of information displayed by show lockedvnods
in DDB. This option has a measurable performance impact.options DEBUG_MEMGUARD: a replacement for
the &man.malloc.9; kernel memory allocator that uses the VM system
to detect reads or writes from allocated memory after free.
Details may be found in &man.memguard.9;. This option has a
significant performance impact, but can be very helpful in
debugging kernel memory corruption bugs.options DIAGNOSTIC: enable additional, more
expensive diagnostic tests along the lines of options
INVARIANTS.
diff --git a/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml
index 8e00147c82..02b062f20c 100644
--- a/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml
+++ b/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml
@@ -1,500 +1,500 @@
Secure ProgrammingMurrayStokelyContributed by SynopsisThis chapter describes some of the security issues that
have plagued &unix; programmers for decades and some of the new
tools available to help programmers avoid writing exploitable
code.Secure Design
MethodologyWriting secure applications takes a very scrutinous and
pessimistic outlook on life. Applications should be run with
the principle of least privilege so that no
process is ever running with more than the bare minimum access
that it needs to accomplish its function. Previously tested
code should be reused whenever possible to avoid common
mistakes that others may have already fixed.One of the pitfalls of the &unix; environment is how easy it
is to make assumptions about the sanity of the environment.
Applications should never trust user input (in all its forms),
system resources, inter-process communication, or the timing of
events. &unix; processes do not execute synchronously so logical
operations are rarely atomic.Buffer OverflowsBuffer Overflows have been around since the very
beginnings of the von Neumann architecture.
buffer overflowvon Neumann
They first gained widespread notoriety in 1988 with the Morris
Internet worm. Unfortunately, the same basic attack remains
Morris Internet worm
effective today.
By far the most common type of buffer overflow attack is based
on corrupting the stack.stackargumentsMost modern computer systems use a stack to pass arguments
to procedures and to store local variables. A stack is a last
in first out (LIFO) buffer in the high memory area of a process
image. When a program invokes a function a new "stack frame" is
LIFOprocess imagestack pointer
created. This stack frame consists of the arguments passed to
the function as well as a dynamic amount of local variable
space. The "stack pointer" is a register that holds the current
stack framestack pointer
location of the top of the stack. Since this value is
constantly changing as new values are pushed onto the top of the
stack, many implementations also provide a "frame pointer" that
is located near the beginning of a stack frame so that local
variables can more easily be addressed relative to this
value. The return address for function
frame pointerprocess imageframe pointerreturn addressstack-overflow
calls is also stored on the stack, and this is the cause of
stack-overflow exploits since overflowing a local variable in a
function can overwrite the return address of that function,
potentially allowing a malicious user to execute any code he or
she wants.Although stack-based attacks are by far the most common,
it would also be possible to overrun the stack with a heap-based
(malloc/free) attack.The C programming language does not perform automatic
bounds checking on arrays or pointers as many other languages
do. In addition, the standard C library is filled with a
handful of very dangerous functions.strcpy(char *dest, const char
*src)May overflow the dest bufferstrcat(char *dest, const char
*src)May overflow the dest buffergetwd(char *buf)May overflow the buf buffergets(char *s)May overflow the s buffer[vf]scanf(const char *format,
...)May overflow its arguments.realpath(char *path, char
resolved_path[])May overflow the path buffer[v]sprintf(char *str, const char
*format, ...)May overflow the str buffer.Example Buffer OverflowThe following example code contains a buffer overflow
designed to overwrite the return address and skip the
instruction immediately following the function call. (Inspired
by )#include <stdio.h>
void manipulate(char *buffer) {
char newbuffer[80];
strcpy(newbuffer,buffer);
}
int main() {
char ch,buffer[4096];
int i=0;
while ((buffer[i++] = getchar()) != '\n') {};
i=1;
manipulate(buffer);
i=2;
printf("The value of i is : %d\n",i);
return 0;
}Let us examine what the memory image of this process would
look like if we were to input 160 spaces into our little program
before hitting return.[XXX figure here!]Obviously more malicious input can be devised to execute
actual compiled instructions (such as exec(/bin/sh)).Avoiding Buffer OverflowsThe most straightforward solution to the problem of
stack-overflows is to always use length restricted memory and
string copy functions. strncpy and
strncat are part of the standard C library.
string copy functionsstrncpystring copy functionsstrncat
These functions accept a length value as a parameter which
should be no larger than the size of the destination buffer.
These functions will then copy up to `length' bytes from the
source to the destination. However there are a number of
problems with these functions. Neither function guarantees NUL
termination if the size of the input buffer is as large as the
NUL termination
destination. The length parameter is also used inconsistently
between strncpy and strncat so it is easy for programmers to get
confused as to their proper usage. There is also a significant
performance loss compared to strcpy when
copying a short string into a large buffer since
strncpy NUL fills up the size
specified.Another memory copy implementation exists
to get around these problems. The
strlcpy and strlcat
functions guarantee that they will always null terminate the
destination string when given a non-zero length argument.string copy functionsstrlcpystring copy functionsstrlcatCompiler based run-time bounds checkingbounds checkingcompiler-basedUnfortunately there is still a very large assortment of
code in public use which blindly copies memory around without
using any of the bounded copy routines we just discussed.
Fortunately, there is a way to help prevent such attacks —
run-time bounds checking, which is implemented by several
C/C++ compilers.ProPoliceStackGuardgccProPolice is one such compiler feature, and is integrated
into &man.gcc.1; versions 4.1 and later. It replaces and
extends the earlier StackGuard &man.gcc.1; extension.ProPolice helps to protect against stack-based buffer
overflows and other attacks by laying pseudo-random numbers in
key areas of the stack before calling any function. When a
function returns, these canaries are checked
and if they are found to have been changed the executable is
immediately aborted. Thus any attempt to modify the return
address or other variable stored on the stack in an attempt to
get malicious code to run is unlikely to succeed, as the
attacker would have to also manage to leave the pseudo-random
canaries untouched.buffer overflowRecompiling your application with ProPolice is an
effective means of stopping most buffer-overflow attacks, but
it can still be compromised.Library based run-time bounds checkingbounds checkinglibrary-basedCompiler-based mechanisms are completely useless for
binary-only software for which you cannot recompile. For
these situations there are a number of libraries which
re-implement the unsafe functions of the C-library
(strcpy, fscanf,
getwd, etc..) and ensure that these
functions can never write past the stack pointer.libsafelibverifylibparanoiaUnfortunately these library-based defenses have a number
of shortcomings. These libraries only protect against a very
small set of security related issues and they neglect to fix
the actual problem. These defenses may fail if the
application was compiled with -fomit-frame-pointer. Also, the
LD_PRELOAD and LD_LIBRARY_PATH environment variables can be
overwritten/unset by the user.SetUID issuesseteuidThere are at least 6 different IDs associated with any
- given process. Because of this you have to be very careful with
+ given process, and you must therefore be very careful with
the access that your process has at any given time. In
particular, all seteuid applications should give up their
privileges as soon as it is no longer required.user IDsreal user IDuser IDseffective user IDThe real user ID can only be changed by a superuser
process. The login program sets this
when a user initially logs in and it is seldom changed.The effective user ID is set by the
exec() functions if a program has its
seteuid bit set. An application can call
seteuid() at any time to set the effective
user ID to either the real user ID or the saved set-user-ID.
When the effective user ID is set by exec()
functions, the previous value is saved in the saved set-user-ID.Limiting your program's environmentchroot()The traditional method of restricting a process
is with the chroot() system call. This
system call changes the root directory from which all other
paths are referenced for a process and any child processes. For
this call to succeed the process must have execute (search)
permission on the directory being referenced. The new
environment does not actually take effect until you
chdir() into your new environment. It
should also be noted that a process can easily break out of a
chroot environment if it has root privilege. This could be
accomplished by creating device nodes to read kernel memory,
attaching a debugger to a process outside of the &man.chroot.8;
environment, or in
many other creative ways.The behavior of the chroot() system
call can be controlled somewhat with the
kern.chroot_allow_open_directories sysctl
variable. When this value is set to 0,
chroot() will fail with EPERM if there are
any directories open. If set to the default value of 1, then
chroot() will fail with EPERM if there are
any directories open and the process is already subject to a
chroot() call. For any other value, the
check for open directories will be bypassed completely.FreeBSD's jail functionalityjailThe concept of a Jail extends upon the
chroot() by limiting the powers of the
superuser to create a true `virtual server'. Once a prison is
set up all network communication must take place through the
specified IP address, and the power of "root privilege" in this
jail is severely constrained.While in a prison, any tests of superuser power within the
kernel using the suser() call will fail.
However, some calls to suser() have been
changed to a new interface suser_xxx().
This function is responsible for recognizing or denying access
to superuser power for imprisoned processes.A superuser process within a jailed environment has the
power to:Manipulate credential with
setuid, seteuid,
setgid, setegid,
setgroups, setreuid,
setregid, setloginSet resource limits with setrlimitModify some sysctl nodes
(kern.hostname)chroot()Set flags on a vnode:
chflags,
fchflagsSet attributes of a vnode such as file
permission, owner, group, size, access time, and modification
time.Bind to privileged ports in the Internet
domain (ports < 1024)Jail is a very useful tool for
running applications in a secure environment but it does have
some shortcomings. Currently, the IPC mechanisms have not been
converted to the suser_xxx so applications
such as MySQL cannot be run within a jail. Superuser access
may have a very limited meaning within a jail, but there is
no way to specify exactly what "very limited" means.&posix;.1e Process CapabilitiesPOSIX.1e Process CapabilitiesTrustedBSD&posix; has released a working draft that adds event
auditing, access control lists, fine grained privileges,
information labeling, and mandatory access control.This is a work in progress and is the focus of the TrustedBSD project. Some
of the initial work has been committed to &os.current;
(cap_set_proc(3)).TrustAn application should never assume that anything about the
users environment is sane. This includes (but is certainly not
limited to): user input, signals, environment variables,
resources, IPC, mmaps, the filesystem working directory, file
descriptors, the # of open files, etc.positive filteringdata validationYou should never assume that you can catch all forms of
invalid input that a user might supply. Instead, your
application should use positive filtering to only allow a
specific subset of inputs that you deem safe. Improper data
validation has been the cause of many exploits, especially with
CGI scripts on the world wide web. For filenames you need to be
extra careful about paths ("../", "/"), symbolic links, and
shell escape characters.Perl Taint modePerl has a really cool feature called "Taint" mode which
can be used to prevent scripts from using data derived outside
the program in an unsafe way. This mode will check command line
arguments, environment variables, locale information, the
results of certain syscalls (readdir(),
readlink(),
getpwxxx()), and all file input.Race ConditionsA race condition is anomalous behavior caused by the
unexpected dependence on the relative timing of events. In
other words, a programmer incorrectly assumed that a particular
event would always happen before another.race conditionssignalsrace conditionsaccess checksrace conditionsfile opensSome of the common causes of race conditions are signals,
access checks, and file opens. Signals are asynchronous events
by nature so special care must be taken in dealing with them.
Checking access with access(2) then
open(2) is clearly non-atomic. Users can
move files in between the two calls. Instead, privileged
applications should seteuid() and then call
open() directly. Along the same lines, an
application should always set a proper umask before
open() to obviate the need for spurious
chmod() calls.
diff --git a/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml
index 6173905a07..43b32f9b78 100644
--- a/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml
+++ b/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml
@@ -1,1748 +1,1748 @@
SocketsG. AdamStanislavContributed by SynopsisBSD sockets take interprocess
communications to a new level. It is no longer necessary for
the communicating processes to run on the same machine. They
still can, but they do not have to.Not only do these processes not have to run on the same
machine, they do not have to run under the same operating
system. Thanks to BSD sockets, your FreeBSD
software can smoothly cooperate with a program running on a
&macintosh;, another one running on a &sun; workstation, yet
another one running under &windows; 2000, all connected with an
Ethernet-based local area network.But your software can equally well cooperate with processes
running in another building, or on another continent, inside a
submarine, or a space shuttle.It can also cooperate with processes that are not part of a
computer (at least not in the strict sense of the word), but of
such devices as printers, digital cameras, medical equipment.
Just about anything capable of digital communications.Networking and DiversityWe have already hinted on the diversity
of networking. Many different systems have to talk to each
other. And they have to speak the same language. They also
have to understand the same language the
same way.People often think that body language
is universal. But it is not. Back in my early teens, my father
took me to Bulgaria. We were sitting at a table in a park in
Sofia, when a vendor approached us trying to sell us some
roasted almonds.I had not learned much Bulgarian by then, so, instead of
saying no, I shook my head from side to side, the
universal body language for
no. The vendor quickly started serving us
some almonds.I then remembered I had been told that in Bulgaria shaking
your head sideways meant yes. Quickly, I
started nodding my head up and down. The vendor noticed, took
his almonds, and walked away. To an uninformed observer, I did
not change the body language: I continued using the language of
shaking and nodding my head. What changed was the
meaning of the body language. At first,
the vendor and I interpreted the same language as having
completely different meaning. I had to adjust my own
interpretation of that language so the vendor would
understand.It is the same with computers: The same symbols may have
different, even outright opposite meaning. Therefore, for two
computers to understand each other, they must not only agree on
the same language, but on the same
interpretation of the language.ProtocolsWhile various programming languages tend to have complex
syntax and use a number of multi-letter reserved words (which
makes them easy for the human programmer to understand), the
languages of data communications tend to be very terse. Instead
of multi-byte words, they often use individual
bits. There is a very convincing reason
for it: While data travels inside your
computer at speeds approaching the speed of light, it often
travels considerably slower between two computers.
- Because the languages used in data communications are so
+ As the languages used in data communications are so
terse, we usually refer to them as
protocols rather than languages.As data travels from one computer to another, it always uses
more than one protocol. These protocols are
layered. The data can be compared to the
inside of an onion: You have to peel off several layers of
skin to get to the data. This is best
illustrated with a picture:+----------------+
| Ethernet |
|+--------------+|
|| IP ||
||+------------+||
||| TCP |||
|||+----------+|||
|||| HTTP ||||
||||+--------+||||
||||| PNG |||||
|||||+------+|||||
|||||| Data ||||||
|||||+------+|||||
||||+--------+||||
|||+----------+|||
||+------------+||
|+--------------+|
+----------------+Protocol LayersIn this example, we are trying to get an image from a web
page we are connected to via an Ethernet.The image consists of raw data, which is simply a sequence
of RGB values that our software can process,
i.e., convert into an image and display on our monitor.Alas, our software has no way of knowing how the raw data is
organized: Is it a sequence of RGB values, or
a sequence of grayscale intensities, or perhaps of
CMYK encoded colors? Is the data represented
by 8-bit quanta, or are they 16 bits in size, or perhaps 4 bits?
How many rows and columns does the image consist of? Should
certain pixels be transparent?I think you get the picture...To inform our software how to handle the raw data, it is
encoded as a PNG file. It could be a
GIF, or a JPEG, but it is
a PNG.And PNG is a protocol.At this point, I can hear some of you yelling,
No, it is not! It is a file
format!Well, of course it is a file format. But from the
perspective of data communications, a file format is a protocol:
The file structure is a language, a terse
one at that, communicating to our process
how the data is organized. Ergo, it is a
protocol.Alas, if all we received was the PNG
file, our software would be facing a serious problem: How is it
supposed to know the data is representing an image, as opposed
to some text, or perhaps a sound, or what not? Secondly, how is
it supposed to know the image is in the PNG
format as opposed to GIF, or
JPEG, or some other image format?To obtain that information, we are using another protocol:
HTTP. This protocol can tell us exactly that
the data represents an image, and that it uses the
PNG protocol. It can also tell us some other
things, but let us stay focused on protocol layers here.So, now we have some data wrapped in the
PNG protocol, wrapped in the
HTTP protocol. How did we get it from the
server?By using TCP/IP over Ethernet, that is
how. Indeed, that is three more protocols. Instead of
continuing inside out, I am now going to talk about Ethernet,
simply because it is easier to explain the rest that way.Ethernet is an interesting system of connecting computers in
a local area network
(LAN). Each computer has a network
interface card (NIC), which has
a unique 48-bit ID called its
address. No two Ethernet
NICs in the world have the same
address.These NICs are all connected with each
other. Whenever one computer wants to communicate with another
in the same Ethernet LAN, it sends a message
over the network. Every NIC sees the
message. But as part of the Ethernet
protocol, the data contains the address of
the destination NIC (among other things).
So, only one of all the network interface cards will pay
attention to it, the rest will ignore it.But not all computers are connected to the same network.
Just because we have received the data over our Ethernet does
not mean it originated in our own local area network. It could
have come to us from some other network (which may not even be
Ethernet based) connected with our own network via the
Internet.All data is transferred over the Internet using
IP, which stands for Internet
Protocol. Its basic role is to let us know where
in the world the data has arrived from, and where it is supposed
to go to. It does not guarantee we will
receive the data, only that we will know where it came from
if we do receive it.Even if we do receive the data, IP does
not guarantee we will receive various chunks of data in the same
order the other computer has sent it to us. So, we can receive
the center of our image before we receive the upper left corner
and after the lower right, for example.It is TCP (Transmission Control
Protocol) that asks the sender to resend any lost
data and that places it all into the proper order.All in all, it took five different
protocols for one computer to communicate to another what an
image looks like. We received the data wrapped into the
PNG protocol, which was wrapped into the
HTTP protocol, which was wrapped into the
TCP protocol, which was wrapped into the
IP protocol, which was wrapped into the
Ethernet protocol.Oh, and by the way, there probably were several other
protocols involved somewhere on the way. For example, if our
LAN was connected to the Internet through a
dial-up call, it used the PPP protocol over
the modem which used one (or several) of the various modem
protocols, et cetera, et cetera, et cetera...As a developer you should be asking by now,
How am I supposed to handle it
all?Luckily for you, you are not supposed
to handle it all. You are supposed to
handle some of it, but not all of it. Specifically, you need
not worry about the physical connection (in our case Ethernet
and possibly PPP, etc). Nor do you need to
handle the Internet Protocol, or the Transmission Control
Protocol.In other words, you do not have to do anything to receive
the data from the other computer. Well, you do have to
ask for it, but that is almost as simple as
opening a file.Once you have received the data, it is up to you to figure
out what to do with it. In our case, you would need to
understand the HTTP protocol and the
PNG file structure.To use an analogy, all the internetworking protocols become
a gray area: Not so much because we do not understand how it
works, but because we are no longer concerned about it. The
sockets interface takes care of this gray area for us:+----------------+
|xxxxEthernetxxxx|
|+--------------+|
||xxxxxxIPxxxxxx||
||+------------+||
|||xxxxxTCPxxxx|||
|||+----------+|||
|||| HTTP ||||
||||+--------+||||
||||| PNG |||||
|||||+------+|||||
|||||| Data ||||||
|||||+------+|||||
||||+--------+||||
|||+----------+|||
||+------------+||
|+--------------+|
+----------------+Sockets Covered Protocol LayersWe only need to understand any protocols that tell us how to
interpret the data, not how to
receive it from another process, nor how to
send it to another process.The Sockets ModelBSD sockets are built on the basic &unix;
model: Everything is a file. In our
example, then, sockets would let us receive an HTTP
file, so to speak. It would then be up to us to
extract the PNG file
from it.
- Because of the complexity of internetworking, we cannot just
+ Due to the complexity of internetworking, we cannot just
use the open system call, or
the open() C function. Instead, we need to
take several steps to opening a socket.Once we do, however, we can start treating the
socket the same way we treat any
file descriptor: We can
read from it, write to
it, pipe it, and, eventually,
close it.Essential Socket FunctionsWhile FreeBSD offers different functions to work with
sockets, we only need four to
open a socket. And in some cases we only need
two.The Client-Server DifferenceTypically, one of the ends of a socket-based data
communication is a server, the other is a
client.The Common ElementssocketThe one function used by both, clients and servers, is
&man.socket.2;. It is declared this way:int socket(int domain, int type, int protocol);The return value is of the same type as that of
open, an integer. FreeBSD allocates
its value from the same pool as that of file handles.
That is what allows sockets to be treated the same way as
files.The domain argument tells the
system what protocol family you want
it to use. Many of them exist, some are vendor specific,
others are very common. They are declared in
sys/socket.h.Use PF_INET for
UDP, TCP and other
Internet protocols (IPv4).Five values are defined for the
type argument, again, in
sys/socket.h. All of them start with
SOCK_. The most
common one is SOCK_STREAM, which
tells the system you are asking for a reliable
stream delivery service (which is
TCP when used with
PF_INET).If you asked for SOCK_DGRAM, you
would be requesting a connectionless datagram
delivery service (in our case,
UDP).If you wanted to be in charge of the low-level
protocols (such as IP), or even network
interfaces (e.g., the Ethernet), you would need to specify
SOCK_RAW.Finally, the protocol argument
depends on the previous two arguments, and is not always
meaningful. In that case, use 0 for
its value.The Unconnected SocketNowhere, in the socket function
have we specified to what other system we should be
connected. Our newly created socket remains
unconnected.This is on purpose: To use a telephone analogy, we
have just attached a modem to the phone line. We have
neither told the modem to make a call, nor to answer if
the phone rings.sockaddrVarious functions of the sockets family expect the
address of (or pointer to, to use C terminology) a small
area of the memory. The various C declarations in the
sys/socket.h refer to it as
struct sockaddr. This structure is
declared in the same file:/*
* Structure used by kernel to store most
* addresses.
*/
struct sockaddr {
unsigned char sa_len; /* total length */
sa_family_t sa_family; /* address family */
char sa_data[14]; /* actually longer; address value */
};
#define SOCK_MAXADDRLEN 255 /* longest possible addresses */Please note the vagueness with
which the sa_data field is declared,
just as an array of 14 bytes, with
the comment hinting there can be more than
14 of them.This vagueness is quite deliberate. Sockets is a very
powerful interface. While most people perhaps think of it
as nothing more than the Internet interface—and most
applications probably use it for that
nowadays—sockets can be used for just about
any kind of interprocess
communications, of which the Internet (or, more precisely,
IP) is only one.The sys/socket.h refers to the
various types of protocols sockets will handle as
address families, and lists them
right before the definition of
sockaddr:/*
* Address families.
*/
#define AF_UNSPEC 0 /* unspecified */
#define AF_LOCAL 1 /* local to host (pipes, portals) */
#define AF_UNIX AF_LOCAL /* backward compatibility */
#define AF_INET 2 /* internetwork: UDP, TCP, etc. */
#define AF_IMPLINK 3 /* arpanet imp addresses */
#define AF_PUP 4 /* pup protocols: e.g. BSP */
#define AF_CHAOS 5 /* mit CHAOS protocols */
#define AF_NS 6 /* XEROX NS protocols */
#define AF_ISO 7 /* ISO protocols */
#define AF_OSI AF_ISO
#define AF_ECMA 8 /* European computer manufacturers */
#define AF_DATAKIT 9 /* datakit protocols */
#define AF_CCITT 10 /* CCITT protocols, X.25 etc */
#define AF_SNA 11 /* IBM SNA */
#define AF_DECnet 12 /* DECnet */
#define AF_DLI 13 /* DEC Direct data link interface */
#define AF_LAT 14 /* LAT */
#define AF_HYLINK 15 /* NSC Hyperchannel */
#define AF_APPLETALK 16 /* Apple Talk */
#define AF_ROUTE 17 /* Internal Routing Protocol */
#define AF_LINK 18 /* Link layer interface */
#define pseudo_AF_XTP 19 /* eXpress Transfer Protocol (no AF) */
#define AF_COIP 20 /* connection-oriented IP, aka ST II */
#define AF_CNT 21 /* Computer Network Technology */
#define pseudo_AF_RTIP 22 /* Help Identify RTIP packets */
#define AF_IPX 23 /* Novell Internet Protocol */
#define AF_SIP 24 /* Simple Internet Protocol */
#define pseudo_AF_PIP 25 /* Help Identify PIP packets */
#define AF_ISDN 26 /* Integrated Services Digital Network*/
#define AF_E164 AF_ISDN /* CCITT E.164 recommendation */
#define pseudo_AF_KEY 27 /* Internal key-management function */
#define AF_INET6 28 /* IPv6 */
#define AF_NATM 29 /* native ATM access */
#define AF_ATM 30 /* ATM */
#define pseudo_AF_HDRCMPLT 31 /* Used by BPF to not rewrite headers
* in interface output routine
*/
#define AF_NETGRAPH 32 /* Netgraph sockets */
#define AF_SLOW 33 /* 802.3ad slow protocol */
#define AF_SCLUSTER 34 /* Sitara cluster protocol */
#define AF_ARP 35
#define AF_BLUETOOTH 36 /* Bluetooth sockets */
#define AF_MAX 37The one used for IP is
AF_INET. It is a symbol for the constant
2.It is the address family listed
in the sa_family field of
sockaddr that decides how exactly the
vaguely named bytes of sa_data will be
used.Specifically, whenever the address
family is AF_INET, we can
use struct sockaddr_in found in
netinet/in.h, wherever
sockaddr is expected:/*
* Socket address, internet style.
*/
struct sockaddr_in {
uint8_t sin_len;
sa_family_t sin_family;
in_port_t sin_port;
struct in_addr sin_addr;
char sin_zero[8];
};We can visualize its organization this way: 0 1 2 3
+--------+--------+-----------------+
0 | 0 | Family | Port |
+--------+--------+-----------------+
4 | IP Address |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+sockaddr_inThe three important fields are
sin_family, which is byte 1 of the
structure, sin_port, a 16-bit value
found in bytes 2 and 3, and sin_addr, a
32-bit integer representation of the IP
address, stored in bytes 4-7.Now, let us try to fill it out. Let us assume we are
trying to write a client for the
daytime protocol, which simply states
that its server will write a text string representing the
current date and time to port 13. We want to use
TCP/IP, so we need to specify
AF_INET in the address family field.
AF_INET is defined as
2. Let us use the
IP address of 192.43.244.18, which is
the time server of US federal government (time.nist.gov). 0 1 2 3
+--------+--------+-----------------+
0 | 0 | 2 | 13 |
+-----------------+-----------------+
4 | 192.43.244.18 |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+Specific example of sockaddr_inBy the way the sin_addr field is
declared as being of the struct in_addr
type, which is defined in
netinet/in.h:/*
* Internet address (a structure for historical reasons)
*/
struct in_addr {
in_addr_t s_addr;
};In addition, in_addr_t is a 32-bit
integer.The 192.43.244.18 is just a
convenient notation of expressing a 32-bit integer by
listing all of its 8-bit bytes, starting with the
most significant one.So far, we have viewed sockaddr as
an abstraction. Our computer does not store
short integers as a single 16-bit
entity, but as a sequence of 2 bytes. Similarly, it
stores 32-bit integers as a sequence of 4 bytes.Suppose we coded something like this:sa.sin_family = AF_INET;
sa.sin_port = 13;
sa.sin_addr.s_addr = (((((192 << 8) | 43) << 8) | 244) << 8) | 18;What would the result look like?Well, that depends, of course. On a &pentium;, or
other x86, based computer, it would look like this: 0 1 2 3
+--------+--------+--------+--------+
0 | 0 | 2 | 13 | 0 |
+--------+--------+--------+--------+
4 | 18 | 244 | 43 | 192 |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+sockaddr_in on an Intel systemOn a different system, it might look like this: 0 1 2 3
+--------+--------+--------+--------+
0 | 0 | 2 | 0 | 13 |
+--------+--------+--------+--------+
4 | 192 | 43 | 244 | 18 |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+sockaddr_in on an MSB systemAnd on a PDP it might look different yet. But the
above two are the most common ways in use today.Ordinarily, wanting to write portable code,
programmers pretend that these differences do not exist.
And they get away with it (except when they code in
assembly language). Alas, you cannot get away with it
that easily when coding for sockets.Why?Because when communicating with another computer, you
usually do not know whether it stores data most
significant byte (MSB) or
least significant byte
(LSB) first.You might be wondering, So, will
sockets not handle it for
me?It will not.While that answer may surprise you at first, remember
that the general sockets interface only understands the
sa_len and sa_family
fields of the sockaddr structure. You
do not have to worry about the byte order there (of
course, on FreeBSD sa_family is only 1
byte anyway, but many other &unix; systems do not have
sa_len and use 2 bytes for
sa_family, and expect the data in
whatever order is native to the computer).But the rest of the data is just
sa_data[14] as far as sockets goes.
Depending on the address family,
sockets just forwards that data to its destination.Indeed, when we enter a port number, it is because we
want the other computer to know what service we are asking
for. And, when we are the server, we read the port number
so we know what service the other computer is expecting
from us. Either way, sockets only has to forward the port
number as data. It does not interpret it in any
way.Similarly, we enter the IP address
to tell everyone on the way where to send our data to.
Sockets, again, only forwards it as data.That is why, we (the programmers,
not the sockets) have to distinguish
between the byte order used by our computer and a
conventional byte order to send the data in to the other
computer.We will call the byte order our computer uses the
host byte order, or just the
host order.There is a convention of sending the multi-byte data
over IP
MSB first. This,
we will refer to as the network byte
order, or simply the network
order.Now, if we compiled the above code for an Intel based
computer, our host byte order would
produce: 0 1 2 3
+--------+--------+--------+--------+
0 | 0 | 2 | 13 | 0 |
+--------+--------+--------+--------+
4 | 18 | 244 | 43 | 192 |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+Host byte order on an Intel systemBut the network byte order
requires that we store the data MSB
first: 0 1 2 3
+--------+--------+--------+--------+
0 | 0 | 2 | 0 | 13 |
+--------+--------+--------+--------+
4 | 192 | 43 | 244 | 18 |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+Network byte orderUnfortunately, our host order is
the exact opposite of the network
order.We have several ways of dealing with it. One would be
to reverse the values in our
code:sa.sin_family = AF_INET;
sa.sin_port = 13 << 8;
sa.sin_addr.s_addr = (((((18 << 8) | 244) << 8) | 43) << 8) | 192;This will trick our compiler into
storing the data in the network byte
order. In some cases, this is exactly the
way to do it (e.g., when programming in assembly
language). In most cases, however, it can cause a
problem.Suppose, you wrote a sockets-based program in C. You
know it is going to run on a &pentium;, so you enter all
your constants in reverse and force them to the
network byte order. It works
well.Then, some day, your trusted old &pentium; becomes a
rusty old &pentium;. You replace it with a system whose
host order is the same as the
network order. You need to recompile
all your software. All of your software continues to
perform well, except the one program you wrote.You have since forgotten that you had forced all of
your constants to the opposite of the host
order. You spend some quality time tearing
out your hair, calling the names of all gods you ever
heard of (and some you made up), hitting your monitor with
a nerf bat, and performing all the other traditional
ceremonies of trying to figure out why something that has
worked so well is suddenly not working at all.Eventually, you figure it out, say a couple of swear
words, and start rewriting your code.Luckily, you are not the first one to face the
problem. Someone else has created the &man.htons.3; and
&man.htonl.3; C functions to convert a
short and long
respectively from the host byte order
to the network byte order, and the
&man.ntohs.3; and &man.ntohl.3; C functions to go the
other way.On MSB-first
systems these functions do nothing. On
LSB-first systems
they convert values to the proper order.So, regardless of what system your software is
compiled on, your data will end up in the correct order if
you use these functions.Client FunctionsTypically, the client initiates the connection to the
server. The client knows which server it is about to call:
It knows its IP address, and it knows the
port the server resides at. It is akin
to you picking up the phone and dialing the number (the
address), then, after someone answers,
asking for the person in charge of wingdings (the
port).connectOnce a client has created a socket, it needs to
connect it to a specific port on a remote system. It uses
&man.connect.2;:int connect(int s, const struct sockaddr *name, socklen_t namelen);The s argument is the socket, i.e.,
the value returned by the socket
function. The name is a pointer to
sockaddr, the structure we have talked
about extensively. Finally, namelen
informs the system how many bytes are in our
sockaddr structure.If connect is successful, it
returns 0. Otherwise it returns
-1 and stores the error code in
errno.There are many reasons why
connect may fail. For example, with
an attempt to an Internet connection, the
IP address may not exist, or it may be
down, or just too busy, or it may not have a server
listening at the specified port. Or it may outright
refuse any request for specific
code.Our First ClientWe now know enough to write a very simple client, one
that will get current time from 192.43.244.18 and print
it to stdout./*
* daytime.c
*
* Programmed by G. Adam Stanislav
*/
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
int main() {
register int s;
register int bytes;
struct sockaddr_in sa;
char buffer[BUFSIZ+1];
if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket");
return 1;
}
bzero(&sa, sizeof sa);
sa.sin_family = AF_INET;
sa.sin_port = htons(13);
sa.sin_addr.s_addr = htonl((((((192 << 8) | 43) << 8) | 244) << 8) | 18);
if (connect(s, (struct sockaddr *)&sa, sizeof sa) < 0) {
perror("connect");
close(s);
return 2;
}
while ((bytes = read(s, buffer, BUFSIZ)) > 0)
write(1, buffer, bytes);
close(s);
return 0;
}Go ahead, enter it in your editor, save it as
daytime.c, then compile and run
it:&prompt.user; cc -O3 -o daytime daytime.c
&prompt.user; ./daytime
52079 01-06-19 02:29:25 50 0 1 543.9 UTC(NIST) *
&prompt.user;In this case, the date was June 19, 2001, the time was
02:29:25 UTC. Naturally, your results
will vary.Server FunctionsThe typical server does not initiate the connection.
Instead, it waits for a client to call it and request
services. It does not know when the client will call, nor
how many clients will call. It may be just sitting there,
waiting patiently, one moment, The next moment, it can find
itself swamped with requests from a number of clients, all
calling in at the same time.The sockets interface offers three basic functions to
handle this.bindPorts are like extensions to a phone line: After you
dial a number, you dial the extension to get to a specific
person or department.There are 65535 IP ports, but a
server usually processes requests that come in on only one
of them. It is like telling the phone room operator that
we are now at work and available to answer the phone at a
specific extension. We use &man.bind.2; to tell sockets
which port we want to serve.int bind(int s, const struct sockaddr *addr, socklen_t addrlen);Beside specifying the port in addr,
the server may include its IP address.
However, it can just use the symbolic constant
INADDR_ANY to indicate it will serve all
requests to the specified port regardless of what its
IP address is. This symbol, along with
several similar ones, is declared in
netinet/in.h#define INADDR_ANY (u_int32_t)0x00000000Suppose we were writing a server for the
daytime protocol over
TCP/IP. Recall that
it uses port 13. Our sockaddr_in
structure would look like this: 0 1 2 3
+--------+--------+--------+--------+
0 | 0 | 2 | 0 | 13 |
+--------+--------+--------+--------+
4 | 0 |
+-----------------------------------+
8 | 0 |
+-----------------------------------+
12 | 0 |
+-----------------------------------+Example Server sockaddr_inlistenTo continue our office phone analogy, after you have
told the phone central operator what extension you will be
at, you now walk into your office, and make sure your own
phone is plugged in and the ringer is turned on. Plus,
you make sure your call waiting is activated, so you can
hear the phone ring even while you are talking to
someone.The server ensures all of that with the &man.listen.2;
function.int listen(int s, int backlog);In here, the backlog variable tells
sockets how many incoming requests to accept while you are
busy processing the last request. In other words, it
determines the maximum size of the queue of pending
connections.acceptAfter you hear the phone ringing, you accept the call
by answering the call. You have now established a
connection with your client. This connection remains
active until either you or your client hang up.The server accepts the connection by using the
&man.accept.2; function.int accept(int s, struct sockaddr *addr, socklen_t *addrlen);Note that this time addrlen is a
pointer. This is necessary because in this case it is the
socket that fills out addr, the
sockaddr_in structure.The return value is an integer. Indeed, the
accept returns a new
socket. You will use this new socket to
communicate with the client.What happens to the old socket? It continues to listen
for more requests (remember the backlog
variable we passed to listen?) until
we close it.Now, the new socket is meant only for communications.
It is fully connected. We cannot pass it to
listen again, trying to accept
additional connections.Our First ServerOur first server will be somewhat more complex than
our first client was: Not only do we have more sockets
functions to use, but we need to write it as a
daemon.This is best achieved by creating a child
process after binding the port. The main
process then exits and returns control to the
shell (or whatever program
invoked it).The child calls listen, then
starts an endless loop, which accepts a connection, serves
it, and eventually closes its socket./*
* daytimed - a port 13 server
*
* Programmed by G. Adam Stanislav
* June 19, 2001
*/
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#define BACKLOG 4
int main() {
register int s, c;
int b;
struct sockaddr_in sa;
time_t t;
struct tm *tm;
FILE *client;
if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket");
return 1;
}
bzero(&sa, sizeof sa);
sa.sin_family = AF_INET;
sa.sin_port = htons(13);
if (INADDR_ANY)
sa.sin_addr.s_addr = htonl(INADDR_ANY);
if (bind(s, (struct sockaddr *)&sa, sizeof sa) < 0) {
perror("bind");
return 2;
}
switch (fork()) {
case -1:
perror("fork");
return 3;
break;
default:
close(s);
return 0;
break;
case 0:
break;
}
listen(s, BACKLOG);
for (;;) {
b = sizeof sa;
if ((c = accept(s, (struct sockaddr *)&sa, &b)) < 0) {
perror("daytimed accept");
return 4;
}
if ((client = fdopen(c, "w")) == NULL) {
perror("daytimed fdopen");
return 5;
}
if ((t = time(NULL)) < 0) {
perror("daytimed time");
return 6;
}
tm = gmtime(&t);
fprintf(client, "%.4i-%.2i-%.2iT%.2i:%.2i:%.2iZ\n",
tm->tm_year + 1900,
tm->tm_mon + 1,
tm->tm_mday,
tm->tm_hour,
tm->tm_min,
tm->tm_sec);
fclose(client);
}
}We start by creating a socket. Then we fill out the
sockaddr_in structure in
sa. Note the conditional use of
INADDR_ANY:if (INADDR_ANY)
sa.sin_addr.s_addr = htonl(INADDR_ANY);Its value is 0. Since we have
just used bzero on the entire
structure, it would be redundant to set it to
0 again. But if we port our code to
some other system where INADDR_ANY is
perhaps not a zero, we need to assign it to
sa.sin_addr.s_addr. Most modern C
compilers are clever enough to notice that
INADDR_ANY is a constant. As long as it
is a zero, they will optimize the entire conditional
statement out of the code.After we have called bind
successfully, we are ready to become a
daemon: We use
fork to create a child process. In
both, the parent and the child, the s
variable is our socket. The parent process will not need
it, so it calls close, then it
returns 0 to inform its own parent it
had terminated successfully.Meanwhile, the child process continues working in the
background. It calls listen and sets
its backlog to 4. It does not need a
large value here because daytime is
not a protocol many clients request all the time, and
because it can process each request instantly
anyway.Finally, the daemon starts an endless loop, which
performs the following steps:Call accept. It waits here
until a client contacts it. At that point, it
receives a new socket, c, which it
can use to communicate with this particular
client.It uses the C function fdopen
to turn the socket from a low-level file
descriptor to a C-style
FILE pointer. This will allow the
use of fprintf later
on.It checks the time, and prints it in the
ISO 8601
format to the clientfile. It then uses
fclose to close the file. That
will automatically close the socket as
well.We can generalize this, and use
it as a model for many other servers:+-----------------+
| Create Socket |
+-----------------+
|
+-----------------+
| Bind Port | Daemon Process
+-----------------+
| +--------+
+-------------+-->| Init |
| | +--------+
+-----------------+ | |
| Exit | | +--------+
+-----------------+ | | Listen |
| +--------+
| |
| +--------+
| | Accept |
| +--------+
| |
| +--------+
| | Serve |
| +--------+
| |
| +--------+
| | Close |
|<--------+Sequential ServerThis flowchart is good for sequential
servers, i.e., servers that can serve one
client at a time, just as we were able to with our
daytime server. This is only
possible whenever there is no real
conversation going on between the client
and the server: As soon as the server detects a connection
to the client, it sends out some data and closes the
connection. The entire operation may take nanoseconds,
and it is finished.The advantage of this flowchart is that, except for
the brief moment after the parent
forks and before it exits, there is
always only one process active: Our
server does not take up much memory and other system
resources.Note that we have added initialize
daemon in our flowchart. We did not need to
initialize our own daemon, but this is a good place in the
flow of the program to set up any
signal handlers, open any files we
may need, etc.Just about everything in the flow chart can be used
literally on many different servers. The
serve entry is the exception. We
think of it as a black
box, i.e., something you design
specifically for your own server, and just plug it
into the rest.Not all protocols are that simple. Many receive a
request from the client, reply to it, then receive another
- request from the same client. Because of that, they do
+ request from the same client. As a result, they do
not know in advance how long they will be serving the
client. Such servers usually start a new process for each
client. While the new process is serving its client, the
daemon can continue listening for more connections.Now, go ahead, save the above source code as
daytimed.c (it is customary to end
the names of daemons with the letter
d). After you have compiled it, try
running it:&prompt.user; ./daytimed
bind: Permission denied
&prompt.user;What happened here? As you will recall, the
daytime protocol uses port 13. But
all ports below 1024 are reserved to the superuser
(otherwise, anyone could start a daemon pretending to
serve a commonly used port, while causing a security
breach).Try again, this time as the superuser:&prompt.root; ./daytimed
&prompt.root;What... Nothing? Let us try again:&prompt.root; ./daytimed
bind: Address already in use
&prompt.root;Every port can only be bound by one program at a time.
Our first attempt was indeed successful: It started the
child daemon and returned quietly. It is still running
and will continue to run until you either kill it, or any
of its system calls fail, or you reboot the system.Fine, we know it is running in the background. But is
it working? How do we know it is a proper
daytime server? Simple:&prompt.user; telnet localhost 13
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
2001-06-19T21:04:42Z
Connection closed by foreign host.
&prompt.user;telnet tried the new
IPv6, and failed. It retried with
IPv4 and succeeded. The daemon
works.If you have access to another &unix; system via
telnet, you can use it to test
accessing the server remotely. My computer does not have
a static IP address, so this is what I
did:&prompt.user; who
whizkid ttyp0 Jun 19 16:59 (216.127.220.143)
xxx ttyp1 Jun 19 16:06 (xx.xx.xx.xx)
&prompt.user; telnet 216.127.220.143 13
Trying 216.127.220.143...
Connected to r47.bfm.org.
Escape character is '^]'.
2001-06-19T21:31:11Z
Connection closed by foreign host.
&prompt.user;Again, it worked. Will it work using the domain
name?&prompt.user; telnet r47.bfm.org 13
Trying 216.127.220.143...
Connected to r47.bfm.org.
Escape character is '^]'.
2001-06-19T21:31:40Z
Connection closed by foreign host.
&prompt.user;By the way, telnet prints
the Connection closed by foreign host
message after our daemon has closed the socket. This
shows us that, indeed, using
fclose(client); in our code works as
advertised.Helper FunctionsFreeBSD C library contains many helper functions for sockets
programming. For example, in our sample client we hard coded
the time.nist.gov
IP address. But we do not always know the
IP address. Even if we do, our software is
more flexible if it allows the user to enter the
IP address, or even the domain name.gethostbynameWhile there is no way to pass the domain name directly to
any of the sockets functions, the FreeBSD C library comes with
the &man.gethostbyname.3; and &man.gethostbyname2.3;
functions, declared in netdb.h.struct hostent * gethostbyname(const char *name);
struct hostent * gethostbyname2(const char *name, int af);Both return a pointer to the hostent
structure, with much information about the domain. For our
purposes, the h_addr_list[0] field of the
structure points at h_length bytes of the
correct address, already stored in the network byte
order.This allows us to create a much more flexible—and
much more useful—version of our
daytime program:/*
* daytime.c
*
* Programmed by G. Adam Stanislav
* 19 June 2001
*/
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
int main(int argc, char *argv[]) {
register int s;
register int bytes;
struct sockaddr_in sa;
struct hostent *he;
char buf[BUFSIZ+1];
char *host;
if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket");
return 1;
}
bzero(&sa, sizeof sa);
sa.sin_family = AF_INET;
sa.sin_port = htons(13);
host = (argc > 1) ? (char *)argv[1] : "time.nist.gov";
if ((he = gethostbyname(host)) == NULL) {
herror(host);
return 2;
}
bcopy(he->h_addr_list[0],&sa.sin_addr, he->h_length);
if (connect(s, (struct sockaddr *)&sa, sizeof sa) < 0) {
perror("connect");
return 3;
}
while ((bytes = read(s, buf, BUFSIZ)) > 0)
write(1, buf, bytes);
close(s);
return 0;
}We now can type a domain name (or an IP
address, it works both ways) on the command line, and the
program will try to connect to its
daytime server. Otherwise, it will still
default to time.nist.gov. However,
even in this case we will use
gethostbyname rather than hard coding
192.43.244.18.
That way, even if its IP address changes in
the future, we will still find it.Since it takes virtually no time to get the time from your
local server, you could run daytime
twice in a row: First to get the time from time.nist.gov, the second
time from your own system. You can then compare the results
and see how exact your system clock is:&prompt.user; daytime ; daytime localhost
52080 01-06-20 04:02:33 50 0 0 390.2 UTC(NIST) *
2001-06-20T04:02:35Z
&prompt.user;As you can see, my system was two seconds ahead of the
NIST time.getservbynameSometimes you may not be sure what port a certain service
uses. The &man.getservbyname.3; function, also declared in
netdb.h comes in very handy in those
cases:struct servent * getservbyname(const char *name, const char *proto);The servent structure contains the
s_port, which contains the proper port,
already in network byte order.Had we not known the correct port for the
daytime service, we could have found it
this way:struct servent *se;
...
if ((se = getservbyname("daytime", "tcp")) == NULL {
fprintf(stderr, "Cannot determine which port to use.\n");
return 7;
}
sa.sin_port = se->s_port;You usually do know the port. But if you are developing a
new protocol, you may be testing it on an unofficial port.
Some day, you will register the protocol and its port (if
nowhere else, at least in your
/etc/services, which is where
getservbyname looks). Instead of
returning an error in the above code, you just use the
temporary port number. Once you have listed the protocol in
/etc/services, your software will find
its port without you having to rewrite the code.Concurrent ServersUnlike a sequential server, a concurrent
server has to be able to serve more than one client
at a time. For example, a chat server may
be serving a specific client for hours—it cannot wait till
it stops serving a client before it serves the next one.This requires a significant change in our flowchart:+-----------------+
| Create Socket |
+-----------------+
|
+-----------------+
| Bind Port | Daemon Process
+-----------------+
| +--------+
+-------------+-->| Init |
| | +--------+
+-----------------+ | |
| Exit | | +--------+
+-----------------+ | | Listen |
| +--------+
| |
| +--------+
| | Accept |
| +--------+
| | +------------------+
| +------>| Close Top Socket |
| | +------------------+
| +--------+ |
| | Close | +------------------+
| +--------+ | Serve |
| | +------------------+
|<--------+ |
+------------------+
| Close Acc Socket |
+--------+ +------------------+
| Signal | |
+--------+ +------------------+
| Exit |
+------------------+Concurrent ServerWe moved the serve from the
daemon process to its own server
process. However, because each child process
inherits all open files (and a socket is treated just like a
file), the new process inherits not only the
accepted handle, i.e., the
socket returned by the accept call, but
also the top socket, i.e., the one opened
by the top process right at the beginning.However, the server process does not
need this socket and should close it
immediately. Similarly, the daemon process
no longer needs the accepted socket, and
not only should, but mustclose it—otherwise, it will run out
of available file descriptors sooner or
later.After the server process is done
serving, it should close the accepted
socket. Instead of returning to
accept, it now exits.Under &unix;, a process does not really
exit. Instead, it
returns to its parent. Typically, a parent
process waits for its child process, and
obtains a return value. However, our daemon
process cannot simply stop and wait. That would
defeat the whole purpose of creating additional processes. But
if it never does wait, its children will
become zombies—no longer functional
but still roaming around.For that reason, the daemon process
needs to set signal handlers in its
initialize daemon phase. At least a
SIGCHLD signal has to be processed, so the
daemon can remove the zombie return values from the system and
release the system resources they are taking up.That is why our flowchart now contains a process
signals box, which is not connected to any other
box. By the way, many servers also process
SIGHUP, and typically interpret as the signal
from the superuser that they should reread their configuration
files. This allows us to change settings without having to kill
and restart these servers.
diff --git a/en_US.ISO8859-1/books/faq/book.xml b/en_US.ISO8859-1/books/faq/book.xml
index 11ebe38646..5942b54a38 100644
--- a/en_US.ISO8859-1/books/faq/book.xml
+++ b/en_US.ISO8859-1/books/faq/book.xml
@@ -1,6431 +1,6431 @@
13-CURRENT">
X">
head/">
X">
12-STABLE">
stable/12/">
X">
11-STABLE">
stable/11/">
]>
Frequently Asked Questions for &os;
&rel2.relx; and &rel.relx;The &os; Documentation Project19951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020The &os; Documentation Project
&legalnotice;
&tm-attrib.freebsd;
&tm-attrib.adobe;
&tm-attrib.ibm;
&tm-attrib.ieee;
&tm-attrib.intel;
&tm-attrib.linux;
&tm-attrib.microsoft;
&tm-attrib.netbsd;
&tm-attrib.opengroup;
&tm-attrib.sgi;
&tm-attrib.sun;
&tm-attrib.general;
$FreeBSD$This is the Frequently Asked Questions
(FAQ) for &os; versions
&rel.relx; and &rel2.relx;. Every effort has been made to
make this FAQ as informative as possible;
if you have any suggestions as to how it may be improved, send
them to the &a.doc;.The latest version of this document is always available
from the &os;
website. It may also be downloaded as one large
HTML file with HTTP or as
a variety of other formats from the &os; FTP
server.IntroductionWhat is &os;?&os; is a modern operating system for desktops,
laptops, servers, and embedded systems with support for a
large number of platforms.It is based on U.C. Berkeley's
4.4BSD-Lite release, with some
4.4BSD-Lite2 enhancements. It is also
based indirectly on William Jolitz's port of U.C.
Berkeley's Net/2 to the &i386;, known as
386BSD, though very little of the 386BSD
code remains.&os; is used by companies, Internet Service Providers,
researchers, computer professionals, students and home
users all over the world in their work, education and
recreation.For more detailed information on &os;, refer to the
&os;
Handbook.What is the goal of the &os; Project?The goal of the &os; Project is to provide a stable
and fast general purpose operating system that may be used
for any purpose without strings attached.Does the &os; license have any restrictions?Yes. Those restrictions do not control how the code
is used, but how to treat the &os; Project itself.
The license itself is available at
license
and can be summarized like this:Do not claim that you wrote this.Do not sue us if it breaks.Do not remove or modify the license.Many of us have a significant investment in the
project and would certainly not mind a little financial
compensation now and then, but we definitely do not insist
on it. We believe that our first and foremost
mission is to provide code to any and all
comers, and for whatever purpose, so that the code gets
the widest possible use and provides the widest possible
benefit. This, we believe, is one of the most fundamental
goals of Free Software and one that we enthusiastically
support.Code in our source tree which falls under the GNU
General Public License (GPL) or GNU
Library General Public License (LGPL) comes with
slightly more strings attached, though at least on the
side of enforced access rather than the usual opposite.
Due to the additional complexities that can evolve in the
commercial use of GPL software, we do, however, endeavor
to replace such software with submissions under the more
relaxed &os;
license whenever possible.Can &os; replace my current operating system?For most people, yes. But this question is not quite
that cut-and-dried.Most people do not actually use an operating system.
They use applications. The applications are what really
use the operating system. &os; is designed to provide a
robust and full-featured environment for applications. It
supports a wide variety of web browsers, office suites,
email readers, graphics programs, programming
environments, network servers, and much more.
Most of these applications can be
managed through the Ports
Collection.If an application is only available on one operating
system, that operating system cannot just be replaced.
Chances are, there is a very similar application on &os;,
however. As a solid office or Internet server or a
reliable workstation, &os; will almost certainly do
everything you need. Many computer users across the
world, including both novices and experienced &unix;
administrators, use &os; as their only desktop operating
system.Users migrating to &os; from another &unix;-like
environment will find &os; to be similar.
&windows; and &macos; users may be interested in instead
using GhostBSD,
MidnightBSD
or NomadBSD
three &os;-based desktop distributions. Non-&unix; users
should expect to invest some additional time learning the
&unix; way of doing things. This FAQ
and the &os;
Handbook are excellent places to start.Why is it called &os;?It may be used free of charge, even by commercial
users.Full source for the operating system is freely
available, and the minimum possible restrictions have
been placed upon its use, distribution and
incorporation into other work (commercial or
non-commercial).Anyone who has an improvement or bug fix is free
to submit their code and have it added to the source
tree (subject to one or two obvious
provisions).It is worth pointing out that the word
free is being used in two ways here: one
meaning at no cost and the other meaning
do whatever you like. Apart from
one or two things you cannot do with
the &os; code, for example pretending you wrote it, you
can really do whatever you like with it.What are the differences between &os; and NetBSD,
OpenBSD, and other open source BSD operating
systems?James Howard wrote a good explanation of the history
and differences between the various projects, called The
BSD Family Tree which goes a fair way to
answering this question. Some of the information is out
of date, but the history portion in particular remains
accurate.Most of the BSDs share patches and code, even today.
All of the BSDs have common ancestry.The design goals of &os; are described in , above. The design goals of
the other most popular BSDs may be summarized as
follows:OpenBSD aims for operating system security above
all else. The OpenBSD team wrote &man.ssh.1; and
&man.pf.4;, which have both been ported to
&os;.NetBSD aims to be easily ported to other hardware
platforms.DragonFly BSD is a fork of &os; 4.8 that
has since developed many interesting features of its
own, including the HAMMER file system and support for
user-mode vkernels.What is the latest version of &os;?At any point in the development of &os;, there can be
multiple parallel branches. &rel.relx; releases are made
from the &rel.stable; branch, and &rel2.relx; releases are
made from the &rel2.stable; branch.Up until the release of 12.0, the &rel2.relx; series
was the one known as -STABLE.
However, as of &rel.head.relx;, the &rel2.relx; branch
will be designated for an extended support
status and receive only fixes for major problems, such as
security-related fixes.
Releases are made every
few months. While many people stay more
up-to-date with the &os; sources (see the questions on
&os.current; and &os.stable;) than that, doing so
is more of a commitment, as the sources are a moving
target.More information on &os; releases can be found on the
Release
Engineering page and in &man.release.7;.What is &os;-CURRENT?&os.current;
is the development version of the operating system, which
will in due course become the new &os.stable; branch. As
such, it is really only of interest to developers working
on the system and die-hard hobbyists. See the relevant
section in the Handbook
for details on running
-CURRENT.Users not familiar with &os; should not use
&os.current;. This branch sometimes evolves quite quickly
and due to mistake can be un-buildable at times. People
that use &os.current; are expected to be able to analyze,
debug, and report problems.What is the &os;-STABLE
concept?&os;-STABLE is the development branch
from which major releases are made. Changes go into this
branch at a slower pace and with the general assumption
that they have first been tested in &os;-CURRENT.
However, at any given time, the sources for &os;-STABLE
may or may not be suitable for general use, as it may
uncover bugs and corner cases that were not yet found in
&os;-CURRENT. Users who do not have the resources to
perform testing should instead run the most recent release
of &os;.
&os;-CURRENT, on the other hand, has
been one unbroken line since 2.0 was released.For more
detailed information on branches see &os;
Release Engineering: Creating the Release
Branch, the status of the branches and
the upcoming release schedule can be found on the Release
Engineering Information page.Version &rel121.current;
is the latest release from the &rel.stable; branch; it was
released in &rel121.current.date;. Version &rel1.current;
is the latest release from the &rel2.stable; branch; it
was released in &rel1.current.date;.When are &os; releases made?The &a.re; releases a new major version of &os; about
every 18 months and a new minor version about every 8
months, on average. Release dates are announced well in
advance, so that the people working on the system know
when their projects need to be finished and tested. A
testing period precedes each release, to ensure that the
addition of new features does not compromise the stability
of the release. Many users regard this caution as one of
the best things about &os;, even though waiting for all
the latest goodies to reach -STABLE
can be a little frustrating.More information on the release engineering process
(including a schedule of upcoming releases) can be found
on the release
engineering pages on the &os; Web site.For people who need or want a little more excitement,
binary snapshots are made weekly as discussed
above.When are &os; snapshots made?&os; snapshot
releases are made based on the current state of the
-CURRENT and
-STABLE branches. The goals behind
each snapshot release are:To test the latest version of the installation
software.To give people who would like to run
-CURRENT or
-STABLE but who do not have the
time or bandwidth to follow it on a day-to-day basis
an easy way of bootstrapping it onto their
systems.To preserve a fixed reference point for the code
in question, just in case we break something really
badly later. (Although Subversion normally prevents
anything horrible like this happening.)To ensure that all new features and fixes in need
of testing have the greatest possible number of
potential testers.No claims are made that any
-CURRENT snapshot can be considered
production quality for any purpose.
If a stable and fully tested system is needed,
stick to full releases.Snapshot releases are directly available from snapshot.Official snapshots are generated on a regular
basis for all actively developed branches.Who is responsible for &os;?The key decisions concerning the &os; project, such as
the overall direction of the project and who is allowed to
add code to the source tree, are made by a core
team of 9 people. There is a much larger team of
more than 350 committers
who are authorized to make changes directly to the &os;
source tree.However, most non-trivial changes are discussed in
advance in the mailing
lists, and there are no restrictions on who may
take part in the discussion.Where can I get &os;?Every significant release of &os; is available via
anonymous FTP from the &os;
FTP site:The latest &rel.stable; release,
&rel121.current;-RELEASE can be found in the &rel121.current;-RELEASE
directory.Snapshot
releases are made monthly for the -CURRENT and -STABLE branch, these being
of service purely to bleeding-edge testers and
developers.The latest &rel2.stable; release,
&rel1.current;-RELEASE can be found in the &rel1.current;-RELEASE
directory.Information about obtaining &os; on CD, DVD, and other
media can be found in the
Handbook.How do I access the Problem Report database?The Problem Report database of all user change
requests may be queried by using our web-based PR query
interface.The web-based
problem report submission interface can be used
to submit problem reports through a web browser.Before submitting a problem report, read Writing
&os; Problem Reports, an article on how to write
good problem reports.Documentation and SupportWhat good books are there about &os;?The project produces a wide range of documentation,
available online from this link: https://www.FreeBSD.org/docs.html.
Is the documentation available in other formats, such
as plain text (ASCII), or PDF?Yes. The documentation is available in a number of
different formats and compression schemes on the &os; FTP
site, in the /ftp/doc/
directory.The documentation is categorized in a number of
different ways. These include:The document's name, such as
faq, or
handbook.The document's language and encoding. These are
based on the locale names found under
/usr/share/locale on a &os;
system. The current languages and encodings
are as follows:NameMeaningen_US.ISO8859-1English (United States)bn_BD.ISO10646-1Bengali or Bangla (Bangladesh)da_DK.ISO8859-1Danish (Denmark)de_DE.ISO8859-1German (Germany)el_GR.ISO8859-7Greek (Greece)es_ES.ISO8859-1Spanish (Spain)fr_FR.ISO8859-1French (France)hu_HU.ISO8859-2Hungarian (Hungary)it_IT.ISO8859-15Italian (Italy)ja_JP.eucJPJapanese (Japan, EUC encoding)ko_KR.UTF-8Korean (Korea, UTF-8 encoding)mn_MN.UTF-8Mongolian (Mongolia, UTF-8
encoding)nl_NL.ISO8859-1Dutch (Netherlands)pl_PL.ISO8859-2Polish (Poland)pt_BR.ISO8859-1Portuguese (Brazil)ru_RU.KOI8-RRussian (Russia, KOI8-R encoding)tr_TR.ISO8859-9Turkish (Turkey)zh_CN.UTF-8Simplified Chinese (China, UTF-8
encoding)zh_TW.UTF-8Traditional Chinese (Taiwan, UTF-8
encoding)Some documents may not be available in all
languages.The document's format. We produce the
documentation in a number of different output formats.
Each format has its own advantages and disadvantages.
Some formats are better suited for online reading,
while others are meant to be aesthetically pleasing
when printed on paper. Having the documentation
available in any of these formats ensures that our
readers will be able to read the parts they are
interested in, either on their monitor, or on paper
after printing the documents. The currently available
formats are:FormatMeaninghtml-splitA collection of small, linked, HTML
files.htmlOne large HTML file containing the entire
documentpdfAdobe's Portable Document FormattxtPlain textThe compression and packaging scheme.Where the format is
html-split, the files are
bundled up using &man.tar.1;. The resulting
.tar is then compressed
using the compression schemes detailed in the next
point.All the other formats generate one file. For
example,
article.pdf,
book.html, and so on.These files are then compressed using either
the zip or
bz2 compression schemes.
&man.tar.1; can be used to uncompress these
files.So the PDF version of the Handbook,
compressed using bzip2 will be
stored in a file called
book.pdf.bz2 in the
handbook/ directory.After choosing the format and compression mechanism,
download the
compressed files, uncompress them, and then copy
the appropriate documents into place.For example, the split HTML version of the
FAQ, compressed using &man.bzip2.1;,
can be found in
doc/en_US.ISO8859-1/books/faq/book.html-split.tar.bz2
To download and uncompress that file, type:&prompt.root; fetch https://download.freebsd.org/ftp/doc/en_US.ISO8859-1/books/faq/book.html-split.tar.bz2
&prompt.root; tar xvf book.html-split.tar.bz2If the file is compressed,
tar will automatically
detect the appropriate format and decompress it correctly,
resulting in a collection of
.html files. The main one is called
index.html, which will contain the
table of contents, introductory material, and links to the
other parts of the document.Where do I find info on the &os; mailing lists? What
&os; news groups are available?Refer to the Handbook
entry on mailing-lists and the Handbook
entry on newsgroups.Are there &os; IRC (Internet Relay Chat)
channels?Yes, most major IRC networks host a &os; chat
channel:Channel #FreeBSDhelp on EFNet
is a channel dedicated to helping &os; users.Channel #FreeBSD on Freenode is
a general help channel with many users at any time.
The conversations have been known to run off-topic for
a while, but priority is given to users with &os;
questions. Other users can help with
the basics, referring to the Handbook whenever
possible and providing links for learning more about
a particular topic. This is primarily an English
speaking channel, though it does have users from all
over the world. Non-native English speakers should
try to ask the question in English first and then
relocate to ##freebsd-lang as
appropriate.Channel #FreeBSD on DALNET is
available at irc.dal.net in
the US and irc.eu.dal.net in
Europe.Channel #FreeBSD on UNDERNET
is available at
us.undernet.org in the US and
eu.undernet.org in Europe.
Since it is a help channel, be prepared to read the
documents you are referred to.Channel #FreeBSD on RUSNET
is a Russian language channel dedicated to
helping &os; users. This is also a good place for
non-technical discussions.Channel #bsdchat on Freenode is
a Traditional Chinese (UTF-8 encoding) language
channel dedicated to helping &os; users.
This is also a good place for non-technical
discussions.The &os; wiki has a good
list of IRC channels.Each of these channels are distinct and are not
connected to each other. Since their chat styles differ,
try each to find one suited to your
chat style.Are there any web based forums to discuss &os;?The official &os; forums are located at https://forums.FreeBSD.org/.Where can I get commercial &os; training and
support?iXsystems,
Inc., parent company of the &os;
Mall, provides commercial &os; and TrueOS
software support,
in addition to &os; development and tuning
solutions.BSD Certification Group, Inc. provides system
administration certifications for DragonFly BSD,
&os;, NetBSD, and OpenBSD. Refer to their
site for more information.Any other organizations providing training and support
should contact the Project to be listed here.InstallationNikClaytonnik@FreeBSD.orgWhich platform should I download? I have a 64
bit capable &intel; CPU,
but I only see amd64.&arch.amd64; is the term &os; uses for 64-bit
compatible x86 architectures (also known as "x86-64" or
"x64"). Most modern computers should use &arch.amd64;.
Older hardware should use &arch.i386;. When installing
on a non-x86-compatible architecture, select the
platform which best matches the hardware.Which file do I download to get &os;?On the Getting
&os; page, select [iso] next
to the architecture that matches the hardware.Any of the following can be used:filedescriptiondisc1.isoContains enough to install &os; and
a minimal set of packages.dvd1.isoSimilar to disc1.iso
but with additional packages.memstick.imgA bootable image sufficient for writing to a
USB stick.bootonly.isoA minimal image that requires network access
during installation to completely install
&os;.Full instructions on this procedure and a little bit
more about installation issues in general can be found in
the Handbook
entry on installing &os;.What do I do if the install image does not
boot?This can be caused by not downloading the image in
binary mode when using
FTP.Some FTP clients default their transfer mode to
ascii and attempt to change any
end-of-line characters received to match the
conventions used by the client's system. This will
almost invariably corrupt the boot image. Check the
SHA-256 checksum of the downloaded boot image: if it
is not exactly that on the
server, then the download process is suspect.When using a command line FTP client, type
binary at the FTP command prompt
after getting connected to the server and before
starting the download of the image.Where are the instructions for installing &os;?Installation instructions
can be found at Handbook
entry on installing &os;.How can I make my own custom release or install
disk?Customized &os; installation media can be created by
building a custom release. Follow the instructions in the
Release
Engineering article.Can &windows; co-exist with &os;? (x86-specific)If &windows; is installed first, then yes. &os;'s
boot manager will then manage to boot &windows; and &os;.
If &windows; is installed afterwards, it will
overwrite the boot manager. If that
happens, see the next section.Another operating system destroyed my Boot Manager.
How do I get it back? (x86-specific)This depends upon the boot manager.
The &os; boot selection menu can be reinstalled using
&man.boot0cfg.8;. For example, to restore the boot menu
onto the disk ada0:&prompt.root; boot0cfg -B ada0The non-interactive MBR bootloader can be installed
using &man.gpart.8;:&prompt.root; gpart bootcode -b /boot/mbr ada0For more complex situations, including GPT disks, see
&man.gpart.8;.Do I need to install the source?In general, no. There is nothing in the base system
which requires the presence of the source to operate.
Some ports, like sysutils/lsof, will
not build unless the source is installed. In particular,
if the port builds a kernel module or directly operates on
kernel structures, the source must be installed.Do I need to build a kernel?Usually not. The supplied GENERIC
kernel contains the drivers an ordinary computer will
need. &man.freebsd-update.8;, the &os; binary upgrade
tool, cannot upgrade custom kernels, another reason to
stick with the GENERIC kernel when
possible. For computers with very limited RAM, such as
embedded systems, it may be worthwhile to build a smaller
custom kernel containing just the required drivers.Should I use DES, Blowfish, or MD5 passwords and how
do I specify which form my users receive?&os; uses
SHA512 by
default. DES
passwords are still available for backwards compatibility
with operating systems that still
use the less secure password format. &os; also supports
the Blowfish and MD5 password formats. Which
password format to use for new passwords is controlled by
the passwd_format login capability in
/etc/login.conf, which takes values
of des, blf (if
these are available) or md5. See the
&man.login.conf.5; manual page for more information about
login capabilities.What are the limits for FFS file systems?For FFS file systems, the largest file system is
practically limited by the amount of memory required to
&man.fsck.8; the file system. &man.fsck.8; requires one
bit per fragment, which with the default fragment size of
4 KB equates to 32 MB of memory per TB of disk.
This does mean that on architectures which limit userland
processes to 2 GB (e.g., &i386;), the maximum
&man.fsck.8;'able filesystem is ~60 TB.If there was not a &man.fsck.8; memory limit the
maximum filesystem size would be 2 ^ 64 (blocks)
* 32 KB => 16 Exa * 32 KB => 512
ZettaBytes.The maximum size of a single FFS file is approximately
2 PB with the default block size of 32 KB. Each
32 KB block can point to 4096 blocks. With triple
indirect blocks, the calculation is 32 KB * 12 +
32 KB * 4096 + 32 KB * 4096^2 + 32 KB *
4096^3. Increasing the block size to 64 KB will
increase the max file size by a factor of 16.Why do I get an error message, readin
failed after compiling and booting a new
kernel?The world and kernel are out of sync. This
is not supported. Be sure to use make
buildworld and make
buildkernel to update the kernel.Boot the system by specifying the kernel directly at
the second stage, pressing any key when the
| shows up before loader is
started.Is there a tool to perform post-installation
configuration tasks?Yes. bsdconfig provides a
nice interface to configure &os; post-installation.Hardware CompatibilityGeneralI want to get a piece of hardware for my &os;
system. Which model/brand/type is best?This is discussed continually on the &os; mailing
lists but is to be expected since hardware changes so
quickly. Read through the Hardware Notes
for &os; &rel121.current;
or &rel1.current;
and search the mailing list archives
before asking about the latest and greatest hardware.
Chances are a discussion about that type of hardware
took place just last week.Before purchasing a laptop, check the archives for
&a.questions;, or possibly a specific
mailing list for a particular hardware type.What are the limits for memory?&os; as an operating system generally supports
as much physical memory (RAM) as the platform it is
running on does. Keep in mind that different platforms
have different limits for memory; for example &i386;
without PAE supports at most
4 GB of memory (and usually less than that because
of PCI address space) and &i386; with PAE supports at
most 64 GB memory. As of &os; 10, AMD64
platforms support up to 4 TB of physical
memory.Why does &os; report less than 4 GB memory when
installed on an &i386; machine?The total address space on &i386; machines is
32-bit, meaning that at most 4 GB of memory is
addressable (can be accessed). Furthermore, some
addresses in this range are reserved by hardware for
different purposes, for example for using and
controlling PCI devices, for accessing video memory, and
so on. Therefore, the total amount of memory usable by
the operating system for its kernel and applications is
limited to significantly less than 4 GB. Usually,
3.2 GB to 3.7 GB is the maximum usable
physical memory in this configuration.To access more than 3.2 GB to 3.7 GB of
installed memory (meaning up to 4 GB but also more
than 4 GB), a special tweak called
PAE must be used. PAE stands for
Physical Address Extension and is a way for 32-bit x86
CPUs to address more than 4 GB of memory. It
remaps the memory that would otherwise be overlaid by
address reservations for hardware devices above the
4 GB range and uses it as additional physical
memory (see &man.pae.4;). Using PAE has some drawbacks;
this mode of memory access is a little bit slower than
the normal (without PAE) mode and loadable modules (see
&man.kld.4;) are not supported. This means all drivers
must be compiled into the kernel.The most common way to enable PAE is to build a new
kernel with the special ready-provided kernel
configuration file called PAE,
which is already configured to build a safe kernel.
Note that some entries in this kernel configuration file
are too conservative and some drivers marked as unready
to be used with PAE are actually usable. A rule of
thumb is that if the driver is usable on 64-bit
architectures (like AMD64), it is also usable with PAE.
When creating a custom kernel configuration
file, PAE can be enabled by adding the following
line:options PAEPAE is not much used nowadays because most new x86
hardware also supports running in 64-bit mode, known as
AMD64 or &intel; 64. It has a much larger address
space and does not need such tweaks. &os; supports
AMD64 and it is recommended that this version of &os; be
used instead of the &i386; version if 4 GB or more
memory is required.Architectures and ProcessorsDoes &os; support architectures other than the
x86?Yes. &os; divides support into multiple tiers.
Tier 1 architectures, such as i386 or amd64; are fully
supported. Tiers 2 and 3 are supported on a
best-effort basis. A full explanation of the tier
system is available in the Committer's
Guide.A complete list of supported architectures can be
found on the platforms
page.Does &os; support Symmetric Multiprocessing
(SMP)?&os; supports symmetric multi-processor (SMP) on all
non-embedded platforms (e.g, &arch.i386;, &arch.amd64;,
etc.). SMP is also supported in arm and MIPS kernels,
although some CPUs may not support this. &os;'s SMP
implementation uses fine-grained locking, and
performance scales nearly linearly with number of
CPUs.&man.smp.4; has more details.What is microcode?
How do I install &intel; CPU microcode updates?Microcode is a method of programmatically
implementing hardware level instructions. This allows
for CPU bugs to be fixed without replacing the on board
chip.Install sysutils/devcpu-data,
then add:microcode_update_enable="YES"to /etc/rc.confPeripheralsWhat kind of peripherals does &os; support?See the complete list in the Hardware Notes for &os;
&rel121.current;
or &rel1.current;.Keyboards and MiceIs it possible to use a mouse outside the
X Window system?The default console driver,
&man.vt.4;, provides the ability to use a mouse
pointer in text consoles to cut & paste text. Run
the mouse daemon, &man.moused.8;, and turn on the mouse
pointer in the virtual console:&prompt.root; moused -p /dev/xxxx -t yyyy
&prompt.root; vidcontrol -m onWhere xxxx is the mouse
device name and yyyy is a
protocol type for the mouse. The mouse daemon can
automatically determine the protocol type of most mice,
except old serial mice. Specify the
auto protocol to invoke automatic
detection. If automatic detection does not work, see
the &man.moused.8; manual page for a list of supported
protocol types.For a PS/2 mouse, add
moused_enable="YES" to
/etc/rc.conf to start the mouse
daemon at boot time. Additionally, to
use the mouse daemon on all virtual terminals instead of
just the console, add allscreens_flags="-m
on" to
/etc/rc.conf.When the mouse daemon is running, access to the
mouse must be coordinated between the mouse daemon and
other programs such as X Windows. Refer to the
FAQ
Why does my mouse not work
with X? for more details on this issue.How do I cut and paste text with a mouse in the text
console?It is not possible to remove data using the mouse.
However, it is possible to copy and paste. Once the
mouse daemon is running as described in the previous question, hold down
button 1 (left button) and move the mouse to select a
region of text. Then, press button 2 (middle button) to
paste it at the text cursor. Pressing button 3 (right
button) will extend the selected region
of text.If the mouse does not have a middle button, it is
possible to emulate one or remap buttons using mouse
daemon options. See the &man.moused.8; manual page for
details.My mouse has a fancy wheel and buttons. Can I use
them in &os;?The answer is, unfortunately, It
depends. These mice with additional features
require specialized driver in most cases. Unless the
mouse device driver or the user program has specific
support for the mouse, it will act just like a standard
two, or three button mouse.For the possible usage of wheels in the X Window
environment, refer to that section.How do I use my delete key in sh
and csh?For the Bourne Shell, add
the following lines to ~/.shrc.
See &man.sh.1; and &man.editrc.5;.bind ^[[3~ ed-delete-next-char # for xtermFor the C Shell, add the
following lines to ~/.cshrc.
See &man.csh.1;.bindkey ^[[3~ delete-char # for xtermOther HardwareWorkarounds for no sound from my &man.pcm.4; sound
card?Some sound cards set their output volume to 0 at
every boot. Run the following command every time the
machine boots:&prompt.root; mixer pcm 100 vol 100 cd 100Does &os; support power management on my
laptop?&os; supports the ACPI features
found in modern hardware. Further information can be
found in &man.acpi.4;.TroubleshootingWhy is &os; finding the wrong amount of memory on
&i386; hardware?The most likely reason is the difference between
physical memory addresses and virtual addresses.The convention for most PC hardware is to use the
memory area between 3.5 GB and 4 GB for a
special purpose (usually for PCI). This address space is
used to access PCI hardware. As a result real, physical
memory cannot be accessed by that address space.What happens to the memory that should appear in that
location is hardware dependent. Unfortunately,
some hardware does nothing and the ability to use that
last 500 MB of RAM is entirely lost.Luckily, most hardware remaps the memory to a higher
location so that it can still be used. However, this can
cause some confusion when watching the boot
messages.On a 32-bit version of &os;, the memory appears lost,
since it will be remapped above 4 GB, which a 32-bit
kernel is unable to access. In this case, the solution is
to build a PAE enabled kernel. See the entry on memory
limits for more information.On a 64-bit version of &os;, or when running a
PAE-enabled kernel, &os; will correctly detect and remap
the memory so it is usable. During boot, however, it may
seem as if &os; is detecting more memory than the system
really has, due to the described remapping. This is
normal and the available memory will be corrected as the
boot process completes.Why do my programs occasionally die with
Signal 11 errors?Signal 11 errors are caused when a process has
attempted to access memory which the operating system has
not granted it access to. If something like this is
happening at seemingly random intervals,
start investigating the cause.These problems can usually be attributed to
either:If the problem is occurring only in a specific
custom application, it is
probably a bug in the code.If it is a problem with part of the base &os;
system, it may also be buggy code, but more often than
not these problems are found and fixed long before us
general FAQ readers get to use
these bits of code (that is what -CURRENT is
for).It is probably
not a &os; bug if the
problem occurs compiling a program, but the activity
that the compiler is carrying out changes each
time.For example, if make
buildworld fails while trying
to compile ls.c into
ls.o and, when run again, it fails
in the same place, this is a broken build. Try
updating source and try again. If the compile fails
elsewhere, it is almost certainly due to hardware.In the first case, use a debugger such as
&man.gdb.1; to find the point in the program which is
attempting to access a bogus address and fix
it.In the second case, verify which piece of
hardware is at fault.Common causes of this include:The hard disks might be overheating: Check that
the fans are still working, as the disk and
other hardware might be overheating.The processor running is overheating: This might
be because the processor has been overclocked, or the
fan on the processor might have died. In either case,
ensure that the hardware is running at
what it is specified to run at, at least while trying
to solve this problem. If it is not, clock it back
to the default settings.)Regarding overclocking, it is far
cheaper to have a slow system than a fried system that
needs replacing! Also the community is not
sympathetic to problems on overclocked systems.Dodgy memory: if multiple memory
SIMMS/DIMMS are installed, pull them all out and try
running the machine with each SIMM or DIMM
individually to narrow the problem down to either the
problematic DIMM/SIMM or perhaps even a
combination.Over-optimistic motherboard settings: the BIOS
settings, and some motherboard jumpers, provide
options to set various timings. The defaults
are often sufficient, but sometimes setting the wait
states on RAM too low, or setting the RAM
Speed: Turbo option
will cause strange behavior. A possible idea is to
set to BIOS defaults, after noting
the current settings first.Unclean or insufficient power to the motherboard.
Remove any unused I/O boards, hard disks, or
CD-ROMs,
or disconnect the power cable from them, to see if
the power supply can manage a smaller load. Or try
another power supply, preferably one with a little
more power. For instance, if the current power supply
is rated at 250 Watts, try one rated at
300 Watts.Read the section on
Signal 11 for a further
explanation and a discussion on how memory testing
software or hardware can still pass faulty memory. There
is an extensive FAQ on this at the SIG11
problem FAQ.Finally, if none of this has helped, it is possibly
a bug in &os;.
Follow these instructions
to send a problem report.My system crashes with either Fatal trap
12: page fault in kernel mode, or
panic:, and spits out a bunch of
information. What should I do?The &os; developers are interested in these
errors, but need more information than just the error
message. Copy the full crash message. Then consult the
FAQ section on kernel
panics, build a debugging kernel, and get a
backtrace. This might sound difficult, but does not
require any programming skills. Just follow the
instructions.What is the meaning of the error maxproc
limit exceeded by uid %i, please see tuning(7) and
login.conf(5)?The &os; kernel will only allow a certain number of
processes to exist at one time. The number is based on
the kern.maxusers &man.sysctl.8;
variable. kern.maxusers also affects
various other in-kernel limits, such as network buffers.
If the machine is heavily loaded,
increase kern.maxusers. This will
increase these other system limits in addition to the
maximum number of processes.To adjust the kern.maxusers value,
see the File/Process
Limits section of the Handbook. While that
section refers to open files, the same limits apply to
processes.If the machine is lightly loaded but running a very
large number of processes, adjust the
kern.maxproc tunable by defining it in
/boot/loader.conf. The tunable will
not get adjusted until the system is rebooted. For more
information about tuning tunables, see
&man.loader.conf.5;. If these processes are being run by
a single user, adjust
kern.maxprocperuid to be one less than
the new kern.maxproc value. It must
be at least one less because one system program,
&man.init.8;, must always be running.Why do full screen applications on remote machines
misbehave?The remote machine may be setting the terminal type to
something other than xterm which is
required by the &os; console. Alternatively the kernel
may have the wrong values for the width and height of the
terminal.Check the value of the TERM
environment variable is xterm. If the
remote machine does not support that try
vt100.Run stty -a to check what the
kernel thinks the terminal dimensions are. If they are
incorrect, they can be changed by running
stty rows RR cols
CC.Alternatively, if the client machine has
x11/xterm installed, then running
resize will query the terminal for the
correct dimensions and set them.Why does it take so long to connect to my computer via
ssh or
telnet?The symptom: there is a long delay between the time
the TCP connection is established and the time when the
client software asks for a password (or, in
&man.telnet.1;'s case, when a login prompt
appears).The problem: more likely than not, the delay is caused
by the server software trying to resolve the client's IP
address into a hostname. Many servers, including the
Telnet and
SSH servers that come with
&os;, do this to store the hostname in a log file for
future reference by the administrator.The remedy: if the problem occurs whenever connecting
the client computer to any server, the problem
is with the client. If the problem only occurs
when someone connects to the server computer, the
problem is with the server.If the problem is with the client, the only remedy is
to fix the DNS so the server can resolve it. If this is
on a local network, consider it a server problem and keep
reading. If this is on the Internet,
contact your ISP.If the problem is with the server on a
local network, configure the server
to resolve address-to-hostname queries for the local
address range. See &man.hosts.5; and &man.named.8;
for more information. If this is on the
Internet, the problem may be that the local server's
resolver is not functioning correctly. To check, try to
look up another host such as
www.yahoo.com. If it does not
work, that is the problem.Following a fresh install of &os;, it is also possible
that domain and name server information is missing from
/etc/resolv.conf. This will often
cause a delay in SSH, as the
option UseDNS is set to
yes by default in
/etc/ssh/sshd_config. If this is
causing the problem, either fill in the
missing information in
/etc/resolv.conf or set
UseDNS to no in
sshd_config as a temporary
workaround.Why does file: table is full
show up repeatedly in &man.dmesg.8;?This error message indicates that the number of
available file descriptors have been exhausted on the
system. Refer to the kern.maxfiles
section of the Tuning
Kernel Limits section of the Handbook for a
discussion and solution.Why does the clock on my computer keep incorrect
time?The computer has two or more clocks, and &os; has
chosen to use the wrong one.Run &man.dmesg.8;, and check for lines that contain
Timecounter. The one with the highest
quality value that &os; chose.&prompt.root; dmesg | grep Timecounter
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
Timecounter "TSC" frequency 2998570050 Hz quality 800
Timecounters tick every 1.000 msecConfirm this by checking the
kern.timecounter.hardware
&man.sysctl.3;.&prompt.root; sysctl kern.timecounter.hardware
kern.timecounter.hardware: ACPI-fastIt may be a broken ACPI timer. The simplest solution
is to disable the ACPI timer in
/boot/loader.conf:debug.acpi.disabled="timer"Or the BIOS may modify the TSC clock—perhaps to
change the speed of the processor when running from
batteries, or going into a power saving mode, but &os; is
unaware of these adjustments, and appears to gain or lose
time.In this example, the i8254 clock is
also available, and can be selected by writing its name to
the kern.timecounter.hardware
&man.sysctl.3;.&prompt.root; sysctl kern.timecounter.hardware=i8254
kern.timecounter.hardware: TSC -> i8254The computer should now start keeping more accurate
time.To have this change automatically run at boot time,
add the following line to
/etc/sysctl.conf:kern.timecounter.hardware=i8254What does the error swap_pager: indefinite
wait buffer: mean?This means that a process is trying to page memory
from
disk, and the page attempt has hung trying to access the
disk for more than 20 seconds. It might be caused by bad
blocks on the disk drive, disk wiring, cables, or any
other disk I/O-related hardware. If the drive itself is
bad, disk errors will appear in
/var/log/messages and in the output
of dmesg. Otherwise, check the cables
and connections.What is a lock order
reversal?The &os; kernel uses a number of resource locks to
arbitrate contention for certain resources. When multiple
kernel threads try to obtain multiple resource locks,
there's always the potential for a deadlock, where two
threads have each obtained one of the locks and blocks
forever waiting for the other thread to release one of the
other locks. This sort of locking problem can be avoided
if all threads obtain the locks in the same order.A run-time lock diagnostic system called
&man.witness.4;, enabled in &os.current; and disabled by
default for stable branches and releases, detects the
potential for deadlocks due to locking errors, including
errors caused by obtaining multiple resource locks with a
different order from different parts of the kernel. The
&man.witness.4; framework tries to detect this problem as
it happens, and reports it by printing a message to the
system console about a lock order
reversal (often referred to also as
LOR).It is possible to get false positives, as
&man.witness.4; is conservative. A true positive report
does not mean that a system is
dead-locked; instead it should be understood as a warning
that a deadlock could have happened here.Problematic LORs tend to get
fixed quickly, so check the &a.current; before posting
to it.What does Called ... with the following
non-sleepable locks held mean?This means that a function that may sleep was called
while a mutex (or other unsleepable) lock was held.The reason this is an error is because mutexes are not
intended to be held for long periods of time; they are
supposed to only be held to maintain short periods of
synchronization. This programming contract allows device
drivers to use mutexes to synchronize with the rest of the
kernel during interrupts. Interrupts (under &os;) may not
sleep. Hence it is imperative that no subsystem in the
kernel block for an extended period while holding a
mutex.To catch such errors, assertions may be added to the
kernel that interact with the &man.witness.4; subsystem to
emit a warning or fatal error (depending on the system
configuration) when a potentially blocking call is made
while holding a mutex.In summary, such warnings are non-fatal, however with
unfortunate timing they could cause undesirable effects
ranging from a minor blip in the system's responsiveness
to a complete system lockup.For additional information about locking in &os; see
&man.locking.9;.Why does
buildworld/installworld
die with the message touch: not
found?This error does not mean that the &man.touch.1;
utility is missing. The error is instead probably due to
the dates of the files being set sometime in the future.
If the CMOS clock is set to local time, run
adjkerntz -i to adjust
the kernel clock when booting into single-user
mode.User ApplicationsWhere are all the user applications?Refer to the ports
page for info on software packages ported to
&os;.Most ports should work on all supported versions of
&os;. Those that do not are specifically marked as such.
Each time a &os; release is made, a snapshot of the ports
tree at the time of release is also included in the
ports/ directory.&os; supports compressed binary packages to easily
install and uninstall ports. Use &man.pkg.7; to control
the installation of packages.How do I download the Ports tree? Should I be using
Subversion?Any of the methods listed here work:Use portsnap for most use cases. Refer to Using
the Ports Collection for instructions on how to
use this tool.Use Subversion if custom patches to the
ports tree are needed or if running &os.current;.
Refer to Using
Subversion for details.Why can I not build this port on my
&rel2.relx; -, or
&rel.relx; -STABLE machine?If the installed &os; version lags significantly
behind -CURRENT or
-STABLE, update the Ports Collection
using the instructions in Using
the Ports Collection. If the system is
up-to-date, someone might have committed a change to the
port which works for -CURRENT but
which broke the port for -STABLE.
Submit
a bug report, since the Ports Collection is supposed to
work for both the -CURRENT and
-STABLE branches.I just tried to build INDEX using
make index, and it failed. Why?First, make sure that the Ports Collection is
up-to-date. Errors that affect building
INDEX from an up-to-date copy of the
Ports Collection are high-visibility and are thus almost
always fixed immediately.There are rare cases where INDEX
will not build due to odd cases involving
OPTIONS_SET
being set in make.conf. If
you suspect that this is the case, try to make
INDEX with those variables
turned off before reporting it to &a.ports;.I updated the sources, now how do I update my
installed ports?&os; does not include a port upgrading tool, but it
does have some tools to make the upgrade process somewhat
easier. Additional tools are available to simplify
port handling and are described the Upgrading
Ports section in the &os; Handbook.Do I need to recompile every port each time I perform
a major version update?Yes! While a recent system will run with
software compiled under an older release,
things will randomly crash and fail to work once
other ports are installed or updated.When the system is upgraded, various shared libraries,
loadable modules, and other parts of the system will be
replaced with newer versions. Applications linked against
the older versions may fail to start or, in other cases,
fail to function properly.For more information, see the
section on upgrades in the &os; Handbook.Do I need to recompile every port each time I perform
a minor version update?In general, no. &os; developers do their utmost to
guarantee binary compatibility across all releases with
the same major version number. Any exceptions will be
documented in the Release Notes, and advice given there
should be followed.Why is /bin/sh so minimal? Why
does &os; not use bash or another
shell?Many people need to write shell scripts which will be
portable across many systems. That is why &posix;
specifies the shell and utility commands in great detail.
Most scripts are written in Bourne shell (&man.sh.1;), and
because several important programming interfaces
(&man.make.1;, &man.system.3;, &man.popen.3;, and
analogues in higher-level scripting languages like Perl
and Tcl) are specified to use the Bourne shell to
- interpret commands. Because the Bourne shell is so often
+ interpret commands. As the Bourne shell is so often
and widely used, it is important for it to be quick to
start, be deterministic in its behavior, and have a small
memory footprint.The existing implementation is our best effort at
meeting as many of these requirements simultaneously as we
can. To keep /bin/sh small, we have
not provided many of the convenience features that other
shells have. That is why other more featureful shells
like bash, scsh,
&man.tcsh.1;, and zsh are available.
Compare the memory utilization of
these shells by looking at the VSZ and
RSS columns in a ps -u
listing.Kernel ConfigurationI would like to customize my kernel. Is it
difficult?Not at all! Check out the kernel
config section of the Handbook.The new kernel will be
installed to the /boot/kernel
directory along with its modules, while the old kernel
and its modules will be moved to the
/boot/kernel.old directory. If
a mistake is made in the
configuration, simply boot the previous version of the
kernel.Why is my kernel so big?GENERIC kernels shipped with &os;
are compiled in debug mode.
Kernels built in debug mode contain debug data in
separate files that are used for debugging.
&os; releases prior to 11.0 store these debug files in
the same directory as the kernel itself,
/boot/kernel/.
In &os; 11.0 and later the debug files are stored in
/usr/lib/debug/boot/kernel/.
Note that there will be little or no performance loss from
running a debug kernel, and it is useful to keep one
around in case of a system panic.When running low on disk space, there
are different options to reduce the size of
/boot/kernel/ and
/usr/lib/debug/.To not install the symbol files,
make sure the following line exists in
/etc/src.conf:WITHOUT_KERNEL_SYMBOLS=yesFor more information see &man.src.conf.5;.If you want to avoid building debug files altogether,
make sure that both of the following are true:This line does not exist in the kernel
configuration file:makeoptions DEBUG=-gDo not run &man.config.8; with
.Either of the above settings will cause the kernel to
be built in debug mode.To build and install only the specified modules, list
them in
/etc/make.conf:MODULES_OVERRIDE= accf_http ipfwReplace accf_httpd ipfw with a
list of needed modules. Only the listed modules will be
built. This reduces the size of the kernel
directory and decreases the amount of time needed to
build the kernel. For more information, read
/usr/share/examples/etc/make.conf.Unneeded devices can be removed from the kernel
to further reduce the size. See for more information.To put any of these options into effect, follow the
instructions to build
and install the new kernel.For reference, the &os; 11 &arch.amd64; kernel
(/boot/kernel/kernel) is
approximately 25 MB.Why does every kernel I try to build fail to compile,
even GENERIC?There are a number of possible causes for this
problem:The source
tree is different from the one used to build the
currently running system. When attempting an upgrade,
read /usr/src/UPDATING, paying
particular attention to the COMMON
ITEMS section at the end.The make buildkernel did not
complete successfully. The make
buildkernel target relies on files
generated by the make buildworld
target to complete its job correctly.Even when building &os;-STABLE, it is possible
that the source tree was fetched at a time when it was
either being modified or it was broken.
Only releases are guaranteed to be
buildable, although &os;-STABLE builds fine the
majority of the time. Try re-fetching the source tree
and see if the problem goes away. Try using a
different mirror in case the previous one is having
problems.Which scheduler is in use on a
running system?The name of the scheduler currently being used is
directly available as the value of the
kern.sched.name sysctl:&prompt.user; sysctl kern.sched.name
kern.sched.name: ULEWhat is kern.sched.quantum?kern.sched.quantum is the maximum
number of ticks a process can run without being preempted
in the 4BSD scheduler.Disks, File Systems, and Boot LoadersHow can I add my new hard disk to my &os;
system?See the Adding
Disks section in the &os; Handbook.How do I move my system over to my huge new
disk?The best way is to reinstall the operating system on
the new disk, then move the user data over. This is
highly recommended when tracking
-STABLE for more than one release or
when updating a release instead of installing a new one.
Install booteasy on both disks with &man.boot0cfg.8; and
dual boot until you are happy with the new configuration.
Skip the next paragraph to find out how to move the data
after doing this.Alternatively, partition and label the new disk with
either &man.sade.8; or &man.gpart.8;. If the disks are
MBR-formatted, booteasy can be installed on both disks
with &man.boot0cfg.8; so that the computer can dual boot
to the old or new system after the copying is done.Once the new disk set up,
the data cannot just be copied. Instead, use tools that
understand device files and system flags, such as
&man.dump.8;. Although it is recommended
to move the data while in single-user mode, it
is not required.When the disks are formatted with
UFS, never use anything but
&man.dump.8; and &man.restore.8; to move the root file
system. These commands should also be used when moving a
single partition to another empty partition. The sequence
of steps to use dump to move the data
from one UFS partitions to a new
partition is:newfs the new partition.mount it on a temporary mount
point.cd to that directory.dump the old partition, piping
output to the new one.For example, to move
/dev/ada1s1a with
/mnt as the temporary mount point,
type:&prompt.root; newfs /dev/ada1s1a
&prompt.root; mount /dev/ada1s1a /mnt
&prompt.root; cd /mnt
&prompt.root; dump 0af - / | restore rf -Rearranging partitions with
dump takes a bit more work. To merge a
partition like /var into its parent,
create the new partition large enough for both, move the
parent partition as described above, then move the child
partition into the empty directory that the first move
created:&prompt.root; newfs /dev/ada1s1a
&prompt.root; mount /dev/ada1s1a /mnt
&prompt.root; cd /mnt
&prompt.root; dump 0af - / | restore rf -
&prompt.root; cd var
&prompt.root; dump 0af - /var | restore rf -To split a directory from its parent, say putting
/var on its own partition when it was
not before, create both partitions, then mount the child
partition on the appropriate directory in the temporary
mount point, then move the old single partition:&prompt.root; newfs /dev/ada1s1a
&prompt.root; newfs /dev/ada1s1d
&prompt.root; mount /dev/ada1s1a /mnt
&prompt.root; mkdir /mnt/var
&prompt.root; mount /dev/ada1s1d /mnt/var
&prompt.root; cd /mnt
&prompt.root; dump 0af - / | restore rf -The &man.cpio.1; and &man.pax.1; utilities are also
available for moving user data. These are known to lose
file flag information, so use them with caution.Which partitions can safely use Soft Updates? I have
heard that Soft Updates on / can
cause problems. What about Journaled Soft Updates?Short answer: Soft Updates can usually be safely used
on all partitions.Long answer: Soft Updates has two characteristics
that may be undesirable on certain partitions. First, a
Soft Updates partition has a small chance of losing data
during a system crash. The partition will not be
corrupted as the data will simply be lost. Second, Soft
Updates can cause temporary space shortages.When using Soft Updates, the kernel can take up to
thirty seconds to write changes to the physical disk.
When a large file is deleted the file still resides on
disk until the kernel actually performs the deletion.
This can cause a very simple race condition. Suppose
one large file is deleted and another large file is
immediately created. The first large file is not yet
actually removed from the physical disk, so the disk might
not have enough room for the second large file. This will
produce an error that the partition does not have enough
space, even though a large chunk of space has just been
released. A few seconds later, the file creation works as
expected.If a system should crash after the kernel accepts a
chunk of data for writing to disk, but before that data is
actually written out, data could be lost. This risk is
extremely small, but generally manageable.These issues affect all partitions using Soft Updates.
So, what does this mean for the root partition?Vital information on the root partition changes very
rarely. If the system crashed during the thirty-second
window after such a change is made, it is possible that
data could be lost. This risk is negligible for most
applications, but be aware that it exists. If
the system cannot tolerate this much risk, do not use
Soft Updates on the root file system!/ is traditionally one of the
smallest partitions. If
/tmp is on
/, there may be intermittent
space problems. Symlinking /tmp to
/var/tmp will solve this
problem.Finally, &man.dump.8; does not work in live mode (-L)
on a filesystem, with Journaled Soft Updates
(SU+J).Can I mount other foreign file systems under
&os;?&os; supports a variety of other file systems.UFSUFS CD-ROMs can be mounted directly on &os;.
Mounting disk partitions from Digital UNIX and other
systems that support UFS may be more complex,
depending on the details of the disk partitioning
for the operating system in question.ext2/ext3&os; supports ext2fs and
ext3fs partitions. See
&man.ext2fs.5; for more information.NTFSFUSE based NTFS support is available as a port
(sysutils/fusefs-ntfs). For more
information see ntfs-3g.FAT&os; includes a read-write FAT driver. For more
information, see &man.mount.msdosfs.8;.ZFS&os; includes a port of &sun;'s ZFS driver. The
current recommendation is to use it only on
&arch.amd64; platforms with sufficient memory. For
more information, see &man.zfs.8;.&os; includes the Network File System
NFS and the &os; Ports Collection
provides several FUSE applications to support many other
file systems.How do I mount a secondary DOS partition?The secondary DOS partitions are found after
all the primary partitions. For
example, if E is the
second DOS partition on the second SCSI drive, there will
be a device file for slice 5 in
/dev. To mount it:&prompt.root; mount -t msdosfs /dev/da1s5 /dos/eIs there a cryptographic file system for &os;?Yes, &man.gbde.8; and &man.geli.8;.
See the Encrypting
Disk Partitions section of the &os;
Handbook.How do I boot &os; and &linux; using
GRUB?To boot &os; using GRUB,
add the following to either
/boot/grub/menu.lst or
/boot/grub/grub.conf, depending upon
which is used by the &linux; distribution.title &os; 9.1
root (hd0,a)
kernel /boot/loaderWhere hd0,a points to the
root partition on the first disk. To specify
the slice number, use something like this
(hd0,2,a). By default, if the
slice number is omitted, GRUB
searches the first slice
which has the a partition.How do I boot &os; and &linux; using
BootEasy?Install LILO at the start of the &linux; boot
partition instead of in the Master Boot Record. Then boot
LILO from BootEasy.This is recommended when running &windows; and &linux;
as it makes it simpler to get &linux; booting again if
&windows; is reinstalled.How do I change the boot prompt from
??? to something more
meaningful?This cannot be accomplished with the standard boot
manager without rewriting it. There are a number of other
boot managers in the sysutils
category of the Ports Collection.How do I use a new removable drive?If the drive already has a file system on it,
use a command like this:&prompt.root; mount -t msdosfs /dev/da0s1 /mntIf the drive will only be used with &os; systems,
partition it with UFS or
ZFS. This will provide long filename
support, improvement in performance, and stability. If
the drive will be used by other operating systems, a more
portable choice, such as msdosfs, is better.&prompt.root; dd if=/dev/zero of=/dev/da0 count=2
&prompt.root; gpart create -s GPT /dev/da0
&prompt.root; gpart add -t freebsd-ufs /dev/da0Finally, create a new file system:&prompt.root; newfs /dev/da0p1and mount it:&prompt.root; mount /dev/da0s1 /mntIt is a good idea to add a line to
/etc/fstab (see &man.fstab.5;) so you
can just type mount /mnt in the
future:/dev/da0p1 /mnt ufs rw,noauto 0 0Why do I get Incorrect super
block when mounting a CD?The type of device to mount must be specified. This
is described in the Handbook section on Using
Data CDs.Why do I get Device not
configured when mounting a CD?This generally means that there is no CD in the
drive, or the drive is not visible on the bus.
Refer to the Using
Data CDs section of the Handbook for a detailed
discussion of this issue.Why do all non-English characters in filenames show up
as ? on my CDs when mounted in &os;?The CD probably uses the Joliet
extension for storing information about files and
directories. This is discussed in the Handbook section on
Using
Data CD-ROMs.A CD burned under &os; cannot be read
under any other operating system. Why?This means a raw file was burned to the CD, rather
than creating an ISO 9660 file system. Take a look
at the Handbook section on Using
Data CDs.How can I create an image of a data CD?This is discussed in the Handbook section on Writing
Data to an ISO File System.
For more on working with CD-ROMs, see the Creating
CDs Section in the Storage chapter in the
Handbook.Why can I not mount an audio
CD?Trying to mount an audio CD will produce an error
like cd9660: /dev/cd0: Invalid
argument. This is because
mount only works on file systems.
Audio CDs do not have file systems; they just have data.
Instead, use a program that reads audio CDs, such as the
audio/xmcd package or port.How do I mount a multi-session
CD?By default, &man.mount.8; will attempt to mount the
last data track (session) of a CD. To
load an earlier session, use the
command line argument. Refer to
&man.mount.cd9660.8; for specific examples.How do I let ordinary users mount CD-ROMs, DVDs,
USB drives, and other removable media?As root set
the sysctl variable vfs.usermount to
1.&prompt.root; sysctl vfs.usermount=1To make this persist across reboots, add the line
vfs.usermount=1 to
/etc/sysctl.conf so that it is reset
at system boot time.Users can only mount devices they have read
permissions to. To allow users to mount a device
permissions must be set in
/etc/devfs.conf.For example, to allow users to mount the first USB
drive add:# Allow all users to mount a USB drive.
own /dev/da0 root:operator
perm /dev/da0 0666All users can now mount devices they could read onto a
directory that they own:&prompt.user; mkdir ~/my-mount-point
&prompt.user; mount -t msdosfs /dev/da0 ~/my-mount-pointUnmounting the device is simple:&prompt.user; umount ~/my-mount-pointEnabling vfs.usermount, however,
has negative security implications. A better way to
access &ms-dos; formatted media is to use the
emulators/mtools package in the Ports
Collection.The device name used in the previous examples must
be changed according to the configuration.The du and df
commands show different amounts of disk space available.
What is going on?This is due to how these commands actually work.
du goes through the directory tree,
measures how large each file is, and presents the totals.
df just asks the file system how much
space it has left. They seem to be the same thing, but a
file without a directory entry will affect
df but not
du.When a program is using a file, and the file is
deleted, the file is not really removed from the file
system until the program stops using it. The file is
immediately deleted from the directory listing, however.
As an example, consider a file large enough
to affect the output of
du and df. A
file being viewed with more can be
deleted wihout causing an error.
The entry is
removed from the directory so no other program or user can
access it. However, du shows that it
is gone as it has walked the directory tree and the
file is not listed. df shows that it
is still there, as the file system knows that
more is still using that space. Once
the more session ends,
du and df will
agree.This situation is common on web servers. Many people
set up a &os; web server and forget to rotate the log
files. The access log fills up /var.
The new administrator deletes the file, but the system
still complains that the partition is full. Stopping and
restarting the web server program would free the file,
allowing the system to release the disk space. To prevent
this from happening, set up &man.newsyslog.8;.Note that Soft Updates can delay the freeing of disk
space and it can take up to 30 seconds for the
change to be visible.How can I add more swap space?This section of
the Handbook describes how to do this.Why does &os; see my disk as smaller than the
manufacturer says it is?Disk manufacturers calculate gigabytes as a billion
bytes each, whereas &os; calculates them as
1,073,741,824 bytes each. This explains why, for
example, &os;'s boot messages will report a disk that
supposedly has 80 GB as holding
76,319 MB.Also note that &os; will (by default) reserve 8% of the
disk space.How is it possible for a partition to be more than
100% full?A portion of each UFS partition (8%, by default) is
reserved for use by the operating system and the
root user.
&man.df.1; does not count that space when calculating the
Capacity column, so it can exceed 100%.
Notice that the Blocks
column is always greater than the sum of the
Used and Avail
columns, usually by a factor of 8%.For more details, look up in
&man.tunefs.8;.ZFSWhat is the minimum amount of RAM one should have to
run ZFS?A minimum of 4GB of RAM is required for comfortable
usage, but individual workloads can vary widely.What is the ZIL and when does it get used?The ZIL (ZFS
intent log) is a write log used to implement posix write
commitment semantics across crashes. Normally writes are
bundled up into transaction groups and written to disk
when filled (Transaction Group Commit).
However syscalls like &man.fsync.2; require a commitment
that the data is written to stable storage before
returning. The ZIL is needed for writes that have been
acknowledged as written but which are not yet on disk as
part of a transaction. The transaction groups are
timestamped. In the event of a crash the last valid
timestamp is found and missing data is merged in from the
ZIL.Do I need a SSD for ZIL?By default, ZFS stores the ZIL in the pool with all
the data. If an application has a heavy write load,
storing the ZIL in a separate device that has very fast
synchronous, sequential write performance can improve
overall system performance. For other workloads, a SSD
is unlikely to make much of an improvement.What is the L2ARC?The L2ARC is a read cache stored on
a fast device such as an SSD. This
cache is not persistent across reboots. Note that RAM is
used as the first layer of cache and the L2ARC is only
needed if there is insufficient RAM.L2ARC needs space in the ARC to index it. So,
perversely, a working set that fits perfectly in the ARC
will not fit perfectly any more if a L2ARC is used because
part of the ARC is holding the L2ARC index, pushing part
of the working set into the L2ARC which is slower than
RAM.Is enabling deduplication advisable?Generally speaking, no.Deduplication takes up a significant amount of RAM and
may slow down read and write disk access times. Unless
one is storing data that is very heavily duplicated, such
as virtual machine images or user backups, it is possible
that deduplication will do more harm than good. Another
consideration is the inability to revert deduplication
status. If data is written when deduplication is enabled,
disabling dedup will not cause those blocks which were
deduplicated to be replicated until they are next
modified.Deduplication can also lead to some unexpected
situations. In particular, deleting files may become much
slower.I cannot delete or create files on my ZFS pool. How
can I fix this?This could happen because the pool is 100% full. ZFS
requires space on the disk to write transaction metadata.
To restore the pool to a usable state, truncate the file
to delete:&prompt.user; truncate -s 0 unimportant-fileFile truncation works because a new transaction is not
started, new spare blocks are created instead.On systems with additional ZFS dataset tuning, such
as deduplication, the space may not be immediately
availableDoes ZFS support TRIM for Solid State Drives?ZFS TRIM support was added to &os; 10-CURRENT
with revision r240868. ZFS TRIM
support was added to all &os;-STABLE branches in
r252162 and
r251419, respectively.ZFS TRIM is enabled by default, and can be turned off
by adding this line to
/etc/sysctl.conf:vfs.zfs.trim.enabled=0ZFS TRIM support was added to GELI as of
r286444. Please see
&man.geli.8; and the switch.System AdministrationWhere are the system start-up configuration
files?The primary configuration file is
/etc/defaults/rc.conf which is
described in &man.rc.conf.5;. System startup scripts
such as /etc/rc and
/etc/rc.d, which are described in
&man.rc.8;, include this file. Do not edit this
file! Instead, to edit an entry in
/etc/defaults/rc.conf, copy the line
into /etc/rc.conf and change it
there.For example, if to start &man.named.8;, the
included DNS server:&prompt.root; echo 'named_enable="YES"' >> /etc/rc.confTo start up local services, place shell scripts in the
/usr/local/etc/rc.d directory. These
shell scripts should be set executable, the default file
mode is 555.How do I add a user easily?Use the &man.adduser.8; command, or the &man.pw.8;
command for more complicated situations.To remove the user, use the &man.rmuser.8; command or,
if necessary, &man.pw.8;.Why do I keep getting messages like root:
not found after editing
/etc/crontab?This is normally caused by editing the system crontab.
This is not the correct way to do things as the system
crontab has a different format to the per-user crontabs.
The system
crontab has an extra field, specifying which user to run
the command as. &man.cron.8; assumes this user is the
first word of the command to execute. Since no such
command exists, this error message is displayed.To delete the extra, incorrect crontab:&prompt.root; crontab -rWhy do I get the error, you are not in the
correct group to su root when I try to
su to root?This is a security feature. In order to
su to
root, or any
other account with superuser privileges, the user account
must be a member of the
wheel group.
If this feature were not there, anybody with an
account on a system who also found out root's password would be
able to gain superuser level access to the system.To allow someone to su to
root, put
them in the wheel group using
pw:&prompt.root; pw groupmod wheel -m lisaThe above example will add user lisa to the group
wheel.I made a mistake in rc.conf, or
another startup file, and now I cannot edit it because the
file system is read-only. What should I do?Restart the system using boot
-s at the loader prompt to enter single-user
mode. When prompted for a shell pathname, press
Enter and run mount -urw
/ to re-mount the root file system in
read/write mode. You may also need to run mount
-a -t ufs to mount the file system where your
favorite editor is defined. If that editor is on a
network file system, either configure the network manually
before mounting the network file systems, or use an editor
which resides on a local file system, such as
&man.ed.1;.In order to use a full screen editor such as
&man.vi.1; or &man.emacs.1;, run
export TERM=xterm
so that these editors can load the correct data from the
&man.termcap.5; database.After performing these steps, edit
/etc/rc.conf to
fix the syntax error. The error message displayed
immediately after the kernel boot messages should indicate
the number of the line in the file which is at
fault.Why am I having trouble setting up my printer?See the Handbook
entry on printing for troubleshooting
tips.How can I correct the keyboard mappings for my
system?Refer to the Handbook section on using
localization, specifically the section on console
setup.Why can I not get user quotas to work properly?It is possible that the kernel is not configured
to use quotas. In this case,
add the following line to the kernel configuration
file and recompile the kernel:options QUOTARefer to the Handbook
entry on quotas for full details.Do not turn on quotas on
/.Put the quota file on the file system that the
quotas are to be enforced on:File SystemQuota file/usr/usr/admin/quotas/home/home/admin/quotas……Does &os; support System V IPC primitives?Yes, &os; supports System V-style IPC, including
shared memory, messages and semaphores, in the
GENERIC kernel. With a custom
kernel, support may be loaded with the
sysvshm.ko,
sysvsem.ko and
sysvmsg.ko kernel modules, or
enabled in the custom kernel by adding the following lines
to the kernel configuration file:options SYSVSHM # enable shared memory
options SYSVSEM # enable for semaphores
options SYSVMSG # enable for messagingRecompile and install the kernel.What other mail-server software can I use instead of
Sendmail?The Sendmail
server is the default mail-server software for &os;, but
it can be replaced with another
MTA installed from the Ports Collection. Available ports
include mail/exim,
mail/postfix, and
mail/qmail. Search the mailing lists
for discussions regarding the advantages and disadvantages
of the available MTAs.I have forgotten the root password! What do I
do?Do not panic! Restart the system, type
boot -s at the
Boot: prompt to enter single-user mode.
At the question about the shell to use, hit
Enter which will display a
&prompt.root; prompt. Enter mount
-urw / to remount the root file system
read/write, then run mount -a to
remount all the file systems. Run passwd
root to change the root password then run
&man.exit.1; to continue booting.If you are still prompted to give the root password when
entering the single-user mode, it means that the console
has been marked as insecure in
/etc/ttys. In this case, it will
be required to boot from a &os; installation disk,
choose the Live CD or
Shell at the beginning of the
install process and issue the commands mentioned above.
Mount the specific partition in this
case and then chroot to it. For example, replace
mount -urw / with
mount /dev/ada0p1 /mnt; chroot /mnt
for a system on
ada0p1.If the root partition cannot be mounted from
single-user mode, it is possible that the partitions are
encrypted and it is impossible to mount them without the
access keys. For more information see the section
about encrypted disks in the &os; Handbook.How do I keep ControlAltDelete
from rebooting the system?When using &man.vt.4;, the default console
driver, this can be done by setting the following
&man.sysctl.8;:&prompt.root; sysctl kern.vt.kbd_reboot=0How do I reformat DOS text files to &unix;
ones?Use this &man.perl.1; command:&prompt.user; perl -i.bak -npe 's/\r\n/\n/g' file(s)where file(s) is one or
more files to process. The modification is done in-place,
with the original file stored with a
.bak extension.Alternatively, use &man.tr.1;:&prompt.user; tr -d '\r' < dos-text-file > unix-filedos-text-file is the file
containing DOS text while
unix-file will contain the
converted output. This can be quite a bit faster than
using perl.Yet another way to reformat DOS text files is to use
the converters/dosunix port from the
Ports Collection. Consult its documentation about the
details.How do I re-read /etc/rc.conf and
re-start /etc/rc without a
reboot?Go into single-user mode and then back to multi-user
mode:&prompt.root; shutdown now
&prompt.root; return
&prompt.root; exitI tried to update my system to the latest
-STABLE, but got
-BETAx,
-RC or
-PRERELEASE! What is going
on?Short answer: it is just a name.
RC stands for Release
Candidate. It signifies that a release is
imminent. In &os;, -PRERELEASE is
typically synonymous with the code freeze before a
release. (For some releases, the
-BETA label was used in the same way
as -PRERELEASE.)Long answer: &os; derives its releases from one of two
places. Major, dot-zero, releases, such as 9.0-RELEASE
are branched from the head of the development stream,
commonly referred to as -CURRENT. Minor releases, such
as 6.3-RELEASE or 5.2-RELEASE, have been snapshots of the
active -STABLE branch.
Starting with 4.3-RELEASE, each release also now has its
own branch which can be tracked by people requiring an
extremely conservative rate of development (typically only
security advisories).When a release is about to be made, the branch from
which it will be derived from has to undergo a certain
process. Part of this process is a code freeze. When a
code freeze is initiated, the name of the branch is
changed to reflect that it is about to become a release.
For example, if the branch used to be called 6.2-STABLE,
its name will be changed to 6.3-PRERELEASE to signify the
code freeze and signify that extra pre-release testing
should be happening. Bug fixes can still be committed to
be part of the release. When the source code is in shape
for the release the name will be changed to 6.3-RC to
signify that a release is about to be made from it. Once
in the RC stage, only the most critical bugs found can be
fixed. Once the release (6.3-RELEASE in this example) and
release branch have been made, the branch will be renamed
to 6.3-STABLE.For more information on version numbers and the
various Subversion branches, refer to the Release
Engineering article.I tried to install a new kernel, and the
&man.chflags.1; failed. How do I get around this?Short answer: the security level is
greater than 0. Reboot directly to single-user mode to
install the kernel.Long answer: &os; disallows changing system flags at
security levels greater than 0. To check the current
security level:&prompt.root; sysctl kern.securelevelThe security level cannot be lowered in multi-user
mode, so boot to single-user mode to install the kernel,
or change the security level in
/etc/rc.conf then reboot. See the
&man.init.8; manual page for details on
securelevel, and see
/etc/defaults/rc.conf and the
&man.rc.conf.5; manual page for more information on
rc.conf.I cannot change the time on my system by more than one
second! How do I get around this?Short answer: the system is at a security level
greater than 1. Reboot directly to single-user mode to
change the date.Long answer: &os; disallows changing the time by more
that one second at security levels greater than 1. To
check the security level:&prompt.root; sysctl kern.securelevelThe security level cannot be lowered in multi-user
mode. Either boot to single-user mode to change the date
or change the security level in
/etc/rc.conf and reboot. See the
&man.init.8; manual page for details on
securelevel, and see
/etc/defaults/rc.conf and the
&man.rc.conf.5; manual page for more information on
rc.conf.Why is rpc.statd using 256 MB
of memory?No, there is no memory leak, and it is not using
256 MB of memory. For convenience,
rpc.statd maps an obscene amount of
memory into its address space. There is nothing terribly
wrong with this from a technical standpoint; it just
throws off things like &man.top.1; and &man.ps.1;.&man.rpc.statd.8; maps its status file (resident on
/var) into its address space; to save
worrying about remapping the status file later when it
needs to grow, it maps the status file with a generous
size. This is very evident from the source code, where
one can see that the length argument to &man.mmap.2; is
0x10000000, or one sixteenth of the
address space on an IA32, or exactly 256 MB.Why can I not unset the schg file
flag?The system is running at securelevel greater than 0.
Lower the securelevel and try again. For more
information, see the
FAQ entry on securelevel and
the &man.init.8; manual page.What is vnlru?vnlru flushes and frees vnodes when
the system hits the kern.maxvnodes
limit. This kernel thread sits mostly idle, and only
activates when there is a huge amount of RAM and users are
accessing tens of thousands of tiny files.What do the various memory states displayed by
top mean?Active: pages recently
statistically used.Inactive: pages recently
statistically unused.Laundry: pages recently
statistically unused but known to be dirty, that is,
whose contents needs to be paged out before they can
be reused.Free: pages without data
content, which can be immediately reused.Wired: pages that are fixed
into memory, usually for kernel purposes, but also
sometimes for special use in processes.Pages are most often written to disk (sort of a VM
sync) when they are in the laundry state, but active or
inactive pages can also be synced. This depends upon the
CPU tracking of the modified bit being available, and in
certain situations there can be an advantage for a block
of VM pages to be synced, regardless of the queue they
belong to. In most common cases, it is best to think of
the laundry queue as a queue of relatively unused
pages that might or might not be in the process of being
written to disk. The inactive queue contains a mix of
clean and dirty pages; clean pages near the head of the
queue are reclaimed immediately to alleviate a free page
shortage, and dirty pages are moved to the laundry queue
for deferred processing.There are some other flags (e.g., busy flag or busy
count) that might modify some of the described
rules.How much free memory is available?There are a couple of kinds of free
memory. The most common is the amount of memory
immediately available without reclaiming memory already
in use. That is the size of the free pages queue plus
some other reserved pages. This amount is exported by the
vm.stats.vm.v_free_count
&man.sysctl.8;, shown, for instance, by &man.top.1;.
Another kind of free memory is
the total amount of virtual memory available to userland
processes, which depends on the sum of swap space and
usable memory. Other kinds of free memory
descriptions are also possible, but it is relatively
useless to define these, but rather it is important to
make sure that the paging rate is kept low, and to avoid
running out of swap space.What is /var/empty?/var/empty is a directory that
the &man.sshd.8; program uses when performing privilege
separation. The /var/empty
directory is empty, owned by root and has the
schg flag set. This directory should
not be deleted.I just changed
/etc/newsyslog.conf. How can I check
if it does what I expect?To see what &man.newsyslog.8; will do, use the
following:&prompt.user; newsyslog -nrvvMy time is wrong, how can I change the
timezone?Use &man.tzsetup.8;.The X Window System and Virtual ConsolesWhat is the X Window System?The X Window System (commonly X11)
is the most widely available windowing system capable of
running on &unix; or &unix; like systems, including
&os;. The X.Org
Foundation administers the X
protocol standards, with the current reference
implementation, version 11 release &xorg.version;, so
references are often shortened to
X11.Many implementations are available for different
architectures and operating systems. An implementation of
the server-side code is properly known as an X
server.I want to run &xorg;, how do I go about it?To install &xorg; do one of the following:Use the x11/xorg
meta-port, which builds and installs every &xorg;
component.Use x11/xorg-minimal, which builds
and installs only the necessary &xorg; components.Install &xorg; from &os; packages:&prompt.root; pkg install xorgAfter the installation of &xorg;, follow the
instructions from the X11
Configuration section of the &os;
Handbook.I tried to run X, but I get a
No devices detected. error when I
type startx. What do I do now?The system is probably running at a raised
securelevel. It is not possible to
start X at a raised securelevel because
X requires write access to &man.io.4;. For more
information, see at the &man.init.8; manual page.There are two solutions to the problem: set the
securelevel back down to zero or run
&man.xdm.1; (or an alternative display manager) at boot
time before the securelevel is
raised.See for more information
about running &man.xdm.1; at boot time.Why does my mouse not work with X?When using &man.vt.4;, the default console
driver, &os; can be configured to support a mouse pointer
on each virtual screen. To avoid conflicting with X,
&man.vt.4; supports a virtual device called
/dev/sysmouse. All mouse events
received from the real mouse device are written to the
&man.sysmouse.4; device via &man.moused.8;. To use the
mouse on one or more virtual consoles,
and use X, see and set up
&man.moused.8;.Then edit /etc/X11/xorg.conf and
make sure the following lines exist:Section "InputDevice"
Option "Protocol" "SysMouse"
Option "Device" "/dev/sysmouse"
.....Starting with &xorg; version 7.4, the
InputDevice sections in
xorg.conf are ignored in favor of
autodetected devices. To restore the old behavior, add
the following line to the ServerLayout
or ServerFlags section:Option "AutoAddDevices" "false"Some people prefer to use
/dev/mouse under X. To make this
work, /dev/mouse should be linked
to /dev/sysmouse (see
&man.sysmouse.4;) by adding the following line to
/etc/devfs.conf (see
&man.devfs.conf.5;):link sysmouse mouseThis link can be created by restarting &man.devfs.5;
with the following command (as root):&prompt.root; service devfs restartMy mouse has a fancy wheel. Can I use it in X?Yes, if X is configured for a 5 button mouse. To
do this, add the lines Buttons 5
and ZAxisMapping 4 5 to the
InputDevice section of
/etc/X11/xorg.conf, as seen in this
example:Section "InputDevice"
Identifier "Mouse1"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/sysmouse"
Option "Buttons" "5"
Option "ZAxisMapping" "4 5"
EndSectionThe mouse can be enabled in
Emacs by adding these
lines to ~/.emacs:;; wheel mouse
(global-set-key [mouse-4] 'scroll-down)
(global-set-key [mouse-5] 'scroll-up)My laptop has a Synaptics touchpad. Can I use it in
X?Yes, after configuring a few things to make
it work.In order to use the Xorg synaptics driver,
first remove moused_enable from
rc.conf.To enable synaptics, add the following line to
/boot/loader.conf:hw.psm.synaptics_support="1"Add the following to
/etc/X11/xorg.conf:Section "InputDevice"
Identifier "Touchpad0"
Driver "synaptics"
Option "Protocol" "psm"
Option "Device" "/dev/psm0"
EndSectionAnd be sure to add the following into the
ServerLayout section:InputDevice "Touchpad0" "SendCoreEvents"How do I use remote X displays?For security reasons, the default setting is to not
allow a machine to remotely open a window.To enable this feature, start
X with the optional
argument:&prompt.user; startx -listen_tcpWhat is a virtual console and how do I make
more?Virtual consoles provide
several simultaneous sessions on the same machine without
doing anything complicated like setting up a network or
running X.When the system starts, it will display a login prompt
on the monitor after displaying all the boot messages.
Type in your login name and password to
start working on the first virtual
console.To start another
session, perhaps to look at documentation for a program
or to read mail while waiting for an
FTP transfer to finish,
hold down Alt and press
F2. This will display the login prompt
for the second virtual
console. To go back to the
original session, press AltF1.The default &os; installation has eight virtual
consoles enabled. AltF1,
AltF2,
AltF3,
and so on will switch between these virtual
consoles.To enable more of virtual consoles, edit
/etc/ttys (see &man.ttys.5;) and add
entries for ttyv8 to
ttyvc, after the comment on
Virtual terminals:# Edit the existing entry for ttyv8 in /etc/ttys and change
# "off" to "on".
ttyv8 "/usr/libexec/getty Pc" xterm on secure
ttyv9 "/usr/libexec/getty Pc" xterm on secure
ttyva "/usr/libexec/getty Pc" xterm on secure
ttyvb "/usr/libexec/getty Pc" xterm on secureThe more virtual
terminals, the more resources that are used. This can be
problematic on systems with 8 MB RAM or less.
Consider changing secure to
insecure.In order to run an X server, at least one virtual
terminal must be left to off for it
to use. This means that only eleven of the Alt-function
keys can be used as virtual consoles so that one is left
for the X server.For example, to run X and eleven virtual consoles, the
setting for virtual terminal 12 should be:ttyvb "/usr/libexec/getty Pc" xterm off secureThe easiest way to activate the
virtual consoles is to reboot.How do I access the virtual consoles from X?Use CtrlAltFn
to switch back to a virtual console. Press CtrlAltF1
to return to the first virtual console.Once at a text console, use
AltFn
to move between them.To return to the X session, switch to the
virtual console running X. If X was started from the
command line using startx,
the X session will attach to the next unused virtual
console, not the text console from which it was invoked.
For eight active virtual terminals, X will
run on the ninth, so use AltF9.How do I start XDM on
boot?There are two schools of thought on how to start
&man.xdm.1;. One school starts xdm
from /etc/ttys (see &man.ttys.5;)
using the supplied example, while the other runs
xdm from
rc.local (see &man.rc.8;) or from an
X script in
/usr/local/etc/rc.d. Both are
equally valid, and one may work in situations where the
other does not. In both cases the result is the same: X
will pop up a graphical login prompt.The &man.ttys.5; method has the advantage of
documenting which vty X will start on and passing the
responsibility of restarting the X server on logout to
&man.init.8;. The &man.rc.8; method makes it easy to
killxdm if there is
a problem starting the X server.If loaded from &man.rc.8;, xdm
should be started without any arguments.
xdm must start
after &man.getty.8; runs, or else
getty and xdm will
conflict, locking out the console. The best way around
this is to have the script sleep 10 seconds or so then
launch xdm.When starting xdm from
/etc/ttys, there still is a chance of
conflict between xdm and &man.getty.8;.
One way to avoid this is to add the vt
number in
/usr/local/lib/X11/xdm/Xservers::0 local /usr/local/bin/X vt4The above example will direct the X server to run in
/dev/ttyv3. Note the number is
offset by one. The X server counts the vty from one,
whereas the &os; kernel numbers the vty from zero.Why do I get Couldn't open
console when I run
xconsole?When X is started with
startx, the permissions on
/dev/console will
not get changed, resulting in things
like xterm -C and
xconsole not working.This is because of the way console permissions are set
by default. On a multi-user system, one does not
necessarily want just any user to be able to write on the
system console. For users who are logging directly onto a
machine with a VTY, the &man.fbtab.5; file exists to solve
such problems.In a nutshell, make sure an uncommented line of the
form is in /etc/fbtab (see
&man.fbtab.5;):/dev/ttyv0 0600 /dev/consoleIt will ensure that whomever logs in on
/dev/ttyv0 will own the
console.Why does my PS/2 mouse misbehave under X?The mouse and the mouse driver may have become out of
synchronization. In rare cases, the driver may also
erroneously report synchronization errors:psmintr: out of sync (xxxx != yyyy)If this happens, disable the synchronization check
code by setting the driver flags for the PS/2 mouse driver
to 0x100. This can be easiest achieved
by adding hint.psm.0.flags="0x100" to
/boot/loader.conf and
rebooting.How do I reverse the mouse buttons?Type
xmodmap -e "pointer = 3 2 1". Add this
command to ~/.xinitrc or
~/.xsession to make it happen
automatically.How do I install a splash screen and where do I find
them?The detailed answer for this question can be found in
the Boot
Time Splash Screens section of the &os;
Handbook.Can I use the Windows keys on my
keyboard in X?Yes. Use &man.xmodmap.1; to
define which functions the keys should perform.Assuming all Windows keyboards are
standard, the keycodes for these three keys are the
following:115 —
Windows key, between the left-hand
Ctrl and Alt
keys116 —
Windows key, to the right of
AltGr117 —
Menu, to the left of the right-hand
CtrlTo have the left Windows key print a
comma, try this.&prompt.root; xmodmap -e "keycode 115 = comma"To have the Windows key-mappings
enabled automatically every time X is started, either put
the xmodmap commands in
~/.xinitrc or, preferably, create
a ~/.xmodmaprc and include the
xmodmap options, one per line, then add
the following line to
~/.xinitrc:xmodmap $HOME/.xmodmaprcFor example, to map the 3 keys to be
F13, F14, and
F15, respectively. This would make it
easy to map them to useful functions within applications
or the window manager.To do this, put the following in
~/.xmodmaprc.keycode 115 = F13
keycode 116 = F14
keycode 117 = F15For the x11-wm/fvwm2 desktop
manager, one could map the keys so that
F13 iconifies or de-iconifies the
window the cursor is in, F14 brings the
window the cursor is in to the front or, if it is already
at the front, pushes it to the back, and
F15 pops up the main Workplace
menu even if the cursor is not on the
desktop, which is useful when no part of
the desktop is visible.The following entries in
~/.fvwmrc implement the
aforementioned setup:Key F13 FTIWS A Iconify
Key F14 FTIWS A RaiseLower
Key F15 A A Menu Workplace NopHow can I get 3D hardware acceleration for
&opengl;?The availability of 3D acceleration depends on the
version of &xorg; and the type of video
chip. For an nVidia chip, use
the binary drivers provided for &os; by installing one of
the following ports:The latest versions of nVidia cards are supported
by the x11/nvidia-driver
port.Older drivers are available as
x11/nvidia-driver-###nVidia provides detailed information on which
card is supported by which driver on their web site: http://www.nvidia.com/object/IO_32667.html.For Matrox G200/G400, check the
x11-drivers/xf86-video-mga
port.For ATI Rage 128 and Radeon see
&man.ati.4x;, &man.r128.4x; and &man.radeon.4x;.NetworkingWhere can I get information on diskless
booting?Diskless booting means that the &os;
box is booted over a network, and reads the necessary
files from a server instead of its hard disk. For full
details, see the
Handbook entry on diskless booting.Can a &os; box be used as a dedicated network
router?Yes. Refer to the Handbook entry on advanced
networking, specifically the section on routing
and gateways.Does &os; support NAT or Masquerading?Yes. For instructions on how to use NAT over a PPP
connection, see the Handbook
entry on PPP. To use NAT over
some other sort of network connection, look at the
natd
section of the Handbook.How can I set up Ethernet aliases?If the alias is on the same subnet as an address
already configured on the interface, add
netmask 0xffffffff to this
command:&prompt.root; ifconfig ed0 alias 192.0.2.2 netmask 0xffffffffOtherwise, specify the network address and
netmask as usual:&prompt.root; ifconfig ed0 alias 172.16.141.5 netmask 0xffffff00More information can be found in the &os; Handbook.Why can I not NFS-mount from a &linux; box?Some versions of the &linux; NFS code only accept
mount requests from a privileged port; try to issue the
following command:&prompt.root; mount -o -P linuxbox:/blah /mntWhy does mountd keep telling me it
can't change attributes and that I
have a bad exports list on my &os;
NFS server?The most frequent problem is not understanding the
correct format of /etc/exports.
Review &man.exports.5; and the NFS
entry in the Handbook, especially the section on configuring
NFS.How do I enable IP multicast support?Install the net/mrouted package
or port and add
mrouted_enable="YES" to
/etc/rc.conf start this service at
boot time.Why do I have to use the FQDN for hosts on my
site?See the answer in the &os; Handbook.Why do I get an error, Permission
denied, for all networking
operations?If the kernel is compiled with the
IPFIREWALL option, be aware
that the default policy is to deny all packets that are
not explicitly allowed.If the firewall is unintentionally misconfigured,
restore network operability by
typing the following as root:&prompt.root; ipfw add 65534 allow all from any to anyConsider setting
firewall_type="open" in
/etc/rc.conf.For further information on configuring this
firewall, see the Handbook
chapter.Why is my ipfwfwd
rule to redirect a service to another machine not
working?Possibly because network address translation (NAT) is
needed instead of just forwarding packets. A
fwd rule only forwards packets, it does not
actually change the data inside the packet. Consider this
rule:01000 fwd 10.0.0.1 from any to foo 21When a packet with a destination address of
foo arrives at the machine with
this rule, the packet is forwarded to
10.0.0.1, but it still has the
destination address of foo.
The destination address of the packet is
not changed to
10.0.0.1. Most machines would
probably drop a packet that they receive with a
destination address that is not their own. Therefore,
using a fwd rule does not often work the
way the user expects. This behavior is a feature and not
a bug.See the FAQ about
redirecting services, the &man.natd.8; manual, or
one of the several port redirecting utilities in the Ports
Collection for a correct way to do this.How can I redirect service requests from one machine
to another?FTP and other service requests can be redirected with
the sysutils/socket package or port.
Replace the entry for the service in
/etc/inetd.conf to call
socket, as seen in this example for
ftpd:ftp stream tcp nowait nobody /usr/local/bin/socket socket ftp.example.comftpwhere ftp.example.com and
ftp are the host and port to
redirect to, respectively.Where can I get a bandwidth management tool?There are three bandwidth management tools available
for &os;. &man.dummynet.4; is integrated into &os; as
part of &man.ipfw.4;. ALTQ
has been integrated into &os; as part of &man.pf.4;.
Bandwidth Manager from Emerging
Technologies is a commercial product.Why do I get /dev/bpf0: device not
configured?The running application requires the Berkeley
Packet Filter (&man.bpf.4;), but it was removed from a
custom kernel. Add this to the kernel config file and
build a new kernel:device bpf # Berkeley Packet FilterHow do I mount a disk from a &windows; machine that is
on my network, like smbmount in &linux;?Use the SMBFS toolset. It
includes a set of kernel modifications and a set of
userland programs. The programs and information are
available as &man.mount.smbfs.8; in the base
system.What are these messages about: Limiting
icmp/open port/closed port response in my
log files?This kernel message indicates that some activity is
provoking it to send a large amount of ICMP or TCP reset
(RST) responses. ICMP responses are
often generated as a result of attempted connections to
unused UDP ports. TCP resets are generated as a result of
attempted connections to unopened TCP ports. Among
others, these are the kinds of activities which may cause
these messages:Brute-force denial of service (DoS) attacks (as
opposed to single-packet attacks which exploit a
specific vulnerability).Port scans which attempt to connect to a large
number of ports (as opposed to only trying a few
well-known ports).The first number in the message indicates how many
packets the kernel would have sent if the limit was not in
place, and the second indicates the limit. This limit
is controlled using
net.inet.icmp.icmplim. This example
sets the limit to 300
packets per second:&prompt.root; sysctl net.inet.icmp.icmplim=300To disable these messages
without disabling response
limiting, use
net.inet.icmp.icmplim_output
to disable the output:&prompt.root; sysctl net.inet.icmp.icmplim_output=0Finally, to disable response limiting completely,
set net.inet.icmp.icmplim to
0. Disabling response limiting is
discouraged for the reasons listed above.What are these arp: unknown hardware
address format error messages?This means that some device on the local Ethernet is
using a MAC address in a format that &os; does not
recognize. This is probably caused by someone
experimenting with an Ethernet card somewhere else on the
network. This is most commonly seen on cable modem
networks. It is harmless, and should not affect the
performance of the &os; system.Why do I keep seeing messages like:
192.168.0.10 is on
fxp1 but got reply from 00:15:17:67:cf:82 on
rl0, and how do I disable it?
- Because a packet is coming from outside the network
+ A packet is coming from outside the network
unexpectedly. To disable them, set
net.link.ether.inet.log_arp_wrong_iface
to 0.How do I compile an IPv6 only kernel?Configure your kernel with these settings:
include GENERIC
ident GENERIC-IPV6ONLY
makeoptions MKMODULESENV+="WITHOUT_INET_SUPPORT="
nooptions INET
nodevice greSecurityWhat is a sandbox?Sandbox is a security term. It can
mean two things:A process which is placed inside a set of virtual
walls that are designed to prevent someone who breaks
into the process from being able to break into the
wider system.The process is only able to run inside the walls.
Since nothing the process does in regards to executing
code is supposed to be able to breach the walls, a
detailed audit of its code is not needed in order to
be able to say certain things about its
security.The walls might be a user ID, for example.
This is the definition used in the &man.security.7;
and &man.named.8; man pages.Take the ntalk service, for
example (see &man.inetd.8;). This service used to run
as user ID root. Now it runs as
user ID tty. The tty user is a sandbox
designed to make it more difficult for someone who has
successfully hacked into the system via
ntalk from being able to hack
beyond that user ID.A process which is placed inside a simulation of
the machine. It means that someone who is able to
break into the process may believe that he can break
into the wider machine but is, in fact, only breaking
into a simulation of that machine and not modifying
any real data.The most common way to accomplish this is to build
a simulated environment in a subdirectory and then run
the processes in that directory chrooted so that
/ for that process is this
directory, not the real / of the
system).Another common use is to mount an underlying file
system read-only and then create a file system layer
on top of it that gives a process a seemingly
writeable view into that file system. The process may
believe it is able to write to those files, but only
the process sees the effects — other processes
in the system do not, necessarily.An attempt is made to make this sort of sandbox so
transparent that the user (or hacker) does not realize
that he is sitting in it.&unix; implements two core sandboxes. One is at the
process level, and one is at the userid level.Every &unix; process is completely firewalled off from
every other &unix; process. One process cannot modify the
address space of another.A &unix; process is owned by a particular userid. If
the user ID is not the root user, it serves to
firewall the process off from processes owned by other
users. The user ID is also used to firewall off
on-disk data.What is securelevel?securelevel is a security
mechanism implemented in the kernel. When the securelevel
is positive, the kernel restricts certain tasks; not even
the superuser (root) is allowed to do
them. The securelevel mechanism limits the ability
to:Unset certain file flags, such as
schg (the system immutable
flag).Write to kernel memory via
/dev/mem and
/dev/kmem.Load kernel modules.Alter firewall rules.To check the status of the securelevel on a running
system:&prompt.root; sysctl -n kern.securelevelThe output contains the current value of the
securelevel. If it is greater than 0, at
least some of the securelevel's protections are
enabled.The securelevel of a running system cannot be lowered
as this would defeat its purpose. If a task requires that
the securelevel be non-positive, change the
kern_securelevel and
kern_securelevel_enable variables in
/etc/rc.conf and reboot.For more information on securelevel and the specific
things all the levels do, consult &man.init.8;.Securelevel is not a silver bullet; it has many
known deficiencies. More often than not, it provides a
false sense of security.One of its biggest problems is that in order for it
to be at all effective, all files used in the boot
process up until the securelevel is set must be
protected. If an attacker can get the system to execute
their code prior to the securelevel being set (which
happens quite late in the boot process since some things
the system must do at start-up cannot be done at an
elevated securelevel), its protections are invalidated.
While this task of protecting all files used in the boot
process is not technically impossible, if it is
achieved, system maintenance will become a nightmare
since one would have to take the system down, at least
to single-user mode, to modify a configuration
file.This point and others are often discussed on the
mailing lists, particularly the &a.security;.
Search the archives here
for an extensive discussion. A more fine-grained
mechanism is preferred.What is this UID 0 toor account? Have I been
compromised?Do not worry. toor is an
alternative superuser account, where toor
is root spelled backwards. It is intended to be used with
a non-standard shell so the default shell for root does not need to
change. This is important as shells which are not part of
the base distribution, but are instead installed from
ports or packages, are installed in
/usr/local/bin which, by default,
resides on a different file system. If root's shell is located in
/usr/local/bin and the
file system
containing /usr/local/bin) is not
mounted, root will not be able to
log in to fix a problem and will have to reboot into
single-user mode in order to enter the path to a
shell.Some people use toor for day-to-day
root tasks with
a non-standard shell, leaving root, with a standard
shell, for single-user mode or emergencies. By default, a
user cannot log in using toor as it does not have a
password, so log in as root and set a password
for toor before
using it to login.Serial CommunicationsThis section answers common questions about serial
communications with &os;.How do I get the boot: prompt to show on the serial
console?See this
section of the Handbook.How do I tell if &os; found my serial ports or modem
cards?As the &os; kernel boots, it will probe for the serial
ports for which the kernel is configured.
Either watch the boot messages closely
or run this command after the system is up and
running:&prompt.user; grep -E '^(sio|uart)[0-9]' < /var/run/dmesg.boot
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550AThis example shows two serial ports. The first is on
IRQ4, port address
0x3f8, and has a 16550A-type UART chip.
The second uses the same kind of chip but is on
IRQ3 and is at port address
0x2f8. Internal modem cards are
treated just like serial ports, except that they
always have a modem attached to the
port.The GENERIC kernel includes
support for two serial ports using the same IRQ and port
address settings in the above example. If these settings
are not right for the system, or if there are more modem
cards or serial ports than the kernel is
configured for, reconfigure using the instructions in
building a kernel
for more details.How do I access the serial ports on &os;? (x86-specific)The third serial port, sio2,
or COM3,
is on /dev/cuad2 for dial-out
devices, and on /dev/ttyd2 for
dial-in devices. What is the difference between these two
classes of devices?When
opening /dev/ttydX in blocking mode,
a process will wait for the corresponding
cuadX device to become inactive, and
then wait for the carrier detect line to go active. When
the cuadX device is opened, it makes
sure the serial port is not already in use by the
ttydX device. If the port is
available, it steals it from the
ttydX device. Also, the
cuadX device does not care about
carrier detect. With this scheme and an auto-answer
modem, remote users can log in and local users can still
dial out with the same modem and the system will take care
of all the conflicts.How do I enable support for a multi-port serial
card?The section on kernel configuration provides
information about configuring the kernel. For a
multi-port serial card, place an &man.sio.4; line for each
serial port on the card in the &man.device.hints.5; file.
But place the IRQ specifiers on only one of the entries.
All of the ports on the card should share one IRQ. For
consistency, use the last serial port to specify the IRQ.
Also, specify the following option in the kernel
configuration file:options COM_MULTIPORTThe following /boot/device.hints
example is for an AST 4-port serial card on
IRQ 12:hint.sio.4.at="isa"
hint.sio.4.port="0x2a0"
hint.sio.4.flags="0x701"
hint.sio.5.at="isa"
hint.sio.5.port="0x2a8"
hint.sio.5.flags="0x701"
hint.sio.6.at="isa"
hint.sio.6.port="0x2b0"
hint.sio.6.flags="0x701"
hint.sio.7.at="isa"
hint.sio.7.port="0x2b8"
hint.sio.7.flags="0x701"
hint.sio.7.irq="12"The flags indicate that the master port has minor
number 7 (0x700),
and all the ports share an IRQ
(0x001).Can I set the default serial parameters for a
port?See the Serial
Communications section in the &os;
Handbook.Why can I not run tip or
cu?The built-in &man.tip.1; and
&man.cu.1; utilities can only access the
/var/spool/lock directory via user
uucp and group
dialer.
Use the dialer group to control
who has access to the modem or remote systems by adding
user accounts to dialer.Alternatively, everyone can be configured to run
&man.tip.1; and &man.cu.1; by typing:&prompt.root; chmod 4511 /usr/bin/cu
&prompt.root; chmod 4511 /usr/bin/tipMiscellaneous Questions&os; uses a lot of swap space even when the computer
has free memory left. Why?&os; will proactively move entirely idle, unused pages
of main memory into swap in order to make more main memory
available for active use. This heavy use of swap is
balanced by using the extra free memory for
caching.Note that while &os; is proactive in this regard, it
does not arbitrarily decide to swap pages when the system
is truly idle. Thus, the system will not be all
paged out after leaving it
idle overnight.Why does top show very little free
memory even when I have very few programs running?The simple answer is that free memory is wasted
memory. Any memory that programs do not actively
allocate is used within the &os; kernel as disk cache.
The values shown by &man.top.1; labeled as
Inact and Laundry
are cached data at different
aging levels. This cached data means the system does not
have to access a slow disk again for data it has accessed
recently, thus increasing overall performance. In
general, a low value shown for Free
memory in &man.top.1; is good, provided it is not
very low.Why will chmod not change the
permissions on symlinks?Symlinks do not have permissions, and by default,
&man.chmod.1; will follow symlinks to change the
permissions on the source file, if possible. For
the file, foo with a symlink named
bar, this command
will always succeed.&prompt.user; chmod g-w barHowever, the permissions on bar
will not have changed.When changing modes of the file hierarchies rooted in
the files instead of the files themselves, use
either or together
with to make this work. See
&man.chmod.1; and &man.symlink.7; for more
information. does a
recursive &man.chmod.1;. Be
careful about specifying directories or symlinks to
directories to &man.chmod.1;. To change the
permissions of a directory referenced by a symlink, use
&man.chmod.1; without any options and follow the symlink
with a trailing slash (/). For
example, if foo is a symlink to
directory bar, to
change the permissions of foo
(actually bar), do
something like:&prompt.user; chmod 555 foo/With the trailing slash, &man.chmod.1; will follow
the symlink, foo, to change the
permissions of the directory,
bar.Can I run DOS binaries under &os;?Yes. A DOS emulation program,
emulators/doscmd, is available in the
&os; Ports Collection.If doscmd will not suffice,
emulators/pcemu
emulates an 8088 and enough BIOS services to run many DOS
text-mode applications. It requires the X Window
System.The Ports Collection also has
emulators/dosbox. The main focus of
this application is emulating old DOS games using the
local file system for files.What do I need to do to translate a &os; document into
my native language?See the Translation
FAQ in the &os; Documentation
Project Primer.Why does my email to any address at FreeBSD.org
bounce?The FreeBSD.org mail
system implements some Postfix
checks on incoming mail and rejects mail that is either
from misconfigured relays or otherwise appears likely to
be spam. Some of the specific requirements are:The IP address of the SMTP client must
"reverse-resolve" to a forward confirmed
hostname.The fully-qualified hostname given in the
SMTP conversation (either HELO or EHLO) must resolve
to the IP address of the client.Other advice to help mail reach its destination
include:Mail should be sent in plain text, and messages
sent to mailing lists should generally be no more than
200KB in length.Avoid excessive cross posting. Choose
one mailing list which seems most
relevant and send it there.If you still have trouble with email infrastructure at
FreeBSD.org,
send a note with the details to
postmaster@freebsd.org; Include a
date/time interval so that logs may be reviewed —
and note that we only keep one week's worth of mail logs.
(Be sure to specify the time zone or offset from
UTC.)Where can I find a free &os; account?While &os; does not provide open access to any of
their servers, others do provide open access &unix;
systems. The charge varies and limited services may be
available.Arbornet,
Inc, also known as M-Net,
has been providing open access to &unix; systems since
1983. Starting on an Altos running System III, the site
switched to BSD/OS in 1991. In June of 2000, the site
switched again to &os;. M-Net can be
accessed via telnet and
SSH and provides basic access
to the entire &os; software suite. However, network
access is limited to members and patrons who donate to the
system, which is run as a non-profit organization.
M-Net also provides an bulletin board
system and interactive chat.What is the cute little red guy's name?He does not have one, and is just called the
BSD daemon. If you insist upon using a name,
call him beastie. Note that
beastie is pronounced
BSD.More about the BSD daemon is available on his home
page.Can I use the BSD daemon image?Perhaps. The BSD daemon is copyrighted by Marshall
Kirk McKusick. Check his Statement
on the Use of the BSD Daemon Figure for detailed
usage terms.In summary, the image can be used in a tasteful
manner, for personal use, so long as appropriate credit
is given. Before using the logo commercially, contact
&a.mckusick.email; for permission. More details are
available on the BSD
Daemon's home page.Do you have any BSD daemon images I could use?Xfig and eps drawings are available under
/usr/share/examples/BSD_daemon/.I have seen an acronym or other term on the mailing
lists and I do not understand what it means. Where should
I look?Refer to the &os;
Glossary.Why should I care what color the bikeshed is?The really, really short answer is that you should
not. The somewhat longer answer is that just because you
are capable of building a bikeshed does not mean you
should stop others from building one just because you do
not like the color they plan to paint it. This is a
metaphor indicating that you need not argue about every
little feature just because you know enough to do so.
Some people have commented that the amount of noise
generated by a change is inversely proportional to the
complexity of the change.The longer and more complete answer is that after a
very long argument about whether &man.sleep.1; should take
fractional second arguments, &a.phk.email; posted a long
message entitled A
bike shed (any color will do) on greener
grass.... The appropriate portions of
that message are quoted below.
&a.phk.email; on &a.hackers.name;, October 2,
1999What is it about this bike shed?
Some of you have asked me.It is a long story, or rather it is an old story,
but it is quite short actually. C. Northcote Parkinson
wrote a book in the early 1960s, called
Parkinson's Law, which contains a lot of
insight into the dynamics of management.[snip a bit of commentary on the
book]In the specific example involving the bike shed, the
other vital component is an atomic power-plant, I guess
that illustrates the age of the book.Parkinson shows how you can go into the board of
directors and get approval for building a multi-million
or even billion dollar atomic power plant, but if you
want to build a bike shed you will be tangled up in
endless discussions.Parkinson explains that this is because an atomic
plant is so vast, so expensive and so complicated that
people cannot grasp it, and rather than try, they fall
back on the assumption that somebody else checked all
the details before it got this far. Richard P. Feynmann
gives a couple of interesting, and very much to the
point, examples relating to Los Alamos in his
books.A bike shed on the other hand. Anyone can build one
of those over a weekend, and still have time to watch
the game on TV. So no matter how well prepared, no
matter how reasonable you are with your proposal,
somebody will seize the chance to show that he is doing
his job, that he is paying attention, that he is
here.In Denmark we call it setting your
fingerprint. It is about personal pride and
prestige, it is about being able to point somewhere and
say There! I did
that. It is a strong trait in politicians, but
present in most people given the chance. Just think
about footsteps in wet cement.
The &os; FunniesHow cool is &os;?Q. Has anyone done any temperature testing while
running &os;? I know &linux; runs cooler than DOS, but
have never seen a mention of &os;. It seems to run really
hot.A. No, but we have done numerous taste tests on
blindfolded volunteers who have also had 250 micrograms of
LSD-25 administered beforehand. 35% of the volunteers
said that &os; tasted sort of orange, whereas &linux;
tasted like purple haze. Neither group mentioned any
significant variances in temperature. We eventually had
to throw the results of this survey out entirely anyway
when we found that too many volunteers were wandering out
of the room during the tests, thus skewing the results.
We think most of the volunteers are at Apple now, working
on their new scratch and sniff GUI. It is
a funny old business we are in!Seriously, &os; uses the HLT (halt)
instruction when the system is idle thus lowering its
energy consumption and therefore the heat it generates.
Also if you have ACPI (Advanced
Configuration and Power Interface) configured, then &os;
can also put the CPU into a low power mode.Who is scratching in my memory banks??Q. Is there anything odd that &os;
does when compiling the kernel which would cause the
memory to make a scratchy sound? When compiling (and for
a brief moment after recognizing the floppy drive upon
startup, as well), a strange scratchy sound emanates from
what appears to be the memory banks.A. Yes! You will see frequent references to
daemons in the BSD documentation, and what
most people do not know is that this refers to genuine,
non-corporeal entities that now possess your computer.
The scratchy sound coming from your memory is actually
high-pitched whispering exchanged among the daemons as
they best decide how to deal with various system
administration tasks.If the noise gets to you, a good fdisk
/mbr from DOS will get rid of them, but do not
be surprised if they react adversely and try to stop you.
In fact, if at any point during the exercise you hear the
satanic voice of Bill Gates coming from the built-in
speaker, take off running and do not ever look back!
Freed from the counterbalancing influence of the BSD
daemons, the twin demons of DOS and &windows; are often
able to re-assert total control over your machine to the
eternal damnation of your soul. Now that you know, given
a choice you would probably prefer to get used to the
scratchy noises, no?How many &os; hackers does it take to change a
lightbulb?One thousand, one hundred and sixty-nine:Twenty-three to complain to -CURRENT about the lights
being out;Four to claim that it is a configuration problem, and
that such matters really belong on -questions;Three to submit PRs about it, one of which is misfiled
under doc and consists only of it's
dark;One to commit an untested lightbulb which breaks
buildworld, then back it out five minutes later;Eight to flame the PR originators for not including
patches in their PRs;Five to complain about buildworld being broken;Thirty-one to answer that it works for them, and they
must have updated at a bad time;One to post a patch for a new lightbulb to
-hackers;One to complain that he had patches for this three
years ago, but when he sent them to -CURRENT they were
just ignored, and he has had bad experiences with the PR
system; besides, the proposed new lightbulb is
non-reflexive;Thirty-seven to scream that lightbulbs do not belong
in the base system, that committers have no right to do
things like this without consulting the Community, and
WHAT IS -CORE DOING ABOUT IT!?Two hundred to complain about the color of the bicycle
shed;Three to point out that the patch breaks
&man.style.9;;Seventeen to complain that the proposed new lightbulb
is under GPL;Five hundred and eighty-six to engage in a flame war
about the comparative advantages of the GPL, the BSD
license, the MIT license, the NPL, and the personal
hygiene of unnamed FSF founders;Seven to move various portions of the thread to -chat
and -advocacy;One to commit the suggested lightbulb, even though it
shines dimmer than the old one;Two to back it out with a furious flame of a commit
message, arguing that &os; is better off in the dark than
with a dim lightbulb;Forty-six to argue vociferously about the backing out
of the dim lightbulb and demanding a statement from
-core;Eleven to request a smaller lightbulb so it will fit
their Tamagotchi if we ever decide to port &os; to that
platform;Seventy-three to complain about the SNR on -hackers
and -chat and unsubscribe in protest;Thirteen to post unsubscribe,
How do I unsubscribe?, or Please
remove me from the list, followed by the usual
footer;One to commit a working lightbulb while everybody is
too busy flaming everybody else to notice;Thirty-one to point out that the new lightbulb would
shine 0.364% brighter if compiled with TenDRA (although it
will have to be reshaped into a cube), and that &os;
should therefore switch to TenDRA instead of GCC;One to complain that the new lightbulb lacks
fairings;Nine (including the PR originators) to ask what
is MFC?;Fifty-seven to complain about the lights being out two
weeks after the bulb has been changed.&a.nik.email; adds:I was laughing quite hard at
this.And then I thought, Hang on,
shouldn't there be '1 to document it.' in that list
somewhere?And then I was enlightened
:-)&a.tabthorpe.email; says:
None, real &os; hackers are
not afraid of the dark!Where does data written to
/dev/null go?It goes into a special data sink in the CPU where it
is converted to heat which is vented through the heatsink
/ fan assembly. This is why CPU cooling is increasingly
important; as people get used to faster processors, they
become careless with their data and more and more of it
ends up in /dev/null, overheating
their CPUs. If you delete /dev/null
(which effectively disables the CPU data sink) your CPU
may run cooler but your system will quickly become
constipated with all that excess data and start to behave
erratically. If you have a fast network connection you
can cool down your CPU by reading data out of
/dev/random and sending it off
somewhere; however you run the risk of overheating your
network connection and / or angering
your ISP, as most of the data will end up getting
converted to heat by their equipment, but they generally
have good cooling, so if you do not overdo it you should
be OK.Paul Robinson adds:There are other methods. As every good sysadmin
knows, it is part of standard practice to send data to the
screen of interesting variety to keep all the pixies that
make up your picture happy. Screen pixies (commonly
mis-typed or re-named as pixels) are
categorized by the type of hat they wear (red, green or
blue) and will hide or appear (thereby showing the color
of their hat) whenever they receive a little piece of
food. Video cards turn data into pixie-food, and then
send them to the pixies — the more expensive the
card, the better the food, so the better behaved the
pixies are. They also need constant stimulation —
this is why screen savers exist.To take your suggestions further, you could just throw
the random data to console, thereby letting the pixies
consume it. This causes no heat to be produced at all,
keeps the pixies happy and gets rid of your data quite
quickly, even if it does make things look a bit messy on
your screen.Incidentally, as an ex-admin of a large ISP who
experienced many problems attempting to maintain a stable
temperature in a server room, I would strongly discourage
people sending the data they do not want out to the
network. The fairies who do the packet switching and
routing get annoyed by it as well.My colleague sits at the computer too much, how
can I prank her?Install games/sl and
wait for her to mistype sl for
ls.Advanced TopicsHow can I learn more about &os;'s internals?See the &os;
Architecture Handbook.Additionally, much general &unix; knowledge is
directly applicable to &os;.How can I contribute to &os;? What can I do to
help?We accept all types of contributions: documentation,
code, and even art. See the article on Contributing
to &os; for specific advice on how to do
this.And thanks for the thought!What are snapshots and releases?There are currently &rel.numbranch; active/semi-active
branches in the &os; Subversion
Repository. (Earlier branches are only changed
very rarely, which is why there are only &rel.numbranch;
active branches of development):&rel2.releng; AKA
&rel2.stable;&rel.releng; AKA
&rel.stable;&rel.head.releng; AKA
-CURRENT AKA
&rel.head;HEAD is not an actual branch tag.
It is a symbolic constant for
the current, non-branched development
stream known as
-CURRENT.Right now, -CURRENT is the
&rel.head.relx; development stream; the &rel.stable;
branch, &rel.releng;, forked off from
-CURRENT in &rel.relengdate; and the
&rel2.stable; branch, &rel2.releng;, forked off from
-CURRENT in &rel2.relengdate;.How can I make the most of the data I see when my
kernel panics?Here is typical kernel panic:Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x40
fault code = supervisor read, page not present
instruction pointer = 0x8:0xf014a7e5
stack pointer = 0x10:0xf4ed6f24
frame pointer = 0x10:0xf4ed6f28
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 80 (mount)
interrupt mask =
trap number = 12
panic: page faultThis message is not enough. While the instruction
pointer value is important, it is also configuration
dependent as it varies depending on the kernel image.
If it is a GENERIC kernel
image from one of the snapshots, it is possible for
somebody else to track down the offending function, but
for a custom kernel, only you can tell us where the fault
occurred.To proceed:Write down the instruction pointer value. Note
that the 0x8: part at the beginning
is not significant in this case: it is the
0xf0xxxxxx part that we
want.When the system reboots, do the following:&prompt.user; nm -n kernel.that.caused.the.panic | grep f0xxxxxxwhere f0xxxxxx is the
instruction pointer value. The odds are you will not
get an exact match since the symbols in the kernel
symbol table are for the entry points of functions and
the instruction pointer address will be somewhere
inside a function, not at the start. If you do not
get an exact match, omit the last digit from the
instruction pointer value and try again:&prompt.user; nm -n kernel.that.caused.the.panic | grep f0xxxxxIf that does not yield any results, chop off
another digit. Repeat until there is some sort of
output. The result will be a possible list of
functions which caused the panic. This is a less than
exact mechanism for tracking down the point of
failure, but it is better than nothing.However, the best way to track down the cause of a
panic is by capturing a crash dump, then using
&man.kgdb.1; to generate a stack trace on the crash
dump.In any case, the method is this:Make sure that the following line is included in
the kernel configuration file:makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbolsChange to the /usr/src
directory:&prompt.root; cd /usr/srcCompile the kernel:&prompt.root; make buildkernel KERNCONF=MYKERNELWait for &man.make.1; to finish compiling.&prompt.root; make installkernel KERNCONF=MYKERNELReboot.If KERNCONF is not included,
the GENERIC kernel will instead
be built and installed.The &man.make.1; process will have built two kernels.
/usr/obj/usr/src/sys/MYKERNEL/kernel
and
/usr/obj/usr/src/sys/MYKERNEL/kernel.debug.
kernel was installed as
/boot/kernel/kernel, while
kernel.debug can be used as the
source of debugging symbols for &man.kgdb.1;.To capture a crash dump, edit
/etc/rc.conf and set
dumpdev to point to either the swap
partition or AUTO. This will cause the
&man.rc.8; scripts to use the &man.dumpon.8; command to
enable crash dumps. This command can also be run
manually. After a panic, the crash dump can be recovered
using &man.savecore.8;; if dumpdev is
set in /etc/rc.conf, the &man.rc.8;
scripts will run &man.savecore.8; automatically and put
the crash dump in /var/crash.&os; crash dumps are usually the same size as
physical RAM. Therefore, make sure there is enough
space in /var/crash to hold the
dump. Alternatively, run &man.savecore.8; manually
and have it recover the crash dump to another directory
with more room. It is possible to limit the
size of the crash dump by using options
MAXMEM=N where
N is the size of kernel's
memory usage in KBs. For example, for 1 GB
of RAM, limit the kernel's memory usage to
128 MB, so that the crash dump size
will be 128 MB instead of 1 GB.Once the crash dump has been recovered , get a
stack trace as follows:&prompt.user; kgdb /usr/obj/usr/src/sys/MYKERNEL/kernel.debug /var/crash/vmcore.0(kgdb)backtraceNote that there may be several screens worth of
information. Ideally, use &man.script.1; to
capture all of them. Using the unstripped kernel image
with all the debug symbols should show the exact line of
kernel source code where the panic occurred. The stack
trace is usually read from the bottom up to trace
the exact sequence of events that lead to the crash.
&man.kgdb.1; can also be used to print out the contents of
various variables or structures to examine the system
state at the time of the crash.If a second computer is available, &man.kgdb.1; can
be configured to do remote debugging, including setting
breakpoints and single-stepping through the kernel
code.If DDB is enabled and the
kernel drops into the debugger, a panic
and a crash dump can be forced by typing
panic at the ddb
prompt. It may stop in the debugger again during the
panic phase. If it does, type
continue and it will finish the crash
dump.Why has dlsym() stopped working
for ELF executables?The ELF toolchain does not, by default, make the
symbols defined in an executable visible to the dynamic
linker. Consequently dlsym()
searches on handles obtained from calls to
dlopen(NULL, flags) will fail to find
such symbols.To search, using
dlsym(), for symbols present in the
main executable of a process, link the
executable using the
option to the ELF linker (&man.ld.1;).How can I increase or reduce the kernel address space
on i386?By default, the kernel address space is 1 GB
(2 GB for PAE) for i386. When running a
network-intensive server or using
ZFS, this will probably not be
enough.Add the following line to the kernel configuration
file to increase available space and rebuild the
kernel:options KVA_PAGES=NTo find the correct value of
N, divide the desired address
space size (in megabytes) by four. (For example, it is
512 for 2 GB.)AcknowledgmentsThis innocent little Frequently Asked Questions document has
been written, rewritten, edited, folded, spindled, mutilated,
eviscerated, contemplated, discombobulated, cogitated,
regurgitated, rebuilt, castigated, and reinvigorated over the
last decade, by a cast of hundreds if not thousands.
Repeatedly.We wish to thank every one of the people responsible, and we
encourage you to join
them in making this FAQ even
better.
diff --git a/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml b/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml
index b2a78f8dc1..c951cbd85e 100644
--- a/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml
+++ b/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml
@@ -1,746 +1,746 @@
Manual PagesIntroductionManual pages, commonly shortened to
man pages, were conceived as
readily-available reminders for command syntax, device driver
details, or configuration file formats. They have become an
extremely valuable quick-reference from the command line for
users, system administrators, and programmers.Although intended as reference material rather than
tutorials, the EXAMPLES sections of manual pages often
provide detailed use case.Manual pages are generally shown interactively by the
&man.man.1; command. When the user types
man ls, a search is performed for a manual
page matching ls. The first matching result
is displayed.SectionsManual pages are grouped into sections.
Each section contains manual pages for a specific category of
documentation:Section NumberCategory1General Commands2System Calls3Library Functions4Kernel Interfaces5File Formats6Games7Miscellaneous8System Manager9Kernel DeveloperMarkupVarious markup forms and rendering programs have been used
for manual pages. &os; has used &man.groff.7; and the newer
&man.mandoc.1;. Most existing &os; manual pages, and all new
ones, use the &man.mdoc.7; form of markup. This is a simple
line-based markup that is reasonably expressive. It is mostly
semantic: parts of text are marked up for what they are, rather
than for how they should appear when rendered. There is some
appearance-based markup which is usually best avoided.Manual page source is usually interpreted and displayed to
the screen interactively. The source files can be ordinary text
files or compressed with &man.gzip.1; to save space.Manual pages can also be rendered to other formats,
including PostScript for printing or PDF
generation. See &man.man.1;.Manual Page SectionsManual pages are composed of several standard sections.
Each section has a title in upper case, and the sections for a
particular type of manual page appear in a specific order.
For a category 1 General Command manual page, the sections
are:Section NameDescriptionNAMEName of the commandSYNOPSISFormat of options and argumentsDESCRIPTIONDescription of purpose and usageENVIRONMENTEnvironment settings that affect
operationEXIT STATUSError codes returned on exitEXAMPLESExamples of usageCOMPATIBILITYCompatibility with other implementationsSEE ALSOCross-reference to related manual pagesSTANDARDSCompatibility with standards like POSIXHISTORYHistory of implementationBUGSKnown bugsAUTHORSPeople who created the command or wrote the
manual page.Some sections are optional, and the combination of
sections for a specific type of manual page vary. Examples of
the most common types are shown later in this chapter.Macros&man.mdoc.7; markup is based on
macros. Lines that begin with a dot
contain macro commands, each two or three letters long. For
example, consider this portion of the &man.ls.1; manual
page:
.Dd December 1, 2015
.Dt LS 1
.Sh NAME
.Nm ls
.Nd list directory contents
.Sh SYNOPSIS
.Nm
.Op Fl -libxo
.Op Fl ABCFGHILPRSTUWZabcdfghiklmnopqrstuwxy1,
.Op Fl D Ar format
.Op Ar
.Sh DESCRIPTION
For each operand that names a
.Ar file
of a type other than
directory,
.Nm
displays its name as well as any requested,
associated information.
For each operand that names a
.Ar file
of type directory,
.Nm
displays the names of files contained
within that directory, as well as any requested, associated
information.A Document date and
Document title are defined.A Section header for the NAME
section is defined. Then the Name
of the command and a one-line
Name description are defined.The SYNOPSIS section begins. This section describes
the command-line options and arguments accepted.Name (.Nm) has
already been defined, and repeating it here just displays
the defined value in the text.An OptionalFlag called -libxo
is shown. The Fl macro adds a dash to
the beginning of flags, so this appears in the manual
page as --libxo.A long list of optional single-character flags are
shown.An optional -D flag is defined. If
the -D flag is given, it must be
followed by an Argument. The
argument is a format, a string that
tells &man.ls.1; what to display and how to display it.
Details on the format string are given later in the manual
page.
- A final optional argument is defined. Because no name
+ A final optional argument is defined. Since no name
is specified for the argument, the default of
file ... is used.The Section header for the
DESCRIPTION section is defined.When rendered with the command man ls,
the result displayed on the screen looks like this:LS(1) FreeBSD General Commands Manual LS(1)
NAME
ls — list directory contents
SYNOPSIS
ls [--libxo] [-ABCFGHILPRSTUWZabcdfghiklmnopqrstuwxy1,] [-D format]
[file ...]
DESCRIPTION
For each operand that names a file of a type other than directory, ls
displays its name as well as any requested, associated information. For
each operand that names a file of type directory, ls displays the names
of files contained within that directory, as well as any requested,
associated information.Optional values are shown inside square brackets.Markup GuidelinesThe &man.mdoc.7; markup language is not very strict. For
clarity and consistency, the &os; Documentation project adds
some additional style guidelines:Only the first letter of macros is upper caseAlways use upper case for the first letter of a
macro and lower case for the remaining letters.Begin new sentences on new linesStart a new sentence on a new line, do not begin it
on the same line as an existing sentence.Update .Dd when making non-trivial
changes to a manual pageThe Document date informs the
reader about the last time the manual page was updated.
It is important to update whenever non-trivial changes
are made to the manual pages. Trivial changes like
spelling or punctuation fixes that do not affect usage
can be made without updating
.Dd.Give examplesShow the reader examples when possible. Even
trivial examples are valuable, because what is trivial
to the writer is not necessarily trivial to the reader.
Three examples are a good goal. A trivial example shows
the minimal requirements, a serious example shows actual
use, and an in-depth example demonstrates unusual or
non-obvious functionality.Include the BSD licenseInclude the BSD license on new manual pages. The
preferred license is available from the Committer's
Guide.Markup TricksAdd a space before punctuation on a line with
macros. Example:.Sh SEE ALSO
.Xr geom 4 ,
.Xr boot0cfg 8 ,
.Xr geom 8 ,
.Xr gptboot 8Note how the commas at the end of the
.Xr lines have been placed after a space.
The .Xr macro expects two parameters to
follow it, the name of an external manual page, and a section
number. The space separates the punctuation from the section
number. Without the space, the external links would
incorrectly point to section 4, or
8,.Important MacrosSome very common macros will be shown here. For
more usage examples, see &man.mdoc.7;, &man.groff.mdoc.7;, or
search for actual use in
/usr/share/man/man* directories. For
example, to search for examples of the .BdBegin display macro:&prompt.user; find /usr/share/man/man* | xargs zgrep '.Bd'Organizational MacrosSome macros are used to define logical blocks of a
manual page.Organizational MacroUse.ShSection header. Followed by the name of
the section, traditionally all upper case.
Think of these as chapter titles..SsSubsection header. Followed by the name of
the subsection. Used to divide a
.Sh section into
subsections..BlBegin list. Start a list of items..ElEnd a list..BdBegin display. Begin a special area of
text, like an indented area..EdEnd display.Inline MacrosMany macros are used to mark up inline text.Inline MacroUse.NmName. Called with a name as a parameter on the
first use, then used later without the parameter to
display the name that has already been
defined..PaPath to a file. Used to mark up filenames and
directory paths.Sample Manual Page StructuresThis section shows minimal desired man page contents for
several common categories of manual pages.Section 1 or 8 CommandThe preferred basic structure for a section 1 or 8
command:.Dd August 25, 2017
.Dt EXAMPLECMD 8
.Os
.Sh NAME
.Nm examplecmd
.Nd "command to demonstrate section 1 and 8 man pages"
.Sh SYNOPSIS
.Nm
.Op Fl v
.Sh DESCRIPTION
The
.Nm
utility does nothing except demonstrate a trivial but complete
manual page for a section 1 or 8 command.
.Sh SEE ALSO
.Xr exampleconf 5
.Sh AUTHORS
.An Firstname Lastname Aq Mt flastname@example.comSection 4 Device DriverThe preferred basic structure for a section 4 device
driver:.Dd August 25, 2017
.Dt EXAMPLEDRIVER 4
.Os
.Sh NAME
.Nm exampledriver
.Nd "driver to demonstrate section 4 man pages"
.Sh SYNOPSIS
To compile this driver into the kernel, add this line to the
kernel configuration file:
.Bd -ragged -offset indent
.Cd "device exampledriver"
.Ed
.Pp
To load the driver as a module at boot, add this line to
.Xr loader.conf 5 :
.Bd -literal -offset indent
exampledriver_load="YES"
.Ed
.Sh DESCRIPTION
The
.Nm
driver provides an opportunity to show a skeleton or template
file for section 4 manual pages.
.Sh HARDWARE
The
.Nm
driver supports these cards from the aptly-named Nonexistent
Technologies:
.Pp
.Bl -bullet -compact
.It
NT X149.2 (single and dual port)
.It
NT X149.8 (single port)
.El
.Sh DIAGNOSTICS
.Bl -diag
.It "flashing green light"
Something bad happened.
.It "flashing red light"
Something really bad happened.
.It "solid black light"
Power cord is unplugged.
.El
.Sh SEE ALSO
.Xr example 8
.Sh HISTORY
The
.Nm
device driver first appeared in
.Fx 49.2 .
.Sh AUTHORS
.An Firstname Lastname Aq Mt flastname@example.comSection 5 Configuration FileThe preferred basic structure for a section 5
configuration file:.Dd August 25, 2017
.Dt EXAMPLECONF 5
.Os
.Sh NAME
.Nm example.conf
.Nd "config file to demonstrate section 5 man pages"
.Sh DESCRIPTION
.Nm
is an example configuration file.
.Sh SEE ALSO
.Xr example 8
.Sh AUTHORS
.An Firstname Lastname Aq Mt flastname@example.comTestingTesting a new manual page can be challenging. Fortunately
there are some tools that can assist in the task. Some of them,
like &man.man.1;, do not look in the current directory. It is a
good idea to prefix the filename with ./ if
the new manual page is in the current directory. An absolute
path can also be used.Use &man.mandoc.1;'s linter to check for parsing
errors:&prompt.user; mandoc -T lint ./mynewmanpage.8Use textproc/igor to proofread the
manual page:&prompt.user; igor ./mynewmanpage.8Use &man.man.1; to check the final result of your
changes:&prompt.user; man ./mynewmanpage.8You can use &man.col.1; to filter the output of
&man.man.1; and get rid of the backspace characters before
loading the result in your favorite editor for
spell checking:&prompt.user; man ./mynewmanpage.8 | col -b | vim -R -Spell-checking with fully-featured dictionaries is
encouraged, and can be accomplished by using
textproc/hunspell or
textproc/aspell combined with
textproc/en-hunspell or
textproc/en-aspell, respectively.
For instance:&prompt.user; aspell check --lang=en --mode=nroff ./mynewmanpage.8Example Manual Pages to Use as TemplatesSome manual pages are suitable as in-depth examples.Manual PagePath to Source Location&man.cp.1;/usr/src/bin/cp/cp.1&man.vt.4;/usr/src/share/man/man4/vt.4&man.crontab.5;/usr/src/usr.sbin/cron/crontab/crontab.5&man.gpart.8;/usr/src/sbin/geom/class/part/gpart.8ResourcesResources for manual page writers:&man.man.1;&man.mandoc.1;&man.groff.mdoc.7;Practical
UNIX Manuals: mdocHistory
of UNIX Manpages
diff --git a/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml b/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml
index e89c3bde73..cc5681e7a5 100644
--- a/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml
+++ b/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml
@@ -1,921 +1,921 @@
PO TranslationsIntroductionThe GNU
gettext system offers
translators an easy way to create and maintain translations of
documents. Translatable strings are extracted from the original
document into a PO (Portable Object) file.
Translated versions of the strings are entered with a separate
editor. The strings can be used directly or built into a
complete translated version of the original document.Quick StartThe procedure shown in
is assumed to have
already been performed. The TRANSLATOR
option is required and already enabled by default in the
textproc/docproj port.This example shows the creation of a Spanish translation of
the short Leap
Seconds article.Install a PO EditorA PO editor is needed to edit
translation files. This example uses
editors/poedit.&prompt.root; cd /usr/ports/editors/poedit
&prompt.root; make install cleanInitial SetupWhen a new translation is first created, the directory
structure and Makefile must be created or
copied from the English original:Create a directory for the new translation. The
English article source is in
~/doc/en_US.ISO8859-1/articles/leap-seconds/.
The Spanish translation will go in
~/doc/es_ES.ISO8859-1/articles/leap-seconds/.
The path is the same except for the name of the language
directory.&prompt.user; svn mkdir --parents ~/doc/es_ES.ISO8859-1/articles/leap-seconds/Copy the Makefile from the original
document into the translation directory:&prompt.user; svn cp ~/doc/en_US.ISO8859-1/articles/leap-seconds/Makefile \
~/doc/es_ES.ISO8859-1/articles/leap-seconds/TranslationTranslating a document consists of two steps: extracting
translatable strings from the original document, and entering
translations for those strings. These steps are repeated
until the translator feels that enough of the document has
been translated to produce a usable translated
document.Extract the translatable strings from the original
English version into a PO file:&prompt.user; cd ~/doc/es_ES.ISO8859-1/articles/leap-seconds/
&prompt.user; make poUse a PO editor to enter translations
in the PO file. There are several
different editors available. poedit
from editors/poedit is shown
here.The PO file name is the
two-character language code followed by an underline and a
two-character region code. For Spanish, the file name is
es_ES.po.&prompt.user; poedit es_ES.poGenerating a Translated DocumentGenerate the translated document:&prompt.user; cd ~/doc/es_ES.ISO8859-1/articles/leap-seconds/
&prompt.user; make tranThe name of the generated document matches the name
of the English original, usually
article.xml for articles or
book.xml for books.Check the generated file by rendering it to
HTML and viewing it with a
web browser:&prompt.user; make FORMATS=html
&prompt.user; firefox article.htmlCreating New TranslationsThe first step to creating a new translated document is
locating or creating a directory to hold it. &os; puts
translated documents in a subdirectory named for their
language and region in the format
lang_REGION.
lang is a two-character lowercase
code. It is followed by an underscore character and then the
two-character uppercase REGION
code.
The translations are in subdirectories of the main
documentation directory, here assumed to be
~/doc/ as shown in
. For example, German
translations are located in
~/doc/de_DE.ISO8859-1/, and French
translations are in
~/doc/fr_FR.ISO8859-1/.Each language directory contains separate subdirectories
named for the type of documents, usually
articles/ and
books/.Combining these directory names gives the complete path to
an article or book. For example, the French translation of the
NanoBSD article is in
~/doc/fr_FR.ISO8859-1/articles/nanobsd/,
and the Mongolian translation of the Handbook is in
~/doc/mn_MN.UTF-8/books/handbook/.A new language directory must be created when translating
a document to a new language. If the language directory already
exists, only a subdirectory in the
articles/ or books/
directory is needed.&os; documentation builds are controlled by a
Makefile in the same directory. With
simple articles, the Makefile can often
just be copied verbatim from the original English directory.
The translation process combines multiple separate
book.xml and
chapter.xml files in books into a single
file, so the Makefile for book translations
must be copied and modified.Creating a Spanish Translation of the Porter's
HandbookCreate a new Spanish translation of the
Porter's
Handbook. The original is a book in
~/doc/en_US.ISO8859-1/books/porters-handbook/.The Spanish language books directory
~/doc/es_ES.ISO8859-1/books/ already
exists, so only a new subdirectory for the Porter's
Handbook is needed:&prompt.user; cd ~/doc/es_ES.ISO8859-1/books/
&prompt.user; svn mkdir porters-handbook
A porters-handbookCopy the Makefile from the
original book:&prompt.user; cd ~/doc/es_ES.ISO8859-1/books/porters-handbook
&prompt.user; svn cp ~/doc/en_US.ISO8859-1/books/porters-handbook/Makefile .
A MakefileModify the contents of the
Makefile to only expect a single
book.xml:#
# $FreeBSD$
#
# Build the FreeBSD Porter's Handbook.
#
MAINTAINER=doc@FreeBSD.org
DOC?= book
FORMATS?= html-split
INSTALL_COMPRESSED?= gz
INSTALL_ONLY_COMPRESSED?=
# XML content
SRCS= book.xml
# Images from the cross-document image library
IMAGES_LIB+= callouts/1.png
IMAGES_LIB+= callouts/2.png
IMAGES_LIB+= callouts/3.png
IMAGES_LIB+= callouts/4.png
IMAGES_LIB+= callouts/5.png
IMAGES_LIB+= callouts/6.png
IMAGES_LIB+= callouts/7.png
IMAGES_LIB+= callouts/8.png
IMAGES_LIB+= callouts/9.png
IMAGES_LIB+= callouts/10.png
IMAGES_LIB+= callouts/11.png
IMAGES_LIB+= callouts/12.png
IMAGES_LIB+= callouts/13.png
IMAGES_LIB+= callouts/14.png
IMAGES_LIB+= callouts/15.png
IMAGES_LIB+= callouts/16.png
IMAGES_LIB+= callouts/17.png
IMAGES_LIB+= callouts/18.png
IMAGES_LIB+= callouts/19.png
IMAGES_LIB+= callouts/20.png
IMAGES_LIB+= callouts/21.png
URL_RELPREFIX?= ../../../..
DOC_PREFIX?= ${.CURDIR}/../../..
.include "${DOC_PREFIX}/share/mk/doc.project.mk"Now the document structure is ready for the translator
to begin translating with
make po.Creating a French Translation of the
PGP Keys ArticleCreate a new French translation of the
PGP
Keys article. The original is an article in
~/doc/en_US.ISO8859-1/articles/pgpkeys/.The French language article directory
~/doc/fr_FR.ISO8859-1/articles/
already exists, so only a new subdirectory for the
PGP Keys article is needed:&prompt.user; cd ~/doc/fr_FR.ISO8859-1/articles/
&prompt.user; svn mkdir pgpkeys
A pgpkeysCopy the Makefile from the
original article:&prompt.user; cd ~/doc/fr_FR.ISO8859-1/articles/pgpkeys
&prompt.user; svn cp ~/doc/en_US.ISO8859-1/articles/pgpkeys/Makefile .
A MakefileCheck the contents of the
- Makefile. Because this is a simple
+ Makefile. As this is a simple
article, in this case the Makefile
can be used unchanged. The $&os;...$
version string on the second line will be replaced by the
version control system when this file is committed.#
# $FreeBSD$
#
# Article: PGP Keys
DOC?= article
FORMATS?= html
WITH_ARTICLE_TOC?= YES
INSTALL_COMPRESSED?= gz
INSTALL_ONLY_COMPRESSED?=
SRCS= article.xml
# To build with just key fingerprints, set FINGERPRINTS_ONLY.
URL_RELPREFIX?= ../../../..
DOC_PREFIX?= ${.CURDIR}/../../..
.include "${DOC_PREFIX}/share/mk/doc.project.mk"With the document structure complete, the
PO file can be created with
make po.TranslatingThe gettext system greatly
reduces the number of things that must be tracked by a
translator. Strings to be translated are extracted from the
original document into a PO file. Then a
PO editor is used to enter the translated
versions of each string.The &os; PO translation system does not
overwrite PO files, so the extraction step
can be run at any time to update the PO
file.A PO editor is used to edit the file.
editors/poedit is shown in
these examples because it is simple and has minimal
requirements. Other PO editors offer
features to make the job of translating easier. The Ports
Collection offers several of these editors, including
devel/gtranslator.It is important to preserve the PO file.
It contains all of the work that translators have done.Translating the Porter's Handbook to SpanishEnter Spanish translations of the contents of the Porter's
Handbook.Change to the Spanish Porter's Handbook directory and
update the PO file. The generated
PO file is called
es_ES.po as shown in
.&prompt.user; cd ~/doc/es_ES.ISO8859-1/books/porters-handbook
&prompt.user; make poEnter translations using a PO
editor:&prompt.user; poedit es_ES.poTips for TranslatorsPreserving XML TagsPreserve XML tags that are shown in
the English original.Preserving XML TagsEnglish original:If acronymNTPacronym is not being usedSpanish translation:Si acronymNTPacronym no se utilizaPreserving SpacesPreserve existing spaces at the beginning and end of
strings to be translated. The translated version must have
these spaces also.Verbatim TagsThe contents of some tags should be copied verbatim, not
translated:citerefentrycommandfilenameliteralmanvolnumorgnamepackageprogramlistingpromptrefentrytitlescreenuserinputvarname$FreeBSD$
StringsThe $FreeBSD$ version strings used in
files require special handling. In examples like
, these
strings are not meant to be expanded. The English documents
use $ entities to avoid
including actual literal dollar signs in the file:$FreeBSD$The $ entities are not seen
as dollar signs by the version control system and so the
string is not expanded into a version string.When a PO file is created, the
$ entities used in examples are
replaced with actual dollar signs. The resulting literal
$FreeBSD$ string will be
wrongly expanded by the version control system when the file
is committed.The same technique as used in the English documents can be
used in the translation. The $
is used to replace the dollar sign in the translation entered
into the PO editor:$FreeBSD$Building a Translated DocumentA translated version of the original document can be created
at any time. Any untranslated portions of the original will be
included in English in the resulting document. Most
PO editors have an indicator that shows how
much of the translation has been completed. This makes it easy
for the translator to see when enough strings have been
translated to make building the final document
worthwhile.Building the Spanish Porter's HandbookBuild and preview the Spanish version of the Porter's
Handbook that was created in an earlier example.
- Build the translated document. Because the original
+ Build the translated document. As the original
is a book, the generated document is
book.xml.&prompt.user; cd ~/doc/es_ES.ISO8859-1/books/porters-handbook
&prompt.user; make tranRender the translated book.xml to
HTML and view it with
Firefox. This is the
same procedure used with the English version of the
documents, and other FORMATS can
be used here in the same way. See .&prompt.user; make FORMATS=html
&prompt.user; firefox book.htmlSubmitting the New TranslationPrepare the new translation files for submission. This
includes adding the files to the version control system, setting
additional properties on them, then creating a diff for
submission.The diff files created by these examples can be attached to
a documentation
bug report or code
review.Spanish Translation of the NanoBSD ArticleAdd a &os; version string comment as the first
line of the PO file:#$FreeBSD$Add the Makefile, the
PO file, and the generated
XML translation to
version control:&prompt.user; cd ~/doc/es_ES.ISO8859-1/articles/nanobsd/
&prompt.user; ls
Makefile article.xml es_ES.po
&prompt.user; svn add Makefile article.xml es_ES.po
A Makefile
A article.xml
A es_ES.poSet the
Subversionsvn:keywords properties on these files
to FreeBSD=%H so
$FreeBSD$ strings are
expanded into the path, revision, date, and author when
committed:&prompt.user; svn propset svn:keywords FreeBSD=%H Makefile article.xml es_ES.po
property 'svn:keywords' set on 'Makefile'
property 'svn:keywords' set on 'article.xml'
property 'svn:keywords' set on 'es_ES.po'Set the MIME types of the files.
These are text/xml for books and
articles, and
text/x-gettext-translation for the
PO file.&prompt.user; svn propset svn:mime-type text/x-gettext-translation es_ES.po
property 'svn:mime-type' set on 'es_ES.po'
&prompt.user; svn propset svn:mime-type text/xml article.xml
property 'svn:mime-type' set on 'article.xml'Create a diff of the new files from the
~/doc/ base directory so the full
path is shown with the filenames. This helps committers
identify the target language directory.&prompt.user; cd ~/docsvn diff es_ES.ISO8859-1/articles/nanobsd/ > /tmp/es_nanobsd.diffKorean UTF-8 Translation of the
Explaining-BSD ArticleAdd a &os; version string comment as the first
line of the PO file:#$FreeBSD$Add the Makefile, the
PO file, and the generated
XML translation to
version control:&prompt.user; cd ~/doc/ko_KR.UTF-8/articles/explaining-bsd/
&prompt.user; ls
Makefile article.xml ko_KR.po
&prompt.user; svn add Makefile article.xml ko_KR.po
A Makefile
A article.xml
A ko_KR.poSet the Subversionsvn:keywords properties on these files
to FreeBSD=%H so
$FreeBSD$ strings are
expanded into the path, revision, date, and author when
committed:&prompt.user; svn propset svn:keywords FreeBSD=%H Makefile article.xml ko_KR.po
property 'svn:keywords' set on 'Makefile'
property 'svn:keywords' set on 'article.xml'
property 'svn:keywords' set on 'ko_KR.po'Set the MIME types of the files.
- Because these files use the UTF-8
- character set, that is also specified. To prevent the
+ These files use the UTF-8
+ character set, so that is also specified. To prevent the
version control system from mistaking these files for
binary data, the fbsd:notbinary
property is also set:&prompt.user; svn propset svn:mime-type 'text/x-gettext-translation; charset=UTF-8' ko_KR.po
property 'svn:mime-type' set on 'ko_KR.po'
&prompt.user; svn propset fbsd:notbinary yes ko_KR.po
property 'fbsd:notbinary' set on 'ko_KR.po'
&prompt.user; svn propset svn:mime-type 'text/xml; charset=UTF-8' article.xml
property 'svn:mime-type' set on 'article.xml'
&prompt.user; svn propset fbsd:notbinary yes article.xml
property 'fbsd:notbinary' set on 'article.xml'Create a diff of these new files from the
~/doc/ base directory:&prompt.user; cd ~/docsvn diff ko_KR.UTF-8/articles/explaining-bsd > /tmp/ko-explaining.diff
diff --git a/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml b/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml
index 96d2c60d0a..a91251c25b 100644
--- a/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml
+++ b/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml
@@ -1,1423 +1,1423 @@
XML PrimerMost FDP documentation is written with
markup languages based on XML. This chapter
explains what that means, how to read and understand the
documentation source, and the XML techniques
used.Portions of this section were inspired by Mark Galassi's Get
Going With DocBook.OverviewIn the original days of computers, electronic text was
simple. There were a few character sets like
ASCII or EBCDIC, but that
was about it. Text was text, and what you saw really was what
you got. No frills, no formatting, no intelligence.Inevitably, this was not enough. When text is in a
machine-usable format, machines are expected to be able to use
and manipulate it intelligently. Authors want to indicate that
certain phrases should be emphasized, or added to a glossary, or
made into hyperlinks. Filenames could be shown in a
typewriter style font for viewing on screen, but
as italics when printed, or any of a myriad of
other options for presentation.It was once hoped that Artificial Intelligence (AI) would
make this easy. The computer would read the document and
automatically identify key phrases, filenames, text that the
reader should type in, examples, and more. Unfortunately, real
life has not happened quite like that, and computers still
require assistance before they can meaningfully process
text.More precisely, they need help identifying what is what.
Consider this text:
To remove /tmp/foo, use
&man.rm.1;.&prompt.user; rm /tmp/foo
It is easy to see which parts are filenames, which are
commands to be typed in, which parts are references to manual
pages, and so on. But the computer processing the document
cannot. For this we need markup.Markup is commonly used to describe
adding value or increasing cost.
The term takes on both these meanings when applied to text.
Markup is additional text included in the document,
distinguished from the document's content in some way, so that
programs that process the document can read the markup and use
it when making decisions about the document. Editors can hide
the markup from the user, so the user is not distracted by
it.The extra information stored in the markup
adds value to the document. Adding the
markup to the document must typically be done by a
person—after all, if computers could recognize the text
sufficiently well to add the markup then there would be no need
to add it in the first place. This
increases the cost (the effort required) to
create the document.The previous example is actually represented in this
document like this:paraTo remove filename/tmp/foofilename, use &man.rm.1;.parascreen&prompt.user; userinputrm /tmp/foouserinputscreenThe markup is clearly separate from the content.Markup languages define what the markup means and how it
should be interpreted.Of course, one markup language might not be enough. A
markup language for technical documentation has very different
requirements than a markup language that is intended for cookery
recipes. This, in turn, would be very different from a markup
language used to describe poetry. What is really needed is a
first language used to write these other markup languages. A
meta markup language.This is exactly what the eXtensible Markup
Language (XML) is. Many markup languages
have been written in XML, including the two
most used by the FDP,
XHTML and DocBook.Each language definition is more properly called a grammar,
vocabulary, schema or Document Type Definition
(DTD). There are various languages to
specify an XML grammar, or
schema.A schema is a
complete specification of all the elements
that are allowed to appear, the order in which they should
appear, which elements are mandatory, which are optional, and so
forth. This makes it possible to write an
XML parser which reads
in both the schema and a document which claims to conform to the
schema. The parser can then confirm whether or not all the
elements required by the vocabulary are in the document in the
right order, and whether there are any errors in the markup.
This is normally referred to as
validating the document.Validation confirms that the choice of
elements, their ordering, and so on, conforms to that listed
in the grammar. It does not check
whether appropriate markup has been used
for the content. If all the filenames in a document were
marked up as function names, the parser would not flag this as
an error (assuming, of course, that the schema defines
elements for filenames and functions, and that they are
allowed to appear in the same place).Most contributions to the Documentation
Project will be content marked up in either
XHTML or DocBook, rather than alterations to
the schemas. For this reason, this book will not touch on how
to write a vocabulary.Elements, Tags, and AttributesAll the vocabularies written in XML share
certain characteristics. This is hardly surprising, as the
philosophy behind XML will inevitably show
through. One of the most obvious manifestations of this
philosophy is that of content and
elements.Documentation, whether it is a single web page, or a lengthy
book, is considered to consist of content. This content is then
divided and further subdivided into elements. The purpose of
adding markup is to name and identify the boundaries of these
elements for further processing.For example, consider a typical book. At the very top
level, the book is itself an element. This book
element obviously contains chapters, which can be considered to
be elements in their own right. Each chapter will contain more
elements, such as paragraphs, quotations, and footnotes. Each
paragraph might contain further elements, identifying content
that was direct speech, or the name of a character in the
story.It may be helpful to think of this as
chunking content. At the very top level is one
chunk, the book. Look a little deeper, and there are more
chunks, the individual chapters. These are chunked further into
paragraphs, footnotes, character names, and so on.Notice how this differentiation between different elements
of the content can be made without resorting to any
XML terms. It really is surprisingly
straightforward. This could be done with a highlighter pen and
a printout of the book, using different colors to indicate
different chunks of content.Of course, we do not have an electronic highlighter pen, so
we need some other way of indicating which element each piece of
content belongs to. In languages written in
XML (XHTML, DocBook, et
al) this is done by means of tags.A tag is used to identify where a particular element starts,
and where the element ends. The tag is not part of
- the element itself. Because each grammar was
+ the element itself. As each grammar was
normally written to mark up specific types of information, each
one will recognize different elements, and will therefore have
different names for the tags.For an element called
element-name the start tag will
normally look like element-name.
The corresponding closing tag for this element is element-name.Using an Element (Start and End Tags)XHTML has an element for indicating
that the content enclosed by the element is a paragraph,
called p.pThis is a paragraph. It starts with the start tag for
the 'p' element, and it will end with the end tag for the 'p'
element.ppThis is another paragraph. But this one is much shorter.pSome elements have no content. For example, in
XHTML, a horizontal line can be included in
the document. For these empty elements,
XML introduced a shorthand form that is
completely equivalent to the two-tag version:Using an Element Without ContentXHTML has an element for indicating a
horizontal rule, called hr. This element
does not wrap content, so it looks like this:pOne paragraph.phrhrpThis is another paragraph. A horizontal rule separates this
from the previous paragraph.pThe shorthand version consists of a single tag:pOne paragraph.phrpThis is another paragraph. A horizontal rule separates this
from the previous paragraph.pAs shown above, elements can contain other elements. In the
book example earlier, the book element contained all the chapter
elements, which in turn contained all the paragraph elements,
and so on.Elements Within Elements; empThis is a simple emparagraphem where some
of the emwordsem have been ememphasizedem.pThe grammar consists of rules that describe which elements
can contain other elements, and exactly what they can
contain.People often confuse the terms tags and elements, and use
the terms as if they were interchangeable. They are
not.An element is a conceptual part of your document. An
element has a defined start and end. The tags mark where the
element starts and ends.When this document (or anyone else knowledgeable about
XML) refers to
the p tag
they mean the literal text consisting of the three characters
<, p, and
>. But the phrase
the p element refers to the
whole element.This distinction is very subtle. But
keep it in mind.Elements can have attributes. An attribute has a name and a
value, and is used for adding extra information to the element.
This might be information that indicates how the content should
be rendered, or might be something that uniquely identifies that
occurrence of the element, or it might be something else.An element's attributes are written
inside the start tag for that element, and
take the form
attribute-name="attribute-value".In XHTML, the p
element has an attribute called
align, which suggests an
alignment (justification) for the paragraph to the program
displaying the XHTML.The align attribute can
take one of four defined values, left,
center, right and
justify. If the attribute is not specified
then the default is left.Using an Element with an Attributep align="left"The inclusion of the align attribute
on this paragraph was superfluous, since the default is left.pp align="center"This may appear in the center.pSome attributes only take specific values, such as
left or justify. Others
allow any value.Single Quotes Around Attributesp align='right'I am on the right!pAttribute values in XML must be enclosed
in either single or double quotes. Double quotes are
traditional. Single quotes are useful when the attribute value
contains double quotes.Information about attributes, elements, and tags is stored
in catalog files. The Documentation Project uses standard
DocBook catalogs and includes additional catalogs for
&os;-specific features. Paths to the catalog files are defined
in an environment variable so they can be found by the document
build tools.To Do…Before running the examples in this document, install
textproc/docproj from the &os; Ports
Collection. This is a meta-port that
downloads and installs the standard programs and supporting
files needed by the Documentation Project. &man.csh.1; users
must use rehash for the shell to recognize
new programs after they have been installed, or log out and
then log back in again.Create example.xml, and enter
this text:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"html xmlns="http://www.w3.org/1999/xhtml"headtitleAn Example XHTML FiletitleheadbodypThis is a paragraph containing some text.ppThis paragraph contains some more text.pp align="right"This paragraph might be right-justified.pbodyhtmlTry to validate this file using an
XML parser.textproc/docproj
includes the xmllint
validating
parser.Use xmllint to validate the
document:&prompt.user; xmllint --valid --noout example.xmlxmllint returns without displaying
any output, showing that the document validated
successfully.See what happens when required elements are omitted.
Delete the line with the
title and
title tags, and re-run
the validation.&prompt.user; xmllint --valid --noout example.xml
example.xml:5: element head: validity error : Element head content does not follow the DTD, expecting ((script | style | meta | link | object | isindex)* , ((title , (script | style | meta | link | object | isindex)* , (base , (script | style | meta | link | object | isindex)*)?) | (base , (script | style | meta | link | object | isindex)* , title , (script | style | meta | link | object | isindex)*))), got ()This shows that the validation error comes from the
fifth line of the
example.xml file and that the
content of the head is
the part which does not follow the rules of the
XHTML grammar.Then xmllint shows the line where
the error was found and marks the exact character position
with a ^ sign.Replace the title element.The DOCTYPE DeclarationThe beginning of each document can specify the name of the
DTD to which the document conforms. This
DOCTYPE declaration is used by XML parsers to
identify the DTD and ensure that the document
does conform to it.A typical declaration for a document written to conform with
version 1.0 of the XHTML
DTD looks like this:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"That line contains a number of different components.<!The indicator shows
this is an XML declaration.DOCTYPEShows that this is an XML
declaration of the document type.htmlNames the first
element that
will appear in the document.PUBLIC "-//W3C//DTD XHTML 1.0
Transitional//EN"Lists the Formal Public Identifier
(FPI)
Formal Public Identifier
for the DTD to which this document
conforms. The XML parser uses this to
find the correct DTD when processing
this document.PUBLIC is not a part of the
FPI, but indicates to the
XML processor how to find the
DTD referenced in the
FPI. Other ways of telling the
XML parser how to find the
DTD are shown later."http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"A local filename or a URL to find
the DTD.>Ends the declaration and returns to the
document.Formal Public Identifiers
(FPIs)Formal Public IdentifierIt is not necessary to know this, but it is useful
background, and might help debug problems when the
XML processor cannot locate the
DTD.FPIs must follow a specific
syntax:"Owner//KeywordDescription//Language"OwnerThe owner of the FPI.The beginning of the string identifies the owner of
the FPI. For example, the
FPI
"ISO 8879:1986//ENTITIES Greek
Symbols//EN" lists
ISO 8879:1986 as being the owner for
the set of entities for Greek symbols.
ISO 8879:1986 is the International
Organization for Standardization
(ISO) number for the
SGML standard, the predecessor (and a
superset) of XML.Otherwise, this string will either look like
-//Owner
or
+//Owner
(notice the only difference is the leading
+ or -).If the string starts with - then
the owner information is unregistered, with a
+ identifying it as
registered.ISO 9070:1991 defines how
registered names are generated. It might be derived
from the number of an ISO
publication, an ISBN code, or an
organization code assigned according to
ISO 6523. Additionally, a
registration authority could be created in order to
assign registered names. The ISO
council delegated this to the American National
Standards Institute (ANSI).
- Because the &os; Project has not been registered,
+ Since the &os; Project has not been registered,
the owner string is -//&os;. As seen
- in the example, the W3C are not a
+ in the example, the W3C is not a
registered owner either.KeywordThere are several keywords that indicate the type of
information in the file. Some of the most common
keywords are DTD,
ELEMENT, ENTITIES,
and TEXT. DTD is
used only for DTD files,
ELEMENT is usually used for
DTD fragments that contain only
entity or element declarations. TEXT
is used for XML content (text and
tags).DescriptionAny description can be given for the contents
of this file. This may include version numbers or any
short text that is meaningful and unique for the
XML system.LanguageAn ISO two-character code that
identifies the native language for the file.
EN is used for English.catalog FilesWith the syntax above, an XML
processor needs to have some way of turning the
FPI into the name of the file containing
the DTD. A catalog file (typically
called catalog) contains lines that map
FPIs to filenames. For example, if the
catalog file contained the line:PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "1.0/transitional.dtd"The XML processor knows that the
DTD is called
transitional.dtd in the
1.0 subdirectory of the directory that
held catalog.Examine the contents of
/usr/local/share/xml/dtd/xhtml/catalog.xml.
This is the catalog file for the XHTML
DTDs that were installed as part of the
textproc/docproj port.Alternatives to FPIsInstead of using an FPI to indicate the
DTD to which the document conforms (and
therefore, which file on the system contains the
DTD), the filename can be explicitly
specified.The syntax is slightly different:!DOCTYPE html SYSTEM "/path/to/file.dtd"The SYSTEM keyword indicates that the
XML processor should locate the
DTD in a system specific fashion. This
typically (but not always) means the DTD
will be provided as a filename.Using FPIs is preferred for reasons of
portability. If the SYSTEM identifier is
used, then the DTD must be provided and
kept in the same location for everyone.Escaping Back to XMLSome of the underlying XML syntax can be
useful within documents. For example, comments can be included
in the document, and will be ignored by the parser. Comments
are entered using XML syntax. Other uses for
XML syntax will be shown later.XML sections begin with a
<! tag and end with a
>. These sections contain instructions
for the parser rather than elements of the document. Everything
between these tags is XML syntax. The
DOCTYPE
declaration shown earlier is an example of
XML syntax included in the document.CommentsAn XML document may contain comments.
They may appear anywhere as long as they are not inside tags.
They are even allowed in some locations inside the
DTD (e.g., between entity
declarations).XML comments start with the string
<!-- and end with the
string -->.Here are some examples of valid XML
comments:XML Generic Comments<!-- This is inside the comment -->
<!--This is another comment-->
<!-- This is how you
write multiline comments -->
<p>A simple <!-- Comment inside an element's content --> paragraph.</p>XML comments may contain any strings
except --:Erroneous XML Comment<!-- This comment--is wrong -->To Do…Add some comments to
example.xml, and check that the file
still validates using xmllint.Add some invalid comments to
example.xml, and see the error
messages that xmllint gives when it
encounters an invalid comment.EntitiesEntities are a mechanism for assigning names to chunks of
content. As an XML parser processes a
document, any entities it finds are replaced by the content of
the entity.This is a good way to have re-usable, easily changeable
chunks of content in XML documents. It is
also the only way to include one marked up file inside another
using XML.There are two types of entities for two different
situations: general entities and
parameter entities.General EntitiesGeneral entities are used to assign names to reusable
chunks of text. These entities can only be used in the
document. They cannot be used in an
XML context.To include the text of a general entity in the document,
include
&entity-name;
in the text. For example, consider a general entity called
current.version which expands to the
current version number of a product. To use it in the
document, write:paraThe current version of our product is
¤t.version;.paraWhen the version number changes, edit the definition of
the general entity, replacing the value. Then reprocess the
document.General entities can also be used to enter characters that
could not otherwise be included in an XML
document. For example, < and
& cannot normally appear in an
XML document. The XML
parser sees the < symbol as the start of
a tag. Likewise, when the & symbol is
seen, the next text is expected to be an entity name.These symbols can be included by using two predefined
general entities: < and
&.General entities can only be defined within an
XML context. Such definitions are usually
done immediately after the DOCTYPE declaration.Defining General Entities<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY current.version "3.0-RELEASE">
<!ENTITY last.version "2.2.7-RELEASE">
]>The DOCTYPE declaration has been extended by adding a
square bracket at the end of the first line. The two
entities are then defined over the next two lines, the
square bracket is closed, and then the DOCTYPE declaration
is closed.The square brackets are necessary to indicate that the
DTD indicated by the DOCTYPE declaration is being
extended.Parameter EntitiesParameter entities, like
general
entities, are used to assign names to reusable chunks
of text. But parameter entities can only be used within an
XML
context.Parameter entity definitions are similar to those for
general entities. However, parameter entities are included
with
%entity-name;.
The definition also includes the % between
the ENTITY keyword and the name of the
entity.For a mnemonic, think
Parameter entities use the
Percent symbol.Defining Parameter Entities<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY % entity "<!ENTITY version '1.0'>">
<!-- use the parameter entity -->
%entity;
]>At first sight, parameter entities do not look very
useful, but they make it possible to include other files into
an XML document.To Do…Add a general entity to
example.xml.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY version "1.1">
]>
html xmlns="http://www.w3.org/1999/xhtml"headtitleAn Example XHTML Filetitlehead
<!-- There may be some comments in here as well -->
bodypThis is a paragraph containing some text.ppThis paragraph contains some more text.pp align="right"This paragraph might be right-justified.ppThe current version of this document is: &version;pbodyhtmlValidate the document using
xmllint.Load example.xml into a web
browser. It may have to be copied to
example.html before the browser
recognizes it as an XHTML
document.Older browsers with simple parsers may not render this
file as expected. The entity reference
&version; may not be replaced by
the version number, or the XML context
closing ]> may not be recognized and
instead shown in the output.The solution is to normalize the
document with an XML normalizer. The
normalizer reads valid XML and writes
equally valid XML which has been
transformed in some way. One way the normalizer
transforms the input is by expanding all the entity
references in the document, replacing the entities with
the text that they represent.xmllint can be used for this. It
also has an option to drop the initial
DTD section so that the closing
]> does not confuse browsers:&prompt.user; xmllint --noent --dropdtd example.xml > example.htmlA normalized copy of the document with entities
expanded is produced in example.html,
ready to load into a web browser.Using Entities to Include FilesBoth
general and
parameter
entities are particularly useful for including one file inside
another.Using General Entities to Include FilesConsider some content for an XML book
organized into files, one file per chapter, called
chapter1.xml,
chapter2.xml, and so forth, with a
book.xml that will contain these
chapters.In order to use the contents of these files as the values
for entities, they are declared with the
SYSTEM keyword. This directs the
XML parser to include the contents of the
named file as the value of the entity.Using General Entities to Include Files<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY chapter.1 SYSTEM "chapter1.xml">
<!ENTITY chapter.2 SYSTEM "chapter2.xml">
<!ENTITY chapter.3 SYSTEM "chapter3.xml">
<!-- And so forth -->
]>
html xmlns="http://www.w3.org/1999/xhtml"
<!-- Use the entities to load in the chapters -->
&chapter.1;
&chapter.2;
&chapter.3;
htmlWhen using general entities to include other files
within a document, the files being included
(chapter1.xml,
chapter2.xml, and so on)
must not start with a DOCTYPE
declaration. This is a syntax error because entities are
low-level constructs and they are resolved before any
parsing happens.Using Parameter Entities to Include FilesParameter entities can only be used inside an
XML context. Including a file in an
XML context can be used
to ensure that general entities are reusable.Suppose that there are many chapters in the document, and
these chapters were reused in two different books, each book
organizing the chapters in a different fashion.The entities could be listed at the top of each book, but
that quickly becomes cumbersome to manage.Instead, place the general entity definitions inside one
file, and use a parameter entity to include that file within
the document.Using Parameter Entities to Include FilesPlace the entity definitions in a separate file
called chapters.ent and
containing this text:<!ENTITY chapter.1 SYSTEM "chapter1.xml">
<!ENTITY chapter.2 SYSTEM "chapter2.xml">
<!ENTITY chapter.3 SYSTEM "chapter3.xml">Create a parameter entity to refer to the contents
of the file. Then use the parameter entity to load the file
into the document, which will then make all the general
entities available for use. Then use the general entities
as before:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!-- Define a parameter entity to load in the chapter general entities -->
<!ENTITY % chapters SYSTEM "chapters.ent">
<!-- Now use the parameter entity to load in this file -->
%chapters;
]>
html xmlns="http://www.w3.org/1999/xhtml"
&chapter.1;
&chapter.2;
&chapter.3;
htmlTo Do…Use General Entities to Include FilesCreate three files, para1.xml,
para2.xml, and
para3.xml.Put content like this in each file:pThis is the first paragraph.pEdit example.xml so that it
looks like this:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY version "1.1">
<!ENTITY para1 SYSTEM "para1.xml">
<!ENTITY para2 SYSTEM "para2.xml">
<!ENTITY para3 SYSTEM "para3.xml">
]>
html xmlns="http://www.w3.org/1999/xhtml"headtitleAn Example XHTML FiletitleheadbodypThe current version of this document is: &version;p
¶1;
¶2;
¶3;
bodyhtmlProduce example.html by
normalizing example.xml.&prompt.user; xmllint --dropdtd --noent example.xml > example.htmlLoad example.html into the web
browser and confirm that the
paran.xml
files have been included in
example.html.Use Parameter Entities to Include FilesThe previous steps must have completed before this
step.Edit example.xml so that it
looks like this:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY % entities SYSTEM "entities.ent"> %entities;
]>
html xmlns="http://www.w3.org/1999/xhtml"headtitleAn Example XHTML FiletitleheadbodypThe current version of this document is: &version;p
¶1;
¶2;
¶3;
bodyhtmlCreate a new file called
entities.ent with this
content:<!ENTITY version "1.1">
<!ENTITY para1 SYSTEM "para1.xml">
<!ENTITY para2 SYSTEM "para2.xml">
<!ENTITY para3 SYSTEM "para3.xml">Produce example.html by
normalizing example.xml.&prompt.user; xmllint --dropdtd --noent example.xml > example.htmlLoad example.html into the web
browser and confirm that the
paran.xml
files have been included in
example.html.Marked SectionsXML provides a mechanism to indicate that
particular pieces of the document should be processed in a
special way. These are called
marked sections.Structure of a Marked Section<![KEYWORD[
Contents of marked section
]]>As expected of an XML construct, a marked
section starts with <!.The first square bracket begins the marked section.KEYWORD describes how this marked
section is to be processed by the parser.The second square bracket indicates the start of the
marked section's content.The marked section is finished by closing the two square
brackets, and then returning to the document context from the
XML context with
>.Marked Section KeywordsCDATAThese keywords denote the marked sections
content model, and allow you to change
it from the default.When an XML parser is processing a
document, it keeps track of the
content model.The content model describes the
content the parser is expecting to see and what it will do
with that content.The CDATA content model is one of the
most useful.CDATA is for
Character Data. When the parser is in this
content model, it expects to see only characters. In this
model the < and
& symbols lose their special status,
and will be treated as ordinary characters.When using CDATA in examples of
text marked up in XML, remember that
the content of CDATA is not validated.
The included text must be check with other means. For
example, the content could be written in another document,
validated, and then pasted into the
CDATA section.Using a CDATA Marked
SectionparaHere is an example of how to include some text that contains
many literal<literal and literal&literal
symbols. The sample text is a fragment of
acronymXHTMLacronym. The surrounding text (para and
programlisting) are from DocBook.paraprogramlisting<![CDATA[pThis is a sample that shows some of the
elements within acronymXHTMLacronym. Since the angle
brackets are used so many times, it is simpler to say the whole
example is a CDATA marked section than to use the entity names for
the left and right angle brackets throughout.pulliThis is a listitemliliThis is a second listitemliliThis is a third listitemliulpThis is the end of the example.p]]>programlistingINCLUDE and
IGNOREWhen the keyword is INCLUDE, then the
contents of the marked section will be processed. When the
keyword is IGNORE, the marked section
is ignored and will not be processed. It will not appear in
the output.Using INCLUDE and
IGNORE in Marked Sections<![INCLUDE[
This text will be processed and included.
]]>
<![IGNORE[
This text will not be processed or included.
]]>By itself, this is not too useful. Text to be
removed from the document could be cut out, or wrapped
in comments.It becomes more useful when controlled by
parameter
entities, yet this usage is limited
to entity files.For example, suppose that documentation was produced in
a hard-copy version and an electronic version. Some extra
text is desired in the electronic version content that was
not to appear in the hard-copy.Create an entity file that defines general entities to
include each chapter and guard these definitions with a
parameter entity that can be set to either
INCLUDE or IGNORE to
control whether the entity is defined. After these
conditional general entity definitions, place one more
definition for each general entity to set them to an empty
value. This technique makes use of the fact that entity
definitions cannot be overridden but the first definition
always takes effect. So the inclusion of the chapter is
controlled with the corresponding parameter entity. Set to
INCLUDE, the first general entity
definition will be read and the second one will be ignored.
Set to IGNORE, the first definition will
be ignored and the second one will take effect.Using a Parameter Entity to Control a Marked
Section<!ENTITY % electronic.copy "INCLUDE">
<![%electronic.copy;[
<!ENTITY chap.preface SYSTEM "preface.xml">
]]>
<!ENTITY chap.preface "">When producing the hard-copy version, change the
parameter entity's definition to:<!ENTITY % electronic.copy "IGNORE">To Do…Modify entities.ent to
contain the following:<!ENTITY version "1.1">
<!ENTITY % conditional.text "IGNORE">
<![%conditional.text;[
<!ENTITY para1 SYSTEM "para1.xml">
]]>
<!ENTITY para1 "">
<!ENTITY para2 SYSTEM "para2.xml">
<!ENTITY para3 SYSTEM "para3.xml">Normalize example.xml
and notice that the conditional text is not present in the
output document. Set the parameter entity
guard to INCLUDE and regenerate the
normalized document and the text will appear again.
This method makes sense if there are more
conditional chunks depending on the same condition. For
example, to control generating printed or online
text.ConclusionThat is the conclusion of this XML
primer. For reasons of space and complexity, several things
have not been covered in depth (or at all). However, the
previous sections cover enough XML to
introduce the organization of the FDP
documentation.
diff --git a/en_US.ISO8859-1/books/handbook/basics/chapter.xml b/en_US.ISO8859-1/books/handbook/basics/chapter.xml
index bc31008871..1be3270b7e 100644
--- a/en_US.ISO8859-1/books/handbook/basics/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/basics/chapter.xml
@@ -1,3417 +1,3417 @@
&os; BasicsSynopsisThis chapter covers the basic commands and functionality of
the &os; operating system. Much of this material is relevant
for any &unix;-like operating system. New &os; users are
encouraged to read through this chapter carefully.After reading this chapter, you will know:How to use and configure virtual consoles.How to create and manage users and groups on
&os;.How &unix; file permissions and &os; file flags
work.The default &os; file system layout.The &os; disk organization.How to mount and unmount file systems.What processes, daemons, and signals are.What a shell is, and how to change the default login
environment.How to use basic text editors.What devices and device nodes are.How to read manual pages for more information.Virtual Consoles and Terminalsvirtual consolesterminalsconsoleUnless &os; has been configured to automatically start a
graphical environment during startup, the system will boot
into a command line login prompt, as seen in this
example:FreeBSD/amd64 (pc3.example.org) (ttyv0)
login:The first line contains some information about the system.
The amd64 indicates that the system in this
example is running a 64-bit version of &os;. The hostname is
pc3.example.org, and
ttyv0 indicates that this is the
system console. The second line is the login
prompt.Since &os; is a multiuser system, it needs some way to
distinguish between different users. This is accomplished by
requiring every user to log into the system before gaining
access to the programs on the system. Every user has a
unique name username and a personal
password.To log into the system console, type the username that
was configured during system installation, as described in
, and press
Enter. Then enter the password associated
with the username and press Enter. The
password is not echoed for security
reasons.Once the correct password is input, the message of the
day (MOTD) will be displayed followed
by a command prompt. Depending upon the shell that was
selected when the user was created, this prompt will be a
#, $, or
% character. The prompt indicates that
the user is now logged into the &os; system console and ready
to try the available commands.Virtual ConsolesWhile the system console can be used to interact with
the system, a user working from the command line at the
keyboard of a &os; system will typically instead log into a
virtual console. This is because system messages are
configured by default to display on the system console.
These messages will appear over the command or file that the
user is working on, making it difficult to concentrate on
the work at hand.By default, &os; is configured to provide several virtual
consoles for inputting commands. Each virtual console has
its own login prompt and shell and it is easy to switch
between virtual consoles. This essentially provides the
command line equivalent of having several windows open at the
same time in a graphical environment.The key combinations
AltF1
through
AltF8
have been reserved by &os; for switching between virtual
consoles. Use
AltF1
to switch to the system console
(ttyv0),
AltF2
to access the first virtual console
(ttyv1),
AltF3
to access the second virtual console
(ttyv2), and so on.
When using &xorg; as a graphical
console, the combination becomes CtrlAltF1 to return to a text-based virtual console.When switching from one console to the next, &os;
manages the screen output. The result is an illusion of
having multiple virtual screens and keyboards that can be used
to type commands for &os; to run. The programs that are
launched in one virtual console do not stop running when
the user switches to a different virtual console.Refer to &man.kbdcontrol.1;, &man.vidcontrol.1;,
&man.atkbd.4;, &man.syscons.4;, and &man.vt.4; for a more
technical description of the &os; console and its keyboard
drivers.In &os;, the number of available virtual consoles is
configured in this section of
/etc/ttys:# name getty type status comments
#
ttyv0 "/usr/libexec/getty Pc" xterm on secure
# Virtual terminals
ttyv1 "/usr/libexec/getty Pc" xterm on secure
ttyv2 "/usr/libexec/getty Pc" xterm on secure
ttyv3 "/usr/libexec/getty Pc" xterm on secure
ttyv4 "/usr/libexec/getty Pc" xterm on secure
ttyv5 "/usr/libexec/getty Pc" xterm on secure
ttyv6 "/usr/libexec/getty Pc" xterm on secure
ttyv7 "/usr/libexec/getty Pc" xterm on secure
ttyv8 "/usr/X11R6/bin/xdm -nodaemon" xterm off secureTo disable a virtual console, put a comment symbol
(#) at the beginning of the line
representing that virtual console. For example, to reduce the
number of available virtual consoles from eight to four, put a
# in front of the last four lines
representing virtual consoles ttyv5
through ttyv8.
Do not comment out the line for the
system console ttyv0. Note that the last
virtual console (ttyv8) is used to access
the graphical environment if &xorg;
has been installed and configured as described in
.For a detailed description of every column in this file
and the available options for the virtual consoles, refer to
&man.ttys.5;.Single User ModeThe &os; boot menu provides an option labelled as
Boot Single User. If this option is selected,
the system will boot into a special mode known as
single user mode. This mode is typically used
to repair a system that will not boot or to reset the
root password when
it is not known. While in single user mode, networking and
other virtual consoles are not available. However, full
root access to the
system is available, and by default, the
root password is not
needed. For these reasons, physical access to the keyboard is
needed to boot into this mode and determining who has physical
access to the keyboard is something to consider when securing
a &os; system.The settings which control single user mode are found in
this section of /etc/ttys:# name getty type status comments
#
# If console is marked "insecure", then init will ask for the root password
# when going to single-user mode.
console none unknown off secureBy default, the status is set to
secure. This assumes that who has physical
access to the keyboard is either not important or it is
controlled by a physical security policy. If this setting is
changed to insecure, the assumption is that
the environment itself is insecure because anyone can access
the keyboard. When this line is changed to
insecure, &os; will prompt for the
root password when a
user selects to boot into single user mode.Be careful when changing this setting to
insecure! If the
root password is
forgotten, booting into single user mode is still possible,
but may be difficult for someone who is not familiar with
the &os; booting process.Changing Console Video ModesThe &os; console default video mode may be adjusted to
1024x768, 1280x1024, or any other size supported by the
graphics chip and monitor. To use a different video mode
load the VESA module:&prompt.root; kldload vesaTo determine which video modes are supported by the
hardware, use &man.vidcontrol.1;. To get a list of supported
video modes issue the following:&prompt.root; vidcontrol -i modeThe output of this command lists the video modes that are
supported by the hardware. To select a new video mode,
specify the mode using &man.vidcontrol.1; as the
root user:&prompt.root; vidcontrol MODE_279If the new video mode is acceptable, it can be permanently
set on boot by adding it to
/etc/rc.conf:allscreens_flags="MODE_279"Users and Basic Account Management&os; allows multiple users to use the computer at the same
time. While only one user can sit in front of the screen and
use the keyboard at any one time, any number of users can log
in to the system through the network. To use the system, each
user should have their own user account.This chapter describes:The different types of user accounts on a
&os; system.How to add, remove, and modify user accounts.How to set limits to control the
resources that users and
groups are allowed to access.How to create groups and add users as members of a
group.Account TypesSince all access to the &os; system is achieved using
accounts and all processes are run by users, user and account
management is important.There are three main types of accounts: system accounts,
user accounts, and the superuser account.System AccountsaccountssystemSystem accounts are used to run services such as DNS,
mail, and web servers. The reason for this is security; if
all services ran as the superuser, they could act without
restriction.accountsdaemonaccountsoperatorExamples of system accounts are
daemon,
operator,
bind,
news, and
www.Care must be taken when using the operator group, as
unintended superuser-like access privileges may be
granted, including but not limited to shutdown, reboot,
and access to all items in /dev
in the group.accountsnobodynobody is the
generic unprivileged system account. However, the more
services that use
nobody, the more
files and processes that user will become associated with,
and hence the more privileged that user becomes.User AccountsaccountsuserUser accounts are assigned to real people and are used
to log in and use the system. Every person accessing the
system should have a unique user account. This allows the
administrator to find out who is doing what and prevents
users from clobbering the settings of other users.Each user can set up their own environment to
accommodate their use of the system, by configuring their
default shell, editor, key bindings, and language
settings.Every user account on a &os; system has certain
information associated with it:User nameThe user name is typed at the
login: prompt. Each user must have
a unique user name. There are a number of rules for
creating valid user names which are documented in
&man.passwd.5;. It is recommended to use user names
that consist of eight or fewer, all lower case
characters in order to maintain backwards
compatibility with applications.PasswordEach account has an associated password.User ID (UID)The User ID (UID) is a number
used to uniquely identify the user to the &os; system.
Commands that allow a user name to be specified will
first convert it to the UID. It is
recommended to use a UID less than 65535, since higher
values may cause compatibility issues with some
software.Group ID (GID)The Group ID (GID) is a number
used to uniquely identify the primary group that the
user belongs to. Groups are a mechanism for
controlling access to resources based on a user's
GID rather than their
UID. This can significantly reduce
the size of some configuration files and allows users
to be members of more than one group. It is
recommended to use a GID of 65535 or lower as higher
GIDs may break some software.Login classLogin classes are an extension to the group
mechanism that provide additional flexibility when
tailoring the system to different users. Login
classes are discussed further in
.Password change timeBy default, passwords do not expire. However,
password expiration can be enabled on a per-user
basis, forcing some or all users to change their
passwords after a certain amount of time has
elapsed.Account expiration timeBy default, &os; does not expire accounts. When
creating accounts that need a limited lifespan, such
as student accounts in a school, specify the account
expiry date using &man.pw.8;. After the expiry time
has elapsed, the account cannot be used to log in to
the system, although the account's directories and
files will remain.User's full nameThe user name uniquely identifies the account to
&os;, but does not necessarily reflect the user's real
name. Similar to a comment, this information can
contain spaces, uppercase characters, and be more
than 8 characters long.Home directoryThe home directory is the full path to a directory
on the system. This is the user's starting directory
when the user logs in. A common convention is to put
all user home directories under /home/username
or /usr/home/username.
Each user stores their personal files and
subdirectories in their own home directory.User shellThe shell provides the user's default environment
for interacting with the system. There are many
different kinds of shells and experienced users will
have their own preferences, which can be reflected in
their account settings.The Superuser Accountaccountssuperuser (root)The superuser account, usually called
root, is used to
manage the system with no limitations on privileges. For
this reason, it should not be used for day-to-day tasks like
sending and receiving mail, general exploration of the
system, or programming.The superuser, unlike other user accounts, can operate
without limits, and misuse of the superuser account may
result in spectacular disasters. User accounts are unable
to destroy the operating system by mistake, so it is
recommended to login as a user account and to only become
the superuser when a command requires extra
privilege.Always double and triple-check any commands issued as
the superuser, since an extra space or missing character can
mean irreparable data loss.There are several ways to gain superuser privilege.
While one can log in as
root, this is
highly discouraged.Instead, use &man.su.1; to become the superuser. If
- is specified when running this command,
the user will also inherit the root user's environment. The
user running this command must be in the
wheel group or
else the command will fail. The user must also know the
password for the
root user
account.In this example, the user only becomes superuser in
order to run make install as this step
requires superuser privilege. Once the command completes,
the user types exit to leave the
superuser account and return to the privilege of their user
account.Install a Program As the Superuser&prompt.user; configure
&prompt.user; make
&prompt.user; su -
Password:
&prompt.root; make install
&prompt.root; exit
&prompt.user;The built-in &man.su.1; framework works well for single
systems or small networks with just one system
administrator. An alternative is to install the
security/sudo package or port. This
software provides activity logging and allows the
administrator to configure which users can run which
commands as the superuser.Managing Accountsaccountsmodifying&os; provides a variety of different commands to manage
user accounts. The most common commands are summarized in
, followed by some
examples of their usage. See the manual page for each utility
for more details and usage examples.
Utilities for Managing User AccountsCommandSummary&man.adduser.8;The recommended command-line application for
adding new users.&man.rmuser.8;The recommended command-line application for
removing users.&man.chpass.1;A flexible tool for changing user database
information.&man.passwd.1;The command-line tool to change user
passwords.&man.pw.8;A powerful and flexible tool for modifying all
aspects of user accounts.
adduseraccountsaddingadduser/usr/share/skelskeleton directoryThe recommended program for adding new users is
&man.adduser.8;. When a new user is added, this program
automatically updates /etc/passwd and
/etc/group. It also creates a home
directory for the new user, copies in the default
configuration files from
/usr/share/skel, and can optionally
mail the new user a welcome message. This utility must be
run as the superuser.The &man.adduser.8; utility is interactive and walks
through the steps for creating a new user account. As seen
in , either input
the required information or press Return
to accept the default value shown in square brackets.
In this example, the user has been invited into the
wheel group,
allowing them to become the superuser with &man.su.1;.
When finished, the utility will prompt to either
create another user or to exit.Adding a User on &os;&prompt.root; adduser
Username: jru
Full name: J. Random User
Uid (Leave empty for default):
Login group [jru]:
Login group is jru. Invite jru into other groups? []: wheel
Login class [default]:
Shell (sh csh tcsh zsh nologin) [sh]: zsh
Home directory [/home/jru]:
Home directory permissions (Leave empty for default):
Use password-based authentication? [yes]:
Use an empty password? (yes/no) [no]:
Use a random password? (yes/no) [no]:
Enter password:
Enter password again:
Lock out the account after creation? [no]:
Username : jru
Password : ****
Full Name : J. Random User
Uid : 1001
Class :
Groups : jru wheel
Home : /home/jru
Shell : /usr/local/bin/zsh
Locked : no
OK? (yes/no): yes
adduser: INFO: Successfully added (jru) to the user database.
Add another user? (yes/no): no
Goodbye!
&prompt.root;Since the password is not echoed when typed, be
careful to not mistype the password when creating the user
account.rmuserrmuseraccountsremovingTo completely remove a user from the system, run
&man.rmuser.8; as the superuser. This command performs the
following steps:Removes the user's &man.crontab.1; entry, if one
exists.Removes any &man.at.1; jobs belonging to the
user.Kills all processes owned by the user.Removes the user from the system's local password
file.Optionally removes the user's home directory, if it
is owned by the user.Removes the incoming mail files belonging to the
user from /var/mail.Removes all files owned by the user from temporary
file storage areas such as
/tmp.Finally, removes the username from all groups to
which it belongs in /etc/group. If
a group becomes empty and the group name is the same as
the username, the group is removed. This complements
the per-user unique groups created by
&man.adduser.8;.&man.rmuser.8; cannot be used to remove superuser
accounts since that is almost always an indication of
massive destruction.By default, an interactive mode is used, as shown
in the following example.rmuser Interactive Account
Removal&prompt.root; rmuser jru
Matching password entry:
jru:*:1001:1001::0:0:J. Random User:/home/jru:/usr/local/bin/zsh
Is this the entry you wish to remove? y
Remove user's home directory (/home/jru)? y
Removing user (jru): mailspool home passwd.
&prompt.root;chpasschpassAny user can use &man.chpass.1; to change their default
shell and personal information associated with their user
account. The superuser can use this utility to change
additional account information for any user.When passed no options, aside from an optional username,
&man.chpass.1; displays an editor containing user
information. When the user exits from the editor, the user
database is updated with the new information.This utility will prompt for the user's password when
exiting the editor, unless the utility is run as the
superuser.In , the
superuser has typed chpass jru and is
now viewing the fields that can be changed for this user.
If jru runs this
command instead, only the last six fields will be displayed
and available for editing. This is shown in
.Using chpass as
Superuser#Changing user database information for jru.
Login: jru
Password: *
Uid [#]: 1001
Gid [# or name]: 1001
Change [month day year]:
Expire [month day year]:
Class:
Home directory: /home/jru
Shell: /usr/local/bin/zsh
Full Name: J. Random User
Office Location:
Office Phone:
Home Phone:
Other information:Using chpass as Regular
User#Changing user database information for jru.
Shell: /usr/local/bin/zsh
Full Name: J. Random User
Office Location:
Office Phone:
Home Phone:
Other information:The commands &man.chfn.1; and &man.chsh.1; are links
to &man.chpass.1;, as are &man.ypchpass.1;,
&man.ypchfn.1;, and &man.ypchsh.1;. Since
NIS support is automatic, specifying
the yp before the command is not
necessary. How to configure NIS is covered in .passwdpasswdaccountschanging passwordAny user can easily change their password using
&man.passwd.1;. To prevent accidental or unauthorized
changes, this command will prompt for the user's original
password before a new password can be set:Changing Your Password&prompt.user; passwd
Changing local password for jru.
Old password:
New password:
Retype new password:
passwd: updating the database...
passwd: doneThe superuser can change any user's password by
specifying the username when running &man.passwd.1;. When
this utility is run as the superuser, it will not prompt for
the user's current password. This allows the password to be
changed when a user cannot remember the original
password.Changing Another User's Password as the
Superuser&prompt.root; passwd jru
Changing local password for jru.
New password:
Retype new password:
passwd: updating the database...
passwd: doneAs with &man.chpass.1;, &man.yppasswd.1; is a link to
&man.passwd.1;, so NIS works with
either command.pwpwThe &man.pw.8; utility can create, remove,
modify, and display users and groups. It functions as a
front end to the system user and group files. &man.pw.8;
has a very powerful set of command line options that make it
suitable for use in shell scripts, but new users may find it
more complicated than the other commands presented in this
section.Managing Groupsgroups/etc/groupsaccountsgroupsA group is a list of users. A group is identified by its
group name and GID. In &os;, the kernel
uses the UID of a process, and the list of
groups it belongs to, to determine what the process is allowed
to do. Most of the time, the GID of a user
or process usually means the first group in the list.The group name to GID mapping is listed
in /etc/group. This is a plain text file
with four colon-delimited fields. The first field is the
group name, the second is the encrypted password, the third
the GID, and the fourth the comma-delimited
list of members. For a more complete description of the
syntax, refer to &man.group.5;.The superuser can modify /etc/group
using a text editor. Alternatively, &man.pw.8; can be used to
add and edit groups. For example, to add a group called
teamtwo and then
confirm that it exists:Adding a Group Using &man.pw.8;&prompt.root; pw groupadd teamtwo
&prompt.root; pw groupshow teamtwo
teamtwo:*:1100:In this example, 1100 is the
GID of
teamtwo. Right
now, teamtwo has no
members. This command will add
jru as a member of
teamtwo.Adding User Accounts to a New Group Using
&man.pw.8;&prompt.root; pw groupmod teamtwo -M jru
&prompt.root; pw groupshow teamtwo
teamtwo:*:1100:jruThe argument to is a comma-delimited
list of users to be added to a new (empty) group or to replace
the members of an existing group. To the user, this group
membership is different from (and in addition to) the user's
primary group listed in the password file. This means that
the user will not show up as a member when using
with &man.pw.8;, but will show up
when the information is queried via &man.id.1; or a similar
tool. When &man.pw.8; is used to add a user to a group, it
only manipulates /etc/group and does not
attempt to read additional data from
/etc/passwd.Adding a New Member to a Group Using &man.pw.8;&prompt.root; pw groupmod teamtwo -m db
&prompt.root; pw groupshow teamtwo
teamtwo:*:1100:jru,dbIn this example, the argument to is a
comma-delimited list of users who are to be added to the
group. Unlike the previous example, these users are appended
to the group and do not replace existing users in the
group.Using &man.id.1; to Determine Group Membership&prompt.user; id jru
uid=1001(jru) gid=1001(jru) groups=1001(jru), 1100(teamtwo)In this example,
jru is a member of
the groups jru and
teamtwo.For more information about this command and the format of
/etc/group, refer to &man.pw.8; and
&man.group.5;.PermissionsUNIXIn &os;, every file and directory has an associated set of
permissions and several utilities are available for viewing
and modifying these permissions. Understanding how permissions
work is necessary to make sure that users are able to access
the files that they need and are unable to improperly access
the files used by the operating system or owned by other
users.This section discusses the traditional &unix; permissions
used in &os;. For finer grained file system access control,
refer to .In &unix;, basic permissions are assigned using
three types of access: read, write, and execute. These access
types are used to determine file access to the file's owner,
group, and others (everyone else). The read, write, and execute
permissions can be represented as the letters
r, w, and
x. They can also be represented as binary
numbers as each permission is either on or off
(0). When represented as a number, the
order is always read as rwx, where
r has an on value of 4,
w has an on value of 2
and x has an on value of
1.Table 4.1 summarizes the possible numeric and alphabetic
possibilities. When reading the Directory
Listing column, a - is used to
represent a permission that is set to off.permissionsfile permissions
&unix; PermissionsValuePermissionDirectory Listing0No read, no write, no execute---1No read, no write, execute--x2No read, write, no execute-w-3No read, write, execute-wx4Read, no write, no executer--5Read, no write, executer-x6Read, write, no executerw-7Read, write, executerwx
&man.ls.1;directoriesUse the argument to &man.ls.1; to view a
long directory listing that includes a column of information
about a file's permissions for the owner, group, and everyone
else. For example, an ls -l in an arbitrary
directory may show:&prompt.user; ls -l
total 530
-rw-r--r-- 1 root wheel 512 Sep 5 12:31 myfile
-rw-r--r-- 1 root wheel 512 Sep 5 12:31 otherfile
-rw-r--r-- 1 root wheel 7680 Sep 5 12:31 email.txtThe first (leftmost) character in the first column indicates
whether this file is a regular file, a directory, a special
character device, a socket, or any other special pseudo-file
device. In this example, the - indicates a
regular file. The next three characters, rw-
in this example, give the permissions for the owner of the file.
The next three characters, r--, give the
permissions for the group that the file belongs to. The final
three characters, r--, give the permissions
for the rest of the world. A dash means that the permission is
turned off. In this example, the permissions are set so the
owner can read and write to the file, the group can read the
file, and the rest of the world can only read the file.
According to the table above, the permissions for this file
would be 644, where each digit represents the
three parts of the file's permission.How does the system control permissions on devices? &os;
treats most hardware devices as a file that programs can open,
read, and write data to. These special device files are
stored in /dev/.Directories are also treated as files. They have read,
write, and execute permissions. The executable bit for a
directory has a slightly different meaning than that of files.
When a directory is marked executable, it means it is possible
to change into that directory using &man.cd.1;. This also
means that it is possible to access the files within that
directory, subject to the permissions on the files
themselves.In order to perform a directory listing, the read permission
must be set on the directory. In order to delete a file that
one knows the name of, it is necessary to have write
and execute permissions to the directory
containing the file.There are more permission bits, but they are primarily used
in special circumstances such as setuid binaries and sticky
directories. For more information on file permissions and how
to set them, refer to &man.chmod.1;.Symbolic PermissionsTomRhodesContributed by permissionssymbolicSymbolic permissions use characters instead of octal
values to assign permissions to files or directories.
Symbolic permissions use the syntax of (who) (action)
(permissions), where the following values are
available:OptionLetterRepresents(who)uUser(who)gGroup owner(who)oOther(who)aAll (world)(action)+Adding permissions(action)-Removing permissions(action)=Explicitly set permissions(permissions)rRead(permissions)wWrite(permissions)xExecute(permissions)tSticky bit(permissions)sSet UID or GIDThese values are used with &man.chmod.1;, but with
letters instead of numbers. For example, the following
command would block other users from accessing
FILE:&prompt.user; chmod go= FILEA comma separated list can be provided when more than one
set of changes to a file must be made. For example, the
following command removes the group and
world write permission on
FILE, and adds the execute
permissions for everyone:&prompt.user; chmod go-w,a+x FILE&os; File FlagsTomRhodesContributed by In addition to file permissions, &os; supports the use of
file flags. These flags add an additional
level of security and control over files, but not directories.
With file flags, even
root can be
prevented from removing or altering files.File flags are modified using &man.chflags.1;. For
example, to enable the system undeletable flag on the file
file1, issue the following
command:&prompt.root; chflags sunlink file1To disable the system undeletable flag, put a
no in front of the
:&prompt.root; chflags nosunlink file1To view the flags of a file, use with
&man.ls.1;:&prompt.root; ls -lo file1-rw-r--r-- 1 trhodes trhodes sunlnk 0 Mar 1 05:54 file1Several file flags may only be added or removed by the
root user. In other
cases, the file owner may set its file flags. Refer to
&man.chflags.1; and &man.chflags.2; for more
information.The setuid,
setgid, and sticky
PermissionsTomRhodesContributed by Other than the permissions already discussed, there are
three other specific settings that all administrators should
know about. They are the setuid,
setgid, and sticky
permissions.These settings are important for some &unix; operations
as they provide functionality not normally granted to normal
users. To understand them, the difference between the real
user ID and effective user ID must be noted.The real user ID is the UID who owns
or starts the process. The effective UID
is the user ID the process runs as. As an example,
&man.passwd.1; runs with the real user ID when a user changes
their password. However, in order to update the password
database, the command runs as the effective ID of the
root user. This
allows users to change their passwords without seeing a
Permission Denied error.The setuid permission may be set by prefixing a permission
set with the number four (4) as shown in the following
example:&prompt.root; chmod 4755 suidexample.shThe permissions on
suidexample.sh
now look like the following:-rwsr-xr-x 1 trhodes trhodes 63 Aug 29 06:36 suidexample.shNote that a s is now part of the
permission set designated for the file owner, replacing the
executable bit. This allows utilities which need elevated
permissions, such as &man.passwd.1;.The nosuid &man.mount.8; option will
cause such binaries to silently fail without alerting
the user. That option is not completely reliable as a
nosuid wrapper may be able to circumvent
it.To view this in real time, open two terminals. On
one, type passwd as a normal user.
While it waits for a new password, check the process
table and look at the user information for
&man.passwd.1;:In terminal A:Changing local password for trhodes
Old Password:In terminal B:&prompt.root; ps aux | grep passwdtrhodes 5232 0.0 0.2 3420 1608 0 R+ 2:10AM 0:00.00 grep passwd
root 5211 0.0 0.2 3620 1724 2 I+ 2:09AM 0:00.01 passwdAlthough &man.passwd.1; is run as a normal user, it is
using the effective UID of
root.The setgid permission performs the
same function as the setuid permission;
except that it alters the group settings. When an application
or utility executes with this setting, it will be granted the
permissions based on the group that owns the file, not the
user who started the process.To set the setgid permission on a
file, provide &man.chmod.1; with a leading two (2):&prompt.root; chmod 2755 sgidexample.shIn the following listing, notice that the
s is now in the field designated for the
group permission settings:-rwxr-sr-x 1 trhodes trhodes 44 Aug 31 01:49 sgidexample.shIn these examples, even though the shell script in
question is an executable file, it will not run with
a different EUID or effective user ID.
This is because shell scripts may not access the
&man.setuid.2; system calls.The setuid and
setgid permission bits may lower system
security, by allowing for elevated permissions. The third
special permission, the sticky bit, can
strengthen the security of a system.When the sticky bit is set on a
directory, it allows file deletion only by the file owner.
This is useful to prevent file deletion in public directories,
such as /tmp, by users
who do not own the file. To utilize this permission, prefix
the permission set with a one (1):&prompt.root; chmod 1777 /tmpThe sticky bit permission will display
as a t at the very end of the permission
set:&prompt.root; ls -al / | grep tmpdrwxrwxrwt 10 root wheel 512 Aug 31 01:49 tmpDirectory Structuredirectory hierarchyThe &os; directory hierarchy is fundamental to obtaining
an overall understanding of the system. The most important
directory is root or, /. This directory is the
first one mounted at boot time and it contains the base system
necessary to prepare the operating system for multi-user
operation. The root directory also contains mount points for
other file systems that are mounted during the transition to
multi-user operation.A mount point is a directory where additional file systems
can be grafted onto a parent file system (usually the root file
system). This is further described in
. Standard mount points
include /usr/, /var/,
/tmp/, /mnt/, and
/cdrom/. These directories are usually
referenced to entries in /etc/fstab. This
file is a table of various file systems and mount points and is
read by the system. Most of the file systems in
/etc/fstab are mounted automatically at
boot time from the script &man.rc.8; unless their entry includes
. Details can be found in
.A complete description of the file system hierarchy is
available in &man.hier.7;. The following table provides a brief
overview of the most common directories.DirectoryDescription/Root directory of the file system./bin/User utilities fundamental to both single-user
and multi-user environments./boot/Programs and configuration files used during
operating system bootstrap./boot/defaults/Default boot configuration files. Refer to
&man.loader.conf.5; for details./dev/Device nodes. Refer to &man.intro.4; for
details./etc/System configuration files and scripts./etc/defaults/Default system configuration files. Refer to
&man.rc.8; for details./etc/mail/Configuration files for mail transport agents
such as &man.sendmail.8;./etc/periodic/Scripts that run daily, weekly, and monthly,
via &man.cron.8;. Refer to &man.periodic.8; for
details./etc/ppp/&man.ppp.8; configuration files./mnt/Empty directory commonly used by system
administrators as a temporary mount point./proc/Process file system. Refer to &man.procfs.5;,
&man.mount.procfs.8; for details./rescue/Statically linked programs for emergency
recovery as described in &man.rescue.8;./root/Home directory for the
root
account./sbin/System programs and administration utilities
fundamental to both single-user and multi-user
environments./tmp/Temporary files which are usually
not preserved across a system
reboot. A memory-based file system is often mounted
at /tmp. This can be automated
using the tmpmfs-related variables of &man.rc.conf.5;
or with an entry in /etc/fstab;
refer to &man.mdmfs.8; for details./usr/The majority of user utilities and
applications./usr/bin/Common utilities, programming tools, and
applications./usr/include/Standard C include files./usr/lib/Archive libraries./usr/libdata/Miscellaneous utility data files./usr/libexec/System daemons and system utilities executed
by other programs./usr/local/Local executables and libraries. Also used as
the default destination for the &os; ports framework.
Within
/usr/local, the
general layout sketched out by &man.hier.7; for
/usr should be
used. Exceptions are the man directory, which is
directly under /usr/local rather than
under /usr/local/share, and
the ports documentation is in share/doc/port./usr/obj/Architecture-specific target tree produced by
building the /usr/src
tree./usr/ports/The &os; Ports Collection (optional)./usr/sbin/System daemons and system utilities executed
by users./usr/share/Architecture-independent files./usr/src/BSD and/or local source files./var/Multi-purpose log, temporary, transient, and
spool files. A memory-based file system is sometimes
mounted at
/var. This can
be automated using the varmfs-related variables in
&man.rc.conf.5; or with an entry in
/etc/fstab; refer to
&man.mdmfs.8; for details./var/log/Miscellaneous system log files./var/mail/User mailbox files./var/spool/Miscellaneous printer and mail system spooling
directories./var/tmp/Temporary files which are usually preserved
across a system reboot, unless
/var is a
memory-based file system./var/yp/NIS maps.Disk OrganizationThe smallest unit of organization that &os; uses to find
files is the filename. Filenames are case-sensitive, which
means that readme.txt and
README.TXT are two separate files. &os;
does not use the extension of a file to determine whether the
file is a program, document, or some other form of data.Files are stored in directories. A directory may contain no
files, or it may contain many hundreds of files. A directory
can also contain other directories, allowing a hierarchy of
directories within one another in order to organize
data.Files and directories are referenced by giving the file or
directory name, followed by a forward slash,
/, followed by any other directory names that
are necessary. For example, if the directory
foo contains a directory
bar which contains the
file readme.txt, the full name, or
path, to the file is
foo/bar/readme.txt. Note that this is
different from &windows; which uses \ to
separate file and directory names. &os; does not use drive
letters, or other drive names in the path. For example, one
would not type c:\foo\bar\readme.txt on
&os;.Directories and files are stored in a file system. Each
file system contains exactly one directory at the very top
level, called the root directory for that
file system. This root directory can contain other directories.
One file system is designated the
root file system or /.
Every other file system is mounted under
the root file system. No matter how many disks are on the &os;
system, every directory appears to be part of the same
disk.Consider three file systems, called A,
B, and C. Each file
system has one root directory, which contains two other
directories, called A1, A2
(and likewise B1, B2 and
C1, C2).Call A the root file system. If
&man.ls.1; is used to view the contents of this directory,
it will show two subdirectories, A1 and
A2. The directory tree looks like
this: /
|
+--- A1
|
`--- A2A file system must be mounted on to a directory in another
file system. When mounting file system B
on to the directory A1, the root directory
of B replaces A1, and
the directories in B appear
accordingly: /
|
+--- A1
| |
| +--- B1
| |
| `--- B2
|
`--- A2Any files that are in the B1 or
B2 directories can be reached with the path
/A1/B1 or
/A1/B2 as necessary. Any
files that were in /A1
have been temporarily hidden. They will reappear if
B is unmounted from
A.If B had been mounted on
A2 then the diagram would look like
this: /
|
+--- A1
|
`--- A2
|
+--- B1
|
`--- B2and the paths would be
/A2/B1 and
/A2/B2
respectively.File systems can be mounted on top of one another.
Continuing the last example, the C file
system could be mounted on top of the B1
directory in the B file system, leading to
this arrangement: /
|
+--- A1
|
`--- A2
|
+--- B1
| |
| +--- C1
| |
| `--- C2
|
`--- B2Or C could be mounted directly on to the
A file system, under the
A1 directory: /
|
+--- A1
| |
| +--- C1
| |
| `--- C2
|
`--- A2
|
+--- B1
|
`--- B2It is entirely possible to have one large root file system,
and not need to create any others. There are some drawbacks to
this approach, and one advantage.Benefits of Multiple File SystemsDifferent file systems can have different
mount options. For example, the root
file system can be mounted read-only, making it impossible
for users to inadvertently delete or edit a critical file.
Separating user-writable file systems, such as
/home, from other
file systems allows them to be mounted
nosuid. This option prevents the
suid/guid bits
on executables stored on the file system from taking effect,
possibly improving security.&os; automatically optimizes the layout of files on a
file system, depending on how the file system is being used.
So a file system that contains many small files that are
written frequently will have a different optimization to one
that contains fewer, larger files. By having one big file
system this optimization breaks down.&os;'s file systems are robust if power is lost.
However, a power loss at a critical point could still damage
the structure of the file system. By splitting data over
multiple file systems it is more likely that the system will
still come up, making it easier to restore from backup as
necessary.Benefit of a Single File SystemFile systems are a fixed size. If you create a file
system when you install &os; and give it a specific size,
you may later discover that you need to make the partition
bigger. This is not easily accomplished without backing up,
recreating the file system with the new size, and then
restoring the backed up data.&os; features the &man.growfs.8; command, which makes
it possible to increase the size of file system on the
fly, removing this limitation.File systems are contained in partitions. This does not
have the same meaning as the common usage of the term partition
(for example, &ms-dos; partition), because of &os;'s &unix;
heritage. Each partition is identified by a letter from
a through to h. Each
partition can contain only one file system, which means that
file systems are often described by either their typical mount
point in the file system hierarchy, or the letter of the
partition they are contained in.&os; also uses disk space for
swap space to provide
virtual memory. This allows your
computer to behave as though it has much more memory than it
actually does. When &os; runs out of memory, it moves some of
the data that is not currently being used to the swap space, and
moves it back in (moving something else out) when it needs
it.Some partitions have certain conventions associated with
them.PartitionConventionaNormally contains the root file system.bNormally contains swap space.cNormally the same size as the enclosing slice.
This allows utilities that need to work on the entire
slice, such as a bad block scanner, to work on the
c partition. A file system would not
normally be created on this partition.dPartition d used to have a
special meaning associated with it, although that is now
gone and d may work as any normal
partition.Disks in &os; are divided into slices, referred to in
&windows; as partitions, which are numbered from 1 to 4. These
are then divided into partitions, which contain file systems,
and are labeled using letters.slicespartitionsdangerously dedicatedSlice numbers follow the device name, prefixed with an
s, starting at 1. So
da0s1 is the first slice on
the first SCSI drive. There can only be four physical slices on
a disk, but there can be logical slices inside physical slices
of the appropriate type. These extended slices are numbered
starting at 5, so ada0s5 is
the first extended slice on the first SATA disk. These devices
are used by file systems that expect to occupy a slice.Slices, dangerously dedicated physical
drives, and other drives contain
partitions, which are represented as
letters from a to h. This
letter is appended to the device name, so
da0a is the
a partition on the first
da drive, which is
dangerously dedicated.
ada1s3e is the fifth
partition in the third slice of the second SATA disk
drive.Finally, each disk on the system is identified. A disk name
starts with a code that indicates the type of disk, and then a
number, indicating which disk it is. Unlike slices, disk
numbering starts at 0. Common codes are listed in
.When referring to a partition, include the disk name,
s, the slice number, and then the partition
letter. Examples are shown in
. shows a
conceptual model of a disk layout.When installing &os;, configure the disk slices, create
partitions within the slice to be used for &os;, create a file
system or swap space in each partition, and decide where each
file system will be mounted.
Disk Device NamesDrive TypeDrive Device NameSATA and IDE
hard drivesada or
adSCSI hard drives and
USB storage devicesdaSATA and IDE
CD-ROM drivescd or
acdSCSI CD-ROM
drivescdFloppy drivesfdAssorted non-standard CD-ROM
drivesmcd for Mitsumi
CD-ROM and scd for
Sony CD-ROM devicesSCSI tape drivessaIDE tape drivesastRAID drivesExamples include aacd for
&adaptec; AdvancedRAID, mlxd and
mlyd for &mylex;,
amrd for AMI &megaraid;,
idad for Compaq Smart RAID,
twed for &tm.3ware; RAID.
Sample Disk, Slice, and Partition NamesNameMeaningada0s1aThe first partition (a) on the
first slice (s1) on the first
SATA
disk (ada0).da1s2eThe fifth partition (e) on the
second slice (s2) on the second
SCSI disk (da1).Conceptual Model of a DiskThis diagram shows &os;'s view of the first
SATA disk attached to the system. Assume
that the disk is 250 GB in size, and contains an
80 GB slice and a 170 GB slice (&ms-dos;
partitions). The first slice contains a &windows;
NTFS file system, C:,
and the second slice contains a &os; installation. This
example &os; installation has four data partitions and a swap
partition.The four partitions each hold a file system. Partition
a is used for the root file system,
d for /var/,
e for /tmp/, and
f for /usr/.
Partition letter c refers to the entire
slice, and so is not used for ordinary partitions.Mounting and Unmounting File SystemsThe file system is best visualized as a tree, rooted, as it
were, at /.
/dev,
/usr, and the other
directories in the root directory are branches, which may have
their own branches, such as
/usr/local, and so
on.root file systemThere are various reasons to house some of these
directories on separate file systems.
/var contains the
directories log/,
spool/, and various types
of temporary files, and as such, may get filled up. Filling up
the root file system is not a good idea, so splitting
/var from
/ is often
favorable.Another common reason to contain certain directory trees on
other file systems is if they are to be housed on separate
physical disks, or are separate virtual disks, such as Network
File System mounts, described in ,
or CDROM drives.The fstab Filefile systemsmounted with fstabDuring the boot process (), file
systems listed in /etc/fstab are
automatically mounted except for the entries containing
. This file contains entries in the
following format:device/mount-pointfstypeoptionsdumpfreqpassnodeviceAn existing device name as explained in
.mount-pointAn existing directory on which to mount the file
system.fstypeThe file system type to pass to &man.mount.8;. The
default &os; file system is
ufs.optionsEither for read-write file
systems, or for read-only file
systems, followed by any other options that may be
needed. A common option is for
file systems not normally mounted during the boot
sequence. Other options are listed in
&man.mount.8;.dumpfreqUsed by &man.dump.8; to determine which file systems
require dumping. If the field is missing, a value of
zero is assumed.passnoDetermines the order in which file systems should be
checked. File systems that should be skipped should
have their passno set to zero. The
root file system needs to be checked before everything
else and should have its passno set
to one. The other file systems should be set to
values greater than one. If more than one file system
has the same passno, &man.fsck.8;
will attempt to check file systems in parallel if
possible.Refer to &man.fstab.5; for more information on the format
of /etc/fstab and its options.Using &man.mount.8;file systemsmountingFile systems are mounted using &man.mount.8;. The most
basic syntax is as follows:&prompt.root; mount devicemountpointThis command provides many options which are described in
&man.mount.8;, The most commonly used options include:Mount OptionsMount all the file systems listed in
/etc/fstab, except those marked as
noauto, excluded by the
flag, or those that are already
mounted.Do everything except for the actual mount system
call. This option is useful in conjunction with the
flag to determine what &man.mount.8;
is actually trying to do.Force the mount of an unclean file system
(dangerous), or the revocation of write access when
downgrading a file system's mount status from read-write
to read-only.Mount the file system read-only. This is identical
to using .fstypeMount the specified file system type or mount only
file systems of the given type, if
is included. ufs is the default file
system type.Update mount options on the file system.Be verbose.Mount the file system read-write.The following options can be passed to
as a comma-separated list:nosuidDo not interpret setuid or setgid flags on the
file system. This is also a useful security
option.Using &man.umount.8;file systemsunmountingTo unmount a file system use &man.umount.8;. This command
takes one parameter which can be a mountpoint, device name,
or .All forms take to force unmounting,
and for verbosity. Be warned that
is not generally a good idea as it might
crash the computer or damage data on the file system.To unmount all mounted file systems, or just the file
system types listed after , use
or . Note that
does not attempt to unmount the root file
system.Processes and Daemons&os; is a multi-tasking operating system. Each program
running at any one time is called a
process. Every running command starts
at least one new process and there are a number of system
processes that are run by &os;.Each process is uniquely identified by a number called a
process ID (PID).
Similar to files, each process has one owner and group, and
the owner and group permissions are used to determine which
files and devices the process can open. Most processes also
have a parent process that started them. For example, the
shell is a process, and any command started in the shell is a
process which has the shell as its parent process. The
exception is a special process called &man.init.8; which is
always the first process to start at boot time and which always
has a PID of 1.Some programs are not designed to be run with continuous
user input and disconnect from the terminal at the first
opportunity. For example, a web server responds to web
requests, rather than user input. Mail servers are another
example of this type of application. These types of programs
are known as daemons. The term daemon
comes from Greek mythology and represents an entity that is
neither good nor evil, and which invisibly performs useful
tasks. This is why the BSD mascot is the cheerful-looking
daemon with sneakers and a pitchfork.There is a convention to name programs that normally run as
daemons with a trailing d. For example,
BIND is the Berkeley Internet Name
Domain, but the actual program that executes is
named. The
Apache web server program is
httpd and the line printer spooling daemon
is lpd. This is only a naming convention.
For example, the main mail daemon for the
Sendmail application is
sendmail, and not
maild.Viewing ProcessesTo see the processes running on the system, use &man.ps.1;
or &man.top.1;. To display a static list of the currently
running processes, their PIDs, how much
memory they are using, and the command they were started with,
use &man.ps.1;. To display all the running processes and
update the display every few seconds in order to interactively
see what the computer is doing, use &man.top.1;.By default, &man.ps.1; only shows the commands that are
running and owned by the user. For example:&prompt.user; ps
PID TT STAT TIME COMMAND
8203 0 Ss 0:00.59 /bin/csh
8895 0 R+ 0:00.00 psThe output from &man.ps.1; is organized into a number of
columns. The PID column displays the
process ID. PIDs are assigned starting at
1, go up to 99999, then wrap around back to the beginning.
However, a PID is not reassigned if it is
already in use. The TT column shows the
tty the program is running on and STAT
shows the program's state. TIME is the
amount of time the program has been running on the CPU. This
is usually not the elapsed time since the program was started,
as most programs spend a lot of time waiting for things to
happen before they need to spend time on the CPU. Finally,
COMMAND is the command that was used to
start the program.A number of different options are available to change the
information that is displayed. One of the most useful sets is
auxww, where displays
information about all the running processes of all users,
displays the username and memory usage of
the process' owner, displays
information about daemon processes, and
causes &man.ps.1; to display the full command line for each
process, rather than truncating it once it gets too long to
fit on the screen.The output from &man.top.1; is similar:&prompt.user; top
last pid: 9609; load averages: 0.56, 0.45, 0.36 up 0+00:20:03 10:21:46
107 processes: 2 running, 104 sleeping, 1 zombie
CPU: 6.2% user, 0.1% nice, 8.2% system, 0.4% interrupt, 85.1% idle
Mem: 541M Active, 450M Inact, 1333M Wired, 4064K Cache, 1498M Free
ARC: 992M Total, 377M MFU, 589M MRU, 250K Anon, 5280K Header, 21M Other
Swap: 2048M Total, 2048M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
557 root 1 -21 r31 136M 42296K select 0 2:20 9.96% Xorg
8198 dru 2 52 0 449M 82736K select 3 0:08 5.96% kdeinit4
8311 dru 27 30 0 1150M 187M uwait 1 1:37 0.98% firefox
431 root 1 20 0 14268K 1728K select 0 0:06 0.98% moused
9551 dru 1 21 0 16600K 2660K CPU3 3 0:01 0.98% top
2357 dru 4 37 0 718M 141M select 0 0:21 0.00% kdeinit4
8705 dru 4 35 0 480M 98M select 2 0:20 0.00% kdeinit4
8076 dru 6 20 0 552M 113M uwait 0 0:12 0.00% soffice.bin
2623 root 1 30 10 12088K 1636K select 3 0:09 0.00% powerd
2338 dru 1 20 0 440M 84532K select 1 0:06 0.00% kwin
1427 dru 5 22 0 605M 86412K select 1 0:05 0.00% kdeinit4The output is split into two sections. The header (the
first five or six lines) shows the PID of
the last process to run, the system load averages (which are a
measure of how busy the system is), the system uptime (time
since the last reboot) and the current time. The other
figures in the header relate to how many processes are
running, how much memory and swap space has been used, and how
much time the system is spending in different CPU states. If
the ZFS file system module has been loaded,
an ARC line indicates how much data was
read from the memory cache instead of from disk.Below the header is a series of columns containing similar
information to the output from &man.ps.1;, such as the
PID, username, amount of CPU time, and the
command that started the process. By default, &man.top.1;
also displays the amount of memory space taken by the process.
This is split into two columns: one for total size and one for
resident size. Total size is how much memory the application
has needed and the resident size is how much it is actually
using now.&man.top.1; automatically updates the display every two
seconds. A different interval can be specified with
.Killing ProcessesOne way to communicate with any running process or daemon
is to send a signal using &man.kill.1;.
There are a number of different signals; some have a specific
meaning while others are described in the application's
documentation. A user can only send a signal to a process
they own and sending a signal to someone else's process will
result in a permission denied error. The exception is the
root user, who can
send signals to anyone's processes.The operating system can also send a signal to a process.
If an application is badly written and tries to access memory
that it is not supposed to, &os; will send the process the
Segmentation Violation signal
(SIGSEGV). If an application has been
written to use the &man.alarm.3; system call to be alerted
after a period of time has elapsed, it will be sent the
Alarm signal
(SIGALRM).Two signals can be used to stop a process:
SIGTERM and SIGKILL.
SIGTERM is the polite way to kill a process
as the process can read the signal, close any log files it may
have open, and attempt to finish what it is doing before
shutting down. In some cases, a process may ignore
SIGTERM if it is in the middle of some task
that cannot be interrupted.SIGKILL cannot be ignored by a
process. Sending a SIGKILL to a
process will usually stop that process there and then.
There are a few tasks that cannot be
interrupted. For example, if the process is trying to
read from a file that is on another computer on the
network, and the other computer is unavailable, the
process is said to be uninterruptible.
Eventually the process will time out, typically after two
minutes. As soon as this time out occurs the process will
be killed..Other commonly used signals are SIGHUP,
SIGUSR1, and SIGUSR2.
Since these are general purpose signals, different
applications will respond differently.For example, after changing a web server's configuration
file, the web server needs to be told to re-read its
configuration. Restarting httpd would
result in a brief outage period on the web server. Instead,
send the daemon the SIGHUP signal. Be
aware that different daemons will have different behavior, so
refer to the documentation for the daemon to determine if
SIGHUP will achieve the desired
results.Sending a Signal to a ProcessThis example shows how to send a signal to
&man.inetd.8;. The &man.inetd.8; configuration file is
/etc/inetd.conf, and &man.inetd.8; will
re-read this configuration file when it is sent a
SIGHUP.Find the PID of the process to send
the signal to using &man.pgrep.1;. In this example, the
PID for &man.inetd.8; is 198:&prompt.user; pgrep -l inetd
198 inetd -wW
- Use &man.kill.1; to send the signal. Because
+ Use &man.kill.1; to send the signal. As
&man.inetd.8; is owned by
root, use
&man.su.1; to become
root
first.&prompt.user; suPassword:
&prompt.root; /bin/kill -s HUP 198Like most &unix; commands, &man.kill.1; will not print
any output if it is successful. If a signal is sent to a
process not owned by that user, the message
kill: PID: Operation
not permitted will be displayed. Mistyping
the PID will either send the signal to
the wrong process, which could have negative results, or
will send the signal to a PID that is
not currently in use, resulting in the error
kill: PID: No such
process.Why Use /bin/kill?Many shells provide kill as a
built in command, meaning that the shell will send the
signal directly, rather than running
/bin/kill. Be aware that different
shells have a different syntax for specifying the name
of the signal to send. Rather than try to learn all of
them, it can be simpler to specify
/bin/kill.When sending other signals, substitute
TERM or KILL with the
name of the signal.Killing a random process on the system is a bad idea.
In particular, &man.init.8;, PID 1, is
special. Running /bin/kill -s KILL 1 is
a quick, and unrecommended, way to shutdown the system.
Always double check the arguments to
&man.kill.1; before pressing
Return.Shellsshellscommand lineA shell provides a command line
interface for interacting with the operating system. A shell
receives commands from the input channel and executes them.
Many shells provide built in functions to help with everyday
tasks such as file management, file globbing, command line
editing, command macros, and environment variables. &os; comes
with several shells, including the Bourne shell (&man.sh.1;) and
the extended C shell (&man.tcsh.1;). Other shells are available
from the &os; Ports Collection, such as
zsh and bash.The shell that is used is really a matter of taste. A C
programmer might feel more comfortable with a C-like shell such
as &man.tcsh.1;. A &linux; user might prefer
bash. Each shell has unique properties that
may or may not work with a user's preferred working environment,
which is why there is a choice of which shell to use.One common shell feature is filename completion. After a
user types the first few letters of a command or filename and
presses Tab, the shell completes the rest of
the command or filename. Consider two files called
foobar and football.
To delete foobar, the user might type
rm foo and press Tab to
complete the filename.But the shell only shows rm foo. It was
unable to complete the filename because both
foobar and football
start with foo. Some shells sound a beep or
show all the choices if more than one name matches. The user
must then type more characters to identify the desired filename.
Typing a t and pressing Tab
again is enough to let the shell determine which filename is
desired and fill in the rest.environment variablesAnother feature of the shell is the use of environment
variables. Environment variables are a variable/key pair stored
in the shell's environment. This environment can be read by any
program invoked by the shell, and thus contains a lot of program
configuration. provides a list
of common environment variables and their meanings. Note that
the names of environment variables are always in
uppercase.
Common Environment VariablesVariableDescriptionUSERCurrent logged in user's name.PATHColon-separated list of directories to search for
binaries.DISPLAYNetwork name of the
&xorg;
display to connect to, if available.SHELLThe current shell.TERMThe name of the user's type of terminal. Used to
determine the capabilities of the terminal.TERMCAPDatabase entry of the terminal escape codes to
perform various terminal functions.OSTYPEType of operating system.MACHTYPEThe system's CPU architecture.EDITORThe user's preferred text editor.PAGERThe user's preferred utility for viewing text one
page at a time.MANPATHColon-separated list of directories to search for
manual pages.
Bourne shellsHow to set an environment variable differs between shells.
In &man.tcsh.1; and &man.csh.1;, use
setenv to set environment variables. In
&man.sh.1; and bash, use
export to set the current environment
variables. This example sets the default EDITOR
to /usr/local/bin/emacs for the
&man.tcsh.1; shell:&prompt.user; setenv EDITOR /usr/local/bin/emacsThe equivalent command for bash
would be:&prompt.user; export EDITOR="/usr/local/bin/emacs"To expand an environment variable in order to see its
current setting, type a $ character in front
of its name on the command line. For example,
echo $TERM displays the current
$TERM setting.Shells treat special characters, known as meta-characters,
as special representations of data. The most common
meta-character is *, which represents any
number of characters in a filename. Meta-characters can be used
to perform filename globbing. For example, echo
* is equivalent to ls because
the shell takes all the files that match *
and echo lists them on the command
line.To prevent the shell from interpreting a special character,
escape it from the shell by starting it with a backslash
(\). For example, echo
$TERM prints the terminal setting whereas
echo \$TERM literally prints the string
$TERM.Changing the ShellThe easiest way to permanently change the default shell is
to use chsh. Running this command will
open the editor that is configured in the
EDITOR environment variable, which by default
is set to &man.vi.1;. Change the Shell:
line to the full path of the new shell.Alternately, use chsh -s which will set
the specified shell without opening an editor. For example,
to change the shell to bash:&prompt.user; chsh -s /usr/local/bin/bashThe new shell must be present in
/etc/shells. If the shell was
installed from the &os; Ports Collection as described in
, it should be automatically added
to this file. If it is missing, add it using this command,
replacing the path with the path of the shell:&prompt.root; echo /usr/local/bin/bash >> /etc/shellsThen, rerun &man.chsh.1;.Advanced Shell TechniquesTomRhodesWritten by The &unix; shell is not just a command interpreter, it
acts as a powerful tool which allows users to execute
commands, redirect their output, redirect their input and
chain commands together to improve the final command output.
When this functionality is mixed with built in commands, the
user is provided with an environment that can maximize
efficiency.Shell redirection is the action of sending the output or
the input of a command into another command or into a file.
To capture the output of the &man.ls.1; command, for example,
into a file, redirect the output:&prompt.user; ls > directory_listing.txtThe directory contents will now be listed in
directory_listing.txt. Some commands can
be used to read input, such as &man.sort.1;. To sort this
listing, redirect the input:&prompt.user; sort < directory_listing.txtThe input will be sorted and placed on the screen. To
redirect that input into another file, one could redirect the
output of &man.sort.1; by mixing the direction:&prompt.user; sort < directory_listing.txt > sorted.txtIn all of the previous examples, the commands are
performing redirection using file descriptors. Every &unix;
system has file descriptors, which include standard input
(stdin), standard output (stdout), and standard error
(stderr). Each one has a purpose, where input could be a
keyboard or a mouse, something that provides input. Output
could be a screen or paper in a printer. And error would be
anything that is used for diagnostic or error messages. All
three are considered I/O based file
descriptors and sometimes considered streams.Through the use of these descriptors, the shell allows
output and input to be passed around through various commands
and redirected to or from a file. Another method of
redirection is the pipe operator.The &unix; pipe operator, | allows the
output of one command to be directly passed or directed to
another program. Basically, a pipe allows the standard
output of a command to be passed as standard input to another
command, for example:&prompt.user; cat directory_listing.txt | sort | lessIn that example, the contents of
directory_listing.txt will be sorted and
the output passed to &man.less.1;. This allows the user to
scroll through the output at their own pace and prevent it
from scrolling off the screen.Text Editorstext editorseditors
- Most &os; configuration is done by editing text files.
- Because of this, it is a good idea to become familiar with a
+ Most &os; configuration is done by editing text files, so
+ it is a good idea to become familiar with a
text editor. &os; comes with a few as part of the base system,
and many more are available in the Ports Collection.eeeditors&man.ee.1;A simple editor to learn is &man.ee.1;, which stands for
easy editor. To start this editor, type ee
filename where
filename is the name of the file to
be edited. Once inside the editor, all of the commands for
manipulating the editor's functions are listed at the top of the
display. The caret (^) represents
Ctrl, so ^e expands to
Ctrle. To leave &man.ee.1;, press Esc,
then choose the leave editor option from the main
menu. The editor will prompt to save any changes if the file
has been modified.vieditorsemacs&os; also comes with more powerful text editors, such as
&man.vi.1;, as part of the base system. Other editors, like
editors/emacs and
editors/vim, are part of the
&os; Ports Collection. These editors offer more functionality
at the expense of being more complicated to learn. Learning a
more powerful editor such as vim or
Emacs can save more time in the long
run.Many applications which modify files or require typed input
will automatically open a text editor. To change the default
editor, set the EDITOR environment
variable as described in .Devices and Device NodesA device is a term used mostly for hardware-related
activities in a system, including disks, printers, graphics
cards, and keyboards. When &os; boots, the majority of the boot
messages refer to devices being detected. A copy of the boot
messages are saved to
/var/run/dmesg.boot.Each device has a device name and number. For example,
ada0 is the first SATA hard drive,
while kbd0 represents the
keyboard.Most devices in &os; must be accessed through special
files called device nodes, which are located in
/dev.Manual Pagesmanual pagesThe most comprehensive documentation on &os; is in the form
of manual pages. Nearly every program on the system comes with
a short reference manual explaining the basic operation and
available arguments. These manuals can be viewed using
man:&prompt.user; man commandwhere command is the name of the
command to learn about. For example, to learn more about
&man.ls.1;, type:&prompt.user; man lsManual pages are divided into sections which represent the
type of topic. In &os;, the following sections are
available:User commands.System calls and error numbers.Functions in the C libraries.Device drivers.File formats.Games and other diversions.Miscellaneous information.System maintenance and operation commands.System kernel interfaces.In some cases, the same topic may appear in more than one
section of the online manual. For example, there is a
chmod user command and a
chmod() system call. To tell &man.man.1;
which section to display, specify the section number:&prompt.user; man 1 chmodThis will display the manual page for the user command
&man.chmod.1;. References to a particular section of the
online manual are traditionally placed in parenthesis in
written documentation, so &man.chmod.1; refers to the user
command and &man.chmod.2; refers to the system call.If the name of the manual page is unknown, use man
-k to search for keywords in the manual page
descriptions:&prompt.user; man -k mailThis command displays a list of commands that have the
keyword mail in their descriptions. This is
equivalent to using &man.apropos.1;.To read the descriptions for all of the commands in
/usr/bin, type:&prompt.user; cd /usr/bin
&prompt.user; man -f * | moreor&prompt.user; cd /usr/bin
&prompt.user; whatis * |moreGNU Info FilesFree Software Foundation&os; includes several applications and utilities produced
by the Free Software Foundation (FSF). In addition to manual
pages, these programs may include hypertext documents called
info files. These can be viewed using
&man.info.1; or, if editors/emacs is
installed, the info mode of
emacs.To use &man.info.1;, type:&prompt.user; infoFor a brief introduction, type h. For
a quick command reference, type ?.
diff --git a/en_US.ISO8859-1/books/handbook/boot/chapter.xml b/en_US.ISO8859-1/books/handbook/boot/chapter.xml
index aa0c741acb..2eead109e5 100644
--- a/en_US.ISO8859-1/books/handbook/boot/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/boot/chapter.xml
@@ -1,892 +1,892 @@
The &os; Booting ProcessSynopsisbootingbootstrapThe process of starting a computer and loading the operating
system is referred to as the bootstrap process,
or booting. &os;'s boot process provides a great
deal of flexibility in customizing what happens when the system
starts, including the ability to select from different operating
systems installed on the same computer, different versions of
the same operating system, or a different installed
kernel.This chapter details the configuration options that can be
set. It demonstrates how to customize the &os; boot process,
including everything that happens until the &os; kernel has
started, probed for devices, and started &man.init.8;. This
occurs when the text color of the boot messages changes from
bright white to grey.After reading this chapter, you will recognize:The components of the &os; bootstrap system and how they
interact.The options that can be passed to the components in the
&os; bootstrap in order to control the boot process.How to configure a customized boot splash screen.The basics of setting device hints.How to boot into single- and multi-user mode and how to
properly shut down a &os; system.This chapter only describes the boot process for &os;
running on x86 and amd64 systems.&os; Boot ProcessTurning on a computer and starting the operating system
poses an interesting dilemma. By definition, the computer does
not know how to do anything until the operating system is
started. This includes running programs from the disk. If the
computer can not run a program from the disk without the
operating system, and the operating system programs are on the
disk, how is the operating system started?This problem parallels one in the book The
Adventures of Baron Munchausen. A character had
fallen part way down a manhole, and pulled himself out by
grabbing his bootstraps and lifting. In the early days of
computing, the term bootstrap was applied
to the mechanism used to load the operating system. It has
since become shortened to booting.BIOSBasic Input/Output
SystemBIOSOn x86 hardware, the Basic Input/Output System
(BIOS) is responsible for loading the
operating system. The BIOS looks on the hard
disk for the Master Boot Record (MBR), which
must be located in a specific place on the disk. The
BIOS has enough knowledge to load and run the
MBR, and assumes that the
MBR can then carry out the rest of the tasks
involved in loading the operating system, possibly with the help
of the BIOS.&os; provides for booting from both the older
MBR standard, and the newer GUID Partition
Table (GPT). GPT
partitioning is often found on computers with the Unified
Extensible Firmware Interface (UEFI).
However, &os; can boot from GPT partitions
even on machines with only a legacy BIOS
with &man.gptboot.8;. Work is under way to provide direct
UEFI booting.Master Boot Record
(MBR)Boot ManagerBoot LoaderThe code within the MBR is typically
referred to as a boot manager, especially
when it interacts with the user. The boot manager usually has
more code in the first track of the disk or within the file
system. Examples of boot managers include the standard &os;
boot manager boot0, also called
Boot Easy, and
Grub, which is used by many &linux;
distributions.If only one operating system is installed, the
MBR searches for the first bootable (active)
slice on the disk, and then runs the code on that slice to load
the remainder of the operating system. When multiple operating
systems are present, a different boot manager can be installed
to display a list of operating systems so the user
can select one to boot.The remainder of the &os; bootstrap system is divided into
three stages. The first stage knows just enough to get the
computer into a specific state and run the second stage. The
second stage can do a little bit more, before running the third
stage. The third stage finishes the task of loading the
operating system. The work is split into three stages because
the MBR puts limits on the size of the
programs that can be run at stages one and two. Chaining the
tasks together allows &os; to provide a more flexible
loader.kernel&man.init.8;The kernel is then started and begins to probe for devices
and initialize them for use. Once the kernel boot process is
finished, the kernel passes control to the user process
&man.init.8;, which makes sure the disks are in a usable state,
starts the user-level resource configuration which mounts file
systems, sets up network cards to communicate on the network,
and starts the processes which have been configured to run at
startup.This section describes these stages in more detail and
demonstrates how to interact with the &os; boot process.The Boot ManagerBoot ManagerMaster Boot Record
(MBR)The boot manager code in the MBR is
sometimes referred to as stage zero of
the boot process. By default, &os; uses the
boot0 boot manager.The MBR installed by the &os; installer
is based on /boot/boot0. The size and
capability of boot0 is restricted
to 446 bytes due to the slice table and
0x55AA identifier at the end of the
MBR. If boot0
and multiple operating systems are installed, a message
similar to this example will be displayed at boot time:boot0 ScreenshotF1 Win
F2 FreeBSD
Default: F2Other operating systems will overwrite an existing
MBR if they are installed after &os;. If
this happens, or to replace the existing
MBR with the &os; MBR,
use the following command:&prompt.root; fdisk -B -b /boot/boot0 devicewhere device is the boot disk,
such as ad0 for the first
IDE disk, ad2 for the
first IDE disk on a second
IDE controller, or da0
for the first SCSI disk. To create a
custom configuration of the MBR, refer to
&man.boot0cfg.8;.Stage One and Stage TwoConceptually, the first and second stages are part of the
- same program on the same area of the disk. Because of space
+ same program on the same area of the disk. Due to space
constraints, they have been split into two, but are always
installed together. They are copied from the combined
/boot/boot by the &os; installer or
bsdlabel.These two stages are located outside file systems, in the
first track of the boot slice, starting with the first sector.
This is where boot0, or any other
boot manager, expects to find a program to run which will
continue the boot process.The first stage, boot1, is very
simple, since it can only be 512 bytes in size. It knows just
enough about the &os; bsdlabel, which
stores information about the slice, to find and execute
boot2.Stage two, boot2, is slightly more
sophisticated, and understands the &os; file system enough to
find files. It can provide a simple interface to choose the
kernel or loader to run. It runs
loader, which is much more
sophisticated and provides a boot configuration file. If the
boot process is interrupted at stage two, the following
interactive screen is displayed:boot2 Screenshot>> FreeBSD/i386 BOOT
Default: 0:ad(0,a)/boot/loader
boot:To replace the installed boot1 and
boot2, use bsdlabel,
where diskslice is the disk and
slice to boot from, such as ad0s1 for the
first slice on the first IDE disk:&prompt.root; bsdlabel -B disksliceIf just the disk name is used, such as
ad0, bsdlabel will
create the disk in dangerously dedicated
mode, without slices. This is probably not the
desired action, so double check the
diskslice before pressing
Return.Stage Threeboot-loaderThe loader is the final stage
of the three-stage bootstrap process. It is located on the
file system, usually as
/boot/loader.The loader is intended as an
interactive method for configuration, using a built-in command
set, backed up by a more powerful interpreter which has a more
complex command set.During initialization, loader
will probe for a console and for disks, and figure out which
disk it is booting from. It will set variables accordingly,
and an interpreter is started where user commands can be
passed from a script or interactively.loaderloader configurationThe loader will then read
/boot/loader.rc, which by default reads
in /boot/defaults/loader.conf which sets
reasonable defaults for variables and reads
/boot/loader.conf for local changes to
those variables. loader.rc then acts on
these variables, loading whichever modules and kernel are
selected.Finally, by default, loader
issues a 10 second wait for key presses, and boots the kernel
if it is not interrupted. If interrupted, the user is
presented with a prompt which understands the command set,
where the user may adjust variables, unload all modules, load
modules, and then finally boot or reboot. lists the most commonly
used loader commands. For a
complete discussion of all available commands, refer to
&man.loader.8;.
Loader Built-In CommandsVariableDescriptionautoboot
secondsProceeds to boot the kernel if not interrupted
within the time span given, in seconds. It displays a
countdown, and the default time span is 10
seconds.boot
-optionskernelnameImmediately proceeds to boot the kernel, with
any specified options or kernel name. Providing a
kernel name on the command-line is only applicable
after an unload has been issued.
Otherwise, the previously-loaded kernel will be
used. If kernelname is not
qualified, it will be searched under
/boot/kernel and
/boot/modules.boot-confGoes through the same automatic configuration of
modules based on specified variables, most commonly
kernel. This only makes sense if
unload is used first, before
changing some variables.help
topicShows help messages read from
/boot/loader.help. If the topic
given is index, the list of
available topics is displayed.include filename
…Reads the specified file and interprets it line
by line. An error immediately stops the
include.load -t
typefilenameLoads the kernel, kernel module, or file of the
type given, with the specified filename. Any
arguments after filename
are passed to the file. If
filename is not qualified, it
will be searched under
/boot/kernel
and /boot/modules.ls -lpathDisplays a listing of files in the given path, or
the root directory, if the path is not specified. If
is specified, file sizes will
also be shown.lsdev -vLists all of the devices from which it may be
possible to load modules. If is
specified, more details are printed.lsmod -vDisplays loaded modules. If
is specified, more details are shown.more filenameDisplays the files specified, with a pause at
each LINES displayed.rebootImmediately reboots the system.set variable, set
variable=valueSets the specified environment variables.unloadRemoves all loaded modules.
Here are some practical examples of loader usage. To boot
the usual kernel in single-user mode
single-user
mode:boot -sTo unload the usual kernel and modules and then load the
previous or another, specified kernel:unloadload /path/to/kernelfileUse the qualified
/boot/GENERIC/kernel to refer to
the default kernel that comes with an installation, or
/boot/kernel.old/kernel, to refer to the
previously installed kernel before a system upgrade or before
configuring a custom kernel.Use the following to load the usual modules with another
kernel. Note that in this case it is not necessary the
qualified name:unloadset kernel="mykernel"boot-confTo load an automated kernel configuration script:load -t userconfig_script /boot/kernel.confkernelboot interactionLast Stage&man.init.8;Once the kernel is loaded by either
loader or by
boot2, which bypasses
loader, it examines any boot flags
and adjusts its behavior as necessary. lists the commonly used boot flags.
Refer to &man.boot.8; for more information on the other boot
flags.kernelbootflags
Kernel Interaction During BootOptionDescriptionDuring kernel initialization, ask for the device
to mount as the root file system.Boot the root file system from a
CDROM.Boot into single-user mode.Be more verbose during kernel startup.
Once the kernel has finished booting, it passes control to
the user process &man.init.8;, which is located at
/sbin/init, or the program path specified
in the init_path variable in
loader. This is the last stage of the boot
process.The boot sequence makes sure that the file systems
available on the system are consistent. If a
UFS file system is not, and
fsck cannot fix the inconsistencies,
init drops the system into
single-user mode so that the system administrator can resolve
the problem directly. Otherwise, the system boots into
multi-user mode.Single-User Modesingle-user modeconsoleA user can specify this mode by booting with
or by setting the
boot_single variable in
loader. It can also be reached
by running shutdown now from multi-user
mode. Single-user mode begins with this message:Enter full pathname of shell or RETURN for /bin/sh:If the user presses Enter, the system
will enter the default Bourne shell. To specify a different
shell, input the full path to the shell.Single-user mode is usually used to repair a system that
will not boot due to an inconsistent file system or an error
in a boot configuration file. It can also be used to reset
the root password
when it is unknown. These actions are possible as the
single-user mode prompt gives full, local access to the
system and its configuration files. There is no networking
in this mode.While single-user mode is useful for repairing a system,
it poses a security risk unless the system is in a
physically secure location. By default, any user who can
gain physical access to a system will have full control of
that system after booting into single-user mode.If the system console is changed to
insecure in
/etc/ttys, the system will first prompt
for the root
password before initiating single-user mode. This adds a
measure of security while removing the ability to reset the
root password when
it is unknown.Configuring an Insecure Console in
/etc/ttys# name getty type status comments
#
# If console is marked "insecure", then init will ask for the root password
# when going to single-user mode.
console none unknown off insecureAn insecure console means that
physical security to the console is considered to be
insecure, so only someone who knows the root password may use
single-user mode.Multi-User Modemulti-user modeIf init finds the file
systems to be in order, or once the user has finished their
commands in single-user mode and has typed
exit to leave single-user mode, the
system enters multi-user mode, in which it starts the
resource configuration of the system.rc filesThe resource configuration system reads in configuration
defaults from /etc/defaults/rc.conf and
system-specific details from
/etc/rc.conf. It then proceeds to
mount the system file systems listed in
/etc/fstab. It starts up networking
services, miscellaneous system daemons, then the startup
scripts of locally installed packages.To learn more about the resource configuration system,
refer to &man.rc.8; and examine the scripts located in
/etc/rc.d.Configuring Boot Time Splash ScreensJoseph J.BarbishContributed by Typically when a &os; system boots, it displays its progress
as a series of messages at the console. A boot splash screen
creates an alternate boot screen that hides all of the boot
probe and service startup messages. A few boot loader messages,
including the boot options menu and a timed wait countdown
prompt, are displayed at boot time, even when the splash screen
is enabled. The display of the splash screen can be turned off
by hitting any key on the keyboard during the boot
process.There are two basic environments available in &os;. The
first is the default legacy virtual console command line
environment. After the system finishes booting, a console login
prompt is presented. The second environment is a configured
graphical environment. Refer to for more
information on how to install and configure a graphical display
manager and a graphical login manager.Once the system has booted, the splash screen defaults to
being a screen saver. After a time period of non-use, the
splash screen will display and will cycle through steps of
changing intensity of the image, from bright to very dark and
over again. The configuration of the splash screen saver can be
overridden by adding a saver= line to
/etc/rc.conf. Several built-in screen
savers are available and described in &man.splash.4;. The
saver= option only applies to virtual
consoles and has no effect on graphical display managers.By installing the
sysutils/bsd-splash-changer package or port,
a random splash image from a collection will display at boot.
The splash screen function supports 256-colors in the
bitmap (.bmp), ZSoft
PCX (.pcx), or
TheDraw (.bin) formats. The
.bmp, .pcx, or
.bin image has to be placed on the root
partition, for example in /boot. The
splash image files must have a resolution of 320 by 200 pixels
or less in order to work on standard VGA
adapters. For the default boot display resolution of 256-colors
and 320 by 200 pixels or less, add the following lines to
/boot/loader.conf. Replace
splash.bmp with the name of the
bitmap file to use:splash_bmp_load="YES"
bitmap_load="YES"
bitmap_name="/boot/splash.bmp"To use a PCX file instead of a bitmap
file:splash_pcx_load="YES"
bitmap_load="YES"
bitmap_name="/boot/splash.pcx"To instead use ASCII art in the https://en.wikipedia.org/wiki/TheDraw
format:splash_txt="YES"
bitmap_load="YES"
bitmap_name="/boot/splash.bin"Other interesting loader.conf options
include:beastie_disable="YES"This will stop the boot options menu from being
displayed, but the timed wait count down prompt will still
be present. Even with the display of the boot options
menu disabled, entering an option selection at the timed
wait count down prompt will enact the corresponding boot
option.loader_logo="beastie"This will replace the default words
&os;, which are displayed to the right of
the boot options menu, with the colored beastie
logo.For more information, refer to &man.splash.4;,
&man.loader.conf.5;, and &man.vga.4;.Device HintsTomRhodesContributed by device.hintsDuring initial system startup, the boot &man.loader.8; reads
&man.device.hints.5;. This file stores kernel boot information
known as variables, sometimes referred to as
device hints. These device hints
are used by device drivers for device configuration.Device hints may also be specified at the Stage 3 boot
loader prompt, as demonstrated in .
Variables can be added using set, removed
with unset, and viewed
show. Variables set in
/boot/device.hints can also be overridden.
Device hints entered at the boot loader are not permanent and
will not be applied on the next reboot.Once the system is booted, &man.kenv.1; can be used to dump
all of the variables.The syntax for /boot/device.hints
is one variable per line, using the hash
# as comment markers. Lines are constructed as
follows:hint.driver.unit.keyword="value"The syntax for the Stage 3 boot loader is:set hint.driver.unit.keyword=valuewhere driver is the device driver name,
unit is the device driver unit number, and
keyword is the hint keyword. The keyword may
consist of the following options:at: specifies the bus which the
device is attached to.port: specifies the start address of
the I/O to be used.irq: specifies the interrupt request
number to be used.drq: specifies the DMA channel
number.maddr: specifies the physical memory
address occupied by the device.flags: sets various flag bits for the
device.disabled: if set to
1 the device is disabled.Since device drivers may accept or require more hints not
listed here, viewing a driver's manual page is recommended.
For more information, refer to &man.device.hints.5;,
&man.kenv.1;, &man.loader.conf.5;, and &man.loader.8;.Shutdown Sequence&man.shutdown.8;Upon controlled shutdown using &man.shutdown.8;,
&man.init.8; will attempt to run the script
/etc/rc.shutdown, and then proceed to send
all processes the TERM signal, and
subsequently the KILL signal to any that do
not terminate in a timely manner.To power down a &os; machine on architectures and systems
that support power management, use
shutdown -p now to turn the power off
immediately. To reboot a &os; system, use
shutdown -r now. One must be
root or a member of
operator in order to
run &man.shutdown.8;. One can also use &man.halt.8; and
&man.reboot.8;. Refer to their manual pages and to
&man.shutdown.8; for more information.Modify group membership by referring to
.Power management requires &man.acpi.4; to be loaded as
a module or statically compiled into a custom kernel.
diff --git a/en_US.ISO8859-1/books/handbook/config/chapter.xml b/en_US.ISO8859-1/books/handbook/config/chapter.xml
index dd0a3d7bc3..f7f24a65f4 100644
--- a/en_US.ISO8859-1/books/handbook/config/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/config/chapter.xml
@@ -1,3486 +1,3486 @@
Configuration and TuningChernLeeWritten by MikeSmithBased on a tutorial written by MattDillonAlso based on tuning(7) written by Synopsissystem configurationsystem optimizationOne of the important aspects of &os; is proper system
configuration. This chapter explains much of the &os;
configuration process, including some of the parameters which
can be set to tune a &os; system.After reading this chapter, you will know:The basics of rc.conf configuration
and /usr/local/etc/rc.d startup
scripts.How to configure and test a network card.How to configure virtual hosts on network
devices.How to use the various configuration files in
/etc.How to tune &os; using &man.sysctl.8; variables.How to tune disk performance and modify kernel
limitations.Before reading this chapter, you should:Understand &unix; and &os; basics
().Be familiar with the basics of kernel configuration and
compilation ().Starting ServicesTomRhodesContributed by servicesMany users install third party software on &os; from the
Ports Collection and require the installed services to be
started upon system initialization. Services, such as
mail/postfix or
www/apache22 are just two of the many
software packages which may be started during system
initialization. This section explains the procedures available
for starting third party software.In &os;, most included services, such as &man.cron.8;, are
started through the system startup scripts.Extended Application ConfigurationNow that &os; includes rc.d,
configuration of application startup is easier and provides
more features. Using the key words discussed in
, applications can be set to
start after certain other services and extra flags can be
passed through /etc/rc.conf in place of
hard coded flags in the startup script. A basic script may
look similar to the following:#!/bin/sh
#
# PROVIDE: utility
# REQUIRE: DAEMON
# KEYWORD: shutdown
. /etc/rc.subr
name=utility
rcvar=utility_enable
command="/usr/local/sbin/utility"
load_rc_config $name
#
# DO NOT CHANGE THESE DEFAULT VALUES HERE
# SET THEM IN THE /etc/rc.conf FILE
#
utility_enable=${utility_enable-"NO"}
pidfile=${utility_pidfile-"/var/run/utility.pid"}
run_rc_command "$1"This script will ensure that the provided
utility will be started after the
DAEMON pseudo-service. It also provides a
method for setting and tracking the process ID
(PID).This application could then have the following line placed
in /etc/rc.conf:utility_enable="YES"This method allows for easier manipulation of command
line arguments, inclusion of the default functions provided
in /etc/rc.subr, compatibility with
&man.rcorder.8;, and provides for easier configuration via
rc.conf.Using Services to Start ServicesOther services can be started using &man.inetd.8;.
Working with &man.inetd.8; and its configuration is
described in depth in
.In some cases, it may make more sense to use
&man.cron.8; to start system services. This approach
has a number of advantages as &man.cron.8; runs these
processes as the owner of the &man.crontab.5;. This allows
regular users to start and maintain their own
applications.The @reboot feature of &man.cron.8;,
may be used in place of the time specification. This causes
the job to run when &man.cron.8; is started, normally during
system initialization.Configuring &man.cron.8;TomRhodesContributed by cronconfigurationOne of the most useful utilities in &os; is
cron. This utility runs in the
background and regularly checks
/etc/crontab for tasks to execute and
searches /var/cron/tabs for custom crontab
files. These files are used to schedule tasks which
cron runs at the specified times.
Each entry in a crontab defines a task to run and is known as a
cron job.Two different types of configuration files are used: the
system crontab, which should not be modified, and user crontabs,
which can be created and edited as needed. The format used by
these files is documented in &man.crontab.5;. The format of the
system crontab, /etc/crontab includes a
who column which does not exist in user
crontabs. In the system crontab,
cron runs the command as the user
specified in this column. In a user crontab, all commands run
as the user who created the crontab.User crontabs allow individual users to schedule their own
tasks. The root user
can also have a user crontab which can be
used to schedule tasks that do not exist in the system
crontab.Here is a sample entry from the system crontab,
/etc/crontab:# /etc/crontab - root's crontab for FreeBSD
#
# $FreeBSD$
#
SHELL=/bin/sh
PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin
#
#minute hour mday month wday who command
#
*/5 * * * * root /usr/libexec/atrun Lines that begin with the # character
are comments. A comment can be placed in the file as a
reminder of what and why a desired action is performed.
Comments cannot be on the same line as a command or else
they will be interpreted as part of the command; they must
be on a new line. Blank lines are ignored.The equals (=) character is used to
define any environment settings. In this example, it is
used to define the SHELL and
PATH. If the SHELL is
omitted, cron will use the
default Bourne shell. If the PATH is
omitted, the full path must be given to the command or
script to run.This line defines the seven fields used in a system
crontab: minute, hour,
mday, month,
wday, who, and
command. The minute
field is the time in minutes when the specified command will
be run, the hour is the hour when the
specified command will be run, the mday
is the day of the month, month is the
month, and wday is the day of the week.
These fields must be numeric values, representing the
twenty-four hour clock, or a *,
representing all values for that field. The
who field only exists in the system
crontab and specifies which user the command should be run
as. The last field is the command to be executed.This entry defines the values for this cron job. The
*/5, followed by several more
* characters, specifies that
/usr/libexec/atrun is invoked by
root every five
minutes of every hour, of every day and day of the week, of
every month.Commands can include any number of switches. However,
commands which extend to multiple lines need to be broken
with the backslash \ continuation
character.Creating a User CrontabTo create a user crontab, invoke
crontab in editor mode:&prompt.user; crontab -eThis will open the user's crontab using the default text
editor. The first time a user runs this command, it will open
an empty file. Once a user creates a crontab, this command
will open that file for editing.It is useful to add these lines to the top of the crontab
file in order to set the environment variables and to remember
the meanings of the fields in the crontab:SHELL=/bin/sh
PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin
# Order of crontab fields
# minute hour mday month wday commandThen add a line for each command or script to run,
specifying the time to run the command. This example runs the
specified custom Bourne shell script every day at two in the
afternoon. Since the path to the script is not specified in
PATH, the full path to the script is
given:0 14 * * * /usr/home/dru/bin/mycustomscript.shBefore using a custom script, make sure it is executable
and test it with the limited set of environment variables
set by cron. To replicate the environment that would be
used to run the above cron entry, use:env -i SHELL=/bin/sh PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin HOME=/home/dru LOGNAME=dru/usr/home/dru/bin/mycustomscript.shThe environment set by cron is discussed in
&man.crontab.5;. Checking that scripts operate correctly in
a cron environment is especially important if they include
any commands that delete files using wildcards.When finished editing the crontab, save the file. It
will automatically be installed and
cron will read the crontab and run
its cron jobs at their specified times. To list the cron jobs
in a crontab, use this command:&prompt.user; crontab -l
0 14 * * * /usr/home/dru/bin/mycustomscript.shTo remove all of the cron jobs in a user crontab:&prompt.user; crontab -r
remove crontab for dru? yManaging Services in &os;TomRhodesContributed by &os; uses the &man.rc.8; system of startup scripts during
system initialization and for managing services. The scripts
listed in /etc/rc.d provide basic services
which can be controlled with the ,
, and options to
&man.service.8;. For instance, &man.sshd.8; can be restarted
with the following command:&prompt.root; service sshd restartThis procedure can be used to start services on a running
system. Services will be started automatically at boot time
as specified in &man.rc.conf.5;. For example, to enable
&man.natd.8; at system startup, add the following line to
/etc/rc.conf:natd_enable="YES"If a line is already
present, change the NO to
YES. The &man.rc.8; scripts will
automatically load any dependent services during the next boot,
as described below.Since the &man.rc.8; system is primarily intended to start
and stop services at system startup and shutdown time, the
, and
options will only perform their action
if the appropriate /etc/rc.conf variable
is set. For instance, sshd restart will
only work if sshd_enable is set to
in /etc/rc.conf.
To , or
a service regardless of the settings
in /etc/rc.conf, these commands should be
prefixed with one. For instance, to restart
&man.sshd.8; regardless of the current
/etc/rc.conf setting, execute the following
command:&prompt.root; service sshd onerestartTo check if a service is enabled in
/etc/rc.conf, run the appropriate
&man.rc.8; script with . This example
checks to see if &man.sshd.8; is enabled in
/etc/rc.conf:&prompt.root; service sshd rcvar
# sshd
#
sshd_enable="YES"
# (default: "")The # sshd line is output from the
above command, not a
root console.To determine whether or not a service is running, use
. For instance, to verify that
&man.sshd.8; is running:&prompt.root; service sshd status
sshd is running as pid 433.In some cases, it is also possible to
a service. This attempts to send a
signal to an individual service, forcing the service to reload
its configuration files. In most cases, this means sending
the service a SIGHUP signal. Support for
this feature is not included for every service.The &man.rc.8; system is used for network services and it
also contributes to most of the system initialization. For
instance, when the
/etc/rc.d/bgfsck script is executed, it
prints out the following message:Starting background file system checks in 60 seconds.This script is used for background file system checks,
which occur only during system initialization.Many system services depend on other services to function
properly. For example, &man.yp.8; and other
RPC-based services may fail to start until
after the &man.rpcbind.8; service has started. To resolve this
issue, information about dependencies and other meta-data is
included in the comments at the top of each startup script.
The &man.rcorder.8; program is used to parse these comments
during system initialization to determine the order in which
system services should be invoked to satisfy the
dependencies.The following key word must be included in all startup
scripts as it is required by &man.rc.subr.8; to
enable the startup script:PROVIDE: Specifies the services this
file provides.The following key words may be included at the top of each
startup script. They are not strictly necessary, but are
useful as hints to &man.rcorder.8;:REQUIRE: Lists services which are
required for this service. The script containing this key
word will run after the specified
services.BEFORE: Lists services which depend
on this service. The script containing this key word will
run before the specified
services.By carefully setting these keywords for each startup script,
an administrator has a fine-grained level of control of the
startup order of the scripts, without the need for
runlevels used by some &unix; operating
systems.Additional information can be found in &man.rc.8; and
&man.rc.subr.8;. Refer to this article
for instructions on how to create custom &man.rc.8;
scripts.Managing System-Specific Configurationrc filesrc.confThe principal location for system configuration
information is /etc/rc.conf. This file
contains a wide range of configuration information and it is
read at system startup to configure the system. It provides
the configuration information for the
rc* files.The entries in /etc/rc.conf override
the default settings in
/etc/defaults/rc.conf. The file
containing the default settings should not be edited.
Instead, all system-specific changes should be made to
/etc/rc.conf.A number of strategies may be applied in clustered
applications to separate site-wide configuration from
system-specific configuration in order to reduce
administration overhead. The recommended approach is to place
system-specific configuration into
/etc/rc.conf.local. For example, these
entries in /etc/rc.conf apply to all
systems:sshd_enable="YES"
keyrate="fast"
defaultrouter="10.1.1.254"Whereas these entries in
/etc/rc.conf.local apply to this system
only:hostname="node1.example.org"
ifconfig_fxp0="inet 10.1.1.1/8"Distribute /etc/rc.conf to every
system using an application such as
rsync or
puppet, while
/etc/rc.conf.local remains
unique.Upgrading the system will not overwrite
/etc/rc.conf, so system configuration
information will not be lost.Both /etc/rc.conf and
/etc/rc.conf.local
are parsed by &man.sh.1;. This allows system operators to
create complex configuration scenarios. Refer to
&man.rc.conf.5; for further information on this
topic.Setting Up Network Interface CardsMarcFonvieilleContributed by network cardsconfigurationAdding and configuring a network interface card
(NIC) is a common task for any &os;
administrator.Locating the Correct Drivernetwork cardsdriverFirst, determine the model of the NIC
and the chip it uses. &os; supports a wide variety of
NICs. Check the Hardware Compatibility
List for the &os; release to see if the NIC
is supported.If the NIC is supported, determine
the name of the &os; driver for the NIC.
Refer to /usr/src/sys/conf/NOTES and
/usr/src/sys/arch/conf/NOTES
for the list of NIC drivers with some
information about the supported chipsets. When in doubt, read
the manual page of the driver as it will provide more
information about the supported hardware and any known
limitations of the driver.The drivers for common NICs are already
present in the GENERIC kernel, meaning
the NIC should be probed during boot. The
system's boot messages can be viewed by typing
more /var/run/dmesg.boot and using the
spacebar to scroll through the text. In this example, two
Ethernet NICs using the &man.dc.4; driver
are present on the system:dc0: <82c169 PNIC 10/100BaseTX> port 0xa000-0xa0ff mem 0xd3800000-0xd38
000ff irq 15 at device 11.0 on pci0
miibus0: <MII bus> on dc0
bmtphy0: <BCM5201 10/100baseTX PHY> PHY 1 on miibus0
bmtphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
dc0: Ethernet address: 00:a0:cc:da:da:da
dc0: [ITHREAD]
dc1: <82c169 PNIC 10/100BaseTX> port 0x9800-0x98ff mem 0xd3000000-0xd30
000ff irq 11 at device 12.0 on pci0
miibus1: <MII bus> on dc1
bmtphy1: <BCM5201 10/100baseTX PHY> PHY 1 on miibus1
bmtphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
dc1: Ethernet address: 00:a0:cc:da:da:db
dc1: [ITHREAD]If the driver for the NIC is not
present in GENERIC, but a driver is
available, the driver will need to be loaded before the
NIC can be configured and used. This may
be accomplished in one of two ways:The easiest way is to load a kernel module for the
NIC using &man.kldload.8;. To also
automatically load the driver at boot time, add the
appropriate line to
/boot/loader.conf. Not all
NIC drivers are available as
modules.Alternatively, statically compile support for the
NIC into a custom kernel. Refer to
/usr/src/sys/conf/NOTES,
/usr/src/sys/arch/conf/NOTES
and the manual page of the driver to determine which line
to add to the custom kernel configuration file. For more
information about recompiling the kernel, refer to . If the NIC
was detected at boot, the kernel does not need to be
recompiled.Using &windows; NDIS DriversNDISNDISulator&windows; driversµsoft.windows;device driversKLD (kernel loadable
object)Unfortunately, there are still many vendors that do not
provide schematics for their drivers to the open source
community because they regard such information as trade
secrets. Consequently, the developers of &os; and other
operating systems are left with two choices: develop the
drivers by a long and pain-staking process of reverse
engineering or using the existing driver binaries available
for µsoft.windows; platforms.&os; provides native support for the
Network Driver Interface Specification
(NDIS). It includes &man.ndisgen.8;
which can be used to convert a &windowsxp; driver into a
- format that can be used on &os;. Because the &man.ndis.4;
+ format that can be used on &os;. As the &man.ndis.4;
driver uses a &windowsxp; binary, it only runs on &i386;
and amd64 systems. PCI, CardBus,
PCMCIA, and USB
devices are supported.To use &man.ndisgen.8;, three things are needed:&os; kernel sources.A &windowsxp; driver binary with a
.SYS extension.A &windowsxp; driver configuration file with a
.INF extension.Download the .SYS and
.INF files for the specific
NIC. Generally, these can be found on
the driver CD or at the vendor's website. The following
examples use W32DRIVER.SYS and
W32DRIVER.INF.The driver bit width must match the version of &os;.
For &os;/i386, use a &windows; 32-bit driver. For
&os;/amd64, a &windows; 64-bit driver is needed.The next step is to compile the driver binary into a
loadable kernel module. As
root, use
&man.ndisgen.8;:&prompt.root; ndisgen /path/to/W32DRIVER.INF/path/to/W32DRIVER.SYSThis command is interactive and prompts for any extra
information it requires. A new kernel module will be
generated in the current directory. Use &man.kldload.8;
to load the new module:&prompt.root; kldload ./W32DRIVER_SYS.koIn addition to the generated kernel module, the
ndis.ko and
if_ndis.ko modules must be loaded.
This should happen automatically when any module that
depends on &man.ndis.4; is loaded. If not, load them
manually, using the following commands:&prompt.root; kldload ndis
&prompt.root; kldload if_ndisThe first command loads the &man.ndis.4; miniport driver
wrapper and the second loads the generated
NIC driver.Check &man.dmesg.8; to see if there were any load
errors. If all went well, the output should be similar to
the following:ndis0: <Wireless-G PCI Adapter> mem 0xf4100000-0xf4101fff irq 3 at device 8.0 on pci1
ndis0: NDIS API version: 5.0
ndis0: Ethernet address: 0a:b1:2c:d3:4e:f5
ndis0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps
ndis0: 11g rates: 6Mbps 9Mbps 12Mbps 18Mbps 36Mbps 48Mbps 54MbpsFrom here, ndis0 can be
configured like any other NIC.To configure the system to load the &man.ndis.4; modules
at boot time, copy the generated module,
W32DRIVER_SYS.ko, to
/boot/modules. Then, add the following
line to /boot/loader.conf:W32DRIVER_SYS_load="YES"Configuring the Network Cardnetwork cardsconfigurationOnce the right driver is loaded for the
NIC, the card needs to be configured. It
may have been configured at installation time by
&man.bsdinstall.8;.To display the NIC configuration,
enter the following command:&prompt.user; ifconfig
dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80008<VLAN_MTU,LINKSTATE>
ether 00:a0:cc:da:da:da
inet 192.168.1.3 netmask 0xffffff00 broadcast 192.168.1.255
media: Ethernet autoselect (100baseTX <full-duplex>)
status: active
dc1: flags=8802<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80008<VLAN_MTU,LINKSTATE>
ether 00:a0:cc:da:da:db
inet 10.0.0.1 netmask 0xffffff00 broadcast 10.0.0.255
media: Ethernet 10baseT/UTP
status: no carrier
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
options=3<RXCSUM,TXCSUM>
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4
inet6 ::1 prefixlen 128
inet 127.0.0.1 netmask 0xff000000
nd6 options=3<PERFORMNUD,ACCEPT_RTADV>In this example, the following devices were
displayed:dc0: The first Ethernet
interface.dc1: The second Ethernet
interface.lo0: The loopback
device.&os; uses the driver name followed by the order in which
the card is detected at boot to name the
NIC. For example,
sis2 is the third
NIC on the system using the &man.sis.4;
driver.In this example, dc0 is up and
running. The key indicators are:UP means that the card is
configured and ready.The card has an Internet (inet)
address, 192.168.1.3.It has a valid subnet mask
(netmask), where
0xffffff00 is the
same as 255.255.255.0.It has a valid broadcast address, 192.168.1.255.The MAC address of the card
(ether) is 00:a0:cc:da:da:da.The physical media selection is on autoselection mode
(media: Ethernet autoselect (100baseTX
<full-duplex>)). In this example,
dc1 is configured to run with
10baseT/UTP media. For more
information on available media types for a driver, refer
to its manual page.The status of the link (status) is
active, indicating that the carrier
signal is detected. For dc1, the
status: no carrier status is normal
when an Ethernet cable is not plugged into the
card.If the &man.ifconfig.8; output had shown something similar
to:dc0: flags=8843<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80008<VLAN_MTU,LINKSTATE>
ether 00:a0:cc:da:da:da
media: Ethernet autoselect (100baseTX <full-duplex>)
status: activeit would indicate the card has not been configured.The card must be configured as
root. The
NIC configuration can be performed from the
command line with &man.ifconfig.8; but will not persist after
a reboot unless the configuration is also added to
/etc/rc.conf. If a
DHCP server is present on the LAN,
just add this line:ifconfig_dc0="DHCP"Replace dc0 with the correct
value for the system.The line added, then, follow the instructions given in
.If the network was configured during installation, some
entries for the NIC(s) may be already
present. Double check /etc/rc.conf
before adding any lines.If there is no DHCP server,
the NIC(s) must be configured manually.
Add a line for each NIC present on the
system, as seen in this example:ifconfig_dc0="inet 192.168.1.3 netmask 255.255.255.0"
ifconfig_dc1="inet 10.0.0.1 netmask 255.255.255.0 media 10baseT/UTP"Replace dc0 and
dc1 and the IP
address information with the correct values for the system.
Refer to the man page for the driver, &man.ifconfig.8;, and
&man.rc.conf.5; for more details about the allowed options and
the syntax of /etc/rc.conf.If the network is not using DNS, edit
/etc/hosts to add the names and
IP addresses of the hosts on the
LAN, if they are not already there. For
more information, refer to &man.hosts.5; and to
/usr/share/examples/etc/hosts.If there is no DHCP server and
access to the Internet is needed, manually configure the
default gateway and the nameserver:&prompt.root; echo 'defaultrouter="your_default_router"' >> /etc/rc.conf
&prompt.root; echo 'nameserver your_DNS_server' >> /etc/resolv.confTesting and TroubleshootingOnce the necessary changes to
/etc/rc.conf are saved, a reboot can be
used to test the network configuration and to verify that the
system restarts without any configuration errors.
Alternatively, apply the settings to the networking system
with this command:&prompt.root; service netif restartIf a default gateway has been set in
/etc/rc.conf, also issue this
command:&prompt.root; service routing restartOnce the networking system has been relaunched, test the
NICs.Testing the Ethernet Cardnetwork cardstestingTo verify that an Ethernet card is configured correctly,
&man.ping.8; the interface itself, and then &man.ping.8;
another machine on the LAN:&prompt.user; ping -c5 192.168.1.3
PING 192.168.1.3 (192.168.1.3): 56 data bytes
64 bytes from 192.168.1.3: icmp_seq=0 ttl=64 time=0.082 ms
64 bytes from 192.168.1.3: icmp_seq=1 ttl=64 time=0.074 ms
64 bytes from 192.168.1.3: icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from 192.168.1.3: icmp_seq=3 ttl=64 time=0.108 ms
64 bytes from 192.168.1.3: icmp_seq=4 ttl=64 time=0.076 ms
--- 192.168.1.3 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.074/0.083/0.108/0.013 ms&prompt.user; ping -c5 192.168.1.2
PING 192.168.1.2 (192.168.1.2): 56 data bytes
64 bytes from 192.168.1.2: icmp_seq=0 ttl=64 time=0.726 ms
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.766 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.700 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.747 ms
64 bytes from 192.168.1.2: icmp_seq=4 ttl=64 time=0.704 ms
--- 192.168.1.2 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.700/0.729/0.766/0.025 msTo test network resolution, use the host name instead
of the IP address. If there is no
DNS server on the network,
/etc/hosts must first be
configured. To this purpose, edit
/etc/hosts to add the names and
IP addresses of the hosts on the
LAN, if they are not already there. For
more information, refer to &man.hosts.5; and to
/usr/share/examples/etc/hosts.Troubleshootingnetwork cardstroubleshootingWhen troubleshooting hardware and software
configurations, check the simple things first. Is the
network cable plugged in? Are the network services properly
configured? Is the firewall configured correctly? Is the
NIC supported by &os;? Before sending
a bug report, always check the Hardware Notes, update the
version of &os; to the latest STABLE version, check the
mailing list archives, and search the Internet.If the card works, yet performance is poor, read
through &man.tuning.7;. Also, check the network
configuration as incorrect network settings can cause slow
connections.Some users experience one or two
device timeout messages, which is
normal for some cards. If they continue, or are bothersome,
determine if the device is conflicting with another device.
Double check the cable connections. Consider trying another
card.To resolve watchdog timeout
errors, first check the network cable. Many cards
require a PCI slot which supports bus
mastering. On some old motherboards, only one
PCI slot allows it, usually slot 0.
Check the NIC and the motherboard
documentation to determine if that may be the
problem.No route to host messages occur
if the system is unable to route a packet to the destination
host. This can happen if no default route is specified or
if a cable is unplugged. Check the output of
netstat -rn and make sure there is a
valid route to the host. If there is not, read
.ping: sendto: Permission denied
error messages are often caused by a misconfigured firewall.
If a firewall is enabled on &os; but no rules have been
defined, the default policy is to deny all traffic, even
&man.ping.8;. Refer to
for more information.Sometimes performance of the card is poor or below
average. In these cases, try setting the media
selection mode from autoselect to the
correct media selection. While this works for most
hardware, it may or may not resolve the issue. Again,
check all the network settings, and refer to
&man.tuning.7;.Virtual Hostsvirtual hostsIP
aliasesA common use of &os; is virtual site hosting, where one
server appears to the network as many servers. This is achieved
by assigning multiple network addresses to a single
interface.A given network interface has one real
address, and may have any number of alias
addresses. These aliases are normally added by placing alias
entries in /etc/rc.conf, as seen in this
example:ifconfig_fxp0_alias0="inet xxx.xxx.xxx.xxx netmask xxx.xxx.xxx.xxx"Alias entries must start with
alias0 using a
sequential number such as
alias0, alias1,
and so on. The configuration process will stop at the first
missing number.The calculation of alias netmasks is important. For a
given interface, there must be one address which correctly
represents the network's netmask. Any other addresses which
fall within this network must have a netmask of all
1s, expressed as either
255.255.255.255 or
0xffffffff.For example, consider the case where the
fxp0 interface is connected to two
networks: 10.1.1.0
with a netmask of
255.255.255.0 and
202.0.75.16 with a
netmask of
255.255.255.240. The
system is to be configured to appear in the ranges
10.1.1.1 through
10.1.1.5 and
202.0.75.17 through
202.0.75.20. Only
the first address in a given network range should have a real
netmask. All the rest
(10.1.1.2 through
10.1.1.5 and
202.0.75.18 through
202.0.75.20) must be
configured with a netmask of
255.255.255.255.The following /etc/rc.conf entries
configure the adapter correctly for this scenario:ifconfig_fxp0="inet 10.1.1.1 netmask 255.255.255.0"
ifconfig_fxp0_alias0="inet 10.1.1.2 netmask 255.255.255.255"
ifconfig_fxp0_alias1="inet 10.1.1.3 netmask 255.255.255.255"
ifconfig_fxp0_alias2="inet 10.1.1.4 netmask 255.255.255.255"
ifconfig_fxp0_alias3="inet 10.1.1.5 netmask 255.255.255.255"
ifconfig_fxp0_alias4="inet 202.0.75.17 netmask 255.255.255.240"
ifconfig_fxp0_alias5="inet 202.0.75.18 netmask 255.255.255.255"
ifconfig_fxp0_alias6="inet 202.0.75.19 netmask 255.255.255.255"
ifconfig_fxp0_alias7="inet 202.0.75.20 netmask 255.255.255.255"A simpler way to express this is with a space-separated list
of IP address ranges. The first address
will be given the
indicated subnet mask and the additional addresses will have a
subnet mask of 255.255.255.255.ifconfig_fxp0_aliases="inet 10.1.1.1-5/24 inet 202.0.75.17-20/28"Configuring System LoggingNiclasZeisingContributed by system loggingsyslog&man.syslogd.8;Generating and reading system logs is an important aspect of
system administration. The information in system logs can be
used to detect hardware and software issues as well as
application and system configuration errors. This information
also plays an important role in security auditing and incident
response. Most system daemons and applications will generate
log entries.&os; provides a system logger,
syslogd, to manage logging. By
default, syslogd is started when the
system boots. This is controlled by the variable
syslogd_enable in
/etc/rc.conf. There are numerous
application arguments that can be set using
syslogd_flags in
/etc/rc.conf. Refer to &man.syslogd.8; for
more information on the available arguments.This section describes how to configure the &os; system
logger for both local and remote logging and how to perform log
rotation and log management.Configuring Local Loggingsyslog.confThe configuration file,
/etc/syslog.conf, controls what
syslogd does with log entries as
they are received. There are several parameters to control
the handling of incoming events. The
facility describes which subsystem
generated the message, such as the kernel or a daemon, and the
level describes the severity of the
event that occurred. This makes it possible to configure if
and where a log message is logged, depending on the facility
and level. It is also possible to take action depending on
the application that sent the message, and in the case of
remote logging, the hostname of the machine generating the
logging event.This configuration file contains one line per action,
where the syntax for each line is a selector field followed by
an action field. The syntax of the selector field is
facility.level which will match log
messages from facility at level
level or higher. It is also
possible to add an optional comparison flag before the level
to specify more precisely what is logged. Multiple selector
fields can be used for the same action, and are separated with
a semicolon (;). Using
* will match everything. The action field
denotes where to send the log message, such as to a file or
remote log host. As an example, here is the default
syslog.conf from &os;:# $&os;$
#
# Spaces ARE valid field separators in this file. However,
# other *nix-like systems still insist on using tabs as field
# separators. If you are sharing this file between systems, you
# may want to use only tabs as field separators here.
# Consult the syslog.conf(5) manpage.
*.err;kern.warning;auth.notice;mail.crit /dev/console
*.notice;authpriv.none;kern.debug;lpr.info;mail.crit;news.err /var/log/messages
security.* /var/log/security
auth.info;authpriv.info /var/log/auth.log
mail.info /var/log/maillog
lpr.info /var/log/lpd-errs
ftp.info /var/log/xferlog
cron.* /var/log/cron
!-devd
*.=debug /var/log/debug.log
*.emerg *
# uncomment this to log all writes to /dev/console to /var/log/console.log
#console.info /var/log/console.log
# uncomment this to enable logging of all log messages to /var/log/all.log
# touch /var/log/all.log and chmod it to mode 600 before it will work
#*.* /var/log/all.log
# uncomment this to enable logging to a remote loghost named loghost
#*.* @loghost
# uncomment these if you're running inn
# news.crit /var/log/news/news.crit
# news.err /var/log/news/news.err
# news.notice /var/log/news/news.notice
# Uncomment this if you wish to see messages produced by devd
# !devd
# *.>=info
!ppp
*.* /var/log/ppp.log
!*In this example:Line 8 matches all messages with a level of
err or higher, as well as
kern.warning,
auth.notice and
mail.crit, and sends these log messages
to the console
(/dev/console).Line 12 matches all messages from the
mail facility at level
info or above and logs the messages to
/var/log/maillog.Line 17 uses a comparison flag (=)
to only match messages at level debug
and logs them to
/var/log/debug.log.Line 33 is an example usage of a program
specification. This makes the rules following it only
valid for the specified program. In this case, only the
messages generated by ppp are
logged to /var/log/ppp.log.The available levels, in order from most to least
critical are emerg,
alert, crit,
err, warning,
notice, info, and
debug.The facilities, in no particular order, are
auth, authpriv,
console, cron,
daemon, ftp,
kern, lpr,
mail, mark,
news, security,
syslog, user,
uucp, and local0 through
local7. Be aware that other operating
systems might have different facilities.To log everything of level notice and
higher to /var/log/daemon.log, add the
following entry:daemon.notice /var/log/daemon.logFor more information about the different levels and
facilities, refer to &man.syslog.3; and &man.syslogd.8;.
For more information about
/etc/syslog.conf, its syntax, and more
advanced usage examples, see &man.syslog.conf.5;.Log Management and Rotationnewsyslognewsyslog.conflog rotationlog managementLog files can grow quickly, taking up disk space and
making it more difficult to locate useful information. Log
management attempts to mitigate this. In &os;,
newsyslog is used to manage log
files. This built-in program periodically rotates and
compresses log files, and optionally creates missing log files
and signals programs when log files are moved. The log files
may be generated by syslogd or by
any other program which generates log files. While
newsyslog is normally run from
&man.cron.8;, it is not a system daemon. In the default
configuration, it runs every hour.To know which actions to take,
newsyslog reads its configuration
file, /etc/newsyslog.conf. This file
contains one line for each log file that
newsyslog manages. Each line
states the file owner, permissions, when to rotate that file,
optional flags that affect log rotation, such as compression,
and programs to signal when the log is rotated. Here is the
default configuration in &os;:# configuration file for newsyslog
# $FreeBSD$
#
# Entries which do not specify the '/pid_file' field will cause the
# syslogd process to be signalled when that log file is rotated. This
# action is only appropriate for log files which are written to by the
# syslogd process (ie, files listed in /etc/syslog.conf). If there
# is no process which needs to be signalled when a given log file is
# rotated, then the entry for that file should include the 'N' flag.
#
# The 'flags' field is one or more of the letters: BCDGJNUXZ or a '-'.
#
# Note: some sites will want to select more restrictive protections than the
# defaults. In particular, it may be desirable to switch many of the 644
# entries to 640 or 600. For example, some sites will consider the
# contents of maillog, messages, and lpd-errs to be confidential. In the
# future, these defaults may change to more conservative ones.
#
# logfilename [owner:group] mode count size when flags [/pid_file] [sig_num]
/var/log/all.log 600 7 * @T00 J
/var/log/amd.log 644 7 100 * J
/var/log/auth.log 600 7 100 @0101T JC
/var/log/console.log 600 5 100 * J
/var/log/cron 600 3 100 * JC
/var/log/daily.log 640 7 * @T00 JN
/var/log/debug.log 600 7 100 * JC
/var/log/kerberos.log 600 7 100 * J
/var/log/lpd-errs 644 7 100 * JC
/var/log/maillog 640 7 * @T00 JC
/var/log/messages 644 5 100 @0101T JC
/var/log/monthly.log 640 12 * $M1D0 JN
/var/log/pflog 600 3 100 * JB /var/run/pflogd.pid
/var/log/ppp.log root:network 640 3 100 * JC
/var/log/devd.log 644 3 100 * JC
/var/log/security 600 10 100 * JC
/var/log/sendmail.st 640 10 * 168 B
/var/log/utx.log 644 3 * @01T05 B
/var/log/weekly.log 640 5 1 $W6D0 JN
/var/log/xferlog 600 7 100 * JCEach line starts with the name of the log to be rotated,
optionally followed by an owner and group for both rotated and
newly created files. The mode field sets
the permissions on the log file and count
denotes how many rotated log files should be kept. The
size and when fields
tell newsyslog when to rotate the
file. A log file is rotated when either its size is larger
than the size field or when the time in the
when field has passed. An asterisk
(*) means that this field is ignored. The
flags field gives further
instructions, such as how to compress the rotated file or to
create the log file if it is missing. The last two fields are
optional and specify the name of the Process ID
(PID) file of a process and a signal number
to send to that process when the file is rotated.For more information on all fields, valid flags, and how
to specify the rotation time, refer to &man.newsyslog.conf.5;.
Since newsyslog is run from
&man.cron.8;, it cannot rotate files more often than it is
scheduled to run from &man.cron.8;.Configuring Remote LoggingTomRhodesContributed by Monitoring the log files of multiple hosts can become
unwieldy as the number of systems increases. Configuring
centralized logging can reduce some of the administrative
burden of log file administration.In &os;, centralized log file aggregation, merging, and
rotation can be configured using
syslogd and
newsyslog. This section
demonstrates an example configuration, where host
A, named logserv.example.com, will
collect logging information for the local network. Host
B, named logclient.example.com,
will be configured to pass logging information to the logging
server.Log Server ConfigurationA log server is a system that has been configured to
accept logging information from other hosts. Before
configuring a log server, check the following:If there is a firewall between the logging server
and any logging clients, ensure that the firewall
ruleset allows UDP port 514 for both
the clients and the server.The logging server and all client machines must
have forward and reverse entries in the local
DNS. If the network does not have a
DNS server, create entries in each
system's /etc/hosts. Proper name
resolution is required so that log entries are not
rejected by the logging server.On the log server, edit
/etc/syslog.conf to specify the name of
the client to receive log entries from, the logging facility
to be used, and the name of the log to store the host's log
entries. This example adds the hostname of
B, logs all facilities, and stores
the log entries in
/var/log/logclient.log.Sample Log Server Configuration+logclient.example.com
*.* /var/log/logclient.logWhen adding multiple log clients, add a similar two-line
entry for each client. More information about the available
facilities may be found in &man.syslog.conf.5;.Next, configure
/etc/rc.conf:syslogd_enable="YES"
syslogd_flags="-a logclient.example.com -v -v"The first entry starts
syslogd at system boot. The
second entry allows log entries from the specified client.
The increases the verbosity of logged
messages. This is useful for tweaking facilities as
administrators are able to see what type of messages are
being logged under each facility.Multiple options may be specified to
allow logging from multiple clients. IP
addresses and whole netblocks may also be specified. Refer
to &man.syslogd.8; for a full list of possible
options.Finally, create the log file:&prompt.root; touch /var/log/logclient.logAt this point, syslogd should
be restarted and verified:&prompt.root; service syslogd restart
&prompt.root; pgrep syslogIf a PID is returned, the server
restarted successfully, and client configuration can begin.
If the server did not restart, consult
/var/log/messages for the error.Log Client ConfigurationA logging client sends log entries to a logging server
on the network. The client also keeps a local copy of its
own logs.Once a logging server has been configured, edit
/etc/rc.conf on the logging
client:syslogd_enable="YES"
syslogd_flags="-s -v -v"The first entry enables
syslogd on boot up. The second
entry prevents logs from being accepted by this client from
other hosts () and increases the
verbosity of logged messages.Next, define the logging server in the client's
/etc/syslog.conf. In this example, all
logged facilities are sent to a remote system, denoted by
the @ symbol, with the specified
hostname:*.* @logserv.example.comAfter saving the edit, restart
syslogd for the changes to take
effect:&prompt.root; service syslogd restartTo test that log messages are being sent across the
network, use &man.logger.1; on the client to send a message
to syslogd:&prompt.root; logger "Test message from logclient"This message should now exist both in
/var/log/messages on the client and
/var/log/logclient.log on the log
server.Debugging Log ServersIf no messages are being received on the log server, the
cause is most likely a network connectivity issue, a
hostname resolution issue, or a typo in a configuration
file. To isolate the cause, ensure that both the logging
server and the logging client are able to
ping each other using the hostname
specified in their /etc/rc.conf. If
this fails, check the network cabling, the firewall ruleset,
and the hostname entries in the DNS
server or /etc/hosts on both the
logging server and clients. Repeat until the
ping is successful from both
hosts.If the ping succeeds on both hosts
but log messages are still not being received, temporarily
increase logging verbosity to narrow down the configuration
issue. In the following example,
/var/log/logclient.log on the logging
server is empty and /var/log/messages
on the logging client does not indicate a reason for the
failure. To increase debugging output, edit the
syslogd_flags entry on the logging server
and issue a restart:syslogd_flags="-d -a logclient.example.com -v -v"&prompt.root; service syslogd restartDebugging data similar to the following will flash on
the console immediately after the restart:logmsg: pri 56, flags 4, from logserv.example.com, msg syslogd: restart
syslogd: restarted
logmsg: pri 6, flags 4, from logserv.example.com, msg syslogd: kernel boot file is /boot/kernel/kernel
Logging to FILE /var/log/messages
syslogd: kernel boot file is /boot/kernel/kernel
cvthname(192.168.1.10)
validate: dgram from IP 192.168.1.10, port 514, name logclient.example.com;
rejected in rule 0 due to name mismatch.In this example, the log messages are being rejected due
to a typo which results in a hostname mismatch. The
client's hostname should be logclient,
not logclien. Fix the typo, issue a
restart, and verify the results:&prompt.root; service syslogd restart
logmsg: pri 56, flags 4, from logserv.example.com, msg syslogd: restart
syslogd: restarted
logmsg: pri 6, flags 4, from logserv.example.com, msg syslogd: kernel boot file is /boot/kernel/kernel
syslogd: kernel boot file is /boot/kernel/kernel
logmsg: pri 166, flags 17, from logserv.example.com,
msg Dec 10 20:55:02 <syslog.err> logserv.example.com syslogd: exiting on signal 2
cvthname(192.168.1.10)
validate: dgram from IP 192.168.1.10, port 514, name logclient.example.com;
accepted in rule 0.
logmsg: pri 15, flags 0, from logclient.example.com, msg Dec 11 02:01:28 trhodes: Test message 2
Logging to FILE /var/log/logclient.log
Logging to FILE /var/log/messagesAt this point, the messages are being properly received
and placed in the correct file.Security ConsiderationsAs with any network service, security requirements
should be considered before implementing a logging server.
Log files may contain sensitive data about services enabled
on the local host, user accounts, and configuration data.
Network data sent from the client to the server will not be
encrypted or password protected. If a need for encryption
exists, consider using security/stunnel,
which will transmit the logging data over an encrypted
tunnel.Local security is also an issue. Log files are not
encrypted during use or after log rotation. Local users may
access log files to gain additional insight into system
configuration. Setting proper permissions on log files is
critical. The built-in log rotator,
newsyslog, supports setting
permissions on newly created and rotated log files. Setting
log files to mode 600 should prevent
unwanted access by local users. Refer to
&man.newsyslog.conf.5; for additional information.Configuration Files/etc
LayoutThere are a number of directories in which configuration
information is kept. These include:/etcGeneric system-specific configuration
information./etc/defaultsDefault versions of system configuration
files./etc/mailExtra &man.sendmail.8; configuration and other
MTA configuration files./etc/pppConfiguration for both user- and kernel-ppp
programs./usr/local/etcConfiguration files for installed applications.
May contain per-application subdirectories./usr/local/etc/rc.d&man.rc.8; scripts for installed
applications./var/dbAutomatically generated system-specific database
files, such as the package database and the
&man.locate.1; database.HostnameshostnameDNS/etc/resolv.confresolv.confHow a &os; system accesses the Internet Domain Name
System (DNS) is controlled by
&man.resolv.conf.5;.The most common entries to
/etc/resolv.conf are:nameserverThe IP address of a name
server the resolver should query. The servers are
queried in the order listed with a maximum of
three.searchSearch list for hostname lookup. This is
normally determined by the domain of the local
hostname.domainThe local domain name.A typical /etc/resolv.conf looks
like this:search example.com
nameserver 147.11.1.11
nameserver 147.11.100.30Only one of the search and
domain options should be used.When using DHCP, &man.dhclient.8;
usually rewrites /etc/resolv.conf
with information received from the DHCP
server./etc/hostshosts/etc/hosts is a simple text
database which works in conjunction with
DNS and
NIS to provide host name to
IP address mappings. Entries for local
computers connected via a LAN can be
added to this file for simplistic naming purposes instead
of setting up a &man.named.8; server. Additionally,
/etc/hosts can be used to provide a
local record of Internet names, reducing the need to query
external DNS servers for commonly
accessed names.# $&os;$
#
#
# Host Database
#
# This file should contain the addresses and aliases for local hosts that
# share this file. Replace 'my.domain' below with the domainname of your
# machine.
#
# In the presence of the domain name service or NIS, this file may
# not be consulted at all; see /etc/nsswitch.conf for the resolution order.
#
#
::1 localhost localhost.my.domain
127.0.0.1 localhost localhost.my.domain
#
# Imaginary network.
#10.0.0.2 myname.my.domain myname
#10.0.0.3 myfriend.my.domain myfriend
#
# According to RFC 1918, you can use the following IP networks for
# private nets which will never be connected to the Internet:
#
# 10.0.0.0 - 10.255.255.255
# 172.16.0.0 - 172.31.255.255
# 192.168.0.0 - 192.168.255.255
#
# In case you want to be able to connect to the Internet, you need
# real official assigned numbers. Do not try to invent your own network
# numbers but instead get one from your network provider (if any) or
# from your regional registry (ARIN, APNIC, LACNIC, RIPE NCC, or AfriNIC.)
#The format of /etc/hosts is as
follows:[Internet address] [official hostname] [alias1] [alias2] ...For example:10.0.0.1 myRealHostname.example.com myRealHostname foobar1 foobar2Consult &man.hosts.5; for more information.Tuning with &man.sysctl.8;sysctltuningwith sysctl&man.sysctl.8; is used to make changes to a running &os;
system. This includes many advanced options of the
TCP/IP stack and virtual memory system
that can dramatically improve performance for an experienced
system administrator. Over five hundred system variables can
be read and set using &man.sysctl.8;.At its core, &man.sysctl.8; serves two functions: to read
and to modify system settings.To view all readable variables:&prompt.user; sysctl -aTo read a particular variable, specify its name:&prompt.user; sysctl kern.maxproc
kern.maxproc: 1044To set a particular variable, use the
variable=value
syntax:&prompt.root; sysctl kern.maxfiles=5000
kern.maxfiles: 2088 -> 5000Settings of sysctl variables are usually either strings,
numbers, or booleans, where a boolean is 1
for yes or 0 for no.To automatically set some variables each time the machine
boots, add them to /etc/sysctl.conf. For
more information, refer to &man.sysctl.conf.5; and
.sysctl.confsysctl.confsysctlThe configuration file for &man.sysctl.8;,
/etc/sysctl.conf, looks much like
/etc/rc.conf. Values are set in a
variable=value form. The specified values
are set after the system goes into multi-user mode. Not all
variables are settable in this mode.For example, to turn off logging of fatal signal exits
and prevent users from seeing processes started by other
users, the following tunables can be set in
/etc/sysctl.conf:# Do not log fatal signal exits (e.g., sig 11)
kern.logsigexit=0
# Prevent users from seeing information about processes that
# are being run under another UID.
security.bsd.see_other_uids=0&man.sysctl.8; Read-onlyTomRhodesContributed by In some cases it may be desirable to modify read-only
&man.sysctl.8; values, which will require a reboot of the
system.For instance, on some laptop models the &man.cardbus.4;
device will not probe memory ranges and will fail with errors
similar to:cbb0: Could not map register memory
device_probe_and_attach: cbb0 attach returned 12The fix requires the modification of a read-only
&man.sysctl.8; setting. Add
to
/boot/loader.conf and reboot. Now
&man.cardbus.4; should work properly.Tuning DisksThe following section will discuss various tuning
mechanisms and options which may be applied to disk
devices. In many cases, disks with mechanical parts,
such as SCSI drives, will be the
bottleneck driving down the overall system performance. While
a solution is to install a drive without mechanical parts,
such as a solid state drive, mechanical drives are not
going away anytime in the near future. When tuning disks,
it is advisable to utilize the features of the &man.iostat.8;
command to test various changes to the system. This
command will allow the user to obtain valuable information
on system IO.Sysctl Variablesvfs.vmiodirenablevfs.vmiodirenableThe vfs.vmiodirenable &man.sysctl.8;
variable
may be set to either 0 (off) or
1 (on). It is set to
1 by default. This variable controls
how directories are cached by the system. Most directories
are small, using just a single fragment (typically 1 K)
in the file system and typically 512 bytes in the
buffer cache. With this variable turned off, the buffer
cache will only cache a fixed number of directories, even
if the system has a huge amount of memory. When turned on,
this &man.sysctl.8; allows the buffer cache to use the
VM page cache to cache the directories,
making all the memory available for caching directories.
However, the minimum in-core memory used to cache a
directory is the physical page size (typically 4 K)
rather than 512 bytes. Keeping this option enabled
is recommended if the system is running any services which
manipulate large numbers of files. Such services can
include web caches, large mail systems, and news systems.
Keeping this option on will generally not reduce
performance, even with the wasted memory, but one should
experiment to find out.vfs.write_behindvfs.write_behindThe vfs.write_behind &man.sysctl.8;
variable
defaults to 1 (on). This tells the file
system to issue media writes as full clusters are collected,
which typically occurs when writing large sequential files.
This avoids saturating the buffer cache with dirty buffers
when it would not benefit I/O performance. However, this
may stall processes and under certain circumstances should
be turned off.vfs.hirunningspacevfs.hirunningspaceThe vfs.hirunningspace &man.sysctl.8;
variable determines how much outstanding write I/O may be
queued to disk controllers system-wide at any given
instance. The default is usually sufficient, but on
machines with many disks, try bumping it up to four or five
megabytes. Setting too high a value
which exceeds the buffer cache's write threshold can lead
to bad clustering performance. Do not set this value
arbitrarily high as higher write values may add latency to
reads occurring at the same time.There are various other buffer cache and
VM page cache related &man.sysctl.8;
values. Modifying these values is not recommended as the
VM system does a good job of
automatically tuning itself.vm.swap_idle_enabledvm.swap_idle_enabledThe vm.swap_idle_enabled
&man.sysctl.8; variable is useful in large multi-user
systems with many active login users and lots of idle
processes. Such systems tend to generate continuous
pressure on free memory reserves. Turning this feature on
and tweaking the swapout hysteresis (in idle seconds) via
vm.swap_idle_threshold1 and
vm.swap_idle_threshold2 depresses the
priority of memory pages associated with idle processes more
quickly then the normal pageout algorithm. This gives a
helping hand to the pageout daemon. Only turn this option
on if needed, because the tradeoff is essentially pre-page
memory sooner rather than later which eats more swap and
disk bandwidth. In a small system this option will have a
determinable effect, but in a large system that is already
doing moderate paging, this option allows the
VM system to stage whole processes into
and out of memory easily.hw.ata.wchw.ata.wcTurning off IDE write caching reduces
write bandwidth to IDE disks, but may
sometimes be necessary due to data consistency issues
introduced by hard drive vendors. The problem is that
some IDE drives lie about when a write
completes. With IDE write caching
turned on, IDE hard drives write data
to disk out of order and will sometimes delay writing some
blocks indefinitely when under heavy disk load. A crash or
power failure may cause serious file system corruption.
Check the default on the system by observing the
hw.ata.wc &man.sysctl.8; variable. If
IDE write caching is turned off, one can
set this read-only variable to
1 in
/boot/loader.conf in order to enable
it at boot time.For more information, refer to &man.ata.4;.SCSI_DELAY
(kern.cam.scsi_delay)kern.cam.scsi_delaykernel optionsSCSI DELAYThe SCSI_DELAY kernel configuration
option may be used to reduce system boot times. The
defaults are fairly high and can be responsible for
15 seconds of delay in the boot process.
Reducing it to 5 seconds usually works
with modern drives. The
kern.cam.scsi_delay boot time tunable
should be used. The tunable and kernel configuration
option accept values in terms of
milliseconds and
notseconds.Soft UpdatesSoft Updates&man.tunefs.8;To fine-tune a file system, use &man.tunefs.8;. This
program has many different options. To toggle Soft Updates
on and off, use:&prompt.root; tunefs -n enable /filesystem
&prompt.root; tunefs -n disable /filesystemA file system cannot be modified with &man.tunefs.8; while
it is mounted. A good time to enable Soft Updates is before
any partitions have been mounted, in single-user mode.Soft Updates is recommended for UFS
file systems as it drastically improves meta-data performance,
mainly file creation and deletion, through the use of a memory
cache. There are two downsides to Soft Updates to be aware
of. First, Soft Updates guarantee file system consistency
in the case of a crash, but could easily be several seconds
or even a minute behind updating the physical disk. If the
system crashes, unwritten data may be lost. Secondly, Soft
Updates delay the freeing of file system blocks. If the
root file system is almost full, performing a major update,
such as make installworld, can cause the
file system to run out of space and the update to fail.More Details About Soft UpdatesSoft UpdatesdetailsMeta-data updates are updates to non-content data like
inodes or directories. There are two traditional approaches
to writing a file system's meta-data back to disk.Historically, the default behavior was to write out
meta-data updates synchronously. If a directory changed,
the system waited until the change was actually written to
disk. The file data buffers (file contents) were passed
through the buffer cache and backed up to disk later on
asynchronously. The advantage of this implementation is
that it operates safely. If there is a failure during an
update, meta-data is always in a consistent state. A
file is either created completely or not at all. If the
data blocks of a file did not find their way out of the
buffer cache onto the disk by the time of the crash,
&man.fsck.8; recognizes this and repairs the file system
by setting the file length to 0.
Additionally, the implementation is clear and simple. The
disadvantage is that meta-data changes are slow. For
example, rm -r touches all the files in a
directory sequentially, but each directory change will be
written synchronously to the disk. This includes updates to
the directory itself, to the inode table, and possibly to
indirect blocks allocated by the file. Similar
considerations apply for unrolling large hierarchies using
tar -x.The second approach is to use asynchronous meta-data
updates. This is the default for a UFS
file system mounted with mount -o async.
Since all meta-data updates are also passed through the
buffer cache, they will be intermixed with the updates of
the file content data. The advantage of this
implementation is there is no need to wait until each
meta-data update has been written to disk, so all operations
which cause huge amounts of meta-data updates work much
faster than in the synchronous case. This implementation
is still clear and simple, so there is a low risk for bugs
creeping into the code. The disadvantage is that there is
no guarantee for a consistent state of the file system.
If there is a failure during an operation that updated
large amounts of meta-data, like a power failure or someone
pressing the reset button, the file system will be left
in an unpredictable state. There is no opportunity to
examine the state of the file system when the system comes
up again as the data blocks of a file could already have
been written to the disk while the updates of the inode
table or the associated directory were not. It is
impossible to implement a &man.fsck.8; which is able to
clean up the resulting chaos because the necessary
information is not available on the disk. If the file
system has been damaged beyond repair, the only choice
is to reformat it and restore from backup.The usual solution for this problem is to implement
dirty region logging, which is also
referred to as journaling.
Meta-data updates are still written synchronously, but only
into a small region of the disk. Later on, they are moved
- to their proper location. Because the logging area is a
+ to their proper location. Since the logging area is a
small, contiguous region on the disk, there are no long
distances for the disk heads to move, even during heavy
operations, so these operations are quicker than synchronous
updates. Additionally, the complexity of the implementation
is limited, so the risk of bugs being present is low. A
disadvantage is that all meta-data is written twice, once
into the logging region and once to the proper location, so
performance pessimization might result. On
the other hand, in case of a crash, all pending meta-data
operations can be either quickly rolled back or completed
from the logging area after the system comes up again,
resulting in a fast file system startup.Kirk McKusick, the developer of Berkeley
FFS, solved this problem with Soft
Updates. All pending meta-data updates are kept in memory
and written out to disk in a sorted sequence
(ordered meta-data updates). This has the
effect that, in case of heavy meta-data operations, later
updates to an item catch the earlier ones
which are still in memory and have not already been written
to disk. All operations are generally performed in memory
before the update is written to disk and the data blocks are
sorted according to their position so that they will not be
on the disk ahead of their meta-data. If the system
crashes, an implicit log rewind causes all
operations which were not written to the disk appear as if
they never happened. A consistent file system state is
maintained that appears to be the one of 30 to 60 seconds
earlier. The algorithm used guarantees that all resources
in use are marked as such in their blocks and inodes.
After a crash, the only resource allocation error that
occurs is that resources are marked as used
which are actually free. &man.fsck.8;
recognizes this situation, and frees the resources that
are no longer used. It is safe to ignore the dirty state
of the file system after a crash by forcibly mounting it
with mount -f. In order to free
resources that may be unused, &man.fsck.8; needs to be run
at a later time. This is the idea behind the
background &man.fsck.8;: at system
startup time, only a snapshot of the
file system is recorded and &man.fsck.8; is run afterwards.
All file systems can then be mounted
dirty, so the system startup proceeds in
multi-user mode. Then, background &man.fsck.8; is
scheduled for all file systems where this is required, to
free resources that may be unused. File systems that do
not use Soft Updates still need the usual foreground
&man.fsck.8;.The advantage is that meta-data operations are nearly
as fast as asynchronous updates and are faster than
logging, which has to write the
meta-data twice. The disadvantages are the complexity of
the code, a higher memory consumption, and some
idiosyncrasies. After a crash, the state of the file
system appears to be somewhat older. In
situations where the standard synchronous approach would
have caused some zero-length files to remain after the
&man.fsck.8;, these files do not exist at all with Soft
Updates because neither the meta-data nor the file contents
have been written to disk. Disk space is not released until
the updates have been written to disk, which may take place
some time after running &man.rm.1;. This may cause problems
when installing large amounts of data on a file system
that does not have enough free space to hold all the files
twice.Tuning Kernel Limitstuningkernel limitsFile/Process Limitskern.maxfileskern.maxfilesThe kern.maxfiles &man.sysctl.8;
variable can be raised or lowered based upon system
requirements. This variable indicates the maximum number
of file descriptors on the system. When the file descriptor
table is full, file: table is full
will show up repeatedly in the system message buffer, which
can be viewed using &man.dmesg.8;.Each open file, socket, or fifo uses one file
descriptor. A large-scale production server may easily
require many thousands of file descriptors, depending on the
kind and number of services running concurrently.In older &os; releases, the default value of
kern.maxfiles is derived from
in the kernel configuration file.
kern.maxfiles grows proportionally to the
value of . When compiling a custom
kernel, consider setting this kernel configuration option
according to the use of the system. From this number, the
kernel is given most of its pre-defined limits. Even though
a production machine may not have 256 concurrent users, the
resources needed may be similar to a high-scale web
server.The read-only &man.sysctl.8; variable
kern.maxusers is automatically sized at
boot based on the amount of memory available in the system,
and may be determined at run-time by inspecting the value
of kern.maxusers. Some systems require
larger or smaller values of
kern.maxusers and values of
64, 128, and
256 are not uncommon. Going above
256 is not recommended unless a huge
number of file descriptors is needed. Many of the tunable
values set to their defaults by
kern.maxusers may be individually
overridden at boot-time or run-time in
/boot/loader.conf. Refer to
&man.loader.conf.5; and
/boot/defaults/loader.conf for more
details and some hints.In older releases, the system will auto-tune
maxusers if it is set to
0.
The auto-tuning algorithm sets
maxusers equal to the amount of
memory in the system, with a minimum of
32, and a maximum of
384.. When
setting this option, set maxusers to
at least 4, especially if the system
runs &xorg; or is used to
compile software. The most important table set by
maxusers is the maximum number of
processes, which is set to
20 + 16 * maxusers. If
maxusers is set to 1,
there can only be
36 simultaneous processes, including
the 18 or so that the system starts up
at boot time and the 15 or so used by
&xorg;. Even a simple task like
reading a manual page will start up nine processes to
filter, decompress, and view it. Setting
maxusers to 64 allows
up to 1044 simultaneous processes, which
should be enough for nearly all uses. If, however, the
proc table full error is displayed
when trying to start another program, or a server is
running with a large number of simultaneous users, increase
the number and rebuild.maxusers does
not limit the number of users which
can log into the machine. It instead sets various table
sizes to reasonable values considering the maximum number
of users on the system and how many processes each user
will be running.kern.ipc.soacceptqueuekern.ipc.soacceptqueueThe kern.ipc.soacceptqueue
&man.sysctl.8; variable limits the size of the listen queue
for accepting new TCP connections. The
default value of 128 is typically too low
for robust handling of new connections on a heavily loaded
web server. For such environments, it is recommended to
increase this value to 1024 or higher. A
service such as &man.sendmail.8;, or
Apache may itself limit the
listen queue size, but will often have a directive in its
configuration file to adjust the queue size. Large listen
queues do a better job of avoiding Denial of Service
(DoS) attacks.Network LimitsThe NMBCLUSTERS kernel configuration
option dictates the amount of network Mbufs available to the
system. A heavily-trafficked server with a low number of
Mbufs will hinder performance. Each cluster represents
approximately 2 K of memory, so a value of
1024 represents 2
megabytes of kernel memory reserved for network buffers. A
simple calculation can be done to figure out how many are
needed. A web server which maxes out at
1000 simultaneous connections where each
connection uses a 6 K receive and 16 K send buffer,
requires approximately 32 MB worth of network buffers
to cover the web server. A good rule of thumb is to multiply
by 2, so
2x32 MB / 2 KB =
64 MB / 2 kB =
32768. Values between
4096 and 32768 are
recommended for machines with greater amounts of memory.
Never specify an arbitrarily high value for this parameter
as it could lead to a boot time crash. To observe network
cluster usage, use with
&man.netstat.1;.The kern.ipc.nmbclusters loader tunable
should be used to tune this at boot time. Only older versions
of &os; will require the use of the
NMBCLUSTERS kernel &man.config.8;
option.For busy servers that make extensive use of the
&man.sendfile.2; system call, it may be necessary to increase
the number of &man.sendfile.2; buffers via the
NSFBUFS kernel configuration option or by
setting its value in /boot/loader.conf
(see &man.loader.8; for details). A common indicator that
this parameter needs to be adjusted is when processes are seen
in the sfbufa state. The &man.sysctl.8;
variable kern.ipc.nsfbufs is read-only.
This parameter nominally scales with
kern.maxusers, however it may be necessary
to tune accordingly.Even though a socket has been marked as non-blocking,
calling &man.sendfile.2; on the non-blocking socket may
result in the &man.sendfile.2; call blocking until enough
struct sf_buf's are made
available.net.inet.ip.portrange.*net.inet.ip.portrange.*The net.inet.ip.portrange.*
&man.sysctl.8; variables control the port number ranges
automatically bound to TCP and
UDP sockets. There are three ranges: a
low range, a default range, and a high range. Most network
programs use the default range which is controlled by
net.inet.ip.portrange.first and
net.inet.ip.portrange.last, which default
to 1024 and 5000,
respectively. Bound port ranges are used for outgoing
connections and it is possible to run the system out of
ports under certain circumstances. This most commonly
occurs when running a heavily loaded web proxy. The port
range is not an issue when running a server which handles
mainly incoming connections, such as a web server, or has
a limited number of outgoing connections, such as a mail
relay. For situations where there is a shortage of ports,
it is recommended to increase
net.inet.ip.portrange.last modestly. A
value of 10000, 20000
or 30000 may be reasonable. Consider
firewall effects when changing the port range. Some
firewalls may block large ranges of ports, usually
low-numbered ports, and expect systems to use higher ranges
of ports for outgoing connections. For this reason, it
is not recommended that the value of
net.inet.ip.portrange.first be
lowered.TCP Bandwidth Delay ProductTCP Bandwidth Delay Product
Limitingnet.inet.tcp.inflight.enableTCP bandwidth delay product limiting
can be enabled by setting the
net.inet.tcp.inflight.enable
&man.sysctl.8; variable to 1. This
instructs the system to attempt to calculate the bandwidth
delay product for each connection and limit the amount of
data queued to the network to just the amount required to
maintain optimum throughput.This feature is useful when serving data over modems,
Gigabit Ethernet, high speed WAN links,
or any other link with a high bandwidth delay product,
especially when also using window scaling or when a large
send window has been configured. When enabling this option,
also set net.inet.tcp.inflight.debug to
0 to disable debugging. For production
use, setting net.inet.tcp.inflight.min
to at least 6144 may be beneficial.
Setting high minimums may effectively disable bandwidth
limiting, depending on the link. The limiting feature
reduces the amount of data built up in intermediate route
and switch packet queues and reduces the amount of data
built up in the local host's interface queue. With fewer
queued packets, interactive connections, especially over
slow modems, will operate with lower
Round Trip Times. This feature only
effects server side data transmission such as uploading.
It has no effect on data reception or downloading.Adjusting net.inet.tcp.inflight.stab
is not recommended. This parameter
defaults to 20, representing 2 maximal
packets added to the bandwidth delay product window
calculation. The additional window is required to stabilize
the algorithm and improve responsiveness to changing
conditions, but it can also result in higher &man.ping.8;
times over slow links, though still much lower than without
the inflight algorithm. In such cases, try reducing this
parameter to 15, 10,
or 5 and reducing
net.inet.tcp.inflight.min to a value such
as 3500 to get the desired effect.
Reducing these parameters should be done as a last resort
only.Virtual Memorykern.maxvnodesA vnode is the internal representation of a file or
directory. Increasing the number of vnodes available to
the operating system reduces disk I/O. Normally, this is
handled by the operating system and does not need to be
changed. In some cases where disk I/O is a bottleneck and
the system is running out of vnodes, this setting needs
to be increased. The amount of inactive and free
RAM will need to be taken into
account.To see the current number of vnodes in use:&prompt.root; sysctl vfs.numvnodes
vfs.numvnodes: 91349To see the maximum vnodes:&prompt.root; sysctl kern.maxvnodes
kern.maxvnodes: 100000If the current vnode usage is near the maximum, try
increasing kern.maxvnodes by a value of
1000. Keep an eye on the number of
vfs.numvnodes. If it climbs up to the
maximum again, kern.maxvnodes will need
to be increased further. Otherwise, a shift in memory
usage as reported by &man.top.1; should be visible and
more memory should be active.Adding Swap SpaceSometimes a system requires more swap space. This section
describes two methods to increase swap space: adding swap to an
existing partition or new hard drive, and creating a swap file
on an existing partition.For information on how to encrypt swap space, which options
exist, and why it should be done, refer to .Swap on a New Hard Drive or Existing PartitionAdding a new hard drive for swap gives better performance
than using a partition on an existing drive. Setting up
partitions and hard drives is explained in while discusses partition layouts
and swap partition size considerations.Use swapon to add a swap partition to
the system. For example:&prompt.root; swapon /dev/ada1s1bIt is possible to use any partition not currently
mounted, even if it already contains data. Using
swapon on a partition that contains data
will overwrite and destroy that data. Make sure that the
partition to be added as swap is really the intended
partition before running swapon.To automatically add this swap partition on boot, add an
entry to /etc/fstab:/dev/ada1s1b none swap sw 0 0See &man.fstab.5; for an explanation of the entries in
/etc/fstab. More information about
swapon can be found in
&man.swapon.8;.Creating a Swap FileThese examples create a 512M swap file called
/usr/swap0 instead of using a
partition.Using swap files requires that the module needed by
&man.md.4; has either been built into the kernel or has been
loaded before swap is enabled. See
for information about building
a custom kernel.Creating a Swap FileCreate the swap file:&prompt.root; dd if=/dev/zero of=/usr/swap0 bs=1m count=512Set the proper permissions on the new file:&prompt.root; chmod 0600 /usr/swap0Inform the system about the swap file by adding a
line to /etc/fstab:md99 none swap sw,file=/usr/swap0,late 0 0The &man.md.4; device md99 is
used, leaving lower device numbers available for
interactive use.Swap space will be added on system startup. To add
swap space immediately, use &man.swapon.8;:&prompt.root; swapon -aLPower and Resource ManagementHitenPandyaWritten by TomRhodesIt is important to utilize hardware resources in an
efficient manner. Power and resource management allows the
operating system to monitor system limits and to possibly
provide an alert if the system temperature increases
unexpectedly. An early specification for providing power
management was the Advanced Power Management
(APM) facility. APM
controls the power usage of a system based on its activity.
However, it was difficult and inflexible for operating systems
to manage the power usage and thermal properties of a system.
The hardware was managed by the BIOS and the
user had limited configurability and visibility into the power
management settings. The APM
BIOS is supplied by the vendor and is
specific to the hardware platform. An APM
driver in the operating system mediates access to the
APM Software Interface, which allows
management of power levels.There are four major problems in APM.
First, power management is done by the vendor-specific
BIOS, separate from the operating system.
For example, the user can set idle-time values for a hard drive
in the APM BIOS so that,
when exceeded, the BIOS spins down the hard
drive without the consent of the operating system. Second, the
APM logic is embedded in the
BIOS, and it operates outside the scope of
the operating system. This means that users can only fix
problems in the APM
BIOS by flashing a new one into the
ROM, which is a dangerous procedure with the
potential to leave the system in an unrecoverable state if it
fails. Third, APM is a vendor-specific
technology, meaning that there is a lot of duplication of
efforts and bugs found in one vendor's BIOS
may not be solved in others. Lastly, the APM
BIOS did not have enough room to implement a
sophisticated power policy or one that can adapt well to the
purpose of the machine.The Plug and Play BIOS
(PNPBIOS) was unreliable in many situations.
PNPBIOS is 16-bit technology, so the
operating system has to use 16-bit emulation in order to
interface with PNPBIOS methods. &os;
provides an APM driver as
APM should still be used for systems
manufactured at or before the year 2000. The driver is
documented in &man.apm.4;.ACPIAPMThe successor to APM is the Advanced
Configuration and Power Interface (ACPI).
ACPI is a standard written by an alliance of
vendors to provide an interface for hardware resources and power
management. It is a key element in Operating
System-directed configuration and Power Management
as it provides more control and flexibility to the operating
system.This chapter demonstrates how to configure
ACPI on &os;. It then offers some tips on
how to debug ACPI and how to submit a problem
report containing debugging information so that developers can
diagnosis and fix ACPI issues.Configuring ACPIIn &os; the &man.acpi.4; driver is loaded by default at
system boot and should not be compiled
into the kernel. This driver cannot be unloaded after boot
because the system bus uses it for various hardware
interactions. However, if the system is experiencing
problems, ACPI can be disabled altogether
by rebooting after setting
hint.acpi.0.disabled="1" in
/boot/loader.conf or by setting this
variable at the loader prompt, as described in .ACPI and APM
cannot coexist and should be used separately. The last one
to load will terminate if the driver notices the other is
running.ACPI can be used to put the system into
a sleep mode with acpiconf, the
flag, and a number from
1 to 5. Most users only
need 1 (quick suspend to
RAM) or 3 (suspend to
RAM). Option 5 performs
a soft-off which is the same as running
halt -p.Other options are available using
sysctl. Refer to &man.acpi.4; and
&man.acpiconf.8; for more information.Common ProblemsACPIACPI is present in all modern computers
that conform to the ia32 (x86) and amd64
(AMD) architectures. The full standard has
many features including CPU performance
management, power planes control, thermal zones, various
battery systems, embedded controllers, and bus enumeration.
Most systems implement less than the full standard. For
instance, a desktop system usually only implements bus
enumeration while a laptop might have cooling and battery
management support as well. Laptops also have suspend and
resume, with their own associated complexity.An ACPI-compliant system has various
components. The BIOS and chipset vendors
provide various fixed tables, such as FADT,
in memory that specify things like the APIC
map (used for SMP), config registers, and
simple configuration values. Additionally, a bytecode table,
the Differentiated System Description Table
DSDT, specifies a tree-like name space of
devices and methods.The ACPI driver must parse the fixed
tables, implement an interpreter for the bytecode, and modify
device drivers and the kernel to accept information from the
ACPI subsystem. For &os;, &intel; has
provided an interpreter (ACPI-CA) that is
shared with &linux; and NetBSD. The path to the
ACPI-CA source code is
src/sys/contrib/dev/acpica. The glue
code that allows ACPI-CA to work on &os; is
in src/sys/dev/acpica/Osd. Finally,
drivers that implement various ACPI devices
are found in src/sys/dev/acpica.ACPIproblemsFor ACPI to work correctly, all the
parts have to work correctly. Here are some common problems,
in order of frequency of appearance, and some possible
workarounds or fixes. If a fix does not resolve the issue,
refer to for instructions
on how to submit a bug report.Mouse IssuesIn some cases, resuming from a suspend operation will
cause the mouse to fail. A known work around is to add
hint.psm.0.flags="0x3000" to
/boot/loader.conf.Suspend/ResumeACPI has three suspend to
RAM (STR) states,
S1-S3, and one suspend
to disk state (STD), called
S4. STD can be
implemented in two separate ways. The
S4BIOS is a
BIOS-assisted suspend to disk and
S4OS is implemented
entirely by the operating system. The normal state the
system is in when plugged in but not powered up is
soft off (S5).Use sysctl hw.acpi to check for the
suspend-related items. These example results are from a
Thinkpad:hw.acpi.supported_sleep_state: S3 S4 S5
hw.acpi.s4bios: 0Use acpiconf -s to test
S3, S4, and
S5. An of one
(1) indicates
S4BIOS support instead
of S4 operating system support.When testing suspend/resume, start with
S1, if supported. This state is most
likely to work since it does not require much driver
support. No one has implemented S2,
which is similar to S1. Next, try
S3. This is the deepest
STR state and requires a lot of driver
support to properly reinitialize the hardware.A common problem with suspend/resume is that many device
drivers do not save, restore, or reinitialize their
firmware, registers, or device memory properly. As a first
attempt at debugging the problem, try:&prompt.root; sysctl debug.bootverbose=1
&prompt.root; sysctl debug.acpi.suspend_bounce=1
&prompt.root; acpiconf -s 3This test emulates the suspend/resume cycle of all
device drivers without actually going into
S3 state. In some cases, problems such
as losing firmware state, device watchdog time out, and
retrying forever, can be captured with this method. Note
that the system will not really enter S3
state, which means devices may not lose power, and many
will work fine even if suspend/resume methods are totally
missing, unlike real S3 state.Harder cases require additional hardware, such as a
serial port and cable for debugging through a serial
console, a Firewire port and cable for using &man.dcons.4;,
and kernel debugging skills.To help isolate the problem, unload as many drivers as
possible. If it works, narrow down which driver is the
problem by loading drivers until it fails again. Typically,
binary drivers like nvidia.ko, display
drivers, and USB will have the most
problems while Ethernet interfaces usually work fine. If
drivers can be properly loaded and unloaded, automate this
by putting the appropriate commands in
/etc/rc.suspend and
/etc/rc.resume. Try setting
to 1
if the display is messed up after resume. Try setting
longer or shorter values for
to see if that
helps.Try loading a recent &linux; distribution to see if
suspend/resume works on the same hardware. If it works on
&linux;, it is likely a &os; driver problem. Narrowing down
which driver causes the problem will assist developers in
fixing the problem. Since the ACPI
maintainers rarely maintain other drivers, such as sound
or ATA, any driver problems should also
be posted to the &a.current.name; list and mailed to the
driver maintainer. Advanced users can include debugging
&man.printf.3;s in a problematic driver to track down where
in its resume function it hangs.Finally, try disabling ACPI and
enabling APM instead. If suspend/resume
works with APM, stick with
APM, especially on older hardware
(pre-2000). It took vendors a while to get
ACPI support correct and older hardware
is more likely to have BIOS problems with
ACPI.System HangsMost system hangs are a result of lost interrupts or an
interrupt storm. Chipsets may have problems based on boot,
how the BIOS configures interrupts before
correctness of the APIC
(MADT) table, and routing of the System
Control Interrupt (SCI).interrupt stormsInterrupt storms can be distinguished from lost
interrupts by checking the output of
vmstat -i and looking at the line that
has acpi0. If the counter is increasing
at more than a couple per second, there is an interrupt
storm. If the system appears hung, try breaking to
DDB (CTRLALTESC on console) and type
show interrupts.APICdisablingWhen dealing with interrupt problems, try disabling
APIC support with
hint.apic.0.disabled="1" in
/boot/loader.conf.PanicsPanics are relatively rare for ACPI
and are the top priority to be fixed. The first step is to
isolate the steps to reproduce the panic, if possible, and
get a backtrace. Follow the advice for enabling
options DDB and setting up a serial
console in or setting
up a dump partition. To get a backtrace in
DDB, use tr. When
handwriting the backtrace, get at least the last five and
the top five lines in the trace.Then, try to isolate the problem by booting with
ACPI disabled. If that works, isolate
the ACPI subsystem by using various
values of . See
&man.acpi.4; for some examples.System Powers Up After Suspend or ShutdownFirst, try setting
hw.acpi.disable_on_poweroff="0" in
/boot/loader.conf. This keeps
ACPI from disabling various events during
the shutdown process. Some systems need this value set to
1 (the default) for the same reason.
This usually fixes the problem of a system powering up
spontaneously after a suspend or poweroff.BIOS Contains Buggy BytecodeACPIASLSome BIOS vendors provide incorrect
or buggy bytecode. This is usually manifested by kernel
console messages like this:ACPI-1287: *** Error: Method execution failed [\\_SB_.PCI0.LPC0.FIGD._STA] \\
(Node 0xc3f6d160), AE_NOT_FOUNDOften, these problems may be resolved by updating the
BIOS to the latest revision. Most
console messages are harmless, but if there are other
problems, like the battery status is not working, these
messages are a good place to start looking for
problems.Overriding the Default AMLThe BIOS bytecode, known as
ACPI Machine Language
(AML), is compiled from a source language
called ACPI Source Language
(ASL). The AML is
found in the table known as the Differentiated System
Description Table (DSDT).ACPIASLThe goal of &os; is for everyone to have working
ACPI without any user intervention.
Workarounds are still being developed for common mistakes made
by BIOS vendors. The µsoft;
interpreter (acpi.sys and
acpiec.sys) does not strictly check for
adherence to the standard, and thus many
BIOS vendors who only test
ACPI under &windows; never fix their
ASL. &os; developers continue to identify
and document which non-standard behavior is allowed by
µsoft;'s interpreter and replicate it so that &os; can
work without forcing users to fix the
ASL.To help identify buggy behavior and possibly fix it
manually, a copy can be made of the system's
ASL. To copy the system's
ASL to a specified file name, use
acpidump with , to show
the contents of the fixed tables, and , to
disassemble the AML:&prompt.root; acpidump -td > my.aslSome AML versions assume the user is
running &windows;. To override this, set
hw.acpi.osname="Windows
2009" in
/boot/loader.conf, using the most recent
&windows; version listed in the ASL.Other workarounds may require my.asl
to be customized. If this file is edited, compile the new
ASL using the following command. Warnings
can usually be ignored, but errors are bugs that will usually
prevent ACPI from working correctly.&prompt.root; iasl -f my.aslIncluding forces creation of the
AML, even if there are errors during
compilation. Some errors, such as missing return statements,
are automatically worked around by the &os;
interpreter.The default output filename for iasl is
DSDT.aml. Load this file instead of the
BIOS's buggy copy, which is still present
in flash memory, by editing
/boot/loader.conf as follows:acpi_dsdt_load="YES"
acpi_dsdt_name="/boot/DSDT.aml"Be sure to copy DSDT.aml to
/boot, then reboot the system. If this
fixes the problem, send a &man.diff.1; of the old and new
ASL to &a.acpi.name; so that developers can
work around the buggy behavior in
acpica.Getting and Submitting Debugging InfoNateLawsonWritten by PeterSchultzWith contributions from TomRhodesACPIproblemsACPIdebuggingThe ACPI driver has a flexible
debugging facility. A set of subsystems and the level of
verbosity can be specified. The subsystems to debug are
specified as layers and are broken down into components
(ACPI_ALL_COMPONENTS) and
ACPI hardware support
(ACPI_ALL_DRIVERS). The verbosity of
debugging output is specified as the level and ranges from
just report errors (ACPI_LV_ERROR) to
everything (ACPI_LV_VERBOSE). The level is
a bitmask so multiple options can be set at once, separated by
spaces. In practice, a serial console should be used to log
the output so it is not lost as the console message buffer
flushes. A full list of the individual layers and levels is
found in &man.acpi.4;.Debugging output is not enabled by default. To enable it,
add options ACPI_DEBUG to the custom kernel
configuration file if ACPI is compiled into
the kernel. Add ACPI_DEBUG=1 to
/etc/make.conf to enable it globally. If
a module is used instead of a custom kernel, recompile just
the acpi.ko module as follows:&prompt.root; cd /sys/modules/acpi/acpi && make clean && make ACPI_DEBUG=1Copy the compiled acpi.ko to
/boot/kernel and add the desired level
and layer to /boot/loader.conf. The
entries in this example enable debug messages for all
ACPI components and hardware drivers and
output error messages at the least verbose level:debug.acpi.layer="ACPI_ALL_COMPONENTS ACPI_ALL_DRIVERS"
debug.acpi.level="ACPI_LV_ERROR"If the required information is triggered by a specific
event, such as a suspend and then resume, do not modify
/boot/loader.conf. Instead, use
sysctl to specify the layer and level after
booting and preparing the system for the specific event. The
variables which can be set using sysctl are
named the same as the tunables in
/boot/loader.conf.ACPIproblemsOnce the debugging information is gathered, it can be sent
to &a.acpi.name; so that it can be used by the &os;
ACPI maintainers to identify the root cause
of the problem and to develop a solution.Before submitting debugging information to this mailing
list, ensure the latest BIOS version is
installed and, if available, the embedded controller
firmware version.When submitting a problem report, include the following
information:Description of the buggy behavior, including system
type, model, and anything that causes the bug to appear.
Note as accurately as possible when the bug began
occurring if it is new.The output of dmesg after running
boot -v, including any error messages
generated by the bug.The dmesg output from boot
-v with ACPI disabled,
if disabling ACPI helps to fix the
problem.Output from sysctl hw.acpi. This
lists which features the system offers.The URL to a pasted version of the
system's ASL. Do
not send the ASL
directly to the list as it can be very large. Generate a
copy of the ASL by running this
command:&prompt.root; acpidump -dt > name-system.aslSubstitute the login name for
name and manufacturer/model for
system. For example, use
njl-FooCo6000.asl.Most &os; developers watch the &a.current;, but one should
submit problems to &a.acpi.name; to be sure it is seen. Be
patient when waiting for a response. If the bug is not
immediately apparent, submit a bug report.
When entering a PR,
include the same information as requested above. This helps
developers to track the problem and resolve it. Do not send a
PR without emailing &a.acpi.name; first as
it is likely that the problem has been reported before.ReferencesMore information about ACPI may be
found in the following locations:The &os; ACPI Mailing List Archives
(https://lists.freebsd.org/pipermail/freebsd-acpi/)The ACPI 2.0 Specification (http://acpi.info/spec.htm)&man.acpi.4;, &man.acpi.thermal.4;, &man.acpidump.8;,
&man.iasl.8;, and &man.acpidb.8;
diff --git a/en_US.ISO8859-1/books/handbook/geom/chapter.xml b/en_US.ISO8859-1/books/handbook/geom/chapter.xml
index dcb1e12e3c..a682799543 100644
--- a/en_US.ISO8859-1/books/handbook/geom/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/geom/chapter.xml
@@ -1,1693 +1,1693 @@
GEOM: Modular Disk Transformation FrameworkTomRhodesWritten by SynopsisGEOMGEOM Disk FrameworkGEOMIn &os;, the GEOM framework permits
access and control to classes, such as Master Boot Records and
BSD labels, through the use of providers, or
the disk devices in /dev. By supporting
various software RAID configurations,
GEOM transparently provides access to the
operating system and operating system utilities.This chapter covers the use of disks under the
GEOM framework in &os;. This includes the
major RAID control utilities which use the
framework for configuration. This chapter is not a definitive
guide to RAID configurations and only
GEOM-supported RAID
classifications are discussed.After reading this chapter, you will know:What type of RAID support is
available through GEOM.How to use the base utilities to configure, maintain,
and manipulate the various RAID
levels.How to mirror, stripe, encrypt, and remotely connect
disk devices through GEOM.How to troubleshoot disks attached to the
GEOM framework.Before reading this chapter, you should:Understand how &os; treats disk devices ().Know how to configure and install a new kernel ().RAID0 - StripingTomRhodesWritten by MurrayStokelyGEOMStripingStriping combines several disk drives into a single volume.
Striping can be performed through the use of hardware
RAID controllers. The
GEOM disk subsystem provides software support
for disk striping, also known as RAID0,
without the need for a RAID disk
controller.In RAID0, data is split into blocks that
are written across all the drives in the array. As seen in the
following illustration, instead of having to wait on the system
to write 256k to one disk, RAID0 can
simultaneously write 64k to each of the four disks in the array,
offering superior I/O performance. This
performance can be enhanced further by using multiple disk
controllers.Disk Striping IllustrationEach disk in a RAID0 stripe must be of
the same size, since I/O requests are
interleaved to read or write to multiple disks in
parallel.RAID0 does not
provide any redundancy. This means that if one disk in the
array fails, all of the data on the disks is lost. If the
data is important, implement a backup strategy that regularly
saves backups to a remote system or device.The process for creating a software,
GEOM-based RAID0 on a &os;
system using commodity disks is as follows. Once the stripe is
created, refer to &man.gstripe.8; for more information on how
to control an existing stripe.Creating a Stripe of Unformatted ATA
DisksLoad the geom_stripe.ko
module:&prompt.root; kldload geom_stripeEnsure that a suitable mount point exists. If this
volume will become a root partition, then temporarily use
another mount point such as
/mnt.Determine the device names for the disks which will
be striped, and create the new stripe device. For example,
to stripe two unused and unpartitioned
ATA disks with device names of
/dev/ad2 and
/dev/ad3:&prompt.root; gstripe label -v st0 /dev/ad2 /dev/ad3
Metadata value stored on /dev/ad2.
Metadata value stored on /dev/ad3.
Done.Write a standard label, also known as a partition table,
on the new volume and install the default bootstrap
code:&prompt.root; bsdlabel -wB /dev/stripe/st0This process should create two other devices in
/dev/stripe in addition to
st0. Those include
st0a and st0c. At
this point, a UFS file system can be
created on st0a using
newfs:&prompt.root; newfs -U /dev/stripe/st0aMany numbers will glide across the screen, and after a
few seconds, the process will be complete. The volume has
been created and is ready to be mounted.To manually mount the created disk stripe:&prompt.root; mount /dev/stripe/st0a /mntTo mount this striped file system automatically during
the boot process, place the volume information in
/etc/fstab. In this example, a
permanent mount point, named stripe, is
created:&prompt.root; mkdir /stripe
&prompt.root; echo "/dev/stripe/st0a /stripe ufs rw 2 2" \>> /etc/fstabThe geom_stripe.ko module must also
be automatically loaded during system initialization, by
adding a line to
/boot/loader.conf:&prompt.root; echo 'geom_stripe_load="YES"' >> /boot/loader.confRAID1 - MirroringGEOMDisk MirroringRAID1RAID1, or
mirroring, is the technique of writing
the same data to more than one disk drive. Mirrors are usually
used to guard against data loss due to drive failure. Each
drive in a mirror contains an identical copy of the data. When
an individual drive fails, the mirror continues to work,
providing data from the drives that are still functioning. The
computer keeps running, and the administrator has time to
replace the failed drive without user interruption.Two common situations are illustrated in these examples.
The first creates a mirror out of two new drives and uses it as
a replacement for an existing single drive. The second example
creates a mirror on a single new drive, copies the old drive's
data to it, then inserts the old drive into the mirror. While
this procedure is slightly more complicated, it only requires
one new drive.Traditionally, the two drives in a mirror are identical in
model and capacity, but &man.gmirror.8; does not require that.
Mirrors created with dissimilar drives will have a capacity
equal to that of the smallest drive in the mirror. Extra space
on larger drives will be unused. Drives inserted into the
mirror later must have at least as much capacity as the smallest
drive already in the mirror.The mirroring procedures shown here are non-destructive,
but as with any major disk operation, make a full backup
first.While &man.dump.8; is used in these procedures
to copy file systems, it does not work on file systems with
soft updates journaling. See &man.tunefs.8; for information
on detecting and disabling soft updates journaling.Metadata IssuesMany disk systems store metadata at the end of each disk.
Old metadata should be erased before reusing the disk for a
mirror. Most problems are caused by two particular types of
leftover metadata: GPT partition tables and
old metadata from a previous mirror.GPT metadata can be erased with
&man.gpart.8;. This example erases both primary and backup
GPT partition tables from disk
ada8:&prompt.root; gpart destroy -F ada8A disk can be removed from an active mirror and the
metadata erased in one step using &man.gmirror.8;. Here, the
example disk ada8 is removed from the
active mirror gm4:&prompt.root; gmirror remove gm4 ada8If the mirror is not running, but old mirror metadata is
still on the disk, use gmirror clear to
remove it:&prompt.root; gmirror clear ada8&man.gmirror.8; stores one block of metadata at the end of
- the disk. Because GPT partition schemes
+ the disk. As GPT partition schemes
also store metadata at the end of the disk, mirroring entire
GPT disks with &man.gmirror.8; is not
recommended. MBR partitioning is used here
because it only stores a partition table at the start of the
disk and does not conflict with the mirror metadata.Creating a Mirror with Two New DisksIn this example, &os; has already been installed on a
single disk, ada0. Two new disks,
ada1 and ada2, have
been connected to the system. A new mirror will be created on
these two disks and used to replace the old single
disk.The geom_mirror.ko kernel module must
either be built into the kernel or loaded at boot- or
run-time. Manually load the kernel module now:&prompt.root; gmirror loadCreate the mirror with the two new drives:&prompt.root; gmirror label -v gm0 /dev/ada1 /dev/ada2gm0 is a user-chosen device name
assigned to the new mirror. After the mirror has been
started, this device name appears in
/dev/mirror/.MBR and
bsdlabel partition tables can now
be created on the mirror with &man.gpart.8;. This example
uses a traditional file system layout, with partitions for
/, swap, /var,
/tmp, and /usr. A
single / and a swap partition
will also work.Partitions on the mirror do not have to be the same size
as those on the existing disk, but they must be large enough
to hold all the data already present on
ada0.&prompt.root; gpart create -s MBR mirror/gm0
&prompt.root; gpart add -t freebsd -a 4k mirror/gm0
&prompt.root; gpart show mirror/gm0
=> 63 156301423 mirror/gm0 MBR (74G)
63 63 - free - (31k)
126 156301299 1 freebsd (74G)
156301425 61 - free - (30k)&prompt.root; gpart create -s BSD mirror/gm0s1
&prompt.root; gpart add -t freebsd-ufs -a 4k -s 2g mirror/gm0s1
&prompt.root; gpart add -t freebsd-swap -a 4k -s 4g mirror/gm0s1
&prompt.root; gpart add -t freebsd-ufs -a 4k -s 2g mirror/gm0s1
&prompt.root; gpart add -t freebsd-ufs -a 4k -s 1g mirror/gm0s1
&prompt.root; gpart add -t freebsd-ufs -a 4k mirror/gm0s1
&prompt.root; gpart show mirror/gm0s1
=> 0 156301299 mirror/gm0s1 BSD (74G)
0 2 - free - (1.0k)
2 4194304 1 freebsd-ufs (2.0G)
4194306 8388608 2 freebsd-swap (4.0G)
12582914 4194304 4 freebsd-ufs (2.0G)
16777218 2097152 5 freebsd-ufs (1.0G)
18874370 137426928 6 freebsd-ufs (65G)
156301298 1 - free - (512B)Make the mirror bootable by installing bootcode in the
MBR and bsdlabel and setting the active
slice:&prompt.root; gpart bootcode -b /boot/mbr mirror/gm0
&prompt.root; gpart set -a active -i 1 mirror/gm0
&prompt.root; gpart bootcode -b /boot/boot mirror/gm0s1Format the file systems on the new mirror, enabling
soft-updates.&prompt.root; newfs -U /dev/mirror/gm0s1a
&prompt.root; newfs -U /dev/mirror/gm0s1d
&prompt.root; newfs -U /dev/mirror/gm0s1e
&prompt.root; newfs -U /dev/mirror/gm0s1fFile systems from the original ada0
disk can now be copied onto the mirror with &man.dump.8; and
&man.restore.8;.&prompt.root; mount /dev/mirror/gm0s1a /mnt
&prompt.root; dump -C16 -b64 -0aL -f - / | (cd /mnt && restore -rf -)
&prompt.root; mount /dev/mirror/gm0s1d /mnt/var
&prompt.root; mount /dev/mirror/gm0s1e /mnt/tmp
&prompt.root; mount /dev/mirror/gm0s1f /mnt/usr
&prompt.root; dump -C16 -b64 -0aL -f - /var | (cd /mnt/var && restore -rf -)
&prompt.root; dump -C16 -b64 -0aL -f - /tmp | (cd /mnt/tmp && restore -rf -)
&prompt.root; dump -C16 -b64 -0aL -f - /usr | (cd /mnt/usr && restore -rf -)Edit /mnt/etc/fstab to point to
the new mirror file systems:# Device Mountpoint FStype Options Dump Pass#
/dev/mirror/gm0s1a / ufs rw 1 1
/dev/mirror/gm0s1b none swap sw 0 0
/dev/mirror/gm0s1d /var ufs rw 2 2
/dev/mirror/gm0s1e /tmp ufs rw 2 2
/dev/mirror/gm0s1f /usr ufs rw 2 2If the geom_mirror.ko kernel module
has not been built into the kernel,
/mnt/boot/loader.conf is edited to load
the module at boot:geom_mirror_load="YES"Reboot the system to test the new mirror and verify that
all data has been copied. The BIOS will
see the mirror as two individual drives rather than a mirror.
- Because the drives are identical, it does not matter which is
+ Since the drives are identical, it does not matter which is
selected to boot.See if there are
problems booting. Powering down and disconnecting the
original ada0 disk will allow it to be
kept as an offline backup.In use, the mirror will behave just like the original
single drive.Creating a Mirror with an Existing DriveIn this example, &os; has already been installed on a
single disk, ada0. A new disk,
ada1, has been connected to the system.
A one-disk mirror will be created on the new disk, the
existing system copied onto it, and then the old disk will be
inserted into the mirror. This slightly complex procedure is
required because gmirror needs to put a
512-byte block of metadata at the end of each disk, and the
existing ada0 has usually had all of its
space already allocated.Load the geom_mirror.ko kernel
module:&prompt.root; gmirror loadCheck the media size of the original disk with
diskinfo:&prompt.root; diskinfo -v ada0 | head -n3
/dev/ada0
512 # sectorsize
1000204821504 # mediasize in bytes (931G)Create a mirror on the new disk. To make certain that the
mirror capacity is not any larger than the original
ada0 drive, &man.gnop.8; is used to
create a fake drive of the exact same size. This drive does
not store any data, but is used only to limit the size of the
mirror. When &man.gmirror.8; creates the mirror, it will
restrict the capacity to the size of
gzero.nop, even if the new
ada1 drive has more space. Note that the
1000204821504 in the second line is
equal to ada0's media size as shown by
diskinfo above.&prompt.root; geom zero load
&prompt.root; gnop create -s 1000204821504 gzero
&prompt.root; gmirror label -v gm0 gzero.nop ada1
&prompt.root; gmirror forget gm0Since gzero.nop does not store any
data, the mirror does not see it as connected. The mirror is
told to forget unconnected components, removing
references to gzero.nop. The result is a
mirror device containing only a single disk,
ada1.After creating gm0, view the
partition table on ada0. This output is
from a 1 TB drive. If there is some unallocated space at
the end of the drive, the contents may be copied directly from
ada0 to the new mirror.However, if the output shows that all of the space on the
disk is allocated, as in the following listing, there is no
space available for the 512-byte mirror metadata at the end of
the disk.&prompt.root; gpart show ada0
=> 63 1953525105 ada0 MBR (931G)
63 1953525105 1 freebsd [active] (931G)In this case, the partition table must be edited to reduce
the capacity by one sector on mirror/gm0.
The procedure will be explained later.In either case, partition tables on the primary disk
should be first copied using gpart backup
and gpart restore.&prompt.root; gpart backup ada0 > table.ada0
&prompt.root; gpart backup ada0s1 > table.ada0s1These commands create two files,
table.ada0 and
table.ada0s1. This example is from a
1 TB drive:&prompt.root; cat table.ada0
MBR 4
1 freebsd 63 1953525105 [active]&prompt.root; cat table.ada0s1
BSD 8
1 freebsd-ufs 0 4194304
2 freebsd-swap 4194304 33554432
4 freebsd-ufs 37748736 50331648
5 freebsd-ufs 88080384 41943040
6 freebsd-ufs 130023424 838860800
7 freebsd-ufs 968884224 984640881If no free space is shown at the end of the disk, the size
of both the slice and the last partition must be reduced by
one sector. Edit the two files, reducing the size of both the
slice and last partition by one. These are the last numbers
in each listing.&prompt.root; cat table.ada0
MBR 4
1 freebsd 63 1953525104 [active]&prompt.root; cat table.ada0s1
BSD 8
1 freebsd-ufs 0 4194304
2 freebsd-swap 4194304 33554432
4 freebsd-ufs 37748736 50331648
5 freebsd-ufs 88080384 41943040
6 freebsd-ufs 130023424 838860800
7 freebsd-ufs 968884224 984640880If at least one sector was unallocated at the end of the
disk, these two files can be used without modification.Now restore the partition table into
mirror/gm0:&prompt.root; gpart restore mirror/gm0 < table.ada0
&prompt.root; gpart restore mirror/gm0s1 < table.ada0s1Check the partition table with
gpart show. This example has
gm0s1a for /,
gm0s1d for /var,
gm0s1e for /usr,
gm0s1f for /data1,
and gm0s1g for
/data2.&prompt.root; gpart show mirror/gm0
=> 63 1953525104 mirror/gm0 MBR (931G)
63 1953525042 1 freebsd [active] (931G)
1953525105 62 - free - (31k)
&prompt.root; gpart show mirror/gm0s1
=> 0 1953525042 mirror/gm0s1 BSD (931G)
0 2097152 1 freebsd-ufs (1.0G)
2097152 16777216 2 freebsd-swap (8.0G)
18874368 41943040 4 freebsd-ufs (20G)
60817408 20971520 5 freebsd-ufs (10G)
81788928 629145600 6 freebsd-ufs (300G)
710934528 1242590514 7 freebsd-ufs (592G)
1953525042 63 - free - (31k)Both the slice and the last partition must have at least
one free block at the end of the disk.Create file systems on these new partitions. The number
of partitions will vary to match the original disk,
ada0.&prompt.root; newfs -U /dev/mirror/gm0s1a
&prompt.root; newfs -U /dev/mirror/gm0s1d
&prompt.root; newfs -U /dev/mirror/gm0s1e
&prompt.root; newfs -U /dev/mirror/gm0s1f
&prompt.root; newfs -U /dev/mirror/gm0s1gMake the mirror bootable by installing bootcode in the
MBR and bsdlabel and setting the active
slice:&prompt.root; gpart bootcode -b /boot/mbr mirror/gm0
&prompt.root; gpart set -a active -i 1 mirror/gm0
&prompt.root; gpart bootcode -b /boot/boot mirror/gm0s1Adjust /etc/fstab to use the new
partitions on the mirror. Back up this file first by copying
it to /etc/fstab.orig.&prompt.root; cp /etc/fstab /etc/fstab.origEdit /etc/fstab, replacing
/dev/ada0 with
mirror/gm0.# Device Mountpoint FStype Options Dump Pass#
/dev/mirror/gm0s1a / ufs rw 1 1
/dev/mirror/gm0s1b none swap sw 0 0
/dev/mirror/gm0s1d /var ufs rw 2 2
/dev/mirror/gm0s1e /usr ufs rw 2 2
/dev/mirror/gm0s1f /data1 ufs rw 2 2
/dev/mirror/gm0s1g /data2 ufs rw 2 2If the geom_mirror.ko kernel module
has not been built into the kernel, edit
/boot/loader.conf to load it at
boot:geom_mirror_load="YES"File systems from the original disk can now be copied onto
the mirror with &man.dump.8; and &man.restore.8;. Each file
system dumped with dump -L will create a
snapshot first, which can take some time.&prompt.root; mount /dev/mirror/gm0s1a /mnt
&prompt.root; dump -C16 -b64 -0aL -f - / | (cd /mnt && restore -rf -)
&prompt.root; mount /dev/mirror/gm0s1d /mnt/var
&prompt.root; mount /dev/mirror/gm0s1e /mnt/usr
&prompt.root; mount /dev/mirror/gm0s1f /mnt/data1
&prompt.root; mount /dev/mirror/gm0s1g /mnt/data2
&prompt.root; dump -C16 -b64 -0aL -f - /usr | (cd /mnt/usr && restore -rf -)
&prompt.root; dump -C16 -b64 -0aL -f - /var | (cd /mnt/var && restore -rf -)
&prompt.root; dump -C16 -b64 -0aL -f - /data1 | (cd /mnt/data1 && restore -rf -)
&prompt.root; dump -C16 -b64 -0aL -f - /data2 | (cd /mnt/data2 && restore -rf -)Restart the system, booting from
ada1. If everything is working, the
system will boot from mirror/gm0, which
now contains the same data as ada0 had
previously. See if
there are problems booting.At this point, the mirror still consists of only the
single ada1 disk.After booting from mirror/gm0
successfully, the final step is inserting
ada0 into the mirror.When ada0 is inserted into the
mirror, its former contents will be overwritten by data from
the mirror. Make certain that
mirror/gm0 has the same contents as
ada0 before adding
ada0 to the mirror. If the contents
previously copied by &man.dump.8; and &man.restore.8; are
not identical to what was on ada0,
revert /etc/fstab to mount the file
systems on ada0, reboot, and start the
whole procedure again.&prompt.root; gmirror insert gm0 ada0
GEOM_MIRROR: Device gm0: rebuilding provider ada0Synchronization between the two disks will start
immediately. Use gmirror status to view
the progress.&prompt.root; gmirror status
Name Status Components
mirror/gm0 DEGRADED ada1 (ACTIVE)
ada0 (SYNCHRONIZING, 64%)After a while, synchronization will finish.GEOM_MIRROR: Device gm0: rebuilding provider ada0 finished.
&prompt.root; gmirror status
Name Status Components
mirror/gm0 COMPLETE ada1 (ACTIVE)
ada0 (ACTIVE)mirror/gm0 now consists
of the two disks ada0 and
ada1, and the contents are automatically
synchronized with each other. In use,
mirror/gm0 will behave just like the
original single drive.TroubleshootingIf the system no longer boots, BIOS
settings may have to be changed to boot from one of the new
mirrored drives. Either mirror drive can be used for booting,
as they contain identical data.If the boot stops with this message, something is wrong
with the mirror device:Mounting from ufs:/dev/mirror/gm0s1a failed with error 19.
Loader variables:
vfs.root.mountfrom=ufs:/dev/mirror/gm0s1a
vfs.root.mountfrom.options=rw
Manual root filesystem specification:
<fstype>:<device> [options]
Mount <device> using filesystem <fstype>
and with the specified (optional) option list.
eg. ufs:/dev/da0s1a
zfs:tank
cd9660:/dev/acd0 ro
(which is equivalent to: mount -t cd9660 -o ro /dev/acd0 /)
? List valid disk boot devices
. Yield 1 second (for background tasks)
<empty line> Abort manual input
mountroot>Forgetting to load the geom_mirror.ko
module in /boot/loader.conf can cause
this problem. To fix it, boot from a &os;
installation media and choose Shell at the
first prompt. Then load the mirror module and mount the
mirror device:&prompt.root; gmirror load
&prompt.root; mount /dev/mirror/gm0s1a /mntEdit /mnt/boot/loader.conf, adding a
line to load the mirror module:geom_mirror_load="YES"Save the file and reboot.Other problems that cause error 19
require more effort to fix. Although the system should boot
from ada0, another prompt to select a
shell will appear if /etc/fstab is
incorrect. Enter ufs:/dev/ada0s1a at the
boot loader prompt and press Enter. Undo the
edits in /etc/fstab then mount the file
systems from the original disk (ada0)
instead of the mirror. Reboot the system and try the
procedure again.Enter full pathname of shell or RETURN for /bin/sh:
&prompt.root; cp /etc/fstab.orig /etc/fstab
&prompt.root; rebootRecovering from Disk FailureThe benefit of disk mirroring is that an individual disk
can fail without causing the mirror to lose any data. In the
above example, if ada0 fails, the mirror
will continue to work, providing data from the remaining
working drive, ada1.To replace the failed drive, shut down the system and
physically replace the failed drive with a new drive of equal
or greater capacity. Manufacturers use somewhat arbitrary
values when rating drives in gigabytes, and the only way to
really be sure is to compare the total count of sectors shown
by diskinfo -v. A drive with larger
capacity than the mirror will work, although the extra space
on the new drive will not be used.After the computer is powered back up, the mirror will be
running in a degraded mode with only one drive.
The mirror is told to forget drives that are not currently
connected:&prompt.root; gmirror forget gm0Any old metadata should be cleared from the replacement
disk using the instructions in
. Then the replacement
disk, ada4 for this example, is inserted
into the mirror:&prompt.root; gmirror insert gm0 /dev/ada4Resynchronization begins when the new drive is inserted
into the mirror. This process of copying mirror data to a new
drive can take a while. Performance of the mirror will be
greatly reduced during the copy, so inserting new drives is
best done when there is low demand on the computer.Progress can be monitored with gmirror
status, which shows drives that are being
synchronized and the percentage of completion. During
resynchronization, the status will be
DEGRADED, changing to
COMPLETE when the process is
finished.RAID3 - Byte-level Striping with
Dedicated ParityMarkGladmanWritten by DanielGerzoTomRhodesBased on documentation by MurrayStokelyGEOMRAID3RAID3 is a method used to combine several
disk drives into a single volume with a dedicated parity disk.
In a RAID3 system, data is split up into a
number of bytes that are written across all the drives in the
array except for one disk which acts as a dedicated parity disk.
This means that disk reads from a RAID3
implementation access all disks in the array. Performance can
be enhanced by using multiple disk controllers. The
RAID3 array provides a fault tolerance of 1
drive, while providing a capacity of 1 - 1/n times the total
capacity of all drives in the array, where n is the number of
hard drives in the array. Such a configuration is mostly
suitable for storing data of larger sizes such as multimedia
files.At least 3 physical hard drives are required to build a
RAID3 array. Each disk must be of the same
size, since I/O requests are interleaved to
read or write to multiple disks in parallel. Also, due to the
nature of RAID3, the number of drives must be
equal to 3, 5, 9, 17, and so on, or 2^n + 1.This section demonstrates how to create a software
RAID3 on a &os; system.While it is theoretically possible to boot from a
RAID3 array on &os;, that configuration is
uncommon and is not advised.Creating a Dedicated RAID3
ArrayIn &os;, support for RAID3 is
implemented by the &man.graid3.8; GEOM
class. Creating a dedicated RAID3 array on
&os; requires the following steps.First, load the geom_raid3.ko
kernel module by issuing one of the following
commands:&prompt.root; graid3 loador:&prompt.root; kldload geom_raid3Ensure that a suitable mount point exists. This
command creates a new directory to use as the mount
point:&prompt.root; mkdir /multimediaDetermine the device names for the disks which will be
added to the array, and create the new
RAID3 device. The final device listed
will act as the dedicated parity disk. This example uses
three unpartitioned ATA drives:
ada1 and
ada2 for
data, and
ada3 for
parity.&prompt.root; graid3 label -v gr0 /dev/ada1 /dev/ada2 /dev/ada3
Metadata value stored on /dev/ada1.
Metadata value stored on /dev/ada2.
Metadata value stored on /dev/ada3.
Done.Partition the newly created gr0
device and put a UFS file system on
it:&prompt.root; gpart create -s GPT /dev/raid3/gr0
&prompt.root; gpart add -t freebsd-ufs /dev/raid3/gr0
&prompt.root; newfs -j /dev/raid3/gr0p1Many numbers will glide across the screen, and after a
bit of time, the process will be complete. The volume has
been created and is ready to be mounted:&prompt.root; mount /dev/raid3/gr0p1 /multimedia/The RAID3 array is now ready to
use.Additional configuration is needed to retain this setup
across system reboots.The geom_raid3.ko module must be
loaded before the array can be mounted. To automatically
load the kernel module during system initialization, add
the following line to
/boot/loader.conf:geom_raid3_load="YES"The following volume information must be added to
/etc/fstab in order to
automatically mount the array's file system during the
system boot process:/dev/raid3/gr0p1 /multimedia ufs rw 2 2Software RAID DevicesWarrenBlockOriginally contributed by GEOMSoftware RAID DevicesHardware-assisted RAIDSome motherboards and expansion cards add some simple
hardware, usually just a ROM, that allows the
computer to boot from a RAID array. After
booting, access to the RAID array is handled
by software running on the computer's main processor. This
hardware-assisted software
RAID gives RAID
arrays that are not dependent on any particular operating
system, and which are functional even before an operating system
is loaded.Several levels of RAID are supported,
depending on the hardware in use. See &man.graid.8; for a
complete list.&man.graid.8; requires the geom_raid.ko
kernel module, which is included in the
GENERIC kernel starting with &os; 9.1.
If needed, it can be loaded manually with
graid load.Creating an ArraySoftware RAID devices often have a menu
that can be entered by pressing special keys when the computer
is booting. The menu can be used to create and delete
RAID arrays. &man.graid.8; can also create
arrays directly from the command line.graid label is used to create a new
array. The motherboard used for this example has an Intel
software RAID chipset, so the Intel
metadata format is specified. The new array is given a label
of gm0, it is a mirror
(RAID1), and uses drives
ada0 and
ada1.Some space on the drives will be overwritten when they
are made into a new array. Back up existing data
first!&prompt.root; graid label Intel gm0 RAID1 ada0 ada1
GEOM_RAID: Intel-a29ea104: Array Intel-a29ea104 created.
GEOM_RAID: Intel-a29ea104: Disk ada0 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-a29ea104: Subdisk gm0:0-ada0 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-a29ea104: Disk ada1 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-a29ea104: Array started.
GEOM_RAID: Intel-a29ea104: Volume gm0 state changed from STARTING to OPTIMAL.
Intel-a29ea104 created
GEOM_RAID: Intel-a29ea104: Provider raid/r0 for volume gm0 created.A status check shows the new mirror is ready for
use:&prompt.root; graid status
Name Status Components
raid/r0 OPTIMAL ada0 (ACTIVE (ACTIVE))
ada1 (ACTIVE (ACTIVE))The array device appears in
/dev/raid/. The first array is called
r0. Additional arrays, if present, will
be r1, r2, and so
on.The BIOS menu on some of these devices
can create arrays with special characters in their names. To
avoid problems with those special characters, arrays are given
simple numbered names like r0. To show
the actual labels, like gm0 in the
example above, use &man.sysctl.8;:&prompt.root; sysctl kern.geom.raid.name_format=1Multiple VolumesSome software RAID devices support
more than one volume on an array.
Volumes work like partitions, allowing space on the physical
drives to be split and used in different ways. For example,
Intel software RAID devices support two
volumes. This example creates a 40 G mirror for safely
storing the operating system, followed by a 20 G
RAID0 (stripe) volume for fast temporary
storage:&prompt.root; graid label -S 40G Intel gm0 RAID1 ada0 ada1
&prompt.root; graid add -S 20G gm0 RAID0Volumes appear as additional
rX entries
in /dev/raid/. An array with two volumes
will show r0 and
r1.See &man.graid.8; for the number of volumes supported by
different software RAID devices.Converting a Single Drive to a MirrorUnder certain specific conditions, it is possible to
convert an existing single drive to a &man.graid.8; array
without reformatting. To avoid data loss during the
conversion, the existing drive must meet these minimum
requirements:The drive must be partitioned with the
MBR partitioning scheme.
GPT or other partitioning schemes with
metadata at the end of the drive will be overwritten and
corrupted by the &man.graid.8; metadata.There must be enough unpartitioned and unused space at
the end of the drive to hold the &man.graid.8; metadata.
This metadata varies in size, but the largest occupies
64 M, so at least that much free space is
recommended.If the drive meets these requirements, start by making a
full backup. Then create a single-drive mirror with that
drive:&prompt.root; graid label Intel gm0 RAID1 ada0 NONE&man.graid.8; metadata was written to the end of the drive
in the unused space. A second drive can now be inserted into
the mirror:&prompt.root; graid insert raid/r0 ada1Data from the original drive will immediately begin to be
copied to the second drive. The mirror will operate in
degraded status until the copy is complete.Inserting New Drives into the ArrayDrives can be inserted into an array as replacements for
drives that have failed or are missing. If there are no
failed or missing drives, the new drive becomes a spare. For
example, inserting a new drive into a working two-drive mirror
results in a two-drive mirror with one spare drive, not a
three-drive mirror.In the example mirror array, data immediately begins to be
copied to the newly-inserted drive. Any existing information
on the new drive will be overwritten.&prompt.root; graid insert raid/r0 ada1
GEOM_RAID: Intel-a29ea104: Disk ada1 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 state changed from NONE to NEW.
GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 state changed from NEW to REBUILD.
GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 rebuild start at 0.Removing Drives from the ArrayIndividual drives can be permanently removed from a
from an array and their metadata erased:&prompt.root; graid remove raid/r0 ada1
GEOM_RAID: Intel-a29ea104: Disk ada1 state changed from ACTIVE to OFFLINE.
GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-[unknown] state changed from ACTIVE to NONE.
GEOM_RAID: Intel-a29ea104: Volume gm0 state changed from OPTIMAL to DEGRADED.Stopping the ArrayAn array can be stopped without removing metadata from the
drives. The array will be restarted when the system is
booted.&prompt.root; graid stop raid/r0Checking Array StatusArray status can be checked at any time. After a drive
was added to the mirror in the example above, data is being
copied from the original drive to the new drive:&prompt.root; graid status
Name Status Components
raid/r0 DEGRADED ada0 (ACTIVE (ACTIVE))
ada1 (ACTIVE (REBUILD 28%))Some types of arrays, like RAID0 or
CONCAT, may not be shown in the status
report if disks have failed. To see these partially-failed
arrays, add :&prompt.root; graid status -ga
Name Status Components
Intel-e2d07d9a BROKEN ada6 (ACTIVE (ACTIVE))Deleting ArraysArrays are destroyed by deleting all of the volumes from
them. When the last volume present is deleted, the array is
stopped and metadata is removed from the drives:&prompt.root; graid delete raid/r0Deleting Unexpected ArraysDrives may unexpectedly contain &man.graid.8; metadata,
either from previous use or manufacturer testing.
&man.graid.8; will detect these drives and create an array,
interfering with access to the individual drive. To remove
the unwanted metadata:Boot the system. At the boot menu, select
2 for the loader prompt. Enter:OK set kern.geom.raid.enable=0
OK bootThe system will boot with &man.graid.8;
disabled.Back up all data on the affected drive.As a workaround, &man.graid.8; array detection
can be disabled by addingkern.geom.raid.enable=0to /boot/loader.conf.To permanently remove the &man.graid.8; metadata
from the affected drive, boot a &os; installation
CD-ROM or memory stick, and select
Shell. Use status
to find the name of the array, typically
raid/r0:&prompt.root; graid status
Name Status Components
raid/r0 OPTIMAL ada0 (ACTIVE (ACTIVE))
ada1 (ACTIVE (ACTIVE))Delete the volume by name:&prompt.root; graid delete raid/r0If there is more than one volume shown, repeat the
process for each volume. After the last array has been
deleted, the volume will be destroyed.Reboot and verify data, restoring from backup if
necessary. After the metadata has been removed, the
kern.geom.raid.enable=0 entry in
/boot/loader.conf can also be
removed.GEOM Gate NetworkGEOM provides a simple mechanism for
providing remote access to devices such as disks,
CDs, and file systems through the use of the
GEOM Gate network daemon,
ggated. The system with the device
runs the server daemon which handles requests made by clients
using ggatec. The devices should not
contain any sensitive data as the connection between the client
and the server is not encrypted.Similar to NFS, which is discussed in
, ggated
is configured using an exports file. This file specifies which
systems are permitted to access the exported resources and what
level of access they are offered. For example, to give the
client 192.168.1.5
read and write access to the fourth slice on the first
SCSI disk, create
/etc/gg.exports with this line:192.168.1.5 RW /dev/da0s4dBefore exporting the device, ensure it is not currently
mounted. Then, start ggated:&prompt.root; ggatedSeveral options are available for specifying an alternate
listening port or changing the default location of the exports
file. Refer to &man.ggated.8; for details.To access the exported device on the client machine, first
use ggatec to specify the
IP address of the server and the device name
of the exported device. If successful, this command will
display a ggate device name to mount. Mount
that specified device name on a free mount point. This example
connects to the /dev/da0s4d partition on
192.168.1.1, then mounts
/dev/ggate0 on
/mnt:&prompt.root; ggatec create -o rw 192.168.1.1 /dev/da0s4d
ggate0
&prompt.root; mount /dev/ggate0 /mntThe device on the server may now be accessed through
/mnt on the client. For more details about
ggatec and a few usage examples, refer to
&man.ggatec.8;.The mount will fail if the device is currently mounted on
either the server or any other client on the network. If
simultaneous access is needed to network resources, use
NFS instead.When the device is no longer needed, unmount it with
umount so that the resource is available to
other clients.Labeling Disk DevicesGEOMDisk LabelsDuring system initialization, the &os; kernel creates
device nodes as devices are found. This method of probing for
devices raises some issues. For instance, what if a new disk
device is added via USB? It is likely that
a flash device may be handed the device name of
da0 and the original
da0 shifted to
da1. This will cause issues mounting
file systems if they are listed in
/etc/fstab which may also prevent the
system from booting.One solution is to chain SCSI devices
in order so a new device added to the SCSI
card will be issued unused device numbers. But what about
USB devices which may replace the primary
SCSI disk? This happens because
USB devices are usually probed before the
SCSI card. One solution is to only insert
these devices after the system has been booted. Another method
is to use only a single ATA drive and never
list the SCSI devices in
/etc/fstab.A better solution is to use glabel to
label the disk devices and use the labels in
- /etc/fstab. Because
- glabel stores the label in the last sector of
- a given provider, the label will remain persistent across
- reboots. By using this label as a device, the file system may
- always be mounted regardless of what device node it is accessed
- through.
+ /etc/fstab.
+ Since glabel stores the label in the last
+ sector of a given provider, the label will remain persistent
+ across reboots. By using this label as a device, the
+ file-system may always be mounted regardless of what
+ device node it is accessed through.
glabel can create both transient and
permanent labels. Only permanent labels are consistent across
reboots. Refer to &man.glabel.8; for more information on the
differences between labels.Label Types and ExamplesPermanent labels can be a generic or a file system label.
Permanent file system labels can be created with
&man.tunefs.8; or &man.newfs.8;. These types of labels are
created in a sub-directory of /dev, and
will be named according to the file system type. For example,
UFS2 file system labels will be created in
/dev/ufs. Generic permanent labels can
be created with glabel label. These are
not file system specific and will be created in
/dev/label.Temporary labels are destroyed at the next reboot. These
labels are created in /dev/label and are
suited to experimentation. A temporary label can be created
using glabel create.To create a permanent label for a
UFS2 file system without destroying any
data, issue the following command:&prompt.root; tunefs -L home/dev/da3A label should now exist in /dev/ufs
which may be added to /etc/fstab:/dev/ufs/home /home ufs rw 2 2The file system must not be mounted while attempting
to run tunefs.Now the file system may be mounted:&prompt.root; mount /homeFrom this point on, so long as the
geom_label.ko kernel module is loaded at
boot with /boot/loader.conf or the
GEOM_LABEL kernel option is present,
the device node may change without any ill effect on the
system.File systems may also be created with a default label
by using the flag with
newfs. Refer to &man.newfs.8; for
more information.The following command can be used to destroy the
label:&prompt.root; glabel destroy homeThe following example shows how to label the partitions of
a boot disk.Labeling Partitions on the Boot DiskBy permanently labeling the partitions on the boot disk,
the system should be able to continue to boot normally, even
if the disk is moved to another controller or transferred to
a different system. For this example, it is assumed that a
single ATA disk is used, which is
currently recognized by the system as
ad0. It is also assumed that the
standard &os; partition scheme is used, with
/,
/var,
/usr and
/tmp, as
well as a swap partition.Reboot the system, and at the &man.loader.8; prompt,
press 4 to boot into single user mode.
Then enter the following commands:&prompt.root; glabel label rootfs /dev/ad0s1a
GEOM_LABEL: Label for provider /dev/ad0s1a is label/rootfs
&prompt.root; glabel label var /dev/ad0s1d
GEOM_LABEL: Label for provider /dev/ad0s1d is label/var
&prompt.root; glabel label usr /dev/ad0s1f
GEOM_LABEL: Label for provider /dev/ad0s1f is label/usr
&prompt.root; glabel label tmp /dev/ad0s1e
GEOM_LABEL: Label for provider /dev/ad0s1e is label/tmp
&prompt.root; glabel label swap /dev/ad0s1b
GEOM_LABEL: Label for provider /dev/ad0s1b is label/swap
&prompt.root; exitThe system will continue with multi-user boot. After
the boot completes, edit /etc/fstab and
replace the conventional device names, with their respective
labels. The final /etc/fstab will
look like this:# Device Mountpoint FStype Options Dump Pass#
/dev/label/swap none swap sw 0 0
/dev/label/rootfs / ufs rw 1 1
/dev/label/tmp /tmp ufs rw 2 2
/dev/label/usr /usr ufs rw 2 2
/dev/label/var /var ufs rw 2 2The system can now be rebooted. If everything went
well, it will come up normally and mount
will show:&prompt.root; mount
/dev/label/rootfs on / (ufs, local)
devfs on /dev (devfs, local)
/dev/label/tmp on /tmp (ufs, local, soft-updates)
/dev/label/usr on /usr (ufs, local, soft-updates)
/dev/label/var on /var (ufs, local, soft-updates)The &man.glabel.8; class
supports a label type for UFS file
systems, based on the unique file system id,
ufsid. These labels may be found in
/dev/ufsid and are
created automatically during system startup. It is possible
to use ufsid labels to mount partitions
using /etc/fstab. Use glabel
status to receive a list of file systems and their
corresponding ufsid labels:&prompt.user; glabel status
Name Status Components
ufsid/486b6fc38d330916 N/A ad4s1d
ufsid/486b6fc16926168e N/A ad4s1fIn the above example, ad4s1d
represents /var,
while ad4s1f represents
/usr.
Using the ufsid values shown, these
partitions may now be mounted with the following entries in
/etc/fstab:/dev/ufsid/486b6fc38d330916 /var ufs rw 2 2
/dev/ufsid/486b6fc16926168e /usr ufs rw 2 2Any partitions with ufsid labels can be
mounted in this way, eliminating the need to manually create
permanent labels, while still enjoying the benefits of device
name independent mounting.UFS Journaling Through GEOMGEOMJournalingSupport for journals on
UFS file systems is available on &os;. The
implementation is provided through the GEOM
subsystem and is configured using gjournal.
Unlike other file system journaling implementations, the
gjournal method is block based and not
implemented as part of the file system. It is a
GEOM extension.Journaling stores a log of file system transactions, such as
changes that make up a complete disk write operation, before
meta-data and file writes are committed to the disk. This
transaction log can later be replayed to redo file system
transactions, preventing file system inconsistencies.This method provides another mechanism to protect against
data loss and inconsistencies of the file system. Unlike Soft
Updates, which tracks and enforces meta-data updates, and
snapshots, which create an image of the file system, a log is
stored in disk space specifically for this task. For better
performance, the journal may be stored on another disk. In this
configuration, the journal provider or storage device should be
listed after the device to enable journaling on.The GENERIC kernel provides support for
gjournal. To automatically load the
geom_journal.ko kernel module at boot time,
add the following line to
/boot/loader.conf:geom_journal_load="YES"If a custom kernel is used, ensure the following line is in
the kernel configuration file:options GEOM_JOURNALOnce the module is loaded, a journal can be created on a new
file system using the following steps. In this example,
da4 is a new SCSI
disk:&prompt.root; gjournal load
&prompt.root; gjournal label /dev/da4This will load the module and create a
/dev/da4.journal device node on
/dev/da4.A UFS file system may now be created on
the journaled device, then mounted on an existing mount
point:&prompt.root; newfs -O 2 -J /dev/da4.journal
&prompt.root; mount /dev/da4.journal /mntIn the case of several slices, a journal will be created
for each individual slice. For instance, if
ad4s1 and ad4s2 are
both slices, then gjournal will create
ad4s1.journal and
ad4s2.journal.Journaling may also be enabled on current file systems by
using tunefs. However,
always make a backup before attempting to
alter an existing file system. In most cases,
gjournal will fail if it is unable to create
the journal, but this does not protect against data loss
incurred as a result of misusing tunefs.
Refer to &man.gjournal.8; and &man.tunefs.8; for more
information about these commands.It is possible to journal the boot disk of a &os; system.
Refer to the article Implementing UFS
Journaling on a Desktop PC for detailed
instructions.
diff --git a/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml b/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml
index 53dc3c06b2..e1ae696ce9 100644
--- a/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml
@@ -1,1696 +1,1696 @@
MultimediaRossLippertEdited by Synopsis&os; supports a wide variety of sound cards, allowing users
to enjoy high fidelity output from a &os; system. This includes
the ability to record and play back audio in the MPEG Audio Layer
3 (MP3), Waveform Audio File
(WAV), Ogg Vorbis, and other formats. The
&os; Ports Collection contains many applications for editing
recorded audio, adding sound effects, and controlling attached
MIDI devices.&os; also supports the playback of video files and
DVDs. The &os; Ports Collection contains
applications to encode, convert, and playback various video
media.This chapter describes how to configure sound cards, video
playback, TV tuner cards, and scanners on &os;. It also
describes some of the applications which are available for
using these devices.After reading this chapter, you will know how to:Configure a sound card on &os;.Troubleshoot the sound setup.Playback and encode MP3s and other audio.Prepare a &os; system for video playback.Play DVDs, .mpg,
and .avi files.Rip CD and DVD
content into files.Configure a TV card.Install and setup MythTV on &os;Configure an image scanner.Configure a Bluetooth headset.Before reading this chapter, you should:Know how to install applications as described in
.Setting Up the Sound CardMosesMooreContributed by MarcFonvieilleEnhanced by PCIsound cardsBefore beginning the configuration, determine the model of
the sound card and the chip it uses. &os; supports a wide
variety of sound cards. Check the supported audio devices
list of the Hardware
Notes to see if the card is supported and which &os;
driver it uses.kernelconfigurationIn order to use the sound device, its device driver must be
loaded. The easiest way is to load a kernel module for the
sound card with &man.kldload.8;. This example loads the driver
for a built-in audio chipset based on the Intel
specification:&prompt.root; kldload snd_hdaTo automate the loading of this driver at boot time, add the
driver to /boot/loader.conf. The line for
this driver is:snd_hda_load="YES"Other available sound modules are listed in
/boot/defaults/loader.conf. When unsure
which driver to use, load the snd_driver
module:&prompt.root; kldload snd_driverThis is a metadriver which loads all of the most common
sound drivers and can be used to speed up the search for the
correct driver. It is also possible to load all sound drivers
by adding the metadriver to
/boot/loader.conf.To determine which driver was selected for the sound card
after loading the snd_driver metadriver,
type cat /dev/sndstat.Configuring a Custom Kernel with Sound SupportThis section is for users who prefer to statically compile
in support for the sound card in a custom kernel. For more
information about recompiling a kernel, refer to .When using a custom kernel to provide sound support, make
sure that the audio framework driver exists in the custom
kernel configuration file:device soundNext, add support for the sound card. To continue the
example of the built-in audio chipset based on the Intel
specification from the previous section, use the following
line in the custom kernel configuration file:device snd_hdaBe sure to read the manual page of the driver for the
device name to use for the driver.Non-PnP ISA sound cards may require the IRQ and I/O port
settings of the card to be added to
/boot/device.hints. During the boot
process, &man.loader.8; reads this file and passes the
settings to the kernel. For example, an old Creative
&soundblaster; 16 ISA non-PnP card will use the
&man.snd.sbc.4; driver in conjunction with
snd_sb16. For this card, the following
lines must be added to the kernel configuration file:device snd_sbc
device snd_sb16If the card uses the 0x220 I/O port and
IRQ 5, these lines must also be added to
/boot/device.hints:hint.sbc.0.at="isa"
hint.sbc.0.port="0x220"
hint.sbc.0.irq="5"
hint.sbc.0.drq="1"
hint.sbc.0.flags="0x15"The syntax used in /boot/device.hints
is described in &man.sound.4; and the manual page for the
driver of the sound card.The settings shown above are the defaults. In some
cases, the IRQ or other settings may need to be changed to
match the card. Refer to &man.snd.sbc.4; for more information
about this card.Testing SoundAfter loading the required module or rebooting into the
custom kernel, the sound card should be detected. To confirm,
run dmesg | grep pcm. This example is
from a system with a built-in Conexant CX20590 chipset:pcm0: <NVIDIA (0x001c) (HDMI/DP 8ch)> at nid 5 on hdaa0
pcm1: <NVIDIA (0x001c) (HDMI/DP 8ch)> at nid 6 on hdaa0
pcm2: <Conexant CX20590 (Analog 2.0+HP/2.0)> at nid 31,25 and 35,27 on hdaa1The status of the sound card may also be checked using
this command:&prompt.root; cat /dev/sndstat
FreeBSD Audio Driver (newpcm: 64bit 2009061500/amd64)
Installed devices:
pcm0: <NVIDIA (0x001c) (HDMI/DP 8ch)> (play)
pcm1: <NVIDIA (0x001c) (HDMI/DP 8ch)> (play)
pcm2: <Conexant CX20590 (Analog 2.0+HP/2.0)> (play/rec) defaultThe output will vary depending upon the sound card. If no
pcm devices are listed, double-check
that the correct device driver was loaded or compiled into the
kernel. The next section lists some common problems and their
solutions.If all goes well, the sound card should now work in &os;.
If the CD or DVD drive
is properly connected to the sound card, one can insert an
audio CD in the drive and play it with
&man.cdcontrol.1;:&prompt.user; cdcontrol -f /dev/acd0 play 1Audio CDs have specialized encodings
which means that they should not be mounted using
&man.mount.8;.Various applications, such as
audio/workman, provide a friendlier
interface. The audio/mpg123 port can be
installed to listen to MP3 audio files.Another quick way to test the card is to send data to
/dev/dsp:&prompt.user; cat filename > /dev/dspwhere
filename can
be any type of file. This command should produce some noise,
confirming that the sound card is working.The /dev/dsp* device nodes will
be created automatically as needed. When not in use, they
do not exist and will not appear in the output of
&man.ls.1;.Setting up Bluetooth Sound DevicesBluetooth audioConnecting to a Bluetooth device is out of scope for this
chapter. Refer to for more information.To get Bluetooth sound sink working with FreeBSD's sound
system, users have to install
audio/virtual_oss first:&prompt.root; pkg install virtual_ossaudio/virtual_oss requires
cuse to be loaded into the kernel:&prompt.root; kldload cuseTo load cuse during system startup, run
this command:&prompt.root; sysrc -f /boot/loader.conf cuse_load=yesTo use headphones as a sound sink with
audio/virtual_oss, users need to create a
virtual device after connecting to a Bluetooth audio
device:&prompt.root; virtual_oss -C 2 -c 2 -r 48000 -b 16 -s 768 -R /dev/null -P /dev/bluetooth/headphones -d dspheadphones in this example is
a hostname from /etc/bluetooth/hosts.
BT_ADDR could be used instead.Refer to &man.virtual_oss.8; for more information.Troubleshooting Sounddevice nodesI/O portIRQDSP
lists some common error messages and their solutions:
Common Error MessagesErrorSolutionsb_dspwr(XX) timed
outThe I/O port is not set
correctly.bad irq XXThe IRQ is set incorrectly. Make sure
that the set IRQ and the sound IRQ are the
same.xxx: gus pcm not attached, out of
memoryThere is not enough available memory to
use the device.xxx: can't open
/dev/dsp!Type fstat | grep
dsp to check if another application is
holding the device open. Noteworthy troublemakers are
esound and
KDE's sound
support.
Modern graphics cards often come with their own sound
driver for use with HDMI. This sound
device is sometimes enumerated before the sound card meaning
that the sound card will not be used as the default playback
device. To check if this is the case, run
dmesg and look for
pcm. The output looks something like
this:...
hdac0: HDA Driver Revision: 20100226_0142
hdac1: HDA Driver Revision: 20100226_0142
hdac0: HDA Codec #0: NVidia (Unknown)
hdac0: HDA Codec #1: NVidia (Unknown)
hdac0: HDA Codec #2: NVidia (Unknown)
hdac0: HDA Codec #3: NVidia (Unknown)
pcm0: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 0 nid 1 on hdac0
pcm1: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 1 nid 1 on hdac0
pcm2: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 2 nid 1 on hdac0
pcm3: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 3 nid 1 on hdac0
hdac1: HDA Codec #2: Realtek ALC889
pcm4: <HDA Realtek ALC889 PCM #0 Analog> at cad 2 nid 1 on hdac1
pcm5: <HDA Realtek ALC889 PCM #1 Analog> at cad 2 nid 1 on hdac1
pcm6: <HDA Realtek ALC889 PCM #2 Digital> at cad 2 nid 1 on hdac1
pcm7: <HDA Realtek ALC889 PCM #3 Digital> at cad 2 nid 1 on hdac1
...In this example, the graphics card
(NVidia) has been enumerated before the
sound card (Realtek ALC889). To use the
sound card as the default playback device, change
hw.snd.default_unit to the unit that should
be used for playback:&prompt.root; sysctl hw.snd.default_unit=nwhere n is the number of the sound
device to use. In this example, it should be
4. Make this change permanent by adding
the following line to
/etc/sysctl.conf:hw.snd.default_unit=4Utilizing Multiple Sound SourcesMunishChopraContributed by It is often desirable to have multiple sources of sound
that are able to play simultaneously. &os; uses
Virtual Sound Channels to multiplex the sound
card's playback by mixing sound in the kernel.Three &man.sysctl.8; knobs are available for configuring
virtual channels:&prompt.root; sysctl dev.pcm.0.play.vchans=4
&prompt.root; sysctl dev.pcm.0.rec.vchans=4
&prompt.root; sysctl hw.snd.maxautovchans=4This example allocates four virtual channels, which is a
practical number for everyday use. Both
dev.pcm.0.play.vchans=4 and
dev.pcm.0.rec.vchans=4 are configurable
after a device has been attached and represent the number of
virtual channels pcm0 has for playback
and recording. Since the pcm module can
be loaded independently of the hardware drivers,
hw.snd.maxautovchans indicates how many
virtual channels will be given to an audio device when it is
attached. Refer to &man.pcm.4; for more information.The number of virtual channels for a device cannot be
changed while it is in use. First, close any programs using
the device, such as music players or sound daemons.The correct pcm device will
automatically be allocated transparently to a program that
requests /dev/dsp0.Setting Default Values for Mixer ChannelsJosefEl-RayesContributed by The default values for the different mixer channels are
hardcoded in the source code of the &man.pcm.4; driver. While
sound card mixer levels can be changed using &man.mixer.8; or
third-party applications and daemons, this is not a permanent
solution. To instead set default mixer values at the driver
level, define the appropriate values in
/boot/device.hints, as seen in this
example:hint.pcm.0.vol="50"This will set the volume channel to a default value of
50 when the &man.pcm.4; module is
loaded.MP3 AudioChernLeeContributed by This section describes some MP3
players available for &os;, how to rip audio
CD tracks, and how to encode and decode
MP3s.MP3 PlayersA popular graphical MP3 player is
Audacious. It supports
Winamp skins and additional
plugins. The interface is intuitive, with a playlist, graphic
equalizer, and more. Those familiar with
Winamp will find
Audacious simple to use. On &os;,
Audacious can be installed from the
multimedia/audacious port or package.
Audacious is a descendant of XMMS.The audio/mpg123 package or port
provides an alternative, command-line MP3
player. Once installed, specify the MP3
file to play on the command line. If the system has multiple
audio devices, the sound device can also be specified:&prompt.root; mpg123 -a /dev/dsp1.0 Foobar-GreatestHits.mp3
High Performance MPEG 1.0/2.0/2.5 Audio Player for Layers 1, 2 and 3
version 1.18.1; written and copyright by Michael Hipp and others
free software (LGPL) without any warranty but with best wishes
Playing MPEG stream from Foobar-GreatestHits.mp3 ...
MPEG 1.0 layer III, 128 kbit/s, 44100 Hz joint-stereoAdditional MP3 players are available in
the &os; Ports Collection.Ripping CD Audio TracksBefore encoding a CD or
CD track to MP3, the
audio data on the CD must be ripped to the
hard drive. This is done by copying the raw
CD Digital Audio (CDDA)
data to WAV files.The cdda2wav tool, which is installed
with the sysutils/cdrtools suite, can be
used to rip audio information from
CDs.With the audio CD in the drive, the
following command can be issued as
root to rip an
entire CD into individual, per track,
WAV files:&prompt.root; cdda2wav -D 0,1,0 -BIn this example, the
indicates
the SCSI device 0,1,0
containing the CD to rip. Use
cdrecord -scanbus to determine the correct
device parameters for the system.To rip individual tracks, use to
specify the track:&prompt.root; cdda2wav -D 0,1,0 -t 7To rip a range of tracks, such as track one to seven,
specify a range:&prompt.root; cdda2wav -D 0,1,0 -t 1+7To rip from an ATAPI
(IDE) CDROM drive,
specify the device name in place of the
SCSI unit numbers. For example, to rip
track 7 from an IDE drive:&prompt.root; cdda2wav -D /dev/acd0 -t 7Alternately, dd can be used to extract
audio tracks on ATAPI drives, as described
in .Encoding and Decoding MP3sLame is a popular
MP3 encoder which can be installed from the
audio/lame port. Due to patent issues, a
package is not available.The following command will convert the ripped
WAV file
audio01.wav to
audio01.mp3:&prompt.root; lame -h -b 128 --tt "Foo Song Title" --ta "FooBar Artist" --tl "FooBar Album" \
--ty "2014" --tc "Ripped and encoded by Foo" --tg "Genre" audio01.wav audio01.mp3The specified 128 kbits is a standard
MP3 bitrate while the 160 and 192 bitrates
provide higher quality. The higher the bitrate, the larger
the size of the resulting MP3. The
turns on the
higher quality but a little slower
mode. The options beginning with
indicate ID3 tags, which usually contain
song information, to be embedded within the
MP3 file. Additional encoding options can
be found in the lame manual
page.In order to burn an audio CD from
MP3s, they must first be converted to a
non-compressed file format. XMMS
can be used to convert to the WAV format,
while mpg123 can be used to convert
to the raw Pulse-Code Modulation (PCM)
audio data format.To convert audio01.mp3 using
mpg123, specify the name of the
PCM file:&prompt.root; mpg123 -s audio01.mp3 > audio01.pcmTo use XMMS to convert a
MP3 to WAV format, use
these steps:Converting to WAV Format in
XMMSLaunch XMMS.Right-click the window to bring up the
XMMS menu.Select Preferences under
Options.Change the Output Plugin to Disk Writer
Plugin.Press Configure.Enter or browse to a directory to write the
uncompressed files to.Load the MP3 file into
XMMS as usual, with volume at
100% and EQ settings turned off.Press Play. The
XMMS will appear as if it is
playing the MP3, but no music will be
heard. It is actually playing the MP3
to a file.When finished, be sure to set the default Output
Plugin back to what it was before in order to listen to
MP3s again.Both the WAV and PCM
formats can be used with cdrecord.
When using WAV files, there will be a small
tick sound at the beginning of each track. This sound is the
header of the WAV file. The
audio/sox port or package can be used to
remove the header:&prompt.user; sox -t wav -r 44100 -s -w -c 2 track.wav track.rawRefer to for more
information on using a CD burner in
&os;.Video PlaybackRossLippertContributed by Before configuring video playback, determine the model and
chipset of the video card. While
&xorg; supports a wide variety of
video cards, not all provide good playback performance. To
obtain a list of extensions supported by the
&xorg; server using the card, run
xdpyinfo while
&xorg; is running.It is a good idea to have a short MPEG test file for
evaluating various players and options. Since some
DVD applications look for
DVD media in /dev/dvd by
default, or have this device name hardcoded in them, it might be
useful to make a symbolic link to the proper device:&prompt.root; ln -sf /dev/cd0 /dev/dvdDue to the nature of &man.devfs.5;, manually created links
will not persist after a system reboot. In order to recreate
the symbolic link automatically when the system boots, add the
following line to /etc/devfs.conf:link cd0 dvdDVD decryption invokes certain functions
that require write permission to the DVD
device.To enhance the shared memory
&xorg; interface, it is recommended
to increase the values of these &man.sysctl.8;
variables:kern.ipc.shmmax=67108864
kern.ipc.shmall=32768Determining Video CapabilitiesXVideoSDLDGAThere are several possible ways to display video under
&xorg; and what works is largely
hardware dependent. Each method described below will have
varying quality across different hardware.Common video interfaces include:&xorg;: normal output using
shared memory.XVideo: an extension to the
&xorg; interface which
allows video to be directly displayed in drawable objects
through a special acceleration. This extension provides
good quality playback even on low-end machines. The next
section describes how to determine if this extension is
running.SDL: the Simple Directmedia Layer
is a porting layer for many operating systems, allowing
cross-platform applications to be developed which make
efficient use of sound and graphics.
SDL provides a low-level abstraction to
the hardware which can sometimes be more efficient than
the &xorg; interface. On &os;,
SDL can be installed using the
devel/sdl20 package or port.DGA: the Direct Graphics Access is
an &xorg; extension which
allows a program to bypass the
&xorg; server and directly
- alter the framebuffer. Because it relies on a low level
+ alter the framebuffer. As it relies on a low-level
memory mapping, programs using it must be run as
root. The
DGA extension can be tested and
benchmarked using &man.dga.1;. When
dga is running, it changes the colors
of the display whenever a key is pressed. To quit, press
q.SVGAlib: a low level console graphics layer.XVideoTo check whether this extension is running, use
xvinfo:&prompt.user; xvinfoXVideo is supported for the card if the result is
similar to:X-Video Extension version 2.2
screen #0
Adaptor #0: "Savage Streams Engine"
number of ports: 1
port base: 43
operations supported: PutImage
supported visuals:
depth 16, visualID 0x22
depth 16, visualID 0x23
number of attributes: 5
"XV_COLORKEY" (range 0 to 16777215)
client settable attribute
client gettable attribute (current value is 2110)
"XV_BRIGHTNESS" (range -128 to 127)
client settable attribute
client gettable attribute (current value is 0)
"XV_CONTRAST" (range 0 to 255)
client settable attribute
client gettable attribute (current value is 128)
"XV_SATURATION" (range 0 to 255)
client settable attribute
client gettable attribute (current value is 128)
"XV_HUE" (range -180 to 180)
client settable attribute
client gettable attribute (current value is 0)
maximum XvImage size: 1024 x 1024
Number of image formats: 7
id: 0x32595559 (YUY2)
guid: 59555932-0000-0010-8000-00aa00389b71
bits per pixel: 16
number of planes: 1
type: YUV (packed)
id: 0x32315659 (YV12)
guid: 59563132-0000-0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
id: 0x30323449 (I420)
guid: 49343230-0000-0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
id: 0x36315652 (RV16)
guid: 52563135-0000-0000-0000-000000000000
bits per pixel: 16
number of planes: 1
type: RGB (packed)
depth: 0
red, green, blue masks: 0x1f, 0x3e0, 0x7c00
id: 0x35315652 (RV15)
guid: 52563136-0000-0000-0000-000000000000
bits per pixel: 16
number of planes: 1
type: RGB (packed)
depth: 0
red, green, blue masks: 0x1f, 0x7e0, 0xf800
id: 0x31313259 (Y211)
guid: 59323131-0000-0010-8000-00aa00389b71
bits per pixel: 6
number of planes: 3
type: YUV (packed)
id: 0x0
guid: 00000000-0000-0000-0000-000000000000
bits per pixel: 0
number of planes: 0
type: RGB (packed)
depth: 1
red, green, blue masks: 0x0, 0x0, 0x0The formats listed, such as YUV2 and YUV12, are not
present with every implementation of XVideo and their
absence may hinder some players.If the result instead looks like:X-Video Extension version 2.2
screen #0
no adaptors presentXVideo is probably not supported for the card. This
means that it will be more difficult for the display to meet
the computational demands of rendering video, depending on
the video card and processor.Ports and Packages Dealing with Videovideo portsvideo packagesThis section introduces some of the software available
from the &os; Ports Collection which can be used for video
playback.MPlayer and
MEncoderMPlayer is a command-line
video player with an optional graphical interface which aims
to provide speed and flexibility. Other graphical
front-ends to MPlayer are
available from the &os; Ports Collection.MPlayerMPlayer can be installed
using the multimedia/mplayer package or
port. Several compile options are available and a variety
of hardware checks occur during the build process. For
these reasons, some users prefer to build the port rather
than install the package.When compiling the port, the menu options should be
reviewed to determine the type of support to compile into
the port. If an option is not selected,
MPlayer will not be able to
display that type of video format. Use the arrow keys and
spacebar to select the required formats. When finished,
press Enter to continue the port compile
and installation.By default, the package or port will build the
mplayer command line utility and the
gmplayer graphical utility. To encode
videos, compile the multimedia/mencoder
port. Due to licensing restrictions, a package is not
available for MEncoder.The first time MPlayer is
run, it will create ~/.mplayer in the
user's home directory. This subdirectory contains default
versions of the user-specific configuration files.This section describes only a few common uses. Refer to
mplayer(1) for a complete description of its numerous
options.To play the file
testfile.avi,
specify the video interfaces with , as
seen in the following examples:&prompt.user; mplayer -vo xv testfile.avi&prompt.user; mplayer -vo sdl testfile.avi&prompt.user; mplayer -vo x11 testfile.avi&prompt.root; mplayer -vo dga testfile.avi&prompt.root; mplayer -vo 'sdl:dga' testfile.aviIt is worth trying all of these options, as their
relative performance depends on many factors and will vary
significantly with hardware.To play a DVD, replace
testfile.avi
with , where
N is the title number to play and
DEVICE is the device node for the
DVD. For example, to play title 3 from
/dev/dvd:&prompt.root; mplayer -vo xv dvd://3 -dvd-device /dev/dvdThe default DVD device can be
defined during the build of the
MPlayer port by including the
WITH_DVD_DEVICE=/path/to/desired/device
option. By default, the device is
/dev/cd0. More details can be found
in the port's
Makefile.options.To stop, pause, advance, and so on, use a keybinding.
To see the list of keybindings, run mplayer
-h or read mplayer(1).Additional playback options include , which engages fullscreen mode, and
, which helps performance.Each user can add commonly used options to their
~/.mplayer/config like so:vo=xv
fs=yes
zoom=yesmplayer can be used to rip a
DVD title to a .vob.
To dump the second title from a
DVD:&prompt.root; mplayer -dumpstream -dumpfile out.vob dvd://2 -dvd-device /dev/dvdThe output file, out.vob, will be
in MPEG format.Anyone wishing to obtain a high level of expertise with
&unix; video should consult mplayerhq.hu/DOCS
as it is technically informative. This documentation should
be considered as required reading before submitting any bug
reports.mencoderBefore using mencoder, it is a good
idea to become familiar with the options described at mplayerhq.hu/DOCS/HTML/en/mencoder.html.
There are innumerable ways to improve quality, lower
bitrate, and change formats, and some of these options may
make the difference between good or bad performance.
Improper combinations of command line options can yield
output files that are unplayable even by
mplayer.Here is an example of a simple copy:&prompt.user; mencoder input.avi -oac copy -ovc copy -o output.aviTo rip to a file, use with
mplayer.To convert
input.avi to
the MPEG4 codec with MPEG3 audio encoding, first install the
audio/lame port. Due to licensing
restrictions, a package is not available. Once installed,
type:&prompt.user; mencoder input.avi -oac mp3lame -lameopts br=192 \
-ovc lavc -lavcopts vcodec=mpeg4:vhq -o output.aviThis will produce output playable by applications such
as mplayer and
xine.input.avi
can be replaced with and run as root to re-encode a
DVD title directly. Since it may take a
few tries to get the desired result, it is recommended to
instead dump the title to a file and to work on the
file.The xine Video
Playerxine is a video player with a
reusable base library and a modular executable which can be
extended with plugins. It can be installed using the
multimedia/xine package or port.In practice, xine requires
either a fast CPU with a fast video card, or support for the
XVideo extension. The xine video
player performs best on XVideo interfaces.By default, the xine player
starts a graphical user interface. The menus can then be
used to open a specific file.Alternatively, xine may be
invoked from the command line by specifying the name of the
file to play:&prompt.user; xine -g -p mymovie.aviRefer to
xine-project.org/faq for more information and
troubleshooting tips.The Transcode
UtilitiesTranscode provides a suite of
tools for re-encoding video and audio files.
Transcode can be used to merge
video files or repair broken files using command line tools
with stdin/stdout stream interfaces.In &os;, Transcode can be
installed using the multimedia/transcode
package or port. Many users prefer to compile the port as
it provides a menu of compile options for specifying the
support and codecs to compile in. If an option is not
selected, Transcode will not be
able to encode that format. Use the arrow keys and spacebar
to select the required formats. When finished, press
Enter to continue the port compile and
installation.This example demonstrates how to convert a DivX file
into a PAL MPEG-1 file (PAL VCD):&prompt.user; transcode -i input.avi -V --export_prof vcd-pal -o output_vcd
&prompt.user; mplex -f 1 -o output_vcd.mpg output_vcd.m1v output_vcd.mpaThe resulting MPEG file,
output_vcd.mpg,
is ready to be played with
MPlayer. The file can be burned
on a CD media to create a video
CD using a utility such as
multimedia/vcdimager or
sysutils/cdrdao.In addition to the manual page for
transcode, refer to transcoding.org/cgi-bin/transcode
for further information and examples.TV CardsJosefEl-RayesOriginal contribution by MarcFonvieilleEnhanced and adapted by TV cardsTV cards can be used to watch broadcast or cable TV on a
computer. Most cards accept composite video via an
RCA or S-video input and some cards include a
FM radio tuner.&os; provides support for PCI-based TV cards using a
Brooktree Bt848/849/878/879 video capture chip with the
&man.bktr.4; driver. This driver supports most Pinnacle PCTV
video cards. Before purchasing a TV card, consult &man.bktr.4;
for a list of supported tuners.Loading the DriverIn order to use the card, the &man.bktr.4; driver must be
loaded. To automate this at boot time, add the following line
to /boot/loader.conf:bktr_load="YES"Alternatively, one can statically compile support for
the TV card into a custom kernel. In that case, add the
following lines to the custom kernel configuration
file:device bktr
device iicbus
device iicbb
device smbusThese additional devices are necessary as the card
components are interconnected via an I2C bus. Then, build and
install a new kernel.To test that the tuner is correctly detected, reboot the
system. The TV card should appear in the boot messages, as
seen in this example:bktr0: <BrookTree 848A> mem 0xd7000000-0xd7000fff irq 10 at device 10.0 on pci0
iicbb0: <I2C bit-banging driver> on bti2c0
iicbus0: <Philips I2C bus> on iicbb0 master-only
iicbus1: <Philips I2C bus> on iicbb0 master-only
smbus0: <System Management Bus> on bti2c0
bktr0: Pinnacle/Miro TV, Philips SECAM tuner.The messages will differ according to the hardware. If
necessary, it is possible to override some of the detected
parameters using &man.sysctl.8; or custom kernel configuration
options. For example, to force the tuner to a Philips SECAM
tuner, add the following line to a custom kernel configuration
file:options OVERRIDE_TUNER=6or, use &man.sysctl.8;:&prompt.root; sysctl hw.bt848.tuner=6Refer to &man.bktr.4; for a description of the available
&man.sysctl.8; parameters and kernel options.Useful ApplicationsTo use the TV card, install one of the following
applications:multimedia/fxtv
provides TV-in-a-window and image/audio/video capture
capabilities.multimedia/xawtv
is another TV application with similar features.audio/xmradio
provides an application for using the FM radio tuner of a
TV card.More applications are available in the &os; Ports
Collection.TroubleshootingIf any problems are encountered with the TV card, check
that the video capture chip and the tuner are supported by
&man.bktr.4; and that the right configuration options were
used. For more support or to ask questions about supported TV
cards, refer to the &a.multimedia.name; mailing list.MythTVMythTV is a popular, open source Personal Video Recorder
(PVR) application. This section demonstrates
how to install and setup MythTV on &os;. Refer to mythtv.org/wiki
for more information on how to use MythTV.MythTV requires a frontend and a backend. These components
can either be installed on the same system or on different
machines.The frontend can be installed on &os; using the
multimedia/mythtv-frontend package or port.
&xorg; must also be installed and
configured as described in . Ideally, this
system has a video card that supports X-Video Motion
Compensation (XvMC) and, optionally, a Linux
Infrared Remote Control (LIRC)-compatible
remote.To install both the backend and the frontend on &os;, use
the multimedia/mythtv package or port. A
&mysql; database server is also required and should
automatically be installed as a dependency. Optionally, this
system should have a tuner card and sufficient storage to hold
recorded data.HardwareMythTV uses Video for Linux (V4L) to
access video input devices such as encoders and tuners. In
&os;, MythTV works best with USB DVB-S/C/T
cards as they are well supported by the
multimedia/webcamd package or port which
provides a V4L userland application. Any
Digital Video Broadcasting (DVB) card
supported by webcamd should work
with MythTV. A list of known working cards can be found at
wiki.freebsd.org/WebcamCompat.
Drivers are also available for Hauppauge cards in the
multimedia/pvr250 and
multimedia/pvrxxx ports, but they provide a
non-standard driver interface that does not work with versions
of MythTV greater than 0.23. Due to licensing restrictions,
no packages are available and these two ports must be
compiled.The wiki.freebsd.org/HTPC
page contains a list of all available DVB
drivers.Setting up the MythTV BackendTo install MythTV using binary packages:&prompt.root; pkg install mythtvAlternatively, to install from the Ports Collection:&prompt.root; cd /usr/ports/multimedia/mythtv
&prompt.root; make installOnce installed, set up the MythTV database:&prompt.root; mysql -uroot -p < /usr/local/share/mythtv/database/mc.sqlThen, configure the backend:&prompt.root; mythtv-setupFinally, start the backend:&prompt.root; sysrc mythbackend_enable=yes
&prompt.root; service mythbackend startImage ScannersMarcFonvieilleWritten by image scannersIn &os;, access to image scanners is provided by
SANE (Scanner Access Now Easy), which
is available in the &os; Ports Collection.
SANE will also use some &os; device
drivers to provide access to the scanner hardware.&os; supports both SCSI and
USB scanners. Depending upon the scanner
interface, different device drivers are required. Be sure the
scanner is supported by SANE prior
to performing any configuration. Refer to
http://www.sane-project.org/sane-supported-devices.html
for more information about supported scanners.This chapter describes how to determine if the scanner has
been detected by &os;. It then provides an overview of how to
configure and use SANE on a &os;
system.Checking the ScannerThe GENERIC kernel includes the
device drivers needed to support USB
scanners. Users with a custom kernel should ensure that the
following lines are present in the custom kernel configuration
file:device usb
device uhci
device ohci
device ehci
device xhciTo determine if the USB scanner is
detected, plug it in and use dmesg to
determine whether the scanner appears in the system message
buffer. If it does, it should display a message similar to
this:ugen0.2: <EPSON> at usbus0In this example, an &epson.perfection; 1650
USB scanner was detected on
/dev/ugen0.2.If the scanner uses a SCSI interface,
it is important to know which SCSI
controller board it will use. Depending upon the
SCSI chipset, a custom kernel configuration
file may be needed. The GENERIC kernel
supports the most common SCSI controllers.
Refer to /usr/src/sys/conf/NOTES to
determine the correct line to add to a custom kernel
configuration file. In addition to the
SCSI adapter driver, the following lines
are needed in a custom kernel configuration file:device scbus
device passVerify that the device is displayed in the system message
buffer:pass2 at aic0 bus 0 target 2 lun 0
pass2: <AGFA SNAPSCAN 600 1.10> Fixed Scanner SCSI-2 device
pass2: 3.300MB/s transfersIf the scanner was not powered-on at system boot, it is
still possible to manually force detection by performing a
SCSI bus scan with
camcontrol:&prompt.root; camcontrol rescan all
Re-scan of bus 0 was successful
Re-scan of bus 1 was successful
Re-scan of bus 2 was successful
Re-scan of bus 3 was successfulThe scanner should now appear in the
SCSI devices list:&prompt.root; camcontrol devlist
<IBM DDRS-34560 S97B> at scbus0 target 5 lun 0 (pass0,da0)
<IBM DDRS-34560 S97B> at scbus0 target 6 lun 0 (pass1,da1)
<AGFA SNAPSCAN 600 1.10> at scbus1 target 2 lun 0 (pass3)
<PHILIPS CDD3610 CD-R/RW 1.00> at scbus2 target 0 lun 0 (pass2,cd0)Refer to &man.scsi.4; and &man.camcontrol.8; for more
details about SCSI devices on &os;.SANE ConfigurationThe SANE system provides the
access to the scanner via backends (graphics/sane-backends).
Refer to http://www.sane-project.org/sane-supported-devices.html
to determine which backend supports the scanner. A
graphical scanning interface is provided by third party
applications like Kooka
(graphics/kooka) or
XSane
(graphics/xsane).
SANE's backends are enough to test
the scanner.To install the backends from binary package:&prompt.root; pkg install sane-backendsAlternatively, to install from the Ports Collection&prompt.root; cd /usr/ports/graphics/sane-backends
&prompt.root; make install cleanAfter installing the
graphics/sane-backends port or package, use
sane-find-scanner to check the scanner
detection by the SANE
system:&prompt.root; sane-find-scanner -q
found SCSI scanner "AGFA SNAPSCAN 600 1.10" at /dev/pass3The output should show the interface type of the scanner
and the device node used to attach the scanner to the system.
The vendor and the product model may or may not appear.Some USB scanners require firmware to
be loaded. Refer to sane-find-scanner(1) and sane(7) for
details.Next, check if the scanner will be identified by a
scanning frontend. The SANE
backends include scanimage which can be
used to list the devices and perform an image acquisition.
Use to list the scanner devices. The
first example is for a SCSI scanner and the
second is for a USB scanner:&prompt.root; scanimage -L
device `snapscan:/dev/pass3' is a AGFA SNAPSCAN 600 flatbed scanner
&prompt.root; scanimage -L
device 'epson2:libusb:000:002' is a Epson GT-8200 flatbed scannerIn this second example,
epson2 is
the backend name and
libusb:000:002 means
/dev/ugen0.2 is the device node used by the
scanner.If scanimage is unable to identify the
scanner, this message will appear:&prompt.root; scanimage -L
No scanners were identified. If you were expecting something different,
check that the scanner is plugged in, turned on and detected by the
sane-find-scanner tool (if appropriate). Please read the documentation
which came with this software (README, FAQ, manpages).If this happens, edit the backend configuration file in
/usr/local/etc/sane.d/ and define the
scanner device used. For example, if the undetected scanner
model is an &epson.perfection; 1650 and it uses the
epson2 backend, edit
/usr/local/etc/sane.d/epson2.conf. When
editing, add a line specifying the interface and the device
node used. In this case, add the following line:usb /dev/ugen0.2Save the edits and verify that the scanner is identified
with the right backend name and the device node:&prompt.root; scanimage -L
device 'epson2:libusb:000:002' is a Epson GT-8200 flatbed scannerOnce scanimage -L sees the scanner, the
configuration is complete and the scanner is now ready to
use.While scanimage can be used to perform
an image acquisition from the command line, it is often
preferable to use a graphical interface to perform image
scanning. Applications like Kooka
or XSane are popular scanning
frontends. They
offer advanced features such as various scanning modes, color
correction, and batch scans. XSane
is also usable as a GIMP plugin.Scanner PermissionsIn order to have access to the scanner, a user needs read
and write permissions to the device node used by the scanner.
In the previous example, the USB scanner
uses the device node /dev/ugen0.2 which
is really a symlink to the real device node
/dev/usb/0.2.0. The symlink and the
device node are owned, respectively, by the wheel and operator groups. While
adding the user to these groups will allow access to the
scanner, it is considered insecure to add a user to
wheel. A better
solution is to create a group and make the scanner device
accessible to members of this group.This example creates a group called usb:&prompt.root; pw groupadd usbThen, make the /dev/ugen0.2 symlink
and the /dev/usb/0.2.0 device node
accessible to the usb group with write
permissions of 0660 or
0664 by adding the following lines to
/etc/devfs.rules:[system=5]
add path ugen0.2 mode 0660 group usb
add path usb/0.2.0 mode 0666 group usbIt happens the device node changes with the addition or
removal of devices, so one may want to give access to all
USB devices using this ruleset instead:[system=5]
add path 'ugen*' mode 0660 group usb
add path 'usb/*' mode 0666 group usbRefer to &man.devfs.rules.5; for more information about
this file.Next, enable the ruleset in /etc/rc.conf:devfs_system_ruleset="system"And, restart the &man.devfs.8; system:&prompt.root; service devfs restartFinally, add the users to usb
in order to allow access to the scanner:&prompt.root; pw groupmod usb -m joeFor more details refer to &man.pw.8;.
diff --git a/en_US.ISO8859-1/books/handbook/security/chapter.xml b/en_US.ISO8859-1/books/handbook/security/chapter.xml
index 4bc3279737..d3b277a978 100644
--- a/en_US.ISO8859-1/books/handbook/security/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/security/chapter.xml
@@ -1,4157 +1,4157 @@
SecurityTomRhodesRewritten by securitySynopsisSecurity, whether physical or virtual, is a topic so broad
that an entire industry has evolved around it. Hundreds of
standard practices have been authored about how to secure
systems and networks, and as a user of &os;, understanding how
to protect against attacks and intruders is a must.In this chapter, several fundamentals and techniques will be
discussed. The &os; system comes with multiple layers of
security, and many more third party utilities may be added to
enhance security.After reading this chapter, you will know:Basic &os; system security concepts.The various crypt mechanisms available in &os;.How to set up one-time password authentication.How to configure TCP Wrapper
for use with &man.inetd.8;.How to set up Kerberos on
&os;.How to configure IPsec and create a
VPN.How to configure and use
OpenSSH on &os;.How to use file system ACLs.How to use pkg to audit
third party software packages installed from the Ports
Collection.How to utilize &os; security advisories.What Process Accounting is and how to enable it on
&os;.How to control user resources using login classes or the
resource limits database.Before reading this chapter, you should:Understand basic &os; and Internet concepts.Additional security topics are covered elsewhere in this
Handbook. For example, Mandatory Access Control is discussed in
and Internet firewalls are discussed in
.IntroductionSecurity is everyone's responsibility. A weak entry point
in any system could allow intruders to gain access to critical
information and cause havoc on an entire network. One of the
core principles of information security is the
CIA triad, which stands for the
Confidentiality, Integrity, and Availability of information
systems.The CIA triad is a bedrock concept of
computer security as customers and users expect their data to be
protected. For example, a customer expects that their credit
card information is securely stored (confidentiality), that
their orders are not changed behind the scenes (integrity), and
that they have access to their order information at all times
(availablility).To provide CIA, security professionals
apply a defense in depth strategy. The idea of defense in depth
is to add several layers of security to prevent one single layer
failing and the entire security system collapsing. For example,
a system administrator cannot simply turn on a firewall and
consider the network or system secure. One must also audit
accounts, check the integrity of binaries, and ensure malicious
tools are not installed. To implement an effective security
strategy, one must understand threats and how to defend against
them.What is a threat as it pertains to computer security?
Threats are not limited to remote attackers who attempt to
access a system without permission from a remote location.
Threats also include employees, malicious software, unauthorized
network devices, natural disasters, security vulnerabilities,
and even competing corporations.Systems and networks can be accessed without permission,
sometimes by accident, or by remote attackers, and in some
cases, via corporate espionage or former employees. As a user,
it is important to prepare for and admit when a mistake has led
to a security breach and report possible issues to the security
team. As an administrator, it is important to know of the
threats and be prepared to mitigate them.When applying security to systems, it is recommended to
start by securing the basic accounts and system configuration,
and then to secure the network layer so that it adheres to the
system policy and the organization's security procedures. Many
organizations already have a security policy that covers the
configuration of technology devices. The policy should include
the security configuration of workstations, desktops, mobile
devices, phones, production servers, and development servers.
In many cases, standard operating procedures
(SOPs) already exist. When in doubt, ask the
security team.The rest of this introduction describes how some of these
basic security configurations are performed on a &os; system.
The rest of this chapter describes some specific tools which can
be used when implementing a security policy on a &os;
system.Preventing LoginsIn securing a system, a good starting point is an audit of
accounts. Ensure that root has a strong password and
that this password is not shared. Disable any accounts that
do not need login access.To deny login access to accounts, two methods exist. The
first is to lock the account. This example locks the
toor account:&prompt.root; pw lock toorThe second method is to prevent login access by changing
the shell to /usr/sbin/nologin. Only the
superuser can change the shell for other users:&prompt.root; chsh -s /usr/sbin/nologin toorThe /usr/sbin/nologin shell prevents
the system from assigning a shell to the user when they
attempt to login.Permitted Account EscalationIn some cases, system administration needs to be shared
with other users. &os; has two methods to handle this. The
first one, which is not recommended, is a shared root password
used by members of the wheel group. With this
method, a user types su and enters the
password for wheel
whenever superuser access is needed. The user should then
type exit to leave privileged access after
finishing the commands that required administrative access.
To add a user to this group, edit
/etc/group and add the user to the end of
the wheel entry. The user must be
separated by a comma character with no space.The second, and recommended, method to permit privilege
escalation is to install the security/sudo
package or port. This software provides additional auditing,
more fine-grained user control, and can be configured to lock
users into running only the specified privileged
commands.After installation, use visudo to edit
/usr/local/etc/sudoers. This example
creates a new webadmin group, adds the
trhodes account to
that group, and configures that group access to restart
apache24:&prompt.root; pw groupadd webadmin -M trhodes -g 6000
&prompt.root; visudo
%webadmin ALL=(ALL) /usr/sbin/service apache24 *Password HashesPasswords are a necessary evil of technology. When they
must be used, they should be complex and a powerful hash
mechanism should be used to encrypt the version that is stored
in the password database. &os; supports the
DES, MD5,
SHA256, SHA512, and
Blowfish hash algorithms in its crypt()
library. The default of SHA512 should not
be changed to a less secure hashing algorithm, but can be
changed to the more secure Blowfish algorithm.Blowfish is not part of AES and is
not considered compliant with any Federal Information
Processing Standards (FIPS). Its use may
not be permitted in some environments.To determine which hash algorithm is used to encrypt a
user's password, the superuser can view the hash for the user
in the &os; password database. Each hash starts with a symbol
which indicates the type of hash mechanism used to encrypt the
password. If DES is used, there is no
beginning symbol. For MD5, the symbol is
$. For SHA256 and
SHA512, the symbol is
$6$. For Blowfish, the symbol is
$2a$. In this example, the password for
dru is hashed using
the default SHA512 algorithm as the hash
starts with $6$. Note that the encrypted
hash, not the password itself, is stored in the password
database:&prompt.root; grep dru /etc/master.passwd
dru:$6$pzIjSvCAn.PBYQBA$PXpSeWPx3g5kscj3IMiM7tUEUSPmGexxta.8Lt9TGSi2lNQqYGKszsBPuGME0:1001:1001::0:0:dru:/usr/home/dru:/bin/cshThe hash mechanism is set in the user's login class. For
this example, the user is in the default
login class and the hash algorithm is set with this line in
/etc/login.conf: :passwd_format=sha512:\To change the algorithm to Blowfish, modify that line to
look like this: :passwd_format=blf:\Then run cap_mkdb /etc/login.conf as
described in . Note that this
change will not affect any existing password hashes. This
means that all passwords should be re-hashed by asking users
to run passwd in order to change their
password.For remote logins, two-factor authentication should be
used. An example of two-factor authentication is
something you have, such as a key, and
something you know, such as the passphrase for
that key. Since OpenSSH is part of
the &os; base system, all network logins should be over an
encrypted connection and use key-based authentication instead
of passwords. For more information, refer to . Kerberos users may need to make
additional changes to implement
OpenSSH in their network. These
changes are described in .Password Policy EnforcementEnforcing a strong password policy for local accounts is a
fundamental aspect of system security. In &os;, password
length, password strength, and password complexity can be
implemented using built-in Pluggable Authentication Modules
(PAM).This section demonstrates how to configure the minimum and
maximum password length and the enforcement of mixed
characters using the pam_passwdqc.so
module. This module is enforced when a user changes their
password.To configure this module, become the superuser and
uncomment the line containing
pam_passwdqc.so in
/etc/pam.d/passwd. Then, edit that line
to match the password policy:password requisite pam_passwdqc.so min=disabled,disabled,disabled,12,10 similar=deny retry=3 enforce=usersThis example sets several requirements for new passwords.
The min setting controls the minimum
password length. It has five values because this module
defines five different types of passwords based on their
complexity. Complexity is defined by the type of characters
that must exist in a password, such as letters, numbers,
symbols, and case. The types of passwords are described in
&man.pam.passwdqc.8;. In this example, the first three types
of passwords are disabled, meaning that passwords that meet
those complexity requirements will not be accepted, regardless
of their length. The 12 sets a minimum
password policy of at least twelve characters, if the password
also contains characters with three types of complexity. The
10 sets the password policy to also allow
passwords of at least ten characters, if the password contains
characters with four types of complexity.The similar setting denies passwords
that are similar to the user's previous password. The
retry setting provides a user with three
opportunities to enter a new password.Once this file is saved, a user changing their password
will see a message similar to the following:&prompt.user; passwd
Changing local password for trhodes
Old Password:
You can now choose the new password.
A valid password should be a mix of upper and lower case letters,
digits and other characters. You can use a 12 character long
password with characters from at least 3 of these 4 classes, or
a 10 character long password containing characters from all the
classes. Characters that form a common pattern are discarded by
the check.
Alternatively, if no one else can see your terminal now, you can
pick this as your password: "trait-useful&knob".
Enter new password:If a password that does not match the policy is entered,
it will be rejected with a warning and the user will have an
opportunity to try again, up to the configured number of
retries.Most password policies require passwords to expire after
so many days. To set a password age time in &os;, set
for the user's login class in
/etc/login.conf. The
default login class contains an
example:# :passwordtime=90d:\So, to set an expiry of 90 days for this login class,
remove the comment symbol (#), save the
edit, and run cap_mkdb
/etc/login.conf.To set the expiration on individual users, pass an
expiration date or the number of days to expiry and a username
to pw:&prompt.root; pw usermod -p 30-apr-2015 -n trhodesAs seen here, an expiration date is set in the form of
day, month, and year. For more information, see
&man.pw.8;.Detecting RootkitsA rootkit is any unauthorized
software that attempts to gain root access to a system. Once
installed, this malicious software will normally open up
another avenue of entry for an attacker. Realistically, once
a system has been compromised by a rootkit and an
investigation has been performed, the system should be
reinstalled from scratch. There is tremendous risk that even
the most prudent security or systems engineer will miss
something an attacker left behind.A rootkit does do one thing useful for administrators:
once detected, it is a sign that a compromise happened at some
point. But, these types of applications tend to be very well
hidden. This section demonstrates a tool that can be used to
detect rootkits, security/rkhunter.After installation of this package or port, the system may
be checked using the following command. It will produce a lot
of information and will require some manual pressing of
ENTER:&prompt.root; rkhunter -cAfter the process completes, a status message will be
printed to the screen. This message will include the amount
of files checked, suspect files, possible rootkits, and more.
During the check, some generic security warnings may
be produced about hidden files, the
OpenSSH protocol selection, and
known vulnerable versions of installed software. These can be
handled now or after a more detailed analysis has been
performed.Every administrator should know what is running on the
systems they are responsible for. Third-party tools like
rkhunter and
sysutils/lsof, and native commands such
as netstat and ps, can
show a great deal of information on the system. Take notes on
what is normal, ask questions when something seems out of
place, and be paranoid. While preventing a compromise is
ideal, detecting a compromise is a must.Binary VerificationVerification of system files and binaries is important
because it provides the system administration and security
teams information about system changes. A software
application that monitors the system for changes is called an
Intrusion Detection System (IDS).&os; provides native support for a basic
IDS system. While the nightly security
emails will notify an administrator of changes, the
information is stored locally and there is a chance that a
malicious user could modify this information in order to hide
their changes to the system. As such, it is recommended to
create a separate set of binary signatures and store them on a
read-only, root-owned directory or, preferably, on a removable
USB disk or remote
rsync server.The built-in mtree utility can be used
to generate a specification of the contents of a directory. A
seed, or a numeric constant, is used to generate the
specification and is required to check that the specification
has not changed. This makes it possible to determine if a
file or binary has been modified. Since the seed value is
unknown by an attacker, faking or checking the checksum values
of files will be difficult to impossible. The following
example generates a set of SHA256 hashes,
one for each system binary in /bin, and
saves those values to a hidden file in root's home directory,
/root/.bin_chksum_mtree:&prompt.root; mtree -s 3483151339707503 -c -K cksum,sha256digest -p /bin > /root/.bin_chksum_mtree
&prompt.root; mtree: /bin checksum: 3427012225The 3483151339707503 represents
the seed. This value should be remembered, but not
shared.Viewing /root/.bin_cksum_mtree should
yield output similar to the following:# user: root
# machine: dreadnaught
# tree: /bin
# date: Mon Feb 3 10:19:53 2014
# .
/set type=file uid=0 gid=0 mode=0555 nlink=1 flags=none
. type=dir mode=0755 nlink=2 size=1024 \
time=1380277977.000000000
\133 nlink=2 size=11704 time=1380277977.000000000 \
cksum=484492447 \
sha256digest=6207490fbdb5ed1904441fbfa941279055c3e24d3a4049aeb45094596400662a
cat size=12096 time=1380277975.000000000 cksum=3909216944 \
sha256digest=65ea347b9418760b247ab10244f47a7ca2a569c9836d77f074e7a306900c1e69
chflags size=8168 time=1380277975.000000000 cksum=3949425175 \
sha256digest=c99eb6fc1c92cac335c08be004a0a5b4c24a0c0ef3712017b12c89a978b2dac3
chio size=18520 time=1380277975.000000000 cksum=2208263309 \
sha256digest=ddf7c8cb92a58750a675328345560d8cc7fe14fb3ccd3690c34954cbe69fc964
chmod size=8640 time=1380277975.000000000 cksum=2214429708 \
sha256digest=a435972263bf814ad8df082c0752aa2a7bdd8b74ff01431ccbd52ed1e490bbe7The machine's hostname, the date and time the
specification was created, and the name of the user who
created the specification are included in this report. There
is a checksum, size, time, and SHA256
digest for each binary in the directory.To verify that the binary signatures have not changed,
compare the current contents of the directory to the
previously generated specification, and save the results to a
file. This command requires the seed that was used to
generate the original specification:&prompt.root; mtree -s 3483151339707503 -p /bin < /root/.bin_chksum_mtree >> /root/.bin_chksum_output
&prompt.root; mtree: /bin checksum: 3427012225This should produce the same checksum for
/bin that was produced when the
specification was created. If no changes have occurred to the
binaries in this directory, the
/root/.bin_chksum_output output file will
be empty. To simulate a change, change the date on
/bin/cat using touch
and run the verification command again:&prompt.root; touch /bin/cat
&prompt.root; mtree -s 3483151339707503 -p /bin < /root/.bin_chksum_mtree >> /root/.bin_chksum_output
&prompt.root; more /root/.bin_chksum_output
cat changed
modification time expected Fri Sep 27 06:32:55 2013 found Mon Feb 3 10:28:43 2014It is recommended to create specifications for the
directories which contain binaries and configuration files, as
well as any directories containing sensitive data. Typically,
specifications are created for /bin,
/sbin, /usr/bin,
/usr/sbin,
/usr/local/bin,
/etc, and
/usr/local/etc.More advanced IDS systems exist, such
as security/aide. In most cases,
mtree provides the functionality
administrators need. It is important to keep the seed value
and the checksum output hidden from malicious users. More
information about mtree can be found in
&man.mtree.8;.System Tuning for SecurityIn &os;, many system features can be tuned using
sysctl. A few of the security features
which can be tuned to prevent Denial of Service
(DoS) attacks will be covered in this
section. More information about using
sysctl, including how to temporarily change
values and how to make the changes permanent after testing,
can be found in .Any time a setting is changed with
sysctl, the chance to cause undesired
harm is increased, affecting the availability of the system.
All changes should be monitored and, if possible, tried on a
testing system before being used on a production
system.By default, the &os; kernel boots with a security level of
-1. This is called insecure
mode because immutable file flags may be turned off
and all devices may be read from or written to. The security
level will remain at -1 unless it is
altered through sysctl or by a setting in
the startup scripts. The security level may be increased
during system startup by setting
kern_securelevel_enable to
YES in /etc/rc.conf,
and the value of kern_securelevel to the
desired security level. See &man.security.7; and &man.init.8;
for more information on these settings and the available
security levels.Increasing the securelevel can break
Xorg and cause other issues. Be
prepared to do some debugging.The net.inet.tcp.blackhole and
net.inet.udp.blackhole settings can be used
to drop incoming SYN packets on closed
ports without sending a return RST
response. The default behavior is to return an
RST to show a port is closed. Changing the
default provides some level of protection against ports scans,
which are used to determine which applications are running on
a system. Set net.inet.tcp.blackhole to
2 and
net.inet.udp.blackhole to
1. Refer to &man.blackhole.4; for more
information about these settings.The net.inet.icmp.drop_redirect and
net.inet.ip.redirect settings help prevent
against redirect attacks. A redirect
attack is a type of DoS which sends mass
numbers of ICMP type 5 packets. Since
these packets are not required, set
net.inet.icmp.drop_redirect to
1 and set
net.inet.ip.redirect to
0.Source routing is a method for detecting and accessing
non-routable addresses on the internal network. This should
be disabled as non-routable addresses are normally not
routable on purpose. To disable this feature, set
net.inet.ip.sourceroute and
net.inet.ip.accept_sourceroute to
0.When a machine on the network needs to send messages to
all hosts on a subnet, an ICMP echo request
message is sent to the broadcast address. However, there is
no reason for an external host to perform such an action. To
reject all external broadcast requests, set
net.inet.icmp.bmcastecho to
0.Some additional settings are documented in
&man.security.7;.One-time Passwordsone-time passwordssecurityone-time passwordsBy default, &os; includes support for One-time Passwords In
Everything (OPIE). OPIE
is designed to prevent replay attacks, in which an attacker
discovers a user's password and uses it to access a system.
Since a password is only used once in OPIE, a
discovered password is of little use to an attacker.
OPIE uses a secure hash and a
challenge/response system to manage passwords. The &os;
implementation uses the MD5 hash by
default.OPIE uses three different types of
passwords. The first is the usual &unix; or Kerberos password.
The second is the one-time password which is generated by
opiekey. The third type of password is the
secret password which is used to generate
one-time passwords. The secret password has nothing to do with,
and should be different from, the &unix; password.There are two other pieces of data that are important to
OPIE. One is the seed or
key, consisting of two letters and five digits.
The other is the iteration count, a number
between 1 and 100. OPIE creates the one-time
password by concatenating the seed and the secret password,
applying the MD5 hash as many times as
specified by the iteration count, and turning the result into
six short English words which represent the one-time password.
The authentication system keeps track of the last one-time
password used, and the user is authenticated if the hash of the
user-provided password is equal to the previous password.
- Because a one-way hash is used, it is impossible to generate
+ Since a one-way hash is used, it is impossible to generate
future one-time passwords if a successfully used password is
captured. The iteration count is decremented after each
successful login to keep the user and the login program in sync.
When the iteration count gets down to 1,
OPIE must be reinitialized.There are a few programs involved in this process. A
one-time password, or a consecutive list of one-time passwords,
is generated by passing an iteration count, a seed, and a secret
password to &man.opiekey.1;. In addition to initializing
OPIE, &man.opiepasswd.1; is used to change
passwords, iteration counts, or seeds. The relevant credential
files in /etc/opiekeys are examined by
&man.opieinfo.1; which prints out the invoking user's current
iteration count and seed.This section describes four different sorts of operations.
The first is how to set up one-time-passwords for the first time
over a secure connection. The second is how to use
opiepasswd over an insecure connection. The
third is how to log in over an insecure connection. The fourth
is how to generate a number of keys which can be written down or
printed out to use at insecure locations.Initializing OPIETo initialize OPIE for the first time,
run this command from a secure location:&prompt.user; opiepasswd -c
Adding unfurl:
Only use this method from the console; NEVER from remote. If you are using
telnet, xterm, or a dial-in, type ^C now or exit with no password.
Then run opiepasswd without the -c parameter.
Using MD5 to compute responses.
Enter new secret pass phrase:
Again new secret pass phrase:
ID unfurl OTP key is 499 to4268
MOS MALL GOAT ARM AVID COEDThe sets console mode which assumes
that the command is being run from a secure location, such as
a computer under the user's control or an
SSH session to a computer under the user's
control.When prompted, enter the secret password which will be
used to generate the one-time login keys. This password
should be difficult to guess and should be different than the
password which is associated with the user's login account.
It must be between 10 and 127 characters long. Remember this
password.The ID line lists the login name
(unfurl), default iteration count
(499), and default seed
(to4268). When logging in, the system will
remember these parameters and display them, meaning that they
do not have to be memorized. The last line lists the
generated one-time password which corresponds to those
parameters and the secret password. At the next login, use
this one-time password.Insecure Connection InitializationTo initialize or change the secret password on an
insecure system, a secure connection is needed to some place
where opiekey can be run. This might be a
shell prompt on a trusted machine. An iteration count is
needed, where 100 is probably a good value, and the seed can
either be specified or the randomly-generated one used. On
the insecure connection, the machine being initialized, use
&man.opiepasswd.1;:&prompt.user; opiepasswd
Updating unfurl:
You need the response from an OTP generator.
Old secret pass phrase:
otp-md5 498 to4268 ext
Response: GAME GAG WELT OUT DOWN CHAT
New secret pass phrase:
otp-md5 499 to4269
Response: LINE PAP MILK NELL BUOY TROY
ID mark OTP key is 499 gr4269
LINE PAP MILK NELL BUOY TROYTo accept the default seed, press Return.
Before entering an access password, move over to the secure
connection and give it the same parameters:&prompt.user; opiekey 498 to4268
Using the MD5 algorithm to compute response.
Reminder: Do not use opiekey from telnet or dial-in sessions.
Enter secret pass phrase:
GAME GAG WELT OUT DOWN CHATSwitch back over to the insecure connection, and copy the
generated one-time password over to the relevant
program.Generating a Single One-time PasswordAfter initializing OPIE and logging in,
a prompt like this will be displayed:&prompt.user; telnet example.com
Trying 10.0.0.1...
Connected to example.com
Escape character is '^]'.
FreeBSD/i386 (example.com) (ttypa)
login: <username>
otp-md5 498 gr4269 ext
Password: The OPIE prompts provides a useful
feature. If Return is pressed at the
password prompt, the prompt will turn echo on and display
what is typed. This can be useful when attempting to type in
a password by hand from a printout.MS-DOSWindowsMacOSAt this point, generate the one-time password to answer
this login prompt. This must be done on a trusted system
where it is safe to run &man.opiekey.1;. There are versions
of this command for &windows;, &macos; and &os;. This command
needs the iteration count and the seed as command line
options. Use cut-and-paste from the login prompt on the
machine being logged in to.On the trusted system:&prompt.user; opiekey 498 to4268
Using the MD5 algorithm to compute response.
Reminder: Do not use opiekey from telnet or dial-in sessions.
Enter secret pass phrase:
GAME GAG WELT OUT DOWN CHATOnce the one-time password is generated, continue to log
in.Generating Multiple One-time PasswordsSometimes there is no access to a trusted machine or
secure connection. In this case, it is possible to use
&man.opiekey.1; to generate a number of one-time passwords
beforehand. For example:&prompt.user; opiekey -n 5 30 zz99999
Using the MD5 algorithm to compute response.
Reminder: Do not use opiekey from telnet or dial-in sessions.
Enter secret pass phrase: <secret password>
26: JOAN BORE FOSS DES NAY QUIT
27: LATE BIAS SLAY FOLK MUCH TRIG
28: SALT TIN ANTI LOON NEAL USE
29: RIO ODIN GO BYE FURY TIC
30: GREW JIVE SAN GIRD BOIL PHIThe requests five keys in sequence,
and specifies what the last iteration
number should be. Note that these are printed out in
reverse order of use. The really
paranoid might want to write the results down by hand;
otherwise, print the list. Each line shows both the iteration
count and the one-time password. Scratch off the passwords as
they are used.Restricting Use of &unix; PasswordsOPIE can restrict the use of &unix;
passwords based on the IP address of a login session. The
relevant file is /etc/opieaccess, which
is present by default. Refer to &man.opieaccess.5; for more
information on this file and which security considerations to
be aware of when using it.Here is a sample opieaccess:permit 192.168.0.0 255.255.0.0This line allows users whose IP source address (which is
vulnerable to spoofing) matches the specified value and mask,
to use &unix; passwords at any time.If no rules in opieaccess are
matched, the default is to deny non-OPIE
logins.TCP WrapperTomRhodesWritten
by TCP WrapperTCP Wrapper is a host-based
access control system which extends the abilities of . It can be configured to provide
logging support, return messages, and connection restrictions
for the server daemons under the control of
inetd. Refer to &man.tcpd.8; for
more information about
TCP Wrapper and its features.TCP Wrapper should not be
considered a replacement for a properly configured firewall.
Instead, TCP Wrapper should be used
in conjunction with a firewall and other security enhancements
in order to provide another layer of protection in the
implementation of a security policy.Initial ConfigurationTo enable TCP Wrapper in &os;,
add the following lines to
/etc/rc.conf:inetd_enable="YES"
inetd_flags="-Ww"Then, properly configure
/etc/hosts.allow.Unlike other implementations of
TCP Wrapper, the use of
hosts.deny is deprecated in &os;. All
configuration options should be placed in
/etc/hosts.allow.In the simplest configuration, daemon connection policies
are set to either permit or block, depending on the options in
/etc/hosts.allow. The default
configuration in &os; is to allow all connections to the
daemons started with inetd.Basic configuration usually takes the form of
daemon : address : action, where
daemon is the daemon which
inetd started,
address is a valid hostname,
IP address, or an IPv6 address enclosed in
brackets ([ ]), and action is either
allow or deny.
TCP Wrapper uses a first rule match
semantic, meaning that the configuration file is scanned from
the beginning for a matching rule. When a match is found, the
rule is applied and the search process stops.For example, to allow POP3 connections
via the mail/qpopper daemon, the following
lines should be appended to
hosts.allow:# This line is required for POP3 connections:
qpopper : ALL : allowWhenever this file is edited, restart
inetd:&prompt.root; service inetd restartAdvanced ConfigurationTCP Wrapper provides advanced
options to allow more control over the way connections are
handled. In some cases, it may be appropriate to return a
comment to certain hosts or daemon connections. In other
cases, a log entry should be recorded or an email sent to the
administrator. Other situations may require the use of a
service for local connections only. This is all possible
through the use of configuration options known as wildcards,
expansion characters, and external command execution.Suppose that a situation occurs where a connection should
be denied yet a reason should be sent to the host who
attempted to establish that connection. That action is
possible with . When a connection
attempt is made, executes a shell
command or script. An example exists in
hosts.allow:# The rest of the daemons are protected.
ALL : ALL \
: severity auth.info \
: twist /bin/echo "You are not welcome to use %d from %h."In this example, the message You are not allowed to
use daemon name from
hostname. will be
returned for any daemon not configured in
hosts.allow. This is useful for sending
a reply back to the connection initiator right after the
established connection is dropped. Any message returned
must be wrapped in quote
(") characters.It may be possible to launch a denial of service attack
on the server if an attacker floods these daemons with
connection requests.Another possibility is to use .
Like , implicitly
denies the connection and may be used to run external shell
commands or scripts. Unlike ,
will not send a reply back to the host
who established the connection. For example, consider the
following configuration:# We do not allow connections from example.com:
ALL : .example.com \
: spawn (/bin/echo %a from %h attempted to access %d >> \
/var/log/connections.log) \
: denyThis will deny all connection attempts from *.example.com and log the
hostname, IP address, and the daemon to
which access was attempted to
/var/log/connections.log. This example
uses the substitution characters %a and
%h. Refer to &man.hosts.access.5; for the
complete list.To match every instance of a daemon, domain, or
IP address, use ALL.
Another wildcard is PARANOID which may be
used to match any host which provides an IP
address that may be forged because the IP
address differs from its resolved hostname. In this example,
all connection requests to Sendmail
which have an IP address that varies from
its hostname will be denied:# Block possibly spoofed requests to sendmail:
sendmail : PARANOID : denyUsing the PARANOID wildcard will
result in denied connections if the client or server has a
broken DNS setup.To learn more about wildcards and their associated
functionality, refer to &man.hosts.access.5;.When adding new configuration lines, make sure that any
unneeded entries for that daemon are commented out in
hosts.allow.KerberosTillmanHodgsonContributed by MarkMurrayBased on a contribution by Kerberos is a network
authentication protocol which was originally created by the
Massachusetts Institute of Technology (MIT)
as a way to securely provide authentication across a potentially
hostile network. The Kerberos
protocol uses strong cryptography so that both a client and
server can prove their identity without sending any unencrypted
secrets over the network. Kerberos
can be described as an identity-verifying proxy system and as a
trusted third-party authentication system. After a user
authenticates with Kerberos, their
communications can be encrypted to assure privacy and data
integrity.The only function of Kerberos is
to provide the secure authentication of users and servers on the
network. It does not provide authorization or auditing
functions. It is recommended that
Kerberos be used with other security
methods which provide authorization and audit services.The current version of the protocol is version 5, described
in RFC 4120. Several free
implementations of this protocol are available, covering a wide
range of operating systems. MIT continues to
develop their Kerberos package. It
is commonly used in the US as a cryptography
product, and has historically been subject to
US export regulations. In &os;,
MIT Kerberos is
available as the security/krb5 package or
port. The Heimdal Kerberos
implementation was explicitly developed outside of the
US to avoid export regulations. The Heimdal
Kerberos distribution is included in
the base &os; installation, and another distribution with more
configurable options is available as
security/heimdal in the Ports
Collection.In Kerberos users and services
are identified as principals which are contained
within an administrative grouping, called a
realm. A typical user principal would be of the
form
user@REALM
(realms are traditionally uppercase).This section provides a guide on how to set up
Kerberos using the Heimdal
distribution included in &os;.For purposes of demonstrating a
Kerberos installation, the name
spaces will be as follows:The DNS domain (zone) will be
example.org.The Kerberos realm will be
EXAMPLE.ORG.Use real domain names when setting up
Kerberos, even if it will run
internally. This avoids DNS problems and
assures inter-operation with other
Kerberos realms.Setting up a Heimdal KDCKerberos5Key Distribution CenterThe Key Distribution Center (KDC) is
the centralized authentication service that
Kerberos provides, the
trusted third party of the system. It is the
computer that issues Kerberos
tickets, which are used for clients to authenticate to
- servers. Because the KDC is considered
+ servers. As the KDC is considered
trusted by all other computers in the
Kerberos realm, it has heightened
security concerns. Direct access to the KDC should be
limited.While running a KDC requires few
computing resources, a dedicated machine acting only as a
KDC is recommended for security
reasons.To begin, install the security/heimdal
package as follows:&prompt.root; pkg install heimdalNext, update /etc/rc.conf using
sysrc as follows:&prompt.root; sysrc kdc_enable=yes
&prompt.root; sysrc kadmind_enable=yesNext, edit /etc/krb5.conf as
follows:[libdefaults]
default_realm = EXAMPLE.ORG
[realms]
EXAMPLE.ORG = {
kdc = kerberos.example.org
admin_server = kerberos.example.org
}
[domain_realm]
.example.org = EXAMPLE.ORGIn this example, the KDC will use the
fully-qualified hostname kerberos.example.org. The
hostname of the KDC must be resolvable in the
DNS.Kerberos can also use the
DNS to locate KDCs, instead of a
[realms] section in
/etc/krb5.conf. For large organizations
that have their own DNS servers, the above
example could be trimmed to:[libdefaults]
default_realm = EXAMPLE.ORG
[domain_realm]
.example.org = EXAMPLE.ORGWith the following lines being included in the
example.org zone
file:_kerberos._udp IN SRV 01 00 88 kerberos.example.org.
_kerberos._tcp IN SRV 01 00 88 kerberos.example.org.
_kpasswd._udp IN SRV 01 00 464 kerberos.example.org.
_kerberos-adm._tcp IN SRV 01 00 749 kerberos.example.org.
_kerberos IN TXT EXAMPLE.ORGIn order for clients to be able to find the
Kerberos services, they
must have either
a fully configured /etc/krb5.conf or a
minimally configured /etc/krb5.confand a properly configured
DNS server.Next, create the Kerberos
database which contains the keys of all principals (users and
hosts) encrypted with a master password. It is not required
to remember this password as it will be stored in
/var/heimdal/m-key; it would be
reasonable to use a 45-character random password for this
purpose. To create the master key, run
kstash and enter a password:&prompt.root; kstash
Master key: xxxxxxxxxxxxxxxxxxxxxxx
Verifying password - Master key: xxxxxxxxxxxxxxxxxxxxxxxOnce the master key has been created, the database should
be initialized. The Kerberos
administrative tool &man.kadmin.8; can be used on the KDC in a
mode that operates directly on the database, without using the
&man.kadmind.8; network service, as
kadmin -l. This resolves the
chicken-and-egg problem of trying to connect to the database
before it is created. At the kadmin
prompt, use init to create the realm's
initial database:&prompt.root; kadmin -l
kadmin> init EXAMPLE.ORG
Realm max ticket life [unlimited]:Lastly, while still in kadmin, create
the first principal using add. Stick to
the default options for the principal for now, as these can be
changed later with modify.
Type ? at the prompt to see the available
options.kadmin> add tillman
Max ticket life [unlimited]:
Max renewable life [unlimited]:
Principal expiration time [never]:
Password expiration time [never]:
Attributes []:
Password: xxxxxxxx
Verifying password - Password: xxxxxxxxNext, start the KDC services by
running:&prompt.root; service kdc start
&prompt.root; service kadmind startWhile there will not be any kerberized daemons running at
this point, it is possible to confirm that the
KDC is functioning by obtaining a ticket
for the principal that was just created:&prompt.user; kinit tillman
tillman@EXAMPLE.ORG's Password:Confirm that a ticket was successfully obtained using
klist:&prompt.user; klist
Credentials cache: FILE:/tmp/krb5cc_1001
Principal: tillman@EXAMPLE.ORG
Issued Expires Principal
Aug 27 15:37:58 2013 Aug 28 01:37:58 2013 krbtgt/EXAMPLE.ORG@EXAMPLE.ORGThe temporary ticket can be destroyed when the test is
finished:&prompt.user; kdestroyConfiguring a Server to Use
KerberosKerberos5enabling servicesThe first step in configuring a server to use
Kerberos authentication is to
ensure that it has the correct configuration in
/etc/krb5.conf. The version from the
KDC can be used as-is, or it can be
regenerated on the new system.Next, create /etc/krb5.keytab on the
server. This is the main part of Kerberizing a
service — it corresponds to generating a secret shared
between the service and the KDC. The
secret is a cryptographic key, stored in a
keytab. The keytab contains the server's host
key, which allows it and the KDC to verify
each others' identity. It must be transmitted to the server
in a secure fashion, as the security of the server can be
broken if the key is made public. Typically, the
keytab is generated on an administrator's
trusted machine using kadmin, then securely
transferred to the server, e.g., with &man.scp.1;; it can also
be created directly on the server if that is consistent with
the desired security policy. It is very important that the
keytab is transmitted to the server in a secure fashion: if
the key is known by some other party, that party can
impersonate any user to the server! Using
kadmin on the server directly is
convenient, because the entry for the host principal in the
KDC database is also created using
kadmin.Of course, kadmin is a kerberized
service; a Kerberos ticket is
needed to authenticate to the network service, but to ensure
that the user running kadmin is actually
present (and their session has not been hijacked),
kadmin will prompt for the password to get
a fresh ticket. The principal authenticating to the kadmin
service must be permitted to use the kadmin
interface, as specified in
/var/heimdal/kadmind.acl. See the
section titled Remote administration in
info heimdal for details on designing
access control lists. Instead of enabling remote
kadmin access, the administrator could
securely connect to the KDC via the local
console or &man.ssh.1;, and perform administration locally
using kadmin -l.After installing /etc/krb5.conf,
use add --random-key in
kadmin. This adds the server's host
principal to the database, but does not extract a copy of the
host principal key to a keytab. To generate the keytab, use
ext to extract the server's host principal
key to its own keytab:&prompt.root; kadmin
kadmin> add --random-key host/myserver.example.org
Max ticket life [unlimited]:
Max renewable life [unlimited]:
Principal expiration time [never]:
Password expiration time [never]:
Attributes []:
kadmin> ext_keytab host/myserver.example.org
kadmin> exitNote that ext_keytab stores the
extracted key in /etc/krb5.keytab by
default. This is good when being run on the server being
kerberized, but the --keytab
path/to/file argument
should be used when the keytab is being extracted
elsewhere:&prompt.root; kadmin
kadmin> ext_keytab --keytab=/tmp/example.keytab host/myserver.example.org
kadmin> exitThe keytab can then be securely copied to the server
using &man.scp.1; or a removable media. Be sure to specify a
non-default keytab name to avoid inserting unneeded keys into
the system's keytab.At this point, the server can read encrypted messages from
the KDC using its shared key, stored in
krb5.keytab. It is now ready for the
Kerberos-using services to be
enabled. One of the most common such services is
&man.sshd.8;, which supports
Kerberos via the
GSS-API. In
/etc/ssh/sshd_config, add the
line:GSSAPIAuthentication yesAfter making this change, &man.sshd.8; must be restarted
for the new configuration to take effect:
service sshd restart.Configuring a Client to Use
KerberosKerberos5configure clientsAs it was for the server, the client requires
configuration in /etc/krb5.conf. Copy
the file in place (securely) or re-enter it as needed.Test the client by using kinit,
klist, and kdestroy from
the client to obtain, show, and then delete a ticket for an
existing principal. Kerberos
applications should also be able to connect to
Kerberos enabled servers. If that
does not work but obtaining a ticket does, the problem is
likely with the server and not with the client or the
KDC. In the case of kerberized
&man.ssh.1;, GSS-API is disabled by
default, so test using ssh -o
GSSAPIAuthentication=yes
hostname.When testing a Kerberized application, try using a packet
sniffer such as tcpdump to confirm that no
sensitive information is sent in the clear.Various Kerberos client
applications are available. With the advent of a bridge so
that applications using SASL for
authentication can use GSS-API mechanisms
as well, large classes of client applications can use
Kerberos for authentication, from
Jabber clients to IMAP clients..k5login.k5usersUsers within a realm typically have their
Kerberos principal mapped to a
local user account. Occasionally, one needs to grant access
to a local user account to someone who does not have a
matching Kerberos principal. For
example, tillman@EXAMPLE.ORG may need
access to the local user account webdevelopers. Other
principals may also need access to that local account.The .k5login and
.k5users files, placed in a user's home
directory, can be used to solve this problem. For example, if
the following .k5login is placed in the
home directory of webdevelopers, both principals
listed will have access to that account without requiring a
shared password:tillman@example.org
jdoe@example.orgRefer to &man.ksu.1; for more information about
.k5users.MIT DifferencesThe major difference between the MIT
and Heimdal implementations is that kadmin
has a different, but equivalent, set of commands and uses a
different protocol. If the KDC is
MIT, the Heimdal version of
kadmin cannot be used to administer the
KDC remotely, and vice versa.Client applications may also use slightly different
command line options to accomplish the same tasks. Following
the instructions at http://web.mit.edu/Kerberos/www/
is recommended. Be careful of path issues: the
MIT port installs into
/usr/local/ by default, and the &os;
system applications run instead of the
MIT versions if PATH lists
the system directories first.When using MIT Kerberos as a KDC on
&os;, the following edits should also be made to
rc.conf:kdc_program="/usr/local/sbin/kdc"
kadmind_program="/usr/local/sbin/kadmind"
kdc_flags=""
kdc_enable="YES"
kadmind_enable="YES"Kerberos Tips, Tricks, and
TroubleshootingWhen configuring and troubleshooting
Kerberos, keep the following points
in mind:When using either Heimdal or MIT
Kerberos from ports, ensure
that the PATH lists the port's versions of
the client applications before the system versions.If all the computers in the realm do not have
synchronized time settings, authentication may fail.
describes how to synchronize
clocks using NTP.If the hostname is changed, the host/ principal must be
changed and the keytab updated. This also applies to
special keytab entries like the HTTP/ principal used for
Apache's www/mod_auth_kerb.All hosts in the realm must be both forward and
reverse resolvable in DNS or, at a
minimum, exist in /etc/hosts. CNAMEs
will work, but the A and PTR records must be correct and
in place. The error message for unresolvable hosts is not
intuitive: Kerberos5 refuses authentication
because Read req failed: Key table entry not
found.Some operating systems that act as clients to the
KDC do not set the permissions for
ksu to be setuid root. This means that
ksu does not work. This is a
permissions problem, not a KDC
error.With MIT
Kerberos, to allow a principal
to have a ticket life longer than the default lifetime of
ten hours, use modify_principal at the
&man.kadmin.8; prompt to change the
maxlife of both the principal in
question and the
krbtgt
principal. The principal can then use
kinit -l to request a ticket with a
longer lifetime.When running a packet sniffer on the
KDC to aid in troubleshooting while
running kinit from a workstation, the
Ticket Granting Ticket (TGT) is sent
immediately, even before the password is typed. This is
because the Kerberos server
freely transmits a TGT to any
unauthorized request. However, every
TGT is encrypted in a key derived from
the user's password. When a user types their password, it
is not sent to the KDC, it is instead
used to decrypt the TGT that
kinit already obtained. If the
decryption process results in a valid ticket with a valid
time stamp, the user has valid
Kerberos credentials. These
credentials include a session key for establishing secure
communications with the
Kerberos server in the future,
as well as the actual TGT, which is
encrypted with the Kerberos
server's own key. This second layer of encryption allows
the Kerberos server to verify
the authenticity of each TGT.Host principals can have a longer ticket lifetime. If
the user principal has a lifetime of a week but the host
being connected to has a lifetime of nine hours, the user
cache will have an expired host principal and the ticket
cache will not work as expected.When setting up krb5.dict to
prevent specific bad passwords from being used as
described in &man.kadmind.8;, remember that it only
applies to principals that have a password policy assigned
to them. The format used in
krb5.dict is one string per line.
Creating a symbolic link to
/usr/share/dict/words might be
useful.Mitigating Kerberos
LimitationsKerberos5limitations and shortcomingsSince Kerberos is an all or
nothing approach, every service enabled on the network must
either be modified to work with
Kerberos or be otherwise secured
against network attacks. This is to prevent user credentials
from being stolen and re-used. An example is when
Kerberos is enabled on all remote
shells but the non-Kerberized POP3 mail
server sends passwords in plain text.The KDC is a single point of failure.
By design, the KDC must be as secure as its
master password database. The KDC should
have absolutely no other services running on it and should be
physically secure. The danger is high because
Kerberos stores all passwords
encrypted with the same master key which is stored as a file
on the KDC.A compromised master key is not quite as bad as one might
fear. The master key is only used to encrypt the
Kerberos database and as a seed for
the random number generator. As long as access to the
KDC is secure, an attacker cannot do much
with the master key.If the KDC is unavailable, network
services are unusable as authentication cannot be performed.
This can be alleviated with a single master
KDC and one or more slaves, and with
careful implementation of secondary or fall-back
authentication using PAM.Kerberos allows users, hosts
and services to authenticate between themselves. It does not
have a mechanism to authenticate the
KDC to the users, hosts, or services. This
means that a trojaned kinit could record
all user names and passwords. File system integrity checking
tools like security/tripwire can
alleviate this.Resources and Further InformationKerberos5external resources
The Kerberos
FAQDesigning
an Authentication System: a Dialog in Four
ScenesRFC
4120, The Kerberos Network
Authentication Service (V5)MIT
Kerberos home
pageHeimdal
Kerberos project wiki
pageOpenSSLTomRhodesWritten
by securityOpenSSLOpenSSL is an open source
implementation of the SSL and
TLS protocols. It provides an encryption
transport layer on top of the normal communications layer,
allowing it to be intertwined with many network applications and
services.The version of OpenSSL included
in &os; supports the Secure Sockets Layer 3.0 (SSLv3)
and Transport Layer Security 1.0/1.1/1.2 (TLSv1/TLSv1.1/TLSv1.2)
network security
protocols and can be used as a general cryptographic
library. In &os; 12.0-RELEASE and above, OpenSSL also supports
Transport Layer Security 1.3 (TLSv1.3).OpenSSL is often used to encrypt
authentication of mail clients and to secure web based
transactions such as credit card payments. Some ports, such as
www/apache24 and
databases/postgresql11-server, include a
compile option for building with
OpenSSL. If selected, the port will
add support using OpenSSL from the
base system. To instead have the port compile against
OpenSSL from the
security/openssl port, add the following to
/etc/make.conf:DEFAULT_VERSIONS+= ssl=opensslAnother common use of OpenSSL is
to provide certificates for use with software applications.
Certificates can be used to verify the credentials of a company
or individual. If a certificate has not been signed by an
external Certificate Authority
(CA), such as http://www.verisign.com,
the application that uses the certificate will produce a
warning. There is a cost associated with obtaining a signed
certificate and using a signed certificate is not mandatory as
certificates can be self-signed. However, using an external
authority will prevent warnings and can put users at
ease.This section demonstrates how to create and use certificates
on a &os; system. Refer to for an
example of how to create a CA for signing
one's own certificates.For more information about SSL, read the
free OpenSSL
Cookbook.Generating CertificatesOpenSSLcertificate generationTo generate a certificate that will be signed by an
external CA, issue the following command
and input the information requested at the prompts. This
input information will be written to the certificate. At the
Common Name prompt, input the fully
qualified name for the system that will use the certificate.
If this name does not match the server, the application
verifying the certificate will issue a warning to the user,
rendering the verification provided by the certificate as
useless.&prompt.root; openssl req -new -nodes -out req.pem -keyout cert.key -sha256 -newkey rsa:2048
Generating a 2048 bit RSA private key
..................+++
.............................................................+++
writing new private key to 'cert.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:PA
Locality Name (eg, city) []:Pittsburgh
Organization Name (eg, company) [Internet Widgits Pty Ltd]:My Company
Organizational Unit Name (eg, section) []:Systems Administrator
Common Name (eg, YOUR name) []:localhost.example.org
Email Address []:trhodes@FreeBSD.org
Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:Another NameOther options, such as the expire time and alternate
encryption algorithms, are available when creating a
certificate. A complete list of options is described in
&man.openssl.1;.This command will create two files in the current
directory. The certificate request,
req.pem, can be sent to a
CA who will validate the entered
credentials, sign the request, and return the signed
certificate. The second file,
cert.key, is the private key for the
certificate and should be stored in a secure location. If
this falls in the hands of others, it can be used to
impersonate the user or the server.Alternately, if a signature from a CA
is not required, a self-signed certificate can be created.
First, generate the RSA key:&prompt.root; openssl genrsa -rand -genkey -out cert.key 2048
0 semi-random bytes loaded
Generating RSA private key, 2048 bit long modulus
.............................................+++
.................................................................................................................+++
e is 65537 (0x10001)Use this key to create a self-signed certificate.
Follow the usual prompts for creating a certificate:&prompt.root; openssl req -new -x509 -days 365 -key cert.key -out cert.crt -sha256
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:PA
Locality Name (eg, city) []:Pittsburgh
Organization Name (eg, company) [Internet Widgits Pty Ltd]:My Company
Organizational Unit Name (eg, section) []:Systems Administrator
Common Name (e.g. server FQDN or YOUR name) []:localhost.example.org
Email Address []:trhodes@FreeBSD.orgThis will create two new files in the current directory: a
private key file
cert.key, and the certificate itself,
cert.crt. These should be placed in a
directory, preferably under /etc/ssl/,
which is readable only by root. Permissions of
0700 are appropriate for these files and
can be set using chmod.Using CertificatesOne use for a certificate is to encrypt connections to the
Sendmail mail server in order to
prevent the use of clear text authentication.Some mail clients will display an error if the user has
not installed a local copy of the certificate. Refer to the
documentation included with the software for more
information on certificate installation.In &os; 10.0-RELEASE and above, it is possible to create a
self-signed certificate for
Sendmail automatically. To enable
this, add the following lines to
/etc/rc.conf:sendmail_enable="YES"
sendmail_cert_create="YES"
sendmail_cert_cn="localhost.example.org"This will automatically create a self-signed certificate,
/etc/mail/certs/host.cert, a signing key,
/etc/mail/certs/host.key, and a
CA certificate,
/etc/mail/certs/cacert.pem. The
certificate will use the Common Name
specified in . After saving
the edits, restart Sendmail:&prompt.root; service sendmail restartIf all went well, there will be no error messages in
/var/log/maillog. For a simple test,
connect to the mail server's listening port using
telnet:&prompt.root; telnet example.com 25
Trying 192.0.34.166...
Connected to example.com.
Escape character is '^]'.
220 example.com ESMTP Sendmail 8.14.7/8.14.7; Fri, 18 Apr 2014 11:50:32 -0400 (EDT)
ehlo example.com
250-example.com Hello example.com [192.0.34.166], pleased to meet you
250-ENHANCEDSTATUSCODES
250-PIPELINING
250-8BITMIME
250-SIZE
250-DSN
250-ETRN
250-AUTH LOGIN PLAIN
250-STARTTLS
250-DELIVERBY
250 HELP
quit
221 2.0.0 example.com closing connection
Connection closed by foreign host.If the STARTTLS line appears in the
output, everything is working correctly.VPN over
IPsecNikClaytonnik@FreeBSD.orgWritten by Hiten M.Pandyahmp@FreeBSD.orgWritten by IPsecInternet Protocol Security (IPsec) is a
set of protocols which sit on top of the Internet Protocol
(IP) layer. It allows two or more hosts to
communicate in a secure manner by authenticating and encrypting
each IP packet of a communication session.
The &os; IPsec network stack is based on the
http://www.kame.net/
implementation and supports both IPv4 and
IPv6 sessions.IPsecESPIPsecAHIPsec is comprised of the following
sub-protocols:Encapsulated Security Payload
(ESP): this protocol
protects the IP packet data from third
party interference by encrypting the contents using
symmetric cryptography algorithms such as Blowfish and
3DES.Authentication Header
(AH): this protocol
protects the IP packet header from third
party interference and spoofing by computing a cryptographic
checksum and hashing the IP packet
header fields with a secure hashing function. This is then
followed by an additional header that contains the hash, to
allow the information in the packet to be
authenticated.IP Payload Compression Protocol
(IPComp): this protocol
tries to increase communication performance by compressing
the IP payload in order to reduce the
amount of data sent.These protocols can either be used together or separately,
depending on the environment.VPNvirtual private networkVPNIPsec supports two modes of operation.
The first mode, Transport Mode, protects
communications between two hosts. The second mode,
Tunnel Mode, is used to build virtual
tunnels, commonly known as Virtual Private Networks
(VPNs). Consult &man.ipsec.4; for detailed
information on the IPsec subsystem in
&os;.IPsec support is enabled by default on
&os; 11 and later. For previous versions of &os;, add
these options to a custom kernel configuration file and rebuild
the kernel using the instructions in :kernel optionsIPSECoptions IPSEC #IP security
device cryptokernel optionsIPSEC_DEBUGIf IPsec debugging support is desired,
the following kernel option should also be added:options IPSEC_DEBUG #debug for IP securityThis rest of this chapter demonstrates the process of
setting up an IPsec VPN
between a home network and a corporate network. In the example
scenario:Both sites are connected to the Internet through a
gateway that is running &os;.The gateway on each network has at least one external
IP address. In this example, the
corporate LAN's external
IP address is 172.16.5.4 and the home
LAN's external IP
address is 192.168.1.12.The internal addresses of the two networks can be either
public or private IP addresses. However,
the address space must not collide. For example, both
networks cannot use 192.168.1.x. In this
example, the corporate LAN's internal
IP address is 10.246.38.1 and the home
LAN's internal IP
address is 10.0.0.5.Configuring a VPN on &os;TomRhodestrhodes@FreeBSD.orgWritten by To begin, security/ipsec-tools must be
installed from the Ports Collection. This software provides a
number of applications which support the configuration.The next requirement is to create two &man.gif.4;
pseudo-devices which will be used to tunnel packets and allow
both networks to communicate properly. As root, run the following
commands, replacing internal and
external with the real IP
addresses of the internal and external interfaces of the two
gateways:&prompt.root; ifconfig gif0 create
&prompt.root; ifconfig gif0 internal1 internal2
&prompt.root; ifconfig gif0 tunnel external1 external2Verify the setup on each gateway, using
ifconfig. Here is the output from Gateway
1:gif0: flags=8051 mtu 1280
tunnel inet 172.16.5.4 --> 192.168.1.12
inet6 fe80::2e0:81ff:fe02:5881%gif0 prefixlen 64 scopeid 0x6
inet 10.246.38.1 --> 10.0.0.5 netmask 0xffffff00Here is the output from Gateway 2:gif0: flags=8051 mtu 1280
tunnel inet 192.168.1.12 --> 172.16.5.4
inet 10.0.0.5 --> 10.246.38.1 netmask 0xffffff00
inet6 fe80::250:bfff:fe3a:c1f%gif0 prefixlen 64 scopeid 0x4Once complete, both internal IP
addresses should be reachable using &man.ping.8;:priv-net&prompt.root; ping 10.0.0.5
PING 10.0.0.5 (10.0.0.5): 56 data bytes
64 bytes from 10.0.0.5: icmp_seq=0 ttl=64 time=42.786 ms
64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=19.255 ms
64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=20.440 ms
64 bytes from 10.0.0.5: icmp_seq=3 ttl=64 time=21.036 ms
--- 10.0.0.5 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 19.255/25.879/42.786/9.782 mscorp-net&prompt.root; ping 10.246.38.1
PING 10.246.38.1 (10.246.38.1): 56 data bytes
64 bytes from 10.246.38.1: icmp_seq=0 ttl=64 time=28.106 ms
64 bytes from 10.246.38.1: icmp_seq=1 ttl=64 time=42.917 ms
64 bytes from 10.246.38.1: icmp_seq=2 ttl=64 time=127.525 ms
64 bytes from 10.246.38.1: icmp_seq=3 ttl=64 time=119.896 ms
64 bytes from 10.246.38.1: icmp_seq=4 ttl=64 time=154.524 ms
--- 10.246.38.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 28.106/94.594/154.524/49.814 msAs expected, both sides have the ability to send and
receive ICMP packets from the privately
configured addresses. Next, both gateways must be told how to
route packets in order to correctly send traffic from either
network. The following commands will achieve this
goal:corp-net&prompt.root; route add 10.0.0.0 10.0.0.5 255.255.255.0
corp-net&prompt.root; route add net 10.0.0.0: gateway 10.0.0.5priv-net&prompt.root; route add 10.246.38.0 10.246.38.1 255.255.255.0
priv-net&prompt.root; route add host 10.246.38.0: gateway 10.246.38.1At this point, internal machines should be reachable from
each gateway as well as from machines behind the gateways.
Again, use &man.ping.8; to confirm:corp-net&prompt.root; ping 10.0.0.8
PING 10.0.0.8 (10.0.0.8): 56 data bytes
64 bytes from 10.0.0.8: icmp_seq=0 ttl=63 time=92.391 ms
64 bytes from 10.0.0.8: icmp_seq=1 ttl=63 time=21.870 ms
64 bytes from 10.0.0.8: icmp_seq=2 ttl=63 time=198.022 ms
64 bytes from 10.0.0.8: icmp_seq=3 ttl=63 time=22.241 ms
64 bytes from 10.0.0.8: icmp_seq=4 ttl=63 time=174.705 ms
--- 10.0.0.8 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 21.870/101.846/198.022/74.001 mspriv-net&prompt.root; ping 10.246.38.107
PING 10.246.38.1 (10.246.38.107): 56 data bytes
64 bytes from 10.246.38.107: icmp_seq=0 ttl=64 time=53.491 ms
64 bytes from 10.246.38.107: icmp_seq=1 ttl=64 time=23.395 ms
64 bytes from 10.246.38.107: icmp_seq=2 ttl=64 time=23.865 ms
64 bytes from 10.246.38.107: icmp_seq=3 ttl=64 time=21.145 ms
64 bytes from 10.246.38.107: icmp_seq=4 ttl=64 time=36.708 ms
--- 10.246.38.107 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 21.145/31.721/53.491/12.179 msSetting up the tunnels is the easy part. Configuring a
secure link is a more in depth process. The following
configuration uses pre-shared (PSK)
RSA keys. Other than the
IP addresses, the
/usr/local/etc/racoon/racoon.conf on both
gateways will be identical and look similar to:path pre_shared_key "/usr/local/etc/racoon/psk.txt"; #location of pre-shared key file
log debug; #log verbosity setting: set to 'notify' when testing and debugging is complete
padding # options are not to be changed
{
maximum_length 20;
randomize off;
strict_check off;
exclusive_tail off;
}
timer # timing options. change as needed
{
counter 5;
interval 20 sec;
persend 1;
# natt_keepalive 15 sec;
phase1 30 sec;
phase2 15 sec;
}
listen # address [port] that racoon will listen on
{
isakmp 172.16.5.4 [500];
isakmp_natt 172.16.5.4 [4500];
}
remote 192.168.1.12 [500]
{
exchange_mode main,aggressive;
doi ipsec_doi;
situation identity_only;
my_identifier address 172.16.5.4;
peers_identifier address 192.168.1.12;
lifetime time 8 hour;
passive off;
proposal_check obey;
# nat_traversal off;
generate_policy off;
proposal {
encryption_algorithm blowfish;
hash_algorithm md5;
authentication_method pre_shared_key;
lifetime time 30 sec;
dh_group 1;
}
}
sainfo (address 10.246.38.0/24 any address 10.0.0.0/24 any) # address $network/$netmask $type address $network/$netmask $type ( $type being any or esp)
{ # $network must be the two internal networks you are joining.
pfs_group 1;
lifetime time 36000 sec;
encryption_algorithm blowfish,3des;
authentication_algorithm hmac_md5,hmac_sha1;
compression_algorithm deflate;
}For descriptions of each available option, refer to the
manual page for racoon.conf.The Security Policy Database (SPD)
needs to be configured so that &os; and
racoon are able to encrypt and
decrypt network traffic between the hosts.This can be achieved with a shell script, similar to the
following, on the corporate gateway. This file will be used
during system initialization and should be saved as
/usr/local/etc/racoon/setkey.conf.flush;
spdflush;
# To the home network
spdadd 10.246.38.0/24 10.0.0.0/24 any -P out ipsec esp/tunnel/172.16.5.4-192.168.1.12/use;
spdadd 10.0.0.0/24 10.246.38.0/24 any -P in ipsec esp/tunnel/192.168.1.12-172.16.5.4/use;Once in place, racoon may be
started on both gateways using the following command:&prompt.root; /usr/local/sbin/racoon -F -f /usr/local/etc/racoon/racoon.conf -l /var/log/racoon.logThe output should be similar to the following:corp-net&prompt.root; /usr/local/sbin/racoon -F -f /usr/local/etc/racoon/racoon.conf
Foreground mode.
2006-01-30 01:35:47: INFO: begin Identity Protection mode.
2006-01-30 01:35:48: INFO: received Vendor ID: KAME/racoon
2006-01-30 01:35:55: INFO: received Vendor ID: KAME/racoon
2006-01-30 01:36:04: INFO: ISAKMP-SA established 172.16.5.4[500]-192.168.1.12[500] spi:623b9b3bd2492452:7deab82d54ff704a
2006-01-30 01:36:05: INFO: initiate new phase 2 negotiation: 172.16.5.4[0]192.168.1.12[0]
2006-01-30 01:36:09: INFO: IPsec-SA established: ESP/Tunnel 192.168.1.12[0]->172.16.5.4[0] spi=28496098(0x1b2d0e2)
2006-01-30 01:36:09: INFO: IPsec-SA established: ESP/Tunnel 172.16.5.4[0]->192.168.1.12[0] spi=47784998(0x2d92426)
2006-01-30 01:36:13: INFO: respond new phase 2 negotiation: 172.16.5.4[0]192.168.1.12[0]
2006-01-30 01:36:18: INFO: IPsec-SA established: ESP/Tunnel 192.168.1.12[0]->172.16.5.4[0] spi=124397467(0x76a279b)
2006-01-30 01:36:18: INFO: IPsec-SA established: ESP/Tunnel 172.16.5.4[0]->192.168.1.12[0] spi=175852902(0xa7b4d66)To ensure the tunnel is working properly, switch to
another console and use &man.tcpdump.1; to view network
traffic using the following command. Replace
em0 with the network interface card as
required:&prompt.root; tcpdump -i em0 host 172.16.5.4 and dst 192.168.1.12Data similar to the following should appear on the
console. If not, there is an issue and debugging the
returned data will be required.01:47:32.021683 IP corporatenetwork.com > 192.168.1.12.privatenetwork.com: ESP(spi=0x02acbf9f,seq=0xa)
01:47:33.022442 IP corporatenetwork.com > 192.168.1.12.privatenetwork.com: ESP(spi=0x02acbf9f,seq=0xb)
01:47:34.024218 IP corporatenetwork.com > 192.168.1.12.privatenetwork.com: ESP(spi=0x02acbf9f,seq=0xc)At this point, both networks should be available and seem
to be part of the same network. Most likely both networks are
protected by a firewall. To allow traffic to flow between
them, rules need to be added to pass packets. For the
&man.ipfw.8; firewall, add the following lines to the firewall
configuration file:ipfw add 00201 allow log esp from any to any
ipfw add 00202 allow log ah from any to any
ipfw add 00203 allow log ipencap from any to any
ipfw add 00204 allow log udp from any 500 to anyThe rule numbers may need to be altered depending on the
current host configuration.For users of &man.pf.4; or &man.ipf.8;, the following
rules should do the trick:pass in quick proto esp from any to any
pass in quick proto ah from any to any
pass in quick proto ipencap from any to any
pass in quick proto udp from any port = 500 to any port = 500
pass in quick on gif0 from any to any
pass out quick proto esp from any to any
pass out quick proto ah from any to any
pass out quick proto ipencap from any to any
pass out quick proto udp from any port = 500 to any port = 500
pass out quick on gif0 from any to anyFinally, to allow the machine to start support for the
VPN during system initialization, add the
following lines to /etc/rc.conf:ipsec_enable="YES"
ipsec_program="/usr/local/sbin/setkey"
ipsec_file="/usr/local/etc/racoon/setkey.conf" # allows setting up spd policies on boot
racoon_enable="yes"OpenSSHChernLeeContributed
by OpenSSHsecurityOpenSSHOpenSSH is a set of network
connectivity tools used to provide secure access to remote
machines. Additionally, TCP/IP connections
can be tunneled or forwarded securely through
SSH connections.
OpenSSH encrypts all traffic to
effectively eliminate eavesdropping, connection hijacking, and
other network-level attacks.OpenSSH is maintained by the
OpenBSD project and is installed by default in &os;. It is
compatible with both SSH version 1 and 2
protocols.When data is sent over the network in an unencrypted form,
network sniffers anywhere in between the client and server can
steal user/password information or data transferred during the
session. OpenSSH offers a variety of
authentication and encryption methods to prevent this from
happening. More information about
OpenSSH is available from http://www.openssh.com/.This section provides an overview of the built-in client
utilities to securely access other systems and securely transfer
files from a &os; system. It then describes how to configure a
SSH server on a &os; system. More
information is available in the man pages mentioned in this
chapter.Using the SSH Client UtilitiesOpenSSHclientTo log into a SSH server, use
ssh and specify a username that exists on
that server and the IP address or hostname
of the server. If this is the first time a connection has
been made to the specified server, the user will be prompted
to first verify the server's fingerprint:&prompt.root; ssh user@example.com
The authenticity of host 'example.com (10.0.0.1)' can't be established.
ECDSA key fingerprint is 25:cc:73:b5:b3:96:75:3d:56:19:49:d2:5c:1f:91:3b.
Are you sure you want to continue connecting (yes/no)? yes
Permanently added 'example.com' (ECDSA) to the list of known hosts.
Password for user@example.com: user_passwordSSH utilizes a key fingerprint system
to verify the authenticity of the server when the client
connects. When the user accepts the key's fingerprint by
typing yes when connecting for the first
time, a copy of the key is saved to
.ssh/known_hosts in the user's home
directory. Future attempts to login are verified against the
saved key and ssh will display an alert if
the server's key does not match the saved key. If this
occurs, the user should first verify why the key has changed
before continuing with the connection.By default, recent versions of
OpenSSH only accept
SSHv2 connections. By default, the client
will use version 2 if possible and will fall back to version 1
if the server does not support version 2. To force
ssh to only use the specified protocol,
include or .
Additional options are described in &man.ssh.1;.OpenSSHsecure copy&man.scp.1;Use &man.scp.1; to securely copy a file to or from a
remote machine. This example copies
COPYRIGHT on the remote system to a file
of the same name in the current directory of the local
system:&prompt.root; scp user@example.com:/COPYRIGHT COPYRIGHT
Password for user@example.com: *******
COPYRIGHT 100% |*****************************| 4735
00:00
&prompt.root;Since the fingerprint was already verified for this host,
the server's key is automatically checked before prompting for
the user's password.The arguments passed to scp are similar
to cp. The file or files to copy is the
first argument and the destination to copy to is the second.
Since the file is fetched over the network, one or more of the
file arguments takes the form
. Be
aware when copying directories recursively that
scp uses , whereas
cp uses .To open an interactive session for copying files, use
sftp. Refer to &man.sftp.1; for a list of
available commands while in an sftp
session.Key-based AuthenticationInstead of using passwords, a client can be configured
to connect to the remote machine using keys. To generate
RSA
authentication keys, use ssh-keygen. To
generate a public and private key pair, specify the type of
key and follow the prompts. It is recommended to protect
the keys with a memorable, but hard to guess
passphrase.&prompt.user; ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:54Xm9Uvtv6H4NOo6yjP/YCfODryvUU7yWHzMqeXwhq8 user@host.example.com
The key's randomart image is:
+---[RSA 2048]----+
| |
| |
| |
| . o.. |
| .S*+*o |
| . O=Oo . . |
| = Oo= oo..|
| .oB.* +.oo.|
| =OE**.o..=|
+----[SHA256]-----+Type a passphrase here. It can contain spaces and
symbols.Retype the passphrase to verify it.The private key
is stored in ~/.ssh/id_rsa
and the public key
is stored in ~/.ssh/id_rsa.pub.
The
public key must be copied to
~/.ssh/authorized_keys on the remote
machine for key-based authentication to
work.Many users believe that keys are secure by design and
will use a key without a passphrase. This is
dangerous behavior. An
administrator can verify that a key pair is protected by a
passphrase by viewing the private key manually. If the
private key file contains the word
ENCRYPTED, the key owner is using a
passphrase. In addition, to better secure end users,
from may be placed in the public key
file. For example, adding
from="192.168.10.5" in front of the
ssh-rsa
prefix will only allow that specific user to log in from
that IP address.The options and files vary with different versions of
OpenSSH.
To avoid problems, consult &man.ssh-keygen.1;.If a passphrase is used, the user is prompted for
the passphrase each time a connection is made to the server.
To load SSH keys into memory and remove
the need to type the passphrase each time, use
&man.ssh-agent.1; and &man.ssh-add.1;.Authentication is handled by
ssh-agent, using the private keys that
are loaded into it. ssh-agent
can be used to launch another application like a
shell or a window manager.To use ssh-agent in a shell, start it
with a shell as an argument. Add the identity by
running ssh-add and entering the
passphrase for the private key.
The user will then be able to ssh
to any host that has the corresponding public key installed.
For example:&prompt.user; ssh-agent csh
&prompt.user; ssh-add
Enter passphrase for key '/usr/home/user/.ssh/id_rsa':
Identity added: /usr/home/user/.ssh/id_rsa (/usr/home/user/.ssh/id_rsa)
&prompt.user;Enter the passphrase for the key.To use ssh-agent in
&xorg;, add an entry for it in
~/.xinitrc. This provides the
ssh-agent services to all programs
launched in &xorg;. An example
~/.xinitrc might look like this:exec ssh-agent startxfce4This launches ssh-agent, which in
turn launches XFCE, every time
&xorg; starts. Once
&xorg; has been restarted so that
the changes can take effect, run ssh-add
to load all of the SSH keys.SSH TunnelingOpenSSHtunnelingOpenSSH has the ability to
create a tunnel to encapsulate another protocol in an
encrypted session.The following command tells ssh to
create a tunnel for
telnet:&prompt.user; ssh -2 -N -f -L 5023:localhost:23 user@foo.example.com
&prompt.user;This example uses the following options:Forces ssh to use version 2 to
connect to the server.Indicates no command, or tunnel only. If omitted,
ssh initiates a normal
session.Forces ssh to run in the
background.Indicates a local tunnel in
localport:remotehost:remoteport
format.The login name to use on the specified remote
SSH server.An SSH tunnel works by creating a
listen socket on localhost on the
specified localport. It then forwards
any connections received on localport via
the SSH connection to the specified
remotehost:remoteport. In the example,
port 5023 on the client is forwarded to
port 23 on the remote machine. Since
port 23 is used by telnet, this
creates an encrypted telnet
session through an SSH tunnel.This method can be used to wrap any number of insecure
TCP protocols such as
SMTP, POP3, and
FTP, as seen in the following
examples.Create a Secure Tunnel for
SMTP&prompt.user; ssh -2 -N -f -L 5025:localhost:25 user@mailserver.example.com
user@mailserver.example.com's password: *****
&prompt.user; telnet localhost 5025
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 mailserver.example.com ESMTPThis can be used in conjunction with
ssh-keygen and additional user accounts
to create a more seamless SSH tunneling
environment. Keys can be used in place of typing a
password, and the tunnels can be run as a separate
user.Secure Access of a POP3
ServerIn this example, there is an SSH
server that accepts connections from the outside. On the
same network resides a mail server running a
POP3 server. To check email in a
secure manner, create an SSH connection
to the SSH server and tunnel through to
the mail server:&prompt.user; ssh -2 -N -f -L 2110:mail.example.com:110 user@ssh-server.example.com
user@ssh-server.example.com's password: ******Once the tunnel is up and running, point the email
client to send POP3 requests to
localhost on port 2110. This
connection will be forwarded securely across the tunnel to
mail.example.com.Bypassing a FirewallSome firewalls
filter both incoming and outgoing connections. For
example, a firewall might limit access from remote
machines to ports 22 and 80 to only allow
SSH and web surfing. This prevents
access to any other service which uses a port other than
22 or 80.The solution is to create an SSH
connection to a machine outside of the network's firewall
and use it to tunnel to the desired service:&prompt.user; ssh -2 -N -f -L 8888:music.example.com:8000 user@unfirewalled-system.example.org
user@unfirewalled-system.example.org's password: *******In this example, a streaming Ogg Vorbis client can now
be pointed to localhost port
8888, which will be forwarded over to
music.example.com on port 8000,
successfully bypassing the firewall.Enabling the SSH ServerOpenSSHenablingIn addition to providing built-in SSH
client utilities, a &os; system can be configured as an
SSH server, accepting connections from
other SSH clients.To see if sshd is operating,
use the &man.service.8; command:&prompt.root; service sshd statusIf the service is not running, add the following line to
/etc/rc.conf.sshd_enable="YES"This will start sshd, the
daemon program for OpenSSH, the
next time the system boots. To start it now:&prompt.root; service sshd startThe first time sshd starts on a
&os; system, the system's host keys will be automatically
created and the fingerprint will be displayed on the console.
Provide users with the fingerprint so that they can verify it
the first time they connect to the server.Refer to &man.sshd.8; for the list of available options
when starting sshd and a more
complete discussion about authentication, the login process,
and the various configuration files.At this point, the sshd should
be available to all users with a username and password on
the system.SSH Server SecurityWhile sshd is the most widely
used remote administration facility for &os;, brute force
and drive by attacks are common to any system exposed to
public networks. Several additional parameters are available
to prevent the success of these attacks and will be described
in this section.It is a good idea to limit which users can log into the
SSH server and from where using the
AllowUsers keyword in the
OpenSSH server configuration file.
For example, to only allow root to log in from
192.168.1.32, add
this line to /etc/ssh/sshd_config:AllowUsers root@192.168.1.32To allow admin
to log in from anywhere, list that user without specifying an
IP address:AllowUsers adminMultiple users should be listed on the same line, like
so:AllowUsers root@192.168.1.32 adminAfter making changes to
/etc/ssh/sshd_config,
tell sshd to reload its
configuration file by running:&prompt.root; service sshd reloadWhen this keyword is used, it is important to list each
user that needs to log into this machine. Any user that is
not specified in that line will be locked out. Also, the
keywords used in the OpenSSH
server configuration file are case-sensitive. If the
keyword is not spelled correctly, including its case, it
will be ignored. Always test changes to this file to make
sure that the edits are working as expected. Refer to
&man.sshd.config.5; to verify the spelling and use of the
available keywords.In addition, users may be forced to use two factor
authentication via the use of a public and private key. When
required, the user may generate a key pair through the use
of &man.ssh-keygen.1; and send the administrator the public
key. This key file will be placed in the
authorized_keys as described above in
the client section. To force the users to use keys only,
the following option may be configured:AuthenticationMethods publickeyDo not confuse /etc/ssh/sshd_config
with /etc/ssh/ssh_config (note the
extra d in the first filename). The
first file configures the server and the second file
configures the client. Refer to &man.ssh.config.5; for a
listing of the available client settings.Access Control ListsTomRhodesContributed
by ACLAccess Control Lists (ACLs) extend the
standard &unix; permission model in a &posix;.1e compatible way.
This permits an administrator to take advantage of a more
fine-grained permissions model.The &os; GENERIC kernel provides
ACL support for UFS file
systems. Users who prefer to compile a custom kernel must
include the following option in their custom kernel
configuration file:options UFS_ACLIf this option is not compiled in, a warning message will be
displayed when attempting to mount a file system with
ACL support. ACLs rely on
extended attributes which are natively supported in
UFS2.This chapter describes how to enable
ACL support and provides some usage
examples.Enabling ACL SupportACLs are enabled by the mount-time
administrative flag, , which may be added
to /etc/fstab. The mount-time flag can
also be automatically set in a persistent manner using
&man.tunefs.8; to modify a superblock ACLs
flag in the file system header. In general, it is preferred
to use the superblock flag for several reasons:The superblock flag cannot be changed by a remount
using as it requires a complete
umount and fresh
mount. This means that
ACLs cannot be enabled on the root file
system after boot. It also means that
ACL support on a file system cannot be
changed while the system is in use.Setting the superblock flag causes the file system to
always be mounted with ACLs enabled,
even if there is not an fstab entry
or if the devices re-order. This prevents accidental
mounting of the file system without ACL
support.It is desirable to discourage accidental mounting
without ACLs enabled because nasty things
can happen if ACLs are enabled, then
disabled, then re-enabled without flushing the extended
attributes. In general, once ACLs are
enabled on a file system, they should not be disabled, as
the resulting file protections may not be compatible with
those intended by the users of the system, and re-enabling
ACLs may re-attach the previous
ACLs to files that have since had their
permissions changed, resulting in unpredictable
behavior.File systems with ACLs enabled will
show a plus (+) sign in their permission
settings:drwx------ 2 robert robert 512 Dec 27 11:54 private
drwxrwx---+ 2 robert robert 512 Dec 23 10:57 directory1
drwxrwx---+ 2 robert robert 512 Dec 22 10:20 directory2
drwxrwx---+ 2 robert robert 512 Dec 27 11:57 directory3
drwxr-xr-x 2 robert robert 512 Nov 10 11:54 public_htmlIn this example, directory1,
directory2, and
directory3 are all taking advantage of
ACLs, whereas private
and public_html are not.Using ACLsFile system ACLs can be viewed using
getfacl. For instance, to view the
ACL settings on
test:&prompt.user; getfacl test
#file:test
#owner:1001
#group:1001
user::rw-
group::r--
other::r--To change the ACL settings on this
file, use setfacl. To remove all of the
currently defined ACLs from a file or file
system, include . However, the preferred
method is to use as it leaves the basic
fields required for ACLs to work.&prompt.user; setfacl -k testTo modify the default ACL entries, use
:&prompt.user; setfacl -m u:trhodes:rwx,group:web:r--,o::--- testIn this example, there were no pre-defined entries, as
they were removed by the previous command. This command
restores the default options and assigns the options listed.
If a user or group is added which does not exist on the
system, an Invalid argument error will
be displayed.Refer to &man.getfacl.1; and &man.setfacl.1; for more
information about the options available for these
commands.Monitoring Third Party Security IssuesTomRhodesContributed
by pkgIn recent years, the security world has made many
improvements to how vulnerability assessment is handled. The
threat of system intrusion increases as third party utilities
are installed and configured for virtually any operating
system available today.Vulnerability assessment is a key factor in security.
While &os; releases advisories for the base system, doing so
for every third party utility is beyond the &os; Project's
capability. There is a way to mitigate third party
vulnerabilities and warn administrators of known security
issues. A &os; add on utility known as
pkg includes options explicitly for
this purpose.pkg polls a database for security
issues. The database is updated and maintained by the &os;
Security Team and ports developers.Please refer to instructions
for installing
pkg.Installation provides &man.periodic.8; configuration files
for maintaining the pkg audit
database, and provides a programmatic method of keeping it
updated. This functionality is enabled if
daily_status_security_pkgaudit_enable
is set to YES in &man.periodic.conf.5;.
Ensure that daily security run emails, which are sent to
root's email account,
are being read.After installation, and to audit third party utilities as
part of the Ports Collection at any time, an administrator may
choose to update the database and view known vulnerabilities
of installed packages by invoking:&prompt.root; pkg audit -Fpkg displays messages
any published vulnerabilities in installed packages:Affected package: cups-base-1.1.22.0_1
Type of problem: cups-base -- HPGL buffer overflow vulnerability.
Reference: <https://www.FreeBSD.org/ports/portaudit/40a3bca2-6809-11d9-a9e7-0001020eed82.html>
1 problem(s) in your installed packages found.
You are advised to update or deinstall the affected package(s) immediately.By pointing a web browser to the displayed
URL, an administrator may obtain more
information about the vulnerability. This will include the
versions affected, by &os; port version, along with other web
sites which may contain security advisories.pkg is a powerful utility
and is extremely useful when coupled with
ports-mgmt/portmaster.&os; Security AdvisoriesTomRhodesContributed
by &os; Security AdvisoriesLike many producers of quality operating systems, the &os;
Project has a security team which is responsible for
determining the End-of-Life (EoL) date for
each &os; release and to provide security updates for supported
releases which have not yet reached their
EoL. More information about the &os;
security team and the supported releases is available on the
&os; security
page.One task of the security team is to respond to reported
security vulnerabilities in the &os; operating system. Once a
vulnerability is confirmed, the security team verifies the steps
necessary to fix the vulnerability and updates the source code
with the fix. It then publishes the details as a
Security Advisory. Security
advisories are published on the &os;
website and mailed to the
&a.security-notifications.name;, &a.security.name;, and
&a.announce.name; mailing lists.This section describes the format of a &os; security
advisory.Format of a Security AdvisoryHere is an example of a &os; security advisory:=============================================================================
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
=============================================================================
FreeBSD-SA-14:04.bind Security Advisory
The FreeBSD Project
Topic: BIND remote denial of service vulnerability
Category: contrib
Module: bind
Announced: 2014-01-14
Credits: ISC
Affects: FreeBSD 8.x and FreeBSD 9.x
Corrected: 2014-01-14 19:38:37 UTC (stable/9, 9.2-STABLE)
2014-01-14 19:42:28 UTC (releng/9.2, 9.2-RELEASE-p3)
2014-01-14 19:42:28 UTC (releng/9.1, 9.1-RELEASE-p10)
2014-01-14 19:38:37 UTC (stable/8, 8.4-STABLE)
2014-01-14 19:42:28 UTC (releng/8.4, 8.4-RELEASE-p7)
2014-01-14 19:42:28 UTC (releng/8.3, 8.3-RELEASE-p14)
CVE Name: CVE-2014-0591
For general information regarding FreeBSD Security Advisories,
including descriptions of the fields above, security branches, and the
following sections, please visit <URL:http://security.FreeBSD.org/>.
I. Background
BIND 9 is an implementation of the Domain Name System (DNS) protocols.
The named(8) daemon is an Internet Domain Name Server.
II. Problem Description
Because of a defect in handling queries for NSEC3-signed zones, BIND can
crash with an "INSIST" failure in name.c when processing queries possessing
certain properties. This issue only affects authoritative nameservers with
at least one NSEC3-signed zone. Recursive-only servers are not at risk.
III. Impact
An attacker who can send a specially crafted query could cause named(8)
to crash, resulting in a denial of service.
IV. Workaround
No workaround is available, but systems not running authoritative DNS service
with at least one NSEC3-signed zone using named(8) are not vulnerable.
V. Solution
Perform one of the following:
1) Upgrade your vulnerable system to a supported FreeBSD stable or
release / security branch (releng) dated after the correction date.
2) To update your vulnerable system via a source code patch:
The following patches have been verified to apply to the applicable
FreeBSD release branches.
a) Download the relevant patch from the location below, and verify the
detached PGP signature using your PGP utility.
[FreeBSD 8.3, 8.4, 9.1, 9.2-RELEASE and 8.4-STABLE]
# fetch http://security.FreeBSD.org/patches/SA-14:04/bind-release.patch
# fetch http://security.FreeBSD.org/patches/SA-14:04/bind-release.patch.asc
# gpg --verify bind-release.patch.asc
[FreeBSD 9.2-STABLE]
# fetch http://security.FreeBSD.org/patches/SA-14:04/bind-stable-9.patch
# fetch http://security.FreeBSD.org/patches/SA-14:04/bind-stable-9.patch.asc
# gpg --verify bind-stable-9.patch.asc
b) Execute the following commands as root:
# cd /usr/src
# patch < /path/to/patch
Recompile the operating system using buildworld and installworld as
described in <URL:https://www.FreeBSD.org/handbook/makeworld.html>.
Restart the applicable daemons, or reboot the system.
3) To update your vulnerable system via a binary patch:
Systems running a RELEASE version of FreeBSD on the i386 or amd64
platforms can be updated via the freebsd-update(8) utility:
# freebsd-update fetch
# freebsd-update install
VI. Correction details
The following list contains the correction revision numbers for each
affected branch.
Branch/path Revision
- -------------------------------------------------------------------------
stable/8/ r260646
releng/8.3/ r260647
releng/8.4/ r260647
stable/9/ r260646
releng/9.1/ r260647
releng/9.2/ r260647
- -------------------------------------------------------------------------
To see which files were modified by a particular revision, run the
following command, replacing NNNNNN with the revision number, on a
machine with Subversion installed:
# svn diff -cNNNNNN --summarize svn://svn.freebsd.org/base
Or visit the following URL, replacing NNNNNN with the revision number:
<URL:https://svnweb.freebsd.org/base?view=revision&revision=NNNNNN>
VII. References
<URL:https://kb.isc.org/article/AA-01078>
<URL:http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-0591>
The latest revision of this advisory is available at
<URL:http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc>
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJS1ZTYAAoJEO1n7NZdz2rnOvQP/2/68/s9Cu35PmqNtSZVVxVG
ZSQP5EGWx/lramNf9566iKxOrLRMq/h3XWcC4goVd+gZFrvITJSVOWSa7ntDQ7TO
XcinfRZ/iyiJbs/Rg2wLHc/t5oVSyeouyccqODYFbOwOlk35JjOTMUG1YcX+Zasg
ax8RV+7Zt1QSBkMlOz/myBLXUjlTZ3Xg2FXVsfFQW5/g2CjuHpRSFx1bVNX6ysoG
9DT58EQcYxIS8WfkHRbbXKh9I1nSfZ7/Hky/kTafRdRMrjAgbqFgHkYTYsBZeav5
fYWKGQRJulYfeZQ90yMTvlpF42DjCC3uJYamJnwDIu8OhS1WRBI8fQfr9DRzmRua
OK3BK9hUiScDZOJB6OqeVzUTfe7MAA4/UwrDtTYQ+PqAenv1PK8DZqwXyxA9ThHb
zKO3OwuKOVHJnKvpOcr+eNwo7jbnHlis0oBksj/mrq2P9m2ueF9gzCiq5Ri5Syag
Wssb1HUoMGwqU0roS8+pRpNC8YgsWpsttvUWSZ8u6Vj/FLeHpiV3mYXPVMaKRhVm
067BA2uj4Th1JKtGleox+Em0R7OFbCc/9aWC67wiqI6KRyit9pYiF3npph+7D5Eq
7zPsUdDd+qc+UTiLp3liCRp5w6484wWdhZO6wRtmUgxGjNkxFoNnX8CitzF8AaqO
UWWemqWuz3lAZuORQ9KX
=OQzQ
-----END PGP SIGNATURE-----Every security advisory uses the following format:Each security advisory is signed by the
PGP key of the Security Officer. The
public key for the Security Officer can be verified at
.The name of the security advisory always begins with
FreeBSD-SA- (for FreeBSD Security
Advisory), followed by the year in two digit format
(14:), followed by the advisory number
for that year (04.), followed by the
name of the affected application or subsystem
(bind). The advisory shown here is the
fourth advisory for 2014 and it affects
BIND.The Topic field summarizes the
vulnerability.The Category refers to the
affected part of the system which may be one of
core, contrib, or
ports. The core
category means that the vulnerability affects a core
component of the &os; operating system. The
contrib category means that the
vulnerability affects software included with &os;,
such as BIND. The
ports category indicates that the
vulnerability affects software available through the Ports
Collection.The Module field refers to the
component location. In this example, the
bind module is affected; therefore,
this vulnerability affects an application installed with
the operating system.The Announced field reflects the
date the security advisory was published. This means
that the security team has verified that the problem
exists and that a patch has been committed to the &os;
source code repository.The Credits field gives credit to
the individual or organization who noticed the
vulnerability and reported it.The Affects field explains which
releases of &os; are affected by this
vulnerability.The Corrected field indicates the
date, time, time offset, and releases that were
corrected. The section in parentheses shows each branch
for which the fix has been merged, and the version number
of the corresponding release from that branch. The
release identifier itself includes the version number
and, if appropriate, the patch level. The patch level is
the letter p followed by a number,
indicating the sequence number of the patch, allowing
users to track which patches have already been applied to
the system.The CVE Name field lists the
advisory number, if one exists, in the public cve.mitre.org
security vulnerabilities database.The Background field provides a
description of the affected module.The Problem Description field
explains the vulnerability. This can include
information about the flawed code and how the utility
could be maliciously used.The Impact field describes what
type of impact the problem could have on a system.The Workaround field indicates if
a workaround is available to system administrators who
cannot immediately patch the system .The Solution field provides the
instructions for patching the affected system. This is a
step by step tested and verified method for getting a
system patched and working securely.The Correction Details field
displays each affected Subversion branch with the revision
number that contains the corrected code.The References field offers sources
of additional information regarding the
vulnerability.Process AccountingTomRhodesContributed
by Process AccountingProcess accounting is a security method in which an
administrator may keep track of system resources used and
their allocation among users, provide for system monitoring,
and minimally track a user's commands.Process accounting has both positive and negative points.
One of the positives is that an intrusion may be narrowed down
to the point of entry. A negative is the amount of logs
generated by process accounting, and the disk space they may
require. This section walks an administrator through the basics
of process accounting.If more fine-grained accounting is needed, refer to
.Enabling and Utilizing Process AccountingBefore using process accounting, it must be enabled using
the following commands:&prompt.root; sysrc accounting_enable=yes
&prompt.root; service accounting startThe accounting information is stored in files located in
/var/account, which is automatically created,
if necessary, the first time the accounting service starts.
These files contain sensitive information, including all the
commands issued by all users. Write access to the files is
limited to root,
and read access is limited to root and members of the
wheel group.
To also prevent members of wheel from reading the files,
change the mode of the /var/account
directory to allow access only by root.Once enabled, accounting will begin to track information
such as CPU statistics and executed
commands. All accounting logs are in a non-human readable
format which can be viewed using sa. If
issued without any options, sa prints
information relating to the number of per-user calls, the
total elapsed time in minutes, total CPU
and user time in minutes, and the average number of
I/O operations. Refer to &man.sa.8; for
the list of available options which control the output.To display the commands issued by users, use
lastcomm. For example, this command
prints out all usage of ls by trhodes on the
ttyp1 terminal:&prompt.root; lastcomm ls trhodes ttyp1Many other useful options exist and are explained in
&man.lastcomm.1;, &man.acct.5;, and &man.sa.8;.Resource LimitsTomRhodesContributed
by Resource limits&os; provides several methods for an administrator to
limit the amount of system resources an individual may use.
Disk quotas limit the amount of disk space available to users.
Quotas are discussed in .quotaslimiting usersquotasdisk quotasLimits to other resources, such as CPU
and memory, can be set using either a flat file or a command to
configure a resource limits database. The traditional method
defines login classes by editing
/etc/login.conf. While this method is
still supported, any changes require a multi-step process of
editing this file, rebuilding the resource database, making
necessary changes to /etc/master.passwd,
and rebuilding the password database. This can become time
consuming, depending upon the number of users to
configure.rctl can be used to provide a more
fine-grained method for controlling resource limits. This
command supports more than user limits as it can also be used to
set resource constraints on processes and jails.This section demonstrates both methods for controlling
resources, beginning with the traditional method.Configuring Login Classeslimiting usersaccountslimiting/etc/login.confIn the traditional method, login classes and the resource
limits to apply to a login class are defined in
/etc/login.conf. Each user account can
be assigned to a login class, where default
is the default login class. Each login class has a set of
login capabilities associated with it. A login capability is
a
name=value
pair, where name is a well-known
identifier and value is an
arbitrary string which is processed accordingly depending on
the name.Whenever /etc/login.conf is edited,
the /etc/login.conf.db must be updated
by executing the following command:&prompt.root; cap_mkdb /etc/login.confResource limits differ from the default login capabilities
in two ways. First, for every limit, there is a
soft and hard
limit. A soft limit may be adjusted by the user or
application, but may not be set higher than the hard limit.
The hard limit may be lowered by the user, but can only be
raised by the superuser. Second, most resource limits apply
per process to a specific user. lists the most commonly
used resource limits. All of the available resource limits
and capabilities are described in detail in
&man.login.conf.5;.limiting userscoredumpsizelimiting userscputimelimiting usersfilesizelimiting usersmaxproclimiting usersmemorylockedlimiting usersmemoryuselimiting usersopenfileslimiting userssbsizelimiting usersstacksize
Login Class Resource LimitsResource LimitDescriptioncoredumpsizeThe limit on the size of a core file generated by
a program is subordinate to other limits on disk
usage, such as filesize or disk
quotas. This limit is often used as a less severe
method of controlling disk space consumption. Since
users do not generate core files and often do not
delete them, this setting may save them from running
out of disk space should a large program
crash.cputimeThe maximum amount of CPU time
a user's process may consume. Offending processes
will be killed by the kernel. This is a limit on
CPU time
consumed, not the percentage of the
CPU as displayed in some of the
fields generated by top and
ps.filesizeThe maximum size of a file the user may own.
Unlike disk quotas (), this
limit is enforced on individual files, not the set of
all files a user owns.maxprocThe maximum number of foreground and background
processes a user can run. This limit may not be
larger than the system limit specified by
kern.maxproc. Setting this limit
too small may hinder a user's productivity as some
tasks, such as compiling a large program, start lots
of processes.memorylockedThe maximum amount of memory a process may
request to be locked into main memory using
&man.mlock.2;. Some system-critical programs, such as
&man.amd.8;, lock into main memory so that if the
system begins to swap, they do not contribute to disk
thrashing.memoryuseThe maximum amount of memory a process may
consume at any given time. It includes both core
memory and swap usage. This is not a catch-all limit
for restricting memory consumption, but is a good
start.openfilesThe maximum number of files a process may have
open. In &os;, files are used to represent sockets
and IPC channels, so be careful not
to set this too low. The system-wide limit for this
is defined by
kern.maxfiles.sbsizeThe limit on the amount of network memory a user
may consume. This can be generally used to limit
network communications.stacksizeThe maximum size of a process stack. This alone
is not sufficient to limit the amount of memory a
program may use, so it should be used in conjunction
with other limits.
There are a few other things to remember when setting
resource limits:Processes started at system startup by
/etc/rc are assigned to the
daemon login class.Although the default
/etc/login.conf is a good source of
reasonable values for most limits, they may not be
appropriate for every system. Setting a limit too high
may open the system up to abuse, while setting it too low
may put a strain on productivity.&xorg; takes a lot of
resources and encourages users to run more programs
simultaneously.Many limits apply to individual processes, not the
user as a whole. For example, setting
openfiles to 50
means that each process the user runs may open up to
50 files. The total amount of files a
user may open is the value of openfiles
multiplied by the value of maxproc.
This also applies to memory consumption.For further information on resource limits and login
classes and capabilities in general, refer to
&man.cap.mkdb.1;, &man.getrlimit.2;, and
&man.login.conf.5;.Enabling and Configuring Resource LimitsThe kern.racct.enable tunable must be
set to a non-zero value. Custom kernels require specific
configuration:options RACCT
options RCTLOnce the system has rebooted into the new kernel,
rctl may be used to set rules for the
system.Rule syntax is controlled through the use of a subject,
subject-id, resource, and action, as seen in this example
rule:user:trhodes:maxproc:deny=10/userIn this rule, the subject is user, the
subject-id is trhodes, the resource,
maxproc, is the maximum number of
processes, and the action is deny, which
blocks any new processes from being created. This means that
the user, trhodes, will be constrained to
no greater than 10 processes. Other
possible actions include logging to the console, passing a
notification to &man.devd.8;, or sending a sigterm to the
process.Some care must be taken when adding rules. Since this
user is constrained to 10 processes, this
example will prevent the user from performing other tasks
after logging in and executing a
screen session. Once a resource limit has
been hit, an error will be printed, as in this example:&prompt.user; man test
/usr/bin/man: Cannot fork: Resource temporarily unavailable
eval: Cannot fork: Resource temporarily unavailableAs another example, a jail can be prevented from exceeding
a memory limit. This rule could be written as:&prompt.root; rctl -a jail:httpd:memoryuse:deny=2G/jailRules will persist across reboots if they have been added
to /etc/rctl.conf. The format is a rule,
without the preceding command. For example, the previous rule
could be added as:# Block jail from using more than 2G memory:
jail:httpd:memoryuse:deny=2G/jailTo remove a rule, use rctl to remove it
from the list:&prompt.root; rctl -r user:trhodes:maxproc:deny=10/userA method for removing all rules is documented in
&man.rctl.8;. However, if removing all rules for a single
user is required, this command may be issued:&prompt.root; rctl -r user:trhodesMany other resources exist which can be used to exert
additional control over various subjects.
See &man.rctl.8; to learn about them.Shared Administration with SudoTomRhodesContributed
by SecuritySudoSystem administrators often need the ability to grant
enhanced permissions to users so they may perform privileged
tasks. The idea that team members are provided access
to a &os; system to perform their specific tasks opens up unique
challenges to every administrator. These team members only
need a subset of access beyond normal end user levels; however,
they almost always tell management they are unable to
perform their tasks without superuser access. Thankfully, there
is no reason to provide such access to end users because tools
exist to manage this exact requirement.Up to this point, the security chapter has covered
permitting access to authorized users and attempting to prevent
unauthorized access. Another problem arises once authorized
users have access to the system resources. In many cases, some
users may need access to application startup scripts, or a team
of administrators need to maintain the system. Traditionally,
the standard users and groups, file permissions, and even the
&man.su.1; command would manage this access. And as
applications required more access, as more users needed to use
system resources, a better solution was required. The most used
application is currently Sudo.Sudo allows administrators
to configure more rigid access to system commands
and provide for some advanced logging features.
As a tool, it is available from the Ports Collection as
security/sudo or by use of
the &man.pkg.8; utility. To use the &man.pkg.8; tool:&prompt.root; pkg install sudoAfter the installation is complete, the installed
visudo will open the configuration file with
a text editor. Using visudo is highly
recommended as it comes with a built in syntax checker to verify
there are no errors before the file is saved.The configuration file is made up of several small sections
which allow for extensive configuration. In the following
example, web application maintainer, user1, needs to start,
stop, and restart the web application known as
webservice. To
grant this user permission to perform these tasks, add
this line to the end of
/usr/local/etc/sudoers:user1 ALL=(ALL) /usr/sbin/service webservice *The user may now start webservice
using this command:&prompt.user; sudo /usr/sbin/service webservice startWhile this configuration allows a single user access to the
webservice service; however, in most
organizations, there is an entire web team in charge of managing
the service. A single line can also give access to an entire
group. These steps will create a web group, add a user to this
group, and allow all members of the group to manage the
service:&prompt.root; pw groupadd -g 6001 -n webteamUsing the same &man.pw.8; command, the user is added to
the webteam group:&prompt.root; pw groupmod -m user1 -n webteamFinally, this line in
/usr/local/etc/sudoers allows any
member of the webteam group to manage
webservice:%webteam ALL=(ALL) /usr/sbin/service webservice *Unlike &man.su.1;, Sudo only
requires the end user password. This adds an advantage where
users will not need shared passwords, a finding in most security
audits and just bad all the way around.Users permitted to run applications with
Sudo only enter their own passwords.
This is more secure and gives better control than &man.su.1;,
where the root
password is entered and the user acquires all
root
permissions.Most organizations are moving or have moved toward a two
factor authentication model. In these cases, the user may not
have a password to enter. Sudo
provides for these cases with the NOPASSWD
variable. Adding it to the configuration above will allow all
members of the webteam group to
manage the service without the password requirement:%webteam ALL=(ALL) NOPASSWD: /usr/sbin/service webservice *Logging OutputAn advantage to implementing
Sudo is the ability to enable
session logging. Using the built in log mechanisms
and the included sudoreplay
command, all commands initiated through
Sudo are logged for later
verification. To enable this feature, add a default log
directory entry, this example uses a user variable.
Several other log filename conventions exist, consult the
manual page for sudoreplay for
additional information.Defaults iolog_dir=/var/log/sudo-io/%{user}This directory will be created automatically after the
logging is configured. It is best to let the system create
directory with default permissions just to be safe. In
addition, this entry will also log administrators who use
the sudoreplay command. To
change this behavior, read and uncomment the logging options
inside sudoers.Once this directive has been added to the
sudoers file, any user configuration can
be updated with the request to log access. In the example
shown, the updated webteam entry
would have the following additional changes:%webteam ALL=(ALL) NOPASSWD: LOG_INPUT: LOG_OUTPUT: /usr/sbin/service webservice *From this point on, all webteam
members altering the status of the
webservice application
will be logged. The list of previous and current sessions
can be displayed with:&prompt.root; sudoreplay -lIn the output, to replay a specific session, search for
the TSID= entry, and pass that to
sudoreplay with no other options to
replay the session at normal speed. For example:&prompt.root; sudoreplay user1/00/00/02While sessions are logged, any administrator is able to
remove sessions and leave only a question of why they had
done so. It is worthwhile to add a daily check through an
intrusion detection system (IDS) or
similar software so that other administrators are alerted to
manual alterations.The sudoreplay is extremely extendable.
Consult the documentation for more information.
diff --git a/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml b/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml
index 9913ad3755..db13e71ef6 100644
--- a/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml
@@ -1,2191 +1,2191 @@
Serial CommunicationsSynopsisserial communications&unix; has always had support for serial communications as
the very first &unix; machines relied on serial lines for user
input and output. Things have changed a lot from the days
when the average terminal consisted of a 10-character-per-second
serial printer and a keyboard. This chapter covers some of the
ways serial communications can be used on &os;.After reading this chapter, you will know:How to connect terminals to a &os; system.How to use a modem to dial out to remote hosts.How to allow remote users to login to a &os; system
with a modem.How to boot a &os; system from a serial console.Before reading this chapter, you should:Know how to configure and
install a custom kernel.Understand &os; permissions
and processes.Have access to the technical manual for the serial
hardware to be used with &os;.Serial Terminology and HardwareThe following terms are often used in serial
communications:bpsBits per
Secondbits-per-second
(bps) is the rate at which data is
transmitted.DTEData Terminal
EquipmentDTE
(DTE) is one of two endpoints in a
serial communication. An example would be a
computer.DCEData Communications
EquipmentDCE
(DTE) is the other endpoint in a
serial communication. Typically, it is a modem or serial
terminal.RS-232The original standard which defined hardware serial
communications. It has since been renamed to
TIA-232.When referring to communication data rates, this section
does not use the term baud. Baud refers
to the number of electrical state transitions made in a period
of time, while bps is the correct term to
use.To connect a serial terminal to a &os; system, a serial port
on the computer and the proper cable to connect to the serial
device are needed. Users who are already familiar with serial
hardware and cabling can safely skip this section.Serial Cables and PortsThere are several different kinds of serial cables. The
two most common types are null-modem cables and standard
RS-232 cables. The documentation for the
hardware should describe the type of cable required.These two types of cables differ in how the wires are
connected to the connector. Each wire represents a signal,
with the defined signals summarized in . A standard serial
cable passes all of the RS-232C signals
straight through. For example, the Transmitted
Data pin on one end of the cable goes to the
Transmitted Data pin on the other end. This is
the type of cable used to connect a modem to the &os; system,
and is also appropriate for some terminals.A null-modem cable switches the Transmitted
Data pin of the connector on one end with the
Received Data pin on the other end. The
connector can be either a DB-25 or a
DB-9.A null-modem cable can be constructed using the pin
connections summarized in ,
, and . While the standard calls for
a straight-through pin 1 to pin 1 Protective
Ground line, it is often omitted. Some terminals
work using only pins 2, 3, and 7, while others require
different configurations. When in doubt, refer to the
documentation for the hardware.null-modem cable
RS-232C Signal NamesAcronymsNamesRDReceived DataTDTransmitted DataDTRData Terminal ReadyDSRData Set ReadyDCDData Carrier DetectSGSignal GroundRTSRequest to SendCTSClear to Send
When one pin at one end connects to a pair of pins at
the other end, it is usually implemented with one short wire
between the pair of pins in their connector and a long wire
to the other single pin.Serial ports are the devices through which data is
transferred between the &os; host computer and the terminal.
Several kinds of serial ports exist. Before purchasing or
constructing a cable, make sure it will fit the ports on the
terminal and on the &os; system.Most terminals have DB-25 ports.
Personal computers may have DB-25 or
DB-9 ports. A multiport serial card may
have RJ-12 or RJ-45/
ports. See the documentation that accompanied the hardware
for specifications on the kind of port or visually verify the
type of port.In &os;, each serial port is accessed through an entry in
/dev. There are two different kinds of
entries:Call-in ports are named
/dev/ttyuN
where N is the port number,
starting from zero. If a terminal is connected to the
first serial port (COM1), use
/dev/ttyu0 to refer to the terminal.
If the terminal is on the second serial port
(COM2), use
/dev/ttyu1, and so forth. Generally,
the call-in port is used for terminals. Call-in ports
require that the serial line assert the Data
Carrier Detect signal to work correctly.Call-out ports are named
/dev/cuauN
on &os; versions 8.X and higher and
/dev/cuadN
on &os; versions 7.X and lower. Call-out ports are
usually not used for terminals, but are used for modems.
The call-out port can be used if the serial cable or the
terminal does not support the Data Carrier
Detect signal.&os; also provides initialization devices
(/dev/ttyuN.init
and
/dev/cuauN.init
or
/dev/cuadN.init)
and locking devices
(/dev/ttyuN.lock
and
/dev/cuauN.lock
or
/dev/cuadN.lock).
The initialization devices are used to initialize
communications port parameters each time a port is opened,
such as crtscts for modems which use
RTS/CTS signaling for flow control. The
locking devices are used to lock flags on ports to prevent
users or programs changing certain parameters. Refer to
&man.termios.4;, &man.sio.4;, and &man.stty.1; for information
on terminal settings, locking and initializing devices, and
setting terminal options, respectively.Serial Port ConfigurationBy default, &os; supports four serial ports which are
commonly known as COM1,
COM2, COM3, and
COM4. &os; also supports dumb multi-port
serial interface cards, such as the BocaBoard 1008 and 2016,
as well as more intelligent multi-port cards such as those
made by Digiboard. However, the default kernel only looks for
the standard COM ports.To see if the system recognizes the serial ports, look for
system boot messages that start with
uart:&prompt.root; grep uart /var/run/dmesg.bootIf the system does not recognize all of the needed serial
ports, additional entries can be added to
/boot/device.hints. This file already
contains hint.uart.0.* entries for
COM1 and hint.uart.1.*
entries for COM2. When adding a port
entry for COM3 use
0x3E8, and for COM4
use 0x2E8. Common IRQ
addresses are 5 for
COM3 and 9 for
COM4.ttyucuauTo determine the default set of terminal
I/O settings used by the port, specify its
device name. This example determines the settings for the
call-in port on COM2:&prompt.root; stty -a -f /dev/ttyu1System-wide initialization of serial devices is controlled
by /etc/rc.d/serial. This file affects
the default settings of serial devices. To change the
settings for a device, use stty. By
default, the changed settings are in effect until the device
is closed and when the device is reopened, it goes back to the
default set. To permanently change the default set, open and
adjust the settings of the initialization device. For
example, to turn on mode, 8 bit
communication, and flow control for
ttyu5, type:&prompt.root; stty -f /dev/ttyu5.init clocal cs8 ixon ixoffrc filesrc.serialTo prevent certain settings from being changed by an
application, make adjustments to the locking device. For
example, to lock the speed of ttyu5 to
57600 bps, type:&prompt.root; stty -f /dev/ttyu5.lock 57600Now, any application that opens ttyu5
and tries to change the speed of the port will be stuck with
57600 bps.TerminalsSeanKellyContributed by terminalsTerminals provide a convenient and low-cost way to access
a &os; system when not at the computer's console or on a
connected network. This section describes how to use terminals
with &os;.The original &unix; systems did not have consoles. Instead,
users logged in and ran programs through terminals that were
connected to the computer's serial ports.The ability to establish a login session on a serial port
still exists in nearly every &unix;-like operating system
today, including &os;. By using a terminal attached to an
unused serial port, a user can log in and run any text program
that can normally be run on the console or in an
xterm window.Many terminals can be attached to a &os; system. An older
spare computer can be used as a terminal wired into a more
powerful computer running &os;. This can turn what might
otherwise be a single-user computer into a powerful
multiple-user system.&os; supports three types of terminals:Dumb terminalsDumb terminals are specialized hardware that connect
to computers over serial lines. They are called
dumb because they have only enough
computational power to display, send, and receive text.
No programs can be run on these devices. Instead, dumb
terminals connect to a computer that runs the needed
programs.There are hundreds of kinds of dumb terminals made by
many manufacturers, and just about any kind will work with
&os;. Some high-end terminals can even display graphics,
but only certain software packages can take advantage of
these advanced features.Dumb terminals are popular in work environments where
workers do not need access to graphical
applications.Computers Acting as TerminalsSince a dumb terminal has just enough ability to
display, send, and receive text, any spare computer can
be a dumb terminal. All that is needed is the proper
cable and some terminal emulation
software to run on the computer.This configuration can be useful. For example, if one
user is busy working at the &os; system's console, another
user can do some text-only work at the same time from a
less powerful personal computer hooked up as a terminal to
the &os; system.There are at least two utilities in the base-system of
&os; that can be used to work through a serial connection:
&man.cu.1; and &man.tip.1;.For example, to connect from a client system that runs
&os; to the serial connection of another system:&prompt.root; cu -l /dev/cuauNPorts are numbered starting from zero. This means that
COM1 is
/dev/cuau0.Additional programs are available through the Ports
Collection, such as
comms/minicom.X TerminalsX terminals are the most sophisticated kind of
terminal available. Instead of connecting to a serial
port, they usually connect to a network like Ethernet.
Instead of being relegated to text-only applications, they
can display any &xorg;
application.This chapter does not cover the setup, configuration,
or use of X terminals.Terminal ConfigurationThis section describes how to configure a &os; system to
enable a login session on a serial terminal. It assumes that
the system recognizes the serial port to which the terminal is
connected and that the terminal is connected with the correct
cable.In &os;, init reads
/etc/ttys and starts a
getty process on the available terminals.
The getty process is responsible for
reading a login name and starting the login
program. The ports on the &os; system which allow logins are
listed in /etc/ttys. For example, the
first virtual console, ttyv0, has an
entry in this file, allowing logins on the console. This file
also contains entries for the other virtual consoles, serial
ports, and pseudo-ttys. For a hardwired terminal, the serial
port's /dev entry is listed without the
/dev part. For example,
/dev/ttyv0 is listed as
ttyv0.The default /etc/ttys configures
support for the first four serial ports,
ttyu0 through
ttyu3:ttyu0 "/usr/libexec/getty std.9600" dialup off secure
ttyu1 "/usr/libexec/getty std.9600" dialup off secure
ttyu2 "/usr/libexec/getty std.9600" dialup off secure
ttyu3 "/usr/libexec/getty std.9600" dialup off secureWhen attaching a terminal to one of those ports, modify
the default entry to set the required speed and terminal type,
to turn the device on and, if needed, to
change the port's secure setting. If the
terminal is connected to another port, add an entry for the
port. configures two terminals in
/etc/ttys. The first entry configures a
Wyse-50 connected to COM2. The second
entry configures an old computer running
Procomm terminal software emulating
a VT-100 terminal. The computer is connected to the sixth
serial port on a multi-port serial card.Configuring Terminal Entriesttyu1 "/usr/libexec/getty std.38400" wy50 on insecure
ttyu5 "/usr/libexec/getty std.19200" vt100 on insecureThe first field specifies the device name of the
serial terminal.The second field tells getty to
initialize and open the line, set the line speed, prompt
for a user name, and then execute the
login program. The optional
getty type configures
characteristics on the terminal line, like
bps rate and parity. The available
getty types are listed in
/etc/gettytab. In almost all
cases, the getty types that start with
std will work for hardwired terminals
as these entries ignore parity. There is a
std entry for each
bps rate from 110 to 115200. Refer
to &man.gettytab.5; for more information.When setting the getty type, make sure to match the
communications settings used by the terminal. For this
example, the Wyse-50 uses no parity and connects at
38400 bps. The computer uses no parity and
connects at 19200 bps.The third field is the type of terminal. For
dial-up ports, unknown or
dialup is typically used since users
may dial up with practically any type of terminal or
software. Since the terminal type does not change for
hardwired terminals, a real terminal type from
/etc/termcap can be specified. For
this example, the Wyse-50 uses the real terminal type
while the computer running
Procomm is set to emulate a
VT-100.The fourth field specifies if the port should be
enabled. To enable logins on this port, this field must
be set to on.The final field is used to specify whether the port
is secure. Marking a port as secure
means that it is trusted enough to allow root to login from that
port. Insecure ports do not allow root logins. On an
insecure port, users must login from unprivileged
accounts and then use su or a similar
mechanism to gain superuser privileges, as described in
. For security
reasons, it is recommended to change this setting to
insecure.After making any changes to
/etc/ttys, send a SIGHUP (hangup) signal
to the init process to force it to re-read
its configuration file:&prompt.root; kill -HUP 1Since init is always the first process
run on a system, it always has a process ID
of 1.If everything is set up correctly, all cables are in
place, and the terminals are powered up, a
getty process should now be running on each
terminal and login prompts should be available on each
terminal.Troubleshooting the ConnectionEven with the most meticulous attention to detail,
something could still go wrong while setting up a terminal.
Here is a list of common symptoms and some suggested
fixes.If no login prompt appears, make sure the terminal is
plugged in and powered up. If it is a personal computer
acting as a terminal, make sure it is running terminal
emulation software on the correct serial port.Make sure the cable is connected firmly to both the
terminal and the &os; computer. Make sure it is the right
kind of cable.Make sure the terminal and &os; agree on the
bps rate and parity settings. For a video
display terminal, make sure the contrast and brightness
controls are turned up. If it is a printing terminal, make
sure paper and ink are in good supply.Use ps to make sure that a
getty process is running and serving the
terminal. For example, the following listing shows that a
getty is running on the second serial port,
ttyu1, and is using the
std.38400 entry in
/etc/gettytab:&prompt.root; ps -axww|grep ttyu
22189 d1 Is+ 0:00.03 /usr/libexec/getty std.38400 ttyu1If no getty process is running, make
sure the port is enabled in /etc/ttys.
Remember to run kill -HUP 1 after modifying
/etc/ttys.If the getty process is running but the
terminal still does not display a login prompt, or if it
displays a prompt but will not accept typed input, the
terminal or cable may not support hardware handshaking. Try
changing the entry in /etc/ttys from
std.38400 to
3wire.38400, then run kill -HUP
1 after modifying /etc/ttys.
The 3wire entry is similar to
std, but ignores hardware handshaking. The
baud rate may need to be reduced or software flow control
enabled when using 3wire to prevent buffer
overflows.If garbage appears instead of a login prompt, make sure
the terminal and &os; agree on the bps rate
and parity settings. Check the getty
processes to make sure the correct
getty type is in use. If not, edit
/etc/ttys and run kill
-HUP 1.If characters appear doubled and the password appears when
typed, switch the terminal, or the terminal emulation
software, from half duplex or local
echo to full duplex.Dial-in ServiceGuyHelmerContributed by SeanKellyAdditions by dial-in serviceConfiguring a &os; system for dial-in service is similar to
configuring terminals, except that modems are used instead of
terminal devices. &os; supports both external and internal
modems.External modems are more convenient because they often can
be configured via parameters stored in non-volatile
RAM and they usually provide lighted
indicators that display the state of important
RS-232 signals, indicating whether the modem
is operating properly.Internal modems usually lack non-volatile
RAM, so their configuration may be limited to
setting DIP switches. If the internal modem
has any signal indicator lights, they are difficult to view when
the system's cover is in place.modemWhen using an external modem, a proper cable is needed. A
standard RS-232C serial cable should
suffice.&os; needs the RTS and
CTS signals for flow control at speeds above
2400 bps, the CD signal to detect when a
call has been answered or the line has been hung up, and the
DTR signal to reset the modem after a session
is complete. Some cables are wired without all of the needed
signals, so if a login session does not go away when the line
hangs up, there may be a problem with the cable. Refer to for more information about these
signals.Like other &unix;-like operating systems, &os; uses the
hardware signals to find out when a call has been answered or a
line has been hung up and to hangup and reset the modem after a
call. &os; avoids sending commands to the modem or watching for
status reports from the modem.&os; supports the NS8250,
NS16450, NS16550, and
NS16550A-based RS-232C
(CCITT V.24) communications interfaces. The
8250 and 16450 devices have single-character buffers. The 16550
device provides a 16-character buffer, which allows for better
system performance. Bugs in plain 16550 devices prevent the use
of the 16-character buffer, so use 16550A devices if possible.
- Because single-character-buffer devices require more work by the
+ As single-character-buffer devices require more work by the
operating system than the 16-character-buffer devices,
16550A-based serial interface cards are preferred. If the
system has many active serial ports or will have a heavy load,
16550A-based cards are better for low-error-rate
communications.The rest of this section demonstrates how to configure a
modem to receive incoming connections, how to communicate with
the modem, and offers some troubleshooting tips.Modem ConfigurationgettyAs with terminals, init spawns a
getty process for each configured serial
port used for dial-in connections. When a user dials the
modem's line and the modems connect, the Carrier
Detect signal is reported by the modem. The kernel
notices that the carrier has been detected and instructs
getty to open the port and display a
login: prompt at the specified initial line
speed. In a typical configuration, if garbage characters are
received, usually due to the modem's connection speed being
different than the configured speed, getty
tries adjusting the line speeds until it receives reasonable
characters. After the user enters their login name,
getty executes login,
which completes the login process by asking for the user's
password and then starting the user's shell./usr/bin/loginThere are two schools of thought regarding dial-up modems.
One configuration method is to set the modems and systems so
that no matter at what speed a remote user dials in, the
dial-in RS-232 interface runs at a locked
speed. The benefit of this configuration is that the remote
user always sees a system login prompt immediately. The
downside is that the system does not know what a user's true
data rate is, so full-screen programs like
Emacs will not adjust their
screen-painting methods to make their response better for
slower connections.The second method is to configure the
RS-232 interface to vary its speed based on
- the remote user's connection speed. Because
+ the remote user's connection speed. As
getty does not understand any particular
modem's connection speed reporting, it gives a
login: message at an initial speed and
watches the characters that come back in response. If the
user sees junk, they should press Enter until
they see a recognizable prompt. If the data rates do not
match, getty sees anything the user types
as junk, tries the next speed, and gives the
login: prompt again. This procedure normally
only takes a keystroke or two before the user sees a good
prompt. This login sequence does not look as clean as the
locked-speed method, but a user on a low-speed connection
should receive better interactive response from full-screen
programs.When locking a modem's data communications rate at a
particular speed, no changes to
/etc/gettytab should be needed. However,
for a matching-speed configuration, additional entries may be
required in order to define the speeds to use for the modem.
This example configures a 14.4 Kbps modem with a top
interface speed of 19.2 Kbps using 8-bit, no parity
connections. It configures getty to start
the communications rate for a V.32bis connection at
19.2 Kbps, then cycles through 9600 bps,
2400 bps, 1200 bps, 300 bps, and back to
19.2 Kbps. Communications rate cycling is implemented
with the nx= (next table) capability. Each
line uses a tc= (table continuation) entry
to pick up the rest of the settings for a particular data
rate.#
# Additions for a V.32bis Modem
#
um|V300|High Speed Modem at 300,8-bit:\
:nx=V19200:tc=std.300:
un|V1200|High Speed Modem at 1200,8-bit:\
:nx=V300:tc=std.1200:
uo|V2400|High Speed Modem at 2400,8-bit:\
:nx=V1200:tc=std.2400:
up|V9600|High Speed Modem at 9600,8-bit:\
:nx=V2400:tc=std.9600:
uq|V19200|High Speed Modem at 19200,8-bit:\
:nx=V9600:tc=std.19200:For a 28.8 Kbps modem, or to take advantage of
compression on a 14.4 Kbps modem, use a higher
communications rate, as seen in this example:#
# Additions for a V.32bis or V.34 Modem
# Starting at 57.6 Kbps
#
vm|VH300|Very High Speed Modem at 300,8-bit:\
:nx=VH57600:tc=std.300:
vn|VH1200|Very High Speed Modem at 1200,8-bit:\
:nx=VH300:tc=std.1200:
vo|VH2400|Very High Speed Modem at 2400,8-bit:\
:nx=VH1200:tc=std.2400:
vp|VH9600|Very High Speed Modem at 9600,8-bit:\
:nx=VH2400:tc=std.9600:
vq|VH57600|Very High Speed Modem at 57600,8-bit:\
:nx=VH9600:tc=std.57600:For a slow CPU or a heavily loaded
system without 16550A-based serial ports, this configuration
may produce siosilo errors at 57.6 Kbps./etc/ttysThe configuration of /etc/ttys is
similar to , but a different
argument is passed to getty and
dialup is used for the terminal type.
Replace xxx with the process
init will run on the device:ttyu0 "/usr/libexec/getty xxx" dialup onThe dialup terminal type can be
changed. For example, setting vt102 as the
default terminal type allows users to use
VT102 emulation on their remote
systems.For a locked-speed configuration, specify the speed with
a valid type listed in /etc/gettytab.
This example is for a modem whose port speed is locked at
19.2 Kbps:ttyu0 "/usr/libexec/getty std.19200" dialup onIn a matching-speed configuration, the entry needs to
reference the appropriate beginning auto-baud
entry in /etc/gettytab. To continue the
example for a matching-speed modem that starts at
19.2 Kbps, use this entry:ttyu0 "/usr/libexec/getty V19200" dialup onAfter editing /etc/ttys, wait until
the modem is properly configured and connected before
signaling init:&prompt.root; kill -HUP 1rc filesrc.serialHigh-speed modems, like V.32,
V.32bis, and V.34
modems, use hardware (RTS/CTS) flow
control. Use stty to set the hardware flow
control flag for the modem port. This example sets the
crtscts flag on COM2's
dial-in and dial-out initialization devices:&prompt.root; stty -f /dev/ttyu1.init crtscts
&prompt.root; stty -f /dev/cuau1.init crtsctsTroubleshootingThis section provides a few tips for troubleshooting a
dial-up modem that will not connect to a &os; system.Hook up the modem to the &os; system and boot the system.
If the modem has status indication lights, watch to see
whether the modem's DTR indicator lights
when the login: prompt appears on the
system's console. If it lights up, that should mean that &os;
has started a getty process on the
appropriate communications port and is waiting for the modem
to accept a call.If the DTR indicator does not light,
login to the &os; system through the console and type
ps ax to see if &os; is running a
getty process on the correct port: 114 ?? I 0:00.10 /usr/libexec/getty V19200 ttyu0If the second column contains a d0
instead of a ?? and the modem has not
accepted a call yet, this means that getty
has completed its open on the communications port. This could
indicate a problem with the cabling or a misconfigured modem
because getty should not be able to open
the communications port until the carrier detect signal has
been asserted by the modem.If no getty processes are waiting to
open the port, double-check that the entry for the port is
correct in /etc/ttys. Also, check
/var/log/messages to see if there are
any log messages from init or
getty.Next, try dialing into the system. Be sure to use 8 bits,
no parity, and 1 stop bit on the remote system. If a prompt
does not appear right away, or the prompt shows garbage, try
pressing Enter about once per second. If
there is still no login: prompt,
try sending a BREAK. When using a
high-speed modem, try dialing again after locking the
dialing modem's interface speed.If there is still no login: prompt, check
/etc/gettytab again and double-check
that:The initial capability name specified in the entry in
/etc/ttys matches the name of a
capability in /etc/gettytab.Each nx= entry matches another
gettytab capability name.Each tc= entry matches another
gettytab capability name.If the modem on the &os; system will not answer, make
sure that the modem is configured to answer the phone when
DTR is asserted. If the modem seems to be
configured correctly, verify that the
DTR line is asserted by checking the
modem's indicator lights.If it still does not work, try sending an email
to the &a.questions; describing the modem and the
problem.Dial-out Servicedial-out serviceThe following are tips for getting the host to connect over
the modem to another computer. This is appropriate for
establishing a terminal session with a remote host.This kind of connection can be helpful to get a file on the
Internet if there are problems using PPP. If PPP is not
working, use the terminal session to FTP the needed file. Then
use zmodem to transfer it to the machine.Using a Stock Hayes ModemA generic Hayes dialer is built into
tip. Use at=hayes in
/etc/remote.The Hayes driver is not smart enough to recognize some of
the advanced features of newer modems messages like
BUSY, NO DIALTONE, or
CONNECT 115200. Turn those messages off
when using tip with
ATX0&W.The dial timeout for tip is 60
seconds. The modem should use something less, or else
tip will think there is a communication
problem. Try ATS7=45&W.Using AT Commands/etc/remoteCreate a direct entry in
/etc/remote. For example, if the modem
is hooked up to the first serial port,
/dev/cuau0, use the following
line:cuau0:dv=/dev/cuau0:br#19200:pa=noneUse the highest bps rate the modem
supports in the br capability. Then, type
tip cuau0 to connect to the modem.Or, use cu as root with the following
command:&prompt.root; cu -lline -sspeedline is the serial port, such
as /dev/cuau0, and
speed is the speed, such as
57600. When finished entering the AT
commands, type ~. to exit.The @ Sign Does Not WorkThe @ sign in the phone number
capability tells tip to look in
/etc/phones for a phone number. But, the
@ sign is also a special character in
capability files like /etc/remote, so it
needs to be escaped with a backslash:pn=\@Dialing from the Command LinePut a generic entry in
/etc/remote. For example:tip115200|Dial any phone number at 115200 bps:\
:dv=/dev/cuau0:br#115200:at=hayes:pa=none:du:
tip57600|Dial any phone number at 57600 bps:\
:dv=/dev/cuau0:br#57600:at=hayes:pa=none:du:This should now work:&prompt.root; tip -115200 5551234Users who prefer cu over
tip, can use a generic
cu entry:cu115200|Use cu to dial any number at 115200bps:\
:dv=/dev/cuau1:br#57600:at=hayes:pa=none:du:and type:&prompt.root; cu 5551234 -s 115200Setting the bps RatePut in an entry for tip1200 or
cu1200, but go ahead and use whatever
bps rate is appropriate with the
br capability.
tip thinks a good default is 1200 bps
which is why it looks for a tip1200 entry.
1200 bps does not have to be used, though.Accessing a Number of Hosts Through a Terminal
ServerRather than waiting until connected and typing
CONNECT host
each time, use tip's cm
capability. For example, these entries in
/etc/remote will let you type
tip pain or tip muffin
to connect to the hosts pain or
muffin, and tip
deep13 to connect to the terminal server.pain|pain.deep13.com|Forrester's machine:\
:cm=CONNECT pain\n:tc=deep13:
muffin|muffin.deep13.com|Frank's machine:\
:cm=CONNECT muffin\n:tc=deep13:
deep13:Gizmonics Institute terminal server:\
:dv=/dev/cuau2:br#38400:at=hayes:du:pa=none:pn=5551234:Using More Than One Line with
tipThis is often a problem where a university has several
modem lines and several thousand students trying to use
them.Make an entry in /etc/remote and use
@ for the pn
capability:big-university:\
:pn=\@:tc=dialout
dialout:\
:dv=/dev/cuau3:br#9600:at=courier:du:pa=none:Then, list the phone numbers in
/etc/phones:big-university 5551111
big-university 5551112
big-university 5551113
big-university 5551114tip will try each number in the listed
order, then give up. To keep retrying, run
tip in a while
loop.Using the Force CharacterCtrlP is the default force character,
used to tell tip that the next character is
literal data. The force character can be set to any other
character with the ~s escape, which means
set a variable.Type
~sforce=single-char
followed by a newline. single-char
is any single character. If
single-char is left out, then the
force character is the null character, which is accessed by
typing
Ctrl2
or CtrlSpace. A pretty good value for
single-char is
ShiftCtrl6, which is only used on some terminal
servers.To change the force character, specify the following in
~/.tiprc:force=single-charUpper Case CharactersThis happens when
CtrlA is pressed, which is tip's
raise character, specially designed for people
with broken caps-lock keys. Use ~s to set
raisechar to something reasonable. It can
be set to be the same as the force character, if neither
feature is used.Here is a sample ~/.tiprc for
Emacs users who need to type
Ctrl2 and CtrlA:force=^^
raisechar=^^The ^^ is
ShiftCtrl6.File Transfers with tipWhen talking to another &unix;-like operating system,
files can be sent and received using ~p
(put) and ~t (take). These commands run
cat and echo on the
remote system to accept and send files. The syntax is:~plocal-fileremote-file~tremote-filelocal-fileThere is no error checking, so another protocol, like
zmodem, should probably be used.Using zmodem with
tip?To receive files, start the sending program on the remote
end. Then, type ~C rz to begin receiving
them locally.To send files, start the receiving program on the remote
end. Then, type ~C sz
files to send them to the
remote system.Setting Up the Serial ConsoleKazutakaYOKOTAContributed by BillPaulBased on a document by serial console&os; has the ability to boot a system with a dumb
terminal on a serial port as a console. This configuration is
useful for system administrators who wish to install &os; on
machines that have no keyboard or monitor attached, and
developers who want to debug the kernel or device
drivers.As described in , &os; employs a three
stage bootstrap. The first two stages are in the boot block
code which is stored at the beginning of the &os; slice on the
boot disk. The boot block then loads and runs the boot loader
as the third stage code.In order to set up booting from a serial console, the boot
block code, the boot loader code, and the kernel need to be
configured.Quick Serial Console ConfigurationThis section provides a fast overview of setting up the
serial console. This procedure can be used when the dumb
terminal is connected to COM1.Configuring a Serial Console on
COM1Connect the serial cable to
COM1 and the controlling
terminal.To configure boot messages to display on the serial
console, issue the following command as the
superuser:&prompt.root; echo 'console="comconsole"' >> /boot/loader.confEdit /etc/ttys and change
off to on and
dialup to vt100 for
the ttyu0 entry. Otherwise, a
password will not be required to connect via the serial
console, resulting in a potential security hole.Reboot the system to see if the changes took
effect.If a different configuration is required, see the next
section for a more in-depth configuration explanation.In-Depth Serial Console ConfigurationThis section provides a more detailed explanation of the
steps needed to setup a serial console in &os;.Configuring a Serial ConsolePrepare a serial cable.null-modem cableUse either a null-modem cable or a standard serial
cable and a null-modem adapter. See for a discussion on serial
cables.Unplug the keyboard.Many systems probe for the keyboard during the
Power-On Self-Test (POST) and will
generate an error if the keyboard is not detected. Some
machines will refuse to boot until the keyboard is plugged
in.If the computer complains about the error, but boots
anyway, no further configuration is needed.If the computer refuses to boot without a keyboard
attached, configure the BIOS so that it
ignores this error. Consult the motherboard's manual for
details on how to do this.Try setting the keyboard to Not
installed in the BIOS.
This setting tells the BIOS not to
probe for a keyboard at power-on so it should not
complain if the keyboard is absent. If that option is
not present in the BIOS, look for an
Halt on Error option instead. Setting
this to All but Keyboard or to No
Errors will have the same effect.If the system has a &ps2; mouse, unplug it as well.
&ps2; mice share some hardware with the keyboard and
leaving the mouse plugged in can fool the keyboard probe
into thinking the keyboard is still there.While most systems will boot without a keyboard,
quite a few will not boot without a graphics adapter.
Some systems can be configured to boot with no graphics
adapter by changing the graphics adapter
setting in the BIOS configuration to
Not installed. Other systems do not
support this option and will refuse to boot if there is
no display hardware in the system. With these machines,
leave some kind of graphics card plugged in, even if it
is just a junky mono board. A monitor does not need to
be attached.Plug a dumb terminal, an old computer with a modem
program, or the serial port on another &unix; box into the
serial port.Add the appropriate hint.sio.*
entries to /boot/device.hints for the
serial port. Some multi-port cards also require kernel
configuration options. Refer to &man.sio.4; for the
required options and device hints for each supported
serial port.Create boot.config in the root
directory of the a partition on the
boot drive.This file instructs the boot block code how to boot
the system. In order to activate the serial console, one
or more of the following options are needed. When using
multiple options, include them all on the same
line:Toggles between the internal and serial
consoles. Use this to switch console devices. For
instance, to boot from the internal (video) console,
use to direct the boot loader
and the kernel to use the serial port as its console
device. Alternatively, to boot from the serial
port, use to tell the boot
loader and the kernel to use the video display as
the console instead.Toggles between the single and dual console
configurations. In the single configuration, the
console will be either the internal console (video
display) or the serial port, depending on the state
of . In the dual console
configuration, both the video display and the
serial port will become the console at the same
time, regardless of the state of
. However, the dual console
configuration takes effect only while the boot
block is running. Once the boot loader gets
control, the console specified by
becomes the only
console.Makes the boot block probe the keyboard. If no
keyboard is found, the and
options are automatically
set.Due to space constraints in the current
version of the boot blocks, is
capable of detecting extended keyboards only.
Keyboards with less than 101 keys and without F11
and F12 keys may not be detected. Keyboards on
some laptops may not be properly found because of
this limitation. If this is the case, do not use
.Use either to select the console
automatically or to activate the
serial console. Refer to &man.boot.8; and
&man.boot.config.5; for more details.The options, except for , are
passed to the boot loader. The boot loader will
determine whether the internal video or the serial port
should become the console by examining the state of
. This means that if
is specified but
is not specified in /boot.config, the
serial port can be used as the console only during the
boot block as the boot loader will use the internal video
display as the console.Boot the machine.When &os; starts, the boot blocks echo the contents of
/boot.config to the console. For
example:/boot.config: -P
Keyboard: noThe second line appears only if is
in /boot.config and indicates the
presence or absence of the keyboard. These messages go
to either the serial or internal console, or both,
depending on the option in
/boot.config:OptionsMessage goes tononeinternal consoleserial consoleserial and internal consolesserial and internal consoles, keyboard presentinternal console, keyboard absentserial consoleAfter the message, there will be a small pause before
the boot blocks continue loading the boot loader and
before any further messages are printed to the console.
Under normal circumstances, there is no need to interrupt
the boot blocks, but one can do so in order to make sure
things are set up correctly.Press any key, other than Enter, at
the console to interrupt the boot process. The boot
blocks will then prompt for further action:>> FreeBSD/i386 BOOT
Default: 0:ad(0,a)/boot/loader
boot:Verify that the above message appears on either the
serial or internal console, or both, according to the
options in /boot.config. If the
message appears in the correct console, press
Enter to continue the boot
process.If there is no prompt on the serial terminal,
something is wrong with the settings. Enter
then Enter or
Return to tell the boot block (and then
the boot loader and the kernel) to choose the serial port
for the console. Once the system is up, go back and check
what went wrong.During the third stage of the boot process, one can still
switch between the internal console and the serial console by
setting appropriate environment variables in the boot loader.
See &man.loader.8; for more
information.This line in /boot/loader.conf or
/boot/loader.conf.local configures the
boot loader and the kernel to send their boot messages to
the serial console, regardless of the options in
/boot.config:console="comconsole"That line should be the first line of
/boot/loader.conf so that boot messages
are displayed on the serial console as early as
possible.If that line does not exist, or if it is set to
console="vidconsole", the boot loader and
the kernel will use whichever console is indicated by
in the boot block. See
&man.loader.conf.5; for more information.At the moment, the boot loader has no option
equivalent to in the boot block, and
there is no provision to automatically select the internal
console and the serial console based on the presence of the
keyboard.While it is not required, it is possible to provide a
login prompt over the serial line. To
configure this, edit the entry for the serial port in
/etc/ttys using the instructions in
. If the speed of the serial
port has been changed, change std.9600 to
match the new setting.Setting a Faster Serial Port SpeedBy default, the serial port settings are 9600 baud, 8
bits, no parity, and 1 stop bit. To change the default
console speed, use one of the following options:Edit /etc/make.conf and set
BOOT_COMCONSOLE_SPEED to the new
console speed. Then, recompile and install the boot
blocks and the boot loader:&prompt.root; cd /sys/boot
&prompt.root; make clean
&prompt.root; make
&prompt.root; make installIf the serial console is configured in some other way
than by booting with , or if the serial
console used by the kernel is different from the one used
by the boot blocks, add the following option, with the
desired speed, to a custom kernel configuration file and
compile a new kernel:options CONSPEED=19200Add the
boot
option to /boot.config, replacing
19200 with the speed to
use.Add the following options to
/boot/loader.conf. Replace
115200 with the speed to
use.boot_multicons="YES"
boot_serial="YES"
comconsole_speed="115200"
console="comconsole,vidconsole"Entering the DDB Debugger from the Serial LineTo configure the ability to drop into the kernel debugger
from the serial console, add the following options to a custom
kernel configuration file and compile the kernel using the
instructions in . Note that
while this is useful for remote diagnostics, it is also
dangerous if a spurious BREAK is generated on the serial port.
Refer to &man.ddb.4; and &man.ddb.8; for more information
about the kernel debugger.options BREAK_TO_DEBUGGER
options DDB
diff --git a/en_US.ISO8859-1/books/handbook/zfs/chapter.xml b/en_US.ISO8859-1/books/handbook/zfs/chapter.xml
index c6d89f091a..ef1064a438 100644
--- a/en_US.ISO8859-1/books/handbook/zfs/chapter.xml
+++ b/en_US.ISO8859-1/books/handbook/zfs/chapter.xml
@@ -1,4424 +1,4424 @@
The Z File System (ZFS)TomRhodesWritten by AllanJudeWritten by BenedictReuschlingWritten by WarrenBlockWritten by The Z File System, or
ZFS, is an advanced file system designed to
overcome many of the major problems found in previous
designs.Originally developed at &sun;, ongoing open source
ZFS development has moved to the OpenZFS Project.ZFS has three major design goals:Data integrity: All data includes a
checksum of the data.
When data is written, the checksum is calculated and written
along with it. When that data is later read back, the
checksum is calculated again. If the checksums do not match,
a data error has been detected. ZFS will
attempt to automatically correct errors when data redundancy
is available.Pooled storage: physical storage devices are added to a
pool, and storage space is allocated from that shared pool.
Space is available to all file systems, and can be increased
by adding new storage devices to the pool.Performance: multiple caching mechanisms provide increased
performance. ARC is an
advanced memory-based read cache. A second level of
disk-based read cache can be added with
L2ARC, and disk-based
synchronous write cache is available with
ZIL.A complete list of features and terminology is shown in
.What Makes ZFS DifferentZFS is significantly different from any
previous file system because it is more than just a file system.
Combining the traditionally separate roles of volume manager and
file system provides ZFS with unique
advantages. The file system is now aware of the underlying
structure of the disks. Traditional file systems could only be
created on a single disk at a time. If there were two disks
then two separate file systems would have to be created. In a
traditional hardware RAID configuration, this
problem was avoided by presenting the operating system with a
single logical disk made up of the space provided by a number of
physical disks, on top of which the operating system placed a
file system. Even in the case of software
RAID solutions like those provided by
GEOM, the UFS file system
living on top of the RAID transform believed
that it was dealing with a single device.
ZFS's combination of the volume manager and
the file system solves this and allows the creation of many file
systems all sharing a pool of available storage. One of the
biggest advantages to ZFS's awareness of the
physical layout of the disks is that existing file systems can
be grown automatically when additional disks are added to the
pool. This new space is then made available to all of the file
systems. ZFS also has a number of different
properties that can be applied to each file system, giving many
advantages to creating a number of different file systems and
datasets rather than a single monolithic file system.Quick Start GuideThere is a startup mechanism that allows &os; to mount
ZFS pools during system initialization. To
enable it, add this line to
/etc/rc.conf:zfs_enable="YES"Then start the service:&prompt.root; service zfs startThe examples in this section assume three
SCSI disks with the device names
da0,
da1, and
da2. Users
of SATA hardware should instead use
ada device
names.Single Disk PoolTo create a simple, non-redundant pool using a single
disk device:&prompt.root; zpool create example/dev/da0To view the new pool, review the output of
df:&prompt.root; df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/ad0s1a 2026030 235230 1628718 13% /
devfs 1 1 0 100% /dev
/dev/ad0s1d 54098308 1032846 48737598 2% /usr
example 17547136 0 17547136 0% /exampleThis output shows that the example pool
has been created and mounted. It is now accessible as a file
system. Files can be created on it and users can browse
it:&prompt.root; cd /example
&prompt.root; ls
&prompt.root; touch testfile
&prompt.root; ls -al
total 4
drwxr-xr-x 2 root wheel 3 Aug 29 23:15 .
drwxr-xr-x 21 root wheel 512 Aug 29 23:12 ..
-rw-r--r-- 1 root wheel 0 Aug 29 23:15 testfileHowever, this pool is not taking advantage of any
ZFS features. To create a dataset on this
pool with compression enabled:&prompt.root; zfs create example/compressed
&prompt.root; zfs set compression=gzip example/compressedThe example/compressed dataset is now a
ZFS compressed file system. Try copying
some large files to
/example/compressed.Compression can be disabled with:&prompt.root; zfs set compression=off example/compressedTo unmount a file system, use
zfs umount and then verify with
df:&prompt.root; zfs umount example/compressed
&prompt.root; df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/ad0s1a 2026030 235232 1628716 13% /
devfs 1 1 0 100% /dev
/dev/ad0s1d 54098308 1032864 48737580 2% /usr
example 17547008 0 17547008 0% /exampleTo re-mount the file system to make it accessible again,
use zfs mount and verify with
df:&prompt.root; zfs mount example/compressed
&prompt.root; df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/ad0s1a 2026030 235234 1628714 13% /
devfs 1 1 0 100% /dev
/dev/ad0s1d 54098308 1032864 48737580 2% /usr
example 17547008 0 17547008 0% /example
example/compressed 17547008 0 17547008 0% /example/compressedThe pool and file system may also be observed by viewing
the output from mount:&prompt.root; mount
/dev/ad0s1a on / (ufs, local)
devfs on /dev (devfs, local)
/dev/ad0s1d on /usr (ufs, local, soft-updates)
example on /example (zfs, local)
example/compressed on /example/compressed (zfs, local)After creation, ZFS datasets can be
used like any file systems. However, many other features are
available which can be set on a per-dataset basis. In the
example below, a new file system called
data is created. Important files will be
stored here, so it is configured to keep two copies of each
data block:&prompt.root; zfs create example/data
&prompt.root; zfs set copies=2 example/dataIt is now possible to see the data and space utilization
by issuing df:&prompt.root; df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/ad0s1a 2026030 235234 1628714 13% /
devfs 1 1 0 100% /dev
/dev/ad0s1d 54098308 1032864 48737580 2% /usr
example 17547008 0 17547008 0% /example
example/compressed 17547008 0 17547008 0% /example/compressed
example/data 17547008 0 17547008 0% /example/dataNotice that each file system on the pool has the same
amount of available space. This is the reason for using
df in these examples, to show that the file
systems use only the amount of space they need and all draw
from the same pool. ZFS eliminates
concepts such as volumes and partitions, and allows multiple
file systems to occupy the same pool.To destroy the file systems and then destroy the pool as
it is no longer needed:&prompt.root; zfs destroy example/compressed
&prompt.root; zfs destroy example/data
&prompt.root; zpool destroy exampleRAID-ZDisks fail. One method of avoiding data loss from disk
failure is to implement RAID.
ZFS supports this feature in its pool
design. RAID-Z pools require three or more
disks but provide more usable space than mirrored
pools.This example creates a RAID-Z pool,
specifying the disks to add to the pool:&prompt.root; zpool create storage raidz da0 da1 da2&sun; recommends that the number of devices used in a
RAID-Z configuration be between three and
nine. For environments requiring a single pool consisting
of 10 disks or more, consider breaking it up into smaller
RAID-Z groups. If only two disks are
available and redundancy is a requirement, consider using a
ZFS mirror. Refer to &man.zpool.8; for
more details.The previous example created the
storage zpool. This example makes a new
file system called home in that
pool:&prompt.root; zfs create storage/homeCompression and keeping extra copies of directories
and files can be enabled:&prompt.root; zfs set copies=2 storage/home
&prompt.root; zfs set compression=gzip storage/homeTo make this the new home directory for users, copy the
user data to this directory and create the appropriate
symbolic links:&prompt.root; cp -rp /home/* /storage/home
&prompt.root; rm -rf /home /usr/home
&prompt.root; ln -s /storage/home /home
&prompt.root; ln -s /storage/home /usr/homeUsers data is now stored on the freshly-created
/storage/home. Test by adding a new user
and logging in as that user.Try creating a file system snapshot which can be rolled
back later:&prompt.root; zfs snapshot storage/home@08-30-08Snapshots can only be made of a full file system, not a
single directory or file.The @ character is a delimiter between
the file system name or the volume name. If an important
directory has been accidentally deleted, the file system can
be backed up, then rolled back to an earlier snapshot when the
directory still existed:&prompt.root; zfs rollback storage/home@08-30-08To list all available snapshots, run
ls in the file system's
.zfs/snapshot directory. For example, to
see the previously taken snapshot:&prompt.root; ls /storage/home/.zfs/snapshotIt is possible to write a script to perform regular
snapshots on user data. However, over time, snapshots can
consume a great deal of disk space. The previous snapshot can
be removed using the command:&prompt.root; zfs destroy storage/home@08-30-08After testing, /storage/home can be
made the real /home using this
command:&prompt.root; zfs set mountpoint=/home storage/homeRun df and mount to
confirm that the system now treats the file system as the real
/home:&prompt.root; mount
/dev/ad0s1a on / (ufs, local)
devfs on /dev (devfs, local)
/dev/ad0s1d on /usr (ufs, local, soft-updates)
storage on /storage (zfs, local)
storage/home on /home (zfs, local)
&prompt.root; df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/ad0s1a 2026030 235240 1628708 13% /
devfs 1 1 0 100% /dev
/dev/ad0s1d 54098308 1032826 48737618 2% /usr
storage 26320512 0 26320512 0% /storage
storage/home 26320512 0 26320512 0% /homeThis completes the RAID-Z
configuration. Daily status updates about the file systems
created can be generated as part of the nightly
&man.periodic.8; runs. Add this line to
/etc/periodic.conf:daily_status_zfs_enable="YES"Recovering RAID-ZEvery software RAID has a method of
monitoring its state. The status of
RAID-Z devices may be viewed with this
command:&prompt.root; zpool status -xIf all pools are
Online and everything
is normal, the message shows:all pools are healthyIf there is an issue, perhaps a disk is in the
Offline state, the
pool state will look similar to: pool: storage
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
da0 ONLINE 0 0 0
da1 OFFLINE 0 0 0
da2 ONLINE 0 0 0
errors: No known data errorsThis indicates that the device was previously taken
offline by the administrator with this command:&prompt.root; zpool offline storage da1Now the system can be powered down to replace
da1. When the system is back online,
the failed disk can replaced in the pool:&prompt.root; zpool replace storage da1From here, the status may be checked again, this time
without so that all pools are
shown:&prompt.root; zpool status storage
pool: storage
state: ONLINE
scrub: resilver completed with 0 errors on Sat Aug 30 19:44:11 2008
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
errors: No known data errorsIn this example, everything is normal.Data VerificationZFS uses checksums to verify the
integrity of stored data. These are enabled automatically
upon creation of file systems.Checksums can be disabled, but it is
not recommended! Checksums take very
little storage space and provide data integrity. Many
ZFS features will not work properly with
checksums disabled. There is no noticeable performance gain
from disabling these checksums.Checksum verification is known as
scrubbing. Verify the data integrity of
the storage pool with this command:&prompt.root; zpool scrub storageThe duration of a scrub depends on the amount of data
stored. Larger amounts of data will take proportionally
longer to verify. Scrubs are very I/O
intensive, and only one scrub is allowed to run at a time.
After the scrub completes, the status can be viewed with
status:&prompt.root; zpool status storage
pool: storage
state: ONLINE
scrub: scrub completed with 0 errors on Sat Jan 26 19:57:37 2013
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
errors: No known data errorsThe completion date of the last scrub operation is
displayed to help track when another scrub is required.
Routine scrubs help protect data from silent corruption and
ensure the integrity of the pool.Refer to &man.zfs.8; and &man.zpool.8; for other
ZFS options.zpool AdministrationZFS administration is divided between two
main utilities. The zpool utility controls
the operation of the pool and deals with adding, removing,
replacing, and managing disks. The
zfs utility
deals with creating, destroying, and managing datasets,
both file systems and
volumes.Creating and Destroying Storage PoolsCreating a ZFS storage pool
(zpool) involves making a number of
decisions that are relatively permanent because the structure
of the pool cannot be changed after the pool has been created.
The most important decision is what types of vdevs into which
to group the physical disks. See the list of
vdev types for details
about the possible options. After the pool has been created,
most vdev types do not allow additional disks to be added to
the vdev. The exceptions are mirrors, which allow additional
disks to be added to the vdev, and stripes, which can be
upgraded to mirrors by attaching an additional disk to the
vdev. Although additional vdevs can be added to expand a
pool, the layout of the pool cannot be changed after pool
creation. Instead, the data must be backed up and the
pool destroyed and recreated.Create a simple mirror pool:&prompt.root; zpool create mypool mirror /dev/ada1/dev/ada2
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
errors: No known data errorsMultiple vdevs can be created at once. Specify multiple
groups of disks separated by the vdev type keyword,
mirror in this example:&prompt.root; zpool create mypool mirror /dev/ada1/dev/ada2 mirror /dev/ada3/dev/ada4
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada4 ONLINE 0 0 0
errors: No known data errorsPools can also be constructed using partitions rather than
whole disks. Putting ZFS in a separate
partition allows the same disk to have other partitions for
other purposes. In particular, partitions with bootcode and
file systems needed for booting can be added. This allows
booting from disks that are also members of a pool. There is
no performance penalty on &os; when using a partition rather
than a whole disk. Using partitions also allows the
administrator to under-provision the
disks, using less than the full capacity. If a future
replacement disk of the same nominal size as the original
actually has a slightly smaller capacity, the smaller
partition will still fit, and the replacement disk can still
be used.Create a
RAID-Z2 pool using
partitions:&prompt.root; zpool create mypool raidz2 /dev/ada0p3/dev/ada1p3/dev/ada2p3/dev/ada3p3/dev/ada4p3/dev/ada5p3
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
ada4p3 ONLINE 0 0 0
ada5p3 ONLINE 0 0 0
errors: No known data errorsA pool that is no longer needed can be destroyed so that
the disks can be reused. Destroying a pool involves first
unmounting all of the datasets in that pool. If the datasets
are in use, the unmount operation will fail and the pool will
not be destroyed. The destruction of the pool can be forced
with , but this can cause undefined
behavior in applications which had open files on those
datasets.Adding and Removing DevicesThere are two cases for adding disks to a zpool: attaching
a disk to an existing vdev with
zpool attach, or adding vdevs to the pool
with zpool add. Only some
vdev types allow disks to
be added to the vdev after creation.A pool created with a single disk lacks redundancy.
Corruption can be detected but
not repaired, because there is no other copy of the data.
The copies property may
be able to recover from a small failure such as a bad sector,
but does not provide the same level of protection as mirroring
or RAID-Z. Starting with a pool consisting
of a single disk vdev, zpool attach can be
used to add an additional disk to the vdev, creating a mirror.
zpool attach can also be used to add
additional disks to a mirror group, increasing redundancy and
read performance. If the disks being used for the pool are
partitioned, replicate the layout of the first disk on to the
second, gpart backup and
gpart restore can be used to make this
process easier.Upgrade the single disk (stripe) vdev
ada0p3 to a mirror by attaching
ada1p3:&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool attach mypoolada0p3ada1p3
Make sure to wait until resilver is done before rebooting.
If you boot from pool 'mypool', you may need to update
boot code on newly attached disk 'ada1p3'.
Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
bootcode written to ada1
&prompt.root; zpool status
pool: mypool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri May 30 08:19:19 2014
527M scanned out of 781M at 47.9M/s, 0h0m to go
527M resilvered, 67.53% done
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0 (resilvering)
errors: No known data errors
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: resilvered 781M in 0h0m with 0 errors on Fri May 30 08:15:58 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errorsWhen adding disks to the existing vdev is not an option,
as for RAID-Z, an alternative method is to
add another vdev to the pool. Additional vdevs provide higher
performance, distributing writes across the vdevs. Each vdev
is responsible for providing its own redundancy. It is
possible, but discouraged, to mix vdev types, like
mirror and RAID-Z.
Adding a non-redundant vdev to a pool containing mirror or
RAID-Z vdevs risks the data on the entire
pool. Writes are distributed, so the failure of the
non-redundant disk will result in the loss of a fraction of
every block that has been written to the pool.Data is striped across each of the vdevs. For example,
with two mirror vdevs, this is effectively a
RAID 10 that stripes writes across two sets
of mirrors. Space is allocated so that each vdev reaches 100%
full at the same time. There is a performance penalty if the
vdevs have different amounts of free space, as a
disproportionate amount of the data is written to the less
full vdev.When attaching additional devices to a boot pool, remember
to update the bootcode.Attach a second mirror group (ada2p3
and ada3p3) to the existing
mirror:&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: resilvered 781M in 0h0m with 0 errors on Fri May 30 08:19:35 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool add mypool mirror ada2p3ada3p3
&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada2
bootcode written to ada2
&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada3
bootcode written to ada3
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
errors: No known data errorsCurrently, vdevs cannot be removed from a pool, and disks
can only be removed from a mirror if there is enough remaining
redundancy. If only one disk in a mirror group remains, it
ceases to be a mirror and reverts to being a stripe, risking
the entire pool if that remaining disk fails.Remove a disk from a three-way mirror group:&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool detach mypoolada2p3
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errorsChecking the Status of a PoolPool status is important. If a drive goes offline or a
read, write, or checksum error is detected, the corresponding
error count increases. The status output
shows the configuration and status of each device in the pool
and the status of the entire pool. Actions that need to be
taken and details about the last scrub
are also shown.&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: scrub repaired 0 in 2h25m with 0 errors on Sat Sep 14 04:25:50 2013
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
ada4p3 ONLINE 0 0 0
ada5p3 ONLINE 0 0 0
errors: No known data errorsClearing ErrorsWhen an error is detected, the read, write, or checksum
counts are incremented. The error message can be cleared and
the counts reset with zpool clear
mypool. Clearing the
error state can be important for automated scripts that alert
the administrator when the pool encounters an error. Further
errors may not be reported if the old errors are not
cleared.Replacing a Functioning DeviceThere are a number of situations where it may be
desirable to replace one disk with a different disk. When
replacing a working disk, the process keeps the old disk
online during the replacement. The pool never enters a
degraded state,
reducing the risk of data loss.
zpool replace copies all of the data from
the old disk to the new one. After the operation completes,
the old disk is disconnected from the vdev. If the new disk
is larger than the old disk, it may be possible to grow the
zpool, using the new space. See Growing a Pool.Replace a functioning device in the pool:&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool replace mypoolada1p3ada2p3
Make sure to wait until resilver is done before rebooting.
If you boot from pool 'zroot', you may need to update
boot code on newly attached disk 'ada2p3'.
Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada2
&prompt.root; zpool status
pool: mypool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 2 14:21:35 2014
604M scanned out of 781M at 46.5M/s, 0h0m to go
604M resilvered, 77.39% done
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0 (resilvering)
errors: No known data errors
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: resilvered 781M in 0h0m with 0 errors on Mon Jun 2 14:21:52 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
errors: No known data errorsDealing with Failed DevicesWhen a disk in a pool fails, the vdev to which the disk
belongs enters the
degraded state. All
of the data is still available, but performance may be reduced
because missing data must be calculated from the available
redundancy. To restore the vdev to a fully functional state,
the failed physical device must be replaced.
ZFS is then instructed to begin the
resilver operation.
Data that was on the failed device is recalculated from
available redundancy and written to the replacement device.
After completion, the vdev returns to
online status.If the vdev does not have any redundancy, or if multiple
devices have failed and there is not enough redundancy to
compensate, the pool enters the
faulted state. If a
sufficient number of devices cannot be reconnected to the
pool, the pool becomes inoperative and data must be restored
from backups.When replacing a failed disk, the name of the failed disk
is replaced with the GUID of the device.
A new device name parameter for
zpool replace is not required if the
replacement device has the same device name.Replace a failed disk using
zpool replace:&prompt.root; zpool status
pool: mypool
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
316502962686821739 UNAVAIL 0 0 0 was /dev/ada1p3
errors: No known data errors
&prompt.root; zpool replace mypool316502962686821739ada2p3
&prompt.root; zpool status
pool: mypool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 2 14:52:21 2014
641M scanned out of 781M at 49.3M/s, 0h0m to go
640M resilvered, 82.04% done
config:
NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
15732067398082357289 UNAVAIL 0 0 0 was /dev/ada1p3/old
ada2p3 ONLINE 0 0 0 (resilvering)
errors: No known data errors
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: resilvered 781M in 0h0m with 0 errors on Mon Jun 2 14:52:38 2014
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
errors: No known data errorsScrubbing a PoolIt is recommended that pools be
scrubbed regularly,
ideally at least once every month. The
scrub operation is very disk-intensive and
will reduce performance while running. Avoid high-demand
periods when scheduling scrub or use vfs.zfs.scrub_delay
to adjust the relative priority of the
scrub to prevent it interfering with other
workloads.&prompt.root; zpool scrub mypool
&prompt.root; zpool status
pool: mypool
state: ONLINE
scan: scrub in progress since Wed Feb 19 20:52:54 2014
116G scanned out of 8.60T at 649M/s, 3h48m to go
0 repaired, 1.32% done
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
ada4p3 ONLINE 0 0 0
ada5p3 ONLINE 0 0 0
errors: No known data errorsIn the event that a scrub operation needs to be cancelled,
issue zpool scrub -s
mypool.Self-HealingThe checksums stored with data blocks enable the file
system to self-heal. This feature will
automatically repair data whose checksum does not match the
one recorded on another device that is part of the storage
pool. For example, a mirror with two disks where one drive is
starting to malfunction and cannot properly store the data any
more. This is even worse when the data has not been accessed
for a long time, as with long term archive storage.
Traditional file systems need to run algorithms that check and
repair the data like &man.fsck.8;. These commands take time,
and in severe cases, an administrator has to manually decide
which repair operation must be performed. When
ZFS detects a data block with a checksum
that does not match, it tries to read the data from the mirror
disk. If that disk can provide the correct data, it will not
only give that data to the application requesting it, but also
correct the wrong data on the disk that had the bad checksum.
This happens without any interaction from a system
administrator during normal pool operation.The next example demonstrates this self-healing behavior.
A mirrored pool of disks /dev/ada0 and
/dev/ada1 is created.&prompt.root; zpool create healer mirror /dev/ada0/dev/ada1
&prompt.root; zpool status healer
pool: healer
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
healer ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
healer 960M 92.5K 960M - - 0% 0% 1.00x ONLINE -Some important data that to be protected from data errors
using the self-healing feature is copied to the pool. A
checksum of the pool is created for later comparison.&prompt.root; cp /some/important/data /healer
&prompt.root; zfs list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
healer 960M 67.7M 892M 7% 1.00x ONLINE -
&prompt.root; sha1 /healer > checksum.txt
&prompt.root; cat checksum.txt
SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1fData corruption is simulated by writing random data to the
beginning of one of the disks in the mirror. To prevent
ZFS from healing the data as soon as it is
detected, the pool is exported before the corruption and
imported again afterwards.This is a dangerous operation that can destroy vital
data. It is shown here for demonstrational purposes only
and should not be attempted during normal operation of a
storage pool. Nor should this intentional corruption
example be run on any disk with a different file system on
it. Do not use any other disk device names other than the
ones that are part of the pool. Make certain that proper
backups of the pool are created before running the
command!&prompt.root; zpool export healer
&prompt.root; dd if=/dev/random of=/dev/ada1 bs=1m count=200
200+0 records in
200+0 records out
209715200 bytes transferred in 62.992162 secs (3329227 bytes/sec)
&prompt.root; zpool import healerThe pool status shows that one device has experienced an
error. Note that applications reading data from the pool did
not receive any incorrect data. ZFS
provided data from the ada0 device with
the correct checksums. The device with the wrong checksum can
be found easily as the CKSUM column
contains a nonzero value.&prompt.root; zpool status healer
pool: healer
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: none requested
config:
NAME STATE READ WRITE CKSUM
healer ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 1
errors: No known data errorsThe error was detected and handled by using the redundancy
present in the unaffected ada0 mirror
disk. A checksum comparison with the original one will reveal
whether the pool is consistent again.&prompt.root; sha1 /healer >> checksum.txt
&prompt.root; cat checksum.txt
SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f
SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1fThe two checksums that were generated before and after the
intentional tampering with the pool data still match. This
shows how ZFS is capable of detecting and
correcting any errors automatically when the checksums differ.
Note that this is only possible when there is enough
redundancy present in the pool. A pool consisting of a single
device has no self-healing capabilities. That is also the
reason why checksums are so important in
ZFS and should not be disabled for any
reason. No &man.fsck.8; or similar file system consistency
check program is required to detect and correct this and the
pool was still available during the time there was a problem.
A scrub operation is now required to overwrite the corrupted
data on ada1.&prompt.root; zpool scrub healer
&prompt.root; zpool status healer
pool: healer
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: scrub in progress since Mon Dec 10 12:23:30 2012
10.4M scanned out of 67.0M at 267K/s, 0h3m to go
9.63M repaired, 15.56% done
config:
NAME STATE READ WRITE CKSUM
healer ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 627 (repairing)
errors: No known data errorsThe scrub operation reads data from
ada0 and rewrites any data with an
incorrect checksum on ada1. This is
indicated by the (repairing) output from
zpool status. After the operation is
complete, the pool status changes to:&prompt.root; zpool status healer
pool: healer
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-4J
scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012
config:
NAME STATE READ WRITE CKSUM
healer ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 2.72K
errors: No known data errorsAfter the scrub operation completes and all the data
has been synchronized from ada0 to
ada1, the error messages can be
cleared from the pool
status by running zpool clear.&prompt.root; zpool clear healer
&prompt.root; zpool status healer
pool: healer
state: ONLINE
scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012
config:
NAME STATE READ WRITE CKSUM
healer ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
errors: No known data errorsThe pool is now back to a fully working state and all the
errors have been cleared.Growing a PoolThe usable size of a redundant pool is limited by the
capacity of the smallest device in each vdev. The smallest
device can be replaced with a larger device. After completing
a replace or
resilver operation,
the pool can grow to use the capacity of the new device. For
example, consider a mirror of a 1 TB drive and a
2 TB drive. The usable space is 1 TB. When the
1 TB drive is replaced with another 2 TB drive, the
resilvering process copies the existing data onto the new
- drive. Because
+ drive. As
both of the devices now have 2 TB capacity, the mirror's
available space can be grown to 2 TB.Expansion is triggered by using
zpool online -e on each device. After
expansion of all devices, the additional space becomes
available to the pool.Importing and Exporting PoolsPools are exported before moving them
to another system. All datasets are unmounted, and each
device is marked as exported but still locked so it cannot be
used by other disk subsystems. This allows pools to be
imported on other machines, other
operating systems that support ZFS, and
even different hardware architectures (with some caveats, see
&man.zpool.8;). When a dataset has open files,
zpool export -f can be used to force the
export of a pool. Use this with caution. The datasets are
forcibly unmounted, potentially resulting in unexpected
behavior by the applications which had open files on those
datasets.Export a pool that is not in use:&prompt.root; zpool export mypoolImporting a pool automatically mounts the datasets. This
may not be the desired behavior, and can be prevented with
zpool import -N.
zpool import -o sets temporary properties
for this import only.
zpool import altroot= allows importing a
pool with a base mount point instead of the root of the file
system. If the pool was last used on a different system and
was not properly exported, an import might have to be forced
with zpool import -f.
zpool import -a imports all pools that do
not appear to be in use by another system.List all available pools for import:&prompt.root; zpool import
pool: mypool
id: 9930174748043525076
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
mypool ONLINE
ada2p3 ONLINEImport the pool with an alternative root directory:&prompt.root; zpool import -o altroot=/mntmypool
&prompt.root; zfs list
zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 110K 47.0G 31K /mnt/mypoolUpgrading a Storage PoolAfter upgrading &os;, or if a pool has been imported from
a system using an older version of ZFS, the
pool can be manually upgraded to the latest version of
ZFS to support newer features. Consider
whether the pool may ever need to be imported on an older
system before upgrading. Upgrading is a one-way process.
Older pools can be upgraded, but pools with newer features
cannot be downgraded.Upgrade a v28 pool to support
Feature Flags:&prompt.root; zpool status
pool: mypool
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feat
flags.
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool upgrade
This system supports ZFS pool feature flags.
The following pools are formatted with legacy version numbers and can
be upgraded to use feature flags. After being upgraded, these pools
will no longer be accessible by software that does not support feature
flags.
VER POOL
--- ------------
28 mypool
Use 'zpool upgrade -v' for a list of available legacy versions.
Every feature flags pool has all supported features enabled.
&prompt.root; zpool upgrade mypool
This system supports ZFS pool feature flags.
Successfully upgraded 'mypool' from version 28 to feature flags.
Enabled the following features on 'mypool':
async_destroy
empty_bpobj
lz4_compress
multi_vdev_crash_dumpThe newer features of ZFS will not be
available until zpool upgrade has
completed. zpool upgrade -v can be used to
see what new features will be provided by upgrading, as well
as which features are already supported.Upgrade a pool to support additional feature flags:&prompt.root; zpool status
pool: mypool
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: none requested
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
errors: No known data errors
&prompt.root; zpool upgrade
This system supports ZFS pool feature flags.
All pools are formatted using feature flags.
Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(7) for details.
POOL FEATURE
---------------
zstore
multi_vdev_crash_dump
spacemap_histogram
enabled_txg
hole_birth
extensible_dataset
bookmarks
filesystem_limits
&prompt.root; zpool upgrade mypool
This system supports ZFS pool feature flags.
Enabled the following features on 'mypool':
spacemap_histogram
enabled_txg
hole_birth
extensible_dataset
bookmarks
filesystem_limitsThe boot code on systems that boot from a pool must be
updated to support the new pool version. Use
gpart bootcode on the partition that
contains the boot code. There are two types of bootcode
available, depending on way the system boots:
GPT (the most common option) and
EFI (for more modern systems).For legacy boot using GPT, use the following
command:&prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1ada1For systems using EFI to boot, execute the following
command:&prompt.root; gpart bootcode -p /boot/boot1.efifat -i 1ada1Apply the bootcode to all bootable disks in the pool.
See &man.gpart.8; for more information.Displaying Recorded Pool HistoryCommands that modify the pool are recorded. Recorded
actions include the creation of datasets, changing properties,
or replacement of a disk. This history is useful for
reviewing how a pool was created and which user performed a
specific action and when. History is not kept in a log file,
but is part of the pool itself. The command to review this
history is aptly named
zpool history:&prompt.root; zpool history
History for 'tank':
2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1
2013-02-27.18:50:58 zfs set atime=off tank
2013-02-27.18:51:09 zfs set checksum=fletcher4 tank
2013-02-27.18:51:18 zfs create tank/backupThe output shows zpool and
zfs commands that were executed on the pool
along with a timestamp. Only commands that alter the pool in
some way are recorded. Commands like
zfs list are not included. When no pool
name is specified, the history of all pools is
displayed.zpool history can show even more
information when the options or
are provided.
displays user-initiated events as well as internally logged
ZFS events.&prompt.root; zpool history -i
History for 'tank':
2013-02-26.23:02:35 [internal pool create txg:5] pool spa 28; zfs spa 28; zpl 5;uts 9.1-RELEASE 901000 amd64
2013-02-27.18:50:53 [internal property set txg:50] atime=0 dataset = 21
2013-02-27.18:50:58 zfs set atime=off tank
2013-02-27.18:51:04 [internal property set txg:53] checksum=7 dataset = 21
2013-02-27.18:51:09 zfs set checksum=fletcher4 tank
2013-02-27.18:51:13 [internal create txg:55] dataset = 39
2013-02-27.18:51:18 zfs create tank/backupMore details can be shown by adding .
History records are shown in a long format, including
information like the name of the user who issued the command
and the hostname on which the change was made.&prompt.root; zpool history -l
History for 'tank':
2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1 [user 0 (root) on :global]
2013-02-27.18:50:58 zfs set atime=off tank [user 0 (root) on myzfsbox:global]
2013-02-27.18:51:09 zfs set checksum=fletcher4 tank [user 0 (root) on myzfsbox:global]
2013-02-27.18:51:18 zfs create tank/backup [user 0 (root) on myzfsbox:global]The output shows that the
root user created
the mirrored pool with disks
/dev/ada0 and
/dev/ada1. The hostname
myzfsbox is also
shown in the commands after the pool's creation. The hostname
display becomes important when the pool is exported from one
system and imported on another. The commands that are issued
on the other system can clearly be distinguished by the
hostname that is recorded for each command.Both options to zpool history can be
combined to give the most detailed information possible for
any given pool. Pool history provides valuable information
when tracking down the actions that were performed or when
more detailed output is needed for debugging.Performance MonitoringA built-in monitoring system can display pool
I/O statistics in real time. It shows the
amount of free and used space on the pool, how many read and
write operations are being performed per second, and how much
I/O bandwidth is currently being utilized.
By default, all pools in the system are monitored and
displayed. A pool name can be provided to limit monitoring to
just that pool. A basic example:&prompt.root; zpool iostat
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 288G 1.53T 2 11 11.3K 57.1KTo continuously monitor I/O activity, a
number can be specified as the last parameter, indicating a
interval in seconds to wait between updates. The next
statistic line is printed after each interval. Press
CtrlC to stop this continuous monitoring.
Alternatively, give a second number on the command line after
the interval to specify the total number of statistics to
display.Even more detailed I/O statistics can
be displayed with . Each device in the
pool is shown with a statistics line. This is useful in
seeing how many read and write operations are being performed
on each device, and can help determine if any individual
device is slowing down the pool. This example shows a
mirrored pool with two devices:&prompt.root; zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
----------------------- ----- ----- ----- ----- ----- -----
data 288G 1.53T 2 12 9.23K 61.5K
mirror 288G 1.53T 2 12 9.23K 61.5K
ada1 - - 0 4 5.61K 61.7K
ada2 - - 1 4 5.04K 61.7K
----------------------- ----- ----- ----- ----- ----- -----Splitting a Storage PoolA pool consisting of one or more mirror vdevs can be split
into two pools. Unless otherwise specified, the last member
of each mirror is detached and used to create a new pool
containing the same data. The operation should first be
attempted with . The details of the
proposed operation are displayed without it actually being
performed. This helps confirm that the operation will do what
the user intends.zfs AdministrationThe zfs utility is responsible for
creating, destroying, and managing all ZFS
datasets that exist within a pool. The pool is managed using
zpool.Creating and Destroying DatasetsUnlike traditional disks and volume managers, space in
ZFS is not
preallocated. With traditional file systems, after all of the
space is partitioned and assigned, there is no way to add an
additional file system without adding a new disk. With
ZFS, new file systems can be created at any
time. Each dataset
has properties including features like compression,
deduplication, caching, and quotas, as well as other useful
properties like readonly, case sensitivity, network file
sharing, and a mount point. Datasets can be nested inside
each other, and child datasets will inherit properties from
their parents. Each dataset can be administered,
delegated,
replicated,
snapshotted,
jailed, and destroyed as a
unit. There are many advantages to creating a separate
dataset for each different type or set of files. The only
drawbacks to having an extremely large number of datasets is
that some commands like zfs list will be
slower, and the mounting of hundreds or even thousands of
datasets can slow the &os; boot process.Create a new dataset and enable LZ4
compression on it:&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 781M 93.2G 144K none
mypool/ROOT 777M 93.2G 144K none
mypool/ROOT/default 777M 93.2G 777M /
mypool/tmp 176K 93.2G 176K /tmp
mypool/usr 616K 93.2G 144K /usr
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/ports 144K 93.2G 144K /usr/ports
mypool/usr/src 144K 93.2G 144K /usr/src
mypool/var 1.20M 93.2G 608K /var
mypool/var/crash 148K 93.2G 148K /var/crash
mypool/var/log 178K 93.2G 178K /var/log
mypool/var/mail 144K 93.2G 144K /var/mail
mypool/var/tmp 152K 93.2G 152K /var/tmp
&prompt.root; zfs create -o compress=lz4 mypool/usr/mydataset
&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 781M 93.2G 144K none
mypool/ROOT 777M 93.2G 144K none
mypool/ROOT/default 777M 93.2G 777M /
mypool/tmp 176K 93.2G 176K /tmp
mypool/usr 704K 93.2G 144K /usr
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/mydataset 87.5K 93.2G 87.5K /usr/mydataset
mypool/usr/ports 144K 93.2G 144K /usr/ports
mypool/usr/src 144K 93.2G 144K /usr/src
mypool/var 1.20M 93.2G 610K /var
mypool/var/crash 148K 93.2G 148K /var/crash
mypool/var/log 178K 93.2G 178K /var/log
mypool/var/mail 144K 93.2G 144K /var/mail
mypool/var/tmp 152K 93.2G 152K /var/tmpDestroying a dataset is much quicker than deleting all
of the files that reside on the dataset, as it does not
involve scanning all of the files and updating all of the
corresponding metadata.Destroy the previously-created dataset:&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 880M 93.1G 144K none
mypool/ROOT 777M 93.1G 144K none
mypool/ROOT/default 777M 93.1G 777M /
mypool/tmp 176K 93.1G 176K /tmp
mypool/usr 101M 93.1G 144K /usr
mypool/usr/home 184K 93.1G 184K /usr/home
mypool/usr/mydataset 100M 93.1G 100M /usr/mydataset
mypool/usr/ports 144K 93.1G 144K /usr/ports
mypool/usr/src 144K 93.1G 144K /usr/src
mypool/var 1.20M 93.1G 610K /var
mypool/var/crash 148K 93.1G 148K /var/crash
mypool/var/log 178K 93.1G 178K /var/log
mypool/var/mail 144K 93.1G 144K /var/mail
mypool/var/tmp 152K 93.1G 152K /var/tmp
&prompt.root; zfs destroy mypool/usr/mydataset
&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 781M 93.2G 144K none
mypool/ROOT 777M 93.2G 144K none
mypool/ROOT/default 777M 93.2G 777M /
mypool/tmp 176K 93.2G 176K /tmp
mypool/usr 616K 93.2G 144K /usr
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/ports 144K 93.2G 144K /usr/ports
mypool/usr/src 144K 93.2G 144K /usr/src
mypool/var 1.21M 93.2G 612K /var
mypool/var/crash 148K 93.2G 148K /var/crash
mypool/var/log 178K 93.2G 178K /var/log
mypool/var/mail 144K 93.2G 144K /var/mail
mypool/var/tmp 152K 93.2G 152K /var/tmpIn modern versions of ZFS,
zfs destroy is asynchronous, and the free
space might take several minutes to appear in the pool. Use
zpool get freeing
poolname to see the
freeing property, indicating how many
datasets are having their blocks freed in the background.
If there are child datasets, like
snapshots or other
datasets, then the parent cannot be destroyed. To destroy a
dataset and all of its children, use to
recursively destroy the dataset and all of its children.
Use to list datasets
and snapshots that would be destroyed by this operation, but
do not actually destroy anything. Space that would be
reclaimed by destruction of snapshots is also shown.Creating and Destroying VolumesA volume is a special type of dataset. Rather than being
mounted as a file system, it is exposed as a block device
under
/dev/zvol/poolname/dataset.
This allows the volume to be used for other file systems, to
back the disks of a virtual machine, or to be exported using
protocols like iSCSI or
HAST.A volume can be formatted with any file system, or used
without a file system to store raw data. To the user, a
volume appears to be a regular disk. Putting ordinary file
systems on these zvols provides features
that ordinary disks or file systems do not normally have. For
example, using the compression property on a 250 MB
volume allows creation of a compressed FAT
file system.&prompt.root; zfs create -V 250m -o compression=on tank/fat32
&prompt.root; zfs list tank
NAME USED AVAIL REFER MOUNTPOINT
tank 258M 670M 31K /tank
&prompt.root; newfs_msdos -F32 /dev/zvol/tank/fat32
&prompt.root; mount -t msdosfs /dev/zvol/tank/fat32 /mnt
&prompt.root; df -h /mnt | grep fat32
Filesystem Size Used Avail Capacity Mounted on
/dev/zvol/tank/fat32 249M 24k 249M 0% /mnt
&prompt.root; mount | grep fat32
/dev/zvol/tank/fat32 on /mnt (msdosfs, local)Destroying a volume is much the same as destroying a
regular file system dataset. The operation is nearly
instantaneous, but it may take several minutes for the free
space to be reclaimed in the background.Renaming a DatasetThe name of a dataset can be changed with
zfs rename. The parent of a dataset can
also be changed with this command. Renaming a dataset to be
under a different parent dataset will change the value of
those properties that are inherited from the parent dataset.
When a dataset is renamed, it is unmounted and then remounted
in the new location (which is inherited from the new parent
dataset). This behavior can be prevented with
.Rename a dataset and move it to be under a different
parent dataset:&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 780M 93.2G 144K none
mypool/ROOT 777M 93.2G 144K none
mypool/ROOT/default 777M 93.2G 777M /
mypool/tmp 176K 93.2G 176K /tmp
mypool/usr 704K 93.2G 144K /usr
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/mydataset 87.5K 93.2G 87.5K /usr/mydataset
mypool/usr/ports 144K 93.2G 144K /usr/ports
mypool/usr/src 144K 93.2G 144K /usr/src
mypool/var 1.21M 93.2G 614K /var
mypool/var/crash 148K 93.2G 148K /var/crash
mypool/var/log 178K 93.2G 178K /var/log
mypool/var/mail 144K 93.2G 144K /var/mail
mypool/var/tmp 152K 93.2G 152K /var/tmp
&prompt.root; zfs rename mypool/usr/mydatasetmypool/var/newname
&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 780M 93.2G 144K none
mypool/ROOT 777M 93.2G 144K none
mypool/ROOT/default 777M 93.2G 777M /
mypool/tmp 176K 93.2G 176K /tmp
mypool/usr 616K 93.2G 144K /usr
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/ports 144K 93.2G 144K /usr/ports
mypool/usr/src 144K 93.2G 144K /usr/src
mypool/var 1.29M 93.2G 614K /var
mypool/var/crash 148K 93.2G 148K /var/crash
mypool/var/log 178K 93.2G 178K /var/log
mypool/var/mail 144K 93.2G 144K /var/mail
mypool/var/newname 87.5K 93.2G 87.5K /var/newname
mypool/var/tmp 152K 93.2G 152K /var/tmpSnapshots can also be renamed like this. Due to the
nature of snapshots, they cannot be renamed into a different
parent dataset. To rename a recursive snapshot, specify
, and all snapshots with the same name in
child datasets with also be renamed.&prompt.root; zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mypool/var/newname@first_snapshot 0 - 87.5K -
&prompt.root; zfs rename mypool/var/newname@first_snapshotnew_snapshot_name
&prompt.root; zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mypool/var/newname@new_snapshot_name 0 - 87.5K -Setting Dataset PropertiesEach ZFS dataset has a number of
properties that control its behavior. Most properties are
automatically inherited from the parent dataset, but can be
overridden locally. Set a property on a dataset with
zfs set
property=valuedataset. Most
properties have a limited set of valid values,
zfs get will display each possible property
and valid values. Most properties can be reverted to their
inherited values using zfs inherit.User-defined properties can also be set. They become part
of the dataset configuration and can be used to provide
additional information about the dataset or its contents. To
distinguish these custom properties from the ones supplied as
part of ZFS, a colon (:)
is used to create a custom namespace for the property.&prompt.root; zfs set custom:costcenter=1234tank
&prompt.root; zfs get custom:costcentertank
NAME PROPERTY VALUE SOURCE
tank custom:costcenter 1234 localTo remove a custom property, use
zfs inherit with . If
the custom property is not defined in any of the parent
datasets, it will be removed completely (although the changes
are still recorded in the pool's history).&prompt.root; zfs inherit -r custom:costcentertank
&prompt.root; zfs get custom:costcentertank
NAME PROPERTY VALUE SOURCE
tank custom:costcenter - -
&prompt.root; zfs get all tank | grep custom:costcenter
&prompt.root;Getting and Setting Share PropertiesTwo commonly used and useful dataset properties are the
NFS and SMB share
options. Setting these define if and how
ZFS datasets may be shared on the network.
At present, only setting sharing via NFS is
supported on &os;. To get the current status of
a share, enter:&prompt.root; zfs get sharenfs mypool/usr/home
NAME PROPERTY VALUE SOURCE
mypool/usr/home sharenfs on local
&prompt.root; zfs get sharesmb mypool/usr/home
NAME PROPERTY VALUE SOURCE
mypool/usr/home sharesmb off localTo enable sharing of a dataset, enter:&prompt.root; zfs set sharenfs=on mypool/usr/homeIt is also possible to set additional options for sharing
datasets through NFS, such as
, and
. To set additional options to a
dataset shared through NFS, enter:&prompt.root; zfs set sharenfs="-alldirs,-maproot=root,-network=192.168.1.0/24" mypool/usr/homeManaging SnapshotsSnapshots are one
of the most powerful features of ZFS. A
snapshot provides a read-only, point-in-time copy of the
dataset. With Copy-On-Write (COW),
snapshots can be created quickly by preserving the older
version of the data on disk. If no snapshots exist, space is
reclaimed for future use when data is rewritten or deleted.
Snapshots preserve disk space by recording only the
differences between the current dataset and a previous
version. Snapshots are allowed only on whole datasets, not on
individual files or directories. When a snapshot is created
from a dataset, everything contained in it is duplicated.
This includes the file system properties, files, directories,
permissions, and so on. Snapshots use no additional space
when they are first created, only consuming space as the
blocks they reference are changed. Recursive snapshots taken
with create a snapshot with the same name
on the dataset and all of its children, providing a consistent
moment-in-time snapshot of all of the file systems. This can
be important when an application has files on multiple
datasets that are related or dependent upon each other.
Without snapshots, a backup would have copies of the files
from different points in time.Snapshots in ZFS provide a variety of
features that even other file systems with snapshot
functionality lack. A typical example of snapshot use is to
have a quick way of backing up the current state of the file
system when a risky action like a software installation or a
system upgrade is performed. If the action fails, the
snapshot can be rolled back and the system has the same state
as when the snapshot was created. If the upgrade was
successful, the snapshot can be deleted to free up space.
Without snapshots, a failed upgrade often requires a restore
from backup, which is tedious, time consuming, and may require
downtime during which the system cannot be used. Snapshots
can be rolled back quickly, even while the system is running
in normal operation, with little or no downtime. The time
savings are enormous with multi-terabyte storage systems and
the time required to copy the data from backup. Snapshots are
not a replacement for a complete backup of a pool, but can be
used as a quick and easy way to store a copy of the dataset at
a specific point in time.Creating SnapshotsSnapshots are created with zfs snapshot
dataset@snapshotname.
Adding creates a snapshot recursively,
with the same name on all child datasets.Create a recursive snapshot of the entire pool:&prompt.root; zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
mypool 780M 93.2G 144K none
mypool/ROOT 777M 93.2G 144K none
mypool/ROOT/default 777M 93.2G 777M /
mypool/tmp 176K 93.2G 176K /tmp
mypool/usr 616K 93.2G 144K /usr
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/ports 144K 93.2G 144K /usr/ports
mypool/usr/src 144K 93.2G 144K /usr/src
mypool/var 1.29M 93.2G 616K /var
mypool/var/crash 148K 93.2G 148K /var/crash
mypool/var/log 178K 93.2G 178K /var/log
mypool/var/mail 144K 93.2G 144K /var/mail
mypool/var/newname 87.5K 93.2G 87.5K /var/newname
mypool/var/newname@new_snapshot_name 0 - 87.5K -
mypool/var/tmp 152K 93.2G 152K /var/tmp
&prompt.root; zfs snapshot -r mypool@my_recursive_snapshot
&prompt.root; zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mypool@my_recursive_snapshot 0 - 144K -
mypool/ROOT@my_recursive_snapshot 0 - 144K -
mypool/ROOT/default@my_recursive_snapshot 0 - 777M -
mypool/tmp@my_recursive_snapshot 0 - 176K -
mypool/usr@my_recursive_snapshot 0 - 144K -
mypool/usr/home@my_recursive_snapshot 0 - 184K -
mypool/usr/ports@my_recursive_snapshot 0 - 144K -
mypool/usr/src@my_recursive_snapshot 0 - 144K -
mypool/var@my_recursive_snapshot 0 - 616K -
mypool/var/crash@my_recursive_snapshot 0 - 148K -
mypool/var/log@my_recursive_snapshot 0 - 178K -
mypool/var/mail@my_recursive_snapshot 0 - 144K -
mypool/var/newname@new_snapshot_name 0 - 87.5K -
mypool/var/newname@my_recursive_snapshot 0 - 87.5K -
mypool/var/tmp@my_recursive_snapshot 0 - 152K -Snapshots are not shown by a normal
zfs list operation. To list snapshots,
is appended to
zfs list.
displays both file systems and snapshots.Snapshots are not mounted directly, so no path is shown
in the MOUNTPOINT column. There is no
mention of available disk space in the
AVAIL column, as snapshots cannot be
written to after they are created. Compare the snapshot
to the original dataset from which it was created:&prompt.root; zfs list -rt all mypool/usr/home
NAME USED AVAIL REFER MOUNTPOINT
mypool/usr/home 184K 93.2G 184K /usr/home
mypool/usr/home@my_recursive_snapshot 0 - 184K -Displaying both the dataset and the snapshot together
reveals how snapshots work in
COW fashion. They save
only the changes (delta) that were made
and not the complete file system contents all over again.
This means that snapshots take little space when few changes
are made. Space usage can be made even more apparent by
copying a file to the dataset, then making a second
snapshot:&prompt.root; cp /etc/passwd/var/tmp
&prompt.root; zfs snapshot mypool/var/tmp@after_cp
&prompt.root; zfs list -rt all mypool/var/tmp
NAME USED AVAIL REFER MOUNTPOINT
mypool/var/tmp 206K 93.2G 118K /var/tmp
mypool/var/tmp@my_recursive_snapshot 88K - 152K -
mypool/var/tmp@after_cp 0 - 118K -The second snapshot contains only the changes to the
dataset after the copy operation. This yields enormous
space savings. Notice that the size of the snapshot
mypool/var/tmp@my_recursive_snapshot
also changed in the USED
column to indicate the changes between itself and the
snapshot taken afterwards.Comparing SnapshotsZFS provides a built-in command to compare the
differences in content between two snapshots. This is
helpful when many snapshots were taken over time and the
user wants to see how the file system has changed over time.
For example, zfs diff lets a user find
the latest snapshot that still contains a file that was
accidentally deleted. Doing this for the two snapshots that
were created in the previous section yields this
output:&prompt.root; zfs list -rt all mypool/var/tmp
NAME USED AVAIL REFER MOUNTPOINT
mypool/var/tmp 206K 93.2G 118K /var/tmp
mypool/var/tmp@my_recursive_snapshot 88K - 152K -
mypool/var/tmp@after_cp 0 - 118K -
&prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot
M /var/tmp/
+ /var/tmp/passwdThe command lists the changes between the specified
snapshot (in this case
mypool/var/tmp@my_recursive_snapshot)
and the live file system. The first column shows the
type of change:+The path or file was added.-The path or file was deleted.MThe path or file was modified.RThe path or file was renamed.Comparing the output with the table, it becomes clear
that passwd
was added after the snapshot
mypool/var/tmp@my_recursive_snapshot
was created. This also resulted in a modification to the
parent directory mounted at
/var/tmp.Comparing two snapshots is helpful when using the
ZFS replication feature to transfer a
dataset to a different host for backup purposes.Compare two snapshots by providing the full dataset name
and snapshot name of both datasets:&prompt.root; cp /var/tmp/passwd /var/tmp/passwd.copy
&prompt.root; zfs snapshot mypool/var/tmp@diff_snapshot
&prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshotmypool/var/tmp@diff_snapshot
M /var/tmp/
+ /var/tmp/passwd
+ /var/tmp/passwd.copy
&prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshotmypool/var/tmp@after_cp
M /var/tmp/
+ /var/tmp/passwdA backup administrator can compare two snapshots
received from the sending host and determine the actual
changes in the dataset. See the
Replication section for
more information.Snapshot RollbackWhen at least one snapshot is available, it can be
rolled back to at any time. Most of the time this is the
case when the current state of the dataset is no longer
required and an older version is preferred. Scenarios such
as local development tests have gone wrong, botched system
updates hampering the system's overall functionality, or the
requirement to restore accidentally deleted files or
directories are all too common occurrences. Luckily,
rolling back a snapshot is just as easy as typing
zfs rollback
snapshotname.
Depending on how many changes are involved, the operation
will finish in a certain amount of time. During that time,
the dataset always remains in a consistent state, much like
a database that conforms to ACID principles is performing a
rollback. This is happening while the dataset is live and
accessible without requiring a downtime. Once the snapshot
has been rolled back, the dataset has the same state as it
had when the snapshot was originally taken. All other data
in that dataset that was not part of the snapshot is
discarded. Taking a snapshot of the current state of the
dataset before rolling back to a previous one is a good idea
when some data is required later. This way, the user can
roll back and forth between snapshots without losing data
that is still valuable.In the first example, a snapshot is rolled back because
of a careless rm operation that removes
too much data than was intended.&prompt.root; zfs list -rt all mypool/var/tmp
NAME USED AVAIL REFER MOUNTPOINT
mypool/var/tmp 262K 93.2G 120K /var/tmp
mypool/var/tmp@my_recursive_snapshot 88K - 152K -
mypool/var/tmp@after_cp 53.5K - 118K -
mypool/var/tmp@diff_snapshot 0 - 120K -
&prompt.root; ls /var/tmp
passwd passwd.copy vi.recover
&prompt.root; rm /var/tmp/passwd*
&prompt.root; ls /var/tmp
vi.recoverAt this point, the user realized that too many files
were deleted and wants them back. ZFS
provides an easy way to get them back using rollbacks, but
only when snapshots of important data are performed on a
regular basis. To get the files back and start over from
the last snapshot, issue the command:&prompt.root; zfs rollback mypool/var/tmp@diff_snapshot
&prompt.root; ls /var/tmp
passwd passwd.copy vi.recoverThe rollback operation restored the dataset to the state
of the last snapshot. It is also possible to roll back to a
snapshot that was taken much earlier and has other snapshots
that were created after it. When trying to do this,
ZFS will issue this warning:&prompt.root; zfs list -rt snapshot mypool/var/tmp
AME USED AVAIL REFER MOUNTPOINT
mypool/var/tmp@my_recursive_snapshot 88K - 152K -
mypool/var/tmp@after_cp 53.5K - 118K -
mypool/var/tmp@diff_snapshot 0 - 120K -
&prompt.root; zfs rollback mypool/var/tmp@my_recursive_snapshot
cannot rollback to 'mypool/var/tmp@my_recursive_snapshot': more recent snapshots exist
use '-r' to force deletion of the following snapshots:
mypool/var/tmp@after_cp
mypool/var/tmp@diff_snapshotThis warning means that snapshots exist between the
current state of the dataset and the snapshot to which the
user wants to roll back. To complete the rollback, these
snapshots must be deleted. ZFS cannot
track all the changes between different states of the
dataset, because snapshots are read-only.
ZFS will not delete the affected
snapshots unless the user specifies to
indicate that this is the desired action. If that is the
intention, and the consequences of losing all intermediate
snapshots is understood, the command can be issued:&prompt.root; zfs rollback -r mypool/var/tmp@my_recursive_snapshot
&prompt.root; zfs list -rt snapshot mypool/var/tmp
NAME USED AVAIL REFER MOUNTPOINT
mypool/var/tmp@my_recursive_snapshot 8K - 152K -
&prompt.root; ls /var/tmp
vi.recoverThe output from zfs list -t snapshot
confirms that the intermediate snapshots
were removed as a result of
zfs rollback -r.Restoring Individual Files from SnapshotsSnapshots are mounted in a hidden directory under the
parent dataset:
.zfs/snapshots/snapshotname.
By default, these directories will not be displayed even
when a standard ls -a is issued.
Although the directory is not displayed, it is there
nevertheless and can be accessed like any normal directory.
The property named snapdir controls
whether these hidden directories show up in a directory
listing. Setting the property to visible
allows them to appear in the output of ls
and other commands that deal with directory contents.&prompt.root; zfs get snapdir mypool/var/tmp
NAME PROPERTY VALUE SOURCE
mypool/var/tmp snapdir hidden default
&prompt.root; ls -a /var/tmp
. .. passwd vi.recover
&prompt.root; zfs set snapdir=visible mypool/var/tmp
&prompt.root; ls -a /var/tmp
. .. .zfs passwd vi.recoverIndividual files can easily be restored to a previous
state by copying them from the snapshot back to the parent
dataset. The directory structure below
.zfs/snapshot has a directory named
exactly like the snapshots taken earlier to make it easier
to identify them. In the next example, it is assumed that a
file is to be restored from the hidden
.zfs directory by copying it from the
snapshot that contained the latest version of the
file:&prompt.root; rm /var/tmp/passwd
&prompt.root; ls -a /var/tmp
. .. .zfs vi.recover
&prompt.root; ls /var/tmp/.zfs/snapshot
after_cp my_recursive_snapshot
&prompt.root; ls /var/tmp/.zfs/snapshot/after_cp
passwd vi.recover
&prompt.root; cp /var/tmp/.zfs/snapshot/after_cp/passwd/var/tmpWhen ls .zfs/snapshot was issued, the
snapdir property might have been set to
hidden, but it would still be possible to list the contents
of that directory. It is up to the administrator to decide
whether these directories will be displayed. It is possible
to display these for certain datasets and prevent it for
others. Copying files or directories from this hidden
.zfs/snapshot is simple enough. Trying
it the other way around results in this error:&prompt.root; cp /etc/rc.conf /var/tmp/.zfs/snapshot/after_cp/
cp: /var/tmp/.zfs/snapshot/after_cp/rc.conf: Read-only file systemThe error reminds the user that snapshots are read-only
and cannot be changed after creation. Files cannot be
copied into or removed from snapshot directories because
that would change the state of the dataset they
represent.Snapshots consume space based on how much the parent
file system has changed since the time of the snapshot. The
written property of a snapshot tracks how
much space is being used by the snapshot.Snapshots are destroyed and the space reclaimed with
zfs destroy
dataset@snapshot.
Adding recursively removes all snapshots
with the same name under the parent dataset. Adding
to the command displays a list of the
snapshots that would be deleted and an estimate of how much
space would be reclaimed without performing the actual
destroy operation.Managing ClonesA clone is a copy of a snapshot that is treated more like
a regular dataset. Unlike a snapshot, a clone is not read
only, is mounted, and can have its own properties. Once a
clone has been created using zfs clone, the
snapshot it was created from cannot be destroyed. The
child/parent relationship between the clone and the snapshot
can be reversed using zfs promote. After a
clone has been promoted, the snapshot becomes a child of the
clone, rather than of the original parent dataset. This will
change how the space is accounted, but not actually change the
amount of space consumed. The clone can be mounted at any
point within the ZFS file system hierarchy,
not just below the original location of the snapshot.To demonstrate the clone feature, this example dataset is
used:&prompt.root; zfs list -rt all camino/home/joe
NAME USED AVAIL REFER MOUNTPOINT
camino/home/joe 108K 1.3G 87K /usr/home/joe
camino/home/joe@plans 21K - 85.5K -
camino/home/joe@backup 0K - 87K -A typical use for clones is to experiment with a specific
dataset while keeping the snapshot around to fall back to in
case something goes wrong. Since snapshots cannot be
changed, a read/write clone of a snapshot is created. After
the desired result is achieved in the clone, the clone can be
promoted to a dataset and the old file system removed. This
is not strictly necessary, as the clone and dataset can
coexist without problems.&prompt.root; zfs clone camino/home/joe@backupcamino/home/joenew
&prompt.root; ls /usr/home/joe*
/usr/home/joe:
backup.txz plans.txt
/usr/home/joenew:
backup.txz plans.txt
&prompt.root; df -h /usr/home
Filesystem Size Used Avail Capacity Mounted on
usr/home/joe 1.3G 31k 1.3G 0% /usr/home/joe
usr/home/joenew 1.3G 31k 1.3G 0% /usr/home/joenewAfter a clone is created it is an exact copy of the state
the dataset was in when the snapshot was taken. The clone can
now be changed independently from its originating dataset.
The only connection between the two is the snapshot.
ZFS records this connection in the property
origin. Once the dependency between the
snapshot and the clone has been removed by promoting the clone
using zfs promote, the
origin of the clone is removed as it is now
an independent dataset. This example demonstrates it:&prompt.root; zfs get origin camino/home/joenew
NAME PROPERTY VALUE SOURCE
camino/home/joenew origin camino/home/joe@backup -
&prompt.root; zfs promote camino/home/joenew
&prompt.root; zfs get origin camino/home/joenew
NAME PROPERTY VALUE SOURCE
camino/home/joenew origin - -After making some changes like copying
loader.conf to the promoted clone, for
example, the old directory becomes obsolete in this case.
Instead, the promoted clone can replace it. This can be
achieved by two consecutive commands: zfs
destroy on the old dataset and zfs
rename on the clone to name it like the old
dataset (it could also get an entirely different name).&prompt.root; cp /boot/defaults/loader.conf/usr/home/joenew
&prompt.root; zfs destroy -f camino/home/joe
&prompt.root; zfs rename camino/home/joenewcamino/home/joe
&prompt.root; ls /usr/home/joe
backup.txz loader.conf plans.txt
&prompt.root; df -h /usr/home
Filesystem Size Used Avail Capacity Mounted on
usr/home/joe 1.3G 128k 1.3G 0% /usr/home/joeThe cloned snapshot is now handled like an ordinary
dataset. It contains all the data from the original snapshot
plus the files that were added to it like
loader.conf. Clones can be used in
different scenarios to provide useful features to ZFS users.
For example, jails could be provided as snapshots containing
different sets of installed applications. Users can clone
these snapshots and add their own applications as they see
fit. Once they are satisfied with the changes, the clones can
be promoted to full datasets and provided to end users to work
with like they would with a real dataset. This saves time and
administrative overhead when providing these jails.ReplicationKeeping data on a single pool in one location exposes
it to risks like theft and natural or human disasters. Making
regular backups of the entire pool is vital.
ZFS provides a built-in serialization
feature that can send a stream representation of the data to
standard output. Using this technique, it is possible to not
only store the data on another pool connected to the local
system, but also to send it over a network to another system.
Snapshots are the basis for this replication (see the section
on ZFS
snapshots). The commands used for replicating data
are zfs send and
zfs receive.These examples demonstrate ZFS
replication with these two pools:&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backup 960M 77K 896M - - 0% 0% 1.00x ONLINE -
mypool 984M 43.7M 940M - - 0% 4% 1.00x ONLINE -The pool named mypool is the
primary pool where data is written to and read from on a
regular basis. A second pool,
backup is used as a standby in case
the primary pool becomes unavailable. Note that this
fail-over is not done automatically by ZFS,
but must be manually done by a system administrator when
needed. A snapshot is used to provide a consistent version of
the file system to be replicated. Once a snapshot of
mypool has been created, it can be
copied to the backup pool. Only
snapshots can be replicated. Changes made since the most
recent snapshot will not be included.&prompt.root; zfs snapshot mypool@backup1
&prompt.root; zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mypool@backup1 0 - 43.6M -Now that a snapshot exists, zfs send
can be used to create a stream representing the contents of
the snapshot. This stream can be stored as a file or received
by another pool. The stream is written to standard output,
but must be redirected to a file or pipe or an error is
produced:&prompt.root; zfs send mypool@backup1
Error: Stream can not be written to a terminal.
You must redirect standard output.To back up a dataset with zfs send,
redirect to a file located on the mounted backup pool. Ensure
that the pool has enough free space to accommodate the size of
the snapshot being sent, which means all of the data contained
in the snapshot, not just the changes from the previous
snapshot.&prompt.root; zfs send mypool@backup1 > /backup/backup1
&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backup 960M 63.7M 896M - - 0% 6% 1.00x ONLINE -
mypool 984M 43.7M 940M - - 0% 4% 1.00x ONLINE -The zfs send transferred all the data
in the snapshot called backup1 to
the pool named backup. Creating
and sending these snapshots can be done automatically with a
&man.cron.8; job.Instead of storing the backups as archive files,
ZFS can receive them as a live file system,
allowing the backed up data to be accessed directly. To get
to the actual data contained in those streams,
zfs receive is used to transform the
streams back into files and directories. The example below
combines zfs send and
zfs receive using a pipe to copy the data
from one pool to another. The data can be used directly on
the receiving pool after the transfer is complete. A dataset
can only be replicated to an empty dataset.&prompt.root; zfs snapshot mypool@replica1
&prompt.root; zfs send -v mypool@replica1 | zfs receive backup/mypool
send from @ to mypool@replica1 estimated size is 50.1M
total estimated size is 50.1M
TIME SENT SNAPSHOT
&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backup 960M 63.7M 896M - - 0% 6% 1.00x ONLINE -
mypool 984M 43.7M 940M - - 0% 4% 1.00x ONLINE -Incremental Backupszfs send can also determine the
difference between two snapshots and send only the
differences between the two. This saves disk space and
transfer time. For example:&prompt.root; zfs snapshot mypool@replica2
&prompt.root; zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mypool@replica1 5.72M - 43.6M -
mypool@replica2 0 - 44.1M -
&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backup 960M 61.7M 898M - - 0% 6% 1.00x ONLINE -
mypool 960M 50.2M 910M - - 0% 5% 1.00x ONLINE -A second snapshot called
replica2 was created. This
second snapshot contains only the changes that were made to
the file system between now and the previous snapshot,
replica1. Using
zfs send -i and indicating the pair of
snapshots generates an incremental replica stream containing
only the data that has changed. This can only succeed if
the initial snapshot already exists on the receiving
side.&prompt.root; zfs send -v -i mypool@replica1mypool@replica2 | zfs receive /backup/mypool
send from @replica1 to mypool@replica2 estimated size is 5.02M
total estimated size is 5.02M
TIME SENT SNAPSHOT
&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backup 960M 80.8M 879M - - 0% 8% 1.00x ONLINE -
mypool 960M 50.2M 910M - - 0% 5% 1.00x ONLINE -
&prompt.root; zfs list
NAME USED AVAIL REFER MOUNTPOINT
backup 55.4M 240G 152K /backup
backup/mypool 55.3M 240G 55.2M /backup/mypool
mypool 55.6M 11.6G 55.0M /mypool
&prompt.root; zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
backup/mypool@replica1 104K - 50.2M -
backup/mypool@replica2 0 - 55.2M -
mypool@replica1 29.9K - 50.0M -
mypool@replica2 0 - 55.0M -The incremental stream was successfully transferred.
Only the data that had changed was replicated, rather than
the entirety of replica1. Only
the differences were sent, which took much less time to
transfer and saved disk space by not copying the complete
pool each time. This is useful when having to rely on slow
networks or when costs per transferred byte must be
considered.A new file system,
backup/mypool, is available with
all of the files and data from the pool
mypool. If
is specified, the properties of the dataset will be copied,
including compression settings, quotas, and mount points.
When is specified, all child datasets of
the indicated dataset will be copied, along with all of
their properties. Sending and receiving can be automated so
that regular backups are created on the second pool.Sending Encrypted Backups over
SSHSending streams over the network is a good way to keep a
remote backup, but it does come with a drawback. Data sent
over the network link is not encrypted, allowing anyone to
intercept and transform the streams back into data without
the knowledge of the sending user. This is undesirable,
especially when sending the streams over the internet to a
remote host. SSH can be used to
securely encrypt data send over a network connection. Since
ZFS only requires the stream to be
redirected from standard output, it is relatively easy to
pipe it through SSH. To keep the
contents of the file system encrypted in transit and on the
remote system, consider using PEFS.A few settings and security precautions must be
completed first. Only the necessary steps required for the
zfs send operation are shown here. For
more information on SSH, see
.This configuration is required:Passwordless SSH access
between sending and receiving host using
SSH keysNormally, the privileges of the
root user are
needed to send and receive streams. This requires
logging in to the receiving system as
root.
However, logging in as
root is
disabled by default for security reasons. The
ZFS Delegation
system can be used to allow a
non-root user
on each system to perform the respective send and
receive operations.On the sending system:&prompt.root; zfs allow -u someuser send,snapshot mypoolTo mount the pool, the unprivileged user must own
the directory, and regular users must be allowed to
mount file systems. On the receiving system:&prompt.root; sysctl vfs.usermount=1
vfs.usermount: 0 -> 1
&prompt.root; echo vfs.usermount=1 >> /etc/sysctl.conf
&prompt.root; zfs create recvpool/backup
&prompt.root; zfs allow -u someuser create,mount,receive recvpool/backup
&prompt.root; chown someuser/recvpool/backupThe unprivileged user now has the ability to receive and
mount datasets, and the home
dataset can be replicated to the remote system:&prompt.user; zfs snapshot -r mypool/home@monday
&prompt.user; zfs send -R mypool/home@monday | ssh someuser@backuphost zfs recv -dvu recvpool/backupA recursive snapshot called
monday is made of the file system
dataset home that resides on the
pool mypool. Then it is sent
with zfs send -R to include the dataset,
all child datasets, snapshots, clones, and settings in the
stream. The output is piped to the waiting
zfs receive on the remote host
backuphost through
SSH. Using a fully qualified
domain name or IP address is recommended. The receiving
machine writes the data to the
backup dataset on the
recvpool pool. Adding
to zfs recv
overwrites the name of the pool on the receiving side with
the name of the snapshot. causes the
file systems to not be mounted on the receiving side. When
is included, more detail about the
transfer is shown, including elapsed time and the amount of
data transferred.Dataset, User, and Group QuotasDataset quotas are
used to restrict the amount of space that can be consumed
by a particular dataset.
Reference Quotas work
in very much the same way, but only count the space
used by the dataset itself, excluding snapshots and child
datasets. Similarly,
user and
group quotas can be
used to prevent users or groups from using all of the
space in the pool or dataset.The following examples assume that the users already
exist in the system. Before adding a user to the system,
make sure to create their home dataset first and set the
to
/home/bob.
Then, create the user and make the home directory point to
the dataset's location. This will
properly set owner and group permissions without shadowing any
pre-existing home directory paths that might exist.To enforce a dataset quota of 10 GB for
storage/home/bob:&prompt.root; zfs set quota=10G storage/home/bobTo enforce a reference quota of 10 GB for
storage/home/bob:&prompt.root; zfs set refquota=10G storage/home/bobTo remove a quota of 10 GB for
storage/home/bob:&prompt.root; zfs set quota=none storage/home/bobThe general format is
userquota@user=size,
and the user's name must be in one of these formats:POSIX compatible name such as
joe.POSIX numeric ID such as
789.SID name
such as
joe.bloggs@example.com.SID
numeric ID such as
S-1-123-456-789.For example, to enforce a user quota of 50 GB for the
user named joe:&prompt.root; zfs set userquota@joe=50GTo remove any quota:&prompt.root; zfs set userquota@joe=noneUser quota properties are not displayed by
zfs get all.
Non-root users can
only see their own quotas unless they have been granted the
userquota privilege. Users with this
privilege are able to view and set everyone's quota.The general format for setting a group quota is:
groupquota@group=size.To set the quota for the group
firstgroup to 50 GB,
use:&prompt.root; zfs set groupquota@firstgroup=50GTo remove the quota for the group
firstgroup, or to make sure that
one is not set, instead use:&prompt.root; zfs set groupquota@firstgroup=noneAs with the user quota property,
non-root users can
only see the quotas associated with the groups to which they
belong. However,
root or a user with
the groupquota privilege can view and set
all quotas for all groups.To display the amount of space used by each user on
a file system or snapshot along with any quotas, use
zfs userspace. For group information, use
zfs groupspace. For more information about
supported options or how to display only specific options,
refer to &man.zfs.1;.Users with sufficient privileges, and
root, can list the
quota for storage/home/bob using:&prompt.root; zfs get quota storage/home/bobReservationsReservations
guarantee a minimum amount of space will always be available
on a dataset. The reserved space will not be available to any
other dataset. This feature can be especially useful to
ensure that free space is available for an important dataset
or log files.The general format of the reservation
property is
reservation=size,
so to set a reservation of 10 GB on
storage/home/bob, use:&prompt.root; zfs set reservation=10G storage/home/bobTo clear any reservation:&prompt.root; zfs set reservation=none storage/home/bobThe same principle can be applied to the
refreservation property for setting a
Reference
Reservation, with the general format
refreservation=size.This command shows any reservations or refreservations
that exist on storage/home/bob:&prompt.root; zfs get reservation storage/home/bob
&prompt.root; zfs get refreservation storage/home/bobCompressionZFS provides transparent compression.
Compressing data at the block level as it is written not only
saves space, but can also increase disk throughput. If data
is compressed by 25%, but the compressed data is written to
the disk at the same rate as the uncompressed version,
resulting in an effective write speed of 125%. Compression
can also be a great alternative to
Deduplication
because it does not require additional memory.ZFS offers several different
compression algorithms, each with different trade-offs. With
the introduction of LZ4 compression in
ZFS v5000, it is possible to enable
compression for the entire pool without the large performance
trade-off of other algorithms. The biggest advantage to
LZ4 is the early abort
feature. If LZ4 does not achieve at least
12.5% compression in the first part of the data, the block is
written uncompressed to avoid wasting CPU cycles trying to
compress data that is either already compressed or
uncompressible. For details about the different compression
algorithms available in ZFS, see the
Compression entry
in the terminology section.The administrator can monitor the effectiveness of
compression using a number of dataset properties.&prompt.root; zfs get used,compressratio,compression,logicalused mypool/compressed_dataset
NAME PROPERTY VALUE SOURCE
mypool/compressed_dataset used 449G -
mypool/compressed_dataset compressratio 1.11x -
mypool/compressed_dataset compression lz4 local
mypool/compressed_dataset logicalused 496G -The dataset is currently using 449 GB of space (the
used property). Without compression, it would have taken
496 GB of space (the logicalused
property). This results in the 1.11:1 compression
ratio.Compression can have an unexpected side effect when
combined with
User Quotas.
User quotas restrict how much space a user can consume on a
dataset, but the measurements are based on how much space is
used after compression. So if a user has
a quota of 10 GB, and writes 10 GB of compressible
data, they will still be able to store additional data. If
they later update a file, say a database, with more or less
compressible data, the amount of space available to them will
change. This can result in the odd situation where a user did
not increase the actual amount of data (the
logicalused property), but the change in
compression caused them to reach their quota limit.Compression can have a similar unexpected interaction with
backups. Quotas are often used to limit how much data can be
stored to ensure there is sufficient backup space available.
However since quotas do not consider compression, more data
may be written than would fit with uncompressed
backups.Zstandard CompressionIn OpenZFS 2.0, a new compression
algorithm was added. Zstandard (Zstd)
offers higher compression ratios than the default
LZ4 while offering much greater speeds
than the alternative, gzip.
OpenZFS 2.0 is available starting with
&os; 12.1-RELEASE via
sysutils/openzfs and has been the
default in &os; 13-CURRENT since September 2020, and
will by in &os; 13.0-RELEASE.Zstd provides a large selection of
compression levels, providing fine-grained control over
performance versus compression ratio. One of the main
advantages of Zstd is that the
decompression speed is independent of the compression
level. For data that is written once but read many times,
Zstd allows the use of the highest
compression levels without a read performance
penalty.Even when data is updated frequently, there are often
performance gains that come from enabling compression. One
of the biggest advantages comes from the compressed ARC
feature. ZFS's Adaptive Replacement
Cache (ARC) caches the compressed version
of the data in RAM, decompressing it each
time it is needed. This allows the same amount of
RAM to store more data and metadata,
increasing the cache hit ratio.ZFS offers 19 levels of
Zstd compression, each offering
incrementally more space savings in exchange for slower
compression. The default level is
zstd-3 and offers greater compression
than LZ4 without being significantly
slower. Levels above 10 require significant amounts of
memory to compress each block, so they are discouraged on
systems with less than 16 GB of RAM.
ZFS also implements a selection of the
Zstd fast levels,
which get correspondingly faster but offer lower
compression ratios. ZFS supports
zstd-fast-1 through
zstd-fast-10,
zstd-fast-20 through
zstd-fast-100 in increments of 10, and
finally zstd-fast-500 and
zstd-fast-1000 which provide minimal
compression, but offer very high performance.If ZFS is not able to allocate the required memory to
compress a block with Zstd, it will fall
back to storing the block uncompressed. This is unlikely
to happen outside of the highest levels of
Zstd on systems that are memory
constrained. The sysctl
kstat.zfs.misc.zstd.compress_alloc_fail
counts how many times this has occurred since the
ZFS module was loaded.DeduplicationWhen enabled,
deduplication
uses the checksum of each block to detect duplicate blocks.
When a new block is a duplicate of an existing block,
ZFS writes an additional reference to the
existing data instead of the whole duplicate block.
Tremendous space savings are possible if the data contains
many duplicated files or repeated information. Be warned:
deduplication requires an extremely large amount of memory,
and most of the space savings can be had without the extra
cost by enabling compression instead.To activate deduplication, set the
dedup property on the target pool:&prompt.root; zfs set dedup=on poolOnly new data being written to the pool will be
deduplicated. Data that has already been written to the pool
will not be deduplicated merely by activating this option. A
pool with a freshly activated deduplication property will look
like this example:&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool 2.84G 2.19M 2.83G - - 0% 0% 1.00x ONLINE -The DEDUP column shows the actual rate
of deduplication for the pool. A value of
1.00x shows that data has not been
deduplicated yet. In the next example, the ports tree is
copied three times into different directories on the
deduplicated pool created above.&prompt.root; for d in dir1 dir2 dir3; do
> mkdir $d && cp -R /usr/ports $d &
> doneRedundant data is detected and deduplicated:&prompt.root; zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool 2.84G 20.9M 2.82G - - 0% 0% 3.00x ONLINE -The DEDUP column shows a factor of
3.00x. Multiple copies of the ports tree
data was detected and deduplicated, using only a third of the
space. The potential for space savings can be enormous, but
comes at the cost of having enough memory to keep track of the
deduplicated blocks.Deduplication is not always beneficial, especially when
the data on a pool is not redundant.
ZFS can show potential space savings by
simulating deduplication on an existing pool:&prompt.root; zdb -S pool
Simulated DDT histogram:
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 2.58M 289G 264G 264G 2.58M 289G 264G 264G
2 206K 12.6G 10.4G 10.4G 430K 26.4G 21.6G 21.6G
4 37.6K 692M 276M 276M 170K 3.04G 1.26G 1.26G
8 2.18K 45.2M 19.4M 19.4M 20.0K 425M 176M 176M
16 174 2.83M 1.20M 1.20M 3.33K 48.4M 20.4M 20.4M
32 40 2.17M 222K 222K 1.70K 97.2M 9.91M 9.91M
64 9 56K 10.5K 10.5K 865 4.96M 948K 948K
128 2 9.50K 2K 2K 419 2.11M 438K 438K
256 5 61.5K 12K 12K 1.90K 23.0M 4.47M 4.47M
1K 2 1K 1K 1K 2.98K 1.49M 1.49M 1.49M
Total 2.82M 303G 275G 275G 3.20M 319G 287G 287G
dedup = 1.05, compress = 1.11, copies = 1.00, dedup * compress / copies = 1.16After zdb -S finishes analyzing the
pool, it shows the space reduction ratio that would be
achieved by activating deduplication. In this case,
1.16 is a very poor space saving ratio that
is mostly provided by compression. Activating deduplication
on this pool would not save any significant amount of space,
and is not worth the amount of memory required to enable
deduplication. Using the formula
ratio = dedup * compress / copies,
system administrators can plan the storage allocation,
deciding whether the workload will contain enough duplicate
blocks to justify the memory requirements. If the data is
reasonably compressible, the space savings may be very good.
Enabling compression first is recommended, and compression can
also provide greatly increased performance. Only enable
deduplication in cases where the additional savings will be
considerable and there is sufficient memory for the DDT.ZFS and Jailszfs jail and the corresponding
jailed property are used to delegate a
ZFS dataset to a
Jail.
zfs jail jailid
attaches a dataset to the specified jail, and
zfs unjail detaches it. For the dataset to
be controlled from within a jail, the
jailed property must be set. Once a
dataset is jailed, it can no longer be mounted on the
host because it may have mount points that would compromise
the security of the host.Delegated AdministrationA comprehensive permission delegation system allows
unprivileged users to perform ZFS
administration functions. For example, if each user's home
directory is a dataset, users can be given permission to create
and destroy snapshots of their home directories. A backup user
can be given permission to use replication features. A usage
statistics script can be allowed to run with access only to the
space utilization data for all users. It is even possible to
delegate the ability to delegate permissions. Permission
delegation is possible for each subcommand and most
properties.Delegating Dataset Creationzfs allow
someuser create
mydataset gives the
specified user permission to create child datasets under the
selected parent dataset. There is a caveat: creating a new
dataset involves mounting it. That requires setting the
&os; vfs.usermount &man.sysctl.8; to
1 to allow non-root users to mount a
file system. There is another restriction aimed at preventing
abuse: non-root
users must own the mountpoint where the file system is to be
mounted.Delegating Permission Delegationzfs allow
someuser allow
mydataset gives the
specified user the ability to assign any permission they have
on the target dataset, or its children, to other users. If a
user has the snapshot permission and the
allow permission, that user can then grant
the snapshot permission to other
users.Advanced TopicsTuningThere are a number of tunables that can be adjusted to
make ZFS perform best for different
workloads.vfs.zfs.arc_max
- Maximum size of the ARC.
The default is all RAM but 1 GB,
or 5/8 of all RAM, whichever is more.
However, a lower value should be used if the system will
be running any other daemons or processes that may require
memory. This value can be adjusted at runtime with
&man.sysctl.8; and can be set in
/boot/loader.conf or
/etc/sysctl.conf.vfs.zfs.arc_meta_limit
- Limit the portion of the
ARC
that can be used to store metadata. The default is one
fourth of vfs.zfs.arc_max. Increasing
this value will improve performance if the workload
involves operations on a large number of files and
directories, or frequent metadata operations, at the cost
of less file data fitting in the ARC.
This value can be adjusted at runtime with &man.sysctl.8;
and can be set in
/boot/loader.conf or
/etc/sysctl.conf.vfs.zfs.arc_min
- Minimum size of the ARC.
The default is one half of
vfs.zfs.arc_meta_limit. Adjust this
value to prevent other applications from pressuring out
the entire ARC.
This value can be adjusted at runtime with &man.sysctl.8;
and can be set in
/boot/loader.conf or
/etc/sysctl.conf.vfs.zfs.vdev.cache.size
- A preallocated amount of memory reserved as a cache for
each device in the pool. The total amount of memory used
will be this value multiplied by the number of devices.
This value can only be adjusted at boot time, and is set
in /boot/loader.conf.vfs.zfs.min_auto_ashift
- Minimum ashift (sector size) that
will be used automatically at pool creation time. The
value is a power of two. The default value of
9 represents
2^9 = 512, a sector size of 512 bytes.
To avoid write amplification and get
the best performance, set this value to the largest sector
size used by a device in the pool.Many drives have 4 KB sectors. Using the default
ashift of 9 with
these drives results in write amplification on these
devices. Data that could be contained in a single
4 KB write must instead be written in eight 512-byte
writes. ZFS tries to read the native
sector size from all devices when creating a pool, but
many drives with 4 KB sectors report that their
sectors are 512 bytes for compatibility. Setting
vfs.zfs.min_auto_ashift to
12 (2^12 = 4096)
before creating a pool forces ZFS to
use 4 KB blocks for best performance on these
drives.Forcing 4 KB blocks is also useful on pools where
disk upgrades are planned. Future disks are likely to use
4 KB sectors, and ashift values
cannot be changed after a pool is created.In some specific cases, the smaller 512-byte block
size might be preferable. When used with 512-byte disks
for databases, or as storage for virtual machines, less
data is transferred during small random reads. This can
provide better performance, especially when using a
smaller ZFS record size.vfs.zfs.prefetch_disable
- Disable prefetch. A value of 0 is
enabled and 1 is disabled. The default
is 0, unless the system has less than
4 GB of RAM. Prefetch works by
reading larger blocks than were requested into the
ARC
in hopes that the data will be needed soon. If the
workload has a large number of random reads, disabling
prefetch may actually improve performance by reducing
unnecessary reads. This value can be adjusted at any time
with &man.sysctl.8;.vfs.zfs.vdev.trim_on_init
- Control whether new devices added to the pool have the
TRIM command run on them. This ensures
the best performance and longevity for
SSDs, but takes extra time. If the
device has already been secure erased, disabling this
setting will make the addition of the new device faster.
This value can be adjusted at any time with
&man.sysctl.8;.vfs.zfs.vdev.max_pending
- Limit the number of pending I/O requests per device.
A higher value will keep the device command queue full
and may give higher throughput. A lower value will reduce
latency. This value can be adjusted at any time with
&man.sysctl.8;.vfs.zfs.top_maxinflight
- Maxmimum number of outstanding I/Os per top-level
vdev. Limits the
depth of the command queue to prevent high latency. The
limit is per top-level vdev, meaning the limit applies to
each mirror,
RAID-Z, or
other vdev independently. This value can be adjusted at
any time with &man.sysctl.8;.vfs.zfs.l2arc_write_max
- Limit the amount of data written to the L2ARC
per second. This tunable is designed to extend the
longevity of SSDs by limiting the
amount of data written to the device. This value can be
adjusted at any time with &man.sysctl.8;.vfs.zfs.l2arc_write_boost
- The value of this tunable is added to vfs.zfs.l2arc_write_max
and increases the write speed to the
SSD until the first block is evicted
from the L2ARC.
This Turbo Warmup Phase is designed to
reduce the performance loss from an empty L2ARC
after a reboot. This value can be adjusted at any time
with &man.sysctl.8;.vfs.zfs.scrub_delay
- Number of ticks to delay between each I/O during a
scrub.
To ensure that a scrub does not
interfere with the normal operation of the pool, if any
other I/O is happening the
scrub will delay between each command.
This value controls the limit on the total
IOPS (I/Os Per Second) generated by the
scrub. The granularity of the setting
is determined by the value of kern.hz
which defaults to 1000 ticks per second. This setting may
be changed, resulting in a different effective
IOPS limit. The default value is
4, resulting in a limit of:
1000 ticks/sec / 4 =
250 IOPS. Using a value of
20 would give a limit of:
1000 ticks/sec / 20 =
50 IOPS. The speed of
scrub is only limited when there has
been recent activity on the pool, as determined by vfs.zfs.scan_idle.
This value can be adjusted at any time with
&man.sysctl.8;.vfs.zfs.resilver_delay
- Number of milliseconds of delay inserted between
each I/O during a
resilver. To
ensure that a resilver does not interfere with the normal
operation of the pool, if any other I/O is happening the
resilver will delay between each command. This value
controls the limit of total IOPS (I/Os
Per Second) generated by the resilver. The granularity of
the setting is determined by the value of
kern.hz which defaults to 1000 ticks
per second. This setting may be changed, resulting in a
different effective IOPS limit. The
default value is 2, resulting in a limit of:
1000 ticks/sec / 2 =
500 IOPS. Returning the pool to
an Online state may
be more important if another device failing could
Fault the pool,
causing data loss. A value of 0 will give the resilver
operation the same priority as other operations, speeding
the healing process. The speed of resilver is only
limited when there has been other recent activity on the
pool, as determined by vfs.zfs.scan_idle.
This value can be adjusted at any time with
&man.sysctl.8;.vfs.zfs.scan_idle
- Number of milliseconds since the last operation before
the pool is considered idle. When the pool is idle the
rate limiting for scrub
and
resilver are
disabled. This value can be adjusted at any time with
&man.sysctl.8;.vfs.zfs.txg.timeout
- Maximum number of seconds between
transaction groups.
The current transaction group will be written to the pool
and a fresh transaction group started if this amount of
time has elapsed since the previous transaction group. A
transaction group my be triggered earlier if enough data
is written. The default value is 5 seconds. A larger
value may improve read performance by delaying
asynchronous writes, but this may cause uneven performance
when the transaction group is written. This value can be
adjusted at any time with &man.sysctl.8;.ZFS on i386Some of the features provided by ZFS
are memory intensive, and may require tuning for maximum
efficiency on systems with limited
RAM.MemoryAs a bare minimum, the total system memory should be at
least one gigabyte. The amount of recommended
RAM depends upon the size of the pool and
which ZFS features are used. A general
rule of thumb is 1 GB of RAM for every 1 TB of
storage. If the deduplication feature is used, a general
rule of thumb is 5 GB of RAM per TB of storage to be
deduplicated. While some users successfully use
ZFS with less RAM,
systems under heavy load may panic due to memory exhaustion.
Further tuning may be required for systems with less than
the recommended RAM requirements.Kernel ConfigurationDue to the address space limitations of the
&i386; platform, ZFS users on the
&i386; architecture must add this option to a
custom kernel configuration file, rebuild the kernel, and
reboot:options KVA_PAGES=512This expands the kernel address space, allowing
the vm.kvm_size tunable to be pushed
beyond the currently imposed limit of 1 GB, or the
limit of 2 GB for PAE. To find the
most suitable value for this option, divide the desired
address space in megabytes by four. In this example, it
is 512 for 2 GB.Loader TunablesThe kmem address space can be
increased on all &os; architectures. On a test system with
1 GB of physical memory, success was achieved with
these options added to
/boot/loader.conf, and the system
restarted:vm.kmem_size="330M"
vm.kmem_size_max="330M"
vfs.zfs.arc_max="40M"
vfs.zfs.vdev.cache.size="5M"For a more detailed list of recommendations for
ZFS-related tuning, see .Additional ResourcesOpenZFSFreeBSD
Wiki - ZFS TuningOracle
Solaris ZFS Administration
GuideCalomel
Blog - ZFS Raidz Performance, Capacity
and IntegrityZFS Features and TerminologyZFS is a fundamentally different file
system because it is more than just a file system.
ZFS combines the roles of file system and
volume manager, enabling additional storage devices to be added
to a live system and having the new space available on all of
the existing file systems in that pool immediately. By
combining the traditionally separate roles,
ZFS is able to overcome previous limitations
that prevented RAID groups being able to
grow. Each top level device in a pool is called a
vdev, which can be a simple disk or a
RAID transformation such as a mirror or
RAID-Z array. ZFS file
systems (called datasets) each have access
to the combined free space of the entire pool. As blocks are
allocated from the pool, the space available to each file system
decreases. This approach avoids the common pitfall with
extensive partitioning where free space becomes fragmented
across the partitions.poolA storage pool is the most
basic building block of ZFS. A pool
is made up of one or more vdevs, the underlying devices
that store the data. A pool is then used to create one
or more file systems (datasets) or block devices
(volumes). These datasets and volumes share the pool of
remaining free space. Each pool is uniquely identified
by a name and a GUID. The features
available are determined by the ZFS
version number on the pool.vdev TypesA pool is made up of one or more vdevs, which
themselves can be a single disk or a group of disks, in
the case of a RAID transform. When
multiple vdevs are used, ZFS spreads
data across the vdevs to increase performance and
maximize usable space.
Disk
- The most basic type of vdev is a standard block
device. This can be an entire disk (such as
/dev/ada0
or
/dev/da0)
or a partition
(/dev/ada0p3).
On &os;, there is no performance penalty for using
a partition rather than the entire disk. This
differs from recommendations made by the Solaris
documentation.Using an entire disk as part of a bootable
pool is strongly discouraged, as this may render
the pool unbootable. Likewise, you should not
use an entire disk as part of a mirror or
RAID-Z vdev. These are
because it is impossible to reliably determine
the size of an unpartitioned disk at boot time
and because there's no place to put in boot
code.File
- In addition to disks, ZFS
pools can be backed by regular files, this is
especially useful for testing and experimentation.
Use the full path to the file as the device path
in zpool create. All vdevs
must be at least 128 MB in size.Mirror
- When creating a mirror, specify the
mirror keyword followed by the
list of member devices for the mirror. A mirror
consists of two or more devices, all data will be
written to all member devices. A mirror vdev will
only hold as much data as its smallest member. A
mirror vdev can withstand the failure of all but
one of its members without losing any data.A regular single disk vdev can be upgraded
to a mirror vdev at any time with
zpool
attach.RAID-Z
- ZFS implements
RAID-Z, a variation on standard
RAID-5 that offers better
distribution of parity and eliminates the
RAID-5 write
hole in which the data and parity
information become inconsistent after an
unexpected restart. ZFS
supports three levels of RAID-Z
which provide varying levels of redundancy in
exchange for decreasing levels of usable storage.
The types are named RAID-Z1
through RAID-Z3 based on the
number of parity devices in the array and the
number of disks which can fail while the pool
remains operational.In a RAID-Z1 configuration
with four disks, each 1 TB, usable storage is
3 TB and the pool will still be able to
operate in degraded mode with one faulted disk.
If an additional disk goes offline before the
faulted disk is replaced and resilvered, all data
in the pool can be lost.In a RAID-Z3 configuration
with eight disks of 1 TB, the volume will
provide 5 TB of usable space and still be
able to operate with three faulted disks. &sun;
recommends no more than nine disks in a single
vdev. If the configuration has more disks, it is
recommended to divide them into separate vdevs and
the pool data will be striped across them.A configuration of two
RAID-Z2 vdevs consisting of 8
disks each would create something similar to a
RAID-60 array. A
RAID-Z group's storage capacity
is approximately the size of the smallest disk
multiplied by the number of non-parity disks.
Four 1 TB disks in RAID-Z1
has an effective size of approximately 3 TB,
and an array of eight 1 TB disks in
RAID-Z3 will yield 5 TB of
usable space.Spare
- ZFS has a special pseudo-vdev
type for keeping track of available hot spares.
Note that installed hot spares are not deployed
automatically; they must manually be configured to
replace the failed device using
zfs replace.Log
- ZFS Log Devices, also known
as ZFS Intent Log (ZIL)
move the intent log from the regular pool devices
to a dedicated device, typically an
SSD. Having a dedicated log
device can significantly improve the performance
of applications with a high volume of synchronous
writes, especially databases. Log devices can be
mirrored, but RAID-Z is not
supported. If multiple log devices are used,
writes will be load balanced across them.Cache
- Adding a cache vdev to a pool will add the
storage of the cache to the L2ARC.
Cache devices cannot be mirrored. Since a cache
device only stores additional copies of existing
data, there is no risk of data loss.Transaction Group
(TXG)Transaction Groups are the way changed blocks are
grouped together and eventually written to the pool.
Transaction groups are the atomic unit that
ZFS uses to assert consistency. Each
transaction group is assigned a unique 64-bit
consecutive identifier. There can be up to three active
transaction groups at a time, one in each of these three
states:
Open - When a new
transaction group is created, it is in the open
state, and accepts new writes. There is always
a transaction group in the open state, however the
transaction group may refuse new writes if it has
reached a limit. Once the open transaction group
has reached a limit, or the vfs.zfs.txg.timeout
has been reached, the transaction group advances
to the next state.Quiescing - A short state
that allows any pending operations to finish while
not blocking the creation of a new open
transaction group. Once all of the transactions
in the group have completed, the transaction group
advances to the final state.Syncing - All of the data
in the transaction group is written to stable
storage. This process will in turn modify other
data, such as metadata and space maps, that will
also need to be written to stable storage. The
process of syncing involves multiple passes. The
first, all of the changed data blocks, is the
biggest, followed by the metadata, which may take
multiple passes to complete. Since allocating
space for the data blocks generates new metadata,
the syncing state cannot finish until a pass
completes that does not allocate any additional
space. The syncing state is also where
synctasks are completed.
Synctasks are administrative operations, such as
creating or destroying snapshots and datasets,
that modify the uberblock are completed. Once the
sync state is complete, the transaction group in
the quiescing state is advanced to the syncing
state.
All administrative functions, such as snapshot
are written as part of the transaction group. When a
synctask is created, it is added to the currently open
transaction group, and that group is advanced as quickly
as possible to the syncing state to reduce the
latency of administrative commands.Adaptive Replacement
Cache (ARC)ZFS uses an Adaptive Replacement
Cache (ARC), rather than a more
traditional Least Recently Used (LRU)
cache. An LRU cache is a simple list
of items in the cache, sorted by when each object was
most recently used. New items are added to the top of
the list. When the cache is full, items from the
bottom of the list are evicted to make room for more
active objects. An ARC consists of
four lists; the Most Recently Used
(MRU) and Most Frequently Used
(MFU) objects, plus a ghost list for
each. These ghost lists track recently evicted objects
to prevent them from being added back to the cache.
This increases the cache hit ratio by avoiding objects
that have a history of only being used occasionally.
Another advantage of using both an
MRU and MFU is
that scanning an entire file system would normally evict
all data from an MRU or
LRU cache in favor of this freshly
accessed content. With ZFS, there is
also an MFU that only tracks the most
frequently used objects, and the cache of the most
commonly accessed blocks remains.L2ARCL2ARC is the second level
of the ZFS caching system. The
primary ARC is stored in
RAM. Since the amount of
available RAM is often limited,
ZFS can also use
cache vdevs.
Solid State Disks (SSDs) are often
used as these cache devices due to their higher speed
and lower latency compared to traditional spinning
disks. L2ARC is entirely optional,
but having one will significantly increase read speeds
for files that are cached on the SSD
instead of having to be read from the regular disks.
L2ARC can also speed up deduplication
because a DDT that does not fit in
RAM but does fit in the
L2ARC will be much faster than a
DDT that must be read from disk. The
rate at which data is added to the cache devices is
limited to prevent prematurely wearing out
SSDs with too many writes. Until the
cache is full (the first block has been evicted to make
room), writing to the L2ARC is
limited to the sum of the write limit and the boost
limit, and afterwards limited to the write limit. A
pair of &man.sysctl.8; values control these rate limits.
vfs.zfs.l2arc_write_max
controls how many bytes are written to the cache per
second, while vfs.zfs.l2arc_write_boost
adds to this limit during the
Turbo Warmup Phase (Write Boost).ZILZIL accelerates synchronous
transactions by using storage devices like
SSDs that are faster than those used
in the main storage pool. When an application requests
a synchronous write (a guarantee that the data has been
safely stored to disk rather than merely cached to be
written later), the data is written to the faster
ZIL storage, then later flushed out
to the regular disks. This greatly reduces latency and
improves performance. Only synchronous workloads like
databases will benefit from a ZIL.
Regular asynchronous writes such as copying files will
not use the ZIL at all.Copy-On-WriteUnlike a traditional file system, when data is
overwritten on ZFS, the new data is
written to a different block rather than overwriting the
old data in place. Only when this write is complete is
the metadata then updated to point to the new location.
In the event of a shorn write (a system crash or power
loss in the middle of writing a file), the entire
original contents of the file are still available and
the incomplete write is discarded. This also means that
ZFS does not require a &man.fsck.8;
after an unexpected shutdown.DatasetDataset is the generic term
for a ZFS file system, volume,
snapshot or clone. Each dataset has a unique name in
the format
poolname/path@snapshot.
The root of the pool is technically a dataset as well.
Child datasets are named hierarchically like
directories. For example,
mypool/home, the home
dataset, is a child of mypool
and inherits properties from it. This can be expanded
further by creating
mypool/home/user. This
grandchild dataset will inherit properties from the
parent and grandparent. Properties on a child can be
set to override the defaults inherited from the parents
and grandparents. Administration of datasets and their
children can be
delegated.File systemA ZFS dataset is most often used
as a file system. Like most other file systems, a
ZFS file system is mounted somewhere
in the systems directory hierarchy and contains files
and directories of its own with permissions, flags, and
other metadata.VolumeIn additional to regular file system datasets,
ZFS can also create volumes, which
are block devices. Volumes have many of the same
features, including copy-on-write, snapshots, clones,
and checksumming. Volumes can be useful for running
other file system formats on top of
ZFS, such as UFS
virtualization, or exporting iSCSI
extents.SnapshotThe
copy-on-write
(COW) design of
ZFS allows for nearly instantaneous,
consistent snapshots with arbitrary names. After taking
a snapshot of a dataset, or a recursive snapshot of a
parent dataset that will include all child datasets, new
data is written to new blocks, but the old blocks are
not reclaimed as free space. The snapshot contains
the original version of the file system, and the live
file system contains any changes made since the snapshot
was taken. No additional space is used. As new data is
written to the live file system, new blocks are
allocated to store this data. The apparent size of the
snapshot will grow as the blocks are no longer used in
the live file system, but only in the snapshot. These
snapshots can be mounted read only to allow for the
recovery of previous versions of files. It is also
possible to
rollback a live
file system to a specific snapshot, undoing any changes
that took place after the snapshot was taken. Each
block in the pool has a reference counter which keeps
track of how many snapshots, clones, datasets, or
volumes make use of that block. As files and snapshots
are deleted, the reference count is decremented. When a
block is no longer referenced, it is reclaimed as free
space. Snapshots can also be marked with a
hold. When a
snapshot is held, any attempt to destroy it will return
an EBUSY error. Each snapshot can
have multiple holds, each with a unique name. The
release command
removes the hold so the snapshot can deleted. Snapshots
can be taken on volumes, but they can only be cloned or
rolled back, not mounted independently.CloneSnapshots can also be cloned. A clone is a
writable version of a snapshot, allowing the file system
to be forked as a new dataset. As with a snapshot, a
clone initially consumes no additional space. As
new data is written to a clone and new blocks are
allocated, the apparent size of the clone grows. When
blocks are overwritten in the cloned file system or
volume, the reference count on the previous block is
decremented. The snapshot upon which a clone is based
cannot be deleted because the clone depends on it. The
snapshot is the parent, and the clone is the child.
Clones can be promoted, reversing
this dependency and making the clone the parent and the
previous parent the child. This operation requires no
- additional space. Because the amount of space used by
+ additional space. Since the amount of space used by
the parent and child is reversed, existing quotas and
reservations might be affected.ChecksumEvery block that is allocated is also checksummed.
The checksum algorithm used is a per-dataset property,
see set.
The checksum of each block is transparently validated as
it is read, allowing ZFS to detect
silent corruption. If the data that is read does not
match the expected checksum, ZFS will
attempt to recover the data from any available
redundancy, like mirrors or RAID-Z).
Validation of all checksums can be triggered with scrub.
Checksum algorithms include:
fletcher2fletcher4sha256
The fletcher algorithms are faster,
but sha256 is a strong cryptographic
hash and has a much lower chance of collisions at the
cost of some performance. Checksums can be disabled,
but it is not recommended.CompressionEach dataset has a compression property, which
defaults to off. This property can be set to one of a
number of compression algorithms. This will cause all
new data that is written to the dataset to be
compressed. Beyond a reduction in space used, read and
write throughput often increases because fewer blocks
are read or written.
LZ4 -
Added in ZFS pool version
5000 (feature flags), LZ4 is
now the recommended compression algorithm.
LZ4 compresses approximately
50% faster than LZJB when
operating on compressible data, and is over three
times faster when operating on uncompressible
data. LZ4 also decompresses
approximately 80% faster than
LZJB. On modern
CPUs, LZ4
can often compress at over 500 MB/s, and
decompress at over 1.5 GB/s (per single CPU
core).LZJB -
The default compression algorithm. Created by
Jeff Bonwick (one of the original creators of
ZFS). LZJB
offers good compression with less
CPU overhead compared to
GZIP. In the future, the
default compression algorithm will likely change
to LZ4.GZIP -
A popular stream compression algorithm available
in ZFS. One of the main
advantages of using GZIP is its
configurable level of compression. When setting
the compress property, the
administrator can choose the level of compression,
ranging from gzip1, the lowest
level of compression, to gzip9,
the highest level of compression. This gives the
administrator control over how much
CPU time to trade for saved
disk space.ZLE -
Zero Length Encoding is a special compression
algorithm that only compresses continuous runs of
zeros. This compression algorithm is only useful
when the dataset contains large blocks of
zeros.CopiesWhen set to a value greater than 1, the
copies property instructs
ZFS to maintain multiple copies of
each block in the
File System
or
Volume. Setting
this property on important datasets provides additional
redundancy from which to recover a block that does not
match its checksum. In pools without redundancy, the
copies feature is the only form of redundancy. The
copies feature can recover from a single bad sector or
other forms of minor corruption, but it does not protect
the pool from the loss of an entire disk.DeduplicationChecksums make it possible to detect duplicate
blocks of data as they are written. With deduplication,
the reference count of an existing, identical block is
increased, saving storage space. To detect duplicate
blocks, a deduplication table (DDT)
is kept in memory. The table contains a list of unique
checksums, the location of those blocks, and a reference
count. When new data is written, the checksum is
calculated and compared to the list. If a match is
found, the existing block is used. The
SHA256 checksum algorithm is used
with deduplication to provide a secure cryptographic
hash. Deduplication is tunable. If
dedup is on, then
a matching checksum is assumed to mean that the data is
identical. If dedup is set to
verify, then the data in the two
blocks will be checked byte-for-byte to ensure it is
actually identical. If the data is not identical, the
hash collision will be noted and the two blocks will be
- stored separately. Because DDT must
+ stored separately. As DDT must
store the hash of each unique block, it consumes a very
large amount of memory. A general rule of thumb is
5-6 GB of ram per 1 TB of deduplicated data).
In situations where it is not practical to have enough
RAM to keep the entire
DDT in memory, performance will
suffer greatly as the DDT must be
read from disk before each new block is written.
Deduplication can use L2ARC to store
the DDT, providing a middle ground
between fast system memory and slower disks. Consider
using compression instead, which often provides nearly
as much space savings without the additional memory
requirement.ScrubInstead of a consistency check like &man.fsck.8;,
ZFS has scrub.
scrub reads all data blocks stored on
the pool and verifies their checksums against the known
good checksums stored in the metadata. A periodic check
of all the data stored on the pool ensures the recovery
of any corrupted blocks before they are needed. A scrub
is not required after an unclean shutdown, but is
recommended at least once every three months. The
checksum of each block is verified as blocks are read
during normal use, but a scrub makes certain that even
infrequently used blocks are checked for silent
corruption. Data security is improved, especially in
archival storage situations. The relative priority of
scrub can be adjusted with vfs.zfs.scrub_delay
to prevent the scrub from degrading the performance of
other workloads on the pool.Dataset QuotaZFS provides very fast and
accurate dataset, user, and group space accounting in
addition to quotas and space reservations. This gives
the administrator fine grained control over how space is
allocated and allows space to be reserved for critical
file systems.
ZFS supports different types of
quotas: the dataset quota, the reference
quota (refquota), the
user
quota, and the
group
quota.Quotas limit the amount of space that a dataset
and all of its descendants, including snapshots of the
dataset, child datasets, and the snapshots of those
datasets, can consume.Quotas cannot be set on volumes, as the
volsize property acts as an
implicit quota.Reference
QuotaA reference quota limits the amount of space a
dataset can consume by enforcing a hard limit. However,
this hard limit includes only space that the dataset
references and does not include space used by
descendants, such as file systems or snapshots.User
QuotaUser quotas are useful to limit the amount of space
that can be used by the specified user.Group
QuotaThe group quota limits the amount of space that a
specified group can consume.Dataset
ReservationThe reservation property makes
it possible to guarantee a minimum amount of space for a
specific dataset and its descendants. If a 10 GB
reservation is set on
storage/home/bob, and another
dataset tries to use all of the free space, at least
10 GB of space is reserved for this dataset. If a
snapshot is taken of
storage/home/bob, the space used by
that snapshot is counted against the reservation. The
refreservation
property works in a similar way, but it
excludes descendants like
snapshots.
Reservations of any sort are useful in many
situations, such as planning and testing the
suitability of disk space allocation in a new system,
or ensuring that enough space is available on file
systems for audio logs or system recovery procedures
and files.Reference
ReservationThe refreservation property
makes it possible to guarantee a minimum amount of
space for the use of a specific dataset
excluding its descendants. This
means that if a 10 GB reservation is set on
storage/home/bob, and another
dataset tries to use all of the free space, at least
10 GB of space is reserved for this dataset. In
contrast to a regular
reservation,
space used by snapshots and descendant datasets is not
counted against the reservation. For example, if a
snapshot is taken of
storage/home/bob, enough disk space
must exist outside of the
refreservation amount for the
operation to succeed. Descendants of the main data set
are not counted in the refreservation
amount and so do not encroach on the space set.ResilverWhen a disk fails and is replaced, the new disk
must be filled with the data that was lost. The process
of using the parity information distributed across the
remaining drives to calculate and write the missing data
to the new drive is called
resilvering.OnlineA pool or vdev in the Online
state has all of its member devices connected and fully
operational. Individual devices in the
Online state are functioning
normally.OfflineIndividual devices can be put in an
Offline state by the administrator if
there is sufficient redundancy to avoid putting the pool
or vdev into a
Faulted state.
An administrator may choose to offline a disk in
preparation for replacing it, or to make it easier to
identify.DegradedA pool or vdev in the Degraded
state has one or more disks that have been disconnected
or have failed. The pool is still usable, but if
additional devices fail, the pool could become
unrecoverable. Reconnecting the missing devices or
replacing the failed disks will return the pool to an
Online state
after the reconnected or new device has completed the
Resilver
process.FaultedA pool or vdev in the Faulted
state is no longer operational. The data on it can no
longer be accessed. A pool or vdev enters the
Faulted state when the number of
missing or failed devices exceeds the level of
redundancy in the vdev. If missing devices can be
reconnected, the pool will return to a
Online state. If
there is insufficient redundancy to compensate for the
number of failed disks, then the contents of the pool
are lost and must be restored from backups.