diff --git a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml index 798b7bc6d9..2ba2795fb3 100644 --- a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml @@ -1,2396 +1,2396 @@ Bootstrapping and Kernel Initialization Sergey Lyubka Contributed by Sergio Andrés Gómez del Real Updated and enhanced by Synopsis BIOS firmware POST IA-32 booting system initialization This chapter is an overview of the boot and system initialization processes, starting from the BIOS (firmware) POST, to the first user process creation. Since the initial steps of system startup are very architecture dependent, the IA-32 architecture is used as an example. The &os; boot process can be surprisingly complex. After control is passed from the BIOS, a considerable amount of low-level configuration must be done before the kernel can be loaded and executed. This setup must be done in a simple and flexible manner, allowing the user a great deal of customization possibilities. Overview The boot process is an extremely machine-dependent activity. Not only must code be written for every computer architecture, but there may also be multiple types of booting on the same architecture. For example, a directory listing of /usr/src/sys/boot reveals a great amount of architecture-dependent code. There is a directory for each of the various supported architectures. In the x86-specific i386 directory, there are subdirectories for different boot standards like mbr (Master Boot Record), gpt (GUID Partition Table), and efi (Extensible Firmware Interface). Each boot standard has its own conventions and data structures. The example that follows shows booting an x86 computer from an MBR hard drive with the &os; boot0 multi-boot loader stored in the very first sector. That boot code starts the &os; three-stage boot process. The key to understanding this process is that it is a series of stages of increasing complexity. These stages are boot1, boot2, and loader (see &man.boot.8; for more detail). The boot system executes each stage in sequence. The last stage, loader, is responsible for loading the &os; kernel. Each stage is examined in the following sections. Here is an example of the output generated by the different boot stages. Actual output may differ from machine to machine: &os; Component Output (may vary) boot0 F1 FreeBSD F2 BSD F5 Disk 2 boot2 This prompt will appear if the user presses a key just after selecting an OS to boot at the boot0 stage. >>FreeBSD/i386 BOOT Default: 1:ad(1,a)/boot/loader boot: loader BTX loader 1.00 BTX version is 1.02 Consoles: internal video/keyboard BIOS drive C: is disk0 BIOS 639kB/2096064kB available memory FreeBSD/x86 bootstrap loader, Revision 1.1 Console internal video/keyboard (root@snap.freebsd.org, Thu Jan 16 22:18:05 UTC 2014) Loading /boot/defaults/loader.conf /boot/kernel/kernel text=0xed9008 data=0x117d28+0x176650 syms=[0x8+0x137988+0x8+0x1515f8] kernel Copyright (c) 1992-2013 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014 root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610 The <acronym>BIOS</acronym> When the computer powers on, the processor's registers are set to some predefined values. One of the registers is the instruction pointer register, and its value after a power on is well defined: it is a 32-bit value of 0xfffffff0. The instruction pointer register (also known as the Program Counter) points to code to be executed by the processor. Another important register is the cr0 32-bit control register, and its value just after a reboot is 0. One of cr0's bits, the PE (Protection Enabled) bit, indicates whether the processor is running in 32-bit protected mode or 16-bit real mode. Since this bit is cleared at boot time, the processor boots in 16-bit real mode. Real mode means, among other things, that linear and physical addresses are identical. The reason for the processor not to start immediately in 32-bit protected mode is backwards compatibility. In particular, the boot process relies on the services provided by the BIOS, and the BIOS itself works in legacy, 16-bit code. The value of 0xfffffff0 is slightly less than 4 GB, so unless the machine has 4 GB of physical memory, it cannot point to a valid memory address. The computer's hardware translates this address so that it points to a BIOS memory block. The BIOS (Basic Input Output System) is a chip on the motherboard that has a relatively small amount of read-only memory (ROM). This memory contains various low-level routines that are specific to the hardware supplied with the motherboard. The processor will first jump to the address 0xfffffff0, which really resides in the BIOS's memory. Usually this address contains a jump instruction to the BIOS's POST routines. The POST (Power On Self Test) is a set of routines including the memory check, system bus check, and other low-level initialization so the CPU can set up the computer properly. The important step of this stage is determining the boot device. Modern BIOS implementations permit the selection of a boot device, allowing booting from a floppy, CD-ROM, hard disk, or other devices. The very last thing in the POST is the INT 0x19 instruction. The INT 0x19 handler reads 512 bytes from the first sector of boot device into the memory at address 0x7c00. The term first sector originates from hard drive architecture, where the magnetic plate is divided into a number of cylindrical tracks. Tracks are numbered, and every track is divided into a number (usually 64) of sectors. Track numbers start at 0, but sector numbers start from 1. Track 0 is the outermost on the magnetic plate, and sector 1, the first sector, has a special purpose. It is also called the MBR, or Master Boot Record. The remaining sectors on the first track are never used. This sector is our boot-sequence starting point. As we will see, this sector contains a copy of our boot0 program. A jump is made by the BIOS to address 0x7c00 so it starts executing. The Master Boot Record (<literal>boot0</literal>) MBR After control is received from the BIOS at memory address 0x7c00, boot0 starts executing. It is the first piece of code under &os; control. The task of boot0 is quite simple: scan the partition table and let the user choose which partition to boot from. The Partition Table is a special, standard data structure embedded in the MBR (hence embedded in boot0) describing the four standard PC partitions . boot0 resides in the filesystem as /boot/boot0. It is a small 512-byte file, and it is exactly what &os;'s installation procedure wrote to the hard disk's MBR if you chose the bootmanager option at installation time. Indeed, boot0 is the MBR. As mentioned previously, the INT 0x19 instruction causes the INT 0x19 handler to load an MBR (boot0) into memory at address 0x7c00. The source file for boot0 can be found in sys/boot/i386/boot0/boot0.S - which is an awesome piece of code written by Robert Nordier. A special structure starting from offset 0x1be in the MBR is called the partition table. It has four records of 16 bytes each, called partition records, which represent how the hard disk is partitioned, or, in &os;'s terminology, sliced. One byte of those 16 says whether a partition (slice) is bootable or not. Exactly one record must have that flag set, otherwise boot0's code will refuse to proceed. A partition record has the following fields: the 1-byte filesystem type the 1-byte bootable flag the 6 byte descriptor in CHS format the 8 byte descriptor in LBA format A partition record descriptor contains information about where exactly the partition resides on the drive. Both descriptors, LBA and CHS, describe the same information, but in different ways: LBA (Logical Block Addressing) has the starting sector for the partition and the partition's length, while CHS (Cylinder Head Sector) has coordinates for the first and last sectors of the partition. The partition table ends with the special signature 0xaa55. The MBR must fit into 512 bytes, a single disk sector. This program uses low-level tricks like taking advantage of the side effects of certain instructions and reusing register values from previous operations to make the most out of the fewest possible instructions. Care must also be taken when handling the partition table, which is embedded in the MBR itself. For these reasons, be very careful when modifying boot0.S. Note that the boot0.S source file is assembled as is: instructions are translated one by one to binary, with no additional information (no ELF file format, for example). This kind of low-level control is achieved at link time through special control flags passed to the linker. For example, the text section of the program is set to be located at address 0x600. In practice this means that boot0 must be loaded to memory address 0x600 in order to function properly. It is worth looking at the Makefile for boot0 (sys/boot/i386/boot0/Makefile), as it defines some of the run-time behavior of boot0. For instance, if a terminal connected to the serial port (COM1) is used for I/O, the macro SIO must be defined (-DSIO). -DPXE enables boot through PXE by pressing F6. Additionally, the program defines a set of flags that allow further modification of its behavior. All of this is illustrated in the Makefile. For example, look at the linker directives which command the linker to start the text section at address 0x600, and to build the output file as is (strip out any file formatting):
<filename>sys/boot/i386/boot0/Makefile</filename> BOOT_BOOT0_ORG?=0x600 LDFLAGS=-e start -Ttext ${BOOT_BOOT0_ORG} \ -Wl,-N,-S,--oformat,binary
Let us now start our study of the MBR, or boot0, starting where execution begins. Some modifications have been made to some instructions in favor of better exposition. For example, some macros are expanded, and some macro tests are omitted when the result of the test is known. This applies to all of the code examples shown.
<filename>sys/boot/i386/boot0/boot0.S</filename> start: cld # String ops inc xorw %ax,%ax # Zero movw %ax,%es # Address movw %ax,%ds # data movw %ax,%ss # Set up movw 0x7c00,%sp # stack
This first block of code is the entry point of the program. It is where the BIOS transfers control. First, it makes sure that the string operations autoincrement its pointer operands (the cld instruction) When in doubt, we refer the reader to the official Intel manuals, which describe the exact semantics for each instruction: .. Then, as it makes no assumption about the state of the segment registers, it initializes them. Finally, it sets the stack pointer register (%sp) to address 0x7c00, so we have a working stack. The next block is responsible for the relocation and subsequent jump to the relocated code.
<filename>sys/boot/i386/boot0/boot0.S</filename> movw $0x7c00,%si # Source movw $0x600,%di # Destination movw $512,%cx # Word count rep # Relocate movsb # code movw %di,%bp # Address variables movb $16,%cl # Words to clear rep # Zero stosb # them incb -0xe(%di) # Set the S field to 1 jmp main-0x7c00+0x600 # Jump to relocated code
As boot0 is loaded by the BIOS to address 0x7C00, it copies itself to address 0x600 and then transfers control there (recall that it was linked to execute at address 0x600). The source address, 0x7c00, is copied to register %si. The destination address, 0x600, to register %di. The number of bytes to copy, 512 (the program's size), is copied to register %cx. Next, the rep instruction repeats the instruction that follows, that is, movsb, the number of times dictated by the %cx register. The movsb instruction copies the byte pointed to by %si to the address pointed to by %di. This is repeated another 511 times. On each repetition, both the source and destination registers, %si and %di, are incremented by one. Thus, upon completion of the 512-byte copy, %di has the value 0x600+512= 0x800, and %si has the value 0x7c00+512= 0x7e00; we have thus completed the code relocation. Next, the destination register %di is copied to %bp. %bp gets the value 0x800. The value 16 is copied to %cl in preparation for a new string operation (like our previous movsb). Now, stosb is executed 16 times. This instruction copies a 0 value to the address pointed to by the destination register (%di, which is 0x800), and increments it. This is repeated another 15 times, so %di ends up with value 0x810. Effectively, this clears the address range 0x800-0x80f. This range is used as a (fake) partition table for writing the MBR back to disk. Finally, the sector field for the CHS addressing of this fake partition is given the value 1 and a jump is made to the main function from the relocated code. Note that until this jump to the relocated code, any reference to an absolute address was avoided. The following code block tests whether the drive number provided by the BIOS should be used, or the one stored in boot0.
<filename>sys/boot/i386/boot0/boot0.S</filename> main: testb $SETDRV,-69(%bp) # Set drive number? jnz disable_update # Yes testb %dl,%dl # Drive number valid? js save_curdrive # Possibly (0x80 set)
This code tests the SETDRV bit (0x20) in the flags variable. Recall that register %bp points to address location 0x800, so the test is done to the flags variable at address 0x800-69= 0x7bb. This is an example of the type of modifications that can be done to boot0. The SETDRV flag is not set by default, but it can be set in the Makefile. When set, the drive number stored in the MBR is used instead of the one provided by the BIOS. We assume the defaults, and that the BIOS provided a valid drive number, so we jump to save_curdrive. The next block saves the drive number provided by the BIOS, and calls putn to print a new line on the screen.
<filename>sys/boot/i386/boot0/boot0.S</filename> save_curdrive: movb %dl, (%bp) # Save drive number pushw %dx # Also in the stack #ifdef TEST /* test code, print internal bios drive */ rolb $1, %dl movw $drive, %si call putkey #endif callw putn # Print a newline
Note that we assume TEST is not defined, so the conditional code in it is not assembled and will not appear in our executable boot0. Our next block implements the actual scanning of the partition table. It prints to the screen the partition type for each of the four entries in the partition table. It compares each type with a list of well-known operating system file systems. Examples of recognized partition types are NTFS (&windows;, ID 0x7), ext2fs (&linux;, ID 0x83), and, of course, ffs/ufs2 (&os;, ID 0xa5). The implementation is fairly simple.
<filename>sys/boot/i386/boot0/boot0.S</filename> movw $(partbl+0x4),%bx # Partition table (+4) xorw %dx,%dx # Item number read_entry: movb %ch,-0x4(%bx) # Zero active flag (ch == 0) btw %dx,_FLAGS(%bp) # Entry enabled? jnc next_entry # No movb (%bx),%al # Load type test %al, %al # skip empty partition jz next_entry movw $bootable_ids,%di # Lookup tables movb $(TLEN+1),%cl # Number of entries repne # Locate scasb # type addw $(TLEN-1), %di # Adjust movb (%di),%cl # Partition addw %cx,%di # description callw putx # Display it next_entry: incw %dx # Next item addb $0x10,%bl # Next entry jnc read_entry # Till done
It is important to note that the active flag for each entry is cleared, so after the scanning, no partition entry is active in our memory copy of boot0. Later, the active flag will be set for the selected partition. This ensures that only one active partition exists if the user chooses to write the changes back to disk. The next block tests for other drives. At startup, the BIOS writes the number of drives present in the computer to address 0x475. If there are any other drives present, boot0 prints the current drive to screen. The user may command boot0 to scan partitions on another drive later.
<filename>sys/boot/i386/boot0/boot0.S</filename> popw %ax # Drive number subb $0x79,%al # Does next cmpb 0x475,%al # drive exist? (from BIOS?) jb print_drive # Yes decw %ax # Already drive 0? jz print_prompt # Yes
We make the assumption that a single drive is present, so the jump to print_drive is not performed. We also assume nothing strange happened, so we jump to print_prompt. This next block just prints out a prompt followed by the default option:
<filename>sys/boot/i386/boot0/boot0.S</filename> print_prompt: movw $prompt,%si # Display callw putstr # prompt movb _OPT(%bp),%dl # Display decw %si # default callw putkey # key jmp start_input # Skip beep
Finally, a jump is performed to start_input, where the BIOS services are used to start a timer and for reading user input from the keyboard; if the timer expires, the default option will be selected:
<filename>sys/boot/i386/boot0/boot0.S</filename> start_input: xorb %ah,%ah # BIOS: Get int $0x1a # system time movw %dx,%di # Ticks when addw _TICKS(%bp),%di # timeout read_key: movb $0x1,%ah # BIOS: Check int $0x16 # for keypress jnz got_key # Have input xorb %ah,%ah # BIOS: int 0x1a, 00 int $0x1a # get system time cmpw %di,%dx # Timeout? jb read_key # No
An interrupt is requested with number 0x1a and argument 0 in register %ah. The BIOS has a predefined set of services, requested by applications as software-generated interrupts through the int instruction and receiving arguments in registers (in this case, %ah). Here, particularly, we are requesting the number of clock ticks since last midnight; this value is computed by the BIOS through the RTC (Real Time Clock). This clock can be programmed to work at frequencies ranging from 2 Hz to 8192 Hz. The BIOS sets it to 18.2 Hz at startup. When the request is satisfied, a 32-bit result is returned by the BIOS in registers %cx and %dx (lower bytes in %dx). This result (the %dx part) is copied to register %di, and the value of the TICKS variable is added to %di. This variable resides in boot0 at offset _TICKS (a negative value) from register %bp (which, recall, points to 0x800). The default value of this variable is 0xb6 (182 in decimal). Now, the idea is that boot0 constantly requests the time from the BIOS, and when the value returned in register %dx is greater than the value stored in %di, the time is up and the default selection will be made. Since the RTC ticks 18.2 times per second, this condition will be met after 10 seconds (this default behavior can be changed in the Makefile). Until this time has passed, boot0 continually asks the BIOS for any user input; this is done through int 0x16, argument 1 in %ah. Whether a key was pressed or the time expired, subsequent code validates the selection. Based on the selection, the register %si is set to point to the appropriate partition entry in the partition table. This new selection overrides the previous default one. Indeed, it becomes the new default. Finally, the ACTIVE flag of the selected partition is set. If it was enabled at compile time, the in-memory version of boot0 with these modified values is written back to the MBR on disk. We leave the details of this implementation to the reader. We now end our study with the last code block from the boot0 program:
<filename>sys/boot/i386/boot0/boot0.S</filename> movw $0x7c00,%bx # Address for read movb $0x2,%ah # Read sector callw intx13 # from disk jc beep # If error cmpw $0xaa55,0x1fe(%bx) # Bootable? jne beep # No pushw %si # Save ptr to selected part. callw putn # Leave some space popw %si # Restore, next stage uses it jmp *%bx # Invoke bootstrap
Recall that %si points to the selected partition entry. This entry tells us where the partition begins on disk. We assume, of course, that the partition selected is actually a &os; slice. From now on, we will favor the use of the technically more accurate term slice rather than partition. The transfer buffer is set to 0x7c00 (register %bx), and a read for the first sector of the &os; slice is requested by calling intx13. We assume that everything went okay, so a jump to beep is not performed. In particular, the new sector read must end with the magic sequence 0xaa55. Finally, the value at %si (the pointer to the selected partition table) is preserved for use by the next stage, and a jump is performed to address 0x7c00, where execution of our next stage (the just-read block) is started.
<literal>boot1</literal> Stage So far we have gone through the following sequence: The BIOS did some early hardware initialization, including the POST. The MBR (boot0) was loaded from absolute disk sector one to address 0x7c00. Execution control was passed to that location. boot0 relocated itself to the location it was linked to execute (0x600), followed by a jump to continue execution at the appropriate place. Finally, boot0 loaded the first disk sector from the &os; slice to address 0x7c00. Execution control was passed to that location. boot1 is the next step in the boot-loading sequence. It is the first of three boot stages. Note that we have been dealing exclusively with disk sectors. Indeed, the BIOS loads the absolute first sector, while boot0 loads the first sector of the &os; slice. Both loads are to address 0x7c00. We can conceptually think of these disk sectors as containing the files boot0 and boot1, respectively, but in reality this is not entirely true for boot1. Strictly speaking, unlike boot0, boot1 is not part of the boot blocks There is a file /boot/boot1, but it is not the written to the beginning of the &os; slice. Instead, it is concatenated with boot2 to form boot, which is written to the beginning of the &os; slice and read at boot time.. Instead, a single, full-blown file, boot (/boot/boot), is what ultimately is written to disk. This file is a combination of boot1, boot2 and the Boot Extender (or BTX). This single file is greater in size than a single sector (greater than 512 bytes). Fortunately, boot1 occupies exactly the first 512 bytes of this single file, so when boot0 loads the first sector of the &os; slice (512 bytes), it is actually loading boot1 and transferring control to it. The main task of boot1 is to load the next boot stage. This next stage is somewhat more complex. It is composed of a server called the Boot Extender, or BTX, and a client, called boot2. As we will see, the last boot stage, loader, is also a client of the BTX server. Let us now look in detail at what exactly is done by boot1, starting like we did for boot0, at its entry point:
<filename>sys/boot/i386/boot2/boot1.S</filename> start: jmp main
The entry point at start simply jumps past a special data area to the label main, which in turn looks like this:
<filename>sys/boot/i386/boot2/boot1.S</filename> main: cld # String ops inc xor %cx,%cx # Zero mov %cx,%es # Address mov %cx,%ds # data mov %cx,%ss # Set up mov $start,%sp # stack mov %sp,%si # Source mov $0x700,%di # Destination incb %ch # Word count rep # Copy movsw # code
Just like boot0, this code relocates boot1, this time to memory address 0x700. However, unlike boot0, it does not jump there. boot1 is linked to execute at address 0x7c00, effectively where it was loaded in the first place. The reason for this relocation will be discussed shortly. Next comes a loop that looks for the &os; slice. Although boot0 loaded boot1 from the &os; slice, no information was passed to it about this Actually we did pass a pointer to the slice entry in register %si. However, boot1 does not assume that it was loaded by boot0 (perhaps some other MBR loaded it, and did not pass this information), so it assumes nothing., so boot1 must rescan the partition table to find where the &os; slice starts. Therefore it rereads the MBR:
<filename>sys/boot/i386/boot2/boot1.S</filename> mov $part4,%si # Partition cmpb $0x80,%dl # Hard drive? jb main.4 # No movb $0x1,%dh # Block count callw nread # Read MBR
In the code above, register %dl maintains information about the boot device. This is passed on by the BIOS and preserved by the MBR. Numbers 0x80 and greater tells us that we are dealing with a hard drive, so a call is made to nread, where the MBR is read. Arguments to nread are passed through %si and %dh. The memory address at label part4 is copied to %si. This memory address holds a fake partition to be used by nread. The following is the data in the fake partition:
<filename>sys/boot/i386/boot2/Makefile</filename> part4: .byte 0x80, 0x00, 0x01, 0x00 .byte 0xa5, 0xfe, 0xff, 0xff .byte 0x00, 0x00, 0x00, 0x00 .byte 0x50, 0xc3, 0x00, 0x00
In particular, the LBA for this fake partition is hardcoded to zero. This is used as an argument to the BIOS for reading absolute sector one from the hard drive. Alternatively, CHS addressing could be used. In this case, the fake partition holds cylinder 0, head 0 and sector 1, which is equivalent to absolute sector one. Let us now proceed to take a look at nread:
<filename>sys/boot/i386/boot2/boot1.S</filename> nread: mov $0x8c00,%bx # Transfer buffer mov 0x8(%si),%ax # Get mov 0xa(%si),%cx # LBA push %cs # Read from callw xread.1 # disk jnc return # If success, return
Recall that %si points to the fake partition. The word In the context of 16-bit real mode, a word is 2 bytes. at offset 0x8 is copied to register %ax and word at offset 0xa to %cx. They are interpreted by the BIOS as the lower 4-byte value denoting the LBA to be read (the upper four bytes are assumed to be zero). Register %bx holds the memory address where the MBR will be loaded. The instruction pushing %cs onto the stack is very interesting. In this context, it accomplishes nothing. However, as we will see shortly, boot2, in conjunction with the BTX server, also uses xread.1. This mechanism will be discussed in the next section. The code at xread.1 further calls the read function, which actually calls the BIOS asking for the disk sector:
<filename>sys/boot/i386/boot2/boot1.S</filename> xread.1: pushl $0x0 # absolute push %cx # block push %ax # number push %es # Address of push %bx # transfer buffer xor %ax,%ax # Number of movb %dh,%al # blocks to push %ax # transfer push $0x10 # Size of packet mov %sp,%bp # Packet pointer callw read # Read from disk lea 0x10(%bp),%sp # Clear stack lret # To far caller
Note the long return instruction at the end of this block. This instruction pops out the %cs register pushed by nread, and returns. Finally, nread also returns. With the MBR loaded to memory, the actual loop for searching the &os; slice begins:
<filename>sys/boot/i386/boot2/boot1.S</filename> mov $0x1,%cx # Two passes main.1: mov $0x8dbe,%si # Partition table movb $0x1,%dh # Partition main.2: cmpb $0xa5,0x4(%si) # Our partition type? jne main.3 # No jcxz main.5 # If second pass testb $0x80,(%si) # Active? jnz main.5 # Yes main.3: add $0x10,%si # Next entry incb %dh # Partition cmpb $0x5,%dh # In table? jb main.2 # Yes dec %cx # Do two jcxz main.1 # passes
If a &os; slice is identified, execution continues at main.5. Note that when a &os; slice is found %si points to the appropriate entry in the partition table, and %dh holds the partition number. We assume that a &os; slice is found, so we continue execution at main.5:
<filename>sys/boot/i386/boot2/boot1.S</filename> main.5: mov %dx,0x900 # Save args movb $0x10,%dh # Sector count callw nread # Read disk mov $0x9000,%bx # BTX mov 0xa(%bx),%si # Get BTX length and set add %bx,%si # %si to start of boot2.bin mov $0xc000,%di # Client page 2 mov $0xa200,%cx # Byte sub %si,%cx # count rep # Relocate movsb # client
Recall that at this point, register %si points to the &os; slice entry in the MBR partition table, so a call to nread will effectively read sectors at the beginning of this partition. The argument passed on register %dh tells nread to read 16 disk sectors. Recall that the first 512 bytes, or the first sector of the &os; slice, coincides with the boot1 program. Also recall that the file written to the beginning of the &os; slice is not /boot/boot1, but /boot/boot. Let us look at the size of these files in the filesystem: -r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot0 -r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot1 -r--r--r-- 1 root wheel 7.5K Jan 8 00:15 /boot/boot2 -r--r--r-- 1 root wheel 8.0K Jan 8 00:15 /boot/boot Both boot0 and boot1 are 512 bytes each, so they fit exactly in one disk sector. boot2 is much bigger, holding both the BTX server and the boot2 client. Finally, a file called simply boot is 512 bytes larger than boot2. This file is a concatenation of boot1 and boot2. As already noted, boot0 is the file written to the absolute first disk sector (the MBR), and boot is the file written to the first sector of the &os; slice; boot1 and boot2 are not written to disk. The command used to concatenate boot1 and boot2 into a single boot is merely cat boot1 boot2 > boot. So boot1 occupies exactly the first 512 bytes of boot and, because boot is written to the first sector of the &os; slice, boot1 fits exactly in this first sector. When nread reads the first 16 sectors of the &os; slice, it effectively reads the entire boot file 512*16=8192 bytes, exactly the size of boot. We will see more details about how boot is formed from boot1 and boot2 in the next section. Recall that nread uses memory address 0x8c00 as the transfer buffer to hold the sectors read. This address is conveniently chosen. Indeed, because boot1 belongs to the first 512 bytes, it ends up in the address range 0x8c00-0x8dff. The 512 bytes that follows (range 0x8e00-0x8fff) is used to store the bsdlabel Historically known as disklabel. If you ever wondered where &os; stored this information, it is in this region. See &man.bsdlabel.8;. Starting at address 0x9000 is the beginning of the BTX server, and immediately following is the boot2 client. The BTX server acts as a kernel, and executes in protected mode in the most privileged level. In contrast, the BTX clients (boot2, for example), execute in user mode. We will see how this is accomplished in the next section. The code after the call to nread locates the beginning of boot2 in the memory buffer, and copies it to memory address 0xc000. This is because the BTX server arranges boot2 to execute in a segment starting at 0xa000. We explore this in detail in the following section. The last code block of boot1 enables access to memory above 1MB This is necessary for legacy reasons. Interested readers should see . and concludes with a jump to the starting point of the BTX server:
<filename>sys/boot/i386/boot2/boot1.S</filename> seta20: cli # Disable interrupts seta20.1: dec %cx # Timeout? jz seta20.3 # Yes inb $0x64,%al # Get status testb $0x2,%al # Busy? jnz seta20.1 # Yes movb $0xd1,%al # Command: Write outb %al,$0x64 # output port seta20.2: inb $0x64,%al # Get status testb $0x2,%al # Busy? jnz seta20.2 # Yes movb $0xdf,%al # Enable outb %al,$0x60 # A20 seta20.3: sti # Enable interrupts jmp 0x9010 # Start BTX
Note that right before the jump, interrupts are enabled.
The <acronym>BTX</acronym> Server Next in our boot sequence is the BTX Server. Let us quickly remember how we got here: The BIOS loads the absolute sector one (the MBR, or boot0), to address 0x7c00 and jumps there. boot0 relocates itself to 0x600, the address it was linked to execute, and jumps over there. It then reads the first sector of the &os; slice (which consists of boot1) into address 0x7c00 and jumps over there. boot1 loads the first 16 sectors of the &os; slice into address 0x8c00. This 16 sectors, or 8192 bytes, is the whole file boot. The file is a concatenation of boot1 and boot2. boot2, in turn, contains the BTX server and the boot2 client. Finally, a jump is made to address 0x9010, the entry point of the BTX server. Before studying the BTX Server in detail, let us further review how the single, all-in-one boot file is created. The way boot is built is defined in its Makefile (/usr/src/sys/boot/i386/boot2/Makefile). Let us look at the rule that creates the boot file:
<filename>sys/boot/i386/boot2/Makefile</filename> boot: boot1 boot2 cat boot1 boot2 > boot
This tells us that boot1 and boot2 are needed, and the rule simply concatenates them to produce a single file called boot. The rules for creating boot1 are also quite simple:
<filename>sys/boot/i386/boot2/Makefile</filename> boot1: boot1.out objcopy -S -O binary boot1.out boot1 boot1.out: boot1.o ld -e start -Ttext 0x7c00 -o boot1.out boot1.o
To apply the rule for creating boot1, boot1.out must be resolved. This, in turn, depends on the existence of boot1.o. This last file is simply the result of assembling our familiar boot1.S, without linking. Now, the rule for creating boot1.out is applied. This tells us that boot1.o should be linked with start as its entry point, and starting at address 0x7c00. Finally, boot1 is created from boot1.out applying the appropriate rule. This rule is the objcopy command applied to boot1.out. Note the flags passed to objcopy: -S tells it to strip all relocation and symbolic information; -O binary indicates the output format, that is, a simple, unformatted binary file. Having boot1, let us take a look at how boot2 is constructed:
<filename>sys/boot/i386/boot2/Makefile</filename> boot2: boot2.ld @set -- `ls -l boot2.ld`; x=$$((7680-$$5)); \ echo "$$x bytes available"; test $$x -ge 0 dd if=boot2.ld of=boot2 obs=7680 conv=osync boot2.ld: boot2.ldr boot2.bin ../btx/btx/btx btxld -v -E 0x2000 -f bin -b ../btx/btx/btx -l boot2.ldr \ -o boot2.ld -P 1 boot2.bin boot2.ldr: dd if=/dev/zero of=boot2.ldr bs=512 count=1 boot2.bin: boot2.out objcopy -S -O binary boot2.out boot2.bin boot2.out: ../btx/lib/crt0.o boot2.o sio.o ld -Ttext 0x2000 -o boot2.out boot2.o: boot2.s ${CC} ${ACFLAGS} -c boot2.s boot2.s: boot2.c boot2.h ${.CURDIR}/../../common/ufsread.c ${CC} ${CFLAGS} -S -o boot2.s.tmp ${.CURDIR}/boot2.c sed -e '/align/d' -e '/nop/d' "MISSING" boot2.s.tmp > boot2.s rm -f boot2.s.tmp boot2.h: boot1.out ${NM} -t d ${.ALLSRC} | awk '/([0-9])+ T xread/ \ { x = $$1 - ORG1; \ printf("#define XREADORG %#x\n", REL1 + x) }' \ ORG1=`printf "%d" ${ORG1}` \ REL1=`printf "%d" ${REL1}` > ${.TARGET}
The mechanism for building boot2 is far more elaborate. Let us point out the most relevant facts. The dependency list is as follows:
<filename>sys/boot/i386/boot2/Makefile</filename> boot2: boot2.ld boot2.ld: boot2.ldr boot2.bin ${BTXDIR}/btx/btx boot2.bin: boot2.out boot2.out: ${BTXDIR}/lib/crt0.o boot2.o sio.o boot2.o: boot2.s boot2.s: boot2.c boot2.h ${.CURDIR}/../../common/ufsread.c boot2.h: boot1.out
Note that initially there is no header file boot2.h, but its creation depends on boot1.out, which we already have. The rule for its creation is a bit terse, but the important thing is that the output, boot2.h, is something like this:
<filename>sys/boot/i386/boot2/boot2.h</filename> #define XREADORG 0x725
Recall that boot1 was relocated (i.e., copied from 0x7c00 to 0x700). This relocation will now make sense, because as we will see, the BTX server reclaims some memory, including the space where boot1 was originally loaded. However, the BTX server needs access to boot1's xread function; this function, according to the output of boot2.h, is at location 0x725. Indeed, the BTX server uses the xread function from boot1's relocated code. This function is now accessible from within the boot2 client. We next build boot2.s from files boot2.h, boot2.c and /usr/src/sys/boot/common/ufsread.c. The rule for this is to compile the code in boot2.c (which includes boot2.h and ufsread.c) into assembly code. Having boot2.s, the next rule assembles boot2.s, creating the object file boot2.o. The next rule directs the linker to link various files (crt0.o, boot2.o and sio.o). Note that the output file, boot2.out, is linked to execute at address 0x2000. Recall that boot2 will be executed in user mode, within a special user segment set up by the BTX server. This segment starts at 0xa000. Also, remember that the boot2 portion of boot was copied to address 0xc000, that is, offset 0x2000 from the start of the user segment, so boot2 will work properly when we transfer control to it. Next, boot2.bin is created from boot2.out by stripping its symbols and format information; boot2.bin is a raw binary. Now, note that a file boot2.ldr is created as a 512-byte file full of zeros. This space is reserved for the bsdlabel. Now that we have files boot1, boot2.bin and boot2.ldr, only the BTX server is missing before creating the all-in-one boot file. The BTX server is located in /usr/src/sys/boot/i386/btx/btx; it has its own Makefile with its own set of rules for building. The important thing to notice is that it is also compiled as a raw binary, and that it is linked to execute at address 0x9000. The details can be found in /usr/src/sys/boot/i386/btx/btx/Makefile. Having the files that comprise the boot program, the final step is to merge them. This is done by a special program called btxld (source located in /usr/src/usr.sbin/btxld). Some arguments to this program include the name of the output file (boot), its entry point (0x2000) and its file format (raw binary). The various files are finally merged by this utility into the file boot, which consists of boot1, boot2, the bsdlabel and the BTX server. This file, which takes exactly 16 sectors, or 8192 bytes, is what is actually written to the beginning of the &os; slice during installation. Let us now proceed to study the BTX server program. The BTX server prepares a simple environment and switches from 16-bit real mode to 32-bit protected mode, right before passing control to the client. This includes initializing and updating the following data structures: virtual v86 mode Modifies the Interrupt Vector Table (IVT). The IVT provides exception and interrupt handlers for Real-Mode code. The Interrupt Descriptor Table (IDT) is created. Entries are provided for processor exceptions, hardware interrupts, two system calls and V86 interface. The IDT provides exception and interrupt handlers for Protected-Mode code. A Task-State Segment (TSS) is created. This is necessary because the processor works in the least privileged level when executing the client (boot2), but in the most privileged level when executing the BTX server. The GDT (Global Descriptor Table) is set up. Entries (descriptors) are provided for supervisor code and data, user code and data, and real-mode code and data. Real-mode code and data are necessary when switching back to real mode from protected mode, as suggested by the Intel manuals. Let us now start studying the actual implementation. Recall that boot1 made a jump to address 0x9010, the BTX server's entry point. Before studying program execution there, note that the BTX server has a special header at address range 0x9000-0x900f, right before its entry point. This header is defined as follows:
<filename>sys/boot/i386/btx/btx/btx.S</filename> start: # Start of code /* * BTX header. */ btx_hdr: .byte 0xeb # Machine ID .byte 0xe # Header size .ascii "BTX" # Magic .byte 0x1 # Major version .byte 0x2 # Minor version .byte BTX_FLAGS # Flags .word PAG_CNT-MEM_ORG>>0xc # Paging control .word break-start # Text size .long 0x0 # Entry address
Note the first two bytes are 0xeb and 0xe. In the IA-32 architecture, these two bytes are interpreted as a relative jump past the header into the entry point, so in theory, boot1 could jump here (address 0x9000) instead of address 0x9010. Note that the last field in the BTX header is a pointer to the client's (boot2) entry point. This field is patched at link time. Immediately following the header is the BTX server's entry point:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Initialization routine. */ init: cli # Disable interrupts xor %ax,%ax # Zero/segment mov %ax,%ss # Set up mov $0x1800,%sp # stack mov %ax,%es # Address mov %ax,%ds # data pushl $0x2 # Clear popfl # flags
This code disables interrupts, sets up a working stack (starting at address 0x1800) and clears the flags in the EFLAGS register. Note that the popfl instruction pops out a doubleword (4 bytes) from the stack and places it in the EFLAGS register. As the value actually popped is 2, the EFLAGS register is effectively cleared (IA-32 requires that bit 2 of the EFLAGS register always be 1). Our next code block clears (sets to 0) the memory range 0x5e00-0x8fff. This range is where the various data structures will be created:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Initialize memory. */ mov $0x5e00,%di # Memory to initialize mov $(0x9000-0x5e00)/2,%cx # Words to zero rep # Zero-fill stosw # memory
Recall that boot1 was originally loaded to address 0x7c00, so, with this memory initialization, that copy effectively disappeared. However, also recall that boot1 was relocated to 0x700, so that copy is still in memory, and the BTX server will make use of it. Next, the real-mode IVT (Interrupt Vector Table is updated. The IVT is an array of segment/offset pairs for exception and interrupt handlers. The BIOS normally maps hardware interrupts to interrupt vectors 0x8 to 0xf and 0x70 to 0x77 but, as will be seen, the 8259A Programmable Interrupt Controller, the chip controlling the actual mapping of hardware interrupts to interrupt vectors, is programmed to remap these interrupt vectors from 0x8-0xf to 0x20-0x27 and from 0x70-0x77 to 0x28-0x2f. Thus, interrupt handlers are provided for interrupt vectors 0x20-0x2f. The reason the BIOS-provided handlers are not used directly is because they work in 16-bit real mode, but not 32-bit protected mode. Processor mode will be switched to 32-bit protected mode shortly. However, the BTX server sets up a mechanism to effectively use the handlers provided by the BIOS:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Update real mode IDT for reflecting hardware interrupts. */ mov $intr20,%bx # Address first handler mov $0x10,%cx # Number of handlers mov $0x20*4,%di # First real mode IDT entry init.0: mov %bx,(%di) # Store IP inc %di # Address next inc %di # entry stosw # Store CS add $4,%bx # Next handler loop init.0 # Next IRQ
The next block creates the IDT (Interrupt Descriptor Table). The IDT is analogous, in protected mode, to the IVT in real mode. That is, the IDT describes the various exception and interrupt handlers used when the processor is executing in protected mode. In essence, it also consists of an array of segment/offset pairs, although the structure is somewhat more complex, because segments in protected mode are different than in real mode, and various protection mechanisms apply:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Create IDT. */ mov $0x5e00,%di # IDT's address mov $idtctl,%si # Control string init.1: lodsb # Get entry cbw # count xchg %ax,%cx # as word jcxz init.4 # If done lodsb # Get segment xchg %ax,%dx # P:DPL:type lodsw # Get control xchg %ax,%bx # set lodsw # Get handler offset mov $SEL_SCODE,%dh # Segment selector init.2: shr %bx # Handle this int? jnc init.3 # No mov %ax,(%di) # Set handler offset mov %dh,0x2(%di) # and selector mov %dl,0x5(%di) # Set P:DPL:type add $0x4,%ax # Next handler init.3: lea 0x8(%di),%di # Next entry loop init.2 # Till set done jmp init.1 # Continue
Each entry in the IDT is 8 bytes long. Besides the segment/offset information, they also describe the segment type, privilege level, and whether the segment is present in memory or not. The construction is such that interrupt vectors from 0 to 0xf (exceptions) are handled by function intx00; vector 0x10 (also an exception) is handled by intx10; hardware interrupts, which are later configured to start at interrupt vector 0x20 all the way to interrupt vector 0x2f, are handled by function intx20. Lastly, interrupt vector 0x30, which is used for system calls, is handled by intx30, and vectors 0x31 and 0x32 are handled by intx31. It must be noted that only descriptors for interrupt vectors 0x30, 0x31 and 0x32 are given privilege level 3, the same privilege level as the boot2 client, which means the client can execute a software-generated interrupt to this vectors through the int instruction without failing (this is the way boot2 use the services provided by the BTX server). Also, note that only software-generated interrupts are protected from code executing in lesser privilege levels. Hardware-generated interrupts and processor-generated exceptions are always handled adequately, regardless of the actual privileges involved. The next step is to initialize the TSS (Task-State Segment). The TSS is a hardware feature that helps the operating system or executive software implement multitasking functionality through process abstraction. The IA-32 architecture demands the creation and use of at least one TSS if multitasking facilities are used or different privilege levels are defined. Since the boot2 client is executed in privilege level 3, but the - BTX server does in privilege level 0, a + BTX server runs in privilege level 0, a TSS must be defined:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Initialize TSS. */ init.4: movb $_ESP0H,TSS_ESP0+1(%di) # Set ESP0 movb $SEL_SDATA,TSS_SS0(%di) # Set SS0 movb $_TSSIO,TSS_MAP(%di) # Set I/O bit map base
Note that a value is given for the Privilege Level 0 stack pointer and stack segment in the TSS. This is needed because, if an interrupt or exception is received while executing boot2 in Privilege Level 3, a change to Privilege Level 0 is automatically performed by the processor, so a new working stack is needed. Finally, the I/O Map Base Address field of the TSS is given a value, which is a 16-bit offset from the beginning of the TSS to the I/O Permission Bitmap and the Interrupt Redirection Bitmap. After the IDT and TSS are created, the processor is ready to switch to protected mode. This is done in the next block:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Bring up the system. */ mov $0x2820,%bx # Set protected mode callw setpic # IRQ offsets lidt idtdesc # Set IDT lgdt gdtdesc # Set GDT mov %cr0,%eax # Switch to protected inc %ax # mode mov %eax,%cr0 # ljmp $SEL_SCODE,$init.8 # To 32-bit code .code32 init.8: xorl %ecx,%ecx # Zero movb $SEL_SDATA,%cl # To 32-bit movw %cx,%ss # stack
First, a call is made to setpic to program the 8259A PIC (Programmable Interrupt Controller). This chip is connected to multiple hardware interrupt sources. Upon receiving an interrupt from a device, it signals the processor with the appropriate interrupt vector. This can be customized so that specific interrupts are associated with specific interrupt vectors, as explained before. Next, the IDTR (Interrupt Descriptor Table Register) and GDTR (Global Descriptor Table Register) are loaded with the instructions lidt and lgdt, respectively. These registers are loaded with the base address and limit address for the IDT and GDT. The following three instructions set the Protection Enable (PE) bit of the %cr0 register. This effectively switches the processor to 32-bit protected mode. Next, a long jump is made to init.8 using segment selector SEL_SCODE, which selects the Supervisor Code Segment. The processor is effectively executing in CPL 0, the most privileged level, after this jump. Finally, the Supervisor Data Segment is selected for the stack by assigning the segment selector SEL_SDATA to the %ss register. This data segment also has a privilege level of 0. Our last code block is responsible for loading the TR (Task Register) with the segment selector for the TSS we created earlier, and setting the User Mode environment before passing execution control to the boot2 client.
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Launch user task. */ movb $SEL_TSS,%cl # Set task ltr %cx # register movl $0xa000,%edx # User base address movzwl %ss:BDA_MEM,%eax # Get free memory shll $0xa,%eax # To bytes subl $ARGSPACE,%eax # Less arg space subl %edx,%eax # Less base movb $SEL_UDATA,%cl # User data selector pushl %ecx # Set SS pushl %eax # Set ESP push $0x202 # Set flags (IF set) push $SEL_UCODE # Set CS pushl btx_hdr+0xc # Set EIP pushl %ecx # Set GS pushl %ecx # Set FS pushl %ecx # Set DS pushl %ecx # Set ES pushl %edx # Set EAX movb $0x7,%cl # Set remaining init.9: push $0x0 # general loop init.9 # registers popa # and initialize popl %es # Initialize popl %ds # user popl %fs # segment popl %gs # registers iret # To user mode
Note that the client's environment include a stack segment selector and stack pointer (registers %ss and %esp). Indeed, once the TR is loaded with the appropriate stack segment selector (instruction ltr), the stack pointer is calculated and pushed onto the stack along with the stack's segment selector. Next, the value 0x202 is pushed onto the stack; it is the value that the EFLAGS will get when control is passed to the client. Also, the User Mode code segment selector and the client's entry point are pushed. Recall that this entry point is patched in the BTX header at link time. Finally, segment selectors (stored in register %ecx) for the segment registers %gs, %fs, %ds and %es are pushed onto the stack, along with the value at %edx (0xa000). Keep in mind the various values that have been pushed onto the stack (they will be popped out shortly). Next, values for the remaining general purpose registers are also pushed onto the stack (note the loop that pushes the value 0 seven times). Now, values will be started to be popped out of the stack. First, the popa instruction pops out of the stack the latest seven values pushed. They are stored in the general purpose registers in order %edi, %esi, %ebp, %ebx, %edx, %ecx, %eax. Then, the various segment selectors pushed are popped into the various segment registers. Five values still remain on the stack. They are popped when the iret instruction is executed. This instruction first pops the value that was pushed from the BTX header. This value is a pointer to boot2's entry point. It is placed in the register %eip, the instruction pointer register. Next, the segment selector for the User Code Segment is popped and copied to register %cs. Remember that this segment's privilege level is 3, the least privileged level. This means that we must provide values for the stack of this privilege level. This is why the processor, besides further popping the value for the EFLAGS register, does two more pops out of the stack. These values go to the stack pointer (%esp) and the stack segment (%ss). Now, execution continues at boot0's entry point. It is important to note how the User Code Segment is defined. This segment's base address is set to 0xa000. This means that code memory addresses are relative to address 0xa000; if code being executed is fetched from address 0x2000, the actual memory addressed is 0xa000+0x2000=0xc000.
<application>boot2</application> Stage boot2 defines an important structure, struct bootinfo. This structure is initialized by boot2 and passed to the loader, and then further to the kernel. Some nodes of this structures are set by boot2, the rest by the loader. This structure, among other information, contains the kernel filename, BIOS harddisk geometry, BIOS drive number for boot device, physical memory available, envp pointer etc. The definition for it is: /usr/include/machine/bootinfo.h: struct bootinfo { u_int32_t bi_version; u_int32_t bi_kernelname; /* represents a char * */ u_int32_t bi_nfs_diskless; /* struct nfs_diskless * */ /* End of fields that are always present. */ #define bi_endcommon bi_n_bios_used u_int32_t bi_n_bios_used; u_int32_t bi_bios_geom[N_BIOS_GEOM]; u_int32_t bi_size; u_int8_t bi_memsizes_valid; u_int8_t bi_bios_dev; /* bootdev BIOS unit number */ u_int8_t bi_pad[2]; u_int32_t bi_basemem; u_int32_t bi_extmem; u_int32_t bi_symtab; /* struct symtab * */ u_int32_t bi_esymtab; /* struct symtab * */ /* Items below only from advanced bootloader */ u_int32_t bi_kernend; /* end of kernel space */ u_int32_t bi_envp; /* environment */ u_int32_t bi_modulep; /* preloaded modules */ }; boot2 enters into an infinite loop waiting for user input, then calls load(). If the user does not press anything, the loop breaks by a timeout, so load() will load the default file (/boot/loader). Functions ino_t lookup(char *filename) and int xfsread(ino_t inode, void *buf, size_t nbyte) are used to read the content of a file into memory. /boot/loader is an ELF binary, but where the ELF header is prepended with a.out's struct exec structure. load() scans the loader's ELF header, loading the content of /boot/loader into memory, and passing the execution to the loader's entry: sys/boot/i386/boot2/boot2.c: __exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK), MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part), 0, 0, 0, VTOP(&bootinfo)); <application>loader</application> Stage loader is a BTX client as well. I will not describe it here in detail, there is a comprehensive man page written by Mike Smith, &man.loader.8;. The underlying mechanisms and BTX were discussed above. The main task for the loader is to boot the kernel. When the kernel is loaded into memory, it is being called by the loader: sys/boot/common/boot.c: /* Call the exec handler from the loader matching the kernel */ module_formats[km->m_loader]->l_exec(km); Kernel Initialization Let us take a look at the command that links the kernel. This will help identify the exact location where the loader passes execution to the kernel. This location is the kernel's actual entry point. sys/conf/Makefile.i386: ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \ -dynamic-linker /red/herring -o kernel -X locore.o \ <lots of kernel .o files> ELF A few interesting things can be seen here. First, the kernel is an ELF dynamically linked binary, but the dynamic linker for kernel is /red/herring, which is definitely a bogus file. Second, taking a look at the file sys/conf/ldscript.i386 gives an idea about what ld options are used when compiling a kernel. Reading through the first few lines, the string sys/conf/ldscript.i386: ENTRY(btext) says that a kernel's entry point is the symbol `btext'. This symbol is defined in locore.s: sys/i386/i386/locore.s: .text /********************************************************************** * * This is where the bootblocks start us, set the ball rolling... * */ NON_GPROF_ENTRY(btext) First, the register EFLAGS is set to a predefined value of 0x00000002. Then all the segment registers are initialized: sys/i386/i386/locore.s: /* Don't trust what the BIOS gives for eflags. */ pushl $PSL_KERNEL popfl /* * Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap * to set %cs, %ds, %es and %ss. */ mov %ds, %ax mov %ax, %fs mov %ax, %gs btext calls the routines recover_bootinfo(), identify_cpu(), create_pagetables(), which are also defined in locore.s. Here is a description of what they do: recover_bootinfo This routine parses the parameters to the kernel passed from the bootstrap. The kernel may have been booted in 3 ways: by the loader, described above, by the old disk boot blocks, or by the old diskless boot procedure. This function determines the booting method, and stores the struct bootinfo structure into the kernel memory. identify_cpu This functions tries to find out what CPU it is running on, storing the value found in a variable _cpu. create_pagetables This function allocates and fills out a Page Table Directory at the top of the kernel memory area. The next steps are enabling VME, if the CPU supports it: testl $CPUID_VME, R(_cpu_feature) jz 1f movl %cr4, %eax orl $CR4_VME, %eax movl %eax, %cr4 Then, enabling paging: /* Now enable paging */ movl R(_IdlePTD), %eax movl %eax,%cr3 /* load ptd addr into mmu */ movl %cr0,%eax /* get control word */ orl $CR0_PE|CR0_PG,%eax /* enable paging */ movl %eax,%cr0 /* and let's page NOW! */ The next three lines of code are because the paging was set, so the jump is needed to continue the execution in virtualized address space: pushl $begin /* jump to high virtualized address */ ret /* now running relocated at KERNBASE where the system is linked to run */ begin: The function init386() is called with a pointer to the first free physical page, after that mi_startup(). init386 is an architecture dependent initialization function, and mi_startup() is an architecture independent one (the 'mi_' prefix stands for Machine Independent). The kernel never returns from mi_startup(), and by calling it, the kernel finishes booting: sys/i386/i386/locore.s: movl physfree, %esi pushl %esi /* value of first for init386(first) */ call _init386 /* wire 386 chip for unix operation */ call _mi_startup /* autoconfiguration, mountroot etc */ hlt /* never returns to here */ <function>init386()</function> init386() is defined in sys/i386/i386/machdep.c and performs low-level initialization specific to the i386 chip. The switch to protected mode was performed by the loader. The loader has created the very first task, in which the kernel continues to operate. Before looking at the code, consider the tasks the processor must complete to initialize protected mode execution: Initialize the kernel tunable parameters, passed from the bootstrapping program. Prepare the GDT. Prepare the IDT. Initialize the system console. Initialize the DDB, if it is compiled into kernel. Initialize the TSS. Prepare the LDT. Set up proc0's pcb. parameters init386() initializes the tunable parameters passed from bootstrap by setting the environment pointer (envp) and calling init_param1(). The envp pointer has been passed from loader in the bootinfo structure: sys/i386/i386/machdep.c: kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE; /* Init basic tunables, hz etc */ init_param1(); init_param1() is defined in sys/kern/subr_param.c. That file has a number of sysctls, and two functions, init_param1() and init_param2(), that are called from init386(): sys/kern/subr_param.c: hz = HZ; TUNABLE_INT_FETCH("kern.hz", &hz); TUNABLE_<typename>_FETCH is used to fetch the value from the environment: /usr/src/sys/sys/kernel.h: #define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var)) Sysctl kern.hz is the system clock tick. Additionally, these sysctls are set by init_param1(): kern.maxswzone, kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.maxdsiz, kern.dflssiz, kern.maxssiz, kern.sgrowsiz. Global Descriptors Table (GDT) Then init386() prepares the Global Descriptors Table (GDT). Every task on an x86 is running in its own virtual address space, and this space is addressed by a segment:offset pair. Say, for instance, the current instruction to be executed by the processor lies at CS:EIP, then the linear virtual address for that instruction would be the virtual address of code segment CS + EIP. For convenience, segments begin at virtual address 0 and end at a 4Gb boundary. Therefore, the instruction's linear virtual address for this example would just be the value of EIP. Segment registers such as CS, DS etc are the selectors, i.e., indexes, into GDT (to be more precise, an index is not a selector itself, but the INDEX field of a selector). FreeBSD's GDT holds descriptors for 15 selectors per CPU: sys/i386/i386/machdep.c: union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */ sys/i386/include/segments.h: /* * Entries in the Global Descriptor Table (GDT) */ #define GNULL_SEL 0 /* Null Descriptor */ #define GCODE_SEL 1 /* Kernel Code Descriptor */ #define GDATA_SEL 2 /* Kernel Data Descriptor */ #define GPRIV_SEL 3 /* SMP Per-Processor Private Data */ #define GPROC0_SEL 4 /* Task state process slot zero and up */ #define GLDT_SEL 5 /* LDT - eventually one per process */ #define GUSERLDT_SEL 6 /* User LDT */ #define GTGATE_SEL 7 /* Process task switch gate */ #define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */ #define GPANIC_SEL 9 /* Task state to consider panic from */ #define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */ #define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */ #define GBIOSDATA_SEL 12 /* BIOS interface (Data) */ #define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */ #define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */ Note that those #defines are not selectors themselves, but just a field INDEX of a selector, so they are exactly the indices of the GDT. for example, an actual selector for the kernel code (GCODE_SEL) has the value 0x08. Interrupt Descriptor Table (IDT) The next step is to initialize the Interrupt Descriptor Table (IDT). This table is referenced by the processor when a software or hardware interrupt occurs. For example, to make a system call, user application issues the INT 0x80 instruction. This is a software interrupt, so the processor's hardware looks up a record with index 0x80 in the IDT. This record points to the routine that handles this interrupt, in this particular case, this will be the kernel's syscall gate. The IDT may have a maximum of 256 (0x100) records. The kernel allocates NIDT records for the IDT, where NIDT is the maximum (256): sys/i386/i386/machdep.c: static struct gate_descriptor idt0[NIDT]; struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */ For each interrupt, an appropriate handler is set. The syscall gate for INT 0x80 is set as well: sys/i386/i386/machdep.c: setidt(0x80, &IDTVEC(int0x80_syscall), SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL)); So when a userland application issues the INT 0x80 instruction, control will transfer to the function _Xint0x80_syscall, which is in the kernel code segment and will be executed with supervisor privileges. Console and DDB are then initialized: DDB sys/i386/i386/machdep.c: cninit(); /* skipped */ #ifdef DDB kdb_init(); if (boothowto & RB_KDB) Debugger("Boot flags requested debugger"); #endif The Task State Segment is another x86 protected mode structure, the TSS is used by the hardware to store task information when a task switch occurs. The Local Descriptors Table is used to reference userland code and data. Several selectors are defined to point to the LDT, they are the system call gates and the user code and data selectors: /usr/include/machine/segments.h: #define LSYS5CALLS_SEL 0 /* forced by intel BCS */ #define LSYS5SIGR_SEL 1 #define L43BSDCALLS_SEL 2 /* notyet */ #define LUCODE_SEL 3 #define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */ #define LUDATA_SEL 5 /* separate stack, es,fs,gs sels ? */ /* #define LPOSIXCALLS_SEL 5*/ /* notyet */ #define LBSDICALLS_SEL 16 /* BSDI system call gate */ #define NLDT (LBSDICALLS_SEL + 1) Next, proc0's Process Control Block (struct pcb) structure is initialized. proc0 is a struct proc structure that describes a kernel process. It is always present while the kernel is running, therefore it is declared as global: sys/kern/kern_init.c: struct proc proc0; The structure struct pcb is a part of a proc structure. It is defined in /usr/include/machine/pcb.h and has a process's information specific to the i386 architecture, such as registers values. <function>mi_startup()</function> This function performs a bubble sort of all the system initialization objects and then calls the entry of each object one by one: sys/kern/init_main.c: for (sipp = sysinit; *sipp; sipp++) { /* ... skipped ... */ /* Call function */ (*((*sipp)->func))((*sipp)->udata); /* ... skipped ... */ } Although the sysinit framework is described in the Developers' Handbook, I will discuss the internals of it. sysinit objects Every system initialization object (sysinit object) is created by calling a SYSINIT() macro. Let us take as example an announce sysinit object. This object prints the copyright message: sys/kern/init_main.c: static void print_caddr_t(void *data __unused) { printf("%s", (char *)data); } SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright) The subsystem ID for this object is SI_SUB_COPYRIGHT (0x0800001), which comes right after the SI_SUB_CONSOLE (0x0800000). So, the copyright message will be printed out first, just after the console initialization. Let us take a look at what exactly the macro SYSINIT() does. It expands to a C_SYSINIT() macro. The C_SYSINIT() macro then expands to a static struct sysinit structure declaration with another DATA_SET macro call: /usr/include/sys/kernel.h: #define C_SYSINIT(uniquifier, subsystem, order, func, ident) \ static struct sysinit uniquifier ## _sys_init = { \ subsystem, \ order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ## _sys_init); #define SYSINIT(uniquifier, subsystem, order, func, ident) \ C_SYSINIT(uniquifier, subsystem, order, \ (sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident) The DATA_SET() macro expands to a MAKE_SET(), and that macro is the point where all the sysinit magic is hidden: /usr/include/linker_set.h: #define MAKE_SET(set, sym) \ static void const * const __set_##set##_sym_##sym = &sym; \ __asm(".section .set." #set ",\"aw\""); \ __asm(".long " #sym); \ __asm(".previous") #endif #define TEXT_SET(set, sym) MAKE_SET(set, sym) #define DATA_SET(set, sym) MAKE_SET(set, sym) In our case, the following declaration will occur: static struct sysinit announce_sys_init = { SI_SUB_COPYRIGHT, SI_ORDER_FIRST, (sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t, (void *) copyright }; static void const *const __set_sysinit_set_sym_announce_sys_init = &announce_sys_init; __asm(".section .set.sysinit_set" ",\"aw\""); __asm(".long " "announce_sys_init"); __asm(".previous"); The first __asm instruction will create an ELF section within the kernel's executable. This will happen at kernel link time. The section will have the name .set.sysinit_set. The content of this section is one 32-bit value, the address of announce_sys_init structure, and that is what the second __asm is. The third __asm instruction marks the end of a section. If a directive with the same section name occurred before, the content, i.e., the 32-bit value, will be appended to the existing section, so forming an array of 32-bit pointers. Running objdump on a kernel binary, you may notice the presence of such small sections: &prompt.user; objdump -h /kernel 7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2 CONTENTS, ALLOC, LOAD, DATA 8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2 CONTENTS, ALLOC, LOAD, DATA 9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2 CONTENTS, ALLOC, LOAD, DATA 10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2 CONTENTS, ALLOC, LOAD, DATA 11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2 CONTENTS, ALLOC, LOAD, DATA 12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2 CONTENTS, ALLOC, LOAD, DATA This screen dump shows that the size of .set.sysinit_set section is 0x664 bytes, so 0x664/sizeof(void *) sysinit objects are compiled into the kernel. The other sections such as .set.sysctl_set represent other linker sets. By defining a variable of type struct linker_set the content of .set.sysinit_set section will be collected into that variable: sys/kern/init_main.c: extern struct linker_set sysinit_set; /* XXX */ The struct linker_set is defined as follows: /usr/include/linker_set.h: struct linker_set { int ls_length; void *ls_items[1]; /* really ls_length of them, trailing NULL */ }; The first node will be equal to the number of a sysinit objects, and the second node will be a NULL-terminated array of pointers to them. Returning to the mi_startup() discussion, it is must be clear now, how the sysinit objects are being organized. The mi_startup() function sorts them and calls each. The very last object is the system scheduler: /usr/include/sys/kernel.h: enum sysinit_sub_id { SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/ SI_SUB_DONE = 0x0000001, /* processed*/ SI_SUB_CONSOLE = 0x0800000, /* console*/ SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/ ... SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/ }; The system scheduler sysinit object is defined in the file sys/vm/vm_glue.c, and the entry point for that object is scheduler(). That function is actually an infinite loop, and it represents a process with PID 0, the swapper process. The proc0 structure, mentioned before, is used to describe it. The first user process, called init, is created by the sysinit object init: sys/kern/init_main.c: static void create_init(const void *udata __unused) { int error; int s; s = splhigh(); error = fork1(&proc0, RFFDG | RFPROC, &initproc); if (error) panic("cannot fork init: %d\n", error); initproc->p_flag |= P_INMEM | P_SYSTEM; cpu_set_fork_handler(initproc, start_init, NULL); remrunqueue(initproc); splx(s); } SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL) The create_init() allocates a new process by calling fork1(), but does not mark it runnable. When this new process is scheduled for execution by the scheduler, the start_init() will be called. That function is defined in init_main.c. It tries to load and exec the init binary, probing /sbin/init first, then /sbin/oinit, /sbin/init.bak, and finally /stand/sysinstall: sys/kern/init_main.c: static char init_path[MAXPATHLEN] = #ifdef INIT_PATH __XSTRING(INIT_PATH); #else "/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall"; #endif
diff --git a/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml index 7de627b5b9..dfde154052 100644 --- a/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml @@ -1,2239 +1,2239 @@ Common Access Method SCSI Controllers SergeyBabkinWritten by MurrayStokelyModifications for Handbook made by Synopsis SCSI This document assumes that the reader has a general understanding of device drivers in FreeBSD and of the SCSI protocol. Much of the information in this document was extracted from the drivers: ncr (/sys/pci/ncr.c) by Wolfgang Stanglmeier and Stefan Esser sym (/sys/dev/sym/sym_hipd.c) by Gerard Roudier aic7xxx (/sys/dev/aic7xxx/aic7xxx.c) by Justin T. Gibbs and from the CAM code itself (by Justin T. Gibbs, see /sys/cam/*). When some solution looked the most logical and was essentially verbatim extracted from the code by Justin T. Gibbs, I marked it as recommended. The document is illustrated with examples in pseudo-code. Although sometimes the examples have many details and look like real code, it is still pseudo-code. It was written to demonstrate the concepts in an understandable way. For a real driver other approaches may be more modular and efficient. It also abstracts from the hardware details, as well as issues that would cloud the demonstrated concepts or that are supposed to be described in the other chapters of the developers handbook. Such details are commonly shown as calls to functions with descriptive names, comments or pseudo-statements. Fortunately real life full-size examples with all the details can be found in the real drivers. General Architecture Common Access Method (CAM) CAM stands for Common Access Method. It is a generic way to address the I/O buses in a SCSI-like way. This allows a separation of the generic device drivers from the drivers controlling the I/O bus: for example the disk driver becomes able to control disks on both SCSI, IDE, and/or any other bus so the disk driver portion does not have to be rewritten (or copied and modified) for every new I/O bus. Thus the two most important active entities are: CD-ROM tape IDE Peripheral Modules - a driver for peripheral devices (disk, tape, CD-ROM, etc.) SCSI Interface Modules (SIM) - a Host Bus Adapter drivers for connecting to an I/O bus such as SCSI or IDE. A peripheral driver receives requests from the OS, converts them to a sequence of SCSI commands and passes these SCSI commands to a SCSI Interface Module. The SCSI Interface Module is responsible for passing these commands to the actual hardware (or if the actual hardware is not SCSI but, for example, IDE then also converting the SCSI commands to the native commands of the hardware). As we are interested in writing a SCSI adapter driver here, from this point on we will consider everything from the SIM standpoint. A typical SIM driver needs to include the following CAM-related header files: #include <cam/cam.h> #include <cam/cam_ccb.h> #include <cam/cam_sim.h> #include <cam/cam_xpt_sim.h> #include <cam/cam_debug.h> #include <cam/scsi/scsi_all.h> The first thing each SIM driver must do is register itself with the CAM subsystem. This is done during the driver's xxx_attach() function (here and further xxx_ is used to denote the unique driver name prefix). The xxx_attach() function itself is called by the system bus auto-configuration code which we do not describe here. This is achieved in multiple steps: first it is necessary to allocate the queue of requests associated with this SIM: struct cam_devq *devq; if(( devq = cam_simq_alloc(SIZE) )==NULL) { error; /* some code to handle the error */ } Here SIZE is the size of the queue to be allocated, maximal number of requests it could contain. It is the number of requests that the SIM driver can handle in parallel on one SCSI card. Commonly it can be calculated as: SIZE = NUMBER_OF_SUPPORTED_TARGETS * MAX_SIMULTANEOUS_COMMANDS_PER_TARGET Next we create a descriptor of our SIM: struct cam_sim *sim; if(( sim = cam_sim_alloc(action_func, poll_func, driver_name, softc, unit, mtx, max_dev_transactions, max_tagged_dev_transactions, devq) )==NULL) { cam_simq_free(devq); error; /* some code to handle the error */ } Note that if we are not able to create a SIM descriptor we free the devq also because we can do nothing else with it and we want to conserve memory. If a SCSI card has multiple SCSI busesSCSIbus on it then each bus requires its own cam_sim structure. An interesting question is what to do if a SCSI card has more than one SCSI bus, do we need one devq structure per card or per SCSI bus? The answer given in the comments to the CAM code is: either way, as the driver's author prefers. The arguments are: action_func - pointer to the driver's xxx_action function. static void xxx_action struct cam_sim *sim, union ccb *ccb poll_func - pointer to the driver's xxx_poll() static void xxx_poll struct cam_sim *sim driver_name - the name of the actual driver, such as ncr or wds. softc - pointer to the driver's internal descriptor for this SCSI card. This pointer will be used by the driver in future to get private data. unit - the controller unit number, for example for controller mps0 this number will be 0 mtx - Lock associated with this SIM. For SIMs that don't know about locking, pass in Giant. For SIMs that do, pass in the lock used to guard this SIM's data structures. This lock will be held when xxx_action and xxx_poll are called. max_dev_transactions - maximal number of simultaneous transactions per SCSI target in the non-tagged mode. This value will be almost universally equal to 1, with possible exceptions only for the non-SCSI cards. Also the drivers that hope to take advantage by preparing one transaction while another one is executed may set it to 2 but this does not seem to be worth the complexity. max_tagged_dev_transactions - the same thing, but in the tagged mode. Tags are the SCSI way to initiate multiple transactions on a device: each transaction is assigned a unique tag and the transaction is sent to the device. When the device completes some transaction it sends back the result together with the tag so that the SCSI adapter (and the driver) can tell which transaction was completed. This argument is also known as the maximal tag depth. It depends on the abilities of the SCSI adapter. Finally we register the SCSI buses associated with our SCSI adapterSCSIadapter: if(xpt_bus_register(sim, softc, bus_number) != CAM_SUCCESS) { cam_sim_free(sim, /*free_devq*/ TRUE); error; /* some code to handle the error */ } If there is one devq structure per SCSI bus (i.e., we consider a card with multiple buses as multiple cards with one bus each) then the bus number will always be 0, otherwise each bus on the SCSI card should be get a distinct number. Each bus needs its own separate structure cam_sim. After that our controller is completely hooked to the CAM system. The value of devq can be discarded now: sim will be passed as an argument in all further calls from CAM and devq can be derived from it. CAM provides the framework for such asynchronous events. Some events originate from the lower levels (the SIM drivers), some events originate from the peripheral drivers, some events originate from the CAM subsystem itself. Any driver can register callbacks for some types of the asynchronous events, so that it would be notified if these events occur. A typical example of such an event is a device reset. Each transaction and event identifies the devices to which it applies by the means of path. The target-specific events normally occur during a transaction with this device. So the path from that transaction may be re-used to report this event (this is safe because the event path is copied in the event reporting routine but not deallocated nor passed anywhere further). Also it is safe to allocate paths dynamically at any time including the interrupt routines, although that incurs certain overhead, and a possible problem with this approach is that there may be no free memory at that time. For a bus reset event we need to define a wildcard path including all devices on the bus. So we can create the path for the future bus reset events in advance and avoid problems with the future memory shortage: struct cam_path *path; if(xpt_create_path(&path, /*periph*/NULL, cam_sim_path(sim), CAM_TARGET_WILDCARD, CAM_LUN_WILDCARD) != CAM_REQ_CMP) { xpt_bus_deregister(cam_sim_path(sim)); cam_sim_free(sim, /*free_devq*/TRUE); error; /* some code to handle the error */ } softc->wpath = path; softc->sim = sim; As you can see the path includes: ID of the peripheral driver (NULL here because we have none) ID of the SIM driver (cam_sim_path(sim)) SCSI target number of the device (CAM_TARGET_WILDCARD means all devices) SCSI LUN number of the subdevice (CAM_LUN_WILDCARD means all LUNs) If the driver can not allocate this path it will not be able to work normally, so in that case we dismantle that SCSI bus. And we save the path pointer in the softc structure for future use. After that we save the value of sim (or we can also discard it on the exit from xxx_probe() if we wish). That is all for a minimalistic initialization. To do things right there is one more issue left. For a SIM driver there is one particularly interesting event: when a target device is considered lost. In this case resetting the SCSI negotiations with this device may be a good idea. So we register a callback for this event with CAM. The request is passed to CAM by requesting CAM action on a CAM control block for this type of request: struct ccb_setasync csa; xpt_setup_ccb(&csa.ccb_h, path, /*priority*/5); csa.ccb_h.func_code = XPT_SASYNC_CB; csa.event_enable = AC_LOST_DEVICE; csa.callback = xxx_async; csa.callback_arg = sim; xpt_action((union ccb *)&csa); Now we take a look at the xxx_action() and xxx_poll() driver entry points. static void xxx_action struct cam_sim *sim, union ccb *ccb Do some action on request of the CAM subsystem. Sim describes the SIM for the request, CCB is the request itself. CCB stands for CAM Control Block. It is a union of many specific instances, each describing arguments for some type of transactions. All of these instances share the CCB header where the common part of arguments is stored. CAM supports the SCSI controllers working in both initiator (normal) mode and target (simulating a SCSI device) mode. Here we only consider the part relevant to the initiator mode. There are a few function and macros (in other words, methods) defined to access the public data in the struct sim: cam_sim_path(sim) - the path ID (see above) cam_sim_name(sim) - the name of the sim cam_sim_softc(sim) - the pointer to the softc (driver private data) structure cam_sim_unit(sim) - the unit number cam_sim_bus(sim) - the bus ID To identify the device, xxx_action() can get the unit number and pointer to its structure softc using these functions. The type of request is stored in ccb->ccb_h.func_code. So generally xxx_action() consists of a big switch: struct xxx_softc *softc = (struct xxx_softc *) cam_sim_softc(sim); struct ccb_hdr *ccb_h = &ccb->ccb_h; int unit = cam_sim_unit(sim); int bus = cam_sim_bus(sim); switch(ccb_h->func_code) { case ...: ... default: ccb_h->status = CAM_REQ_INVALID; xpt_done(ccb); break; } As can be seen from the default case (if an unknown command was received) the return code of the command is set into ccb->ccb_h.status and the completed CCB is returned back to CAM by calling xpt_done(ccb). xpt_done() does not have to be called from xxx_action(): For example an I/O request may be enqueued inside the SIM driver and/or its SCSI controller. Then when the device would post an interrupt signaling that the processing of this request is complete xpt_done() may be called from the interrupt handling routine. Actually, the CCB status is not only assigned as a return code but a CCB has some status all the time. Before CCB is passed to the xxx_action() routine it gets the status CCB_REQ_INPROG meaning that it is in progress. There are a surprising number of status values defined in /sys/cam/cam.h which should be able to represent the status of a request in great detail. More interesting yet, the status is in fact a bitwise or of an enumerated status value (the lower 6 bits) and possible additional flag-like bits (the upper bits). The enumerated values will be discussed later in more detail. The summary of them can be found in the Errors Summary section. The possible status flags are: CAM_DEV_QFRZN - if the SIM driver gets a serious error (for example, the device does not respond to the selection or breaks the SCSI protocol) when processing a CCB it should freeze the request queue by calling xpt_freeze_simq(), return the other enqueued but not processed yet CCBs for this device back to the CAM queue, then set this flag for the troublesome CCB and call xpt_done(). This flag causes the CAM subsystem to unfreeze the queue after it handles the error. CAM_AUTOSNS_VALID - if the device returned an error condition and the flag CAM_DIS_AUTOSENSE is not set in CCB the SIM driver must execute the REQUEST SENSE command automatically to extract the sense (extended error information) data from the device. If this attempt was successful the sense data should be saved in the CCB and this flag set. CAM_RELEASE_SIMQ - like CAM_DEV_QFRZN but used in case there is some problem (or resource shortage) with the SCSI controller itself. Then all the future requests to the controller should be stopped by xpt_freeze_simq(). The controller queue will be restarted after the SIM driver overcomes the shortage and informs CAM by returning some CCB with this flag set. CAM_SIM_QUEUED - when SIM puts a CCB into its request queue this flag should be set (and removed when this CCB gets dequeued before being returned back to CAM). This flag is not used anywhere in the CAM code now, so its purpose is purely diagnostic. CAM_QOS_VALID - The QOS data is now valid. The function xxx_action() is not allowed to sleep, so all the synchronization for resource access must be done using SIM or device queue freezing. Besides the aforementioned flags the CAM subsystem provides functions xpt_release_simq() and xpt_release_devq() to unfreeze the queues directly, without passing a CCB to CAM. The CCB header contains the following fields: path - path ID for the request target_id - target device ID for the request target_lun - LUN ID of the target device timeout - timeout interval for this command, in milliseconds timeout_ch - a convenience place for the SIM driver to store the timeout handle (the CAM subsystem itself does not make any assumptions about it) flags - various bits of information about the request spriv_ptr0, spriv_ptr1 - fields reserved for private use by the SIM driver (such as linking to the SIM queues or SIM private control blocks); actually, they exist as unions: spriv_ptr0 and spriv_ptr1 have the type (void *), spriv_field0 and spriv_field1 have the type unsigned long, sim_priv.entries[0].bytes and sim_priv.entries[1].bytes are byte arrays of the size consistent with the other incarnations of the union and sim_priv.bytes is one array, twice bigger. The recommended way of using the SIM private fields of CCB is to define some meaningful names for them and use these meaningful names in the driver, like: #define ccb_some_meaningful_name sim_priv.entries[0].bytes #define ccb_hcb spriv_ptr1 /* for hardware control block */ The most common initiator mode requests are: XPT_SCSI_IO - execute an I/O transaction The instance struct ccb_scsiio csio of the union ccb is used to transfer the arguments. They are: cdb_io - pointer to the SCSI command buffer or the buffer itself cdb_len - SCSI command length data_ptr - pointer to the data buffer (gets a bit complicated if scatter/gather is used) dxfer_len - length of the data to transfer sglist_cnt - counter of the scatter/gather segments scsi_status - place to return the SCSI status sense_data - buffer for the SCSI sense information if the command returns an error (the SIM driver is supposed to run the REQUEST SENSE command automatically in this case if the CCB flag CAM_DIS_AUTOSENSE is not set) sense_len - the length of that buffer (if it happens to be higher than size of sense_data the SIM driver must silently assume the smaller value) resid, sense_resid - if the transfer of data or SCSI sense returned an error these are the returned counters of the residual (not transferred) data. They do not seem to be especially meaningful, so in a case when they are difficult to compute (say, counting bytes in the SCSI controller's FIFO buffer) an approximate value will do as well. For a successfully completed transfer they must be set to zero. tag_action - the kind of tag to use: CAM_TAG_ACTION_NONE - do not use tags for this transaction MSG_SIMPLE_Q_TAG, MSG_HEAD_OF_Q_TAG, MSG_ORDERED_Q_TAG - value equal to the appropriate tag message (see /sys/cam/scsi/scsi_message.h); this gives only the tag type, the SIM driver must assign the tag value itself The general logic of handling this request is the following: The first thing to do is to check for possible races, to make sure that the command did not get aborted when it was sitting in the queue: struct ccb_scsiio *csio = &ccb->csio; if ((ccb_h->status & CAM_STATUS_MASK) != CAM_REQ_INPROG) { xpt_done(ccb); return; } Also we check that the device is supported at all by our controller: if(ccb_h->target_id > OUR_MAX_SUPPORTED_TARGET_ID || cch_h->target_id == OUR_SCSI_CONTROLLERS_OWN_ID) { ccb_h->status = CAM_TID_INVALID; xpt_done(ccb); return; } if(ccb_h->target_lun > OUR_MAX_SUPPORTED_LUN) { ccb_h->status = CAM_LUN_INVALID; xpt_done(ccb); return; } Then allocate whatever data structures (such as card-dependent hardware control blockhardware control block) we need to process this request. If we can not then freeze the SIM queue and remember that we have a pending operation, return the CCB back and ask CAM to re-queue it. Later when the resources become available the SIM queue must be unfrozen by returning a ccb with the CAM_SIMQ_RELEASE bit set in its status. Otherwise, if all went well, link the CCB with the hardware control block (HCB) and mark it as queued. struct xxx_hcb *hcb = allocate_hcb(softc, unit, bus); if(hcb == NULL) { softc->flags |= RESOURCE_SHORTAGE; xpt_freeze_simq(sim, /*count*/1); ccb_h->status = CAM_REQUEUE_REQ; xpt_done(ccb); return; } hcb->ccb = ccb; ccb_h->ccb_hcb = (void *)hcb; ccb_h->status |= CAM_SIM_QUEUED; Extract the target data from CCB into the hardware control block. Check if we are asked to assign a tag and if yes then generate an unique tag and build the SCSI tag messages. The SIM driver is also responsible for negotiations with the devices to set the maximal mutually supported bus width, synchronous rate and offset. hcb->target = ccb_h->target_id; hcb->lun = ccb_h->target_lun; generate_identify_message(hcb); if( ccb_h->tag_action != CAM_TAG_ACTION_NONE ) generate_unique_tag_message(hcb, ccb_h->tag_action); if( !target_negotiated(hcb) ) generate_negotiation_messages(hcb); Then set up the SCSI command. The command storage may be specified in the CCB in many interesting ways, specified by the CCB flags. The command buffer can be contained in CCB or pointed to, in the latter case the pointer may be physical or virtual. Since the hardware commonly needs physical address we always convert the address to the physical one, typically using the busdma API. In case if a physical address is requested it is OK to return the CCB with the status CAM_REQ_INVALID, the current drivers do that. If necessary a physical address can be also converted or mapped back to a virtual address but with big pain, so we do not do that. if(ccb_h->flags & CAM_CDB_POINTER) { /* CDB is a pointer */ if(!(ccb_h->flags & CAM_CDB_PHYS)) { /* CDB pointer is virtual */ hcb->cmd = vtobus(csio->cdb_io.cdb_ptr); } else { /* CDB pointer is physical */ hcb->cmd = csio->cdb_io.cdb_ptr ; } } else { /* CDB is in the ccb (buffer) */ hcb->cmd = vtobus(csio->cdb_io.cdb_bytes); } hcb->cmdlen = csio->cdb_len; Now it is time to set up the data. Again, the data storage may be specified in the CCB in many interesting ways, specified by the CCB flags. First we get the direction of the data transfer. The simplest case is if there is no data to transfer: int dir = (ccb_h->flags & CAM_DIR_MASK); if (dir == CAM_DIR_NONE) goto end_data; Then we check if the data is in one chunk or in a scatter-gather list, and the addresses are physical or virtual. The SCSI controller may be able to handle only a limited number of chunks of limited length. If the request hits this limitation we return an error. We use a special function to return the CCB to handle in one place the HCB resource shortages. The functions to add chunks are driver-dependent, and here we leave them without detailed implementation. See description of the SCSI command (CDB) handling for the details on the address-translation issues. If some variation is too difficult or impossible to implement with a particular card it is OK to return the status CAM_REQ_INVALID. Actually, it seems like the scatter-gather ability is not used anywhere in the CAM code now. But at least the case for a single non-scattered virtual buffer must be implemented, it is actively used by CAM. int rv; initialize_hcb_for_data(hcb); if((!(ccb_h->flags & CAM_SCATTER_VALID)) { /* single buffer */ if(!(ccb_h->flags & CAM_DATA_PHYS)) { rv = add_virtual_chunk(hcb, csio->data_ptr, csio->dxfer_len, dir); } } else { rv = add_physical_chunk(hcb, csio->data_ptr, csio->dxfer_len, dir); } } else { int i; struct bus_dma_segment *segs; segs = (struct bus_dma_segment *)csio->data_ptr; if ((ccb_h->flags & CAM_SG_LIST_PHYS) != 0) { /* The SG list pointer is physical */ rv = setup_hcb_for_physical_sg_list(hcb, segs, csio->sglist_cnt); } else if (!(ccb_h->flags & CAM_DATA_PHYS)) { /* SG buffer pointers are virtual */ for (i = 0; i < csio->sglist_cnt; i++) { rv = add_virtual_chunk(hcb, segs[i].ds_addr, segs[i].ds_len, dir); if (rv != CAM_REQ_CMP) break; } } else { /* SG buffer pointers are physical */ for (i = 0; i < csio->sglist_cnt; i++) { rv = add_physical_chunk(hcb, segs[i].ds_addr, segs[i].ds_len, dir); if (rv != CAM_REQ_CMP) break; } } } if(rv != CAM_REQ_CMP) { /* we expect that add_*_chunk() functions return CAM_REQ_CMP * if they added a chunk successfully, CAM_REQ_TOO_BIG if * the request is too big (too many bytes or too many chunks), * CAM_REQ_INVALID in case of other troubles */ free_hcb_and_ccb_done(hcb, ccb, rv); return; } end_data: If disconnection is disabled for this CCB we pass this information to the hcb: if(ccb_h->flags & CAM_DIS_DISCONNECT) hcb_disable_disconnect(hcb); If the controller is able to run REQUEST SENSE command all by itself then the value of the flag CAM_DIS_AUTOSENSE should also be passed to it, to prevent automatic REQUEST SENSE if the CAM subsystem does not want it. The only thing left is to set up the timeout, pass our hcb to the hardware and return, the rest will be done by the interrupt handler (or timeout handler). ccb_h->timeout_ch = timeout(xxx_timeout, (caddr_t) hcb, (ccb_h->timeout * hz) / 1000); /* convert milliseconds to ticks */ put_hcb_into_hardware_queue(hcb); return; And here is a possible implementation of the function returning CCB: static void free_hcb_and_ccb_done(struct xxx_hcb *hcb, union ccb *ccb, u_int32_t status) { struct xxx_softc *softc = hcb->softc; ccb->ccb_h.ccb_hcb = 0; if(hcb != NULL) { untimeout(xxx_timeout, (caddr_t) hcb, ccb->ccb_h.timeout_ch); /* we're about to free a hcb, so the shortage has ended */ if(softc->flags & RESOURCE_SHORTAGE) { softc->flags &= ~RESOURCE_SHORTAGE; status |= CAM_RELEASE_SIMQ; } free_hcb(hcb); /* also removes hcb from any internal lists */ } ccb->ccb_h.status = status | (ccb->ccb_h.status & ~(CAM_STATUS_MASK|CAM_SIM_QUEUED)); xpt_done(ccb); } XPT_RESET_DEV - send the SCSI BUS DEVICE RESET message to a device There is no data transferred in CCB except the header and the most interesting argument of it is target_id. Depending on the controller hardware a hardware control block just like for the XPT_SCSI_IO request may be constructed (see XPT_SCSI_IO request description) and sent to the controller or the SCSI controller may be immediately programmed to send this RESET message to the device or this request may be just not supported (and return the status CAM_REQ_INVALID). Also on completion of the request all the disconnected transactions for this target must be aborted (probably in the interrupt routine). Also all the current negotiations for the target are lost on reset, so they might be cleaned too. Or they clearing may be deferred, because anyway the target would request re-negotiation on the next transaction. XPT_RESET_BUS - send the RESET signal to the SCSI bus No arguments are passed in the CCB, the only interesting argument is the SCSI bus indicated by the struct sim pointer. A minimalistic implementation would forget the SCSI negotiations for all the devices on the bus and return the status CAM_REQ_CMP. The proper implementation would in addition actually reset the SCSI bus (possible also reset the SCSI controller) and mark all the CCBs being processed, both those in the hardware queue and those being disconnected, as done with the status CAM_SCSI_BUS_RESET. Like: int targ, lun; struct xxx_hcb *h, *hh; struct ccb_trans_settings neg; struct cam_path *path; /* The SCSI bus reset may take a long time, in this case its completion * should be checked by interrupt or timeout. But for simplicity * we assume here that it is really fast. */ reset_scsi_bus(softc); /* drop all enqueued CCBs */ for(h = softc->first_queued_hcb; h != NULL; h = hh) { hh = h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } /* the clean values of negotiations to report */ neg.bus_width = 8; neg.sync_period = neg.sync_offset = 0; neg.valid = (CCB_TRANS_BUS_WIDTH_VALID | CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID); /* drop all disconnected CCBs and clean negotiations */ for(targ=0; targ <= OUR_MAX_SUPPORTED_TARGET; targ++) { clean_negotiations(softc, targ); /* report the event if possible */ if(xpt_create_path(&path, /*periph*/NULL, cam_sim_path(sim), targ, CAM_LUN_WILDCARD) == CAM_REQ_CMP) { xpt_async(AC_TRANSFER_NEG, path, &neg); xpt_free_path(path); } for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++) for(h = softc->first_discon_hcb[targ][lun]; h != NULL; h = hh) { hh=h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } } ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); /* report the event */ xpt_async(AC_BUS_RESET, softc->wpath, NULL); return; Implementing the SCSI bus reset as a function may be a good idea because it would be re-used by the timeout function as a last resort if the things go wrong. XPT_ABORT - abort the specified CCB The arguments are transferred in the instance struct ccb_abort cab of the union ccb. The only argument field in it is: abort_ccb - pointer to the CCB to be aborted If the abort is not supported just return the status CAM_UA_ABORT. This is also the easy way to minimally implement this call, return CAM_UA_ABORT in any case. The hard way is to implement this request honestly. First check that abort applies to a SCSI transaction: struct ccb *abort_ccb; abort_ccb = ccb->cab.abort_ccb; if(abort_ccb->ccb_h.func_code != XPT_SCSI_IO) { ccb->ccb_h.status = CAM_UA_ABORT; xpt_done(ccb); return; } Then it is necessary to find this CCB in our queue. This can be done by walking the list of all our hardware control blocks in search for one associated with this CCB: struct xxx_hcb *hcb, *h; hcb = NULL; /* We assume that softc->first_hcb is the head of the list of all * HCBs associated with this bus, including those enqueued for * processing, being processed by hardware and disconnected ones. */ for(h = softc->first_hcb; h != NULL; h = h->next) { if(h->ccb == abort_ccb) { hcb = h; break; } } if(hcb == NULL) { /* no such CCB in our queue */ ccb->ccb_h.status = CAM_PATH_INVALID; xpt_done(ccb); return; } hcb=found_hcb; Now we look at the current processing status of the HCB. It may be either sitting in the queue waiting to be sent to the SCSI bus, being transferred right now, or disconnected and waiting for the result of the command, or actually completed by hardware but not yet marked as done by software. To make sure that we do not get in any races with hardware we mark the HCB as being aborted, so that if this HCB is about to be sent to the SCSI bus the SCSI controller will see this flag and skip it. int hstatus; /* shown as a function, in case special action is needed to make * this flag visible to hardware */ set_hcb_flags(hcb, HCB_BEING_ABORTED); abort_again: hstatus = get_hcb_status(hcb); switch(hstatus) { case HCB_SITTING_IN_QUEUE: remove_hcb_from_hardware_queue(hcb); /* FALLTHROUGH */ case HCB_COMPLETED: /* this is an easy case */ free_hcb_and_ccb_done(hcb, abort_ccb, CAM_REQ_ABORTED); break; If the CCB is being transferred right now we would like to signal to the SCSI controller in some hardware-dependent way that we want to abort the current transfer. The SCSI controller would set the SCSI ATTENTION signal and when the target responds to it send an ABORT message. We also reset the timeout to make sure that the target is not sleeping forever. If the command would not get aborted in some reasonable time like 10 seconds the timeout routine would go ahead and reset the whole SCSI bus. Since the command will be aborted in some reasonable time we can just return the abort request now as successfully completed, and mark the aborted CCB as aborted (but not mark it as done yet). case HCB_BEING_TRANSFERRED: untimeout(xxx_timeout, (caddr_t) hcb, abort_ccb->ccb_h.timeout_ch); abort_ccb->ccb_h.timeout_ch = timeout(xxx_timeout, (caddr_t) hcb, 10 * hz); abort_ccb->ccb_h.status = CAM_REQ_ABORTED; /* ask the controller to abort that HCB, then generate * an interrupt and stop */ if(signal_hardware_to_abort_hcb_and_stop(hcb) < 0) { /* oops, we missed the race with hardware, this transaction * got off the bus before we aborted it, try again */ goto abort_again; } break; If the CCB is in the list of disconnected then set it up as an abort request and re-queue it at the front of hardware queue. Reset the timeout and report the abort request to be completed. case HCB_DISCONNECTED: untimeout(xxx_timeout, (caddr_t) hcb, abort_ccb->ccb_h.timeout_ch); abort_ccb->ccb_h.timeout_ch = timeout(xxx_timeout, (caddr_t) hcb, 10 * hz); put_abort_message_into_hcb(hcb); put_hcb_at_the_front_of_hardware_queue(hcb); break; } ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; That is all for the ABORT request, although there is one more issue. As the ABORT message cleans all the ongoing transactions on a LUN we have to mark all the other active transactions on this LUN as aborted. That should be done in the interrupt routine, after the transaction gets aborted. Implementing the CCB abort as a function may be quite a good idea, this function can be re-used if an I/O transaction times out. The only difference would be that the timed out transaction would return the status CAM_CMD_TIMEOUT for the timed out request. Then the case XPT_ABORT would be small, like that: case XPT_ABORT: struct ccb *abort_ccb; abort_ccb = ccb->cab.abort_ccb; if(abort_ccb->ccb_h.func_code != XPT_SCSI_IO) { ccb->ccb_h.status = CAM_UA_ABORT; xpt_done(ccb); return; } if(xxx_abort_ccb(abort_ccb, CAM_REQ_ABORTED) < 0) /* no such CCB in our queue */ ccb->ccb_h.status = CAM_PATH_INVALID; else ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; XPT_SET_TRAN_SETTINGS - explicitly set values of SCSI transfer settings The arguments are transferred in the instance struct ccb_trans_setting cts of the union ccb: valid - a bitmask showing which settings should be updated: CCB_TRANS_SYNC_RATE_VALID - synchronous transfer rate CCB_TRANS_SYNC_OFFSET_VALID - synchronous offset CCB_TRANS_BUS_WIDTH_VALID - bus width CCB_TRANS_DISC_VALID - set enable/disable disconnection CCB_TRANS_TQ_VALID - set enable/disable tagged queuing flags - consists of two parts, binary arguments and identification of sub-operations. The binary arguments are: CCB_TRANS_DISC_ENB - enable disconnection CCB_TRANS_TAG_ENB - enable tagged queuing the sub-operations are: CCB_TRANS_CURRENT_SETTINGS - change the current negotiations CCB_TRANS_USER_SETTINGS - remember the desired user values sync_period, sync_offset - self-explanatory, if sync_offset==0 then the asynchronous mode is requested bus_width - bus width, in bits (not bytes) Two sets of negotiated parameters are supported, the user settings and the current settings. The user settings are not really used much in the SIM drivers, this is mostly just a piece of memory where the upper levels can store (and later recall) its ideas about the parameters. Setting the user parameters does not cause re-negotiation of the transfer rates. But when the SCSI controller does a negotiation it must never set the values higher than the user parameters, so it is essentially the top boundary. The current settings are, as the name says, current. Changing them means that the parameters must be re-negotiated on the next transfer. Again, these new current settings are not supposed to be forced on the device, just they are used as the initial step of negotiations. Also they must be limited by actual capabilities of the SCSI controller: for example, if the SCSI controller has 8-bit bus and the request asks to set 16-bit wide transfers this parameter must be silently truncated to 8-bit transfers before sending it to the device. One caveat is that the bus width and synchronous parameters are per target while the disconnection and tag enabling parameters are per lun. The recommended implementation is to keep 3 sets of negotiated (bus width and synchronous transfer) parameters: user - the user set, as above current - those actually in effect goal - those requested by setting of the current parameters The code looks like: struct ccb_trans_settings *cts; int targ, lun; int flags; cts = &ccb->cts; targ = ccb_h->target_id; lun = ccb_h->target_lun; flags = cts->flags; if(flags & CCB_TRANS_USER_SETTINGS) { if(flags & CCB_TRANS_SYNC_RATE_VALID) softc->user_sync_period[targ] = cts->sync_period; if(flags & CCB_TRANS_SYNC_OFFSET_VALID) softc->user_sync_offset[targ] = cts->sync_offset; if(flags & CCB_TRANS_BUS_WIDTH_VALID) softc->user_bus_width[targ] = cts->bus_width; if(flags & CCB_TRANS_DISC_VALID) { softc->user_tflags[targ][lun] &= ~CCB_TRANS_DISC_ENB; softc->user_tflags[targ][lun] |= flags & CCB_TRANS_DISC_ENB; } if(flags & CCB_TRANS_TQ_VALID) { softc->user_tflags[targ][lun] &= ~CCB_TRANS_TQ_ENB; softc->user_tflags[targ][lun] |= flags & CCB_TRANS_TQ_ENB; } } if(flags & CCB_TRANS_CURRENT_SETTINGS) { if(flags & CCB_TRANS_SYNC_RATE_VALID) softc->goal_sync_period[targ] = max(cts->sync_period, OUR_MIN_SUPPORTED_PERIOD); if(flags & CCB_TRANS_SYNC_OFFSET_VALID) softc->goal_sync_offset[targ] = min(cts->sync_offset, OUR_MAX_SUPPORTED_OFFSET); if(flags & CCB_TRANS_BUS_WIDTH_VALID) softc->goal_bus_width[targ] = min(cts->bus_width, OUR_BUS_WIDTH); if(flags & CCB_TRANS_DISC_VALID) { softc->current_tflags[targ][lun] &= ~CCB_TRANS_DISC_ENB; softc->current_tflags[targ][lun] |= flags & CCB_TRANS_DISC_ENB; } if(flags & CCB_TRANS_TQ_VALID) { softc->current_tflags[targ][lun] &= ~CCB_TRANS_TQ_ENB; softc->current_tflags[targ][lun] |= flags & CCB_TRANS_TQ_ENB; } } ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; Then when the next I/O request will be processed it will check if it has to re-negotiate, for example by calling the function target_negotiated(hcb). It can be implemented like this: int target_negotiated(struct xxx_hcb *hcb) { struct softc *softc = hcb->softc; int targ = hcb->targ; if( softc->current_sync_period[targ] != softc->goal_sync_period[targ] || softc->current_sync_offset[targ] != softc->goal_sync_offset[targ] || softc->current_bus_width[targ] != softc->goal_bus_width[targ] ) return 0; /* FALSE */ else return 1; /* TRUE */ } After the values are re-negotiated the resulting values must be assigned to both current and goal parameters, so for future I/O transactions the current and goal parameters would be the same and target_negotiated() would return TRUE. When the card is initialized (in xxx_attach()) the current negotiation values must be initialized to narrow asynchronous mode, the goal and current values must be initialized to the maximal values supported by controller. XPT_GET_TRAN_SETTINGS - get values of SCSI transfer settings This operations is the reverse of XPT_SET_TRAN_SETTINGS. Fill up the CCB instance struct ccb_trans_setting cts with data as requested by the flags CCB_TRANS_CURRENT_SETTINGS or CCB_TRANS_USER_SETTINGS (if both are set then the existing drivers return the current settings). Set all the bits in the valid field. XPT_CALC_GEOMETRY - calculate logical (BIOS)BIOS geometry of the disk The arguments are transferred in the instance struct ccb_calc_geometry ccg of the union ccb: block_size - input, block (A.K.A sector) size in bytes volume_size - input, volume size in bytes cylinders - output, logical cylinders heads - output, logical heads secs_per_track - output, logical sectors per track If the returned geometry differs much enough from what the SCSI controller BIOSSCSI BIOS thinks and a disk on this SCSI controller is used as bootable the system may not be able to boot. The typical calculation example taken from the aic7xxx driver is: struct ccb_calc_geometry *ccg; u_int32_t size_mb; u_int32_t secs_per_cylinder; int extended; ccg = &ccb->ccg; size_mb = ccg->volume_size / ((1024L * 1024L) / ccg->block_size); extended = check_cards_EEPROM_for_extended_geometry(softc); if (size_mb > 1024 && extended) { ccg->heads = 255; ccg->secs_per_track = 63; } else { ccg->heads = 64; ccg->secs_per_track = 32; } secs_per_cylinder = ccg->heads * ccg->secs_per_track; ccg->cylinders = ccg->volume_size / secs_per_cylinder; ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; This gives the general idea, the exact calculation depends on the quirks of the particular BIOS. If BIOS provides no way set the extended translation flag in EEPROM this flag should normally be assumed equal to 1. Other popular geometries are: 128 heads, 63 sectors - Symbios controllers 16 heads, 63 sectors - old controllers Some system BIOSes and SCSI BIOSes fight with each other with variable success, for example a combination of Symbios 875/895 SCSI and Phoenix BIOS can give geometry 128/63 after power up and 255/63 after a hard reset or soft reboot. XPT_PATH_INQ - path inquiry, in other words get the SIM driver and SCSI controller (also known as HBA - Host Bus Adapter) properties The properties are returned in the instance struct ccb_pathinq cpi of the union ccb: version_num - the SIM driver version number, now all drivers use 1 hba_inquiry - bitmask of features supported by the controller: PI_MDP_ABLE - supports MDP message (something from SCSI3?) PI_WIDE_32 - supports 32 bit wide SCSI PI_WIDE_16 - supports 16 bit wide SCSI PI_SDTR_ABLE - can negotiate synchronous transfer rate PI_LINKED_CDB - supports linked commands PI_TAG_ABLE - supports tagged commands PI_SOFT_RST - supports soft reset alternative (hard reset and soft reset are mutually exclusive within a SCSI bus) target_sprt - flags for target mode support, 0 if unsupported hba_misc - miscellaneous controller features: PIM_SCANHILO - bus scans from high ID to low ID PIM_NOREMOVE - removable devices not included in scan PIM_NOINITIATOR - initiator role not supported PIM_NOBUSRESET - user has disabled initial BUS RESET hba_eng_cnt - mysterious HBA engine count, something related to compression, now is always set to 0 vuhba_flags - vendor-unique flags, unused now max_target - maximal supported target ID (7 for 8-bit bus, 15 for 16-bit bus, 127 for Fibre Channel) max_lun - maximal supported LUN ID (7 for older SCSI controllers, 63 for newer ones) async_flags - bitmask of installed Async handler, unused now hpath_id - highest Path ID in the subsystem, unused now unit_number - the controller unit number, cam_sim_unit(sim) bus_id - the bus number, cam_sim_bus(sim) initiator_id - the SCSI ID of the controller itself base_transfer_speed - nominal transfer speed in KB/s for asynchronous narrow transfers, equals to 3300 for SCSI sim_vid - SIM driver's vendor id, a zero-terminated string of maximal length SIM_IDLEN including the terminating zero hba_vid - SCSI controller's vendor id, a zero-terminated string of maximal length HBA_IDLEN including the terminating zero dev_name - device driver name, a zero-terminated string of maximal length DEV_IDLEN including the terminating zero, equal to cam_sim_name(sim) The recommended way of setting the string fields is using strncpy, like: strncpy(cpi->dev_name, cam_sim_name(sim), DEV_IDLEN); After setting the values set the status to CAM_REQ_CMP and mark the CCB as done. Polling static void xxx_poll struct cam_sim *sim The poll function is used to simulate the interrupts when the interrupt subsystem is not functioning (for example, when the system has crashed and is creating the system dump). The CAM subsystem sets the proper interrupt level before calling the poll routine. So all it needs to do is to call the interrupt routine (or the other way around, the poll routine may be doing the real action and the interrupt routine would just call the poll routine). Why bother about a separate function then? - Due to different calling conventions. The + This has to do with different calling conventions. The xxx_poll routine gets the struct cam_sim - pointer as its argument when the PCI interrupt routine by common + pointer as its argument while the PCI interrupt routine by common convention gets pointer to the struct xxx_softc and the ISA interrupt routine gets just the device unit number. So the poll routine would normally look as: static void xxx_poll(struct cam_sim *sim) { xxx_intr((struct xxx_softc *)cam_sim_softc(sim)); /* for PCI device */ } or static void xxx_poll(struct cam_sim *sim) { xxx_intr(cam_sim_unit(sim)); /* for ISA device */ } Asynchronous Events If an asynchronous event callback has been set up then the callback function should be defined. static void ahc_async(void *callback_arg, u_int32_t code, struct cam_path *path, void *arg) callback_arg - the value supplied when registering the callback code - identifies the type of event path - identifies the devices to which the event applies arg - event-specific argument Implementation for a single type of event, AC_LOST_DEVICE, looks like: struct xxx_softc *softc; struct cam_sim *sim; int targ; struct ccb_trans_settings neg; sim = (struct cam_sim *)callback_arg; softc = (struct xxx_softc *)cam_sim_softc(sim); switch (code) { case AC_LOST_DEVICE: targ = xpt_path_target_id(path); if(targ <= OUR_MAX_SUPPORTED_TARGET) { clean_negotiations(softc, targ); /* send indication to CAM */ neg.bus_width = 8; neg.sync_period = neg.sync_offset = 0; neg.valid = (CCB_TRANS_BUS_WIDTH_VALID | CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID); xpt_async(AC_TRANSFER_NEG, path, &neg); } break; default: break; } Interrupts SCSIinterrupts The exact type of the interrupt routine depends on the type of the peripheral bus (PCI, ISA and so on) to which the SCSI controller is connected. The interrupt routines of the SIM drivers run at the interrupt level splcam. So splcam() should be used in the driver to synchronize activity between the interrupt routine and the rest of the driver (for a multiprocessor-aware driver things get yet more interesting but we ignore this case here). The pseudo-code in this document happily ignores the problems of synchronization. The real code must not ignore them. A simple-minded approach is to set splcam() on the entry to the other routines and reset it on return thus protecting them by one big critical section. To make sure that the interrupt level will be always restored a wrapper function can be defined, like: static void xxx_action(struct cam_sim *sim, union ccb *ccb) { int s; s = splcam(); xxx_action1(sim, ccb); splx(s); } static void xxx_action1(struct cam_sim *sim, union ccb *ccb) { ... process the request ... } This approach is simple and robust but the problem with it is that interrupts may get blocked for a relatively long time and this would negatively affect the system's performance. On the other hand the functions of the spl() family have rather high overhead, so vast amount of tiny critical sections may not be good either. The conditions handled by the interrupt routine and the details depend very much on the hardware. We consider the set of typical conditions. First, we check if a SCSI reset was encountered on the bus (probably caused by another SCSI controller on the same SCSI bus). If so we drop all the enqueued and disconnected requests, report the events and re-initialize our SCSI controller. It is important that during this initialization the controller will not issue another reset or else two controllers on the same SCSI bus could ping-pong resets forever. The case of fatal controller error/hang could be handled in the same place, but it will probably need also sending RESET signal to the SCSI bus to reset the status of the connections with the SCSI devices. int fatal=0; struct ccb_trans_settings neg; struct cam_path *path; if( detected_scsi_reset(softc) || (fatal = detected_fatal_controller_error(softc)) ) { int targ, lun; struct xxx_hcb *h, *hh; /* drop all enqueued CCBs */ for(h = softc->first_queued_hcb; h != NULL; h = hh) { hh = h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } /* the clean values of negotiations to report */ neg.bus_width = 8; neg.sync_period = neg.sync_offset = 0; neg.valid = (CCB_TRANS_BUS_WIDTH_VALID | CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID); /* drop all disconnected CCBs and clean negotiations */ for(targ=0; targ <= OUR_MAX_SUPPORTED_TARGET; targ++) { clean_negotiations(softc, targ); /* report the event if possible */ if(xpt_create_path(&path, /*periph*/NULL, cam_sim_path(sim), targ, CAM_LUN_WILDCARD) == CAM_REQ_CMP) { xpt_async(AC_TRANSFER_NEG, path, &neg); xpt_free_path(path); } for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++) for(h = softc->first_discon_hcb[targ][lun]; h != NULL; h = hh) { hh=h->next; if(fatal) free_hcb_and_ccb_done(h, h->ccb, CAM_UNREC_HBA_ERROR); else free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } } /* report the event */ xpt_async(AC_BUS_RESET, softc->wpath, NULL); /* re-initialization may take a lot of time, in such case * its completion should be signaled by another interrupt or * checked on timeout - but for simplicity we assume here that * it is really fast */ if(!fatal) { reinitialize_controller_without_scsi_reset(softc); } else { reinitialize_controller_with_scsi_reset(softc); } schedule_next_hcb(softc); return; } If interrupt is not caused by a controller-wide condition then probably something has happened to the current hardware control block. Depending on the hardware there may be other non-HCB-related events, we just do not consider them here. Then we analyze what happened to this HCB: struct xxx_hcb *hcb, *h, *hh; int hcb_status, scsi_status; int ccb_status; int targ; int lun_to_freeze; hcb = get_current_hcb(softc); if(hcb == NULL) { /* either stray interrupt or something went very wrong * or this is something hardware-dependent */ handle as necessary; return; } targ = hcb->target; hcb_status = get_status_of_current_hcb(softc); First we check if the HCB has completed and if so we check the returned SCSI status. if(hcb_status == COMPLETED) { scsi_status = get_completion_status(hcb); Then look if this status is related to the REQUEST SENSE command and if so handle it in a simple way. if(hcb->flags & DOING_AUTOSENSE) { if(scsi_status == GOOD) { /* autosense was successful */ hcb->ccb->ccb_h.status |= CAM_AUTOSNS_VALID; free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_SCSI_STATUS_ERROR); } else { autosense_failed: free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_AUTOSENSE_FAIL); } schedule_next_hcb(softc); return; } Else the command itself has completed, pay more attention to details. If auto-sense is not disabled for this CCB and the command has failed with sense data then run REQUEST SENSE command to receive that data. hcb->ccb->csio.scsi_status = scsi_status; calculate_residue(hcb); if( (hcb->ccb->ccb_h.flags & CAM_DIS_AUTOSENSE)==0 && ( scsi_status == CHECK_CONDITION || scsi_status == COMMAND_TERMINATED) ) { /* start auto-SENSE */ hcb->flags |= DOING_AUTOSENSE; setup_autosense_command_in_hcb(hcb); restart_current_hcb(softc); return; } if(scsi_status == GOOD) free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_REQ_CMP); else free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_SCSI_STATUS_ERROR); schedule_next_hcb(softc); return; } One typical thing would be negotiation events: negotiation messages received from a SCSI target (in answer to our negotiation attempt or by target's initiative) or the target is unable to negotiate (rejects our negotiation messages or does not answer them). switch(hcb_status) { case TARGET_REJECTED_WIDE_NEG: /* revert to 8-bit bus */ softc->current_bus_width[targ] = softc->goal_bus_width[targ] = 8; /* report the event */ neg.bus_width = 8; neg.valid = CCB_TRANS_BUS_WIDTH_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); continue_current_hcb(softc); return; case TARGET_ANSWERED_WIDE_NEG: { int wd; wd = get_target_bus_width_request(softc); if(wd <= softc->goal_bus_width[targ]) { /* answer is acceptable */ softc->current_bus_width[targ] = softc->goal_bus_width[targ] = neg.bus_width = wd; /* report the event */ neg.valid = CCB_TRANS_BUS_WIDTH_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); } else { prepare_reject_message(hcb); } } continue_current_hcb(softc); return; case TARGET_REQUESTED_WIDE_NEG: { int wd; wd = get_target_bus_width_request(softc); wd = min (wd, OUR_BUS_WIDTH); wd = min (wd, softc->user_bus_width[targ]); if(wd != softc->current_bus_width[targ]) { /* the bus width has changed */ softc->current_bus_width[targ] = softc->goal_bus_width[targ] = neg.bus_width = wd; /* report the event */ neg.valid = CCB_TRANS_BUS_WIDTH_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); } prepare_width_nego_rsponse(hcb, wd); } continue_current_hcb(softc); return; } Then we handle any errors that could have happened during auto-sense in the same simple-minded way as before. Otherwise we look closer at the details again. if(hcb->flags & DOING_AUTOSENSE) goto autosense_failed; switch(hcb_status) { The next event we consider is unexpected disconnect. Which is considered normal after an ABORT or BUS DEVICE RESET message and abnormal in other cases. case UNEXPECTED_DISCONNECT: if(requested_abort(hcb)) { /* abort affects all commands on that target+LUN, so * mark all disconnected HCBs on that target+LUN as aborted too */ for(h = softc->first_discon_hcb[hcb->target][hcb->lun]; h != NULL; h = hh) { hh=h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_REQ_ABORTED); } ccb_status = CAM_REQ_ABORTED; } else if(requested_bus_device_reset(hcb)) { int lun; /* reset affects all commands on that target, so * mark all disconnected HCBs on that target+LUN as reset */ for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++) for(h = softc->first_discon_hcb[hcb->target][lun]; h != NULL; h = hh) { hh=h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } /* send event */ xpt_async(AC_SENT_BDR, hcb->ccb->ccb_h.path_id, NULL); /* this was the CAM_RESET_DEV request itself, it is completed */ ccb_status = CAM_REQ_CMP; } else { calculate_residue(hcb); ccb_status = CAM_UNEXP_BUSFREE; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = hcb->lun; } break; If the target refuses to accept tags we notify CAM about that and return back all commands for this LUN: case TAGS_REJECTED: /* report the event */ neg.flags = 0 & ~CCB_TRANS_TAG_ENB; neg.valid = CCB_TRANS_TQ_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); ccb_status = CAM_MSG_REJECT_REC; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = hcb->lun; break; Then we check a number of other conditions, with processing basically limited to setting the CCB status: case SELECTION_TIMEOUT: ccb_status = CAM_SEL_TIMEOUT; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = CAM_LUN_WILDCARD; break; case PARITY_ERROR: ccb_status = CAM_UNCOR_PARITY; break; case DATA_OVERRUN: case ODD_WIDE_TRANSFER: ccb_status = CAM_DATA_RUN_ERR; break; default: /* all other errors are handled in a generic way */ ccb_status = CAM_REQ_CMP_ERR; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = CAM_LUN_WILDCARD; break; } Then we check if the error was serious enough to freeze the input queue until it gets proceeded and do so if it is: if(hcb->ccb->ccb_h.status & CAM_DEV_QFRZN) { /* freeze the queue */ xpt_freeze_devq(ccb->ccb_h.path, /*count*/1); /* re-queue all commands for this target/LUN back to CAM */ for(h = softc->first_queued_hcb; h != NULL; h = hh) { hh = h->next; if(targ == h->targ && (lun_to_freeze == CAM_LUN_WILDCARD || lun_to_freeze == h->lun) ) free_hcb_and_ccb_done(h, h->ccb, CAM_REQUEUE_REQ); } } free_hcb_and_ccb_done(hcb, hcb->ccb, ccb_status); schedule_next_hcb(softc); return; This concludes the generic interrupt handling although specific controllers may require some additions. Errors Summary SCSIerrors When executing an I/O request many things may go wrong. The reason of error can be reported in the CCB status with great detail. Examples of use are spread throughout this document. For completeness here is the summary of recommended responses for the typical error conditions: CAM_RESRC_UNAVAIL - some resource is temporarily unavailable and the SIM driver cannot generate an event when it will become available. An example of this resource would be some intra-controller hardware resource for which the controller does not generate an interrupt when it becomes available. CAM_UNCOR_PARITY - unrecovered parity error occurred CAM_DATA_RUN_ERR - data overrun or unexpected data phase (going in other direction than specified in CAM_DIR_MASK) or odd transfer length for wide transfer CAM_SEL_TIMEOUT - selection timeout occurred (target does not respond) CAM_CMD_TIMEOUT - command timeout occurred (the timeout function ran) CAM_SCSI_STATUS_ERROR - the device returned error CAM_AUTOSENSE_FAIL - the device returned error and the REQUEST SENSE COMMAND failed CAM_MSG_REJECT_REC - MESSAGE REJECT message was received CAM_SCSI_BUS_RESET - received SCSI bus reset CAM_REQ_CMP_ERR - impossible SCSI phase occurred or something else as weird or just a generic error if further detail is not available CAM_UNEXP_BUSFREE - unexpected disconnect occurred CAM_BDR_SENT - BUS DEVICE RESET message was sent to the target CAM_UNREC_HBA_ERROR - unrecoverable Host Bus Adapter Error CAM_REQ_TOO_BIG - the request was too large for this controller CAM_REQUEUE_REQ - this request should be re-queued to preserve transaction ordering. This typically occurs when the SIM recognizes an error that should freeze the queue and must place other queued requests for the target at the sim level back into the XPT queue. Typical cases of such errors are selection timeouts, command timeouts and other like conditions. In such cases the troublesome command returns the status indicating the error, the and the other commands which have not be sent to the bus yet get re-queued. CAM_LUN_INVALID - the LUN ID in the request is not supported by the SCSI controller CAM_TID_INVALID - the target ID in the request is not supported by the SCSI controller Timeout Handling When the timeout for an HCB expires that request should be aborted, just like with an XPT_ABORT request. The only difference is that the returned status of aborted request should be CAM_CMD_TIMEOUT instead of CAM_REQ_ABORTED (that is why implementation of the abort better be done as a function). But there is one more possible problem: what if the abort request itself will get stuck? In this case the SCSI bus should be reset, just like with an XPT_RESET_BUS request (and the idea about implementing it as a function called from both places applies here too). Also we should reset the whole SCSI bus if a device reset request got stuck. So after all the timeout function would look like: static void xxx_timeout(void *arg) { struct xxx_hcb *hcb = (struct xxx_hcb *)arg; struct xxx_softc *softc; struct ccb_hdr *ccb_h; softc = hcb->softc; ccb_h = &hcb->ccb->ccb_h; if(hcb->flags & HCB_BEING_ABORTED || ccb_h->func_code == XPT_RESET_DEV) { xxx_reset_bus(softc); } else { xxx_abort_ccb(hcb->ccb, CAM_CMD_TIMEOUT); } } When we abort a request all the other disconnected requests to the same target/LUN get aborted too. So there appears a question, should we return them with status CAM_REQ_ABORTED or CAM_CMD_TIMEOUT? The current drivers use CAM_CMD_TIMEOUT. This seems logical because if one request got timed out then probably something really bad is happening to the device, so if they would not be disturbed they would time out by themselves. diff --git a/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml index 568cc8ba35..6a51621e89 100644 --- a/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml +++ b/en_US.ISO8859-1/books/developers-handbook/ipv6/chapter.xml @@ -1,1638 +1,1638 @@ IPv6 Internals IPv6/IPsec Implementation Yoshinobu Inoue Contributed by This section should explain IPv6 and IPsec related implementation internals. These functionalities are derived from KAME project IPv6 Conformance The IPv6 related functions conforms, or tries to conform to the latest set of IPv6 specifications. For future reference we list some of the relevant documents below (NOTE: this is not a complete list - this is too hard to maintain...). For details please refer to specific chapter in the document, RFCs, manual pages, or comments in the source code. Conformance tests have been performed on the KAME STABLE kit at TAHI project. Results can be viewed at http://www.tahi.org/report/KAME/. We also attended University of New Hampshire IOL tests (http://www.iol.unh.edu/) in the past, with our past snapshots. RFC1639: FTP Operation Over Big Address Records (FOOBAR) RFC2428 is preferred over RFC1639. FTP clients will first try RFC2428, then RFC1639 if failed. RFC1886: DNS Extensions to support IPv6 RFC1933: Transition Mechanisms for IPv6 Hosts and Routers IPv4 compatible address is not supported. automatic tunneling (described in 4.3 of this RFC) is not supported. &man.gif.4; interface implements IPv[46]-over-IPv[46] tunnel in a generic way, and it covers "configured tunnel" described in the spec. See 23.5.1.5 in this document for details. RFC1981: Path MTU Discovery for IPv6 RFC2080: RIPng for IPv6 usr.sbin/route6d support this. RFC2292: Advanced Sockets API for IPv6 For supported library functions/kernel APIs, see sys/netinet6/ADVAPI. RFC2362: Protocol Independent Multicast-Sparse Mode (PIM-SM) RFC2362 defines packet formats for PIM-SM. draft-ietf-pim-ipv6-01.txt is written based on this. RFC2373: IPv6 Addressing Architecture supports node required addresses, and conforms to the scope requirement. RFC2374: An IPv6 Aggregatable Global Unicast Address Format supports 64-bit length of Interface ID. RFC2375: IPv6 Multicast Address Assignments Userland applications use the well-known addresses assigned in the RFC. RFC2428: FTP Extensions for IPv6 and NATs RFC2428 is preferred over RFC1639. FTP clients will first try RFC2428, then RFC1639 if failed. RFC2460: IPv6 specification RFC2461: Neighbor discovery for IPv6 See 23.5.1.2 in this document for details. RFC2462: IPv6 Stateless Address Autoconfiguration See 23.5.1.4 in this document for details. RFC2463: ICMPv6 for IPv6 specification See 23.5.1.9 in this document for details. RFC2464: Transmission of IPv6 Packets over Ethernet Networks RFC2465: MIB for IPv6: Textual Conventions and General Group Necessary statistics are gathered by the kernel. Actual IPv6 MIB support is provided as a patchkit for ucd-snmp. RFC2466: MIB for IPv6: ICMPv6 group Necessary statistics are gathered by the kernel. Actual IPv6 MIB support is provided as patchkit for ucd-snmp. RFC2467: Transmission of IPv6 Packets over FDDI Networks RFC2497: Transmission of IPv6 packet over ARCnet Networks RFC2553: Basic Socket Interface Extensions for IPv6 IPv4 mapped address (3.7) and special behavior of IPv6 wildcard bind socket (3.8) are supported. See 23.5.1.12 in this document for details. RFC2675: IPv6 Jumbograms See 23.5.1.7 in this document for details. RFC2710: Multicast Listener Discovery for IPv6 RFC2711: IPv6 router alert option draft-ietf-ipngwg-router-renum-08: Router renumbering for IPv6 draft-ietf-ipngwg-icmp-namelookups-02: IPv6 Name Lookups Through ICMP draft-ietf-ipngwg-icmp-name-lookups-03: IPv6 Name Lookups Through ICMP draft-ietf-pim-ipv6-01.txt: PIM for IPv6 &man.pim6dd.8; implements dense mode. &man.pim6sd.8; implements sparse mode. draft-itojun-ipv6-tcp-to-anycast-00: Disconnecting TCP connection toward IPv6 anycast address draft-yamamoto-wideipv6-comm-model-00 See 23.5.1.6 in this document for details. draft-ietf-ipngwg-scopedaddr-format-00.txt: An Extension of Format for IPv6 Scoped Addresses Neighbor Discovery Neighbor Discovery is fairly stable. Currently Address Resolution, Duplicated Address Detection, and Neighbor Unreachability Detection are supported. In the near future we will be adding Proxy Neighbor Advertisement support in the kernel and Unsolicited Neighbor Advertisement transmission command as admin tool. If DAD fails, the address will be marked "duplicated" and message will be generated to syslog (and usually to console). The "duplicated" mark can be checked with &man.ifconfig.8;. It is administrators' responsibility to check for and recover from DAD failures. The behavior should be improved in the near future. Some of the network driver loops multicast packets back to itself, even if instructed not to do so (especially in promiscuous mode). In such cases DAD may fail, because DAD engine sees inbound NS packet (actually from the node itself) and considers it as a sign of duplicate. You may want to look at #if condition marked "heuristics" in sys/netinet6/nd6_nbr.c:nd6_dad_timer() as workaround (note that the code fragment in "heuristics" section is not spec conformant). Neighbor Discovery specification (RFC2461) does not talk about neighbor cache handling in the following cases: when there was no neighbor cache entry, node received unsolicited RS/NS/NA/redirect packet without link-layer address neighbor cache handling on medium without link-layer address (we need a neighbor cache entry for IsRouter bit) For first case, we implemented workaround based on discussions on IETF ipngwg mailing list. For more details, see the comments in the source code and email thread started from (IPng 7155), dated Feb 6 1999. IPv6 on-link determination rule (RFC2461) is quite different from assumptions in BSD network code. At this moment, no on-link determination rule is supported where default router list is empty (RFC2461, section 5.2, last sentence in 2nd paragraph - note that the spec misuse the word "host" and "node" in several places in the section). To avoid possible DoS attacks and infinite loops, only 10 options on ND packet is accepted now. Therefore, if you have 20 prefix options attached to RA, only the first 10 prefixes will be recognized. If this troubles you, please ask it on FREEBSD-CURRENT mailing list and/or modify nd6_maxndopt in sys/netinet6/nd6.c. If there are high demands we may provide sysctl knob for the variable. Scope Index IPv6 uses scoped addresses. Therefore, it is very important to specify scope index (interface index for link-local address, or site index for site-local address) with an IPv6 address. Without scope index, scoped IPv6 address is ambiguous to the kernel, and kernel will not be able to determine the outbound interface for a packet. Ordinary userland applications should use advanced API (RFC2292) to specify scope index, or interface index. For similar purpose, sin6_scope_id member in sockaddr_in6 structure is defined in RFC2553. However, the semantics for sin6_scope_id is rather vague. If you care about portability of your application, we suggest you to use advanced API rather than sin6_scope_id. In the kernel, an interface index for link-local scoped address is embedded into 2nd 16bit-word (3rd and 4th byte) in IPv6 address. For example, you may see something like: fe80:1::200:f8ff:fe01:6317 in the routing table and interface address structure (struct in6_ifaddr). The address above is a link-local unicast address which belongs to a network interface whose interface identifier is 1. The embedded index enables us to identify IPv6 link local addresses over multiple interfaces effectively and with only a little code change. Routing daemons and configuration programs, like &man.route6d.8; and &man.ifconfig.8;, will need to manipulate the "embedded" scope index. These programs use routing sockets and ioctls (like SIOCGIFADDR_IN6) and the kernel API will return IPv6 addresses with 2nd 16bit-word filled in. The APIs are for manipulating kernel internal structure. Programs that use these APIs have to be prepared about differences in kernels anyway. When you specify scoped address to the command line, NEVER write the embedded form (such as ff02:1::1 or fe80:2::fedc). This is not supposed to work. Always use standard form, like ff02::1 or fe80::fedc, with command line option for specifying interface (like ping6 -I ne0 ff02::1). In general, if a command does not have command line option to specify outgoing interface, that command is not ready to accept scoped address. This may seem to be opposite from IPv6's premise to support "dentist office" situation. We believe that specifications need some improvements for this. Some of the userland tools support extended numeric IPv6 syntax, as documented in draft-ietf-ipngwg-scopedaddr-format-00.txt. You can specify outgoing link, by using name of the outgoing interface like "fe80::1%ne0". This way you will be able to specify link-local scoped address without much trouble. To use this extension in your program, you will need to use &man.getaddrinfo.3;, and &man.getnameinfo.3; with NI_WITHSCOPEID. The implementation currently assumes 1-to-1 relationship between a link and an interface, which is stronger than what specs say. Plug and Play Most of the IPv6 stateless address autoconfiguration is implemented in the kernel. Neighbor Discovery functions are implemented in the kernel as a whole. Router Advertisement (RA) input for hosts is implemented in the kernel. Router Solicitation (RS) output for endhosts, RS input for routers, and RA output for routers are implemented in the userland. Assignment of link-local, and special addresses IPv6 link-local address is generated from IEEE802 address (Ethernet MAC address). Each of interface is assigned an IPv6 link-local address automatically, when the interface becomes up (IFF_UP). Also, direct route for the link-local address is added to routing table. Here is an output of netstat command: Internet6: Destination Gateway Flags Netif Expire fe80:1::%ed0/64 link#1 UC ed0 fe80:2::%ep0/64 link#2 UC ep0 Interfaces that has no IEEE802 address (pseudo interfaces like tunnel interfaces, or ppp interfaces) will borrow IEEE802 address from other interfaces, such as Ethernet interfaces, whenever possible. If there is no IEEE802 hardware attached, a last resort pseudo-random value, MD5(hostname), will be used as source of link-local address. If it is not suitable for your usage, you will need to configure the link-local address manually. If an interface is not capable of handling IPv6 (such as lack of multicast support), link-local address will not be assigned to that interface. See section 2 for details. Each interface joins the solicited multicast address and the link-local all-nodes multicast addresses (e.g., fe80::1:ff01:6317 and ff02::1, respectively, on the link the interface is attached). In addition to a link-local address, the loopback address (::1) will be assigned to the loopback interface. Also, ::1/128 and ff01::/32 are automatically added to routing table, and loopback interface joins node-local multicast group ff01::1. Stateless address autoconfiguration on Hosts In IPv6 specification, nodes are separated into two categories: routers and hosts. Routers forward packets addressed to others, hosts does not forward the packets. net.inet6.ip6.forwarding defines whether this node is router or host (router if it is 1, host if it is 0). When a host hears Router Advertisement from the router, a host may autoconfigure itself by stateless address autoconfiguration. This behavior can be controlled by net.inet6.ip6.accept_rtadv (host autoconfigures itself if it is set to 1). By autoconfiguration, network address prefix for the receiving interface (usually global address prefix) is added. Default route is also configured. Routers periodically generate Router Advertisement packets. To request an adjacent router to generate RA packet, a host can transmit Router Solicitation. To generate a RS packet at any time, use the rtsol command. &man.rtsold.8; daemon is also available. &man.rtsold.8; generates Router Solicitation whenever necessary, and it works great for nomadic usage (notebooks/laptops). If one wishes to ignore Router Advertisements, use sysctl to set net.inet6.ip6.accept_rtadv to 0. To generate Router Advertisement from a router, use the &man.rtadvd.8; daemon. Note that, IPv6 specification assumes the following items, and nonconforming cases are left unspecified: Only hosts will listen to router advertisements Hosts have single network interface (except loopback) Therefore, this is unwise to enable net.inet6.ip6.accept_rtadv on routers, or multi-interface host. A misconfigured node can behave strange (nonconforming configuration allowed for those who would like to do some experiments). To summarize the sysctl knob: accept_rtadv forwarding role of the node --- --- --- 0 0 host (to be manually configured) 0 1 router 1 0 autoconfigured host (spec assumes that host has single interface only, autoconfigured host with multiple interface is out-of-scope) 1 1 invalid, or experimental (out-of-scope of spec) RFC2462 has validation rule against incoming RA prefix information option, in 5.5.3 (e). This is to protect hosts from malicious (or misconfigured) routers that advertise very short prefix lifetime. There was an update from Jim Bound to ipngwg mailing list (look for "(ipng 6712)" in the archive) and it is implemented Jim's update. See 23.5.1.2 in the document for relationship between DAD and autoconfiguration. Generic Tunnel Interface GIF (Generic InterFace) is a pseudo interface for configured tunnel. Details are described in &man.gif.4;. Currently v6 in v6 v6 in v4 v4 in v6 v4 in v4 are available. Use &man.gifconfig.8; to assign physical (outer) source and destination address to gif interfaces. Configuration that uses same address family for inner and outer IP header (v4 in v4, or v6 in v6) is dangerous. It is very easy to configure interfaces and routing tables to perform infinite level of tunneling. Please be warned. gif can be configured to be ECN-friendly. See 23.5.4.5 for ECN-friendliness of tunnels, and &man.gif.4; for how to configure. If you would like to configure an IPv4-in-IPv6 tunnel with gif interface, read &man.gif.4; carefully. You will need to remove IPv6 link-local address automatically assigned to the gif interface. Source Address Selection Current source selection rule is scope oriented (there are some exceptions - see below). For a given destination, a source IPv6 address is selected by the following rule: If the source address is explicitly specified by the user (e.g., via the advanced API), the specified address is used. If there is an address assigned to the outgoing interface (which is usually determined by looking up the routing table) that has the same scope as the destination address, the address is used. This is the most typical case. If there is no address that satisfies the above condition, choose a global address assigned to one of the interfaces on the sending node. If there is no address that satisfies the above condition, and destination address is site local scope, choose a site local address assigned to one of the interfaces on the sending node. If there is no address that satisfies the above condition, choose the address associated with the routing table entry for the destination. This is the last resort, which may cause scope violation. For instance, ::1 is selected for ff01::1, fe80:1::200:f8ff:fe01:6317 for fe80:1::2a0:24ff:feab:839b (note that embedded interface index - described in 23.5.1.3 - helps us choose the right source address. Those embedded indices will not be on the wire). If the outgoing interface has multiple address for the scope, a source is selected longest match basis (rule 3). Suppose 2001:0DB8:808:1:200:f8ff:fe01:6317 and 2001:0DB8:9:124:200:f8ff:fe01:6317 are given to the outgoing interface. 2001:0DB8:808:1:200:f8ff:fe01:6317 is chosen as the source for the destination 2001:0DB8:800::1. Note that the above rule is not documented in the IPv6 spec. It is considered "up to implementation" item. There are some cases where we do not use the above rule. One example is connected TCP session, and we use the address kept in tcb as the source. Another example is source address for Neighbor Advertisement. Under the spec (RFC2461 7.2.2) NA's source should be the target address of the corresponding NS's target. In this case we follow the spec rather than the above longest-match rule. For new connections (when rule 1 does not apply), deprecated addresses (addresses with preferred lifetime = 0) will not be chosen as source address if other choices are available. If no other choices are available, deprecated address will be used as a last resort. If there are multiple choice of deprecated addresses, the above scope rule will be used to choose from those deprecated addresses. If you would like to prohibit the use of deprecated address for some reason, configure net.inet6.ip6.use_deprecated to 0. The issue related to deprecated address is described in RFC2462 5.5.4 (NOTE: there is some debate underway in IETF ipngwg on how to use "deprecated" address). Jumbo Payload The Jumbo Payload hop-by-hop option is implemented and can be used to send IPv6 packets with payloads longer than 65,535 octets. But currently no physical interface whose MTU is more than 65,535 is supported, so such payloads can be seen only on the loopback interface (i.e., lo0). If you want to try jumbo payloads, you first have to reconfigure the kernel so that the MTU of the loopback interface is more than 65,535 bytes; add the following to the kernel configuration file: options "LARGE_LOMTU" #To test jumbo payload and recompile the new kernel. Then you can test jumbo payloads by the &man.ping6.8; command with -b and -s options. The -b option must be specified to enlarge the size of the socket buffer and the -s option specifies the length of the packet, which should be more than 65,535. For example, type as follows: &prompt.user; ping6 -b 70000 -s 68000 ::1 The IPv6 specification requires that the Jumbo Payload option must not be used in a packet that carries a fragment header. If this condition is broken, an ICMPv6 Parameter Problem message must be sent to the sender. specification is followed, but you cannot usually see an ICMPv6 error caused by this requirement. When an IPv6 packet is received, the frame length is checked and compared to the length specified in the payload length field of the IPv6 header or in the value of the Jumbo Payload option, if any. If the former is shorter than the latter, the packet is discarded and statistics are incremented. You can see the statistics as output of &man.netstat.8; command with `-s -p ip6' option: &prompt.user; netstat -s -p ip6 ip6: (snip) 1 with data size < data length So, kernel does not send an ICMPv6 error unless the erroneous packet is an actual Jumbo Payload, that is, its packet size is more than 65,535 bytes. As described above, currently no physical interface with such a huge MTU is supported, so it rarely returns an ICMPv6 error. TCP/UDP over jumbogram is not supported at this moment. This is because we have no medium (other than loopback) to test this. Contact us if you need this. IPsec does not work on jumbograms. This is due to some specification twists in supporting AH with jumbograms (AH header size influences payload length, and this makes it real hard to authenticate inbound packet with jumbo payload option as well as AH). There are fundamental issues in *BSD support for jumbograms. We would like to address those, but we need more time to finalize these. To name a few: mbuf pkthdr.len field is typed as "int" in 4.4BSD, so it will not hold jumbogram with len > 2G on 32bit architecture CPUs. If we would like to support jumbogram properly, the field must be expanded to hold 4G + IPv6 header + link-layer header. Therefore, it must be expanded to at least int64_t (u_int32_t is NOT enough). We mistakingly use "int" to hold packet length in many places. We need to convert them into larger integral type. It needs a great care, as we may experience overflow during packet length computation. We mistakingly check for ip6_plen field of IPv6 header for packet payload length in various places. We should be checking mbuf pkthdr.len instead. ip6_input() will perform sanity check on jumbo payload option on input, and we can safely use mbuf pkthdr.len afterwards. TCP code needs a careful update in bunch of places, of course. Loop Prevention in Header Processing IPv6 specification allows arbitrary number of extension headers to be placed onto packets. If we implement IPv6 packet processing code in the way BSD IPv4 code is implemented, kernel stack may overflow due to long function call chain. sys/netinet6 code is carefully designed to - avoid kernel stack overflow. Because of this, sys/netinet6 + avoid kernel stack overflow, so sys/netinet6 code defines its own protocol switch structure, as "struct ip6protosw" (see netinet6/ip6protosw.h). There is no such update to IPv4 part (sys/netinet) for compatibility, but small change is added to its pr_input() prototype. So - "struct ipprotosw" is also defined. Because of this, if you + "struct ipprotosw" is also defined. As a result, if you receive IPsec-over-IPv4 packet with massive number of IPsec headers, kernel stack may blow up. IPsec-over-IPv6 is okay. - (Off-course, for those all IPsec headers to be processed, + (Of-course, for those all IPsec headers to be processed, each such IPsec header must pass each IPsec check. So an anonymous attacker will not be able to do such an attack.) ICMPv6 After RFC2463 was published, IETF ipngwg has decided to disallow ICMPv6 error packet against ICMPv6 redirect, to prevent ICMPv6 storm on a network medium. This is already implemented into the kernel. Applications For userland programming, we support IPv6 socket API as specified in RFC2553, RFC2292 and upcoming Internet drafts. TCP/UDP over IPv6 is available and quite stable. You can enjoy &man.telnet.1;, &man.ftp.1;, &man.rlogin.1;, &man.rsh.1;, &man.ssh.1;, etc. These applications are protocol independent. That is, they automatically chooses IPv4 or IPv6 according to DNS. Kernel Internals While ip_forward() calls ip_output(), ip6_forward() directly calls if_output() since routers must not divide IPv6 packets into fragments. ICMPv6 should contain the original packet as long as possible up to 1280. UDP6/IP6 port unreach, for instance, should contain all extension headers and the *unchanged* UDP6 and IP6 headers. So, all IP6 functions except TCP never convert network byte order into host byte order, to save the original packet. tcp_input(), udp6_input() and icmp6_input() can not assume that IP6 header is preceding the transport headers due to extension headers. So, in6_cksum() was implemented to handle packets whose IP6 header and transport header is not continuous. TCP/IP6 nor UDP6/IP6 header structures do not exist for checksum calculation. To process IP6 header, extension headers and transport headers easily, network drivers are now required to store packets in one internal mbuf or one or more external mbufs. A typical old driver prepares two internal mbufs for 96 - 204 bytes data, however, now such packet data is stored in one external mbuf. netstat -s -p ip6 tells you whether or not your driver conforms such requirement. In the following example, "cce0" violates the requirement. (For more information, refer to Section 2.) Mbuf statistics: 317 one mbuf two or more mbuf:: lo0 = 8 cce0 = 10 3282 one ext mbuf 0 two or more ext mbuf Each input function calls IP6_EXTHDR_CHECK in the beginning to check if the region between IP6 and its header is continuous. IP6_EXTHDR_CHECK calls m_pullup() only if the mbuf has M_LOOP flag, that is, the packet comes from the loopback interface. m_pullup() is never called for packets coming from physical network interfaces. Both IP and IP6 reassemble functions never call m_pullup(). IPv4 Mapped Address and IPv6 Wildcard Socket RFC2553 describes IPv4 mapped address (3.7) and special behavior of IPv6 wildcard bind socket (3.8). The spec allows you to: Accept IPv4 connections by AF_INET6 wildcard bind socket. Transmit IPv4 packet over AF_INET6 socket by using special form of the address like ::ffff:10.1.1.1. but the spec itself is very complicated and does not specify how the socket layer should behave. Here we call the former one "listening side" and the latter one "initiating side", for reference purposes. You can perform wildcard bind on both of the address families, on the same port. The following table show the behavior of FreeBSD 4.x. listening side initiating side (AF_INET6 wildcard (connection to ::ffff:10.1.1.1) socket gets IPv4 conn.) --- --- FreeBSD 4.x configurable supported default: enabled The following sections will give you more details, and how you can configure the behavior. Comments on listening side: It looks that RFC2553 talks too little on wildcard bind issue, especially on the port space issue, failure mode and relationship between AF_INET/INET6 wildcard bind. There can be several separate interpretation for this RFC which conform to it but behaves differently. So, to implement portable application you should assume nothing about the behavior in the kernel. Using &man.getaddrinfo.3; is the safest way. Port number space and wildcard bind issues were discussed in detail on ipv6imp mailing list, in mid March 1999 and it looks that there is no concrete consensus (means, up to implementers). You may want to check the mailing list archives. If a server application would like to accept IPv4 and IPv6 connections, there will be two alternatives. One is using AF_INET and AF_INET6 socket (you will need two sockets). Use &man.getaddrinfo.3; with AI_PASSIVE into ai_flags, and &man.socket.2; and &man.bind.2; to all the addresses returned. By opening multiple sockets, you can accept connections onto the socket with proper address family. IPv4 connections will be accepted by AF_INET socket, and IPv6 connections will be accepted by AF_INET6 socket. Another way is using one AF_INET6 wildcard bind socket. Use &man.getaddrinfo.3; with AI_PASSIVE into ai_flags and with AF_INET6 into ai_family, and set the 1st argument hostname to NULL. And &man.socket.2; and &man.bind.2; to the address returned. (should be IPv6 unspecified addr). You can accept either of IPv4 and IPv6 packet via this one socket. To support only IPv6 traffic on AF_INET6 wildcard binded socket portably, always check the peer address when a connection is made toward AF_INET6 listening socket. If the address is IPv4 mapped address, you may want to reject the connection. You can check the condition by using IN6_IS_ADDR_V4MAPPED() macro. To resolve this issue more easily, there is system dependent &man.setsockopt.2; option, IPV6_BINDV6ONLY, used like below. int on; setsockopt(s, IPPROTO_IPV6, IPV6_BINDV6ONLY, (char *)&on, sizeof (on)) < 0)); When this call succeed, then this socket only receive IPv6 packets. Comments on initiating side: Advise to application implementers: to implement a portable IPv6 application (which works on multiple IPv6 kernels), we believe that the following is the key to the success: NEVER hardcode AF_INET nor AF_INET6. Use &man.getaddrinfo.3; and &man.getnameinfo.3; throughout the system. Never use gethostby*(), getaddrby*(), inet_*() or getipnodeby*(). (To update existing applications to be IPv6 aware easily, sometime getipnodeby*() will be useful. But if possible, try to rewrite the code to use &man.getaddrinfo.3; and &man.getnameinfo.3;.) If you would like to connect to destination, use &man.getaddrinfo.3; and try all the destination returned, like &man.telnet.1; does. Some of the IPv6 stack is shipped with buggy &man.getaddrinfo.3;. Ship a minimal working version with your application and use that as last resort. If you would like to use AF_INET6 socket for both IPv4 and IPv6 outgoing connection, you will need to use &man.getipnodebyname.3;. When you would like to update your existing application to be IPv6 aware with minimal effort, this approach might be chosen. But please note that it is a temporal solution, because &man.getipnodebyname.3; itself is not recommended as it does not handle scoped IPv6 addresses at all. For IPv6 name resolution, &man.getaddrinfo.3; is the preferred API. So you should rewrite your application to use &man.getaddrinfo.3;, when you get the time to do it. When writing applications that make outgoing connections, story goes much simpler if you treat AF_INET and AF_INET6 as totally separate address family. {set,get}sockopt issue goes simpler, DNS issue will be made simpler. We do not recommend you to rely upon IPv4 mapped address. unified tcp and inpcb code FreeBSD 4.x uses shared tcp code between IPv4 and IPv6 (from sys/netinet/tcp*) and separate udp4/6 code. It uses unified inpcb structure. The platform can be configured to support IPv4 mapped address. Kernel configuration is summarized as follows: By default, AF_INET6 socket will grab IPv4 connections in certain condition, and can initiate connection to IPv4 destination embedded in IPv4 mapped IPv6 address. You can disable it on entire system with sysctl like below. sysctl net.inet6.ip6.mapped_addr=0 Listening Side Each socket can be configured to support special AF_INET6 wildcard bind (enabled by default). You can disable it on each socket basis with &man.setsockopt.2; like below. int on; setsockopt(s, IPPROTO_IPV6, IPV6_BINDV6ONLY, (char *)&on, sizeof (on)) < 0)); Wildcard AF_INET6 socket grabs IPv4 connection if and only if the following conditions are satisfied: there is no AF_INET socket that matches the IPv4 connection the AF_INET6 socket is configured to accept IPv4 traffic, i.e., getsockopt(IPV6_BINDV6ONLY) returns 0. There is no problem with open/close ordering. Initiating Side FreeBSD 4.x supports outgoing connection to IPv4 mapped address (::ffff:10.1.1.1), if the node is configured to support IPv4 mapped address. sockaddr_storage When RFC2553 was about to be finalized, there was discussion on how struct sockaddr_storage members are named. One proposal is to prepend "__" to the members (like "__ss_len") as they should not be touched. The other proposal was not to prepend it (like "ss_len") as we need to touch those members directly. There was no clear consensus on it. As a result, RFC2553 defines struct sockaddr_storage as follows: struct sockaddr_storage { u_char __ss_len; /* address length */ u_char __ss_family; /* address family */ /* and bunch of padding */ }; On the contrary, XNET draft defines as follows: struct sockaddr_storage { u_char ss_len; /* address length */ u_char ss_family; /* address family */ /* and bunch of padding */ }; In December 1999, it was agreed that RFC2553bis should pick the latter (XNET) definition. Current implementation conforms to XNET definition, based on RFC2553bis discussion. If you look at multiple IPv6 implementations, you will be able to see both definitions. As an userland programmer, the most portable way of dealing with it is to: ensure ss_family and/or ss_len are available on the platform, by using GNU autoconf, have -Dss_family=__ss_family to unify all occurrences (including header file) into __ss_family, or never touch __ss_family. cast to sockaddr * and use sa_family like: struct sockaddr_storage ss; family = ((struct sockaddr *)&ss)->sa_family Network Drivers Now following two items are required to be supported by standard drivers: mbuf clustering requirement. In this stable release, we changed MINCLSIZE into MHLEN+1 for all the operating systems in order to make all the drivers behave as we expect. multicast. If &man.ifmcstat.8; yields no multicast group for a interface, that interface has to be patched. If any of the drivers do not support the requirements, then the drivers cannot be used for IPv6 and/or IPsec communication. If you find any problem with your card using IPv6/IPsec, then, please report it to the &a.bugs;. (NOTE: In the past we required all PCMCIA drivers to have a call to in6_ifattach(). We have no such requirement any more) Translator We categorize IPv4/IPv6 translator into 4 types: Translator A --- It is used in the early stage of transition to make it possible to establish a connection from an IPv6 host in an IPv6 island to an IPv4 host in the IPv4 ocean. Translator B --- It is used in the early stage of transition to make it possible to establish a connection from an IPv4 host in the IPv4 ocean to an IPv6 host in an IPv6 island. Translator C --- It is used in the late stage of transition to make it possible to establish a connection from an IPv4 host in an IPv4 island to an IPv6 host in the IPv6 ocean. Translator D --- It is used in the late stage of transition to make it possible to establish a connection from an IPv6 host in the IPv6 ocean to an IPv4 host in an IPv4 island. IPsec IPsec is mainly organized by three components. Policy Management Key Management AH and ESP handling Policy Management The kernel implements experimental policy management code. There are two way to manage security policy. One is to configure per-socket policy using &man.setsockopt.2;. In this cases, policy configuration is described in &man.ipsec.set.policy.3;. The other is to configure kernel packet filter-based policy using PF_KEY interface, via &man.setkey.8;. The policy entry is not re-ordered with its indexes, so the order of entry when you add is very significant. Key Management The key management code implemented in this kit (sys/netkey) is a home-brew PFKEY v2 implementation. This conforms to RFC2367. The home-brew IKE daemon, "racoon" is included in the kit (kame/kame/racoon). Basically you will need to run racoon as daemon, then set up a policy to require keys (like ping -P 'out ipsec esp/transport//use'). The kernel will contact racoon daemon as necessary to exchange keys. AH and ESP Handling IPsec module is implemented as "hooks" to the standard IPv4/IPv6 processing. When sending a packet, ip{,6}_output() checks if ESP/AH processing is required by checking if a matching SPD (Security Policy Database) is found. If ESP/AH is needed, {esp,ah}{4,6}_output() will be called and mbuf will be updated accordingly. When a packet is received, {esp,ah}4_input() will be called based on protocol number, i.e., (*inetsw[proto])(). {esp,ah}4_input() will decrypt/check authenticity of the packet, and strips off daisy-chained header and padding for ESP/AH. It is safe to strip off the ESP/AH header on packet reception, since we will never use the received packet in "as is" form. By using ESP/AH, TCP4/6 effective data segment size will be affected by extra daisy-chained headers inserted by ESP/AH. Our code takes care of the case. Basic crypto functions can be found in directory "sys/crypto". ESP/AH transform are listed in {esp,ah}_core.c with wrapper functions. If you wish to add some algorithm, add wrapper function in {esp,ah}_core.c, and add your crypto algorithm code into sys/crypto. Tunnel mode is partially supported in this release, with the following restrictions: IPsec tunnel is not combined with GIF generic tunneling interface. It needs a great care because we may create an infinite loop between ip_output() and tunnelifp->if_output(). Opinion varies if it is better to unify them, or not. MTU and Don't Fragment bit (IPv4) considerations need more checking, but basically works fine. Authentication model for AH tunnel must be revisited. We will need to improve the policy management engine, eventually. Conformance to RFCs and IDs The IPsec code in the kernel conforms (or, tries to conform) to the following standards: "old IPsec" specification documented in rfc182[5-9].txt "new IPsec" specification documented in rfc240[1-6].txt, rfc241[01].txt, rfc2451.txt and draft-mcdonald-simple-ipsec-api-01.txt (draft expired, but you can take from ftp://ftp.kame.net/pub/internet-drafts/). (NOTE: IKE specifications, rfc241[7-9].txt are implemented in userland, as "racoon" IKE daemon) Currently supported algorithms are: old IPsec AH null crypto checksum (no document, just for debugging) keyed MD5 with 128bit crypto checksum (rfc1828.txt) keyed SHA1 with 128bit crypto checksum (no document) HMAC MD5 with 128bit crypto checksum (rfc2085.txt) HMAC SHA1 with 128bit crypto checksum (no document) old IPsec ESP null encryption (no document, similar to rfc2410.txt) DES-CBC mode (rfc1829.txt) new IPsec AH null crypto checksum (no document, just for debugging) keyed MD5 with 96bit crypto checksum (no document) keyed SHA1 with 96bit crypto checksum (no document) HMAC MD5 with 96bit crypto checksum (rfc2403.txt) HMAC SHA1 with 96bit crypto checksum (rfc2404.txt) new IPsec ESP null encryption (rfc2410.txt) DES-CBC with derived IV (draft-ietf-ipsec-ciph-des-derived-01.txt, draft expired) DES-CBC with explicit IV (rfc2405.txt) 3DES-CBC with explicit IV (rfc2451.txt) BLOWFISH CBC (rfc2451.txt) CAST128 CBC (rfc2451.txt) RC5 CBC (rfc2451.txt) each of the above can be combined with: ESP authentication with HMAC-MD5(96bit) ESP authentication with HMAC-SHA1(96bit) The following algorithms are NOT supported: old IPsec AH HMAC MD5 with 128bit crypto checksum + 64bit replay prevention (rfc2085.txt) keyed SHA1 with 160bit crypto checksum + 32bit padding (rfc1852.txt) IPsec (in kernel) and IKE (in userland as "racoon") has been tested at several interoperability test events, and it is known to interoperate with many other implementations well. Also, current IPsec implementation as quite wide coverage for IPsec crypto algorithms documented in RFC (we cover algorithms without intellectual property issues only). ECN Consideration on IPsec Tunnels ECN-friendly IPsec tunnel is supported as described in draft-ipsec-ecn-00.txt. Normal IPsec tunnel is described in RFC2401. On encapsulation, IPv4 TOS field (or, IPv6 traffic class field) will be copied from inner IP header to outer IP header. On decapsulation outer IP header will be simply dropped. The decapsulation rule is not compatible with ECN, since ECN bit on the outer IP TOS/traffic class field will be lost. To make IPsec tunnel ECN-friendly, we should modify encapsulation and decapsulation procedure. This is described in http://www.aciri.org/floyd/papers/draft-ipsec-ecn-00.txt, chapter 3. IPsec tunnel implementation can give you three behaviors, by setting net.inet.ipsec.ecn (or net.inet6.ipsec6.ecn) to some value: RFC2401: no consideration for ECN (sysctl value -1) ECN forbidden (sysctl value 0) ECN allowed (sysctl value 1) Note that the behavior is configurable in per-node manner, not per-SA manner (draft-ipsec-ecn-00 wants per-SA configuration, but it looks too much for me). The behavior is summarized as follows (see source code for more detail): encapsulate decapsulate --- --- RFC2401 copy all TOS bits drop TOS bits on outer from inner to outer. (use inner TOS bits as is) ECN forbidden copy TOS bits except for ECN drop TOS bits on outer (masked with 0xfc) from inner (use inner TOS bits as is) to outer. set ECN bits to 0. ECN allowed copy TOS bits except for ECN use inner TOS bits with some CE (masked with 0xfe) from change. if outer ECN CE bit inner to outer. is 1, enable ECN CE bit on set ECN CE bit to 0. the inner. General strategy for configuration is as follows: if both IPsec tunnel endpoint are capable of ECN-friendly behavior, you should better configure both end to ECN allowed (sysctl value 1). if the other end is very strict about TOS bit, use "RFC2401" (sysctl value -1). in other cases, use "ECN forbidden" (sysctl value 0). The default behavior is "ECN forbidden" (sysctl value 0). For more information, please refer to: http://www.aciri.org/floyd/papers/draft-ipsec-ecn-00.txt, RFC2481 (Explicit Congestion Notification), src/sys/netinet6/{ah,esp}_input.c (Thanks goes to Kenjiro Cho kjc@csl.sony.co.jp for detailed analysis) Interoperability Here are (some of) platforms that KAME code have tested IPsec/IKE interoperability in the past. Note that both ends may have modified their implementation, so use the following list just for reference purposes. Altiga, Ashley-laurent (vpcom.com), Data Fellows (F-Secure), Ericsson ACC, FreeS/WAN, HITACHI, IBM &aix;, IIJ, Intel, µsoft; &windowsnt;, NIST (linux IPsec + plutoplus), Netscreen, OpenBSD, RedCreek, Routerware, SSH, Secure Computing, Soliton, Toshiba, VPNet, Yamaha RT100i diff --git a/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml index 8fd50f0d68..793d728368 100644 --- a/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml +++ b/en_US.ISO8859-1/books/developers-handbook/kerneldebug/chapter.xml @@ -1,1075 +1,1075 @@ Kernel Debugging PaulRichardsContributed by JörgWunsch RobertWatson Obtaining a Kernel Crash Dump When running a development kernel (e.g., &os.current;), such as a kernel under extreme conditions (e.g., very high load averages, tens of thousands of connections, exceedingly high number of concurrent users, hundreds of &man.jail.8;s, etc.), or using a new feature or device driver on &os.stable; (e.g., PAE), sometimes a kernel will panic. In the event that it does, this chapter will demonstrate how to extract useful information out of a crash. A system reboot is inevitable once a kernel panics. Once a system is rebooted, the contents of a system's physical memory (RAM) is lost, as well as any bits that are on the swap device before the panic. To preserve the bits in physical memory, the kernel makes use of the swap device as a temporary place to store the bits that are in RAM across a reboot after a crash. In doing this, when &os; boots after a crash, a kernel image can now be extracted and debugging can take place. A swap device that has been configured as a dump device still acts as a swap device. Dumps to non-swap devices (such as tapes or CDRWs, for example) are not supported at this time. A swap device is synonymous with a swap partition. Several types of kernel crash dumps are available: Full memory dumps Hold the complete contents of physical memory. Minidumps Hold only memory pages in use by the kernel (&os; 6.2 and higher). Textdumps Hold captured, scripted, or interactive debugger output (&os; 7.1 and higher). Minidumps are the default dump type as of &os; 7.0, and in most cases will capture all necessary information present in a full memory dump, as most problems can be isolated only using kernel state. Configuring the Dump Device Before the kernel will dump the contents of its physical memory to a dump device, a dump device must be configured. A dump device is specified by using the &man.dumpon.8; command to tell the kernel where to save kernel crash dumps. The &man.dumpon.8; program must be called after the swap partition has been configured with &man.swapon.8;. This is normally handled by setting the dumpdev variable in &man.rc.conf.5; to the path of the swap device (the recommended way to extract a kernel dump) or AUTO to use the first configured swap device. The default for dumpdev is AUTO in HEAD, and changed to NO on RELENG_* branches (except for RELENG_7, which was left set to AUTO). On &os; 9.0-RELEASE and later versions, bsdinstall will ask whether crash dumps should be enabled on the target system during the install process. Check /etc/fstab or &man.swapinfo.8; for a list of swap devices. Make sure the dumpdir specified in &man.rc.conf.5; exists before a kernel crash! &prompt.root; mkdir /var/crash &prompt.root; chmod 700 /var/crash Also, remember that the contents of /var/crash is sensitive and very likely contains confidential information such as passwords. Extracting a Kernel Dump Once a dump has been written to a dump device, the dump must be extracted before the swap device is mounted. To extract a dump from a dump device, use the &man.savecore.8; program. If dumpdev has been set in &man.rc.conf.5;, &man.savecore.8; will be called automatically on the first multi-user boot after the crash and before the swap device is mounted. The location of the extracted core is placed in the &man.rc.conf.5; value dumpdir, by default /var/crash and will be named vmcore.0. In the event that there is already a file called vmcore.0 in /var/crash (or whatever dumpdir is set to), the kernel will increment the trailing number for every crash to avoid overwriting an existing vmcore (e.g., vmcore.1). &man.savecore.8; will always create a symbolic link to named vmcore.last in /var/crash after a dump is saved. This symbolic link can be used to locate the name of the most recent dump. The &man.crashinfo.8; utility generates a text file containing a summary of information from a full memory dump or minidump. If dumpdev has been set in &man.rc.conf.5;, &man.crashinfo.8; will be invoked automatically after &man.savecore.8;. The output is saved to a file in dumpdir named core.txt.N. If you are testing a new kernel but need to boot a different one in order to get your system up and running again, boot it only into single user mode using the flag at the boot prompt, and then perform the following steps: &prompt.root; fsck -p &prompt.root; mount -a -t ufs # make sure /var/crash is writable &prompt.root; savecore /var/crash /dev/ad0s1b &prompt.root; exit # exit to multi-user This instructs &man.savecore.8; to extract a kernel dump from /dev/ad0s1b and place the contents in /var/crash. Do not forget to make sure the destination directory /var/crash has enough space for the dump. Also, do not forget to specify the correct path to your swap device as it is likely different than /dev/ad0s1b! Testing Kernel Dump Configuration The kernel includes a &man.sysctl.8; node that requests a kernel panic. This can be used to verify that your system is properly configured to save kernel crash dumps. You may wish to remount existing file systems as read-only in single user mode before triggering the crash to avoid data loss. &prompt.root; shutdown now ... Enter full pathname of shell or RETURN for /bin/sh: &prompt.root; mount -a -u -r &prompt.root; sysctl debug.kdb.panic=1 debug.kdb.panic:panic: kdb_sysctl_panic ... After rebooting, your system should save a dump in /var/crash along with a matching summary from &man.crashinfo.8;. Debugging a Kernel Crash Dump with <command>kgdb</command> This section covers &man.kgdb.1;. The latest version is included in the devel/gdb. An older version is also present in &os; 11 and earlier. To enter into the debugger and begin getting information from the dump, start kgdb: &prompt.root; kgdb -n N Where N is the suffix of the vmcore.N to examine. To open the most recent dump use: &prompt.root; kgdb -n last Normally, &man.kgdb.1; should be able to locate the kernel running at the time the dump was generated. If it is not able to locate the correct kernel, pass the pathname of the kernel and dump as two arguments to kgdb: &prompt.root; kgdb /boot/kernel/kernel /var/crash/vmcore.0 You can debug the crash dump using the kernel sources just like you can for any other program. This dump is from a 5.2-BETA kernel and the crash comes from deep within the kernel. The output below has been modified to include line numbers on the left. This first trace inspects the instruction pointer and obtains a back trace. The address that is used on line 41 for the list command is the instruction pointer and can be found on line 17. Most developers will request having at least this information sent to them if you are unable to debug the problem yourself. If, however, you do solve the problem, make sure that your patch winds its way into the source tree via a problem report, mailing lists, or by being able to commit it! 1:&prompt.root; cd /usr/obj/usr/src/sys/KERNCONF 2:&prompt.root; kgdb kernel.debug /var/crash/vmcore.0 3:GNU gdb 5.2.1 (FreeBSD) 4:Copyright 2002 Free Software Foundation, Inc. 5:GDB is free software, covered by the GNU General Public License, and you are 6:welcome to change it and/or distribute copies of it under certain conditions. 7:Type "show copying" to see the conditions. 8:There is absolutely no warranty for GDB. Type "show warranty" for details. 9:This GDB was configured as "i386-undermydesk-freebsd"... 10:panic: page fault 11:panic messages: 12:--- 13:Fatal trap 12: page fault while in kernel mode 14:cpuid = 0; apic id = 00 15:fault virtual address = 0x300 16:fault code: = supervisor read, page not present 17:instruction pointer = 0x8:0xc0713860 18:stack pointer = 0x10:0xdc1d0b70 19:frame pointer = 0x10:0xdc1d0b7c 20:code segment = base 0x0, limit 0xfffff, type 0x1b 21: = DPL 0, pres 1, def32 1, gran 1 22:processor eflags = resume, IOPL = 0 23:current process = 14394 (uname) 24:trap number = 12 25:panic: page fault 26 cpuid = 0; 27:Stack backtrace: 28 29:syncing disks, buffers remaining... 2199 2199 panic: mi_switch: switch in a critical section 30:cpuid = 0; 31:Uptime: 2h43m19s 32:Dumping 255 MB 33: 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 34:--- 35:Reading symbols from /boot/kernel/snd_maestro3.ko...done. 36:Loaded symbols for /boot/kernel/snd_maestro3.ko 37:Reading symbols from /boot/kernel/snd_pcm.ko...done. 38:Loaded symbols for /boot/kernel/snd_pcm.ko 39:#0 doadump () at /usr/src/sys/kern/kern_shutdown.c:240 40:240 dumping++; 41:(kgdb) list *0xc0713860 42:0xc0713860 is in lapic_ipi_wait (/usr/src/sys/i386/i386/local_apic.c:663). 43:658 incr = 0; 44:659 delay = 1; 45:660 } else 46:661 incr = 1; 47:662 for (x = 0; x < delay; x += incr) { 48:663 if ((lapic->icr_lo & APIC_DELSTAT_MASK) == APIC_DELSTAT_IDLE) 49:664 return (1); 50:665 ia32_pause(); 51:666 } 52:667 return (0); 53:(kgdb) backtrace 54:#0 doadump () at /usr/src/sys/kern/kern_shutdown.c:240 55:#1 0xc055fd9b in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:372 56:#2 0xc056019d in panic () at /usr/src/sys/kern/kern_shutdown.c:550 57:#3 0xc0567ef5 in mi_switch () at /usr/src/sys/kern/kern_synch.c:470 58:#4 0xc055fa87 in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:312 59:#5 0xc056019d in panic () at /usr/src/sys/kern/kern_shutdown.c:550 60:#6 0xc0720c66 in trap_fatal (frame=0xdc1d0b30, eva=0) 61: at /usr/src/sys/i386/i386/trap.c:821 62:#7 0xc07202b3 in trap (frame= 63: {tf_fs = -1065484264, tf_es = -1065484272, tf_ds = -1065484272, tf_edi = 1, tf_esi = 0, tf_ebp = -602076292, tf_isp = -602076324, tf_ebx = 0, tf_edx = 0, tf_ecx = 1000000, tf_eax = 243, tf_trapno = 12, tf_err = 0, tf_eip = -1066321824, tf_cs = 8, tf_eflags = 65671, tf_esp = 243, tf_ss = 0}) 64: at /usr/src/sys/i386/i386/trap.c:250 65:#8 0xc070c9f8 in calltrap () at {standard input}:94 66:#9 0xc07139f3 in lapic_ipi_vectored (vector=0, dest=0) 67: at /usr/src/sys/i386/i386/local_apic.c:733 68:#10 0xc0718b23 in ipi_selected (cpus=1, ipi=1) 69: at /usr/src/sys/i386/i386/mp_machdep.c:1115 70:#11 0xc057473e in kseq_notify (ke=0xcc05e360, cpu=0) 71: at /usr/src/sys/kern/sched_ule.c:520 72:#12 0xc0575cad in sched_add (td=0xcbcf5c80) 73: at /usr/src/sys/kern/sched_ule.c:1366 74:#13 0xc05666c6 in setrunqueue (td=0xcc05e360) 75: at /usr/src/sys/kern/kern_switch.c:422 76:#14 0xc05752f4 in sched_wakeup (td=0xcbcf5c80) 77: at /usr/src/sys/kern/sched_ule.c:999 78:#15 0xc056816c in setrunnable (td=0xcbcf5c80) 79: at /usr/src/sys/kern/kern_synch.c:570 80:#16 0xc0567d53 in wakeup (ident=0xcbcf5c80) 81: at /usr/src/sys/kern/kern_synch.c:411 82:#17 0xc05490a8 in exit1 (td=0xcbcf5b40, rv=0) 83: at /usr/src/sys/kern/kern_exit.c:509 84:#18 0xc0548011 in sys_exit () at /usr/src/sys/kern/kern_exit.c:102 85:#19 0xc0720fd0 in syscall (frame= 86: {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 0, tf_esi = -1, tf_ebp = -1077940712, tf_isp = -602075788, tf_ebx = 672411944, tf_edx = 10, tf_ecx = 672411600, tf_eax = 1, tf_trapno = 12, tf_err = 2, tf_eip = 671899563, tf_cs = 31, tf_eflags = 642, tf_esp = -1077940740, tf_ss = 47}) 87: at /usr/src/sys/i386/i386/trap.c:1010 88:#20 0xc070ca4d in Xint0x80_syscall () at {standard input}:136 89:---Can't read userspace from dump, or kernel process--- 90:(kgdb) quit If your system is crashing regularly and you are running out of disk space, deleting old vmcore files in /var/crash could save a considerable amount of disk space! On-Line Kernel Debugging Using DDB While kgdb as an off-line debugger provides a very high level of user interface, there are some things it cannot do. The most important ones being breakpointing and single-stepping kernel code. If you need to do low-level debugging on your kernel, there is an on-line debugger available called DDB. It allows setting of breakpoints, single-stepping kernel functions, examining and changing kernel variables, etc. However, it cannot access kernel source files, and only has access to the global and static symbols, not to the full debug information like kgdb does. To configure your kernel to include DDB, add the options options KDB options DDB to your config file, and rebuild. (See The FreeBSD Handbook for details on configuring the FreeBSD kernel). Once your DDB kernel is running, there are several ways to enter DDB. The first, and earliest way is to use the boot flag . The kernel will start up in debug mode and enter DDB prior to any device probing. Hence you can even debug the device probe/attach functions. To use this, exit the loader's boot menu and enter boot -d at the loader prompt. The second scenario is to drop to the debugger once the system has booted. There are two simple ways to accomplish this. If you would like to break to the debugger from the command prompt, simply type the command: &prompt.root; sysctl debug.kdb.enter=1 Alternatively, if you are at the system console, you may use a hot-key on the keyboard. The default break-to-debugger sequence is Ctrl AltESC. For syscons, this sequence can be remapped and some of the distributed maps out there do this, so check to make sure you know the right sequence to use. There is an option available for serial consoles that allows the use of a serial line BREAK on the console line to enter DDB (options BREAK_TO_DEBUGGER in the kernel config file). It is not the default since there are a lot of serial adapters around that gratuitously generate a BREAK condition, for example when pulling the cable. The third way is that any panic condition will branch to DDB if the kernel is configured to use it. For this reason, it is not wise to configure a kernel with DDB for a machine running unattended. To obtain the unattended functionality, add: options KDB_UNATTENDED to the kernel configuration file and rebuild/reinstall. The DDB commands roughly resemble some gdb commands. The first thing you probably need to do is to set a breakpoint: break function-name address Numbers are taken hexadecimal by default, but to make them distinct from symbol names; hexadecimal numbers starting with the letters a-f need to be preceded with 0x (this is optional for other numbers). Simple expressions are allowed, for example: function-name + 0x103. To exit the debugger and continue execution, type: continue To get a stack trace of the current thread, use: trace To get a stack trace of an arbitrary thread, specify a process ID or thread ID as a second argument to trace. If you want to remove a breakpoint, use del del address-expression The first form will be accepted immediately after a breakpoint hit, and deletes the current breakpoint. The second form can remove any breakpoint, but you need to specify the exact address; this can be obtained from: show b or: show break To single-step the kernel, try: s This will step into functions, but you can make DDB trace them until the matching return statement is reached by: n This is different from gdb's next statement; it is like gdb's finish. Pressing n more than once will cause a continue. To examine data from memory, use (for example): x/wx 0xf0133fe0,40 x/hd db_symtab_space x/bc termbuf,10 x/s stringbuf for word/halfword/byte access, and hexadecimal/decimal/character/ string display. The number after the comma is the object count. To display the next 0x10 items, simply use: x ,10 Similarly, use x/ia foofunc,10 to disassemble the first 0x10 instructions of foofunc, and display them along with their offset from the beginning of foofunc. To modify memory, use the write command: w/b termbuf 0xa 0xb 0 w/w 0xf0010030 0 0 The command modifier (b/h/w) specifies the size of the data to be written, the first following expression is the address to write to and the remainder is interpreted as data to write to successive memory locations. If you need to know the current registers, use: show reg Alternatively, you can display a single register value by e.g. p $eax and modify it by: set $eax new-value Should you need to call some kernel functions from DDB, simply say: call func(arg1, arg2, ...) The return value will be printed. For a &man.ps.1; style summary of all running processes, use: ps Now you have examined why your kernel failed, and you wish to reboot. Remember that, depending on the severity of previous malfunctioning, not all parts of the kernel might still be working as expected. Perform one of the following actions to shut down and reboot your system: panic This will cause your kernel to dump core and reboot, so you can later analyze the core on a higher level with &man.kgdb.1;. call boot(0) Might be a good way to cleanly shut down the running system, sync() all disks, and finally, in some cases, reboot. As long as the disk and filesystem interfaces of the kernel are not damaged, this could be a good way for an almost clean shutdown. reset This is the final way out of disaster and almost the same as hitting the Big Red Button. If you need a short command summary, simply type: help It is highly recommended to have a printed copy of the &man.ddb.4; manual page ready for a debugging session. Remember that it is hard to read the on-line manual while single-stepping the kernel. On-Line Kernel Debugging Using Remote GDB This feature has been supported since FreeBSD 2.2, and it is actually a very neat one. GDB has already supported remote debugging for a long time. This is done using a very simple protocol along a serial line. Unlike the other methods described above, you will need two machines for doing this. One is the host providing the debugging environment, including all the sources, and a copy of the kernel binary with all the symbols in it, and the other one is the target machine that simply runs a similar copy of the very same kernel (but stripped of the debugging information). You should configure the kernel in question with config -g if building the traditional way. If building the new way, make sure that makeoptions DEBUG=-g is in the configuration. In both cases, include in the configuration, and compile it as usual. This gives a large binary, due to the debugging information. Copy this kernel to the target machine, strip the debugging symbols off with strip -x, and boot it using the boot option. Connect the serial line of the target machine that has "flags 080" set on its uart device to any serial line of the debugging host. See &man.uart.4; for information on how to set the flags on an uart device. Now, on the debugging machine, go to the compile directory of the target kernel, and start gdb: &prompt.user; kgdb kernel GDB is free software and you are welcome to distribute copies of it under certain conditions; type "show copying" to see the conditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.16 (i386-unknown-freebsd), Copyright 1996 Free Software Foundation, Inc... (kgdb) Initialize the remote debugging session (assuming the first serial port is being used) by: (kgdb) target remote /dev/cuau0 Now, on the target host (the one that entered DDB right before even starting the device probe), type: Debugger("Boot flags requested debugger") Stopped at Debugger+0x35: movb $0, edata+0x51bc db> gdb DDB will respond with: Next trap will enter GDB remote protocol mode Every time you type gdb, the mode will be toggled between remote GDB and local DDB. In order to force a next trap immediately, simply type s (step). Your hosting GDB will now gain control over the target kernel: Remote debugging using /dev/cuau0 Debugger (msg=0xf01b0383 "Boot flags requested debugger") at ../../i386/i386/db_interface.c:257 (kgdb) You can use this session almost as any other GDB session, including full access to the source, running it in gud-mode inside an Emacs window (which gives you an automatic source code display in another Emacs window), etc. Debugging a Console Driver Since you need a console driver to run DDB on, things are more complicated if the console driver itself is failing. You might remember the use of a serial console (either with modified boot blocks, or by specifying at the Boot: prompt), and hook up a standard terminal onto your first serial port. DDB works on any configured console driver, including a serial console. Debugging Deadlocks You may experience so called deadlocks, a situation where a system stops doing useful work. To provide a helpful bug report in this situation, use &man.ddb.4; as described in the previous section. Include the output of ps and trace for suspected processes in the report. If possible, consider doing further investigation. The recipe below is especially useful if you suspect that a deadlock occurs in the VFS layer. Add these options to the kernel configuration file. makeoptions DEBUG=-g options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN options DEBUG_LOCKS options DEBUG_VFS_LOCKS options DIAGNOSTIC When a deadlock occurs, in addition to the output of the ps command, provide information from the show pcpu, show allpcpu, show locks, show alllocks, show lockedvnods and alltrace. To obtain meaningful backtraces for threaded processes, use thread thread-id to switch to the thread stack, and do a backtrace with where. Kernel debugging with Dcons &man.dcons.4; is a very simple console driver that is not directly connected with any physical devices. It just reads and writes characters from and to a buffer in a kernel or loader. Due to its simple nature, it is very useful for kernel debugging, especially with a &firewire; device. Currently, &os; provides two ways to interact with the buffer from outside of the kernel using &man.dconschat.8;. Dcons over &firewire; Most &firewire; (IEEE1394) host controllers are based on the OHCI specification that supports physical access to the host memory. This means that once the host controller is initialized, we can access the host memory without the help of software (kernel). We can exploit this facility for interaction with &man.dcons.4;. &man.dcons.4; provides similar functionality as a serial console. It emulates two serial ports, one for the console and DDB, the other for - GDB. Because remote memory access is fully + GDB. Since remote memory access is fully handled by the hardware, the &man.dcons.4; buffer is accessible even when the system crashes. &firewire; devices are not limited to those integrated into motherboards. PCI cards exist for desktops, and a cardbus interface can be purchased for laptops. Enabling &firewire; and Dcons support on the target machine To enable &firewire; and Dcons support in the kernel of the target machine: Make sure your kernel supports dcons, dcons_crom and firewire. Dcons should be statically linked with the kernel. For dcons_crom and firewire, modules should be OK. Make sure physical DMA is enabled. You may need to add hw.firewire.phydma_enable=1 to /boot/loader.conf. Add options for debugging. Add dcons_gdb=1 in /boot/loader.conf if you use GDB over &firewire;. Enable dcons in /etc/ttys. Optionally, to force dcons to be the high-level console, add hw.firewire.dcons_crom.force_console=1 to loader.conf. To enable &firewire; and Dcons support in &man.loader.8; on i386 or amd64: Add LOADER_FIREWIRE_SUPPORT=YES in /etc/make.conf and rebuild &man.loader.8;: &prompt.root; cd /sys/boot/i386 && make clean && make && make install To enable &man.dcons.4; as an active low-level console, add boot_multicons="YES" to /boot/loader.conf. Here are a few configuration examples. A sample kernel configuration file would contain: device dcons device dcons_crom options KDB options DDB options GDB options ALT_BREAK_TO_DEBUGGER And a sample /boot/loader.conf would contain: dcons_crom_load="YES" dcons_gdb=1 boot_multicons="YES" hw.firewire.phydma_enable=1 hw.firewire.dcons_crom.force_console=1 Enabling &firewire; and Dcons support on the host machine To enable &firewire; support in the kernel on the host machine: &prompt.root; kldload firewire Find out the EUI64 (the unique 64 bit identifier) of the &firewire; host controller, and use &man.fwcontrol.8; or dmesg to find the EUI64 of the target machine. Run &man.dconschat.8;, with: &prompt.root; dconschat -e \# -br -G 12345 -t 00-11-22-33-44-55-66-77 The following key combinations can be used once &man.dconschat.8; is running: ~ . Disconnect ~ Ctrl B ALT BREAK ~ Ctrl R RESET target ~ Ctrl Z Suspend dconschat Attach remote GDB by starting &man.kgdb.1; with a remote debugging session: kgdb -r :12345 kernel Some general tips Here are some general tips: To take full advantage of the speed of &firewire;, disable other slow console drivers: &prompt.root; conscontrol delete ttyd0 # serial console &prompt.root; conscontrol delete consolectl # video/keyboard There exists a GDB mode for &man.emacs.1;; this is what you will need to add to your .emacs: (setq gud-gdba-command-name "kgdb -a -a -a -r :12345") (setq gdb-many-windows t) (xterm-mouse-mode 1) M-x gdba And for DDD (devel/ddd): # remote serial protocol LANG=C ddd --debugger kgdb -r :12345 kernel # live core debug LANG=C ddd --debugger kgdb kernel /dev/fwmem0.2 Dcons with KVM We can directly read the &man.dcons.4; buffer via /dev/mem for live systems, and in the core dump for crashed systems. These give you similar output to dmesg -a, but the &man.dcons.4; buffer includes more information. Using Dcons with KVM To use &man.dcons.4; with KVM: Dump a &man.dcons.4; buffer of a live system: &prompt.root; dconschat -1 Dump a &man.dcons.4; buffer of a crash dump: &prompt.root; dconschat -1 -M vmcore.XX Live core debugging can be done via: &prompt.root; fwcontrol -m target_eui64 &prompt.root; kgdb kernel /dev/fwmem0.2 Glossary of Kernel Options for Debugging This section provides a brief glossary of compile-time kernel options used for debugging: options KDB: compiles in the kernel debugger framework. Required for options DDB and options GDB. Little or no performance overhead. By default, the debugger will be entered on panic instead of an automatic reboot. options KDB_UNATTENDED: change the default value of the debug.debugger_on_panic sysctl to 0, which controls whether the debugger is entered on panic. When options KDB is not compiled into the kernel, the behavior is to automatically reboot on panic; when it is compiled into the kernel, the default behavior is to drop into the debugger unless options KDB_UNATTENDED is compiled in. If you want to leave the kernel debugger compiled into the kernel but want the system to come back up unless you're on-hand to use the debugger for diagnostics, use this option. options KDB_TRACE: change the default value of the debug.trace_on_panic sysctl to 1, which controls whether the debugger automatically prints a stack trace on panic. Especially if running with options KDB_UNATTENDED, this can be helpful to gather basic debugging information on the serial or firewire console while still rebooting to recover. options DDB: compile in support for the console debugger, DDB. This interactive debugger runs on whatever the active low-level console of the system is, which includes the video console, serial console, or firewire console. It provides basic integrated debugging facilities, such as stack tracing, process and thread listing, dumping of lock state, VM state, file system state, and kernel memory management. DDB does not require software running on a second machine or being able to generate a core dump or full debugging kernel symbols, and provides detailed diagnostics of the kernel at run-time. Many bugs can be fully diagnosed using only DDB output. This option depends on options KDB. options GDB: compile in support for the remote debugger, GDB, which can operate over serial cable or firewire. When the debugger is entered, GDB may be attached to inspect structure contents, generate stack traces, etc. Some kernel state is more awkward to access than in DDB, which is able to generate useful summaries of kernel state automatically, such as automatically walking lock debugging or kernel memory management structures, and a second machine running the debugger is required. On the other hand, GDB combines information from the kernel source and full debugging symbols, and is aware of full data structure definitions, local variables, and is scriptable. This option is not required to run GDB on a kernel core dump. This option depends on options KDB. options BREAK_TO_DEBUGGER, options ALT_BREAK_TO_DEBUGGER: allow a break signal or alternative signal on the console to enter the debugger. If the system hangs without a panic, this is a useful way to reach the debugger. Due to the current kernel locking, a break signal generated on a serial console is significantly more reliable at getting into the debugger, and is generally recommended. This option has little or no performance impact. options INVARIANTS: compile into the kernel a large number of run-time assertion checks and tests, which constantly test the integrity of kernel data structures and the invariants of kernel algorithms. These tests can be expensive, so are not compiled in by default, but help provide useful "fail stop" behavior, in which certain classes of undesired behavior enter the debugger before kernel data corruption occurs, making them easier to debug. Tests include memory scrubbing and use-after-free testing, which is one of the more significant sources of overhead. This option depends on options INVARIANT_SUPPORT. options INVARIANT_SUPPORT: many of the tests present in options INVARIANTS require modified data structures or additional kernel symbols to be defined. options WITNESS: this option enables run-time lock order tracking and verification, and is an invaluable tool for deadlock diagnosis. WITNESS maintains a graph of acquired lock orders by lock type, and checks the graph at each acquire for cycles (implicit or explicit). If a cycle is detected, a warning and stack trace are generated to the console, indicating that a potential deadlock might have occurred. WITNESS is required in order to use the show locks, show witness and show alllocks DDB commands. This debug option has significant performance overhead, which may be somewhat mitigated through the use of options WITNESS_SKIPSPIN. Detailed documentation may be found in &man.witness.4;. options WITNESS_SKIPSPIN: disable run-time checking of spinlock lock order with WITNESS. As spin locks are acquired most frequently in the scheduler, and scheduler events occur often, this option can significantly speed up systems running with WITNESS. This option depends on options WITNESS. options WITNESS_KDB: change the default value of the debug.witness.kdb sysctl to 1, which causes WITNESS to enter the debugger when a lock order violation is detected, rather than simply printing a warning. This option depends on options WITNESS. options SOCKBUF_DEBUG: perform extensive run-time consistency checking on socket buffers, which can be useful for debugging both socket bugs and race conditions in protocols and device drivers that interact with sockets. This option significantly impacts network performance, and may change the timing in device driver races. options DEBUG_VFS_LOCKS: track lock acquisition points for lockmgr/vnode locks, expanding the amount of information displayed by show lockedvnods in DDB. This option has a measurable performance impact. options DEBUG_MEMGUARD: a replacement for the &man.malloc.9; kernel memory allocator that uses the VM system to detect reads or writes from allocated memory after free. Details may be found in &man.memguard.9;. This option has a significant performance impact, but can be very helpful in debugging kernel memory corruption bugs. options DIAGNOSTIC: enable additional, more expensive diagnostic tests along the lines of options INVARIANTS. diff --git a/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml index 8e00147c82..02b062f20c 100644 --- a/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml +++ b/en_US.ISO8859-1/books/developers-handbook/secure/chapter.xml @@ -1,500 +1,500 @@ Secure Programming MurrayStokelyContributed by Synopsis This chapter describes some of the security issues that have plagued &unix; programmers for decades and some of the new tools available to help programmers avoid writing exploitable code. Secure Design Methodology Writing secure applications takes a very scrutinous and pessimistic outlook on life. Applications should be run with the principle of least privilege so that no process is ever running with more than the bare minimum access that it needs to accomplish its function. Previously tested code should be reused whenever possible to avoid common mistakes that others may have already fixed. One of the pitfalls of the &unix; environment is how easy it is to make assumptions about the sanity of the environment. Applications should never trust user input (in all its forms), system resources, inter-process communication, or the timing of events. &unix; processes do not execute synchronously so logical operations are rarely atomic. Buffer Overflows Buffer Overflows have been around since the very beginnings of the von Neumann architecture. buffer overflow von Neumann They first gained widespread notoriety in 1988 with the Morris Internet worm. Unfortunately, the same basic attack remains Morris Internet worm effective today. By far the most common type of buffer overflow attack is based on corrupting the stack. stack arguments Most modern computer systems use a stack to pass arguments to procedures and to store local variables. A stack is a last in first out (LIFO) buffer in the high memory area of a process image. When a program invokes a function a new "stack frame" is LIFO process image stack pointer created. This stack frame consists of the arguments passed to the function as well as a dynamic amount of local variable space. The "stack pointer" is a register that holds the current stack frame stack pointer location of the top of the stack. Since this value is constantly changing as new values are pushed onto the top of the stack, many implementations also provide a "frame pointer" that is located near the beginning of a stack frame so that local variables can more easily be addressed relative to this value. The return address for function frame pointer process image frame pointer return address stack-overflow calls is also stored on the stack, and this is the cause of stack-overflow exploits since overflowing a local variable in a function can overwrite the return address of that function, potentially allowing a malicious user to execute any code he or she wants. Although stack-based attacks are by far the most common, it would also be possible to overrun the stack with a heap-based (malloc/free) attack. The C programming language does not perform automatic bounds checking on arrays or pointers as many other languages do. In addition, the standard C library is filled with a handful of very dangerous functions. strcpy(char *dest, const char *src) May overflow the dest buffer strcat(char *dest, const char *src) May overflow the dest buffer getwd(char *buf) May overflow the buf buffer gets(char *s) May overflow the s buffer [vf]scanf(const char *format, ...) May overflow its arguments. realpath(char *path, char resolved_path[]) May overflow the path buffer [v]sprintf(char *str, const char *format, ...) May overflow the str buffer. Example Buffer Overflow The following example code contains a buffer overflow designed to overwrite the return address and skip the instruction immediately following the function call. (Inspired by ) #include <stdio.h> void manipulate(char *buffer) { char newbuffer[80]; strcpy(newbuffer,buffer); } int main() { char ch,buffer[4096]; int i=0; while ((buffer[i++] = getchar()) != '\n') {}; i=1; manipulate(buffer); i=2; printf("The value of i is : %d\n",i); return 0; } Let us examine what the memory image of this process would look like if we were to input 160 spaces into our little program before hitting return. [XXX figure here!] Obviously more malicious input can be devised to execute actual compiled instructions (such as exec(/bin/sh)). Avoiding Buffer Overflows The most straightforward solution to the problem of stack-overflows is to always use length restricted memory and string copy functions. strncpy and strncat are part of the standard C library. string copy functions strncpy string copy functions strncat These functions accept a length value as a parameter which should be no larger than the size of the destination buffer. These functions will then copy up to `length' bytes from the source to the destination. However there are a number of problems with these functions. Neither function guarantees NUL termination if the size of the input buffer is as large as the NUL termination destination. The length parameter is also used inconsistently between strncpy and strncat so it is easy for programmers to get confused as to their proper usage. There is also a significant performance loss compared to strcpy when copying a short string into a large buffer since strncpy NUL fills up the size specified. Another memory copy implementation exists to get around these problems. The strlcpy and strlcat functions guarantee that they will always null terminate the destination string when given a non-zero length argument. string copy functions strlcpy string copy functions strlcat Compiler based run-time bounds checking bounds checking compiler-based Unfortunately there is still a very large assortment of code in public use which blindly copies memory around without using any of the bounded copy routines we just discussed. Fortunately, there is a way to help prevent such attacks — run-time bounds checking, which is implemented by several C/C++ compilers. ProPolice StackGuard gcc ProPolice is one such compiler feature, and is integrated into &man.gcc.1; versions 4.1 and later. It replaces and extends the earlier StackGuard &man.gcc.1; extension. ProPolice helps to protect against stack-based buffer overflows and other attacks by laying pseudo-random numbers in key areas of the stack before calling any function. When a function returns, these canaries are checked and if they are found to have been changed the executable is immediately aborted. Thus any attempt to modify the return address or other variable stored on the stack in an attempt to get malicious code to run is unlikely to succeed, as the attacker would have to also manage to leave the pseudo-random canaries untouched. buffer overflow Recompiling your application with ProPolice is an effective means of stopping most buffer-overflow attacks, but it can still be compromised. Library based run-time bounds checking bounds checking library-based Compiler-based mechanisms are completely useless for binary-only software for which you cannot recompile. For these situations there are a number of libraries which re-implement the unsafe functions of the C-library (strcpy, fscanf, getwd, etc..) and ensure that these functions can never write past the stack pointer. libsafe libverify libparanoia Unfortunately these library-based defenses have a number of shortcomings. These libraries only protect against a very small set of security related issues and they neglect to fix the actual problem. These defenses may fail if the application was compiled with -fomit-frame-pointer. Also, the LD_PRELOAD and LD_LIBRARY_PATH environment variables can be overwritten/unset by the user. SetUID issues seteuid There are at least 6 different IDs associated with any - given process. Because of this you have to be very careful with + given process, and you must therefore be very careful with the access that your process has at any given time. In particular, all seteuid applications should give up their privileges as soon as it is no longer required. user IDs real user ID user IDs effective user ID The real user ID can only be changed by a superuser process. The login program sets this when a user initially logs in and it is seldom changed. The effective user ID is set by the exec() functions if a program has its seteuid bit set. An application can call seteuid() at any time to set the effective user ID to either the real user ID or the saved set-user-ID. When the effective user ID is set by exec() functions, the previous value is saved in the saved set-user-ID. Limiting your program's environment chroot() The traditional method of restricting a process is with the chroot() system call. This system call changes the root directory from which all other paths are referenced for a process and any child processes. For this call to succeed the process must have execute (search) permission on the directory being referenced. The new environment does not actually take effect until you chdir() into your new environment. It should also be noted that a process can easily break out of a chroot environment if it has root privilege. This could be accomplished by creating device nodes to read kernel memory, attaching a debugger to a process outside of the &man.chroot.8; environment, or in many other creative ways. The behavior of the chroot() system call can be controlled somewhat with the kern.chroot_allow_open_directories sysctl variable. When this value is set to 0, chroot() will fail with EPERM if there are any directories open. If set to the default value of 1, then chroot() will fail with EPERM if there are any directories open and the process is already subject to a chroot() call. For any other value, the check for open directories will be bypassed completely. FreeBSD's jail functionality jail The concept of a Jail extends upon the chroot() by limiting the powers of the superuser to create a true `virtual server'. Once a prison is set up all network communication must take place through the specified IP address, and the power of "root privilege" in this jail is severely constrained. While in a prison, any tests of superuser power within the kernel using the suser() call will fail. However, some calls to suser() have been changed to a new interface suser_xxx(). This function is responsible for recognizing or denying access to superuser power for imprisoned processes. A superuser process within a jailed environment has the power to: Manipulate credential with setuid, seteuid, setgid, setegid, setgroups, setreuid, setregid, setlogin Set resource limits with setrlimit Modify some sysctl nodes (kern.hostname) chroot() Set flags on a vnode: chflags, fchflags Set attributes of a vnode such as file permission, owner, group, size, access time, and modification time. Bind to privileged ports in the Internet domain (ports < 1024) Jail is a very useful tool for running applications in a secure environment but it does have some shortcomings. Currently, the IPC mechanisms have not been converted to the suser_xxx so applications such as MySQL cannot be run within a jail. Superuser access may have a very limited meaning within a jail, but there is no way to specify exactly what "very limited" means. &posix;.1e Process Capabilities POSIX.1e Process Capabilities TrustedBSD &posix; has released a working draft that adds event auditing, access control lists, fine grained privileges, information labeling, and mandatory access control. This is a work in progress and is the focus of the TrustedBSD project. Some of the initial work has been committed to &os.current; (cap_set_proc(3)). Trust An application should never assume that anything about the users environment is sane. This includes (but is certainly not limited to): user input, signals, environment variables, resources, IPC, mmaps, the filesystem working directory, file descriptors, the # of open files, etc. positive filtering data validation You should never assume that you can catch all forms of invalid input that a user might supply. Instead, your application should use positive filtering to only allow a specific subset of inputs that you deem safe. Improper data validation has been the cause of many exploits, especially with CGI scripts on the world wide web. For filenames you need to be extra careful about paths ("../", "/"), symbolic links, and shell escape characters. Perl Taint mode Perl has a really cool feature called "Taint" mode which can be used to prevent scripts from using data derived outside the program in an unsafe way. This mode will check command line arguments, environment variables, locale information, the results of certain syscalls (readdir(), readlink(), getpwxxx()), and all file input. Race Conditions A race condition is anomalous behavior caused by the unexpected dependence on the relative timing of events. In other words, a programmer incorrectly assumed that a particular event would always happen before another. race conditions signals race conditions access checks race conditions file opens Some of the common causes of race conditions are signals, access checks, and file opens. Signals are asynchronous events by nature so special care must be taken in dealing with them. Checking access with access(2) then open(2) is clearly non-atomic. Users can move files in between the two calls. Instead, privileged applications should seteuid() and then call open() directly. Along the same lines, an application should always set a proper umask before open() to obviate the need for spurious chmod() calls. diff --git a/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml b/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml index 6173905a07..43b32f9b78 100644 --- a/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml +++ b/en_US.ISO8859-1/books/developers-handbook/sockets/chapter.xml @@ -1,1748 +1,1748 @@ Sockets G. Adam Stanislav Contributed by Synopsis BSD sockets take interprocess communications to a new level. It is no longer necessary for the communicating processes to run on the same machine. They still can, but they do not have to. Not only do these processes not have to run on the same machine, they do not have to run under the same operating system. Thanks to BSD sockets, your FreeBSD software can smoothly cooperate with a program running on a &macintosh;, another one running on a &sun; workstation, yet another one running under &windows; 2000, all connected with an Ethernet-based local area network. But your software can equally well cooperate with processes running in another building, or on another continent, inside a submarine, or a space shuttle. It can also cooperate with processes that are not part of a computer (at least not in the strict sense of the word), but of such devices as printers, digital cameras, medical equipment. Just about anything capable of digital communications. Networking and Diversity We have already hinted on the diversity of networking. Many different systems have to talk to each other. And they have to speak the same language. They also have to understand the same language the same way. People often think that body language is universal. But it is not. Back in my early teens, my father took me to Bulgaria. We were sitting at a table in a park in Sofia, when a vendor approached us trying to sell us some roasted almonds. I had not learned much Bulgarian by then, so, instead of saying no, I shook my head from side to side, the universal body language for no. The vendor quickly started serving us some almonds. I then remembered I had been told that in Bulgaria shaking your head sideways meant yes. Quickly, I started nodding my head up and down. The vendor noticed, took his almonds, and walked away. To an uninformed observer, I did not change the body language: I continued using the language of shaking and nodding my head. What changed was the meaning of the body language. At first, the vendor and I interpreted the same language as having completely different meaning. I had to adjust my own interpretation of that language so the vendor would understand. It is the same with computers: The same symbols may have different, even outright opposite meaning. Therefore, for two computers to understand each other, they must not only agree on the same language, but on the same interpretation of the language. Protocols While various programming languages tend to have complex syntax and use a number of multi-letter reserved words (which makes them easy for the human programmer to understand), the languages of data communications tend to be very terse. Instead of multi-byte words, they often use individual bits. There is a very convincing reason for it: While data travels inside your computer at speeds approaching the speed of light, it often travels considerably slower between two computers. - Because the languages used in data communications are so + As the languages used in data communications are so terse, we usually refer to them as protocols rather than languages. As data travels from one computer to another, it always uses more than one protocol. These protocols are layered. The data can be compared to the inside of an onion: You have to peel off several layers of skin to get to the data. This is best illustrated with a picture: +----------------+ | Ethernet | |+--------------+| || IP || ||+------------+|| ||| TCP ||| |||+----------+||| |||| HTTP |||| ||||+--------+|||| ||||| PNG ||||| |||||+------+||||| |||||| Data |||||| |||||+------+||||| ||||+--------+|||| |||+----------+||| ||+------------+|| |+--------------+| +----------------+ Protocol Layers In this example, we are trying to get an image from a web page we are connected to via an Ethernet. The image consists of raw data, which is simply a sequence of RGB values that our software can process, i.e., convert into an image and display on our monitor. Alas, our software has no way of knowing how the raw data is organized: Is it a sequence of RGB values, or a sequence of grayscale intensities, or perhaps of CMYK encoded colors? Is the data represented by 8-bit quanta, or are they 16 bits in size, or perhaps 4 bits? How many rows and columns does the image consist of? Should certain pixels be transparent? I think you get the picture... To inform our software how to handle the raw data, it is encoded as a PNG file. It could be a GIF, or a JPEG, but it is a PNG. And PNG is a protocol. At this point, I can hear some of you yelling, No, it is not! It is a file format! Well, of course it is a file format. But from the perspective of data communications, a file format is a protocol: The file structure is a language, a terse one at that, communicating to our process how the data is organized. Ergo, it is a protocol. Alas, if all we received was the PNG file, our software would be facing a serious problem: How is it supposed to know the data is representing an image, as opposed to some text, or perhaps a sound, or what not? Secondly, how is it supposed to know the image is in the PNG format as opposed to GIF, or JPEG, or some other image format? To obtain that information, we are using another protocol: HTTP. This protocol can tell us exactly that the data represents an image, and that it uses the PNG protocol. It can also tell us some other things, but let us stay focused on protocol layers here. So, now we have some data wrapped in the PNG protocol, wrapped in the HTTP protocol. How did we get it from the server? By using TCP/IP over Ethernet, that is how. Indeed, that is three more protocols. Instead of continuing inside out, I am now going to talk about Ethernet, simply because it is easier to explain the rest that way. Ethernet is an interesting system of connecting computers in a local area network (LAN). Each computer has a network interface card (NIC), which has a unique 48-bit ID called its address. No two Ethernet NICs in the world have the same address. These NICs are all connected with each other. Whenever one computer wants to communicate with another in the same Ethernet LAN, it sends a message over the network. Every NIC sees the message. But as part of the Ethernet protocol, the data contains the address of the destination NIC (among other things). So, only one of all the network interface cards will pay attention to it, the rest will ignore it. But not all computers are connected to the same network. Just because we have received the data over our Ethernet does not mean it originated in our own local area network. It could have come to us from some other network (which may not even be Ethernet based) connected with our own network via the Internet. All data is transferred over the Internet using IP, which stands for Internet Protocol. Its basic role is to let us know where in the world the data has arrived from, and where it is supposed to go to. It does not guarantee we will receive the data, only that we will know where it came from if we do receive it. Even if we do receive the data, IP does not guarantee we will receive various chunks of data in the same order the other computer has sent it to us. So, we can receive the center of our image before we receive the upper left corner and after the lower right, for example. It is TCP (Transmission Control Protocol) that asks the sender to resend any lost data and that places it all into the proper order. All in all, it took five different protocols for one computer to communicate to another what an image looks like. We received the data wrapped into the PNG protocol, which was wrapped into the HTTP protocol, which was wrapped into the TCP protocol, which was wrapped into the IP protocol, which was wrapped into the Ethernet protocol. Oh, and by the way, there probably were several other protocols involved somewhere on the way. For example, if our LAN was connected to the Internet through a dial-up call, it used the PPP protocol over the modem which used one (or several) of the various modem protocols, et cetera, et cetera, et cetera... As a developer you should be asking by now, How am I supposed to handle it all? Luckily for you, you are not supposed to handle it all. You are supposed to handle some of it, but not all of it. Specifically, you need not worry about the physical connection (in our case Ethernet and possibly PPP, etc). Nor do you need to handle the Internet Protocol, or the Transmission Control Protocol. In other words, you do not have to do anything to receive the data from the other computer. Well, you do have to ask for it, but that is almost as simple as opening a file. Once you have received the data, it is up to you to figure out what to do with it. In our case, you would need to understand the HTTP protocol and the PNG file structure. To use an analogy, all the internetworking protocols become a gray area: Not so much because we do not understand how it works, but because we are no longer concerned about it. The sockets interface takes care of this gray area for us: +----------------+ |xxxxEthernetxxxx| |+--------------+| ||xxxxxxIPxxxxxx|| ||+------------+|| |||xxxxxTCPxxxx||| |||+----------+||| |||| HTTP |||| ||||+--------+|||| ||||| PNG ||||| |||||+------+||||| |||||| Data |||||| |||||+------+||||| ||||+--------+|||| |||+----------+||| ||+------------+|| |+--------------+| +----------------+ Sockets Covered Protocol Layers We only need to understand any protocols that tell us how to interpret the data, not how to receive it from another process, nor how to send it to another process. The Sockets Model BSD sockets are built on the basic &unix; model: Everything is a file. In our example, then, sockets would let us receive an HTTP file, so to speak. It would then be up to us to extract the PNG file from it. - Because of the complexity of internetworking, we cannot just + Due to the complexity of internetworking, we cannot just use the open system call, or the open() C function. Instead, we need to take several steps to opening a socket. Once we do, however, we can start treating the socket the same way we treat any file descriptor: We can read from it, write to it, pipe it, and, eventually, close it. Essential Socket Functions While FreeBSD offers different functions to work with sockets, we only need four to open a socket. And in some cases we only need two. The Client-Server Difference Typically, one of the ends of a socket-based data communication is a server, the other is a client. The Common Elements <function>socket</function> The one function used by both, clients and servers, is &man.socket.2;. It is declared this way: int socket(int domain, int type, int protocol); The return value is of the same type as that of open, an integer. FreeBSD allocates its value from the same pool as that of file handles. That is what allows sockets to be treated the same way as files. The domain argument tells the system what protocol family you want it to use. Many of them exist, some are vendor specific, others are very common. They are declared in sys/socket.h. Use PF_INET for UDP, TCP and other Internet protocols (IPv4). Five values are defined for the type argument, again, in sys/socket.h. All of them start with SOCK_. The most common one is SOCK_STREAM, which tells the system you are asking for a reliable stream delivery service (which is TCP when used with PF_INET). If you asked for SOCK_DGRAM, you would be requesting a connectionless datagram delivery service (in our case, UDP). If you wanted to be in charge of the low-level protocols (such as IP), or even network interfaces (e.g., the Ethernet), you would need to specify SOCK_RAW. Finally, the protocol argument depends on the previous two arguments, and is not always meaningful. In that case, use 0 for its value. The Unconnected Socket Nowhere, in the socket function have we specified to what other system we should be connected. Our newly created socket remains unconnected. This is on purpose: To use a telephone analogy, we have just attached a modem to the phone line. We have neither told the modem to make a call, nor to answer if the phone rings. <varname>sockaddr</varname> Various functions of the sockets family expect the address of (or pointer to, to use C terminology) a small area of the memory. The various C declarations in the sys/socket.h refer to it as struct sockaddr. This structure is declared in the same file: /* * Structure used by kernel to store most * addresses. */ struct sockaddr { unsigned char sa_len; /* total length */ sa_family_t sa_family; /* address family */ char sa_data[14]; /* actually longer; address value */ }; #define SOCK_MAXADDRLEN 255 /* longest possible addresses */ Please note the vagueness with which the sa_data field is declared, just as an array of 14 bytes, with the comment hinting there can be more than 14 of them. This vagueness is quite deliberate. Sockets is a very powerful interface. While most people perhaps think of it as nothing more than the Internet interface—and most applications probably use it for that nowadays—sockets can be used for just about any kind of interprocess communications, of which the Internet (or, more precisely, IP) is only one. The sys/socket.h refers to the various types of protocols sockets will handle as address families, and lists them right before the definition of sockaddr: /* * Address families. */ #define AF_UNSPEC 0 /* unspecified */ #define AF_LOCAL 1 /* local to host (pipes, portals) */ #define AF_UNIX AF_LOCAL /* backward compatibility */ #define AF_INET 2 /* internetwork: UDP, TCP, etc. */ #define AF_IMPLINK 3 /* arpanet imp addresses */ #define AF_PUP 4 /* pup protocols: e.g. BSP */ #define AF_CHAOS 5 /* mit CHAOS protocols */ #define AF_NS 6 /* XEROX NS protocols */ #define AF_ISO 7 /* ISO protocols */ #define AF_OSI AF_ISO #define AF_ECMA 8 /* European computer manufacturers */ #define AF_DATAKIT 9 /* datakit protocols */ #define AF_CCITT 10 /* CCITT protocols, X.25 etc */ #define AF_SNA 11 /* IBM SNA */ #define AF_DECnet 12 /* DECnet */ #define AF_DLI 13 /* DEC Direct data link interface */ #define AF_LAT 14 /* LAT */ #define AF_HYLINK 15 /* NSC Hyperchannel */ #define AF_APPLETALK 16 /* Apple Talk */ #define AF_ROUTE 17 /* Internal Routing Protocol */ #define AF_LINK 18 /* Link layer interface */ #define pseudo_AF_XTP 19 /* eXpress Transfer Protocol (no AF) */ #define AF_COIP 20 /* connection-oriented IP, aka ST II */ #define AF_CNT 21 /* Computer Network Technology */ #define pseudo_AF_RTIP 22 /* Help Identify RTIP packets */ #define AF_IPX 23 /* Novell Internet Protocol */ #define AF_SIP 24 /* Simple Internet Protocol */ #define pseudo_AF_PIP 25 /* Help Identify PIP packets */ #define AF_ISDN 26 /* Integrated Services Digital Network*/ #define AF_E164 AF_ISDN /* CCITT E.164 recommendation */ #define pseudo_AF_KEY 27 /* Internal key-management function */ #define AF_INET6 28 /* IPv6 */ #define AF_NATM 29 /* native ATM access */ #define AF_ATM 30 /* ATM */ #define pseudo_AF_HDRCMPLT 31 /* Used by BPF to not rewrite headers * in interface output routine */ #define AF_NETGRAPH 32 /* Netgraph sockets */ #define AF_SLOW 33 /* 802.3ad slow protocol */ #define AF_SCLUSTER 34 /* Sitara cluster protocol */ #define AF_ARP 35 #define AF_BLUETOOTH 36 /* Bluetooth sockets */ #define AF_MAX 37 The one used for IP is AF_INET. It is a symbol for the constant 2. It is the address family listed in the sa_family field of sockaddr that decides how exactly the vaguely named bytes of sa_data will be used. Specifically, whenever the address family is AF_INET, we can use struct sockaddr_in found in netinet/in.h, wherever sockaddr is expected: /* * Socket address, internet style. */ struct sockaddr_in { uint8_t sin_len; sa_family_t sin_family; in_port_t sin_port; struct in_addr sin_addr; char sin_zero[8]; }; We can visualize its organization this way: 0 1 2 3 +--------+--------+-----------------+ 0 | 0 | Family | Port | +--------+--------+-----------------+ 4 | IP Address | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ sockaddr_in The three important fields are sin_family, which is byte 1 of the structure, sin_port, a 16-bit value found in bytes 2 and 3, and sin_addr, a 32-bit integer representation of the IP address, stored in bytes 4-7. Now, let us try to fill it out. Let us assume we are trying to write a client for the daytime protocol, which simply states that its server will write a text string representing the current date and time to port 13. We want to use TCP/IP, so we need to specify AF_INET in the address family field. AF_INET is defined as 2. Let us use the IP address of 192.43.244.18, which is the time server of US federal government (time.nist.gov). 0 1 2 3 +--------+--------+-----------------+ 0 | 0 | 2 | 13 | +-----------------+-----------------+ 4 | 192.43.244.18 | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ Specific example of sockaddr_in By the way the sin_addr field is declared as being of the struct in_addr type, which is defined in netinet/in.h: /* * Internet address (a structure for historical reasons) */ struct in_addr { in_addr_t s_addr; }; In addition, in_addr_t is a 32-bit integer. The 192.43.244.18 is just a convenient notation of expressing a 32-bit integer by listing all of its 8-bit bytes, starting with the most significant one. So far, we have viewed sockaddr as an abstraction. Our computer does not store short integers as a single 16-bit entity, but as a sequence of 2 bytes. Similarly, it stores 32-bit integers as a sequence of 4 bytes. Suppose we coded something like this: sa.sin_family = AF_INET; sa.sin_port = 13; sa.sin_addr.s_addr = (((((192 << 8) | 43) << 8) | 244) << 8) | 18; What would the result look like? Well, that depends, of course. On a &pentium;, or other x86, based computer, it would look like this: 0 1 2 3 +--------+--------+--------+--------+ 0 | 0 | 2 | 13 | 0 | +--------+--------+--------+--------+ 4 | 18 | 244 | 43 | 192 | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ sockaddr_in on an Intel system On a different system, it might look like this: 0 1 2 3 +--------+--------+--------+--------+ 0 | 0 | 2 | 0 | 13 | +--------+--------+--------+--------+ 4 | 192 | 43 | 244 | 18 | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ sockaddr_in on an MSB system And on a PDP it might look different yet. But the above two are the most common ways in use today. Ordinarily, wanting to write portable code, programmers pretend that these differences do not exist. And they get away with it (except when they code in assembly language). Alas, you cannot get away with it that easily when coding for sockets. Why? Because when communicating with another computer, you usually do not know whether it stores data most significant byte (MSB) or least significant byte (LSB) first. You might be wondering, So, will sockets not handle it for me? It will not. While that answer may surprise you at first, remember that the general sockets interface only understands the sa_len and sa_family fields of the sockaddr structure. You do not have to worry about the byte order there (of course, on FreeBSD sa_family is only 1 byte anyway, but many other &unix; systems do not have sa_len and use 2 bytes for sa_family, and expect the data in whatever order is native to the computer). But the rest of the data is just sa_data[14] as far as sockets goes. Depending on the address family, sockets just forwards that data to its destination. Indeed, when we enter a port number, it is because we want the other computer to know what service we are asking for. And, when we are the server, we read the port number so we know what service the other computer is expecting from us. Either way, sockets only has to forward the port number as data. It does not interpret it in any way. Similarly, we enter the IP address to tell everyone on the way where to send our data to. Sockets, again, only forwards it as data. That is why, we (the programmers, not the sockets) have to distinguish between the byte order used by our computer and a conventional byte order to send the data in to the other computer. We will call the byte order our computer uses the host byte order, or just the host order. There is a convention of sending the multi-byte data over IP MSB first. This, we will refer to as the network byte order, or simply the network order. Now, if we compiled the above code for an Intel based computer, our host byte order would produce: 0 1 2 3 +--------+--------+--------+--------+ 0 | 0 | 2 | 13 | 0 | +--------+--------+--------+--------+ 4 | 18 | 244 | 43 | 192 | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ Host byte order on an Intel system But the network byte order requires that we store the data MSB first: 0 1 2 3 +--------+--------+--------+--------+ 0 | 0 | 2 | 0 | 13 | +--------+--------+--------+--------+ 4 | 192 | 43 | 244 | 18 | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ Network byte order Unfortunately, our host order is the exact opposite of the network order. We have several ways of dealing with it. One would be to reverse the values in our code: sa.sin_family = AF_INET; sa.sin_port = 13 << 8; sa.sin_addr.s_addr = (((((18 << 8) | 244) << 8) | 43) << 8) | 192; This will trick our compiler into storing the data in the network byte order. In some cases, this is exactly the way to do it (e.g., when programming in assembly language). In most cases, however, it can cause a problem. Suppose, you wrote a sockets-based program in C. You know it is going to run on a &pentium;, so you enter all your constants in reverse and force them to the network byte order. It works well. Then, some day, your trusted old &pentium; becomes a rusty old &pentium;. You replace it with a system whose host order is the same as the network order. You need to recompile all your software. All of your software continues to perform well, except the one program you wrote. You have since forgotten that you had forced all of your constants to the opposite of the host order. You spend some quality time tearing out your hair, calling the names of all gods you ever heard of (and some you made up), hitting your monitor with a nerf bat, and performing all the other traditional ceremonies of trying to figure out why something that has worked so well is suddenly not working at all. Eventually, you figure it out, say a couple of swear words, and start rewriting your code. Luckily, you are not the first one to face the problem. Someone else has created the &man.htons.3; and &man.htonl.3; C functions to convert a short and long respectively from the host byte order to the network byte order, and the &man.ntohs.3; and &man.ntohl.3; C functions to go the other way. On MSB-first systems these functions do nothing. On LSB-first systems they convert values to the proper order. So, regardless of what system your software is compiled on, your data will end up in the correct order if you use these functions. Client Functions Typically, the client initiates the connection to the server. The client knows which server it is about to call: It knows its IP address, and it knows the port the server resides at. It is akin to you picking up the phone and dialing the number (the address), then, after someone answers, asking for the person in charge of wingdings (the port). <function>connect</function> Once a client has created a socket, it needs to connect it to a specific port on a remote system. It uses &man.connect.2;: int connect(int s, const struct sockaddr *name, socklen_t namelen); The s argument is the socket, i.e., the value returned by the socket function. The name is a pointer to sockaddr, the structure we have talked about extensively. Finally, namelen informs the system how many bytes are in our sockaddr structure. If connect is successful, it returns 0. Otherwise it returns -1 and stores the error code in errno. There are many reasons why connect may fail. For example, with an attempt to an Internet connection, the IP address may not exist, or it may be down, or just too busy, or it may not have a server listening at the specified port. Or it may outright refuse any request for specific code. Our First Client We now know enough to write a very simple client, one that will get current time from 192.43.244.18 and print it to stdout. /* * daytime.c * * Programmed by G. Adam Stanislav */ #include <stdio.h> #include <string.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> int main() { register int s; register int bytes; struct sockaddr_in sa; char buffer[BUFSIZ+1]; if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) { perror("socket"); return 1; } bzero(&sa, sizeof sa); sa.sin_family = AF_INET; sa.sin_port = htons(13); sa.sin_addr.s_addr = htonl((((((192 << 8) | 43) << 8) | 244) << 8) | 18); if (connect(s, (struct sockaddr *)&sa, sizeof sa) < 0) { perror("connect"); close(s); return 2; } while ((bytes = read(s, buffer, BUFSIZ)) > 0) write(1, buffer, bytes); close(s); return 0; } Go ahead, enter it in your editor, save it as daytime.c, then compile and run it: &prompt.user; cc -O3 -o daytime daytime.c &prompt.user; ./daytime 52079 01-06-19 02:29:25 50 0 1 543.9 UTC(NIST) * &prompt.user; In this case, the date was June 19, 2001, the time was 02:29:25 UTC. Naturally, your results will vary. Server Functions The typical server does not initiate the connection. Instead, it waits for a client to call it and request services. It does not know when the client will call, nor how many clients will call. It may be just sitting there, waiting patiently, one moment, The next moment, it can find itself swamped with requests from a number of clients, all calling in at the same time. The sockets interface offers three basic functions to handle this. <function>bind</function> Ports are like extensions to a phone line: After you dial a number, you dial the extension to get to a specific person or department. There are 65535 IP ports, but a server usually processes requests that come in on only one of them. It is like telling the phone room operator that we are now at work and available to answer the phone at a specific extension. We use &man.bind.2; to tell sockets which port we want to serve. int bind(int s, const struct sockaddr *addr, socklen_t addrlen); Beside specifying the port in addr, the server may include its IP address. However, it can just use the symbolic constant INADDR_ANY to indicate it will serve all requests to the specified port regardless of what its IP address is. This symbol, along with several similar ones, is declared in netinet/in.h #define INADDR_ANY (u_int32_t)0x00000000 Suppose we were writing a server for the daytime protocol over TCP/IP. Recall that it uses port 13. Our sockaddr_in structure would look like this: 0 1 2 3 +--------+--------+--------+--------+ 0 | 0 | 2 | 0 | 13 | +--------+--------+--------+--------+ 4 | 0 | +-----------------------------------+ 8 | 0 | +-----------------------------------+ 12 | 0 | +-----------------------------------+ Example Server sockaddr_in <function>listen</function> To continue our office phone analogy, after you have told the phone central operator what extension you will be at, you now walk into your office, and make sure your own phone is plugged in and the ringer is turned on. Plus, you make sure your call waiting is activated, so you can hear the phone ring even while you are talking to someone. The server ensures all of that with the &man.listen.2; function. int listen(int s, int backlog); In here, the backlog variable tells sockets how many incoming requests to accept while you are busy processing the last request. In other words, it determines the maximum size of the queue of pending connections. <function>accept</function> After you hear the phone ringing, you accept the call by answering the call. You have now established a connection with your client. This connection remains active until either you or your client hang up. The server accepts the connection by using the &man.accept.2; function. int accept(int s, struct sockaddr *addr, socklen_t *addrlen); Note that this time addrlen is a pointer. This is necessary because in this case it is the socket that fills out addr, the sockaddr_in structure. The return value is an integer. Indeed, the accept returns a new socket. You will use this new socket to communicate with the client. What happens to the old socket? It continues to listen for more requests (remember the backlog variable we passed to listen?) until we close it. Now, the new socket is meant only for communications. It is fully connected. We cannot pass it to listen again, trying to accept additional connections. Our First Server Our first server will be somewhat more complex than our first client was: Not only do we have more sockets functions to use, but we need to write it as a daemon. This is best achieved by creating a child process after binding the port. The main process then exits and returns control to the shell (or whatever program invoked it). The child calls listen, then starts an endless loop, which accepts a connection, serves it, and eventually closes its socket. /* * daytimed - a port 13 server * * Programmed by G. Adam Stanislav * June 19, 2001 */ #include <stdio.h> #include <string.h> #include <time.h> #include <unistd.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #define BACKLOG 4 int main() { register int s, c; int b; struct sockaddr_in sa; time_t t; struct tm *tm; FILE *client; if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) { perror("socket"); return 1; } bzero(&sa, sizeof sa); sa.sin_family = AF_INET; sa.sin_port = htons(13); if (INADDR_ANY) sa.sin_addr.s_addr = htonl(INADDR_ANY); if (bind(s, (struct sockaddr *)&sa, sizeof sa) < 0) { perror("bind"); return 2; } switch (fork()) { case -1: perror("fork"); return 3; break; default: close(s); return 0; break; case 0: break; } listen(s, BACKLOG); for (;;) { b = sizeof sa; if ((c = accept(s, (struct sockaddr *)&sa, &b)) < 0) { perror("daytimed accept"); return 4; } if ((client = fdopen(c, "w")) == NULL) { perror("daytimed fdopen"); return 5; } if ((t = time(NULL)) < 0) { perror("daytimed time"); return 6; } tm = gmtime(&t); fprintf(client, "%.4i-%.2i-%.2iT%.2i:%.2i:%.2iZ\n", tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, tm->tm_hour, tm->tm_min, tm->tm_sec); fclose(client); } } We start by creating a socket. Then we fill out the sockaddr_in structure in sa. Note the conditional use of INADDR_ANY: if (INADDR_ANY) sa.sin_addr.s_addr = htonl(INADDR_ANY); Its value is 0. Since we have just used bzero on the entire structure, it would be redundant to set it to 0 again. But if we port our code to some other system where INADDR_ANY is perhaps not a zero, we need to assign it to sa.sin_addr.s_addr. Most modern C compilers are clever enough to notice that INADDR_ANY is a constant. As long as it is a zero, they will optimize the entire conditional statement out of the code. After we have called bind successfully, we are ready to become a daemon: We use fork to create a child process. In both, the parent and the child, the s variable is our socket. The parent process will not need it, so it calls close, then it returns 0 to inform its own parent it had terminated successfully. Meanwhile, the child process continues working in the background. It calls listen and sets its backlog to 4. It does not need a large value here because daytime is not a protocol many clients request all the time, and because it can process each request instantly anyway. Finally, the daemon starts an endless loop, which performs the following steps: Call accept. It waits here until a client contacts it. At that point, it receives a new socket, c, which it can use to communicate with this particular client. It uses the C function fdopen to turn the socket from a low-level file descriptor to a C-style FILE pointer. This will allow the use of fprintf later on. It checks the time, and prints it in the ISO 8601 format to the client file. It then uses fclose to close the file. That will automatically close the socket as well. We can generalize this, and use it as a model for many other servers: +-----------------+ | Create Socket | +-----------------+ | +-----------------+ | Bind Port | Daemon Process +-----------------+ | +--------+ +-------------+-->| Init | | | +--------+ +-----------------+ | | | Exit | | +--------+ +-----------------+ | | Listen | | +--------+ | | | +--------+ | | Accept | | +--------+ | | | +--------+ | | Serve | | +--------+ | | | +--------+ | | Close | |<--------+ Sequential Server This flowchart is good for sequential servers, i.e., servers that can serve one client at a time, just as we were able to with our daytime server. This is only possible whenever there is no real conversation going on between the client and the server: As soon as the server detects a connection to the client, it sends out some data and closes the connection. The entire operation may take nanoseconds, and it is finished. The advantage of this flowchart is that, except for the brief moment after the parent forks and before it exits, there is always only one process active: Our server does not take up much memory and other system resources. Note that we have added initialize daemon in our flowchart. We did not need to initialize our own daemon, but this is a good place in the flow of the program to set up any signal handlers, open any files we may need, etc. Just about everything in the flow chart can be used literally on many different servers. The serve entry is the exception. We think of it as a black box, i.e., something you design specifically for your own server, and just plug it into the rest. Not all protocols are that simple. Many receive a request from the client, reply to it, then receive another - request from the same client. Because of that, they do + request from the same client. As a result, they do not know in advance how long they will be serving the client. Such servers usually start a new process for each client. While the new process is serving its client, the daemon can continue listening for more connections. Now, go ahead, save the above source code as daytimed.c (it is customary to end the names of daemons with the letter d). After you have compiled it, try running it: &prompt.user; ./daytimed bind: Permission denied &prompt.user; What happened here? As you will recall, the daytime protocol uses port 13. But all ports below 1024 are reserved to the superuser (otherwise, anyone could start a daemon pretending to serve a commonly used port, while causing a security breach). Try again, this time as the superuser: &prompt.root; ./daytimed &prompt.root; What... Nothing? Let us try again: &prompt.root; ./daytimed bind: Address already in use &prompt.root; Every port can only be bound by one program at a time. Our first attempt was indeed successful: It started the child daemon and returned quietly. It is still running and will continue to run until you either kill it, or any of its system calls fail, or you reboot the system. Fine, we know it is running in the background. But is it working? How do we know it is a proper daytime server? Simple: &prompt.user; telnet localhost 13 Trying ::1... telnet: connect to address ::1: Connection refused Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. 2001-06-19T21:04:42Z Connection closed by foreign host. &prompt.user; telnet tried the new IPv6, and failed. It retried with IPv4 and succeeded. The daemon works. If you have access to another &unix; system via telnet, you can use it to test accessing the server remotely. My computer does not have a static IP address, so this is what I did: &prompt.user; who whizkid ttyp0 Jun 19 16:59 (216.127.220.143) xxx ttyp1 Jun 19 16:06 (xx.xx.xx.xx) &prompt.user; telnet 216.127.220.143 13 Trying 216.127.220.143... Connected to r47.bfm.org. Escape character is '^]'. 2001-06-19T21:31:11Z Connection closed by foreign host. &prompt.user; Again, it worked. Will it work using the domain name? &prompt.user; telnet r47.bfm.org 13 Trying 216.127.220.143... Connected to r47.bfm.org. Escape character is '^]'. 2001-06-19T21:31:40Z Connection closed by foreign host. &prompt.user; By the way, telnet prints the Connection closed by foreign host message after our daemon has closed the socket. This shows us that, indeed, using fclose(client); in our code works as advertised. Helper Functions FreeBSD C library contains many helper functions for sockets programming. For example, in our sample client we hard coded the time.nist.gov IP address. But we do not always know the IP address. Even if we do, our software is more flexible if it allows the user to enter the IP address, or even the domain name. <function>gethostbyname</function> While there is no way to pass the domain name directly to any of the sockets functions, the FreeBSD C library comes with the &man.gethostbyname.3; and &man.gethostbyname2.3; functions, declared in netdb.h. struct hostent * gethostbyname(const char *name); struct hostent * gethostbyname2(const char *name, int af); Both return a pointer to the hostent structure, with much information about the domain. For our purposes, the h_addr_list[0] field of the structure points at h_length bytes of the correct address, already stored in the network byte order. This allows us to create a much more flexible—and much more useful—version of our daytime program: /* * daytime.c * * Programmed by G. Adam Stanislav * 19 June 2001 */ #include <stdio.h> #include <string.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <netdb.h> int main(int argc, char *argv[]) { register int s; register int bytes; struct sockaddr_in sa; struct hostent *he; char buf[BUFSIZ+1]; char *host; if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) { perror("socket"); return 1; } bzero(&sa, sizeof sa); sa.sin_family = AF_INET; sa.sin_port = htons(13); host = (argc > 1) ? (char *)argv[1] : "time.nist.gov"; if ((he = gethostbyname(host)) == NULL) { herror(host); return 2; } bcopy(he->h_addr_list[0],&sa.sin_addr, he->h_length); if (connect(s, (struct sockaddr *)&sa, sizeof sa) < 0) { perror("connect"); return 3; } while ((bytes = read(s, buf, BUFSIZ)) > 0) write(1, buf, bytes); close(s); return 0; } We now can type a domain name (or an IP address, it works both ways) on the command line, and the program will try to connect to its daytime server. Otherwise, it will still default to time.nist.gov. However, even in this case we will use gethostbyname rather than hard coding 192.43.244.18. That way, even if its IP address changes in the future, we will still find it. Since it takes virtually no time to get the time from your local server, you could run daytime twice in a row: First to get the time from time.nist.gov, the second time from your own system. You can then compare the results and see how exact your system clock is: &prompt.user; daytime ; daytime localhost 52080 01-06-20 04:02:33 50 0 0 390.2 UTC(NIST) * 2001-06-20T04:02:35Z &prompt.user; As you can see, my system was two seconds ahead of the NIST time. <function>getservbyname</function> Sometimes you may not be sure what port a certain service uses. The &man.getservbyname.3; function, also declared in netdb.h comes in very handy in those cases: struct servent * getservbyname(const char *name, const char *proto); The servent structure contains the s_port, which contains the proper port, already in network byte order. Had we not known the correct port for the daytime service, we could have found it this way: struct servent *se; ... if ((se = getservbyname("daytime", "tcp")) == NULL { fprintf(stderr, "Cannot determine which port to use.\n"); return 7; } sa.sin_port = se->s_port; You usually do know the port. But if you are developing a new protocol, you may be testing it on an unofficial port. Some day, you will register the protocol and its port (if nowhere else, at least in your /etc/services, which is where getservbyname looks). Instead of returning an error in the above code, you just use the temporary port number. Once you have listed the protocol in /etc/services, your software will find its port without you having to rewrite the code. Concurrent Servers Unlike a sequential server, a concurrent server has to be able to serve more than one client at a time. For example, a chat server may be serving a specific client for hours—it cannot wait till it stops serving a client before it serves the next one. This requires a significant change in our flowchart: +-----------------+ | Create Socket | +-----------------+ | +-----------------+ | Bind Port | Daemon Process +-----------------+ | +--------+ +-------------+-->| Init | | | +--------+ +-----------------+ | | | Exit | | +--------+ +-----------------+ | | Listen | | +--------+ | | | +--------+ | | Accept | | +--------+ | | +------------------+ | +------>| Close Top Socket | | | +------------------+ | +--------+ | | | Close | +------------------+ | +--------+ | Serve | | | +------------------+ |<--------+ | +------------------+ | Close Acc Socket | +--------+ +------------------+ | Signal | | +--------+ +------------------+ | Exit | +------------------+ Concurrent Server We moved the serve from the daemon process to its own server process. However, because each child process inherits all open files (and a socket is treated just like a file), the new process inherits not only the accepted handle, i.e., the socket returned by the accept call, but also the top socket, i.e., the one opened by the top process right at the beginning. However, the server process does not need this socket and should close it immediately. Similarly, the daemon process no longer needs the accepted socket, and not only should, but must close it—otherwise, it will run out of available file descriptors sooner or later. After the server process is done serving, it should close the accepted socket. Instead of returning to accept, it now exits. Under &unix;, a process does not really exit. Instead, it returns to its parent. Typically, a parent process waits for its child process, and obtains a return value. However, our daemon process cannot simply stop and wait. That would defeat the whole purpose of creating additional processes. But if it never does wait, its children will become zombies—no longer functional but still roaming around. For that reason, the daemon process needs to set signal handlers in its initialize daemon phase. At least a SIGCHLD signal has to be processed, so the daemon can remove the zombie return values from the system and release the system resources they are taking up. That is why our flowchart now contains a process signals box, which is not connected to any other box. By the way, many servers also process SIGHUP, and typically interpret as the signal from the superuser that they should reread their configuration files. This allows us to change settings without having to kill and restart these servers. diff --git a/en_US.ISO8859-1/books/faq/book.xml b/en_US.ISO8859-1/books/faq/book.xml index 11ebe38646..5942b54a38 100644 --- a/en_US.ISO8859-1/books/faq/book.xml +++ b/en_US.ISO8859-1/books/faq/book.xml @@ -1,6431 +1,6431 @@ 13-CURRENT"> X"> head/"> X"> 12-STABLE"> stable/12/"> X"> 11-STABLE"> stable/11/"> ]> Frequently Asked Questions for &os; &rel2.relx; and &rel.relx; The &os; Documentation Project 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 The &os; Documentation Project &legalnotice; &tm-attrib.freebsd; &tm-attrib.adobe; &tm-attrib.ibm; &tm-attrib.ieee; &tm-attrib.intel; &tm-attrib.linux; &tm-attrib.microsoft; &tm-attrib.netbsd; &tm-attrib.opengroup; &tm-attrib.sgi; &tm-attrib.sun; &tm-attrib.general; $FreeBSD$ This is the Frequently Asked Questions (FAQ) for &os; versions &rel.relx; and &rel2.relx;. Every effort has been made to make this FAQ as informative as possible; if you have any suggestions as to how it may be improved, send them to the &a.doc;. The latest version of this document is always available from the &os; website. It may also be downloaded as one large HTML file with HTTP or as a variety of other formats from the &os; FTP server. Introduction What is &os;? &os; is a modern operating system for desktops, laptops, servers, and embedded systems with support for a large number of platforms. It is based on U.C. Berkeley's 4.4BSD-Lite release, with some 4.4BSD-Lite2 enhancements. It is also based indirectly on William Jolitz's port of U.C. Berkeley's Net/2 to the &i386;, known as 386BSD, though very little of the 386BSD code remains. &os; is used by companies, Internet Service Providers, researchers, computer professionals, students and home users all over the world in their work, education and recreation. For more detailed information on &os;, refer to the &os; Handbook. What is the goal of the &os; Project? The goal of the &os; Project is to provide a stable and fast general purpose operating system that may be used for any purpose without strings attached. Does the &os; license have any restrictions? Yes. Those restrictions do not control how the code is used, but how to treat the &os; Project itself. The license itself is available at license and can be summarized like this: Do not claim that you wrote this. Do not sue us if it breaks. Do not remove or modify the license. Many of us have a significant investment in the project and would certainly not mind a little financial compensation now and then, but we definitely do not insist on it. We believe that our first and foremost mission is to provide code to any and all comers, and for whatever purpose, so that the code gets the widest possible use and provides the widest possible benefit. This, we believe, is one of the most fundamental goals of Free Software and one that we enthusiastically support. Code in our source tree which falls under the GNU General Public License (GPL) or GNU Library General Public License (LGPL) comes with slightly more strings attached, though at least on the side of enforced access rather than the usual opposite. Due to the additional complexities that can evolve in the commercial use of GPL software, we do, however, endeavor to replace such software with submissions under the more relaxed &os; license whenever possible. Can &os; replace my current operating system? For most people, yes. But this question is not quite that cut-and-dried. Most people do not actually use an operating system. They use applications. The applications are what really use the operating system. &os; is designed to provide a robust and full-featured environment for applications. It supports a wide variety of web browsers, office suites, email readers, graphics programs, programming environments, network servers, and much more. Most of these applications can be managed through the Ports Collection. If an application is only available on one operating system, that operating system cannot just be replaced. Chances are, there is a very similar application on &os;, however. As a solid office or Internet server or a reliable workstation, &os; will almost certainly do everything you need. Many computer users across the world, including both novices and experienced &unix; administrators, use &os; as their only desktop operating system. Users migrating to &os; from another &unix;-like environment will find &os; to be similar. &windows; and &macos; users may be interested in instead using GhostBSD, MidnightBSD or NomadBSD three &os;-based desktop distributions. Non-&unix; users should expect to invest some additional time learning the &unix; way of doing things. This FAQ and the &os; Handbook are excellent places to start. Why is it called &os;? It may be used free of charge, even by commercial users. Full source for the operating system is freely available, and the minimum possible restrictions have been placed upon its use, distribution and incorporation into other work (commercial or non-commercial). Anyone who has an improvement or bug fix is free to submit their code and have it added to the source tree (subject to one or two obvious provisions). It is worth pointing out that the word free is being used in two ways here: one meaning at no cost and the other meaning do whatever you like. Apart from one or two things you cannot do with the &os; code, for example pretending you wrote it, you can really do whatever you like with it. What are the differences between &os; and NetBSD, OpenBSD, and other open source BSD operating systems? James Howard wrote a good explanation of the history and differences between the various projects, called The BSD Family Tree which goes a fair way to answering this question. Some of the information is out of date, but the history portion in particular remains accurate. Most of the BSDs share patches and code, even today. All of the BSDs have common ancestry. The design goals of &os; are described in , above. The design goals of the other most popular BSDs may be summarized as follows: OpenBSD aims for operating system security above all else. The OpenBSD team wrote &man.ssh.1; and &man.pf.4;, which have both been ported to &os;. NetBSD aims to be easily ported to other hardware platforms. DragonFly BSD is a fork of &os; 4.8 that has since developed many interesting features of its own, including the HAMMER file system and support for user-mode vkernels. What is the latest version of &os;? At any point in the development of &os;, there can be multiple parallel branches. &rel.relx; releases are made from the &rel.stable; branch, and &rel2.relx; releases are made from the &rel2.stable; branch. Up until the release of 12.0, the &rel2.relx; series was the one known as -STABLE. However, as of &rel.head.relx;, the &rel2.relx; branch will be designated for an extended support status and receive only fixes for major problems, such as security-related fixes. Releases are made every few months. While many people stay more up-to-date with the &os; sources (see the questions on &os.current; and &os.stable;) than that, doing so is more of a commitment, as the sources are a moving target. More information on &os; releases can be found on the Release Engineering page and in &man.release.7;. What is &os;-CURRENT? &os.current; is the development version of the operating system, which will in due course become the new &os.stable; branch. As such, it is really only of interest to developers working on the system and die-hard hobbyists. See the relevant section in the Handbook for details on running -CURRENT. Users not familiar with &os; should not use &os.current;. This branch sometimes evolves quite quickly and due to mistake can be un-buildable at times. People that use &os.current; are expected to be able to analyze, debug, and report problems. What is the &os;-STABLE concept? &os;-STABLE is the development branch from which major releases are made. Changes go into this branch at a slower pace and with the general assumption that they have first been tested in &os;-CURRENT. However, at any given time, the sources for &os;-STABLE may or may not be suitable for general use, as it may uncover bugs and corner cases that were not yet found in &os;-CURRENT. Users who do not have the resources to perform testing should instead run the most recent release of &os;. &os;-CURRENT, on the other hand, has been one unbroken line since 2.0 was released. For more detailed information on branches see &os; Release Engineering: Creating the Release Branch, the status of the branches and the upcoming release schedule can be found on the Release Engineering Information page. Version &rel121.current; is the latest release from the &rel.stable; branch; it was released in &rel121.current.date;. Version &rel1.current; is the latest release from the &rel2.stable; branch; it was released in &rel1.current.date;. When are &os; releases made? The &a.re; releases a new major version of &os; about every 18 months and a new minor version about every 8 months, on average. Release dates are announced well in advance, so that the people working on the system know when their projects need to be finished and tested. A testing period precedes each release, to ensure that the addition of new features does not compromise the stability of the release. Many users regard this caution as one of the best things about &os;, even though waiting for all the latest goodies to reach -STABLE can be a little frustrating. More information on the release engineering process (including a schedule of upcoming releases) can be found on the release engineering pages on the &os; Web site. For people who need or want a little more excitement, binary snapshots are made weekly as discussed above. When are &os; snapshots made? &os; snapshot releases are made based on the current state of the -CURRENT and -STABLE branches. The goals behind each snapshot release are: To test the latest version of the installation software. To give people who would like to run -CURRENT or -STABLE but who do not have the time or bandwidth to follow it on a day-to-day basis an easy way of bootstrapping it onto their systems. To preserve a fixed reference point for the code in question, just in case we break something really badly later. (Although Subversion normally prevents anything horrible like this happening.) To ensure that all new features and fixes in need of testing have the greatest possible number of potential testers. No claims are made that any -CURRENT snapshot can be considered production quality for any purpose. If a stable and fully tested system is needed, stick to full releases. Snapshot releases are directly available from snapshot. Official snapshots are generated on a regular basis for all actively developed branches. Who is responsible for &os;? The key decisions concerning the &os; project, such as the overall direction of the project and who is allowed to add code to the source tree, are made by a core team of 9 people. There is a much larger team of more than 350 committers who are authorized to make changes directly to the &os; source tree. However, most non-trivial changes are discussed in advance in the mailing lists, and there are no restrictions on who may take part in the discussion. Where can I get &os;? Every significant release of &os; is available via anonymous FTP from the &os; FTP site: The latest &rel.stable; release, &rel121.current;-RELEASE can be found in the &rel121.current;-RELEASE directory. Snapshot releases are made monthly for the -CURRENT and -STABLE branch, these being of service purely to bleeding-edge testers and developers. The latest &rel2.stable; release, &rel1.current;-RELEASE can be found in the &rel1.current;-RELEASE directory. Information about obtaining &os; on CD, DVD, and other media can be found in the Handbook. How do I access the Problem Report database? The Problem Report database of all user change requests may be queried by using our web-based PR query interface. The web-based problem report submission interface can be used to submit problem reports through a web browser. Before submitting a problem report, read Writing &os; Problem Reports, an article on how to write good problem reports. Documentation and Support What good books are there about &os;? The project produces a wide range of documentation, available online from this link: https://www.FreeBSD.org/docs.html. Is the documentation available in other formats, such as plain text (ASCII), or PDF? Yes. The documentation is available in a number of different formats and compression schemes on the &os; FTP site, in the /ftp/doc/ directory. The documentation is categorized in a number of different ways. These include: The document's name, such as faq, or handbook. The document's language and encoding. These are based on the locale names found under /usr/share/locale on a &os; system. The current languages and encodings are as follows: Name Meaning en_US.ISO8859-1 English (United States) bn_BD.ISO10646-1 Bengali or Bangla (Bangladesh) da_DK.ISO8859-1 Danish (Denmark) de_DE.ISO8859-1 German (Germany) el_GR.ISO8859-7 Greek (Greece) es_ES.ISO8859-1 Spanish (Spain) fr_FR.ISO8859-1 French (France) hu_HU.ISO8859-2 Hungarian (Hungary) it_IT.ISO8859-15 Italian (Italy) ja_JP.eucJP Japanese (Japan, EUC encoding) ko_KR.UTF-8 Korean (Korea, UTF-8 encoding) mn_MN.UTF-8 Mongolian (Mongolia, UTF-8 encoding) nl_NL.ISO8859-1 Dutch (Netherlands) pl_PL.ISO8859-2 Polish (Poland) pt_BR.ISO8859-1 Portuguese (Brazil) ru_RU.KOI8-R Russian (Russia, KOI8-R encoding) tr_TR.ISO8859-9 Turkish (Turkey) zh_CN.UTF-8 Simplified Chinese (China, UTF-8 encoding) zh_TW.UTF-8 Traditional Chinese (Taiwan, UTF-8 encoding) Some documents may not be available in all languages. The document's format. We produce the documentation in a number of different output formats. Each format has its own advantages and disadvantages. Some formats are better suited for online reading, while others are meant to be aesthetically pleasing when printed on paper. Having the documentation available in any of these formats ensures that our readers will be able to read the parts they are interested in, either on their monitor, or on paper after printing the documents. The currently available formats are: Format Meaning html-split A collection of small, linked, HTML files. html One large HTML file containing the entire document pdf Adobe's Portable Document Format txt Plain text The compression and packaging scheme. Where the format is html-split, the files are bundled up using &man.tar.1;. The resulting .tar is then compressed using the compression schemes detailed in the next point. All the other formats generate one file. For example, article.pdf, book.html, and so on. These files are then compressed using either the zip or bz2 compression schemes. &man.tar.1; can be used to uncompress these files. So the PDF version of the Handbook, compressed using bzip2 will be stored in a file called book.pdf.bz2 in the handbook/ directory. After choosing the format and compression mechanism, download the compressed files, uncompress them, and then copy the appropriate documents into place. For example, the split HTML version of the FAQ, compressed using &man.bzip2.1;, can be found in doc/en_US.ISO8859-1/books/faq/book.html-split.tar.bz2 To download and uncompress that file, type: &prompt.root; fetch https://download.freebsd.org/ftp/doc/en_US.ISO8859-1/books/faq/book.html-split.tar.bz2 &prompt.root; tar xvf book.html-split.tar.bz2 If the file is compressed, tar will automatically detect the appropriate format and decompress it correctly, resulting in a collection of .html files. The main one is called index.html, which will contain the table of contents, introductory material, and links to the other parts of the document. Where do I find info on the &os; mailing lists? What &os; news groups are available? Refer to the Handbook entry on mailing-lists and the Handbook entry on newsgroups. Are there &os; IRC (Internet Relay Chat) channels? Yes, most major IRC networks host a &os; chat channel: Channel #FreeBSDhelp on EFNet is a channel dedicated to helping &os; users. Channel #FreeBSD on Freenode is a general help channel with many users at any time. The conversations have been known to run off-topic for a while, but priority is given to users with &os; questions. Other users can help with the basics, referring to the Handbook whenever possible and providing links for learning more about a particular topic. This is primarily an English speaking channel, though it does have users from all over the world. Non-native English speakers should try to ask the question in English first and then relocate to ##freebsd-lang as appropriate. Channel #FreeBSD on DALNET is available at irc.dal.net in the US and irc.eu.dal.net in Europe. Channel #FreeBSD on UNDERNET is available at us.undernet.org in the US and eu.undernet.org in Europe. Since it is a help channel, be prepared to read the documents you are referred to. Channel #FreeBSD on RUSNET is a Russian language channel dedicated to helping &os; users. This is also a good place for non-technical discussions. Channel #bsdchat on Freenode is a Traditional Chinese (UTF-8 encoding) language channel dedicated to helping &os; users. This is also a good place for non-technical discussions. The &os; wiki has a good list of IRC channels. Each of these channels are distinct and are not connected to each other. Since their chat styles differ, try each to find one suited to your chat style. Are there any web based forums to discuss &os;? The official &os; forums are located at https://forums.FreeBSD.org/. Where can I get commercial &os; training and support? iXsystems, Inc., parent company of the &os; Mall, provides commercial &os; and TrueOS software support, in addition to &os; development and tuning solutions. BSD Certification Group, Inc. provides system administration certifications for DragonFly BSD, &os;, NetBSD, and OpenBSD. Refer to their site for more information. Any other organizations providing training and support should contact the Project to be listed here. Installation Nik Clayton
nik@FreeBSD.org
Which platform should I download? I have a 64 bit capable &intel; CPU, but I only see amd64. &arch.amd64; is the term &os; uses for 64-bit compatible x86 architectures (also known as "x86-64" or "x64"). Most modern computers should use &arch.amd64;. Older hardware should use &arch.i386;. When installing on a non-x86-compatible architecture, select the platform which best matches the hardware. Which file do I download to get &os;? On the Getting &os; page, select [iso] next to the architecture that matches the hardware. Any of the following can be used: file description disc1.iso Contains enough to install &os; and a minimal set of packages. dvd1.iso Similar to disc1.iso but with additional packages. memstick.img A bootable image sufficient for writing to a USB stick. bootonly.iso A minimal image that requires network access during installation to completely install &os;. Full instructions on this procedure and a little bit more about installation issues in general can be found in the Handbook entry on installing &os;. What do I do if the install image does not boot? This can be caused by not downloading the image in binary mode when using FTP. Some FTP clients default their transfer mode to ascii and attempt to change any end-of-line characters received to match the conventions used by the client's system. This will almost invariably corrupt the boot image. Check the SHA-256 checksum of the downloaded boot image: if it is not exactly that on the server, then the download process is suspect. When using a command line FTP client, type binary at the FTP command prompt after getting connected to the server and before starting the download of the image. Where are the instructions for installing &os;? Installation instructions can be found at Handbook entry on installing &os;. How can I make my own custom release or install disk? Customized &os; installation media can be created by building a custom release. Follow the instructions in the Release Engineering article. Can &windows; co-exist with &os;? (x86-specific) If &windows; is installed first, then yes. &os;'s boot manager will then manage to boot &windows; and &os;. If &windows; is installed afterwards, it will overwrite the boot manager. If that happens, see the next section. Another operating system destroyed my Boot Manager. How do I get it back? (x86-specific) This depends upon the boot manager. The &os; boot selection menu can be reinstalled using &man.boot0cfg.8;. For example, to restore the boot menu onto the disk ada0: &prompt.root; boot0cfg -B ada0 The non-interactive MBR bootloader can be installed using &man.gpart.8;: &prompt.root; gpart bootcode -b /boot/mbr ada0 For more complex situations, including GPT disks, see &man.gpart.8;. Do I need to install the source? In general, no. There is nothing in the base system which requires the presence of the source to operate. Some ports, like sysutils/lsof, will not build unless the source is installed. In particular, if the port builds a kernel module or directly operates on kernel structures, the source must be installed. Do I need to build a kernel? Usually not. The supplied GENERIC kernel contains the drivers an ordinary computer will need. &man.freebsd-update.8;, the &os; binary upgrade tool, cannot upgrade custom kernels, another reason to stick with the GENERIC kernel when possible. For computers with very limited RAM, such as embedded systems, it may be worthwhile to build a smaller custom kernel containing just the required drivers. Should I use DES, Blowfish, or MD5 passwords and how do I specify which form my users receive? &os; uses SHA512 by default. DES passwords are still available for backwards compatibility with operating systems that still use the less secure password format. &os; also supports the Blowfish and MD5 password formats. Which password format to use for new passwords is controlled by the passwd_format login capability in /etc/login.conf, which takes values of des, blf (if these are available) or md5. See the &man.login.conf.5; manual page for more information about login capabilities. What are the limits for FFS file systems? For FFS file systems, the largest file system is practically limited by the amount of memory required to &man.fsck.8; the file system. &man.fsck.8; requires one bit per fragment, which with the default fragment size of 4 KB equates to 32 MB of memory per TB of disk. This does mean that on architectures which limit userland processes to 2 GB (e.g., &i386;), the maximum &man.fsck.8;'able filesystem is ~60 TB. If there was not a &man.fsck.8; memory limit the maximum filesystem size would be 2 ^ 64 (blocks) * 32 KB => 16 Exa * 32 KB => 512 ZettaBytes. The maximum size of a single FFS file is approximately 2 PB with the default block size of 32 KB. Each 32 KB block can point to 4096 blocks. With triple indirect blocks, the calculation is 32 KB * 12 + 32 KB * 4096 + 32 KB * 4096^2 + 32 KB * 4096^3. Increasing the block size to 64 KB will increase the max file size by a factor of 16. Why do I get an error message, readin failed after compiling and booting a new kernel? The world and kernel are out of sync. This is not supported. Be sure to use make buildworld and make buildkernel to update the kernel. Boot the system by specifying the kernel directly at the second stage, pressing any key when the | shows up before loader is started. Is there a tool to perform post-installation configuration tasks? Yes. bsdconfig provides a nice interface to configure &os; post-installation.
Hardware Compatibility General I want to get a piece of hardware for my &os; system. Which model/brand/type is best? This is discussed continually on the &os; mailing lists but is to be expected since hardware changes so quickly. Read through the Hardware Notes for &os; &rel121.current; or &rel1.current; and search the mailing list archives before asking about the latest and greatest hardware. Chances are a discussion about that type of hardware took place just last week. Before purchasing a laptop, check the archives for &a.questions;, or possibly a specific mailing list for a particular hardware type. What are the limits for memory? &os; as an operating system generally supports as much physical memory (RAM) as the platform it is running on does. Keep in mind that different platforms have different limits for memory; for example &i386; without PAE supports at most 4 GB of memory (and usually less than that because of PCI address space) and &i386; with PAE supports at most 64 GB memory. As of &os; 10, AMD64 platforms support up to 4 TB of physical memory. Why does &os; report less than 4 GB memory when installed on an &i386; machine? The total address space on &i386; machines is 32-bit, meaning that at most 4 GB of memory is addressable (can be accessed). Furthermore, some addresses in this range are reserved by hardware for different purposes, for example for using and controlling PCI devices, for accessing video memory, and so on. Therefore, the total amount of memory usable by the operating system for its kernel and applications is limited to significantly less than 4 GB. Usually, 3.2 GB to 3.7 GB is the maximum usable physical memory in this configuration. To access more than 3.2 GB to 3.7 GB of installed memory (meaning up to 4 GB but also more than 4 GB), a special tweak called PAE must be used. PAE stands for Physical Address Extension and is a way for 32-bit x86 CPUs to address more than 4 GB of memory. It remaps the memory that would otherwise be overlaid by address reservations for hardware devices above the 4 GB range and uses it as additional physical memory (see &man.pae.4;). Using PAE has some drawbacks; this mode of memory access is a little bit slower than the normal (without PAE) mode and loadable modules (see &man.kld.4;) are not supported. This means all drivers must be compiled into the kernel. The most common way to enable PAE is to build a new kernel with the special ready-provided kernel configuration file called PAE, which is already configured to build a safe kernel. Note that some entries in this kernel configuration file are too conservative and some drivers marked as unready to be used with PAE are actually usable. A rule of thumb is that if the driver is usable on 64-bit architectures (like AMD64), it is also usable with PAE. When creating a custom kernel configuration file, PAE can be enabled by adding the following line: options PAE PAE is not much used nowadays because most new x86 hardware also supports running in 64-bit mode, known as AMD64 or &intel; 64. It has a much larger address space and does not need such tweaks. &os; supports AMD64 and it is recommended that this version of &os; be used instead of the &i386; version if 4 GB or more memory is required. Architectures and Processors Does &os; support architectures other than the x86? Yes. &os; divides support into multiple tiers. Tier 1 architectures, such as i386 or amd64; are fully supported. Tiers 2 and 3 are supported on a best-effort basis. A full explanation of the tier system is available in the Committer's Guide. A complete list of supported architectures can be found on the platforms page. Does &os; support Symmetric Multiprocessing (SMP)? &os; supports symmetric multi-processor (SMP) on all non-embedded platforms (e.g, &arch.i386;, &arch.amd64;, etc.). SMP is also supported in arm and MIPS kernels, although some CPUs may not support this. &os;'s SMP implementation uses fine-grained locking, and performance scales nearly linearly with number of CPUs. &man.smp.4; has more details. What is microcode? How do I install &intel; CPU microcode updates? Microcode is a method of programmatically implementing hardware level instructions. This allows for CPU bugs to be fixed without replacing the on board chip. Install sysutils/devcpu-data, then add: microcode_update_enable="YES" to /etc/rc.conf Peripherals What kind of peripherals does &os; support? See the complete list in the Hardware Notes for &os; &rel121.current; or &rel1.current;. Keyboards and Mice Is it possible to use a mouse outside the X Window system? The default console driver, &man.vt.4;, provides the ability to use a mouse pointer in text consoles to cut & paste text. Run the mouse daemon, &man.moused.8;, and turn on the mouse pointer in the virtual console: &prompt.root; moused -p /dev/xxxx -t yyyy &prompt.root; vidcontrol -m on Where xxxx is the mouse device name and yyyy is a protocol type for the mouse. The mouse daemon can automatically determine the protocol type of most mice, except old serial mice. Specify the auto protocol to invoke automatic detection. If automatic detection does not work, see the &man.moused.8; manual page for a list of supported protocol types. For a PS/2 mouse, add moused_enable="YES" to /etc/rc.conf to start the mouse daemon at boot time. Additionally, to use the mouse daemon on all virtual terminals instead of just the console, add allscreens_flags="-m on" to /etc/rc.conf. When the mouse daemon is running, access to the mouse must be coordinated between the mouse daemon and other programs such as X Windows. Refer to the FAQ Why does my mouse not work with X? for more details on this issue. How do I cut and paste text with a mouse in the text console? It is not possible to remove data using the mouse. However, it is possible to copy and paste. Once the mouse daemon is running as described in the previous question, hold down button 1 (left button) and move the mouse to select a region of text. Then, press button 2 (middle button) to paste it at the text cursor. Pressing button 3 (right button) will extend the selected region of text. If the mouse does not have a middle button, it is possible to emulate one or remap buttons using mouse daemon options. See the &man.moused.8; manual page for details. My mouse has a fancy wheel and buttons. Can I use them in &os;? The answer is, unfortunately, It depends. These mice with additional features require specialized driver in most cases. Unless the mouse device driver or the user program has specific support for the mouse, it will act just like a standard two, or three button mouse. For the possible usage of wheels in the X Window environment, refer to that section. How do I use my delete key in sh and csh? For the Bourne Shell, add the following lines to ~/.shrc. See &man.sh.1; and &man.editrc.5;. bind ^[[3~ ed-delete-next-char # for xterm For the C Shell, add the following lines to ~/.cshrc. See &man.csh.1;. bindkey ^[[3~ delete-char # for xterm Other Hardware Workarounds for no sound from my &man.pcm.4; sound card? Some sound cards set their output volume to 0 at every boot. Run the following command every time the machine boots: &prompt.root; mixer pcm 100 vol 100 cd 100 Does &os; support power management on my laptop? &os; supports the ACPI features found in modern hardware. Further information can be found in &man.acpi.4;. Troubleshooting Why is &os; finding the wrong amount of memory on &i386; hardware? The most likely reason is the difference between physical memory addresses and virtual addresses. The convention for most PC hardware is to use the memory area between 3.5 GB and 4 GB for a special purpose (usually for PCI). This address space is used to access PCI hardware. As a result real, physical memory cannot be accessed by that address space. What happens to the memory that should appear in that location is hardware dependent. Unfortunately, some hardware does nothing and the ability to use that last 500 MB of RAM is entirely lost. Luckily, most hardware remaps the memory to a higher location so that it can still be used. However, this can cause some confusion when watching the boot messages. On a 32-bit version of &os;, the memory appears lost, since it will be remapped above 4 GB, which a 32-bit kernel is unable to access. In this case, the solution is to build a PAE enabled kernel. See the entry on memory limits for more information. On a 64-bit version of &os;, or when running a PAE-enabled kernel, &os; will correctly detect and remap the memory so it is usable. During boot, however, it may seem as if &os; is detecting more memory than the system really has, due to the described remapping. This is normal and the available memory will be corrected as the boot process completes. Why do my programs occasionally die with Signal 11 errors? Signal 11 errors are caused when a process has attempted to access memory which the operating system has not granted it access to. If something like this is happening at seemingly random intervals, start investigating the cause. These problems can usually be attributed to either: If the problem is occurring only in a specific custom application, it is probably a bug in the code. If it is a problem with part of the base &os; system, it may also be buggy code, but more often than not these problems are found and fixed long before us general FAQ readers get to use these bits of code (that is what -CURRENT is for). It is probably not a &os; bug if the problem occurs compiling a program, but the activity that the compiler is carrying out changes each time. For example, if make buildworld fails while trying to compile ls.c into ls.o and, when run again, it fails in the same place, this is a broken build. Try updating source and try again. If the compile fails elsewhere, it is almost certainly due to hardware. In the first case, use a debugger such as &man.gdb.1; to find the point in the program which is attempting to access a bogus address and fix it. In the second case, verify which piece of hardware is at fault. Common causes of this include: The hard disks might be overheating: Check that the fans are still working, as the disk and other hardware might be overheating. The processor running is overheating: This might be because the processor has been overclocked, or the fan on the processor might have died. In either case, ensure that the hardware is running at what it is specified to run at, at least while trying to solve this problem. If it is not, clock it back to the default settings.) Regarding overclocking, it is far cheaper to have a slow system than a fried system that needs replacing! Also the community is not sympathetic to problems on overclocked systems. Dodgy memory: if multiple memory SIMMS/DIMMS are installed, pull them all out and try running the machine with each SIMM or DIMM individually to narrow the problem down to either the problematic DIMM/SIMM or perhaps even a combination. Over-optimistic motherboard settings: the BIOS settings, and some motherboard jumpers, provide options to set various timings. The defaults are often sufficient, but sometimes setting the wait states on RAM too low, or setting the RAM Speed: Turbo option will cause strange behavior. A possible idea is to set to BIOS defaults, after noting the current settings first. Unclean or insufficient power to the motherboard. Remove any unused I/O boards, hard disks, or CD-ROMs, or disconnect the power cable from them, to see if the power supply can manage a smaller load. Or try another power supply, preferably one with a little more power. For instance, if the current power supply is rated at 250 Watts, try one rated at 300 Watts. Read the section on Signal 11 for a further explanation and a discussion on how memory testing software or hardware can still pass faulty memory. There is an extensive FAQ on this at the SIG11 problem FAQ. Finally, if none of this has helped, it is possibly a bug in &os;. Follow these instructions to send a problem report. My system crashes with either Fatal trap 12: page fault in kernel mode, or panic:, and spits out a bunch of information. What should I do? The &os; developers are interested in these errors, but need more information than just the error message. Copy the full crash message. Then consult the FAQ section on kernel panics, build a debugging kernel, and get a backtrace. This might sound difficult, but does not require any programming skills. Just follow the instructions. What is the meaning of the error maxproc limit exceeded by uid %i, please see tuning(7) and login.conf(5)? The &os; kernel will only allow a certain number of processes to exist at one time. The number is based on the kern.maxusers &man.sysctl.8; variable. kern.maxusers also affects various other in-kernel limits, such as network buffers. If the machine is heavily loaded, increase kern.maxusers. This will increase these other system limits in addition to the maximum number of processes. To adjust the kern.maxusers value, see the File/Process Limits section of the Handbook. While that section refers to open files, the same limits apply to processes. If the machine is lightly loaded but running a very large number of processes, adjust the kern.maxproc tunable by defining it in /boot/loader.conf. The tunable will not get adjusted until the system is rebooted. For more information about tuning tunables, see &man.loader.conf.5;. If these processes are being run by a single user, adjust kern.maxprocperuid to be one less than the new kern.maxproc value. It must be at least one less because one system program, &man.init.8;, must always be running. Why do full screen applications on remote machines misbehave? The remote machine may be setting the terminal type to something other than xterm which is required by the &os; console. Alternatively the kernel may have the wrong values for the width and height of the terminal. Check the value of the TERM environment variable is xterm. If the remote machine does not support that try vt100. Run stty -a to check what the kernel thinks the terminal dimensions are. If they are incorrect, they can be changed by running stty rows RR cols CC. Alternatively, if the client machine has x11/xterm installed, then running resize will query the terminal for the correct dimensions and set them. Why does it take so long to connect to my computer via ssh or telnet? The symptom: there is a long delay between the time the TCP connection is established and the time when the client software asks for a password (or, in &man.telnet.1;'s case, when a login prompt appears). The problem: more likely than not, the delay is caused by the server software trying to resolve the client's IP address into a hostname. Many servers, including the Telnet and SSH servers that come with &os;, do this to store the hostname in a log file for future reference by the administrator. The remedy: if the problem occurs whenever connecting the client computer to any server, the problem is with the client. If the problem only occurs when someone connects to the server computer, the problem is with the server. If the problem is with the client, the only remedy is to fix the DNS so the server can resolve it. If this is on a local network, consider it a server problem and keep reading. If this is on the Internet, contact your ISP. If the problem is with the server on a local network, configure the server to resolve address-to-hostname queries for the local address range. See &man.hosts.5; and &man.named.8; for more information. If this is on the Internet, the problem may be that the local server's resolver is not functioning correctly. To check, try to look up another host such as www.yahoo.com. If it does not work, that is the problem. Following a fresh install of &os;, it is also possible that domain and name server information is missing from /etc/resolv.conf. This will often cause a delay in SSH, as the option UseDNS is set to yes by default in /etc/ssh/sshd_config. If this is causing the problem, either fill in the missing information in /etc/resolv.conf or set UseDNS to no in sshd_config as a temporary workaround. Why does file: table is full show up repeatedly in &man.dmesg.8;? This error message indicates that the number of available file descriptors have been exhausted on the system. Refer to the kern.maxfiles section of the Tuning Kernel Limits section of the Handbook for a discussion and solution. Why does the clock on my computer keep incorrect time? The computer has two or more clocks, and &os; has chosen to use the wrong one. Run &man.dmesg.8;, and check for lines that contain Timecounter. The one with the highest quality value that &os; chose. &prompt.root; dmesg | grep Timecounter Timecounter "i8254" frequency 1193182 Hz quality 0 Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 Timecounter "TSC" frequency 2998570050 Hz quality 800 Timecounters tick every 1.000 msec Confirm this by checking the kern.timecounter.hardware &man.sysctl.3;. &prompt.root; sysctl kern.timecounter.hardware kern.timecounter.hardware: ACPI-fast It may be a broken ACPI timer. The simplest solution is to disable the ACPI timer in /boot/loader.conf: debug.acpi.disabled="timer" Or the BIOS may modify the TSC clock—perhaps to change the speed of the processor when running from batteries, or going into a power saving mode, but &os; is unaware of these adjustments, and appears to gain or lose time. In this example, the i8254 clock is also available, and can be selected by writing its name to the kern.timecounter.hardware &man.sysctl.3;. &prompt.root; sysctl kern.timecounter.hardware=i8254 kern.timecounter.hardware: TSC -> i8254 The computer should now start keeping more accurate time. To have this change automatically run at boot time, add the following line to /etc/sysctl.conf: kern.timecounter.hardware=i8254 What does the error swap_pager: indefinite wait buffer: mean? This means that a process is trying to page memory from disk, and the page attempt has hung trying to access the disk for more than 20 seconds. It might be caused by bad blocks on the disk drive, disk wiring, cables, or any other disk I/O-related hardware. If the drive itself is bad, disk errors will appear in /var/log/messages and in the output of dmesg. Otherwise, check the cables and connections. What is a lock order reversal? The &os; kernel uses a number of resource locks to arbitrate contention for certain resources. When multiple kernel threads try to obtain multiple resource locks, there's always the potential for a deadlock, where two threads have each obtained one of the locks and blocks forever waiting for the other thread to release one of the other locks. This sort of locking problem can be avoided if all threads obtain the locks in the same order. A run-time lock diagnostic system called &man.witness.4;, enabled in &os.current; and disabled by default for stable branches and releases, detects the potential for deadlocks due to locking errors, including errors caused by obtaining multiple resource locks with a different order from different parts of the kernel. The &man.witness.4; framework tries to detect this problem as it happens, and reports it by printing a message to the system console about a lock order reversal (often referred to also as LOR). It is possible to get false positives, as &man.witness.4; is conservative. A true positive report does not mean that a system is dead-locked; instead it should be understood as a warning that a deadlock could have happened here. Problematic LORs tend to get fixed quickly, so check the &a.current; before posting to it. What does Called ... with the following non-sleepable locks held mean? This means that a function that may sleep was called while a mutex (or other unsleepable) lock was held. The reason this is an error is because mutexes are not intended to be held for long periods of time; they are supposed to only be held to maintain short periods of synchronization. This programming contract allows device drivers to use mutexes to synchronize with the rest of the kernel during interrupts. Interrupts (under &os;) may not sleep. Hence it is imperative that no subsystem in the kernel block for an extended period while holding a mutex. To catch such errors, assertions may be added to the kernel that interact with the &man.witness.4; subsystem to emit a warning or fatal error (depending on the system configuration) when a potentially blocking call is made while holding a mutex. In summary, such warnings are non-fatal, however with unfortunate timing they could cause undesirable effects ranging from a minor blip in the system's responsiveness to a complete system lockup. For additional information about locking in &os; see &man.locking.9;. Why does buildworld/installworld die with the message touch: not found? This error does not mean that the &man.touch.1; utility is missing. The error is instead probably due to the dates of the files being set sometime in the future. If the CMOS clock is set to local time, run adjkerntz -i to adjust the kernel clock when booting into single-user mode. User Applications Where are all the user applications? Refer to the ports page for info on software packages ported to &os;. Most ports should work on all supported versions of &os;. Those that do not are specifically marked as such. Each time a &os; release is made, a snapshot of the ports tree at the time of release is also included in the ports/ directory. &os; supports compressed binary packages to easily install and uninstall ports. Use &man.pkg.7; to control the installation of packages. How do I download the Ports tree? Should I be using Subversion? Any of the methods listed here work: Use portsnap for most use cases. Refer to Using the Ports Collection for instructions on how to use this tool. Use Subversion if custom patches to the ports tree are needed or if running &os.current;. Refer to Using Subversion for details. Why can I not build this port on my &rel2.relx; -, or &rel.relx; -STABLE machine? If the installed &os; version lags significantly behind -CURRENT or -STABLE, update the Ports Collection using the instructions in Using the Ports Collection. If the system is up-to-date, someone might have committed a change to the port which works for -CURRENT but which broke the port for -STABLE. Submit a bug report, since the Ports Collection is supposed to work for both the -CURRENT and -STABLE branches. I just tried to build INDEX using make index, and it failed. Why? First, make sure that the Ports Collection is up-to-date. Errors that affect building INDEX from an up-to-date copy of the Ports Collection are high-visibility and are thus almost always fixed immediately. There are rare cases where INDEX will not build due to odd cases involving OPTIONS_SET being set in make.conf. If you suspect that this is the case, try to make INDEX with those variables turned off before reporting it to &a.ports;. I updated the sources, now how do I update my installed ports? &os; does not include a port upgrading tool, but it does have some tools to make the upgrade process somewhat easier. Additional tools are available to simplify port handling and are described the Upgrading Ports section in the &os; Handbook. Do I need to recompile every port each time I perform a major version update? Yes! While a recent system will run with software compiled under an older release, things will randomly crash and fail to work once other ports are installed or updated. When the system is upgraded, various shared libraries, loadable modules, and other parts of the system will be replaced with newer versions. Applications linked against the older versions may fail to start or, in other cases, fail to function properly. For more information, see the section on upgrades in the &os; Handbook. Do I need to recompile every port each time I perform a minor version update? In general, no. &os; developers do their utmost to guarantee binary compatibility across all releases with the same major version number. Any exceptions will be documented in the Release Notes, and advice given there should be followed. Why is /bin/sh so minimal? Why does &os; not use bash or another shell? Many people need to write shell scripts which will be portable across many systems. That is why &posix; specifies the shell and utility commands in great detail. Most scripts are written in Bourne shell (&man.sh.1;), and because several important programming interfaces (&man.make.1;, &man.system.3;, &man.popen.3;, and analogues in higher-level scripting languages like Perl and Tcl) are specified to use the Bourne shell to - interpret commands. Because the Bourne shell is so often + interpret commands. As the Bourne shell is so often and widely used, it is important for it to be quick to start, be deterministic in its behavior, and have a small memory footprint. The existing implementation is our best effort at meeting as many of these requirements simultaneously as we can. To keep /bin/sh small, we have not provided many of the convenience features that other shells have. That is why other more featureful shells like bash, scsh, &man.tcsh.1;, and zsh are available. Compare the memory utilization of these shells by looking at the VSZ and RSS columns in a ps -u listing. Kernel Configuration I would like to customize my kernel. Is it difficult? Not at all! Check out the kernel config section of the Handbook. The new kernel will be installed to the /boot/kernel directory along with its modules, while the old kernel and its modules will be moved to the /boot/kernel.old directory. If a mistake is made in the configuration, simply boot the previous version of the kernel. Why is my kernel so big? GENERIC kernels shipped with &os; are compiled in debug mode. Kernels built in debug mode contain debug data in separate files that are used for debugging. &os; releases prior to 11.0 store these debug files in the same directory as the kernel itself, /boot/kernel/. In &os; 11.0 and later the debug files are stored in /usr/lib/debug/boot/kernel/. Note that there will be little or no performance loss from running a debug kernel, and it is useful to keep one around in case of a system panic. When running low on disk space, there are different options to reduce the size of /boot/kernel/ and /usr/lib/debug/. To not install the symbol files, make sure the following line exists in /etc/src.conf: WITHOUT_KERNEL_SYMBOLS=yes For more information see &man.src.conf.5;. If you want to avoid building debug files altogether, make sure that both of the following are true: This line does not exist in the kernel configuration file: makeoptions DEBUG=-g Do not run &man.config.8; with . Either of the above settings will cause the kernel to be built in debug mode. To build and install only the specified modules, list them in /etc/make.conf: MODULES_OVERRIDE= accf_http ipfw Replace accf_httpd ipfw with a list of needed modules. Only the listed modules will be built. This reduces the size of the kernel directory and decreases the amount of time needed to build the kernel. For more information, read /usr/share/examples/etc/make.conf. Unneeded devices can be removed from the kernel to further reduce the size. See for more information. To put any of these options into effect, follow the instructions to build and install the new kernel. For reference, the &os; 11 &arch.amd64; kernel (/boot/kernel/kernel) is approximately 25 MB. Why does every kernel I try to build fail to compile, even GENERIC? There are a number of possible causes for this problem: The source tree is different from the one used to build the currently running system. When attempting an upgrade, read /usr/src/UPDATING, paying particular attention to the COMMON ITEMS section at the end. The make buildkernel did not complete successfully. The make buildkernel target relies on files generated by the make buildworld target to complete its job correctly. Even when building &os;-STABLE, it is possible that the source tree was fetched at a time when it was either being modified or it was broken. Only releases are guaranteed to be buildable, although &os;-STABLE builds fine the majority of the time. Try re-fetching the source tree and see if the problem goes away. Try using a different mirror in case the previous one is having problems. Which scheduler is in use on a running system? The name of the scheduler currently being used is directly available as the value of the kern.sched.name sysctl: &prompt.user; sysctl kern.sched.name kern.sched.name: ULE What is kern.sched.quantum? kern.sched.quantum is the maximum number of ticks a process can run without being preempted in the 4BSD scheduler. Disks, File Systems, and Boot Loaders How can I add my new hard disk to my &os; system? See the Adding Disks section in the &os; Handbook. How do I move my system over to my huge new disk? The best way is to reinstall the operating system on the new disk, then move the user data over. This is highly recommended when tracking -STABLE for more than one release or when updating a release instead of installing a new one. Install booteasy on both disks with &man.boot0cfg.8; and dual boot until you are happy with the new configuration. Skip the next paragraph to find out how to move the data after doing this. Alternatively, partition and label the new disk with either &man.sade.8; or &man.gpart.8;. If the disks are MBR-formatted, booteasy can be installed on both disks with &man.boot0cfg.8; so that the computer can dual boot to the old or new system after the copying is done. Once the new disk set up, the data cannot just be copied. Instead, use tools that understand device files and system flags, such as &man.dump.8;. Although it is recommended to move the data while in single-user mode, it is not required. When the disks are formatted with UFS, never use anything but &man.dump.8; and &man.restore.8; to move the root file system. These commands should also be used when moving a single partition to another empty partition. The sequence of steps to use dump to move the data from one UFS partitions to a new partition is: newfs the new partition. mount it on a temporary mount point. cd to that directory. dump the old partition, piping output to the new one. For example, to move /dev/ada1s1a with /mnt as the temporary mount point, type: &prompt.root; newfs /dev/ada1s1a &prompt.root; mount /dev/ada1s1a /mnt &prompt.root; cd /mnt &prompt.root; dump 0af - / | restore rf - Rearranging partitions with dump takes a bit more work. To merge a partition like /var into its parent, create the new partition large enough for both, move the parent partition as described above, then move the child partition into the empty directory that the first move created: &prompt.root; newfs /dev/ada1s1a &prompt.root; mount /dev/ada1s1a /mnt &prompt.root; cd /mnt &prompt.root; dump 0af - / | restore rf - &prompt.root; cd var &prompt.root; dump 0af - /var | restore rf - To split a directory from its parent, say putting /var on its own partition when it was not before, create both partitions, then mount the child partition on the appropriate directory in the temporary mount point, then move the old single partition: &prompt.root; newfs /dev/ada1s1a &prompt.root; newfs /dev/ada1s1d &prompt.root; mount /dev/ada1s1a /mnt &prompt.root; mkdir /mnt/var &prompt.root; mount /dev/ada1s1d /mnt/var &prompt.root; cd /mnt &prompt.root; dump 0af - / | restore rf - The &man.cpio.1; and &man.pax.1; utilities are also available for moving user data. These are known to lose file flag information, so use them with caution. Which partitions can safely use Soft Updates? I have heard that Soft Updates on / can cause problems. What about Journaled Soft Updates? Short answer: Soft Updates can usually be safely used on all partitions. Long answer: Soft Updates has two characteristics that may be undesirable on certain partitions. First, a Soft Updates partition has a small chance of losing data during a system crash. The partition will not be corrupted as the data will simply be lost. Second, Soft Updates can cause temporary space shortages. When using Soft Updates, the kernel can take up to thirty seconds to write changes to the physical disk. When a large file is deleted the file still resides on disk until the kernel actually performs the deletion. This can cause a very simple race condition. Suppose one large file is deleted and another large file is immediately created. The first large file is not yet actually removed from the physical disk, so the disk might not have enough room for the second large file. This will produce an error that the partition does not have enough space, even though a large chunk of space has just been released. A few seconds later, the file creation works as expected. If a system should crash after the kernel accepts a chunk of data for writing to disk, but before that data is actually written out, data could be lost. This risk is extremely small, but generally manageable. These issues affect all partitions using Soft Updates. So, what does this mean for the root partition? Vital information on the root partition changes very rarely. If the system crashed during the thirty-second window after such a change is made, it is possible that data could be lost. This risk is negligible for most applications, but be aware that it exists. If the system cannot tolerate this much risk, do not use Soft Updates on the root file system! / is traditionally one of the smallest partitions. If /tmp is on /, there may be intermittent space problems. Symlinking /tmp to /var/tmp will solve this problem. Finally, &man.dump.8; does not work in live mode (-L) on a filesystem, with Journaled Soft Updates (SU+J). Can I mount other foreign file systems under &os;? &os; supports a variety of other file systems. UFS UFS CD-ROMs can be mounted directly on &os;. Mounting disk partitions from Digital UNIX and other systems that support UFS may be more complex, depending on the details of the disk partitioning for the operating system in question. ext2/ext3 &os; supports ext2fs and ext3fs partitions. See &man.ext2fs.5; for more information. NTFS FUSE based NTFS support is available as a port (sysutils/fusefs-ntfs). For more information see ntfs-3g. FAT &os; includes a read-write FAT driver. For more information, see &man.mount.msdosfs.8;. ZFS &os; includes a port of &sun;'s ZFS driver. The current recommendation is to use it only on &arch.amd64; platforms with sufficient memory. For more information, see &man.zfs.8;. &os; includes the Network File System NFS and the &os; Ports Collection provides several FUSE applications to support many other file systems. How do I mount a secondary DOS partition? The secondary DOS partitions are found after all the primary partitions. For example, if E is the second DOS partition on the second SCSI drive, there will be a device file for slice 5 in /dev. To mount it: &prompt.root; mount -t msdosfs /dev/da1s5 /dos/e Is there a cryptographic file system for &os;? Yes, &man.gbde.8; and &man.geli.8;. See the Encrypting Disk Partitions section of the &os; Handbook. How do I boot &os; and &linux; using GRUB? To boot &os; using GRUB, add the following to either /boot/grub/menu.lst or /boot/grub/grub.conf, depending upon which is used by the &linux; distribution. title &os; 9.1 root (hd0,a) kernel /boot/loader Where hd0,a points to the root partition on the first disk. To specify the slice number, use something like this (hd0,2,a). By default, if the slice number is omitted, GRUB searches the first slice which has the a partition. How do I boot &os; and &linux; using BootEasy? Install LILO at the start of the &linux; boot partition instead of in the Master Boot Record. Then boot LILO from BootEasy. This is recommended when running &windows; and &linux; as it makes it simpler to get &linux; booting again if &windows; is reinstalled. How do I change the boot prompt from ??? to something more meaningful? This cannot be accomplished with the standard boot manager without rewriting it. There are a number of other boot managers in the sysutils category of the Ports Collection. How do I use a new removable drive? If the drive already has a file system on it, use a command like this: &prompt.root; mount -t msdosfs /dev/da0s1 /mnt If the drive will only be used with &os; systems, partition it with UFS or ZFS. This will provide long filename support, improvement in performance, and stability. If the drive will be used by other operating systems, a more portable choice, such as msdosfs, is better. &prompt.root; dd if=/dev/zero of=/dev/da0 count=2 &prompt.root; gpart create -s GPT /dev/da0 &prompt.root; gpart add -t freebsd-ufs /dev/da0 Finally, create a new file system: &prompt.root; newfs /dev/da0p1 and mount it: &prompt.root; mount /dev/da0s1 /mnt It is a good idea to add a line to /etc/fstab (see &man.fstab.5;) so you can just type mount /mnt in the future: /dev/da0p1 /mnt ufs rw,noauto 0 0 Why do I get Incorrect super block when mounting a CD? The type of device to mount must be specified. This is described in the Handbook section on Using Data CDs. Why do I get Device not configured when mounting a CD? This generally means that there is no CD in the drive, or the drive is not visible on the bus. Refer to the Using Data CDs section of the Handbook for a detailed discussion of this issue. Why do all non-English characters in filenames show up as ? on my CDs when mounted in &os;? The CD probably uses the Joliet extension for storing information about files and directories. This is discussed in the Handbook section on Using Data CD-ROMs. A CD burned under &os; cannot be read under any other operating system. Why? This means a raw file was burned to the CD, rather than creating an ISO 9660 file system. Take a look at the Handbook section on Using Data CDs. How can I create an image of a data CD? This is discussed in the Handbook section on Writing Data to an ISO File System. For more on working with CD-ROMs, see the Creating CDs Section in the Storage chapter in the Handbook. Why can I not mount an audio CD? Trying to mount an audio CD will produce an error like cd9660: /dev/cd0: Invalid argument. This is because mount only works on file systems. Audio CDs do not have file systems; they just have data. Instead, use a program that reads audio CDs, such as the audio/xmcd package or port. How do I mount a multi-session CD? By default, &man.mount.8; will attempt to mount the last data track (session) of a CD. To load an earlier session, use the command line argument. Refer to &man.mount.cd9660.8; for specific examples. How do I let ordinary users mount CD-ROMs, DVDs, USB drives, and other removable media? As root set the sysctl variable vfs.usermount to 1. &prompt.root; sysctl vfs.usermount=1 To make this persist across reboots, add the line vfs.usermount=1 to /etc/sysctl.conf so that it is reset at system boot time. Users can only mount devices they have read permissions to. To allow users to mount a device permissions must be set in /etc/devfs.conf. For example, to allow users to mount the first USB drive add: # Allow all users to mount a USB drive. own /dev/da0 root:operator perm /dev/da0 0666 All users can now mount devices they could read onto a directory that they own: &prompt.user; mkdir ~/my-mount-point &prompt.user; mount -t msdosfs /dev/da0 ~/my-mount-point Unmounting the device is simple: &prompt.user; umount ~/my-mount-point Enabling vfs.usermount, however, has negative security implications. A better way to access &ms-dos; formatted media is to use the emulators/mtools package in the Ports Collection. The device name used in the previous examples must be changed according to the configuration. The du and df commands show different amounts of disk space available. What is going on? This is due to how these commands actually work. du goes through the directory tree, measures how large each file is, and presents the totals. df just asks the file system how much space it has left. They seem to be the same thing, but a file without a directory entry will affect df but not du. When a program is using a file, and the file is deleted, the file is not really removed from the file system until the program stops using it. The file is immediately deleted from the directory listing, however. As an example, consider a file large enough to affect the output of du and df. A file being viewed with more can be deleted wihout causing an error. The entry is removed from the directory so no other program or user can access it. However, du shows that it is gone as it has walked the directory tree and the file is not listed. df shows that it is still there, as the file system knows that more is still using that space. Once the more session ends, du and df will agree. This situation is common on web servers. Many people set up a &os; web server and forget to rotate the log files. The access log fills up /var. The new administrator deletes the file, but the system still complains that the partition is full. Stopping and restarting the web server program would free the file, allowing the system to release the disk space. To prevent this from happening, set up &man.newsyslog.8;. Note that Soft Updates can delay the freeing of disk space and it can take up to 30 seconds for the change to be visible. How can I add more swap space? This section of the Handbook describes how to do this. Why does &os; see my disk as smaller than the manufacturer says it is? Disk manufacturers calculate gigabytes as a billion bytes each, whereas &os; calculates them as 1,073,741,824 bytes each. This explains why, for example, &os;'s boot messages will report a disk that supposedly has 80 GB as holding 76,319 MB. Also note that &os; will (by default) reserve 8% of the disk space. How is it possible for a partition to be more than 100% full? A portion of each UFS partition (8%, by default) is reserved for use by the operating system and the root user. &man.df.1; does not count that space when calculating the Capacity column, so it can exceed 100%. Notice that the Blocks column is always greater than the sum of the Used and Avail columns, usually by a factor of 8%. For more details, look up in &man.tunefs.8;. ZFS What is the minimum amount of RAM one should have to run ZFS? A minimum of 4GB of RAM is required for comfortable usage, but individual workloads can vary widely. What is the ZIL and when does it get used? The ZIL (ZFS intent log) is a write log used to implement posix write commitment semantics across crashes. Normally writes are bundled up into transaction groups and written to disk when filled (Transaction Group Commit). However syscalls like &man.fsync.2; require a commitment that the data is written to stable storage before returning. The ZIL is needed for writes that have been acknowledged as written but which are not yet on disk as part of a transaction. The transaction groups are timestamped. In the event of a crash the last valid timestamp is found and missing data is merged in from the ZIL. Do I need a SSD for ZIL? By default, ZFS stores the ZIL in the pool with all the data. If an application has a heavy write load, storing the ZIL in a separate device that has very fast synchronous, sequential write performance can improve overall system performance. For other workloads, a SSD is unlikely to make much of an improvement. What is the L2ARC? The L2ARC is a read cache stored on a fast device such as an SSD. This cache is not persistent across reboots. Note that RAM is used as the first layer of cache and the L2ARC is only needed if there is insufficient RAM. L2ARC needs space in the ARC to index it. So, perversely, a working set that fits perfectly in the ARC will not fit perfectly any more if a L2ARC is used because part of the ARC is holding the L2ARC index, pushing part of the working set into the L2ARC which is slower than RAM. Is enabling deduplication advisable? Generally speaking, no. Deduplication takes up a significant amount of RAM and may slow down read and write disk access times. Unless one is storing data that is very heavily duplicated, such as virtual machine images or user backups, it is possible that deduplication will do more harm than good. Another consideration is the inability to revert deduplication status. If data is written when deduplication is enabled, disabling dedup will not cause those blocks which were deduplicated to be replicated until they are next modified. Deduplication can also lead to some unexpected situations. In particular, deleting files may become much slower. I cannot delete or create files on my ZFS pool. How can I fix this? This could happen because the pool is 100% full. ZFS requires space on the disk to write transaction metadata. To restore the pool to a usable state, truncate the file to delete: &prompt.user; truncate -s 0 unimportant-file File truncation works because a new transaction is not started, new spare blocks are created instead. On systems with additional ZFS dataset tuning, such as deduplication, the space may not be immediately available Does ZFS support TRIM for Solid State Drives? ZFS TRIM support was added to &os; 10-CURRENT with revision r240868. ZFS TRIM support was added to all &os;-STABLE branches in r252162 and r251419, respectively. ZFS TRIM is enabled by default, and can be turned off by adding this line to /etc/sysctl.conf: vfs.zfs.trim.enabled=0 ZFS TRIM support was added to GELI as of r286444. Please see &man.geli.8; and the switch. System Administration Where are the system start-up configuration files? The primary configuration file is /etc/defaults/rc.conf which is described in &man.rc.conf.5;. System startup scripts such as /etc/rc and /etc/rc.d, which are described in &man.rc.8;, include this file. Do not edit this file! Instead, to edit an entry in /etc/defaults/rc.conf, copy the line into /etc/rc.conf and change it there. For example, if to start &man.named.8;, the included DNS server: &prompt.root; echo 'named_enable="YES"' >> /etc/rc.conf To start up local services, place shell scripts in the /usr/local/etc/rc.d directory. These shell scripts should be set executable, the default file mode is 555. How do I add a user easily? Use the &man.adduser.8; command, or the &man.pw.8; command for more complicated situations. To remove the user, use the &man.rmuser.8; command or, if necessary, &man.pw.8;. Why do I keep getting messages like root: not found after editing /etc/crontab? This is normally caused by editing the system crontab. This is not the correct way to do things as the system crontab has a different format to the per-user crontabs. The system crontab has an extra field, specifying which user to run the command as. &man.cron.8; assumes this user is the first word of the command to execute. Since no such command exists, this error message is displayed. To delete the extra, incorrect crontab: &prompt.root; crontab -r Why do I get the error, you are not in the correct group to su root when I try to su to root? This is a security feature. In order to su to root, or any other account with superuser privileges, the user account must be a member of the wheel group. If this feature were not there, anybody with an account on a system who also found out root's password would be able to gain superuser level access to the system. To allow someone to su to root, put them in the wheel group using pw: &prompt.root; pw groupmod wheel -m lisa The above example will add user lisa to the group wheel. I made a mistake in rc.conf, or another startup file, and now I cannot edit it because the file system is read-only. What should I do? Restart the system using boot -s at the loader prompt to enter single-user mode. When prompted for a shell pathname, press Enter and run mount -urw / to re-mount the root file system in read/write mode. You may also need to run mount -a -t ufs to mount the file system where your favorite editor is defined. If that editor is on a network file system, either configure the network manually before mounting the network file systems, or use an editor which resides on a local file system, such as &man.ed.1;. In order to use a full screen editor such as &man.vi.1; or &man.emacs.1;, run export TERM=xterm so that these editors can load the correct data from the &man.termcap.5; database. After performing these steps, edit /etc/rc.conf to fix the syntax error. The error message displayed immediately after the kernel boot messages should indicate the number of the line in the file which is at fault. Why am I having trouble setting up my printer? See the Handbook entry on printing for troubleshooting tips. How can I correct the keyboard mappings for my system? Refer to the Handbook section on using localization, specifically the section on console setup. Why can I not get user quotas to work properly? It is possible that the kernel is not configured to use quotas. In this case, add the following line to the kernel configuration file and recompile the kernel: options QUOTA Refer to the Handbook entry on quotas for full details. Do not turn on quotas on /. Put the quota file on the file system that the quotas are to be enforced on: File System Quota file /usr /usr/admin/quotas /home /home/admin/quotas Does &os; support System V IPC primitives? Yes, &os; supports System V-style IPC, including shared memory, messages and semaphores, in the GENERIC kernel. With a custom kernel, support may be loaded with the sysvshm.ko, sysvsem.ko and sysvmsg.ko kernel modules, or enabled in the custom kernel by adding the following lines to the kernel configuration file: options SYSVSHM # enable shared memory options SYSVSEM # enable for semaphores options SYSVMSG # enable for messaging Recompile and install the kernel. What other mail-server software can I use instead of Sendmail? The Sendmail server is the default mail-server software for &os;, but it can be replaced with another MTA installed from the Ports Collection. Available ports include mail/exim, mail/postfix, and mail/qmail. Search the mailing lists for discussions regarding the advantages and disadvantages of the available MTAs. I have forgotten the root password! What do I do? Do not panic! Restart the system, type boot -s at the Boot: prompt to enter single-user mode. At the question about the shell to use, hit Enter which will display a &prompt.root; prompt. Enter mount -urw / to remount the root file system read/write, then run mount -a to remount all the file systems. Run passwd root to change the root password then run &man.exit.1; to continue booting. If you are still prompted to give the root password when entering the single-user mode, it means that the console has been marked as insecure in /etc/ttys. In this case, it will be required to boot from a &os; installation disk, choose the Live CD or Shell at the beginning of the install process and issue the commands mentioned above. Mount the specific partition in this case and then chroot to it. For example, replace mount -urw / with mount /dev/ada0p1 /mnt; chroot /mnt for a system on ada0p1. If the root partition cannot be mounted from single-user mode, it is possible that the partitions are encrypted and it is impossible to mount them without the access keys. For more information see the section about encrypted disks in the &os; Handbook. How do I keep ControlAltDelete from rebooting the system? When using &man.vt.4;, the default console driver, this can be done by setting the following &man.sysctl.8;: &prompt.root; sysctl kern.vt.kbd_reboot=0 How do I reformat DOS text files to &unix; ones? Use this &man.perl.1; command: &prompt.user; perl -i.bak -npe 's/\r\n/\n/g' file(s) where file(s) is one or more files to process. The modification is done in-place, with the original file stored with a .bak extension. Alternatively, use &man.tr.1;: &prompt.user; tr -d '\r' < dos-text-file > unix-file dos-text-file is the file containing DOS text while unix-file will contain the converted output. This can be quite a bit faster than using perl. Yet another way to reformat DOS text files is to use the converters/dosunix port from the Ports Collection. Consult its documentation about the details. How do I re-read /etc/rc.conf and re-start /etc/rc without a reboot? Go into single-user mode and then back to multi-user mode: &prompt.root; shutdown now &prompt.root; return &prompt.root; exit I tried to update my system to the latest -STABLE, but got -BETAx, -RC or -PRERELEASE! What is going on? Short answer: it is just a name. RC stands for Release Candidate. It signifies that a release is imminent. In &os;, -PRERELEASE is typically synonymous with the code freeze before a release. (For some releases, the -BETA label was used in the same way as -PRERELEASE.) Long answer: &os; derives its releases from one of two places. Major, dot-zero, releases, such as 9.0-RELEASE are branched from the head of the development stream, commonly referred to as -CURRENT. Minor releases, such as 6.3-RELEASE or 5.2-RELEASE, have been snapshots of the active -STABLE branch. Starting with 4.3-RELEASE, each release also now has its own branch which can be tracked by people requiring an extremely conservative rate of development (typically only security advisories). When a release is about to be made, the branch from which it will be derived from has to undergo a certain process. Part of this process is a code freeze. When a code freeze is initiated, the name of the branch is changed to reflect that it is about to become a release. For example, if the branch used to be called 6.2-STABLE, its name will be changed to 6.3-PRERELEASE to signify the code freeze and signify that extra pre-release testing should be happening. Bug fixes can still be committed to be part of the release. When the source code is in shape for the release the name will be changed to 6.3-RC to signify that a release is about to be made from it. Once in the RC stage, only the most critical bugs found can be fixed. Once the release (6.3-RELEASE in this example) and release branch have been made, the branch will be renamed to 6.3-STABLE. For more information on version numbers and the various Subversion branches, refer to the Release Engineering article. I tried to install a new kernel, and the &man.chflags.1; failed. How do I get around this? Short answer: the security level is greater than 0. Reboot directly to single-user mode to install the kernel. Long answer: &os; disallows changing system flags at security levels greater than 0. To check the current security level: &prompt.root; sysctl kern.securelevel The security level cannot be lowered in multi-user mode, so boot to single-user mode to install the kernel, or change the security level in /etc/rc.conf then reboot. See the &man.init.8; manual page for details on securelevel, and see /etc/defaults/rc.conf and the &man.rc.conf.5; manual page for more information on rc.conf. I cannot change the time on my system by more than one second! How do I get around this? Short answer: the system is at a security level greater than 1. Reboot directly to single-user mode to change the date. Long answer: &os; disallows changing the time by more that one second at security levels greater than 1. To check the security level: &prompt.root; sysctl kern.securelevel The security level cannot be lowered in multi-user mode. Either boot to single-user mode to change the date or change the security level in /etc/rc.conf and reboot. See the &man.init.8; manual page for details on securelevel, and see /etc/defaults/rc.conf and the &man.rc.conf.5; manual page for more information on rc.conf. Why is rpc.statd using 256 MB of memory? No, there is no memory leak, and it is not using 256 MB of memory. For convenience, rpc.statd maps an obscene amount of memory into its address space. There is nothing terribly wrong with this from a technical standpoint; it just throws off things like &man.top.1; and &man.ps.1;. &man.rpc.statd.8; maps its status file (resident on /var) into its address space; to save worrying about remapping the status file later when it needs to grow, it maps the status file with a generous size. This is very evident from the source code, where one can see that the length argument to &man.mmap.2; is 0x10000000, or one sixteenth of the address space on an IA32, or exactly 256 MB. Why can I not unset the schg file flag? The system is running at securelevel greater than 0. Lower the securelevel and try again. For more information, see the FAQ entry on securelevel and the &man.init.8; manual page. What is vnlru? vnlru flushes and frees vnodes when the system hits the kern.maxvnodes limit. This kernel thread sits mostly idle, and only activates when there is a huge amount of RAM and users are accessing tens of thousands of tiny files. What do the various memory states displayed by top mean? Active: pages recently statistically used. Inactive: pages recently statistically unused. Laundry: pages recently statistically unused but known to be dirty, that is, whose contents needs to be paged out before they can be reused. Free: pages without data content, which can be immediately reused. Wired: pages that are fixed into memory, usually for kernel purposes, but also sometimes for special use in processes. Pages are most often written to disk (sort of a VM sync) when they are in the laundry state, but active or inactive pages can also be synced. This depends upon the CPU tracking of the modified bit being available, and in certain situations there can be an advantage for a block of VM pages to be synced, regardless of the queue they belong to. In most common cases, it is best to think of the laundry queue as a queue of relatively unused pages that might or might not be in the process of being written to disk. The inactive queue contains a mix of clean and dirty pages; clean pages near the head of the queue are reclaimed immediately to alleviate a free page shortage, and dirty pages are moved to the laundry queue for deferred processing. There are some other flags (e.g., busy flag or busy count) that might modify some of the described rules. How much free memory is available? There are a couple of kinds of free memory. The most common is the amount of memory immediately available without reclaiming memory already in use. That is the size of the free pages queue plus some other reserved pages. This amount is exported by the vm.stats.vm.v_free_count &man.sysctl.8;, shown, for instance, by &man.top.1;. Another kind of free memory is the total amount of virtual memory available to userland processes, which depends on the sum of swap space and usable memory. Other kinds of free memory descriptions are also possible, but it is relatively useless to define these, but rather it is important to make sure that the paging rate is kept low, and to avoid running out of swap space. What is /var/empty? /var/empty is a directory that the &man.sshd.8; program uses when performing privilege separation. The /var/empty directory is empty, owned by root and has the schg flag set. This directory should not be deleted. I just changed /etc/newsyslog.conf. How can I check if it does what I expect? To see what &man.newsyslog.8; will do, use the following: &prompt.user; newsyslog -nrvv My time is wrong, how can I change the timezone? Use &man.tzsetup.8;. The X Window System and Virtual Consoles What is the X Window System? The X Window System (commonly X11) is the most widely available windowing system capable of running on &unix; or &unix; like systems, including &os;. The X.Org Foundation administers the X protocol standards, with the current reference implementation, version 11 release &xorg.version;, so references are often shortened to X11. Many implementations are available for different architectures and operating systems. An implementation of the server-side code is properly known as an X server. I want to run &xorg;, how do I go about it? To install &xorg; do one of the following: Use the x11/xorg meta-port, which builds and installs every &xorg; component. Use x11/xorg-minimal, which builds and installs only the necessary &xorg; components. Install &xorg; from &os; packages: &prompt.root; pkg install xorg After the installation of &xorg;, follow the instructions from the X11 Configuration section of the &os; Handbook. I tried to run X, but I get a No devices detected. error when I type startx. What do I do now? The system is probably running at a raised securelevel. It is not possible to start X at a raised securelevel because X requires write access to &man.io.4;. For more information, see at the &man.init.8; manual page. There are two solutions to the problem: set the securelevel back down to zero or run &man.xdm.1; (or an alternative display manager) at boot time before the securelevel is raised. See for more information about running &man.xdm.1; at boot time. Why does my mouse not work with X? When using &man.vt.4;, the default console driver, &os; can be configured to support a mouse pointer on each virtual screen. To avoid conflicting with X, &man.vt.4; supports a virtual device called /dev/sysmouse. All mouse events received from the real mouse device are written to the &man.sysmouse.4; device via &man.moused.8;. To use the mouse on one or more virtual consoles, and use X, see and set up &man.moused.8;. Then edit /etc/X11/xorg.conf and make sure the following lines exist: Section "InputDevice" Option "Protocol" "SysMouse" Option "Device" "/dev/sysmouse" ..... Starting with &xorg; version 7.4, the InputDevice sections in xorg.conf are ignored in favor of autodetected devices. To restore the old behavior, add the following line to the ServerLayout or ServerFlags section: Option "AutoAddDevices" "false" Some people prefer to use /dev/mouse under X. To make this work, /dev/mouse should be linked to /dev/sysmouse (see &man.sysmouse.4;) by adding the following line to /etc/devfs.conf (see &man.devfs.conf.5;): link sysmouse mouse This link can be created by restarting &man.devfs.5; with the following command (as root): &prompt.root; service devfs restart My mouse has a fancy wheel. Can I use it in X? Yes, if X is configured for a 5 button mouse. To do this, add the lines Buttons 5 and ZAxisMapping 4 5 to the InputDevice section of /etc/X11/xorg.conf, as seen in this example: Section "InputDevice" Identifier "Mouse1" Driver "mouse" Option "Protocol" "auto" Option "Device" "/dev/sysmouse" Option "Buttons" "5" Option "ZAxisMapping" "4 5" EndSection The mouse can be enabled in Emacs by adding these lines to ~/.emacs: ;; wheel mouse (global-set-key [mouse-4] 'scroll-down) (global-set-key [mouse-5] 'scroll-up) My laptop has a Synaptics touchpad. Can I use it in X? Yes, after configuring a few things to make it work. In order to use the Xorg synaptics driver, first remove moused_enable from rc.conf. To enable synaptics, add the following line to /boot/loader.conf: hw.psm.synaptics_support="1" Add the following to /etc/X11/xorg.conf: Section "InputDevice" Identifier "Touchpad0" Driver "synaptics" Option "Protocol" "psm" Option "Device" "/dev/psm0" EndSection And be sure to add the following into the ServerLayout section: InputDevice "Touchpad0" "SendCoreEvents" How do I use remote X displays? For security reasons, the default setting is to not allow a machine to remotely open a window. To enable this feature, start X with the optional argument: &prompt.user; startx -listen_tcp What is a virtual console and how do I make more? Virtual consoles provide several simultaneous sessions on the same machine without doing anything complicated like setting up a network or running X. When the system starts, it will display a login prompt on the monitor after displaying all the boot messages. Type in your login name and password to start working on the first virtual console. To start another session, perhaps to look at documentation for a program or to read mail while waiting for an FTP transfer to finish, hold down Alt and press F2. This will display the login prompt for the second virtual console. To go back to the original session, press AltF1. The default &os; installation has eight virtual consoles enabled. AltF1, AltF2, AltF3, and so on will switch between these virtual consoles. To enable more of virtual consoles, edit /etc/ttys (see &man.ttys.5;) and add entries for ttyv8 to ttyvc, after the comment on Virtual terminals: # Edit the existing entry for ttyv8 in /etc/ttys and change # "off" to "on". ttyv8 "/usr/libexec/getty Pc" xterm on secure ttyv9 "/usr/libexec/getty Pc" xterm on secure ttyva "/usr/libexec/getty Pc" xterm on secure ttyvb "/usr/libexec/getty Pc" xterm on secure The more virtual terminals, the more resources that are used. This can be problematic on systems with 8 MB RAM or less. Consider changing secure to insecure. In order to run an X server, at least one virtual terminal must be left to off for it to use. This means that only eleven of the Alt-function keys can be used as virtual consoles so that one is left for the X server. For example, to run X and eleven virtual consoles, the setting for virtual terminal 12 should be: ttyvb "/usr/libexec/getty Pc" xterm off secure The easiest way to activate the virtual consoles is to reboot. How do I access the virtual consoles from X? Use CtrlAltFn to switch back to a virtual console. Press CtrlAltF1 to return to the first virtual console. Once at a text console, use AltFn to move between them. To return to the X session, switch to the virtual console running X. If X was started from the command line using startx, the X session will attach to the next unused virtual console, not the text console from which it was invoked. For eight active virtual terminals, X will run on the ninth, so use AltF9. How do I start XDM on boot? There are two schools of thought on how to start &man.xdm.1;. One school starts xdm from /etc/ttys (see &man.ttys.5;) using the supplied example, while the other runs xdm from rc.local (see &man.rc.8;) or from an X script in /usr/local/etc/rc.d. Both are equally valid, and one may work in situations where the other does not. In both cases the result is the same: X will pop up a graphical login prompt. The &man.ttys.5; method has the advantage of documenting which vty X will start on and passing the responsibility of restarting the X server on logout to &man.init.8;. The &man.rc.8; method makes it easy to kill xdm if there is a problem starting the X server. If loaded from &man.rc.8;, xdm should be started without any arguments. xdm must start after &man.getty.8; runs, or else getty and xdm will conflict, locking out the console. The best way around this is to have the script sleep 10 seconds or so then launch xdm. When starting xdm from /etc/ttys, there still is a chance of conflict between xdm and &man.getty.8;. One way to avoid this is to add the vt number in /usr/local/lib/X11/xdm/Xservers: :0 local /usr/local/bin/X vt4 The above example will direct the X server to run in /dev/ttyv3. Note the number is offset by one. The X server counts the vty from one, whereas the &os; kernel numbers the vty from zero. Why do I get Couldn't open console when I run xconsole? When X is started with startx, the permissions on /dev/console will not get changed, resulting in things like xterm -C and xconsole not working. This is because of the way console permissions are set by default. On a multi-user system, one does not necessarily want just any user to be able to write on the system console. For users who are logging directly onto a machine with a VTY, the &man.fbtab.5; file exists to solve such problems. In a nutshell, make sure an uncommented line of the form is in /etc/fbtab (see &man.fbtab.5;): /dev/ttyv0 0600 /dev/console It will ensure that whomever logs in on /dev/ttyv0 will own the console. Why does my PS/2 mouse misbehave under X? The mouse and the mouse driver may have become out of synchronization. In rare cases, the driver may also erroneously report synchronization errors: psmintr: out of sync (xxxx != yyyy) If this happens, disable the synchronization check code by setting the driver flags for the PS/2 mouse driver to 0x100. This can be easiest achieved by adding hint.psm.0.flags="0x100" to /boot/loader.conf and rebooting. How do I reverse the mouse buttons? Type xmodmap -e "pointer = 3 2 1". Add this command to ~/.xinitrc or ~/.xsession to make it happen automatically. How do I install a splash screen and where do I find them? The detailed answer for this question can be found in the Boot Time Splash Screens section of the &os; Handbook. Can I use the Windows keys on my keyboard in X? Yes. Use &man.xmodmap.1; to define which functions the keys should perform. Assuming all Windows keyboards are standard, the keycodes for these three keys are the following: 115Windows key, between the left-hand Ctrl and Alt keys 116Windows key, to the right of AltGr 117Menu, to the left of the right-hand Ctrl To have the left Windows key print a comma, try this. &prompt.root; xmodmap -e "keycode 115 = comma" To have the Windows key-mappings enabled automatically every time X is started, either put the xmodmap commands in ~/.xinitrc or, preferably, create a ~/.xmodmaprc and include the xmodmap options, one per line, then add the following line to ~/.xinitrc: xmodmap $HOME/.xmodmaprc For example, to map the 3 keys to be F13, F14, and F15, respectively. This would make it easy to map them to useful functions within applications or the window manager. To do this, put the following in ~/.xmodmaprc. keycode 115 = F13 keycode 116 = F14 keycode 117 = F15 For the x11-wm/fvwm2 desktop manager, one could map the keys so that F13 iconifies or de-iconifies the window the cursor is in, F14 brings the window the cursor is in to the front or, if it is already at the front, pushes it to the back, and F15 pops up the main Workplace menu even if the cursor is not on the desktop, which is useful when no part of the desktop is visible. The following entries in ~/.fvwmrc implement the aforementioned setup: Key F13 FTIWS A Iconify Key F14 FTIWS A RaiseLower Key F15 A A Menu Workplace Nop How can I get 3D hardware acceleration for &opengl;? The availability of 3D acceleration depends on the version of &xorg; and the type of video chip. For an nVidia chip, use the binary drivers provided for &os; by installing one of the following ports: The latest versions of nVidia cards are supported by the x11/nvidia-driver port. Older drivers are available as x11/nvidia-driver-### nVidia provides detailed information on which card is supported by which driver on their web site: http://www.nvidia.com/object/IO_32667.html. For Matrox G200/G400, check the x11-drivers/xf86-video-mga port. For ATI Rage 128 and Radeon see &man.ati.4x;, &man.r128.4x; and &man.radeon.4x;. Networking Where can I get information on diskless booting? Diskless booting means that the &os; box is booted over a network, and reads the necessary files from a server instead of its hard disk. For full details, see the Handbook entry on diskless booting. Can a &os; box be used as a dedicated network router? Yes. Refer to the Handbook entry on advanced networking, specifically the section on routing and gateways. Does &os; support NAT or Masquerading? Yes. For instructions on how to use NAT over a PPP connection, see the Handbook entry on PPP. To use NAT over some other sort of network connection, look at the natd section of the Handbook. How can I set up Ethernet aliases? If the alias is on the same subnet as an address already configured on the interface, add netmask 0xffffffff to this command: &prompt.root; ifconfig ed0 alias 192.0.2.2 netmask 0xffffffff Otherwise, specify the network address and netmask as usual: &prompt.root; ifconfig ed0 alias 172.16.141.5 netmask 0xffffff00 More information can be found in the &os; Handbook. Why can I not NFS-mount from a &linux; box? Some versions of the &linux; NFS code only accept mount requests from a privileged port; try to issue the following command: &prompt.root; mount -o -P linuxbox:/blah /mnt Why does mountd keep telling me it can't change attributes and that I have a bad exports list on my &os; NFS server? The most frequent problem is not understanding the correct format of /etc/exports. Review &man.exports.5; and the NFS entry in the Handbook, especially the section on configuring NFS. How do I enable IP multicast support? Install the net/mrouted package or port and add mrouted_enable="YES" to /etc/rc.conf start this service at boot time. Why do I have to use the FQDN for hosts on my site? See the answer in the &os; Handbook. Why do I get an error, Permission denied, for all networking operations? If the kernel is compiled with the IPFIREWALL option, be aware that the default policy is to deny all packets that are not explicitly allowed. If the firewall is unintentionally misconfigured, restore network operability by typing the following as root: &prompt.root; ipfw add 65534 allow all from any to any Consider setting firewall_type="open" in /etc/rc.conf. For further information on configuring this firewall, see the Handbook chapter. Why is my ipfw fwd rule to redirect a service to another machine not working? Possibly because network address translation (NAT) is needed instead of just forwarding packets. A fwd rule only forwards packets, it does not actually change the data inside the packet. Consider this rule: 01000 fwd 10.0.0.1 from any to foo 21 When a packet with a destination address of foo arrives at the machine with this rule, the packet is forwarded to 10.0.0.1, but it still has the destination address of foo. The destination address of the packet is not changed to 10.0.0.1. Most machines would probably drop a packet that they receive with a destination address that is not their own. Therefore, using a fwd rule does not often work the way the user expects. This behavior is a feature and not a bug. See the FAQ about redirecting services, the &man.natd.8; manual, or one of the several port redirecting utilities in the Ports Collection for a correct way to do this. How can I redirect service requests from one machine to another? FTP and other service requests can be redirected with the sysutils/socket package or port. Replace the entry for the service in /etc/inetd.conf to call socket, as seen in this example for ftpd: ftp stream tcp nowait nobody /usr/local/bin/socket socket ftp.example.com ftp where ftp.example.com and ftp are the host and port to redirect to, respectively. Where can I get a bandwidth management tool? There are three bandwidth management tools available for &os;. &man.dummynet.4; is integrated into &os; as part of &man.ipfw.4;. ALTQ has been integrated into &os; as part of &man.pf.4;. Bandwidth Manager from Emerging Technologies is a commercial product. Why do I get /dev/bpf0: device not configured? The running application requires the Berkeley Packet Filter (&man.bpf.4;), but it was removed from a custom kernel. Add this to the kernel config file and build a new kernel: device bpf # Berkeley Packet Filter How do I mount a disk from a &windows; machine that is on my network, like smbmount in &linux;? Use the SMBFS toolset. It includes a set of kernel modifications and a set of userland programs. The programs and information are available as &man.mount.smbfs.8; in the base system. What are these messages about: Limiting icmp/open port/closed port response in my log files? This kernel message indicates that some activity is provoking it to send a large amount of ICMP or TCP reset (RST) responses. ICMP responses are often generated as a result of attempted connections to unused UDP ports. TCP resets are generated as a result of attempted connections to unopened TCP ports. Among others, these are the kinds of activities which may cause these messages: Brute-force denial of service (DoS) attacks (as opposed to single-packet attacks which exploit a specific vulnerability). Port scans which attempt to connect to a large number of ports (as opposed to only trying a few well-known ports). The first number in the message indicates how many packets the kernel would have sent if the limit was not in place, and the second indicates the limit. This limit is controlled using net.inet.icmp.icmplim. This example sets the limit to 300 packets per second: &prompt.root; sysctl net.inet.icmp.icmplim=300 To disable these messages without disabling response limiting, use net.inet.icmp.icmplim_output to disable the output: &prompt.root; sysctl net.inet.icmp.icmplim_output=0 Finally, to disable response limiting completely, set net.inet.icmp.icmplim to 0. Disabling response limiting is discouraged for the reasons listed above. What are these arp: unknown hardware address format error messages? This means that some device on the local Ethernet is using a MAC address in a format that &os; does not recognize. This is probably caused by someone experimenting with an Ethernet card somewhere else on the network. This is most commonly seen on cable modem networks. It is harmless, and should not affect the performance of the &os; system. Why do I keep seeing messages like: 192.168.0.10 is on fxp1 but got reply from 00:15:17:67:cf:82 on rl0, and how do I disable it? - Because a packet is coming from outside the network + A packet is coming from outside the network unexpectedly. To disable them, set net.link.ether.inet.log_arp_wrong_iface to 0. How do I compile an IPv6 only kernel? Configure your kernel with these settings: include GENERIC ident GENERIC-IPV6ONLY makeoptions MKMODULESENV+="WITHOUT_INET_SUPPORT=" nooptions INET nodevice gre Security What is a sandbox? Sandbox is a security term. It can mean two things: A process which is placed inside a set of virtual walls that are designed to prevent someone who breaks into the process from being able to break into the wider system. The process is only able to run inside the walls. Since nothing the process does in regards to executing code is supposed to be able to breach the walls, a detailed audit of its code is not needed in order to be able to say certain things about its security. The walls might be a user ID, for example. This is the definition used in the &man.security.7; and &man.named.8; man pages. Take the ntalk service, for example (see &man.inetd.8;). This service used to run as user ID root. Now it runs as user ID tty. The tty user is a sandbox designed to make it more difficult for someone who has successfully hacked into the system via ntalk from being able to hack beyond that user ID. A process which is placed inside a simulation of the machine. It means that someone who is able to break into the process may believe that he can break into the wider machine but is, in fact, only breaking into a simulation of that machine and not modifying any real data. The most common way to accomplish this is to build a simulated environment in a subdirectory and then run the processes in that directory chrooted so that / for that process is this directory, not the real / of the system). Another common use is to mount an underlying file system read-only and then create a file system layer on top of it that gives a process a seemingly writeable view into that file system. The process may believe it is able to write to those files, but only the process sees the effects — other processes in the system do not, necessarily. An attempt is made to make this sort of sandbox so transparent that the user (or hacker) does not realize that he is sitting in it. &unix; implements two core sandboxes. One is at the process level, and one is at the userid level. Every &unix; process is completely firewalled off from every other &unix; process. One process cannot modify the address space of another. A &unix; process is owned by a particular userid. If the user ID is not the root user, it serves to firewall the process off from processes owned by other users. The user ID is also used to firewall off on-disk data. What is securelevel? securelevel is a security mechanism implemented in the kernel. When the securelevel is positive, the kernel restricts certain tasks; not even the superuser (root) is allowed to do them. The securelevel mechanism limits the ability to: Unset certain file flags, such as schg (the system immutable flag). Write to kernel memory via /dev/mem and /dev/kmem. Load kernel modules. Alter firewall rules. To check the status of the securelevel on a running system: &prompt.root; sysctl -n kern.securelevel The output contains the current value of the securelevel. If it is greater than 0, at least some of the securelevel's protections are enabled. The securelevel of a running system cannot be lowered as this would defeat its purpose. If a task requires that the securelevel be non-positive, change the kern_securelevel and kern_securelevel_enable variables in /etc/rc.conf and reboot. For more information on securelevel and the specific things all the levels do, consult &man.init.8;. Securelevel is not a silver bullet; it has many known deficiencies. More often than not, it provides a false sense of security. One of its biggest problems is that in order for it to be at all effective, all files used in the boot process up until the securelevel is set must be protected. If an attacker can get the system to execute their code prior to the securelevel being set (which happens quite late in the boot process since some things the system must do at start-up cannot be done at an elevated securelevel), its protections are invalidated. While this task of protecting all files used in the boot process is not technically impossible, if it is achieved, system maintenance will become a nightmare since one would have to take the system down, at least to single-user mode, to modify a configuration file. This point and others are often discussed on the mailing lists, particularly the &a.security;. Search the archives here for an extensive discussion. A more fine-grained mechanism is preferred. What is this UID 0 toor account? Have I been compromised? Do not worry. toor is an alternative superuser account, where toor is root spelled backwards. It is intended to be used with a non-standard shell so the default shell for root does not need to change. This is important as shells which are not part of the base distribution, but are instead installed from ports or packages, are installed in /usr/local/bin which, by default, resides on a different file system. If root's shell is located in /usr/local/bin and the file system containing /usr/local/bin) is not mounted, root will not be able to log in to fix a problem and will have to reboot into single-user mode in order to enter the path to a shell. Some people use toor for day-to-day root tasks with a non-standard shell, leaving root, with a standard shell, for single-user mode or emergencies. By default, a user cannot log in using toor as it does not have a password, so log in as root and set a password for toor before using it to login. Serial Communications This section answers common questions about serial communications with &os;. How do I get the boot: prompt to show on the serial console? See this section of the Handbook. How do I tell if &os; found my serial ports or modem cards? As the &os; kernel boots, it will probe for the serial ports for which the kernel is configured. Either watch the boot messages closely or run this command after the system is up and running: &prompt.user; grep -E '^(sio|uart)[0-9]' < /var/run/dmesg.boot sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A This example shows two serial ports. The first is on IRQ4, port address 0x3f8, and has a 16550A-type UART chip. The second uses the same kind of chip but is on IRQ3 and is at port address 0x2f8. Internal modem cards are treated just like serial ports, except that they always have a modem attached to the port. The GENERIC kernel includes support for two serial ports using the same IRQ and port address settings in the above example. If these settings are not right for the system, or if there are more modem cards or serial ports than the kernel is configured for, reconfigure using the instructions in building a kernel for more details. How do I access the serial ports on &os;? (x86-specific) The third serial port, sio2, or COM3, is on /dev/cuad2 for dial-out devices, and on /dev/ttyd2 for dial-in devices. What is the difference between these two classes of devices? When opening /dev/ttydX in blocking mode, a process will wait for the corresponding cuadX device to become inactive, and then wait for the carrier detect line to go active. When the cuadX device is opened, it makes sure the serial port is not already in use by the ttydX device. If the port is available, it steals it from the ttydX device. Also, the cuadX device does not care about carrier detect. With this scheme and an auto-answer modem, remote users can log in and local users can still dial out with the same modem and the system will take care of all the conflicts. How do I enable support for a multi-port serial card? The section on kernel configuration provides information about configuring the kernel. For a multi-port serial card, place an &man.sio.4; line for each serial port on the card in the &man.device.hints.5; file. But place the IRQ specifiers on only one of the entries. All of the ports on the card should share one IRQ. For consistency, use the last serial port to specify the IRQ. Also, specify the following option in the kernel configuration file: options COM_MULTIPORT The following /boot/device.hints example is for an AST 4-port serial card on IRQ 12: hint.sio.4.at="isa" hint.sio.4.port="0x2a0" hint.sio.4.flags="0x701" hint.sio.5.at="isa" hint.sio.5.port="0x2a8" hint.sio.5.flags="0x701" hint.sio.6.at="isa" hint.sio.6.port="0x2b0" hint.sio.6.flags="0x701" hint.sio.7.at="isa" hint.sio.7.port="0x2b8" hint.sio.7.flags="0x701" hint.sio.7.irq="12" The flags indicate that the master port has minor number 7 (0x700), and all the ports share an IRQ (0x001). Can I set the default serial parameters for a port? See the Serial Communications section in the &os; Handbook. Why can I not run tip or cu? The built-in &man.tip.1; and &man.cu.1; utilities can only access the /var/spool/lock directory via user uucp and group dialer. Use the dialer group to control who has access to the modem or remote systems by adding user accounts to dialer. Alternatively, everyone can be configured to run &man.tip.1; and &man.cu.1; by typing: &prompt.root; chmod 4511 /usr/bin/cu &prompt.root; chmod 4511 /usr/bin/tip Miscellaneous Questions &os; uses a lot of swap space even when the computer has free memory left. Why? &os; will proactively move entirely idle, unused pages of main memory into swap in order to make more main memory available for active use. This heavy use of swap is balanced by using the extra free memory for caching. Note that while &os; is proactive in this regard, it does not arbitrarily decide to swap pages when the system is truly idle. Thus, the system will not be all paged out after leaving it idle overnight. Why does top show very little free memory even when I have very few programs running? The simple answer is that free memory is wasted memory. Any memory that programs do not actively allocate is used within the &os; kernel as disk cache. The values shown by &man.top.1; labeled as Inact and Laundry are cached data at different aging levels. This cached data means the system does not have to access a slow disk again for data it has accessed recently, thus increasing overall performance. In general, a low value shown for Free memory in &man.top.1; is good, provided it is not very low. Why will chmod not change the permissions on symlinks? Symlinks do not have permissions, and by default, &man.chmod.1; will follow symlinks to change the permissions on the source file, if possible. For the file, foo with a symlink named bar, this command will always succeed. &prompt.user; chmod g-w bar However, the permissions on bar will not have changed. When changing modes of the file hierarchies rooted in the files instead of the files themselves, use either or together with to make this work. See &man.chmod.1; and &man.symlink.7; for more information. does a recursive &man.chmod.1;. Be careful about specifying directories or symlinks to directories to &man.chmod.1;. To change the permissions of a directory referenced by a symlink, use &man.chmod.1; without any options and follow the symlink with a trailing slash (/). For example, if foo is a symlink to directory bar, to change the permissions of foo (actually bar), do something like: &prompt.user; chmod 555 foo/ With the trailing slash, &man.chmod.1; will follow the symlink, foo, to change the permissions of the directory, bar. Can I run DOS binaries under &os;? Yes. A DOS emulation program, emulators/doscmd, is available in the &os; Ports Collection. If doscmd will not suffice, emulators/pcemu emulates an 8088 and enough BIOS services to run many DOS text-mode applications. It requires the X Window System. The Ports Collection also has emulators/dosbox. The main focus of this application is emulating old DOS games using the local file system for files. What do I need to do to translate a &os; document into my native language? See the Translation FAQ in the &os; Documentation Project Primer. Why does my email to any address at FreeBSD.org bounce? The FreeBSD.org mail system implements some Postfix checks on incoming mail and rejects mail that is either from misconfigured relays or otherwise appears likely to be spam. Some of the specific requirements are: The IP address of the SMTP client must "reverse-resolve" to a forward confirmed hostname. The fully-qualified hostname given in the SMTP conversation (either HELO or EHLO) must resolve to the IP address of the client. Other advice to help mail reach its destination include: Mail should be sent in plain text, and messages sent to mailing lists should generally be no more than 200KB in length. Avoid excessive cross posting. Choose one mailing list which seems most relevant and send it there. If you still have trouble with email infrastructure at FreeBSD.org, send a note with the details to postmaster@freebsd.org; Include a date/time interval so that logs may be reviewed — and note that we only keep one week's worth of mail logs. (Be sure to specify the time zone or offset from UTC.) Where can I find a free &os; account? While &os; does not provide open access to any of their servers, others do provide open access &unix; systems. The charge varies and limited services may be available. Arbornet, Inc, also known as M-Net, has been providing open access to &unix; systems since 1983. Starting on an Altos running System III, the site switched to BSD/OS in 1991. In June of 2000, the site switched again to &os;. M-Net can be accessed via telnet and SSH and provides basic access to the entire &os; software suite. However, network access is limited to members and patrons who donate to the system, which is run as a non-profit organization. M-Net also provides an bulletin board system and interactive chat. What is the cute little red guy's name? He does not have one, and is just called the BSD daemon. If you insist upon using a name, call him beastie. Note that beastie is pronounced BSD. More about the BSD daemon is available on his home page. Can I use the BSD daemon image? Perhaps. The BSD daemon is copyrighted by Marshall Kirk McKusick. Check his Statement on the Use of the BSD Daemon Figure for detailed usage terms. In summary, the image can be used in a tasteful manner, for personal use, so long as appropriate credit is given. Before using the logo commercially, contact &a.mckusick.email; for permission. More details are available on the BSD Daemon's home page. Do you have any BSD daemon images I could use? Xfig and eps drawings are available under /usr/share/examples/BSD_daemon/. I have seen an acronym or other term on the mailing lists and I do not understand what it means. Where should I look? Refer to the &os; Glossary. Why should I care what color the bikeshed is? The really, really short answer is that you should not. The somewhat longer answer is that just because you are capable of building a bikeshed does not mean you should stop others from building one just because you do not like the color they plan to paint it. This is a metaphor indicating that you need not argue about every little feature just because you know enough to do so. Some people have commented that the amount of noise generated by a change is inversely proportional to the complexity of the change. The longer and more complete answer is that after a very long argument about whether &man.sleep.1; should take fractional second arguments, &a.phk.email; posted a long message entitled A bike shed (any color will do) on greener grass.... The appropriate portions of that message are quoted below.
&a.phk.email; on &a.hackers.name;, October 2, 1999 What is it about this bike shed? Some of you have asked me. It is a long story, or rather it is an old story, but it is quite short actually. C. Northcote Parkinson wrote a book in the early 1960s, called Parkinson's Law, which contains a lot of insight into the dynamics of management. [snip a bit of commentary on the book] In the specific example involving the bike shed, the other vital component is an atomic power-plant, I guess that illustrates the age of the book. Parkinson shows how you can go into the board of directors and get approval for building a multi-million or even billion dollar atomic power plant, but if you want to build a bike shed you will be tangled up in endless discussions. Parkinson explains that this is because an atomic plant is so vast, so expensive and so complicated that people cannot grasp it, and rather than try, they fall back on the assumption that somebody else checked all the details before it got this far. Richard P. Feynmann gives a couple of interesting, and very much to the point, examples relating to Los Alamos in his books. A bike shed on the other hand. Anyone can build one of those over a weekend, and still have time to watch the game on TV. So no matter how well prepared, no matter how reasonable you are with your proposal, somebody will seize the chance to show that he is doing his job, that he is paying attention, that he is here. In Denmark we call it setting your fingerprint. It is about personal pride and prestige, it is about being able to point somewhere and say There! I did that. It is a strong trait in politicians, but present in most people given the chance. Just think about footsteps in wet cement.
The &os; Funnies How cool is &os;? Q. Has anyone done any temperature testing while running &os;? I know &linux; runs cooler than DOS, but have never seen a mention of &os;. It seems to run really hot. A. No, but we have done numerous taste tests on blindfolded volunteers who have also had 250 micrograms of LSD-25 administered beforehand. 35% of the volunteers said that &os; tasted sort of orange, whereas &linux; tasted like purple haze. Neither group mentioned any significant variances in temperature. We eventually had to throw the results of this survey out entirely anyway when we found that too many volunteers were wandering out of the room during the tests, thus skewing the results. We think most of the volunteers are at Apple now, working on their new scratch and sniff GUI. It is a funny old business we are in! Seriously, &os; uses the HLT (halt) instruction when the system is idle thus lowering its energy consumption and therefore the heat it generates. Also if you have ACPI (Advanced Configuration and Power Interface) configured, then &os; can also put the CPU into a low power mode. Who is scratching in my memory banks?? Q. Is there anything odd that &os; does when compiling the kernel which would cause the memory to make a scratchy sound? When compiling (and for a brief moment after recognizing the floppy drive upon startup, as well), a strange scratchy sound emanates from what appears to be the memory banks. A. Yes! You will see frequent references to daemons in the BSD documentation, and what most people do not know is that this refers to genuine, non-corporeal entities that now possess your computer. The scratchy sound coming from your memory is actually high-pitched whispering exchanged among the daemons as they best decide how to deal with various system administration tasks. If the noise gets to you, a good fdisk /mbr from DOS will get rid of them, but do not be surprised if they react adversely and try to stop you. In fact, if at any point during the exercise you hear the satanic voice of Bill Gates coming from the built-in speaker, take off running and do not ever look back! Freed from the counterbalancing influence of the BSD daemons, the twin demons of DOS and &windows; are often able to re-assert total control over your machine to the eternal damnation of your soul. Now that you know, given a choice you would probably prefer to get used to the scratchy noises, no? How many &os; hackers does it take to change a lightbulb? One thousand, one hundred and sixty-nine: Twenty-three to complain to -CURRENT about the lights being out; Four to claim that it is a configuration problem, and that such matters really belong on -questions; Three to submit PRs about it, one of which is misfiled under doc and consists only of it's dark; One to commit an untested lightbulb which breaks buildworld, then back it out five minutes later; Eight to flame the PR originators for not including patches in their PRs; Five to complain about buildworld being broken; Thirty-one to answer that it works for them, and they must have updated at a bad time; One to post a patch for a new lightbulb to -hackers; One to complain that he had patches for this three years ago, but when he sent them to -CURRENT they were just ignored, and he has had bad experiences with the PR system; besides, the proposed new lightbulb is non-reflexive; Thirty-seven to scream that lightbulbs do not belong in the base system, that committers have no right to do things like this without consulting the Community, and WHAT IS -CORE DOING ABOUT IT!? Two hundred to complain about the color of the bicycle shed; Three to point out that the patch breaks &man.style.9;; Seventeen to complain that the proposed new lightbulb is under GPL; Five hundred and eighty-six to engage in a flame war about the comparative advantages of the GPL, the BSD license, the MIT license, the NPL, and the personal hygiene of unnamed FSF founders; Seven to move various portions of the thread to -chat and -advocacy; One to commit the suggested lightbulb, even though it shines dimmer than the old one; Two to back it out with a furious flame of a commit message, arguing that &os; is better off in the dark than with a dim lightbulb; Forty-six to argue vociferously about the backing out of the dim lightbulb and demanding a statement from -core; Eleven to request a smaller lightbulb so it will fit their Tamagotchi if we ever decide to port &os; to that platform; Seventy-three to complain about the SNR on -hackers and -chat and unsubscribe in protest; Thirteen to post unsubscribe, How do I unsubscribe?, or Please remove me from the list, followed by the usual footer; One to commit a working lightbulb while everybody is too busy flaming everybody else to notice; Thirty-one to point out that the new lightbulb would shine 0.364% brighter if compiled with TenDRA (although it will have to be reshaped into a cube), and that &os; should therefore switch to TenDRA instead of GCC; One to complain that the new lightbulb lacks fairings; Nine (including the PR originators) to ask what is MFC?; Fifty-seven to complain about the lights being out two weeks after the bulb has been changed. &a.nik.email; adds: I was laughing quite hard at this. And then I thought, Hang on, shouldn't there be '1 to document it.' in that list somewhere? And then I was enlightened :-) &a.tabthorpe.email; says: None, real &os; hackers are not afraid of the dark! Where does data written to /dev/null go? It goes into a special data sink in the CPU where it is converted to heat which is vented through the heatsink / fan assembly. This is why CPU cooling is increasingly important; as people get used to faster processors, they become careless with their data and more and more of it ends up in /dev/null, overheating their CPUs. If you delete /dev/null (which effectively disables the CPU data sink) your CPU may run cooler but your system will quickly become constipated with all that excess data and start to behave erratically. If you have a fast network connection you can cool down your CPU by reading data out of /dev/random and sending it off somewhere; however you run the risk of overheating your network connection and / or angering your ISP, as most of the data will end up getting converted to heat by their equipment, but they generally have good cooling, so if you do not overdo it you should be OK. Paul Robinson adds: There are other methods. As every good sysadmin knows, it is part of standard practice to send data to the screen of interesting variety to keep all the pixies that make up your picture happy. Screen pixies (commonly mis-typed or re-named as pixels) are categorized by the type of hat they wear (red, green or blue) and will hide or appear (thereby showing the color of their hat) whenever they receive a little piece of food. Video cards turn data into pixie-food, and then send them to the pixies — the more expensive the card, the better the food, so the better behaved the pixies are. They also need constant stimulation — this is why screen savers exist. To take your suggestions further, you could just throw the random data to console, thereby letting the pixies consume it. This causes no heat to be produced at all, keeps the pixies happy and gets rid of your data quite quickly, even if it does make things look a bit messy on your screen. Incidentally, as an ex-admin of a large ISP who experienced many problems attempting to maintain a stable temperature in a server room, I would strongly discourage people sending the data they do not want out to the network. The fairies who do the packet switching and routing get annoyed by it as well. My colleague sits at the computer too much, how can I prank her? Install games/sl and wait for her to mistype sl for ls. Advanced Topics How can I learn more about &os;'s internals? See the &os; Architecture Handbook. Additionally, much general &unix; knowledge is directly applicable to &os;. How can I contribute to &os;? What can I do to help? We accept all types of contributions: documentation, code, and even art. See the article on Contributing to &os; for specific advice on how to do this. And thanks for the thought! What are snapshots and releases? There are currently &rel.numbranch; active/semi-active branches in the &os; Subversion Repository. (Earlier branches are only changed very rarely, which is why there are only &rel.numbranch; active branches of development): &rel2.releng; AKA &rel2.stable; &rel.releng; AKA &rel.stable; &rel.head.releng; AKA -CURRENT AKA &rel.head; HEAD is not an actual branch tag. It is a symbolic constant for the current, non-branched development stream known as -CURRENT. Right now, -CURRENT is the &rel.head.relx; development stream; the &rel.stable; branch, &rel.releng;, forked off from -CURRENT in &rel.relengdate; and the &rel2.stable; branch, &rel2.releng;, forked off from -CURRENT in &rel2.relengdate;. How can I make the most of the data I see when my kernel panics? Here is typical kernel panic: Fatal trap 12: page fault while in kernel mode fault virtual address = 0x40 fault code = supervisor read, page not present instruction pointer = 0x8:0xf014a7e5 stack pointer = 0x10:0xf4ed6f24 frame pointer = 0x10:0xf4ed6f28 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 80 (mount) interrupt mask = trap number = 12 panic: page fault This message is not enough. While the instruction pointer value is important, it is also configuration dependent as it varies depending on the kernel image. If it is a GENERIC kernel image from one of the snapshots, it is possible for somebody else to track down the offending function, but for a custom kernel, only you can tell us where the fault occurred. To proceed: Write down the instruction pointer value. Note that the 0x8: part at the beginning is not significant in this case: it is the 0xf0xxxxxx part that we want. When the system reboots, do the following: &prompt.user; nm -n kernel.that.caused.the.panic | grep f0xxxxxx where f0xxxxxx is the instruction pointer value. The odds are you will not get an exact match since the symbols in the kernel symbol table are for the entry points of functions and the instruction pointer address will be somewhere inside a function, not at the start. If you do not get an exact match, omit the last digit from the instruction pointer value and try again: &prompt.user; nm -n kernel.that.caused.the.panic | grep f0xxxxx If that does not yield any results, chop off another digit. Repeat until there is some sort of output. The result will be a possible list of functions which caused the panic. This is a less than exact mechanism for tracking down the point of failure, but it is better than nothing. However, the best way to track down the cause of a panic is by capturing a crash dump, then using &man.kgdb.1; to generate a stack trace on the crash dump. In any case, the method is this: Make sure that the following line is included in the kernel configuration file: makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols Change to the /usr/src directory: &prompt.root; cd /usr/src Compile the kernel: &prompt.root; make buildkernel KERNCONF=MYKERNEL Wait for &man.make.1; to finish compiling. &prompt.root; make installkernel KERNCONF=MYKERNEL Reboot. If KERNCONF is not included, the GENERIC kernel will instead be built and installed. The &man.make.1; process will have built two kernels. /usr/obj/usr/src/sys/MYKERNEL/kernel and /usr/obj/usr/src/sys/MYKERNEL/kernel.debug. kernel was installed as /boot/kernel/kernel, while kernel.debug can be used as the source of debugging symbols for &man.kgdb.1;. To capture a crash dump, edit /etc/rc.conf and set dumpdev to point to either the swap partition or AUTO. This will cause the &man.rc.8; scripts to use the &man.dumpon.8; command to enable crash dumps. This command can also be run manually. After a panic, the crash dump can be recovered using &man.savecore.8;; if dumpdev is set in /etc/rc.conf, the &man.rc.8; scripts will run &man.savecore.8; automatically and put the crash dump in /var/crash. &os; crash dumps are usually the same size as physical RAM. Therefore, make sure there is enough space in /var/crash to hold the dump. Alternatively, run &man.savecore.8; manually and have it recover the crash dump to another directory with more room. It is possible to limit the size of the crash dump by using options MAXMEM=N where N is the size of kernel's memory usage in KBs. For example, for 1 GB of RAM, limit the kernel's memory usage to 128 MB, so that the crash dump size will be 128 MB instead of 1 GB. Once the crash dump has been recovered , get a stack trace as follows: &prompt.user; kgdb /usr/obj/usr/src/sys/MYKERNEL/kernel.debug /var/crash/vmcore.0 (kgdb) backtrace Note that there may be several screens worth of information. Ideally, use &man.script.1; to capture all of them. Using the unstripped kernel image with all the debug symbols should show the exact line of kernel source code where the panic occurred. The stack trace is usually read from the bottom up to trace the exact sequence of events that lead to the crash. &man.kgdb.1; can also be used to print out the contents of various variables or structures to examine the system state at the time of the crash. If a second computer is available, &man.kgdb.1; can be configured to do remote debugging, including setting breakpoints and single-stepping through the kernel code. If DDB is enabled and the kernel drops into the debugger, a panic and a crash dump can be forced by typing panic at the ddb prompt. It may stop in the debugger again during the panic phase. If it does, type continue and it will finish the crash dump. Why has dlsym() stopped working for ELF executables? The ELF toolchain does not, by default, make the symbols defined in an executable visible to the dynamic linker. Consequently dlsym() searches on handles obtained from calls to dlopen(NULL, flags) will fail to find such symbols. To search, using dlsym(), for symbols present in the main executable of a process, link the executable using the option to the ELF linker (&man.ld.1;). How can I increase or reduce the kernel address space on i386? By default, the kernel address space is 1 GB (2 GB for PAE) for i386. When running a network-intensive server or using ZFS, this will probably not be enough. Add the following line to the kernel configuration file to increase available space and rebuild the kernel: options KVA_PAGES=N To find the correct value of N, divide the desired address space size (in megabytes) by four. (For example, it is 512 for 2 GB.) Acknowledgments This innocent little Frequently Asked Questions document has been written, rewritten, edited, folded, spindled, mutilated, eviscerated, contemplated, discombobulated, cogitated, regurgitated, rebuilt, castigated, and reinvigorated over the last decade, by a cast of hundreds if not thousands. Repeatedly. We wish to thank every one of the people responsible, and we encourage you to join them in making this FAQ even better.
diff --git a/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml b/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml index b2a78f8dc1..c951cbd85e 100644 --- a/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml +++ b/en_US.ISO8859-1/books/fdp-primer/manpages/chapter.xml @@ -1,746 +1,746 @@ Manual Pages Introduction Manual pages, commonly shortened to man pages, were conceived as readily-available reminders for command syntax, device driver details, or configuration file formats. They have become an extremely valuable quick-reference from the command line for users, system administrators, and programmers. Although intended as reference material rather than tutorials, the EXAMPLES sections of manual pages often provide detailed use case. Manual pages are generally shown interactively by the &man.man.1; command. When the user types man ls, a search is performed for a manual page matching ls. The first matching result is displayed. Sections Manual pages are grouped into sections. Each section contains manual pages for a specific category of documentation: Section Number Category 1 General Commands 2 System Calls 3 Library Functions 4 Kernel Interfaces 5 File Formats 6 Games 7 Miscellaneous 8 System Manager 9 Kernel Developer Markup Various markup forms and rendering programs have been used for manual pages. &os; has used &man.groff.7; and the newer &man.mandoc.1;. Most existing &os; manual pages, and all new ones, use the &man.mdoc.7; form of markup. This is a simple line-based markup that is reasonably expressive. It is mostly semantic: parts of text are marked up for what they are, rather than for how they should appear when rendered. There is some appearance-based markup which is usually best avoided. Manual page source is usually interpreted and displayed to the screen interactively. The source files can be ordinary text files or compressed with &man.gzip.1; to save space. Manual pages can also be rendered to other formats, including PostScript for printing or PDF generation. See &man.man.1;. Manual Page Sections Manual pages are composed of several standard sections. Each section has a title in upper case, and the sections for a particular type of manual page appear in a specific order. For a category 1 General Command manual page, the sections are: Section Name Description NAME Name of the command SYNOPSIS Format of options and arguments DESCRIPTION Description of purpose and usage ENVIRONMENT Environment settings that affect operation EXIT STATUS Error codes returned on exit EXAMPLES Examples of usage COMPATIBILITY Compatibility with other implementations SEE ALSO Cross-reference to related manual pages STANDARDS Compatibility with standards like POSIX HISTORY History of implementation BUGS Known bugs AUTHORS People who created the command or wrote the manual page. Some sections are optional, and the combination of sections for a specific type of manual page vary. Examples of the most common types are shown later in this chapter. Macros &man.mdoc.7; markup is based on macros. Lines that begin with a dot contain macro commands, each two or three letters long. For example, consider this portion of the &man.ls.1; manual page: .Dd December 1, 2015 .Dt LS 1 .Sh NAME .Nm ls .Nd list directory contents .Sh SYNOPSIS .Nm .Op Fl -libxo .Op Fl ABCFGHILPRSTUWZabcdfghiklmnopqrstuwxy1, .Op Fl D Ar format .Op Ar .Sh DESCRIPTION For each operand that names a .Ar file of a type other than directory, .Nm displays its name as well as any requested, associated information. For each operand that names a .Ar file of type directory, .Nm displays the names of files contained within that directory, as well as any requested, associated information. A Document date and Document title are defined. A Section header for the NAME section is defined. Then the Name of the command and a one-line Name description are defined. The SYNOPSIS section begins. This section describes the command-line options and arguments accepted. Name (.Nm) has already been defined, and repeating it here just displays the defined value in the text. An Optional Flag called -libxo is shown. The Fl macro adds a dash to the beginning of flags, so this appears in the manual page as --libxo. A long list of optional single-character flags are shown. An optional -D flag is defined. If the -D flag is given, it must be followed by an Argument. The argument is a format, a string that tells &man.ls.1; what to display and how to display it. Details on the format string are given later in the manual page. - A final optional argument is defined. Because no name + A final optional argument is defined. Since no name is specified for the argument, the default of file ... is used. The Section header for the DESCRIPTION section is defined. When rendered with the command man ls, the result displayed on the screen looks like this: LS(1) FreeBSD General Commands Manual LS(1) NAME ls — list directory contents SYNOPSIS ls [--libxo] [-ABCFGHILPRSTUWZabcdfghiklmnopqrstuwxy1,] [-D format] [file ...] DESCRIPTION For each operand that names a file of a type other than directory, ls displays its name as well as any requested, associated information. For each operand that names a file of type directory, ls displays the names of files contained within that directory, as well as any requested, associated information. Optional values are shown inside square brackets. Markup Guidelines The &man.mdoc.7; markup language is not very strict. For clarity and consistency, the &os; Documentation project adds some additional style guidelines: Only the first letter of macros is upper case Always use upper case for the first letter of a macro and lower case for the remaining letters. Begin new sentences on new lines Start a new sentence on a new line, do not begin it on the same line as an existing sentence. Update .Dd when making non-trivial changes to a manual page The Document date informs the reader about the last time the manual page was updated. It is important to update whenever non-trivial changes are made to the manual pages. Trivial changes like spelling or punctuation fixes that do not affect usage can be made without updating .Dd. Give examples Show the reader examples when possible. Even trivial examples are valuable, because what is trivial to the writer is not necessarily trivial to the reader. Three examples are a good goal. A trivial example shows the minimal requirements, a serious example shows actual use, and an in-depth example demonstrates unusual or non-obvious functionality. Include the BSD license Include the BSD license on new manual pages. The preferred license is available from the Committer's Guide. Markup Tricks Add a space before punctuation on a line with macros. Example: .Sh SEE ALSO .Xr geom 4 , .Xr boot0cfg 8 , .Xr geom 8 , .Xr gptboot 8 Note how the commas at the end of the .Xr lines have been placed after a space. The .Xr macro expects two parameters to follow it, the name of an external manual page, and a section number. The space separates the punctuation from the section number. Without the space, the external links would incorrectly point to section 4, or 8,. Important Macros Some very common macros will be shown here. For more usage examples, see &man.mdoc.7;, &man.groff.mdoc.7;, or search for actual use in /usr/share/man/man* directories. For example, to search for examples of the .Bd Begin display macro: &prompt.user; find /usr/share/man/man* | xargs zgrep '.Bd' Organizational Macros Some macros are used to define logical blocks of a manual page. Organizational Macro Use .Sh Section header. Followed by the name of the section, traditionally all upper case. Think of these as chapter titles. .Ss Subsection header. Followed by the name of the subsection. Used to divide a .Sh section into subsections. .Bl Begin list. Start a list of items. .El End a list. .Bd Begin display. Begin a special area of text, like an indented area. .Ed End display. Inline Macros Many macros are used to mark up inline text. Inline Macro Use .Nm Name. Called with a name as a parameter on the first use, then used later without the parameter to display the name that has already been defined. .Pa Path to a file. Used to mark up filenames and directory paths. Sample Manual Page Structures This section shows minimal desired man page contents for several common categories of manual pages. Section 1 or 8 Command The preferred basic structure for a section 1 or 8 command: .Dd August 25, 2017 .Dt EXAMPLECMD 8 .Os .Sh NAME .Nm examplecmd .Nd "command to demonstrate section 1 and 8 man pages" .Sh SYNOPSIS .Nm .Op Fl v .Sh DESCRIPTION The .Nm utility does nothing except demonstrate a trivial but complete manual page for a section 1 or 8 command. .Sh SEE ALSO .Xr exampleconf 5 .Sh AUTHORS .An Firstname Lastname Aq Mt flastname@example.com Section 4 Device Driver The preferred basic structure for a section 4 device driver: .Dd August 25, 2017 .Dt EXAMPLEDRIVER 4 .Os .Sh NAME .Nm exampledriver .Nd "driver to demonstrate section 4 man pages" .Sh SYNOPSIS To compile this driver into the kernel, add this line to the kernel configuration file: .Bd -ragged -offset indent .Cd "device exampledriver" .Ed .Pp To load the driver as a module at boot, add this line to .Xr loader.conf 5 : .Bd -literal -offset indent exampledriver_load="YES" .Ed .Sh DESCRIPTION The .Nm driver provides an opportunity to show a skeleton or template file for section 4 manual pages. .Sh HARDWARE The .Nm driver supports these cards from the aptly-named Nonexistent Technologies: .Pp .Bl -bullet -compact .It NT X149.2 (single and dual port) .It NT X149.8 (single port) .El .Sh DIAGNOSTICS .Bl -diag .It "flashing green light" Something bad happened. .It "flashing red light" Something really bad happened. .It "solid black light" Power cord is unplugged. .El .Sh SEE ALSO .Xr example 8 .Sh HISTORY The .Nm device driver first appeared in .Fx 49.2 . .Sh AUTHORS .An Firstname Lastname Aq Mt flastname@example.com Section 5 Configuration File The preferred basic structure for a section 5 configuration file: .Dd August 25, 2017 .Dt EXAMPLECONF 5 .Os .Sh NAME .Nm example.conf .Nd "config file to demonstrate section 5 man pages" .Sh DESCRIPTION .Nm is an example configuration file. .Sh SEE ALSO .Xr example 8 .Sh AUTHORS .An Firstname Lastname Aq Mt flastname@example.com Testing Testing a new manual page can be challenging. Fortunately there are some tools that can assist in the task. Some of them, like &man.man.1;, do not look in the current directory. It is a good idea to prefix the filename with ./ if the new manual page is in the current directory. An absolute path can also be used. Use &man.mandoc.1;'s linter to check for parsing errors: &prompt.user; mandoc -T lint ./mynewmanpage.8 Use textproc/igor to proofread the manual page: &prompt.user; igor ./mynewmanpage.8 Use &man.man.1; to check the final result of your changes: &prompt.user; man ./mynewmanpage.8 You can use &man.col.1; to filter the output of &man.man.1; and get rid of the backspace characters before loading the result in your favorite editor for spell checking: &prompt.user; man ./mynewmanpage.8 | col -b | vim -R - Spell-checking with fully-featured dictionaries is encouraged, and can be accomplished by using textproc/hunspell or textproc/aspell combined with textproc/en-hunspell or textproc/en-aspell, respectively. For instance: &prompt.user; aspell check --lang=en --mode=nroff ./mynewmanpage.8 Example Manual Pages to Use as Templates Some manual pages are suitable as in-depth examples. Manual Page Path to Source Location &man.cp.1; /usr/src/bin/cp/cp.1 &man.vt.4; /usr/src/share/man/man4/vt.4 &man.crontab.5; /usr/src/usr.sbin/cron/crontab/crontab.5 &man.gpart.8; /usr/src/sbin/geom/class/part/gpart.8 Resources Resources for manual page writers: &man.man.1; &man.mandoc.1; &man.groff.mdoc.7; Practical UNIX Manuals: mdoc History of UNIX Manpages diff --git a/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml b/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml index e89c3bde73..cc5681e7a5 100644 --- a/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml +++ b/en_US.ISO8859-1/books/fdp-primer/po-translations/chapter.xml @@ -1,921 +1,921 @@ <acronym>PO</acronym> Translations Introduction The GNU gettext system offers translators an easy way to create and maintain translations of documents. Translatable strings are extracted from the original document into a PO (Portable Object) file. Translated versions of the strings are entered with a separate editor. The strings can be used directly or built into a complete translated version of the original document. Quick Start The procedure shown in is assumed to have already been performed. The TRANSLATOR option is required and already enabled by default in the textproc/docproj port. This example shows the creation of a Spanish translation of the short Leap Seconds article. Install a <acronym>PO</acronym> Editor A PO editor is needed to edit translation files. This example uses editors/poedit. &prompt.root; cd /usr/ports/editors/poedit &prompt.root; make install clean Initial Setup When a new translation is first created, the directory structure and Makefile must be created or copied from the English original: Create a directory for the new translation. The English article source is in ~/doc/en_US.ISO8859-1/articles/leap-seconds/. The Spanish translation will go in ~/doc/es_ES.ISO8859-1/articles/leap-seconds/. The path is the same except for the name of the language directory. &prompt.user; svn mkdir --parents ~/doc/es_ES.ISO8859-1/articles/leap-seconds/ Copy the Makefile from the original document into the translation directory: &prompt.user; svn cp ~/doc/en_US.ISO8859-1/articles/leap-seconds/Makefile \ ~/doc/es_ES.ISO8859-1/articles/leap-seconds/ Translation Translating a document consists of two steps: extracting translatable strings from the original document, and entering translations for those strings. These steps are repeated until the translator feels that enough of the document has been translated to produce a usable translated document. Extract the translatable strings from the original English version into a PO file: &prompt.user; cd ~/doc/es_ES.ISO8859-1/articles/leap-seconds/ &prompt.user; make po Use a PO editor to enter translations in the PO file. There are several different editors available. poedit from editors/poedit is shown here. The PO file name is the two-character language code followed by an underline and a two-character region code. For Spanish, the file name is es_ES.po. &prompt.user; poedit es_ES.po Generating a Translated Document Generate the translated document: &prompt.user; cd ~/doc/es_ES.ISO8859-1/articles/leap-seconds/ &prompt.user; make tran The name of the generated document matches the name of the English original, usually article.xml for articles or book.xml for books. Check the generated file by rendering it to HTML and viewing it with a web browser: &prompt.user; make FORMATS=html &prompt.user; firefox article.html Creating New Translations The first step to creating a new translated document is locating or creating a directory to hold it. &os; puts translated documents in a subdirectory named for their language and region in the format lang_REGION. lang is a two-character lowercase code. It is followed by an underscore character and then the two-character uppercase REGION code. Language Names Language Region Translated Directory Name PO File Name Character Set English United States en_US.ISO8859-1 en_US.po ISO 8859-1 Bengali Bangladesh bn_BD.UTF-8 bn_BD.po UTF-8 Danish Denmark da_DK.ISO8859-1 da_DK.po ISO 8859-1 German Germany de_DE.ISO8859-1 de_DE.po ISO 8859-1 Greek Greece el_GR.ISO8859-7 el_GR.po ISO 8859-7 Spanish Spain es_ES.ISO8859-1 es_ES.po ISO 8859-1 French France fr_FR.ISO8859-1 fr_FR.po ISO 8859-1 Hungarian Hungary hu_HU.ISO8859-2 hu_HU.po ISO 8859-2 Italian Italy it_IT.ISO8859-15 it_IT.po ISO 8859-15 Japanese Japan ja_JP.eucJP ja_JP.po EUC JP Korean Korea ko_KR.UTF-8 ko_KR.po UTF-8 Mongolian Mongolia mn_MN.UTF-8 mn_MN.po UTF-8 Dutch Netherlands nl_NL.ISO8859-1 nl_NL.po ISO 8859-1 Polish Poland pl_PL.ISO8859-2 pl_PL.po ISO 8859-2 Portuguese Brazil pt_BR.ISO8859-1 pt_BR.po ISO 8859-1 Russian Russia ru_RU.KOI8-R ru_RU.po KOI8-R Turkish Turkey tr_TR.ISO8859-9 tr_TR.po ISO 8859-9 Chinese China zh_CN.UTF-8 zh_CN.po UTF-8 Chinese Taiwan zh_TW.UTF-8 zh_TW.po UTF-8
The translations are in subdirectories of the main documentation directory, here assumed to be ~/doc/ as shown in . For example, German translations are located in ~/doc/de_DE.ISO8859-1/, and French translations are in ~/doc/fr_FR.ISO8859-1/. Each language directory contains separate subdirectories named for the type of documents, usually articles/ and books/. Combining these directory names gives the complete path to an article or book. For example, the French translation of the NanoBSD article is in ~/doc/fr_FR.ISO8859-1/articles/nanobsd/, and the Mongolian translation of the Handbook is in ~/doc/mn_MN.UTF-8/books/handbook/. A new language directory must be created when translating a document to a new language. If the language directory already exists, only a subdirectory in the articles/ or books/ directory is needed. &os; documentation builds are controlled by a Makefile in the same directory. With simple articles, the Makefile can often just be copied verbatim from the original English directory. The translation process combines multiple separate book.xml and chapter.xml files in books into a single file, so the Makefile for book translations must be copied and modified. Creating a Spanish Translation of the Porter's Handbook Create a new Spanish translation of the Porter's Handbook. The original is a book in ~/doc/en_US.ISO8859-1/books/porters-handbook/. The Spanish language books directory ~/doc/es_ES.ISO8859-1/books/ already exists, so only a new subdirectory for the Porter's Handbook is needed: &prompt.user; cd ~/doc/es_ES.ISO8859-1/books/ &prompt.user; svn mkdir porters-handbook A porters-handbook Copy the Makefile from the original book: &prompt.user; cd ~/doc/es_ES.ISO8859-1/books/porters-handbook &prompt.user; svn cp ~/doc/en_US.ISO8859-1/books/porters-handbook/Makefile . A Makefile Modify the contents of the Makefile to only expect a single book.xml: # # $FreeBSD$ # # Build the FreeBSD Porter's Handbook. # MAINTAINER=doc@FreeBSD.org DOC?= book FORMATS?= html-split INSTALL_COMPRESSED?= gz INSTALL_ONLY_COMPRESSED?= # XML content SRCS= book.xml # Images from the cross-document image library IMAGES_LIB+= callouts/1.png IMAGES_LIB+= callouts/2.png IMAGES_LIB+= callouts/3.png IMAGES_LIB+= callouts/4.png IMAGES_LIB+= callouts/5.png IMAGES_LIB+= callouts/6.png IMAGES_LIB+= callouts/7.png IMAGES_LIB+= callouts/8.png IMAGES_LIB+= callouts/9.png IMAGES_LIB+= callouts/10.png IMAGES_LIB+= callouts/11.png IMAGES_LIB+= callouts/12.png IMAGES_LIB+= callouts/13.png IMAGES_LIB+= callouts/14.png IMAGES_LIB+= callouts/15.png IMAGES_LIB+= callouts/16.png IMAGES_LIB+= callouts/17.png IMAGES_LIB+= callouts/18.png IMAGES_LIB+= callouts/19.png IMAGES_LIB+= callouts/20.png IMAGES_LIB+= callouts/21.png URL_RELPREFIX?= ../../../.. DOC_PREFIX?= ${.CURDIR}/../../.. .include "${DOC_PREFIX}/share/mk/doc.project.mk" Now the document structure is ready for the translator to begin translating with make po. Creating a French Translation of the <acronym>PGP</acronym> Keys Article Create a new French translation of the PGP Keys article. The original is an article in ~/doc/en_US.ISO8859-1/articles/pgpkeys/. The French language article directory ~/doc/fr_FR.ISO8859-1/articles/ already exists, so only a new subdirectory for the PGP Keys article is needed: &prompt.user; cd ~/doc/fr_FR.ISO8859-1/articles/ &prompt.user; svn mkdir pgpkeys A pgpkeys Copy the Makefile from the original article: &prompt.user; cd ~/doc/fr_FR.ISO8859-1/articles/pgpkeys &prompt.user; svn cp ~/doc/en_US.ISO8859-1/articles/pgpkeys/Makefile . A Makefile Check the contents of the - Makefile. Because this is a simple + Makefile. As this is a simple article, in this case the Makefile can be used unchanged. The $&os;...$ version string on the second line will be replaced by the version control system when this file is committed. # # $FreeBSD$ # # Article: PGP Keys DOC?= article FORMATS?= html WITH_ARTICLE_TOC?= YES INSTALL_COMPRESSED?= gz INSTALL_ONLY_COMPRESSED?= SRCS= article.xml # To build with just key fingerprints, set FINGERPRINTS_ONLY. URL_RELPREFIX?= ../../../.. DOC_PREFIX?= ${.CURDIR}/../../.. .include "${DOC_PREFIX}/share/mk/doc.project.mk" With the document structure complete, the PO file can be created with make po.
Translating The gettext system greatly reduces the number of things that must be tracked by a translator. Strings to be translated are extracted from the original document into a PO file. Then a PO editor is used to enter the translated versions of each string. The &os; PO translation system does not overwrite PO files, so the extraction step can be run at any time to update the PO file. A PO editor is used to edit the file. editors/poedit is shown in these examples because it is simple and has minimal requirements. Other PO editors offer features to make the job of translating easier. The Ports Collection offers several of these editors, including devel/gtranslator. It is important to preserve the PO file. It contains all of the work that translators have done. Translating the Porter's Handbook to Spanish Enter Spanish translations of the contents of the Porter's Handbook. Change to the Spanish Porter's Handbook directory and update the PO file. The generated PO file is called es_ES.po as shown in . &prompt.user; cd ~/doc/es_ES.ISO8859-1/books/porters-handbook &prompt.user; make po Enter translations using a PO editor: &prompt.user; poedit es_ES.po Tips for Translators Preserving <acronym>XML</acronym> Tags Preserve XML tags that are shown in the English original. Preserving <acronym>XML</acronym> Tags English original: If acronymNTPacronym is not being used Spanish translation: Si acronymNTPacronym no se utiliza Preserving Spaces Preserve existing spaces at the beginning and end of strings to be translated. The translated version must have these spaces also. Verbatim Tags The contents of some tags should be copied verbatim, not translated: citerefentry command filename literal manvolnum orgname package programlisting prompt refentrytitle screen userinput varname <literal>$FreeBSD$</literal> Strings The $FreeBSD$ version strings used in files require special handling. In examples like , these strings are not meant to be expanded. The English documents use &dollar; entities to avoid including actual literal dollar signs in the file: &dollar;FreeBSD&dollar; The &dollar; entities are not seen as dollar signs by the version control system and so the string is not expanded into a version string. When a PO file is created, the &dollar; entities used in examples are replaced with actual dollar signs. The resulting literal $FreeBSD$ string will be wrongly expanded by the version control system when the file is committed. The same technique as used in the English documents can be used in the translation. The &dollar; is used to replace the dollar sign in the translation entered into the PO editor: &dollar;FreeBSD&dollar; Building a Translated Document A translated version of the original document can be created at any time. Any untranslated portions of the original will be included in English in the resulting document. Most PO editors have an indicator that shows how much of the translation has been completed. This makes it easy for the translator to see when enough strings have been translated to make building the final document worthwhile. Building the Spanish Porter's Handbook Build and preview the Spanish version of the Porter's Handbook that was created in an earlier example. - Build the translated document. Because the original + Build the translated document. As the original is a book, the generated document is book.xml. &prompt.user; cd ~/doc/es_ES.ISO8859-1/books/porters-handbook &prompt.user; make tran Render the translated book.xml to HTML and view it with Firefox. This is the same procedure used with the English version of the documents, and other FORMATS can be used here in the same way. See . &prompt.user; make FORMATS=html &prompt.user; firefox book.html Submitting the New Translation Prepare the new translation files for submission. This includes adding the files to the version control system, setting additional properties on them, then creating a diff for submission. The diff files created by these examples can be attached to a documentation bug report or code review. Spanish Translation of the NanoBSD Article Add a &os; version string comment as the first line of the PO file: #$FreeBSD$ Add the Makefile, the PO file, and the generated XML translation to version control: &prompt.user; cd ~/doc/es_ES.ISO8859-1/articles/nanobsd/ &prompt.user; ls Makefile article.xml es_ES.po &prompt.user; svn add Makefile article.xml es_ES.po A Makefile A article.xml A es_ES.po Set the Subversion svn:keywords properties on these files to FreeBSD=%H so $FreeBSD$ strings are expanded into the path, revision, date, and author when committed: &prompt.user; svn propset svn:keywords FreeBSD=%H Makefile article.xml es_ES.po property 'svn:keywords' set on 'Makefile' property 'svn:keywords' set on 'article.xml' property 'svn:keywords' set on 'es_ES.po' Set the MIME types of the files. These are text/xml for books and articles, and text/x-gettext-translation for the PO file. &prompt.user; svn propset svn:mime-type text/x-gettext-translation es_ES.po property 'svn:mime-type' set on 'es_ES.po' &prompt.user; svn propset svn:mime-type text/xml article.xml property 'svn:mime-type' set on 'article.xml' Create a diff of the new files from the ~/doc/ base directory so the full path is shown with the filenames. This helps committers identify the target language directory. &prompt.user; cd ~/doc svn diff es_ES.ISO8859-1/articles/nanobsd/ > /tmp/es_nanobsd.diff Korean <acronym>UTF-8</acronym> Translation of the Explaining-BSD Article Add a &os; version string comment as the first line of the PO file: #$FreeBSD$ Add the Makefile, the PO file, and the generated XML translation to version control: &prompt.user; cd ~/doc/ko_KR.UTF-8/articles/explaining-bsd/ &prompt.user; ls Makefile article.xml ko_KR.po &prompt.user; svn add Makefile article.xml ko_KR.po A Makefile A article.xml A ko_KR.po Set the Subversion svn:keywords properties on these files to FreeBSD=%H so $FreeBSD$ strings are expanded into the path, revision, date, and author when committed: &prompt.user; svn propset svn:keywords FreeBSD=%H Makefile article.xml ko_KR.po property 'svn:keywords' set on 'Makefile' property 'svn:keywords' set on 'article.xml' property 'svn:keywords' set on 'ko_KR.po' Set the MIME types of the files. - Because these files use the UTF-8 - character set, that is also specified. To prevent the + These files use the UTF-8 + character set, so that is also specified. To prevent the version control system from mistaking these files for binary data, the fbsd:notbinary property is also set: &prompt.user; svn propset svn:mime-type 'text/x-gettext-translation; charset=UTF-8' ko_KR.po property 'svn:mime-type' set on 'ko_KR.po' &prompt.user; svn propset fbsd:notbinary yes ko_KR.po property 'fbsd:notbinary' set on 'ko_KR.po' &prompt.user; svn propset svn:mime-type 'text/xml; charset=UTF-8' article.xml property 'svn:mime-type' set on 'article.xml' &prompt.user; svn propset fbsd:notbinary yes article.xml property 'fbsd:notbinary' set on 'article.xml' Create a diff of these new files from the ~/doc/ base directory: &prompt.user; cd ~/doc svn diff ko_KR.UTF-8/articles/explaining-bsd > /tmp/ko-explaining.diff
diff --git a/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml b/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml index 96d2c60d0a..a91251c25b 100644 --- a/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml +++ b/en_US.ISO8859-1/books/fdp-primer/xml-primer/chapter.xml @@ -1,1423 +1,1423 @@ XML Primer Most FDP documentation is written with markup languages based on XML. This chapter explains what that means, how to read and understand the documentation source, and the XML techniques used. Portions of this section were inspired by Mark Galassi's Get Going With DocBook. Overview In the original days of computers, electronic text was simple. There were a few character sets like ASCII or EBCDIC, but that was about it. Text was text, and what you saw really was what you got. No frills, no formatting, no intelligence. Inevitably, this was not enough. When text is in a machine-usable format, machines are expected to be able to use and manipulate it intelligently. Authors want to indicate that certain phrases should be emphasized, or added to a glossary, or made into hyperlinks. Filenames could be shown in a typewriter style font for viewing on screen, but as italics when printed, or any of a myriad of other options for presentation. It was once hoped that Artificial Intelligence (AI) would make this easy. The computer would read the document and automatically identify key phrases, filenames, text that the reader should type in, examples, and more. Unfortunately, real life has not happened quite like that, and computers still require assistance before they can meaningfully process text. More precisely, they need help identifying what is what. Consider this text:
To remove /tmp/foo, use &man.rm.1;. &prompt.user; rm /tmp/foo
It is easy to see which parts are filenames, which are commands to be typed in, which parts are references to manual pages, and so on. But the computer processing the document cannot. For this we need markup. Markup is commonly used to describe adding value or increasing cost. The term takes on both these meanings when applied to text. Markup is additional text included in the document, distinguished from the document's content in some way, so that programs that process the document can read the markup and use it when making decisions about the document. Editors can hide the markup from the user, so the user is not distracted by it. The extra information stored in the markup adds value to the document. Adding the markup to the document must typically be done by a person—after all, if computers could recognize the text sufficiently well to add the markup then there would be no need to add it in the first place. This increases the cost (the effort required) to create the document. The previous example is actually represented in this document like this: paraTo remove filename/tmp/foofilename, use &man.rm.1;.para screen&prompt.user; userinputrm /tmp/foouserinputscreen The markup is clearly separate from the content. Markup languages define what the markup means and how it should be interpreted. Of course, one markup language might not be enough. A markup language for technical documentation has very different requirements than a markup language that is intended for cookery recipes. This, in turn, would be very different from a markup language used to describe poetry. What is really needed is a first language used to write these other markup languages. A meta markup language. This is exactly what the eXtensible Markup Language (XML) is. Many markup languages have been written in XML, including the two most used by the FDP, XHTML and DocBook. Each language definition is more properly called a grammar, vocabulary, schema or Document Type Definition (DTD). There are various languages to specify an XML grammar, or schema. A schema is a complete specification of all the elements that are allowed to appear, the order in which they should appear, which elements are mandatory, which are optional, and so forth. This makes it possible to write an XML parser which reads in both the schema and a document which claims to conform to the schema. The parser can then confirm whether or not all the elements required by the vocabulary are in the document in the right order, and whether there are any errors in the markup. This is normally referred to as validating the document. Validation confirms that the choice of elements, their ordering, and so on, conforms to that listed in the grammar. It does not check whether appropriate markup has been used for the content. If all the filenames in a document were marked up as function names, the parser would not flag this as an error (assuming, of course, that the schema defines elements for filenames and functions, and that they are allowed to appear in the same place). Most contributions to the Documentation Project will be content marked up in either XHTML or DocBook, rather than alterations to the schemas. For this reason, this book will not touch on how to write a vocabulary.
Elements, Tags, and Attributes All the vocabularies written in XML share certain characteristics. This is hardly surprising, as the philosophy behind XML will inevitably show through. One of the most obvious manifestations of this philosophy is that of content and elements. Documentation, whether it is a single web page, or a lengthy book, is considered to consist of content. This content is then divided and further subdivided into elements. The purpose of adding markup is to name and identify the boundaries of these elements for further processing. For example, consider a typical book. At the very top level, the book is itself an element. This book element obviously contains chapters, which can be considered to be elements in their own right. Each chapter will contain more elements, such as paragraphs, quotations, and footnotes. Each paragraph might contain further elements, identifying content that was direct speech, or the name of a character in the story. It may be helpful to think of this as chunking content. At the very top level is one chunk, the book. Look a little deeper, and there are more chunks, the individual chapters. These are chunked further into paragraphs, footnotes, character names, and so on. Notice how this differentiation between different elements of the content can be made without resorting to any XML terms. It really is surprisingly straightforward. This could be done with a highlighter pen and a printout of the book, using different colors to indicate different chunks of content. Of course, we do not have an electronic highlighter pen, so we need some other way of indicating which element each piece of content belongs to. In languages written in XML (XHTML, DocBook, et al) this is done by means of tags. A tag is used to identify where a particular element starts, and where the element ends. The tag is not part of - the element itself. Because each grammar was + the element itself. As each grammar was normally written to mark up specific types of information, each one will recognize different elements, and will therefore have different names for the tags. For an element called element-name the start tag will normally look like element-name. The corresponding closing tag for this element is element-name. Using an Element (Start and End Tags) XHTML has an element for indicating that the content enclosed by the element is a paragraph, called p. pThis is a paragraph. It starts with the start tag for the 'p' element, and it will end with the end tag for the 'p' element.p pThis is another paragraph. But this one is much shorter.p Some elements have no content. For example, in XHTML, a horizontal line can be included in the document. For these empty elements, XML introduced a shorthand form that is completely equivalent to the two-tag version: Using an Element Without Content XHTML has an element for indicating a horizontal rule, called hr. This element does not wrap content, so it looks like this: pOne paragraph.p hrhr pThis is another paragraph. A horizontal rule separates this from the previous paragraph.p The shorthand version consists of a single tag: pOne paragraph.p hr pThis is another paragraph. A horizontal rule separates this from the previous paragraph.p As shown above, elements can contain other elements. In the book example earlier, the book element contained all the chapter elements, which in turn contained all the paragraph elements, and so on. Elements Within Elements; <tag>em</tag> pThis is a simple emparagraphem where some of the emwordsem have been ememphasizedem.p The grammar consists of rules that describe which elements can contain other elements, and exactly what they can contain. People often confuse the terms tags and elements, and use the terms as if they were interchangeable. They are not. An element is a conceptual part of your document. An element has a defined start and end. The tags mark where the element starts and ends. When this document (or anyone else knowledgeable about XML) refers to the p tag they mean the literal text consisting of the three characters <, p, and >. But the phrase the p element refers to the whole element. This distinction is very subtle. But keep it in mind. Elements can have attributes. An attribute has a name and a value, and is used for adding extra information to the element. This might be information that indicates how the content should be rendered, or might be something that uniquely identifies that occurrence of the element, or it might be something else. An element's attributes are written inside the start tag for that element, and take the form attribute-name="attribute-value". In XHTML, the p element has an attribute called align, which suggests an alignment (justification) for the paragraph to the program displaying the XHTML. The align attribute can take one of four defined values, left, center, right and justify. If the attribute is not specified then the default is left. Using an Element with an Attribute p align="left"The inclusion of the align attribute on this paragraph was superfluous, since the default is left.p p align="center"This may appear in the center.p Some attributes only take specific values, such as left or justify. Others allow any value. Single Quotes Around Attributes p align='right'I am on the right!p Attribute values in XML must be enclosed in either single or double quotes. Double quotes are traditional. Single quotes are useful when the attribute value contains double quotes. Information about attributes, elements, and tags is stored in catalog files. The Documentation Project uses standard DocBook catalogs and includes additional catalogs for &os;-specific features. Paths to the catalog files are defined in an environment variable so they can be found by the document build tools. To Do… Before running the examples in this document, install textproc/docproj from the &os; Ports Collection. This is a meta-port that downloads and installs the standard programs and supporting files needed by the Documentation Project. &man.csh.1; users must use rehash for the shell to recognize new programs after they have been installed, or log out and then log back in again. Create example.xml, and enter this text: !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" html xmlns="http://www.w3.org/1999/xhtml" head titleAn Example XHTML Filetitle head body pThis is a paragraph containing some text.p pThis paragraph contains some more text.p p align="right"This paragraph might be right-justified.p body html Try to validate this file using an XML parser. textproc/docproj includes the xmllint validating parser. Use xmllint to validate the document: &prompt.user; xmllint --valid --noout example.xml xmllint returns without displaying any output, showing that the document validated successfully. See what happens when required elements are omitted. Delete the line with the title and title tags, and re-run the validation. &prompt.user; xmllint --valid --noout example.xml example.xml:5: element head: validity error : Element head content does not follow the DTD, expecting ((script | style | meta | link | object | isindex)* , ((title , (script | style | meta | link | object | isindex)* , (base , (script | style | meta | link | object | isindex)*)?) | (base , (script | style | meta | link | object | isindex)* , title , (script | style | meta | link | object | isindex)*))), got () This shows that the validation error comes from the fifth line of the example.xml file and that the content of the head is the part which does not follow the rules of the XHTML grammar. Then xmllint shows the line where the error was found and marks the exact character position with a ^ sign. Replace the title element. The DOCTYPE Declaration The beginning of each document can specify the name of the DTD to which the document conforms. This DOCTYPE declaration is used by XML parsers to identify the DTD and ensure that the document does conform to it. A typical declaration for a document written to conform with version 1.0 of the XHTML DTD looks like this: !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" That line contains a number of different components. <! The indicator shows this is an XML declaration. DOCTYPE Shows that this is an XML declaration of the document type. html Names the first element that will appear in the document. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" Lists the Formal Public Identifier (FPI) Formal Public Identifier for the DTD to which this document conforms. The XML parser uses this to find the correct DTD when processing this document. PUBLIC is not a part of the FPI, but indicates to the XML processor how to find the DTD referenced in the FPI. Other ways of telling the XML parser how to find the DTD are shown later. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" A local filename or a URL to find the DTD. > Ends the declaration and returns to the document. Formal Public Identifiers (<acronym>FPI</acronym>s) Formal Public Identifier It is not necessary to know this, but it is useful background, and might help debug problems when the XML processor cannot locate the DTD. FPIs must follow a specific syntax: "Owner//Keyword Description//Language" Owner The owner of the FPI. The beginning of the string identifies the owner of the FPI. For example, the FPI "ISO 8879:1986//ENTITIES Greek Symbols//EN" lists ISO 8879:1986 as being the owner for the set of entities for Greek symbols. ISO 8879:1986 is the International Organization for Standardization (ISO) number for the SGML standard, the predecessor (and a superset) of XML. Otherwise, this string will either look like -//Owner or +//Owner (notice the only difference is the leading + or -). If the string starts with - then the owner information is unregistered, with a + identifying it as registered. ISO 9070:1991 defines how registered names are generated. It might be derived from the number of an ISO publication, an ISBN code, or an organization code assigned according to ISO 6523. Additionally, a registration authority could be created in order to assign registered names. The ISO council delegated this to the American National Standards Institute (ANSI). - Because the &os; Project has not been registered, + Since the &os; Project has not been registered, the owner string is -//&os;. As seen - in the example, the W3C are not a + in the example, the W3C is not a registered owner either. Keyword There are several keywords that indicate the type of information in the file. Some of the most common keywords are DTD, ELEMENT, ENTITIES, and TEXT. DTD is used only for DTD files, ELEMENT is usually used for DTD fragments that contain only entity or element declarations. TEXT is used for XML content (text and tags). Description Any description can be given for the contents of this file. This may include version numbers or any short text that is meaningful and unique for the XML system. Language An ISO two-character code that identifies the native language for the file. EN is used for English. <filename>catalog</filename> Files With the syntax above, an XML processor needs to have some way of turning the FPI into the name of the file containing the DTD. A catalog file (typically called catalog) contains lines that map FPIs to filenames. For example, if the catalog file contained the line: PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "1.0/transitional.dtd" The XML processor knows that the DTD is called transitional.dtd in the 1.0 subdirectory of the directory that held catalog. Examine the contents of /usr/local/share/xml/dtd/xhtml/catalog.xml. This is the catalog file for the XHTML DTDs that were installed as part of the textproc/docproj port. Alternatives to <acronym>FPI</acronym>s Instead of using an FPI to indicate the DTD to which the document conforms (and therefore, which file on the system contains the DTD), the filename can be explicitly specified. The syntax is slightly different: !DOCTYPE html SYSTEM "/path/to/file.dtd" The SYSTEM keyword indicates that the XML processor should locate the DTD in a system specific fashion. This typically (but not always) means the DTD will be provided as a filename. Using FPIs is preferred for reasons of portability. If the SYSTEM identifier is used, then the DTD must be provided and kept in the same location for everyone. Escaping Back to <acronym>XML</acronym> Some of the underlying XML syntax can be useful within documents. For example, comments can be included in the document, and will be ignored by the parser. Comments are entered using XML syntax. Other uses for XML syntax will be shown later. XML sections begin with a <! tag and end with a >. These sections contain instructions for the parser rather than elements of the document. Everything between these tags is XML syntax. The DOCTYPE declaration shown earlier is an example of XML syntax included in the document. Comments An XML document may contain comments. They may appear anywhere as long as they are not inside tags. They are even allowed in some locations inside the DTD (e.g., between entity declarations). XML comments start with the string <!-- and end with the string -->. Here are some examples of valid XML comments: <acronym>XML</acronym> Generic Comments <!-- This is inside the comment --> <!--This is another comment--> <!-- This is how you write multiline comments --> <p>A simple <!-- Comment inside an element's content --> paragraph.</p> XML comments may contain any strings except --: Erroneous <acronym>XML</acronym> Comment <!-- This comment--is wrong --> To Do… Add some comments to example.xml, and check that the file still validates using xmllint. Add some invalid comments to example.xml, and see the error messages that xmllint gives when it encounters an invalid comment. Entities Entities are a mechanism for assigning names to chunks of content. As an XML parser processes a document, any entities it finds are replaced by the content of the entity. This is a good way to have re-usable, easily changeable chunks of content in XML documents. It is also the only way to include one marked up file inside another using XML. There are two types of entities for two different situations: general entities and parameter entities. General Entities General entities are used to assign names to reusable chunks of text. These entities can only be used in the document. They cannot be used in an XML context. To include the text of a general entity in the document, include &entity-name; in the text. For example, consider a general entity called current.version which expands to the current version number of a product. To use it in the document, write: paraThe current version of our product is &current.version;.para When the version number changes, edit the definition of the general entity, replacing the value. Then reprocess the document. General entities can also be used to enter characters that could not otherwise be included in an XML document. For example, < and & cannot normally appear in an XML document. The XML parser sees the < symbol as the start of a tag. Likewise, when the & symbol is seen, the next text is expected to be an entity name. These symbols can be included by using two predefined general entities: &lt; and &amp;. General entities can only be defined within an XML context. Such definitions are usually done immediately after the DOCTYPE declaration. Defining General Entities <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY current.version "3.0-RELEASE"> <!ENTITY last.version "2.2.7-RELEASE"> ]> The DOCTYPE declaration has been extended by adding a square bracket at the end of the first line. The two entities are then defined over the next two lines, the square bracket is closed, and then the DOCTYPE declaration is closed. The square brackets are necessary to indicate that the DTD indicated by the DOCTYPE declaration is being extended. Parameter Entities Parameter entities, like general entities, are used to assign names to reusable chunks of text. But parameter entities can only be used within an XML context. Parameter entity definitions are similar to those for general entities. However, parameter entities are included with %entity-name;. The definition also includes the % between the ENTITY keyword and the name of the entity. For a mnemonic, think Parameter entities use the Percent symbol. Defining Parameter Entities <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY % entity "<!ENTITY version '1.0'>"> <!-- use the parameter entity --> %entity; ]> At first sight, parameter entities do not look very useful, but they make it possible to include other files into an XML document. To Do… Add a general entity to example.xml. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY version "1.1"> ]> html xmlns="http://www.w3.org/1999/xhtml" head titleAn Example XHTML Filetitle head <!-- There may be some comments in here as well --> body pThis is a paragraph containing some text.p pThis paragraph contains some more text.p p align="right"This paragraph might be right-justified.p pThe current version of this document is: &version;p body html Validate the document using xmllint. Load example.xml into a web browser. It may have to be copied to example.html before the browser recognizes it as an XHTML document. Older browsers with simple parsers may not render this file as expected. The entity reference &version; may not be replaced by the version number, or the XML context closing ]> may not be recognized and instead shown in the output. The solution is to normalize the document with an XML normalizer. The normalizer reads valid XML and writes equally valid XML which has been transformed in some way. One way the normalizer transforms the input is by expanding all the entity references in the document, replacing the entities with the text that they represent. xmllint can be used for this. It also has an option to drop the initial DTD section so that the closing ]> does not confuse browsers: &prompt.user; xmllint --noent --dropdtd example.xml > example.html A normalized copy of the document with entities expanded is produced in example.html, ready to load into a web browser. Using Entities to Include Files Both general and parameter entities are particularly useful for including one file inside another. Using General Entities to Include Files Consider some content for an XML book organized into files, one file per chapter, called chapter1.xml, chapter2.xml, and so forth, with a book.xml that will contain these chapters. In order to use the contents of these files as the values for entities, they are declared with the SYSTEM keyword. This directs the XML parser to include the contents of the named file as the value of the entity. Using General Entities to Include Files <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY chapter.1 SYSTEM "chapter1.xml"> <!ENTITY chapter.2 SYSTEM "chapter2.xml"> <!ENTITY chapter.3 SYSTEM "chapter3.xml"> <!-- And so forth --> ]> html xmlns="http://www.w3.org/1999/xhtml" <!-- Use the entities to load in the chapters --> &chapter.1; &chapter.2; &chapter.3; html When using general entities to include other files within a document, the files being included (chapter1.xml, chapter2.xml, and so on) must not start with a DOCTYPE declaration. This is a syntax error because entities are low-level constructs and they are resolved before any parsing happens. Using Parameter Entities to Include Files Parameter entities can only be used inside an XML context. Including a file in an XML context can be used to ensure that general entities are reusable. Suppose that there are many chapters in the document, and these chapters were reused in two different books, each book organizing the chapters in a different fashion. The entities could be listed at the top of each book, but that quickly becomes cumbersome to manage. Instead, place the general entity definitions inside one file, and use a parameter entity to include that file within the document. Using Parameter Entities to Include Files Place the entity definitions in a separate file called chapters.ent and containing this text: <!ENTITY chapter.1 SYSTEM "chapter1.xml"> <!ENTITY chapter.2 SYSTEM "chapter2.xml"> <!ENTITY chapter.3 SYSTEM "chapter3.xml"> Create a parameter entity to refer to the contents of the file. Then use the parameter entity to load the file into the document, which will then make all the general entities available for use. Then use the general entities as before: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!-- Define a parameter entity to load in the chapter general entities --> <!ENTITY % chapters SYSTEM "chapters.ent"> <!-- Now use the parameter entity to load in this file --> %chapters; ]> html xmlns="http://www.w3.org/1999/xhtml" &chapter.1; &chapter.2; &chapter.3; html To Do… Use General Entities to Include Files Create three files, para1.xml, para2.xml, and para3.xml. Put content like this in each file: pThis is the first paragraph.p Edit example.xml so that it looks like this: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY version "1.1"> <!ENTITY para1 SYSTEM "para1.xml"> <!ENTITY para2 SYSTEM "para2.xml"> <!ENTITY para3 SYSTEM "para3.xml"> ]> html xmlns="http://www.w3.org/1999/xhtml" head titleAn Example XHTML Filetitle head body pThe current version of this document is: &version;p &para1; &para2; &para3; body html Produce example.html by normalizing example.xml. &prompt.user; xmllint --dropdtd --noent example.xml > example.html Load example.html into the web browser and confirm that the paran.xml files have been included in example.html. Use Parameter Entities to Include Files The previous steps must have completed before this step. Edit example.xml so that it looks like this: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [ <!ENTITY % entities SYSTEM "entities.ent"> %entities; ]> html xmlns="http://www.w3.org/1999/xhtml" head titleAn Example XHTML Filetitle head body pThe current version of this document is: &version;p &para1; &para2; &para3; body html Create a new file called entities.ent with this content: <!ENTITY version "1.1"> <!ENTITY para1 SYSTEM "para1.xml"> <!ENTITY para2 SYSTEM "para2.xml"> <!ENTITY para3 SYSTEM "para3.xml"> Produce example.html by normalizing example.xml. &prompt.user; xmllint --dropdtd --noent example.xml > example.html Load example.html into the web browser and confirm that the paran.xml files have been included in example.html. Marked Sections XML provides a mechanism to indicate that particular pieces of the document should be processed in a special way. These are called marked sections. Structure of a Marked Section <![KEYWORD[ Contents of marked section ]]> As expected of an XML construct, a marked section starts with <!. The first square bracket begins the marked section. KEYWORD describes how this marked section is to be processed by the parser. The second square bracket indicates the start of the marked section's content. The marked section is finished by closing the two square brackets, and then returning to the document context from the XML context with >. Marked Section Keywords <literal>CDATA</literal> These keywords denote the marked sections content model, and allow you to change it from the default. When an XML parser is processing a document, it keeps track of the content model. The content model describes the content the parser is expecting to see and what it will do with that content. The CDATA content model is one of the most useful. CDATA is for Character Data. When the parser is in this content model, it expects to see only characters. In this model the < and & symbols lose their special status, and will be treated as ordinary characters. When using CDATA in examples of text marked up in XML, remember that the content of CDATA is not validated. The included text must be check with other means. For example, the content could be written in another document, validated, and then pasted into the CDATA section. Using a <literal>CDATA</literal> Marked Section paraHere is an example of how to include some text that contains many literal&lt;literal and literal&amp;literal symbols. The sample text is a fragment of acronymXHTMLacronym. The surrounding text (para and programlisting) are from DocBook.para programlisting<![CDATA[pThis is a sample that shows some of the elements within acronymXHTMLacronym. Since the angle brackets are used so many times, it is simpler to say the whole example is a CDATA marked section than to use the entity names for the left and right angle brackets throughout.p ul liThis is a listitemli liThis is a second listitemli liThis is a third listitemli ul pThis is the end of the example.p]]>programlisting <literal>INCLUDE</literal> and <literal>IGNORE</literal> When the keyword is INCLUDE, then the contents of the marked section will be processed. When the keyword is IGNORE, the marked section is ignored and will not be processed. It will not appear in the output. Using <literal>INCLUDE</literal> and <literal>IGNORE</literal> in Marked Sections <![INCLUDE[ This text will be processed and included. ]]> <![IGNORE[ This text will not be processed or included. ]]> By itself, this is not too useful. Text to be removed from the document could be cut out, or wrapped in comments. It becomes more useful when controlled by parameter entities, yet this usage is limited to entity files. For example, suppose that documentation was produced in a hard-copy version and an electronic version. Some extra text is desired in the electronic version content that was not to appear in the hard-copy. Create an entity file that defines general entities to include each chapter and guard these definitions with a parameter entity that can be set to either INCLUDE or IGNORE to control whether the entity is defined. After these conditional general entity definitions, place one more definition for each general entity to set them to an empty value. This technique makes use of the fact that entity definitions cannot be overridden but the first definition always takes effect. So the inclusion of the chapter is controlled with the corresponding parameter entity. Set to INCLUDE, the first general entity definition will be read and the second one will be ignored. Set to IGNORE, the first definition will be ignored and the second one will take effect. Using a Parameter Entity to Control a Marked Section <!ENTITY % electronic.copy "INCLUDE"> <![%electronic.copy;[ <!ENTITY chap.preface SYSTEM "preface.xml"> ]]> <!ENTITY chap.preface ""> When producing the hard-copy version, change the parameter entity's definition to: <!ENTITY % electronic.copy "IGNORE"> To Do… Modify entities.ent to contain the following: <!ENTITY version "1.1"> <!ENTITY % conditional.text "IGNORE"> <![%conditional.text;[ <!ENTITY para1 SYSTEM "para1.xml"> ]]> <!ENTITY para1 ""> <!ENTITY para2 SYSTEM "para2.xml"> <!ENTITY para3 SYSTEM "para3.xml"> Normalize example.xml and notice that the conditional text is not present in the output document. Set the parameter entity guard to INCLUDE and regenerate the normalized document and the text will appear again. This method makes sense if there are more conditional chunks depending on the same condition. For example, to control generating printed or online text. Conclusion That is the conclusion of this XML primer. For reasons of space and complexity, several things have not been covered in depth (or at all). However, the previous sections cover enough XML to introduce the organization of the FDP documentation.
diff --git a/en_US.ISO8859-1/books/handbook/basics/chapter.xml b/en_US.ISO8859-1/books/handbook/basics/chapter.xml index bc31008871..1be3270b7e 100644 --- a/en_US.ISO8859-1/books/handbook/basics/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/basics/chapter.xml @@ -1,3417 +1,3417 @@ &os; Basics Synopsis This chapter covers the basic commands and functionality of the &os; operating system. Much of this material is relevant for any &unix;-like operating system. New &os; users are encouraged to read through this chapter carefully. After reading this chapter, you will know: How to use and configure virtual consoles. How to create and manage users and groups on &os;. How &unix; file permissions and &os; file flags work. The default &os; file system layout. The &os; disk organization. How to mount and unmount file systems. What processes, daemons, and signals are. What a shell is, and how to change the default login environment. How to use basic text editors. What devices and device nodes are. How to read manual pages for more information. Virtual Consoles and Terminals virtual consoles terminals console Unless &os; has been configured to automatically start a graphical environment during startup, the system will boot into a command line login prompt, as seen in this example: FreeBSD/amd64 (pc3.example.org) (ttyv0) login: The first line contains some information about the system. The amd64 indicates that the system in this example is running a 64-bit version of &os;. The hostname is pc3.example.org, and ttyv0 indicates that this is the system console. The second line is the login prompt. Since &os; is a multiuser system, it needs some way to distinguish between different users. This is accomplished by requiring every user to log into the system before gaining access to the programs on the system. Every user has a unique name username and a personal password. To log into the system console, type the username that was configured during system installation, as described in , and press Enter. Then enter the password associated with the username and press Enter. The password is not echoed for security reasons. Once the correct password is input, the message of the day (MOTD) will be displayed followed by a command prompt. Depending upon the shell that was selected when the user was created, this prompt will be a #, $, or % character. The prompt indicates that the user is now logged into the &os; system console and ready to try the available commands. Virtual Consoles While the system console can be used to interact with the system, a user working from the command line at the keyboard of a &os; system will typically instead log into a virtual console. This is because system messages are configured by default to display on the system console. These messages will appear over the command or file that the user is working on, making it difficult to concentrate on the work at hand. By default, &os; is configured to provide several virtual consoles for inputting commands. Each virtual console has its own login prompt and shell and it is easy to switch between virtual consoles. This essentially provides the command line equivalent of having several windows open at the same time in a graphical environment. The key combinations AltF1 through AltF8 have been reserved by &os; for switching between virtual consoles. Use AltF1 to switch to the system console (ttyv0), AltF2 to access the first virtual console (ttyv1), AltF3 to access the second virtual console (ttyv2), and so on. When using &xorg; as a graphical console, the combination becomes CtrlAltF1 to return to a text-based virtual console. When switching from one console to the next, &os; manages the screen output. The result is an illusion of having multiple virtual screens and keyboards that can be used to type commands for &os; to run. The programs that are launched in one virtual console do not stop running when the user switches to a different virtual console. Refer to &man.kbdcontrol.1;, &man.vidcontrol.1;, &man.atkbd.4;, &man.syscons.4;, and &man.vt.4; for a more technical description of the &os; console and its keyboard drivers. In &os;, the number of available virtual consoles is configured in this section of /etc/ttys: # name getty type status comments # ttyv0 "/usr/libexec/getty Pc" xterm on secure # Virtual terminals ttyv1 "/usr/libexec/getty Pc" xterm on secure ttyv2 "/usr/libexec/getty Pc" xterm on secure ttyv3 "/usr/libexec/getty Pc" xterm on secure ttyv4 "/usr/libexec/getty Pc" xterm on secure ttyv5 "/usr/libexec/getty Pc" xterm on secure ttyv6 "/usr/libexec/getty Pc" xterm on secure ttyv7 "/usr/libexec/getty Pc" xterm on secure ttyv8 "/usr/X11R6/bin/xdm -nodaemon" xterm off secure To disable a virtual console, put a comment symbol (#) at the beginning of the line representing that virtual console. For example, to reduce the number of available virtual consoles from eight to four, put a # in front of the last four lines representing virtual consoles ttyv5 through ttyv8. Do not comment out the line for the system console ttyv0. Note that the last virtual console (ttyv8) is used to access the graphical environment if &xorg; has been installed and configured as described in . For a detailed description of every column in this file and the available options for the virtual consoles, refer to &man.ttys.5;. Single User Mode The &os; boot menu provides an option labelled as Boot Single User. If this option is selected, the system will boot into a special mode known as single user mode. This mode is typically used to repair a system that will not boot or to reset the root password when it is not known. While in single user mode, networking and other virtual consoles are not available. However, full root access to the system is available, and by default, the root password is not needed. For these reasons, physical access to the keyboard is needed to boot into this mode and determining who has physical access to the keyboard is something to consider when securing a &os; system. The settings which control single user mode are found in this section of /etc/ttys: # name getty type status comments # # If console is marked "insecure", then init will ask for the root password # when going to single-user mode. console none unknown off secure By default, the status is set to secure. This assumes that who has physical access to the keyboard is either not important or it is controlled by a physical security policy. If this setting is changed to insecure, the assumption is that the environment itself is insecure because anyone can access the keyboard. When this line is changed to insecure, &os; will prompt for the root password when a user selects to boot into single user mode. Be careful when changing this setting to insecure! If the root password is forgotten, booting into single user mode is still possible, but may be difficult for someone who is not familiar with the &os; booting process. Changing Console Video Modes The &os; console default video mode may be adjusted to 1024x768, 1280x1024, or any other size supported by the graphics chip and monitor. To use a different video mode load the VESA module: &prompt.root; kldload vesa To determine which video modes are supported by the hardware, use &man.vidcontrol.1;. To get a list of supported video modes issue the following: &prompt.root; vidcontrol -i mode The output of this command lists the video modes that are supported by the hardware. To select a new video mode, specify the mode using &man.vidcontrol.1; as the root user: &prompt.root; vidcontrol MODE_279 If the new video mode is acceptable, it can be permanently set on boot by adding it to /etc/rc.conf: allscreens_flags="MODE_279" Users and Basic Account Management &os; allows multiple users to use the computer at the same time. While only one user can sit in front of the screen and use the keyboard at any one time, any number of users can log in to the system through the network. To use the system, each user should have their own user account. This chapter describes: The different types of user accounts on a &os; system. How to add, remove, and modify user accounts. How to set limits to control the resources that users and groups are allowed to access. How to create groups and add users as members of a group. Account Types Since all access to the &os; system is achieved using accounts and all processes are run by users, user and account management is important. There are three main types of accounts: system accounts, user accounts, and the superuser account. System Accounts accounts system System accounts are used to run services such as DNS, mail, and web servers. The reason for this is security; if all services ran as the superuser, they could act without restriction. accounts daemon accounts operator Examples of system accounts are daemon, operator, bind, news, and www. Care must be taken when using the operator group, as unintended superuser-like access privileges may be granted, including but not limited to shutdown, reboot, and access to all items in /dev in the group. accounts nobody nobody is the generic unprivileged system account. However, the more services that use nobody, the more files and processes that user will become associated with, and hence the more privileged that user becomes. User Accounts accounts user User accounts are assigned to real people and are used to log in and use the system. Every person accessing the system should have a unique user account. This allows the administrator to find out who is doing what and prevents users from clobbering the settings of other users. Each user can set up their own environment to accommodate their use of the system, by configuring their default shell, editor, key bindings, and language settings. Every user account on a &os; system has certain information associated with it: User name The user name is typed at the login: prompt. Each user must have a unique user name. There are a number of rules for creating valid user names which are documented in &man.passwd.5;. It is recommended to use user names that consist of eight or fewer, all lower case characters in order to maintain backwards compatibility with applications. Password Each account has an associated password. User ID (UID) The User ID (UID) is a number used to uniquely identify the user to the &os; system. Commands that allow a user name to be specified will first convert it to the UID. It is recommended to use a UID less than 65535, since higher values may cause compatibility issues with some software. Group ID (GID) The Group ID (GID) is a number used to uniquely identify the primary group that the user belongs to. Groups are a mechanism for controlling access to resources based on a user's GID rather than their UID. This can significantly reduce the size of some configuration files and allows users to be members of more than one group. It is recommended to use a GID of 65535 or lower as higher GIDs may break some software. Login class Login classes are an extension to the group mechanism that provide additional flexibility when tailoring the system to different users. Login classes are discussed further in . Password change time By default, passwords do not expire. However, password expiration can be enabled on a per-user basis, forcing some or all users to change their passwords after a certain amount of time has elapsed. Account expiration time By default, &os; does not expire accounts. When creating accounts that need a limited lifespan, such as student accounts in a school, specify the account expiry date using &man.pw.8;. After the expiry time has elapsed, the account cannot be used to log in to the system, although the account's directories and files will remain. User's full name The user name uniquely identifies the account to &os;, but does not necessarily reflect the user's real name. Similar to a comment, this information can contain spaces, uppercase characters, and be more than 8 characters long. Home directory The home directory is the full path to a directory on the system. This is the user's starting directory when the user logs in. A common convention is to put all user home directories under /home/username or /usr/home/username. Each user stores their personal files and subdirectories in their own home directory. User shell The shell provides the user's default environment for interacting with the system. There are many different kinds of shells and experienced users will have their own preferences, which can be reflected in their account settings. The Superuser Account accounts superuser (root) The superuser account, usually called root, is used to manage the system with no limitations on privileges. For this reason, it should not be used for day-to-day tasks like sending and receiving mail, general exploration of the system, or programming. The superuser, unlike other user accounts, can operate without limits, and misuse of the superuser account may result in spectacular disasters. User accounts are unable to destroy the operating system by mistake, so it is recommended to login as a user account and to only become the superuser when a command requires extra privilege. Always double and triple-check any commands issued as the superuser, since an extra space or missing character can mean irreparable data loss. There are several ways to gain superuser privilege. While one can log in as root, this is highly discouraged. Instead, use &man.su.1; to become the superuser. If - is specified when running this command, the user will also inherit the root user's environment. The user running this command must be in the wheel group or else the command will fail. The user must also know the password for the root user account. In this example, the user only becomes superuser in order to run make install as this step requires superuser privilege. Once the command completes, the user types exit to leave the superuser account and return to the privilege of their user account. Install a Program As the Superuser &prompt.user; configure &prompt.user; make &prompt.user; su - Password: &prompt.root; make install &prompt.root; exit &prompt.user; The built-in &man.su.1; framework works well for single systems or small networks with just one system administrator. An alternative is to install the security/sudo package or port. This software provides activity logging and allows the administrator to configure which users can run which commands as the superuser. Managing Accounts accounts modifying &os; provides a variety of different commands to manage user accounts. The most common commands are summarized in , followed by some examples of their usage. See the manual page for each utility for more details and usage examples. Utilities for Managing User Accounts Command Summary &man.adduser.8; The recommended command-line application for adding new users. &man.rmuser.8; The recommended command-line application for removing users. &man.chpass.1; A flexible tool for changing user database information. &man.passwd.1; The command-line tool to change user passwords. &man.pw.8; A powerful and flexible tool for modifying all aspects of user accounts.
<command>adduser</command> accounts adding adduser /usr/share/skel skeleton directory The recommended program for adding new users is &man.adduser.8;. When a new user is added, this program automatically updates /etc/passwd and /etc/group. It also creates a home directory for the new user, copies in the default configuration files from /usr/share/skel, and can optionally mail the new user a welcome message. This utility must be run as the superuser. The &man.adduser.8; utility is interactive and walks through the steps for creating a new user account. As seen in , either input the required information or press Return to accept the default value shown in square brackets. In this example, the user has been invited into the wheel group, allowing them to become the superuser with &man.su.1;. When finished, the utility will prompt to either create another user or to exit. Adding a User on &os; &prompt.root; adduser Username: jru Full name: J. Random User Uid (Leave empty for default): Login group [jru]: Login group is jru. Invite jru into other groups? []: wheel Login class [default]: Shell (sh csh tcsh zsh nologin) [sh]: zsh Home directory [/home/jru]: Home directory permissions (Leave empty for default): Use password-based authentication? [yes]: Use an empty password? (yes/no) [no]: Use a random password? (yes/no) [no]: Enter password: Enter password again: Lock out the account after creation? [no]: Username : jru Password : **** Full Name : J. Random User Uid : 1001 Class : Groups : jru wheel Home : /home/jru Shell : /usr/local/bin/zsh Locked : no OK? (yes/no): yes adduser: INFO: Successfully added (jru) to the user database. Add another user? (yes/no): no Goodbye! &prompt.root; Since the password is not echoed when typed, be careful to not mistype the password when creating the user account. <command>rmuser</command> rmuser accounts removing To completely remove a user from the system, run &man.rmuser.8; as the superuser. This command performs the following steps: Removes the user's &man.crontab.1; entry, if one exists. Removes any &man.at.1; jobs belonging to the user. Kills all processes owned by the user. Removes the user from the system's local password file. Optionally removes the user's home directory, if it is owned by the user. Removes the incoming mail files belonging to the user from /var/mail. Removes all files owned by the user from temporary file storage areas such as /tmp. Finally, removes the username from all groups to which it belongs in /etc/group. If a group becomes empty and the group name is the same as the username, the group is removed. This complements the per-user unique groups created by &man.adduser.8;. &man.rmuser.8; cannot be used to remove superuser accounts since that is almost always an indication of massive destruction. By default, an interactive mode is used, as shown in the following example. <command>rmuser</command> Interactive Account Removal &prompt.root; rmuser jru Matching password entry: jru:*:1001:1001::0:0:J. Random User:/home/jru:/usr/local/bin/zsh Is this the entry you wish to remove? y Remove user's home directory (/home/jru)? y Removing user (jru): mailspool home passwd. &prompt.root; <command>chpass</command> chpass Any user can use &man.chpass.1; to change their default shell and personal information associated with their user account. The superuser can use this utility to change additional account information for any user. When passed no options, aside from an optional username, &man.chpass.1; displays an editor containing user information. When the user exits from the editor, the user database is updated with the new information. This utility will prompt for the user's password when exiting the editor, unless the utility is run as the superuser. In , the superuser has typed chpass jru and is now viewing the fields that can be changed for this user. If jru runs this command instead, only the last six fields will be displayed and available for editing. This is shown in . Using <command>chpass</command> as Superuser #Changing user database information for jru. Login: jru Password: * Uid [#]: 1001 Gid [# or name]: 1001 Change [month day year]: Expire [month day year]: Class: Home directory: /home/jru Shell: /usr/local/bin/zsh Full Name: J. Random User Office Location: Office Phone: Home Phone: Other information: Using <command>chpass</command> as Regular User #Changing user database information for jru. Shell: /usr/local/bin/zsh Full Name: J. Random User Office Location: Office Phone: Home Phone: Other information: The commands &man.chfn.1; and &man.chsh.1; are links to &man.chpass.1;, as are &man.ypchpass.1;, &man.ypchfn.1;, and &man.ypchsh.1;. Since NIS support is automatic, specifying the yp before the command is not necessary. How to configure NIS is covered in . <command>passwd</command> passwd accounts changing password Any user can easily change their password using &man.passwd.1;. To prevent accidental or unauthorized changes, this command will prompt for the user's original password before a new password can be set: Changing Your Password &prompt.user; passwd Changing local password for jru. Old password: New password: Retype new password: passwd: updating the database... passwd: done The superuser can change any user's password by specifying the username when running &man.passwd.1;. When this utility is run as the superuser, it will not prompt for the user's current password. This allows the password to be changed when a user cannot remember the original password. Changing Another User's Password as the Superuser &prompt.root; passwd jru Changing local password for jru. New password: Retype new password: passwd: updating the database... passwd: done As with &man.chpass.1;, &man.yppasswd.1; is a link to &man.passwd.1;, so NIS works with either command. <command>pw</command> pw The &man.pw.8; utility can create, remove, modify, and display users and groups. It functions as a front end to the system user and group files. &man.pw.8; has a very powerful set of command line options that make it suitable for use in shell scripts, but new users may find it more complicated than the other commands presented in this section.
Managing Groups groups /etc/groups accounts groups A group is a list of users. A group is identified by its group name and GID. In &os;, the kernel uses the UID of a process, and the list of groups it belongs to, to determine what the process is allowed to do. Most of the time, the GID of a user or process usually means the first group in the list. The group name to GID mapping is listed in /etc/group. This is a plain text file with four colon-delimited fields. The first field is the group name, the second is the encrypted password, the third the GID, and the fourth the comma-delimited list of members. For a more complete description of the syntax, refer to &man.group.5;. The superuser can modify /etc/group using a text editor. Alternatively, &man.pw.8; can be used to add and edit groups. For example, to add a group called teamtwo and then confirm that it exists: Adding a Group Using &man.pw.8; &prompt.root; pw groupadd teamtwo &prompt.root; pw groupshow teamtwo teamtwo:*:1100: In this example, 1100 is the GID of teamtwo. Right now, teamtwo has no members. This command will add jru as a member of teamtwo. Adding User Accounts to a New Group Using &man.pw.8; &prompt.root; pw groupmod teamtwo -M jru &prompt.root; pw groupshow teamtwo teamtwo:*:1100:jru The argument to is a comma-delimited list of users to be added to a new (empty) group or to replace the members of an existing group. To the user, this group membership is different from (and in addition to) the user's primary group listed in the password file. This means that the user will not show up as a member when using with &man.pw.8;, but will show up when the information is queried via &man.id.1; or a similar tool. When &man.pw.8; is used to add a user to a group, it only manipulates /etc/group and does not attempt to read additional data from /etc/passwd. Adding a New Member to a Group Using &man.pw.8; &prompt.root; pw groupmod teamtwo -m db &prompt.root; pw groupshow teamtwo teamtwo:*:1100:jru,db In this example, the argument to is a comma-delimited list of users who are to be added to the group. Unlike the previous example, these users are appended to the group and do not replace existing users in the group. Using &man.id.1; to Determine Group Membership &prompt.user; id jru uid=1001(jru) gid=1001(jru) groups=1001(jru), 1100(teamtwo) In this example, jru is a member of the groups jru and teamtwo. For more information about this command and the format of /etc/group, refer to &man.pw.8; and &man.group.5;.
Permissions UNIX In &os;, every file and directory has an associated set of permissions and several utilities are available for viewing and modifying these permissions. Understanding how permissions work is necessary to make sure that users are able to access the files that they need and are unable to improperly access the files used by the operating system or owned by other users. This section discusses the traditional &unix; permissions used in &os;. For finer grained file system access control, refer to . In &unix;, basic permissions are assigned using three types of access: read, write, and execute. These access types are used to determine file access to the file's owner, group, and others (everyone else). The read, write, and execute permissions can be represented as the letters r, w, and x. They can also be represented as binary numbers as each permission is either on or off (0). When represented as a number, the order is always read as rwx, where r has an on value of 4, w has an on value of 2 and x has an on value of 1. Table 4.1 summarizes the possible numeric and alphabetic possibilities. When reading the Directory Listing column, a - is used to represent a permission that is set to off. permissions file permissions &unix; Permissions Value Permission Directory Listing 0 No read, no write, no execute --- 1 No read, no write, execute --x 2 No read, write, no execute -w- 3 No read, write, execute -wx 4 Read, no write, no execute r-- 5 Read, no write, execute r-x 6 Read, write, no execute rw- 7 Read, write, execute rwx
&man.ls.1; directories Use the argument to &man.ls.1; to view a long directory listing that includes a column of information about a file's permissions for the owner, group, and everyone else. For example, an ls -l in an arbitrary directory may show: &prompt.user; ls -l total 530 -rw-r--r-- 1 root wheel 512 Sep 5 12:31 myfile -rw-r--r-- 1 root wheel 512 Sep 5 12:31 otherfile -rw-r--r-- 1 root wheel 7680 Sep 5 12:31 email.txt The first (leftmost) character in the first column indicates whether this file is a regular file, a directory, a special character device, a socket, or any other special pseudo-file device. In this example, the - indicates a regular file. The next three characters, rw- in this example, give the permissions for the owner of the file. The next three characters, r--, give the permissions for the group that the file belongs to. The final three characters, r--, give the permissions for the rest of the world. A dash means that the permission is turned off. In this example, the permissions are set so the owner can read and write to the file, the group can read the file, and the rest of the world can only read the file. According to the table above, the permissions for this file would be 644, where each digit represents the three parts of the file's permission. How does the system control permissions on devices? &os; treats most hardware devices as a file that programs can open, read, and write data to. These special device files are stored in /dev/. Directories are also treated as files. They have read, write, and execute permissions. The executable bit for a directory has a slightly different meaning than that of files. When a directory is marked executable, it means it is possible to change into that directory using &man.cd.1;. This also means that it is possible to access the files within that directory, subject to the permissions on the files themselves. In order to perform a directory listing, the read permission must be set on the directory. In order to delete a file that one knows the name of, it is necessary to have write and execute permissions to the directory containing the file. There are more permission bits, but they are primarily used in special circumstances such as setuid binaries and sticky directories. For more information on file permissions and how to set them, refer to &man.chmod.1;. Symbolic Permissions Tom Rhodes Contributed by permissions symbolic Symbolic permissions use characters instead of octal values to assign permissions to files or directories. Symbolic permissions use the syntax of (who) (action) (permissions), where the following values are available: Option Letter Represents (who) u User (who) g Group owner (who) o Other (who) a All (world) (action) + Adding permissions (action) - Removing permissions (action) = Explicitly set permissions (permissions) r Read (permissions) w Write (permissions) x Execute (permissions) t Sticky bit (permissions) s Set UID or GID These values are used with &man.chmod.1;, but with letters instead of numbers. For example, the following command would block other users from accessing FILE: &prompt.user; chmod go= FILE A comma separated list can be provided when more than one set of changes to a file must be made. For example, the following command removes the group and world write permission on FILE, and adds the execute permissions for everyone: &prompt.user; chmod go-w,a+x FILE &os; File Flags Tom Rhodes Contributed by In addition to file permissions, &os; supports the use of file flags. These flags add an additional level of security and control over files, but not directories. With file flags, even root can be prevented from removing or altering files. File flags are modified using &man.chflags.1;. For example, to enable the system undeletable flag on the file file1, issue the following command: &prompt.root; chflags sunlink file1 To disable the system undeletable flag, put a no in front of the : &prompt.root; chflags nosunlink file1 To view the flags of a file, use with &man.ls.1;: &prompt.root; ls -lo file1 -rw-r--r-- 1 trhodes trhodes sunlnk 0 Mar 1 05:54 file1 Several file flags may only be added or removed by the root user. In other cases, the file owner may set its file flags. Refer to &man.chflags.1; and &man.chflags.2; for more information. The <literal>setuid</literal>, <literal>setgid</literal>, and <literal>sticky</literal> Permissions Tom Rhodes Contributed by Other than the permissions already discussed, there are three other specific settings that all administrators should know about. They are the setuid, setgid, and sticky permissions. These settings are important for some &unix; operations as they provide functionality not normally granted to normal users. To understand them, the difference between the real user ID and effective user ID must be noted. The real user ID is the UID who owns or starts the process. The effective UID is the user ID the process runs as. As an example, &man.passwd.1; runs with the real user ID when a user changes their password. However, in order to update the password database, the command runs as the effective ID of the root user. This allows users to change their passwords without seeing a Permission Denied error. The setuid permission may be set by prefixing a permission set with the number four (4) as shown in the following example: &prompt.root; chmod 4755 suidexample.sh The permissions on suidexample.sh now look like the following: -rwsr-xr-x 1 trhodes trhodes 63 Aug 29 06:36 suidexample.sh Note that a s is now part of the permission set designated for the file owner, replacing the executable bit. This allows utilities which need elevated permissions, such as &man.passwd.1;. The nosuid &man.mount.8; option will cause such binaries to silently fail without alerting the user. That option is not completely reliable as a nosuid wrapper may be able to circumvent it. To view this in real time, open two terminals. On one, type passwd as a normal user. While it waits for a new password, check the process table and look at the user information for &man.passwd.1;: In terminal A: Changing local password for trhodes Old Password: In terminal B: &prompt.root; ps aux | grep passwd trhodes 5232 0.0 0.2 3420 1608 0 R+ 2:10AM 0:00.00 grep passwd root 5211 0.0 0.2 3620 1724 2 I+ 2:09AM 0:00.01 passwd Although &man.passwd.1; is run as a normal user, it is using the effective UID of root. The setgid permission performs the same function as the setuid permission; except that it alters the group settings. When an application or utility executes with this setting, it will be granted the permissions based on the group that owns the file, not the user who started the process. To set the setgid permission on a file, provide &man.chmod.1; with a leading two (2): &prompt.root; chmod 2755 sgidexample.sh In the following listing, notice that the s is now in the field designated for the group permission settings: -rwxr-sr-x 1 trhodes trhodes 44 Aug 31 01:49 sgidexample.sh In these examples, even though the shell script in question is an executable file, it will not run with a different EUID or effective user ID. This is because shell scripts may not access the &man.setuid.2; system calls. The setuid and setgid permission bits may lower system security, by allowing for elevated permissions. The third special permission, the sticky bit, can strengthen the security of a system. When the sticky bit is set on a directory, it allows file deletion only by the file owner. This is useful to prevent file deletion in public directories, such as /tmp, by users who do not own the file. To utilize this permission, prefix the permission set with a one (1): &prompt.root; chmod 1777 /tmp The sticky bit permission will display as a t at the very end of the permission set: &prompt.root; ls -al / | grep tmp drwxrwxrwt 10 root wheel 512 Aug 31 01:49 tmp
Directory Structure directory hierarchy The &os; directory hierarchy is fundamental to obtaining an overall understanding of the system. The most important directory is root or, /. This directory is the first one mounted at boot time and it contains the base system necessary to prepare the operating system for multi-user operation. The root directory also contains mount points for other file systems that are mounted during the transition to multi-user operation. A mount point is a directory where additional file systems can be grafted onto a parent file system (usually the root file system). This is further described in . Standard mount points include /usr/, /var/, /tmp/, /mnt/, and /cdrom/. These directories are usually referenced to entries in /etc/fstab. This file is a table of various file systems and mount points and is read by the system. Most of the file systems in /etc/fstab are mounted automatically at boot time from the script &man.rc.8; unless their entry includes . Details can be found in . A complete description of the file system hierarchy is available in &man.hier.7;. The following table provides a brief overview of the most common directories. Directory Description / Root directory of the file system. /bin/ User utilities fundamental to both single-user and multi-user environments. /boot/ Programs and configuration files used during operating system bootstrap. /boot/defaults/ Default boot configuration files. Refer to &man.loader.conf.5; for details. /dev/ Device nodes. Refer to &man.intro.4; for details. /etc/ System configuration files and scripts. /etc/defaults/ Default system configuration files. Refer to &man.rc.8; for details. /etc/mail/ Configuration files for mail transport agents such as &man.sendmail.8;. /etc/periodic/ Scripts that run daily, weekly, and monthly, via &man.cron.8;. Refer to &man.periodic.8; for details. /etc/ppp/ &man.ppp.8; configuration files. /mnt/ Empty directory commonly used by system administrators as a temporary mount point. /proc/ Process file system. Refer to &man.procfs.5;, &man.mount.procfs.8; for details. /rescue/ Statically linked programs for emergency recovery as described in &man.rescue.8;. /root/ Home directory for the root account. /sbin/ System programs and administration utilities fundamental to both single-user and multi-user environments. /tmp/ Temporary files which are usually not preserved across a system reboot. A memory-based file system is often mounted at /tmp. This can be automated using the tmpmfs-related variables of &man.rc.conf.5; or with an entry in /etc/fstab; refer to &man.mdmfs.8; for details. /usr/ The majority of user utilities and applications. /usr/bin/ Common utilities, programming tools, and applications. /usr/include/ Standard C include files. /usr/lib/ Archive libraries. /usr/libdata/ Miscellaneous utility data files. /usr/libexec/ System daemons and system utilities executed by other programs. /usr/local/ Local executables and libraries. Also used as the default destination for the &os; ports framework. Within /usr/local, the general layout sketched out by &man.hier.7; for /usr should be used. Exceptions are the man directory, which is directly under /usr/local rather than under /usr/local/share, and the ports documentation is in share/doc/port. /usr/obj/ Architecture-specific target tree produced by building the /usr/src tree. /usr/ports/ The &os; Ports Collection (optional). /usr/sbin/ System daemons and system utilities executed by users. /usr/share/ Architecture-independent files. /usr/src/ BSD and/or local source files. /var/ Multi-purpose log, temporary, transient, and spool files. A memory-based file system is sometimes mounted at /var. This can be automated using the varmfs-related variables in &man.rc.conf.5; or with an entry in /etc/fstab; refer to &man.mdmfs.8; for details. /var/log/ Miscellaneous system log files. /var/mail/ User mailbox files. /var/spool/ Miscellaneous printer and mail system spooling directories. /var/tmp/ Temporary files which are usually preserved across a system reboot, unless /var is a memory-based file system. /var/yp/ NIS maps. Disk Organization The smallest unit of organization that &os; uses to find files is the filename. Filenames are case-sensitive, which means that readme.txt and README.TXT are two separate files. &os; does not use the extension of a file to determine whether the file is a program, document, or some other form of data. Files are stored in directories. A directory may contain no files, or it may contain many hundreds of files. A directory can also contain other directories, allowing a hierarchy of directories within one another in order to organize data. Files and directories are referenced by giving the file or directory name, followed by a forward slash, /, followed by any other directory names that are necessary. For example, if the directory foo contains a directory bar which contains the file readme.txt, the full name, or path, to the file is foo/bar/readme.txt. Note that this is different from &windows; which uses \ to separate file and directory names. &os; does not use drive letters, or other drive names in the path. For example, one would not type c:\foo\bar\readme.txt on &os;. Directories and files are stored in a file system. Each file system contains exactly one directory at the very top level, called the root directory for that file system. This root directory can contain other directories. One file system is designated the root file system or /. Every other file system is mounted under the root file system. No matter how many disks are on the &os; system, every directory appears to be part of the same disk. Consider three file systems, called A, B, and C. Each file system has one root directory, which contains two other directories, called A1, A2 (and likewise B1, B2 and C1, C2). Call A the root file system. If &man.ls.1; is used to view the contents of this directory, it will show two subdirectories, A1 and A2. The directory tree looks like this: / | +--- A1 | `--- A2 A file system must be mounted on to a directory in another file system. When mounting file system B on to the directory A1, the root directory of B replaces A1, and the directories in B appear accordingly: / | +--- A1 | | | +--- B1 | | | `--- B2 | `--- A2 Any files that are in the B1 or B2 directories can be reached with the path /A1/B1 or /A1/B2 as necessary. Any files that were in /A1 have been temporarily hidden. They will reappear if B is unmounted from A. If B had been mounted on A2 then the diagram would look like this: / | +--- A1 | `--- A2 | +--- B1 | `--- B2 and the paths would be /A2/B1 and /A2/B2 respectively. File systems can be mounted on top of one another. Continuing the last example, the C file system could be mounted on top of the B1 directory in the B file system, leading to this arrangement: / | +--- A1 | `--- A2 | +--- B1 | | | +--- C1 | | | `--- C2 | `--- B2 Or C could be mounted directly on to the A file system, under the A1 directory: / | +--- A1 | | | +--- C1 | | | `--- C2 | `--- A2 | +--- B1 | `--- B2 It is entirely possible to have one large root file system, and not need to create any others. There are some drawbacks to this approach, and one advantage. Benefits of Multiple File Systems Different file systems can have different mount options. For example, the root file system can be mounted read-only, making it impossible for users to inadvertently delete or edit a critical file. Separating user-writable file systems, such as /home, from other file systems allows them to be mounted nosuid. This option prevents the suid/guid bits on executables stored on the file system from taking effect, possibly improving security. &os; automatically optimizes the layout of files on a file system, depending on how the file system is being used. So a file system that contains many small files that are written frequently will have a different optimization to one that contains fewer, larger files. By having one big file system this optimization breaks down. &os;'s file systems are robust if power is lost. However, a power loss at a critical point could still damage the structure of the file system. By splitting data over multiple file systems it is more likely that the system will still come up, making it easier to restore from backup as necessary. Benefit of a Single File System File systems are a fixed size. If you create a file system when you install &os; and give it a specific size, you may later discover that you need to make the partition bigger. This is not easily accomplished without backing up, recreating the file system with the new size, and then restoring the backed up data. &os; features the &man.growfs.8; command, which makes it possible to increase the size of file system on the fly, removing this limitation. File systems are contained in partitions. This does not have the same meaning as the common usage of the term partition (for example, &ms-dos; partition), because of &os;'s &unix; heritage. Each partition is identified by a letter from a through to h. Each partition can contain only one file system, which means that file systems are often described by either their typical mount point in the file system hierarchy, or the letter of the partition they are contained in. &os; also uses disk space for swap space to provide virtual memory. This allows your computer to behave as though it has much more memory than it actually does. When &os; runs out of memory, it moves some of the data that is not currently being used to the swap space, and moves it back in (moving something else out) when it needs it. Some partitions have certain conventions associated with them. Partition Convention a Normally contains the root file system. b Normally contains swap space. c Normally the same size as the enclosing slice. This allows utilities that need to work on the entire slice, such as a bad block scanner, to work on the c partition. A file system would not normally be created on this partition. d Partition d used to have a special meaning associated with it, although that is now gone and d may work as any normal partition. Disks in &os; are divided into slices, referred to in &windows; as partitions, which are numbered from 1 to 4. These are then divided into partitions, which contain file systems, and are labeled using letters. slices partitions dangerously dedicated Slice numbers follow the device name, prefixed with an s, starting at 1. So da0s1 is the first slice on the first SCSI drive. There can only be four physical slices on a disk, but there can be logical slices inside physical slices of the appropriate type. These extended slices are numbered starting at 5, so ada0s5 is the first extended slice on the first SATA disk. These devices are used by file systems that expect to occupy a slice. Slices, dangerously dedicated physical drives, and other drives contain partitions, which are represented as letters from a to h. This letter is appended to the device name, so da0a is the a partition on the first da drive, which is dangerously dedicated. ada1s3e is the fifth partition in the third slice of the second SATA disk drive. Finally, each disk on the system is identified. A disk name starts with a code that indicates the type of disk, and then a number, indicating which disk it is. Unlike slices, disk numbering starts at 0. Common codes are listed in . When referring to a partition, include the disk name, s, the slice number, and then the partition letter. Examples are shown in . shows a conceptual model of a disk layout. When installing &os;, configure the disk slices, create partitions within the slice to be used for &os;, create a file system or swap space in each partition, and decide where each file system will be mounted. Disk Device Names Drive Type Drive Device Name SATA and IDE hard drives ada or ad SCSI hard drives and USB storage devices da SATA and IDE CD-ROM drives cd or acd SCSI CD-ROM drives cd Floppy drives fd Assorted non-standard CD-ROM drives mcd for Mitsumi CD-ROM and scd for Sony CD-ROM devices SCSI tape drives sa IDE tape drives ast RAID drives Examples include aacd for &adaptec; AdvancedRAID, mlxd and mlyd for &mylex;, amrd for AMI &megaraid;, idad for Compaq Smart RAID, twed for &tm.3ware; RAID.
Sample Disk, Slice, and Partition Names Name Meaning ada0s1a The first partition (a) on the first slice (s1) on the first SATA disk (ada0). da1s2e The fifth partition (e) on the second slice (s2) on the second SCSI disk (da1). Conceptual Model of a Disk This diagram shows &os;'s view of the first SATA disk attached to the system. Assume that the disk is 250 GB in size, and contains an 80 GB slice and a 170 GB slice (&ms-dos; partitions). The first slice contains a &windows; NTFS file system, C:, and the second slice contains a &os; installation. This example &os; installation has four data partitions and a swap partition. The four partitions each hold a file system. Partition a is used for the root file system, d for /var/, e for /tmp/, and f for /usr/. Partition letter c refers to the entire slice, and so is not used for ordinary partitions.
Mounting and Unmounting File Systems The file system is best visualized as a tree, rooted, as it were, at /. /dev, /usr, and the other directories in the root directory are branches, which may have their own branches, such as /usr/local, and so on. root file system There are various reasons to house some of these directories on separate file systems. /var contains the directories log/, spool/, and various types of temporary files, and as such, may get filled up. Filling up the root file system is not a good idea, so splitting /var from / is often favorable. Another common reason to contain certain directory trees on other file systems is if they are to be housed on separate physical disks, or are separate virtual disks, such as Network File System mounts, described in , or CDROM drives. The <filename>fstab</filename> File file systems mounted with fstab During the boot process (), file systems listed in /etc/fstab are automatically mounted except for the entries containing . This file contains entries in the following format: device /mount-point fstype options dumpfreq passno device An existing device name as explained in . mount-point An existing directory on which to mount the file system. fstype The file system type to pass to &man.mount.8;. The default &os; file system is ufs. options Either for read-write file systems, or for read-only file systems, followed by any other options that may be needed. A common option is for file systems not normally mounted during the boot sequence. Other options are listed in &man.mount.8;. dumpfreq Used by &man.dump.8; to determine which file systems require dumping. If the field is missing, a value of zero is assumed. passno Determines the order in which file systems should be checked. File systems that should be skipped should have their passno set to zero. The root file system needs to be checked before everything else and should have its passno set to one. The other file systems should be set to values greater than one. If more than one file system has the same passno, &man.fsck.8; will attempt to check file systems in parallel if possible. Refer to &man.fstab.5; for more information on the format of /etc/fstab and its options. Using &man.mount.8; file systems mounting File systems are mounted using &man.mount.8;. The most basic syntax is as follows: &prompt.root; mount device mountpoint This command provides many options which are described in &man.mount.8;, The most commonly used options include: Mount Options Mount all the file systems listed in /etc/fstab, except those marked as noauto, excluded by the flag, or those that are already mounted. Do everything except for the actual mount system call. This option is useful in conjunction with the flag to determine what &man.mount.8; is actually trying to do. Force the mount of an unclean file system (dangerous), or the revocation of write access when downgrading a file system's mount status from read-write to read-only. Mount the file system read-only. This is identical to using . fstype Mount the specified file system type or mount only file systems of the given type, if is included. ufs is the default file system type. Update mount options on the file system. Be verbose. Mount the file system read-write. The following options can be passed to as a comma-separated list: nosuid Do not interpret setuid or setgid flags on the file system. This is also a useful security option. Using &man.umount.8; file systems unmounting To unmount a file system use &man.umount.8;. This command takes one parameter which can be a mountpoint, device name, or . All forms take to force unmounting, and for verbosity. Be warned that is not generally a good idea as it might crash the computer or damage data on the file system. To unmount all mounted file systems, or just the file system types listed after , use or . Note that does not attempt to unmount the root file system. Processes and Daemons &os; is a multi-tasking operating system. Each program running at any one time is called a process. Every running command starts at least one new process and there are a number of system processes that are run by &os;. Each process is uniquely identified by a number called a process ID (PID). Similar to files, each process has one owner and group, and the owner and group permissions are used to determine which files and devices the process can open. Most processes also have a parent process that started them. For example, the shell is a process, and any command started in the shell is a process which has the shell as its parent process. The exception is a special process called &man.init.8; which is always the first process to start at boot time and which always has a PID of 1. Some programs are not designed to be run with continuous user input and disconnect from the terminal at the first opportunity. For example, a web server responds to web requests, rather than user input. Mail servers are another example of this type of application. These types of programs are known as daemons. The term daemon comes from Greek mythology and represents an entity that is neither good nor evil, and which invisibly performs useful tasks. This is why the BSD mascot is the cheerful-looking daemon with sneakers and a pitchfork. There is a convention to name programs that normally run as daemons with a trailing d. For example, BIND is the Berkeley Internet Name Domain, but the actual program that executes is named. The Apache web server program is httpd and the line printer spooling daemon is lpd. This is only a naming convention. For example, the main mail daemon for the Sendmail application is sendmail, and not maild. Viewing Processes To see the processes running on the system, use &man.ps.1; or &man.top.1;. To display a static list of the currently running processes, their PIDs, how much memory they are using, and the command they were started with, use &man.ps.1;. To display all the running processes and update the display every few seconds in order to interactively see what the computer is doing, use &man.top.1;. By default, &man.ps.1; only shows the commands that are running and owned by the user. For example: &prompt.user; ps PID TT STAT TIME COMMAND 8203 0 Ss 0:00.59 /bin/csh 8895 0 R+ 0:00.00 ps The output from &man.ps.1; is organized into a number of columns. The PID column displays the process ID. PIDs are assigned starting at 1, go up to 99999, then wrap around back to the beginning. However, a PID is not reassigned if it is already in use. The TT column shows the tty the program is running on and STAT shows the program's state. TIME is the amount of time the program has been running on the CPU. This is usually not the elapsed time since the program was started, as most programs spend a lot of time waiting for things to happen before they need to spend time on the CPU. Finally, COMMAND is the command that was used to start the program. A number of different options are available to change the information that is displayed. One of the most useful sets is auxww, where displays information about all the running processes of all users, displays the username and memory usage of the process' owner, displays information about daemon processes, and causes &man.ps.1; to display the full command line for each process, rather than truncating it once it gets too long to fit on the screen. The output from &man.top.1; is similar: &prompt.user; top last pid: 9609; load averages: 0.56, 0.45, 0.36 up 0+00:20:03 10:21:46 107 processes: 2 running, 104 sleeping, 1 zombie CPU: 6.2% user, 0.1% nice, 8.2% system, 0.4% interrupt, 85.1% idle Mem: 541M Active, 450M Inact, 1333M Wired, 4064K Cache, 1498M Free ARC: 992M Total, 377M MFU, 589M MRU, 250K Anon, 5280K Header, 21M Other Swap: 2048M Total, 2048M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 557 root 1 -21 r31 136M 42296K select 0 2:20 9.96% Xorg 8198 dru 2 52 0 449M 82736K select 3 0:08 5.96% kdeinit4 8311 dru 27 30 0 1150M 187M uwait 1 1:37 0.98% firefox 431 root 1 20 0 14268K 1728K select 0 0:06 0.98% moused 9551 dru 1 21 0 16600K 2660K CPU3 3 0:01 0.98% top 2357 dru 4 37 0 718M 141M select 0 0:21 0.00% kdeinit4 8705 dru 4 35 0 480M 98M select 2 0:20 0.00% kdeinit4 8076 dru 6 20 0 552M 113M uwait 0 0:12 0.00% soffice.bin 2623 root 1 30 10 12088K 1636K select 3 0:09 0.00% powerd 2338 dru 1 20 0 440M 84532K select 1 0:06 0.00% kwin 1427 dru 5 22 0 605M 86412K select 1 0:05 0.00% kdeinit4 The output is split into two sections. The header (the first five or six lines) shows the PID of the last process to run, the system load averages (which are a measure of how busy the system is), the system uptime (time since the last reboot) and the current time. The other figures in the header relate to how many processes are running, how much memory and swap space has been used, and how much time the system is spending in different CPU states. If the ZFS file system module has been loaded, an ARC line indicates how much data was read from the memory cache instead of from disk. Below the header is a series of columns containing similar information to the output from &man.ps.1;, such as the PID, username, amount of CPU time, and the command that started the process. By default, &man.top.1; also displays the amount of memory space taken by the process. This is split into two columns: one for total size and one for resident size. Total size is how much memory the application has needed and the resident size is how much it is actually using now. &man.top.1; automatically updates the display every two seconds. A different interval can be specified with . Killing Processes One way to communicate with any running process or daemon is to send a signal using &man.kill.1;. There are a number of different signals; some have a specific meaning while others are described in the application's documentation. A user can only send a signal to a process they own and sending a signal to someone else's process will result in a permission denied error. The exception is the root user, who can send signals to anyone's processes. The operating system can also send a signal to a process. If an application is badly written and tries to access memory that it is not supposed to, &os; will send the process the Segmentation Violation signal (SIGSEGV). If an application has been written to use the &man.alarm.3; system call to be alerted after a period of time has elapsed, it will be sent the Alarm signal (SIGALRM). Two signals can be used to stop a process: SIGTERM and SIGKILL. SIGTERM is the polite way to kill a process as the process can read the signal, close any log files it may have open, and attempt to finish what it is doing before shutting down. In some cases, a process may ignore SIGTERM if it is in the middle of some task that cannot be interrupted. SIGKILL cannot be ignored by a process. Sending a SIGKILL to a process will usually stop that process there and then. There are a few tasks that cannot be interrupted. For example, if the process is trying to read from a file that is on another computer on the network, and the other computer is unavailable, the process is said to be uninterruptible. Eventually the process will time out, typically after two minutes. As soon as this time out occurs the process will be killed.. Other commonly used signals are SIGHUP, SIGUSR1, and SIGUSR2. Since these are general purpose signals, different applications will respond differently. For example, after changing a web server's configuration file, the web server needs to be told to re-read its configuration. Restarting httpd would result in a brief outage period on the web server. Instead, send the daemon the SIGHUP signal. Be aware that different daemons will have different behavior, so refer to the documentation for the daemon to determine if SIGHUP will achieve the desired results. Sending a Signal to a Process This example shows how to send a signal to &man.inetd.8;. The &man.inetd.8; configuration file is /etc/inetd.conf, and &man.inetd.8; will re-read this configuration file when it is sent a SIGHUP. Find the PID of the process to send the signal to using &man.pgrep.1;. In this example, the PID for &man.inetd.8; is 198: &prompt.user; pgrep -l inetd 198 inetd -wW - Use &man.kill.1; to send the signal. Because + Use &man.kill.1; to send the signal. As &man.inetd.8; is owned by root, use &man.su.1; to become root first. &prompt.user; su Password: &prompt.root; /bin/kill -s HUP 198 Like most &unix; commands, &man.kill.1; will not print any output if it is successful. If a signal is sent to a process not owned by that user, the message kill: PID: Operation not permitted will be displayed. Mistyping the PID will either send the signal to the wrong process, which could have negative results, or will send the signal to a PID that is not currently in use, resulting in the error kill: PID: No such process. Why Use <command>/bin/kill</command>? Many shells provide kill as a built in command, meaning that the shell will send the signal directly, rather than running /bin/kill. Be aware that different shells have a different syntax for specifying the name of the signal to send. Rather than try to learn all of them, it can be simpler to specify /bin/kill. When sending other signals, substitute TERM or KILL with the name of the signal. Killing a random process on the system is a bad idea. In particular, &man.init.8;, PID 1, is special. Running /bin/kill -s KILL 1 is a quick, and unrecommended, way to shutdown the system. Always double check the arguments to &man.kill.1; before pressing Return. Shells shells command line A shell provides a command line interface for interacting with the operating system. A shell receives commands from the input channel and executes them. Many shells provide built in functions to help with everyday tasks such as file management, file globbing, command line editing, command macros, and environment variables. &os; comes with several shells, including the Bourne shell (&man.sh.1;) and the extended C shell (&man.tcsh.1;). Other shells are available from the &os; Ports Collection, such as zsh and bash. The shell that is used is really a matter of taste. A C programmer might feel more comfortable with a C-like shell such as &man.tcsh.1;. A &linux; user might prefer bash. Each shell has unique properties that may or may not work with a user's preferred working environment, which is why there is a choice of which shell to use. One common shell feature is filename completion. After a user types the first few letters of a command or filename and presses Tab, the shell completes the rest of the command or filename. Consider two files called foobar and football. To delete foobar, the user might type rm foo and press Tab to complete the filename. But the shell only shows rm foo. It was unable to complete the filename because both foobar and football start with foo. Some shells sound a beep or show all the choices if more than one name matches. The user must then type more characters to identify the desired filename. Typing a t and pressing Tab again is enough to let the shell determine which filename is desired and fill in the rest. environment variables Another feature of the shell is the use of environment variables. Environment variables are a variable/key pair stored in the shell's environment. This environment can be read by any program invoked by the shell, and thus contains a lot of program configuration. provides a list of common environment variables and their meanings. Note that the names of environment variables are always in uppercase. Common Environment Variables Variable Description USER Current logged in user's name. PATH Colon-separated list of directories to search for binaries. DISPLAY Network name of the &xorg; display to connect to, if available. SHELL The current shell. TERM The name of the user's type of terminal. Used to determine the capabilities of the terminal. TERMCAP Database entry of the terminal escape codes to perform various terminal functions. OSTYPE Type of operating system. MACHTYPE The system's CPU architecture. EDITOR The user's preferred text editor. PAGER The user's preferred utility for viewing text one page at a time. MANPATH Colon-separated list of directories to search for manual pages.
Bourne shells How to set an environment variable differs between shells. In &man.tcsh.1; and &man.csh.1;, use setenv to set environment variables. In &man.sh.1; and bash, use export to set the current environment variables. This example sets the default EDITOR to /usr/local/bin/emacs for the &man.tcsh.1; shell: &prompt.user; setenv EDITOR /usr/local/bin/emacs The equivalent command for bash would be: &prompt.user; export EDITOR="/usr/local/bin/emacs" To expand an environment variable in order to see its current setting, type a $ character in front of its name on the command line. For example, echo $TERM displays the current $TERM setting. Shells treat special characters, known as meta-characters, as special representations of data. The most common meta-character is *, which represents any number of characters in a filename. Meta-characters can be used to perform filename globbing. For example, echo * is equivalent to ls because the shell takes all the files that match * and echo lists them on the command line. To prevent the shell from interpreting a special character, escape it from the shell by starting it with a backslash (\). For example, echo $TERM prints the terminal setting whereas echo \$TERM literally prints the string $TERM. Changing the Shell The easiest way to permanently change the default shell is to use chsh. Running this command will open the editor that is configured in the EDITOR environment variable, which by default is set to &man.vi.1;. Change the Shell: line to the full path of the new shell. Alternately, use chsh -s which will set the specified shell without opening an editor. For example, to change the shell to bash: &prompt.user; chsh -s /usr/local/bin/bash The new shell must be present in /etc/shells. If the shell was installed from the &os; Ports Collection as described in , it should be automatically added to this file. If it is missing, add it using this command, replacing the path with the path of the shell: &prompt.root; echo /usr/local/bin/bash >> /etc/shells Then, rerun &man.chsh.1;. Advanced Shell Techniques Tom Rhodes Written by The &unix; shell is not just a command interpreter, it acts as a powerful tool which allows users to execute commands, redirect their output, redirect their input and chain commands together to improve the final command output. When this functionality is mixed with built in commands, the user is provided with an environment that can maximize efficiency. Shell redirection is the action of sending the output or the input of a command into another command or into a file. To capture the output of the &man.ls.1; command, for example, into a file, redirect the output: &prompt.user; ls > directory_listing.txt The directory contents will now be listed in directory_listing.txt. Some commands can be used to read input, such as &man.sort.1;. To sort this listing, redirect the input: &prompt.user; sort < directory_listing.txt The input will be sorted and placed on the screen. To redirect that input into another file, one could redirect the output of &man.sort.1; by mixing the direction: &prompt.user; sort < directory_listing.txt > sorted.txt In all of the previous examples, the commands are performing redirection using file descriptors. Every &unix; system has file descriptors, which include standard input (stdin), standard output (stdout), and standard error (stderr). Each one has a purpose, where input could be a keyboard or a mouse, something that provides input. Output could be a screen or paper in a printer. And error would be anything that is used for diagnostic or error messages. All three are considered I/O based file descriptors and sometimes considered streams. Through the use of these descriptors, the shell allows output and input to be passed around through various commands and redirected to or from a file. Another method of redirection is the pipe operator. The &unix; pipe operator, | allows the output of one command to be directly passed or directed to another program. Basically, a pipe allows the standard output of a command to be passed as standard input to another command, for example: &prompt.user; cat directory_listing.txt | sort | less In that example, the contents of directory_listing.txt will be sorted and the output passed to &man.less.1;. This allows the user to scroll through the output at their own pace and prevent it from scrolling off the screen.
Text Editors text editors editors - Most &os; configuration is done by editing text files. - Because of this, it is a good idea to become familiar with a + Most &os; configuration is done by editing text files, so + it is a good idea to become familiar with a text editor. &os; comes with a few as part of the base system, and many more are available in the Ports Collection. ee editors &man.ee.1; A simple editor to learn is &man.ee.1;, which stands for easy editor. To start this editor, type ee filename where filename is the name of the file to be edited. Once inside the editor, all of the commands for manipulating the editor's functions are listed at the top of the display. The caret (^) represents Ctrl, so ^e expands to Ctrl e . To leave &man.ee.1;, press Esc, then choose the leave editor option from the main menu. The editor will prompt to save any changes if the file has been modified. vi editors emacs &os; also comes with more powerful text editors, such as &man.vi.1;, as part of the base system. Other editors, like editors/emacs and editors/vim, are part of the &os; Ports Collection. These editors offer more functionality at the expense of being more complicated to learn. Learning a more powerful editor such as vim or Emacs can save more time in the long run. Many applications which modify files or require typed input will automatically open a text editor. To change the default editor, set the EDITOR environment variable as described in . Devices and Device Nodes A device is a term used mostly for hardware-related activities in a system, including disks, printers, graphics cards, and keyboards. When &os; boots, the majority of the boot messages refer to devices being detected. A copy of the boot messages are saved to /var/run/dmesg.boot. Each device has a device name and number. For example, ada0 is the first SATA hard drive, while kbd0 represents the keyboard. Most devices in &os; must be accessed through special files called device nodes, which are located in /dev. Manual Pages manual pages The most comprehensive documentation on &os; is in the form of manual pages. Nearly every program on the system comes with a short reference manual explaining the basic operation and available arguments. These manuals can be viewed using man: &prompt.user; man command where command is the name of the command to learn about. For example, to learn more about &man.ls.1;, type: &prompt.user; man ls Manual pages are divided into sections which represent the type of topic. In &os;, the following sections are available: User commands. System calls and error numbers. Functions in the C libraries. Device drivers. File formats. Games and other diversions. Miscellaneous information. System maintenance and operation commands. System kernel interfaces. In some cases, the same topic may appear in more than one section of the online manual. For example, there is a chmod user command and a chmod() system call. To tell &man.man.1; which section to display, specify the section number: &prompt.user; man 1 chmod This will display the manual page for the user command &man.chmod.1;. References to a particular section of the online manual are traditionally placed in parenthesis in written documentation, so &man.chmod.1; refers to the user command and &man.chmod.2; refers to the system call. If the name of the manual page is unknown, use man -k to search for keywords in the manual page descriptions: &prompt.user; man -k mail This command displays a list of commands that have the keyword mail in their descriptions. This is equivalent to using &man.apropos.1;. To read the descriptions for all of the commands in /usr/bin, type: &prompt.user; cd /usr/bin &prompt.user; man -f * | more or &prompt.user; cd /usr/bin &prompt.user; whatis * |more GNU Info Files Free Software Foundation &os; includes several applications and utilities produced by the Free Software Foundation (FSF). In addition to manual pages, these programs may include hypertext documents called info files. These can be viewed using &man.info.1; or, if editors/emacs is installed, the info mode of emacs. To use &man.info.1;, type: &prompt.user; info For a brief introduction, type h. For a quick command reference, type ?.
diff --git a/en_US.ISO8859-1/books/handbook/boot/chapter.xml b/en_US.ISO8859-1/books/handbook/boot/chapter.xml index aa0c741acb..2eead109e5 100644 --- a/en_US.ISO8859-1/books/handbook/boot/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/boot/chapter.xml @@ -1,892 +1,892 @@ The &os; Booting Process Synopsis booting bootstrap The process of starting a computer and loading the operating system is referred to as the bootstrap process, or booting. &os;'s boot process provides a great deal of flexibility in customizing what happens when the system starts, including the ability to select from different operating systems installed on the same computer, different versions of the same operating system, or a different installed kernel. This chapter details the configuration options that can be set. It demonstrates how to customize the &os; boot process, including everything that happens until the &os; kernel has started, probed for devices, and started &man.init.8;. This occurs when the text color of the boot messages changes from bright white to grey. After reading this chapter, you will recognize: The components of the &os; bootstrap system and how they interact. The options that can be passed to the components in the &os; bootstrap in order to control the boot process. How to configure a customized boot splash screen. The basics of setting device hints. How to boot into single- and multi-user mode and how to properly shut down a &os; system. This chapter only describes the boot process for &os; running on x86 and amd64 systems. &os; Boot Process Turning on a computer and starting the operating system poses an interesting dilemma. By definition, the computer does not know how to do anything until the operating system is started. This includes running programs from the disk. If the computer can not run a program from the disk without the operating system, and the operating system programs are on the disk, how is the operating system started? This problem parallels one in the book The Adventures of Baron Munchausen. A character had fallen part way down a manhole, and pulled himself out by grabbing his bootstraps and lifting. In the early days of computing, the term bootstrap was applied to the mechanism used to load the operating system. It has since become shortened to booting. BIOS Basic Input/Output SystemBIOS On x86 hardware, the Basic Input/Output System (BIOS) is responsible for loading the operating system. The BIOS looks on the hard disk for the Master Boot Record (MBR), which must be located in a specific place on the disk. The BIOS has enough knowledge to load and run the MBR, and assumes that the MBR can then carry out the rest of the tasks involved in loading the operating system, possibly with the help of the BIOS. &os; provides for booting from both the older MBR standard, and the newer GUID Partition Table (GPT). GPT partitioning is often found on computers with the Unified Extensible Firmware Interface (UEFI). However, &os; can boot from GPT partitions even on machines with only a legacy BIOS with &man.gptboot.8;. Work is under way to provide direct UEFI booting. Master Boot Record (MBR) Boot Manager Boot Loader The code within the MBR is typically referred to as a boot manager, especially when it interacts with the user. The boot manager usually has more code in the first track of the disk or within the file system. Examples of boot managers include the standard &os; boot manager boot0, also called Boot Easy, and Grub, which is used by many &linux; distributions. If only one operating system is installed, the MBR searches for the first bootable (active) slice on the disk, and then runs the code on that slice to load the remainder of the operating system. When multiple operating systems are present, a different boot manager can be installed to display a list of operating systems so the user can select one to boot. The remainder of the &os; bootstrap system is divided into three stages. The first stage knows just enough to get the computer into a specific state and run the second stage. The second stage can do a little bit more, before running the third stage. The third stage finishes the task of loading the operating system. The work is split into three stages because the MBR puts limits on the size of the programs that can be run at stages one and two. Chaining the tasks together allows &os; to provide a more flexible loader. kernel &man.init.8; The kernel is then started and begins to probe for devices and initialize them for use. Once the kernel boot process is finished, the kernel passes control to the user process &man.init.8;, which makes sure the disks are in a usable state, starts the user-level resource configuration which mounts file systems, sets up network cards to communicate on the network, and starts the processes which have been configured to run at startup. This section describes these stages in more detail and demonstrates how to interact with the &os; boot process. The Boot Manager Boot Manager Master Boot Record (MBR) The boot manager code in the MBR is sometimes referred to as stage zero of the boot process. By default, &os; uses the boot0 boot manager. The MBR installed by the &os; installer is based on /boot/boot0. The size and capability of boot0 is restricted to 446 bytes due to the slice table and 0x55AA identifier at the end of the MBR. If boot0 and multiple operating systems are installed, a message similar to this example will be displayed at boot time: <filename>boot0</filename> Screenshot F1 Win F2 FreeBSD Default: F2 Other operating systems will overwrite an existing MBR if they are installed after &os;. If this happens, or to replace the existing MBR with the &os; MBR, use the following command: &prompt.root; fdisk -B -b /boot/boot0 device where device is the boot disk, such as ad0 for the first IDE disk, ad2 for the first IDE disk on a second IDE controller, or da0 for the first SCSI disk. To create a custom configuration of the MBR, refer to &man.boot0cfg.8;. Stage One and Stage Two Conceptually, the first and second stages are part of the - same program on the same area of the disk. Because of space + same program on the same area of the disk. Due to space constraints, they have been split into two, but are always installed together. They are copied from the combined /boot/boot by the &os; installer or bsdlabel. These two stages are located outside file systems, in the first track of the boot slice, starting with the first sector. This is where boot0, or any other boot manager, expects to find a program to run which will continue the boot process. The first stage, boot1, is very simple, since it can only be 512 bytes in size. It knows just enough about the &os; bsdlabel, which stores information about the slice, to find and execute boot2. Stage two, boot2, is slightly more sophisticated, and understands the &os; file system enough to find files. It can provide a simple interface to choose the kernel or loader to run. It runs loader, which is much more sophisticated and provides a boot configuration file. If the boot process is interrupted at stage two, the following interactive screen is displayed: <filename>boot2</filename> Screenshot >> FreeBSD/i386 BOOT Default: 0:ad(0,a)/boot/loader boot: To replace the installed boot1 and boot2, use bsdlabel, where diskslice is the disk and slice to boot from, such as ad0s1 for the first slice on the first IDE disk: &prompt.root; bsdlabel -B diskslice If just the disk name is used, such as ad0, bsdlabel will create the disk in dangerously dedicated mode, without slices. This is probably not the desired action, so double check the diskslice before pressing Return. Stage Three boot-loader The loader is the final stage of the three-stage bootstrap process. It is located on the file system, usually as /boot/loader. The loader is intended as an interactive method for configuration, using a built-in command set, backed up by a more powerful interpreter which has a more complex command set. During initialization, loader will probe for a console and for disks, and figure out which disk it is booting from. It will set variables accordingly, and an interpreter is started where user commands can be passed from a script or interactively. loader loader configuration The loader will then read /boot/loader.rc, which by default reads in /boot/defaults/loader.conf which sets reasonable defaults for variables and reads /boot/loader.conf for local changes to those variables. loader.rc then acts on these variables, loading whichever modules and kernel are selected. Finally, by default, loader issues a 10 second wait for key presses, and boots the kernel if it is not interrupted. If interrupted, the user is presented with a prompt which understands the command set, where the user may adjust variables, unload all modules, load modules, and then finally boot or reboot. lists the most commonly used loader commands. For a complete discussion of all available commands, refer to &man.loader.8;. Loader Built-In Commands Variable Description autoboot seconds Proceeds to boot the kernel if not interrupted within the time span given, in seconds. It displays a countdown, and the default time span is 10 seconds. boot -options kernelname Immediately proceeds to boot the kernel, with any specified options or kernel name. Providing a kernel name on the command-line is only applicable after an unload has been issued. Otherwise, the previously-loaded kernel will be used. If kernelname is not qualified, it will be searched under /boot/kernel and /boot/modules. boot-conf Goes through the same automatic configuration of modules based on specified variables, most commonly kernel. This only makes sense if unload is used first, before changing some variables. help topic Shows help messages read from /boot/loader.help. If the topic given is index, the list of available topics is displayed. include filename Reads the specified file and interprets it line by line. An error immediately stops the include. load -t type filename Loads the kernel, kernel module, or file of the type given, with the specified filename. Any arguments after filename are passed to the file. If filename is not qualified, it will be searched under /boot/kernel and /boot/modules. ls -l path Displays a listing of files in the given path, or the root directory, if the path is not specified. If is specified, file sizes will also be shown. lsdev -v Lists all of the devices from which it may be possible to load modules. If is specified, more details are printed. lsmod -v Displays loaded modules. If is specified, more details are shown. more filename Displays the files specified, with a pause at each LINES displayed. reboot Immediately reboots the system. set variable, set variable=value Sets the specified environment variables. unload Removes all loaded modules.
Here are some practical examples of loader usage. To boot the usual kernel in single-user mode single-user mode: boot -s To unload the usual kernel and modules and then load the previous or another, specified kernel: unload load /path/to/kernelfile Use the qualified /boot/GENERIC/kernel to refer to the default kernel that comes with an installation, or /boot/kernel.old/kernel, to refer to the previously installed kernel before a system upgrade or before configuring a custom kernel. Use the following to load the usual modules with another kernel. Note that in this case it is not necessary the qualified name: unload set kernel="mykernel" boot-conf To load an automated kernel configuration script: load -t userconfig_script /boot/kernel.conf kernel boot interaction
Last Stage &man.init.8; Once the kernel is loaded by either loader or by boot2, which bypasses loader, it examines any boot flags and adjusts its behavior as necessary. lists the commonly used boot flags. Refer to &man.boot.8; for more information on the other boot flags. kernel bootflags Kernel Interaction During Boot Option Description During kernel initialization, ask for the device to mount as the root file system. Boot the root file system from a CDROM. Boot into single-user mode. Be more verbose during kernel startup.
Once the kernel has finished booting, it passes control to the user process &man.init.8;, which is located at /sbin/init, or the program path specified in the init_path variable in loader. This is the last stage of the boot process. The boot sequence makes sure that the file systems available on the system are consistent. If a UFS file system is not, and fsck cannot fix the inconsistencies, init drops the system into single-user mode so that the system administrator can resolve the problem directly. Otherwise, the system boots into multi-user mode. Single-User Mode single-user mode console A user can specify this mode by booting with or by setting the boot_single variable in loader. It can also be reached by running shutdown now from multi-user mode. Single-user mode begins with this message: Enter full pathname of shell or RETURN for /bin/sh: If the user presses Enter, the system will enter the default Bourne shell. To specify a different shell, input the full path to the shell. Single-user mode is usually used to repair a system that will not boot due to an inconsistent file system or an error in a boot configuration file. It can also be used to reset the root password when it is unknown. These actions are possible as the single-user mode prompt gives full, local access to the system and its configuration files. There is no networking in this mode. While single-user mode is useful for repairing a system, it poses a security risk unless the system is in a physically secure location. By default, any user who can gain physical access to a system will have full control of that system after booting into single-user mode. If the system console is changed to insecure in /etc/ttys, the system will first prompt for the root password before initiating single-user mode. This adds a measure of security while removing the ability to reset the root password when it is unknown. Configuring an Insecure Console in <filename>/etc/ttys</filename> # name getty type status comments # # If console is marked "insecure", then init will ask for the root password # when going to single-user mode. console none unknown off insecure An insecure console means that physical security to the console is considered to be insecure, so only someone who knows the root password may use single-user mode. Multi-User Mode multi-user mode If init finds the file systems to be in order, or once the user has finished their commands in single-user mode and has typed exit to leave single-user mode, the system enters multi-user mode, in which it starts the resource configuration of the system. rc files The resource configuration system reads in configuration defaults from /etc/defaults/rc.conf and system-specific details from /etc/rc.conf. It then proceeds to mount the system file systems listed in /etc/fstab. It starts up networking services, miscellaneous system daemons, then the startup scripts of locally installed packages. To learn more about the resource configuration system, refer to &man.rc.8; and examine the scripts located in /etc/rc.d.
Configuring Boot Time Splash Screens Joseph J. Barbish Contributed by Typically when a &os; system boots, it displays its progress as a series of messages at the console. A boot splash screen creates an alternate boot screen that hides all of the boot probe and service startup messages. A few boot loader messages, including the boot options menu and a timed wait countdown prompt, are displayed at boot time, even when the splash screen is enabled. The display of the splash screen can be turned off by hitting any key on the keyboard during the boot process. There are two basic environments available in &os;. The first is the default legacy virtual console command line environment. After the system finishes booting, a console login prompt is presented. The second environment is a configured graphical environment. Refer to for more information on how to install and configure a graphical display manager and a graphical login manager. Once the system has booted, the splash screen defaults to being a screen saver. After a time period of non-use, the splash screen will display and will cycle through steps of changing intensity of the image, from bright to very dark and over again. The configuration of the splash screen saver can be overridden by adding a saver= line to /etc/rc.conf. Several built-in screen savers are available and described in &man.splash.4;. The saver= option only applies to virtual consoles and has no effect on graphical display managers. By installing the sysutils/bsd-splash-changer package or port, a random splash image from a collection will display at boot. The splash screen function supports 256-colors in the bitmap (.bmp), ZSoft PCX (.pcx), or TheDraw (.bin) formats. The .bmp, .pcx, or .bin image has to be placed on the root partition, for example in /boot. The splash image files must have a resolution of 320 by 200 pixels or less in order to work on standard VGA adapters. For the default boot display resolution of 256-colors and 320 by 200 pixels or less, add the following lines to /boot/loader.conf. Replace splash.bmp with the name of the bitmap file to use: splash_bmp_load="YES" bitmap_load="YES" bitmap_name="/boot/splash.bmp" To use a PCX file instead of a bitmap file: splash_pcx_load="YES" bitmap_load="YES" bitmap_name="/boot/splash.pcx" To instead use ASCII art in the https://en.wikipedia.org/wiki/TheDraw format: splash_txt="YES" bitmap_load="YES" bitmap_name="/boot/splash.bin" Other interesting loader.conf options include: beastie_disable="YES" This will stop the boot options menu from being displayed, but the timed wait count down prompt will still be present. Even with the display of the boot options menu disabled, entering an option selection at the timed wait count down prompt will enact the corresponding boot option. loader_logo="beastie" This will replace the default words &os;, which are displayed to the right of the boot options menu, with the colored beastie logo. For more information, refer to &man.splash.4;, &man.loader.conf.5;, and &man.vga.4;. Device Hints Tom Rhodes Contributed by device.hints During initial system startup, the boot &man.loader.8; reads &man.device.hints.5;. This file stores kernel boot information known as variables, sometimes referred to as device hints. These device hints are used by device drivers for device configuration. Device hints may also be specified at the Stage 3 boot loader prompt, as demonstrated in . Variables can be added using set, removed with unset, and viewed show. Variables set in /boot/device.hints can also be overridden. Device hints entered at the boot loader are not permanent and will not be applied on the next reboot. Once the system is booted, &man.kenv.1; can be used to dump all of the variables. The syntax for /boot/device.hints is one variable per line, using the hash # as comment markers. Lines are constructed as follows: hint.driver.unit.keyword="value" The syntax for the Stage 3 boot loader is: set hint.driver.unit.keyword=value where driver is the device driver name, unit is the device driver unit number, and keyword is the hint keyword. The keyword may consist of the following options: at: specifies the bus which the device is attached to. port: specifies the start address of the I/O to be used. irq: specifies the interrupt request number to be used. drq: specifies the DMA channel number. maddr: specifies the physical memory address occupied by the device. flags: sets various flag bits for the device. disabled: if set to 1 the device is disabled. Since device drivers may accept or require more hints not listed here, viewing a driver's manual page is recommended. For more information, refer to &man.device.hints.5;, &man.kenv.1;, &man.loader.conf.5;, and &man.loader.8;. Shutdown Sequence &man.shutdown.8; Upon controlled shutdown using &man.shutdown.8;, &man.init.8; will attempt to run the script /etc/rc.shutdown, and then proceed to send all processes the TERM signal, and subsequently the KILL signal to any that do not terminate in a timely manner. To power down a &os; machine on architectures and systems that support power management, use shutdown -p now to turn the power off immediately. To reboot a &os; system, use shutdown -r now. One must be root or a member of operator in order to run &man.shutdown.8;. One can also use &man.halt.8; and &man.reboot.8;. Refer to their manual pages and to &man.shutdown.8; for more information. Modify group membership by referring to . Power management requires &man.acpi.4; to be loaded as a module or statically compiled into a custom kernel.
diff --git a/en_US.ISO8859-1/books/handbook/config/chapter.xml b/en_US.ISO8859-1/books/handbook/config/chapter.xml index dd0a3d7bc3..f7f24a65f4 100644 --- a/en_US.ISO8859-1/books/handbook/config/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/config/chapter.xml @@ -1,3486 +1,3486 @@ Configuration and Tuning Chern Lee Written by Mike Smith Based on a tutorial written by Matt Dillon Also based on tuning(7) written by Synopsis system configuration system optimization One of the important aspects of &os; is proper system configuration. This chapter explains much of the &os; configuration process, including some of the parameters which can be set to tune a &os; system. After reading this chapter, you will know: The basics of rc.conf configuration and /usr/local/etc/rc.d startup scripts. How to configure and test a network card. How to configure virtual hosts on network devices. How to use the various configuration files in /etc. How to tune &os; using &man.sysctl.8; variables. How to tune disk performance and modify kernel limitations. Before reading this chapter, you should: Understand &unix; and &os; basics (). Be familiar with the basics of kernel configuration and compilation (). Starting Services Tom Rhodes Contributed by services Many users install third party software on &os; from the Ports Collection and require the installed services to be started upon system initialization. Services, such as mail/postfix or www/apache22 are just two of the many software packages which may be started during system initialization. This section explains the procedures available for starting third party software. In &os;, most included services, such as &man.cron.8;, are started through the system startup scripts. Extended Application Configuration Now that &os; includes rc.d, configuration of application startup is easier and provides more features. Using the key words discussed in , applications can be set to start after certain other services and extra flags can be passed through /etc/rc.conf in place of hard coded flags in the startup script. A basic script may look similar to the following: #!/bin/sh # # PROVIDE: utility # REQUIRE: DAEMON # KEYWORD: shutdown . /etc/rc.subr name=utility rcvar=utility_enable command="/usr/local/sbin/utility" load_rc_config $name # # DO NOT CHANGE THESE DEFAULT VALUES HERE # SET THEM IN THE /etc/rc.conf FILE # utility_enable=${utility_enable-"NO"} pidfile=${utility_pidfile-"/var/run/utility.pid"} run_rc_command "$1" This script will ensure that the provided utility will be started after the DAEMON pseudo-service. It also provides a method for setting and tracking the process ID (PID). This application could then have the following line placed in /etc/rc.conf: utility_enable="YES" This method allows for easier manipulation of command line arguments, inclusion of the default functions provided in /etc/rc.subr, compatibility with &man.rcorder.8;, and provides for easier configuration via rc.conf. Using Services to Start Services Other services can be started using &man.inetd.8;. Working with &man.inetd.8; and its configuration is described in depth in . In some cases, it may make more sense to use &man.cron.8; to start system services. This approach has a number of advantages as &man.cron.8; runs these processes as the owner of the &man.crontab.5;. This allows regular users to start and maintain their own applications. The @reboot feature of &man.cron.8;, may be used in place of the time specification. This causes the job to run when &man.cron.8; is started, normally during system initialization. Configuring &man.cron.8; Tom Rhodes Contributed by cron configuration One of the most useful utilities in &os; is cron. This utility runs in the background and regularly checks /etc/crontab for tasks to execute and searches /var/cron/tabs for custom crontab files. These files are used to schedule tasks which cron runs at the specified times. Each entry in a crontab defines a task to run and is known as a cron job. Two different types of configuration files are used: the system crontab, which should not be modified, and user crontabs, which can be created and edited as needed. The format used by these files is documented in &man.crontab.5;. The format of the system crontab, /etc/crontab includes a who column which does not exist in user crontabs. In the system crontab, cron runs the command as the user specified in this column. In a user crontab, all commands run as the user who created the crontab. User crontabs allow individual users to schedule their own tasks. The root user can also have a user crontab which can be used to schedule tasks that do not exist in the system crontab. Here is a sample entry from the system crontab, /etc/crontab: # /etc/crontab - root's crontab for FreeBSD # # $FreeBSD$ # SHELL=/bin/sh PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin # #minute hour mday month wday who command # */5 * * * * root /usr/libexec/atrun Lines that begin with the # character are comments. A comment can be placed in the file as a reminder of what and why a desired action is performed. Comments cannot be on the same line as a command or else they will be interpreted as part of the command; they must be on a new line. Blank lines are ignored. The equals (=) character is used to define any environment settings. In this example, it is used to define the SHELL and PATH. If the SHELL is omitted, cron will use the default Bourne shell. If the PATH is omitted, the full path must be given to the command or script to run. This line defines the seven fields used in a system crontab: minute, hour, mday, month, wday, who, and command. The minute field is the time in minutes when the specified command will be run, the hour is the hour when the specified command will be run, the mday is the day of the month, month is the month, and wday is the day of the week. These fields must be numeric values, representing the twenty-four hour clock, or a *, representing all values for that field. The who field only exists in the system crontab and specifies which user the command should be run as. The last field is the command to be executed. This entry defines the values for this cron job. The */5, followed by several more * characters, specifies that /usr/libexec/atrun is invoked by root every five minutes of every hour, of every day and day of the week, of every month. Commands can include any number of switches. However, commands which extend to multiple lines need to be broken with the backslash \ continuation character. Creating a User Crontab To create a user crontab, invoke crontab in editor mode: &prompt.user; crontab -e This will open the user's crontab using the default text editor. The first time a user runs this command, it will open an empty file. Once a user creates a crontab, this command will open that file for editing. It is useful to add these lines to the top of the crontab file in order to set the environment variables and to remember the meanings of the fields in the crontab: SHELL=/bin/sh PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin # Order of crontab fields # minute hour mday month wday command Then add a line for each command or script to run, specifying the time to run the command. This example runs the specified custom Bourne shell script every day at two in the afternoon. Since the path to the script is not specified in PATH, the full path to the script is given: 0 14 * * * /usr/home/dru/bin/mycustomscript.sh Before using a custom script, make sure it is executable and test it with the limited set of environment variables set by cron. To replicate the environment that would be used to run the above cron entry, use: env -i SHELL=/bin/sh PATH=/etc:/bin:/sbin:/usr/bin:/usr/sbin HOME=/home/dru LOGNAME=dru /usr/home/dru/bin/mycustomscript.sh The environment set by cron is discussed in &man.crontab.5;. Checking that scripts operate correctly in a cron environment is especially important if they include any commands that delete files using wildcards. When finished editing the crontab, save the file. It will automatically be installed and cron will read the crontab and run its cron jobs at their specified times. To list the cron jobs in a crontab, use this command: &prompt.user; crontab -l 0 14 * * * /usr/home/dru/bin/mycustomscript.sh To remove all of the cron jobs in a user crontab: &prompt.user; crontab -r remove crontab for dru? y Managing Services in &os; Tom Rhodes Contributed by &os; uses the &man.rc.8; system of startup scripts during system initialization and for managing services. The scripts listed in /etc/rc.d provide basic services which can be controlled with the , , and options to &man.service.8;. For instance, &man.sshd.8; can be restarted with the following command: &prompt.root; service sshd restart This procedure can be used to start services on a running system. Services will be started automatically at boot time as specified in &man.rc.conf.5;. For example, to enable &man.natd.8; at system startup, add the following line to /etc/rc.conf: natd_enable="YES" If a line is already present, change the NO to YES. The &man.rc.8; scripts will automatically load any dependent services during the next boot, as described below. Since the &man.rc.8; system is primarily intended to start and stop services at system startup and shutdown time, the , and options will only perform their action if the appropriate /etc/rc.conf variable is set. For instance, sshd restart will only work if sshd_enable is set to in /etc/rc.conf. To , or a service regardless of the settings in /etc/rc.conf, these commands should be prefixed with one. For instance, to restart &man.sshd.8; regardless of the current /etc/rc.conf setting, execute the following command: &prompt.root; service sshd onerestart To check if a service is enabled in /etc/rc.conf, run the appropriate &man.rc.8; script with . This example checks to see if &man.sshd.8; is enabled in /etc/rc.conf: &prompt.root; service sshd rcvar # sshd # sshd_enable="YES" # (default: "") The # sshd line is output from the above command, not a root console. To determine whether or not a service is running, use . For instance, to verify that &man.sshd.8; is running: &prompt.root; service sshd status sshd is running as pid 433. In some cases, it is also possible to a service. This attempts to send a signal to an individual service, forcing the service to reload its configuration files. In most cases, this means sending the service a SIGHUP signal. Support for this feature is not included for every service. The &man.rc.8; system is used for network services and it also contributes to most of the system initialization. For instance, when the /etc/rc.d/bgfsck script is executed, it prints out the following message: Starting background file system checks in 60 seconds. This script is used for background file system checks, which occur only during system initialization. Many system services depend on other services to function properly. For example, &man.yp.8; and other RPC-based services may fail to start until after the &man.rpcbind.8; service has started. To resolve this issue, information about dependencies and other meta-data is included in the comments at the top of each startup script. The &man.rcorder.8; program is used to parse these comments during system initialization to determine the order in which system services should be invoked to satisfy the dependencies. The following key word must be included in all startup scripts as it is required by &man.rc.subr.8; to enable the startup script: PROVIDE: Specifies the services this file provides. The following key words may be included at the top of each startup script. They are not strictly necessary, but are useful as hints to &man.rcorder.8;: REQUIRE: Lists services which are required for this service. The script containing this key word will run after the specified services. BEFORE: Lists services which depend on this service. The script containing this key word will run before the specified services. By carefully setting these keywords for each startup script, an administrator has a fine-grained level of control of the startup order of the scripts, without the need for runlevels used by some &unix; operating systems. Additional information can be found in &man.rc.8; and &man.rc.subr.8;. Refer to this article for instructions on how to create custom &man.rc.8; scripts. Managing System-Specific Configuration rc files rc.conf The principal location for system configuration information is /etc/rc.conf. This file contains a wide range of configuration information and it is read at system startup to configure the system. It provides the configuration information for the rc* files. The entries in /etc/rc.conf override the default settings in /etc/defaults/rc.conf. The file containing the default settings should not be edited. Instead, all system-specific changes should be made to /etc/rc.conf. A number of strategies may be applied in clustered applications to separate site-wide configuration from system-specific configuration in order to reduce administration overhead. The recommended approach is to place system-specific configuration into /etc/rc.conf.local. For example, these entries in /etc/rc.conf apply to all systems: sshd_enable="YES" keyrate="fast" defaultrouter="10.1.1.254" Whereas these entries in /etc/rc.conf.local apply to this system only: hostname="node1.example.org" ifconfig_fxp0="inet 10.1.1.1/8" Distribute /etc/rc.conf to every system using an application such as rsync or puppet, while /etc/rc.conf.local remains unique. Upgrading the system will not overwrite /etc/rc.conf, so system configuration information will not be lost. Both /etc/rc.conf and /etc/rc.conf.local are parsed by &man.sh.1;. This allows system operators to create complex configuration scenarios. Refer to &man.rc.conf.5; for further information on this topic. Setting Up Network Interface Cards Marc Fonvieille Contributed by network cards configuration Adding and configuring a network interface card (NIC) is a common task for any &os; administrator. Locating the Correct Driver network cards driver First, determine the model of the NIC and the chip it uses. &os; supports a wide variety of NICs. Check the Hardware Compatibility List for the &os; release to see if the NIC is supported. If the NIC is supported, determine the name of the &os; driver for the NIC. Refer to /usr/src/sys/conf/NOTES and /usr/src/sys/arch/conf/NOTES for the list of NIC drivers with some information about the supported chipsets. When in doubt, read the manual page of the driver as it will provide more information about the supported hardware and any known limitations of the driver. The drivers for common NICs are already present in the GENERIC kernel, meaning the NIC should be probed during boot. The system's boot messages can be viewed by typing more /var/run/dmesg.boot and using the spacebar to scroll through the text. In this example, two Ethernet NICs using the &man.dc.4; driver are present on the system: dc0: <82c169 PNIC 10/100BaseTX> port 0xa000-0xa0ff mem 0xd3800000-0xd38 000ff irq 15 at device 11.0 on pci0 miibus0: <MII bus> on dc0 bmtphy0: <BCM5201 10/100baseTX PHY> PHY 1 on miibus0 bmtphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto dc0: Ethernet address: 00:a0:cc:da:da:da dc0: [ITHREAD] dc1: <82c169 PNIC 10/100BaseTX> port 0x9800-0x98ff mem 0xd3000000-0xd30 000ff irq 11 at device 12.0 on pci0 miibus1: <MII bus> on dc1 bmtphy1: <BCM5201 10/100baseTX PHY> PHY 1 on miibus1 bmtphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto dc1: Ethernet address: 00:a0:cc:da:da:db dc1: [ITHREAD] If the driver for the NIC is not present in GENERIC, but a driver is available, the driver will need to be loaded before the NIC can be configured and used. This may be accomplished in one of two ways: The easiest way is to load a kernel module for the NIC using &man.kldload.8;. To also automatically load the driver at boot time, add the appropriate line to /boot/loader.conf. Not all NIC drivers are available as modules. Alternatively, statically compile support for the NIC into a custom kernel. Refer to /usr/src/sys/conf/NOTES, /usr/src/sys/arch/conf/NOTES and the manual page of the driver to determine which line to add to the custom kernel configuration file. For more information about recompiling the kernel, refer to . If the NIC was detected at boot, the kernel does not need to be recompiled. Using &windows; <acronym>NDIS</acronym> Drivers NDIS NDISulator &windows; drivers µsoft.windows; device drivers KLD (kernel loadable object) Unfortunately, there are still many vendors that do not provide schematics for their drivers to the open source community because they regard such information as trade secrets. Consequently, the developers of &os; and other operating systems are left with two choices: develop the drivers by a long and pain-staking process of reverse engineering or using the existing driver binaries available for µsoft.windows; platforms. &os; provides native support for the Network Driver Interface Specification (NDIS). It includes &man.ndisgen.8; which can be used to convert a &windowsxp; driver into a - format that can be used on &os;. Because the &man.ndis.4; + format that can be used on &os;. As the &man.ndis.4; driver uses a &windowsxp; binary, it only runs on &i386; and amd64 systems. PCI, CardBus, PCMCIA, and USB devices are supported. To use &man.ndisgen.8;, three things are needed: &os; kernel sources. A &windowsxp; driver binary with a .SYS extension. A &windowsxp; driver configuration file with a .INF extension. Download the .SYS and .INF files for the specific NIC. Generally, these can be found on the driver CD or at the vendor's website. The following examples use W32DRIVER.SYS and W32DRIVER.INF. The driver bit width must match the version of &os;. For &os;/i386, use a &windows; 32-bit driver. For &os;/amd64, a &windows; 64-bit driver is needed. The next step is to compile the driver binary into a loadable kernel module. As root, use &man.ndisgen.8;: &prompt.root; ndisgen /path/to/W32DRIVER.INF /path/to/W32DRIVER.SYS This command is interactive and prompts for any extra information it requires. A new kernel module will be generated in the current directory. Use &man.kldload.8; to load the new module: &prompt.root; kldload ./W32DRIVER_SYS.ko In addition to the generated kernel module, the ndis.ko and if_ndis.ko modules must be loaded. This should happen automatically when any module that depends on &man.ndis.4; is loaded. If not, load them manually, using the following commands: &prompt.root; kldload ndis &prompt.root; kldload if_ndis The first command loads the &man.ndis.4; miniport driver wrapper and the second loads the generated NIC driver. Check &man.dmesg.8; to see if there were any load errors. If all went well, the output should be similar to the following: ndis0: <Wireless-G PCI Adapter> mem 0xf4100000-0xf4101fff irq 3 at device 8.0 on pci1 ndis0: NDIS API version: 5.0 ndis0: Ethernet address: 0a:b1:2c:d3:4e:f5 ndis0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps ndis0: 11g rates: 6Mbps 9Mbps 12Mbps 18Mbps 36Mbps 48Mbps 54Mbps From here, ndis0 can be configured like any other NIC. To configure the system to load the &man.ndis.4; modules at boot time, copy the generated module, W32DRIVER_SYS.ko, to /boot/modules. Then, add the following line to /boot/loader.conf: W32DRIVER_SYS_load="YES" Configuring the Network Card network cards configuration Once the right driver is loaded for the NIC, the card needs to be configured. It may have been configured at installation time by &man.bsdinstall.8;. To display the NIC configuration, enter the following command: &prompt.user; ifconfig dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=80008<VLAN_MTU,LINKSTATE> ether 00:a0:cc:da:da:da inet 192.168.1.3 netmask 0xffffff00 broadcast 192.168.1.255 media: Ethernet autoselect (100baseTX <full-duplex>) status: active dc1: flags=8802<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=80008<VLAN_MTU,LINKSTATE> ether 00:a0:cc:da:da:db inet 10.0.0.1 netmask 0xffffff00 broadcast 10.0.0.255 media: Ethernet 10baseT/UTP status: no carrier lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=3<RXCSUM,TXCSUM> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 inet6 ::1 prefixlen 128 inet 127.0.0.1 netmask 0xff000000 nd6 options=3<PERFORMNUD,ACCEPT_RTADV> In this example, the following devices were displayed: dc0: The first Ethernet interface. dc1: The second Ethernet interface. lo0: The loopback device. &os; uses the driver name followed by the order in which the card is detected at boot to name the NIC. For example, sis2 is the third NIC on the system using the &man.sis.4; driver. In this example, dc0 is up and running. The key indicators are: UP means that the card is configured and ready. The card has an Internet (inet) address, 192.168.1.3. It has a valid subnet mask (netmask), where 0xffffff00 is the same as 255.255.255.0. It has a valid broadcast address, 192.168.1.255. The MAC address of the card (ether) is 00:a0:cc:da:da:da. The physical media selection is on autoselection mode (media: Ethernet autoselect (100baseTX <full-duplex>)). In this example, dc1 is configured to run with 10baseT/UTP media. For more information on available media types for a driver, refer to its manual page. The status of the link (status) is active, indicating that the carrier signal is detected. For dc1, the status: no carrier status is normal when an Ethernet cable is not plugged into the card. If the &man.ifconfig.8; output had shown something similar to: dc0: flags=8843<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=80008<VLAN_MTU,LINKSTATE> ether 00:a0:cc:da:da:da media: Ethernet autoselect (100baseTX <full-duplex>) status: active it would indicate the card has not been configured. The card must be configured as root. The NIC configuration can be performed from the command line with &man.ifconfig.8; but will not persist after a reboot unless the configuration is also added to /etc/rc.conf. If a DHCP server is present on the LAN, just add this line: ifconfig_dc0="DHCP" Replace dc0 with the correct value for the system. The line added, then, follow the instructions given in . If the network was configured during installation, some entries for the NIC(s) may be already present. Double check /etc/rc.conf before adding any lines. If there is no DHCP server, the NIC(s) must be configured manually. Add a line for each NIC present on the system, as seen in this example: ifconfig_dc0="inet 192.168.1.3 netmask 255.255.255.0" ifconfig_dc1="inet 10.0.0.1 netmask 255.255.255.0 media 10baseT/UTP" Replace dc0 and dc1 and the IP address information with the correct values for the system. Refer to the man page for the driver, &man.ifconfig.8;, and &man.rc.conf.5; for more details about the allowed options and the syntax of /etc/rc.conf. If the network is not using DNS, edit /etc/hosts to add the names and IP addresses of the hosts on the LAN, if they are not already there. For more information, refer to &man.hosts.5; and to /usr/share/examples/etc/hosts. If there is no DHCP server and access to the Internet is needed, manually configure the default gateway and the nameserver: &prompt.root; echo 'defaultrouter="your_default_router"' >> /etc/rc.conf &prompt.root; echo 'nameserver your_DNS_server' >> /etc/resolv.conf Testing and Troubleshooting Once the necessary changes to /etc/rc.conf are saved, a reboot can be used to test the network configuration and to verify that the system restarts without any configuration errors. Alternatively, apply the settings to the networking system with this command: &prompt.root; service netif restart If a default gateway has been set in /etc/rc.conf, also issue this command: &prompt.root; service routing restart Once the networking system has been relaunched, test the NICs. Testing the Ethernet Card network cards testing To verify that an Ethernet card is configured correctly, &man.ping.8; the interface itself, and then &man.ping.8; another machine on the LAN: &prompt.user; ping -c5 192.168.1.3 PING 192.168.1.3 (192.168.1.3): 56 data bytes 64 bytes from 192.168.1.3: icmp_seq=0 ttl=64 time=0.082 ms 64 bytes from 192.168.1.3: icmp_seq=1 ttl=64 time=0.074 ms 64 bytes from 192.168.1.3: icmp_seq=2 ttl=64 time=0.076 ms 64 bytes from 192.168.1.3: icmp_seq=3 ttl=64 time=0.108 ms 64 bytes from 192.168.1.3: icmp_seq=4 ttl=64 time=0.076 ms --- 192.168.1.3 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.074/0.083/0.108/0.013 ms &prompt.user; ping -c5 192.168.1.2 PING 192.168.1.2 (192.168.1.2): 56 data bytes 64 bytes from 192.168.1.2: icmp_seq=0 ttl=64 time=0.726 ms 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.766 ms 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.700 ms 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.747 ms 64 bytes from 192.168.1.2: icmp_seq=4 ttl=64 time=0.704 ms --- 192.168.1.2 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.700/0.729/0.766/0.025 ms To test network resolution, use the host name instead of the IP address. If there is no DNS server on the network, /etc/hosts must first be configured. To this purpose, edit /etc/hosts to add the names and IP addresses of the hosts on the LAN, if they are not already there. For more information, refer to &man.hosts.5; and to /usr/share/examples/etc/hosts. Troubleshooting network cards troubleshooting When troubleshooting hardware and software configurations, check the simple things first. Is the network cable plugged in? Are the network services properly configured? Is the firewall configured correctly? Is the NIC supported by &os;? Before sending a bug report, always check the Hardware Notes, update the version of &os; to the latest STABLE version, check the mailing list archives, and search the Internet. If the card works, yet performance is poor, read through &man.tuning.7;. Also, check the network configuration as incorrect network settings can cause slow connections. Some users experience one or two device timeout messages, which is normal for some cards. If they continue, or are bothersome, determine if the device is conflicting with another device. Double check the cable connections. Consider trying another card. To resolve watchdog timeout errors, first check the network cable. Many cards require a PCI slot which supports bus mastering. On some old motherboards, only one PCI slot allows it, usually slot 0. Check the NIC and the motherboard documentation to determine if that may be the problem. No route to host messages occur if the system is unable to route a packet to the destination host. This can happen if no default route is specified or if a cable is unplugged. Check the output of netstat -rn and make sure there is a valid route to the host. If there is not, read . ping: sendto: Permission denied error messages are often caused by a misconfigured firewall. If a firewall is enabled on &os; but no rules have been defined, the default policy is to deny all traffic, even &man.ping.8;. Refer to for more information. Sometimes performance of the card is poor or below average. In these cases, try setting the media selection mode from autoselect to the correct media selection. While this works for most hardware, it may or may not resolve the issue. Again, check all the network settings, and refer to &man.tuning.7;. Virtual Hosts virtual hosts IP aliases A common use of &os; is virtual site hosting, where one server appears to the network as many servers. This is achieved by assigning multiple network addresses to a single interface. A given network interface has one real address, and may have any number of alias addresses. These aliases are normally added by placing alias entries in /etc/rc.conf, as seen in this example: ifconfig_fxp0_alias0="inet xxx.xxx.xxx.xxx netmask xxx.xxx.xxx.xxx" Alias entries must start with alias0 using a sequential number such as alias0, alias1, and so on. The configuration process will stop at the first missing number. The calculation of alias netmasks is important. For a given interface, there must be one address which correctly represents the network's netmask. Any other addresses which fall within this network must have a netmask of all 1s, expressed as either 255.255.255.255 or 0xffffffff. For example, consider the case where the fxp0 interface is connected to two networks: 10.1.1.0 with a netmask of 255.255.255.0 and 202.0.75.16 with a netmask of 255.255.255.240. The system is to be configured to appear in the ranges 10.1.1.1 through 10.1.1.5 and 202.0.75.17 through 202.0.75.20. Only the first address in a given network range should have a real netmask. All the rest (10.1.1.2 through 10.1.1.5 and 202.0.75.18 through 202.0.75.20) must be configured with a netmask of 255.255.255.255. The following /etc/rc.conf entries configure the adapter correctly for this scenario: ifconfig_fxp0="inet 10.1.1.1 netmask 255.255.255.0" ifconfig_fxp0_alias0="inet 10.1.1.2 netmask 255.255.255.255" ifconfig_fxp0_alias1="inet 10.1.1.3 netmask 255.255.255.255" ifconfig_fxp0_alias2="inet 10.1.1.4 netmask 255.255.255.255" ifconfig_fxp0_alias3="inet 10.1.1.5 netmask 255.255.255.255" ifconfig_fxp0_alias4="inet 202.0.75.17 netmask 255.255.255.240" ifconfig_fxp0_alias5="inet 202.0.75.18 netmask 255.255.255.255" ifconfig_fxp0_alias6="inet 202.0.75.19 netmask 255.255.255.255" ifconfig_fxp0_alias7="inet 202.0.75.20 netmask 255.255.255.255" A simpler way to express this is with a space-separated list of IP address ranges. The first address will be given the indicated subnet mask and the additional addresses will have a subnet mask of 255.255.255.255. ifconfig_fxp0_aliases="inet 10.1.1.1-5/24 inet 202.0.75.17-20/28" Configuring System Logging Niclas Zeising Contributed by system logging syslog &man.syslogd.8; Generating and reading system logs is an important aspect of system administration. The information in system logs can be used to detect hardware and software issues as well as application and system configuration errors. This information also plays an important role in security auditing and incident response. Most system daemons and applications will generate log entries. &os; provides a system logger, syslogd, to manage logging. By default, syslogd is started when the system boots. This is controlled by the variable syslogd_enable in /etc/rc.conf. There are numerous application arguments that can be set using syslogd_flags in /etc/rc.conf. Refer to &man.syslogd.8; for more information on the available arguments. This section describes how to configure the &os; system logger for both local and remote logging and how to perform log rotation and log management. Configuring Local Logging syslog.conf The configuration file, /etc/syslog.conf, controls what syslogd does with log entries as they are received. There are several parameters to control the handling of incoming events. The facility describes which subsystem generated the message, such as the kernel or a daemon, and the level describes the severity of the event that occurred. This makes it possible to configure if and where a log message is logged, depending on the facility and level. It is also possible to take action depending on the application that sent the message, and in the case of remote logging, the hostname of the machine generating the logging event. This configuration file contains one line per action, where the syntax for each line is a selector field followed by an action field. The syntax of the selector field is facility.level which will match log messages from facility at level level or higher. It is also possible to add an optional comparison flag before the level to specify more precisely what is logged. Multiple selector fields can be used for the same action, and are separated with a semicolon (;). Using * will match everything. The action field denotes where to send the log message, such as to a file or remote log host. As an example, here is the default syslog.conf from &os;: # $&os;$ # # Spaces ARE valid field separators in this file. However, # other *nix-like systems still insist on using tabs as field # separators. If you are sharing this file between systems, you # may want to use only tabs as field separators here. # Consult the syslog.conf(5) manpage. *.err;kern.warning;auth.notice;mail.crit /dev/console *.notice;authpriv.none;kern.debug;lpr.info;mail.crit;news.err /var/log/messages security.* /var/log/security auth.info;authpriv.info /var/log/auth.log mail.info /var/log/maillog lpr.info /var/log/lpd-errs ftp.info /var/log/xferlog cron.* /var/log/cron !-devd *.=debug /var/log/debug.log *.emerg * # uncomment this to log all writes to /dev/console to /var/log/console.log #console.info /var/log/console.log # uncomment this to enable logging of all log messages to /var/log/all.log # touch /var/log/all.log and chmod it to mode 600 before it will work #*.* /var/log/all.log # uncomment this to enable logging to a remote loghost named loghost #*.* @loghost # uncomment these if you're running inn # news.crit /var/log/news/news.crit # news.err /var/log/news/news.err # news.notice /var/log/news/news.notice # Uncomment this if you wish to see messages produced by devd # !devd # *.>=info !ppp *.* /var/log/ppp.log !* In this example: Line 8 matches all messages with a level of err or higher, as well as kern.warning, auth.notice and mail.crit, and sends these log messages to the console (/dev/console). Line 12 matches all messages from the mail facility at level info or above and logs the messages to /var/log/maillog. Line 17 uses a comparison flag (=) to only match messages at level debug and logs them to /var/log/debug.log. Line 33 is an example usage of a program specification. This makes the rules following it only valid for the specified program. In this case, only the messages generated by ppp are logged to /var/log/ppp.log. The available levels, in order from most to least critical are emerg, alert, crit, err, warning, notice, info, and debug. The facilities, in no particular order, are auth, authpriv, console, cron, daemon, ftp, kern, lpr, mail, mark, news, security, syslog, user, uucp, and local0 through local7. Be aware that other operating systems might have different facilities. To log everything of level notice and higher to /var/log/daemon.log, add the following entry: daemon.notice /var/log/daemon.log For more information about the different levels and facilities, refer to &man.syslog.3; and &man.syslogd.8;. For more information about /etc/syslog.conf, its syntax, and more advanced usage examples, see &man.syslog.conf.5;. Log Management and Rotation newsyslog newsyslog.conf log rotation log management Log files can grow quickly, taking up disk space and making it more difficult to locate useful information. Log management attempts to mitigate this. In &os;, newsyslog is used to manage log files. This built-in program periodically rotates and compresses log files, and optionally creates missing log files and signals programs when log files are moved. The log files may be generated by syslogd or by any other program which generates log files. While newsyslog is normally run from &man.cron.8;, it is not a system daemon. In the default configuration, it runs every hour. To know which actions to take, newsyslog reads its configuration file, /etc/newsyslog.conf. This file contains one line for each log file that newsyslog manages. Each line states the file owner, permissions, when to rotate that file, optional flags that affect log rotation, such as compression, and programs to signal when the log is rotated. Here is the default configuration in &os;: # configuration file for newsyslog # $FreeBSD$ # # Entries which do not specify the '/pid_file' field will cause the # syslogd process to be signalled when that log file is rotated. This # action is only appropriate for log files which are written to by the # syslogd process (ie, files listed in /etc/syslog.conf). If there # is no process which needs to be signalled when a given log file is # rotated, then the entry for that file should include the 'N' flag. # # The 'flags' field is one or more of the letters: BCDGJNUXZ or a '-'. # # Note: some sites will want to select more restrictive protections than the # defaults. In particular, it may be desirable to switch many of the 644 # entries to 640 or 600. For example, some sites will consider the # contents of maillog, messages, and lpd-errs to be confidential. In the # future, these defaults may change to more conservative ones. # # logfilename [owner:group] mode count size when flags [/pid_file] [sig_num] /var/log/all.log 600 7 * @T00 J /var/log/amd.log 644 7 100 * J /var/log/auth.log 600 7 100 @0101T JC /var/log/console.log 600 5 100 * J /var/log/cron 600 3 100 * JC /var/log/daily.log 640 7 * @T00 JN /var/log/debug.log 600 7 100 * JC /var/log/kerberos.log 600 7 100 * J /var/log/lpd-errs 644 7 100 * JC /var/log/maillog 640 7 * @T00 JC /var/log/messages 644 5 100 @0101T JC /var/log/monthly.log 640 12 * $M1D0 JN /var/log/pflog 600 3 100 * JB /var/run/pflogd.pid /var/log/ppp.log root:network 640 3 100 * JC /var/log/devd.log 644 3 100 * JC /var/log/security 600 10 100 * JC /var/log/sendmail.st 640 10 * 168 B /var/log/utx.log 644 3 * @01T05 B /var/log/weekly.log 640 5 1 $W6D0 JN /var/log/xferlog 600 7 100 * JC Each line starts with the name of the log to be rotated, optionally followed by an owner and group for both rotated and newly created files. The mode field sets the permissions on the log file and count denotes how many rotated log files should be kept. The size and when fields tell newsyslog when to rotate the file. A log file is rotated when either its size is larger than the size field or when the time in the when field has passed. An asterisk (*) means that this field is ignored. The flags field gives further instructions, such as how to compress the rotated file or to create the log file if it is missing. The last two fields are optional and specify the name of the Process ID (PID) file of a process and a signal number to send to that process when the file is rotated. For more information on all fields, valid flags, and how to specify the rotation time, refer to &man.newsyslog.conf.5;. Since newsyslog is run from &man.cron.8;, it cannot rotate files more often than it is scheduled to run from &man.cron.8;. Configuring Remote Logging Tom Rhodes Contributed by Monitoring the log files of multiple hosts can become unwieldy as the number of systems increases. Configuring centralized logging can reduce some of the administrative burden of log file administration. In &os;, centralized log file aggregation, merging, and rotation can be configured using syslogd and newsyslog. This section demonstrates an example configuration, where host A, named logserv.example.com, will collect logging information for the local network. Host B, named logclient.example.com, will be configured to pass logging information to the logging server. Log Server Configuration A log server is a system that has been configured to accept logging information from other hosts. Before configuring a log server, check the following: If there is a firewall between the logging server and any logging clients, ensure that the firewall ruleset allows UDP port 514 for both the clients and the server. The logging server and all client machines must have forward and reverse entries in the local DNS. If the network does not have a DNS server, create entries in each system's /etc/hosts. Proper name resolution is required so that log entries are not rejected by the logging server. On the log server, edit /etc/syslog.conf to specify the name of the client to receive log entries from, the logging facility to be used, and the name of the log to store the host's log entries. This example adds the hostname of B, logs all facilities, and stores the log entries in /var/log/logclient.log. Sample Log Server Configuration +logclient.example.com *.* /var/log/logclient.log When adding multiple log clients, add a similar two-line entry for each client. More information about the available facilities may be found in &man.syslog.conf.5;. Next, configure /etc/rc.conf: syslogd_enable="YES" syslogd_flags="-a logclient.example.com -v -v" The first entry starts syslogd at system boot. The second entry allows log entries from the specified client. The increases the verbosity of logged messages. This is useful for tweaking facilities as administrators are able to see what type of messages are being logged under each facility. Multiple options may be specified to allow logging from multiple clients. IP addresses and whole netblocks may also be specified. Refer to &man.syslogd.8; for a full list of possible options. Finally, create the log file: &prompt.root; touch /var/log/logclient.log At this point, syslogd should be restarted and verified: &prompt.root; service syslogd restart &prompt.root; pgrep syslog If a PID is returned, the server restarted successfully, and client configuration can begin. If the server did not restart, consult /var/log/messages for the error. Log Client Configuration A logging client sends log entries to a logging server on the network. The client also keeps a local copy of its own logs. Once a logging server has been configured, edit /etc/rc.conf on the logging client: syslogd_enable="YES" syslogd_flags="-s -v -v" The first entry enables syslogd on boot up. The second entry prevents logs from being accepted by this client from other hosts () and increases the verbosity of logged messages. Next, define the logging server in the client's /etc/syslog.conf. In this example, all logged facilities are sent to a remote system, denoted by the @ symbol, with the specified hostname: *.* @logserv.example.com After saving the edit, restart syslogd for the changes to take effect: &prompt.root; service syslogd restart To test that log messages are being sent across the network, use &man.logger.1; on the client to send a message to syslogd: &prompt.root; logger "Test message from logclient" This message should now exist both in /var/log/messages on the client and /var/log/logclient.log on the log server. Debugging Log Servers If no messages are being received on the log server, the cause is most likely a network connectivity issue, a hostname resolution issue, or a typo in a configuration file. To isolate the cause, ensure that both the logging server and the logging client are able to ping each other using the hostname specified in their /etc/rc.conf. If this fails, check the network cabling, the firewall ruleset, and the hostname entries in the DNS server or /etc/hosts on both the logging server and clients. Repeat until the ping is successful from both hosts. If the ping succeeds on both hosts but log messages are still not being received, temporarily increase logging verbosity to narrow down the configuration issue. In the following example, /var/log/logclient.log on the logging server is empty and /var/log/messages on the logging client does not indicate a reason for the failure. To increase debugging output, edit the syslogd_flags entry on the logging server and issue a restart: syslogd_flags="-d -a logclient.example.com -v -v" &prompt.root; service syslogd restart Debugging data similar to the following will flash on the console immediately after the restart: logmsg: pri 56, flags 4, from logserv.example.com, msg syslogd: restart syslogd: restarted logmsg: pri 6, flags 4, from logserv.example.com, msg syslogd: kernel boot file is /boot/kernel/kernel Logging to FILE /var/log/messages syslogd: kernel boot file is /boot/kernel/kernel cvthname(192.168.1.10) validate: dgram from IP 192.168.1.10, port 514, name logclient.example.com; rejected in rule 0 due to name mismatch. In this example, the log messages are being rejected due to a typo which results in a hostname mismatch. The client's hostname should be logclient, not logclien. Fix the typo, issue a restart, and verify the results: &prompt.root; service syslogd restart logmsg: pri 56, flags 4, from logserv.example.com, msg syslogd: restart syslogd: restarted logmsg: pri 6, flags 4, from logserv.example.com, msg syslogd: kernel boot file is /boot/kernel/kernel syslogd: kernel boot file is /boot/kernel/kernel logmsg: pri 166, flags 17, from logserv.example.com, msg Dec 10 20:55:02 <syslog.err> logserv.example.com syslogd: exiting on signal 2 cvthname(192.168.1.10) validate: dgram from IP 192.168.1.10, port 514, name logclient.example.com; accepted in rule 0. logmsg: pri 15, flags 0, from logclient.example.com, msg Dec 11 02:01:28 trhodes: Test message 2 Logging to FILE /var/log/logclient.log Logging to FILE /var/log/messages At this point, the messages are being properly received and placed in the correct file. Security Considerations As with any network service, security requirements should be considered before implementing a logging server. Log files may contain sensitive data about services enabled on the local host, user accounts, and configuration data. Network data sent from the client to the server will not be encrypted or password protected. If a need for encryption exists, consider using security/stunnel, which will transmit the logging data over an encrypted tunnel. Local security is also an issue. Log files are not encrypted during use or after log rotation. Local users may access log files to gain additional insight into system configuration. Setting proper permissions on log files is critical. The built-in log rotator, newsyslog, supports setting permissions on newly created and rotated log files. Setting log files to mode 600 should prevent unwanted access by local users. Refer to &man.newsyslog.conf.5; for additional information. Configuration Files <filename>/etc</filename> Layout There are a number of directories in which configuration information is kept. These include: /etc Generic system-specific configuration information. /etc/defaults Default versions of system configuration files. /etc/mail Extra &man.sendmail.8; configuration and other MTA configuration files. /etc/ppp Configuration for both user- and kernel-ppp programs. /usr/local/etc Configuration files for installed applications. May contain per-application subdirectories. /usr/local/etc/rc.d &man.rc.8; scripts for installed applications. /var/db Automatically generated system-specific database files, such as the package database and the &man.locate.1; database. Hostnames hostname DNS <filename>/etc/resolv.conf</filename> resolv.conf How a &os; system accesses the Internet Domain Name System (DNS) is controlled by &man.resolv.conf.5;. The most common entries to /etc/resolv.conf are: nameserver The IP address of a name server the resolver should query. The servers are queried in the order listed with a maximum of three. search Search list for hostname lookup. This is normally determined by the domain of the local hostname. domain The local domain name. A typical /etc/resolv.conf looks like this: search example.com nameserver 147.11.1.11 nameserver 147.11.100.30 Only one of the search and domain options should be used. When using DHCP, &man.dhclient.8; usually rewrites /etc/resolv.conf with information received from the DHCP server. <filename>/etc/hosts</filename> hosts /etc/hosts is a simple text database which works in conjunction with DNS and NIS to provide host name to IP address mappings. Entries for local computers connected via a LAN can be added to this file for simplistic naming purposes instead of setting up a &man.named.8; server. Additionally, /etc/hosts can be used to provide a local record of Internet names, reducing the need to query external DNS servers for commonly accessed names. # $&os;$ # # # Host Database # # This file should contain the addresses and aliases for local hosts that # share this file. Replace 'my.domain' below with the domainname of your # machine. # # In the presence of the domain name service or NIS, this file may # not be consulted at all; see /etc/nsswitch.conf for the resolution order. # # ::1 localhost localhost.my.domain 127.0.0.1 localhost localhost.my.domain # # Imaginary network. #10.0.0.2 myname.my.domain myname #10.0.0.3 myfriend.my.domain myfriend # # According to RFC 1918, you can use the following IP networks for # private nets which will never be connected to the Internet: # # 10.0.0.0 - 10.255.255.255 # 172.16.0.0 - 172.31.255.255 # 192.168.0.0 - 192.168.255.255 # # In case you want to be able to connect to the Internet, you need # real official assigned numbers. Do not try to invent your own network # numbers but instead get one from your network provider (if any) or # from your regional registry (ARIN, APNIC, LACNIC, RIPE NCC, or AfriNIC.) # The format of /etc/hosts is as follows: [Internet address] [official hostname] [alias1] [alias2] ... For example: 10.0.0.1 myRealHostname.example.com myRealHostname foobar1 foobar2 Consult &man.hosts.5; for more information. Tuning with &man.sysctl.8; sysctl tuning with sysctl &man.sysctl.8; is used to make changes to a running &os; system. This includes many advanced options of the TCP/IP stack and virtual memory system that can dramatically improve performance for an experienced system administrator. Over five hundred system variables can be read and set using &man.sysctl.8;. At its core, &man.sysctl.8; serves two functions: to read and to modify system settings. To view all readable variables: &prompt.user; sysctl -a To read a particular variable, specify its name: &prompt.user; sysctl kern.maxproc kern.maxproc: 1044 To set a particular variable, use the variable=value syntax: &prompt.root; sysctl kern.maxfiles=5000 kern.maxfiles: 2088 -> 5000 Settings of sysctl variables are usually either strings, numbers, or booleans, where a boolean is 1 for yes or 0 for no. To automatically set some variables each time the machine boots, add them to /etc/sysctl.conf. For more information, refer to &man.sysctl.conf.5; and . <filename>sysctl.conf</filename> sysctl.conf sysctl The configuration file for &man.sysctl.8;, /etc/sysctl.conf, looks much like /etc/rc.conf. Values are set in a variable=value form. The specified values are set after the system goes into multi-user mode. Not all variables are settable in this mode. For example, to turn off logging of fatal signal exits and prevent users from seeing processes started by other users, the following tunables can be set in /etc/sysctl.conf: # Do not log fatal signal exits (e.g., sig 11) kern.logsigexit=0 # Prevent users from seeing information about processes that # are being run under another UID. security.bsd.see_other_uids=0 &man.sysctl.8; Read-only Tom Rhodes Contributed by In some cases it may be desirable to modify read-only &man.sysctl.8; values, which will require a reboot of the system. For instance, on some laptop models the &man.cardbus.4; device will not probe memory ranges and will fail with errors similar to: cbb0: Could not map register memory device_probe_and_attach: cbb0 attach returned 12 The fix requires the modification of a read-only &man.sysctl.8; setting. Add to /boot/loader.conf and reboot. Now &man.cardbus.4; should work properly. Tuning Disks The following section will discuss various tuning mechanisms and options which may be applied to disk devices. In many cases, disks with mechanical parts, such as SCSI drives, will be the bottleneck driving down the overall system performance. While a solution is to install a drive without mechanical parts, such as a solid state drive, mechanical drives are not going away anytime in the near future. When tuning disks, it is advisable to utilize the features of the &man.iostat.8; command to test various changes to the system. This command will allow the user to obtain valuable information on system IO. Sysctl Variables <varname>vfs.vmiodirenable</varname> vfs.vmiodirenable The vfs.vmiodirenable &man.sysctl.8; variable may be set to either 0 (off) or 1 (on). It is set to 1 by default. This variable controls how directories are cached by the system. Most directories are small, using just a single fragment (typically 1 K) in the file system and typically 512 bytes in the buffer cache. With this variable turned off, the buffer cache will only cache a fixed number of directories, even if the system has a huge amount of memory. When turned on, this &man.sysctl.8; allows the buffer cache to use the VM page cache to cache the directories, making all the memory available for caching directories. However, the minimum in-core memory used to cache a directory is the physical page size (typically 4 K) rather than 512  bytes. Keeping this option enabled is recommended if the system is running any services which manipulate large numbers of files. Such services can include web caches, large mail systems, and news systems. Keeping this option on will generally not reduce performance, even with the wasted memory, but one should experiment to find out. <varname>vfs.write_behind</varname> vfs.write_behind The vfs.write_behind &man.sysctl.8; variable defaults to 1 (on). This tells the file system to issue media writes as full clusters are collected, which typically occurs when writing large sequential files. This avoids saturating the buffer cache with dirty buffers when it would not benefit I/O performance. However, this may stall processes and under certain circumstances should be turned off. <varname>vfs.hirunningspace</varname> vfs.hirunningspace The vfs.hirunningspace &man.sysctl.8; variable determines how much outstanding write I/O may be queued to disk controllers system-wide at any given instance. The default is usually sufficient, but on machines with many disks, try bumping it up to four or five megabytes. Setting too high a value which exceeds the buffer cache's write threshold can lead to bad clustering performance. Do not set this value arbitrarily high as higher write values may add latency to reads occurring at the same time. There are various other buffer cache and VM page cache related &man.sysctl.8; values. Modifying these values is not recommended as the VM system does a good job of automatically tuning itself. <varname>vm.swap_idle_enabled</varname> vm.swap_idle_enabled The vm.swap_idle_enabled &man.sysctl.8; variable is useful in large multi-user systems with many active login users and lots of idle processes. Such systems tend to generate continuous pressure on free memory reserves. Turning this feature on and tweaking the swapout hysteresis (in idle seconds) via vm.swap_idle_threshold1 and vm.swap_idle_threshold2 depresses the priority of memory pages associated with idle processes more quickly then the normal pageout algorithm. This gives a helping hand to the pageout daemon. Only turn this option on if needed, because the tradeoff is essentially pre-page memory sooner rather than later which eats more swap and disk bandwidth. In a small system this option will have a determinable effect, but in a large system that is already doing moderate paging, this option allows the VM system to stage whole processes into and out of memory easily. <varname>hw.ata.wc</varname> hw.ata.wc Turning off IDE write caching reduces write bandwidth to IDE disks, but may sometimes be necessary due to data consistency issues introduced by hard drive vendors. The problem is that some IDE drives lie about when a write completes. With IDE write caching turned on, IDE hard drives write data to disk out of order and will sometimes delay writing some blocks indefinitely when under heavy disk load. A crash or power failure may cause serious file system corruption. Check the default on the system by observing the hw.ata.wc &man.sysctl.8; variable. If IDE write caching is turned off, one can set this read-only variable to 1 in /boot/loader.conf in order to enable it at boot time. For more information, refer to &man.ata.4;. <literal>SCSI_DELAY</literal> (<varname>kern.cam.scsi_delay</varname>) kern.cam.scsi_delay kernel options SCSI DELAY The SCSI_DELAY kernel configuration option may be used to reduce system boot times. The defaults are fairly high and can be responsible for 15 seconds of delay in the boot process. Reducing it to 5 seconds usually works with modern drives. The kern.cam.scsi_delay boot time tunable should be used. The tunable and kernel configuration option accept values in terms of milliseconds and not seconds. Soft Updates Soft Updates &man.tunefs.8; To fine-tune a file system, use &man.tunefs.8;. This program has many different options. To toggle Soft Updates on and off, use: &prompt.root; tunefs -n enable /filesystem &prompt.root; tunefs -n disable /filesystem A file system cannot be modified with &man.tunefs.8; while it is mounted. A good time to enable Soft Updates is before any partitions have been mounted, in single-user mode. Soft Updates is recommended for UFS file systems as it drastically improves meta-data performance, mainly file creation and deletion, through the use of a memory cache. There are two downsides to Soft Updates to be aware of. First, Soft Updates guarantee file system consistency in the case of a crash, but could easily be several seconds or even a minute behind updating the physical disk. If the system crashes, unwritten data may be lost. Secondly, Soft Updates delay the freeing of file system blocks. If the root file system is almost full, performing a major update, such as make installworld, can cause the file system to run out of space and the update to fail. More Details About Soft Updates Soft Updates details Meta-data updates are updates to non-content data like inodes or directories. There are two traditional approaches to writing a file system's meta-data back to disk. Historically, the default behavior was to write out meta-data updates synchronously. If a directory changed, the system waited until the change was actually written to disk. The file data buffers (file contents) were passed through the buffer cache and backed up to disk later on asynchronously. The advantage of this implementation is that it operates safely. If there is a failure during an update, meta-data is always in a consistent state. A file is either created completely or not at all. If the data blocks of a file did not find their way out of the buffer cache onto the disk by the time of the crash, &man.fsck.8; recognizes this and repairs the file system by setting the file length to 0. Additionally, the implementation is clear and simple. The disadvantage is that meta-data changes are slow. For example, rm -r touches all the files in a directory sequentially, but each directory change will be written synchronously to the disk. This includes updates to the directory itself, to the inode table, and possibly to indirect blocks allocated by the file. Similar considerations apply for unrolling large hierarchies using tar -x. The second approach is to use asynchronous meta-data updates. This is the default for a UFS file system mounted with mount -o async. Since all meta-data updates are also passed through the buffer cache, they will be intermixed with the updates of the file content data. The advantage of this implementation is there is no need to wait until each meta-data update has been written to disk, so all operations which cause huge amounts of meta-data updates work much faster than in the synchronous case. This implementation is still clear and simple, so there is a low risk for bugs creeping into the code. The disadvantage is that there is no guarantee for a consistent state of the file system. If there is a failure during an operation that updated large amounts of meta-data, like a power failure or someone pressing the reset button, the file system will be left in an unpredictable state. There is no opportunity to examine the state of the file system when the system comes up again as the data blocks of a file could already have been written to the disk while the updates of the inode table or the associated directory were not. It is impossible to implement a &man.fsck.8; which is able to clean up the resulting chaos because the necessary information is not available on the disk. If the file system has been damaged beyond repair, the only choice is to reformat it and restore from backup. The usual solution for this problem is to implement dirty region logging, which is also referred to as journaling. Meta-data updates are still written synchronously, but only into a small region of the disk. Later on, they are moved - to their proper location. Because the logging area is a + to their proper location. Since the logging area is a small, contiguous region on the disk, there are no long distances for the disk heads to move, even during heavy operations, so these operations are quicker than synchronous updates. Additionally, the complexity of the implementation is limited, so the risk of bugs being present is low. A disadvantage is that all meta-data is written twice, once into the logging region and once to the proper location, so performance pessimization might result. On the other hand, in case of a crash, all pending meta-data operations can be either quickly rolled back or completed from the logging area after the system comes up again, resulting in a fast file system startup. Kirk McKusick, the developer of Berkeley FFS, solved this problem with Soft Updates. All pending meta-data updates are kept in memory and written out to disk in a sorted sequence (ordered meta-data updates). This has the effect that, in case of heavy meta-data operations, later updates to an item catch the earlier ones which are still in memory and have not already been written to disk. All operations are generally performed in memory before the update is written to disk and the data blocks are sorted according to their position so that they will not be on the disk ahead of their meta-data. If the system crashes, an implicit log rewind causes all operations which were not written to the disk appear as if they never happened. A consistent file system state is maintained that appears to be the one of 30 to 60 seconds earlier. The algorithm used guarantees that all resources in use are marked as such in their blocks and inodes. After a crash, the only resource allocation error that occurs is that resources are marked as used which are actually free. &man.fsck.8; recognizes this situation, and frees the resources that are no longer used. It is safe to ignore the dirty state of the file system after a crash by forcibly mounting it with mount -f. In order to free resources that may be unused, &man.fsck.8; needs to be run at a later time. This is the idea behind the background &man.fsck.8;: at system startup time, only a snapshot of the file system is recorded and &man.fsck.8; is run afterwards. All file systems can then be mounted dirty, so the system startup proceeds in multi-user mode. Then, background &man.fsck.8; is scheduled for all file systems where this is required, to free resources that may be unused. File systems that do not use Soft Updates still need the usual foreground &man.fsck.8;. The advantage is that meta-data operations are nearly as fast as asynchronous updates and are faster than logging, which has to write the meta-data twice. The disadvantages are the complexity of the code, a higher memory consumption, and some idiosyncrasies. After a crash, the state of the file system appears to be somewhat older. In situations where the standard synchronous approach would have caused some zero-length files to remain after the &man.fsck.8;, these files do not exist at all with Soft Updates because neither the meta-data nor the file contents have been written to disk. Disk space is not released until the updates have been written to disk, which may take place some time after running &man.rm.1;. This may cause problems when installing large amounts of data on a file system that does not have enough free space to hold all the files twice. Tuning Kernel Limits tuning kernel limits File/Process Limits <varname>kern.maxfiles</varname> kern.maxfiles The kern.maxfiles &man.sysctl.8; variable can be raised or lowered based upon system requirements. This variable indicates the maximum number of file descriptors on the system. When the file descriptor table is full, file: table is full will show up repeatedly in the system message buffer, which can be viewed using &man.dmesg.8;. Each open file, socket, or fifo uses one file descriptor. A large-scale production server may easily require many thousands of file descriptors, depending on the kind and number of services running concurrently. In older &os; releases, the default value of kern.maxfiles is derived from in the kernel configuration file. kern.maxfiles grows proportionally to the value of . When compiling a custom kernel, consider setting this kernel configuration option according to the use of the system. From this number, the kernel is given most of its pre-defined limits. Even though a production machine may not have 256 concurrent users, the resources needed may be similar to a high-scale web server. The read-only &man.sysctl.8; variable kern.maxusers is automatically sized at boot based on the amount of memory available in the system, and may be determined at run-time by inspecting the value of kern.maxusers. Some systems require larger or smaller values of kern.maxusers and values of 64, 128, and 256 are not uncommon. Going above 256 is not recommended unless a huge number of file descriptors is needed. Many of the tunable values set to their defaults by kern.maxusers may be individually overridden at boot-time or run-time in /boot/loader.conf. Refer to &man.loader.conf.5; and /boot/defaults/loader.conf for more details and some hints. In older releases, the system will auto-tune maxusers if it is set to 0. The auto-tuning algorithm sets maxusers equal to the amount of memory in the system, with a minimum of 32, and a maximum of 384.. When setting this option, set maxusers to at least 4, especially if the system runs &xorg; or is used to compile software. The most important table set by maxusers is the maximum number of processes, which is set to 20 + 16 * maxusers. If maxusers is set to 1, there can only be 36 simultaneous processes, including the 18 or so that the system starts up at boot time and the 15 or so used by &xorg;. Even a simple task like reading a manual page will start up nine processes to filter, decompress, and view it. Setting maxusers to 64 allows up to 1044 simultaneous processes, which should be enough for nearly all uses. If, however, the proc table full error is displayed when trying to start another program, or a server is running with a large number of simultaneous users, increase the number and rebuild. maxusers does not limit the number of users which can log into the machine. It instead sets various table sizes to reasonable values considering the maximum number of users on the system and how many processes each user will be running. <varname>kern.ipc.soacceptqueue</varname> kern.ipc.soacceptqueue The kern.ipc.soacceptqueue &man.sysctl.8; variable limits the size of the listen queue for accepting new TCP connections. The default value of 128 is typically too low for robust handling of new connections on a heavily loaded web server. For such environments, it is recommended to increase this value to 1024 or higher. A service such as &man.sendmail.8;, or Apache may itself limit the listen queue size, but will often have a directive in its configuration file to adjust the queue size. Large listen queues do a better job of avoiding Denial of Service (DoS) attacks. Network Limits The NMBCLUSTERS kernel configuration option dictates the amount of network Mbufs available to the system. A heavily-trafficked server with a low number of Mbufs will hinder performance. Each cluster represents approximately 2 K of memory, so a value of 1024 represents 2 megabytes of kernel memory reserved for network buffers. A simple calculation can be done to figure out how many are needed. A web server which maxes out at 1000 simultaneous connections where each connection uses a 6 K receive and 16 K send buffer, requires approximately 32 MB worth of network buffers to cover the web server. A good rule of thumb is to multiply by 2, so 2x32 MB / 2 KB = 64 MB / 2 kB = 32768. Values between 4096 and 32768 are recommended for machines with greater amounts of memory. Never specify an arbitrarily high value for this parameter as it could lead to a boot time crash. To observe network cluster usage, use with &man.netstat.1;. The kern.ipc.nmbclusters loader tunable should be used to tune this at boot time. Only older versions of &os; will require the use of the NMBCLUSTERS kernel &man.config.8; option. For busy servers that make extensive use of the &man.sendfile.2; system call, it may be necessary to increase the number of &man.sendfile.2; buffers via the NSFBUFS kernel configuration option or by setting its value in /boot/loader.conf (see &man.loader.8; for details). A common indicator that this parameter needs to be adjusted is when processes are seen in the sfbufa state. The &man.sysctl.8; variable kern.ipc.nsfbufs is read-only. This parameter nominally scales with kern.maxusers, however it may be necessary to tune accordingly. Even though a socket has been marked as non-blocking, calling &man.sendfile.2; on the non-blocking socket may result in the &man.sendfile.2; call blocking until enough struct sf_buf's are made available. <varname>net.inet.ip.portrange.*</varname> net.inet.ip.portrange.* The net.inet.ip.portrange.* &man.sysctl.8; variables control the port number ranges automatically bound to TCP and UDP sockets. There are three ranges: a low range, a default range, and a high range. Most network programs use the default range which is controlled by net.inet.ip.portrange.first and net.inet.ip.portrange.last, which default to 1024 and 5000, respectively. Bound port ranges are used for outgoing connections and it is possible to run the system out of ports under certain circumstances. This most commonly occurs when running a heavily loaded web proxy. The port range is not an issue when running a server which handles mainly incoming connections, such as a web server, or has a limited number of outgoing connections, such as a mail relay. For situations where there is a shortage of ports, it is recommended to increase net.inet.ip.portrange.last modestly. A value of 10000, 20000 or 30000 may be reasonable. Consider firewall effects when changing the port range. Some firewalls may block large ranges of ports, usually low-numbered ports, and expect systems to use higher ranges of ports for outgoing connections. For this reason, it is not recommended that the value of net.inet.ip.portrange.first be lowered. <literal>TCP</literal> Bandwidth Delay Product TCP Bandwidth Delay Product Limiting net.inet.tcp.inflight.enable TCP bandwidth delay product limiting can be enabled by setting the net.inet.tcp.inflight.enable &man.sysctl.8; variable to 1. This instructs the system to attempt to calculate the bandwidth delay product for each connection and limit the amount of data queued to the network to just the amount required to maintain optimum throughput. This feature is useful when serving data over modems, Gigabit Ethernet, high speed WAN links, or any other link with a high bandwidth delay product, especially when also using window scaling or when a large send window has been configured. When enabling this option, also set net.inet.tcp.inflight.debug to 0 to disable debugging. For production use, setting net.inet.tcp.inflight.min to at least 6144 may be beneficial. Setting high minimums may effectively disable bandwidth limiting, depending on the link. The limiting feature reduces the amount of data built up in intermediate route and switch packet queues and reduces the amount of data built up in the local host's interface queue. With fewer queued packets, interactive connections, especially over slow modems, will operate with lower Round Trip Times. This feature only effects server side data transmission such as uploading. It has no effect on data reception or downloading. Adjusting net.inet.tcp.inflight.stab is not recommended. This parameter defaults to 20, representing 2 maximal packets added to the bandwidth delay product window calculation. The additional window is required to stabilize the algorithm and improve responsiveness to changing conditions, but it can also result in higher &man.ping.8; times over slow links, though still much lower than without the inflight algorithm. In such cases, try reducing this parameter to 15, 10, or 5 and reducing net.inet.tcp.inflight.min to a value such as 3500 to get the desired effect. Reducing these parameters should be done as a last resort only. Virtual Memory <varname>kern.maxvnodes</varname> A vnode is the internal representation of a file or directory. Increasing the number of vnodes available to the operating system reduces disk I/O. Normally, this is handled by the operating system and does not need to be changed. In some cases where disk I/O is a bottleneck and the system is running out of vnodes, this setting needs to be increased. The amount of inactive and free RAM will need to be taken into account. To see the current number of vnodes in use: &prompt.root; sysctl vfs.numvnodes vfs.numvnodes: 91349 To see the maximum vnodes: &prompt.root; sysctl kern.maxvnodes kern.maxvnodes: 100000 If the current vnode usage is near the maximum, try increasing kern.maxvnodes by a value of 1000. Keep an eye on the number of vfs.numvnodes. If it climbs up to the maximum again, kern.maxvnodes will need to be increased further. Otherwise, a shift in memory usage as reported by &man.top.1; should be visible and more memory should be active. Adding Swap Space Sometimes a system requires more swap space. This section describes two methods to increase swap space: adding swap to an existing partition or new hard drive, and creating a swap file on an existing partition. For information on how to encrypt swap space, which options exist, and why it should be done, refer to . Swap on a New Hard Drive or Existing Partition Adding a new hard drive for swap gives better performance than using a partition on an existing drive. Setting up partitions and hard drives is explained in while discusses partition layouts and swap partition size considerations. Use swapon to add a swap partition to the system. For example: &prompt.root; swapon /dev/ada1s1b It is possible to use any partition not currently mounted, even if it already contains data. Using swapon on a partition that contains data will overwrite and destroy that data. Make sure that the partition to be added as swap is really the intended partition before running swapon. To automatically add this swap partition on boot, add an entry to /etc/fstab: /dev/ada1s1b none swap sw 0 0 See &man.fstab.5; for an explanation of the entries in /etc/fstab. More information about swapon can be found in &man.swapon.8;. Creating a Swap File These examples create a 512M swap file called /usr/swap0 instead of using a partition. Using swap files requires that the module needed by &man.md.4; has either been built into the kernel or has been loaded before swap is enabled. See for information about building a custom kernel. Creating a Swap File Create the swap file: &prompt.root; dd if=/dev/zero of=/usr/swap0 bs=1m count=512 Set the proper permissions on the new file: &prompt.root; chmod 0600 /usr/swap0 Inform the system about the swap file by adding a line to /etc/fstab: md99 none swap sw,file=/usr/swap0,late 0 0 The &man.md.4; device md99 is used, leaving lower device numbers available for interactive use. Swap space will be added on system startup. To add swap space immediately, use &man.swapon.8;: &prompt.root; swapon -aL Power and Resource Management Hiten Pandya Written by Tom Rhodes It is important to utilize hardware resources in an efficient manner. Power and resource management allows the operating system to monitor system limits and to possibly provide an alert if the system temperature increases unexpectedly. An early specification for providing power management was the Advanced Power Management (APM) facility. APM controls the power usage of a system based on its activity. However, it was difficult and inflexible for operating systems to manage the power usage and thermal properties of a system. The hardware was managed by the BIOS and the user had limited configurability and visibility into the power management settings. The APM BIOS is supplied by the vendor and is specific to the hardware platform. An APM driver in the operating system mediates access to the APM Software Interface, which allows management of power levels. There are four major problems in APM. First, power management is done by the vendor-specific BIOS, separate from the operating system. For example, the user can set idle-time values for a hard drive in the APM BIOS so that, when exceeded, the BIOS spins down the hard drive without the consent of the operating system. Second, the APM logic is embedded in the BIOS, and it operates outside the scope of the operating system. This means that users can only fix problems in the APM BIOS by flashing a new one into the ROM, which is a dangerous procedure with the potential to leave the system in an unrecoverable state if it fails. Third, APM is a vendor-specific technology, meaning that there is a lot of duplication of efforts and bugs found in one vendor's BIOS may not be solved in others. Lastly, the APM BIOS did not have enough room to implement a sophisticated power policy or one that can adapt well to the purpose of the machine. The Plug and Play BIOS (PNPBIOS) was unreliable in many situations. PNPBIOS is 16-bit technology, so the operating system has to use 16-bit emulation in order to interface with PNPBIOS methods. &os; provides an APM driver as APM should still be used for systems manufactured at or before the year 2000. The driver is documented in &man.apm.4;. ACPI APM The successor to APM is the Advanced Configuration and Power Interface (ACPI). ACPI is a standard written by an alliance of vendors to provide an interface for hardware resources and power management. It is a key element in Operating System-directed configuration and Power Management as it provides more control and flexibility to the operating system. This chapter demonstrates how to configure ACPI on &os;. It then offers some tips on how to debug ACPI and how to submit a problem report containing debugging information so that developers can diagnosis and fix ACPI issues. Configuring <acronym>ACPI</acronym> In &os; the &man.acpi.4; driver is loaded by default at system boot and should not be compiled into the kernel. This driver cannot be unloaded after boot because the system bus uses it for various hardware interactions. However, if the system is experiencing problems, ACPI can be disabled altogether by rebooting after setting hint.acpi.0.disabled="1" in /boot/loader.conf or by setting this variable at the loader prompt, as described in . ACPI and APM cannot coexist and should be used separately. The last one to load will terminate if the driver notices the other is running. ACPI can be used to put the system into a sleep mode with acpiconf, the flag, and a number from 1 to 5. Most users only need 1 (quick suspend to RAM) or 3 (suspend to RAM). Option 5 performs a soft-off which is the same as running halt -p. Other options are available using sysctl. Refer to &man.acpi.4; and &man.acpiconf.8; for more information. Common Problems ACPI ACPI is present in all modern computers that conform to the ia32 (x86) and amd64 (AMD) architectures. The full standard has many features including CPU performance management, power planes control, thermal zones, various battery systems, embedded controllers, and bus enumeration. Most systems implement less than the full standard. For instance, a desktop system usually only implements bus enumeration while a laptop might have cooling and battery management support as well. Laptops also have suspend and resume, with their own associated complexity. An ACPI-compliant system has various components. The BIOS and chipset vendors provide various fixed tables, such as FADT, in memory that specify things like the APIC map (used for SMP), config registers, and simple configuration values. Additionally, a bytecode table, the Differentiated System Description Table DSDT, specifies a tree-like name space of devices and methods. The ACPI driver must parse the fixed tables, implement an interpreter for the bytecode, and modify device drivers and the kernel to accept information from the ACPI subsystem. For &os;, &intel; has provided an interpreter (ACPI-CA) that is shared with &linux; and NetBSD. The path to the ACPI-CA source code is src/sys/contrib/dev/acpica. The glue code that allows ACPI-CA to work on &os; is in src/sys/dev/acpica/Osd. Finally, drivers that implement various ACPI devices are found in src/sys/dev/acpica. ACPI problems For ACPI to work correctly, all the parts have to work correctly. Here are some common problems, in order of frequency of appearance, and some possible workarounds or fixes. If a fix does not resolve the issue, refer to for instructions on how to submit a bug report. Mouse Issues In some cases, resuming from a suspend operation will cause the mouse to fail. A known work around is to add hint.psm.0.flags="0x3000" to /boot/loader.conf. Suspend/Resume ACPI has three suspend to RAM (STR) states, S1-S3, and one suspend to disk state (STD), called S4. STD can be implemented in two separate ways. The S4BIOS is a BIOS-assisted suspend to disk and S4OS is implemented entirely by the operating system. The normal state the system is in when plugged in but not powered up is soft off (S5). Use sysctl hw.acpi to check for the suspend-related items. These example results are from a Thinkpad: hw.acpi.supported_sleep_state: S3 S4 S5 hw.acpi.s4bios: 0 Use acpiconf -s to test S3, S4, and S5. An of one (1) indicates S4BIOS support instead of S4 operating system support. When testing suspend/resume, start with S1, if supported. This state is most likely to work since it does not require much driver support. No one has implemented S2, which is similar to S1. Next, try S3. This is the deepest STR state and requires a lot of driver support to properly reinitialize the hardware. A common problem with suspend/resume is that many device drivers do not save, restore, or reinitialize their firmware, registers, or device memory properly. As a first attempt at debugging the problem, try: &prompt.root; sysctl debug.bootverbose=1 &prompt.root; sysctl debug.acpi.suspend_bounce=1 &prompt.root; acpiconf -s 3 This test emulates the suspend/resume cycle of all device drivers without actually going into S3 state. In some cases, problems such as losing firmware state, device watchdog time out, and retrying forever, can be captured with this method. Note that the system will not really enter S3 state, which means devices may not lose power, and many will work fine even if suspend/resume methods are totally missing, unlike real S3 state. Harder cases require additional hardware, such as a serial port and cable for debugging through a serial console, a Firewire port and cable for using &man.dcons.4;, and kernel debugging skills. To help isolate the problem, unload as many drivers as possible. If it works, narrow down which driver is the problem by loading drivers until it fails again. Typically, binary drivers like nvidia.ko, display drivers, and USB will have the most problems while Ethernet interfaces usually work fine. If drivers can be properly loaded and unloaded, automate this by putting the appropriate commands in /etc/rc.suspend and /etc/rc.resume. Try setting to 1 if the display is messed up after resume. Try setting longer or shorter values for to see if that helps. Try loading a recent &linux; distribution to see if suspend/resume works on the same hardware. If it works on &linux;, it is likely a &os; driver problem. Narrowing down which driver causes the problem will assist developers in fixing the problem. Since the ACPI maintainers rarely maintain other drivers, such as sound or ATA, any driver problems should also be posted to the &a.current.name; list and mailed to the driver maintainer. Advanced users can include debugging &man.printf.3;s in a problematic driver to track down where in its resume function it hangs. Finally, try disabling ACPI and enabling APM instead. If suspend/resume works with APM, stick with APM, especially on older hardware (pre-2000). It took vendors a while to get ACPI support correct and older hardware is more likely to have BIOS problems with ACPI. System Hangs Most system hangs are a result of lost interrupts or an interrupt storm. Chipsets may have problems based on boot, how the BIOS configures interrupts before correctness of the APIC (MADT) table, and routing of the System Control Interrupt (SCI). interrupt storms Interrupt storms can be distinguished from lost interrupts by checking the output of vmstat -i and looking at the line that has acpi0. If the counter is increasing at more than a couple per second, there is an interrupt storm. If the system appears hung, try breaking to DDB ( CTRL ALT ESC on console) and type show interrupts. APIC disabling When dealing with interrupt problems, try disabling APIC support with hint.apic.0.disabled="1" in /boot/loader.conf. Panics Panics are relatively rare for ACPI and are the top priority to be fixed. The first step is to isolate the steps to reproduce the panic, if possible, and get a backtrace. Follow the advice for enabling options DDB and setting up a serial console in or setting up a dump partition. To get a backtrace in DDB, use tr. When handwriting the backtrace, get at least the last five and the top five lines in the trace. Then, try to isolate the problem by booting with ACPI disabled. If that works, isolate the ACPI subsystem by using various values of . See &man.acpi.4; for some examples. System Powers Up After Suspend or Shutdown First, try setting hw.acpi.disable_on_poweroff="0" in /boot/loader.conf. This keeps ACPI from disabling various events during the shutdown process. Some systems need this value set to 1 (the default) for the same reason. This usually fixes the problem of a system powering up spontaneously after a suspend or poweroff. BIOS Contains Buggy Bytecode ACPI ASL Some BIOS vendors provide incorrect or buggy bytecode. This is usually manifested by kernel console messages like this: ACPI-1287: *** Error: Method execution failed [\\_SB_.PCI0.LPC0.FIGD._STA] \\ (Node 0xc3f6d160), AE_NOT_FOUND Often, these problems may be resolved by updating the BIOS to the latest revision. Most console messages are harmless, but if there are other problems, like the battery status is not working, these messages are a good place to start looking for problems. Overriding the Default <acronym>AML</acronym> The BIOS bytecode, known as ACPI Machine Language (AML), is compiled from a source language called ACPI Source Language (ASL). The AML is found in the table known as the Differentiated System Description Table (DSDT). ACPI ASL The goal of &os; is for everyone to have working ACPI without any user intervention. Workarounds are still being developed for common mistakes made by BIOS vendors. The µsoft; interpreter (acpi.sys and acpiec.sys) does not strictly check for adherence to the standard, and thus many BIOS vendors who only test ACPI under &windows; never fix their ASL. &os; developers continue to identify and document which non-standard behavior is allowed by µsoft;'s interpreter and replicate it so that &os; can work without forcing users to fix the ASL. To help identify buggy behavior and possibly fix it manually, a copy can be made of the system's ASL. To copy the system's ASL to a specified file name, use acpidump with , to show the contents of the fixed tables, and , to disassemble the AML: &prompt.root; acpidump -td > my.asl Some AML versions assume the user is running &windows;. To override this, set hw.acpi.osname="Windows 2009" in /boot/loader.conf, using the most recent &windows; version listed in the ASL. Other workarounds may require my.asl to be customized. If this file is edited, compile the new ASL using the following command. Warnings can usually be ignored, but errors are bugs that will usually prevent ACPI from working correctly. &prompt.root; iasl -f my.asl Including forces creation of the AML, even if there are errors during compilation. Some errors, such as missing return statements, are automatically worked around by the &os; interpreter. The default output filename for iasl is DSDT.aml. Load this file instead of the BIOS's buggy copy, which is still present in flash memory, by editing /boot/loader.conf as follows: acpi_dsdt_load="YES" acpi_dsdt_name="/boot/DSDT.aml" Be sure to copy DSDT.aml to /boot, then reboot the system. If this fixes the problem, send a &man.diff.1; of the old and new ASL to &a.acpi.name; so that developers can work around the buggy behavior in acpica. Getting and Submitting Debugging Info Nate Lawson Written by Peter Schultz With contributions from Tom Rhodes ACPI problems ACPI debugging The ACPI driver has a flexible debugging facility. A set of subsystems and the level of verbosity can be specified. The subsystems to debug are specified as layers and are broken down into components (ACPI_ALL_COMPONENTS) and ACPI hardware support (ACPI_ALL_DRIVERS). The verbosity of debugging output is specified as the level and ranges from just report errors (ACPI_LV_ERROR) to everything (ACPI_LV_VERBOSE). The level is a bitmask so multiple options can be set at once, separated by spaces. In practice, a serial console should be used to log the output so it is not lost as the console message buffer flushes. A full list of the individual layers and levels is found in &man.acpi.4;. Debugging output is not enabled by default. To enable it, add options ACPI_DEBUG to the custom kernel configuration file if ACPI is compiled into the kernel. Add ACPI_DEBUG=1 to /etc/make.conf to enable it globally. If a module is used instead of a custom kernel, recompile just the acpi.ko module as follows: &prompt.root; cd /sys/modules/acpi/acpi && make clean && make ACPI_DEBUG=1 Copy the compiled acpi.ko to /boot/kernel and add the desired level and layer to /boot/loader.conf. The entries in this example enable debug messages for all ACPI components and hardware drivers and output error messages at the least verbose level: debug.acpi.layer="ACPI_ALL_COMPONENTS ACPI_ALL_DRIVERS" debug.acpi.level="ACPI_LV_ERROR" If the required information is triggered by a specific event, such as a suspend and then resume, do not modify /boot/loader.conf. Instead, use sysctl to specify the layer and level after booting and preparing the system for the specific event. The variables which can be set using sysctl are named the same as the tunables in /boot/loader.conf. ACPI problems Once the debugging information is gathered, it can be sent to &a.acpi.name; so that it can be used by the &os; ACPI maintainers to identify the root cause of the problem and to develop a solution. Before submitting debugging information to this mailing list, ensure the latest BIOS version is installed and, if available, the embedded controller firmware version. When submitting a problem report, include the following information: Description of the buggy behavior, including system type, model, and anything that causes the bug to appear. Note as accurately as possible when the bug began occurring if it is new. The output of dmesg after running boot -v, including any error messages generated by the bug. The dmesg output from boot -v with ACPI disabled, if disabling ACPI helps to fix the problem. Output from sysctl hw.acpi. This lists which features the system offers. The URL to a pasted version of the system's ASL. Do not send the ASL directly to the list as it can be very large. Generate a copy of the ASL by running this command: &prompt.root; acpidump -dt > name-system.asl Substitute the login name for name and manufacturer/model for system. For example, use njl-FooCo6000.asl. Most &os; developers watch the &a.current;, but one should submit problems to &a.acpi.name; to be sure it is seen. Be patient when waiting for a response. If the bug is not immediately apparent, submit a bug report. When entering a PR, include the same information as requested above. This helps developers to track the problem and resolve it. Do not send a PR without emailing &a.acpi.name; first as it is likely that the problem has been reported before. References More information about ACPI may be found in the following locations: The &os; ACPI Mailing List Archives (https://lists.freebsd.org/pipermail/freebsd-acpi/) The ACPI 2.0 Specification (http://acpi.info/spec.htm) &man.acpi.4;, &man.acpi.thermal.4;, &man.acpidump.8;, &man.iasl.8;, and &man.acpidb.8; diff --git a/en_US.ISO8859-1/books/handbook/geom/chapter.xml b/en_US.ISO8859-1/books/handbook/geom/chapter.xml index dcb1e12e3c..a682799543 100644 --- a/en_US.ISO8859-1/books/handbook/geom/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/geom/chapter.xml @@ -1,1693 +1,1693 @@ GEOM: Modular Disk Transformation Framework Tom Rhodes Written by Synopsis GEOM GEOM Disk Framework GEOM In &os;, the GEOM framework permits access and control to classes, such as Master Boot Records and BSD labels, through the use of providers, or the disk devices in /dev. By supporting various software RAID configurations, GEOM transparently provides access to the operating system and operating system utilities. This chapter covers the use of disks under the GEOM framework in &os;. This includes the major RAID control utilities which use the framework for configuration. This chapter is not a definitive guide to RAID configurations and only GEOM-supported RAID classifications are discussed. After reading this chapter, you will know: What type of RAID support is available through GEOM. How to use the base utilities to configure, maintain, and manipulate the various RAID levels. How to mirror, stripe, encrypt, and remotely connect disk devices through GEOM. How to troubleshoot disks attached to the GEOM framework. Before reading this chapter, you should: Understand how &os; treats disk devices (). Know how to configure and install a new kernel (). RAID0 - Striping Tom Rhodes Written by Murray Stokely GEOM Striping Striping combines several disk drives into a single volume. Striping can be performed through the use of hardware RAID controllers. The GEOM disk subsystem provides software support for disk striping, also known as RAID0, without the need for a RAID disk controller. In RAID0, data is split into blocks that are written across all the drives in the array. As seen in the following illustration, instead of having to wait on the system to write 256k to one disk, RAID0 can simultaneously write 64k to each of the four disks in the array, offering superior I/O performance. This performance can be enhanced further by using multiple disk controllers. Disk Striping Illustration Each disk in a RAID0 stripe must be of the same size, since I/O requests are interleaved to read or write to multiple disks in parallel. RAID0 does not provide any redundancy. This means that if one disk in the array fails, all of the data on the disks is lost. If the data is important, implement a backup strategy that regularly saves backups to a remote system or device. The process for creating a software, GEOM-based RAID0 on a &os; system using commodity disks is as follows. Once the stripe is created, refer to &man.gstripe.8; for more information on how to control an existing stripe. Creating a Stripe of Unformatted <acronym>ATA</acronym> Disks Load the geom_stripe.ko module: &prompt.root; kldload geom_stripe Ensure that a suitable mount point exists. If this volume will become a root partition, then temporarily use another mount point such as /mnt. Determine the device names for the disks which will be striped, and create the new stripe device. For example, to stripe two unused and unpartitioned ATA disks with device names of /dev/ad2 and /dev/ad3: &prompt.root; gstripe label -v st0 /dev/ad2 /dev/ad3 Metadata value stored on /dev/ad2. Metadata value stored on /dev/ad3. Done. Write a standard label, also known as a partition table, on the new volume and install the default bootstrap code: &prompt.root; bsdlabel -wB /dev/stripe/st0 This process should create two other devices in /dev/stripe in addition to st0. Those include st0a and st0c. At this point, a UFS file system can be created on st0a using newfs: &prompt.root; newfs -U /dev/stripe/st0a Many numbers will glide across the screen, and after a few seconds, the process will be complete. The volume has been created and is ready to be mounted. To manually mount the created disk stripe: &prompt.root; mount /dev/stripe/st0a /mnt To mount this striped file system automatically during the boot process, place the volume information in /etc/fstab. In this example, a permanent mount point, named stripe, is created: &prompt.root; mkdir /stripe &prompt.root; echo "/dev/stripe/st0a /stripe ufs rw 2 2" \ >> /etc/fstab The geom_stripe.ko module must also be automatically loaded during system initialization, by adding a line to /boot/loader.conf: &prompt.root; echo 'geom_stripe_load="YES"' >> /boot/loader.conf RAID1 - Mirroring GEOM Disk Mirroring RAID1 RAID1, or mirroring, is the technique of writing the same data to more than one disk drive. Mirrors are usually used to guard against data loss due to drive failure. Each drive in a mirror contains an identical copy of the data. When an individual drive fails, the mirror continues to work, providing data from the drives that are still functioning. The computer keeps running, and the administrator has time to replace the failed drive without user interruption. Two common situations are illustrated in these examples. The first creates a mirror out of two new drives and uses it as a replacement for an existing single drive. The second example creates a mirror on a single new drive, copies the old drive's data to it, then inserts the old drive into the mirror. While this procedure is slightly more complicated, it only requires one new drive. Traditionally, the two drives in a mirror are identical in model and capacity, but &man.gmirror.8; does not require that. Mirrors created with dissimilar drives will have a capacity equal to that of the smallest drive in the mirror. Extra space on larger drives will be unused. Drives inserted into the mirror later must have at least as much capacity as the smallest drive already in the mirror. The mirroring procedures shown here are non-destructive, but as with any major disk operation, make a full backup first. While &man.dump.8; is used in these procedures to copy file systems, it does not work on file systems with soft updates journaling. See &man.tunefs.8; for information on detecting and disabling soft updates journaling. Metadata Issues Many disk systems store metadata at the end of each disk. Old metadata should be erased before reusing the disk for a mirror. Most problems are caused by two particular types of leftover metadata: GPT partition tables and old metadata from a previous mirror. GPT metadata can be erased with &man.gpart.8;. This example erases both primary and backup GPT partition tables from disk ada8: &prompt.root; gpart destroy -F ada8 A disk can be removed from an active mirror and the metadata erased in one step using &man.gmirror.8;. Here, the example disk ada8 is removed from the active mirror gm4: &prompt.root; gmirror remove gm4 ada8 If the mirror is not running, but old mirror metadata is still on the disk, use gmirror clear to remove it: &prompt.root; gmirror clear ada8 &man.gmirror.8; stores one block of metadata at the end of - the disk. Because GPT partition schemes + the disk. As GPT partition schemes also store metadata at the end of the disk, mirroring entire GPT disks with &man.gmirror.8; is not recommended. MBR partitioning is used here because it only stores a partition table at the start of the disk and does not conflict with the mirror metadata. Creating a Mirror with Two New Disks In this example, &os; has already been installed on a single disk, ada0. Two new disks, ada1 and ada2, have been connected to the system. A new mirror will be created on these two disks and used to replace the old single disk. The geom_mirror.ko kernel module must either be built into the kernel or loaded at boot- or run-time. Manually load the kernel module now: &prompt.root; gmirror load Create the mirror with the two new drives: &prompt.root; gmirror label -v gm0 /dev/ada1 /dev/ada2 gm0 is a user-chosen device name assigned to the new mirror. After the mirror has been started, this device name appears in /dev/mirror/. MBR and bsdlabel partition tables can now be created on the mirror with &man.gpart.8;. This example uses a traditional file system layout, with partitions for /, swap, /var, /tmp, and /usr. A single / and a swap partition will also work. Partitions on the mirror do not have to be the same size as those on the existing disk, but they must be large enough to hold all the data already present on ada0. &prompt.root; gpart create -s MBR mirror/gm0 &prompt.root; gpart add -t freebsd -a 4k mirror/gm0 &prompt.root; gpart show mirror/gm0 => 63 156301423 mirror/gm0 MBR (74G) 63 63 - free - (31k) 126 156301299 1 freebsd (74G) 156301425 61 - free - (30k) &prompt.root; gpart create -s BSD mirror/gm0s1 &prompt.root; gpart add -t freebsd-ufs -a 4k -s 2g mirror/gm0s1 &prompt.root; gpart add -t freebsd-swap -a 4k -s 4g mirror/gm0s1 &prompt.root; gpart add -t freebsd-ufs -a 4k -s 2g mirror/gm0s1 &prompt.root; gpart add -t freebsd-ufs -a 4k -s 1g mirror/gm0s1 &prompt.root; gpart add -t freebsd-ufs -a 4k mirror/gm0s1 &prompt.root; gpart show mirror/gm0s1 => 0 156301299 mirror/gm0s1 BSD (74G) 0 2 - free - (1.0k) 2 4194304 1 freebsd-ufs (2.0G) 4194306 8388608 2 freebsd-swap (4.0G) 12582914 4194304 4 freebsd-ufs (2.0G) 16777218 2097152 5 freebsd-ufs (1.0G) 18874370 137426928 6 freebsd-ufs (65G) 156301298 1 - free - (512B) Make the mirror bootable by installing bootcode in the MBR and bsdlabel and setting the active slice: &prompt.root; gpart bootcode -b /boot/mbr mirror/gm0 &prompt.root; gpart set -a active -i 1 mirror/gm0 &prompt.root; gpart bootcode -b /boot/boot mirror/gm0s1 Format the file systems on the new mirror, enabling soft-updates. &prompt.root; newfs -U /dev/mirror/gm0s1a &prompt.root; newfs -U /dev/mirror/gm0s1d &prompt.root; newfs -U /dev/mirror/gm0s1e &prompt.root; newfs -U /dev/mirror/gm0s1f File systems from the original ada0 disk can now be copied onto the mirror with &man.dump.8; and &man.restore.8;. &prompt.root; mount /dev/mirror/gm0s1a /mnt &prompt.root; dump -C16 -b64 -0aL -f - / | (cd /mnt && restore -rf -) &prompt.root; mount /dev/mirror/gm0s1d /mnt/var &prompt.root; mount /dev/mirror/gm0s1e /mnt/tmp &prompt.root; mount /dev/mirror/gm0s1f /mnt/usr &prompt.root; dump -C16 -b64 -0aL -f - /var | (cd /mnt/var && restore -rf -) &prompt.root; dump -C16 -b64 -0aL -f - /tmp | (cd /mnt/tmp && restore -rf -) &prompt.root; dump -C16 -b64 -0aL -f - /usr | (cd /mnt/usr && restore -rf -) Edit /mnt/etc/fstab to point to the new mirror file systems: # Device Mountpoint FStype Options Dump Pass# /dev/mirror/gm0s1a / ufs rw 1 1 /dev/mirror/gm0s1b none swap sw 0 0 /dev/mirror/gm0s1d /var ufs rw 2 2 /dev/mirror/gm0s1e /tmp ufs rw 2 2 /dev/mirror/gm0s1f /usr ufs rw 2 2 If the geom_mirror.ko kernel module has not been built into the kernel, /mnt/boot/loader.conf is edited to load the module at boot: geom_mirror_load="YES" Reboot the system to test the new mirror and verify that all data has been copied. The BIOS will see the mirror as two individual drives rather than a mirror. - Because the drives are identical, it does not matter which is + Since the drives are identical, it does not matter which is selected to boot. See if there are problems booting. Powering down and disconnecting the original ada0 disk will allow it to be kept as an offline backup. In use, the mirror will behave just like the original single drive. Creating a Mirror with an Existing Drive In this example, &os; has already been installed on a single disk, ada0. A new disk, ada1, has been connected to the system. A one-disk mirror will be created on the new disk, the existing system copied onto it, and then the old disk will be inserted into the mirror. This slightly complex procedure is required because gmirror needs to put a 512-byte block of metadata at the end of each disk, and the existing ada0 has usually had all of its space already allocated. Load the geom_mirror.ko kernel module: &prompt.root; gmirror load Check the media size of the original disk with diskinfo: &prompt.root; diskinfo -v ada0 | head -n3 /dev/ada0 512 # sectorsize 1000204821504 # mediasize in bytes (931G) Create a mirror on the new disk. To make certain that the mirror capacity is not any larger than the original ada0 drive, &man.gnop.8; is used to create a fake drive of the exact same size. This drive does not store any data, but is used only to limit the size of the mirror. When &man.gmirror.8; creates the mirror, it will restrict the capacity to the size of gzero.nop, even if the new ada1 drive has more space. Note that the 1000204821504 in the second line is equal to ada0's media size as shown by diskinfo above. &prompt.root; geom zero load &prompt.root; gnop create -s 1000204821504 gzero &prompt.root; gmirror label -v gm0 gzero.nop ada1 &prompt.root; gmirror forget gm0 Since gzero.nop does not store any data, the mirror does not see it as connected. The mirror is told to forget unconnected components, removing references to gzero.nop. The result is a mirror device containing only a single disk, ada1. After creating gm0, view the partition table on ada0. This output is from a 1 TB drive. If there is some unallocated space at the end of the drive, the contents may be copied directly from ada0 to the new mirror. However, if the output shows that all of the space on the disk is allocated, as in the following listing, there is no space available for the 512-byte mirror metadata at the end of the disk. &prompt.root; gpart show ada0 => 63 1953525105 ada0 MBR (931G) 63 1953525105 1 freebsd [active] (931G) In this case, the partition table must be edited to reduce the capacity by one sector on mirror/gm0. The procedure will be explained later. In either case, partition tables on the primary disk should be first copied using gpart backup and gpart restore. &prompt.root; gpart backup ada0 > table.ada0 &prompt.root; gpart backup ada0s1 > table.ada0s1 These commands create two files, table.ada0 and table.ada0s1. This example is from a 1 TB drive: &prompt.root; cat table.ada0 MBR 4 1 freebsd 63 1953525105 [active] &prompt.root; cat table.ada0s1 BSD 8 1 freebsd-ufs 0 4194304 2 freebsd-swap 4194304 33554432 4 freebsd-ufs 37748736 50331648 5 freebsd-ufs 88080384 41943040 6 freebsd-ufs 130023424 838860800 7 freebsd-ufs 968884224 984640881 If no free space is shown at the end of the disk, the size of both the slice and the last partition must be reduced by one sector. Edit the two files, reducing the size of both the slice and last partition by one. These are the last numbers in each listing. &prompt.root; cat table.ada0 MBR 4 1 freebsd 63 1953525104 [active] &prompt.root; cat table.ada0s1 BSD 8 1 freebsd-ufs 0 4194304 2 freebsd-swap 4194304 33554432 4 freebsd-ufs 37748736 50331648 5 freebsd-ufs 88080384 41943040 6 freebsd-ufs 130023424 838860800 7 freebsd-ufs 968884224 984640880 If at least one sector was unallocated at the end of the disk, these two files can be used without modification. Now restore the partition table into mirror/gm0: &prompt.root; gpart restore mirror/gm0 < table.ada0 &prompt.root; gpart restore mirror/gm0s1 < table.ada0s1 Check the partition table with gpart show. This example has gm0s1a for /, gm0s1d for /var, gm0s1e for /usr, gm0s1f for /data1, and gm0s1g for /data2. &prompt.root; gpart show mirror/gm0 => 63 1953525104 mirror/gm0 MBR (931G) 63 1953525042 1 freebsd [active] (931G) 1953525105 62 - free - (31k) &prompt.root; gpart show mirror/gm0s1 => 0 1953525042 mirror/gm0s1 BSD (931G) 0 2097152 1 freebsd-ufs (1.0G) 2097152 16777216 2 freebsd-swap (8.0G) 18874368 41943040 4 freebsd-ufs (20G) 60817408 20971520 5 freebsd-ufs (10G) 81788928 629145600 6 freebsd-ufs (300G) 710934528 1242590514 7 freebsd-ufs (592G) 1953525042 63 - free - (31k) Both the slice and the last partition must have at least one free block at the end of the disk. Create file systems on these new partitions. The number of partitions will vary to match the original disk, ada0. &prompt.root; newfs -U /dev/mirror/gm0s1a &prompt.root; newfs -U /dev/mirror/gm0s1d &prompt.root; newfs -U /dev/mirror/gm0s1e &prompt.root; newfs -U /dev/mirror/gm0s1f &prompt.root; newfs -U /dev/mirror/gm0s1g Make the mirror bootable by installing bootcode in the MBR and bsdlabel and setting the active slice: &prompt.root; gpart bootcode -b /boot/mbr mirror/gm0 &prompt.root; gpart set -a active -i 1 mirror/gm0 &prompt.root; gpart bootcode -b /boot/boot mirror/gm0s1 Adjust /etc/fstab to use the new partitions on the mirror. Back up this file first by copying it to /etc/fstab.orig. &prompt.root; cp /etc/fstab /etc/fstab.orig Edit /etc/fstab, replacing /dev/ada0 with mirror/gm0. # Device Mountpoint FStype Options Dump Pass# /dev/mirror/gm0s1a / ufs rw 1 1 /dev/mirror/gm0s1b none swap sw 0 0 /dev/mirror/gm0s1d /var ufs rw 2 2 /dev/mirror/gm0s1e /usr ufs rw 2 2 /dev/mirror/gm0s1f /data1 ufs rw 2 2 /dev/mirror/gm0s1g /data2 ufs rw 2 2 If the geom_mirror.ko kernel module has not been built into the kernel, edit /boot/loader.conf to load it at boot: geom_mirror_load="YES" File systems from the original disk can now be copied onto the mirror with &man.dump.8; and &man.restore.8;. Each file system dumped with dump -L will create a snapshot first, which can take some time. &prompt.root; mount /dev/mirror/gm0s1a /mnt &prompt.root; dump -C16 -b64 -0aL -f - / | (cd /mnt && restore -rf -) &prompt.root; mount /dev/mirror/gm0s1d /mnt/var &prompt.root; mount /dev/mirror/gm0s1e /mnt/usr &prompt.root; mount /dev/mirror/gm0s1f /mnt/data1 &prompt.root; mount /dev/mirror/gm0s1g /mnt/data2 &prompt.root; dump -C16 -b64 -0aL -f - /usr | (cd /mnt/usr && restore -rf -) &prompt.root; dump -C16 -b64 -0aL -f - /var | (cd /mnt/var && restore -rf -) &prompt.root; dump -C16 -b64 -0aL -f - /data1 | (cd /mnt/data1 && restore -rf -) &prompt.root; dump -C16 -b64 -0aL -f - /data2 | (cd /mnt/data2 && restore -rf -) Restart the system, booting from ada1. If everything is working, the system will boot from mirror/gm0, which now contains the same data as ada0 had previously. See if there are problems booting. At this point, the mirror still consists of only the single ada1 disk. After booting from mirror/gm0 successfully, the final step is inserting ada0 into the mirror. When ada0 is inserted into the mirror, its former contents will be overwritten by data from the mirror. Make certain that mirror/gm0 has the same contents as ada0 before adding ada0 to the mirror. If the contents previously copied by &man.dump.8; and &man.restore.8; are not identical to what was on ada0, revert /etc/fstab to mount the file systems on ada0, reboot, and start the whole procedure again. &prompt.root; gmirror insert gm0 ada0 GEOM_MIRROR: Device gm0: rebuilding provider ada0 Synchronization between the two disks will start immediately. Use gmirror status to view the progress. &prompt.root; gmirror status Name Status Components mirror/gm0 DEGRADED ada1 (ACTIVE) ada0 (SYNCHRONIZING, 64%) After a while, synchronization will finish. GEOM_MIRROR: Device gm0: rebuilding provider ada0 finished. &prompt.root; gmirror status Name Status Components mirror/gm0 COMPLETE ada1 (ACTIVE) ada0 (ACTIVE) mirror/gm0 now consists of the two disks ada0 and ada1, and the contents are automatically synchronized with each other. In use, mirror/gm0 will behave just like the original single drive. Troubleshooting If the system no longer boots, BIOS settings may have to be changed to boot from one of the new mirrored drives. Either mirror drive can be used for booting, as they contain identical data. If the boot stops with this message, something is wrong with the mirror device: Mounting from ufs:/dev/mirror/gm0s1a failed with error 19. Loader variables: vfs.root.mountfrom=ufs:/dev/mirror/gm0s1a vfs.root.mountfrom.options=rw Manual root filesystem specification: <fstype>:<device> [options] Mount <device> using filesystem <fstype> and with the specified (optional) option list. eg. ufs:/dev/da0s1a zfs:tank cd9660:/dev/acd0 ro (which is equivalent to: mount -t cd9660 -o ro /dev/acd0 /) ? List valid disk boot devices . Yield 1 second (for background tasks) <empty line> Abort manual input mountroot> Forgetting to load the geom_mirror.ko module in /boot/loader.conf can cause this problem. To fix it, boot from a &os; installation media and choose Shell at the first prompt. Then load the mirror module and mount the mirror device: &prompt.root; gmirror load &prompt.root; mount /dev/mirror/gm0s1a /mnt Edit /mnt/boot/loader.conf, adding a line to load the mirror module: geom_mirror_load="YES" Save the file and reboot. Other problems that cause error 19 require more effort to fix. Although the system should boot from ada0, another prompt to select a shell will appear if /etc/fstab is incorrect. Enter ufs:/dev/ada0s1a at the boot loader prompt and press Enter. Undo the edits in /etc/fstab then mount the file systems from the original disk (ada0) instead of the mirror. Reboot the system and try the procedure again. Enter full pathname of shell or RETURN for /bin/sh: &prompt.root; cp /etc/fstab.orig /etc/fstab &prompt.root; reboot Recovering from Disk Failure The benefit of disk mirroring is that an individual disk can fail without causing the mirror to lose any data. In the above example, if ada0 fails, the mirror will continue to work, providing data from the remaining working drive, ada1. To replace the failed drive, shut down the system and physically replace the failed drive with a new drive of equal or greater capacity. Manufacturers use somewhat arbitrary values when rating drives in gigabytes, and the only way to really be sure is to compare the total count of sectors shown by diskinfo -v. A drive with larger capacity than the mirror will work, although the extra space on the new drive will not be used. After the computer is powered back up, the mirror will be running in a degraded mode with only one drive. The mirror is told to forget drives that are not currently connected: &prompt.root; gmirror forget gm0 Any old metadata should be cleared from the replacement disk using the instructions in . Then the replacement disk, ada4 for this example, is inserted into the mirror: &prompt.root; gmirror insert gm0 /dev/ada4 Resynchronization begins when the new drive is inserted into the mirror. This process of copying mirror data to a new drive can take a while. Performance of the mirror will be greatly reduced during the copy, so inserting new drives is best done when there is low demand on the computer. Progress can be monitored with gmirror status, which shows drives that are being synchronized and the percentage of completion. During resynchronization, the status will be DEGRADED, changing to COMPLETE when the process is finished. <acronym>RAID</acronym>3 - Byte-level Striping with Dedicated Parity Mark Gladman Written by Daniel Gerzo Tom Rhodes Based on documentation by Murray Stokely GEOM RAID3 RAID3 is a method used to combine several disk drives into a single volume with a dedicated parity disk. In a RAID3 system, data is split up into a number of bytes that are written across all the drives in the array except for one disk which acts as a dedicated parity disk. This means that disk reads from a RAID3 implementation access all disks in the array. Performance can be enhanced by using multiple disk controllers. The RAID3 array provides a fault tolerance of 1 drive, while providing a capacity of 1 - 1/n times the total capacity of all drives in the array, where n is the number of hard drives in the array. Such a configuration is mostly suitable for storing data of larger sizes such as multimedia files. At least 3 physical hard drives are required to build a RAID3 array. Each disk must be of the same size, since I/O requests are interleaved to read or write to multiple disks in parallel. Also, due to the nature of RAID3, the number of drives must be equal to 3, 5, 9, 17, and so on, or 2^n + 1. This section demonstrates how to create a software RAID3 on a &os; system. While it is theoretically possible to boot from a RAID3 array on &os;, that configuration is uncommon and is not advised. Creating a Dedicated <acronym>RAID</acronym>3 Array In &os;, support for RAID3 is implemented by the &man.graid3.8; GEOM class. Creating a dedicated RAID3 array on &os; requires the following steps. First, load the geom_raid3.ko kernel module by issuing one of the following commands: &prompt.root; graid3 load or: &prompt.root; kldload geom_raid3 Ensure that a suitable mount point exists. This command creates a new directory to use as the mount point: &prompt.root; mkdir /multimedia Determine the device names for the disks which will be added to the array, and create the new RAID3 device. The final device listed will act as the dedicated parity disk. This example uses three unpartitioned ATA drives: ada1 and ada2 for data, and ada3 for parity. &prompt.root; graid3 label -v gr0 /dev/ada1 /dev/ada2 /dev/ada3 Metadata value stored on /dev/ada1. Metadata value stored on /dev/ada2. Metadata value stored on /dev/ada3. Done. Partition the newly created gr0 device and put a UFS file system on it: &prompt.root; gpart create -s GPT /dev/raid3/gr0 &prompt.root; gpart add -t freebsd-ufs /dev/raid3/gr0 &prompt.root; newfs -j /dev/raid3/gr0p1 Many numbers will glide across the screen, and after a bit of time, the process will be complete. The volume has been created and is ready to be mounted: &prompt.root; mount /dev/raid3/gr0p1 /multimedia/ The RAID3 array is now ready to use. Additional configuration is needed to retain this setup across system reboots. The geom_raid3.ko module must be loaded before the array can be mounted. To automatically load the kernel module during system initialization, add the following line to /boot/loader.conf: geom_raid3_load="YES" The following volume information must be added to /etc/fstab in order to automatically mount the array's file system during the system boot process: /dev/raid3/gr0p1 /multimedia ufs rw 2 2 Software <acronym>RAID</acronym> Devices Warren Block Originally contributed by GEOM Software RAID Devices Hardware-assisted RAID Some motherboards and expansion cards add some simple hardware, usually just a ROM, that allows the computer to boot from a RAID array. After booting, access to the RAID array is handled by software running on the computer's main processor. This hardware-assisted software RAID gives RAID arrays that are not dependent on any particular operating system, and which are functional even before an operating system is loaded. Several levels of RAID are supported, depending on the hardware in use. See &man.graid.8; for a complete list. &man.graid.8; requires the geom_raid.ko kernel module, which is included in the GENERIC kernel starting with &os; 9.1. If needed, it can be loaded manually with graid load. Creating an Array Software RAID devices often have a menu that can be entered by pressing special keys when the computer is booting. The menu can be used to create and delete RAID arrays. &man.graid.8; can also create arrays directly from the command line. graid label is used to create a new array. The motherboard used for this example has an Intel software RAID chipset, so the Intel metadata format is specified. The new array is given a label of gm0, it is a mirror (RAID1), and uses drives ada0 and ada1. Some space on the drives will be overwritten when they are made into a new array. Back up existing data first! &prompt.root; graid label Intel gm0 RAID1 ada0 ada1 GEOM_RAID: Intel-a29ea104: Array Intel-a29ea104 created. GEOM_RAID: Intel-a29ea104: Disk ada0 state changed from NONE to ACTIVE. GEOM_RAID: Intel-a29ea104: Subdisk gm0:0-ada0 state changed from NONE to ACTIVE. GEOM_RAID: Intel-a29ea104: Disk ada1 state changed from NONE to ACTIVE. GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 state changed from NONE to ACTIVE. GEOM_RAID: Intel-a29ea104: Array started. GEOM_RAID: Intel-a29ea104: Volume gm0 state changed from STARTING to OPTIMAL. Intel-a29ea104 created GEOM_RAID: Intel-a29ea104: Provider raid/r0 for volume gm0 created. A status check shows the new mirror is ready for use: &prompt.root; graid status Name Status Components raid/r0 OPTIMAL ada0 (ACTIVE (ACTIVE)) ada1 (ACTIVE (ACTIVE)) The array device appears in /dev/raid/. The first array is called r0. Additional arrays, if present, will be r1, r2, and so on. The BIOS menu on some of these devices can create arrays with special characters in their names. To avoid problems with those special characters, arrays are given simple numbered names like r0. To show the actual labels, like gm0 in the example above, use &man.sysctl.8;: &prompt.root; sysctl kern.geom.raid.name_format=1 Multiple Volumes Some software RAID devices support more than one volume on an array. Volumes work like partitions, allowing space on the physical drives to be split and used in different ways. For example, Intel software RAID devices support two volumes. This example creates a 40 G mirror for safely storing the operating system, followed by a 20 G RAID0 (stripe) volume for fast temporary storage: &prompt.root; graid label -S 40G Intel gm0 RAID1 ada0 ada1 &prompt.root; graid add -S 20G gm0 RAID0 Volumes appear as additional rX entries in /dev/raid/. An array with two volumes will show r0 and r1. See &man.graid.8; for the number of volumes supported by different software RAID devices. Converting a Single Drive to a Mirror Under certain specific conditions, it is possible to convert an existing single drive to a &man.graid.8; array without reformatting. To avoid data loss during the conversion, the existing drive must meet these minimum requirements: The drive must be partitioned with the MBR partitioning scheme. GPT or other partitioning schemes with metadata at the end of the drive will be overwritten and corrupted by the &man.graid.8; metadata. There must be enough unpartitioned and unused space at the end of the drive to hold the &man.graid.8; metadata. This metadata varies in size, but the largest occupies 64 M, so at least that much free space is recommended. If the drive meets these requirements, start by making a full backup. Then create a single-drive mirror with that drive: &prompt.root; graid label Intel gm0 RAID1 ada0 NONE &man.graid.8; metadata was written to the end of the drive in the unused space. A second drive can now be inserted into the mirror: &prompt.root; graid insert raid/r0 ada1 Data from the original drive will immediately begin to be copied to the second drive. The mirror will operate in degraded status until the copy is complete. Inserting New Drives into the Array Drives can be inserted into an array as replacements for drives that have failed or are missing. If there are no failed or missing drives, the new drive becomes a spare. For example, inserting a new drive into a working two-drive mirror results in a two-drive mirror with one spare drive, not a three-drive mirror. In the example mirror array, data immediately begins to be copied to the newly-inserted drive. Any existing information on the new drive will be overwritten. &prompt.root; graid insert raid/r0 ada1 GEOM_RAID: Intel-a29ea104: Disk ada1 state changed from NONE to ACTIVE. GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 state changed from NONE to NEW. GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 state changed from NEW to REBUILD. GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-ada1 rebuild start at 0. Removing Drives from the Array Individual drives can be permanently removed from a from an array and their metadata erased: &prompt.root; graid remove raid/r0 ada1 GEOM_RAID: Intel-a29ea104: Disk ada1 state changed from ACTIVE to OFFLINE. GEOM_RAID: Intel-a29ea104: Subdisk gm0:1-[unknown] state changed from ACTIVE to NONE. GEOM_RAID: Intel-a29ea104: Volume gm0 state changed from OPTIMAL to DEGRADED. Stopping the Array An array can be stopped without removing metadata from the drives. The array will be restarted when the system is booted. &prompt.root; graid stop raid/r0 Checking Array Status Array status can be checked at any time. After a drive was added to the mirror in the example above, data is being copied from the original drive to the new drive: &prompt.root; graid status Name Status Components raid/r0 DEGRADED ada0 (ACTIVE (ACTIVE)) ada1 (ACTIVE (REBUILD 28%)) Some types of arrays, like RAID0 or CONCAT, may not be shown in the status report if disks have failed. To see these partially-failed arrays, add : &prompt.root; graid status -ga Name Status Components Intel-e2d07d9a BROKEN ada6 (ACTIVE (ACTIVE)) Deleting Arrays Arrays are destroyed by deleting all of the volumes from them. When the last volume present is deleted, the array is stopped and metadata is removed from the drives: &prompt.root; graid delete raid/r0 Deleting Unexpected Arrays Drives may unexpectedly contain &man.graid.8; metadata, either from previous use or manufacturer testing. &man.graid.8; will detect these drives and create an array, interfering with access to the individual drive. To remove the unwanted metadata: Boot the system. At the boot menu, select 2 for the loader prompt. Enter: OK set kern.geom.raid.enable=0 OK boot The system will boot with &man.graid.8; disabled. Back up all data on the affected drive. As a workaround, &man.graid.8; array detection can be disabled by adding kern.geom.raid.enable=0 to /boot/loader.conf. To permanently remove the &man.graid.8; metadata from the affected drive, boot a &os; installation CD-ROM or memory stick, and select Shell. Use status to find the name of the array, typically raid/r0: &prompt.root; graid status Name Status Components raid/r0 OPTIMAL ada0 (ACTIVE (ACTIVE)) ada1 (ACTIVE (ACTIVE)) Delete the volume by name: &prompt.root; graid delete raid/r0 If there is more than one volume shown, repeat the process for each volume. After the last array has been deleted, the volume will be destroyed. Reboot and verify data, restoring from backup if necessary. After the metadata has been removed, the kern.geom.raid.enable=0 entry in /boot/loader.conf can also be removed. <acronym>GEOM</acronym> Gate Network GEOM provides a simple mechanism for providing remote access to devices such as disks, CDs, and file systems through the use of the GEOM Gate network daemon, ggated. The system with the device runs the server daemon which handles requests made by clients using ggatec. The devices should not contain any sensitive data as the connection between the client and the server is not encrypted. Similar to NFS, which is discussed in , ggated is configured using an exports file. This file specifies which systems are permitted to access the exported resources and what level of access they are offered. For example, to give the client 192.168.1.5 read and write access to the fourth slice on the first SCSI disk, create /etc/gg.exports with this line: 192.168.1.5 RW /dev/da0s4d Before exporting the device, ensure it is not currently mounted. Then, start ggated: &prompt.root; ggated Several options are available for specifying an alternate listening port or changing the default location of the exports file. Refer to &man.ggated.8; for details. To access the exported device on the client machine, first use ggatec to specify the IP address of the server and the device name of the exported device. If successful, this command will display a ggate device name to mount. Mount that specified device name on a free mount point. This example connects to the /dev/da0s4d partition on 192.168.1.1, then mounts /dev/ggate0 on /mnt: &prompt.root; ggatec create -o rw 192.168.1.1 /dev/da0s4d ggate0 &prompt.root; mount /dev/ggate0 /mnt The device on the server may now be accessed through /mnt on the client. For more details about ggatec and a few usage examples, refer to &man.ggatec.8;. The mount will fail if the device is currently mounted on either the server or any other client on the network. If simultaneous access is needed to network resources, use NFS instead. When the device is no longer needed, unmount it with umount so that the resource is available to other clients. Labeling Disk Devices GEOM Disk Labels During system initialization, the &os; kernel creates device nodes as devices are found. This method of probing for devices raises some issues. For instance, what if a new disk device is added via USB? It is likely that a flash device may be handed the device name of da0 and the original da0 shifted to da1. This will cause issues mounting file systems if they are listed in /etc/fstab which may also prevent the system from booting. One solution is to chain SCSI devices in order so a new device added to the SCSI card will be issued unused device numbers. But what about USB devices which may replace the primary SCSI disk? This happens because USB devices are usually probed before the SCSI card. One solution is to only insert these devices after the system has been booted. Another method is to use only a single ATA drive and never list the SCSI devices in /etc/fstab. A better solution is to use glabel to label the disk devices and use the labels in - /etc/fstab. Because - glabel stores the label in the last sector of - a given provider, the label will remain persistent across - reboots. By using this label as a device, the file system may - always be mounted regardless of what device node it is accessed - through. + /etc/fstab. + Since glabel stores the label in the last + sector of a given provider, the label will remain persistent + across reboots. By using this label as a device, the + file-system may always be mounted regardless of what + device node it is accessed through. glabel can create both transient and permanent labels. Only permanent labels are consistent across reboots. Refer to &man.glabel.8; for more information on the differences between labels. Label Types and Examples Permanent labels can be a generic or a file system label. Permanent file system labels can be created with &man.tunefs.8; or &man.newfs.8;. These types of labels are created in a sub-directory of /dev, and will be named according to the file system type. For example, UFS2 file system labels will be created in /dev/ufs. Generic permanent labels can be created with glabel label. These are not file system specific and will be created in /dev/label. Temporary labels are destroyed at the next reboot. These labels are created in /dev/label and are suited to experimentation. A temporary label can be created using glabel create. To create a permanent label for a UFS2 file system without destroying any data, issue the following command: &prompt.root; tunefs -L home /dev/da3 A label should now exist in /dev/ufs which may be added to /etc/fstab: /dev/ufs/home /home ufs rw 2 2 The file system must not be mounted while attempting to run tunefs. Now the file system may be mounted: &prompt.root; mount /home From this point on, so long as the geom_label.ko kernel module is loaded at boot with /boot/loader.conf or the GEOM_LABEL kernel option is present, the device node may change without any ill effect on the system. File systems may also be created with a default label by using the flag with newfs. Refer to &man.newfs.8; for more information. The following command can be used to destroy the label: &prompt.root; glabel destroy home The following example shows how to label the partitions of a boot disk. Labeling Partitions on the Boot Disk By permanently labeling the partitions on the boot disk, the system should be able to continue to boot normally, even if the disk is moved to another controller or transferred to a different system. For this example, it is assumed that a single ATA disk is used, which is currently recognized by the system as ad0. It is also assumed that the standard &os; partition scheme is used, with /, /var, /usr and /tmp, as well as a swap partition. Reboot the system, and at the &man.loader.8; prompt, press 4 to boot into single user mode. Then enter the following commands: &prompt.root; glabel label rootfs /dev/ad0s1a GEOM_LABEL: Label for provider /dev/ad0s1a is label/rootfs &prompt.root; glabel label var /dev/ad0s1d GEOM_LABEL: Label for provider /dev/ad0s1d is label/var &prompt.root; glabel label usr /dev/ad0s1f GEOM_LABEL: Label for provider /dev/ad0s1f is label/usr &prompt.root; glabel label tmp /dev/ad0s1e GEOM_LABEL: Label for provider /dev/ad0s1e is label/tmp &prompt.root; glabel label swap /dev/ad0s1b GEOM_LABEL: Label for provider /dev/ad0s1b is label/swap &prompt.root; exit The system will continue with multi-user boot. After the boot completes, edit /etc/fstab and replace the conventional device names, with their respective labels. The final /etc/fstab will look like this: # Device Mountpoint FStype Options Dump Pass# /dev/label/swap none swap sw 0 0 /dev/label/rootfs / ufs rw 1 1 /dev/label/tmp /tmp ufs rw 2 2 /dev/label/usr /usr ufs rw 2 2 /dev/label/var /var ufs rw 2 2 The system can now be rebooted. If everything went well, it will come up normally and mount will show: &prompt.root; mount /dev/label/rootfs on / (ufs, local) devfs on /dev (devfs, local) /dev/label/tmp on /tmp (ufs, local, soft-updates) /dev/label/usr on /usr (ufs, local, soft-updates) /dev/label/var on /var (ufs, local, soft-updates) The &man.glabel.8; class supports a label type for UFS file systems, based on the unique file system id, ufsid. These labels may be found in /dev/ufsid and are created automatically during system startup. It is possible to use ufsid labels to mount partitions using /etc/fstab. Use glabel status to receive a list of file systems and their corresponding ufsid labels: &prompt.user; glabel status Name Status Components ufsid/486b6fc38d330916 N/A ad4s1d ufsid/486b6fc16926168e N/A ad4s1f In the above example, ad4s1d represents /var, while ad4s1f represents /usr. Using the ufsid values shown, these partitions may now be mounted with the following entries in /etc/fstab: /dev/ufsid/486b6fc38d330916 /var ufs rw 2 2 /dev/ufsid/486b6fc16926168e /usr ufs rw 2 2 Any partitions with ufsid labels can be mounted in this way, eliminating the need to manually create permanent labels, while still enjoying the benefits of device name independent mounting. UFS Journaling Through <acronym>GEOM</acronym> GEOM Journaling Support for journals on UFS file systems is available on &os;. The implementation is provided through the GEOM subsystem and is configured using gjournal. Unlike other file system journaling implementations, the gjournal method is block based and not implemented as part of the file system. It is a GEOM extension. Journaling stores a log of file system transactions, such as changes that make up a complete disk write operation, before meta-data and file writes are committed to the disk. This transaction log can later be replayed to redo file system transactions, preventing file system inconsistencies. This method provides another mechanism to protect against data loss and inconsistencies of the file system. Unlike Soft Updates, which tracks and enforces meta-data updates, and snapshots, which create an image of the file system, a log is stored in disk space specifically for this task. For better performance, the journal may be stored on another disk. In this configuration, the journal provider or storage device should be listed after the device to enable journaling on. The GENERIC kernel provides support for gjournal. To automatically load the geom_journal.ko kernel module at boot time, add the following line to /boot/loader.conf: geom_journal_load="YES" If a custom kernel is used, ensure the following line is in the kernel configuration file: options GEOM_JOURNAL Once the module is loaded, a journal can be created on a new file system using the following steps. In this example, da4 is a new SCSI disk: &prompt.root; gjournal load &prompt.root; gjournal label /dev/da4 This will load the module and create a /dev/da4.journal device node on /dev/da4. A UFS file system may now be created on the journaled device, then mounted on an existing mount point: &prompt.root; newfs -O 2 -J /dev/da4.journal &prompt.root; mount /dev/da4.journal /mnt In the case of several slices, a journal will be created for each individual slice. For instance, if ad4s1 and ad4s2 are both slices, then gjournal will create ad4s1.journal and ad4s2.journal. Journaling may also be enabled on current file systems by using tunefs. However, always make a backup before attempting to alter an existing file system. In most cases, gjournal will fail if it is unable to create the journal, but this does not protect against data loss incurred as a result of misusing tunefs. Refer to &man.gjournal.8; and &man.tunefs.8; for more information about these commands. It is possible to journal the boot disk of a &os; system. Refer to the article Implementing UFS Journaling on a Desktop PC for detailed instructions. diff --git a/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml b/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml index 53dc3c06b2..e1ae696ce9 100644 --- a/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/multimedia/chapter.xml @@ -1,1696 +1,1696 @@ Multimedia Ross Lippert Edited by Synopsis &os; supports a wide variety of sound cards, allowing users to enjoy high fidelity output from a &os; system. This includes the ability to record and play back audio in the MPEG Audio Layer 3 (MP3), Waveform Audio File (WAV), Ogg Vorbis, and other formats. The &os; Ports Collection contains many applications for editing recorded audio, adding sound effects, and controlling attached MIDI devices. &os; also supports the playback of video files and DVDs. The &os; Ports Collection contains applications to encode, convert, and playback various video media. This chapter describes how to configure sound cards, video playback, TV tuner cards, and scanners on &os;. It also describes some of the applications which are available for using these devices. After reading this chapter, you will know how to: Configure a sound card on &os;. Troubleshoot the sound setup. Playback and encode MP3s and other audio. Prepare a &os; system for video playback. Play DVDs, .mpg, and .avi files. Rip CD and DVD content into files. Configure a TV card. Install and setup MythTV on &os; Configure an image scanner. Configure a Bluetooth headset. Before reading this chapter, you should: Know how to install applications as described in . Setting Up the Sound Card Moses Moore Contributed by Marc Fonvieille Enhanced by PCI sound cards Before beginning the configuration, determine the model of the sound card and the chip it uses. &os; supports a wide variety of sound cards. Check the supported audio devices list of the Hardware Notes to see if the card is supported and which &os; driver it uses. kernel configuration In order to use the sound device, its device driver must be loaded. The easiest way is to load a kernel module for the sound card with &man.kldload.8;. This example loads the driver for a built-in audio chipset based on the Intel specification: &prompt.root; kldload snd_hda To automate the loading of this driver at boot time, add the driver to /boot/loader.conf. The line for this driver is: snd_hda_load="YES" Other available sound modules are listed in /boot/defaults/loader.conf. When unsure which driver to use, load the snd_driver module: &prompt.root; kldload snd_driver This is a metadriver which loads all of the most common sound drivers and can be used to speed up the search for the correct driver. It is also possible to load all sound drivers by adding the metadriver to /boot/loader.conf. To determine which driver was selected for the sound card after loading the snd_driver metadriver, type cat /dev/sndstat. Configuring a Custom Kernel with Sound Support This section is for users who prefer to statically compile in support for the sound card in a custom kernel. For more information about recompiling a kernel, refer to . When using a custom kernel to provide sound support, make sure that the audio framework driver exists in the custom kernel configuration file: device sound Next, add support for the sound card. To continue the example of the built-in audio chipset based on the Intel specification from the previous section, use the following line in the custom kernel configuration file: device snd_hda Be sure to read the manual page of the driver for the device name to use for the driver. Non-PnP ISA sound cards may require the IRQ and I/O port settings of the card to be added to /boot/device.hints. During the boot process, &man.loader.8; reads this file and passes the settings to the kernel. For example, an old Creative &soundblaster; 16 ISA non-PnP card will use the &man.snd.sbc.4; driver in conjunction with snd_sb16. For this card, the following lines must be added to the kernel configuration file: device snd_sbc device snd_sb16 If the card uses the 0x220 I/O port and IRQ 5, these lines must also be added to /boot/device.hints: hint.sbc.0.at="isa" hint.sbc.0.port="0x220" hint.sbc.0.irq="5" hint.sbc.0.drq="1" hint.sbc.0.flags="0x15" The syntax used in /boot/device.hints is described in &man.sound.4; and the manual page for the driver of the sound card. The settings shown above are the defaults. In some cases, the IRQ or other settings may need to be changed to match the card. Refer to &man.snd.sbc.4; for more information about this card. Testing Sound After loading the required module or rebooting into the custom kernel, the sound card should be detected. To confirm, run dmesg | grep pcm. This example is from a system with a built-in Conexant CX20590 chipset: pcm0: <NVIDIA (0x001c) (HDMI/DP 8ch)> at nid 5 on hdaa0 pcm1: <NVIDIA (0x001c) (HDMI/DP 8ch)> at nid 6 on hdaa0 pcm2: <Conexant CX20590 (Analog 2.0+HP/2.0)> at nid 31,25 and 35,27 on hdaa1 The status of the sound card may also be checked using this command: &prompt.root; cat /dev/sndstat FreeBSD Audio Driver (newpcm: 64bit 2009061500/amd64) Installed devices: pcm0: <NVIDIA (0x001c) (HDMI/DP 8ch)> (play) pcm1: <NVIDIA (0x001c) (HDMI/DP 8ch)> (play) pcm2: <Conexant CX20590 (Analog 2.0+HP/2.0)> (play/rec) default The output will vary depending upon the sound card. If no pcm devices are listed, double-check that the correct device driver was loaded or compiled into the kernel. The next section lists some common problems and their solutions. If all goes well, the sound card should now work in &os;. If the CD or DVD drive is properly connected to the sound card, one can insert an audio CD in the drive and play it with &man.cdcontrol.1;: &prompt.user; cdcontrol -f /dev/acd0 play 1 Audio CDs have specialized encodings which means that they should not be mounted using &man.mount.8;. Various applications, such as audio/workman, provide a friendlier interface. The audio/mpg123 port can be installed to listen to MP3 audio files. Another quick way to test the card is to send data to /dev/dsp: &prompt.user; cat filename > /dev/dsp where filename can be any type of file. This command should produce some noise, confirming that the sound card is working. The /dev/dsp* device nodes will be created automatically as needed. When not in use, they do not exist and will not appear in the output of &man.ls.1;. Setting up Bluetooth Sound Devices Bluetooth audio Connecting to a Bluetooth device is out of scope for this chapter. Refer to for more information. To get Bluetooth sound sink working with FreeBSD's sound system, users have to install audio/virtual_oss first: &prompt.root; pkg install virtual_oss audio/virtual_oss requires cuse to be loaded into the kernel: &prompt.root; kldload cuse To load cuse during system startup, run this command: &prompt.root; sysrc -f /boot/loader.conf cuse_load=yes To use headphones as a sound sink with audio/virtual_oss, users need to create a virtual device after connecting to a Bluetooth audio device: &prompt.root; virtual_oss -C 2 -c 2 -r 48000 -b 16 -s 768 -R /dev/null -P /dev/bluetooth/headphones -d dsp headphones in this example is a hostname from /etc/bluetooth/hosts. BT_ADDR could be used instead. Refer to &man.virtual_oss.8; for more information. Troubleshooting Sound device nodes I/O port IRQ DSP lists some common error messages and their solutions: Common Error Messages Error Solution sb_dspwr(XX) timed out The I/O port is not set correctly. bad irq XX The IRQ is set incorrectly. Make sure that the set IRQ and the sound IRQ are the same. xxx: gus pcm not attached, out of memory There is not enough available memory to use the device. xxx: can't open /dev/dsp! Type fstat | grep dsp to check if another application is holding the device open. Noteworthy troublemakers are esound and KDE's sound support.
Modern graphics cards often come with their own sound driver for use with HDMI. This sound device is sometimes enumerated before the sound card meaning that the sound card will not be used as the default playback device. To check if this is the case, run dmesg and look for pcm. The output looks something like this: ... hdac0: HDA Driver Revision: 20100226_0142 hdac1: HDA Driver Revision: 20100226_0142 hdac0: HDA Codec #0: NVidia (Unknown) hdac0: HDA Codec #1: NVidia (Unknown) hdac0: HDA Codec #2: NVidia (Unknown) hdac0: HDA Codec #3: NVidia (Unknown) pcm0: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 0 nid 1 on hdac0 pcm1: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 1 nid 1 on hdac0 pcm2: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 2 nid 1 on hdac0 pcm3: <HDA NVidia (Unknown) PCM #0 DisplayPort> at cad 3 nid 1 on hdac0 hdac1: HDA Codec #2: Realtek ALC889 pcm4: <HDA Realtek ALC889 PCM #0 Analog> at cad 2 nid 1 on hdac1 pcm5: <HDA Realtek ALC889 PCM #1 Analog> at cad 2 nid 1 on hdac1 pcm6: <HDA Realtek ALC889 PCM #2 Digital> at cad 2 nid 1 on hdac1 pcm7: <HDA Realtek ALC889 PCM #3 Digital> at cad 2 nid 1 on hdac1 ... In this example, the graphics card (NVidia) has been enumerated before the sound card (Realtek ALC889). To use the sound card as the default playback device, change hw.snd.default_unit to the unit that should be used for playback: &prompt.root; sysctl hw.snd.default_unit=n where n is the number of the sound device to use. In this example, it should be 4. Make this change permanent by adding the following line to /etc/sysctl.conf: hw.snd.default_unit=4
Utilizing Multiple Sound Sources Munish Chopra Contributed by It is often desirable to have multiple sources of sound that are able to play simultaneously. &os; uses Virtual Sound Channels to multiplex the sound card's playback by mixing sound in the kernel. Three &man.sysctl.8; knobs are available for configuring virtual channels: &prompt.root; sysctl dev.pcm.0.play.vchans=4 &prompt.root; sysctl dev.pcm.0.rec.vchans=4 &prompt.root; sysctl hw.snd.maxautovchans=4 This example allocates four virtual channels, which is a practical number for everyday use. Both dev.pcm.0.play.vchans=4 and dev.pcm.0.rec.vchans=4 are configurable after a device has been attached and represent the number of virtual channels pcm0 has for playback and recording. Since the pcm module can be loaded independently of the hardware drivers, hw.snd.maxautovchans indicates how many virtual channels will be given to an audio device when it is attached. Refer to &man.pcm.4; for more information. The number of virtual channels for a device cannot be changed while it is in use. First, close any programs using the device, such as music players or sound daemons. The correct pcm device will automatically be allocated transparently to a program that requests /dev/dsp0. Setting Default Values for Mixer Channels Josef El-Rayes Contributed by The default values for the different mixer channels are hardcoded in the source code of the &man.pcm.4; driver. While sound card mixer levels can be changed using &man.mixer.8; or third-party applications and daemons, this is not a permanent solution. To instead set default mixer values at the driver level, define the appropriate values in /boot/device.hints, as seen in this example: hint.pcm.0.vol="50" This will set the volume channel to a default value of 50 when the &man.pcm.4; module is loaded.
MP3 Audio Chern Lee Contributed by This section describes some MP3 players available for &os;, how to rip audio CD tracks, and how to encode and decode MP3s. MP3 Players A popular graphical MP3 player is Audacious. It supports Winamp skins and additional plugins. The interface is intuitive, with a playlist, graphic equalizer, and more. Those familiar with Winamp will find Audacious simple to use. On &os;, Audacious can be installed from the multimedia/audacious port or package. Audacious is a descendant of XMMS. The audio/mpg123 package or port provides an alternative, command-line MP3 player. Once installed, specify the MP3 file to play on the command line. If the system has multiple audio devices, the sound device can also be specified: &prompt.root; mpg123 -a /dev/dsp1.0 Foobar-GreatestHits.mp3 High Performance MPEG 1.0/2.0/2.5 Audio Player for Layers 1, 2 and 3 version 1.18.1; written and copyright by Michael Hipp and others free software (LGPL) without any warranty but with best wishes Playing MPEG stream from Foobar-GreatestHits.mp3 ... MPEG 1.0 layer III, 128 kbit/s, 44100 Hz joint-stereo Additional MP3 players are available in the &os; Ports Collection. Ripping <acronym>CD</acronym> Audio Tracks Before encoding a CD or CD track to MP3, the audio data on the CD must be ripped to the hard drive. This is done by copying the raw CD Digital Audio (CDDA) data to WAV files. The cdda2wav tool, which is installed with the sysutils/cdrtools suite, can be used to rip audio information from CDs. With the audio CD in the drive, the following command can be issued as root to rip an entire CD into individual, per track, WAV files: &prompt.root; cdda2wav -D 0,1,0 -B In this example, the indicates the SCSI device 0,1,0 containing the CD to rip. Use cdrecord -scanbus to determine the correct device parameters for the system. To rip individual tracks, use to specify the track: &prompt.root; cdda2wav -D 0,1,0 -t 7 To rip a range of tracks, such as track one to seven, specify a range: &prompt.root; cdda2wav -D 0,1,0 -t 1+7 To rip from an ATAPI (IDE) CDROM drive, specify the device name in place of the SCSI unit numbers. For example, to rip track 7 from an IDE drive: &prompt.root; cdda2wav -D /dev/acd0 -t 7 Alternately, dd can be used to extract audio tracks on ATAPI drives, as described in . Encoding and Decoding MP3s Lame is a popular MP3 encoder which can be installed from the audio/lame port. Due to patent issues, a package is not available. The following command will convert the ripped WAV file audio01.wav to audio01.mp3: &prompt.root; lame -h -b 128 --tt "Foo Song Title" --ta "FooBar Artist" --tl "FooBar Album" \ --ty "2014" --tc "Ripped and encoded by Foo" --tg "Genre" audio01.wav audio01.mp3 The specified 128 kbits is a standard MP3 bitrate while the 160 and 192 bitrates provide higher quality. The higher the bitrate, the larger the size of the resulting MP3. The turns on the higher quality but a little slower mode. The options beginning with indicate ID3 tags, which usually contain song information, to be embedded within the MP3 file. Additional encoding options can be found in the lame manual page. In order to burn an audio CD from MP3s, they must first be converted to a non-compressed file format. XMMS can be used to convert to the WAV format, while mpg123 can be used to convert to the raw Pulse-Code Modulation (PCM) audio data format. To convert audio01.mp3 using mpg123, specify the name of the PCM file: &prompt.root; mpg123 -s audio01.mp3 > audio01.pcm To use XMMS to convert a MP3 to WAV format, use these steps: Converting to <acronym>WAV</acronym> Format in <application>XMMS</application> Launch XMMS. Right-click the window to bring up the XMMS menu. Select Preferences under Options. Change the Output Plugin to Disk Writer Plugin. Press Configure. Enter or browse to a directory to write the uncompressed files to. Load the MP3 file into XMMS as usual, with volume at 100% and EQ settings turned off. Press Play. The XMMS will appear as if it is playing the MP3, but no music will be heard. It is actually playing the MP3 to a file. When finished, be sure to set the default Output Plugin back to what it was before in order to listen to MP3s again. Both the WAV and PCM formats can be used with cdrecord. When using WAV files, there will be a small tick sound at the beginning of each track. This sound is the header of the WAV file. The audio/sox port or package can be used to remove the header: &prompt.user; sox -t wav -r 44100 -s -w -c 2 track.wav track.raw Refer to for more information on using a CD burner in &os;. Video Playback Ross Lippert Contributed by Before configuring video playback, determine the model and chipset of the video card. While &xorg; supports a wide variety of video cards, not all provide good playback performance. To obtain a list of extensions supported by the &xorg; server using the card, run xdpyinfo while &xorg; is running. It is a good idea to have a short MPEG test file for evaluating various players and options. Since some DVD applications look for DVD media in /dev/dvd by default, or have this device name hardcoded in them, it might be useful to make a symbolic link to the proper device: &prompt.root; ln -sf /dev/cd0 /dev/dvd Due to the nature of &man.devfs.5;, manually created links will not persist after a system reboot. In order to recreate the symbolic link automatically when the system boots, add the following line to /etc/devfs.conf: link cd0 dvd DVD decryption invokes certain functions that require write permission to the DVD device. To enhance the shared memory &xorg; interface, it is recommended to increase the values of these &man.sysctl.8; variables: kern.ipc.shmmax=67108864 kern.ipc.shmall=32768 Determining Video Capabilities XVideo SDL DGA There are several possible ways to display video under &xorg; and what works is largely hardware dependent. Each method described below will have varying quality across different hardware. Common video interfaces include: &xorg;: normal output using shared memory. XVideo: an extension to the &xorg; interface which allows video to be directly displayed in drawable objects through a special acceleration. This extension provides good quality playback even on low-end machines. The next section describes how to determine if this extension is running. SDL: the Simple Directmedia Layer is a porting layer for many operating systems, allowing cross-platform applications to be developed which make efficient use of sound and graphics. SDL provides a low-level abstraction to the hardware which can sometimes be more efficient than the &xorg; interface. On &os;, SDL can be installed using the devel/sdl20 package or port. DGA: the Direct Graphics Access is an &xorg; extension which allows a program to bypass the &xorg; server and directly - alter the framebuffer. Because it relies on a low level + alter the framebuffer. As it relies on a low-level memory mapping, programs using it must be run as root. The DGA extension can be tested and benchmarked using &man.dga.1;. When dga is running, it changes the colors of the display whenever a key is pressed. To quit, press q. SVGAlib: a low level console graphics layer. XVideo To check whether this extension is running, use xvinfo: &prompt.user; xvinfo XVideo is supported for the card if the result is similar to: X-Video Extension version 2.2 screen #0 Adaptor #0: "Savage Streams Engine" number of ports: 1 port base: 43 operations supported: PutImage supported visuals: depth 16, visualID 0x22 depth 16, visualID 0x23 number of attributes: 5 "XV_COLORKEY" (range 0 to 16777215) client settable attribute client gettable attribute (current value is 2110) "XV_BRIGHTNESS" (range -128 to 127) client settable attribute client gettable attribute (current value is 0) "XV_CONTRAST" (range 0 to 255) client settable attribute client gettable attribute (current value is 128) "XV_SATURATION" (range 0 to 255) client settable attribute client gettable attribute (current value is 128) "XV_HUE" (range -180 to 180) client settable attribute client gettable attribute (current value is 0) maximum XvImage size: 1024 x 1024 Number of image formats: 7 id: 0x32595559 (YUY2) guid: 59555932-0000-0010-8000-00aa00389b71 bits per pixel: 16 number of planes: 1 type: YUV (packed) id: 0x32315659 (YV12) guid: 59563132-0000-0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) id: 0x30323449 (I420) guid: 49343230-0000-0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) id: 0x36315652 (RV16) guid: 52563135-0000-0000-0000-000000000000 bits per pixel: 16 number of planes: 1 type: RGB (packed) depth: 0 red, green, blue masks: 0x1f, 0x3e0, 0x7c00 id: 0x35315652 (RV15) guid: 52563136-0000-0000-0000-000000000000 bits per pixel: 16 number of planes: 1 type: RGB (packed) depth: 0 red, green, blue masks: 0x1f, 0x7e0, 0xf800 id: 0x31313259 (Y211) guid: 59323131-0000-0010-8000-00aa00389b71 bits per pixel: 6 number of planes: 3 type: YUV (packed) id: 0x0 guid: 00000000-0000-0000-0000-000000000000 bits per pixel: 0 number of planes: 0 type: RGB (packed) depth: 1 red, green, blue masks: 0x0, 0x0, 0x0 The formats listed, such as YUV2 and YUV12, are not present with every implementation of XVideo and their absence may hinder some players. If the result instead looks like: X-Video Extension version 2.2 screen #0 no adaptors present XVideo is probably not supported for the card. This means that it will be more difficult for the display to meet the computational demands of rendering video, depending on the video card and processor. Ports and Packages Dealing with Video video ports video packages This section introduces some of the software available from the &os; Ports Collection which can be used for video playback. <application>MPlayer</application> and <application>MEncoder</application> MPlayer is a command-line video player with an optional graphical interface which aims to provide speed and flexibility. Other graphical front-ends to MPlayer are available from the &os; Ports Collection. MPlayer MPlayer can be installed using the multimedia/mplayer package or port. Several compile options are available and a variety of hardware checks occur during the build process. For these reasons, some users prefer to build the port rather than install the package. When compiling the port, the menu options should be reviewed to determine the type of support to compile into the port. If an option is not selected, MPlayer will not be able to display that type of video format. Use the arrow keys and spacebar to select the required formats. When finished, press Enter to continue the port compile and installation. By default, the package or port will build the mplayer command line utility and the gmplayer graphical utility. To encode videos, compile the multimedia/mencoder port. Due to licensing restrictions, a package is not available for MEncoder. The first time MPlayer is run, it will create ~/.mplayer in the user's home directory. This subdirectory contains default versions of the user-specific configuration files. This section describes only a few common uses. Refer to mplayer(1) for a complete description of its numerous options. To play the file testfile.avi, specify the video interfaces with , as seen in the following examples: &prompt.user; mplayer -vo xv testfile.avi &prompt.user; mplayer -vo sdl testfile.avi &prompt.user; mplayer -vo x11 testfile.avi &prompt.root; mplayer -vo dga testfile.avi &prompt.root; mplayer -vo 'sdl:dga' testfile.avi It is worth trying all of these options, as their relative performance depends on many factors and will vary significantly with hardware. To play a DVD, replace testfile.avi with , where N is the title number to play and DEVICE is the device node for the DVD. For example, to play title 3 from /dev/dvd: &prompt.root; mplayer -vo xv dvd://3 -dvd-device /dev/dvd The default DVD device can be defined during the build of the MPlayer port by including the WITH_DVD_DEVICE=/path/to/desired/device option. By default, the device is /dev/cd0. More details can be found in the port's Makefile.options. To stop, pause, advance, and so on, use a keybinding. To see the list of keybindings, run mplayer -h or read mplayer(1). Additional playback options include , which engages fullscreen mode, and , which helps performance. Each user can add commonly used options to their ~/.mplayer/config like so: vo=xv fs=yes zoom=yes mplayer can be used to rip a DVD title to a .vob. To dump the second title from a DVD: &prompt.root; mplayer -dumpstream -dumpfile out.vob dvd://2 -dvd-device /dev/dvd The output file, out.vob, will be in MPEG format. Anyone wishing to obtain a high level of expertise with &unix; video should consult mplayerhq.hu/DOCS as it is technically informative. This documentation should be considered as required reading before submitting any bug reports. mencoder Before using mencoder, it is a good idea to become familiar with the options described at mplayerhq.hu/DOCS/HTML/en/mencoder.html. There are innumerable ways to improve quality, lower bitrate, and change formats, and some of these options may make the difference between good or bad performance. Improper combinations of command line options can yield output files that are unplayable even by mplayer. Here is an example of a simple copy: &prompt.user; mencoder input.avi -oac copy -ovc copy -o output.avi To rip to a file, use with mplayer. To convert input.avi to the MPEG4 codec with MPEG3 audio encoding, first install the audio/lame port. Due to licensing restrictions, a package is not available. Once installed, type: &prompt.user; mencoder input.avi -oac mp3lame -lameopts br=192 \ -ovc lavc -lavcopts vcodec=mpeg4:vhq -o output.avi This will produce output playable by applications such as mplayer and xine. input.avi can be replaced with and run as root to re-encode a DVD title directly. Since it may take a few tries to get the desired result, it is recommended to instead dump the title to a file and to work on the file. The <application>xine</application> Video Player xine is a video player with a reusable base library and a modular executable which can be extended with plugins. It can be installed using the multimedia/xine package or port. In practice, xine requires either a fast CPU with a fast video card, or support for the XVideo extension. The xine video player performs best on XVideo interfaces. By default, the xine player starts a graphical user interface. The menus can then be used to open a specific file. Alternatively, xine may be invoked from the command line by specifying the name of the file to play: &prompt.user; xine -g -p mymovie.avi Refer to xine-project.org/faq for more information and troubleshooting tips. The <application>Transcode</application> Utilities Transcode provides a suite of tools for re-encoding video and audio files. Transcode can be used to merge video files or repair broken files using command line tools with stdin/stdout stream interfaces. In &os;, Transcode can be installed using the multimedia/transcode package or port. Many users prefer to compile the port as it provides a menu of compile options for specifying the support and codecs to compile in. If an option is not selected, Transcode will not be able to encode that format. Use the arrow keys and spacebar to select the required formats. When finished, press Enter to continue the port compile and installation. This example demonstrates how to convert a DivX file into a PAL MPEG-1 file (PAL VCD): &prompt.user; transcode -i input.avi -V --export_prof vcd-pal -o output_vcd &prompt.user; mplex -f 1 -o output_vcd.mpg output_vcd.m1v output_vcd.mpa The resulting MPEG file, output_vcd.mpg, is ready to be played with MPlayer. The file can be burned on a CD media to create a video CD using a utility such as multimedia/vcdimager or sysutils/cdrdao. In addition to the manual page for transcode, refer to transcoding.org/cgi-bin/transcode for further information and examples. TV Cards Josef El-Rayes Original contribution by Marc Fonvieille Enhanced and adapted by TV cards TV cards can be used to watch broadcast or cable TV on a computer. Most cards accept composite video via an RCA or S-video input and some cards include a FM radio tuner. &os; provides support for PCI-based TV cards using a Brooktree Bt848/849/878/879 video capture chip with the &man.bktr.4; driver. This driver supports most Pinnacle PCTV video cards. Before purchasing a TV card, consult &man.bktr.4; for a list of supported tuners. Loading the Driver In order to use the card, the &man.bktr.4; driver must be loaded. To automate this at boot time, add the following line to /boot/loader.conf: bktr_load="YES" Alternatively, one can statically compile support for the TV card into a custom kernel. In that case, add the following lines to the custom kernel configuration file: device bktr device iicbus device iicbb device smbus These additional devices are necessary as the card components are interconnected via an I2C bus. Then, build and install a new kernel. To test that the tuner is correctly detected, reboot the system. The TV card should appear in the boot messages, as seen in this example: bktr0: <BrookTree 848A> mem 0xd7000000-0xd7000fff irq 10 at device 10.0 on pci0 iicbb0: <I2C bit-banging driver> on bti2c0 iicbus0: <Philips I2C bus> on iicbb0 master-only iicbus1: <Philips I2C bus> on iicbb0 master-only smbus0: <System Management Bus> on bti2c0 bktr0: Pinnacle/Miro TV, Philips SECAM tuner. The messages will differ according to the hardware. If necessary, it is possible to override some of the detected parameters using &man.sysctl.8; or custom kernel configuration options. For example, to force the tuner to a Philips SECAM tuner, add the following line to a custom kernel configuration file: options OVERRIDE_TUNER=6 or, use &man.sysctl.8;: &prompt.root; sysctl hw.bt848.tuner=6 Refer to &man.bktr.4; for a description of the available &man.sysctl.8; parameters and kernel options. Useful Applications To use the TV card, install one of the following applications: multimedia/fxtv provides TV-in-a-window and image/audio/video capture capabilities. multimedia/xawtv is another TV application with similar features. audio/xmradio provides an application for using the FM radio tuner of a TV card. More applications are available in the &os; Ports Collection. Troubleshooting If any problems are encountered with the TV card, check that the video capture chip and the tuner are supported by &man.bktr.4; and that the right configuration options were used. For more support or to ask questions about supported TV cards, refer to the &a.multimedia.name; mailing list. MythTV MythTV is a popular, open source Personal Video Recorder (PVR) application. This section demonstrates how to install and setup MythTV on &os;. Refer to mythtv.org/wiki for more information on how to use MythTV. MythTV requires a frontend and a backend. These components can either be installed on the same system or on different machines. The frontend can be installed on &os; using the multimedia/mythtv-frontend package or port. &xorg; must also be installed and configured as described in . Ideally, this system has a video card that supports X-Video Motion Compensation (XvMC) and, optionally, a Linux Infrared Remote Control (LIRC)-compatible remote. To install both the backend and the frontend on &os;, use the multimedia/mythtv package or port. A &mysql; database server is also required and should automatically be installed as a dependency. Optionally, this system should have a tuner card and sufficient storage to hold recorded data. Hardware MythTV uses Video for Linux (V4L) to access video input devices such as encoders and tuners. In &os;, MythTV works best with USB DVB-S/C/T cards as they are well supported by the multimedia/webcamd package or port which provides a V4L userland application. Any Digital Video Broadcasting (DVB) card supported by webcamd should work with MythTV. A list of known working cards can be found at wiki.freebsd.org/WebcamCompat. Drivers are also available for Hauppauge cards in the multimedia/pvr250 and multimedia/pvrxxx ports, but they provide a non-standard driver interface that does not work with versions of MythTV greater than 0.23. Due to licensing restrictions, no packages are available and these two ports must be compiled. The wiki.freebsd.org/HTPC page contains a list of all available DVB drivers. Setting up the MythTV Backend To install MythTV using binary packages: &prompt.root; pkg install mythtv Alternatively, to install from the Ports Collection: &prompt.root; cd /usr/ports/multimedia/mythtv &prompt.root; make install Once installed, set up the MythTV database: &prompt.root; mysql -uroot -p < /usr/local/share/mythtv/database/mc.sql Then, configure the backend: &prompt.root; mythtv-setup Finally, start the backend: &prompt.root; sysrc mythbackend_enable=yes &prompt.root; service mythbackend start Image Scanners Marc Fonvieille Written by image scanners In &os;, access to image scanners is provided by SANE (Scanner Access Now Easy), which is available in the &os; Ports Collection. SANE will also use some &os; device drivers to provide access to the scanner hardware. &os; supports both SCSI and USB scanners. Depending upon the scanner interface, different device drivers are required. Be sure the scanner is supported by SANE prior to performing any configuration. Refer to http://www.sane-project.org/sane-supported-devices.html for more information about supported scanners. This chapter describes how to determine if the scanner has been detected by &os;. It then provides an overview of how to configure and use SANE on a &os; system. Checking the Scanner The GENERIC kernel includes the device drivers needed to support USB scanners. Users with a custom kernel should ensure that the following lines are present in the custom kernel configuration file: device usb device uhci device ohci device ehci device xhci To determine if the USB scanner is detected, plug it in and use dmesg to determine whether the scanner appears in the system message buffer. If it does, it should display a message similar to this: ugen0.2: <EPSON> at usbus0 In this example, an &epson.perfection; 1650 USB scanner was detected on /dev/ugen0.2. If the scanner uses a SCSI interface, it is important to know which SCSI controller board it will use. Depending upon the SCSI chipset, a custom kernel configuration file may be needed. The GENERIC kernel supports the most common SCSI controllers. Refer to /usr/src/sys/conf/NOTES to determine the correct line to add to a custom kernel configuration file. In addition to the SCSI adapter driver, the following lines are needed in a custom kernel configuration file: device scbus device pass Verify that the device is displayed in the system message buffer: pass2 at aic0 bus 0 target 2 lun 0 pass2: <AGFA SNAPSCAN 600 1.10> Fixed Scanner SCSI-2 device pass2: 3.300MB/s transfers If the scanner was not powered-on at system boot, it is still possible to manually force detection by performing a SCSI bus scan with camcontrol: &prompt.root; camcontrol rescan all Re-scan of bus 0 was successful Re-scan of bus 1 was successful Re-scan of bus 2 was successful Re-scan of bus 3 was successful The scanner should now appear in the SCSI devices list: &prompt.root; camcontrol devlist <IBM DDRS-34560 S97B> at scbus0 target 5 lun 0 (pass0,da0) <IBM DDRS-34560 S97B> at scbus0 target 6 lun 0 (pass1,da1) <AGFA SNAPSCAN 600 1.10> at scbus1 target 2 lun 0 (pass3) <PHILIPS CDD3610 CD-R/RW 1.00> at scbus2 target 0 lun 0 (pass2,cd0) Refer to &man.scsi.4; and &man.camcontrol.8; for more details about SCSI devices on &os;. <application>SANE</application> Configuration The SANE system provides the access to the scanner via backends (graphics/sane-backends). Refer to http://www.sane-project.org/sane-supported-devices.html to determine which backend supports the scanner. A graphical scanning interface is provided by third party applications like Kooka (graphics/kooka) or XSane (graphics/xsane). SANE's backends are enough to test the scanner. To install the backends from binary package: &prompt.root; pkg install sane-backends Alternatively, to install from the Ports Collection &prompt.root; cd /usr/ports/graphics/sane-backends &prompt.root; make install clean After installing the graphics/sane-backends port or package, use sane-find-scanner to check the scanner detection by the SANE system: &prompt.root; sane-find-scanner -q found SCSI scanner "AGFA SNAPSCAN 600 1.10" at /dev/pass3 The output should show the interface type of the scanner and the device node used to attach the scanner to the system. The vendor and the product model may or may not appear. Some USB scanners require firmware to be loaded. Refer to sane-find-scanner(1) and sane(7) for details. Next, check if the scanner will be identified by a scanning frontend. The SANE backends include scanimage which can be used to list the devices and perform an image acquisition. Use to list the scanner devices. The first example is for a SCSI scanner and the second is for a USB scanner: &prompt.root; scanimage -L device `snapscan:/dev/pass3' is a AGFA SNAPSCAN 600 flatbed scanner &prompt.root; scanimage -L device 'epson2:libusb:000:002' is a Epson GT-8200 flatbed scanner In this second example, epson2 is the backend name and libusb:000:002 means /dev/ugen0.2 is the device node used by the scanner. If scanimage is unable to identify the scanner, this message will appear: &prompt.root; scanimage -L No scanners were identified. If you were expecting something different, check that the scanner is plugged in, turned on and detected by the sane-find-scanner tool (if appropriate). Please read the documentation which came with this software (README, FAQ, manpages). If this happens, edit the backend configuration file in /usr/local/etc/sane.d/ and define the scanner device used. For example, if the undetected scanner model is an &epson.perfection; 1650 and it uses the epson2 backend, edit /usr/local/etc/sane.d/epson2.conf. When editing, add a line specifying the interface and the device node used. In this case, add the following line: usb /dev/ugen0.2 Save the edits and verify that the scanner is identified with the right backend name and the device node: &prompt.root; scanimage -L device 'epson2:libusb:000:002' is a Epson GT-8200 flatbed scanner Once scanimage -L sees the scanner, the configuration is complete and the scanner is now ready to use. While scanimage can be used to perform an image acquisition from the command line, it is often preferable to use a graphical interface to perform image scanning. Applications like Kooka or XSane are popular scanning frontends. They offer advanced features such as various scanning modes, color correction, and batch scans. XSane is also usable as a GIMP plugin. Scanner Permissions In order to have access to the scanner, a user needs read and write permissions to the device node used by the scanner. In the previous example, the USB scanner uses the device node /dev/ugen0.2 which is really a symlink to the real device node /dev/usb/0.2.0. The symlink and the device node are owned, respectively, by the wheel and operator groups. While adding the user to these groups will allow access to the scanner, it is considered insecure to add a user to wheel. A better solution is to create a group and make the scanner device accessible to members of this group. This example creates a group called usb: &prompt.root; pw groupadd usb Then, make the /dev/ugen0.2 symlink and the /dev/usb/0.2.0 device node accessible to the usb group with write permissions of 0660 or 0664 by adding the following lines to /etc/devfs.rules: [system=5] add path ugen0.2 mode 0660 group usb add path usb/0.2.0 mode 0666 group usb It happens the device node changes with the addition or removal of devices, so one may want to give access to all USB devices using this ruleset instead: [system=5] add path 'ugen*' mode 0660 group usb add path 'usb/*' mode 0666 group usb Refer to &man.devfs.rules.5; for more information about this file. Next, enable the ruleset in /etc/rc.conf: devfs_system_ruleset="system" And, restart the &man.devfs.8; system: &prompt.root; service devfs restart Finally, add the users to usb in order to allow access to the scanner: &prompt.root; pw groupmod usb -m joe For more details refer to &man.pw.8;.
diff --git a/en_US.ISO8859-1/books/handbook/security/chapter.xml b/en_US.ISO8859-1/books/handbook/security/chapter.xml index 4bc3279737..d3b277a978 100644 --- a/en_US.ISO8859-1/books/handbook/security/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/security/chapter.xml @@ -1,4157 +1,4157 @@ Security Tom Rhodes Rewritten by security Synopsis Security, whether physical or virtual, is a topic so broad that an entire industry has evolved around it. Hundreds of standard practices have been authored about how to secure systems and networks, and as a user of &os;, understanding how to protect against attacks and intruders is a must. In this chapter, several fundamentals and techniques will be discussed. The &os; system comes with multiple layers of security, and many more third party utilities may be added to enhance security. After reading this chapter, you will know: Basic &os; system security concepts. The various crypt mechanisms available in &os;. How to set up one-time password authentication. How to configure TCP Wrapper for use with &man.inetd.8;. How to set up Kerberos on &os;. How to configure IPsec and create a VPN. How to configure and use OpenSSH on &os;. How to use file system ACLs. How to use pkg to audit third party software packages installed from the Ports Collection. How to utilize &os; security advisories. What Process Accounting is and how to enable it on &os;. How to control user resources using login classes or the resource limits database. Before reading this chapter, you should: Understand basic &os; and Internet concepts. Additional security topics are covered elsewhere in this Handbook. For example, Mandatory Access Control is discussed in and Internet firewalls are discussed in . Introduction Security is everyone's responsibility. A weak entry point in any system could allow intruders to gain access to critical information and cause havoc on an entire network. One of the core principles of information security is the CIA triad, which stands for the Confidentiality, Integrity, and Availability of information systems. The CIA triad is a bedrock concept of computer security as customers and users expect their data to be protected. For example, a customer expects that their credit card information is securely stored (confidentiality), that their orders are not changed behind the scenes (integrity), and that they have access to their order information at all times (availablility). To provide CIA, security professionals apply a defense in depth strategy. The idea of defense in depth is to add several layers of security to prevent one single layer failing and the entire security system collapsing. For example, a system administrator cannot simply turn on a firewall and consider the network or system secure. One must also audit accounts, check the integrity of binaries, and ensure malicious tools are not installed. To implement an effective security strategy, one must understand threats and how to defend against them. What is a threat as it pertains to computer security? Threats are not limited to remote attackers who attempt to access a system without permission from a remote location. Threats also include employees, malicious software, unauthorized network devices, natural disasters, security vulnerabilities, and even competing corporations. Systems and networks can be accessed without permission, sometimes by accident, or by remote attackers, and in some cases, via corporate espionage or former employees. As a user, it is important to prepare for and admit when a mistake has led to a security breach and report possible issues to the security team. As an administrator, it is important to know of the threats and be prepared to mitigate them. When applying security to systems, it is recommended to start by securing the basic accounts and system configuration, and then to secure the network layer so that it adheres to the system policy and the organization's security procedures. Many organizations already have a security policy that covers the configuration of technology devices. The policy should include the security configuration of workstations, desktops, mobile devices, phones, production servers, and development servers. In many cases, standard operating procedures (SOPs) already exist. When in doubt, ask the security team. The rest of this introduction describes how some of these basic security configurations are performed on a &os; system. The rest of this chapter describes some specific tools which can be used when implementing a security policy on a &os; system. Preventing Logins In securing a system, a good starting point is an audit of accounts. Ensure that root has a strong password and that this password is not shared. Disable any accounts that do not need login access. To deny login access to accounts, two methods exist. The first is to lock the account. This example locks the toor account: &prompt.root; pw lock toor The second method is to prevent login access by changing the shell to /usr/sbin/nologin. Only the superuser can change the shell for other users: &prompt.root; chsh -s /usr/sbin/nologin toor The /usr/sbin/nologin shell prevents the system from assigning a shell to the user when they attempt to login. Permitted Account Escalation In some cases, system administration needs to be shared with other users. &os; has two methods to handle this. The first one, which is not recommended, is a shared root password used by members of the wheel group. With this method, a user types su and enters the password for wheel whenever superuser access is needed. The user should then type exit to leave privileged access after finishing the commands that required administrative access. To add a user to this group, edit /etc/group and add the user to the end of the wheel entry. The user must be separated by a comma character with no space. The second, and recommended, method to permit privilege escalation is to install the security/sudo package or port. This software provides additional auditing, more fine-grained user control, and can be configured to lock users into running only the specified privileged commands. After installation, use visudo to edit /usr/local/etc/sudoers. This example creates a new webadmin group, adds the trhodes account to that group, and configures that group access to restart apache24: &prompt.root; pw groupadd webadmin -M trhodes -g 6000 &prompt.root; visudo %webadmin ALL=(ALL) /usr/sbin/service apache24 * Password Hashes Passwords are a necessary evil of technology. When they must be used, they should be complex and a powerful hash mechanism should be used to encrypt the version that is stored in the password database. &os; supports the DES, MD5, SHA256, SHA512, and Blowfish hash algorithms in its crypt() library. The default of SHA512 should not be changed to a less secure hashing algorithm, but can be changed to the more secure Blowfish algorithm. Blowfish is not part of AES and is not considered compliant with any Federal Information Processing Standards (FIPS). Its use may not be permitted in some environments. To determine which hash algorithm is used to encrypt a user's password, the superuser can view the hash for the user in the &os; password database. Each hash starts with a symbol which indicates the type of hash mechanism used to encrypt the password. If DES is used, there is no beginning symbol. For MD5, the symbol is $. For SHA256 and SHA512, the symbol is $6$. For Blowfish, the symbol is $2a$. In this example, the password for dru is hashed using the default SHA512 algorithm as the hash starts with $6$. Note that the encrypted hash, not the password itself, is stored in the password database: &prompt.root; grep dru /etc/master.passwd dru:$6$pzIjSvCAn.PBYQBA$PXpSeWPx3g5kscj3IMiM7tUEUSPmGexxta.8Lt9TGSi2lNQqYGKszsBPuGME0:1001:1001::0:0:dru:/usr/home/dru:/bin/csh The hash mechanism is set in the user's login class. For this example, the user is in the default login class and the hash algorithm is set with this line in /etc/login.conf: :passwd_format=sha512:\ To change the algorithm to Blowfish, modify that line to look like this: :passwd_format=blf:\ Then run cap_mkdb /etc/login.conf as described in . Note that this change will not affect any existing password hashes. This means that all passwords should be re-hashed by asking users to run passwd in order to change their password. For remote logins, two-factor authentication should be used. An example of two-factor authentication is something you have, such as a key, and something you know, such as the passphrase for that key. Since OpenSSH is part of the &os; base system, all network logins should be over an encrypted connection and use key-based authentication instead of passwords. For more information, refer to . Kerberos users may need to make additional changes to implement OpenSSH in their network. These changes are described in . Password Policy Enforcement Enforcing a strong password policy for local accounts is a fundamental aspect of system security. In &os;, password length, password strength, and password complexity can be implemented using built-in Pluggable Authentication Modules (PAM). This section demonstrates how to configure the minimum and maximum password length and the enforcement of mixed characters using the pam_passwdqc.so module. This module is enforced when a user changes their password. To configure this module, become the superuser and uncomment the line containing pam_passwdqc.so in /etc/pam.d/passwd. Then, edit that line to match the password policy: password requisite pam_passwdqc.so min=disabled,disabled,disabled,12,10 similar=deny retry=3 enforce=users This example sets several requirements for new passwords. The min setting controls the minimum password length. It has five values because this module defines five different types of passwords based on their complexity. Complexity is defined by the type of characters that must exist in a password, such as letters, numbers, symbols, and case. The types of passwords are described in &man.pam.passwdqc.8;. In this example, the first three types of passwords are disabled, meaning that passwords that meet those complexity requirements will not be accepted, regardless of their length. The 12 sets a minimum password policy of at least twelve characters, if the password also contains characters with three types of complexity. The 10 sets the password policy to also allow passwords of at least ten characters, if the password contains characters with four types of complexity. The similar setting denies passwords that are similar to the user's previous password. The retry setting provides a user with three opportunities to enter a new password. Once this file is saved, a user changing their password will see a message similar to the following: &prompt.user; passwd Changing local password for trhodes Old Password: You can now choose the new password. A valid password should be a mix of upper and lower case letters, digits and other characters. You can use a 12 character long password with characters from at least 3 of these 4 classes, or a 10 character long password containing characters from all the classes. Characters that form a common pattern are discarded by the check. Alternatively, if no one else can see your terminal now, you can pick this as your password: "trait-useful&knob". Enter new password: If a password that does not match the policy is entered, it will be rejected with a warning and the user will have an opportunity to try again, up to the configured number of retries. Most password policies require passwords to expire after so many days. To set a password age time in &os;, set for the user's login class in /etc/login.conf. The default login class contains an example: # :passwordtime=90d:\ So, to set an expiry of 90 days for this login class, remove the comment symbol (#), save the edit, and run cap_mkdb /etc/login.conf. To set the expiration on individual users, pass an expiration date or the number of days to expiry and a username to pw: &prompt.root; pw usermod -p 30-apr-2015 -n trhodes As seen here, an expiration date is set in the form of day, month, and year. For more information, see &man.pw.8;. Detecting Rootkits A rootkit is any unauthorized software that attempts to gain root access to a system. Once installed, this malicious software will normally open up another avenue of entry for an attacker. Realistically, once a system has been compromised by a rootkit and an investigation has been performed, the system should be reinstalled from scratch. There is tremendous risk that even the most prudent security or systems engineer will miss something an attacker left behind. A rootkit does do one thing useful for administrators: once detected, it is a sign that a compromise happened at some point. But, these types of applications tend to be very well hidden. This section demonstrates a tool that can be used to detect rootkits, security/rkhunter. After installation of this package or port, the system may be checked using the following command. It will produce a lot of information and will require some manual pressing of ENTER: &prompt.root; rkhunter -c After the process completes, a status message will be printed to the screen. This message will include the amount of files checked, suspect files, possible rootkits, and more. During the check, some generic security warnings may be produced about hidden files, the OpenSSH protocol selection, and known vulnerable versions of installed software. These can be handled now or after a more detailed analysis has been performed. Every administrator should know what is running on the systems they are responsible for. Third-party tools like rkhunter and sysutils/lsof, and native commands such as netstat and ps, can show a great deal of information on the system. Take notes on what is normal, ask questions when something seems out of place, and be paranoid. While preventing a compromise is ideal, detecting a compromise is a must. Binary Verification Verification of system files and binaries is important because it provides the system administration and security teams information about system changes. A software application that monitors the system for changes is called an Intrusion Detection System (IDS). &os; provides native support for a basic IDS system. While the nightly security emails will notify an administrator of changes, the information is stored locally and there is a chance that a malicious user could modify this information in order to hide their changes to the system. As such, it is recommended to create a separate set of binary signatures and store them on a read-only, root-owned directory or, preferably, on a removable USB disk or remote rsync server. The built-in mtree utility can be used to generate a specification of the contents of a directory. A seed, or a numeric constant, is used to generate the specification and is required to check that the specification has not changed. This makes it possible to determine if a file or binary has been modified. Since the seed value is unknown by an attacker, faking or checking the checksum values of files will be difficult to impossible. The following example generates a set of SHA256 hashes, one for each system binary in /bin, and saves those values to a hidden file in root's home directory, /root/.bin_chksum_mtree: &prompt.root; mtree -s 3483151339707503 -c -K cksum,sha256digest -p /bin > /root/.bin_chksum_mtree &prompt.root; mtree: /bin checksum: 3427012225 The 3483151339707503 represents the seed. This value should be remembered, but not shared. Viewing /root/.bin_cksum_mtree should yield output similar to the following: # user: root # machine: dreadnaught # tree: /bin # date: Mon Feb 3 10:19:53 2014 # . /set type=file uid=0 gid=0 mode=0555 nlink=1 flags=none . type=dir mode=0755 nlink=2 size=1024 \ time=1380277977.000000000 \133 nlink=2 size=11704 time=1380277977.000000000 \ cksum=484492447 \ sha256digest=6207490fbdb5ed1904441fbfa941279055c3e24d3a4049aeb45094596400662a cat size=12096 time=1380277975.000000000 cksum=3909216944 \ sha256digest=65ea347b9418760b247ab10244f47a7ca2a569c9836d77f074e7a306900c1e69 chflags size=8168 time=1380277975.000000000 cksum=3949425175 \ sha256digest=c99eb6fc1c92cac335c08be004a0a5b4c24a0c0ef3712017b12c89a978b2dac3 chio size=18520 time=1380277975.000000000 cksum=2208263309 \ sha256digest=ddf7c8cb92a58750a675328345560d8cc7fe14fb3ccd3690c34954cbe69fc964 chmod size=8640 time=1380277975.000000000 cksum=2214429708 \ sha256digest=a435972263bf814ad8df082c0752aa2a7bdd8b74ff01431ccbd52ed1e490bbe7 The machine's hostname, the date and time the specification was created, and the name of the user who created the specification are included in this report. There is a checksum, size, time, and SHA256 digest for each binary in the directory. To verify that the binary signatures have not changed, compare the current contents of the directory to the previously generated specification, and save the results to a file. This command requires the seed that was used to generate the original specification: &prompt.root; mtree -s 3483151339707503 -p /bin < /root/.bin_chksum_mtree >> /root/.bin_chksum_output &prompt.root; mtree: /bin checksum: 3427012225 This should produce the same checksum for /bin that was produced when the specification was created. If no changes have occurred to the binaries in this directory, the /root/.bin_chksum_output output file will be empty. To simulate a change, change the date on /bin/cat using touch and run the verification command again: &prompt.root; touch /bin/cat &prompt.root; mtree -s 3483151339707503 -p /bin < /root/.bin_chksum_mtree >> /root/.bin_chksum_output &prompt.root; more /root/.bin_chksum_output cat changed modification time expected Fri Sep 27 06:32:55 2013 found Mon Feb 3 10:28:43 2014 It is recommended to create specifications for the directories which contain binaries and configuration files, as well as any directories containing sensitive data. Typically, specifications are created for /bin, /sbin, /usr/bin, /usr/sbin, /usr/local/bin, /etc, and /usr/local/etc. More advanced IDS systems exist, such as security/aide. In most cases, mtree provides the functionality administrators need. It is important to keep the seed value and the checksum output hidden from malicious users. More information about mtree can be found in &man.mtree.8;. System Tuning for Security In &os;, many system features can be tuned using sysctl. A few of the security features which can be tuned to prevent Denial of Service (DoS) attacks will be covered in this section. More information about using sysctl, including how to temporarily change values and how to make the changes permanent after testing, can be found in . Any time a setting is changed with sysctl, the chance to cause undesired harm is increased, affecting the availability of the system. All changes should be monitored and, if possible, tried on a testing system before being used on a production system. By default, the &os; kernel boots with a security level of -1. This is called insecure mode because immutable file flags may be turned off and all devices may be read from or written to. The security level will remain at -1 unless it is altered through sysctl or by a setting in the startup scripts. The security level may be increased during system startup by setting kern_securelevel_enable to YES in /etc/rc.conf, and the value of kern_securelevel to the desired security level. See &man.security.7; and &man.init.8; for more information on these settings and the available security levels. Increasing the securelevel can break Xorg and cause other issues. Be prepared to do some debugging. The net.inet.tcp.blackhole and net.inet.udp.blackhole settings can be used to drop incoming SYN packets on closed ports without sending a return RST response. The default behavior is to return an RST to show a port is closed. Changing the default provides some level of protection against ports scans, which are used to determine which applications are running on a system. Set net.inet.tcp.blackhole to 2 and net.inet.udp.blackhole to 1. Refer to &man.blackhole.4; for more information about these settings. The net.inet.icmp.drop_redirect and net.inet.ip.redirect settings help prevent against redirect attacks. A redirect attack is a type of DoS which sends mass numbers of ICMP type 5 packets. Since these packets are not required, set net.inet.icmp.drop_redirect to 1 and set net.inet.ip.redirect to 0. Source routing is a method for detecting and accessing non-routable addresses on the internal network. This should be disabled as non-routable addresses are normally not routable on purpose. To disable this feature, set net.inet.ip.sourceroute and net.inet.ip.accept_sourceroute to 0. When a machine on the network needs to send messages to all hosts on a subnet, an ICMP echo request message is sent to the broadcast address. However, there is no reason for an external host to perform such an action. To reject all external broadcast requests, set net.inet.icmp.bmcastecho to 0. Some additional settings are documented in &man.security.7;. One-time Passwords one-time passwords security one-time passwords By default, &os; includes support for One-time Passwords In Everything (OPIE). OPIE is designed to prevent replay attacks, in which an attacker discovers a user's password and uses it to access a system. Since a password is only used once in OPIE, a discovered password is of little use to an attacker. OPIE uses a secure hash and a challenge/response system to manage passwords. The &os; implementation uses the MD5 hash by default. OPIE uses three different types of passwords. The first is the usual &unix; or Kerberos password. The second is the one-time password which is generated by opiekey. The third type of password is the secret password which is used to generate one-time passwords. The secret password has nothing to do with, and should be different from, the &unix; password. There are two other pieces of data that are important to OPIE. One is the seed or key, consisting of two letters and five digits. The other is the iteration count, a number between 1 and 100. OPIE creates the one-time password by concatenating the seed and the secret password, applying the MD5 hash as many times as specified by the iteration count, and turning the result into six short English words which represent the one-time password. The authentication system keeps track of the last one-time password used, and the user is authenticated if the hash of the user-provided password is equal to the previous password. - Because a one-way hash is used, it is impossible to generate + Since a one-way hash is used, it is impossible to generate future one-time passwords if a successfully used password is captured. The iteration count is decremented after each successful login to keep the user and the login program in sync. When the iteration count gets down to 1, OPIE must be reinitialized. There are a few programs involved in this process. A one-time password, or a consecutive list of one-time passwords, is generated by passing an iteration count, a seed, and a secret password to &man.opiekey.1;. In addition to initializing OPIE, &man.opiepasswd.1; is used to change passwords, iteration counts, or seeds. The relevant credential files in /etc/opiekeys are examined by &man.opieinfo.1; which prints out the invoking user's current iteration count and seed. This section describes four different sorts of operations. The first is how to set up one-time-passwords for the first time over a secure connection. The second is how to use opiepasswd over an insecure connection. The third is how to log in over an insecure connection. The fourth is how to generate a number of keys which can be written down or printed out to use at insecure locations. Initializing <acronym>OPIE</acronym> To initialize OPIE for the first time, run this command from a secure location: &prompt.user; opiepasswd -c Adding unfurl: Only use this method from the console; NEVER from remote. If you are using telnet, xterm, or a dial-in, type ^C now or exit with no password. Then run opiepasswd without the -c parameter. Using MD5 to compute responses. Enter new secret pass phrase: Again new secret pass phrase: ID unfurl OTP key is 499 to4268 MOS MALL GOAT ARM AVID COED The sets console mode which assumes that the command is being run from a secure location, such as a computer under the user's control or an SSH session to a computer under the user's control. When prompted, enter the secret password which will be used to generate the one-time login keys. This password should be difficult to guess and should be different than the password which is associated with the user's login account. It must be between 10 and 127 characters long. Remember this password. The ID line lists the login name (unfurl), default iteration count (499), and default seed (to4268). When logging in, the system will remember these parameters and display them, meaning that they do not have to be memorized. The last line lists the generated one-time password which corresponds to those parameters and the secret password. At the next login, use this one-time password. Insecure Connection Initialization To initialize or change the secret password on an insecure system, a secure connection is needed to some place where opiekey can be run. This might be a shell prompt on a trusted machine. An iteration count is needed, where 100 is probably a good value, and the seed can either be specified or the randomly-generated one used. On the insecure connection, the machine being initialized, use &man.opiepasswd.1;: &prompt.user; opiepasswd Updating unfurl: You need the response from an OTP generator. Old secret pass phrase: otp-md5 498 to4268 ext Response: GAME GAG WELT OUT DOWN CHAT New secret pass phrase: otp-md5 499 to4269 Response: LINE PAP MILK NELL BUOY TROY ID mark OTP key is 499 gr4269 LINE PAP MILK NELL BUOY TROY To accept the default seed, press Return. Before entering an access password, move over to the secure connection and give it the same parameters: &prompt.user; opiekey 498 to4268 Using the MD5 algorithm to compute response. Reminder: Do not use opiekey from telnet or dial-in sessions. Enter secret pass phrase: GAME GAG WELT OUT DOWN CHAT Switch back over to the insecure connection, and copy the generated one-time password over to the relevant program. Generating a Single One-time Password After initializing OPIE and logging in, a prompt like this will be displayed: &prompt.user; telnet example.com Trying 10.0.0.1... Connected to example.com Escape character is '^]'. FreeBSD/i386 (example.com) (ttypa) login: <username> otp-md5 498 gr4269 ext Password: The OPIE prompts provides a useful feature. If Return is pressed at the password prompt, the prompt will turn echo on and display what is typed. This can be useful when attempting to type in a password by hand from a printout. MS-DOS Windows MacOS At this point, generate the one-time password to answer this login prompt. This must be done on a trusted system where it is safe to run &man.opiekey.1;. There are versions of this command for &windows;, &macos; and &os;. This command needs the iteration count and the seed as command line options. Use cut-and-paste from the login prompt on the machine being logged in to. On the trusted system: &prompt.user; opiekey 498 to4268 Using the MD5 algorithm to compute response. Reminder: Do not use opiekey from telnet or dial-in sessions. Enter secret pass phrase: GAME GAG WELT OUT DOWN CHAT Once the one-time password is generated, continue to log in. Generating Multiple One-time Passwords Sometimes there is no access to a trusted machine or secure connection. In this case, it is possible to use &man.opiekey.1; to generate a number of one-time passwords beforehand. For example: &prompt.user; opiekey -n 5 30 zz99999 Using the MD5 algorithm to compute response. Reminder: Do not use opiekey from telnet or dial-in sessions. Enter secret pass phrase: <secret password> 26: JOAN BORE FOSS DES NAY QUIT 27: LATE BIAS SLAY FOLK MUCH TRIG 28: SALT TIN ANTI LOON NEAL USE 29: RIO ODIN GO BYE FURY TIC 30: GREW JIVE SAN GIRD BOIL PHI The requests five keys in sequence, and specifies what the last iteration number should be. Note that these are printed out in reverse order of use. The really paranoid might want to write the results down by hand; otherwise, print the list. Each line shows both the iteration count and the one-time password. Scratch off the passwords as they are used. Restricting Use of &unix; Passwords OPIE can restrict the use of &unix; passwords based on the IP address of a login session. The relevant file is /etc/opieaccess, which is present by default. Refer to &man.opieaccess.5; for more information on this file and which security considerations to be aware of when using it. Here is a sample opieaccess: permit 192.168.0.0 255.255.0.0 This line allows users whose IP source address (which is vulnerable to spoofing) matches the specified value and mask, to use &unix; passwords at any time. If no rules in opieaccess are matched, the default is to deny non-OPIE logins. TCP Wrapper TomRhodesWritten by TCP Wrapper TCP Wrapper is a host-based access control system which extends the abilities of . It can be configured to provide logging support, return messages, and connection restrictions for the server daemons under the control of inetd. Refer to &man.tcpd.8; for more information about TCP Wrapper and its features. TCP Wrapper should not be considered a replacement for a properly configured firewall. Instead, TCP Wrapper should be used in conjunction with a firewall and other security enhancements in order to provide another layer of protection in the implementation of a security policy. Initial Configuration To enable TCP Wrapper in &os;, add the following lines to /etc/rc.conf: inetd_enable="YES" inetd_flags="-Ww" Then, properly configure /etc/hosts.allow. Unlike other implementations of TCP Wrapper, the use of hosts.deny is deprecated in &os;. All configuration options should be placed in /etc/hosts.allow. In the simplest configuration, daemon connection policies are set to either permit or block, depending on the options in /etc/hosts.allow. The default configuration in &os; is to allow all connections to the daemons started with inetd. Basic configuration usually takes the form of daemon : address : action, where daemon is the daemon which inetd started, address is a valid hostname, IP address, or an IPv6 address enclosed in brackets ([ ]), and action is either allow or deny. TCP Wrapper uses a first rule match semantic, meaning that the configuration file is scanned from the beginning for a matching rule. When a match is found, the rule is applied and the search process stops. For example, to allow POP3 connections via the mail/qpopper daemon, the following lines should be appended to hosts.allow: # This line is required for POP3 connections: qpopper : ALL : allow Whenever this file is edited, restart inetd: &prompt.root; service inetd restart Advanced Configuration TCP Wrapper provides advanced options to allow more control over the way connections are handled. In some cases, it may be appropriate to return a comment to certain hosts or daemon connections. In other cases, a log entry should be recorded or an email sent to the administrator. Other situations may require the use of a service for local connections only. This is all possible through the use of configuration options known as wildcards, expansion characters, and external command execution. Suppose that a situation occurs where a connection should be denied yet a reason should be sent to the host who attempted to establish that connection. That action is possible with . When a connection attempt is made, executes a shell command or script. An example exists in hosts.allow: # The rest of the daemons are protected. ALL : ALL \ : severity auth.info \ : twist /bin/echo "You are not welcome to use %d from %h." In this example, the message You are not allowed to use daemon name from hostname. will be returned for any daemon not configured in hosts.allow. This is useful for sending a reply back to the connection initiator right after the established connection is dropped. Any message returned must be wrapped in quote (") characters. It may be possible to launch a denial of service attack on the server if an attacker floods these daemons with connection requests. Another possibility is to use . Like , implicitly denies the connection and may be used to run external shell commands or scripts. Unlike , will not send a reply back to the host who established the connection. For example, consider the following configuration: # We do not allow connections from example.com: ALL : .example.com \ : spawn (/bin/echo %a from %h attempted to access %d >> \ /var/log/connections.log) \ : deny This will deny all connection attempts from *.example.com and log the hostname, IP address, and the daemon to which access was attempted to /var/log/connections.log. This example uses the substitution characters %a and %h. Refer to &man.hosts.access.5; for the complete list. To match every instance of a daemon, domain, or IP address, use ALL. Another wildcard is PARANOID which may be used to match any host which provides an IP address that may be forged because the IP address differs from its resolved hostname. In this example, all connection requests to Sendmail which have an IP address that varies from its hostname will be denied: # Block possibly spoofed requests to sendmail: sendmail : PARANOID : deny Using the PARANOID wildcard will result in denied connections if the client or server has a broken DNS setup. To learn more about wildcards and their associated functionality, refer to &man.hosts.access.5;. When adding new configuration lines, make sure that any unneeded entries for that daemon are commented out in hosts.allow. <application>Kerberos</application> Tillman Hodgson Contributed by Mark Murray Based on a contribution by Kerberos is a network authentication protocol which was originally created by the Massachusetts Institute of Technology (MIT) as a way to securely provide authentication across a potentially hostile network. The Kerberos protocol uses strong cryptography so that both a client and server can prove their identity without sending any unencrypted secrets over the network. Kerberos can be described as an identity-verifying proxy system and as a trusted third-party authentication system. After a user authenticates with Kerberos, their communications can be encrypted to assure privacy and data integrity. The only function of Kerberos is to provide the secure authentication of users and servers on the network. It does not provide authorization or auditing functions. It is recommended that Kerberos be used with other security methods which provide authorization and audit services. The current version of the protocol is version 5, described in RFC 4120. Several free implementations of this protocol are available, covering a wide range of operating systems. MIT continues to develop their Kerberos package. It is commonly used in the US as a cryptography product, and has historically been subject to US export regulations. In &os;, MIT Kerberos is available as the security/krb5 package or port. The Heimdal Kerberos implementation was explicitly developed outside of the US to avoid export regulations. The Heimdal Kerberos distribution is included in the base &os; installation, and another distribution with more configurable options is available as security/heimdal in the Ports Collection. In Kerberos users and services are identified as principals which are contained within an administrative grouping, called a realm. A typical user principal would be of the form user@REALM (realms are traditionally uppercase). This section provides a guide on how to set up Kerberos using the Heimdal distribution included in &os;. For purposes of demonstrating a Kerberos installation, the name spaces will be as follows: The DNS domain (zone) will be example.org. The Kerberos realm will be EXAMPLE.ORG. Use real domain names when setting up Kerberos, even if it will run internally. This avoids DNS problems and assures inter-operation with other Kerberos realms. Setting up a Heimdal <acronym>KDC</acronym> Kerberos5 Key Distribution Center The Key Distribution Center (KDC) is the centralized authentication service that Kerberos provides, the trusted third party of the system. It is the computer that issues Kerberos tickets, which are used for clients to authenticate to - servers. Because the KDC is considered + servers. As the KDC is considered trusted by all other computers in the Kerberos realm, it has heightened security concerns. Direct access to the KDC should be limited. While running a KDC requires few computing resources, a dedicated machine acting only as a KDC is recommended for security reasons. To begin, install the security/heimdal package as follows: &prompt.root; pkg install heimdal Next, update /etc/rc.conf using sysrc as follows: &prompt.root; sysrc kdc_enable=yes &prompt.root; sysrc kadmind_enable=yes Next, edit /etc/krb5.conf as follows: [libdefaults] default_realm = EXAMPLE.ORG [realms] EXAMPLE.ORG = { kdc = kerberos.example.org admin_server = kerberos.example.org } [domain_realm] .example.org = EXAMPLE.ORG In this example, the KDC will use the fully-qualified hostname kerberos.example.org. The hostname of the KDC must be resolvable in the DNS. Kerberos can also use the DNS to locate KDCs, instead of a [realms] section in /etc/krb5.conf. For large organizations that have their own DNS servers, the above example could be trimmed to: [libdefaults] default_realm = EXAMPLE.ORG [domain_realm] .example.org = EXAMPLE.ORG With the following lines being included in the example.org zone file: _kerberos._udp IN SRV 01 00 88 kerberos.example.org. _kerberos._tcp IN SRV 01 00 88 kerberos.example.org. _kpasswd._udp IN SRV 01 00 464 kerberos.example.org. _kerberos-adm._tcp IN SRV 01 00 749 kerberos.example.org. _kerberos IN TXT EXAMPLE.ORG In order for clients to be able to find the Kerberos services, they must have either a fully configured /etc/krb5.conf or a minimally configured /etc/krb5.conf and a properly configured DNS server. Next, create the Kerberos database which contains the keys of all principals (users and hosts) encrypted with a master password. It is not required to remember this password as it will be stored in /var/heimdal/m-key; it would be reasonable to use a 45-character random password for this purpose. To create the master key, run kstash and enter a password: &prompt.root; kstash Master key: xxxxxxxxxxxxxxxxxxxxxxx Verifying password - Master key: xxxxxxxxxxxxxxxxxxxxxxx Once the master key has been created, the database should be initialized. The Kerberos administrative tool &man.kadmin.8; can be used on the KDC in a mode that operates directly on the database, without using the &man.kadmind.8; network service, as kadmin -l. This resolves the chicken-and-egg problem of trying to connect to the database before it is created. At the kadmin prompt, use init to create the realm's initial database: &prompt.root; kadmin -l kadmin> init EXAMPLE.ORG Realm max ticket life [unlimited]: Lastly, while still in kadmin, create the first principal using add. Stick to the default options for the principal for now, as these can be changed later with modify. Type ? at the prompt to see the available options. kadmin> add tillman Max ticket life [unlimited]: Max renewable life [unlimited]: Principal expiration time [never]: Password expiration time [never]: Attributes []: Password: xxxxxxxx Verifying password - Password: xxxxxxxx Next, start the KDC services by running: &prompt.root; service kdc start &prompt.root; service kadmind start While there will not be any kerberized daemons running at this point, it is possible to confirm that the KDC is functioning by obtaining a ticket for the principal that was just created: &prompt.user; kinit tillman tillman@EXAMPLE.ORG's Password: Confirm that a ticket was successfully obtained using klist: &prompt.user; klist Credentials cache: FILE:/tmp/krb5cc_1001 Principal: tillman@EXAMPLE.ORG Issued Expires Principal Aug 27 15:37:58 2013 Aug 28 01:37:58 2013 krbtgt/EXAMPLE.ORG@EXAMPLE.ORG The temporary ticket can be destroyed when the test is finished: &prompt.user; kdestroy Configuring a Server to Use <application>Kerberos</application> Kerberos5 enabling services The first step in configuring a server to use Kerberos authentication is to ensure that it has the correct configuration in /etc/krb5.conf. The version from the KDC can be used as-is, or it can be regenerated on the new system. Next, create /etc/krb5.keytab on the server. This is the main part of Kerberizing a service — it corresponds to generating a secret shared between the service and the KDC. The secret is a cryptographic key, stored in a keytab. The keytab contains the server's host key, which allows it and the KDC to verify each others' identity. It must be transmitted to the server in a secure fashion, as the security of the server can be broken if the key is made public. Typically, the keytab is generated on an administrator's trusted machine using kadmin, then securely transferred to the server, e.g., with &man.scp.1;; it can also be created directly on the server if that is consistent with the desired security policy. It is very important that the keytab is transmitted to the server in a secure fashion: if the key is known by some other party, that party can impersonate any user to the server! Using kadmin on the server directly is convenient, because the entry for the host principal in the KDC database is also created using kadmin. Of course, kadmin is a kerberized service; a Kerberos ticket is needed to authenticate to the network service, but to ensure that the user running kadmin is actually present (and their session has not been hijacked), kadmin will prompt for the password to get a fresh ticket. The principal authenticating to the kadmin service must be permitted to use the kadmin interface, as specified in /var/heimdal/kadmind.acl. See the section titled Remote administration in info heimdal for details on designing access control lists. Instead of enabling remote kadmin access, the administrator could securely connect to the KDC via the local console or &man.ssh.1;, and perform administration locally using kadmin -l. After installing /etc/krb5.conf, use add --random-key in kadmin. This adds the server's host principal to the database, but does not extract a copy of the host principal key to a keytab. To generate the keytab, use ext to extract the server's host principal key to its own keytab: &prompt.root; kadmin kadmin> add --random-key host/myserver.example.org Max ticket life [unlimited]: Max renewable life [unlimited]: Principal expiration time [never]: Password expiration time [never]: Attributes []: kadmin> ext_keytab host/myserver.example.org kadmin> exit Note that ext_keytab stores the extracted key in /etc/krb5.keytab by default. This is good when being run on the server being kerberized, but the --keytab path/to/file argument should be used when the keytab is being extracted elsewhere: &prompt.root; kadmin kadmin> ext_keytab --keytab=/tmp/example.keytab host/myserver.example.org kadmin> exit The keytab can then be securely copied to the server using &man.scp.1; or a removable media. Be sure to specify a non-default keytab name to avoid inserting unneeded keys into the system's keytab. At this point, the server can read encrypted messages from the KDC using its shared key, stored in krb5.keytab. It is now ready for the Kerberos-using services to be enabled. One of the most common such services is &man.sshd.8;, which supports Kerberos via the GSS-API. In /etc/ssh/sshd_config, add the line: GSSAPIAuthentication yes After making this change, &man.sshd.8; must be restarted for the new configuration to take effect: service sshd restart. Configuring a Client to Use <application>Kerberos</application> Kerberos5 configure clients As it was for the server, the client requires configuration in /etc/krb5.conf. Copy the file in place (securely) or re-enter it as needed. Test the client by using kinit, klist, and kdestroy from the client to obtain, show, and then delete a ticket for an existing principal. Kerberos applications should also be able to connect to Kerberos enabled servers. If that does not work but obtaining a ticket does, the problem is likely with the server and not with the client or the KDC. In the case of kerberized &man.ssh.1;, GSS-API is disabled by default, so test using ssh -o GSSAPIAuthentication=yes hostname. When testing a Kerberized application, try using a packet sniffer such as tcpdump to confirm that no sensitive information is sent in the clear. Various Kerberos client applications are available. With the advent of a bridge so that applications using SASL for authentication can use GSS-API mechanisms as well, large classes of client applications can use Kerberos for authentication, from Jabber clients to IMAP clients. .k5login .k5users Users within a realm typically have their Kerberos principal mapped to a local user account. Occasionally, one needs to grant access to a local user account to someone who does not have a matching Kerberos principal. For example, tillman@EXAMPLE.ORG may need access to the local user account webdevelopers. Other principals may also need access to that local account. The .k5login and .k5users files, placed in a user's home directory, can be used to solve this problem. For example, if the following .k5login is placed in the home directory of webdevelopers, both principals listed will have access to that account without requiring a shared password: tillman@example.org jdoe@example.org Refer to &man.ksu.1; for more information about .k5users. <acronym>MIT</acronym> Differences The major difference between the MIT and Heimdal implementations is that kadmin has a different, but equivalent, set of commands and uses a different protocol. If the KDC is MIT, the Heimdal version of kadmin cannot be used to administer the KDC remotely, and vice versa. Client applications may also use slightly different command line options to accomplish the same tasks. Following the instructions at http://web.mit.edu/Kerberos/www/ is recommended. Be careful of path issues: the MIT port installs into /usr/local/ by default, and the &os; system applications run instead of the MIT versions if PATH lists the system directories first. When using MIT Kerberos as a KDC on &os;, the following edits should also be made to rc.conf: kdc_program="/usr/local/sbin/kdc" kadmind_program="/usr/local/sbin/kadmind" kdc_flags="" kdc_enable="YES" kadmind_enable="YES" <application>Kerberos</application> Tips, Tricks, and Troubleshooting When configuring and troubleshooting Kerberos, keep the following points in mind: When using either Heimdal or MIT Kerberos from ports, ensure that the PATH lists the port's versions of the client applications before the system versions. If all the computers in the realm do not have synchronized time settings, authentication may fail. describes how to synchronize clocks using NTP. If the hostname is changed, the host/ principal must be changed and the keytab updated. This also applies to special keytab entries like the HTTP/ principal used for Apache's www/mod_auth_kerb. All hosts in the realm must be both forward and reverse resolvable in DNS or, at a minimum, exist in /etc/hosts. CNAMEs will work, but the A and PTR records must be correct and in place. The error message for unresolvable hosts is not intuitive: Kerberos5 refuses authentication because Read req failed: Key table entry not found. Some operating systems that act as clients to the KDC do not set the permissions for ksu to be setuid root. This means that ksu does not work. This is a permissions problem, not a KDC error. With MIT Kerberos, to allow a principal to have a ticket life longer than the default lifetime of ten hours, use modify_principal at the &man.kadmin.8; prompt to change the maxlife of both the principal in question and the krbtgt principal. The principal can then use kinit -l to request a ticket with a longer lifetime. When running a packet sniffer on the KDC to aid in troubleshooting while running kinit from a workstation, the Ticket Granting Ticket (TGT) is sent immediately, even before the password is typed. This is because the Kerberos server freely transmits a TGT to any unauthorized request. However, every TGT is encrypted in a key derived from the user's password. When a user types their password, it is not sent to the KDC, it is instead used to decrypt the TGT that kinit already obtained. If the decryption process results in a valid ticket with a valid time stamp, the user has valid Kerberos credentials. These credentials include a session key for establishing secure communications with the Kerberos server in the future, as well as the actual TGT, which is encrypted with the Kerberos server's own key. This second layer of encryption allows the Kerberos server to verify the authenticity of each TGT. Host principals can have a longer ticket lifetime. If the user principal has a lifetime of a week but the host being connected to has a lifetime of nine hours, the user cache will have an expired host principal and the ticket cache will not work as expected. When setting up krb5.dict to prevent specific bad passwords from being used as described in &man.kadmind.8;, remember that it only applies to principals that have a password policy assigned to them. The format used in krb5.dict is one string per line. Creating a symbolic link to /usr/share/dict/words might be useful. Mitigating <application>Kerberos</application> Limitations Kerberos5 limitations and shortcomings Since Kerberos is an all or nothing approach, every service enabled on the network must either be modified to work with Kerberos or be otherwise secured against network attacks. This is to prevent user credentials from being stolen and re-used. An example is when Kerberos is enabled on all remote shells but the non-Kerberized POP3 mail server sends passwords in plain text. The KDC is a single point of failure. By design, the KDC must be as secure as its master password database. The KDC should have absolutely no other services running on it and should be physically secure. The danger is high because Kerberos stores all passwords encrypted with the same master key which is stored as a file on the KDC. A compromised master key is not quite as bad as one might fear. The master key is only used to encrypt the Kerberos database and as a seed for the random number generator. As long as access to the KDC is secure, an attacker cannot do much with the master key. If the KDC is unavailable, network services are unusable as authentication cannot be performed. This can be alleviated with a single master KDC and one or more slaves, and with careful implementation of secondary or fall-back authentication using PAM. Kerberos allows users, hosts and services to authenticate between themselves. It does not have a mechanism to authenticate the KDC to the users, hosts, or services. This means that a trojaned kinit could record all user names and passwords. File system integrity checking tools like security/tripwire can alleviate this. Resources and Further Information Kerberos5 external resources The Kerberos FAQ Designing an Authentication System: a Dialog in Four Scenes RFC 4120, The Kerberos Network Authentication Service (V5) MIT Kerberos home page Heimdal Kerberos project wiki page OpenSSL TomRhodesWritten by security OpenSSL OpenSSL is an open source implementation of the SSL and TLS protocols. It provides an encryption transport layer on top of the normal communications layer, allowing it to be intertwined with many network applications and services. The version of OpenSSL included in &os; supports the Secure Sockets Layer 3.0 (SSLv3) and Transport Layer Security 1.0/1.1/1.2 (TLSv1/TLSv1.1/TLSv1.2) network security protocols and can be used as a general cryptographic library. In &os; 12.0-RELEASE and above, OpenSSL also supports Transport Layer Security 1.3 (TLSv1.3). OpenSSL is often used to encrypt authentication of mail clients and to secure web based transactions such as credit card payments. Some ports, such as www/apache24 and databases/postgresql11-server, include a compile option for building with OpenSSL. If selected, the port will add support using OpenSSL from the base system. To instead have the port compile against OpenSSL from the security/openssl port, add the following to /etc/make.conf: DEFAULT_VERSIONS+= ssl=openssl Another common use of OpenSSL is to provide certificates for use with software applications. Certificates can be used to verify the credentials of a company or individual. If a certificate has not been signed by an external Certificate Authority (CA), such as http://www.verisign.com, the application that uses the certificate will produce a warning. There is a cost associated with obtaining a signed certificate and using a signed certificate is not mandatory as certificates can be self-signed. However, using an external authority will prevent warnings and can put users at ease. This section demonstrates how to create and use certificates on a &os; system. Refer to for an example of how to create a CA for signing one's own certificates. For more information about SSL, read the free OpenSSL Cookbook. Generating Certificates OpenSSL certificate generation To generate a certificate that will be signed by an external CA, issue the following command and input the information requested at the prompts. This input information will be written to the certificate. At the Common Name prompt, input the fully qualified name for the system that will use the certificate. If this name does not match the server, the application verifying the certificate will issue a warning to the user, rendering the verification provided by the certificate as useless. &prompt.root; openssl req -new -nodes -out req.pem -keyout cert.key -sha256 -newkey rsa:2048 Generating a 2048 bit RSA private key ..................+++ .............................................................+++ writing new private key to 'cert.key' ----- You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:PA Locality Name (eg, city) []:Pittsburgh Organization Name (eg, company) [Internet Widgits Pty Ltd]:My Company Organizational Unit Name (eg, section) []:Systems Administrator Common Name (eg, YOUR name) []:localhost.example.org Email Address []:trhodes@FreeBSD.org Please enter the following 'extra' attributes to be sent with your certificate request A challenge password []: An optional company name []:Another Name Other options, such as the expire time and alternate encryption algorithms, are available when creating a certificate. A complete list of options is described in &man.openssl.1;. This command will create two files in the current directory. The certificate request, req.pem, can be sent to a CA who will validate the entered credentials, sign the request, and return the signed certificate. The second file, cert.key, is the private key for the certificate and should be stored in a secure location. If this falls in the hands of others, it can be used to impersonate the user or the server. Alternately, if a signature from a CA is not required, a self-signed certificate can be created. First, generate the RSA key: &prompt.root; openssl genrsa -rand -genkey -out cert.key 2048 0 semi-random bytes loaded Generating RSA private key, 2048 bit long modulus .............................................+++ .................................................................................................................+++ e is 65537 (0x10001) Use this key to create a self-signed certificate. Follow the usual prompts for creating a certificate: &prompt.root; openssl req -new -x509 -days 365 -key cert.key -out cert.crt -sha256 You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:PA Locality Name (eg, city) []:Pittsburgh Organization Name (eg, company) [Internet Widgits Pty Ltd]:My Company Organizational Unit Name (eg, section) []:Systems Administrator Common Name (e.g. server FQDN or YOUR name) []:localhost.example.org Email Address []:trhodes@FreeBSD.org This will create two new files in the current directory: a private key file cert.key, and the certificate itself, cert.crt. These should be placed in a directory, preferably under /etc/ssl/, which is readable only by root. Permissions of 0700 are appropriate for these files and can be set using chmod. Using Certificates One use for a certificate is to encrypt connections to the Sendmail mail server in order to prevent the use of clear text authentication. Some mail clients will display an error if the user has not installed a local copy of the certificate. Refer to the documentation included with the software for more information on certificate installation. In &os; 10.0-RELEASE and above, it is possible to create a self-signed certificate for Sendmail automatically. To enable this, add the following lines to /etc/rc.conf: sendmail_enable="YES" sendmail_cert_create="YES" sendmail_cert_cn="localhost.example.org" This will automatically create a self-signed certificate, /etc/mail/certs/host.cert, a signing key, /etc/mail/certs/host.key, and a CA certificate, /etc/mail/certs/cacert.pem. The certificate will use the Common Name specified in . After saving the edits, restart Sendmail: &prompt.root; service sendmail restart If all went well, there will be no error messages in /var/log/maillog. For a simple test, connect to the mail server's listening port using telnet: &prompt.root; telnet example.com 25 Trying 192.0.34.166... Connected to example.com. Escape character is '^]'. 220 example.com ESMTP Sendmail 8.14.7/8.14.7; Fri, 18 Apr 2014 11:50:32 -0400 (EDT) ehlo example.com 250-example.com Hello example.com [192.0.34.166], pleased to meet you 250-ENHANCEDSTATUSCODES 250-PIPELINING 250-8BITMIME 250-SIZE 250-DSN 250-ETRN 250-AUTH LOGIN PLAIN 250-STARTTLS 250-DELIVERBY 250 HELP quit 221 2.0.0 example.com closing connection Connection closed by foreign host. If the STARTTLS line appears in the output, everything is working correctly. <acronym>VPN</acronym> over <acronym>IPsec</acronym> Nik Clayton
nik@FreeBSD.org
Written by
Hiten M. Pandya
hmp@FreeBSD.org
Written by
IPsec Internet Protocol Security (IPsec) is a set of protocols which sit on top of the Internet Protocol (IP) layer. It allows two or more hosts to communicate in a secure manner by authenticating and encrypting each IP packet of a communication session. The &os; IPsec network stack is based on the http://www.kame.net/ implementation and supports both IPv4 and IPv6 sessions. IPsec ESP IPsec AH IPsec is comprised of the following sub-protocols: Encapsulated Security Payload (ESP): this protocol protects the IP packet data from third party interference by encrypting the contents using symmetric cryptography algorithms such as Blowfish and 3DES. Authentication Header (AH): this protocol protects the IP packet header from third party interference and spoofing by computing a cryptographic checksum and hashing the IP packet header fields with a secure hashing function. This is then followed by an additional header that contains the hash, to allow the information in the packet to be authenticated. IP Payload Compression Protocol (IPComp): this protocol tries to increase communication performance by compressing the IP payload in order to reduce the amount of data sent. These protocols can either be used together or separately, depending on the environment. VPN virtual private network VPN IPsec supports two modes of operation. The first mode, Transport Mode, protects communications between two hosts. The second mode, Tunnel Mode, is used to build virtual tunnels, commonly known as Virtual Private Networks (VPNs). Consult &man.ipsec.4; for detailed information on the IPsec subsystem in &os;. IPsec support is enabled by default on &os; 11 and later. For previous versions of &os;, add these options to a custom kernel configuration file and rebuild the kernel using the instructions in : kernel options IPSEC options IPSEC #IP security device crypto kernel options IPSEC_DEBUG If IPsec debugging support is desired, the following kernel option should also be added: options IPSEC_DEBUG #debug for IP security This rest of this chapter demonstrates the process of setting up an IPsec VPN between a home network and a corporate network. In the example scenario: Both sites are connected to the Internet through a gateway that is running &os;. The gateway on each network has at least one external IP address. In this example, the corporate LAN's external IP address is 172.16.5.4 and the home LAN's external IP address is 192.168.1.12. The internal addresses of the two networks can be either public or private IP addresses. However, the address space must not collide. For example, both networks cannot use 192.168.1.x. In this example, the corporate LAN's internal IP address is 10.246.38.1 and the home LAN's internal IP address is 10.0.0.5. Configuring a <acronym>VPN</acronym> on &os; Tom Rhodes
trhodes@FreeBSD.org
Written by
To begin, security/ipsec-tools must be installed from the Ports Collection. This software provides a number of applications which support the configuration. The next requirement is to create two &man.gif.4; pseudo-devices which will be used to tunnel packets and allow both networks to communicate properly. As root, run the following commands, replacing internal and external with the real IP addresses of the internal and external interfaces of the two gateways: &prompt.root; ifconfig gif0 create &prompt.root; ifconfig gif0 internal1 internal2 &prompt.root; ifconfig gif0 tunnel external1 external2 Verify the setup on each gateway, using ifconfig. Here is the output from Gateway 1: gif0: flags=8051 mtu 1280 tunnel inet 172.16.5.4 --> 192.168.1.12 inet6 fe80::2e0:81ff:fe02:5881%gif0 prefixlen 64 scopeid 0x6 inet 10.246.38.1 --> 10.0.0.5 netmask 0xffffff00 Here is the output from Gateway 2: gif0: flags=8051 mtu 1280 tunnel inet 192.168.1.12 --> 172.16.5.4 inet 10.0.0.5 --> 10.246.38.1 netmask 0xffffff00 inet6 fe80::250:bfff:fe3a:c1f%gif0 prefixlen 64 scopeid 0x4 Once complete, both internal IP addresses should be reachable using &man.ping.8;: priv-net&prompt.root; ping 10.0.0.5 PING 10.0.0.5 (10.0.0.5): 56 data bytes 64 bytes from 10.0.0.5: icmp_seq=0 ttl=64 time=42.786 ms 64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=19.255 ms 64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=20.440 ms 64 bytes from 10.0.0.5: icmp_seq=3 ttl=64 time=21.036 ms --- 10.0.0.5 ping statistics --- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max/stddev = 19.255/25.879/42.786/9.782 ms corp-net&prompt.root; ping 10.246.38.1 PING 10.246.38.1 (10.246.38.1): 56 data bytes 64 bytes from 10.246.38.1: icmp_seq=0 ttl=64 time=28.106 ms 64 bytes from 10.246.38.1: icmp_seq=1 ttl=64 time=42.917 ms 64 bytes from 10.246.38.1: icmp_seq=2 ttl=64 time=127.525 ms 64 bytes from 10.246.38.1: icmp_seq=3 ttl=64 time=119.896 ms 64 bytes from 10.246.38.1: icmp_seq=4 ttl=64 time=154.524 ms --- 10.246.38.1 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max/stddev = 28.106/94.594/154.524/49.814 ms As expected, both sides have the ability to send and receive ICMP packets from the privately configured addresses. Next, both gateways must be told how to route packets in order to correctly send traffic from either network. The following commands will achieve this goal: corp-net&prompt.root; route add 10.0.0.0 10.0.0.5 255.255.255.0 corp-net&prompt.root; route add net 10.0.0.0: gateway 10.0.0.5 priv-net&prompt.root; route add 10.246.38.0 10.246.38.1 255.255.255.0 priv-net&prompt.root; route add host 10.246.38.0: gateway 10.246.38.1 At this point, internal machines should be reachable from each gateway as well as from machines behind the gateways. Again, use &man.ping.8; to confirm: corp-net&prompt.root; ping 10.0.0.8 PING 10.0.0.8 (10.0.0.8): 56 data bytes 64 bytes from 10.0.0.8: icmp_seq=0 ttl=63 time=92.391 ms 64 bytes from 10.0.0.8: icmp_seq=1 ttl=63 time=21.870 ms 64 bytes from 10.0.0.8: icmp_seq=2 ttl=63 time=198.022 ms 64 bytes from 10.0.0.8: icmp_seq=3 ttl=63 time=22.241 ms 64 bytes from 10.0.0.8: icmp_seq=4 ttl=63 time=174.705 ms --- 10.0.0.8 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max/stddev = 21.870/101.846/198.022/74.001 ms priv-net&prompt.root; ping 10.246.38.107 PING 10.246.38.1 (10.246.38.107): 56 data bytes 64 bytes from 10.246.38.107: icmp_seq=0 ttl=64 time=53.491 ms 64 bytes from 10.246.38.107: icmp_seq=1 ttl=64 time=23.395 ms 64 bytes from 10.246.38.107: icmp_seq=2 ttl=64 time=23.865 ms 64 bytes from 10.246.38.107: icmp_seq=3 ttl=64 time=21.145 ms 64 bytes from 10.246.38.107: icmp_seq=4 ttl=64 time=36.708 ms --- 10.246.38.107 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max/stddev = 21.145/31.721/53.491/12.179 ms Setting up the tunnels is the easy part. Configuring a secure link is a more in depth process. The following configuration uses pre-shared (PSK) RSA keys. Other than the IP addresses, the /usr/local/etc/racoon/racoon.conf on both gateways will be identical and look similar to: path pre_shared_key "/usr/local/etc/racoon/psk.txt"; #location of pre-shared key file log debug; #log verbosity setting: set to 'notify' when testing and debugging is complete padding # options are not to be changed { maximum_length 20; randomize off; strict_check off; exclusive_tail off; } timer # timing options. change as needed { counter 5; interval 20 sec; persend 1; # natt_keepalive 15 sec; phase1 30 sec; phase2 15 sec; } listen # address [port] that racoon will listen on { isakmp 172.16.5.4 [500]; isakmp_natt 172.16.5.4 [4500]; } remote 192.168.1.12 [500] { exchange_mode main,aggressive; doi ipsec_doi; situation identity_only; my_identifier address 172.16.5.4; peers_identifier address 192.168.1.12; lifetime time 8 hour; passive off; proposal_check obey; # nat_traversal off; generate_policy off; proposal { encryption_algorithm blowfish; hash_algorithm md5; authentication_method pre_shared_key; lifetime time 30 sec; dh_group 1; } } sainfo (address 10.246.38.0/24 any address 10.0.0.0/24 any) # address $network/$netmask $type address $network/$netmask $type ( $type being any or esp) { # $network must be the two internal networks you are joining. pfs_group 1; lifetime time 36000 sec; encryption_algorithm blowfish,3des; authentication_algorithm hmac_md5,hmac_sha1; compression_algorithm deflate; } For descriptions of each available option, refer to the manual page for racoon.conf. The Security Policy Database (SPD) needs to be configured so that &os; and racoon are able to encrypt and decrypt network traffic between the hosts. This can be achieved with a shell script, similar to the following, on the corporate gateway. This file will be used during system initialization and should be saved as /usr/local/etc/racoon/setkey.conf. flush; spdflush; # To the home network spdadd 10.246.38.0/24 10.0.0.0/24 any -P out ipsec esp/tunnel/172.16.5.4-192.168.1.12/use; spdadd 10.0.0.0/24 10.246.38.0/24 any -P in ipsec esp/tunnel/192.168.1.12-172.16.5.4/use; Once in place, racoon may be started on both gateways using the following command: &prompt.root; /usr/local/sbin/racoon -F -f /usr/local/etc/racoon/racoon.conf -l /var/log/racoon.log The output should be similar to the following: corp-net&prompt.root; /usr/local/sbin/racoon -F -f /usr/local/etc/racoon/racoon.conf Foreground mode. 2006-01-30 01:35:47: INFO: begin Identity Protection mode. 2006-01-30 01:35:48: INFO: received Vendor ID: KAME/racoon 2006-01-30 01:35:55: INFO: received Vendor ID: KAME/racoon 2006-01-30 01:36:04: INFO: ISAKMP-SA established 172.16.5.4[500]-192.168.1.12[500] spi:623b9b3bd2492452:7deab82d54ff704a 2006-01-30 01:36:05: INFO: initiate new phase 2 negotiation: 172.16.5.4[0]192.168.1.12[0] 2006-01-30 01:36:09: INFO: IPsec-SA established: ESP/Tunnel 192.168.1.12[0]->172.16.5.4[0] spi=28496098(0x1b2d0e2) 2006-01-30 01:36:09: INFO: IPsec-SA established: ESP/Tunnel 172.16.5.4[0]->192.168.1.12[0] spi=47784998(0x2d92426) 2006-01-30 01:36:13: INFO: respond new phase 2 negotiation: 172.16.5.4[0]192.168.1.12[0] 2006-01-30 01:36:18: INFO: IPsec-SA established: ESP/Tunnel 192.168.1.12[0]->172.16.5.4[0] spi=124397467(0x76a279b) 2006-01-30 01:36:18: INFO: IPsec-SA established: ESP/Tunnel 172.16.5.4[0]->192.168.1.12[0] spi=175852902(0xa7b4d66) To ensure the tunnel is working properly, switch to another console and use &man.tcpdump.1; to view network traffic using the following command. Replace em0 with the network interface card as required: &prompt.root; tcpdump -i em0 host 172.16.5.4 and dst 192.168.1.12 Data similar to the following should appear on the console. If not, there is an issue and debugging the returned data will be required. 01:47:32.021683 IP corporatenetwork.com > 192.168.1.12.privatenetwork.com: ESP(spi=0x02acbf9f,seq=0xa) 01:47:33.022442 IP corporatenetwork.com > 192.168.1.12.privatenetwork.com: ESP(spi=0x02acbf9f,seq=0xb) 01:47:34.024218 IP corporatenetwork.com > 192.168.1.12.privatenetwork.com: ESP(spi=0x02acbf9f,seq=0xc) At this point, both networks should be available and seem to be part of the same network. Most likely both networks are protected by a firewall. To allow traffic to flow between them, rules need to be added to pass packets. For the &man.ipfw.8; firewall, add the following lines to the firewall configuration file: ipfw add 00201 allow log esp from any to any ipfw add 00202 allow log ah from any to any ipfw add 00203 allow log ipencap from any to any ipfw add 00204 allow log udp from any 500 to any The rule numbers may need to be altered depending on the current host configuration. For users of &man.pf.4; or &man.ipf.8;, the following rules should do the trick: pass in quick proto esp from any to any pass in quick proto ah from any to any pass in quick proto ipencap from any to any pass in quick proto udp from any port = 500 to any port = 500 pass in quick on gif0 from any to any pass out quick proto esp from any to any pass out quick proto ah from any to any pass out quick proto ipencap from any to any pass out quick proto udp from any port = 500 to any port = 500 pass out quick on gif0 from any to any Finally, to allow the machine to start support for the VPN during system initialization, add the following lines to /etc/rc.conf: ipsec_enable="YES" ipsec_program="/usr/local/sbin/setkey" ipsec_file="/usr/local/etc/racoon/setkey.conf" # allows setting up spd policies on boot racoon_enable="yes"
OpenSSH ChernLeeContributed by OpenSSH security OpenSSH OpenSSH is a set of network connectivity tools used to provide secure access to remote machines. Additionally, TCP/IP connections can be tunneled or forwarded securely through SSH connections. OpenSSH encrypts all traffic to effectively eliminate eavesdropping, connection hijacking, and other network-level attacks. OpenSSH is maintained by the OpenBSD project and is installed by default in &os;. It is compatible with both SSH version 1 and 2 protocols. When data is sent over the network in an unencrypted form, network sniffers anywhere in between the client and server can steal user/password information or data transferred during the session. OpenSSH offers a variety of authentication and encryption methods to prevent this from happening. More information about OpenSSH is available from http://www.openssh.com/. This section provides an overview of the built-in client utilities to securely access other systems and securely transfer files from a &os; system. It then describes how to configure a SSH server on a &os; system. More information is available in the man pages mentioned in this chapter. Using the SSH Client Utilities OpenSSH client To log into a SSH server, use ssh and specify a username that exists on that server and the IP address or hostname of the server. If this is the first time a connection has been made to the specified server, the user will be prompted to first verify the server's fingerprint: &prompt.root; ssh user@example.com The authenticity of host 'example.com (10.0.0.1)' can't be established. ECDSA key fingerprint is 25:cc:73:b5:b3:96:75:3d:56:19:49:d2:5c:1f:91:3b. Are you sure you want to continue connecting (yes/no)? yes Permanently added 'example.com' (ECDSA) to the list of known hosts. Password for user@example.com: user_password SSH utilizes a key fingerprint system to verify the authenticity of the server when the client connects. When the user accepts the key's fingerprint by typing yes when connecting for the first time, a copy of the key is saved to .ssh/known_hosts in the user's home directory. Future attempts to login are verified against the saved key and ssh will display an alert if the server's key does not match the saved key. If this occurs, the user should first verify why the key has changed before continuing with the connection. By default, recent versions of OpenSSH only accept SSHv2 connections. By default, the client will use version 2 if possible and will fall back to version 1 if the server does not support version 2. To force ssh to only use the specified protocol, include or . Additional options are described in &man.ssh.1;. OpenSSH secure copy &man.scp.1; Use &man.scp.1; to securely copy a file to or from a remote machine. This example copies COPYRIGHT on the remote system to a file of the same name in the current directory of the local system: &prompt.root; scp user@example.com:/COPYRIGHT COPYRIGHT Password for user@example.com: ******* COPYRIGHT 100% |*****************************| 4735 00:00 &prompt.root; Since the fingerprint was already verified for this host, the server's key is automatically checked before prompting for the user's password. The arguments passed to scp are similar to cp. The file or files to copy is the first argument and the destination to copy to is the second. Since the file is fetched over the network, one or more of the file arguments takes the form . Be aware when copying directories recursively that scp uses , whereas cp uses . To open an interactive session for copying files, use sftp. Refer to &man.sftp.1; for a list of available commands while in an sftp session. Key-based Authentication Instead of using passwords, a client can be configured to connect to the remote machine using keys. To generate RSA authentication keys, use ssh-keygen. To generate a public and private key pair, specify the type of key and follow the prompts. It is recommended to protect the keys with a memorable, but hard to guess passphrase. &prompt.user; ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:54Xm9Uvtv6H4NOo6yjP/YCfODryvUU7yWHzMqeXwhq8 user@host.example.com The key's randomart image is: +---[RSA 2048]----+ | | | | | | | . o.. | | .S*+*o | | . O=Oo . . | | = Oo= oo..| | .oB.* +.oo.| | =OE**.o..=| +----[SHA256]-----+ Type a passphrase here. It can contain spaces and symbols. Retype the passphrase to verify it. The private key is stored in ~/.ssh/id_rsa and the public key is stored in ~/.ssh/id_rsa.pub. The public key must be copied to ~/.ssh/authorized_keys on the remote machine for key-based authentication to work. Many users believe that keys are secure by design and will use a key without a passphrase. This is dangerous behavior. An administrator can verify that a key pair is protected by a passphrase by viewing the private key manually. If the private key file contains the word ENCRYPTED, the key owner is using a passphrase. In addition, to better secure end users, from may be placed in the public key file. For example, adding from="192.168.10.5" in front of the ssh-rsa prefix will only allow that specific user to log in from that IP address. The options and files vary with different versions of OpenSSH. To avoid problems, consult &man.ssh-keygen.1;. If a passphrase is used, the user is prompted for the passphrase each time a connection is made to the server. To load SSH keys into memory and remove the need to type the passphrase each time, use &man.ssh-agent.1; and &man.ssh-add.1;. Authentication is handled by ssh-agent, using the private keys that are loaded into it. ssh-agent can be used to launch another application like a shell or a window manager. To use ssh-agent in a shell, start it with a shell as an argument. Add the identity by running ssh-add and entering the passphrase for the private key. The user will then be able to ssh to any host that has the corresponding public key installed. For example: &prompt.user; ssh-agent csh &prompt.user; ssh-add Enter passphrase for key '/usr/home/user/.ssh/id_rsa': Identity added: /usr/home/user/.ssh/id_rsa (/usr/home/user/.ssh/id_rsa) &prompt.user; Enter the passphrase for the key. To use ssh-agent in &xorg;, add an entry for it in ~/.xinitrc. This provides the ssh-agent services to all programs launched in &xorg;. An example ~/.xinitrc might look like this: exec ssh-agent startxfce4 This launches ssh-agent, which in turn launches XFCE, every time &xorg; starts. Once &xorg; has been restarted so that the changes can take effect, run ssh-add to load all of the SSH keys. <acronym>SSH</acronym> Tunneling OpenSSH tunneling OpenSSH has the ability to create a tunnel to encapsulate another protocol in an encrypted session. The following command tells ssh to create a tunnel for telnet: &prompt.user; ssh -2 -N -f -L 5023:localhost:23 user@foo.example.com &prompt.user; This example uses the following options: Forces ssh to use version 2 to connect to the server. Indicates no command, or tunnel only. If omitted, ssh initiates a normal session. Forces ssh to run in the background. Indicates a local tunnel in localport:remotehost:remoteport format. The login name to use on the specified remote SSH server. An SSH tunnel works by creating a listen socket on localhost on the specified localport. It then forwards any connections received on localport via the SSH connection to the specified remotehost:remoteport. In the example, port 5023 on the client is forwarded to port 23 on the remote machine. Since port 23 is used by telnet, this creates an encrypted telnet session through an SSH tunnel. This method can be used to wrap any number of insecure TCP protocols such as SMTP, POP3, and FTP, as seen in the following examples. Create a Secure Tunnel for <acronym>SMTP</acronym> &prompt.user; ssh -2 -N -f -L 5025:localhost:25 user@mailserver.example.com user@mailserver.example.com's password: ***** &prompt.user; telnet localhost 5025 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. 220 mailserver.example.com ESMTP This can be used in conjunction with ssh-keygen and additional user accounts to create a more seamless SSH tunneling environment. Keys can be used in place of typing a password, and the tunnels can be run as a separate user. Secure Access of a <acronym>POP3</acronym> Server In this example, there is an SSH server that accepts connections from the outside. On the same network resides a mail server running a POP3 server. To check email in a secure manner, create an SSH connection to the SSH server and tunnel through to the mail server: &prompt.user; ssh -2 -N -f -L 2110:mail.example.com:110 user@ssh-server.example.com user@ssh-server.example.com's password: ****** Once the tunnel is up and running, point the email client to send POP3 requests to localhost on port 2110. This connection will be forwarded securely across the tunnel to mail.example.com. Bypassing a Firewall Some firewalls filter both incoming and outgoing connections. For example, a firewall might limit access from remote machines to ports 22 and 80 to only allow SSH and web surfing. This prevents access to any other service which uses a port other than 22 or 80. The solution is to create an SSH connection to a machine outside of the network's firewall and use it to tunnel to the desired service: &prompt.user; ssh -2 -N -f -L 8888:music.example.com:8000 user@unfirewalled-system.example.org user@unfirewalled-system.example.org's password: ******* In this example, a streaming Ogg Vorbis client can now be pointed to localhost port 8888, which will be forwarded over to music.example.com on port 8000, successfully bypassing the firewall. Enabling the SSH Server OpenSSH enabling In addition to providing built-in SSH client utilities, a &os; system can be configured as an SSH server, accepting connections from other SSH clients. To see if sshd is operating, use the &man.service.8; command: &prompt.root; service sshd status If the service is not running, add the following line to /etc/rc.conf. sshd_enable="YES" This will start sshd, the daemon program for OpenSSH, the next time the system boots. To start it now: &prompt.root; service sshd start The first time sshd starts on a &os; system, the system's host keys will be automatically created and the fingerprint will be displayed on the console. Provide users with the fingerprint so that they can verify it the first time they connect to the server. Refer to &man.sshd.8; for the list of available options when starting sshd and a more complete discussion about authentication, the login process, and the various configuration files. At this point, the sshd should be available to all users with a username and password on the system. SSH Server Security While sshd is the most widely used remote administration facility for &os;, brute force and drive by attacks are common to any system exposed to public networks. Several additional parameters are available to prevent the success of these attacks and will be described in this section. It is a good idea to limit which users can log into the SSH server and from where using the AllowUsers keyword in the OpenSSH server configuration file. For example, to only allow root to log in from 192.168.1.32, add this line to /etc/ssh/sshd_config: AllowUsers root@192.168.1.32 To allow admin to log in from anywhere, list that user without specifying an IP address: AllowUsers admin Multiple users should be listed on the same line, like so: AllowUsers root@192.168.1.32 admin After making changes to /etc/ssh/sshd_config, tell sshd to reload its configuration file by running: &prompt.root; service sshd reload When this keyword is used, it is important to list each user that needs to log into this machine. Any user that is not specified in that line will be locked out. Also, the keywords used in the OpenSSH server configuration file are case-sensitive. If the keyword is not spelled correctly, including its case, it will be ignored. Always test changes to this file to make sure that the edits are working as expected. Refer to &man.sshd.config.5; to verify the spelling and use of the available keywords. In addition, users may be forced to use two factor authentication via the use of a public and private key. When required, the user may generate a key pair through the use of &man.ssh-keygen.1; and send the administrator the public key. This key file will be placed in the authorized_keys as described above in the client section. To force the users to use keys only, the following option may be configured: AuthenticationMethods publickey Do not confuse /etc/ssh/sshd_config with /etc/ssh/ssh_config (note the extra d in the first filename). The first file configures the server and the second file configures the client. Refer to &man.ssh.config.5; for a listing of the available client settings. Access Control Lists TomRhodesContributed by ACL Access Control Lists (ACLs) extend the standard &unix; permission model in a &posix;.1e compatible way. This permits an administrator to take advantage of a more fine-grained permissions model. The &os; GENERIC kernel provides ACL support for UFS file systems. Users who prefer to compile a custom kernel must include the following option in their custom kernel configuration file: options UFS_ACL If this option is not compiled in, a warning message will be displayed when attempting to mount a file system with ACL support. ACLs rely on extended attributes which are natively supported in UFS2. This chapter describes how to enable ACL support and provides some usage examples. Enabling <acronym>ACL</acronym> Support ACLs are enabled by the mount-time administrative flag, , which may be added to /etc/fstab. The mount-time flag can also be automatically set in a persistent manner using &man.tunefs.8; to modify a superblock ACLs flag in the file system header. In general, it is preferred to use the superblock flag for several reasons: The superblock flag cannot be changed by a remount using as it requires a complete umount and fresh mount. This means that ACLs cannot be enabled on the root file system after boot. It also means that ACL support on a file system cannot be changed while the system is in use. Setting the superblock flag causes the file system to always be mounted with ACLs enabled, even if there is not an fstab entry or if the devices re-order. This prevents accidental mounting of the file system without ACL support. It is desirable to discourage accidental mounting without ACLs enabled because nasty things can happen if ACLs are enabled, then disabled, then re-enabled without flushing the extended attributes. In general, once ACLs are enabled on a file system, they should not be disabled, as the resulting file protections may not be compatible with those intended by the users of the system, and re-enabling ACLs may re-attach the previous ACLs to files that have since had their permissions changed, resulting in unpredictable behavior. File systems with ACLs enabled will show a plus (+) sign in their permission settings: drwx------ 2 robert robert 512 Dec 27 11:54 private drwxrwx---+ 2 robert robert 512 Dec 23 10:57 directory1 drwxrwx---+ 2 robert robert 512 Dec 22 10:20 directory2 drwxrwx---+ 2 robert robert 512 Dec 27 11:57 directory3 drwxr-xr-x 2 robert robert 512 Nov 10 11:54 public_html In this example, directory1, directory2, and directory3 are all taking advantage of ACLs, whereas private and public_html are not. Using <acronym>ACL</acronym>s File system ACLs can be viewed using getfacl. For instance, to view the ACL settings on test: &prompt.user; getfacl test #file:test #owner:1001 #group:1001 user::rw- group::r-- other::r-- To change the ACL settings on this file, use setfacl. To remove all of the currently defined ACLs from a file or file system, include . However, the preferred method is to use as it leaves the basic fields required for ACLs to work. &prompt.user; setfacl -k test To modify the default ACL entries, use : &prompt.user; setfacl -m u:trhodes:rwx,group:web:r--,o::--- test In this example, there were no pre-defined entries, as they were removed by the previous command. This command restores the default options and assigns the options listed. If a user or group is added which does not exist on the system, an Invalid argument error will be displayed. Refer to &man.getfacl.1; and &man.setfacl.1; for more information about the options available for these commands. Monitoring Third Party Security Issues TomRhodesContributed by pkg In recent years, the security world has made many improvements to how vulnerability assessment is handled. The threat of system intrusion increases as third party utilities are installed and configured for virtually any operating system available today. Vulnerability assessment is a key factor in security. While &os; releases advisories for the base system, doing so for every third party utility is beyond the &os; Project's capability. There is a way to mitigate third party vulnerabilities and warn administrators of known security issues. A &os; add on utility known as pkg includes options explicitly for this purpose. pkg polls a database for security issues. The database is updated and maintained by the &os; Security Team and ports developers. Please refer to instructions for installing pkg. Installation provides &man.periodic.8; configuration files for maintaining the pkg audit database, and provides a programmatic method of keeping it updated. This functionality is enabled if daily_status_security_pkgaudit_enable is set to YES in &man.periodic.conf.5;. Ensure that daily security run emails, which are sent to root's email account, are being read. After installation, and to audit third party utilities as part of the Ports Collection at any time, an administrator may choose to update the database and view known vulnerabilities of installed packages by invoking: &prompt.root; pkg audit -F pkg displays messages any published vulnerabilities in installed packages: Affected package: cups-base-1.1.22.0_1 Type of problem: cups-base -- HPGL buffer overflow vulnerability. Reference: <https://www.FreeBSD.org/ports/portaudit/40a3bca2-6809-11d9-a9e7-0001020eed82.html> 1 problem(s) in your installed packages found. You are advised to update or deinstall the affected package(s) immediately. By pointing a web browser to the displayed URL, an administrator may obtain more information about the vulnerability. This will include the versions affected, by &os; port version, along with other web sites which may contain security advisories. pkg is a powerful utility and is extremely useful when coupled with ports-mgmt/portmaster. &os; Security Advisories TomRhodesContributed by &os; Security Advisories Like many producers of quality operating systems, the &os; Project has a security team which is responsible for determining the End-of-Life (EoL) date for each &os; release and to provide security updates for supported releases which have not yet reached their EoL. More information about the &os; security team and the supported releases is available on the &os; security page. One task of the security team is to respond to reported security vulnerabilities in the &os; operating system. Once a vulnerability is confirmed, the security team verifies the steps necessary to fix the vulnerability and updates the source code with the fix. It then publishes the details as a Security Advisory. Security advisories are published on the &os; website and mailed to the &a.security-notifications.name;, &a.security.name;, and &a.announce.name; mailing lists. This section describes the format of a &os; security advisory. Format of a Security Advisory Here is an example of a &os; security advisory: ============================================================================= -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 ============================================================================= FreeBSD-SA-14:04.bind Security Advisory The FreeBSD Project Topic: BIND remote denial of service vulnerability Category: contrib Module: bind Announced: 2014-01-14 Credits: ISC Affects: FreeBSD 8.x and FreeBSD 9.x Corrected: 2014-01-14 19:38:37 UTC (stable/9, 9.2-STABLE) 2014-01-14 19:42:28 UTC (releng/9.2, 9.2-RELEASE-p3) 2014-01-14 19:42:28 UTC (releng/9.1, 9.1-RELEASE-p10) 2014-01-14 19:38:37 UTC (stable/8, 8.4-STABLE) 2014-01-14 19:42:28 UTC (releng/8.4, 8.4-RELEASE-p7) 2014-01-14 19:42:28 UTC (releng/8.3, 8.3-RELEASE-p14) CVE Name: CVE-2014-0591 For general information regarding FreeBSD Security Advisories, including descriptions of the fields above, security branches, and the following sections, please visit <URL:http://security.FreeBSD.org/>. I. Background BIND 9 is an implementation of the Domain Name System (DNS) protocols. The named(8) daemon is an Internet Domain Name Server. II. Problem Description Because of a defect in handling queries for NSEC3-signed zones, BIND can crash with an "INSIST" failure in name.c when processing queries possessing certain properties. This issue only affects authoritative nameservers with at least one NSEC3-signed zone. Recursive-only servers are not at risk. III. Impact An attacker who can send a specially crafted query could cause named(8) to crash, resulting in a denial of service. IV. Workaround No workaround is available, but systems not running authoritative DNS service with at least one NSEC3-signed zone using named(8) are not vulnerable. V. Solution Perform one of the following: 1) Upgrade your vulnerable system to a supported FreeBSD stable or release / security branch (releng) dated after the correction date. 2) To update your vulnerable system via a source code patch: The following patches have been verified to apply to the applicable FreeBSD release branches. a) Download the relevant patch from the location below, and verify the detached PGP signature using your PGP utility. [FreeBSD 8.3, 8.4, 9.1, 9.2-RELEASE and 8.4-STABLE] # fetch http://security.FreeBSD.org/patches/SA-14:04/bind-release.patch # fetch http://security.FreeBSD.org/patches/SA-14:04/bind-release.patch.asc # gpg --verify bind-release.patch.asc [FreeBSD 9.2-STABLE] # fetch http://security.FreeBSD.org/patches/SA-14:04/bind-stable-9.patch # fetch http://security.FreeBSD.org/patches/SA-14:04/bind-stable-9.patch.asc # gpg --verify bind-stable-9.patch.asc b) Execute the following commands as root: # cd /usr/src # patch < /path/to/patch Recompile the operating system using buildworld and installworld as described in <URL:https://www.FreeBSD.org/handbook/makeworld.html>. Restart the applicable daemons, or reboot the system. 3) To update your vulnerable system via a binary patch: Systems running a RELEASE version of FreeBSD on the i386 or amd64 platforms can be updated via the freebsd-update(8) utility: # freebsd-update fetch # freebsd-update install VI. Correction details The following list contains the correction revision numbers for each affected branch. Branch/path Revision - ------------------------------------------------------------------------- stable/8/ r260646 releng/8.3/ r260647 releng/8.4/ r260647 stable/9/ r260646 releng/9.1/ r260647 releng/9.2/ r260647 - ------------------------------------------------------------------------- To see which files were modified by a particular revision, run the following command, replacing NNNNNN with the revision number, on a machine with Subversion installed: # svn diff -cNNNNNN --summarize svn://svn.freebsd.org/base Or visit the following URL, replacing NNNNNN with the revision number: <URL:https://svnweb.freebsd.org/base?view=revision&revision=NNNNNN> VII. References <URL:https://kb.isc.org/article/AA-01078> <URL:http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-0591> The latest revision of this advisory is available at <URL:http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc> -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJS1ZTYAAoJEO1n7NZdz2rnOvQP/2/68/s9Cu35PmqNtSZVVxVG ZSQP5EGWx/lramNf9566iKxOrLRMq/h3XWcC4goVd+gZFrvITJSVOWSa7ntDQ7TO XcinfRZ/iyiJbs/Rg2wLHc/t5oVSyeouyccqODYFbOwOlk35JjOTMUG1YcX+Zasg ax8RV+7Zt1QSBkMlOz/myBLXUjlTZ3Xg2FXVsfFQW5/g2CjuHpRSFx1bVNX6ysoG 9DT58EQcYxIS8WfkHRbbXKh9I1nSfZ7/Hky/kTafRdRMrjAgbqFgHkYTYsBZeav5 fYWKGQRJulYfeZQ90yMTvlpF42DjCC3uJYamJnwDIu8OhS1WRBI8fQfr9DRzmRua OK3BK9hUiScDZOJB6OqeVzUTfe7MAA4/UwrDtTYQ+PqAenv1PK8DZqwXyxA9ThHb zKO3OwuKOVHJnKvpOcr+eNwo7jbnHlis0oBksj/mrq2P9m2ueF9gzCiq5Ri5Syag Wssb1HUoMGwqU0roS8+pRpNC8YgsWpsttvUWSZ8u6Vj/FLeHpiV3mYXPVMaKRhVm 067BA2uj4Th1JKtGleox+Em0R7OFbCc/9aWC67wiqI6KRyit9pYiF3npph+7D5Eq 7zPsUdDd+qc+UTiLp3liCRp5w6484wWdhZO6wRtmUgxGjNkxFoNnX8CitzF8AaqO UWWemqWuz3lAZuORQ9KX =OQzQ -----END PGP SIGNATURE----- Every security advisory uses the following format: Each security advisory is signed by the PGP key of the Security Officer. The public key for the Security Officer can be verified at . The name of the security advisory always begins with FreeBSD-SA- (for FreeBSD Security Advisory), followed by the year in two digit format (14:), followed by the advisory number for that year (04.), followed by the name of the affected application or subsystem (bind). The advisory shown here is the fourth advisory for 2014 and it affects BIND. The Topic field summarizes the vulnerability. The Category refers to the affected part of the system which may be one of core, contrib, or ports. The core category means that the vulnerability affects a core component of the &os; operating system. The contrib category means that the vulnerability affects software included with &os;, such as BIND. The ports category indicates that the vulnerability affects software available through the Ports Collection. The Module field refers to the component location. In this example, the bind module is affected; therefore, this vulnerability affects an application installed with the operating system. The Announced field reflects the date the security advisory was published. This means that the security team has verified that the problem exists and that a patch has been committed to the &os; source code repository. The Credits field gives credit to the individual or organization who noticed the vulnerability and reported it. The Affects field explains which releases of &os; are affected by this vulnerability. The Corrected field indicates the date, time, time offset, and releases that were corrected. The section in parentheses shows each branch for which the fix has been merged, and the version number of the corresponding release from that branch. The release identifier itself includes the version number and, if appropriate, the patch level. The patch level is the letter p followed by a number, indicating the sequence number of the patch, allowing users to track which patches have already been applied to the system. The CVE Name field lists the advisory number, if one exists, in the public cve.mitre.org security vulnerabilities database. The Background field provides a description of the affected module. The Problem Description field explains the vulnerability. This can include information about the flawed code and how the utility could be maliciously used. The Impact field describes what type of impact the problem could have on a system. The Workaround field indicates if a workaround is available to system administrators who cannot immediately patch the system . The Solution field provides the instructions for patching the affected system. This is a step by step tested and verified method for getting a system patched and working securely. The Correction Details field displays each affected Subversion branch with the revision number that contains the corrected code. The References field offers sources of additional information regarding the vulnerability. Process Accounting TomRhodesContributed by Process Accounting Process accounting is a security method in which an administrator may keep track of system resources used and their allocation among users, provide for system monitoring, and minimally track a user's commands. Process accounting has both positive and negative points. One of the positives is that an intrusion may be narrowed down to the point of entry. A negative is the amount of logs generated by process accounting, and the disk space they may require. This section walks an administrator through the basics of process accounting. If more fine-grained accounting is needed, refer to . Enabling and Utilizing Process Accounting Before using process accounting, it must be enabled using the following commands: &prompt.root; sysrc accounting_enable=yes &prompt.root; service accounting start The accounting information is stored in files located in /var/account, which is automatically created, if necessary, the first time the accounting service starts. These files contain sensitive information, including all the commands issued by all users. Write access to the files is limited to root, and read access is limited to root and members of the wheel group. To also prevent members of wheel from reading the files, change the mode of the /var/account directory to allow access only by root. Once enabled, accounting will begin to track information such as CPU statistics and executed commands. All accounting logs are in a non-human readable format which can be viewed using sa. If issued without any options, sa prints information relating to the number of per-user calls, the total elapsed time in minutes, total CPU and user time in minutes, and the average number of I/O operations. Refer to &man.sa.8; for the list of available options which control the output. To display the commands issued by users, use lastcomm. For example, this command prints out all usage of ls by trhodes on the ttyp1 terminal: &prompt.root; lastcomm ls trhodes ttyp1 Many other useful options exist and are explained in &man.lastcomm.1;, &man.acct.5;, and &man.sa.8;. Resource Limits TomRhodesContributed by Resource limits &os; provides several methods for an administrator to limit the amount of system resources an individual may use. Disk quotas limit the amount of disk space available to users. Quotas are discussed in . quotas limiting users quotas disk quotas Limits to other resources, such as CPU and memory, can be set using either a flat file or a command to configure a resource limits database. The traditional method defines login classes by editing /etc/login.conf. While this method is still supported, any changes require a multi-step process of editing this file, rebuilding the resource database, making necessary changes to /etc/master.passwd, and rebuilding the password database. This can become time consuming, depending upon the number of users to configure. rctl can be used to provide a more fine-grained method for controlling resource limits. This command supports more than user limits as it can also be used to set resource constraints on processes and jails. This section demonstrates both methods for controlling resources, beginning with the traditional method. Configuring Login Classes limiting users accounts limiting /etc/login.conf In the traditional method, login classes and the resource limits to apply to a login class are defined in /etc/login.conf. Each user account can be assigned to a login class, where default is the default login class. Each login class has a set of login capabilities associated with it. A login capability is a name=value pair, where name is a well-known identifier and value is an arbitrary string which is processed accordingly depending on the name. Whenever /etc/login.conf is edited, the /etc/login.conf.db must be updated by executing the following command: &prompt.root; cap_mkdb /etc/login.conf Resource limits differ from the default login capabilities in two ways. First, for every limit, there is a soft and hard limit. A soft limit may be adjusted by the user or application, but may not be set higher than the hard limit. The hard limit may be lowered by the user, but can only be raised by the superuser. Second, most resource limits apply per process to a specific user. lists the most commonly used resource limits. All of the available resource limits and capabilities are described in detail in &man.login.conf.5;. limiting users coredumpsize limiting users cputime limiting users filesize limiting users maxproc limiting users memorylocked limiting users memoryuse limiting users openfiles limiting users sbsize limiting users stacksize Login Class Resource Limits Resource Limit Description coredumpsize The limit on the size of a core file generated by a program is subordinate to other limits on disk usage, such as filesize or disk quotas. This limit is often used as a less severe method of controlling disk space consumption. Since users do not generate core files and often do not delete them, this setting may save them from running out of disk space should a large program crash. cputime The maximum amount of CPU time a user's process may consume. Offending processes will be killed by the kernel. This is a limit on CPU time consumed, not the percentage of the CPU as displayed in some of the fields generated by top and ps. filesize The maximum size of a file the user may own. Unlike disk quotas (), this limit is enforced on individual files, not the set of all files a user owns. maxproc The maximum number of foreground and background processes a user can run. This limit may not be larger than the system limit specified by kern.maxproc. Setting this limit too small may hinder a user's productivity as some tasks, such as compiling a large program, start lots of processes. memorylocked The maximum amount of memory a process may request to be locked into main memory using &man.mlock.2;. Some system-critical programs, such as &man.amd.8;, lock into main memory so that if the system begins to swap, they do not contribute to disk thrashing. memoryuse The maximum amount of memory a process may consume at any given time. It includes both core memory and swap usage. This is not a catch-all limit for restricting memory consumption, but is a good start. openfiles The maximum number of files a process may have open. In &os;, files are used to represent sockets and IPC channels, so be careful not to set this too low. The system-wide limit for this is defined by kern.maxfiles. sbsize The limit on the amount of network memory a user may consume. This can be generally used to limit network communications. stacksize The maximum size of a process stack. This alone is not sufficient to limit the amount of memory a program may use, so it should be used in conjunction with other limits.
There are a few other things to remember when setting resource limits: Processes started at system startup by /etc/rc are assigned to the daemon login class. Although the default /etc/login.conf is a good source of reasonable values for most limits, they may not be appropriate for every system. Setting a limit too high may open the system up to abuse, while setting it too low may put a strain on productivity. &xorg; takes a lot of resources and encourages users to run more programs simultaneously. Many limits apply to individual processes, not the user as a whole. For example, setting openfiles to 50 means that each process the user runs may open up to 50 files. The total amount of files a user may open is the value of openfiles multiplied by the value of maxproc. This also applies to memory consumption. For further information on resource limits and login classes and capabilities in general, refer to &man.cap.mkdb.1;, &man.getrlimit.2;, and &man.login.conf.5;.
Enabling and Configuring Resource Limits The kern.racct.enable tunable must be set to a non-zero value. Custom kernels require specific configuration: options RACCT options RCTL Once the system has rebooted into the new kernel, rctl may be used to set rules for the system. Rule syntax is controlled through the use of a subject, subject-id, resource, and action, as seen in this example rule: user:trhodes:maxproc:deny=10/user In this rule, the subject is user, the subject-id is trhodes, the resource, maxproc, is the maximum number of processes, and the action is deny, which blocks any new processes from being created. This means that the user, trhodes, will be constrained to no greater than 10 processes. Other possible actions include logging to the console, passing a notification to &man.devd.8;, or sending a sigterm to the process. Some care must be taken when adding rules. Since this user is constrained to 10 processes, this example will prevent the user from performing other tasks after logging in and executing a screen session. Once a resource limit has been hit, an error will be printed, as in this example: &prompt.user; man test /usr/bin/man: Cannot fork: Resource temporarily unavailable eval: Cannot fork: Resource temporarily unavailable As another example, a jail can be prevented from exceeding a memory limit. This rule could be written as: &prompt.root; rctl -a jail:httpd:memoryuse:deny=2G/jail Rules will persist across reboots if they have been added to /etc/rctl.conf. The format is a rule, without the preceding command. For example, the previous rule could be added as: # Block jail from using more than 2G memory: jail:httpd:memoryuse:deny=2G/jail To remove a rule, use rctl to remove it from the list: &prompt.root; rctl -r user:trhodes:maxproc:deny=10/user A method for removing all rules is documented in &man.rctl.8;. However, if removing all rules for a single user is required, this command may be issued: &prompt.root; rctl -r user:trhodes Many other resources exist which can be used to exert additional control over various subjects. See &man.rctl.8; to learn about them.
Shared Administration with Sudo TomRhodesContributed by Security Sudo System administrators often need the ability to grant enhanced permissions to users so they may perform privileged tasks. The idea that team members are provided access to a &os; system to perform their specific tasks opens up unique challenges to every administrator. These team members only need a subset of access beyond normal end user levels; however, they almost always tell management they are unable to perform their tasks without superuser access. Thankfully, there is no reason to provide such access to end users because tools exist to manage this exact requirement. Up to this point, the security chapter has covered permitting access to authorized users and attempting to prevent unauthorized access. Another problem arises once authorized users have access to the system resources. In many cases, some users may need access to application startup scripts, or a team of administrators need to maintain the system. Traditionally, the standard users and groups, file permissions, and even the &man.su.1; command would manage this access. And as applications required more access, as more users needed to use system resources, a better solution was required. The most used application is currently Sudo. Sudo allows administrators to configure more rigid access to system commands and provide for some advanced logging features. As a tool, it is available from the Ports Collection as security/sudo or by use of the &man.pkg.8; utility. To use the &man.pkg.8; tool: &prompt.root; pkg install sudo After the installation is complete, the installed visudo will open the configuration file with a text editor. Using visudo is highly recommended as it comes with a built in syntax checker to verify there are no errors before the file is saved. The configuration file is made up of several small sections which allow for extensive configuration. In the following example, web application maintainer, user1, needs to start, stop, and restart the web application known as webservice. To grant this user permission to perform these tasks, add this line to the end of /usr/local/etc/sudoers: user1 ALL=(ALL) /usr/sbin/service webservice * The user may now start webservice using this command: &prompt.user; sudo /usr/sbin/service webservice start While this configuration allows a single user access to the webservice service; however, in most organizations, there is an entire web team in charge of managing the service. A single line can also give access to an entire group. These steps will create a web group, add a user to this group, and allow all members of the group to manage the service: &prompt.root; pw groupadd -g 6001 -n webteam Using the same &man.pw.8; command, the user is added to the webteam group: &prompt.root; pw groupmod -m user1 -n webteam Finally, this line in /usr/local/etc/sudoers allows any member of the webteam group to manage webservice: %webteam ALL=(ALL) /usr/sbin/service webservice * Unlike &man.su.1;, Sudo only requires the end user password. This adds an advantage where users will not need shared passwords, a finding in most security audits and just bad all the way around. Users permitted to run applications with Sudo only enter their own passwords. This is more secure and gives better control than &man.su.1;, where the root password is entered and the user acquires all root permissions. Most organizations are moving or have moved toward a two factor authentication model. In these cases, the user may not have a password to enter. Sudo provides for these cases with the NOPASSWD variable. Adding it to the configuration above will allow all members of the webteam group to manage the service without the password requirement: %webteam ALL=(ALL) NOPASSWD: /usr/sbin/service webservice * Logging Output An advantage to implementing Sudo is the ability to enable session logging. Using the built in log mechanisms and the included sudoreplay command, all commands initiated through Sudo are logged for later verification. To enable this feature, add a default log directory entry, this example uses a user variable. Several other log filename conventions exist, consult the manual page for sudoreplay for additional information. Defaults iolog_dir=/var/log/sudo-io/%{user} This directory will be created automatically after the logging is configured. It is best to let the system create directory with default permissions just to be safe. In addition, this entry will also log administrators who use the sudoreplay command. To change this behavior, read and uncomment the logging options inside sudoers. Once this directive has been added to the sudoers file, any user configuration can be updated with the request to log access. In the example shown, the updated webteam entry would have the following additional changes: %webteam ALL=(ALL) NOPASSWD: LOG_INPUT: LOG_OUTPUT: /usr/sbin/service webservice * From this point on, all webteam members altering the status of the webservice application will be logged. The list of previous and current sessions can be displayed with: &prompt.root; sudoreplay -l In the output, to replay a specific session, search for the TSID= entry, and pass that to sudoreplay with no other options to replay the session at normal speed. For example: &prompt.root; sudoreplay user1/00/00/02 While sessions are logged, any administrator is able to remove sessions and leave only a question of why they had done so. It is worthwhile to add a daily check through an intrusion detection system (IDS) or similar software so that other administrators are alerted to manual alterations. The sudoreplay is extremely extendable. Consult the documentation for more information.
diff --git a/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml b/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml index 9913ad3755..db13e71ef6 100644 --- a/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/serialcomms/chapter.xml @@ -1,2191 +1,2191 @@ Serial Communications Synopsis serial communications &unix; has always had support for serial communications as the very first &unix; machines relied on serial lines for user input and output. Things have changed a lot from the days when the average terminal consisted of a 10-character-per-second serial printer and a keyboard. This chapter covers some of the ways serial communications can be used on &os;. After reading this chapter, you will know: How to connect terminals to a &os; system. How to use a modem to dial out to remote hosts. How to allow remote users to login to a &os; system with a modem. How to boot a &os; system from a serial console. Before reading this chapter, you should: Know how to configure and install a custom kernel. Understand &os; permissions and processes. Have access to the technical manual for the serial hardware to be used with &os;. Serial Terminology and Hardware The following terms are often used in serial communications: bps Bits per Secondbits-per-second (bps) is the rate at which data is transmitted. DTE Data Terminal EquipmentDTE (DTE) is one of two endpoints in a serial communication. An example would be a computer. DCE Data Communications EquipmentDCE (DTE) is the other endpoint in a serial communication. Typically, it is a modem or serial terminal. RS-232 The original standard which defined hardware serial communications. It has since been renamed to TIA-232. When referring to communication data rates, this section does not use the term baud. Baud refers to the number of electrical state transitions made in a period of time, while bps is the correct term to use. To connect a serial terminal to a &os; system, a serial port on the computer and the proper cable to connect to the serial device are needed. Users who are already familiar with serial hardware and cabling can safely skip this section. Serial Cables and Ports There are several different kinds of serial cables. The two most common types are null-modem cables and standard RS-232 cables. The documentation for the hardware should describe the type of cable required. These two types of cables differ in how the wires are connected to the connector. Each wire represents a signal, with the defined signals summarized in . A standard serial cable passes all of the RS-232C signals straight through. For example, the Transmitted Data pin on one end of the cable goes to the Transmitted Data pin on the other end. This is the type of cable used to connect a modem to the &os; system, and is also appropriate for some terminals. A null-modem cable switches the Transmitted Data pin of the connector on one end with the Received Data pin on the other end. The connector can be either a DB-25 or a DB-9. A null-modem cable can be constructed using the pin connections summarized in , , and . While the standard calls for a straight-through pin 1 to pin 1 Protective Ground line, it is often omitted. Some terminals work using only pins 2, 3, and 7, while others require different configurations. When in doubt, refer to the documentation for the hardware. null-modem cable <acronym>RS-232C</acronym> Signal Names Acronyms Names RD Received Data TD Transmitted Data DTR Data Terminal Ready DSR Data Set Ready DCD Data Carrier Detect SG Signal Ground RTS Request to Send CTS Clear to Send
DB-25 to DB-25 Null-Modem Cable Signal Pin # Pin # Signal SG 7 connects to 7 SG TD 2 connects to 3 RD RD 3 connects to 2 TD RTS 4 connects to 5 CTS CTS 5 connects to 4 RTS DTR 20 connects to 6 DSR DTR 20 connects to 8 DCD DSR 6 connects to 20 DTR DCD 8 connects to 20 DTR
DB-9 to DB-9 Null-Modem Cable Signal Pin # Pin # Signal RD 2 connects to 3 TD TD 3 connects to 2 RD DTR 4 connects to 6 DSR DTR 4 connects to 1 DCD SG 5 connects to 5 SG DSR 6 connects to 4 DTR DCD 1 connects to 4 DTR RTS 7 connects to 8 CTS CTS 8 connects to 7 RTS
DB-9 to DB-25 Null-Modem Cable Signal Pin # Pin # Signal RD 2 connects to 2 TD TD 3 connects to 3 RD DTR 4 connects to 6 DSR DTR 4 connects to 8 DCD SG 5 connects to 7 SG DSR 6 connects to 20 DTR DCD 1 connects to 20 DTR RTS 7 connects to 5 CTS CTS 8 connects to 4 RTS
When one pin at one end connects to a pair of pins at the other end, it is usually implemented with one short wire between the pair of pins in their connector and a long wire to the other single pin. Serial ports are the devices through which data is transferred between the &os; host computer and the terminal. Several kinds of serial ports exist. Before purchasing or constructing a cable, make sure it will fit the ports on the terminal and on the &os; system. Most terminals have DB-25 ports. Personal computers may have DB-25 or DB-9 ports. A multiport serial card may have RJ-12 or RJ-45/ ports. See the documentation that accompanied the hardware for specifications on the kind of port or visually verify the type of port. In &os;, each serial port is accessed through an entry in /dev. There are two different kinds of entries: Call-in ports are named /dev/ttyuN where N is the port number, starting from zero. If a terminal is connected to the first serial port (COM1), use /dev/ttyu0 to refer to the terminal. If the terminal is on the second serial port (COM2), use /dev/ttyu1, and so forth. Generally, the call-in port is used for terminals. Call-in ports require that the serial line assert the Data Carrier Detect signal to work correctly. Call-out ports are named /dev/cuauN on &os; versions 8.X and higher and /dev/cuadN on &os; versions 7.X and lower. Call-out ports are usually not used for terminals, but are used for modems. The call-out port can be used if the serial cable or the terminal does not support the Data Carrier Detect signal. &os; also provides initialization devices (/dev/ttyuN.init and /dev/cuauN.init or /dev/cuadN.init) and locking devices (/dev/ttyuN.lock and /dev/cuauN.lock or /dev/cuadN.lock). The initialization devices are used to initialize communications port parameters each time a port is opened, such as crtscts for modems which use RTS/CTS signaling for flow control. The locking devices are used to lock flags on ports to prevent users or programs changing certain parameters. Refer to &man.termios.4;, &man.sio.4;, and &man.stty.1; for information on terminal settings, locking and initializing devices, and setting terminal options, respectively.
Serial Port Configuration By default, &os; supports four serial ports which are commonly known as COM1, COM2, COM3, and COM4. &os; also supports dumb multi-port serial interface cards, such as the BocaBoard 1008 and 2016, as well as more intelligent multi-port cards such as those made by Digiboard. However, the default kernel only looks for the standard COM ports. To see if the system recognizes the serial ports, look for system boot messages that start with uart: &prompt.root; grep uart /var/run/dmesg.boot If the system does not recognize all of the needed serial ports, additional entries can be added to /boot/device.hints. This file already contains hint.uart.0.* entries for COM1 and hint.uart.1.* entries for COM2. When adding a port entry for COM3 use 0x3E8, and for COM4 use 0x2E8. Common IRQ addresses are 5 for COM3 and 9 for COM4. ttyu cuau To determine the default set of terminal I/O settings used by the port, specify its device name. This example determines the settings for the call-in port on COM2: &prompt.root; stty -a -f /dev/ttyu1 System-wide initialization of serial devices is controlled by /etc/rc.d/serial. This file affects the default settings of serial devices. To change the settings for a device, use stty. By default, the changed settings are in effect until the device is closed and when the device is reopened, it goes back to the default set. To permanently change the default set, open and adjust the settings of the initialization device. For example, to turn on mode, 8 bit communication, and flow control for ttyu5, type: &prompt.root; stty -f /dev/ttyu5.init clocal cs8 ixon ixoff rc files rc.serial To prevent certain settings from being changed by an application, make adjustments to the locking device. For example, to lock the speed of ttyu5 to 57600 bps, type: &prompt.root; stty -f /dev/ttyu5.lock 57600 Now, any application that opens ttyu5 and tries to change the speed of the port will be stuck with 57600 bps.
Terminals Sean Kelly Contributed by terminals Terminals provide a convenient and low-cost way to access a &os; system when not at the computer's console or on a connected network. This section describes how to use terminals with &os;. The original &unix; systems did not have consoles. Instead, users logged in and ran programs through terminals that were connected to the computer's serial ports. The ability to establish a login session on a serial port still exists in nearly every &unix;-like operating system today, including &os;. By using a terminal attached to an unused serial port, a user can log in and run any text program that can normally be run on the console or in an xterm window. Many terminals can be attached to a &os; system. An older spare computer can be used as a terminal wired into a more powerful computer running &os;. This can turn what might otherwise be a single-user computer into a powerful multiple-user system. &os; supports three types of terminals: Dumb terminals Dumb terminals are specialized hardware that connect to computers over serial lines. They are called dumb because they have only enough computational power to display, send, and receive text. No programs can be run on these devices. Instead, dumb terminals connect to a computer that runs the needed programs. There are hundreds of kinds of dumb terminals made by many manufacturers, and just about any kind will work with &os;. Some high-end terminals can even display graphics, but only certain software packages can take advantage of these advanced features. Dumb terminals are popular in work environments where workers do not need access to graphical applications. Computers Acting as Terminals Since a dumb terminal has just enough ability to display, send, and receive text, any spare computer can be a dumb terminal. All that is needed is the proper cable and some terminal emulation software to run on the computer. This configuration can be useful. For example, if one user is busy working at the &os; system's console, another user can do some text-only work at the same time from a less powerful personal computer hooked up as a terminal to the &os; system. There are at least two utilities in the base-system of &os; that can be used to work through a serial connection: &man.cu.1; and &man.tip.1;. For example, to connect from a client system that runs &os; to the serial connection of another system: &prompt.root; cu -l /dev/cuauN Ports are numbered starting from zero. This means that COM1 is /dev/cuau0. Additional programs are available through the Ports Collection, such as comms/minicom. X Terminals X terminals are the most sophisticated kind of terminal available. Instead of connecting to a serial port, they usually connect to a network like Ethernet. Instead of being relegated to text-only applications, they can display any &xorg; application. This chapter does not cover the setup, configuration, or use of X terminals. Terminal Configuration This section describes how to configure a &os; system to enable a login session on a serial terminal. It assumes that the system recognizes the serial port to which the terminal is connected and that the terminal is connected with the correct cable. In &os;, init reads /etc/ttys and starts a getty process on the available terminals. The getty process is responsible for reading a login name and starting the login program. The ports on the &os; system which allow logins are listed in /etc/ttys. For example, the first virtual console, ttyv0, has an entry in this file, allowing logins on the console. This file also contains entries for the other virtual consoles, serial ports, and pseudo-ttys. For a hardwired terminal, the serial port's /dev entry is listed without the /dev part. For example, /dev/ttyv0 is listed as ttyv0. The default /etc/ttys configures support for the first four serial ports, ttyu0 through ttyu3: ttyu0 "/usr/libexec/getty std.9600" dialup off secure ttyu1 "/usr/libexec/getty std.9600" dialup off secure ttyu2 "/usr/libexec/getty std.9600" dialup off secure ttyu3 "/usr/libexec/getty std.9600" dialup off secure When attaching a terminal to one of those ports, modify the default entry to set the required speed and terminal type, to turn the device on and, if needed, to change the port's secure setting. If the terminal is connected to another port, add an entry for the port. configures two terminals in /etc/ttys. The first entry configures a Wyse-50 connected to COM2. The second entry configures an old computer running Procomm terminal software emulating a VT-100 terminal. The computer is connected to the sixth serial port on a multi-port serial card. Configuring Terminal Entries ttyu1 "/usr/libexec/getty std.38400" wy50 on insecure ttyu5 "/usr/libexec/getty std.19200" vt100 on insecure The first field specifies the device name of the serial terminal. The second field tells getty to initialize and open the line, set the line speed, prompt for a user name, and then execute the login program. The optional getty type configures characteristics on the terminal line, like bps rate and parity. The available getty types are listed in /etc/gettytab. In almost all cases, the getty types that start with std will work for hardwired terminals as these entries ignore parity. There is a std entry for each bps rate from 110 to 115200. Refer to &man.gettytab.5; for more information. When setting the getty type, make sure to match the communications settings used by the terminal. For this example, the Wyse-50 uses no parity and connects at 38400 bps. The computer uses no parity and connects at 19200 bps. The third field is the type of terminal. For dial-up ports, unknown or dialup is typically used since users may dial up with practically any type of terminal or software. Since the terminal type does not change for hardwired terminals, a real terminal type from /etc/termcap can be specified. For this example, the Wyse-50 uses the real terminal type while the computer running Procomm is set to emulate a VT-100. The fourth field specifies if the port should be enabled. To enable logins on this port, this field must be set to on. The final field is used to specify whether the port is secure. Marking a port as secure means that it is trusted enough to allow root to login from that port. Insecure ports do not allow root logins. On an insecure port, users must login from unprivileged accounts and then use su or a similar mechanism to gain superuser privileges, as described in . For security reasons, it is recommended to change this setting to insecure. After making any changes to /etc/ttys, send a SIGHUP (hangup) signal to the init process to force it to re-read its configuration file: &prompt.root; kill -HUP 1 Since init is always the first process run on a system, it always has a process ID of 1. If everything is set up correctly, all cables are in place, and the terminals are powered up, a getty process should now be running on each terminal and login prompts should be available on each terminal. Troubleshooting the Connection Even with the most meticulous attention to detail, something could still go wrong while setting up a terminal. Here is a list of common symptoms and some suggested fixes. If no login prompt appears, make sure the terminal is plugged in and powered up. If it is a personal computer acting as a terminal, make sure it is running terminal emulation software on the correct serial port. Make sure the cable is connected firmly to both the terminal and the &os; computer. Make sure it is the right kind of cable. Make sure the terminal and &os; agree on the bps rate and parity settings. For a video display terminal, make sure the contrast and brightness controls are turned up. If it is a printing terminal, make sure paper and ink are in good supply. Use ps to make sure that a getty process is running and serving the terminal. For example, the following listing shows that a getty is running on the second serial port, ttyu1, and is using the std.38400 entry in /etc/gettytab: &prompt.root; ps -axww|grep ttyu 22189 d1 Is+ 0:00.03 /usr/libexec/getty std.38400 ttyu1 If no getty process is running, make sure the port is enabled in /etc/ttys. Remember to run kill -HUP 1 after modifying /etc/ttys. If the getty process is running but the terminal still does not display a login prompt, or if it displays a prompt but will not accept typed input, the terminal or cable may not support hardware handshaking. Try changing the entry in /etc/ttys from std.38400 to 3wire.38400, then run kill -HUP 1 after modifying /etc/ttys. The 3wire entry is similar to std, but ignores hardware handshaking. The baud rate may need to be reduced or software flow control enabled when using 3wire to prevent buffer overflows. If garbage appears instead of a login prompt, make sure the terminal and &os; agree on the bps rate and parity settings. Check the getty processes to make sure the correct getty type is in use. If not, edit /etc/ttys and run kill -HUP 1. If characters appear doubled and the password appears when typed, switch the terminal, or the terminal emulation software, from half duplex or local echo to full duplex. Dial-in Service Guy Helmer Contributed by Sean Kelly Additions by dial-in service Configuring a &os; system for dial-in service is similar to configuring terminals, except that modems are used instead of terminal devices. &os; supports both external and internal modems. External modems are more convenient because they often can be configured via parameters stored in non-volatile RAM and they usually provide lighted indicators that display the state of important RS-232 signals, indicating whether the modem is operating properly. Internal modems usually lack non-volatile RAM, so their configuration may be limited to setting DIP switches. If the internal modem has any signal indicator lights, they are difficult to view when the system's cover is in place. modem When using an external modem, a proper cable is needed. A standard RS-232C serial cable should suffice. &os; needs the RTS and CTS signals for flow control at speeds above 2400 bps, the CD signal to detect when a call has been answered or the line has been hung up, and the DTR signal to reset the modem after a session is complete. Some cables are wired without all of the needed signals, so if a login session does not go away when the line hangs up, there may be a problem with the cable. Refer to for more information about these signals. Like other &unix;-like operating systems, &os; uses the hardware signals to find out when a call has been answered or a line has been hung up and to hangup and reset the modem after a call. &os; avoids sending commands to the modem or watching for status reports from the modem. &os; supports the NS8250, NS16450, NS16550, and NS16550A-based RS-232C (CCITT V.24) communications interfaces. The 8250 and 16450 devices have single-character buffers. The 16550 device provides a 16-character buffer, which allows for better system performance. Bugs in plain 16550 devices prevent the use of the 16-character buffer, so use 16550A devices if possible. - Because single-character-buffer devices require more work by the + As single-character-buffer devices require more work by the operating system than the 16-character-buffer devices, 16550A-based serial interface cards are preferred. If the system has many active serial ports or will have a heavy load, 16550A-based cards are better for low-error-rate communications. The rest of this section demonstrates how to configure a modem to receive incoming connections, how to communicate with the modem, and offers some troubleshooting tips. Modem Configuration getty As with terminals, init spawns a getty process for each configured serial port used for dial-in connections. When a user dials the modem's line and the modems connect, the Carrier Detect signal is reported by the modem. The kernel notices that the carrier has been detected and instructs getty to open the port and display a login: prompt at the specified initial line speed. In a typical configuration, if garbage characters are received, usually due to the modem's connection speed being different than the configured speed, getty tries adjusting the line speeds until it receives reasonable characters. After the user enters their login name, getty executes login, which completes the login process by asking for the user's password and then starting the user's shell. /usr/bin/login There are two schools of thought regarding dial-up modems. One configuration method is to set the modems and systems so that no matter at what speed a remote user dials in, the dial-in RS-232 interface runs at a locked speed. The benefit of this configuration is that the remote user always sees a system login prompt immediately. The downside is that the system does not know what a user's true data rate is, so full-screen programs like Emacs will not adjust their screen-painting methods to make their response better for slower connections. The second method is to configure the RS-232 interface to vary its speed based on - the remote user's connection speed. Because + the remote user's connection speed. As getty does not understand any particular modem's connection speed reporting, it gives a login: message at an initial speed and watches the characters that come back in response. If the user sees junk, they should press Enter until they see a recognizable prompt. If the data rates do not match, getty sees anything the user types as junk, tries the next speed, and gives the login: prompt again. This procedure normally only takes a keystroke or two before the user sees a good prompt. This login sequence does not look as clean as the locked-speed method, but a user on a low-speed connection should receive better interactive response from full-screen programs. When locking a modem's data communications rate at a particular speed, no changes to /etc/gettytab should be needed. However, for a matching-speed configuration, additional entries may be required in order to define the speeds to use for the modem. This example configures a 14.4 Kbps modem with a top interface speed of 19.2 Kbps using 8-bit, no parity connections. It configures getty to start the communications rate for a V.32bis connection at 19.2 Kbps, then cycles through 9600 bps, 2400 bps, 1200 bps, 300 bps, and back to 19.2 Kbps. Communications rate cycling is implemented with the nx= (next table) capability. Each line uses a tc= (table continuation) entry to pick up the rest of the settings for a particular data rate. # # Additions for a V.32bis Modem # um|V300|High Speed Modem at 300,8-bit:\ :nx=V19200:tc=std.300: un|V1200|High Speed Modem at 1200,8-bit:\ :nx=V300:tc=std.1200: uo|V2400|High Speed Modem at 2400,8-bit:\ :nx=V1200:tc=std.2400: up|V9600|High Speed Modem at 9600,8-bit:\ :nx=V2400:tc=std.9600: uq|V19200|High Speed Modem at 19200,8-bit:\ :nx=V9600:tc=std.19200: For a 28.8 Kbps modem, or to take advantage of compression on a 14.4 Kbps modem, use a higher communications rate, as seen in this example: # # Additions for a V.32bis or V.34 Modem # Starting at 57.6 Kbps # vm|VH300|Very High Speed Modem at 300,8-bit:\ :nx=VH57600:tc=std.300: vn|VH1200|Very High Speed Modem at 1200,8-bit:\ :nx=VH300:tc=std.1200: vo|VH2400|Very High Speed Modem at 2400,8-bit:\ :nx=VH1200:tc=std.2400: vp|VH9600|Very High Speed Modem at 9600,8-bit:\ :nx=VH2400:tc=std.9600: vq|VH57600|Very High Speed Modem at 57600,8-bit:\ :nx=VH9600:tc=std.57600: For a slow CPU or a heavily loaded system without 16550A-based serial ports, this configuration may produce sio silo errors at 57.6 Kbps. /etc/ttys The configuration of /etc/ttys is similar to , but a different argument is passed to getty and dialup is used for the terminal type. Replace xxx with the process init will run on the device: ttyu0 "/usr/libexec/getty xxx" dialup on The dialup terminal type can be changed. For example, setting vt102 as the default terminal type allows users to use VT102 emulation on their remote systems. For a locked-speed configuration, specify the speed with a valid type listed in /etc/gettytab. This example is for a modem whose port speed is locked at 19.2 Kbps: ttyu0 "/usr/libexec/getty std.19200" dialup on In a matching-speed configuration, the entry needs to reference the appropriate beginning auto-baud entry in /etc/gettytab. To continue the example for a matching-speed modem that starts at 19.2 Kbps, use this entry: ttyu0 "/usr/libexec/getty V19200" dialup on After editing /etc/ttys, wait until the modem is properly configured and connected before signaling init: &prompt.root; kill -HUP 1 rc files rc.serial High-speed modems, like V.32, V.32bis, and V.34 modems, use hardware (RTS/CTS) flow control. Use stty to set the hardware flow control flag for the modem port. This example sets the crtscts flag on COM2's dial-in and dial-out initialization devices: &prompt.root; stty -f /dev/ttyu1.init crtscts &prompt.root; stty -f /dev/cuau1.init crtscts Troubleshooting This section provides a few tips for troubleshooting a dial-up modem that will not connect to a &os; system. Hook up the modem to the &os; system and boot the system. If the modem has status indication lights, watch to see whether the modem's DTR indicator lights when the login: prompt appears on the system's console. If it lights up, that should mean that &os; has started a getty process on the appropriate communications port and is waiting for the modem to accept a call. If the DTR indicator does not light, login to the &os; system through the console and type ps ax to see if &os; is running a getty process on the correct port: 114 ?? I 0:00.10 /usr/libexec/getty V19200 ttyu0 If the second column contains a d0 instead of a ?? and the modem has not accepted a call yet, this means that getty has completed its open on the communications port. This could indicate a problem with the cabling or a misconfigured modem because getty should not be able to open the communications port until the carrier detect signal has been asserted by the modem. If no getty processes are waiting to open the port, double-check that the entry for the port is correct in /etc/ttys. Also, check /var/log/messages to see if there are any log messages from init or getty. Next, try dialing into the system. Be sure to use 8 bits, no parity, and 1 stop bit on the remote system. If a prompt does not appear right away, or the prompt shows garbage, try pressing Enter about once per second. If there is still no login: prompt, try sending a BREAK. When using a high-speed modem, try dialing again after locking the dialing modem's interface speed. If there is still no login: prompt, check /etc/gettytab again and double-check that: The initial capability name specified in the entry in /etc/ttys matches the name of a capability in /etc/gettytab. Each nx= entry matches another gettytab capability name. Each tc= entry matches another gettytab capability name. If the modem on the &os; system will not answer, make sure that the modem is configured to answer the phone when DTR is asserted. If the modem seems to be configured correctly, verify that the DTR line is asserted by checking the modem's indicator lights. If it still does not work, try sending an email to the &a.questions; describing the modem and the problem. Dial-out Service dial-out service The following are tips for getting the host to connect over the modem to another computer. This is appropriate for establishing a terminal session with a remote host. This kind of connection can be helpful to get a file on the Internet if there are problems using PPP. If PPP is not working, use the terminal session to FTP the needed file. Then use zmodem to transfer it to the machine. Using a Stock Hayes Modem A generic Hayes dialer is built into tip. Use at=hayes in /etc/remote. The Hayes driver is not smart enough to recognize some of the advanced features of newer modems messages like BUSY, NO DIALTONE, or CONNECT 115200. Turn those messages off when using tip with ATX0&W. The dial timeout for tip is 60 seconds. The modem should use something less, or else tip will think there is a communication problem. Try ATS7=45&W. Using <literal>AT</literal> Commands /etc/remote Create a direct entry in /etc/remote. For example, if the modem is hooked up to the first serial port, /dev/cuau0, use the following line: cuau0:dv=/dev/cuau0:br#19200:pa=none Use the highest bps rate the modem supports in the br capability. Then, type tip cuau0 to connect to the modem. Or, use cu as root with the following command: &prompt.root; cu -lline -sspeed line is the serial port, such as /dev/cuau0, and speed is the speed, such as 57600. When finished entering the AT commands, type ~. to exit. The <literal>@</literal> Sign Does Not Work The @ sign in the phone number capability tells tip to look in /etc/phones for a phone number. But, the @ sign is also a special character in capability files like /etc/remote, so it needs to be escaped with a backslash: pn=\@ Dialing from the Command Line Put a generic entry in /etc/remote. For example: tip115200|Dial any phone number at 115200 bps:\ :dv=/dev/cuau0:br#115200:at=hayes:pa=none:du: tip57600|Dial any phone number at 57600 bps:\ :dv=/dev/cuau0:br#57600:at=hayes:pa=none:du: This should now work: &prompt.root; tip -115200 5551234 Users who prefer cu over tip, can use a generic cu entry: cu115200|Use cu to dial any number at 115200bps:\ :dv=/dev/cuau1:br#57600:at=hayes:pa=none:du: and type: &prompt.root; cu 5551234 -s 115200 Setting the <acronym>bps</acronym> Rate Put in an entry for tip1200 or cu1200, but go ahead and use whatever bps rate is appropriate with the br capability. tip thinks a good default is 1200 bps which is why it looks for a tip1200 entry. 1200 bps does not have to be used, though. Accessing a Number of Hosts Through a Terminal Server Rather than waiting until connected and typing CONNECT host each time, use tip's cm capability. For example, these entries in /etc/remote will let you type tip pain or tip muffin to connect to the hosts pain or muffin, and tip deep13 to connect to the terminal server. pain|pain.deep13.com|Forrester's machine:\ :cm=CONNECT pain\n:tc=deep13: muffin|muffin.deep13.com|Frank's machine:\ :cm=CONNECT muffin\n:tc=deep13: deep13:Gizmonics Institute terminal server:\ :dv=/dev/cuau2:br#38400:at=hayes:du:pa=none:pn=5551234: Using More Than One Line with <command>tip</command> This is often a problem where a university has several modem lines and several thousand students trying to use them. Make an entry in /etc/remote and use @ for the pn capability: big-university:\ :pn=\@:tc=dialout dialout:\ :dv=/dev/cuau3:br#9600:at=courier:du:pa=none: Then, list the phone numbers in /etc/phones: big-university 5551111 big-university 5551112 big-university 5551113 big-university 5551114 tip will try each number in the listed order, then give up. To keep retrying, run tip in a while loop. Using the Force Character Ctrl P is the default force character, used to tell tip that the next character is literal data. The force character can be set to any other character with the ~s escape, which means set a variable. Type ~sforce=single-char followed by a newline. single-char is any single character. If single-char is left out, then the force character is the null character, which is accessed by typing Ctrl2 or CtrlSpace . A pretty good value for single-char is Shift Ctrl 6 , which is only used on some terminal servers. To change the force character, specify the following in ~/.tiprc: force=single-char Upper Case Characters This happens when Ctrl A is pressed, which is tip's raise character, specially designed for people with broken caps-lock keys. Use ~s to set raisechar to something reasonable. It can be set to be the same as the force character, if neither feature is used. Here is a sample ~/.tiprc for Emacs users who need to type Ctrl 2 and Ctrl A : force=^^ raisechar=^^ The ^^ is ShiftCtrl6 . File Transfers with <command>tip</command> When talking to another &unix;-like operating system, files can be sent and received using ~p (put) and ~t (take). These commands run cat and echo on the remote system to accept and send files. The syntax is: ~p local-file remote-file ~t remote-file local-file There is no error checking, so another protocol, like zmodem, should probably be used. Using <application>zmodem</application> with <command>tip</command>? To receive files, start the sending program on the remote end. Then, type ~C rz to begin receiving them locally. To send files, start the receiving program on the remote end. Then, type ~C sz files to send them to the remote system. Setting Up the Serial Console Kazutaka YOKOTA Contributed by Bill Paul Based on a document by serial console &os; has the ability to boot a system with a dumb terminal on a serial port as a console. This configuration is useful for system administrators who wish to install &os; on machines that have no keyboard or monitor attached, and developers who want to debug the kernel or device drivers. As described in , &os; employs a three stage bootstrap. The first two stages are in the boot block code which is stored at the beginning of the &os; slice on the boot disk. The boot block then loads and runs the boot loader as the third stage code. In order to set up booting from a serial console, the boot block code, the boot loader code, and the kernel need to be configured. Quick Serial Console Configuration This section provides a fast overview of setting up the serial console. This procedure can be used when the dumb terminal is connected to COM1. Configuring a Serial Console on <filename>COM1</filename> Connect the serial cable to COM1 and the controlling terminal. To configure boot messages to display on the serial console, issue the following command as the superuser: &prompt.root; echo 'console="comconsole"' >> /boot/loader.conf Edit /etc/ttys and change off to on and dialup to vt100 for the ttyu0 entry. Otherwise, a password will not be required to connect via the serial console, resulting in a potential security hole. Reboot the system to see if the changes took effect. If a different configuration is required, see the next section for a more in-depth configuration explanation. In-Depth Serial Console Configuration This section provides a more detailed explanation of the steps needed to setup a serial console in &os;. Configuring a Serial Console Prepare a serial cable. null-modem cable Use either a null-modem cable or a standard serial cable and a null-modem adapter. See for a discussion on serial cables. Unplug the keyboard. Many systems probe for the keyboard during the Power-On Self-Test (POST) and will generate an error if the keyboard is not detected. Some machines will refuse to boot until the keyboard is plugged in. If the computer complains about the error, but boots anyway, no further configuration is needed. If the computer refuses to boot without a keyboard attached, configure the BIOS so that it ignores this error. Consult the motherboard's manual for details on how to do this. Try setting the keyboard to Not installed in the BIOS. This setting tells the BIOS not to probe for a keyboard at power-on so it should not complain if the keyboard is absent. If that option is not present in the BIOS, look for an Halt on Error option instead. Setting this to All but Keyboard or to No Errors will have the same effect. If the system has a &ps2; mouse, unplug it as well. &ps2; mice share some hardware with the keyboard and leaving the mouse plugged in can fool the keyboard probe into thinking the keyboard is still there. While most systems will boot without a keyboard, quite a few will not boot without a graphics adapter. Some systems can be configured to boot with no graphics adapter by changing the graphics adapter setting in the BIOS configuration to Not installed. Other systems do not support this option and will refuse to boot if there is no display hardware in the system. With these machines, leave some kind of graphics card plugged in, even if it is just a junky mono board. A monitor does not need to be attached. Plug a dumb terminal, an old computer with a modem program, or the serial port on another &unix; box into the serial port. Add the appropriate hint.sio.* entries to /boot/device.hints for the serial port. Some multi-port cards also require kernel configuration options. Refer to &man.sio.4; for the required options and device hints for each supported serial port. Create boot.config in the root directory of the a partition on the boot drive. This file instructs the boot block code how to boot the system. In order to activate the serial console, one or more of the following options are needed. When using multiple options, include them all on the same line: Toggles between the internal and serial consoles. Use this to switch console devices. For instance, to boot from the internal (video) console, use to direct the boot loader and the kernel to use the serial port as its console device. Alternatively, to boot from the serial port, use to tell the boot loader and the kernel to use the video display as the console instead. Toggles between the single and dual console configurations. In the single configuration, the console will be either the internal console (video display) or the serial port, depending on the state of . In the dual console configuration, both the video display and the serial port will become the console at the same time, regardless of the state of . However, the dual console configuration takes effect only while the boot block is running. Once the boot loader gets control, the console specified by becomes the only console. Makes the boot block probe the keyboard. If no keyboard is found, the and options are automatically set. Due to space constraints in the current version of the boot blocks, is capable of detecting extended keyboards only. Keyboards with less than 101 keys and without F11 and F12 keys may not be detected. Keyboards on some laptops may not be properly found because of this limitation. If this is the case, do not use . Use either to select the console automatically or to activate the serial console. Refer to &man.boot.8; and &man.boot.config.5; for more details. The options, except for , are passed to the boot loader. The boot loader will determine whether the internal video or the serial port should become the console by examining the state of . This means that if is specified but is not specified in /boot.config, the serial port can be used as the console only during the boot block as the boot loader will use the internal video display as the console. Boot the machine. When &os; starts, the boot blocks echo the contents of /boot.config to the console. For example: /boot.config: -P Keyboard: no The second line appears only if is in /boot.config and indicates the presence or absence of the keyboard. These messages go to either the serial or internal console, or both, depending on the option in /boot.config: Options Message goes to none internal console serial console serial and internal consoles serial and internal consoles , keyboard present internal console , keyboard absent serial console After the message, there will be a small pause before the boot blocks continue loading the boot loader and before any further messages are printed to the console. Under normal circumstances, there is no need to interrupt the boot blocks, but one can do so in order to make sure things are set up correctly. Press any key, other than Enter, at the console to interrupt the boot process. The boot blocks will then prompt for further action: >> FreeBSD/i386 BOOT Default: 0:ad(0,a)/boot/loader boot: Verify that the above message appears on either the serial or internal console, or both, according to the options in /boot.config. If the message appears in the correct console, press Enter to continue the boot process. If there is no prompt on the serial terminal, something is wrong with the settings. Enter then Enter or Return to tell the boot block (and then the boot loader and the kernel) to choose the serial port for the console. Once the system is up, go back and check what went wrong. During the third stage of the boot process, one can still switch between the internal console and the serial console by setting appropriate environment variables in the boot loader. See &man.loader.8; for more information. This line in /boot/loader.conf or /boot/loader.conf.local configures the boot loader and the kernel to send their boot messages to the serial console, regardless of the options in /boot.config: console="comconsole" That line should be the first line of /boot/loader.conf so that boot messages are displayed on the serial console as early as possible. If that line does not exist, or if it is set to console="vidconsole", the boot loader and the kernel will use whichever console is indicated by in the boot block. See &man.loader.conf.5; for more information. At the moment, the boot loader has no option equivalent to in the boot block, and there is no provision to automatically select the internal console and the serial console based on the presence of the keyboard. While it is not required, it is possible to provide a login prompt over the serial line. To configure this, edit the entry for the serial port in /etc/ttys using the instructions in . If the speed of the serial port has been changed, change std.9600 to match the new setting. Setting a Faster Serial Port Speed By default, the serial port settings are 9600 baud, 8 bits, no parity, and 1 stop bit. To change the default console speed, use one of the following options: Edit /etc/make.conf and set BOOT_COMCONSOLE_SPEED to the new console speed. Then, recompile and install the boot blocks and the boot loader: &prompt.root; cd /sys/boot &prompt.root; make clean &prompt.root; make &prompt.root; make install If the serial console is configured in some other way than by booting with , or if the serial console used by the kernel is different from the one used by the boot blocks, add the following option, with the desired speed, to a custom kernel configuration file and compile a new kernel: options CONSPEED=19200 Add the boot option to /boot.config, replacing 19200 with the speed to use. Add the following options to /boot/loader.conf. Replace 115200 with the speed to use. boot_multicons="YES" boot_serial="YES" comconsole_speed="115200" console="comconsole,vidconsole" Entering the DDB Debugger from the Serial Line To configure the ability to drop into the kernel debugger from the serial console, add the following options to a custom kernel configuration file and compile the kernel using the instructions in . Note that while this is useful for remote diagnostics, it is also dangerous if a spurious BREAK is generated on the serial port. Refer to &man.ddb.4; and &man.ddb.8; for more information about the kernel debugger. options BREAK_TO_DEBUGGER options DDB
diff --git a/en_US.ISO8859-1/books/handbook/zfs/chapter.xml b/en_US.ISO8859-1/books/handbook/zfs/chapter.xml index c6d89f091a..ef1064a438 100644 --- a/en_US.ISO8859-1/books/handbook/zfs/chapter.xml +++ b/en_US.ISO8859-1/books/handbook/zfs/chapter.xml @@ -1,4424 +1,4424 @@ The Z File System (<acronym>ZFS</acronym>) Tom Rhodes Written by Allan Jude Written by Benedict Reuschling Written by Warren Block Written by The Z File System, or ZFS, is an advanced file system designed to overcome many of the major problems found in previous designs. Originally developed at &sun;, ongoing open source ZFS development has moved to the OpenZFS Project. ZFS has three major design goals: Data integrity: All data includes a checksum of the data. When data is written, the checksum is calculated and written along with it. When that data is later read back, the checksum is calculated again. If the checksums do not match, a data error has been detected. ZFS will attempt to automatically correct errors when data redundancy is available. Pooled storage: physical storage devices are added to a pool, and storage space is allocated from that shared pool. Space is available to all file systems, and can be increased by adding new storage devices to the pool. Performance: multiple caching mechanisms provide increased performance. ARC is an advanced memory-based read cache. A second level of disk-based read cache can be added with L2ARC, and disk-based synchronous write cache is available with ZIL. A complete list of features and terminology is shown in . What Makes <acronym>ZFS</acronym> Different ZFS is significantly different from any previous file system because it is more than just a file system. Combining the traditionally separate roles of volume manager and file system provides ZFS with unique advantages. The file system is now aware of the underlying structure of the disks. Traditional file systems could only be created on a single disk at a time. If there were two disks then two separate file systems would have to be created. In a traditional hardware RAID configuration, this problem was avoided by presenting the operating system with a single logical disk made up of the space provided by a number of physical disks, on top of which the operating system placed a file system. Even in the case of software RAID solutions like those provided by GEOM, the UFS file system living on top of the RAID transform believed that it was dealing with a single device. ZFS's combination of the volume manager and the file system solves this and allows the creation of many file systems all sharing a pool of available storage. One of the biggest advantages to ZFS's awareness of the physical layout of the disks is that existing file systems can be grown automatically when additional disks are added to the pool. This new space is then made available to all of the file systems. ZFS also has a number of different properties that can be applied to each file system, giving many advantages to creating a number of different file systems and datasets rather than a single monolithic file system. Quick Start Guide There is a startup mechanism that allows &os; to mount ZFS pools during system initialization. To enable it, add this line to /etc/rc.conf: zfs_enable="YES" Then start the service: &prompt.root; service zfs start The examples in this section assume three SCSI disks with the device names da0, da1, and da2. Users of SATA hardware should instead use ada device names. Single Disk Pool To create a simple, non-redundant pool using a single disk device: &prompt.root; zpool create example /dev/da0 To view the new pool, review the output of df: &prompt.root; df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s1a 2026030 235230 1628718 13% / devfs 1 1 0 100% /dev /dev/ad0s1d 54098308 1032846 48737598 2% /usr example 17547136 0 17547136 0% /example This output shows that the example pool has been created and mounted. It is now accessible as a file system. Files can be created on it and users can browse it: &prompt.root; cd /example &prompt.root; ls &prompt.root; touch testfile &prompt.root; ls -al total 4 drwxr-xr-x 2 root wheel 3 Aug 29 23:15 . drwxr-xr-x 21 root wheel 512 Aug 29 23:12 .. -rw-r--r-- 1 root wheel 0 Aug 29 23:15 testfile However, this pool is not taking advantage of any ZFS features. To create a dataset on this pool with compression enabled: &prompt.root; zfs create example/compressed &prompt.root; zfs set compression=gzip example/compressed The example/compressed dataset is now a ZFS compressed file system. Try copying some large files to /example/compressed. Compression can be disabled with: &prompt.root; zfs set compression=off example/compressed To unmount a file system, use zfs umount and then verify with df: &prompt.root; zfs umount example/compressed &prompt.root; df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s1a 2026030 235232 1628716 13% / devfs 1 1 0 100% /dev /dev/ad0s1d 54098308 1032864 48737580 2% /usr example 17547008 0 17547008 0% /example To re-mount the file system to make it accessible again, use zfs mount and verify with df: &prompt.root; zfs mount example/compressed &prompt.root; df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s1a 2026030 235234 1628714 13% / devfs 1 1 0 100% /dev /dev/ad0s1d 54098308 1032864 48737580 2% /usr example 17547008 0 17547008 0% /example example/compressed 17547008 0 17547008 0% /example/compressed The pool and file system may also be observed by viewing the output from mount: &prompt.root; mount /dev/ad0s1a on / (ufs, local) devfs on /dev (devfs, local) /dev/ad0s1d on /usr (ufs, local, soft-updates) example on /example (zfs, local) example/compressed on /example/compressed (zfs, local) After creation, ZFS datasets can be used like any file systems. However, many other features are available which can be set on a per-dataset basis. In the example below, a new file system called data is created. Important files will be stored here, so it is configured to keep two copies of each data block: &prompt.root; zfs create example/data &prompt.root; zfs set copies=2 example/data It is now possible to see the data and space utilization by issuing df: &prompt.root; df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s1a 2026030 235234 1628714 13% / devfs 1 1 0 100% /dev /dev/ad0s1d 54098308 1032864 48737580 2% /usr example 17547008 0 17547008 0% /example example/compressed 17547008 0 17547008 0% /example/compressed example/data 17547008 0 17547008 0% /example/data Notice that each file system on the pool has the same amount of available space. This is the reason for using df in these examples, to show that the file systems use only the amount of space they need and all draw from the same pool. ZFS eliminates concepts such as volumes and partitions, and allows multiple file systems to occupy the same pool. To destroy the file systems and then destroy the pool as it is no longer needed: &prompt.root; zfs destroy example/compressed &prompt.root; zfs destroy example/data &prompt.root; zpool destroy example RAID-Z Disks fail. One method of avoiding data loss from disk failure is to implement RAID. ZFS supports this feature in its pool design. RAID-Z pools require three or more disks but provide more usable space than mirrored pools. This example creates a RAID-Z pool, specifying the disks to add to the pool: &prompt.root; zpool create storage raidz da0 da1 da2 &sun; recommends that the number of devices used in a RAID-Z configuration be between three and nine. For environments requiring a single pool consisting of 10 disks or more, consider breaking it up into smaller RAID-Z groups. If only two disks are available and redundancy is a requirement, consider using a ZFS mirror. Refer to &man.zpool.8; for more details. The previous example created the storage zpool. This example makes a new file system called home in that pool: &prompt.root; zfs create storage/home Compression and keeping extra copies of directories and files can be enabled: &prompt.root; zfs set copies=2 storage/home &prompt.root; zfs set compression=gzip storage/home To make this the new home directory for users, copy the user data to this directory and create the appropriate symbolic links: &prompt.root; cp -rp /home/* /storage/home &prompt.root; rm -rf /home /usr/home &prompt.root; ln -s /storage/home /home &prompt.root; ln -s /storage/home /usr/home Users data is now stored on the freshly-created /storage/home. Test by adding a new user and logging in as that user. Try creating a file system snapshot which can be rolled back later: &prompt.root; zfs snapshot storage/home@08-30-08 Snapshots can only be made of a full file system, not a single directory or file. The @ character is a delimiter between the file system name or the volume name. If an important directory has been accidentally deleted, the file system can be backed up, then rolled back to an earlier snapshot when the directory still existed: &prompt.root; zfs rollback storage/home@08-30-08 To list all available snapshots, run ls in the file system's .zfs/snapshot directory. For example, to see the previously taken snapshot: &prompt.root; ls /storage/home/.zfs/snapshot It is possible to write a script to perform regular snapshots on user data. However, over time, snapshots can consume a great deal of disk space. The previous snapshot can be removed using the command: &prompt.root; zfs destroy storage/home@08-30-08 After testing, /storage/home can be made the real /home using this command: &prompt.root; zfs set mountpoint=/home storage/home Run df and mount to confirm that the system now treats the file system as the real /home: &prompt.root; mount /dev/ad0s1a on / (ufs, local) devfs on /dev (devfs, local) /dev/ad0s1d on /usr (ufs, local, soft-updates) storage on /storage (zfs, local) storage/home on /home (zfs, local) &prompt.root; df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s1a 2026030 235240 1628708 13% / devfs 1 1 0 100% /dev /dev/ad0s1d 54098308 1032826 48737618 2% /usr storage 26320512 0 26320512 0% /storage storage/home 26320512 0 26320512 0% /home This completes the RAID-Z configuration. Daily status updates about the file systems created can be generated as part of the nightly &man.periodic.8; runs. Add this line to /etc/periodic.conf: daily_status_zfs_enable="YES" Recovering <acronym>RAID-Z</acronym> Every software RAID has a method of monitoring its state. The status of RAID-Z devices may be viewed with this command: &prompt.root; zpool status -x If all pools are Online and everything is normal, the message shows: all pools are healthy If there is an issue, perhaps a disk is in the Offline state, the pool state will look similar to: pool: storage state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: none requested config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 da0 ONLINE 0 0 0 da1 OFFLINE 0 0 0 da2 ONLINE 0 0 0 errors: No known data errors This indicates that the device was previously taken offline by the administrator with this command: &prompt.root; zpool offline storage da1 Now the system can be powered down to replace da1. When the system is back online, the failed disk can replaced in the pool: &prompt.root; zpool replace storage da1 From here, the status may be checked again, this time without so that all pools are shown: &prompt.root; zpool status storage pool: storage state: ONLINE scrub: resilver completed with 0 errors on Sat Aug 30 19:44:11 2008 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1 ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 errors: No known data errors In this example, everything is normal. Data Verification ZFS uses checksums to verify the integrity of stored data. These are enabled automatically upon creation of file systems. Checksums can be disabled, but it is not recommended! Checksums take very little storage space and provide data integrity. Many ZFS features will not work properly with checksums disabled. There is no noticeable performance gain from disabling these checksums. Checksum verification is known as scrubbing. Verify the data integrity of the storage pool with this command: &prompt.root; zpool scrub storage The duration of a scrub depends on the amount of data stored. Larger amounts of data will take proportionally longer to verify. Scrubs are very I/O intensive, and only one scrub is allowed to run at a time. After the scrub completes, the status can be viewed with status: &prompt.root; zpool status storage pool: storage state: ONLINE scrub: scrub completed with 0 errors on Sat Jan 26 19:57:37 2013 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1 ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 errors: No known data errors The completion date of the last scrub operation is displayed to help track when another scrub is required. Routine scrubs help protect data from silent corruption and ensure the integrity of the pool. Refer to &man.zfs.8; and &man.zpool.8; for other ZFS options. <command>zpool</command> Administration ZFS administration is divided between two main utilities. The zpool utility controls the operation of the pool and deals with adding, removing, replacing, and managing disks. The zfs utility deals with creating, destroying, and managing datasets, both file systems and volumes. Creating and Destroying Storage Pools Creating a ZFS storage pool (zpool) involves making a number of decisions that are relatively permanent because the structure of the pool cannot be changed after the pool has been created. The most important decision is what types of vdevs into which to group the physical disks. See the list of vdev types for details about the possible options. After the pool has been created, most vdev types do not allow additional disks to be added to the vdev. The exceptions are mirrors, which allow additional disks to be added to the vdev, and stripes, which can be upgraded to mirrors by attaching an additional disk to the vdev. Although additional vdevs can be added to expand a pool, the layout of the pool cannot be changed after pool creation. Instead, the data must be backed up and the pool destroyed and recreated. Create a simple mirror pool: &prompt.root; zpool create mypool mirror /dev/ada1 /dev/ada2 &prompt.root; zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 errors: No known data errors Multiple vdevs can be created at once. Specify multiple groups of disks separated by the vdev type keyword, mirror in this example: &prompt.root; zpool create mypool mirror /dev/ada1 /dev/ada2 mirror /dev/ada3 /dev/ada4 &prompt.root; zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ada3 ONLINE 0 0 0 ada4 ONLINE 0 0 0 errors: No known data errors Pools can also be constructed using partitions rather than whole disks. Putting ZFS in a separate partition allows the same disk to have other partitions for other purposes. In particular, partitions with bootcode and file systems needed for booting can be added. This allows booting from disks that are also members of a pool. There is no performance penalty on &os; when using a partition rather than a whole disk. Using partitions also allows the administrator to under-provision the disks, using less than the full capacity. If a future replacement disk of the same nominal size as the original actually has a slightly smaller capacity, the smaller partition will still fit, and the replacement disk can still be used. Create a RAID-Z2 pool using partitions: &prompt.root; zpool create mypool raidz2 /dev/ada0p3 /dev/ada1p3 /dev/ada2p3 /dev/ada3p3 /dev/ada4p3 /dev/ada5p3 &prompt.root; zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 ada4p3 ONLINE 0 0 0 ada5p3 ONLINE 0 0 0 errors: No known data errors A pool that is no longer needed can be destroyed so that the disks can be reused. Destroying a pool involves first unmounting all of the datasets in that pool. If the datasets are in use, the unmount operation will fail and the pool will not be destroyed. The destruction of the pool can be forced with , but this can cause undefined behavior in applications which had open files on those datasets. Adding and Removing Devices There are two cases for adding disks to a zpool: attaching a disk to an existing vdev with zpool attach, or adding vdevs to the pool with zpool add. Only some vdev types allow disks to be added to the vdev after creation. A pool created with a single disk lacks redundancy. Corruption can be detected but not repaired, because there is no other copy of the data. The copies property may be able to recover from a small failure such as a bad sector, but does not provide the same level of protection as mirroring or RAID-Z. Starting with a pool consisting of a single disk vdev, zpool attach can be used to add an additional disk to the vdev, creating a mirror. zpool attach can also be used to add additional disks to a mirror group, increasing redundancy and read performance. If the disks being used for the pool are partitioned, replicate the layout of the first disk on to the second, gpart backup and gpart restore can be used to make this process easier. Upgrade the single disk (stripe) vdev ada0p3 to a mirror by attaching ada1p3: &prompt.root; zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool attach mypool ada0p3 ada1p3 Make sure to wait until resilver is done before rebooting. If you boot from pool 'mypool', you may need to update boot code on newly attached disk 'ada1p3'. Assuming you use GPT partitioning and 'da0' is your new boot disk you may use the following command: gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0 &prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1 bootcode written to ada1 &prompt.root; zpool status pool: mypool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri May 30 08:19:19 2014 527M scanned out of 781M at 47.9M/s, 0h0m to go 527M resilvered, 67.53% done config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 (resilvering) errors: No known data errors &prompt.root; zpool status pool: mypool state: ONLINE scan: resilvered 781M in 0h0m with 0 errors on Fri May 30 08:15:58 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors When adding disks to the existing vdev is not an option, as for RAID-Z, an alternative method is to add another vdev to the pool. Additional vdevs provide higher performance, distributing writes across the vdevs. Each vdev is responsible for providing its own redundancy. It is possible, but discouraged, to mix vdev types, like mirror and RAID-Z. Adding a non-redundant vdev to a pool containing mirror or RAID-Z vdevs risks the data on the entire pool. Writes are distributed, so the failure of the non-redundant disk will result in the loss of a fraction of every block that has been written to the pool. Data is striped across each of the vdevs. For example, with two mirror vdevs, this is effectively a RAID 10 that stripes writes across two sets of mirrors. Space is allocated so that each vdev reaches 100% full at the same time. There is a performance penalty if the vdevs have different amounts of free space, as a disproportionate amount of the data is written to the less full vdev. When attaching additional devices to a boot pool, remember to update the bootcode. Attach a second mirror group (ada2p3 and ada3p3) to the existing mirror: &prompt.root; zpool status pool: mypool state: ONLINE scan: resilvered 781M in 0h0m with 0 errors on Fri May 30 08:19:35 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool add mypool mirror ada2p3 ada3p3 &prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada2 bootcode written to ada2 &prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada3 bootcode written to ada3 &prompt.root; zpool status pool: mypool state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 errors: No known data errors Currently, vdevs cannot be removed from a pool, and disks can only be removed from a mirror if there is enough remaining redundancy. If only one disk in a mirror group remains, it ceases to be a mirror and reverts to being a stripe, risking the entire pool if that remaining disk fails. Remove a disk from a three-way mirror group: &prompt.root; zpool status pool: mypool state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool detach mypool ada2p3 &prompt.root; zpool status pool: mypool state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 30 08:29:51 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors Checking the Status of a Pool Pool status is important. If a drive goes offline or a read, write, or checksum error is detected, the corresponding error count increases. The status output shows the configuration and status of each device in the pool and the status of the entire pool. Actions that need to be taken and details about the last scrub are also shown. &prompt.root; zpool status pool: mypool state: ONLINE scan: scrub repaired 0 in 2h25m with 0 errors on Sat Sep 14 04:25:50 2013 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 ada4p3 ONLINE 0 0 0 ada5p3 ONLINE 0 0 0 errors: No known data errors Clearing Errors When an error is detected, the read, write, or checksum counts are incremented. The error message can be cleared and the counts reset with zpool clear mypool. Clearing the error state can be important for automated scripts that alert the administrator when the pool encounters an error. Further errors may not be reported if the old errors are not cleared. Replacing a Functioning Device There are a number of situations where it may be desirable to replace one disk with a different disk. When replacing a working disk, the process keeps the old disk online during the replacement. The pool never enters a degraded state, reducing the risk of data loss. zpool replace copies all of the data from the old disk to the new one. After the operation completes, the old disk is disconnected from the vdev. If the new disk is larger than the old disk, it may be possible to grow the zpool, using the new space. See Growing a Pool. Replace a functioning device in the pool: &prompt.root; zpool status pool: mypool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool replace mypool ada1p3 ada2p3 Make sure to wait until resilver is done before rebooting. If you boot from pool 'zroot', you may need to update boot code on newly attached disk 'ada2p3'. Assuming you use GPT partitioning and 'da0' is your new boot disk you may use the following command: gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0 &prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada2 &prompt.root; zpool status pool: mypool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Jun 2 14:21:35 2014 604M scanned out of 781M at 46.5M/s, 0h0m to go 604M resilvered, 77.39% done config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 replacing-1 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 (resilvering) errors: No known data errors &prompt.root; zpool status pool: mypool state: ONLINE scan: resilvered 781M in 0h0m with 0 errors on Mon Jun 2 14:21:52 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 errors: No known data errors Dealing with Failed Devices When a disk in a pool fails, the vdev to which the disk belongs enters the degraded state. All of the data is still available, but performance may be reduced because missing data must be calculated from the available redundancy. To restore the vdev to a fully functional state, the failed physical device must be replaced. ZFS is then instructed to begin the resilver operation. Data that was on the failed device is recalculated from available redundancy and written to the replacement device. After completion, the vdev returns to online status. If the vdev does not have any redundancy, or if multiple devices have failed and there is not enough redundancy to compensate, the pool enters the faulted state. If a sufficient number of devices cannot be reconnected to the pool, the pool becomes inoperative and data must be restored from backups. When replacing a failed disk, the name of the failed disk is replaced with the GUID of the device. A new device name parameter for zpool replace is not required if the replacement device has the same device name. Replace a failed disk using zpool replace: &prompt.root; zpool status pool: mypool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: none requested config: NAME STATE READ WRITE CKSUM mypool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 316502962686821739 UNAVAIL 0 0 0 was /dev/ada1p3 errors: No known data errors &prompt.root; zpool replace mypool 316502962686821739 ada2p3 &prompt.root; zpool status pool: mypool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Jun 2 14:52:21 2014 641M scanned out of 781M at 49.3M/s, 0h0m to go 640M resilvered, 82.04% done config: NAME STATE READ WRITE CKSUM mypool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 15732067398082357289 UNAVAIL 0 0 0 was /dev/ada1p3/old ada2p3 ONLINE 0 0 0 (resilvering) errors: No known data errors &prompt.root; zpool status pool: mypool state: ONLINE scan: resilvered 781M in 0h0m with 0 errors on Mon Jun 2 14:52:38 2014 config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 errors: No known data errors Scrubbing a Pool It is recommended that pools be scrubbed regularly, ideally at least once every month. The scrub operation is very disk-intensive and will reduce performance while running. Avoid high-demand periods when scheduling scrub or use vfs.zfs.scrub_delay to adjust the relative priority of the scrub to prevent it interfering with other workloads. &prompt.root; zpool scrub mypool &prompt.root; zpool status pool: mypool state: ONLINE scan: scrub in progress since Wed Feb 19 20:52:54 2014 116G scanned out of 8.60T at 649M/s, 3h48m to go 0 repaired, 1.32% done config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 ada4p3 ONLINE 0 0 0 ada5p3 ONLINE 0 0 0 errors: No known data errors In the event that a scrub operation needs to be cancelled, issue zpool scrub -s mypool. Self-Healing The checksums stored with data blocks enable the file system to self-heal. This feature will automatically repair data whose checksum does not match the one recorded on another device that is part of the storage pool. For example, a mirror with two disks where one drive is starting to malfunction and cannot properly store the data any more. This is even worse when the data has not been accessed for a long time, as with long term archive storage. Traditional file systems need to run algorithms that check and repair the data like &man.fsck.8;. These commands take time, and in severe cases, an administrator has to manually decide which repair operation must be performed. When ZFS detects a data block with a checksum that does not match, it tries to read the data from the mirror disk. If that disk can provide the correct data, it will not only give that data to the application requesting it, but also correct the wrong data on the disk that had the bad checksum. This happens without any interaction from a system administrator during normal pool operation. The next example demonstrates this self-healing behavior. A mirrored pool of disks /dev/ada0 and /dev/ada1 is created. &prompt.root; zpool create healer mirror /dev/ada0 /dev/ada1 &prompt.root; zpool status healer pool: healer state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM healer ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT healer 960M 92.5K 960M - - 0% 0% 1.00x ONLINE - Some important data that to be protected from data errors using the self-healing feature is copied to the pool. A checksum of the pool is created for later comparison. &prompt.root; cp /some/important/data /healer &prompt.root; zfs list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT healer 960M 67.7M 892M 7% 1.00x ONLINE - &prompt.root; sha1 /healer > checksum.txt &prompt.root; cat checksum.txt SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f Data corruption is simulated by writing random data to the beginning of one of the disks in the mirror. To prevent ZFS from healing the data as soon as it is detected, the pool is exported before the corruption and imported again afterwards. This is a dangerous operation that can destroy vital data. It is shown here for demonstrational purposes only and should not be attempted during normal operation of a storage pool. Nor should this intentional corruption example be run on any disk with a different file system on it. Do not use any other disk device names other than the ones that are part of the pool. Make certain that proper backups of the pool are created before running the command! &prompt.root; zpool export healer &prompt.root; dd if=/dev/random of=/dev/ada1 bs=1m count=200 200+0 records in 200+0 records out 209715200 bytes transferred in 62.992162 secs (3329227 bytes/sec) &prompt.root; zpool import healer The pool status shows that one device has experienced an error. Note that applications reading data from the pool did not receive any incorrect data. ZFS provided data from the ada0 device with the correct checksums. The device with the wrong checksum can be found easily as the CKSUM column contains a nonzero value. &prompt.root; zpool status healer pool: healer state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM healer ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 1 errors: No known data errors The error was detected and handled by using the redundancy present in the unaffected ada0 mirror disk. A checksum comparison with the original one will reveal whether the pool is consistent again. &prompt.root; sha1 /healer >> checksum.txt &prompt.root; cat checksum.txt SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f SHA1 (/healer) = 2753eff56d77d9a536ece6694bf0a82740344d1f The two checksums that were generated before and after the intentional tampering with the pool data still match. This shows how ZFS is capable of detecting and correcting any errors automatically when the checksums differ. Note that this is only possible when there is enough redundancy present in the pool. A pool consisting of a single device has no self-healing capabilities. That is also the reason why checksums are so important in ZFS and should not be disabled for any reason. No &man.fsck.8; or similar file system consistency check program is required to detect and correct this and the pool was still available during the time there was a problem. A scrub operation is now required to overwrite the corrupted data on ada1. &prompt.root; zpool scrub healer &prompt.root; zpool status healer pool: healer state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-4J scan: scrub in progress since Mon Dec 10 12:23:30 2012 10.4M scanned out of 67.0M at 267K/s, 0h3m to go 9.63M repaired, 15.56% done config: NAME STATE READ WRITE CKSUM healer ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 627 (repairing) errors: No known data errors The scrub operation reads data from ada0 and rewrites any data with an incorrect checksum on ada1. This is indicated by the (repairing) output from zpool status. After the operation is complete, the pool status changes to: &prompt.root; zpool status healer pool: healer state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-4J scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012 config: NAME STATE READ WRITE CKSUM healer ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 2.72K errors: No known data errors After the scrub operation completes and all the data has been synchronized from ada0 to ada1, the error messages can be cleared from the pool status by running zpool clear. &prompt.root; zpool clear healer &prompt.root; zpool status healer pool: healer state: ONLINE scan: scrub repaired 66.5M in 0h2m with 0 errors on Mon Dec 10 12:26:25 2012 config: NAME STATE READ WRITE CKSUM healer ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 errors: No known data errors The pool is now back to a fully working state and all the errors have been cleared. Growing a Pool The usable size of a redundant pool is limited by the capacity of the smallest device in each vdev. The smallest device can be replaced with a larger device. After completing a replace or resilver operation, the pool can grow to use the capacity of the new device. For example, consider a mirror of a 1 TB drive and a 2 TB drive. The usable space is 1 TB. When the 1 TB drive is replaced with another 2 TB drive, the resilvering process copies the existing data onto the new - drive. Because + drive. As both of the devices now have 2 TB capacity, the mirror's available space can be grown to 2 TB. Expansion is triggered by using zpool online -e on each device. After expansion of all devices, the additional space becomes available to the pool. Importing and Exporting Pools Pools are exported before moving them to another system. All datasets are unmounted, and each device is marked as exported but still locked so it cannot be used by other disk subsystems. This allows pools to be imported on other machines, other operating systems that support ZFS, and even different hardware architectures (with some caveats, see &man.zpool.8;). When a dataset has open files, zpool export -f can be used to force the export of a pool. Use this with caution. The datasets are forcibly unmounted, potentially resulting in unexpected behavior by the applications which had open files on those datasets. Export a pool that is not in use: &prompt.root; zpool export mypool Importing a pool automatically mounts the datasets. This may not be the desired behavior, and can be prevented with zpool import -N. zpool import -o sets temporary properties for this import only. zpool import altroot= allows importing a pool with a base mount point instead of the root of the file system. If the pool was last used on a different system and was not properly exported, an import might have to be forced with zpool import -f. zpool import -a imports all pools that do not appear to be in use by another system. List all available pools for import: &prompt.root; zpool import pool: mypool id: 9930174748043525076 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: mypool ONLINE ada2p3 ONLINE Import the pool with an alternative root directory: &prompt.root; zpool import -o altroot=/mnt mypool &prompt.root; zfs list zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 110K 47.0G 31K /mnt/mypool Upgrading a Storage Pool After upgrading &os;, or if a pool has been imported from a system using an older version of ZFS, the pool can be manually upgraded to the latest version of ZFS to support newer features. Consider whether the pool may ever need to be imported on an older system before upgrading. Upgrading is a one-way process. Older pools can be upgraded, but pools with newer features cannot be downgraded. Upgrade a v28 pool to support Feature Flags: &prompt.root; zpool status pool: mypool state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feat flags. scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool upgrade This system supports ZFS pool feature flags. The following pools are formatted with legacy version numbers and can be upgraded to use feature flags. After being upgraded, these pools will no longer be accessible by software that does not support feature flags. VER POOL --- ------------ 28 mypool Use 'zpool upgrade -v' for a list of available legacy versions. Every feature flags pool has all supported features enabled. &prompt.root; zpool upgrade mypool This system supports ZFS pool feature flags. Successfully upgraded 'mypool' from version 28 to feature flags. Enabled the following features on 'mypool': async_destroy empty_bpobj lz4_compress multi_vdev_crash_dump The newer features of ZFS will not be available until zpool upgrade has completed. zpool upgrade -v can be used to see what new features will be provided by upgrading, as well as which features are already supported. Upgrade a pool to support additional feature flags: &prompt.root; zpool status pool: mypool state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: none requested config: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 errors: No known data errors &prompt.root; zpool upgrade This system supports ZFS pool feature flags. All pools are formatted using feature flags. Some supported features are not enabled on the following pools. Once a feature is enabled the pool may become incompatible with software that does not support the feature. See zpool-features(7) for details. POOL FEATURE --------------- zstore multi_vdev_crash_dump spacemap_histogram enabled_txg hole_birth extensible_dataset bookmarks filesystem_limits &prompt.root; zpool upgrade mypool This system supports ZFS pool feature flags. Enabled the following features on 'mypool': spacemap_histogram enabled_txg hole_birth extensible_dataset bookmarks filesystem_limits The boot code on systems that boot from a pool must be updated to support the new pool version. Use gpart bootcode on the partition that contains the boot code. There are two types of bootcode available, depending on way the system boots: GPT (the most common option) and EFI (for more modern systems). For legacy boot using GPT, use the following command: &prompt.root; gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1 For systems using EFI to boot, execute the following command: &prompt.root; gpart bootcode -p /boot/boot1.efifat -i 1 ada1 Apply the bootcode to all bootable disks in the pool. See &man.gpart.8; for more information. Displaying Recorded Pool History Commands that modify the pool are recorded. Recorded actions include the creation of datasets, changing properties, or replacement of a disk. This history is useful for reviewing how a pool was created and which user performed a specific action and when. History is not kept in a log file, but is part of the pool itself. The command to review this history is aptly named zpool history: &prompt.root; zpool history History for 'tank': 2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1 2013-02-27.18:50:58 zfs set atime=off tank 2013-02-27.18:51:09 zfs set checksum=fletcher4 tank 2013-02-27.18:51:18 zfs create tank/backup The output shows zpool and zfs commands that were executed on the pool along with a timestamp. Only commands that alter the pool in some way are recorded. Commands like zfs list are not included. When no pool name is specified, the history of all pools is displayed. zpool history can show even more information when the options or are provided. displays user-initiated events as well as internally logged ZFS events. &prompt.root; zpool history -i History for 'tank': 2013-02-26.23:02:35 [internal pool create txg:5] pool spa 28; zfs spa 28; zpl 5;uts 9.1-RELEASE 901000 amd64 2013-02-27.18:50:53 [internal property set txg:50] atime=0 dataset = 21 2013-02-27.18:50:58 zfs set atime=off tank 2013-02-27.18:51:04 [internal property set txg:53] checksum=7 dataset = 21 2013-02-27.18:51:09 zfs set checksum=fletcher4 tank 2013-02-27.18:51:13 [internal create txg:55] dataset = 39 2013-02-27.18:51:18 zfs create tank/backup More details can be shown by adding . History records are shown in a long format, including information like the name of the user who issued the command and the hostname on which the change was made. &prompt.root; zpool history -l History for 'tank': 2013-02-26.23:02:35 zpool create tank mirror /dev/ada0 /dev/ada1 [user 0 (root) on :global] 2013-02-27.18:50:58 zfs set atime=off tank [user 0 (root) on myzfsbox:global] 2013-02-27.18:51:09 zfs set checksum=fletcher4 tank [user 0 (root) on myzfsbox:global] 2013-02-27.18:51:18 zfs create tank/backup [user 0 (root) on myzfsbox:global] The output shows that the root user created the mirrored pool with disks /dev/ada0 and /dev/ada1. The hostname myzfsbox is also shown in the commands after the pool's creation. The hostname display becomes important when the pool is exported from one system and imported on another. The commands that are issued on the other system can clearly be distinguished by the hostname that is recorded for each command. Both options to zpool history can be combined to give the most detailed information possible for any given pool. Pool history provides valuable information when tracking down the actions that were performed or when more detailed output is needed for debugging. Performance Monitoring A built-in monitoring system can display pool I/O statistics in real time. It shows the amount of free and used space on the pool, how many read and write operations are being performed per second, and how much I/O bandwidth is currently being utilized. By default, all pools in the system are monitored and displayed. A pool name can be provided to limit monitoring to just that pool. A basic example: &prompt.root; zpool iostat capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 288G 1.53T 2 11 11.3K 57.1K To continuously monitor I/O activity, a number can be specified as the last parameter, indicating a interval in seconds to wait between updates. The next statistic line is printed after each interval. Press Ctrl C to stop this continuous monitoring. Alternatively, give a second number on the command line after the interval to specify the total number of statistics to display. Even more detailed I/O statistics can be displayed with . Each device in the pool is shown with a statistics line. This is useful in seeing how many read and write operations are being performed on each device, and can help determine if any individual device is slowing down the pool. This example shows a mirrored pool with two devices: &prompt.root; zpool iostat -v capacity operations bandwidth pool alloc free read write read write ----------------------- ----- ----- ----- ----- ----- ----- data 288G 1.53T 2 12 9.23K 61.5K mirror 288G 1.53T 2 12 9.23K 61.5K ada1 - - 0 4 5.61K 61.7K ada2 - - 1 4 5.04K 61.7K ----------------------- ----- ----- ----- ----- ----- ----- Splitting a Storage Pool A pool consisting of one or more mirror vdevs can be split into two pools. Unless otherwise specified, the last member of each mirror is detached and used to create a new pool containing the same data. The operation should first be attempted with . The details of the proposed operation are displayed without it actually being performed. This helps confirm that the operation will do what the user intends. <command>zfs</command> Administration The zfs utility is responsible for creating, destroying, and managing all ZFS datasets that exist within a pool. The pool is managed using zpool. Creating and Destroying Datasets Unlike traditional disks and volume managers, space in ZFS is not preallocated. With traditional file systems, after all of the space is partitioned and assigned, there is no way to add an additional file system without adding a new disk. With ZFS, new file systems can be created at any time. Each dataset has properties including features like compression, deduplication, caching, and quotas, as well as other useful properties like readonly, case sensitivity, network file sharing, and a mount point. Datasets can be nested inside each other, and child datasets will inherit properties from their parents. Each dataset can be administered, delegated, replicated, snapshotted, jailed, and destroyed as a unit. There are many advantages to creating a separate dataset for each different type or set of files. The only drawbacks to having an extremely large number of datasets is that some commands like zfs list will be slower, and the mounting of hundreds or even thousands of datasets can slow the &os; boot process. Create a new dataset and enable LZ4 compression on it: &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 781M 93.2G 144K none mypool/ROOT 777M 93.2G 144K none mypool/ROOT/default 777M 93.2G 777M / mypool/tmp 176K 93.2G 176K /tmp mypool/usr 616K 93.2G 144K /usr mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/ports 144K 93.2G 144K /usr/ports mypool/usr/src 144K 93.2G 144K /usr/src mypool/var 1.20M 93.2G 608K /var mypool/var/crash 148K 93.2G 148K /var/crash mypool/var/log 178K 93.2G 178K /var/log mypool/var/mail 144K 93.2G 144K /var/mail mypool/var/tmp 152K 93.2G 152K /var/tmp &prompt.root; zfs create -o compress=lz4 mypool/usr/mydataset &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 781M 93.2G 144K none mypool/ROOT 777M 93.2G 144K none mypool/ROOT/default 777M 93.2G 777M / mypool/tmp 176K 93.2G 176K /tmp mypool/usr 704K 93.2G 144K /usr mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/mydataset 87.5K 93.2G 87.5K /usr/mydataset mypool/usr/ports 144K 93.2G 144K /usr/ports mypool/usr/src 144K 93.2G 144K /usr/src mypool/var 1.20M 93.2G 610K /var mypool/var/crash 148K 93.2G 148K /var/crash mypool/var/log 178K 93.2G 178K /var/log mypool/var/mail 144K 93.2G 144K /var/mail mypool/var/tmp 152K 93.2G 152K /var/tmp Destroying a dataset is much quicker than deleting all of the files that reside on the dataset, as it does not involve scanning all of the files and updating all of the corresponding metadata. Destroy the previously-created dataset: &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 880M 93.1G 144K none mypool/ROOT 777M 93.1G 144K none mypool/ROOT/default 777M 93.1G 777M / mypool/tmp 176K 93.1G 176K /tmp mypool/usr 101M 93.1G 144K /usr mypool/usr/home 184K 93.1G 184K /usr/home mypool/usr/mydataset 100M 93.1G 100M /usr/mydataset mypool/usr/ports 144K 93.1G 144K /usr/ports mypool/usr/src 144K 93.1G 144K /usr/src mypool/var 1.20M 93.1G 610K /var mypool/var/crash 148K 93.1G 148K /var/crash mypool/var/log 178K 93.1G 178K /var/log mypool/var/mail 144K 93.1G 144K /var/mail mypool/var/tmp 152K 93.1G 152K /var/tmp &prompt.root; zfs destroy mypool/usr/mydataset &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 781M 93.2G 144K none mypool/ROOT 777M 93.2G 144K none mypool/ROOT/default 777M 93.2G 777M / mypool/tmp 176K 93.2G 176K /tmp mypool/usr 616K 93.2G 144K /usr mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/ports 144K 93.2G 144K /usr/ports mypool/usr/src 144K 93.2G 144K /usr/src mypool/var 1.21M 93.2G 612K /var mypool/var/crash 148K 93.2G 148K /var/crash mypool/var/log 178K 93.2G 178K /var/log mypool/var/mail 144K 93.2G 144K /var/mail mypool/var/tmp 152K 93.2G 152K /var/tmp In modern versions of ZFS, zfs destroy is asynchronous, and the free space might take several minutes to appear in the pool. Use zpool get freeing poolname to see the freeing property, indicating how many datasets are having their blocks freed in the background. If there are child datasets, like snapshots or other datasets, then the parent cannot be destroyed. To destroy a dataset and all of its children, use to recursively destroy the dataset and all of its children. Use to list datasets and snapshots that would be destroyed by this operation, but do not actually destroy anything. Space that would be reclaimed by destruction of snapshots is also shown. Creating and Destroying Volumes A volume is a special type of dataset. Rather than being mounted as a file system, it is exposed as a block device under /dev/zvol/poolname/dataset. This allows the volume to be used for other file systems, to back the disks of a virtual machine, or to be exported using protocols like iSCSI or HAST. A volume can be formatted with any file system, or used without a file system to store raw data. To the user, a volume appears to be a regular disk. Putting ordinary file systems on these zvols provides features that ordinary disks or file systems do not normally have. For example, using the compression property on a 250 MB volume allows creation of a compressed FAT file system. &prompt.root; zfs create -V 250m -o compression=on tank/fat32 &prompt.root; zfs list tank NAME USED AVAIL REFER MOUNTPOINT tank 258M 670M 31K /tank &prompt.root; newfs_msdos -F32 /dev/zvol/tank/fat32 &prompt.root; mount -t msdosfs /dev/zvol/tank/fat32 /mnt &prompt.root; df -h /mnt | grep fat32 Filesystem Size Used Avail Capacity Mounted on /dev/zvol/tank/fat32 249M 24k 249M 0% /mnt &prompt.root; mount | grep fat32 /dev/zvol/tank/fat32 on /mnt (msdosfs, local) Destroying a volume is much the same as destroying a regular file system dataset. The operation is nearly instantaneous, but it may take several minutes for the free space to be reclaimed in the background. Renaming a Dataset The name of a dataset can be changed with zfs rename. The parent of a dataset can also be changed with this command. Renaming a dataset to be under a different parent dataset will change the value of those properties that are inherited from the parent dataset. When a dataset is renamed, it is unmounted and then remounted in the new location (which is inherited from the new parent dataset). This behavior can be prevented with . Rename a dataset and move it to be under a different parent dataset: &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 780M 93.2G 144K none mypool/ROOT 777M 93.2G 144K none mypool/ROOT/default 777M 93.2G 777M / mypool/tmp 176K 93.2G 176K /tmp mypool/usr 704K 93.2G 144K /usr mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/mydataset 87.5K 93.2G 87.5K /usr/mydataset mypool/usr/ports 144K 93.2G 144K /usr/ports mypool/usr/src 144K 93.2G 144K /usr/src mypool/var 1.21M 93.2G 614K /var mypool/var/crash 148K 93.2G 148K /var/crash mypool/var/log 178K 93.2G 178K /var/log mypool/var/mail 144K 93.2G 144K /var/mail mypool/var/tmp 152K 93.2G 152K /var/tmp &prompt.root; zfs rename mypool/usr/mydataset mypool/var/newname &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 780M 93.2G 144K none mypool/ROOT 777M 93.2G 144K none mypool/ROOT/default 777M 93.2G 777M / mypool/tmp 176K 93.2G 176K /tmp mypool/usr 616K 93.2G 144K /usr mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/ports 144K 93.2G 144K /usr/ports mypool/usr/src 144K 93.2G 144K /usr/src mypool/var 1.29M 93.2G 614K /var mypool/var/crash 148K 93.2G 148K /var/crash mypool/var/log 178K 93.2G 178K /var/log mypool/var/mail 144K 93.2G 144K /var/mail mypool/var/newname 87.5K 93.2G 87.5K /var/newname mypool/var/tmp 152K 93.2G 152K /var/tmp Snapshots can also be renamed like this. Due to the nature of snapshots, they cannot be renamed into a different parent dataset. To rename a recursive snapshot, specify , and all snapshots with the same name in child datasets with also be renamed. &prompt.root; zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool/var/newname@first_snapshot 0 - 87.5K - &prompt.root; zfs rename mypool/var/newname@first_snapshot new_snapshot_name &prompt.root; zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool/var/newname@new_snapshot_name 0 - 87.5K - Setting Dataset Properties Each ZFS dataset has a number of properties that control its behavior. Most properties are automatically inherited from the parent dataset, but can be overridden locally. Set a property on a dataset with zfs set property=value dataset. Most properties have a limited set of valid values, zfs get will display each possible property and valid values. Most properties can be reverted to their inherited values using zfs inherit. User-defined properties can also be set. They become part of the dataset configuration and can be used to provide additional information about the dataset or its contents. To distinguish these custom properties from the ones supplied as part of ZFS, a colon (:) is used to create a custom namespace for the property. &prompt.root; zfs set custom:costcenter=1234 tank &prompt.root; zfs get custom:costcenter tank NAME PROPERTY VALUE SOURCE tank custom:costcenter 1234 local To remove a custom property, use zfs inherit with . If the custom property is not defined in any of the parent datasets, it will be removed completely (although the changes are still recorded in the pool's history). &prompt.root; zfs inherit -r custom:costcenter tank &prompt.root; zfs get custom:costcenter tank NAME PROPERTY VALUE SOURCE tank custom:costcenter - - &prompt.root; zfs get all tank | grep custom:costcenter &prompt.root; Getting and Setting Share Properties Two commonly used and useful dataset properties are the NFS and SMB share options. Setting these define if and how ZFS datasets may be shared on the network. At present, only setting sharing via NFS is supported on &os;. To get the current status of a share, enter: &prompt.root; zfs get sharenfs mypool/usr/home NAME PROPERTY VALUE SOURCE mypool/usr/home sharenfs on local &prompt.root; zfs get sharesmb mypool/usr/home NAME PROPERTY VALUE SOURCE mypool/usr/home sharesmb off local To enable sharing of a dataset, enter: &prompt.root; zfs set sharenfs=on mypool/usr/home It is also possible to set additional options for sharing datasets through NFS, such as , and . To set additional options to a dataset shared through NFS, enter: &prompt.root; zfs set sharenfs="-alldirs,-maproot=root,-network=192.168.1.0/24" mypool/usr/home Managing Snapshots Snapshots are one of the most powerful features of ZFS. A snapshot provides a read-only, point-in-time copy of the dataset. With Copy-On-Write (COW), snapshots can be created quickly by preserving the older version of the data on disk. If no snapshots exist, space is reclaimed for future use when data is rewritten or deleted. Snapshots preserve disk space by recording only the differences between the current dataset and a previous version. Snapshots are allowed only on whole datasets, not on individual files or directories. When a snapshot is created from a dataset, everything contained in it is duplicated. This includes the file system properties, files, directories, permissions, and so on. Snapshots use no additional space when they are first created, only consuming space as the blocks they reference are changed. Recursive snapshots taken with create a snapshot with the same name on the dataset and all of its children, providing a consistent moment-in-time snapshot of all of the file systems. This can be important when an application has files on multiple datasets that are related or dependent upon each other. Without snapshots, a backup would have copies of the files from different points in time. Snapshots in ZFS provide a variety of features that even other file systems with snapshot functionality lack. A typical example of snapshot use is to have a quick way of backing up the current state of the file system when a risky action like a software installation or a system upgrade is performed. If the action fails, the snapshot can be rolled back and the system has the same state as when the snapshot was created. If the upgrade was successful, the snapshot can be deleted to free up space. Without snapshots, a failed upgrade often requires a restore from backup, which is tedious, time consuming, and may require downtime during which the system cannot be used. Snapshots can be rolled back quickly, even while the system is running in normal operation, with little or no downtime. The time savings are enormous with multi-terabyte storage systems and the time required to copy the data from backup. Snapshots are not a replacement for a complete backup of a pool, but can be used as a quick and easy way to store a copy of the dataset at a specific point in time. Creating Snapshots Snapshots are created with zfs snapshot dataset@snapshotname. Adding creates a snapshot recursively, with the same name on all child datasets. Create a recursive snapshot of the entire pool: &prompt.root; zfs list -t all NAME USED AVAIL REFER MOUNTPOINT mypool 780M 93.2G 144K none mypool/ROOT 777M 93.2G 144K none mypool/ROOT/default 777M 93.2G 777M / mypool/tmp 176K 93.2G 176K /tmp mypool/usr 616K 93.2G 144K /usr mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/ports 144K 93.2G 144K /usr/ports mypool/usr/src 144K 93.2G 144K /usr/src mypool/var 1.29M 93.2G 616K /var mypool/var/crash 148K 93.2G 148K /var/crash mypool/var/log 178K 93.2G 178K /var/log mypool/var/mail 144K 93.2G 144K /var/mail mypool/var/newname 87.5K 93.2G 87.5K /var/newname mypool/var/newname@new_snapshot_name 0 - 87.5K - mypool/var/tmp 152K 93.2G 152K /var/tmp &prompt.root; zfs snapshot -r mypool@my_recursive_snapshot &prompt.root; zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool@my_recursive_snapshot 0 - 144K - mypool/ROOT@my_recursive_snapshot 0 - 144K - mypool/ROOT/default@my_recursive_snapshot 0 - 777M - mypool/tmp@my_recursive_snapshot 0 - 176K - mypool/usr@my_recursive_snapshot 0 - 144K - mypool/usr/home@my_recursive_snapshot 0 - 184K - mypool/usr/ports@my_recursive_snapshot 0 - 144K - mypool/usr/src@my_recursive_snapshot 0 - 144K - mypool/var@my_recursive_snapshot 0 - 616K - mypool/var/crash@my_recursive_snapshot 0 - 148K - mypool/var/log@my_recursive_snapshot 0 - 178K - mypool/var/mail@my_recursive_snapshot 0 - 144K - mypool/var/newname@new_snapshot_name 0 - 87.5K - mypool/var/newname@my_recursive_snapshot 0 - 87.5K - mypool/var/tmp@my_recursive_snapshot 0 - 152K - Snapshots are not shown by a normal zfs list operation. To list snapshots, is appended to zfs list. displays both file systems and snapshots. Snapshots are not mounted directly, so no path is shown in the MOUNTPOINT column. There is no mention of available disk space in the AVAIL column, as snapshots cannot be written to after they are created. Compare the snapshot to the original dataset from which it was created: &prompt.root; zfs list -rt all mypool/usr/home NAME USED AVAIL REFER MOUNTPOINT mypool/usr/home 184K 93.2G 184K /usr/home mypool/usr/home@my_recursive_snapshot 0 - 184K - Displaying both the dataset and the snapshot together reveals how snapshots work in COW fashion. They save only the changes (delta) that were made and not the complete file system contents all over again. This means that snapshots take little space when few changes are made. Space usage can be made even more apparent by copying a file to the dataset, then making a second snapshot: &prompt.root; cp /etc/passwd /var/tmp &prompt.root; zfs snapshot mypool/var/tmp@after_cp &prompt.root; zfs list -rt all mypool/var/tmp NAME USED AVAIL REFER MOUNTPOINT mypool/var/tmp 206K 93.2G 118K /var/tmp mypool/var/tmp@my_recursive_snapshot 88K - 152K - mypool/var/tmp@after_cp 0 - 118K - The second snapshot contains only the changes to the dataset after the copy operation. This yields enormous space savings. Notice that the size of the snapshot mypool/var/tmp@my_recursive_snapshot also changed in the USED column to indicate the changes between itself and the snapshot taken afterwards. Comparing Snapshots ZFS provides a built-in command to compare the differences in content between two snapshots. This is helpful when many snapshots were taken over time and the user wants to see how the file system has changed over time. For example, zfs diff lets a user find the latest snapshot that still contains a file that was accidentally deleted. Doing this for the two snapshots that were created in the previous section yields this output: &prompt.root; zfs list -rt all mypool/var/tmp NAME USED AVAIL REFER MOUNTPOINT mypool/var/tmp 206K 93.2G 118K /var/tmp mypool/var/tmp@my_recursive_snapshot 88K - 152K - mypool/var/tmp@after_cp 0 - 118K - &prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot M /var/tmp/ + /var/tmp/passwd The command lists the changes between the specified snapshot (in this case mypool/var/tmp@my_recursive_snapshot) and the live file system. The first column shows the type of change: + The path or file was added. - The path or file was deleted. M The path or file was modified. R The path or file was renamed. Comparing the output with the table, it becomes clear that passwd was added after the snapshot mypool/var/tmp@my_recursive_snapshot was created. This also resulted in a modification to the parent directory mounted at /var/tmp. Comparing two snapshots is helpful when using the ZFS replication feature to transfer a dataset to a different host for backup purposes. Compare two snapshots by providing the full dataset name and snapshot name of both datasets: &prompt.root; cp /var/tmp/passwd /var/tmp/passwd.copy &prompt.root; zfs snapshot mypool/var/tmp@diff_snapshot &prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot mypool/var/tmp@diff_snapshot M /var/tmp/ + /var/tmp/passwd + /var/tmp/passwd.copy &prompt.root; zfs diff mypool/var/tmp@my_recursive_snapshot mypool/var/tmp@after_cp M /var/tmp/ + /var/tmp/passwd A backup administrator can compare two snapshots received from the sending host and determine the actual changes in the dataset. See the Replication section for more information. Snapshot Rollback When at least one snapshot is available, it can be rolled back to at any time. Most of the time this is the case when the current state of the dataset is no longer required and an older version is preferred. Scenarios such as local development tests have gone wrong, botched system updates hampering the system's overall functionality, or the requirement to restore accidentally deleted files or directories are all too common occurrences. Luckily, rolling back a snapshot is just as easy as typing zfs rollback snapshotname. Depending on how many changes are involved, the operation will finish in a certain amount of time. During that time, the dataset always remains in a consistent state, much like a database that conforms to ACID principles is performing a rollback. This is happening while the dataset is live and accessible without requiring a downtime. Once the snapshot has been rolled back, the dataset has the same state as it had when the snapshot was originally taken. All other data in that dataset that was not part of the snapshot is discarded. Taking a snapshot of the current state of the dataset before rolling back to a previous one is a good idea when some data is required later. This way, the user can roll back and forth between snapshots without losing data that is still valuable. In the first example, a snapshot is rolled back because of a careless rm operation that removes too much data than was intended. &prompt.root; zfs list -rt all mypool/var/tmp NAME USED AVAIL REFER MOUNTPOINT mypool/var/tmp 262K 93.2G 120K /var/tmp mypool/var/tmp@my_recursive_snapshot 88K - 152K - mypool/var/tmp@after_cp 53.5K - 118K - mypool/var/tmp@diff_snapshot 0 - 120K - &prompt.root; ls /var/tmp passwd passwd.copy vi.recover &prompt.root; rm /var/tmp/passwd* &prompt.root; ls /var/tmp vi.recover At this point, the user realized that too many files were deleted and wants them back. ZFS provides an easy way to get them back using rollbacks, but only when snapshots of important data are performed on a regular basis. To get the files back and start over from the last snapshot, issue the command: &prompt.root; zfs rollback mypool/var/tmp@diff_snapshot &prompt.root; ls /var/tmp passwd passwd.copy vi.recover The rollback operation restored the dataset to the state of the last snapshot. It is also possible to roll back to a snapshot that was taken much earlier and has other snapshots that were created after it. When trying to do this, ZFS will issue this warning: &prompt.root; zfs list -rt snapshot mypool/var/tmp AME USED AVAIL REFER MOUNTPOINT mypool/var/tmp@my_recursive_snapshot 88K - 152K - mypool/var/tmp@after_cp 53.5K - 118K - mypool/var/tmp@diff_snapshot 0 - 120K - &prompt.root; zfs rollback mypool/var/tmp@my_recursive_snapshot cannot rollback to 'mypool/var/tmp@my_recursive_snapshot': more recent snapshots exist use '-r' to force deletion of the following snapshots: mypool/var/tmp@after_cp mypool/var/tmp@diff_snapshot This warning means that snapshots exist between the current state of the dataset and the snapshot to which the user wants to roll back. To complete the rollback, these snapshots must be deleted. ZFS cannot track all the changes between different states of the dataset, because snapshots are read-only. ZFS will not delete the affected snapshots unless the user specifies to indicate that this is the desired action. If that is the intention, and the consequences of losing all intermediate snapshots is understood, the command can be issued: &prompt.root; zfs rollback -r mypool/var/tmp@my_recursive_snapshot &prompt.root; zfs list -rt snapshot mypool/var/tmp NAME USED AVAIL REFER MOUNTPOINT mypool/var/tmp@my_recursive_snapshot 8K - 152K - &prompt.root; ls /var/tmp vi.recover The output from zfs list -t snapshot confirms that the intermediate snapshots were removed as a result of zfs rollback -r. Restoring Individual Files from Snapshots Snapshots are mounted in a hidden directory under the parent dataset: .zfs/snapshots/snapshotname. By default, these directories will not be displayed even when a standard ls -a is issued. Although the directory is not displayed, it is there nevertheless and can be accessed like any normal directory. The property named snapdir controls whether these hidden directories show up in a directory listing. Setting the property to visible allows them to appear in the output of ls and other commands that deal with directory contents. &prompt.root; zfs get snapdir mypool/var/tmp NAME PROPERTY VALUE SOURCE mypool/var/tmp snapdir hidden default &prompt.root; ls -a /var/tmp . .. passwd vi.recover &prompt.root; zfs set snapdir=visible mypool/var/tmp &prompt.root; ls -a /var/tmp . .. .zfs passwd vi.recover Individual files can easily be restored to a previous state by copying them from the snapshot back to the parent dataset. The directory structure below .zfs/snapshot has a directory named exactly like the snapshots taken earlier to make it easier to identify them. In the next example, it is assumed that a file is to be restored from the hidden .zfs directory by copying it from the snapshot that contained the latest version of the file: &prompt.root; rm /var/tmp/passwd &prompt.root; ls -a /var/tmp . .. .zfs vi.recover &prompt.root; ls /var/tmp/.zfs/snapshot after_cp my_recursive_snapshot &prompt.root; ls /var/tmp/.zfs/snapshot/after_cp passwd vi.recover &prompt.root; cp /var/tmp/.zfs/snapshot/after_cp/passwd /var/tmp When ls .zfs/snapshot was issued, the snapdir property might have been set to hidden, but it would still be possible to list the contents of that directory. It is up to the administrator to decide whether these directories will be displayed. It is possible to display these for certain datasets and prevent it for others. Copying files or directories from this hidden .zfs/snapshot is simple enough. Trying it the other way around results in this error: &prompt.root; cp /etc/rc.conf /var/tmp/.zfs/snapshot/after_cp/ cp: /var/tmp/.zfs/snapshot/after_cp/rc.conf: Read-only file system The error reminds the user that snapshots are read-only and cannot be changed after creation. Files cannot be copied into or removed from snapshot directories because that would change the state of the dataset they represent. Snapshots consume space based on how much the parent file system has changed since the time of the snapshot. The written property of a snapshot tracks how much space is being used by the snapshot. Snapshots are destroyed and the space reclaimed with zfs destroy dataset@snapshot. Adding recursively removes all snapshots with the same name under the parent dataset. Adding to the command displays a list of the snapshots that would be deleted and an estimate of how much space would be reclaimed without performing the actual destroy operation. Managing Clones A clone is a copy of a snapshot that is treated more like a regular dataset. Unlike a snapshot, a clone is not read only, is mounted, and can have its own properties. Once a clone has been created using zfs clone, the snapshot it was created from cannot be destroyed. The child/parent relationship between the clone and the snapshot can be reversed using zfs promote. After a clone has been promoted, the snapshot becomes a child of the clone, rather than of the original parent dataset. This will change how the space is accounted, but not actually change the amount of space consumed. The clone can be mounted at any point within the ZFS file system hierarchy, not just below the original location of the snapshot. To demonstrate the clone feature, this example dataset is used: &prompt.root; zfs list -rt all camino/home/joe NAME USED AVAIL REFER MOUNTPOINT camino/home/joe 108K 1.3G 87K /usr/home/joe camino/home/joe@plans 21K - 85.5K - camino/home/joe@backup 0K - 87K - A typical use for clones is to experiment with a specific dataset while keeping the snapshot around to fall back to in case something goes wrong. Since snapshots cannot be changed, a read/write clone of a snapshot is created. After the desired result is achieved in the clone, the clone can be promoted to a dataset and the old file system removed. This is not strictly necessary, as the clone and dataset can coexist without problems. &prompt.root; zfs clone camino/home/joe@backup camino/home/joenew &prompt.root; ls /usr/home/joe* /usr/home/joe: backup.txz plans.txt /usr/home/joenew: backup.txz plans.txt &prompt.root; df -h /usr/home Filesystem Size Used Avail Capacity Mounted on usr/home/joe 1.3G 31k 1.3G 0% /usr/home/joe usr/home/joenew 1.3G 31k 1.3G 0% /usr/home/joenew After a clone is created it is an exact copy of the state the dataset was in when the snapshot was taken. The clone can now be changed independently from its originating dataset. The only connection between the two is the snapshot. ZFS records this connection in the property origin. Once the dependency between the snapshot and the clone has been removed by promoting the clone using zfs promote, the origin of the clone is removed as it is now an independent dataset. This example demonstrates it: &prompt.root; zfs get origin camino/home/joenew NAME PROPERTY VALUE SOURCE camino/home/joenew origin camino/home/joe@backup - &prompt.root; zfs promote camino/home/joenew &prompt.root; zfs get origin camino/home/joenew NAME PROPERTY VALUE SOURCE camino/home/joenew origin - - After making some changes like copying loader.conf to the promoted clone, for example, the old directory becomes obsolete in this case. Instead, the promoted clone can replace it. This can be achieved by two consecutive commands: zfs destroy on the old dataset and zfs rename on the clone to name it like the old dataset (it could also get an entirely different name). &prompt.root; cp /boot/defaults/loader.conf /usr/home/joenew &prompt.root; zfs destroy -f camino/home/joe &prompt.root; zfs rename camino/home/joenew camino/home/joe &prompt.root; ls /usr/home/joe backup.txz loader.conf plans.txt &prompt.root; df -h /usr/home Filesystem Size Used Avail Capacity Mounted on usr/home/joe 1.3G 128k 1.3G 0% /usr/home/joe The cloned snapshot is now handled like an ordinary dataset. It contains all the data from the original snapshot plus the files that were added to it like loader.conf. Clones can be used in different scenarios to provide useful features to ZFS users. For example, jails could be provided as snapshots containing different sets of installed applications. Users can clone these snapshots and add their own applications as they see fit. Once they are satisfied with the changes, the clones can be promoted to full datasets and provided to end users to work with like they would with a real dataset. This saves time and administrative overhead when providing these jails. Replication Keeping data on a single pool in one location exposes it to risks like theft and natural or human disasters. Making regular backups of the entire pool is vital. ZFS provides a built-in serialization feature that can send a stream representation of the data to standard output. Using this technique, it is possible to not only store the data on another pool connected to the local system, but also to send it over a network to another system. Snapshots are the basis for this replication (see the section on ZFS snapshots). The commands used for replicating data are zfs send and zfs receive. These examples demonstrate ZFS replication with these two pools: &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT backup 960M 77K 896M - - 0% 0% 1.00x ONLINE - mypool 984M 43.7M 940M - - 0% 4% 1.00x ONLINE - The pool named mypool is the primary pool where data is written to and read from on a regular basis. A second pool, backup is used as a standby in case the primary pool becomes unavailable. Note that this fail-over is not done automatically by ZFS, but must be manually done by a system administrator when needed. A snapshot is used to provide a consistent version of the file system to be replicated. Once a snapshot of mypool has been created, it can be copied to the backup pool. Only snapshots can be replicated. Changes made since the most recent snapshot will not be included. &prompt.root; zfs snapshot mypool@backup1 &prompt.root; zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool@backup1 0 - 43.6M - Now that a snapshot exists, zfs send can be used to create a stream representing the contents of the snapshot. This stream can be stored as a file or received by another pool. The stream is written to standard output, but must be redirected to a file or pipe or an error is produced: &prompt.root; zfs send mypool@backup1 Error: Stream can not be written to a terminal. You must redirect standard output. To back up a dataset with zfs send, redirect to a file located on the mounted backup pool. Ensure that the pool has enough free space to accommodate the size of the snapshot being sent, which means all of the data contained in the snapshot, not just the changes from the previous snapshot. &prompt.root; zfs send mypool@backup1 > /backup/backup1 &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT backup 960M 63.7M 896M - - 0% 6% 1.00x ONLINE - mypool 984M 43.7M 940M - - 0% 4% 1.00x ONLINE - The zfs send transferred all the data in the snapshot called backup1 to the pool named backup. Creating and sending these snapshots can be done automatically with a &man.cron.8; job. Instead of storing the backups as archive files, ZFS can receive them as a live file system, allowing the backed up data to be accessed directly. To get to the actual data contained in those streams, zfs receive is used to transform the streams back into files and directories. The example below combines zfs send and zfs receive using a pipe to copy the data from one pool to another. The data can be used directly on the receiving pool after the transfer is complete. A dataset can only be replicated to an empty dataset. &prompt.root; zfs snapshot mypool@replica1 &prompt.root; zfs send -v mypool@replica1 | zfs receive backup/mypool send from @ to mypool@replica1 estimated size is 50.1M total estimated size is 50.1M TIME SENT SNAPSHOT &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT backup 960M 63.7M 896M - - 0% 6% 1.00x ONLINE - mypool 984M 43.7M 940M - - 0% 4% 1.00x ONLINE - Incremental Backups zfs send can also determine the difference between two snapshots and send only the differences between the two. This saves disk space and transfer time. For example: &prompt.root; zfs snapshot mypool@replica2 &prompt.root; zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT mypool@replica1 5.72M - 43.6M - mypool@replica2 0 - 44.1M - &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT backup 960M 61.7M 898M - - 0% 6% 1.00x ONLINE - mypool 960M 50.2M 910M - - 0% 5% 1.00x ONLINE - A second snapshot called replica2 was created. This second snapshot contains only the changes that were made to the file system between now and the previous snapshot, replica1. Using zfs send -i and indicating the pair of snapshots generates an incremental replica stream containing only the data that has changed. This can only succeed if the initial snapshot already exists on the receiving side. &prompt.root; zfs send -v -i mypool@replica1 mypool@replica2 | zfs receive /backup/mypool send from @replica1 to mypool@replica2 estimated size is 5.02M total estimated size is 5.02M TIME SENT SNAPSHOT &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT backup 960M 80.8M 879M - - 0% 8% 1.00x ONLINE - mypool 960M 50.2M 910M - - 0% 5% 1.00x ONLINE - &prompt.root; zfs list NAME USED AVAIL REFER MOUNTPOINT backup 55.4M 240G 152K /backup backup/mypool 55.3M 240G 55.2M /backup/mypool mypool 55.6M 11.6G 55.0M /mypool &prompt.root; zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT backup/mypool@replica1 104K - 50.2M - backup/mypool@replica2 0 - 55.2M - mypool@replica1 29.9K - 50.0M - mypool@replica2 0 - 55.0M - The incremental stream was successfully transferred. Only the data that had changed was replicated, rather than the entirety of replica1. Only the differences were sent, which took much less time to transfer and saved disk space by not copying the complete pool each time. This is useful when having to rely on slow networks or when costs per transferred byte must be considered. A new file system, backup/mypool, is available with all of the files and data from the pool mypool. If is specified, the properties of the dataset will be copied, including compression settings, quotas, and mount points. When is specified, all child datasets of the indicated dataset will be copied, along with all of their properties. Sending and receiving can be automated so that regular backups are created on the second pool. Sending Encrypted Backups over <application>SSH</application> Sending streams over the network is a good way to keep a remote backup, but it does come with a drawback. Data sent over the network link is not encrypted, allowing anyone to intercept and transform the streams back into data without the knowledge of the sending user. This is undesirable, especially when sending the streams over the internet to a remote host. SSH can be used to securely encrypt data send over a network connection. Since ZFS only requires the stream to be redirected from standard output, it is relatively easy to pipe it through SSH. To keep the contents of the file system encrypted in transit and on the remote system, consider using PEFS. A few settings and security precautions must be completed first. Only the necessary steps required for the zfs send operation are shown here. For more information on SSH, see . This configuration is required: Passwordless SSH access between sending and receiving host using SSH keys Normally, the privileges of the root user are needed to send and receive streams. This requires logging in to the receiving system as root. However, logging in as root is disabled by default for security reasons. The ZFS Delegation system can be used to allow a non-root user on each system to perform the respective send and receive operations. On the sending system: &prompt.root; zfs allow -u someuser send,snapshot mypool To mount the pool, the unprivileged user must own the directory, and regular users must be allowed to mount file systems. On the receiving system: &prompt.root; sysctl vfs.usermount=1 vfs.usermount: 0 -> 1 &prompt.root; echo vfs.usermount=1 >> /etc/sysctl.conf &prompt.root; zfs create recvpool/backup &prompt.root; zfs allow -u someuser create,mount,receive recvpool/backup &prompt.root; chown someuser /recvpool/backup The unprivileged user now has the ability to receive and mount datasets, and the home dataset can be replicated to the remote system: &prompt.user; zfs snapshot -r mypool/home@monday &prompt.user; zfs send -R mypool/home@monday | ssh someuser@backuphost zfs recv -dvu recvpool/backup A recursive snapshot called monday is made of the file system dataset home that resides on the pool mypool. Then it is sent with zfs send -R to include the dataset, all child datasets, snapshots, clones, and settings in the stream. The output is piped to the waiting zfs receive on the remote host backuphost through SSH. Using a fully qualified domain name or IP address is recommended. The receiving machine writes the data to the backup dataset on the recvpool pool. Adding to zfs recv overwrites the name of the pool on the receiving side with the name of the snapshot. causes the file systems to not be mounted on the receiving side. When is included, more detail about the transfer is shown, including elapsed time and the amount of data transferred. Dataset, User, and Group Quotas Dataset quotas are used to restrict the amount of space that can be consumed by a particular dataset. Reference Quotas work in very much the same way, but only count the space used by the dataset itself, excluding snapshots and child datasets. Similarly, user and group quotas can be used to prevent users or groups from using all of the space in the pool or dataset. The following examples assume that the users already exist in the system. Before adding a user to the system, make sure to create their home dataset first and set the to /home/bob. Then, create the user and make the home directory point to the dataset's location. This will properly set owner and group permissions without shadowing any pre-existing home directory paths that might exist. To enforce a dataset quota of 10 GB for storage/home/bob: &prompt.root; zfs set quota=10G storage/home/bob To enforce a reference quota of 10 GB for storage/home/bob: &prompt.root; zfs set refquota=10G storage/home/bob To remove a quota of 10 GB for storage/home/bob: &prompt.root; zfs set quota=none storage/home/bob The general format is userquota@user=size, and the user's name must be in one of these formats: POSIX compatible name such as joe. POSIX numeric ID such as 789. SID name such as joe.bloggs@example.com. SID numeric ID such as S-1-123-456-789. For example, to enforce a user quota of 50 GB for the user named joe: &prompt.root; zfs set userquota@joe=50G To remove any quota: &prompt.root; zfs set userquota@joe=none User quota properties are not displayed by zfs get all. Non-root users can only see their own quotas unless they have been granted the userquota privilege. Users with this privilege are able to view and set everyone's quota. The general format for setting a group quota is: groupquota@group=size. To set the quota for the group firstgroup to 50 GB, use: &prompt.root; zfs set groupquota@firstgroup=50G To remove the quota for the group firstgroup, or to make sure that one is not set, instead use: &prompt.root; zfs set groupquota@firstgroup=none As with the user quota property, non-root users can only see the quotas associated with the groups to which they belong. However, root or a user with the groupquota privilege can view and set all quotas for all groups. To display the amount of space used by each user on a file system or snapshot along with any quotas, use zfs userspace. For group information, use zfs groupspace. For more information about supported options or how to display only specific options, refer to &man.zfs.1;. Users with sufficient privileges, and root, can list the quota for storage/home/bob using: &prompt.root; zfs get quota storage/home/bob Reservations Reservations guarantee a minimum amount of space will always be available on a dataset. The reserved space will not be available to any other dataset. This feature can be especially useful to ensure that free space is available for an important dataset or log files. The general format of the reservation property is reservation=size, so to set a reservation of 10 GB on storage/home/bob, use: &prompt.root; zfs set reservation=10G storage/home/bob To clear any reservation: &prompt.root; zfs set reservation=none storage/home/bob The same principle can be applied to the refreservation property for setting a Reference Reservation, with the general format refreservation=size. This command shows any reservations or refreservations that exist on storage/home/bob: &prompt.root; zfs get reservation storage/home/bob &prompt.root; zfs get refreservation storage/home/bob Compression ZFS provides transparent compression. Compressing data at the block level as it is written not only saves space, but can also increase disk throughput. If data is compressed by 25%, but the compressed data is written to the disk at the same rate as the uncompressed version, resulting in an effective write speed of 125%. Compression can also be a great alternative to Deduplication because it does not require additional memory. ZFS offers several different compression algorithms, each with different trade-offs. With the introduction of LZ4 compression in ZFS v5000, it is possible to enable compression for the entire pool without the large performance trade-off of other algorithms. The biggest advantage to LZ4 is the early abort feature. If LZ4 does not achieve at least 12.5% compression in the first part of the data, the block is written uncompressed to avoid wasting CPU cycles trying to compress data that is either already compressed or uncompressible. For details about the different compression algorithms available in ZFS, see the Compression entry in the terminology section. The administrator can monitor the effectiveness of compression using a number of dataset properties. &prompt.root; zfs get used,compressratio,compression,logicalused mypool/compressed_dataset NAME PROPERTY VALUE SOURCE mypool/compressed_dataset used 449G - mypool/compressed_dataset compressratio 1.11x - mypool/compressed_dataset compression lz4 local mypool/compressed_dataset logicalused 496G - The dataset is currently using 449 GB of space (the used property). Without compression, it would have taken 496 GB of space (the logicalused property). This results in the 1.11:1 compression ratio. Compression can have an unexpected side effect when combined with User Quotas. User quotas restrict how much space a user can consume on a dataset, but the measurements are based on how much space is used after compression. So if a user has a quota of 10 GB, and writes 10 GB of compressible data, they will still be able to store additional data. If they later update a file, say a database, with more or less compressible data, the amount of space available to them will change. This can result in the odd situation where a user did not increase the actual amount of data (the logicalused property), but the change in compression caused them to reach their quota limit. Compression can have a similar unexpected interaction with backups. Quotas are often used to limit how much data can be stored to ensure there is sufficient backup space available. However since quotas do not consider compression, more data may be written than would fit with uncompressed backups. Zstandard Compression In OpenZFS 2.0, a new compression algorithm was added. Zstandard (Zstd) offers higher compression ratios than the default LZ4 while offering much greater speeds than the alternative, gzip. OpenZFS 2.0 is available starting with &os; 12.1-RELEASE via sysutils/openzfs and has been the default in &os; 13-CURRENT since September 2020, and will by in &os; 13.0-RELEASE. Zstd provides a large selection of compression levels, providing fine-grained control over performance versus compression ratio. One of the main advantages of Zstd is that the decompression speed is independent of the compression level. For data that is written once but read many times, Zstd allows the use of the highest compression levels without a read performance penalty. Even when data is updated frequently, there are often performance gains that come from enabling compression. One of the biggest advantages comes from the compressed ARC feature. ZFS's Adaptive Replacement Cache (ARC) caches the compressed version of the data in RAM, decompressing it each time it is needed. This allows the same amount of RAM to store more data and metadata, increasing the cache hit ratio. ZFS offers 19 levels of Zstd compression, each offering incrementally more space savings in exchange for slower compression. The default level is zstd-3 and offers greater compression than LZ4 without being significantly slower. Levels above 10 require significant amounts of memory to compress each block, so they are discouraged on systems with less than 16 GB of RAM. ZFS also implements a selection of the Zstd fast levels, which get correspondingly faster but offer lower compression ratios. ZFS supports zstd-fast-1 through zstd-fast-10, zstd-fast-20 through zstd-fast-100 in increments of 10, and finally zstd-fast-500 and zstd-fast-1000 which provide minimal compression, but offer very high performance. If ZFS is not able to allocate the required memory to compress a block with Zstd, it will fall back to storing the block uncompressed. This is unlikely to happen outside of the highest levels of Zstd on systems that are memory constrained. The sysctl kstat.zfs.misc.zstd.compress_alloc_fail counts how many times this has occurred since the ZFS module was loaded. Deduplication When enabled, deduplication uses the checksum of each block to detect duplicate blocks. When a new block is a duplicate of an existing block, ZFS writes an additional reference to the existing data instead of the whole duplicate block. Tremendous space savings are possible if the data contains many duplicated files or repeated information. Be warned: deduplication requires an extremely large amount of memory, and most of the space savings can be had without the extra cost by enabling compression instead. To activate deduplication, set the dedup property on the target pool: &prompt.root; zfs set dedup=on pool Only new data being written to the pool will be deduplicated. Data that has already been written to the pool will not be deduplicated merely by activating this option. A pool with a freshly activated deduplication property will look like this example: &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT pool 2.84G 2.19M 2.83G - - 0% 0% 1.00x ONLINE - The DEDUP column shows the actual rate of deduplication for the pool. A value of 1.00x shows that data has not been deduplicated yet. In the next example, the ports tree is copied three times into different directories on the deduplicated pool created above. &prompt.root; for d in dir1 dir2 dir3; do > mkdir $d && cp -R /usr/ports $d & > done Redundant data is detected and deduplicated: &prompt.root; zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT pool 2.84G 20.9M 2.82G - - 0% 0% 3.00x ONLINE - The DEDUP column shows a factor of 3.00x. Multiple copies of the ports tree data was detected and deduplicated, using only a third of the space. The potential for space savings can be enormous, but comes at the cost of having enough memory to keep track of the deduplicated blocks. Deduplication is not always beneficial, especially when the data on a pool is not redundant. ZFS can show potential space savings by simulating deduplication on an existing pool: &prompt.root; zdb -S pool Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 2.58M 289G 264G 264G 2.58M 289G 264G 264G 2 206K 12.6G 10.4G 10.4G 430K 26.4G 21.6G 21.6G 4 37.6K 692M 276M 276M 170K 3.04G 1.26G 1.26G 8 2.18K 45.2M 19.4M 19.4M 20.0K 425M 176M 176M 16 174 2.83M 1.20M 1.20M 3.33K 48.4M 20.4M 20.4M 32 40 2.17M 222K 222K 1.70K 97.2M 9.91M 9.91M 64 9 56K 10.5K 10.5K 865 4.96M 948K 948K 128 2 9.50K 2K 2K 419 2.11M 438K 438K 256 5 61.5K 12K 12K 1.90K 23.0M 4.47M 4.47M 1K 2 1K 1K 1K 2.98K 1.49M 1.49M 1.49M Total 2.82M 303G 275G 275G 3.20M 319G 287G 287G dedup = 1.05, compress = 1.11, copies = 1.00, dedup * compress / copies = 1.16 After zdb -S finishes analyzing the pool, it shows the space reduction ratio that would be achieved by activating deduplication. In this case, 1.16 is a very poor space saving ratio that is mostly provided by compression. Activating deduplication on this pool would not save any significant amount of space, and is not worth the amount of memory required to enable deduplication. Using the formula ratio = dedup * compress / copies, system administrators can plan the storage allocation, deciding whether the workload will contain enough duplicate blocks to justify the memory requirements. If the data is reasonably compressible, the space savings may be very good. Enabling compression first is recommended, and compression can also provide greatly increased performance. Only enable deduplication in cases where the additional savings will be considerable and there is sufficient memory for the DDT. <acronym>ZFS</acronym> and Jails zfs jail and the corresponding jailed property are used to delegate a ZFS dataset to a Jail. zfs jail jailid attaches a dataset to the specified jail, and zfs unjail detaches it. For the dataset to be controlled from within a jail, the jailed property must be set. Once a dataset is jailed, it can no longer be mounted on the host because it may have mount points that would compromise the security of the host. Delegated Administration A comprehensive permission delegation system allows unprivileged users to perform ZFS administration functions. For example, if each user's home directory is a dataset, users can be given permission to create and destroy snapshots of their home directories. A backup user can be given permission to use replication features. A usage statistics script can be allowed to run with access only to the space utilization data for all users. It is even possible to delegate the ability to delegate permissions. Permission delegation is possible for each subcommand and most properties. Delegating Dataset Creation zfs allow someuser create mydataset gives the specified user permission to create child datasets under the selected parent dataset. There is a caveat: creating a new dataset involves mounting it. That requires setting the &os; vfs.usermount &man.sysctl.8; to 1 to allow non-root users to mount a file system. There is another restriction aimed at preventing abuse: non-root users must own the mountpoint where the file system is to be mounted. Delegating Permission Delegation zfs allow someuser allow mydataset gives the specified user the ability to assign any permission they have on the target dataset, or its children, to other users. If a user has the snapshot permission and the allow permission, that user can then grant the snapshot permission to other users. Advanced Topics Tuning There are a number of tunables that can be adjusted to make ZFS perform best for different workloads. vfs.zfs.arc_max - Maximum size of the ARC. The default is all RAM but 1 GB, or 5/8 of all RAM, whichever is more. However, a lower value should be used if the system will be running any other daemons or processes that may require memory. This value can be adjusted at runtime with &man.sysctl.8; and can be set in /boot/loader.conf or /etc/sysctl.conf. vfs.zfs.arc_meta_limit - Limit the portion of the ARC that can be used to store metadata. The default is one fourth of vfs.zfs.arc_max. Increasing this value will improve performance if the workload involves operations on a large number of files and directories, or frequent metadata operations, at the cost of less file data fitting in the ARC. This value can be adjusted at runtime with &man.sysctl.8; and can be set in /boot/loader.conf or /etc/sysctl.conf. vfs.zfs.arc_min - Minimum size of the ARC. The default is one half of vfs.zfs.arc_meta_limit. Adjust this value to prevent other applications from pressuring out the entire ARC. This value can be adjusted at runtime with &man.sysctl.8; and can be set in /boot/loader.conf or /etc/sysctl.conf. vfs.zfs.vdev.cache.size - A preallocated amount of memory reserved as a cache for each device in the pool. The total amount of memory used will be this value multiplied by the number of devices. This value can only be adjusted at boot time, and is set in /boot/loader.conf. vfs.zfs.min_auto_ashift - Minimum ashift (sector size) that will be used automatically at pool creation time. The value is a power of two. The default value of 9 represents 2^9 = 512, a sector size of 512 bytes. To avoid write amplification and get the best performance, set this value to the largest sector size used by a device in the pool. Many drives have 4 KB sectors. Using the default ashift of 9 with these drives results in write amplification on these devices. Data that could be contained in a single 4 KB write must instead be written in eight 512-byte writes. ZFS tries to read the native sector size from all devices when creating a pool, but many drives with 4 KB sectors report that their sectors are 512 bytes for compatibility. Setting vfs.zfs.min_auto_ashift to 12 (2^12 = 4096) before creating a pool forces ZFS to use 4 KB blocks for best performance on these drives. Forcing 4 KB blocks is also useful on pools where disk upgrades are planned. Future disks are likely to use 4 KB sectors, and ashift values cannot be changed after a pool is created. In some specific cases, the smaller 512-byte block size might be preferable. When used with 512-byte disks for databases, or as storage for virtual machines, less data is transferred during small random reads. This can provide better performance, especially when using a smaller ZFS record size. vfs.zfs.prefetch_disable - Disable prefetch. A value of 0 is enabled and 1 is disabled. The default is 0, unless the system has less than 4 GB of RAM. Prefetch works by reading larger blocks than were requested into the ARC in hopes that the data will be needed soon. If the workload has a large number of random reads, disabling prefetch may actually improve performance by reducing unnecessary reads. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.vdev.trim_on_init - Control whether new devices added to the pool have the TRIM command run on them. This ensures the best performance and longevity for SSDs, but takes extra time. If the device has already been secure erased, disabling this setting will make the addition of the new device faster. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.vdev.max_pending - Limit the number of pending I/O requests per device. A higher value will keep the device command queue full and may give higher throughput. A lower value will reduce latency. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.top_maxinflight - Maxmimum number of outstanding I/Os per top-level vdev. Limits the depth of the command queue to prevent high latency. The limit is per top-level vdev, meaning the limit applies to each mirror, RAID-Z, or other vdev independently. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.l2arc_write_max - Limit the amount of data written to the L2ARC per second. This tunable is designed to extend the longevity of SSDs by limiting the amount of data written to the device. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.l2arc_write_boost - The value of this tunable is added to vfs.zfs.l2arc_write_max and increases the write speed to the SSD until the first block is evicted from the L2ARC. This Turbo Warmup Phase is designed to reduce the performance loss from an empty L2ARC after a reboot. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.scrub_delay - Number of ticks to delay between each I/O during a scrub. To ensure that a scrub does not interfere with the normal operation of the pool, if any other I/O is happening the scrub will delay between each command. This value controls the limit on the total IOPS (I/Os Per Second) generated by the scrub. The granularity of the setting is determined by the value of kern.hz which defaults to 1000 ticks per second. This setting may be changed, resulting in a different effective IOPS limit. The default value is 4, resulting in a limit of: 1000 ticks/sec / 4 = 250 IOPS. Using a value of 20 would give a limit of: 1000 ticks/sec / 20 = 50 IOPS. The speed of scrub is only limited when there has been recent activity on the pool, as determined by vfs.zfs.scan_idle. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.resilver_delay - Number of milliseconds of delay inserted between each I/O during a resilver. To ensure that a resilver does not interfere with the normal operation of the pool, if any other I/O is happening the resilver will delay between each command. This value controls the limit of total IOPS (I/Os Per Second) generated by the resilver. The granularity of the setting is determined by the value of kern.hz which defaults to 1000 ticks per second. This setting may be changed, resulting in a different effective IOPS limit. The default value is 2, resulting in a limit of: 1000 ticks/sec / 2 = 500 IOPS. Returning the pool to an Online state may be more important if another device failing could Fault the pool, causing data loss. A value of 0 will give the resilver operation the same priority as other operations, speeding the healing process. The speed of resilver is only limited when there has been other recent activity on the pool, as determined by vfs.zfs.scan_idle. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.scan_idle - Number of milliseconds since the last operation before the pool is considered idle. When the pool is idle the rate limiting for scrub and resilver are disabled. This value can be adjusted at any time with &man.sysctl.8;. vfs.zfs.txg.timeout - Maximum number of seconds between transaction groups. The current transaction group will be written to the pool and a fresh transaction group started if this amount of time has elapsed since the previous transaction group. A transaction group my be triggered earlier if enough data is written. The default value is 5 seconds. A larger value may improve read performance by delaying asynchronous writes, but this may cause uneven performance when the transaction group is written. This value can be adjusted at any time with &man.sysctl.8;. <acronym>ZFS</acronym> on i386 Some of the features provided by ZFS are memory intensive, and may require tuning for maximum efficiency on systems with limited RAM. Memory As a bare minimum, the total system memory should be at least one gigabyte. The amount of recommended RAM depends upon the size of the pool and which ZFS features are used. A general rule of thumb is 1 GB of RAM for every 1 TB of storage. If the deduplication feature is used, a general rule of thumb is 5 GB of RAM per TB of storage to be deduplicated. While some users successfully use ZFS with less RAM, systems under heavy load may panic due to memory exhaustion. Further tuning may be required for systems with less than the recommended RAM requirements. Kernel Configuration Due to the address space limitations of the &i386; platform, ZFS users on the &i386; architecture must add this option to a custom kernel configuration file, rebuild the kernel, and reboot: options KVA_PAGES=512 This expands the kernel address space, allowing the vm.kvm_size tunable to be pushed beyond the currently imposed limit of 1 GB, or the limit of 2 GB for PAE. To find the most suitable value for this option, divide the desired address space in megabytes by four. In this example, it is 512 for 2 GB. Loader Tunables The kmem address space can be increased on all &os; architectures. On a test system with 1 GB of physical memory, success was achieved with these options added to /boot/loader.conf, and the system restarted: vm.kmem_size="330M" vm.kmem_size_max="330M" vfs.zfs.arc_max="40M" vfs.zfs.vdev.cache.size="5M" For a more detailed list of recommendations for ZFS-related tuning, see . Additional Resources OpenZFS FreeBSD Wiki - ZFS Tuning Oracle Solaris ZFS Administration Guide Calomel Blog - ZFS Raidz Performance, Capacity and Integrity <acronym>ZFS</acronym> Features and Terminology ZFS is a fundamentally different file system because it is more than just a file system. ZFS combines the roles of file system and volume manager, enabling additional storage devices to be added to a live system and having the new space available on all of the existing file systems in that pool immediately. By combining the traditionally separate roles, ZFS is able to overcome previous limitations that prevented RAID groups being able to grow. Each top level device in a pool is called a vdev, which can be a simple disk or a RAID transformation such as a mirror or RAID-Z array. ZFS file systems (called datasets) each have access to the combined free space of the entire pool. As blocks are allocated from the pool, the space available to each file system decreases. This approach avoids the common pitfall with extensive partitioning where free space becomes fragmented across the partitions. pool A storage pool is the most basic building block of ZFS. A pool is made up of one or more vdevs, the underlying devices that store the data. A pool is then used to create one or more file systems (datasets) or block devices (volumes). These datasets and volumes share the pool of remaining free space. Each pool is uniquely identified by a name and a GUID. The features available are determined by the ZFS version number on the pool. vdev Types A pool is made up of one or more vdevs, which themselves can be a single disk or a group of disks, in the case of a RAID transform. When multiple vdevs are used, ZFS spreads data across the vdevs to increase performance and maximize usable space. Disk - The most basic type of vdev is a standard block device. This can be an entire disk (such as /dev/ada0 or /dev/da0) or a partition (/dev/ada0p3). On &os;, there is no performance penalty for using a partition rather than the entire disk. This differs from recommendations made by the Solaris documentation. Using an entire disk as part of a bootable pool is strongly discouraged, as this may render the pool unbootable. Likewise, you should not use an entire disk as part of a mirror or RAID-Z vdev. These are because it is impossible to reliably determine the size of an unpartitioned disk at boot time and because there's no place to put in boot code. File - In addition to disks, ZFS pools can be backed by regular files, this is especially useful for testing and experimentation. Use the full path to the file as the device path in zpool create. All vdevs must be at least 128 MB in size. Mirror - When creating a mirror, specify the mirror keyword followed by the list of member devices for the mirror. A mirror consists of two or more devices, all data will be written to all member devices. A mirror vdev will only hold as much data as its smallest member. A mirror vdev can withstand the failure of all but one of its members without losing any data. A regular single disk vdev can be upgraded to a mirror vdev at any time with zpool attach. RAID-Z - ZFS implements RAID-Z, a variation on standard RAID-5 that offers better distribution of parity and eliminates the RAID-5 write hole in which the data and parity information become inconsistent after an unexpected restart. ZFS supports three levels of RAID-Z which provide varying levels of redundancy in exchange for decreasing levels of usable storage. The types are named RAID-Z1 through RAID-Z3 based on the number of parity devices in the array and the number of disks which can fail while the pool remains operational. In a RAID-Z1 configuration with four disks, each 1 TB, usable storage is 3 TB and the pool will still be able to operate in degraded mode with one faulted disk. If an additional disk goes offline before the faulted disk is replaced and resilvered, all data in the pool can be lost. In a RAID-Z3 configuration with eight disks of 1 TB, the volume will provide 5 TB of usable space and still be able to operate with three faulted disks. &sun; recommends no more than nine disks in a single vdev. If the configuration has more disks, it is recommended to divide them into separate vdevs and the pool data will be striped across them. A configuration of two RAID-Z2 vdevs consisting of 8 disks each would create something similar to a RAID-60 array. A RAID-Z group's storage capacity is approximately the size of the smallest disk multiplied by the number of non-parity disks. Four 1 TB disks in RAID-Z1 has an effective size of approximately 3 TB, and an array of eight 1 TB disks in RAID-Z3 will yield 5 TB of usable space. Spare - ZFS has a special pseudo-vdev type for keeping track of available hot spares. Note that installed hot spares are not deployed automatically; they must manually be configured to replace the failed device using zfs replace. Log - ZFS Log Devices, also known as ZFS Intent Log (ZIL) move the intent log from the regular pool devices to a dedicated device, typically an SSD. Having a dedicated log device can significantly improve the performance of applications with a high volume of synchronous writes, especially databases. Log devices can be mirrored, but RAID-Z is not supported. If multiple log devices are used, writes will be load balanced across them. Cache - Adding a cache vdev to a pool will add the storage of the cache to the L2ARC. Cache devices cannot be mirrored. Since a cache device only stores additional copies of existing data, there is no risk of data loss. Transaction Group (TXG) Transaction Groups are the way changed blocks are grouped together and eventually written to the pool. Transaction groups are the atomic unit that ZFS uses to assert consistency. Each transaction group is assigned a unique 64-bit consecutive identifier. There can be up to three active transaction groups at a time, one in each of these three states: Open - When a new transaction group is created, it is in the open state, and accepts new writes. There is always a transaction group in the open state, however the transaction group may refuse new writes if it has reached a limit. Once the open transaction group has reached a limit, or the vfs.zfs.txg.timeout has been reached, the transaction group advances to the next state. Quiescing - A short state that allows any pending operations to finish while not blocking the creation of a new open transaction group. Once all of the transactions in the group have completed, the transaction group advances to the final state. Syncing - All of the data in the transaction group is written to stable storage. This process will in turn modify other data, such as metadata and space maps, that will also need to be written to stable storage. The process of syncing involves multiple passes. The first, all of the changed data blocks, is the biggest, followed by the metadata, which may take multiple passes to complete. Since allocating space for the data blocks generates new metadata, the syncing state cannot finish until a pass completes that does not allocate any additional space. The syncing state is also where synctasks are completed. Synctasks are administrative operations, such as creating or destroying snapshots and datasets, that modify the uberblock are completed. Once the sync state is complete, the transaction group in the quiescing state is advanced to the syncing state. All administrative functions, such as snapshot are written as part of the transaction group. When a synctask is created, it is added to the currently open transaction group, and that group is advanced as quickly as possible to the syncing state to reduce the latency of administrative commands. Adaptive Replacement Cache (ARC) ZFS uses an Adaptive Replacement Cache (ARC), rather than a more traditional Least Recently Used (LRU) cache. An LRU cache is a simple list of items in the cache, sorted by when each object was most recently used. New items are added to the top of the list. When the cache is full, items from the bottom of the list are evicted to make room for more active objects. An ARC consists of four lists; the Most Recently Used (MRU) and Most Frequently Used (MFU) objects, plus a ghost list for each. These ghost lists track recently evicted objects to prevent them from being added back to the cache. This increases the cache hit ratio by avoiding objects that have a history of only being used occasionally. Another advantage of using both an MRU and MFU is that scanning an entire file system would normally evict all data from an MRU or LRU cache in favor of this freshly accessed content. With ZFS, there is also an MFU that only tracks the most frequently used objects, and the cache of the most commonly accessed blocks remains. L2ARC L2ARC is the second level of the ZFS caching system. The primary ARC is stored in RAM. Since the amount of available RAM is often limited, ZFS can also use cache vdevs. Solid State Disks (SSDs) are often used as these cache devices due to their higher speed and lower latency compared to traditional spinning disks. L2ARC is entirely optional, but having one will significantly increase read speeds for files that are cached on the SSD instead of having to be read from the regular disks. L2ARC can also speed up deduplication because a DDT that does not fit in RAM but does fit in the L2ARC will be much faster than a DDT that must be read from disk. The rate at which data is added to the cache devices is limited to prevent prematurely wearing out SSDs with too many writes. Until the cache is full (the first block has been evicted to make room), writing to the L2ARC is limited to the sum of the write limit and the boost limit, and afterwards limited to the write limit. A pair of &man.sysctl.8; values control these rate limits. vfs.zfs.l2arc_write_max controls how many bytes are written to the cache per second, while vfs.zfs.l2arc_write_boost adds to this limit during the Turbo Warmup Phase (Write Boost). ZIL ZIL accelerates synchronous transactions by using storage devices like SSDs that are faster than those used in the main storage pool. When an application requests a synchronous write (a guarantee that the data has been safely stored to disk rather than merely cached to be written later), the data is written to the faster ZIL storage, then later flushed out to the regular disks. This greatly reduces latency and improves performance. Only synchronous workloads like databases will benefit from a ZIL. Regular asynchronous writes such as copying files will not use the ZIL at all. Copy-On-Write Unlike a traditional file system, when data is overwritten on ZFS, the new data is written to a different block rather than overwriting the old data in place. Only when this write is complete is the metadata then updated to point to the new location. In the event of a shorn write (a system crash or power loss in the middle of writing a file), the entire original contents of the file are still available and the incomplete write is discarded. This also means that ZFS does not require a &man.fsck.8; after an unexpected shutdown. Dataset Dataset is the generic term for a ZFS file system, volume, snapshot or clone. Each dataset has a unique name in the format poolname/path@snapshot. The root of the pool is technically a dataset as well. Child datasets are named hierarchically like directories. For example, mypool/home, the home dataset, is a child of mypool and inherits properties from it. This can be expanded further by creating mypool/home/user. This grandchild dataset will inherit properties from the parent and grandparent. Properties on a child can be set to override the defaults inherited from the parents and grandparents. Administration of datasets and their children can be delegated. File system A ZFS dataset is most often used as a file system. Like most other file systems, a ZFS file system is mounted somewhere in the systems directory hierarchy and contains files and directories of its own with permissions, flags, and other metadata. Volume In additional to regular file system datasets, ZFS can also create volumes, which are block devices. Volumes have many of the same features, including copy-on-write, snapshots, clones, and checksumming. Volumes can be useful for running other file system formats on top of ZFS, such as UFS virtualization, or exporting iSCSI extents. Snapshot The copy-on-write (COW) design of ZFS allows for nearly instantaneous, consistent snapshots with arbitrary names. After taking a snapshot of a dataset, or a recursive snapshot of a parent dataset that will include all child datasets, new data is written to new blocks, but the old blocks are not reclaimed as free space. The snapshot contains the original version of the file system, and the live file system contains any changes made since the snapshot was taken. No additional space is used. As new data is written to the live file system, new blocks are allocated to store this data. The apparent size of the snapshot will grow as the blocks are no longer used in the live file system, but only in the snapshot. These snapshots can be mounted read only to allow for the recovery of previous versions of files. It is also possible to rollback a live file system to a specific snapshot, undoing any changes that took place after the snapshot was taken. Each block in the pool has a reference counter which keeps track of how many snapshots, clones, datasets, or volumes make use of that block. As files and snapshots are deleted, the reference count is decremented. When a block is no longer referenced, it is reclaimed as free space. Snapshots can also be marked with a hold. When a snapshot is held, any attempt to destroy it will return an EBUSY error. Each snapshot can have multiple holds, each with a unique name. The release command removes the hold so the snapshot can deleted. Snapshots can be taken on volumes, but they can only be cloned or rolled back, not mounted independently. Clone Snapshots can also be cloned. A clone is a writable version of a snapshot, allowing the file system to be forked as a new dataset. As with a snapshot, a clone initially consumes no additional space. As new data is written to a clone and new blocks are allocated, the apparent size of the clone grows. When blocks are overwritten in the cloned file system or volume, the reference count on the previous block is decremented. The snapshot upon which a clone is based cannot be deleted because the clone depends on it. The snapshot is the parent, and the clone is the child. Clones can be promoted, reversing this dependency and making the clone the parent and the previous parent the child. This operation requires no - additional space. Because the amount of space used by + additional space. Since the amount of space used by the parent and child is reversed, existing quotas and reservations might be affected. Checksum Every block that is allocated is also checksummed. The checksum algorithm used is a per-dataset property, see set. The checksum of each block is transparently validated as it is read, allowing ZFS to detect silent corruption. If the data that is read does not match the expected checksum, ZFS will attempt to recover the data from any available redundancy, like mirrors or RAID-Z). Validation of all checksums can be triggered with scrub. Checksum algorithms include: fletcher2 fletcher4 sha256 The fletcher algorithms are faster, but sha256 is a strong cryptographic hash and has a much lower chance of collisions at the cost of some performance. Checksums can be disabled, but it is not recommended. Compression Each dataset has a compression property, which defaults to off. This property can be set to one of a number of compression algorithms. This will cause all new data that is written to the dataset to be compressed. Beyond a reduction in space used, read and write throughput often increases because fewer blocks are read or written. LZ4 - Added in ZFS pool version 5000 (feature flags), LZ4 is now the recommended compression algorithm. LZ4 compresses approximately 50% faster than LZJB when operating on compressible data, and is over three times faster when operating on uncompressible data. LZ4 also decompresses approximately 80% faster than LZJB. On modern CPUs, LZ4 can often compress at over 500 MB/s, and decompress at over 1.5 GB/s (per single CPU core). LZJB - The default compression algorithm. Created by Jeff Bonwick (one of the original creators of ZFS). LZJB offers good compression with less CPU overhead compared to GZIP. In the future, the default compression algorithm will likely change to LZ4. GZIP - A popular stream compression algorithm available in ZFS. One of the main advantages of using GZIP is its configurable level of compression. When setting the compress property, the administrator can choose the level of compression, ranging from gzip1, the lowest level of compression, to gzip9, the highest level of compression. This gives the administrator control over how much CPU time to trade for saved disk space. ZLE - Zero Length Encoding is a special compression algorithm that only compresses continuous runs of zeros. This compression algorithm is only useful when the dataset contains large blocks of zeros. Copies When set to a value greater than 1, the copies property instructs ZFS to maintain multiple copies of each block in the File System or Volume. Setting this property on important datasets provides additional redundancy from which to recover a block that does not match its checksum. In pools without redundancy, the copies feature is the only form of redundancy. The copies feature can recover from a single bad sector or other forms of minor corruption, but it does not protect the pool from the loss of an entire disk. Deduplication Checksums make it possible to detect duplicate blocks of data as they are written. With deduplication, the reference count of an existing, identical block is increased, saving storage space. To detect duplicate blocks, a deduplication table (DDT) is kept in memory. The table contains a list of unique checksums, the location of those blocks, and a reference count. When new data is written, the checksum is calculated and compared to the list. If a match is found, the existing block is used. The SHA256 checksum algorithm is used with deduplication to provide a secure cryptographic hash. Deduplication is tunable. If dedup is on, then a matching checksum is assumed to mean that the data is identical. If dedup is set to verify, then the data in the two blocks will be checked byte-for-byte to ensure it is actually identical. If the data is not identical, the hash collision will be noted and the two blocks will be - stored separately. Because DDT must + stored separately. As DDT must store the hash of each unique block, it consumes a very large amount of memory. A general rule of thumb is 5-6 GB of ram per 1 TB of deduplicated data). In situations where it is not practical to have enough RAM to keep the entire DDT in memory, performance will suffer greatly as the DDT must be read from disk before each new block is written. Deduplication can use L2ARC to store the DDT, providing a middle ground between fast system memory and slower disks. Consider using compression instead, which often provides nearly as much space savings without the additional memory requirement. Scrub Instead of a consistency check like &man.fsck.8;, ZFS has scrub. scrub reads all data blocks stored on the pool and verifies their checksums against the known good checksums stored in the metadata. A periodic check of all the data stored on the pool ensures the recovery of any corrupted blocks before they are needed. A scrub is not required after an unclean shutdown, but is recommended at least once every three months. The checksum of each block is verified as blocks are read during normal use, but a scrub makes certain that even infrequently used blocks are checked for silent corruption. Data security is improved, especially in archival storage situations. The relative priority of scrub can be adjusted with vfs.zfs.scrub_delay to prevent the scrub from degrading the performance of other workloads on the pool. Dataset Quota ZFS provides very fast and accurate dataset, user, and group space accounting in addition to quotas and space reservations. This gives the administrator fine grained control over how space is allocated and allows space to be reserved for critical file systems. ZFS supports different types of quotas: the dataset quota, the reference quota (refquota), the user quota, and the group quota. Quotas limit the amount of space that a dataset and all of its descendants, including snapshots of the dataset, child datasets, and the snapshots of those datasets, can consume. Quotas cannot be set on volumes, as the volsize property acts as an implicit quota. Reference Quota A reference quota limits the amount of space a dataset can consume by enforcing a hard limit. However, this hard limit includes only space that the dataset references and does not include space used by descendants, such as file systems or snapshots. User Quota User quotas are useful to limit the amount of space that can be used by the specified user. Group Quota The group quota limits the amount of space that a specified group can consume. Dataset Reservation The reservation property makes it possible to guarantee a minimum amount of space for a specific dataset and its descendants. If a 10 GB reservation is set on storage/home/bob, and another dataset tries to use all of the free space, at least 10 GB of space is reserved for this dataset. If a snapshot is taken of storage/home/bob, the space used by that snapshot is counted against the reservation. The refreservation property works in a similar way, but it excludes descendants like snapshots. Reservations of any sort are useful in many situations, such as planning and testing the suitability of disk space allocation in a new system, or ensuring that enough space is available on file systems for audio logs or system recovery procedures and files. Reference Reservation The refreservation property makes it possible to guarantee a minimum amount of space for the use of a specific dataset excluding its descendants. This means that if a 10 GB reservation is set on storage/home/bob, and another dataset tries to use all of the free space, at least 10 GB of space is reserved for this dataset. In contrast to a regular reservation, space used by snapshots and descendant datasets is not counted against the reservation. For example, if a snapshot is taken of storage/home/bob, enough disk space must exist outside of the refreservation amount for the operation to succeed. Descendants of the main data set are not counted in the refreservation amount and so do not encroach on the space set. Resilver When a disk fails and is replaced, the new disk must be filled with the data that was lost. The process of using the parity information distributed across the remaining drives to calculate and write the missing data to the new drive is called resilvering. Online A pool or vdev in the Online state has all of its member devices connected and fully operational. Individual devices in the Online state are functioning normally. Offline Individual devices can be put in an Offline state by the administrator if there is sufficient redundancy to avoid putting the pool or vdev into a Faulted state. An administrator may choose to offline a disk in preparation for replacing it, or to make it easier to identify. Degraded A pool or vdev in the Degraded state has one or more disks that have been disconnected or have failed. The pool is still usable, but if additional devices fail, the pool could become unrecoverable. Reconnecting the missing devices or replacing the failed disks will return the pool to an Online state after the reconnected or new device has completed the Resilver process. Faulted A pool or vdev in the Faulted state is no longer operational. The data on it can no longer be accessed. A pool or vdev enters the Faulted state when the number of missing or failed devices exceeds the level of redundancy in the vdev. If missing devices can be reconnected, the pool will return to a Online state. If there is insufficient redundancy to compensate for the number of failed disks, then the contents of the pool are lost and must be restored from backups.