diff --git a/en_US.ISO8859-1/books/handbook/vinum/chapter.sgml b/en_US.ISO8859-1/books/handbook/vinum/chapter.sgml index 27cc87dc40..728160fb7f 100644 --- a/en_US.ISO8859-1/books/handbook/vinum/chapter.sgml +++ b/en_US.ISO8859-1/books/handbook/vinum/chapter.sgml @@ -1,1475 +1,1475 @@ Greg Lehey Originally written by The Vinum Volume Manager Synopsis No matter what disks you have, there are always potential problems: They can be too small. They can be too slow. They can be too unreliable. One way some users safeguard themselves against such issues is through the use of multiple, and sometimes redundant, disks. In addition to supporting various cards and controllers for hardware RAID systems, the base FreeBSD system includes the Vinum Volume Manager, a block device driver that implements virtual disk drives. Vinum provides more flexibility, performance, and reliability than traditional disk storage, and implements RAID-0, RAID-1, and RAID-5 models both individually and in combination. This chapter provides an overview of potential problems with traditional disk storage, and an introduction to the Vinum Volume Manager. Disks Are Too Small Vinum RAID software Vinum is a so-called Volume Manager, a virtual disk driver that addresses these three problems. Let us look at them in more detail. Various solutions to these problems have been proposed and implemented: Disks are getting bigger, but so are data storage requirements. Often you will find you want a file system that is bigger than the disks you have available. Admittedly, this problem is not as acute as it was ten years ago, but it still exists. Some systems have solved this by creating an abstract device which stores its data on a number of disks. Access Bottlenecks Modern systems frequently need to access data in a highly concurrent manner. For example, large FTP or HTTP servers can maintain thousands of concurrent sessions and have multiple 100 Mbit/s connections to the outside world, well beyond the sustained transfer rate of most disks. Current disk drives can transfer data sequentially at up to 70 MB/s, but this value is of little importance in an environment where many independent processes access a drive, where they may achieve only a fraction of these values. In such cases it is more interesting to view the problem from the viewpoint of the disk subsystem: the important parameter is the load that a transfer places on the subsystem, in other words the time for which a transfer occupies the drives involved in the transfer. In any disk transfer, the drive must first position the heads, wait for the first sector to pass under the read head, and then perform the transfer. These actions can be considered to be atomic: it does not make any sense to interrupt them. Consider a typical transfer of about 10 kB: the current generation of high-performance disks can position the heads in an average of 3.5 ms. The fastest drives spin at 15,000 rpm, so the average rotational latency (half a revolution) is 2 ms. At 70 MB/s, the transfer itself takes about 150 μs, almost nothing compared to the positioning time. In such a case, the effective transfer rate drops to a little over 1 MB/s and is clearly highly dependent on the transfer size. The traditional and obvious solution to this bottleneck is more spindles: rather than using one large disk, it uses several smaller disks with the same aggregate storage space. Each disk is capable of positioning and transferring independently, so the effective throughput increases by a factor close to the number of disks used. The exact throughput improvement is, of course, smaller than the number of disks involved: although each drive is capable of transferring in parallel, there is no way to ensure that the requests are evenly distributed across the drives. Inevitably the load on one drive will be higher than on another. disk concatenation Vinum concatenation The evenness of the load on the disks is strongly dependent on the way the data is shared across the drives. In the following discussion, it is convenient to think of the disk storage as a large number of data sectors which are addressable by number, rather like the pages in a book. The most obvious method is to divide the virtual disk into groups of consecutive sectors the size of the individual physical disks and store them in this manner, rather like taking a large book and tearing it into smaller sections. This method is called concatenation and has the advantage that the disks are not required to have any specific size relationships. It works well when the access to the virtual disk is spread evenly about its address space. When access is concentrated on a smaller area, the improvement is less marked. illustrates the sequence in which storage units are allocated in a concatenated organization.
Concatenated Organization
disk striping Vinum striping RAID An alternative mapping is to divide the address space into smaller, equal-sized components and store them sequentially on different devices. For example, the first 256 sectors may be stored on the first disk, the next 256 sectors on the next disk and so on. After filling the last disk, the process repeats until the disks are full. This mapping is called striping or RAID-0 RAID stands for Redundant Array of Inexpensive Disks and offers various forms of fault tolerance, though the latter term is somewhat misleading: it provides no redundancy. . Striping requires somewhat more effort to locate the data, and it can cause additional I/O load where a transfer is spread over multiple disks, but it can also provide a more constant load across the disks. illustrates the sequence in which storage units are allocated in a striped organization.
Striped Organization
Data Integrity The final problem with current disks is that they are unreliable. Although disk drive reliability has increased tremendously over the last few years, they are still the most likely core component of a server to fail. When they do, the results can be catastrophic: replacing a failed disk drive and restoring data to it can take days. disk mirroring Vinum mirroring RAID-1 The traditional way to approach this problem has been mirroring, keeping two copies of the data on different physical hardware. Since the advent of the RAID levels, this technique has also been called RAID level 1 or RAID-1. Any write to the volume writes to both locations; a read can be satisfied from either, so if one drive fails, the data is still available on the other drive. Mirroring has two problems: The price. It requires twice as much disk storage as a non-redundant solution. The performance impact. Writes must be performed to both drives, so they take up twice the bandwidth of a non-mirrored volume. Reads do not suffer from a performance penalty: it even looks as if they are faster. RAID-5An alternative solution is parity, implemented in the RAID levels 2, 3, 4 and 5. Of these, RAID-5 is the most interesting. As implemented in Vinum, it is a variant on a striped organization which dedicates one block of each stripe to parity of the other blocks. As implemented by Vinum, a RAID-5 plex is similar to a striped plex, except that it implements RAID-5 by including a parity block in each stripe. As required by RAID-5, the location of this parity block changes from one stripe to the next. The numbers in the data blocks indicate the relative block numbers.
RAID-5 Organization
Compared to mirroring, RAID-5 has the advantage of requiring significantly less storage space. Read access is similar to that of striped organizations, but write access is significantly slower, approximately 25% of the read performance. If one drive fails, the array can continue to operate in degraded mode: a read from one of the remaining accessible drives continues normally, but a read from the failed drive is recalculated from the corresponding block from all the remaining drives.
Vinum Objects In order to address these problems, Vinum implements a four-level hierarchy of objects: The most visible object is the virtual disk, called a volume. Volumes have essentially the same properties as a &unix; disk drive, though there are some minor differences. They have no size limitations. Volumes are composed of plexes, each of which represent the total address space of a volume. This level in the hierarchy thus provides redundancy. Think of plexes as individual disks in a mirrored array, each containing the same data. Since Vinum exists within the &unix; disk storage framework, it would be possible to use &unix; partitions as the building block for multi-disk plexes, but in fact this turns out to be too inflexible: &unix; disks can have only a limited number of partitions. Instead, Vinum subdivides a single &unix; partition (the drive) into contiguous areas called subdisks, which it uses as building blocks for plexes. Subdisks reside on Vinum drives, currently &unix; partitions. Vinum drives can contain any number of subdisks. With the exception of a small area at the beginning of the drive, which is used for storing configuration and state information, the entire drive is available for data storage. The following sections describe the way these objects provide the functionality required of Vinum. Volume Size Considerations Plexes can include multiple subdisks spread over all drives in the Vinum configuration. As a result, the size of an individual drive does not limit the size of a plex, and thus of a volume. Redundant Data Storage Vinum implements mirroring by attaching multiple plexes to a volume. Each plex is a representation of the data in a volume. A volume may contain between one and eight plexes. Although a plex represents the complete data of a volume, it is possible for parts of the representation to be physically missing, either by design (by not defining a subdisk for parts of the plex) or by accident (as a result of the failure of a drive). As long as at least one plex can provide the data for the complete address range of the volume, the volume is fully functional. Performance Issues Vinum implements both concatenation and striping at the plex level: A concatenated plex uses the address space of each subdisk in turn. A striped plex stripes the data across each subdisk. The subdisks must all have the same size, and there must be at least two subdisks in order to distinguish it from a concatenated plex. Which Plex Organization? The version of Vinum supplied with FreeBSD &rel.current; implements two kinds of plex: Concatenated plexes are the most flexible: they can contain any number of subdisks, and the subdisks may be of different length. The plex may be extended by adding additional subdisks. They require less CPU time than striped plexes, though the difference in CPU overhead is not measurable. On the other hand, they are most susceptible to hot spots, where one disk is very active and others are idle. The greatest advantage of striped (RAID-0) plexes is that they reduce hot spots: by choosing an optimum sized stripe (about 256 kB), you can even out the load on the component drives. The disadvantages of this approach are (fractionally) more complex code and restrictions on subdisks: they must be all the same size, and extending a plex by adding new subdisks is so complicated that Vinum currently does not implement it. Vinum imposes an additional, trivial restriction: a striped plex must have at least two subdisks, since otherwise it is indistinguishable from a concatenated plex. summarizes the advantages and disadvantages of each plex organization. Vinum Plex Organizations Plex type Minimum subdisks Can add subdisks Must be equal size Application concatenated 1 yes no Large data storage with maximum placement flexibility and moderate performance striped 2 no yes High performance in combination with highly concurrent access
Some Examples Vinum maintains a configuration database which describes the objects known to an individual system. Initially, the user creates the configuration database from one or more configuration files with the aid of the &man.vinum.8; utility program. Vinum stores a copy of its configuration database on each disk slice (which Vinum calls a device) under its control. This database is updated on each state change, so that a restart accurately restores the state of each Vinum object. The Configuration File The configuration file describes individual Vinum objects. The definition of a simple volume might be: drive a device /dev/da3h volume myvol plex org concat sd length 512m drive a This file describes four Vinum objects: The drive line describes a disk partition (drive) and its location relative to the underlying hardware. It is given the symbolic name a. This separation of the symbolic names from the device names allows disks to be moved from one location to another without confusion. The volume line describes a volume. The only required attribute is the name, in this case myvol. The plex line defines a plex. The only required parameter is the organization, in this case concat. No name is necessary: the system automatically generates a name from the volume name by adding the suffix .px, where x is the number of the plex in the volume. Thus this plex will be called myvol.p0. The sd line describes a subdisk. The minimum specifications are the name of a drive on which to store it, and the length of the subdisk. As with plexes, no name is necessary: the system automatically assigns names derived from the plex name by adding the suffix .sx, where x is the number of the subdisk in the plex. Thus Vinum gives this subdisk the name myvol.p0.s0. After processing this file, &man.vinum.8; produces the following output: &prompt.root; vinum -> create config1 Configuration summary Drives: 1 (4 configured) Volumes: 1 (4 configured) Plexes: 1 (8 configured) Subdisks: 1 (16 configured) D a State: up Device /dev/da3h Avail: 2061/2573 MB (80%) V myvol State: up Plexes: 1 Size: 512 MB P myvol.p0 C State: up Subdisks: 1 Size: 512 MB S myvol.p0.s0 State: up PO: 0 B Size: 512 MB This output shows the brief listing format of &man.vinum.8;. It is represented graphically in .
A Simple Vinum Volume
This figure, and the ones which follow, represent a volume, which contains the plexes, which in turn contain the subdisks. In this trivial example, the volume contains one plex, and the plex contains one subdisk. This particular volume has no specific advantage over a conventional disk partition. It contains a single plex, so it is not redundant. The plex contains a single subdisk, so there is no difference in storage allocation from a conventional disk partition. The following sections illustrate various more interesting configuration methods.
Increased Resilience: Mirroring The resilience of a volume can be increased by mirroring. When laying out a mirrored volume, it is important to ensure that the subdisks of each plex are on different drives, so that a drive failure will not take down both plexes. The following configuration mirrors a volume: drive b device /dev/da4h volume mirror plex org concat sd length 512m drive a plex org concat sd length 512m drive b In this example, it was not necessary to specify a definition of drive a again, since Vinum keeps track of all objects in its configuration database. After processing this definition, the configuration looks like: Drives: 2 (4 configured) Volumes: 2 (4 configured) Plexes: 3 (8 configured) Subdisks: 3 (16 configured) D a State: up Device /dev/da3h Avail: 1549/2573 MB (60%) D b State: up Device /dev/da4h Avail: 2061/2573 MB (80%) V myvol State: up Plexes: 1 Size: 512 MB V mirror State: up Plexes: 2 Size: 512 MB P myvol.p0 C State: up Subdisks: 1 Size: 512 MB P mirror.p0 C State: up Subdisks: 1 Size: 512 MB P mirror.p1 C State: initializing Subdisks: 1 Size: 512 MB S myvol.p0.s0 State: up PO: 0 B Size: 512 MB S mirror.p0.s0 State: up PO: 0 B Size: 512 MB S mirror.p1.s0 State: empty PO: 0 B Size: 512 MB shows the structure graphically.
A Mirrored Vinum Volume
In this example, each plex contains the full 512 MB of address space. As in the previous example, each plex contains only a single subdisk.
Optimizing Performance The mirrored volume in the previous example is more resistant to failure than an unmirrored volume, but its performance is less: each write to the volume requires a write to both drives, using up a greater proportion of the total disk bandwidth. Performance considerations demand a different approach: instead of mirroring, the data is striped across as many disk drives as possible. The following configuration shows a volume with a plex striped across four disk drives: drive c device /dev/da5h drive d device /dev/da6h volume stripe plex org striped 512k sd length 128m drive a sd length 128m drive b sd length 128m drive c sd length 128m drive d As before, it is not necessary to define the drives which are already known to Vinum. After processing this definition, the configuration looks like: Drives: 4 (4 configured) Volumes: 3 (4 configured) Plexes: 4 (8 configured) Subdisks: 7 (16 configured) D a State: up Device /dev/da3h Avail: 1421/2573 MB (55%) D b State: up Device /dev/da4h Avail: 1933/2573 MB (75%) D c State: up Device /dev/da5h Avail: 2445/2573 MB (95%) D d State: up Device /dev/da6h Avail: 2445/2573 MB (95%) V myvol State: up Plexes: 1 Size: 512 MB V mirror State: up Plexes: 2 Size: 512 MB V striped State: up Plexes: 1 Size: 512 MB P myvol.p0 C State: up Subdisks: 1 Size: 512 MB P mirror.p0 C State: up Subdisks: 1 Size: 512 MB P mirror.p1 C State: initializing Subdisks: 1 Size: 512 MB P striped.p1 State: up Subdisks: 1 Size: 512 MB S myvol.p0.s0 State: up PO: 0 B Size: 512 MB S mirror.p0.s0 State: up PO: 0 B Size: 512 MB S mirror.p1.s0 State: empty PO: 0 B Size: 512 MB S striped.p0.s0 State: up PO: 0 B Size: 128 MB S striped.p0.s1 State: up PO: 512 kB Size: 128 MB S striped.p0.s2 State: up PO: 1024 kB Size: 128 MB S striped.p0.s3 State: up PO: 1536 kB Size: 128 MB
A Striped Vinum Volume
This volume is represented in . The darkness of the stripes indicates the position within the plex address space: the lightest stripes come first, the darkest last.
Resilience and Performance With sufficient hardware, it is possible to build volumes which show both increased resilience and increased performance compared to standard &unix; partitions. A typical configuration file might be: volume raid10 plex org striped 512k sd length 102480k drive a sd length 102480k drive b sd length 102480k drive c sd length 102480k drive d sd length 102480k drive e plex org striped 512k sd length 102480k drive c sd length 102480k drive d sd length 102480k drive e sd length 102480k drive a sd length 102480k drive b The subdisks of the second plex are offset by two drives from those of the first plex: this helps ensure that writes do not go to the same subdisks even if a transfer goes over two drives. represents the structure of this volume.
A Mirrored, Striped Vinum Volume
Object Naming As described above, Vinum assigns default names to plexes and subdisks, although they may be overridden. Overriding the default names is not recommended: experience with the VERITAS volume manager, which allows arbitrary naming of objects, has shown that this flexibility does not bring a significant advantage, and it can cause confusion. Names may contain any non-blank character, but it is recommended to restrict them to letters, digits and the underscore characters. The names of volumes, plexes and subdisks may be up to 64 characters long, and the names of drives may be up to 32 characters long. Vinum objects are assigned device nodes in the hierarchy /dev/vinum. The configuration shown above would cause Vinum to create the following device nodes: The control devices /dev/vinum/control and /dev/vinum/controld, which are used by &man.vinum.8; and the Vinum daemon respectively. Block and character device entries for each volume. These are the main devices used by Vinum. The block device names are the name of the volume, while the character device names follow the BSD tradition of prepending the letter r to the name. Thus the configuration above would include the block devices /dev/vinum/myvol, /dev/vinum/mirror, /dev/vinum/striped, /dev/vinum/raid5 and /dev/vinum/raid10, and the character devices /dev/vinum/rmyvol, /dev/vinum/rmirror, /dev/vinum/rstriped, /dev/vinum/rraid5 and /dev/vinum/rraid10. There is obviously a problem here: it is possible to have two volumes called r and rr, but there will be a conflict creating the device node /dev/vinum/rr: is it a character device for volume r or a block device for volume rr? Currently Vinum does not address this conflict: the first-defined volume will get the name. A directory /dev/vinum/drive with entries for each drive. These entries are in fact symbolic links to the corresponding disk nodes. A directory /dev/vinum/volume with entries for each volume. It contains subdirectories for each plex, which in turn contain subdirectories for their component subdisks. The directories /dev/vinum/plex, /dev/vinum/sd, and /dev/vinum/rsd, which contain block device nodes for each plex and block and character device nodes respectively for each subdisk. For example, consider the following configuration file: drive drive1 device /dev/sd1h drive drive2 device /dev/sd2h drive drive3 device /dev/sd3h drive drive4 device /dev/sd4h volume s64 setupstate plex org striped 64k sd length 100m drive drive1 sd length 100m drive drive2 sd length 100m drive drive3 sd length 100m drive drive4 After processing this file, &man.vinum.8; creates the following structure in /dev/vinum: brwx------ 1 root wheel 25, 0x40000001 Apr 13 16:46 Control brwx------ 1 root wheel 25, 0x40000002 Apr 13 16:46 control brwx------ 1 root wheel 25, 0x40000000 Apr 13 16:46 controld drwxr-xr-x 2 root wheel 512 Apr 13 16:46 drive drwxr-xr-x 2 root wheel 512 Apr 13 16:46 plex crwxr-xr-- 1 root wheel 91, 2 Apr 13 16:46 rs64 drwxr-xr-x 2 root wheel 512 Apr 13 16:46 rsd drwxr-xr-x 2 root wheel 512 Apr 13 16:46 rvol brwxr-xr-- 1 root wheel 25, 2 Apr 13 16:46 s64 drwxr-xr-x 2 root wheel 512 Apr 13 16:46 sd drwxr-xr-x 3 root wheel 512 Apr 13 16:46 vol /dev/vinum/drive: total 0 lrwxr-xr-x 1 root wheel 9 Apr 13 16:46 drive1 -> /dev/sd1h lrwxr-xr-x 1 root wheel 9 Apr 13 16:46 drive2 -> /dev/sd2h lrwxr-xr-x 1 root wheel 9 Apr 13 16:46 drive3 -> /dev/sd3h lrwxr-xr-x 1 root wheel 9 Apr 13 16:46 drive4 -> /dev/sd4h /dev/vinum/plex: total 0 brwxr-xr-- 1 root wheel 25, 0x10000002 Apr 13 16:46 s64.p0 /dev/vinum/rsd: total 0 crwxr-xr-- 1 root wheel 91, 0x20000002 Apr 13 16:46 s64.p0.s0 crwxr-xr-- 1 root wheel 91, 0x20100002 Apr 13 16:46 s64.p0.s1 crwxr-xr-- 1 root wheel 91, 0x20200002 Apr 13 16:46 s64.p0.s2 crwxr-xr-- 1 root wheel 91, 0x20300002 Apr 13 16:46 s64.p0.s3 /dev/vinum/rvol: total 0 crwxr-xr-- 1 root wheel 91, 2 Apr 13 16:46 s64 /dev/vinum/sd: total 0 brwxr-xr-- 1 root wheel 25, 0x20000002 Apr 13 16:46 s64.p0.s0 brwxr-xr-- 1 root wheel 25, 0x20100002 Apr 13 16:46 s64.p0.s1 brwxr-xr-- 1 root wheel 25, 0x20200002 Apr 13 16:46 s64.p0.s2 brwxr-xr-- 1 root wheel 25, 0x20300002 Apr 13 16:46 s64.p0.s3 /dev/vinum/vol: total 1 brwxr-xr-- 1 root wheel 25, 2 Apr 13 16:46 s64 drwxr-xr-x 3 root wheel 512 Apr 13 16:46 s64.plex /dev/vinum/vol/s64.plex: total 1 brwxr-xr-- 1 root wheel 25, 0x10000002 Apr 13 16:46 s64.p0 drwxr-xr-x 2 root wheel 512 Apr 13 16:46 s64.p0.sd /dev/vinum/vol/s64.plex/s64.p0.sd: total 0 brwxr-xr-- 1 root wheel 25, 0x20000002 Apr 13 16:46 s64.p0.s0 brwxr-xr-- 1 root wheel 25, 0x20100002 Apr 13 16:46 s64.p0.s1 brwxr-xr-- 1 root wheel 25, 0x20200002 Apr 13 16:46 s64.p0.s2 brwxr-xr-- 1 root wheel 25, 0x20300002 Apr 13 16:46 s64.p0.s3 Although it is recommended that plexes and subdisks should not be allocated specific names, Vinum drives must be named. This makes it possible to move a drive to a different location and still recognize it automatically. Drive names may be up to 32 characters long. Creating File Systems Volumes appear to the system to be identical to disks, with one exception. Unlike &unix; drives, Vinum does not partition volumes, which thus do not contain a partition table. This has required modification to some disk utilities, notably &man.newfs.8;, which previously tried to interpret the last letter of a Vinum volume name as a partition identifier. For example, a disk drive may have a name like /dev/ad0a or /dev/da2h. These names represent the first partition (a) on the first (0) IDE disk (ad) and the eighth partition (h) on the third (2) SCSI disk (da) respectively. By contrast, a Vinum volume might be called /dev/vinum/concat, a name which has no relationship with a partition name. Normally, &man.newfs.8; interprets the name of the disk and complains if it cannot understand it. For example: &prompt.root; newfs /dev/vinum/concat newfs: /dev/vinum/concat: can't figure out file system partition The following is only valid for FreeBSD versions prior to 5.0: In order to create a file system on this volume, use the option to &man.newfs.8;: &prompt.root; newfs -v /dev/vinum/concat Configuring Vinum The GENERIC kernel does not contain Vinum. It is possible to build a special kernel which includes Vinum, but this is not recommended. The standard way to start Vinum is as a kernel module (kld). You do not even need to use &man.kldload.8; for Vinum: when you start &man.vinum.8;, it checks whether the module has been loaded, and if it is not, it loads it automatically. Startup Vinum stores configuration information on the disk slices in essentially the same form as in the configuration files. When reading from the configuration database, Vinum recognizes a number of keywords which are not allowed in the configuration files. For example, a disk configuration might contain the following text: volume myvol state up volume bigraid state down plex name myvol.p0 state up org concat vol myvol plex name myvol.p1 state up org concat vol myvol plex name myvol.p2 state init org striped 512b vol myvol plex name bigraid.p0 state initializing org raid5 512b vol bigraid sd name myvol.p0.s0 drive a plex myvol.p0 state up len 1048576b driveoffset 265b plexoffset 0b sd name myvol.p0.s1 drive b plex myvol.p0 state up len 1048576b driveoffset 265b plexoffset 1048576b sd name myvol.p1.s0 drive c plex myvol.p1 state up len 1048576b driveoffset 265b plexoffset 0b sd name myvol.p1.s1 drive d plex myvol.p1 state up len 1048576b driveoffset 265b plexoffset 1048576b sd name myvol.p2.s0 drive a plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 0b sd name myvol.p2.s1 drive b plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 524288b sd name myvol.p2.s2 drive c plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 1048576b sd name myvol.p2.s3 drive d plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 1572864b sd name bigraid.p0.s0 drive a plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 0b sd name bigraid.p0.s1 drive b plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 4194304b sd name bigraid.p0.s2 drive c plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 8388608b sd name bigraid.p0.s3 drive d plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 12582912b sd name bigraid.p0.s4 drive e plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 16777216b The obvious differences here are the presence of explicit location information and naming (both of which are also allowed, but discouraged, for use by the user) and the information on the states (which are not available to the user). Vinum does not store information about drives in the configuration information: it finds the drives by scanning the configured disk drives for partitions with a Vinum label. This enables Vinum to identify drives correctly even if they have been assigned different &unix; drive IDs. Automatic Startup In order to start Vinum automatically when you boot the system, ensure that you have the following line in your /etc/rc.conf: start_vinum="YES" # set to YES to start vinum If you do not have a file /etc/rc.conf, create one with this content. This will cause the system to load the Vinum kld at startup, and to start any objects mentioned in the configuration. This is done before mounting file systems, so it is possible to automatically &man.fsck.8; and mount file systems on Vinum volumes. When you start Vinum with the vinum start command, Vinum reads the configuration database from one of the Vinum drives. Under normal circumstances, each drive contains an identical copy of the configuration database, so it does not matter which drive is read. After a crash, however, Vinum must determine which drive was updated most recently and read the configuration from this drive. It then updates the configuration if necessary from progressively older drives. Using Vinum for the Root Filesystem For a machine that has fully-mirrored filesystems using Vinum, it is desirable to also mirror the root filesystem. Setting up such a configuration is less trivial than mirroring an arbitrary filesystem because: The root filesystem must be available very early during the boot process, so the Vinum infrastructure must already be available at this time. The volume containing the root filesystem also contains the system bootstrap and the kernel, which must be read using the host system's native utilities (e. g. the BIOS on PC-class machines) which often cannot be taught about the details of Vinum. In the following sections, the term root volume is generally used to describe the Vinum volume that contains the root filesystem. It is probably a good idea to use the name "root" for this volume, but this is not technically required in any way. All command examples in the following sections assume this name though. Starting up Vinum Early Enough for the Root Filesystem There are several measures to take for this to happen: Vinum must be available in the kernel at boot-time. Thus, the method to start Vinum automatically described in is not applicable to accomplish this task, and the start_vinum parameter must actually not be set when the following setup is being arranged. The first option would be to compile Vinum statically into the kernel, so it is available all the time, but this is usually not desirable. There is another option as well, to have /boot/loader () load the vinum kernel module early, before starting the kernel. This can be accomplished by putting the line: vinum_load="YES" into the file /boot/loader.conf. Vinum must be initialized early since it needs to supply the volume for the root filesystem. By default, the Vinum kernel part is not looking for drives that might contain Vinum volume information until the administrator (or one of the startup scripts) issues a vinum start command. The following paragraphs are outlining the steps needed for FreeBSD 5.X and above. The setup required for FreeBSD 4.X differs, and is described below in . By placing the line: vinum.autostart="YES" into /boot/loader.conf, Vinum is instructed to automatically scan all drives for Vinum information as part of the kernel startup. Note that it is not necessary to instruct the kernel where to look for the root filesystem. /boot/loader looks up the name of the root device in /etc/fstab, and passes this information on to the kernel. When it comes to mount the root filesystem, the kernel figures out from the - devicename provided which driver to ask to translate this + device name provided which driver to ask to translate this into the internal device ID (major/minor number). Making a Vinum-based Root Volume Accessible to the Bootstrap Since the current FreeBSD bootstrap is only 7.5 KB of code, and already has the burden of reading files (like /boot/loader) from the UFS filesystem, it is sheer impossible to also teach it about internal Vinum structures so it could parse the Vinum configuration data, and figure out about the elements of a boot volume itself. Thus, some tricks are necessary to provide the bootstrap code with the illusion of a standard "a" partition that contains the root filesystem. For this to be possible at all, the following requirements must be met for the root volume: The root volume must not be striped or RAID-5. The root volume must not contain more than one concatenated subdisk per plex. Note that it is desirable and possible that there are multiple plexes, each containing one replica of the root filesystem. The bootstrap process will, however, only use one of these replica for finding the bootstrap and all the files, until the kernel will eventually mount the root filesystem itself. Each single subdisk within these plexes will then need its own "a" partition illusion, for the respective device to become bootable. It is not strictly needed that each of these faked "a" partitions is located at the same offset within its device, compared with other devices containing plexes of the root volume. However, it is probably a good idea to create the Vinum volumes that way so the resulting mirrored devices are symmetric, to avoid confusion. In order to set up these "a" partitions, for each device containing part of the root volume, the following needs to be done: The location (offset from the beginning of the device) and size of this device's subdisk that is part of the root volume need to be examined, using the command: &prompt.root; vinum l -rv root Note that Vinum offsets and sizes are measured in bytes. They must be divided by 512 in order to obtain the block numbers that are to be used in the disklabel command. Run the command: &prompt.root; disklabel -e devname for each device that participates in the root volume. devname must be either the name of the disk (like da0) for disks without a slice (aka. fdisk) table, or the name of the slice (like ad0s1). If there is already an "a" partition on the device (presumably, containing a pre-Vinum root filesystem), it should be renamed to something else, so it remains accessible (just in case), but will no longer be used by default to bootstrap the system. Note that active partitions (like a root filesystem currently mounted) cannot be renamed, so this must be executed either when being booted from a Fixit medium, or in a two-step process, where (in a mirrored situation) the disk that has not been currently booted is being manipulated first. Then, the offset the Vinum partition on this device (if any) must be added to the offset of the respective root volume subdisk on this device. The resulting value will become the "offset" value for the new "a" partition. The "size" value for this partition can be taken verbatim from the calculation above. The "fstype" should be 4.2BSD. The "fsize", "bsize", and "cpg" values should best be chosen to match the actual filesystem, though they are fairly unimportant within this context. That way, a new "a" partition will be established that overlaps the Vinum partition on this device. Note that the disklabel will only allow for this overlap if the Vinum partition has properly been marked using the "vinum" fstype. That's all! A faked "a" partition does exist now on each device that has one replica of the root volume. It is highly recommendable to verify the result again, using a command like: &prompt.root; fsck -n /dev/devnamea It should be remembered that all files containing control information must be relative to the root filesystem in the Vinum volume which, when setting up a new Vinum root volume, might not match the root filesystem that is currently active. So in particular, the files /etc/fstab and /boot/loader.conf need to be taken care of. At next reboot, the bootstrap should figure out the appropriate control information from the new Vinum-based root filesystem, and act accordingly. At the end of the kernel initialization process, after all devices have been announced, the prominent notice that shows the success of this setup is a message like: Mounting root from ufs:/dev/vinum/root Example of a Vinum-based Root Setup After the Vinum root volume has been set up, the output of vinum l -rv root could look like: ... Subdisk root.p0.s0: Size: 125829120 bytes (120 MB) State: up Plex root.p0 at offset 0 (0 B) Drive disk0 (/dev/da0h) at offset 135680 (132 kB) Subdisk root.p1.s0: Size: 125829120 bytes (120 MB) State: up Plex root.p1 at offset 0 (0 B) Drive disk1 (/dev/da1h) at offset 135680 (132 kB) The values to note are 135680 for the offset (relative to partition /dev/da0h). This translates to 265 512-byte disk blocks in disklabel's terms. Likewise, the size of this root volume is 245760 512-byte blocks. /dev/da1h, containing the second replica of this root volume, has a symmetric setup. The disklabel for these devices might look like: ... 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 245760 281 4.2BSD 2048 16384 0 # (Cyl. 0*- 15*) c: 71771688 0 unused 0 0 # (Cyl. 0 - 4467*) h: 71771672 16 vinum # (Cyl. 0*- 4467*) It can be observed that the "size" parameter for the faked "a" partition matches the value outlined above, while the "offset" parameter is the sum of the offset within the Vinum partition "h", and the offset of this partition within the device (or slice). This is a typical setup that is necessary to avoid the problem described in . It can also be seen that the entire "a" partition is completely within the "h" partition containing all the Vinum data for this device. Note that in the above example, the entire device is dedicated to Vinum, and there is no leftover pre-Vinum root partition, since this has been a newly set-up disk that was only meant to be part of a Vinum configuration, ever. Troubleshooting If something goes wrong, a way is needed to recover from the situation. The following list contains few known pitfalls and solutions. System Bootstrap Loads, but System Does Not Boot If for any reason the system does not continue to boot, the bootstrap can be interrupted with by pressing the space key at the 10-seconds warning. The loader variables (like vinum.autostart) can be examined using the show, and manipulated using set or unset commands. If the only problem was that the Vinum kernel module was not yet in the list of modules to load automatically, a simple load vinum will help. When ready, the boot process can be continued with a boot -as. The options will request the kernel to ask for the root filesystem to mount (), and make the boot process stop in single-user mode (), where the root filesystem is mounted read-only. That way, even if only one plex of a multi-plex volume has been mounted, no data inconsistency between plexes is being risked. At the prompt asking for a root filesystem to mount, any device that contains a valid root filesystem can be entered. If /etc/fstab had been set up correctly, the default should be something like ufs:/dev/vinum/root. A typical alternate choice would be something like ufs:da0d which could be a hypothetical partition that contains the pre-Vinum root filesystem. Care should be taken if one of the alias "a" partitions are entered here that are actually reference to the subdisks of the Vinum root device, because in a mirrored setup, this would only mount one piece of a mirrored root device. If this filesystem is to be mounted read-write later on, it is necessary to remove the other plex(es) of the Vinum root volume since these plexes would otherwise carry inconsistent data. Only Primary Bootstrap Loads If /boot/loader fails to load, but the primary bootstrap still loads (visible by a single dash in the left column of the screen right after the boot process starts), an attempt can be made to interrupt the primary bootstrap at this point, using the space key. This will make the bootstrap stop in stage two, see . An attempt can be made here to boot off an alternate partition, like the partition containing the previous root filesystem that has been moved away from "a" above. Nothing Boots, the Bootstrap Panics This situation will happen if the bootstrap had been destroyed by the Vinum installation. Unfortunately, Vinum accidentally currently leaves only 4 KB at the beginning of its partition free before starting to write its Vinum header information. However, the stage one and two bootstraps plus the disklabel embedded between them currently require 8 KB. So if a Vinum partition was started at offset 0 within a slice or disk that was meant to be bootable, the Vinum setup will trash the bootstrap. Similarly, if the above situation has been recovered, for example by booting from a Fixit medium, and the bootstrap has been re-installed using disklabel -B as described in , the bootstrap will trash the Vinum header, and Vinum will no longer find its disk(s). Though no actual Vinum configuration data or data in Vinum volumes will be trashed by this, and it would be possible to recover all the data by entering exact the same Vinum configuration data again, the situation is hard to fix at all. It would be necessary to move the entire Vinum partition by at least 4 KB off, in order to have the Vinum header and the system bootstrap no longer collide. Differences for FreeBSD 4.X Under FreeBSD 4.X, some internal functions required to make Vinum automatically scan all disks are missing, and the code that figures out the internal ID of the root device is not smart enough to handle a name like /dev/vinum/root automatically. Therefore, things are a little different here. Vinum must explicitly be told which disks to scan, using a line like the following one in /boot/loader.conf: vinum.drives="/dev/da0 /dev/da1" It is important that all drives are mentioned that could possibly contain Vinum data. It does not harm if more drives are listed, nor is it necessary to add each slice and/or partition explicitly, since Vinum will scan all slices and partitions of the named drives for valid Vinum headers. Since the routines used to parse the name of the root filesystem, and derive the device ID (major/minor number) are only prepared to handle classical device names like /dev/ad0s1a, they cannot make any sense out of a root volume name like /dev/vinum/root. For that reason, Vinum itself needs to pre-setup the internal kernel parameter that holds the ID of the root device during its own initialization. This is requested by passing the name of the root volume in the loader variable vinum.root. The entry in /boot/loader.conf to accomplish this looks like: vinum.root="root" Now, when the kernel initialization tries to find out the root device to mount, it sees whether some kernel module has already pre-initialized the kernel parameter for it. If that is the case, and the device claiming the root device matches the major number of the driver as figured out from the name of the root device string being passed (that is, "vinum" in our case), it will use the pre-allocated device ID, instead of trying to figure out one itself. That way, during the usual automatic startup, it can continue to mount the Vinum root volume for the root filesystem. However, when boot -a has been requesting to ask for entering the name of the root device manually, it must be noted that this routine still cannot actually parse a name entered there that refers to a Vinum volume. If any device name is entered that does not refer to a Vinum device, the mismatch between the major numbers of the pre-allocated root parameter and the driver as figured out from the given name will make this routine enter its normal parser, so entering a string like ufs:da0d will work as expected. Note that if this fails, it is however no longer possible to re-enter a string like ufs:vinum/root again, since it cannot be parsed. The only way out is to reboot again, and start over then. (At the askroot prompt, the initial /dev/ can always be omitted.)