diff --git a/en_US.ISO8859-1/articles/solid-state/article.sgml b/en_US.ISO8859-1/articles/solid-state/article.sgml index 57175186c5..dc2bfe6db6 100644 --- a/en_US.ISO8859-1/articles/solid-state/article.sgml +++ b/en_US.ISO8859-1/articles/solid-state/article.sgml @@ -1,629 +1,638 @@ %man; + + +%trademarks; ]>
FreeBSD and Solid State Devices John Kozubik
john@kozubik.com
$FreeBSD$ 2001 The FreeBSD Documentation Project + + &tm-attrib.freebsd; + &tm-attrib.m-systems; + &tm-attrib.general; + + &legalnotice; This article covers the use of solid state disk devices in FreeBSD to create embedded systems. Embedded systems have the advantage of increased stability due to the lack of integral moving parts (hard drives). Account must be taken, however, for the generally low disk space available in the system and the durability of the storage medium. Specific topics to be covered include the types and attributes of solid state media suitable for disk use in FreeBSD, kernel options that are of interest in such an environment, the rc.diskless mechanisms that automate the initialization of such systems and the need for read-only filesystems, and building filesystems from scratch. The article will conclude with some general strategies for small and read-only FreeBSD environments.
Solid State Disk Devices The scope of this article will be limited to solid state disk devices made from flash memory. Flash memory is a solid state memory (no moving parts) that is non-volatile (the memory maintains data even after all power sources have been disconnected). Flash memory can withstand tremendous physical shock and is reasonably fast (the flash memory solutions covered in this article are slightly slower than a EIDE hard disk for write operations, and much faster for read operations). One very important aspect of flash memory, the ramifications of which will be discussed later in this article, is that each sector has a limited rewrite capacity. You can only write, erase, and write again to a sector of flash memory a certain number of times before the sector becomes permanently unusable. Although many flash memory products automatically map bad blocks, and although some even distribute write operations evenly throughout the unit, the fact remains that there exists a limit to the amount of writing that can be done to the device. Competitive units have between 1,000,000 and 10,000,000 writes per sector in their specification. This figure varies due to the temperature of the environment. Specifically, we will be discussing ATA compatible compact-flash - units and the M-Systems Disk-On-Chip flash memory unit. ATA compatible + units and the M-Systems &diskonchip; flash memory unit. ATA compatible compact-flash cards are quite popular as storage media for digital cameras. Of particular interest is the fact that they pin out directly to the IDE bus and are compatible with the ATA command set. Therefore, with a very simple and low-cost adaptor, these devices can be attached directly to an IDE bus in a computer. Once implemented in this manner, operating systems such as FreeBSD see the device as a normal hard disk - (albeit small). The M-Systems Disk-On-Chip product is based on the same + (albeit small). The M-Systems &diskonchip; product is based on the same underlying flash memory technology as ATA compatible compact-flash cards, but resides in a DIP form factor and is not ATA compatible. To use such a device, not only must you install it on a motherboard that - has a Disk-On-Chip socket, you must also build the `fla` driver into any + has a &diskonchip; socket, you must also build the `fla` driver into any FreeBSD kernel you wish to use it with. Further, there is critical, manufacturer-specific data residing in the boot sector of this device, so you must take care not to install the FreeBSD (or any other) boot loader when using this. Other solid state disk solutions do exist, but their expense, obscurity, and relative unease of use places them beyond the scope of this article. Kernel Options A few kernel options are of specific interest to those creating an embedded FreeBSD system. First, all embedded FreeBSD systems that use flash memory as system disk will be interested in memory disks and memory filesystems. Because of the limited number of writes that can be done to flash memory, the disk and the filesystems on the disk will most likely be mounted read-only. In this environment, filesystems such as /tmp and /var are mounted as memory filesystems to allow the system to create logs and update counters and temporary files. Memory filesystems are a critical component to a successful solid state FreeBSD implementation. You should make sure the following lines exist in your kernel configuration file: options MFS # Memory Filesystem options MD_ROOT # md device usable as a potential root device pseudo-device md # memory disk - Second, if you will be using the M-Systems Disk-On-Chip product, you + Second, if you will be using the M-Systems &diskonchip; product, you must also include this line: device fla0 at isa? <filename>rc.diskless</filename> and Read-Only Filesystems The post-boot initialization of an embedded FreeBSD system is controlled by /etc/rc.diskless2 (/etc/rc.diskless1 is for BOOTP diskless boot). This initialization script is invoked by placing a line in /etc/rc.conf as follows: diskless_mount=/etc/rc.diskless2 rc.diskless2 mounts /var as a memory filesystem, makes a configurable list of directories in /var with the &man.mkdir.1; command, changes modes on some of those directories, and extracts a list of device entries to copy to a writable (again, a memory filesystem) /dev partition. In the execution of /etc/rc.diskless2, one other rc.conf variable comes into play - varsize. The /etc/rc.diskless2 file creates a /var partition based on the value of this variable in rc.conf: varsize=8192 Remember that this value is in sectors. The creation of the /dev partition by /etc/rc.diskless2, however, is governed by a hard-coded value of 4096 sectors. It is trivial to change this entry in the /etc/rc.diskless2 file itself, although you should not need more space than that for /dev. It is important to remember that the /etc/rc.diskless2 script assumes that you have already removed your conventional /tmp partition and replaced it with a symbolic link to /var/tmp. Because tmp is one of the directories created in /var by the /etc/rc.diskless2 script, and because /var is a memory filesystem (which is mounted read-write), /tmp will now be a directory that is read-write as well. The fact that /var and /dev are read-write filesystems is an important distinction, as the / partition (and any other partitions you may have on your flash media) should be mounted read-only. Remember that in we detailed the limitations of flash memory - specifically the limited write capability. The importance of not mounting filesystems on flash media read-write, and the importance of not using a swap file, cannot be overstated. A swap file on a busy system can burn through a piece of flash media in less than one year. Heavy logging or temporary file creation and destruction can do the same. Therefore, in addition to removing the swap and /proc entries from your /etc/fstab file, you should also change the Options field for each filesystem to ro as follows: # Device Mountpoint FStype Options Dump Pass# /dev/ad0s1a / ufs ro 1 1 A few applications in the average system will immediately begin to fail as a result of this change. For instance, ports will not install from the ports tree because the /var/db/port.mkversion file does not exist. cron will not run properly as a result of missing cron tabs in the /var created by /etc/rc.diskless2, and syslog and dhcp will encounter problems as well as a result of the read-only filesystem and missing items in the /var that /etc/rc.diskless2 has created. These are only temporary problems though, and are addressed, along with solutions to the execution of other common software packages in . An important thing to remember is that a filesystem that was mounted read-only with /etc/fstab can be made read-write at any time by issuing the command: &prompt.root; /sbin/mount -uw partition and can be toggled back to read-only with the command: &prompt.root; /sbin/mount -ur partition Building a File System From Scratch Because ATA compatible compact-flash cards are seen by FreeBSD as - normal IDE hard drives, as is a M-Systems Disk-On-Chip product (when you + normal IDE hard drives, as is a M-Systems &diskonchip; product (when you are running a kernel with the fla driver built in) you could theoretically install FreeBSD from the network using the kern and mfsroot floppies or from a CD. Other than the fact that you should not write a boot-loader of any kind to the M-Systems device, no special instructions are needed. However, even a small installation of FreeBSD using normal installation procedures can produce a system in size of greater than 200 megabytes. Because most people will be using smaller flash memory devices (128 megabytes is considered fairly large - 32 or even 16 megabytes is common) an installation using normal mechanisms is not possible—there is simply not enough disk space for even the smallest of conventional installations. The easiest way to overcome this space limitation is to install FreeBSD using conventional means to a normal hard disk. After the installation is complete, pare down the operating system to a size that will fit onto your flash media, then tar the entire filesystem. The following steps will guide you through the process of preparing a piece of flash memory for your tarred filesystem. Remember, because a normal installation is not being performed, operations such as partitioning, labeling, file-system creation, etc. need to be performed by hand. In addition to the kern and mfsroot floppy disks, you will also need to use - the fixit floppy. If you are using a M-Systems Disk-On-Chip, the kernel + the fixit floppy. If you are using a M-Systems &diskonchip;, the kernel on your kern floppy must have the fla option detailed in compiled into it. Please see for instructions on creating a new kernel for kern.flp. Partitioning your flash media device After booting with the kern and mfsroot floppies, choose custom from the installation menu. In the custom installation menu, choose partition. In the partition menu, you should delete all existing partitions using the d key. After deleting all existing partitions, create a partition using the c key and accept the default value for the size of the partition. When asked for the type of the partition, make sure the value is set to 165. Now write this partition table to the disk by pressing the w key (this is a hidden option on this screen). When presented with a menu to choose a boot manager, take care to select None if you are using an - M-Systems Disk-On-Chip. If you are using an ATA compatible compact + M-Systems &diskonchip;. If you are using an ATA compatible compact flash card, you should choose the FreeBSD Boot Manager. Now press the q key to quit the partition menu. You will be shown the boot manager menu once more - repeat the choice you made earlier. Creating filesystems on your flash memory device Exit the custom installation menu, and from the main installation menu choose the fixit option. After entering the fixit environment, enter the following commands: ATA compatible - Disk-On-Chip + &diskonchip; &prompt.root; mknod /dev/ad0a c 116 0 &prompt.root; mknod /dev/ad0c c 116 2 &prompt.root; disklabel -e /dev/ad0c &prompt.root; mknod /dev/fla0a c 102 0 &prompt.root; mknod /dev/fla0c c 102 2 &prompt.root; disklabel -e /dev/fla0c At this point you will have entered the vi editor under the - auspices of the disklabel command. If you are using Disk-On-Chip, + auspices of the disklabel command. If you are using &diskonchip;, the first step will be to change the type value near the beginning of the file from ESDI to DOC2K. Next, regardless of whether you are using - Disk-On-Chip or ATA compatible compact flash media, you need to add + &diskonchip; or ATA compatible compact flash media, you need to add an a: line at the end of the file. This a: line should look like: a: 123456 0 4.2BSD 0 0 Where 123456 is a number that is exactly the same as the number in the existing c: entry for size. Basically you are duplicating the existing c: line as an a: line, making sure that fstype is 4.2BSD. Save the file and exit. ATA compatible - Disk-On-Chip + &diskonchip; &prompt.root; disklabel -B -r /dev/ad0c &prompt.root; newfs /dev/ad0a &prompt.root; disklabel -B -r /dev/fla0c &prompt.root; newfs /dev/fla0a Placing your filesystem on the flash media Mount the newly prepared flash media: ATA compatible - Disk-On-Chip + &diskonchip; &prompt.root; mount /dev/ad0a /flash &prompt.root; mount /dev/fla0a /flash Bring this machine up on the network so we may transfer our tar file and explode it onto our flash media filesystem. One example of how to do this is: &prompt.root; ifconfig xl0 192.168.0.10 netmask 255.255.255.0 &prompt.root; route add default 192.168.0.1 Now that the machine is on the network, transfer your tar file. You may be faced with a bit of a dilemma at this point - if your flash memory part is 128 megabytes, for instance, and your tar file is larger than 64 megabytes, you cannot have your tar file on the flash media at the same time as you explode it - you will run out of space. One solution to this problem, if you are using FTP, is to untar the file while it is transferred over FTP. If you perform your transfer in this manner, you will never have the tar file and the tar contents on your disk at the same time: ftp> get tarfile.tar "| tar xvf -" If your tarfile is gzipped, you can accomplish this as well: ftp> get tarfile.tar "| zcat | tar xvf -" After the contents of your tarred filesystem are on your flash memory filesystem, you can unmount the flash memory and reboot: &prompt.root; cd / &prompt.root; umount /flash &prompt.root; exit Assuming that you configured your filesystem correctly when it was built on the normal hard disk (with your filesystems mounted read-only, and with the necessary options compiled into the kernel) you should now be successfully booting your FreeBSD embedded system. Building a <filename>kern.flp</filename> Installation Floppy with the fla Driver This section of the article is relevant only to those using - M-Systems Disk-On-Chip flash media. + M-Systems &diskonchip; flash media. It is possible that your kern.flp boot floppy does not have a kernel with the fla driver - compiled into it necessary for the system to recognize the Disk-On-Chip. + compiled into it necessary for the system to recognize the &diskonchip;. If you have booted off of the installation floppies and are told that no disks are present, then you are probably lacking the fla driver in your kernel. After you have built a kernel with fla support that is smaller than 1.4 megabytes, you can create a custom kern.flp floppy image with it by following these instructions: Obtain an existing kern.flp image file &prompt.root; vnconfig vn0c kern.flp &prompt.root; mount /dev/vn0c /mnt Place your kernel file into /mnt, replacing the existing one &prompt.root; vnconfig -d vn0c Your kern.flp file now has your new kernel on it. System Strategies for Small and Read Only Environments In , it was pointed out that the /var filesystem constructed by /etc/rc.diskless2 and the presence of a read-only root filesystem causes problems with many common software packages used with FreeBSD. In this article, suggestions for successfully running cron, syslog, ports installations, and the Apache web server will be provided. cron In /etc/rc.diskless2 there is a variable named var_dirs. This variable consists of a space-delimited list of directories that will be created inside of /var after it is mounted as a memory filesystem. cron and cron/tabs are not in that list, and without those directories, cron will complain. By inserting cron, cron/tabs, and perhaps even at, and at/jobs as elements of that variable, you will facilitate the running of the &man.cron.8; and &man.at.1; daemons. However, this still does not solve the problem of maintaining cron tabs across reboots. When the system reboots, the /var filesystem that is in memory will disappear and any cron tabs you may have had in it will also disappear. Therefore, one solution would be to create cron tabs for the users that need them, mount your / filesystem as read-write and copy those cron tabs to somewhere safe, like /etc/tabs, then add a line to the end of /etc/rc.diskless2 that copies those crontabs into /var/cron/tabs after that directory has been created during system initialization. You may also need to add a line that changes modes and permissions on the directories you create and the files you copy with /etc/rc.diskless2. syslog syslog.conf specifies the locations of certain log files that exist in /var/log. These files are not created by /etc/rc.diskless2 upon system initialization. Therefore, somewhere in /etc/rc.diskless2, after the section that creates the directories in /var, you will need to add something like this: &prompt.root; touch /var/log/security /var/log/maillog /var/log/cron /var/log/messages &prompt.root; chmod 0644 /var/log/* You will also need to add the log directory to the list of directories that /etc/rc.diskless2 creates. ports installation Before discussing the changes necessary to successfully use the ports tree, a reminder is necessary regarding the read-only nature of your filesystems on the flash media. Since they are read-only, you will need to temporarily mount them read-write using the mount syntax shown in . You should always remount those filesystems read-only when you are done with any maintenance - unnecessary writes to the flash media could considerably shorten its lifespan. To make it possible to enter a ports directory and successfully run make install, it is necessary for the file /var/db/port.mkversion to exist, and that it has a correct date in it. Further, we must create a packages directory on a non-memory filesystem that will keep track of our packages across reboots. Because it is necessary to mount your filesystems as read-write for the installation of a package anyway, it is sensible to assume that an area on the flash media can also be used for package information to be written to. First, create a package database directory. This is normally in /var/db/pkg, but we cannot place it there as it will disappear every time the system is booted. &prompt.root; mkdir /etc/pkg Now, add a line to /etc/rc.diskless2 that links the /etc/pkg directory to /var/db/pkg. An example: &prompt.root; ln -s /etc/pkg /var/db/pkg Add another line in /etc/rc.diskless2 that creates and populates /var/db/port.mkversion &prompt.root; touch /var/db/port.mkversion &prompt.root; chmod 0644 /var/db/port.mkversion &prompt.root; echo 20010412 >> /var/db/port.mkversion where 20010412 is a date that is appropriate for your particular release of FreeBSD Now, any time that you mount your filesystems as read-write and install a package, the make install will work because it finds a suitable /var/db/port.mkversion, and package information will be written successfully to /etc/pkg (because the filesystem will, at that time, be mounted read-write) which will always be available to the operating system as /var/db/pkg. Apache Web Server Apache keeps pid files and logs in apache_install/logs. Since this directory doubtless exists on a read-only filesystem, this will not work. It is necessary to add a new directory to the /etc/rc.diskless2 list of directories to create in /var, to link apache_install/logs to /var/log/apache. It is also necessary to set permissions and ownership on this new directory. First, add the directory log/apache to the list of directories to be created in /etc/rc.diskless2. Second, add these commands to /etc/rc.diskless2 after the directory creation section: &prompt.root; chmod 0774 /var/log/apache &prompt.root; chown nobody:nobody /var/log/apache Finally, remove the existing apache_install/logs directory, and replace it with a link: &prompt.root; rm -rf (apache_install)/logs &prompt.root; ln -s /var/log/apache (apache_install)/logs
diff --git a/en_US.ISO8859-1/articles/storage-devices/article.sgml b/en_US.ISO8859-1/articles/storage-devices/article.sgml index ad4d0c9ae1..cc7fd0cafa 100644 --- a/en_US.ISO8859-1/articles/storage-devices/article.sgml +++ b/en_US.ISO8859-1/articles/storage-devices/article.sgml @@ -1,2639 +1,2647 @@ %man; %authors; + +%trademarks; ]>
Storage Devices Wilko Bulte
wilko@FreeBSD.org
$FreeBSD$ + + + &tm-attrib.freebsd; + &tm-attrib.general; + + This article talks about storage devices with FreeBSD.
Using ESDI hard disks Copyright © 1995, &a.wilko;. 24 September 1995. ESDI is an acronym that means Enhanced Small Device Interface. It is loosely based on the good old ST506/412 interface originally devised by Seagate Technology, the makers of the first affordable 5.25" winchester disk. The acronym says Enhanced, and rightly so. In the first place the speed of the interface is higher, 10 or 15 Mbits/second instead of the 5 Mbits/second of ST412 interfaced drives. Secondly some higher level commands are added, making the ESDI interface somewhat smarter to the operating system driver writers. It is by no means as smart as SCSI by the way. ESDI is standardized by ANSI. Capacities of the drives are boosted by putting more sectors on each track. Typical is 35 sectors per track, high capacity drives I have seen were up to 54 sectors/track. Although ESDI has been largely obsoleted by IDE and SCSI interfaces, the availability of free or cheap surplus drives makes them ideal for low (or now) budget systems. Concepts of ESDI Physical connections The ESDI interface uses two cables connected to each drive. One cable is a 34 pin flat cable edge connector that carries the command and status signals from the controller to the drive and vice-versa. The command cable is daisy chained between all the drives. So, it forms a bus onto which all drives are connected. The second cable is a 20 pin flat cable edge connector that carries the data to and from the drive. This cable is radially connected, so each drive has its own direct connection to the controller. To the best of my knowledge PC ESDI controllers are limited to using a maximum of 2 drives per controller. This is compatibility feature(?) left over from the WD1003 standard that reserves only a single bit for device addressing. Device addressing On each command cable a maximum of 7 devices and 1 controller can be present. To enable the controller to uniquely identify which drive it addresses, each ESDI device is equipped with jumpers or switches to select the devices address. On PC type controllers the first drive is set to address 0, the second disk to address 1. Always make sure you set each disk to an unique address! So, on a PC with its two drives/controller maximum the first drive is drive 0, the second is drive 1. Termination The daisy chained command cable (the 34 pin cable remember?) needs to be terminated at the last drive on the chain. For this purpose ESDI drives come with a termination resistor network that can be removed or disabled by a jumper when it is not used. So, one and only one drive, the one at the farthest end of the command cable has its terminator installed/enabled. The controller automatically terminates the other end of the cable. Please note that this implies that the controller must be at one end of the cable and not in the middle. Using ESDI disks with FreeBSD Why is ESDI such a pain to get working in the first place? People who tried ESDI disks with FreeBSD are known to have developed a profound sense of frustration. A combination of factors works against you to produce effects that are hard to understand when you have never seen them before. This has also led to the popular legend ESDI and FreeBSD is a plain NO-GO. The following sections try to list all the pitfalls and solutions. ESDI speed variants As briefly mentioned before, ESDI comes in two speed flavors. The older drives and controllers use a 10 Mbits/second data transfer rate. Newer stuff uses 15 Mbits/second. It is not hard to imagine that 15 Mbits/second drive cause problems on controllers laid out for 10 Mbits/second. As always, consult your controller and drive documentation to see if things match. Stay on track Mainstream ESDI drives use 34 to 36 sectors per track. Most (older) controllers cannot handle more than this number of sectors. Newer, higher capacity, drives use higher numbers of sectors per track. For instance, I own a 670 MB drive that has 54 sectors per track. In my case, the controller could not handle this number of sectors. It proved to work well except that it only used 35 sectors on each track. This meant losing a lot of disk space. Once again, check the documentation of your hardware for more info. Going out-of-spec like in the example might or might not work. Give it a try or get another more capable controller. Hard or soft sectoring Most ESDI drives allow hard or soft sectoring to be selected using a jumper. Hard sectoring means that the drive will produce a sector pulse on the start of each new sector. The controller uses this pulse to tell when it should start to write or read. Hard sectoring allows a selection of sector size (normally 256, 512 or 1024 bytes per formatted sector). FreeBSD uses 512 byte sectors. The number of sectors per track also varies while still using the same number of bytes per formatted sector. The number of unformatted bytes per sector varies, dependent on your controller it needs more or less overhead bytes to work correctly. Pushing more sectors on a track of course gives you more usable space, but might give problems if your controller needs more bytes than the drive offers. In case of soft sectoring, the controller itself determines where to start/stop reading or writing. For ESDI hard sectoring is the default (at least on everything I came across). I never felt the urge to try soft sectoring. In general, experiment with sector settings before you install FreeBSD because you need to re-run the low-level format after each change. Low level formatting ESDI drives need to be low level formatted before they are usable. A reformat is needed whenever you figgle with the number of sectors/track jumpers or the physical orientation of the drive (horizontal, vertical). So, first think, then format. The format time must not be underestimated, for big disks it can take hours. After a low level format, a surface scan is done to find and flag bad sectors. Most disks have a manufacturer bad block list listed on a piece of paper or adhesive sticker. In addition, on most disks the list is also written onto the disk. Please use the manufacturer's list. It is much easier to remap a defect now than after FreeBSD is installed. Stay away from low-level formatters that mark all sectors of a track as bad as soon as they find one bad sector. Not only does this waste space, it also and more importantly causes you grief with bad144 (see the section on bad144). Translations Translations, although not exclusively a ESDI-only problem, might give you real trouble. Translations come in multiple flavors. Most of them have in common that they attempt to work around the limitations posed upon disk geometries by the original IBM PC/AT design (thanks IBM!). First of all there is the (in)famous 1024 cylinder limit. For a system to be able to boot, the stuff (whatever operating system) must be in the first 1024 cylinders of a disk. Only 10 bits are available to encode the cylinder number. For the number of sectors the limit is 64 (0-63). When you combine the 1024 cylinder limit with the 16 head limit (also a design feature) you max out at fairly limited disk sizes. To work around this problem, the manufacturers of ESDI PC controllers added a BIOS prom extension on their boards. This BIOS extension handles disk I/O for booting (and for some operating systems all disk I/O) by using translation. For instance, a big drive might be presented to the system as having 32 heads and 64 sectors/track. The result is that the number of cylinders is reduced to something below 1024 and is therefore usable by the system without problems. It is noteworthy to know that FreeBSD does not use the BIOS after its kernel has started. More on this later. A second reason for translations is the fact that most older system BIOSes could only handle drives with 17 sectors per track (the old ST412 standard). Newer system BIOSes usually have a user-defined drive type (in most cases this is drive type 47). Whatever you do to translations after reading this document, keep in mind that if you have multiple operating systems on the same disk, all must use the same translation While on the subject of translations, I have seen one controller type (but there are probably more like this) offer the option to logically split a drive in multiple partitions as a BIOS option. I had select 1 drive == 1 partition because this controller wrote this info onto the disk. On power-up it read the info and presented itself to the system based on the info from the disk. Spare sectoring Most ESDI controllers offer the possibility to remap bad sectors. During/after the low-level format of the disk bad sectors are marked as such, and a replacement sector is put in place (logically of course) of the bad one. In most cases the remapping is done by using N-1 sectors on each track for actual data storage, and sector N itself is the spare sector. N is the total number of sectors physically available on the track. The idea behind this is that the operating system sees a perfect disk without bad sectors. In the case of FreeBSD this concept is not usable. The problem is that the translation from bad to good is performed by the BIOS of the ESDI controller. FreeBSD, being a true 32 bit operating system, does not use the BIOS after it has been booted. Instead, it has device drivers that talk directly to the hardware. So: do not use spare sectoring, bad block remapping or whatever it may be called by the controller manufacturer when you want to use the disk for FreeBSD. Bad block handling The preceding section leaves us with a problem. The controller's bad block handling is not usable and still FreeBSD's filesystems assume perfect media without any flaws. To solve this problem, FreeBSD use the bad144 tool. Bad144 (named after a Digital Equipment standard for bad block handling) scans a FreeBSD slice for bad blocks. Having found these bad blocks, it writes a table with the offending block numbers to the end of the FreeBSD slice. When the disk is in operation, the disk accesses are checked against the table read from the disk. Whenever a block number is requested that is in the bad144 list, a replacement block (also from the end of the FreeBSD slice) is used. In this way, the bad144 replacement scheme presents perfect media to the FreeBSD filesystems. There are a number of potential pitfalls associated with the use of bad144. First of all, the slice cannot have more than 126 bad sectors. If your drive has a high number of bad sectors, you might need to divide it into multiple FreeBSD slices each containing less than 126 bad sectors. Stay away from low-level format programs that mark every sector of a track as bad when they find a flaw on the track. As you can imagine, the 126 limit is quickly reached when the low-level format is done this way. Second, if the slice contains the root filesystem, the slice should be within the 1024 cylinder BIOS limit. During the boot process the bad144 list is read using the BIOS and this only succeeds when the list is within the 1024 cylinder limit. The restriction is not that only the root filesystem must be within the 1024 cylinder limit, but rather the entire slice that contains the root filesystem. Kernel configuration ESDI disks are handled by the same wddriver as IDE and ST412 MFM disks. The wd driver should work for all WD1003 compatible interfaces. Most hardware is jumperable for one of two different I/O address ranges and IRQ lines. This allows you to have two wd type controllers in one system. When your hardware allows non-standard strappings, you can use these with FreeBSD as long as you enter the correct info into the kernel config file. An example from the kernel config file (they live in /sys/i386/conf BTW). # First WD compatible controller controller wdc0 at isa? port "IO_WD1" bio irq 14 vector wdintr disk wd0 at wdc0 drive 0 disk wd1 at wdc0 drive 1 # Second WD compatible controller controller wdc1 at isa? port "IO_WD2" bio irq 15 vector wdintr disk wd2 at wdc1 drive 0 disk wd3 at wdc1 drive 1 Particulars on ESDI hardware Adaptec 2320 controllers I successfully installed FreeBSD onto a ESDI disk controlled by a ACB-2320. No other operating system was present on the disk. To do so I low level formatted the disk using NEFMT.EXE (ftpable from www.adaptec.com) and answered NO to the question whether the disk should be formatted with a spare sector on each track. The BIOS on the ACD-2320 was disabled. I used the free configurable option in the system BIOS to allow the BIOS to boot it. Before using NEFMT.EXE I tried to format the disk using the ACB-2320 BIOS built-in formatter. This proved to be a show stopper, because it did not give me an option to disable spare sectoring. With spare sectoring enabled the FreeBSD installation process broke down on the bad144 run. Please check carefully which ACB-232xy variant you have. The x is either 0 or 2, indicating a controller without or with a floppy controller on board. The y is more interesting. It can either be a blank, a A-8 or a D. A blank indicates a plain 10 Mbits/second controller. An A-8 indicates a 15 Mbits/second controller capable of handling 52 sectors/track. A D means a 15 Mbits/second controller that can also handle drives with > 36 sectors/track (also 52?). All variations should be capable of using 1:1 interleaving. Use 1:1, FreeBSD is fast enough to handle it. Western Digital WD1007 controllers I successfully installed FreeBSD onto a ESDI disk controlled by a WD1007 controller. To be precise, it was a WD1007-WA2. Other variations of the WD1007 do exist. To get it to work, I had to disable the sector translation and the WD1007's onboard BIOS. This implied I could not use the low-level formatter built into this BIOS. Instead, I grabbed WDFMT.EXE from www.wdc.com Running this formatted my drive just fine. Ultrastor U14F controllers According to multiple reports from the net, Ultrastor ESDI boards work OK with FreeBSD. I lack any further info on particular settings. Further reading If you intend to do some serious ESDI hacking, you might want to have the official standard at hand: The latest ANSI X3T10 committee document is: Enhanced Small Device Interface (ESDI) [X3.170-1990/X3.170a-1991] [X3T10/792D Rev 11] On Usenet the newsgroup comp.periphs is a noteworthy place to look for more info. The World Wide Web (WWW) also proves to be a very handy info source: For info on Adaptec ESDI controllers see . For info on Western Digital controllers see . Thanks to... Andrew Gordon for sending me an Adaptec 2320 controller and ESDI disk for testing. What is SCSI? Copyright © 1995, &a.wilko;. July 6, 1996. SCSI is an acronym for Small Computer Systems Interface. It is an ANSI standard that has become one of the leading I/O buses in the computer industry. The foundation of the SCSI standard was laid by Shugart Associates (the same guys that gave the world the first mini floppy disks) when they introduced the SASI bus (Shugart Associates Standard Interface). After some time an industry effort was started to come to a more strict standard allowing devices from different vendors to work together. This effort was recognized in the ANSI SCSI-1 standard. The SCSI-1 standard (approximately 1985) is rapidly becoming obsolete. The current standard is SCSI-2 (see Further reading), with SCSI-3 on the drawing boards. In addition to a physical interconnection standard, SCSI defines a logical (command set) standard to which disk devices must adhere. This standard is called the Common Command Set (CCS) and was developed more or less in parallel with ANSI SCSI-1. SCSI-2 includes the (revised) CCS as part of the standard itself. The commands are dependent on the type of device at hand. It does not make much sense of course to define a Write command for a scanner. The SCSI bus is a parallel bus, which comes in a number of variants. The oldest and most used is an 8 bit wide bus, with single-ended signals, carried on 50 wires. (If you do not know what single-ended means, do not worry, that is what this document is all about.) Modern designs also use 16 bit wide buses, with differential signals. This allows transfer speeds of 20Mbytes/second, on cables lengths of up to 25 meters. SCSI-2 allows a maximum bus width of 32 bits, using an additional cable. Quickly emerging are Ultra SCSI (also called Fast-20) and Ultra2 (also called Fast-40). Fast-20 is 20 million transfers per second (20 Mbytes/sec on a 8 bit bus), Fast-40 is 40 million transfers per second (40 Mbytes/sec on a 8 bit bus). Most hard drives sold today are single-ended Ultra SCSI (8 or 16 bits). Of course the SCSI bus not only has data lines, but also a number of control signals. A very elaborate protocol is part of the standard to allow multiple devices to share the bus in an efficient manner. In SCSI-2, the data is always checked using a separate parity line. In pre-SCSI-2 designs parity was optional. In SCSI-3 even faster bus types are introduced, along with a serial SCSI busses that reduces the cabling overhead and allows a higher maximum bus length. You might see names like SSA and fibre channel in this context. None of the serial buses are currently in widespread use (especially not in the typical FreeBSD environment). For this reason the serial bus types are not discussed any further. As you could have guessed from the description above, SCSI devices are intelligent. They have to be to adhere to the SCSI standard (which is over 2 inches thick BTW). So, for a hard disk drive for instance you do not specify a head/cylinder/sector to address a particular block, but simply the number of the block you want. Elaborate caching schemes, automatic bad block replacement etc are all made possible by this intelligent device approach. On a SCSI bus, each possible pair of devices can communicate. Whether their function allows this is another matter, but the standard does not restrict it. To avoid signal contention, the 2 devices have to arbitrate for the bus before using it. The philosophy of SCSI is to have a standard that allows older-standard devices to work with newer-standard ones. So, an old SCSI-1 device should normally work on a SCSI-2 bus. I say Normally, because it is not absolutely sure that the implementation of an old device follows the (old) standard closely enough to be acceptable on a new bus. Modern devices are usually more well-behaved, because the standardization has become more strict and is better adhered to by the device manufacturers. Generally speaking, the chances of getting a working set of devices on a single bus is better when all the devices are SCSI-2 or newer. This implies that you do not have to dump all your old stuff when you get that shiny 80GB disk: I own a system on which a pre-SCSI-1 disk, a SCSI-2 QIC tape unit, a SCSI-1 helical scan tape unit and 2 SCSI-1 disks work together quite happily. From a performance standpoint you might want to separate your older and newer (=faster) devices however. This is especially advantageous if you have an Ultra160 host adapter where you should separate your U160 devices from the Fast and Wide SCSI-2 devices. Components of SCSI As said before, SCSI devices are smart. The idea is to put the knowledge about intimate hardware details onto the SCSI device itself. In this way, the host system does not have to worry about things like how many heads a hard disks has, or how many tracks there are on a specific tape device. If you are curious, the standard specifies commands with which you can query your devices on their hardware particulars. FreeBSD uses this capability during boot to check out what devices are connected and whether they need any special treatment. The advantage of intelligent devices is obvious: the device drivers on the host can be made in a much more generic fashion, there is no longer a need to change (and qualify!) drivers for every odd new device that is introduced. For cabling and connectors there is a golden rule: get good stuff. With bus speeds going up all the time you will save yourself a lot of grief by using good material. So, gold plated connectors, shielded cabling, sturdy connector hoods with strain reliefs etc are the way to go. Second golden rule: do no use cables longer than necessary. I once spent 3 days hunting down a problem with a flaky machine only to discover that shortening the SCSI bus by 1 meter solved the problem. And the original bus length was well within the SCSI specification. SCSI bus types From an electrical point of view, there are two incompatible bus types: single-ended and differential. This means that there are two different main groups of SCSI devices and controllers, which cannot be mixed on the same bus. It is possible however to use special converter hardware to transform a single-ended bus into a differential one (and vice versa). The differences between the bus types are explained in the next sections. In lots of SCSI related documentation there is a sort of jargon in use to abbreviate the different bus types. A small list: FWD: Fast Wide Differential FND: Fast Narrow Differential SE: Single Ended FN: Fast Narrow etc. With a minor amount of imagination one can usually imagine what is meant. Wide is a bit ambiguous, it can indicate 16 or 32 bit buses. As far as I know, the 32 bit variant is not (yet) in use, so wide normally means 16 bit. Fast means that the timing on the bus is somewhat different, so that on a narrow (8 bit) bus 10 Mbytes/sec are possible instead of 5 Mbytes/sec for slow SCSI. As discussed before, bus speeds of 20 and 40 million transfers/second are also emerging (Fast-20 == Ultra SCSI and Fast-40 == Ultra2 SCSI). The data lines > 8 are only used for data transfers and device addressing. The transfers of commands and status messages etc are only performed on the lowest 8 data lines. The standard allows narrow devices to operate on a wide bus. The usable bus width is negotiated between the devices. You have to watch your device addressing closely when mixing wide and narrow. Single ended buses A single-ended SCSI bus uses signals that are either 5 Volts or 0 Volts (indeed, TTL levels) and are relative to a COMMON ground reference. A singled ended 8 bit SCSI bus has approximately 25 ground lines, who are all tied to a single rail on all devices. A standard single ended bus has a maximum length of 6 meters. If the same bus is used with fast-SCSI devices, the maximum length allowed drops to 3 meters. Fast-SCSI means that instead of 5Mbytes/sec the bus allows 10Mbytes/sec transfers. Fast-20 (Ultra SCSI) and Fast-40 allow for 20 and 40 million transfers/second respectively. So, F20 is 20 Mbytes/second on a 8 bit bus, 40 Mbytes/second on a 16 bit bus etc. For F20 the max bus length is 1.5 meters, for F40 it becomes 0.75 meters. Be aware that F20 is pushing the limits quite a bit, so you will quickly find out if your SCSI bus is electrically sound. If some devices on your bus use fast to communicate your bus must adhere to the length restrictions for fast buses! It is obvious that with the newer fast-SCSI devices the bus length can become a real bottleneck. This is why the differential SCSI bus was introduced in the SCSI-2 standard. For connector pinning and connector types please refer to the SCSI-2 standard (see Further reading) itself, connectors etc are listed there in painstaking detail. Beware of devices using non-standard cabling. For instance Apple uses a 25pin D-type connecter (like the one on serial ports and parallel printers). Considering that the official SCSI bus needs 50 pins you can imagine the use of this connector needs some creative cabling. The reduction of the number of ground wires they used is a bad idea, you better stick to 50 pins cabling in accordance with the SCSI standard. For Fast-20 and 40 do not even think about buses like this. Differential buses A differential SCSI bus has a maximum length of 25 meters. Quite a difference from the 3 meters for a single-ended fast-SCSI bus. The idea behind differential signals is that each bus signal has its own return wire. So, each signal is carried on a (preferably twisted) pair of wires. The voltage difference between these two wires determines whether the signal is asserted or de-asserted. To a certain extent the voltage difference between ground and the signal wire pair is not relevant (do not try 10 kVolts though). It is beyond the scope of this document to explain why this differential idea is so much better. Just accept that electrically seen the use of differential signals gives a much better noise margin. You will normally find differential buses in use for inter-cabinet connections. Because of the lower cost single ended is mostly used for shorter buses like inside cabinets. There is nothing that stops you from using differential stuff with FreeBSD, as long as you use a controller that has device driver support in FreeBSD. As an example, Adaptec marketed the AHA1740 as a single ended board, whereas the AHA1744 was differential. The software interface to the host is identical for both. Terminators Terminators in SCSI terminology are resistor networks that are used to get a correct impedance matching. Impedance matching is important to get clean signals on the bus, without reflections or ringing. If you once made a long distance telephone call on a bad line you probably know what reflections are. With 20Mbytes/sec traveling over your SCSI bus, you do not want signals echoing back. Terminators come in various incarnations, with more or less sophisticated designs. Of course, there are internal and external variants. Many SCSI devices come with a number of sockets in which a number of resistor networks can (must be!) installed. If you remove terminators from a device, carefully store them. You will need them when you ever decide to reconfigure your SCSI bus. There is enough variation in even these simple tiny things to make finding the exact replacement a frustrating business. There are also SCSI devices that have a single jumper to enable or disable a built-in terminator. There are special terminators you can stick onto a flat cable bus. Others look like external connectors, or a connector hood without a cable. So, lots of choice as you can see. There is much debate going on if and when you should switch from simple resistor (passive) terminators to active terminators. Active terminators contain slightly more elaborate circuit to give cleaner bus signals. The general consensus seems to be that the usefulness of active termination increases when you have long buses and/or fast devices. If you ever have problems with your SCSI buses you might consider trying an active terminator. Try to borrow one first, they reputedly are quite expensive. Please keep in mind that terminators for differential and single-ended buses are not identical. You should not mix the two variants. OK, and now where should you install your terminators? This is by far the most misunderstood part of SCSI. And it is by far the simplest. The rule is: every single line on the SCSI bus has 2 (two) terminators, one at each end of the bus. So, two and not one or three or whatever. Do yourself a favor and stick to this rule. It will save you endless grief, because wrong termination has the potential to introduce highly mysterious bugs. (Note the potential here; the nastiest part is that it may or may not work.) A common pitfall is to have an internal (flat) cable in a machine and also an external cable attached to the controller. It seems almost everybody forgets to remove the terminators from the controller. The terminator must now be on the last external device, and not on the controller! In general, every reconfiguration of a SCSI bus must pay attention to this. Termination is to be done on a per-line basis. This means if you have both narrow and wide buses connected to the same host adapter, you need to enable termination on the higher 8 bits of the bus on the adapter (as well as the last devices on each bus, of course). What I did myself is remove all terminators from my SCSI devices and controllers. I own a couple of external terminators, for both the Centronics-type external cabling and for the internal flat cable connectors. This makes reconfiguration much easier. On modern devices, sometimes integrated terminators are used. These things are special purpose integrated circuits that can be enabled or disabled with a control pin. It is not necessary to physically remove them from a device. You may find them on newer host adapters, sometimes they are software configurable, using some sort of setup tool. Some will even auto-detect the cables attached to the connectors and automatically set up the termination as necessary. At any rate, consult your documentation! Terminator power The terminators discussed in the previous chapter need power to operate properly. On the SCSI bus, a line is dedicated to this purpose. So, simple huh? Not so. Each device can provide its own terminator power to the terminator sockets it has on-device. But if you have external terminators, or when the device supplying the terminator power to the SCSI bus line is switched off you are in trouble. The idea is that initiators (these are devices that initiate actions on the bus, a discussion follows) must supply terminator power. All SCSI devices are allowed (but not required) to supply terminator power. To allow for un-powered devices on a bus, the terminator power must be supplied to the bus via a diode. This prevents the backflow of current to un-powered devices. To prevent all kinds of nastiness, the terminator power is usually fused. As you can imagine, fuses might blow. This can, but does not have to, lead to a non functional bus. If multiple devices supply terminator power, a single blown fuse will not put you out of business. A single supplier with a blown fuse certainly will. Clever external terminators sometimes have a LED indication that shows whether terminator power is present. In newer designs auto-restoring fuses that reset themselves after some time are sometimes used. Device addressing Because the SCSI bus is, ehh, a bus there must be a way to distinguish or address the different devices connected to it. This is done by means of the SCSI or target ID. Each device has a unique target ID. You can select the ID to which a device must respond using a set of jumpers, or a dip switch, or something similar. Some SCSI host adapters let you change the target ID from the boot menu. (Yet some others will not let you change the ID from 7.) Consult the documentation of your device for more information. Beware of multiple devices configured to use the same ID. Chaos normally reigns in this case. A pitfall is that one of the devices sharing the same ID sometimes even manages to answer to I/O requests! For an 8 bit bus, a maximum of 8 targets is possible. The maximum is 8 because the selection is done bitwise using the 8 data lines on the bus. For wide buses this increases to the number of data lines (usually 16). A narrow SCSI device can not communicate with a SCSI device with a target ID larger than 7. This means it is generally not a good idea to move your SCSI host adapter's target ID to something higher than 7 (or your CDROM will stop working). The higher the SCSI target ID, the higher the priority the devices has. When it comes to arbitration between devices that want to use the bus at the same time, the device that has the highest SCSI ID will win. This also means that the SCSI host adapter usually uses target ID 7. Note however that the lower 8 IDs have higher priorities than the higher 8 IDs on a wide-SCSI bus. Thus, the order of target IDs is: [7 6 .. 1 0 15 14 .. 9 8] on a wide-SCSI system. (If you are wondering why the lower 8 have higher priority, read the previous paragraph for a hint.) For a further subdivision, the standard allows for Logical Units or LUNs for short. A single target ID may have multiple LUNs. For example, a tape device including a tape changer may have LUN 0 for the tape device itself, and LUN 1 for the tape changer. In this way, the host system can address each of the functional units of the tape changer as desired. Bus layout SCSI buses are linear. So, not shaped like Y-junctions, star topologies, rings, cobwebs or whatever else people might want to invent. One of the most common mistakes is for people with wide-SCSI host adapters to connect devices on all three connecters (external connector, internal wide connector, internal narrow connector). Do not do that. It may appear to work if you are really lucky, but I can almost guarantee that your system will stop functioning at the most unfortunate moment (this is also known as Murphy's law). You might notice that the terminator issue discussed earlier becomes rather hairy if your bus is not linear. Also, if you have more connectors than devices on your internal SCSI cable, make sure you attach devices on connectors on both ends instead of using the connectors in the middle and let one or both ends dangle. This will screw up the termination of the bus. The electrical characteristics, its noise margins and ultimately the reliability of it all are tightly related to linear bus rule. Stick to the linear bus rule! Using SCSI with FreeBSD About translations, BIOSes and magic... As stated before, you should first make sure that you have a electrically sound bus. When you want to use a SCSI disk on your PC as boot disk, you must aware of some quirks related to PC BIOSes. The PC BIOS in its first incarnation used a low level physical interface to the hard disk. So, you had to tell the BIOS (using a setup tool or a BIOS built-in setup) how your disk physically looked like. This involved stating number of heads, number of cylinders, number of sectors per track, obscure things like precompensation and reduced write current cylinder etc. One might be inclined to think that since SCSI disks are smart you can forget about this. Alas, the arcane setup issue is still present today. The system BIOS needs to know how to access your SCSI disk with the head/cyl/sector method in order to load the FreeBSD kernel during boot. The SCSI host adapter or SCSI controller you have put in your AT/EISA/PCI/whatever bus to connect your disk therefore has its own on-board BIOS. During system startup, the SCSI BIOS takes over the hard disk interface routines from the system BIOS. To fool the system BIOS, the system setup is normally set to No hard disk present. Obvious, is it not? The SCSI BIOS itself presents to the system a so called translated drive. This means that a fake drive table is constructed that allows the PC to boot the drive. This translation is often (but not always) done using a pseudo drive with 64 heads and 32 sectors per track. By varying the number of cylinders, the SCSI BIOS adapts to the actual drive size. It is useful to note that 32 * 64 / 2 = the size of your drive in megabytes. The division by 2 is to get from disk blocks that are normally 512 bytes in size to Kbytes. Right. All is well now?! No, it is not. The system BIOS has another quirk you might run into. The number of cylinders of a bootable hard disk cannot be greater than 1024. Using the translation above, this is a show-stopper for disks greater than 1 GB. With disk capacities going up all the time this is causing problems. Fortunately, the solution is simple: just use another translation, e.g. with 128 heads instead of 32. In most cases new SCSI BIOS versions are available to upgrade older SCSI host adapters. Some newer adapters have an option, in the form of a jumper or software setup selection, to switch the translation the SCSI BIOS uses. It is very important that all operating systems on the disk use the same translation to get the right idea about where to find the relevant partitions. So, when installing FreeBSD you must answer any questions about heads/cylinders etc using the translated values your host adapter uses. Failing to observe the translation issue might lead to un-bootable systems or operating systems overwriting each others partitions. Using fdisk you should be able to see all partitions. You might have heard some talk of lying devices? Older FreeBSD kernels used to report the geometry of SCSI disks when booting. An example from one of my systems: aha0 targ 0 lun 0: <MICROP 1588-15MB1057404HSP4> da0: 636MB (1303250 total sec), 1632 cyl, 15 head, 53 sec, bytes/sec 512 Newer kernels usually do not report this information. e.g. (bt0:0:0): "SEAGATE ST41651 7574" type 0 fixed SCSI 2 da0(bt0:0:0): Direct-Access 1350MB (2766300 512 byte sectors) Why has this changed? This info is retrieved from the SCSI disk itself. Newer disks often use a technique called zone bit recording. The idea is that on the outer cylinders of the drive there is more space so more sectors per track can be put on them. This results in disks that have more tracks on outer cylinders than on the inner cylinders and, last but not least, have more capacity. You can imagine that the value reported by the drive when inquiring about the geometry now becomes suspect at best, and nearly always misleading. When asked for a geometry, it is nearly always better to supply the geometry used by the BIOS, or if the BIOS is never going to know about this disk, (e.g. it is not a booting disk) to supply a fictitious geometry that is convenient. SCSI subsystem design FreeBSD uses a layered SCSI subsystem. For each different controller card a device driver is written. This driver knows all the intimate details about the hardware it controls. The driver has a interface to the upper layers of the SCSI subsystem through which it receives its commands and reports back any status. On top of the card drivers there are a number of more generic drivers for a class of devices. More specific: a driver for tape devices (abbreviation: sa, for serial access), magnetic disks (da, for direct access), CDROMs (cd) etc. In case you are wondering where you can find this stuff, it all lives in /sys/cam/scsi. See the man pages in section 4 for more details. The multi level design allows a decoupling of low-level bit banging and more high level stuff. Adding support for another piece of hardware is a much more manageable problem. Kernel configuration Dependent on your hardware, the kernel configuration file must contain one or more lines describing your host adapter(s). This includes I/O addresses, interrupts etc. Consult the manual page for your adapter driver to get more info. Apart from that, check out /sys/i386/conf/LINT for an overview of a kernel config file. LINT contains every possible option you can dream of. It does not imply LINT will actually get you to a working kernel at all. Although it is probably stating the obvious: the kernel config file should reflect your actual hardware setup. So, interrupts, I/O addresses etc must match the kernel config file. During system boot messages will be displayed to indicate whether the configured hardware was actually found. Note that most of the EISA/PCI drivers (namely ahb, ahc, ncr and amd will automatically obtain the correct parameters from the host adapters themselves at boot time; thus, you just need to write, for instance, controller ahc0. An example loosely based on the FreeBSD 2.2.5-Release kernel config file LINT with some added comments (between []): # SCSI host adapters: `aha', `ahb', `aic', `bt', `nca' # # aha: Adaptec 154x # ahb: Adaptec 174x # ahc: Adaptec 274x/284x/294x # aic: Adaptec 152x and sound cards using the Adaptec AIC-6360 (slow!) # amd: AMD 53c974 based SCSI cards (e.g., Tekram DC-390 and 390T) # bt: Most Buslogic controllers # nca: ProAudioSpectrum cards using the NCR 5380 or Trantor T130 # ncr: NCR/Symbios 53c810/815/825/875 etc based SCSI cards # uha: UltraStore 14F and 34F # sea: Seagate ST01/02 8 bit controller (slow!) # wds: Western Digital WD7000 controller (no scatter/gather!). # [For an Adaptec AHA274x/284x/294x/394x etc controller] controller ahc0 [For an NCR/Symbios 53c875 based controller] controller ncr0 [For an Ultrastor adapter] controller uha0 at isa? port "IO_UHA0" bio irq ? drq 5 vector uhaintr # Map SCSI buses to specific SCSI adapters controller scbus0 at ahc0 controller scbus2 at ncr0 controller scbus1 at uha0 # The actual SCSI devices disk da0 at scbus0 target 0 unit 0 [SCSI disk 0 is at scbus 0, LUN 0] disk da1 at scbus0 target 1 [implicit LUN 0 if omitted] disk da2 at scbus1 target 3 [SCSI disk on the uha0] disk da3 at scbus2 target 4 [SCSI disk on the ncr0] tape sa1 at scbus0 target 6 [SCSI tape at target 6] device cd0 at scbus? [the first ever CDROM found, no wiring] The example above tells the kernel to look for a ahc (Adaptec 274x) controller, then for an NCR/Symbios board, and so on. The lines following the controller specifications tell the kernel to configure specific devices but only attach them when they match the target ID and LUN specified on the corresponding bus. Wired down devices get first shot at the unit numbers so the first non wired down device, is allocated the unit number one greater than the highest wired down unit number for that kind of device. So, if you had a SCSI tape at target ID 2 it would be configured as sa2, as the tape at target ID 6 is wired down to unit number 1. Wired down devices need not be found to get their unit number. The unit number for a wired down device is reserved for that device, even if it is turned off at boot time. This allows the device to be turned on and brought on-line at a later time, without rebooting. Notice that a device's unit number has no relationship with its target ID on the SCSI bus. Below is another example of a kernel config file as used by FreeBSD version < 2.0.5. The difference with the first example is that devices are not wired down. Wired down means that you specify which SCSI target belongs to which device. A kernel built to the config file below will attach the first SCSI disk it finds to da0, the second disk to da1 etc. If you ever removed or added a disk, all other devices of the same type (disk in this case) would move around. This implies you have to change /etc/fstab each time. Although the old style still works, you are strongly recommended to use this new feature. It will save you a lot of grief whenever you shift your hardware around on the SCSI buses. So, when you re-use your old trusty config file after upgrading from a pre-FreeBSD2.0.5.R system check this out. [driver for Adaptec 174x] controller ahb0 at isa? bio irq 11 vector ahbintr [for Adaptec 154x] controller aha0 at isa? port "IO_AHA0" bio irq 11 drq 5 vector ahaintr [for Seagate ST01/02] controller sea0 at isa? bio irq 5 iomem 0xc8000 iosiz 0x2000 vector seaintr controller scbus0 device da0 [support for 4 SCSI harddisks, da0 up da3] device sa0 [support for 2 SCSI tapes] [for the CDROM] device cd0 #Only need one of these, the code dynamically grows Both examples support SCSI disks. If during boot more devices of a specific type (e.g. da disks) are found than are configured in the booting kernel, the system will simply allocate more devices, incrementing the unit number starting at the last number wired down. If there are no wired down devices then counting starts at unit 0. Use man 4 scsi to check for the latest info on the SCSI subsystem. For more detailed info on host adapter drivers use e.g., man 4 ahc for info on the Adaptec 294x driver. Tuning your SCSI kernel setup Experience has shown that some devices are slow to respond to INQUIRY commands after a SCSI bus reset (which happens at boot time). An INQUIRY command is sent by the kernel on boot to see what kind of device (disk, tape, CDROM etc.) is connected to a specific target ID. This process is called device probing by the way. To work around the slow response problem, FreeBSD allows a tunable delay time before the SCSI devices are probed following a SCSI bus reset. You can set this delay time in your kernel configuration file using a line like: options SCSI_DELAY=15 #Be pessimistic about Joe SCSI device This line sets the delay time to 15 seconds. On my own system I had to use 3 seconds minimum to get my trusty old CDROM drive to be recognized. Start with a high value (say 30 seconds or so) when you have problems with device recognition. If this helps, tune it back until it just stays working. Rogue SCSI devices Although the SCSI standard tries to be complete and concise, it is a complex standard and implementing things correctly is no easy task. Some vendors do a better job then others. This is exactly where the rogue devices come into view. Rogues are devices that are recognized by the FreeBSD kernel as behaving slightly (...) non-standard. Rogue devices are reported by the kernel when booting. An example for two of my cartridge tape units: Feb 25 21:03:34 yedi /kernel: ahb0 targ 5 lun 0: <TANDBERG TDC 3600 -06:> Feb 25 21:03:34 yedi /kernel: sa0: Tandberg tdc3600 is a known rogue Mar 29 21:16:37 yedi /kernel: aha0 targ 5 lun 0: <ARCHIVE VIPER 150 21247-005> Mar 29 21:16:37 yedi /kernel: sa1: Archive Viper 150 is a known rogue For instance, there are devices that respond to all LUNs on a certain target ID, even if they are actually only one device. It is easy to see that the kernel might be fooled into believing that there are 8 LUNs at that particular target ID. The confusion this causes is left as an exercise to the reader. The SCSI subsystem of FreeBSD recognizes devices with bad habits by looking at the INQUIRY response they send when probed. Because the INQUIRY response also includes the version number of the device firmware, it is even possible that for different firmware versions different workarounds are used. See e.g. /sys/cam/scsi/scsi_sa.c and /sys/cam/scsi/scsi_all.c for more info on how this is done. This scheme works fine, but keep in mind that it of course only works for devices that are known to be weird. If you are the first to connect your bogus Mumbletech SCSI CDROM you might be the one that has to define which workaround is needed. After you got your Mumbletech working, please send the required workaround to the FreeBSD development team for inclusion in the next release of FreeBSD. Other Mumbletech owners will be grateful to you. Multiple LUN devices In some cases you come across devices that use multiple logical units (LUNs) on a single SCSI ID. In most cases FreeBSD only probes devices for LUN 0. An example are so called bridge boards that connect 2 non-SCSI hard disks to a SCSI bus (e.g. an Emulex MD21 found in old Sun systems). This means that any devices with LUNs != 0 are not normally found during device probe on system boot. To work around this problem you must add an appropriate entry in /sys/cam/scsi and rebuild your kernel. Look for a struct that is initialized like below: (FIXME: which file? Do these entries still exist in this form now that we use CAM?) { T_DIRECT, T_FIXED, "MAXTOR", "XT-4170S", "B5A", "mx1", SC_ONE_LU } For your Mumbletech BRIDGE2000 that has more than one LUN, acts as a SCSI disk and has firmware revision 123 you would add something like: { T_DIRECT, T_FIXED, "MUMBLETECH", "BRIDGE2000", "123", "da", SC_MORE_LUS } The kernel on boot scans the inquiry data it receives against the table and acts accordingly. See the source for more info. Tagged command queuing Modern SCSI devices, particularly magnetic disks, support what is called tagged command queuing (TCQ). In a nutshell, TCQ allows the device to have multiple I/O requests outstanding at the same time. Because the device is intelligent, it can optimize its operations (like head positioning) based on its own request queue. On SCSI devices like RAID (Redundant Array of Independent Disks) arrays the TCQ function is indispensable to take advantage of the device's inherent parallelism. Each I/O request is uniquely identified by a tag (hence the name tagged command queuing) and this tag is used by FreeBSD to see which I/O in the device drivers queue is reported as complete by the device. It should be noted however that TCQ requires device driver support and that some devices implemented it not quite right in their firmware. This problem bit me once, and it leads to highly mysterious problems. In such cases, try to disable TCQ. Bus-master host adapters Most, but not all, SCSI host adapters are bus mastering controllers. This means that they can do I/O on their own without putting load onto the host CPU for data movement. This is of course an advantage for a multitasking operating system like FreeBSD. It must be noted however that there might be some rough edges. For instance an Adaptec 1542 controller can be set to use different transfer speeds on the host bus (ISA or AT in this case). The controller is settable to different rates because not all motherboards can handle the higher speeds. Problems like hang-ups, bad data etc might be the result of using a higher data transfer rate then your motherboard can stomach. The solution is of course obvious: switch to a lower data transfer rate and try if that works better. In the case of a Adaptec 1542, there is an option that can be put into the kernel config file to allow dynamic determination of the right, read: fastest feasible, transfer rate. This option is disabled by default: options "TUNE_1542" #dynamic tune of bus DMA speed Check the manual pages for the host adapter that you use. Or better still, use the ultimate documentation (read: driver source). Tracking down problems The following list is an attempt to give a guideline for the most common SCSI problems and their solutions. It is by no means complete. Check for loose connectors and cables. Check and double check the location and number of your terminators. Check if your bus has at least one supplier of terminator power (especially with external terminators. Check if no double target IDs are used. Check if all devices to be used are powered up. Make a minimal bus config with as little devices as possible. If possible, configure your host adapter to use slow bus speeds. Disable tagged command queuing to make things as simple as possible (for a NCR host adapter based system see man ncrcontrol) If you can compile a kernel, make one with the SCSIDEBUG option, and try accessing the device with debugging turned on for that device. If your device does not even probe at startup, you may have to define the address of the device that is failing, and the desired debug level in /sys/cam/cam_debug.h. If it probes but just does not work, you can use the &man.camcontrol.8; command to dynamically set a debug level to it in a running kernel (if CAMDEBUG is defined). This will give you copious debugging output with which to confuse the gurus. See man camcontrol for more exact information. Also look at man 4 pass. Further reading If you intend to do some serious SCSI hacking, you might want to have the official standard at hand: Approved American National Standards can be purchased from ANSI at
13th Floor 11 West 42nd Street New York NY 10036 Sales Dept: (212) 642-4900
You can also buy many ANSI standards and most committee draft documents from Global Engineering Documents,
15 Inverness Way East Englewood CO, 80112-5704 Phone: (800) 854-7179 Outside USA and Canada: (303) 792-2181 Fax: (303) 792- 2192
Many X3T10 draft documents are available electronically on the SCSI BBS (719-574-0424) and on the ncrinfo.ncr.com anonymous FTP site. Latest X3T10 committee documents are: AT Attachment (ATA or IDE) [X3.221-1994] (Approved) ATA Extensions (ATA-2) [X3T10/948D Rev 2i] Enhanced Small Device Interface (ESDI) [X3.170-1990/X3.170a-1991] (Approved) Small Computer System Interface — 2 (SCSI-2) [X3.131-1994] (Approved) SCSI-2 Common Access Method Transport and SCSI Interface Module (CAM) [X3T10/792D Rev 11] Other publications that might provide you with additional information are: SCSI: Understanding the Small Computer System Interface, written by NCR Corporation. Available from: Prentice Hall, Englewood Cliffs, NJ, 07632 Phone: (201) 767-5937 ISBN 0-13-796855-8 Basics of SCSI, a SCSI tutorial written by Ancot Corporation Contact Ancot for availability information at: Phone: (415) 322-5322 Fax: (415) 322-0455 SCSI Interconnection Guide Book, an AMP publication (dated 4/93, Catalog 65237) that lists the various SCSI connectors and suggests cabling schemes. Available from AMP at (800) 522-6752 or (717) 564-0100 Fast Track to SCSI, A Product Guide written by Fujitsu. Available from: Prentice Hall, Englewood Cliffs, NJ, 07632 Phone: (201) 767-5937 ISBN 0-13-307000-X The SCSI Bench Reference, The SCSI Encyclopedia, and the SCSI Tutor, ENDL Publications, 14426 Black Walnut Court, Saratoga CA, 95070 Phone: (408) 867-6642 Zadian SCSI Navigator (quick ref. book) and Discover the Power of SCSI (First book along with a one-hour video and tutorial book), Zadian Software, Suite 214, 1210 S. Bascom Ave., San Jose, CA 92128, (408) 293-0800 On Usenet the newsgroups comp.periphs.scsi and comp.periphs are noteworthy places to look for more info. You can also find the SCSI-FAQ there, which is posted periodically. Most major SCSI device and host adapter suppliers operate FTP sites and/or BBS systems. They may be valuable sources of information about the devices you own.
* Disk/tape controllers * SCSI * IDE * Floppy Hard drives SCSI hard drives Contributed by &a.asami;. 17 February 1998. As mentioned in the SCSI section, virtually all SCSI hard drives sold today are SCSI-2 compliant and thus will work fine as long as you connect them to a supported SCSI host adapter. Most problems people encounter are either due to badly designed cabling (cable too long, star topology, etc.), insufficient termination, or defective parts. Please refer to the SCSI section first if your SCSI hard drive is not working. However, there are a couple of things you may want to take into account before you purchase SCSI hard drives for your system. Rotational speed Rotational speeds of SCSI drives sold today range from around 4,500RPM to 15,000RPM. Most of them are either 7,200RPM or 10,000RPM, with 15,000RPM becoming affordable (June 2002). Even though the 10,000RPM drives can generally transfer data faster, they run considerably hotter than their 7,200RPM counterparts. A large fraction of today's disk drive malfunctions are heat-related. If you do not have very good cooling in your PC case, you may want to stick with 7,200RPM or slower drives. Note that newer drives, with higher areal recording densities, can deliver much more bits per rotation than older ones. Today's top-of-line 7,200RPM drives can sustain a throughput comparable to 10,000RPM drives of one or two model generations ago. The number to find on the spec sheet for bandwidth is internal data (or transfer) rate. It is usually in megabits/sec so divide it by 8 and you will get the rough approximation of how much megabytes/sec you can get out of the drive. (If you are a speed maniac and want a 15,000RPM drive for your cute little PC, be my guest; however, those drives become extremely hot. Do not even think about it if you do not have a fan blowing air directly at the drive or a properly ventilated disk enclosure.) Obviously, the latest 15,000RPM drives and 10,000RPM drives can deliver more data than the latest 7,200RPM drives, so if absolute bandwidth is the necessity for your applications, you have little choice but to get the faster drives. Also, if you need low latency, faster drives are better; not only do they usually have lower average seek times, but also the rotational delay is one place where slow-spinning drives can never beat a faster one. (The average rotational latency is half the time it takes to rotate the drive once; thus, it is 2 milliseconds for 15,000RPM, 3ms for 10,000RPM drives, 4.2ms for 7,200RPM drives and 5.6ms for 5,400RPM drives.) Latency is seek time plus rotational delay. Make sure you understand whether you need low latency or more accesses per second, though; in the latter case (e.g., news servers), it may not be optimal to purchase one big fast drive. You can achieve similar or even better results by using the ccd (concatenated disk) driver to create a striped disk array out of multiple slower drives for comparable overall cost. Make sure you have adequate air flow around the drive, especially if you are going to use a fast-spinning drive. You generally need at least 1/2” (1.25cm) of spacing above and below a drive. Understand how the air flows through your PC case. Most cases have the power supply suck the air out of the back. See where the air flows in, and put the drive where it will have the largest volume of cool air flowing around it. You may need to seal some unwanted holes or add a new fan for effective cooling. Another consideration is noise. Many 10,000 or faster drives generate a high-pitched whine which is quite unpleasant to most people. That, plus the extra fans often required for cooling, may make 10,000 or faster drives unsuitable for some office and home environments. Form factor Most SCSI drives sold today are of 3.5” form factor. They come in two different heights; 1.6” (half-height) or 1” (low-profile). The half-height drive is the same height as a CDROM drive. However, do not forget the spacing rule mentioned in the previous section. If you have three standard 3.5” drive bays, you will not be able to put three half-height drives in there (without frying them, that is). Interface The majority of SCSI hard drives sold today are Ultra, Ultra-wide, or Ultra160 SCSI. As of this writing (June 2002), the first Ultra320 host adapters and devices become available. The maximum bandwidth of Ultra SCSI is 20MB/sec, and Ultra-wide SCSI is 40MB/sec. Ultra160 can transfer 160MB/sec and Ultra320 can transfer 320MB/sec. There is no difference in max cable length between Ultra and Ultra-wide; however, the more devices you have on the same bus, the sooner you will start having bus integrity problems. Unless you have a well-designed disk enclosure, it is not easy to make more than 5 or 6 Ultra SCSI drives work on a single bus. On the other hand, if you need to connect many drives, going for Fast-wide SCSI may not be a bad idea. That will have the same max bandwidth as Ultra (narrow) SCSI, while electronically it is much easier to get it right. My advice would be: if you want to connect many disks, get wide or Ultra160 SCSI drives; they usually cost a little more but it may save you down the road. (Besides, if you can not afford the cost difference, you should not be building a disk array.) There are two variant of wide SCSI drives; 68-pin and 80-pin SCA (Single Connector Attach). The SCA drives do not have a separate 4-pin power connector, and also read the SCSI ID settings through the 80-pin connector. If you are really serious about building a large storage system, get SCA drives and a good SCA enclosure (dual power supply with at least one extra fan). They are more electronically sound than 68-pin counterparts because there is no stub of the SCSI bus inside the disk canister as in arrays built from 68-pin drives. They are easier to install too (you just need to screw the drive in the canister, instead of trying to squeeze in your fingers in a tight place to hook up all the little cables (like the SCSI ID and disk activity LED lines). * IDE hard drives Tape drives Contributed by &a.jmb;. 2 July 1996. General tape access commands &man.mt.1; provides generic access to the tape drives. Some of the more common commands are rewind, erase, and status. See the &man.mt.1; manual page for a detailed description. Controller Interfaces There are several different interfaces that support tape drives. The interfaces are SCSI, IDE, Floppy and Parallel Port. A wide variety of tape drives are available for these interfaces. Controllers are discussed in Disk/tape controllers. SCSI drives The &man.st.4; driver provides support for 8mm (Exabyte), 4mm (DAT: Digital Audio Tape), QIC (Quarter-Inch Cartridge), DLT (Digital Linear Tape), QIC Mini cartridge and 9-track (remember the big reels that you see spinning in Hollywood computer rooms) tape drives. See the &man.st.4; manual page for a detailed description. The drives listed below are currently being used by members of the FreeBSD community. They are not the only drives that will work with FreeBSD. They just happen to be the ones that we use. 4mm (DAT: Digital Audio Tape) Archive Python 28454 Archive Python 04687 HP C1533A HP C1534A HP 35450A HP 35470A HP 35480A SDT-5000 Wangtek 6200 8mm (Exabyte) EXB-8200 EXB-8500 EXB-8505 QIC (Quarter-Inch Cartridge) Archive Anaconda 2750 Archive Viper 60 Archive Viper 150 Archive Viper 2525 Tandberg TDC 3600 Tandberg TDC 3620 Tandberg TDC 3800 Tandberg TDC 4222 Wangtek 5525ES DLT (Digital Linear Tape) Digital TZ87 Mini-Cartridge Conner CTMS 3200 Exabyte 2501 Autoloaders/Changers Hewlett-Packard HP C1553A Autoloading DDS2 * IDE drives Floppy drives Conner 420R * Parallel port drives Detailed Information Archive Anaconda 2750 The boot message identifier for this drive is ARCHIVE ANCDA 2750 28077 -003 type 1 removable SCSI 2 This is a QIC tape drive. Native capacity is 1.35GB when using QIC-1350 tapes. This drive will read and write QIC-150 (DC6150), QIC-250 (DC6250), and QIC-525 (DC6525) tapes as well. Data transfer rate is 350kB/s using &man.dump.8;. Rates of 530kB/s have been reported when using Amanda Production of this drive has been discontinued. The SCSI bus connector on this tape drive is reversed from that on most other SCSI devices. Make sure that you have enough SCSI cable to twist the cable one-half turn before and after the Archive Anaconda tape drive, or turn your other SCSI devices upside-down. Two kernel code changes are required to use this drive. This drive will not work as delivered. If you have a SCSI-2 controller, short jumper 6. Otherwise, the drive behaves are a SCSI-1 device. When operating as a SCSI-1 device, this drive, locks the SCSI bus during some tape operations, including: fsf, rewind, and rewoffl. If you are using the NCR SCSI controllers, patch the file /usr/src/sys/pci/ncr.c (as shown below). Build and install a new kernel. *** 4831,4835 **** }; ! if (np->latetime>4) { /* ** Although we tried to wake it up, --- 4831,4836 ---- }; ! if (np->latetime>1200) { /* ** Although we tried to wake it up, Reported by: &a.jmb; Archive Python 28454 The boot message identifier for this drive is ARCHIVE Python 28454-XXX4ASB type 1 removable SCSI 2 density code 0x8c, 512-byte blocks This is a DDS-1 tape drive. Native capacity is 2.5GB on 90m tapes. Data transfer rate is XXX. This drive was repackaged by Sun Microsystems as model 595-3067. Reported by: Bob Bishop rb@gid.co.uk Throughput is in the 1.5 MByte/sec range, however this will drop if the disks and tape drive are on the same SCSI controller. Reported by: Robert E. Seastrom rs@seastrom.com Archive Python 04687 The boot message identifier for this drive is ARCHIVE Python 04687-XXX 6580 Removable Sequential Access SCSI-2 device This is a DAT-DDS-2 drive. Native capacity is 4GB when using 120m tapes. This drive supports hardware data compression. Switch 4 controls MRS (Media Recognition System). MRS tapes have stripes on the transparent leader. Switch 4 off enables MRS, on disables MRS. Parity is controlled by switch 5. Switch 5 on to enable parity control. Compression is enabled with Switch 6 off. It is possible to override compression with the SCSI MODE SELECT command (see &man.mt.1;). Data transfer rate is 800kB/s. Archive Viper 60 The boot message identifier for this drive is ARCHIVE VIPER 60 21116 -007 type 1 removable SCSI 1 This is a QIC tape drive. Native capacity is 60MB. Data transfer rate is XXX. Production of this drive has been discontinued. Reported by: Philippe Regnauld regnauld@hsc.fr Archive Viper 150 The boot message identifier for this drive is ARCHIVE VIPER 150 21531 -004 Archive Viper 150 is a known rogue type 1 removable SCSI 1. A multitude of firmware revisions exist for this drive. Your drive may report different numbers (e.g 21247 -005. This is a QIC tape drive. Native capacity is 150/250MB. Both 150MB (DC6150) and 250MB (DC6250) tapes have the recording format. The 250MB tapes are approximately 67% longer than the 150MB tapes. This drive can read 120MB tapes as well. It can not write 120MB tapes. Data transfer rate is 100kB/s This drive reads and writes DC6150 (150MB) and DC6250 (250MB) tapes. This drives quirks are known and pre-compiled into the SCSI tape device driver (&man.st.4;). Under FreeBSD 2.2-CURRENT, use mt blocksize 512 to set the blocksize. (The particular drive had firmware revision 21247 -005. Other firmware revisions may behave differently) Previous versions of FreeBSD did not have this problem. Production of this drive has been discontinued. Reported by: Pedro A M Vazquez vazquez@IQM.Unicamp.BR &a.msmith; Archive Viper 2525 The boot message identifier for this drive is ARCHIVE VIPER 2525 25462 -011 type 1 removable SCSI 1 This is a QIC tape drive. Native capacity is 525MB. Data transfer rate is 180kB/s at 90 inches/sec. The drive reads QIC-525, QIC-150, QIC-120 and QIC-24 tapes. Writes QIC-525, QIC-150, and QIC-120. Firmware revisions prior to 25462 -011 are bug ridden and will not function properly. Production of this drive has been discontinued. Conner 420R The boot message identifier for this drive is Conner tape. This is a floppy controller, mini cartridge tape drive. Native capacity is XXXX Data transfer rate is XXX The drive uses QIC-80 tape cartridges. Reported by: Mark Hannon mark@seeware.DIALix.oz.au Conner CTMS 3200 The boot message identifier for this drive is CONNER CTMS 3200 7.00 type 1 removable SCSI 2. This is a mini cartridge tape drive. Native capacity is XXXX Data transfer rate is XXX The drive uses QIC-3080 tape cartridges. Reported by: Thomas S. Traylor tst@titan.cs.mci.com <ulink url="http://www.digital.com/info/Customer-Update/931206004.txt.html">DEC TZ87</ulink> The boot message identifier for this drive is DEC TZ87 (C) DEC 9206 type 1 removable SCSI 2 density code 0x19 This is a DLT tape drive. Native capacity is 10GB. This drive supports hardware data compression. Data transfer rate is 1.2MB/s. This drive is identical to the Quantum DLT2000. The drive firmware can be set to emulate several well-known drives, including an Exabyte 8mm drive. Reported by: &a.wilko; <ulink url="http://www.Exabyte.COM:80/Products/Minicartridge/2501/Rfeatures.html">Exabyte EXB-2501</ulink> The boot message identifier for this drive is EXABYTE EXB-2501 This is a mini-cartridge tape drive. Native capacity is 1GB when using MC3000XL mini cartridges. Data transfer rate is XXX This drive can read and write DC2300 (550MB), DC2750 (750MB), MC3000 (750MB), and MC3000XL (1GB) mini cartridges. WARNING: This drive does not meet the SCSI-2 specifications. The drive locks up completely in response to a SCSI MODE_SELECT command unless there is a formatted tape in the drive. Before using this drive, set the tape blocksize with &prompt.root; mt -f /dev/st0ctl.0 blocksize 1024 Before using a mini cartridge for the first time, the mini cartridge must be formated. FreeBSD 2.1.0-RELEASE and earlier: &prompt.root; /sbin/scsi -f /dev/rst0.ctl -s 600 -c "4 0 0 0 0 0" (Alternatively, fetch a copy of the scsiformat shell script from FreeBSD 2.1.5/2.2.) FreeBSD 2.1.5 and later: &prompt.root; /sbin/scsiformat -q -w /dev/rst0.ctl Right now, this drive cannot really be recommended for FreeBSD. Reported by: Bob Beaulieu ez@eztravel.com Exabyte EXB-8200 The boot message identifier for this drive is EXABYTE EXB-8200 252X type 1 removable SCSI 1 This is an 8mm tape drive. Native capacity is 2.3GB. Data transfer rate is 270kB/s. This drive is fairly slow in responding to the SCSI bus during boot. A custom kernel may be required (set SCSI_DELAY to 10 seconds). There are a large number of firmware configurations for this drive, some have been customized to a particular vendor's hardware. The firmware can be changed via EPROM replacement. Production of this drive has been discontinued. Reported by: &a.msmith; Exabyte EXB-8500 The boot message identifier for this drive is EXABYTE EXB-8500-85Qanx0 0415 type 1 removable SCSI 2 This is an 8mm tape drive. Native capacity is 5GB. Data transfer rate is 300kB/s. Reported by: Greg Lehey grog@lemis.de <ulink url="http://www.Exabyte.COM:80/Products/8mm/8505XL/Rfeatures.html">Exabyte EXB-8505</ulink> The boot message identifier for this drive is EXABYTE EXB-85058SQANXR1 05B0 type 1 removable SCSI 2 This is an 8mm tape drive which supports compression, and is upward compatible with the EXB-5200 and EXB-8500. Native capacity is 5GB. The drive supports hardware data compression. Data transfer rate is 300kB/s. Reported by: Glen Foster gfoster@gfoster.com Hewlett-Packard HP C1533A The boot message identifier for this drive is HP C1533A 9503 type 1 removable SCSI 2. This is a DDS-2 tape drive. DDS-2 means hardware data compression and narrower tracks for increased data capacity. Native capacity is 4GB when using 120m tapes. This drive supports hardware data compression. Data transfer rate is 510kB/s. This drive is used in Hewlett-Packard's SureStore 6000eU and 6000i tape drives and C1533A DDS-2 DAT drive. The drive has a block of 8 dip switches. The proper settings for FreeBSD are: 1 ON; 2 ON; 3 OFF; 4 ON; 5 ON; 6 ON; 7 ON; 8 ON. switch 1 switch 2 Result On On Compression enabled at power-on, with host control On Off Compression enabled at power-on, no host control Off On Compression disabled at power-on, with host control Off Off Compression disabled at power-on, no host control Switch 3 controls MRS (Media Recognition System). MRS tapes have stripes on the transparent leader. These identify the tape as DDS (Digital Data Storage) grade media. Tapes that do not have the stripes will be treated as write-protected. Switch 3 OFF enables MRS. Switch 3 ON disables MRS. See HP SureStore Tape Products and Hewlett-Packard Disk and Tape Technical Information for more information on configuring this drive. Warning: Quality control on these drives varies greatly. One FreeBSD core-team member has returned 2 of these drives. Neither lasted more than 5 months. Reported by: &a.se; Hewlett-Packard HP 1534A The boot message identifier for this drive is HP HP35470A T503 type 1 removable SCSI 2 Sequential-Access density code 0x13, variable blocks. This is a DDS-1 tape drive. DDS-1 is the original DAT tape format. Native capacity is 2GB when using 90m tapes. Data transfer rate is 183kB/s. The same mechanism is used in Hewlett-Packard's SureStore 2000i tape drive, C35470A DDS format DAT drive, C1534A DDS format DAT drive and HP C1536A DDS format DAT drive. The HP C1534A DDS format DAT drive has two indicator lights, one green and one amber. The green one indicates tape action: slow flash during load, steady when loaded, fast flash during read/write operations. The amber one indicates warnings: slow flash when cleaning is required or tape is nearing the end of its useful life, steady indicates an hard fault. (factory service required?) Reported by Gary Crutcher gcrutchr@nightflight.com Hewlett-Packard HP C1553A Autoloading DDS2 The boot message identifier for this drive is "". This is a DDS-2 tape drive with a tape changer. DDS-2 means hardware data compression and narrower tracks for increased data capacity. Native capacity is 24GB when using 120m tapes. This drive supports hardware data compression. Data transfer rate is 510kB/s (native). This drive is used in Hewlett-Packard's SureStore 12000e tape drive. The drive has two selectors on the rear panel. The selector closer to the fan is SCSI id. The other selector should be set to 7. There are four internal switches. These should be set: 1 ON; 2 ON; 3 ON; 4 OFF. At present the kernel drivers do not automatically change tapes at the end of a volume. This shell script can be used to change tapes: #!/bin/sh PATH="/sbin:/usr/sbin:/bin:/usr/bin"; export PATH usage() { echo "Usage: dds_changer [123456ne] raw-device-name echo "1..6 = Select cartridge" echo "next cartridge" echo "eject magazine" exit 2 } if [ $# -ne 2 ] ; then usage fi cdb3=0 cdb4=0 cdb5=0 case $1 in [123456]) cdb3=$1 cdb4=1 ;; n) ;; e) cdb5=0x80 ;; ?) usage ;; esac scsi -f $2 -s 100 -c "1b 0 0 $cdb3 $cdb4 $cdb5" Hewlett-Packard HP 35450A The boot message identifier for this drive is HP HP35450A -A C620 type 1 removable SCSI 2 Sequential-Access density code 0x13 This is a DDS-1 tape drive. DDS-1 is the original DAT tape format. Native capacity is 1.2GB. Data transfer rate is 160kB/s. Reported by: Mark Thompson mark.a.thompson@pobox.com Hewlett-Packard HP 35470A The boot message identifier for this drive is HP HP35470A 9 09 type 1 removable SCSI 2 This is a DDS-1 tape drive. DDS-1 is the original DAT tape format. Native capacity is 2GB when using 90m tapes. Data transfer rate is 183kB/s. The same mechanism is used in Hewlett-Packard's SureStore 2000i tape drive, C35470A DDS format DAT drive, C1534A DDS format DAT drive, and HP C1536A DDS format DAT drive. Warning: Quality control on these drives varies greatly. One FreeBSD core-team member has returned 5 of these drives. None lasted more than 9 months. Reported by: David Dawes dawes@rf900.physics.usyd.edu.au (9 09) Hewlett-Packard HP 35480A The boot message identifier for this drive is HP HP35480A 1009 type 1 removable SCSI 2 Sequential-Access density code 0x13. This is a DDS-DC tape drive. DDS-DC is DDS-1 with hardware data compression. DDS-1 is the original DAT tape format. Native capacity is 2GB when using 90m tapes. It cannot handle 120m tapes. This drive supports hardware data compression. Please refer to the section on HP C1533A for the proper switch settings. Data transfer rate is 183kB/s. This drive is used in Hewlett-Packard's SureStore 5000eU and 5000i tape drives and C35480A DDS format DAT drive.. This drive will occasionally hang during a tape eject operation (mt offline). Pressing the front panel button will eject the tape and bring the tape drive back to life. WARNING: HP 35480-03110 only. On at least two occasions this tape drive when used with FreeBSD 2.1.0, an IBM Server 320 and an 2940W SCSI controller resulted in all SCSI disk partitions being lost. The problem has not be analyzed or resolved at this time. <ulink url="http://www.sel.sony.com/SEL/ccpg/storage/tape/t5000.html">Sony SDT-5000</ulink> There are at least two significantly different models: one is a DDS-1 and the other DDS-2. The DDS-1 version is SDT-5000 3.02. The DDS-2 version is SONY SDT-5000 327M. The DDS-2 version has a 1MB cache. This cache is able to keep the tape streaming in almost any circumstances. The boot message identifier for this drive is SONY SDT-5000 3.02 type 1 removable SCSI 2 Sequential-Access density code 0x13 Native capacity is 4GB when using 120m tapes. This drive supports hardware data compression. Data transfer rate is depends upon the model or the drive. The rate is 630kB/s for the SONY SDT-5000 327M while compressing the data. For the SONY SDT-5000 3.02, the data transfer rate is 225kB/s. In order to get this drive to stream, set the blocksize to 512 bytes (mt blocksize 512) reported by Kenneth Merry ken@ulc199.residence.gatech.edu. SONY SDT-5000 327M information reported by Charles Henrich henrich@msu.edu. Reported by: &a.jmz; Tandberg TDC 3600 The boot message identifier for this drive is TANDBERG TDC 3600 =08: type 1 removable SCSI 2 This is a QIC tape drive. Native capacity is 150/250MB. This drive has quirks which are known and work around code is present in the SCSI tape device driver (&man.st.4;). Upgrading the firmware to XXX version will fix the quirks and provide SCSI 2 capabilities. Data transfer rate is 80kB/s. IBM and Emerald units will not work. Replacing the firmware EPROM of these units will solve the problem. Reported by: &a.msmith; Tandberg TDC 3620 This is very similar to the Tandberg TDC 3600 drive. Reported by: &a.joerg; Tandberg TDC 3800 The boot message identifier for this drive is TANDBERG TDC 3800 =04Y Removable Sequential Access SCSI-2 device This is a QIC tape drive. Native capacity is 525MB. Reported by: &a.jhs; Tandberg TDC 4222 The boot message identifier for this drive is TANDBERG TDC 4222 =07 type 1 removable SCSI 2 This is a QIC tape drive. Native capacity is 2.5GB. The drive will read all cartridges from the 60 MB (DC600A) upwards, and write 150 MB (DC6150) upwards. Hardware compression is optionally supported for the 2.5 GB cartridges. This drives quirks are known and pre-compiled into the SCSI tape device driver (&man.st.4;) beginning with FreeBSD 2.2-CURRENT. For previous versions of FreeBSD, use mt to read one block from the tape, rewind the tape, and then execute the backup program (mt fsr 1; mt rewind; dump ...) Data transfer rate is 600kB/s (vendor claim with compression), 350 KB/s can even be reached in start/stop mode. The rate decreases for smaller cartridges. Reported by: &a.joerg; Wangtek 5525ES The boot message identifier for this drive is WANGTEK 5525ES SCSI REV7 3R1 type 1 removable SCSI 1 density code 0x11, 1024-byte blocks This is a QIC tape drive. Native capacity is 525MB. Data transfer rate is 180kB/s. The drive reads 60, 120, 150, and 525MB tapes. The drive will not write 60MB (DC600 cartridge) tapes. In order to overwrite 120 and 150 tapes reliably, first erase (mt erase) the tape. 120 and 150 tapes used a wider track (fewer tracks per tape) than 525MB tapes. The extra width of the previous tracks is not overwritten, as a result the new data lies in a band surrounded on both sides by the previous data unless the tape have been erased. This drives quirks are known and pre-compiled into the SCSI tape device driver (&man.st.4;). Other firmware revisions that are known to work are: M75D Reported by: Marc van Kempen marc@bowtie.nl REV73R1 Andrew Gordon Andrew.Gordon@net-tel.co.uk M75D Wangtek 6200 The boot message identifier for this drive is WANGTEK 6200-HS 4B18 type 1 removable SCSI 2 Sequential-Access density code 0x13 This is a DDS-1 tape drive. Native capacity is 2GB using 90m tapes. Data transfer rate is 150kB/s. Reported by: Tony Kimball alk@Think.COM * Problem drives CDROM drives Contributed by &a.obrien;. 23 November 1997. Generally speaking those in The FreeBSD Project prefer SCSI CDROM drives over IDE CDROM drives. However not all SCSI CDROM drives are equal. Some feel the quality of some SCSI CDROM drives have been deteriorating to that of IDE CDROM drives. Toshiba used to be the favored stand-by, but many on the SCSI mailing list have found displeasure with the 12x speed XM-5701TA as its volume (when playing audio CDROMs) is not controllable by the various audio player software. Another area where SCSI CDROM manufacturers are cutting corners is adherence to the SCSI specification. Many SCSI CDROMs will respond to multiple LUNs for its target address. Known violators include the 6x Teac CD-56S 1.0D.
diff --git a/en_US.ISO8859-1/articles/vinum/article.sgml b/en_US.ISO8859-1/articles/vinum/article.sgml index f3caf01a8e..a3d2098834 100644 --- a/en_US.ISO8859-1/articles/vinum/article.sgml +++ b/en_US.ISO8859-1/articles/vinum/article.sgml @@ -1,2542 +1,2550 @@ + + +%trademarks; + Vinum"> %man; ]>
Bootstrapping Vinum: A Foundation for Reliable Servers Robert A. Van Valzah 2001 Robert A. Van Valzah - $Date: 2003-08-27 07:13:11 $ GMT - $Id: article.sgml,v 1.13 2003-08-27 07:13:11 blackend Exp $ + $Date: 2003-10-18 10:39:16 $ GMT + $Id: article.sgml,v 1.14 2003-10-18 10:39:16 simon Exp $ + + &tm-attrib.freebsd; + &tm-attrib.general; + In the most abstract sense, these instructions show how to build a pair of disk drives where either one is adequate to keep your server running if the other fails. Life is better if they are both working, but your server will never die unless both disk drives die at once. If you choose ATAPI drives and use a fairly generic kernel, you can be confident that either of these drives can be plugged into most any main board to produce a working server in a pinch. The drives need not be identical. These techniques work equally well with SCSI drives as they do with ATAPI, but I will focus on ATAPI here because main boards with this interface are ubiquitous. After building the foundation of a reliable server as shown here, you can expand to as many disk drives as necessary to build the failure-resilient server of your dreams.
Introduction Any machine that is going to provide reliable service needs to have either redundant components on-line or a pool of off-line spares that can be promptly swapped in. Commodity PC hardware makes it affordable for even small organizations to have some spare parts available that could be pressed into service following the failure of production equipment. In many organizations, a failed power supply, NIC, memory, or main board could easily be swapped with a standby in a matter of minutes and be ready to return to production work. If a disk drive fails, however, it often has to be restored from a tape backup. This may take many hours. With disk drive capacities rising faster than tape drive capacities, the time needed to restore a failed disk drive seems to increase as technology progresses. &vinum.ap; is a volume manager for FreeBSD that provides a standard block I/O layer interface to the filesystem code just as any hardware device driver would. It works by managing partitions of type vinum and allows you to subdivide and group the space in such partitions into logical devices called volumes that can be used in the same way as disk partitions. Volumes can be configured for resilience, performance, or both. Experienced system administrators will immediately recognize the benefits of being able to configure each filesystem to match the way it is most often used. In some ways, Vinum is similar to &man.ccd.4;, but it is far more flexible and robust in the face of failures. It is only slightly more difficult to set up than &man.ccd.4;. &man.ccd.4; may meet your needs if you are only interested in concatenation.
Terminology Discussion of storage management can get very tricky simply because of the terminology involved. As we will see below, the terms disk, slice, partition, subdisk, and volume each refer to different things that present the same interface to a kernel function like swapping. The potential for confusion is compounded because the objects that these terms represent can be nested inside each other. I will refer to a physical disk drive as a spindle. A partition here means a BSD partition as maintained by disklabel. It does not refer to slices or BIOS partitions as maintained by fdisk.
Vinum Objects Vinum defines a hierarchy of four objects that it uses to manage storage (see ). Different combinations of these objects are used to achieve failure resilience, performance, and/or extra capacity. I will give a whirlwind tour of the objects here--see the Vinum web site for a more thorough description.
Vinum Objects and Architecture +-----+------+------+ | UFS | swap | Etc. | +---+-+------+----+ + | volume | | + V +-------------+ + | i plex | | + n +-------------+ + | u subdisk | | + m +-------------+ + | drive | | +-----------------+ + | Block I/O devices | +-------------------+ Vinum Objects and Architecture
The top object, a vinum volume, implements a virtual disk that provides a standard block I/O layer interface to other parts of the kernel. The bottom object, a vinum drive, uses this same interface to request I/O from physical devices below it. In between these two (from top to bottom) we have objects called a vinum plex and a vinum subdisk. As you can probably guess from the name, a vinum subdisk is a contiguous subset of the space available on a vinum drive. It lets you subdivide a vinum drive in much the same way that a disk BSD partition lets you subdivide a BIOS slice. A plex allows subdisks to be grouped together making the space of all subdisks available as a single object. A plex can be organized with its constituent subdisks concatenated or striped. Both organizations are useful for spreading I/O requests across spindles since plexes reside on distinct spindles. A striped plex will switch spindles each time a multiple of the stripe size is reached. A concatenated plex will switch spindles only when the end of a subdisk is reached. An important characteristic of a Vinum volume is that it can be made up of more than one plex. In this case, writes go to all plexes and a read may be satisfied by any plex. Configuring two or more plexes on distinct spindles yields a volume that is resilient to failure. Vinum maintains a configuration that defines instances of the above objects and the way they are related to each other. This configuration is automatically written to all spindles under Vinum management whenever it changes.
Vinum Volume/Plex Organization Although Vinum can manage any number of spindles, I will only cover scenarios with two spindles here for simplification. See to see how two spindles organized with Vinum compare to two spindles without Vinum. Characteristics of Two Spindles Organized with Vinum Organization Total Capacity Failure Resilient Peak Read Performance Peak Write Performance Concatenated Plexes Unchanged, but appears as a single drive No Unchanged Unchanged Striped Plexes (RAID-0) Unchanged, but appears as a single drive No 2x 2x Mirrored Volumes (RAID-1) 1/2, appearing as a single drive Yes 2x Unchanged
shows that striping yields the same capacity and lack of failure resilience as concatenation, but it has better peak read and write performance. Hence we will not be using concatenation in any of the examples here. Mirrored volumes provide the benefits of improved peak read performance and failure resilience--but this comes at a loss in capacity. Both concatenation and striping bring their benefits over a single spindle at the cost of increased likelihood of failure since more than one spindle is now involved. When three or more spindles are present, Vinum also supports rotated, block-interleaved parity (also called RAID-5) that provides better capacity than mirroring (but not quite as good as striping), better read performance than both mirroring and striping, and good failure resilience. There is, however, a substantial decrease in write performance with RAID-5. Most of the benefits become more pronounced with five or more spindles. The organizations described above may be combined to provide benefits that no single organization can match. For example, mirroring and striping can be combined to provide failure-resilience with very fast read performance.
Vinum History Vinum is a standard part of even a "minimum" FreeBSD distribution and it has been standard since 3.0-RELEASE. The official pronunciation of the name is VEE-noom. &vinum.ap; was inspired by the Veritas Volume Manager, but was not derived from it. The name is a play on that history and the Latin adage In Vino Veritas (Vino is the ablative form of Vinum). Literally translated, that is Truth lies in wine hinting that drunkards have a hard time lying. I have been using it in production on six different servers for over two years with no data loss. Like the rest of FreeBSD, Vinum provides rock-stable performance. (On a personal note, I have seen Vinum panic when I misconfigured something, but I have never had any trouble in normal operation.) Greg Lehey wrote Vinum for FreeBSD, but he is seeking help in porting it to NetBSD and OpenBSD. Just like the rest of FreeBSD, Vinum is undergoing continuous development. Several subtle, but significant bugs have been fixed in recent releases. It is always best to use the most recent code base that meets your stability requirements.
Vinum Deployment Strategy Vinum, coupled with prudent partition management, lets you keep warm-spare spindles on-line so that failures are transparent to users. Failed spindles can be replaced during regular maintenance periods or whenever it is convenient. When all spindles are working, the server benefits from increased performance and capacity. Having redundant copies of your home directory does not help you if the spindle holding root, /usr, or swap fails on your server. Hence I focus here on building a simple foundation for a failure-resilient server covering the root, /usr, /home, and swap partitions. Vinum mirroring does not remove the need for making backups! Mirroring cannot help you recover from site disasters or the dreaded rm -r -f / command.
Why Bootstrap Vinum? It is possible to add Vinum to a server configuration after it is already in production use, but this is much harder than designing for it from the start. Ironically, Vinum is not supported by /stand/sysinstall and hence you cannot install /usr right onto a Vinum volume. Vinum currently does not support the root filesystem (this feature is in development). Hence it is a bit tricky to get started using Vinum, but these instructions take you though the process of planning for Vinum, installing FreeBSD without it, and then beginning to use it. I have come to call this whole process bootstrapping Vinum. That is, the process of getting Vinum initially installed and operating to the point where you have met your resilience or performance goals. My purpose here is to document a Vinum bootstrapping method that I have found that works well for me.
Vinum Benefits The server foundation scenario I have chosen here allows me to show you examples of configuring for resilience on /usr and /home. Yet Vinum provides benefits other than resilience--namely performance, capacity, and manageability. It can significantly improve disk performance (especially under multi-user loads). Vinum can easily concatenate many smaller disks to produce the illusion of a single larger disk (but my server foundation scenario does not allow me to illustrate these benefits here). For servers with many spindles, Vinum provides substantial benefits in volume management, particularly when coupled with hot-pluggable hardware. Data can be moved from spindle to spindle while the system is running without loss of production time. Again, details of this will not be given here, but once you get your feet wet with Vinum, other documentation will help you do things like this. See "The Vinum Volume Manager" for a technical introduction to Vinum, &man.vinum.8; for a description of the vinum command, and &man.vinum.4; for a description of the vinum device driver and the way Vinum objects are named. Breaking up your disk space into smaller and smaller partitions has the benefit of allowing you to tune for the most common type of access and tends to keep disk hogs within their pens. However it also causes some loss in total available disk space due to fragmentation.
Server Operation in Degraded Mode Some disk failures in this two-spindle scenario will result in Vinum automatically routing all disk I/O to the remaining good spindle. Others will require brief manual intervention on the console to configure the server for degraded mode operation and a quick reboot. Other than actual hardware repairs, most recovery work can be done while the server is running in multi-user degraded mode so there is as little production impact from failures as possible. I give the instructions in needed to configure the server for degraded mode operation in those cases where Vinum cannot do it automatically. I also give the instructions needed to return to normal operation once the failed hardware is repaired. You might call these instructions Vinum failure recovery techniques. I recommend practicing using these instructions by recovering from simulated failures. For each failure scenario, I also give tips below for simulating a failure even when your hardware is working well. Even a minimum Vinum system as described in below can be a good place to experiment with recovery techniques without impacting production equipment.
Hardware RAID vs. Vinum (Software RAID) Manual intervention is sometimes required to configure a server for degraded mode because Vinum is implemented in software that runs after the FreeBSD kernel is loaded. One disadvantage of such software RAID solutions is that there is nothing that can be done to hide spindle failures from the BIOS or the FreeBSD boot sequence. Hence the manual reconfiguration of the server for degraded operation mentioned above just informs the BIOS and boot sequence of failed spindles. Hardware RAID solutions generally have an advantage in that they require no such reconfiguration since spindle failures are hidden from the BIOS and boot sequence. Hardware RAID, however, may have some disadvantages that can be significant in some cases: The hardware RAID controller itself may become a single point of failure for the system. The data is usually kept in a proprietary format so that a disk drive cannot be simply plugged into another main board and booted. You often cannot mix and match drives with different sizes and interfaces. You are often limited to the number of drives supported by the hardware RAID controller (often only four or eight). In other words, &vinum.ap; may offer advantages in that there is no single point of failure, the drives can boot on most any main board, and you are free to mix and match as many drives using whatever interface you choose. Keep your kernel fairly generic (or at least keep /kernel.GENERIC around). This will improve the chances that you can come back up on foreign hardware more quickly. The pros and cons discussed above suggest that the root filesystem and swap partition are good candidates for hardware RAID if available. This is especially true for servers where it is difficult for administrators to get console access (recall that this is sometimes required to configure a server for degraded mode operation). A server with only software RAID is well suited to office and home environments where an administrator can be close at hand. A common myth is that hardware RAID is always faster than software RAID. Since it runs on the host CPU, Vinum often has more CPU power and memory available than a dedicated RAID controller would have. If performance is a prime concern, it is best to benchmark your application running on your CPU with your spindles using both hardware and software RAID systems before making a decision.
Hardware for Vinum These instructions may be timely since commodity PC hardware can now easily host several hundred gigabytes of reasonably high-performance disk space at a low price. Many disk drive manufactures now sell 7,200 RPM disk drives with quite low seek times and high transfer rates through ATA-100 interfaces, all at very attractive prices. Four such drives, attached to a suitable main board and configured with Vinum and prudent partitioning, yields a failure-resilient, high performance disk server at a very reasonable cost. However, you can indeed get started with Vinum very simply. A minimum system can be as simple as an old CPU (even a 486 is fine) and a pair of drives that are 500 MB or more. They need not be the same size or even use the same interface (i.e., it is fine to mix ATAPI and SCSI). So get busy and give this a try today! You will have the foundation of a failure-resilient server running in an hour or so!
Bootstrapping Phases Greg Lehey suggested this bootstrapping method. It uses knowledge of how Vinum internally allocates disk space to avoid copying data. Instead, Vinum objects are configured so that they occupy the same disk space where /stand/sysinstall built filesystems. The filesystems are thus embedded within Vinum objects without copying. There are several distinct phases to the Vinum bootstrapping procedure. Each of these phases is presented in a separate section below. The section starts with a general overview of the phase and its goals. It then gives example steps for the two-spindle scenario presented here and advice on how to adapt them for your server. (If you are reading for a general understanding of Vinum bootstrapping, the example sections for each phase can safely be skipped.) The remainder of this section gives an overview of the entire bootstrapping process. Phase 1 involves planning and preparation. We will balance requirements for the server against available resources and make design tradeoffs. We will plan the transition from no Vinum to Vinum on just one spindle, to Vinum on two spindles. In phase 2, we will install a minimum FreeBSD system on a single spindle using partitions of type 4.2BSD (regular UFS filesystems). Phase 3 will embed the non-root filesystems from phase 2 in Vinum objects. Note that Vinum will be up and running at this point, but it cannot yet provide any resilience since it only has one spindle on which to store data. Finally in phase 4, we configure Vinum on a second spindle and make a backup copy of the root filesystem. This will give us resilience on all filesystems.
Bootstrapping Phase 1: Planning and Preparation Our goal in this phase is to define the different partitions we will need and examine their requirements. We will also look at available disk drives and controllers and allocate partitions to them. Finally, we will determine the size of each partition and its use during the bootstrapping process. After this planning is complete, we can optionally prepare to use some tools that will make bootstrapping Vinum easier. Several key questions must be answered in this planning phase: What filesystem and partitions will be needed? How will they be used? How will we name each spindle? How will the partitions be ordered for each spindle? How will partitions be assigned to the spindles? How will partitions be configured? Resilience or performance? What technique will be used to achieve resilience? What spindles will be used? How will they be configured on the available controllers? How much space is required for each partition?
Phase 1 Example In this example, I will assume a scenario where we are building a minimal foundation for a failure-resilient server. Hence we will need at least root, /usr, /home, and swap partitions. The root, /usr, and /home filesystems all need resilience since the server will not be much good without them. The swap partition needs performance first and generally does not need resilience since nothing it holds needs to be retained across a reboot.
Spindle Naming The kernel would refer to the master spindle on the primary and secondary ATA controllers as /dev/ad0 and /dev/ad2 respectively. This assumes that you have not removed the line options ATA_STATIC_ID from your kernel configuration. But Vinum also needs to have a name for each spindle that will stay the same name regardless of how it is attached to the CPU (i.e., if the drive moves, the Vinum name moves with the drive). Some recovery techniques documented below suggest moving a spindle from the secondary ATA controller to the primary ATA controller. (Indeed, the flexibility of making such moves is a key benefit of Vinum especially if you are managing a large number of spindles.) After such a drive/controller swap, the kernel will see what used to be /dev/ad2 as /dev/ad0 but Vinum will still call it by whatever name it had when it was attached to /dev/ad2 (i.e., when it was created or first made known to Vinum). Since connections can change, it is best to give each spindle a unique, abstract name that gives no hint of how it is attached. Avoid names that suggest a manufacturer, model number, physical location, or membership in a sequence (e.g. avoid names like upper, lower, etc., alpha, beta, etc., SCSI1, SCSI2, etc., or Seagate1, Seagate2 etc.). Such names are likely to lose their uniqueness or get out of sequence someday even if they seem like great names today. Once you have picked names for your spindles, label them with a permanent marker. If you have hot-swappable hardware, write the names on the sleds in which the spindles are mounted. This will significantly reduce the likelihood of error when you are moving spindles around later as part of failure recovery or routine system management procedures. In the instructions that follow, Vinum will name the root spindle YouCrazy and the rootback spindle UpWindow. I will only use /dev/ad0 when I want to refer to whichever of the two spindles is currently attached as /dev/ad0.
Partition Ordering Modern disk drives operate with fairly uniform areal density across the surface of the disk. That implies that more data is available under the heads without seeking on the outer cylinders than on the inner cylinders. We will allocate partitions most critical to system performance from these outer cylinders as /stand/sysinstall generally does. The root filesystem is traditionally the outermost, even though it generally is not as critical to system performance as others. (However root can have a larger impact on performance if it contains /tmp and /var as it does in this example.) The FreeBSD boot loaders assume that the root filesystem lives in the a partition. There is no requirement that the a partition start on the outermost cylinders, but this convention makes it easier to manage disk labels. Swap performance is critical so it comes next on our way toward the center. I/O operations here tend to be large and contiguous. Having as much data under the heads as possible avoids seeking while swapping. With all the smaller partitions out of the way, we finish up the disk with /home and /usr. Access patterns here tend not to be as intense as for other filesystems (especially if there is an abundant supply of RAM and read cache hit rates are high). If the pair of spindles you have are large enough to allow for more than /home and /usr, it is fine to plan for additional filesystems here.
Assigning Partitions to Spindles We will want to assign partitions to these spindles so that either can fail without loss of data on filesystems configured for resilience. Reliability on /usr and /home is best achieved using Vinum mirroring. Resilience will have to come differently, however, for the root filesystem since Vinum is not a part of the FreeBSD boot sequence. Here we will have to settle for two identical partitions with a periodic copy from the primary to the backup secondary. The kernel already has support for interleaved swap across all available partitions so there is no need for help from Vinum here. /stand/sysinstall will automatically configure /etc/fstab for all swap partitions given. The &vinum.ap; bootstrapping method given below requires a pair of spindles that I will call the root spindle and the rootback spindle. The rootback spindle must be the same size or larger than the root spindle. These instructions first allocate all space on the root spindle and then allocate exactly that amount of space on a rootback spindle. (After &vinum.ap; is bootstrapped, there is nothing special about either of these spindles--they are interchangeable.) You can later use the remaining space on the rootback spindle for other filesystems. If you have more than two spindles, the bootvinum Perl script and the procedure below will help you initialize them for use with &vinum.ap;. However you will have to figure out how to assign partitions to them on your own.
Assigning Space to Partitions For this example, I will use two spindles: one with 4,124,673 blocks (about 2 GB) on /dev/ad0 and one with 8,420,769 blocks (about 4 GB) on /dev/ad2. It is best to configure your two spindles on separate controllers so that both can operate in parallel and so that you will have failure resilience in case a controller dies. Note that mirrored volume write performance will be halved in cases where both spindles share a controller that requires they operate serially (as is often the case with ATA controllers). One spindle will be the master on the primary ATA controller and the other will be the master on the secondary ATA controller. Recall that we will be allocating space on the smaller spindle first and the larger spindle second.
Assigning Partitions on the Root Spindle We will allocate 200,000 blocks (about 93 MB) for a root filesystem on each spindle (/dev/ad0s1a and /dev/ad2s1a). We will initially allocate 200,265 blocks for a swap partition on each spindle, giving a total of about 186 MB of swap space (/dev/ad0s1b and /dev/ad2s1b). We will lose 265 blocks from each swap partition as part of the bootstrapping process. This is the size of the space used by Vinum to store configuration information. The space will be taken from swap and given to a vinum partition but will be unavailable for Vinum subdisks. I have done the partition allocation in nice round numbers of blocks just to emphasize where the 265 blocks go. There is nothing wrong with allocating space in MB if that is more convenient for you. This leaves 4,124,673 - 200,000 - 200,265 = 3,724,408 blocks (about 1,818 MB) on the root spindle for Vinum partitions (/dev/ad0s1e and /dev/ad2s1f). From this, allocate the 265 blocks for Vinum configuration information, 1,000,000 blocks (about 488 MB) for /home, and the remaining 2,724,408 blocks (about 1,330 MB) for /usr. See below to see this graphically. The left-hand side of below shows what spindle ad0 will look like at the end of phase 2. The right-hand side shows what it will look like at the end of phase 3.
Spindle ad0 Before and After Vinum ad0 Before Vinum Offset (blocks) ad0 After Vinum +----------------------+ <-- 0--> +----------------------+ | root | | root | | /dev/ad0s1a | | /dev/ad0s1a | +----------------------+ <-- 200000--> +----------------------+ | swap | | swap | | /dev/ad0s1b | | /dev/ad0s1b | | | 400000--> +----------------------+ | | | Vinum drive YouCrazy | | | | /dev/ad0s1h | +----------------------+ <-- 400265--> +-----------------+ | | /home | | Vinum sd | | | /dev/ad0s1e | | home.p0.s0 | | +----------------------+ <--1400265--> +-----------------+ | | /usr | | Vinum sd | | | /dev/ad0s1f | | usr.p0.s0 | | +----------------------+ <--4124673--> +-----------------+----+ Not to scale Spindle /dev/ad0 Before and After Vinum
Assigning Partitions on the Rootback Spindle The /rootback and swap partition sizes on the rootback spindle must match the root and swap partition sizes on the root spindle. That leaves 8,420,769 - 200,000 - 200,265 = 8,020,504 blocks for the Vinum partition. Mirrors of /home and /usr receive the same allocation as on the root spindle. That will leave an extra 2 GB or so that we can deal with later. See below to see this graphically. The left-hand side of below shows what spindle ad2 will look like at the beginning of phase 4. The right-hand side shows what it will look like at the end.
Spindle ad2 Before and After Vinum ad2 Before Vinum Offset (blocks) ad2 After Vinum +----------------------+ <-- 0--> +----------------------+ | /rootback | | /rootback | | /dev/ad2s1e | | /dev/ad2s1a | +----------------------+ <-- 200000--> +----------------------+ | swap | | swap | | /dev/ad2s1b | | /dev/ad2s1b | | | 400000--> +----------------------+ | | | Vinum drive UpWindow | | | | /dev/ad2s1h | +----------------------+ <-- 400265--> +-----------------+ | | /NOFUTURE | | Vinum sd | | | /dev/ad2s1f | | home.p1.s0 | | | | 1400265--> +-----------------+ | | | | Vinum sd | | | | | usr.p1.s0 | | | | 4124673--> +-----------------+ | | | | Vinum sd | | | | | hope.p0.s0 | | +----------------------+ <--8420769--> +-----------------+----+ Not to scale Spindle ad2 Before and After Vinum
Preparation of Tools The bootvinum Perl script given below in will make the Vinum bootstrapping process much easier if you can run it on the machine being bootstrapped. It is over 200 lines and you would not want to type it in. At this point, I recommend that you copy it to a floppy or arrange some alternative method of making it readily available so that it can be available later when needed. For example: &prompt.root; fdformat -f 1440 /dev/fd0 &prompt.root; newfs_msdos -f 1440 /dev/fd0 &prompt.root; mount_msdos /dev/fd0 /mnt &prompt.root; cp /usr/share/examples/vinum/bootvinum /mnt XXX Someday, I would like this script to live in /usr/share/examples/vinum. Till then, please use this link to get a copy.
Bootstrapping Phase 2: Minimal OS Installation Our goal in this phase is to complete the smallest possible FreeBSD installation in such a way that we can later install Vinum. We will use only partitions of type 4.2BSD (i.e., regular UFS file systems) since that is the only type supported by /stand/sysinstall.
Phase 2 Example Start up the FreeBSD installation process by running /stand/sysinstall from installation media as you normally would. Fdisk partition all spindles as needed. Make sure to select BootMgr for all spindles. Partition the root spindle with appropriate block allocations as described above in . For this example on a 2 GB spindle, I will use 200,000 blocks for root, 200,265 blocks for swap, 1,000,000 blocks for /home, and the rest of the spindle (2,724,408 blocks) for /usr. (/stand/sysinstall should automatically assign these to /dev/ad0s1a, /dev/ad0s1b, /dev/ad0s1e, and /dev/ad0s1f by default.) If you prefer Soft Updates as I do and you are using 4.4-RELEASE or better, this is a good time to enable them. Partition the rootback spindle with the appropriate block allocations as described above in . For this example on a 4 GB spindle, I will use 200,000 blocks for /rootback, 200,265 blocks for swap, and the rest of the spindle (8,020,504 blocks) for /NOFUTURE. (/stand/sysinstall should automatically assign these to /dev/ad2s1e, /dev/ad2s1b, and /dev/ad2s1f by default.) We do not really want to have a /NOFUTURE UFS filesystem (we want a vinum partition instead), but that is the best choice we have for the space given the limitations of /stand/sysinstall. Mount point names beginning with NOFUTURE and rootback serve as sentinels to the bootstrapping script presented in below. Partition any other spindles with swap if desired and a single /NOFUTURExx filesystem. Select a minimum system install for now even if you want to end up with more distributions loaded later. Do not worry about system configuration options at this point--get Vinum set up and get the partitions in the right places first. Exit /stand/sysinstall and reboot. Do a quick test to verify that the minimum installation was successful. The left-hand side of above and the left-hand side of above show how the disks will look at this point.
Bootstrapping Phase 3: Root Spindle Setup Our goal in this phase is get Vinum set up and running on the root spindle. We will embed the existing /usr and /home filesystems in a Vinum partition. Note that the Vinum volumes created will not yet be failure-resilient since we have only one underlying Vinum drive to hold them. The resulting system will automatically start Vinum as it boots to multi-user mode.
Phase 3 Example Login as root. We will need a directory in the root filesystem in which to keep a few files that will be used in the Vinum bootstrapping process. &prompt.root; mkdir /bootvinum &prompt.root; cd /bootvinum Several files need to be prepared for use in bootstrapping. I have written a Perl script that makes all the required files for you. Copy this script to /bootvinum by floppy disk, tape, network, or any convenient means and then run it. (If you cannot get this script copied onto the machine being bootstrapped, then see below for a manual alternative.) &prompt.root; cp /mnt/bootvinum . &prompt.root; ./bootvinum bootvinum produces no output when run successfully. If you get any errors, something may have gone wrong when you were creating partitions with /stand/sysinstall above. Running bootvinum will: Create /etc/fstab.vinum based on what it finds in your existing /etc/fstab Create new disk labels for each spindle mentioned in /etc/fstab and keep copies of the current disk labels Create files needed as input to vinum for building Vinum objects on each spindle Create many alternates to /etc/fstab.vinum that might come in handy should a spindle fail You may want to take a look at these files to learn more about the disk partitioning required for Vinum or to learn more about the commands needed to create Vinum objects. We now need to install new spindle partitioning for /dev/ad0. This requires that /dev/ad0s1b not be in use for swapping so we have to reboot in single-user mode. First, reboot the system. &prompt.root; reboot Next, enter single-user mode. Hit [Enter] to boot immediately, or any other key for command prompt. Booting [kernel] in 8 seconds... Type '?' for a list of commands, 'help' for more detailed help. ok boot -s In single-user mode, install the new partitioning created above. &prompt.root; cd /bootvinum &prompt.root; disklabel -R ad0s1 disklabel.ad0s1 &prompt.root; disklabel -R ad2s1 disklabel.ad2s1 If you have additional spindles, repeat the above commands as appropriate for them. We are about to start Vinum for the first time. It is going to want to create several device nodes under /dev/vinum so we will need to mount the root filesystem for read/write access. &prompt.root; fsck -p / &prompt.root; mount / Now it is time to create the Vinum objects that will embed the existing non-root filesystems on the root spindle in a Vinum partition. This will load the Vinum kernel module and start Vinum as a side effect. &prompt.root; vinum create create.YouCrazy You should see a list of Vinum objects created that looks like the following: 1 drives: D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%) 2 volumes: V home State: up Plexes: 1 Size: 488 MB V usr State: up Plexes: 1 Size: 1330 MB 2 plexes: P home.p0 C State: up Subdisks: 1 Size: 488 MB P usr.p0 C State: up Subdisks: 1 Size: 1330 MB 2 subdisks: S home.p0.s0 State: up PO: 0 B Size: 488 MB S usr.p0.s0 State: up PO: 0 B Size: 1330 MB You should also see several kernel messages which state that the Vinum objects you have created are now up. Our non-root filesystems should now be embedded in a Vinum partition and hence available through Vinum volumes. It is important to test that this embedding worked. &prompt.root; fsck -n /dev/vinum/home &prompt.root; fsck -n /dev/vinum/usr This should produce no errors. If it does produce errors do not fix them. Instead, go back and examine the root spindle partition tables before and after Vinum to see if you can spot the error. You can back out the partition table changes by using disklabel -R with the disklabel.*.b4vinum files. While we have the root filesystem mounted read/write, this is a good time to install /etc/fstab. &prompt.root; mv /etc/fstab /etc/fstab.b4vinum &prompt.root; cp /etc/fstab.vinum /etc/fstab We are now done with tasks requiring single-user mode, so it is safe to go multi-user from here on. &prompt.root; ^D Login as root. Edit /etc/rc.conf and add this line: start_vinum="YES"
Bootstrapping Phase 4: Rootback Spindle Setup Our goal in this phase is to get redundant copies of all data from the root spindle to the rootback spindle. We will first create the necessary Vinum objects on the rootback spindle. Then we will ask Vinum to copy the data from the root spindle to the rootback spindle. Finally, we use dump and restore to copy the root filesystem.
Phase 4 Example Now that Vinum is running on the root spindle, we can bring it up on the rootback spindle so that our Vinum volumes can become failure-resilient. &prompt.root; cd /bootvinum &prompt.root; vinum create create.UpWindow You should see a list of Vinum objects created that looks like the following: 2 drives: D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%) D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%) 2 volumes: V home State: up Plexes: 2 Size: 488 MB V usr State: up Plexes: 2 Size: 1330 MB 4 plexes: P home.p0 C State: up Subdisks: 1 Size: 488 MB P usr.p0 C State: up Subdisks: 1 Size: 1330 MB P home.p1 C State: faulty Subdisks: 1 Size: 488 MB P usr.p1 C State: faulty Subdisks: 1 Size: 1330 MB 4 subdisks: S home.p0.s0 State: up PO: 0 B Size: 488 MB S usr.p0.s0 State: up PO: 0 B Size: 1330 MB S home.p1.s0 State: stale PO: 0 B Size: 488 MB S usr.p1.s0 State: stale PO: 0 B Size: 1330 MB You should also see several kernel messages which state that some of the Vinum objects you have created are now up while others are faulty or stale. Now we ask Vinum to copy each of the subdisks on drive YouCrazy to drive UpWindow. This will change the state of the newly created Vinum subdisks from stale to up. It will also change the state of the newly created Vinum plexes from faulty to up. First, we do the new subdisk we added to /home. &prompt.root; vinum start -w home.p1.s0 reviving home.p1.s0 (time passes . . . ) home.p1.s0 is up by force home.p1 is up home.p1.s0 is up My 5,400 RPM EIDE spindles copied at about 3.5 MBytes/sec. Your mileage may vary. Next we do the new subdisk we added to /usr. &prompt.root; vinum start -w usr.p1.s0 reviving usr.p1.s0 (time passes . . . ) usr.p1.s0 is up by force usr.p1 is up usr.p1.s0 is up All Vinum objects should be in state up at this point. The output of vinum list should look like the following: 2 drives: D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%) D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%) 2 volumes: V home State: up Plexes: 2 Size: 488 MB V usr State: up Plexes: 2 Size: 1330 MB 4 plexes: P home.p0 C State: up Subdisks: 1 Size: 488 MB P usr.p0 C State: up Subdisks: 1 Size: 1330 MB P home.p1 C State: up Subdisks: 1 Size: 488 MB P usr.p1 C State: up Subdisks: 1 Size: 1330 MB 4 subdisks: S home.p0.s0 State: up PO: 0 B Size: 488 MB S usr.p0.s0 State: up PO: 0 B Size: 1330 MB S home.p1.s0 State: up PO: 0 B Size: 488 MB S usr.p1.s0 State: up PO: 0 B Size: 1330 MB Copy the root filesystem so that you will have a backup. &prompt.root; cd /rootback &prompt.root; dump 0f - / | restore rf - &prompt.root; rm restoresymtable &prompt.root; cd / You may see errors like this: ./tmp/rstdir1001216411: (inode 558) not found on tape cannot find directory inode 265 abort? [yn] n expected next file 492, got 491 They seem to cause no harm. I suspect they are a consequence of dumping the filesystem containing /tmp and/or the pipe connecting dump and restore. Make a directory on which we can mount a damaged root filesystem during the recovery process. &prompt.root; mkdir /rootbad Remove sentinel mount points that are now unused. &prompt.root; rmdir /NOFUTURE* Create empty &vinum.ap; drives on remaining spindles. &prompt.root; vinum create create.ThruBank &prompt.root; ... At this point, the reliable server foundation is complete. The right-hand side of above and the right-hand side of above show how the disks will look. You may want to do a quick reboot to multi-user and give it a quick test drive. This is also a good point to complete installation of other distributions beyond the minimal install. Add packages, ports, and users as required. Configure /etc/rc.conf as required. After you have completed your server configuration, remember to do one more copy of root to /rootback as shown above before placing the server into production. Make a schedule to refresh /rootback periodically. It may be a good idea to mount /rootback read-only for normal operation of the server. This does, however, complicate the periodic refresh a bit. Do not forget to watch /var/log/messages carefully for errors. Vinum may automatically avoid failed hardware in a way that users do not notice. You must watch for such failures and get them repaired before a second failure results in data loss. You may see Vinum noting damaged objects at server boot time.
Where to Go from Here? Now that you have established the foundation of a reliable server, there are several things you might want to try next.
Make a Vinum Volume with Remaining Space Following are the steps to create another Vinum volume with space remaining on the rootback spindle. This volume will not be resilient to spindle failure since it has only one plex on a single spindle. Create a file with the following contents: volume hope plex name hope.p0 org concat volume hope sd name hope.p0.s0 drive UpWindow plex hope.p0 len 0 Specifying a length of 0 for the hope.p0.s0 subdisk asks Vinum to use whatever space is left available on the underlying drive. Feed these commands into vinum . &prompt.root; vinum create filename Now we newfs the volume and mount it. &prompt.root; newfs -v /dev/vinum/hope &prompt.root; mkdir /hope &prompt.root; mount /dev/vinum/hope /hope Edit /etc/fstab if you want /hope mounted at boot time.
Try Out More Vinum Commands You might already be familiar with vinum to get a list of all Vinum objects. Try following it to see more detail. If you have more spindles and you want to bring them up as concatenated, mirrored, or striped volumes, then give vinum drivelist, vinum drivelist, or vinum drivelist a try. See &man.vinum.8; for sample configurations and important performance considerations before settling on a final organization for your additional spindles. The failure recovery instructions below will also give you some experience using more Vinum commands.
Failure Scenarios This section contains descriptions of various failure scenarios. For each scenario, there is a subsection on how to configure your server for degraded mode operation, how to recover from the failure, how to exit degraded mode, and how to simulate the failure. Make a hard copy of these instructions and leave them inside the CPU case, being careful not to interfere with ventilation.
Root filesystem on ad0 unusable, rest of drive ok We assume here that the boot blocks and disk label on /dev/ad0 are ok. If your BIOS can boot from a drive other than C:, you may be able to get around this limitation.
Configure Server for Degraded Mode Use BootMgr to load kernel from /dev/ad2s1a. Hit F5 in BootMgr to select Drive 1. Hit F1 to select FreeBSD. After the kernel is loaded, hit any key but enter to interrupt the boot sequence. Boot into single-user mode and allow explicit entry of a root filesystem. Hit [Enter] to boot immediately, or any other key for command prompt. Booting [kernel] in 8 seconds... Type '?' for a list of commands, 'help' for more detailed help. ok boot -as Select /rootback as your root filesystem. Manual root filesystem specification: <fstype>:<device> Mount <device> using filesystem <fstype> e.g. ufs:/dev/da0s1a ? List valid disk boot devices <empty line> Abort manual input mountroot> ufs:/dev/ad2s1a Now that you are in single-user mode, change /etc/fstab to avoid the bad root filesystem. If you used the bootvinum Perl script from below, then these commands should configure your server for degraded mode. &prompt.root; fsck -p / &prompt.root; mount / &prompt.root; cd /etc &prompt.root; mv fstab fstab.bak &prompt.root; cp fstab_ad0s1_root_bad fstab &prompt.root; cd / &prompt.root; mount -o ro / &prompt.root; vinum start &prompt.root; fsck -p &prompt.root; ^D
Recovery Restore /dev/ad0s1a from backups or copy /rootback to it with these commands: &prompt.root; umount /rootbad &prompt.root; newfs /dev/ad0s1a &prompt.root; tunefs -n enable /dev/ad0s1a &prompt.root; mount /rootbad &prompt.root; cd /rootbad &prompt.root; dump 0f - / | restore rf - &prompt.root; rm restoresymtable
Exiting Degraded Mode Enter single-user mode. &prompt.root; shutdown now Put /etc/fstab back to normal and reboot. &prompt.root; cd /rootbad/etc &prompt.root; rm fstab &prompt.root; mv fstab.bak fstab &prompt.root; reboot Reboot and hit F1 to boot from /dev/ad0 when prompted by BootMgr.
Simulation This kind of failure can be simulated by shutting down to single-user mode and then booting as shown above in .
Drive ad2 Fails This section deals with the total failure of /dev/ad2.
Configure Server for Degraded Mode After the kernel is loaded, hit any key but Enter to interrupt the boot sequence. Boot into single-user mode. Hit [Enter] to boot immediately, or any other key for command prompt. Booting [kernel] in 8 seconds... Type '?' for a list of commands, 'help' for more detailed help. ok boot -s Change /etc/fstab to avoid the bad drive. If you used the bootvinum Perl script from below, then these commands should configure your server for degraded mode. &prompt.root; fsck -p / &prompt.root; mount / &prompt.root; cd /etc &prompt.root; mv fstab fstab.bak &prompt.root; cp fstab_only_have_ad0s1 fstab &prompt.root; cd / &prompt.root; mount -o ro / &prompt.root; vinum start &prompt.root; fsck -p &prompt.root; ^D If you do not have modified versions of /etc/fstab that are ready for use, then you can use ed to make one. Alternatively, you can fsck and mount /usr and then use your favorite editor.
Recovery We assume here that your server is up and running multi-user in degraded mode on just /dev/ad0 and that you have a new spindle now on /dev/ad2 ready to go. You will need a new spindle with enough room to hold root and swap partitions plus a Vinum partition large enough to hold /home and /usr. Create a BIOS partition (slice) on the new spindle. &prompt.root; /stand/sysinstall Select Custom. Select Partition. Select ad2. Create a FreeBSD (type 165) slice large enough to hold everything mentioned above. Write changes. Yes, you are absolutely sure. Select BootMgr. Quit Partitioning. Exit /stand/sysinstall. Create disk label partitioning based on current /dev/ad0 partitioning. &prompt.root; disklabel ad0 > /tmp/ad0 &prompt.root; disklabel -e ad2 This will drop you into your favorite editor. Copy the lines for the a and b partitions from /tmp/ad0 to the ad2 disklabel. Add the size of the a and b partitions to find the proper offset for the h partition. Subtract this offset from the size of the c partition to find the proper size for the h partition. Define an h partition with the size and offset calculated above. Set the fstype column to vinum. Save the file and quit your editor. Tell Vinum about the new drive. Ask Vinum to start an editor with a copy of the current configuration. &prompt.root; vinum create Uncomment the drive line referring to drive UpWindow and set device to /dev/ad2s1h. Save the file and quit your editor. Now that Vinum has two spindles again, revive the mirrors. &prompt.root; vinum start -w usr.p1.s0 &prompt.root; vinum start -w home.p1.s0 Now we need to restore /rootback to a current copy of the root filesystem. These commands will accomplish this. &prompt.root; newfs /dev/ad2s1a &prompt.root; tunefs -n enable /dev/ad2s1a &prompt.root; mount /dev/ad2s1a /mnt &prompt.root; cd /mnt &prompt.root; dump 0f - / | restore rf - &prompt.root; rm restoresymtable &prompt.root; cd / &prompt.root; umount /mnt
Exiting Degraded Mode Enter single-user mode. &prompt.root; shutdown now Return /etc/fstab to its normal state and reboot. &prompt.root; cd /etc &prompt.root; rm fstab &prompt.root; mv fstab.bak fstab &prompt.root; reboot
Simulation You can simulate this kind of failure by unplugging /dev/ad2, write-protecting it, or by this procedure: Shutdown to single-user mode. Unmount all non-root filesystems. Clobber any existing Vinum configuration and partitioning on /dev/ad2. &prompt.root; vinum stop &prompt.root; dd if=/dev/zero of=/dev/ad2s1h count=512 &prompt.root; dd if=/dev/zero of=/dev/ad2 count=512
Drive ad0 Fails Some BIOSes can boot from drive 1 or drive 2 (often called C: or D:), while others can boot only from drive 1. If your BIOS can boot from either, the fastest road to recovery might be to boot directly from /dev/ad2 in single-user mode and install /etc/fstab_only_have_ad2s1 as /etc/fstab. You would then have to adapt the /dev/ad2 failure recovery instructions from above. If your BIOS can only boot from drive one, then you will have to unplug drive YouCrazy from the controller for /dev/ad2 and plug it into the controller for /dev/ad0. Then continue with the instructions for /dev/ad2 failure recovery in above.
bootvinum Perl Script The bootvinum Perl script below reads /etc/fstab and current drive partitioning. It then writes several files in the current directory and several variants of /etc/fstab in /etc. These files significantly simplify the installation of Vinum and recovery from spindle failures. #!/usr/bin/perl -w use strict; use FileHandle; -my $config_tag1 = '$Id: article.sgml,v 1.13 2003-08-27 07:13:11 blackend Exp $'; +my $config_tag1 = '$Id: article.sgml,v 1.14 2003-10-18 10:39:16 simon Exp $'; # Copyright (C) 2001 Robert A. Van Valzah # # Bootstrap Vinum # # Read /etc/fstab and current partitioning for all spindles mentioned there. # Generate files needed to mirror all filesystems on root spindle. # A new partition table for each spindle # Input for the vinum create command to create Vinum objects on each spindle # A copy of fstab mounting Vinum volumes instead of BSD partitions # Copies of fstab altered for server's degraded modes of operation # See handbook for instructions on how to use the the files generated. # N.B. This bootstrapping method shrinks size of swap partition by the size # of Vinum's on-disk configuration (265 sectors). It embeds existing file # systems on the root spindle in Vinum objects without having to copy them. # Thanks to Greg Lehey for suggesting this bootstrapping method. # Expectations: # The root spindle must contain at least root, swap, and /usr partitions # The rootback spindle must have matching /rootback and swap partitions # Other spindles should only have a /NOFUTURE* filesystem and maybe swap # File systems named /NOFUTURE* will be replaced with Vinum drives # Change configuration variables below to suit your taste my $vip = 'h'; # VInum Partition my @drv = ('YouCrazy', 'UpWindow', 'ThruBank', # Vinum DRiVe names 'OutSnakes', 'MeWild', 'InMovie', 'HomeJames', 'DownPrices', 'WhileBlind'); # No configuration variables beyond this point my %vols; # One entry per Vinum volume to be created my @spndl; # One entry per SPiNDLe my $rsp; # Root SPindle (as in /dev/$rsp) my $rbsp; # RootBack SPindle (as in /dev/$rbsp) my $cfgsiz = 265; # Size of Vinum on-disk configuration info in sectors my $nxtpas = 2; # Next fsck pass number for non-root filesystems # Parse fstab, generating the version we'll need for Vinum and noting # spindles in use. my $fsin = "/etc/fstab"; #my $fsin = "simu/fstab"; open(FSIN, "$fsin") || die("Couldn't open $fsin: $!\n"); my $fsout = "/etc/fstab.vinum"; open(FSOUT, ">$fsout") || die("Couldn't open $fsout for writing: $!\n"); while (<FSIN>) { my ($dev, $mnt, $fstyp, $opt, $dump, $pass) = split; next if $dev =~ /^#/; if ($mnt eq '/' || $mnt eq '/rootback' || $mnt =~ /^\/NOFUTURE/) { my $dn = substr($dev, 5, length($dev)-6); # Device Name without /dev/ push(@spndl, $dn) unless grep($_ eq $dn, @spndl); $rsp = $dn if $mnt eq '/'; next if $mnt =~ /^\/NOFUTURE/; } # Move /rootback from partition e to a if ($mnt =~ /^\/rootback/) { $dev =~ s/e$/a/; $pass = 1; $rbsp = substr($dev, 5, length($dev)-6); print FSOUT "$dev\t\t$mnt\t$fstyp\t$opt\t\t$dump\t$pass\n"; next; } # Move non-root filesystems on smallest spindle into Vinum if (defined($rsp) && $dev =~ /^\/dev\/$rsp/ && $dev =~ /[d-h]$/) { $pass = $nxtpas++; print FSOUT "/dev/vinum$mnt\t\t$mnt\t\t$fstyp\t$opt\t\t$dump\t$pass\n"; $vols{$dev}->{mnt} = substr($mnt, 1); next; } print FSOUT $_; } close(FSOUT); die("Found more spindles than we have abstract names\n") if $#spndl > $#drv; die("Didn't find a root partition!\n") if !defined($rsp); die("Didn't find a /rootback partition!\n") if !defined($rbsp); # Table of server's Degraded Modes # One row per mode with hash keys # fn FileName # xpr eXPRession needed to convert fstab lines for this mode # cm1 CoMment 1 describing this mode # cm2 CoMment 2 describing this mode # FH FileHandle (dynamically initialized below) my @DM = ( { cm1 => "When we only have $rsp, comment out lines using $rbsp", fn => "/etc/fstab_only_have_$rsp", xpr => "s:^/dev/$rbsp:#\$&:", }, { cm1 => "When we only have $rbsp, comment out lines using $rsp and", cm2 => "rootback becomes root", fn => "/etc/fstab_only_have_$rbsp", xpr => "s:^/dev/$rsp:#\$&: || s:/rootback:/\t:", }, { cm1 => "When only $rsp root is bad, /rootback becomes root and", cm2 => "root becomes /rootbad", fn => "/etc/fstab_${rsp}_root_bad", xpr => "s:\t/\t:\t/rootbad: || s:/rootback:/\t:", }, ); # Initialize output FileHandles and write comments foreach my $dm (@DM) { my $fh = new FileHandle; $fh->open(">$dm->{fn}") || die("Can't write $dm->{fn}: $!\n"); print $fh "# $dm->{cm1}\n" if $dm->{cm1}; print $fh "# $dm->{cm2}\n" if $dm->{cm2}; $dm->{FH} = $fh; } # Parse the Vinum version of fstab written above and write versions needed # for server's degraded modes. open(FSOUT, "$fsout") || die("Couldn't open $fsout: $!\n"); while (<FSOUT>) { my $line = $_; foreach my $dm (@DM) { $_ = $line; eval $dm->{xpr}; print {$dm->{FH}} $_; } } # Parse partition table for each spindle and write versions needed for Vinum my $rootsiz; # ROOT partition SIZe my $swapsiz; # SWAP partition SIZe my $rspminoff; # Root SPindle MINimum OFFset of non-root, non-swap, non-c parts my $rspsiz; # Root SPindle SIZe my $rbspsiz; # RootBack SPindle SIZe foreach my $i (0..$#spndl) { my $dlin = "disklabel $spndl[$i] |"; # my $dlin = "simu/disklabel.$spndl[$i]"; open(DLIN, "$dlin") || die("Couldn't open $dlin: $!\n"); my $dlout = "disklabel.$spndl[$i]"; open(DLOUT, ">$dlout") || die("Couldn't open $dlout for writing: $!\n"); my $dlb4 = "$dlout.b4vinum"; open(DLB4, ">$dlb4") || die("Couldn't open $dlb4 for writing: $!\n"); my $minoff; # MINimum OFFset of non-root, non-swap, non-c partitions my $totsiz = 0; # TOTal SIZe of all non-root, non-swap, non-c partitions my $swapspndl = 0; # True if SWAP partition on this SPiNDLe while (<DLIN>) { print DLB4 $_; my ($part, $siz, $off, $fstyp, $fsiz, $bsiz, $bps) = split; if ($part && $part eq 'a:' && $spndl[$i] eq $rsp) { $rootsiz = $siz; } if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) { if ($rootsiz != $siz) { die("Rootback size ($siz) != root size ($rootsiz)\n"); } } if ($part && $part eq 'c:') { $rspsiz = $siz if $spndl[$i] eq $rsp; $rbspsiz = $siz if $spndl[$i] eq $rbsp; } # Make swap partition $cfgsiz sectors smaller if ($part && $part eq 'b:') { if ($spndl[$i] eq $rsp) { $swapsiz = $siz; } else { if ($swapsiz != $siz) { die("Swap partition sizes unequal across spindles\n"); } } printf DLOUT "%4s%9d%9d%10s\n", $part, $siz-$cfgsiz, $off, $fstyp; $swapspndl = 1; next; } # Move rootback spindle e partitions to a if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) { printf DLOUT "%4s%9d%9d%10s%9d%6d%6d\n", 'a:', $siz, $off, $fstyp, $fsiz, $bsiz, $bps; next; } # Delete non-root, non-swap, non-c partitions but note their minimum # offset and total size that're needed below. if ($part && $part =~ /^[d-h]:$/) { $minoff = $off unless $minoff; $minoff = $off if $off < $minoff; $totsiz += $siz; if ($spndl[$i] eq $rsp) { # If doing spindle containing root my $dev = "/dev/$spndl[$i]" . substr($part, 0, 1); $vols{$dev}->{siz} = $siz; $vols{$dev}->{off} = $off; $rspminoff = $minoff; } next; } print DLOUT $_; } if ($swapspndl) { # If there was a swap partition on this spindle # Make a Vinum partition the size of all non-root, non-swap, # non-c partitions + the size of Vinum's on-disk configuration. # Set its offset so that the start of the first subdisk it contains # coincides with the first filesystem we're embedding in Vinum. printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz+$cfgsiz, $minoff-$cfgsiz, 'vinum'; } else { # No need to mess with size size and offset if there was no swap printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz, $minoff, 'vinum'; } } die("Swap partition not found\n") unless $swapsiz; die("Swap partition not larger than $cfgsiz blocks\n") unless $swapsiz>$cfgsiz; die("Rootback spindle size not >= root spindle size\n") unless $rbspsiz>=$rspsiz; # Generate input to vinum create command needed for each spindle. foreach my $i (0..$#spndl) { my $cfn = "create.$drv[$i]"; # Create File Name open(CF, ">$cfn") || die("Can't open $cfn for writing: $!\n"); print CF "drive $drv[$i] device /dev/$spndl[$i]$vip\n"; next unless $spndl[$i] eq $rsp || $spndl[$i] eq $rbsp; foreach my $dev (keys(%vols)) { my $mnt = $vols{$dev}->{mnt}; my $siz = $vols{$dev}->{siz}; my $off = $vols{$dev}->{off}-$rspminoff+$cfgsiz; print CF "volume $mnt\n" if $spndl[$i] eq $rsp; print CF <<EOF; plex name $mnt.p$i org concat volume $mnt sd name $mnt.p$i.s0 drive $drv[$i] plex $mnt.p$i len ${siz}s driveoffset ${off}s EOF } } Manual Vinum Bootstrapping The bootvinum Perl script in makes life easier, but it may be necessary to manually perform some or all of the steps that it automates. This appendix describes how you would manually mimic the script. Make a copy of /etc/fstab to be customized. &prompt.root; cp /etc/fstab /etc/fstab.vinum Edit /etc/fstab.vinum. Change the device column of non-root partitions on the root spindle to /dev/vinum/mnt. Change the pass column of non-root partitions on the root spindle to 2, 3, etc. Delete any lines with mountpoint matching /NOFUTURE*. Change the device column of /rootback from e to a. Change the pass column of /rootback to 1. Prepare disklabels for editing: &prompt.root; cd /bootvinum &prompt.root; disklabel ad0s1 > disklabel.ad0s1 &prompt.root; cp disklabel.ad0s1 disklabel.ad0s1.b4vinum &prompt.root; disklabel ad2s1 > disklabel.ad2s1 &prompt.root; cp disklabel.ad2s1 disklabel.ad2s1.b4vinum Edit /etc/disklabel.ad?s1. On the root spindle: Decrease the size of the b partition by 265 blocks. Note the size and offset of the a and b partitions. Note the smallest offset for partitions d-h. Note the size and offset for all non-root, non-swap partitions (/home was probably on e and /usr was probably on f). Delete partitions d-h. Create a new h partition with offset 265 blocks less than the smallest offset for partitions d-h noted above. Set its size to the size of the c partition less the smallest offset for partitions d-h noted above + 265 blocks. Vinum can use any partition other than c. It is not strictly necessary to use h for all your Vinum partitions, but it is good practice to be consistent across all spindles. Set the fstype of this new partition to vinum. On the rootback spindle: Move the e partition to a. Verify that the size of the a and b partitions matches the root spindle. Note the smallest offset for partitions d-h. Delete partitions d-h. Create a new h partition with offset 265 blocks less than the smallest offset noted above for partitions d-h. Set its size to the size of the c partition less the smallest offset for partitions d-h noted above + 265 blocks. Set the fstype of this new partition to vinum. Create a file named create.YouCrazy that contains: drive YouCrazy device /dev/ad0s1h volume home plex name home.p0 org concat volume home sd name home.p0.s0 drive YouCrazy plex home.p0 len $hl driveoffset $ho volume usr plex name usr.p0 org concat volume usr sd name usr.p0.s0 drive YouCrazy plex usr.p0 len $ul driveoffset $uo Where: $hl is the length noted above for /home. $ho is the offset noted above for /home less the smallest offset noted above + 265 blocks. $ul is the length noted above for /usr. $uo is the offset noted above for /usr less the smallest offset noted above + 265 blocks. Create a file named create.UpWindow containing: drive UpWindow device /dev/ad2s1h plex name home.p1 org concat volume home sd name home.p1.s0 drive UpWindow plex home.p1 len $hl driveoffset $ho plex name usr.p1 org concat volume usr sd name usr.p1.s0 drive UpWindow plex usr.p1 len $ul driveoffset $uo Where $hl, $ho, $ul, and $uo are set as above. Acknowledgements I would like to thank Greg Lehey for writing &vinum.ap; and for providing very helpful comments on early drafts. Several others made helpful suggestions after reviewing later drafts including Dag-Erling Smørgrav, Michael Splendoria, Chern Lee, Stefan Aeschbacher, Fleming Froekjaer, Bernd Walter, Aleksey Baranov, and Doug Swarin.
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml index bbe6da9094..c77ab30396 100644 --- a/en_US.ISO8859-1/articles/vm-design/article.sgml +++ b/en_US.ISO8859-1/articles/vm-design/article.sgml @@ -1,841 +1,851 @@ %man; %freebsd; + +%trademarks; ]>
Design elements of the FreeBSD VM system Matthew Dillon
dillon@apollo.backplane.com
+ + &tm-attrib.freebsd; + &tm-attrib.linux; + &tm-attrib.microsoft; + &tm-attrib.opengroup; + &tm-attrib.general; + + The title is really just a fancy way of saying that I am going to attempt to describe the whole VM enchilada, hopefully in a way that everyone can follow. For the last year I have concentrated on a number of major kernel subsystems within FreeBSD, with the VM and Swap subsystems being the most interesting and NFS being a necessary chore. I rewrote only small portions of the code. In the VM arena the only major rewrite I have done is to the swap subsystem. Most of my work was cleanup and maintenance, with only moderate code rewriting and no major algorithmic adjustments within the VM subsystem. The bulk of the VM subsystem's theoretical base remains unchanged and a lot of the credit for the modernization effort in the last few years belongs to John Dyson and David Greenman. Not being a historian like Kirk I will not attempt to tag all the various features with peoples names, since I will invariably get it wrong. This article was originally published in the January 2000 issue of DaemonNews. This version of the article may include updates from Matt and other authors to reflect changes in FreeBSD's VM implementation.
Introduction Before moving along to the actual design let's spend a little time on the necessity of maintaining and modernizing any long-living codebase. In the programming world, algorithms tend to be more important than code and it is precisely due to BSD's academic roots that a great deal of attention was paid to algorithm design from the beginning. More attention paid to the design generally leads to a clean and flexible codebase that can be fairly easily modified, extended, or replaced over time. While BSD is considered an old operating system by some people, those of us who work on it tend to view it more as a mature codebase which has various components modified, extended, or replaced with modern code. It has evolved, and FreeBSD is at the bleeding edge no matter how old some of the code might be. This is an important distinction to make and one that is unfortunately lost to many people. The biggest error a programmer can make is to not learn from history, and this is precisely the error that - many other modern operating systems have made. NT is the best example + many other modern operating systems have made. &windowsnt; is the best example of this, and the consequences have been dire. Linux also makes this mistake to some degree—enough that we BSD folk can make small jokes about it every once in a while, anyway. Linux's problem is simply one of a lack of experience and history to compare ideas against, a problem that is easily and rapidly being addressed by the Linux community in the same way it has been addressed in the BSD - community—by continuous code development. The NT folk, on the + community—by continuous code development. The &windowsnt; folk, on the other hand, repeatedly make the same mistakes solved by &unix; decades ago and then spend years fixing them. Over and over again. They have a severe case of not designed here and we are always right because our marketing department says so. I have little tolerance for anyone who cannot learn from history. Much of the apparent complexity of the FreeBSD design, especially in the VM/Swap subsystem, is a direct result of having to solve serious performance issues that occur under various conditions. These issues are not due to bad algorithmic design but instead rise from environmental factors. In any direct comparison between platforms, these issues become most apparent when system resources begin to get stressed. As I describe FreeBSD's VM/Swap subsystem the reader should always keep two points in mind. First, the most important aspect of performance design is what is known as Optimizing the Critical Path. It is often the case that performance optimizations add a little bloat to the code in order to make the critical path perform better. Second, a solid, generalized design outperforms a heavily-optimized design over the long run. While a generalized design may end up being slower than an heavily-optimized design when they are first implemented, the generalized design tends to be easier to adapt to changing conditions and the heavily-optimized design winds up having to be thrown away. Any codebase that will survive and be maintainable for years must therefore be designed properly from the beginning even if it costs some performance. Twenty years ago people were still arguing that programming in assembly was better than programming in a high-level language because it produced code that was ten times as fast. Today, the fallibility of that argument is obvious—as are the parallels to algorithmic design and code generalization. VM Objects The best way to begin describing the FreeBSD VM system is to look at it from the perspective of a user-level process. Each user process sees a single, private, contiguous VM address space containing several types of memory objects. These objects have various characteristics. Program code and program data are effectively a single memory-mapped file (the binary file being run), but program code is read-only while program data is copy-on-write. Program BSS is just memory allocated and filled with zeros on demand, called demand zero page fill. Arbitrary files can be memory-mapped into the address space as well, which is how the shared library mechanism works. Such mappings can require modifications to remain private to the process making them. The fork system call adds an entirely new dimension to the VM management problem on top of the complexity already given. A program binary data page (which is a basic copy-on-write page) illustrates the complexity. A program binary contains a preinitialized data section which is initially mapped directly from the program file. When a program is loaded into a process's VM space, this area is initially memory-mapped and backed by the program binary itself, allowing the VM system to free/reuse the page and later load it back in from the binary. The moment a process modifies this data, however, the VM system must make a private copy of the page for that process. Since the private copy has been modified, the VM system may no longer free it, because there is no longer any way to restore it later on. You will notice immediately that what was originally a simple file mapping has become much more complex. Data may be modified on a page-by-page basis whereas the file mapping encompasses many pages at once. The complexity further increases when a process forks. When a process forks, the result is two processes—each with their own private address spaces, including any modifications made by the original process prior to the call to fork(). It would be silly for the VM system to make a complete copy of the data at the time of the fork() because it is quite possible that at least one of the two processes will only need to read from that page from then on, allowing the original page to continue to be used. What was a private page is made copy-on-write again, since each process (parent and child) expects their own personal post-fork modifications to remain private to themselves and not effect the other. FreeBSD manages all of this with a layered VM Object model. The original binary program file winds up being the lowest VM Object layer. A copy-on-write layer is pushed on top of that to hold those pages which had to be copied from the original file. If the program modifies a data page belonging to the original file the VM system takes a fault and makes a copy of the page in the higher layer. When a process forks, additional VM Object layers are pushed on. This might make a little more sense with a fairly basic example. A fork() is a common operation for any *BSD system, so this example will consider a program that starts up, and forks. When the process starts, the VM system creates an object layer, let's call this A: +---------------+ | A | +---------------+ A picture A represents the file—pages may be paged in and out of the file's physical media as necessary. Paging in from the disk is reasonable for a program, but we really do not want to page back out and overwrite the executable. The VM system therefore creates a second layer, B, that will be physically backed by swap space: +---------------+ | B | +---------------+ | A | +---------------+ On the first write to a page after this, a new page is created in B, and its contents are initialized from A. All pages in B can be paged in or out to a swap device. When the program forks, the VM system creates two new object layers—C1 for the parent, and C2 for the child—that rest on top of B: +-------+-------+ | C1 | C2 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, let's say a page in B is modified by the original parent process. The process will take a copy-on-write fault and duplicate the page in C1, leaving the original page in B untouched. Now, let's say the same page in B is modified by the child process. The process will take a copy-on-write fault and duplicate the page in C2. The original page in B is now completely hidden since both C1 and C2 have a copy and B could theoretically be destroyed if it does not represent a real file). However, this sort of optimization is not trivial to make because it is so fine-grained. FreeBSD does not make this optimization. Now, suppose (as is often the case) that the child process does an exec(). Its current address space is usually replaced by a new address space representing a new file. In this case, the C2 layer is destroyed: +-------+ | C1 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, the number of children of B drops to one, and all accesses to B now go through C1. This means that B and C1 can be collapsed together. Any pages in B that also exist in C1 are deleted from B during the collapse. Thus, even though the optimization in the previous step could not be made, we can recover the dead pages when either of the processes exit or exec(). This model creates a number of potential problems. The first is that you can wind up with a relatively deep stack of layered VM Objects which can cost scanning time and memory when you take a fault. Deep layering can occur when processes fork and then fork again (either parent or child). The second problem is that you can wind up with dead, inaccessible pages deep in the stack of VM Objects. In our last example if both the parent and child processes modify the same page, they both get their own private copies of the page and the original page in B is no longer accessible by anyone. That page in B can be freed. FreeBSD solves the deep layering problem with a special optimization called the All Shadowed Case. This case occurs if either C1 or C2 take sufficient COW faults to completely shadow all pages in B. Lets say that C1 achieves this. C1 can now bypass B entirely, so rather then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But look what also happened—now B has only one reference (C2), so we can collapse B and C2 together. The end result is that B is deleted entirely and we have C1->A and C2->A. It is often the case that B will contain a large number of pages and neither C1 nor C2 will be able to completely overshadow it. If we fork again and create a set of D layers, however, it is much more likely that one of the D layers will eventually be able to completely overshadow the much smaller dataset represented by C1 or C2. The same optimization will work at any point in the graph and the grand result of this is that even on a heavily forked machine VM Object stacks tend to not get much deeper then 4. This is true of both the parent and the children and true whether the parent is doing the forking or whether the children cascade forks. The dead page problem still exists in the case where C1 or C2 do not completely overshadow B. Due to our other optimizations this case does not represent much of a problem and we simply allow the pages to be dead. If the system runs low on memory it will swap them out, eating a little swap, but that is it. The advantage to the VM Object model is that fork() is extremely fast, since no real data copying need take place. The disadvantage is that you can build a relatively complex VM Object layering that slows page fault handling down a little, and you spend memory managing the VM Object structures. The optimizations FreeBSD makes proves to reduce the problems enough that they can be ignored, leaving no real disadvantage. SWAP Layers Private data pages are initially either copy-on-write or zero-fill pages. When a change, and therefore a copy, is made, the original backing object (usually a file) can no longer be used to save a copy of the page when the VM system needs to reuse it for other purposes. This is where SWAP comes in. SWAP is allocated to create backing store for memory that does not otherwise have it. FreeBSD allocates the swap management structure for a VM Object only when it is actually needed. However, the swap management structure has had problems historically. Under FreeBSD 3.X the swap management structure preallocates an array that encompasses the entire object requiring swap backing store—even if only a few pages of that object are swap-backed. This creates a kernel memory fragmentation problem when large objects are mapped, or processes with large runsizes (RSS) fork. Also, in order to keep track of swap space, a list of holes is kept in kernel memory, and this tends to get severely fragmented as well. Since the list of holes is a linear list, the swap allocation and freeing performance is a non-optimal O(n)-per-page. It also requires kernel memory allocations to take place during the swap freeing process, and that creates low memory deadlock problems. The problem is further exacerbated by holes created due to the interleaving algorithm. Also, the swap block map can become fragmented fairly easily resulting in non-contiguous allocations. Kernel memory must also be allocated on the fly for additional swap management structures when a swapout occurs. It is evident that there was plenty of room for improvement. For FreeBSD 4.X, I completely rewrote the swap subsystem. With this rewrite, swap management structures are allocated through a hash table rather than a linear array giving them a fixed allocation size and much finer granularity. Rather then using a linearly linked list to keep track of swap space reservations, it now uses a bitmap of swap blocks arranged in a radix tree structure with free-space hinting in the radix node structures. This effectively makes swap allocation and freeing an O(1) operation. The entire radix tree bitmap is also preallocated in order to avoid having to allocate kernel memory during critical low memory swapping operations. After all, the system tends to swap when it is low on memory so we should avoid allocating kernel memory at such times in order to avoid potential deadlocks. Finally, to reduce fragmentation the radix tree is capable of allocating large contiguous chunks at once, skipping over smaller fragmented chunks. I did not take the final step of having an allocating hint pointer that would trundle through a portion of swap as allocations were made in order to further guarantee contiguous allocations or at least locality of reference, but I ensured that such an addition could be made. When to free a page Since the VM system uses all available memory for disk caching, there are usually very few truly-free pages. The VM system depends on being able to properly choose pages which are not in use to reuse for new allocations. Selecting the optimal pages to free is possibly the single-most important function any VM system can perform because if it makes a poor selection, the VM system may be forced to unnecessarily retrieve pages from disk, seriously degrading system performance. How much overhead are we willing to suffer in the critical path to avoid freeing the wrong page? Each wrong choice we make will cost us hundreds of thousands of CPU cycles and a noticeable stall of the affected processes, so we are willing to endure a significant amount of overhead in order to be sure that the right page is chosen. This is why FreeBSD tends to outperform other systems when memory resources become stressed. The free page determination algorithm is built upon a history of the use of memory pages. To acquire this history, the system takes advantage of a page-used bit feature that most hardware page tables have. In any case, the page-used bit is cleared and at some later point the VM system comes across the page again and sees that the page-used bit has been set. This indicates that the page is still being actively used. If the bit is still clear it is an indication that the page is not being actively used. By testing this bit periodically, a use history (in the form of a counter) for the physical page is developed. When the VM system later needs to free up some pages, checking this history becomes the cornerstone of determining the best candidate page to reuse. What if the hardware has no page-used bit? For those platforms that do not have this feature, the system actually emulates a page-used bit. It unmaps or protects a page, forcing a page fault if the page is accessed again. When the page fault is taken, the system simply marks the page as having been used and unprotects the page so that it may be used. While taking such page faults just to determine if a page is being used appears to be an expensive proposition, it is much less expensive than reusing the page for some other purpose only to find that a process needs it back and then have to go to disk. FreeBSD makes use of several page queues to further refine the selection of pages to reuse as well as to determine when dirty pages must be flushed to their backing store. Since page tables are dynamic entities under FreeBSD, it costs virtually nothing to unmap a page from the address space of any processes using it. When a page candidate has been chosen based on the page-use counter, this is precisely what is done. The system must make a distinction between clean pages which can theoretically be freed up at any time, and dirty pages which must first be written to their backing store before being reusable. When a page candidate has been found it is moved to the inactive queue if it is dirty, or the cache queue if it is clean. A separate algorithm based on the dirty-to-clean page ratio determines when dirty pages in the inactive queue must be flushed to disk. Once this is accomplished, the flushed pages are moved from the inactive queue to the cache queue. At this point, pages in the cache queue can still be reactivated by a VM fault at relatively low cost. However, pages in the cache queue are considered to be immediately freeable and will be reused in an LRU (least-recently used) fashion when the system needs to allocate new memory. It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command. As the VM system becomes more stressed, it makes a greater effort to maintain the various page queues at the levels determined to be the most effective. An urban myth has circulated for years that Linux did a better job avoiding swapouts than FreeBSD, but this in fact is not true. What was actually occurring was that FreeBSD was proactively paging out unused pages in order to make room for more disk cache while Linux was keeping unused pages in core and leaving less memory available for cache and process pages. I do not know whether this is still true today. Pre-Faulting and Zeroing Optimizations Taking a VM fault is not expensive if the underlying page is already in core and can simply be mapped into the process, but it can become expensive if you take a whole lot of them on a regular basis. A good example of this is running a program such as &man.ls.1; or &man.ps.1; over and over again. If the program binary is mapped into memory but not mapped into the page table, then all the pages that will be accessed by the program will have to be faulted in every time the program is run. This is unnecessary when the pages in question are already in the VM Cache, so FreeBSD will attempt to pre-populate a process's page tables with those pages that are already in the VM Cache. One thing that FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For example, if you run the &man.ls.1; program while running vmstat 1 you will notice that it always takes a certain number of page faults, even when you run it over and over again. These are zero-fill faults, not program code faults (which were pre-faulted in already). Pre-copying pages on exec or fork is an area that could use more study. A large percentage of page faults that occur are zero-fill faults. You can usually see this by observing the vmstat -s output. These occur when a process accesses pages in its BSS area. The BSS area is expected to be initially zero but the VM system does not bother to allocate any memory at all until the process actually accesses it. When a fault occurs the VM system must not only allocate a new page, it must zero it as well. To optimize the zeroing operation the VM system has the ability to pre-zero pages and mark them as such, and to request pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs whenever the CPU is idle but the number of pages the system pre-zeros is limited in order to avoid blowing away the memory caches. This is an excellent example of adding complexity to the VM system in order to optimize the critical path. Page Table Optimizations The page table optimizations make up the most contentious part of the FreeBSD VM design and they have shown some strain with the advent of serious use of mmap(). I think this is actually a feature of most BSDs though I am not sure when it was first introduced. There are two major optimizations. The first is that hardware page tables do not contain persistent state but instead can be thrown away at any time with only a minor amount of management overhead. The second is that every active page table entry in the system has a governing pv_entry structure which is tied into the vm_page structure. FreeBSD can simply iterate through those mappings that are known to exist while Linux must check all page tables that might contain a specific mapping to see if it does, which can achieve O(n^2) overhead in certain situations. It is because of this that FreeBSD tends to make better choices on which pages to reuse or swap when memory is stressed, giving it better performance under load. However, FreeBSD requires kernel tuning to accommodate large-shared-address-space situations such as those that can occur in a news system because it may run out of pv_entry structures. Both Linux and FreeBSD need work in this area. FreeBSD is trying to maximize the advantage of a potentially sparse active-mapping model (not all processes need to map all pages of a shared library, for example), whereas Linux is trying to simplify its algorithms. FreeBSD generally has the performance advantage here at the cost of wasting a little extra memory, but FreeBSD breaks down in the case where a large file is massively shared across hundreds of processes. Linux, on the other hand, breaks down in the case where many processes are sparsely-mapping the same shared library and also runs non-optimally when trying to determine whether a page can be reused or not. Page Coloring We will end with the page coloring optimizations. Page coloring is a performance optimization designed to ensure that accesses to contiguous pages in virtual memory make the best use of the processor cache. In ancient times (i.e. 10+ years ago) processor caches tended to map virtual memory rather than physical memory. This led to a huge number of problems including having to clear the cache on every context switch in some cases, and problems with data aliasing in the cache. Modern processor caches map physical memory precisely to solve those problems. This means that two side-by-side pages in a processes address space may not correspond to two side-by-side pages in the cache. In fact, if you are not careful side-by-side pages in virtual memory could wind up using the same page in the processor cache—leading to cacheable data being thrown away prematurely and reducing CPU performance. This is true even with multi-way set-associative caches (though the effect is mitigated somewhat). FreeBSD's memory allocation code implements page coloring optimizations, which means that the memory allocation code will attempt to locate free pages that are contiguous from the point of view of the cache. For example, if page 16 of physical memory is assigned to page 0 of a process's virtual memory and the cache can hold 4 pages, the page coloring code will not assign page 20 of physical memory to page 1 of a process's virtual memory. It would, instead, assign page 21 of physical memory. The page coloring code attempts to avoid assigning page 20 because this maps over the same cache memory as page 16 and would result in non-optimal caching. This code adds a significant amount of complexity to the VM memory allocation subsystem as you can well imagine, but the result is well worth the effort. Page Coloring makes VM memory as deterministic as physical memory in regards to cache performance. Conclusion Virtual memory in modern operating systems must address a number of different issues efficiently and for many different usage patterns. The modular and algorithmic approach that BSD has historically taken allows us to study and understand the current implementation as well as relatively cleanly replace large sections of the code. There have been a number of improvements to the FreeBSD VM system in the last several years, and work is ongoing. Bonus QA session by Allen Briggs <email>briggs@ninthwonder.com</email> What is the interleaving algorithm that you refer to in your listing of the ills of the FreeBSD 3.X swap arrangements? FreeBSD uses a fixed swap interleave which defaults to 4. This means that FreeBSD reserves space for four swap areas even if you only have one, two, or three. Since swap is interleaved the linear address space representing the four swap areas will be fragmented if you do not actually have four swap areas. For example, if you have two swap areas A and B FreeBSD's address space representation for that swap area will be interleaved in blocks of 16 pages: A B C D A B C D A B C D A B C D FreeBSD 3.X uses a sequential list of free regions approach to accounting for the free swap areas. The idea is that large blocks of free linear space can be represented with a single list node (kern/subr_rlist.c). But due to the fragmentation the sequential list winds up being insanely fragmented. In the above example, completely unused swap will have A and B shown as free and C and D shown as all allocated. Each A-B sequence requires a list node to account for because C and D are holes, so the list node cannot be combined with the next A-B sequence. Why do we interleave our swap space instead of just tack swap areas onto the end and do something fancier? Because it is a whole lot easier to allocate linear swaths of an address space and have the result automatically be interleaved across multiple disks than it is to try to put that sophistication elsewhere. The fragmentation causes other problems. Being a linear list under 3.X, and having such a huge amount of inherent fragmentation, allocating and freeing swap winds up being an O(N) algorithm instead of an O(1) algorithm. Combined with other factors (heavy swapping) and you start getting into O(N^2) and O(N^3) levels of overhead, which is bad. The 3.X system may also need to allocate KVM during a swap operation to create a new list node which can lead to a deadlock if the system is trying to pageout pages in a low-memory situation. Under 4.X we do not use a sequential list. Instead we use a radix tree and bitmaps of swap blocks rather than ranged list nodes. We take the hit of preallocating all the bitmaps required for the entire swap area up front but it winds up wasting less memory due to the use of a bitmap (one bit per block) instead of a linked list of nodes. The use of a radix tree instead of a sequential list gives us nearly O(1) performance no matter how fragmented the tree becomes. I do not get the following:
It is important to note that the FreeBSD VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command.
How is the separation of clean and dirty (inactive) pages related to the situation where you see low cache queue counts and high active queue counts in systat -vm? Do the systat stats roll the active and dirty pages together for the active queue count?
Yes, that is confusing. The relationship is goal verses reality. Our goal is to separate the pages but the reality is that if we are not in a memory crunch, we do not really have to. What this means is that FreeBSD will not try very hard to separate out dirty pages (inactive queue) from clean pages (cache queue) when the system is not being stressed, nor will it try to deactivate pages (active queue -> inactive queue) when the system is not being stressed, even if they are not being used.
In the &man.ls.1; / vmstat 1 example, would not some of the page faults be data page faults (COW from executable file to private page)? I.e., I would expect the page faults to be some zero-fill and some program data. Or are you implying that FreeBSD does do pre-COW for the program data? A COW fault can be either zero-fill or program-data. The mechanism is the same either way because the backing program-data is almost certainly already in the cache. I am indeed lumping the two together. FreeBSD does not pre-COW program data or zero-fill, but it does pre-map pages that exist in its cache. In your section on page table optimizations, can you give a little more detail about pv_entry and vm_page (or should vm_page be vm_pmap—as in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of operation/reaction would require scanning the mappings? How does Linux do in the case where FreeBSD breaks down (sharing a large file mapping over many processes)? A vm_page represents an (object,index#) tuple. A pv_entry represents a hardware page table entry (pte). If you have five processes sharing the same physical page, and three of those processes's page tables actually map the page, that page will be represented by a single vm_page structure and three pv_entry structures. pv_entry structures only represent pages mapped by the MMU (one pv_entry represents one pte). This means that when we need to remove all hardware references to a vm_page (in order to reuse the page for something else, page it out, clear it, dirty it, and so forth) we can simply scan the linked list of pv_entry's associated with that vm_page to remove or modify the pte's from their page tables. Under Linux there is no such linked list. In order to remove all the hardware page table mappings for a vm_page linux must index into every VM object that might have mapped the page. For example, if you have 50 processes all mapping the same shared library and want to get rid of page X in that library, you need to index into the page table for each of those 50 processes even if only 10 of them have actually mapped the page. So Linux is trading off the simplicity of its design against performance. Many VM algorithms which are O(1) or (small N) under FreeBSD wind up being O(N), O(N^2), or worse under Linux. Since the pte's representing a particular page in an object tend to be at the same offset in all the page tables they are mapped in, reducing the number of accesses into the page tables at the same pte offset will often avoid blowing away the L1 cache line for that offset, which can lead to better performance. FreeBSD has added complexity (the pv_entry scheme) in order to increase performance (to limit page table accesses to only those pte's that need to be modified). But FreeBSD has a scaling problem that Linux does not in that there are a limited number of pv_entry structures and this causes problems when you have massive sharing of data. In this case you may run out of pv_entry structures even though there is plenty of free memory available. This can be fixed easily enough by bumping up the number of pv_entry structures in the kernel config, but we really need to find a better way to do it. In regards to the memory overhead of a page table verses the pv_entry scheme: Linux uses permanent page tables that are not throw away, but does not need a pv_entry for each potentially mapped pte. FreeBSD uses throw away page tables but adds in a pv_entry structure for each actually-mapped pte. I think memory utilization winds up being about the same, giving FreeBSD an algorithmic advantage with its ability to throw away page tables at will with very low overhead. Finally, in the page coloring section, it might help to have a little more description of what you mean here. I did not quite follow it. Do you know how an L1 hardware memory cache works? I will explain: Consider a machine with 16MB of main memory but only 128K of L1 cache. Generally the way this cache works is that each 128K block of main memory uses the same 128K of cache. If you access offset 0 in main memory and then offset offset 128K in main memory you can wind up throwing away the cached data you read from offset 0! Now, I am simplifying things greatly. What I just described is what is called a direct mapped hardware memory cache. Most modern caches are what are called 2-way-set-associative or 4-way-set-associative caches. The set-associatively allows you to access up to N different memory regions that overlap the same cache memory without destroying the previously cached data. But only N. So if I have a 4-way set associative cache I can access offset 0, offset 128K, 256K and offset 384K and still be able to access offset 0 again and have it come from the L1 cache. If I then access offset 512K, however, one of the four previously cached data objects will be thrown away by the cache. It is extremely important… extremely important for most of a processor's memory accesses to be able to come from the L1 cache, because the L1 cache operates at the processor frequency. The moment you have an L1 cache miss and have to go to the L2 cache or to main memory, the processor will stall and potentially sit twiddling its fingers for hundreds of instructions worth of time waiting for a read from main memory to complete. Main memory (the dynamic ram you stuff into a computer) is slow, when compared to the speed of a modern processor core. Ok, so now onto page coloring: All modern memory caches are what are known as physical caches. They cache physical memory addresses, not virtual memory addresses. This allows the cache to be left alone across a process context switch, which is very important. - But in the Unix world you are dealing with virtual address + But in the &unix; world you are dealing with virtual address spaces, not physical address spaces. Any program you write will see the virtual address space given to it. The actual physical pages underlying that virtual address space are not necessarily physically contiguous! In fact, you might have two pages that are side by side in a processes address space which wind up being at offset 0 and offset 128K in physical memory. A program normally assumes that two side-by-side pages will be optimally cached. That is, that you can access data objects in both pages without having them blow away each other's cache entry. But this is only true if the physical pages underlying the virtual address space are contiguous (insofar as the cache is concerned). This is what Page coloring does. Instead of assigning random physical pages to virtual addresses, which may result in non-optimal cache performance, Page coloring assigns reasonably-contiguous physical pages to virtual addresses. Thus programs can be written under the assumption that the characteristics of the underlying hardware cache are the same for their virtual address space as they would be if the program had been run directly in a physical address space. Note that I say reasonably contiguous rather than simply contiguous. From the point of view of a 128K direct mapped cache, the physical address 0 is the same as the physical address 128K. So two side-by-side pages in your virtual address space may wind up being offset 128K and offset 132K in physical memory, but could also easily be offset 128K and offset 4K in physical memory and still retain the same cache performance characteristics. So page-coloring does not have to assign truly contiguous pages of physical memory to contiguous pages of virtual memory, it just needs to make sure it assigns contiguous pages from the point of view of cache performance and operation.
diff --git a/en_US.ISO8859-1/articles/zip-drive/article.sgml b/en_US.ISO8859-1/articles/zip-drive/article.sgml index 5330a94857..eba18e15d6 100644 --- a/en_US.ISO8859-1/articles/zip-drive/article.sgml +++ b/en_US.ISO8859-1/articles/zip-drive/article.sgml @@ -1,276 +1,287 @@ %man; %freebsd; + +%trademarks; ]>
- ZIP Drives + &iomegazip; Drives Jason Bacon
acadix@execpc.com
+ + + &tm-attrib.freebsd; + &tm-attrib.adaptec; + &tm-attrib.iomega; + &tm-attrib.microsoft; + &tm-attrib.opengroup; + &tm-attrib.general; +
- ZIP Drive Basics + &iomegazip; Drive Basics - ZIP disks are high capacity, removable, magnetic disks, which can be + &iomegazip; disks are high capacity, removable, magnetic disks, which can be read or written by ZIP drives from IOMEGA corporation. ZIP disks are similar to floppy disks, except that they are much faster, and have a much greater capacity. While floppy disks typically hold 1.44 megabytes, ZIP disks are available in two sizes, namely 100 megabytes and 250 megabytes. ZIP drives should not be confused with the super-floppy, a 120 megabyte floppy drive which also handles traditional 1.44 megabyte floppies. IOMEGA also sells a higher capacity, higher performance drive called - the JAZZ drive. JAZZ drives come in 1 gigabyte and 2 gigabyte + the &jaz;/JAZZ drive. Jaz drives come in 1 gigabyte and 2 gigabyte sizes. ZIP drives are available as internal or external units, using one of three interfaces: The SCSI (Small Computer Standard Interface) interface is the fastest, most sophisticated, most expandable, and most expensive interface. The SCSI interface is used by all types of computers from PC's to RISC workstations to minicomputers, to connect all types of peripherals such as disk drives, tape drives, scanners, and so on. SCSI ZIP drives may be internal or external, assuming your host adapter has an external connector. If you are using an external SCSI device, it is important never to connect or disconnect it from the SCSI bus while the computer is running. Doing so may cause file-system damage on the disks that remain connected. If you want maximum performance and easy setup, the SCSI interface is the best choice. This will probably require adding a SCSI host adapter, since most PC's (except for high-performance servers) do not have built-in SCSI support. Each SCSI host adapter can support either 7 or 15 SCSI devices, depending on the model. Each SCSI device has its own controller, and these controllers are fairly intelligent and well standardized, (the second `S' in SCSI is for Standard) so from the operating system's point of view, all SCSI disk drives look about the same, as do all SCSI tape drives, etc. To support SCSI devices, the operating system need only have a driver for the particular host adapter, and a generic driver for each type of device, i.e. a SCSI disk driver, SCSI tape driver, and so on. There are some SCSI devices that can be better utilized with specialized drivers (e.g. DAT tape drives), but they tend to work OK with the generic driver, too. It is just that the generic drivers may not support some of the special features. Using a SCSI zip drive is simply a matter of determining which device file in the /dev directory represents the ZIP drive. This can be determined by looking at the boot messages while FreeBSD is booting (or in /var/log/messages after booting), where you will see a line something like this: da1: <IOMEGA ZIP 100 D.13> Removable Direct Access SCSI-2 Device This means that the ZIP drive is represented by the file /dev/da1. The IDE (Integrated Drive Electronics) interface is a low-cost disk drive interface used by many desktop PC's. Most IDE devices are strictly internal. Performance of IDE ZIP drives is comparable to SCSI ZIP drives. (The IDE interface is not as fast as SCSI, but ZIP drives performance is limited mainly by the mechanics of the drive, not by the bus interface.) The drawback of the IDE interface is the limitations it imposes. Most IDE adapters can only support 2 devices, and IDE interfaces are not typically designed for the long term. For example, the original IDE interface would not support hard disks with more than 1024 cylinders, which forced a lot of people to upgrade their hardware prematurely. If you have plans to expand your PC by adding another disk, a tape drive, or scanner, you may want to invest in a SCSI host adapter and a SCSI ZIP drive to avoid problems in the future. IDE devices in FreeBSD are prefixed with a a. For example, an IDE hard disk might be /dev/ad0, an IDE (ATAPI) CDROM might be /dev/acd1, and so on. The parallel port interface is popular for portable external devices such as external ZIP drives and scanners, because virtually every computer has a standard parallel port (usually used for printers). This makes things easy for people to transfer data between multiple computers by toting around their ZIP drive. Performance will generally be slower than a SCSI or IDE ZIP drive, since it is limited by the speed of the parallel port. Parallel port speed varies considerably between various computers, and can often be configured in the system BIOS. Some machines will also require BIOS configuration to operate the parallel port in bidirectional mode. (Parallel ports were originally designed only for output to printers) Parallel ZIP: The <devicename>vpo</devicename> Driver To use a parallel-port ZIP drive under FreeBSD, the vpo driver must be configured into the kernel. Parallel port ZIP drives also have a built-in SCSI controller. The vpo driver allows the FreeBSD kernel to communicate with the ZIP drive's SCSI controller through the parallel port. Since the vpo driver is not a standard part of the kernel (as of FreeBSD 3.2), you will need to rebuild the kernel to enable this device. The process of building a kernel is outlined in detail in another section. The following steps outline the process in brief for the purpose of enabling the vpo driver: Run /stand/sysinstall, and install the kernel source code on your system. Create a custom kernel configuration, that includes the driver for the vpo driver: &prompt.root; cd /sys/i386/conf &prompt.root; cp GENERIC MYKERNEL Edit MYKERNEL, change the ident line to MYKERNEL, and uncomment the line describing the vpo driver. If you have a second parallel port, you may need to copy the section for ppc0 to create a ppc1 device. The second parallel port usually uses IRQ 5 and address 378. Only the IRQ is required in the config file. If your root hard disk is a SCSI disk, you might run into a problem with probing order, which will cause the system to attempt to use the ZIP drive as the root device. This will cause a boot failure, unless you happen to have a FreeBSD root file-system on your ZIP disk! In this case, you will need to wire down the root disk, i.e. force the kernel to bind a specific device to /dev/da0, the root SCSI disk. It will then assign the ZIP disk to the next available SCSI disk, e.g. /dev/da1. To wire down your SCSI hard drive as da0, change the line device da0 to disk da0 at scbus0 target 0 unit 0 You may need to change the target above to match the SCSI ID of your disk drive. You should also wire down the scbus0 entry to your - controller. For example, if you have an Adaptec 15xx controller, + controller. For example, if you have an &adaptec; 15xx controller, you would change controller scbus0 to controller scbus0 at aha0 Finally, since you are creating a custom kernel configuration, you can take the opportunity to remove all the unnecessary drivers. This should be done with a great deal of caution, and only if you feel confident about making modifications to your kernel configuration. Removing unnecessary drivers will reduce the kernel size, leaving more memory available for your applications. To determine which drivers are not needed, go to the end of the file /var/log/messages, and look for lines reading "not found". Then, comment out these devices in your config file. You can also change other options to reduce the size and increase the speed of your kernel. Read the section on rebuilding your kernel for more complete information. Now it is time to compile the kernel: &prompt.root; /usr/sbin/config MYKERNEL &prompt.root; cd ../../compile/MYKERNEL &prompt.root; make clean depend && make all install After the kernel is rebuilt, you will need to reboot. Make sure the ZIP drive is connected to the parallel port before the boot begins. You should see the ZIP drive show up in the boot messages as device vpo0 or vpo1, depending on which parallel port the drive is attached to. It should also show which device file the ZIP drive has been bound to. This will be /dev/da0 if you have no other SCSI disks in the system, or /dev/da1 if you have a SCSI hard disk wired down as the root device. Mounting ZIP disks To access the ZIP disk, you simply mount it like any other disk device. The file-system is represented as slice 4 on the device, so for SCSI or parallel ZIP disks, you would use: &prompt.root; mount_msdos /dev/da1s4 /mnt For IDE ZIP drives, use: &prompt.root; mount_msdos /dev/ad1s4 /mnt It will also be helpful to update /etc/fstab to make mounting easier. Add a line like the following, edited to suit your system: /dev/da1s4 /zip msdos rw,noauto 0 0 and create the directory /zip. Then, you can mount simply by typing &prompt.root; mount /zip and unmount by typing &prompt.root; umount /zip For more information on the format of /etc/fstab, see &man.fstab.5;. You can also create a FreeBSD file-system on the ZIP disk using &man.newfs.8;. However, the disk will only be usable on a FreeBSD system, or perhaps a few other &unix; clones that recognize FreeBSD - file-systems. (Definitely not DOS or Windows.) + file-systems. (Definitely not DOS or &windows;.)