Characteristics of Two Spindles Organized with VinumOrganizationTotal CapacityFailure ResilientPeak Read PerformancePeak Write PerformanceConcatenated PlexesUnchanged, but appears as a single driveNoUnchangedUnchangedStriped Plexes (RAID-0)Unchanged, but appears as a single driveNo2x2xMirrored Volumes (RAID-1)1/2, appearing as a single driveYes2xUnchanged
shows that striping yields
the same capacity and lack of failure resilience
as concatenation, but it has better peak read and write performance.
Hence we will not be using concatenation in any of the examples here.
Mirrored volumes provide the benefits of improved peak read performance
and failure resilience--but this comes at a loss in capacity.Both concatenation and striping bring their benefits over a
single spindle at the cost of increased likelihood of failure since
more than one spindle is now involved.When three or more spindles are present,
Vinum also supports rotated,
block-interleaved parity (also called RAID-5)
that provides better
capacity than mirroring (but not quite as good as striping), better
read performance than both mirroring and striping,
and good failure resilience.
There is, however,
a substantial decrease in write performance with RAID-5.
Most of the benefits become more pronounced with five or more
spindles.The organizations described above may be combined to provide
benefits that no single organization can match.
For example, mirroring and striping can be combined to provide
failure-resilience with very fast read performance.Vinum HistoryVinum
is a standard part of even a "minimum" FreeBSD distribution and
it has been standard since 3.0-RELEASE.
The official pronunciation of the name is
VEE-noom.&vinum.ap; was inspired by the Veritas Volume Manager, but
was not derived from it.
The name is a play on that history and the Latin adage
In Vino Veritas
(Vino is the ablative form of
Vinum).
Literally translated, that is Truth lies in wine hinting that
drunkards have a hard time lying.
I have been using it in production on six different servers for
over two years with no data loss.
Like the rest of FreeBSD, Vinum
provides rock-stable performance.
(On a personal note, I have seen Vinum
panic when I misconfigured something, but I have
never had any trouble in normal operation.)
Greg Lehey wrote
Vinum for FreeBSD,
but he is seeking
help in porting it to NetBSD and OpenBSD.Just like the rest of FreeBSD, Vinum
is undergoing continuous
development.
Several subtle, but significant bugs have been fixed in recent
releases.
It is always best to use the most recent code base that meets your
stability requirements.Vinum Deployment StrategyVinum,
coupled with prudent partition management, lets you
keep warm-spare spindles on-line so that failures
are transparent to users. Failed spindles can be replaced
during regular maintenance periods or whenever it is convenient.
When all spindles are working, the server benefits from increased
performance and capacity.Having redundant copies of your home directory does not
help you if the spindle holding root,
/usr, or swap fails on your server.
Hence I focus here on building a simple
foundation for a failure-resilient server covering the root,
/usr,
/home, and swap partitions.Vinum
mirroring does not remove the need for making backups!
Mirroring cannot help you recover from site disasters
or the dreaded
rm -r -f / command.Why Bootstrap Vinum?It is possible to add Vinum
to a server configuration after
it is already in production use, but this is much harder than
designing for it from the start. Ironically,
Vinum is not supported by
/stand/sysinstall
and hence you cannot install
/usr right onto a
Vinum volume.Vinum currently does not
support the root filesystem (this feature
is in development).Hence it is a bit
tricky to get started using
Vinum, but these instructions
take you though the process of planning for
Vinum, installing FreeBSD
without it, and then beginning to use it.I have come to call this whole process bootstrapping Vinum.
That is, the process of getting Vinum
initially installed
and operating to the point where you have met your resilience
or performance goals. My purpose here is to document a
Vinum
bootstrapping method that I have found that works well for me.Vinum BenefitsThe server foundation scenario I have chosen here allows me
to show you examples of configuring for resilience on
/usr and
/home.
Yet Vinum
provides benefits other than resilience--namely
performance, capacity, and manageability.
It can significantly improve disk performance (especially
under multi-user loads).
Vinum
can easily concatenate many smaller disks to produce the
illusion of a single larger disk (but my server foundation
scenario does not allow me to illustrate these benefits here).For servers with many spindles, Vinum
provides substantial
benefits in volume management, particularly when coupled with
hot-pluggable hardware. Data can be moved from spindle to
spindle while the system is running without loss of production
time. Again, details of this will not be given here, but once
you get your feet wet with Vinum,
other documentation will help you do things like this.
See
"The Vinum
Volume Manager" for a technical introduction to
Vinum,
&man.vinum.8; for a description of the vinum
command, and
&man.vinum.4;
for a description of the vinum device
driver and the way Vinum
objects are named.Breaking up your disk space into smaller and smaller partitions
has the benefit of allowing you to tune for the most common
type of access and tends to keep disk hogs within their pens.
However it also causes some loss in total available disk space
due to fragmentation.Server Operation in Degraded ModeSome disk failures in this two-spindle scenario will result in
Vinum
automatically routing
all disk I/O to the remaining good spindle.
Others will require brief manual intervention on the console
to configure the server for degraded mode operation and a quick reboot.
Other than actual hardware repairs, most recovery work
can be done while the server is running in multi-user degraded
mode so there is as little production impact
from failures as possible.I give the instructions in needed to
configure the server for degraded mode operation
in those cases where Vinum
cannot do it automatically.
I also give the instructions needed to
return to normal operation once the failed hardware is repaired.
You might call these instructions Vinum
failure recovery techniques.I recommend practicing using these instructions
by recovering from simulated failures.
For each failure scenario, I also give tips below for simulating
a failure even when your hardware is working well.
Even a minimum Vinum
system as described in
below can be a good place to experiment with
recovery techniques without impacting production equipment.Hardware RAID vs. Vinum (Software RAID)Manual intervention is sometimes required to configure a server for
degraded mode because
Vinum
is implemented in software that runs after the FreeBSD
kernel is loaded. One disadvantage of such
software RAID
solutions is that there is nothing that can be done to hide spindle
failures from the BIOS or the FreeBSD boot sequence. Hence
the manual reconfiguration of the server
for degraded operation mentioned
above just informs the BIOS and boot sequence of failed
spindles.
Hardware RAID solutions generally have an
advantage in that they require no such reconfiguration since
spindle failures are hidden from the BIOS and boot sequence.Hardware RAID, however, may have some disadvantages that can
be significant in some cases:
The hardware RAID controller itself may become a single
point of failure for the system.
The data is usually kept in a proprietary
format so that a disk drive cannot be simply plugged
into another main board and booted.
You often cannot mix and
match drives with different sizes and interfaces.
You are often limited to the number of drives supported by the
hardware RAID controller (often only four or eight).
In other words, &vinum.ap; may offer advantages in that
there is no single point of failure,
the drives can boot on most any main board, and
you are free to mix and match as many drives using
whatever interface you choose.Keep your kernel fairly generic (or at least keep
/kernel.GENERIC around).
This will improve the chances that you can come back up on
foreign hardware more quickly.The pros and cons discussed above suggest
that the root filesystem and swap partition are good
candidates for hardware RAID if available.
This is especially true for servers where it is difficult for
administrators to get console access (recall that this is sometimes
required to configure a server for degraded mode operation).
A server with only software RAID is well suited to office and home
environments where an administrator can be close at hand.A common myth is that hardware RAID is always faster
than software RAID.
Since it runs on the host CPU, Vinum
often has more CPU power and memory available than a
dedicated RAID controller would have.
If performance is a prime concern, it is best to benchmark
your application running on your CPU with your spindles using
both hardware and software RAID systems before making
a decision.Hardware for VinumThese instructions may be timely since commodity PC hardware
can now easily host several hundred gigabytes of reasonably
high-performance disk space at a low price. Many disk
drive manufactures now sell 7,200 RPM disk drives with quite
low seek times and high transfer rates through ATA-100
interfaces, all at very attractive prices. Four such drives,
attached to a suitable main board and configured with
Vinum
and prudent partitioning, yields a failure-resilient, high
performance disk server at a very reasonable cost.However, you can indeed get started with
Vinum very simply.
A minimum system can be as simple as
an old CPU (even a 486 is fine) and a pair of drives
that are 500 MB or more. They need not be the same size or
even use the same interface (i.e., it is fine to mix ATAPI and
SCSI). So get busy and give this a try today! You will have
the foundation of a failure-resilient server running in an
hour or so!Bootstrapping PhasesGreg Lehey suggested this bootstrapping method.
It uses knowledge of how Vinum
internally allocates disk space to avoid copying data.
Instead, Vinum
objects are configured so that they occupy the
same disk space where /stand/sysinstall built
filesystems.
The filesystems are thus embedded within
Vinum objects without copying.There are several distinct phases to the
Vinum bootstrapping
procedure. Each of these phases is presented in a separate section below.
The section starts with a general overview of the phase and its goals.
It then gives example steps for the two-spindle scenario
presented here and advice on how to adapt them for your server.
(If you are reading for a general understanding
of Vinum
bootstrapping, the example sections for each phase
can safely be skipped.)
The remainder of this section gives
an overview of the entire bootstrapping process.Phase 1 involves planning and preparation.
We will balance requirements
for the server against available resources and make design
tradeoffs.
We will plan the transition from no
Vinum to
Vinum
on just one spindle, to Vinum
on two spindles.In phase 2, we will install a minimum FreeBSD system on a
single spindle using partitions of type
4.2BSD (regular UFS filesystems).Phase 3 will embed the non-root filesystems from phase 2 in
Vinum objects.
Note that Vinum will be up and
running at this point,
but it cannot yet provide any resilience since it only has
one spindle on which to store data.Finally in phase 4, we configure Vinum
on a second spindle and make a backup copy of the root filesystem.
This will give us resilience on all filesystems.Bootstrapping Phase 1: Planning and PreparationOur goal in this phase is to define the different partitions
we will need and examine their requirements.
We will also look at available disk drives and controllers and allocate
partitions to them.
Finally, we will determine the size of
each partition and its use during the bootstrapping process.
After this planning is complete, we can optionally prepare to use some
tools that will make bootstrapping Vinum
easier.Several key questions must be answered in this
planning phase:
What filesystem and partitions will be needed?
How will they be used?
How will we name each spindle?
How will the partitions be ordered for each spindle?
How will partitions be assigned to the spindles?
How will partitions be configured? Resilience or performance?
What technique will be used to achieve resilience?
What spindles will be used?
How will they be configured on the available controllers?
How much space is required for each partition?
Phase 1 ExampleIn this example, I will assume a scenario
where we are building
a minimal foundation for a failure-resilient server.
Hence we will need at least root,
/usr,
/home,
and swap partitions.
The root,
/usr, and
/home filesystems all need resilience since the
server will not be much good without them.
The swap partition needs performance first and
generally does
not need resilience since nothing it holds needs to be retained
across a reboot.Spindle NamingThe kernel would refer to the master spindle on
the primary and secondary ATA controllers as
/dev/ad0 and
/dev/ad2 respectively.
This assumes that you have not removed the line
options ATA_STATIC_ID
from your kernel configuration.
But Vinum
also needs to have a name for each spindle
that will stay the same name regardless
of how it is attached to the CPU (i.e., if the drive moves, the
Vinum name moves with the drive).Some recovery techniques documented below suggest
moving a spindle from
the secondary ATA controller to the primary ATA controller.
(Indeed, the flexibility of making such moves is a key benefit
of Vinum
especially if you are managing a large number of spindles.)
After such a drive/controller swap,
the kernel will see what used to be
/dev/ad2 as
/dev/ad0
but Vinum
will still call
it by whatever name it had when it was attached to
/dev/ad2
(i.e., when it was created or first made known to
Vinum).Since connections can change, it is best to give
each spindle a unique, abstract
name that gives no hint of how it is attached.
Avoid names that suggest a manufacturer, model number,
physical location, or membership in a sequence
(e.g. avoid names like
upper, lower, etc.,
alpha, beta, etc.,
SCSI1, SCSI2, etc., or
Seagate1, Seagate2 etc.).
Such names are likely to lose their uniqueness or
get out of sequence
someday even if they seem like great names today.Once you have picked names for your spindles,
label them with a permanent marker.
If you have hot-swappable hardware, write the names on the sleds
in which the spindles are mounted.
This will significantly reduce the likelihood of
error when you are moving spindles around later as
part of failure recovery or routine system management
procedures.In the instructions that follow,
Vinum
will name the root spindle YouCrazy
and the rootback spindle UpWindow.
I will only use /dev/ad0
when I want to refer to whichever
of the two spindles is currently attached as
/dev/ad0.Partition OrderingModern disk drives operate with fairly uniform areal
density across the surface of the disk.
That implies that more data is available under the heads without
seeking on the outer cylinders than on the inner cylinders.
We will allocate partitions most critical to system performance
from these outer cylinders as
/stand/sysinstall generally does.The root filesystem is traditionally the outermost, even though
it generally is not as critical to system performance as others.
(However root can have a larger impact on performance if it contains
/tmp and /var as it
does in this example.)
The FreeBSD boot loaders assume that the
root filesystem lives in the a partition.
There is no requirement that the a
partition start on the outermost cylinders, but this
convention makes it easier to manage disk labels.Swap performance is critical so it comes next on our way toward
the center.
I/O operations here tend to be large and contiguous.
Having as much data under the heads as possible avoids seeking
while swapping.With all the smaller partitions out of the way, we finish
up the disk with
/home and
/usr.
Access patterns here tend not to be as intense as for other
filesystems (especially if there is an abundant supply of RAM
and read cache hit rates are high).If the pair of spindles you have are large enough to allow
for more than
/home and
/usr,
it is fine to plan for additional filesystems here.Assigning Partitions to SpindlesWe will want to assign
partitions to these spindles so that either can fail
without loss of data on filesystems configured for
resilience.Reliability on
/usr and
/home
is best achieved using Vinum
mirroring.
Resilience will have to come differently, however, for the root
filesystem since Vinum
is not a part of the FreeBSD boot sequence.
Here we will have to settle for two identical
partitions with a periodic copy from the primary to the
backup secondary.The kernel already has support for interleaved swap across
all available partitions so there is no need for help from
Vinum here.
/stand/sysinstall
will automatically configure /etc/fstab
for all swap partitions given.The &vinum.ap; bootstrapping method given below
requires a pair of spindles that I will call the
root spindle and the
rootback spindle.The rootback spindle must be the same size or
larger than the root spindle.These instructions first allocate all space on the root
spindle and then allocate exactly that amount of space on
a rootback spindle.
(After &vinum.ap; is bootstrapped, there is nothing special
about either of these spindles--they are interchangeable.)
You can later use the remaining space on the rootback spindle for
other filesystems.If you have more than two spindles, the
bootvinum Perl script and the procedure
below will help you initialize them for use with &vinum.ap;.
However you will have to figure out how to assign partitions
to them on your own.Assigning Space to PartitionsFor this example, I will use two spindles: one with
4,124,673 blocks (about 2 GB) on /dev/ad0
and one with 8,420,769 blocks (about 4 GB) on
/dev/ad2.It is best to configure your two spindles on separate
controllers so that both can operate in parallel and
so that you will have failure resilience in case a
controller dies.
Note that mirrored volume write performance will be halved
in cases where both spindles share a controller that requires
they operate serially (as is often the case with ATA controllers).
One spindle will be the master on the primary ATA
controller and the other will be the master on the
secondary ATA controller.Recall that we will be allocating space on the smaller
spindle first and the larger spindle second.Assigning Partitions on the Root SpindleWe will allocate 200,000 blocks (about 93 MB)
for a root filesystem on each spindle
(/dev/ad0s1a and
/dev/ad2s1a).
We will initially allocate 200,265 blocks for a swap partition
on each spindle,
giving a total of about 186 MB of
swap space (/dev/ad0s1b and
/dev/ad2s1b).We will lose 265 blocks from each swap partition
as part of the bootstrapping process.
This is the size of the space used by
Vinum to store configuration
information.
The space will be taken from swap and given to a vinum
partition but will be unavailable for
Vinum subdisks.I have done the partition allocation in nice round
numbers of blocks just to emphasize where the 265 blocks go.
There is nothing wrong with allocating space in MB if that is
more convenient for you.This leaves 4,124,673 - 200,000 - 200,265 = 3,724,408 blocks
(about 1,818 MB) on the root spindle for
Vinum
partitions (/dev/ad0s1e and
/dev/ad2s1f).
From this, allocate the 265 blocks for
Vinum configuration information,
1,000,000 blocks (about 488 MB)
for /home, and the remaining
2,724,408 blocks (about 1,330 MB) for
/usr.
See below to see this graphically.The left-hand side of
below shows what spindle ad0 will
look like at the end of phase 2.
The right-hand side shows what it will look like at the
end of phase 3.Spindle ad0 Before and After Vinum ad0 Before Vinum Offset (blocks) ad0 After Vinum
+----------------------+ <-- 0--> +----------------------+
| root | | root |
| /dev/ad0s1a | | /dev/ad0s1a |
+----------------------+ <-- 200000--> +----------------------+
| swap | | swap |
| /dev/ad0s1b | | /dev/ad0s1b |
| | 400000--> +----------------------+
| | | Vinum drive YouCrazy |
| | | /dev/ad0s1h |
+----------------------+ <-- 400265--> +-----------------+ |
| /home | | Vinum sd | |
| /dev/ad0s1e | | home.p0.s0 | |
+----------------------+ <--1400265--> +-----------------+ |
| /usr | | Vinum sd | |
| /dev/ad0s1f | | usr.p0.s0 | |
+----------------------+ <--4124673--> +-----------------+----+
Not to scaleSpindle /dev/ad0 Before and After VinumAssigning Partitions on the Rootback SpindleThe /rootback and swap partition sizes
on the rootback spindle must
match the root and swap partition sizes on the root spindle.
That leaves 8,420,769 - 200,000 - 200,265 = 8,020,504
blocks for the Vinum partition.
Mirrors of /home and
/usr receive the same allocation as on
the root spindle.
That will leave an extra 2 GB or so that we can deal
with later.
See below to see this graphically.The left-hand side of
below shows what spindle ad2 will
look like at the beginning of phase 4.
The right-hand side shows what it will look like at the end.Spindle ad2 Before and After Vinum ad2 Before Vinum Offset (blocks) ad2 After Vinum
+----------------------+ <-- 0--> +----------------------+
| /rootback | | /rootback |
| /dev/ad2s1e | | /dev/ad2s1a |
+----------------------+ <-- 200000--> +----------------------+
| swap | | swap |
| /dev/ad2s1b | | /dev/ad2s1b |
| | 400000--> +----------------------+
| | | Vinum drive UpWindow |
| | | /dev/ad2s1h |
+----------------------+ <-- 400265--> +-----------------+ |
| /NOFUTURE | | Vinum sd | |
| /dev/ad2s1f | | home.p1.s0 | |
| | 1400265--> +-----------------+ |
| | | Vinum sd | |
| | | usr.p1.s0 | |
| | 4124673--> +-----------------+ |
| | | Vinum sd | |
| | | hope.p0.s0 | |
+----------------------+ <--8420769--> +-----------------+----+
Not to scaleSpindle ad2 Before and After VinumPreparation of ToolsThe bootvinum Perl script given below in
will make the
Vinum bootstrapping process much
easier if you can run it on the machine being bootstrapped.
It is over 200 lines and you would not want to type it in.
At this point, I recommend that you
copy it to a floppy or arrange some
alternative method of making it readily available
so that it can be available later when needed.
For example:&prompt.root; fdformat -f 1440 /dev/fd0
&prompt.root; newfs_msdos -f 1440 /dev/fd0
&prompt.root; mount_msdos /dev/fd0 /mnt
&prompt.root; cp /usr/share/examples/vinum/bootvinum /mntXXX Someday, I would like this script to live in
/usr/share/examples/vinum.
Till then, please use this
link
to get a copy.Bootstrapping Phase 2: Minimal OS InstallationOur goal in this phase is to complete the smallest possible
FreeBSD installation in such a way that we can later install
Vinum.
We will use only
partitions of type 4.2BSD (i.e., regular UFS file
systems) since that is the only type supported by
/stand/sysinstall.Phase 2 ExampleStart up the FreeBSD installation process by running
/stand/sysinstall from
installation media as you normally would.Fdisk partition all spindles as needed.Make sure to select BootMgr for all spindles.Partition the root spindle with appropriate block
allocations as described above in .
For this example on a 2 GB spindle, I will use
200,000 blocks for root, 200,265 blocks for swap,
1,000,000 blocks for /home, and
the rest of the spindle (2,724,408 blocks) for
/usr.
(/stand/sysinstall
should automatically assign these to
/dev/ad0s1a,
/dev/ad0s1b,
/dev/ad0s1e, and
/dev/ad0s1f
by default.)If you prefer Soft Updates as I do and you are
using 4.4-RELEASE or better, this is a good time to enable
them.Partition the rootback spindle with the appropriate block
allocations as described above in .
For this example on a 4 GB spindle, I will use
200,000 blocks for /rootback,
200,265 blocks for swap, and
the rest of the spindle (8,020,504 blocks) for
/NOFUTURE.
(/stand/sysinstall
should automatically assign these to
/dev/ad2s1e,
/dev/ad2s1b, and
/dev/ad2s1f by default.)We do not really want to have a
/NOFUTURE UFS filesystem (we
want a vinum partition instead), but that is the
best choice we have for the space given the limitations of
/stand/sysinstall.
Mount point names beginning with NOFUTURE
and rootback
serve as sentinels to the bootstrapping
script presented in below.Partition any other spindles with swap if desired and a
single /NOFUTURExx filesystem.Select a minimum system install for now even if you
want to end up with more distributions loaded later.Do not worry about system configuration options at this
point--get Vinum
set up and get the partitions in
the right places first.Exit /stand/sysinstall and reboot.
Do a quick test to verify that the minimum
installation was successful.The left-hand side of above
and the left-hand side of above
show how the disks will look at this point.Bootstrapping Phase 3: Root Spindle SetupOur goal in this phase is get Vinum
set up and running on the
root spindle.
We will embed the existing
/usr and
/home filesystems in a
Vinum partition.
Note that the Vinum
volumes created will not yet be
failure-resilient since we have
only one underlying Vinum
drive to hold them.
The resulting system will automatically start
Vinum as it boots to multi-user mode.Phase 3 ExampleLogin as root.We will need a directory in the root filesystem in
which to keep a few files that will be used in the
Vinum
bootstrapping process.&prompt.root; mkdir /bootvinum
&prompt.root; cd /bootvinumSeveral files need to be prepared for use in bootstrapping.
I have written a Perl script that makes all the required
files for you.
Copy this script to /bootvinum by
floppy disk, tape, network, or any convenient means and
then run it.
(If you cannot get this script copied onto the machine being
bootstrapped, then see
below for a manual alternative.)&prompt.root; cp /mnt/bootvinum .
&prompt.root; ./bootvinumbootvinum produces no output
when run successfully.
If you get any errors,
something may have gone wrong when you were creating
partitions with
/stand/sysinstall above.Running bootvinum will:
Create /etc/fstab.vinum
based on what it finds
in your existing /etc/fstab
Create new disk labels for each spindle mentioned
in /etc/fstab and keep copies of the
current disk labels
Create files needed as input to vinum
for building
Vinum objects on each spindle
Create many alternates to /etc/fstab.vinum
that might come in handy should a spindle fail
You may want to take a look at these files to learn more
about the disk partitioning required for
Vinum or to learn more about the
commands needed to create
Vinum objects.We now need to install new spindle partitioning for
/dev/ad0.
This requires that
/dev/ad0s1b not be in use for
swapping so we have to reboot in single-user mode.First, reboot the system.&prompt.root; rebootNext, enter single-user mode.Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [kernel] in 8 seconds...
Type '?' for a list of commands, 'help' for more detailed help.
ok boot -sIn single-user mode, install the new partitioning
created above.&prompt.root; cd /bootvinum
&prompt.root; disklabel -R ad0s1 disklabel.ad0s1
&prompt.root; disklabel -R ad2s1 disklabel.ad2s1If you have additional spindles, repeat the
above commands as appropriate for them.We are about to start Vinum
for the first time.
It is going to want to create several device nodes under
/dev/vinum so we will need to mount the
root filesystem for read/write access.&prompt.root; fsck -p /
&prompt.root; mount /Now it is time to create the Vinum
objects that
will embed the existing non-root filesystems on
the root spindle in a
Vinum partition.
This will load the Vinum
kernel module and start Vinum
as a side effect.&prompt.root; vinum create create.YouCrazy
You should see a list of Vinum
objects created that looks like the following:1 drives:
D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%)
2 volumes:
V home State: up Plexes: 1 Size: 488 MB
V usr State: up Plexes: 1 Size: 1330 MB
2 plexes:
P home.p0 C State: up Subdisks: 1 Size: 488 MB
P usr.p0 C State: up Subdisks: 1 Size: 1330 MB
2 subdisks:
S home.p0.s0 State: up PO: 0 B Size: 488 MB
S usr.p0.s0 State: up PO: 0 B Size: 1330 MB
You should also see several kernel messages
which state that the Vinum
objects you have created are now up.Our non-root filesystems should now be embedded in a
Vinum partition and
hence available through Vinum
volumes.
It is important to test that this embedding worked.&prompt.root; fsck -n /dev/vinum/home
&prompt.root; fsck -n /dev/vinum/usrThis should produce no errors.
If it does produce errors do not fix them.
Instead, go back and examine the root spindle partition tables
before and after Vinum
to see if you can spot the error.
You can back out the partition table changes by using
disklabel -R with the
disklabel.*.b4vinum files.While we have the root filesystem mounted read/write, this is
a good time to install /etc/fstab.&prompt.root; mv /etc/fstab /etc/fstab.b4vinum
&prompt.root; cp /etc/fstab.vinum /etc/fstabWe are now done with tasks requiring single-user
mode, so it is safe to go multi-user from here on.&prompt.root; ^DLogin as root.Edit /etc/rc.conf and add this line:
start_vinum="YES"Bootstrapping Phase 4: Rootback Spindle SetupOur goal in this phase is to get redundant copies of all data
from the root spindle to the rootback spindle.
We will first create the necessary Vinum
objects on the rootback spindle.
Then we will ask Vinum
to copy the data from the root spindle to the
rootback spindle.
Finally, we use dump and restore
to copy the root filesystem.Phase 4 ExampleNow that Vinum
is running on the root spindle, we can bring
it up on the rootback spindle so that our
Vinum volumes can become
failure-resilient.&prompt.root; cd /bootvinum
&prompt.root; vinum create create.UpWindowYou should see a list of Vinum
objects created that
looks like the following:2 drives:
D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%)
D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%)
2 volumes:
V home State: up Plexes: 2 Size: 488 MB
V usr State: up Plexes: 2 Size: 1330 MB
4 plexes:
P home.p0 C State: up Subdisks: 1 Size: 488 MB
P usr.p0 C State: up Subdisks: 1 Size: 1330 MB
P home.p1 C State: faulty Subdisks: 1 Size: 488 MB
P usr.p1 C State: faulty Subdisks: 1 Size: 1330 MB
4 subdisks:
S home.p0.s0 State: up PO: 0 B Size: 488 MB
S usr.p0.s0 State: up PO: 0 B Size: 1330 MB
S home.p1.s0 State: stale PO: 0 B Size: 488 MB
S usr.p1.s0 State: stale PO: 0 B Size: 1330 MBYou should also see several kernel messages
which state that some of the Vinum
objects you have created are now up
while others are faulty or
stale.Now we ask Vinum
to copy each of the subdisks on drive
YouCrazy to drive UpWindow.
This will change the state of the newly created
Vinum subdisks
from stale to up.
It will also change the state of the newly created
Vinum plexes
from faulty to up.First, we do the new subdisk we
added to /home.&prompt.root; vinum start -w home.p1.s0
reviving home.p1.s0
(time passes . . . )
home.p1.s0 is up by force
home.p1 is up
home.p1.s0 is up
My 5,400 RPM EIDE spindles copied at about 3.5 MBytes/sec.
Your mileage may vary.
Next we do the new subdisk we
added to /usr.&prompt.root; vinum start -w usr.p1.s0
reviving usr.p1.s0
(time passes . . . )
usr.p1.s0 is up by force
usr.p1 is up
usr.p1.s0 is upAll Vinum
objects should be in state up at this point.
The output of
vinum list should look
like the following:2 drives:
D YouCrazy State: up Device /dev/ad0s1h Avail: 0/1818 MB (0%)
D UpWindow State: up Device /dev/ad2s1h Avail: 2096/3915 MB (53%)
2 volumes:
V home State: up Plexes: 2 Size: 488 MB
V usr State: up Plexes: 2 Size: 1330 MB
4 plexes:
P home.p0 C State: up Subdisks: 1 Size: 488 MB
P usr.p0 C State: up Subdisks: 1 Size: 1330 MB
P home.p1 C State: up Subdisks: 1 Size: 488 MB
P usr.p1 C State: up Subdisks: 1 Size: 1330 MB
4 subdisks:
S home.p0.s0 State: up PO: 0 B Size: 488 MB
S usr.p0.s0 State: up PO: 0 B Size: 1330 MB
S home.p1.s0 State: up PO: 0 B Size: 488 MB
S usr.p1.s0 State: up PO: 0 B Size: 1330 MBCopy the root filesystem so that you will have a backup.&prompt.root; cd /rootback
&prompt.root; dump 0f - / | restore rf -
&prompt.root; rm restoresymtable
&prompt.root; cd /You may see errors like this:./tmp/rstdir1001216411: (inode 558) not found on tape
cannot find directory inode 265
abort? [yn] n
expected next file 492, got 491They seem to cause no harm.
I suspect they are a consequence of dumping the filesystem
containing /tmp and/or the pipe
connecting dump and
restore.Make a directory on which we can mount a damaged root
filesystem during the recovery process.&prompt.root; mkdir /rootbadRemove sentinel mount points that are now unused.&prompt.root; rmdir /NOFUTURE*Create empty &vinum.ap; drives on remaining spindles.&prompt.root; vinum create create.ThruBank
&prompt.root; ...At this point, the reliable server foundation is complete.
The right-hand side of above
and the right-hand side of above
show how the disks will look.You may want to do a quick reboot to multi-user and give it
a quick test drive.
This is also a good point to complete installation
of other distributions beyond the minimal install.
Add packages, ports, and users as required.
Configure /etc/rc.conf as required.After you have completed your server configuration,
remember to do one more copy of root to
/rootback as shown above before placing
the server into production.Make a schedule to refresh
/rootback periodically.It may be a good idea to mount
/rootback read-only for normal operation
of the server.
This does, however, complicate the periodic refresh a bit.Do not forget to watch
/var/log/messages carefully for errors.
Vinum
may automatically avoid failed hardware in a way that users
do not notice.
You must watch for such failures and get them repaired before a
second failure results in data loss.
You may see
Vinum noting damaged objects
at server boot time.Where to Go from Here?Now that you have established the foundation of a reliable server,
there are several things you might want to try next.Make a Vinum Volume with Remaining SpaceFollowing are the steps to create another
Vinum volume with space remaining
on the rootback spindle.This volume will not be resilient to spindle failure
since it has only one plex on a single spindle.Create a file with the following contents:volume hope
plex name hope.p0 org concat volume hope
sd name hope.p0.s0 drive UpWindow plex hope.p0 len 0Specifying a length of 0 for
the hope.p0.s0 subdisk
asks Vinum
to use whatever space is left available on the underlying
drive.Feed these commands into vinum .&prompt.root; vinum create filenameNow we newfs the volume and
mount it.&prompt.root; newfs -v /dev/vinum/hope
&prompt.root; mkdir /hope
&prompt.root; mount /dev/vinum/hope /hopeEdit /etc/fstab if you want
/hope mounted at boot time.Try Out More Vinum CommandsYou might already be familiar with
vinum to get a list of
all Vinum objects.
Try following it to see more detail.If you have more spindles and you want to bring them up as
concatenated, mirrored, or striped volumes, then give
vinumdrivelist,
vinumdrivelist, or
vinumdrivelist a try.See &man.vinum.8; for sample configurations and important
performance considerations before settling on a final organization
for your additional spindles.The failure recovery instructions below will also give you
some experience using more Vinum
commands.Failure ScenariosThis section contains descriptions of various failure scenarios.
For each scenario, there is a subsection on how to configure your
server for degraded mode operation, how to recover from the failure,
how to exit degraded mode, and how to simulate the failure.Make a hard copy of these instructions and leave them inside the CPU
case, being careful not to interfere with ventilation.Root filesystem on ad0 unusable, rest of drive okWe assume here that the boot blocks and disk label on
/dev/ad0 are ok.
If your BIOS can boot from a drive other than
C:, you may be able to get around this
limitation.Configure Server for Degraded ModeUse BootMgr to load kernel from
/dev/ad2s1a.Hit F5 in BootMgr to select
Drive 1.Hit F1 to select
FreeBSD.After the kernel is loaded, hit any key but enter to interrupt
the boot sequence.
Boot into single-user mode and allow explicit entry of
a root filesystem.Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [kernel] in 8 seconds...
Type '?' for a list of commands, 'help' for more detailed help.
ok boot -asSelect /rootback
as your root filesystem.Manual root filesystem specification:
<fstype>:<device> Mount <device> using filesystem <fstype>
e.g. ufs:/dev/da0s1a
? List valid disk boot devices
<empty line> Abort manual input
mountroot> ufs:/dev/ad2s1aNow that you are in single-user mode, change
/etc/fstab to avoid the
bad root filesystem.If you used the bootvinum Perl script from
below, then these commands should configure your server for
degraded mode.&prompt.root; fsck -p /
&prompt.root; mount /
&prompt.root; cd /etc
&prompt.root; mv fstab fstab.bak
&prompt.root; cp fstab_ad0s1_root_bad fstab
&prompt.root; cd /
&prompt.root; mount -o ro /
&prompt.root; vinum start
&prompt.root; fsck -p
&prompt.root; ^DRecoveryRestore /dev/ad0s1a from
backups or copy
/rootback to it with these commands:&prompt.root; umount /rootbad
&prompt.root; newfs /dev/ad0s1a
&prompt.root; tunefs -n enable /dev/ad0s1a
&prompt.root; mount /rootbad
&prompt.root; cd /rootbad
&prompt.root; dump 0f - / | restore rf -
&prompt.root; rm restoresymtableExiting Degraded ModeEnter single-user mode.&prompt.root; shutdown nowPut /etc/fstab back to
normal and reboot.&prompt.root; cd /rootbad/etc
&prompt.root; rm fstab
&prompt.root; mv fstab.bak fstab
&prompt.root; rebootReboot and hit F1 to boot from
/dev/ad0 when
prompted by BootMgr.SimulationThis kind of failure can be simulated by shutting down to
single-user mode and then booting as shown above in
.Drive ad2 FailsThis section deals with the total failure of
/dev/ad2.Configure Server for Degraded ModeAfter the kernel is loaded, hit any key but
Enter to interrupt the boot sequence.
Boot into single-user mode.Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [kernel] in 8 seconds...
Type '?' for a list of commands, 'help' for more detailed help.
ok boot -sChange
/etc/fstab to avoid the bad drive.
If you used the bootvinum Perl script from
below, then
these commands should configure your server for
degraded mode.&prompt.root; fsck -p /
&prompt.root; mount /
&prompt.root; cd /etc
&prompt.root; mv fstab fstab.bak
&prompt.root; cp fstab_only_have_ad0s1 fstab
&prompt.root; cd /
&prompt.root; mount -o ro /
&prompt.root; vinum start
&prompt.root; fsck -p
&prompt.root; ^DIf you do not have modified versions of
/etc/fstab that are ready for use,
then you can use ed to make one.
Alternatively, you can fsck and
mount/usr and then use your
favorite editor.RecoveryWe assume here that your server is up and running multi-user in
degraded mode on just
/dev/ad0 and that you have
a new spindle now on
/dev/ad2 ready to go.You will need a new spindle with enough room to hold root and swap
partitions plus a Vinum
partition large enough to hold
/home and /usr.Create a BIOS partition (slice) on the new spindle.&prompt.root; /stand/sysinstallSelect Custom.Select Partition.Select ad2.Create a FreeBSD (type 165) slice
large enough to hold everything mentioned above.Write changes.Yes, you are absolutely sure.Select BootMgr.Quit Partitioning.Exit /stand/sysinstall.Create disk label partitioning based on current
/dev/ad0 partitioning.&prompt.root; disklabel ad0 > /tmp/ad0
&prompt.root; disklabel -e ad2This will drop you into your favorite editor.Copy the lines for the a and
b partitions from
/tmp/ad0 to the
ad2 disklabel.Add the size of the
a and
b partitions to find the proper
offset for the
h partition.Subtract this offset from the
size of the c
partition to find the proper size for the h
partition.Define an h partition with the
size and
offset calculated above.Set the fstype column to
vinum.Save the file and quit your editor.Tell Vinum
about the new drive.Ask Vinum to start an
editor with a copy of the current configuration.&prompt.root; vinum createUncomment the drive line referring to drive
UpWindow and set
device to
/dev/ad2s1h.Save the file and quit your editor.Now that Vinum
has two spindles again, revive the mirrors.&prompt.root; vinum start -w usr.p1.s0
&prompt.root; vinum start -w home.p1.s0Now we need to restore
/rootback to a current copy of the
root filesystem.
These commands will accomplish this.&prompt.root; newfs /dev/ad2s1a
&prompt.root; tunefs -n enable /dev/ad2s1a
&prompt.root; mount /dev/ad2s1a /mnt
&prompt.root; cd /mnt
&prompt.root; dump 0f - / | restore rf -
&prompt.root; rm restoresymtable
&prompt.root; cd /
&prompt.root; umount /mntExiting Degraded ModeEnter single-user mode.&prompt.root; shutdown nowReturn /etc/fstab to
its normal state and reboot.&prompt.root; cd /etc
&prompt.root; rm fstab
&prompt.root; mv fstab.bak fstab
&prompt.root; rebootSimulationYou can simulate this kind of failure by unplugging
/dev/ad2, write-protecting it,
or by this procedure:Shutdown to single-user mode.Unmount all non-root filesystems.Clobber any existing Vinum
configuration and partitioning on
/dev/ad2.&prompt.root; vinum stop
&prompt.root; dd if=/dev/zero of=/dev/ad2s1h count=512
&prompt.root; dd if=/dev/zero of=/dev/ad2 count=512Drive ad0 FailsSome BIOSes can boot from drive 1 or drive 2 (often called
C: or D:),
while others can boot only from drive 1.
If your BIOS can boot from either, the fastest road to recovery
might be to boot directly from /dev/ad2
in single-user mode and
install /etc/fstab_only_have_ad2s1 as
/etc/fstab.
You would then have to adapt the /dev/ad2
failure recovery instructions from above.If your BIOS can only boot from drive one, then you will have to
unplug drive YouCrazy from the controller for
/dev/ad2 and plug it
into the controller for /dev/ad0.
Then continue with the instructions for
/dev/ad2 failure recovery
in above.bootvinum Perl ScriptThe bootvinum Perl script below reads /etc/fstab
and current drive partitioning.
It then writes several files in the current directory and several
variants of /etc/fstab in /etc.
These files significantly simplify the installation of
Vinum and recovery from
spindle failures.#!/usr/bin/perl -w
use strict;
use FileHandle;
-my $config_tag1 = '$Id: article.sgml,v 1.13 2003-08-27 07:13:11 blackend Exp $';
+my $config_tag1 = '$Id: article.sgml,v 1.14 2003-10-18 10:39:16 simon Exp $';
# Copyright (C) 2001 Robert A. Van Valzah
#
# Bootstrap Vinum
#
# Read /etc/fstab and current partitioning for all spindles mentioned there.
# Generate files needed to mirror all filesystems on root spindle.
# A new partition table for each spindle
# Input for the vinum create command to create Vinum objects on each spindle
# A copy of fstab mounting Vinum volumes instead of BSD partitions
# Copies of fstab altered for server's degraded modes of operation
# See handbook for instructions on how to use the the files generated.
# N.B. This bootstrapping method shrinks size of swap partition by the size
# of Vinum's on-disk configuration (265 sectors). It embeds existing file
# systems on the root spindle in Vinum objects without having to copy them.
# Thanks to Greg Lehey for suggesting this bootstrapping method.
# Expectations:
# The root spindle must contain at least root, swap, and /usr partitions
# The rootback spindle must have matching /rootback and swap partitions
# Other spindles should only have a /NOFUTURE* filesystem and maybe swap
# File systems named /NOFUTURE* will be replaced with Vinum drives
# Change configuration variables below to suit your taste
my $vip = 'h'; # VInum Partition
my @drv = ('YouCrazy', 'UpWindow', 'ThruBank', # Vinum DRiVe names
'OutSnakes', 'MeWild', 'InMovie', 'HomeJames', 'DownPrices', 'WhileBlind');
# No configuration variables beyond this point
my %vols; # One entry per Vinum volume to be created
my @spndl; # One entry per SPiNDLe
my $rsp; # Root SPindle (as in /dev/$rsp)
my $rbsp; # RootBack SPindle (as in /dev/$rbsp)
my $cfgsiz = 265; # Size of Vinum on-disk configuration info in sectors
my $nxtpas = 2; # Next fsck pass number for non-root filesystems
# Parse fstab, generating the version we'll need for Vinum and noting
# spindles in use.
my $fsin = "/etc/fstab";
#my $fsin = "simu/fstab";
open(FSIN, "$fsin") || die("Couldn't open $fsin: $!\n");
my $fsout = "/etc/fstab.vinum";
open(FSOUT, ">$fsout") || die("Couldn't open $fsout for writing: $!\n");
while (<FSIN>) {
my ($dev, $mnt, $fstyp, $opt, $dump, $pass) = split;
next if $dev =~ /^#/;
if ($mnt eq '/' || $mnt eq '/rootback' || $mnt =~ /^\/NOFUTURE/) {
my $dn = substr($dev, 5, length($dev)-6); # Device Name without /dev/
push(@spndl, $dn) unless grep($_ eq $dn, @spndl);
$rsp = $dn if $mnt eq '/';
next if $mnt =~ /^\/NOFUTURE/;
}
# Move /rootback from partition e to a
if ($mnt =~ /^\/rootback/) {
$dev =~ s/e$/a/;
$pass = 1;
$rbsp = substr($dev, 5, length($dev)-6);
print FSOUT "$dev\t\t$mnt\t$fstyp\t$opt\t\t$dump\t$pass\n";
next;
}
# Move non-root filesystems on smallest spindle into Vinum
if (defined($rsp) && $dev =~ /^\/dev\/$rsp/ && $dev =~ /[d-h]$/) {
$pass = $nxtpas++;
print FSOUT "/dev/vinum$mnt\t\t$mnt\t\t$fstyp\t$opt\t\t$dump\t$pass\n";
$vols{$dev}->{mnt} = substr($mnt, 1);
next;
}
print FSOUT $_;
}
close(FSOUT);
die("Found more spindles than we have abstract names\n") if $#spndl > $#drv;
die("Didn't find a root partition!\n") if !defined($rsp);
die("Didn't find a /rootback partition!\n") if !defined($rbsp);
# Table of server's Degraded Modes
# One row per mode with hash keys
# fn FileName
# xpr eXPRession needed to convert fstab lines for this mode
# cm1 CoMment 1 describing this mode
# cm2 CoMment 2 describing this mode
# FH FileHandle (dynamically initialized below)
my @DM = (
{ cm1 => "When we only have $rsp, comment out lines using $rbsp",
fn => "/etc/fstab_only_have_$rsp",
xpr => "s:^/dev/$rbsp:#\$&:",
},
{ cm1 => "When we only have $rbsp, comment out lines using $rsp and",
cm2 => "rootback becomes root",
fn => "/etc/fstab_only_have_$rbsp",
xpr => "s:^/dev/$rsp:#\$&: || s:/rootback:/\t:",
},
{ cm1 => "When only $rsp root is bad, /rootback becomes root and",
cm2 => "root becomes /rootbad",
fn => "/etc/fstab_${rsp}_root_bad",
xpr => "s:\t/\t:\t/rootbad: || s:/rootback:/\t:",
},
);
# Initialize output FileHandles and write comments
foreach my $dm (@DM) {
my $fh = new FileHandle;
$fh->open(">$dm->{fn}") || die("Can't write $dm->{fn}: $!\n");
print $fh "# $dm->{cm1}\n" if $dm->{cm1};
print $fh "# $dm->{cm2}\n" if $dm->{cm2};
$dm->{FH} = $fh;
}
# Parse the Vinum version of fstab written above and write versions needed
# for server's degraded modes.
open(FSOUT, "$fsout") || die("Couldn't open $fsout: $!\n");
while (<FSOUT>) {
my $line = $_;
foreach my $dm (@DM) {
$_ = $line;
eval $dm->{xpr};
print {$dm->{FH}} $_;
}
}
# Parse partition table for each spindle and write versions needed for Vinum
my $rootsiz; # ROOT partition SIZe
my $swapsiz; # SWAP partition SIZe
my $rspminoff; # Root SPindle MINimum OFFset of non-root, non-swap, non-c parts
my $rspsiz; # Root SPindle SIZe
my $rbspsiz; # RootBack SPindle SIZe
foreach my $i (0..$#spndl) {
my $dlin = "disklabel $spndl[$i] |";
# my $dlin = "simu/disklabel.$spndl[$i]";
open(DLIN, "$dlin") || die("Couldn't open $dlin: $!\n");
my $dlout = "disklabel.$spndl[$i]";
open(DLOUT, ">$dlout") || die("Couldn't open $dlout for writing: $!\n");
my $dlb4 = "$dlout.b4vinum";
open(DLB4, ">$dlb4") || die("Couldn't open $dlb4 for writing: $!\n");
my $minoff; # MINimum OFFset of non-root, non-swap, non-c partitions
my $totsiz = 0; # TOTal SIZe of all non-root, non-swap, non-c partitions
my $swapspndl = 0; # True if SWAP partition on this SPiNDLe
while (<DLIN>) {
print DLB4 $_;
my ($part, $siz, $off, $fstyp, $fsiz, $bsiz, $bps) = split;
if ($part && $part eq 'a:' && $spndl[$i] eq $rsp) {
$rootsiz = $siz;
}
if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) {
if ($rootsiz != $siz) {
die("Rootback size ($siz) != root size ($rootsiz)\n");
}
}
if ($part && $part eq 'c:') {
$rspsiz = $siz if $spndl[$i] eq $rsp;
$rbspsiz = $siz if $spndl[$i] eq $rbsp;
}
# Make swap partition $cfgsiz sectors smaller
if ($part && $part eq 'b:') {
if ($spndl[$i] eq $rsp) {
$swapsiz = $siz;
} else {
if ($swapsiz != $siz) {
die("Swap partition sizes unequal across spindles\n");
}
}
printf DLOUT "%4s%9d%9d%10s\n", $part, $siz-$cfgsiz, $off, $fstyp;
$swapspndl = 1;
next;
}
# Move rootback spindle e partitions to a
if ($part && $part eq 'e:' && $spndl[$i] eq $rbsp) {
printf DLOUT "%4s%9d%9d%10s%9d%6d%6d\n", 'a:', $siz, $off, $fstyp,
$fsiz, $bsiz, $bps;
next;
}
# Delete non-root, non-swap, non-c partitions but note their minimum
# offset and total size that're needed below.
if ($part && $part =~ /^[d-h]:$/) {
$minoff = $off unless $minoff;
$minoff = $off if $off < $minoff;
$totsiz += $siz;
if ($spndl[$i] eq $rsp) { # If doing spindle containing root
my $dev = "/dev/$spndl[$i]" . substr($part, 0, 1);
$vols{$dev}->{siz} = $siz;
$vols{$dev}->{off} = $off;
$rspminoff = $minoff;
}
next;
}
print DLOUT $_;
}
if ($swapspndl) { # If there was a swap partition on this spindle
# Make a Vinum partition the size of all non-root, non-swap,
# non-c partitions + the size of Vinum's on-disk configuration.
# Set its offset so that the start of the first subdisk it contains
# coincides with the first filesystem we're embedding in Vinum.
printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz+$cfgsiz, $minoff-$cfgsiz,
'vinum';
} else {
# No need to mess with size size and offset if there was no swap
printf DLOUT "%4s%9d%9d%10s\n", "$vip:", $totsiz, $minoff,
'vinum';
}
}
die("Swap partition not found\n") unless $swapsiz;
die("Swap partition not larger than $cfgsiz blocks\n") unless $swapsiz>$cfgsiz;
die("Rootback spindle size not >= root spindle size\n") unless $rbspsiz>=$rspsiz;
# Generate input to vinum create command needed for each spindle.
foreach my $i (0..$#spndl) {
my $cfn = "create.$drv[$i]"; # Create File Name
open(CF, ">$cfn") || die("Can't open $cfn for writing: $!\n");
print CF "drive $drv[$i] device /dev/$spndl[$i]$vip\n";
next unless $spndl[$i] eq $rsp || $spndl[$i] eq $rbsp;
foreach my $dev (keys(%vols)) {
my $mnt = $vols{$dev}->{mnt};
my $siz = $vols{$dev}->{siz};
my $off = $vols{$dev}->{off}-$rspminoff+$cfgsiz;
print CF "volume $mnt\n" if $spndl[$i] eq $rsp;
print CF <<EOF;
plex name $mnt.p$i org concat volume $mnt
sd name $mnt.p$i.s0 drive $drv[$i] plex $mnt.p$i len ${siz}s driveoffset ${off}s
EOF
}
}Manual Vinum BootstrappingThe bootvinum Perl script in makes life easier, but
it may be necessary to manually perform some or all of the steps that
it automates.
This appendix describes how you would manually mimic the script.Make a copy of /etc/fstab
to be customized.&prompt.root; cp /etc/fstab /etc/fstab.vinumEdit /etc/fstab.vinum.Change the device column of
non-root partitions on the root spindle to
/dev/vinum/mnt.Change the pass column of
non-root partitions on the root spindle to 2,
3, etc.Delete any lines with mountpoint
matching /NOFUTURE*.Change the device column of
/rootback
from e to
a.Change the pass column of
/rootback to
1.Prepare disklabels for editing:&prompt.root; cd /bootvinum
&prompt.root; disklabel ad0s1 > disklabel.ad0s1
&prompt.root; cp disklabel.ad0s1 disklabel.ad0s1.b4vinum
&prompt.root; disklabel ad2s1 > disklabel.ad2s1
&prompt.root; cp disklabel.ad2s1 disklabel.ad2s1.b4vinumEdit /etc/disklabel.ad?s1.On the root spindle:Decrease the size of the
b partition by 265 blocks.Note the size and
offset of the a and
b partitions.Note the smallest offset for partitions
d-h.Note the size and
offset for all non-root, non-swap
partitions (/home was probably on
e and /usr was
probably on f).Delete partitions
d-h.Create a new h partition with
offset 265 blocks less than the
smallest offset
for partitions d-h
noted above.
Set its size to the size
of the c partition less the
smallest offset
for partitions d-h
noted above + 265 blocks.Vinum
can use any partition other than c.
It is not strictly necessary to use h
for all your Vinum
partitions, but it is good practice to
be consistent across all spindles.Set the fstype of this new
partition to vinum.On the rootback spindle:Move the e partition to
a.Verify that the size of the
a and
b partitions matches the
root spindle.Note the smallest offset for partitions
d-h.Delete partitions
d-h.Create a new h partition with
offset 265 blocks less than the
smallest offset
noted above for partitions
d-h.
Set its size to the size
of the c partition less the
smallest offset
for partitions d-h
noted above + 265 blocks.Set the fstype of this new
partition to vinum.Create a file named
create.YouCrazy that contains:drive YouCrazy device /dev/ad0s1h
volume home
plex name home.p0 org concat volume home
sd name home.p0.s0 drive YouCrazy plex home.p0 len $hl driveoffset $ho
volume usr
plex name usr.p0 org concat volume usr
sd name usr.p0.s0 drive YouCrazy plex usr.p0 len $ul driveoffset $uoWhere:$hl is the length noted above for
/home.$ho is the offset noted above for
/home less the smallest offset
noted above + 265 blocks.$ul is the length noted above for
/usr.$uo is the offset noted above for
/usr less the smallest offset
noted above + 265 blocks.Create a file named
create.UpWindow containing:drive UpWindow device /dev/ad2s1h
plex name home.p1 org concat volume home
sd name home.p1.s0 drive UpWindow plex home.p1 len $hl driveoffset $ho
plex name usr.p1 org concat volume usr
sd name usr.p1.s0 drive UpWindow plex usr.p1 len $ul driveoffset $uoWhere $hl, $ho, $ul, and $uo are set as above.AcknowledgementsI would like to thank Greg Lehey for writing &vinum.ap; and for
providing very helpful comments on early drafts.
Several others made helpful suggestions after reviewing later drafts
including
Dag-Erling Smørgrav,
Michael Splendoria,
Chern Lee,
Stefan Aeschbacher,
Fleming Froekjaer,
Bernd Walter,
Aleksey Baranov, and
Doug Swarin.
diff --git a/en_US.ISO8859-1/articles/vm-design/article.sgml b/en_US.ISO8859-1/articles/vm-design/article.sgml
index bbe6da9094..c77ab30396 100644
--- a/en_US.ISO8859-1/articles/vm-design/article.sgml
+++ b/en_US.ISO8859-1/articles/vm-design/article.sgml
@@ -1,841 +1,851 @@
%man;
%freebsd;
+
+%trademarks;
]>
Design elements of the FreeBSD VM systemMatthewDillondillon@apollo.backplane.com
+
+ &tm-attrib.freebsd;
+ &tm-attrib.linux;
+ &tm-attrib.microsoft;
+ &tm-attrib.opengroup;
+ &tm-attrib.general;
+
+
The title is really just a fancy way of saying that I am going to
attempt to describe the whole VM enchilada, hopefully in a way that
everyone can follow. For the last year I have concentrated on a number
of major kernel subsystems within FreeBSD, with the VM and Swap
subsystems being the most interesting and NFS being a necessary
chore. I rewrote only small portions of the code. In the VM
arena the only major rewrite I have done is to the swap subsystem.
Most of my work was cleanup and maintenance, with only moderate code
rewriting and no major algorithmic adjustments within the VM
subsystem. The bulk of the VM subsystem's theoretical base remains
unchanged and a lot of the credit for the modernization effort in the
last few years belongs to John Dyson and David Greenman. Not being a
historian like Kirk I will not attempt to tag all the various features
with peoples names, since I will invariably get it wrong.This article was originally published in the January 2000 issue of
DaemonNews. This
version of the article may include updates from Matt and other authors
to reflect changes in FreeBSD's VM implementation.IntroductionBefore moving along to the actual design let's spend a little time
on the necessity of maintaining and modernizing any long-living
codebase. In the programming world, algorithms tend to be more
important than code and it is precisely due to BSD's academic roots that
a great deal of attention was paid to algorithm design from the
beginning. More attention paid to the design generally leads to a clean
and flexible codebase that can be fairly easily modified, extended, or
replaced over time. While BSD is considered an old
operating system by some people, those of us who work on it tend to view
it more as a mature codebase which has various components
modified, extended, or replaced with modern code. It has evolved, and
FreeBSD is at the bleeding edge no matter how old some of the code might
be. This is an important distinction to make and one that is
unfortunately lost to many people. The biggest error a programmer can
make is to not learn from history, and this is precisely the error that
- many other modern operating systems have made. NT is the best example
+ many other modern operating systems have made. &windowsnt; is the best example
of this, and the consequences have been dire. Linux also makes this
mistake to some degree—enough that we BSD folk can make small
jokes about it every once in a while, anyway. Linux's problem is simply
one of a lack of experience and history to compare ideas against, a
problem that is easily and rapidly being addressed by the Linux
community in the same way it has been addressed in the BSD
- community—by continuous code development. The NT folk, on the
+ community—by continuous code development. The &windowsnt; folk, on the
other hand, repeatedly make the same mistakes solved by &unix; decades ago
and then spend years fixing them. Over and over again. They have a
severe case of not designed here and we are always
right because our marketing department says so. I have little
tolerance for anyone who cannot learn from history.Much of the apparent complexity of the FreeBSD design, especially in
the VM/Swap subsystem, is a direct result of having to solve serious
performance issues that occur under various conditions. These issues
are not due to bad algorithmic design but instead rise from
environmental factors. In any direct comparison between platforms,
these issues become most apparent when system resources begin to get
stressed. As I describe FreeBSD's VM/Swap subsystem the reader should
always keep two points in mind. First, the most important aspect of
performance design is what is known as Optimizing the Critical
Path. It is often the case that performance optimizations add a
little bloat to the code in order to make the critical path perform
better. Second, a solid, generalized design outperforms a
heavily-optimized design over the long run. While a generalized design
may end up being slower than an heavily-optimized design when they are
first implemented, the generalized design tends to be easier to adapt to
changing conditions and the heavily-optimized design winds up having to
be thrown away. Any codebase that will survive and be maintainable for
years must therefore be designed properly from the beginning even if it
costs some performance. Twenty years ago people were still arguing that
programming in assembly was better than programming in a high-level
language because it produced code that was ten times as fast. Today,
the fallibility of that argument is obvious—as are the parallels
to algorithmic design and code generalization.VM ObjectsThe best way to begin describing the FreeBSD VM system is to look at
it from the perspective of a user-level process. Each user process sees
a single, private, contiguous VM address space containing several types
of memory objects. These objects have various characteristics. Program
code and program data are effectively a single memory-mapped file (the
binary file being run), but program code is read-only while program data
is copy-on-write. Program BSS is just memory allocated and filled with
zeros on demand, called demand zero page fill. Arbitrary files can be
memory-mapped into the address space as well, which is how the shared
library mechanism works. Such mappings can require modifications to
remain private to the process making them. The fork system call adds an
entirely new dimension to the VM management problem on top of the
complexity already given.A program binary data page (which is a basic copy-on-write page)
illustrates the complexity. A program binary contains a preinitialized
data section which is initially mapped directly from the program file.
When a program is loaded into a process's VM space, this area is
initially memory-mapped and backed by the program binary itself,
allowing the VM system to free/reuse the page and later load it back in
from the binary. The moment a process modifies this data, however, the
VM system must make a private copy of the page for that process. Since
the private copy has been modified, the VM system may no longer free it,
because there is no longer any way to restore it later on.You will notice immediately that what was originally a simple file
mapping has become much more complex. Data may be modified on a
page-by-page basis whereas the file mapping encompasses many pages at
once. The complexity further increases when a process forks. When a
process forks, the result is two processes—each with their own
private address spaces, including any modifications made by the original
process prior to the call to fork(). It would be
silly for the VM system to make a complete copy of the data at the time
of the fork() because it is quite possible that at
least one of the two processes will only need to read from that page
from then on, allowing the original page to continue to be used. What
was a private page is made copy-on-write again, since each process
(parent and child) expects their own personal post-fork modifications to
remain private to themselves and not effect the other.FreeBSD manages all of this with a layered VM Object model. The
original binary program file winds up being the lowest VM Object layer.
A copy-on-write layer is pushed on top of that to hold those pages which
had to be copied from the original file. If the program modifies a data
page belonging to the original file the VM system takes a fault and
makes a copy of the page in the higher layer. When a process forks,
additional VM Object layers are pushed on. This might make a little
more sense with a fairly basic example. A fork()
is a common operation for any *BSD system, so this example will consider
a program that starts up, and forks. When the process starts, the VM
system creates an object layer, let's call this A:+---------------+
| A |
+---------------+A pictureA represents the file—pages may be paged in and out of the
file's physical media as necessary. Paging in from the disk is
reasonable for a program, but we really do not want to page back out and
overwrite the executable. The VM system therefore creates a second
layer, B, that will be physically backed by swap space:+---------------+
| B |
+---------------+
| A |
+---------------+On the first write to a page after this, a new page is created in B,
and its contents are initialized from A. All pages in B can be paged in
or out to a swap device. When the program forks, the VM system creates
two new object layers—C1 for the parent, and C2 for the
child—that rest on top of B:+-------+-------+
| C1 | C2 |
+-------+-------+
| B |
+---------------+
| A |
+---------------+In this case, let's say a page in B is modified by the original
parent process. The process will take a copy-on-write fault and
duplicate the page in C1, leaving the original page in B untouched.
Now, let's say the same page in B is modified by the child process. The
process will take a copy-on-write fault and duplicate the page in C2.
The original page in B is now completely hidden since both C1 and C2
have a copy and B could theoretically be destroyed if it does not
represent a real file). However, this sort of optimization is not
trivial to make because it is so fine-grained. FreeBSD does not make
this optimization. Now, suppose (as is often the case) that the child
process does an exec(). Its current address space
is usually replaced by a new address space representing a new file. In
this case, the C2 layer is destroyed:+-------+
| C1 |
+-------+-------+
| B |
+---------------+
| A |
+---------------+In this case, the number of children of B drops to one, and all
accesses to B now go through C1. This means that B and C1 can be
collapsed together. Any pages in B that also exist in C1 are deleted
from B during the collapse. Thus, even though the optimization in the
previous step could not be made, we can recover the dead pages when
either of the processes exit or exec().This model creates a number of potential problems. The first is that
you can wind up with a relatively deep stack of layered VM Objects which
can cost scanning time and memory when you take a fault. Deep
layering can occur when processes fork and then fork again (either
parent or child). The second problem is that you can wind up with dead,
inaccessible pages deep in the stack of VM Objects. In our last example
if both the parent and child processes modify the same page, they both
get their own private copies of the page and the original page in B is
no longer accessible by anyone. That page in B can be freed.FreeBSD solves the deep layering problem with a special optimization
called the All Shadowed Case. This case occurs if either
C1 or C2 take sufficient COW faults to completely shadow all pages in B.
Lets say that C1 achieves this. C1 can now bypass B entirely, so rather
then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But
look what also happened—now B has only one reference (C2), so we
can collapse B and C2 together. The end result is that B is deleted
entirely and we have C1->A and C2->A. It is often the case that B will
contain a large number of pages and neither C1 nor C2 will be able to
completely overshadow it. If we fork again and create a set of D
layers, however, it is much more likely that one of the D layers will
eventually be able to completely overshadow the much smaller dataset
represented by C1 or C2. The same optimization will work at any point in
the graph and the grand result of this is that even on a heavily forked
machine VM Object stacks tend to not get much deeper then 4. This is
true of both the parent and the children and true whether the parent is
doing the forking or whether the children cascade forks.The dead page problem still exists in the case where C1 or C2 do not
completely overshadow B. Due to our other optimizations this case does
not represent much of a problem and we simply allow the pages to be
dead. If the system runs low on memory it will swap them out, eating a
little swap, but that is it.The advantage to the VM Object model is that
fork() is extremely fast, since no real data
copying need take place. The disadvantage is that you can build a
relatively complex VM Object layering that slows page fault handling
down a little, and you spend memory managing the VM Object structures.
The optimizations FreeBSD makes proves to reduce the problems enough
that they can be ignored, leaving no real disadvantage.SWAP LayersPrivate data pages are initially either copy-on-write or zero-fill
pages. When a change, and therefore a copy, is made, the original
backing object (usually a file) can no longer be used to save a copy of
the page when the VM system needs to reuse it for other purposes. This
is where SWAP comes in. SWAP is allocated to create backing store for
memory that does not otherwise have it. FreeBSD allocates the swap
management structure for a VM Object only when it is actually needed.
However, the swap management structure has had problems
historically.Under FreeBSD 3.X the swap management structure preallocates an
array that encompasses the entire object requiring swap backing
store—even if only a few pages of that object are swap-backed.
This creates a kernel memory fragmentation problem when large objects
are mapped, or processes with large runsizes (RSS) fork. Also, in order
to keep track of swap space, a list of holes is kept in
kernel memory, and this tends to get severely fragmented as well. Since
the list of holes is a linear list, the swap allocation and freeing
performance is a non-optimal O(n)-per-page. It also requires kernel
memory allocations to take place during the swap freeing process, and
that creates low memory deadlock problems. The problem is further
exacerbated by holes created due to the interleaving algorithm. Also,
the swap block map can become fragmented fairly easily resulting in
non-contiguous allocations. Kernel memory must also be allocated on the
fly for additional swap management structures when a swapout occurs. It
is evident that there was plenty of room for improvement.For FreeBSD 4.X, I completely rewrote the swap subsystem. With this
rewrite, swap management structures are allocated through a hash table
rather than a linear array giving them a fixed allocation size and much
finer granularity. Rather then using a linearly linked list to keep
track of swap space reservations, it now uses a bitmap of swap blocks
arranged in a radix tree structure with free-space hinting in the radix
node structures. This effectively makes swap allocation and freeing an
O(1) operation. The entire radix tree bitmap is also preallocated in
order to avoid having to allocate kernel memory during critical low
memory swapping operations. After all, the system tends to swap when it
is low on memory so we should avoid allocating kernel memory at such
times in order to avoid potential deadlocks. Finally, to reduce
fragmentation the radix tree is capable of allocating large contiguous
chunks at once, skipping over smaller fragmented chunks. I did not take
the final step of having an allocating hint pointer that would trundle
through a portion of swap as allocations were made in order to further
guarantee contiguous allocations or at least locality of reference, but
I ensured that such an addition could be made.When to free a pageSince the VM system uses all available memory for disk caching,
there are usually very few truly-free pages. The VM system depends on
being able to properly choose pages which are not in use to reuse for
new allocations. Selecting the optimal pages to free is possibly the
single-most important function any VM system can perform because if it
makes a poor selection, the VM system may be forced to unnecessarily
retrieve pages from disk, seriously degrading system performance.How much overhead are we willing to suffer in the critical path to
avoid freeing the wrong page? Each wrong choice we make will cost us
hundreds of thousands of CPU cycles and a noticeable stall of the
affected processes, so we are willing to endure a significant amount of
overhead in order to be sure that the right page is chosen. This is why
FreeBSD tends to outperform other systems when memory resources become
stressed.The free page determination algorithm is built upon a history of the
use of memory pages. To acquire this history, the system takes advantage
of a page-used bit feature that most hardware page tables have.In any case, the page-used bit is cleared and at some later point
the VM system comes across the page again and sees that the page-used
bit has been set. This indicates that the page is still being actively
used. If the bit is still clear it is an indication that the page is not
being actively used. By testing this bit periodically, a use history (in
the form of a counter) for the physical page is developed. When the VM
system later needs to free up some pages, checking this history becomes
the cornerstone of determining the best candidate page to reuse.What if the hardware has no page-used bit?For those platforms that do not have this feature, the system
actually emulates a page-used bit. It unmaps or protects a page,
forcing a page fault if the page is accessed again. When the page
fault is taken, the system simply marks the page as having been used
and unprotects the page so that it may be used. While taking such page
faults just to determine if a page is being used appears to be an
expensive proposition, it is much less expensive than reusing the page
for some other purpose only to find that a process needs it back and
then have to go to disk.FreeBSD makes use of several page queues to further refine the
selection of pages to reuse as well as to determine when dirty pages
must be flushed to their backing store. Since page tables are dynamic
entities under FreeBSD, it costs virtually nothing to unmap a page from
the address space of any processes using it. When a page candidate has
been chosen based on the page-use counter, this is precisely what is
done. The system must make a distinction between clean pages which can
theoretically be freed up at any time, and dirty pages which must first
be written to their backing store before being reusable. When a page
candidate has been found it is moved to the inactive queue if it is
dirty, or the cache queue if it is clean. A separate algorithm based on
the dirty-to-clean page ratio determines when dirty pages in the
inactive queue must be flushed to disk. Once this is accomplished, the
flushed pages are moved from the inactive queue to the cache queue. At
this point, pages in the cache queue can still be reactivated by a VM
fault at relatively low cost. However, pages in the cache queue are
considered to be immediately freeable and will be reused
in an LRU (least-recently used) fashion when the system needs to
allocate new memory.It is important to note that the FreeBSD VM system attempts to
separate clean and dirty pages for the express reason of avoiding
unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
it move pages between the various page queues gratuitously when the
memory subsystem is not being stressed. This is why you will see some
systems with very low cache queue counts and high active queue counts
when doing a systat -vm command. As the VM system
becomes more stressed, it makes a greater effort to maintain the various
page queues at the levels determined to be the most effective. An urban
myth has circulated for years that Linux did a better job avoiding
swapouts than FreeBSD, but this in fact is not true. What was actually
occurring was that FreeBSD was proactively paging out unused pages in
order to make room for more disk cache while Linux was keeping unused
pages in core and leaving less memory available for cache and process
pages. I do not know whether this is still true today.Pre-Faulting and Zeroing OptimizationsTaking a VM fault is not expensive if the underlying page is already
in core and can simply be mapped into the process, but it can become
expensive if you take a whole lot of them on a regular basis. A good
example of this is running a program such as &man.ls.1; or &man.ps.1;
over and over again. If the program binary is mapped into memory but
not mapped into the page table, then all the pages that will be accessed
by the program will have to be faulted in every time the program is run.
This is unnecessary when the pages in question are already in the VM
Cache, so FreeBSD will attempt to pre-populate a process's page tables
with those pages that are already in the VM Cache. One thing that
FreeBSD does not yet do is pre-copy-on-write certain pages on exec. For
example, if you run the &man.ls.1; program while running vmstat
1 you will notice that it always takes a certain number of
page faults, even when you run it over and over again. These are
zero-fill faults, not program code faults (which were pre-faulted in
already). Pre-copying pages on exec or fork is an area that could use
more study.A large percentage of page faults that occur are zero-fill faults.
You can usually see this by observing the vmstat -s
output. These occur when a process accesses pages in its BSS area. The
BSS area is expected to be initially zero but the VM system does not
bother to allocate any memory at all until the process actually accesses
it. When a fault occurs the VM system must not only allocate a new page,
it must zero it as well. To optimize the zeroing operation the VM system
has the ability to pre-zero pages and mark them as such, and to request
pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs
whenever the CPU is idle but the number of pages the system pre-zeros is
limited in order to avoid blowing away the memory caches. This is an
excellent example of adding complexity to the VM system in order to
optimize the critical path.Page Table OptimizationsThe page table optimizations make up the most contentious part of
the FreeBSD VM design and they have shown some strain with the advent of
serious use of mmap(). I think this is actually a
feature of most BSDs though I am not sure when it was first introduced.
There are two major optimizations. The first is that hardware page
tables do not contain persistent state but instead can be thrown away at
any time with only a minor amount of management overhead. The second is
that every active page table entry in the system has a governing
pv_entry structure which is tied into the
vm_page structure. FreeBSD can simply iterate
through those mappings that are known to exist while Linux must check
all page tables that might contain a specific
mapping to see if it does, which can achieve O(n^2) overhead in certain
situations. It is because of this that FreeBSD tends to make better
choices on which pages to reuse or swap when memory is stressed, giving
it better performance under load. However, FreeBSD requires kernel
tuning to accommodate large-shared-address-space situations such as
those that can occur in a news system because it may run out of
pv_entry structures.Both Linux and FreeBSD need work in this area. FreeBSD is trying to
maximize the advantage of a potentially sparse active-mapping model (not
all processes need to map all pages of a shared library, for example),
whereas Linux is trying to simplify its algorithms. FreeBSD generally
has the performance advantage here at the cost of wasting a little extra
memory, but FreeBSD breaks down in the case where a large file is
massively shared across hundreds of processes. Linux, on the other hand,
breaks down in the case where many processes are sparsely-mapping the
same shared library and also runs non-optimally when trying to determine
whether a page can be reused or not.Page ColoringWe will end with the page coloring optimizations. Page coloring is a
performance optimization designed to ensure that accesses to contiguous
pages in virtual memory make the best use of the processor cache. In
ancient times (i.e. 10+ years ago) processor caches tended to map
virtual memory rather than physical memory. This led to a huge number of
problems including having to clear the cache on every context switch in
some cases, and problems with data aliasing in the cache. Modern
processor caches map physical memory precisely to solve those problems.
This means that two side-by-side pages in a processes address space may
not correspond to two side-by-side pages in the cache. In fact, if you
are not careful side-by-side pages in virtual memory could wind up using
the same page in the processor cache—leading to cacheable data
being thrown away prematurely and reducing CPU performance. This is true
even with multi-way set-associative caches (though the effect is
mitigated somewhat).FreeBSD's memory allocation code implements page coloring
optimizations, which means that the memory allocation code will attempt
to locate free pages that are contiguous from the point of view of the
cache. For example, if page 16 of physical memory is assigned to page 0
of a process's virtual memory and the cache can hold 4 pages, the page
coloring code will not assign page 20 of physical memory to page 1 of a
process's virtual memory. It would, instead, assign page 21 of physical
memory. The page coloring code attempts to avoid assigning page 20
because this maps over the same cache memory as page 16 and would result
in non-optimal caching. This code adds a significant amount of
complexity to the VM memory allocation subsystem as you can well
imagine, but the result is well worth the effort. Page Coloring makes VM
memory as deterministic as physical memory in regards to cache
performance.ConclusionVirtual memory in modern operating systems must address a number of
different issues efficiently and for many different usage patterns. The
modular and algorithmic approach that BSD has historically taken allows
us to study and understand the current implementation as well as
relatively cleanly replace large sections of the code. There have been a
number of improvements to the FreeBSD VM system in the last several
years, and work is ongoing.Bonus QA session by Allen Briggs
briggs@ninthwonder.comWhat is the interleaving algorithm that you
refer to in your listing of the ills of the FreeBSD 3.X swap
arrangements?FreeBSD uses a fixed swap interleave which defaults to 4. This
means that FreeBSD reserves space for four swap areas even if you
only have one, two, or three. Since swap is interleaved the linear
address space representing the four swap areas will be
fragmented if you do not actually have four swap areas. For
example, if you have two swap areas A and B FreeBSD's address
space representation for that swap area will be interleaved in
blocks of 16 pages:A B C D A B C D A B C D A B C DFreeBSD 3.X uses a sequential list of free
regions approach to accounting for the free swap areas.
The idea is that large blocks of free linear space can be
represented with a single list node
(kern/subr_rlist.c). But due to the
fragmentation the sequential list winds up being insanely
fragmented. In the above example, completely unused swap will
have A and B shown as free and C and D shown as
all allocated. Each A-B sequence requires a list
node to account for because C and D are holes, so the list node
cannot be combined with the next A-B sequence.Why do we interleave our swap space instead of just tack swap
areas onto the end and do something fancier? Because it is a whole
lot easier to allocate linear swaths of an address space and have
the result automatically be interleaved across multiple disks than
it is to try to put that sophistication elsewhere.The fragmentation causes other problems. Being a linear list
under 3.X, and having such a huge amount of inherent
fragmentation, allocating and freeing swap winds up being an O(N)
algorithm instead of an O(1) algorithm. Combined with other
factors (heavy swapping) and you start getting into O(N^2) and
O(N^3) levels of overhead, which is bad. The 3.X system may also
need to allocate KVM during a swap operation to create a new list
node which can lead to a deadlock if the system is trying to
pageout pages in a low-memory situation.Under 4.X we do not use a sequential list. Instead we use a
radix tree and bitmaps of swap blocks rather than ranged list
nodes. We take the hit of preallocating all the bitmaps required
for the entire swap area up front but it winds up wasting less
memory due to the use of a bitmap (one bit per block) instead of a
linked list of nodes. The use of a radix tree instead of a
sequential list gives us nearly O(1) performance no matter how
fragmented the tree becomes.I do not get the following:
It is important to note that the FreeBSD VM system attempts
to separate clean and dirty pages for the express reason of
avoiding unnecessary flushes of dirty pages (which eats I/O
bandwidth), nor does it move pages between the various page
queues gratuitously when the memory subsystem is not being
stressed. This is why you will see some systems with very low
cache queue counts and high active queue counts when doing a
systat -vm command.
How is the separation of clean and dirty (inactive) pages
related to the situation where you see low cache queue counts and
high active queue counts in systat -vm? Do the
systat stats roll the active and dirty pages together for the
active queue count?Yes, that is confusing. The relationship is
goal verses reality. Our goal is to
separate the pages but the reality is that if we are not in a
memory crunch, we do not really have to.What this means is that FreeBSD will not try very hard to
separate out dirty pages (inactive queue) from clean pages (cache
queue) when the system is not being stressed, nor will it try to
deactivate pages (active queue -> inactive queue) when the system
is not being stressed, even if they are not being used. In the &man.ls.1; / vmstat 1 example,
would not some of the page faults be data page faults (COW from
executable file to private page)? I.e., I would expect the page
faults to be some zero-fill and some program data. Or are you
implying that FreeBSD does do pre-COW for the program data?A COW fault can be either zero-fill or program-data. The
mechanism is the same either way because the backing program-data
is almost certainly already in the cache. I am indeed lumping the
two together. FreeBSD does not pre-COW program data or zero-fill,
but it does pre-map pages that exist in its
cache.In your section on page table optimizations, can you give a
little more detail about pv_entry and
vm_page (or should vm_page be
vm_pmap—as in 4.4, cf. pp. 180-181 of
McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of
operation/reaction would require scanning the mappings?How does Linux do in the case where FreeBSD breaks down
(sharing a large file mapping over many processes)?A vm_page represents an (object,index#)
tuple. A pv_entry represents a hardware page
table entry (pte). If you have five processes sharing the same
physical page, and three of those processes's page tables actually
map the page, that page will be represented by a single
vm_page structure and three
pv_entry structures.pv_entry structures only represent pages
mapped by the MMU (one pv_entry represents one
pte). This means that when we need to remove all hardware
references to a vm_page (in order to reuse the
page for something else, page it out, clear it, dirty it, and so
forth) we can simply scan the linked list of
pv_entry's associated with that
vm_page to remove or modify the pte's from
their page tables.Under Linux there is no such linked list. In order to remove
all the hardware page table mappings for a
vm_page linux must index into every VM object
that might have mapped the page. For
example, if you have 50 processes all mapping the same shared
library and want to get rid of page X in that library, you need to
index into the page table for each of those 50 processes even if
only 10 of them have actually mapped the page. So Linux is
trading off the simplicity of its design against performance.
Many VM algorithms which are O(1) or (small N) under FreeBSD wind
up being O(N), O(N^2), or worse under Linux. Since the pte's
representing a particular page in an object tend to be at the same
offset in all the page tables they are mapped in, reducing the
number of accesses into the page tables at the same pte offset
will often avoid blowing away the L1 cache line for that offset,
which can lead to better performance.FreeBSD has added complexity (the pv_entry
scheme) in order to increase performance (to limit page table
accesses to only those pte's that need to be
modified).But FreeBSD has a scaling problem that Linux does not in that
there are a limited number of pv_entry
structures and this causes problems when you have massive sharing
of data. In this case you may run out of
pv_entry structures even though there is plenty
of free memory available. This can be fixed easily enough by
bumping up the number of pv_entry structures in
the kernel config, but we really need to find a better way to do
it.In regards to the memory overhead of a page table verses the
pv_entry scheme: Linux uses
permanent page tables that are not throw away, but
does not need a pv_entry for each potentially
mapped pte. FreeBSD uses throw away page tables but
adds in a pv_entry structure for each
actually-mapped pte. I think memory utilization winds up being
about the same, giving FreeBSD an algorithmic advantage with its
ability to throw away page tables at will with very low
overhead.Finally, in the page coloring section, it might help to have a
little more description of what you mean here. I did not quite
follow it.Do you know how an L1 hardware memory cache works? I will
explain: Consider a machine with 16MB of main memory but only 128K
of L1 cache. Generally the way this cache works is that each 128K
block of main memory uses the same 128K of
cache. If you access offset 0 in main memory and then offset
offset 128K in main memory you can wind up throwing away the
cached data you read from offset 0!Now, I am simplifying things greatly. What I just described
is what is called a direct mapped hardware memory
cache. Most modern caches are what are called
2-way-set-associative or 4-way-set-associative caches. The
set-associatively allows you to access up to N different memory
regions that overlap the same cache memory without destroying the
previously cached data. But only N.So if I have a 4-way set associative cache I can access offset
0, offset 128K, 256K and offset 384K and still be able to access
offset 0 again and have it come from the L1 cache. If I then
access offset 512K, however, one of the four previously cached
data objects will be thrown away by the cache.It is extremely important…
extremely important for most of a processor's
memory accesses to be able to come from the L1 cache, because the
L1 cache operates at the processor frequency. The moment you have
an L1 cache miss and have to go to the L2 cache or to main memory,
the processor will stall and potentially sit twiddling its fingers
for hundreds of instructions worth of time
waiting for a read from main memory to complete. Main memory (the
dynamic ram you stuff into a computer) is
slow, when compared to the speed of a modern
processor core.Ok, so now onto page coloring: All modern memory caches are
what are known as physical caches. They
cache physical memory addresses, not virtual memory addresses.
This allows the cache to be left alone across a process context
switch, which is very important.
- But in the Unix world you are dealing with virtual address
+ But in the &unix; world you are dealing with virtual address
spaces, not physical address spaces. Any program you write will
see the virtual address space given to it. The actual
physical pages underlying that virtual
address space are not necessarily physically contiguous! In fact,
you might have two pages that are side by side in a processes
address space which wind up being at offset 0 and offset 128K in
physical memory.A program normally assumes that two side-by-side pages will be
optimally cached. That is, that you can access data objects in
both pages without having them blow away each other's cache entry.
But this is only true if the physical pages underlying the virtual
address space are contiguous (insofar as the cache is
concerned).This is what Page coloring does. Instead of assigning
random physical pages to virtual addresses,
which may result in non-optimal cache performance, Page coloring
assigns reasonably-contiguous physical pages
to virtual addresses. Thus programs can be written under the
assumption that the characteristics of the underlying hardware
cache are the same for their virtual address space as they would
be if the program had been run directly in a physical address
space.Note that I say reasonably contiguous rather
than simply contiguous. From the point of view of a
128K direct mapped cache, the physical address 0 is the same as
the physical address 128K. So two side-by-side pages in your
virtual address space may wind up being offset 128K and offset
132K in physical memory, but could also easily be offset 128K and
offset 4K in physical memory and still retain the same cache
performance characteristics. So page-coloring does
not have to assign truly contiguous pages of
physical memory to contiguous pages of virtual memory, it just
needs to make sure it assigns contiguous pages from the point of
view of cache performance and operation.
diff --git a/en_US.ISO8859-1/articles/zip-drive/article.sgml b/en_US.ISO8859-1/articles/zip-drive/article.sgml
index 5330a94857..eba18e15d6 100644
--- a/en_US.ISO8859-1/articles/zip-drive/article.sgml
+++ b/en_US.ISO8859-1/articles/zip-drive/article.sgml
@@ -1,276 +1,287 @@
%man;
%freebsd;
+
+%trademarks;
]>
- ZIP Drives
+ &iomegazip; DrivesJasonBaconacadix@execpc.com
+
+
+ &tm-attrib.freebsd;
+ &tm-attrib.adaptec;
+ &tm-attrib.iomega;
+ &tm-attrib.microsoft;
+ &tm-attrib.opengroup;
+ &tm-attrib.general;
+
- ZIP Drive Basics
+ &iomegazip; Drive Basics
- ZIP disks are high capacity, removable, magnetic disks, which can be
+ &iomegazip; disks are high capacity, removable, magnetic disks, which can be
read or written by ZIP drives from IOMEGA corporation. ZIP disks are
similar to floppy disks, except that they are much faster, and have a
much greater capacity. While floppy disks typically hold 1.44
megabytes, ZIP disks are available in two sizes, namely 100 megabytes
and 250 megabytes. ZIP drives should not be confused with the
super-floppy, a 120 megabyte floppy drive which also handles traditional
1.44 megabyte floppies.IOMEGA also sells a higher capacity, higher performance drive called
- the JAZZ drive. JAZZ drives come in 1 gigabyte and 2 gigabyte
+ the &jaz;/JAZZ drive. Jaz drives come in 1 gigabyte and 2 gigabyte
sizes.ZIP drives are available as internal or external units, using one of
three interfaces:The SCSI (Small Computer Standard Interface) interface is the
fastest, most sophisticated, most expandable, and most expensive
interface. The SCSI interface is used by all types of computers
from PC's to RISC workstations to minicomputers, to connect all
types of peripherals such as disk drives, tape drives, scanners, and
so on. SCSI ZIP drives may be internal or external, assuming your
host adapter has an external connector.If you are using an external SCSI device, it is important
never to connect or disconnect it from the SCSI bus while the
computer is running. Doing so may cause file-system damage on the
disks that remain connected.If you want maximum performance and easy setup, the SCSI
interface is the best choice. This will probably require adding a
SCSI host adapter, since most PC's (except for high-performance
servers) do not have built-in SCSI support. Each SCSI host adapter
can support either 7 or 15 SCSI devices, depending on the
model.Each SCSI device has its own controller, and these
controllers are fairly intelligent and well standardized, (the
second `S' in SCSI is for Standard) so from the operating system's
point of view, all SCSI disk drives look about the same, as do all
SCSI tape drives, etc. To support SCSI devices, the operating
system need only have a driver for the particular host adapter, and
a generic driver for each type of device, i.e. a SCSI disk driver,
SCSI tape driver, and so on. There are some SCSI devices that can
be better utilized with specialized drivers (e.g. DAT tape drives),
but they tend to work OK with the generic driver, too. It is just
that the generic drivers may not support some of the special
features.Using a SCSI zip drive is simply a matter of determining which
device file in the /dev directory represents
the ZIP drive. This can be determined by looking at the boot
messages while FreeBSD is booting (or in
/var/log/messages after booting), where you
will see a line something like this:da1: <IOMEGA ZIP 100 D.13> Removable Direct Access SCSI-2 DeviceThis means that the ZIP drive is represented by the file
/dev/da1.The IDE (Integrated Drive Electronics) interface is a low-cost
disk drive interface used by many desktop PC's. Most IDE devices
are strictly internal.Performance of IDE ZIP drives is comparable to SCSI ZIP drives.
(The IDE interface is not as fast as SCSI, but ZIP drives
performance is limited mainly by the mechanics of the drive, not by
the bus interface.)The drawback of the IDE interface is the limitations it imposes.
Most IDE adapters can only support 2 devices, and IDE interfaces are
not typically designed for the long term. For example, the original
IDE interface would not support hard disks with more than 1024
cylinders, which forced a lot of people to upgrade their hardware
prematurely. If you have plans to expand your PC by adding another
disk, a tape drive, or scanner, you may want to invest in a SCSI
host adapter and a SCSI ZIP drive to avoid problems in the
future.IDE devices in FreeBSD are prefixed with a a.
For example, an IDE hard disk might be
/dev/ad0, an IDE (ATAPI) CDROM might be
/dev/acd1, and so on.The parallel port interface is popular for portable external
devices such as external ZIP drives and scanners, because virtually
every computer has a standard parallel port (usually used for
printers). This makes things easy for people to transfer data
between multiple computers by toting around their ZIP drive.Performance will generally be slower than a SCSI or IDE ZIP
drive, since it is limited by the speed of the parallel port.
Parallel port speed varies considerably between various computers,
and can often be configured in the system BIOS. Some machines will
also require BIOS configuration to operate the parallel port in
bidirectional mode. (Parallel ports were originally designed only
for output to printers)Parallel ZIP: The vpo DriverTo use a parallel-port ZIP drive under FreeBSD, the
vpo driver must be configured into the kernel.
Parallel port ZIP drives also have a built-in SCSI controller. The vpo
driver allows the FreeBSD kernel to communicate with the ZIP drive's
SCSI controller through the parallel port.Since the vpo driver is not a standard part of the kernel (as of
FreeBSD 3.2), you will need to rebuild the kernel to enable this device.
The process of building a kernel is outlined in detail in another
section. The following steps outline the process in brief for the
purpose of enabling the vpo driver:Run /stand/sysinstall, and install the kernel
source code on your system.Create a custom kernel configuration, that includes the
driver for the vpo driver:&prompt.root; cd /sys/i386/conf
&prompt.root; cp GENERIC MYKERNELEdit MYKERNEL, change the
ident line to MYKERNEL, and
uncomment the line describing the vpo driver.If you have a second parallel port, you may need to copy the
section for ppc0 to create a
ppc1 device. The second parallel port usually
uses IRQ 5 and address 378. Only the IRQ is required in the config
file.If your root hard disk is a SCSI disk, you might run into a
problem with probing order, which will cause the system to attempt
to use the ZIP drive as the root device. This will cause a boot
failure, unless you happen to have a FreeBSD root file-system on
your ZIP disk! In this case, you will need to wire
down the root disk, i.e. force the kernel to bind a
specific device to /dev/da0, the root SCSI
disk. It will then assign the ZIP disk to the next available SCSI
disk, e.g. /dev/da1. To wire down your SCSI hard
drive as da0, change the line
device da0
to
disk da0 at scbus0 target 0 unit 0You may need to change the target above to match the SCSI ID of
your disk drive. You should also wire down the scbus0 entry to your
- controller. For example, if you have an Adaptec 15xx controller,
+ controller. For example, if you have an &adaptec; 15xx controller,
you would change
controller scbus0
to
controller scbus0 at aha0Finally, since you are creating a custom kernel configuration,
you can take the opportunity to remove all the unnecessary drivers.
This should be done with a great deal of caution, and only if you
feel confident about making modifications to your kernel
configuration. Removing unnecessary drivers will reduce the kernel
size, leaving more memory available for your applications. To
determine which drivers are not needed, go to the end of the file
/var/log/messages, and look for lines reading
"not found". Then, comment out these devices in your config file.
You can also change other options to reduce the size and increase
the speed of your kernel. Read the section on rebuilding your kernel
for more complete information.Now it is time to compile the kernel:&prompt.root; /usr/sbin/config MYKERNEL
&prompt.root; cd ../../compile/MYKERNEL
&prompt.root; make clean depend && make all installAfter the kernel is rebuilt, you will need to reboot. Make sure the
ZIP drive is connected to the parallel port before the boot begins. You
should see the ZIP drive show up in the boot messages as device vpo0 or
vpo1, depending on which parallel port the drive is attached to. It
should also show which device file the ZIP drive has been bound to. This
will be /dev/da0 if you have no other SCSI disks in
the system, or /dev/da1 if you have a SCSI hard
disk wired down as the root device.Mounting ZIP disksTo access the ZIP disk, you simply mount it like any other disk
device. The file-system is represented as slice 4 on the device, so for
SCSI or parallel ZIP disks, you would use:&prompt.root; mount_msdos /dev/da1s4 /mntFor IDE ZIP drives, use:&prompt.root; mount_msdos /dev/ad1s4 /mntIt will also be helpful to update /etc/fstab to
make mounting easier. Add a line like the following, edited to suit your
system:
/dev/da1s4 /zip msdos rw,noauto 0 0
and create the directory /zip.Then, you can mount simply by typing
&prompt.root; mount /zip
and unmount by typing
&prompt.root; umount /zipFor more information on the format of
/etc/fstab, see &man.fstab.5;.You can also create a FreeBSD file-system on the ZIP disk using
&man.newfs.8;. However, the disk will only be usable on a FreeBSD
system, or perhaps a few other &unix; clones that recognize FreeBSD
- file-systems. (Definitely not DOS or Windows.)
+ file-systems. (Definitely not DOS or &windows;.)