diff --git a/en_US.ISO8859-1/articles/committers-guide/article.xml b/en_US.ISO8859-1/articles/committers-guide/article.xml index e05587a219..db04ddd5f0 100644 --- a/en_US.ISO8859-1/articles/committers-guide/article.xml +++ b/en_US.ISO8859-1/articles/committers-guide/article.xml @@ -1,5655 +1,5655 @@ ]>
Committer's Guide The &os; Documentation Project 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 The &os; Documentation Project &tm-attrib.freebsd; &tm-attrib.coverity; &tm-attrib.ibm; &tm-attrib.intel; &tm-attrib.sparc; &tm-attrib.general; $FreeBSD$ $FreeBSD$ This document provides information for the &os; committer community. All new committers should read this document before they start, and existing committers are strongly encouraged to review it from time to time. Almost all &os; developers have commit rights to one or more repositories. However, a few developers do not, and some of the information here applies to them as well. (For instance, some people only have rights to work with the Problem Report database). Please see for more information. This document may also be of interest to members of the &os; community who want to learn more about how the project works. Administrative Details Login Methods &man.ssh.1;, protocol 2 only Main Shell Host freefall.FreeBSD.org SMTP Host smtp.FreeBSD.org:587 (see also ). src/ Subversion Root svn+ssh://repo.FreeBSD.org/base (see also ). doc/ Subversion Root svn+ssh://repo.FreeBSD.org/doc (see also ). ports/ Subversion Root svn+ssh://repo.FreeBSD.org/ports (see also ). Internal Mailing Lists developers (technically called all-developers), doc-developers, doc-committers, ports-developers, ports-committers, src-developers, src-committers. (Each project repository has its own -developers and -committers mailing lists. Archives for these lists can be found in the files /local/mail/repository-name-developers-archive and /local/mail/repository-name-committers-archive on the FreeBSD.org cluster.) Core Team monthly reports /home/core/public/monthly-reports on the FreeBSD.org cluster. Ports Management Team monthly reports /home/portmgr/public/monthly-reports on the FreeBSD.org cluster. Noteworthy src/ SVN Branches stable/n (n-STABLE), head (-CURRENT) &man.ssh.1; is required to connect to the project hosts. For more information, see . Useful links: &os; Project Internal Pages &os; Project Hosts &os; Project Administrative Groups Open<acronym>PGP</acronym> Keys for &os; Cryptographic keys conforming to the OpenPGP (Pretty Good Privacy) standard are used by the &os; project to authenticate committers. Messages carrying important information like public SSH keys can be signed with the OpenPGP key to prove that they are really from the committer. See PGP & GPG: Email for the Practical Paranoid by Michael Lucas and for more information. Creating a Key Existing keys can be used, but should be checked with doc/head/share/pgpkeys/checkkey.sh first. In this case, make sure the key has a &os; user ID. For those who do not yet have an OpenPGP key, or need a new key to meet &os; security requirements, here we show how to generate one. Install security/gnupg. Enter these lines in ~/.gnupg/gpg.conf to set minimum acceptable defaults: fixed-list-mode keyid-format 0xlong personal-digest-preferences SHA512 SHA384 SHA256 SHA224 default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 BZIP2 ZLIB ZIP Uncompressed use-agent verify-options show-uid-validity list-options show-uid-validity sig-notation issuer-fpr@notations.openpgp.fifthhorseman.net=%g cert-digest-algo SHA512 Generate a key: &prompt.user; gpg --full-gen-key gpg (GnuPG) 2.1.8; Copyright (C) 2015 Free Software Foundation, Inc. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Warning: using insecure memory! Please select what kind of key you want: (1) RSA and RSA (default) (2) DSA and Elgamal (3) DSA (sign only) (4) RSA (sign only) Your selection? 1 RSA keys may be between 1024 and 4096 bits long. What keysize do you want? (2048) 2048 Requested keysize is 2048 bits Please specify how long the key should be valid. 0 = key does not expire <n> = key expires in n days <n>w = key expires in n weeks <n>m = key expires in n months <n>y = key expires in n years Key is valid for? (0) 3y Key expires at Wed Nov 4 17:20:20 2015 MST Is this correct? (y/N) y GnuPG needs to construct a user ID to identify your key. Real name: Chucky Daemon Email address: notreal@example.com Comment: You selected this USER-ID: "Chucky Daemon <notreal@example.com>" Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? o You need a Passphrase to protect your secret key. 2048-bit keys with a three-year expiration provide adequate protection at present (2013-12). describes the situation in more detail. A three year key lifespan is short enough to obsolete keys weakened by advancing computer power, but long enough to reduce key management problems. Use your real name here, preferably matching that shown on government-issued ID to make it easier for others to verify your identity. Text that may help others identify you can be entered in the Comment section. After the email address is entered, a passphrase is requested. Methods of creating a secure passphrase are contentious. Rather than suggest a single way, here are some links to sites that describe various methods: , , , . Protect the private key and passphrase. If either the private key or passphrase may have been compromised or disclosed, immediately notify accounts@FreeBSD.org and revoke the key. Committing the new key is shown in . Kerberos and LDAP web Password for &os; Cluster The &os; cluster requires a Kerberos password to access certain services. The Kerberos password also serves as the LDAP web password, since LDAP is proxying to Kerberos in the cluster. Some of the services which require this include: Bugzilla Jenkins To create a new Kerberos account in the &os; cluster, or to reset a Kerberos password for an existing account using a random password generator: &prompt.user; ssh kpasswd.freebsd.org This must be done from a machine outside of the &os;.org cluster. A Kerberos password can also be set manually by logging into freefall.FreeBSD.org and running: &prompt.user; kpasswd Unless the Kerberos-authenticated services of the &os;.org cluster have been used previously, Client unknown will be shown. This error means that the ssh kpasswd.freebsd.org method shown above must be used first to initialize the Kerberos account. Commit Bit Types The &os; repository has a number of components which, when combined, support the basic operating system source, documentation, third party application ports infrastructure, and various maintained utilities. When &os; commit bits are allocated, the areas of the tree where the bit may be used are specified. Generally, the areas associated with a bit reflect who authorized the allocation of the commit bit. Additional areas of authority may be added at a later date: when this occurs, the committer should follow normal commit bit allocation procedures for that area of the tree, seeking approval from the appropriate entity and possibly getting a mentor for that area for some period of time. Committer Type Responsible Tree Components src core@ src/, doc/ subject to appropriate review doc doceng@ doc/, ports/, src/ documentation ports portmgr@ ports/ Commit bits allocated prior to the development of the notion of areas of authority may be appropriate for use in many parts of the tree. However, common sense dictates that a committer who has not previously worked in an area of the tree seek review prior to committing, seek approval from the appropriate responsible party, and/or work with a mentor. Since the rules regarding code maintenance differ by area of the tree, this is as much for the benefit of the committer working in an area of less familiarity as it is for others working on the tree. Committers are encouraged to seek review for their work as part of the normal development process, regardless of the area of the tree where the work is occurring. Policy for Committer Activity in Other Trees All committers may modify base/head/share/misc/committers-*.dot, base/head/usr.bin/calendar/calendars/calendar.freebsd, and ports/head/astro/xearth/files. doc committers may commit documentation changes to src files, such as man pages, READMEs, fortune databases, calendar files, and comment fixes without approval from a src committer, subject to the normal care and tending of commits. Any committer may make changes to any other tree with an "Approved by" from a non-mentored committer with the appropriate bit. Committers can acquire an additional bit by the usual process of finding a mentor who will propose them to core, doceng, or portmgr, as appropriate. When approved, they will be added to 'access' and the normal mentoring period will ensue, which will involve a continuing of Approved by for some period. "Approved by" is only acceptable from non-mentored src committers -- mentored committers can provide a "Reviewed by" but not an "Approved by". Subversion Primer New committers are assumed to already be familiar with the basic operation of Subversion. If not, start by reading the Subversion Book. Introduction The &os; source repository switched from CVS to Subversion on May 31st, 2008. The first real SVN commit is r179447. The &os; doc/www repository switched from CVS to Subversion on May 19th, 2012. The first real SVN commit is r38821. The &os; ports repository switched from CVS to Subversion on July 14th, 2012. The first real SVN commit is r300894. Subversion can be installed from the &os; Ports Collection by issuing these commands: &prompt.root; pkg install subversion Getting Started There are a few ways to obtain a working copy of the tree from Subversion. This section will explain them. Direct Checkout The first is to check out directly from the main repository. For the src tree, use: &prompt.user; svn checkout svn+ssh://repo.freebsd.org/base/head /usr/src For the doc tree, use: &prompt.user; svn checkout svn+ssh://repo.freebsd.org/doc/head /usr/doc For the ports tree, use: &prompt.user; svn checkout svn+ssh://repo.freebsd.org/ports/head /usr/ports Though the remaining examples in this document are written with the workflow of working with the src tree in mind, the underlying concepts are the same for working with the doc and the ports tree. Ports related Subversion operations are listed in . The above command will check out a CURRENT source tree as /usr/src/, which can be any target directory on the local filesystem. Omitting the final argument of that command causes the working copy, in this case, to be named head, but that can be renamed safely. svn+ssh means the SVN protocol tunnelled over SSH. The name of the server is repo.freebsd.org, base is the path to the repository, and head is the subdirectory within the repository. If your &os; login name is different from the login name used on the local machine, either include it in the URL (for example svn+ssh://jarjar@repo.freebsd.org/base/head), or add an entry to ~/.ssh/config in the form: Host repo.freebsd.org User jarjar This is the simplest method, but it is hard to tell just yet how much load it will place on the repository. The svn diff does not require access to the server as SVN stores a reference copy of every file in the working copy. This, however, means that Subversion working copies are very large in size. <literal>RELENG_*</literal> Branches and General Layout In svn+ssh://repo.freebsd.org/base, base refers to the source tree. Similarly, ports refers to the ports tree, and so on. These are separate repositories with their own change number sequences, access controls and commit mail. For the base repository, HEAD refers to the -CURRENT tree. For example, head/bin/ls is what would go into /usr/src/bin/ls in a release. Some key locations are: /head/ which corresponds to HEAD, also known as -CURRENT. /stable/n which corresponds to RELENG_n. /releng/n.n which corresponds to RELENG_n_n. /release/n.n.n which corresponds to RELENG_n_n_n_RELEASE. /vendor* is the vendor branch import work area. This directory itself does not contain branches, however its subdirectories do. This contrasts with the stable, releng and release directories. /projects and /user feature a branch work area. As above, the /user directory does not contain branches itself. &os; Documentation Project Branches and Layout In svn+ssh://repo.freebsd.org/doc, doc refers to the repository root of the source tree. In general, most &os; Documentation Project work will be done within the head/ branch of the documentation source tree. &os; documentation is written and/or translated to various languages, each in a separate directory in the head/ branch. Each translation set contains several subdirectories for the various parts of the &os; Documentation Project. A few noteworthy directories are: /articles/ contains the source code for articles written by various &os; contributors. /books/ contains the source code for the different books, such as the &os; Handbook. /htdocs/ contains the source code for the &os; website. &os; Ports Tree Branches and Layout In svn+ssh://repo.freebsd.org/ports, ports refers to the repository root of the ports tree. In general, most &os; port work will be done within the head/ branch of the ports tree which is the actual ports tree used to install software. Some other key locations are: /branches/RELENG_n_n_n which corresponds to RELENG_n_n_n is used to merge back security updates in preparation for a release. /tags/RELEASE_n_n_n which corresponds to RELEASE_n_n_n represents a release tag of the ports tree. /tags/RELEASE_n_EOL represents the end of life tag of a specific &os; branch. Daily Use This section will explain how to perform common day-to-day operations with Subversion. Help SVN has built in help documentation. It can be accessed by typing: &prompt.user; svn help Additional information can be found in the Subversion Book. Checkout As seen earlier, to check out the &os; head branch: &prompt.user; svn checkout svn+ssh://repo.freebsd.org/base/head /usr/src At some point, more than just HEAD will probably be useful, for instance when merging changes to stable/7. Therefore, it may be useful to have a partial checkout of the complete tree (a full checkout would be very painful). To do this, first check out the root of the repository: &prompt.user; svn checkout --depth=immediates svn+ssh://repo.freebsd.org/base This will give base with all the files it contains (at the time of writing, just ROADMAP.txt) and empty subdirectories for head, stable, vendor and so on. Expanding the working copy is possible. Just change the depth of the various subdirectories: &prompt.user; svn up --set-depth=infinity base/head &prompt.user; svn up --set-depth=immediates base/release base/releng base/stable The above command will pull down a full copy of head, plus empty copies of every release tag, every releng branch, and every stable branch. If at a later date merging to 7-STABLE is required, expand the working copy: &prompt.user; svn up --set-depth=infinity base/stable/7 Subtrees do not have to be expanded completely. For instance, expanding only stable/7/sys and then later expand the rest of stable/7: &prompt.user; svn up --set-depth=infinity base/stable/7/sys &prompt.user; svn up --set-depth=infinity base/stable/7 Updating the tree with svn update will only update what was previously asked for (in this case, head and stable/7; it will not pull down the whole tree. Anonymous Checkout It is possible to anonymously check out the &os; repository with Subversion. This will give access to a read-only tree that can be updated, but not committed back to the main repository. To do this, use: &prompt.user; svn co https://svn.FreeBSD.org/base/head /usr/src More details on using Subversion this way can be found in Using Subversion. Updating the Tree To update a working copy to either the latest revision, or a specific revision: &prompt.user; svn update &prompt.user; svn update -r12345 Status To view the local changes that have been made to the working copy: &prompt.user; svn status To show local changes and files that are out-of-date do: &prompt.user; svn status --show-updates Editing and Committing SVN does not need to be told in advance about file editing. To commit all changes in the current directory and all subdirectories: &prompt.user; svn commit To commit all changes in, for example, lib/libfetch/ and usr/bin/fetch/ in a single operation: &prompt.user; svn commit lib/libfetch usr/bin/fetch There is also a commit wrapper for the ports tree to handle the properties and sanity checking the changes: &prompt.user; /usr/ports/Tools/scripts/psvn commit Adding and Removing Files Before adding files, get a copy of auto-props.txt (there is also a ports tree specific version) and add it to ~/.subversion/config according to the instructions in the file. If you added something before reading this, use svn rm --keep-local for just added files, fix your config file and re-add them again. The initial config file is created when you first run a svn command, even something as simple as svn help. Files are added to a SVN repository with svn add. To add a file named foo, edit it, then: &prompt.user; svn add foo Most new source files should include a $&os;$ string near the start of the file. On commit, svn will expand the $&os;$ string, adding the file path, revision number, date and time of commit, and the username of the committer. Files which cannot be modified may be committed without the $&os;$ string. Files can be removed with svn remove: &prompt.user; svn remove foo Subversion does not require deleting the file before using svn rm, and indeed complains if that happens. It is possible to add directories with svn add: &prompt.user; mkdir bar &prompt.user; svn add bar Although svn mkdir makes this easier by combining the creation of the directory and the adding of it: &prompt.user; svn mkdir bar Like files, directories are removed with svn rm. There is no separate command specifically for removing directories. &prompt.user; svn rm bar Copying and Moving Files This command creates a copy of foo.c named bar.c, with the new file also under version control and with the full history of foo.c: &prompt.user; svn copy foo.c bar.c This is usually preferred to copying the file with cp and adding it to the repository with svn add because this way the new file does not inherit the original one's history. To move and rename a file: &prompt.user; svn move foo.c bar.c Log and Annotate svn log shows revisions and commit messages, most recent first, for files or directories. When used on a directory, all revisions that affected the directory and files within that directory are shown. svn annotate, or equally svn praise or svn blame, shows the most recent revision number and who committed that revision for each line of a file. Diffs svn diff displays changes to the working copy. Diffs generated by SVN are unified and include new files by default in the diff output. svn diff can show the changes between two revisions of the same file: &prompt.user; svn diff -r179453:179454 ROADMAP.txt It can also show all changes for a specific changeset. This command shows what changes were made to the current directory and all subdirectories in changeset 179454: &prompt.user; svn diff -c179454 . Reverting Local changes (including additions and deletions) can be reverted using svn revert. It does not update out-of-date files, but just replaces them with pristine copies of the original version. Conflicts If an svn update resulted in a merge conflict, Subversion will remember which files have conflicts and refuse to commit any changes to those files until explicitly told that the conflicts have been resolved. The simple, not yet deprecated procedure is: &prompt.user; svn resolved foo However, the preferred procedure is: &prompt.user; svn resolve --accept=working foo The two examples are equivalent. Possible values for --accept are: working: use the version in your working directory (which one presumes has been edited to resolve the conflicts). base: use a pristine copy of the version you had before svn update, discarding your own changes, the conflicting changes, and possibly other intervening changes as well. mine-full: use what you had before svn update, including your own changes, but discarding the conflicting changes, and possibly other intervening changes as well. theirs-full: use the version that was retrieved when you did svn update, discarding your own changes. Advanced Use Sparse Checkouts SVN allows sparse, or partial checkouts of a directory by adding to a svn checkout. Valid arguments to are: empty: the directory itself without any of its contents. files: the directory and any files it contains. immediates: the directory and any files and directories it contains, but none of the subdirectories' contents. infinity: anything. The --depth option applies to many other commands, including svn commit, svn revert, and svn diff. Since --depth is sticky, there is a --set-depth option for svn update that will change the selected depth. Thus, given the working copy produced by the previous example: &prompt.user; cd ~/freebsd &prompt.user; svn update --set-depth=immediates . The above command will populate the working copy in ~/freebsd with ROADMAP.txt and empty subdirectories, and nothing will happen when svn update is executed on the subdirectories. However, this command will set the depth for head (in this case) to infinity, and fully populate it: &prompt.user; svn update --set-depth=infinity head Direct Operation Certain operations can be performed directly on the repository without touching the working copy. Specifically, this applies to any operation that does not require editing a file, including: log, diff mkdir remove, copy, rename propset, propedit, propdel merge Branching is very fast. This command would be used to branch RELENG_8: &prompt.user; svn copy svn+ssh://repo.freebsd.org/base/head svn+ssh://repo.freebsd.org/base/stable/8 This is equivalent to these commands which take minutes and hours as opposed to seconds, depending on your network connection: &prompt.user; svn checkout --depth=immediates svn+ssh://repo.freebsd.org/base &prompt.user; cd base &prompt.user; svn update --set-depth=infinity head &prompt.user; svn copy head stable/8 &prompt.user; svn commit stable/8 Merging with <acronym>SVN</acronym> This section deals with merging code from one branch to another (typically, from head to a stable branch). In all examples below, $FSVN refers to the location of the &os; Subversion repository, svn+ssh://repo.freebsd.org/base/. About Merge Tracking From the user's perspective, merge tracking information (or mergeinfo) is stored in a property called svn:mergeinfo, which is a comma-separated list of revisions and ranges of revisions that have been merged. When set on a file, it applies only to that file. When set on a directory, it applies to that directory and its descendants (files and directories) except for those that have their own svn:mergeinfo. It is not inherited. For instance, stable/6/contrib/openpam/ does not implicitly inherit mergeinfo from stable/6/, or stable/6/contrib/. Doing so would make partial checkouts very hard to manage. Instead, mergeinfo is explicitly propagated down the tree. For merging something into branch/foo/bar/, these rules apply: If branch/foo/bar/ does not already have a mergeinfo record, but a direct ancestor (for instance, branch/foo/) does, then that record will be propagated down to branch/foo/bar/ before information about the current merge is recorded. Information about the current merge will not be propagated back up that ancestor. If a direct descendant of branch/foo/bar/ (for instance, branch/foo/bar/baz/) already has a mergeinfo record, information about the current merge will be propagated down to it. If you consider the case where a revision changes several separate parts of the tree (for example, branch/foo/bar/ and branch/foo/quux/), but you only want to merge some of it (for example, branch/foo/bar/), you will see that these rules make sense. If mergeinfo was propagated up, it would seem like that revision had also been merged to branch/foo/quux/, when in fact it had not been. Selecting the Source and Target Branch When Merging Merging to stable/ branches should originate from head/. For example: &prompt.user; svn merge -c r123456 ^/head/ stable/11 &prompt.user; svn commit stable/11 Merges to releng/ branches should always originate from the corresponding stable/ branch. For example: &prompt.user; svn merge -c r123456 ^/stable/11 releng/11.0 &prompt.user; svn commit releng/11.0 Committers are only permitted to commit to the releng/ branches during a release cycle after receiving approval from the Release Engineering Team, after which only the Security Officer may commit to a releng/ branch for a Security Advisory or Errata Notice. All merges are merged to and committed from the root of the branch. All merges look like: &prompt.user; svn merge -c r123456 ^/head/ checkout &prompt.user; svn commit checkout Note that checkout must be a complete checkout of the branch to which the merge occurs. &prompt.user; svn merge -c r123456 ^/stable/10 releng/10.0 Preparing the Merge Target - Because of the mergeinfo propagation issues described + Due to the mergeinfo propagation issues described earlier, it is very important to never merge changes into a sparse working copy. Always use a full checkout of the branch being merged into. For instance, when merging from HEAD to 7, use a full checkout of stable/7: &prompt.user; cd stable/7 &prompt.user; svn up --set-depth=infinity The target directory must also be up-to-date and must not contain any uncommitted changes or stray files. Identifying Revisions Identifying revisions to be merged is a must. If the target already has complete mergeinfo, ask SVN for a list: &prompt.user; cd stable/6/contrib/openpam &prompt.user; svn mergeinfo --show-revs=eligible $FSVN/head/contrib/openpam If the target does not have complete mergeinfo, check the log for the merge source. Merging Now, let us start merging! The Principles For example, To merge: revision $R in directory $target in stable branch $B from directory $source in head $FSVN is svn+ssh://repo.freebsd.org/base Assuming that revisions $P and $Q have already been merged, and that the current directory is an up-to-date working copy of stable/$B, the existing mergeinfo looks like this: &prompt.user; svn propget svn:mergeinfo -R $target $target - /head/$source:$P,$Q Merging is done like so: &prompt.user; svn merge -c$R $FSVN/head/$source $target Checking the results of this is possible with svn diff. The svn:mergeinfo now looks like: &prompt.user; svn propget svn:mergeinfo -R $target $target - head/$source:$P,$Q,$R If the results are not exactly as shown, assistance may be required before committing as mistakes may have been made, or there may be something wrong with the existing mergeinfo, or there may be a bug in Subversion. Practical Example As a practical example, consider this scenario. The changes to netmap.4 in r238987 are to be merged from CURRENT to 9-STABLE. The file resides in head/share/man/man4. According to , this is also where to do the merge. Note that in this example all paths are relative to the top of the svn repository. For more information on the directory layout, see . The first step is to inspect the existing mergeinfo. &prompt.user; svn propget svn:mergeinfo -R stable/9/share/man/man4 Take a quick note of how it looks before moving on to the next step; doing the actual merge: &prompt.user; svn merge -c r238987 svn+ssh://repo.freebsd.org/base/head/share/man/man4 stable/9/share/man/man4 --- Merging r238987 into 'stable/9/share/man/man4': U stable/9/share/man/man4/netmap.4 --- Recording mergeinfo for merge of r238987 into 'stable/9/share/man/man4': U stable/9/share/man/man4 Check that the revision number of the merged revision has been added. Once this is verified, the only thing left is the actual commit. &prompt.user; svn commit stable/9/share/man/man4 Precautions Before Committing As always, build world (or appropriate parts of it). Check the changes with svn diff and svn stat. Make sure all the files that should have been added or deleted were in fact added or deleted. Take a closer look at any property change (marked by a M in the second column of svn stat). Normally, no svn:mergeinfo properties should be anywhere except the target directory (or directories). If something looks fishy, ask for help. Committing Make sure to commit a top level directory to have the mergeinfo included as well. Do not specify individual files on the command line. For more information about committing files in general, see the relevant section of this primer. Vendor Imports with <acronym>SVN</acronym> Please read this entire section before starting a vendor import. Patches to vendor code fall into two categories: Vendor patches: these are patches that have been issued by the vendor, or that have been extracted from the vendor's version control system, which address issues which cannot wait until the next vendor release. &os; patches: these are patches that modify the vendor code to address &os;-specific issues. The nature of a patch dictates where it should be committed: Vendor patches must be committed to the vendor branch, and merged from there to head. If the patch addresses an issue in a new release that is currently being imported, it must not be committed along with the new release: the release must be imported and tagged first, then the patch can be applied and committed. There is no need to re-tag the vendor sources after committing the patch. &os; patches are committed directly to head. Preparing the Tree If importing for the first time after the switch to Subversion, flattening and cleaning up the vendor tree is necessary, as well as bootstrapping the merge history in the main tree. Flattening During the conversion from CVS to Subversion, vendor branches were imported with the same layout as the main tree. This means that the pf vendor sources ended up in vendor/pf/dist/contrib/pf. The vendor source is best directly in vendor/pf/dist. To flatten the pf tree: &prompt.user; cd vendor/pf/dist/contrib/pf &prompt.user; svn mv $(svn list) ../.. &prompt.user; cd ../.. &prompt.user; svn rm contrib &prompt.user; svn propdel -R svn:mergeinfo . &prompt.user; svn commit The propdel bit is necessary because starting with 1.5, Subversion will automatically add svn:mergeinfo to any directory that is copied or moved. In this case, as nothing is being merged from the deleted tree, they just get in the way. Tags may be flattened as well (3, 4, 3.5 etc.); the procedure is exactly the same, only changing dist to 3.5 or similar, and putting the svn commit off until the end of the process. Cleaning Up The dist tree can be cleaned up as necessary. Disabling keyword expansion is recommended, as it makes no sense on unmodified vendor code and in some cases it can even be harmful. OpenSSH, for example, includes two files that originated with &os; and still contain the original version tags. To do this: &prompt.user; svn propdel svn:keywords -R . &prompt.user; svn commit Bootstrapping Merge History If importing for the first time after the switch to Subversion, bootstrap svn:mergeinfo on the target directory in the main tree to the revision that corresponds to the last related change to the vendor tree, prior to importing new sources: &prompt.user; cd head/contrib/pf &prompt.user; svn merge --record-only svn+ssh://repo.freebsd.org/base/vendor/pf/dist@180876 . &prompt.user; svn commit Importing New Sources With two commits—one for the import itself and one for the tag—this step can optionally be repeated for every upstream release between the last import and the current import. Preparing the Vendor Sources Subversion is able to store a full distribution in the vendor tree. So, import everything, but merge only what is required. A svn add is required to add any files that were added since the last vendor import, and svn rm is required to remove any that were removed since. Preparing sorted lists of the contents of the vendor tree and of the sources that are about to be imported is recommended, to facilitate the process. &prompt.user; cd vendor/pf/dist &prompt.user; svn list -R | grep -v '/$' | sort >../old &prompt.user; cd ../pf-4.3 &prompt.user; find . -type f | cut -c 3- | sort >../new With these two files, comm -23 ../old ../new will list removed files (files only in old), while comm -13 ../old ../new will list added files only in new. Importing into the Vendor Tree Now, the sources must be copied into dist and the svn add and svn rm commands are used as needed: &prompt.user; cd vendor/pf/pf-4.3 &prompt.user; tar cf - . | tar xf - -C ../dist &prompt.user; cd ../dist &prompt.user; comm -23 ../old ../new | xargs svn rm &prompt.user; comm -13 ../old ../new | xargs svn add --parents If any directories were removed, they will have to be svn rmed manually. Nothing will break if they are not, but they will remain in the tree. Check properties on any new files. All text files should have svn:eol-style set to native. All binary files should have svn:mime-type set to application/octet-stream unless there is a more appropriate media type. Executable files should have svn:executable set to *. No other properties should exist on any file in the tree. Committing is now possible. However, it is good practice to make sure that everything is okay by using the svn stat and svn diff commands. Tagging Once committed, vendor releases are tagged for future reference. The best and quickest way to do this is directly in the repository: &prompt.user; svn cp svn+ssh://repo.freebsd.org/base/vendor/pf/dist svn+ssh://repo.freebsd.org/base/vendor/pf/4.3 Once that is complete, svn up the working copy of vendor/pf to get the new tag, although this is rarely needed. If creating the tag in the working copy of the tree, svn:mergeinfo results must be removed: &prompt.user; cd vendor/pf &prompt.user; svn cp dist 4.3 &prompt.user; svn propdel svn:mergeinfo -R 4.3 Merging to Head &prompt.user; cd head/contrib/pf &prompt.user; svn up &prompt.user; svn merge --accept=postpone svn+ssh://repo.freebsd.org/base/vendor/pf/dist . The --accept=postpone tells Subversion not to complain about merge conflicts as they will be handled manually. The cvs2svn changeover occurred on June 3, 2008. When performing vendor merges for packages which were already present and converted by the cvs2svn process, the command used to merge /vendor/package_name/dist to /head/package_location (for example, head/contrib/sendmail) must use to indicate the revision to merge from the /vendor tree. For example: &prompt.user; svn checkout svn+ssh://repo.freebsd.org/base/head/contrib/sendmail &prompt.user; cd sendmail &prompt.user; svn merge -c r261190 '^/vendor/sendmail/dist' . ^ is an alias for the repository path. If using the Zsh shell, the ^ must be escaped with \ or quoted. It is necessary to resolve any merge conflicts. Make sure that any files that were added or removed in the vendor tree have been properly added or removed in the main tree. To check diffs against the vendor branch: &prompt.user; svn diff --no-diff-deleted --old=svn+ssh://repo.freebsd.org/base/vendor/pf/dist --new=. The --no-diff-deleted tells Subversion not to complain about files that are in the vendor tree but not in the main tree. Things that would have previously been removed before the vendor import, like the vendor's makefiles and configure scripts. Using CVS, once a file was off the vendor branch, it was not able to be put back. With Subversion, there is no concept of on or off the vendor branch. If a file that previously had local modifications, to make it not show up in diffs in the vendor tree, all that has to be done is remove any left-over cruft like &os; version tags, which is much easier. If any changes are required for the world to build with the new sources, make them now, and keep testing until everything builds and runs perfectly. Committing the Vendor Import Committing is now possible! Everything must be committed in one go. If done properly, the tree will move from a consistent state with old code, to a consistent state with new code. From Scratch Importing into the Vendor Tree This section is an example of importing and tagging byacc into head. First, prepare the directory in vendor: &prompt.user; svn co --depth immediates $FSVN/vendor &prompt.user; cd vendor &prompt.user; svn mkdir byacc &prompt.user; svn mkdir byacc/dist Now, import the sources into the dist directory. Once the files are in place, svn add the new ones, then svn commit and tag the imported version. To save time and bandwidth, direct remote committing and tagging is possible: &prompt.user; svn cp -m "Tag byacc 20120115" $FSVN/vendor/byacc/dist $FSVN/vendor/byacc/20120115 Merging to <literal>head</literal> Due to this being a new file, copy it for the merge: &prompt.user; svn cp -m "Import byacc to contrib" $FSVN/vendor/byacc/dist $FSVN/head/contrib/byacc Working normally on newly imported sources is still possible. Reverting a Commit Reverting a commit to a previous version is fairly easy: &prompt.user; svn merge -r179454:179453 ROADMAP.txt &prompt.user; svn commit Change number syntax, with negative meaning a reverse change, can also be used: &prompt.user; svn merge -c -179454 ROADMAP.txt &prompt.user; svn commit This can also be done directly in the repository: &prompt.user; svn merge -r179454:179453 svn+ssh://repo.freebsd.org/base/ROADMAP.txt It is important to ensure that the mergeinfo is correct when reverting a file to permit svn mergeinfo --eligible to work as expected. Reverting the deletion of a file is slightly different. Copying the version of the file that predates the deletion is required. For example, to restore a file that was deleted in revision N, restore version N-1: &prompt.user; svn copy svn+ssh://repo.freebsd.org/base/ROADMAP.txt@179454 &prompt.user; svn commit or, equally: &prompt.user; svn copy svn+ssh://repo.freebsd.org/base/ROADMAP.txt@179454 svn+ssh://repo.freebsd.org/base Do not simply recreate the file manually and svn add it—this will cause history to be lost. Fixing Mistakes While we can do surgery in an emergency, do not plan on having mistakes fixed behind the scenes. Plan on mistakes remaining in the logs forever. Be sure to check the output of svn status and svn diff before committing. Mistakes will happen but, they can generally be fixed without disruption. Take a case of adding a file in the wrong location. The right thing to do is to svn move the file to the correct location and commit. This causes just a couple of lines of metadata in the repository journal, and the logs are all linked up correctly. The wrong thing to do is to delete the file and then svn add an independent copy in the correct location. Instead of a couple of lines of text, the repository journal grows an entire new copy of the file. This is a waste. Using a Subversion Mirror There is a serious disadvantage to this method: every time something is to be committed, a svn relocate to the main repository has to be done, remembering to svn relocate back to the mirror after the commit. Also, since svn relocate only works between repositories that have the same UUID, some hacking of the local repository's UUID has to occur before it is possible to start using it. Checkout from a Mirror Check out a working copy from a mirror by substituting the mirror's URL for svn+ssh://repo.freebsd.org/base. This can be an official mirror or a mirror maintained by using svnsync. Setting up a <application>svnsync</application> Mirror Avoid setting up a svnsync mirror unless there is a very good reason for it. Most of the time a git mirror is a better alternative. Starting a fresh mirror from scratch takes a long time. Expect a minimum of 10 hours for high speed connectivity. If international links are involved, expect this to take four to ten times longer. One way to limit the time required is to grab a seed file. It is large (~1GB) but will consume less network traffic and take less time to fetch than svnsync will. Extract the file and update it: &prompt.user; tar xf svnmirror-base-r261170.tar.xz &prompt.user; svnsync sync file:///home/svnmirror/base Now, set that up to run from &man.cron.8;, do checkouts locally, set up a svnserve server for local machines to talk to, etc. The seed mirror is set to fetch from svn://svn.freebsd.org/base. The configuration for the mirror is stored in revprop 0 on the local mirror. To see the configuration, try: &prompt.user; svn proplist -v --revprop -r 0 file:///home/svnmirror/base Use svn propset to change things. Committing High-<acronym>ASCII</acronym> Data Files that have high-ASCII bits are considered binary files in SVN, so the pre-commit checks fail and indicate that the mime-type property should be set to application/octet-stream. However, the use of this is discouraged, so please do not set it. The best way is always avoiding high-ASCII data, so that it can be read everywhere with any text editor but if it is not avoidable, instead of changing the mime-type, set the fbsd:notbinary property with propset: &prompt.user; svn propset fbsd:notbinary yes foo.data Maintaining a Project Branch A project branch is one that is synced to head (or another branch) is used to develop a project then commit it back to head. In SVN, dolphin branching is used for this. A dolphin branch is one that diverges for a while and is finally committed back to the original branch. During development code migration in one direction (from head to the branch only). No code is committed back to head until the end. After the branch is committed back at the end, it is dead (although a new branch with the same name can be created after the dead one is deleted). As per https://people.FreeBSD.org/~peter/svn_notes.txt, work that is intended to be merged back into HEAD should be in base/projects/. If the work is beneficial to the &os; community in some way but not intended to be merged directly back into HEAD then the proper location is base/user/username/. This page contains further details. To create a project branch: &prompt.user; svn copy svn+ssh://repo.freebsd.org/base/head svn+ssh://repo.freebsd.org/base/projects/spif To merge changes from HEAD back into the project branch: &prompt.user; cd copy_of_spif &prompt.user; svn merge svn+ssh://repo.freebsd.org/base/head &prompt.user; svn commit It is important to resolve any merge conflicts before committing. Some Tips In commit logs etc., rev 179872 is spelled r179872 as per convention. Speeding up svn is possible by adding these entries to ~/.ssh/config: Host * ControlPath ~/.ssh/sockets/master-%l-%r@%h:%p ControlMaster auto ControlPersist yes and then typing mkdir ~/.ssh/sockets Checking out a working copy with a stock Subversion client without &os;-specific patches (OPTIONS_SET=FREEBSD_TEMPLATE) will mean that $FreeBSD$ tags will not be expanded. Once the correct version has been installed, trick Subversion into expanding them like so: &prompt.user; svn propdel -R svn:keywords . &prompt.user; svn revert -R . This will wipe out uncommitted patches. It is possible to automatically fill the "Sponsored by" and "MFC after" commit log fields by setting "freebsd-sponsored-by" and "freebsd-mfc-after" fields in the "[miscellany]" section of the ~/.subversion/config configuration file. For example: freebsd-sponsored-by = The FreeBSD Foundation freebsd-mfc-after = 2 weeks Setup, Conventions, and Traditions There are a number of things to do as a new developer. The first set of steps is specific to committers only. These steps must be done by a mentor for those who are not committers. For New Committers Those who have been given commit rights to the &os; repositories must follow these steps. Get mentor approval before committing each of these changes! The .ent and .xml files mentioned below exist in the &os; Documentation Project SVN repository at svn+ssh://repo.FreeBSD.org/doc/. New files that do not have the FreeBSD=%H svn:keywords property will be rejected when attempting to commit them to the repository. Be sure to read regarding adding and removing files. Verify that ~/.subversion/config contains the necessary auto-props entries from auto-props.txt mentioned there. All src commits go to &os.current; first before being merged to &os.stable;. The &os.stable; branch must maintain ABI and API compatibility with earlier versions of that branch. Do not merge changes that break this compatibility. Steps for New Committers Add an Author Entity doc/head/share/xml/authors.ent — Add an author entity. Later steps depend on this entity, and missing this step will cause the doc/ build to fail. This is a relatively easy task, but remains a good first test of version control skills. Update the List of Developers and Contributors doc/head/en_US.ISO8859-1/articles/contributors/contrib.committers.xml — Add an entry to the Developers section of the Contributors List. Entries are sorted by last name. doc/head/en_US.ISO8859-1/articles/contributors/contrib.additional.xmlRemove the entry from the Additional Contributors section. Entries are sorted by first name. Add a News Item doc/head/share/xml/news.xml — Add an entry. Look for the other entries that announce new committers and follow the format. Use the date from the commit bit approval email from core@FreeBSD.org. Add a <acronym>PGP</acronym> Key doc/head/share/pgpkeys/pgpkeys.ent and doc/head/share/pgpkeys/pgpkeys-developers.xml - Add your PGP or GnuPG key. Those who do not yet have a key should see . &a.des.email; has written a shell script (doc/head/share/pgpkeys/addkey.sh) to make this easier. See the README file for more information. Use doc/head/share/pgpkeys/checkkey.sh to verify that keys meet minimal best-practices standards. After adding and checking a key, add both updated files to source control and then commit them. Entries in this file are sorted by last name. It is very important to have a current PGP/GnuPG key in the repository. The key may be required for positive identification of a committer. For example, the &a.admins; might need it for account recovery. A complete keyring of FreeBSD.org users is available for download from https://www.FreeBSD.org/doc/pgpkeyring.txt. Update Mentor and Mentee Information base/head/share/misc/committers-repository.dot — Add an entry to the current committers section, where repository is doc, ports, or src, depending on the commit privileges granted. Add an entry for each additional mentor/mentee relationship in the bottom section. Generate a <application>Kerberos</application> Password See to generate or set a Kerberos for use with other &os; services like the bug tracking database. Optional: Enable Wiki Account &os; Wiki Account — A wiki account allows sharing projects and ideas. Those who do not yet have an account can follow instructions on the AboutWiki Page to obtain one. Contact wiki-admin@FreeBSD.org if you need help with your Wiki account. Optional: Update Wiki Information Wiki Information - After gaining access to the wiki, some people add entries to the How We Got Here, IRC Nicks, and Dogs of FreeBSD pages. Optional: Update Ports with Personal Information ports/astro/xearth/files/freebsd.committers.markers and src/usr.bin/calendar/calendars/calendar.freebsd - Some people add entries for themselves to these files to show where they are located or the date of their birthday. Optional: Prevent Duplicate Mailings Subscribers to &a.svn-src-all.name;, &a.svn-ports-all.name; or &a.svn-doc-all.name; might wish to unsubscribe to avoid receiving duplicate copies of commit messages and followups. For Everyone Introduce yourself to the other developers, otherwise no one will have any idea who you are or what you are working on. The introduction need not be a comprehensive biography, just write a paragraph or two about who you are, what you plan to be working on as a developer in &os;, and who will be your mentor. Email this to the &a.developers; and you will be on your way! Log into freefall.FreeBSD.org and create a /var/forward/user (where user is your username) file containing the e-mail address where you want mail addressed to yourusername@FreeBSD.org to be forwarded. This includes all of the commit messages as well as any other mail addressed to the &a.committers; and the &a.developers;. Really large mailboxes which have taken up permanent residence on freefall may get truncated without warning if space needs to be freed, so forward it or save it elsewhere. If your e-mail system uses SPF with strict rules, you should whitelist mx2.FreeBSD.org from SPF checks. Due to the severe load dealing with SPAM places on the central mail servers that do the mailing list processing, the front-end server does do some basic checks and will drop some messages based on these checks. At the moment proper DNS information for the connecting host is the only check in place but that may change. Some people blame these checks for bouncing valid email. To have these checks turned off for your email, create a file named ~/.spam_lover on freefall.FreeBSD.org. Those who are developers but not committers will not be subscribed to the committers or developers mailing lists. The subscriptions are derived from the access rights. SMTP Access Setup For those willing to send e-mail messages through the FreeBSD.org infrastructure, follow the instructions below: Point your mail client at smtp.FreeBSD.org:587. Enable STARTTLS. Ensure your From: address is set to yourusername@FreeBSD.org. For authentication, you can use your &os; Kerberos username and password (see ). The yourusername/mail principal is preferred, as it is only valid for authenticating to mail resources. Do not include @FreeBSD.org when entering in your username. Additional Notes Will only accept mail from yourusername@FreeBSD.org. If you are authenticated as one user, you are not permitted to send mail from another. A header will be appended with the SASL username: (Authenticated sender: username). Host has various rate limits in place to cut down on brute force attempts. Using a Local MTA to Forward Emails to the &os;.org SMTP Service It is also possible to use a local MTA to forward locally sent emails to the &os;.org SMTP servers. Using <application>Postfix</application> To tell a local Postfix instance that anything from yourusername@FreeBSD.org should be forwarded to the &os;.org servers, add this to your main.cf: sender_dependent_relayhost_maps = hash:/usr/local/etc/postfix/relayhost_maps smtp_sasl_auth_enable = yes smtp_sasl_security_options = noanonymous smtp_sasl_password_maps = hash:/usr/local/etc/postfix/sasl_passwd smtp_use_tls = yes Create /usr/local/etc/postfix/relayhost_maps with the following content: yourusername@FreeBSD.org [smtp.freebsd.org]:587 Create /usr/local/etc/postfix/sasl_passwd with the following content: [smtp.freebsd.org]:587 yourusername:yourpassword If the email server is used by other people, you may want to prevent them from sending e-mails from your address. To achieve this, add this to your main.cf: smtpd_sender_login_maps = hash:/usr/local/etc/postfix/sender_login_maps smtpd_sender_restrictions = reject_known_sender_login_mismatch Create /usr/local/etc/postfix/sender_login_maps with the following content: yourusername@FreeBSD.org yourlocalusername Where yourlocalusername is the SASL username used to connect to the local instance of Postfix. Mentors All new developers have a mentor assigned to them for the first few months. A mentor is responsible for teaching the mentee the rules and conventions of the project and guiding their first steps in the developer community. The mentor is also personally responsible for the mentee's actions during this initial period. For committers: do not commit anything without first getting mentor approval. Document that approval with an Approved by: line in the commit message. When the mentor decides that a mentee has learned the ropes and is ready to commit on their own, the mentor announces it with a commit to conf/mentors. This file is in the svnadmin branch of each repository: src base/svnadmin/conf/mentors doc doc/svnadmin/conf/mentors ports ports/svnadmin/conf/mentors New committers should aim to complete enough commits that their mentor is comfortable releasing them from mentorship within the first year. If they are still under mentorship, the appropriate management body (core, doceng, or portmgr) should attempt to ensure that there are no barriers preventing completion. If the committer is unable to satisfy their mentor of readiness by a year and a half their commit bit may be converted to project membership. Pre-Commit Review Code review is one way to increase the quality of software. The following guidelines apply to commits to the head (-CURRENT) branch of the src repository. Other branches and the ports and docs trees have their own review policies, but these guidelines generally apply to commits requiring review: All non-trivial changes should be reviewed before they are committed to the repository. Reviews may be conducted by email, in Bugzilla, in Phabricator, or by another mechanism. Where possible, reviews should be public. The developer responsible for a code change is also responsible for making all necessary review-related changes. Code review can be an iterative process, which continues until the patch is ready to be committed. Specifically, once a patch is sent out for review, it should receive an explicit looks good before it is committed. So long as it is explicit, this can take whatever form makes sense for the review method. Timeouts are not a substitute for review. Sometimes code reviews will take longer than you would hope for, especially for larger features. Accepted ways to speed up review times for your patches are: Review other people's patches. If you help out, everybody will be more willing to do the same for you; goodwill is our currency. Ping the patch. If it is urgent, provide reasons why it is important to you to get this patch landed and ping it every couple of days. If it is not urgent, the common courtesy ping rate is one week. Remember that you are asking for valuable time from other professional developers. Ask for help on mailing lists, IRC, etc. Others may be able to either help you directly, or suggest a reviewer. Split your patch into multiple smaller patches that build on each other. The smaller your patch, the higher the probability that somebody will take a quick look at it. When making large changes, it is helpful to keep this in mind from the beginning of the effort as breaking large changes into smaller ones is often difficult after the fact. Developers should participate in code reviews as both reviewers and reviewees. If someone is kind enough to review your code, you should return the favor for someone else. Note that while anyone is welcome to review and give feedback on a patch, only an appropriate subject-matter expert can approve a change. This will usually be a committer who works with the code in question on a regular basis. In some cases, no subject-matter expert may be available. In those cases, a review by an experienced developer is sufficient when coupled with appropriate testing. Commit Log Messages This section contains some suggestions and traditions for how commit logs are formatted. As well as including an informative message with each commit, some additional information may be needed. This information consists of one or more lines containing the key word or phrase, a colon, tabs for formatting, and then the additional information. The key words or phrases are: PR: The problem report (if any) which is affected (typically, by being closed) by this commit. Multiple PRs may be specified on one line, separated by commas or spaces. Submitted by: The name and e-mail address of the person that submitted the fix; for developers, just the username on the &os; cluster. If the submitter is the maintainer of the port being committed, include "(maintainer)" after the email address. Avoid obfuscating the email address of the submitter as this adds additional work when searching logs. Reviewed by: The name and e-mail address of the person or people that reviewed the change; for developers, just the username on the &os; cluster. If a patch was submitted to a mailing list for review, and the review was favorable, then just include the list name. Approved by: The name and e-mail address of the person or people that approved the change; for developers, just the username on the &os; cluster. It is customary to get prior approval for a commit if it is to an area of the tree to which you do not usually commit. In addition, during the run up to a new release all commits must be approved by the release engineering team. While under mentorship, get mentor approval before the commit. Enter the mentor's username in this field, and note that they are a mentor: Approved by: username-of-mentor (mentor) If a team approved these commits then include the team name followed by the username of the approver in parentheses. For example: Approved by: re (username) Obtained from: The name of the project (if any) from which the code was obtained. Do not use this line for the name of an individual person. Sponsored by: Sponsoring organizations for this change, if any. Separate multiple organizations with commas. If only a portion of the work was sponsored, or different amounts of sponsorship were provided to different authors, please give appropriate credit in parentheses after each sponsor name. For example, Example.com (alice, code refactoring), Wormulon (bob), Momcorp (cindy) shows that Alice was sponsored by Example.com to do code refactoring, while Wormulon sponsored Bob's work and Momcorp sponsored Cindy's work. Other authors were either not sponsored or chose not to list sponsorship. MFC after: To receive an e-mail reminder to MFC at a later date, specify the number of days, weeks, or months after which an MFC is planned. MFC to: If the commit should be merged to a subset of stable branches, specify the branch names. MFC with: If the commit should be merged together with a previous one in a single MFC commit (for example, where this commit corrects a bug in the previous change), specify the corresponding revision number. Relnotes: If the change is a candidate for inclusion in the release notes for the next release from the branch, set to yes. Security: If the change is related to a security vulnerability or security exposure, include one or more references or a description of the issue. If possible, include a VuXML URL or a CVE ID. Event: The description for the event where this commit was made. If this is a recurring event, add the year or even the month to it. For example, this could be FooBSDcon 2019. The idea behind this line is to put recognition to conferences, gatherings, and other types of meetups and to show that these are useful to have. Please do not use the Sponsored by: line for this as that is meant for organizations sponsoring certain features or developers working on them. Differential Revision: The full URL of the Phabricator review. This line must be the last line. For example: https://reviews.freebsd.org/D1708. Commit Log for a Commit Based on a PR The commit is based on a patch from a PR submitted by John Smith. The commit message PR and Submitted by fields are filled.. ... PR: 12345 Submitted by: John Smith <John.Smith@example.com> Commit Log for a Commit Needing Review The virtual memory system is being changed. After posting patches to the appropriate mailing list (in this case, freebsd-arch) and the changes have been approved. ... Reviewed by: -arch Commit Log for a Commit Needing Approval Commit a port, after working with the listed MAINTAINER, who said to go ahead and commit. ... Approved by: abc (maintainer) Where abc is the account name of the person who approved. Commit Log for a Commit Bringing in Code from OpenBSD Committing some code based on work done in the OpenBSD project. ... Obtained from: OpenBSD Commit Log for a Change to &os.current; with a Planned Commit to &os.stable; to Follow at a Later Date. Committing some code which will be merged from &os.current; into the &os.stable; branch after two weeks. ... MFC after: 2 weeks Where 2 is the number of days, weeks, or months after which an MFC is planned. The weeks option may be day, days, week, weeks, month, months. It is often necessary to combine these. Consider the situation where a user has submitted a PR containing code from the NetBSD project. Looking at the PR, the developer sees it is not an area of the tree they normally work in, so they have the change reviewed by the arch mailing list. Since the change is complex, the developer opts to MFC after one month to allow adequate testing. The extra information to include in the commit would look something like Example Combined Commit Log PR: 54321 Submitted by: John Smith <John.Smith@example.com> Reviewed by: -arch Obtained from: NetBSD MFC after: 1 month Relnotes: yes Preferred License for New Files The &os; Project's full license policy can be found at https://www.FreeBSD.org/internal/software-license.html. The rest of this section is intended to help you get started. As a rule, when in doubt, ask. It is much easier to give advice than to fix the source tree. The &os; Project suggests and uses this text as the preferred license scheme: /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) [year] [your name] * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * [id for your version control system, if any] */ The &os; project strongly discourages the so-called "advertising clause" in new code. Due to the large number of contributors to the &os; project, complying with this clause for many commercial vendors has become difficult. If you have code in the tree with the advertising clause, please consider removing it. In fact, please consider using the above license for your code. The &os; project discourages completely new licenses and variations on the standard licenses. New licenses require the approval of the &a.core; to reside in the main repository. The more different licenses that are used in the tree, the more problems that this causes to those wishing to utilize this code, typically from unintended consequences from a poorly worded license. Project policy dictates that code under some non-BSD licenses must be placed only in specific sections of the repository, and in some cases, compilation must be conditional or even disabled by default. For example, the GENERIC kernel must be compiled under only licenses identical to or substantially similar to the BSD license. GPL, APSL, CDDL, etc, licensed software must not be compiled into GENERIC. Developers are reminded that in open source, getting "open" right is just as important as getting "source" right, as improper handling of intellectual property has serious consequences. Any questions or concerns should immediately be brought to the attention of the core team. Keeping Track of Licenses Granted to the &os; Project Various software or data exist in the repositories where the &os; project has been granted a special licence to be able to use them. A case in point are the Terminus fonts for use with &man.vt.4;. Here the author Dimitar Zhekov has allowed us to use the "Terminus BSD Console" font under a 2-clause BSD license rather than the regular Open Font License he normally uses. It is clearly sensible to keep a record of any such license grants. To that end, the &a.core; has decided to keep an archive of them. Whenever the &os; project is granted a special license we require the &a.core; to be notified. Any developers involved in arranging such a license grant, please send details to the &a.core; including: Contact details for people or organizations granting the special license. What files, directories etc. in the repositories are covered by the license grant including the revision numbers where any specially licensed material was committed. The date the license comes into effect from. Unless otherwise agreed, this will be the date the license was issued by the authors of the software in question. The license text. A note of any restrictions, limitations or exceptions that apply specifically to &os;'s usage of the licensed material. Any other relevant information. Once the &a.core; is satisfied that all the necessary details have been gathered and are correct, the secretary will send a PGP-signed acknowledgement of receipt including the license details. This receipt will be persistently archived and serve as our permanent record of the license grant. The license archive should contain only details of license grants; this is not the place for any discussions around licensing or other subjects. Access to data within the license archive will be available on request to the &a.core;. Developer Relations When working directly on your own code or on code which is already well established as your responsibility, then there is probably little need to check with other committers before jumping in with a commit. Working on a bug in an area of the system which is clearly orphaned (and there are a few such areas, to our shame), the same applies. When modifying parts of the system which are maintained, formally, or informally, consider asking for review just as a developer would have before becoming a committer. For ports, contact the listed MAINTAINER in the Makefile. To determine if an area of the tree is maintained, check the MAINTAINERS file at the root of the tree. If nobody is listed, scan the revision history to see who has committed changes in the past. An example script that lists each person who has committed to a given file along with the number of commits each person has made can be found at on freefall at ~eadler/bin/whodid. If queries go unanswered or the committer otherwise indicates a lack of interest in the area affected, go ahead and commit it. Avoid sending private emails to maintainers. Other people might be interested in the conversation, not just the final output. If there is any doubt about a commit for any reason at all, have it reviewed before committing. Better to have it flamed then and there rather than when it is part of the repository. If a commit does results in controversy erupting, it may be advisable to consider backing the change out again until the matter is settled. Remember, with a version control system we can always change it back. Do not impugn the intentions of others. If they see a different solution to a problem, or even a different problem, it is probably not because they are stupid, because they have questionable parentage, or because they are trying to destroy hard work, personal image, or &os;, but basically because they have a different outlook on the world. Different is good. Disagree honestly. Argue your position from its merits, be honest about any shortcomings it may have, and be open to seeing their solution, or even their vision of the problem, with an open mind. Accept correction. We are all fallible. When you have made a mistake, apologize and get on with life. Do not beat up yourself, and certainly do not beat up others for your mistake. Do not waste time on embarrassment or recrimination, just fix the problem and move on. Ask for help. Seek out (and give) peer reviews. One of the ways open source software is supposed to excel is in the number of eyeballs applied to it; this does not apply if nobody will review code. If in Doubt... When unsure about something, whether it be a technical issue or a project convention be sure to ask. If you stay silent you will never make progress. If it relates to a technical issue ask on the public mailing lists. Avoid the temptation to email the individual person that knows the answer. This way everyone will be able to learn from the question and the answer. For project specific or administrative questions ask, in order: Your mentor or former mentor. An experienced committer on IRC, email, etc. Any team with a "hat", as they can give you a definitive answer. If still not sure, ask on &a.developers;. Once your question is answered, if no one pointed you to documentation that spelled out the answer to your question, document it, as others will have the same question. Bugzilla The &os; Project utilizes Bugzilla for tracking bugs and change requests. Be sure that if you commit a fix or suggestion found in the PR database to close it. It is also considered nice if you take time to close any PRs associated with your commits, if appropriate. Committers with non-&os;.org Bugzilla accounts can have the old account merged with the &os;.org account by following these steps: Log in using your old account. Open new bug. Choose Services as the Product, and Bug Tracker as the Component. In bug description list accounts you wish to be merged. Log in using &os;.org account and post comment to newly opened bug to confirm ownership. See for more details on how to generate or set a password for your &os;.org account. If there are more than two accounts to merge, post comments from each of them. You can find out more about Bugzilla at: &os; Problem Report Handling Guidelines https://www.FreeBSD.org/support.html Phabricator The &os; Project utilizes Phabricator for code review requests. See the CodeReview wiki page for details. Committers with non-&os;.org Phabricator accounts can have the old account renamed to the &os;.org account by following these steps: Change your Phabricator account email to your &os;.org email. Open new bug on our bug tracker using your &os;.org account, see for more information. Choose Services as the Product, and Code Review as the Component. In bug description request that your Phabricator account be renamed, and provide a link to your Phabricator user. For example, https://reviews.freebsd.org/p/bob_example.com/ Phabricator accounts cannot be merged, please do not open a new account. Who's Who Besides the repository meisters, there are other &os; project members and teams whom you will probably get to know in your role as a committer. Briefly, and by no means all-inclusively, these are: &a.doceng; doceng is the group responsible for the documentation build infrastructure, approving new documentation committers, and ensuring that the &os; website and documentation on the FTP site is up to date with respect to the subversion tree. It is not a conflict resolution body. The vast majority of documentation related discussion takes place on the &a.doc;. More details regarding the doceng team can be found in its charter. Committers interested in contributing to the documentation should familiarize themselves with the Documentation Project Primer. &a.re.members.email; These are the members of the &a.re;. This team is responsible for setting release deadlines and controlling the release process. During code freezes, the release engineers have final authority on all changes to the system for whichever branch is pending release status. If there is something you want merged from &os.current; to &os.stable; (whatever values those may have at any given time), these are the people to talk to about it. &a.so.email; &a.so; is the &os; Security Officer and oversees the &a.security-officer;. &a.wollman.email; If you need advice on obscure network internals or are not sure of some potential change to the networking subsystem you have in mind, Garrett is someone to talk to. Garrett is also very knowledgeable on the various standards applicable to &os;. &a.committers; &a.svn-src-all.name;, &a.svn-ports-all.name; and &a.svn-doc-all.name; are the mailing lists that the version control system uses to send commit messages to. Never send email directly to these lists. Only send replies to this list when they are short and are directly related to a commit. &a.developers; All committers are subscribed to -developers. This list was created to be a forum for the committers community issues. Examples are Core voting, announcements, etc. The &a.developers; is for the exclusive use of &os; committers. To develop &os;, committers must have the ability to openly discuss matters that will be resolved before they are publicly announced. Frank discussions of work in progress are not suitable for open publication and may harm &os;. All &os; committers are expected not to not publish or forward messages from the &a.developers; outside the list membership without permission of all of the authors. Violators will be removed from the &a.developers;, resulting in a suspension of commit privileges. Repeated or flagrant violations may result in permanent revocation of commit privileges. This list is not intended as a place for code reviews or for any technical discussion. In fact using it as such hurts the &os; Project as it gives a sense of a closed list where general decisions affecting all of the &os; using community are made without being open. Last, but not least never, never ever, email the &a.developers; and CC:/BCC: another &os; list. Never, ever email another &os; email list and CC:/BCC: the &a.developers;. Doing so can greatly diminish the benefits of this list. SSH Quick-Start Guide If you do not wish to type your password in every time you use &man.ssh.1;, and you use keys to authenticate, &man.ssh-agent.1; is there for your convenience. If you want to use &man.ssh-agent.1;, make sure that you run it before running other applications. X users, for example, usually do this from their .xsession or .xinitrc. See &man.ssh-agent.1; for details. Generate a key pair using &man.ssh-keygen.1;. The key pair will wind up in your $HOME/.ssh/ directory. Only ECDSA, Ed25519 or RSA keys are supported. Send your public key ($HOME/.ssh/id_ecdsa.pub, $HOME/.ssh/id_ed25519.pub, or $HOME/.ssh/id_rsa.pub) to the person setting you up as a committer so it can be put into yourlogin in /etc/ssh-keys/ on freefall. Now &man.ssh-add.1; can be used for authentication once per session. It prompts for the private key's pass phrase, and then stores it in the authentication agent (&man.ssh-agent.1;). Use ssh-add -d to remove keys stored in the agent. Test with a simple remote command: ssh freefall.FreeBSD.org ls /usr. For more information, see security/openssh-portable, &man.ssh.1;, &man.ssh-add.1;, &man.ssh-agent.1;, &man.ssh-keygen.1;, and &man.scp.1;. For information on adding, changing, or removing &man.ssh.1; keys, see this article. &coverity; Availability for &os; Committers All &os; developers can obtain access to Coverity analysis results of all &os; Project software. All who are interested in obtaining access to the analysis results of the automated Coverity runs, can sign up at Coverity Scan. The &os; wiki includes a mini-guide for developers who are interested in working with the &coverity; analysis reports: https://wiki.freebsd.org/CoverityPrevent. Please note that this mini-guide is only readable by &os; developers, so if you cannot access this page, you will have to ask someone to add you to the appropriate Wiki access list. Finally, all &os; developers who are going to use &coverity; are always encouraged to ask for more details and usage information, by posting any questions to the mailing list of the &os; developers. The &os; Committers' Big List of Rules Everyone involved with the &os; project is expected to abide by the Code of Conduct available from https://www.FreeBSD.org/internal/code-of-conduct.html. As committers, you form the public face of the project, and how you behave has a vital impact on the public perception of it. This guide expands on the parts of the Code of Conduct specific to committers. Respect other committers. Respect other contributors. Discuss any significant change before committing. Respect existing maintainers (if listed in the MAINTAINER field in Makefile or in MAINTAINER in the top-level directory). Any disputed change must be backed out pending resolution of the dispute if requested by a maintainer. Security related changes may override a maintainer's wishes at the Security Officer's discretion. Changes go to &os.current; before &os.stable; unless specifically permitted by the release engineer or unless they are not applicable to &os.current;. Any non-trivial or non-urgent change which is applicable should also be allowed to sit in &os.current; for at least 3 days before merging so that it can be given sufficient testing. The release engineer has the same authority over the &os.stable; branch as outlined for the maintainer in rule #5. Do not fight in public with other committers; it looks bad. Respect all code freezes and read the committers and developers mailing lists in a timely manner so you know when a code freeze is in effect. When in doubt on any procedure, ask first! Test your changes before committing them. Do not commit to contributed software without explicit approval from the respective maintainers. As noted, breaking some of these rules can be grounds for suspension or, upon repeated offense, permanent removal of commit privileges. Individual members of core have the power to temporarily suspend commit privileges until core as a whole has the chance to review the issue. In case of an emergency (a committer doing damage to the repository), a temporary suspension may also be done by the repository meisters. Only a 2/3 majority of core has the authority to suspend commit privileges for longer than a week or to remove them permanently. This rule does not exist to set core up as a bunch of cruel dictators who can dispose of committers as casually as empty soda cans, but to give the project a kind of safety fuse. If someone is out of control, it is important to be able to deal with this immediately rather than be paralyzed by debate. In all cases, a committer whose privileges are suspended or revoked is entitled to a hearing by core, the total duration of the suspension being determined at that time. A committer whose privileges are suspended may also request a review of the decision after 30 days and every 30 days thereafter (unless the total suspension period is less than 30 days). A committer whose privileges have been revoked entirely may request a review after a period of 6 months has elapsed. This review policy is strictly informal and, in all cases, core reserves the right to either act on or disregard requests for review if they feel their original decision to be the right one. In all other aspects of project operation, core is a subset of committers and is bound by the same rules. Just because someone is in core this does not mean that they have special dispensation to step outside any of the lines painted here; core's special powers only kick in when it acts as a group, not on an individual basis. As individuals, the core team members are all committers first and core second. Details Respect other committers. This means that you need to treat other committers as the peer-group developers that they are. Despite our occasional attempts to prove the contrary, one does not get to be a committer by being stupid and nothing rankles more than being treated that way by one of your peers. Whether we always feel respect for one another or not (and everyone has off days), we still have to treat other committers with respect at all times, on public forums and in private email. Being able to work together long term is this project's greatest asset, one far more important than any set of changes to the code, and turning arguments about code into issues that affect our long-term ability to work harmoniously together is just not worth the trade-off by any conceivable stretch of the imagination. To comply with this rule, do not send email when you are angry or otherwise behave in a manner which is likely to strike others as needlessly confrontational. First calm down, then think about how to communicate in the most effective fashion for convincing the other persons that your side of the argument is correct, do not just blow off some steam so you can feel better in the short term at the cost of a long-term flame war. Not only is this very bad energy economics, but repeated displays of public aggression which impair our ability to work well together will be dealt with severely by the project leadership and may result in suspension or termination of your commit privileges. The project leadership will take into account both public and private communications brought before it. It will not seek the disclosure of private communications, but it will take it into account if it is volunteered by the committers involved in the complaint. All of this is never an option which the project's leadership enjoys in the slightest, but unity comes first. No amount of code or good advice is worth trading that away. Respect other contributors. You were not always a committer. At one time you were a contributor. Remember that at all times. Remember what it was like trying to get help and attention. Do not forget that your work as a contributor was very important to you. Remember what it was like. Do not discourage, belittle, or demean contributors. Treat them with respect. They are our committers in waiting. They are every bit as important to the project as committers. Their contributions are as valid and as important as your own. After all, you made many contributions before you became a committer. Always remember that. Consider the points raised under and apply them also to contributors. Discuss any significant change before committing. The repository is not where changes are initially submitted for correctness or argued over, that happens first in the mailing lists or by use of the Phabricator service. The commit will only happen once something resembling consensus has been reached. This does not mean that permission is required before correcting every obvious syntax error or manual page misspelling, just that it is good to develop a feel for when a proposed change is not quite such a no-brainer and requires some feedback first. People really do not mind sweeping changes if the result is something clearly better than what they had before, they just do not like being surprised by those changes. The very best way of making sure that things are on the right track is to have code reviewed by one or more other committers. When in doubt, ask for review! Respect existing maintainers if listed. Many parts of &os; are not owned in the sense that any specific individual will jump up and yell if you commit a change to their area, but it still pays to check first. One convention we use is to put a maintainer line in the Makefile for any package or subtree which is being actively maintained by one or more people; see https://www.FreeBSD.org/doc/en_US.ISO8859-1/books/developers-handbook/policies.html for documentation on this. Where sections of code have several maintainers, commits to affected areas by one maintainer need to be reviewed by at least one other maintainer. In cases where the maintainer-ship of something is not clear, look at the repository logs for the files in question and see if someone has been working recently or predominantly in that area. Any disputed change must be backed out pending resolution of the dispute if requested by a maintainer. Security related changes may override a maintainer's wishes at the Security Officer's discretion. This may be hard to swallow in times of conflict (when each side is convinced that they are in the right, of course) but a version control system makes it unnecessary to have an ongoing dispute raging when it is far easier to simply reverse the disputed change, get everyone calmed down again and then try to figure out what is the best way to proceed. If the change turns out to be the best thing after all, it can be easily brought back. If it turns out not to be, then the users did not have to live with the bogus change in the tree while everyone was busily debating its merits. People very rarely call for back-outs in the repository since discussion generally exposes bad or controversial changes before the commit even happens, but on such rare occasions the back-out should be done without argument so that we can get immediately on to the topic of figuring out whether it was bogus or not. Changes go to &os.current; before &os.stable; unless specifically permitted by the release engineer or unless they are not applicable to &os.current;. Any non-trivial or non-urgent change which is applicable should also be allowed to sit in &os.current; for at least 3 days before merging so that it can be given sufficient testing. The release engineer has the same authority over the &os.stable; branch as outlined in rule #5. This is another do not argue about it issue since it is the release engineer who is ultimately responsible (and gets beaten up) if a change turns out to be bad. Please respect this and give the release engineer your full cooperation when it comes to the &os.stable; branch. The management of &os.stable; may frequently seem to be overly conservative to the casual observer, but also bear in mind the fact that conservatism is supposed to be the hallmark of &os.stable; and different rules apply there than in &os.current;. There is also really no point in having &os.current; be a testing ground if changes are merged over to &os.stable; immediately. Changes need a chance to be tested by the &os.current; developers, so allow some time to elapse before merging unless the &os.stable; fix is critical, time sensitive or so obvious as to make further testing unnecessary (spelling fixes to manual pages, obvious bug/typo fixes, etc.) In other words, apply common sense. Changes to the security branches (for example, releng/9.3) must be approved by a member of the &a.security-officer;, or in some cases, by a member of the &a.re;. Do not fight in public with other committers; it looks bad. This project has a public image to uphold and that image is very important to all of us, especially if we are to continue to attract new members. There will be occasions when, despite everyone's very best attempts at self-control, tempers are lost and angry words are exchanged. The best thing that can be done in such cases is to minimize the effects of this until everyone has cooled back down. Do not air angry words in public and do not forward private correspondence or other private communications to public mailing lists, mail aliases, instant messaging channels or social media sites. What people say one-to-one is often much less sugar-coated than what they would say in public, and such communications therefore have no place there - they only serve to inflame an already bad situation. If the person sending a flame-o-gram at least had the grace to send it privately, then have the grace to keep it private yourself. If you feel you are being unfairly treated by another developer, and it is causing you anguish, bring the matter up with core rather than taking it public. Core will do its best to play peace makers and get things back to sanity. In cases where the dispute involves a change to the codebase and the participants do not appear to be reaching an amicable agreement, core may appoint a mutually-agreeable third party to resolve the dispute. All parties involved must then agree to be bound by the decision reached by this third party. Respect all code freezes and read the committers and developers mailing list on a timely basis so you know when a code freeze is in effect. Committing unapproved changes during a code freeze is a really big mistake and committers are expected to keep up-to-date on what is going on before jumping in after a long absence and committing 10 megabytes worth of accumulated stuff. People who abuse this on a regular basis will have their commit privileges suspended until they get back from the &os; Happy Reeducation Camp we run in Greenland. When in doubt on any procedure, ask first! Many mistakes are made because someone is in a hurry and just assumes they know the right way of doing something. If you have not done it before, chances are good that you do not actually know the way we do things and really need to ask first or you are going to completely embarrass yourself in public. There is no shame in asking how in the heck do I do this? We already know you are an intelligent person; otherwise, you would not be a committer. Test your changes before committing them. This may sound obvious, but if it really were so obvious then we probably would not see so many cases of people clearly not doing this. If your changes are to the kernel, make sure you can still compile both GENERIC and LINT. If your changes are anywhere else, make sure you can still make world. If your changes are to a branch, make sure your testing occurs with a machine which is running that code. If you have a change which also may break another architecture, be sure and test on all supported architectures. Please refer to the &os; Internal Page for a list of available resources. As other architectures are added to the &os; supported platforms list, the appropriate shared testing resources will be made available. Do not commit to contributed software without explicit approval from the respective maintainers. Contributed software is anything under the src/contrib, src/crypto, or src/sys/contrib trees. The trees mentioned above are for contributed software usually imported onto a vendor branch. Committing something there may cause unnecessary headaches when importing newer versions of the software. As a general consider sending patches upstream to the vendor. Patches may be committed to FreeBSD first with permission of the maintainer. Reasons for modifying upstream software range from wanting strict control over a tightly coupled dependency to lack of portability in the canonical repository's distribution of their code. Regardless of the reason, effort to minimize the maintenance burden of fork is helpful to fellow maintainers. Avoid committing trivial or cosmetic changes to files since it makes every merge thereafter more difficult: such patches need to be manually re-verified every import. If a particular piece of software lacks a maintainer, you are encouraged to take up ownership. If you are unsure of the current maintainership email &a.arch; and ask. Policy on Multiple Architectures &os; has added several new architecture ports during recent release cycles and is truly no longer an &i386; centric operating system. In an effort to make it easier to keep &os; portable across the platforms we support, core has developed this mandate:
Our 32-bit reference platform is &arch.i386;, and our 64-bit reference platform is &arch.amd64;. Major design work (including major API and ABI changes) must prove itself on at least one 32-bit and at least one 64-bit platform, preferably the primary reference platforms, before it may be committed to the source tree.
The &arch.i386; and &arch.amd64; platforms were chosen due to being more readily available to developers and as representatives of more diverse processor and system designs - big versus little endian, register file versus register stack, different DMA and cache implementations, hardware page tables versus software TLB management etc. We will continue to re-evaluate this policy as cost and availability of the 64-bit platforms change. Developers should also be aware of our Tier Policy for the long term support of hardware architectures. The rules here are intended to provide guidance during the development process, and are distinct from the requirements for features and architectures listed in that section. The Tier rules for feature support on architectures at release-time are more strict than the rules for changes during the development process.
Other Suggestions When committing documentation changes, use a spell checker before committing. For all XML docs, verify that the formatting directives are correct by running make lint and textproc/igor. For manual pages, run sysutils/manck and textproc/igor over the manual page to verify all of the cross references and file references are correct and that the man page has all of the appropriate MLINKs installed. Do not mix style fixes with new functionality. A style fix is any change which does not modify the functionality of the code. Mixing the changes obfuscates the functionality change when asking for differences between revisions, which can hide any new bugs. Do not include whitespace changes with content changes in commits to doc/ . The extra clutter in the diffs makes the translators' job much more difficult. Instead, make any style or whitespace changes in separate commits that are clearly labeled as such in the commit message. Deprecating Features When it is necessary to remove functionality from software in the base system, follow these guidelines whenever possible: Mention is made in the manual page and possibly the release notes that the option, utility, or interface is deprecated. Use of the deprecated feature generates a warning. The option, utility, or interface is preserved until the next major (point zero) release. The option, utility, or interface is removed and no longer documented. It is now obsolete. It is also generally a good idea to note its removal in the release notes. Privacy and Confidentiality Most &os; business is done in public. &os; is an open project. Which means that not only can anyone use the source code, but that most of the development process is open to public scrutiny. Certain sensitive matters must remain private or held under embargo. There unfortunately cannot be complete transparency. As a &os; developer you will have a certain degree of privileged access to information. Consequently you are expected to respect certain requirements for confidentiality. Sometimes the need for confidentiality comes from external collaborators or has a specific time limit. Mostly though, it is a matter of not releasing private communications. The Security Officer has sole control over the release of security advisories. Where there are security problems that affect many different operating systems, &os; frequently depends on early access to be able to prepare advisories for coordinated release. Unless &os; developers can be trusted to maintain security, such early access will not be made available. The Security Officer is responsible for controlling pre-release access to information about vulnerabilities, and for timing the release of all advisories. He may request help under condition of confidentiality from any developer with relevant knowledge to prepare security fixes. Communications with Core are kept confidential for as long as necessary. Communications to core will initially be treated as confidential. Eventually however, most of Core's business will be summarized into the monthly or quarterly core reports. Care will be taken to avoid publicising any sensitive details. Records of some particularly sensitive subjects may not be reported on at all and will be retained only in Core's private archives. Non-disclosure Agreements may be required for access to certain commercially sensitive data. Access to certain commercially sensitive data may only be available under a Non-Disclosure Agreement. The FreeBSD Foundation legal staff must be consulted before any binding agreements are entered into. Private communications must not be made public without permission. Beyond the specific requirements above there is a general expectation not to publish private communications between developers without the consent of all parties involved. Ask permission before forwarding a message onto a public mailing list, or posting it to a forum or website that can be accessed by other than the original correspondents. Communications on project-only or restricted access channels must be kept private. Similarly to personal communications, certain internal communications channels, including &os; Committer only mailing lists and restricted access IRC channels are considered private communications. Permission is required to publish material from these sources. Core may approve publication. Where it is impractical to obtain permission due to the number of correspondents or where permission to publish is unreasonably withheld, Core may approve release of such private matters that merit more general publication.
Support for Multiple Architectures &os; is a highly portable operating system intended to function on many different types of hardware architectures. Maintaining clean separation of Machine Dependent (MD) and Machine Independent (MI) code, as well as minimizing MD code, is an important part of our strategy to remain agile with regards to current hardware trends. Each new hardware architecture supported by &os; adds substantially to the cost of code maintenance, toolchain support, and release engineering. It also dramatically increases the cost of effective testing of kernel changes. As such, there is strong motivation to differentiate between classes of support for various architectures while remaining strong in a few key architectures that are seen as the &os; target audience. Statement of General Intent The &os; Project targets "production quality commercial off-the-shelf (COTS) workstation, server, and high-end embedded systems". By retaining a focus on a narrow set of architectures of interest in these environments, the &os; Project is able to maintain high levels of quality, stability, and performance, as well as minimize the load on various support teams on the project, such as the ports team, documentation team, security officer, and release engineering teams. Diversity in hardware support broadens the options for &os; consumers by offering new features and usage opportunities, but these benefits must always be carefully considered in terms of the real-world maintenance cost associated with additional platform support. The &os; Project differentiates platform targets into four tiers. Each tier includes a list of guarantees consumers may rely on as well as obligations by the Project and developers to fulfill those guarantees. These lists define the minimum guarantees for each tier. The Project and developers may provide additional levels of support beyond the minimum guarantees for a given tier, but such additional support is not guaranteed. Each platform target is assigned to a specific tier for each stable branch. As a result, a platform target might be assigned to different tiers on concurrent stable branches. Platform Targets Support for a hardware platform consists of two components: kernel support and userland Application Binary Interfaces (ABIs). Kernel platform support includes things needed to run a &os; kernel on a hardware platform such as machine-dependent virtual memory management and device drivers. A userland ABI specifies an interface for user processes to interact with a &os; kernel and base system libraries. A userland ABI includes system call interfaces, the layout and semantics of public data structures, and the layout and semantics of arguments passed to subroutines. Some components of an ABI may be defined by specifications such as the layout of C++ exception objects or calling conventions for C functions. A &os; kernel also uses an ABI (sometimes referred to as the Kernel Binary Interface (KBI)) which includes the semantics and layouts of public data structures and the layout and semantics of arguments to public functions within the kernel itself. A &os; kernel may support multiple userland ABIs. For example, &os;'s amd64 kernel supports &os; amd64 and i386 userland ABIs as well as Linux x86_64 and i386 userland ABIs. A &os; kernel should support a native ABI as the default ABI. The native ABI generally shares certain properties with the kernel ABI such as the C calling convention, sizes of basic types, etc. Tiers are defined for both kernels and userland ABIs. In the common case, a platform's kernel and &os; ABIs are assigned to the same tier. Tier 1: Fully-Supported Architectures Tier 1 platforms are the most mature &os; platforms. They are supported by the security officer, release engineering, and port management teams. Tier 1 architectures are expected to be Production Quality with respect to all aspects of the &os; operating system, including installation and development environments. The &os; Project provides the following guarantees to consumers of Tier 1 platforms: Official &os; release images will be provided by the release engineering team. Binary updates and source patches for Security Advisories and Errata Notices will be provided for supported releases. Source patches for Security Advisories will be provided for supported branches. Binary updates and source patches for cross-platform Security Advisories will typically be provided at the time of the announcement. Changes to userland ABIs will generally include compatibility shims to ensure correct operation of binaries compiled against any stable branch where the platform is Tier 1. These shims might not be enabled in the default install. If compatibility shims are not provided for an ABI change, the lack of shims will be clearly documented in the release notes. Changes to certain portions of the kernel ABI will include compatibility shims to ensure correct operation of kernel modules compiled against the oldest supported release on the branch. Note that not all parts of the kernel ABI are protected. Official binary packages for third party software will be provided by the ports team. For embedded architectures, these packages may be cross-built from a different architecture. Most relevant ports should either build or have the appropriate filters to prevent inappropriate ones from building. New features which are not inherently platform-specific will be fully functional on all Tier 1 architectures. Features and compatibility shims used by binaries compiled against older stable branches may be removed in newer major versions. Such removals will be clearly documented in the release notes. Tier 1 platforms should be fully documented. Basic operations will be documented in the &os; Handbook. Tier 1 platforms will be included in the source tree. Tier 1 platforms should be self-hosting either via the in-tree toolchain or an external toolchain. If an external toolchain is required, official binary packages for an external toolchain will be provided. To maintain maturity of Tier 1 platforms, the &os; Project will maintain the following resources to support development: Build and test automation support either in the FreeBSD.org cluster or some other location easily available for all developers. Embedded platforms may substitute an emulator available in the FreeBSD.org cluster for actual hardware. Inclusion in the make universe and make tinderbox targets. Dedicated hardware in one of the &os; clusters for package building (either natively or via qemu-user). Collectively, developers are required to provide the following to maintain the Tier 1 status of a platform: Changes to the source tree should not knowingly break the build of a Tier 1 platform. Tier 1 architectures must have a mature, healthy ecosystem of users and active developers. Developers should be able to build packages on commonly available, non-embedded Tier 1 systems. This can mean either native builds if non-embedded systems are commonly available for the platform in question, or it can mean cross-builds hosted on some other Tier 1 architecture. Changes cannot break the userland ABI. If an ABI change is required, ABI compatibility for existing binaries should be provided via use of symbol versioning or shared library version bumps. Changes merged to stable branches cannot break the protected portions of the kernel ABI. If a kernel ABI change is required, the change should be modified to preserve functionality of existing kernel modules. Tier 2: Developmental and Niche Architectures Tier 2 platforms are functional, but less mature &os; platforms. They are not supported by the security officer, release engineering, and port management teams. Tier 2 platforms may be Tier 1 platform candidates that are still under active development. Architectures reaching end of life may also be moved from Tier 1 status to Tier 2 status as the availability of resources to continue to maintain the system in a Production Quality state diminishes. Well-supported niche architectures may also be Tier 2. The &os; Project provides the following guarantees to consumers of Tier 2 platforms: The ports infrastructure should include basic support for Tier 2 architectures sufficient to support building ports and packages. This includes support for basic packages such as ports-mgmt/pkg, but there is no guarantee that arbitrary ports will be buildable or functional. New features which are not inherently platform-specific should be feasible on all Tier 2 architectures if not implemented. Tier 2 platforms will be included in the source tree. Tier 2 platforms should be self-hosting either via the in-tree toolchain or an external toolchain. If an external toolchain is required, official binary packages for an external toolchain will be provided. Tier 2 platforms should provide functional kernels and userlands even if an official release distribution is not provided. To maintain maturity of Tier 2 platforms, the &os; Project will maintain the following resources to support development: Inclusion in the make universe and make tinderbox targets. Collectively, developers are required to provide the following to maintain the Tier 2 status of a platform: Changes to the source tree should not knowingly break the build of a Tier 2 platform. Tier 2 architectures must have an active ecosystem of users and developers. While changes are permitted to break the userland ABI, the ABI should not be broken gratuitously. Significant userland ABI changes should be restricted to major versions. New features that are not yet implemented on Tier 2 architectures should provide a means of disabling them on those architectures. Tier 3: Experimental Architectures Tier 3 platforms have at least partial &os; support. They are not supported by the security officer, release engineering, and port management teams. Tier 3 platforms are architectures in the early stages of development, for non-mainstream hardware platforms, or which are considered legacy systems unlikely to see broad future use. Initial support for Tier 3 platforms may exist in a separate repository rather than the main source repository. The &os; Project provides no guarantees to consumers of Tier 3 platforms and is not committed to maintaining resources to support development. Tier 3 platforms may not always be buildable, nor are any kernel or userland ABIs considered stable. Tier 4: Unsupported Architectures Tier 4 platforms are not supported in any form by the project. All systems not otherwise classified are Tier 4 systems. When a platform transitions to Tier 4, all support for the platform is removed from the source and ports trees. Note that ports support should remain as long as the platform is supported in a branch supported by ports. Policy on Changing the Tier of an Architecture Systems may only be moved from one tier to another by approval of the &os; Core Team, which shall make that decision in collaboration with the Security Officer, Release Engineering, and ports management teams. For a platform to be promoted to a higher tier, any missing support guarantees must be satisfied before the promotion is completed. Ports Specific FAQ Adding a New Port How do I add a new port? First, please read the section about repository copies. The easiest way to add a new port is the addport script located in the ports/Tools/scripts directory. It adds a port from the directory specified, determining the category automatically from the port Makefile. It also adds an entry to the port's category Makefile. It was written by &a.mharo.email;, &a.will.email;, and &a.garga.email;. When sending questions about this script to the &a.ports;, please also CC &a.crees.email;, the current maintainer. Any other things I need to know when I add a new port? Check the port, preferably to make sure it compiles and packages correctly. This is the recommended sequence: &prompt.root; make install &prompt.root; make package &prompt.root; make deinstall &prompt.root; pkg add package you built above &prompt.root; make deinstall &prompt.root; make reinstall &prompt.root; make package The Porters Handbook contains more detailed instructions. Use &man.portlint.1; to check the syntax of the port. You do not necessarily have to eliminate all warnings but make sure you have fixed the simple ones. If the port came from a submitter who has not contributed to the Project before, add that person's name to the Additional Contributors section of the &os; Contributors List. Close the PR if the port came in as a PR. To close a PR, change the state to Issue Resolved and the resolution as Fixed. Removing an Existing Port How do I remove an existing port? First, please read the section about repository copies. Before you remove the port, you have to verify there are no other ports depending on it. Make sure there is no dependency on the port in the ports collection: The port's PKGNAME appears in exactly one line in a recent INDEX file. No other ports contains any reference to the port's directory or PKGNAME in their Makefiles When using Git, consider using git grep, it is much faster than grep -r. Then, remove the port: Remove the port's files and directory with svn remove. Remove the SUBDIR listing of the port in the parent directory Makefile. Add an entry to ports/MOVED. Search for entries in ports/security/vuxml/vuln.xml and adjust them accordingly. In particular, check for previous packages with the new name which version could include the new port. Remove the port from ports/LEGAL if it is there. Alternatively, you can use the rmport script, from ports/Tools/scripts. This script was written by &a.vd.email;. When sending questions about this script to the &a.ports;, please also CC &a.crees.email;, the current maintainer. Re-adding a Deleted Port How do I re-add a deleted port? This is essentially the reverse of deleting a port. Do not use svn add to add the port. Follow these steps. If they are unclear, or are not working, ask for help, do not just svn add the port. Figure out when the port was removed. Use this list, or look for the port on freshports, and then copy the last living revision of the port: &prompt.user; cd /usr/ports/category &prompt.user; svn cp 'svn+ssh://repo.freebsd.org/ports/head/category/portname/@XXXXXX' portname Pick the revision that is just before the removal. For example, if the revision where it was removed is 269874, use 269873. It is also possible to specify a date. In that case, pick a date that is before the removal but after the last commit to the port. &prompt.user; cd /usr/ports/category &prompt.user; svn cp 'svn+ssh://repo.freebsd.org/ports/head/category/portname/@{YYYY-MM-DD}' portname Make the changes necessary to get the port working again. If it was deleted because the distfiles are no longer available, either volunteer to host the distfiles, or find someone else to do so. If some files have been added, or were removed during the resurrection process, use svn add or svn remove to make sure all the files in the port will be committed. Restore the SUBDIR listing of the port in the parent directory Makefile, keeping the entries sorted. Delete the port entry from ports/MOVED. If the port had an entry in ports/LEGAL, restore it. svn commit these changes, preferably in one step. The addport script mentioned in now detects when the port to add has previously existed, and attempts to handle all except the ports/LEGAL step automatically. Repository Copies When do we need a repository copy? When you want to add a port that is related to any port that is already in the tree in a separate directory, you have to do a repository copy. Here related means it is a different version or a slightly modified version. Examples are print/ghostscript* (different versions) and x11-wm/windowmaker* (English-only and internationalized version). Another example is when a port is moved from one subdirectory to another, or when the name of a directory must be changed because the authors renamed their software even though it is a descendant of a port already in a tree. What do I need to do? With Subversion, a repo copy can be done by any committer: Doing a repo copy: Verify that the target directory does not exist. Use svn up to make certain the original files, directories, and checkout information is current. Use svn move or svn copy to do the repo copy. Upgrade the copied port to the new version. Remember to add or change the PKGNAMEPREFIX or PKGNAMESUFFIX so there are no duplicate ports with the same name. In some rare cases it may be necessary to change the PORTNAME instead of adding PKGNAMEPREFIX or PKGNAMESUFFIX, but this is only done when it is really needed — for example, using an existing port as the base for a very similar program with a different name, or upgrading a port to a new upstream version which actually changes the distribution name, like the transition from textproc/libxml to textproc/libxml2. In most cases, adding or changing PKGNAMEPREFIX or PKGNAMESUFFIX suffices. Add the new subdirectory to the SUBDIR listing in the parent directory Makefile. You can run make checksubdirs in the parent directory to check this. If the port changed categories, modify the CATEGORIES line of the port's Makefile accordingly Add an entry to ports/MOVED, if you remove the original port. Commit all changes on one commit. When removing a port: Perform a thorough check of the ports collection for any dependencies on the old port location/name, and update them. Running grep on INDEX is not enough because some ports have dependencies enabled by compile-time options. A full grep -r of the ports collection is recommended. Remove the old port and the old SUBDIR entry. Add an entry to ports/MOVED. After repo moves (rename operations where a port is copied and the old location is removed): Follow the same steps that are outlined in the previous two entries, to activate the new location of the port and remove the old one. Ports Freeze What is a ports freeze? A ports freeze was a restricted state the ports tree was put in before a release. It was used to ensure a higher quality for the packages shipped with a release. It usually lasted a couple of weeks. During that time, build problems were fixed, and the release packages were built. This practice is no longer used, as the packages for the releases are built from the current stable, quarterly branch. For more information on how to merge commits to the quarterly branch, see . Quarterly Branches What is the procedure to request authorization for merging a commit to the quarterly branch? When doing the commit, add the branch name to the MFH: line, for example: MFH: 2014Q1 It will automatically notify the &a.ports-secteam; and the &a.portmgr;. They will then decide if the commit can be merged and answer with the procedure. If the commit has already been made, send an email to the &a.ports-secteam; and the &a.portmgr; with the revision number and a small description of why the commit needs to be merged. If the MFH is covered by a blanket approval, please explain why with a couple of words on the MFH line, so that the reviewing team can skip this commit and save time. For example: MFH: 2014Q1 (runtime fix) MFH: 2014Q1 (browser blanket) The list of blanket approvals is available in . Are there any changes that can be merged without asking for approval? The following blanket approvals for merging to the quarterly branches are in effect: This blanket approval also applies to direct commits for ports that have been removed from head. These fixes must be tested on the quarterly branch. Fixes that do not result in a change in contents of the resulting package. For example: pkg-descr: WWW: URL updates (existing 404, moved or incorrect) Build, runtime or packaging fixes, if the quarterly branch version is currently broken. Missing dependencies (detected, linked against but not registered via *_DEPENDS). Fixing shebangs, stripping installed libraries and binaries, and plist fixes. Backport of security and reliability fixes which only result in PORTREVISION bumps and no changes to enabled features. for example, adding a patch fixing a buffer overflow. Minor version changes that do nothing but fix security or crash-related issues. Adding/fixing CONFLICTS. Web Browsers, browser plugins, and their required dependencies. Commits that are not covered by these blanket approvals always require explicit approval of either &a.ports-secteam; or &a.portmgr;. What is the procedure for merging commits to the quarterly branch? A script is provided to automate merging a specific commit: ports/Tools/scripts/mfh. It is used as follows: &prompt.user; /usr/ports/Tools/scripts/mfh 380362 U 2015Q1 Checked out revision 380443. A 2015Q1/security Updating '2015Q1/security/rubygem-sshkit': A 2015Q1/security/rubygem-sshkit A 2015Q1/security/rubygem-sshkit/Makefile A 2015Q1/security/rubygem-sshkit/distinfo A 2015Q1/security/rubygem-sshkit/pkg-descr Updated to revision 380443. --- Merging r380362 into '2015Q1': U 2015Q1/security/rubygem-sshkit/Makefile U 2015Q1/security/rubygem-sshkit/distinfo --- Recording mergeinfo for merge of r380362 into '2015Q1': U 2015Q1 --- Recording mergeinfo for merge of r380362 into '2015Q1/security': G 2015Q1/security --- Eliding mergeinfo from '2015Q1/security': U 2015Q1/security --- Recording mergeinfo for merge of r380362 into '2015Q1/security/rubygem-sshkit': G 2015Q1/security/rubygem-sshkit --- Eliding mergeinfo from '2015Q1/security/rubygem-sshkit': U 2015Q1/security/rubygem-sshkit M 2015Q1 M 2015Q1/security/rubygem-sshkit/Makefile M 2015Q1/security/rubygem-sshkit/distinfo Index: 2015Q1/security/rubygem-sshkit/Makefile =================================================================== --- 2015Q1/security/rubygem-sshkit/Makefile (revision 380443) +++ 2015Q1/security/rubygem-sshkit/Makefile (working copy) @@ -2,7 +2,7 @@ # $FreeBSD$ PORTNAME= sshkit -PORTVERSION= 1.6.1 +PORTVERSION= 1.7.0 CATEGORIES= security rubygems MASTER_SITES= RG Index: 2015Q1/security/rubygem-sshkit/distinfo =================================================================== --- 2015Q1/security/rubygem-sshkit/distinfo (revision 380443) +++ 2015Q1/security/rubygem-sshkit/distinfo (working copy) @@ -1,2 +1,2 @@ -SHA256 (rubygem/sshkit-1.6.1.gem) = 8ca67e46bb4ea50fdb0553cda77552f3e41b17a5aa919877d93875dfa22c03a7 -SIZE (rubygem/sshkit-1.6.1.gem) = 135680 +SHA256 (rubygem/sshkit-1.7.0.gem) = 90effd1813363bae7355f4a45ebc8335a8ca74acc8d0933ba6ee6d40f281a2cf +SIZE (rubygem/sshkit-1.7.0.gem) = 136192 Index: 2015Q1 =================================================================== --- 2015Q1 (revision 380443) +++ 2015Q1 (working copy) Property changes on: 2015Q1 ___________________________________________________________________ Modified: svn:mergeinfo Merged /head:r380362 Do you want to commit? (no = start a shell) [y/n] At that point, the script will either open a shell for you to fix things, or open your text editor with the commit message all prepared and then commit the merge. The script assumes that you can connect to repo.FreeBSD.org with SSH directly, so if your local login name is different than your &os; cluster account, you need a few lines in your ~/.ssh/config: Host *.freebsd.org User freebsd-login The script is also able to merge more than one revision at a time. If there have been other updates to the port since the branch was created that have not been merged because they were not security related. Add the different revisions in the order they were committed on the mfh line. The new commit log message will contain the combined log messages from all the original commits. These messages must be edited to show what is actually being done with the new commit. &prompt.user; /usr/ports/Tools/scripts/mfh r407208 r407713 r407722 r408567 r408943 r410728 The mfh script can also take an optional first argument, the branch where the merge is being done. Only the latest quarterly branch is supported, so specifying the branch is discouraged. To be safe, the script will give a warning if the quarterly branch is not the latest: &prompt.user; /usr/ports/Tools/scripts/mfh 2016Q1 r407208 r407713 /!\ The latest branch is 2016Q2, do you really want to commit to 2016Q1? [y/n] Creating a New Category What is the procedure for creating a new category? Please see Proposing a New Category in the Porter's Handbook. Once that procedure has been followed and the PR has been assigned to the &a.portmgr;, it is their decision whether or not to approve it. If they do, it is their responsibility to: Perform any needed moves. (This only applies to physical categories.) Update the VALID_CATEGORIES definition in ports/Mk/bsd.port.mk. Assign the PR back to you. What do I need to do to implement a new physical category? Upgrade each moved port's Makefile. Do not connect the new category to the build yet. To do this, you will need to: Change the port's CATEGORIES (this was the point of the exercise, remember?) The new category is listed first. This will help to ensure that the PKGORIGIN is correct. Run a make describe. Since the top-level make index that you will be running in a few steps is an iteration of make describe over the entire ports hierarchy, catching any errors here will save you having to re-run that step later on. If you want to be really thorough, now might be a good time to run &man.portlint.1;. Check that the PKGORIGINs are correct. The ports system uses each port's CATEGORIES entry to create its PKGORIGIN, which is used to connect installed packages to the port directory they were built from. If this entry is wrong, common port tools like &man.pkg.version.1; and &man.portupgrade.1; fail. To do this, use the chkorigin.sh tool: env PORTSDIR=/path/to/ports sh -e /path/to/ports/Tools/scripts/chkorigin.sh. This will check every port in the ports tree, even those not connected to the build, so you can run it directly after the move operation. Hint: do not forget to look at the PKGORIGINs of any slave ports of the ports you just moved! On your own local system, test the proposed changes: first, comment out the SUBDIR entries in the old ports' categories' Makefiles; then enable building the new category in ports/Makefile. Run make checksubdirs in the affected category directories to check the SUBDIR entries. Next, in the ports/ directory, run make index. This can take over 40 minutes on even modern systems; however, it is a necessary step to prevent problems for other people. Once this is done, you can commit the updated ports/Makefile to connect the new category to the build and also commit the Makefile changes for the old category or categories. Add appropriate entries to ports/MOVED. Update the documentation by modifying: the list of categories in the Porter's Handbook doc/en_US.ISO8859-1/htdocs/ports. Note that these are now displayed by sub-groups, as specified in doc/en_US.ISO8859-1/htdocs/ports/categories.descriptions. (Note: these are in the docs, not the ports, repository). If you are not a docs committer, you will need to submit a PR for this. Only once all the above have been done, and no one is any longer reporting problems with the new ports, should the old ports be deleted from their previous locations in the repository. It is not necessary to manually update the ports web pages to reflect the new category. This is done automatically via the change to en_US.ISO8859-1/htdocs/ports/categories and the automated rebuild of INDEX. What do I need to do to implement a new virtual category? This is much simpler than a physical category. Only a few modifications are needed: the list of categories in the Porter's Handbook en_US.ISO8859-1/htdocs/ports/categories Miscellaneous Questions Are there changes that can be committed without asking the maintainer for approval? Blanket approval for most ports applies to these types of fixes: Most infrastructure changes to a port (that is, modernizing, but not changing the functionality). For example, the blanket covers converting to new USES macros, enabling verbose builds, and switching to new ports system syntaxes. Trivial and tested build and runtime fixes. Documentations or metadata changes to ports, like pkg-descr or COMMENT. Exceptions to this are anything maintained by the &a.portmgr;, or the &a.security-officer;. No unauthorized commits may ever be made to ports maintained by those groups. How do I know if my port is building correctly or not? The packages are built multiple times each week. If a port fails, the maintainer will receive an email from pkg-fallout@FreeBSD.org. Reports for all the package builds (official, experimental, and non-regression) are aggregated at pkg-status.FreeBSD.org. I added a new port. Do I need to add it to the INDEX? No. The file can either be generated by running make index, or a pre-generated version can be downloaded with make fetchindex. Are there any other files I am not allowed to touch? Any file directly under ports/, or any file under a subdirectory that starts with an uppercase letter (Mk/, Tools/, etc.). In particular, the &a.portmgr; is very protective of ports/Mk/bsd.port*.mk so do not commit changes to those files unless you want to face their wrath. What is the proper procedure for updating the checksum for a port distfile when the file changes without a version change? When the checksum for a distribution file is updated due to the author updating the file without changing the port revision, the commit message includes a summary of the relevant diffs between the original and new distfile to ensure that the distfile has not been corrupted or maliciously altered. If the current version of the port has been in the ports tree for a while, a copy of the old distfile will usually be available on the ftp servers; otherwise the author or maintainer should be contacted to find out why the distfile has changed. How can an experimental test build of the ports tree (exp-run) be requested? An exp-run must be completed before patches with a significant ports impact are committed. The patch can be against the ports tree or the base system. Full package builds will be done with the patches provided by the submitter, and the submitter is required to fix detected problems (fallout) before commit. Go to the Bugzilla new PR page. Select the product your patch is about. Fill in the bug report as normal. Remember to attach the patch. If at the top it says Show Advanced Fields click on it. It will now say Hide Advanced Fields. Many new fields will be available. If it already says Hide Advanced Fields, no need to do anything. In the Flags section, set the exp-run one to ?. As for all other fields, hovering the mouse over any field shows more details. Submit. Wait for the build to run. &a.portmgr; will reply with a possible fallout. Depending on the fallout: If there is no fallout, the procedure stops here, and the change can be committed, pending any other approval required. If there is fallout, it must be fixed, either by fixing the ports directly in the ports tree, or adding to the submitted patch. When this is done, go back to step 6 saying the fallout was fixed and wait for the exp-run to be run again. Repeat as long as there are broken ports. Issues Specific to Developers Who Are Not Committers A few people who have access to the &os; machines do not have commit bits. Almost all of this document will apply to these developers as well (except things specific to commits and the mailing list memberships that go with them). In particular, we recommend that you read: Administrative Details Conventions Get your mentor to add you to the Additional Contributors (doc/en_US.ISO8859-1/articles/contributors/contrib.additional.xml), if you are not already listed there. Developer Relations SSH Quick-Start Guide The &os; Committers' Big List of Rules Information About &ga; As of December 12, 2012, &ga; was enabled on the &os; Project website to collect anonymized usage statistics regarding usage of the site. The information collected is valuable to the &os; Documentation Project, to identify various problems on the &os; website. &ga; General Policy The &os; Project takes visitor privacy very seriously. As such, the &os; Project website honors the Do Not Track header before fetching the tracking code from Google. For more information, please see the &os; Privacy Policy. &ga; access is not arbitrarily allowed — access must be requested, voted on by the &a.doceng;, and explicitly granted. Requests for &ga; data must include a specific purpose. For example, a valid reason for requesting access would be to see the most frequently used web browsers when viewing &os; web pages to ensure page rendering speeds are acceptable. Conversely, to see what web browsers are most frequently used (without stating why) would be rejected. All requests must include the timeframe for which the data would be required. For example, it must be explicitly stated if the requested data would be needed for a timeframe covering a span of 3 weeks, or if the request would be one-time only. Any request for &ga; data without a clear, reasonable reason beneficial to the &os; Project will be rejected. Data Available Through &ga; A few examples of the types of &ga; data available include: Commonly used web browsers Page load times Site access by language Miscellaneous Questions How do I add a new file to a branch? To add a file onto a branch, simply checkout or update to the branch you want to add to and then add the file using the add operation as you normally would. This works fine for the doc and ports trees. The src tree uses SVN and requires more care because of the mergeinfo properties. See the Subversion Primer for details on how to perform an MFC. How do I access people.FreeBSD.org to put up personal or project information? people.FreeBSD.org is the same as freefall.FreeBSD.org. Just create a public_html directory. Anything you place in that directory will automatically be visible under https://people.FreeBSD.org/. Where are the mailing list archives stored? The mailing lists are archived under /local/mail on freefall.FreeBSD.org. I would like to mentor a new committer. What process do I need to follow? See the New Account Creation Procedure document on the internal pages. Benefits and Perks for &os; Committers Recognition Recognition as a competent software engineer is the longest lasting value. In addition, getting a chance to work with some of the best people that every engineer would dream of meeting is a great perk! FreeBSD Mall &os; committers can get a free 4-CD or DVD set at conferences from &os; Mall, Inc.. <acronym>IRC</acronym> In addition, developers may request a cloaked hostmask for their account on the Freenode IRC network in the form of freebsd/developer/freefall name or freebsd/developer/NickServ name. To request a cloak, send an email to &a.irc.email; with your requested hostmask and NickServ account name. <systemitem class="domainname">Gandi.net</systemitem> Gandi provides website hosting, cloud computing, domain registration, and X.509 certificate services. Gandi offers an E-rate discount to all &os; developers. Send mail to non-profit@gandi.net using your @freebsd.org mail address, and indicate your Gandi handle.
diff --git a/en_US.ISO8859-1/articles/contributors/contrib.develinmemoriam.xml b/en_US.ISO8859-1/articles/contributors/contrib.develinmemoriam.xml index cb5064aa5c..eda3c20709 100644 --- a/en_US.ISO8859-1/articles/contributors/contrib.develinmemoriam.xml +++ b/en_US.ISO8859-1/articles/contributors/contrib.develinmemoriam.xml @@ -1,179 +1,179 @@ Bruce D. Evans (1991 - 2019; RIP 2019) Bruce was a programming giant who made FreeBSD his home. Back before FreeBSD and Linux there was Minix, a toy "unix" written by Andy Tannenbaum, released in 1987, sold with complete sources on three floppy disks, for $99. Bruce ported Minix to the i386 around 1989. Linus Torvalds used Minix/386 to develop his own kernel, and Bruce was the first person he thanked in the release-announcement. When Bill Jolitz released 386BSD 0.1 in 1992, Bruce was listed as a contributor. Bruce co-founded the FreeBSD project, and served on core.0, but he was never partisan, and over the years many other projects have benefitted from his patches, advice and wisdom. Code reviews from Bruce came in three flavours, "mild", "brucified" and "brucifiction", but they were never personal: It was always only about the code, the mistakes, the sloppy thinking, the missing historical context, the ambiguous standards - and the style(9) transgressions. - Because Bruce gave more code reviews than anybody else in + As Bruce gave more code reviews than anybody else in the history of the FreeBSD project, the commit logs hide the true scale of his impact until you pay attention to "Submitted by", "Reviewed by" and "Pointed out by". Being hard of hearing, Bruce did not attend conferences. The notable exception was the 1999 BSDcon in California, where his core team colleagues greeted him with "We're not worthy!" in Wayne's World fashion. Twenty years later we're still not. Kurt Lidl (2015 - 2019; RIP 2019) Kurt first got involved with BSD while it was still a project at the University of California at Berkeley. Shortly after personalized license plates became available in Maryland, he got "BSDWZRD". He began contributing to FreeBSD shortly after the conception of the project. He became a FreeBSD source committer in October 2015. Kurt's most well known FreeBSD project was &man.blacklistd.8; which blocks and releases ports on demand to avoid DoS abuse. He has also made many other bug fixes and enhancements to DTrace, boot loaders, and other bits and pieces of the FreeBSD infrastructure. Earlier work included the game XTank, an author on RFC 2516 "A Method for Transmitting PPP Over Ethernet (PPPoE)", and the USENIX paper "Drinking from the Firehose: Multicast USENET News". Frank Durda IV (1995 - 2003; RIP 2018) Frank had been around the project since the very early days, contributing code to the 1.x line before becoming a committer. Andrey A. Chernov (1993 - 2017; RIP 2017) Andrey contributions to &os; can not be overstated. Having been involved for a long there is hardly an area which he did not touch. Jürgen Lock (2006 - 2015; RIP 2015) Jürgen made a number of contributions to &os;, including work on libvirt, the graphics stack, and QEMU. Jürgen's contributions and helpfulness were appreciated by people around the world. That work continues to improve the lives of thousands every day. &a.alexbl.email; (2006 - 2011; RIP 2012) Alexander was best known as a major contributor to &os;'s Python ports and a founding member of &a.python; as well as his work on XMMS2. &a.jb.email; (1997 - 2009; RIP 2009) John made major contributions to FreeBSD, the best known of which is the import of the &man.dtrace.1; code. John's unique sense of humor and plain-spokenness either ruffled feathers or made him quick friends. At the end of his life, he had moved to a rural area and was attempting to live with as minimal impact to the planet as possible, while at the same time still working in the high-tech area. &a.jmz.email; (1994 - 2009; RIP 2009) Jean-Marc was an astrophysicist who made important contributions to the modeling of the atmospheres of both planets and comets at l'Observatoire de Besançon in Besançon, France. While there, he participated in the conception and construction of the Vega tricanal spectrometer that studied Halley's Comet. He had also been a long-time contributor to FreeBSD. &a.itojun.email; (1997 - 2001; RIP 2008) Known to everyone as itojun, Jun-ichiro Hagino was a core researcher at the KAME Project, which aimed to provide IPv6 and IPsec technology in freely redistributable form. Much of this code was incorporated into FreeBSD. Without his efforts, the state of IPv6 on the Internet would be much different. &a.cg.email; (1999 - 2005; RIP 2005) Cameron was a unique individual who contributed to the project despite serious physical disabilities. He was responsible for a complete rewrite of our sound system during the late 1990s. Many of those who corresponded with him had no idea of his limited mobility, due to his cheerful spirit and willingness to help others. &a.alane.email; (2002 - 2003; RIP 2003) Alan was a major contributor to the KDE on FreeBSD group. In addition, he maintained many other difficult and time-consuming ports such as autoconf, CUPS, and python. Alan's path was not an easy one but his passion for FreeBSD, and dedication to programming excellence, won him many friends. diff --git a/en_US.ISO8859-1/articles/ldap-auth/article.xml b/en_US.ISO8859-1/articles/ldap-auth/article.xml index d1957adb4f..26ecdf7f29 100644 --- a/en_US.ISO8859-1/articles/ldap-auth/article.xml +++ b/en_US.ISO8859-1/articles/ldap-auth/article.xml @@ -1,972 +1,972 @@
LDAP Authentication Toby Burress
kurin@causa-sui.net
2007 2008 The FreeBSD Documentation Project &tm-attrib.freebsd; &tm-attrib.general; $FreeBSD$ $FreeBSD$ This document is intended as a guide for the configuration of an LDAP server (principally an OpenLDAP server) for authentication on &os;. This is useful for situations where many servers need the same user accounts, for example as a replacement for NIS.
Preface This document is intended to give the reader enough of an understanding of LDAP to configure an LDAP server. This document will attempt to provide an explanation of net/nss_ldap and security/pam_ldap for use with client machines services for use with the LDAP server. When finished, the reader should be able to configure and deploy a &os; server that can host an LDAP directory, and to configure and deploy a &os; server which can authenticate against an LDAP directory. This article is not intended to be an exhaustive account of the security, robustness, or best practice considerations for configuring LDAP or the other services discussed herein. While the author takes care to do everything correctly, they do not address security issues beyond a general scope. This article should be considered to lay the theoretical groundwork only, and any actual implementation should be accompanied by careful requirement analysis. Configuring LDAP LDAP stands for Lightweight Directory Access Protocol and is a subset of the X.500 Directory Access Protocol. Its most recent specifications are in RFC4510 and friends. Essentially it is a database that expects to be read from more often than it is written to. The LDAP server OpenLDAP will be used in the examples in this document; while the principles here should be generally applicable to many different servers, most of the concrete administration is OpenLDAP-specific. There are several server versions in ports, for example net/openldap24-server. Client servers will need the corresponding net/openldap24-client libraries. There are (basically) two areas of the LDAP service which need configuration. The first is setting up a server to receive connections properly, and the second is adding entries to the server's directory so that &os; tools know how to interact with it. Setting Up the Server for Connections This section is specific to OpenLDAP. If you are using another server, you will need to consult that server's documentation. Installing <application>OpenLDAP</application> First, install OpenLDAP: Installing <application>OpenLDAP</application> &prompt.root; cd /usr/ports/net/openldap24-server &prompt.root; make install clean This installs the slapd and slurpd binaries, along with the required OpenLDAP libraries. Configuring <application>OpenLDAP</application> Next we must configure OpenLDAP. You will want to require encryption in your connections to the LDAP server; otherwise your users' passwords will be transferred in plain text, which is considered insecure. The tools we will be using support two very similar kinds of encryption, SSL and TLS. TLS stands for Transportation Layer Security. Services that employ TLS tend to connect on the same ports as the same services without TLS; thus an SMTP server which supports TLS will listen for connections on port 25, and an LDAP server will listen on 389. SSL stands for Secure Sockets Layer, and services that implement SSL do not listen on the same ports as their non-SSL counterparts. Thus SMTPS listens on port 465 (not 25), HTTPS listens on 443, and LDAPS on 636. The reason SSL uses a different port than TLS is because a TLS connection begins as plain text, and switches to encrypted traffic after the STARTTLS directive. SSL connections are encrypted from the beginning. Other than that there are no substantial differences between the two. We will adjust OpenLDAP to use TLS, as SSL is considered deprecated. Once OpenLDAP is installed via ports, the following configuration parameters in /usr/local/etc/openldap/slapd.conf will enable TLS: security ssf=128 TLSCertificateFile /path/to/your/cert.crt TLSCertificateKeyFile /path/to/your/cert.key TLSCACertificateFile /path/to/your/cacert.crt Here, ssf=128 tells OpenLDAP to require 128-bit encryption for all connections, both search and update. This parameter may be configured based on the security needs of your site, but rarely you need to weaken it, as most LDAP client libraries support strong encryption. The cert.crt, cert.key, and cacert.crt files are necessary for clients to authenticate you as the valid LDAP server. If you simply want a server that runs, you can create a self-signed certificate with OpenSSL: Generating an RSA Key &prompt.user; openssl genrsa -out cert.key 1024 Generating RSA private key, 1024 bit long modulus ....................++++++ ...++++++ e is 65537 (0x10001) &prompt.user; openssl req -new -key cert.key -out cert.csr At this point you should be prompted for some values. You may enter whatever values you like; however, it is important the Common Name value be the fully qualified domain name of the OpenLDAP server. In our case, and the examples here, the server is server.example.org. Incorrectly setting this value will cause clients to fail when making connections. This can the cause of great frustration, so ensure that you follow these steps closely. Finally, the certificate signing request needs to be signed: Self-signing the Certificate &prompt.user; openssl x509 -req -in cert.csr -days 365 -signkey cert.key -out cert.crt Signature ok subject=/C=AU/ST=Some-State/O=Internet Widgits Pty Ltd Getting Private key This will create a self-signed certificate that can be used for the directives in slapd.conf, where cert.crt and cacert.crt are the same file. If you are going to use many OpenLDAP servers (for replication via slurpd) you will want to see to generate a CA key and use it to sign individual server certificates. Once this is done, put the following in /etc/rc.conf: slapd_enable="YES" Then run /usr/local/etc/rc.d/slapd start. This should start OpenLDAP. Confirm that it is listening on 389 with &prompt.user; sockstat -4 -p 389 ldap slapd 3261 7 tcp4 *:389 *:* Configuring the Client Install the net/openldap24-client port for the OpenLDAP libraries. The client machines will always have OpenLDAP libraries since that is all security/pam_ldap and net/nss_ldap support, at least for the moment. The configuration file for the OpenLDAP libraries is /usr/local/etc/openldap/ldap.conf. Edit this file to contain the following values: base dc=example,dc=org uri ldap://server.example.org/ ssl start_tls tls_cacert /path/to/your/cacert.crt It is important that your clients have access to cacert.crt, otherwise they will not be able to connect. There are two files called ldap.conf. The first is this file, which is for the OpenLDAP libraries and defines how to talk to the server. The second is /usr/local/etc/ldap.conf, and is for pam_ldap. At this point you should be able to run ldapsearch -Z on the client machine; means use TLS. If you encounter an error, then something is configured wrong; most likely it is your certificates. Use &man.openssl.1;'s s_client and s_server to ensure you have them configured and signed properly. Entries in the Database Authentication against an LDAP directory is generally accomplished by attempting to bind to the directory as the connecting user. This is done by establishing a simple bind on the directory with the user name supplied. If there is an entry with the uid equal to the user name and that entry's userPassword attribute matches the password supplied, then the bind is successful. The first thing we have to do is figure out is where in the directory our users will live. The base entry for our database is dc=example,dc=org. The default location for users that most clients seem to expect is something like ou=people,base, so that is what will be used here. However keep in mind that this is configurable. So the ldif entry for the people organizational unit will look like: dn: ou=people,dc=example,dc=org objectClass: top objectClass: organizationalUnit ou: people All users will be created as subentries of this organizational unit. Some thought might be given to the object class your users will belong to. Most tools by default will use people, which is fine if you simply want to provide entries against which to authenticate. However, if you are going to store user information in the LDAP database as well, you will probably want to use inetOrgPerson, which has many useful attributes. In either case, the relevant schemas need to be loaded in slapd.conf. For this example we will use the person object class. If you are using inetOrgPerson, the steps are basically identical, except that the sn attribute is required. To add a user testuser, the ldif would be: dn: uid=tuser,ou=people,dc=example,dc=org objectClass: person objectClass: posixAccount objectClass: shadowAccount objectClass: top uidNumber: 10000 gidNumber: 10000 homeDirectory: /home/tuser loginShell: /bin/csh uid: tuser cn: tuser I start my LDAP users' UIDs at 10000 to avoid collisions with system accounts; you can configure whatever number you wish here, as long as it is less than 65536. We also need group entries. They are as configurable as user entries, but we will use the defaults below: dn: ou=groups,dc=example,dc=org objectClass: top objectClass: organizationalUnit ou: groups dn: cn=tuser,ou=groups,dc=example,dc=org objectClass: posixGroup objectClass: top gidNumber: 10000 cn: tuser To enter these into your database, you can use slapadd or ldapadd on a file containing these entries. Alternatively, you can use sysutils/ldapvi. The ldapsearch utility on the client machine should now return these entries. If it does, your database is properly configured to be used as an LDAP authentication server. Client Configuration The client should already have OpenLDAP libraries from , but if you are installing several client machines you will need to install net/openldap24-client on each of them. &os; requires two ports to be installed to authenticate against an LDAP server, security/pam_ldap and net/nss_ldap. Authentication security/pam_ldap is configured via /usr/local/etc/ldap.conf. This is a different file than the OpenLDAP library functions' configuration file, /usr/local/etc/openldap/ldap.conf; however, it takes many of the same options; in fact it is a superset of that file. For the rest of this section, references to ldap.conf will mean /usr/local/etc/ldap.conf. Thus, we will want to copy all of our original configuration parameters from openldap/ldap.conf to the new ldap.conf. Once this is done, we want to tell security/pam_ldap what to look for on the directory server. We are identifying our users with the uid attribute. To configure this (though it is the default), set the pam_login_attribute directive in ldap.conf: Setting <literal>pam_login_attribute</literal> pam_login_attribute uid With this set, security/pam_ldap will search the entire LDAP directory under base for the value uid=username. If it finds one and only one entry, it will attempt to bind as that user with the password it was given. If it binds correctly, then it will allow access. Otherwise it will fail. Users whose shell is not in /etc/shells will not be able to log in. This is particularly important when Bash is set as the user shell on the LDAP server. Bash is not included with a default installation of &os;. When installed from a package or port, it is located at /usr/local/bin/bash. Verify that the path to the shell on the server is set correctly: &prompt.user; getent passwd username There are two choices when the output shows /bin/bash in the last column. The first is to change the user's entry on the LDAP server to /usr/local/bin/bash. The second option is to create a symlink on the LDAP client computer so Bash is found at the correct location: &prompt.root; ln -s /usr/local/bin/bash /bin/bash Make sure that /etc/shells contains entries for both /usr/local/bin/bash and /bin/bash. The user will then be able to log in to the system with Bash as their shell. PAM PAM, which stands for Pluggable Authentication Modules, is the method by which &os; authenticates most of its sessions. To tell &os; we wish to use an LDAP server, we will have to add a line to the appropriate PAM file. Most of the time the appropriate PAM file is /etc/pam.d/sshd, if you want to use SSH (remember to set the relevant options in /etc/ssh/sshd_config, otherwise SSH will not use PAM). To use PAM for authentication, add the line auth sufficient /usr/local/lib/pam_ldap.so no_warn Exactly where this line shows up in the file and which options appear in the fourth column determine the exact behavior of the authentication mechanism; see &man.pam.d.5; With this configuration you should be able to authenticate a user against an LDAP directory. PAM will perform a bind with your credentials, and if successful will tell SSH to allow access. However it is not a good idea to allow every user in the directory into every client machine. With the current configuration, all that a user needs to log into a machine is an LDAP entry. Fortunately there are a few ways to restrict user access. ldap.conf supports a pam_groupdn directive; every account that connects to this machine needs to be a member of the group specified here. For example, if you have pam_groupdn cn=servername,ou=accessgroups,dc=example,dc=org in ldap.conf, then only members of that group will be able to log in. There are a few things to bear in mind, however. Members of this group are specified in one or more memberUid attributes, and each attribute must have the full distinguished name of the member. So memberUid: someuser will not work; it must be: memberUid: uid=someuser,ou=people,dc=example,dc=org Additionally, this directive is not checked in PAM during authentication, it is checked during account management, so you will need a second line in your PAM files under account. This will require, in turn, every user to be listed in the group, which is not necessarily what we want. To avoid blocking users that are not in LDAP, you should enable the ignore_unknown_user attribute. Finally, you should set the ignore_authinfo_unavail option so that you are not locked out of every computer when the LDAP server is unavailable. Your pam.d/sshd might then end up looking like this: Sample <filename>pam.d/sshd</filename> auth required pam_nologin.so no_warn auth sufficient pam_opie.so no_warn no_fake_prompts auth requisite pam_opieaccess.so no_warn allow_local auth sufficient /usr/local/lib/pam_ldap.so no_warn auth required pam_unix.so no_warn try_first_pass account required pam_login_access.so account required /usr/local/lib/pam_ldap.so no_warn ignore_authinfo_unavail ignore_unknown_user Since we are adding these lines specifically to pam.d/sshd, this will only have an effect on SSH sessions. LDAP users will be unable to log in at the console. To change this behavior, examine the other files in /etc/pam.d and modify them accordingly. Name Service Switch NSS is the service that maps attributes to names. So, for example, if a file is owned by user 1001, an application will query NSS for the name of 1001, and it might get bob or ted or whatever the user's name is. Now that our user information is kept in LDAP, we need to tell NSS to look there when queried. The net/nss_ldap port does this. It uses the same configuration file as security/pam_ldap, and should not need any extra parameters once it is installed. Instead, what is left is simply to edit /etc/nsswitch.conf to take advantage of the directory. Simply replace the following lines: group: compat passwd: compat with group: files ldap passwd: files ldap This will allow you to map usernames to UIDs and UIDs to usernames. Congratulations! You should now have working LDAP authentication. Caveats Unfortunately, as of the time this was written &os; did not support changing user passwords with &man.passwd.1;. - Because of this, most administrators are left to implement a + As a result of this, most administrators are left to implement a solution themselves. I provide some examples here. Note that if you write your own password change script, there are some security issues you should be made aware of; see Shell Script for Changing Passwords This script does hardly any error checking, but more important it is very cavalier about how it stores your passwords. If you do anything like this, at least adjust the security.bsd.see_other_uids sysctl value: &prompt.root; sysctl security.bsd.see_other_uids=0 A more flexible (and probably more secure) approach can be used by writing a custom program, or even a web interface. The following is part of a Ruby library that can change LDAP passwords. It sees use both on the command line, and on the web. Ruby Script for Changing Passwords Although not guaranteed to be free of security holes (the password is kept in memory, for example) this is cleaner and more flexible than a simple sh script. Security Considerations Now that your machines (and possibly other services) are authenticating against your LDAP server, this server needs to be protected at least as well as /etc/master.passwd would be on a regular server, and possibly even more so since a broken or cracked LDAP server would break every client service. Remember, this section is not exhaustive. You should continually review your configuration and procedures for improvements. Setting Attributes Read-only Several attributes in LDAP should be read-only. If left writable by the user, for example, a user could change his uidNumber attribute to 0 and get root access! To begin with, the userPassword attribute should not be world-readable. By default, anyone who can connect to the LDAP server can read this attribute. To disable this, put the following in slapd.conf: Hide Passwords access to dn.subtree="ou=people,dc=example,dc=org" attrs=userPassword by self write by anonymous auth by * none access to * by self write by * read This will disallow reading of the userPassword attribute, while still allowing users to change their own passwords. Additionally, you'll want to keep users from changing some of their own attributes. By default, users can change any attribute (except for those which the LDAP schemas themselves deny changes), such as uidNumber. To close this hole, modify the above to Read-only Attributes access to dn.subtree="ou=people,dc=example,dc=org" attrs=userPassword by self write by anonymous auth by * none access to attrs=homeDirectory,uidNumber,gidNumber by * read access to * by self write by * read This will stop users from being able to masquerade as other users. <systemitem class="username">root</systemitem> Account Definition Often the root or manager account for the LDAP service will be defined in the configuration file. OpenLDAP supports this, for example, and it works, but it can lead to trouble if slapd.conf is compromised. It may be better to use this only to bootstrap yourself into LDAP, and then define a root account there. Even better is to define accounts that have limited permissions, and omit a root account entirely. For example, users that can add or remove user accounts are added to one group, but they cannot themselves change the membership of this group. Such a security policy would help mitigate the effects of a leaked password. Creating a Management Group Say you want your IT department to be able to change home directories for users, but you do not want all of them to be able to add or remove users. The way to do this is to add a group for these admins: Creating a Management Group dn: cn=homemanagement,dc=example,dc=org objectClass: top objectClass: posixGroup cn: homemanagement gidNumber: 121 # required for posixGroup memberUid: uid=tuser,ou=people,dc=example,dc=org memberUid: uid=user2,ou=people,dc=example,dc=org And then change the permissions attributes in slapd.conf: ACLs for a Home Directory Management Group access to dn.subtree="ou=people,dc=example,dc=org" attr=homeDirectory by dn="cn=homemanagement,dc=example,dc=org" dnattr=memberUid write Now tuser and user2 can change other users' home directories. In this example we have given a subset of administrative power to certain users without giving them power in other domains. The idea is that soon no single user account has the power of a root account, but every power root had is had by at least one user. The root account then becomes unnecessary and can be removed. Password Storage By default OpenLDAP will store the value of the userPassword attribute as it stores any other data: in the clear. Most of the time it is base 64 encoded, which provides enough protection to keep an honest administrator from knowing your password, but little else. It is a good idea, then, to store passwords in a more secure format, such as SSHA (salted SHA). This is done by whatever program you use to change users' passwords. Useful Aids There are a few other programs that might be useful, particularly if you have many users and do not want to configure everything manually. security/pam_mkhomedir is a PAM module that always succeeds; its purpose is to create home directories for users which do not have them. If you have dozens of client servers and hundreds of users, it is much easier to use this and set up skeleton directories than to prepare every home directory. sysutils/cpu is a &man.pw.8;-like utility that can be used to manage users in the LDAP directory. You can call it directly, or wrap scripts around it. It can handle both TLS (with the flag) and SSL (directly). sysutils/ldapvi is a great utility for editing LDAP values in an LDIF-like syntax. The directory (or subsection of the directory) is presented in the editor chosen by the EDITOR environment variable. This makes it easy to enable large-scale changes in the directory without having to write a custom tool. security/openssh-portable has the ability to contact an LDAP server to verify SSH keys. This is extremely nice if you have many servers and do not want to copy your public keys across all of them. <application>OpenSSL</application> Certificates for LDAP If you are hosting two or more LDAP servers, you will probably not want to use self-signed certificates, since each client will have to be configured to work with each certificate. While this is possible, it is not nearly as simple as creating your own certificate authority, and signing your servers' certificates with that. The steps here are presented as they are with very little attempt at explaining what is going on—further explanation can be found in &man.openssl.1; and its friends. To create a certificate authority, we simply need a self-signed certificate and key. The steps for this again are Creating a Certificate &prompt.user; openssl genrsa -out root.key 1024 &prompt.user; openssl req -new -key root.key -out root.csr &prompt.user; openssl x509 -req -days 1024 -in root.csr -signkey root.key -out root.crt These will be your root CA key and certificate. You will probably want to encrypt the key and store it in a cool, dry place; anyone with access to it can masquerade as one of your LDAP servers. Next, using the first two steps above create a key ldap-server-one.key and certificate signing request ldap-server-one.csr. Once you sign the signing request with root.key, you will be able to use ldap-server-one.* on your LDAP servers. Do not forget to use the fully qualified domain name for the common name attribute when generating the certificate signing request; otherwise clients will reject a connection with you, and it can be very tricky to diagnose. To sign the key, use and instead of : Signing as a Certificate Authority &prompt.user; openssl x509 -req -days 1024 \ -in ldap-server-one.csr -CA root.crt -CAkey root.key \ -out ldap-server-one.crt The resulting file will be the certificate that you can use on your LDAP servers. Finally, for clients to trust all your servers, distribute root.crt (the certificate, not the key!) to each client, and specify it in the TLSCACertificateFile directive in ldap.conf.
diff --git a/en_US.ISO8859-1/articles/linux-emulation/article.xml b/en_US.ISO8859-1/articles/linux-emulation/article.xml index d3e6c8742e..489b88168c 100644 --- a/en_US.ISO8859-1/articles/linux-emulation/article.xml +++ b/en_US.ISO8859-1/articles/linux-emulation/article.xml @@ -1,2545 +1,2545 @@
&linux; emulation in &os; Roman Divacky
rdivacky@FreeBSD.org
&tm-attrib.adobe; &tm-attrib.ibm; &tm-attrib.freebsd; &tm-attrib.linux; &tm-attrib.netbsd; &tm-attrib.realnetworks; &tm-attrib.oracle; &tm-attrib.sun; &tm-attrib.general; $FreeBSD$ $FreeBSD$ This masters thesis deals with updating the &linux; emulation layer (the so called Linuxulator). The task was to update the layer to match the functionality of &linux; 2.6. As a reference implementation, the &linux; 2.6.16 kernel was chosen. The concept is loosely based on the NetBSD implementation. Most of the work was done in the summer of 2006 as a part of the Google Summer of Code students program. The focus was on bringing the NPTL (new &posix; thread library) support into the emulation layer, including TLS (thread local storage), futexes (fast user space mutexes), PID mangling, and some other minor things. Many small problems were identified and fixed in the process. My work was integrated into the main &os; source repository and will be shipped in the upcoming 7.0R release. We, the emulation development team, are working on making the &linux; 2.6 emulation the default emulation layer in &os;.
Introduction In the last few years the open source &unix; based operating systems started to be widely deployed on server and client machines. Among these operating systems I would like to point out two: &os;, for its BSD heritage, time proven code base and many interesting features and &linux; for its wide user base, enthusiastic open developer community and support from large companies. &os; tends to be used on server class machines serving heavy duty networking tasks with less usage on desktop class machines for ordinary users. While &linux; has the same usage on servers, but it is used much more by home based users. This leads to a situation where there are many binary only programs available for &linux; that lack support for &os;. Naturally, a need for the ability to run &linux; binaries on a &os; system arises and this is what this thesis deals with: the emulation of the &linux; kernel in the &os; operating system. During the Summer of 2006 Google Inc. sponsored a project which focused on extending the &linux; emulation layer (the so called Linuxulator) in &os; to include &linux; 2.6 facilities. This thesis is written as a part of this project. A look inside… In this section we are going to describe every operating system in question. How they deal with syscalls, trapframes etc., all the low-level stuff. We also describe the way they understand common &unix; primitives like what a PID is, what a thread is, etc. In the third subsection we talk about how &unix; on &unix; emulation could be done in general. What is &unix; &unix; is an operating system with a long history that has influenced almost every other operating system currently in use. Starting in the 1960s, its development continues to this day (although in different projects). &unix; development soon forked into two main ways: the BSDs and System III/V families. They mutually influenced themselves by growing a common &unix; standard. Among the contributions originated in BSD we can name virtual memory, TCP/IP networking, FFS, and many others. The System V branch contributed to SysV interprocess communication primitives, copy-on-write, etc. &unix; itself does not exist any more but its ideas have been used by many other operating systems world wide thus forming the so called &unix;-like operating systems. These days the most influential ones are &linux;, Solaris, and possibly (to some extent) &os;. There are in-company &unix; derivatives (AIX, HP-UX etc.), but these have been more and more migrated to the aforementioned systems. Let us summarize typical &unix; characteristics. Technical details Every running program constitutes a process that represents a state of the computation. Running process is divided between kernel-space and user-space. Some operations can be done only from kernel space (dealing with hardware etc.), but the process should spend most of its lifetime in the user space. The kernel is where the management of the processes, hardware, and low-level details take place. The kernel provides a standard unified &unix; API to the user space. The most important ones are covered below. Communication between kernel and user space process Common &unix; API defines a syscall as a way to issue commands from a user space process to the kernel. The most common implementation is either by using an interrupt or specialized instruction (think of SYSENTER/SYSCALL instructions for ia32). Syscalls are defined by a number. For example in &os;, the syscall number 85 is the &man.swapon.2; syscall and the syscall number 132 is &man.mkfifo.2;. Some syscalls need parameters, which are passed from the user-space to the kernel-space in various ways (implementation dependant). Syscalls are synchronous. Another possible way to communicate is by using a trap. Traps occur asynchronously after some event occurs (division by zero, page fault etc.). A trap can be transparent for a process (page fault) or can result in a reaction like sending a signal (division by zero). Communication between processes There are other APIs (System V IPC, shared memory etc.) but the single most important API is signal. Signals are sent by processes or by the kernel and received by processes. Some signals can be ignored or handled by a user supplied routine, some result in a predefined action that cannot be altered or ignored. Process management Kernel instances are processed first in the system (so called init). Every running process can create its identical copy using the &man.fork.2; syscall. Some slightly modified versions of this syscall were introduced but the basic semantic is the same. Every running process can morph into some other process using the &man.exec.3; syscall. Some modifications of this syscall were introduced but all serve the same basic purpose. Processes end their lives by calling the &man.exit.2; syscall. Every process is identified by a unique number called PID. Every process has a defined parent (identified by its PID). Thread management Traditional &unix; does not define any API nor implementation for threading, while &posix; defines its threading API but the implementation is undefined. Traditionally there were two ways of implementing threads. Handling them as separate processes (1:1 threading) or envelope the whole thread group in one process and managing the threading in userspace (1:N threading). Comparing main features of each approach: 1:1 threading - heavyweight threads - the scheduling cannot be altered by the user (slightly mitigated by the &posix; API) + no syscall wrapping necessary + can utilize multiple CPUs 1:N threading + lightweight threads + scheduling can be easily altered by the user - syscalls must be wrapped - cannot utilize more than one CPU What is &os;? The &os; project is one of the oldest open source operating systems currently available for daily use. It is a direct descendant of the genuine &unix; so it could be claimed that it is a true &unix; although licensing issues do not permit that. The start of the project dates back to the early 1990's when a crew of fellow BSD users patched the 386BSD operating system. Based on this patchkit a new operating system arose named &os; for its liberal license. Another group created the NetBSD operating system with different goals in mind. We will focus on &os;. &os; is a modern &unix;-based operating system with all the features of &unix;. Preemptive multitasking, multiuser facilities, TCP/IP networking, memory protection, symmetric multiprocessing support, virtual memory with merged VM and buffer cache, they are all there. One of the interesting and extremely useful features is the ability to emulate other &unix;-like operating systems. As of December 2006 and 7-CURRENT development, the following emulation functionalities are supported: &os;/i386 emulation on &os;/amd64 &os;/i386 emulation on &os;/ia64 &linux;-emulation of &linux; operating system on &os; NDIS-emulation of Windows networking drivers interface NetBSD-emulation of NetBSD operating system PECoff-support for PECoff &os; executables SVR4-emulation of System V revision 4 &unix; Actively developed emulations are the &linux; layer and various &os;-on-&os; layers. Others are not supposed to work properly nor be usable these days. Technical details &os; is traditional flavor of &unix; in the sense of dividing the run of processes into two halves: kernel space and user space run. There are two types of process entry to the kernel: a syscall and a trap. There is only one way to return. In the subsequent sections we will describe the three gates to/from the kernel. The whole description applies to the i386 architecture as the Linuxulator only exists there but the concept is similar on other architectures. The information was taken from [1] and the source code. System entries &os; has an abstraction called an execution class loader, which is a wedge into the &man.execve.2; syscall. This employs a structure sysentvec, which describes an executable ABI. It contains things like errno translation table, signal translation table, various functions to serve syscall needs (stack fixup, coredumping, etc.). Every ABI the &os; kernel wants to support must define this structure, as it is used later in the syscall processing code and at some other places. System entries are handled by trap handlers, where we can access both the kernel-space and the user-space at once. Syscalls Syscalls on &os; are issued by executing interrupt 0x80 with register %eax set to a desired syscall number with arguments passed on the stack. When a process issues an interrupt 0x80, the int0x80 syscall trap handler is issued (defined in sys/i386/i386/exception.s), which prepares arguments (i.e. copies them on to the stack) for a call to a C function &man.syscall.2; (defined in sys/i386/i386/trap.c), which processes the passed in trapframe. The processing consists of preparing the syscall (depending on the sysvec entry), determining if the syscall is 32-bit or 64-bit one (changes size of the parameters), then the parameters are copied, including the syscall. Next, the actual syscall function is executed with processing of the return code (special cases for ERESTART and EJUSTRETURN errors). Finally an userret() is scheduled, switching the process back to the users-pace. The parameters to the actual syscall handler are passed in the form of struct thread *td, struct syscall args * arguments where the second parameter is a pointer to the copied in structure of parameters. Traps Handling of traps in &os; is similar to the handling of syscalls. Whenever a trap occurs, an assembler handler is called. It is chosen between alltraps, alltraps with regs pushed or calltrap depending on the type of the trap. This handler prepares arguments for a call to a C function trap() (defined in sys/i386/i386/trap.c), which then processes the occurred trap. After the processing it might send a signal to the process and/or exit to userland using userret(). Exits Exits from kernel to userspace happen using the assembler routine doreti regardless of whether the kernel was entered via a trap or via a syscall. This restores the program status from the stack and returns to the userspace. &unix; primitives &os; operating system adheres to the traditional &unix; scheme, where every process has a unique identification number, the so called PID (Process ID). PID numbers are allocated either linearly or randomly ranging from 0 to PID_MAX. The allocation of PID numbers is done using linear searching of PID space. Every thread in a process receives the same PID number as result of the &man.getpid.2; call. There are currently two ways to implement threading in &os;. The first way is M:N threading followed by the 1:1 threading model. The default library used is M:N threading (libpthread) and you can switch at runtime to 1:1 threading (libthr). The plan is to switch to 1:1 library by default soon. Although those two libraries use the same kernel primitives, they are accessed through different API(es). The M:N library uses the kse_* family of syscalls while the 1:1 library uses the thr_* family of - syscalls. Because of this, there is no general concept of + syscalls. Due to this, there is no general concept of thread ID shared between kernel and userspace. Of course, both threading libraries implement the pthread thread ID API. Every kernel thread (as described by struct thread) has td tid identifier but this is not directly accessible from userland and solely serves the kernel's needs. It is also used for 1:1 threading library as pthread's thread ID but handling of this is internal to the library and cannot be relied on. As stated previously there are two implementations of threading in &os;. The M:N library divides the work between kernel space and userspace. Thread is an entity that gets scheduled in the kernel but it can represent various number of userspace threads. M userspace threads get mapped to N kernel threads thus saving resources while keeping the ability to exploit multiprocessor parallelism. Further information about the implementation can be obtained from the man page or [1]. The 1:1 library directly maps a userland thread to a kernel thread thus greatly simplifying the scheme. None of these designs implement a fairness mechanism (such a mechanism was implemented but it was removed recently because it caused serious slowdown and made the code more difficult to deal with). What is &linux; &linux; is a &unix;-like kernel originally developed by Linus Torvalds, and now being contributed to by a massive crowd of programmers all around the world. From its mere beginnings to today, with wide support from companies such as IBM or Google, &linux; is being associated with its fast development pace, full hardware support and benevolent dictator model of organization. &linux; development started in 1991 as a hobbyist project at University of Helsinki in Finland. Since then it has obtained all the features of a modern &unix;-like OS: multiprocessing, multiuser support, virtual memory, networking, basically everything is there. There are also highly advanced features like virtualization etc. As of 2006 &linux; seems to be the most widely used open source operating system with support from independent software vendors like Oracle, RealNetworks, Adobe, etc. Most of the commercial software distributed for &linux; can only be obtained in a binary form so recompilation for other operating systems is impossible. Most of the &linux; development happens in a Git version control system. Git is a distributed system so there is no central source of the &linux; code, but some branches are considered prominent and official. The version number scheme implemented by &linux; consists of four numbers A.B.C.D. Currently development happens in 2.6.C.D, where C represents major version, where new features are added or changed while D is a minor version for bugfixes only. More information can be obtained from [3]. Technical details &linux; follows the traditional &unix; scheme of dividing the run of a process in two halves: the kernel and user space. The kernel can be entered in two ways: via a trap or via a syscall. The return is handled only in one way. The further description applies to &linux; 2.6 on the &i386; architecture. This information was taken from [2]. Syscalls Syscalls in &linux; are performed (in userspace) using syscallX macros where X substitutes a number representing the number of parameters of the given syscall. This macro translates to a code that loads %eax register with a number of the syscall and executes interrupt 0x80. After this syscall return is called, which translates negative return values to positive errno values and sets res to -1 in case of an error. Whenever the interrupt 0x80 is called the process enters the kernel in system call trap handler. This routine saves all registers on the stack and calls the selected syscall entry. Note that the &linux; calling convention expects parameters to the syscall to be passed via registers as shown here: parameter -> %ebx parameter -> %ecx parameter -> %edx parameter -> %esi parameter -> %edi parameter -> %ebp There are some exceptions to this, where &linux; uses different calling convention (most notably the clone syscall). Traps The trap handlers are introduced in arch/i386/kernel/traps.c and most of these handlers live in arch/i386/kernel/entry.S, where handling of the traps happens. Exits Return from the syscall is managed by syscall &man.exit.3;, which checks for the process having unfinished work, then checks whether we used user-supplied selectors. If this happens stack fixing is applied and finally the registers are restored from the stack and the process returns to the userspace. &unix; primitives In the 2.6 version, the &linux; operating system redefined some of the traditional &unix; primitives, notably PID, TID and thread. PID is defined not to be unique for every process, so for some processes (threads) &man.getppid.2; returns the same value. Unique identification of process is provided by TID. This is because NPTL (New &posix; Thread Library) defines threads to be normal processes (so called 1:1 threading). Spawning a new process in &linux; 2.6 happens using the clone syscall (fork variants are reimplemented using it). This clone syscall defines a set of flags that affect behavior of the cloning process regarding thread implementation. The semantic is a bit fuzzy as there is no single flag telling the syscall to create a thread. Implemented clone flags are: CLONE_VM - processes share their memory space CLONE_FS - share umask, cwd and namespace CLONE_FILES - share open files CLONE_SIGHAND - share signal handlers and blocked signals CLONE_PARENT - share parent CLONE_THREAD - be thread (further explanation below) CLONE_NEWNS - new namespace CLONE_SYSVSEM - share SysV undo structures CLONE_SETTLS - setup TLS at supplied address CLONE_PARENT_SETTID - set TID in the parent CLONE_CHILD_CLEARTID - clear TID in the child CLONE_CHILD_SETTID - set TID in the child CLONE_PARENT sets the real parent to the parent of the caller. This is useful for threads because if thread A creates thread B we want thread B to be parented to the parent of the whole thread group. CLONE_THREAD does exactly the same thing as CLONE_PARENT, CLONE_VM and CLONE_SIGHAND, rewrites PID to be the same as PID of the caller, sets exit signal to be none and enters the thread group. CLONE_SETTLS sets up GDT entries for TLS handling. The CLONE_*_*TID set of flags sets/clears user supplied address to TID or 0. As you can see the CLONE_THREAD does most of the work and does not seem to fit the scheme very well. The original intention is unclear (even for authors, according to comments in the code) but I think originally there was one threading flag, which was then parcelled among many other flags but this separation was never fully finished. It is also unclear what this partition is good for as glibc does not use that so only hand-written use of the clone permits a programmer to access this features. For non-threaded programs the PID and TID are the same. For threaded programs the first thread PID and TID are the same and every created thread shares the same PID and gets assigned a unique TID (because CLONE_THREAD is passed in) also parent is shared for all processes forming this threaded program. The code that implements &man.pthread.create.3; in NPTL defines the clone flags like this: int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL | CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM #if __ASSUME_NO_CLONE_DETACHED == 0 | CLONE_DETACHED #endif | 0); The CLONE_SIGNAL is defined like #define CLONE_SIGNAL (CLONE_SIGHAND | CLONE_THREAD) the last 0 means no signal is sent when any of the threads exits. What is emulation According to a dictionary definition, emulation is the ability of a program or device to imitate another program or device. This is achieved by providing the same reaction to a given stimulus as the emulated object. In practice, the software world mostly sees three types of emulation - a program used to emulate a machine (QEMU, various game console emulators etc.), software emulation of a hardware facility (OpenGL emulators, floating point units emulation etc.) and operating system emulation (either in kernel of the operating system or as a userspace program). Emulation is usually used in a place, where using the original component is not feasible nor possible at all. For example someone might want to use a program developed for a different operating system than they use. Then emulation comes in handy. Sometimes there is no other way but to use emulation - e.g. when the hardware device you try to use does not exist (yet/anymore) then there is no other way but emulation. This happens often when porting an operating system to a new (non-existent) platform. Sometimes it is just cheaper to emulate. Looking from an implementation point of view, there are two main approaches to the implementation of emulation. You can either emulate the whole thing - accepting possible inputs of the original object, maintaining inner state and emitting correct output based on the state and/or input. This kind of emulation does not require any special conditions and basically can be implemented anywhere for any device/program. The drawback is that implementing such emulation is quite difficult, time-consuming and error-prone. In some cases we can use a simpler approach. Imagine you want to emulate a printer that prints from left to right on a printer that prints from right to left. It is obvious that there is no need for a complex emulation layer but simply reversing of the printed text is sufficient. Sometimes the emulating environment is very similar to the emulated one so just a thin layer of some translation is necessary to provide fully working emulation! As you can see this is much less demanding to implement, so less time-consuming and error-prone than the previous approach. But the necessary condition is that the two environments must be similar enough. The third approach combines the two previous. Most of the time the objects do not provide the same capabilities so in a case of emulating the more powerful one on the less powerful we have to emulate the missing features with full emulation described above. This master thesis deals with emulation of &unix; on &unix;, which is exactly the case, where only a thin layer of translation is sufficient to provide full emulation. The &unix; API consists of a set of syscalls, which are usually self contained and do not affect some global kernel state. There are a few syscalls that affect inner state but this can be dealt with by providing some structures that maintain the extra state. No emulation is perfect and emulations tend to lack some parts but this usually does not cause any serious drawbacks. Imagine a game console emulator that emulates everything but music output. No doubt that the games are playable and one can use the emulator. It might not be that comfortable as the original game console but its an acceptable compromise between price and comfort. The same goes with the &unix; API. Most programs can live with a very limited set of syscalls working. Those syscalls tend to be the oldest ones (&man.read.2;/&man.write.2;, &man.fork.2; family, &man.signal.3; handling, &man.exit.3;, &man.socket.2; API) hence it is easy to emulate because their semantics is shared among all &unix;es, which exist todays. Emulation How emulation works in &os; As stated earlier, &os; supports running binaries from several other &unix;es. This works because &os; has an abstraction called the execution class loader. This wedges into the &man.execve.2; syscall, so when &man.execve.2; is about to execute a binary it examines its type. There are basically two types of binaries in &os;. Shell-like text scripts which are identified by #! as their first two characters and normal (typically ELF) binaries, which are a representation of a compiled executable object. The vast majority (one could say all of them) of binaries in &os; are from type ELF. ELF files contain a header, which specifies the OS ABI for this ELF file. By reading this information, the operating system can accurately determine what type of binary the given file is. Every OS ABI must be registered in the &os; kernel. This applies to the &os; native OS ABI, as well. So when &man.execve.2; executes a binary it iterates through the list of registered APIs and when it finds the right one it starts to use the information contained in the OS ABI description (its syscall table, errno translation table, etc.). So every time the process calls a syscall, it uses its own set of syscalls instead of some global one. This effectively provides a very elegant and easy way of supporting execution of various binary formats. The nature of emulation of different OSes (and also some other subsystems) led developers to invite a handler event mechanism. There are various places in the kernel, where a list of event handlers are called. Every subsystem can register an event handler and they are called accordingly. For example, when a process exits there is a handler called that possibly cleans up whatever the subsystem needs to be cleaned. Those simple facilities provide basically everything that is needed for the emulation infrastructure and in fact these are basically the only things necessary to implement the &linux; emulation layer. Common primitives in the &os; kernel Emulation layers need some support from the operating system. I am going to describe some of the supported primitives in the &os; operating system. Locking primitives Contributed by: &a.attilio.email; The &os; synchronization primitive set is based on the idea to supply a rather huge number of different primitives in a way that the better one can be used for every particular, appropriate situation. To a high level point of view you can consider three kinds of synchronization primitives in the &os; kernel: atomic operations and memory barriers locks scheduling barriers Below there are descriptions for the 3 families. For every lock, you should really check the linked manpage (where possible) for more detailed explanations. Atomic operations and memory barriers Atomic operations are implemented through a set of functions performing simple arithmetics on memory operands in an atomic way with respect to external events (interrupts, preemption, etc.). Atomic operations can guarantee atomicity just on small data types (in the magnitude order of the .long. architecture C data type), so should be rarely used directly in the end-level code, if not only for very simple operations (like flag setting in a bitmap, for example). In fact, it is rather simple and common to write down a wrong semantic based on just atomic operations (usually referred as lock-less). The &os; kernel offers a way to perform atomic operations in conjunction with a memory barrier. The memory barriers will guarantee that an atomic operation will happen following some specified ordering with respect to other memory accesses. For example, if we need that an atomic operation happen just after all other pending writes (in terms of instructions reordering buffers activities) are completed, we need to explicitly use a memory barrier in conjunction to this atomic operation. So it is simple to understand why memory barriers play a key role for higher-level locks building (just as refcounts, mutexes, etc.). For a detailed explanatory on atomic operations, please refer to &man.atomic.9;. It is far, however, noting that atomic operations (and memory barriers as well) should ideally only be used for building front-ending locks (as mutexes). Refcounts Refcounts are interfaces for handling reference counters. They are implemented through atomic operations and are intended to be used just for cases, where the reference counter is the only one thing to be protected, so even something like a spin-mutex is deprecated. Using the refcount interface for structures, where a mutex is already used is often wrong since we should probably close the reference counter in some already protected paths. A manpage discussing refcount does not exist currently, just check sys/refcount.h for an overview of the existing API. Locks &os; kernel has huge classes of locks. Every lock is defined by some peculiar properties, but probably the most important is the event linked to contesting holders (or in other terms, the behavior of threads unable to acquire the lock). &os;'s locking scheme presents three different behaviors for contenders: spinning blocking sleeping numbers are not casual Spinning locks Spin locks let waiters to spin until they cannot acquire the lock. An important matter do deal with is when a thread contests on a spin lock if it is not descheduled. Since the &os; kernel is preemptive, this exposes spin lock at the risk of deadlocks that can be solved just disabling interrupts while they are acquired. For this and other reasons (like lack of priority propagation support, poorness in load balancing schemes between CPUs, etc.), spin locks are intended to protect very small paths of code, or ideally not to be used at all if not explicitly requested (explained later). Blocking Block locks let waiters to be descheduled and blocked until the lock owner does not drop it and wakes up one or more contenders. In order to avoid starvation issues, blocking locks do priority propagation from the waiters to the owner. Block locks must be implemented through the turnstile interface and are intended to be the most used kind of locks in the kernel, if no particular conditions are met. Sleeping Sleep locks let waiters to be descheduled and fall asleep until the lock holder does not drop it and wakes up one or more waiters. Since sleep locks are intended to protect large paths of code and to cater asynchronous events, they do not do any form of priority propagation. They must be implemented through the &man.sleepqueue.9; interface. The order used to acquire locks is very important, not only for the possibility to deadlock due at lock order reversals, but even because lock acquisition should follow specific rules linked to locks natures. If you give a look at the table above, the practical rule is that if a thread holds a lock of level n (where the level is the number listed close to the kind of lock) it is not allowed to acquire a lock of superior levels, since this would break the specified semantic for a path. For example, if a thread holds a block lock (level 2), it is allowed to acquire a spin lock (level 1) but not a sleep lock (level 3), since block locks are intended to protect smaller paths than sleep lock (these rules are not about atomic operations or scheduling barriers, however). This is a list of lock with their respective behaviors: spin mutex - spinning - &man.mutex.9; sleep mutex - blocking - &man.mutex.9; pool mutex - blocking - &man.mtx.pool.9; sleep family - sleeping - &man.sleep.9; pause tsleep msleep msleep spin msleep rw msleep sx condvar - sleeping - &man.condvar.9; rwlock - blocking - &man.rwlock.9; sxlock - sleeping - &man.sx.9; lockmgr - sleeping - &man.lockmgr.9; semaphores - sleeping - &man.sema.9; Among these locks only mutexes, sxlocks, rwlocks and lockmgrs are intended to handle recursion, but currently recursion is only supported by mutexes and lockmgrs. Scheduling barriers Scheduling barriers are intended to be used in order to drive scheduling of threading. They consist mainly of three different stubs: critical sections (and preemption) sched_bind sched_pin Generally, these should be used only in a particular context and even if they can often replace locks, they should be avoided because they do not let the diagnose of simple eventual problems with locking debugging tools (as &man.witness.4;). Critical sections The &os; kernel has been made preemptive basically to deal with interrupt threads. In fact, in order to avoid high interrupt latency, time-sharing priority threads can be preempted by interrupt threads (in this way, they do not need to wait to be scheduled as the normal path previews). Preemption, however, introduces new racing points that need to be handled, as well. Often, in order to deal with preemption, the simplest thing to do is to completely disable it. A critical section defines a piece of code (borderlined by the pair of functions &man.critical.enter.9; and &man.critical.exit.9;, where preemption is guaranteed to not happen (until the protected code is fully executed). This can often replace a lock effectively but should be used carefully in order to not lose the whole advantage that preemption brings. sched_pin/sched_unpin Another way to deal with preemption is the sched_pin() interface. If a piece of code is closed in the sched_pin() and sched_unpin() pair of functions it is guaranteed that the respective thread, even if it can be preempted, it will always be executed on the same CPU. Pinning is very effective in the particular case when we have to access at per-cpu datas and we assume other threads will not change those data. The latter condition will determine a critical section as a too strong condition for our code. sched_bind/sched_unbind sched_bind is an API used in order to bind a thread to a particular CPU for all the time it executes the code, until a sched_unbind function call does not unbind it. This feature has a key role in situations where you cannot trust the current state of CPUs (for example, at very early stages of boot), as you want to avoid your thread to migrate on inactive CPUs. Since sched_bind and sched_unbind manipulate internal scheduler structures, they need to be enclosed in sched_lock acquisition/releasing when used. Proc structure Various emulation layers sometimes require some additional per-process data. It can manage separate structures (a list, a tree etc.) containing these data for every process but this tends to be slow and memory consuming. To solve this problem the &os; proc structure contains p_emuldata, which is a void pointer to some emulation layer specific data. This proc entry is protected by the proc mutex. The &os; proc structure contains a p_sysent entry that identifies, which ABI this process is running. In fact, it is a pointer to the sysentvec described above. So by comparing this pointer to the address where the sysentvec structure for the given ABI is stored we can effectively determine whether the process belongs to our emulation layer. The code typically looks like: if (__predict_true(p->p_sysent != &elf_&linux;_sysvec)) return; As you can see, we effectively use the __predict_true modifier to collapse the most common case (&os; process) to a simple return operation thus preserving high performance. This code should be turned into a macro because currently it is not very flexible, i.e. we do not support &linux;64 emulation nor A.OUT &linux; processes on i386. VFS The &os; VFS subsystem is very complex but the &linux; emulation layer uses just a small subset via a well defined API. It can either operate on vnodes or file handlers. Vnode represents a virtual vnode, i.e. representation of a node in VFS. Another representation is a file handler, which represents an opened file from the perspective of a process. A file handler can represent a socket or an ordinary file. A file handler contains a pointer to its vnode. More then one file handler can point to the same vnode. namei The &man.namei.9; routine is a central entry point to pathname lookup and translation. It traverses the path point by point from the starting point to the end point using lookup function, which is internal to VFS. The &man.namei.9; syscall can cope with symlinks, absolute and relative paths. When a path is looked up using &man.namei.9; it is inputed to the name cache. This behavior can be suppressed. This routine is used all over the kernel and its performance is very critical. vn_fullpath The &man.vn.fullpath.9; function takes the best effort to traverse VFS name cache and returns a path for a given (locked) vnode. This process is unreliable but works just fine for the most common cases. The unreliability is because it relies on VFS cache (it does not traverse the on medium structures), it does not work with hardlinks, etc. This routine is used in several places in the Linuxulator. Vnode operations fgetvp - given a thread and a file descriptor number it returns the associated vnode &man.vn.lock.9; - locks a vnode vn_unlock - unlocks a vnode &man.VOP.READDIR.9; - reads a directory referenced by a vnode &man.VOP.GETATTR.9; - gets attributes of a file or a directory referenced by a vnode &man.VOP.LOOKUP.9; - looks up a path to a given directory &man.VOP.OPEN.9; - opens a file referenced by a vnode &man.VOP.CLOSE.9; - closes a file referenced by a vnode &man.vput.9; - decrements the use count for a vnode and unlocks it &man.vrele.9; - decrements the use count for a vnode &man.vref.9; - increments the use count for a vnode File handler operations fget - given a thread and a file descriptor number it returns associated file handler and references it fdrop - drops a reference to a file handler fhold - references a file handler &linux; emulation layer -MD part This section deals with implementation of &linux; emulation layer in &os; operating system. It first describes the machine dependent part talking about how and where interaction between userland and kernel is implemented. It talks about syscalls, signals, ptrace, traps, stack fixup. This part discusses i386 but it is written generally so other architectures should not differ very much. The next part is the machine independent part of the Linuxulator. This section only covers i386 and ELF handling. A.OUT is obsolete and untested. Syscall handling Syscall handling is mostly written in linux_sysvec.c, which covers most of the routines pointed out in the sysentvec structure. When a &linux; process running on &os; issues a syscall, the general syscall routine calls linux prepsyscall routine for the &linux; ABI. &linux; prepsyscall &linux; passes arguments to syscalls via registers (that is why it is limited to 6 parameters on i386) while &os; uses the stack. The &linux; prepsyscall routine must copy parameters from registers to the stack. The order of the registers is: %ebx, %ecx, %edx, %esi, %edi, %ebp. The catch is that this is true for only most of the syscalls. Some (most notably clone) uses a different order but it is luckily easy to fix by inserting a dummy parameter in the linux_clone prototype. Syscall writing Every syscall implemented in the Linuxulator must have its prototype with various flags in syscalls.master. The form of the file is: ... AUE_FORK STD { int linux_fork(void); } ... AUE_CLOSE NOPROTO { int close(int fd); } ... The first column represents the syscall number. The second column is for auditing support. The third column represents the syscall type. It is either STD, OBSOL, NOPROTO and UNIMPL. STD is a standard syscall with full prototype and implementation. OBSOL is obsolete and defines just the prototype. NOPROTO means that the syscall is implemented elsewhere so do not prepend ABI prefix, etc. UNIMPL means that the syscall will be substituted with the nosys syscall (a syscall just printing out a message about the syscall not being implemented and returning ENOSYS). From syscalls.master a script generates three files: linux_syscall.h, linux_proto.h and linux_sysent.c. The linux_syscall.h contains definitions of syscall names and their numerical value, e.g.: ... #define LINUX_SYS_linux_fork 2 ... #define LINUX_SYS_close 6 ... The linux_proto.h contains structure definitions of arguments to every syscall, e.g.: struct linux_fork_args { register_t dummy; }; And finally, linux_sysent.c contains structure describing the system entry table, used to actually dispatch a syscall, e.g.: { 0, (sy_call_t *)linux_fork, AUE_FORK, NULL, 0, 0 }, /* 2 = linux_fork */ { AS(close_args), (sy_call_t *)close, AUE_CLOSE, NULL, 0, 0 }, /* 6 = close */ As you can see linux_fork is implemented in Linuxulator itself so the definition is of STD type and has no argument, which is exhibited by the dummy argument structure. On the other hand close is just an alias for real &os; &man.close.2; so it has no linux arguments structure associated and in the system entry table it is not prefixed with linux as it calls the real &man.close.2; in the kernel. Dummy syscalls The &linux; emulation layer is not complete, as some syscalls are not implemented properly and some are not implemented at all. The emulation layer employs a facility to mark unimplemented syscalls with the DUMMY macro. These dummy definitions reside in linux_dummy.c in a form of DUMMY(syscall);, which is then translated to various syscall auxiliary files and the implementation consists of printing a message saying that this syscall is not implemented. The UNIMPL prototype is not used because we want to be able to identify the name of the syscall that was called in order to know what syscalls are more important to implement. Signal handling Signal handling is done generally in the &os; kernel for all binary compatibilities with a call to a compat-dependent layer. &linux; compatibility layer defines linux_sendsig routine for this purpose. &linux; sendsig This routine first checks whether the signal has been installed with a SA_SIGINFO in which case it calls linux_rt_sendsig routine instead. Furthermore, it allocates (or reuses an already existing) signal handle context, then it builds a list of arguments for the signal handler. It translates the signal number based on the signal translation table, assigns a handler, translates sigset. Then it saves context for the sigreturn routine (various registers, translated trap number and signal mask). Finally, it copies out the signal context to the userspace and prepares context for the actual signal handler to run. linux_rt_sendsig This routine is similar to linux_sendsig just the signal context preparation is different. It adds siginfo, ucontext, and some &posix; parts. It might be worth considering whether those two functions could not be merged with a benefit of less code duplication and possibly even faster execution. linux_sigreturn This syscall is used for return from the signal handler. It does some security checks and restores the original process context. It also unmasks the signal in process signal mask. Ptrace Many &unix; derivates implement the &man.ptrace.2; syscall in order to allow various tracking and debugging features. This facility enables the tracing process to obtain various information about the traced process, like register dumps, any memory from the process address space, etc. and also to trace the process like in stepping an instruction or between system entries (syscalls and traps). &man.ptrace.2; also lets you set various information in the traced process (registers etc.). &man.ptrace.2; is a &unix;-wide standard implemented in most &unix;es around the world. &linux; emulation in &os; implements the &man.ptrace.2; facility in linux_ptrace.c. The routines for converting registers between &linux; and &os; and the actual &man.ptrace.2; syscall emulation syscall. The syscall is a long switch block that implements its counterpart in &os; for every &man.ptrace.2; command. The &man.ptrace.2; commands are mostly equal between &linux; and &os; so usually just a small modification is needed. For example, PT_GETREGS in &linux; operates on direct data while &os; uses a pointer to the data so after performing a (native) &man.ptrace.2; syscall, a copyout must be done to preserve &linux; semantics. The &man.ptrace.2; implementation in Linuxulator has some known weaknesses. There have been panics seen when using strace (which is a &man.ptrace.2; consumer) in the Linuxulator environment. Also PT_SYSCALL is not implemented. Traps Whenever a &linux; process running in the emulation layer traps the trap itself is handled transparently with the only exception of the trap translation. &linux; and &os; differs in opinion on what a trap is so this is dealt with here. The code is actually very short: static int translate_traps(int signal, int trap_code) { if (signal != SIGBUS) return signal; switch (trap_code) { case T_PROTFLT: case T_TSSFLT: case T_DOUBLEFLT: case T_PAGEFLT: return SIGSEGV; default: return signal; } } Stack fixup The RTLD run-time link-editor expects so called AUX tags on stack during an execve so a fixup must be done to ensure this. Of course, every RTLD system is different so the emulation layer must provide its own stack fixup routine to do this. So does Linuxulator. The elf_linux_fixup simply copies out AUX tags to the stack and adjusts the stack of the user space process to point right after those tags. So RTLD works in a smart way. A.OUT support The &linux; emulation layer on i386 also supports &linux; A.OUT binaries. Pretty much everything described in the previous sections must be implemented for A.OUT support (beside traps translation and signals sending). The support for A.OUT binaries is no longer maintained, especially the 2.6 emulation does not work with it but this does not cause any problem, as the linux-base in ports probably do not support A.OUT binaries at all. This support will probably be removed in future. Most of the stuff necessary for loading &linux; A.OUT binaries is in imgact_linux.c file. &linux; emulation layer -MI part This section talks about machine independent part of the Linuxulator. It covers the emulation infrastructure needed for &linux; 2.6 emulation, the thread local storage (TLS) implementation (on i386) and futexes. Then we talk briefly about some syscalls. Description of NPTL One of the major areas of progress in development of &linux; 2.6 was threading. Prior to 2.6, the &linux; threading support was implemented in the linuxthreads library. The library was a partial implementation of &posix; threading. The threading was implemented using separate processes for each thread using the clone syscall to let them share the address space (and other things). The main weaknesses of this approach was that every thread had a different PID, signal handling was broken (from the pthreads perspective), etc. Also the performance was not very good (use of SIGUSR signals for threads synchronization, kernel resource consumption, etc.) so to overcome these problems a new threading system was developed and named NPTL. The NPTL library focused on two things but a third thing came along so it is usually considered a part of NPTL. Those two things were embedding of threads into a process structure and futexes. The additional third thing was TLS, which is not directly required by NPTL but the whole NPTL userland library depends on it. Those improvements yielded in much improved performance and standards conformance. NPTL is a standard threading library in &linux; systems these days. The &os; Linuxulator implementation approaches the NPTL in three main areas. The TLS, futexes and PID mangling, which is meant to simulate the &linux; threads. Further sections describe each of these areas. &linux; 2.6 emulation infrastructure These sections deal with the way &linux; threads are managed and how we simulate that in &os;. Runtime determining of 2.6 emulation The &linux; emulation layer in &os; supports runtime setting of the emulated version. This is done via &man.sysctl.8;, namely compat.linux.osrelease. Setting this &man.sysctl.8; affects runtime behavior of the emulation layer. When set to 2.6.x it sets the value of linux_use_linux26 while setting to something else keeps it unset. This variable (plus per-prison variables of the very same kind) determines whether 2.6 infrastructure (mainly PID mangling) is used in the code or not. The version setting is done system-wide and this affects all &linux; processes. The &man.sysctl.8; should not be changed when running any &linux; binary as it might harm things. &linux; processes and thread identifiers The semantics of &linux; threading are a little confusing and uses entirely different nomenclature to &os;. A process in &linux; consists of a struct task embedding two identifier fields - PID and TGID. PID is not a process ID but it is a thread ID. The TGID identifies a thread group in other words a process. For single-threaded process the PID equals the TGID. The thread in NPTL is just an ordinary process that happens to have TGID not equal to PID and have a group leader not equal to itself (and shared VM etc. of course). Everything else happens in the same way as to an ordinary process. There is no separation of a shared status to some external structure like in &os;. This creates some duplication of information and possible data inconsistency. The &linux; kernel seems to use task -> group information in some places and task information elsewhere and it is really not very consistent and looks error-prone. Every NPTL thread is created by a call to the clone syscall with a specific set of flags (more in the next subsection). The NPTL implements strict 1:1 threading. In &os; we emulate NPTL threads with ordinary &os; processes that share VM space, etc. and the PID gymnastic is just mimicked in the emulation specific structure attached to the process. The structure attached to the process looks like: struct linux_emuldata { pid_t pid; int *child_set_tid; /* in clone(): Child.s TID to set on clone */ int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */ struct linux_emuldata_shared *shared; int pdeath_signal; /* parent death signal */ LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */ }; The PID is used to identify the &os; process that attaches this structure. The child_se_tid and child_clear_tid are used for TID address copyout when a process exits and is created. The shared pointer points to a structure shared among threads. The pdeath_signal variable identifies the parent death signal and the threads pointer is used to link this structure to the list of threads. The linux_emuldata_shared structure looks like: struct linux_emuldata_shared { int refs; pid_t group_pid; LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */ }; The refs is a reference counter being used to determine when we can free the structure to avoid memory leaks. The group_pid is to identify PID ( = TGID) of the whole process ( = thread group). The threads pointer is the head of the list of threads in the process. The linux_emuldata structure can be obtained from the process using em_find. The prototype of the function is: struct linux_emuldata *em_find(struct proc *, int locked); Here, proc is the process we want the emuldata structure from and the locked parameter determines whether we want to lock or not. The accepted values are EMUL_DOLOCK and EMUL_DOUNLOCK. More about locking later. PID mangling - Because of the described different view knowing what a + As there is a difference in view as what to the idea of a process ID and thread ID is between &os; and &linux; we have to translate the view somehow. We do it by PID mangling. This means that we fake what a PID (=TGID) and TID (=PID) is between kernel and userland. The rule of thumb is that in kernel (in Linuxulator) PID = PID and TGID = shared -> group pid and to userland we present PID = shared -> group_pid and TID = proc -> p_pid. The PID member of linux_emuldata structure is a &os; PID. The above affects mainly getpid, getppid, gettid syscalls. Where we use PID/TGID respectively. In copyout of TIDs in child_clear_tid and child_set_tid we copy out &os; PID. Clone syscall The clone syscall is the way threads are created in &linux;. The syscall prototype looks like this: int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy, void * child_tidptr); The flags parameter tells the syscall how exactly the processes should be cloned. As described above, &linux; can create processes sharing various things independently, for example two processes can share file descriptors but not VM, etc. Last byte of the flags parameter is the exit signal of the newly created process. The stack parameter if non-NULL tells, where the thread stack is and if it is NULL we are supposed to copy-on-write the calling process stack (i.e. do what normal &man.fork.2; routine does). The parent_tidptr parameter is used as an address for copying out process PID (i.e. thread id) once the process is sufficiently instantiated but is not runnable yet. The dummy parameter is here because of the very strange calling convention of this syscall on i386. It uses the registers directly and does not let the compiler do it what results in the need of a dummy syscall. The child_tidptr parameter is used as an address for copying out PID once the process has finished forking and when the process exits. The syscall itself proceeds by setting corresponding flags depending on the flags passed in. For example, CLONE_VM maps to RFMEM (sharing of VM), etc. The only nit here is CLONE_FS and CLONE_FILES because &os; does not allow setting this separately so we fake it by not setting RFFDG (copying of fd table and other fs information) if either of these is defined. This does not cause any problems, because those flags are always set together. After setting the flags the process is forked using the internal fork1 routine, the process is instrumented not to be put on a run queue, i.e. not to be set runnable. After the forking is done we possibly reparent the newly created process to emulate CLONE_PARENT semantics. Next part is creating the emulation data. Threads in &linux; does not signal their parents so we set exit signal to be 0 to disable this. After that setting of child_set_tid and child_clear_tid is performed enabling the functionality later in the code. At this point we copy out the PID to the address specified by parent_tidptr. The setting of process stack is done by simply rewriting thread frame %esp register (%rsp on amd64). Next part is setting up TLS for the newly created process. After this &man.vfork.2; semantics might be emulated and finally the newly created process is put on a run queue and copying out its PID to the parent process via clone return value is done. The clone syscall is able and in fact is used for emulating classic &man.fork.2; and &man.vfork.2; syscalls. Newer glibc in a case of 2.6 kernel uses clone to implement &man.fork.2; and &man.vfork.2; syscalls. Locking The locking is implemented to be per-subsystem because we do not expect a lot of contention on these. There are two locks: emul_lock used to protect manipulating of linux_emuldata and emul_shared_lock used to manipulate linux_emuldata_shared. The emul_lock is a nonsleepable blocking mutex while emul_shared_lock is a - sleepable blocking sx_lock. Because of + sleepable blocking sx_lock. Due to the per-subsystem locking we can coalesce some locks and that is why the em find offers the non-locking access. TLS This section deals with TLS also known as thread local storage. Introduction to threading Threads in computer science are entities within a process that can be scheduled independently from each other. The threads in the process share process wide data (file descriptors, etc.) but also have their own stack for their own data. Sometimes there is a need for process-wide data specific to a given thread. Imagine a name of the thread in execution or something like that. The traditional &unix; threading API, pthreads provides a way to do it via &man.pthread.key.create.3;, &man.pthread.setspecific.3; and &man.pthread.getspecific.3; where a thread can create a key to the thread local data and using &man.pthread.getspecific.3; or &man.pthread.getspecific.3; to manipulate those data. You can easily see that this is not the most comfortable way this could be accomplished. So various producers of C/C++ compilers introduced a better way. They defined a new modifier keyword thread that specifies that a variable is thread specific. A new method of accessing such variables was developed as well (at least on i386). The pthreads method tends to be implemented in userspace as a trivial lookup table. The performance of such a solution is not very good. So the new method uses (on i386) segment registers to address a segment, where TLS area is stored so the actual accessing of a thread variable is just appending the segment register to the address thus addressing via it. The segment registers are usually %gs and %fs acting like segment selectors. Every thread has its own area where the thread local data are stored and the segment must be loaded on every context switch. This method is very fast and used almost exclusively in the whole i386 &unix; world. Both &os; and &linux; implement this approach and it yields very good results. The only drawback is the need to reload the segment on every context switch which can slowdown context switches. &os; tries to avoid this overhead by using only 1 segment descriptor for this while &linux; uses 3. Interesting thing is that almost nothing uses more than 1 descriptor (only Wine seems to use 2) so &linux; pays this unnecessary price for context switches. Segments on i386 The i386 architecture implements the so called segments. A segment is a description of an area of memory. The base address (bottom) of the memory area, the end of it (ceiling), type, protection, etc. The memory described by a segment can be accessed using segment selector registers (%cs, %ds, %ss, %es, %fs, %gs). For example let us suppose we have a segment which base address is 0x1234 and length and this code: mov %edx,%gs:0x10 This will load the content of the %edx register into memory location 0x1244. Some segment registers have a special use, for example %cs is used for code segment and %ss is used for stack segment but %fs and %gs are generally unused. Segments are either stored in a global GDT table or in a local LDT table. LDT is accessed via an entry in the GDT. The LDT can store more types of segments. LDT can be per process. Both tables define up to 8191 entries. Implementation on &linux; i386 There are two main ways of setting up TLS in &linux;. It can be set when cloning a process using the clone syscall or it can call set_thread_area. When a process passes CLONE_SETTLS flag to clone, the kernel expects the memory pointed to by the %esi register a &linux; user space representation of a segment, which gets translated to the machine representation of a segment and loaded into a GDT slot. The GDT slot can be specified with a number or -1 can be used meaning that the system itself should choose the first free slot. In practice, the vast majority of programs use only one TLS entry and does not care about the number of the entry. We exploit this in the emulation and in fact depend on it. Emulation of &linux; TLS i386 Loading of TLS for the current thread happens by calling set_thread_area while loading TLS for a second process in clone is done in the separate block in clone. Those two functions are very similar. The only difference being the actual loading of the GDT segment, which happens on the next context switch for the newly created process while set_thread_area must load this directly. The code basically does this. It copies the &linux; form segment descriptor from the userland. The code checks for the number of the descriptor but because this differs between &os; and &linux; we fake it a little. We only support indexes of 6, 3 and -1. The 6 is genuine &linux; number, 3 is genuine &os; one and -1 means autoselection. Then we set the descriptor number to constant 3 and copy out this to the userspace. We rely on the userspace process using the number from the descriptor but this works most of the time (have never seen a case where this did not work) as the userspace process typically passes in 1. Then we convert the descriptor from the &linux; form to a machine dependant form (i.e. operating system independent form) and copy this to the &os; defined segment descriptor. Finally we can load it. We assign the descriptor to threads PCB (process control block) and load the %gs segment using load_gs. This loading must be done in a critical section so that nothing can interrupt us. The CLONE_SETTLS case works exactly like this just the loading using load_gs is not performed. The segment used for this (segment number 3) is shared for this use between &os; processes and &linux; processes so the &linux; emulation layer does not add any overhead over plain &os;. amd64 The amd64 implementation is similar to the i386 one but there was initially no 32bit segment descriptor used for this purpose (hence not even native 32bit TLS users worked) so we had to add such a segment and implement its loading on every context switch (when a flag signaling use of 32bit is set). Apart from this the TLS loading is exactly the same just the segment numbers are different and the descriptor format and the loading differs slightly. Futexes Introduction to synchronization Threads need some kind of synchronization and &posix; provides some of them: mutexes for mutual exclusion, read-write locks for mutual exclusion with biased ratio of reads and writes and condition variables for signaling a status change. It is interesting to note that &posix; threading API lacks support for semaphores. Those synchronization routines implementations are heavily dependant on the type threading support we have. In pure 1:M (userspace) model the implementation can be solely done in userspace and thus be very fast (the condition variables will probably end up being implemented using signals, i.e. not fast) and simple. In 1:1 model, the situation is also quite clear - the threads must be synchronized using kernel facilities (which is very slow because a syscall must be performed). The mixed M:N scenario just combines the first and second approach or rely solely on kernel. Threads synchronization is a vital part of thread-enabled programming and its performance can affect resulting program a lot. Recent benchmarks on &os; operating system showed that an improved sx_lock implementation yielded 40% speedup in ZFS (a heavy sx user), this is in-kernel stuff but it shows clearly how important the performance of synchronization primitives is. Threaded programs should be written with as little contention on locks as possible. Otherwise, instead of - doing useful work the thread just waits on a lock. Because + doing useful work the thread just waits on a lock. As a result of this, the most well written threaded programs show little locks contention. Futexes introduction &linux; implements 1:1 threading, i.e. it has to use in-kernel synchronization primitives. As stated earlier, well written threaded programs have little lock contention. So a typical sequence could be performed as two atomic increase/decrease mutex reference counter, which is very fast, as presented by the following example: pthread_mutex_lock(&mutex); .... pthread_mutex_unlock(&mutex); 1:1 threading forces us to perform two syscalls for those mutex calls, which is very slow. The solution &linux; 2.6 implements is called futexes. Futexes implement the check for contention in userspace and call kernel primitives only in a case of contention. Thus the typical case takes place without any kernel intervention. This yields reasonably fast and flexible synchronization primitives implementation. Futex API The futex syscall looks like this: int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3); In this example uaddr is an address of the mutex in userspace, op is an operation we are about to perform and the other parameters have per-operation meaning. Futexes implement the following operations: FUTEX_WAIT FUTEX_WAKE FUTEX_FD FUTEX_REQUEUE FUTEX_CMP_REQUEUE FUTEX_WAKE_OP FUTEX_WAIT This operation verifies that on address uaddr the value val is written. If not, EWOULDBLOCK is returned, otherwise the thread is queued on the futex and gets suspended. If the argument timeout is non-zero it specifies the maximum time for the sleeping, otherwise the sleeping is infinite. FUTEX_WAKE This operation takes a futex at uaddr and wakes up val first futexes queued on this futex. FUTEX_FD This operations associates a file descriptor with a given futex. FUTEX_REQUEUE This operation takes val threads queued on futex at uaddr, wakes them up, and takes val2 next threads and requeues them on futex at uaddr2. FUTEX_CMP_REQUEUE This operation does the same as FUTEX_REQUEUE but it checks that val3 equals to val first. FUTEX_WAKE_OP This operation performs an atomic operation on val3 (which contains coded some other value) and uaddr. Then it wakes up val threads on futex at uaddr and if the atomic operation returned a positive number it wakes up val2 threads on futex at uaddr2. The operations implemented in FUTEX_WAKE_OP: FUTEX_OP_SET FUTEX_OP_ADD FUTEX_OP_OR FUTEX_OP_AND FUTEX_OP_XOR There is no val2 parameter in the futex prototype. The val2 is taken from the struct timespec *timeout parameter for operations FUTEX_REQUEUE, FUTEX_CMP_REQUEUE and FUTEX_WAKE_OP. Futex emulation in &os; The futex emulation in &os; is taken from NetBSD and further extended by us. It is placed in linux_futex.c and linux_futex.h files. The futex structure looks like: struct futex { void *f_uaddr; int f_refcount; LIST_ENTRY(futex) f_list; TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc; }; And the structure waiting_proc is: struct waiting_proc { struct thread *wp_t; struct futex *wp_new_futex; TAILQ_ENTRY(waiting_proc) wp_list; }; futex_get / futex_put A futex is obtained using the futex_get function, which searches a linear list of futexes and returns the found one or creates a new futex. When releasing a futex from the use we call the futex_put function, which decreases a reference counter of the futex and if the refcount reaches zero it is released. futex_sleep When a futex queues a thread for sleeping it creates a working_proc structure and puts this structure to the list inside the futex structure then it just performs a &man.tsleep.9; to suspend the thread. The sleep can be timed out. After &man.tsleep.9; returns (the thread was woken up or it timed out) the working_proc structure is removed from the list and is destroyed. All this is done in the futex_sleep function. If we got woken up from futex_wake we have wp_new_futex set so we sleep on it. This way the actual requeueing is done in this function. futex_wake Waking up a thread sleeping on a futex is performed in the futex_wake function. First in this function we mimic the strange &linux; behavior, where it wakes up N threads for all operations, the only exception is that the REQUEUE operations are performed on N+1 threads. But this usually does not make any difference as we are waking up all threads. Next in the function in the loop we wake up n threads, after this we check if there is a new futex for requeueing. If so, we requeue up to n2 threads on the new futex. This cooperates with futex_sleep. futex_wake_op The FUTEX_WAKE_OP operation is quite complicated. First we obtain two futexes at addresses uaddr and uaddr2 then we perform the atomic operation using val3 and uaddr2. Then val waiters on the first futex is woken up and if the atomic operation condition holds we wake up val2 (i.e. timeout) waiter on the second futex. futex atomic operation The atomic operation takes two parameters encoded_op and uaddr. The encoded operation encodes the operation itself, comparing value, operation argument, and comparing argument. The pseudocode for the operation is like this one: oldval = *uaddr2 *uaddr2 = oldval OP oparg And this is done atomically. First a copying in of the number at uaddr is performed and the operation is done. The code handles page faults and if no page fault occurs oldval is compared to cmparg argument with cmp comparator. Futex locking Futex implementation uses two lock lists protecting sx_lock and global locks (either Giant or another sx_lock). Every operation is performed locked from the start to the very end. Various syscalls implementation In this section I am going to describe some smaller syscalls that are worth mentioning because their implementation is not obvious or those syscalls are interesting from other point of view. *at family of syscalls During development of &linux; 2.6.16 kernel, the *at syscalls were added. Those syscalls (openat for example) work exactly like their at-less counterparts with the slight exception of the dirfd parameter. This parameter changes where the given file, on which the syscall is to be performed, is. When the filename parameter is absolute dirfd is ignored but when the path to the file is relative, it comes to the play. The dirfd parameter is a directory relative to which the relative pathname is checked. The dirfd parameter is a file descriptor of some directory or AT_FDCWD. So for example the openat syscall can be like this: file descriptor 123 = /tmp/foo/, current working directory = /tmp/ openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */ openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */ openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */ openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */ This infrastructure is necessary to avoid races when opening files outside the working directory. Imagine that a process consists of two threads, thread A and thread B. Thread A issues open(./tmp/foo/bah., flags, mode) and before returning it gets preempted and thread B runs. Thread B does not care about the needs of thread A and renames or removes /tmp/foo/. We got a race. To avoid this we can open /tmp/foo and use it as dirfd for openat syscall. This also enables user to implement per-thread working directories. &linux; family of *at syscalls contains: linux_openat, linux_mkdirat, linux_mknodat, linux_fchownat, linux_futimesat, linux_fstatat64, linux_unlinkat, linux_renameat, linux_linkat, linux_symlinkat, linux_readlinkat, linux_fchmodat and linux_faccessat. All these are implemented using the modified &man.namei.9; routine and simple wrapping layer. Implementation The implementation is done by altering the &man.namei.9; routine (described above) to take additional parameter dirfd in its nameidata structure, which specifies the starting point of the pathname lookup instead of using the current working directory every time. The resolution of dirfd from file descriptor number to a vnode is done in native *at syscalls. When dirfd is AT_FDCWD the dvp entry in nameidata structure is NULL but when dirfd is a different number we obtain a file for this file descriptor, check whether this file is valid and if there is vnode attached to it then we get a vnode. Then we check this vnode for being a directory. In the actual &man.namei.9; routine we simply substitute the dvp vnode for dp variable in the &man.namei.9; function, which determines the starting point. The &man.namei.9; is not used directly but via a trace of different functions on various levels. For example the openat goes like this: openat() --> kern_openat() --> vn_open() -> namei() For this reason kern_open and vn_open must be altered to incorporate the additional dirfd parameter. No compat layer is created for those because there are not many users of this and the users can be easily converted. This general implementation enables &os; to implement their own *at syscalls. This is being discussed right now. Ioctl The ioctl interface is quite fragile due to its generality. We have to bear in mind that devices differ between &linux; and &os; so some care must be applied to do ioctl emulation work right. The ioctl handling is implemented in linux_ioctl.c, where linux_ioctl function is defined. This function simply iterates over sets of ioctl handlers to find a handler that implements a given command. The ioctl syscall has three parameters, the file descriptor, command and an argument. The command is a 16-bit number, which in theory is divided into high 8 bits determining class of the ioctl command and low 8 bits, which are the actual command within the given set. The emulation takes advantage of this division. We implement handlers for each set, like sound_handler or disk_handler. Each handler has a maximum command and a minimum command defined, which is used for determining what handler is used. There are slight problems with this approach because &linux; does not use the set division consistently so sometimes ioctls for a different set are inside a set they should not belong to (SCSI generic ioctls inside cdrom set, etc.). &os; currently does not implement many &linux; ioctls (compared to NetBSD, for example) but the plan is to port those from NetBSD. The trend is to use &linux; ioctls even in the native &os; drivers because of the easy porting of applications. Debugging Every syscall should be debuggable. For this purpose we introduce a small infrastructure. We have the ldebug facility, which tells whether a given syscall should be debugged (settable via a sysctl). For printing we have LMSG and ARGS macros. Those are used for altering a printable string for uniform debugging messages. Conclusion Results As of April 2007 the &linux; emulation layer is capable of emulating the &linux; 2.6.16 kernel quite well. The remaining problems concern futexes, unfinished *at family of syscalls, problematic signals delivery, missing epoll and inotify and probably some bugs we have not discovered yet. Despite this we are capable of running basically all the &linux; programs included in &os; Ports Collection with Fedora Core 4 at 2.6.16 and there are some rudimentary reports of success with Fedora Core 6 at 2.6.16. The Fedora Core 6 linux_base was recently committed enabling some further testing of the emulation layer and giving us some more hints where we should put our effort in implementing missing stuff. We are able to run the most used applications like www/linux-firefox, net-im/skype and some games from the Ports Collection. Some of the programs exhibit bad behavior under 2.6 emulation but this is currently under investigation and hopefully will be fixed soon. The only big application that is known not to work is the &linux; &java; Development Kit and this is because of the requirement of epoll facility which is not directly related to the &linux; kernel 2.6. We hope to enable 2.6.16 emulation by default some time after &os; 7.0 is released at least to expose the 2.6 emulation parts for some wider testing. Once this is done we can switch to Fedora Core 6 linux_base, which is the ultimate plan. Future work Future work should focus on fixing the remaining issues with futexes, implement the rest of the *at family of syscalls, fix the signal delivery and possibly implement the epoll and inotify facilities. We hope to be able to run the most important programs flawlessly soon, so we will be able to switch to the 2.6 emulation by default and make the Fedora Core 6 the default linux_base because our currently used Fedora Core 4 is not supported any more. The other possible goal is to share our code with NetBSD and DragonflyBSD. NetBSD has some support for 2.6 emulation but its far from finished and not really tested. DragonflyBSD has expressed some interest in porting the 2.6 improvements. Generally, as &linux; develops we would like to keep up with their development, implementing newly added syscalls. Splice comes to mind first. Some already implemented syscalls are also heavily crippled, for example mremap and others. Some performance improvements can also be made, finer grained locking and others. Team I cooperated on this project with (in alphabetical order): &a.jhb.email; &a.kib.email; Emmanuel Dreyfus Scot Hetzel &a.jkim.email; &a.netchild.email; &a.ssouhlal.email; Li Xiao &a.davidxu.email; I would like to thank all those people for their advice, code reviews and general support. Literatures Marshall Kirk McKusick - George V. Nevile-Neil. Design and Implementation of the &os; operating system. Addison-Wesley, 2005. https://tldp.org https://www.kernel.org
diff --git a/en_US.ISO8859-1/articles/serial-uart/article.xml b/en_US.ISO8859-1/articles/serial-uart/article.xml index e57b052b9e..2ddbfbe2aa 100644 --- a/en_US.ISO8859-1/articles/serial-uart/article.xml +++ b/en_US.ISO8859-1/articles/serial-uart/article.xml @@ -1,2433 +1,2433 @@
Serial and UART Tutorial FrankDurda
uhclem@FreeBSD.org
&tm-attrib.freebsd; &tm-attrib.microsoft; &tm-attrib.general; $FreeBSD$ $FreeBSD$ This article talks about using serial hardware with FreeBSD.
The UART: What it is and how it works Copyright © 1996 &a.uhclem.email;, All Rights Reserved. 13 January 1996. The Universal Asynchronous Receiver/Transmitter (UART) controller is the key component of the serial communications subsystem of a computer. The UART takes bytes of data and transmits the individual bits in a sequential fashion. At the destination, a second UART re-assembles the bits into complete bytes. Serial transmission is commonly used with modems and for non-networked communication between computers, terminals and other devices. There are two primary forms of serial transmission: Synchronous and Asynchronous. Depending on the modes that are supported by the hardware, the name of the communication sub-system will usually include a A if it supports Asynchronous communications, and a S if it supports Synchronous communications. Both forms are described below. Some common acronyms are:
UART Universal Asynchronous Receiver/Transmitter
USART Universal Synchronous-Asynchronous Receiver/Transmitter
Synchronous Serial Transmission Synchronous serial transmission requires that the sender and receiver share a clock with one another, or that the sender provide a strobe or other timing signal so that the receiver knows when to read the next bit of the data. In most forms of serial Synchronous communication, if there is no data available at a given instant to transmit, a fill character must be sent instead so that data is always being transmitted. Synchronous communication is usually more efficient because only data bits are transmitted between sender and receiver, and synchronous communication can be more costly if extra wiring and circuits are required to share a clock signal between the sender and receiver. A form of Synchronous transmission is used with printers and fixed disk devices in that the data is sent on one set of wires while a clock or strobe is sent on a different wire. Printers and fixed disk devices are not normally serial devices because most fixed disk interface standards send an entire word of data for each clock or strobe signal by using a separate wire for each bit of the word. In the PC industry, these are known as Parallel devices. The standard serial communications hardware in the PC does not support Synchronous operations. This mode is described here for comparison purposes only. Asynchronous Serial Transmission Asynchronous transmission allows data to be transmitted without the sender having to send a clock signal to the receiver. Instead, the sender and receiver must agree on timing parameters in advance and special bits are added to each word which are used to synchronize the sending and receiving units. When a word is given to the UART for Asynchronous transmissions, a bit called the "Start Bit" is added to the beginning of each word that is to be transmitted. The Start Bit is used to alert the receiver that a word of data is about to be sent, and to force the clock in the receiver into synchronization with the clock in the transmitter. These two clocks must be accurate enough to not have the frequency drift by more than 10% during the transmission of the remaining bits in the word. (This requirement was set in the days of mechanical teleprinters and is easily met by modern electronic equipment.) After the Start Bit, the individual bits of the word of data are sent, with the Least Significant Bit (LSB) being sent first. Each bit in the transmission is transmitted for exactly the same amount of time as all of the other bits, and the receiver looks at the wire at approximately halfway through the period assigned to each bit to determine if the bit is a 1 or a 0. For example, if it takes two seconds to send each bit, the receiver will examine the signal to determine if it is a 1 or a 0 after one second has passed, then it will wait two seconds and then examine the value of the next bit, and so on. The sender does not know when the receiver has looked at the value of the bit. The sender only knows when the clock says to begin transmitting the next bit of the word. When the entire data word has been sent, the transmitter may add a Parity Bit that the transmitter generates. The Parity Bit may be used by the receiver to perform simple error checking. Then at least one Stop Bit is sent by the transmitter. When the receiver has received all of the bits in the data word, it may check for the Parity Bits (both sender and receiver must agree on whether a Parity Bit is to be used), and then the receiver looks for a Stop Bit. If the Stop Bit does not appear when it is supposed to, the UART considers the entire word to be garbled and will report a Framing Error to the host processor when the data word is read. The usual cause of a Framing Error is that the sender and receiver clocks were not running at the same speed, or that the signal was interrupted. Regardless of whether the data was received correctly or not, the UART automatically discards the Start, Parity and Stop bits. If the sender and receiver are configured identically, these bits are not passed to the host. If another word is ready for transmission, the Start Bit for the new word can be sent as soon as the Stop Bit for the previous word has been sent. - Because asynchronous data is self + As asynchronous data is self synchronizing, if there is no data to transmit, the transmission line can be idle. Other UART Functions In addition to the basic job of converting data from parallel to serial for transmission and from serial to parallel on reception, a UART will usually provide additional circuits for signals that can be used to indicate the state of the transmission media, and to regulate the flow of data in the event that the remote device is not prepared to accept more data. For example, when the device connected to the UART is a modem, the modem may report the presence of a carrier on the phone line while the computer may be able to instruct the modem to reset itself or to not take calls by raising or lowering one more of these extra signals. The function of each of these additional signals is defined in the EIA RS232-C standard. The RS232-C and V.24 Standards In most computer systems, the UART is connected to circuitry that generates signals that comply with the EIA RS232-C specification. There is also a CCITT standard named V.24 that mirrors the specifications included in RS232-C. RS232-C Bit Assignments (Marks and Spaces) In RS232-C, a value of 1 is called a Mark and a value of 0 is called a Space. When a communication line is idle, the line is said to be Marking, or transmitting continuous 1 values. The Start bit always has a value of 0 (a Space). The Stop Bit always has a value of 1 (a Mark). This means that there will always be a Mark (1) to Space (0) transition on the line at the start of every word, even when multiple word are transmitted back to back. This guarantees that sender and receiver can resynchronize their clocks regardless of the content of the data bits that are being transmitted. The idle time between Stop and Start bits does not have to be an exact multiple (including zero) of the bit rate of the communication link, but most UARTs are designed this way for simplicity. In RS232-C, the "Marking" signal (a 1) is represented by a voltage between -2 VDC and -12 VDC, and a "Spacing" signal (a 0) is represented by a voltage between 0 and +12 VDC. The transmitter is supposed to send +12 VDC or -12 VDC, and the receiver is supposed to allow for some voltage loss in long cables. Some transmitters in low power devices (like portable computers) sometimes use only +5 VDC and -5 VDC, but these values are still acceptable to a RS232-C receiver, provided that the cable lengths are short. RS232-C Break Signal RS232-C also specifies a signal called a Break, which is caused by sending continuous Spacing values (no Start or Stop bits). When there is no electricity present on the data circuit, the line is considered to be sending Break. The Break signal must be of a duration longer than the time it takes to send a complete byte plus Start, Stop and Parity bits. Most UARTs can distinguish between a Framing Error and a Break, but if the UART cannot do this, the Framing Error detection can be used to identify Breaks. In the days of teleprinters, when numerous printers around the country were wired in series (such as news services), any unit could cause a Break by temporarily opening the entire circuit so that no current flowed. This was used to allow a location with urgent news to interrupt some other location that was currently sending information. In modern systems there are two types of Break signals. If the Break is longer than 1.6 seconds, it is considered a "Modem Break", and some modems can be programmed to terminate the conversation and go on-hook or enter the modems' command mode when the modem detects this signal. If the Break is smaller than 1.6 seconds, it signifies a Data Break and it is up to the remote computer to respond to this signal. Sometimes this form of Break is used as an Attention or Interrupt signal and sometimes is accepted as a substitute for the ASCII CONTROL-C character. Marks and Spaces are also equivalent to Holes and No Holes in paper tape systems. Breaks cannot be generated from paper tape or from any other byte value, since bytes are always sent with Start and Stop bit. The UART is usually capable of generating the continuous Spacing signal in response to a special command from the host processor. RS232-C DTE and DCE Devices The RS232-C specification defines two types of equipment: the Data Terminal Equipment (DTE) and the Data Carrier Equipment (DCE). Usually, the DTE device is the terminal (or computer), and the DCE is a modem. Across the phone line at the other end of a conversation, the receiving modem is also a DCE device and the computer that is connected to that modem is a DTE device. The DCE device receives signals on the pins that the DTE device transmits on, and vice versa. When two devices that are both DTE or both DCE must be connected together without a modem or a similar media translator between them, a NULL modem must be used. The NULL modem electrically re-arranges the cabling so that the transmitter output is connected to the receiver input on the other device, and vice versa. Similar translations are performed on all of the control signals so that each device will see what it thinks are DCE (or DTE) signals from the other device. The number of signals generated by the DTE and DCE devices are not symmetrical. The DTE device generates fewer signals for the DCE device than the DTE device receives from the DCE. RS232-C Pin Assignments The EIA RS232-C specification (and the ITU equivalent, V.24) calls for a twenty-five pin connector (usually a DB25) and defines the purpose of most of the pins in that connector. In the IBM Personal Computer and similar systems, a subset of RS232-C signals are provided via nine pin connectors (DB9). The signals that are not included on the PC connector deal mainly with synchronous operation, and this transmission mode is not supported by the UART that IBM selected for use in the IBM PC. Depending on the computer manufacturer, a DB25, a DB9, or both types of connector may be used for RS232-C communications. (The IBM PC also uses a DB25 connector for the parallel printer interface which causes some confusion.) Below is a table of the RS232-C signal assignments in the DB25 and DB9 connectors. DB25 RS232-C Pin DB9 IBM PC Pin EIA Circuit Symbol CCITT Circuit Symbol Common Name Signal Source Description 1 - AA 101 PG/FG - Frame/Protective Ground 2 3 BA 103 TD DTE Transmit Data 3 2 BB 104 RD DCE Receive Data 4 7 CA 105 RTS DTE Request to Send 5 8 CB 106 CTS DCE Clear to Send 6 6 CC 107 DSR DCE Data Set Ready 7 5 AV 102 SG/GND - Signal Ground 8 1 CF 109 DCD/CD DCE Data Carrier Detect 9 - - - - - Reserved for Test 10 - - - - - Reserved for Test 11 - - - - - Reserved for Test 12 - CI 122 SRLSD DCE Sec. Recv. Line Signal Detector 13 - SCB 121 SCTS DCE Secondary Clear to Send 14 - SBA 118 STD DTE Secondary Transmit Data 15 - DB 114 TSET DCE Trans. Sig. Element Timing 16 - SBB 119 SRD DCE Secondary Received Data 17 - DD 115 RSET DCE Receiver Signal Element Timing 18 - - 141 LOOP DTE Local Loopback 19 - SCA 120 SRS DTE Secondary Request to Send 20 4 CD 108.2 DTR DTE Data Terminal Ready 21 - - - RDL DTE Remote Digital Loopback 22 9 CE 125 RI DCE Ring Indicator 23 - CH 111 DSRS DTE Data Signal Rate Selector 24 - DA 113 TSET DTE Trans. Sig. Element Timing 25 - - 142 - DCE Test Mode Bits, Baud and Symbols Baud is a measurement of transmission speed in - asynchronous communication. Because of advances in modem + asynchronous communication. Due to advances in modem communication technology, this term is frequently misused when describing the data rates in newer devices. Traditionally, a Baud Rate represents the number of bits that are actually being sent over the media, not the amount of data that is actually moved from one DTE device to the other. The Baud count includes the overhead bits Start, Stop and Parity that are generated by the sending UART and removed by the receiving UART. This means that seven-bit words of data actually take 10 bits to be completely transmitted. Therefore, a modem capable of moving 300 bits per second from one place to another can normally only move 30 7-bit words if Parity is used and one Start and Stop bit are present. If 8-bit data words are used and Parity bits are also used, the data rate falls to 27.27 words per second, because it now takes 11 bits to send the eight-bit words, and the modem still only sends 300 bits per second. The formula for converting bytes per second into a baud rate and vice versa was simple until error-correcting modems came along. These modems receive the serial stream of bits from the UART in the host computer (even when internal modems are used the data is still frequently serialized) and converts the bits back into bytes. These bytes are then combined into packets and sent over the phone line using a Synchronous transmission method. This means that the Stop, Start, and Parity bits added by the UART in the DTE (the computer) were removed by the modem before transmission by the sending modem. When these bytes are received by the remote modem, the remote modem adds Start, Stop and Parity bits to the words, converts them to a serial format and then sends them to the receiving UART in the remote computer, who then strips the Start, Stop and Parity bits. The reason all these extra conversions are done is so that the two modems can perform error correction, which means that the receiving modem is able to ask the sending modem to resend a block of data that was not received with the correct checksum. This checking is handled by the modems, and the DTE devices are usually unaware that the process is occurring. By striping the Start, Stop and Parity bits, the additional bits of data that the two modems must share between themselves to perform error-correction are mostly concealed from the effective transmission rate seen by the sending and receiving DTE equipment. For example, if a modem sends ten 7-bit words to another modem without including the Start, Stop and Parity bits, the sending modem will be able to add 30 bits of its own information that the receiving modem can use to do error-correction without impacting the transmission speed of the real data. The use of the term Baud is further confused by modems that perform compression. A single 8-bit word passed over the telephone line might represent a dozen words that were transmitted to the sending modem. The receiving modem will expand the data back to its original content and pass that data to the receiving DTE. Modern modems also include buffers that allow the rate that bits move across the phone line (DCE to DCE) to be a different speed than the speed that the bits move between the DTE and DCE on both ends of the conversation. Normally the speed between the DTE and DCE is higher than the DCE to DCE speed because of the use of compression by the modems. - Because the number of bits needed to describe a byte + As the number of bits needed to describe a byte varied during the trip between the two machines plus the differing bits-per-seconds speeds that are used present on the DTE-DCE and DCE-DCE links, the usage of the term Baud to describe the overall communication speed causes problems and can misrepresent the true transmission speed. So Bits Per Second (bps) is the correct term to use to describe the transmission rate seen at the DCE to DCE interface and Baud or Bits Per Second are acceptable terms to use when a connection is made between two systems with a wired connection, or if a modem is in use that is not performing error-correction or compression. Modern high speed modems (2400, 9600, 14,400, and 19,200bps) in reality still operate at or below 2400 baud, or more accurately, 2400 Symbols per second. High speed modem are able to encode more bits of data into each Symbol using a technique called Constellation Stuffing, which is why the effective bits per second rate of the modem is higher, but the modem continues to operate within the limited audio bandwidth that the telephone system provides. Modems operating at 28,800 and higher speeds have variable Symbol rates, but the technique is the same. The IBM Personal Computer UART Starting with the original IBM Personal Computer, IBM selected the National Semiconductor INS8250 UART for use in the IBM PC Parallel/Serial Adapter. Subsequent generations of compatible computers from IBM and other vendors continued to use the INS8250 or improved versions of the National Semiconductor UART family. National Semiconductor UART Family Tree There have been several versions and subsequent generations of the INS8250 UART. Each major version is described below. INS8250 -> INS8250B \ \ \-> INS8250A -> INS82C50A \ \ \-> NS16450 -> NS16C450 \ \ \-> NS16550 -> NS16550A -> PC16550D INS8250 This part was used in the original IBM PC and IBM PC/XT. The original name for this part was the INS8250 ACE (Asynchronous Communications Element) and it is made from NMOS technology. The 8250 uses eight I/O ports and has a one-byte send and a one-byte receive buffer. This original UART has several race conditions and other flaws. The original IBM BIOS includes code to work around these flaws, but this made the BIOS dependent on the flaws being present, so subsequent parts like the 8250A, 16450 or 16550 could not be used in the original IBM PC or IBM PC/XT. INS8250-B This is the slower speed of the INS8250 made from NMOS technology. It contains the same problems as the original INS8250. INS8250A An improved version of the INS8250 using XMOS technology with various functional flaws corrected. The INS8250A was used initially in PC clone computers by vendors who used - clean BIOS designs. Because of the + clean BIOS designs. Due to the corrections in the chip, this part could not be used with a BIOS compatible with the INS8250 or INS8250B. INS82C50A This is a CMOS version (low power consumption) of the INS8250A and has similar functional characteristics. NS16450 Same as NS8250A with improvements so it can be used with faster CPU bus designs. IBM used this part in the IBM AT and updated the IBM BIOS to no longer rely on the bugs in the INS8250. NS16C450 This is a CMOS version (low power consumption) of the NS16450. NS16550 Same as NS16450 with a 16-byte send and receive buffer but the buffer design was flawed and could not be reliably be used. NS16550A Same as NS16550 with the buffer flaws corrected. The 16550A and its successors have become the most popular UART design in the PC industry, mainly due to its ability to reliably handle higher data rates on operating systems with sluggish interrupt response times. NS16C552 This component consists of two NS16C550A CMOS UARTs in a single package. PC16550D Same as NS16550A with subtle flaws corrected. This is revision D of the 16550 family and is the latest design available from National Semiconductor. The NS16550AF and the PC16550D are the same thing National reorganized their part numbering system a few years ago, and the NS16550AFN no longer exists by that name. (If you have a NS16550AFN, look at the date code on the part, which is a four digit number that usually starts with a nine. The first two digits of the number are the year, and the last two digits are the week in that year when the part was packaged. If you have a NS16550AFN, it is probably a few years old.) The new numbers are like PC16550DV, with minor differences in the suffix letters depending on the package material and its shape. (A description of the numbering system can be found below.) It is important to understand that in some stores, you may pay $15(US) for a NS16550AFN made in 1990 and in the next bin are the new PC16550DN parts with minor fixes that National has made since the AFN part was in production, the PC16550DN was probably made in the past six months and it costs half (as low as $5(US) in volume) as much as the NS16550AFN because they are readily available. As the supply of NS16550AFN chips continues to shrink, the price will probably continue to increase until more people discover and accept that the PC16550DN really has the same function as the old part number. National Semiconductor Part Numbering System The older NSnnnnnrqp part numbers are now of the format PCnnnnnrgp. The r is the revision field. The current revision of the 16550 from National Semiconductor is D. The p is the package-type field. The types are: "F" QFP (quad flat pack) L lead type "N" DIP (dual inline package) through hole straight lead type "V" LPCC (lead plastic chip carrier) J lead type The g is the product grade field. If an I precedes the package-type letter, it indicates an industrial grade part, which has higher specs than a standard part but not as high as Military Specification (Milspec) component. This is an optional field. So what we used to call a NS16550AFN (DIP Package) is now called a PC16550DN or PC16550DIN. Other Vendors and Similar UARTs Over the years, the 8250, 8250A, 16450 and 16550 have been licensed or copied by other chip vendors. In the case of the 8250, 8250A and 16450, the exact circuit (the megacell) was licensed to many vendors, including Western Digital and Intel. Other vendors reverse-engineered the part or produced emulations that had similar behavior. In internal modems, the modem designer will frequently emulate the 8250A/16450 with the modem microprocessor, and the emulated UART will frequently have a hidden buffer - consisting of several hundred bytes. Because of the size of + consisting of several hundred bytes. Due to the size of the buffer, these emulations can be as reliable as a 16550A in their ability to handle high speed data. However, most operating systems will still report that the UART is only a 8250A or 16450, and may not make effective use of the extra buffering present in the emulated UART unless special drivers are used. Some modem makers are driven by market forces to abandon a design that has hundreds of bytes of buffer and instead use a 16550A UART so that the product will compare favorably in market comparisons even though the effective performance may be lowered by this action. A common misconception is that all parts with 16550A written on them are identical in performance. There are differences, and in some cases, outright flaws in most of these 16550A clones. When the NS16550 was developed, the National Semiconductor obtained several patents on the design and they also limited licensing, making it harder for other - vendors to provide a chip with similar features. Because of + vendors to provide a chip with similar features. As a result of the patents, reverse-engineered designs and emulations had to avoid infringing the claims covered by the patents. Subsequently, these copies almost never perform exactly the same as the NS16550A or PC16550D, which are the parts most computer and modem makers want to buy but are sometimes unwilling to pay the price required to get the genuine part. Some of the differences in the clone 16550A parts are unimportant, while others can prevent the device from being used at all with a given operating system or driver. These differences may show up when using other drivers, or when particular combinations of events occur that were not well tested or considered in the &windows; driver. This is because most modem vendors and 16550-clone makers use the Microsoft drivers from &windows; for Workgroups 3.11 and the µsoft; &ms-dos; utility as the primary tests for compatibility with the NS16550A. This over-simplistic criteria means that if a different operating system is used, problems could appear due to subtle differences between the clones and genuine components. National Semiconductor has made available a program named COMTEST that performs compatibility tests independent of any OS drivers. It should be remembered that the purpose of this type of program is to demonstrate the flaws in the products of the competition, so the program will report major as well as extremely subtle differences in behavior in the part being tested. In a series of tests performed by the author of this document in 1994, components made by National Semiconductor, TI, StarTech, and CMD as well as megacells and emulations embedded in internal modems were tested with COMTEST. A difference count for some of these components is listed - below. Because these tests were performed in 1994, they may + below. Since these tests were performed in 1994, they may not reflect the current performance of the given product from a vendor. It should be noted that COMTEST normally aborts when an excessive number or certain types of problems have been detected. As part of this testing, COMTEST was modified so that it would not abort no matter how many differences were encountered. Vendor Part Number Errors (aka "differences" reported) National (PC16550DV) 0 National (NS16550AFN) 0 National (NS16C552V) 0 TI (TL16550AFN) 3 CMD (16C550PE) 19 StarTech (ST16C550J) 23 Rockwell Reference modem with internal 16550 or an emulation (RC144DPi/C3000-25) 117 Sierra Modem with an internal 16550 (SC11951/SC11351) 91 To date, the author of this document has not found any non-National parts that report zero differences using the COMTEST program. It should also be noted that National has had five versions of the 16550 over the years and the newest parts behave a bit differently than the classic NS16550AFN that is considered the benchmark for functionality. COMTEST appears to turn a blind eye to the differences within the National product line and reports no errors on the National parts (except for the original 16550) even when there are official erratas that describe bugs in the A, B and C revisions of the parts, so this bias in COMTEST must be taken into account. It is important to understand that a simple count of differences from COMTEST does not reveal a lot about what differences are important and which are not. For example, about half of the differences reported in the two modems listed above that have internal UARTs were caused by the clone UARTs not supporting five- and six-bit character modes. The real 16550, 16450, and 8250 UARTs all support these modes and COMTEST checks the functionality of these modes so over fifty differences are reported. However, almost no modern modem supports five- or six-bit characters, particularly those with error-correction and compression capabilities. This means that the differences related to five- and six-bit character modes can be discounted. Many of the differences COMTEST reports have to do with timing. In many of the clone designs, when the host reads from one port, the status bits in some other port may not update in the same amount of time (some faster, some slower) as a real NS16550AFN and COMTEST looks for these differences. This means that the number of differences can be misleading in that one device may only have one or two differences but they are extremely serious, and some other device that updates the status registers faster or slower than the reference part (that would probably never affect the operation of a properly written driver) could have dozens of differences reported. COMTEST can be used as a screening tool to alert the administrator to the presence of potentially incompatible components that might cause problems or have to be handled as a special case. If you run COMTEST on a 16550 that is in a modem or a modem is attached to the serial port, you need to first issue a ATE0&W command to the modem so that the modem will not echo any of the test characters. If you forget to do this, COMTEST will report at least this one difference: Error (6)...Timeout interrupt failed: IIR = c1 LSR = 61 8250/16450/16550 Registers The 8250/16450/16550 UART occupies eight contiguous I/O port addresses. In the IBM PC, there are two defined locations for these eight ports and they are known collectively as COM1 and COM2. The makers of PC-clones and add-on cards have created two additional areas known as COM3 and COM4, but these extra COM ports conflict with other hardware on some systems. The most common conflict is with video adapters that provide IBM 8514 emulation. COM1 is located from 0x3f8 to 0x3ff and normally uses IRQ 4. COM2 is located from 0x2f8 to 0x2ff and normally uses IRQ 3. COM3 is located from 0x3e8 to 0x3ef and has no standardized IRQ. COM4 is located from 0x2e8 to 0x2ef and has no standardized IRQ. A description of the I/O ports of the 8250/16450/16550 UART is provided below. I/O Port Access Allowed Description +0x00 write (DLAB==0) Transmit Holding Register (THR).Information written to this port are treated as data words and will be transmitted by the UART. +0x00 read (DLAB==0) Receive Buffer Register (RBR).Any data words received by the UART form the serial link are accessed by the host by reading this port. +0x00 write/read (DLAB==1) Divisor Latch LSB (DLL)This value will be divided from the master input clock (in the IBM PC, the master clock is 1.8432MHz) and the resulting clock will determine the baud rate of the UART. This register holds bits 0 thru 7 of the divisor. +0x01 write/read (DLAB==1) Divisor Latch MSB (DLH)This value will be divided from the master input clock (in the IBM PC, the master clock is 1.8432MHz) and the resulting clock will determine the baud rate of the UART. This register holds bits 8 thru 15 of the divisor. +0x01 write/read (DLAB==0) Interrupt Enable Register (IER)The 8250/16450/16550 UART classifies events into one of four categories. Each category can be configured to generate an interrupt when any of the events occurs. The 8250/16450/16550 UART generates a single external interrupt signal regardless of how many events in the enabled categories have occurred. It is up to the host processor to respond to the interrupt and then poll the enabled interrupt categories (usually all categories have interrupts enabled) to determine the true cause(s) of the interrupt. Bit 7 Reserved, always 0. Bit 6 Reserved, always 0. Bit 5 Reserved, always 0. Bit 4 Reserved, always 0. Bit 3 Enable Modem Status Interrupt (EDSSI). Setting this bit to "1" allows the UART to generate an interrupt when a change occurs on one or more of the status lines. Bit 2 Enable Receiver Line Status Interrupt (ELSI) Setting this bit to "1" causes the UART to generate an interrupt when the an error (or a BREAK signal) has been detected in the incoming data. Bit 1 Enable Transmitter Holding Register Empty Interrupt (ETBEI) Setting this bit to "1" causes the UART to generate an interrupt when the UART has room for one or more additional characters that are to be transmitted. Bit 0 Enable Received Data Available Interrupt (ERBFI) Setting this bit to "1" causes the UART to generate an interrupt when the UART has received enough characters to exceed the trigger level of the FIFO, or the FIFO timer has expired (stale data), or a single character has been received when the FIFO is disabled. +0x02 write FIFO Control Register (FCR) (This port does not exist on the 8250 and 16450 UART.) Bit 7 Receiver Trigger Bit #1 Bit 6 Receiver Trigger Bit #0These two bits control at what point the receiver is to generate an interrupt when the FIFO is active. 7 6 How many words are received before an interrupt is generated 0 0 1 0 1 4 1 0 8 1 1 14 Bit 5 Reserved, always 0. Bit 4 Reserved, always 0. Bit 3 DMA Mode Select. If Bit 0 is set to "1" (FIFOs enabled), setting this bit changes the operation of the -RXRDY and -TXRDY signals from Mode 0 to Mode 1. Bit 2 Transmit FIFO Reset. When a "1" is written to this bit, the contents of the FIFO are discarded. Any word currently being transmitted will be sent intact. This function is useful in aborting transfers. Bit 1 Receiver FIFO Reset. When a "1" is written to this bit, the contents of the FIFO are discarded. Any word currently being assembled in the shift register will be received intact. Bit 0 16550 FIFO Enable. When set, both the transmit and receive FIFOs are enabled. Any contents in the holding register, shift registers or FIFOs are lost when FIFOs are enabled or disabled. +0x02 read Interrupt Identification Register Bit 7 FIFOs enabled. On the 8250/16450 UART, this bit is zero. Bit 6 FIFOs enabled. On the 8250/16450 UART, this bit is zero. Bit 5 Reserved, always 0. Bit 4 Reserved, always 0. Bit 3 Interrupt ID Bit #2. On the 8250/16450 UART, this bit is zero. Bit 2 Interrupt ID Bit #1 Bit 1 Interrupt ID Bit #0.These three bits combine to report the category of event that caused the interrupt that is in progress. These categories have priorities, so if multiple categories of events occur at the same time, the UART will report the more important events first and the host must resolve the events in the order they are reported. All events that caused the current interrupt must be resolved before any new interrupts will be generated. (This is a limitation of the PC architecture.) 2 1 0 Priority Description 0 1 1 First Received Error (OE, PE, BI, or FE) 0 1 0 Second Received Data Available 1 1 0 Second Trigger level identification (Stale data in receive buffer) 0 0 1 Third Transmitter has room for more words (THRE) 0 0 0 Fourth Modem Status Change (-CTS, -DSR, -RI, or -DCD) Bit 0 Interrupt Pending Bit. If this bit is set to "0", then at least one interrupt is pending. +0x03 write/read Line Control Register (LCR) Bit 7 Divisor Latch Access Bit (DLAB). When set, access to the data transmit/receive register (THR/RBR) and the Interrupt Enable Register (IER) is disabled. Any access to these ports is now redirected to the Divisor Latch Registers. Setting this bit, loading the Divisor Registers, and clearing DLAB should be done with interrupts disabled. Bit 6 Set Break. When set to "1", the transmitter begins to transmit continuous Spacing until this bit is set to "0". This overrides any bits of characters that are being transmitted. Bit 5 Stick Parity. When parity is enabled, setting this bit causes parity to always be "1" or "0", based on the value of Bit 4. Bit 4 Even Parity Select (EPS). When parity is enabled and Bit 5 is "0", setting this bit causes even parity to be transmitted and expected. Otherwise, odd parity is used. Bit 3 Parity Enable (PEN). When set to "1", a parity bit is inserted between the last bit of the data and the Stop Bit. The UART will also expect parity to be present in the received data. Bit 2 Number of Stop Bits (STB). If set to "1" and using 5-bit data words, 1.5 Stop Bits are transmitted and expected in each data word. For 6, 7 and 8-bit data words, 2 Stop Bits are transmitted and expected. When this bit is set to "0", one Stop Bit is used on each data word. Bit 1 Word Length Select Bit #1 (WLSB1) Bit 0 Word Length Select Bit #0 (WLSB0) Together these bits specify the number of bits in each data word. 1 0 Word Length 0 0 5 Data Bits 0 1 6 Data Bits 1 0 7 Data Bits 1 1 8 Data Bits +0x04 write/read Modem Control Register (MCR) Bit 7 Reserved, always 0. Bit 6 Reserved, always 0. Bit 5 Reserved, always 0. Bit 4 Loop-Back Enable. When set to "1", the UART transmitter and receiver are internally connected together to allow diagnostic operations. In addition, the UART modem control outputs are connected to the UART modem control inputs. CTS is connected to RTS, DTR is connected to DSR, OUT1 is connected to RI, and OUT 2 is connected to DCD. Bit 3 OUT 2. An auxiliary output that the host processor may set high or low. In the IBM PC serial adapter (and most clones), OUT 2 is used to tri-state (disable) the interrupt signal from the 8250/16450/16550 UART. Bit 2 OUT 1. An auxiliary output that the host processor may set high or low. This output is not used on the IBM PC serial adapter. Bit 1 Request to Send (RTS). When set to "1", the output of the UART -RTS line is Low (Active). Bit 0 Data Terminal Ready (DTR). When set to "1", the output of the UART -DTR line is Low (Active). +0x05 write/read Line Status Register (LSR) Bit 7 Error in Receiver FIFO. On the 8250/16450 UART, this bit is zero. This bit is set to "1" when any of the bytes in the FIFO have one or more of the following error conditions: PE, FE, or BI. Bit 6 Transmitter Empty (TEMT). When set to "1", there are no words remaining in the transmit FIFO or the transmit shift register. The transmitter is completely idle. Bit 5 Transmitter Holding Register Empty (THRE). When set to "1", the FIFO (or holding register) now has room for at least one additional word to transmit. The transmitter may still be transmitting when this bit is set to "1". Bit 4 Break Interrupt (BI). The receiver has detected a Break signal. Bit 3 Framing Error (FE). A Start Bit was detected but the Stop Bit did not appear at the expected time. The received word is probably garbled. Bit 2 Parity Error (PE). The parity bit was incorrect for the word received. Bit 1 Overrun Error (OE). A new word was received and there was no room in the receive buffer. The newly-arrived word in the shift register is discarded. On 8250/16450 UARTs, the word in the holding register is discarded and the newly- arrived word is put in the holding register. Bit 0 Data Ready (DR) One or more words are in the receive FIFO that the host may read. A word must be completely received and moved from the shift register into the FIFO (or holding register for 8250/16450 designs) before this bit is set. +0x06 write/read Modem Status Register (MSR) Bit 7 Data Carrier Detect (DCD). Reflects the state of the DCD line on the UART. Bit 6 Ring Indicator (RI). Reflects the state of the RI line on the UART. Bit 5 Data Set Ready (DSR). Reflects the state of the DSR line on the UART. Bit 4 Clear To Send (CTS). Reflects the state of the CTS line on the UART. Bit 3 Delta Data Carrier Detect (DDCD). Set to "1" if the -DCD line has changed state one more time since the last time the MSR was read by the host. Bit 2 Trailing Edge Ring Indicator (TERI). Set to "1" if the -RI line has had a low to high transition since the last time the MSR was read by the host. Bit 1 Delta Data Set Ready (DDSR). Set to "1" if the -DSR line has changed state one more time since the last time the MSR was read by the host. Bit 0 Delta Clear To Send (DCTS). Set to "1" if the -CTS line has changed state one more time since the last time the MSR was read by the host. +0x07 write/read Scratch Register (SCR). This register performs no function in the UART. Any value can be written by the host to this location and read by the host later on. Beyond the 16550A UART Although National Semiconductor has not offered any components compatible with the 16550 that provide additional features, various other vendors have. Some of these components are described below. It should be understood that to effectively utilize these improvements, drivers may have to be provided by the chip vendor since most of the popular operating systems do not support features beyond those provided by the 16550. ST16650 By default this part is similar to the NS16550A, but an extended 32-byte send and receive buffer can be optionally enabled. Made by StarTech. TIL16660 By default this part behaves similar to the NS16550A, but an extended 64-byte send and receive buffer can be optionally enabled. Made by Texas Instruments. Hayes ESP This proprietary plug-in card contains a 2048-byte send and receive buffer, and supports data rates to 230.4Kbit/sec. Made by Hayes. In addition to these dumb UARTs, many vendors produce intelligent serial communication boards. This type of design usually provides a microprocessor that interfaces with several UARTs, processes and buffers the data, and then alerts the - main PC processor when necessary. Because the UARTs are not + main PC processor when necessary. As the UARTs are not directly accessed by the PC processor in this type of communication system, it is not necessary for the vendor to use UARTs that are compatible with the 8250, 16450, or the 16550 UART. This leaves the designer free to components that may have better performance characteristics.
Configuring the <filename>sio</filename> driver The sio driver provides support for NS8250-, NS16450-, NS16550 and NS16550A-based EIA RS-232C (CCITT V.24) communications interfaces. Several multiport cards are supported as well. See the &man.sio.4; manual page for detailed technical documentation. Digi International (DigiBoard) PC/8 Contributed by &a.awebster.email;. 26 August 1995. Here is a config snippet from a machine with a Digi International PC/8 with 16550. It has 8 modems connected to these 8 lines, and they work just great. Do not forget to add options COM_MULTIPORT or it will not work very well! device sio4 at isa? port 0x100 flags 0xb05 device sio5 at isa? port 0x108 flags 0xb05 device sio6 at isa? port 0x110 flags 0xb05 device sio7 at isa? port 0x118 flags 0xb05 device sio8 at isa? port 0x120 flags 0xb05 device sio9 at isa? port 0x128 flags 0xb05 device sio10 at isa? port 0x130 flags 0xb05 device sio11 at isa? port 0x138 flags 0xb05 irq 9 The trick in setting this up is that the MSB of the flags represent the last SIO port, in this case 11 so flags are 0xb05. Boca 16 Contributed by &a.whiteside.email;. 26 August 1995. The procedures to make a Boca 16 port board with FreeBSD are pretty straightforward, but you will need a couple things to make it work: You either need the kernel sources installed so you can recompile the necessary options or you will need someone else to compile it for you. The 2.0.5 default kernel does not come with multiport support enabled and you will need to add a device entry for each port anyways. Two, you will need to know the interrupt and IO setting for your Boca Board so you can set these options properly in the kernel. One important note — the actual UART chips for the Boca 16 are in the connector box, not on the internal board itself. So if you have it unplugged, probes of those ports will fail. I have never tested booting with the box unplugged and plugging it back in, and I suggest you do not either. If you do not already have a custom kernel configuration file set up, refer to Kernel Configuration chapter of the FreeBSD Handbook for general procedures. The following are the specifics for the Boca 16 board and assume you are using the kernel name MYKERNEL and editing with vi. Add the line options COM_MULTIPORT to the config file. Where the current device sion lines are, you will need to add 16 more devices. The following example is for a Boca Board with an interrupt of 3, and a base IO address 100h. The IO address for Each port is +8 hexadecimal from the previous port, thus the 100h, 108h, 110h... addresses. device sio1 at isa? port 0x100 flags 0x1005 device sio2 at isa? port 0x108 flags 0x1005 device sio3 at isa? port 0x110 flags 0x1005 device sio4 at isa? port 0x118 flags 0x1005 … device sio15 at isa? port 0x170 flags 0x1005 device sio16 at isa? port 0x178 flags 0x1005 irq 3 The flags entry must be changed from this example unless you are using the exact same sio assignments. Flags are set according to 0xMYY where M indicates the minor number of the master port (the last port on a Boca 16) and YY indicates if FIFO is enabled or disabled(enabled), IRQ sharing is used(yes) and if there is an AST/4 compatible IRQ control register(no). In this example, flags 0x1005 indicates that the master port is sio16. If I added another board and assigned sio17 through sio28, the flags for all 16 ports on that board would be 0x1C05, where 1C indicates the minor number of the master port. Do not change the 05 setting. Save and complete the kernel configuration, recompile, install and reboot. Presuming you have successfully installed the recompiled kernel and have it set to the correct address and IRQ, your boot message should indicate the successful probe of the Boca ports as follows: (obviously the sio numbers, IO and IRQ could be different) sio1 at 0x100-0x107 flags 0x1005 on isa sio1: type 16550A (multiport) sio2 at 0x108-0x10f flags 0x1005 on isa sio2: type 16550A (multiport) sio3 at 0x110-0x117 flags 0x1005 on isa sio3: type 16550A (multiport) sio4 at 0x118-0x11f flags 0x1005 on isa sio4: type 16550A (multiport) sio5 at 0x120-0x127 flags 0x1005 on isa sio5: type 16550A (multiport) sio6 at 0x128-0x12f flags 0x1005 on isa sio6: type 16550A (multiport) sio7 at 0x130-0x137 flags 0x1005 on isa sio7: type 16550A (multiport) sio8 at 0x138-0x13f flags 0x1005 on isa sio8: type 16550A (multiport) sio9 at 0x140-0x147 flags 0x1005 on isa sio9: type 16550A (multiport) sio10 at 0x148-0x14f flags 0x1005 on isa sio10: type 16550A (multiport) sio11 at 0x150-0x157 flags 0x1005 on isa sio11: type 16550A (multiport) sio12 at 0x158-0x15f flags 0x1005 on isa sio12: type 16550A (multiport) sio13 at 0x160-0x167 flags 0x1005 on isa sio13: type 16550A (multiport) sio14 at 0x168-0x16f flags 0x1005 on isa sio14: type 16550A (multiport) sio15 at 0x170-0x177 flags 0x1005 on isa sio15: type 16550A (multiport) sio16 at 0x178-0x17f irq 3 flags 0x1005 on isa sio16: type 16550A (multiport master) If the messages go by too fast to see, &prompt.root; dmesg | more will show you the boot messages. Next, appropriate entries in /dev for the devices must be made using the /dev/MAKEDEV script. This step can be omitted if you are running FreeBSD 5.X with a kernel that has &man.devfs.5; support compiled in. If you do need to create the /dev entries, run the following as root: &prompt.root; cd /dev &prompt.root; ./MAKEDEV tty1 &prompt.root; ./MAKEDEV cua1 (everything in between) &prompt.root; ./MAKEDEV ttyg &prompt.root; ./MAKEDEV cuag If you do not want or need call-out devices for some reason, you can dispense with making the cua* devices. If you want a quick and sloppy way to make sure the devices are working, you can simply plug a modem into each port and (as root) &prompt.root; echo at > ttyd* for each device you have made. You should see the RX lights flash for each working port. Support for Cheap Multi-UART Cards Contributed by Helge Oldach hmo@sep.hamburg.com, September 1999 Ever wondered about FreeBSD support for your 20$ multi-I/O card with two (or more) COM ports, sharing IRQs? Here is how: Usually the only option to support these kind of boards is to use a distinct IRQ for each port. For example, if your CPU board has an on-board COM1 port (aka sio0–I/O address 0x3F8 and IRQ 4) and you have an extension board with two UARTs, you will commonly need to configure them as COM2 (aka sio1–I/O address 0x2F8 and IRQ 3), and the third port (aka sio2) as I/O 0x3E8 and IRQ 5. Obviously this is a waste of IRQ resources, as it should be basically possible to run both extension board ports using a single IRQ with the COM_MULTIPORT configuration described in the previous sections. Such cheap I/O boards commonly have a 4 by 3 jumper matrix for the COM ports, similar to the following: o o o * Port A | o * o * Port B | o * o o IRQ 2 3 4 5 Shown here is port A wired for IRQ 5 and port B wired for IRQ 3. The IRQ columns on your specific board may vary—other boards may supply jumpers for IRQs 3, 4, 5, and 7 instead. One could conclude that wiring both ports for IRQ 3 using a handcrafted wire-made jumper covering all three connection points in the IRQ 3 column would solve the issue, but no. You cannot duplicate IRQ 3 because the output drivers of each UART are wired in a totem pole fashion, so if one of the UARTs drives IRQ 3, the output signal will not be what you would expect. Depending on the implementation of the extension board or your motherboard, the IRQ 3 line will continuously stay up, or always stay low. You need to decouple the IRQ drivers for the two UARTs, so that the IRQ line of the board only goes up if (and only if) one of the UARTs asserts a IRQ, and stays low otherwise. The solution was proposed by Joerg Wunsch j@ida.interface-business.de: To solder up a wired-or consisting of two diodes (Germanium or Schottky-types strongly preferred) and a 1 kOhm resistor. Here is the schematic, starting from the 4 by 3 jumper field above: Diode +---------->|-------+ / | o * o o | 1 kOhm Port A +----|######|-------+ o * o o | | Port B `-------------------+ ==+== o * o o | Ground \ | +--------->|-------+ IRQ 2 3 4 5 Diode The cathodes of the diodes are connected to a common point, together with a 1 kOhm pull-down resistor. It is essential to connect the resistor to ground to avoid floating of the IRQ line on the bus. Now we are ready to configure a kernel. Staying with this example, we would configure: # standard on-board COM1 port device sio0 at isa? port "IO_COM1" flags 0x10 # patched-up multi-I/O extension board options COM_MULTIPORT device sio1 at isa? port "IO_COM2" flags 0x205 device sio2 at isa? port "IO_COM3" flags 0x205 irq 3 Note that the flags setting for sio1 and sio2 is truly essential; refer to &man.sio.4; for details. (Generally, the 2 in the "flags" attribute refers to sio2 which holds the IRQ, and you surely want a 5 low nibble.) With kernel verbose mode turned on this should yield something similar to this: sio0: irq maps: 0x1 0x11 0x1 0x1 sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa sio0: type 16550A sio1: irq maps: 0x1 0x9 0x1 0x1 sio1 at 0x2f8-0x2ff flags 0x205 on isa sio1: type 16550A (multiport) sio2: irq maps: 0x1 0x9 0x1 0x1 sio2 at 0x3e8-0x3ef irq 3 flags 0x205 on isa sio2: type 16550A (multiport master) Though /sys/i386/isa/sio.c is somewhat cryptic with its use of the irq maps array above, the basic idea is that you observe 0x1 in the first, third, and fourth place. This means that the corresponding IRQ was set upon output and cleared after, which is just what we would expect. If your kernel does not display this behavior, most likely there is something wrong with your wiring. Configuring the <filename>cy</filename> driver Contributed by Alex Nash. 6 June 1996. The Cyclades multiport cards are based on the cy driver instead of the usual sio driver used by other multiport cards. Configuration is a simple matter of: Add the cy device to your kernel configuration (note that your irq and iomem settings may differ). device cy0 at isa? irq 10 iomem 0xd4000 iosiz 0x2000 Rebuild and install the new kernel. Make the device nodes by typing (the following example assumes an 8-port board) You can omit this part if you are running FreeBSD 5.X with &man.devfs.5;. : &prompt.root; cd /dev &prompt.root; for i in 0 1 2 3 4 5 6 7;do ./MAKEDEV cuac$i ttyc$i;done If appropriate, add dialup entries to /etc/ttys by duplicating serial device (ttyd) entries and using ttyc in place of ttyd. For example: ttyc0 "/usr/libexec/getty std.38400" unknown on insecure ttyc1 "/usr/libexec/getty std.38400" unknown on insecure ttyc2 "/usr/libexec/getty std.38400" unknown on insecure … ttyc7 "/usr/libexec/getty std.38400" unknown on insecure Reboot with the new kernel. Configuring the <filename>si</filename> driver Contributed by &a.nsayer.email;. 25 March 1998. The Specialix SI/XIO and SX multiport cards use the si driver. A single machine can have up to 4 host cards. The following host cards are supported: ISA SI/XIO host card (2 versions) EISA SI/XIO host card PCI SI/XIO host card ISA SX host card PCI SX host card Although the SX and SI/XIO host cards look markedly different, their functionality are basically the same. The host cards do not use I/O locations, but instead require a 32K chunk of memory. The factory configuration for ISA cards places this at 0xd0000-0xd7fff. They also require an IRQ. PCI cards will, of course, auto-configure themselves. You can attach up to 4 external modules to each host card. The external modules contain either 4 or 8 serial ports. They come in the following varieties: SI 4 or 8 port modules. Up to 57600 bps on each port supported. XIO 8 port modules. Up to 115200 bps on each port supported. One type of XIO module has 7 serial and 1 parallel port. SXDC 8 port modules. Up to 921600 bps on each port supported. Like XIO, a module is available with one parallel port as well. To configure an ISA host card, add the following line to your kernel configuration file, changing the numbers as appropriate: device si0 at isa? iomem 0xd0000 irq 11 Valid IRQ numbers are 9, 10, 11, 12 and 15 for SX ISA host cards and 11, 12 and 15 for SI/XIO ISA host cards. To configure an EISA or PCI host card, use this line: device si0 After adding the configuration entry, rebuild and install your new kernel. The following step, is not necessary if you are using &man.devfs.5; in FreeBSD 5.X. After rebooting with the new kernel, you need to make the device nodes in /dev. The MAKEDEV script will take care of this for you. Count how many total ports you have and type: &prompt.root; cd /dev &prompt.root; ./MAKEDEV ttyAnn cuaAnn (where nn is the number of ports) If you want login prompts to appear on these ports, you will need to add lines like this to /etc/ttys: ttyA01 "/usr/libexec/getty std.9600" vt100 on insecure Change the terminal type as appropriate. For modems, dialup or unknown is fine.
diff --git a/en_US.ISO8859-1/articles/solid-state/article.xml b/en_US.ISO8859-1/articles/solid-state/article.xml index 232a5d59f4..8de18f207c 100644 --- a/en_US.ISO8859-1/articles/solid-state/article.xml +++ b/en_US.ISO8859-1/articles/solid-state/article.xml @@ -1,498 +1,498 @@
&os; and Solid State Devices John Kozubik
john@kozubik.com
2001 2009 The FreeBSD Documentation Project &tm-attrib.freebsd; &tm-attrib.general; &legalnotice; $FreeBSD$ $FreeBSD$ This article covers the use of solid state disk devices in &os; to create embedded systems. Embedded systems have the advantage of increased stability due to the lack of integral moving parts (hard drives). Account must be taken, however, for the generally low disk space available in the system and the durability of the storage medium. Specific topics to be covered include the types and attributes of solid state media suitable for disk use in &os;, kernel options that are of interest in such an environment, the rc.initdiskless mechanisms that automate the initialization of such systems and the need for read-only filesystems, and building filesystems from scratch. The article will conclude with some general strategies for small and read-only &os; environments.
Solid State Disk Devices The scope of this article will be limited to solid state disk devices made from flash memory. Flash memory is a solid state memory (no moving parts) that is non-volatile (the memory maintains data even after all power sources have been disconnected). Flash memory can withstand tremendous physical shock and is reasonably fast (the flash memory solutions covered in this article are slightly slower than a EIDE hard disk for write operations, and much faster for read operations). One very important aspect of flash memory, the ramifications of which will be discussed later in this article, is that each sector has a limited rewrite capacity. You can only write, erase, and write again to a sector of flash memory a certain number of times before the sector becomes permanently unusable. Although many flash memory products automatically map bad blocks, and although some even distribute write operations evenly throughout the unit, the fact remains that there exists a limit to the amount of writing that can be done to the device. Competitive units have between 1,000,000 and 10,000,000 writes per sector in their specification. This figure varies due to the temperature of the environment. Specifically, we will be discussing ATA compatible compact-flash units, which are quite popular as storage media for digital cameras. Of particular interest is the fact that they pin out directly to the IDE bus and are compatible with the ATA command set. Therefore, with a very simple and low-cost adaptor, these devices can be attached directly to an IDE bus in a computer. Once implemented in this manner, operating systems such as &os; see the device as a normal hard disk (albeit small). Other solid state disk solutions do exist, but their expense, obscurity, and relative unease of use places them beyond the scope of this article. Kernel Options A few kernel options are of specific interest to those creating an embedded &os; system. All embedded &os; systems that use flash memory as system disk will be interested in memory disks and memory filesystems. - Because of the limited number of writes that can be done to + As a result of the limited number of writes that can be done to flash memory, the disk and the filesystems on the disk will most likely be mounted read-only. In this environment, filesystems such as /tmp and /var are mounted as memory filesystems to allow the system to create logs and update counters and temporary files. Memory filesystems are a critical component to a successful solid state &os; implementation. You should make sure the following lines exist in your kernel configuration file: options MFS # Memory Filesystem options MD_ROOT # md device usable as a potential root device pseudo-device md # memory disk The <literal>rc</literal> Subsystem and Read-Only Filesystems The post-boot initialization of an embedded &os; system is controlled by /etc/rc.initdiskless. /etc/rc.d/var mounts /var as a memory filesystem, makes a configurable list of directories in /var with the &man.mkdir.1; command, and changes modes on some of those directories. In the execution of /etc/rc.d/var, one other rc.conf variable comes into play – varsize. A /var partition is created by /etc/rc.d/var based on the value of this variable in rc.conf: varsize=8192 Remember that this value is in sectors by default. The fact that /var is a read-write filesystem is an important distinction, as the / partition (and any other partitions you may have on your flash media) should be mounted read-only. Remember that in we detailed the limitations of flash memory - specifically the limited write capability. The importance of not mounting filesystems on flash media read-write, and the importance of not using a swap file, cannot be overstated. A swap file on a busy system can burn through a piece of flash media in less than one year. Heavy logging or temporary file creation and destruction can do the same. Therefore, in addition to removing the swap entry from your /etc/fstab, you should also change the Options field for each filesystem to ro as follows: # Device Mountpoint FStype Options Dump Pass# /dev/ad0s1a / ufs ro 1 1 A few applications in the average system will immediately begin to fail as a result of this change. For instance, cron will not run properly as a result of missing cron tabs in the /var created by /etc/rc.d/var, and syslog and dhcp will encounter problems as well as a result of the read-only filesystem and missing items in the /var that /etc/rc.d/var has created. These are only temporary problems though, and are addressed, along with solutions to the execution of other common software packages in . An important thing to remember is that a filesystem that was mounted read-only with /etc/fstab can be made read-write at any time by issuing the command: &prompt.root; /sbin/mount -uw partition and can be toggled back to read-only with the command: &prompt.root; /sbin/mount -ur partition Building a File System from Scratch - Because ATA compatible compact-flash cards are seen by &os; + Since ATA compatible compact-flash cards are seen by &os; as normal IDE hard drives, you could theoretically install &os; from the network using the kern and mfsroot floppies or from a CD. However, even a small installation of &os; using normal installation procedures can produce a system in size of greater - than 200 megabytes. Because most people will be using smaller + than 200 megabytes. Most people will be using smaller flash memory devices (128 megabytes is considered fairly large - - 32 or even 16 megabytes is common) an installation using normal + 32 or even 16 megabytes is common), so an installation using normal mechanisms is not possible—there is simply not enough disk space for even the smallest of conventional installations. The easiest way to overcome this space limitation is to install &os; using conventional means to a normal hard disk. After the installation is complete, pare down the operating system to a size that will fit onto your flash media, then tar the entire filesystem. The following steps will guide you through the process of preparing a piece of flash memory for your tarred filesystem. Remember, because a normal installation is not being performed, operations such as partitioning, labeling, file-system creation, etc. need to be performed by hand. In addition to the kern and mfsroot floppy disks, you will also need to use the fixit floppy. Partitioning Your Flash Media Device After booting with the kern and mfsroot floppies, choose custom from the installation menu. In the custom installation menu, choose partition. In the partition menu, you should delete all existing partitions using d. After deleting all existing partitions, create a partition using c and accept the default value for the size of the partition. When asked for the type of the partition, make sure the value is set to 165. Now write this partition table to the disk by pressing w (this is a hidden option on this screen). If you are using an ATA compatible compact flash card, you should choose the &os; Boot Manager. Now press q to quit the partition menu. You will be shown the boot manager menu once more - repeat the choice you made earlier. Creating Filesystems on Your Flash Memory Device Exit the custom installation menu, and from the main installation menu choose the fixit option. After entering the fixit environment, enter the following command: &prompt.root; disklabel -e /dev/ad0c At this point you will have entered the vi editor under the auspices of the disklabel command. Next, you need to add an a: line at the end of the file. This a: line should look like: a: 123456 0 4.2BSD 0 0 Where 123456 is a number that is exactly the same as the number in the existing c: entry for size. Basically you are duplicating the existing c: line as an a: line, making sure that fstype is 4.2BSD. Save the file and exit. &prompt.root; disklabel -B -r /dev/ad0c &prompt.root; newfs /dev/ad0a Placing Your Filesystem on the Flash Media Mount the newly prepared flash media: &prompt.root; mount /dev/ad0a /flash Bring this machine up on the network so we may transfer our tar file and explode it onto our flash media filesystem. One example of how to do this is: &prompt.root; ifconfig xl0 192.168.0.10 netmask 255.255.255.0 &prompt.root; route add default 192.168.0.1 Now that the machine is on the network, transfer your tar file. You may be faced with a bit of a dilemma at this point - if your flash memory part is 128 megabytes, for instance, and your tar file is larger than 64 megabytes, you cannot have your tar file on the flash media at the same time as you explode it - you will run out of space. One solution to this problem, if you are using FTP, is to untar the file while it is transferred over FTP. If you perform your transfer in this manner, you will never have the tar file and the tar contents on your disk at the same time: ftp> get tarfile.tar "| tar xvf -" If your tarfile is gzipped, you can accomplish this as well: ftp> get tarfile.tar "| zcat | tar xvf -" After the contents of your tarred filesystem are on your flash memory filesystem, you can unmount the flash memory and reboot: &prompt.root; cd / &prompt.root; umount /flash &prompt.root; exit Assuming that you configured your filesystem correctly when it was built on the normal hard disk (with your filesystems mounted read-only, and with the necessary options compiled into the kernel) you should now be successfully booting your &os; embedded system. System Strategies for Small and Read Only Environments In , it was pointed out that the /var filesystem constructed by /etc/rc.d/var and the presence of a read-only root filesystem causes problems with many common software packages used with &os;. In this article, suggestions for successfully running cron, syslog, ports installations, and the Apache web server will be provided. Cron Upon boot, /var gets populated by /etc/rc.d/var using the list from /etc/mtree/BSD.var.dist, so the cron, cron/tabs, at, and a few other standard directories get created. However, this does not solve the problem of maintaining cron tabs across reboots. When the system reboots, the /var filesystem that is in memory will disappear and any cron tabs you may have had in it will also disappear. Therefore, one solution would be to create cron tabs for the users that need them, mount your / filesystem as read-write and copy those cron tabs to somewhere safe, like /etc/tabs, then add a line to the end of /etc/rc.initdiskless that copies those crontabs into /var/cron/tabs after that directory has been created during system initialization. You may also need to add a line that changes modes and permissions on the directories you create and the files you copy with /etc/rc.initdiskless. Syslog syslog.conf specifies the locations of certain log files that exist in /var/log. These files are not created by /etc/rc.d/var upon system initialization. Therefore, somewhere in /etc/rc.d/var, after the section that creates the directories in /var, you will need to add something like this: &prompt.root; touch /var/log/security /var/log/maillog /var/log/cron /var/log/messages &prompt.root; chmod 0644 /var/log/* Ports Installation Before discussing the changes necessary to successfully use the ports tree, a reminder is necessary regarding the read-only nature of your filesystems on the flash media. Since they are read-only, you will need to temporarily mount them read-write using the mount syntax shown in . You should always remount those filesystems read-only when you are done with any maintenance - unnecessary writes to the flash media could considerably shorten its lifespan. To make it possible to enter a ports directory and successfully run make install, we must create a packages directory on a non-memory filesystem that will keep track of - our packages across reboots. Because it is necessary to mount + our packages across reboots. As it is necessary to mount your filesystems as read-write for the installation of a package anyway, it is sensible to assume that an area on the flash media can also be used for package information to be written to. First, create a package database directory. This is normally in /var/db/pkg, but we cannot place it there as it will disappear every time the system is booted. &prompt.root; mkdir /etc/pkg Now, add a line to /etc/rc.d/var that links the /etc/pkg directory to /var/db/pkg. An example: &prompt.root; ln -s /etc/pkg /var/db/pkg Now, any time that you mount your filesystems as read-write and install a package, the make install will work, and package information will be written successfully to /etc/pkg (because the filesystem will, at that time, be mounted read-write) which will always be available to the operating system as /var/db/pkg. Apache Web Server The steps in this section are only necessary if Apache is set up to write its pid or log information outside of /var. By default, Apache keeps its pid file in /var/run/httpd.pid and its log files in /var/log. It is now assumed that Apache keeps its log files in a directory apache_log_dir outside of /var. When this directory lives on a read-only filesystem, Apache will not be able to save any log files, and may have problems working. If so, it is necessary to add a new directory to the list of directories in /etc/rc.d/var to create in /var, and to link apache_log_dir to /var/log/apache. It is also necessary to set permissions and ownership on this new directory. First, add the directory log/apache to the list of directories to be created in /etc/rc.d/var. Second, add these commands to /etc/rc.d/var after the directory creation section: &prompt.root; chmod 0774 /var/log/apache &prompt.root; chown nobody:nobody /var/log/apache Finally, remove the existing apache_log_dir directory, and replace it with a link: &prompt.root; rm -rf apache_log_dir &prompt.root; ln -s /var/log/apache apache_log_dir
diff --git a/en_US.ISO8859-1/articles/vm-design/article.xml b/en_US.ISO8859-1/articles/vm-design/article.xml index 2cf7e001eb..79b56d296c 100644 --- a/en_US.ISO8859-1/articles/vm-design/article.xml +++ b/en_US.ISO8859-1/articles/vm-design/article.xml @@ -1,899 +1,899 @@
Design elements of the &os; VM system MatthewDillon
dillon@apollo.backplane.com
&tm-attrib.freebsd; &tm-attrib.linux; &tm-attrib.microsoft; &tm-attrib.opengroup; &tm-attrib.general; $FreeBSD$ $FreeBSD$ The title is really just a fancy way of saying that I am going to attempt to describe the whole VM enchilada, hopefully in a way that everyone can follow. For the last year I have concentrated on a number of major kernel subsystems within &os;, with the VM and Swap subsystems being the most interesting and NFS being a necessary chore. I rewrote only small portions of the code. In the VM arena the only major rewrite I have done is to the swap subsystem. Most of my work was cleanup and maintenance, with only moderate code rewriting and no major algorithmic adjustments within the VM subsystem. The bulk of the VM subsystem's theoretical base remains unchanged and a lot of the credit for the modernization effort in the last few years belongs to John Dyson and David Greenman. Not being a historian like Kirk I will not attempt to tag all the various features with peoples names, since I will invariably get it wrong. This article was originally published in the January 2000 issue of DaemonNews. This version of the article may include updates from Matt and other authors to reflect changes in &os;'s VM implementation.
Introduction Before moving along to the actual design let's spend a little time on the necessity of maintaining and modernizing any long-living codebase. In the programming world, algorithms tend to be more important than code and it is precisely due to BSD's academic roots that a great deal of attention was paid to algorithm design from the beginning. More attention paid to the design generally leads to a clean and flexible codebase that can be fairly easily modified, extended, or replaced over time. While BSD is considered an old operating system by some people, those of us who work on it tend to view it more as a mature codebase which has various components modified, extended, or replaced with modern code. It has evolved, and &os; is at the bleeding edge no matter how old some of the code might be. This is an important distinction to make and one that is unfortunately lost to many people. The biggest error a programmer can make is to not learn from history, and this is precisely the error that many other modern operating systems have made. &windowsnt; is the best example of this, and the consequences have been dire. Linux also makes this mistake to some degree—enough that we BSD folk can make small jokes about it every once in a while, anyway. Linux's problem is simply one of a lack of experience and history to compare ideas against, a problem that is easily and rapidly being addressed by the Linux community in the same way it has been addressed in the BSD community—by continuous code development. The &windowsnt; folk, on the other hand, repeatedly make the same mistakes solved by &unix; decades ago and then spend years fixing them. Over and over again. They have a severe case of not designed here and we are always right because our marketing department says so. I have little tolerance for anyone who cannot learn from history. Much of the apparent complexity of the &os; design, especially in the VM/Swap subsystem, is a direct result of having to solve serious performance issues that occur under various conditions. These issues are not due to bad algorithmic design but instead rise from environmental factors. In any direct comparison between platforms, these issues become most apparent when system resources begin to get stressed. As I describe &os;'s VM/Swap subsystem the reader should always keep two points in mind: The most important aspect of performance design is what is known as Optimizing the Critical Path. It is often the case that performance optimizations add a little bloat to the code in order to make the critical path perform better. A solid, generalized design outperforms a heavily-optimized design over the long run. While a generalized design may end up being slower than an heavily-optimized design when they are first implemented, the generalized design tends to be easier to adapt to changing conditions and the heavily-optimized design winds up having to be thrown away. Any codebase that will survive and be maintainable for years must therefore be designed properly from the beginning even if it costs some performance. Twenty years ago people were still arguing that programming in assembly was better than programming in a high-level language because it produced code that was ten times as fast. Today, the fallibility of that argument is obvious  — as are the parallels to algorithmic design and code generalization. VM Objects The best way to begin describing the &os; VM system is to look at it from the perspective of a user-level process. Each user process sees a single, private, contiguous VM address space containing several types of memory objects. These objects have various characteristics. Program code and program data are effectively a single memory-mapped file (the binary file being run), but program code is read-only while program data is copy-on-write. Program BSS is just memory allocated and filled with zeros on demand, called demand zero page fill. Arbitrary files can be memory-mapped into the address space as well, which is how the shared library mechanism works. Such mappings can require modifications to remain private to the process making them. The fork system call adds an entirely new dimension to the VM management problem on top of the complexity already given. A program binary data page (which is a basic copy-on-write page) illustrates the complexity. A program binary contains a preinitialized data section which is initially mapped directly from the program file. When a program is loaded into a process's VM space, this area is initially memory-mapped and backed by the program binary itself, allowing the VM system to free/reuse the page and later load it back in from the binary. The moment a process modifies this data, however, the VM system must make a private copy of the page for that process. Since the private copy has been modified, the VM system may no longer free it, because there is no longer any way to restore it later on. You will notice immediately that what was originally a simple file mapping has become much more complex. Data may be modified on a page-by-page basis whereas the file mapping encompasses many pages at once. The complexity further increases when a process forks. When a process forks, the result is two processes—each with their own private address spaces, including any modifications made by the original process prior to the call to fork(). It would be silly for the VM system to make a complete copy of the data at the time of the fork() because it is quite possible that at least one of the two processes will only need to read from that page from then on, allowing the original page to continue to be used. What was a private page is made copy-on-write again, since each process (parent and child) expects their own personal post-fork modifications to remain private to themselves and not effect the other. &os; manages all of this with a layered VM Object model. The original binary program file winds up being the lowest VM Object layer. A copy-on-write layer is pushed on top of that to hold those pages which had to be copied from the original file. If the program modifies a data page belonging to the original file the VM system takes a fault and makes a copy of the page in the higher layer. When a process forks, additional VM Object layers are pushed on. This might make a little more sense with a fairly basic example. A fork() is a common operation for any *BSD system, so this example will consider a program that starts up, and forks. When the process starts, the VM system creates an object layer, let's call this A: +---------------+ | A | +---------------+ A picture A represents the file—pages may be paged in and out of the file's physical media as necessary. Paging in from the disk is reasonable for a program, but we really do not want to page back out and overwrite the executable. The VM system therefore creates a second layer, B, that will be physically backed by swap space: +---------------+ | B | +---------------+ | A | +---------------+ On the first write to a page after this, a new page is created in B, and its contents are initialized from A. All pages in B can be paged in or out to a swap device. When the program forks, the VM system creates two new object layers—C1 for the parent, and C2 for the child—that rest on top of B: +-------+-------+ | C1 | C2 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, let's say a page in B is modified by the original parent process. The process will take a copy-on-write fault and duplicate the page in C1, leaving the original page in B untouched. Now, let's say the same page in B is modified by the child process. The process will take a copy-on-write fault and duplicate the page in C2. The original page in B is now completely hidden since both C1 and C2 have a copy and B could theoretically be destroyed if it does not represent a real file; however, this sort of optimization is not trivial to make because it is so fine-grained. &os; does not make this optimization. Now, suppose (as is often the case) that the child process does an exec(). Its current address space is usually replaced by a new address space representing a new file. In this case, the C2 layer is destroyed: +-------+ | C1 | +-------+-------+ | B | +---------------+ | A | +---------------+ In this case, the number of children of B drops to one, and all accesses to B now go through C1. This means that B and C1 can be collapsed together. Any pages in B that also exist in C1 are deleted from B during the collapse. Thus, even though the optimization in the previous step could not be made, we can recover the dead pages when either of the processes exit or exec(). This model creates a number of potential problems. The first is that you can wind up with a relatively deep stack of layered VM Objects which can cost scanning time and memory when you take a fault. Deep layering can occur when processes fork and then fork again (either parent or child). The second problem is that you can wind up with dead, inaccessible pages deep in the stack of VM Objects. In our last example if both the parent and child processes modify the same page, they both get their own private copies of the page and the original page in B is no longer accessible by anyone. That page in B can be freed. &os; solves the deep layering problem with a special optimization called the All Shadowed Case. This case occurs if either C1 or C2 take sufficient COW faults to completely shadow all pages in B. Lets say that C1 achieves this. C1 can now bypass B entirely, so rather then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But look what also happened—now B has only one reference (C2), so we can collapse B and C2 together. The end result is that B is deleted entirely and we have C1->A and C2->A. It is often the case that B will contain a large number of pages and neither C1 nor C2 will be able to completely overshadow it. If we fork again and create a set of D layers, however, it is much more likely that one of the D layers will eventually be able to completely overshadow the much smaller dataset represented by C1 or C2. The same optimization will work at any point in the graph and the grand result of this is that even on a heavily forked machine VM Object stacks tend to not get much deeper then 4. This is true of both the parent and the children and true whether the parent is doing the forking or whether the children cascade forks. The dead page problem still exists in the case where C1 or C2 do not completely overshadow B. Due to our other optimizations this case does not represent much of a problem and we simply allow the pages to be dead. If the system runs low on memory it will swap them out, eating a little swap, but that is it. The advantage to the VM Object model is that fork() is extremely fast, since no real data copying need take place. The disadvantage is that you can build a relatively complex VM Object layering that slows page fault handling down a little, and you spend memory managing the VM Object structures. The optimizations &os; makes proves to reduce the problems enough that they can be ignored, leaving no real disadvantage. SWAP Layers Private data pages are initially either copy-on-write or zero-fill pages. When a change, and therefore a copy, is made, the original backing object (usually a file) can no longer be used to save a copy of the page when the VM system needs to reuse it for other purposes. This is where SWAP comes in. SWAP is allocated to create backing store for memory that does not otherwise have it. &os; allocates the swap management structure for a VM Object only when it is actually needed. However, the swap management structure has had problems historically: Under &os; 3.X the swap management structure preallocates an array that encompasses the entire object requiring swap backing store—even if only a few pages of that object are swap-backed. This creates a kernel memory fragmentation problem when large objects are mapped, or processes with large runsizes (RSS) fork. Also, in order to keep track of swap space, a list of holes is kept in kernel memory, and this tends to get severely fragmented as well. Since the list of holes is a linear list, the swap allocation and freeing performance is a non-optimal O(n)-per-page. It requires kernel memory allocations to take place during the swap freeing process, and that creates low memory deadlock problems. The problem is further exacerbated by holes created due to the interleaving algorithm. Also, the swap block map can become fragmented fairly easily resulting in non-contiguous allocations. Kernel memory must also be allocated on the fly for additional swap management structures when a swapout occurs. It is evident from that list that there was plenty of room for improvement. For &os; 4.X, I completely rewrote the swap subsystem: Swap management structures are allocated through a hash table rather than a linear array giving them a fixed allocation size and much finer granularity. Rather then using a linearly linked list to keep track of swap space reservations, it now uses a bitmap of swap blocks arranged in a radix tree structure with free-space hinting in the radix node structures. This effectively makes swap allocation and freeing an O(1) operation. The entire radix tree bitmap is also preallocated in order to avoid having to allocate kernel memory during critical low memory swapping operations. After all, the system tends to swap when it is low on memory so we should avoid allocating kernel memory at such times in order to avoid potential deadlocks. To reduce fragmentation the radix tree is capable of allocating large contiguous chunks at once, skipping over smaller fragmented chunks. I did not take the final step of having an allocating hint pointer that would trundle through a portion of swap as allocations were made in order to further guarantee contiguous allocations or at least locality of reference, but I ensured that such an addition could be made. When to free a page Since the VM system uses all available memory for disk caching, there are usually very few truly-free pages. The VM system depends on being able to properly choose pages which are not in use to reuse for new allocations. Selecting the optimal pages to free is possibly the single-most important function any VM system can perform because if it makes a poor selection, the VM system may be forced to unnecessarily retrieve pages from disk, seriously degrading system performance. How much overhead are we willing to suffer in the critical path to avoid freeing the wrong page? Each wrong choice we make will cost us hundreds of thousands of CPU cycles and a noticeable stall of the affected processes, so we are willing to endure a significant amount of overhead in order to be sure that the right page is chosen. This is why &os; tends to outperform other systems when memory resources become stressed. The free page determination algorithm is built upon a history of the use of memory pages. To acquire this history, the system takes advantage of a page-used bit feature that most hardware page tables have. In any case, the page-used bit is cleared and at some later point the VM system comes across the page again and sees that the page-used bit has been set. This indicates that the page is still being actively used. If the bit is still clear it is an indication that the page is not being actively used. By testing this bit periodically, a use history (in the form of a counter) for the physical page is developed. When the VM system later needs to free up some pages, checking this history becomes the cornerstone of determining the best candidate page to reuse. What if the hardware has no page-used bit? For those platforms that do not have this feature, the system actually emulates a page-used bit. It unmaps or protects a page, forcing a page fault if the page is accessed again. When the page fault is taken, the system simply marks the page as having been used and unprotects the page so that it may be used. While taking such page faults just to determine if a page is being used appears to be an expensive proposition, it is much less expensive than reusing the page for some other purpose only to find that a process needs it back and then have to go to disk. &os; makes use of several page queues to further refine the selection of pages to reuse as well as to determine when dirty pages must be flushed to their backing store. Since page tables are dynamic entities under &os;, it costs virtually nothing to unmap a page from the address space of any processes using it. When a page candidate has been chosen based on the page-use counter, this is precisely what is done. The system must make a distinction between clean pages which can theoretically be freed up at any time, and dirty pages which must first be written to their backing store before being reusable. When a page candidate has been found it is moved to the inactive queue if it is dirty, or the cache queue if it is clean. A separate algorithm based on the dirty-to-clean page ratio determines when dirty pages in the inactive queue must be flushed to disk. Once this is accomplished, the flushed pages are moved from the inactive queue to the cache queue. At this point, pages in the cache queue can still be reactivated by a VM fault at relatively low cost. However, pages in the cache queue are considered to be immediately freeable and will be reused in an LRU (least-recently used) fashion when the system needs to allocate new memory. It is important to note that the &os; VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command. As the VM system becomes more stressed, it makes a greater effort to maintain the various page queues at the levels determined to be the most effective. An urban myth has circulated for years that Linux did a better job avoiding swapouts than &os;, but this in fact is not true. What was actually occurring was that &os; was proactively paging out unused pages in order to make room for more disk cache while Linux was keeping unused pages in core and leaving less memory available for cache and process pages. I do not know whether this is still true today. Pre-Faulting and Zeroing Optimizations Taking a VM fault is not expensive if the underlying page is already in core and can simply be mapped into the process, but it can become expensive if you take a whole lot of them on a regular basis. A good example of this is running a program such as &man.ls.1; or &man.ps.1; over and over again. If the program binary is mapped into memory but not mapped into the page table, then all the pages that will be accessed by the program will have to be faulted in every time the program is run. This is unnecessary when the pages in question are already in the VM Cache, so &os; will attempt to pre-populate a process's page tables with those pages that are already in the VM Cache. One thing that &os; does not yet do is pre-copy-on-write certain pages on exec. For example, if you run the &man.ls.1; program while running vmstat 1 you will notice that it always takes a certain number of page faults, even when you run it over and over again. These are zero-fill faults, not program code faults (which were pre-faulted in already). Pre-copying pages on exec or fork is an area that could use more study. A large percentage of page faults that occur are zero-fill faults. You can usually see this by observing the vmstat -s output. These occur when a process accesses pages in its BSS area. The BSS area is expected to be initially zero but the VM system does not bother to allocate any memory at all until the process actually accesses it. When a fault occurs the VM system must not only allocate a new page, it must zero it as well. To optimize the zeroing operation the VM system has the ability to pre-zero pages and mark them as such, and to request pre-zeroed pages when zero-fill faults occur. The pre-zeroing occurs whenever the CPU is idle but the number of pages the system pre-zeros is limited in order to avoid blowing away the memory caches. This is an excellent example of adding complexity to the VM system in order to optimize the critical path. Page Table Optimizations The page table optimizations make up the most contentious part of the &os; VM design and they have shown some strain with the advent of serious use of mmap(). I think this is actually a feature of most BSDs though I am not sure when it was first introduced. There are two major optimizations. The first is that hardware page tables do not contain persistent state but instead can be thrown away at any time with only a minor amount of management overhead. The second is that every active page table entry in the system has a governing pv_entry structure which is tied into the vm_page structure. &os; can simply iterate through those mappings that are known to exist while Linux must check all page tables that might contain a specific mapping to see if it does, which can achieve O(n^2) overhead in certain situations. It is because of this that &os; tends to make better choices on which pages to reuse or swap when memory is stressed, giving it better performance under load. However, &os; requires kernel tuning to accommodate large-shared-address-space situations such as those that can occur in a news system because it may run out of pv_entry structures. Both Linux and &os; need work in this area. &os; is trying to maximize the advantage of a potentially sparse active-mapping model (not all processes need to map all pages of a shared library, for example), whereas Linux is trying to simplify its algorithms. &os; generally has the performance advantage here at the cost of wasting a little extra memory, but &os; breaks down in the case where a large file is massively shared across hundreds of processes. Linux, on the other hand, breaks down in the case where many processes are sparsely-mapping the same shared library and also runs non-optimally when trying to determine whether a page can be reused or not. Page Coloring We will end with the page coloring optimizations. Page coloring is a performance optimization designed to ensure that accesses to contiguous pages in virtual memory make the best use of the processor cache. In ancient times (i.e. 10+ years ago) processor caches tended to map virtual memory rather than physical memory. This led to a huge number of problems including having to clear the cache on every context switch in some cases, and problems with data aliasing in the cache. Modern processor caches map physical memory precisely to solve those problems. This means that two side-by-side pages in a processes address space may not correspond to two side-by-side pages in the cache. In fact, if you are not careful side-by-side pages in virtual memory could wind up using the same page in the processor cache—leading to cacheable data being thrown away prematurely and reducing CPU performance. This is true even with multi-way set-associative caches (though the effect is mitigated somewhat). &os;'s memory allocation code implements page coloring optimizations, which means that the memory allocation code will attempt to locate free pages that are contiguous from the point of view of the cache. For example, if page 16 of physical memory is assigned to page 0 of a process's virtual memory and the cache can hold 4 pages, the page coloring code will not assign page 20 of physical memory to page 1 of a process's virtual memory. It would, instead, assign page 21 of physical memory. The page coloring code attempts to avoid assigning page 20 because this maps over the same cache memory as page 16 and would result in non-optimal caching. This code adds a significant amount of complexity to the VM memory allocation subsystem as you can well imagine, but the result is well worth the effort. Page Coloring makes VM memory as deterministic as physical memory in regards to cache performance. Conclusion Virtual memory in modern operating systems must address a number of different issues efficiently and for many different usage patterns. The modular and algorithmic approach that BSD has historically taken allows us to study and understand the current implementation as well as relatively cleanly replace large sections of the code. There have been a number of improvements to the &os; VM system in the last several years, and work is ongoing. Bonus QA session by Allen Briggs <email>briggs@ninthwonder.com</email> What is the interleaving algorithm that you refer to in your listing of the ills of the &os; 3.X swap arrangements? &os; uses a fixed swap interleave which defaults to 4. This means that &os; reserves space for four swap areas even if you only have one, two, or three. Since swap is interleaved the linear address space representing the four swap areas will be fragmented if you do not actually have four swap areas. For example, if you have two swap areas A and B &os;'s address space representation for that swap area will be interleaved in blocks of 16 pages: A B C D A B C D A B C D A B C D &os; 3.X uses a sequential list of free regions approach to accounting for the free swap areas. The idea is that large blocks of free linear space can be represented with a single list node (kern/subr_rlist.c). But due to the fragmentation the sequential list winds up being insanely fragmented. In the above example, completely unused swap will have A and B shown as free and C and D shown as all allocated. Each A-B sequence requires a list node to account for because C and D are holes, so the list node cannot be combined with the next A-B sequence. Why do we interleave our swap space instead of just tack swap - areas onto the end and do something fancier? Because it is a whole + areas onto the end and do something fancier? It is a whole lot easier to allocate linear swaths of an address space and have the result automatically be interleaved across multiple disks than it is to try to put that sophistication elsewhere. The fragmentation causes other problems. Being a linear list under 3.X, and having such a huge amount of inherent fragmentation, allocating and freeing swap winds up being an O(N) algorithm instead of an O(1) algorithm. Combined with other factors (heavy swapping) and you start getting into O(N^2) and O(N^3) levels of overhead, which is bad. The 3.X system may also need to allocate KVM during a swap operation to create a new list node which can lead to a deadlock if the system is trying to pageout pages in a low-memory situation. Under 4.X we do not use a sequential list. Instead we use a radix tree and bitmaps of swap blocks rather than ranged list nodes. We take the hit of preallocating all the bitmaps required for the entire swap area up front but it winds up wasting less memory due to the use of a bitmap (one bit per block) instead of a linked list of nodes. The use of a radix tree instead of a sequential list gives us nearly O(1) performance no matter how fragmented the tree becomes. How is the separation of clean and dirty (inactive) pages related to the situation where you see low cache queue counts and high active queue counts in systat -vm? Do the systat stats roll the active and dirty pages together for the active queue count? I do not get the following:
It is important to note that the &os; VM system attempts to separate clean and dirty pages for the express reason of avoiding unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does it move pages between the various page queues gratuitously when the memory subsystem is not being stressed. This is why you will see some systems with very low cache queue counts and high active queue counts when doing a systat -vm command.
Yes, that is confusing. The relationship is goal verses reality. Our goal is to separate the pages but the reality is that if we are not in a memory crunch, we do not really have to. What this means is that &os; will not try very hard to separate out dirty pages (inactive queue) from clean pages (cache queue) when the system is not being stressed, nor will it try to deactivate pages (active queue -> inactive queue) when the system is not being stressed, even if they are not being used.
In the &man.ls.1; / vmstat 1 example, would not some of the page faults be data page faults (COW from executable file to private page)? I.e., I would expect the page faults to be some zero-fill and some program data. Or are you implying that &os; does do pre-COW for the program data? A COW fault can be either zero-fill or program-data. The mechanism is the same either way because the backing program-data is almost certainly already in the cache. I am indeed lumping the two together. &os; does not pre-COW program data or zero-fill, but it does pre-map pages that exist in its cache. In your section on page table optimizations, can you give a little more detail about pv_entry and vm_page (or should vm_page be vm_pmap—as in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)? Specifically, what kind of operation/reaction would require scanning the mappings? How does Linux do in the case where &os; breaks down (sharing a large file mapping over many processes)? A vm_page represents an (object,index#) tuple. A pv_entry represents a hardware page table entry (pte). If you have five processes sharing the same physical page, and three of those processes's page tables actually map the page, that page will be represented by a single vm_page structure and three pv_entry structures. pv_entry structures only represent pages mapped by the MMU (one pv_entry represents one pte). This means that when we need to remove all hardware references to a vm_page (in order to reuse the page for something else, page it out, clear it, dirty it, and so forth) we can simply scan the linked list of pv_entry's associated with that vm_page to remove or modify the pte's from their page tables. Under Linux there is no such linked list. In order to remove all the hardware page table mappings for a vm_page linux must index into every VM object that might have mapped the page. For example, if you have 50 processes all mapping the same shared library and want to get rid of page X in that library, you need to index into the page table for each of those 50 processes even if only 10 of them have actually mapped the page. So Linux is trading off the simplicity of its design against performance. Many VM algorithms which are O(1) or (small N) under &os; wind up being O(N), O(N^2), or worse under Linux. Since the pte's representing a particular page in an object tend to be at the same offset in all the page tables they are mapped in, reducing the number of accesses into the page tables at the same pte offset will often avoid blowing away the L1 cache line for that offset, which can lead to better performance. &os; has added complexity (the pv_entry scheme) in order to increase performance (to limit page table accesses to only those pte's that need to be modified). But &os; has a scaling problem that Linux does not in that there are a limited number of pv_entry structures and this causes problems when you have massive sharing of data. In this case you may run out of pv_entry structures even though there is plenty of free memory available. This can be fixed easily enough by bumping up the number of pv_entry structures in the kernel config, but we really need to find a better way to do it. In regards to the memory overhead of a page table verses the pv_entry scheme: Linux uses permanent page tables that are not throw away, but does not need a pv_entry for each potentially mapped pte. &os; uses throw away page tables but adds in a pv_entry structure for each actually-mapped pte. I think memory utilization winds up being about the same, giving &os; an algorithmic advantage with its ability to throw away page tables at will with very low overhead. Finally, in the page coloring section, it might help to have a little more description of what you mean here. I did not quite follow it. Do you know how an L1 hardware memory cache works? I will explain: Consider a machine with 16MB of main memory but only 128K of L1 cache. Generally the way this cache works is that each 128K block of main memory uses the same 128K of cache. If you access offset 0 in main memory and then offset 128K in main memory you can wind up throwing away the cached data you read from offset 0! Now, I am simplifying things greatly. What I just described is what is called a direct mapped hardware memory cache. Most modern caches are what are called 2-way-set-associative or 4-way-set-associative caches. The set-associatively allows you to access up to N different memory regions that overlap the same cache memory without destroying the previously cached data. But only N. So if I have a 4-way set associative cache I can access offset 0, offset 128K, 256K and offset 384K and still be able to access offset 0 again and have it come from the L1 cache. If I then access offset 512K, however, one of the four previously cached data objects will be thrown away by the cache. It is extremely important… extremely important for most of a processor's memory accesses to be able to come from the L1 cache, because the L1 cache operates at the processor frequency. The moment you have an L1 cache miss and have to go to the L2 cache or to main memory, the processor will stall and potentially sit twiddling its fingers for hundreds of instructions worth of time waiting for a read from main memory to complete. Main memory (the dynamic ram you stuff into a computer) is slow, when compared to the speed of a modern processor core. Ok, so now onto page coloring: All modern memory caches are what are known as physical caches. They cache physical memory addresses, not virtual memory addresses. This allows the cache to be left alone across a process context switch, which is very important. But in the &unix; world you are dealing with virtual address spaces, not physical address spaces. Any program you write will see the virtual address space given to it. The actual physical pages underlying that virtual address space are not necessarily physically contiguous! In fact, you might have two pages that are side by side in a processes address space which wind up being at offset 0 and offset 128K in physical memory. A program normally assumes that two side-by-side pages will be optimally cached. That is, that you can access data objects in both pages without having them blow away each other's cache entry. But this is only true if the physical pages underlying the virtual address space are contiguous (insofar as the cache is concerned). This is what Page coloring does. Instead of assigning random physical pages to virtual addresses, which may result in non-optimal cache performance, Page coloring assigns reasonably-contiguous physical pages to virtual addresses. Thus programs can be written under the assumption that the characteristics of the underlying hardware cache are the same for their virtual address space as they would be if the program had been run directly in a physical address space. Note that I say reasonably contiguous rather than simply contiguous. From the point of view of a 128K direct mapped cache, the physical address 0 is the same as the physical address 128K. So two side-by-side pages in your virtual address space may wind up being offset 128K and offset 132K in physical memory, but could also easily be offset 128K and offset 4K in physical memory and still retain the same cache performance characteristics. So page-coloring does not have to assign truly contiguous pages of physical memory to contiguous pages of virtual memory, it just needs to make sure it assigns contiguous pages from the point of view of cache performance and operation.
diff --git a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml index 65f8c6cfd0..798b7bc6d9 100644 --- a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.xml @@ -1,2396 +1,2396 @@ Bootstrapping and Kernel Initialization Sergey Lyubka Contributed by Sergio Andrés Gómez del Real Updated and enhanced by Synopsis BIOS firmware POST IA-32 booting system initialization This chapter is an overview of the boot and system initialization processes, starting from the BIOS (firmware) POST, to the first user process creation. Since the initial steps of system startup are very architecture dependent, the IA-32 architecture is used as an example. The &os; boot process can be surprisingly complex. After control is passed from the BIOS, a considerable amount of low-level configuration must be done before the kernel can be loaded and executed. This setup must be done in a simple and flexible manner, allowing the user a great deal of customization possibilities. Overview The boot process is an extremely machine-dependent activity. Not only must code be written for every computer architecture, but there may also be multiple types of booting on the same architecture. For example, a directory listing of /usr/src/sys/boot reveals a great amount of architecture-dependent code. There is a directory for each of the various supported architectures. In the x86-specific i386 directory, there are subdirectories for different boot standards like mbr (Master Boot Record), gpt (GUID Partition Table), and efi (Extensible Firmware Interface). Each boot standard has its own conventions and data structures. The example that follows shows booting an x86 computer from an MBR hard drive with the &os; boot0 multi-boot loader stored in the very first sector. That boot code starts the &os; three-stage boot process. The key to understanding this process is that it is a series of stages of increasing complexity. These stages are boot1, boot2, and loader (see &man.boot.8; for more detail). The boot system executes each stage in sequence. The last stage, loader, is responsible for loading the &os; kernel. Each stage is examined in the following sections. Here is an example of the output generated by the different boot stages. Actual output may differ from machine to machine: &os; Component Output (may vary) boot0 F1 FreeBSD F2 BSD F5 Disk 2 boot2 This prompt will appear if the user presses a key just after selecting an OS to boot at the boot0 stage. >>FreeBSD/i386 BOOT Default: 1:ad(1,a)/boot/loader boot: loader BTX loader 1.00 BTX version is 1.02 Consoles: internal video/keyboard BIOS drive C: is disk0 BIOS 639kB/2096064kB available memory FreeBSD/x86 bootstrap loader, Revision 1.1 Console internal video/keyboard (root@snap.freebsd.org, Thu Jan 16 22:18:05 UTC 2014) Loading /boot/defaults/loader.conf /boot/kernel/kernel text=0xed9008 data=0x117d28+0x176650 syms=[0x8+0x137988+0x8+0x1515f8] kernel Copyright (c) 1992-2013 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014 root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610 The <acronym>BIOS</acronym> When the computer powers on, the processor's registers are set to some predefined values. One of the registers is the instruction pointer register, and its value after a power on is well defined: it is a 32-bit value of 0xfffffff0. The instruction pointer register (also known as the Program Counter) points to code to be executed by the processor. Another important register is the cr0 32-bit control register, and its value just after a reboot is 0. One of cr0's bits, the PE (Protection Enabled) bit, indicates whether the processor is running in 32-bit protected mode or 16-bit real mode. Since this bit is cleared at boot time, the processor boots in 16-bit real mode. Real mode means, among other things, that linear and physical addresses are identical. The reason for the processor not to start immediately in 32-bit protected mode is backwards compatibility. In particular, the boot process relies on the services provided by the BIOS, and the BIOS itself works in legacy, 16-bit code. The value of 0xfffffff0 is slightly less than 4 GB, so unless the machine has 4 GB of physical memory, it cannot point to a valid memory address. The computer's hardware translates this address so that it points to a BIOS memory block. The BIOS (Basic Input Output System) is a chip on the motherboard that has a relatively small amount of read-only memory (ROM). This memory contains various low-level routines that are specific to the hardware supplied with the motherboard. The processor will first jump to the address 0xfffffff0, which really resides in the BIOS's memory. Usually this address contains a jump instruction to the BIOS's POST routines. The POST (Power On Self Test) is a set of routines including the memory check, system bus check, and other low-level initialization so the CPU can set up the computer properly. The important step of this stage is determining the boot device. Modern BIOS implementations permit the selection of a boot device, allowing booting from a floppy, CD-ROM, hard disk, or other devices. The very last thing in the POST is the INT 0x19 instruction. The INT 0x19 handler reads 512 bytes from the first sector of boot device into the memory at address 0x7c00. The term first sector originates from hard drive architecture, where the magnetic plate is divided into a number of cylindrical tracks. Tracks are numbered, and every track is divided into a number (usually 64) of sectors. Track numbers start at 0, but sector numbers start from 1. Track 0 is the outermost on the magnetic plate, and sector 1, the first sector, has a special purpose. It is also called the MBR, or Master Boot Record. The remaining sectors on the first track are never used. This sector is our boot-sequence starting point. As we will see, this sector contains a copy of our boot0 program. A jump is made by the BIOS to address 0x7c00 so it starts executing. The Master Boot Record (<literal>boot0</literal>) MBR After control is received from the BIOS at memory address 0x7c00, boot0 starts executing. It is the first piece of code under &os; control. The task of boot0 is quite simple: scan the partition table and let the user choose which partition to boot from. The Partition Table is a special, standard data structure embedded in the MBR (hence embedded in boot0) describing the four standard PC partitions . boot0 resides in the filesystem as /boot/boot0. It is a small 512-byte file, and it is exactly what &os;'s installation procedure wrote to the hard disk's MBR if you chose the bootmanager option at installation time. Indeed, boot0 is the MBR. As mentioned previously, the INT 0x19 instruction causes the INT 0x19 handler to load an MBR (boot0) into memory at address 0x7c00. The source file for boot0 can be found in sys/boot/i386/boot0/boot0.S - which is an awesome piece of code written by Robert Nordier. A special structure starting from offset 0x1be in the MBR is called the partition table. It has four records of 16 bytes each, called partition records, which represent how the hard disk is partitioned, or, in &os;'s terminology, sliced. One byte of those 16 says whether a partition (slice) is bootable or not. Exactly one record must have that flag set, otherwise boot0's code will refuse to proceed. A partition record has the following fields: the 1-byte filesystem type the 1-byte bootable flag the 6 byte descriptor in CHS format the 8 byte descriptor in LBA format A partition record descriptor contains information about where exactly the partition resides on the drive. Both descriptors, LBA and CHS, describe the same information, but in different ways: LBA (Logical Block Addressing) has the starting sector for the partition and the partition's length, while CHS (Cylinder Head Sector) has coordinates for the first and last sectors of the partition. The partition table ends with the special signature 0xaa55. The MBR must fit into 512 bytes, a single disk sector. This program uses low-level tricks like taking advantage of the side effects of certain instructions and reusing register values from previous operations to make the most out of the fewest possible instructions. Care must also be taken when handling the partition table, which is embedded in the MBR itself. For these reasons, be very careful when modifying boot0.S. Note that the boot0.S source file is assembled as is: instructions are translated one by one to binary, with no additional information (no ELF file format, for example). This kind of low-level control is achieved at link time through special control flags passed to the linker. For example, the text section of the program is set to be located at address 0x600. In practice this means that boot0 must be loaded to memory address 0x600 in order to function properly. It is worth looking at the Makefile for boot0 (sys/boot/i386/boot0/Makefile), as it defines some of the run-time behavior of boot0. For instance, if a terminal connected to the serial port (COM1) is used for I/O, the macro SIO must be defined (-DSIO). -DPXE enables boot through PXE by pressing F6. Additionally, the program defines a set of flags that allow further modification of its behavior. All of this is illustrated in the Makefile. For example, look at the linker directives which command the linker to start the text section at address 0x600, and to build the output file as is (strip out any file formatting):
<filename>sys/boot/i386/boot0/Makefile</filename> BOOT_BOOT0_ORG?=0x600 LDFLAGS=-e start -Ttext ${BOOT_BOOT0_ORG} \ -Wl,-N,-S,--oformat,binary
Let us now start our study of the MBR, or boot0, starting where execution begins. Some modifications have been made to some instructions in favor of better exposition. For example, some macros are expanded, and some macro tests are omitted when the result of the test is known. This applies to all of the code examples shown.
<filename>sys/boot/i386/boot0/boot0.S</filename> start: cld # String ops inc xorw %ax,%ax # Zero movw %ax,%es # Address movw %ax,%ds # data movw %ax,%ss # Set up movw 0x7c00,%sp # stack
This first block of code is the entry point of the program. It is where the BIOS transfers control. First, it makes sure that the string operations autoincrement its pointer operands (the cld instruction) When in doubt, we refer the reader to the official Intel manuals, which describe the exact semantics for each instruction: .. Then, as it makes no assumption about the state of the segment registers, it initializes them. Finally, it sets the stack pointer register (%sp) to address 0x7c00, so we have a working stack. The next block is responsible for the relocation and subsequent jump to the relocated code.
<filename>sys/boot/i386/boot0/boot0.S</filename> movw $0x7c00,%si # Source movw $0x600,%di # Destination movw $512,%cx # Word count rep # Relocate movsb # code movw %di,%bp # Address variables movb $16,%cl # Words to clear rep # Zero stosb # them incb -0xe(%di) # Set the S field to 1 jmp main-0x7c00+0x600 # Jump to relocated code
- Because boot0 is loaded by the + As boot0 is loaded by the BIOS to address 0x7C00, it copies itself to address 0x600 and then transfers control there (recall that it was linked to execute at address 0x600). The source address, 0x7c00, is copied to register %si. The destination address, 0x600, to register %di. The number of bytes to copy, 512 (the program's size), is copied to register %cx. Next, the rep instruction repeats the instruction that follows, that is, movsb, the number of times dictated by the %cx register. The movsb instruction copies the byte pointed to by %si to the address pointed to by %di. This is repeated another 511 times. On each repetition, both the source and destination registers, %si and %di, are incremented by one. Thus, upon completion of the 512-byte copy, %di has the value 0x600+512= 0x800, and %si has the value 0x7c00+512= 0x7e00; we have thus completed the code relocation. Next, the destination register %di is copied to %bp. %bp gets the value 0x800. The value 16 is copied to %cl in preparation for a new string operation (like our previous movsb). Now, stosb is executed 16 times. This instruction copies a 0 value to the address pointed to by the destination register (%di, which is 0x800), and increments it. This is repeated another 15 times, so %di ends up with value 0x810. Effectively, this clears the address range 0x800-0x80f. This range is used as a (fake) partition table for writing the MBR back to disk. Finally, the sector field for the CHS addressing of this fake partition is given the value 1 and a jump is made to the main function from the relocated code. Note that until this jump to the relocated code, any reference to an absolute address was avoided. The following code block tests whether the drive number provided by the BIOS should be used, or the one stored in boot0.
<filename>sys/boot/i386/boot0/boot0.S</filename> main: testb $SETDRV,-69(%bp) # Set drive number? jnz disable_update # Yes testb %dl,%dl # Drive number valid? js save_curdrive # Possibly (0x80 set)
This code tests the SETDRV bit (0x20) in the flags variable. Recall that register %bp points to address location 0x800, so the test is done to the flags variable at address 0x800-69= 0x7bb. This is an example of the type of modifications that can be done to boot0. The SETDRV flag is not set by default, but it can be set in the Makefile. When set, the drive number stored in the MBR is used instead of the one provided by the BIOS. We assume the defaults, and that the BIOS provided a valid drive number, so we jump to save_curdrive. The next block saves the drive number provided by the BIOS, and calls putn to print a new line on the screen.
<filename>sys/boot/i386/boot0/boot0.S</filename> save_curdrive: movb %dl, (%bp) # Save drive number pushw %dx # Also in the stack #ifdef TEST /* test code, print internal bios drive */ rolb $1, %dl movw $drive, %si call putkey #endif callw putn # Print a newline
Note that we assume TEST is not defined, so the conditional code in it is not assembled and will not appear in our executable boot0. Our next block implements the actual scanning of the partition table. It prints to the screen the partition type for each of the four entries in the partition table. It compares each type with a list of well-known operating system file systems. Examples of recognized partition types are NTFS (&windows;, ID 0x7), ext2fs (&linux;, ID 0x83), and, of course, ffs/ufs2 (&os;, ID 0xa5). The implementation is fairly simple.
<filename>sys/boot/i386/boot0/boot0.S</filename> movw $(partbl+0x4),%bx # Partition table (+4) xorw %dx,%dx # Item number read_entry: movb %ch,-0x4(%bx) # Zero active flag (ch == 0) btw %dx,_FLAGS(%bp) # Entry enabled? jnc next_entry # No movb (%bx),%al # Load type test %al, %al # skip empty partition jz next_entry movw $bootable_ids,%di # Lookup tables movb $(TLEN+1),%cl # Number of entries repne # Locate scasb # type addw $(TLEN-1), %di # Adjust movb (%di),%cl # Partition addw %cx,%di # description callw putx # Display it next_entry: incw %dx # Next item addb $0x10,%bl # Next entry jnc read_entry # Till done
It is important to note that the active flag for each entry is cleared, so after the scanning, no partition entry is active in our memory copy of boot0. Later, the active flag will be set for the selected partition. This ensures that only one active partition exists if the user chooses to write the changes back to disk. The next block tests for other drives. At startup, the BIOS writes the number of drives present in the computer to address 0x475. If there are any other drives present, boot0 prints the current drive to screen. The user may command boot0 to scan partitions on another drive later.
<filename>sys/boot/i386/boot0/boot0.S</filename> popw %ax # Drive number subb $0x79,%al # Does next cmpb 0x475,%al # drive exist? (from BIOS?) jb print_drive # Yes decw %ax # Already drive 0? jz print_prompt # Yes
We make the assumption that a single drive is present, so the jump to print_drive is not performed. We also assume nothing strange happened, so we jump to print_prompt. This next block just prints out a prompt followed by the default option:
<filename>sys/boot/i386/boot0/boot0.S</filename> print_prompt: movw $prompt,%si # Display callw putstr # prompt movb _OPT(%bp),%dl # Display decw %si # default callw putkey # key jmp start_input # Skip beep
Finally, a jump is performed to start_input, where the BIOS services are used to start a timer and for reading user input from the keyboard; if the timer expires, the default option will be selected:
<filename>sys/boot/i386/boot0/boot0.S</filename> start_input: xorb %ah,%ah # BIOS: Get int $0x1a # system time movw %dx,%di # Ticks when addw _TICKS(%bp),%di # timeout read_key: movb $0x1,%ah # BIOS: Check int $0x16 # for keypress jnz got_key # Have input xorb %ah,%ah # BIOS: int 0x1a, 00 int $0x1a # get system time cmpw %di,%dx # Timeout? jb read_key # No
An interrupt is requested with number 0x1a and argument 0 in register %ah. The BIOS has a predefined set of services, requested by applications as software-generated interrupts through the int instruction and receiving arguments in registers (in this case, %ah). Here, particularly, we are requesting the number of clock ticks since last midnight; this value is computed by the BIOS through the RTC (Real Time Clock). This clock can be programmed to work at frequencies ranging from 2 Hz to 8192 Hz. The BIOS sets it to 18.2 Hz at startup. When the request is satisfied, a 32-bit result is returned by the BIOS in registers %cx and %dx (lower bytes in %dx). This result (the %dx part) is copied to register %di, and the value of the TICKS variable is added to %di. This variable resides in boot0 at offset _TICKS (a negative value) from register %bp (which, recall, points to 0x800). The default value of this variable is 0xb6 (182 in decimal). Now, the idea is that boot0 constantly requests the time from the BIOS, and when the value returned in register %dx is greater than the value stored in %di, the time is up and the default selection will be made. Since the RTC ticks 18.2 times per second, this condition will be met after 10 seconds (this default behavior can be changed in the Makefile). Until this time has passed, boot0 continually asks the BIOS for any user input; this is done through int 0x16, argument 1 in %ah. Whether a key was pressed or the time expired, subsequent code validates the selection. Based on the selection, the register %si is set to point to the appropriate partition entry in the partition table. This new selection overrides the previous default one. Indeed, it becomes the new default. Finally, the ACTIVE flag of the selected partition is set. If it was enabled at compile time, the in-memory version of boot0 with these modified values is written back to the MBR on disk. We leave the details of this implementation to the reader. We now end our study with the last code block from the boot0 program:
<filename>sys/boot/i386/boot0/boot0.S</filename> movw $0x7c00,%bx # Address for read movb $0x2,%ah # Read sector callw intx13 # from disk jc beep # If error cmpw $0xaa55,0x1fe(%bx) # Bootable? jne beep # No pushw %si # Save ptr to selected part. callw putn # Leave some space popw %si # Restore, next stage uses it jmp *%bx # Invoke bootstrap
Recall that %si points to the selected partition entry. This entry tells us where the partition begins on disk. We assume, of course, that the partition selected is actually a &os; slice. From now on, we will favor the use of the technically more accurate term slice rather than partition. The transfer buffer is set to 0x7c00 (register %bx), and a read for the first sector of the &os; slice is requested by calling intx13. We assume that everything went okay, so a jump to beep is not performed. In particular, the new sector read must end with the magic sequence 0xaa55. Finally, the value at %si (the pointer to the selected partition table) is preserved for use by the next stage, and a jump is performed to address 0x7c00, where execution of our next stage (the just-read block) is started.
<literal>boot1</literal> Stage So far we have gone through the following sequence: The BIOS did some early hardware initialization, including the POST. The MBR (boot0) was loaded from absolute disk sector one to address 0x7c00. Execution control was passed to that location. boot0 relocated itself to the location it was linked to execute (0x600), followed by a jump to continue execution at the appropriate place. Finally, boot0 loaded the first disk sector from the &os; slice to address 0x7c00. Execution control was passed to that location. boot1 is the next step in the boot-loading sequence. It is the first of three boot stages. Note that we have been dealing exclusively with disk sectors. Indeed, the BIOS loads the absolute first sector, while boot0 loads the first sector of the &os; slice. Both loads are to address 0x7c00. We can conceptually think of these disk sectors as containing the files boot0 and boot1, respectively, but in reality this is not entirely true for boot1. Strictly speaking, unlike boot0, boot1 is not part of the boot blocks There is a file /boot/boot1, but it is not the written to the beginning of the &os; slice. Instead, it is concatenated with boot2 to form boot, which is written to the beginning of the &os; slice and read at boot time.. Instead, a single, full-blown file, boot (/boot/boot), is what ultimately is written to disk. This file is a combination of boot1, boot2 and the Boot Extender (or BTX). This single file is greater in size than a single sector (greater than 512 bytes). Fortunately, boot1 occupies exactly the first 512 bytes of this single file, so when boot0 loads the first sector of the &os; slice (512 bytes), it is actually loading boot1 and transferring control to it. The main task of boot1 is to load the next boot stage. This next stage is somewhat more complex. It is composed of a server called the Boot Extender, or BTX, and a client, called boot2. As we will see, the last boot stage, loader, is also a client of the BTX server. Let us now look in detail at what exactly is done by boot1, starting like we did for boot0, at its entry point:
<filename>sys/boot/i386/boot2/boot1.S</filename> start: jmp main
The entry point at start simply jumps past a special data area to the label main, which in turn looks like this:
<filename>sys/boot/i386/boot2/boot1.S</filename> main: cld # String ops inc xor %cx,%cx # Zero mov %cx,%es # Address mov %cx,%ds # data mov %cx,%ss # Set up mov $start,%sp # stack mov %sp,%si # Source mov $0x700,%di # Destination incb %ch # Word count rep # Copy movsw # code
Just like boot0, this code relocates boot1, this time to memory address 0x700. However, unlike boot0, it does not jump there. boot1 is linked to execute at address 0x7c00, effectively where it was loaded in the first place. The reason for this relocation will be discussed shortly. Next comes a loop that looks for the &os; slice. Although boot0 loaded boot1 from the &os; slice, no information was passed to it about this Actually we did pass a pointer to the slice entry in register %si. However, boot1 does not assume that it was loaded by boot0 (perhaps some other MBR loaded it, and did not pass this information), so it assumes nothing., so boot1 must rescan the partition table to find where the &os; slice starts. Therefore it rereads the MBR:
<filename>sys/boot/i386/boot2/boot1.S</filename> mov $part4,%si # Partition cmpb $0x80,%dl # Hard drive? jb main.4 # No movb $0x1,%dh # Block count callw nread # Read MBR
In the code above, register %dl maintains information about the boot device. This is passed on by the BIOS and preserved by the MBR. Numbers 0x80 and greater tells us that we are dealing with a hard drive, so a call is made to nread, where the MBR is read. Arguments to nread are passed through %si and %dh. The memory address at label part4 is copied to %si. This memory address holds a fake partition to be used by nread. The following is the data in the fake partition:
<filename>sys/boot/i386/boot2/Makefile</filename> part4: .byte 0x80, 0x00, 0x01, 0x00 .byte 0xa5, 0xfe, 0xff, 0xff .byte 0x00, 0x00, 0x00, 0x00 .byte 0x50, 0xc3, 0x00, 0x00
In particular, the LBA for this fake partition is hardcoded to zero. This is used as an argument to the BIOS for reading absolute sector one from the hard drive. Alternatively, CHS addressing could be used. In this case, the fake partition holds cylinder 0, head 0 and sector 1, which is equivalent to absolute sector one. Let us now proceed to take a look at nread:
<filename>sys/boot/i386/boot2/boot1.S</filename> nread: mov $0x8c00,%bx # Transfer buffer mov 0x8(%si),%ax # Get mov 0xa(%si),%cx # LBA push %cs # Read from callw xread.1 # disk jnc return # If success, return
Recall that %si points to the fake partition. The word In the context of 16-bit real mode, a word is 2 bytes. at offset 0x8 is copied to register %ax and word at offset 0xa to %cx. They are interpreted by the BIOS as the lower 4-byte value denoting the LBA to be read (the upper four bytes are assumed to be zero). Register %bx holds the memory address where the MBR will be loaded. The instruction pushing %cs onto the stack is very interesting. In this context, it accomplishes nothing. However, as we will see shortly, boot2, in conjunction with the BTX server, also uses xread.1. This mechanism will be discussed in the next section. The code at xread.1 further calls the read function, which actually calls the BIOS asking for the disk sector:
<filename>sys/boot/i386/boot2/boot1.S</filename> xread.1: pushl $0x0 # absolute push %cx # block push %ax # number push %es # Address of push %bx # transfer buffer xor %ax,%ax # Number of movb %dh,%al # blocks to push %ax # transfer push $0x10 # Size of packet mov %sp,%bp # Packet pointer callw read # Read from disk lea 0x10(%bp),%sp # Clear stack lret # To far caller
Note the long return instruction at the end of this block. This instruction pops out the %cs register pushed by nread, and returns. Finally, nread also returns. With the MBR loaded to memory, the actual loop for searching the &os; slice begins:
<filename>sys/boot/i386/boot2/boot1.S</filename> mov $0x1,%cx # Two passes main.1: mov $0x8dbe,%si # Partition table movb $0x1,%dh # Partition main.2: cmpb $0xa5,0x4(%si) # Our partition type? jne main.3 # No jcxz main.5 # If second pass testb $0x80,(%si) # Active? jnz main.5 # Yes main.3: add $0x10,%si # Next entry incb %dh # Partition cmpb $0x5,%dh # In table? jb main.2 # Yes dec %cx # Do two jcxz main.1 # passes
If a &os; slice is identified, execution continues at main.5. Note that when a &os; slice is found %si points to the appropriate entry in the partition table, and %dh holds the partition number. We assume that a &os; slice is found, so we continue execution at main.5:
<filename>sys/boot/i386/boot2/boot1.S</filename> main.5: mov %dx,0x900 # Save args movb $0x10,%dh # Sector count callw nread # Read disk mov $0x9000,%bx # BTX mov 0xa(%bx),%si # Get BTX length and set add %bx,%si # %si to start of boot2.bin mov $0xc000,%di # Client page 2 mov $0xa200,%cx # Byte sub %si,%cx # count rep # Relocate movsb # client
Recall that at this point, register %si points to the &os; slice entry in the MBR partition table, so a call to nread will effectively read sectors at the beginning of this partition. The argument passed on register %dh tells nread to read 16 disk sectors. Recall that the first 512 bytes, or the first sector of the &os; slice, coincides with the boot1 program. Also recall that the file written to the beginning of the &os; slice is not /boot/boot1, but /boot/boot. Let us look at the size of these files in the filesystem: -r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot0 -r--r--r-- 1 root wheel 512B Jan 8 00:15 /boot/boot1 -r--r--r-- 1 root wheel 7.5K Jan 8 00:15 /boot/boot2 -r--r--r-- 1 root wheel 8.0K Jan 8 00:15 /boot/boot Both boot0 and boot1 are 512 bytes each, so they fit exactly in one disk sector. boot2 is much bigger, holding both the BTX server and the boot2 client. Finally, a file called simply boot is 512 bytes larger than boot2. This file is a concatenation of boot1 and boot2. As already noted, boot0 is the file written to the absolute first disk sector (the MBR), and boot is the file written to the first sector of the &os; slice; boot1 and boot2 are not written to disk. The command used to concatenate boot1 and boot2 into a single boot is merely cat boot1 boot2 > boot. So boot1 occupies exactly the first 512 bytes of boot and, because boot is written to the first sector of the &os; slice, boot1 fits exactly in this - first sector. Because nread reads the first + first sector. When nread reads the first 16 sectors of the &os; slice, it effectively reads the entire boot file 512*16=8192 bytes, exactly the size of boot. We will see more details about how boot is formed from boot1 and boot2 in the next section. Recall that nread uses memory address 0x8c00 as the transfer buffer to hold the sectors read. This address is conveniently chosen. Indeed, because boot1 belongs to the first 512 bytes, it ends up in the address range 0x8c00-0x8dff. The 512 bytes that follows (range 0x8e00-0x8fff) is used to store the bsdlabel Historically known as disklabel. If you ever wondered where &os; stored this information, it is in this region. See &man.bsdlabel.8;. Starting at address 0x9000 is the beginning of the BTX server, and immediately following is the boot2 client. The BTX server acts as a kernel, and executes in protected mode in the most privileged level. In contrast, the BTX clients (boot2, for example), execute in user mode. We will see how this is accomplished in the next section. The code after the call to nread locates the beginning of boot2 in the memory buffer, and copies it to memory address 0xc000. This is because the BTX server arranges boot2 to execute in a segment starting at 0xa000. We explore this in detail in the following section. The last code block of boot1 enables access to memory above 1MB This is necessary for legacy reasons. Interested readers should see . and concludes with a jump to the starting point of the BTX server:
<filename>sys/boot/i386/boot2/boot1.S</filename> seta20: cli # Disable interrupts seta20.1: dec %cx # Timeout? jz seta20.3 # Yes inb $0x64,%al # Get status testb $0x2,%al # Busy? jnz seta20.1 # Yes movb $0xd1,%al # Command: Write outb %al,$0x64 # output port seta20.2: inb $0x64,%al # Get status testb $0x2,%al # Busy? jnz seta20.2 # Yes movb $0xdf,%al # Enable outb %al,$0x60 # A20 seta20.3: sti # Enable interrupts jmp 0x9010 # Start BTX
Note that right before the jump, interrupts are enabled.
The <acronym>BTX</acronym> Server Next in our boot sequence is the BTX Server. Let us quickly remember how we got here: The BIOS loads the absolute sector one (the MBR, or boot0), to address 0x7c00 and jumps there. boot0 relocates itself to 0x600, the address it was linked to execute, and jumps over there. It then reads the first sector of the &os; slice (which consists of boot1) into address 0x7c00 and jumps over there. boot1 loads the first 16 sectors of the &os; slice into address 0x8c00. This 16 sectors, or 8192 bytes, is the whole file boot. The file is a concatenation of boot1 and boot2. boot2, in turn, contains the BTX server and the boot2 client. Finally, a jump is made to address 0x9010, the entry point of the BTX server. Before studying the BTX Server in detail, let us further review how the single, all-in-one boot file is created. The way boot is built is defined in its Makefile (/usr/src/sys/boot/i386/boot2/Makefile). Let us look at the rule that creates the boot file:
<filename>sys/boot/i386/boot2/Makefile</filename> boot: boot1 boot2 cat boot1 boot2 > boot
This tells us that boot1 and boot2 are needed, and the rule simply concatenates them to produce a single file called boot. The rules for creating boot1 are also quite simple:
<filename>sys/boot/i386/boot2/Makefile</filename> boot1: boot1.out objcopy -S -O binary boot1.out boot1 boot1.out: boot1.o ld -e start -Ttext 0x7c00 -o boot1.out boot1.o
To apply the rule for creating boot1, boot1.out must be resolved. This, in turn, depends on the existence of boot1.o. This last file is simply the result of assembling our familiar boot1.S, without linking. Now, the rule for creating boot1.out is applied. This tells us that boot1.o should be linked with start as its entry point, and starting at address 0x7c00. Finally, boot1 is created from boot1.out applying the appropriate rule. This rule is the objcopy command applied to boot1.out. Note the flags passed to objcopy: -S tells it to strip all relocation and symbolic information; -O binary indicates the output format, that is, a simple, unformatted binary file. Having boot1, let us take a look at how boot2 is constructed:
<filename>sys/boot/i386/boot2/Makefile</filename> boot2: boot2.ld @set -- `ls -l boot2.ld`; x=$$((7680-$$5)); \ echo "$$x bytes available"; test $$x -ge 0 dd if=boot2.ld of=boot2 obs=7680 conv=osync boot2.ld: boot2.ldr boot2.bin ../btx/btx/btx btxld -v -E 0x2000 -f bin -b ../btx/btx/btx -l boot2.ldr \ -o boot2.ld -P 1 boot2.bin boot2.ldr: dd if=/dev/zero of=boot2.ldr bs=512 count=1 boot2.bin: boot2.out objcopy -S -O binary boot2.out boot2.bin boot2.out: ../btx/lib/crt0.o boot2.o sio.o ld -Ttext 0x2000 -o boot2.out boot2.o: boot2.s ${CC} ${ACFLAGS} -c boot2.s boot2.s: boot2.c boot2.h ${.CURDIR}/../../common/ufsread.c ${CC} ${CFLAGS} -S -o boot2.s.tmp ${.CURDIR}/boot2.c sed -e '/align/d' -e '/nop/d' "MISSING" boot2.s.tmp > boot2.s rm -f boot2.s.tmp boot2.h: boot1.out ${NM} -t d ${.ALLSRC} | awk '/([0-9])+ T xread/ \ { x = $$1 - ORG1; \ printf("#define XREADORG %#x\n", REL1 + x) }' \ ORG1=`printf "%d" ${ORG1}` \ REL1=`printf "%d" ${REL1}` > ${.TARGET}
The mechanism for building boot2 is far more elaborate. Let us point out the most relevant facts. The dependency list is as follows:
<filename>sys/boot/i386/boot2/Makefile</filename> boot2: boot2.ld boot2.ld: boot2.ldr boot2.bin ${BTXDIR}/btx/btx boot2.bin: boot2.out boot2.out: ${BTXDIR}/lib/crt0.o boot2.o sio.o boot2.o: boot2.s boot2.s: boot2.c boot2.h ${.CURDIR}/../../common/ufsread.c boot2.h: boot1.out
Note that initially there is no header file boot2.h, but its creation depends on boot1.out, which we already have. The rule for its creation is a bit terse, but the important thing is that the output, boot2.h, is something like this:
<filename>sys/boot/i386/boot2/boot2.h</filename> #define XREADORG 0x725
Recall that boot1 was relocated (i.e., copied from 0x7c00 to 0x700). This relocation will now make sense, because as we will see, the BTX server reclaims some memory, including the space where boot1 was originally loaded. However, the BTX server needs access to boot1's xread function; this function, according to the output of boot2.h, is at location 0x725. Indeed, the BTX server uses the xread function from boot1's relocated code. This function is now accessible from within the boot2 client. We next build boot2.s from files boot2.h, boot2.c and /usr/src/sys/boot/common/ufsread.c. The rule for this is to compile the code in boot2.c (which includes boot2.h and ufsread.c) into assembly code. Having boot2.s, the next rule assembles boot2.s, creating the object file boot2.o. The next rule directs the linker to link various files (crt0.o, boot2.o and sio.o). Note that the output file, boot2.out, is linked to execute at address 0x2000. Recall that boot2 will be executed in user mode, within a special user segment set up by the BTX server. This segment starts at 0xa000. Also, remember that the boot2 portion of boot was copied to address 0xc000, that is, offset 0x2000 from the start of the user segment, so boot2 will work properly when we transfer control to it. Next, boot2.bin is created from boot2.out by stripping its symbols and format information; boot2.bin is a raw binary. Now, note that a file boot2.ldr is created as a 512-byte file full of zeros. This space is reserved for the bsdlabel. Now that we have files boot1, boot2.bin and boot2.ldr, only the BTX server is missing before creating the all-in-one boot file. The BTX server is located in /usr/src/sys/boot/i386/btx/btx; it has its own Makefile with its own set of rules for building. The important thing to notice is that it is also compiled as a raw binary, and that it is linked to execute at address 0x9000. The details can be found in /usr/src/sys/boot/i386/btx/btx/Makefile. Having the files that comprise the boot program, the final step is to merge them. This is done by a special program called btxld (source located in /usr/src/usr.sbin/btxld). Some arguments to this program include the name of the output file (boot), its entry point (0x2000) and its file format (raw binary). The various files are finally merged by this utility into the file boot, which consists of boot1, boot2, the bsdlabel and the BTX server. This file, which takes exactly 16 sectors, or 8192 bytes, is what is actually written to the beginning of the &os; slice during installation. Let us now proceed to study the BTX server program. The BTX server prepares a simple environment and switches from 16-bit real mode to 32-bit protected mode, right before passing control to the client. This includes initializing and updating the following data structures: virtual v86 mode Modifies the Interrupt Vector Table (IVT). The IVT provides exception and interrupt handlers for Real-Mode code. The Interrupt Descriptor Table (IDT) is created. Entries are provided for processor exceptions, hardware interrupts, two system calls and V86 interface. The IDT provides exception and interrupt handlers for Protected-Mode code. A Task-State Segment (TSS) is created. This is necessary because the processor works in the least privileged level when executing the client (boot2), but in the most privileged level when executing the BTX server. The GDT (Global Descriptor Table) is set up. Entries (descriptors) are provided for supervisor code and data, user code and data, and real-mode code and data. Real-mode code and data are necessary when switching back to real mode from protected mode, as suggested by the Intel manuals. Let us now start studying the actual implementation. Recall that boot1 made a jump to address 0x9010, the BTX server's entry point. Before studying program execution there, note that the BTX server has a special header at address range 0x9000-0x900f, right before its entry point. This header is defined as follows:
<filename>sys/boot/i386/btx/btx/btx.S</filename> start: # Start of code /* * BTX header. */ btx_hdr: .byte 0xeb # Machine ID .byte 0xe # Header size .ascii "BTX" # Magic .byte 0x1 # Major version .byte 0x2 # Minor version .byte BTX_FLAGS # Flags .word PAG_CNT-MEM_ORG>>0xc # Paging control .word break-start # Text size .long 0x0 # Entry address
Note the first two bytes are 0xeb and 0xe. In the IA-32 architecture, these two bytes are interpreted as a relative jump past the header into the entry point, so in theory, boot1 could jump here (address 0x9000) instead of address 0x9010. Note that the last field in the BTX header is a pointer to the client's (boot2) entry point. This field is patched at link time. Immediately following the header is the BTX server's entry point:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Initialization routine. */ init: cli # Disable interrupts xor %ax,%ax # Zero/segment mov %ax,%ss # Set up mov $0x1800,%sp # stack mov %ax,%es # Address mov %ax,%ds # data pushl $0x2 # Clear popfl # flags
This code disables interrupts, sets up a working stack (starting at address 0x1800) and clears the flags in the EFLAGS register. Note that the popfl instruction pops out a doubleword (4 bytes) from the stack and places it in the EFLAGS register. - Because the value actually popped is 2, the + As the value actually popped is 2, the EFLAGS register is effectively cleared (IA-32 requires that bit 2 of the EFLAGS register always be 1). Our next code block clears (sets to 0) the memory range 0x5e00-0x8fff. This range is where the various data structures will be created:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Initialize memory. */ mov $0x5e00,%di # Memory to initialize mov $(0x9000-0x5e00)/2,%cx # Words to zero rep # Zero-fill stosw # memory
Recall that boot1 was originally loaded to address 0x7c00, so, with this memory initialization, that copy effectively disappeared. However, also recall that boot1 was relocated to 0x700, so that copy is still in memory, and the BTX server will make use of it. Next, the real-mode IVT (Interrupt Vector Table is updated. The IVT is an array of segment/offset pairs for exception and interrupt handlers. The BIOS normally maps hardware interrupts to interrupt vectors 0x8 to 0xf and 0x70 to 0x77 but, as will be seen, the 8259A Programmable Interrupt Controller, the chip controlling the actual mapping of hardware interrupts to interrupt vectors, is programmed to remap these interrupt vectors from 0x8-0xf to 0x20-0x27 and from 0x70-0x77 to 0x28-0x2f. Thus, interrupt handlers are provided for interrupt vectors 0x20-0x2f. The reason the BIOS-provided handlers are not used directly is because they work in 16-bit real mode, but not 32-bit protected mode. Processor mode will be switched to 32-bit protected mode shortly. However, the BTX server sets up a mechanism to effectively use the handlers provided by the BIOS:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Update real mode IDT for reflecting hardware interrupts. */ mov $intr20,%bx # Address first handler mov $0x10,%cx # Number of handlers mov $0x20*4,%di # First real mode IDT entry init.0: mov %bx,(%di) # Store IP inc %di # Address next inc %di # entry stosw # Store CS add $4,%bx # Next handler loop init.0 # Next IRQ
The next block creates the IDT (Interrupt Descriptor Table). The IDT is analogous, in protected mode, to the IVT in real mode. That is, the IDT describes the various exception and interrupt handlers used when the processor is executing in protected mode. In essence, it also consists of an array of segment/offset pairs, although the structure is somewhat more complex, because segments in protected mode are different than in real mode, and various protection mechanisms apply:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Create IDT. */ mov $0x5e00,%di # IDT's address mov $idtctl,%si # Control string init.1: lodsb # Get entry cbw # count xchg %ax,%cx # as word jcxz init.4 # If done lodsb # Get segment xchg %ax,%dx # P:DPL:type lodsw # Get control xchg %ax,%bx # set lodsw # Get handler offset mov $SEL_SCODE,%dh # Segment selector init.2: shr %bx # Handle this int? jnc init.3 # No mov %ax,(%di) # Set handler offset mov %dh,0x2(%di) # and selector mov %dl,0x5(%di) # Set P:DPL:type add $0x4,%ax # Next handler init.3: lea 0x8(%di),%di # Next entry loop init.2 # Till set done jmp init.1 # Continue
Each entry in the IDT is 8 bytes long. Besides the segment/offset information, they also describe the segment type, privilege level, and whether the segment is present in memory or not. The construction is such that interrupt vectors from 0 to 0xf (exceptions) are handled by function intx00; vector 0x10 (also an exception) is handled by intx10; hardware interrupts, which are later configured to start at interrupt vector 0x20 all the way to interrupt vector 0x2f, are handled by function intx20. Lastly, interrupt vector 0x30, which is used for system calls, is handled by intx30, and vectors 0x31 and 0x32 are handled by intx31. It must be noted that only descriptors for interrupt vectors 0x30, 0x31 and 0x32 are given privilege level 3, the same privilege level as the boot2 client, which means the client can execute a software-generated interrupt to this vectors through the int instruction without failing (this is the way boot2 use the services provided by the BTX server). Also, note that only software-generated interrupts are protected from code executing in lesser privilege levels. Hardware-generated interrupts and processor-generated exceptions are always handled adequately, regardless of the actual privileges involved. The next step is to initialize the TSS (Task-State Segment). The TSS is a hardware feature that helps the operating system or executive software implement multitasking functionality through process abstraction. The IA-32 architecture demands the creation and use of at least one TSS if multitasking facilities are used or different privilege - levels are defined. Because the boot2 + levels are defined. Since the boot2 client is executed in privilege level 3, but the BTX server does in privilege level 0, a TSS must be defined:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Initialize TSS. */ init.4: movb $_ESP0H,TSS_ESP0+1(%di) # Set ESP0 movb $SEL_SDATA,TSS_SS0(%di) # Set SS0 movb $_TSSIO,TSS_MAP(%di) # Set I/O bit map base
Note that a value is given for the Privilege Level 0 stack pointer and stack segment in the TSS. This is needed because, if an interrupt or exception is received while executing boot2 in Privilege Level 3, a change to Privilege Level 0 is automatically performed by the processor, so a new working stack is needed. Finally, the I/O Map Base Address field of the TSS is given a value, which is a 16-bit offset from the beginning of the TSS to the I/O Permission Bitmap and the Interrupt Redirection Bitmap. After the IDT and TSS are created, the processor is ready to switch to protected mode. This is done in the next block:
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Bring up the system. */ mov $0x2820,%bx # Set protected mode callw setpic # IRQ offsets lidt idtdesc # Set IDT lgdt gdtdesc # Set GDT mov %cr0,%eax # Switch to protected inc %ax # mode mov %eax,%cr0 # ljmp $SEL_SCODE,$init.8 # To 32-bit code .code32 init.8: xorl %ecx,%ecx # Zero movb $SEL_SDATA,%cl # To 32-bit movw %cx,%ss # stack
First, a call is made to setpic to program the 8259A PIC (Programmable Interrupt Controller). This chip is connected to multiple hardware interrupt sources. Upon receiving an interrupt from a device, it signals the processor with the appropriate interrupt vector. This can be customized so that specific interrupts are associated with specific interrupt vectors, as explained before. Next, the IDTR (Interrupt Descriptor Table Register) and GDTR (Global Descriptor Table Register) are loaded with the instructions lidt and lgdt, respectively. These registers are loaded with the base address and limit address for the IDT and GDT. The following three instructions set the Protection Enable (PE) bit of the %cr0 register. This effectively switches the processor to 32-bit protected mode. Next, a long jump is made to init.8 using segment selector SEL_SCODE, which selects the Supervisor Code Segment. The processor is effectively executing in CPL 0, the most privileged level, after this jump. Finally, the Supervisor Data Segment is selected for the stack by assigning the segment selector SEL_SDATA to the %ss register. This data segment also has a privilege level of 0. Our last code block is responsible for loading the TR (Task Register) with the segment selector for the TSS we created earlier, and setting the User Mode environment before passing execution control to the boot2 client.
<filename>sys/boot/i386/btx/btx/btx.S</filename> /* * Launch user task. */ movb $SEL_TSS,%cl # Set task ltr %cx # register movl $0xa000,%edx # User base address movzwl %ss:BDA_MEM,%eax # Get free memory shll $0xa,%eax # To bytes subl $ARGSPACE,%eax # Less arg space subl %edx,%eax # Less base movb $SEL_UDATA,%cl # User data selector pushl %ecx # Set SS pushl %eax # Set ESP push $0x202 # Set flags (IF set) push $SEL_UCODE # Set CS pushl btx_hdr+0xc # Set EIP pushl %ecx # Set GS pushl %ecx # Set FS pushl %ecx # Set DS pushl %ecx # Set ES pushl %edx # Set EAX movb $0x7,%cl # Set remaining init.9: push $0x0 # general loop init.9 # registers popa # and initialize popl %es # Initialize popl %ds # user popl %fs # segment popl %gs # registers iret # To user mode
Note that the client's environment include a stack segment selector and stack pointer (registers %ss and %esp). Indeed, once the TR is loaded with the appropriate stack segment selector (instruction ltr), the stack pointer is calculated and pushed onto the stack along with the stack's segment selector. Next, the value 0x202 is pushed onto the stack; it is the value that the EFLAGS will get when control is passed to the client. Also, the User Mode code segment selector and the client's entry point are pushed. Recall that this entry point is patched in the BTX header at link time. Finally, segment selectors (stored in register %ecx) for the segment registers %gs, %fs, %ds and %es are pushed onto the stack, along with the value at %edx (0xa000). Keep in mind the various values that have been pushed onto the stack (they will be popped out shortly). Next, values for the remaining general purpose registers are also pushed onto the stack (note the loop that pushes the value 0 seven times). Now, values will be started to be popped out of the stack. First, the popa instruction pops out of the stack the latest seven values pushed. They are stored in the general purpose registers in order %edi, %esi, %ebp, %ebx, %edx, %ecx, %eax. Then, the various segment selectors pushed are popped into the various segment registers. Five values still remain on the stack. They are popped when the iret instruction is executed. This instruction first pops the value that was pushed from the BTX header. This value is a pointer to boot2's entry point. It is placed in the register %eip, the instruction pointer register. Next, the segment selector for the User Code Segment is popped and copied to register %cs. Remember that this segment's privilege level is 3, the least privileged level. This means that we must provide values for the stack of this privilege level. This is why the processor, besides further popping the value for the EFLAGS register, does two more pops out of the stack. These values go to the stack pointer (%esp) and the stack segment (%ss). Now, execution continues at boot0's entry point. It is important to note how the User Code Segment is defined. This segment's base address is set to 0xa000. This means that code memory addresses are relative to address 0xa000; if code being executed is fetched from address 0x2000, the actual memory addressed is 0xa000+0x2000=0xc000.
<application>boot2</application> Stage boot2 defines an important structure, struct bootinfo. This structure is initialized by boot2 and passed to the loader, and then further to the kernel. Some nodes of this structures are set by boot2, the rest by the loader. This structure, among other information, contains the kernel filename, BIOS harddisk geometry, BIOS drive number for boot device, physical memory available, envp pointer etc. The definition for it is: /usr/include/machine/bootinfo.h: struct bootinfo { u_int32_t bi_version; u_int32_t bi_kernelname; /* represents a char * */ u_int32_t bi_nfs_diskless; /* struct nfs_diskless * */ /* End of fields that are always present. */ #define bi_endcommon bi_n_bios_used u_int32_t bi_n_bios_used; u_int32_t bi_bios_geom[N_BIOS_GEOM]; u_int32_t bi_size; u_int8_t bi_memsizes_valid; u_int8_t bi_bios_dev; /* bootdev BIOS unit number */ u_int8_t bi_pad[2]; u_int32_t bi_basemem; u_int32_t bi_extmem; u_int32_t bi_symtab; /* struct symtab * */ u_int32_t bi_esymtab; /* struct symtab * */ /* Items below only from advanced bootloader */ u_int32_t bi_kernend; /* end of kernel space */ u_int32_t bi_envp; /* environment */ u_int32_t bi_modulep; /* preloaded modules */ }; boot2 enters into an infinite loop waiting for user input, then calls load(). If the user does not press anything, the loop breaks by a timeout, so load() will load the default file (/boot/loader). Functions ino_t lookup(char *filename) and int xfsread(ino_t inode, void *buf, size_t nbyte) are used to read the content of a file into memory. /boot/loader is an ELF binary, but where the ELF header is prepended with a.out's struct exec structure. load() scans the loader's ELF header, loading the content of /boot/loader into memory, and passing the execution to the loader's entry: sys/boot/i386/boot2/boot2.c: __exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK), MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part), 0, 0, 0, VTOP(&bootinfo)); <application>loader</application> Stage loader is a BTX client as well. I will not describe it here in detail, there is a comprehensive man page written by Mike Smith, &man.loader.8;. The underlying mechanisms and BTX were discussed above. The main task for the loader is to boot the kernel. When the kernel is loaded into memory, it is being called by the loader: sys/boot/common/boot.c: /* Call the exec handler from the loader matching the kernel */ module_formats[km->m_loader]->l_exec(km); Kernel Initialization Let us take a look at the command that links the kernel. This will help identify the exact location where the loader passes execution to the kernel. This location is the kernel's actual entry point. sys/conf/Makefile.i386: ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \ -dynamic-linker /red/herring -o kernel -X locore.o \ <lots of kernel .o files> ELF A few interesting things can be seen here. First, the kernel is an ELF dynamically linked binary, but the dynamic linker for kernel is /red/herring, which is definitely a bogus file. Second, taking a look at the file sys/conf/ldscript.i386 gives an idea about what ld options are used when compiling a kernel. Reading through the first few lines, the string sys/conf/ldscript.i386: ENTRY(btext) says that a kernel's entry point is the symbol `btext'. This symbol is defined in locore.s: sys/i386/i386/locore.s: .text /********************************************************************** * * This is where the bootblocks start us, set the ball rolling... * */ NON_GPROF_ENTRY(btext) First, the register EFLAGS is set to a predefined value of 0x00000002. Then all the segment registers are initialized: sys/i386/i386/locore.s: /* Don't trust what the BIOS gives for eflags. */ pushl $PSL_KERNEL popfl /* * Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap * to set %cs, %ds, %es and %ss. */ mov %ds, %ax mov %ax, %fs mov %ax, %gs btext calls the routines recover_bootinfo(), identify_cpu(), create_pagetables(), which are also defined in locore.s. Here is a description of what they do: recover_bootinfo This routine parses the parameters to the kernel passed from the bootstrap. The kernel may have been booted in 3 ways: by the loader, described above, by the old disk boot blocks, or by the old diskless boot procedure. This function determines the booting method, and stores the struct bootinfo structure into the kernel memory. identify_cpu This functions tries to find out what CPU it is running on, storing the value found in a variable _cpu. create_pagetables This function allocates and fills out a Page Table Directory at the top of the kernel memory area. The next steps are enabling VME, if the CPU supports it: testl $CPUID_VME, R(_cpu_feature) jz 1f movl %cr4, %eax orl $CR4_VME, %eax movl %eax, %cr4 Then, enabling paging: /* Now enable paging */ movl R(_IdlePTD), %eax movl %eax,%cr3 /* load ptd addr into mmu */ movl %cr0,%eax /* get control word */ orl $CR0_PE|CR0_PG,%eax /* enable paging */ movl %eax,%cr0 /* and let's page NOW! */ The next three lines of code are because the paging was set, so the jump is needed to continue the execution in virtualized address space: pushl $begin /* jump to high virtualized address */ ret /* now running relocated at KERNBASE where the system is linked to run */ begin: The function init386() is called with a pointer to the first free physical page, after that mi_startup(). init386 is an architecture dependent initialization function, and mi_startup() is an architecture independent one (the 'mi_' prefix stands for Machine Independent). The kernel never returns from mi_startup(), and by calling it, the kernel finishes booting: sys/i386/i386/locore.s: movl physfree, %esi pushl %esi /* value of first for init386(first) */ call _init386 /* wire 386 chip for unix operation */ call _mi_startup /* autoconfiguration, mountroot etc */ hlt /* never returns to here */ <function>init386()</function> init386() is defined in sys/i386/i386/machdep.c and performs low-level initialization specific to the i386 chip. The switch to protected mode was performed by the loader. The loader has created the very first task, in which the kernel continues to operate. Before looking at the code, consider the tasks the processor must complete to initialize protected mode execution: Initialize the kernel tunable parameters, passed from the bootstrapping program. Prepare the GDT. Prepare the IDT. Initialize the system console. Initialize the DDB, if it is compiled into kernel. Initialize the TSS. Prepare the LDT. Set up proc0's pcb. parameters init386() initializes the tunable parameters passed from bootstrap by setting the environment pointer (envp) and calling init_param1(). The envp pointer has been passed from loader in the bootinfo structure: sys/i386/i386/machdep.c: kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE; /* Init basic tunables, hz etc */ init_param1(); init_param1() is defined in sys/kern/subr_param.c. That file has a number of sysctls, and two functions, init_param1() and init_param2(), that are called from init386(): sys/kern/subr_param.c: hz = HZ; TUNABLE_INT_FETCH("kern.hz", &hz); TUNABLE_<typename>_FETCH is used to fetch the value from the environment: /usr/src/sys/sys/kernel.h: #define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var)) Sysctl kern.hz is the system clock tick. Additionally, these sysctls are set by init_param1(): kern.maxswzone, kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.maxdsiz, kern.dflssiz, kern.maxssiz, kern.sgrowsiz. Global Descriptors Table (GDT) Then init386() prepares the Global Descriptors Table (GDT). Every task on an x86 is running in its own virtual address space, and this space is addressed by a segment:offset pair. Say, for instance, the current instruction to be executed by the processor lies at CS:EIP, then the linear virtual address for that instruction would be the virtual address of code segment CS + EIP. For convenience, segments begin at virtual address 0 and end at a 4Gb boundary. Therefore, the instruction's linear virtual address for this example would just be the value of EIP. Segment registers such as CS, DS etc are the selectors, i.e., indexes, into GDT (to be more precise, an index is not a selector itself, but the INDEX field of a selector). FreeBSD's GDT holds descriptors for 15 selectors per CPU: sys/i386/i386/machdep.c: union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */ sys/i386/include/segments.h: /* * Entries in the Global Descriptor Table (GDT) */ #define GNULL_SEL 0 /* Null Descriptor */ #define GCODE_SEL 1 /* Kernel Code Descriptor */ #define GDATA_SEL 2 /* Kernel Data Descriptor */ #define GPRIV_SEL 3 /* SMP Per-Processor Private Data */ #define GPROC0_SEL 4 /* Task state process slot zero and up */ #define GLDT_SEL 5 /* LDT - eventually one per process */ #define GUSERLDT_SEL 6 /* User LDT */ #define GTGATE_SEL 7 /* Process task switch gate */ #define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */ #define GPANIC_SEL 9 /* Task state to consider panic from */ #define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */ #define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */ #define GBIOSDATA_SEL 12 /* BIOS interface (Data) */ #define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */ #define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */ Note that those #defines are not selectors themselves, but just a field INDEX of a selector, so they are exactly the indices of the GDT. for example, an actual selector for the kernel code (GCODE_SEL) has the value 0x08. Interrupt Descriptor Table (IDT) The next step is to initialize the Interrupt Descriptor Table (IDT). This table is referenced by the processor when a software or hardware interrupt occurs. For example, to make a system call, user application issues the INT 0x80 instruction. This is a software interrupt, so the processor's hardware looks up a record with index 0x80 in the IDT. This record points to the routine that handles this interrupt, in this particular case, this will be the kernel's syscall gate. The IDT may have a maximum of 256 (0x100) records. The kernel allocates NIDT records for the IDT, where NIDT is the maximum (256): sys/i386/i386/machdep.c: static struct gate_descriptor idt0[NIDT]; struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */ For each interrupt, an appropriate handler is set. The syscall gate for INT 0x80 is set as well: sys/i386/i386/machdep.c: setidt(0x80, &IDTVEC(int0x80_syscall), SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL)); So when a userland application issues the INT 0x80 instruction, control will transfer to the function _Xint0x80_syscall, which is in the kernel code segment and will be executed with supervisor privileges. Console and DDB are then initialized: DDB sys/i386/i386/machdep.c: cninit(); /* skipped */ #ifdef DDB kdb_init(); if (boothowto & RB_KDB) Debugger("Boot flags requested debugger"); #endif The Task State Segment is another x86 protected mode structure, the TSS is used by the hardware to store task information when a task switch occurs. The Local Descriptors Table is used to reference userland code and data. Several selectors are defined to point to the LDT, they are the system call gates and the user code and data selectors: /usr/include/machine/segments.h: #define LSYS5CALLS_SEL 0 /* forced by intel BCS */ #define LSYS5SIGR_SEL 1 #define L43BSDCALLS_SEL 2 /* notyet */ #define LUCODE_SEL 3 #define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */ #define LUDATA_SEL 5 /* separate stack, es,fs,gs sels ? */ /* #define LPOSIXCALLS_SEL 5*/ /* notyet */ #define LBSDICALLS_SEL 16 /* BSDI system call gate */ #define NLDT (LBSDICALLS_SEL + 1) Next, proc0's Process Control Block (struct pcb) structure is initialized. proc0 is a struct proc structure that describes a kernel process. It is always present while the kernel is running, therefore it is declared as global: sys/kern/kern_init.c: struct proc proc0; The structure struct pcb is a part of a proc structure. It is defined in /usr/include/machine/pcb.h and has a process's information specific to the i386 architecture, such as registers values. <function>mi_startup()</function> This function performs a bubble sort of all the system initialization objects and then calls the entry of each object one by one: sys/kern/init_main.c: for (sipp = sysinit; *sipp; sipp++) { /* ... skipped ... */ /* Call function */ (*((*sipp)->func))((*sipp)->udata); /* ... skipped ... */ } Although the sysinit framework is described in the Developers' Handbook, I will discuss the internals of it. sysinit objects Every system initialization object (sysinit object) is created by calling a SYSINIT() macro. Let us take as example an announce sysinit object. This object prints the copyright message: sys/kern/init_main.c: static void print_caddr_t(void *data __unused) { printf("%s", (char *)data); } SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright) The subsystem ID for this object is SI_SUB_COPYRIGHT (0x0800001), which comes right after the SI_SUB_CONSOLE (0x0800000). So, the copyright message will be printed out first, just after the console initialization. Let us take a look at what exactly the macro SYSINIT() does. It expands to a C_SYSINIT() macro. The C_SYSINIT() macro then expands to a static struct sysinit structure declaration with another DATA_SET macro call: /usr/include/sys/kernel.h: #define C_SYSINIT(uniquifier, subsystem, order, func, ident) \ static struct sysinit uniquifier ## _sys_init = { \ subsystem, \ order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ## _sys_init); #define SYSINIT(uniquifier, subsystem, order, func, ident) \ C_SYSINIT(uniquifier, subsystem, order, \ (sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident) The DATA_SET() macro expands to a MAKE_SET(), and that macro is the point where all the sysinit magic is hidden: /usr/include/linker_set.h: #define MAKE_SET(set, sym) \ static void const * const __set_##set##_sym_##sym = &sym; \ __asm(".section .set." #set ",\"aw\""); \ __asm(".long " #sym); \ __asm(".previous") #endif #define TEXT_SET(set, sym) MAKE_SET(set, sym) #define DATA_SET(set, sym) MAKE_SET(set, sym) In our case, the following declaration will occur: static struct sysinit announce_sys_init = { SI_SUB_COPYRIGHT, SI_ORDER_FIRST, (sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t, (void *) copyright }; static void const *const __set_sysinit_set_sym_announce_sys_init = &announce_sys_init; __asm(".section .set.sysinit_set" ",\"aw\""); __asm(".long " "announce_sys_init"); __asm(".previous"); The first __asm instruction will create an ELF section within the kernel's executable. This will happen at kernel link time. The section will have the name .set.sysinit_set. The content of this section is one 32-bit value, the address of announce_sys_init structure, and that is what the second __asm is. The third __asm instruction marks the end of a section. If a directive with the same section name occurred before, the content, i.e., the 32-bit value, will be appended to the existing section, so forming an array of 32-bit pointers. Running objdump on a kernel binary, you may notice the presence of such small sections: &prompt.user; objdump -h /kernel 7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2 CONTENTS, ALLOC, LOAD, DATA 8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2 CONTENTS, ALLOC, LOAD, DATA 9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2 CONTENTS, ALLOC, LOAD, DATA 10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2 CONTENTS, ALLOC, LOAD, DATA 11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2 CONTENTS, ALLOC, LOAD, DATA 12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2 CONTENTS, ALLOC, LOAD, DATA This screen dump shows that the size of .set.sysinit_set section is 0x664 bytes, so 0x664/sizeof(void *) sysinit objects are compiled into the kernel. The other sections such as .set.sysctl_set represent other linker sets. By defining a variable of type struct linker_set the content of .set.sysinit_set section will be collected into that variable: sys/kern/init_main.c: extern struct linker_set sysinit_set; /* XXX */ The struct linker_set is defined as follows: /usr/include/linker_set.h: struct linker_set { int ls_length; void *ls_items[1]; /* really ls_length of them, trailing NULL */ }; The first node will be equal to the number of a sysinit objects, and the second node will be a NULL-terminated array of pointers to them. Returning to the mi_startup() discussion, it is must be clear now, how the sysinit objects are being organized. The mi_startup() function sorts them and calls each. The very last object is the system scheduler: /usr/include/sys/kernel.h: enum sysinit_sub_id { SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/ SI_SUB_DONE = 0x0000001, /* processed*/ SI_SUB_CONSOLE = 0x0800000, /* console*/ SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/ ... SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/ }; The system scheduler sysinit object is defined in the file sys/vm/vm_glue.c, and the entry point for that object is scheduler(). That function is actually an infinite loop, and it represents a process with PID 0, the swapper process. The proc0 structure, mentioned before, is used to describe it. The first user process, called init, is created by the sysinit object init: sys/kern/init_main.c: static void create_init(const void *udata __unused) { int error; int s; s = splhigh(); error = fork1(&proc0, RFFDG | RFPROC, &initproc); if (error) panic("cannot fork init: %d\n", error); initproc->p_flag |= P_INMEM | P_SYSTEM; cpu_set_fork_handler(initproc, start_init, NULL); remrunqueue(initproc); splx(s); } SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL) The create_init() allocates a new process by calling fork1(), but does not mark it runnable. When this new process is scheduled for execution by the scheduler, the start_init() will be called. That function is defined in init_main.c. It tries to load and exec the init binary, probing /sbin/init first, then /sbin/oinit, /sbin/init.bak, and finally /stand/sysinstall: sys/kern/init_main.c: static char init_path[MAXPATHLEN] = #ifdef INIT_PATH __XSTRING(INIT_PATH); #else "/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall"; #endif
diff --git a/en_US.ISO8859-1/books/arch-handbook/driverbasics/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/driverbasics/chapter.xml index 6e5551873b..9826e3a1d9 100644 --- a/en_US.ISO8859-1/books/arch-handbook/driverbasics/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/driverbasics/chapter.xml @@ -1,423 +1,423 @@ Writing FreeBSD Device Drivers Murray Stokely Written by Jörg Wunsch Based on intro(4) manual page by Introduction device driver pseudo-device This chapter provides a brief introduction to writing device drivers for FreeBSD. A device in this context is a term used mostly for hardware-related stuff that belongs to the system, like disks, printers, or a graphics display with its keyboard. A device driver is the software component of the operating system that controls a specific device. There are also so-called pseudo-devices where a device driver emulates the behavior of a device in software without any particular underlying hardware. Device drivers can be compiled into the system statically or loaded on demand through the dynamic kernel linker facility `kld'. device nodes Most devices in a &unix;-like operating system are accessed through device-nodes, sometimes also called special files. These files are usually located under the directory /dev in the filesystem hierarchy. Device drivers can roughly be broken down into two categories; character and network device drivers. Dynamic Kernel Linker Facility - KLD kernel linking dynamic kernel loadable modules (KLD) The kld interface allows system administrators to dynamically add and remove functionality from a running system. This allows device driver writers to load their new changes into a running kernel without constantly rebooting to test changes. kernel modules loading kernel modules unloading kernel modules listing The kld interface is used through: kldload - loads a new kernel module kldunload - unloads a kernel module kldstat - lists loaded modules Skeleton Layout of a kernel module /* * KLD Skeleton * Inspired by Andrew Reiter's Daemonnews article */ #include <sys/types.h> #include <sys/module.h> #include <sys/systm.h> /* uprintf */ #include <sys/errno.h> #include <sys/param.h> /* defines used in kernel.h */ #include <sys/kernel.h> /* types used in module initialization */ /* * Load handler that deals with the loading and unloading of a KLD. */ static int skel_loader(struct module *m, int what, void *arg) { int err = 0; switch (what) { case MOD_LOAD: /* kldload */ uprintf("Skeleton KLD loaded.\n"); break; case MOD_UNLOAD: uprintf("Skeleton KLD unloaded.\n"); break; default: err = EOPNOTSUPP; break; } return(err); } /* Declare this module to the rest of the kernel */ static moduledata_t skel_mod = { "skel", skel_loader, NULL }; DECLARE_MODULE(skeleton, skel_mod, SI_SUB_KLD, SI_ORDER_ANY); Makefile &os; provides a system makefile to simplify compiling a kernel module. SRCS=skeleton.c KMOD=skeleton .include <bsd.kmod.mk> Running make with this makefile will create a file skeleton.ko that can be loaded into the kernel by typing: &prompt.root; kldload -v ./skeleton.ko Character Devices character devices A character device driver is one that transfers data directly to and from a user process. This is the most common type of device driver and there are plenty of simple examples in the source tree. This simple example pseudo-device remembers whatever values are written to it and can then echo them back when read. Example of a Sample Echo Pseudo-Device Driver for &os; 10.X - 12.X /* * Simple Echo pseudo-device KLD * * Murray Stokely * Søren (Xride) Straarup * Eitan Adler */ #include <sys/types.h> #include <sys/module.h> #include <sys/systm.h> /* uprintf */ #include <sys/param.h> /* defines used in kernel.h */ #include <sys/kernel.h> /* types used in module initialization */ #include <sys/conf.h> /* cdevsw struct */ #include <sys/uio.h> /* uio struct */ #include <sys/malloc.h> #define BUFFERSIZE 255 /* Function prototypes */ static d_open_t echo_open; static d_close_t echo_close; static d_read_t echo_read; static d_write_t echo_write; /* Character device entry points */ static struct cdevsw echo_cdevsw = { .d_version = D_VERSION, .d_open = echo_open, .d_close = echo_close, .d_read = echo_read, .d_write = echo_write, .d_name = "echo", }; struct s_echo { char msg[BUFFERSIZE + 1]; int len; }; /* vars */ static struct cdev *echo_dev; static struct s_echo *echomsg; MALLOC_DECLARE(M_ECHOBUF); MALLOC_DEFINE(M_ECHOBUF, "echobuffer", "buffer for echo module"); /* * This function is called by the kld[un]load(2) system calls to * determine what actions to take when a module is loaded or unloaded. */ static int echo_loader(struct module *m __unused, int what, void *arg __unused) { int error = 0; switch (what) { case MOD_LOAD: /* kldload */ error = make_dev_p(MAKEDEV_CHECKNAME | MAKEDEV_WAITOK, &echo_dev, &echo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0600, "echo"); if (error != 0) break; echomsg = malloc(sizeof(*echomsg), M_ECHOBUF, M_WAITOK | M_ZERO); printf("Echo device loaded.\n"); break; case MOD_UNLOAD: destroy_dev(echo_dev); free(echomsg, M_ECHOBUF); printf("Echo device unloaded.\n"); break; default: error = EOPNOTSUPP; break; } return (error); } static int echo_open(struct cdev *dev __unused, int oflags __unused, int devtype __unused, struct thread *td __unused) { int error = 0; uprintf("Opened device \"echo\" successfully.\n"); return (error); } static int echo_close(struct cdev *dev __unused, int fflag __unused, int devtype __unused, struct thread *td __unused) { uprintf("Closing device \"echo\".\n"); return (0); } /* * The read function just takes the buf that was saved via * echo_write() and returns it to userland for accessing. * uio(9) */ static int echo_read(struct cdev *dev __unused, struct uio *uio, int ioflag __unused) { size_t amt; int error; /* * How big is this read operation? Either as big as the user wants, * or as big as the remaining data. Note that the 'len' does not * include the trailing null character. */ amt = MIN(uio->uio_resid, uio->uio_offset >= echomsg->len + 1 ? 0 : echomsg->len + 1 - uio->uio_offset); if ((error = uiomove(echomsg->msg, amt, uio)) != 0) uprintf("uiomove failed!\n"); return (error); } /* * echo_write takes in a character string and saves it * to buf for later accessing. */ static int echo_write(struct cdev *dev __unused, struct uio *uio, int ioflag __unused) { size_t amt; int error; /* * We either write from the beginning or are appending -- do * not allow random access. */ if (uio->uio_offset != 0 && (uio->uio_offset != echomsg->len)) return (EINVAL); /* This is a new message, reset length */ if (uio->uio_offset == 0) echomsg->len = 0; /* Copy the string in from user memory to kernel memory */ amt = MIN(uio->uio_resid, (BUFFERSIZE - echomsg->len)); error = uiomove(echomsg->msg + uio->uio_offset, amt, uio); /* Now we need to null terminate and record the length */ echomsg->len = uio->uio_offset; echomsg->msg[echomsg->len] = 0; if (error != 0) uprintf("Write failed: bad address!\n"); return (error); } DEV_MODULE(echo, echo_loader, NULL); With this driver loaded try: &prompt.root; echo -n "Test Data" > /dev/echo &prompt.root; cat /dev/echo Opened device "echo" successfully. Test Data Closing device "echo". Real hardware devices are described in the next chapter. Block Devices (Are Gone) block devices Other &unix; systems may support a second type of disk device known as block devices. Block devices are disk devices for which the kernel provides caching. This caching makes block-devices almost unusable, or at least dangerously unreliable. The caching will reorder the sequence of write operations, depriving the application of the ability to know the exact disk contents at any one instant in time. This makes predictable and reliable crash recovery of on-disk data structures (filesystems, databases, etc.) impossible. Since writes may be delayed, there is no way the kernel can report to the application which particular write operation encountered a write error, this further compounds the consistency problem. For this reason, no serious applications rely on block devices, and in fact, almost all applications which access disks directly take great pains to specify that character - (or raw) devices should always be used. Because + (or raw) devices should always be used. As the implementation of the aliasing of each disk (partition) to two devices with different semantics significantly complicated - the relevant kernel code &os; dropped support for cached disk + the relevant kernel code, &os; dropped support for cached disk devices as part of the modernization of the disk I/O infrastructure. Network Drivers network devices Drivers for network devices do not use device nodes in order to be accessed. Their selection is based on other decisions made inside the kernel and instead of calling open(), use of a network device is generally introduced by using the system call socket(2). For more information see ifnet(9), the source of the loopback device, and Bill Paul's network drivers. diff --git a/en_US.ISO8859-1/books/arch-handbook/isa/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/isa/chapter.xml index 97bd2822c5..04de498a3f 100644 --- a/en_US.ISO8859-1/books/arch-handbook/isa/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/isa/chapter.xml @@ -1,2514 +1,2514 @@ ISA Device Drivers SergeyBabkinWritten by MurrayStokelyModifications for Handbook made by ValentinoVaschetto WylieStilwell Synopsis ISA device driverISA This chapter introduces the issues relevant to writing a driver for an ISA device. The pseudo-code presented here is rather detailed and reminiscent of the real code but is still only pseudo-code. It avoids the details irrelevant to the subject of the discussion. The real-life examples can be found in the source code of real drivers. In particular the drivers ep and aha are good sources of information. Basic Information A typical ISA driver would need the following include files: #include <sys/module.h> #include <sys/bus.h> #include <machine/bus.h> #include <machine/resource.h> #include <sys/rman.h> #include <isa/isavar.h> #include <isa/pnpvar.h> They describe the things specific to the ISA and generic bus subsystem. object-oriented The bus subsystem is implemented in an object-oriented fashion, its main structures are accessed by associated method functions. bus methods The list of bus methods implemented by an ISA driver is like one for any other bus. For a hypothetical driver named xxx they would be: static void xxx_isa_identify (driver_t *, device_t); Normally used for bus drivers, not device drivers. But for ISA devices this method may have special use: if the device provides some device-specific (non-PnP) way to auto-detect devices this routine may implement it. static int xxx_isa_probe (device_t dev); Probe for a device at a known (or PnP) location. This routine can also accommodate device-specific auto-detection of parameters for partially configured devices. static int xxx_isa_attach (device_t dev); Attach and initialize device. static int xxx_isa_detach (device_t dev); Detach device before unloading the driver module. static int xxx_isa_shutdown (device_t dev); Execute shutdown of the device before system shutdown. static int xxx_isa_suspend (device_t dev); Suspend the device before the system goes to the power-save state. May also abort transition to the power-save state. static int xxx_isa_resume (device_t dev); Resume the device activity after return from power-save state. xxx_isa_probe() and xxx_isa_attach() are mandatory, the rest of the routines are optional, depending on the device's needs. The driver is linked to the system with the following set of descriptions. /* table of supported bus methods */ static device_method_t xxx_isa_methods[] = { /* list all the bus method functions supported by the driver */ /* omit the unsupported methods */ DEVMETHOD(device_identify, xxx_isa_identify), DEVMETHOD(device_probe, xxx_isa_probe), DEVMETHOD(device_attach, xxx_isa_attach), DEVMETHOD(device_detach, xxx_isa_detach), DEVMETHOD(device_shutdown, xxx_isa_shutdown), DEVMETHOD(device_suspend, xxx_isa_suspend), DEVMETHOD(device_resume, xxx_isa_resume), DEVMETHOD_END }; static driver_t xxx_isa_driver = { "xxx", xxx_isa_methods, sizeof(struct xxx_softc), }; static devclass_t xxx_devclass; DRIVER_MODULE(xxx, isa, xxx_isa_driver, xxx_devclass, load_function, load_argument); softc Here struct xxx_softc is a device-specific structure that contains private driver data and descriptors for the driver's resources. The bus code automatically allocates one softc descriptor per device as needed. kernel module If the driver is implemented as a loadable module then load_function() is called to do driver-specific initialization or clean-up when the driver is loaded or unloaded and load_argument is passed as one of its arguments. If the driver does not support dynamic loading (in other words it must always be linked into the kernel) then these values should be set to 0 and the last definition would look like: DRIVER_MODULE(xxx, isa, xxx_isa_driver, xxx_devclass, 0, 0); PnP If the driver is for a device which supports PnP then a table of supported PnP IDs must be defined. The table consists of a list of PnP IDs supported by this driver and human-readable descriptions of the hardware types and models having these IDs. It looks like: static struct isa_pnp_id xxx_pnp_ids[] = { /* a line for each supported PnP ID */ { 0x12345678, "Our device model 1234A" }, { 0x12345679, "Our device model 1234B" }, { 0, NULL }, /* end of table */ }; If the driver does not support PnP devices it still needs an empty PnP ID table, like: static struct isa_pnp_id xxx_pnp_ids[] = { { 0, NULL }, /* end of table */ }; <varname remap="structname">device_t</varname> Pointer device_t is the pointer type for the device structure. Here we consider only the methods interesting from the device driver writer's standpoint. The methods to manipulate values in the device structure are: device_t device_get_parent(dev) Get the parent bus of a device. driver_t device_get_driver(dev) Get pointer to its driver structure. char *device_get_name(dev) Get the driver name, such as "xxx" for our example. int device_get_unit(dev) Get the unit number (units are numbered from 0 for the devices associated with each driver). char *device_get_nameunit(dev) Get the device name including the unit number, such as xxx0, xxx1 and so on. char *device_get_desc(dev) Get the device description. Normally it describes the exact model of device in human-readable form. device_set_desc(dev, desc) Set the description. This makes the device description point to the string desc which may not be deallocated or changed after that. device_set_desc_copy(dev, desc) Set the description. The description is copied into an internal dynamically allocated buffer, so the string desc may be changed afterwards without adverse effects. void *device_get_softc(dev) Get pointer to the device descriptor (struct xxx_softc) associated with this device. u_int32_t device_get_flags(dev) Get the flags specified for the device in the configuration file. A convenience function device_printf(dev, fmt, ...) may be used to print the messages from the device driver. It automatically prepends the unitname and colon to the message. The device_t methods are implemented in the file kern/bus_subr.c. Configuration File and the Order of Identifying and Probing During Auto-Configuration ISAprobing The ISA devices are described in the kernel configuration file like: device xxx0 at isa? port 0x300 irq 10 drq 5 iomem 0xd0000 flags 0x1 sensitive IRQ The values of port, IRQ and so on are converted to the resource values associated with the device. They are optional, depending on the device's needs and abilities for auto-configuration. For example, some devices do not need DRQ at all and some allow the driver to read the IRQ setting from the device configuration ports. If a machine has multiple ISA buses the exact bus may be specified in the configuration line, like isa0 or isa1, otherwise the device would be searched for on all the ISA buses. sensitive is a resource requesting that this device must be probed before all non-sensitive devices. It is supported but does not seem to be used in any current driver. For legacy ISA devices in many cases the drivers are still able to detect the configuration parameters. But each device to be configured in the system must have a config line. If two devices of some type are installed in the system but there is only one configuration line for the corresponding driver, ie: device xxx0 at isa? then only one device will be configured. But for the devices supporting automatic identification by the means of Plug-n-Play or some proprietary protocol one configuration line is enough to configure all the devices in the system, like the one above or just simply: device xxx at isa? If a driver supports both auto-identified and legacy devices and both kinds are installed at once in one machine then it is enough to describe in the config file the legacy devices only. The auto-identified devices will be added automatically. When an ISA bus is auto-configured the events happen as follows: All the drivers' identify routines (including the PnP identify routine which identifies all the PnP devices) are called in random order. As they identify the devices they add them to the list on the ISA bus. Normally the drivers' identify routines associate their drivers with the new devices. The PnP identify routine does not know about the other drivers yet so it does not associate any with the new devices it adds. The PnP devices are put to sleep using the PnP protocol to prevent them from being probed as legacy devices. The probe routines of non-PnP devices marked as sensitive are called. If probe for a device went successfully, the attach routine is called for it. The probe and attach routines of all non-PNP devices are called likewise. The PnP devices are brought back from the sleep state and assigned the resources they request: I/O and memory address ranges, IRQs and DRQs, all of them not conflicting with the attached legacy devices. Then for each PnP device the probe routines of all the present ISA drivers are called. The first one that claims the device gets attached. It is possible that multiple drivers would claim the device with different priority; in this case, the highest-priority driver wins. The probe routines must call ISA_PNP_PROBE() to compare the actual PnP ID with the list of the IDs supported by the driver and if the ID is not in the table return failure. That means that absolutely every driver, even the ones not supporting any PnP devices must call ISA_PNP_PROBE(), at least with an empty PnP ID table to return failure on unknown PnP devices. The probe routine returns a positive value (the error code) on error, zero or negative value on success. The negative return values are used when a PnP device supports multiple interfaces. For example, an older compatibility interface and a newer advanced interface which are supported by different drivers. Then both drivers would detect the device. The driver which returns a higher value in the probe routine takes precedence (in other words, the driver returning 0 has highest precedence, returning -1 is next, returning -2 is after it and so on). In result the devices which support only the old interface will be handled by the old driver (which should return -1 from the probe routine) while the devices supporting the new interface as well will be handled by the new driver (which should return 0 from the probe routine). If multiple drivers return the same value then the one called first wins. So if a driver returns value 0 it may be sure that it won the priority arbitration. The device-specific identify routines can also assign not a driver but a class of drivers to the device. Then all the drivers in the class are probed for this device, like the case with PnP. This feature is not implemented in any existing driver and is not considered further in this document. - Because the PnP devices are disabled when probing the + As the PnP devices are disabled when probing the legacy devices they will not be attached twice (once as legacy and once as PnP). But in case of device-dependent identify routines it is the responsibility of the driver to make sure that the same device will not be attached by the driver twice: once as legacy user-configured and once as auto-identified. Another practical consequence for the auto-identified devices (both PnP and device-specific) is that the flags can not be passed to them from the kernel configuration file. So they must either not use the flags at all or use the flags from the device unit 0 for all the auto-identified devices or use the sysctl interface instead of flags. Other unusual configurations may be accommodated by accessing the configuration resources directly with functions of families resource_query_*() and resource_*_value(). Their implementations are located in kern/subr_bus.c. The old IDE disk driver i386/isa/wd.c contains examples of such use. But the standard means of configuration must always be preferred. Leave parsing the configuration resources to the bus configuration code. Resources resources device driverresources The information that a user enters into the kernel configuration file is processed and passed to the kernel as configuration resources. This information is parsed by the bus configuration code and transformed into a value of structure device_t and the bus resources associated with it. The drivers may access the configuration resources directly using functions resource_* for more complex cases of configuration. However, generally this is neither needed nor recommended, so this issue is not discussed further here. The bus resources are associated with each device. They are identified by type and number within the type. For the ISA bus the following types are defined: DMA channel SYS_RES_IRQ - interrupt number SYS_RES_DRQ - ISA DMA channel number SYS_RES_MEMORY - range of device memory mapped into the system memory space SYS_RES_IOPORT - range of device I/O registers The enumeration within types starts from 0, so if a device has two memory regions it would have resources of type SYS_RES_MEMORY numbered 0 and 1. The resource type has nothing to do with the C language type, all the resource values have the C language type unsigned long and must be cast as necessary. The resource numbers do not have to be contiguous, although for ISA they normally would be. The permitted resource numbers for ISA devices are: IRQ: 0-1 DRQ: 0-1 MEMORY: 0-3 IOPORT: 0-7 All the resources are represented as ranges, with a start value and count. For IRQ and DRQ resources the count would normally be equal to 1. The values for memory refer to the physical addresses. Three types of activities can be performed on resources: set/get allocate/release activate/deactivate Setting sets the range used by the resource. Allocation reserves the requested range that no other driver would be able to reserve it (and checking that no other driver reserved this range already). Activation makes the resource accessible to the driver by doing whatever is necessary for that (for example, for memory it would be mapping into the kernel virtual address space). The functions to manipulate resources are: int bus_set_resource(device_t dev, int type, int rid, u_long start, u_long count) Set a range for a resource. Returns 0 if successful, error code otherwise. Normally, this function will return an error only if one of type, rid, start or count has a value that falls out of the permitted range. dev - driver's device type - type of resource, SYS_RES_* rid - resource number (ID) within type start, count - resource range int bus_get_resource(device_t dev, int type, int rid, u_long *startp, u_long *countp) Get the range of resource. Returns 0 if successful, error code if the resource is not defined yet. u_long bus_get_resource_start(device_t dev, int type, int rid) u_long bus_get_resource_count (device_t dev, int type, int rid) Convenience functions to get only the start or count. Return 0 in case of error, so if the resource start has 0 among the legitimate values it would be impossible to tell if the value is 0 or an error occurred. Luckily, no ISA resources for add-on drivers may have a start value equal to 0. void bus_delete_resource(device_t dev, int type, int rid) Delete a resource, make it undefined. struct resource * bus_alloc_resource(device_t dev, int type, int *rid, u_long start, u_long end, u_long count, u_int flags) Allocate a resource as a range of count values not allocated by anyone else, somewhere between start and end. Alas, alignment is not supported. If the resource was not set yet it is automatically created. The special values of start 0 and end ~0 (all ones) means that the fixed values previously set by bus_set_resource() must be used instead: start and count as themselves and end=(start+count), in this case if the resource was not defined before then an error is returned. Although rid is passed by reference it is not set anywhere by the resource allocation code of the ISA bus. (The other buses may use a different approach and modify it). Flags are a bitmap, the flags interesting for the caller are: RF_ACTIVE - causes the resource to be automatically activated after allocation. RF_SHAREABLE - resource may be shared at the same time by multiple drivers. RF_TIMESHARE - resource may be time-shared by multiple drivers, i.e., allocated at the same time by many but activated only by one at any given moment of time. Returns 0 on error. The allocated values may be obtained from the returned handle using methods rhand_*(). int bus_release_resource(device_t dev, int type, int rid, struct resource *r) Release the resource, r is the handle returned by bus_alloc_resource(). Returns 0 on success, error code otherwise. int bus_activate_resource(device_t dev, int type, int rid, struct resource *r) int bus_deactivate_resource(device_t dev, int type, int rid, struct resource *r) Activate or deactivate resource. Return 0 on success, error code otherwise. If the resource is time-shared and currently activated by another driver then EBUSY is returned. int bus_setup_intr(device_t dev, struct resource *r, int flags, driver_intr_t *handler, void *arg, void **cookiep) int bus_teardown_intr(device_t dev, struct resource *r, void *cookie) Associate or de-associate the interrupt handler with a device. Return 0 on success, error code otherwise. r - the activated resource handler describing the IRQ flags - the interrupt priority level, one of: INTR_TYPE_TTY - terminals and other likewise character-type devices. To mask them use spltty(). (INTR_TYPE_TTY | INTR_TYPE_FAST) - terminal type devices with small input buffer, critical to the data loss on input (such as the old-fashioned serial ports). To mask them use spltty(). INTR_TYPE_BIO - block-type devices, except those on the CAM controllers. To mask them use splbio(). INTR_TYPE_CAM - CAM (Common Access Method) bus controllers. To mask them use splcam(). INTR_TYPE_NET - network interface controllers. To mask them use splimp(). INTR_TYPE_MISC - miscellaneous devices. There is no other way to mask them than by splhigh() which masks all interrupts. When an interrupt handler executes all the other interrupts matching its priority level will be masked. The only exception is the MISC level for which no other interrupts are masked and which is not masked by any other interrupt. handler - pointer to the handler function, the type driver_intr_t is defined as void driver_intr_t(void *) arg - the argument passed to the handler to identify this particular device. It is cast from void* to any real type by the handler. The old convention for the ISA interrupt handlers was to use the unit number as argument, the new (recommended) convention is using a pointer to the device softc structure. cookie[p] - the value received from setup() is used to identify the handler when passed to teardown() A number of methods are defined to operate on the resource handlers (struct resource *). Those of interest to the device driver writers are: u_long rman_get_start(r) u_long rman_get_end(r) Get the start and end of allocated resource range. void *rman_get_virtual(r) Get the virtual address of activated memory resource. Bus Memory Mapping In many cases data is exchanged between the driver and the device through the memory. Two variants are possible: (a) memory is located on the device card (b) memory is the main memory of the computer In case (a) the driver always copies the data back and forth between the on-card memory and the main memory as necessary. To map the on-card memory into the kernel virtual address space the physical address and length of the on-card memory must be defined as a SYS_RES_MEMORY resource. That resource can then be allocated and activated, and its virtual address obtained using rman_get_virtual(). The older drivers used the function pmap_mapdev() for this purpose, which should not be used directly any more. Now it is one of the internal steps of resource activation. Most of the ISA cards will have their memory configured for physical location somewhere in range 640KB-1MB. Some of the ISA cards require larger memory ranges which should be placed somewhere under 16MB (because of the 24-bit address limitation on the ISA bus). In that case if the machine has more memory than the start address of the device memory (in other words, they overlap) a memory hole must be configured at the address range used by devices. Many BIOSes allow configuration of a memory hole of 1MB starting at 14MB or 15MB. FreeBSD can handle the memory holes properly if the BIOS reports them properly (this feature may be broken on old BIOSes). In case (b) just the address of the data is sent to the device, and the device uses DMA to actually access the data in the main memory. Two limitations are present: First, ISA cards can only access memory below 16MB. Second, the contiguous pages in virtual address space may not be contiguous in physical address space, so the device may have to do scatter/gather operations. The bus subsystem provides ready solutions for some of these problems, the rest has to be done by the drivers themselves. Two structures are used for DMA memory allocation, bus_dma_tag_t and bus_dmamap_t. Tag describes the properties required for the DMA memory. Map represents a memory block allocated according to these properties. Multiple maps may be associated with the same tag. Tags are organized into a tree-like hierarchy with inheritance of the properties. A child tag inherits all the requirements of its parent tag, and may make them more strict but never more loose. Normally one top-level tag (with no parent) is created for each device unit. If multiple memory areas with different requirements are needed for each device then a tag for each of them may be created as a child of the parent tag. The tags can be used to create a map in two ways. First, a chunk of contiguous memory conformant with the tag requirements may be allocated (and later may be freed). This is normally used to allocate relatively long-living areas of memory for communication with the device. Loading of such memory into a map is trivial: it is always considered as one chunk in the appropriate physical memory range. Second, an arbitrary area of virtual memory may be loaded into a map. Each page of this memory will be checked for conformance to the map requirement. If it conforms then it is left at its original location. If it is not then a fresh conformant bounce page is allocated and used as intermediate storage. When writing the data from the non-conformant original pages they will be copied to their bounce pages first and then transferred from the bounce pages to the device. When reading the data would go from the device to the bounce pages and then copied to their non-conformant original pages. The process of copying between the original and bounce pages is called synchronization. This is normally used on a per-transfer basis: buffer for each transfer would be loaded, transfer done and buffer unloaded. The functions working on the DMA memory are: int bus_dma_tag_create(bus_dma_tag_t parent, bus_size_t alignment, bus_size_t boundary, bus_addr_t lowaddr, bus_addr_t highaddr, bus_dma_filter_t *filter, void *filterarg, bus_size_t maxsize, int nsegments, bus_size_t maxsegsz, int flags, bus_dma_tag_t *dmat) Create a new tag. Returns 0 on success, the error code otherwise. parent - parent tag, or NULL to create a top-level tag. alignment - required physical alignment of the memory area to be allocated for this tag. Use value 1 for no specific alignment. Applies only to the future bus_dmamem_alloc() but not bus_dmamap_create() calls. boundary - physical address boundary that must not be crossed when allocating the memory. Use value 0 for no boundary. Applies only to the future bus_dmamem_alloc() but not bus_dmamap_create() calls. Must be power of 2. If the memory is planned to be used in non-cascaded DMA mode (i.e., the DMA addresses will be supplied not by the device itself but by the ISA DMA controller) then the boundary must be no larger than 64KB (64*1024) due to the limitations of the DMA hardware. lowaddr, highaddr - the names are slightly misleading; these values are used to limit the permitted range of physical addresses used to allocate the memory. The exact meaning varies depending on the planned future use: For bus_dmamem_alloc() all the addresses from 0 to lowaddr-1 are considered permitted, the higher ones are forbidden. For bus_dmamap_create() all the addresses outside the inclusive range [lowaddr; highaddr] are considered accessible. The addresses of pages inside the range are passed to the filter function which decides if they are accessible. If no filter function is supplied then all the range is considered unaccessible. For the ISA devices the normal values (with no filter function) are: lowaddr = BUS_SPACE_MAXADDR_24BIT highaddr = BUS_SPACE_MAXADDR filter, filterarg - the filter function and its argument. If NULL is passed for filter then the whole range [lowaddr, highaddr] is considered unaccessible when doing bus_dmamap_create(). Otherwise the physical address of each attempted page in range [lowaddr; highaddr] is passed to the filter function which decides if it is accessible. The prototype of the filter function is: int filterfunc(void *arg, bus_addr_t paddr). It must return 0 if the page is accessible, non-zero otherwise. maxsize - the maximal size of memory (in bytes) that may be allocated through this tag. In case it is difficult to estimate or could be arbitrarily big, the value for ISA devices would be BUS_SPACE_MAXSIZE_24BIT. nsegments - maximal number of scatter-gather segments supported by the device. If unrestricted then the value BUS_SPACE_UNRESTRICTED should be used. This value is recommended for the parent tags, the actual restrictions would then be specified for the descendant tags. Tags with nsegments equal to BUS_SPACE_UNRESTRICTED may not be used to actually load maps, they may be used only as parent tags. The practical limit for nsegments seems to be about 250-300, higher values will cause kernel stack overflow (the hardware can not normally support that many scatter-gather buffers anyway). maxsegsz - maximal size of a scatter-gather segment supported by the device. The maximal value for ISA device would be BUS_SPACE_MAXSIZE_24BIT. flags - a bitmap of flags. The only interesting flags are: BUS_DMA_ALLOCNOW - requests to allocate all the potentially needed bounce pages when creating the tag. BUS_DMA_ISA - mysterious flag used only on Alpha machines. It is not defined for the i386 machines. Probably it should be used by all the ISA drivers for Alpha machines but it looks like there are no such drivers yet. dmat - pointer to the storage for the new tag to be returned. int bus_dma_tag_destroy(bus_dma_tag_t dmat) Destroy a tag. Returns 0 on success, the error code otherwise. dmat - the tag to be destroyed. int bus_dmamem_alloc(bus_dma_tag_t dmat, void** vaddr, int flags, bus_dmamap_t *mapp) Allocate an area of contiguous memory described by the tag. The size of memory to be allocated is tag's maxsize. Returns 0 on success, the error code otherwise. The result still has to be loaded by bus_dmamap_load() before being used to get the physical address of the memory. dmat - the tag vaddr - pointer to the storage for the kernel virtual address of the allocated area to be returned. flags - a bitmap of flags. The only interesting flag is: BUS_DMA_NOWAIT - if the memory is not immediately available return the error. If this flag is not set then the routine is allowed to sleep until the memory becomes available. mapp - pointer to the storage for the new map to be returned. void bus_dmamem_free(bus_dma_tag_t dmat, void *vaddr, bus_dmamap_t map) Free the memory allocated by bus_dmamem_alloc(). At present, freeing of the memory allocated with ISA restrictions is - not implemented. Because of this the recommended model + not implemented. Due to this the recommended model of use is to keep and re-use the allocated areas for as long as possible. Do not lightly free some area and then shortly allocate it again. That does not mean that bus_dmamem_free() should not be used at all: hopefully it will be properly implemented soon. dmat - the tag vaddr - the kernel virtual address of the memory map - the map of the memory (as returned from bus_dmamem_alloc()) int bus_dmamap_create(bus_dma_tag_t dmat, int flags, bus_dmamap_t *mapp) Create a map for the tag, to be used in bus_dmamap_load() later. Returns 0 on success, the error code otherwise. dmat - the tag flags - theoretically, a bit map of flags. But no flags are defined yet, so at present it will be always 0. mapp - pointer to the storage for the new map to be returned int bus_dmamap_destroy(bus_dma_tag_t dmat, bus_dmamap_t map) Destroy a map. Returns 0 on success, the error code otherwise. dmat - the tag to which the map is associated map - the map to be destroyed int bus_dmamap_load(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, bus_size_t buflen, bus_dmamap_callback_t *callback, void *callback_arg, int flags) Load a buffer into the map (the map must be previously created by bus_dmamap_create() or bus_dmamem_alloc()). All the pages of the buffer are checked for conformance to the tag requirements and for those not conformant the bounce pages are allocated. An array of physical segment descriptors is built and passed to the callback routine. This callback routine is then expected to handle it in some way. The number of bounce buffers in the system is limited, so if the bounce buffers are needed but not immediately available the request will be queued and the callback will be called when the bounce buffers will become available. Returns 0 if the callback was executed immediately or EINPROGRESS if the request was queued for future execution. In the latter case the synchronization with queued callback routine is the responsibility of the driver. dmat - the tag map - the map buf - kernel virtual address of the buffer buflen - length of the buffer callback, callback_arg - the callback function and its argument The prototype of callback function is: void callback(void *arg, bus_dma_segment_t *seg, int nseg, int error) arg - the same as callback_arg passed to bus_dmamap_load() seg - array of the segment descriptors nseg - number of descriptors in array error - indication of the segment number overflow: if it is set to EFBIG then the buffer did not fit into the maximal number of segments permitted by the tag. In this case only the permitted number of descriptors will be in the array. Handling of this situation is up to the driver: depending on the desired semantics it can either consider this an error or split the buffer in two and handle the second part separately Each entry in the segments array contains the fields: ds_addr - physical bus address of the segment ds_len - length of the segment void bus_dmamap_unload(bus_dma_tag_t dmat, bus_dmamap_t map) unload the map. dmat - tag map - loaded map void bus_dmamap_sync (bus_dma_tag_t dmat, bus_dmamap_t map, bus_dmasync_op_t op) Synchronise a loaded buffer with its bounce pages before and after physical transfer to or from device. This is the function that does all the necessary copying of data between the original buffer and its mapped version. The buffers must be synchronized both before and after doing the transfer. dmat - tag map - loaded map op - type of synchronization operation to perform: BUS_DMASYNC_PREREAD - before reading from device into buffer BUS_DMASYNC_POSTREAD - after reading from device into buffer BUS_DMASYNC_PREWRITE - before writing the buffer to device BUS_DMASYNC_POSTWRITE - after writing the buffer to device As of now PREREAD and POSTWRITE are null operations but that may change in the future, so they must not be ignored in the driver. Synchronization is not needed for the memory obtained from bus_dmamem_alloc(). Before calling the callback function from bus_dmamap_load() the segment array is stored in the stack. And it gets pre-allocated for the - maximal number of segments allowed by the tag. Because of + maximal number of segments allowed by the tag. As a result of this the practical limit for the number of segments on i386 architecture is about 250-300 (the kernel stack is 4KB minus the size of the user structure, size of a segment array - entry is 8 bytes, and some space must be left). Because the + entry is 8 bytes, and some space must be left). Since the array is allocated based on the maximal number this value must not be set higher than really needed. Fortunately, for most of hardware the maximal supported number of segments is much lower. But if the driver wants to handle buffers with a very large number of scatter-gather segments it should do that in portions: load part of the buffer, transfer it to the device, load next part of the buffer, and so on. Another practical consequence is that the number of segments may limit the size of the buffer. If all the pages in the buffer happen to be physically non-contiguous then the maximal supported buffer size for that fragmented case would be (nsegments * page_size). For example, if a maximal number of 10 segments is supported then on i386 maximal guaranteed supported buffer size would be 40K. If a higher size is desired then special tricks should be used in the driver. If the hardware does not support scatter-gather at all or the driver wants to support some buffer size even if it is heavily fragmented then the solution is to allocate a contiguous buffer in the driver and use it as intermediate storage if the original buffer does not fit. Below are the typical call sequences when using a map depend on the use of the map. The characters -> are used to show the flow of time. For a buffer which stays practically fixed during all the time between attachment and detachment of a device: bus_dmamem_alloc -> bus_dmamap_load -> ...use buffer... -> -> bus_dmamap_unload -> bus_dmamem_free For a buffer that changes frequently and is passed from outside the driver: bus_dmamap_create -> -> bus_dmamap_load -> bus_dmamap_sync(PRE...) -> do transfer -> -> bus_dmamap_sync(POST...) -> bus_dmamap_unload -> ... -> bus_dmamap_load -> bus_dmamap_sync(PRE...) -> do transfer -> -> bus_dmamap_sync(POST...) -> bus_dmamap_unload -> -> bus_dmamap_destroy When loading a map created by bus_dmamem_alloc() the passed address and size of the buffer must be the same as used in bus_dmamem_alloc(). In this case it is guaranteed that the whole buffer will be mapped as one segment (so the callback may be based on this assumption) and the request will be executed immediately (EINPROGRESS will never be returned). All the callback needs to do in this case is to save the physical address. A typical example would be: static void alloc_callback(void *arg, bus_dma_segment_t *seg, int nseg, int error) { *(bus_addr_t *)arg = seg[0].ds_addr; } ... int error; struct somedata { .... }; struct somedata *vsomedata; /* virtual address */ bus_addr_t psomedata; /* physical bus-relative address */ bus_dma_tag_t tag_somedata; bus_dmamap_t map_somedata; ... error=bus_dma_tag_create(parent_tag, alignment, boundary, lowaddr, highaddr, /*filter*/ NULL, /*filterarg*/ NULL, /*maxsize*/ sizeof(struct somedata), /*nsegments*/ 1, /*maxsegsz*/ sizeof(struct somedata), /*flags*/ 0, &tag_somedata); if(error) return error; error = bus_dmamem_alloc(tag_somedata, &vsomedata, /* flags*/ 0, &map_somedata); if(error) return error; bus_dmamap_load(tag_somedata, map_somedata, (void *)vsomedata, sizeof (struct somedata), alloc_callback, (void *) &psomedata, /*flags*/0); Looks a bit long and complicated but that is the way to do it. The practical consequence is: if multiple memory areas are allocated always together it would be a really good idea to combine them all into one structure and allocate as one (if the alignment and boundary limitations permit). When loading an arbitrary buffer into the map created by bus_dmamap_create() special measures must be taken to synchronize with the callback in case it would be delayed. The code would look like: { int s; int error; s = splsoftvm(); error = bus_dmamap_load( dmat, dmamap, buffer_ptr, buffer_len, callback, /*callback_arg*/ buffer_descriptor, /*flags*/0); if (error == EINPROGRESS) { /* * Do whatever is needed to ensure synchronization * with callback. Callback is guaranteed not to be started * until we do splx() or tsleep(). */ } splx(s); } Two possible approaches for the processing of requests are: 1. If requests are completed by marking them explicitly as done (such as the CAM requests) then it would be simpler to put all the further processing into the callback driver which would mark the request when it is done. Then not much extra synchronization is needed. For the flow control reasons it may be a good idea to freeze the request queue until this request gets completed. 2. If requests are completed when the function returns (such as classic read or write requests on character devices) then a synchronization flag should be set in the buffer descriptor and tsleep() called. Later when the callback gets called it will do its processing and check this synchronization flag. If it is set then the callback should issue a wakeup. In this approach the callback function could either do all the needed processing (just like the previous case) or simply save the segments array in the buffer descriptor. Then after callback completes the calling function could use this saved segments array and do all the processing. DMA Direct Memory Access (DMA) The Direct Memory Access (DMA) is implemented in the ISA bus through the DMA controller (actually, two of them but that is an irrelevant detail). To make the early ISA devices simple and cheap the logic of the bus control and address generation was concentrated in the DMA controller. Fortunately, FreeBSD provides a set of functions that mostly hide the annoying details of the DMA controller from the device drivers. The simplest case is for the fairly intelligent devices. Like the bus master devices on PCI they can generate the bus cycles and memory addresses all by themselves. The only thing they really need from the DMA controller is bus arbitration. So for this purpose they pretend to be cascaded slave DMA controllers. And the only thing needed from the system DMA controller is to enable the cascaded mode on a DMA channel by calling the following function when attaching the driver: void isa_dmacascade(int channel_number) All the further activity is done by programming the device. When detaching the driver no DMA-related functions need to be called. For the simpler devices things get more complicated. The functions used are: int isa_dma_acquire(int chanel_number) Reserve a DMA channel. Returns 0 on success or EBUSY if the channel was already reserved by this or a different driver. Most of the ISA devices are not able to share DMA channels anyway, so normally this function is called when attaching a device. This reservation was made redundant by the modern interface of bus resources but still must be used in addition to the latter. If not used then later, other DMA routines will panic. int isa_dma_release(int chanel_number) Release a previously reserved DMA channel. No transfers must be in progress when the channel is released (in addition the device must not try to initiate transfer after the channel is released). void isa_dmainit(int chan, u_int bouncebufsize) Allocate a bounce buffer for use with the specified channel. The requested size of the buffer can not exceed 64KB. This bounce buffer will be automatically used later if a transfer buffer happens to be not physically contiguous or outside of the memory accessible by the ISA bus or crossing the 64KB boundary. If the transfers will be always done from buffers which conform to these conditions (such as those allocated by bus_dmamem_alloc() with proper limitations) then isa_dmainit() does not have to be called. But it is quite convenient to transfer arbitrary data using the DMA controller. The bounce buffer will automatically care of the scatter-gather issues. chan - channel number bouncebufsize - size of the bounce buffer in bytes void isa_dmastart(int flags, caddr_t addr, u_int nbytes, int chan) Prepare to start a DMA transfer. This function must be called to set up the DMA controller before actually starting transfer on the device. It checks that the buffer is contiguous and falls into the ISA memory range, if not then the bounce buffer is automatically used. If bounce buffer is required but not set up by isa_dmainit() or too small for the requested transfer size then the system will panic. In case of a write request with bounce buffer the data will be automatically copied to the bounce buffer. flags - a bitmask determining the type of operation to be done. The direction bits B_READ and B_WRITE are mutually exclusive. B_READ - read from the ISA bus into memory B_WRITE - write from the memory to the ISA bus B_RAW - if set then the DMA controller will remember the buffer and after the end of transfer will automatically re-initialize itself to repeat transfer of the same buffer again (of course, the driver may change the data in the buffer before initiating another transfer in the device). If not set then the parameters will work only for one transfer, and isa_dmastart() will have to be called again before initiating the next transfer. Using B_RAW makes sense only if the bounce buffer is not used. addr - virtual address of the buffer nbytes - length of the buffer. Must be less or equal to 64KB. Length of 0 is not allowed: the DMA controller will understand it as 64KB while the kernel code will understand it as 0 and that would cause unpredictable effects. For channels number 4 and higher the length must be even because these channels transfer 2 bytes at a time. In case of an odd length the last byte will not be transferred. chan - channel number void isa_dmadone(int flags, caddr_t addr, int nbytes, int chan) Synchronize the memory after device reports that transfer is done. If that was a read operation with a bounce buffer then the data will be copied from the bounce buffer to the original buffer. Arguments are the same as for isa_dmastart(). Flag B_RAW is permitted but it does not affect isa_dmadone() in any way. int isa_dmastatus(int channel_number) Returns the number of bytes left in the current transfer to be transferred. In case the flag B_READ was set in isa_dmastart() the number returned will never be equal to zero. At the end of transfer it will be automatically reset back to the length of buffer. The normal use is to check the number of bytes left after the device signals that the transfer is completed. If the number of bytes is not 0 then something probably went wrong with that transfer. int isa_dmastop(int channel_number) Aborts the current transfer and returns the number of bytes left untransferred. xxx_isa_probe This function probes if a device is present. If the driver supports auto-detection of some part of device configuration (such as interrupt vector or memory address) this auto-detection must be done in this routine. As for any other bus, if the device cannot be detected or is detected but failed the self-test or some other problem happened then it returns a positive value of error. The value ENXIO must be returned if the device is not present. Other error values may mean other conditions. Zero or negative values mean success. Most of the drivers return zero as success. The negative return values are used when a PnP device supports multiple interfaces. For example, an older compatibility interface and a newer advanced interface which are supported by different drivers. Then both drivers would detect the device. The driver which returns a higher value in the probe routine takes precedence (in other words, the driver returning 0 has highest precedence, one returning -1 is next, one returning -2 is after it and so on). In result the devices which support only the old interface will be handled by the old driver (which should return -1 from the probe routine) while the devices supporting the new interface as well will be handled by the new driver (which should return 0 from the probe routine). The device descriptor struct xxx_softc is allocated by the system before calling the probe routine. If the probe routine returns an error the descriptor will be automatically deallocated by the system. So if a probing error occurs the driver must make sure that all the resources it used during probe are deallocated and that nothing keeps the descriptor from being safely deallocated. If the probe completes successfully the descriptor will be preserved by the system and later passed to the routine xxx_isa_attach(). If a driver returns a negative value it can not be sure that it will have the highest priority and its attach routine will be called. So in this case it also must release all the resources before returning and if necessary allocate them again in the attach routine. When xxx_isa_probe() returns 0 releasing the resources before returning is also a good idea and a well-behaved driver should do so. But in cases where there is some problem with releasing the resources the driver is allowed to keep resources between returning 0 from the probe routine and execution of the attach routine. A typical probe routine starts with getting the device descriptor and unit: struct xxx_softc *sc = device_get_softc(dev); int unit = device_get_unit(dev); int pnperror; int error = 0; sc->dev = dev; /* link it back */ sc->unit = unit; Then check for the PnP devices. The check is carried out by a table containing the list of PnP IDs supported by this driver and human-readable descriptions of the device models corresponding to these IDs. pnperror=ISA_PNP_PROBE(device_get_parent(dev), dev, xxx_pnp_ids); if(pnperror == ENXIO) return ENXIO; The logic of ISA_PNP_PROBE is the following: If this card (device unit) was not detected as PnP then ENOENT will be returned. If it was detected as PnP but its detected ID does not match any of the IDs in the table then ENXIO is returned. Finally, if it has PnP support and it matches on of the IDs in the table, 0 is returned and the appropriate description from the table is set by device_set_desc(). If a driver supports only PnP devices then the condition would look like: if(pnperror != 0) return pnperror; No special treatment is required for the drivers which do not support PnP because they pass an empty PnP ID table and will always get ENXIO if called on a PnP card. The probe routine normally needs at least some minimal set of resources, such as I/O port number to find the card and probe it. Depending on the hardware the driver may be able to discover the other necessary resources automatically. The PnP devices have all the resources pre-set by the PnP subsystem, so the driver does not need to discover them by itself. Typically the minimal information required to get access to the device is the I/O port number. Then some devices allow to get the rest of information from the device configuration registers (though not all devices do that). So first we try to get the port start value: sc->port0 = bus_get_resource_start(dev, SYS_RES_IOPORT, 0 /*rid*/); if(sc->port0 == 0) return ENXIO; The base port address is saved in the structure softc for future use. If it will be used very often then calling the resource function each time would be prohibitively slow. If we do not get a port we just return an error. Some device drivers can instead be clever and try to probe all the possible ports, like this: /* table of all possible base I/O port addresses for this device */ static struct xxx_allports { u_short port; /* port address */ short used; /* flag: if this port is already used by some unit */ } xxx_allports = { { 0x300, 0 }, { 0x320, 0 }, { 0x340, 0 }, { 0, 0 } /* end of table */ }; ... int port, i; ... port = bus_get_resource_start(dev, SYS_RES_IOPORT, 0 /*rid*/); if(port !=0 ) { for(i=0; xxx_allports[i].port!=0; i++) { if(xxx_allports[i].used || xxx_allports[i].port != port) continue; /* found it */ xxx_allports[i].used = 1; /* do probe on a known port */ return xxx_really_probe(dev, port); } return ENXIO; /* port is unknown or already used */ } /* we get here only if we need to guess the port */ for(i=0; xxx_allports[i].port!=0; i++) { if(xxx_allports[i].used) continue; /* mark as used - even if we find nothing at this port * at least we won't probe it in future */ xxx_allports[i].used = 1; error = xxx_really_probe(dev, xxx_allports[i].port); if(error == 0) /* found a device at that port */ return 0; } /* probed all possible addresses, none worked */ return ENXIO; Of course, normally the driver's identify() routine should be used for such things. But there may be one valid reason why it may be better to be done in probe(): if this probe would drive some other sensitive device crazy. The probe routines are ordered with consideration of the sensitive flag: the sensitive devices get probed first and the rest of the devices later. But the identify() routines are called before any probes, so they show no respect to the sensitive devices and may upset them. Now, after we got the starting port we need to set the port count (except for PnP devices) because the kernel does not have this information in the configuration file. if(pnperror /* only for non-PnP devices */ && bus_set_resource(dev, SYS_RES_IOPORT, 0, sc->port0, XXX_PORT_COUNT)<0) return ENXIO; Finally allocate and activate a piece of port address space (special values of start and end mean use those we set by bus_set_resource()): sc->port0_rid = 0; sc->port0_r = bus_alloc_resource(dev, SYS_RES_IOPORT, &sc->port0_rid, /*start*/ 0, /*end*/ ~0, /*count*/ 0, RF_ACTIVE); if(sc->port0_r == NULL) return ENXIO; Now having access to the port-mapped registers we can poke the device in some way and check if it reacts like it is expected to. If it does not then there is probably some other device or no device at all at this address. Normally drivers do not set up the interrupt handlers until the attach routine. Instead they do probes in the polling mode using the DELAY() function for timeout. The probe routine must never hang forever, all the waits for the device must be done with timeouts. If the device does not respond within the time it is probably broken or misconfigured and the driver must return error. When determining the timeout interval give the device some extra time to be on the safe side: although DELAY() is supposed to delay for the same amount of time on any machine it has some margin of error, depending on the exact CPU. If the probe routine really wants to check that the interrupts really work it may configure and probe the interrupts too. But that is not recommended. /* implemented in some very device-specific way */ if(error = xxx_probe_ports(sc)) goto bad; /* will deallocate the resources before returning */ The function xxx_probe_ports() may also set the device description depending on the exact model of device it discovers. But if there is only one supported device model this can be as well done in a hardcoded way. Of course, for the PnP devices the PnP support sets the description from the table automatically. if(pnperror) device_set_desc(dev, "Our device model 1234"); Then the probe routine should either discover the ranges of all the resources by reading the device configuration registers or make sure that they were set explicitly by the user. We will consider it with an example of on-board memory. The probe routine should be as non-intrusive as possible, so allocation and check of functionality of the rest of resources (besides the ports) would be better left to the attach routine. The memory address may be specified in the kernel configuration file or on some devices it may be pre-configured in non-volatile configuration registers. If both sources are available and different, which one should be used? Probably if the user bothered to set the address explicitly in the kernel configuration file they know what they are doing and this one should take precedence. An example of implementation could be: /* try to find out the config address first */ sc->mem0_p = bus_get_resource_start(dev, SYS_RES_MEMORY, 0 /*rid*/); if(sc->mem0_p == 0) { /* nope, not specified by user */ sc->mem0_p = xxx_read_mem0_from_device_config(sc); if(sc->mem0_p == 0) /* can't get it from device config registers either */ goto bad; } else { if(xxx_set_mem0_address_on_device(sc) < 0) goto bad; /* device does not support that address */ } /* just like the port, set the memory size, * for some devices the memory size would not be constant * but should be read from the device configuration registers instead * to accommodate different models of devices. Another option would * be to let the user set the memory size as "msize" configuration * resource which will be automatically handled by the ISA bus. */ if(pnperror) { /* only for non-PnP devices */ sc->mem0_size = bus_get_resource_count(dev, SYS_RES_MEMORY, 0 /*rid*/); if(sc->mem0_size == 0) /* not specified by user */ sc->mem0_size = xxx_read_mem0_size_from_device_config(sc); if(sc->mem0_size == 0) { /* suppose this is a very old model of device without * auto-configuration features and the user gave no preference, * so assume the minimalistic case * (of course, the real value will vary with the driver) */ sc->mem0_size = 8*1024; } if(xxx_set_mem0_size_on_device(sc) < 0) goto bad; /* device does not support that size */ if(bus_set_resource(dev, SYS_RES_MEMORY, /*rid*/0, sc->mem0_p, sc->mem0_size)<0) goto bad; } else { sc->mem0_size = bus_get_resource_count(dev, SYS_RES_MEMORY, 0 /*rid*/); } Resources for IRQ and DRQ are easy to check by analogy. If all went well then release all the resources and return success. xxx_free_resources(sc); return 0; Finally, handle the troublesome situations. All the resources should be deallocated before returning. We make use of the fact that before the structure softc is passed to us it gets zeroed out, so we can find out if some resource was allocated: then its descriptor is non-zero. bad: xxx_free_resources(sc); if(error) return error; else /* exact error is unknown */ return ENXIO; That would be all for the probe routine. Freeing of resources is done from multiple places, so it is moved to a function which may look like: static void xxx_free_resources(sc) struct xxx_softc *sc; { /* check every resource and free if not zero */ /* interrupt handler */ if(sc->intr_r) { bus_teardown_intr(sc->dev, sc->intr_r, sc->intr_cookie); bus_release_resource(sc->dev, SYS_RES_IRQ, sc->intr_rid, sc->intr_r); sc->intr_r = 0; } /* all kinds of memory maps we could have allocated */ if(sc->data_p) { bus_dmamap_unload(sc->data_tag, sc->data_map); sc->data_p = 0; } if(sc->data) { /* sc->data_map may be legitimately equal to 0 */ /* the map will also be freed */ bus_dmamem_free(sc->data_tag, sc->data, sc->data_map); sc->data = 0; } if(sc->data_tag) { bus_dma_tag_destroy(sc->data_tag); sc->data_tag = 0; } ... free other maps and tags if we have them ... if(sc->parent_tag) { bus_dma_tag_destroy(sc->parent_tag); sc->parent_tag = 0; } /* release all the bus resources */ if(sc->mem0_r) { bus_release_resource(sc->dev, SYS_RES_MEMORY, sc->mem0_rid, sc->mem0_r); sc->mem0_r = 0; } ... if(sc->port0_r) { bus_release_resource(sc->dev, SYS_RES_IOPORT, sc->port0_rid, sc->port0_r); sc->port0_r = 0; } } xxx_isa_attach The attach routine actually connects the driver to the system if the probe routine returned success and the system had chosen to attach that driver. If the probe routine returned 0 then the attach routine may expect to receive the device structure softc intact, as it was set by the probe routine. Also if the probe routine returns 0 it may expect that the attach routine for this device shall be called at some point in the future. If the probe routine returns a negative value then the driver may make none of these assumptions. The attach routine returns 0 if it completed successfully or error code otherwise. The attach routine starts just like the probe routine, with getting some frequently used data into more accessible variables. struct xxx_softc *sc = device_get_softc(dev); int unit = device_get_unit(dev); int error = 0; Then allocate and activate all the necessary - resources. Because normally the port range will be released + resources. As normally the port range will be released before returning from probe, it has to be allocated again. We expect that the probe routine had properly set all the resource ranges, as well as saved them in the structure softc. If the probe routine had left some resource allocated then it does not need to be allocated again (which would be considered an error). sc->port0_rid = 0; sc->port0_r = bus_alloc_resource(dev, SYS_RES_IOPORT, &sc->port0_rid, /*start*/ 0, /*end*/ ~0, /*count*/ 0, RF_ACTIVE); if(sc->port0_r == NULL) return ENXIO; /* on-board memory */ sc->mem0_rid = 0; sc->mem0_r = bus_alloc_resource(dev, SYS_RES_MEMORY, &sc->mem0_rid, /*start*/ 0, /*end*/ ~0, /*count*/ 0, RF_ACTIVE); if(sc->mem0_r == NULL) goto bad; /* get its virtual address */ sc->mem0_v = rman_get_virtual(sc->mem0_r); The DMA request channel (DRQ) is allocated likewise. To initialize it use functions of the isa_dma*() family. For example: isa_dmacascade(sc->drq0); The interrupt request line (IRQ) is a bit special. Besides allocation the driver's interrupt handler should be associated with it. Historically in the old ISA drivers the argument passed by the system to the interrupt handler was the device unit number. But in modern drivers the convention suggests passing the pointer to structure softc. The important reason is that when the structures softc are allocated dynamically then getting the unit number from softc is easy while getting softc from the unit number is difficult. Also this convention makes the drivers for different buses look more uniform and allows them to share the code: each bus gets its own probe, attach, detach and other bus-specific routines while the bulk of the driver code may be shared among them. sc->intr_rid = 0; sc->intr_r = bus_alloc_resource(dev, SYS_RES_MEMORY, &sc->intr_rid, /*start*/ 0, /*end*/ ~0, /*count*/ 0, RF_ACTIVE); if(sc->intr_r == NULL) goto bad; /* * XXX_INTR_TYPE is supposed to be defined depending on the type of * the driver, for example as INTR_TYPE_CAM for a CAM driver */ error = bus_setup_intr(dev, sc->intr_r, XXX_INTR_TYPE, (driver_intr_t *) xxx_intr, (void *) sc, &sc->intr_cookie); if(error) goto bad; If the device needs to make DMA to the main memory then this memory should be allocated like described before: error=bus_dma_tag_create(NULL, /*alignment*/ 4, /*boundary*/ 0, /*lowaddr*/ BUS_SPACE_MAXADDR_24BIT, /*highaddr*/ BUS_SPACE_MAXADDR, /*filter*/ NULL, /*filterarg*/ NULL, /*maxsize*/ BUS_SPACE_MAXSIZE_24BIT, /*nsegments*/ BUS_SPACE_UNRESTRICTED, /*maxsegsz*/ BUS_SPACE_MAXSIZE_24BIT, /*flags*/ 0, &sc->parent_tag); if(error) goto bad; /* many things get inherited from the parent tag * sc->data is supposed to point to the structure with the shared data, * for example for a ring buffer it could be: * struct { * u_short rd_pos; * u_short wr_pos; * char bf[XXX_RING_BUFFER_SIZE] * } *data; */ error=bus_dma_tag_create(sc->parent_tag, 1, 0, BUS_SPACE_MAXADDR, 0, /*filter*/ NULL, /*filterarg*/ NULL, /*maxsize*/ sizeof(* sc->data), /*nsegments*/ 1, /*maxsegsz*/ sizeof(* sc->data), /*flags*/ 0, &sc->data_tag); if(error) goto bad; error = bus_dmamem_alloc(sc->data_tag, &sc->data, /* flags*/ 0, &sc->data_map); if(error) goto bad; /* xxx_alloc_callback() just saves the physical address at * the pointer passed as its argument, in this case &sc->data_p. * See details in the section on bus memory mapping. * It can be implemented like: * * static void * xxx_alloc_callback(void *arg, bus_dma_segment_t *seg, * int nseg, int error) * { * *(bus_addr_t *)arg = seg[0].ds_addr; * } */ bus_dmamap_load(sc->data_tag, sc->data_map, (void *)sc->data, sizeof (* sc->data), xxx_alloc_callback, (void *) &sc->data_p, /*flags*/0); After all the necessary resources are allocated the device should be initialized. The initialization may include testing that all the expected features are functional. if(xxx_initialize(sc) < 0) goto bad; The bus subsystem will automatically print on the console the device description set by probe. But if the driver wants to print some extra information about the device it may do so, for example: device_printf(dev, "has on-card FIFO buffer of %d bytes\n", sc->fifosize); If the initialization routine experiences any problems then printing messages about them before returning error is also recommended. The final step of the attach routine is attaching the device to its functional subsystem in the kernel. The exact way to do it depends on the type of the driver: a character device, a block device, a network device, a CAM SCSI bus device and so on. If all went well then return success. error = xxx_attach_subsystem(sc); if(error) goto bad; return 0; Finally, handle the troublesome situations. All the resources should be deallocated before returning an error. We make use of the fact that before the structure softc is passed to us it gets zeroed out, so we can find out if some resource was allocated: then its descriptor is non-zero. bad: xxx_free_resources(sc); if(error) return error; else /* exact error is unknown */ return ENXIO; That would be all for the attach routine. xxx_isa_detach If this function is present in the driver and the driver is compiled as a loadable module then the driver gets the ability to be unloaded. This is an important feature if the hardware supports hot plug. But the ISA bus does not support hot plug, so this feature is not particularly important for the ISA devices. The ability to unload a driver may be useful when debugging it, but in many cases installation of the new version of the driver would be required only after the old version somehow wedges the system and a reboot will be needed anyway, so the efforts spent on writing the detach routine may not be worth it. Another argument that unloading would allow upgrading the drivers on a production machine seems to be mostly theoretical. Installing a new version of a driver is a dangerous operation which should never be performed on a production machine (and which is not permitted when the system is running in secure mode). Still, the detach routine may be provided for the sake of completeness. The detach routine returns 0 if the driver was successfully detached or the error code otherwise. The logic of detach is a mirror of the attach. The first thing to do is to detach the driver from its kernel subsystem. If the device is currently open then the driver has two choices: refuse to be detached or forcibly close and proceed with detach. The choice used depends on the ability of the particular kernel subsystem to do a forced close and on the preferences of the driver's author. Generally the forced close seems to be the preferred alternative. struct xxx_softc *sc = device_get_softc(dev); int error; error = xxx_detach_subsystem(sc); if(error) return error; Next the driver may want to reset the hardware to some consistent state. That includes stopping any ongoing transfers, disabling the DMA channels and interrupts to avoid memory corruption by the device. For most of the drivers this is exactly what the shutdown routine does, so if it is included in the driver we can just call it. xxx_isa_shutdown(dev); And finally release all the resources and return success. xxx_free_resources(sc); return 0; xxx_isa_shutdown This routine is called when the system is about to be shut down. It is expected to bring the hardware to some consistent state. For most of the ISA devices no special action is required, so the function is not really necessary because the device will be re-initialized on reboot anyway. But some devices have to be shut down with a special procedure, to make sure that they will be properly detected after soft reboot (this is especially true for many devices with proprietary identification protocols). In any case disabling DMA and interrupts in the device registers and stopping any ongoing transfers is a good idea. The exact action depends on the hardware, so we do not consider it here in any detail. xxx_intr interrupt handler The interrupt handler is called when an interrupt is received which may be from this particular device. The ISA bus does not support interrupt sharing (except in some special cases) so in practice if the interrupt handler is called then the interrupt almost for sure came from its device. Still, the interrupt handler must poll the device registers and make sure that the interrupt was generated by its device. If not it should just return. The old convention for the ISA drivers was getting the device unit number as an argument. This is obsolete, and the new drivers receive whatever argument was specified for them in the attach routine when calling bus_setup_intr(). By the new convention it should be the pointer to the structure softc. So the interrupt handler commonly starts as: static void xxx_intr(struct xxx_softc *sc) { It runs at the interrupt priority level specified by the interrupt type parameter of bus_setup_intr(). That means that all the other interrupts of the same type as well as all the software interrupts are disabled. To avoid races it is commonly written as a loop: while(xxx_interrupt_pending(sc)) { xxx_process_interrupt(sc); xxx_acknowledge_interrupt(sc); } The interrupt handler has to acknowledge interrupt to the device only but not to the interrupt controller, the system takes care of the latter. diff --git a/en_US.ISO8859-1/books/arch-handbook/pccard/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/pccard/chapter.xml index a9a2753d9a..59261a9568 100644 --- a/en_US.ISO8859-1/books/arch-handbook/pccard/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/pccard/chapter.xml @@ -1,367 +1,367 @@ PC Card PC Card CardBus This chapter will talk about the FreeBSD mechanisms for writing a device driver for a PC Card or CardBus device. However, at present it just documents how to add a new device to an existing pccard driver. Adding a Device Device drivers know what devices they support. There is a table of supported devices in the kernel that drivers use to attach to a device. Overview CIS PC Cards are identified in one of two ways, both based on the Card Information Structure (CIS) stored on the card. The first method is to use numeric manufacturer and product numbers. The second method is to use the human readable strings that are also contained in the CIS. The PC Card bus uses a centralized database and some macros to facilitate a design pattern to help the driver writer match devices to his driver. Original equipment manufacturers (OEMs) often develop a reference design for a PC Card product, then sell this design to other companies to market. Those companies refine the design, market the product to their target audience or geographic area, and put their own name plate onto the card. The refinements to the physical card are typically very minor, if any changes are made at all. To strengthen their brand, these vendors place their company name in the human readable strings in the CIS space, but leave the manufacturer and product IDs unchanged. NetGear Linksys D-Link - Because of this practice, FreeBSD drivers usually rely on + Due to this practice, FreeBSD drivers usually rely on numeric IDs for device identification. Using numeric IDs and a centralized database complicates adding IDs and support for cards to the system. One must carefully check to see who really made the card, especially when it appears that the vendor who made the card might already have a different manufacturer ID listed in the central database. Linksys, D-Link, and NetGear are a number of US manufacturers of LAN hardware that often sell the same design. These same designs can be sold in Japan under names such as Buffalo and Corega. Often, these devices will all have the same manufacturer and product IDs. The PC Card bus code keeps a central database of card information, but not which driver is associated with them, in /sys/dev/pccard/pccarddevs. It also provides a set of macros that allow one to easily construct simple entries in the table the driver uses to claim devices. Finally, some really low end devices do not contain manufacturer identification at all. These devices must be detected by matching the human readable CIS strings. While it would be nice if we did not need this method as a fallback, it is necessary for some very low end CD-ROM players and Ethernet cards. This method should generally be avoided, but a number of devices are listed in this section because they were added prior to the recognition of the OEM nature of the PC Card business. When adding new devices, prefer using the numeric method. Format of <filename>pccarddevs</filename> There are four sections in the pccarddevs files. The first section lists the manufacturer numbers for vendors that use them. This section is sorted in numerical order. The next section has all of the products that are used by these vendors, along with their product ID numbers and a description string. The description string typically is not used (instead we set the device's description based on the human readable CIS, even if we match on the numeric version). These two sections are then repeated for devices that use the string matching method. Finally, C-style comments enclosed in /* and */ characters are allowed anywhere in the file. The first section of the file contains the vendor IDs. Please keep this list sorted in numeric order. Also, please coordinate changes to this file because we share it with NetBSD to help facilitate a common clearing house for this information. For example, here are the first few vendor IDs: vendor FUJITSU 0x0004 Fujitsu Corporation vendor NETGEAR_2 0x000b Netgear vendor PANASONIC 0x0032 Matsushita Electric Industrial Co. vendor SANDISK 0x0045 Sandisk Corporation Chances are very good that the NETGEAR_2 entry is really an OEM that NETGEAR purchased cards from and the author of support for those cards was unaware at the time that Netgear was using someone else's ID. These entries are fairly straightforward. The vendor keyword denotes the kind of line that this is, followed by the name of the vendor. This name will be repeated later in pccarddevs, as well as used in the driver's match tables, so keep it short and a valid C identifier. A numeric ID in hex identifies the manufacturer. Do not add IDs of the form 0xffffffff or 0xffff because these are reserved IDs (the former is no ID set while the latter is sometimes seen in extremely poor quality cards to try to indicate none). Finally there is a string description of the company that makes the card. This string is not used in FreeBSD for anything but commentary purposes. The second section of the file contains the products. As shown in this example, the format is similar to the vendor lines: /* Allied Telesis K.K. */ product ALLIEDTELESIS LA_PCM 0x0002 Allied Telesis LA-PCM /* Archos */ product ARCHOS ARC_ATAPI 0x0043 MiniCD The product keyword is followed by the vendor name, repeated from above. This is followed by the product name, which is used by the driver and should be a valid C identifier, but may also start with a number. As with the vendors, the hex product ID for this card follows the same convention for 0xffffffff and 0xffff. Finally, there is a string description of the device itself. This string typically is not used in FreeBSD, since FreeBSD's pccard bus driver will construct a string from the human readable CIS entries, but it can be used in the rare cases where this is somehow insufficient. The products are in alphabetical order by manufacturer, then numerical order by product ID. They have a C comment before each manufacturer's entries and there is a blank line between entries. The third section is like the previous vendor section, but with all of the manufacturer numeric IDs set to -1, meaning match anything found in the FreeBSD pccard bus code. Since these are C identifiers, their names must be unique. Otherwise the format is identical to the first section of the file. The final section contains the entries for those cards that must be identified by string entries. This section's format is a little different from the generic section: product ADDTRON AWP100 { "Addtron", "AWP-100&spWireless&spPCMCIA", "Version&sp01.02", NULL } product ALLIEDTELESIS WR211PCM { "Allied&spTelesis&spK.K.", "WR211PCM", NULL, NULL } Allied Telesis WR211PCM The familiar product keyword is followed by the vendor name and the card name, just as in the second section of the file. Here the format deviates from that used earlier. There is a {} grouping, followed by a number of strings. These strings correspond to the vendor, product, and extra information that is defined in a CIS_INFO tuple. These strings are filtered by the program that generates pccarddevs.h to replace &sp with a real space. NULL strings mean that the corresponding part of the entry should be ignored. The example shown here contains a bad entry. It should not contain the version number unless that is critical for the operation of the card. Sometimes vendors will have many different versions of the card in the field that all work, in which case that information only makes it harder for someone with a similar card to use it with FreeBSD. Sometimes it is necessary when a vendor wishes to sell many different parts under the same brand due to market considerations (availability, price, and so forth). Then it can be critical to disambiguating the card in those rare cases where the vendor kept the same manufacturer/product pair. Regular expression matching is not available at this time. Sample Probe Routine PC Card probe To understand how to add a device to the list of supported devices, one must understand the probe and/or match routines that many drivers have. It is complicated a little in FreeBSD 5.x because there is a compatibility layer for OLDCARD present as well. Since only the window-dressing is different, an idealized version will be presented here. static const struct pccard_product wi_pccard_products[] = { PCMCIA_CARD(3COM, 3CRWE737A, 0), PCMCIA_CARD(BUFFALO, WLI_PCM_S11, 0), PCMCIA_CARD(BUFFALO, WLI_CF_S11G, 0), PCMCIA_CARD(TDK, LAK_CD011WL, 0), { NULL } }; static int wi_pccard_probe(dev) device_t dev; { const struct pccard_product *pp; if ((pp = pccard_product_lookup(dev, wi_pccard_products, sizeof(wi_pccard_products[0]), NULL)) != NULL) { if (pp->pp_name != NULL) device_set_desc(dev, pp->pp_name); return (0); } return (ENXIO); } Here we have a simple pccard probe routine that matches a few devices. As stated above, the name may vary (if it is not foo_pccard_probe() it will be foo_pccard_match()). The function pccard_product_lookup() is a generalized function that walks the table and returns a pointer to the first entry that it matches. Some drivers may use this mechanism to convey additional information about some cards to the rest of the driver, so there may be some variance in the table. The only requirement is that each row of the table must have a struct pccard_product as the first element. Looking at the table wi_pccard_products, one notices that all the entries are of the form PCMCIA_CARD(foo, bar, baz). The foo part is the manufacturer ID from pccarddevs. The bar part is the product ID. baz is the expected function number for this card. Many pccards can have multiple functions, and some way to disambiguate function 1 from function 0 is needed. You may see PCMCIA_CARD_D, which includes the device description from pccarddevs. You may also see PCMCIA_CARD2 and PCMCIA_CARD2_D which are used when you need to match both CIS strings and manufacturer numbers, in the use the default description and take the description from pccarddevs flavors. Putting it All Together To add a new device, one must first obtain the identification information from the device. The easiest way to do this is to insert the device into a PC Card or CF slot and issue devinfo -v. Sample output: cbb1 pnpinfo vendor=0x104c device=0xac51 subvendor=0x1265 subdevice=0x0300 class=0x060700 at slot=10 function=1 cardbus1 pccard1 unknown pnpinfo manufacturer=0x026f product=0x030c cisvendor="BUFFALO" cisproduct="WLI2-CF-S11" function_type=6 at function=0 manufacturer and product are the numeric IDs for this product, while cisvendor and cisproduct are the product description strings from the CIS. Since we first want to prefer the numeric option, first try to construct an entry based on that. The above card has been slightly fictionalized for the purpose of this example. The vendor is BUFFALO, which we see already has an entry: vendor BUFFALO 0x026f BUFFALO (Melco Corporation) But there is no entry for this particular card. Instead we find: /* BUFFALO */ product BUFFALO WLI_PCM_S11 0x0305 BUFFALO AirStation 11Mbps WLAN product BUFFALO LPC_CF_CLT 0x0307 BUFFALO LPC-CF-CLT product BUFFALO LPC3_CLT 0x030a BUFFALO LPC3-CLT Ethernet Adapter product BUFFALO WLI_CF_S11G 0x030b BUFFALO AirStation 11Mbps CF WLAN To add the device, we can just add this entry to pccarddevs: product BUFFALO WLI2_CF_S11G 0x030c BUFFALO AirStation ultra 802.11b CF Once these steps are complete, the card can be added to the driver. That is a simple operation of adding one line: static const struct pccard_product wi_pccard_products[] = { PCMCIA_CARD(3COM, 3CRWE737A, 0), PCMCIA_CARD(BUFFALO, WLI_PCM_S11, 0), PCMCIA_CARD(BUFFALO, WLI_CF_S11G, 0), + PCMCIA_CARD(BUFFALO, WLI_CF2_S11G, 0), PCMCIA_CARD(TDK, LAK_CD011WL, 0), { NULL } }; Note that I have included a '+' in the line before the line that I added, but that is simply to highlight the line. Do not add it to the actual driver. Once you have added the line, you can recompile your kernel or module and test it. If the device is recognized and works, please submit a patch. If it does not work, please figure out what is needed to make it work and submit a patch. If the device is not recognized at all, you have done something wrong and should recheck each step. If you are a FreeBSD src committer, and everything appears to be working, then you can commit the changes to the tree. However, there are some minor tricky things to be considered. pccarddevs must be committed to the tree first. Then pccarddevs.h must be regenerated and committed as a second step, ensuring that the right $FreeBSD$ tag is in the latter file. Finally, commit the additions to the driver. Submitting a New Device Please do not send entries for new devices to the author directly. Instead, submit them as a PR and send the author the PR number for his records. This ensures that entries are not lost. When submitting a PR, it is unnecessary to include the pccardevs.h diffs in the patch, since those will be regenerated. It is necessary to include a description of the device, as well as the patches to the client driver. If you do not know the name, use OEM99 as the name, and the author will adjust OEM99 accordingly after investigation. Committers should not commit OEM99, but instead find the highest OEM entry and commit one more than that. diff --git a/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml index c325840bed..7de627b5b9 100644 --- a/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/scsi/chapter.xml @@ -1,2239 +1,2239 @@ Common Access Method SCSI Controllers SergeyBabkinWritten by MurrayStokelyModifications for Handbook made by Synopsis SCSI This document assumes that the reader has a general understanding of device drivers in FreeBSD and of the SCSI protocol. Much of the information in this document was extracted from the drivers: ncr (/sys/pci/ncr.c) by Wolfgang Stanglmeier and Stefan Esser sym (/sys/dev/sym/sym_hipd.c) by Gerard Roudier aic7xxx (/sys/dev/aic7xxx/aic7xxx.c) by Justin T. Gibbs and from the CAM code itself (by Justin T. Gibbs, see /sys/cam/*). When some solution looked the most logical and was essentially verbatim extracted from the code by Justin T. Gibbs, I marked it as recommended. The document is illustrated with examples in pseudo-code. Although sometimes the examples have many details and look like real code, it is still pseudo-code. It was written to demonstrate the concepts in an understandable way. For a real driver other approaches may be more modular and efficient. It also abstracts from the hardware details, as well as issues that would cloud the demonstrated concepts or that are supposed to be described in the other chapters of the developers handbook. Such details are commonly shown as calls to functions with descriptive names, comments or pseudo-statements. Fortunately real life full-size examples with all the details can be found in the real drivers. General Architecture Common Access Method (CAM) CAM stands for Common Access Method. It is a generic way to address the I/O buses in a SCSI-like way. This allows a separation of the generic device drivers from the drivers controlling the I/O bus: for example the disk driver becomes able to control disks on both SCSI, IDE, and/or any other bus so the disk driver portion does not have to be rewritten (or copied and modified) for every new I/O bus. Thus the two most important active entities are: CD-ROM tape IDE Peripheral Modules - a driver for peripheral devices (disk, tape, CD-ROM, etc.) SCSI Interface Modules (SIM) - a Host Bus Adapter drivers for connecting to an I/O bus such as SCSI or IDE. A peripheral driver receives requests from the OS, converts them to a sequence of SCSI commands and passes these SCSI commands to a SCSI Interface Module. The SCSI Interface Module is responsible for passing these commands to the actual hardware (or if the actual hardware is not SCSI but, for example, IDE then also converting the SCSI commands to the native commands of the hardware). - Because we are interested in writing a SCSI adapter driver + As we are interested in writing a SCSI adapter driver here, from this point on we will consider everything from the SIM standpoint. A typical SIM driver needs to include the following CAM-related header files: #include <cam/cam.h> #include <cam/cam_ccb.h> #include <cam/cam_sim.h> #include <cam/cam_xpt_sim.h> #include <cam/cam_debug.h> #include <cam/scsi/scsi_all.h> The first thing each SIM driver must do is register itself with the CAM subsystem. This is done during the driver's xxx_attach() function (here and further xxx_ is used to denote the unique driver name prefix). The xxx_attach() function itself is called by the system bus auto-configuration code which we do not describe here. This is achieved in multiple steps: first it is necessary to allocate the queue of requests associated with this SIM: struct cam_devq *devq; if(( devq = cam_simq_alloc(SIZE) )==NULL) { error; /* some code to handle the error */ } Here SIZE is the size of the queue to be allocated, maximal number of requests it could contain. It is the number of requests that the SIM driver can handle in parallel on one SCSI card. Commonly it can be calculated as: SIZE = NUMBER_OF_SUPPORTED_TARGETS * MAX_SIMULTANEOUS_COMMANDS_PER_TARGET Next we create a descriptor of our SIM: struct cam_sim *sim; if(( sim = cam_sim_alloc(action_func, poll_func, driver_name, softc, unit, mtx, max_dev_transactions, max_tagged_dev_transactions, devq) )==NULL) { cam_simq_free(devq); error; /* some code to handle the error */ } Note that if we are not able to create a SIM descriptor we free the devq also because we can do nothing else with it and we want to conserve memory. If a SCSI card has multiple SCSI busesSCSIbus on it then each bus requires its own cam_sim structure. An interesting question is what to do if a SCSI card has more than one SCSI bus, do we need one devq structure per card or per SCSI bus? The answer given in the comments to the CAM code is: either way, as the driver's author prefers. The arguments are: action_func - pointer to the driver's xxx_action function. static void xxx_action struct cam_sim *sim, union ccb *ccb poll_func - pointer to the driver's xxx_poll() static void xxx_poll struct cam_sim *sim driver_name - the name of the actual driver, such as ncr or wds. softc - pointer to the driver's internal descriptor for this SCSI card. This pointer will be used by the driver in future to get private data. unit - the controller unit number, for example for controller mps0 this number will be 0 mtx - Lock associated with this SIM. For SIMs that don't know about locking, pass in Giant. For SIMs that do, pass in the lock used to guard this SIM's data structures. This lock will be held when xxx_action and xxx_poll are called. max_dev_transactions - maximal number of simultaneous transactions per SCSI target in the non-tagged mode. This value will be almost universally equal to 1, with possible exceptions only for the non-SCSI cards. Also the drivers that hope to take advantage by preparing one transaction while another one is executed may set it to 2 but this does not seem to be worth the complexity. max_tagged_dev_transactions - the same thing, but in the tagged mode. Tags are the SCSI way to initiate multiple transactions on a device: each transaction is assigned a unique tag and the transaction is sent to the device. When the device completes some transaction it sends back the result together with the tag so that the SCSI adapter (and the driver) can tell which transaction was completed. This argument is also known as the maximal tag depth. It depends on the abilities of the SCSI adapter. Finally we register the SCSI buses associated with our SCSI adapterSCSIadapter: if(xpt_bus_register(sim, softc, bus_number) != CAM_SUCCESS) { cam_sim_free(sim, /*free_devq*/ TRUE); error; /* some code to handle the error */ } If there is one devq structure per SCSI bus (i.e., we consider a card with multiple buses as multiple cards with one bus each) then the bus number will always be 0, otherwise each bus on the SCSI card should be get a distinct number. Each bus needs its own separate structure cam_sim. After that our controller is completely hooked to the CAM system. The value of devq can be discarded now: sim will be passed as an argument in all further calls from CAM and devq can be derived from it. CAM provides the framework for such asynchronous events. Some events originate from the lower levels (the SIM drivers), some events originate from the peripheral drivers, some events originate from the CAM subsystem itself. Any driver can register callbacks for some types of the asynchronous events, so that it would be notified if these events occur. A typical example of such an event is a device reset. Each transaction and event identifies the devices to which it applies by the means of path. The target-specific events normally occur during a transaction with this device. So the path from that transaction may be re-used to report this event (this is safe because the event path is copied in the event reporting routine but not deallocated nor passed anywhere further). Also it is safe to allocate paths dynamically at any time including the interrupt routines, although that incurs certain overhead, and a possible problem with this approach is that there may be no free memory at that time. For a bus reset event we need to define a wildcard path including all devices on the bus. So we can create the path for the future bus reset events in advance and avoid problems with the future memory shortage: struct cam_path *path; if(xpt_create_path(&path, /*periph*/NULL, cam_sim_path(sim), CAM_TARGET_WILDCARD, CAM_LUN_WILDCARD) != CAM_REQ_CMP) { xpt_bus_deregister(cam_sim_path(sim)); cam_sim_free(sim, /*free_devq*/TRUE); error; /* some code to handle the error */ } softc->wpath = path; softc->sim = sim; As you can see the path includes: ID of the peripheral driver (NULL here because we have none) ID of the SIM driver (cam_sim_path(sim)) SCSI target number of the device (CAM_TARGET_WILDCARD means all devices) SCSI LUN number of the subdevice (CAM_LUN_WILDCARD means all LUNs) If the driver can not allocate this path it will not be able to work normally, so in that case we dismantle that SCSI bus. And we save the path pointer in the softc structure for future use. After that we save the value of sim (or we can also discard it on the exit from xxx_probe() if we wish). That is all for a minimalistic initialization. To do things right there is one more issue left. For a SIM driver there is one particularly interesting event: when a target device is considered lost. In this case resetting the SCSI negotiations with this device may be a good idea. So we register a callback for this event with CAM. The request is passed to CAM by requesting CAM action on a CAM control block for this type of request: struct ccb_setasync csa; xpt_setup_ccb(&csa.ccb_h, path, /*priority*/5); csa.ccb_h.func_code = XPT_SASYNC_CB; csa.event_enable = AC_LOST_DEVICE; csa.callback = xxx_async; csa.callback_arg = sim; xpt_action((union ccb *)&csa); Now we take a look at the xxx_action() and xxx_poll() driver entry points. static void xxx_action struct cam_sim *sim, union ccb *ccb Do some action on request of the CAM subsystem. Sim describes the SIM for the request, CCB is the request itself. CCB stands for CAM Control Block. It is a union of many specific instances, each describing arguments for some type of transactions. All of these instances share the CCB header where the common part of arguments is stored. CAM supports the SCSI controllers working in both initiator (normal) mode and target (simulating a SCSI device) mode. Here we only consider the part relevant to the initiator mode. There are a few function and macros (in other words, methods) defined to access the public data in the struct sim: cam_sim_path(sim) - the path ID (see above) cam_sim_name(sim) - the name of the sim cam_sim_softc(sim) - the pointer to the softc (driver private data) structure cam_sim_unit(sim) - the unit number cam_sim_bus(sim) - the bus ID To identify the device, xxx_action() can get the unit number and pointer to its structure softc using these functions. The type of request is stored in ccb->ccb_h.func_code. So generally xxx_action() consists of a big switch: struct xxx_softc *softc = (struct xxx_softc *) cam_sim_softc(sim); struct ccb_hdr *ccb_h = &ccb->ccb_h; int unit = cam_sim_unit(sim); int bus = cam_sim_bus(sim); switch(ccb_h->func_code) { case ...: ... default: ccb_h->status = CAM_REQ_INVALID; xpt_done(ccb); break; } As can be seen from the default case (if an unknown command was received) the return code of the command is set into ccb->ccb_h.status and the completed CCB is returned back to CAM by calling xpt_done(ccb). xpt_done() does not have to be called from xxx_action(): For example an I/O request may be enqueued inside the SIM driver and/or its SCSI controller. Then when the device would post an interrupt signaling that the processing of this request is complete xpt_done() may be called from the interrupt handling routine. Actually, the CCB status is not only assigned as a return code but a CCB has some status all the time. Before CCB is passed to the xxx_action() routine it gets the status CCB_REQ_INPROG meaning that it is in progress. There are a surprising number of status values defined in /sys/cam/cam.h which should be able to represent the status of a request in great detail. More interesting yet, the status is in fact a bitwise or of an enumerated status value (the lower 6 bits) and possible additional flag-like bits (the upper bits). The enumerated values will be discussed later in more detail. The summary of them can be found in the Errors Summary section. The possible status flags are: CAM_DEV_QFRZN - if the SIM driver gets a serious error (for example, the device does not respond to the selection or breaks the SCSI protocol) when processing a CCB it should freeze the request queue by calling xpt_freeze_simq(), return the other enqueued but not processed yet CCBs for this device back to the CAM queue, then set this flag for the troublesome CCB and call xpt_done(). This flag causes the CAM subsystem to unfreeze the queue after it handles the error. CAM_AUTOSNS_VALID - if the device returned an error condition and the flag CAM_DIS_AUTOSENSE is not set in CCB the SIM driver must execute the REQUEST SENSE command automatically to extract the sense (extended error information) data from the device. If this attempt was successful the sense data should be saved in the CCB and this flag set. CAM_RELEASE_SIMQ - like CAM_DEV_QFRZN but used in case there is some problem (or resource shortage) with the SCSI controller itself. Then all the future requests to the controller should be stopped by xpt_freeze_simq(). The controller queue will be restarted after the SIM driver overcomes the shortage and informs CAM by returning some CCB with this flag set. CAM_SIM_QUEUED - when SIM puts a CCB into its request queue this flag should be set (and removed when this CCB gets dequeued before being returned back to CAM). This flag is not used anywhere in the CAM code now, so its purpose is purely diagnostic. CAM_QOS_VALID - The QOS data is now valid. The function xxx_action() is not allowed to sleep, so all the synchronization for resource access must be done using SIM or device queue freezing. Besides the aforementioned flags the CAM subsystem provides functions xpt_release_simq() and xpt_release_devq() to unfreeze the queues directly, without passing a CCB to CAM. The CCB header contains the following fields: path - path ID for the request target_id - target device ID for the request target_lun - LUN ID of the target device timeout - timeout interval for this command, in milliseconds timeout_ch - a convenience place for the SIM driver to store the timeout handle (the CAM subsystem itself does not make any assumptions about it) flags - various bits of information about the request spriv_ptr0, spriv_ptr1 - fields reserved for private use by the SIM driver (such as linking to the SIM queues or SIM private control blocks); actually, they exist as unions: spriv_ptr0 and spriv_ptr1 have the type (void *), spriv_field0 and spriv_field1 have the type unsigned long, sim_priv.entries[0].bytes and sim_priv.entries[1].bytes are byte arrays of the size consistent with the other incarnations of the union and sim_priv.bytes is one array, twice bigger. The recommended way of using the SIM private fields of CCB is to define some meaningful names for them and use these meaningful names in the driver, like: #define ccb_some_meaningful_name sim_priv.entries[0].bytes #define ccb_hcb spriv_ptr1 /* for hardware control block */ The most common initiator mode requests are: XPT_SCSI_IO - execute an I/O transaction The instance struct ccb_scsiio csio of the union ccb is used to transfer the arguments. They are: cdb_io - pointer to the SCSI command buffer or the buffer itself cdb_len - SCSI command length data_ptr - pointer to the data buffer (gets a bit complicated if scatter/gather is used) dxfer_len - length of the data to transfer sglist_cnt - counter of the scatter/gather segments scsi_status - place to return the SCSI status sense_data - buffer for the SCSI sense information if the command returns an error (the SIM driver is supposed to run the REQUEST SENSE command automatically in this case if the CCB flag CAM_DIS_AUTOSENSE is not set) sense_len - the length of that buffer (if it happens to be higher than size of sense_data the SIM driver must silently assume the smaller value) resid, sense_resid - if the transfer of data or SCSI sense returned an error these are the returned counters of the residual (not transferred) data. They do not seem to be especially meaningful, so in a case when they are difficult to compute (say, counting bytes in the SCSI controller's FIFO buffer) an approximate value will do as well. For a successfully completed transfer they must be set to zero. tag_action - the kind of tag to use: CAM_TAG_ACTION_NONE - do not use tags for this transaction MSG_SIMPLE_Q_TAG, MSG_HEAD_OF_Q_TAG, MSG_ORDERED_Q_TAG - value equal to the appropriate tag message (see /sys/cam/scsi/scsi_message.h); this gives only the tag type, the SIM driver must assign the tag value itself The general logic of handling this request is the following: The first thing to do is to check for possible races, to make sure that the command did not get aborted when it was sitting in the queue: struct ccb_scsiio *csio = &ccb->csio; if ((ccb_h->status & CAM_STATUS_MASK) != CAM_REQ_INPROG) { xpt_done(ccb); return; } Also we check that the device is supported at all by our controller: if(ccb_h->target_id > OUR_MAX_SUPPORTED_TARGET_ID || cch_h->target_id == OUR_SCSI_CONTROLLERS_OWN_ID) { ccb_h->status = CAM_TID_INVALID; xpt_done(ccb); return; } if(ccb_h->target_lun > OUR_MAX_SUPPORTED_LUN) { ccb_h->status = CAM_LUN_INVALID; xpt_done(ccb); return; } Then allocate whatever data structures (such as card-dependent hardware control blockhardware control block) we need to process this request. If we can not then freeze the SIM queue and remember that we have a pending operation, return the CCB back and ask CAM to re-queue it. Later when the resources become available the SIM queue must be unfrozen by returning a ccb with the CAM_SIMQ_RELEASE bit set in its status. Otherwise, if all went well, link the CCB with the hardware control block (HCB) and mark it as queued. struct xxx_hcb *hcb = allocate_hcb(softc, unit, bus); if(hcb == NULL) { softc->flags |= RESOURCE_SHORTAGE; xpt_freeze_simq(sim, /*count*/1); ccb_h->status = CAM_REQUEUE_REQ; xpt_done(ccb); return; } hcb->ccb = ccb; ccb_h->ccb_hcb = (void *)hcb; ccb_h->status |= CAM_SIM_QUEUED; Extract the target data from CCB into the hardware control block. Check if we are asked to assign a tag and if yes then generate an unique tag and build the SCSI tag messages. The SIM driver is also responsible for negotiations with the devices to set the maximal mutually supported bus width, synchronous rate and offset. hcb->target = ccb_h->target_id; hcb->lun = ccb_h->target_lun; generate_identify_message(hcb); if( ccb_h->tag_action != CAM_TAG_ACTION_NONE ) generate_unique_tag_message(hcb, ccb_h->tag_action); if( !target_negotiated(hcb) ) generate_negotiation_messages(hcb); Then set up the SCSI command. The command storage may be specified in the CCB in many interesting ways, specified by the CCB flags. The command buffer can be contained in CCB or pointed to, in the latter case the pointer may be physical or virtual. Since the hardware commonly needs physical address we always convert the address to the physical one, typically using the busdma API. In case if a physical address is requested it is OK to return the CCB with the status CAM_REQ_INVALID, the current drivers do that. If necessary a physical address can be also converted or mapped back to a virtual address but with big pain, so we do not do that. if(ccb_h->flags & CAM_CDB_POINTER) { /* CDB is a pointer */ if(!(ccb_h->flags & CAM_CDB_PHYS)) { /* CDB pointer is virtual */ hcb->cmd = vtobus(csio->cdb_io.cdb_ptr); } else { /* CDB pointer is physical */ hcb->cmd = csio->cdb_io.cdb_ptr ; } } else { /* CDB is in the ccb (buffer) */ hcb->cmd = vtobus(csio->cdb_io.cdb_bytes); } hcb->cmdlen = csio->cdb_len; Now it is time to set up the data. Again, the data storage may be specified in the CCB in many interesting ways, specified by the CCB flags. First we get the direction of the data transfer. The simplest case is if there is no data to transfer: int dir = (ccb_h->flags & CAM_DIR_MASK); if (dir == CAM_DIR_NONE) goto end_data; Then we check if the data is in one chunk or in a scatter-gather list, and the addresses are physical or virtual. The SCSI controller may be able to handle only a limited number of chunks of limited length. If the request hits this limitation we return an error. We use a special function to return the CCB to handle in one place the HCB resource shortages. The functions to add chunks are driver-dependent, and here we leave them without detailed implementation. See description of the SCSI command (CDB) handling for the details on the address-translation issues. If some variation is too difficult or impossible to implement with a particular card it is OK to return the status CAM_REQ_INVALID. Actually, it seems like the scatter-gather ability is not used anywhere in the CAM code now. But at least the case for a single non-scattered virtual buffer must be implemented, it is actively used by CAM. int rv; initialize_hcb_for_data(hcb); if((!(ccb_h->flags & CAM_SCATTER_VALID)) { /* single buffer */ if(!(ccb_h->flags & CAM_DATA_PHYS)) { rv = add_virtual_chunk(hcb, csio->data_ptr, csio->dxfer_len, dir); } } else { rv = add_physical_chunk(hcb, csio->data_ptr, csio->dxfer_len, dir); } } else { int i; struct bus_dma_segment *segs; segs = (struct bus_dma_segment *)csio->data_ptr; if ((ccb_h->flags & CAM_SG_LIST_PHYS) != 0) { /* The SG list pointer is physical */ rv = setup_hcb_for_physical_sg_list(hcb, segs, csio->sglist_cnt); } else if (!(ccb_h->flags & CAM_DATA_PHYS)) { /* SG buffer pointers are virtual */ for (i = 0; i < csio->sglist_cnt; i++) { rv = add_virtual_chunk(hcb, segs[i].ds_addr, segs[i].ds_len, dir); if (rv != CAM_REQ_CMP) break; } } else { /* SG buffer pointers are physical */ for (i = 0; i < csio->sglist_cnt; i++) { rv = add_physical_chunk(hcb, segs[i].ds_addr, segs[i].ds_len, dir); if (rv != CAM_REQ_CMP) break; } } } if(rv != CAM_REQ_CMP) { /* we expect that add_*_chunk() functions return CAM_REQ_CMP * if they added a chunk successfully, CAM_REQ_TOO_BIG if * the request is too big (too many bytes or too many chunks), * CAM_REQ_INVALID in case of other troubles */ free_hcb_and_ccb_done(hcb, ccb, rv); return; } end_data: If disconnection is disabled for this CCB we pass this information to the hcb: if(ccb_h->flags & CAM_DIS_DISCONNECT) hcb_disable_disconnect(hcb); If the controller is able to run REQUEST SENSE command all by itself then the value of the flag CAM_DIS_AUTOSENSE should also be passed to it, to prevent automatic REQUEST SENSE if the CAM subsystem does not want it. The only thing left is to set up the timeout, pass our hcb to the hardware and return, the rest will be done by the interrupt handler (or timeout handler). ccb_h->timeout_ch = timeout(xxx_timeout, (caddr_t) hcb, (ccb_h->timeout * hz) / 1000); /* convert milliseconds to ticks */ put_hcb_into_hardware_queue(hcb); return; And here is a possible implementation of the function returning CCB: static void free_hcb_and_ccb_done(struct xxx_hcb *hcb, union ccb *ccb, u_int32_t status) { struct xxx_softc *softc = hcb->softc; ccb->ccb_h.ccb_hcb = 0; if(hcb != NULL) { untimeout(xxx_timeout, (caddr_t) hcb, ccb->ccb_h.timeout_ch); /* we're about to free a hcb, so the shortage has ended */ if(softc->flags & RESOURCE_SHORTAGE) { softc->flags &= ~RESOURCE_SHORTAGE; status |= CAM_RELEASE_SIMQ; } free_hcb(hcb); /* also removes hcb from any internal lists */ } ccb->ccb_h.status = status | (ccb->ccb_h.status & ~(CAM_STATUS_MASK|CAM_SIM_QUEUED)); xpt_done(ccb); } XPT_RESET_DEV - send the SCSI BUS DEVICE RESET message to a device There is no data transferred in CCB except the header and the most interesting argument of it is target_id. Depending on the controller hardware a hardware control block just like for the XPT_SCSI_IO request may be constructed (see XPT_SCSI_IO request description) and sent to the controller or the SCSI controller may be immediately programmed to send this RESET message to the device or this request may be just not supported (and return the status CAM_REQ_INVALID). Also on completion of the request all the disconnected transactions for this target must be aborted (probably in the interrupt routine). Also all the current negotiations for the target are lost on reset, so they might be cleaned too. Or they clearing may be deferred, because anyway the target would request re-negotiation on the next transaction. XPT_RESET_BUS - send the RESET signal to the SCSI bus No arguments are passed in the CCB, the only interesting argument is the SCSI bus indicated by the struct sim pointer. A minimalistic implementation would forget the SCSI negotiations for all the devices on the bus and return the status CAM_REQ_CMP. The proper implementation would in addition actually reset the SCSI bus (possible also reset the SCSI controller) and mark all the CCBs being processed, both those in the hardware queue and those being disconnected, as done with the status CAM_SCSI_BUS_RESET. Like: int targ, lun; struct xxx_hcb *h, *hh; struct ccb_trans_settings neg; struct cam_path *path; /* The SCSI bus reset may take a long time, in this case its completion * should be checked by interrupt or timeout. But for simplicity * we assume here that it is really fast. */ reset_scsi_bus(softc); /* drop all enqueued CCBs */ for(h = softc->first_queued_hcb; h != NULL; h = hh) { hh = h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } /* the clean values of negotiations to report */ neg.bus_width = 8; neg.sync_period = neg.sync_offset = 0; neg.valid = (CCB_TRANS_BUS_WIDTH_VALID | CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID); /* drop all disconnected CCBs and clean negotiations */ for(targ=0; targ <= OUR_MAX_SUPPORTED_TARGET; targ++) { clean_negotiations(softc, targ); /* report the event if possible */ if(xpt_create_path(&path, /*periph*/NULL, cam_sim_path(sim), targ, CAM_LUN_WILDCARD) == CAM_REQ_CMP) { xpt_async(AC_TRANSFER_NEG, path, &neg); xpt_free_path(path); } for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++) for(h = softc->first_discon_hcb[targ][lun]; h != NULL; h = hh) { hh=h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } } ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); /* report the event */ xpt_async(AC_BUS_RESET, softc->wpath, NULL); return; Implementing the SCSI bus reset as a function may be a good idea because it would be re-used by the timeout function as a last resort if the things go wrong. XPT_ABORT - abort the specified CCB The arguments are transferred in the instance struct ccb_abort cab of the union ccb. The only argument field in it is: abort_ccb - pointer to the CCB to be aborted If the abort is not supported just return the status CAM_UA_ABORT. This is also the easy way to minimally implement this call, return CAM_UA_ABORT in any case. The hard way is to implement this request honestly. First check that abort applies to a SCSI transaction: struct ccb *abort_ccb; abort_ccb = ccb->cab.abort_ccb; if(abort_ccb->ccb_h.func_code != XPT_SCSI_IO) { ccb->ccb_h.status = CAM_UA_ABORT; xpt_done(ccb); return; } Then it is necessary to find this CCB in our queue. This can be done by walking the list of all our hardware control blocks in search for one associated with this CCB: struct xxx_hcb *hcb, *h; hcb = NULL; /* We assume that softc->first_hcb is the head of the list of all * HCBs associated with this bus, including those enqueued for * processing, being processed by hardware and disconnected ones. */ for(h = softc->first_hcb; h != NULL; h = h->next) { if(h->ccb == abort_ccb) { hcb = h; break; } } if(hcb == NULL) { /* no such CCB in our queue */ ccb->ccb_h.status = CAM_PATH_INVALID; xpt_done(ccb); return; } hcb=found_hcb; Now we look at the current processing status of the HCB. It may be either sitting in the queue waiting to be sent to the SCSI bus, being transferred right now, or disconnected and waiting for the result of the command, or actually completed by hardware but not yet marked as done by software. To make sure that we do not get in any races with hardware we mark the HCB as being aborted, so that if this HCB is about to be sent to the SCSI bus the SCSI controller will see this flag and skip it. int hstatus; /* shown as a function, in case special action is needed to make * this flag visible to hardware */ set_hcb_flags(hcb, HCB_BEING_ABORTED); abort_again: hstatus = get_hcb_status(hcb); switch(hstatus) { case HCB_SITTING_IN_QUEUE: remove_hcb_from_hardware_queue(hcb); /* FALLTHROUGH */ case HCB_COMPLETED: /* this is an easy case */ free_hcb_and_ccb_done(hcb, abort_ccb, CAM_REQ_ABORTED); break; If the CCB is being transferred right now we would like to signal to the SCSI controller in some hardware-dependent way that we want to abort the current transfer. The SCSI controller would set the SCSI ATTENTION signal and when the target responds to it send an ABORT message. We also reset the timeout to make sure that the target is not sleeping forever. If the command would not get aborted in some reasonable time like 10 seconds the timeout routine would go - ahead and reset the whole SCSI bus. Because the command + ahead and reset the whole SCSI bus. Since the command will be aborted in some reasonable time we can just return the abort request now as successfully completed, and mark the aborted CCB as aborted (but not mark it as done yet). case HCB_BEING_TRANSFERRED: untimeout(xxx_timeout, (caddr_t) hcb, abort_ccb->ccb_h.timeout_ch); abort_ccb->ccb_h.timeout_ch = timeout(xxx_timeout, (caddr_t) hcb, 10 * hz); abort_ccb->ccb_h.status = CAM_REQ_ABORTED; /* ask the controller to abort that HCB, then generate * an interrupt and stop */ if(signal_hardware_to_abort_hcb_and_stop(hcb) < 0) { /* oops, we missed the race with hardware, this transaction * got off the bus before we aborted it, try again */ goto abort_again; } break; If the CCB is in the list of disconnected then set it up as an abort request and re-queue it at the front of hardware queue. Reset the timeout and report the abort request to be completed. case HCB_DISCONNECTED: untimeout(xxx_timeout, (caddr_t) hcb, abort_ccb->ccb_h.timeout_ch); abort_ccb->ccb_h.timeout_ch = timeout(xxx_timeout, (caddr_t) hcb, 10 * hz); put_abort_message_into_hcb(hcb); put_hcb_at_the_front_of_hardware_queue(hcb); break; } ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; That is all for the ABORT request, although there is one - more issue. Because the ABORT message cleans all the + more issue. As the ABORT message cleans all the ongoing transactions on a LUN we have to mark all the other active transactions on this LUN as aborted. That should be done in the interrupt routine, after the transaction gets aborted. Implementing the CCB abort as a function may be quite a good idea, this function can be re-used if an I/O transaction times out. The only difference would be that the timed out transaction would return the status CAM_CMD_TIMEOUT for the timed out request. Then the case XPT_ABORT would be small, like that: case XPT_ABORT: struct ccb *abort_ccb; abort_ccb = ccb->cab.abort_ccb; if(abort_ccb->ccb_h.func_code != XPT_SCSI_IO) { ccb->ccb_h.status = CAM_UA_ABORT; xpt_done(ccb); return; } if(xxx_abort_ccb(abort_ccb, CAM_REQ_ABORTED) < 0) /* no such CCB in our queue */ ccb->ccb_h.status = CAM_PATH_INVALID; else ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; XPT_SET_TRAN_SETTINGS - explicitly set values of SCSI transfer settings The arguments are transferred in the instance struct ccb_trans_setting cts of the union ccb: valid - a bitmask showing which settings should be updated: CCB_TRANS_SYNC_RATE_VALID - synchronous transfer rate CCB_TRANS_SYNC_OFFSET_VALID - synchronous offset CCB_TRANS_BUS_WIDTH_VALID - bus width CCB_TRANS_DISC_VALID - set enable/disable disconnection CCB_TRANS_TQ_VALID - set enable/disable tagged queuing flags - consists of two parts, binary arguments and identification of sub-operations. The binary arguments are: CCB_TRANS_DISC_ENB - enable disconnection CCB_TRANS_TAG_ENB - enable tagged queuing the sub-operations are: CCB_TRANS_CURRENT_SETTINGS - change the current negotiations CCB_TRANS_USER_SETTINGS - remember the desired user values sync_period, sync_offset - self-explanatory, if sync_offset==0 then the asynchronous mode is requested bus_width - bus width, in bits (not bytes) Two sets of negotiated parameters are supported, the user settings and the current settings. The user settings are not really used much in the SIM drivers, this is mostly just a piece of memory where the upper levels can store (and later recall) its ideas about the parameters. Setting the user parameters does not cause re-negotiation of the transfer rates. But when the SCSI controller does a negotiation it must never set the values higher than the user parameters, so it is essentially the top boundary. The current settings are, as the name says, current. Changing them means that the parameters must be re-negotiated on the next transfer. Again, these new current settings are not supposed to be forced on the device, just they are used as the initial step of negotiations. Also they must be limited by actual capabilities of the SCSI controller: for example, if the SCSI controller has 8-bit bus and the request asks to set 16-bit wide transfers this parameter must be silently truncated to 8-bit transfers before sending it to the device. One caveat is that the bus width and synchronous parameters are per target while the disconnection and tag enabling parameters are per lun. The recommended implementation is to keep 3 sets of negotiated (bus width and synchronous transfer) parameters: user - the user set, as above current - those actually in effect goal - those requested by setting of the current parameters The code looks like: struct ccb_trans_settings *cts; int targ, lun; int flags; cts = &ccb->cts; targ = ccb_h->target_id; lun = ccb_h->target_lun; flags = cts->flags; if(flags & CCB_TRANS_USER_SETTINGS) { if(flags & CCB_TRANS_SYNC_RATE_VALID) softc->user_sync_period[targ] = cts->sync_period; if(flags & CCB_TRANS_SYNC_OFFSET_VALID) softc->user_sync_offset[targ] = cts->sync_offset; if(flags & CCB_TRANS_BUS_WIDTH_VALID) softc->user_bus_width[targ] = cts->bus_width; if(flags & CCB_TRANS_DISC_VALID) { softc->user_tflags[targ][lun] &= ~CCB_TRANS_DISC_ENB; softc->user_tflags[targ][lun] |= flags & CCB_TRANS_DISC_ENB; } if(flags & CCB_TRANS_TQ_VALID) { softc->user_tflags[targ][lun] &= ~CCB_TRANS_TQ_ENB; softc->user_tflags[targ][lun] |= flags & CCB_TRANS_TQ_ENB; } } if(flags & CCB_TRANS_CURRENT_SETTINGS) { if(flags & CCB_TRANS_SYNC_RATE_VALID) softc->goal_sync_period[targ] = max(cts->sync_period, OUR_MIN_SUPPORTED_PERIOD); if(flags & CCB_TRANS_SYNC_OFFSET_VALID) softc->goal_sync_offset[targ] = min(cts->sync_offset, OUR_MAX_SUPPORTED_OFFSET); if(flags & CCB_TRANS_BUS_WIDTH_VALID) softc->goal_bus_width[targ] = min(cts->bus_width, OUR_BUS_WIDTH); if(flags & CCB_TRANS_DISC_VALID) { softc->current_tflags[targ][lun] &= ~CCB_TRANS_DISC_ENB; softc->current_tflags[targ][lun] |= flags & CCB_TRANS_DISC_ENB; } if(flags & CCB_TRANS_TQ_VALID) { softc->current_tflags[targ][lun] &= ~CCB_TRANS_TQ_ENB; softc->current_tflags[targ][lun] |= flags & CCB_TRANS_TQ_ENB; } } ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; Then when the next I/O request will be processed it will check if it has to re-negotiate, for example by calling the function target_negotiated(hcb). It can be implemented like this: int target_negotiated(struct xxx_hcb *hcb) { struct softc *softc = hcb->softc; int targ = hcb->targ; if( softc->current_sync_period[targ] != softc->goal_sync_period[targ] || softc->current_sync_offset[targ] != softc->goal_sync_offset[targ] || softc->current_bus_width[targ] != softc->goal_bus_width[targ] ) return 0; /* FALSE */ else return 1; /* TRUE */ } After the values are re-negotiated the resulting values must be assigned to both current and goal parameters, so for future I/O transactions the current and goal parameters would be the same and target_negotiated() would return TRUE. When the card is initialized (in xxx_attach()) the current negotiation values must be initialized to narrow asynchronous mode, the goal and current values must be initialized to the maximal values supported by controller. XPT_GET_TRAN_SETTINGS - get values of SCSI transfer settings This operations is the reverse of XPT_SET_TRAN_SETTINGS. Fill up the CCB instance struct ccb_trans_setting cts with data as requested by the flags CCB_TRANS_CURRENT_SETTINGS or CCB_TRANS_USER_SETTINGS (if both are set then the existing drivers return the current settings). Set all the bits in the valid field. XPT_CALC_GEOMETRY - calculate logical (BIOS)BIOS geometry of the disk The arguments are transferred in the instance struct ccb_calc_geometry ccg of the union ccb: block_size - input, block (A.K.A sector) size in bytes volume_size - input, volume size in bytes cylinders - output, logical cylinders heads - output, logical heads secs_per_track - output, logical sectors per track If the returned geometry differs much enough from what the SCSI controller BIOSSCSI BIOS thinks and a disk on this SCSI controller is used as bootable the system may not be able to boot. The typical calculation example taken from the aic7xxx driver is: struct ccb_calc_geometry *ccg; u_int32_t size_mb; u_int32_t secs_per_cylinder; int extended; ccg = &ccb->ccg; size_mb = ccg->volume_size / ((1024L * 1024L) / ccg->block_size); extended = check_cards_EEPROM_for_extended_geometry(softc); if (size_mb > 1024 && extended) { ccg->heads = 255; ccg->secs_per_track = 63; } else { ccg->heads = 64; ccg->secs_per_track = 32; } secs_per_cylinder = ccg->heads * ccg->secs_per_track; ccg->cylinders = ccg->volume_size / secs_per_cylinder; ccb->ccb_h.status = CAM_REQ_CMP; xpt_done(ccb); return; This gives the general idea, the exact calculation depends on the quirks of the particular BIOS. If BIOS provides no way set the extended translation flag in EEPROM this flag should normally be assumed equal to 1. Other popular geometries are: 128 heads, 63 sectors - Symbios controllers 16 heads, 63 sectors - old controllers Some system BIOSes and SCSI BIOSes fight with each other with variable success, for example a combination of Symbios 875/895 SCSI and Phoenix BIOS can give geometry 128/63 after power up and 255/63 after a hard reset or soft reboot. XPT_PATH_INQ - path inquiry, in other words get the SIM driver and SCSI controller (also known as HBA - Host Bus Adapter) properties The properties are returned in the instance struct ccb_pathinq cpi of the union ccb: version_num - the SIM driver version number, now all drivers use 1 hba_inquiry - bitmask of features supported by the controller: PI_MDP_ABLE - supports MDP message (something from SCSI3?) PI_WIDE_32 - supports 32 bit wide SCSI PI_WIDE_16 - supports 16 bit wide SCSI PI_SDTR_ABLE - can negotiate synchronous transfer rate PI_LINKED_CDB - supports linked commands PI_TAG_ABLE - supports tagged commands PI_SOFT_RST - supports soft reset alternative (hard reset and soft reset are mutually exclusive within a SCSI bus) target_sprt - flags for target mode support, 0 if unsupported hba_misc - miscellaneous controller features: PIM_SCANHILO - bus scans from high ID to low ID PIM_NOREMOVE - removable devices not included in scan PIM_NOINITIATOR - initiator role not supported PIM_NOBUSRESET - user has disabled initial BUS RESET hba_eng_cnt - mysterious HBA engine count, something related to compression, now is always set to 0 vuhba_flags - vendor-unique flags, unused now max_target - maximal supported target ID (7 for 8-bit bus, 15 for 16-bit bus, 127 for Fibre Channel) max_lun - maximal supported LUN ID (7 for older SCSI controllers, 63 for newer ones) async_flags - bitmask of installed Async handler, unused now hpath_id - highest Path ID in the subsystem, unused now unit_number - the controller unit number, cam_sim_unit(sim) bus_id - the bus number, cam_sim_bus(sim) initiator_id - the SCSI ID of the controller itself base_transfer_speed - nominal transfer speed in KB/s for asynchronous narrow transfers, equals to 3300 for SCSI sim_vid - SIM driver's vendor id, a zero-terminated string of maximal length SIM_IDLEN including the terminating zero hba_vid - SCSI controller's vendor id, a zero-terminated string of maximal length HBA_IDLEN including the terminating zero dev_name - device driver name, a zero-terminated string of maximal length DEV_IDLEN including the terminating zero, equal to cam_sim_name(sim) The recommended way of setting the string fields is using strncpy, like: strncpy(cpi->dev_name, cam_sim_name(sim), DEV_IDLEN); After setting the values set the status to CAM_REQ_CMP and mark the CCB as done. Polling static void xxx_poll struct cam_sim *sim The poll function is used to simulate the interrupts when the interrupt subsystem is not functioning (for example, when the system has crashed and is creating the system dump). The CAM subsystem sets the proper interrupt level before calling the poll routine. So all it needs to do is to call the interrupt routine (or the other way around, the poll routine may be doing the real action and the interrupt routine would just call the poll routine). Why bother about a separate function then? - Because of different calling conventions. The + Due to different calling conventions. The xxx_poll routine gets the struct cam_sim pointer as its argument when the PCI interrupt routine by common convention gets pointer to the struct xxx_softc and the ISA interrupt routine gets just the device unit number. So the poll routine would normally look as: static void xxx_poll(struct cam_sim *sim) { xxx_intr((struct xxx_softc *)cam_sim_softc(sim)); /* for PCI device */ } or static void xxx_poll(struct cam_sim *sim) { xxx_intr(cam_sim_unit(sim)); /* for ISA device */ } Asynchronous Events If an asynchronous event callback has been set up then the callback function should be defined. static void ahc_async(void *callback_arg, u_int32_t code, struct cam_path *path, void *arg) callback_arg - the value supplied when registering the callback code - identifies the type of event path - identifies the devices to which the event applies arg - event-specific argument Implementation for a single type of event, AC_LOST_DEVICE, looks like: struct xxx_softc *softc; struct cam_sim *sim; int targ; struct ccb_trans_settings neg; sim = (struct cam_sim *)callback_arg; softc = (struct xxx_softc *)cam_sim_softc(sim); switch (code) { case AC_LOST_DEVICE: targ = xpt_path_target_id(path); if(targ <= OUR_MAX_SUPPORTED_TARGET) { clean_negotiations(softc, targ); /* send indication to CAM */ neg.bus_width = 8; neg.sync_period = neg.sync_offset = 0; neg.valid = (CCB_TRANS_BUS_WIDTH_VALID | CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID); xpt_async(AC_TRANSFER_NEG, path, &neg); } break; default: break; } Interrupts SCSIinterrupts The exact type of the interrupt routine depends on the type of the peripheral bus (PCI, ISA and so on) to which the SCSI controller is connected. The interrupt routines of the SIM drivers run at the interrupt level splcam. So splcam() should be used in the driver to synchronize activity between the interrupt routine and the rest of the driver (for a multiprocessor-aware driver things get yet more interesting but we ignore this case here). The pseudo-code in this document happily ignores the problems of synchronization. The real code must not ignore them. A simple-minded approach is to set splcam() on the entry to the other routines and reset it on return thus protecting them by one big critical section. To make sure that the interrupt level will be always restored a wrapper function can be defined, like: static void xxx_action(struct cam_sim *sim, union ccb *ccb) { int s; s = splcam(); xxx_action1(sim, ccb); splx(s); } static void xxx_action1(struct cam_sim *sim, union ccb *ccb) { ... process the request ... } This approach is simple and robust but the problem with it is that interrupts may get blocked for a relatively long time and this would negatively affect the system's performance. On the other hand the functions of the spl() family have rather high overhead, so vast amount of tiny critical sections may not be good either. The conditions handled by the interrupt routine and the details depend very much on the hardware. We consider the set of typical conditions. First, we check if a SCSI reset was encountered on the bus (probably caused by another SCSI controller on the same SCSI bus). If so we drop all the enqueued and disconnected requests, report the events and re-initialize our SCSI controller. It is important that during this initialization the controller will not issue another reset or else two controllers on the same SCSI bus could ping-pong resets forever. The case of fatal controller error/hang could be handled in the same place, but it will probably need also sending RESET signal to the SCSI bus to reset the status of the connections with the SCSI devices. int fatal=0; struct ccb_trans_settings neg; struct cam_path *path; if( detected_scsi_reset(softc) || (fatal = detected_fatal_controller_error(softc)) ) { int targ, lun; struct xxx_hcb *h, *hh; /* drop all enqueued CCBs */ for(h = softc->first_queued_hcb; h != NULL; h = hh) { hh = h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } /* the clean values of negotiations to report */ neg.bus_width = 8; neg.sync_period = neg.sync_offset = 0; neg.valid = (CCB_TRANS_BUS_WIDTH_VALID | CCB_TRANS_SYNC_RATE_VALID | CCB_TRANS_SYNC_OFFSET_VALID); /* drop all disconnected CCBs and clean negotiations */ for(targ=0; targ <= OUR_MAX_SUPPORTED_TARGET; targ++) { clean_negotiations(softc, targ); /* report the event if possible */ if(xpt_create_path(&path, /*periph*/NULL, cam_sim_path(sim), targ, CAM_LUN_WILDCARD) == CAM_REQ_CMP) { xpt_async(AC_TRANSFER_NEG, path, &neg); xpt_free_path(path); } for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++) for(h = softc->first_discon_hcb[targ][lun]; h != NULL; h = hh) { hh=h->next; if(fatal) free_hcb_and_ccb_done(h, h->ccb, CAM_UNREC_HBA_ERROR); else free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } } /* report the event */ xpt_async(AC_BUS_RESET, softc->wpath, NULL); /* re-initialization may take a lot of time, in such case * its completion should be signaled by another interrupt or * checked on timeout - but for simplicity we assume here that * it is really fast */ if(!fatal) { reinitialize_controller_without_scsi_reset(softc); } else { reinitialize_controller_with_scsi_reset(softc); } schedule_next_hcb(softc); return; } If interrupt is not caused by a controller-wide condition then probably something has happened to the current hardware control block. Depending on the hardware there may be other non-HCB-related events, we just do not consider them here. Then we analyze what happened to this HCB: struct xxx_hcb *hcb, *h, *hh; int hcb_status, scsi_status; int ccb_status; int targ; int lun_to_freeze; hcb = get_current_hcb(softc); if(hcb == NULL) { /* either stray interrupt or something went very wrong * or this is something hardware-dependent */ handle as necessary; return; } targ = hcb->target; hcb_status = get_status_of_current_hcb(softc); First we check if the HCB has completed and if so we check the returned SCSI status. if(hcb_status == COMPLETED) { scsi_status = get_completion_status(hcb); Then look if this status is related to the REQUEST SENSE command and if so handle it in a simple way. if(hcb->flags & DOING_AUTOSENSE) { if(scsi_status == GOOD) { /* autosense was successful */ hcb->ccb->ccb_h.status |= CAM_AUTOSNS_VALID; free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_SCSI_STATUS_ERROR); } else { autosense_failed: free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_AUTOSENSE_FAIL); } schedule_next_hcb(softc); return; } Else the command itself has completed, pay more attention to details. If auto-sense is not disabled for this CCB and the command has failed with sense data then run REQUEST SENSE command to receive that data. hcb->ccb->csio.scsi_status = scsi_status; calculate_residue(hcb); if( (hcb->ccb->ccb_h.flags & CAM_DIS_AUTOSENSE)==0 && ( scsi_status == CHECK_CONDITION || scsi_status == COMMAND_TERMINATED) ) { /* start auto-SENSE */ hcb->flags |= DOING_AUTOSENSE; setup_autosense_command_in_hcb(hcb); restart_current_hcb(softc); return; } if(scsi_status == GOOD) free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_REQ_CMP); else free_hcb_and_ccb_done(hcb, hcb->ccb, CAM_SCSI_STATUS_ERROR); schedule_next_hcb(softc); return; } One typical thing would be negotiation events: negotiation messages received from a SCSI target (in answer to our negotiation attempt or by target's initiative) or the target is unable to negotiate (rejects our negotiation messages or does not answer them). switch(hcb_status) { case TARGET_REJECTED_WIDE_NEG: /* revert to 8-bit bus */ softc->current_bus_width[targ] = softc->goal_bus_width[targ] = 8; /* report the event */ neg.bus_width = 8; neg.valid = CCB_TRANS_BUS_WIDTH_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); continue_current_hcb(softc); return; case TARGET_ANSWERED_WIDE_NEG: { int wd; wd = get_target_bus_width_request(softc); if(wd <= softc->goal_bus_width[targ]) { /* answer is acceptable */ softc->current_bus_width[targ] = softc->goal_bus_width[targ] = neg.bus_width = wd; /* report the event */ neg.valid = CCB_TRANS_BUS_WIDTH_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); } else { prepare_reject_message(hcb); } } continue_current_hcb(softc); return; case TARGET_REQUESTED_WIDE_NEG: { int wd; wd = get_target_bus_width_request(softc); wd = min (wd, OUR_BUS_WIDTH); wd = min (wd, softc->user_bus_width[targ]); if(wd != softc->current_bus_width[targ]) { /* the bus width has changed */ softc->current_bus_width[targ] = softc->goal_bus_width[targ] = neg.bus_width = wd; /* report the event */ neg.valid = CCB_TRANS_BUS_WIDTH_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); } prepare_width_nego_rsponse(hcb, wd); } continue_current_hcb(softc); return; } Then we handle any errors that could have happened during auto-sense in the same simple-minded way as before. Otherwise we look closer at the details again. if(hcb->flags & DOING_AUTOSENSE) goto autosense_failed; switch(hcb_status) { The next event we consider is unexpected disconnect. Which is considered normal after an ABORT or BUS DEVICE RESET message and abnormal in other cases. case UNEXPECTED_DISCONNECT: if(requested_abort(hcb)) { /* abort affects all commands on that target+LUN, so * mark all disconnected HCBs on that target+LUN as aborted too */ for(h = softc->first_discon_hcb[hcb->target][hcb->lun]; h != NULL; h = hh) { hh=h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_REQ_ABORTED); } ccb_status = CAM_REQ_ABORTED; } else if(requested_bus_device_reset(hcb)) { int lun; /* reset affects all commands on that target, so * mark all disconnected HCBs on that target+LUN as reset */ for(lun=0; lun <= OUR_MAX_SUPPORTED_LUN; lun++) for(h = softc->first_discon_hcb[hcb->target][lun]; h != NULL; h = hh) { hh=h->next; free_hcb_and_ccb_done(h, h->ccb, CAM_SCSI_BUS_RESET); } /* send event */ xpt_async(AC_SENT_BDR, hcb->ccb->ccb_h.path_id, NULL); /* this was the CAM_RESET_DEV request itself, it is completed */ ccb_status = CAM_REQ_CMP; } else { calculate_residue(hcb); ccb_status = CAM_UNEXP_BUSFREE; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = hcb->lun; } break; If the target refuses to accept tags we notify CAM about that and return back all commands for this LUN: case TAGS_REJECTED: /* report the event */ neg.flags = 0 & ~CCB_TRANS_TAG_ENB; neg.valid = CCB_TRANS_TQ_VALID; xpt_async(AC_TRANSFER_NEG, hcb->ccb.ccb_h.path_id, &neg); ccb_status = CAM_MSG_REJECT_REC; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = hcb->lun; break; Then we check a number of other conditions, with processing basically limited to setting the CCB status: case SELECTION_TIMEOUT: ccb_status = CAM_SEL_TIMEOUT; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = CAM_LUN_WILDCARD; break; case PARITY_ERROR: ccb_status = CAM_UNCOR_PARITY; break; case DATA_OVERRUN: case ODD_WIDE_TRANSFER: ccb_status = CAM_DATA_RUN_ERR; break; default: /* all other errors are handled in a generic way */ ccb_status = CAM_REQ_CMP_ERR; /* request the further code to freeze the queue */ hcb->ccb->ccb_h.status |= CAM_DEV_QFRZN; lun_to_freeze = CAM_LUN_WILDCARD; break; } Then we check if the error was serious enough to freeze the input queue until it gets proceeded and do so if it is: if(hcb->ccb->ccb_h.status & CAM_DEV_QFRZN) { /* freeze the queue */ xpt_freeze_devq(ccb->ccb_h.path, /*count*/1); /* re-queue all commands for this target/LUN back to CAM */ for(h = softc->first_queued_hcb; h != NULL; h = hh) { hh = h->next; if(targ == h->targ && (lun_to_freeze == CAM_LUN_WILDCARD || lun_to_freeze == h->lun) ) free_hcb_and_ccb_done(h, h->ccb, CAM_REQUEUE_REQ); } } free_hcb_and_ccb_done(hcb, hcb->ccb, ccb_status); schedule_next_hcb(softc); return; This concludes the generic interrupt handling although specific controllers may require some additions. Errors Summary SCSIerrors When executing an I/O request many things may go wrong. The reason of error can be reported in the CCB status with great detail. Examples of use are spread throughout this document. For completeness here is the summary of recommended responses for the typical error conditions: CAM_RESRC_UNAVAIL - some resource is temporarily unavailable and the SIM driver cannot generate an event when it will become available. An example of this resource would be some intra-controller hardware resource for which the controller does not generate an interrupt when it becomes available. CAM_UNCOR_PARITY - unrecovered parity error occurred CAM_DATA_RUN_ERR - data overrun or unexpected data phase (going in other direction than specified in CAM_DIR_MASK) or odd transfer length for wide transfer CAM_SEL_TIMEOUT - selection timeout occurred (target does not respond) CAM_CMD_TIMEOUT - command timeout occurred (the timeout function ran) CAM_SCSI_STATUS_ERROR - the device returned error CAM_AUTOSENSE_FAIL - the device returned error and the REQUEST SENSE COMMAND failed CAM_MSG_REJECT_REC - MESSAGE REJECT message was received CAM_SCSI_BUS_RESET - received SCSI bus reset CAM_REQ_CMP_ERR - impossible SCSI phase occurred or something else as weird or just a generic error if further detail is not available CAM_UNEXP_BUSFREE - unexpected disconnect occurred CAM_BDR_SENT - BUS DEVICE RESET message was sent to the target CAM_UNREC_HBA_ERROR - unrecoverable Host Bus Adapter Error CAM_REQ_TOO_BIG - the request was too large for this controller CAM_REQUEUE_REQ - this request should be re-queued to preserve transaction ordering. This typically occurs when the SIM recognizes an error that should freeze the queue and must place other queued requests for the target at the sim level back into the XPT queue. Typical cases of such errors are selection timeouts, command timeouts and other like conditions. In such cases the troublesome command returns the status indicating the error, the and the other commands which have not be sent to the bus yet get re-queued. CAM_LUN_INVALID - the LUN ID in the request is not supported by the SCSI controller CAM_TID_INVALID - the target ID in the request is not supported by the SCSI controller Timeout Handling When the timeout for an HCB expires that request should be aborted, just like with an XPT_ABORT request. The only difference is that the returned status of aborted request should be CAM_CMD_TIMEOUT instead of CAM_REQ_ABORTED (that is why implementation of the abort better be done as a function). But there is one more possible problem: what if the abort request itself will get stuck? In this case the SCSI bus should be reset, just like with an XPT_RESET_BUS request (and the idea about implementing it as a function called from both places applies here too). Also we should reset the whole SCSI bus if a device reset request got stuck. So after all the timeout function would look like: static void xxx_timeout(void *arg) { struct xxx_hcb *hcb = (struct xxx_hcb *)arg; struct xxx_softc *softc; struct ccb_hdr *ccb_h; softc = hcb->softc; ccb_h = &hcb->ccb->ccb_h; if(hcb->flags & HCB_BEING_ABORTED || ccb_h->func_code == XPT_RESET_DEV) { xxx_reset_bus(softc); } else { xxx_abort_ccb(hcb->ccb, CAM_CMD_TIMEOUT); } } When we abort a request all the other disconnected requests to the same target/LUN get aborted too. So there appears a question, should we return them with status CAM_REQ_ABORTED or CAM_CMD_TIMEOUT? The current drivers use CAM_CMD_TIMEOUT. This seems logical because if one request got timed out then probably something really bad is happening to the device, so if they would not be disturbed they would time out by themselves. diff --git a/en_US.ISO8859-1/books/arch-handbook/usb/chapter.xml b/en_US.ISO8859-1/books/arch-handbook/usb/chapter.xml index 6fa02b2d59..1d34c3192b 100644 --- a/en_US.ISO8859-1/books/arch-handbook/usb/chapter.xml +++ b/en_US.ISO8859-1/books/arch-handbook/usb/chapter.xml @@ -1,721 +1,721 @@ USB Devices Nick Hibma Written by Murray Stokely Modifications for Handbook made by Introduction Universal Serial Bus (USB) NetBSD The Universal Serial Bus (USB) is a new way of attaching devices to personal computers. The bus architecture features two-way communication and has been developed as a response to devices becoming smarter and requiring more interaction with the host. USB support is included in all current PC chipsets and is therefore available in all recently built PCs. Apple's introduction of the USB-only iMac has been a major incentive for hardware manufacturers to produce USB versions of their devices. The future PC specifications specify that all legacy connectors on PCs should be replaced by one or more USB connectors, providing generic plug and play capabilities. Support for USB hardware was available at a very early stage in NetBSD and was developed by Lennart Augustsson for the NetBSD project. The code has been ported to FreeBSD and we are currently maintaining a shared code base. For the implementation of the USB subsystem a number of features of USB are important. Lennart Augustsson has done most of the implementation of the USB support for the NetBSD project. Many thanks for this incredible amount of work. Many thanks also to Ardy and Dirk for their comments and proofreading of this paper. Devices connect to ports on the computer directly or on devices called hubs, forming a treelike device structure. The devices can be connected and disconnected at run time. Devices can suspend themselves and trigger resumes of the host system As the devices can be powered from the bus, the host software has to keep track of power budgets for each hub. Different quality of service requirements by the different device types together with the maximum of 126 devices that can be connected to the same bus, require proper scheduling of transfers on the shared bus to take full advantage of the 12Mbps bandwidth available. (over 400Mbps with USB 2.0) Devices are intelligent and contain easily accessible information about themselves The development of drivers for the USB subsystem and devices connected to it is supported by the specifications that have been developed and will be developed. These specifications are publicly available from the USB home pages. Apple has been very strong in pushing for standards based drivers, by making drivers for the generic classes available in their operating system MacOS and discouraging the use of separate drivers for each new device. This chapter tries to collate essential information for a basic understanding of the USB 2.0 implementation stack in FreeBSD/NetBSD. It is recommended however to read it together with the relevant 2.0 specifications and other developer resources: USB 2.0 Specification (http://www.usb.org/developers/docs/usb20_docs/) Universal Host Controller Interface (UHCI) Specification (ftp://ftp.netbsd.org/pub/NetBSD/misc/blymn/uhci11d.pdf) Open Host Controller Interface (OHCI) Specification(ftp://ftp.compaq.com/pub/supportinformation/papers/hcir1_0a.pdf) Developer section of USB home page (http://www.usb.org/developers/) Structure of the USB Stack The USB support in FreeBSD can be split into three layers. The lowest layer contains the host controller driver, providing a generic interface to the hardware and its scheduling facilities. It supports initialisation of the hardware, scheduling of transfers and handling of completed and/or failed transfers. Each host controller driver implements a virtual hub providing hardware independent access to the registers controlling the root ports on the back of the machine. The middle layer handles the device connection and disconnection, basic initialisation of the device, driver selection, the communication channels (pipes) and does resource management. This services layer also controls the default pipes and the device requests transferred over them. The top layer contains the individual drivers supporting specific (classes of) devices. These drivers implement the protocol that is used over the pipes other than the default pipe. They also implement additional functionality to make the device available to other parts of the kernel or userland. They use the USB driver interface (USBDI) exposed by the services layer. Host Controllers USBhost controllers The host controller (HC) controls the transmission of packets on the bus. Frames of 1 millisecond are used. At the start of each frame the host controller generates a Start of Frame (SOF) packet. The SOF packet is used to synchronise to the start of the frame and to keep track of the frame number. Within each frame packets are transferred, either from host to device (out) or from device to host (in). Transfers are always initiated by the host (polled transfers). Therefore there can only be one host per USB bus. Each transfer of a packet has a status stage in which the recipient of the data can return either ACK (acknowledge reception), NAK (retry), STALL (error condition) or nothing (garbled data stage, device not available or disconnected). Section 8.5 of the USB 2.0 Specification explains the details of packets in more detail. Four different types of transfers can occur on a USB bus: control, bulk, interrupt and isochronous. The types of transfers and their characteristics are described below. Large transfers between the device on the USB bus and the device driver are split up into multiple packets by the host controller or the HC driver. Device requests (control transfers) to the default endpoints are special. They consist of two or three phases: SETUP, DATA (optional) and STATUS. The set-up packet is sent to the device. If there is a data phase, the direction of the data packet(s) is given in the set-up packet. The direction in the status phase is the opposite of the direction during the data phase, or IN if there was no data phase. The host controller hardware also provides registers with the current status of the root ports and the changes that have occurred since the last reset of the status change register. Access to these registers is provided through a virtualised hub as suggested in the USB specification. The virtual hub must comply with the hub device class given in chapter 11 of that specification. It must provide a default pipe through which device requests can be sent to it. It returns the standard andhub class specific set of descriptors. It should also provide an interrupt pipe that reports changes happening at its ports. There are currently two specifications for host controllers available: Universal Host Controller Interface (UHCI) from Intel and Open Host Controller Interface (OHCI) from Compaq, Microsoft, and National Semiconductor. The UHCI specification has been designed to reduce hardware complexity by requiring the host controller driver to supply a complete schedule of the transfers for each frame. OHCI type controllers are much more independent by providing a more abstract interface doing a lot of work themselves. UHCI USB UHCI The UHCI host controller maintains a framelist with 1024 pointers to per frame data structures. It understands two different data types: transfer descriptors (TD) and queue heads (QH). Each TD represents a packet to be communicated to or from a device endpoint. QHs are a means to groupTDs (and QHs) together. Each transfer consists of one or more packets. The UHCI driver splits large transfers into multiple packets. For every transfer, apart from isochronous transfers, a QH is allocated. For every type of transfer these QHs are collected at a QH for that type. Isochronous transfers have to be executed first because of the fixed latency requirement and are directly referred to by the pointer in the framelist. The last isochronous TD refers to the QH for interrupt transfers for that frame. All QHs for interrupt transfers point at the QH for control transfers, which in turn points at the QH for bulk transfers. The following diagram gives a graphical overview of this: This results in the following schedule being run in each frame. After fetching the pointer for the current frame from the framelist the controller first executes the TDs for all the isochronous packets in that frame. The last of these TDs refers to the QH for the interrupt transfers for thatframe. The host controller will then descend from that QH to the QHs for the individual interrupt transfers. After finishing that queue, the QH for the interrupt transfers will refer the controller to the QH for all control transfers. It will execute all the subqueues scheduled there, followed by all the transfers queued at the bulk QH. To facilitate the handling of finished or failed transfers different types of interrupts are generated by the hardware at the end of each frame. In the last TD for a transfer the Interrupt-On Completion bit is set by the HC driver to flag an interrupt when the transfer has completed. An error interrupt is flagged if a TD reaches its maximum error count. If the short packet detect bit is set in a TD and less than the set packet length is transferred this interrupt is flagged to notify the controller driver of the completed transfer. It is the host controller driver's task to find out which transfer has completed or produced an error. When called the interrupt service routine will locate all the finished transfers and call their callbacks. Refer to the UHCI Specification for a more elaborate description. OHCI USB OHCI Programming an OHCI host controller is much simpler. The controller assumes that a set of endpoints is available, and is aware of scheduling priorities and the ordering of the types of transfers in a frame. The main data structure used by the host controller is the endpoint descriptor (ED) to which a queue of transfer descriptors (TDs) is attached. The ED contains the maximum packet size allowed for an endpoint and the controller hardware does the splitting into packets. The pointers to the data buffers are updated after each transfer and when the start and end pointer are equal, the TD is retired to the done-queue. The four types of endpoints (interrupt, isochronous, control, and bulk) have their own queues. Control and bulk endpoints are queued each at their own queue. Interrupt EDs are queued in a tree, with the level in the tree defining the frequency at which they run. The schedule being run by the host controller in each frame looks as follows. The controller will first run the non-periodic control and bulk queues, up to a time limit set by the HC driver. Then the interrupt transfers for that frame number are run, by using the lower five bits of the frame number as an index into level 0 of the tree of interrupts EDs. At the end of this tree the isochronous EDs are connected and these are traversed subsequently. The isochronous TDs contain the frame number of the first frame the transfer should be run in. After all the periodic transfers have been run, the control and bulk queues are traversed again. Periodically the interrupt service routine is called to process the done queue and call the callbacks for each transfer and reschedule interrupt and isochronous endpoints. See the UHCI Specification for a more elaborate description. The middle layer provides access to the device in a controlled way and maintains resources in use by the different drivers and the services layer. The layer takes care of the following aspects: The device configuration information The pipes to communicate with a device Probing and attaching and detaching form a device. USB Device Information Device Configuration Information Each device provides different levels of configuration information. Each device has one or more configurations, of which one is selected during probe/attach. A configuration provides power and bandwidth requirements. Within each configuration there can be multiple interfaces. A device interface is a collection of endpoints. For example USB speakers can have an interface for the audio data (Audio Class) and an interface for the knobs, dials and buttons (HID Class). All interfaces in a configuration are active at the same time and can be attached to by different drivers. Each interface can have alternates, providing different quality of service parameters. In for example cameras this is used to provide different frame sizes and numbers of frames per second. Within each interface, 0 or more endpoints can be specified. Endpoints are the unidirectional access points for communicating with a device. They provide buffers to temporarily store incoming or outgoing data from the device. Each endpoint has a unique address within a configuration, the endpoint's number plus its direction. The default endpoint, endpoint 0, is not part of any interface and available in all configurations. It is managed by the services layer and not directly available to device drivers. This hierarchical configuration information is described in the device by a standard set of descriptors (see section 9.6 of the USB specification). They can be requested through the Get Descriptor Request. The services layer caches these descriptors to avoid unnecessary transfers on the USB bus. Access to the descriptors is provided through function calls. Device descriptors: General information about the device, like Vendor, Product and Revision Id, supported device class, subclass and protocol if applicable, maximum packet size for the default endpoint, etc. Configuration descriptors: The number of interfaces in this configuration, suspend and resume functionality supported and power requirements. Interface descriptors: interface class, subclass and protocol if applicable, number of alternate settings for the interface and the number of endpoints. Endpoint descriptors: Endpoint address, direction and type, maximum packet size supported and polling frequency if type is interrupt endpoint. There is no descriptor for the default endpoint (endpoint 0) and it is never counted in an interface descriptor. String descriptors: In the other descriptors string indices are supplied for some fields.These can be used to retrieve descriptive strings, possibly in multiple languages. Class specifications can add their own descriptor types that are available through the GetDescriptor Request. Pipes Communication to end points on a device flows through so-called pipes. Drivers submit transfers to endpoints to a pipe and provide a callback to be called on completion or failure of the transfer (asynchronous transfers) or wait for completion (synchronous transfer). Transfers to an endpoint are serialised in the pipe. A transfer can either complete, fail or time-out (if a time-out has been set). There are two types of time-outs for transfers. Time-outs can happen due to time-out on the USBbus (milliseconds). These time-outs are seen as failures and can be due to disconnection of the device. A second form of time-out is implemented in software and is triggered when a transfer does not complete within a specified amount of time (seconds). These are caused by a device acknowledging negatively (NAK) the transferred packets. The cause for this is the device not being ready to receive data, buffer under- or overrun or protocol errors. If a transfer over a pipe is larger than the maximum packet size specified in the associated endpoint descriptor, the host controller (OHCI) or the HC driver (UHCI) will split the transfer into packets of maximum packet size, with the last packet possibly smaller than the maximum packet size. Sometimes it is not a problem for a device to return less data than requested. For example abulk-in-transfer to a modem might request 200 bytes of data, but the modem has only 5 bytes available at that time. The driver can set the short packet (SPD) flag. It allows the host controller to accept a packet even if the amount of data transferred is less than requested. This flag is only valid for in-transfers, as the amount of data to be sent to a device is always known beforehand. If an unrecoverable error occurs in a device during a transfer the pipe is stalled. Before any more data is accepted or sent the driver needs to resolve the cause of the stall and clear the endpoint stall condition through send the clear endpoint halt device request over the default pipe. The default endpoint should never stall. There are four different types of endpoints and corresponding pipes: - Control pipe / default pipe: There is one control pipe per device, connected to the default endpoint (endpoint 0). The pipe carries the device requests and associated data. The difference between transfers over the default pipe and other pipes is that the protocol for the transfers is described in the USB specification. These requests are used to reset and configure the device. A basic set of commands that must be supported by each device is provided in chapter 9 of the USB specification. The commands supported on this pipe can be extended by a device class specification to support additional functionality. Bulk pipe: This is the USB equivalent to a raw transmission medium. Interrupt pipe: The host sends a request for data to the device and if the device has nothing to send, it will NAK the data packet. Interrupt transfers are scheduled at a frequency specified when creating the pipe. Isochronous pipe: These pipes are intended for isochronous data, for example video or audio streams, with fixed latency, but no guaranteed delivery. Some support for pipes of this type is available in the current implementation. Packets in control, bulk and interrupt transfers are retried if an error occurs during transmission or the device acknowledges the packet negatively (NAK) due to for example lack of buffer space to store the incoming data. Isochronous packets are however not retried in case of failed delivery or NAK of a packet as this might violate the timing constraints. The availability of the necessary bandwidth is calculated during the creation of the pipe. Transfers are scheduled within frames of 1 millisecond. The bandwidth allocation within a frame is prescribed by the USB specification, section 5.6 [ 2]. Isochronous and interrupt transfers are allowed to consume up to 90% of the bandwidth within a frame. Packets for control and bulk transfers are scheduled after all isochronous and interrupt packets and will consume all the remaining bandwidth. More information on scheduling of transfers and bandwidth reclamation can be found in chapter 5 of the USB specification, section 1.3 of the UHCI specification, and section 3.4.2 of the OHCI specification. Device Probe and Attach USB probe After the notification by the hub that a new device has been connected, the service layer switches on the port, providing the device with 100 mA of current. At this point the device is in its default state and listening to device address 0. The services layer will proceed to retrieve the various descriptors through the default pipe. After that it will send a Set Address request to move the device away from the default device address (address 0). Multiple device drivers might be able to support the device. For example a modem driver might be able to support an ISDN TA through the AT compatibility interface. A driver for that specific model of the ISDN adapter might however be able to provide much better support for this device. To support this flexibility, the probes return priorities indicating their level of support. Support for a specific revision of a product ranks the highest and the generic driver the lowest priority. It might also be that multiple drivers could attach to one device if there are multiple interfaces within one configuration. Each driver only needs to support a subset of the interfaces. The probing for a driver for a newly attached device checks first for device specific drivers. If not found, the probe code iterates over all supported configurations until a driver attaches in a configuration. To support devices with multiple drivers on different interfaces, the probe iterates over all interfaces in a configuration that have not yet been claimed by a driver. Configurations that exceed the power budget for the hub are ignored. During attach the driver should initialise the device to its proper state, but not reset it, as this will make the device disconnect itself from the bus and restart the probing process for it. To avoid consuming unnecessary bandwidth should not claim the interrupt pipe at attach time, but should postpone allocating the pipe until the file is opened and the data is actually used. When the file is closed the pipe should be closed again, even though the device might still be attached. Device Disconnect and Detach USB disconnect A device driver should expect to receive errors during any transaction with the device. The design of USB supports and encourages the disconnection of devices at any point in time. Drivers should make sure that they do the right thing when the device disappears. Furthermore a device that has been disconnected and reconnected will not be reattached at the same device instance. This might change in the future when more devices support serial numbers (see the device descriptor) or other means of defining an identity for a device have been developed. The disconnection of a device is signaled by a hub in the interrupt packet delivered to the hub driver. The status change information indicates which port has seen a connection change. The device detach method for all device drivers for the device connected on that port are called and the structures cleaned up. If the port status indicates that in the mean time a device has been connected to that port, the procedure for probing and attaching the device will be started. A device reset will produce a disconnect-connect sequence on the hub and will be handled as described above. USB Drivers Protocol Information The protocol used over pipes other than the default pipe is undefined by the USB specification. Information on this can be found from various sources. The most accurate source is the developer's section on the USB home pages. From these pages, a growing number of deviceclass specifications are available. These specifications specify what a compliant device should look like from a driver perspective, basic functionality it needs to provide and the protocol that is to be used over the communication channels. The USB specification includes the description of the Hub Class. A class specification for Human Interface Devices (HID) has been created to cater for keyboards, tablets, bar-code readers, buttons, knobs, switches, etc. A third example is the class specification for mass storage devices. For a full list of device classes see the developers section on the USB home pages. For many devices the protocol information has not yet been published however. Information on the protocol being used might be available from the company making the device. Some companies will require you to sign a Non -Disclosure Agreement (NDA) before giving you the specifications. This in most cases precludes making the driver open source. Another good source of information is the Linux driver sources, as a number of companies have started to provide drivers for Linux for their devices. It is always a good idea to contact the authors of those drivers for their source of information. Example: Human Interface Devices The specification for the Human Interface Devices like keyboards, mice, tablets, buttons, dials,etc. is referred to in other device class specifications and is used in many devices. For example audio speakers provide endpoints to the digital to analogue converters and possibly an extra pipe for a microphone. They also provide a HID endpoint in a separate interface for the buttons and dials on the front of the device. The same is true for the monitor control class. It is straightforward to build support for these interfaces through the available kernel and userland libraries together with the HID class driver or the generic driver. Another device that serves as an example for interfaces within one configuration driven by different device drivers is a cheap keyboard with built-in legacy mouse port. To avoid having the cost of including the hardware for a USB hub in the device, manufacturers combined the mouse data received from the PS/2 port on the back of the keyboard and the key presses from the keyboard into two separate interfaces in the same configuration. The mouse and keyboard drivers each attach to the appropriate interface and allocate the pipes to the two independent endpoints. USB firmware Example: Firmware download Many devices that have been developed are based on a general purpose processor with an - additional USB core added to it. Because the development of + additional USB core added to it. Since the development of drivers and firmware for USB devices is still very new, many devices require the downloading of the firmware after they have been connected. The procedure followed is straightforward. The device identifies itself through a vendor and product Id. The first driver probes and attaches to it and downloads the firmware into it. After that the device soft resets itself and the driver is detached. After a short pause the device announces its presence on the bus. The device will have changed its vendor/product/revision Id to reflect the fact that it has been supplied with firmware and as a consequence a second driver will probe it and attach to it. An example of these types of devices is the ActiveWire I/O board, based on the EZ-USB chip. For this chip a generic firmware downloader is available. The firmware downloaded into the ActiveWire board changes the revision Id. It will then perform a soft reset of the USB part of the EZ-USB chip to disconnect from the USB bus and again reconnect. Example: Mass Storage Devices Support for mass storage devices is mainly built around existing protocols. The Iomega USB Zipdrive is based on the SCSI version of their drive. The SCSI commands and status messages are wrapped in blocks and transferred over the bulk pipes to and from the device, emulating a SCSI controller over the USB wire. ATAPI and UFI commands are supported in a similar fashion. ATAPI The Mass Storage Specification supports 2 different types of wrapping of the command block.The initial attempt was based on sending the command and status through the default pipe and using bulk transfers for the data to be moved between the host and the device. Based on experience a second approach was designed that was based on wrapping the command and status blocks and sending them over the bulk out and in endpoint. The specification specifies exactly what has to happen when and what has to be done in case an error condition is encountered. The biggest challenge when writing drivers for these devices is to fit USB based protocol into the existing support for mass storage devices. CAM provides hooks to do this in a fairly straight forward way. ATAPI is less simple as historically the IDE interface has never had many different appearances. The support for the USB floppy from Y-E Data is again less straightforward as a new command set has been designed.