diff --git a/en_US.ISO8859-1/books/arch-handbook/boot/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.sgml new file mode 100644 index 0000000000..3f88803fb2 --- /dev/null +++ b/en_US.ISO8859-1/books/arch-handbook/boot/chapter.sgml @@ -0,0 +1,970 @@ + + + + + + + Sergey + Lyubka + Contributed by + + + + Bootstrapping and kernel initialization + + + Synopsis + + This chapter is an overview of the boot and system + initialization process, starting from the BIOS (firmware) POST, + to the first user process creation. Since the initial steps of + system startup are very architecture dependent, the IA-32 + architecture is used as an example. + + + + Overview + + A computer running FreeBSD can boot by several methods, + although the most common method, booting from a harddisk where + the OS is installed, will be discussed here. The boot process + is divided into several steps: + + + BIOS POST + boot0 stage + boot2 stage + loader stage + kernel initialization + + + The boot0 and boot2 stages are also referred to as + bootstrap stages 1 and 2 in &man.boot.8; as + the first steps in Freud's 3-stage bootstrapping procedure. + Various information is printed on the screen at each stage, so + visually you may recognize them using the table that follows. + Please note that the actual data may differ from machine to + machine: + + + + + + may vary BIOS + (firmware) messages + + + +F1 FreeBSD +F2 BSD +F5 Disk 2 + + boot0 + + + +>>FreeBSD/i386 BOOT +Default: 1:ad(1,a)/boot/loader +boot: + + + boot2This prompt will appear + if the user presses a key just after selecting an OS to + boot at the boot0 + stage. + + + +BTX loader 1.0 BTX version is 1.01 +BIOS drive A: is disk0 +BIOS drive C: is disk1 +BIOS 639kB/64512kB available memory +FreeBSD/i386 bootstrap loader, Revision 0.8 +Console internal video/keyboard +(jkh@bento.freebsd.org, Mon Nov 20 11:41:23 GMT 2000) +/kernel text=0x1234 data=0x2345 syms=[0x4+0x3456] +Hit [Enter] to boot immediately, or any other key for command prompt +Booting [kernel] in 9 seconds..._ + + loader + + + +Copyright (c) 1992-2002 The FreeBSD Project. +Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 + The Regents of the University of California. All rights reserved. +FreeBSD 4.6-RC #0: Sat May 4 22:49:02 GMT 2002 + devnull@kukas:/usr/obj/usr/src/sys/DEVNULL +Timecounter "i8254" frequency 1193182 Hz + + kernel + + + + + + + + + BIOS POST + + When the PC powers on, the processor's registers are set + with some predefined values. One of the registers is the + instruction pointer register, and its value + after a power on is well defined: it is a 32-bit value of + 0xffffff00. The instruction pointer register points to code to + be executed by the processor. One of the registers is the + cr1 32-bit control register, and its value + just after the reboot is 0. One of the cr1's bits, the bit PE + (Protected Enabled) indicates whether the processor is running + in protected or real mode. Since at boot time this bit is + cleared, the processor boots in real mode. Real mode means, + among other things, that linear and physical addresses are + identical. + + The value of 0xffffff00 is slightly less then 4Gb, so unless + the machine has 4Gb physical memory, it cannot point to a valid + memory address. The computer's hardware translates this address + so that it points to a BIOS memory block. + + BIOS stands for Basic Input Output + System, and it is a chip on the motherboard that has + a relatively small amount of read-only memory (ROM). This + memory contains various low-level routines that are specific to + the hardware supplied with the motherboard. So, the processor + will first jump to the address 0xffffff00, which really resides + in the BIOS's memory. Usually this address contains a jump + instruction to the BIOS's POST routines. + + POST stands for Power On Self Test. + This is a set of routines including the memory check, system bus + check and other low-level stuff so that the CPU can initialize + the computer properly. The important step on this stage is + determining the boot device. All modern BIOS's allow the boot + device to be set manually, so you can boot from a floppy, + CD-ROM, harddisk etc. + + The very last thing in the POST is the INT + 0x19 instruction. That instruction reads 512 bytes + from the first sector of boot device into the memory at address + 0x7c00. The term first sector originates + from harddrive architecture, where the magnetic plate is divided + to a number of cylindrical tracks. Tracks are numbered, and + every track is divided by a number (usually 64) sectors. Track + number 0 is the outermost on the magnetic plate, and sector 1, + the first sector (tracks, or, cylinders, are numbered starting + from 0, but sectors - starting from 1), has a special meaning. + It is also called Master Boot Record, or MBR. The remaining + sectors on the first track are never used Some + utilities such as &man.disklabel.8; may store the information in + this area, mostly in the second + sector.. + + + + + boot0 stage + + Take a look at the file /boot/boot0. + This is a small 512-byte file, and it is exactly what FreeBSD's + installation procedure wrote to your harddisk's MBR if you chose + the "bootmanager" option at installation time. + + As mentioned previously, the INT 0x19 + instruction loads an MBR, i.e. the boot0 + content, into the memory at address 0x7c00. Taking a look at + the file sys/boot/i386/boot0/boot0.s can + give a guess at what is happening there - this is the boot + manager, which is an awesome piece of code written by Robert + Nordier. + + The MBR, or, boot0, has a special + structure starting from offset 0x1be, called the + partition table. It has 4 records of 16 + bytes each, called partition records, which + represent how the harddisk(s) are partitioned, or, in FreeBSD's + terminology, sliced. One byte of those 16 says whether a + partition (slice) is bootable or not. Exactly one record must + have that flag set, otherwise boot0's code + will refuse to proceed. + + A partition record has the following fields: + + + the 1-byte filesystem type + the 1-byte bootable flag + the 6 byte descriptor in CHS + format + the 8 byte descriptor in LBA + format + + + A partition record descriptor has the information about + where exactly the partition resides on the drive. Both + descriptors, LBA and CHS, describe the same information, but in + different ways: LBA (Logical Block Addressing) has the starting + sector for the partition and the partition's length, while CHS + (Cylinder Head Sector) has coordinates for the first and last + sectors of the partition. + + The boot manager scans the partition table and prints the + menu on the screen so the user can select what disk and what + slice to boot. By pressing an appropriate key, + boot0 performs the following + actions: + + + modifies the bootable flag for the selected + partition to make it bootable, and clears the + previous + + saves itself to disk to remember what partition + (slice) has been selected so to use it as the default on the + next boot + + loads the first sector of the selected partition + (slice) into memory and jumps there + + + What kind of data should reside on the very first sector of + a bootable partition (slice), in our case, a FreeBSD slice? As + you may have already guessed, it is + boot2. + + + + + boot2 stage + + You might wonder, why boot2 comes after boot0, and not + boot1. Actually, there is a 512-byte file called + boot1 in the directory + /boot as well. It is used for booting from + a floppy. When booting from a floppy, + boot1 plays the same role as + boot0 for a harddisk: it locates boot2 and + runs it. + + You may have realized that a file + /boot/mbr exists as well. It is a + simplified version of boot0. The code in + mbr does not provide a menu for the user, + it just blindly boots the partition marked active. + + The code implementing boot2 resides in + sys/boot/i386/boot2/, and the executable + itself is in /boot. The files boot0 and + boot2 that are in /boot are not used by the + bootstrap, but by utilities such as + boot0cfg. The actual position for + boot0 is in the MBR. For boot2 it is the beginning of a + bootable FreeBSD slice. These locations are not under the + filesystem's control, so they are invisible to commands like + ls. + + The main task for boot2 is to load the file + /boot/loader, which is the third stage in + the bootstrapping procedure. The code in boot2 cannot use any + services like open() and + read(), since the kernel is not yet loaded. + It must scan the harddisk, knowing about the filesystem + structure, find the file /boot/loader, read + it into memory using a BIOS service, and then pass the execution + to the loader's entry point. + + Besides that, boot2 prompts for user input so the loader can + be booted from different disk, unit, slice and partition. + + The boot2 binary is created in special way: + sys/boot/i386/boot2/Makefile +boot2: boot2.ldr boot2.bin ${BTX}/btx/btx + btxld -v -E ${ORG2} -f bin -b ${BTX}/btx/btx -l boot2.ldr \ + -o boot2.ld -P 1 boot2.bin + + This Makefile snippet shows that &man.btxld.8; is used to + link the binary. BTX, which stands for BooT eXtender, is a + piece of code that provides a protected mode environment for the + program, called the client, that it is linked with. So boot2 is + a BTX client, i.e. it uses the sevice provided by BTX. + + The btxld utility is the linker. + It links two binaries together. The difference between + &man.btxld.8; and &man.ld.1; is that + ld usually links object files into a + shared object or executable, while + btxld links an object file with the + BTX, producing the binary file suitable to be put on the + beginning of the partition for the system boot. + + boot0 passes the execution to BTX's entry point. BTX then + switches the processor to protected mode, and prepares a simple + environment before calling the client. This includes: + + + virtual v86 mode. That means, the BTX is a v86 + monitor. Real mode instructions like posh, popf, cli, sti, if + called by the client, will work. + + Interrupt Descriptor Table (IDT) is set up so + all hardware interrupts are routed to the default BIOS's + handlers, and interrupt 0x30 is set up to be the syscall + gate. + + Two system calls: exec and + exit, are defined: + + sys/boot/i386/btx/lib/btxsys.s: + .set INT_SYS,0x30 # Interrupt number +# +# System call: exit +# +__exit: xorl %eax,%eax # BTX system + int $INT_SYS # call 0x0 +# +# System call: exec +# +__exec: movl $0x1,%eax # BTX system + int $INT_SYS # call 0x1 + + + BTX creates a Global Descriptor Table (GDT): + + sys/boot/i386/btx/btx/btx.s: +gdt: .word 0x0,0x0,0x0,0x0 # Null entry + .word 0xffff,0x0,0x9a00,0xcf # SEL_SCODE + .word 0xffff,0x0,0x9200,0xcf # SEL_SDATA + .word 0xffff,0x0,0x9a00,0x0 # SEL_RCODE + .word 0xffff,0x0,0x9200,0x0 # SEL_RDATA + .word 0xffff,MEM_USR,0xfa00,0xcf# SEL_UCODE + .word 0xffff,MEM_USR,0xf200,0xcf# SEL_UDATA + .word _TSSLM,MEM_TSS,0x8900,0x0 # SEL_TSS + + The client's code and data start from address MEM_USR + (0xa000), and a selector (SEL_UCODE) points to the client's code + segment. The SEL_UCODE descriptor has Descriptor Privilege + Level (DPL) 3, which is the lowest privilege level. But the + INT 0x30 instruction handler resides in a + segment pointed to by the SEL_SCODE (supervisor code) selector, + as shown from the code that creates an IDT: + + mov $SEL_SCODE,%dh # Segment selector +init.2: shr %bx # Handle this int? + jnc init.3 # No + mov %ax,(%di) # Set handler offset + mov %dh,0x2(%di) # and selector + mov %dl,0x5(%di) # Set P:DPL:type + add $0x4,%ax # Next handler + + So, when the client calls __exec(), the + code will be executed with the highest privileges. This allows + the kernel to change the protected mode data structures, such as + page tables, GDT, IDT, etc later, if needed. + + boot2 defines an important structure, struct + bootinfo. This structure is initialized by boot2 and + passed to the loader, and then further to the kernel. Some + nodes of this structures are set by boot2, the rest by the + loader. This structure, among other information, contains the + kernel filename, BIOS harddisk geometry, BIOS drive number for + boot device, physical memory available, envp + pointer etc. The definition for it is: + + /usr/include/machine/bootinfo.h +struct bootinfo { + u_int32_t bi_version; + u_int32_t bi_kernelname; /* represents a char * */ + u_int32_t bi_nfs_diskless; /* struct nfs_diskless * */ + /* End of fields that are always present. */ +#define bi_endcommon bi_n_bios_used + u_int32_t bi_n_bios_used; + u_int32_t bi_bios_geom[N_BIOS_GEOM]; + u_int32_t bi_size; + u_int8_t bi_memsizes_valid; + u_int8_t bi_bios_dev; /* bootdev BIOS unit number */ + u_int8_t bi_pad[2]; + u_int32_t bi_basemem; + u_int32_t bi_extmem; + u_int32_t bi_symtab; /* struct symtab * */ + u_int32_t bi_esymtab; /* struct symtab * */ + /* Items below only from advanced bootloader */ + u_int32_t bi_kernend; /* end of kernel space */ + u_int32_t bi_envp; /* environment */ + u_int32_t bi_modulep; /* preloaded modules */ +}; + + boot2 enters into an infinite loop waiting for user input, + then calls load(). If the user does not + press anything, the loop brakes by a timeout, so + load() will load the default file + (/boot/loader). Functions ino_t + lookup(char *filename) and int xfsread(ino_t + inode, void *buf, size_t nbyte) are used to read the + content of a file into memory. /boot/loader + is an ELF binary, but where the ELF header is prepended with + a.out's struct exec structure. + load() scans the loader's ELF header, loading + the content of /boot/loader into memory, and + passing the execution to the loader's entry: + + sys/boot/i386/boot2/boot2.c: + __exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK), + MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part), + 0, 0, 0, VTOP(&bootinfo)); + + + + + <application>loader</application> stage + + loader is a BTX client as well. + I will not describe it here in detail, there is a comprehensive + manpage written by Mike Smith, &man.loader.8;. The underlying + mechanisms and BTX were discussed above. + + The main task for the loader is to boot the kernel. When + the kernel is loaded into memory, it is being called by the + loader: + + sys/boot/common/boot.c: + /* Call the exec handler from the loader matching the kernel */ + module_formats[km->m_loader]->l_exec(km); + + + + Kernel initialization + + To where exactly is the execution passed by the loader, + i.e. what is the kernel's actual entry point. Let us take a + look at the command that links the kernel: + + sys/conf/Makefile.i386: +ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \ +-dynamic-linker /red/herring -o kernel -X locore.o \ +<lots of kernel .o files> + + A few interesting things can be seen in this line. First, + the kernel is an ELF dynamically linked binary, but the dynamic + linker for kernel is /red/herring, which is + definitely a bogus file. Second, taking a look at the file + sys/conf/ldscript.i386 gives an idea about + what ld options are used when + compiling a kernel. Reading through the first few lines, the + string + + sys/conf/ldscript.i386: +ENTRY(btext) + + says that a kernel's entry point is the symbol `btext'. + This symbol is defined in locore.s: + + sys/i386/i386/locore.s: + .text +/********************************************************************** + * + * This is where the bootblocks start us, set the ball rolling... + * + */ +NON_GPROF_ENTRY(btext) + + First what is done is the register EFLAGS is set to a + predefined value of 0x00000002, and then all the segment + registers are initialized: + + sys/i386/i386/locore.s +/* Don't trust what the BIOS gives for eflags. */ + pushl $PSL_KERNEL + popfl + +/* + * Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap + * to set %cs, %ds, %es and %ss. + */ + mov %ds, %ax + mov %ax, %fs + mov %ax, %gs + + btext calls the routines + recover_bootinfo(), + identify_cpu(), + create_pagetables(), which are also defined + in locore.s. Here is a description of what + they do: + + + + + + recover_bootinfo + + This routine parses the parameters to the kernel + passed from the bootstrap. The kernel may have been + booted in 3 ways: by the loader, described above, by the + old disk boot blocks, and by the old diskless boot + procedure. This function determines the booting method, + and stores the struct bootinfo + structure into the kernel memory. + + + identify_cpu This + functions tries to find out what CPU it is running on, + storing the value found in a variable + _cpu. + + + create_pagetables + This function allocates and fills out a Page Table Directory + at the top of the kernel memory area. + + + + The next steps are enabling VME, if the CPU supports it: + + testl $CPUID_VME, R(_cpu_feature) + jz 1f + movl %cr4, %eax + orl $CR4_VME, %eax + movl %eax, %cr4 + + Then, enabling paging: + /* Now enable paging */ + movl R(_IdlePTD), %eax + movl %eax,%cr3 /* load ptd addr into mmu */ + movl %cr0,%eax /* get control word */ + orl $CR0_PE|CR0_PG,%eax /* enable paging */ + movl %eax,%cr0 /* and let's page NOW! */ + + The next three lines of code are because the paging was set, + so the jump is needed to continue the execution in virtualized + address space: + + pushl $begin /* jump to high virtualized address */ + ret + +/* now running relocated at KERNBASE where the system is linked to run */ +begin: + + The function init386() is called, with + a pointer to the first free physical page, after that + mi_startup(). init386 + is an architecture dependent initialization function, and + mi_startup() is an architecture independent + one (the 'mi_' prefix stands for Machine Independent). The + kernel never returns from mi_startup(), and + by calling it, the kernel finishes booting: + + sys/i386/i386/locore.s: + movl physfree, %esi + pushl %esi /* value of first for init386(first) */ + call _init386 /* wire 386 chip for unix operation */ + call _mi_startup /* autoconfiguration, mountroot etc */ + hlt /* never returns to here */ + + + <function>init386()</function> + + init386() is defined in + sys/i386/i386/machdep.c and performs + low-level initialization, specific to the i386 chip. The + switch to protected mode was performed by the loader. The + loader has created the very first task, in which the kernel + continues to operate. Before running straight away to the + code, I will enumerate the tasks the processor must complete + to initialize protected mode execution: + + + Initialize the kernel tunable parameters, passed from + the bootstrapping program. + Prepare the GDT. + Prepare the IDT. + Initialize the system console. + Initialize the DDB, if it is compiled into kernel. + + Initialize the TSS. + Prepare the LDT. + Setup proc0's pcb. + + + + What init386() first does is + initialize the tunable parameters passed from bootstrap. This + is done by setting the environment pointer (envp) and calling + init_param1(). The envp pointer has been + passed from loader in the bootinfo + structure: + + sys/i386/i386/machdep.c: + kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE; + + /* Init basic tunables, hz etc */ + init_param1(); + + init_param1() is defined in + sys/kern/subr_param.c. That file has a + number of sysctls, and two functions, + init_param1() and + init_param2(), that are called from + init386(): + + sys/kern/subr_param.c + hz = HZ; + TUNABLE_INT_FETCH("kern.hz", &hz); + + TUNABLE_<typename>_FETCH is used to fetch the value + from the environment: + + /usr/src/sys/sys/kernel.h +#define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var)) + + + Sysctl "kern.hz" is the system clock tick. Along with + this, the following sysctls are set by + init_param1(): kern.maxswzone, + kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.dflssiz, + kern.maxssiz, kern.sgrowsiz. + + Then init386() prepares the Global + Descriptors Table (GDT). Every task on an x86 is running in + its own virtual address space, and this space is addressed by + a segment:offset pair. Say, for instance, the current + instruction to be executed by the processor lies at CS:EIP, + then the linear virtual address for that instruction would be + "the virtual address of code segment CS" + EIP. For + convenience, segments begin at virtual address 0 and end at a + 4Gb boundary. Therefore, the instruction's linear virtual + address for this example would just be the value of EIP. + Segment registers such as CS, DS etc are the selectors, + i.e. indexes, into GDT (to be more precise, an index is not a + selector itself, but the INDEX field of a selector). + FreeBSD's GDT holds descriptors for 15 selectors per + CPU: + + sys/i386/i386/machdep.c: +union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */ + +sys/i386/include/segments.h: +/* + * Entries in the Global Descriptor Table (GDT) + */ +#define GNULL_SEL 0 /* Null Descriptor */ +#define GCODE_SEL 1 /* Kernel Code Descriptor */ +#define GDATA_SEL 2 /* Kernel Data Descriptor */ +#define GPRIV_SEL 3 /* SMP Per-Processor Private Data */ +#define GPROC0_SEL 4 /* Task state process slot zero and up */ +#define GLDT_SEL 5 /* LDT - eventually one per process */ +#define GUSERLDT_SEL 6 /* User LDT */ +#define GTGATE_SEL 7 /* Process task switch gate */ +#define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */ +#define GPANIC_SEL 9 /* Task state to consider panic from */ +#define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */ +#define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */ +#define GBIOSDATA_SEL 12 /* BIOS interface (Data) */ +#define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */ +#define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */ + + Note that those #defines are not selectors themselves, but + just a field INDEX of a selector, so they are exactly the + indices of the GDT. for example, an actual selector for the + kernel code (GCODE_SEL) has the value 0x08. + + The next step is to initialize the Interrupt Descriptor + Table (IDT). This table is to be referenced by the processor + when a software or hardware interrupt occurs. For example, to + make a system call, user application issues the INT + 0x80 instruction. This is a software interrupt, so + the processor's hardware looks up a record with index 0x80 in + the IDT. This record points to the routine that handles this + interrupt, in this particular case, this will be the kernel's + syscall gate. The IDT may have a maximum of 256 (0x100) + records. The kernel allocates NIDT records for the IDT, where + NIDT is the maximum (256): + + sys/i386/i386/machdep.c: +static struct gate_descriptor idt0[NIDT]; +struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */ + + + For each interrupt, an appropriate handler is set. The + syscall gate for INT 0x80 is set as + well: + + sys/i386/i386/machdep.c: + setidt(0x80, &IDTVEC(int0x80_syscall), + SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL)); + + So when a userland application issues the INT + 0x80 instruction, control will transfer to the + function _Xint0x80_syscall, which is in + the kernel code segment and will be executed with supervisor + privileges. + + Console and DDB are then initialized: + + sys/i386/i386/machdep.c: + cninit(); +/* skipped */ +#ifdef DDB + kdb_init(); + if (boothowto & RB_KDB) + Debugger("Boot flags requested debugger"); +#endif + + The Task State Segment is another x86 protected mode + structure, the TSS is used by the hardware to store task + information when a task switch occurs. + + The Local Descriptors Table is used to reference userland + code and data. Several selectors are defined to point to the + LDT, they are the system call gates and the user code and data + selectors: + + /usr/include/machine/segments.h +#define LSYS5CALLS_SEL 0 /* forced by intel BCS */ +#define LSYS5SIGR_SEL 1 +#define L43BSDCALLS_SEL 2 /* notyet */ +#define LUCODE_SEL 3 +#define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */ +#define LUDATA_SEL 5 +/* separate stack, es,fs,gs sels ? */ +/* #define LPOSIXCALLS_SEL 5*/ /* notyet */ +#define LBSDICALLS_SEL 16 /* BSDI system call gate */ +#define NLDT (LBSDICALLS_SEL + 1) + + + Next, proc0's Process Control Block (struct + pcb) structure is initialized. proc0 is a + struct proc structure that describes a kernel + process. It is always present while the kernel is running, + therefore it is declared as global: + + sys/kern/kern_init.c: + struct proc proc0; + + The structure struct pcb is a part of a + proc structure. It is defined in + /usr/include/machine/pcb.h and has a + process's information specific to the i386 architecture, such as + registers values. + + + + + <function>mi_startup()</function> + + This function performs a bubble sort of all the system + initialization objects and then calls the entry of each object + one by one: + + sys/kern/init_main.c: + for (sipp = sysinit; *sipp; sipp++) { + + /* ... skipped ... */ + + /* Call function */ + (*((*sipp)->func))((*sipp)->udata); + /* ... skipped ... */ + } + + Although the sysinit framework is described in the + Developers' Handbook, I will discuss the internals of it. + + Every system initialization object (sysinit object) is + created by calling a SYSINIT() macro. Let us take as example an + announce sysinit object. This object prints + the copyright message: + + sys/kern/init_main.c: +static void +print_caddr_t(void *data __unused) +{ + printf("%s", (char *)data); +} +SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright) + + The subsystem ID for this object is SI_SUB_COPYRIGHT + (0x0800001), which comes right after the SI_SUB_CONSOLE + (0x0800000). So, the copyright message will be printed out + first, just after the console initialization. + + Let us take a look at what exactly the macro + SYSINIT() does. It expands to a + C_SYSINIT() macro. The + C_SYSINIT() macro then expands to a static + struct sysinit structure declaration with + another DATA_SET macro call: + /usr/include/sys/kernel.h: + #define C_SYSINIT(uniquifier, subsystem, order, func, ident) \ + static struct sysinit uniquifier ## _sys_init = { \ subsystem, \ + order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ## + _sys_init); + +#define SYSINIT(uniquifier, subsystem, order, func, ident) \ + C_SYSINIT(uniquifier, subsystem, order, \ + (sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident) + + The DATA_SET() macro expands to a + MAKE_SET(), and that macro is the point where + the all sysinit magic is hidden: + + /usr/include/linker_set.h +#define MAKE_SET(set, sym) \ + static void const * const __set_##set##_sym_##sym = &sym; \ + __asm(".section .set." #set ",\"aw\""); \ + __asm(".long " #sym); \ + __asm(".previous") +#endif +#define TEXT_SET(set, sym) MAKE_SET(set, sym) +#define DATA_SET(set, sym) MAKE_SET(set, sym) + + In our case, the following declaration will occur: + + static struct sysinit announce_sys_init = { + SI_SUB_COPYRIGHT, + SI_ORDER_FIRST, + (sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t, + (void *) copyright +}; + +static void const *const __set_sysinit_set_sym_announce_sys_init = + &announce_sys_init; +__asm(".section .set.sysinit_set" ",\"aw\""); +__asm(".long " "announce_sys_init"); +__asm(".previous"); + + The first __asm instruction will create + an ELF section within the kernel's executable. This will happen + at kernel link time. The section will have the name + ".set.sysinit_set". The content of this section is one 32-bit + value, the address of announce_sys_init structure, and that is + what the second __asm is. The third + __asm instruction marks the end of a section. + If a directive with the same section name occured before, the + content, i.e. the 32-bit value, will be appended to the existing + section, so forming an array of 32-bit pointers. + + Running objdump on a kernel + binary, you may notice the presence of such small sections: + + &prompt.user; objdump -h /kernel + 7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2 + CONTENTS, ALLOC, LOAD, DATA + 8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2 + CONTENTS, ALLOC, LOAD, DATA + 9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2 + CONTENTS, ALLOC, LOAD, DATA + 10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2 + CONTENTS, ALLOC, LOAD, DATA + 11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2 + CONTENTS, ALLOC, LOAD, DATA + 12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2 + CONTENTS, ALLOC, LOAD, DATA + + This screen dump shows that the size of .set.sysinit_set + section is 0x664 bytes, so 0x664/sizeof(void + *) sysinit objects are compiled into the kernel. The + other sections such as .set.sysctl_set + represent other linker sets. + + By defining a variable of type struct + linker_set the content of + .set.sysinit_set section will be "collected" + into that variable: + sys/kern/init_main.c: + extern struct linker_set sysinit_set; /* XXX */ + + The struct linker_set is defined as + follows: + + /usr/include/linker_set.h: + struct linker_set { + int ls_length; + void *ls_items[1]; /* really ls_length of them, trailing NULL */ +}; + + The first node will be equal to the number of a sysinit + objects, and the second node will be a NULL-terminated array of + pointers to them. + + Returning to the mi_startup() + discussion, it is must be clear now, how the sysinit objects are + being organized. The mi_startup() function + sorts them and calls each. The very last object is the system + scheduler: + + /usr/include/sys/kernel.h: +enum sysinit_sub_id { + SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/ + SI_SUB_DONE = 0x0000001, /* processed*/ + SI_SUB_CONSOLE = 0x0800000, /* console*/ + SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/ +... + SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/ +}; + + The system scheduler sysinit object is defined in the file + sys/vm/vm_glue.c, and the entry point for + that object is scheduler(). That function + is actually an infinite loop, and it represents a process with + PID 0, the swapper process. The proc0 structure, mentioned + before, is used to describe it. + + The first user process, called init, is + created by the sysinit object "init": + + sys/kern/init_main.c: +static void +create_init(const void *udata __unused) +{ + int error; + int s; + + s = splhigh(); + error = fork1(&proc0, RFFDG | RFPROC, &initproc); + if (error) + panic("cannot fork init: %d\n", error); + initproc->p_flag |= P_INMEM | P_SYSTEM; + cpu_set_fork_handler(initproc, start_init, NULL); + remrunqueue(initproc); + splx(s); +} +SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL) + + The create_init() allocates a new process + by calling fork1(), but does not mark it + runnable. When this new process is scheduled for execution by the + scheduler, the start_init() will be called. + That function is defined in init_main.c. It + tries to load and exec the init binary, + probing /sbin/init first, then + /sbin/oinit, + /sbin/init.bak, and finally + /stand/sysinstall: + + sys/kern/init_main.c: +static char init_path[MAXPATHLEN] = +#ifdef INIT_PATH + __XSTRING(INIT_PATH); +#else + "/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall"; +#endif + + + + + + + diff --git a/en_US.ISO8859-1/books/developers-handbook/boot/chapter.sgml b/en_US.ISO8859-1/books/developers-handbook/boot/chapter.sgml new file mode 100644 index 0000000000..3f88803fb2 --- /dev/null +++ b/en_US.ISO8859-1/books/developers-handbook/boot/chapter.sgml @@ -0,0 +1,970 @@ + + + + + + + Sergey + Lyubka + Contributed by + + + + Bootstrapping and kernel initialization + + + Synopsis + + This chapter is an overview of the boot and system + initialization process, starting from the BIOS (firmware) POST, + to the first user process creation. Since the initial steps of + system startup are very architecture dependent, the IA-32 + architecture is used as an example. + + + + Overview + + A computer running FreeBSD can boot by several methods, + although the most common method, booting from a harddisk where + the OS is installed, will be discussed here. The boot process + is divided into several steps: + + + BIOS POST + boot0 stage + boot2 stage + loader stage + kernel initialization + + + The boot0 and boot2 stages are also referred to as + bootstrap stages 1 and 2 in &man.boot.8; as + the first steps in Freud's 3-stage bootstrapping procedure. + Various information is printed on the screen at each stage, so + visually you may recognize them using the table that follows. + Please note that the actual data may differ from machine to + machine: + + + + + + may vary BIOS + (firmware) messages + + + +F1 FreeBSD +F2 BSD +F5 Disk 2 + + boot0 + + + +>>FreeBSD/i386 BOOT +Default: 1:ad(1,a)/boot/loader +boot: + + + boot2This prompt will appear + if the user presses a key just after selecting an OS to + boot at the boot0 + stage. + + + +BTX loader 1.0 BTX version is 1.01 +BIOS drive A: is disk0 +BIOS drive C: is disk1 +BIOS 639kB/64512kB available memory +FreeBSD/i386 bootstrap loader, Revision 0.8 +Console internal video/keyboard +(jkh@bento.freebsd.org, Mon Nov 20 11:41:23 GMT 2000) +/kernel text=0x1234 data=0x2345 syms=[0x4+0x3456] +Hit [Enter] to boot immediately, or any other key for command prompt +Booting [kernel] in 9 seconds..._ + + loader + + + +Copyright (c) 1992-2002 The FreeBSD Project. +Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 + The Regents of the University of California. All rights reserved. +FreeBSD 4.6-RC #0: Sat May 4 22:49:02 GMT 2002 + devnull@kukas:/usr/obj/usr/src/sys/DEVNULL +Timecounter "i8254" frequency 1193182 Hz + + kernel + + + + + + + + + BIOS POST + + When the PC powers on, the processor's registers are set + with some predefined values. One of the registers is the + instruction pointer register, and its value + after a power on is well defined: it is a 32-bit value of + 0xffffff00. The instruction pointer register points to code to + be executed by the processor. One of the registers is the + cr1 32-bit control register, and its value + just after the reboot is 0. One of the cr1's bits, the bit PE + (Protected Enabled) indicates whether the processor is running + in protected or real mode. Since at boot time this bit is + cleared, the processor boots in real mode. Real mode means, + among other things, that linear and physical addresses are + identical. + + The value of 0xffffff00 is slightly less then 4Gb, so unless + the machine has 4Gb physical memory, it cannot point to a valid + memory address. The computer's hardware translates this address + so that it points to a BIOS memory block. + + BIOS stands for Basic Input Output + System, and it is a chip on the motherboard that has + a relatively small amount of read-only memory (ROM). This + memory contains various low-level routines that are specific to + the hardware supplied with the motherboard. So, the processor + will first jump to the address 0xffffff00, which really resides + in the BIOS's memory. Usually this address contains a jump + instruction to the BIOS's POST routines. + + POST stands for Power On Self Test. + This is a set of routines including the memory check, system bus + check and other low-level stuff so that the CPU can initialize + the computer properly. The important step on this stage is + determining the boot device. All modern BIOS's allow the boot + device to be set manually, so you can boot from a floppy, + CD-ROM, harddisk etc. + + The very last thing in the POST is the INT + 0x19 instruction. That instruction reads 512 bytes + from the first sector of boot device into the memory at address + 0x7c00. The term first sector originates + from harddrive architecture, where the magnetic plate is divided + to a number of cylindrical tracks. Tracks are numbered, and + every track is divided by a number (usually 64) sectors. Track + number 0 is the outermost on the magnetic plate, and sector 1, + the first sector (tracks, or, cylinders, are numbered starting + from 0, but sectors - starting from 1), has a special meaning. + It is also called Master Boot Record, or MBR. The remaining + sectors on the first track are never used Some + utilities such as &man.disklabel.8; may store the information in + this area, mostly in the second + sector.. + + + + + boot0 stage + + Take a look at the file /boot/boot0. + This is a small 512-byte file, and it is exactly what FreeBSD's + installation procedure wrote to your harddisk's MBR if you chose + the "bootmanager" option at installation time. + + As mentioned previously, the INT 0x19 + instruction loads an MBR, i.e. the boot0 + content, into the memory at address 0x7c00. Taking a look at + the file sys/boot/i386/boot0/boot0.s can + give a guess at what is happening there - this is the boot + manager, which is an awesome piece of code written by Robert + Nordier. + + The MBR, or, boot0, has a special + structure starting from offset 0x1be, called the + partition table. It has 4 records of 16 + bytes each, called partition records, which + represent how the harddisk(s) are partitioned, or, in FreeBSD's + terminology, sliced. One byte of those 16 says whether a + partition (slice) is bootable or not. Exactly one record must + have that flag set, otherwise boot0's code + will refuse to proceed. + + A partition record has the following fields: + + + the 1-byte filesystem type + the 1-byte bootable flag + the 6 byte descriptor in CHS + format + the 8 byte descriptor in LBA + format + + + A partition record descriptor has the information about + where exactly the partition resides on the drive. Both + descriptors, LBA and CHS, describe the same information, but in + different ways: LBA (Logical Block Addressing) has the starting + sector for the partition and the partition's length, while CHS + (Cylinder Head Sector) has coordinates for the first and last + sectors of the partition. + + The boot manager scans the partition table and prints the + menu on the screen so the user can select what disk and what + slice to boot. By pressing an appropriate key, + boot0 performs the following + actions: + + + modifies the bootable flag for the selected + partition to make it bootable, and clears the + previous + + saves itself to disk to remember what partition + (slice) has been selected so to use it as the default on the + next boot + + loads the first sector of the selected partition + (slice) into memory and jumps there + + + What kind of data should reside on the very first sector of + a bootable partition (slice), in our case, a FreeBSD slice? As + you may have already guessed, it is + boot2. + + + + + boot2 stage + + You might wonder, why boot2 comes after boot0, and not + boot1. Actually, there is a 512-byte file called + boot1 in the directory + /boot as well. It is used for booting from + a floppy. When booting from a floppy, + boot1 plays the same role as + boot0 for a harddisk: it locates boot2 and + runs it. + + You may have realized that a file + /boot/mbr exists as well. It is a + simplified version of boot0. The code in + mbr does not provide a menu for the user, + it just blindly boots the partition marked active. + + The code implementing boot2 resides in + sys/boot/i386/boot2/, and the executable + itself is in /boot. The files boot0 and + boot2 that are in /boot are not used by the + bootstrap, but by utilities such as + boot0cfg. The actual position for + boot0 is in the MBR. For boot2 it is the beginning of a + bootable FreeBSD slice. These locations are not under the + filesystem's control, so they are invisible to commands like + ls. + + The main task for boot2 is to load the file + /boot/loader, which is the third stage in + the bootstrapping procedure. The code in boot2 cannot use any + services like open() and + read(), since the kernel is not yet loaded. + It must scan the harddisk, knowing about the filesystem + structure, find the file /boot/loader, read + it into memory using a BIOS service, and then pass the execution + to the loader's entry point. + + Besides that, boot2 prompts for user input so the loader can + be booted from different disk, unit, slice and partition. + + The boot2 binary is created in special way: + sys/boot/i386/boot2/Makefile +boot2: boot2.ldr boot2.bin ${BTX}/btx/btx + btxld -v -E ${ORG2} -f bin -b ${BTX}/btx/btx -l boot2.ldr \ + -o boot2.ld -P 1 boot2.bin + + This Makefile snippet shows that &man.btxld.8; is used to + link the binary. BTX, which stands for BooT eXtender, is a + piece of code that provides a protected mode environment for the + program, called the client, that it is linked with. So boot2 is + a BTX client, i.e. it uses the sevice provided by BTX. + + The btxld utility is the linker. + It links two binaries together. The difference between + &man.btxld.8; and &man.ld.1; is that + ld usually links object files into a + shared object or executable, while + btxld links an object file with the + BTX, producing the binary file suitable to be put on the + beginning of the partition for the system boot. + + boot0 passes the execution to BTX's entry point. BTX then + switches the processor to protected mode, and prepares a simple + environment before calling the client. This includes: + + + virtual v86 mode. That means, the BTX is a v86 + monitor. Real mode instructions like posh, popf, cli, sti, if + called by the client, will work. + + Interrupt Descriptor Table (IDT) is set up so + all hardware interrupts are routed to the default BIOS's + handlers, and interrupt 0x30 is set up to be the syscall + gate. + + Two system calls: exec and + exit, are defined: + + sys/boot/i386/btx/lib/btxsys.s: + .set INT_SYS,0x30 # Interrupt number +# +# System call: exit +# +__exit: xorl %eax,%eax # BTX system + int $INT_SYS # call 0x0 +# +# System call: exec +# +__exec: movl $0x1,%eax # BTX system + int $INT_SYS # call 0x1 + + + BTX creates a Global Descriptor Table (GDT): + + sys/boot/i386/btx/btx/btx.s: +gdt: .word 0x0,0x0,0x0,0x0 # Null entry + .word 0xffff,0x0,0x9a00,0xcf # SEL_SCODE + .word 0xffff,0x0,0x9200,0xcf # SEL_SDATA + .word 0xffff,0x0,0x9a00,0x0 # SEL_RCODE + .word 0xffff,0x0,0x9200,0x0 # SEL_RDATA + .word 0xffff,MEM_USR,0xfa00,0xcf# SEL_UCODE + .word 0xffff,MEM_USR,0xf200,0xcf# SEL_UDATA + .word _TSSLM,MEM_TSS,0x8900,0x0 # SEL_TSS + + The client's code and data start from address MEM_USR + (0xa000), and a selector (SEL_UCODE) points to the client's code + segment. The SEL_UCODE descriptor has Descriptor Privilege + Level (DPL) 3, which is the lowest privilege level. But the + INT 0x30 instruction handler resides in a + segment pointed to by the SEL_SCODE (supervisor code) selector, + as shown from the code that creates an IDT: + + mov $SEL_SCODE,%dh # Segment selector +init.2: shr %bx # Handle this int? + jnc init.3 # No + mov %ax,(%di) # Set handler offset + mov %dh,0x2(%di) # and selector + mov %dl,0x5(%di) # Set P:DPL:type + add $0x4,%ax # Next handler + + So, when the client calls __exec(), the + code will be executed with the highest privileges. This allows + the kernel to change the protected mode data structures, such as + page tables, GDT, IDT, etc later, if needed. + + boot2 defines an important structure, struct + bootinfo. This structure is initialized by boot2 and + passed to the loader, and then further to the kernel. Some + nodes of this structures are set by boot2, the rest by the + loader. This structure, among other information, contains the + kernel filename, BIOS harddisk geometry, BIOS drive number for + boot device, physical memory available, envp + pointer etc. The definition for it is: + + /usr/include/machine/bootinfo.h +struct bootinfo { + u_int32_t bi_version; + u_int32_t bi_kernelname; /* represents a char * */ + u_int32_t bi_nfs_diskless; /* struct nfs_diskless * */ + /* End of fields that are always present. */ +#define bi_endcommon bi_n_bios_used + u_int32_t bi_n_bios_used; + u_int32_t bi_bios_geom[N_BIOS_GEOM]; + u_int32_t bi_size; + u_int8_t bi_memsizes_valid; + u_int8_t bi_bios_dev; /* bootdev BIOS unit number */ + u_int8_t bi_pad[2]; + u_int32_t bi_basemem; + u_int32_t bi_extmem; + u_int32_t bi_symtab; /* struct symtab * */ + u_int32_t bi_esymtab; /* struct symtab * */ + /* Items below only from advanced bootloader */ + u_int32_t bi_kernend; /* end of kernel space */ + u_int32_t bi_envp; /* environment */ + u_int32_t bi_modulep; /* preloaded modules */ +}; + + boot2 enters into an infinite loop waiting for user input, + then calls load(). If the user does not + press anything, the loop brakes by a timeout, so + load() will load the default file + (/boot/loader). Functions ino_t + lookup(char *filename) and int xfsread(ino_t + inode, void *buf, size_t nbyte) are used to read the + content of a file into memory. /boot/loader + is an ELF binary, but where the ELF header is prepended with + a.out's struct exec structure. + load() scans the loader's ELF header, loading + the content of /boot/loader into memory, and + passing the execution to the loader's entry: + + sys/boot/i386/boot2/boot2.c: + __exec((caddr_t)addr, RB_BOOTINFO | (opts & RBX_MASK), + MAKEBOOTDEV(dev_maj[dsk.type], 0, dsk.slice, dsk.unit, dsk.part), + 0, 0, 0, VTOP(&bootinfo)); + + + + + <application>loader</application> stage + + loader is a BTX client as well. + I will not describe it here in detail, there is a comprehensive + manpage written by Mike Smith, &man.loader.8;. The underlying + mechanisms and BTX were discussed above. + + The main task for the loader is to boot the kernel. When + the kernel is loaded into memory, it is being called by the + loader: + + sys/boot/common/boot.c: + /* Call the exec handler from the loader matching the kernel */ + module_formats[km->m_loader]->l_exec(km); + + + + Kernel initialization + + To where exactly is the execution passed by the loader, + i.e. what is the kernel's actual entry point. Let us take a + look at the command that links the kernel: + + sys/conf/Makefile.i386: +ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \ +-dynamic-linker /red/herring -o kernel -X locore.o \ +<lots of kernel .o files> + + A few interesting things can be seen in this line. First, + the kernel is an ELF dynamically linked binary, but the dynamic + linker for kernel is /red/herring, which is + definitely a bogus file. Second, taking a look at the file + sys/conf/ldscript.i386 gives an idea about + what ld options are used when + compiling a kernel. Reading through the first few lines, the + string + + sys/conf/ldscript.i386: +ENTRY(btext) + + says that a kernel's entry point is the symbol `btext'. + This symbol is defined in locore.s: + + sys/i386/i386/locore.s: + .text +/********************************************************************** + * + * This is where the bootblocks start us, set the ball rolling... + * + */ +NON_GPROF_ENTRY(btext) + + First what is done is the register EFLAGS is set to a + predefined value of 0x00000002, and then all the segment + registers are initialized: + + sys/i386/i386/locore.s +/* Don't trust what the BIOS gives for eflags. */ + pushl $PSL_KERNEL + popfl + +/* + * Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap + * to set %cs, %ds, %es and %ss. + */ + mov %ds, %ax + mov %ax, %fs + mov %ax, %gs + + btext calls the routines + recover_bootinfo(), + identify_cpu(), + create_pagetables(), which are also defined + in locore.s. Here is a description of what + they do: + + + + + + recover_bootinfo + + This routine parses the parameters to the kernel + passed from the bootstrap. The kernel may have been + booted in 3 ways: by the loader, described above, by the + old disk boot blocks, and by the old diskless boot + procedure. This function determines the booting method, + and stores the struct bootinfo + structure into the kernel memory. + + + identify_cpu This + functions tries to find out what CPU it is running on, + storing the value found in a variable + _cpu. + + + create_pagetables + This function allocates and fills out a Page Table Directory + at the top of the kernel memory area. + + + + The next steps are enabling VME, if the CPU supports it: + + testl $CPUID_VME, R(_cpu_feature) + jz 1f + movl %cr4, %eax + orl $CR4_VME, %eax + movl %eax, %cr4 + + Then, enabling paging: + /* Now enable paging */ + movl R(_IdlePTD), %eax + movl %eax,%cr3 /* load ptd addr into mmu */ + movl %cr0,%eax /* get control word */ + orl $CR0_PE|CR0_PG,%eax /* enable paging */ + movl %eax,%cr0 /* and let's page NOW! */ + + The next three lines of code are because the paging was set, + so the jump is needed to continue the execution in virtualized + address space: + + pushl $begin /* jump to high virtualized address */ + ret + +/* now running relocated at KERNBASE where the system is linked to run */ +begin: + + The function init386() is called, with + a pointer to the first free physical page, after that + mi_startup(). init386 + is an architecture dependent initialization function, and + mi_startup() is an architecture independent + one (the 'mi_' prefix stands for Machine Independent). The + kernel never returns from mi_startup(), and + by calling it, the kernel finishes booting: + + sys/i386/i386/locore.s: + movl physfree, %esi + pushl %esi /* value of first for init386(first) */ + call _init386 /* wire 386 chip for unix operation */ + call _mi_startup /* autoconfiguration, mountroot etc */ + hlt /* never returns to here */ + + + <function>init386()</function> + + init386() is defined in + sys/i386/i386/machdep.c and performs + low-level initialization, specific to the i386 chip. The + switch to protected mode was performed by the loader. The + loader has created the very first task, in which the kernel + continues to operate. Before running straight away to the + code, I will enumerate the tasks the processor must complete + to initialize protected mode execution: + + + Initialize the kernel tunable parameters, passed from + the bootstrapping program. + Prepare the GDT. + Prepare the IDT. + Initialize the system console. + Initialize the DDB, if it is compiled into kernel. + + Initialize the TSS. + Prepare the LDT. + Setup proc0's pcb. + + + + What init386() first does is + initialize the tunable parameters passed from bootstrap. This + is done by setting the environment pointer (envp) and calling + init_param1(). The envp pointer has been + passed from loader in the bootinfo + structure: + + sys/i386/i386/machdep.c: + kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE; + + /* Init basic tunables, hz etc */ + init_param1(); + + init_param1() is defined in + sys/kern/subr_param.c. That file has a + number of sysctls, and two functions, + init_param1() and + init_param2(), that are called from + init386(): + + sys/kern/subr_param.c + hz = HZ; + TUNABLE_INT_FETCH("kern.hz", &hz); + + TUNABLE_<typename>_FETCH is used to fetch the value + from the environment: + + /usr/src/sys/sys/kernel.h +#define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var)) + + + Sysctl "kern.hz" is the system clock tick. Along with + this, the following sysctls are set by + init_param1(): kern.maxswzone, + kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.dflssiz, + kern.maxssiz, kern.sgrowsiz. + + Then init386() prepares the Global + Descriptors Table (GDT). Every task on an x86 is running in + its own virtual address space, and this space is addressed by + a segment:offset pair. Say, for instance, the current + instruction to be executed by the processor lies at CS:EIP, + then the linear virtual address for that instruction would be + "the virtual address of code segment CS" + EIP. For + convenience, segments begin at virtual address 0 and end at a + 4Gb boundary. Therefore, the instruction's linear virtual + address for this example would just be the value of EIP. + Segment registers such as CS, DS etc are the selectors, + i.e. indexes, into GDT (to be more precise, an index is not a + selector itself, but the INDEX field of a selector). + FreeBSD's GDT holds descriptors for 15 selectors per + CPU: + + sys/i386/i386/machdep.c: +union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */ + +sys/i386/include/segments.h: +/* + * Entries in the Global Descriptor Table (GDT) + */ +#define GNULL_SEL 0 /* Null Descriptor */ +#define GCODE_SEL 1 /* Kernel Code Descriptor */ +#define GDATA_SEL 2 /* Kernel Data Descriptor */ +#define GPRIV_SEL 3 /* SMP Per-Processor Private Data */ +#define GPROC0_SEL 4 /* Task state process slot zero and up */ +#define GLDT_SEL 5 /* LDT - eventually one per process */ +#define GUSERLDT_SEL 6 /* User LDT */ +#define GTGATE_SEL 7 /* Process task switch gate */ +#define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */ +#define GPANIC_SEL 9 /* Task state to consider panic from */ +#define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */ +#define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */ +#define GBIOSDATA_SEL 12 /* BIOS interface (Data) */ +#define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */ +#define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */ + + Note that those #defines are not selectors themselves, but + just a field INDEX of a selector, so they are exactly the + indices of the GDT. for example, an actual selector for the + kernel code (GCODE_SEL) has the value 0x08. + + The next step is to initialize the Interrupt Descriptor + Table (IDT). This table is to be referenced by the processor + when a software or hardware interrupt occurs. For example, to + make a system call, user application issues the INT + 0x80 instruction. This is a software interrupt, so + the processor's hardware looks up a record with index 0x80 in + the IDT. This record points to the routine that handles this + interrupt, in this particular case, this will be the kernel's + syscall gate. The IDT may have a maximum of 256 (0x100) + records. The kernel allocates NIDT records for the IDT, where + NIDT is the maximum (256): + + sys/i386/i386/machdep.c: +static struct gate_descriptor idt0[NIDT]; +struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */ + + + For each interrupt, an appropriate handler is set. The + syscall gate for INT 0x80 is set as + well: + + sys/i386/i386/machdep.c: + setidt(0x80, &IDTVEC(int0x80_syscall), + SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL)); + + So when a userland application issues the INT + 0x80 instruction, control will transfer to the + function _Xint0x80_syscall, which is in + the kernel code segment and will be executed with supervisor + privileges. + + Console and DDB are then initialized: + + sys/i386/i386/machdep.c: + cninit(); +/* skipped */ +#ifdef DDB + kdb_init(); + if (boothowto & RB_KDB) + Debugger("Boot flags requested debugger"); +#endif + + The Task State Segment is another x86 protected mode + structure, the TSS is used by the hardware to store task + information when a task switch occurs. + + The Local Descriptors Table is used to reference userland + code and data. Several selectors are defined to point to the + LDT, they are the system call gates and the user code and data + selectors: + + /usr/include/machine/segments.h +#define LSYS5CALLS_SEL 0 /* forced by intel BCS */ +#define LSYS5SIGR_SEL 1 +#define L43BSDCALLS_SEL 2 /* notyet */ +#define LUCODE_SEL 3 +#define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */ +#define LUDATA_SEL 5 +/* separate stack, es,fs,gs sels ? */ +/* #define LPOSIXCALLS_SEL 5*/ /* notyet */ +#define LBSDICALLS_SEL 16 /* BSDI system call gate */ +#define NLDT (LBSDICALLS_SEL + 1) + + + Next, proc0's Process Control Block (struct + pcb) structure is initialized. proc0 is a + struct proc structure that describes a kernel + process. It is always present while the kernel is running, + therefore it is declared as global: + + sys/kern/kern_init.c: + struct proc proc0; + + The structure struct pcb is a part of a + proc structure. It is defined in + /usr/include/machine/pcb.h and has a + process's information specific to the i386 architecture, such as + registers values. + + + + + <function>mi_startup()</function> + + This function performs a bubble sort of all the system + initialization objects and then calls the entry of each object + one by one: + + sys/kern/init_main.c: + for (sipp = sysinit; *sipp; sipp++) { + + /* ... skipped ... */ + + /* Call function */ + (*((*sipp)->func))((*sipp)->udata); + /* ... skipped ... */ + } + + Although the sysinit framework is described in the + Developers' Handbook, I will discuss the internals of it. + + Every system initialization object (sysinit object) is + created by calling a SYSINIT() macro. Let us take as example an + announce sysinit object. This object prints + the copyright message: + + sys/kern/init_main.c: +static void +print_caddr_t(void *data __unused) +{ + printf("%s", (char *)data); +} +SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright) + + The subsystem ID for this object is SI_SUB_COPYRIGHT + (0x0800001), which comes right after the SI_SUB_CONSOLE + (0x0800000). So, the copyright message will be printed out + first, just after the console initialization. + + Let us take a look at what exactly the macro + SYSINIT() does. It expands to a + C_SYSINIT() macro. The + C_SYSINIT() macro then expands to a static + struct sysinit structure declaration with + another DATA_SET macro call: + /usr/include/sys/kernel.h: + #define C_SYSINIT(uniquifier, subsystem, order, func, ident) \ + static struct sysinit uniquifier ## _sys_init = { \ subsystem, \ + order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ## + _sys_init); + +#define SYSINIT(uniquifier, subsystem, order, func, ident) \ + C_SYSINIT(uniquifier, subsystem, order, \ + (sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident) + + The DATA_SET() macro expands to a + MAKE_SET(), and that macro is the point where + the all sysinit magic is hidden: + + /usr/include/linker_set.h +#define MAKE_SET(set, sym) \ + static void const * const __set_##set##_sym_##sym = &sym; \ + __asm(".section .set." #set ",\"aw\""); \ + __asm(".long " #sym); \ + __asm(".previous") +#endif +#define TEXT_SET(set, sym) MAKE_SET(set, sym) +#define DATA_SET(set, sym) MAKE_SET(set, sym) + + In our case, the following declaration will occur: + + static struct sysinit announce_sys_init = { + SI_SUB_COPYRIGHT, + SI_ORDER_FIRST, + (sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t, + (void *) copyright +}; + +static void const *const __set_sysinit_set_sym_announce_sys_init = + &announce_sys_init; +__asm(".section .set.sysinit_set" ",\"aw\""); +__asm(".long " "announce_sys_init"); +__asm(".previous"); + + The first __asm instruction will create + an ELF section within the kernel's executable. This will happen + at kernel link time. The section will have the name + ".set.sysinit_set". The content of this section is one 32-bit + value, the address of announce_sys_init structure, and that is + what the second __asm is. The third + __asm instruction marks the end of a section. + If a directive with the same section name occured before, the + content, i.e. the 32-bit value, will be appended to the existing + section, so forming an array of 32-bit pointers. + + Running objdump on a kernel + binary, you may notice the presence of such small sections: + + &prompt.user; objdump -h /kernel + 7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2 + CONTENTS, ALLOC, LOAD, DATA + 8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2 + CONTENTS, ALLOC, LOAD, DATA + 9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2 + CONTENTS, ALLOC, LOAD, DATA + 10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2 + CONTENTS, ALLOC, LOAD, DATA + 11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2 + CONTENTS, ALLOC, LOAD, DATA + 12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2 + CONTENTS, ALLOC, LOAD, DATA + + This screen dump shows that the size of .set.sysinit_set + section is 0x664 bytes, so 0x664/sizeof(void + *) sysinit objects are compiled into the kernel. The + other sections such as .set.sysctl_set + represent other linker sets. + + By defining a variable of type struct + linker_set the content of + .set.sysinit_set section will be "collected" + into that variable: + sys/kern/init_main.c: + extern struct linker_set sysinit_set; /* XXX */ + + The struct linker_set is defined as + follows: + + /usr/include/linker_set.h: + struct linker_set { + int ls_length; + void *ls_items[1]; /* really ls_length of them, trailing NULL */ +}; + + The first node will be equal to the number of a sysinit + objects, and the second node will be a NULL-terminated array of + pointers to them. + + Returning to the mi_startup() + discussion, it is must be clear now, how the sysinit objects are + being organized. The mi_startup() function + sorts them and calls each. The very last object is the system + scheduler: + + /usr/include/sys/kernel.h: +enum sysinit_sub_id { + SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/ + SI_SUB_DONE = 0x0000001, /* processed*/ + SI_SUB_CONSOLE = 0x0800000, /* console*/ + SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/ +... + SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/ +}; + + The system scheduler sysinit object is defined in the file + sys/vm/vm_glue.c, and the entry point for + that object is scheduler(). That function + is actually an infinite loop, and it represents a process with + PID 0, the swapper process. The proc0 structure, mentioned + before, is used to describe it. + + The first user process, called init, is + created by the sysinit object "init": + + sys/kern/init_main.c: +static void +create_init(const void *udata __unused) +{ + int error; + int s; + + s = splhigh(); + error = fork1(&proc0, RFFDG | RFPROC, &initproc); + if (error) + panic("cannot fork init: %d\n", error); + initproc->p_flag |= P_INMEM | P_SYSTEM; + cpu_set_fork_handler(initproc, start_init, NULL); + remrunqueue(initproc); + splx(s); +} +SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL) + + The create_init() allocates a new process + by calling fork1(), but does not mark it + runnable. When this new process is scheduled for execution by the + scheduler, the start_init() will be called. + That function is defined in init_main.c. It + tries to load and exec the init binary, + probing /sbin/init first, then + /sbin/oinit, + /sbin/init.bak, and finally + /stand/sysinstall: + + sys/kern/init_main.c: +static char init_path[MAXPATHLEN] = +#ifdef INIT_PATH + __XSTRING(INIT_PATH); +#else + "/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall"; +#endif + + + + + + +