Index: head/share/man/man4/ddb.4 =================================================================== --- head/share/man/man4/ddb.4 (revision 353428) +++ head/share/man/man4/ddb.4 (revision 353429) @@ -1,1611 +1,1628 @@ .\" .\" Mach Operating System .\" Copyright (c) 1991,1990 Carnegie Mellon University .\" Copyright (c) 2007 Robert N. M. Watson .\" All Rights Reserved. .\" .\" Permission to use, copy, modify and distribute this software and its .\" documentation is hereby granted, provided that both the copyright .\" notice and this permission notice appear in all copies of the .\" software, derivative works or modified versions, and any portions .\" thereof, and that both notices appear in supporting documentation. .\" .\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" .\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND FOR .\" ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. .\" .\" Carnegie Mellon requests users of this software to return to .\" .\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU .\" School of Computer Science .\" Carnegie Mellon University .\" Pittsburgh PA 15213-3890 .\" .\" any improvements or extensions that they make and grant Carnegie Mellon .\" the rights to redistribute these changes. .\" .\" changed a \# to #, since groff choked on it. .\" .\" HISTORY .\" ddb.4,v .\" Revision 1.1 1993/07/15 18:41:02 brezak .\" Man page for DDB .\" .\" Revision 2.6 92/04/08 08:52:57 rpd .\" Changes from OSF. .\" [92/01/17 14:19:22 jsb] .\" Changes for OSF debugger modifications. .\" [91/12/12 tak] .\" .\" Revision 2.5 91/06/25 13:50:22 rpd .\" Added some watchpoint explanation. .\" [91/06/25 rpd] .\" .\" Revision 2.4 91/06/17 15:47:31 jsb .\" Added documentation for continue/c, match, search, and watchpoints. .\" I've not actually explained what a watchpoint is; maybe Rich can .\" do that (hint, hint). .\" [91/06/17 10:58:08 jsb] .\" .\" Revision 2.3 91/05/14 17:04:23 mrt .\" Correcting copyright .\" .\" Revision 2.2 91/02/14 14:10:06 mrt .\" Changed to new Mach copyright .\" [91/02/12 18:10:12 mrt] .\" .\" Revision 2.2 90/08/30 14:23:15 dbg .\" Created. .\" [90/08/30 dbg] .\" .\" $FreeBSD$ .\" -.Dd September 9, 2019 +.Dd October 10, 2019 .Dt DDB 4 .Os .Sh NAME .Nm ddb .Nd interactive kernel debugger .Sh SYNOPSIS In order to enable kernel debugging facilities include: .Bd -ragged -offset indent .Cd options KDB .Cd options DDB .Ed .Pp To prevent activation of the debugger on kernel .Xr panic 9 : .Bd -ragged -offset indent .Cd options KDB_UNATTENDED .Ed .Pp In order to print a stack trace of the current thread on the console for a panic: .Bd -ragged -offset indent .Cd options KDB_TRACE .Ed .Pp To print the numerical value of symbols in addition to the symbolic representation, define: .Bd -ragged -offset indent .Cd options DDB_NUMSYM .Ed .Pp To enable the .Xr gdb 1 backend, so that remote debugging with .Xr kgdb 1 is possible, include: .Bd -ragged -offset indent .Cd options GDB .Ed .Sh DESCRIPTION The .Nm kernel debugger is an interactive debugger with a syntax inspired by .Xr gdb 1 . If linked into the running kernel, it can be invoked locally with the .Ql debug .Xr keymap 5 action, usually mapped to Ctrl+Alt+Esc, or by setting the .Va debug.kdb.enter sysctl to 1. The debugger is also invoked on kernel .Xr panic 9 if the .Va debug.debugger_on_panic .Xr sysctl 8 MIB variable is set non-zero, which is the default unless the .Dv KDB_UNATTENDED option is specified. .Pp The current location is called .Va dot . The .Va dot is displayed with a hexadecimal format at a prompt. The commands .Ic examine and .Ic write update .Va dot to the address of the last line examined or the last location modified, and set .Va next to the address of the next location to be examined or changed. Other commands do not change .Va dot , and set .Va next to be the same as .Va dot . .Pp The general command syntax is: .Ar command Ns Op Li / Ns Ar modifier .Oo Ar addr Oc Ns Op , Ns Ar count .Pp A blank line repeats the previous command from the address .Va next with count 1 and no modifiers. Specifying .Ar addr sets .Va dot to the address. Omitting .Ar addr uses .Va dot . A missing .Ar count is taken to be 1 for printing commands or infinity for stack traces. A .Ar count of -1 is equivalent to a missing .Ar count . Options that are supplied but not supported by the given .Ar command are usually ignored. .Pp The .Nm debugger has a pager feature (like the .Xr more 1 command) for the output. If an output line exceeds the number set in the .Va lines variable, it displays .Dq Li --More-- and waits for a response. The valid responses for it are: .Pp .Bl -tag -compact -width ".Li SPC" .It Li SPC one more page .It Li RET one more line .It Li q abort the current command, and return to the command input mode .El .Pp Finally, .Nm provides a small (currently 10 items) command history, and offers simple .Nm emacs Ns -style command line editing capabilities. In addition to the .Nm emacs control keys, the usual .Tn ANSI arrow keys may be used to browse through the history buffer, and move the cursor within the current line. .Sh COMMANDS .Bl -tag -width indent -compact .It Xo .Ic examine Ns Op Li / Ns Cm AISabcdghilmorsuxz ... .Oo Ar addr Oc Ns Op , Ns Ar count .Xc .It Xo .Ic x Ns Op Li / Ns Cm AISabcdghilmorsuxz ... .Oo Ar addr Oc Ns Op , Ns Ar count .Xc Display the addressed locations according to the formats in the modifier. Multiple modifier formats display multiple locations. If no format is specified, the last format specified for this command is used. .Pp The format characters are: .Bl -tag -compact -width indent .It Cm b look at by bytes (8 bits) .It Cm h look at by half words (16 bits) .It Cm l look at by long words (32 bits) .It Cm g look at by quad words (64 bits) .It Cm a print the location being displayed .It Cm A print the location with a line number if possible .It Cm x display in unsigned hex .It Cm z display in signed hex .It Cm o display in unsigned octal .It Cm d display in signed decimal .It Cm u display in unsigned decimal .It Cm r display in current radix, signed .It Cm c display low 8 bits as a character. Non-printing characters are displayed as an octal escape code (e.g., .Ql \e000 ) . .It Cm s display the null-terminated string at the location. Non-printing characters are displayed as octal escapes. .It Cm m display in unsigned hex with character dump at the end of each line. The location is also displayed in hex at the beginning of each line. .It Cm i display as a disassembled instruction .It Cm I display as an disassembled instruction with possible alternate formats depending on the machine. On i386, this selects the alternate format for the instruction decoding (16 bits in a 32-bit code segment and vice versa). .It Cm S display a symbol name for the pointer stored at the address .El .Pp .It Ic xf Examine forward: execute an .Ic examine command with the last specified parameters to it except that the next address displayed by it is used as the start address. .Pp .It Ic xb Examine backward: execute an .Ic examine command with the last specified parameters to it except that the last start address subtracted by the size displayed by it is used as the start address. .Pp .It Ic print Ns Op Li / Ns Cm acdoruxz .It Ic p Ns Op Li / Ns Cm acdoruxz Print .Ar addr Ns s according to the modifier character (as described above for .Cm examine ) . Valid formats are: .Cm a , x , z , o , d , u , r , and .Cm c . If no modifier is specified, the last one specified to it is used. The argument .Ar addr can be a string, in which case it is printed as it is. For example: .Bd -literal -offset indent print/x "eax = " $eax "\enecx = " $ecx "\en" .Ed .Pp will print like: .Bd -literal -offset indent eax = xxxxxx ecx = yyyyyy .Ed .Pp .It Xo .Ic write Ns Op Li / Ns Cm bhl .Ar addr expr1 Op Ar expr2 ... .Xc .It Xo .Ic w Ns Op Li / Ns Cm bhl .Ar addr expr1 Op Ar expr2 ... .Xc Write the expressions specified after .Ar addr on the command line at succeeding locations starting with .Ar addr . The write unit size can be specified in the modifier with a letter .Cm b (byte), .Cm h (half word) or .Cm l (long word) respectively. If omitted, long word is assumed. .Pp .Sy Warning : since there is no delimiter between expressions, strange things may happen. It is best to enclose each expression in parentheses. .Pp .It Ic set Li $ Ns Ar variable Oo Li = Oc Ar expr Set the named variable or register with the value of .Ar expr . Valid variable names are described below. .Pp .It Ic break Ns Oo Li / Ns Cm u Oc Oo Ar addr Oc Ns Op , Ns Ar count .It Ic b Ns Oo Li / Ns Cm u Oc Oo Ar addr Oc Ns Op , Ns Ar count Set a break point at .Ar addr . If .Ar count is supplied, the .Ic continue command will not stop at this break point on the first .Ar count \- 1 times that it is hit. If the break point is set, a break point number is printed with .Ql # . This number can be used in deleting the break point or adding conditions to it. .Pp If the .Cm u modifier is specified, this command sets a break point in user address space. Without the .Cm u option, the address is considered to be in the kernel space, and a wrong space address is rejected with an error message. This modifier can be used only if it is supported by machine dependent routines. .Pp .Sy Warning : If a user text is shadowed by a normal user space debugger, user space break points may not work correctly. Setting a break point at the low-level code paths may also cause strange behavior. .Pp .It Ic delete Op Ar addr .It Ic d Op Ar addr .It Ic delete Li # Ns Ar number .It Ic d Li # Ns Ar number Delete the specified break point. The break point can be specified by a break point number with .Ql # , or by using the same .Ar addr specified in the original .Ic break command, or by omitting .Ar addr to get the default address of .Va dot . .Pp .It Ic watch Oo Ar addr Oc Ns Op , Ns Ar size Set a watchpoint for a region. Execution stops when an attempt to modify the region occurs. The .Ar size argument defaults to 4. If you specify a wrong space address, the request is rejected with an error message. .Pp .Sy Warning : Attempts to watch wired kernel memory may cause unrecoverable error in some systems such as i386. Watchpoints on user addresses work best. .Pp .It Ic hwatch Oo Ar addr Oc Ns Op , Ns Ar size Set a hardware watchpoint for a region if supported by the architecture. Execution stops when an attempt to modify the region occurs. The .Ar size argument defaults to 4. .Pp .Sy Warning : The hardware debug facilities do not have a concept of separate address spaces like the watch command does. Use .Ic hwatch for setting watchpoints on kernel address locations only, and avoid its use on user mode address spaces. .Pp .It Ic dhwatch Oo Ar addr Oc Ns Op , Ns Ar size Delete specified hardware watchpoint. .Pp .It Ic step Ns Oo Li / Ns Cm p Oc Ns Op , Ns Ar count .It Ic s Ns Oo Li / Ns Cm p Oc Ns Op , Ns Ar count Single step .Ar count times. If the .Cm p modifier is specified, print each instruction at each step. Otherwise, only print the last instruction. .Pp .Sy Warning : depending on machine type, it may not be possible to single-step through some low-level code paths or user space code. On machines with software-emulated single-stepping (e.g., pmax), stepping through code executed by interrupt handlers will probably do the wrong thing. .Pp .It Ic continue Ns Op Li / Ns Cm c .It Ic c Ns Op Li / Ns Cm c Continue execution until a breakpoint or watchpoint. If the .Cm c modifier is specified, count instructions while executing. Some machines (e.g., pmax) also count loads and stores. .Pp .Sy Warning : when counting, the debugger is really silently single-stepping. This means that single-stepping on low-level code may cause strange behavior. .Pp .It Ic until Ns Op Li / Ns Cm p Stop at the next call or return instruction. If the .Cm p modifier is specified, print the call nesting depth and the cumulative instruction count at each call or return. Otherwise, only print when the matching return is hit. .Pp .It Ic next Ns Op Li / Ns Cm p .It Ic match Ns Op Li / Ns Cm p Stop at the matching return instruction. If the .Cm p modifier is specified, print the call nesting depth and the cumulative instruction count at each call or return. Otherwise, only print when the matching return is hit. .Pp .It Xo .Ic trace Ns Op Li / Ns Cm u .Op Ar pid | tid Ns .Op , Ns Ar count .Xc .It Xo .Ic t Ns Op Li / Ns Cm u .Op Ar pid | tid Ns .Op , Ns Ar count .Xc .It Xo .Ic where Ns Op Li / Ns Cm u .Op Ar pid | tid Ns .Op , Ns Ar count .Xc .It Xo .Ic bt Ns Op Li / Ns Cm u .Op Ar pid | tid Ns .Op , Ns Ar count .Xc Stack trace. The .Cm u option traces user space; if omitted, .Ic trace only traces kernel space. The optional argument .Ar count is the number of frames to be traced. If .Ar count is omitted, all frames are printed. .Pp .Sy Warning : User space stack trace is valid only if the machine dependent code supports it. .Pp .It Xo .Ic search Ns Op Li / Ns Cm bhl .Ar addr .Ar value .Op Ar mask Ns .Op , Ns Ar count .Xc Search memory for .Ar value . The optional .Ar count argument limits the search. .\" .Pp .It Xo .Ic findstack .Ar addr .Xc Prints the thread address for a thread kernel-mode stack of which contains the specified address. If the thread is not found, search the thread stack cache and prints the cached stack address. Otherwise, prints nothing. .Pp .It Ic show Cm all procs Ns Op Li / Ns Cm a .It Ic ps Ns Op Li / Ns Cm a Display all process information. The process information may not be shown if it is not supported in the machine, or the bottom of the stack of the target process is not in the main memory at that time. The .Cm a modifier will print command line arguments for each process. .\" .Pp .It Ic show Cm all trace .It Ic alltrace Show a stack trace for every thread in the system. .Pp .It Ic show Cm all ttys Show all TTY's within the system. Output is similar to .Xr pstat 8 , but also includes the address of the TTY structure. .\" .Pp .It Ic show Cm all vnets Show the same output as "show vnet" does, but lists all virtualized network stacks within the system. .\" .Pp .It Ic show Cm allchains Show the same information like "show lockchain" does, but for every thread in the system. .\" .Pp .It Ic show Cm alllocks Show all locks that are currently held. This command is only available if .Xr witness 4 is included in the kernel. .\" .Pp .It Ic show Cm allpcpu The same as "show pcpu", but for every CPU present in the system. .\" .Pp .It Ic show Cm allrman Show information related with resource management, including interrupt request lines, DMA request lines, I/O ports, I/O memory addresses, and Resource IDs. .\" .Pp .It Ic show Cm apic Dump data about APIC IDT vector mappings. .\" .Pp .It Ic show Cm breaks Show breakpoints set with the "break" command. .\" .Pp .It Ic show Cm bio Ar addr Show information about the bio structure .Vt struct bio present at .Ar addr . See the .Pa sys/bio.h header file and .Xr g_bio 9 for more details on the exact meaning of the structure fields. .\" .Pp .It Ic show Cm buffer Ar addr Show information about the buf structure .Vt struct buf present at .Ar addr . See the .Pa sys/buf.h header file for more details on the exact meaning of the structure fields. .\" .Pp .It Ic show Cm callout Ar addr Show information about the callout structure .Vt struct callout present at .Ar addr . .\" .Pp .It Ic show Cm cbstat Show brief information about the TTY subsystem. .\" .Pp .It Ic show Cm cdev Without argument, show the list of all created cdev's, consisting of devfs node name and struct cdev address. When address of cdev is supplied, show some internal devfs state of the cdev. .\" .Pp .It Ic show Cm conifhk Lists hooks currently waiting for completion in run_interrupt_driven_config_hooks(). .\" .Pp .It Ic show Cm cpusets Print numbered root and assigned CPU affinity sets. See .Xr cpuset 2 for more details. .\" .Pp .It Ic show Cm cyrixreg Show registers specific to the Cyrix processor. .\" .Pp .It Ic show Cm devmap Prints the contents of the static device mapping table. Currently only available on the ARM architecture. .\" .Pp .It Ic show Cm domain Ar addr Print protocol domain structure .Vt struct domain at address .Ar addr . See the .Pa sys/domain.h header file for more details on the exact meaning of the structure fields. .\" .Pp .It Ic show Cm ffs Op Ar addr Show brief information about ffs mount at the address .Ar addr , if argument is given. Otherwise, provides the summary about each ffs mount. .\" .Pp .It Ic show Cm file Ar addr Show information about the file structure .Vt struct file present at address .Ar addr . .\" .Pp .It Ic show Cm files Show information about every file structure in the system. .\" .Pp .It Ic show Cm freepages Show the number of physical pages in each of the free lists. .\" .Pp .It Ic show Cm geom Op Ar addr If the .Ar addr argument is not given, displays the entire GEOM topology. If .Ar addr is given, displays details about the given GEOM object (class, geom, provider or consumer). .\" .Pp .It Ic show Cm idt Show IDT layout. The first column specifies the IDT vector. The second one is the name of the interrupt/trap handler. Those functions are machine dependent. .\" .Pp .It Ic show Cm igi_list Ar addr Show information about the IGMP structure .Vt struct igmp_ifsoftc present at .Ar addr . .\" .Pp .It Ic show Cm inodedeps Op Ar addr Show brief information about each inodedep structure. If .Ar addr is given, only inodedeps belonging to the fs located at the supplied address are shown. .\" .Pp .It Ic show Cm inpcb Ar addr Show information on IP Control Block .Vt struct in_pcb present at .Ar addr . .\" .Pp .It Ic show Cm intr Dump information about interrupt handlers. .\" .Pp .It Ic show Cm intrcnt Dump the interrupt statistics. .\" .Pp .It Ic show Cm irqs Show interrupt lines and their respective kernel threads. .\" .Pp .It Ic show Cm jails Show the list of .Xr jail 8 instances. In addition to what .Xr jls 8 shows, also list kernel internal details. .\" .Pp .It Ic show Cm lapic Show information from the local APIC registers for this CPU. .\" .Pp .It Ic show Cm lock Ar addr Show lock structure. The output format is as follows: .Bl -tag -width "flags" .It Ic class: Class of the lock. Possible types include .Xr mutex 9 , .Xr rmlock 9 , .Xr rwlock 9 , .Xr sx 9 . .It Ic name: Name of the lock. .It Ic flags: Flags passed to the lock initialization function. .Em flags values are lock class specific. .It Ic state: Current state of a lock. .Em state values are lock class specific. .It Ic owner: Lock owner. .El .\" .Pp .It Ic show Cm lockchain Ar addr Show all threads a particular thread at address .Ar addr is waiting on based on non-spin locks. .\" .Pp .It Ic show Cm lockedbufs Show the same information as "show buf", but for every locked .Vt struct buf object. .\" .Pp .It Ic show Cm lockedvnods List all locked vnodes in the system. .\" .Pp .It Ic show Cm locks Prints all locks that are currently acquired. This command is only available if .Xr witness 4 is included in the kernel. .\" .Pp .It Ic show Cm locktree .\" .Pp -.It Ic show Cm malloc +.It Ic show Cm malloc Ns Op Li / Ns Cm i Prints .Xr malloc 9 memory allocator statistics. -The output format is as follows: +If the +.Cm i +modifier is specified, format output as machine-parseable comma-separated +values ("CSV"). +The output columns are as follows: .Pp .Bl -tag -compact -offset indent -width "Requests" .It Ic Type Specifies a type of memory. It is the same as a description string used while defining the given memory type with .Xr MALLOC_DECLARE 9 . .It Ic InUse Number of memory allocations of the given type, for which .Xr free 9 has not been called yet. .It Ic MemUse Total memory consumed by the given allocation type. .It Ic Requests Number of memory allocation requests for the given memory type. .El .Pp The same information can be gathered in userspace with .Dq Nm vmstat Fl m . .\" .Pp .It Ic show Cm map Ns Oo Li / Ns Cm f Oc Ar addr Prints the VM map at .Ar addr . If the .Cm f modifier is specified the complete map is printed. .\" .Pp .It Ic show Cm msgbuf Print the system's message buffer. It is the same output as in the .Dq Nm dmesg case. It is useful if you got a kernel panic, attached a serial cable to the machine and want to get the boot messages from before the system hang. .\" .It Ic show Cm mount Displays short info about all currently mounted file systems. .Pp .It Ic show Cm mount Ar addr Displays details about the given mount point. .\" .Pp .It Ic show Cm object Ns Oo Li / Ns Cm f Oc Ar addr Prints the VM object at .Ar addr . If the .Cm f option is specified the complete object is printed. .\" .Pp .It Ic show Cm panic Print the panic message if set. .\" .Pp .It Ic show Cm page Show statistics on VM pages. .\" .Pp .It Ic show Cm pageq Show statistics on VM page queues. .\" .Pp .It Ic show Cm pciregs Print PCI bus registers. The same information can be gathered in userspace by running .Dq Nm pciconf Fl lv . .\" .Pp .It Ic show Cm pcpu Print current processor state. The output format is as follows: .Pp .Bl -tag -compact -offset indent -width "spin locks held:" .It Ic cpuid Processor identifier. .It Ic curthread Thread pointer, process identifier and the name of the process. .It Ic curpcb Control block pointer. .It Ic fpcurthread FPU thread pointer. .It Ic idlethread Idle thread pointer. .It Ic APIC ID CPU identifier coming from APIC. .It Ic currentldt LDT pointer. .It Ic spin locks held Names of spin locks held. .El .\" .Pp .It Ic show Cm pgrpdump Dump process groups present within the system. .\" .Pp .It Ic show Cm proc Op Ar addr If no .Op Ar addr is specified, print information about the current process. Otherwise, show information about the process at address .Ar addr . .\" .Pp .It Ic show Cm procvm Show process virtual memory layout. .\" .Pp .It Ic show Cm protosw Ar addr Print protocol switch structure .Vt struct protosw at address .Ar addr . .\" .Pp .It Ic show Cm registers Ns Op Li / Ns Cm u Display the register set. If the .Cm u modifier is specified, it displays user registers instead of kernel registers or the currently saved one. .Pp .Sy Warning : The support of the .Cm u modifier depends on the machine. If not supported, incorrect information will be displayed. .\" .Pp .It Ic show Cm rman Ar addr Show resource manager object .Vt struct rman at address .Ar addr . Addresses of particular pointers can be gathered with "show allrman" command. .\" .Pp .It Ic show Cm route Ar addr Show route table result for destination .Ar addr . At this time, INET and INET6 formatted addresses are supported. .\" .Pp .It Ic show Cm routetable Oo Ar af Oc Show full route table or tables. If .Ar af is specified, show only routes for the given numeric address family. If no argument is specified, dump the route table for all address families. .\" .Pp .It Ic show Cm rtc Show real time clock value. Useful for long debugging sessions. .\" .Pp .It Ic show Cm sleepchain Deprecated. Now an alias for .Ic show Cm lockchain . .\" .Pp .It Ic show Cm sleepq .It Ic show Cm sleepqueue Both commands provide the same functionality. They show sleepqueue .Vt struct sleepqueue structure. Sleepqueues are used within the .Fx kernel to implement sleepable synchronization primitives (thread holding a lock might sleep or be context switched), which at the time of writing are: .Xr condvar 9 , .Xr sx 9 and standard .Xr msleep 9 interface. .\" .Pp .It Ic show Cm sockbuf Ar addr .It Ic show Cm socket Ar addr Those commands print .Vt struct sockbuf and .Vt struct socket objects placed at .Ar addr . Output consists of all values present in structures mentioned. For exact interpretation and more details, visit .Pa sys/socket.h header file. .\" .Pp .It Ic show Cm sysregs Show system registers (e.g., .Li cr0-4 on i386.) Not present on some platforms. .\" .Pp .It Ic show Cm tcpcb Ar addr Print TCP control block .Vt struct tcpcb lying at address .Ar addr . For exact interpretation of output, visit .Pa netinet/tcp.h header file. .\" .Pp .It Ic show Cm thread Op Ar addr | tid If no .Ar addr or .Ar tid is specified, show detailed information about current thread. Otherwise, print information about the thread with ID .Ar tid or kernel address .Ar addr . (If the argument is a decimal number, it is assumed to be a tid.) .\" .Pp .It Ic show Cm threads Show all threads within the system. Output format is as follows: .Pp .Bl -tag -compact -offset indent -width "Second column" .It Ic First column Thread identifier (TID) .It Ic Second column Thread structure address .It Ic Third column Backtrace. .El .\" .Pp .It Ic show Cm tty Ar addr Display the contents of a TTY structure in a readable form. .\" .Pp .It Ic show Cm turnstile Ar addr Show turnstile .Vt struct turnstile structure at address .Ar addr . Turnstiles are structures used within the .Fx kernel to implement synchronization primitives which, while holding a specific type of lock, cannot sleep or context switch to another thread. Currently, those are: .Xr mutex 9 , .Xr rwlock 9 , .Xr rmlock 9 . .\" .Pp -.It Ic show Cm uma +.It Ic show Cm uma Ns Op Li / Ns Cm i Show UMA allocator statistics. -Output consists five columns: +If the +.Cm i +modifier is specified, format output as machine-parseable comma-separated +values ("CSV"). +The output contains the following columns: .Pp -.Bl -tag -compact -offset indent -width "Requests" +.Bl -tag -compact -offset indent -width "Total Mem" .It Cm "Zone" Name of the UMA zone. The same string that was passed to .Xr uma_zcreate 9 as a first argument. .It Cm "Size" Size of a given memory object (slab). .It Cm "Used" Number of slabs being currently used. .It Cm "Free" Number of free slabs within the UMA zone. .It Cm "Requests" Number of allocations requests to the given zone. +.It Cm "Total Mem" +Total memory in use (either allocated or free) by a zone, in bytes. +.It Cm "XFree" +Number of free slabs within the UMA zone that were freed on a different NUMA +domain than allocated. +(The count in the +.Cm "Free" +column is inclusive of +.Cm "XFree" . ) .El .Pp -The very same information might be gathered in the userspace +The same information might be gathered in the userspace with the help of .Dq Nm vmstat Fl z . .\" .Pp .It Ic show Cm unpcb Ar addr Shows UNIX domain socket private control block .Vt struct unpcb present at the address .Ar addr . .\" .Pp .It Ic show Cm vmochk Prints, whether the internal VM objects are in a map somewhere and none have zero ref counts. .\" .Pp .It Ic show Cm vmopag This is supposed to show physical addresses consumed by a VM object. Currently, it is not possible to use this command when .Xr witness 4 is compiled in the kernel. .\" .Pp .It Ic show Cm vnet Ar addr Prints virtualized network stack .Vt struct vnet structure present at the address .Ar addr . .\" .Pp .It Ic show Cm vnode Op Ar addr Prints vnode .Vt struct vnode structure lying at .Op Ar addr . For the exact interpretation of the output, look at the .Pa sys/vnode.h header file. .\" .Pp .It Ic show Cm vnodebufs Ar addr Shows clean/dirty buffer lists of the vnode located at .Ar addr . .\" .Pp .It Ic show Cm vpath Ar addr Walk the namecache to lookup the pathname of the vnode located at .Ar addr . .\" .Pp .It Ic show Cm watches Displays all watchpoints. Shows watchpoints set with "watch" command. .\" .Pp .It Ic show Cm witness Shows information about lock acquisition coming from the .Xr witness 4 subsystem. .\" .Pp .It Ic gdb Toggles between remote GDB and DDB mode. In remote GDB mode, another machine is required that runs .Xr gdb 1 using the remote debug feature, with a connection to the serial console port on the target machine. Currently only available on the i386 architecture. .Pp .It Ic halt Halt the system. .Pp .It Ic kill Ar sig pid Send signal .Ar sig to process .Ar pid . The signal is acted on upon returning from the debugger. This command can be used to kill a process causing resource contention in the case of a hung system. See .Xr signal 3 for a list of signals. Note that the arguments are reversed relative to .Xr kill 2 . .Pp .It Ic reboot Op Ar seconds .It Ic reset Op Ar seconds Hard reset the system. If the optional argument .Ar seconds is given, the debugger will wait for this long, at most a week, before rebooting. .Pp .It Ic help Print a short summary of the available commands and command abbreviations. .Pp .It Ic capture on .It Ic capture off .It Ic capture reset .It Ic capture status .Nm supports a basic output capture facility, which can be used to retrieve the results of debugging commands from userspace using .Xr sysctl 3 . .Ic capture on enables output capture; .Ic capture off disables capture. .Ic capture reset will clear the capture buffer and disable capture. .Ic capture status will report current buffer use, buffer size, and disposition of output capture. .Pp Userspace processes may inspect and manage .Nm capture state using .Xr sysctl 8 : .Pp .Va debug.ddb.capture.bufsize may be used to query or set the current capture buffer size. .Pp .Va debug.ddb.capture.maxbufsize may be used to query the compile-time limit on the capture buffer size. .Pp .Va debug.ddb.capture.bytes may be used to query the number of bytes of output currently in the capture buffer. .Pp .Va debug.ddb.capture.data returns the contents of the buffer as a string to an appropriately privileged process. .Pp This facility is particularly useful in concert with the scripting and .Xr textdump 4 facilities, allowing scripted debugging output to be captured and committed to disk as part of a textdump for later analysis. The contents of the capture buffer may also be inspected in a kernel core dump using .Xr kgdb 1 . .Pp .It Ic run .It Ic script .It Ic scripts .It Ic unscript Run, define, list, and delete scripts. See the .Sx SCRIPTING section for more information on the scripting facility. .Pp .It Ic textdump dump .It Ic textdump set .It Ic textdump status .It Ic textdump unset Use the .Ic textdump dump command to immediately perform a textdump. More information may be found in .Xr textdump 4 . The .Ic textdump set command may be used to force the next kernel core dump to be a textdump rather than a traditional memory dump or minidump. .Ic textdump status reports whether a textdump has been scheduled. .Ic textdump unset cancels a request to perform a textdump as the next kernel core dump. .Pp .It Ic thread Ar addr | tid Switch the debugger to the thread with ID .Ar tid , if the argument is a decimal number, or address .Ar addr , otherwise. .El .Sh VARIABLES The debugger accesses registers and variables as .Li $ Ns Ar name . Register names are as in the .Dq Ic show Cm registers command. Some variables are suffixed with numbers, and may have some modifier following a colon immediately after the variable name. For example, register variables can have a .Cm u modifier to indicate user register (e.g., .Dq Li $eax:u ) . .Pp Built-in variables currently supported are: .Pp .Bl -tag -width ".Va tabstops" -compact .It Va radix Input and output radix. .It Va maxoff Addresses are printed as .Dq Ar symbol Ns Li + Ns Ar offset unless .Ar offset is greater than .Va maxoff . .It Va maxwidth The width of the displayed line. .It Va lines The number of lines. It is used by the built-in pager. Setting it to 0 disables paging. .It Va tabstops Tab stop width. .It Va work Ns Ar xx Work variable; .Ar xx can take values from 0 to 31. .El .Sh EXPRESSIONS Most expression operators in C are supported except .Ql ~ , .Ql ^ , and unary .Ql & . Special rules in .Nm are: .Bl -tag -width ".No Identifiers" .It Identifiers The name of a symbol is translated to the value of the symbol, which is the address of the corresponding object. .Ql \&. and .Ql \&: can be used in the identifier. If supported by an object format dependent routine, .Sm off .Oo Ar filename : Oc Ar func : lineno , .Sm on .Oo Ar filename : Oc Ns Ar variable , and .Oo Ar filename : Oc Ns Ar lineno can be accepted as a symbol. .It Numbers Radix is determined by the first two letters: .Ql 0x : hex, .Ql 0o : octal, .Ql 0t : decimal; otherwise, follow current radix. .It Li \&. .Va dot .It Li + .Va next .It Li .. address of the start of the last line examined. Unlike .Va dot or .Va next , this is only changed by .Ic examine or .Ic write command. .It Li ' last address explicitly specified. .It Li $ Ns Ar variable Translated to the value of the specified variable. It may be followed by a .Ql \&: and modifiers as described above. .It Ar a Ns Li # Ns Ar b A binary operator which rounds up the left hand side to the next multiple of right hand side. .It Li * Ns Ar expr Indirection. It may be followed by a .Ql \&: and modifiers as described above. .El .Sh SCRIPTING .Nm supports a basic scripting facility to allow automating tasks or responses to specific events. Each script consists of a list of DDB commands to be executed sequentially, and is assigned a unique name. Certain script names have special meaning, and will be automatically run on various .Nm events if scripts by those names have been defined. .Pp The .Ic script command may be used to define a script by name. Scripts consist of a series of .Nm commands separated with the .Ql \&; character. For example: .Bd -literal -offset indent script kdb.enter.panic=bt; show pcpu script lockinfo=show alllocks; show lockedvnods .Ed .Pp The .Ic scripts command lists currently defined scripts. .Pp The .Ic run command execute a script by name. For example: .Bd -literal -offset indent run lockinfo .Ed .Pp The .Ic unscript command may be used to delete a script by name. For example: .Bd -literal -offset indent unscript kdb.enter.panic .Ed .Pp These functions may also be performed from userspace using the .Xr ddb 8 command. .Pp Certain scripts are run automatically, if defined, for specific .Nm events. The follow scripts are run when various events occur: .Bl -tag -width kdb.enter.powerfail .It Va kdb.enter.acpi The kernel debugger was entered as a result of an .Xr acpi 4 event. .It Va kdb.enter.bootflags The kernel debugger was entered at boot as a result of the debugger boot flag being set. .It Va kdb.enter.break The kernel debugger was entered as a result of a serial or console break. .It Va kdb.enter.cam The kernel debugger was entered as a result of a .Xr CAM 4 event. .It Va kdb.enter.mac The kernel debugger was entered as a result of an assertion failure in the .Xr mac_test 4 module of the TrustedBSD MAC Framework. .It Va kdb.enter.ndis The kernel debugger was entered as a result of an .Xr ndis 4 breakpoint event. .It Va kdb.enter.netgraph The kernel debugger was entered as a result of a .Xr netgraph 4 event. .It Va kdb.enter.panic .Xr panic 9 was called. .It Va kdb.enter.powerfail The kernel debugger was entered as a result of a powerfail NMI on the sparc64 platform. .It Va kdb.enter.powerpc The kernel debugger was entered as a result of an unimplemented interrupt type on the powerpc platform. .It Va kdb.enter.sysctl The kernel debugger was entered as a result of the .Va debug.kdb.enter sysctl being set. .It Va kdb.enter.trapsig The kernel debugger was entered as a result of a trapsig event on the sparc64 platform. .It Va kdb.enter.unionfs The kernel debugger was entered as a result of an assertion failure in the union file system. .It Va kdb.enter.unknown The kernel debugger was entered, but no reason has been set. .It Va kdb.enter.vfslock The kernel debugger was entered as a result of a VFS lock violation. .It Va kdb.enter.watchdog The kernel debugger was entered as a result of a watchdog firing. .It Va kdb.enter.witness The kernel debugger was entered as a result of a .Xr witness 4 violation. .El .Pp In the event that none of these scripts is found, .Nm will attempt to execute a default script: .Bl -tag -width kdb.enter.powerfail .It Va kdb.enter.default The kernel debugger was entered, but a script exactly matching the reason for entering was not defined. This can be used as a catch-all to handle cases not specifically of interest; for example, .Va kdb.enter.witness might be defined to have special handling, and .Va kdb.enter.default might be defined to simply panic and reboot. .El .Sh HINTS On machines with an ISA expansion bus, a simple NMI generation card can be constructed by connecting a push button between the A01 and B01 (CHCHK# and GND) card fingers. Momentarily shorting these two fingers together may cause the bridge chipset to generate an NMI, which causes the kernel to pass control to .Nm . Some bridge chipsets do not generate a NMI on CHCHK#, so your mileage may vary. The NMI allows one to break into the debugger on a wedged machine to diagnose problems. Other bus' bridge chipsets may be able to generate NMI using bus specific methods. There are many PCI and PCIe add-in cards which can generate NMI for debugging. Modern server systems typically use IPMI to generate signals to enter the debugger. The .Va devel/ipmitool port can be used to send the .Cd chassis power diag command which delivers an NMI to the processor. Embedded systems often use JTAG for debugging, but rarely use it in combination with .Nm . .Pp For serial consoles, you can enter the debugger by sending a BREAK condition on the serial line if .Cd options BREAK_TO_DEBUGGER is specified in the kernel. Most terminal emulation programs can send a break sequence with a special key sequence or via a menu item. However, in some setups, sending the break can be difficult to arrange or happens spuriously, so if the kernel contains .Cd options ALT_BREAK_TO_DEBUGGER then the sequence of CR TILDE CTRL-B enters the debugger; CR TILDE CTRL-P causes a panic instead of entering the debugger; and CR TILDE CTRL-R causes an immediate reboot. In all the above sequences, CR is a Carriage Return and is usually sent by hitting the Enter or Return key. TILDE is the ASCII tilde character (~). CTRL-x is Control x created by hitting the control key and then x and then releasing both. .Pp The break to enter the debugger behavior may be enabled at run-time by setting the .Xr sysctl 8 .Va debug.kdb.break_to_debugger to 1. The alternate sequence to enter the debugger behavior may be enabled at run-time by setting the .Xr sysctl 8 .Va debug.kdb.alt_break_to_debugger to 1. The debugger may be entered by setting the .Xr sysctl 8 .Va debug.kdb.enter to 1. .Sh FILES Header files mentioned in this manual page can be found below .Pa /usr/include directory. .Pp .Bl -dash -compact .It .Pa sys/buf.h .It .Pa sys/domain.h .It .Pa netinet/in_pcb.h .It .Pa sys/socket.h .It .Pa sys/vnode.h .El .Sh SEE ALSO .Xr gdb 1 , .Xr kgdb 1 , .Xr acpi 4 , .Xr CAM 4 , .Xr mac_test 4 , .Xr ndis 4 , .Xr netgraph 4 , .Xr textdump 4 , .Xr witness 4 , .Xr ddb 8 , .Xr sysctl 8 , .Xr panic 9 .Sh HISTORY The .Nm debugger was developed for Mach, and ported to .Bx 386 0.1 . This manual page translated from .Xr man 7 macros by .An Garrett Wollman . .Pp .An Robert N. M. Watson added support for .Nm output capture, .Xr textdump 4 and scripting in .Fx 7.1 . Index: head/sys/kern/kern_malloc.c =================================================================== --- head/sys/kern/kern_malloc.c (revision 353428) +++ head/sys/kern/kern_malloc.c (revision 353429) @@ -1,1318 +1,1373 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1987, 1991, 1993 * The Regents of the University of California. * Copyright (c) 2005-2009 Robert N. M. Watson * Copyright (c) 2008 Otto Moerbeek (mallocarray) * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_malloc.c 8.3 (Berkeley) 1/4/94 */ /* * Kernel malloc(9) implementation -- general purpose kernel memory allocator * based on memory types. Back end is implemented using the UMA(9) zone * allocator. A set of fixed-size buckets are used for smaller allocations, * and a special UMA allocation interface is used for larger allocations. * Callers declare memory types, and statistics are maintained independently * for each memory type. Statistics are maintained per-CPU for performance * reasons. See malloc(9) and comments in malloc.h for a detailed * description. */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_vm.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DEBUG_MEMGUARD #include #endif #ifdef DEBUG_REDZONE #include #endif #if defined(INVARIANTS) && defined(__i386__) #include #endif #include #ifdef KDTRACE_HOOKS #include bool __read_frequently dtrace_malloc_enabled; dtrace_malloc_probe_func_t __read_mostly dtrace_malloc_probe; #endif #if defined(INVARIANTS) || defined(MALLOC_MAKE_FAILURES) || \ defined(DEBUG_MEMGUARD) || defined(DEBUG_REDZONE) #define MALLOC_DEBUG 1 #endif /* * When realloc() is called, if the new size is sufficiently smaller than * the old size, realloc() will allocate a new, smaller block to avoid * wasting memory. 'Sufficiently smaller' is defined as: newsize <= * oldsize / 2^n, where REALLOC_FRACTION defines the value of 'n'. */ #ifndef REALLOC_FRACTION #define REALLOC_FRACTION 1 /* new block if <= half the size */ #endif /* * Centrally define some common malloc types. */ MALLOC_DEFINE(M_CACHE, "cache", "Various Dynamically allocated caches"); MALLOC_DEFINE(M_DEVBUF, "devbuf", "device driver memory"); MALLOC_DEFINE(M_TEMP, "temp", "misc temporary data buffers"); static struct malloc_type *kmemstatistics; static int kmemcount; #define KMEM_ZSHIFT 4 #define KMEM_ZBASE 16 #define KMEM_ZMASK (KMEM_ZBASE - 1) #define KMEM_ZMAX 65536 #define KMEM_ZSIZE (KMEM_ZMAX >> KMEM_ZSHIFT) static uint8_t kmemsize[KMEM_ZSIZE + 1]; #ifndef MALLOC_DEBUG_MAXZONES #define MALLOC_DEBUG_MAXZONES 1 #endif static int numzones = MALLOC_DEBUG_MAXZONES; /* * Small malloc(9) memory allocations are allocated from a set of UMA buckets * of various sizes. * * XXX: The comment here used to read "These won't be powers of two for * long." It's possible that a significant amount of wasted memory could be * recovered by tuning the sizes of these buckets. */ struct { int kz_size; char *kz_name; uma_zone_t kz_zone[MALLOC_DEBUG_MAXZONES]; } kmemzones[] = { {16, "16", }, {32, "32", }, {64, "64", }, {128, "128", }, {256, "256", }, {512, "512", }, {1024, "1024", }, {2048, "2048", }, {4096, "4096", }, {8192, "8192", }, {16384, "16384", }, {32768, "32768", }, {65536, "65536", }, {0, NULL}, }; /* * Zone to allocate malloc type descriptions from. For ABI reasons, memory * types are described by a data structure passed by the declaring code, but * the malloc(9) implementation has its own data structure describing the * type and statistics. This permits the malloc(9)-internal data structures * to be modified without breaking binary-compiled kernel modules that * declare malloc types. */ static uma_zone_t mt_zone; static uma_zone_t mt_stats_zone; u_long vm_kmem_size; SYSCTL_ULONG(_vm, OID_AUTO, kmem_size, CTLFLAG_RDTUN, &vm_kmem_size, 0, "Size of kernel memory"); static u_long kmem_zmax = KMEM_ZMAX; SYSCTL_ULONG(_vm, OID_AUTO, kmem_zmax, CTLFLAG_RDTUN, &kmem_zmax, 0, "Maximum allocation size that malloc(9) would use UMA as backend"); static u_long vm_kmem_size_min; SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_min, CTLFLAG_RDTUN, &vm_kmem_size_min, 0, "Minimum size of kernel memory"); static u_long vm_kmem_size_max; SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_max, CTLFLAG_RDTUN, &vm_kmem_size_max, 0, "Maximum size of kernel memory"); static u_int vm_kmem_size_scale; SYSCTL_UINT(_vm, OID_AUTO, kmem_size_scale, CTLFLAG_RDTUN, &vm_kmem_size_scale, 0, "Scale factor for kernel memory size"); static int sysctl_kmem_map_size(SYSCTL_HANDLER_ARGS); SYSCTL_PROC(_vm, OID_AUTO, kmem_map_size, CTLFLAG_RD | CTLTYPE_ULONG | CTLFLAG_MPSAFE, NULL, 0, sysctl_kmem_map_size, "LU", "Current kmem allocation size"); static int sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS); SYSCTL_PROC(_vm, OID_AUTO, kmem_map_free, CTLFLAG_RD | CTLTYPE_ULONG | CTLFLAG_MPSAFE, NULL, 0, sysctl_kmem_map_free, "LU", "Free space in kmem"); /* * The malloc_mtx protects the kmemstatistics linked list. */ struct mtx malloc_mtx; #ifdef MALLOC_PROFILE uint64_t krequests[KMEM_ZSIZE + 1]; static int sysctl_kern_mprof(SYSCTL_HANDLER_ARGS); #endif static int sysctl_kern_malloc_stats(SYSCTL_HANDLER_ARGS); /* * time_uptime of the last malloc(9) failure (induced or real). */ static time_t t_malloc_fail; #if defined(MALLOC_MAKE_FAILURES) || (MALLOC_DEBUG_MAXZONES > 1) static SYSCTL_NODE(_debug, OID_AUTO, malloc, CTLFLAG_RD, 0, "Kernel malloc debugging options"); #endif /* * malloc(9) fault injection -- cause malloc failures every (n) mallocs when * the caller specifies M_NOWAIT. If set to 0, no failures are caused. */ #ifdef MALLOC_MAKE_FAILURES static int malloc_failure_rate; static int malloc_nowait_count; static int malloc_failure_count; SYSCTL_INT(_debug_malloc, OID_AUTO, failure_rate, CTLFLAG_RWTUN, &malloc_failure_rate, 0, "Every (n) mallocs with M_NOWAIT will fail"); SYSCTL_INT(_debug_malloc, OID_AUTO, failure_count, CTLFLAG_RD, &malloc_failure_count, 0, "Number of imposed M_NOWAIT malloc failures"); #endif static int sysctl_kmem_map_size(SYSCTL_HANDLER_ARGS) { u_long size; size = uma_size(); return (sysctl_handle_long(oidp, &size, 0, req)); } static int sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS) { u_long size, limit; /* The sysctl is unsigned, implement as a saturation value. */ size = uma_size(); limit = uma_limit(); if (size > limit) size = 0; else size = limit - size; return (sysctl_handle_long(oidp, &size, 0, req)); } /* * malloc(9) uma zone separation -- sub-page buffer overruns in one * malloc type will affect only a subset of other malloc types. */ #if MALLOC_DEBUG_MAXZONES > 1 static void tunable_set_numzones(void) { TUNABLE_INT_FETCH("debug.malloc.numzones", &numzones); /* Sanity check the number of malloc uma zones. */ if (numzones <= 0) numzones = 1; if (numzones > MALLOC_DEBUG_MAXZONES) numzones = MALLOC_DEBUG_MAXZONES; } SYSINIT(numzones, SI_SUB_TUNABLES, SI_ORDER_ANY, tunable_set_numzones, NULL); SYSCTL_INT(_debug_malloc, OID_AUTO, numzones, CTLFLAG_RDTUN | CTLFLAG_NOFETCH, &numzones, 0, "Number of malloc uma subzones"); /* * Any number that changes regularly is an okay choice for the * offset. Build numbers are pretty good of you have them. */ static u_int zone_offset = __FreeBSD_version; TUNABLE_INT("debug.malloc.zone_offset", &zone_offset); SYSCTL_UINT(_debug_malloc, OID_AUTO, zone_offset, CTLFLAG_RDTUN, &zone_offset, 0, "Separate malloc types by examining the " "Nth character in the malloc type short description."); static void mtp_set_subzone(struct malloc_type *mtp) { struct malloc_type_internal *mtip; const char *desc; size_t len; u_int val; mtip = mtp->ks_handle; desc = mtp->ks_shortdesc; if (desc == NULL || (len = strlen(desc)) == 0) val = 0; else val = desc[zone_offset % len]; mtip->mti_zone = (val % numzones); } static inline u_int mtp_get_subzone(struct malloc_type *mtp) { struct malloc_type_internal *mtip; mtip = mtp->ks_handle; KASSERT(mtip->mti_zone < numzones, ("mti_zone %u out of range %d", mtip->mti_zone, numzones)); return (mtip->mti_zone); } #elif MALLOC_DEBUG_MAXZONES == 0 #error "MALLOC_DEBUG_MAXZONES must be positive." #else static void mtp_set_subzone(struct malloc_type *mtp) { struct malloc_type_internal *mtip; mtip = mtp->ks_handle; mtip->mti_zone = 0; } static inline u_int mtp_get_subzone(struct malloc_type *mtp) { return (0); } #endif /* MALLOC_DEBUG_MAXZONES > 1 */ int malloc_last_fail(void) { return (time_uptime - t_malloc_fail); } /* * An allocation has succeeded -- update malloc type statistics for the * amount of bucket size. Occurs within a critical section so that the * thread isn't preempted and doesn't migrate while updating per-PCU * statistics. */ static void malloc_type_zone_allocated(struct malloc_type *mtp, unsigned long size, int zindx) { struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp; critical_enter(); mtip = mtp->ks_handle; mtsp = zpcpu_get(mtip->mti_stats); if (size > 0) { mtsp->mts_memalloced += size; mtsp->mts_numallocs++; } if (zindx != -1) mtsp->mts_size |= 1 << zindx; #ifdef KDTRACE_HOOKS if (__predict_false(dtrace_malloc_enabled)) { uint32_t probe_id = mtip->mti_probes[DTMALLOC_PROBE_MALLOC]; if (probe_id != 0) (dtrace_malloc_probe)(probe_id, (uintptr_t) mtp, (uintptr_t) mtip, (uintptr_t) mtsp, size, zindx); } #endif critical_exit(); } void malloc_type_allocated(struct malloc_type *mtp, unsigned long size) { if (size > 0) malloc_type_zone_allocated(mtp, size, -1); } /* * A free operation has occurred -- update malloc type statistics for the * amount of the bucket size. Occurs within a critical section so that the * thread isn't preempted and doesn't migrate while updating per-CPU * statistics. */ void malloc_type_freed(struct malloc_type *mtp, unsigned long size) { struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp; critical_enter(); mtip = mtp->ks_handle; mtsp = zpcpu_get(mtip->mti_stats); mtsp->mts_memfreed += size; mtsp->mts_numfrees++; #ifdef KDTRACE_HOOKS if (__predict_false(dtrace_malloc_enabled)) { uint32_t probe_id = mtip->mti_probes[DTMALLOC_PROBE_FREE]; if (probe_id != 0) (dtrace_malloc_probe)(probe_id, (uintptr_t) mtp, (uintptr_t) mtip, (uintptr_t) mtsp, size, 0); } #endif critical_exit(); } /* * contigmalloc: * * Allocate a block of physically contiguous memory. * * If M_NOWAIT is set, this routine will not block and return NULL if * the allocation fails. */ void * contigmalloc(unsigned long size, struct malloc_type *type, int flags, vm_paddr_t low, vm_paddr_t high, unsigned long alignment, vm_paddr_t boundary) { void *ret; ret = (void *)kmem_alloc_contig(size, flags, low, high, alignment, boundary, VM_MEMATTR_DEFAULT); if (ret != NULL) malloc_type_allocated(type, round_page(size)); return (ret); } void * contigmalloc_domainset(unsigned long size, struct malloc_type *type, struct domainset *ds, int flags, vm_paddr_t low, vm_paddr_t high, unsigned long alignment, vm_paddr_t boundary) { void *ret; ret = (void *)kmem_alloc_contig_domainset(ds, size, flags, low, high, alignment, boundary, VM_MEMATTR_DEFAULT); if (ret != NULL) malloc_type_allocated(type, round_page(size)); return (ret); } /* * contigfree: * * Free a block of memory allocated by contigmalloc. * * This routine may not block. */ void contigfree(void *addr, unsigned long size, struct malloc_type *type) { kmem_free((vm_offset_t)addr, size); malloc_type_freed(type, round_page(size)); } #ifdef MALLOC_DEBUG static int malloc_dbg(caddr_t *vap, size_t *sizep, struct malloc_type *mtp, int flags) { #ifdef INVARIANTS int indx; KASSERT(mtp->ks_magic == M_MAGIC, ("malloc: bad malloc type magic")); /* * Check that exactly one of M_WAITOK or M_NOWAIT is specified. */ indx = flags & (M_WAITOK | M_NOWAIT); if (indx != M_NOWAIT && indx != M_WAITOK) { static struct timeval lasterr; static int curerr, once; if (once == 0 && ppsratecheck(&lasterr, &curerr, 1)) { printf("Bad malloc flags: %x\n", indx); kdb_backtrace(); flags |= M_WAITOK; once++; } } #endif #ifdef MALLOC_MAKE_FAILURES if ((flags & M_NOWAIT) && (malloc_failure_rate != 0)) { atomic_add_int(&malloc_nowait_count, 1); if ((malloc_nowait_count % malloc_failure_rate) == 0) { atomic_add_int(&malloc_failure_count, 1); t_malloc_fail = time_uptime; *vap = NULL; return (EJUSTRETURN); } } #endif if (flags & M_WAITOK) { KASSERT(curthread->td_intr_nesting_level == 0, ("malloc(M_WAITOK) in interrupt context")); KASSERT(curthread->td_epochnest == 0, ("malloc(M_WAITOK) in epoch context")); } KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("malloc: called with spinlock or critical section held")); #ifdef DEBUG_MEMGUARD if (memguard_cmp_mtp(mtp, *sizep)) { *vap = memguard_alloc(*sizep, flags); if (*vap != NULL) return (EJUSTRETURN); /* This is unfortunate but should not be fatal. */ } #endif #ifdef DEBUG_REDZONE *sizep = redzone_size_ntor(*sizep); #endif return (0); } #endif /* * malloc: * * Allocate a block of memory. * * If M_NOWAIT is set, this routine will not block and return NULL if * the allocation fails. */ void * (malloc)(size_t size, struct malloc_type *mtp, int flags) { int indx; caddr_t va; uma_zone_t zone; #if defined(DEBUG_REDZONE) unsigned long osize = size; #endif #ifdef MALLOC_DEBUG va = NULL; if (malloc_dbg(&va, &size, mtp, flags) != 0) return (va); #endif if (size <= kmem_zmax && (flags & M_EXEC) == 0) { if (size & KMEM_ZMASK) size = (size & ~KMEM_ZMASK) + KMEM_ZBASE; indx = kmemsize[size >> KMEM_ZSHIFT]; zone = kmemzones[indx].kz_zone[mtp_get_subzone(mtp)]; #ifdef MALLOC_PROFILE krequests[size >> KMEM_ZSHIFT]++; #endif va = uma_zalloc(zone, flags); if (va != NULL) size = zone->uz_size; malloc_type_zone_allocated(mtp, va == NULL ? 0 : size, indx); } else { size = roundup(size, PAGE_SIZE); zone = NULL; va = uma_large_malloc(size, flags); malloc_type_allocated(mtp, va == NULL ? 0 : size); } if (flags & M_WAITOK) KASSERT(va != NULL, ("malloc(M_WAITOK) returned NULL")); else if (va == NULL) t_malloc_fail = time_uptime; #ifdef DEBUG_REDZONE if (va != NULL) va = redzone_setup(va, osize); #endif return ((void *) va); } static void * malloc_domain(size_t size, struct malloc_type *mtp, int domain, int flags) { int indx; caddr_t va; uma_zone_t zone; #if defined(DEBUG_REDZONE) unsigned long osize = size; #endif #ifdef MALLOC_DEBUG va = NULL; if (malloc_dbg(&va, &size, mtp, flags) != 0) return (va); #endif if (size <= kmem_zmax && (flags & M_EXEC) == 0) { if (size & KMEM_ZMASK) size = (size & ~KMEM_ZMASK) + KMEM_ZBASE; indx = kmemsize[size >> KMEM_ZSHIFT]; zone = kmemzones[indx].kz_zone[mtp_get_subzone(mtp)]; #ifdef MALLOC_PROFILE krequests[size >> KMEM_ZSHIFT]++; #endif va = uma_zalloc_domain(zone, NULL, domain, flags); if (va != NULL) size = zone->uz_size; malloc_type_zone_allocated(mtp, va == NULL ? 0 : size, indx); } else { size = roundup(size, PAGE_SIZE); zone = NULL; va = uma_large_malloc_domain(size, domain, flags); malloc_type_allocated(mtp, va == NULL ? 0 : size); } if (flags & M_WAITOK) KASSERT(va != NULL, ("malloc(M_WAITOK) returned NULL")); else if (va == NULL) t_malloc_fail = time_uptime; #ifdef DEBUG_REDZONE if (va != NULL) va = redzone_setup(va, osize); #endif return ((void *) va); } void * malloc_domainset(size_t size, struct malloc_type *mtp, struct domainset *ds, int flags) { struct vm_domainset_iter di; void *ret; int domain; vm_domainset_iter_policy_init(&di, ds, &domain, &flags); do { ret = malloc_domain(size, mtp, domain, flags); if (ret != NULL) break; } while (vm_domainset_iter_policy(&di, &domain) == 0); return (ret); } void * mallocarray(size_t nmemb, size_t size, struct malloc_type *type, int flags) { if (WOULD_OVERFLOW(nmemb, size)) panic("mallocarray: %zu * %zu overflowed", nmemb, size); return (malloc(size * nmemb, type, flags)); } #ifdef INVARIANTS static void free_save_type(void *addr, struct malloc_type *mtp, u_long size) { struct malloc_type **mtpp = addr; /* * Cache a pointer to the malloc_type that most recently freed * this memory here. This way we know who is most likely to * have stepped on it later. * * This code assumes that size is a multiple of 8 bytes for * 64 bit machines */ mtpp = (struct malloc_type **) ((unsigned long)mtpp & ~UMA_ALIGN_PTR); mtpp += (size - sizeof(struct malloc_type *)) / sizeof(struct malloc_type *); *mtpp = mtp; } #endif #ifdef MALLOC_DEBUG static int free_dbg(void **addrp, struct malloc_type *mtp) { void *addr; addr = *addrp; KASSERT(mtp->ks_magic == M_MAGIC, ("free: bad malloc type magic")); KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("free: called with spinlock or critical section held")); /* free(NULL, ...) does nothing */ if (addr == NULL) return (EJUSTRETURN); #ifdef DEBUG_MEMGUARD if (is_memguard_addr(addr)) { memguard_free(addr); return (EJUSTRETURN); } #endif #ifdef DEBUG_REDZONE redzone_check(addr); *addrp = redzone_addr_ntor(addr); #endif return (0); } #endif /* * free: * * Free a block of memory allocated by malloc. * * This routine may not block. */ void free(void *addr, struct malloc_type *mtp) { uma_slab_t slab; u_long size; #ifdef MALLOC_DEBUG if (free_dbg(&addr, mtp) != 0) return; #endif /* free(NULL, ...) does nothing */ if (addr == NULL) return; slab = vtoslab((vm_offset_t)addr & (~UMA_SLAB_MASK)); if (slab == NULL) panic("free: address %p(%p) has not been allocated.\n", addr, (void *)((u_long)addr & (~UMA_SLAB_MASK))); if (!(slab->us_flags & UMA_SLAB_MALLOC)) { size = slab->us_keg->uk_size; #ifdef INVARIANTS free_save_type(addr, mtp, size); #endif uma_zfree_arg(LIST_FIRST(&slab->us_keg->uk_zones), addr, slab); } else { size = slab->us_size; uma_large_free(slab); } malloc_type_freed(mtp, size); } void free_domain(void *addr, struct malloc_type *mtp) { uma_slab_t slab; u_long size; #ifdef MALLOC_DEBUG if (free_dbg(&addr, mtp) != 0) return; #endif /* free(NULL, ...) does nothing */ if (addr == NULL) return; slab = vtoslab((vm_offset_t)addr & (~UMA_SLAB_MASK)); if (slab == NULL) panic("free_domain: address %p(%p) has not been allocated.\n", addr, (void *)((u_long)addr & (~UMA_SLAB_MASK))); if (!(slab->us_flags & UMA_SLAB_MALLOC)) { size = slab->us_keg->uk_size; #ifdef INVARIANTS free_save_type(addr, mtp, size); #endif uma_zfree_domain(LIST_FIRST(&slab->us_keg->uk_zones), addr, slab); } else { size = slab->us_size; uma_large_free(slab); } malloc_type_freed(mtp, size); } /* * realloc: change the size of a memory block */ void * realloc(void *addr, size_t size, struct malloc_type *mtp, int flags) { uma_slab_t slab; unsigned long alloc; void *newaddr; KASSERT(mtp->ks_magic == M_MAGIC, ("realloc: bad malloc type magic")); KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("realloc: called with spinlock or critical section held")); /* realloc(NULL, ...) is equivalent to malloc(...) */ if (addr == NULL) return (malloc(size, mtp, flags)); /* * XXX: Should report free of old memory and alloc of new memory to * per-CPU stats. */ #ifdef DEBUG_MEMGUARD if (is_memguard_addr(addr)) return (memguard_realloc(addr, size, mtp, flags)); #endif #ifdef DEBUG_REDZONE slab = NULL; alloc = redzone_get_size(addr); #else slab = vtoslab((vm_offset_t)addr & ~(UMA_SLAB_MASK)); /* Sanity check */ KASSERT(slab != NULL, ("realloc: address %p out of range", (void *)addr)); /* Get the size of the original block */ if (!(slab->us_flags & UMA_SLAB_MALLOC)) alloc = slab->us_keg->uk_size; else alloc = slab->us_size; /* Reuse the original block if appropriate */ if (size <= alloc && (size > (alloc >> REALLOC_FRACTION) || alloc == MINALLOCSIZE)) return (addr); #endif /* !DEBUG_REDZONE */ /* Allocate a new, bigger (or smaller) block */ if ((newaddr = malloc(size, mtp, flags)) == NULL) return (NULL); /* Copy over original contents */ bcopy(addr, newaddr, min(size, alloc)); free(addr, mtp); return (newaddr); } /* * reallocf: same as realloc() but free memory on failure. */ void * reallocf(void *addr, size_t size, struct malloc_type *mtp, int flags) { void *mem; if ((mem = realloc(addr, size, mtp, flags)) == NULL) free(addr, mtp); return (mem); } #ifndef __sparc64__ CTASSERT(VM_KMEM_SIZE_SCALE >= 1); #endif /* * Initialize the kernel memory (kmem) arena. */ void kmeminit(void) { u_long mem_size; u_long tmp; #ifdef VM_KMEM_SIZE if (vm_kmem_size == 0) vm_kmem_size = VM_KMEM_SIZE; #endif #ifdef VM_KMEM_SIZE_MIN if (vm_kmem_size_min == 0) vm_kmem_size_min = VM_KMEM_SIZE_MIN; #endif #ifdef VM_KMEM_SIZE_MAX if (vm_kmem_size_max == 0) vm_kmem_size_max = VM_KMEM_SIZE_MAX; #endif /* * Calculate the amount of kernel virtual address (KVA) space that is * preallocated to the kmem arena. In order to support a wide range * of machines, it is a function of the physical memory size, * specifically, * * min(max(physical memory size / VM_KMEM_SIZE_SCALE, * VM_KMEM_SIZE_MIN), VM_KMEM_SIZE_MAX) * * Every architecture must define an integral value for * VM_KMEM_SIZE_SCALE. However, the definitions of VM_KMEM_SIZE_MIN * and VM_KMEM_SIZE_MAX, which represent respectively the floor and * ceiling on this preallocation, are optional. Typically, * VM_KMEM_SIZE_MAX is itself a function of the available KVA space on * a given architecture. */ mem_size = vm_cnt.v_page_count; if (mem_size <= 32768) /* delphij XXX 128MB */ kmem_zmax = PAGE_SIZE; if (vm_kmem_size_scale < 1) vm_kmem_size_scale = VM_KMEM_SIZE_SCALE; /* * Check if we should use defaults for the "vm_kmem_size" * variable: */ if (vm_kmem_size == 0) { vm_kmem_size = mem_size / vm_kmem_size_scale; vm_kmem_size = vm_kmem_size * PAGE_SIZE < vm_kmem_size ? vm_kmem_size_max : vm_kmem_size * PAGE_SIZE; if (vm_kmem_size_min > 0 && vm_kmem_size < vm_kmem_size_min) vm_kmem_size = vm_kmem_size_min; if (vm_kmem_size_max > 0 && vm_kmem_size >= vm_kmem_size_max) vm_kmem_size = vm_kmem_size_max; } if (vm_kmem_size == 0) panic("Tune VM_KMEM_SIZE_* for the platform"); /* * The amount of KVA space that is preallocated to the * kmem arena can be set statically at compile-time or manually * through the kernel environment. However, it is still limited to * twice the physical memory size, which has been sufficient to handle * the most severe cases of external fragmentation in the kmem arena. */ if (vm_kmem_size / 2 / PAGE_SIZE > mem_size) vm_kmem_size = 2 * mem_size * PAGE_SIZE; vm_kmem_size = round_page(vm_kmem_size); #ifdef DEBUG_MEMGUARD tmp = memguard_fudge(vm_kmem_size, kernel_map); #else tmp = vm_kmem_size; #endif uma_set_limit(tmp); #ifdef DEBUG_MEMGUARD /* * Initialize MemGuard if support compiled in. MemGuard is a * replacement allocator used for detecting tamper-after-free * scenarios as they occur. It is only used for debugging. */ memguard_init(kernel_arena); #endif } /* * Initialize the kernel memory allocator */ /* ARGSUSED*/ static void mallocinit(void *dummy) { int i; uint8_t indx; mtx_init(&malloc_mtx, "malloc", NULL, MTX_DEF); kmeminit(); if (kmem_zmax < PAGE_SIZE || kmem_zmax > KMEM_ZMAX) kmem_zmax = KMEM_ZMAX; mt_stats_zone = uma_zcreate("mt_stats_zone", sizeof(struct malloc_type_stats), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_PCPU); mt_zone = uma_zcreate("mt_zone", sizeof(struct malloc_type_internal), #ifdef INVARIANTS mtrash_ctor, mtrash_dtor, mtrash_init, mtrash_fini, #else NULL, NULL, NULL, NULL, #endif UMA_ALIGN_PTR, UMA_ZONE_MALLOC); for (i = 0, indx = 0; kmemzones[indx].kz_size != 0; indx++) { int size = kmemzones[indx].kz_size; char *name = kmemzones[indx].kz_name; int subzone; for (subzone = 0; subzone < numzones; subzone++) { kmemzones[indx].kz_zone[subzone] = uma_zcreate(name, size, #ifdef INVARIANTS mtrash_ctor, mtrash_dtor, mtrash_init, mtrash_fini, #else NULL, NULL, NULL, NULL, #endif UMA_ALIGN_PTR, UMA_ZONE_MALLOC); } for (;i <= size; i+= KMEM_ZBASE) kmemsize[i >> KMEM_ZSHIFT] = indx; } } SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_SECOND, mallocinit, NULL); void malloc_init(void *data) { struct malloc_type_internal *mtip; struct malloc_type *mtp; KASSERT(vm_cnt.v_page_count != 0, ("malloc_register before vm_init")); mtp = data; if (mtp->ks_magic != M_MAGIC) panic("malloc_init: bad malloc type magic"); mtip = uma_zalloc(mt_zone, M_WAITOK | M_ZERO); mtip->mti_stats = uma_zalloc_pcpu(mt_stats_zone, M_WAITOK | M_ZERO); mtp->ks_handle = mtip; mtp_set_subzone(mtp); mtx_lock(&malloc_mtx); mtp->ks_next = kmemstatistics; kmemstatistics = mtp; kmemcount++; mtx_unlock(&malloc_mtx); } void malloc_uninit(void *data) { struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp; struct malloc_type *mtp, *temp; uma_slab_t slab; long temp_allocs, temp_bytes; int i; mtp = data; KASSERT(mtp->ks_magic == M_MAGIC, ("malloc_uninit: bad malloc type magic")); KASSERT(mtp->ks_handle != NULL, ("malloc_deregister: cookie NULL")); mtx_lock(&malloc_mtx); mtip = mtp->ks_handle; mtp->ks_handle = NULL; if (mtp != kmemstatistics) { for (temp = kmemstatistics; temp != NULL; temp = temp->ks_next) { if (temp->ks_next == mtp) { temp->ks_next = mtp->ks_next; break; } } KASSERT(temp, ("malloc_uninit: type '%s' not found", mtp->ks_shortdesc)); } else kmemstatistics = mtp->ks_next; kmemcount--; mtx_unlock(&malloc_mtx); /* * Look for memory leaks. */ temp_allocs = temp_bytes = 0; for (i = 0; i <= mp_maxid; i++) { mtsp = zpcpu_get_cpu(mtip->mti_stats, i); temp_allocs += mtsp->mts_numallocs; temp_allocs -= mtsp->mts_numfrees; temp_bytes += mtsp->mts_memalloced; temp_bytes -= mtsp->mts_memfreed; } if (temp_allocs > 0 || temp_bytes > 0) { printf("Warning: memory type %s leaked memory on destroy " "(%ld allocations, %ld bytes leaked).\n", mtp->ks_shortdesc, temp_allocs, temp_bytes); } slab = vtoslab((vm_offset_t) mtip & (~UMA_SLAB_MASK)); uma_zfree_pcpu(mt_stats_zone, mtip->mti_stats); uma_zfree_arg(mt_zone, mtip, slab); } struct malloc_type * malloc_desc2type(const char *desc) { struct malloc_type *mtp; mtx_assert(&malloc_mtx, MA_OWNED); for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { if (strcmp(mtp->ks_shortdesc, desc) == 0) return (mtp); } return (NULL); } static int sysctl_kern_malloc_stats(SYSCTL_HANDLER_ARGS) { struct malloc_type_stream_header mtsh; struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp, zeromts; struct malloc_type_header mth; struct malloc_type *mtp; int error, i; struct sbuf sbuf; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sbuf_new_for_sysctl(&sbuf, NULL, 128, req); sbuf_clear_flags(&sbuf, SBUF_INCLUDENUL); mtx_lock(&malloc_mtx); bzero(&zeromts, sizeof(zeromts)); /* * Insert stream header. */ bzero(&mtsh, sizeof(mtsh)); mtsh.mtsh_version = MALLOC_TYPE_STREAM_VERSION; mtsh.mtsh_maxcpus = MAXCPU; mtsh.mtsh_count = kmemcount; (void)sbuf_bcat(&sbuf, &mtsh, sizeof(mtsh)); /* * Insert alternating sequence of type headers and type statistics. */ for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { mtip = (struct malloc_type_internal *)mtp->ks_handle; /* * Insert type header. */ bzero(&mth, sizeof(mth)); strlcpy(mth.mth_name, mtp->ks_shortdesc, MALLOC_MAX_NAME); (void)sbuf_bcat(&sbuf, &mth, sizeof(mth)); /* * Insert type statistics for each CPU. */ for (i = 0; i <= mp_maxid; i++) { mtsp = zpcpu_get_cpu(mtip->mti_stats, i); (void)sbuf_bcat(&sbuf, mtsp, sizeof(*mtsp)); } /* * Fill in the missing CPUs. */ for (; i < MAXCPU; i++) { (void)sbuf_bcat(&sbuf, &zeromts, sizeof(zeromts)); } } mtx_unlock(&malloc_mtx); error = sbuf_finish(&sbuf); sbuf_delete(&sbuf); return (error); } SYSCTL_PROC(_kern, OID_AUTO, malloc_stats, CTLFLAG_RD|CTLTYPE_STRUCT, 0, 0, sysctl_kern_malloc_stats, "s,malloc_type_ustats", "Return malloc types"); SYSCTL_INT(_kern, OID_AUTO, malloc_count, CTLFLAG_RD, &kmemcount, 0, "Count of kernel malloc types"); void malloc_type_list(malloc_type_list_func_t *func, void *arg) { struct malloc_type *mtp, **bufmtp; int count, i; size_t buflen; mtx_lock(&malloc_mtx); restart: mtx_assert(&malloc_mtx, MA_OWNED); count = kmemcount; mtx_unlock(&malloc_mtx); buflen = sizeof(struct malloc_type *) * count; bufmtp = malloc(buflen, M_TEMP, M_WAITOK); mtx_lock(&malloc_mtx); if (count < kmemcount) { free(bufmtp, M_TEMP); goto restart; } for (mtp = kmemstatistics, i = 0; mtp != NULL; mtp = mtp->ks_next, i++) bufmtp[i] = mtp; mtx_unlock(&malloc_mtx); for (i = 0; i < count; i++) (func)(bufmtp[i], arg); free(bufmtp, M_TEMP); } #ifdef DDB +static int64_t +get_malloc_stats(const struct malloc_type_internal *mtip, uint64_t *allocs, + uint64_t *inuse) +{ + const struct malloc_type_stats *mtsp; + uint64_t frees, alloced, freed; + int i; + + *allocs = 0; + frees = 0; + alloced = 0; + freed = 0; + for (i = 0; i <= mp_maxid; i++) { + mtsp = zpcpu_get_cpu(mtip->mti_stats, i); + + *allocs += mtsp->mts_numallocs; + frees += mtsp->mts_numfrees; + alloced += mtsp->mts_memalloced; + freed += mtsp->mts_memfreed; + } + *inuse = *allocs - frees; + return (alloced - freed); +} + DB_SHOW_COMMAND(malloc, db_show_malloc) { - struct malloc_type_internal *mtip; - struct malloc_type_stats *mtsp; + const char *fmt_hdr, *fmt_entry; struct malloc_type *mtp; - uint64_t allocs, frees; - uint64_t alloced, freed; - int i; + uint64_t allocs, inuse; + int64_t size; + /* variables for sorting */ + struct malloc_type *last_mtype, *cur_mtype; + int64_t cur_size, last_size; + int ties; - db_printf("%18s %12s %12s %12s\n", "Type", "InUse", "MemUse", - "Requests"); - for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { - mtip = (struct malloc_type_internal *)mtp->ks_handle; - allocs = 0; - frees = 0; - alloced = 0; - freed = 0; - for (i = 0; i <= mp_maxid; i++) { - mtsp = zpcpu_get_cpu(mtip->mti_stats, i); - allocs += mtsp->mts_numallocs; - frees += mtsp->mts_numfrees; - alloced += mtsp->mts_memalloced; - freed += mtsp->mts_memfreed; + if (modif[0] == 'i') { + fmt_hdr = "%s,%s,%s,%s\n"; + fmt_entry = "\"%s\",%ju,%jdK,%ju\n"; + } else { + fmt_hdr = "%18s %12s %12s %12s\n"; + fmt_entry = "%18s %12ju %12jdK %12ju\n"; + } + + db_printf(fmt_hdr, "Type", "InUse", "MemUse", "Requests"); + + /* Select sort, largest size first. */ + last_mtype = NULL; + last_size = INT64_MAX; + for (;;) { + cur_mtype = NULL; + cur_size = -1; + ties = 0; + + for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { + /* + * In the case of size ties, print out mtypes + * in the order they are encountered. That is, + * when we encounter the most recently output + * mtype, we have already printed all preceding + * ties, and we must print all following ties. + */ + if (mtp == last_mtype) { + ties = 1; + continue; + } + size = get_malloc_stats(mtp->ks_handle, &allocs, + &inuse); + if (size > cur_size && size < last_size + ties) { + cur_size = size; + cur_mtype = mtp; + } } - db_printf("%18s %12ju %12juK %12ju\n", - mtp->ks_shortdesc, allocs - frees, - (alloced - freed + 1023) / 1024, allocs); + if (cur_mtype == NULL) + break; + + size = get_malloc_stats(cur_mtype->ks_handle, &allocs, &inuse); + db_printf(fmt_entry, cur_mtype->ks_shortdesc, inuse, + howmany(size, 1024), allocs); + if (db_pager_quit) break; + + last_mtype = cur_mtype; + last_size = cur_size; } } #if MALLOC_DEBUG_MAXZONES > 1 DB_SHOW_COMMAND(multizone_matches, db_show_multizone_matches) { struct malloc_type_internal *mtip; struct malloc_type *mtp; u_int subzone; if (!have_addr) { db_printf("Usage: show multizone_matches \n"); return; } mtp = (void *)addr; if (mtp->ks_magic != M_MAGIC) { db_printf("Magic %lx does not match expected %x\n", mtp->ks_magic, M_MAGIC); return; } mtip = mtp->ks_handle; subzone = mtip->mti_zone; for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { mtip = mtp->ks_handle; if (mtip->mti_zone != subzone) continue; db_printf("%s\n", mtp->ks_shortdesc); if (db_pager_quit) break; } } #endif /* MALLOC_DEBUG_MAXZONES > 1 */ #endif /* DDB */ #ifdef MALLOC_PROFILE static int sysctl_kern_mprof(SYSCTL_HANDLER_ARGS) { struct sbuf sbuf; uint64_t count; uint64_t waste; uint64_t mem; int error; int rsize; int size; int i; waste = 0; mem = 0; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sbuf_new_for_sysctl(&sbuf, NULL, 128, req); sbuf_printf(&sbuf, "\n Size Requests Real Size\n"); for (i = 0; i < KMEM_ZSIZE; i++) { size = i << KMEM_ZSHIFT; rsize = kmemzones[kmemsize[i]].kz_size; count = (long long unsigned)krequests[i]; sbuf_printf(&sbuf, "%6d%28llu%11d\n", size, (unsigned long long)count, rsize); if ((rsize * count) > (size * count)) waste += (rsize * count) - (size * count); mem += (rsize * count); } sbuf_printf(&sbuf, "\nTotal memory used:\t%30llu\nTotal Memory wasted:\t%30llu\n", (unsigned long long)mem, (unsigned long long)waste); error = sbuf_finish(&sbuf); sbuf_delete(&sbuf); return (error); } SYSCTL_OID(_kern, OID_AUTO, mprof, CTLTYPE_STRING|CTLFLAG_RD, NULL, 0, sysctl_kern_mprof, "A", "Malloc Profiling"); #endif /* MALLOC_PROFILE */ Index: head/sys/vm/uma_core.c =================================================================== --- head/sys/vm/uma_core.c (revision 353428) +++ head/sys/vm/uma_core.c (revision 353429) @@ -1,4401 +1,4462 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2002-2005, 2009, 2013 Jeffrey Roberson * Copyright (c) 2004, 2005 Bosko Milekic * Copyright (c) 2004-2006 Robert N. M. Watson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* * uma_core.c Implementation of the Universal Memory allocator * * This allocator is intended to replace the multitude of similar object caches * in the standard FreeBSD kernel. The intent is to be flexible as well as * efficient. A primary design goal is to return unused memory to the rest of * the system. This will make the system as a whole more flexible due to the * ability to move memory to subsystems which most need it instead of leaving * pools of reserved memory unused. * * The basic ideas stem from similar slab/zone based allocators whose algorithms * are well known. * */ /* * TODO: * - Improve memory usage for large allocations * - Investigate cache size adjustments */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_param.h" #include "opt_vm.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DEBUG_MEMGUARD #include #endif /* * This is the zone and keg from which all zones are spawned. */ static uma_zone_t kegs; static uma_zone_t zones; /* This is the zone from which all offpage uma_slab_ts are allocated. */ static uma_zone_t slabzone; /* * The initial hash tables come out of this zone so they can be allocated * prior to malloc coming up. */ static uma_zone_t hashzone; /* The boot-time adjusted value for cache line alignment. */ int uma_align_cache = 64 - 1; static MALLOC_DEFINE(M_UMAHASH, "UMAHash", "UMA Hash Buckets"); /* * Are we allowed to allocate buckets? */ static int bucketdisable = 1; /* Linked list of all kegs in the system */ static LIST_HEAD(,uma_keg) uma_kegs = LIST_HEAD_INITIALIZER(uma_kegs); /* Linked list of all cache-only zones in the system */ static LIST_HEAD(,uma_zone) uma_cachezones = LIST_HEAD_INITIALIZER(uma_cachezones); /* This RW lock protects the keg list */ static struct rwlock_padalign __exclusive_cache_line uma_rwlock; /* * Pointer and counter to pool of pages, that is preallocated at * startup to bootstrap UMA. */ static char *bootmem; static int boot_pages; static struct sx uma_reclaim_lock; /* * kmem soft limit, initialized by uma_set_limit(). Ensure that early * allocations don't trigger a wakeup of the reclaim thread. */ static unsigned long uma_kmem_limit = LONG_MAX; SYSCTL_ULONG(_vm, OID_AUTO, uma_kmem_limit, CTLFLAG_RD, &uma_kmem_limit, 0, "UMA kernel memory soft limit"); static unsigned long uma_kmem_total; SYSCTL_ULONG(_vm, OID_AUTO, uma_kmem_total, CTLFLAG_RD, &uma_kmem_total, 0, "UMA kernel memory usage"); /* Is the VM done starting up? */ static enum { BOOT_COLD = 0, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS, BOOT_RUNNING } booted = BOOT_COLD; /* * This is the handle used to schedule events that need to happen * outside of the allocation fast path. */ static struct callout uma_callout; #define UMA_TIMEOUT 20 /* Seconds for callout interval. */ /* * This structure is passed as the zone ctor arg so that I don't have to create * a special allocation function just for zones. */ struct uma_zctor_args { const char *name; size_t size; uma_ctor ctor; uma_dtor dtor; uma_init uminit; uma_fini fini; uma_import import; uma_release release; void *arg; uma_keg_t keg; int align; uint32_t flags; }; struct uma_kctor_args { uma_zone_t zone; size_t size; uma_init uminit; uma_fini fini; int align; uint32_t flags; }; struct uma_bucket_zone { uma_zone_t ubz_zone; char *ubz_name; int ubz_entries; /* Number of items it can hold. */ int ubz_maxsize; /* Maximum allocation size per-item. */ }; /* * Compute the actual number of bucket entries to pack them in power * of two sizes for more efficient space utilization. */ #define BUCKET_SIZE(n) \ (((sizeof(void *) * (n)) - sizeof(struct uma_bucket)) / sizeof(void *)) #define BUCKET_MAX BUCKET_SIZE(256) #define BUCKET_MIN BUCKET_SIZE(4) struct uma_bucket_zone bucket_zones[] = { { NULL, "4 Bucket", BUCKET_SIZE(4), 4096 }, { NULL, "6 Bucket", BUCKET_SIZE(6), 3072 }, { NULL, "8 Bucket", BUCKET_SIZE(8), 2048 }, { NULL, "12 Bucket", BUCKET_SIZE(12), 1536 }, { NULL, "16 Bucket", BUCKET_SIZE(16), 1024 }, { NULL, "32 Bucket", BUCKET_SIZE(32), 512 }, { NULL, "64 Bucket", BUCKET_SIZE(64), 256 }, { NULL, "128 Bucket", BUCKET_SIZE(128), 128 }, { NULL, "256 Bucket", BUCKET_SIZE(256), 64 }, { NULL, NULL, 0} }; /* * Flags and enumerations to be passed to internal functions. */ enum zfreeskip { SKIP_NONE = 0, SKIP_CNT = 0x00000001, SKIP_DTOR = 0x00010000, SKIP_FINI = 0x00020000, }; /* Prototypes.. */ int uma_startup_count(int); void uma_startup(void *, int); void uma_startup1(void); void uma_startup2(void); static void *noobj_alloc(uma_zone_t, vm_size_t, int, uint8_t *, int); static void *page_alloc(uma_zone_t, vm_size_t, int, uint8_t *, int); static void *pcpu_page_alloc(uma_zone_t, vm_size_t, int, uint8_t *, int); static void *startup_alloc(uma_zone_t, vm_size_t, int, uint8_t *, int); static void page_free(void *, vm_size_t, uint8_t); static void pcpu_page_free(void *, vm_size_t, uint8_t); static uma_slab_t keg_alloc_slab(uma_keg_t, uma_zone_t, int, int, int); static void cache_drain(uma_zone_t); static void bucket_drain(uma_zone_t, uma_bucket_t); static void bucket_cache_reclaim(uma_zone_t zone, bool); static int keg_ctor(void *, int, void *, int); static void keg_dtor(void *, int, void *); static int zone_ctor(void *, int, void *, int); static void zone_dtor(void *, int, void *); static int zero_init(void *, int, int); static void keg_small_init(uma_keg_t keg); static void keg_large_init(uma_keg_t keg); static void zone_foreach(void (*zfunc)(uma_zone_t)); static void zone_timeout(uma_zone_t zone); static int hash_alloc(struct uma_hash *, u_int); static int hash_expand(struct uma_hash *, struct uma_hash *); static void hash_free(struct uma_hash *hash); static void uma_timeout(void *); static void uma_startup3(void); static void *zone_alloc_item(uma_zone_t, void *, int, int); static void *zone_alloc_item_locked(uma_zone_t, void *, int, int); static void zone_free_item(uma_zone_t, void *, void *, enum zfreeskip); static void bucket_enable(void); static void bucket_init(void); static uma_bucket_t bucket_alloc(uma_zone_t zone, void *, int); static void bucket_free(uma_zone_t zone, uma_bucket_t, void *); static void bucket_zone_drain(void); static uma_bucket_t zone_alloc_bucket(uma_zone_t, void *, int, int, int); static uma_slab_t zone_fetch_slab(uma_zone_t, uma_keg_t, int, int); static void *slab_alloc_item(uma_keg_t keg, uma_slab_t slab); static void slab_free_item(uma_zone_t zone, uma_slab_t slab, void *item); static uma_keg_t uma_kcreate(uma_zone_t zone, size_t size, uma_init uminit, uma_fini fini, int align, uint32_t flags); static int zone_import(uma_zone_t, void **, int, int, int); static void zone_release(uma_zone_t, void **, int); static void uma_zero_item(void *, uma_zone_t); void uma_print_zone(uma_zone_t); void uma_print_stats(void); static int sysctl_vm_zone_count(SYSCTL_HANDLER_ARGS); static int sysctl_vm_zone_stats(SYSCTL_HANDLER_ARGS); #ifdef INVARIANTS static bool uma_dbg_kskip(uma_keg_t keg, void *mem); static bool uma_dbg_zskip(uma_zone_t zone, void *mem); static void uma_dbg_free(uma_zone_t zone, uma_slab_t slab, void *item); static void uma_dbg_alloc(uma_zone_t zone, uma_slab_t slab, void *item); static SYSCTL_NODE(_vm, OID_AUTO, debug, CTLFLAG_RD, 0, "Memory allocation debugging"); static u_int dbg_divisor = 1; SYSCTL_UINT(_vm_debug, OID_AUTO, divisor, CTLFLAG_RDTUN | CTLFLAG_NOFETCH, &dbg_divisor, 0, "Debug & thrash every this item in memory allocator"); static counter_u64_t uma_dbg_cnt = EARLY_COUNTER; static counter_u64_t uma_skip_cnt = EARLY_COUNTER; SYSCTL_COUNTER_U64(_vm_debug, OID_AUTO, trashed, CTLFLAG_RD, &uma_dbg_cnt, "memory items debugged"); SYSCTL_COUNTER_U64(_vm_debug, OID_AUTO, skipped, CTLFLAG_RD, &uma_skip_cnt, "memory items skipped, not debugged"); #endif SYSINIT(uma_startup3, SI_SUB_VM_CONF, SI_ORDER_SECOND, uma_startup3, NULL); SYSCTL_PROC(_vm, OID_AUTO, zone_count, CTLFLAG_RD|CTLTYPE_INT, 0, 0, sysctl_vm_zone_count, "I", "Number of UMA zones"); SYSCTL_PROC(_vm, OID_AUTO, zone_stats, CTLFLAG_RD|CTLTYPE_STRUCT, 0, 0, sysctl_vm_zone_stats, "s,struct uma_type_header", "Zone Stats"); static int zone_warnings = 1; SYSCTL_INT(_vm, OID_AUTO, zone_warnings, CTLFLAG_RWTUN, &zone_warnings, 0, "Warn when UMA zones becomes full"); /* Adjust bytes under management by UMA. */ static inline void uma_total_dec(unsigned long size) { atomic_subtract_long(&uma_kmem_total, size); } static inline void uma_total_inc(unsigned long size) { if (atomic_fetchadd_long(&uma_kmem_total, size) > uma_kmem_limit) uma_reclaim_wakeup(); } /* * This routine checks to see whether or not it's safe to enable buckets. */ static void bucket_enable(void) { bucketdisable = vm_page_count_min(); } /* * Initialize bucket_zones, the array of zones of buckets of various sizes. * * For each zone, calculate the memory required for each bucket, consisting * of the header and an array of pointers. */ static void bucket_init(void) { struct uma_bucket_zone *ubz; int size; for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) { size = roundup(sizeof(struct uma_bucket), sizeof(void *)); size += sizeof(void *) * ubz->ubz_entries; ubz->ubz_zone = uma_zcreate(ubz->ubz_name, size, NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_MTXCLASS | UMA_ZFLAG_BUCKET | UMA_ZONE_NUMA); } } /* * Given a desired number of entries for a bucket, return the zone from which * to allocate the bucket. */ static struct uma_bucket_zone * bucket_zone_lookup(int entries) { struct uma_bucket_zone *ubz; for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) if (ubz->ubz_entries >= entries) return (ubz); ubz--; return (ubz); } static int bucket_select(int size) { struct uma_bucket_zone *ubz; ubz = &bucket_zones[0]; if (size > ubz->ubz_maxsize) return MAX((ubz->ubz_maxsize * ubz->ubz_entries) / size, 1); for (; ubz->ubz_entries != 0; ubz++) if (ubz->ubz_maxsize < size) break; ubz--; return (ubz->ubz_entries); } static uma_bucket_t bucket_alloc(uma_zone_t zone, void *udata, int flags) { struct uma_bucket_zone *ubz; uma_bucket_t bucket; /* * This is to stop us from allocating per cpu buckets while we're * running out of vm.boot_pages. Otherwise, we would exhaust the * boot pages. This also prevents us from allocating buckets in * low memory situations. */ if (bucketdisable) return (NULL); /* * To limit bucket recursion we store the original zone flags * in a cookie passed via zalloc_arg/zfree_arg. This allows the * NOVM flag to persist even through deep recursions. We also * store ZFLAG_BUCKET once we have recursed attempting to allocate * a bucket for a bucket zone so we do not allow infinite bucket * recursion. This cookie will even persist to frees of unused * buckets via the allocation path or bucket allocations in the * free path. */ if ((zone->uz_flags & UMA_ZFLAG_BUCKET) == 0) udata = (void *)(uintptr_t)zone->uz_flags; else { if ((uintptr_t)udata & UMA_ZFLAG_BUCKET) return (NULL); udata = (void *)((uintptr_t)udata | UMA_ZFLAG_BUCKET); } if ((uintptr_t)udata & UMA_ZFLAG_CACHEONLY) flags |= M_NOVM; ubz = bucket_zone_lookup(zone->uz_count); if (ubz->ubz_zone == zone && (ubz + 1)->ubz_entries != 0) ubz++; bucket = uma_zalloc_arg(ubz->ubz_zone, udata, flags); if (bucket) { #ifdef INVARIANTS bzero(bucket->ub_bucket, sizeof(void *) * ubz->ubz_entries); #endif bucket->ub_cnt = 0; bucket->ub_entries = ubz->ubz_entries; } return (bucket); } static void bucket_free(uma_zone_t zone, uma_bucket_t bucket, void *udata) { struct uma_bucket_zone *ubz; KASSERT(bucket->ub_cnt == 0, ("bucket_free: Freeing a non free bucket.")); if ((zone->uz_flags & UMA_ZFLAG_BUCKET) == 0) udata = (void *)(uintptr_t)zone->uz_flags; ubz = bucket_zone_lookup(bucket->ub_entries); uma_zfree_arg(ubz->ubz_zone, bucket, udata); } static void bucket_zone_drain(void) { struct uma_bucket_zone *ubz; for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) uma_zone_reclaim(ubz->ubz_zone, UMA_RECLAIM_DRAIN); } /* * Attempt to satisfy an allocation by retrieving a full bucket from one of the * zone's caches. */ static uma_bucket_t zone_fetch_bucket(uma_zone_t zone, uma_zone_domain_t zdom) { uma_bucket_t bucket; ZONE_LOCK_ASSERT(zone); if ((bucket = TAILQ_FIRST(&zdom->uzd_buckets)) != NULL) { MPASS(zdom->uzd_nitems >= bucket->ub_cnt); TAILQ_REMOVE(&zdom->uzd_buckets, bucket, ub_link); zdom->uzd_nitems -= bucket->ub_cnt; if (zdom->uzd_imin > zdom->uzd_nitems) zdom->uzd_imin = zdom->uzd_nitems; zone->uz_bkt_count -= bucket->ub_cnt; } return (bucket); } /* * Insert a full bucket into the specified cache. The "ws" parameter indicates * whether the bucket's contents should be counted as part of the zone's working * set. */ static void zone_put_bucket(uma_zone_t zone, uma_zone_domain_t zdom, uma_bucket_t bucket, const bool ws) { ZONE_LOCK_ASSERT(zone); KASSERT(zone->uz_bkt_count < zone->uz_bkt_max, ("%s: zone %p overflow", __func__, zone)); if (ws) TAILQ_INSERT_HEAD(&zdom->uzd_buckets, bucket, ub_link); else TAILQ_INSERT_TAIL(&zdom->uzd_buckets, bucket, ub_link); zdom->uzd_nitems += bucket->ub_cnt; if (ws && zdom->uzd_imax < zdom->uzd_nitems) zdom->uzd_imax = zdom->uzd_nitems; zone->uz_bkt_count += bucket->ub_cnt; } static void zone_log_warning(uma_zone_t zone) { static const struct timeval warninterval = { 300, 0 }; if (!zone_warnings || zone->uz_warning == NULL) return; if (ratecheck(&zone->uz_ratecheck, &warninterval)) printf("[zone: %s] %s\n", zone->uz_name, zone->uz_warning); } static inline void zone_maxaction(uma_zone_t zone) { if (zone->uz_maxaction.ta_func != NULL) taskqueue_enqueue(taskqueue_thread, &zone->uz_maxaction); } /* * Routine called by timeout which is used to fire off some time interval * based calculations. (stats, hash size, etc.) * * Arguments: * arg Unused * * Returns: * Nothing */ static void uma_timeout(void *unused) { bucket_enable(); zone_foreach(zone_timeout); /* Reschedule this event */ callout_reset(&uma_callout, UMA_TIMEOUT * hz, uma_timeout, NULL); } /* * Update the working set size estimate for the zone's bucket cache. * The constants chosen here are somewhat arbitrary. With an update period of * 20s (UMA_TIMEOUT), this estimate is dominated by zone activity over the * last 100s. */ static void zone_domain_update_wss(uma_zone_domain_t zdom) { long wss; MPASS(zdom->uzd_imax >= zdom->uzd_imin); wss = zdom->uzd_imax - zdom->uzd_imin; zdom->uzd_imax = zdom->uzd_imin = zdom->uzd_nitems; zdom->uzd_wss = (4 * wss + zdom->uzd_wss) / 5; } /* * Routine to perform timeout driven calculations. This expands the * hashes and does per cpu statistics aggregation. * * Returns nothing. */ static void zone_timeout(uma_zone_t zone) { uma_keg_t keg = zone->uz_keg; u_int slabs; KEG_LOCK(keg); /* * Expand the keg hash table. * * This is done if the number of slabs is larger than the hash size. * What I'm trying to do here is completely reduce collisions. This * may be a little aggressive. Should I allow for two collisions max? */ if (keg->uk_flags & UMA_ZONE_HASH && (slabs = keg->uk_pages / keg->uk_ppera) > keg->uk_hash.uh_hashsize) { struct uma_hash newhash; struct uma_hash oldhash; int ret; /* * This is so involved because allocating and freeing * while the keg lock is held will lead to deadlock. * I have to do everything in stages and check for * races. */ KEG_UNLOCK(keg); ret = hash_alloc(&newhash, 1 << fls(slabs)); KEG_LOCK(keg); if (ret) { if (hash_expand(&keg->uk_hash, &newhash)) { oldhash = keg->uk_hash; keg->uk_hash = newhash; } else oldhash = newhash; KEG_UNLOCK(keg); hash_free(&oldhash); return; } } KEG_UNLOCK(keg); ZONE_LOCK(zone); for (int i = 0; i < vm_ndomains; i++) zone_domain_update_wss(&zone->uz_domain[i]); ZONE_UNLOCK(zone); } /* * Allocate and zero fill the next sized hash table from the appropriate * backing store. * * Arguments: * hash A new hash structure with the old hash size in uh_hashsize * * Returns: * 1 on success and 0 on failure. */ static int hash_alloc(struct uma_hash *hash, u_int size) { size_t alloc; KASSERT(powerof2(size), ("hash size must be power of 2")); if (size > UMA_HASH_SIZE_INIT) { hash->uh_hashsize = size; alloc = sizeof(hash->uh_slab_hash[0]) * hash->uh_hashsize; hash->uh_slab_hash = (struct slabhead *)malloc(alloc, M_UMAHASH, M_NOWAIT); } else { alloc = sizeof(hash->uh_slab_hash[0]) * UMA_HASH_SIZE_INIT; hash->uh_slab_hash = zone_alloc_item(hashzone, NULL, UMA_ANYDOMAIN, M_WAITOK); hash->uh_hashsize = UMA_HASH_SIZE_INIT; } if (hash->uh_slab_hash) { bzero(hash->uh_slab_hash, alloc); hash->uh_hashmask = hash->uh_hashsize - 1; return (1); } return (0); } /* * Expands the hash table for HASH zones. This is done from zone_timeout * to reduce collisions. This must not be done in the regular allocation * path, otherwise, we can recurse on the vm while allocating pages. * * Arguments: * oldhash The hash you want to expand * newhash The hash structure for the new table * * Returns: * Nothing * * Discussion: */ static int hash_expand(struct uma_hash *oldhash, struct uma_hash *newhash) { uma_slab_t slab; u_int hval; u_int idx; if (!newhash->uh_slab_hash) return (0); if (oldhash->uh_hashsize >= newhash->uh_hashsize) return (0); /* * I need to investigate hash algorithms for resizing without a * full rehash. */ for (idx = 0; idx < oldhash->uh_hashsize; idx++) while (!SLIST_EMPTY(&oldhash->uh_slab_hash[idx])) { slab = SLIST_FIRST(&oldhash->uh_slab_hash[idx]); SLIST_REMOVE_HEAD(&oldhash->uh_slab_hash[idx], us_hlink); hval = UMA_HASH(newhash, slab->us_data); SLIST_INSERT_HEAD(&newhash->uh_slab_hash[hval], slab, us_hlink); } return (1); } /* * Free the hash bucket to the appropriate backing store. * * Arguments: * slab_hash The hash bucket we're freeing * hashsize The number of entries in that hash bucket * * Returns: * Nothing */ static void hash_free(struct uma_hash *hash) { if (hash->uh_slab_hash == NULL) return; if (hash->uh_hashsize == UMA_HASH_SIZE_INIT) zone_free_item(hashzone, hash->uh_slab_hash, NULL, SKIP_NONE); else free(hash->uh_slab_hash, M_UMAHASH); } /* * Frees all outstanding items in a bucket * * Arguments: * zone The zone to free to, must be unlocked. * bucket The free/alloc bucket with items, cpu queue must be locked. * * Returns: * Nothing */ static void bucket_drain(uma_zone_t zone, uma_bucket_t bucket) { int i; if (bucket == NULL) return; if (zone->uz_fini) for (i = 0; i < bucket->ub_cnt; i++) zone->uz_fini(bucket->ub_bucket[i], zone->uz_size); zone->uz_release(zone->uz_arg, bucket->ub_bucket, bucket->ub_cnt); if (zone->uz_max_items > 0) { ZONE_LOCK(zone); zone->uz_items -= bucket->ub_cnt; if (zone->uz_sleepers && zone->uz_items < zone->uz_max_items) wakeup_one(zone); ZONE_UNLOCK(zone); } bucket->ub_cnt = 0; } /* * Drains the per cpu caches for a zone. * * NOTE: This may only be called while the zone is being turn down, and not * during normal operation. This is necessary in order that we do not have * to migrate CPUs to drain the per-CPU caches. * * Arguments: * zone The zone to drain, must be unlocked. * * Returns: * Nothing */ static void cache_drain(uma_zone_t zone) { uma_cache_t cache; int cpu; /* * XXX: It is safe to not lock the per-CPU caches, because we're * tearing down the zone anyway. I.e., there will be no further use * of the caches at this point. * * XXX: It would good to be able to assert that the zone is being * torn down to prevent improper use of cache_drain(). * * XXX: We lock the zone before passing into bucket_cache_reclaim() as * it is used elsewhere. Should the tear-down path be made special * there in some form? */ CPU_FOREACH(cpu) { cache = &zone->uz_cpu[cpu]; bucket_drain(zone, cache->uc_allocbucket); if (cache->uc_allocbucket != NULL) bucket_free(zone, cache->uc_allocbucket, NULL); cache->uc_allocbucket = NULL; bucket_drain(zone, cache->uc_freebucket); if (cache->uc_freebucket != NULL) bucket_free(zone, cache->uc_freebucket, NULL); cache->uc_freebucket = NULL; bucket_drain(zone, cache->uc_crossbucket); if (cache->uc_crossbucket != NULL) bucket_free(zone, cache->uc_crossbucket, NULL); cache->uc_crossbucket = NULL; } ZONE_LOCK(zone); bucket_cache_reclaim(zone, true); ZONE_UNLOCK(zone); } static void cache_shrink(uma_zone_t zone) { if (zone->uz_flags & UMA_ZFLAG_INTERNAL) return; ZONE_LOCK(zone); zone->uz_count = (zone->uz_count_min + zone->uz_count) / 2; ZONE_UNLOCK(zone); } static void cache_drain_safe_cpu(uma_zone_t zone) { uma_cache_t cache; uma_bucket_t b1, b2, b3; int domain; if (zone->uz_flags & UMA_ZFLAG_INTERNAL) return; b1 = b2 = b3 = NULL; ZONE_LOCK(zone); critical_enter(); if (zone->uz_flags & UMA_ZONE_NUMA) domain = PCPU_GET(domain); else domain = 0; cache = &zone->uz_cpu[curcpu]; if (cache->uc_allocbucket) { if (cache->uc_allocbucket->ub_cnt != 0) zone_put_bucket(zone, &zone->uz_domain[domain], cache->uc_allocbucket, false); else b1 = cache->uc_allocbucket; cache->uc_allocbucket = NULL; } if (cache->uc_freebucket) { if (cache->uc_freebucket->ub_cnt != 0) zone_put_bucket(zone, &zone->uz_domain[domain], cache->uc_freebucket, false); else b2 = cache->uc_freebucket; cache->uc_freebucket = NULL; } b3 = cache->uc_crossbucket; cache->uc_crossbucket = NULL; critical_exit(); ZONE_UNLOCK(zone); if (b1) bucket_free(zone, b1, NULL); if (b2) bucket_free(zone, b2, NULL); if (b3) { bucket_drain(zone, b3); bucket_free(zone, b3, NULL); } } /* * Safely drain per-CPU caches of a zone(s) to alloc bucket. * This is an expensive call because it needs to bind to all CPUs * one by one and enter a critical section on each of them in order * to safely access their cache buckets. * Zone lock must not be held on call this function. */ static void pcpu_cache_drain_safe(uma_zone_t zone) { int cpu; /* * Polite bucket sizes shrinking was not enouth, shrink aggressively. */ if (zone) cache_shrink(zone); else zone_foreach(cache_shrink); CPU_FOREACH(cpu) { thread_lock(curthread); sched_bind(curthread, cpu); thread_unlock(curthread); if (zone) cache_drain_safe_cpu(zone); else zone_foreach(cache_drain_safe_cpu); } thread_lock(curthread); sched_unbind(curthread); thread_unlock(curthread); } /* * Reclaim cached buckets from a zone. All buckets are reclaimed if the caller * requested a drain, otherwise the per-domain caches are trimmed to either * estimated working set size. */ static void bucket_cache_reclaim(uma_zone_t zone, bool drain) { uma_zone_domain_t zdom; uma_bucket_t bucket; long target, tofree; int i; for (i = 0; i < vm_ndomains; i++) { zdom = &zone->uz_domain[i]; /* * If we were asked to drain the zone, we are done only once * this bucket cache is empty. Otherwise, we reclaim items in * excess of the zone's estimated working set size. If the * difference nitems - imin is larger than the WSS estimate, * then the estimate will grow at the end of this interval and * we ignore the historical average. */ target = drain ? 0 : lmax(zdom->uzd_wss, zdom->uzd_nitems - zdom->uzd_imin); while (zdom->uzd_nitems > target) { bucket = TAILQ_LAST(&zdom->uzd_buckets, uma_bucketlist); if (bucket == NULL) break; tofree = bucket->ub_cnt; TAILQ_REMOVE(&zdom->uzd_buckets, bucket, ub_link); zdom->uzd_nitems -= tofree; /* * Shift the bounds of the current WSS interval to avoid * perturbing the estimate. */ zdom->uzd_imax -= lmin(zdom->uzd_imax, tofree); zdom->uzd_imin -= lmin(zdom->uzd_imin, tofree); ZONE_UNLOCK(zone); bucket_drain(zone, bucket); bucket_free(zone, bucket, NULL); ZONE_LOCK(zone); } } /* * Shrink the zone bucket size to ensure that the per-CPU caches * don't grow too large. */ if (zone->uz_count > zone->uz_count_min) zone->uz_count--; } static void keg_free_slab(uma_keg_t keg, uma_slab_t slab, int start) { uint8_t *mem; int i; uint8_t flags; CTR4(KTR_UMA, "keg_free_slab keg %s(%p) slab %p, returning %d bytes", keg->uk_name, keg, slab, PAGE_SIZE * keg->uk_ppera); mem = slab->us_data; flags = slab->us_flags; i = start; if (keg->uk_fini != NULL) { for (i--; i > -1; i--) #ifdef INVARIANTS /* * trash_fini implies that dtor was trash_dtor. trash_fini * would check that memory hasn't been modified since free, * which executed trash_dtor. * That's why we need to run uma_dbg_kskip() check here, * albeit we don't make skip check for other init/fini * invocations. */ if (!uma_dbg_kskip(keg, slab->us_data + (keg->uk_rsize * i)) || keg->uk_fini != trash_fini) #endif keg->uk_fini(slab->us_data + (keg->uk_rsize * i), keg->uk_size); } if (keg->uk_flags & UMA_ZONE_OFFPAGE) zone_free_item(keg->uk_slabzone, slab, NULL, SKIP_NONE); keg->uk_freef(mem, PAGE_SIZE * keg->uk_ppera, flags); uma_total_dec(PAGE_SIZE * keg->uk_ppera); } /* * Frees pages from a keg back to the system. This is done on demand from * the pageout daemon. * * Returns nothing. */ static void keg_drain(uma_keg_t keg) { struct slabhead freeslabs = { 0 }; uma_domain_t dom; uma_slab_t slab, tmp; int i; /* * We don't want to take pages from statically allocated kegs at this * time */ if (keg->uk_flags & UMA_ZONE_NOFREE || keg->uk_freef == NULL) return; CTR3(KTR_UMA, "keg_drain %s(%p) free items: %u", keg->uk_name, keg, keg->uk_free); KEG_LOCK(keg); if (keg->uk_free == 0) goto finished; for (i = 0; i < vm_ndomains; i++) { dom = &keg->uk_domain[i]; LIST_FOREACH_SAFE(slab, &dom->ud_free_slab, us_link, tmp) { /* We have nowhere to free these to. */ if (slab->us_flags & UMA_SLAB_BOOT) continue; LIST_REMOVE(slab, us_link); keg->uk_pages -= keg->uk_ppera; keg->uk_free -= keg->uk_ipers; if (keg->uk_flags & UMA_ZONE_HASH) UMA_HASH_REMOVE(&keg->uk_hash, slab, slab->us_data); SLIST_INSERT_HEAD(&freeslabs, slab, us_hlink); } } finished: KEG_UNLOCK(keg); while ((slab = SLIST_FIRST(&freeslabs)) != NULL) { SLIST_REMOVE(&freeslabs, slab, uma_slab, us_hlink); keg_free_slab(keg, slab, keg->uk_ipers); } } static void zone_reclaim(uma_zone_t zone, int waitok, bool drain) { /* * Set draining to interlock with zone_dtor() so we can release our * locks as we go. Only dtor() should do a WAITOK call since it * is the only call that knows the structure will still be available * when it wakes up. */ ZONE_LOCK(zone); while (zone->uz_flags & UMA_ZFLAG_RECLAIMING) { if (waitok == M_NOWAIT) goto out; msleep(zone, zone->uz_lockptr, PVM, "zonedrain", 1); } zone->uz_flags |= UMA_ZFLAG_RECLAIMING; bucket_cache_reclaim(zone, drain); ZONE_UNLOCK(zone); /* * The DRAINING flag protects us from being freed while * we're running. Normally the uma_rwlock would protect us but we * must be able to release and acquire the right lock for each keg. */ keg_drain(zone->uz_keg); ZONE_LOCK(zone); zone->uz_flags &= ~UMA_ZFLAG_RECLAIMING; wakeup(zone); out: ZONE_UNLOCK(zone); } static void zone_drain(uma_zone_t zone) { zone_reclaim(zone, M_NOWAIT, true); } static void zone_trim(uma_zone_t zone) { zone_reclaim(zone, M_NOWAIT, false); } /* * Allocate a new slab for a keg. This does not insert the slab onto a list. * If the allocation was successful, the keg lock will be held upon return, * otherwise the keg will be left unlocked. * * Arguments: * flags Wait flags for the item initialization routine * aflags Wait flags for the slab allocation * * Returns: * The slab that was allocated or NULL if there is no memory and the * caller specified M_NOWAIT. */ static uma_slab_t keg_alloc_slab(uma_keg_t keg, uma_zone_t zone, int domain, int flags, int aflags) { uma_alloc allocf; uma_slab_t slab; unsigned long size; uint8_t *mem; uint8_t sflags; int i; KASSERT(domain >= 0 && domain < vm_ndomains, ("keg_alloc_slab: domain %d out of range", domain)); KEG_LOCK_ASSERT(keg); MPASS(zone->uz_lockptr == &keg->uk_lock); allocf = keg->uk_allocf; KEG_UNLOCK(keg); slab = NULL; mem = NULL; if (keg->uk_flags & UMA_ZONE_OFFPAGE) { slab = zone_alloc_item(keg->uk_slabzone, NULL, domain, aflags); if (slab == NULL) goto out; } /* * This reproduces the old vm_zone behavior of zero filling pages the * first time they are added to a zone. * * Malloced items are zeroed in uma_zalloc. */ if ((keg->uk_flags & UMA_ZONE_MALLOC) == 0) aflags |= M_ZERO; else aflags &= ~M_ZERO; if (keg->uk_flags & UMA_ZONE_NODUMP) aflags |= M_NODUMP; /* zone is passed for legacy reasons. */ size = keg->uk_ppera * PAGE_SIZE; mem = allocf(zone, size, domain, &sflags, aflags); if (mem == NULL) { if (keg->uk_flags & UMA_ZONE_OFFPAGE) zone_free_item(keg->uk_slabzone, slab, NULL, SKIP_NONE); slab = NULL; goto out; } uma_total_inc(size); /* Point the slab into the allocated memory */ if (!(keg->uk_flags & UMA_ZONE_OFFPAGE)) slab = (uma_slab_t )(mem + keg->uk_pgoff); if (keg->uk_flags & UMA_ZONE_VTOSLAB) for (i = 0; i < keg->uk_ppera; i++) vsetslab((vm_offset_t)mem + (i * PAGE_SIZE), slab); slab->us_keg = keg; slab->us_data = mem; slab->us_freecount = keg->uk_ipers; slab->us_flags = sflags; slab->us_domain = domain; BIT_FILL(SLAB_SETSIZE, &slab->us_free); #ifdef INVARIANTS BIT_ZERO(SLAB_SETSIZE, &slab->us_debugfree); #endif if (keg->uk_init != NULL) { for (i = 0; i < keg->uk_ipers; i++) if (keg->uk_init(slab->us_data + (keg->uk_rsize * i), keg->uk_size, flags) != 0) break; if (i != keg->uk_ipers) { keg_free_slab(keg, slab, i); slab = NULL; goto out; } } KEG_LOCK(keg); CTR3(KTR_UMA, "keg_alloc_slab: allocated slab %p for %s(%p)", slab, keg->uk_name, keg); if (keg->uk_flags & UMA_ZONE_HASH) UMA_HASH_INSERT(&keg->uk_hash, slab, mem); keg->uk_pages += keg->uk_ppera; keg->uk_free += keg->uk_ipers; out: return (slab); } /* * This function is intended to be used early on in place of page_alloc() so * that we may use the boot time page cache to satisfy allocations before * the VM is ready. */ static void * startup_alloc(uma_zone_t zone, vm_size_t bytes, int domain, uint8_t *pflag, int wait) { uma_keg_t keg; void *mem; int pages; keg = zone->uz_keg; /* * If we are in BOOT_BUCKETS or higher, than switch to real * allocator. Zones with page sized slabs switch at BOOT_PAGEALLOC. */ switch (booted) { case BOOT_COLD: case BOOT_STRAPPED: break; case BOOT_PAGEALLOC: if (keg->uk_ppera > 1) break; case BOOT_BUCKETS: case BOOT_RUNNING: #ifdef UMA_MD_SMALL_ALLOC keg->uk_allocf = (keg->uk_ppera > 1) ? page_alloc : uma_small_alloc; #else keg->uk_allocf = page_alloc; #endif return keg->uk_allocf(zone, bytes, domain, pflag, wait); } /* * Check our small startup cache to see if it has pages remaining. */ pages = howmany(bytes, PAGE_SIZE); KASSERT(pages > 0, ("%s can't reserve 0 pages", __func__)); if (pages > boot_pages) panic("UMA zone \"%s\": Increase vm.boot_pages", zone->uz_name); #ifdef DIAGNOSTIC printf("%s from \"%s\", %d boot pages left\n", __func__, zone->uz_name, boot_pages); #endif mem = bootmem; boot_pages -= pages; bootmem += pages * PAGE_SIZE; *pflag = UMA_SLAB_BOOT; return (mem); } /* * Allocates a number of pages from the system * * Arguments: * bytes The number of bytes requested * wait Shall we wait? * * Returns: * A pointer to the alloced memory or possibly * NULL if M_NOWAIT is set. */ static void * page_alloc(uma_zone_t zone, vm_size_t bytes, int domain, uint8_t *pflag, int wait) { void *p; /* Returned page */ *pflag = UMA_SLAB_KERNEL; p = (void *)kmem_malloc_domainset(DOMAINSET_FIXED(domain), bytes, wait); return (p); } static void * pcpu_page_alloc(uma_zone_t zone, vm_size_t bytes, int domain, uint8_t *pflag, int wait) { struct pglist alloctail; vm_offset_t addr, zkva; int cpu, flags; vm_page_t p, p_next; #ifdef NUMA struct pcpu *pc; #endif MPASS(bytes == (mp_maxid + 1) * PAGE_SIZE); TAILQ_INIT(&alloctail); flags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED | VM_ALLOC_NOOBJ | malloc2vm_flags(wait); *pflag = UMA_SLAB_KERNEL; for (cpu = 0; cpu <= mp_maxid; cpu++) { if (CPU_ABSENT(cpu)) { p = vm_page_alloc(NULL, 0, flags); } else { #ifndef NUMA p = vm_page_alloc(NULL, 0, flags); #else pc = pcpu_find(cpu); p = vm_page_alloc_domain(NULL, 0, pc->pc_domain, flags); if (__predict_false(p == NULL)) p = vm_page_alloc(NULL, 0, flags); #endif } if (__predict_false(p == NULL)) goto fail; TAILQ_INSERT_TAIL(&alloctail, p, listq); } if ((addr = kva_alloc(bytes)) == 0) goto fail; zkva = addr; TAILQ_FOREACH(p, &alloctail, listq) { pmap_qenter(zkva, &p, 1); zkva += PAGE_SIZE; } return ((void*)addr); fail: TAILQ_FOREACH_SAFE(p, &alloctail, listq, p_next) { vm_page_unwire_noq(p); vm_page_free(p); } return (NULL); } /* * Allocates a number of pages from within an object * * Arguments: * bytes The number of bytes requested * wait Shall we wait? * * Returns: * A pointer to the alloced memory or possibly * NULL if M_NOWAIT is set. */ static void * noobj_alloc(uma_zone_t zone, vm_size_t bytes, int domain, uint8_t *flags, int wait) { TAILQ_HEAD(, vm_page) alloctail; u_long npages; vm_offset_t retkva, zkva; vm_page_t p, p_next; uma_keg_t keg; TAILQ_INIT(&alloctail); keg = zone->uz_keg; npages = howmany(bytes, PAGE_SIZE); while (npages > 0) { p = vm_page_alloc_domain(NULL, 0, domain, VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED | VM_ALLOC_NOOBJ | ((wait & M_WAITOK) != 0 ? VM_ALLOC_WAITOK : VM_ALLOC_NOWAIT)); if (p != NULL) { /* * Since the page does not belong to an object, its * listq is unused. */ TAILQ_INSERT_TAIL(&alloctail, p, listq); npages--; continue; } /* * Page allocation failed, free intermediate pages and * exit. */ TAILQ_FOREACH_SAFE(p, &alloctail, listq, p_next) { vm_page_unwire_noq(p); vm_page_free(p); } return (NULL); } *flags = UMA_SLAB_PRIV; zkva = keg->uk_kva + atomic_fetchadd_long(&keg->uk_offset, round_page(bytes)); retkva = zkva; TAILQ_FOREACH(p, &alloctail, listq) { pmap_qenter(zkva, &p, 1); zkva += PAGE_SIZE; } return ((void *)retkva); } /* * Frees a number of pages to the system * * Arguments: * mem A pointer to the memory to be freed * size The size of the memory being freed * flags The original p->us_flags field * * Returns: * Nothing */ static void page_free(void *mem, vm_size_t size, uint8_t flags) { if ((flags & UMA_SLAB_KERNEL) == 0) panic("UMA: page_free used with invalid flags %x", flags); kmem_free((vm_offset_t)mem, size); } /* * Frees pcpu zone allocations * * Arguments: * mem A pointer to the memory to be freed * size The size of the memory being freed * flags The original p->us_flags field * * Returns: * Nothing */ static void pcpu_page_free(void *mem, vm_size_t size, uint8_t flags) { vm_offset_t sva, curva; vm_paddr_t paddr; vm_page_t m; MPASS(size == (mp_maxid+1)*PAGE_SIZE); sva = (vm_offset_t)mem; for (curva = sva; curva < sva + size; curva += PAGE_SIZE) { paddr = pmap_kextract(curva); m = PHYS_TO_VM_PAGE(paddr); vm_page_unwire_noq(m); vm_page_free(m); } pmap_qremove(sva, size >> PAGE_SHIFT); kva_free(sva, size); } /* * Zero fill initializer * * Arguments/Returns follow uma_init specifications */ static int zero_init(void *mem, int size, int flags) { bzero(mem, size); return (0); } /* * Finish creating a small uma keg. This calculates ipers, and the keg size. * * Arguments * keg The zone we should initialize * * Returns * Nothing */ static void keg_small_init(uma_keg_t keg) { u_int rsize; u_int memused; u_int wastedspace; u_int shsize; u_int slabsize; if (keg->uk_flags & UMA_ZONE_PCPU) { u_int ncpus = (mp_maxid + 1) ? (mp_maxid + 1) : MAXCPU; slabsize = UMA_PCPU_ALLOC_SIZE; keg->uk_ppera = ncpus; } else { slabsize = UMA_SLAB_SIZE; keg->uk_ppera = 1; } /* * Calculate the size of each allocation (rsize) according to * alignment. If the requested size is smaller than we have * allocation bits for we round it up. */ rsize = keg->uk_size; if (rsize < slabsize / SLAB_SETSIZE) rsize = slabsize / SLAB_SETSIZE; if (rsize & keg->uk_align) rsize = (rsize & ~keg->uk_align) + (keg->uk_align + 1); keg->uk_rsize = rsize; KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0 || keg->uk_rsize < UMA_PCPU_ALLOC_SIZE, ("%s: size %u too large", __func__, keg->uk_rsize)); if (keg->uk_flags & UMA_ZONE_OFFPAGE) shsize = 0; else shsize = SIZEOF_UMA_SLAB; if (rsize <= slabsize - shsize) keg->uk_ipers = (slabsize - shsize) / rsize; else { /* Handle special case when we have 1 item per slab, so * alignment requirement can be relaxed. */ KASSERT(keg->uk_size <= slabsize - shsize, ("%s: size %u greater than slab", __func__, keg->uk_size)); keg->uk_ipers = 1; } KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE, ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers)); memused = keg->uk_ipers * rsize + shsize; wastedspace = slabsize - memused; /* * We can't do OFFPAGE if we're internal or if we've been * asked to not go to the VM for buckets. If we do this we * may end up going to the VM for slabs which we do not * want to do if we're UMA_ZFLAG_CACHEONLY as a result * of UMA_ZONE_VM, which clearly forbids it. */ if ((keg->uk_flags & UMA_ZFLAG_INTERNAL) || (keg->uk_flags & UMA_ZFLAG_CACHEONLY)) return; /* * See if using an OFFPAGE slab will limit our waste. Only do * this if it permits more items per-slab. * * XXX We could try growing slabsize to limit max waste as well. * Historically this was not done because the VM could not * efficiently handle contiguous allocations. */ if ((wastedspace >= slabsize / UMA_MAX_WASTE) && (keg->uk_ipers < (slabsize / keg->uk_rsize))) { keg->uk_ipers = slabsize / keg->uk_rsize; KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE, ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers)); CTR6(KTR_UMA, "UMA decided we need offpage slab headers for " "keg: %s(%p), calculated wastedspace = %d, " "maximum wasted space allowed = %d, " "calculated ipers = %d, " "new wasted space = %d\n", keg->uk_name, keg, wastedspace, slabsize / UMA_MAX_WASTE, keg->uk_ipers, slabsize - keg->uk_ipers * keg->uk_rsize); keg->uk_flags |= UMA_ZONE_OFFPAGE; } if ((keg->uk_flags & UMA_ZONE_OFFPAGE) && (keg->uk_flags & UMA_ZONE_VTOSLAB) == 0) keg->uk_flags |= UMA_ZONE_HASH; } /* * Finish creating a large (> UMA_SLAB_SIZE) uma kegs. Just give in and do * OFFPAGE for now. When I can allow for more dynamic slab sizes this will be * more complicated. * * Arguments * keg The keg we should initialize * * Returns * Nothing */ static void keg_large_init(uma_keg_t keg) { KASSERT(keg != NULL, ("Keg is null in keg_large_init")); KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0, ("%s: Cannot large-init a UMA_ZONE_PCPU keg", __func__)); keg->uk_ppera = howmany(keg->uk_size, PAGE_SIZE); keg->uk_ipers = 1; keg->uk_rsize = keg->uk_size; /* Check whether we have enough space to not do OFFPAGE. */ if ((keg->uk_flags & UMA_ZONE_OFFPAGE) == 0 && PAGE_SIZE * keg->uk_ppera - keg->uk_rsize < SIZEOF_UMA_SLAB) { /* * We can't do OFFPAGE if we're internal, in which case * we need an extra page per allocation to contain the * slab header. */ if ((keg->uk_flags & UMA_ZFLAG_INTERNAL) == 0) keg->uk_flags |= UMA_ZONE_OFFPAGE; else keg->uk_ppera++; } if ((keg->uk_flags & UMA_ZONE_OFFPAGE) && (keg->uk_flags & UMA_ZONE_VTOSLAB) == 0) keg->uk_flags |= UMA_ZONE_HASH; } static void keg_cachespread_init(uma_keg_t keg) { int alignsize; int trailer; int pages; int rsize; KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0, ("%s: Cannot cachespread-init a UMA_ZONE_PCPU keg", __func__)); alignsize = keg->uk_align + 1; rsize = keg->uk_size; /* * We want one item to start on every align boundary in a page. To * do this we will span pages. We will also extend the item by the * size of align if it is an even multiple of align. Otherwise, it * would fall on the same boundary every time. */ if (rsize & keg->uk_align) rsize = (rsize & ~keg->uk_align) + alignsize; if ((rsize & alignsize) == 0) rsize += alignsize; trailer = rsize - keg->uk_size; pages = (rsize * (PAGE_SIZE / alignsize)) / PAGE_SIZE; pages = MIN(pages, (128 * 1024) / PAGE_SIZE); keg->uk_rsize = rsize; keg->uk_ppera = pages; keg->uk_ipers = ((pages * PAGE_SIZE) + trailer) / rsize; keg->uk_flags |= UMA_ZONE_OFFPAGE | UMA_ZONE_VTOSLAB; KASSERT(keg->uk_ipers <= SLAB_SETSIZE, ("%s: keg->uk_ipers too high(%d) increase max_ipers", __func__, keg->uk_ipers)); } /* * Keg header ctor. This initializes all fields, locks, etc. And inserts * the keg onto the global keg list. * * Arguments/Returns follow uma_ctor specifications * udata Actually uma_kctor_args */ static int keg_ctor(void *mem, int size, void *udata, int flags) { struct uma_kctor_args *arg = udata; uma_keg_t keg = mem; uma_zone_t zone; bzero(keg, size); keg->uk_size = arg->size; keg->uk_init = arg->uminit; keg->uk_fini = arg->fini; keg->uk_align = arg->align; keg->uk_free = 0; keg->uk_reserve = 0; keg->uk_pages = 0; keg->uk_flags = arg->flags; keg->uk_slabzone = NULL; /* * We use a global round-robin policy by default. Zones with * UMA_ZONE_NUMA set will use first-touch instead, in which case the * iterator is never run. */ keg->uk_dr.dr_policy = DOMAINSET_RR(); keg->uk_dr.dr_iter = 0; /* * The master zone is passed to us at keg-creation time. */ zone = arg->zone; keg->uk_name = zone->uz_name; if (arg->flags & UMA_ZONE_VM) keg->uk_flags |= UMA_ZFLAG_CACHEONLY; if (arg->flags & UMA_ZONE_ZINIT) keg->uk_init = zero_init; if (arg->flags & UMA_ZONE_MALLOC) keg->uk_flags |= UMA_ZONE_VTOSLAB; if (arg->flags & UMA_ZONE_PCPU) #ifdef SMP keg->uk_flags |= UMA_ZONE_OFFPAGE; #else keg->uk_flags &= ~UMA_ZONE_PCPU; #endif if (keg->uk_flags & UMA_ZONE_CACHESPREAD) { keg_cachespread_init(keg); } else { if (keg->uk_size > UMA_SLAB_SPACE) keg_large_init(keg); else keg_small_init(keg); } if (keg->uk_flags & UMA_ZONE_OFFPAGE) keg->uk_slabzone = slabzone; /* * If we haven't booted yet we need allocations to go through the * startup cache until the vm is ready. */ if (booted < BOOT_PAGEALLOC) keg->uk_allocf = startup_alloc; #ifdef UMA_MD_SMALL_ALLOC else if (keg->uk_ppera == 1) keg->uk_allocf = uma_small_alloc; #endif else if (keg->uk_flags & UMA_ZONE_PCPU) keg->uk_allocf = pcpu_page_alloc; else keg->uk_allocf = page_alloc; #ifdef UMA_MD_SMALL_ALLOC if (keg->uk_ppera == 1) keg->uk_freef = uma_small_free; else #endif if (keg->uk_flags & UMA_ZONE_PCPU) keg->uk_freef = pcpu_page_free; else keg->uk_freef = page_free; /* * Initialize keg's lock */ KEG_LOCK_INIT(keg, (arg->flags & UMA_ZONE_MTXCLASS)); /* * If we're putting the slab header in the actual page we need to * figure out where in each page it goes. See SIZEOF_UMA_SLAB * macro definition. */ if (!(keg->uk_flags & UMA_ZONE_OFFPAGE)) { keg->uk_pgoff = (PAGE_SIZE * keg->uk_ppera) - SIZEOF_UMA_SLAB; /* * The only way the following is possible is if with our * UMA_ALIGN_PTR adjustments we are now bigger than * UMA_SLAB_SIZE. I haven't checked whether this is * mathematically possible for all cases, so we make * sure here anyway. */ KASSERT(keg->uk_pgoff + sizeof(struct uma_slab) <= PAGE_SIZE * keg->uk_ppera, ("zone %s ipers %d rsize %d size %d slab won't fit", zone->uz_name, keg->uk_ipers, keg->uk_rsize, keg->uk_size)); } if (keg->uk_flags & UMA_ZONE_HASH) hash_alloc(&keg->uk_hash, 0); CTR5(KTR_UMA, "keg_ctor %p zone %s(%p) out %d free %d\n", keg, zone->uz_name, zone, (keg->uk_pages / keg->uk_ppera) * keg->uk_ipers - keg->uk_free, keg->uk_free); LIST_INSERT_HEAD(&keg->uk_zones, zone, uz_link); rw_wlock(&uma_rwlock); LIST_INSERT_HEAD(&uma_kegs, keg, uk_link); rw_wunlock(&uma_rwlock); return (0); } static void zone_alloc_counters(uma_zone_t zone) { zone->uz_allocs = counter_u64_alloc(M_WAITOK); zone->uz_frees = counter_u64_alloc(M_WAITOK); zone->uz_fails = counter_u64_alloc(M_WAITOK); } /* * Zone header ctor. This initializes all fields, locks, etc. * * Arguments/Returns follow uma_ctor specifications * udata Actually uma_zctor_args */ static int zone_ctor(void *mem, int size, void *udata, int flags) { struct uma_zctor_args *arg = udata; uma_zone_t zone = mem; uma_zone_t z; uma_keg_t keg; int i; bzero(zone, size); zone->uz_name = arg->name; zone->uz_ctor = arg->ctor; zone->uz_dtor = arg->dtor; zone->uz_init = NULL; zone->uz_fini = NULL; zone->uz_sleeps = 0; zone->uz_xdomain = 0; zone->uz_count = 0; zone->uz_count_min = 0; zone->uz_count_max = BUCKET_MAX; zone->uz_flags = 0; zone->uz_warning = NULL; /* The domain structures follow the cpu structures. */ zone->uz_domain = (struct uma_zone_domain *)&zone->uz_cpu[mp_ncpus]; zone->uz_bkt_max = ULONG_MAX; timevalclear(&zone->uz_ratecheck); if (__predict_true(booted == BOOT_RUNNING)) zone_alloc_counters(zone); else { zone->uz_allocs = EARLY_COUNTER; zone->uz_frees = EARLY_COUNTER; zone->uz_fails = EARLY_COUNTER; } for (i = 0; i < vm_ndomains; i++) TAILQ_INIT(&zone->uz_domain[i].uzd_buckets); /* * This is a pure cache zone, no kegs. */ if (arg->import) { if (arg->flags & UMA_ZONE_VM) arg->flags |= UMA_ZFLAG_CACHEONLY; zone->uz_flags = arg->flags; zone->uz_size = arg->size; zone->uz_import = arg->import; zone->uz_release = arg->release; zone->uz_arg = arg->arg; zone->uz_lockptr = &zone->uz_lock; ZONE_LOCK_INIT(zone, (arg->flags & UMA_ZONE_MTXCLASS)); rw_wlock(&uma_rwlock); LIST_INSERT_HEAD(&uma_cachezones, zone, uz_link); rw_wunlock(&uma_rwlock); goto out; } /* * Use the regular zone/keg/slab allocator. */ zone->uz_import = (uma_import)zone_import; zone->uz_release = (uma_release)zone_release; zone->uz_arg = zone; keg = arg->keg; if (arg->flags & UMA_ZONE_SECONDARY) { KASSERT(arg->keg != NULL, ("Secondary zone on zero'd keg")); zone->uz_init = arg->uminit; zone->uz_fini = arg->fini; zone->uz_lockptr = &keg->uk_lock; zone->uz_flags |= UMA_ZONE_SECONDARY; rw_wlock(&uma_rwlock); ZONE_LOCK(zone); LIST_FOREACH(z, &keg->uk_zones, uz_link) { if (LIST_NEXT(z, uz_link) == NULL) { LIST_INSERT_AFTER(z, zone, uz_link); break; } } ZONE_UNLOCK(zone); rw_wunlock(&uma_rwlock); } else if (keg == NULL) { if ((keg = uma_kcreate(zone, arg->size, arg->uminit, arg->fini, arg->align, arg->flags)) == NULL) return (ENOMEM); } else { struct uma_kctor_args karg; int error; /* We should only be here from uma_startup() */ karg.size = arg->size; karg.uminit = arg->uminit; karg.fini = arg->fini; karg.align = arg->align; karg.flags = arg->flags; karg.zone = zone; error = keg_ctor(arg->keg, sizeof(struct uma_keg), &karg, flags); if (error) return (error); } zone->uz_keg = keg; zone->uz_size = keg->uk_size; zone->uz_flags |= (keg->uk_flags & (UMA_ZONE_INHERIT | UMA_ZFLAG_INHERIT)); /* * Some internal zones don't have room allocated for the per cpu * caches. If we're internal, bail out here. */ if (keg->uk_flags & UMA_ZFLAG_INTERNAL) { KASSERT((zone->uz_flags & UMA_ZONE_SECONDARY) == 0, ("Secondary zone requested UMA_ZFLAG_INTERNAL")); return (0); } out: KASSERT((arg->flags & (UMA_ZONE_MAXBUCKET | UMA_ZONE_NOBUCKET)) != (UMA_ZONE_MAXBUCKET | UMA_ZONE_NOBUCKET), ("Invalid zone flag combination")); if ((arg->flags & UMA_ZONE_MAXBUCKET) != 0) { zone->uz_count = BUCKET_MAX; } else if ((arg->flags & UMA_ZONE_MINBUCKET) != 0) { zone->uz_count = BUCKET_MIN; zone->uz_count_max = BUCKET_MIN; } else if ((arg->flags & UMA_ZONE_NOBUCKET) != 0) zone->uz_count = 0; else zone->uz_count = bucket_select(zone->uz_size); zone->uz_count_min = zone->uz_count; return (0); } /* * Keg header dtor. This frees all data, destroys locks, frees the hash * table and removes the keg from the global list. * * Arguments/Returns follow uma_dtor specifications * udata unused */ static void keg_dtor(void *arg, int size, void *udata) { uma_keg_t keg; keg = (uma_keg_t)arg; KEG_LOCK(keg); if (keg->uk_free != 0) { printf("Freed UMA keg (%s) was not empty (%d items). " " Lost %d pages of memory.\n", keg->uk_name ? keg->uk_name : "", keg->uk_free, keg->uk_pages); } KEG_UNLOCK(keg); hash_free(&keg->uk_hash); KEG_LOCK_FINI(keg); } /* * Zone header dtor. * * Arguments/Returns follow uma_dtor specifications * udata unused */ static void zone_dtor(void *arg, int size, void *udata) { uma_zone_t zone; uma_keg_t keg; zone = (uma_zone_t)arg; if (!(zone->uz_flags & UMA_ZFLAG_INTERNAL)) cache_drain(zone); rw_wlock(&uma_rwlock); LIST_REMOVE(zone, uz_link); rw_wunlock(&uma_rwlock); /* * XXX there are some races here where * the zone can be drained but zone lock * released and then refilled before we * remove it... we dont care for now */ zone_reclaim(zone, M_WAITOK, true); /* * We only destroy kegs from non secondary/non cache zones. */ if ((zone->uz_flags & (UMA_ZONE_SECONDARY | UMA_ZFLAG_CACHE)) == 0) { keg = zone->uz_keg; rw_wlock(&uma_rwlock); LIST_REMOVE(keg, uk_link); rw_wunlock(&uma_rwlock); zone_free_item(kegs, keg, NULL, SKIP_NONE); } counter_u64_free(zone->uz_allocs); counter_u64_free(zone->uz_frees); counter_u64_free(zone->uz_fails); if (zone->uz_lockptr == &zone->uz_lock) ZONE_LOCK_FINI(zone); } /* * Traverses every zone in the system and calls a callback * * Arguments: * zfunc A pointer to a function which accepts a zone * as an argument. * * Returns: * Nothing */ static void zone_foreach(void (*zfunc)(uma_zone_t)) { uma_keg_t keg; uma_zone_t zone; /* * Before BOOT_RUNNING we are guaranteed to be single * threaded, so locking isn't needed. Startup functions * are allowed to use M_WAITOK. */ if (__predict_true(booted == BOOT_RUNNING)) rw_rlock(&uma_rwlock); LIST_FOREACH(keg, &uma_kegs, uk_link) { LIST_FOREACH(zone, &keg->uk_zones, uz_link) zfunc(zone); } if (__predict_true(booted == BOOT_RUNNING)) rw_runlock(&uma_rwlock); } /* * Count how many pages do we need to bootstrap. VM supplies * its need in early zones in the argument, we add up our zones, * which consist of: UMA Slabs, UMA Hash and 9 Bucket zones. The * zone of zones and zone of kegs are accounted separately. */ #define UMA_BOOT_ZONES 11 /* Zone of zones and zone of kegs have arbitrary alignment. */ #define UMA_BOOT_ALIGN 32 static int zsize, ksize; int uma_startup_count(int vm_zones) { int zones, pages; ksize = sizeof(struct uma_keg) + (sizeof(struct uma_domain) * vm_ndomains); zsize = sizeof(struct uma_zone) + (sizeof(struct uma_cache) * (mp_maxid + 1)) + (sizeof(struct uma_zone_domain) * vm_ndomains); /* * Memory for the zone of kegs and its keg, * and for zone of zones. */ pages = howmany(roundup(zsize, CACHE_LINE_SIZE) * 2 + roundup(ksize, CACHE_LINE_SIZE), PAGE_SIZE); #ifdef UMA_MD_SMALL_ALLOC zones = UMA_BOOT_ZONES; #else zones = UMA_BOOT_ZONES + vm_zones; vm_zones = 0; #endif /* Memory for the rest of startup zones, UMA and VM, ... */ if (zsize > UMA_SLAB_SPACE) { /* See keg_large_init(). */ u_int ppera; ppera = howmany(roundup2(zsize, UMA_BOOT_ALIGN), PAGE_SIZE); if (PAGE_SIZE * ppera - roundup2(zsize, UMA_BOOT_ALIGN) < SIZEOF_UMA_SLAB) ppera++; pages += (zones + vm_zones) * ppera; } else if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SPACE) /* See keg_small_init() special case for uk_ppera = 1. */ pages += zones; else pages += howmany(zones, UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN)); /* ... and their kegs. Note that zone of zones allocates a keg! */ pages += howmany(zones + 1, UMA_SLAB_SPACE / roundup2(ksize, UMA_BOOT_ALIGN)); /* * Most of startup zones are not going to be offpages, that's * why we use UMA_SLAB_SPACE instead of UMA_SLAB_SIZE in all * calculations. Some large bucket zones will be offpage, and * thus will allocate hashes. We take conservative approach * and assume that all zones may allocate hash. This may give * us some positive inaccuracy, usually an extra single page. */ pages += howmany(zones, UMA_SLAB_SPACE / (sizeof(struct slabhead *) * UMA_HASH_SIZE_INIT)); return (pages); } void uma_startup(void *mem, int npages) { struct uma_zctor_args args; uma_keg_t masterkeg; uintptr_t m; #ifdef DIAGNOSTIC printf("Entering %s with %d boot pages configured\n", __func__, npages); #endif rw_init(&uma_rwlock, "UMA lock"); /* Use bootpages memory for the zone of zones and zone of kegs. */ m = (uintptr_t)mem; zones = (uma_zone_t)m; m += roundup(zsize, CACHE_LINE_SIZE); kegs = (uma_zone_t)m; m += roundup(zsize, CACHE_LINE_SIZE); masterkeg = (uma_keg_t)m; m += roundup(ksize, CACHE_LINE_SIZE); m = roundup(m, PAGE_SIZE); npages -= (m - (uintptr_t)mem) / PAGE_SIZE; mem = (void *)m; /* "manually" create the initial zone */ memset(&args, 0, sizeof(args)); args.name = "UMA Kegs"; args.size = ksize; args.ctor = keg_ctor; args.dtor = keg_dtor; args.uminit = zero_init; args.fini = NULL; args.keg = masterkeg; args.align = UMA_BOOT_ALIGN - 1; args.flags = UMA_ZFLAG_INTERNAL; zone_ctor(kegs, zsize, &args, M_WAITOK); bootmem = mem; boot_pages = npages; args.name = "UMA Zones"; args.size = zsize; args.ctor = zone_ctor; args.dtor = zone_dtor; args.uminit = zero_init; args.fini = NULL; args.keg = NULL; args.align = UMA_BOOT_ALIGN - 1; args.flags = UMA_ZFLAG_INTERNAL; zone_ctor(zones, zsize, &args, M_WAITOK); /* Now make a zone for slab headers */ slabzone = uma_zcreate("UMA Slabs", sizeof(struct uma_slab), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL); hashzone = uma_zcreate("UMA Hash", sizeof(struct slabhead *) * UMA_HASH_SIZE_INIT, NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL); bucket_init(); booted = BOOT_STRAPPED; } void uma_startup1(void) { #ifdef DIAGNOSTIC printf("Entering %s with %d boot pages left\n", __func__, boot_pages); #endif booted = BOOT_PAGEALLOC; } void uma_startup2(void) { #ifdef DIAGNOSTIC printf("Entering %s with %d boot pages left\n", __func__, boot_pages); #endif booted = BOOT_BUCKETS; sx_init(&uma_reclaim_lock, "umareclaim"); bucket_enable(); } /* * Initialize our callout handle * */ static void uma_startup3(void) { #ifdef INVARIANTS TUNABLE_INT_FETCH("vm.debug.divisor", &dbg_divisor); uma_dbg_cnt = counter_u64_alloc(M_WAITOK); uma_skip_cnt = counter_u64_alloc(M_WAITOK); #endif zone_foreach(zone_alloc_counters); callout_init(&uma_callout, 1); callout_reset(&uma_callout, UMA_TIMEOUT * hz, uma_timeout, NULL); booted = BOOT_RUNNING; } static uma_keg_t uma_kcreate(uma_zone_t zone, size_t size, uma_init uminit, uma_fini fini, int align, uint32_t flags) { struct uma_kctor_args args; args.size = size; args.uminit = uminit; args.fini = fini; args.align = (align == UMA_ALIGN_CACHE) ? uma_align_cache : align; args.flags = flags; args.zone = zone; return (zone_alloc_item(kegs, &args, UMA_ANYDOMAIN, M_WAITOK)); } /* Public functions */ /* See uma.h */ void uma_set_align(int align) { if (align != UMA_ALIGN_CACHE) uma_align_cache = align; } /* See uma.h */ uma_zone_t uma_zcreate(const char *name, size_t size, uma_ctor ctor, uma_dtor dtor, uma_init uminit, uma_fini fini, int align, uint32_t flags) { struct uma_zctor_args args; uma_zone_t res; bool locked; KASSERT(powerof2(align + 1), ("invalid zone alignment %d for \"%s\"", align, name)); /* Sets all zones to a first-touch domain policy. */ #ifdef UMA_FIRSTTOUCH flags |= UMA_ZONE_NUMA; #endif /* This stuff is essential for the zone ctor */ memset(&args, 0, sizeof(args)); args.name = name; args.size = size; args.ctor = ctor; args.dtor = dtor; args.uminit = uminit; args.fini = fini; #ifdef INVARIANTS /* * If a zone is being created with an empty constructor and * destructor, pass UMA constructor/destructor which checks for * memory use after free. */ if ((!(flags & (UMA_ZONE_ZINIT | UMA_ZONE_NOFREE))) && ctor == NULL && dtor == NULL && uminit == NULL && fini == NULL) { args.ctor = trash_ctor; args.dtor = trash_dtor; args.uminit = trash_init; args.fini = trash_fini; } #endif args.align = align; args.flags = flags; args.keg = NULL; if (booted < BOOT_BUCKETS) { locked = false; } else { sx_slock(&uma_reclaim_lock); locked = true; } res = zone_alloc_item(zones, &args, UMA_ANYDOMAIN, M_WAITOK); if (locked) sx_sunlock(&uma_reclaim_lock); return (res); } /* See uma.h */ uma_zone_t uma_zsecond_create(char *name, uma_ctor ctor, uma_dtor dtor, uma_init zinit, uma_fini zfini, uma_zone_t master) { struct uma_zctor_args args; uma_keg_t keg; uma_zone_t res; bool locked; keg = master->uz_keg; memset(&args, 0, sizeof(args)); args.name = name; args.size = keg->uk_size; args.ctor = ctor; args.dtor = dtor; args.uminit = zinit; args.fini = zfini; args.align = keg->uk_align; args.flags = keg->uk_flags | UMA_ZONE_SECONDARY; args.keg = keg; if (booted < BOOT_BUCKETS) { locked = false; } else { sx_slock(&uma_reclaim_lock); locked = true; } /* XXX Attaches only one keg of potentially many. */ res = zone_alloc_item(zones, &args, UMA_ANYDOMAIN, M_WAITOK); if (locked) sx_sunlock(&uma_reclaim_lock); return (res); } /* See uma.h */ uma_zone_t uma_zcache_create(char *name, int size, uma_ctor ctor, uma_dtor dtor, uma_init zinit, uma_fini zfini, uma_import zimport, uma_release zrelease, void *arg, int flags) { struct uma_zctor_args args; memset(&args, 0, sizeof(args)); args.name = name; args.size = size; args.ctor = ctor; args.dtor = dtor; args.uminit = zinit; args.fini = zfini; args.import = zimport; args.release = zrelease; args.arg = arg; args.align = 0; args.flags = flags | UMA_ZFLAG_CACHE; return (zone_alloc_item(zones, &args, UMA_ANYDOMAIN, M_WAITOK)); } /* See uma.h */ void uma_zdestroy(uma_zone_t zone) { sx_slock(&uma_reclaim_lock); zone_free_item(zones, zone, NULL, SKIP_NONE); sx_sunlock(&uma_reclaim_lock); } void uma_zwait(uma_zone_t zone) { void *item; item = uma_zalloc_arg(zone, NULL, M_WAITOK); uma_zfree(zone, item); } void * uma_zalloc_pcpu_arg(uma_zone_t zone, void *udata, int flags) { void *item; #ifdef SMP int i; MPASS(zone->uz_flags & UMA_ZONE_PCPU); #endif item = uma_zalloc_arg(zone, udata, flags & ~M_ZERO); if (item != NULL && (flags & M_ZERO)) { #ifdef SMP for (i = 0; i <= mp_maxid; i++) bzero(zpcpu_get_cpu(item, i), zone->uz_size); #else bzero(item, zone->uz_size); #endif } return (item); } /* * A stub while both regular and pcpu cases are identical. */ void uma_zfree_pcpu_arg(uma_zone_t zone, void *item, void *udata) { #ifdef SMP MPASS(zone->uz_flags & UMA_ZONE_PCPU); #endif uma_zfree_arg(zone, item, udata); } /* See uma.h */ void * uma_zalloc_arg(uma_zone_t zone, void *udata, int flags) { uma_zone_domain_t zdom; uma_bucket_t bucket; uma_cache_t cache; void *item; int cpu, domain, lockfail, maxbucket; #ifdef INVARIANTS bool skipdbg; #endif /* Enable entropy collection for RANDOM_ENABLE_UMA kernel option */ random_harvest_fast_uma(&zone, sizeof(zone), RANDOM_UMA); /* This is the fast path allocation */ CTR4(KTR_UMA, "uma_zalloc_arg thread %x zone %s(%p) flags %d", curthread, zone->uz_name, zone, flags); if (flags & M_WAITOK) { WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "uma_zalloc_arg: zone \"%s\"", zone->uz_name); } KASSERT((flags & M_EXEC) == 0, ("uma_zalloc_arg: called with M_EXEC")); KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("uma_zalloc_arg: called with spinlock or critical section held")); if (zone->uz_flags & UMA_ZONE_PCPU) KASSERT((flags & M_ZERO) == 0, ("allocating from a pcpu zone " "with M_ZERO passed")); #ifdef DEBUG_MEMGUARD if (memguard_cmp_zone(zone)) { item = memguard_alloc(zone->uz_size, flags); if (item != NULL) { if (zone->uz_init != NULL && zone->uz_init(item, zone->uz_size, flags) != 0) return (NULL); if (zone->uz_ctor != NULL && zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) { zone->uz_fini(item, zone->uz_size); return (NULL); } return (item); } /* This is unfortunate but should not be fatal. */ } #endif /* * If possible, allocate from the per-CPU cache. There are two * requirements for safe access to the per-CPU cache: (1) the thread * accessing the cache must not be preempted or yield during access, * and (2) the thread must not migrate CPUs without switching which * cache it accesses. We rely on a critical section to prevent * preemption and migration. We release the critical section in * order to acquire the zone mutex if we are unable to allocate from * the current cache; when we re-acquire the critical section, we * must detect and handle migration if it has occurred. */ zalloc_restart: critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; zalloc_start: bucket = cache->uc_allocbucket; if (bucket != NULL && bucket->ub_cnt > 0) { bucket->ub_cnt--; item = bucket->ub_bucket[bucket->ub_cnt]; #ifdef INVARIANTS bucket->ub_bucket[bucket->ub_cnt] = NULL; #endif KASSERT(item != NULL, ("uma_zalloc: Bucket pointer mangled.")); cache->uc_allocs++; critical_exit(); #ifdef INVARIANTS skipdbg = uma_dbg_zskip(zone, item); #endif if (zone->uz_ctor != NULL && #ifdef INVARIANTS (!skipdbg || zone->uz_ctor != trash_ctor || zone->uz_dtor != trash_dtor) && #endif zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) { counter_u64_add(zone->uz_fails, 1); zone_free_item(zone, item, udata, SKIP_DTOR | SKIP_CNT); return (NULL); } #ifdef INVARIANTS if (!skipdbg) uma_dbg_alloc(zone, NULL, item); #endif if (flags & M_ZERO) uma_zero_item(item, zone); return (item); } /* * We have run out of items in our alloc bucket. * See if we can switch with our free bucket. */ bucket = cache->uc_freebucket; if (bucket != NULL && bucket->ub_cnt > 0) { CTR2(KTR_UMA, "uma_zalloc: zone %s(%p) swapping empty with alloc", zone->uz_name, zone); cache->uc_freebucket = cache->uc_allocbucket; cache->uc_allocbucket = bucket; goto zalloc_start; } /* * Discard any empty allocation bucket while we hold no locks. */ bucket = cache->uc_allocbucket; cache->uc_allocbucket = NULL; critical_exit(); if (bucket != NULL) bucket_free(zone, bucket, udata); /* Short-circuit for zones without buckets and low memory. */ if (zone->uz_count == 0 || bucketdisable) { ZONE_LOCK(zone); if (zone->uz_flags & UMA_ZONE_NUMA) domain = PCPU_GET(domain); else domain = UMA_ANYDOMAIN; goto zalloc_item; } /* * Attempt to retrieve the item from the per-CPU cache has failed, so * we must go back to the zone. This requires the zone lock, so we * must drop the critical section, then re-acquire it when we go back * to the cache. Since the critical section is released, we may be * preempted or migrate. As such, make sure not to maintain any * thread-local state specific to the cache from prior to releasing * the critical section. */ lockfail = 0; if (ZONE_TRYLOCK(zone) == 0) { /* Record contention to size the buckets. */ ZONE_LOCK(zone); lockfail = 1; } critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; /* See if we lost the race to fill the cache. */ if (cache->uc_allocbucket != NULL) { ZONE_UNLOCK(zone); goto zalloc_start; } /* * Check the zone's cache of buckets. */ if (zone->uz_flags & UMA_ZONE_NUMA) { domain = PCPU_GET(domain); zdom = &zone->uz_domain[domain]; } else { domain = UMA_ANYDOMAIN; zdom = &zone->uz_domain[0]; } if ((bucket = zone_fetch_bucket(zone, zdom)) != NULL) { KASSERT(bucket->ub_cnt != 0, ("uma_zalloc_arg: Returning an empty bucket.")); cache->uc_allocbucket = bucket; ZONE_UNLOCK(zone); goto zalloc_start; } /* We are no longer associated with this CPU. */ critical_exit(); /* * We bump the uz count when the cache size is insufficient to * handle the working set. */ if (lockfail && zone->uz_count < zone->uz_count_max) zone->uz_count++; if (zone->uz_max_items > 0) { if (zone->uz_items >= zone->uz_max_items) goto zalloc_item; maxbucket = MIN(zone->uz_count, zone->uz_max_items - zone->uz_items); zone->uz_items += maxbucket; } else maxbucket = zone->uz_count; ZONE_UNLOCK(zone); /* * Now lets just fill a bucket and put it on the free list. If that * works we'll restart the allocation from the beginning and it * will use the just filled bucket. */ bucket = zone_alloc_bucket(zone, udata, domain, flags, maxbucket); CTR3(KTR_UMA, "uma_zalloc: zone %s(%p) bucket zone returned %p", zone->uz_name, zone, bucket); ZONE_LOCK(zone); if (bucket != NULL) { if (zone->uz_max_items > 0 && bucket->ub_cnt < maxbucket) { MPASS(zone->uz_items >= maxbucket - bucket->ub_cnt); zone->uz_items -= maxbucket - bucket->ub_cnt; if (zone->uz_sleepers > 0 && zone->uz_items < zone->uz_max_items) wakeup_one(zone); } critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; /* * See if we lost the race or were migrated. Cache the * initialized bucket to make this less likely or claim * the memory directly. */ if (cache->uc_allocbucket == NULL && ((zone->uz_flags & UMA_ZONE_NUMA) == 0 || domain == PCPU_GET(domain))) { cache->uc_allocbucket = bucket; zdom->uzd_imax += bucket->ub_cnt; } else if (zone->uz_bkt_count >= zone->uz_bkt_max) { critical_exit(); ZONE_UNLOCK(zone); bucket_drain(zone, bucket); bucket_free(zone, bucket, udata); goto zalloc_restart; } else zone_put_bucket(zone, zdom, bucket, false); ZONE_UNLOCK(zone); goto zalloc_start; } else if (zone->uz_max_items > 0) { zone->uz_items -= maxbucket; if (zone->uz_sleepers > 0 && zone->uz_items + 1 < zone->uz_max_items) wakeup_one(zone); } /* * We may not be able to get a bucket so return an actual item. */ zalloc_item: item = zone_alloc_item_locked(zone, udata, domain, flags); return (item); } void * uma_zalloc_domain(uma_zone_t zone, void *udata, int domain, int flags) { /* Enable entropy collection for RANDOM_ENABLE_UMA kernel option */ random_harvest_fast_uma(&zone, sizeof(zone), RANDOM_UMA); /* This is the fast path allocation */ CTR5(KTR_UMA, "uma_zalloc_domain thread %x zone %s(%p) domain %d flags %d", curthread, zone->uz_name, zone, domain, flags); if (flags & M_WAITOK) { WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "uma_zalloc_domain: zone \"%s\"", zone->uz_name); } KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("uma_zalloc_domain: called with spinlock or critical section held")); return (zone_alloc_item(zone, udata, domain, flags)); } /* * Find a slab with some space. Prefer slabs that are partially used over those * that are totally full. This helps to reduce fragmentation. * * If 'rr' is 1, search all domains starting from 'domain'. Otherwise check * only 'domain'. */ static uma_slab_t keg_first_slab(uma_keg_t keg, int domain, bool rr) { uma_domain_t dom; uma_slab_t slab; int start; KASSERT(domain >= 0 && domain < vm_ndomains, ("keg_first_slab: domain %d out of range", domain)); KEG_LOCK_ASSERT(keg); slab = NULL; start = domain; do { dom = &keg->uk_domain[domain]; if (!LIST_EMPTY(&dom->ud_part_slab)) return (LIST_FIRST(&dom->ud_part_slab)); if (!LIST_EMPTY(&dom->ud_free_slab)) { slab = LIST_FIRST(&dom->ud_free_slab); LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&dom->ud_part_slab, slab, us_link); return (slab); } if (rr) domain = (domain + 1) % vm_ndomains; } while (domain != start); return (NULL); } static uma_slab_t keg_fetch_free_slab(uma_keg_t keg, int domain, bool rr, int flags) { uint32_t reserve; KEG_LOCK_ASSERT(keg); reserve = (flags & M_USE_RESERVE) != 0 ? 0 : keg->uk_reserve; if (keg->uk_free <= reserve) return (NULL); return (keg_first_slab(keg, domain, rr)); } static uma_slab_t keg_fetch_slab(uma_keg_t keg, uma_zone_t zone, int rdomain, const int flags) { struct vm_domainset_iter di; uma_domain_t dom; uma_slab_t slab; int aflags, domain; bool rr; restart: KEG_LOCK_ASSERT(keg); /* * Use the keg's policy if upper layers haven't already specified a * domain (as happens with first-touch zones). * * To avoid races we run the iterator with the keg lock held, but that * means that we cannot allow the vm_domainset layer to sleep. Thus, * clear M_WAITOK and handle low memory conditions locally. */ rr = rdomain == UMA_ANYDOMAIN; if (rr) { aflags = (flags & ~M_WAITOK) | M_NOWAIT; vm_domainset_iter_policy_ref_init(&di, &keg->uk_dr, &domain, &aflags); } else { aflags = flags; domain = rdomain; } for (;;) { slab = keg_fetch_free_slab(keg, domain, rr, flags); if (slab != NULL) { MPASS(slab->us_keg == keg); return (slab); } /* * M_NOVM means don't ask at all! */ if (flags & M_NOVM) break; KASSERT(zone->uz_max_items == 0 || zone->uz_items <= zone->uz_max_items, ("%s: zone %p overflow", __func__, zone)); slab = keg_alloc_slab(keg, zone, domain, flags, aflags); /* * If we got a slab here it's safe to mark it partially used * and return. We assume that the caller is going to remove * at least one item. */ if (slab) { MPASS(slab->us_keg == keg); dom = &keg->uk_domain[slab->us_domain]; LIST_INSERT_HEAD(&dom->ud_part_slab, slab, us_link); return (slab); } KEG_LOCK(keg); if (rr && vm_domainset_iter_policy(&di, &domain) != 0) { if ((flags & M_WAITOK) != 0) { KEG_UNLOCK(keg); vm_wait_doms(&keg->uk_dr.dr_policy->ds_mask); KEG_LOCK(keg); goto restart; } break; } } /* * We might not have been able to get a slab but another cpu * could have while we were unlocked. Check again before we * fail. */ if ((slab = keg_fetch_free_slab(keg, domain, rr, flags)) != NULL) { MPASS(slab->us_keg == keg); return (slab); } return (NULL); } static uma_slab_t zone_fetch_slab(uma_zone_t zone, uma_keg_t keg, int domain, int flags) { uma_slab_t slab; if (keg == NULL) { keg = zone->uz_keg; KEG_LOCK(keg); } for (;;) { slab = keg_fetch_slab(keg, zone, domain, flags); if (slab) return (slab); if (flags & (M_NOWAIT | M_NOVM)) break; } KEG_UNLOCK(keg); return (NULL); } static void * slab_alloc_item(uma_keg_t keg, uma_slab_t slab) { uma_domain_t dom; void *item; uint8_t freei; MPASS(keg == slab->us_keg); KEG_LOCK_ASSERT(keg); freei = BIT_FFS(SLAB_SETSIZE, &slab->us_free) - 1; BIT_CLR(SLAB_SETSIZE, freei, &slab->us_free); item = slab->us_data + (keg->uk_rsize * freei); slab->us_freecount--; keg->uk_free--; /* Move this slab to the full list */ if (slab->us_freecount == 0) { LIST_REMOVE(slab, us_link); dom = &keg->uk_domain[slab->us_domain]; LIST_INSERT_HEAD(&dom->ud_full_slab, slab, us_link); } return (item); } static int zone_import(uma_zone_t zone, void **bucket, int max, int domain, int flags) { uma_slab_t slab; uma_keg_t keg; #ifdef NUMA int stripe; #endif int i; slab = NULL; keg = NULL; /* Try to keep the buckets totally full */ for (i = 0; i < max; ) { if ((slab = zone_fetch_slab(zone, keg, domain, flags)) == NULL) break; keg = slab->us_keg; #ifdef NUMA stripe = howmany(max, vm_ndomains); #endif while (slab->us_freecount && i < max) { bucket[i++] = slab_alloc_item(keg, slab); if (keg->uk_free <= keg->uk_reserve) break; #ifdef NUMA /* * If the zone is striped we pick a new slab for every * N allocations. Eliminating this conditional will * instead pick a new domain for each bucket rather * than stripe within each bucket. The current option * produces more fragmentation and requires more cpu * time but yields better distribution. */ if ((zone->uz_flags & UMA_ZONE_NUMA) == 0 && vm_ndomains > 1 && --stripe == 0) break; #endif } /* Don't block if we allocated any successfully. */ flags &= ~M_WAITOK; flags |= M_NOWAIT; } if (slab != NULL) KEG_UNLOCK(keg); return i; } static uma_bucket_t zone_alloc_bucket(uma_zone_t zone, void *udata, int domain, int flags, int max) { uma_bucket_t bucket; CTR1(KTR_UMA, "zone_alloc:_bucket domain %d)", domain); /* Avoid allocs targeting empty domains. */ if (domain != UMA_ANYDOMAIN && VM_DOMAIN_EMPTY(domain)) domain = UMA_ANYDOMAIN; /* Don't wait for buckets, preserve caller's NOVM setting. */ bucket = bucket_alloc(zone, udata, M_NOWAIT | (flags & M_NOVM)); if (bucket == NULL) return (NULL); bucket->ub_cnt = zone->uz_import(zone->uz_arg, bucket->ub_bucket, MIN(max, bucket->ub_entries), domain, flags); /* * Initialize the memory if necessary. */ if (bucket->ub_cnt != 0 && zone->uz_init != NULL) { int i; for (i = 0; i < bucket->ub_cnt; i++) if (zone->uz_init(bucket->ub_bucket[i], zone->uz_size, flags) != 0) break; /* * If we couldn't initialize the whole bucket, put the * rest back onto the freelist. */ if (i != bucket->ub_cnt) { zone->uz_release(zone->uz_arg, &bucket->ub_bucket[i], bucket->ub_cnt - i); #ifdef INVARIANTS bzero(&bucket->ub_bucket[i], sizeof(void *) * (bucket->ub_cnt - i)); #endif bucket->ub_cnt = i; } } if (bucket->ub_cnt == 0) { bucket_free(zone, bucket, udata); counter_u64_add(zone->uz_fails, 1); return (NULL); } return (bucket); } /* * Allocates a single item from a zone. * * Arguments * zone The zone to alloc for. * udata The data to be passed to the constructor. * domain The domain to allocate from or UMA_ANYDOMAIN. * flags M_WAITOK, M_NOWAIT, M_ZERO. * * Returns * NULL if there is no memory and M_NOWAIT is set * An item if successful */ static void * zone_alloc_item(uma_zone_t zone, void *udata, int domain, int flags) { ZONE_LOCK(zone); return (zone_alloc_item_locked(zone, udata, domain, flags)); } /* * Returns with zone unlocked. */ static void * zone_alloc_item_locked(uma_zone_t zone, void *udata, int domain, int flags) { void *item; #ifdef INVARIANTS bool skipdbg; #endif ZONE_LOCK_ASSERT(zone); if (zone->uz_max_items > 0) { if (zone->uz_items >= zone->uz_max_items) { zone_log_warning(zone); zone_maxaction(zone); if (flags & M_NOWAIT) { ZONE_UNLOCK(zone); return (NULL); } zone->uz_sleeps++; zone->uz_sleepers++; while (zone->uz_items >= zone->uz_max_items) mtx_sleep(zone, zone->uz_lockptr, PVM, "zonelimit", 0); zone->uz_sleepers--; if (zone->uz_sleepers > 0 && zone->uz_items + 1 < zone->uz_max_items) wakeup_one(zone); } zone->uz_items++; } ZONE_UNLOCK(zone); /* Avoid allocs targeting empty domains. */ if (domain != UMA_ANYDOMAIN && VM_DOMAIN_EMPTY(domain)) domain = UMA_ANYDOMAIN; if (zone->uz_import(zone->uz_arg, &item, 1, domain, flags) != 1) goto fail; #ifdef INVARIANTS skipdbg = uma_dbg_zskip(zone, item); #endif /* * We have to call both the zone's init (not the keg's init) * and the zone's ctor. This is because the item is going from * a keg slab directly to the user, and the user is expecting it * to be both zone-init'd as well as zone-ctor'd. */ if (zone->uz_init != NULL) { if (zone->uz_init(item, zone->uz_size, flags) != 0) { zone_free_item(zone, item, udata, SKIP_FINI | SKIP_CNT); goto fail; } } if (zone->uz_ctor != NULL && #ifdef INVARIANTS (!skipdbg || zone->uz_ctor != trash_ctor || zone->uz_dtor != trash_dtor) && #endif zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) { zone_free_item(zone, item, udata, SKIP_DTOR | SKIP_CNT); goto fail; } #ifdef INVARIANTS if (!skipdbg) uma_dbg_alloc(zone, NULL, item); #endif if (flags & M_ZERO) uma_zero_item(item, zone); counter_u64_add(zone->uz_allocs, 1); CTR3(KTR_UMA, "zone_alloc_item item %p from %s(%p)", item, zone->uz_name, zone); return (item); fail: if (zone->uz_max_items > 0) { ZONE_LOCK(zone); zone->uz_items--; ZONE_UNLOCK(zone); } counter_u64_add(zone->uz_fails, 1); CTR2(KTR_UMA, "zone_alloc_item failed from %s(%p)", zone->uz_name, zone); return (NULL); } /* See uma.h */ void uma_zfree_arg(uma_zone_t zone, void *item, void *udata) { uma_cache_t cache; uma_bucket_t bucket; uma_zone_domain_t zdom; int cpu, domain; #ifdef UMA_XDOMAIN int itemdomain; #endif bool lockfail; #ifdef INVARIANTS bool skipdbg; #endif /* Enable entropy collection for RANDOM_ENABLE_UMA kernel option */ random_harvest_fast_uma(&zone, sizeof(zone), RANDOM_UMA); CTR2(KTR_UMA, "uma_zfree_arg thread %x zone %s", curthread, zone->uz_name); KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("uma_zfree_arg: called with spinlock or critical section held")); /* uma_zfree(..., NULL) does nothing, to match free(9). */ if (item == NULL) return; #ifdef DEBUG_MEMGUARD if (is_memguard_addr(item)) { if (zone->uz_dtor != NULL) zone->uz_dtor(item, zone->uz_size, udata); if (zone->uz_fini != NULL) zone->uz_fini(item, zone->uz_size); memguard_free(item); return; } #endif #ifdef INVARIANTS skipdbg = uma_dbg_zskip(zone, item); if (skipdbg == false) { if (zone->uz_flags & UMA_ZONE_MALLOC) uma_dbg_free(zone, udata, item); else uma_dbg_free(zone, NULL, item); } if (zone->uz_dtor != NULL && (!skipdbg || zone->uz_dtor != trash_dtor || zone->uz_ctor != trash_ctor)) #else if (zone->uz_dtor != NULL) #endif zone->uz_dtor(item, zone->uz_size, udata); /* * The race here is acceptable. If we miss it we'll just have to wait * a little longer for the limits to be reset. */ if (zone->uz_sleepers > 0) goto zfree_item; #ifdef UMA_XDOMAIN if ((zone->uz_flags & UMA_ZONE_NUMA) != 0) itemdomain = _vm_phys_domain(pmap_kextract((vm_offset_t)item)); #endif /* * If possible, free to the per-CPU cache. There are two * requirements for safe access to the per-CPU cache: (1) the thread * accessing the cache must not be preempted or yield during access, * and (2) the thread must not migrate CPUs without switching which * cache it accesses. We rely on a critical section to prevent * preemption and migration. We release the critical section in * order to acquire the zone mutex if we are unable to free to the * current cache; when we re-acquire the critical section, we must * detect and handle migration if it has occurred. */ zfree_restart: critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; zfree_start: domain = PCPU_GET(domain); #ifdef UMA_XDOMAIN if ((zone->uz_flags & UMA_ZONE_NUMA) == 0) itemdomain = domain; #endif /* * Try to free into the allocbucket first to give LIFO ordering * for cache-hot datastructures. Spill over into the freebucket * if necessary. Alloc will swap them if one runs dry. */ #ifdef UMA_XDOMAIN if (domain != itemdomain) { bucket = cache->uc_crossbucket; } else #endif { bucket = cache->uc_allocbucket; if (bucket == NULL || bucket->ub_cnt >= bucket->ub_entries) bucket = cache->uc_freebucket; } if (bucket != NULL && bucket->ub_cnt < bucket->ub_entries) { KASSERT(bucket->ub_bucket[bucket->ub_cnt] == NULL, ("uma_zfree: Freeing to non free bucket index.")); bucket->ub_bucket[bucket->ub_cnt] = item; bucket->ub_cnt++; cache->uc_frees++; critical_exit(); return; } /* * We must go back the zone, which requires acquiring the zone lock, * which in turn means we must release and re-acquire the critical * section. Since the critical section is released, we may be * preempted or migrate. As such, make sure not to maintain any * thread-local state specific to the cache from prior to releasing * the critical section. */ critical_exit(); if (zone->uz_count == 0 || bucketdisable) goto zfree_item; lockfail = false; if (ZONE_TRYLOCK(zone) == 0) { /* Record contention to size the buckets. */ ZONE_LOCK(zone); lockfail = true; } critical_enter(); cpu = curcpu; domain = PCPU_GET(domain); cache = &zone->uz_cpu[cpu]; #ifdef UMA_XDOMAIN if (domain != itemdomain) bucket = cache->uc_crossbucket; else #endif bucket = cache->uc_freebucket; if (bucket != NULL && bucket->ub_cnt < bucket->ub_entries) { ZONE_UNLOCK(zone); goto zfree_start; } #ifdef UMA_XDOMAIN if (domain != itemdomain) cache->uc_crossbucket = NULL; else #endif cache->uc_freebucket = NULL; /* We are no longer associated with this CPU. */ critical_exit(); #ifdef UMA_XDOMAIN if (domain != itemdomain) { if (bucket != NULL) { zone->uz_xdomain += bucket->ub_cnt; if (vm_ndomains > 2 || zone->uz_bkt_count >= zone->uz_bkt_max) { ZONE_UNLOCK(zone); bucket_drain(zone, bucket); bucket_free(zone, bucket, udata); } else { zdom = &zone->uz_domain[itemdomain]; zone_put_bucket(zone, zdom, bucket, true); ZONE_UNLOCK(zone); } } else ZONE_UNLOCK(zone); bucket = bucket_alloc(zone, udata, M_NOWAIT); if (bucket == NULL) goto zfree_item; critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; if (cache->uc_crossbucket == NULL) { cache->uc_crossbucket = bucket; goto zfree_start; } critical_exit(); bucket_free(zone, bucket, udata); goto zfree_restart; } #endif if ((zone->uz_flags & UMA_ZONE_NUMA) != 0) { zdom = &zone->uz_domain[domain]; } else { domain = 0; zdom = &zone->uz_domain[0]; } /* Can we throw this on the zone full list? */ if (bucket != NULL) { CTR3(KTR_UMA, "uma_zfree: zone %s(%p) putting bucket %p on free list", zone->uz_name, zone, bucket); /* ub_cnt is pointing to the last free item */ KASSERT(bucket->ub_cnt == bucket->ub_entries, ("uma_zfree: Attempting to insert not full bucket onto the full list.\n")); if (zone->uz_bkt_count >= zone->uz_bkt_max) { ZONE_UNLOCK(zone); bucket_drain(zone, bucket); bucket_free(zone, bucket, udata); goto zfree_restart; } else zone_put_bucket(zone, zdom, bucket, true); } /* * We bump the uz count when the cache size is insufficient to * handle the working set. */ if (lockfail && zone->uz_count < zone->uz_count_max) zone->uz_count++; ZONE_UNLOCK(zone); bucket = bucket_alloc(zone, udata, M_NOWAIT); CTR3(KTR_UMA, "uma_zfree: zone %s(%p) allocated bucket %p", zone->uz_name, zone, bucket); if (bucket) { critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; if (cache->uc_freebucket == NULL && ((zone->uz_flags & UMA_ZONE_NUMA) == 0 || domain == PCPU_GET(domain))) { cache->uc_freebucket = bucket; goto zfree_start; } /* * We lost the race, start over. We have to drop our * critical section to free the bucket. */ critical_exit(); bucket_free(zone, bucket, udata); goto zfree_restart; } /* * If nothing else caught this, we'll just do an internal free. */ zfree_item: zone_free_item(zone, item, udata, SKIP_DTOR); } void uma_zfree_domain(uma_zone_t zone, void *item, void *udata) { /* Enable entropy collection for RANDOM_ENABLE_UMA kernel option */ random_harvest_fast_uma(&zone, sizeof(zone), RANDOM_UMA); CTR2(KTR_UMA, "uma_zfree_domain thread %x zone %s", curthread, zone->uz_name); KASSERT(curthread->td_critnest == 0 || SCHEDULER_STOPPED(), ("uma_zfree_domain: called with spinlock or critical section held")); /* uma_zfree(..., NULL) does nothing, to match free(9). */ if (item == NULL) return; zone_free_item(zone, item, udata, SKIP_NONE); } static void slab_free_item(uma_zone_t zone, uma_slab_t slab, void *item) { uma_keg_t keg; uma_domain_t dom; uint8_t freei; keg = zone->uz_keg; MPASS(zone->uz_lockptr == &keg->uk_lock); KEG_LOCK_ASSERT(keg); MPASS(keg == slab->us_keg); dom = &keg->uk_domain[slab->us_domain]; /* Do we need to remove from any lists? */ if (slab->us_freecount+1 == keg->uk_ipers) { LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&dom->ud_free_slab, slab, us_link); } else if (slab->us_freecount == 0) { LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&dom->ud_part_slab, slab, us_link); } /* Slab management. */ freei = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize; BIT_SET(SLAB_SETSIZE, freei, &slab->us_free); slab->us_freecount++; /* Keg statistics. */ keg->uk_free++; } static void zone_release(uma_zone_t zone, void **bucket, int cnt) { void *item; uma_slab_t slab; uma_keg_t keg; uint8_t *mem; int i; keg = zone->uz_keg; KEG_LOCK(keg); for (i = 0; i < cnt; i++) { item = bucket[i]; if (!(zone->uz_flags & UMA_ZONE_VTOSLAB)) { mem = (uint8_t *)((uintptr_t)item & (~UMA_SLAB_MASK)); if (zone->uz_flags & UMA_ZONE_HASH) { slab = hash_sfind(&keg->uk_hash, mem); } else { mem += keg->uk_pgoff; slab = (uma_slab_t)mem; } } else { slab = vtoslab((vm_offset_t)item); MPASS(slab->us_keg == keg); } slab_free_item(zone, slab, item); } KEG_UNLOCK(keg); } /* * Frees a single item to any zone. * * Arguments: * zone The zone to free to * item The item we're freeing * udata User supplied data for the dtor * skip Skip dtors and finis */ static void zone_free_item(uma_zone_t zone, void *item, void *udata, enum zfreeskip skip) { #ifdef INVARIANTS bool skipdbg; skipdbg = uma_dbg_zskip(zone, item); if (skip == SKIP_NONE && !skipdbg) { if (zone->uz_flags & UMA_ZONE_MALLOC) uma_dbg_free(zone, udata, item); else uma_dbg_free(zone, NULL, item); } if (skip < SKIP_DTOR && zone->uz_dtor != NULL && (!skipdbg || zone->uz_dtor != trash_dtor || zone->uz_ctor != trash_ctor)) #else if (skip < SKIP_DTOR && zone->uz_dtor != NULL) #endif zone->uz_dtor(item, zone->uz_size, udata); if (skip < SKIP_FINI && zone->uz_fini) zone->uz_fini(item, zone->uz_size); zone->uz_release(zone->uz_arg, &item, 1); if (skip & SKIP_CNT) return; counter_u64_add(zone->uz_frees, 1); if (zone->uz_max_items > 0) { ZONE_LOCK(zone); zone->uz_items--; if (zone->uz_sleepers > 0 && zone->uz_items < zone->uz_max_items) wakeup_one(zone); ZONE_UNLOCK(zone); } } /* See uma.h */ int uma_zone_set_max(uma_zone_t zone, int nitems) { struct uma_bucket_zone *ubz; /* * If limit is very low we may need to limit how * much items are allowed in CPU caches. */ ubz = &bucket_zones[0]; for (; ubz->ubz_entries != 0; ubz++) if (ubz->ubz_entries * 2 * mp_ncpus > nitems) break; if (ubz == &bucket_zones[0]) nitems = ubz->ubz_entries * 2 * mp_ncpus; else ubz--; ZONE_LOCK(zone); zone->uz_count_max = zone->uz_count = ubz->ubz_entries; if (zone->uz_count_min > zone->uz_count_max) zone->uz_count_min = zone->uz_count_max; zone->uz_max_items = nitems; ZONE_UNLOCK(zone); return (nitems); } /* See uma.h */ int uma_zone_set_maxcache(uma_zone_t zone, int nitems) { ZONE_LOCK(zone); zone->uz_bkt_max = nitems; ZONE_UNLOCK(zone); return (nitems); } /* See uma.h */ int uma_zone_get_max(uma_zone_t zone) { int nitems; ZONE_LOCK(zone); nitems = zone->uz_max_items; ZONE_UNLOCK(zone); return (nitems); } /* See uma.h */ void uma_zone_set_warning(uma_zone_t zone, const char *warning) { ZONE_LOCK(zone); zone->uz_warning = warning; ZONE_UNLOCK(zone); } /* See uma.h */ void uma_zone_set_maxaction(uma_zone_t zone, uma_maxaction_t maxaction) { ZONE_LOCK(zone); TASK_INIT(&zone->uz_maxaction, 0, (task_fn_t *)maxaction, zone); ZONE_UNLOCK(zone); } /* See uma.h */ int uma_zone_get_cur(uma_zone_t zone) { int64_t nitems; u_int i; ZONE_LOCK(zone); nitems = counter_u64_fetch(zone->uz_allocs) - counter_u64_fetch(zone->uz_frees); CPU_FOREACH(i) { /* * See the comment in uma_vm_zone_stats() regarding the * safety of accessing the per-cpu caches. With the zone lock * held, it is safe, but can potentially result in stale data. */ nitems += zone->uz_cpu[i].uc_allocs - zone->uz_cpu[i].uc_frees; } ZONE_UNLOCK(zone); return (nitems < 0 ? 0 : nitems); } /* See uma.h */ void uma_zone_set_init(uma_zone_t zone, uma_init uminit) { uma_keg_t keg; KEG_GET(zone, keg); KEG_LOCK(keg); KASSERT(keg->uk_pages == 0, ("uma_zone_set_init on non-empty keg")); keg->uk_init = uminit; KEG_UNLOCK(keg); } /* See uma.h */ void uma_zone_set_fini(uma_zone_t zone, uma_fini fini) { uma_keg_t keg; KEG_GET(zone, keg); KEG_LOCK(keg); KASSERT(keg->uk_pages == 0, ("uma_zone_set_fini on non-empty keg")); keg->uk_fini = fini; KEG_UNLOCK(keg); } /* See uma.h */ void uma_zone_set_zinit(uma_zone_t zone, uma_init zinit) { ZONE_LOCK(zone); KASSERT(zone->uz_keg->uk_pages == 0, ("uma_zone_set_zinit on non-empty keg")); zone->uz_init = zinit; ZONE_UNLOCK(zone); } /* See uma.h */ void uma_zone_set_zfini(uma_zone_t zone, uma_fini zfini) { ZONE_LOCK(zone); KASSERT(zone->uz_keg->uk_pages == 0, ("uma_zone_set_zfini on non-empty keg")); zone->uz_fini = zfini; ZONE_UNLOCK(zone); } /* See uma.h */ /* XXX uk_freef is not actually used with the zone locked */ void uma_zone_set_freef(uma_zone_t zone, uma_free freef) { uma_keg_t keg; KEG_GET(zone, keg); KASSERT(keg != NULL, ("uma_zone_set_freef: Invalid zone type")); KEG_LOCK(keg); keg->uk_freef = freef; KEG_UNLOCK(keg); } /* See uma.h */ /* XXX uk_allocf is not actually used with the zone locked */ void uma_zone_set_allocf(uma_zone_t zone, uma_alloc allocf) { uma_keg_t keg; KEG_GET(zone, keg); KEG_LOCK(keg); keg->uk_allocf = allocf; KEG_UNLOCK(keg); } /* See uma.h */ void uma_zone_reserve(uma_zone_t zone, int items) { uma_keg_t keg; KEG_GET(zone, keg); KEG_LOCK(keg); keg->uk_reserve = items; KEG_UNLOCK(keg); } /* See uma.h */ int uma_zone_reserve_kva(uma_zone_t zone, int count) { uma_keg_t keg; vm_offset_t kva; u_int pages; KEG_GET(zone, keg); pages = count / keg->uk_ipers; if (pages * keg->uk_ipers < count) pages++; pages *= keg->uk_ppera; #ifdef UMA_MD_SMALL_ALLOC if (keg->uk_ppera > 1) { #else if (1) { #endif kva = kva_alloc((vm_size_t)pages * PAGE_SIZE); if (kva == 0) return (0); } else kva = 0; ZONE_LOCK(zone); MPASS(keg->uk_kva == 0); keg->uk_kva = kva; keg->uk_offset = 0; zone->uz_max_items = pages * keg->uk_ipers; #ifdef UMA_MD_SMALL_ALLOC keg->uk_allocf = (keg->uk_ppera > 1) ? noobj_alloc : uma_small_alloc; #else keg->uk_allocf = noobj_alloc; #endif keg->uk_flags |= UMA_ZONE_NOFREE; ZONE_UNLOCK(zone); return (1); } /* See uma.h */ void uma_prealloc(uma_zone_t zone, int items) { struct vm_domainset_iter di; uma_domain_t dom; uma_slab_t slab; uma_keg_t keg; int aflags, domain, slabs; KEG_GET(zone, keg); KEG_LOCK(keg); slabs = items / keg->uk_ipers; if (slabs * keg->uk_ipers < items) slabs++; while (slabs-- > 0) { aflags = M_NOWAIT; vm_domainset_iter_policy_ref_init(&di, &keg->uk_dr, &domain, &aflags); for (;;) { slab = keg_alloc_slab(keg, zone, domain, M_WAITOK, aflags); if (slab != NULL) { MPASS(slab->us_keg == keg); dom = &keg->uk_domain[slab->us_domain]; LIST_INSERT_HEAD(&dom->ud_free_slab, slab, us_link); break; } KEG_LOCK(keg); if (vm_domainset_iter_policy(&di, &domain) != 0) { KEG_UNLOCK(keg); vm_wait_doms(&keg->uk_dr.dr_policy->ds_mask); KEG_LOCK(keg); } } } KEG_UNLOCK(keg); } /* See uma.h */ void uma_reclaim(int req) { CTR0(KTR_UMA, "UMA: vm asked us to release pages!"); sx_xlock(&uma_reclaim_lock); bucket_enable(); switch (req) { case UMA_RECLAIM_TRIM: zone_foreach(zone_trim); break; case UMA_RECLAIM_DRAIN: case UMA_RECLAIM_DRAIN_CPU: zone_foreach(zone_drain); if (req == UMA_RECLAIM_DRAIN_CPU) { pcpu_cache_drain_safe(NULL); zone_foreach(zone_drain); } break; default: panic("unhandled reclamation request %d", req); } /* * Some slabs may have been freed but this zone will be visited early * we visit again so that we can free pages that are empty once other * zones are drained. We have to do the same for buckets. */ zone_drain(slabzone); bucket_zone_drain(); sx_xunlock(&uma_reclaim_lock); } static volatile int uma_reclaim_needed; void uma_reclaim_wakeup(void) { if (atomic_fetchadd_int(&uma_reclaim_needed, 1) == 0) wakeup(uma_reclaim); } void uma_reclaim_worker(void *arg __unused) { for (;;) { sx_xlock(&uma_reclaim_lock); while (atomic_load_int(&uma_reclaim_needed) == 0) sx_sleep(uma_reclaim, &uma_reclaim_lock, PVM, "umarcl", hz); sx_xunlock(&uma_reclaim_lock); EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_KMEM); uma_reclaim(UMA_RECLAIM_DRAIN_CPU); atomic_store_int(&uma_reclaim_needed, 0); /* Don't fire more than once per-second. */ pause("umarclslp", hz); } } /* See uma.h */ void uma_zone_reclaim(uma_zone_t zone, int req) { switch (req) { case UMA_RECLAIM_TRIM: zone_trim(zone); break; case UMA_RECLAIM_DRAIN: zone_drain(zone); break; case UMA_RECLAIM_DRAIN_CPU: pcpu_cache_drain_safe(zone); zone_drain(zone); break; default: panic("unhandled reclamation request %d", req); } } /* See uma.h */ int uma_zone_exhausted(uma_zone_t zone) { int full; ZONE_LOCK(zone); full = zone->uz_sleepers > 0; ZONE_UNLOCK(zone); return (full); } int uma_zone_exhausted_nolock(uma_zone_t zone) { return (zone->uz_sleepers > 0); } void * uma_large_malloc_domain(vm_size_t size, int domain, int wait) { struct domainset *policy; vm_offset_t addr; uma_slab_t slab; if (domain != UMA_ANYDOMAIN) { /* avoid allocs targeting empty domains */ if (VM_DOMAIN_EMPTY(domain)) domain = UMA_ANYDOMAIN; } slab = zone_alloc_item(slabzone, NULL, domain, wait); if (slab == NULL) return (NULL); policy = (domain == UMA_ANYDOMAIN) ? DOMAINSET_RR() : DOMAINSET_FIXED(domain); addr = kmem_malloc_domainset(policy, size, wait); if (addr != 0) { vsetslab(addr, slab); slab->us_data = (void *)addr; slab->us_flags = UMA_SLAB_KERNEL | UMA_SLAB_MALLOC; slab->us_size = size; slab->us_domain = vm_phys_domain(PHYS_TO_VM_PAGE( pmap_kextract(addr))); uma_total_inc(size); } else { zone_free_item(slabzone, slab, NULL, SKIP_NONE); } return ((void *)addr); } void * uma_large_malloc(vm_size_t size, int wait) { return uma_large_malloc_domain(size, UMA_ANYDOMAIN, wait); } void uma_large_free(uma_slab_t slab) { KASSERT((slab->us_flags & UMA_SLAB_KERNEL) != 0, ("uma_large_free: Memory not allocated with uma_large_malloc.")); kmem_free((vm_offset_t)slab->us_data, slab->us_size); uma_total_dec(slab->us_size); zone_free_item(slabzone, slab, NULL, SKIP_NONE); } static void uma_zero_item(void *item, uma_zone_t zone) { bzero(item, zone->uz_size); } unsigned long uma_limit(void) { return (uma_kmem_limit); } void uma_set_limit(unsigned long limit) { uma_kmem_limit = limit; } unsigned long uma_size(void) { return (atomic_load_long(&uma_kmem_total)); } long uma_avail(void) { return (uma_kmem_limit - uma_size()); } void uma_print_stats(void) { zone_foreach(uma_print_zone); } static void slab_print(uma_slab_t slab) { printf("slab: keg %p, data %p, freecount %d\n", slab->us_keg, slab->us_data, slab->us_freecount); } static void cache_print(uma_cache_t cache) { printf("alloc: %p(%d), free: %p(%d), cross: %p(%d)j\n", cache->uc_allocbucket, cache->uc_allocbucket?cache->uc_allocbucket->ub_cnt:0, cache->uc_freebucket, cache->uc_freebucket?cache->uc_freebucket->ub_cnt:0, cache->uc_crossbucket, cache->uc_crossbucket?cache->uc_crossbucket->ub_cnt:0); } static void uma_print_keg(uma_keg_t keg) { uma_domain_t dom; uma_slab_t slab; int i; printf("keg: %s(%p) size %d(%d) flags %#x ipers %d ppera %d " "out %d free %d\n", keg->uk_name, keg, keg->uk_size, keg->uk_rsize, keg->uk_flags, keg->uk_ipers, keg->uk_ppera, (keg->uk_pages / keg->uk_ppera) * keg->uk_ipers - keg->uk_free, keg->uk_free); for (i = 0; i < vm_ndomains; i++) { dom = &keg->uk_domain[i]; printf("Part slabs:\n"); LIST_FOREACH(slab, &dom->ud_part_slab, us_link) slab_print(slab); printf("Free slabs:\n"); LIST_FOREACH(slab, &dom->ud_free_slab, us_link) slab_print(slab); printf("Full slabs:\n"); LIST_FOREACH(slab, &dom->ud_full_slab, us_link) slab_print(slab); } } void uma_print_zone(uma_zone_t zone) { uma_cache_t cache; int i; printf("zone: %s(%p) size %d maxitems %ju flags %#x\n", zone->uz_name, zone, zone->uz_size, (uintmax_t)zone->uz_max_items, zone->uz_flags); if (zone->uz_lockptr != &zone->uz_lock) uma_print_keg(zone->uz_keg); CPU_FOREACH(i) { cache = &zone->uz_cpu[i]; printf("CPU %d Cache:\n", i); cache_print(cache); } } #ifdef DDB /* * Generate statistics across both the zone and its per-cpu cache's. Return * desired statistics if the pointer is non-NULL for that statistic. * * Note: does not update the zone statistics, as it can't safely clear the * per-CPU cache statistic. * * XXXRW: Following the uc_allocbucket and uc_freebucket pointers here isn't * safe from off-CPU; we should modify the caches to track this information * directly so that we don't have to. */ static void uma_zone_sumstat(uma_zone_t z, long *cachefreep, uint64_t *allocsp, uint64_t *freesp, uint64_t *sleepsp, uint64_t *xdomainp) { uma_cache_t cache; uint64_t allocs, frees, sleeps, xdomain; int cachefree, cpu; allocs = frees = sleeps = xdomain = 0; cachefree = 0; CPU_FOREACH(cpu) { cache = &z->uz_cpu[cpu]; if (cache->uc_allocbucket != NULL) cachefree += cache->uc_allocbucket->ub_cnt; if (cache->uc_freebucket != NULL) cachefree += cache->uc_freebucket->ub_cnt; if (cache->uc_crossbucket != NULL) { xdomain += cache->uc_crossbucket->ub_cnt; cachefree += cache->uc_crossbucket->ub_cnt; } allocs += cache->uc_allocs; frees += cache->uc_frees; } allocs += counter_u64_fetch(z->uz_allocs); frees += counter_u64_fetch(z->uz_frees); sleeps += z->uz_sleeps; xdomain += z->uz_xdomain; if (cachefreep != NULL) *cachefreep = cachefree; if (allocsp != NULL) *allocsp = allocs; if (freesp != NULL) *freesp = frees; if (sleepsp != NULL) *sleepsp = sleeps; if (xdomainp != NULL) *xdomainp = xdomain; } #endif /* DDB */ static int sysctl_vm_zone_count(SYSCTL_HANDLER_ARGS) { uma_keg_t kz; uma_zone_t z; int count; count = 0; rw_rlock(&uma_rwlock); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) count++; } LIST_FOREACH(z, &uma_cachezones, uz_link) count++; rw_runlock(&uma_rwlock); return (sysctl_handle_int(oidp, &count, 0, req)); } static void uma_vm_zone_stats(struct uma_type_header *uth, uma_zone_t z, struct sbuf *sbuf, struct uma_percpu_stat *ups, bool internal) { uma_zone_domain_t zdom; uma_cache_t cache; int i; for (i = 0; i < vm_ndomains; i++) { zdom = &z->uz_domain[i]; uth->uth_zone_free += zdom->uzd_nitems; } uth->uth_allocs = counter_u64_fetch(z->uz_allocs); uth->uth_frees = counter_u64_fetch(z->uz_frees); uth->uth_fails = counter_u64_fetch(z->uz_fails); uth->uth_sleeps = z->uz_sleeps; uth->uth_xdomain = z->uz_xdomain; /* * While it is not normally safe to access the cache * bucket pointers while not on the CPU that owns the * cache, we only allow the pointers to be exchanged * without the zone lock held, not invalidated, so * accept the possible race associated with bucket * exchange during monitoring. */ for (i = 0; i < mp_maxid + 1; i++) { bzero(&ups[i], sizeof(*ups)); if (internal || CPU_ABSENT(i)) continue; cache = &z->uz_cpu[i]; if (cache->uc_allocbucket != NULL) ups[i].ups_cache_free += cache->uc_allocbucket->ub_cnt; if (cache->uc_freebucket != NULL) ups[i].ups_cache_free += cache->uc_freebucket->ub_cnt; if (cache->uc_crossbucket != NULL) ups[i].ups_cache_free += cache->uc_crossbucket->ub_cnt; ups[i].ups_allocs = cache->uc_allocs; ups[i].ups_frees = cache->uc_frees; } } static int sysctl_vm_zone_stats(SYSCTL_HANDLER_ARGS) { struct uma_stream_header ush; struct uma_type_header uth; struct uma_percpu_stat *ups; struct sbuf sbuf; uma_keg_t kz; uma_zone_t z; int count, error, i; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sbuf_new_for_sysctl(&sbuf, NULL, 128, req); sbuf_clear_flags(&sbuf, SBUF_INCLUDENUL); ups = malloc((mp_maxid + 1) * sizeof(*ups), M_TEMP, M_WAITOK); count = 0; rw_rlock(&uma_rwlock); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) count++; } LIST_FOREACH(z, &uma_cachezones, uz_link) count++; /* * Insert stream header. */ bzero(&ush, sizeof(ush)); ush.ush_version = UMA_STREAM_VERSION; ush.ush_maxcpus = (mp_maxid + 1); ush.ush_count = count; (void)sbuf_bcat(&sbuf, &ush, sizeof(ush)); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) { bzero(&uth, sizeof(uth)); ZONE_LOCK(z); strlcpy(uth.uth_name, z->uz_name, UTH_MAX_NAME); uth.uth_align = kz->uk_align; uth.uth_size = kz->uk_size; uth.uth_rsize = kz->uk_rsize; if (z->uz_max_items > 0) uth.uth_pages = (z->uz_items / kz->uk_ipers) * kz->uk_ppera; else uth.uth_pages = kz->uk_pages; uth.uth_maxpages = (z->uz_max_items / kz->uk_ipers) * kz->uk_ppera; uth.uth_limit = z->uz_max_items; uth.uth_keg_free = z->uz_keg->uk_free; /* * A zone is secondary is it is not the first entry * on the keg's zone list. */ if ((z->uz_flags & UMA_ZONE_SECONDARY) && (LIST_FIRST(&kz->uk_zones) != z)) uth.uth_zone_flags = UTH_ZONE_SECONDARY; uma_vm_zone_stats(&uth, z, &sbuf, ups, kz->uk_flags & UMA_ZFLAG_INTERNAL); ZONE_UNLOCK(z); (void)sbuf_bcat(&sbuf, &uth, sizeof(uth)); for (i = 0; i < mp_maxid + 1; i++) (void)sbuf_bcat(&sbuf, &ups[i], sizeof(ups[i])); } } LIST_FOREACH(z, &uma_cachezones, uz_link) { bzero(&uth, sizeof(uth)); ZONE_LOCK(z); strlcpy(uth.uth_name, z->uz_name, UTH_MAX_NAME); uth.uth_size = z->uz_size; uma_vm_zone_stats(&uth, z, &sbuf, ups, false); ZONE_UNLOCK(z); (void)sbuf_bcat(&sbuf, &uth, sizeof(uth)); for (i = 0; i < mp_maxid + 1; i++) (void)sbuf_bcat(&sbuf, &ups[i], sizeof(ups[i])); } rw_runlock(&uma_rwlock); error = sbuf_finish(&sbuf); sbuf_delete(&sbuf); free(ups, M_TEMP); return (error); } int sysctl_handle_uma_zone_max(SYSCTL_HANDLER_ARGS) { uma_zone_t zone = *(uma_zone_t *)arg1; int error, max; max = uma_zone_get_max(zone); error = sysctl_handle_int(oidp, &max, 0, req); if (error || !req->newptr) return (error); uma_zone_set_max(zone, max); return (0); } int sysctl_handle_uma_zone_cur(SYSCTL_HANDLER_ARGS) { uma_zone_t zone = *(uma_zone_t *)arg1; int cur; cur = uma_zone_get_cur(zone); return (sysctl_handle_int(oidp, &cur, 0, req)); } #ifdef INVARIANTS static uma_slab_t uma_dbg_getslab(uma_zone_t zone, void *item) { uma_slab_t slab; uma_keg_t keg; uint8_t *mem; mem = (uint8_t *)((uintptr_t)item & (~UMA_SLAB_MASK)); if (zone->uz_flags & UMA_ZONE_VTOSLAB) { slab = vtoslab((vm_offset_t)mem); } else { /* * It is safe to return the slab here even though the * zone is unlocked because the item's allocation state * essentially holds a reference. */ if (zone->uz_lockptr == &zone->uz_lock) return (NULL); ZONE_LOCK(zone); keg = zone->uz_keg; if (keg->uk_flags & UMA_ZONE_HASH) slab = hash_sfind(&keg->uk_hash, mem); else slab = (uma_slab_t)(mem + keg->uk_pgoff); ZONE_UNLOCK(zone); } return (slab); } static bool uma_dbg_zskip(uma_zone_t zone, void *mem) { if (zone->uz_lockptr == &zone->uz_lock) return (true); return (uma_dbg_kskip(zone->uz_keg, mem)); } static bool uma_dbg_kskip(uma_keg_t keg, void *mem) { uintptr_t idx; if (dbg_divisor == 0) return (true); if (dbg_divisor == 1) return (false); idx = (uintptr_t)mem >> PAGE_SHIFT; if (keg->uk_ipers > 1) { idx *= keg->uk_ipers; idx += ((uintptr_t)mem & PAGE_MASK) / keg->uk_rsize; } if ((idx / dbg_divisor) * dbg_divisor != idx) { counter_u64_add(uma_skip_cnt, 1); return (true); } counter_u64_add(uma_dbg_cnt, 1); return (false); } /* * Set up the slab's freei data such that uma_dbg_free can function. * */ static void uma_dbg_alloc(uma_zone_t zone, uma_slab_t slab, void *item) { uma_keg_t keg; int freei; if (slab == NULL) { slab = uma_dbg_getslab(zone, item); if (slab == NULL) panic("uma: item %p did not belong to zone %s\n", item, zone->uz_name); } keg = slab->us_keg; freei = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize; if (BIT_ISSET(SLAB_SETSIZE, freei, &slab->us_debugfree)) panic("Duplicate alloc of %p from zone %p(%s) slab %p(%d)\n", item, zone, zone->uz_name, slab, freei); BIT_SET_ATOMIC(SLAB_SETSIZE, freei, &slab->us_debugfree); return; } /* * Verifies freed addresses. Checks for alignment, valid slab membership * and duplicate frees. * */ static void uma_dbg_free(uma_zone_t zone, uma_slab_t slab, void *item) { uma_keg_t keg; int freei; if (slab == NULL) { slab = uma_dbg_getslab(zone, item); if (slab == NULL) panic("uma: Freed item %p did not belong to zone %s\n", item, zone->uz_name); } keg = slab->us_keg; freei = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize; if (freei >= keg->uk_ipers) panic("Invalid free of %p from zone %p(%s) slab %p(%d)\n", item, zone, zone->uz_name, slab, freei); if (((freei * keg->uk_rsize) + slab->us_data) != item) panic("Unaligned free of %p from zone %p(%s) slab %p(%d)\n", item, zone, zone->uz_name, slab, freei); if (!BIT_ISSET(SLAB_SETSIZE, freei, &slab->us_debugfree)) panic("Duplicate free of %p from zone %p(%s) slab %p(%d)\n", item, zone, zone->uz_name, slab, freei); BIT_CLR_ATOMIC(SLAB_SETSIZE, freei, &slab->us_debugfree); } #endif /* INVARIANTS */ #ifdef DDB +static int64_t +get_uma_stats(uma_keg_t kz, uma_zone_t z, uint64_t *allocs, uint64_t *used, + uint64_t *sleeps, uint64_t *xdomain, long *cachefree) +{ + uint64_t frees; + int i; + + if (kz->uk_flags & UMA_ZFLAG_INTERNAL) { + *allocs = counter_u64_fetch(z->uz_allocs); + frees = counter_u64_fetch(z->uz_frees); + *sleeps = z->uz_sleeps; + *cachefree = 0; + *xdomain = 0; + } else + uma_zone_sumstat(z, cachefree, allocs, &frees, sleeps, + xdomain); + if (!((z->uz_flags & UMA_ZONE_SECONDARY) && + (LIST_FIRST(&kz->uk_zones) != z))) + *cachefree += kz->uk_free; + for (i = 0; i < vm_ndomains; i++) + *cachefree += z->uz_domain[i].uzd_nitems; + *used = *allocs - frees; + return (((int64_t)*used + *cachefree) * kz->uk_size); +} + DB_SHOW_COMMAND(uma, db_show_uma) { + const char *fmt_hdr, *fmt_entry; uma_keg_t kz; uma_zone_t z; - uint64_t allocs, frees, sleeps, xdomain; + uint64_t allocs, used, sleeps, xdomain; long cachefree; - int i; + /* variables for sorting */ + uma_keg_t cur_keg; + uma_zone_t cur_zone, last_zone; + int64_t cur_size, last_size, size; + int ties; - db_printf("%18s %8s %8s %8s %12s %8s %8s %8s\n", "Zone", "Size", "Used", - "Free", "Requests", "Sleeps", "Bucket", "XFree"); - LIST_FOREACH(kz, &uma_kegs, uk_link) { - LIST_FOREACH(z, &kz->uk_zones, uz_link) { - if (kz->uk_flags & UMA_ZFLAG_INTERNAL) { - allocs = counter_u64_fetch(z->uz_allocs); - frees = counter_u64_fetch(z->uz_frees); - sleeps = z->uz_sleeps; - cachefree = 0; - } else - uma_zone_sumstat(z, &cachefree, &allocs, - &frees, &sleeps, &xdomain); - if (!((z->uz_flags & UMA_ZONE_SECONDARY) && - (LIST_FIRST(&kz->uk_zones) != z))) - cachefree += kz->uk_free; - for (i = 0; i < vm_ndomains; i++) - cachefree += z->uz_domain[i].uzd_nitems; + /* /i option produces machine-parseable CSV output */ + if (modif[0] == 'i') { + fmt_hdr = "%s,%s,%s,%s,%s,%s,%s,%s,%s\n"; + fmt_entry = "\"%s\",%ju,%jd,%ld,%ju,%ju,%u,%jd,%ju\n"; + } else { + fmt_hdr = "%18s %6s %7s %7s %11s %7s %7s %10s %8s\n"; + fmt_entry = "%18s %6ju %7jd %7ld %11ju %7ju %7u %10jd %8ju\n"; + } - db_printf("%18s %8ju %8jd %8ld %12ju %8ju %8u %8ju\n", - z->uz_name, (uintmax_t)kz->uk_size, - (intmax_t)(allocs - frees), cachefree, - (uintmax_t)allocs, sleeps, z->uz_count, xdomain); - if (db_pager_quit) - return; + db_printf(fmt_hdr, "Zone", "Size", "Used", "Free", "Requests", + "Sleeps", "Bucket", "Total Mem", "XFree"); + + /* Sort the zones with largest size first. */ + last_zone = NULL; + last_size = INT64_MAX; + for (;;) { + cur_zone = NULL; + cur_size = -1; + ties = 0; + LIST_FOREACH(kz, &uma_kegs, uk_link) { + LIST_FOREACH(z, &kz->uk_zones, uz_link) { + /* + * In the case of size ties, print out zones + * in the order they are encountered. That is, + * when we encounter the most recently output + * zone, we have already printed all preceding + * ties, and we must print all following ties. + */ + if (z == last_zone) { + ties = 1; + continue; + } + size = get_uma_stats(kz, z, &allocs, &used, + &sleeps, &cachefree, &xdomain); + if (size > cur_size && size < last_size + ties) + { + cur_size = size; + cur_zone = z; + cur_keg = kz; + } + } } + if (cur_zone == NULL) + break; + + size = get_uma_stats(cur_keg, cur_zone, &allocs, &used, + &sleeps, &cachefree, &xdomain); + db_printf(fmt_entry, cur_zone->uz_name, + (uintmax_t)cur_keg->uk_size, (intmax_t)used, cachefree, + (uintmax_t)allocs, (uintmax_t)sleeps, + (unsigned)cur_zone->uz_count, (intmax_t)size, xdomain); + + if (db_pager_quit) + return; + last_zone = cur_zone; + last_size = cur_size; } } DB_SHOW_COMMAND(umacache, db_show_umacache) { uma_zone_t z; uint64_t allocs, frees; long cachefree; int i; db_printf("%18s %8s %8s %8s %12s %8s\n", "Zone", "Size", "Used", "Free", "Requests", "Bucket"); LIST_FOREACH(z, &uma_cachezones, uz_link) { uma_zone_sumstat(z, &cachefree, &allocs, &frees, NULL, NULL); for (i = 0; i < vm_ndomains; i++) cachefree += z->uz_domain[i].uzd_nitems; db_printf("%18s %8ju %8jd %8ld %12ju %8u\n", z->uz_name, (uintmax_t)z->uz_size, (intmax_t)(allocs - frees), cachefree, (uintmax_t)allocs, z->uz_count); if (db_pager_quit) return; } } #endif /* DDB */