Index: projects/release-arm-redux/share/man/man4/acpi.4 =================================================================== --- projects/release-arm-redux/share/man/man4/acpi.4 (revision 282691) +++ projects/release-arm-redux/share/man/man4/acpi.4 (revision 282692) @@ -1,630 +1,641 @@ .\" .\" Copyright (c) 2001 Michael Smith .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd June 23, 2014 +.Dd May 9, 2015 .Dt ACPI 4 .Os .Sh NAME .Nm acpi .Nd Advanced Configuration and Power Management support .Sh SYNOPSIS .Cd "device acpi" .Pp .Cd "options ACPI_DEBUG" .Cd "options DDB" .Sh DESCRIPTION The .Nm driver provides support for the Intel/Microsoft/Compaq/Toshiba ACPI standard. This support includes platform hardware discovery (superseding the PnP and PCI BIOS), as well as power management (superseding APM) and other features. ACPI core support is provided by the ACPI CA reference implementation from Intel. .Pp Note that the .Nm driver is automatically loaded by the .Xr loader 8 , and should only be compiled into the kernel on platforms where ACPI is mandatory. .Sh SYSCTL VARIABLES The .Nm driver is intended to provide power management without user intervention. If the default settings are not optimal, the following sysctls can be used to modify or monitor .Nm behavior. Note that some variables will be available only if the given hardware supports them (such as .Va hw.acpi.acline ) . .Bl -tag -width indent .It Va debug.acpi.enable_debug_objects Enable dumping Debug objects without .Cd "options ACPI_DEBUG" . Default is 0, ignore Debug objects. -.It Va hw.acpi.acline -AC line state (1 means online, 0 means on battery power). -.It Va hw.acpi.cpu.cx_usage +.It Va dev.cpu.N.cx_usage Debugging information listing the percent of total usage for each sleep state. The values are reset when -.Va hw.acpi.cpu.cx_lowest +.Va dev.cpu.N.cx_lowest is modified. -.It Va hw.acpi.cpu.cx_lowest +.It Va dev.cpu.N.cx_lowest Lowest Cx state to use for idling the CPU. A scheduling algorithm will select states between .Li C1 and this setting as system load dictates. To enable ACPI CPU idling control, .Va machdep.idle should be set to .Li acpi if it is listed in .Va machdep.idle_available . .It Va hw.acpi.cpu.cx_supported List of supported CPU idle states and their transition latency in microseconds. Each state has a type (e.g., .Li C2 ) . .Li C1 is equivalent to the ia32 .Li HLT instruction, .Li C2 provides a deeper sleep with the same semantics, and .Li C3 provides the deepest sleep but additionally requires bus mastering to be disabled. States greater than .Li C3 provide even more power savings with the same semantics as the .Li C3 state. Deeper sleeps provide more power savings but increased transition latency when an interrupt occurs. +.It Va dev.cpu.N.cx_method +List of supported CPU idle states and their transition methods, as +directed by the firmware. +.It Va hw.acpi.acline +AC line state (1 means online, 0 means on battery power). .It Va hw.acpi.disable_on_reboot Disable ACPI during the reboot process. Most systems reboot fine with ACPI still enabled, but some require exiting to legacy mode first. Default is 0, leave ACPI enabled. .It Va hw.acpi.handle_reboot Use the ACPI Reset Register capability to reboot the system. Some newer systems require use of this register, while some only work with legacy rebooting support. .It Va hw.acpi.lid_switch_state Suspend state .Pq Li S1 Ns \[en] Ns Li S5 to enter when the lid switch (i.e., a notebook screen) is closed. Default is .Dq Li NONE (do nothing). .It Va hw.acpi.power_button_state Suspend state .Pq Li S1 Ns \[en] Ns Li S5 to enter when the power button is pressed. Default is .Li S5 (power-off nicely). .It Va hw.acpi.reset_video Reset the video adapter from real mode during the resume path. Some systems need this help, others have display problems if it is enabled. Default is 0 (disabled). .It Va hw.acpi.s4bios Indicate whether the system supports .Li S4BIOS . This means that the BIOS can handle all the functions of suspending the system to disk. Otherwise, the OS is responsible for suspending to disk .Pq Li S4OS . Most current systems do not support .Li S4BIOS . .It Va hw.acpi.sleep_button_state Suspend state .Pq Li S1 Ns \[en] Ns Li S5 to enter when the sleep button is pressed. This is usually a special function button on the keyboard. Default is .Li S3 (suspend-to-RAM). .It Va hw.acpi.sleep_delay Wait this number of seconds between preparing the system to suspend and actually entering the suspend state. Default is 1 second. .It Va hw.acpi.supported_sleep_state Suspend states .Pq Li S1 Ns \[en] Ns Li S5 supported by the BIOS. .Bl -tag -width indent .It Li S1 Quick suspend to RAM. The CPU enters a lower power state, but most peripherals are left running. .It Li S2 Lower power state than .Li S1 , but with the same basic characteristics. Not supported by many systems. .It Li S3 Suspend to RAM. Most devices are powered off, and the system stops running except for memory refresh. .It Li S4 Suspend to disk. All devices are powered off, and the system stops running. When resuming, the system starts as if from a cold power on. Not yet supported by .Fx unless .Li S4BIOS is available. .It Li S5 System shuts down cleanly and powers off. .El .It Va hw.acpi.verbose Enable verbose printing from the various ACPI subsystems. .El .Sh LOADER TUNABLES Tunables can be set at the .Xr loader 8 prompt before booting the kernel or stored in .Pa /boot/loader.conf . Many of these tunables also have a matching .Xr sysctl 8 entry for access after boot. .Bl -tag -width indent .It Va acpi_dsdt_load Enables loading of a custom ACPI DSDT. .It Va acpi_dsdt_name Name of the DSDT table to load, if loading is enabled. .It Va debug.acpi.cpu_unordered Do not use the MADT to match ACPI Processor objects to CPUs. This is needed on a few systems with a buggy BIOS that does not use consistent processor IDs. Default is 0 (disabled). .It Va debug.acpi.disabled Selectively disables portions of ACPI for debugging purposes. .It Va debug.acpi.interpreter_slack Enable less strict ACPI implementations. Default is 1, ignore common BIOS mistakes. .It Va debug.acpi.max_threads Specify the number of task threads that are started on boot. Limiting this to 1 may help work around various BIOSes that cannot handle parallel requests. The default value is 3. .It Va debug.acpi.quirks Override any automatic quirks completely. .It Va debug.acpi.resume_beep Beep the PC speaker on resume. This can help diagnose suspend/resume problems. Default is 0 (disabled). .It Va hint.acpi.0.disabled Set this to 1 to disable all of ACPI. If ACPI has been disabled on your system due to a blacklist entry for your BIOS, you can set this to 0 to re-enable ACPI for testing. .It Va hw.acpi.ec.poll_timeout Delay in milliseconds to wait for the EC to respond. Try increasing this number if you get the error .Qq Li AE_NO_HARDWARE_RESPONSE . .It Va hw.acpi.host_mem_start Override the assumed memory starting address for PCI host bridges. .It Va hw.acpi.install_interface , hw.acpi.remove_interface Install or remove OS interface(s) to control return value of .Ql _OSI query method. When an OS interface is specified in .Va hw.acpi.install_interface , .Li _OSI query for the interface returns it is .Em supported . Conversely, when an OS interface is specified in .Va hw.acpi.remove_interface , .Li _OSI query returns it is .Em not supported . Multiple interfaces can be specified in a comma-separated list and any leading white spaces will be ignored. For example, .Qq Li FreeBSD, Linux is a valid list of two interfaces .Qq Li FreeBSD and .Qq Li Linux . .It Va hw.acpi.reset_video Enables calling the VESA reset BIOS vector on the resume path. This can fix some graphics cards that have problems such as LCD white-out after resume. Default is 0 (disabled). .It Va hw.acpi.serialize_methods Allow override of whether methods execute in parallel or not. Enable this for serial behavior, which fixes .Qq Li AE_ALREADY_EXISTS errors for AML that really cannot handle parallel method execution. It is off by default since this breaks recursive methods and some IBMs use such code. .It Va hw.acpi.verbose Turn on verbose debugging information about what ACPI is doing. .It Va hw.pci.link.%s.%d.irq Override the interrupt to use for this link and index. This capability should be used carefully, and only if a device is not working with .Nm enabled. .Qq %s is the name of the link (e.g., LNKA). .Qq %d is the resource index when the link supports multiple IRQs. Most PCI links only have one IRQ resource, so the below form should be used. .It Va hw.pci.link.%s.irq Override the interrupt to use. This capability should be used carefully, and only if a device is not working with .Nm enabled. .Qq %s is the name of the link (e.g., LNKA). .El .Sh DISABLING ACPI Since ACPI support on different platforms varies greatly, there are many debugging and tuning options available. .Pp For machines known not to work with .Nm enabled, there is a BIOS blacklist. Currently, the blacklist only controls whether .Nm should be disabled or not. In the future, it will have more granularity to control features (the infrastructure for that is already there). .Pp To enable .Nm (for debugging purposes, etc.) on machines that are on the blacklist, set the kernel environment variable .Va hint.acpi.0.disabled to 0. Before trying this, consider updating your BIOS to a more recent version that may be compatible with ACPI. .Pp To disable the .Nm driver completely, set the kernel environment variable .Va hint.acpi.0.disabled to 1. .Pp Some i386 machines totally fail to operate with some or all of ACPI disabled. Other i386 machines fail with ACPI enabled. Disabling all or part of ACPI on non-i386 platforms (i.e., platforms where ACPI support is mandatory) may result in a non-functional system. .Pp The .Nm driver comprises a set of drivers, which may be selectively disabled in case of problems. To disable a sub-driver, list it in the kernel environment variable .Va debug.acpi.disabled . Multiple entries can be listed, separated by a space. .Pp ACPI sub-devices and features that can be disabled: .Bl -tag -width ".Li sysresource" .It Li all Disable all ACPI features and devices. .It Li acad .Pq Vt device Supports AC adapter. .It Li bus .Pq Vt feature Probes and attaches subdevices. Disabling will avoid scanning the ACPI namespace entirely. .It Li children .Pq Vt feature Attaches standard ACPI sub-drivers and devices enumerated in the ACPI namespace. Disabling this has a similar effect to disabling .Dq Li bus , except that the ACPI namespace will still be scanned. .It Li button .Pq Vt device Supports ACPI button devices (typically power and sleep buttons). .It Li cmbat .Pq Vt device Control-method batteries device. .It Li cpu .Pq Vt device Supports CPU power-saving and speed-setting functions. .It Li ec .Pq Vt device Supports the ACPI Embedded Controller interface, used to communicate with embedded platform controllers. .It Li isa .Pq Vt device Supports an ISA bus bridge defined in the ACPI namespace, typically as a child of a PCI bus. .It Li lid .Pq Vt device Supports an ACPI laptop lid switch, which typically puts a system to sleep. +.It Li mwait +.Pq Vt feature +Do not ask firmware for available x86-vendor specific methods to enter +.Li Cx +sleep states. +Only query and use the generic I/O-based entrance method. +The knob is provided to work around inconsistencies in the tables +filled by firmware. .It Li quirks .Pq Vt feature Do not honor quirks. Quirks automatically disable ACPI functionality based on the XSDT table's OEM vendor name and revision date. .It Li pci .Pq Vt device Supports Host to PCI bridges. .It Li pci_link .Pq Vt feature Performs PCI interrupt routing. .It Li sysresource .Pq Vt device Pseudo-devices containing resources which ACPI claims. .It Li thermal .Pq Vt device Supports system cooling and heat management. .It Li timer .Pq Vt device Implements a timecounter using the ACPI fixed-frequency timer. .It Li video .Pq Vt device Supports .Xr acpi_video 4 which may conflict with .Xr agp 4 device. .El .Pp It is also possible to avoid portions of the ACPI namespace which may be causing problems, by listing the full path of the root of the region to be avoided in the kernel environment variable .Va debug.acpi.avoid . The object and all of its children will be ignored during the bus/children scan of the namespace. The ACPI CA code will still know about the avoided region. .Sh DEBUGGING OUTPUT To enable debugging output, .Nm must be compiled with .Cd "options ACPI_DEBUG" . Debugging output is separated between layers and levels, where a layer is a component of the ACPI subsystem, and a level is a particular kind of debugging output. .Pp Both layers and levels are specified as a whitespace-separated list of tokens, with layers listed in .Va debug.acpi.layer and levels in .Va debug.acpi.level . .Pp The first set of layers is for ACPI-CA components, and the second is for .Fx drivers. The ACPI-CA layer descriptions include the prefix for the files they refer to. The supported layers are: .Pp .Bl -tag -compact -width ".Li ACPI_CA_DISASSEMBLER" .It Li ACPI_UTILITIES Utility ("ut") functions .It Li ACPI_HARDWARE Hardware access ("hw") .It Li ACPI_EVENTS Event and GPE ("ev") .It Li ACPI_TABLES Table access ("tb") .It Li ACPI_NAMESPACE Namespace evaluation ("ns") .It Li ACPI_PARSER AML parser ("ps") .It Li ACPI_DISPATCHER Internal representation of interpreter state ("ds") .It Li ACPI_EXECUTER Execute AML methods ("ex") .It Li ACPI_RESOURCES Resource parsing ("rs") .It Li ACPI_CA_DEBUGGER Debugger implementation ("db", "dm") .It Li ACPI_OS_SERVICES Usermode support routines ("os") .It Li ACPI_CA_DISASSEMBLER Disassembler implementation (unused) .It Li ACPI_ALL_COMPONENTS All the above ACPI-CA components .It Li ACPI_AC_ADAPTER AC adapter driver .It Li ACPI_BATTERY Control-method battery driver .It Li ACPI_BUS ACPI, ISA, and PCI bus drivers .It Li ACPI_BUTTON Power and sleep button driver .It Li ACPI_EC Embedded controller driver .It Li ACPI_FAN Fan driver .It Li ACPI_OEM Platform-specific driver for hotkeys, LED, etc. .It Li ACPI_POWER Power resource driver .It Li ACPI_PROCESSOR CPU driver .It Li ACPI_THERMAL Thermal zone driver .It Li ACPI_TIMER Timer driver .It Li ACPI_ALL_DRIVERS All the above .Fx ACPI drivers .El .Pp The supported levels are: .Pp .Bl -tag -compact -width ".Li ACPI_LV_AML_DISASSEMBLE" .It Li ACPI_LV_INIT Initialization progress .It Li ACPI_LV_DEBUG_OBJECT Stores to objects .It Li ACPI_LV_INFO General information and progress .It Li ACPI_LV_REPAIR Repair a common problem with predefined methods .It Li ACPI_LV_ALL_EXCEPTIONS All the previous levels .It Li ACPI_LV_PARSE .It Li ACPI_LV_DISPATCH .It Li ACPI_LV_EXEC .It Li ACPI_LV_NAMES .It Li ACPI_LV_OPREGION .It Li ACPI_LV_BFIELD .It Li ACPI_LV_TABLES .It Li ACPI_LV_VALUES .It Li ACPI_LV_OBJECTS .It Li ACPI_LV_RESOURCES .It Li ACPI_LV_USER_REQUESTS .It Li ACPI_LV_PACKAGE .It Li ACPI_LV_VERBOSITY1 All the previous levels .It Li ACPI_LV_ALLOCATIONS .It Li ACPI_LV_FUNCTIONS .It Li ACPI_LV_OPTIMIZATIONS .It Li ACPI_LV_VERBOSITY2 All the previous levels .It Li ACPI_LV_ALL Synonym for .Qq Li ACPI_LV_VERBOSITY2 .It Li ACPI_LV_MUTEX .It Li ACPI_LV_THREADS .It Li ACPI_LV_IO .It Li ACPI_LV_INTERRUPTS .It Li ACPI_LV_VERBOSITY3 All the previous levels .It Li ACPI_LV_AML_DISASSEMBLE .It Li ACPI_LV_VERBOSE_INFO .It Li ACPI_LV_FULL_TABLES .It Li ACPI_LV_EVENTS .It Li ACPI_LV_VERBOSE All levels after .Qq Li ACPI_LV_VERBOSITY3 .It Li ACPI_LV_INIT_NAMES .It Li ACPI_LV_LOAD .El .Pp Selection of the appropriate layer and level values is important to avoid massive amounts of debugging output. For example, the following configuration is a good way to gather initial information. It enables debug output for both ACPI-CA and the .Nm driver, printing basic information about errors, warnings, and progress. .Bd -literal -offset indent debug.acpi.layer="ACPI_ALL_COMPONENTS ACPI_ALL_DRIVERS" debug.acpi.level="ACPI_LV_ALL_EXCEPTIONS" .Ed .Pp Debugging output by the ACPI CA subsystem is prefixed with the module name in lowercase, followed by a source line number. Output from the .Fx Ns -local code follows the same format, but the module name is uppercased. .Sh OVERRIDING YOUR BIOS BYTECODE ACPI interprets bytecode named AML (ACPI Machine Language) provided by the BIOS vendor as a memory image at boot time. Sometimes, the AML code contains a bug that does not appear when parsed by the Microsoft implementation. .Fx provides a way to override it with your own AML code to work around or debug such problems. Note that all AML in your DSDT and any SSDT tables is overridden. .Pp In order to load your AML code, you must edit .Pa /boot/loader.conf and include the following lines. .Bd -literal -offset indent acpi_dsdt_load="YES" acpi_dsdt_name="/boot/acpi_dsdt.aml" # You may change this name. .Ed .Pp In order to prepare your AML code, you will need the .Xr acpidump 8 and .Xr iasl 8 utilities and some ACPI knowledge. .Sh COMPATIBILITY ACPI is only found and supported on i386/ia32 and amd64. .Sh SEE ALSO .Xr kenv 1 , .Xr acpi_thermal 4 , .Xr device.hints 5 , .Xr loader.conf 5 , .Xr acpiconf 8 , .Xr acpidump 8 , .Xr config 8 , .Xr iasl 8 .Rs .%A "Compaq Computer Corporation" .%A "Intel Corporation" .%A "Microsoft Corporation" .%A "Phoenix Technologies Ltd." .%A "Toshiba Corporation" .%D August 25, 2003 .%T "Advanced Configuration and Power Interface Specification" .%U http://acpi.info/spec.htm .Re .Sh AUTHORS .An -nosplit The ACPI CA subsystem is developed and maintained by Intel Architecture Labs. .Pp The following people made notable contributions to the ACPI subsystem in .Fx : .An Michael Smith , .An Takanori Watanabe Aq Mt takawata@jp.FreeBSD.org , .An Mitsuru IWASAKI Aq Mt iwasaki@jp.FreeBSD.org , .An Munehiro Matsuda , .An Nate Lawson , the ACPI-jp mailing list at .Aq Mt acpi-jp@jp.FreeBSD.org , and many other contributors. .Pp This manual page was written by .An Michael Smith Aq Mt msmith@FreeBSD.org . .Sh BUGS Many BIOS versions have serious bugs that may cause system instability, break suspend/resume, or prevent devices from operating properly due to IRQ routing problems. Upgrade your BIOS to the latest version available from the vendor before deciding it is a problem with .Nm . Index: projects/release-arm-redux/share/man/man4 =================================================================== --- projects/release-arm-redux/share/man/man4 (revision 282691) +++ projects/release-arm-redux/share/man/man4 (revision 282692) Property changes on: projects/release-arm-redux/share/man/man4 ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/share/man/man4:r282673-282691 Index: projects/release-arm-redux/share =================================================================== --- projects/release-arm-redux/share (revision 282691) +++ projects/release-arm-redux/share (revision 282692) Property changes on: projects/release-arm-redux/share ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/share:r282673-282691 Index: projects/release-arm-redux/sys/amd64/acpica/acpi_machdep.c =================================================================== --- projects/release-arm-redux/sys/amd64/acpica/acpi_machdep.c (revision 282691) +++ projects/release-arm-redux/sys/amd64/acpica/acpi_machdep.c (revision 282692) @@ -1,384 +1,377 @@ /*- * Copyright (c) 2001 Mitsuru IWASAKI * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include int acpi_resume_beep; SYSCTL_INT(_debug_acpi, OID_AUTO, resume_beep, CTLFLAG_RWTUN, &acpi_resume_beep, 0, "Beep the PC speaker when resuming"); int acpi_reset_video; TUNABLE_INT("hw.acpi.reset_video", &acpi_reset_video); static int intr_model = ACPI_INTR_PIC; int acpi_machdep_init(device_t dev) { struct acpi_softc *sc; sc = device_get_softc(dev); acpi_apm_init(sc); if (intr_model != ACPI_INTR_PIC) acpi_SetIntrModel(intr_model); SYSCTL_ADD_INT(&sc->acpi_sysctl_ctx, SYSCTL_CHILDREN(sc->acpi_sysctl_tree), OID_AUTO, "reset_video", CTLFLAG_RW, &acpi_reset_video, 0, "Call the VESA reset BIOS vector on the resume path"); return (0); } void acpi_SetDefaultIntrModel(int model) { intr_model = model; } int acpi_machdep_quirks(int *quirks) { return (0); } -void -acpi_cpu_c1() -{ - - __asm __volatile("sti; hlt"); -} - /* * Support for mapping ACPI tables during early boot. Currently this * uses the crashdump map to map each table. However, the crashdump * map is created in pmap_bootstrap() right after the direct map, so * we should be able to just use pmap_mapbios() here instead. * * This makes the following assumptions about how we use this KVA: * pages 0 and 1 are used to map in the header of each table found via * the RSDT or XSDT and pages 2 to n are used to map in the RSDT or * XSDT. This has to use 2 pages for the table headers in case a * header spans a page boundary. * * XXX: We don't ensure the table fits in the available address space * in the crashdump map. */ /* * Map some memory using the crashdump map. 'offset' is an offset in * pages into the crashdump map to use for the start of the mapping. */ static void * table_map(vm_paddr_t pa, int offset, vm_offset_t length) { vm_offset_t va, off; void *data; off = pa & PAGE_MASK; length = round_page(length + off); pa = pa & PG_FRAME; va = (vm_offset_t)pmap_kenter_temporary(pa, offset) + (offset * PAGE_SIZE); data = (void *)(va + off); length -= PAGE_SIZE; while (length > 0) { va += PAGE_SIZE; pa += PAGE_SIZE; length -= PAGE_SIZE; pmap_kenter(va, pa); invlpg(va); } return (data); } /* Unmap memory previously mapped with table_map(). */ static void table_unmap(void *data, vm_offset_t length) { vm_offset_t va, off; va = (vm_offset_t)data; off = va & PAGE_MASK; length = round_page(length + off); va &= ~PAGE_MASK; while (length > 0) { pmap_kremove(va); invlpg(va); va += PAGE_SIZE; length -= PAGE_SIZE; } } /* * Map a table at a given offset into the crashdump map. It first * maps the header to determine the table length and then maps the * entire table. */ static void * map_table(vm_paddr_t pa, int offset, const char *sig) { ACPI_TABLE_HEADER *header; vm_offset_t length; void *table; header = table_map(pa, offset, sizeof(ACPI_TABLE_HEADER)); if (strncmp(header->Signature, sig, ACPI_NAME_SIZE) != 0) { table_unmap(header, sizeof(ACPI_TABLE_HEADER)); return (NULL); } length = header->Length; table_unmap(header, sizeof(ACPI_TABLE_HEADER)); table = table_map(pa, offset, length); if (ACPI_FAILURE(AcpiTbChecksum(table, length))) { if (bootverbose) printf("ACPI: Failed checksum for table %s\n", sig); #if (ACPI_CHECKSUM_ABORT) table_unmap(table, length); return (NULL); #endif } return (table); } /* * See if a given ACPI table is the requested table. Returns the * length of the able if it matches or zero on failure. */ static int probe_table(vm_paddr_t address, const char *sig) { ACPI_TABLE_HEADER *table; table = table_map(address, 0, sizeof(ACPI_TABLE_HEADER)); if (table == NULL) { if (bootverbose) printf("ACPI: Failed to map table at 0x%jx\n", (uintmax_t)address); return (0); } if (bootverbose) printf("Table '%.4s' at 0x%jx\n", table->Signature, (uintmax_t)address); if (strncmp(table->Signature, sig, ACPI_NAME_SIZE) != 0) { table_unmap(table, sizeof(ACPI_TABLE_HEADER)); return (0); } table_unmap(table, sizeof(ACPI_TABLE_HEADER)); return (1); } /* * Try to map a table at a given physical address previously returned * by acpi_find_table(). */ void * acpi_map_table(vm_paddr_t pa, const char *sig) { return (map_table(pa, 0, sig)); } /* Unmap a table previously mapped via acpi_map_table(). */ void acpi_unmap_table(void *table) { ACPI_TABLE_HEADER *header; header = (ACPI_TABLE_HEADER *)table; table_unmap(table, header->Length); } /* * Return the physical address of the requested table or zero if one * is not found. */ vm_paddr_t acpi_find_table(const char *sig) { ACPI_PHYSICAL_ADDRESS rsdp_ptr; ACPI_TABLE_RSDP *rsdp; ACPI_TABLE_RSDT *rsdt; ACPI_TABLE_XSDT *xsdt; ACPI_TABLE_HEADER *table; vm_paddr_t addr; int i, count; if (resource_disabled("acpi", 0)) return (0); /* * Map in the RSDP. Since ACPI uses AcpiOsMapMemory() which in turn * calls pmap_mapbios() to find the RSDP, we assume that we can use * pmap_mapbios() to map the RSDP. */ if ((rsdp_ptr = AcpiOsGetRootPointer()) == 0) return (0); rsdp = pmap_mapbios(rsdp_ptr, sizeof(ACPI_TABLE_RSDP)); if (rsdp == NULL) { if (bootverbose) printf("ACPI: Failed to map RSDP\n"); return (0); } /* * For ACPI >= 2.0, use the XSDT if it is available. * Otherwise, use the RSDT. We map the XSDT or RSDT at page 2 * in the crashdump area. Pages 0 and 1 are used to map in the * headers of candidate ACPI tables. */ addr = 0; if (rsdp->Revision >= 2 && rsdp->XsdtPhysicalAddress != 0) { /* * AcpiOsGetRootPointer only verifies the checksum for * the version 1.0 portion of the RSDP. Version 2.0 has * an additional checksum that we verify first. */ if (AcpiTbChecksum((UINT8 *)rsdp, ACPI_RSDP_XCHECKSUM_LENGTH)) { if (bootverbose) printf("ACPI: RSDP failed extended checksum\n"); return (0); } xsdt = map_table(rsdp->XsdtPhysicalAddress, 2, ACPI_SIG_XSDT); if (xsdt == NULL) { if (bootverbose) printf("ACPI: Failed to map XSDT\n"); return (0); } count = (xsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) / sizeof(UINT64); for (i = 0; i < count; i++) if (probe_table(xsdt->TableOffsetEntry[i], sig)) { addr = xsdt->TableOffsetEntry[i]; break; } acpi_unmap_table(xsdt); } else { rsdt = map_table(rsdp->RsdtPhysicalAddress, 2, ACPI_SIG_RSDT); if (rsdt == NULL) { if (bootverbose) printf("ACPI: Failed to map RSDT\n"); return (0); } count = (rsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) / sizeof(UINT32); for (i = 0; i < count; i++) if (probe_table(rsdt->TableOffsetEntry[i], sig)) { addr = rsdt->TableOffsetEntry[i]; break; } acpi_unmap_table(rsdt); } pmap_unmapbios((vm_offset_t)rsdp, sizeof(ACPI_TABLE_RSDP)); if (addr == 0) { if (bootverbose) printf("ACPI: No %s table found\n", sig); return (0); } if (bootverbose) printf("%s: Found table at 0x%jx\n", sig, (uintmax_t)addr); /* * Verify that we can map the full table and that its checksum is * correct, etc. */ table = map_table(addr, 0, sig); if (table == NULL) return (0); acpi_unmap_table(table); return (addr); } /* * ACPI nexus(4) driver. */ static int nexus_acpi_probe(device_t dev) { int error; error = acpi_identify(); if (error) return (error); return (BUS_PROBE_DEFAULT); } static int nexus_acpi_attach(device_t dev) { device_t acpi_dev; int error; nexus_init_resources(); bus_generic_probe(dev); acpi_dev = BUS_ADD_CHILD(dev, 10, "acpi", 0); if (acpi_dev == NULL) panic("failed to add acpi0 device"); error = bus_generic_attach(dev); if (error == 0) acpi_install_wakeup_handler(device_get_softc(acpi_dev)); return (error); } static device_method_t nexus_acpi_methods[] = { /* Device interface */ DEVMETHOD(device_probe, nexus_acpi_probe), DEVMETHOD(device_attach, nexus_acpi_attach), { 0, 0 } }; DEFINE_CLASS_1(nexus, nexus_acpi_driver, nexus_acpi_methods, 1, nexus_driver); static devclass_t nexus_devclass; DRIVER_MODULE(nexus_acpi, root, nexus_acpi_driver, nexus_devclass, 0, 0); Index: projects/release-arm-redux/sys/amd64/amd64/apic_vector.S =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/apic_vector.S (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/apic_vector.S (revision 282692) @@ -1,351 +1,347 @@ /*- * Copyright (c) 1989, 1990 William F. Jolitz. * Copyright (c) 1990 The Regents of the University of California. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: vector.s, 386BSD 0.1 unknown origin * $FreeBSD$ */ /* * Interrupt entry points for external interrupts triggered by I/O APICs * as well as IPI handlers. */ #include "opt_smp.h" #include #include #include #include "assym.s" #ifdef SMP #define LK lock ; #else #define LK #endif .text SUPERALIGN_TEXT /* End Of Interrupt to APIC */ as_lapic_eoi: cmpl $0,x2apic_mode jne 1f movq lapic_map,%rax movl $0,LA_EOI(%rax) ret 1: movl $MSR_APIC_EOI,%ecx xorl %eax,%eax xorl %edx,%edx wrmsr ret /* * I/O Interrupt Entry Point. Rather than having one entry point for * each interrupt source, we use one entry point for each 32-bit word * in the ISR. The handler determines the highest bit set in the ISR, * translates that into a vector, and passes the vector to the * lapic_handle_intr() function. */ #define ISR_VEC(index, vec_name) \ .text ; \ SUPERALIGN_TEXT ; \ IDTVEC(vec_name) ; \ PUSH_FRAME ; \ FAKE_MCOUNT(TF_RIP(%rsp)) ; \ cmpl $0,x2apic_mode ; \ je 1f ; \ movl $(MSR_APIC_ISR0 + index),%ecx ; \ rdmsr ; \ jmp 2f ; \ 1: ; \ movq lapic_map, %rdx ; /* pointer to local APIC */ \ movl LA_ISR + 16 * (index)(%rdx), %eax ; /* load ISR */ \ 2: ; \ bsrl %eax, %eax ; /* index of highest set bit in ISR */ \ jz 3f ; \ addl $(32 * index),%eax ; \ movq %rsp, %rsi ; \ movl %eax, %edi ; /* pass the IRQ */ \ call lapic_handle_intr ; \ 3: ; \ MEXITCOUNT ; \ jmp doreti /* * Handle "spurious INTerrupts". * Notes: * This is different than the "spurious INTerrupt" generated by an * 8259 PIC for missing INTs. See the APIC documentation for details. * This routine should NOT do an 'EOI' cycle. */ .text SUPERALIGN_TEXT IDTVEC(spuriousint) /* No EOI cycle used here */ jmp doreti_iret ISR_VEC(1, apic_isr1) ISR_VEC(2, apic_isr2) ISR_VEC(3, apic_isr3) ISR_VEC(4, apic_isr4) ISR_VEC(5, apic_isr5) ISR_VEC(6, apic_isr6) ISR_VEC(7, apic_isr7) /* * Local APIC periodic timer handler. */ .text SUPERALIGN_TEXT IDTVEC(timerint) PUSH_FRAME FAKE_MCOUNT(TF_RIP(%rsp)) movq %rsp, %rdi call lapic_handle_timer MEXITCOUNT jmp doreti /* * Local APIC CMCI handler. */ .text SUPERALIGN_TEXT IDTVEC(cmcint) PUSH_FRAME FAKE_MCOUNT(TF_RIP(%rsp)) call lapic_handle_cmc MEXITCOUNT jmp doreti /* * Local APIC error interrupt handler. */ .text SUPERALIGN_TEXT IDTVEC(errorint) PUSH_FRAME FAKE_MCOUNT(TF_RIP(%rsp)) call lapic_handle_error MEXITCOUNT jmp doreti #ifdef XENHVM /* * Xen event channel upcall interrupt handler. * Only used when the hypervisor supports direct vector callbacks. */ .text SUPERALIGN_TEXT IDTVEC(xen_intr_upcall) PUSH_FRAME FAKE_MCOUNT(TF_RIP(%rsp)) movq %rsp, %rdi call xen_intr_handle_upcall MEXITCOUNT jmp doreti #endif #ifdef HYPERV /* * This is the Hyper-V vmbus channel direct callback interrupt. * Only used when it is running on Hyper-V. */ .text SUPERALIGN_TEXT IDTVEC(hv_vmbus_callback) PUSH_FRAME FAKE_MCOUNT(TF_RIP(%rsp)) movq %rsp, %rdi call hv_vector_handler MEXITCOUNT jmp doreti #endif #ifdef SMP /* * Global address space TLB shootdown. */ .text -#define NAKE_INTR_CS 24 - SUPERALIGN_TEXT invltlb_ret: call as_lapic_eoi POP_FRAME jmp doreti_iret SUPERALIGN_TEXT +IDTVEC(invltlb) + PUSH_FRAME + + call invltlb_handler + jmp invltlb_ret + IDTVEC(invltlb_pcid) PUSH_FRAME call invltlb_pcid_handler jmp invltlb_ret - - SUPERALIGN_TEXT -IDTVEC(invltlb) +IDTVEC(invltlb_invpcid) PUSH_FRAME - call invltlb_handler + call invltlb_invpcid_handler jmp invltlb_ret /* * Single page TLB shootdown */ .text - SUPERALIGN_TEXT -IDTVEC(invlpg_pcid) - PUSH_FRAME - - call invlpg_pcid_handler - jmp invltlb_ret SUPERALIGN_TEXT IDTVEC(invlpg) PUSH_FRAME call invlpg_handler jmp invltlb_ret /* * Page range TLB shootdown. */ .text SUPERALIGN_TEXT IDTVEC(invlrng) PUSH_FRAME call invlrng_handler jmp invltlb_ret /* * Invalidate cache. */ .text SUPERALIGN_TEXT IDTVEC(invlcache) PUSH_FRAME call invlcache_handler jmp invltlb_ret /* * Handler for IPIs sent via the per-cpu IPI bitmap. */ .text SUPERALIGN_TEXT IDTVEC(ipi_intr_bitmap_handler) PUSH_FRAME call as_lapic_eoi FAKE_MCOUNT(TF_RIP(%rsp)) call ipi_bitmap_handler MEXITCOUNT jmp doreti /* * Executed by a CPU when it receives an IPI_STOP from another CPU. */ .text SUPERALIGN_TEXT IDTVEC(cpustop) PUSH_FRAME call as_lapic_eoi call cpustop_handler jmp doreti /* * Executed by a CPU when it receives an IPI_SUSPEND from another CPU. */ .text SUPERALIGN_TEXT IDTVEC(cpususpend) PUSH_FRAME call cpususpend_handler call as_lapic_eoi jmp doreti /* * Executed by a CPU when it receives a RENDEZVOUS IPI from another CPU. * * - Calls the generic rendezvous action function. */ .text SUPERALIGN_TEXT IDTVEC(rendezvous) PUSH_FRAME #ifdef COUNT_IPIS movl PCPU(CPUID), %eax movq ipi_rendezvous_counts(,%rax,8), %rax incq (%rax) #endif call smp_rendezvous_action call as_lapic_eoi jmp doreti /* * IPI handler whose purpose is to interrupt the CPU with minimum overhead. * This is used by bhyve to force a host cpu executing in guest context to * trap into the hypervisor. * * This handler is different from other IPI handlers in the following aspects: * * 1. It doesn't push a trapframe on the stack. * * This implies that a DDB backtrace involving 'justreturn' will skip the * function that was interrupted by this handler. * * 2. It doesn't 'swapgs' when userspace is interrupted. * * The 'justreturn' handler does not access any pcpu data so it is not an * issue. Moreover the 'justreturn' handler can only be interrupted by an NMI * whose handler already doesn't trust GS.base when kernel code is interrupted. */ .text SUPERALIGN_TEXT IDTVEC(justreturn) pushq %rax pushq %rcx pushq %rdx call as_lapic_eoi popq %rdx popq %rcx popq %rax jmp doreti_iret #endif /* SMP */ Index: projects/release-arm-redux/sys/amd64/amd64/cpu_switch.S =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/cpu_switch.S (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/cpu_switch.S (revision 282692) @@ -1,521 +1,476 @@ /*- * Copyright (c) 2003 Peter Wemm. * Copyright (c) 1990 The Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * William Jolitz. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include #include #include "assym.s" #include "opt_sched.h" /*****************************************************************************/ /* Scheduling */ /*****************************************************************************/ .text #ifdef SMP #define LK lock ; #else #define LK #endif #if defined(SCHED_ULE) && defined(SMP) #define SETLK xchgq #else #define SETLK movq #endif /* * cpu_throw() * * This is the second half of cpu_switch(). It is used when the current * thread is either a dummy or slated to die, and we no longer care * about its state. This is only a slight optimization and is probably * not worth it anymore. Note that we need to clear the pm_active bits so * we do need the old proc if it still exists. * %rdi = oldtd * %rsi = newtd */ ENTRY(cpu_throw) - movl PCPU(CPUID),%eax - testq %rdi,%rdi - jz 1f - /* release bit from old pm_active */ - movq PCPU(CURPMAP),%rdx - LK btrl %eax,PM_ACTIVE(%rdx) /* clear old */ -1: - movq TD_PCB(%rsi),%r8 /* newtd->td_pcb */ - movq PCB_CR3(%r8),%rcx /* new address space */ - jmp swact + movq %rsi,%r12 + movq %rsi,%rdi + call pmap_activate_sw + jmp sw1 END(cpu_throw) /* * cpu_switch(old, new, mtx) * * Save the current thread state, then select the next thread to run * and load its state. * %rdi = oldtd * %rsi = newtd * %rdx = mtx */ ENTRY(cpu_switch) /* Switch to new thread. First, save context. */ movq TD_PCB(%rdi),%r8 orl $PCB_FULL_IRET,PCB_FLAGS(%r8) movq (%rsp),%rax /* Hardware registers */ movq %r15,PCB_R15(%r8) movq %r14,PCB_R14(%r8) movq %r13,PCB_R13(%r8) movq %r12,PCB_R12(%r8) movq %rbp,PCB_RBP(%r8) movq %rsp,PCB_RSP(%r8) movq %rbx,PCB_RBX(%r8) movq %rax,PCB_RIP(%r8) testl $PCB_DBREGS,PCB_FLAGS(%r8) jnz store_dr /* static predict not taken */ done_store_dr: /* have we used fp, and need a save? */ cmpq %rdi,PCPU(FPCURTHREAD) jne 3f movq PCB_SAVEFPU(%r8),%r8 clts cmpl $0,use_xsave jne 1f fxsave (%r8) jmp 2f 1: movq %rdx,%rcx movl xsave_mask,%eax movl xsave_mask+4,%edx .globl ctx_switch_xsave ctx_switch_xsave: /* This is patched to xsaveopt if supported, see fpuinit_bsp1() */ xsave (%r8) movq %rcx,%rdx 2: smsw %ax orb $CR0_TS,%al lmsw %ax xorl %eax,%eax movq %rax,PCPU(FPCURTHREAD) 3: - /* Save is done. Now fire up new thread. Leave old vmspace. */ - movq TD_PCB(%rsi),%r8 - - /* switch address space */ - movq PCB_CR3(%r8),%rcx - movq %cr3,%rax - cmpq %rcx,%rax /* Same address space? */ - jne swinact - SETLK %rdx, TD_LOCK(%rdi) /* Release the old thread */ - jmp sw1 -swinact: - movl PCPU(CPUID),%eax - /* Release bit from old pmap->pm_active */ - movq PCPU(CURPMAP),%r12 - LK btrl %eax,PM_ACTIVE(%r12) /* clear old */ - SETLK %rdx,TD_LOCK(%rdi) /* Release the old thread */ -swact: - /* Set bit in new pmap->pm_active */ - movq TD_PROC(%rsi),%rdx /* newproc */ - movq P_VMSPACE(%rdx), %rdx - addq $VM_PMAP,%rdx - cmpl $-1,PM_PCID(%rdx) - je 1f - LK btsl %eax,PM_SAVE(%rdx) - jnc 1f - btsq $63,%rcx /* CR3_PCID_SAVE */ - incq PCPU(PM_SAVE_CNT) -1: - movq %rcx,%cr3 /* new address space */ - LK btsl %eax,PM_ACTIVE(%rdx) /* set new */ - movq %rdx,PCPU(CURPMAP) - - /* - * We might lose the race and other CPU might have changed - * the pmap after we set our bit in pmap->pm_save. Recheck. - * Reload %cr3 with CR3_PCID_SAVE bit cleared if pmap was - * modified, causing TLB flush for this pcid. - */ - btrq $63,%rcx - jnc 1f - LK btsl %eax,PM_SAVE(%rdx) - jc 1f - decq PCPU(PM_SAVE_CNT) - movq %rcx,%cr3 -1: - + movq %rsi,%r12 + movq %rdi,%r13 + movq %rdx,%r15 + movq %rsi,%rdi + callq pmap_activate_sw + SETLK %r15,TD_LOCK(%r13) /* Release the old thread */ sw1: + movq TD_PCB(%r12),%r8 #if defined(SCHED_ULE) && defined(SMP) /* Wait for the new thread to become unblocked */ movq $blocked_lock, %rdx 1: - movq TD_LOCK(%rsi),%rcx + movq TD_LOCK(%r12),%rcx cmpq %rcx, %rdx pause je 1b #endif /* * At this point, we've switched address spaces and are ready * to load up the rest of the next context. */ /* Skip loading user fsbase/gsbase for kthreads */ - testl $TDP_KTHREAD,TD_PFLAGS(%rsi) + testl $TDP_KTHREAD,TD_PFLAGS(%r12) jnz do_kthread /* * Load ldt register */ - movq TD_PROC(%rsi),%rcx + movq TD_PROC(%r12),%rcx cmpq $0, P_MD+MD_LDT(%rcx) jne do_ldt xorl %eax,%eax ld_ldt: lldt %ax /* Restore fs base in GDT */ movl PCB_FSBASE(%r8),%eax movq PCPU(FS32P),%rdx movw %ax,2(%rdx) shrl $16,%eax movb %al,4(%rdx) shrl $8,%eax movb %al,7(%rdx) /* Restore gs base in GDT */ movl PCB_GSBASE(%r8),%eax movq PCPU(GS32P),%rdx movw %ax,2(%rdx) shrl $16,%eax movb %al,4(%rdx) shrl $8,%eax movb %al,7(%rdx) do_kthread: /* Do we need to reload tss ? */ movq PCPU(TSSP),%rax movq PCB_TSSP(%r8),%rdx testq %rdx,%rdx cmovzq PCPU(COMMONTSSP),%rdx cmpq %rax,%rdx jne do_tss done_tss: movq %r8,PCPU(RSP0) movq %r8,PCPU(CURPCB) /* Update the TSS_RSP0 pointer for the next interrupt */ movq %r8,COMMON_TSS_RSP0(%rdx) - movq %rsi,PCPU(CURTHREAD) /* into next thread */ + movq %r12,PCPU(CURTHREAD) /* into next thread */ /* Test if debug registers should be restored. */ testl $PCB_DBREGS,PCB_FLAGS(%r8) jnz load_dr /* static predict not taken */ done_load_dr: /* Restore context. */ movq PCB_R15(%r8),%r15 movq PCB_R14(%r8),%r14 movq PCB_R13(%r8),%r13 movq PCB_R12(%r8),%r12 movq PCB_RBP(%r8),%rbp movq PCB_RSP(%r8),%rsp movq PCB_RBX(%r8),%rbx movq PCB_RIP(%r8),%rax movq %rax,(%rsp) ret /* * We order these strangely for several reasons. * 1: I wanted to use static branch prediction hints * 2: Most athlon64/opteron cpus don't have them. They define * a forward branch as 'predict not taken'. Intel cores have * the 'rep' prefix to invert this. * So, to make it work on both forms of cpu we do the detour. * We use jumps rather than call in order to avoid the stack. */ store_dr: movq %dr7,%rax /* yes, do the save */ movq %dr0,%r15 movq %dr1,%r14 movq %dr2,%r13 movq %dr3,%r12 movq %dr6,%r11 movq %r15,PCB_DR0(%r8) movq %r14,PCB_DR1(%r8) movq %r13,PCB_DR2(%r8) movq %r12,PCB_DR3(%r8) movq %r11,PCB_DR6(%r8) movq %rax,PCB_DR7(%r8) andq $0x0000fc00, %rax /* disable all watchpoints */ movq %rax,%dr7 jmp done_store_dr load_dr: movq %dr7,%rax movq PCB_DR0(%r8),%r15 movq PCB_DR1(%r8),%r14 movq PCB_DR2(%r8),%r13 movq PCB_DR3(%r8),%r12 movq PCB_DR6(%r8),%r11 movq PCB_DR7(%r8),%rcx movq %r15,%dr0 movq %r14,%dr1 /* Preserve reserved bits in %dr7 */ andq $0x0000fc00,%rax andq $~0x0000fc00,%rcx movq %r13,%dr2 movq %r12,%dr3 orq %rcx,%rax movq %r11,%dr6 movq %rax,%dr7 jmp done_load_dr do_tss: movq %rdx,PCPU(TSSP) movq %rdx,%rcx movq PCPU(TSS),%rax movw %cx,2(%rax) shrq $16,%rcx movb %cl,4(%rax) shrq $8,%rcx movb %cl,7(%rax) shrq $8,%rcx movl %ecx,8(%rax) movb $0x89,5(%rax) /* unset busy */ movl $TSSSEL,%eax ltr %ax jmp done_tss do_ldt: movq PCPU(LDT),%rax movq P_MD+MD_LDT_SD(%rcx),%rdx movq %rdx,(%rax) movq P_MD+MD_LDT_SD+8(%rcx),%rdx movq %rdx,8(%rax) movl $LDTSEL,%eax jmp ld_ldt END(cpu_switch) /* * savectx(pcb) * Update pcb, saving current processor state. */ ENTRY(savectx) /* Save caller's return address. */ movq (%rsp),%rax movq %rax,PCB_RIP(%rdi) movq %rbx,PCB_RBX(%rdi) movq %rsp,PCB_RSP(%rdi) movq %rbp,PCB_RBP(%rdi) movq %r12,PCB_R12(%rdi) movq %r13,PCB_R13(%rdi) movq %r14,PCB_R14(%rdi) movq %r15,PCB_R15(%rdi) movq %cr0,%rax movq %rax,PCB_CR0(%rdi) movq %cr2,%rax movq %rax,PCB_CR2(%rdi) movq %cr3,%rax movq %rax,PCB_CR3(%rdi) movq %cr4,%rax movq %rax,PCB_CR4(%rdi) movq %dr0,%rax movq %rax,PCB_DR0(%rdi) movq %dr1,%rax movq %rax,PCB_DR1(%rdi) movq %dr2,%rax movq %rax,PCB_DR2(%rdi) movq %dr3,%rax movq %rax,PCB_DR3(%rdi) movq %dr6,%rax movq %rax,PCB_DR6(%rdi) movq %dr7,%rax movq %rax,PCB_DR7(%rdi) movl $MSR_FSBASE,%ecx rdmsr movl %eax,PCB_FSBASE(%rdi) movl %edx,PCB_FSBASE+4(%rdi) movl $MSR_GSBASE,%ecx rdmsr movl %eax,PCB_GSBASE(%rdi) movl %edx,PCB_GSBASE+4(%rdi) movl $MSR_KGSBASE,%ecx rdmsr movl %eax,PCB_KGSBASE(%rdi) movl %edx,PCB_KGSBASE+4(%rdi) movl $MSR_EFER,%ecx rdmsr movl %eax,PCB_EFER(%rdi) movl %edx,PCB_EFER+4(%rdi) movl $MSR_STAR,%ecx rdmsr movl %eax,PCB_STAR(%rdi) movl %edx,PCB_STAR+4(%rdi) movl $MSR_LSTAR,%ecx rdmsr movl %eax,PCB_LSTAR(%rdi) movl %edx,PCB_LSTAR+4(%rdi) movl $MSR_CSTAR,%ecx rdmsr movl %eax,PCB_CSTAR(%rdi) movl %edx,PCB_CSTAR+4(%rdi) movl $MSR_SF_MASK,%ecx rdmsr movl %eax,PCB_SFMASK(%rdi) movl %edx,PCB_SFMASK+4(%rdi) sgdt PCB_GDT(%rdi) sidt PCB_IDT(%rdi) sldt PCB_LDT(%rdi) str PCB_TR(%rdi) movl $1,%eax ret END(savectx) /* * resumectx(pcb) * Resuming processor state from pcb. */ ENTRY(resumectx) /* Switch to KPML4phys. */ movq KPML4phys,%rax movq %rax,%cr3 /* Force kernel segment registers. */ movl $KDSEL,%eax movw %ax,%ds movw %ax,%es movw %ax,%ss movl $KUF32SEL,%eax movw %ax,%fs movl $KUG32SEL,%eax movw %ax,%gs movl $MSR_FSBASE,%ecx movl PCB_FSBASE(%rdi),%eax movl 4 + PCB_FSBASE(%rdi),%edx wrmsr movl $MSR_GSBASE,%ecx movl PCB_GSBASE(%rdi),%eax movl 4 + PCB_GSBASE(%rdi),%edx wrmsr movl $MSR_KGSBASE,%ecx movl PCB_KGSBASE(%rdi),%eax movl 4 + PCB_KGSBASE(%rdi),%edx wrmsr /* Restore EFER. */ movl $MSR_EFER,%ecx movl PCB_EFER(%rdi),%eax wrmsr /* Restore fast syscall stuff. */ movl $MSR_STAR,%ecx movl PCB_STAR(%rdi),%eax movl 4 + PCB_STAR(%rdi),%edx wrmsr movl $MSR_LSTAR,%ecx movl PCB_LSTAR(%rdi),%eax movl 4 + PCB_LSTAR(%rdi),%edx wrmsr movl $MSR_CSTAR,%ecx movl PCB_CSTAR(%rdi),%eax movl 4 + PCB_CSTAR(%rdi),%edx wrmsr movl $MSR_SF_MASK,%ecx movl PCB_SFMASK(%rdi),%eax wrmsr /* Restore CR0, CR2, CR4 and CR3. */ movq PCB_CR0(%rdi),%rax movq %rax,%cr0 movq PCB_CR2(%rdi),%rax movq %rax,%cr2 movq PCB_CR4(%rdi),%rax movq %rax,%cr4 movq PCB_CR3(%rdi),%rax movq %rax,%cr3 /* Restore descriptor tables. */ lidt PCB_IDT(%rdi) lldt PCB_LDT(%rdi) #define SDT_SYSTSS 9 #define SDT_SYSBSY 11 /* Clear "task busy" bit and reload TR. */ movq PCPU(TSS),%rax andb $(~SDT_SYSBSY | SDT_SYSTSS),5(%rax) movw PCB_TR(%rdi),%ax ltr %ax #undef SDT_SYSTSS #undef SDT_SYSBSY /* Restore debug registers. */ movq PCB_DR0(%rdi),%rax movq %rax,%dr0 movq PCB_DR1(%rdi),%rax movq %rax,%dr1 movq PCB_DR2(%rdi),%rax movq %rax,%dr2 movq PCB_DR3(%rdi),%rax movq %rax,%dr3 movq PCB_DR6(%rdi),%rax movq %rax,%dr6 movq PCB_DR7(%rdi),%rax movq %rax,%dr7 /* Restore other callee saved registers. */ movq PCB_R15(%rdi),%r15 movq PCB_R14(%rdi),%r14 movq PCB_R13(%rdi),%r13 movq PCB_R12(%rdi),%r12 movq PCB_RBP(%rdi),%rbp movq PCB_RSP(%rdi),%rsp movq PCB_RBX(%rdi),%rbx /* Restore return address. */ movq PCB_RIP(%rdi),%rax movq %rax,(%rsp) xorl %eax,%eax ret END(resumectx) Index: projects/release-arm-redux/sys/amd64/amd64/genassym.c =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/genassym.c (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/genassym.c (revision 282692) @@ -1,241 +1,239 @@ /*- * Copyright (c) 1982, 1990 The Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * William Jolitz. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)genassym.c 5.11 (Berkeley) 5/10/91 */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_hwpmc_hooks.h" #include "opt_kstack_pages.h" #include #include #include #include #include #include #ifdef HWPMC_HOOKS #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include ASSYM(P_VMSPACE, offsetof(struct proc, p_vmspace)); ASSYM(VM_PMAP, offsetof(struct vmspace, vm_pmap)); ASSYM(PM_ACTIVE, offsetof(struct pmap, pm_active)); -ASSYM(PM_SAVE, offsetof(struct pmap, pm_save)); -ASSYM(PM_PCID, offsetof(struct pmap, pm_pcid)); ASSYM(P_MD, offsetof(struct proc, p_md)); ASSYM(MD_LDT, offsetof(struct mdproc, md_ldt)); ASSYM(MD_LDT_SD, offsetof(struct mdproc, md_ldt_sd)); ASSYM(TD_LOCK, offsetof(struct thread, td_lock)); ASSYM(TD_FLAGS, offsetof(struct thread, td_flags)); ASSYM(TD_PCB, offsetof(struct thread, td_pcb)); ASSYM(TD_PFLAGS, offsetof(struct thread, td_pflags)); ASSYM(TD_PROC, offsetof(struct thread, td_proc)); ASSYM(TD_TID, offsetof(struct thread, td_tid)); ASSYM(TD_FRAME, offsetof(struct thread, td_frame)); ASSYM(TDF_ASTPENDING, TDF_ASTPENDING); ASSYM(TDF_NEEDRESCHED, TDF_NEEDRESCHED); ASSYM(TDP_CALLCHAIN, TDP_CALLCHAIN); ASSYM(TDP_KTHREAD, TDP_KTHREAD); ASSYM(V_TRAP, offsetof(struct vmmeter, v_trap)); ASSYM(V_SYSCALL, offsetof(struct vmmeter, v_syscall)); ASSYM(V_INTR, offsetof(struct vmmeter, v_intr)); ASSYM(KSTACK_PAGES, KSTACK_PAGES); ASSYM(PAGE_SIZE, PAGE_SIZE); ASSYM(NPTEPG, NPTEPG); ASSYM(NPDEPG, NPDEPG); ASSYM(addr_PTmap, addr_PTmap); ASSYM(addr_PDmap, addr_PDmap); ASSYM(addr_PDPmap, addr_PDPmap); ASSYM(addr_PML4map, addr_PML4map); ASSYM(addr_PML4pml4e, addr_PML4pml4e); ASSYM(PDESIZE, sizeof(pd_entry_t)); ASSYM(PTESIZE, sizeof(pt_entry_t)); ASSYM(PAGE_SHIFT, PAGE_SHIFT); ASSYM(PAGE_MASK, PAGE_MASK); ASSYM(PDRSHIFT, PDRSHIFT); ASSYM(PDPSHIFT, PDPSHIFT); ASSYM(PML4SHIFT, PML4SHIFT); ASSYM(val_KPDPI, KPDPI); ASSYM(val_KPML4I, KPML4I); ASSYM(val_PML4PML4I, PML4PML4I); ASSYM(USRSTACK, USRSTACK); ASSYM(VM_MAXUSER_ADDRESS, VM_MAXUSER_ADDRESS); ASSYM(KERNBASE, KERNBASE); ASSYM(DMAP_MIN_ADDRESS, DMAP_MIN_ADDRESS); ASSYM(DMAP_MAX_ADDRESS, DMAP_MAX_ADDRESS); ASSYM(MCLBYTES, MCLBYTES); ASSYM(PCB_R15, offsetof(struct pcb, pcb_r15)); ASSYM(PCB_R14, offsetof(struct pcb, pcb_r14)); ASSYM(PCB_R13, offsetof(struct pcb, pcb_r13)); ASSYM(PCB_R12, offsetof(struct pcb, pcb_r12)); ASSYM(PCB_RBP, offsetof(struct pcb, pcb_rbp)); ASSYM(PCB_RSP, offsetof(struct pcb, pcb_rsp)); ASSYM(PCB_RBX, offsetof(struct pcb, pcb_rbx)); ASSYM(PCB_RIP, offsetof(struct pcb, pcb_rip)); ASSYM(PCB_FSBASE, offsetof(struct pcb, pcb_fsbase)); ASSYM(PCB_GSBASE, offsetof(struct pcb, pcb_gsbase)); ASSYM(PCB_KGSBASE, offsetof(struct pcb, pcb_kgsbase)); ASSYM(PCB_CR0, offsetof(struct pcb, pcb_cr0)); ASSYM(PCB_CR2, offsetof(struct pcb, pcb_cr2)); ASSYM(PCB_CR3, offsetof(struct pcb, pcb_cr3)); ASSYM(PCB_CR4, offsetof(struct pcb, pcb_cr4)); ASSYM(PCB_DR0, offsetof(struct pcb, pcb_dr0)); ASSYM(PCB_DR1, offsetof(struct pcb, pcb_dr1)); ASSYM(PCB_DR2, offsetof(struct pcb, pcb_dr2)); ASSYM(PCB_DR3, offsetof(struct pcb, pcb_dr3)); ASSYM(PCB_DR6, offsetof(struct pcb, pcb_dr6)); ASSYM(PCB_DR7, offsetof(struct pcb, pcb_dr7)); ASSYM(PCB_GDT, offsetof(struct pcb, pcb_gdt)); ASSYM(PCB_IDT, offsetof(struct pcb, pcb_idt)); ASSYM(PCB_LDT, offsetof(struct pcb, pcb_ldt)); ASSYM(PCB_TR, offsetof(struct pcb, pcb_tr)); ASSYM(PCB_FLAGS, offsetof(struct pcb, pcb_flags)); ASSYM(PCB_ONFAULT, offsetof(struct pcb, pcb_onfault)); ASSYM(PCB_GS32SD, offsetof(struct pcb, pcb_gs32sd)); ASSYM(PCB_TSSP, offsetof(struct pcb, pcb_tssp)); ASSYM(PCB_SAVEFPU, offsetof(struct pcb, pcb_save)); ASSYM(PCB_EFER, offsetof(struct pcb, pcb_efer)); ASSYM(PCB_STAR, offsetof(struct pcb, pcb_star)); ASSYM(PCB_LSTAR, offsetof(struct pcb, pcb_lstar)); ASSYM(PCB_CSTAR, offsetof(struct pcb, pcb_cstar)); ASSYM(PCB_SFMASK, offsetof(struct pcb, pcb_sfmask)); ASSYM(PCB_SIZE, sizeof(struct pcb)); ASSYM(PCB_FULL_IRET, PCB_FULL_IRET); ASSYM(PCB_DBREGS, PCB_DBREGS); ASSYM(PCB_32BIT, PCB_32BIT); ASSYM(COMMON_TSS_RSP0, offsetof(struct amd64tss, tss_rsp0)); ASSYM(TF_R15, offsetof(struct trapframe, tf_r15)); ASSYM(TF_R14, offsetof(struct trapframe, tf_r14)); ASSYM(TF_R13, offsetof(struct trapframe, tf_r13)); ASSYM(TF_R12, offsetof(struct trapframe, tf_r12)); ASSYM(TF_R11, offsetof(struct trapframe, tf_r11)); ASSYM(TF_R10, offsetof(struct trapframe, tf_r10)); ASSYM(TF_R9, offsetof(struct trapframe, tf_r9)); ASSYM(TF_R8, offsetof(struct trapframe, tf_r8)); ASSYM(TF_RDI, offsetof(struct trapframe, tf_rdi)); ASSYM(TF_RSI, offsetof(struct trapframe, tf_rsi)); ASSYM(TF_RBP, offsetof(struct trapframe, tf_rbp)); ASSYM(TF_RBX, offsetof(struct trapframe, tf_rbx)); ASSYM(TF_RDX, offsetof(struct trapframe, tf_rdx)); ASSYM(TF_RCX, offsetof(struct trapframe, tf_rcx)); ASSYM(TF_RAX, offsetof(struct trapframe, tf_rax)); ASSYM(TF_TRAPNO, offsetof(struct trapframe, tf_trapno)); ASSYM(TF_ADDR, offsetof(struct trapframe, tf_addr)); ASSYM(TF_ERR, offsetof(struct trapframe, tf_err)); ASSYM(TF_RIP, offsetof(struct trapframe, tf_rip)); ASSYM(TF_CS, offsetof(struct trapframe, tf_cs)); ASSYM(TF_RFLAGS, offsetof(struct trapframe, tf_rflags)); ASSYM(TF_RSP, offsetof(struct trapframe, tf_rsp)); ASSYM(TF_SS, offsetof(struct trapframe, tf_ss)); ASSYM(TF_DS, offsetof(struct trapframe, tf_ds)); ASSYM(TF_ES, offsetof(struct trapframe, tf_es)); ASSYM(TF_FS, offsetof(struct trapframe, tf_fs)); ASSYM(TF_GS, offsetof(struct trapframe, tf_gs)); ASSYM(TF_FLAGS, offsetof(struct trapframe, tf_flags)); ASSYM(TF_SIZE, sizeof(struct trapframe)); ASSYM(TF_HASSEGS, TF_HASSEGS); ASSYM(SIGF_HANDLER, offsetof(struct sigframe, sf_ahu.sf_handler)); ASSYM(SIGF_UC, offsetof(struct sigframe, sf_uc)); ASSYM(UC_EFLAGS, offsetof(ucontext_t, uc_mcontext.mc_rflags)); ASSYM(ENOENT, ENOENT); ASSYM(EFAULT, EFAULT); ASSYM(ENAMETOOLONG, ENAMETOOLONG); ASSYM(MAXCOMLEN, MAXCOMLEN); ASSYM(MAXPATHLEN, MAXPATHLEN); ASSYM(PC_SIZEOF, sizeof(struct pcpu)); ASSYM(PC_PRVSPACE, offsetof(struct pcpu, pc_prvspace)); ASSYM(PC_CURTHREAD, offsetof(struct pcpu, pc_curthread)); ASSYM(PC_FPCURTHREAD, offsetof(struct pcpu, pc_fpcurthread)); ASSYM(PC_IDLETHREAD, offsetof(struct pcpu, pc_idlethread)); ASSYM(PC_CURPCB, offsetof(struct pcpu, pc_curpcb)); ASSYM(PC_CPUID, offsetof(struct pcpu, pc_cpuid)); ASSYM(PC_SCRATCH_RSP, offsetof(struct pcpu, pc_scratch_rsp)); ASSYM(PC_CURPMAP, offsetof(struct pcpu, pc_curpmap)); ASSYM(PC_TSSP, offsetof(struct pcpu, pc_tssp)); ASSYM(PC_RSP0, offsetof(struct pcpu, pc_rsp0)); ASSYM(PC_FS32P, offsetof(struct pcpu, pc_fs32p)); ASSYM(PC_GS32P, offsetof(struct pcpu, pc_gs32p)); ASSYM(PC_LDT, offsetof(struct pcpu, pc_ldt)); ASSYM(PC_COMMONTSSP, offsetof(struct pcpu, pc_commontssp)); ASSYM(PC_TSS, offsetof(struct pcpu, pc_tss)); ASSYM(PC_PM_SAVE_CNT, offsetof(struct pcpu, pc_pm_save_cnt)); ASSYM(LA_EOI, LAPIC_EOI * LAPIC_MEM_MUL); ASSYM(LA_ISR, LAPIC_ISR0 * LAPIC_MEM_MUL); ASSYM(KCSEL, GSEL(GCODE_SEL, SEL_KPL)); ASSYM(KDSEL, GSEL(GDATA_SEL, SEL_KPL)); ASSYM(KUCSEL, GSEL(GUCODE_SEL, SEL_UPL)); ASSYM(KUDSEL, GSEL(GUDATA_SEL, SEL_UPL)); ASSYM(KUC32SEL, GSEL(GUCODE32_SEL, SEL_UPL)); ASSYM(KUF32SEL, GSEL(GUFS32_SEL, SEL_UPL)); ASSYM(KUG32SEL, GSEL(GUGS32_SEL, SEL_UPL)); ASSYM(TSSSEL, GSEL(GPROC0_SEL, SEL_KPL)); ASSYM(LDTSEL, GSEL(GUSERLDT_SEL, SEL_KPL)); ASSYM(SEL_RPL_MASK, SEL_RPL_MASK); ASSYM(__FreeBSD_version, __FreeBSD_version); #ifdef HWPMC_HOOKS ASSYM(PMC_FN_USER_CALLCHAIN, PMC_FN_USER_CALLCHAIN); #endif Index: projects/release-arm-redux/sys/amd64/amd64/machdep.c =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/machdep.c (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/machdep.c (revision 282692) @@ -1,2459 +1,2458 @@ /*- * Copyright (c) 2003 Peter Wemm. * Copyright (c) 1992 Terrence R. Lambert. * Copyright (c) 1982, 1987, 1990 The Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * William Jolitz. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)machdep.c 7.4 (Berkeley) 6/3/91 */ #include __FBSDID("$FreeBSD$"); #include "opt_atpic.h" #include "opt_compat.h" #include "opt_cpu.h" #include "opt_ddb.h" #include "opt_inet.h" #include "opt_isa.h" #include "opt_kstack_pages.h" #include "opt_maxmem.h" #include "opt_mp_watchdog.h" #include "opt_perfmon.h" #include "opt_platform.h" #include "opt_sched.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef SMP #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DDB #ifndef KDB #error KDB must be enabled in order for DDB to work! #endif #include #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef PERFMON #include #endif #include #ifdef SMP #include #endif #ifdef FDT #include #endif #ifdef DEV_ATPIC #include #else #include #endif #include #include #include /* Sanity check for __curthread() */ CTASSERT(offsetof(struct pcpu, pc_curthread) == 0); extern u_int64_t hammer_time(u_int64_t, u_int64_t); #define CS_SECURE(cs) (ISPL(cs) == SEL_UPL) #define EFL_SECURE(ef, oef) ((((ef) ^ (oef)) & ~PSL_USERCHANGE) == 0) static void cpu_startup(void *); static void get_fpcontext(struct thread *td, mcontext_t *mcp, char *xfpusave, size_t xfpusave_len); static int set_fpcontext(struct thread *td, mcontext_t *mcp, char *xfpustate, size_t xfpustate_len); SYSINIT(cpu, SI_SUB_CPU, SI_ORDER_FIRST, cpu_startup, NULL); /* Preload data parse function */ static caddr_t native_parse_preload_data(u_int64_t); /* Native function to fetch and parse the e820 map */ static void native_parse_memmap(caddr_t, vm_paddr_t *, int *); /* Default init_ops implementation. */ struct init_ops init_ops = { .parse_preload_data = native_parse_preload_data, .early_clock_source_init = i8254_init, .early_delay = i8254_delay, .parse_memmap = native_parse_memmap, #ifdef SMP .mp_bootaddress = mp_bootaddress, .start_all_aps = native_start_all_aps, #endif .msi_init = msi_init, }; /* * The file "conf/ldscript.amd64" defines the symbol "kernphys". Its value is * the physical address at which the kernel is loaded. */ extern char kernphys[]; struct msgbuf *msgbufp; /* Intel ICH registers */ #define ICH_PMBASE 0x400 #define ICH_SMI_EN ICH_PMBASE + 0x30 int _udatasel, _ucodesel, _ucode32sel, _ufssel, _ugssel; int cold = 1; long Maxmem = 0; long realmem = 0; /* * The number of PHYSMAP entries must be one less than the number of * PHYSSEG entries because the PHYSMAP entry that spans the largest * physical address that is accessible by ISA DMA is split into two * PHYSSEG entries. */ #define PHYSMAP_SIZE (2 * (VM_PHYSSEG_MAX - 1)) vm_paddr_t phys_avail[PHYSMAP_SIZE + 2]; vm_paddr_t dump_avail[PHYSMAP_SIZE + 2]; /* must be 2 less so 0 0 can signal end of chunks */ #define PHYS_AVAIL_ARRAY_END ((sizeof(phys_avail) / sizeof(phys_avail[0])) - 2) #define DUMP_AVAIL_ARRAY_END ((sizeof(dump_avail) / sizeof(dump_avail[0])) - 2) struct kva_md_info kmi; static struct trapframe proc0_tf; struct region_descriptor r_gdt, r_idt; struct pcpu __pcpu[MAXCPU]; struct mtx icu_lock; struct mem_range_softc mem_range_softc; struct mtx dt_lock; /* lock for GDT and LDT */ void (*vmm_resume_p)(void); static void cpu_startup(dummy) void *dummy; { uintmax_t memsize; char *sysenv; /* * On MacBooks, we need to disallow the legacy USB circuit to * generate an SMI# because this can cause several problems, * namely: incorrect CPU frequency detection and failure to * start the APs. * We do this by disabling a bit in the SMI_EN (SMI Control and * Enable register) of the Intel ICH LPC Interface Bridge. */ sysenv = kern_getenv("smbios.system.product"); if (sysenv != NULL) { if (strncmp(sysenv, "MacBook1,1", 10) == 0 || strncmp(sysenv, "MacBook3,1", 10) == 0 || strncmp(sysenv, "MacBook4,1", 10) == 0 || strncmp(sysenv, "MacBookPro1,1", 13) == 0 || strncmp(sysenv, "MacBookPro1,2", 13) == 0 || strncmp(sysenv, "MacBookPro3,1", 13) == 0 || strncmp(sysenv, "MacBookPro4,1", 13) == 0 || strncmp(sysenv, "Macmini1,1", 10) == 0) { if (bootverbose) printf("Disabling LEGACY_USB_EN bit on " "Intel ICH.\n"); outl(ICH_SMI_EN, inl(ICH_SMI_EN) & ~0x8); } freeenv(sysenv); } /* * Good {morning,afternoon,evening,night}. */ startrtclock(); printcpuinfo(); panicifcpuunsupported(); #ifdef PERFMON perfmon_init(); #endif /* * Display physical memory if SMBIOS reports reasonable amount. */ memsize = 0; sysenv = kern_getenv("smbios.memory.enabled"); if (sysenv != NULL) { memsize = (uintmax_t)strtoul(sysenv, (char **)NULL, 10) << 10; freeenv(sysenv); } if (memsize < ptoa((uintmax_t)vm_cnt.v_free_count)) memsize = ptoa((uintmax_t)Maxmem); printf("real memory = %ju (%ju MB)\n", memsize, memsize >> 20); realmem = atop(memsize); /* * Display any holes after the first chunk of extended memory. */ if (bootverbose) { int indx; printf("Physical memory chunk(s):\n"); for (indx = 0; phys_avail[indx + 1] != 0; indx += 2) { vm_paddr_t size; size = phys_avail[indx + 1] - phys_avail[indx]; printf( "0x%016jx - 0x%016jx, %ju bytes (%ju pages)\n", (uintmax_t)phys_avail[indx], (uintmax_t)phys_avail[indx + 1] - 1, (uintmax_t)size, (uintmax_t)size / PAGE_SIZE); } } vm_ksubmap_init(&kmi); printf("avail memory = %ju (%ju MB)\n", ptoa((uintmax_t)vm_cnt.v_free_count), ptoa((uintmax_t)vm_cnt.v_free_count) / 1048576); /* * Set up buffers, so they can be used to read disk labels. */ bufinit(); vm_pager_bufferinit(); cpu_setregs(); } /* * Send an interrupt to process. * * Stack is set up to allow sigcode stored * at top to call routine, followed by call * to sigreturn routine below. After sigreturn * resets the signal mask, the stack, and the * frame pointer, it returns to the user * specified pc, psl. */ void sendsig(sig_t catcher, ksiginfo_t *ksi, sigset_t *mask) { struct sigframe sf, *sfp; struct pcb *pcb; struct proc *p; struct thread *td; struct sigacts *psp; char *sp; struct trapframe *regs; char *xfpusave; size_t xfpusave_len; int sig; int oonstack; td = curthread; pcb = td->td_pcb; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); sig = ksi->ksi_signo; psp = p->p_sigacts; mtx_assert(&psp->ps_mtx, MA_OWNED); regs = td->td_frame; oonstack = sigonstack(regs->tf_rsp); if (cpu_max_ext_state_size > sizeof(struct savefpu) && use_xsave) { xfpusave_len = cpu_max_ext_state_size - sizeof(struct savefpu); xfpusave = __builtin_alloca(xfpusave_len); } else { xfpusave_len = 0; xfpusave = NULL; } /* Save user context. */ bzero(&sf, sizeof(sf)); sf.sf_uc.uc_sigmask = *mask; sf.sf_uc.uc_stack = td->td_sigstk; sf.sf_uc.uc_stack.ss_flags = (td->td_pflags & TDP_ALTSTACK) ? ((oonstack) ? SS_ONSTACK : 0) : SS_DISABLE; sf.sf_uc.uc_mcontext.mc_onstack = (oonstack) ? 1 : 0; bcopy(regs, &sf.sf_uc.uc_mcontext.mc_rdi, sizeof(*regs)); sf.sf_uc.uc_mcontext.mc_len = sizeof(sf.sf_uc.uc_mcontext); /* magic */ get_fpcontext(td, &sf.sf_uc.uc_mcontext, xfpusave, xfpusave_len); fpstate_drop(td); sf.sf_uc.uc_mcontext.mc_fsbase = pcb->pcb_fsbase; sf.sf_uc.uc_mcontext.mc_gsbase = pcb->pcb_gsbase; bzero(sf.sf_uc.uc_mcontext.mc_spare, sizeof(sf.sf_uc.uc_mcontext.mc_spare)); bzero(sf.sf_uc.__spare__, sizeof(sf.sf_uc.__spare__)); /* Allocate space for the signal handler context. */ if ((td->td_pflags & TDP_ALTSTACK) != 0 && !oonstack && SIGISMEMBER(psp->ps_sigonstack, sig)) { sp = td->td_sigstk.ss_sp + td->td_sigstk.ss_size; #if defined(COMPAT_43) td->td_sigstk.ss_flags |= SS_ONSTACK; #endif } else sp = (char *)regs->tf_rsp - 128; if (xfpusave != NULL) { sp -= xfpusave_len; sp = (char *)((unsigned long)sp & ~0x3Ful); sf.sf_uc.uc_mcontext.mc_xfpustate = (register_t)sp; } sp -= sizeof(struct sigframe); /* Align to 16 bytes. */ sfp = (struct sigframe *)((unsigned long)sp & ~0xFul); /* Translate the signal if appropriate. */ if (p->p_sysent->sv_sigtbl && sig <= p->p_sysent->sv_sigsize) sig = p->p_sysent->sv_sigtbl[_SIG_IDX(sig)]; /* Build the argument list for the signal handler. */ regs->tf_rdi = sig; /* arg 1 in %rdi */ regs->tf_rdx = (register_t)&sfp->sf_uc; /* arg 3 in %rdx */ bzero(&sf.sf_si, sizeof(sf.sf_si)); if (SIGISMEMBER(psp->ps_siginfo, sig)) { /* Signal handler installed with SA_SIGINFO. */ regs->tf_rsi = (register_t)&sfp->sf_si; /* arg 2 in %rsi */ sf.sf_ahu.sf_action = (__siginfohandler_t *)catcher; /* Fill in POSIX parts */ sf.sf_si = ksi->ksi_info; sf.sf_si.si_signo = sig; /* maybe a translated signal */ regs->tf_rcx = (register_t)ksi->ksi_addr; /* arg 4 in %rcx */ } else { /* Old FreeBSD-style arguments. */ regs->tf_rsi = ksi->ksi_code; /* arg 2 in %rsi */ regs->tf_rcx = (register_t)ksi->ksi_addr; /* arg 4 in %rcx */ sf.sf_ahu.sf_handler = catcher; } mtx_unlock(&psp->ps_mtx); PROC_UNLOCK(p); /* * Copy the sigframe out to the user's stack. */ if (copyout(&sf, sfp, sizeof(*sfp)) != 0 || (xfpusave != NULL && copyout(xfpusave, (void *)sf.sf_uc.uc_mcontext.mc_xfpustate, xfpusave_len) != 0)) { #ifdef DEBUG printf("process %ld has trashed its stack\n", (long)p->p_pid); #endif PROC_LOCK(p); sigexit(td, SIGILL); } regs->tf_rsp = (long)sfp; regs->tf_rip = p->p_sysent->sv_sigcode_base; regs->tf_rflags &= ~(PSL_T | PSL_D); regs->tf_cs = _ucodesel; regs->tf_ds = _udatasel; regs->tf_ss = _udatasel; regs->tf_es = _udatasel; regs->tf_fs = _ufssel; regs->tf_gs = _ugssel; regs->tf_flags = TF_HASSEGS; set_pcb_flags(pcb, PCB_FULL_IRET); PROC_LOCK(p); mtx_lock(&psp->ps_mtx); } /* * System call to cleanup state after a signal * has been taken. Reset signal mask and * stack state from context left by sendsig (above). * Return to previous pc and psl as specified by * context left by sendsig. Check carefully to * make sure that the user has not modified the * state to gain improper privileges. * * MPSAFE */ int sys_sigreturn(td, uap) struct thread *td; struct sigreturn_args /* { const struct __ucontext *sigcntxp; } */ *uap; { ucontext_t uc; struct pcb *pcb; struct proc *p; struct trapframe *regs; ucontext_t *ucp; char *xfpustate; size_t xfpustate_len; long rflags; int cs, error, ret; ksiginfo_t ksi; pcb = td->td_pcb; p = td->td_proc; error = copyin(uap->sigcntxp, &uc, sizeof(uc)); if (error != 0) { uprintf("pid %d (%s): sigreturn copyin failed\n", p->p_pid, td->td_name); return (error); } ucp = &uc; if ((ucp->uc_mcontext.mc_flags & ~_MC_FLAG_MASK) != 0) { uprintf("pid %d (%s): sigreturn mc_flags %x\n", p->p_pid, td->td_name, ucp->uc_mcontext.mc_flags); return (EINVAL); } regs = td->td_frame; rflags = ucp->uc_mcontext.mc_rflags; /* * Don't allow users to change privileged or reserved flags. */ if (!EFL_SECURE(rflags, regs->tf_rflags)) { uprintf("pid %d (%s): sigreturn rflags = 0x%lx\n", p->p_pid, td->td_name, rflags); return (EINVAL); } /* * Don't allow users to load a valid privileged %cs. Let the * hardware check for invalid selectors, excess privilege in * other selectors, invalid %eip's and invalid %esp's. */ cs = ucp->uc_mcontext.mc_cs; if (!CS_SECURE(cs)) { uprintf("pid %d (%s): sigreturn cs = 0x%x\n", p->p_pid, td->td_name, cs); ksiginfo_init_trap(&ksi); ksi.ksi_signo = SIGBUS; ksi.ksi_code = BUS_OBJERR; ksi.ksi_trapno = T_PROTFLT; ksi.ksi_addr = (void *)regs->tf_rip; trapsignal(td, &ksi); return (EINVAL); } if ((uc.uc_mcontext.mc_flags & _MC_HASFPXSTATE) != 0) { xfpustate_len = uc.uc_mcontext.mc_xfpustate_len; if (xfpustate_len > cpu_max_ext_state_size - sizeof(struct savefpu)) { uprintf("pid %d (%s): sigreturn xfpusave_len = 0x%zx\n", p->p_pid, td->td_name, xfpustate_len); return (EINVAL); } xfpustate = __builtin_alloca(xfpustate_len); error = copyin((const void *)uc.uc_mcontext.mc_xfpustate, xfpustate, xfpustate_len); if (error != 0) { uprintf( "pid %d (%s): sigreturn copying xfpustate failed\n", p->p_pid, td->td_name); return (error); } } else { xfpustate = NULL; xfpustate_len = 0; } ret = set_fpcontext(td, &ucp->uc_mcontext, xfpustate, xfpustate_len); if (ret != 0) { uprintf("pid %d (%s): sigreturn set_fpcontext err %d\n", p->p_pid, td->td_name, ret); return (ret); } bcopy(&ucp->uc_mcontext.mc_rdi, regs, sizeof(*regs)); pcb->pcb_fsbase = ucp->uc_mcontext.mc_fsbase; pcb->pcb_gsbase = ucp->uc_mcontext.mc_gsbase; #if defined(COMPAT_43) if (ucp->uc_mcontext.mc_onstack & 1) td->td_sigstk.ss_flags |= SS_ONSTACK; else td->td_sigstk.ss_flags &= ~SS_ONSTACK; #endif kern_sigprocmask(td, SIG_SETMASK, &ucp->uc_sigmask, NULL, 0); set_pcb_flags(pcb, PCB_FULL_IRET); return (EJUSTRETURN); } #ifdef COMPAT_FREEBSD4 int freebsd4_sigreturn(struct thread *td, struct freebsd4_sigreturn_args *uap) { return sys_sigreturn(td, (struct sigreturn_args *)uap); } #endif /* * Reset registers to default values on exec. */ void exec_setregs(struct thread *td, struct image_params *imgp, u_long stack) { struct trapframe *regs = td->td_frame; struct pcb *pcb = td->td_pcb; mtx_lock(&dt_lock); if (td->td_proc->p_md.md_ldt != NULL) user_ldt_free(td); else mtx_unlock(&dt_lock); pcb->pcb_fsbase = 0; pcb->pcb_gsbase = 0; clear_pcb_flags(pcb, PCB_32BIT); pcb->pcb_initial_fpucw = __INITIAL_FPUCW__; set_pcb_flags(pcb, PCB_FULL_IRET); bzero((char *)regs, sizeof(struct trapframe)); regs->tf_rip = imgp->entry_addr; regs->tf_rsp = ((stack - 8) & ~0xFul) + 8; regs->tf_rdi = stack; /* argv */ regs->tf_rflags = PSL_USER | (regs->tf_rflags & PSL_T); regs->tf_ss = _udatasel; regs->tf_cs = _ucodesel; regs->tf_ds = _udatasel; regs->tf_es = _udatasel; regs->tf_fs = _ufssel; regs->tf_gs = _ugssel; regs->tf_flags = TF_HASSEGS; td->td_retval[1] = 0; /* * Reset the hardware debug registers if they were in use. * They won't have any meaning for the newly exec'd process. */ if (pcb->pcb_flags & PCB_DBREGS) { pcb->pcb_dr0 = 0; pcb->pcb_dr1 = 0; pcb->pcb_dr2 = 0; pcb->pcb_dr3 = 0; pcb->pcb_dr6 = 0; pcb->pcb_dr7 = 0; if (pcb == curpcb) { /* * Clear the debug registers on the running * CPU, otherwise they will end up affecting * the next process we switch to. */ reset_dbregs(); } clear_pcb_flags(pcb, PCB_DBREGS); } /* * Drop the FP state if we hold it, so that the process gets a * clean FP state if it uses the FPU again. */ fpstate_drop(td); } void cpu_setregs(void) { register_t cr0; cr0 = rcr0(); /* * CR0_MP, CR0_NE and CR0_TS are also set by npx_probe() for the * BSP. See the comments there about why we set them. */ cr0 |= CR0_MP | CR0_NE | CR0_TS | CR0_WP | CR0_AM; load_cr0(cr0); } /* * Initialize amd64 and configure to run kernel */ /* * Initialize segments & interrupt table */ struct user_segment_descriptor gdt[NGDT * MAXCPU];/* global descriptor tables */ static struct gate_descriptor idt0[NIDT]; struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */ static char dblfault_stack[PAGE_SIZE] __aligned(16); static char nmi0_stack[PAGE_SIZE] __aligned(16); CTASSERT(sizeof(struct nmi_pcpu) == 16); struct amd64tss common_tss[MAXCPU]; /* * Software prototypes -- in more palatable form. * * Keep GUFS32, GUGS32, GUCODE32 and GUDATA at the same * slots as corresponding segments for i386 kernel. */ struct soft_segment_descriptor gdt_segs[] = { /* GNULL_SEL 0 Null Descriptor */ { .ssd_base = 0x0, .ssd_limit = 0x0, .ssd_type = 0, .ssd_dpl = 0, .ssd_p = 0, .ssd_long = 0, .ssd_def32 = 0, .ssd_gran = 0 }, /* GNULL2_SEL 1 Null Descriptor */ { .ssd_base = 0x0, .ssd_limit = 0x0, .ssd_type = 0, .ssd_dpl = 0, .ssd_p = 0, .ssd_long = 0, .ssd_def32 = 0, .ssd_gran = 0 }, /* GUFS32_SEL 2 32 bit %gs Descriptor for user */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMRWA, .ssd_dpl = SEL_UPL, .ssd_p = 1, .ssd_long = 0, .ssd_def32 = 1, .ssd_gran = 1 }, /* GUGS32_SEL 3 32 bit %fs Descriptor for user */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMRWA, .ssd_dpl = SEL_UPL, .ssd_p = 1, .ssd_long = 0, .ssd_def32 = 1, .ssd_gran = 1 }, /* GCODE_SEL 4 Code Descriptor for kernel */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMERA, .ssd_dpl = SEL_KPL, .ssd_p = 1, .ssd_long = 1, .ssd_def32 = 0, .ssd_gran = 1 }, /* GDATA_SEL 5 Data Descriptor for kernel */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMRWA, .ssd_dpl = SEL_KPL, .ssd_p = 1, .ssd_long = 1, .ssd_def32 = 0, .ssd_gran = 1 }, /* GUCODE32_SEL 6 32 bit Code Descriptor for user */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMERA, .ssd_dpl = SEL_UPL, .ssd_p = 1, .ssd_long = 0, .ssd_def32 = 1, .ssd_gran = 1 }, /* GUDATA_SEL 7 32/64 bit Data Descriptor for user */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMRWA, .ssd_dpl = SEL_UPL, .ssd_p = 1, .ssd_long = 0, .ssd_def32 = 1, .ssd_gran = 1 }, /* GUCODE_SEL 8 64 bit Code Descriptor for user */ { .ssd_base = 0x0, .ssd_limit = 0xfffff, .ssd_type = SDT_MEMERA, .ssd_dpl = SEL_UPL, .ssd_p = 1, .ssd_long = 1, .ssd_def32 = 0, .ssd_gran = 1 }, /* GPROC0_SEL 9 Proc 0 Tss Descriptor */ { .ssd_base = 0x0, .ssd_limit = sizeof(struct amd64tss) + IOPERM_BITMAP_SIZE - 1, .ssd_type = SDT_SYSTSS, .ssd_dpl = SEL_KPL, .ssd_p = 1, .ssd_long = 0, .ssd_def32 = 0, .ssd_gran = 0 }, /* Actually, the TSS is a system descriptor which is double size */ { .ssd_base = 0x0, .ssd_limit = 0x0, .ssd_type = 0, .ssd_dpl = 0, .ssd_p = 0, .ssd_long = 0, .ssd_def32 = 0, .ssd_gran = 0 }, /* GUSERLDT_SEL 11 LDT Descriptor */ { .ssd_base = 0x0, .ssd_limit = 0x0, .ssd_type = 0, .ssd_dpl = 0, .ssd_p = 0, .ssd_long = 0, .ssd_def32 = 0, .ssd_gran = 0 }, /* GUSERLDT_SEL 12 LDT Descriptor, double size */ { .ssd_base = 0x0, .ssd_limit = 0x0, .ssd_type = 0, .ssd_dpl = 0, .ssd_p = 0, .ssd_long = 0, .ssd_def32 = 0, .ssd_gran = 0 }, }; void setidt(idx, func, typ, dpl, ist) int idx; inthand_t *func; int typ; int dpl; int ist; { struct gate_descriptor *ip; ip = idt + idx; ip->gd_looffset = (uintptr_t)func; ip->gd_selector = GSEL(GCODE_SEL, SEL_KPL); ip->gd_ist = ist; ip->gd_xx = 0; ip->gd_type = typ; ip->gd_dpl = dpl; ip->gd_p = 1; ip->gd_hioffset = ((uintptr_t)func)>>16 ; } extern inthand_t IDTVEC(div), IDTVEC(dbg), IDTVEC(nmi), IDTVEC(bpt), IDTVEC(ofl), IDTVEC(bnd), IDTVEC(ill), IDTVEC(dna), IDTVEC(fpusegm), IDTVEC(tss), IDTVEC(missing), IDTVEC(stk), IDTVEC(prot), IDTVEC(page), IDTVEC(mchk), IDTVEC(rsvd), IDTVEC(fpu), IDTVEC(align), IDTVEC(xmm), IDTVEC(dblfault), #ifdef KDTRACE_HOOKS IDTVEC(dtrace_ret), #endif #ifdef XENHVM IDTVEC(xen_intr_upcall), #endif IDTVEC(fast_syscall), IDTVEC(fast_syscall32); #ifdef DDB /* * Display the index and function name of any IDT entries that don't use * the default 'rsvd' entry point. */ DB_SHOW_COMMAND(idt, db_show_idt) { struct gate_descriptor *ip; int idx; uintptr_t func; ip = idt; for (idx = 0; idx < NIDT && !db_pager_quit; idx++) { func = ((long)ip->gd_hioffset << 16 | ip->gd_looffset); if (func != (uintptr_t)&IDTVEC(rsvd)) { db_printf("%3d\t", idx); db_printsym(func, DB_STGY_PROC); db_printf("\n"); } ip++; } } /* Show privileged registers. */ DB_SHOW_COMMAND(sysregs, db_show_sysregs) { struct { uint16_t limit; uint64_t base; } __packed idtr, gdtr; uint16_t ldt, tr; __asm __volatile("sidt %0" : "=m" (idtr)); db_printf("idtr\t0x%016lx/%04x\n", (u_long)idtr.base, (u_int)idtr.limit); __asm __volatile("sgdt %0" : "=m" (gdtr)); db_printf("gdtr\t0x%016lx/%04x\n", (u_long)gdtr.base, (u_int)gdtr.limit); __asm __volatile("sldt %0" : "=r" (ldt)); db_printf("ldtr\t0x%04x\n", ldt); __asm __volatile("str %0" : "=r" (tr)); db_printf("tr\t0x%04x\n", tr); db_printf("cr0\t0x%016lx\n", rcr0()); db_printf("cr2\t0x%016lx\n", rcr2()); db_printf("cr3\t0x%016lx\n", rcr3()); db_printf("cr4\t0x%016lx\n", rcr4()); db_printf("EFER\t%016lx\n", rdmsr(MSR_EFER)); db_printf("FEATURES_CTL\t%016lx\n", rdmsr(MSR_IA32_FEATURE_CONTROL)); db_printf("DEBUG_CTL\t%016lx\n", rdmsr(MSR_DEBUGCTLMSR)); db_printf("PAT\t%016lx\n", rdmsr(MSR_PAT)); db_printf("GSBASE\t%016lx\n", rdmsr(MSR_GSBASE)); } #endif void sdtossd(sd, ssd) struct user_segment_descriptor *sd; struct soft_segment_descriptor *ssd; { ssd->ssd_base = (sd->sd_hibase << 24) | sd->sd_lobase; ssd->ssd_limit = (sd->sd_hilimit << 16) | sd->sd_lolimit; ssd->ssd_type = sd->sd_type; ssd->ssd_dpl = sd->sd_dpl; ssd->ssd_p = sd->sd_p; ssd->ssd_long = sd->sd_long; ssd->ssd_def32 = sd->sd_def32; ssd->ssd_gran = sd->sd_gran; } void ssdtosd(ssd, sd) struct soft_segment_descriptor *ssd; struct user_segment_descriptor *sd; { sd->sd_lobase = (ssd->ssd_base) & 0xffffff; sd->sd_hibase = (ssd->ssd_base >> 24) & 0xff; sd->sd_lolimit = (ssd->ssd_limit) & 0xffff; sd->sd_hilimit = (ssd->ssd_limit >> 16) & 0xf; sd->sd_type = ssd->ssd_type; sd->sd_dpl = ssd->ssd_dpl; sd->sd_p = ssd->ssd_p; sd->sd_long = ssd->ssd_long; sd->sd_def32 = ssd->ssd_def32; sd->sd_gran = ssd->ssd_gran; } void ssdtosyssd(ssd, sd) struct soft_segment_descriptor *ssd; struct system_segment_descriptor *sd; { sd->sd_lobase = (ssd->ssd_base) & 0xffffff; sd->sd_hibase = (ssd->ssd_base >> 24) & 0xfffffffffful; sd->sd_lolimit = (ssd->ssd_limit) & 0xffff; sd->sd_hilimit = (ssd->ssd_limit >> 16) & 0xf; sd->sd_type = ssd->ssd_type; sd->sd_dpl = ssd->ssd_dpl; sd->sd_p = ssd->ssd_p; sd->sd_gran = ssd->ssd_gran; } #if !defined(DEV_ATPIC) && defined(DEV_ISA) #include #include /* * Return a bitmap of the current interrupt requests. This is 8259-specific * and is only suitable for use at probe time. * This is only here to pacify sio. It is NOT FATAL if this doesn't work. * It shouldn't be here. There should probably be an APIC centric * implementation in the apic driver code, if at all. */ intrmask_t isa_irq_pending(void) { u_char irr1; u_char irr2; irr1 = inb(IO_ICU1); irr2 = inb(IO_ICU2); return ((irr2 << 8) | irr1); } #endif u_int basemem; static int add_physmap_entry(uint64_t base, uint64_t length, vm_paddr_t *physmap, int *physmap_idxp) { int i, insert_idx, physmap_idx; physmap_idx = *physmap_idxp; if (length == 0) return (1); /* * Find insertion point while checking for overlap. Start off by * assuming the new entry will be added to the end. * * NB: physmap_idx points to the next free slot. */ insert_idx = physmap_idx; for (i = 0; i <= physmap_idx; i += 2) { if (base < physmap[i + 1]) { if (base + length <= physmap[i]) { insert_idx = i; break; } if (boothowto & RB_VERBOSE) printf( "Overlapping memory regions, ignoring second region\n"); return (1); } } /* See if we can prepend to the next entry. */ if (insert_idx <= physmap_idx && base + length == physmap[insert_idx]) { physmap[insert_idx] = base; return (1); } /* See if we can append to the previous entry. */ if (insert_idx > 0 && base == physmap[insert_idx - 1]) { physmap[insert_idx - 1] += length; return (1); } physmap_idx += 2; *physmap_idxp = physmap_idx; if (physmap_idx == PHYSMAP_SIZE) { printf( "Too many segments in the physical address map, giving up\n"); return (0); } /* * Move the last 'N' entries down to make room for the new * entry if needed. */ for (i = (physmap_idx - 2); i > insert_idx; i -= 2) { physmap[i] = physmap[i - 2]; physmap[i + 1] = physmap[i - 1]; } /* Insert the new entry. */ physmap[insert_idx] = base; physmap[insert_idx + 1] = base + length; return (1); } void bios_add_smap_entries(struct bios_smap *smapbase, u_int32_t smapsize, vm_paddr_t *physmap, int *physmap_idx) { struct bios_smap *smap, *smapend; smapend = (struct bios_smap *)((uintptr_t)smapbase + smapsize); for (smap = smapbase; smap < smapend; smap++) { if (boothowto & RB_VERBOSE) printf("SMAP type=%02x base=%016lx len=%016lx\n", smap->type, smap->base, smap->length); if (smap->type != SMAP_TYPE_MEMORY) continue; if (!add_physmap_entry(smap->base, smap->length, physmap, physmap_idx)) break; } } #define efi_next_descriptor(ptr, size) \ ((struct efi_md *)(((uint8_t *) ptr) + size)) static void add_efi_map_entries(struct efi_map_header *efihdr, vm_paddr_t *physmap, int *physmap_idx) { struct efi_md *map, *p; const char *type; size_t efisz; int ndesc, i; static const char *types[] = { "Reserved", "LoaderCode", "LoaderData", "BootServicesCode", "BootServicesData", "RuntimeServicesCode", "RuntimeServicesData", "ConventionalMemory", "UnusableMemory", "ACPIReclaimMemory", "ACPIMemoryNVS", "MemoryMappedIO", "MemoryMappedIOPortSpace", "PalCode" }; /* * Memory map data provided by UEFI via the GetMemoryMap * Boot Services API. */ efisz = (sizeof(struct efi_map_header) + 0xf) & ~0xf; map = (struct efi_md *)((uint8_t *)efihdr + efisz); if (efihdr->descriptor_size == 0) return; ndesc = efihdr->memory_size / efihdr->descriptor_size; if (boothowto & RB_VERBOSE) printf("%23s %12s %12s %8s %4s\n", "Type", "Physical", "Virtual", "#Pages", "Attr"); for (i = 0, p = map; i < ndesc; i++, p = efi_next_descriptor(p, efihdr->descriptor_size)) { if (boothowto & RB_VERBOSE) { if (p->md_type <= EFI_MD_TYPE_PALCODE) type = types[p->md_type]; else type = ""; printf("%23s %012lx %12p %08lx ", type, p->md_phys, p->md_virt, p->md_pages); if (p->md_attr & EFI_MD_ATTR_UC) printf("UC "); if (p->md_attr & EFI_MD_ATTR_WC) printf("WC "); if (p->md_attr & EFI_MD_ATTR_WT) printf("WT "); if (p->md_attr & EFI_MD_ATTR_WB) printf("WB "); if (p->md_attr & EFI_MD_ATTR_UCE) printf("UCE "); if (p->md_attr & EFI_MD_ATTR_WP) printf("WP "); if (p->md_attr & EFI_MD_ATTR_RP) printf("RP "); if (p->md_attr & EFI_MD_ATTR_XP) printf("XP "); if (p->md_attr & EFI_MD_ATTR_RT) printf("RUNTIME"); printf("\n"); } switch (p->md_type) { case EFI_MD_TYPE_CODE: case EFI_MD_TYPE_DATA: case EFI_MD_TYPE_BS_CODE: case EFI_MD_TYPE_BS_DATA: case EFI_MD_TYPE_FREE: /* * We're allowed to use any entry with these types. */ break; default: continue; } if (!add_physmap_entry(p->md_phys, (p->md_pages * PAGE_SIZE), physmap, physmap_idx)) break; } } static char bootmethod[16] = ""; SYSCTL_STRING(_machdep, OID_AUTO, bootmethod, CTLFLAG_RD, bootmethod, 0, "System firmware boot method"); static void native_parse_memmap(caddr_t kmdp, vm_paddr_t *physmap, int *physmap_idx) { struct bios_smap *smap; struct efi_map_header *efihdr; u_int32_t size; /* * Memory map from INT 15:E820. * * subr_module.c says: * "Consumer may safely assume that size value precedes data." * ie: an int32_t immediately precedes smap. */ efihdr = (struct efi_map_header *)preload_search_info(kmdp, MODINFO_METADATA | MODINFOMD_EFI_MAP); smap = (struct bios_smap *)preload_search_info(kmdp, MODINFO_METADATA | MODINFOMD_SMAP); if (efihdr == NULL && smap == NULL) panic("No BIOS smap or EFI map info from loader!"); if (efihdr != NULL) { add_efi_map_entries(efihdr, physmap, physmap_idx); strlcpy(bootmethod, "UEFI", sizeof(bootmethod)); } else { size = *((u_int32_t *)smap - 1); bios_add_smap_entries(smap, size, physmap, physmap_idx); strlcpy(bootmethod, "BIOS", sizeof(bootmethod)); } } #define PAGES_PER_GB (1024 * 1024 * 1024 / PAGE_SIZE) /* * Populate the (physmap) array with base/bound pairs describing the * available physical memory in the system, then test this memory and * build the phys_avail array describing the actually-available memory. * * Total memory size may be set by the kernel environment variable * hw.physmem or the compile-time define MAXMEM. * * XXX first should be vm_paddr_t. */ static void getmemsize(caddr_t kmdp, u_int64_t first) { int i, physmap_idx, pa_indx, da_indx; vm_paddr_t pa, physmap[PHYSMAP_SIZE]; u_long physmem_start, physmem_tunable, memtest; pt_entry_t *pte; quad_t dcons_addr, dcons_size; int page_counter; bzero(physmap, sizeof(physmap)); physmap_idx = 0; init_ops.parse_memmap(kmdp, physmap, &physmap_idx); physmap_idx -= 2; /* * Find the 'base memory' segment for SMP */ basemem = 0; for (i = 0; i <= physmap_idx; i += 2) { if (physmap[i] <= 0xA0000) { basemem = physmap[i + 1] / 1024; break; } } if (basemem == 0 || basemem > 640) { if (bootverbose) printf( "Memory map doesn't contain a basemem segment, faking it"); basemem = 640; } /* * Make hole for "AP -> long mode" bootstrap code. The * mp_bootaddress vector is only available when the kernel * is configured to support APs and APs for the system start * in 32bit mode (e.g. SMP bare metal). */ if (init_ops.mp_bootaddress) { if (physmap[1] >= 0x100000000) panic( "Basemem segment is not suitable for AP bootstrap code!"); physmap[1] = init_ops.mp_bootaddress(physmap[1] / 1024); } /* * Maxmem isn't the "maximum memory", it's one larger than the * highest page of the physical address space. It should be * called something like "Maxphyspage". We may adjust this * based on ``hw.physmem'' and the results of the memory test. */ Maxmem = atop(physmap[physmap_idx + 1]); #ifdef MAXMEM Maxmem = MAXMEM / 4; #endif if (TUNABLE_ULONG_FETCH("hw.physmem", &physmem_tunable)) Maxmem = atop(physmem_tunable); /* * The boot memory test is disabled by default, as it takes a * significant amount of time on large-memory systems, and is * unfriendly to virtual machines as it unnecessarily touches all * pages. * * A general name is used as the code may be extended to support * additional tests beyond the current "page present" test. */ memtest = 0; TUNABLE_ULONG_FETCH("hw.memtest.tests", &memtest); /* * Don't allow MAXMEM or hw.physmem to extend the amount of memory * in the system. */ if (Maxmem > atop(physmap[physmap_idx + 1])) Maxmem = atop(physmap[physmap_idx + 1]); if (atop(physmap[physmap_idx + 1]) != Maxmem && (boothowto & RB_VERBOSE)) printf("Physical memory use set to %ldK\n", Maxmem * 4); /* call pmap initialization to make new kernel address space */ pmap_bootstrap(&first); /* * Size up each available chunk of physical memory. * * XXX Some BIOSes corrupt low 64KB between suspend and resume. * By default, mask off the first 16 pages unless we appear to be * running in a VM. */ physmem_start = (vm_guest > VM_GUEST_NO ? 1 : 16) << PAGE_SHIFT; TUNABLE_ULONG_FETCH("hw.physmem.start", &physmem_start); if (physmap[0] < physmem_start) { if (physmem_start < PAGE_SIZE) physmap[0] = PAGE_SIZE; else if (physmem_start >= physmap[1]) physmap[0] = round_page(physmap[1] - PAGE_SIZE); else physmap[0] = round_page(physmem_start); } pa_indx = 0; da_indx = 1; phys_avail[pa_indx++] = physmap[0]; phys_avail[pa_indx] = physmap[0]; dump_avail[da_indx] = physmap[0]; pte = CMAP1; /* * Get dcons buffer address */ if (getenv_quad("dcons.addr", &dcons_addr) == 0 || getenv_quad("dcons.size", &dcons_size) == 0) dcons_addr = 0; /* * physmap is in bytes, so when converting to page boundaries, * round up the start address and round down the end address. */ page_counter = 0; if (memtest != 0) printf("Testing system memory"); for (i = 0; i <= physmap_idx; i += 2) { vm_paddr_t end; end = ptoa((vm_paddr_t)Maxmem); if (physmap[i + 1] < end) end = trunc_page(physmap[i + 1]); for (pa = round_page(physmap[i]); pa < end; pa += PAGE_SIZE) { int tmp, page_bad, full; int *ptr = (int *)CADDR1; full = FALSE; /* * block out kernel memory as not available. */ if (pa >= (vm_paddr_t)kernphys && pa < first) goto do_dump_avail; /* * block out dcons buffer */ if (dcons_addr > 0 && pa >= trunc_page(dcons_addr) && pa < dcons_addr + dcons_size) goto do_dump_avail; page_bad = FALSE; if (memtest == 0) goto skip_memtest; /* * Print a "." every GB to show we're making * progress. */ page_counter++; if ((page_counter % PAGES_PER_GB) == 0) printf("."); /* * map page into kernel: valid, read/write,non-cacheable */ *pte = pa | PG_V | PG_RW | PG_NC_PWT | PG_NC_PCD; invltlb(); tmp = *(int *)ptr; /* * Test for alternating 1's and 0's */ *(volatile int *)ptr = 0xaaaaaaaa; if (*(volatile int *)ptr != 0xaaaaaaaa) page_bad = TRUE; /* * Test for alternating 0's and 1's */ *(volatile int *)ptr = 0x55555555; if (*(volatile int *)ptr != 0x55555555) page_bad = TRUE; /* * Test for all 1's */ *(volatile int *)ptr = 0xffffffff; if (*(volatile int *)ptr != 0xffffffff) page_bad = TRUE; /* * Test for all 0's */ *(volatile int *)ptr = 0x0; if (*(volatile int *)ptr != 0x0) page_bad = TRUE; /* * Restore original value. */ *(int *)ptr = tmp; skip_memtest: /* * Adjust array of valid/good pages. */ if (page_bad == TRUE) continue; /* * If this good page is a continuation of the * previous set of good pages, then just increase * the end pointer. Otherwise start a new chunk. * Note that "end" points one higher than end, * making the range >= start and < end. * If we're also doing a speculative memory * test and we at or past the end, bump up Maxmem * so that we keep going. The first bad page * will terminate the loop. */ if (phys_avail[pa_indx] == pa) { phys_avail[pa_indx] += PAGE_SIZE; } else { pa_indx++; if (pa_indx == PHYS_AVAIL_ARRAY_END) { printf( "Too many holes in the physical address space, giving up\n"); pa_indx--; full = TRUE; goto do_dump_avail; } phys_avail[pa_indx++] = pa; /* start */ phys_avail[pa_indx] = pa + PAGE_SIZE; /* end */ } physmem++; do_dump_avail: if (dump_avail[da_indx] == pa) { dump_avail[da_indx] += PAGE_SIZE; } else { da_indx++; if (da_indx == DUMP_AVAIL_ARRAY_END) { da_indx--; goto do_next; } dump_avail[da_indx++] = pa; /* start */ dump_avail[da_indx] = pa + PAGE_SIZE; /* end */ } do_next: if (full) break; } } *pte = 0; invltlb(); if (memtest != 0) printf("\n"); /* * XXX * The last chunk must contain at least one page plus the message * buffer to avoid complicating other code (message buffer address * calculation, etc.). */ while (phys_avail[pa_indx - 1] + PAGE_SIZE + round_page(msgbufsize) >= phys_avail[pa_indx]) { physmem -= atop(phys_avail[pa_indx] - phys_avail[pa_indx - 1]); phys_avail[pa_indx--] = 0; phys_avail[pa_indx--] = 0; } Maxmem = atop(phys_avail[pa_indx]); /* Trim off space for the message buffer. */ phys_avail[pa_indx] -= round_page(msgbufsize); /* Map the message buffer. */ msgbufp = (struct msgbuf *)PHYS_TO_DMAP(phys_avail[pa_indx]); } static caddr_t native_parse_preload_data(u_int64_t modulep) { caddr_t kmdp; #ifdef DDB vm_offset_t ksym_start; vm_offset_t ksym_end; #endif preload_metadata = (caddr_t)(uintptr_t)(modulep + KERNBASE); preload_bootstrap_relocate(KERNBASE); kmdp = preload_search_by_type("elf kernel"); if (kmdp == NULL) kmdp = preload_search_by_type("elf64 kernel"); boothowto = MD_FETCH(kmdp, MODINFOMD_HOWTO, int); kern_envp = MD_FETCH(kmdp, MODINFOMD_ENVP, char *) + KERNBASE; #ifdef DDB ksym_start = MD_FETCH(kmdp, MODINFOMD_SSYM, uintptr_t); ksym_end = MD_FETCH(kmdp, MODINFOMD_ESYM, uintptr_t); db_fetch_ksymtab(ksym_start, ksym_end); #endif return (kmdp); } u_int64_t hammer_time(u_int64_t modulep, u_int64_t physfree) { caddr_t kmdp; int gsel_tss, x; struct pcpu *pc; struct nmi_pcpu *np; struct xstate_hdr *xhdr; u_int64_t msr; char *env; size_t kstack0_sz; thread0.td_kstack = physfree + KERNBASE; thread0.td_kstack_pages = KSTACK_PAGES; kstack0_sz = thread0.td_kstack_pages * PAGE_SIZE; bzero((void *)thread0.td_kstack, kstack0_sz); physfree += kstack0_sz; /* * This may be done better later if it gets more high level * components in it. If so just link td->td_proc here. */ proc_linkup0(&proc0, &thread0); kmdp = init_ops.parse_preload_data(modulep); /* Init basic tunables, hz etc */ init_param1(); /* * make gdt memory segments */ for (x = 0; x < NGDT; x++) { if (x != GPROC0_SEL && x != (GPROC0_SEL + 1) && x != GUSERLDT_SEL && x != (GUSERLDT_SEL) + 1) ssdtosd(&gdt_segs[x], &gdt[x]); } gdt_segs[GPROC0_SEL].ssd_base = (uintptr_t)&common_tss[0]; ssdtosyssd(&gdt_segs[GPROC0_SEL], (struct system_segment_descriptor *)&gdt[GPROC0_SEL]); r_gdt.rd_limit = NGDT * sizeof(gdt[0]) - 1; r_gdt.rd_base = (long) gdt; lgdt(&r_gdt); pc = &__pcpu[0]; wrmsr(MSR_FSBASE, 0); /* User value */ wrmsr(MSR_GSBASE, (u_int64_t)pc); wrmsr(MSR_KGSBASE, 0); /* User value while in the kernel */ pcpu_init(pc, 0, sizeof(struct pcpu)); dpcpu_init((void *)(physfree + KERNBASE), 0); physfree += DPCPU_SIZE; PCPU_SET(prvspace, pc); PCPU_SET(curthread, &thread0); PCPU_SET(tssp, &common_tss[0]); PCPU_SET(commontssp, &common_tss[0]); PCPU_SET(tss, (struct system_segment_descriptor *)&gdt[GPROC0_SEL]); PCPU_SET(ldt, (struct system_segment_descriptor *)&gdt[GUSERLDT_SEL]); PCPU_SET(fs32p, &gdt[GUFS32_SEL]); PCPU_SET(gs32p, &gdt[GUGS32_SEL]); /* * Initialize mutexes. * * icu_lock: in order to allow an interrupt to occur in a critical * section, to set pcpu->ipending (etc...) properly, we * must be able to get the icu lock, so it can't be * under witness. */ mutex_init(); mtx_init(&icu_lock, "icu", NULL, MTX_SPIN | MTX_NOWITNESS); mtx_init(&dt_lock, "descriptor tables", NULL, MTX_DEF); /* exceptions */ for (x = 0; x < NIDT; x++) setidt(x, &IDTVEC(rsvd), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_DE, &IDTVEC(div), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_DB, &IDTVEC(dbg), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_NMI, &IDTVEC(nmi), SDT_SYSIGT, SEL_KPL, 2); setidt(IDT_BP, &IDTVEC(bpt), SDT_SYSIGT, SEL_UPL, 0); setidt(IDT_OF, &IDTVEC(ofl), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_BR, &IDTVEC(bnd), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_UD, &IDTVEC(ill), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_NM, &IDTVEC(dna), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_DF, &IDTVEC(dblfault), SDT_SYSIGT, SEL_KPL, 1); setidt(IDT_FPUGP, &IDTVEC(fpusegm), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_TS, &IDTVEC(tss), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_NP, &IDTVEC(missing), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_SS, &IDTVEC(stk), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_GP, &IDTVEC(prot), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_PF, &IDTVEC(page), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_MF, &IDTVEC(fpu), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_AC, &IDTVEC(align), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_MC, &IDTVEC(mchk), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_XF, &IDTVEC(xmm), SDT_SYSIGT, SEL_KPL, 0); #ifdef KDTRACE_HOOKS setidt(IDT_DTRACE_RET, &IDTVEC(dtrace_ret), SDT_SYSIGT, SEL_UPL, 0); #endif #ifdef XENHVM setidt(IDT_EVTCHN, &IDTVEC(xen_intr_upcall), SDT_SYSIGT, SEL_UPL, 0); #endif r_idt.rd_limit = sizeof(idt0) - 1; r_idt.rd_base = (long) idt; lidt(&r_idt); /* * Initialize the clock before the console so that console * initialization can use DELAY(). */ clock_init(); /* * Use vt(4) by default for UEFI boot (during the sc(4)/vt(4) * transition). */ if (kmdp != NULL && preload_search_info(kmdp, MODINFO_METADATA | MODINFOMD_EFI_MAP) != NULL) vty_set_preferred(VTY_VT); /* * Initialize the console before we print anything out. */ cninit(); #ifdef DEV_ISA #ifdef DEV_ATPIC elcr_probe(); atpic_startup(); #else /* Reset and mask the atpics and leave them shut down. */ atpic_reset(); /* * Point the ICU spurious interrupt vectors at the APIC spurious * interrupt handler. */ setidt(IDT_IO_INTS + 7, IDTVEC(spuriousint), SDT_SYSIGT, SEL_KPL, 0); setidt(IDT_IO_INTS + 15, IDTVEC(spuriousint), SDT_SYSIGT, SEL_KPL, 0); #endif #else #error "have you forgotten the isa device?"; #endif kdb_init(); #ifdef KDB if (boothowto & RB_KDB) kdb_enter(KDB_WHY_BOOTFLAGS, "Boot flags requested debugger"); #endif identify_cpu(); /* Final stage of CPU initialization */ initializecpu(); /* Initialize CPU registers */ initializecpucache(); /* doublefault stack space, runs on ist1 */ common_tss[0].tss_ist1 = (long)&dblfault_stack[sizeof(dblfault_stack)]; /* * NMI stack, runs on ist2. The pcpu pointer is stored just * above the start of the ist2 stack. */ np = ((struct nmi_pcpu *) &nmi0_stack[sizeof(nmi0_stack)]) - 1; np->np_pcpu = (register_t) pc; common_tss[0].tss_ist2 = (long) np; /* Set the IO permission bitmap (empty due to tss seg limit) */ common_tss[0].tss_iobase = sizeof(struct amd64tss) + IOPERM_BITMAP_SIZE; gsel_tss = GSEL(GPROC0_SEL, SEL_KPL); ltr(gsel_tss); /* Set up the fast syscall stuff */ msr = rdmsr(MSR_EFER) | EFER_SCE; wrmsr(MSR_EFER, msr); wrmsr(MSR_LSTAR, (u_int64_t)IDTVEC(fast_syscall)); wrmsr(MSR_CSTAR, (u_int64_t)IDTVEC(fast_syscall32)); msr = ((u_int64_t)GSEL(GCODE_SEL, SEL_KPL) << 32) | ((u_int64_t)GSEL(GUCODE32_SEL, SEL_UPL) << 48); wrmsr(MSR_STAR, msr); wrmsr(MSR_SF_MASK, PSL_NT|PSL_T|PSL_I|PSL_C|PSL_D); getmemsize(kmdp, physfree); init_param2(physmem); /* now running on new page tables, configured,and u/iom is accessible */ msgbufinit(msgbufp, msgbufsize); fpuinit(); /* * Set up thread0 pcb after fpuinit calculated pcb + fpu save * area size. Zero out the extended state header in fpu save * area. */ thread0.td_pcb = get_pcb_td(&thread0); bzero(get_pcb_user_save_td(&thread0), cpu_max_ext_state_size); if (use_xsave) { xhdr = (struct xstate_hdr *)(get_pcb_user_save_td(&thread0) + 1); xhdr->xstate_bv = xsave_mask; } /* make an initial tss so cpu can get interrupt stack on syscall! */ common_tss[0].tss_rsp0 = (vm_offset_t)thread0.td_pcb; /* Ensure the stack is aligned to 16 bytes */ common_tss[0].tss_rsp0 &= ~0xFul; PCPU_SET(rsp0, common_tss[0].tss_rsp0); PCPU_SET(curpcb, thread0.td_pcb); /* transfer to user mode */ _ucodesel = GSEL(GUCODE_SEL, SEL_UPL); _udatasel = GSEL(GUDATA_SEL, SEL_UPL); _ucode32sel = GSEL(GUCODE32_SEL, SEL_UPL); _ufssel = GSEL(GUFS32_SEL, SEL_UPL); _ugssel = GSEL(GUGS32_SEL, SEL_UPL); load_ds(_udatasel); load_es(_udatasel); load_fs(_ufssel); /* setup proc 0's pcb */ thread0.td_pcb->pcb_flags = 0; - thread0.td_pcb->pcb_cr3 = KPML4phys; /* PCID 0 is reserved for kernel */ thread0.td_frame = &proc0_tf; env = kern_getenv("kernelname"); if (env != NULL) strlcpy(kernelname, env, sizeof(kernelname)); cpu_probe_amdc1e(); #ifdef FDT x86_init_fdt(); #endif /* Location of kernel stack for locore */ return ((u_int64_t)thread0.td_pcb); } void cpu_pcpu_init(struct pcpu *pcpu, int cpuid, size_t size) { pcpu->pc_acpi_id = 0xffffffff; } static int smap_sysctl_handler(SYSCTL_HANDLER_ARGS) { struct bios_smap *smapbase; struct bios_smap_xattr smap; caddr_t kmdp; uint32_t *smapattr; int count, error, i; /* Retrieve the system memory map from the loader. */ kmdp = preload_search_by_type("elf kernel"); if (kmdp == NULL) kmdp = preload_search_by_type("elf64 kernel"); smapbase = (struct bios_smap *)preload_search_info(kmdp, MODINFO_METADATA | MODINFOMD_SMAP); if (smapbase == NULL) return (0); smapattr = (uint32_t *)preload_search_info(kmdp, MODINFO_METADATA | MODINFOMD_SMAP_XATTR); count = *((uint32_t *)smapbase - 1) / sizeof(*smapbase); error = 0; for (i = 0; i < count; i++) { smap.base = smapbase[i].base; smap.length = smapbase[i].length; smap.type = smapbase[i].type; if (smapattr != NULL) smap.xattr = smapattr[i]; else smap.xattr = 0; error = SYSCTL_OUT(req, &smap, sizeof(smap)); } return (error); } SYSCTL_PROC(_machdep, OID_AUTO, smap, CTLTYPE_OPAQUE|CTLFLAG_RD, NULL, 0, smap_sysctl_handler, "S,bios_smap_xattr", "Raw BIOS SMAP data"); static int efi_map_sysctl_handler(SYSCTL_HANDLER_ARGS) { struct efi_map_header *efihdr; caddr_t kmdp; uint32_t efisize; kmdp = preload_search_by_type("elf kernel"); if (kmdp == NULL) kmdp = preload_search_by_type("elf64 kernel"); efihdr = (struct efi_map_header *)preload_search_info(kmdp, MODINFO_METADATA | MODINFOMD_EFI_MAP); if (efihdr == NULL) return (0); efisize = *((uint32_t *)efihdr - 1); return (SYSCTL_OUT(req, efihdr, efisize)); } SYSCTL_PROC(_machdep, OID_AUTO, efi_map, CTLTYPE_OPAQUE|CTLFLAG_RD, NULL, 0, efi_map_sysctl_handler, "S,efi_map_header", "Raw EFI Memory Map"); void spinlock_enter(void) { struct thread *td; register_t flags; td = curthread; if (td->td_md.md_spinlock_count == 0) { flags = intr_disable(); td->td_md.md_spinlock_count = 1; td->td_md.md_saved_flags = flags; } else td->td_md.md_spinlock_count++; critical_enter(); } void spinlock_exit(void) { struct thread *td; register_t flags; td = curthread; critical_exit(); flags = td->td_md.md_saved_flags; td->td_md.md_spinlock_count--; if (td->td_md.md_spinlock_count == 0) intr_restore(flags); } /* * Construct a PCB from a trapframe. This is called from kdb_trap() where * we want to start a backtrace from the function that caused us to enter * the debugger. We have the context in the trapframe, but base the trace * on the PCB. The PCB doesn't have to be perfect, as long as it contains * enough for a backtrace. */ void makectx(struct trapframe *tf, struct pcb *pcb) { pcb->pcb_r12 = tf->tf_r12; pcb->pcb_r13 = tf->tf_r13; pcb->pcb_r14 = tf->tf_r14; pcb->pcb_r15 = tf->tf_r15; pcb->pcb_rbp = tf->tf_rbp; pcb->pcb_rbx = tf->tf_rbx; pcb->pcb_rip = tf->tf_rip; pcb->pcb_rsp = tf->tf_rsp; } int ptrace_set_pc(struct thread *td, unsigned long addr) { td->td_frame->tf_rip = addr; set_pcb_flags(td->td_pcb, PCB_FULL_IRET); return (0); } int ptrace_single_step(struct thread *td) { td->td_frame->tf_rflags |= PSL_T; return (0); } int ptrace_clear_single_step(struct thread *td) { td->td_frame->tf_rflags &= ~PSL_T; return (0); } int fill_regs(struct thread *td, struct reg *regs) { struct trapframe *tp; tp = td->td_frame; return (fill_frame_regs(tp, regs)); } int fill_frame_regs(struct trapframe *tp, struct reg *regs) { regs->r_r15 = tp->tf_r15; regs->r_r14 = tp->tf_r14; regs->r_r13 = tp->tf_r13; regs->r_r12 = tp->tf_r12; regs->r_r11 = tp->tf_r11; regs->r_r10 = tp->tf_r10; regs->r_r9 = tp->tf_r9; regs->r_r8 = tp->tf_r8; regs->r_rdi = tp->tf_rdi; regs->r_rsi = tp->tf_rsi; regs->r_rbp = tp->tf_rbp; regs->r_rbx = tp->tf_rbx; regs->r_rdx = tp->tf_rdx; regs->r_rcx = tp->tf_rcx; regs->r_rax = tp->tf_rax; regs->r_rip = tp->tf_rip; regs->r_cs = tp->tf_cs; regs->r_rflags = tp->tf_rflags; regs->r_rsp = tp->tf_rsp; regs->r_ss = tp->tf_ss; if (tp->tf_flags & TF_HASSEGS) { regs->r_ds = tp->tf_ds; regs->r_es = tp->tf_es; regs->r_fs = tp->tf_fs; regs->r_gs = tp->tf_gs; } else { regs->r_ds = 0; regs->r_es = 0; regs->r_fs = 0; regs->r_gs = 0; } return (0); } int set_regs(struct thread *td, struct reg *regs) { struct trapframe *tp; register_t rflags; tp = td->td_frame; rflags = regs->r_rflags & 0xffffffff; if (!EFL_SECURE(rflags, tp->tf_rflags) || !CS_SECURE(regs->r_cs)) return (EINVAL); tp->tf_r15 = regs->r_r15; tp->tf_r14 = regs->r_r14; tp->tf_r13 = regs->r_r13; tp->tf_r12 = regs->r_r12; tp->tf_r11 = regs->r_r11; tp->tf_r10 = regs->r_r10; tp->tf_r9 = regs->r_r9; tp->tf_r8 = regs->r_r8; tp->tf_rdi = regs->r_rdi; tp->tf_rsi = regs->r_rsi; tp->tf_rbp = regs->r_rbp; tp->tf_rbx = regs->r_rbx; tp->tf_rdx = regs->r_rdx; tp->tf_rcx = regs->r_rcx; tp->tf_rax = regs->r_rax; tp->tf_rip = regs->r_rip; tp->tf_cs = regs->r_cs; tp->tf_rflags = rflags; tp->tf_rsp = regs->r_rsp; tp->tf_ss = regs->r_ss; if (0) { /* XXXKIB */ tp->tf_ds = regs->r_ds; tp->tf_es = regs->r_es; tp->tf_fs = regs->r_fs; tp->tf_gs = regs->r_gs; tp->tf_flags = TF_HASSEGS; } set_pcb_flags(td->td_pcb, PCB_FULL_IRET); return (0); } /* XXX check all this stuff! */ /* externalize from sv_xmm */ static void fill_fpregs_xmm(struct savefpu *sv_xmm, struct fpreg *fpregs) { struct envxmm *penv_fpreg = (struct envxmm *)&fpregs->fpr_env; struct envxmm *penv_xmm = &sv_xmm->sv_env; int i; /* pcb -> fpregs */ bzero(fpregs, sizeof(*fpregs)); /* FPU control/status */ penv_fpreg->en_cw = penv_xmm->en_cw; penv_fpreg->en_sw = penv_xmm->en_sw; penv_fpreg->en_tw = penv_xmm->en_tw; penv_fpreg->en_opcode = penv_xmm->en_opcode; penv_fpreg->en_rip = penv_xmm->en_rip; penv_fpreg->en_rdp = penv_xmm->en_rdp; penv_fpreg->en_mxcsr = penv_xmm->en_mxcsr; penv_fpreg->en_mxcsr_mask = penv_xmm->en_mxcsr_mask; /* FPU registers */ for (i = 0; i < 8; ++i) bcopy(sv_xmm->sv_fp[i].fp_acc.fp_bytes, fpregs->fpr_acc[i], 10); /* SSE registers */ for (i = 0; i < 16; ++i) bcopy(sv_xmm->sv_xmm[i].xmm_bytes, fpregs->fpr_xacc[i], 16); } /* internalize from fpregs into sv_xmm */ static void set_fpregs_xmm(struct fpreg *fpregs, struct savefpu *sv_xmm) { struct envxmm *penv_xmm = &sv_xmm->sv_env; struct envxmm *penv_fpreg = (struct envxmm *)&fpregs->fpr_env; int i; /* fpregs -> pcb */ /* FPU control/status */ penv_xmm->en_cw = penv_fpreg->en_cw; penv_xmm->en_sw = penv_fpreg->en_sw; penv_xmm->en_tw = penv_fpreg->en_tw; penv_xmm->en_opcode = penv_fpreg->en_opcode; penv_xmm->en_rip = penv_fpreg->en_rip; penv_xmm->en_rdp = penv_fpreg->en_rdp; penv_xmm->en_mxcsr = penv_fpreg->en_mxcsr; penv_xmm->en_mxcsr_mask = penv_fpreg->en_mxcsr_mask & cpu_mxcsr_mask; /* FPU registers */ for (i = 0; i < 8; ++i) bcopy(fpregs->fpr_acc[i], sv_xmm->sv_fp[i].fp_acc.fp_bytes, 10); /* SSE registers */ for (i = 0; i < 16; ++i) bcopy(fpregs->fpr_xacc[i], sv_xmm->sv_xmm[i].xmm_bytes, 16); } /* externalize from td->pcb */ int fill_fpregs(struct thread *td, struct fpreg *fpregs) { KASSERT(td == curthread || TD_IS_SUSPENDED(td) || P_SHOULDSTOP(td->td_proc), ("not suspended thread %p", td)); fpugetregs(td); fill_fpregs_xmm(get_pcb_user_save_td(td), fpregs); return (0); } /* internalize to td->pcb */ int set_fpregs(struct thread *td, struct fpreg *fpregs) { set_fpregs_xmm(fpregs, get_pcb_user_save_td(td)); fpuuserinited(td); return (0); } /* * Get machine context. */ int get_mcontext(struct thread *td, mcontext_t *mcp, int flags) { struct pcb *pcb; struct trapframe *tp; pcb = td->td_pcb; tp = td->td_frame; PROC_LOCK(curthread->td_proc); mcp->mc_onstack = sigonstack(tp->tf_rsp); PROC_UNLOCK(curthread->td_proc); mcp->mc_r15 = tp->tf_r15; mcp->mc_r14 = tp->tf_r14; mcp->mc_r13 = tp->tf_r13; mcp->mc_r12 = tp->tf_r12; mcp->mc_r11 = tp->tf_r11; mcp->mc_r10 = tp->tf_r10; mcp->mc_r9 = tp->tf_r9; mcp->mc_r8 = tp->tf_r8; mcp->mc_rdi = tp->tf_rdi; mcp->mc_rsi = tp->tf_rsi; mcp->mc_rbp = tp->tf_rbp; mcp->mc_rbx = tp->tf_rbx; mcp->mc_rcx = tp->tf_rcx; mcp->mc_rflags = tp->tf_rflags; if (flags & GET_MC_CLEAR_RET) { mcp->mc_rax = 0; mcp->mc_rdx = 0; mcp->mc_rflags &= ~PSL_C; } else { mcp->mc_rax = tp->tf_rax; mcp->mc_rdx = tp->tf_rdx; } mcp->mc_rip = tp->tf_rip; mcp->mc_cs = tp->tf_cs; mcp->mc_rsp = tp->tf_rsp; mcp->mc_ss = tp->tf_ss; mcp->mc_ds = tp->tf_ds; mcp->mc_es = tp->tf_es; mcp->mc_fs = tp->tf_fs; mcp->mc_gs = tp->tf_gs; mcp->mc_flags = tp->tf_flags; mcp->mc_len = sizeof(*mcp); get_fpcontext(td, mcp, NULL, 0); mcp->mc_fsbase = pcb->pcb_fsbase; mcp->mc_gsbase = pcb->pcb_gsbase; mcp->mc_xfpustate = 0; mcp->mc_xfpustate_len = 0; bzero(mcp->mc_spare, sizeof(mcp->mc_spare)); return (0); } /* * Set machine context. * * However, we don't set any but the user modifiable flags, and we won't * touch the cs selector. */ int set_mcontext(struct thread *td, mcontext_t *mcp) { struct pcb *pcb; struct trapframe *tp; char *xfpustate; long rflags; int ret; pcb = td->td_pcb; tp = td->td_frame; if (mcp->mc_len != sizeof(*mcp) || (mcp->mc_flags & ~_MC_FLAG_MASK) != 0) return (EINVAL); rflags = (mcp->mc_rflags & PSL_USERCHANGE) | (tp->tf_rflags & ~PSL_USERCHANGE); if (mcp->mc_flags & _MC_HASFPXSTATE) { if (mcp->mc_xfpustate_len > cpu_max_ext_state_size - sizeof(struct savefpu)) return (EINVAL); xfpustate = __builtin_alloca(mcp->mc_xfpustate_len); ret = copyin((void *)mcp->mc_xfpustate, xfpustate, mcp->mc_xfpustate_len); if (ret != 0) return (ret); } else xfpustate = NULL; ret = set_fpcontext(td, mcp, xfpustate, mcp->mc_xfpustate_len); if (ret != 0) return (ret); tp->tf_r15 = mcp->mc_r15; tp->tf_r14 = mcp->mc_r14; tp->tf_r13 = mcp->mc_r13; tp->tf_r12 = mcp->mc_r12; tp->tf_r11 = mcp->mc_r11; tp->tf_r10 = mcp->mc_r10; tp->tf_r9 = mcp->mc_r9; tp->tf_r8 = mcp->mc_r8; tp->tf_rdi = mcp->mc_rdi; tp->tf_rsi = mcp->mc_rsi; tp->tf_rbp = mcp->mc_rbp; tp->tf_rbx = mcp->mc_rbx; tp->tf_rdx = mcp->mc_rdx; tp->tf_rcx = mcp->mc_rcx; tp->tf_rax = mcp->mc_rax; tp->tf_rip = mcp->mc_rip; tp->tf_rflags = rflags; tp->tf_rsp = mcp->mc_rsp; tp->tf_ss = mcp->mc_ss; tp->tf_flags = mcp->mc_flags; if (tp->tf_flags & TF_HASSEGS) { tp->tf_ds = mcp->mc_ds; tp->tf_es = mcp->mc_es; tp->tf_fs = mcp->mc_fs; tp->tf_gs = mcp->mc_gs; } if (mcp->mc_flags & _MC_HASBASES) { pcb->pcb_fsbase = mcp->mc_fsbase; pcb->pcb_gsbase = mcp->mc_gsbase; } set_pcb_flags(pcb, PCB_FULL_IRET); return (0); } static void get_fpcontext(struct thread *td, mcontext_t *mcp, char *xfpusave, size_t xfpusave_len) { size_t max_len, len; mcp->mc_ownedfp = fpugetregs(td); bcopy(get_pcb_user_save_td(td), &mcp->mc_fpstate[0], sizeof(mcp->mc_fpstate)); mcp->mc_fpformat = fpuformat(); if (!use_xsave || xfpusave_len == 0) return; max_len = cpu_max_ext_state_size - sizeof(struct savefpu); len = xfpusave_len; if (len > max_len) { len = max_len; bzero(xfpusave + max_len, len - max_len); } mcp->mc_flags |= _MC_HASFPXSTATE; mcp->mc_xfpustate_len = len; bcopy(get_pcb_user_save_td(td) + 1, xfpusave, len); } static int set_fpcontext(struct thread *td, mcontext_t *mcp, char *xfpustate, size_t xfpustate_len) { struct savefpu *fpstate; int error; if (mcp->mc_fpformat == _MC_FPFMT_NODEV) return (0); else if (mcp->mc_fpformat != _MC_FPFMT_XMM) return (EINVAL); else if (mcp->mc_ownedfp == _MC_FPOWNED_NONE) { /* We don't care what state is left in the FPU or PCB. */ fpstate_drop(td); error = 0; } else if (mcp->mc_ownedfp == _MC_FPOWNED_FPU || mcp->mc_ownedfp == _MC_FPOWNED_PCB) { fpstate = (struct savefpu *)&mcp->mc_fpstate; fpstate->sv_env.en_mxcsr &= cpu_mxcsr_mask; error = fpusetregs(td, fpstate, xfpustate, xfpustate_len); } else return (EINVAL); return (error); } void fpstate_drop(struct thread *td) { KASSERT(PCB_USER_FPU(td->td_pcb), ("fpstate_drop: kernel-owned fpu")); critical_enter(); if (PCPU_GET(fpcurthread) == td) fpudrop(); /* * XXX force a full drop of the fpu. The above only drops it if we * owned it. * * XXX I don't much like fpugetuserregs()'s semantics of doing a full * drop. Dropping only to the pcb matches fnsave's behaviour. * We only need to drop to !PCB_INITDONE in sendsig(). But * sendsig() is the only caller of fpugetuserregs()... perhaps we just * have too many layers. */ clear_pcb_flags(curthread->td_pcb, PCB_FPUINITDONE | PCB_USERFPUINITDONE); critical_exit(); } int fill_dbregs(struct thread *td, struct dbreg *dbregs) { struct pcb *pcb; if (td == NULL) { dbregs->dr[0] = rdr0(); dbregs->dr[1] = rdr1(); dbregs->dr[2] = rdr2(); dbregs->dr[3] = rdr3(); dbregs->dr[6] = rdr6(); dbregs->dr[7] = rdr7(); } else { pcb = td->td_pcb; dbregs->dr[0] = pcb->pcb_dr0; dbregs->dr[1] = pcb->pcb_dr1; dbregs->dr[2] = pcb->pcb_dr2; dbregs->dr[3] = pcb->pcb_dr3; dbregs->dr[6] = pcb->pcb_dr6; dbregs->dr[7] = pcb->pcb_dr7; } dbregs->dr[4] = 0; dbregs->dr[5] = 0; dbregs->dr[8] = 0; dbregs->dr[9] = 0; dbregs->dr[10] = 0; dbregs->dr[11] = 0; dbregs->dr[12] = 0; dbregs->dr[13] = 0; dbregs->dr[14] = 0; dbregs->dr[15] = 0; return (0); } int set_dbregs(struct thread *td, struct dbreg *dbregs) { struct pcb *pcb; int i; if (td == NULL) { load_dr0(dbregs->dr[0]); load_dr1(dbregs->dr[1]); load_dr2(dbregs->dr[2]); load_dr3(dbregs->dr[3]); load_dr6(dbregs->dr[6]); load_dr7(dbregs->dr[7]); } else { /* * Don't let an illegal value for dr7 get set. Specifically, * check for undefined settings. Setting these bit patterns * result in undefined behaviour and can lead to an unexpected * TRCTRAP or a general protection fault right here. * Upper bits of dr6 and dr7 must not be set */ for (i = 0; i < 4; i++) { if (DBREG_DR7_ACCESS(dbregs->dr[7], i) == 0x02) return (EINVAL); if (td->td_frame->tf_cs == _ucode32sel && DBREG_DR7_LEN(dbregs->dr[7], i) == DBREG_DR7_LEN_8) return (EINVAL); } if ((dbregs->dr[6] & 0xffffffff00000000ul) != 0 || (dbregs->dr[7] & 0xffffffff00000000ul) != 0) return (EINVAL); pcb = td->td_pcb; /* * Don't let a process set a breakpoint that is not within the * process's address space. If a process could do this, it * could halt the system by setting a breakpoint in the kernel * (if ddb was enabled). Thus, we need to check to make sure * that no breakpoints are being enabled for addresses outside * process's address space. * * XXX - what about when the watched area of the user's * address space is written into from within the kernel * ... wouldn't that still cause a breakpoint to be generated * from within kernel mode? */ if (DBREG_DR7_ENABLED(dbregs->dr[7], 0)) { /* dr0 is enabled */ if (dbregs->dr[0] >= VM_MAXUSER_ADDRESS) return (EINVAL); } if (DBREG_DR7_ENABLED(dbregs->dr[7], 1)) { /* dr1 is enabled */ if (dbregs->dr[1] >= VM_MAXUSER_ADDRESS) return (EINVAL); } if (DBREG_DR7_ENABLED(dbregs->dr[7], 2)) { /* dr2 is enabled */ if (dbregs->dr[2] >= VM_MAXUSER_ADDRESS) return (EINVAL); } if (DBREG_DR7_ENABLED(dbregs->dr[7], 3)) { /* dr3 is enabled */ if (dbregs->dr[3] >= VM_MAXUSER_ADDRESS) return (EINVAL); } pcb->pcb_dr0 = dbregs->dr[0]; pcb->pcb_dr1 = dbregs->dr[1]; pcb->pcb_dr2 = dbregs->dr[2]; pcb->pcb_dr3 = dbregs->dr[3]; pcb->pcb_dr6 = dbregs->dr[6]; pcb->pcb_dr7 = dbregs->dr[7]; set_pcb_flags(pcb, PCB_DBREGS); } return (0); } void reset_dbregs(void) { load_dr7(0); /* Turn off the control bits first */ load_dr0(0); load_dr1(0); load_dr2(0); load_dr3(0); load_dr6(0); } /* * Return > 0 if a hardware breakpoint has been hit, and the * breakpoint was in user space. Return 0, otherwise. */ int user_dbreg_trap(void) { u_int64_t dr7, dr6; /* debug registers dr6 and dr7 */ u_int64_t bp; /* breakpoint bits extracted from dr6 */ int nbp; /* number of breakpoints that triggered */ caddr_t addr[4]; /* breakpoint addresses */ int i; dr7 = rdr7(); if ((dr7 & 0x000000ff) == 0) { /* * all GE and LE bits in the dr7 register are zero, * thus the trap couldn't have been caused by the * hardware debug registers */ return 0; } nbp = 0; dr6 = rdr6(); bp = dr6 & 0x0000000f; if (!bp) { /* * None of the breakpoint bits are set meaning this * trap was not caused by any of the debug registers */ return 0; } /* * at least one of the breakpoints were hit, check to see * which ones and if any of them are user space addresses */ if (bp & 0x01) { addr[nbp++] = (caddr_t)rdr0(); } if (bp & 0x02) { addr[nbp++] = (caddr_t)rdr1(); } if (bp & 0x04) { addr[nbp++] = (caddr_t)rdr2(); } if (bp & 0x08) { addr[nbp++] = (caddr_t)rdr3(); } for (i = 0; i < nbp; i++) { if (addr[i] < (caddr_t)VM_MAXUSER_ADDRESS) { /* * addr[i] is in user space */ return nbp; } } /* * None of the breakpoints are in user space. */ return 0; } #ifdef KDB /* * Provide inb() and outb() as functions. They are normally only available as * inline functions, thus cannot be called from the debugger. */ /* silence compiler warnings */ u_char inb_(u_short); void outb_(u_short, u_char); u_char inb_(u_short port) { return inb(port); } void outb_(u_short port, u_char data) { outb(port, data); } #endif /* KDB */ Index: projects/release-arm-redux/sys/amd64/amd64/mp_machdep.c =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/mp_machdep.c (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/mp_machdep.c (revision 282692) @@ -1,730 +1,608 @@ /*- * Copyright (c) 1996, by Steve Passe * Copyright (c) 2003, by Peter Wemm * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. The name of the developer may NOT be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_cpu.h" #include "opt_ddb.h" #include "opt_kstack_pages.h" #include "opt_sched.h" #include "opt_smp.h" #include #include #include #include #ifdef GPROF #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define WARMBOOT_TARGET 0 #define WARMBOOT_OFF (KERNBASE + 0x0467) #define WARMBOOT_SEG (KERNBASE + 0x0469) #define CMOS_REG (0x70) #define CMOS_DATA (0x71) #define BIOS_RESET (0x0f) #define BIOS_WARM (0x0a) extern struct pcpu __pcpu[]; /* Temporary variables for init_secondary() */ char *doublefault_stack; char *nmi_stack; /* Variables needed for SMP tlb shootdown. */ -vm_offset_t smp_tlb_addr2; -struct invpcid_descr smp_tlb_invpcid; +static vm_offset_t smp_tlb_addr1, smp_tlb_addr2; +static pmap_t smp_tlb_pmap; volatile int smp_tlb_wait; -uint64_t pcid_cr3; -pmap_t smp_tlb_pmap; -extern int invpcid_works; extern inthand_t IDTVEC(fast_syscall), IDTVEC(fast_syscall32); /* * Local data and functions. */ static int start_ap(int apic_id); static u_int bootMP_size; static u_int boot_address; /* * Calculate usable address in base memory for AP trampoline code. */ u_int mp_bootaddress(u_int basemem) { bootMP_size = mptramp_end - mptramp_start; boot_address = trunc_page(basemem * 1024); /* round down to 4k boundary */ if (((basemem * 1024) - boot_address) < bootMP_size) boot_address -= PAGE_SIZE; /* not enough, lower by 4k */ /* 3 levels of page table pages */ mptramp_pagetables = boot_address - (PAGE_SIZE * 3); return mptramp_pagetables; } /* * Initialize the IPI handlers and start up the AP's. */ void cpu_mp_start(void) { int i; /* Initialize the logical ID to APIC ID table. */ for (i = 0; i < MAXCPU; i++) { cpu_apic_ids[i] = -1; cpu_ipi_pending[i] = 0; } /* Install an inter-CPU IPI for TLB invalidation */ if (pmap_pcid_enabled) { - setidt(IPI_INVLTLB, IDTVEC(invltlb_pcid), SDT_SYSIGT, - SEL_KPL, 0); - setidt(IPI_INVLPG, IDTVEC(invlpg_pcid), SDT_SYSIGT, - SEL_KPL, 0); + if (invpcid_works) { + setidt(IPI_INVLTLB, IDTVEC(invltlb_invpcid), + SDT_SYSIGT, SEL_KPL, 0); + } else { + setidt(IPI_INVLTLB, IDTVEC(invltlb_pcid), SDT_SYSIGT, + SEL_KPL, 0); + } } else { setidt(IPI_INVLTLB, IDTVEC(invltlb), SDT_SYSIGT, SEL_KPL, 0); - setidt(IPI_INVLPG, IDTVEC(invlpg), SDT_SYSIGT, SEL_KPL, 0); } + setidt(IPI_INVLPG, IDTVEC(invlpg), SDT_SYSIGT, SEL_KPL, 0); setidt(IPI_INVLRNG, IDTVEC(invlrng), SDT_SYSIGT, SEL_KPL, 0); /* Install an inter-CPU IPI for cache invalidation. */ setidt(IPI_INVLCACHE, IDTVEC(invlcache), SDT_SYSIGT, SEL_KPL, 0); /* Install an inter-CPU IPI for all-CPU rendezvous */ setidt(IPI_RENDEZVOUS, IDTVEC(rendezvous), SDT_SYSIGT, SEL_KPL, 0); /* Install generic inter-CPU IPI handler */ setidt(IPI_BITMAP_VECTOR, IDTVEC(ipi_intr_bitmap_handler), SDT_SYSIGT, SEL_KPL, 0); /* Install an inter-CPU IPI for CPU stop/restart */ setidt(IPI_STOP, IDTVEC(cpustop), SDT_SYSIGT, SEL_KPL, 0); /* Install an inter-CPU IPI for CPU suspend/resume */ setidt(IPI_SUSPEND, IDTVEC(cpususpend), SDT_SYSIGT, SEL_KPL, 0); /* Set boot_cpu_id if needed. */ if (boot_cpu_id == -1) { boot_cpu_id = PCPU_GET(apic_id); cpu_info[boot_cpu_id].cpu_bsp = 1; } else KASSERT(boot_cpu_id == PCPU_GET(apic_id), ("BSP's APIC ID doesn't match boot_cpu_id")); /* Probe logical/physical core configuration. */ topo_probe(); assign_cpu_ids(); /* Start each Application Processor */ init_ops.start_all_aps(); set_interrupt_apic_ids(); } /* * AP CPU's call this to initialize themselves. */ void init_secondary(void) { struct pcpu *pc; struct nmi_pcpu *np; u_int64_t msr, cr0; int cpu, gsel_tss, x; struct region_descriptor ap_gdt; /* Set by the startup code for us to use */ cpu = bootAP; /* Init tss */ common_tss[cpu] = common_tss[0]; common_tss[cpu].tss_rsp0 = 0; /* not used until after switch */ common_tss[cpu].tss_iobase = sizeof(struct amd64tss) + IOPERM_BITMAP_SIZE; common_tss[cpu].tss_ist1 = (long)&doublefault_stack[PAGE_SIZE]; /* The NMI stack runs on IST2. */ np = ((struct nmi_pcpu *) &nmi_stack[PAGE_SIZE]) - 1; common_tss[cpu].tss_ist2 = (long) np; /* Prepare private GDT */ gdt_segs[GPROC0_SEL].ssd_base = (long) &common_tss[cpu]; for (x = 0; x < NGDT; x++) { if (x != GPROC0_SEL && x != (GPROC0_SEL + 1) && x != GUSERLDT_SEL && x != (GUSERLDT_SEL + 1)) ssdtosd(&gdt_segs[x], &gdt[NGDT * cpu + x]); } ssdtosyssd(&gdt_segs[GPROC0_SEL], (struct system_segment_descriptor *)&gdt[NGDT * cpu + GPROC0_SEL]); ap_gdt.rd_limit = NGDT * sizeof(gdt[0]) - 1; ap_gdt.rd_base = (long) &gdt[NGDT * cpu]; lgdt(&ap_gdt); /* does magic intra-segment return */ /* Get per-cpu data */ pc = &__pcpu[cpu]; /* prime data page for it to use */ pcpu_init(pc, cpu, sizeof(struct pcpu)); dpcpu_init(dpcpu, cpu); pc->pc_apic_id = cpu_apic_ids[cpu]; pc->pc_prvspace = pc; pc->pc_curthread = 0; pc->pc_tssp = &common_tss[cpu]; pc->pc_commontssp = &common_tss[cpu]; pc->pc_rsp0 = 0; pc->pc_tss = (struct system_segment_descriptor *)&gdt[NGDT * cpu + GPROC0_SEL]; pc->pc_fs32p = &gdt[NGDT * cpu + GUFS32_SEL]; pc->pc_gs32p = &gdt[NGDT * cpu + GUGS32_SEL]; pc->pc_ldt = (struct system_segment_descriptor *)&gdt[NGDT * cpu + GUSERLDT_SEL]; + pc->pc_curpmap = kernel_pmap; + pc->pc_pcid_gen = 1; + pc->pc_pcid_next = PMAP_PCID_KERN + 1; /* Save the per-cpu pointer for use by the NMI handler. */ np->np_pcpu = (register_t) pc; wrmsr(MSR_FSBASE, 0); /* User value */ wrmsr(MSR_GSBASE, (u_int64_t)pc); wrmsr(MSR_KGSBASE, (u_int64_t)pc); /* XXX User value while we're in the kernel */ lidt(&r_idt); gsel_tss = GSEL(GPROC0_SEL, SEL_KPL); ltr(gsel_tss); /* * Set to a known state: * Set by mpboot.s: CR0_PG, CR0_PE * Set by cpu_setregs: CR0_NE, CR0_MP, CR0_TS, CR0_WP, CR0_AM */ cr0 = rcr0(); cr0 &= ~(CR0_CD | CR0_NW | CR0_EM); load_cr0(cr0); /* Set up the fast syscall stuff */ msr = rdmsr(MSR_EFER) | EFER_SCE; wrmsr(MSR_EFER, msr); wrmsr(MSR_LSTAR, (u_int64_t)IDTVEC(fast_syscall)); wrmsr(MSR_CSTAR, (u_int64_t)IDTVEC(fast_syscall32)); msr = ((u_int64_t)GSEL(GCODE_SEL, SEL_KPL) << 32) | ((u_int64_t)GSEL(GUCODE32_SEL, SEL_UPL) << 48); wrmsr(MSR_STAR, msr); wrmsr(MSR_SF_MASK, PSL_NT|PSL_T|PSL_I|PSL_C|PSL_D); /* signal our startup to the BSP. */ mp_naps++; /* Spin until the BSP releases the AP's. */ while (!aps_ready) ia32_pause(); init_secondary_tail(); } /******************************************************************* * local functions and data */ /* * start each AP in our list */ int native_start_all_aps(void) { vm_offset_t va = boot_address + KERNBASE; u_int64_t *pt4, *pt3, *pt2; u_int32_t mpbioswarmvec; int apic_id, cpu, i; u_char mpbiosreason; mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN); /* install the AP 1st level boot code */ pmap_kenter(va, boot_address); pmap_invalidate_page(kernel_pmap, va); bcopy(mptramp_start, (void *)va, bootMP_size); /* Locate the page tables, they'll be below the trampoline */ pt4 = (u_int64_t *)(uintptr_t)(mptramp_pagetables + KERNBASE); pt3 = pt4 + (PAGE_SIZE) / sizeof(u_int64_t); pt2 = pt3 + (PAGE_SIZE) / sizeof(u_int64_t); /* Create the initial 1GB replicated page tables */ for (i = 0; i < 512; i++) { /* Each slot of the level 4 pages points to the same level 3 page */ pt4[i] = (u_int64_t)(uintptr_t)(mptramp_pagetables + PAGE_SIZE); pt4[i] |= PG_V | PG_RW | PG_U; /* Each slot of the level 3 pages points to the same level 2 page */ pt3[i] = (u_int64_t)(uintptr_t)(mptramp_pagetables + (2 * PAGE_SIZE)); pt3[i] |= PG_V | PG_RW | PG_U; /* The level 2 page slots are mapped with 2MB pages for 1GB. */ pt2[i] = i * (2 * 1024 * 1024); pt2[i] |= PG_V | PG_RW | PG_PS | PG_U; } /* save the current value of the warm-start vector */ mpbioswarmvec = *((u_int32_t *) WARMBOOT_OFF); outb(CMOS_REG, BIOS_RESET); mpbiosreason = inb(CMOS_DATA); /* setup a vector to our boot code */ *((volatile u_short *) WARMBOOT_OFF) = WARMBOOT_TARGET; *((volatile u_short *) WARMBOOT_SEG) = (boot_address >> 4); outb(CMOS_REG, BIOS_RESET); outb(CMOS_DATA, BIOS_WARM); /* 'warm-start' */ /* start each AP */ for (cpu = 1; cpu < mp_ncpus; cpu++) { apic_id = cpu_apic_ids[cpu]; /* allocate and set up an idle stack data page */ bootstacks[cpu] = (void *)kmem_malloc(kernel_arena, KSTACK_PAGES * PAGE_SIZE, M_WAITOK | M_ZERO); doublefault_stack = (char *)kmem_malloc(kernel_arena, PAGE_SIZE, M_WAITOK | M_ZERO); nmi_stack = (char *)kmem_malloc(kernel_arena, PAGE_SIZE, M_WAITOK | M_ZERO); dpcpu = (void *)kmem_malloc(kernel_arena, DPCPU_SIZE, M_WAITOK | M_ZERO); bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 8; bootAP = cpu; /* attempt to start the Application Processor */ if (!start_ap(apic_id)) { /* restore the warmstart vector */ *(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec; panic("AP #%d (PHY# %d) failed!", cpu, apic_id); } CPU_SET(cpu, &all_cpus); /* record AP in CPU map */ } /* restore the warmstart vector */ *(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec; outb(CMOS_REG, BIOS_RESET); outb(CMOS_DATA, mpbiosreason); /* number of APs actually started */ return mp_naps; } /* * This function starts the AP (application processor) identified * by the APIC ID 'physicalCpu'. It does quite a "song and dance" * to accomplish this. This is necessary because of the nuances * of the different hardware we might encounter. It isn't pretty, * but it seems to work. */ static int start_ap(int apic_id) { int vector, ms; int cpus; /* calculate the vector */ vector = (boot_address >> 12) & 0xff; /* used as a watchpoint to signal AP startup */ cpus = mp_naps; ipi_startup(apic_id, vector); /* Wait up to 5 seconds for it to start. */ for (ms = 0; ms < 5000; ms++) { if (mp_naps > cpus) return 1; /* return SUCCESS */ DELAY(1000); } return 0; /* return FAILURE */ } /* - * Flush the TLB on all other CPU's + * Flush the TLB on other CPU's */ -static void -smp_tlb_shootdown(u_int vector, pmap_t pmap, vm_offset_t addr1, - vm_offset_t addr2) -{ - u_int ncpu; - ncpu = mp_ncpus - 1; /* does not shootdown self */ - if (ncpu < 1) - return; /* no other cpus */ - if (!(read_rflags() & PSL_I)) - panic("%s: interrupts disabled", __func__); - mtx_lock_spin(&smp_ipi_mtx); - smp_tlb_invpcid.addr = addr1; - if (pmap == NULL) { - smp_tlb_invpcid.pcid = 0; - } else { - smp_tlb_invpcid.pcid = pmap->pm_pcid; - pcid_cr3 = pmap->pm_cr3; - } - smp_tlb_addr2 = addr2; - smp_tlb_pmap = pmap; - atomic_store_rel_int(&smp_tlb_wait, 0); - ipi_all_but_self(vector); - while (smp_tlb_wait < ncpu) - ia32_pause(); - mtx_unlock_spin(&smp_ipi_mtx); -} - static void smp_targeted_tlb_shootdown(cpuset_t mask, u_int vector, pmap_t pmap, vm_offset_t addr1, vm_offset_t addr2) { int cpu, ncpu, othercpus; - othercpus = mp_ncpus - 1; + othercpus = mp_ncpus - 1; /* does not shootdown self */ + + /* + * Check for other cpus. Return if none. + */ if (CPU_ISFULLSET(&mask)) { if (othercpus < 1) return; } else { CPU_CLR(PCPU_GET(cpuid), &mask); if (CPU_EMPTY(&mask)) return; } + if (!(read_rflags() & PSL_I)) panic("%s: interrupts disabled", __func__); mtx_lock_spin(&smp_ipi_mtx); - smp_tlb_invpcid.addr = addr1; - if (pmap == NULL) { - smp_tlb_invpcid.pcid = 0; - } else { - smp_tlb_invpcid.pcid = pmap->pm_pcid; - pcid_cr3 = pmap->pm_cr3; - } + smp_tlb_addr1 = addr1; smp_tlb_addr2 = addr2; smp_tlb_pmap = pmap; atomic_store_rel_int(&smp_tlb_wait, 0); if (CPU_ISFULLSET(&mask)) { ncpu = othercpus; ipi_all_but_self(vector); } else { ncpu = 0; while ((cpu = CPU_FFS(&mask)) != 0) { cpu--; CPU_CLR(cpu, &mask); CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, vector); ipi_send_cpu(cpu, vector); ncpu++; } } while (smp_tlb_wait < ncpu) ia32_pause(); mtx_unlock_spin(&smp_ipi_mtx); } void -smp_invlpg(pmap_t pmap, vm_offset_t addr) -{ - - if (smp_started) { - smp_tlb_shootdown(IPI_INVLPG, pmap, addr, 0); -#ifdef COUNT_XINVLTLB_HITS - ipi_page++; -#endif - } -} - -void -smp_invlpg_range(pmap_t pmap, vm_offset_t addr1, vm_offset_t addr2) -{ - - if (smp_started) { - smp_tlb_shootdown(IPI_INVLRNG, pmap, addr1, addr2); -#ifdef COUNT_XINVLTLB_HITS - ipi_range++; - ipi_range_size += (addr2 - addr1) / PAGE_SIZE; -#endif - } -} - -void smp_masked_invltlb(cpuset_t mask, pmap_t pmap) { if (smp_started) { smp_targeted_tlb_shootdown(mask, IPI_INVLTLB, pmap, 0, 0); #ifdef COUNT_XINVLTLB_HITS - ipi_masked_global++; + ipi_global++; #endif } } void -smp_masked_invlpg(cpuset_t mask, pmap_t pmap, vm_offset_t addr) +smp_masked_invlpg(cpuset_t mask, vm_offset_t addr) { if (smp_started) { - smp_targeted_tlb_shootdown(mask, IPI_INVLPG, pmap, addr, 0); + smp_targeted_tlb_shootdown(mask, IPI_INVLPG, NULL, addr, 0); #ifdef COUNT_XINVLTLB_HITS - ipi_masked_page++; + ipi_page++; #endif } } void -smp_masked_invlpg_range(cpuset_t mask, pmap_t pmap, vm_offset_t addr1, - vm_offset_t addr2) +smp_masked_invlpg_range(cpuset_t mask, vm_offset_t addr1, vm_offset_t addr2) { if (smp_started) { - smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, pmap, addr1, - addr2); + smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, NULL, + addr1, addr2); #ifdef COUNT_XINVLTLB_HITS - ipi_masked_range++; - ipi_masked_range_size += (addr2 - addr1) / PAGE_SIZE; + ipi_range++; + ipi_range_size += (addr2 - addr1) / PAGE_SIZE; #endif } } void smp_cache_flush(void) { - if (smp_started) - smp_tlb_shootdown(IPI_INVLCACHE, NULL, 0, 0); -} - -void -smp_invltlb(pmap_t pmap) -{ - if (smp_started) { - smp_tlb_shootdown(IPI_INVLTLB, pmap, 0, 0); -#ifdef COUNT_XINVLTLB_HITS - ipi_global++; -#endif + smp_targeted_tlb_shootdown(all_cpus, IPI_INVLCACHE, NULL, + 0, 0); } } /* * Handlers for TLB related IPIs */ void invltlb_handler(void) { #ifdef COUNT_XINVLTLB_HITS xhits_gbl[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invltlb_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ invltlb(); atomic_add_int(&smp_tlb_wait, 1); } void -invltlb_pcid_handler(void) +invltlb_invpcid_handler(void) { - uint64_t cr3; - u_int cpuid; + struct invpcid_descr d; + #ifdef COUNT_XINVLTLB_HITS xhits_gbl[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invltlb_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ - if (smp_tlb_invpcid.pcid != (uint64_t)-1 && - smp_tlb_invpcid.pcid != 0) { - if (invpcid_works) { - invpcid(&smp_tlb_invpcid, INVPCID_CTX); - } else { - /* Otherwise reload %cr3 twice. */ - cr3 = rcr3(); - if (cr3 != pcid_cr3) { - load_cr3(pcid_cr3); - cr3 |= CR3_PCID_SAVE; - } - load_cr3(cr3); - } - } else { - invltlb_globpcid(); - } - if (smp_tlb_pmap != NULL) { - cpuid = PCPU_GET(cpuid); - if (!CPU_ISSET(cpuid, &smp_tlb_pmap->pm_active)) - CPU_CLR_ATOMIC(cpuid, &smp_tlb_pmap->pm_save); - } - + d.pcid = smp_tlb_pmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid; + d.pad = 0; + d.addr = 0; + invpcid(&d, smp_tlb_pmap == kernel_pmap ? INVPCID_CTXGLOB : + INVPCID_CTX); atomic_add_int(&smp_tlb_wait, 1); } void -invlpg_handler(void) +invltlb_pcid_handler(void) { #ifdef COUNT_XINVLTLB_HITS - xhits_pg[PCPU_GET(cpuid)]++; + xhits_gbl[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS - (*ipi_invlpg_counts[PCPU_GET(cpuid)])++; + (*ipi_invltlb_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ - invlpg(smp_tlb_invpcid.addr); + if (smp_tlb_pmap == kernel_pmap) { + invltlb_globpcid(); + } else { + /* + * The current pmap might not be equal to + * smp_tlb_pmap. The clearing of the pm_gen in + * pmap_invalidate_all() takes care of TLB + * invalidation when switching to the pmap on this + * CPU. + */ + if (PCPU_GET(curpmap) == smp_tlb_pmap) { + load_cr3(smp_tlb_pmap->pm_cr3 | + smp_tlb_pmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid); + } + } atomic_add_int(&smp_tlb_wait, 1); } void -invlpg_pcid_handler(void) +invlpg_handler(void) { - uint64_t cr3; #ifdef COUNT_XINVLTLB_HITS xhits_pg[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invlpg_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ - if (smp_tlb_invpcid.pcid == (uint64_t)-1) { - invltlb_globpcid(); - } else if (smp_tlb_invpcid.pcid == 0) { - invlpg(smp_tlb_invpcid.addr); - } else if (invpcid_works) { - invpcid(&smp_tlb_invpcid, INVPCID_ADDR); - } else { - /* - * PCID supported, but INVPCID is not. - * Temporarily switch to the target address - * space and do INVLPG. - */ - cr3 = rcr3(); - if (cr3 != pcid_cr3) - load_cr3(pcid_cr3 | CR3_PCID_SAVE); - invlpg(smp_tlb_invpcid.addr); - load_cr3(cr3 | CR3_PCID_SAVE); - } - + invlpg(smp_tlb_addr1); atomic_add_int(&smp_tlb_wait, 1); } -static inline void -invlpg_range(vm_offset_t start, vm_offset_t end) -{ - - do { - invlpg(start); - start += PAGE_SIZE; - } while (start < end); -} - void invlrng_handler(void) { - struct invpcid_descr d; vm_offset_t addr; - uint64_t cr3; - u_int cpuid; + #ifdef COUNT_XINVLTLB_HITS xhits_rng[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invlrng_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ - addr = smp_tlb_invpcid.addr; - if (pmap_pcid_enabled) { - if (smp_tlb_invpcid.pcid == 0) { - /* - * kernel pmap - use invlpg to invalidate - * global mapping. - */ - invlpg_range(addr, smp_tlb_addr2); - } else if (smp_tlb_invpcid.pcid == (uint64_t)-1) { - invltlb_globpcid(); - if (smp_tlb_pmap != NULL) { - cpuid = PCPU_GET(cpuid); - if (!CPU_ISSET(cpuid, &smp_tlb_pmap->pm_active)) - CPU_CLR_ATOMIC(cpuid, - &smp_tlb_pmap->pm_save); - } - } else if (invpcid_works) { - d = smp_tlb_invpcid; - do { - invpcid(&d, INVPCID_ADDR); - d.addr += PAGE_SIZE; - } while (d.addr <= smp_tlb_addr2); - } else { - cr3 = rcr3(); - if (cr3 != pcid_cr3) - load_cr3(pcid_cr3 | CR3_PCID_SAVE); - invlpg_range(addr, smp_tlb_addr2); - load_cr3(cr3 | CR3_PCID_SAVE); - } - } else { - invlpg_range(addr, smp_tlb_addr2); - } + addr = smp_tlb_addr1; + do { + invlpg(addr); + addr += PAGE_SIZE; + } while (addr < smp_tlb_addr2); atomic_add_int(&smp_tlb_wait, 1); } Index: projects/release-arm-redux/sys/amd64/amd64/pmap.c =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/pmap.c (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/pmap.c (revision 282692) @@ -1,7027 +1,6976 @@ /*- * Copyright (c) 1991 Regents of the University of California. * All rights reserved. * Copyright (c) 1994 John S. Dyson * All rights reserved. * Copyright (c) 1994 David Greenman * All rights reserved. * Copyright (c) 2003 Peter Wemm * All rights reserved. * Copyright (c) 2005-2010 Alan L. Cox * All rights reserved. * * This code is derived from software contributed to Berkeley by * the Systems Programming Group of the University of Utah Computer * Science Department and William Jolitz of UUNET Technologies Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)pmap.c 7.7 (Berkeley) 5/12/91 */ /*- * Copyright (c) 2003 Networks Associates Technology, Inc. * All rights reserved. * * This software was developed for the FreeBSD Project by Jake Burkholder, * Safeport Network Services, and Network Associates Laboratories, the * Security Research Division of Network Associates, Inc. under * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA * CHATS research program. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #define AMD64_NPT_AWARE #include __FBSDID("$FreeBSD$"); /* * Manages physical address maps. * * Since the information managed by this module is * also stored by the logical address mapping module, * this module may throw away valid virtual-to-physical * mappings at almost any time. However, invalidations * of virtual-to-physical mappings must be done as * requested. * * In order to cope with hardware architectures which * make virtual-to-physical map invalidates expensive, * this module may delay invalidate or reduced protection * operations until such time as they are actually * necessary. This module is given full information as * to which processors are currently using which maps, * and to when physical maps must be made correct. */ #include "opt_pmap.h" #include "opt_vm.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef SMP #include #endif static __inline boolean_t pmap_type_guest(pmap_t pmap) { return ((pmap->pm_type == PT_EPT) || (pmap->pm_type == PT_RVI)); } static __inline boolean_t pmap_emulate_ad_bits(pmap_t pmap) { return ((pmap->pm_flags & PMAP_EMULATE_AD_BITS) != 0); } static __inline pt_entry_t pmap_valid_bit(pmap_t pmap) { pt_entry_t mask; switch (pmap->pm_type) { case PT_X86: case PT_RVI: mask = X86_PG_V; break; case PT_EPT: if (pmap_emulate_ad_bits(pmap)) mask = EPT_PG_EMUL_V; else mask = EPT_PG_READ; break; default: panic("pmap_valid_bit: invalid pm_type %d", pmap->pm_type); } return (mask); } static __inline pt_entry_t pmap_rw_bit(pmap_t pmap) { pt_entry_t mask; switch (pmap->pm_type) { case PT_X86: case PT_RVI: mask = X86_PG_RW; break; case PT_EPT: if (pmap_emulate_ad_bits(pmap)) mask = EPT_PG_EMUL_RW; else mask = EPT_PG_WRITE; break; default: panic("pmap_rw_bit: invalid pm_type %d", pmap->pm_type); } return (mask); } static __inline pt_entry_t pmap_global_bit(pmap_t pmap) { pt_entry_t mask; switch (pmap->pm_type) { case PT_X86: mask = X86_PG_G; break; case PT_RVI: case PT_EPT: mask = 0; break; default: panic("pmap_global_bit: invalid pm_type %d", pmap->pm_type); } return (mask); } static __inline pt_entry_t pmap_accessed_bit(pmap_t pmap) { pt_entry_t mask; switch (pmap->pm_type) { case PT_X86: case PT_RVI: mask = X86_PG_A; break; case PT_EPT: if (pmap_emulate_ad_bits(pmap)) mask = EPT_PG_READ; else mask = EPT_PG_A; break; default: panic("pmap_accessed_bit: invalid pm_type %d", pmap->pm_type); } return (mask); } static __inline pt_entry_t pmap_modified_bit(pmap_t pmap) { pt_entry_t mask; switch (pmap->pm_type) { case PT_X86: case PT_RVI: mask = X86_PG_M; break; case PT_EPT: if (pmap_emulate_ad_bits(pmap)) mask = EPT_PG_WRITE; else mask = EPT_PG_M; break; default: panic("pmap_modified_bit: invalid pm_type %d", pmap->pm_type); } return (mask); } +extern struct pcpu __pcpu[]; + #if !defined(DIAGNOSTIC) #ifdef __GNUC_GNU_INLINE__ #define PMAP_INLINE __attribute__((__gnu_inline__)) inline #else #define PMAP_INLINE extern inline #endif #else #define PMAP_INLINE #endif #ifdef PV_STATS #define PV_STAT(x) do { x ; } while (0) #else #define PV_STAT(x) do { } while (0) #endif #define pa_index(pa) ((pa) >> PDRSHIFT) #define pa_to_pvh(pa) (&pv_table[pa_index(pa)]) #define NPV_LIST_LOCKS MAXCPU #define PHYS_TO_PV_LIST_LOCK(pa) \ (&pv_list_locks[pa_index(pa) % NPV_LIST_LOCKS]) #define CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa) do { \ struct rwlock **_lockp = (lockp); \ struct rwlock *_new_lock; \ \ _new_lock = PHYS_TO_PV_LIST_LOCK(pa); \ if (_new_lock != *_lockp) { \ if (*_lockp != NULL) \ rw_wunlock(*_lockp); \ *_lockp = _new_lock; \ rw_wlock(*_lockp); \ } \ } while (0) #define CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m) \ CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, VM_PAGE_TO_PHYS(m)) #define RELEASE_PV_LIST_LOCK(lockp) do { \ struct rwlock **_lockp = (lockp); \ \ if (*_lockp != NULL) { \ rw_wunlock(*_lockp); \ *_lockp = NULL; \ } \ } while (0) #define VM_PAGE_TO_PV_LIST_LOCK(m) \ PHYS_TO_PV_LIST_LOCK(VM_PAGE_TO_PHYS(m)) struct pmap kernel_pmap_store; vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */ int nkpt; SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, &nkpt, 0, "Number of kernel page table pages allocated on bootup"); static int ndmpdp; vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; pt_entry_t pg_nx; static SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters"); static int pat_works = 1; SYSCTL_INT(_vm_pmap, OID_AUTO, pat_works, CTLFLAG_RD, &pat_works, 1, "Is page attribute table fully functional?"); static int pg_ps_enabled = 1; SYSCTL_INT(_vm_pmap, OID_AUTO, pg_ps_enabled, CTLFLAG_RDTUN | CTLFLAG_NOFETCH, &pg_ps_enabled, 0, "Are large page mappings enabled?"); #define PAT_INDEX_SIZE 8 static int pat_index[PAT_INDEX_SIZE]; /* cache mode to PAT index conversion */ static u_int64_t KPTphys; /* phys addr of kernel level 1 */ static u_int64_t KPDphys; /* phys addr of kernel level 2 */ u_int64_t KPDPphys; /* phys addr of kernel level 3 */ u_int64_t KPML4phys; /* phys addr of kernel level 4 */ static u_int64_t DMPDphys; /* phys addr of direct mapped level 2 */ static u_int64_t DMPDPphys; /* phys addr of direct mapped level 3 */ static int ndmpdpphys; /* number of DMPDPphys pages */ static struct rwlock_padalign pvh_global_lock; /* * Data for the pv entry allocation mechanism */ static TAILQ_HEAD(pch, pv_chunk) pv_chunks = TAILQ_HEAD_INITIALIZER(pv_chunks); static struct mtx pv_chunks_mutex; static struct rwlock pv_list_locks[NPV_LIST_LOCKS]; static struct md_page *pv_table; /* * All those kernel PT submaps that BSD is so fond of */ pt_entry_t *CMAP1 = 0; caddr_t CADDR1 = 0; static int pmap_flags = PMAP_PDE_SUPERPAGE; /* flags for x86 pmaps */ -static struct unrhdr pcid_unr; -static struct mtx pcid_mtx; int pmap_pcid_enabled = 0; SYSCTL_INT(_vm_pmap, OID_AUTO, pcid_enabled, CTLFLAG_RDTUN | CTLFLAG_NOFETCH, &pmap_pcid_enabled, 0, "Is TLB Context ID enabled ?"); int invpcid_works = 0; SYSCTL_INT(_vm_pmap, OID_AUTO, invpcid_works, CTLFLAG_RD, &invpcid_works, 0, "Is the invpcid instruction available ?"); static int pmap_pcid_save_cnt_proc(SYSCTL_HANDLER_ARGS) { int i; uint64_t res; res = 0; CPU_FOREACH(i) { res += cpuid_to_pcpu[i]->pc_pm_save_cnt; } return (sysctl_handle_64(oidp, &res, 0, req)); } SYSCTL_PROC(_vm_pmap, OID_AUTO, pcid_save_cnt, CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, 0, pmap_pcid_save_cnt_proc, "QU", "Count of saved TLB context on switch"); /* * Crashdump maps. */ static caddr_t crashdumpmap; static void free_pv_chunk(struct pv_chunk *pc); static void free_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); static int popcnt_pc_map_elem_pq(uint64_t elem); static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp); static void reserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp); static void pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa, struct rwlock **lockp); static boolean_t pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa, struct rwlock **lockp); static void pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa, struct rwlock **lockp); static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va); static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va); static int pmap_change_attr_locked(vm_offset_t va, vm_size_t size, int mode); static boolean_t pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va); static boolean_t pmap_demote_pde_locked(pmap_t pmap, pd_entry_t *pde, vm_offset_t va, struct rwlock **lockp); static boolean_t pmap_demote_pdpe(pmap_t pmap, pdp_entry_t *pdpe, vm_offset_t va); static boolean_t pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, struct rwlock **lockp); static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, vm_page_t mpte, struct rwlock **lockp); static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte); static int pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte); static void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode); static vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va); static void pmap_pde_attr(pd_entry_t *pde, int cache_bits, int mask); static void pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va, struct rwlock **lockp); static boolean_t pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, vm_prot_t prot); static void pmap_pte_attr(pt_entry_t *pte, int cache_bits, int mask); static int pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva, struct spglist *free, struct rwlock **lockp); static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t sva, pd_entry_t ptepde, struct spglist *free, struct rwlock **lockp); static void pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte); static void pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, struct spglist *free); static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va, vm_page_t m, struct rwlock **lockp); static void pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde); static void pmap_update_pde_invalidate(pmap_t, vm_offset_t va, pd_entry_t pde); static vm_page_t _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex, struct rwlock **lockp); static vm_page_t pmap_allocpde(pmap_t pmap, vm_offset_t va, struct rwlock **lockp); static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va, struct rwlock **lockp); static void _pmap_unwire_ptp(pmap_t pmap, vm_offset_t va, vm_page_t m, struct spglist *free); static int pmap_unuse_pt(pmap_t, vm_offset_t, pd_entry_t, struct spglist *); static vm_offset_t pmap_kmem_choose(vm_offset_t addr); /* * Move the kernel virtual free pointer to the next * 2MB. This is used to help improve performance * by using a large (2MB) page for much of the kernel * (.text, .data, .bss) */ static vm_offset_t pmap_kmem_choose(vm_offset_t addr) { vm_offset_t newaddr = addr; newaddr = (addr + (NBPDR - 1)) & ~(NBPDR - 1); return (newaddr); } /********************/ /* Inline functions */ /********************/ /* Return a non-clipped PD index for a given VA */ static __inline vm_pindex_t pmap_pde_pindex(vm_offset_t va) { return (va >> PDRSHIFT); } /* Return various clipped indexes for a given VA */ static __inline vm_pindex_t pmap_pte_index(vm_offset_t va) { return ((va >> PAGE_SHIFT) & ((1ul << NPTEPGSHIFT) - 1)); } static __inline vm_pindex_t pmap_pde_index(vm_offset_t va) { return ((va >> PDRSHIFT) & ((1ul << NPDEPGSHIFT) - 1)); } static __inline vm_pindex_t pmap_pdpe_index(vm_offset_t va) { return ((va >> PDPSHIFT) & ((1ul << NPDPEPGSHIFT) - 1)); } static __inline vm_pindex_t pmap_pml4e_index(vm_offset_t va) { return ((va >> PML4SHIFT) & ((1ul << NPML4EPGSHIFT) - 1)); } /* Return a pointer to the PML4 slot that corresponds to a VA */ static __inline pml4_entry_t * pmap_pml4e(pmap_t pmap, vm_offset_t va) { return (&pmap->pm_pml4[pmap_pml4e_index(va)]); } /* Return a pointer to the PDP slot that corresponds to a VA */ static __inline pdp_entry_t * pmap_pml4e_to_pdpe(pml4_entry_t *pml4e, vm_offset_t va) { pdp_entry_t *pdpe; pdpe = (pdp_entry_t *)PHYS_TO_DMAP(*pml4e & PG_FRAME); return (&pdpe[pmap_pdpe_index(va)]); } /* Return a pointer to the PDP slot that corresponds to a VA */ static __inline pdp_entry_t * pmap_pdpe(pmap_t pmap, vm_offset_t va) { pml4_entry_t *pml4e; pt_entry_t PG_V; PG_V = pmap_valid_bit(pmap); pml4e = pmap_pml4e(pmap, va); if ((*pml4e & PG_V) == 0) return (NULL); return (pmap_pml4e_to_pdpe(pml4e, va)); } /* Return a pointer to the PD slot that corresponds to a VA */ static __inline pd_entry_t * pmap_pdpe_to_pde(pdp_entry_t *pdpe, vm_offset_t va) { pd_entry_t *pde; pde = (pd_entry_t *)PHYS_TO_DMAP(*pdpe & PG_FRAME); return (&pde[pmap_pde_index(va)]); } /* Return a pointer to the PD slot that corresponds to a VA */ static __inline pd_entry_t * pmap_pde(pmap_t pmap, vm_offset_t va) { pdp_entry_t *pdpe; pt_entry_t PG_V; PG_V = pmap_valid_bit(pmap); pdpe = pmap_pdpe(pmap, va); if (pdpe == NULL || (*pdpe & PG_V) == 0) return (NULL); return (pmap_pdpe_to_pde(pdpe, va)); } /* Return a pointer to the PT slot that corresponds to a VA */ static __inline pt_entry_t * pmap_pde_to_pte(pd_entry_t *pde, vm_offset_t va) { pt_entry_t *pte; pte = (pt_entry_t *)PHYS_TO_DMAP(*pde & PG_FRAME); return (&pte[pmap_pte_index(va)]); } /* Return a pointer to the PT slot that corresponds to a VA */ static __inline pt_entry_t * pmap_pte(pmap_t pmap, vm_offset_t va) { pd_entry_t *pde; pt_entry_t PG_V; PG_V = pmap_valid_bit(pmap); pde = pmap_pde(pmap, va); if (pde == NULL || (*pde & PG_V) == 0) return (NULL); if ((*pde & PG_PS) != 0) /* compat with i386 pmap_pte() */ return ((pt_entry_t *)pde); return (pmap_pde_to_pte(pde, va)); } static __inline void pmap_resident_count_inc(pmap_t pmap, int count) { PMAP_LOCK_ASSERT(pmap, MA_OWNED); pmap->pm_stats.resident_count += count; } static __inline void pmap_resident_count_dec(pmap_t pmap, int count) { PMAP_LOCK_ASSERT(pmap, MA_OWNED); KASSERT(pmap->pm_stats.resident_count >= count, ("pmap %p resident count underflow %ld %d", pmap, pmap->pm_stats.resident_count, count)); pmap->pm_stats.resident_count -= count; } PMAP_INLINE pt_entry_t * vtopte(vm_offset_t va) { u_int64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); KASSERT(va >= VM_MAXUSER_ADDRESS, ("vtopte on a uva/gpa 0x%0lx", va)); return (PTmap + ((va >> PAGE_SHIFT) & mask)); } static __inline pd_entry_t * vtopde(vm_offset_t va) { u_int64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1); KASSERT(va >= VM_MAXUSER_ADDRESS, ("vtopde on a uva/gpa 0x%0lx", va)); return (PDmap + ((va >> PDRSHIFT) & mask)); } static u_int64_t allocpages(vm_paddr_t *firstaddr, int n) { u_int64_t ret; ret = *firstaddr; bzero((void *)ret, n * PAGE_SIZE); *firstaddr += n * PAGE_SIZE; return (ret); } CTASSERT(powerof2(NDMPML4E)); /* number of kernel PDP slots */ #define NKPDPE(ptpgs) howmany((ptpgs), NPDEPG) static void nkpt_init(vm_paddr_t addr) { int pt_pages; #ifdef NKPT pt_pages = NKPT; #else pt_pages = howmany(addr, 1 << PDRSHIFT); pt_pages += NKPDPE(pt_pages); /* * Add some slop beyond the bare minimum required for bootstrapping * the kernel. * * This is quite important when allocating KVA for kernel modules. * The modules are required to be linked in the negative 2GB of * the address space. If we run out of KVA in this region then * pmap_growkernel() will need to allocate page table pages to map * the entire 512GB of KVA space which is an unnecessary tax on * physical memory. */ pt_pages += 8; /* 16MB additional slop for kernel modules */ #endif nkpt = pt_pages; } static void create_pagetables(vm_paddr_t *firstaddr) { int i, j, ndm1g, nkpdpe; pt_entry_t *pt_p; pd_entry_t *pd_p; pdp_entry_t *pdp_p; pml4_entry_t *p4_p; /* Allocate page table pages for the direct map */ ndmpdp = (ptoa(Maxmem) + NBPDP - 1) >> PDPSHIFT; if (ndmpdp < 4) /* Minimum 4GB of dirmap */ ndmpdp = 4; ndmpdpphys = howmany(ndmpdp, NPDPEPG); if (ndmpdpphys > NDMPML4E) { /* * Each NDMPML4E allows 512 GB, so limit to that, * and then readjust ndmpdp and ndmpdpphys. */ printf("NDMPML4E limits system to %d GB\n", NDMPML4E * 512); Maxmem = atop(NDMPML4E * NBPML4); ndmpdpphys = NDMPML4E; ndmpdp = NDMPML4E * NPDEPG; } DMPDPphys = allocpages(firstaddr, ndmpdpphys); ndm1g = 0; if ((amd_feature & AMDID_PAGE1GB) != 0) ndm1g = ptoa(Maxmem) >> PDPSHIFT; if (ndm1g < ndmpdp) DMPDphys = allocpages(firstaddr, ndmpdp - ndm1g); dmaplimit = (vm_paddr_t)ndmpdp << PDPSHIFT; /* Allocate pages */ KPML4phys = allocpages(firstaddr, 1); KPDPphys = allocpages(firstaddr, NKPML4E); /* * Allocate the initial number of kernel page table pages required to * bootstrap. We defer this until after all memory-size dependent * allocations are done (e.g. direct map), so that we don't have to * build in too much slop in our estimate. * * Note that when NKPML4E > 1, we have an empty page underneath * all but the KPML4I'th one, so we need NKPML4E-1 extra (zeroed) * pages. (pmap_enter requires a PD page to exist for each KPML4E.) */ nkpt_init(*firstaddr); nkpdpe = NKPDPE(nkpt); KPTphys = allocpages(firstaddr, nkpt); KPDphys = allocpages(firstaddr, nkpdpe); /* Fill in the underlying page table pages */ /* Nominally read-only (but really R/W) from zero to physfree */ /* XXX not fully used, underneath 2M pages */ pt_p = (pt_entry_t *)KPTphys; for (i = 0; ptoa(i) < *firstaddr; i++) pt_p[i] = ptoa(i) | X86_PG_RW | X86_PG_V | X86_PG_G; /* Now map the page tables at their location within PTmap */ pd_p = (pd_entry_t *)KPDphys; for (i = 0; i < nkpt; i++) pd_p[i] = (KPTphys + ptoa(i)) | X86_PG_RW | X86_PG_V; /* Map from zero to end of allocations under 2M pages */ /* This replaces some of the KPTphys entries above */ for (i = 0; (i << PDRSHIFT) < *firstaddr; i++) pd_p[i] = (i << PDRSHIFT) | X86_PG_RW | X86_PG_V | PG_PS | X86_PG_G; /* And connect up the PD to the PDP (leaving room for L4 pages) */ pdp_p = (pdp_entry_t *)(KPDPphys + ptoa(KPML4I - KPML4BASE)); for (i = 0; i < nkpdpe; i++) pdp_p[i + KPDPI] = (KPDphys + ptoa(i)) | X86_PG_RW | X86_PG_V | PG_U; /* * Now, set up the direct map region using 2MB and/or 1GB pages. If * the end of physical memory is not aligned to a 1GB page boundary, * then the residual physical memory is mapped with 2MB pages. Later, * if pmap_mapdev{_attr}() uses the direct map for non-write-back * memory, pmap_change_attr() will demote any 2MB or 1GB page mappings * that are partially used. */ pd_p = (pd_entry_t *)DMPDphys; for (i = NPDEPG * ndm1g, j = 0; i < NPDEPG * ndmpdp; i++, j++) { pd_p[j] = (vm_paddr_t)i << PDRSHIFT; /* Preset PG_M and PG_A because demotion expects it. */ pd_p[j] |= X86_PG_RW | X86_PG_V | PG_PS | X86_PG_G | X86_PG_M | X86_PG_A; } pdp_p = (pdp_entry_t *)DMPDPphys; for (i = 0; i < ndm1g; i++) { pdp_p[i] = (vm_paddr_t)i << PDPSHIFT; /* Preset PG_M and PG_A because demotion expects it. */ pdp_p[i] |= X86_PG_RW | X86_PG_V | PG_PS | X86_PG_G | X86_PG_M | X86_PG_A; } for (j = 0; i < ndmpdp; i++, j++) { pdp_p[i] = DMPDphys + ptoa(j); pdp_p[i] |= X86_PG_RW | X86_PG_V | PG_U; } /* And recursively map PML4 to itself in order to get PTmap */ p4_p = (pml4_entry_t *)KPML4phys; p4_p[PML4PML4I] = KPML4phys; p4_p[PML4PML4I] |= X86_PG_RW | X86_PG_V | PG_U; /* Connect the Direct Map slot(s) up to the PML4. */ for (i = 0; i < ndmpdpphys; i++) { p4_p[DMPML4I + i] = DMPDPphys + ptoa(i); p4_p[DMPML4I + i] |= X86_PG_RW | X86_PG_V | PG_U; } /* Connect the KVA slots up to the PML4 */ for (i = 0; i < NKPML4E; i++) { p4_p[KPML4BASE + i] = KPDPphys + ptoa(i); p4_p[KPML4BASE + i] |= X86_PG_RW | X86_PG_V | PG_U; } } /* * Bootstrap the system enough to run with virtual memory. * * On amd64 this is called after mapping has already been enabled * and just syncs the pmap module with what has already been done. * [We can't call it easily with mapping off since the kernel is not * mapped with PA == VA, hence we would have to relocate every address * from the linked base (virtual) address "KERNBASE" to the actual * (physical) address starting relative to 0] */ void pmap_bootstrap(vm_paddr_t *firstaddr) { vm_offset_t va; pt_entry_t *pte; + int i; /* * Create an initial set of page tables to run the kernel in. */ create_pagetables(firstaddr); /* * Add a physical memory segment (vm_phys_seg) corresponding to the * preallocated kernel page table pages so that vm_page structures * representing these pages will be created. The vm_page structures * are required for promotion of the corresponding kernel virtual * addresses to superpage mappings. */ vm_phys_add_seg(KPTphys, KPTphys + ptoa(nkpt)); virtual_avail = (vm_offset_t) KERNBASE + *firstaddr; virtual_avail = pmap_kmem_choose(virtual_avail); virtual_end = VM_MAX_KERNEL_ADDRESS; /* XXX do %cr0 as well */ load_cr4(rcr4() | CR4_PGE); load_cr3(KPML4phys); if (cpu_stdext_feature & CPUID_STDEXT_SMEP) load_cr4(rcr4() | CR4_SMEP); /* * Initialize the kernel pmap (which is statically allocated). */ PMAP_LOCK_INIT(kernel_pmap); kernel_pmap->pm_pml4 = (pdp_entry_t *)PHYS_TO_DMAP(KPML4phys); kernel_pmap->pm_cr3 = KPML4phys; CPU_FILL(&kernel_pmap->pm_active); /* don't allow deactivation */ - CPU_FILL(&kernel_pmap->pm_save); /* always superset of pm_active */ TAILQ_INIT(&kernel_pmap->pm_pvchunk); kernel_pmap->pm_flags = pmap_flags; /* * Initialize the global pv list lock. */ rw_init(&pvh_global_lock, "pmap pv global"); /* * Reserve some special page table entries/VA space for temporary * mapping of pages. */ #define SYSMAP(c, p, v, n) \ v = (c)va; va += ((n)*PAGE_SIZE); p = pte; pte += (n); va = virtual_avail; pte = vtopte(va); /* * Crashdump maps. The first page is reused as CMAP1 for the * memory test. */ SYSMAP(caddr_t, CMAP1, crashdumpmap, MAXDUMPPGS) CADDR1 = crashdumpmap; virtual_avail = va; /* Initialize the PAT MSR. */ pmap_init_pat(); /* Initialize TLB Context Id. */ TUNABLE_INT_FETCH("vm.pmap.pcid_enabled", &pmap_pcid_enabled); if ((cpu_feature2 & CPUID2_PCID) != 0 && pmap_pcid_enabled) { - load_cr4(rcr4() | CR4_PCIDE); - mtx_init(&pcid_mtx, "pcid", NULL, MTX_DEF); - init_unrhdr(&pcid_unr, 1, (1 << 12) - 1, &pcid_mtx); /* Check for INVPCID support */ invpcid_works = (cpu_stdext_feature & CPUID_STDEXT_INVPCID) != 0; - kernel_pmap->pm_pcid = 0; -#ifndef SMP + for (i = 0; i < MAXCPU; i++) { + kernel_pmap->pm_pcids[i].pm_pcid = PMAP_PCID_KERN; + kernel_pmap->pm_pcids[i].pm_gen = 1; + } + __pcpu[0].pc_pcid_next = PMAP_PCID_KERN + 1; + __pcpu[0].pc_pcid_gen = 1; + /* + * pcpu area for APs is zeroed during AP startup. + * pc_pcid_next and pc_pcid_gen are initialized by AP + * during pcpu setup. + */ +#ifdef SMP + load_cr4(rcr4() | CR4_PCIDE); +#else pmap_pcid_enabled = 0; #endif - } else + } else { pmap_pcid_enabled = 0; + } } /* * Setup the PAT MSR. */ void pmap_init_pat(void) { int pat_table[PAT_INDEX_SIZE]; uint64_t pat_msr; u_long cr0, cr4; int i; /* Bail if this CPU doesn't implement PAT. */ if ((cpu_feature & CPUID_PAT) == 0) panic("no PAT??"); /* Set default PAT index table. */ for (i = 0; i < PAT_INDEX_SIZE; i++) pat_table[i] = -1; pat_table[PAT_WRITE_BACK] = 0; pat_table[PAT_WRITE_THROUGH] = 1; pat_table[PAT_UNCACHEABLE] = 3; pat_table[PAT_WRITE_COMBINING] = 3; pat_table[PAT_WRITE_PROTECTED] = 3; pat_table[PAT_UNCACHED] = 3; /* Initialize default PAT entries. */ pat_msr = PAT_VALUE(0, PAT_WRITE_BACK) | PAT_VALUE(1, PAT_WRITE_THROUGH) | PAT_VALUE(2, PAT_UNCACHED) | PAT_VALUE(3, PAT_UNCACHEABLE) | PAT_VALUE(4, PAT_WRITE_BACK) | PAT_VALUE(5, PAT_WRITE_THROUGH) | PAT_VALUE(6, PAT_UNCACHED) | PAT_VALUE(7, PAT_UNCACHEABLE); if (pat_works) { /* * Leave the indices 0-3 at the default of WB, WT, UC-, and UC. * Program 5 and 6 as WP and WC. * Leave 4 and 7 as WB and UC. */ pat_msr &= ~(PAT_MASK(5) | PAT_MASK(6)); pat_msr |= PAT_VALUE(5, PAT_WRITE_PROTECTED) | PAT_VALUE(6, PAT_WRITE_COMBINING); pat_table[PAT_UNCACHED] = 2; pat_table[PAT_WRITE_PROTECTED] = 5; pat_table[PAT_WRITE_COMBINING] = 6; } else { /* * Just replace PAT Index 2 with WC instead of UC-. */ pat_msr &= ~PAT_MASK(2); pat_msr |= PAT_VALUE(2, PAT_WRITE_COMBINING); pat_table[PAT_WRITE_COMBINING] = 2; } /* Disable PGE. */ cr4 = rcr4(); load_cr4(cr4 & ~CR4_PGE); /* Disable caches (CD = 1, NW = 0). */ cr0 = rcr0(); load_cr0((cr0 & ~CR0_NW) | CR0_CD); /* Flushes caches and TLBs. */ wbinvd(); invltlb(); /* Update PAT and index table. */ wrmsr(MSR_PAT, pat_msr); for (i = 0; i < PAT_INDEX_SIZE; i++) pat_index[i] = pat_table[i]; /* Flush caches and TLBs again. */ wbinvd(); invltlb(); /* Restore caches and PGE. */ load_cr0(cr0); load_cr4(cr4); } /* * Initialize a vm_page's machine-dependent fields. */ void pmap_page_init(vm_page_t m) { TAILQ_INIT(&m->md.pv_list); m->md.pat_mode = PAT_WRITE_BACK; } /* * Initialize the pmap module. * Called by vm_init, to initialize any structures that the pmap * system needs to map virtual memory. */ void pmap_init(void) { vm_page_t mpte; vm_size_t s; int i, pv_npg; /* * Initialize the vm page array entries for the kernel pmap's * page table pages. */ for (i = 0; i < nkpt; i++) { mpte = PHYS_TO_VM_PAGE(KPTphys + (i << PAGE_SHIFT)); KASSERT(mpte >= vm_page_array && mpte < &vm_page_array[vm_page_array_size], ("pmap_init: page table page is out of range")); mpte->pindex = pmap_pde_pindex(KERNBASE) + i; mpte->phys_addr = KPTphys + (i << PAGE_SHIFT); } /* * If the kernel is running on a virtual machine, then it must assume * that MCA is enabled by the hypervisor. Moreover, the kernel must * be prepared for the hypervisor changing the vendor and family that * are reported by CPUID. Consequently, the workaround for AMD Family * 10h Erratum 383 is enabled if the processor's feature set does not * include at least one feature that is only supported by older Intel * or newer AMD processors. */ if (vm_guest == VM_GUEST_VM && (cpu_feature & CPUID_SS) == 0 && (cpu_feature2 & (CPUID2_SSSE3 | CPUID2_SSE41 | CPUID2_AESNI | CPUID2_AVX | CPUID2_XSAVE)) == 0 && (amd_feature2 & (AMDID2_XOP | AMDID2_FMA4)) == 0) workaround_erratum383 = 1; /* * Are large page mappings enabled? */ TUNABLE_INT_FETCH("vm.pmap.pg_ps_enabled", &pg_ps_enabled); if (pg_ps_enabled) { KASSERT(MAXPAGESIZES > 1 && pagesizes[1] == 0, ("pmap_init: can't assign to pagesizes[1]")); pagesizes[1] = NBPDR; } /* * Initialize the pv chunk list mutex. */ mtx_init(&pv_chunks_mutex, "pmap pv chunk list", NULL, MTX_DEF); /* * Initialize the pool of pv list locks. */ for (i = 0; i < NPV_LIST_LOCKS; i++) rw_init(&pv_list_locks[i], "pmap pv list"); /* * Calculate the size of the pv head table for superpages. */ pv_npg = howmany(vm_phys_segs[vm_phys_nsegs - 1].end, NBPDR); /* * Allocate memory for the pv head table for superpages. */ s = (vm_size_t)(pv_npg * sizeof(struct md_page)); s = round_page(s); pv_table = (struct md_page *)kmem_malloc(kernel_arena, s, M_WAITOK | M_ZERO); for (i = 0; i < pv_npg; i++) TAILQ_INIT(&pv_table[i].pv_list); } static SYSCTL_NODE(_vm_pmap, OID_AUTO, pde, CTLFLAG_RD, 0, "2MB page mapping counters"); static u_long pmap_pde_demotions; SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, demotions, CTLFLAG_RD, &pmap_pde_demotions, 0, "2MB page demotions"); static u_long pmap_pde_mappings; SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, mappings, CTLFLAG_RD, &pmap_pde_mappings, 0, "2MB page mappings"); static u_long pmap_pde_p_failures; SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, p_failures, CTLFLAG_RD, &pmap_pde_p_failures, 0, "2MB page promotion failures"); static u_long pmap_pde_promotions; SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, promotions, CTLFLAG_RD, &pmap_pde_promotions, 0, "2MB page promotions"); static SYSCTL_NODE(_vm_pmap, OID_AUTO, pdpe, CTLFLAG_RD, 0, "1GB page mapping counters"); static u_long pmap_pdpe_demotions; SYSCTL_ULONG(_vm_pmap_pdpe, OID_AUTO, demotions, CTLFLAG_RD, &pmap_pdpe_demotions, 0, "1GB page demotions"); /*************************************************** * Low level helper routines..... ***************************************************/ static pt_entry_t pmap_swap_pat(pmap_t pmap, pt_entry_t entry) { int x86_pat_bits = X86_PG_PTE_PAT | X86_PG_PDE_PAT; switch (pmap->pm_type) { case PT_X86: case PT_RVI: /* Verify that both PAT bits are not set at the same time */ KASSERT((entry & x86_pat_bits) != x86_pat_bits, ("Invalid PAT bits in entry %#lx", entry)); /* Swap the PAT bits if one of them is set */ if ((entry & x86_pat_bits) != 0) entry ^= x86_pat_bits; break; case PT_EPT: /* * Nothing to do - the memory attributes are represented * the same way for regular pages and superpages. */ break; default: panic("pmap_switch_pat_bits: bad pm_type %d", pmap->pm_type); } return (entry); } /* * Determine the appropriate bits to set in a PTE or PDE for a specified * caching mode. */ static int pmap_cache_bits(pmap_t pmap, int mode, boolean_t is_pde) { int cache_bits, pat_flag, pat_idx; if (mode < 0 || mode >= PAT_INDEX_SIZE || pat_index[mode] < 0) panic("Unknown caching mode %d\n", mode); switch (pmap->pm_type) { case PT_X86: case PT_RVI: /* The PAT bit is different for PTE's and PDE's. */ pat_flag = is_pde ? X86_PG_PDE_PAT : X86_PG_PTE_PAT; /* Map the caching mode to a PAT index. */ pat_idx = pat_index[mode]; /* Map the 3-bit index value into the PAT, PCD, and PWT bits. */ cache_bits = 0; if (pat_idx & 0x4) cache_bits |= pat_flag; if (pat_idx & 0x2) cache_bits |= PG_NC_PCD; if (pat_idx & 0x1) cache_bits |= PG_NC_PWT; break; case PT_EPT: cache_bits = EPT_PG_IGNORE_PAT | EPT_PG_MEMORY_TYPE(mode); break; default: panic("unsupported pmap type %d", pmap->pm_type); } return (cache_bits); } static int pmap_cache_mask(pmap_t pmap, boolean_t is_pde) { int mask; switch (pmap->pm_type) { case PT_X86: case PT_RVI: mask = is_pde ? X86_PG_PDE_CACHE : X86_PG_PTE_CACHE; break; case PT_EPT: mask = EPT_PG_IGNORE_PAT | EPT_PG_MEMORY_TYPE(0x7); break; default: panic("pmap_cache_mask: invalid pm_type %d", pmap->pm_type); } return (mask); } static __inline boolean_t pmap_ps_enabled(pmap_t pmap) { return (pg_ps_enabled && (pmap->pm_flags & PMAP_PDE_SUPERPAGE) != 0); } static void pmap_update_pde_store(pmap_t pmap, pd_entry_t *pde, pd_entry_t newpde) { switch (pmap->pm_type) { case PT_X86: break; case PT_RVI: case PT_EPT: /* * XXX * This is a little bogus since the generation number is * supposed to be bumped up when a region of the address * space is invalidated in the page tables. * * In this case the old PDE entry is valid but yet we want * to make sure that any mappings using the old entry are * invalidated in the TLB. * * The reason this works as expected is because we rendezvous * "all" host cpus and force any vcpu context to exit as a * side-effect. */ atomic_add_acq_long(&pmap->pm_eptgen, 1); break; default: panic("pmap_update_pde_store: bad pm_type %d", pmap->pm_type); } pde_store(pde, newpde); } /* * After changing the page size for the specified virtual address in the page * table, flush the corresponding entries from the processor's TLB. Only the * calling processor's TLB is affected. * * The calling thread must be pinned to a processor. */ static void pmap_update_pde_invalidate(pmap_t pmap, vm_offset_t va, pd_entry_t newpde) { pt_entry_t PG_G; if (pmap_type_guest(pmap)) return; KASSERT(pmap->pm_type == PT_X86, ("pmap_update_pde_invalidate: invalid type %d", pmap->pm_type)); PG_G = pmap_global_bit(pmap); if ((newpde & PG_PS) == 0) /* Demotion: flush a specific 2MB page mapping. */ invlpg(va); else if ((newpde & PG_G) == 0) /* * Promotion: flush every 4KB page mapping from the TLB * because there are too many to flush individually. */ invltlb(); else { /* * Promotion: flush every 4KB page mapping from the TLB, * including any global (PG_G) mappings. */ invltlb_globpcid(); } } #ifdef SMP -static void -pmap_invalidate_page_pcid(pmap_t pmap, vm_offset_t va) -{ - struct invpcid_descr d; - uint64_t cr3; - - if (invpcid_works) { - d.pcid = pmap->pm_pcid; - d.pad = 0; - d.addr = va; - invpcid(&d, INVPCID_ADDR); - return; - } - - cr3 = rcr3(); - critical_enter(); - load_cr3(pmap->pm_cr3 | CR3_PCID_SAVE); - invlpg(va); - load_cr3(cr3 | CR3_PCID_SAVE); - critical_exit(); -} - /* * For SMP, these functions have to use the IPI mechanism for coherence. * * N.B.: Before calling any of the following TLB invalidation functions, * the calling processor must ensure that all stores updating a non- * kernel page table are globally performed. Otherwise, another * processor could cache an old, pre-update entry without being * invalidated. This can happen one of two ways: (1) The pmap becomes * active on another processor after its pm_active field is checked by * one of the following functions but before a store updating the page * table is globally performed. (2) The pmap becomes active on another * processor before its pm_active field is checked but due to * speculative loads one of the following functions stills reads the * pmap as inactive on the other processor. * * The kernel page table is exempt because its pm_active field is * immutable. The kernel page table is always active on every * processor. */ /* * Interrupt the cpus that are executing in the guest context. * This will force the vcpu to exit and the cached EPT mappings * will be invalidated by the host before the next vmresume. */ static __inline void pmap_invalidate_ept(pmap_t pmap) { int ipinum; sched_pin(); KASSERT(!CPU_ISSET(curcpu, &pmap->pm_active), ("pmap_invalidate_ept: absurd pm_active")); /* * The TLB mappings associated with a vcpu context are not * flushed each time a different vcpu is chosen to execute. * * This is in contrast with a process's vtop mappings that * are flushed from the TLB on each context switch. * * Therefore we need to do more than just a TLB shootdown on * the active cpus in 'pmap->pm_active'. To do this we keep * track of the number of invalidations performed on this pmap. * * Each vcpu keeps a cache of this counter and compares it * just before a vmresume. If the counter is out-of-date an * invept will be done to flush stale mappings from the TLB. */ atomic_add_acq_long(&pmap->pm_eptgen, 1); /* * Force the vcpu to exit and trap back into the hypervisor. */ ipinum = pmap->pm_flags & PMAP_NESTED_IPIMASK; ipi_selected(pmap->pm_active, ipinum); sched_unpin(); } void pmap_invalidate_page(pmap_t pmap, vm_offset_t va) { - cpuset_t other_cpus; - u_int cpuid; + cpuset_t *mask; + u_int cpuid, i; if (pmap_type_guest(pmap)) { pmap_invalidate_ept(pmap); return; } KASSERT(pmap->pm_type == PT_X86, ("pmap_invalidate_page: invalid type %d", pmap->pm_type)); sched_pin(); - if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) { - if (!pmap_pcid_enabled) { - invlpg(va); - } else { - if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) { - if (pmap == PCPU_GET(curpmap)) - invlpg(va); - else - pmap_invalidate_page_pcid(pmap, va); - } else { - invltlb_globpcid(); - } - } - smp_invlpg(pmap, va); + if (pmap == kernel_pmap) { + invlpg(va); + mask = &all_cpus; } else { cpuid = PCPU_GET(cpuid); - other_cpus = all_cpus; - CPU_CLR(cpuid, &other_cpus); - if (CPU_ISSET(cpuid, &pmap->pm_active)) + if (pmap == PCPU_GET(curpmap)) invlpg(va); - else if (pmap_pcid_enabled) { - if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) - pmap_invalidate_page_pcid(pmap, va); - else - invltlb_globpcid(); + else if (pmap_pcid_enabled) + pmap->pm_pcids[cpuid].pm_gen = 0; + if (pmap_pcid_enabled) { + CPU_FOREACH(i) { + if (cpuid != i) + pmap->pm_pcids[i].pm_gen = 0; + } } - if (pmap_pcid_enabled) - CPU_AND(&other_cpus, &pmap->pm_save); - else - CPU_AND(&other_cpus, &pmap->pm_active); - if (!CPU_EMPTY(&other_cpus)) - smp_masked_invlpg(other_cpus, pmap, va); + mask = &pmap->pm_active; } + smp_masked_invlpg(*mask, va); sched_unpin(); } -static void -pmap_invalidate_range_pcid(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) -{ - struct invpcid_descr d; - uint64_t cr3; - vm_offset_t addr; - - if (invpcid_works) { - d.pcid = pmap->pm_pcid; - d.pad = 0; - for (addr = sva; addr < eva; addr += PAGE_SIZE) { - d.addr = addr; - invpcid(&d, INVPCID_ADDR); - } - return; - } - - cr3 = rcr3(); - critical_enter(); - load_cr3(pmap->pm_cr3 | CR3_PCID_SAVE); - for (addr = sva; addr < eva; addr += PAGE_SIZE) - invlpg(addr); - load_cr3(cr3 | CR3_PCID_SAVE); - critical_exit(); -} - void pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) { - cpuset_t other_cpus; + cpuset_t *mask; vm_offset_t addr; - u_int cpuid; + u_int cpuid, i; if (pmap_type_guest(pmap)) { pmap_invalidate_ept(pmap); return; } KASSERT(pmap->pm_type == PT_X86, ("pmap_invalidate_range: invalid type %d", pmap->pm_type)); sched_pin(); - if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) { - if (!pmap_pcid_enabled) { - for (addr = sva; addr < eva; addr += PAGE_SIZE) - invlpg(addr); - } else { - if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) { - if (pmap == PCPU_GET(curpmap)) { - for (addr = sva; addr < eva; - addr += PAGE_SIZE) - invlpg(addr); - } else { - pmap_invalidate_range_pcid(pmap, - sva, eva); - } - } else { - invltlb_globpcid(); - } - } - smp_invlpg_range(pmap, sva, eva); + cpuid = PCPU_GET(cpuid); + if (pmap == kernel_pmap) { + for (addr = sva; addr < eva; addr += PAGE_SIZE) + invlpg(addr); + mask = &all_cpus; } else { - cpuid = PCPU_GET(cpuid); - other_cpus = all_cpus; - CPU_CLR(cpuid, &other_cpus); - if (CPU_ISSET(cpuid, &pmap->pm_active)) { + if (pmap == PCPU_GET(curpmap)) { for (addr = sva; addr < eva; addr += PAGE_SIZE) invlpg(addr); } else if (pmap_pcid_enabled) { - if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) - pmap_invalidate_range_pcid(pmap, sva, eva); - else - invltlb_globpcid(); + pmap->pm_pcids[cpuid].pm_gen = 0; } - if (pmap_pcid_enabled) - CPU_AND(&other_cpus, &pmap->pm_save); - else - CPU_AND(&other_cpus, &pmap->pm_active); - if (!CPU_EMPTY(&other_cpus)) - smp_masked_invlpg_range(other_cpus, pmap, sva, eva); + if (pmap_pcid_enabled) { + CPU_FOREACH(i) { + if (cpuid != i) + pmap->pm_pcids[i].pm_gen = 0; + } + } + mask = &pmap->pm_active; } + smp_masked_invlpg_range(*mask, sva, eva); sched_unpin(); } void pmap_invalidate_all(pmap_t pmap) { - cpuset_t other_cpus; + cpuset_t *mask; struct invpcid_descr d; - uint64_t cr3; - u_int cpuid; + u_int cpuid, i; if (pmap_type_guest(pmap)) { pmap_invalidate_ept(pmap); return; } KASSERT(pmap->pm_type == PT_X86, ("pmap_invalidate_all: invalid type %d", pmap->pm_type)); sched_pin(); - cpuid = PCPU_GET(cpuid); - if (pmap == kernel_pmap || - (pmap_pcid_enabled && !CPU_CMP(&pmap->pm_save, &all_cpus)) || - !CPU_CMP(&pmap->pm_active, &all_cpus)) { - if (invpcid_works) { + if (pmap == kernel_pmap) { + if (pmap_pcid_enabled && invpcid_works) { bzero(&d, sizeof(d)); invpcid(&d, INVPCID_CTXGLOB); } else { invltlb_globpcid(); } - if (!CPU_ISSET(cpuid, &pmap->pm_active)) - CPU_CLR_ATOMIC(cpuid, &pmap->pm_save); - smp_invltlb(pmap); + mask = &all_cpus; } else { - other_cpus = all_cpus; - CPU_CLR(cpuid, &other_cpus); - - /* - * This logic is duplicated in the Xinvltlb shootdown - * IPI handler. - */ - if (pmap_pcid_enabled) { - if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) { + cpuid = PCPU_GET(cpuid); + if (pmap == PCPU_GET(curpmap)) { + if (pmap_pcid_enabled) { if (invpcid_works) { - d.pcid = pmap->pm_pcid; + d.pcid = pmap->pm_pcids[cpuid].pm_pcid; d.pad = 0; d.addr = 0; invpcid(&d, INVPCID_CTX); } else { - cr3 = rcr3(); - critical_enter(); - - /* - * Bit 63 is clear, pcid TLB - * entries are invalidated. - */ - load_cr3(pmap->pm_cr3); - load_cr3(cr3 | CR3_PCID_SAVE); - critical_exit(); + load_cr3(pmap->pm_cr3 | pmap->pm_pcids + [PCPU_GET(cpuid)].pm_pcid); } } else { - invltlb_globpcid(); + invltlb(); } - } else if (CPU_ISSET(cpuid, &pmap->pm_active)) - invltlb(); - if (!CPU_ISSET(cpuid, &pmap->pm_active)) - CPU_CLR_ATOMIC(cpuid, &pmap->pm_save); - if (pmap_pcid_enabled) - CPU_AND(&other_cpus, &pmap->pm_save); - else - CPU_AND(&other_cpus, &pmap->pm_active); - if (!CPU_EMPTY(&other_cpus)) - smp_masked_invltlb(other_cpus, pmap); + } else if (pmap_pcid_enabled) { + pmap->pm_pcids[cpuid].pm_gen = 0; + } + if (pmap_pcid_enabled) { + CPU_FOREACH(i) { + if (cpuid != i) + pmap->pm_pcids[i].pm_gen = 0; + } + } + mask = &pmap->pm_active; } + smp_masked_invltlb(*mask, pmap); sched_unpin(); } void pmap_invalidate_cache(void) { sched_pin(); wbinvd(); smp_cache_flush(); sched_unpin(); } struct pde_action { cpuset_t invalidate; /* processors that invalidate their TLB */ pmap_t pmap; vm_offset_t va; pd_entry_t *pde; pd_entry_t newpde; u_int store; /* processor that updates the PDE */ }; static void pmap_update_pde_action(void *arg) { struct pde_action *act = arg; if (act->store == PCPU_GET(cpuid)) pmap_update_pde_store(act->pmap, act->pde, act->newpde); } static void pmap_update_pde_teardown(void *arg) { struct pde_action *act = arg; if (CPU_ISSET(PCPU_GET(cpuid), &act->invalidate)) pmap_update_pde_invalidate(act->pmap, act->va, act->newpde); } /* * Change the page size for the specified virtual address in a way that * prevents any possibility of the TLB ever having two entries that map the * same virtual address using different page sizes. This is the recommended * workaround for Erratum 383 on AMD Family 10h processors. It prevents a * machine check exception for a TLB state that is improperly diagnosed as a * hardware error. */ static void pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde) { struct pde_action act; cpuset_t active, other_cpus; u_int cpuid; sched_pin(); cpuid = PCPU_GET(cpuid); other_cpus = all_cpus; CPU_CLR(cpuid, &other_cpus); if (pmap == kernel_pmap || pmap_type_guest(pmap)) active = all_cpus; else { active = pmap->pm_active; - CPU_AND_ATOMIC(&pmap->pm_save, &active); } if (CPU_OVERLAP(&active, &other_cpus)) { act.store = cpuid; act.invalidate = active; act.va = va; act.pmap = pmap; act.pde = pde; act.newpde = newpde; CPU_SET(cpuid, &active); smp_rendezvous_cpus(active, smp_no_rendevous_barrier, pmap_update_pde_action, pmap_update_pde_teardown, &act); } else { pmap_update_pde_store(pmap, pde, newpde); if (CPU_ISSET(cpuid, &active)) pmap_update_pde_invalidate(pmap, va, newpde); } sched_unpin(); } #else /* !SMP */ /* * Normal, non-SMP, invalidation functions. * We inline these within pmap.c for speed. */ PMAP_INLINE void pmap_invalidate_page(pmap_t pmap, vm_offset_t va) { switch (pmap->pm_type) { case PT_X86: if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active)) invlpg(va); break; case PT_RVI: case PT_EPT: pmap->pm_eptgen++; break; default: panic("pmap_invalidate_page: unknown type: %d", pmap->pm_type); } } PMAP_INLINE void pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) { vm_offset_t addr; switch (pmap->pm_type) { case PT_X86: if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active)) for (addr = sva; addr < eva; addr += PAGE_SIZE) invlpg(addr); break; case PT_RVI: case PT_EPT: pmap->pm_eptgen++; break; default: panic("pmap_invalidate_range: unknown type: %d", pmap->pm_type); } } PMAP_INLINE void pmap_invalidate_all(pmap_t pmap) { switch (pmap->pm_type) { case PT_X86: if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active)) invltlb(); break; case PT_RVI: case PT_EPT: pmap->pm_eptgen++; break; default: panic("pmap_invalidate_all: unknown type %d", pmap->pm_type); } } PMAP_INLINE void pmap_invalidate_cache(void) { wbinvd(); } static void pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde) { pmap_update_pde_store(pmap, pde, newpde); if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active)) pmap_update_pde_invalidate(pmap, va, newpde); else CPU_ZERO(&pmap->pm_save); } #endif /* !SMP */ #define PMAP_CLFLUSH_THRESHOLD (2 * 1024 * 1024) void pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva, boolean_t force) { if (force) { sva &= ~(vm_offset_t)cpu_clflush_line_size; } else { KASSERT((sva & PAGE_MASK) == 0, ("pmap_invalidate_cache_range: sva not page-aligned")); KASSERT((eva & PAGE_MASK) == 0, ("pmap_invalidate_cache_range: eva not page-aligned")); } if ((cpu_feature & CPUID_SS) != 0 && !force) ; /* If "Self Snoop" is supported and allowed, do nothing. */ else if ((cpu_feature & CPUID_CLFSH) != 0 && eva - sva < PMAP_CLFLUSH_THRESHOLD) { /* * XXX: Some CPUs fault, hang, or trash the local APIC * registers if we use CLFLUSH on the local APIC * range. The local APIC is always uncached, so we * don't need to flush for that range anyway. */ if (pmap_kextract(sva) == lapic_paddr) return; /* * Otherwise, do per-cache line flush. Use the mfence * instruction to insure that previous stores are * included in the write-back. The processor * propagates flush to other processors in the cache * coherence domain. */ mfence(); for (; sva < eva; sva += cpu_clflush_line_size) clflush(sva); mfence(); } else { /* * No targeted cache flush methods are supported by CPU, * or the supplied range is bigger than 2MB. * Globally invalidate cache. */ pmap_invalidate_cache(); } } /* * Remove the specified set of pages from the data and instruction caches. * * In contrast to pmap_invalidate_cache_range(), this function does not * rely on the CPU's self-snoop feature, because it is intended for use * when moving pages into a different cache domain. */ void pmap_invalidate_cache_pages(vm_page_t *pages, int count) { vm_offset_t daddr, eva; int i; if (count >= PMAP_CLFLUSH_THRESHOLD / PAGE_SIZE || (cpu_feature & CPUID_CLFSH) == 0) pmap_invalidate_cache(); else { mfence(); for (i = 0; i < count; i++) { daddr = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pages[i])); eva = daddr + PAGE_SIZE; for (; daddr < eva; daddr += cpu_clflush_line_size) clflush(daddr); } mfence(); } } /* * Routine: pmap_extract * Function: * Extract the physical page address associated * with the given map/virtual_address pair. */ vm_paddr_t pmap_extract(pmap_t pmap, vm_offset_t va) { pdp_entry_t *pdpe; pd_entry_t *pde; pt_entry_t *pte, PG_V; vm_paddr_t pa; pa = 0; PG_V = pmap_valid_bit(pmap); PMAP_LOCK(pmap); pdpe = pmap_pdpe(pmap, va); if (pdpe != NULL && (*pdpe & PG_V) != 0) { if ((*pdpe & PG_PS) != 0) pa = (*pdpe & PG_PS_FRAME) | (va & PDPMASK); else { pde = pmap_pdpe_to_pde(pdpe, va); if ((*pde & PG_V) != 0) { if ((*pde & PG_PS) != 0) { pa = (*pde & PG_PS_FRAME) | (va & PDRMASK); } else { pte = pmap_pde_to_pte(pde, va); pa = (*pte & PG_FRAME) | (va & PAGE_MASK); } } } } PMAP_UNLOCK(pmap); return (pa); } /* * Routine: pmap_extract_and_hold * Function: * Atomically extract and hold the physical page * with the given pmap and virtual address pair * if that mapping permits the given protection. */ vm_page_t pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot) { pd_entry_t pde, *pdep; pt_entry_t pte, PG_RW, PG_V; vm_paddr_t pa; vm_page_t m; pa = 0; m = NULL; PG_RW = pmap_rw_bit(pmap); PG_V = pmap_valid_bit(pmap); PMAP_LOCK(pmap); retry: pdep = pmap_pde(pmap, va); if (pdep != NULL && (pde = *pdep)) { if (pde & PG_PS) { if ((pde & PG_RW) || (prot & VM_PROT_WRITE) == 0) { if (vm_page_pa_tryrelock(pmap, (pde & PG_PS_FRAME) | (va & PDRMASK), &pa)) goto retry; m = PHYS_TO_VM_PAGE((pde & PG_PS_FRAME) | (va & PDRMASK)); vm_page_hold(m); } } else { pte = *pmap_pde_to_pte(pdep, va); if ((pte & PG_V) && ((pte & PG_RW) || (prot & VM_PROT_WRITE) == 0)) { if (vm_page_pa_tryrelock(pmap, pte & PG_FRAME, &pa)) goto retry; m = PHYS_TO_VM_PAGE(pte & PG_FRAME); vm_page_hold(m); } } } PA_UNLOCK_COND(pa); PMAP_UNLOCK(pmap); return (m); } vm_paddr_t pmap_kextract(vm_offset_t va) { pd_entry_t pde; vm_paddr_t pa; if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS) { pa = DMAP_TO_PHYS(va); } else { pde = *vtopde(va); if (pde & PG_PS) { pa = (pde & PG_PS_FRAME) | (va & PDRMASK); } else { /* * Beware of a concurrent promotion that changes the * PDE at this point! For example, vtopte() must not * be used to access the PTE because it would use the * new PDE. It is, however, safe to use the old PDE * because the page table page is preserved by the * promotion. */ pa = *pmap_pde_to_pte(&pde, va); pa = (pa & PG_FRAME) | (va & PAGE_MASK); } } return (pa); } /*************************************************** * Low level mapping routines..... ***************************************************/ /* * Add a wired page to the kva. * Note: not SMP coherent. */ PMAP_INLINE void pmap_kenter(vm_offset_t va, vm_paddr_t pa) { pt_entry_t *pte; pte = vtopte(va); pte_store(pte, pa | X86_PG_RW | X86_PG_V | X86_PG_G); } static __inline void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode) { pt_entry_t *pte; int cache_bits; pte = vtopte(va); cache_bits = pmap_cache_bits(kernel_pmap, mode, 0); pte_store(pte, pa | X86_PG_RW | X86_PG_V | X86_PG_G | cache_bits); } /* * Remove a page from the kernel pagetables. * Note: not SMP coherent. */ PMAP_INLINE void pmap_kremove(vm_offset_t va) { pt_entry_t *pte; pte = vtopte(va); pte_clear(pte); } /* * Used to map a range of physical addresses into kernel * virtual address space. * * The value passed in '*virt' is a suggested virtual address for * the mapping. Architectures which can support a direct-mapped * physical to virtual region can return the appropriate address * within that region, leaving '*virt' unchanged. Other * architectures should map the pages starting at '*virt' and * update '*virt' with the first usable address after the mapped * region. */ vm_offset_t pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot) { return PHYS_TO_DMAP(start); } /* * Add a list of wired pages to the kva * this routine is only used for temporary * kernel mappings that do not need to have * page modification or references recorded. * Note that old mappings are simply written * over. The page *must* be wired. * Note: SMP coherent. Uses a ranged shootdown IPI. */ void pmap_qenter(vm_offset_t sva, vm_page_t *ma, int count) { pt_entry_t *endpte, oldpte, pa, *pte; vm_page_t m; int cache_bits; oldpte = 0; pte = vtopte(sva); endpte = pte + count; while (pte < endpte) { m = *ma++; cache_bits = pmap_cache_bits(kernel_pmap, m->md.pat_mode, 0); pa = VM_PAGE_TO_PHYS(m) | cache_bits; if ((*pte & (PG_FRAME | X86_PG_PTE_CACHE)) != pa) { oldpte |= *pte; pte_store(pte, pa | X86_PG_G | X86_PG_RW | X86_PG_V); } pte++; } if (__predict_false((oldpte & X86_PG_V) != 0)) pmap_invalidate_range(kernel_pmap, sva, sva + count * PAGE_SIZE); } /* * This routine tears out page mappings from the * kernel -- it is meant only for temporary mappings. * Note: SMP coherent. Uses a ranged shootdown IPI. */ void pmap_qremove(vm_offset_t sva, int count) { vm_offset_t va; va = sva; while (count-- > 0) { KASSERT(va >= VM_MIN_KERNEL_ADDRESS, ("usermode va %lx", va)); pmap_kremove(va); va += PAGE_SIZE; } pmap_invalidate_range(kernel_pmap, sva, va); } /*************************************************** * Page table page management routines..... ***************************************************/ static __inline void pmap_free_zero_pages(struct spglist *free) { vm_page_t m; while ((m = SLIST_FIRST(free)) != NULL) { SLIST_REMOVE_HEAD(free, plinks.s.ss); /* Preserve the page's PG_ZERO setting. */ vm_page_free_toq(m); } } /* * Schedule the specified unused page table page to be freed. Specifically, * add the page to the specified list of pages that will be released to the * physical memory manager after the TLB has been updated. */ static __inline void pmap_add_delayed_free_list(vm_page_t m, struct spglist *free, boolean_t set_PG_ZERO) { if (set_PG_ZERO) m->flags |= PG_ZERO; else m->flags &= ~PG_ZERO; SLIST_INSERT_HEAD(free, m, plinks.s.ss); } /* * Inserts the specified page table page into the specified pmap's collection * of idle page table pages. Each of a pmap's page table pages is responsible * for mapping a distinct range of virtual addresses. The pmap's collection is * ordered by this virtual address range. */ static __inline int pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte) { PMAP_LOCK_ASSERT(pmap, MA_OWNED); return (vm_radix_insert(&pmap->pm_root, mpte)); } /* * Looks for a page table page mapping the specified virtual address in the * specified pmap's collection of idle page table pages. Returns NULL if there * is no page table page corresponding to the specified virtual address. */ static __inline vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va) { PMAP_LOCK_ASSERT(pmap, MA_OWNED); return (vm_radix_lookup(&pmap->pm_root, pmap_pde_pindex(va))); } /* * Removes the specified page table page from the specified pmap's collection * of idle page table pages. The specified page table page must be a member of * the pmap's collection. */ static __inline void pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte) { PMAP_LOCK_ASSERT(pmap, MA_OWNED); vm_radix_remove(&pmap->pm_root, mpte->pindex); } /* * Decrements a page table page's wire count, which is used to record the * number of valid page table entries within the page. If the wire count * drops to zero, then the page table page is unmapped. Returns TRUE if the * page table page was unmapped and FALSE otherwise. */ static inline boolean_t pmap_unwire_ptp(pmap_t pmap, vm_offset_t va, vm_page_t m, struct spglist *free) { --m->wire_count; if (m->wire_count == 0) { _pmap_unwire_ptp(pmap, va, m, free); return (TRUE); } else return (FALSE); } static void _pmap_unwire_ptp(pmap_t pmap, vm_offset_t va, vm_page_t m, struct spglist *free) { PMAP_LOCK_ASSERT(pmap, MA_OWNED); /* * unmap the page table page */ if (m->pindex >= (NUPDE + NUPDPE)) { /* PDP page */ pml4_entry_t *pml4; pml4 = pmap_pml4e(pmap, va); *pml4 = 0; } else if (m->pindex >= NUPDE) { /* PD page */ pdp_entry_t *pdp; pdp = pmap_pdpe(pmap, va); *pdp = 0; } else { /* PTE page */ pd_entry_t *pd; pd = pmap_pde(pmap, va); *pd = 0; } pmap_resident_count_dec(pmap, 1); if (m->pindex < NUPDE) { /* We just released a PT, unhold the matching PD */ vm_page_t pdpg; pdpg = PHYS_TO_VM_PAGE(*pmap_pdpe(pmap, va) & PG_FRAME); pmap_unwire_ptp(pmap, va, pdpg, free); } if (m->pindex >= NUPDE && m->pindex < (NUPDE + NUPDPE)) { /* We just released a PD, unhold the matching PDP */ vm_page_t pdppg; pdppg = PHYS_TO_VM_PAGE(*pmap_pml4e(pmap, va) & PG_FRAME); pmap_unwire_ptp(pmap, va, pdppg, free); } /* * This is a release store so that the ordinary store unmapping * the page table page is globally performed before TLB shoot- * down is begun. */ atomic_subtract_rel_int(&vm_cnt.v_wire_count, 1); /* * Put page on a list so that it is released after * *ALL* TLB shootdown is done */ pmap_add_delayed_free_list(m, free, TRUE); } /* * After removing a page table entry, this routine is used to * conditionally free the page, and manage the hold/wire counts. */ static int pmap_unuse_pt(pmap_t pmap, vm_offset_t va, pd_entry_t ptepde, struct spglist *free) { vm_page_t mpte; if (va >= VM_MAXUSER_ADDRESS) return (0); KASSERT(ptepde != 0, ("pmap_unuse_pt: ptepde != 0")); mpte = PHYS_TO_VM_PAGE(ptepde & PG_FRAME); return (pmap_unwire_ptp(pmap, va, mpte, free)); } void pmap_pinit0(pmap_t pmap) { PMAP_LOCK_INIT(pmap); pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(KPML4phys); pmap->pm_cr3 = KPML4phys; pmap->pm_root.rt_root = 0; CPU_ZERO(&pmap->pm_active); - CPU_ZERO(&pmap->pm_save); PCPU_SET(curpmap, pmap); TAILQ_INIT(&pmap->pm_pvchunk); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); - pmap->pm_pcid = pmap_pcid_enabled ? 0 : -1; pmap->pm_flags = pmap_flags; } /* * Initialize a preallocated and zeroed pmap structure, * such as one in a vmspace structure. */ int pmap_pinit_type(pmap_t pmap, enum pmap_type pm_type, int flags) { vm_page_t pml4pg; vm_paddr_t pml4phys; int i; /* * allocate the page directory page */ while ((pml4pg = vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL) VM_WAIT; pml4phys = VM_PAGE_TO_PHYS(pml4pg); pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(pml4phys); - pmap->pm_pcid = -1; + CPU_FOREACH(i) { + pmap->pm_pcids[i].pm_pcid = PMAP_PCID_NONE; + pmap->pm_pcids[i].pm_gen = 0; + } pmap->pm_cr3 = ~0; /* initialize to an invalid value */ if ((pml4pg->flags & PG_ZERO) == 0) pagezero(pmap->pm_pml4); /* * Do not install the host kernel mappings in the nested page * tables. These mappings are meaningless in the guest physical * address space. */ if ((pmap->pm_type = pm_type) == PT_X86) { pmap->pm_cr3 = pml4phys; /* Wire in kernel global address entries. */ for (i = 0; i < NKPML4E; i++) { pmap->pm_pml4[KPML4BASE + i] = (KPDPphys + ptoa(i)) | X86_PG_RW | X86_PG_V | PG_U; } for (i = 0; i < ndmpdpphys; i++) { pmap->pm_pml4[DMPML4I + i] = (DMPDPphys + ptoa(i)) | X86_PG_RW | X86_PG_V | PG_U; } /* install self-referential address mapping entry(s) */ pmap->pm_pml4[PML4PML4I] = VM_PAGE_TO_PHYS(pml4pg) | X86_PG_V | X86_PG_RW | X86_PG_A | X86_PG_M; - - if (pmap_pcid_enabled) { - pmap->pm_pcid = alloc_unr(&pcid_unr); - if (pmap->pm_pcid != -1) - pmap->pm_cr3 |= pmap->pm_pcid; - } } pmap->pm_root.rt_root = 0; CPU_ZERO(&pmap->pm_active); TAILQ_INIT(&pmap->pm_pvchunk); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); pmap->pm_flags = flags; pmap->pm_eptgen = 0; - CPU_ZERO(&pmap->pm_save); return (1); } int pmap_pinit(pmap_t pmap) { return (pmap_pinit_type(pmap, PT_X86, pmap_flags)); } /* * This routine is called if the desired page table page does not exist. * * If page table page allocation fails, this routine may sleep before * returning NULL. It sleeps only if a lock pointer was given. * * Note: If a page allocation fails at page table level two or three, * one or two pages may be held during the wait, only to be released * afterwards. This conservative approach is easily argued to avoid * race conditions. */ static vm_page_t _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex, struct rwlock **lockp) { vm_page_t m, pdppg, pdpg; pt_entry_t PG_A, PG_M, PG_RW, PG_V; PMAP_LOCK_ASSERT(pmap, MA_OWNED); PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); /* * Allocate a page table page. */ if ((m = vm_page_alloc(NULL, ptepindex, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL) { if (lockp != NULL) { RELEASE_PV_LIST_LOCK(lockp); PMAP_UNLOCK(pmap); rw_runlock(&pvh_global_lock); VM_WAIT; rw_rlock(&pvh_global_lock); PMAP_LOCK(pmap); } /* * Indicate the need to retry. While waiting, the page table * page may have been allocated. */ return (NULL); } if ((m->flags & PG_ZERO) == 0) pmap_zero_page(m); /* * Map the pagetable page into the process address space, if * it isn't already there. */ if (ptepindex >= (NUPDE + NUPDPE)) { pml4_entry_t *pml4; vm_pindex_t pml4index; /* Wire up a new PDPE page */ pml4index = ptepindex - (NUPDE + NUPDPE); pml4 = &pmap->pm_pml4[pml4index]; *pml4 = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M; } else if (ptepindex >= NUPDE) { vm_pindex_t pml4index; vm_pindex_t pdpindex; pml4_entry_t *pml4; pdp_entry_t *pdp; /* Wire up a new PDE page */ pdpindex = ptepindex - NUPDE; pml4index = pdpindex >> NPML4EPGSHIFT; pml4 = &pmap->pm_pml4[pml4index]; if ((*pml4 & PG_V) == 0) { /* Have to allocate a new pdp, recurse */ if (_pmap_allocpte(pmap, NUPDE + NUPDPE + pml4index, lockp) == NULL) { --m->wire_count; atomic_subtract_int(&vm_cnt.v_wire_count, 1); vm_page_free_zero(m); return (NULL); } } else { /* Add reference to pdp page */ pdppg = PHYS_TO_VM_PAGE(*pml4 & PG_FRAME); pdppg->wire_count++; } pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); /* Now find the pdp page */ pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; *pdp = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M; } else { vm_pindex_t pml4index; vm_pindex_t pdpindex; pml4_entry_t *pml4; pdp_entry_t *pdp; pd_entry_t *pd; /* Wire up a new PTE page */ pdpindex = ptepindex >> NPDPEPGSHIFT; pml4index = pdpindex >> NPML4EPGSHIFT; /* First, find the pdp and check that its valid. */ pml4 = &pmap->pm_pml4[pml4index]; if ((*pml4 & PG_V) == 0) { /* Have to allocate a new pd, recurse */ if (_pmap_allocpte(pmap, NUPDE + pdpindex, lockp) == NULL) { --m->wire_count; atomic_subtract_int(&vm_cnt.v_wire_count, 1); vm_page_free_zero(m); return (NULL); } pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; } else { pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME); pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)]; if ((*pdp & PG_V) == 0) { /* Have to allocate a new pd, recurse */ if (_pmap_allocpte(pmap, NUPDE + pdpindex, lockp) == NULL) { --m->wire_count; atomic_subtract_int(&vm_cnt.v_wire_count, 1); vm_page_free_zero(m); return (NULL); } } else { /* Add reference to the pd page */ pdpg = PHYS_TO_VM_PAGE(*pdp & PG_FRAME); pdpg->wire_count++; } } pd = (pd_entry_t *)PHYS_TO_DMAP(*pdp & PG_FRAME); /* Now we know where the page directory page is */ pd = &pd[ptepindex & ((1ul << NPDEPGSHIFT) - 1)]; *pd = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M; } pmap_resident_count_inc(pmap, 1); return (m); } static vm_page_t pmap_allocpde(pmap_t pmap, vm_offset_t va, struct rwlock **lockp) { vm_pindex_t pdpindex, ptepindex; pdp_entry_t *pdpe, PG_V; vm_page_t pdpg; PG_V = pmap_valid_bit(pmap); retry: pdpe = pmap_pdpe(pmap, va); if (pdpe != NULL && (*pdpe & PG_V) != 0) { /* Add a reference to the pd page. */ pdpg = PHYS_TO_VM_PAGE(*pdpe & PG_FRAME); pdpg->wire_count++; } else { /* Allocate a pd page. */ ptepindex = pmap_pde_pindex(va); pdpindex = ptepindex >> NPDPEPGSHIFT; pdpg = _pmap_allocpte(pmap, NUPDE + pdpindex, lockp); if (pdpg == NULL && lockp != NULL) goto retry; } return (pdpg); } static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va, struct rwlock **lockp) { vm_pindex_t ptepindex; pd_entry_t *pd, PG_V; vm_page_t m; PG_V = pmap_valid_bit(pmap); /* * Calculate pagetable page index */ ptepindex = pmap_pde_pindex(va); retry: /* * Get the page directory entry */ pd = pmap_pde(pmap, va); /* * This supports switching from a 2MB page to a * normal 4K page. */ if (pd != NULL && (*pd & (PG_PS | PG_V)) == (PG_PS | PG_V)) { if (!pmap_demote_pde_locked(pmap, pd, va, lockp)) { /* * Invalidation of the 2MB page mapping may have caused * the deallocation of the underlying PD page. */ pd = NULL; } } /* * If the page table page is mapped, we just increment the * hold count, and activate it. */ if (pd != NULL && (*pd & PG_V) != 0) { m = PHYS_TO_VM_PAGE(*pd & PG_FRAME); m->wire_count++; } else { /* * Here if the pte page isn't mapped, or if it has been * deallocated. */ m = _pmap_allocpte(pmap, ptepindex, lockp); if (m == NULL && lockp != NULL) goto retry; } return (m); } /*************************************************** * Pmap allocation/deallocation routines. ***************************************************/ /* * Release any resources held by the given physical map. * Called when a pmap initialized by pmap_pinit is being released. * Should only be called if the map contains no valid mappings. */ void pmap_release(pmap_t pmap) { vm_page_t m; int i; KASSERT(pmap->pm_stats.resident_count == 0, ("pmap_release: pmap resident count %ld != 0", pmap->pm_stats.resident_count)); KASSERT(vm_radix_is_empty(&pmap->pm_root), ("pmap_release: pmap has reserved page table page(s)")); KASSERT(CPU_EMPTY(&pmap->pm_active), ("releasing active pmap %p", pmap)); - if (pmap_pcid_enabled) { - /* - * Invalidate any left TLB entries, to allow the reuse - * of the pcid. - */ - pmap_invalidate_all(pmap); - } - m = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pmap->pm_pml4)); for (i = 0; i < NKPML4E; i++) /* KVA */ pmap->pm_pml4[KPML4BASE + i] = 0; for (i = 0; i < ndmpdpphys; i++)/* Direct Map */ pmap->pm_pml4[DMPML4I + i] = 0; pmap->pm_pml4[PML4PML4I] = 0; /* Recursive Mapping */ m->wire_count--; atomic_subtract_int(&vm_cnt.v_wire_count, 1); vm_page_free_zero(m); - if (pmap->pm_pcid != -1) - free_unr(&pcid_unr, pmap->pm_pcid); } static int kvm_size(SYSCTL_HANDLER_ARGS) { unsigned long ksize = VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS; return sysctl_handle_long(oidp, &ksize, 0, req); } SYSCTL_PROC(_vm, OID_AUTO, kvm_size, CTLTYPE_LONG|CTLFLAG_RD, 0, 0, kvm_size, "LU", "Size of KVM"); static int kvm_free(SYSCTL_HANDLER_ARGS) { unsigned long kfree = VM_MAX_KERNEL_ADDRESS - kernel_vm_end; return sysctl_handle_long(oidp, &kfree, 0, req); } SYSCTL_PROC(_vm, OID_AUTO, kvm_free, CTLTYPE_LONG|CTLFLAG_RD, 0, 0, kvm_free, "LU", "Amount of KVM free"); /* * grow the number of kernel page table entries, if needed */ void pmap_growkernel(vm_offset_t addr) { vm_paddr_t paddr; vm_page_t nkpg; pd_entry_t *pde, newpdir; pdp_entry_t *pdpe; mtx_assert(&kernel_map->system_mtx, MA_OWNED); /* * Return if "addr" is within the range of kernel page table pages * that were preallocated during pmap bootstrap. Moreover, leave * "kernel_vm_end" and the kernel page table as they were. * * The correctness of this action is based on the following * argument: vm_map_insert() allocates contiguous ranges of the * kernel virtual address space. It calls this function if a range * ends after "kernel_vm_end". If the kernel is mapped between * "kernel_vm_end" and "addr", then the range cannot begin at * "kernel_vm_end". In fact, its beginning address cannot be less * than the kernel. Thus, there is no immediate need to allocate * any new kernel page table pages between "kernel_vm_end" and * "KERNBASE". */ if (KERNBASE < addr && addr <= KERNBASE + nkpt * NBPDR) return; addr = roundup2(addr, NBPDR); if (addr - 1 >= kernel_map->max_offset) addr = kernel_map->max_offset; while (kernel_vm_end < addr) { pdpe = pmap_pdpe(kernel_pmap, kernel_vm_end); if ((*pdpe & X86_PG_V) == 0) { /* We need a new PDP entry */ nkpg = vm_page_alloc(NULL, kernel_vm_end >> PDPSHIFT, VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED | VM_ALLOC_ZERO); if (nkpg == NULL) panic("pmap_growkernel: no memory to grow kernel"); if ((nkpg->flags & PG_ZERO) == 0) pmap_zero_page(nkpg); paddr = VM_PAGE_TO_PHYS(nkpg); *pdpe = (pdp_entry_t)(paddr | X86_PG_V | X86_PG_RW | X86_PG_A | X86_PG_M); continue; /* try again */ } pde = pmap_pdpe_to_pde(pdpe, kernel_vm_end); if ((*pde & X86_PG_V) != 0) { kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK; if (kernel_vm_end - 1 >= kernel_map->max_offset) { kernel_vm_end = kernel_map->max_offset; break; } continue; } nkpg = vm_page_alloc(NULL, pmap_pde_pindex(kernel_vm_end), VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED | VM_ALLOC_ZERO); if (nkpg == NULL) panic("pmap_growkernel: no memory to grow kernel"); if ((nkpg->flags & PG_ZERO) == 0) pmap_zero_page(nkpg); paddr = VM_PAGE_TO_PHYS(nkpg); newpdir = paddr | X86_PG_V | X86_PG_RW | X86_PG_A | X86_PG_M; pde_store(pde, newpdir); kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK; if (kernel_vm_end - 1 >= kernel_map->max_offset) { kernel_vm_end = kernel_map->max_offset; break; } } } /*************************************************** * page management routines. ***************************************************/ CTASSERT(sizeof(struct pv_chunk) == PAGE_SIZE); CTASSERT(_NPCM == 3); CTASSERT(_NPCPV == 168); static __inline struct pv_chunk * pv_to_chunk(pv_entry_t pv) { return ((struct pv_chunk *)((uintptr_t)pv & ~(uintptr_t)PAGE_MASK)); } #define PV_PMAP(pv) (pv_to_chunk(pv)->pc_pmap) #define PC_FREE0 0xfffffffffffffffful #define PC_FREE1 0xfffffffffffffffful #define PC_FREE2 0x000000fffffffffful static const uint64_t pc_freemask[_NPCM] = { PC_FREE0, PC_FREE1, PC_FREE2 }; #ifdef PV_STATS static int pc_chunk_count, pc_chunk_allocs, pc_chunk_frees, pc_chunk_tryfail; SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_count, CTLFLAG_RD, &pc_chunk_count, 0, "Current number of pv entry chunks"); SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_allocs, CTLFLAG_RD, &pc_chunk_allocs, 0, "Current number of pv entry chunks allocated"); SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_frees, CTLFLAG_RD, &pc_chunk_frees, 0, "Current number of pv entry chunks frees"); SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_tryfail, CTLFLAG_RD, &pc_chunk_tryfail, 0, "Number of times tried to get a chunk page but failed."); static long pv_entry_frees, pv_entry_allocs, pv_entry_count; static int pv_entry_spare; SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_frees, CTLFLAG_RD, &pv_entry_frees, 0, "Current number of pv entry frees"); SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_allocs, CTLFLAG_RD, &pv_entry_allocs, 0, "Current number of pv entry allocs"); SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_count, CTLFLAG_RD, &pv_entry_count, 0, "Current number of pv entries"); SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_spare, CTLFLAG_RD, &pv_entry_spare, 0, "Current number of spare pv entries"); #endif /* * We are in a serious low memory condition. Resort to * drastic measures to free some pages so we can allocate * another pv entry chunk. * * Returns NULL if PV entries were reclaimed from the specified pmap. * * We do not, however, unmap 2mpages because subsequent accesses will * allocate per-page pv entries until repromotion occurs, thereby * exacerbating the shortage of free pv entries. */ static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp) { struct pch new_tail; struct pv_chunk *pc; struct md_page *pvh; pd_entry_t *pde; pmap_t pmap; pt_entry_t *pte, tpte; pt_entry_t PG_G, PG_A, PG_M, PG_RW; pv_entry_t pv; vm_offset_t va; vm_page_t m, m_pc; struct spglist free; uint64_t inuse; int bit, field, freed; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(locked_pmap, MA_OWNED); KASSERT(lockp != NULL, ("reclaim_pv_chunk: lockp is NULL")); pmap = NULL; m_pc = NULL; PG_G = PG_A = PG_M = PG_RW = 0; SLIST_INIT(&free); TAILQ_INIT(&new_tail); mtx_lock(&pv_chunks_mutex); while ((pc = TAILQ_FIRST(&pv_chunks)) != NULL && SLIST_EMPTY(&free)) { TAILQ_REMOVE(&pv_chunks, pc, pc_lru); mtx_unlock(&pv_chunks_mutex); if (pmap != pc->pc_pmap) { if (pmap != NULL) { pmap_invalidate_all(pmap); if (pmap != locked_pmap) PMAP_UNLOCK(pmap); } pmap = pc->pc_pmap; /* Avoid deadlock and lock recursion. */ if (pmap > locked_pmap) { RELEASE_PV_LIST_LOCK(lockp); PMAP_LOCK(pmap); } else if (pmap != locked_pmap && !PMAP_TRYLOCK(pmap)) { pmap = NULL; TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru); mtx_lock(&pv_chunks_mutex); continue; } PG_G = pmap_global_bit(pmap); PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); } /* * Destroy every non-wired, 4 KB page mapping in the chunk. */ freed = 0; for (field = 0; field < _NPCM; field++) { for (inuse = ~pc->pc_map[field] & pc_freemask[field]; inuse != 0; inuse &= ~(1UL << bit)) { bit = bsfq(inuse); pv = &pc->pc_pventry[field * 64 + bit]; va = pv->pv_va; pde = pmap_pde(pmap, va); if ((*pde & PG_PS) != 0) continue; pte = pmap_pde_to_pte(pde, va); if ((*pte & PG_W) != 0) continue; tpte = pte_load_clear(pte); if ((tpte & PG_G) != 0) pmap_invalidate_page(pmap, va); m = PHYS_TO_VM_PAGE(tpte & PG_FRAME); if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); if ((tpte & PG_A) != 0) vm_page_aflag_set(m, PGA_REFERENCED); CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m); TAILQ_REMOVE(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; if (TAILQ_EMPTY(&m->md.pv_list) && (m->flags & PG_FICTITIOUS) == 0) { pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); if (TAILQ_EMPTY(&pvh->pv_list)) { vm_page_aflag_clear(m, PGA_WRITEABLE); } } pc->pc_map[field] |= 1UL << bit; pmap_unuse_pt(pmap, va, *pde, &free); freed++; } } if (freed == 0) { TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru); mtx_lock(&pv_chunks_mutex); continue; } /* Every freed mapping is for a 4 KB page. */ pmap_resident_count_dec(pmap, freed); PV_STAT(atomic_add_long(&pv_entry_frees, freed)); PV_STAT(atomic_add_int(&pv_entry_spare, freed)); PV_STAT(atomic_subtract_long(&pv_entry_count, freed)); TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); if (pc->pc_map[0] == PC_FREE0 && pc->pc_map[1] == PC_FREE1 && pc->pc_map[2] == PC_FREE2) { PV_STAT(atomic_subtract_int(&pv_entry_spare, _NPCPV)); PV_STAT(atomic_subtract_int(&pc_chunk_count, 1)); PV_STAT(atomic_add_int(&pc_chunk_frees, 1)); /* Entire chunk is free; return it. */ m_pc = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pc)); dump_drop_page(m_pc->phys_addr); mtx_lock(&pv_chunks_mutex); break; } TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list); TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru); mtx_lock(&pv_chunks_mutex); /* One freed pv entry in locked_pmap is sufficient. */ if (pmap == locked_pmap) break; } TAILQ_CONCAT(&pv_chunks, &new_tail, pc_lru); mtx_unlock(&pv_chunks_mutex); if (pmap != NULL) { pmap_invalidate_all(pmap); if (pmap != locked_pmap) PMAP_UNLOCK(pmap); } if (m_pc == NULL && !SLIST_EMPTY(&free)) { m_pc = SLIST_FIRST(&free); SLIST_REMOVE_HEAD(&free, plinks.s.ss); /* Recycle a freed page table page. */ m_pc->wire_count = 1; atomic_add_int(&vm_cnt.v_wire_count, 1); } pmap_free_zero_pages(&free); return (m_pc); } /* * free the pv_entry back to the free list */ static void free_pv_entry(pmap_t pmap, pv_entry_t pv) { struct pv_chunk *pc; int idx, field, bit; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); PV_STAT(atomic_add_long(&pv_entry_frees, 1)); PV_STAT(atomic_add_int(&pv_entry_spare, 1)); PV_STAT(atomic_subtract_long(&pv_entry_count, 1)); pc = pv_to_chunk(pv); idx = pv - &pc->pc_pventry[0]; field = idx / 64; bit = idx % 64; pc->pc_map[field] |= 1ul << bit; if (pc->pc_map[0] != PC_FREE0 || pc->pc_map[1] != PC_FREE1 || pc->pc_map[2] != PC_FREE2) { /* 98% of the time, pc is already at the head of the list. */ if (__predict_false(pc != TAILQ_FIRST(&pmap->pm_pvchunk))) { TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list); } return; } TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); free_pv_chunk(pc); } static void free_pv_chunk(struct pv_chunk *pc) { vm_page_t m; mtx_lock(&pv_chunks_mutex); TAILQ_REMOVE(&pv_chunks, pc, pc_lru); mtx_unlock(&pv_chunks_mutex); PV_STAT(atomic_subtract_int(&pv_entry_spare, _NPCPV)); PV_STAT(atomic_subtract_int(&pc_chunk_count, 1)); PV_STAT(atomic_add_int(&pc_chunk_frees, 1)); /* entire chunk is free, return it */ m = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pc)); dump_drop_page(m->phys_addr); vm_page_unwire(m, PQ_INACTIVE); vm_page_free(m); } /* * Returns a new PV entry, allocating a new PV chunk from the system when * needed. If this PV chunk allocation fails and a PV list lock pointer was * given, a PV chunk is reclaimed from an arbitrary pmap. Otherwise, NULL is * returned. * * The given PV list lock may be released. */ static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp) { int bit, field; pv_entry_t pv; struct pv_chunk *pc; vm_page_t m; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); PV_STAT(atomic_add_long(&pv_entry_allocs, 1)); retry: pc = TAILQ_FIRST(&pmap->pm_pvchunk); if (pc != NULL) { for (field = 0; field < _NPCM; field++) { if (pc->pc_map[field]) { bit = bsfq(pc->pc_map[field]); break; } } if (field < _NPCM) { pv = &pc->pc_pventry[field * 64 + bit]; pc->pc_map[field] &= ~(1ul << bit); /* If this was the last item, move it to tail */ if (pc->pc_map[0] == 0 && pc->pc_map[1] == 0 && pc->pc_map[2] == 0) { TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list); } PV_STAT(atomic_add_long(&pv_entry_count, 1)); PV_STAT(atomic_subtract_int(&pv_entry_spare, 1)); return (pv); } } /* No free items, allocate another chunk */ m = vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); if (m == NULL) { if (lockp == NULL) { PV_STAT(pc_chunk_tryfail++); return (NULL); } m = reclaim_pv_chunk(pmap, lockp); if (m == NULL) goto retry; } PV_STAT(atomic_add_int(&pc_chunk_count, 1)); PV_STAT(atomic_add_int(&pc_chunk_allocs, 1)); dump_add_page(m->phys_addr); pc = (void *)PHYS_TO_DMAP(m->phys_addr); pc->pc_pmap = pmap; pc->pc_map[0] = PC_FREE0 & ~1ul; /* preallocated bit 0 */ pc->pc_map[1] = PC_FREE1; pc->pc_map[2] = PC_FREE2; mtx_lock(&pv_chunks_mutex); TAILQ_INSERT_TAIL(&pv_chunks, pc, pc_lru); mtx_unlock(&pv_chunks_mutex); pv = &pc->pc_pventry[0]; TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list); PV_STAT(atomic_add_long(&pv_entry_count, 1)); PV_STAT(atomic_add_int(&pv_entry_spare, _NPCPV - 1)); return (pv); } /* * Returns the number of one bits within the given PV chunk map element. * * The erratas for Intel processors state that "POPCNT Instruction May * Take Longer to Execute Than Expected". It is believed that the * issue is the spurious dependency on the destination register. * Provide a hint to the register rename logic that the destination * value is overwritten, by clearing it, as suggested in the * optimization manual. It should be cheap for unaffected processors * as well. * * Reference numbers for erratas are * 4th Gen Core: HSD146 * 5th Gen Core: BDM85 */ static int popcnt_pc_map_elem_pq(uint64_t elem) { u_long result; __asm __volatile("xorl %k0,%k0;popcntq %1,%0" : "=&r" (result) : "rm" (elem)); return (result); } /* * Ensure that the number of spare PV entries in the specified pmap meets or * exceeds the given count, "needed". * * The given PV list lock may be released. */ static void reserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp) { struct pch new_tail; struct pv_chunk *pc; int avail, free; vm_page_t m; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); KASSERT(lockp != NULL, ("reserve_pv_entries: lockp is NULL")); /* * Newly allocated PV chunks must be stored in a private list until * the required number of PV chunks have been allocated. Otherwise, * reclaim_pv_chunk() could recycle one of these chunks. In * contrast, these chunks must be added to the pmap upon allocation. */ TAILQ_INIT(&new_tail); retry: avail = 0; TAILQ_FOREACH(pc, &pmap->pm_pvchunk, pc_list) { if ((cpu_feature2 & CPUID2_POPCNT) == 0) { free = bitcount64(pc->pc_map[0]); free += bitcount64(pc->pc_map[1]); free += bitcount64(pc->pc_map[2]); } else { free = popcnt_pc_map_elem_pq(pc->pc_map[0]); free += popcnt_pc_map_elem_pq(pc->pc_map[1]); free += popcnt_pc_map_elem_pq(pc->pc_map[2]); } if (free == 0) break; avail += free; if (avail >= needed) break; } for (; avail < needed; avail += _NPCPV) { m = vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); if (m == NULL) { m = reclaim_pv_chunk(pmap, lockp); if (m == NULL) goto retry; } PV_STAT(atomic_add_int(&pc_chunk_count, 1)); PV_STAT(atomic_add_int(&pc_chunk_allocs, 1)); dump_add_page(m->phys_addr); pc = (void *)PHYS_TO_DMAP(m->phys_addr); pc->pc_pmap = pmap; pc->pc_map[0] = PC_FREE0; pc->pc_map[1] = PC_FREE1; pc->pc_map[2] = PC_FREE2; TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list); TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru); PV_STAT(atomic_add_int(&pv_entry_spare, _NPCPV)); } if (!TAILQ_EMPTY(&new_tail)) { mtx_lock(&pv_chunks_mutex); TAILQ_CONCAT(&pv_chunks, &new_tail, pc_lru); mtx_unlock(&pv_chunks_mutex); } } /* * First find and then remove the pv entry for the specified pmap and virtual * address from the specified pv list. Returns the pv entry if found and NULL * otherwise. This operation can be performed on pv lists for either 4KB or * 2MB page mappings. */ static __inline pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va) { pv_entry_t pv; rw_assert(&pvh_global_lock, RA_LOCKED); TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) { if (pmap == PV_PMAP(pv) && va == pv->pv_va) { TAILQ_REMOVE(&pvh->pv_list, pv, pv_next); pvh->pv_gen++; break; } } return (pv); } /* * After demotion from a 2MB page mapping to 512 4KB page mappings, * destroy the pv entry for the 2MB page mapping and reinstantiate the pv * entries for each of the 4KB page mappings. */ static void pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa, struct rwlock **lockp) { struct md_page *pvh; struct pv_chunk *pc; pv_entry_t pv; vm_offset_t va_last; vm_page_t m; int bit, field; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); KASSERT((pa & PDRMASK) == 0, ("pmap_pv_demote_pde: pa is not 2mpage aligned")); CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa); /* * Transfer the 2mpage's pv entry for this mapping to the first * page's pv list. Once this transfer begins, the pv list lock * must not be released until the last pv entry is reinstantiated. */ pvh = pa_to_pvh(pa); va = trunc_2mpage(va); pv = pmap_pvh_remove(pvh, pmap, va); KASSERT(pv != NULL, ("pmap_pv_demote_pde: pv not found")); m = PHYS_TO_VM_PAGE(pa); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; /* Instantiate the remaining NPTEPG - 1 pv entries. */ PV_STAT(atomic_add_long(&pv_entry_allocs, NPTEPG - 1)); va_last = va + NBPDR - PAGE_SIZE; for (;;) { pc = TAILQ_FIRST(&pmap->pm_pvchunk); KASSERT(pc->pc_map[0] != 0 || pc->pc_map[1] != 0 || pc->pc_map[2] != 0, ("pmap_pv_demote_pde: missing spare")); for (field = 0; field < _NPCM; field++) { while (pc->pc_map[field]) { bit = bsfq(pc->pc_map[field]); pc->pc_map[field] &= ~(1ul << bit); pv = &pc->pc_pventry[field * 64 + bit]; va += PAGE_SIZE; pv->pv_va = va; m++; KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_pv_demote_pde: page %p is not managed", m)); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; if (va == va_last) goto out; } } TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list); } out: if (pc->pc_map[0] == 0 && pc->pc_map[1] == 0 && pc->pc_map[2] == 0) { TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list); } PV_STAT(atomic_add_long(&pv_entry_count, NPTEPG - 1)); PV_STAT(atomic_subtract_int(&pv_entry_spare, NPTEPG - 1)); } /* * After promotion from 512 4KB page mappings to a single 2MB page mapping, * replace the many pv entries for the 4KB page mappings by a single pv entry * for the 2MB page mapping. */ static void pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa, struct rwlock **lockp) { struct md_page *pvh; pv_entry_t pv; vm_offset_t va_last; vm_page_t m; rw_assert(&pvh_global_lock, RA_LOCKED); KASSERT((pa & PDRMASK) == 0, ("pmap_pv_promote_pde: pa is not 2mpage aligned")); CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa); /* * Transfer the first page's pv entry for this mapping to the 2mpage's * pv list. Aside from avoiding the cost of a call to get_pv_entry(), * a transfer avoids the possibility that get_pv_entry() calls * reclaim_pv_chunk() and that reclaim_pv_chunk() removes one of the * mappings that is being promoted. */ m = PHYS_TO_VM_PAGE(pa); va = trunc_2mpage(va); pv = pmap_pvh_remove(&m->md, pmap, va); KASSERT(pv != NULL, ("pmap_pv_promote_pde: pv not found")); pvh = pa_to_pvh(pa); TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_next); pvh->pv_gen++; /* Free the remaining NPTEPG - 1 pv entries. */ va_last = va + NBPDR - PAGE_SIZE; do { m++; va += PAGE_SIZE; pmap_pvh_free(&m->md, pmap, va); } while (va < va_last); } /* * First find and then destroy the pv entry for the specified pmap and virtual * address. This operation can be performed on pv lists for either 4KB or 2MB * page mappings. */ static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va) { pv_entry_t pv; pv = pmap_pvh_remove(pvh, pmap, va); KASSERT(pv != NULL, ("pmap_pvh_free: pv not found")); free_pv_entry(pmap, pv); } /* * Conditionally create the PV entry for a 4KB page mapping if the required * memory can be allocated without resorting to reclamation. */ static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va, vm_page_t m, struct rwlock **lockp) { pv_entry_t pv; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); /* Pass NULL instead of the lock pointer to disable reclamation. */ if ((pv = get_pv_entry(pmap, NULL)) != NULL) { pv->pv_va = va; CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; return (TRUE); } else return (FALSE); } /* * Conditionally create the PV entry for a 2MB page mapping if the required * memory can be allocated without resorting to reclamation. */ static boolean_t pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa, struct rwlock **lockp) { struct md_page *pvh; pv_entry_t pv; rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); /* Pass NULL instead of the lock pointer to disable reclamation. */ if ((pv = get_pv_entry(pmap, NULL)) != NULL) { pv->pv_va = va; CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa); pvh = pa_to_pvh(pa); TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_next); pvh->pv_gen++; return (TRUE); } else return (FALSE); } /* * Fills a page table page with mappings to consecutive physical pages. */ static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte) { pt_entry_t *pte; for (pte = firstpte; pte < firstpte + NPTEPG; pte++) { *pte = newpte; newpte += PAGE_SIZE; } } /* * Tries to demote a 2MB page mapping. If demotion fails, the 2MB page * mapping is invalidated. */ static boolean_t pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va) { struct rwlock *lock; boolean_t rv; lock = NULL; rv = pmap_demote_pde_locked(pmap, pde, va, &lock); if (lock != NULL) rw_wunlock(lock); return (rv); } static boolean_t pmap_demote_pde_locked(pmap_t pmap, pd_entry_t *pde, vm_offset_t va, struct rwlock **lockp) { pd_entry_t newpde, oldpde; pt_entry_t *firstpte, newpte; pt_entry_t PG_A, PG_G, PG_M, PG_RW, PG_V; vm_paddr_t mptepa; vm_page_t mpte; struct spglist free; int PG_PTE_CACHE; PG_G = pmap_global_bit(pmap); PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_PTE_CACHE = pmap_cache_mask(pmap, 0); PMAP_LOCK_ASSERT(pmap, MA_OWNED); oldpde = *pde; KASSERT((oldpde & (PG_PS | PG_V)) == (PG_PS | PG_V), ("pmap_demote_pde: oldpde is missing PG_PS and/or PG_V")); if ((oldpde & PG_A) != 0 && (mpte = pmap_lookup_pt_page(pmap, va)) != NULL) pmap_remove_pt_page(pmap, mpte); else { KASSERT((oldpde & PG_W) == 0, ("pmap_demote_pde: page table page for a wired mapping" " is missing")); /* * Invalidate the 2MB page mapping and return "failure" if the * mapping was never accessed or the allocation of the new * page table page fails. If the 2MB page mapping belongs to * the direct map region of the kernel's address space, then * the page allocation request specifies the highest possible * priority (VM_ALLOC_INTERRUPT). Otherwise, the priority is * normal. Page table pages are preallocated for every other * part of the kernel address space, so the direct map region * is the only part of the kernel address space that must be * handled here. */ if ((oldpde & PG_A) == 0 || (mpte = vm_page_alloc(NULL, pmap_pde_pindex(va), (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS ? VM_ALLOC_INTERRUPT : VM_ALLOC_NORMAL) | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) { SLIST_INIT(&free); pmap_remove_pde(pmap, pde, trunc_2mpage(va), &free, lockp); pmap_invalidate_page(pmap, trunc_2mpage(va)); pmap_free_zero_pages(&free); CTR2(KTR_PMAP, "pmap_demote_pde: failure for va %#lx" " in pmap %p", va, pmap); return (FALSE); } if (va < VM_MAXUSER_ADDRESS) pmap_resident_count_inc(pmap, 1); } mptepa = VM_PAGE_TO_PHYS(mpte); firstpte = (pt_entry_t *)PHYS_TO_DMAP(mptepa); newpde = mptepa | PG_M | PG_A | (oldpde & PG_U) | PG_RW | PG_V; KASSERT((oldpde & PG_A) != 0, ("pmap_demote_pde: oldpde is missing PG_A")); KASSERT((oldpde & (PG_M | PG_RW)) != PG_RW, ("pmap_demote_pde: oldpde is missing PG_M")); newpte = oldpde & ~PG_PS; newpte = pmap_swap_pat(pmap, newpte); /* * If the page table page is new, initialize it. */ if (mpte->wire_count == 1) { mpte->wire_count = NPTEPG; pmap_fill_ptp(firstpte, newpte); } KASSERT((*firstpte & PG_FRAME) == (newpte & PG_FRAME), ("pmap_demote_pde: firstpte and newpte map different physical" " addresses")); /* * If the mapping has changed attributes, update the page table * entries. */ if ((*firstpte & PG_PTE_PROMOTE) != (newpte & PG_PTE_PROMOTE)) pmap_fill_ptp(firstpte, newpte); /* * The spare PV entries must be reserved prior to demoting the * mapping, that is, prior to changing the PDE. Otherwise, the state * of the PDE and the PV lists will be inconsistent, which can result * in reclaim_pv_chunk() attempting to remove a PV entry from the * wrong PV list and pmap_pv_demote_pde() failing to find the expected * PV entry for the 2MB page mapping that is being demoted. */ if ((oldpde & PG_MANAGED) != 0) reserve_pv_entries(pmap, NPTEPG - 1, lockp); /* * Demote the mapping. This pmap is locked. The old PDE has * PG_A set. If the old PDE has PG_RW set, it also has PG_M * set. Thus, there is no danger of a race with another * processor changing the setting of PG_A and/or PG_M between * the read above and the store below. */ if (workaround_erratum383) pmap_update_pde(pmap, va, pde, newpde); else pde_store(pde, newpde); /* * Invalidate a stale recursive mapping of the page table page. */ if (va >= VM_MAXUSER_ADDRESS) pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va)); /* * Demote the PV entry. */ if ((oldpde & PG_MANAGED) != 0) pmap_pv_demote_pde(pmap, va, oldpde & PG_PS_FRAME, lockp); atomic_add_long(&pmap_pde_demotions, 1); CTR2(KTR_PMAP, "pmap_demote_pde: success for va %#lx" " in pmap %p", va, pmap); return (TRUE); } /* * pmap_remove_kernel_pde: Remove a kernel superpage mapping. */ static void pmap_remove_kernel_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va) { pd_entry_t newpde; vm_paddr_t mptepa; vm_page_t mpte; KASSERT(pmap == kernel_pmap, ("pmap %p is not kernel_pmap", pmap)); PMAP_LOCK_ASSERT(pmap, MA_OWNED); mpte = pmap_lookup_pt_page(pmap, va); if (mpte == NULL) panic("pmap_remove_kernel_pde: Missing pt page."); pmap_remove_pt_page(pmap, mpte); mptepa = VM_PAGE_TO_PHYS(mpte); newpde = mptepa | X86_PG_M | X86_PG_A | X86_PG_RW | X86_PG_V; /* * Initialize the page table page. */ pagezero((void *)PHYS_TO_DMAP(mptepa)); /* * Demote the mapping. */ if (workaround_erratum383) pmap_update_pde(pmap, va, pde, newpde); else pde_store(pde, newpde); /* * Invalidate a stale recursive mapping of the page table page. */ pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va)); } /* * pmap_remove_pde: do the things to unmap a superpage in a process */ static int pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva, struct spglist *free, struct rwlock **lockp) { struct md_page *pvh; pd_entry_t oldpde; vm_offset_t eva, va; vm_page_t m, mpte; pt_entry_t PG_G, PG_A, PG_M, PG_RW; PG_G = pmap_global_bit(pmap); PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); PMAP_LOCK_ASSERT(pmap, MA_OWNED); KASSERT((sva & PDRMASK) == 0, ("pmap_remove_pde: sva is not 2mpage aligned")); oldpde = pte_load_clear(pdq); if (oldpde & PG_W) pmap->pm_stats.wired_count -= NBPDR / PAGE_SIZE; /* * Machines that don't support invlpg, also don't support * PG_G. */ if (oldpde & PG_G) pmap_invalidate_page(kernel_pmap, sva); pmap_resident_count_dec(pmap, NBPDR / PAGE_SIZE); if (oldpde & PG_MANAGED) { CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, oldpde & PG_PS_FRAME); pvh = pa_to_pvh(oldpde & PG_PS_FRAME); pmap_pvh_free(pvh, pmap, sva); eva = sva + NBPDR; for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME); va < eva; va += PAGE_SIZE, m++) { if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); if (oldpde & PG_A) vm_page_aflag_set(m, PGA_REFERENCED); if (TAILQ_EMPTY(&m->md.pv_list) && TAILQ_EMPTY(&pvh->pv_list)) vm_page_aflag_clear(m, PGA_WRITEABLE); } } if (pmap == kernel_pmap) { pmap_remove_kernel_pde(pmap, pdq, sva); } else { mpte = pmap_lookup_pt_page(pmap, sva); if (mpte != NULL) { pmap_remove_pt_page(pmap, mpte); pmap_resident_count_dec(pmap, 1); KASSERT(mpte->wire_count == NPTEPG, ("pmap_remove_pde: pte page wire count error")); mpte->wire_count = 0; pmap_add_delayed_free_list(mpte, free, FALSE); atomic_subtract_int(&vm_cnt.v_wire_count, 1); } } return (pmap_unuse_pt(pmap, sva, *pmap_pdpe(pmap, sva), free)); } /* * pmap_remove_pte: do the things to unmap a page in a process */ static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, pd_entry_t ptepde, struct spglist *free, struct rwlock **lockp) { struct md_page *pvh; pt_entry_t oldpte, PG_A, PG_M, PG_RW; vm_page_t m; PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); PMAP_LOCK_ASSERT(pmap, MA_OWNED); oldpte = pte_load_clear(ptq); if (oldpte & PG_W) pmap->pm_stats.wired_count -= 1; pmap_resident_count_dec(pmap, 1); if (oldpte & PG_MANAGED) { m = PHYS_TO_VM_PAGE(oldpte & PG_FRAME); if ((oldpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); if (oldpte & PG_A) vm_page_aflag_set(m, PGA_REFERENCED); CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m); pmap_pvh_free(&m->md, pmap, va); if (TAILQ_EMPTY(&m->md.pv_list) && (m->flags & PG_FICTITIOUS) == 0) { pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); if (TAILQ_EMPTY(&pvh->pv_list)) vm_page_aflag_clear(m, PGA_WRITEABLE); } } return (pmap_unuse_pt(pmap, va, ptepde, free)); } /* * Remove a single page from a process address space */ static void pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, struct spglist *free) { struct rwlock *lock; pt_entry_t *pte, PG_V; PG_V = pmap_valid_bit(pmap); PMAP_LOCK_ASSERT(pmap, MA_OWNED); if ((*pde & PG_V) == 0) return; pte = pmap_pde_to_pte(pde, va); if ((*pte & PG_V) == 0) return; lock = NULL; pmap_remove_pte(pmap, pte, va, *pde, free, &lock); if (lock != NULL) rw_wunlock(lock); pmap_invalidate_page(pmap, va); } /* * Remove the given range of addresses from the specified map. * * It is assumed that the start and end are properly * rounded to the page size. */ void pmap_remove(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) { struct rwlock *lock; vm_offset_t va, va_next; pml4_entry_t *pml4e; pdp_entry_t *pdpe; pd_entry_t ptpaddr, *pde; pt_entry_t *pte, PG_G, PG_V; struct spglist free; int anyvalid; PG_G = pmap_global_bit(pmap); PG_V = pmap_valid_bit(pmap); /* * Perform an unsynchronized read. This is, however, safe. */ if (pmap->pm_stats.resident_count == 0) return; anyvalid = 0; SLIST_INIT(&free); rw_rlock(&pvh_global_lock); PMAP_LOCK(pmap); /* * special handling of removing one page. a very * common operation and easy to short circuit some * code. */ if (sva + PAGE_SIZE == eva) { pde = pmap_pde(pmap, sva); if (pde && (*pde & PG_PS) == 0) { pmap_remove_page(pmap, sva, pde, &free); goto out; } } lock = NULL; for (; sva < eva; sva = va_next) { if (pmap->pm_stats.resident_count == 0) break; pml4e = pmap_pml4e(pmap, sva); if ((*pml4e & PG_V) == 0) { va_next = (sva + NBPML4) & ~PML4MASK; if (va_next < sva) va_next = eva; continue; } pdpe = pmap_pml4e_to_pdpe(pml4e, sva); if ((*pdpe & PG_V) == 0) { va_next = (sva + NBPDP) & ~PDPMASK; if (va_next < sva) va_next = eva; continue; } /* * Calculate index for next page table. */ va_next = (sva + NBPDR) & ~PDRMASK; if (va_next < sva) va_next = eva; pde = pmap_pdpe_to_pde(pdpe, sva); ptpaddr = *pde; /* * Weed out invalid mappings. */ if (ptpaddr == 0) continue; /* * Check for large page. */ if ((ptpaddr & PG_PS) != 0) { /* * Are we removing the entire large page? If not, * demote the mapping and fall through. */ if (sva + NBPDR == va_next && eva >= va_next) { /* * The TLB entry for a PG_G mapping is * invalidated by pmap_remove_pde(). */ if ((ptpaddr & PG_G) == 0) anyvalid = 1; pmap_remove_pde(pmap, pde, sva, &free, &lock); continue; } else if (!pmap_demote_pde_locked(pmap, pde, sva, &lock)) { /* The large page mapping was destroyed. */ continue; } else ptpaddr = *pde; } /* * Limit our scan to either the end of the va represented * by the current page table page, or to the end of the * range being removed. */ if (va_next > eva) va_next = eva; va = va_next; for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++, sva += PAGE_SIZE) { if (*pte == 0) { if (va != va_next) { pmap_invalidate_range(pmap, va, sva); va = va_next; } continue; } if ((*pte & PG_G) == 0) anyvalid = 1; else if (va == va_next) va = sva; if (pmap_remove_pte(pmap, pte, sva, ptpaddr, &free, &lock)) { sva += PAGE_SIZE; break; } } if (va != va_next) pmap_invalidate_range(pmap, va, sva); } if (lock != NULL) rw_wunlock(lock); out: if (anyvalid) pmap_invalidate_all(pmap); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); pmap_free_zero_pages(&free); } /* * Routine: pmap_remove_all * Function: * Removes this physical page from * all physical maps in which it resides. * Reflects back modify bits to the pager. * * Notes: * Original versions of this routine were very * inefficient because they iteratively called * pmap_remove (slow...) */ void pmap_remove_all(vm_page_t m) { struct md_page *pvh; pv_entry_t pv; pmap_t pmap; pt_entry_t *pte, tpte, PG_A, PG_M, PG_RW; pd_entry_t *pde; vm_offset_t va; struct spglist free; KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_remove_all: page %p is not managed", m)); SLIST_INIT(&free); rw_wlock(&pvh_global_lock); if ((m->flags & PG_FICTITIOUS) != 0) goto small_mappings; pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); while ((pv = TAILQ_FIRST(&pvh->pv_list)) != NULL) { pmap = PV_PMAP(pv); PMAP_LOCK(pmap); va = pv->pv_va; pde = pmap_pde(pmap, va); (void)pmap_demote_pde(pmap, pde, va); PMAP_UNLOCK(pmap); } small_mappings: while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { pmap = PV_PMAP(pv); PMAP_LOCK(pmap); PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); pmap_resident_count_dec(pmap, 1); pde = pmap_pde(pmap, pv->pv_va); KASSERT((*pde & PG_PS) == 0, ("pmap_remove_all: found" " a 2mpage in page %p's pv list", m)); pte = pmap_pde_to_pte(pde, pv->pv_va); tpte = pte_load_clear(pte); if (tpte & PG_W) pmap->pm_stats.wired_count--; if (tpte & PG_A) vm_page_aflag_set(m, PGA_REFERENCED); /* * Update the vm_page_t clean and reference bits. */ if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); pmap_unuse_pt(pmap, pv->pv_va, *pde, &free); pmap_invalidate_page(pmap, pv->pv_va); TAILQ_REMOVE(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; free_pv_entry(pmap, pv); PMAP_UNLOCK(pmap); } vm_page_aflag_clear(m, PGA_WRITEABLE); rw_wunlock(&pvh_global_lock); pmap_free_zero_pages(&free); } /* * pmap_protect_pde: do the things to protect a 2mpage in a process */ static boolean_t pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, vm_prot_t prot) { pd_entry_t newpde, oldpde; vm_offset_t eva, va; vm_page_t m; boolean_t anychanged; pt_entry_t PG_G, PG_M, PG_RW; PG_G = pmap_global_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); PMAP_LOCK_ASSERT(pmap, MA_OWNED); KASSERT((sva & PDRMASK) == 0, ("pmap_protect_pde: sva is not 2mpage aligned")); anychanged = FALSE; retry: oldpde = newpde = *pde; if (oldpde & PG_MANAGED) { eva = sva + NBPDR; for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME); va < eva; va += PAGE_SIZE, m++) if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); } if ((prot & VM_PROT_WRITE) == 0) newpde &= ~(PG_RW | PG_M); if ((prot & VM_PROT_EXECUTE) == 0) newpde |= pg_nx; if (newpde != oldpde) { if (!atomic_cmpset_long(pde, oldpde, newpde)) goto retry; if (oldpde & PG_G) pmap_invalidate_page(pmap, sva); else anychanged = TRUE; } return (anychanged); } /* * Set the physical protection on the * specified range of this map as requested. */ void pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot) { vm_offset_t va_next; pml4_entry_t *pml4e; pdp_entry_t *pdpe; pd_entry_t ptpaddr, *pde; pt_entry_t *pte, PG_G, PG_M, PG_RW, PG_V; boolean_t anychanged, pv_lists_locked; KASSERT((prot & ~VM_PROT_ALL) == 0, ("invalid prot %x", prot)); if (prot == VM_PROT_NONE) { pmap_remove(pmap, sva, eva); return; } if ((prot & (VM_PROT_WRITE|VM_PROT_EXECUTE)) == (VM_PROT_WRITE|VM_PROT_EXECUTE)) return; PG_G = pmap_global_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); pv_lists_locked = FALSE; resume: anychanged = FALSE; PMAP_LOCK(pmap); for (; sva < eva; sva = va_next) { pml4e = pmap_pml4e(pmap, sva); if ((*pml4e & PG_V) == 0) { va_next = (sva + NBPML4) & ~PML4MASK; if (va_next < sva) va_next = eva; continue; } pdpe = pmap_pml4e_to_pdpe(pml4e, sva); if ((*pdpe & PG_V) == 0) { va_next = (sva + NBPDP) & ~PDPMASK; if (va_next < sva) va_next = eva; continue; } va_next = (sva + NBPDR) & ~PDRMASK; if (va_next < sva) va_next = eva; pde = pmap_pdpe_to_pde(pdpe, sva); ptpaddr = *pde; /* * Weed out invalid mappings. */ if (ptpaddr == 0) continue; /* * Check for large page. */ if ((ptpaddr & PG_PS) != 0) { /* * Are we protecting the entire large page? If not, * demote the mapping and fall through. */ if (sva + NBPDR == va_next && eva >= va_next) { /* * The TLB entry for a PG_G mapping is * invalidated by pmap_protect_pde(). */ if (pmap_protect_pde(pmap, pde, sva, prot)) anychanged = TRUE; continue; } else { if (!pv_lists_locked) { pv_lists_locked = TRUE; if (!rw_try_rlock(&pvh_global_lock)) { if (anychanged) pmap_invalidate_all( pmap); PMAP_UNLOCK(pmap); rw_rlock(&pvh_global_lock); goto resume; } } if (!pmap_demote_pde(pmap, pde, sva)) { /* * The large page mapping was * destroyed. */ continue; } } } if (va_next > eva) va_next = eva; for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++, sva += PAGE_SIZE) { pt_entry_t obits, pbits; vm_page_t m; retry: obits = pbits = *pte; if ((pbits & PG_V) == 0) continue; if ((prot & VM_PROT_WRITE) == 0) { if ((pbits & (PG_MANAGED | PG_M | PG_RW)) == (PG_MANAGED | PG_M | PG_RW)) { m = PHYS_TO_VM_PAGE(pbits & PG_FRAME); vm_page_dirty(m); } pbits &= ~(PG_RW | PG_M); } if ((prot & VM_PROT_EXECUTE) == 0) pbits |= pg_nx; if (pbits != obits) { if (!atomic_cmpset_long(pte, obits, pbits)) goto retry; if (obits & PG_G) pmap_invalidate_page(pmap, sva); else anychanged = TRUE; } } } if (anychanged) pmap_invalidate_all(pmap); if (pv_lists_locked) rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); } /* * Tries to promote the 512, contiguous 4KB page mappings that are within a * single page table page (PTP) to a single 2MB page mapping. For promotion * to occur, two conditions must be met: (1) the 4KB page mappings must map * aligned, contiguous physical memory and (2) the 4KB page mappings must have * identical characteristics. */ static void pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va, struct rwlock **lockp) { pd_entry_t newpde; pt_entry_t *firstpte, oldpte, pa, *pte; pt_entry_t PG_G, PG_A, PG_M, PG_RW, PG_V; vm_offset_t oldpteva; vm_page_t mpte; int PG_PTE_CACHE; PG_A = pmap_accessed_bit(pmap); PG_G = pmap_global_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); PG_PTE_CACHE = pmap_cache_mask(pmap, 0); PMAP_LOCK_ASSERT(pmap, MA_OWNED); /* * Examine the first PTE in the specified PTP. Abort if this PTE is * either invalid, unused, or does not map the first 4KB physical page * within a 2MB page. */ firstpte = (pt_entry_t *)PHYS_TO_DMAP(*pde & PG_FRAME); setpde: newpde = *firstpte; if ((newpde & ((PG_FRAME & PDRMASK) | PG_A | PG_V)) != (PG_A | PG_V)) { atomic_add_long(&pmap_pde_p_failures, 1); CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx" " in pmap %p", va, pmap); return; } if ((newpde & (PG_M | PG_RW)) == PG_RW) { /* * When PG_M is already clear, PG_RW can be cleared without * a TLB invalidation. */ if (!atomic_cmpset_long(firstpte, newpde, newpde & ~PG_RW)) goto setpde; newpde &= ~PG_RW; } /* * Examine each of the other PTEs in the specified PTP. Abort if this * PTE maps an unexpected 4KB physical page or does not have identical * characteristics to the first PTE. */ pa = (newpde & (PG_PS_FRAME | PG_A | PG_V)) + NBPDR - PAGE_SIZE; for (pte = firstpte + NPTEPG - 1; pte > firstpte; pte--) { setpte: oldpte = *pte; if ((oldpte & (PG_FRAME | PG_A | PG_V)) != pa) { atomic_add_long(&pmap_pde_p_failures, 1); CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx" " in pmap %p", va, pmap); return; } if ((oldpte & (PG_M | PG_RW)) == PG_RW) { /* * When PG_M is already clear, PG_RW can be cleared * without a TLB invalidation. */ if (!atomic_cmpset_long(pte, oldpte, oldpte & ~PG_RW)) goto setpte; oldpte &= ~PG_RW; oldpteva = (oldpte & PG_FRAME & PDRMASK) | (va & ~PDRMASK); CTR2(KTR_PMAP, "pmap_promote_pde: protect for va %#lx" " in pmap %p", oldpteva, pmap); } if ((oldpte & PG_PTE_PROMOTE) != (newpde & PG_PTE_PROMOTE)) { atomic_add_long(&pmap_pde_p_failures, 1); CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx" " in pmap %p", va, pmap); return; } pa -= PAGE_SIZE; } /* * Save the page table page in its current state until the PDE * mapping the superpage is demoted by pmap_demote_pde() or * destroyed by pmap_remove_pde(). */ mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME); KASSERT(mpte >= vm_page_array && mpte < &vm_page_array[vm_page_array_size], ("pmap_promote_pde: page table page is out of range")); KASSERT(mpte->pindex == pmap_pde_pindex(va), ("pmap_promote_pde: page table page's pindex is wrong")); if (pmap_insert_pt_page(pmap, mpte)) { atomic_add_long(&pmap_pde_p_failures, 1); CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx in pmap %p", va, pmap); return; } /* * Promote the pv entries. */ if ((newpde & PG_MANAGED) != 0) pmap_pv_promote_pde(pmap, va, newpde & PG_PS_FRAME, lockp); /* * Propagate the PAT index to its proper position. */ newpde = pmap_swap_pat(pmap, newpde); /* * Map the superpage. */ if (workaround_erratum383) pmap_update_pde(pmap, va, pde, PG_PS | newpde); else pde_store(pde, PG_PS | newpde); atomic_add_long(&pmap_pde_promotions, 1); CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx" " in pmap %p", va, pmap); } /* * Insert the given physical page (p) at * the specified virtual address (v) in the * target physical map with the protection requested. * * If specified, the page will be wired down, meaning * that the related pte can not be reclaimed. * * NB: This is the only routine which MAY NOT lazy-evaluate * or lose information. That is, this routine must actually * insert this page into the given map NOW. */ int pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, u_int flags, int8_t psind __unused) { struct rwlock *lock; pd_entry_t *pde; pt_entry_t *pte, PG_G, PG_A, PG_M, PG_RW, PG_V; pt_entry_t newpte, origpte; pv_entry_t pv; vm_paddr_t opa, pa; vm_page_t mpte, om; boolean_t nosleep; PG_A = pmap_accessed_bit(pmap); PG_G = pmap_global_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); va = trunc_page(va); KASSERT(va <= VM_MAX_KERNEL_ADDRESS, ("pmap_enter: toobig")); KASSERT(va < UPT_MIN_ADDRESS || va >= UPT_MAX_ADDRESS, ("pmap_enter: invalid to pmap_enter page table pages (va: 0x%lx)", va)); KASSERT((m->oflags & VPO_UNMANAGED) != 0 || va < kmi.clean_sva || va >= kmi.clean_eva, ("pmap_enter: managed mapping within the clean submap")); if ((m->oflags & VPO_UNMANAGED) == 0 && !vm_page_xbusied(m)) VM_OBJECT_ASSERT_LOCKED(m->object); pa = VM_PAGE_TO_PHYS(m); newpte = (pt_entry_t)(pa | PG_A | PG_V); if ((flags & VM_PROT_WRITE) != 0) newpte |= PG_M; if ((prot & VM_PROT_WRITE) != 0) newpte |= PG_RW; KASSERT((newpte & (PG_M | PG_RW)) != PG_M, ("pmap_enter: flags includes VM_PROT_WRITE but prot doesn't")); if ((prot & VM_PROT_EXECUTE) == 0) newpte |= pg_nx; if ((flags & PMAP_ENTER_WIRED) != 0) newpte |= PG_W; if (va < VM_MAXUSER_ADDRESS) newpte |= PG_U; if (pmap == kernel_pmap) newpte |= PG_G; newpte |= pmap_cache_bits(pmap, m->md.pat_mode, 0); /* * Set modified bit gratuitously for writeable mappings if * the page is unmanaged. We do not want to take a fault * to do the dirty bit accounting for these mappings. */ if ((m->oflags & VPO_UNMANAGED) != 0) { if ((newpte & PG_RW) != 0) newpte |= PG_M; } mpte = NULL; lock = NULL; rw_rlock(&pvh_global_lock); PMAP_LOCK(pmap); /* * In the case that a page table page is not * resident, we are creating it here. */ retry: pde = pmap_pde(pmap, va); if (pde != NULL && (*pde & PG_V) != 0 && ((*pde & PG_PS) == 0 || pmap_demote_pde_locked(pmap, pde, va, &lock))) { pte = pmap_pde_to_pte(pde, va); if (va < VM_MAXUSER_ADDRESS && mpte == NULL) { mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME); mpte->wire_count++; } } else if (va < VM_MAXUSER_ADDRESS) { /* * Here if the pte page isn't mapped, or if it has been * deallocated. */ nosleep = (flags & PMAP_ENTER_NOSLEEP) != 0; mpte = _pmap_allocpte(pmap, pmap_pde_pindex(va), nosleep ? NULL : &lock); if (mpte == NULL && nosleep) { if (lock != NULL) rw_wunlock(lock); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); return (KERN_RESOURCE_SHORTAGE); } goto retry; } else panic("pmap_enter: invalid page directory va=%#lx", va); origpte = *pte; /* * Is the specified virtual address already mapped? */ if ((origpte & PG_V) != 0) { /* * Wiring change, just update stats. We don't worry about * wiring PT pages as they remain resident as long as there * are valid mappings in them. Hence, if a user page is wired, * the PT page will be also. */ if ((newpte & PG_W) != 0 && (origpte & PG_W) == 0) pmap->pm_stats.wired_count++; else if ((newpte & PG_W) == 0 && (origpte & PG_W) != 0) pmap->pm_stats.wired_count--; /* * Remove the extra PT page reference. */ if (mpte != NULL) { mpte->wire_count--; KASSERT(mpte->wire_count > 0, ("pmap_enter: missing reference to page table page," " va: 0x%lx", va)); } /* * Has the physical page changed? */ opa = origpte & PG_FRAME; if (opa == pa) { /* * No, might be a protection or wiring change. */ if ((origpte & PG_MANAGED) != 0) { newpte |= PG_MANAGED; if ((newpte & PG_RW) != 0) vm_page_aflag_set(m, PGA_WRITEABLE); } if (((origpte ^ newpte) & ~(PG_M | PG_A)) == 0) goto unchanged; goto validate; } } else { /* * Increment the counters. */ if ((newpte & PG_W) != 0) pmap->pm_stats.wired_count++; pmap_resident_count_inc(pmap, 1); } /* * Enter on the PV list if part of our managed memory. */ if ((m->oflags & VPO_UNMANAGED) == 0) { newpte |= PG_MANAGED; pv = get_pv_entry(pmap, &lock); pv->pv_va = va; CHANGE_PV_LIST_LOCK_TO_PHYS(&lock, pa); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; if ((newpte & PG_RW) != 0) vm_page_aflag_set(m, PGA_WRITEABLE); } /* * Update the PTE. */ if ((origpte & PG_V) != 0) { validate: origpte = pte_load_store(pte, newpte); opa = origpte & PG_FRAME; if (opa != pa) { if ((origpte & PG_MANAGED) != 0) { om = PHYS_TO_VM_PAGE(opa); if ((origpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(om); if ((origpte & PG_A) != 0) vm_page_aflag_set(om, PGA_REFERENCED); CHANGE_PV_LIST_LOCK_TO_PHYS(&lock, opa); pmap_pvh_free(&om->md, pmap, va); if ((om->aflags & PGA_WRITEABLE) != 0 && TAILQ_EMPTY(&om->md.pv_list) && ((om->flags & PG_FICTITIOUS) != 0 || TAILQ_EMPTY(&pa_to_pvh(opa)->pv_list))) vm_page_aflag_clear(om, PGA_WRITEABLE); } } else if ((newpte & PG_M) == 0 && (origpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) { if ((origpte & PG_MANAGED) != 0) vm_page_dirty(m); /* * Although the PTE may still have PG_RW set, TLB * invalidation may nonetheless be required because * the PTE no longer has PG_M set. */ } else if ((origpte & PG_NX) != 0 || (newpte & PG_NX) == 0) { /* * This PTE change does not require TLB invalidation. */ goto unchanged; } if ((origpte & PG_A) != 0) pmap_invalidate_page(pmap, va); } else pte_store(pte, newpte); unchanged: /* * If both the page table page and the reservation are fully * populated, then attempt promotion. */ if ((mpte == NULL || mpte->wire_count == NPTEPG) && pmap_ps_enabled(pmap) && (m->flags & PG_FICTITIOUS) == 0 && vm_reserv_level_iffullpop(m) == 0) pmap_promote_pde(pmap, pde, va, &lock); if (lock != NULL) rw_wunlock(lock); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); return (KERN_SUCCESS); } /* * Tries to create a 2MB page mapping. Returns TRUE if successful and FALSE * otherwise. Fails if (1) a page table page cannot be allocated without * blocking, (2) a mapping already exists at the specified virtual address, or * (3) a pv entry cannot be allocated without reclaiming another pv entry. */ static boolean_t pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, struct rwlock **lockp) { pd_entry_t *pde, newpde; pt_entry_t PG_V; vm_page_t mpde; struct spglist free; PG_V = pmap_valid_bit(pmap); rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); if ((mpde = pmap_allocpde(pmap, va, NULL)) == NULL) { CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx" " in pmap %p", va, pmap); return (FALSE); } pde = (pd_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mpde)); pde = &pde[pmap_pde_index(va)]; if ((*pde & PG_V) != 0) { KASSERT(mpde->wire_count > 1, ("pmap_enter_pde: mpde's wire count is too low")); mpde->wire_count--; CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx" " in pmap %p", va, pmap); return (FALSE); } newpde = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(pmap, m->md.pat_mode, 1) | PG_PS | PG_V; if ((m->oflags & VPO_UNMANAGED) == 0) { newpde |= PG_MANAGED; /* * Abort this mapping if its PV entry could not be created. */ if (!pmap_pv_insert_pde(pmap, va, VM_PAGE_TO_PHYS(m), lockp)) { SLIST_INIT(&free); if (pmap_unwire_ptp(pmap, va, mpde, &free)) { pmap_invalidate_page(pmap, va); pmap_free_zero_pages(&free); } CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx" " in pmap %p", va, pmap); return (FALSE); } } if ((prot & VM_PROT_EXECUTE) == 0) newpde |= pg_nx; if (va < VM_MAXUSER_ADDRESS) newpde |= PG_U; /* * Increment counters. */ pmap_resident_count_inc(pmap, NBPDR / PAGE_SIZE); /* * Map the superpage. */ pde_store(pde, newpde); atomic_add_long(&pmap_pde_mappings, 1); CTR2(KTR_PMAP, "pmap_enter_pde: success for va %#lx" " in pmap %p", va, pmap); return (TRUE); } /* * Maps a sequence of resident pages belonging to the same object. * The sequence begins with the given page m_start. This page is * mapped at the given virtual address start. Each subsequent page is * mapped at a virtual address that is offset from start by the same * amount as the page is offset from m_start within the object. The * last page in the sequence is the page with the largest offset from * m_start that can be mapped at a virtual address less than the given * virtual address end. Not every virtual page between start and end * is mapped; only those for which a resident page exists with the * corresponding offset from m_start are mapped. */ void pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end, vm_page_t m_start, vm_prot_t prot) { struct rwlock *lock; vm_offset_t va; vm_page_t m, mpte; vm_pindex_t diff, psize; VM_OBJECT_ASSERT_LOCKED(m_start->object); psize = atop(end - start); mpte = NULL; m = m_start; lock = NULL; rw_rlock(&pvh_global_lock); PMAP_LOCK(pmap); while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) { va = start + ptoa(diff); if ((va & PDRMASK) == 0 && va + NBPDR <= end && m->psind == 1 && pmap_ps_enabled(pmap) && pmap_enter_pde(pmap, va, m, prot, &lock)) m = &m[NBPDR / PAGE_SIZE - 1]; else mpte = pmap_enter_quick_locked(pmap, va, m, prot, mpte, &lock); m = TAILQ_NEXT(m, listq); } if (lock != NULL) rw_wunlock(lock); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); } /* * this code makes some *MAJOR* assumptions: * 1. Current pmap & pmap exists. * 2. Not wired. * 3. Read access. * 4. No page table pages. * but is *MUCH* faster than pmap_enter... */ void pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot) { struct rwlock *lock; lock = NULL; rw_rlock(&pvh_global_lock); PMAP_LOCK(pmap); (void)pmap_enter_quick_locked(pmap, va, m, prot, NULL, &lock); if (lock != NULL) rw_wunlock(lock); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); } static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, vm_page_t mpte, struct rwlock **lockp) { struct spglist free; pt_entry_t *pte, PG_V; vm_paddr_t pa; KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva || (m->oflags & VPO_UNMANAGED) != 0, ("pmap_enter_quick_locked: managed mapping within the clean submap")); PG_V = pmap_valid_bit(pmap); rw_assert(&pvh_global_lock, RA_LOCKED); PMAP_LOCK_ASSERT(pmap, MA_OWNED); /* * In the case that a page table page is not * resident, we are creating it here. */ if (va < VM_MAXUSER_ADDRESS) { vm_pindex_t ptepindex; pd_entry_t *ptepa; /* * Calculate pagetable page index */ ptepindex = pmap_pde_pindex(va); if (mpte && (mpte->pindex == ptepindex)) { mpte->wire_count++; } else { /* * Get the page directory entry */ ptepa = pmap_pde(pmap, va); /* * If the page table page is mapped, we just increment * the hold count, and activate it. Otherwise, we * attempt to allocate a page table page. If this * attempt fails, we don't retry. Instead, we give up. */ if (ptepa && (*ptepa & PG_V) != 0) { if (*ptepa & PG_PS) return (NULL); mpte = PHYS_TO_VM_PAGE(*ptepa & PG_FRAME); mpte->wire_count++; } else { /* * Pass NULL instead of the PV list lock * pointer, because we don't intend to sleep. */ mpte = _pmap_allocpte(pmap, ptepindex, NULL); if (mpte == NULL) return (mpte); } } pte = (pt_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mpte)); pte = &pte[pmap_pte_index(va)]; } else { mpte = NULL; pte = vtopte(va); } if (*pte) { if (mpte != NULL) { mpte->wire_count--; mpte = NULL; } return (mpte); } /* * Enter on the PV list if part of our managed memory. */ if ((m->oflags & VPO_UNMANAGED) == 0 && !pmap_try_insert_pv_entry(pmap, va, m, lockp)) { if (mpte != NULL) { SLIST_INIT(&free); if (pmap_unwire_ptp(pmap, va, mpte, &free)) { pmap_invalidate_page(pmap, va); pmap_free_zero_pages(&free); } mpte = NULL; } return (mpte); } /* * Increment counters */ pmap_resident_count_inc(pmap, 1); pa = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(pmap, m->md.pat_mode, 0); if ((prot & VM_PROT_EXECUTE) == 0) pa |= pg_nx; /* * Now validate mapping with RO protection */ if ((m->oflags & VPO_UNMANAGED) != 0) pte_store(pte, pa | PG_V | PG_U); else pte_store(pte, pa | PG_V | PG_U | PG_MANAGED); return (mpte); } /* * Make a temporary mapping for a physical address. This is only intended * to be used for panic dumps. */ void * pmap_kenter_temporary(vm_paddr_t pa, int i) { vm_offset_t va; va = (vm_offset_t)crashdumpmap + (i * PAGE_SIZE); pmap_kenter(va, pa); invlpg(va); return ((void *)crashdumpmap); } /* * This code maps large physical mmap regions into the * processor address space. Note that some shortcuts * are taken, but the code works. */ void pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_object_t object, vm_pindex_t pindex, vm_size_t size) { pd_entry_t *pde; pt_entry_t PG_A, PG_M, PG_RW, PG_V; vm_paddr_t pa, ptepa; vm_page_t p, pdpg; int pat_mode; PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); VM_OBJECT_ASSERT_WLOCKED(object); KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG, ("pmap_object_init_pt: non-device object")); if ((addr & (NBPDR - 1)) == 0 && (size & (NBPDR - 1)) == 0) { if (!pmap_ps_enabled(pmap)) return; if (!vm_object_populate(object, pindex, pindex + atop(size))) return; p = vm_page_lookup(object, pindex); KASSERT(p->valid == VM_PAGE_BITS_ALL, ("pmap_object_init_pt: invalid page %p", p)); pat_mode = p->md.pat_mode; /* * Abort the mapping if the first page is not physically * aligned to a 2MB page boundary. */ ptepa = VM_PAGE_TO_PHYS(p); if (ptepa & (NBPDR - 1)) return; /* * Skip the first page. Abort the mapping if the rest of * the pages are not physically contiguous or have differing * memory attributes. */ p = TAILQ_NEXT(p, listq); for (pa = ptepa + PAGE_SIZE; pa < ptepa + size; pa += PAGE_SIZE) { KASSERT(p->valid == VM_PAGE_BITS_ALL, ("pmap_object_init_pt: invalid page %p", p)); if (pa != VM_PAGE_TO_PHYS(p) || pat_mode != p->md.pat_mode) return; p = TAILQ_NEXT(p, listq); } /* * Map using 2MB pages. Since "ptepa" is 2M aligned and * "size" is a multiple of 2M, adding the PAT setting to "pa" * will not affect the termination of this loop. */ PMAP_LOCK(pmap); for (pa = ptepa | pmap_cache_bits(pmap, pat_mode, 1); pa < ptepa + size; pa += NBPDR) { pdpg = pmap_allocpde(pmap, addr, NULL); if (pdpg == NULL) { /* * The creation of mappings below is only an * optimization. If a page directory page * cannot be allocated without blocking, * continue on to the next mapping rather than * blocking. */ addr += NBPDR; continue; } pde = (pd_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pdpg)); pde = &pde[pmap_pde_index(addr)]; if ((*pde & PG_V) == 0) { pde_store(pde, pa | PG_PS | PG_M | PG_A | PG_U | PG_RW | PG_V); pmap_resident_count_inc(pmap, NBPDR / PAGE_SIZE); atomic_add_long(&pmap_pde_mappings, 1); } else { /* Continue on if the PDE is already valid. */ pdpg->wire_count--; KASSERT(pdpg->wire_count > 0, ("pmap_object_init_pt: missing reference " "to page directory page, va: 0x%lx", addr)); } addr += NBPDR; } PMAP_UNLOCK(pmap); } } /* * Clear the wired attribute from the mappings for the specified range of * addresses in the given pmap. Every valid mapping within that range * must have the wired attribute set. In contrast, invalid mappings * cannot have the wired attribute set, so they are ignored. * * The wired attribute of the page table entry is not a hardware feature, * so there is no need to invalidate any TLB entries. */ void pmap_unwire(pmap_t pmap, vm_offset_t sva, vm_offset_t eva) { vm_offset_t va_next; pml4_entry_t *pml4e; pdp_entry_t *pdpe; pd_entry_t *pde; pt_entry_t *pte, PG_V; boolean_t pv_lists_locked; PG_V = pmap_valid_bit(pmap); pv_lists_locked = FALSE; resume: PMAP_LOCK(pmap); for (; sva < eva; sva = va_next) { pml4e = pmap_pml4e(pmap, sva); if ((*pml4e & PG_V) == 0) { va_next = (sva + NBPML4) & ~PML4MASK; if (va_next < sva) va_next = eva; continue; } pdpe = pmap_pml4e_to_pdpe(pml4e, sva); if ((*pdpe & PG_V) == 0) { va_next = (sva + NBPDP) & ~PDPMASK; if (va_next < sva) va_next = eva; continue; } va_next = (sva + NBPDR) & ~PDRMASK; if (va_next < sva) va_next = eva; pde = pmap_pdpe_to_pde(pdpe, sva); if ((*pde & PG_V) == 0) continue; if ((*pde & PG_PS) != 0) { if ((*pde & PG_W) == 0) panic("pmap_unwire: pde %#jx is missing PG_W", (uintmax_t)*pde); /* * Are we unwiring the entire large page? If not, * demote the mapping and fall through. */ if (sva + NBPDR == va_next && eva >= va_next) { atomic_clear_long(pde, PG_W); pmap->pm_stats.wired_count -= NBPDR / PAGE_SIZE; continue; } else { if (!pv_lists_locked) { pv_lists_locked = TRUE; if (!rw_try_rlock(&pvh_global_lock)) { PMAP_UNLOCK(pmap); rw_rlock(&pvh_global_lock); /* Repeat sva. */ goto resume; } } if (!pmap_demote_pde(pmap, pde, sva)) panic("pmap_unwire: demotion failed"); } } if (va_next > eva) va_next = eva; for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++, sva += PAGE_SIZE) { if ((*pte & PG_V) == 0) continue; if ((*pte & PG_W) == 0) panic("pmap_unwire: pte %#jx is missing PG_W", (uintmax_t)*pte); /* * PG_W must be cleared atomically. Although the pmap * lock synchronizes access to PG_W, another processor * could be setting PG_M and/or PG_A concurrently. */ atomic_clear_long(pte, PG_W); pmap->pm_stats.wired_count--; } } if (pv_lists_locked) rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); } /* * Copy the range specified by src_addr/len * from the source map to the range dst_addr/len * in the destination map. * * This routine is only advisory and need not do anything. */ void pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, vm_size_t len, vm_offset_t src_addr) { struct rwlock *lock; struct spglist free; vm_offset_t addr; vm_offset_t end_addr = src_addr + len; vm_offset_t va_next; pt_entry_t PG_A, PG_M, PG_V; if (dst_addr != src_addr) return; if (dst_pmap->pm_type != src_pmap->pm_type) return; /* * EPT page table entries that require emulation of A/D bits are * sensitive to clearing the PG_A bit (aka EPT_PG_READ). Although * we clear PG_M (aka EPT_PG_WRITE) concomitantly, the PG_U bit * (aka EPT_PG_EXECUTE) could still be set. Since some EPT * implementations flag an EPT misconfiguration for exec-only * mappings we skip this function entirely for emulated pmaps. */ if (pmap_emulate_ad_bits(dst_pmap)) return; lock = NULL; rw_rlock(&pvh_global_lock); if (dst_pmap < src_pmap) { PMAP_LOCK(dst_pmap); PMAP_LOCK(src_pmap); } else { PMAP_LOCK(src_pmap); PMAP_LOCK(dst_pmap); } PG_A = pmap_accessed_bit(dst_pmap); PG_M = pmap_modified_bit(dst_pmap); PG_V = pmap_valid_bit(dst_pmap); for (addr = src_addr; addr < end_addr; addr = va_next) { pt_entry_t *src_pte, *dst_pte; vm_page_t dstmpde, dstmpte, srcmpte; pml4_entry_t *pml4e; pdp_entry_t *pdpe; pd_entry_t srcptepaddr, *pde; KASSERT(addr < UPT_MIN_ADDRESS, ("pmap_copy: invalid to pmap_copy page tables")); pml4e = pmap_pml4e(src_pmap, addr); if ((*pml4e & PG_V) == 0) { va_next = (addr + NBPML4) & ~PML4MASK; if (va_next < addr) va_next = end_addr; continue; } pdpe = pmap_pml4e_to_pdpe(pml4e, addr); if ((*pdpe & PG_V) == 0) { va_next = (addr + NBPDP) & ~PDPMASK; if (va_next < addr) va_next = end_addr; continue; } va_next = (addr + NBPDR) & ~PDRMASK; if (va_next < addr) va_next = end_addr; pde = pmap_pdpe_to_pde(pdpe, addr); srcptepaddr = *pde; if (srcptepaddr == 0) continue; if (srcptepaddr & PG_PS) { if ((addr & PDRMASK) != 0 || addr + NBPDR > end_addr) continue; dstmpde = pmap_allocpde(dst_pmap, addr, NULL); if (dstmpde == NULL) break; pde = (pd_entry_t *) PHYS_TO_DMAP(VM_PAGE_TO_PHYS(dstmpde)); pde = &pde[pmap_pde_index(addr)]; if (*pde == 0 && ((srcptepaddr & PG_MANAGED) == 0 || pmap_pv_insert_pde(dst_pmap, addr, srcptepaddr & PG_PS_FRAME, &lock))) { *pde = srcptepaddr & ~PG_W; pmap_resident_count_inc(dst_pmap, NBPDR / PAGE_SIZE); } else dstmpde->wire_count--; continue; } srcptepaddr &= PG_FRAME; srcmpte = PHYS_TO_VM_PAGE(srcptepaddr); KASSERT(srcmpte->wire_count > 0, ("pmap_copy: source page table page is unused")); if (va_next > end_addr) va_next = end_addr; src_pte = (pt_entry_t *)PHYS_TO_DMAP(srcptepaddr); src_pte = &src_pte[pmap_pte_index(addr)]; dstmpte = NULL; while (addr < va_next) { pt_entry_t ptetemp; ptetemp = *src_pte; /* * we only virtual copy managed pages */ if ((ptetemp & PG_MANAGED) != 0) { if (dstmpte != NULL && dstmpte->pindex == pmap_pde_pindex(addr)) dstmpte->wire_count++; else if ((dstmpte = pmap_allocpte(dst_pmap, addr, NULL)) == NULL) goto out; dst_pte = (pt_entry_t *) PHYS_TO_DMAP(VM_PAGE_TO_PHYS(dstmpte)); dst_pte = &dst_pte[pmap_pte_index(addr)]; if (*dst_pte == 0 && pmap_try_insert_pv_entry(dst_pmap, addr, PHYS_TO_VM_PAGE(ptetemp & PG_FRAME), &lock)) { /* * Clear the wired, modified, and * accessed (referenced) bits * during the copy. */ *dst_pte = ptetemp & ~(PG_W | PG_M | PG_A); pmap_resident_count_inc(dst_pmap, 1); } else { SLIST_INIT(&free); if (pmap_unwire_ptp(dst_pmap, addr, dstmpte, &free)) { pmap_invalidate_page(dst_pmap, addr); pmap_free_zero_pages(&free); } goto out; } if (dstmpte->wire_count >= srcmpte->wire_count) break; } addr += PAGE_SIZE; src_pte++; } } out: if (lock != NULL) rw_wunlock(lock); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(src_pmap); PMAP_UNLOCK(dst_pmap); } /* * pmap_zero_page zeros the specified hardware page by mapping * the page into KVM and using bzero to clear its contents. */ void pmap_zero_page(vm_page_t m) { vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); pagezero((void *)va); } /* * pmap_zero_page_area zeros the specified hardware page by mapping * the page into KVM and using bzero to clear its contents. * * off and size may not cover an area beyond a single hardware page. */ void pmap_zero_page_area(vm_page_t m, int off, int size) { vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); if (off == 0 && size == PAGE_SIZE) pagezero((void *)va); else bzero((char *)va + off, size); } /* * pmap_zero_page_idle zeros the specified hardware page by mapping * the page into KVM and using bzero to clear its contents. This * is intended to be called from the vm_pagezero process only and * outside of Giant. */ void pmap_zero_page_idle(vm_page_t m) { vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)); pagezero((void *)va); } /* * pmap_copy_page copies the specified (machine independent) * page by mapping the page into virtual memory and using * bcopy to copy the page, one machine dependent page at a * time. */ void pmap_copy_page(vm_page_t msrc, vm_page_t mdst) { vm_offset_t src = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(msrc)); vm_offset_t dst = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mdst)); pagecopy((void *)src, (void *)dst); } int unmapped_buf_allowed = 1; void pmap_copy_pages(vm_page_t ma[], vm_offset_t a_offset, vm_page_t mb[], vm_offset_t b_offset, int xfersize) { void *a_cp, *b_cp; vm_page_t pages[2]; vm_offset_t vaddr[2], a_pg_offset, b_pg_offset; int cnt; boolean_t mapped; while (xfersize > 0) { a_pg_offset = a_offset & PAGE_MASK; pages[0] = ma[a_offset >> PAGE_SHIFT]; b_pg_offset = b_offset & PAGE_MASK; pages[1] = mb[b_offset >> PAGE_SHIFT]; cnt = min(xfersize, PAGE_SIZE - a_pg_offset); cnt = min(cnt, PAGE_SIZE - b_pg_offset); mapped = pmap_map_io_transient(pages, vaddr, 2, FALSE); a_cp = (char *)vaddr[0] + a_pg_offset; b_cp = (char *)vaddr[1] + b_pg_offset; bcopy(a_cp, b_cp, cnt); if (__predict_false(mapped)) pmap_unmap_io_transient(pages, vaddr, 2, FALSE); a_offset += cnt; b_offset += cnt; xfersize -= cnt; } } /* * Returns true if the pmap's pv is one of the first * 16 pvs linked to from this page. This count may * be changed upwards or downwards in the future; it * is only necessary that true be returned for a small * subset of pmaps for proper page aging. */ boolean_t pmap_page_exists_quick(pmap_t pmap, vm_page_t m) { struct md_page *pvh; struct rwlock *lock; pv_entry_t pv; int loops = 0; boolean_t rv; KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_page_exists_quick: page %p is not managed", m)); rv = FALSE; rw_rlock(&pvh_global_lock); lock = VM_PAGE_TO_PV_LIST_LOCK(m); rw_rlock(lock); TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) { if (PV_PMAP(pv) == pmap) { rv = TRUE; break; } loops++; if (loops >= 16) break; } if (!rv && loops < 16 && (m->flags & PG_FICTITIOUS) == 0) { pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) { if (PV_PMAP(pv) == pmap) { rv = TRUE; break; } loops++; if (loops >= 16) break; } } rw_runlock(lock); rw_runlock(&pvh_global_lock); return (rv); } /* * pmap_page_wired_mappings: * * Return the number of managed mappings to the given physical page * that are wired. */ int pmap_page_wired_mappings(vm_page_t m) { struct rwlock *lock; struct md_page *pvh; pmap_t pmap; pt_entry_t *pte; pv_entry_t pv; int count, md_gen, pvh_gen; if ((m->oflags & VPO_UNMANAGED) != 0) return (0); rw_rlock(&pvh_global_lock); lock = VM_PAGE_TO_PV_LIST_LOCK(m); rw_rlock(lock); restart: count = 0; TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { md_gen = m->md.pv_gen; rw_runlock(lock); PMAP_LOCK(pmap); rw_rlock(lock); if (md_gen != m->md.pv_gen) { PMAP_UNLOCK(pmap); goto restart; } } pte = pmap_pte(pmap, pv->pv_va); if ((*pte & PG_W) != 0) count++; PMAP_UNLOCK(pmap); } if ((m->flags & PG_FICTITIOUS) == 0) { pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { md_gen = m->md.pv_gen; pvh_gen = pvh->pv_gen; rw_runlock(lock); PMAP_LOCK(pmap); rw_rlock(lock); if (md_gen != m->md.pv_gen || pvh_gen != pvh->pv_gen) { PMAP_UNLOCK(pmap); goto restart; } } pte = pmap_pde(pmap, pv->pv_va); if ((*pte & PG_W) != 0) count++; PMAP_UNLOCK(pmap); } } rw_runlock(lock); rw_runlock(&pvh_global_lock); return (count); } /* * Returns TRUE if the given page is mapped individually or as part of * a 2mpage. Otherwise, returns FALSE. */ boolean_t pmap_page_is_mapped(vm_page_t m) { struct rwlock *lock; boolean_t rv; if ((m->oflags & VPO_UNMANAGED) != 0) return (FALSE); rw_rlock(&pvh_global_lock); lock = VM_PAGE_TO_PV_LIST_LOCK(m); rw_rlock(lock); rv = !TAILQ_EMPTY(&m->md.pv_list) || ((m->flags & PG_FICTITIOUS) == 0 && !TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list)); rw_runlock(lock); rw_runlock(&pvh_global_lock); return (rv); } /* * Destroy all managed, non-wired mappings in the given user-space * pmap. This pmap cannot be active on any processor besides the * caller. * * This function cannot be applied to the kernel pmap. Moreover, it * is not intended for general use. It is only to be used during * process termination. Consequently, it can be implemented in ways * that make it faster than pmap_remove(). First, it can more quickly * destroy mappings by iterating over the pmap's collection of PV * entries, rather than searching the page table. Second, it doesn't * have to test and clear the page table entries atomically, because * no processor is currently accessing the user address space. In * particular, a page table entry's dirty bit won't change state once * this function starts. */ void pmap_remove_pages(pmap_t pmap) { pd_entry_t ptepde; pt_entry_t *pte, tpte; pt_entry_t PG_M, PG_RW, PG_V; struct spglist free; vm_page_t m, mpte, mt; pv_entry_t pv; struct md_page *pvh; struct pv_chunk *pc, *npc; struct rwlock *lock; int64_t bit; uint64_t inuse, bitmask; int allfree, field, freed, idx; boolean_t superpage; vm_paddr_t pa; /* * Assert that the given pmap is only active on the current * CPU. Unfortunately, we cannot block another CPU from * activating the pmap while this function is executing. */ KASSERT(pmap == PCPU_GET(curpmap), ("non-current pmap %p", pmap)); #ifdef INVARIANTS { cpuset_t other_cpus; other_cpus = all_cpus; critical_enter(); CPU_CLR(PCPU_GET(cpuid), &other_cpus); CPU_AND(&other_cpus, &pmap->pm_active); critical_exit(); KASSERT(CPU_EMPTY(&other_cpus), ("pmap active %p", pmap)); } #endif lock = NULL; PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); SLIST_INIT(&free); rw_rlock(&pvh_global_lock); PMAP_LOCK(pmap); TAILQ_FOREACH_SAFE(pc, &pmap->pm_pvchunk, pc_list, npc) { allfree = 1; freed = 0; for (field = 0; field < _NPCM; field++) { inuse = ~pc->pc_map[field] & pc_freemask[field]; while (inuse != 0) { bit = bsfq(inuse); bitmask = 1UL << bit; idx = field * 64 + bit; pv = &pc->pc_pventry[idx]; inuse &= ~bitmask; pte = pmap_pdpe(pmap, pv->pv_va); ptepde = *pte; pte = pmap_pdpe_to_pde(pte, pv->pv_va); tpte = *pte; if ((tpte & (PG_PS | PG_V)) == PG_V) { superpage = FALSE; ptepde = tpte; pte = (pt_entry_t *)PHYS_TO_DMAP(tpte & PG_FRAME); pte = &pte[pmap_pte_index(pv->pv_va)]; tpte = *pte; } else { /* * Keep track whether 'tpte' is a * superpage explicitly instead of * relying on PG_PS being set. * * This is because PG_PS is numerically * identical to PG_PTE_PAT and thus a * regular page could be mistaken for * a superpage. */ superpage = TRUE; } if ((tpte & PG_V) == 0) { panic("bad pte va %lx pte %lx", pv->pv_va, tpte); } /* * We cannot remove wired pages from a process' mapping at this time */ if (tpte & PG_W) { allfree = 0; continue; } if (superpage) pa = tpte & PG_PS_FRAME; else pa = tpte & PG_FRAME; m = PHYS_TO_VM_PAGE(pa); KASSERT(m->phys_addr == pa, ("vm_page_t %p phys_addr mismatch %016jx %016jx", m, (uintmax_t)m->phys_addr, (uintmax_t)tpte)); KASSERT((m->flags & PG_FICTITIOUS) != 0 || m < &vm_page_array[vm_page_array_size], ("pmap_remove_pages: bad tpte %#jx", (uintmax_t)tpte)); pte_clear(pte); /* * Update the vm_page_t clean/reference bits. */ if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) { if (superpage) { for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++) vm_page_dirty(mt); } else vm_page_dirty(m); } CHANGE_PV_LIST_LOCK_TO_VM_PAGE(&lock, m); /* Mark free */ pc->pc_map[field] |= bitmask; if (superpage) { pmap_resident_count_dec(pmap, NBPDR / PAGE_SIZE); pvh = pa_to_pvh(tpte & PG_PS_FRAME); TAILQ_REMOVE(&pvh->pv_list, pv, pv_next); pvh->pv_gen++; if (TAILQ_EMPTY(&pvh->pv_list)) { for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++) if ((mt->aflags & PGA_WRITEABLE) != 0 && TAILQ_EMPTY(&mt->md.pv_list)) vm_page_aflag_clear(mt, PGA_WRITEABLE); } mpte = pmap_lookup_pt_page(pmap, pv->pv_va); if (mpte != NULL) { pmap_remove_pt_page(pmap, mpte); pmap_resident_count_dec(pmap, 1); KASSERT(mpte->wire_count == NPTEPG, ("pmap_remove_pages: pte page wire count error")); mpte->wire_count = 0; pmap_add_delayed_free_list(mpte, &free, FALSE); atomic_subtract_int(&vm_cnt.v_wire_count, 1); } } else { pmap_resident_count_dec(pmap, 1); TAILQ_REMOVE(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; if ((m->aflags & PGA_WRITEABLE) != 0 && TAILQ_EMPTY(&m->md.pv_list) && (m->flags & PG_FICTITIOUS) == 0) { pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); if (TAILQ_EMPTY(&pvh->pv_list)) vm_page_aflag_clear(m, PGA_WRITEABLE); } } pmap_unuse_pt(pmap, pv->pv_va, ptepde, &free); freed++; } } PV_STAT(atomic_add_long(&pv_entry_frees, freed)); PV_STAT(atomic_add_int(&pv_entry_spare, freed)); PV_STAT(atomic_subtract_long(&pv_entry_count, freed)); if (allfree) { TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list); free_pv_chunk(pc); } } if (lock != NULL) rw_wunlock(lock); pmap_invalidate_all(pmap); rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); pmap_free_zero_pages(&free); } static boolean_t pmap_page_test_mappings(vm_page_t m, boolean_t accessed, boolean_t modified) { struct rwlock *lock; pv_entry_t pv; struct md_page *pvh; pt_entry_t *pte, mask; pt_entry_t PG_A, PG_M, PG_RW, PG_V; pmap_t pmap; int md_gen, pvh_gen; boolean_t rv; rv = FALSE; rw_rlock(&pvh_global_lock); lock = VM_PAGE_TO_PV_LIST_LOCK(m); rw_rlock(lock); restart: TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { md_gen = m->md.pv_gen; rw_runlock(lock); PMAP_LOCK(pmap); rw_rlock(lock); if (md_gen != m->md.pv_gen) { PMAP_UNLOCK(pmap); goto restart; } } pte = pmap_pte(pmap, pv->pv_va); mask = 0; if (modified) { PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); mask |= PG_RW | PG_M; } if (accessed) { PG_A = pmap_accessed_bit(pmap); PG_V = pmap_valid_bit(pmap); mask |= PG_V | PG_A; } rv = (*pte & mask) == mask; PMAP_UNLOCK(pmap); if (rv) goto out; } if ((m->flags & PG_FICTITIOUS) == 0) { pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { md_gen = m->md.pv_gen; pvh_gen = pvh->pv_gen; rw_runlock(lock); PMAP_LOCK(pmap); rw_rlock(lock); if (md_gen != m->md.pv_gen || pvh_gen != pvh->pv_gen) { PMAP_UNLOCK(pmap); goto restart; } } pte = pmap_pde(pmap, pv->pv_va); mask = 0; if (modified) { PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); mask |= PG_RW | PG_M; } if (accessed) { PG_A = pmap_accessed_bit(pmap); PG_V = pmap_valid_bit(pmap); mask |= PG_V | PG_A; } rv = (*pte & mask) == mask; PMAP_UNLOCK(pmap); if (rv) goto out; } } out: rw_runlock(lock); rw_runlock(&pvh_global_lock); return (rv); } /* * pmap_is_modified: * * Return whether or not the specified physical page was modified * in any physical maps. */ boolean_t pmap_is_modified(vm_page_t m) { KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_is_modified: page %p is not managed", m)); /* * If the page is not exclusive busied, then PGA_WRITEABLE cannot be * concurrently set while the object is locked. Thus, if PGA_WRITEABLE * is clear, no PTEs can have PG_M set. */ VM_OBJECT_ASSERT_WLOCKED(m->object); if (!vm_page_xbusied(m) && (m->aflags & PGA_WRITEABLE) == 0) return (FALSE); return (pmap_page_test_mappings(m, FALSE, TRUE)); } /* * pmap_is_prefaultable: * * Return whether or not the specified virtual address is eligible * for prefault. */ boolean_t pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr) { pd_entry_t *pde; pt_entry_t *pte, PG_V; boolean_t rv; PG_V = pmap_valid_bit(pmap); rv = FALSE; PMAP_LOCK(pmap); pde = pmap_pde(pmap, addr); if (pde != NULL && (*pde & (PG_PS | PG_V)) == PG_V) { pte = pmap_pde_to_pte(pde, addr); rv = (*pte & PG_V) == 0; } PMAP_UNLOCK(pmap); return (rv); } /* * pmap_is_referenced: * * Return whether or not the specified physical page was referenced * in any physical maps. */ boolean_t pmap_is_referenced(vm_page_t m) { KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_is_referenced: page %p is not managed", m)); return (pmap_page_test_mappings(m, TRUE, FALSE)); } /* * Clear the write and modified bits in each of the given page's mappings. */ void pmap_remove_write(vm_page_t m) { struct md_page *pvh; pmap_t pmap; struct rwlock *lock; pv_entry_t next_pv, pv; pd_entry_t *pde; pt_entry_t oldpte, *pte, PG_M, PG_RW; vm_offset_t va; int pvh_gen, md_gen; KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_remove_write: page %p is not managed", m)); /* * If the page is not exclusive busied, then PGA_WRITEABLE cannot be * set by another thread while the object is locked. Thus, * if PGA_WRITEABLE is clear, no page table entries need updating. */ VM_OBJECT_ASSERT_WLOCKED(m->object); if (!vm_page_xbusied(m) && (m->aflags & PGA_WRITEABLE) == 0) return; rw_rlock(&pvh_global_lock); lock = VM_PAGE_TO_PV_LIST_LOCK(m); pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); retry_pv_loop: rw_wlock(lock); if ((m->flags & PG_FICTITIOUS) != 0) goto small_mappings; TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_next, next_pv) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { pvh_gen = pvh->pv_gen; rw_wunlock(lock); PMAP_LOCK(pmap); rw_wlock(lock); if (pvh_gen != pvh->pv_gen) { PMAP_UNLOCK(pmap); rw_wunlock(lock); goto retry_pv_loop; } } PG_RW = pmap_rw_bit(pmap); va = pv->pv_va; pde = pmap_pde(pmap, va); if ((*pde & PG_RW) != 0) (void)pmap_demote_pde_locked(pmap, pde, va, &lock); KASSERT(lock == VM_PAGE_TO_PV_LIST_LOCK(m), ("inconsistent pv lock %p %p for page %p", lock, VM_PAGE_TO_PV_LIST_LOCK(m), m)); PMAP_UNLOCK(pmap); } small_mappings: TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { pvh_gen = pvh->pv_gen; md_gen = m->md.pv_gen; rw_wunlock(lock); PMAP_LOCK(pmap); rw_wlock(lock); if (pvh_gen != pvh->pv_gen || md_gen != m->md.pv_gen) { PMAP_UNLOCK(pmap); rw_wunlock(lock); goto retry_pv_loop; } } PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); pde = pmap_pde(pmap, pv->pv_va); KASSERT((*pde & PG_PS) == 0, ("pmap_remove_write: found a 2mpage in page %p's pv list", m)); pte = pmap_pde_to_pte(pde, pv->pv_va); retry: oldpte = *pte; if (oldpte & PG_RW) { if (!atomic_cmpset_long(pte, oldpte, oldpte & ~(PG_RW | PG_M))) goto retry; if ((oldpte & PG_M) != 0) vm_page_dirty(m); pmap_invalidate_page(pmap, pv->pv_va); } PMAP_UNLOCK(pmap); } rw_wunlock(lock); vm_page_aflag_clear(m, PGA_WRITEABLE); rw_runlock(&pvh_global_lock); } static __inline boolean_t safe_to_clear_referenced(pmap_t pmap, pt_entry_t pte) { if (!pmap_emulate_ad_bits(pmap)) return (TRUE); KASSERT(pmap->pm_type == PT_EPT, ("invalid pm_type %d", pmap->pm_type)); /* * XWR = 010 or 110 will cause an unconditional EPT misconfiguration * so we don't let the referenced (aka EPT_PG_READ) bit to be cleared * if the EPT_PG_WRITE bit is set. */ if ((pte & EPT_PG_WRITE) != 0) return (FALSE); /* * XWR = 100 is allowed only if the PMAP_SUPPORTS_EXEC_ONLY is set. */ if ((pte & EPT_PG_EXECUTE) == 0 || ((pmap->pm_flags & PMAP_SUPPORTS_EXEC_ONLY) != 0)) return (TRUE); else return (FALSE); } #define PMAP_TS_REFERENCED_MAX 5 /* * pmap_ts_referenced: * * Return a count of reference bits for a page, clearing those bits. * It is not necessary for every reference bit to be cleared, but it * is necessary that 0 only be returned when there are truly no * reference bits set. * * XXX: The exact number of bits to check and clear is a matter that * should be tested and standardized at some point in the future for * optimal aging of shared pages. */ int pmap_ts_referenced(vm_page_t m) { struct md_page *pvh; pv_entry_t pv, pvf; pmap_t pmap; struct rwlock *lock; pd_entry_t oldpde, *pde; pt_entry_t *pte, PG_A; vm_offset_t va; vm_paddr_t pa; int cleared, md_gen, not_cleared, pvh_gen; struct spglist free; boolean_t demoted; KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_ts_referenced: page %p is not managed", m)); SLIST_INIT(&free); cleared = 0; pa = VM_PAGE_TO_PHYS(m); lock = PHYS_TO_PV_LIST_LOCK(pa); pvh = pa_to_pvh(pa); rw_rlock(&pvh_global_lock); rw_wlock(lock); retry: not_cleared = 0; if ((m->flags & PG_FICTITIOUS) != 0 || (pvf = TAILQ_FIRST(&pvh->pv_list)) == NULL) goto small_mappings; pv = pvf; do { if (pvf == NULL) pvf = pv; pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { pvh_gen = pvh->pv_gen; rw_wunlock(lock); PMAP_LOCK(pmap); rw_wlock(lock); if (pvh_gen != pvh->pv_gen) { PMAP_UNLOCK(pmap); goto retry; } } PG_A = pmap_accessed_bit(pmap); va = pv->pv_va; pde = pmap_pde(pmap, pv->pv_va); oldpde = *pde; if ((*pde & PG_A) != 0) { /* * Since this reference bit is shared by 512 4KB * pages, it should not be cleared every time it is * tested. Apply a simple "hash" function on the * physical page number, the virtual superpage number, * and the pmap address to select one 4KB page out of * the 512 on which testing the reference bit will * result in clearing that reference bit. This * function is designed to avoid the selection of the * same 4KB page for every 2MB page mapping. * * On demotion, a mapping that hasn't been referenced * is simply destroyed. To avoid the possibility of a * subsequent page fault on a demoted wired mapping, * always leave its reference bit set. Moreover, * since the superpage is wired, the current state of * its reference bit won't affect page replacement. */ if ((((pa >> PAGE_SHIFT) ^ (pv->pv_va >> PDRSHIFT) ^ (uintptr_t)pmap) & (NPTEPG - 1)) == 0 && (*pde & PG_W) == 0) { if (safe_to_clear_referenced(pmap, oldpde)) { atomic_clear_long(pde, PG_A); pmap_invalidate_page(pmap, pv->pv_va); demoted = FALSE; } else if (pmap_demote_pde_locked(pmap, pde, pv->pv_va, &lock)) { /* * Remove the mapping to a single page * so that a subsequent access may * repromote. Since the underlying * page table page is fully populated, * this removal never frees a page * table page. */ demoted = TRUE; va += VM_PAGE_TO_PHYS(m) - (oldpde & PG_PS_FRAME); pte = pmap_pde_to_pte(pde, va); pmap_remove_pte(pmap, pte, va, *pde, NULL, &lock); pmap_invalidate_page(pmap, va); } else demoted = TRUE; if (demoted) { /* * The superpage mapping was removed * entirely and therefore 'pv' is no * longer valid. */ if (pvf == pv) pvf = NULL; pv = NULL; } cleared++; KASSERT(lock == VM_PAGE_TO_PV_LIST_LOCK(m), ("inconsistent pv lock %p %p for page %p", lock, VM_PAGE_TO_PV_LIST_LOCK(m), m)); } else not_cleared++; } PMAP_UNLOCK(pmap); /* Rotate the PV list if it has more than one entry. */ if (pv != NULL && TAILQ_NEXT(pv, pv_next) != NULL) { TAILQ_REMOVE(&pvh->pv_list, pv, pv_next); TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_next); pvh->pv_gen++; } if (cleared + not_cleared >= PMAP_TS_REFERENCED_MAX) goto out; } while ((pv = TAILQ_FIRST(&pvh->pv_list)) != pvf); small_mappings: if ((pvf = TAILQ_FIRST(&m->md.pv_list)) == NULL) goto out; pv = pvf; do { if (pvf == NULL) pvf = pv; pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { pvh_gen = pvh->pv_gen; md_gen = m->md.pv_gen; rw_wunlock(lock); PMAP_LOCK(pmap); rw_wlock(lock); if (pvh_gen != pvh->pv_gen || md_gen != m->md.pv_gen) { PMAP_UNLOCK(pmap); goto retry; } } PG_A = pmap_accessed_bit(pmap); pde = pmap_pde(pmap, pv->pv_va); KASSERT((*pde & PG_PS) == 0, ("pmap_ts_referenced: found a 2mpage in page %p's pv list", m)); pte = pmap_pde_to_pte(pde, pv->pv_va); if ((*pte & PG_A) != 0) { if (safe_to_clear_referenced(pmap, *pte)) { atomic_clear_long(pte, PG_A); pmap_invalidate_page(pmap, pv->pv_va); cleared++; } else if ((*pte & PG_W) == 0) { /* * Wired pages cannot be paged out so * doing accessed bit emulation for * them is wasted effort. We do the * hard work for unwired pages only. */ pmap_remove_pte(pmap, pte, pv->pv_va, *pde, &free, &lock); pmap_invalidate_page(pmap, pv->pv_va); cleared++; if (pvf == pv) pvf = NULL; pv = NULL; KASSERT(lock == VM_PAGE_TO_PV_LIST_LOCK(m), ("inconsistent pv lock %p %p for page %p", lock, VM_PAGE_TO_PV_LIST_LOCK(m), m)); } else not_cleared++; } PMAP_UNLOCK(pmap); /* Rotate the PV list if it has more than one entry. */ if (pv != NULL && TAILQ_NEXT(pv, pv_next) != NULL) { TAILQ_REMOVE(&m->md.pv_list, pv, pv_next); TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next); m->md.pv_gen++; } } while ((pv = TAILQ_FIRST(&m->md.pv_list)) != pvf && cleared + not_cleared < PMAP_TS_REFERENCED_MAX); out: rw_wunlock(lock); rw_runlock(&pvh_global_lock); pmap_free_zero_pages(&free); return (cleared + not_cleared); } /* * Apply the given advice to the specified range of addresses within the * given pmap. Depending on the advice, clear the referenced and/or * modified flags in each mapping and set the mapped page's dirty field. */ void pmap_advise(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, int advice) { struct rwlock *lock; pml4_entry_t *pml4e; pdp_entry_t *pdpe; pd_entry_t oldpde, *pde; pt_entry_t *pte, PG_A, PG_G, PG_M, PG_RW, PG_V; vm_offset_t va_next; vm_page_t m; boolean_t anychanged, pv_lists_locked; if (advice != MADV_DONTNEED && advice != MADV_FREE) return; /* * A/D bit emulation requires an alternate code path when clearing * the modified and accessed bits below. Since this function is * advisory in nature we skip it entirely for pmaps that require * A/D bit emulation. */ if (pmap_emulate_ad_bits(pmap)) return; PG_A = pmap_accessed_bit(pmap); PG_G = pmap_global_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); pv_lists_locked = FALSE; resume: anychanged = FALSE; PMAP_LOCK(pmap); for (; sva < eva; sva = va_next) { pml4e = pmap_pml4e(pmap, sva); if ((*pml4e & PG_V) == 0) { va_next = (sva + NBPML4) & ~PML4MASK; if (va_next < sva) va_next = eva; continue; } pdpe = pmap_pml4e_to_pdpe(pml4e, sva); if ((*pdpe & PG_V) == 0) { va_next = (sva + NBPDP) & ~PDPMASK; if (va_next < sva) va_next = eva; continue; } va_next = (sva + NBPDR) & ~PDRMASK; if (va_next < sva) va_next = eva; pde = pmap_pdpe_to_pde(pdpe, sva); oldpde = *pde; if ((oldpde & PG_V) == 0) continue; else if ((oldpde & PG_PS) != 0) { if ((oldpde & PG_MANAGED) == 0) continue; if (!pv_lists_locked) { pv_lists_locked = TRUE; if (!rw_try_rlock(&pvh_global_lock)) { if (anychanged) pmap_invalidate_all(pmap); PMAP_UNLOCK(pmap); rw_rlock(&pvh_global_lock); goto resume; } } lock = NULL; if (!pmap_demote_pde_locked(pmap, pde, sva, &lock)) { if (lock != NULL) rw_wunlock(lock); /* * The large page mapping was destroyed. */ continue; } /* * Unless the page mappings are wired, remove the * mapping to a single page so that a subsequent * access may repromote. Since the underlying page * table page is fully populated, this removal never * frees a page table page. */ if ((oldpde & PG_W) == 0) { pte = pmap_pde_to_pte(pde, sva); KASSERT((*pte & PG_V) != 0, ("pmap_advise: invalid PTE")); pmap_remove_pte(pmap, pte, sva, *pde, NULL, &lock); anychanged = TRUE; } if (lock != NULL) rw_wunlock(lock); } if (va_next > eva) va_next = eva; for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++, sva += PAGE_SIZE) { if ((*pte & (PG_MANAGED | PG_V)) != (PG_MANAGED | PG_V)) continue; else if ((*pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) { if (advice == MADV_DONTNEED) { /* * Future calls to pmap_is_modified() * can be avoided by making the page * dirty now. */ m = PHYS_TO_VM_PAGE(*pte & PG_FRAME); vm_page_dirty(m); } atomic_clear_long(pte, PG_M | PG_A); } else if ((*pte & PG_A) != 0) atomic_clear_long(pte, PG_A); else continue; if ((*pte & PG_G) != 0) pmap_invalidate_page(pmap, sva); else anychanged = TRUE; } } if (anychanged) pmap_invalidate_all(pmap); if (pv_lists_locked) rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); } /* * Clear the modify bits on the specified physical page. */ void pmap_clear_modify(vm_page_t m) { struct md_page *pvh; pmap_t pmap; pv_entry_t next_pv, pv; pd_entry_t oldpde, *pde; pt_entry_t oldpte, *pte, PG_M, PG_RW, PG_V; struct rwlock *lock; vm_offset_t va; int md_gen, pvh_gen; KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("pmap_clear_modify: page %p is not managed", m)); VM_OBJECT_ASSERT_WLOCKED(m->object); KASSERT(!vm_page_xbusied(m), ("pmap_clear_modify: page %p is exclusive busied", m)); /* * If the page is not PGA_WRITEABLE, then no PTEs can have PG_M set. * If the object containing the page is locked and the page is not * exclusive busied, then PGA_WRITEABLE cannot be concurrently set. */ if ((m->aflags & PGA_WRITEABLE) == 0) return; pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); rw_rlock(&pvh_global_lock); lock = VM_PAGE_TO_PV_LIST_LOCK(m); rw_wlock(lock); restart: if ((m->flags & PG_FICTITIOUS) != 0) goto small_mappings; TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_next, next_pv) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { pvh_gen = pvh->pv_gen; rw_wunlock(lock); PMAP_LOCK(pmap); rw_wlock(lock); if (pvh_gen != pvh->pv_gen) { PMAP_UNLOCK(pmap); goto restart; } } PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); va = pv->pv_va; pde = pmap_pde(pmap, va); oldpde = *pde; if ((oldpde & PG_RW) != 0) { if (pmap_demote_pde_locked(pmap, pde, va, &lock)) { if ((oldpde & PG_W) == 0) { /* * Write protect the mapping to a * single page so that a subsequent * write access may repromote. */ va += VM_PAGE_TO_PHYS(m) - (oldpde & PG_PS_FRAME); pte = pmap_pde_to_pte(pde, va); oldpte = *pte; if ((oldpte & PG_V) != 0) { while (!atomic_cmpset_long(pte, oldpte, oldpte & ~(PG_M | PG_RW))) oldpte = *pte; vm_page_dirty(m); pmap_invalidate_page(pmap, va); } } } } PMAP_UNLOCK(pmap); } small_mappings: TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) { pmap = PV_PMAP(pv); if (!PMAP_TRYLOCK(pmap)) { md_gen = m->md.pv_gen; pvh_gen = pvh->pv_gen; rw_wunlock(lock); PMAP_LOCK(pmap); rw_wlock(lock); if (pvh_gen != pvh->pv_gen || md_gen != m->md.pv_gen) { PMAP_UNLOCK(pmap); goto restart; } } PG_M = pmap_modified_bit(pmap); PG_RW = pmap_rw_bit(pmap); pde = pmap_pde(pmap, pv->pv_va); KASSERT((*pde & PG_PS) == 0, ("pmap_clear_modify: found" " a 2mpage in page %p's pv list", m)); pte = pmap_pde_to_pte(pde, pv->pv_va); if ((*pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) { atomic_clear_long(pte, PG_M); pmap_invalidate_page(pmap, pv->pv_va); } PMAP_UNLOCK(pmap); } rw_wunlock(lock); rw_runlock(&pvh_global_lock); } /* * Miscellaneous support routines follow */ /* Adjust the cache mode for a 4KB page mapped via a PTE. */ static __inline void pmap_pte_attr(pt_entry_t *pte, int cache_bits, int mask) { u_int opte, npte; /* * The cache mode bits are all in the low 32-bits of the * PTE, so we can just spin on updating the low 32-bits. */ do { opte = *(u_int *)pte; npte = opte & ~mask; npte |= cache_bits; } while (npte != opte && !atomic_cmpset_int((u_int *)pte, opte, npte)); } /* Adjust the cache mode for a 2MB page mapped via a PDE. */ static __inline void pmap_pde_attr(pd_entry_t *pde, int cache_bits, int mask) { u_int opde, npde; /* * The cache mode bits are all in the low 32-bits of the * PDE, so we can just spin on updating the low 32-bits. */ do { opde = *(u_int *)pde; npde = opde & ~mask; npde |= cache_bits; } while (npde != opde && !atomic_cmpset_int((u_int *)pde, opde, npde)); } /* * Map a set of physical memory pages into the kernel virtual * address space. Return a pointer to where it is mapped. This * routine is intended to be used for mapping device memory, * NOT real memory. */ void * pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode) { vm_offset_t va, offset; vm_size_t tmpsize; /* * If the specified range of physical addresses fits within the direct * map window, use the direct map. */ if (pa < dmaplimit && pa + size < dmaplimit) { va = PHYS_TO_DMAP(pa); if (!pmap_change_attr(va, size, mode)) return ((void *)va); } offset = pa & PAGE_MASK; size = round_page(offset + size); va = kva_alloc(size); if (!va) panic("pmap_mapdev: Couldn't alloc kernel virtual memory"); pa = trunc_page(pa); for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE) pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode); pmap_invalidate_range(kernel_pmap, va, va + tmpsize); pmap_invalidate_cache_range(va, va + tmpsize, FALSE); return ((void *)(va + offset)); } void * pmap_mapdev(vm_paddr_t pa, vm_size_t size) { return (pmap_mapdev_attr(pa, size, PAT_UNCACHEABLE)); } void * pmap_mapbios(vm_paddr_t pa, vm_size_t size) { return (pmap_mapdev_attr(pa, size, PAT_WRITE_BACK)); } void pmap_unmapdev(vm_offset_t va, vm_size_t size) { vm_offset_t base, offset; /* If we gave a direct map region in pmap_mapdev, do nothing */ if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS) return; base = trunc_page(va); offset = va & PAGE_MASK; size = round_page(offset + size); kva_free(base, size); } /* * Tries to demote a 1GB page mapping. */ static boolean_t pmap_demote_pdpe(pmap_t pmap, pdp_entry_t *pdpe, vm_offset_t va) { pdp_entry_t newpdpe, oldpdpe; pd_entry_t *firstpde, newpde, *pde; pt_entry_t PG_A, PG_M, PG_RW, PG_V; vm_paddr_t mpdepa; vm_page_t mpde; PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); PMAP_LOCK_ASSERT(pmap, MA_OWNED); oldpdpe = *pdpe; KASSERT((oldpdpe & (PG_PS | PG_V)) == (PG_PS | PG_V), ("pmap_demote_pdpe: oldpdpe is missing PG_PS and/or PG_V")); if ((mpde = vm_page_alloc(NULL, va >> PDPSHIFT, VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) { CTR2(KTR_PMAP, "pmap_demote_pdpe: failure for va %#lx" " in pmap %p", va, pmap); return (FALSE); } mpdepa = VM_PAGE_TO_PHYS(mpde); firstpde = (pd_entry_t *)PHYS_TO_DMAP(mpdepa); newpdpe = mpdepa | PG_M | PG_A | (oldpdpe & PG_U) | PG_RW | PG_V; KASSERT((oldpdpe & PG_A) != 0, ("pmap_demote_pdpe: oldpdpe is missing PG_A")); KASSERT((oldpdpe & (PG_M | PG_RW)) != PG_RW, ("pmap_demote_pdpe: oldpdpe is missing PG_M")); newpde = oldpdpe; /* * Initialize the page directory page. */ for (pde = firstpde; pde < firstpde + NPDEPG; pde++) { *pde = newpde; newpde += NBPDR; } /* * Demote the mapping. */ *pdpe = newpdpe; /* * Invalidate a stale recursive mapping of the page directory page. */ pmap_invalidate_page(pmap, (vm_offset_t)vtopde(va)); pmap_pdpe_demotions++; CTR2(KTR_PMAP, "pmap_demote_pdpe: success for va %#lx" " in pmap %p", va, pmap); return (TRUE); } /* * Sets the memory attribute for the specified page. */ void pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma) { m->md.pat_mode = ma; /* * If "m" is a normal page, update its direct mapping. This update * can be relied upon to perform any cache operations that are * required for data coherence. */ if ((m->flags & PG_FICTITIOUS) == 0 && pmap_change_attr(PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)), PAGE_SIZE, m->md.pat_mode)) panic("memory attribute change on the direct map failed"); } /* * Changes the specified virtual address range's memory type to that given by * the parameter "mode". The specified virtual address range must be * completely contained within either the direct map or the kernel map. If * the virtual address range is contained within the kernel map, then the * memory type for each of the corresponding ranges of the direct map is also * changed. (The corresponding ranges of the direct map are those ranges that * map the same physical pages as the specified virtual address range.) These * changes to the direct map are necessary because Intel describes the * behavior of their processors as "undefined" if two or more mappings to the * same physical page have different memory types. * * Returns zero if the change completed successfully, and either EINVAL or * ENOMEM if the change failed. Specifically, EINVAL is returned if some part * of the virtual address range was not mapped, and ENOMEM is returned if * there was insufficient memory available to complete the change. In the * latter case, the memory type may have been changed on some part of the * virtual address range or the direct map. */ int pmap_change_attr(vm_offset_t va, vm_size_t size, int mode) { int error; PMAP_LOCK(kernel_pmap); error = pmap_change_attr_locked(va, size, mode); PMAP_UNLOCK(kernel_pmap); return (error); } static int pmap_change_attr_locked(vm_offset_t va, vm_size_t size, int mode) { vm_offset_t base, offset, tmpva; vm_paddr_t pa_start, pa_end; pdp_entry_t *pdpe; pd_entry_t *pde; pt_entry_t *pte; int cache_bits_pte, cache_bits_pde, error; boolean_t changed; PMAP_LOCK_ASSERT(kernel_pmap, MA_OWNED); base = trunc_page(va); offset = va & PAGE_MASK; size = round_page(offset + size); /* * Only supported on kernel virtual addresses, including the direct * map but excluding the recursive map. */ if (base < DMAP_MIN_ADDRESS) return (EINVAL); cache_bits_pde = pmap_cache_bits(kernel_pmap, mode, 1); cache_bits_pte = pmap_cache_bits(kernel_pmap, mode, 0); changed = FALSE; /* * Pages that aren't mapped aren't supported. Also break down 2MB pages * into 4KB pages if required. */ for (tmpva = base; tmpva < base + size; ) { pdpe = pmap_pdpe(kernel_pmap, tmpva); if (*pdpe == 0) return (EINVAL); if (*pdpe & PG_PS) { /* * If the current 1GB page already has the required * memory type, then we need not demote this page. Just * increment tmpva to the next 1GB page frame. */ if ((*pdpe & X86_PG_PDE_CACHE) == cache_bits_pde) { tmpva = trunc_1gpage(tmpva) + NBPDP; continue; } /* * If the current offset aligns with a 1GB page frame * and there is at least 1GB left within the range, then * we need not break down this page into 2MB pages. */ if ((tmpva & PDPMASK) == 0 && tmpva + PDPMASK < base + size) { tmpva += NBPDP; continue; } if (!pmap_demote_pdpe(kernel_pmap, pdpe, tmpva)) return (ENOMEM); } pde = pmap_pdpe_to_pde(pdpe, tmpva); if (*pde == 0) return (EINVAL); if (*pde & PG_PS) { /* * If the current 2MB page already has the required * memory type, then we need not demote this page. Just * increment tmpva to the next 2MB page frame. */ if ((*pde & X86_PG_PDE_CACHE) == cache_bits_pde) { tmpva = trunc_2mpage(tmpva) + NBPDR; continue; } /* * If the current offset aligns with a 2MB page frame * and there is at least 2MB left within the range, then * we need not break down this page into 4KB pages. */ if ((tmpva & PDRMASK) == 0 && tmpva + PDRMASK < base + size) { tmpva += NBPDR; continue; } if (!pmap_demote_pde(kernel_pmap, pde, tmpva)) return (ENOMEM); } pte = pmap_pde_to_pte(pde, tmpva); if (*pte == 0) return (EINVAL); tmpva += PAGE_SIZE; } error = 0; /* * Ok, all the pages exist, so run through them updating their * cache mode if required. */ pa_start = pa_end = 0; for (tmpva = base; tmpva < base + size; ) { pdpe = pmap_pdpe(kernel_pmap, tmpva); if (*pdpe & PG_PS) { if ((*pdpe & X86_PG_PDE_CACHE) != cache_bits_pde) { pmap_pde_attr(pdpe, cache_bits_pde, X86_PG_PDE_CACHE); changed = TRUE; } if (tmpva >= VM_MIN_KERNEL_ADDRESS) { if (pa_start == pa_end) { /* Start physical address run. */ pa_start = *pdpe & PG_PS_FRAME; pa_end = pa_start + NBPDP; } else if (pa_end == (*pdpe & PG_PS_FRAME)) pa_end += NBPDP; else { /* Run ended, update direct map. */ error = pmap_change_attr_locked( PHYS_TO_DMAP(pa_start), pa_end - pa_start, mode); if (error != 0) break; /* Start physical address run. */ pa_start = *pdpe & PG_PS_FRAME; pa_end = pa_start + NBPDP; } } tmpva = trunc_1gpage(tmpva) + NBPDP; continue; } pde = pmap_pdpe_to_pde(pdpe, tmpva); if (*pde & PG_PS) { if ((*pde & X86_PG_PDE_CACHE) != cache_bits_pde) { pmap_pde_attr(pde, cache_bits_pde, X86_PG_PDE_CACHE); changed = TRUE; } if (tmpva >= VM_MIN_KERNEL_ADDRESS) { if (pa_start == pa_end) { /* Start physical address run. */ pa_start = *pde & PG_PS_FRAME; pa_end = pa_start + NBPDR; } else if (pa_end == (*pde & PG_PS_FRAME)) pa_end += NBPDR; else { /* Run ended, update direct map. */ error = pmap_change_attr_locked( PHYS_TO_DMAP(pa_start), pa_end - pa_start, mode); if (error != 0) break; /* Start physical address run. */ pa_start = *pde & PG_PS_FRAME; pa_end = pa_start + NBPDR; } } tmpva = trunc_2mpage(tmpva) + NBPDR; } else { pte = pmap_pde_to_pte(pde, tmpva); if ((*pte & X86_PG_PTE_CACHE) != cache_bits_pte) { pmap_pte_attr(pte, cache_bits_pte, X86_PG_PTE_CACHE); changed = TRUE; } if (tmpva >= VM_MIN_KERNEL_ADDRESS) { if (pa_start == pa_end) { /* Start physical address run. */ pa_start = *pte & PG_FRAME; pa_end = pa_start + PAGE_SIZE; } else if (pa_end == (*pte & PG_FRAME)) pa_end += PAGE_SIZE; else { /* Run ended, update direct map. */ error = pmap_change_attr_locked( PHYS_TO_DMAP(pa_start), pa_end - pa_start, mode); if (error != 0) break; /* Start physical address run. */ pa_start = *pte & PG_FRAME; pa_end = pa_start + PAGE_SIZE; } } tmpva += PAGE_SIZE; } } if (error == 0 && pa_start != pa_end) error = pmap_change_attr_locked(PHYS_TO_DMAP(pa_start), pa_end - pa_start, mode); /* * Flush CPU caches if required to make sure any data isn't cached that * shouldn't be, etc. */ if (changed) { pmap_invalidate_range(kernel_pmap, base, tmpva); pmap_invalidate_cache_range(base, tmpva, FALSE); } return (error); } /* * Demotes any mapping within the direct map region that covers more than the * specified range of physical addresses. This range's size must be a power * of two and its starting address must be a multiple of its size. Since the * demotion does not change any attributes of the mapping, a TLB invalidation * is not mandatory. The caller may, however, request a TLB invalidation. */ void pmap_demote_DMAP(vm_paddr_t base, vm_size_t len, boolean_t invalidate) { pdp_entry_t *pdpe; pd_entry_t *pde; vm_offset_t va; boolean_t changed; if (len == 0) return; KASSERT(powerof2(len), ("pmap_demote_DMAP: len is not a power of 2")); KASSERT((base & (len - 1)) == 0, ("pmap_demote_DMAP: base is not a multiple of len")); if (len < NBPDP && base < dmaplimit) { va = PHYS_TO_DMAP(base); changed = FALSE; PMAP_LOCK(kernel_pmap); pdpe = pmap_pdpe(kernel_pmap, va); if ((*pdpe & X86_PG_V) == 0) panic("pmap_demote_DMAP: invalid PDPE"); if ((*pdpe & PG_PS) != 0) { if (!pmap_demote_pdpe(kernel_pmap, pdpe, va)) panic("pmap_demote_DMAP: PDPE failed"); changed = TRUE; } if (len < NBPDR) { pde = pmap_pdpe_to_pde(pdpe, va); if ((*pde & X86_PG_V) == 0) panic("pmap_demote_DMAP: invalid PDE"); if ((*pde & PG_PS) != 0) { if (!pmap_demote_pde(kernel_pmap, pde, va)) panic("pmap_demote_DMAP: PDE failed"); changed = TRUE; } } if (changed && invalidate) pmap_invalidate_page(kernel_pmap, va); PMAP_UNLOCK(kernel_pmap); } } /* * perform the pmap work for mincore */ int pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa) { pd_entry_t *pdep; pt_entry_t pte, PG_A, PG_M, PG_RW, PG_V; vm_paddr_t pa; int val; PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); PMAP_LOCK(pmap); retry: pdep = pmap_pde(pmap, addr); if (pdep != NULL && (*pdep & PG_V)) { if (*pdep & PG_PS) { pte = *pdep; /* Compute the physical address of the 4KB page. */ pa = ((*pdep & PG_PS_FRAME) | (addr & PDRMASK)) & PG_FRAME; val = MINCORE_SUPER; } else { pte = *pmap_pde_to_pte(pdep, addr); pa = pte & PG_FRAME; val = 0; } } else { pte = 0; pa = 0; val = 0; } if ((pte & PG_V) != 0) { val |= MINCORE_INCORE; if ((pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) val |= MINCORE_MODIFIED | MINCORE_MODIFIED_OTHER; if ((pte & PG_A) != 0) val |= MINCORE_REFERENCED | MINCORE_REFERENCED_OTHER; } if ((val & (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER)) != (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER) && (pte & (PG_MANAGED | PG_V)) == (PG_MANAGED | PG_V)) { /* Ensure that "PHYS_TO_VM_PAGE(pa)->object" doesn't change. */ if (vm_page_pa_tryrelock(pmap, pa, locked_pa)) goto retry; } else PA_UNLOCK_COND(*locked_pa); PMAP_UNLOCK(pmap); return (val); } +static uint64_t +pmap_pcid_alloc(pmap_t pmap, u_int cpuid) +{ + uint32_t gen, new_gen, pcid_next; + + CRITICAL_ASSERT(curthread); + gen = PCPU_GET(pcid_gen); + if (pmap->pm_pcids[cpuid].pm_pcid == PMAP_PCID_KERN || + pmap->pm_pcids[cpuid].pm_gen == gen) + return (CR3_PCID_SAVE); + pcid_next = PCPU_GET(pcid_next); + KASSERT(pcid_next <= PMAP_PCID_OVERMAX, ("cpu %d pcid_next %#x", + cpuid, pcid_next)); + if (pcid_next == PMAP_PCID_OVERMAX) { + new_gen = gen + 1; + if (new_gen == 0) + new_gen = 1; + PCPU_SET(pcid_gen, new_gen); + pcid_next = PMAP_PCID_KERN + 1; + } else { + new_gen = gen; + } + pmap->pm_pcids[cpuid].pm_pcid = pcid_next; + pmap->pm_pcids[cpuid].pm_gen = new_gen; + PCPU_SET(pcid_next, pcid_next + 1); + return (0); +} + void -pmap_activate(struct thread *td) +pmap_activate_sw(struct thread *td) { - pmap_t pmap, oldpmap; - u_int cpuid; + pmap_t oldpmap, pmap; + uint64_t cached, cr3; + u_int cpuid; - critical_enter(); - pmap = vmspace_pmap(td->td_proc->p_vmspace); oldpmap = PCPU_GET(curpmap); + pmap = vmspace_pmap(td->td_proc->p_vmspace); + if (oldpmap == pmap) + return; cpuid = PCPU_GET(cpuid); #ifdef SMP - CPU_CLR_ATOMIC(cpuid, &oldpmap->pm_active); CPU_SET_ATOMIC(cpuid, &pmap->pm_active); - CPU_SET_ATOMIC(cpuid, &pmap->pm_save); #else - CPU_CLR(cpuid, &oldpmap->pm_active); CPU_SET(cpuid, &pmap->pm_active); - CPU_SET(cpuid, &pmap->pm_save); #endif - td->td_pcb->pcb_cr3 = pmap->pm_cr3; - load_cr3(pmap->pm_cr3); + cr3 = rcr3(); + if (pmap_pcid_enabled) { + cached = pmap_pcid_alloc(pmap, cpuid); + KASSERT(pmap->pm_pcids[cpuid].pm_pcid >= 0 && + pmap->pm_pcids[cpuid].pm_pcid < PMAP_PCID_OVERMAX, + ("pmap %p cpu %d pcid %#x", pmap, cpuid, + pmap->pm_pcids[cpuid].pm_pcid)); + KASSERT(pmap != PMAP_PCID_KERN || pmap == kernel_pmap, + ("non-kernel pmap %p cpu %d pcid %#x", pmap, cpuid, + pmap->pm_pcids[cpuid].pm_pcid)); + if (!cached || (cr3 & ~CR3_PCID_MASK) != pmap->pm_cr3) { + load_cr3(pmap->pm_cr3 | pmap->pm_pcids[cpuid].pm_pcid | + cached); + if (cached) + PCPU_INC(pm_save_cnt); + } + } else if (cr3 != pmap->pm_cr3) { + load_cr3(pmap->pm_cr3); + } PCPU_SET(curpmap, pmap); +#ifdef SMP + CPU_CLR_ATOMIC(cpuid, &oldpmap->pm_active); +#else + CPU_CLR(cpuid, &oldpmap->pm_active); +#endif +} + +void +pmap_activate(struct thread *td) +{ + + critical_enter(); + pmap_activate_sw(td); critical_exit(); } void pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz) { } /* * Increase the starting virtual address of the given mapping if a * different alignment might result in more superpage mappings. */ void pmap_align_superpage(vm_object_t object, vm_ooffset_t offset, vm_offset_t *addr, vm_size_t size) { vm_offset_t superpage_offset; if (size < NBPDR) return; if (object != NULL && (object->flags & OBJ_COLORED) != 0) offset += ptoa(object->pg_color); superpage_offset = offset & PDRMASK; if (size - ((NBPDR - superpage_offset) & PDRMASK) < NBPDR || (*addr & PDRMASK) == superpage_offset) return; if ((*addr & PDRMASK) < superpage_offset) *addr = (*addr & ~PDRMASK) + superpage_offset; else *addr = ((*addr + PDRMASK) & ~PDRMASK) + superpage_offset; } #ifdef INVARIANTS static unsigned long num_dirty_emulations; SYSCTL_ULONG(_vm_pmap, OID_AUTO, num_dirty_emulations, CTLFLAG_RW, &num_dirty_emulations, 0, NULL); static unsigned long num_accessed_emulations; SYSCTL_ULONG(_vm_pmap, OID_AUTO, num_accessed_emulations, CTLFLAG_RW, &num_accessed_emulations, 0, NULL); static unsigned long num_superpage_accessed_emulations; SYSCTL_ULONG(_vm_pmap, OID_AUTO, num_superpage_accessed_emulations, CTLFLAG_RW, &num_superpage_accessed_emulations, 0, NULL); static unsigned long ad_emulation_superpage_promotions; SYSCTL_ULONG(_vm_pmap, OID_AUTO, ad_emulation_superpage_promotions, CTLFLAG_RW, &ad_emulation_superpage_promotions, 0, NULL); #endif /* INVARIANTS */ int pmap_emulate_accessed_dirty(pmap_t pmap, vm_offset_t va, int ftype) { int rv; struct rwlock *lock; vm_page_t m, mpte; pd_entry_t *pde; pt_entry_t *pte, PG_A, PG_M, PG_RW, PG_V; boolean_t pv_lists_locked; KASSERT(ftype == VM_PROT_READ || ftype == VM_PROT_WRITE, ("pmap_emulate_accessed_dirty: invalid fault type %d", ftype)); if (!pmap_emulate_ad_bits(pmap)) return (-1); PG_A = pmap_accessed_bit(pmap); PG_M = pmap_modified_bit(pmap); PG_V = pmap_valid_bit(pmap); PG_RW = pmap_rw_bit(pmap); rv = -1; lock = NULL; pv_lists_locked = FALSE; retry: PMAP_LOCK(pmap); pde = pmap_pde(pmap, va); if (pde == NULL || (*pde & PG_V) == 0) goto done; if ((*pde & PG_PS) != 0) { if (ftype == VM_PROT_READ) { #ifdef INVARIANTS atomic_add_long(&num_superpage_accessed_emulations, 1); #endif *pde |= PG_A; rv = 0; } goto done; } pte = pmap_pde_to_pte(pde, va); if ((*pte & PG_V) == 0) goto done; if (ftype == VM_PROT_WRITE) { if ((*pte & PG_RW) == 0) goto done; /* * Set the modified and accessed bits simultaneously. * * Intel EPT PTEs that do software emulation of A/D bits map * PG_A and PG_M to EPT_PG_READ and EPT_PG_WRITE respectively. * An EPT misconfiguration is triggered if the PTE is writable * but not readable (WR=10). This is avoided by setting PG_A * and PG_M simultaneously. */ *pte |= PG_M | PG_A; } else { *pte |= PG_A; } /* try to promote the mapping */ if (va < VM_MAXUSER_ADDRESS) mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME); else mpte = NULL; m = PHYS_TO_VM_PAGE(*pte & PG_FRAME); if ((mpte == NULL || mpte->wire_count == NPTEPG) && pmap_ps_enabled(pmap) && (m->flags & PG_FICTITIOUS) == 0 && vm_reserv_level_iffullpop(m) == 0) { if (!pv_lists_locked) { pv_lists_locked = TRUE; if (!rw_try_rlock(&pvh_global_lock)) { PMAP_UNLOCK(pmap); rw_rlock(&pvh_global_lock); goto retry; } } pmap_promote_pde(pmap, pde, va, &lock); #ifdef INVARIANTS atomic_add_long(&ad_emulation_superpage_promotions, 1); #endif } #ifdef INVARIANTS if (ftype == VM_PROT_WRITE) atomic_add_long(&num_dirty_emulations, 1); else atomic_add_long(&num_accessed_emulations, 1); #endif rv = 0; /* success */ done: if (lock != NULL) rw_wunlock(lock); if (pv_lists_locked) rw_runlock(&pvh_global_lock); PMAP_UNLOCK(pmap); return (rv); } void pmap_get_mapping(pmap_t pmap, vm_offset_t va, uint64_t *ptr, int *num) { pml4_entry_t *pml4; pdp_entry_t *pdp; pd_entry_t *pde; pt_entry_t *pte, PG_V; int idx; idx = 0; PG_V = pmap_valid_bit(pmap); PMAP_LOCK(pmap); pml4 = pmap_pml4e(pmap, va); ptr[idx++] = *pml4; if ((*pml4 & PG_V) == 0) goto done; pdp = pmap_pml4e_to_pdpe(pml4, va); ptr[idx++] = *pdp; if ((*pdp & PG_V) == 0 || (*pdp & PG_PS) != 0) goto done; pde = pmap_pdpe_to_pde(pdp, va); ptr[idx++] = *pde; if ((*pde & PG_V) == 0 || (*pde & PG_PS) != 0) goto done; pte = pmap_pde_to_pte(pde, va); ptr[idx++] = *pte; done: PMAP_UNLOCK(pmap); *num = idx; } /** * Get the kernel virtual address of a set of physical pages. If there are * physical addresses not covered by the DMAP perform a transient mapping * that will be removed when calling pmap_unmap_io_transient. * * \param page The pages the caller wishes to obtain the virtual * address on the kernel memory map. * \param vaddr On return contains the kernel virtual memory address * of the pages passed in the page parameter. * \param count Number of pages passed in. * \param can_fault TRUE if the thread using the mapped pages can take * page faults, FALSE otherwise. * * \returns TRUE if the caller must call pmap_unmap_io_transient when * finished or FALSE otherwise. * */ boolean_t pmap_map_io_transient(vm_page_t page[], vm_offset_t vaddr[], int count, boolean_t can_fault) { vm_paddr_t paddr; boolean_t needs_mapping; pt_entry_t *pte; int cache_bits, error, i; /* * Allocate any KVA space that we need, this is done in a separate * loop to prevent calling vmem_alloc while pinned. */ needs_mapping = FALSE; for (i = 0; i < count; i++) { paddr = VM_PAGE_TO_PHYS(page[i]); if (__predict_false(paddr >= dmaplimit)) { error = vmem_alloc(kernel_arena, PAGE_SIZE, M_BESTFIT | M_WAITOK, &vaddr[i]); KASSERT(error == 0, ("vmem_alloc failed: %d", error)); needs_mapping = TRUE; } else { vaddr[i] = PHYS_TO_DMAP(paddr); } } /* Exit early if everything is covered by the DMAP */ if (!needs_mapping) return (FALSE); /* * NB: The sequence of updating a page table followed by accesses * to the corresponding pages used in the !DMAP case is subject to * the situation described in the "AMD64 Architecture Programmer's * Manual Volume 2: System Programming" rev. 3.23, "7.3.1 Special * Coherency Considerations". Therefore, issuing the INVLPG right * after modifying the PTE bits is crucial. */ if (!can_fault) sched_pin(); for (i = 0; i < count; i++) { paddr = VM_PAGE_TO_PHYS(page[i]); if (paddr >= dmaplimit) { if (can_fault) { /* * Slow path, since we can get page faults * while mappings are active don't pin the * thread to the CPU and instead add a global * mapping visible to all CPUs. */ pmap_qenter(vaddr[i], &page[i], 1); } else { pte = vtopte(vaddr[i]); cache_bits = pmap_cache_bits(kernel_pmap, page[i]->md.pat_mode, 0); pte_store(pte, paddr | X86_PG_RW | X86_PG_V | cache_bits); invlpg(vaddr[i]); } } } return (needs_mapping); } void pmap_unmap_io_transient(vm_page_t page[], vm_offset_t vaddr[], int count, boolean_t can_fault) { vm_paddr_t paddr; int i; if (!can_fault) sched_unpin(); for (i = 0; i < count; i++) { paddr = VM_PAGE_TO_PHYS(page[i]); if (paddr >= dmaplimit) { if (can_fault) pmap_qremove(vaddr[i], 1); vmem_free(kernel_arena, vaddr[i], PAGE_SIZE); } } } #include "opt_ddb.h" #ifdef DDB #include DB_SHOW_COMMAND(pte, pmap_print_pte) { pmap_t pmap; pml4_entry_t *pml4; pdp_entry_t *pdp; pd_entry_t *pde; pt_entry_t *pte, PG_V; vm_offset_t va; if (have_addr) { va = (vm_offset_t)addr; pmap = PCPU_GET(curpmap); /* XXX */ } else { db_printf("show pte addr\n"); return; } PG_V = pmap_valid_bit(pmap); pml4 = pmap_pml4e(pmap, va); db_printf("VA %#016lx pml4e %#016lx", va, *pml4); if ((*pml4 & PG_V) == 0) { db_printf("\n"); return; } pdp = pmap_pml4e_to_pdpe(pml4, va); db_printf(" pdpe %#016lx", *pdp); if ((*pdp & PG_V) == 0 || (*pdp & PG_PS) != 0) { db_printf("\n"); return; } pde = pmap_pdpe_to_pde(pdp, va); db_printf(" pde %#016lx", *pde); if ((*pde & PG_V) == 0 || (*pde & PG_PS) != 0) { db_printf("\n"); return; } pte = pmap_pde_to_pte(pde, va); db_printf(" pte %#016lx\n", *pte); } DB_SHOW_COMMAND(phys2dmap, pmap_phys2dmap) { vm_paddr_t a; if (have_addr) { a = (vm_paddr_t)addr; db_printf("0x%jx\n", (uintmax_t)PHYS_TO_DMAP(a)); } else { db_printf("show phys2dmap addr\n"); } } #endif Index: projects/release-arm-redux/sys/amd64/amd64/vm_machdep.c =================================================================== --- projects/release-arm-redux/sys/amd64/amd64/vm_machdep.c (revision 282691) +++ projects/release-arm-redux/sys/amd64/amd64/vm_machdep.c (revision 282692) @@ -1,734 +1,732 @@ /*- * Copyright (c) 1982, 1986 The Regents of the University of California. * Copyright (c) 1989, 1990 William Jolitz * Copyright (c) 1994 John Dyson * All rights reserved. * * This code is derived from software contributed to Berkeley by * the Systems Programming Group of the University of Utah Computer * Science Department, and William Jolitz. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)vm_machdep.c 7.3 (Berkeley) 5/13/91 * Utah $Hdr: vm_machdep.c 1.16.1.1 89/06/23$ */ #include __FBSDID("$FreeBSD$"); #include "opt_isa.h" #include "opt_cpu.h" #include "opt_compat.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static void cpu_reset_real(void); #ifdef SMP static void cpu_reset_proxy(void); static u_int cpu_reset_proxyid; static volatile u_int cpu_reset_proxy_active; #endif _Static_assert(OFFSETOF_CURTHREAD == offsetof(struct pcpu, pc_curthread), "OFFSETOF_CURTHREAD does not correspond with offset of pc_curthread."); _Static_assert(OFFSETOF_CURPCB == offsetof(struct pcpu, pc_curpcb), "OFFSETOF_CURPCB does not correspond with offset of pc_curpcb."); struct savefpu * get_pcb_user_save_td(struct thread *td) { vm_offset_t p; p = td->td_kstack + td->td_kstack_pages * PAGE_SIZE - cpu_max_ext_state_size; KASSERT((p % 64) == 0, ("Unaligned pcb_user_save area")); return ((struct savefpu *)p); } struct savefpu * get_pcb_user_save_pcb(struct pcb *pcb) { vm_offset_t p; p = (vm_offset_t)(pcb + 1); return ((struct savefpu *)p); } struct pcb * get_pcb_td(struct thread *td) { vm_offset_t p; p = td->td_kstack + td->td_kstack_pages * PAGE_SIZE - cpu_max_ext_state_size - sizeof(struct pcb); return ((struct pcb *)p); } void * alloc_fpusave(int flags) { void *res; struct savefpu_ymm *sf; res = malloc(cpu_max_ext_state_size, M_DEVBUF, flags); if (use_xsave) { sf = (struct savefpu_ymm *)res; bzero(&sf->sv_xstate.sx_hd, sizeof(sf->sv_xstate.sx_hd)); sf->sv_xstate.sx_hd.xstate_bv = xsave_mask; } return (res); } /* * Finish a fork operation, with process p2 nearly set up. * Copy and update the pcb, set up the stack so that the child * ready to run and return to user mode. */ void cpu_fork(td1, p2, td2, flags) register struct thread *td1; register struct proc *p2; struct thread *td2; int flags; { register struct proc *p1; struct pcb *pcb2; struct mdproc *mdp1, *mdp2; struct proc_ldt *pldt; pmap_t pmap2; p1 = td1->td_proc; if ((flags & RFPROC) == 0) { if ((flags & RFMEM) == 0) { /* unshare user LDT */ mdp1 = &p1->p_md; mtx_lock(&dt_lock); if ((pldt = mdp1->md_ldt) != NULL && pldt->ldt_refcnt > 1 && user_ldt_alloc(p1, 1) == NULL) panic("could not copy LDT"); mtx_unlock(&dt_lock); } return; } /* Ensure that td1's pcb is up to date. */ fpuexit(td1); /* Point the pcb to the top of the stack */ pcb2 = get_pcb_td(td2); td2->td_pcb = pcb2; /* Copy td1's pcb */ bcopy(td1->td_pcb, pcb2, sizeof(*pcb2)); /* Properly initialize pcb_save */ pcb2->pcb_save = get_pcb_user_save_pcb(pcb2); bcopy(get_pcb_user_save_td(td1), get_pcb_user_save_pcb(pcb2), cpu_max_ext_state_size); /* Point mdproc and then copy over td1's contents */ mdp2 = &p2->p_md; bcopy(&p1->p_md, mdp2, sizeof(*mdp2)); /* * Create a new fresh stack for the new process. * Copy the trap frame for the return to user mode as if from a * syscall. This copies most of the user mode register values. */ td2->td_frame = (struct trapframe *)td2->td_pcb - 1; bcopy(td1->td_frame, td2->td_frame, sizeof(struct trapframe)); td2->td_frame->tf_rax = 0; /* Child returns zero */ td2->td_frame->tf_rflags &= ~PSL_C; /* success */ td2->td_frame->tf_rdx = 1; /* * If the parent process has the trap bit set (i.e. a debugger had * single stepped the process to the system call), we need to clear * the trap flag from the new frame unless the debugger had set PF_FORK * on the parent. Otherwise, the child will receive a (likely * unexpected) SIGTRAP when it executes the first instruction after * returning to userland. */ if ((p1->p_pfsflags & PF_FORK) == 0) td2->td_frame->tf_rflags &= ~PSL_T; /* * Set registers for trampoline to user mode. Leave space for the * return address on stack. These are the kernel mode register values. */ pmap2 = vmspace_pmap(p2->p_vmspace); - pcb2->pcb_cr3 = pmap2->pm_cr3; pcb2->pcb_r12 = (register_t)fork_return; /* fork_trampoline argument */ pcb2->pcb_rbp = 0; pcb2->pcb_rsp = (register_t)td2->td_frame - sizeof(void *); pcb2->pcb_rbx = (register_t)td2; /* fork_trampoline argument */ pcb2->pcb_rip = (register_t)fork_trampoline; /*- * pcb2->pcb_dr*: cloned above. * pcb2->pcb_savefpu: cloned above. * pcb2->pcb_flags: cloned above. * pcb2->pcb_onfault: cloned above (always NULL here?). * pcb2->pcb_[fg]sbase: cloned above */ /* Setup to release spin count in fork_exit(). */ td2->td_md.md_spinlock_count = 1; td2->td_md.md_saved_flags = PSL_KERNEL | PSL_I; /* As an i386, do not copy io permission bitmap. */ pcb2->pcb_tssp = NULL; /* New segment registers. */ set_pcb_flags(pcb2, PCB_FULL_IRET); /* Copy the LDT, if necessary. */ mdp1 = &td1->td_proc->p_md; mdp2 = &p2->p_md; mtx_lock(&dt_lock); if (mdp1->md_ldt != NULL) { if (flags & RFMEM) { mdp1->md_ldt->ldt_refcnt++; mdp2->md_ldt = mdp1->md_ldt; bcopy(&mdp1->md_ldt_sd, &mdp2->md_ldt_sd, sizeof(struct system_segment_descriptor)); } else { mdp2->md_ldt = NULL; mdp2->md_ldt = user_ldt_alloc(p2, 0); if (mdp2->md_ldt == NULL) panic("could not copy LDT"); amd64_set_ldt_data(td2, 0, max_ldt_segment, (struct user_segment_descriptor *) mdp1->md_ldt->ldt_base); } } else mdp2->md_ldt = NULL; mtx_unlock(&dt_lock); /* * Now, cpu_switch() can schedule the new process. * pcb_rsp is loaded pointing to the cpu_switch() stack frame * containing the return address when exiting cpu_switch. * This will normally be to fork_trampoline(), which will have * %ebx loaded with the new proc's pointer. fork_trampoline() * will set up a stack to call fork_return(p, frame); to complete * the return to user-mode. */ } /* * Intercept the return address from a freshly forked process that has NOT * been scheduled yet. * * This is needed to make kernel threads stay in kernel mode. */ void cpu_set_fork_handler(td, func, arg) struct thread *td; void (*func)(void *); void *arg; { /* * Note that the trap frame follows the args, so the function * is really called like this: func(arg, frame); */ td->td_pcb->pcb_r12 = (long) func; /* function */ td->td_pcb->pcb_rbx = (long) arg; /* first arg */ } void cpu_exit(struct thread *td) { /* * If this process has a custom LDT, release it. */ mtx_lock(&dt_lock); if (td->td_proc->p_md.md_ldt != 0) user_ldt_free(td); else mtx_unlock(&dt_lock); } void cpu_thread_exit(struct thread *td) { struct pcb *pcb; critical_enter(); if (td == PCPU_GET(fpcurthread)) fpudrop(); critical_exit(); pcb = td->td_pcb; /* Disable any hardware breakpoints. */ if (pcb->pcb_flags & PCB_DBREGS) { reset_dbregs(); clear_pcb_flags(pcb, PCB_DBREGS); } } void cpu_thread_clean(struct thread *td) { struct pcb *pcb; pcb = td->td_pcb; /* * Clean TSS/iomap */ if (pcb->pcb_tssp != NULL) { kmem_free(kernel_arena, (vm_offset_t)pcb->pcb_tssp, ctob(IOPAGES + 1)); pcb->pcb_tssp = NULL; } } void cpu_thread_swapin(struct thread *td) { } void cpu_thread_swapout(struct thread *td) { } void cpu_thread_alloc(struct thread *td) { struct pcb *pcb; struct xstate_hdr *xhdr; td->td_pcb = pcb = get_pcb_td(td); td->td_frame = (struct trapframe *)pcb - 1; pcb->pcb_save = get_pcb_user_save_pcb(pcb); if (use_xsave) { xhdr = (struct xstate_hdr *)(pcb->pcb_save + 1); bzero(xhdr, sizeof(*xhdr)); xhdr->xstate_bv = xsave_mask; } } void cpu_thread_free(struct thread *td) { cpu_thread_clean(td); } void cpu_set_syscall_retval(struct thread *td, int error) { switch (error) { case 0: td->td_frame->tf_rax = td->td_retval[0]; td->td_frame->tf_rdx = td->td_retval[1]; td->td_frame->tf_rflags &= ~PSL_C; break; case ERESTART: /* * Reconstruct pc, we know that 'syscall' is 2 bytes, * lcall $X,y is 7 bytes, int 0x80 is 2 bytes. * We saved this in tf_err. * %r10 (which was holding the value of %rcx) is restored * for the next iteration. * %r10 restore is only required for freebsd/amd64 processes, * but shall be innocent for any ia32 ABI. * * Require full context restore to get the arguments * in the registers reloaded at return to usermode. */ td->td_frame->tf_rip -= td->td_frame->tf_err; td->td_frame->tf_r10 = td->td_frame->tf_rcx; set_pcb_flags(td->td_pcb, PCB_FULL_IRET); break; case EJUSTRETURN: break; default: if (td->td_proc->p_sysent->sv_errsize) { if (error >= td->td_proc->p_sysent->sv_errsize) error = -1; /* XXX */ else error = td->td_proc->p_sysent->sv_errtbl[error]; } td->td_frame->tf_rax = error; td->td_frame->tf_rflags |= PSL_C; break; } } /* * Initialize machine state (pcb and trap frame) for a new thread about to * upcall. Put enough state in the new thread's PCB to get it to go back * userret(), where we can intercept it again to set the return (upcall) * Address and stack, along with those from upcals that are from other sources * such as those generated in thread_userret() itself. */ void cpu_set_upcall(struct thread *td, struct thread *td0) { struct pcb *pcb2; /* Point the pcb to the top of the stack. */ pcb2 = td->td_pcb; /* * Copy the upcall pcb. This loads kernel regs. * Those not loaded individually below get their default * values here. */ bcopy(td0->td_pcb, pcb2, sizeof(*pcb2)); clear_pcb_flags(pcb2, PCB_FPUINITDONE | PCB_USERFPUINITDONE | PCB_KERNFPU); pcb2->pcb_save = get_pcb_user_save_pcb(pcb2); bcopy(get_pcb_user_save_td(td0), pcb2->pcb_save, cpu_max_ext_state_size); set_pcb_flags(pcb2, PCB_FULL_IRET); /* * Create a new fresh stack for the new thread. */ bcopy(td0->td_frame, td->td_frame, sizeof(struct trapframe)); /* If the current thread has the trap bit set (i.e. a debugger had * single stepped the process to the system call), we need to clear * the trap flag from the new frame. Otherwise, the new thread will * receive a (likely unexpected) SIGTRAP when it executes the first * instruction after returning to userland. */ td->td_frame->tf_rflags &= ~PSL_T; /* * Set registers for trampoline to user mode. Leave space for the * return address on stack. These are the kernel mode register values. */ pcb2->pcb_r12 = (register_t)fork_return; /* trampoline arg */ pcb2->pcb_rbp = 0; pcb2->pcb_rsp = (register_t)td->td_frame - sizeof(void *); /* trampoline arg */ pcb2->pcb_rbx = (register_t)td; /* trampoline arg */ pcb2->pcb_rip = (register_t)fork_trampoline; /* * If we didn't copy the pcb, we'd need to do the following registers: - * pcb2->pcb_cr3: cloned above. * pcb2->pcb_dr*: cloned above. * pcb2->pcb_savefpu: cloned above. * pcb2->pcb_onfault: cloned above (always NULL here?). * pcb2->pcb_[fg]sbase: cloned above */ /* Setup to release spin count in fork_exit(). */ td->td_md.md_spinlock_count = 1; td->td_md.md_saved_flags = PSL_KERNEL | PSL_I; } /* * Set that machine state for performing an upcall that has to * be done in thread_userret() so that those upcalls generated * in thread_userret() itself can be done as well. */ void cpu_set_upcall_kse(struct thread *td, void (*entry)(void *), void *arg, stack_t *stack) { /* * Do any extra cleaning that needs to be done. * The thread may have optional components * that are not present in a fresh thread. * This may be a recycled thread so make it look * as though it's newly allocated. */ cpu_thread_clean(td); #ifdef COMPAT_FREEBSD32 if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) { /* * Set the trap frame to point at the beginning of the uts * function. */ td->td_frame->tf_rbp = 0; td->td_frame->tf_rsp = (((uintptr_t)stack->ss_sp + stack->ss_size - 4) & ~0x0f) - 4; td->td_frame->tf_rip = (uintptr_t)entry; /* * Pass the address of the mailbox for this kse to the uts * function as a parameter on the stack. */ suword32((void *)(td->td_frame->tf_rsp + sizeof(int32_t)), (uint32_t)(uintptr_t)arg); return; } #endif /* * Set the trap frame to point at the beginning of the uts * function. */ td->td_frame->tf_rbp = 0; td->td_frame->tf_rsp = ((register_t)stack->ss_sp + stack->ss_size) & ~0x0f; td->td_frame->tf_rsp -= 8; td->td_frame->tf_rip = (register_t)entry; td->td_frame->tf_ds = _udatasel; td->td_frame->tf_es = _udatasel; td->td_frame->tf_fs = _ufssel; td->td_frame->tf_gs = _ugssel; td->td_frame->tf_flags = TF_HASSEGS; /* * Pass the address of the mailbox for this kse to the uts * function as a parameter on the stack. */ td->td_frame->tf_rdi = (register_t)arg; } int cpu_set_user_tls(struct thread *td, void *tls_base) { struct pcb *pcb; if ((u_int64_t)tls_base >= VM_MAXUSER_ADDRESS) return (EINVAL); pcb = td->td_pcb; set_pcb_flags(pcb, PCB_FULL_IRET); #ifdef COMPAT_FREEBSD32 if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) { pcb->pcb_gsbase = (register_t)tls_base; return (0); } #endif pcb->pcb_fsbase = (register_t)tls_base; return (0); } #ifdef SMP static void cpu_reset_proxy() { cpuset_t tcrp; cpu_reset_proxy_active = 1; while (cpu_reset_proxy_active == 1) ia32_pause(); /* Wait for other cpu to see that we've started */ CPU_SETOF(cpu_reset_proxyid, &tcrp); stop_cpus(tcrp); printf("cpu_reset_proxy: Stopped CPU %d\n", cpu_reset_proxyid); DELAY(1000000); cpu_reset_real(); } #endif void cpu_reset() { #ifdef SMP cpuset_t map; u_int cnt; if (smp_started) { map = all_cpus; CPU_CLR(PCPU_GET(cpuid), &map); CPU_NAND(&map, &stopped_cpus); if (!CPU_EMPTY(&map)) { printf("cpu_reset: Stopping other CPUs\n"); stop_cpus(map); } if (PCPU_GET(cpuid) != 0) { cpu_reset_proxyid = PCPU_GET(cpuid); cpustop_restartfunc = cpu_reset_proxy; cpu_reset_proxy_active = 0; printf("cpu_reset: Restarting BSP\n"); /* Restart CPU #0. */ CPU_SETOF(0, &started_cpus); wmb(); cnt = 0; while (cpu_reset_proxy_active == 0 && cnt < 10000000) { ia32_pause(); cnt++; /* Wait for BSP to announce restart */ } if (cpu_reset_proxy_active == 0) printf("cpu_reset: Failed to restart BSP\n"); enable_intr(); cpu_reset_proxy_active = 2; while (1) ia32_pause(); /* NOTREACHED */ } DELAY(1000000); } #endif cpu_reset_real(); /* NOTREACHED */ } static void cpu_reset_real() { struct region_descriptor null_idt; int b; disable_intr(); /* * Attempt to do a CPU reset via the keyboard controller, * do not turn off GateA20, as any machine that fails * to do the reset here would then end up in no man's land. */ outb(IO_KBD + 4, 0xFE); DELAY(500000); /* wait 0.5 sec to see if that did it */ /* * Attempt to force a reset via the Reset Control register at * I/O port 0xcf9. Bit 2 forces a system reset when it * transitions from 0 to 1. Bit 1 selects the type of reset * to attempt: 0 selects a "soft" reset, and 1 selects a * "hard" reset. We try a "hard" reset. The first write sets * bit 1 to select a "hard" reset and clears bit 2. The * second write forces a 0 -> 1 transition in bit 2 to trigger * a reset. */ outb(0xcf9, 0x2); outb(0xcf9, 0x6); DELAY(500000); /* wait 0.5 sec to see if that did it */ /* * Attempt to force a reset via the Fast A20 and Init register * at I/O port 0x92. Bit 1 serves as an alternate A20 gate. * Bit 0 asserts INIT# when set to 1. We are careful to only * preserve bit 1 while setting bit 0. We also must clear bit * 0 before setting it if it isn't already clear. */ b = inb(0x92); if (b != 0xff) { if ((b & 0x1) != 0) outb(0x92, b & 0xfe); outb(0x92, b | 0x1); DELAY(500000); /* wait 0.5 sec to see if that did it */ } printf("No known reset method worked, attempting CPU shutdown\n"); DELAY(1000000); /* wait 1 sec for printf to complete */ /* Wipe the IDT. */ null_idt.rd_limit = 0; null_idt.rd_base = 0; lidt(&null_idt); /* "good night, sweet prince .... " */ breakpoint(); /* NOTREACHED */ while(1); } /* * Software interrupt handler for queued VM system processing. */ void swi_vm(void *dummy) { if (busdma_swi_pending != 0) busdma_swi(); } /* * Tell whether this address is in some physical memory region. * Currently used by the kernel coredump code in order to avoid * dumping the ``ISA memory hole'' which could cause indefinite hangs, * or other unpredictable behaviour. */ int is_physical_memory(vm_paddr_t addr) { #ifdef DEV_ISA /* The ISA ``memory hole''. */ if (addr >= 0xa0000 && addr < 0x100000) return 0; #endif /* * stuff other tests for known memory-mapped devices (PCI?) * here */ return 1; } Index: projects/release-arm-redux/sys/amd64/include/cpufunc.h =================================================================== --- projects/release-arm-redux/sys/amd64/include/cpufunc.h (revision 282691) +++ projects/release-arm-redux/sys/amd64/include/cpufunc.h (revision 282692) @@ -1,865 +1,864 @@ /*- * Copyright (c) 2003 Peter Wemm. * Copyright (c) 1993 The Regents of the University of California. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * Functions to provide access to special i386 instructions. * This in included in sys/systm.h, and that file should be * used in preference to this. */ #ifndef _MACHINE_CPUFUNC_H_ #define _MACHINE_CPUFUNC_H_ #ifndef _SYS_CDEFS_H_ #error this file needs sys/cdefs.h as a prerequisite #endif struct region_descriptor; #define readb(va) (*(volatile uint8_t *) (va)) #define readw(va) (*(volatile uint16_t *) (va)) #define readl(va) (*(volatile uint32_t *) (va)) #define readq(va) (*(volatile uint64_t *) (va)) #define writeb(va, d) (*(volatile uint8_t *) (va) = (d)) #define writew(va, d) (*(volatile uint16_t *) (va) = (d)) #define writel(va, d) (*(volatile uint32_t *) (va) = (d)) #define writeq(va, d) (*(volatile uint64_t *) (va) = (d)) #if defined(__GNUCLIKE_ASM) && defined(__CC_SUPPORTS___INLINE) static __inline void breakpoint(void) { __asm __volatile("int $3"); } static __inline u_int bsfl(u_int mask) { u_int result; __asm __volatile("bsfl %1,%0" : "=r" (result) : "rm" (mask)); return (result); } static __inline u_long bsfq(u_long mask) { u_long result; __asm __volatile("bsfq %1,%0" : "=r" (result) : "rm" (mask)); return (result); } static __inline u_int bsrl(u_int mask) { u_int result; __asm __volatile("bsrl %1,%0" : "=r" (result) : "rm" (mask)); return (result); } static __inline u_long bsrq(u_long mask) { u_long result; __asm __volatile("bsrq %1,%0" : "=r" (result) : "rm" (mask)); return (result); } static __inline void clflush(u_long addr) { __asm __volatile("clflush %0" : : "m" (*(char *)addr)); } static __inline void clts(void) { __asm __volatile("clts"); } static __inline void disable_intr(void) { __asm __volatile("cli" : : : "memory"); } static __inline void do_cpuid(u_int ax, u_int *p) { __asm __volatile("cpuid" : "=a" (p[0]), "=b" (p[1]), "=c" (p[2]), "=d" (p[3]) : "0" (ax)); } static __inline void cpuid_count(u_int ax, u_int cx, u_int *p) { __asm __volatile("cpuid" : "=a" (p[0]), "=b" (p[1]), "=c" (p[2]), "=d" (p[3]) : "0" (ax), "c" (cx)); } static __inline void enable_intr(void) { __asm __volatile("sti"); } #ifdef _KERNEL #define HAVE_INLINE_FFS #define ffs(x) __builtin_ffs(x) #define HAVE_INLINE_FFSL static __inline int ffsl(long mask) { return (mask == 0 ? mask : (int)bsfq((u_long)mask) + 1); } #define HAVE_INLINE_FFSLL static __inline int ffsll(long long mask) { return (ffsl((long)mask)); } #define HAVE_INLINE_FLS static __inline int fls(int mask) { return (mask == 0 ? mask : (int)bsrl((u_int)mask) + 1); } #define HAVE_INLINE_FLSL static __inline int flsl(long mask) { return (mask == 0 ? mask : (int)bsrq((u_long)mask) + 1); } #define HAVE_INLINE_FLSLL static __inline int flsll(long long mask) { return (flsl((long)mask)); } #endif /* _KERNEL */ static __inline void halt(void) { __asm __volatile("hlt"); } static __inline u_char inb(u_int port) { u_char data; __asm __volatile("inb %w1, %0" : "=a" (data) : "Nd" (port)); return (data); } static __inline u_int inl(u_int port) { u_int data; __asm __volatile("inl %w1, %0" : "=a" (data) : "Nd" (port)); return (data); } static __inline void insb(u_int port, void *addr, size_t count) { __asm __volatile("cld; rep; insb" : "+D" (addr), "+c" (count) : "d" (port) : "memory"); } static __inline void insw(u_int port, void *addr, size_t count) { __asm __volatile("cld; rep; insw" : "+D" (addr), "+c" (count) : "d" (port) : "memory"); } static __inline void insl(u_int port, void *addr, size_t count) { __asm __volatile("cld; rep; insl" : "+D" (addr), "+c" (count) : "d" (port) : "memory"); } static __inline void invd(void) { __asm __volatile("invd"); } static __inline u_short inw(u_int port) { u_short data; __asm __volatile("inw %w1, %0" : "=a" (data) : "Nd" (port)); return (data); } static __inline void outb(u_int port, u_char data) { __asm __volatile("outb %0, %w1" : : "a" (data), "Nd" (port)); } static __inline void outl(u_int port, u_int data) { __asm __volatile("outl %0, %w1" : : "a" (data), "Nd" (port)); } static __inline void outsb(u_int port, const void *addr, size_t count) { __asm __volatile("cld; rep; outsb" : "+S" (addr), "+c" (count) : "d" (port)); } static __inline void outsw(u_int port, const void *addr, size_t count) { __asm __volatile("cld; rep; outsw" : "+S" (addr), "+c" (count) : "d" (port)); } static __inline void outsl(u_int port, const void *addr, size_t count) { __asm __volatile("cld; rep; outsl" : "+S" (addr), "+c" (count) : "d" (port)); } static __inline void outw(u_int port, u_short data) { __asm __volatile("outw %0, %w1" : : "a" (data), "Nd" (port)); } static __inline u_long popcntq(u_long mask) { u_long result; __asm __volatile("popcntq %1,%0" : "=r" (result) : "rm" (mask)); return (result); } static __inline void lfence(void) { __asm __volatile("lfence" : : : "memory"); } static __inline void mfence(void) { __asm __volatile("mfence" : : : "memory"); } static __inline void ia32_pause(void) { __asm __volatile("pause"); } static __inline u_long read_rflags(void) { u_long rf; __asm __volatile("pushfq; popq %0" : "=r" (rf)); return (rf); } static __inline uint64_t rdmsr(u_int msr) { uint32_t low, high; __asm __volatile("rdmsr" : "=a" (low), "=d" (high) : "c" (msr)); return (low | ((uint64_t)high << 32)); } static __inline uint32_t rdmsr32(u_int msr) { uint32_t low; __asm __volatile("rdmsr" : "=a" (low) : "c" (msr) : "rdx"); return (low); } static __inline uint64_t rdpmc(u_int pmc) { uint32_t low, high; __asm __volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (pmc)); return (low | ((uint64_t)high << 32)); } static __inline uint64_t rdtsc(void) { uint32_t low, high; __asm __volatile("rdtsc" : "=a" (low), "=d" (high)); return (low | ((uint64_t)high << 32)); } static __inline uint32_t rdtsc32(void) { uint32_t rv; __asm __volatile("rdtsc" : "=a" (rv) : : "edx"); return (rv); } static __inline void wbinvd(void) { __asm __volatile("wbinvd"); } static __inline void write_rflags(u_long rf) { __asm __volatile("pushq %0; popfq" : : "r" (rf)); } static __inline void wrmsr(u_int msr, uint64_t newval) { uint32_t low, high; low = newval; high = newval >> 32; __asm __volatile("wrmsr" : : "a" (low), "d" (high), "c" (msr)); } static __inline void load_cr0(u_long data) { __asm __volatile("movq %0,%%cr0" : : "r" (data)); } static __inline u_long rcr0(void) { u_long data; __asm __volatile("movq %%cr0,%0" : "=r" (data)); return (data); } static __inline u_long rcr2(void) { u_long data; __asm __volatile("movq %%cr2,%0" : "=r" (data)); return (data); } static __inline void load_cr3(u_long data) { __asm __volatile("movq %0,%%cr3" : : "r" (data) : "memory"); } static __inline u_long rcr3(void) { u_long data; __asm __volatile("movq %%cr3,%0" : "=r" (data)); return (data); } static __inline void load_cr4(u_long data) { __asm __volatile("movq %0,%%cr4" : : "r" (data)); } static __inline u_long rcr4(void) { u_long data; __asm __volatile("movq %%cr4,%0" : "=r" (data)); return (data); } static __inline u_long rxcr(u_int reg) { u_int low, high; __asm __volatile("xgetbv" : "=a" (low), "=d" (high) : "c" (reg)); return (low | ((uint64_t)high << 32)); } static __inline void load_xcr(u_int reg, u_long val) { u_int low, high; low = val; high = val >> 32; __asm __volatile("xsetbv" : : "c" (reg), "a" (low), "d" (high)); } /* * Global TLB flush (except for thise for pages marked PG_G) */ static __inline void invltlb(void) { load_cr3(rcr3()); } #ifndef CR4_PGE #define CR4_PGE 0x00000080 /* Page global enable */ #endif /* * Perform the guaranteed invalidation of all TLB entries. This * includes the global entries, and entries in all PCIDs, not only the * current context. The function works both on non-PCID CPUs and CPUs * with the PCID turned off or on. See IA-32 SDM Vol. 3a 4.10.4.1 * Operations that Invalidate TLBs and Paging-Structure Caches. */ static __inline void invltlb_globpcid(void) { uint64_t cr4; cr4 = rcr4(); load_cr4(cr4 & ~CR4_PGE); /* * Although preemption at this point could be detrimental to * performance, it would not lead to an error. PG_G is simply * ignored if CR4.PGE is clear. Moreover, in case this block * is re-entered, the load_cr4() either above or below will * modify CR4.PGE flushing the TLB. */ load_cr4(cr4 | CR4_PGE); } /* * TLB flush for an individual page (even if it has PG_G). * Only works on 486+ CPUs (i386 does not have PG_G). */ static __inline void invlpg(u_long addr) { __asm __volatile("invlpg %0" : : "m" (*(char *)addr) : "memory"); } #define INVPCID_ADDR 0 #define INVPCID_CTX 1 #define INVPCID_CTXGLOB 2 #define INVPCID_ALLCTX 3 struct invpcid_descr { uint64_t pcid:12 __packed; uint64_t pad:52 __packed; uint64_t addr; } __packed; static __inline void invpcid(struct invpcid_descr *d, int type) { - /* invpcid (%rdx),%rax */ - __asm __volatile(".byte 0x66,0x0f,0x38,0x82,0x02" - : : "d" (d), "a" ((u_long)type) : "memory"); + __asm __volatile("invpcid (%0),%1" + : : "r" (d), "r" ((u_long)type) : "memory"); } static __inline u_short rfs(void) { u_short sel; __asm __volatile("movw %%fs,%0" : "=rm" (sel)); return (sel); } static __inline u_short rgs(void) { u_short sel; __asm __volatile("movw %%gs,%0" : "=rm" (sel)); return (sel); } static __inline u_short rss(void) { u_short sel; __asm __volatile("movw %%ss,%0" : "=rm" (sel)); return (sel); } static __inline void load_ds(u_short sel) { __asm __volatile("movw %0,%%ds" : : "rm" (sel)); } static __inline void load_es(u_short sel) { __asm __volatile("movw %0,%%es" : : "rm" (sel)); } static __inline void cpu_monitor(const void *addr, u_long extensions, u_int hints) { __asm __volatile("monitor" : : "a" (addr), "c" (extensions), "d" (hints)); } static __inline void cpu_mwait(u_long extensions, u_int hints) { __asm __volatile("mwait" : : "a" (hints), "c" (extensions)); } #ifdef _KERNEL /* This is defined in but is too painful to get to */ #ifndef MSR_FSBASE #define MSR_FSBASE 0xc0000100 #endif static __inline void load_fs(u_short sel) { /* Preserve the fsbase value across the selector load */ __asm __volatile("rdmsr; movw %0,%%fs; wrmsr" : : "rm" (sel), "c" (MSR_FSBASE) : "eax", "edx"); } #ifndef MSR_GSBASE #define MSR_GSBASE 0xc0000101 #endif static __inline void load_gs(u_short sel) { /* * Preserve the gsbase value across the selector load. * Note that we have to disable interrupts because the gsbase * being trashed happens to be the kernel gsbase at the time. */ __asm __volatile("pushfq; cli; rdmsr; movw %0,%%gs; wrmsr; popfq" : : "rm" (sel), "c" (MSR_GSBASE) : "eax", "edx"); } #else /* Usable by userland */ static __inline void load_fs(u_short sel) { __asm __volatile("movw %0,%%fs" : : "rm" (sel)); } static __inline void load_gs(u_short sel) { __asm __volatile("movw %0,%%gs" : : "rm" (sel)); } #endif static __inline void lidt(struct region_descriptor *addr) { __asm __volatile("lidt (%0)" : : "r" (addr)); } static __inline void lldt(u_short sel) { __asm __volatile("lldt %0" : : "r" (sel)); } static __inline void ltr(u_short sel) { __asm __volatile("ltr %0" : : "r" (sel)); } static __inline uint64_t rdr0(void) { uint64_t data; __asm __volatile("movq %%dr0,%0" : "=r" (data)); return (data); } static __inline void load_dr0(uint64_t dr0) { __asm __volatile("movq %0,%%dr0" : : "r" (dr0)); } static __inline uint64_t rdr1(void) { uint64_t data; __asm __volatile("movq %%dr1,%0" : "=r" (data)); return (data); } static __inline void load_dr1(uint64_t dr1) { __asm __volatile("movq %0,%%dr1" : : "r" (dr1)); } static __inline uint64_t rdr2(void) { uint64_t data; __asm __volatile("movq %%dr2,%0" : "=r" (data)); return (data); } static __inline void load_dr2(uint64_t dr2) { __asm __volatile("movq %0,%%dr2" : : "r" (dr2)); } static __inline uint64_t rdr3(void) { uint64_t data; __asm __volatile("movq %%dr3,%0" : "=r" (data)); return (data); } static __inline void load_dr3(uint64_t dr3) { __asm __volatile("movq %0,%%dr3" : : "r" (dr3)); } static __inline uint64_t rdr4(void) { uint64_t data; __asm __volatile("movq %%dr4,%0" : "=r" (data)); return (data); } static __inline void load_dr4(uint64_t dr4) { __asm __volatile("movq %0,%%dr4" : : "r" (dr4)); } static __inline uint64_t rdr5(void) { uint64_t data; __asm __volatile("movq %%dr5,%0" : "=r" (data)); return (data); } static __inline void load_dr5(uint64_t dr5) { __asm __volatile("movq %0,%%dr5" : : "r" (dr5)); } static __inline uint64_t rdr6(void) { uint64_t data; __asm __volatile("movq %%dr6,%0" : "=r" (data)); return (data); } static __inline void load_dr6(uint64_t dr6) { __asm __volatile("movq %0,%%dr6" : : "r" (dr6)); } static __inline uint64_t rdr7(void) { uint64_t data; __asm __volatile("movq %%dr7,%0" : "=r" (data)); return (data); } static __inline void load_dr7(uint64_t dr7) { __asm __volatile("movq %0,%%dr7" : : "r" (dr7)); } static __inline register_t intr_disable(void) { register_t rflags; rflags = read_rflags(); disable_intr(); return (rflags); } static __inline void intr_restore(register_t rflags) { write_rflags(rflags); } #else /* !(__GNUCLIKE_ASM && __CC_SUPPORTS___INLINE) */ int breakpoint(void); u_int bsfl(u_int mask); u_int bsrl(u_int mask); void clflush(u_long addr); void clts(void); void cpuid_count(u_int ax, u_int cx, u_int *p); void disable_intr(void); void do_cpuid(u_int ax, u_int *p); void enable_intr(void); void halt(void); void ia32_pause(void); u_char inb(u_int port); u_int inl(u_int port); void insb(u_int port, void *addr, size_t count); void insl(u_int port, void *addr, size_t count); void insw(u_int port, void *addr, size_t count); register_t intr_disable(void); void intr_restore(register_t rf); void invd(void); void invlpg(u_int addr); void invltlb(void); u_short inw(u_int port); void lidt(struct region_descriptor *addr); void lldt(u_short sel); void load_cr0(u_long cr0); void load_cr3(u_long cr3); void load_cr4(u_long cr4); void load_dr0(uint64_t dr0); void load_dr1(uint64_t dr1); void load_dr2(uint64_t dr2); void load_dr3(uint64_t dr3); void load_dr4(uint64_t dr4); void load_dr5(uint64_t dr5); void load_dr6(uint64_t dr6); void load_dr7(uint64_t dr7); void load_fs(u_short sel); void load_gs(u_short sel); void ltr(u_short sel); void outb(u_int port, u_char data); void outl(u_int port, u_int data); void outsb(u_int port, const void *addr, size_t count); void outsl(u_int port, const void *addr, size_t count); void outsw(u_int port, const void *addr, size_t count); void outw(u_int port, u_short data); u_long rcr0(void); u_long rcr2(void); u_long rcr3(void); u_long rcr4(void); uint64_t rdmsr(u_int msr); uint32_t rdmsr32(u_int msr); uint64_t rdpmc(u_int pmc); uint64_t rdr0(void); uint64_t rdr1(void); uint64_t rdr2(void); uint64_t rdr3(void); uint64_t rdr4(void); uint64_t rdr5(void); uint64_t rdr6(void); uint64_t rdr7(void); uint64_t rdtsc(void); u_long read_rflags(void); u_int rfs(void); u_int rgs(void); void wbinvd(void); void write_rflags(u_int rf); void wrmsr(u_int msr, uint64_t newval); #endif /* __GNUCLIKE_ASM && __CC_SUPPORTS___INLINE */ void reset_dbregs(void); #ifdef _KERNEL int rdmsr_safe(u_int msr, uint64_t *val); int wrmsr_safe(u_int msr, uint64_t newval); #endif #endif /* !_MACHINE_CPUFUNC_H_ */ Index: projects/release-arm-redux/sys/amd64/include/md_var.h =================================================================== --- projects/release-arm-redux/sys/amd64/include/md_var.h (revision 282691) +++ projects/release-arm-redux/sys/amd64/include/md_var.h (revision 282692) @@ -1,131 +1,132 @@ /*- * Copyright (c) 1995 Bruce D. Evans. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the author nor the names of contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _MACHINE_MD_VAR_H_ #define _MACHINE_MD_VAR_H_ /* * Miscellaneous machine-dependent declarations. */ extern long Maxmem; extern u_int basemem; extern int busdma_swi_pending; extern u_int cpu_exthigh; extern u_int cpu_feature; extern u_int cpu_feature2; extern u_int amd_feature; extern u_int amd_feature2; extern u_int amd_pminfo; extern u_int via_feature_rng; extern u_int via_feature_xcrypt; extern u_int cpu_clflush_line_size; extern u_int cpu_stdext_feature; extern u_int cpu_fxsr; extern u_int cpu_high; extern u_int cpu_id; extern u_int cpu_max_ext_state_size; extern u_int cpu_mxcsr_mask; extern u_int cpu_procinfo; extern u_int cpu_procinfo2; extern char cpu_vendor[]; extern u_int cpu_vendor_id; extern u_int cpu_mon_mwait_flags; extern u_int cpu_mon_min_size; extern u_int cpu_mon_max_size; extern u_int cpu_maxphyaddr; extern char ctx_switch_xsave[]; extern u_int hv_high; extern char hv_vendor[]; extern char kstack[]; extern char sigcode[]; extern int szsigcode; extern uint64_t *vm_page_dump; extern int vm_page_dump_size; extern int workaround_erratum383; extern int _udatasel; extern int _ucodesel; extern int _ucode32sel; extern int _ufssel; extern int _ugssel; extern int use_xsave; extern uint64_t xsave_mask; typedef void alias_for_inthand_t(u_int cs, u_int ef, u_int esp, u_int ss); struct pcb; struct savefpu; struct thread; struct reg; struct fpreg; struct dbreg; struct dumperinfo; void *alloc_fpusave(int flags); void amd64_syscall(struct thread *td, int traced); void busdma_swi(void); +bool cpu_mwait_usable(void); void cpu_probe_amdc1e(void); void cpu_setregs(void); void doreti_iret(void) __asm(__STRING(doreti_iret)); void doreti_iret_fault(void) __asm(__STRING(doreti_iret_fault)); void ld_ds(void) __asm(__STRING(ld_ds)); void ld_es(void) __asm(__STRING(ld_es)); void ld_fs(void) __asm(__STRING(ld_fs)); void ld_gs(void) __asm(__STRING(ld_gs)); void ld_fsbase(void) __asm(__STRING(ld_fsbase)); void ld_gsbase(void) __asm(__STRING(ld_gsbase)); void ds_load_fault(void) __asm(__STRING(ds_load_fault)); void es_load_fault(void) __asm(__STRING(es_load_fault)); void fs_load_fault(void) __asm(__STRING(fs_load_fault)); void gs_load_fault(void) __asm(__STRING(gs_load_fault)); void fsbase_load_fault(void) __asm(__STRING(fsbase_load_fault)); void gsbase_load_fault(void) __asm(__STRING(gsbase_load_fault)); void dump_add_page(vm_paddr_t); void dump_drop_page(vm_paddr_t); void identify_cpu(void); void initializecpu(void); void initializecpucache(void); void fillw(int /*u_short*/ pat, void *base, size_t cnt); void fpstate_drop(struct thread *td); int is_physical_memory(vm_paddr_t addr); int isa_nmi(int cd); void panicifcpuunsupported(void); void pagecopy(void *from, void *to); void pagezero(void *addr); void printcpuinfo(void); void setidt(int idx, alias_for_inthand_t *func, int typ, int dpl, int ist); int user_dbreg_trap(void); int minidumpsys(struct dumperinfo *); struct savefpu *get_pcb_user_save_td(struct thread *td); struct savefpu *get_pcb_user_save_pcb(struct pcb *pcb); struct pcb *get_pcb_td(struct thread *td); void amd64_db_resume_dbreg(void); #endif /* !_MACHINE_MD_VAR_H_ */ Index: projects/release-arm-redux/sys/amd64/include/pcpu.h =================================================================== --- projects/release-arm-redux/sys/amd64/include/pcpu.h (revision 282691) +++ projects/release-arm-redux/sys/amd64/include/pcpu.h (revision 282692) @@ -1,249 +1,251 @@ /*- * Copyright (c) Peter Wemm * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _MACHINE_PCPU_H_ #define _MACHINE_PCPU_H_ #ifndef _SYS_CDEFS_H_ #error "sys/cdefs.h is a prerequisite for this file" #endif /* * The SMP parts are setup in pmap.c and locore.s for the BSP, and * mp_machdep.c sets up the data for the AP's to "see" when they awake. * The reason for doing it via a struct is so that an array of pointers * to each CPU's data can be set up for things like "check curproc on all * other processors" */ #define PCPU_MD_FIELDS \ char pc_monitorbuf[128] __aligned(128); /* cache line */ \ struct pcpu *pc_prvspace; /* Self-reference */ \ struct pmap *pc_curpmap; \ struct amd64tss *pc_tssp; /* TSS segment active on CPU */ \ struct amd64tss *pc_commontssp;/* Common TSS for the CPU */ \ register_t pc_rsp0; \ register_t pc_scratch_rsp; /* User %rsp in syscall */ \ u_int pc_apic_id; \ u_int pc_acpi_id; /* ACPI CPU id */ \ /* Pointer to the CPU %fs descriptor */ \ struct user_segment_descriptor *pc_fs32p; \ /* Pointer to the CPU %gs descriptor */ \ struct user_segment_descriptor *pc_gs32p; \ /* Pointer to the CPU LDT descriptor */ \ struct system_segment_descriptor *pc_ldt; \ /* Pointer to the CPU TSS descriptor */ \ struct system_segment_descriptor *pc_tss; \ uint64_t pc_pm_save_cnt; \ u_int pc_cmci_mask; /* MCx banks for CMCI */ \ uint64_t pc_dbreg[16]; /* ddb debugging regs */ \ int pc_dbreg_cmd; /* ddb debugging reg cmd */ \ u_int pc_vcpu_id; /* Xen vCPU ID */ \ - char __pad[157] /* be divisor of PAGE_SIZE \ + uint32_t pc_pcid_next; \ + uint32_t pc_pcid_gen; \ + char __pad[149] /* be divisor of PAGE_SIZE \ after cache alignment */ #define PC_DBREG_CMD_NONE 0 #define PC_DBREG_CMD_LOAD 1 #ifdef _KERNEL #ifdef lint extern struct pcpu *pcpup; #define PCPU_GET(member) (pcpup->pc_ ## member) #define PCPU_ADD(member, val) (pcpup->pc_ ## member += (val)) #define PCPU_INC(member) PCPU_ADD(member, 1) #define PCPU_PTR(member) (&pcpup->pc_ ## member) #define PCPU_SET(member, val) (pcpup->pc_ ## member = (val)) #elif defined(__GNUCLIKE_ASM) && defined(__GNUCLIKE___TYPEOF) /* * Evaluates to the byte offset of the per-cpu variable name. */ #define __pcpu_offset(name) \ __offsetof(struct pcpu, name) /* * Evaluates to the type of the per-cpu variable name. */ #define __pcpu_type(name) \ __typeof(((struct pcpu *)0)->name) /* * Evaluates to the address of the per-cpu variable name. */ #define __PCPU_PTR(name) __extension__ ({ \ __pcpu_type(name) *__p; \ \ __asm __volatile("movq %%gs:%1,%0; addq %2,%0" \ : "=r" (__p) \ : "m" (*(struct pcpu *)(__pcpu_offset(pc_prvspace))), \ "i" (__pcpu_offset(name))); \ \ __p; \ }) /* * Evaluates to the value of the per-cpu variable name. */ #define __PCPU_GET(name) __extension__ ({ \ __pcpu_type(name) __res; \ struct __s { \ u_char __b[MIN(sizeof(__pcpu_type(name)), 8)]; \ } __s; \ \ if (sizeof(__res) == 1 || sizeof(__res) == 2 || \ sizeof(__res) == 4 || sizeof(__res) == 8) { \ __asm __volatile("mov %%gs:%1,%0" \ : "=r" (__s) \ : "m" (*(struct __s *)(__pcpu_offset(name)))); \ *(struct __s *)(void *)&__res = __s; \ } else { \ __res = *__PCPU_PTR(name); \ } \ __res; \ }) /* * Adds the value to the per-cpu counter name. The implementation * must be atomic with respect to interrupts. */ #define __PCPU_ADD(name, val) do { \ __pcpu_type(name) __val; \ struct __s { \ u_char __b[MIN(sizeof(__pcpu_type(name)), 8)]; \ } __s; \ \ __val = (val); \ if (sizeof(__val) == 1 || sizeof(__val) == 2 || \ sizeof(__val) == 4 || sizeof(__val) == 8) { \ __s = *(struct __s *)(void *)&__val; \ __asm __volatile("add %1,%%gs:%0" \ : "=m" (*(struct __s *)(__pcpu_offset(name))) \ : "r" (__s)); \ } else \ *__PCPU_PTR(name) += __val; \ } while (0) /* * Increments the value of the per-cpu counter name. The implementation * must be atomic with respect to interrupts. */ #define __PCPU_INC(name) do { \ CTASSERT(sizeof(__pcpu_type(name)) == 1 || \ sizeof(__pcpu_type(name)) == 2 || \ sizeof(__pcpu_type(name)) == 4 || \ sizeof(__pcpu_type(name)) == 8); \ if (sizeof(__pcpu_type(name)) == 1) { \ __asm __volatile("incb %%gs:%0" \ : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\ : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\ } else if (sizeof(__pcpu_type(name)) == 2) { \ __asm __volatile("incw %%gs:%0" \ : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\ : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\ } else if (sizeof(__pcpu_type(name)) == 4) { \ __asm __volatile("incl %%gs:%0" \ : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\ : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\ } else if (sizeof(__pcpu_type(name)) == 8) { \ __asm __volatile("incq %%gs:%0" \ : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\ : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\ } \ } while (0) /* * Sets the value of the per-cpu variable name to value val. */ #define __PCPU_SET(name, val) { \ __pcpu_type(name) __val; \ struct __s { \ u_char __b[MIN(sizeof(__pcpu_type(name)), 8)]; \ } __s; \ \ __val = (val); \ if (sizeof(__val) == 1 || sizeof(__val) == 2 || \ sizeof(__val) == 4 || sizeof(__val) == 8) { \ __s = *(struct __s *)(void *)&__val; \ __asm __volatile("mov %1,%%gs:%0" \ : "=m" (*(struct __s *)(__pcpu_offset(name))) \ : "r" (__s)); \ } else { \ *__PCPU_PTR(name) = __val; \ } \ } #define PCPU_GET(member) __PCPU_GET(pc_ ## member) #define PCPU_ADD(member, val) __PCPU_ADD(pc_ ## member, val) #define PCPU_INC(member) __PCPU_INC(pc_ ## member) #define PCPU_PTR(member) __PCPU_PTR(pc_ ## member) #define PCPU_SET(member, val) __PCPU_SET(pc_ ## member, val) #define OFFSETOF_CURTHREAD 0 #ifdef __clang__ #pragma clang diagnostic push #pragma clang diagnostic ignored "-Wnull-dereference" #endif static __inline __pure2 struct thread * __curthread(void) { struct thread *td; __asm("movq %%gs:%1,%0" : "=r" (td) : "m" (*(char *)OFFSETOF_CURTHREAD)); return (td); } #ifdef __clang__ #pragma clang diagnostic pop #endif #define curthread (__curthread()) #define OFFSETOF_CURPCB 32 static __inline __pure2 struct pcb * __curpcb(void) { struct pcb *pcb; __asm("movq %%gs:%1,%0" : "=r" (pcb) : "m" (*(char *)OFFSETOF_CURPCB)); return (pcb); } #define curpcb (__curpcb()) #define IS_BSP() (PCPU_GET(cpuid) == 0) #else /* !lint || defined(__GNUCLIKE_ASM) && defined(__GNUCLIKE___TYPEOF) */ #error "this file needs to be ported to your compiler" #endif /* lint, etc. */ #endif /* _KERNEL */ #endif /* !_MACHINE_PCPU_H_ */ Index: projects/release-arm-redux/sys/amd64/include/pmap.h =================================================================== --- projects/release-arm-redux/sys/amd64/include/pmap.h (revision 282691) +++ projects/release-arm-redux/sys/amd64/include/pmap.h (revision 282692) @@ -1,406 +1,417 @@ /*- * Copyright (c) 2003 Peter Wemm. * Copyright (c) 1991 Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * the Systems Programming Group of the University of Utah Computer * Science Department and William Jolitz of UUNET Technologies Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * Derived from hp300 version by Mike Hibler, this version by William * Jolitz uses a recursive map [a pde points to the page directory] to * map the page tables using the pagetables themselves. This is done to * reduce the impact on kernel virtual memory for lots of sparse address * space, and to reduce the cost of memory to each process. * * from: hp300: @(#)pmap.h 7.2 (Berkeley) 12/16/90 * from: @(#)pmap.h 7.4 (Berkeley) 5/12/91 * $FreeBSD$ */ #ifndef _MACHINE_PMAP_H_ #define _MACHINE_PMAP_H_ /* * Page-directory and page-table entries follow this format, with a few * of the fields not present here and there, depending on a lot of things. */ /* ---- Intel Nomenclature ---- */ #define X86_PG_V 0x001 /* P Valid */ #define X86_PG_RW 0x002 /* R/W Read/Write */ #define X86_PG_U 0x004 /* U/S User/Supervisor */ #define X86_PG_NC_PWT 0x008 /* PWT Write through */ #define X86_PG_NC_PCD 0x010 /* PCD Cache disable */ #define X86_PG_A 0x020 /* A Accessed */ #define X86_PG_M 0x040 /* D Dirty */ #define X86_PG_PS 0x080 /* PS Page size (0=4k,1=2M) */ #define X86_PG_PTE_PAT 0x080 /* PAT PAT index */ #define X86_PG_G 0x100 /* G Global */ #define X86_PG_AVAIL1 0x200 /* / Available for system */ #define X86_PG_AVAIL2 0x400 /* < programmers use */ #define X86_PG_AVAIL3 0x800 /* \ */ #define X86_PG_PDE_PAT 0x1000 /* PAT PAT index */ #define X86_PG_NX (1ul<<63) /* No-execute */ #define X86_PG_AVAIL(x) (1ul << (x)) /* Page level cache control fields used to determine the PAT type */ #define X86_PG_PDE_CACHE (X86_PG_PDE_PAT | X86_PG_NC_PWT | X86_PG_NC_PCD) #define X86_PG_PTE_CACHE (X86_PG_PTE_PAT | X86_PG_NC_PWT | X86_PG_NC_PCD) /* * Intel extended page table (EPT) bit definitions. */ #define EPT_PG_READ 0x001 /* R Read */ #define EPT_PG_WRITE 0x002 /* W Write */ #define EPT_PG_EXECUTE 0x004 /* X Execute */ #define EPT_PG_IGNORE_PAT 0x040 /* IPAT Ignore PAT */ #define EPT_PG_PS 0x080 /* PS Page size */ #define EPT_PG_A 0x100 /* A Accessed */ #define EPT_PG_M 0x200 /* D Dirty */ #define EPT_PG_MEMORY_TYPE(x) ((x) << 3) /* MT Memory Type */ /* * Define the PG_xx macros in terms of the bits on x86 PTEs. */ #define PG_V X86_PG_V #define PG_RW X86_PG_RW #define PG_U X86_PG_U #define PG_NC_PWT X86_PG_NC_PWT #define PG_NC_PCD X86_PG_NC_PCD #define PG_A X86_PG_A #define PG_M X86_PG_M #define PG_PS X86_PG_PS #define PG_PTE_PAT X86_PG_PTE_PAT #define PG_G X86_PG_G #define PG_AVAIL1 X86_PG_AVAIL1 #define PG_AVAIL2 X86_PG_AVAIL2 #define PG_AVAIL3 X86_PG_AVAIL3 #define PG_PDE_PAT X86_PG_PDE_PAT #define PG_NX X86_PG_NX #define PG_PDE_CACHE X86_PG_PDE_CACHE #define PG_PTE_CACHE X86_PG_PTE_CACHE /* Our various interpretations of the above */ #define PG_W X86_PG_AVAIL3 /* "Wired" pseudoflag */ #define PG_MANAGED X86_PG_AVAIL2 #define EPT_PG_EMUL_V X86_PG_AVAIL(52) #define EPT_PG_EMUL_RW X86_PG_AVAIL(53) #define PG_FRAME (0x000ffffffffff000ul) #define PG_PS_FRAME (0x000fffffffe00000ul) /* * Promotion to a 2MB (PDE) page mapping requires that the corresponding 4KB * (PTE) page mappings have identical settings for the following fields: */ #define PG_PTE_PROMOTE (PG_NX | PG_MANAGED | PG_W | PG_G | PG_PTE_CACHE | \ PG_M | PG_A | PG_U | PG_RW | PG_V) /* * Page Protection Exception bits */ #define PGEX_P 0x01 /* Protection violation vs. not present */ #define PGEX_W 0x02 /* during a Write cycle */ #define PGEX_U 0x04 /* access from User mode (UPL) */ #define PGEX_RSV 0x08 /* reserved PTE field is non-zero */ #define PGEX_I 0x10 /* during an instruction fetch */ /* * undef the PG_xx macros that define bits in the regular x86 PTEs that * have a different position in nested PTEs. This is done when compiling * code that needs to be aware of the differences between regular x86 and * nested PTEs. * * The appropriate bitmask will be calculated at runtime based on the pmap * type. */ #ifdef AMD64_NPT_AWARE #undef PG_AVAIL1 /* X86_PG_AVAIL1 aliases with EPT_PG_M */ #undef PG_G #undef PG_A #undef PG_M #undef PG_PDE_PAT #undef PG_PDE_CACHE #undef PG_PTE_PAT #undef PG_PTE_CACHE #undef PG_RW #undef PG_V #endif /* * Pte related macros. This is complicated by having to deal with * the sign extension of the 48th bit. */ #define KVADDR(l4, l3, l2, l1) ( \ ((unsigned long)-1 << 47) | \ ((unsigned long)(l4) << PML4SHIFT) | \ ((unsigned long)(l3) << PDPSHIFT) | \ ((unsigned long)(l2) << PDRSHIFT) | \ ((unsigned long)(l1) << PAGE_SHIFT)) #define UVADDR(l4, l3, l2, l1) ( \ ((unsigned long)(l4) << PML4SHIFT) | \ ((unsigned long)(l3) << PDPSHIFT) | \ ((unsigned long)(l2) << PDRSHIFT) | \ ((unsigned long)(l1) << PAGE_SHIFT)) /* * Number of kernel PML4 slots. Can be anywhere from 1 to 64 or so, * but setting it larger than NDMPML4E makes no sense. * * Each slot provides .5 TB of kernel virtual space. */ #define NKPML4E 4 #define NUPML4E (NPML4EPG/2) /* number of userland PML4 pages */ #define NUPDPE (NUPML4E*NPDPEPG)/* number of userland PDP pages */ #define NUPDE (NUPDPE*NPDEPG) /* number of userland PD entries */ /* * NDMPML4E is the maximum number of PML4 entries that will be * used to implement the direct map. It must be a power of two, * and should generally exceed NKPML4E. The maximum possible * value is 64; using 128 will make the direct map intrude into * the recursive page table map. */ #define NDMPML4E 8 /* * These values control the layout of virtual memory. The starting address * of the direct map, which is controlled by DMPML4I, must be a multiple of * its size. (See the PHYS_TO_DMAP() and DMAP_TO_PHYS() macros.) * * Note: KPML4I is the index of the (single) level 4 page that maps * the KVA that holds KERNBASE, while KPML4BASE is the index of the * first level 4 page that maps VM_MIN_KERNEL_ADDRESS. If NKPML4E * is 1, these are the same, otherwise KPML4BASE < KPML4I and extra * level 4 PDEs are needed to map from VM_MIN_KERNEL_ADDRESS up to * KERNBASE. * * (KPML4I combines with KPDPI to choose where KERNBASE starts. * Or, in other words, KPML4I provides bits 39..47 of KERNBASE, * and KPDPI provides bits 30..38.) */ #define PML4PML4I (NPML4EPG/2) /* Index of recursive pml4 mapping */ #define KPML4BASE (NPML4EPG-NKPML4E) /* KVM at highest addresses */ #define DMPML4I rounddown(KPML4BASE-NDMPML4E, NDMPML4E) /* Below KVM */ #define KPML4I (NPML4EPG-1) #define KPDPI (NPDPEPG-2) /* kernbase at -2GB */ /* * XXX doesn't really belong here I guess... */ #define ISA_HOLE_START 0xa0000 #define ISA_HOLE_LENGTH (0x100000-ISA_HOLE_START) +#define PMAP_PCID_NONE 0xffffffff +#define PMAP_PCID_KERN 0 +#define PMAP_PCID_OVERMAX 0x1000 + #ifndef LOCORE #include #include #include #include #include typedef u_int64_t pd_entry_t; typedef u_int64_t pt_entry_t; typedef u_int64_t pdp_entry_t; typedef u_int64_t pml4_entry_t; /* * Address of current address space page table maps and directories. */ #ifdef _KERNEL #define addr_PTmap (KVADDR(PML4PML4I, 0, 0, 0)) #define addr_PDmap (KVADDR(PML4PML4I, PML4PML4I, 0, 0)) #define addr_PDPmap (KVADDR(PML4PML4I, PML4PML4I, PML4PML4I, 0)) #define addr_PML4map (KVADDR(PML4PML4I, PML4PML4I, PML4PML4I, PML4PML4I)) #define addr_PML4pml4e (addr_PML4map + (PML4PML4I * sizeof(pml4_entry_t))) #define PTmap ((pt_entry_t *)(addr_PTmap)) #define PDmap ((pd_entry_t *)(addr_PDmap)) #define PDPmap ((pd_entry_t *)(addr_PDPmap)) #define PML4map ((pd_entry_t *)(addr_PML4map)) #define PML4pml4e ((pd_entry_t *)(addr_PML4pml4e)) extern int nkpt; /* Initial number of kernel page tables */ extern u_int64_t KPDPphys; /* physical address of kernel level 3 */ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ /* * virtual address to page table entry and * to physical address. * Note: these work recursively, thus vtopte of a pte will give * the corresponding pde that in turn maps it. */ pt_entry_t *vtopte(vm_offset_t); #define vtophys(va) pmap_kextract(((vm_offset_t) (va))) #define pte_load_store(ptep, pte) atomic_swap_long(ptep, pte) #define pte_load_clear(ptep) atomic_swap_long(ptep, 0) #define pte_store(ptep, pte) do { \ *(u_long *)(ptep) = (u_long)(pte); \ } while (0) #define pte_clear(ptep) pte_store(ptep, 0) #define pde_store(pdep, pde) pte_store(pdep, pde) extern pt_entry_t pg_nx; #endif /* _KERNEL */ /* * Pmap stuff */ struct pv_entry; struct pv_chunk; struct md_page { TAILQ_HEAD(,pv_entry) pv_list; int pv_gen; int pat_mode; }; enum pmap_type { PT_X86, /* regular x86 page tables */ PT_EPT, /* Intel's nested page tables */ PT_RVI, /* AMD's nested page tables */ }; +struct pmap_pcids { + uint32_t pm_pcid; + uint32_t pm_gen; +}; + /* * The kernel virtual address (KVA) of the level 4 page table page is always * within the direct map (DMAP) region. */ struct pmap { struct mtx pm_mtx; pml4_entry_t *pm_pml4; /* KVA of level 4 page table */ uint64_t pm_cr3; TAILQ_HEAD(,pv_chunk) pm_pvchunk; /* list of mappings in pmap */ cpuset_t pm_active; /* active on cpus */ - cpuset_t pm_save; /* Context valid on cpus mask */ - int pm_pcid; /* context id */ enum pmap_type pm_type; /* regular or nested tables */ struct pmap_statistics pm_stats; /* pmap statistics */ struct vm_radix pm_root; /* spare page table pages */ long pm_eptgen; /* EPT pmap generation id */ int pm_flags; + struct pmap_pcids pm_pcids[MAXCPU]; }; /* flags */ #define PMAP_NESTED_IPIMASK 0xff #define PMAP_PDE_SUPERPAGE (1 << 8) /* supports 2MB superpages */ #define PMAP_EMULATE_AD_BITS (1 << 9) /* needs A/D bits emulation */ #define PMAP_SUPPORTS_EXEC_ONLY (1 << 10) /* execute only mappings ok */ typedef struct pmap *pmap_t; #ifdef _KERNEL extern struct pmap kernel_pmap_store; #define kernel_pmap (&kernel_pmap_store) #define PMAP_LOCK(pmap) mtx_lock(&(pmap)->pm_mtx) #define PMAP_LOCK_ASSERT(pmap, type) \ mtx_assert(&(pmap)->pm_mtx, (type)) #define PMAP_LOCK_DESTROY(pmap) mtx_destroy(&(pmap)->pm_mtx) #define PMAP_LOCK_INIT(pmap) mtx_init(&(pmap)->pm_mtx, "pmap", \ NULL, MTX_DEF | MTX_DUPOK) #define PMAP_LOCKED(pmap) mtx_owned(&(pmap)->pm_mtx) #define PMAP_MTX(pmap) (&(pmap)->pm_mtx) #define PMAP_TRYLOCK(pmap) mtx_trylock(&(pmap)->pm_mtx) #define PMAP_UNLOCK(pmap) mtx_unlock(&(pmap)->pm_mtx) int pmap_pinit_type(pmap_t pmap, enum pmap_type pm_type, int flags); int pmap_emulate_accessed_dirty(pmap_t pmap, vm_offset_t va, int ftype); #endif /* * For each vm_page_t, there is a list of all currently valid virtual * mappings of that page. An entry is a pv_entry_t, the list is pv_list. */ typedef struct pv_entry { vm_offset_t pv_va; /* virtual address for mapping */ TAILQ_ENTRY(pv_entry) pv_next; } *pv_entry_t; /* * pv_entries are allocated in chunks per-process. This avoids the * need to track per-pmap assignments. */ #define _NPCM 3 #define _NPCPV 168 struct pv_chunk { pmap_t pc_pmap; TAILQ_ENTRY(pv_chunk) pc_list; uint64_t pc_map[_NPCM]; /* bitmap; 1 = free */ TAILQ_ENTRY(pv_chunk) pc_lru; struct pv_entry pc_pventry[_NPCPV]; }; #ifdef _KERNEL extern caddr_t CADDR1; extern pt_entry_t *CMAP1; extern vm_paddr_t phys_avail[]; extern vm_paddr_t dump_avail[]; extern vm_offset_t virtual_avail; extern vm_offset_t virtual_end; extern vm_paddr_t dmaplimit; #define pmap_page_get_memattr(m) ((vm_memattr_t)(m)->md.pat_mode) #define pmap_page_is_write_mapped(m) (((m)->aflags & PGA_WRITEABLE) != 0) #define pmap_unmapbios(va, sz) pmap_unmapdev((va), (sz)) +struct thread; + +void pmap_activate_sw(struct thread *); void pmap_bootstrap(vm_paddr_t *); int pmap_change_attr(vm_offset_t, vm_size_t, int); void pmap_demote_DMAP(vm_paddr_t base, vm_size_t len, boolean_t invalidate); void pmap_init_pat(void); void pmap_kenter(vm_offset_t va, vm_paddr_t pa); void *pmap_kenter_temporary(vm_paddr_t pa, int i); vm_paddr_t pmap_kextract(vm_offset_t); void pmap_kremove(vm_offset_t); void *pmap_mapbios(vm_paddr_t, vm_size_t); void *pmap_mapdev(vm_paddr_t, vm_size_t); void *pmap_mapdev_attr(vm_paddr_t, vm_size_t, int); boolean_t pmap_page_is_mapped(vm_page_t m); void pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma); void pmap_unmapdev(vm_offset_t, vm_size_t); void pmap_invalidate_page(pmap_t, vm_offset_t); void pmap_invalidate_range(pmap_t, vm_offset_t, vm_offset_t); void pmap_invalidate_all(pmap_t); void pmap_invalidate_cache(void); void pmap_invalidate_cache_pages(vm_page_t *pages, int count); void pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva, boolean_t force); void pmap_get_mapping(pmap_t pmap, vm_offset_t va, uint64_t *ptr, int *num); boolean_t pmap_map_io_transient(vm_page_t *, vm_offset_t *, int, boolean_t); void pmap_unmap_io_transient(vm_page_t *, vm_offset_t *, int, boolean_t); #endif /* _KERNEL */ #endif /* !LOCORE */ #endif /* !_MACHINE_PMAP_H_ */ Index: projects/release-arm-redux/sys/amd64/include/smp.h =================================================================== --- projects/release-arm-redux/sys/amd64/include/smp.h (revision 282691) +++ projects/release-arm-redux/sys/amd64/include/smp.h (revision 282692) @@ -1,134 +1,127 @@ /*- * ---------------------------------------------------------------------------- * "THE BEER-WARE LICENSE" (Revision 42): * wrote this file. As long as you retain this notice you * can do whatever you want with this stuff. If we meet some day, and you think * this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp * ---------------------------------------------------------------------------- * * $FreeBSD$ * */ #ifndef _MACHINE_SMP_H_ #define _MACHINE_SMP_H_ #ifdef _KERNEL #ifdef SMP #ifndef LOCORE #include #include #include #include #include /* global symbols in mpboot.S */ extern char mptramp_start[]; extern char mptramp_end[]; extern u_int32_t mptramp_pagetables; /* global data in mp_machdep.c */ extern int mp_naps; extern int boot_cpu_id; extern struct pcb stoppcbs[]; extern int cpu_apic_ids[]; extern int bootAP; extern void *dpcpu; extern char *bootSTK; extern int bootAP; extern void *bootstacks[]; extern volatile u_int cpu_ipi_pending[]; extern volatile int aps_ready; extern struct mtx ap_boot_mtx; extern int cpu_logical; extern int cpu_cores; extern int pmap_pcid_enabled; +extern int invpcid_works; extern u_int xhits_gbl[]; extern u_int xhits_pg[]; extern u_int xhits_rng[]; extern u_int ipi_global; extern u_int ipi_page; extern u_int ipi_range; extern u_int ipi_range_size; -extern u_int ipi_masked_global; -extern u_int ipi_masked_page; -extern u_int ipi_masked_range; -extern u_int ipi_masked_range_size; extern volatile int smp_tlb_wait; struct cpu_info { int cpu_present:1; int cpu_bsp:1; int cpu_disabled:1; int cpu_hyperthread:1; }; extern struct cpu_info cpu_info[]; #ifdef COUNT_IPIS extern u_long *ipi_invltlb_counts[MAXCPU]; extern u_long *ipi_invlrng_counts[MAXCPU]; extern u_long *ipi_invlpg_counts[MAXCPU]; extern u_long *ipi_invlcache_counts[MAXCPU]; extern u_long *ipi_rendezvous_counts[MAXCPU]; #endif /* IPI handlers */ inthand_t - IDTVEC(invltlb_pcid), /* TLB shootdowns - global, pcid enabled */ IDTVEC(invltlb), /* TLB shootdowns - global */ - IDTVEC(invlpg_pcid), /* TLB shootdowns - 1 page, pcid enabled */ + IDTVEC(invltlb_pcid), /* TLB shootdowns - global, pcid */ + IDTVEC(invltlb_invpcid),/* TLB shootdowns - global, invpcid */ IDTVEC(invlpg), /* TLB shootdowns - 1 page */ IDTVEC(invlrng), /* TLB shootdowns - page range */ IDTVEC(invlcache), /* Write back and invalidate cache */ IDTVEC(ipi_intr_bitmap_handler), /* Bitmap based IPIs */ IDTVEC(cpustop), /* CPU stops & waits to be restarted */ IDTVEC(cpususpend), /* CPU suspends & waits to be resumed */ IDTVEC(justreturn), /* interrupt CPU with minimum overhead */ IDTVEC(rendezvous); /* handle CPU rendezvous */ struct pmap; /* functions in mp_machdep.c */ void assign_cpu_ids(void); void cpu_add(u_int apic_id, char boot_cpu); void cpustop_handler(void); void cpususpend_handler(void); void init_secondary_tail(void); void invltlb_handler(void); void invltlb_pcid_handler(void); +void invltlb_invpcid_handler(void); void invlpg_handler(void); -void invlpg_pcid_handler(void); void invlrng_handler(void); void invlcache_handler(void); void init_secondary(void); void ipi_startup(int apic_id, int vector); void ipi_all_but_self(u_int ipi); void ipi_bitmap_handler(struct trapframe frame); void ipi_cpu(int cpu, u_int ipi); int ipi_nmi_handler(void); void ipi_selected(cpuset_t cpus, u_int ipi); u_int mp_bootaddress(u_int); void set_interrupt_apic_ids(void); void smp_cache_flush(void); -void smp_invlpg(struct pmap *pmap, vm_offset_t addr); -void smp_masked_invlpg(cpuset_t mask, struct pmap *pmap, vm_offset_t addr); -void smp_invlpg_range(struct pmap *pmap, vm_offset_t startva, +void smp_masked_invlpg(cpuset_t mask, vm_offset_t addr); +void smp_masked_invlpg_range(cpuset_t mask, vm_offset_t startva, vm_offset_t endva); -void smp_masked_invlpg_range(cpuset_t mask, struct pmap *pmap, - vm_offset_t startva, vm_offset_t endva); -void smp_invltlb(struct pmap *pmap); void smp_masked_invltlb(cpuset_t mask, struct pmap *pmap); int native_start_all_aps(void); void mem_range_AP_init(void); void topo_probe(void); void ipi_send_cpu(int cpu, u_int ipi); #endif /* !LOCORE */ #endif /* SMP */ #endif /* _KERNEL */ #endif /* _MACHINE_SMP_H_ */ Index: projects/release-arm-redux/sys/arm/ti/ti_i2c.c =================================================================== --- projects/release-arm-redux/sys/arm/ti/ti_i2c.c (revision 282691) +++ projects/release-arm-redux/sys/arm/ti/ti_i2c.c (revision 282692) @@ -1,979 +1,990 @@ /*- * Copyright (c) 2011 Ben Gray . * Copyright (c) 2014 Luiz Otavio O Souza . * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /** * Driver for the I2C module on the TI SoC. * * This driver is heavily based on the TWI driver for the AT91 (at91_twi.c). * * CAUTION: The I2Ci registers are limited to 16 bit and 8 bit data accesses, * 32 bit data access is not allowed and can corrupt register content. * * This driver currently doesn't use DMA for the transfer, although I hope to * incorporate that sometime in the future. The idea being that for transaction * larger than a certain size the DMA engine is used, for anything less the * normal interrupt/fifo driven option is used. * * * WARNING: This driver uses mtx_sleep and interrupts to perform transactions, * which means you can't do a transaction during startup before the interrupts * have been enabled. Hint - the freebsd function config_intrhook_establish(). */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "iicbus_if.h" /** * I2C device driver context, a pointer to this is stored in the device * driver structure. */ struct ti_i2c_softc { device_t sc_dev; uint32_t device_id; struct resource* sc_irq_res; struct resource* sc_mem_res; device_t sc_iicbus; void* sc_irq_h; struct mtx sc_mtx; struct iic_msg* sc_buffer; int sc_bus_inuse; int sc_buffer_pos; int sc_error; int sc_fifo_trsh; int sc_timeout; uint16_t sc_con_reg; uint16_t sc_rev; }; struct ti_i2c_clock_config { u_int frequency; /* Bus frequency in Hz */ uint8_t psc; /* Fast/Standard mode prescale divider */ uint8_t scll; /* Fast/Standard mode SCL low time */ uint8_t sclh; /* Fast/Standard mode SCL high time */ uint8_t hsscll; /* High Speed mode SCL low time */ uint8_t hssclh; /* High Speed mode SCL high time */ }; #if defined(SOC_OMAP4) /* * OMAP4 i2c bus clock is 96MHz / ((psc + 1) * (scll + 7 + sclh + 5)). * The prescaler values for 100KHz and 400KHz modes come from the table in the * OMAP4 TRM. The table doesn't list 1MHz; these values should give that speed. */ static struct ti_i2c_clock_config ti_omap4_i2c_clock_configs[] = { { 100000, 23, 13, 15, 0, 0}, { 400000, 9, 5, 7, 0, 0}, { 1000000, 3, 5, 7, 0, 0}, /* { 3200000, 1, 113, 115, 7, 10}, - HS mode */ { 0 /* Table terminator */ } }; #endif #if defined(SOC_TI_AM335X) /* * AM335x i2c bus clock is 48MHZ / ((psc + 1) * (scll + 7 + sclh + 5)) * In all cases we prescale the clock to 24MHz as recommended in the manual. */ static struct ti_i2c_clock_config ti_am335x_i2c_clock_configs[] = { { 100000, 1, 111, 117, 0, 0}, { 400000, 1, 23, 25, 0, 0}, { 1000000, 1, 5, 7, 0, 0}, { 0 /* Table terminator */ } }; #endif /** * Locking macros used throughout the driver */ #define TI_I2C_LOCK(_sc) mtx_lock(&(_sc)->sc_mtx) #define TI_I2C_UNLOCK(_sc) mtx_unlock(&(_sc)->sc_mtx) #define TI_I2C_LOCK_INIT(_sc) \ mtx_init(&_sc->sc_mtx, device_get_nameunit(_sc->sc_dev), \ "ti_i2c", MTX_DEF) #define TI_I2C_LOCK_DESTROY(_sc) mtx_destroy(&_sc->sc_mtx) #define TI_I2C_ASSERT_LOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_OWNED) #define TI_I2C_ASSERT_UNLOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_NOTOWNED) #ifdef DEBUG #define ti_i2c_dbg(_sc, fmt, args...) \ device_printf((_sc)->sc_dev, fmt, ##args) #else #define ti_i2c_dbg(_sc, fmt, args...) #endif /** * ti_i2c_read_2 - reads a 16-bit value from one of the I2C registers * @sc: I2C device context * @off: the byte offset within the register bank to read from. * * * LOCKING: * No locking required * * RETURNS: * 16-bit value read from the register. */ static inline uint16_t ti_i2c_read_2(struct ti_i2c_softc *sc, bus_size_t off) { return (bus_read_2(sc->sc_mem_res, off)); } /** * ti_i2c_write_2 - writes a 16-bit value to one of the I2C registers * @sc: I2C device context * @off: the byte offset within the register bank to read from. * @val: the value to write into the register * * LOCKING: * No locking required * * RETURNS: * 16-bit value read from the register. */ static inline void ti_i2c_write_2(struct ti_i2c_softc *sc, bus_size_t off, uint16_t val) { bus_write_2(sc->sc_mem_res, off, val); } static int ti_i2c_transfer_intr(struct ti_i2c_softc* sc, uint16_t status) { int amount, done, i; done = 0; amount = 0; /* Check for the error conditions. */ if (status & I2C_STAT_NACK) { /* No ACK from slave. */ ti_i2c_dbg(sc, "NACK\n"); ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_NACK); sc->sc_error = ENXIO; } else if (status & I2C_STAT_AL) { /* Arbitration lost. */ ti_i2c_dbg(sc, "Arbitration lost\n"); ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_AL); sc->sc_error = ENXIO; } /* Check if we have finished. */ if (status & I2C_STAT_ARDY) { /* Register access ready - transaction complete basically. */ ti_i2c_dbg(sc, "ARDY transaction complete\n"); if (sc->sc_error != 0 && sc->sc_buffer->flags & IIC_M_NOSTOP) { ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg | I2C_CON_STP); } ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_ARDY | I2C_STAT_RDR | I2C_STAT_RRDY | I2C_STAT_XDR | I2C_STAT_XRDY); return (1); } if (sc->sc_buffer->flags & IIC_M_RD) { /* Read some data. */ if (status & I2C_STAT_RDR) { /* * Receive draining interrupt - last data received. * The set FIFO threshold wont be reached to trigger * RRDY. */ ti_i2c_dbg(sc, "Receive draining interrupt\n"); /* * Drain the FIFO. Read the pending data in the FIFO. */ amount = sc->sc_buffer->len - sc->sc_buffer_pos; } else if (status & I2C_STAT_RRDY) { /* * Receive data ready interrupt - FIFO has reached the * set threshold. */ ti_i2c_dbg(sc, "Receive data ready interrupt\n"); amount = min(sc->sc_fifo_trsh, sc->sc_buffer->len - sc->sc_buffer_pos); } /* Read the bytes from the fifo. */ for (i = 0; i < amount; i++) sc->sc_buffer->buf[sc->sc_buffer_pos++] = (uint8_t)(ti_i2c_read_2(sc, I2C_REG_DATA) & 0xff); if (status & I2C_STAT_RDR) ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_RDR); if (status & I2C_STAT_RRDY) ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_RRDY); } else { /* Write some data. */ if (status & I2C_STAT_XDR) { /* * Transmit draining interrupt - FIFO level is below * the set threshold and the amount of data still to * be transferred wont reach the set FIFO threshold. */ ti_i2c_dbg(sc, "Transmit draining interrupt\n"); /* * Drain the TX data. Write the pending data in the * FIFO. */ amount = sc->sc_buffer->len - sc->sc_buffer_pos; } else if (status & I2C_STAT_XRDY) { /* * Transmit data ready interrupt - the FIFO level * is below the set threshold. */ ti_i2c_dbg(sc, "Transmit data ready interrupt\n"); amount = min(sc->sc_fifo_trsh, sc->sc_buffer->len - sc->sc_buffer_pos); } /* Write the bytes from the fifo. */ for (i = 0; i < amount; i++) ti_i2c_write_2(sc, I2C_REG_DATA, sc->sc_buffer->buf[sc->sc_buffer_pos++]); if (status & I2C_STAT_XDR) ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_XDR); if (status & I2C_STAT_XRDY) ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_XRDY); } return (done); } /** * ti_i2c_intr - interrupt handler for the I2C module * @dev: i2c device handle * * * * LOCKING: * Called from timer context * * RETURNS: * EH_HANDLED or EH_NOT_HANDLED */ static void ti_i2c_intr(void *arg) { int done; struct ti_i2c_softc *sc; uint16_t events, status; sc = (struct ti_i2c_softc *)arg; TI_I2C_LOCK(sc); status = ti_i2c_read_2(sc, I2C_REG_STATUS); if (status == 0) { TI_I2C_UNLOCK(sc); return; } /* Save enabled interrupts. */ events = ti_i2c_read_2(sc, I2C_REG_IRQENABLE_SET); /* We only care about enabled interrupts. */ status &= events; done = 0; if (sc->sc_buffer != NULL) done = ti_i2c_transfer_intr(sc, status); else { ti_i2c_dbg(sc, "Transfer interrupt without buffer\n"); sc->sc_error = EINVAL; done = 1; } if (done) /* Wakeup the process that started the transaction. */ wakeup(sc); TI_I2C_UNLOCK(sc); } /** * ti_i2c_transfer - called to perform the transfer * @dev: i2c device handle * @msgs: the messages to send/receive * @nmsgs: the number of messages in the msgs array * * * LOCKING: * Internally locked * * RETURNS: * 0 on function succeeded * EINVAL if invalid message is passed as an arg */ static int ti_i2c_transfer(device_t dev, struct iic_msg *msgs, uint32_t nmsgs) { int err, i, repstart, timeout; struct ti_i2c_softc *sc; uint16_t reg; sc = device_get_softc(dev); TI_I2C_LOCK(sc); /* If the controller is busy wait until it is available. */ while (sc->sc_bus_inuse == 1) mtx_sleep(sc, &sc->sc_mtx, 0, "i2cbuswait", 0); /* Now we have control over the I2C controller. */ sc->sc_bus_inuse = 1; err = 0; repstart = 0; for (i = 0; i < nmsgs; i++) { sc->sc_buffer = &msgs[i]; sc->sc_buffer_pos = 0; sc->sc_error = 0; /* Zero byte transfers aren't allowed. */ if (sc->sc_buffer == NULL || sc->sc_buffer->buf == NULL || sc->sc_buffer->len == 0) { err = EINVAL; break; } /* Check if the i2c bus is free. */ if (repstart == 0) { /* * On repeated start we send the START condition while * the bus _is_ busy. */ timeout = 0; while (ti_i2c_read_2(sc, I2C_REG_STATUS_RAW) & I2C_STAT_BB) { if (timeout++ > 100) { err = EBUSY; goto out; } DELAY(1000); } timeout = 0; } else repstart = 0; if (sc->sc_buffer->flags & IIC_M_NOSTOP) repstart = 1; /* Set the slave address. */ ti_i2c_write_2(sc, I2C_REG_SA, msgs[i].slave >> 1); /* Write the data length. */ ti_i2c_write_2(sc, I2C_REG_CNT, sc->sc_buffer->len); /* Clear the RX and the TX FIFO. */ reg = ti_i2c_read_2(sc, I2C_REG_BUF); reg |= I2C_BUF_RXFIFO_CLR | I2C_BUF_TXFIFO_CLR; ti_i2c_write_2(sc, I2C_REG_BUF, reg); reg = sc->sc_con_reg | I2C_CON_STT; if (repstart == 0) reg |= I2C_CON_STP; if ((sc->sc_buffer->flags & IIC_M_RD) == 0) reg |= I2C_CON_TRX; ti_i2c_write_2(sc, I2C_REG_CON, reg); /* Wait for an event. */ err = mtx_sleep(sc, &sc->sc_mtx, 0, "i2ciowait", sc->sc_timeout); if (err == 0) err = sc->sc_error; if (err) break; } out: if (timeout == 0) { while (ti_i2c_read_2(sc, I2C_REG_STATUS_RAW) & I2C_STAT_BB) { if (timeout++ > 100) break; DELAY(1000); } } /* Put the controller in master mode again. */ if ((ti_i2c_read_2(sc, I2C_REG_CON) & I2C_CON_MST) == 0) ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg); sc->sc_buffer = NULL; sc->sc_bus_inuse = 0; /* Wake up the processes that are waiting for the bus. */ wakeup(sc); TI_I2C_UNLOCK(sc); return (err); } static int ti_i2c_reset(struct ti_i2c_softc *sc, u_char speed) { int timeout; struct ti_i2c_clock_config *clkcfg; u_int busfreq; uint16_t fifo_trsh, reg, scll, sclh; switch (ti_chip()) { #ifdef SOC_OMAP4 case CHIP_OMAP_4: clkcfg = ti_omap4_i2c_clock_configs; break; #endif #ifdef SOC_TI_AM335X case CHIP_AM335X: clkcfg = ti_am335x_i2c_clock_configs; break; #endif default: panic("Unknown Ti SoC, unable to reset the i2c"); } /* * If we haven't attached the bus yet, just init at the default slow * speed. This lets us get the hardware initialized enough to attach * the bus which is where the real speed configuration is handled. After * the bus is attached, get the configured speed from it. Search the * configuration table for the best speed we can do that doesn't exceed * the requested speed. */ if (sc->sc_iicbus == NULL) busfreq = 100000; else busfreq = IICBUS_GET_FREQUENCY(sc->sc_iicbus, speed); for (;;) { if (clkcfg[1].frequency == 0 || clkcfg[1].frequency > busfreq) break; clkcfg++; } /* * 23.1.4.3 - HS I2C Software Reset * From OMAP4 TRM at page 4068. * * 1. Ensure that the module is disabled. */ sc->sc_con_reg = 0; ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg); /* 2. Issue a softreset to the controller. */ bus_write_2(sc->sc_mem_res, I2C_REG_SYSC, I2C_REG_SYSC_SRST); /* * 3. Enable the module. * The I2Ci.I2C_SYSS[0] RDONE bit is asserted only after the module * is enabled by setting the I2Ci.I2C_CON[15] I2C_EN bit to 1. */ ti_i2c_write_2(sc, I2C_REG_CON, I2C_CON_I2C_EN); /* 4. Wait for the software reset to complete. */ timeout = 0; while ((ti_i2c_read_2(sc, I2C_REG_SYSS) & I2C_SYSS_RDONE) == 0) { if (timeout++ > 100) return (EBUSY); DELAY(100); } /* * Disable the I2C controller once again, now that the reset has * finished. */ ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg); /* * The following sequence is taken from the OMAP4 TRM at page 4077. * * 1. Enable the functional and interface clocks (see Section * 23.1.5.1.1.1.1). Done at ti_i2c_activate(). * * 2. Program the prescaler to obtain an approximately 12MHz internal * sampling clock (I2Ci_INTERNAL_CLK) by programming the * corresponding value in the I2Ci.I2C_PSC[3:0] PSC field. * This value depends on the frequency of the functional clock * (I2Ci_FCLK). Because this frequency is 96MHz, the * I2Ci.I2C_PSC[7:0] PSC field value is 0x7. */ ti_i2c_write_2(sc, I2C_REG_PSC, clkcfg->psc); /* * 3. Program the I2Ci.I2C_SCLL[7:0] SCLL and I2Ci.I2C_SCLH[7:0] SCLH * bit fields to obtain a bit rate of 100 Kbps, 400 Kbps or 1Mbps. * These values depend on the internal sampling clock frequency * (see Table 23-8). */ scll = clkcfg->scll & I2C_SCLL_MASK; sclh = clkcfg->sclh & I2C_SCLH_MASK; /* * 4. (Optional) Program the I2Ci.I2C_SCLL[15:8] HSSCLL and * I2Ci.I2C_SCLH[15:8] HSSCLH fields to obtain a bit rate of * 400K bps or 3.4M bps (for the second phase of HS mode). These * values depend on the internal sampling clock frequency (see * Table 23-8). * * 5. (Optional) If a bit rate of 3.4M bps is used and the bus line * capacitance exceeds 45 pF, (see Section 18.4.8, PAD Functional * Multiplexing and Configuration). */ switch (ti_chip()) { #ifdef SOC_OMAP4 case CHIP_OMAP_4: if ((clkcfg->hsscll + clkcfg->hssclh) > 0) { scll |= clkcfg->hsscll << I2C_HSSCLL_SHIFT; sclh |= clkcfg->hssclh << I2C_HSSCLH_SHIFT; sc->sc_con_reg |= I2C_CON_OPMODE_HS; } break; #endif } /* Write the selected bit rate. */ ti_i2c_write_2(sc, I2C_REG_SCLL, scll); ti_i2c_write_2(sc, I2C_REG_SCLH, sclh); /* * 6. Configure the Own Address of the I2C controller by storing it in * the I2Ci.I2C_OA0 register. Up to four Own Addresses can be * programmed in the I2Ci.I2C_OAi registers (where i = 0, 1, 2, 3) * for each I2C controller. * * Note: For a 10-bit address, set the corresponding expand Own Address * bit in the I2Ci.I2C_CON register. * * Driver currently always in single master mode so ignore this step. */ /* * 7. Set the TX threshold (in transmitter mode) and the RX threshold * (in receiver mode) by setting the I2Ci.I2C_BUF[5:0]XTRSH field to * (TX threshold - 1) and the I2Ci.I2C_BUF[13:8]RTRSH field to (RX * threshold - 1), where the TX and RX thresholds are greater than * or equal to 1. * * The threshold is set to 5 for now. */ fifo_trsh = (sc->sc_fifo_trsh - 1) & I2C_BUF_TRSH_MASK; reg = fifo_trsh | (fifo_trsh << I2C_BUF_RXTRSH_SHIFT); ti_i2c_write_2(sc, I2C_REG_BUF, reg); /* * 8. Take the I2C controller out of reset by setting the * I2Ci.I2C_CON[15] I2C_EN bit to 1. * * 23.1.5.1.1.1.2 - Initialize the I2C Controller * * To initialize the I2C controller, perform the following steps: * * 1. Configure the I2Ci.I2C_CON register: * . For master or slave mode, set the I2Ci.I2C_CON[10] MST bit * (0: slave, 1: master). * . For transmitter or receiver mode, set the I2Ci.I2C_CON[9] TRX * bit (0: receiver, 1: transmitter). */ /* Enable the I2C controller in master mode. */ sc->sc_con_reg |= I2C_CON_I2C_EN | I2C_CON_MST; ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg); /* * 2. If using an interrupt to transmit/receive data, set the * corresponding bit in the I2Ci.I2C_IE register (the I2Ci.I2C_IE[4] * XRDY_IE bit for the transmit interrupt, the I2Ci.I2C_IE[3] RRDY * bit for the receive interrupt). */ /* Set the interrupts we want to be notified. */ reg = I2C_IE_XDR | /* Transmit draining interrupt. */ I2C_IE_XRDY | /* Transmit Data Ready interrupt. */ I2C_IE_RDR | /* Receive draining interrupt. */ I2C_IE_RRDY | /* Receive Data Ready interrupt. */ I2C_IE_ARDY | /* Register Access Ready interrupt. */ I2C_IE_NACK | /* No Acknowledgment interrupt. */ I2C_IE_AL; /* Arbitration lost interrupt. */ /* Enable the interrupts. */ ti_i2c_write_2(sc, I2C_REG_IRQENABLE_SET, reg); /* * 3. If using DMA to receive/transmit data, set to 1 the corresponding * bit in the I2Ci.I2C_BUF register (the I2Ci.I2C_BUF[15] RDMA_EN * bit for the receive DMA channel, the I2Ci.I2C_BUF[7] XDMA_EN bit * for the transmit DMA channel). * * Not using DMA for now, so ignore this. */ return (0); } static int ti_i2c_iicbus_reset(device_t dev, u_char speed, u_char addr, u_char *oldaddr) { struct ti_i2c_softc *sc; int err; sc = device_get_softc(dev); TI_I2C_LOCK(sc); err = ti_i2c_reset(sc, speed); TI_I2C_UNLOCK(sc); if (err) return (err); return (IIC_ENOADDR); } static int ti_i2c_activate(device_t dev) { clk_ident_t clk; int err; struct ti_i2c_softc *sc; sc = (struct ti_i2c_softc*)device_get_softc(dev); /* * 1. Enable the functional and interface clocks (see Section * 23.1.5.1.1.1.1). */ clk = I2C0_CLK + sc->device_id; err = ti_prcm_clk_enable(clk); if (err) return (err); return (ti_i2c_reset(sc, IIC_UNKNOWN)); } /** * ti_i2c_deactivate - deactivates the controller and releases resources * @dev: i2c device handle * * * * LOCKING: * Assumed called in an atomic context. * * RETURNS: * nothing */ static void ti_i2c_deactivate(device_t dev) { struct ti_i2c_softc *sc = device_get_softc(dev); clk_ident_t clk; /* Disable the controller - cancel all transactions. */ ti_i2c_write_2(sc, I2C_REG_IRQENABLE_CLR, 0xffff); ti_i2c_write_2(sc, I2C_REG_STATUS, 0xffff); ti_i2c_write_2(sc, I2C_REG_CON, 0); /* Release the interrupt handler. */ if (sc->sc_irq_h != NULL) { bus_teardown_intr(dev, sc->sc_irq_res, sc->sc_irq_h); sc->sc_irq_h = NULL; } bus_generic_detach(sc->sc_dev); /* Unmap the I2C controller registers. */ if (sc->sc_mem_res != NULL) { bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res); sc->sc_mem_res = NULL; } /* Release the IRQ resource. */ if (sc->sc_irq_res != NULL) { bus_release_resource(dev, SYS_RES_IRQ, 0, sc->sc_irq_res); sc->sc_irq_res = NULL; } /* Finally disable the functional and interface clocks. */ clk = I2C0_CLK + sc->device_id; ti_prcm_clk_disable(clk); } static int ti_i2c_sysctl_clk(SYSCTL_HANDLER_ARGS) { int clk, psc, sclh, scll; struct ti_i2c_softc *sc; sc = arg1; TI_I2C_LOCK(sc); /* Get the system prescaler value. */ psc = (int)ti_i2c_read_2(sc, I2C_REG_PSC) + 1; /* Get the bitrate. */ scll = (int)ti_i2c_read_2(sc, I2C_REG_SCLL) & I2C_SCLL_MASK; sclh = (int)ti_i2c_read_2(sc, I2C_REG_SCLH) & I2C_SCLH_MASK; clk = I2C_CLK / psc / (scll + 7 + sclh + 5); TI_I2C_UNLOCK(sc); return (sysctl_handle_int(oidp, &clk, 0, req)); } static int ti_i2c_sysctl_timeout(SYSCTL_HANDLER_ARGS) { struct ti_i2c_softc *sc; unsigned int val; int err; sc = arg1; /* * MTX_DEF lock can't be held while doing uimove in * sysctl_handle_int */ TI_I2C_LOCK(sc); val = sc->sc_timeout; TI_I2C_UNLOCK(sc); err = sysctl_handle_int(oidp, &val, 0, req); /* Write request? */ if ((err == 0) && (req->newptr != NULL)) { TI_I2C_LOCK(sc); sc->sc_timeout = val; TI_I2C_UNLOCK(sc); } return (err); } static int ti_i2c_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (!ofw_bus_is_compatible(dev, "ti,i2c")) return (ENXIO); device_set_desc(dev, "TI I2C Controller"); return (0); } static int ti_i2c_attach(device_t dev) { int err, rid; phandle_t node; struct ti_i2c_softc *sc; struct sysctl_ctx_list *ctx; struct sysctl_oid_list *tree; uint16_t fifosz; sc = device_get_softc(dev); sc->sc_dev = dev; /* Get the i2c device id from FDT. */ node = ofw_bus_get_node(dev); if ((OF_getencprop(node, "i2c-device-id", &sc->device_id, sizeof(sc->device_id))) <= 0) { device_printf(dev, "missing i2c-device-id attribute in FDT\n"); return (ENXIO); } /* Get the memory resource for the register mapping. */ rid = 0; sc->sc_mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE); if (sc->sc_mem_res == NULL) { device_printf(dev, "Cannot map registers.\n"); return (ENXIO); } /* Allocate our IRQ resource. */ rid = 0; sc->sc_irq_res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid, RF_ACTIVE | RF_SHAREABLE); if (sc->sc_irq_res == NULL) { bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res); device_printf(dev, "Cannot allocate interrupt.\n"); return (ENXIO); } TI_I2C_LOCK_INIT(sc); /* First of all, we _must_ activate the H/W. */ err = ti_i2c_activate(dev); if (err) { device_printf(dev, "ti_i2c_activate failed\n"); goto out; } /* Read the version number of the I2C module */ sc->sc_rev = ti_i2c_read_2(sc, I2C_REG_REVNB_HI) & 0xff; /* Get the fifo size. */ fifosz = ti_i2c_read_2(sc, I2C_REG_BUFSTAT); fifosz >>= I2C_BUFSTAT_FIFODEPTH_SHIFT; fifosz &= I2C_BUFSTAT_FIFODEPTH_MASK; device_printf(dev, "I2C revision %d.%d FIFO size: %d bytes\n", sc->sc_rev >> 4, sc->sc_rev & 0xf, 8 << fifosz); /* Set the FIFO threshold to 5 for now. */ sc->sc_fifo_trsh = 5; /* Set I2C bus timeout */ sc->sc_timeout = 5*hz; ctx = device_get_sysctl_ctx(dev); tree = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "i2c_clock", CTLFLAG_RD | CTLTYPE_UINT | CTLFLAG_MPSAFE, sc, 0, ti_i2c_sysctl_clk, "IU", "I2C bus clock"); SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "i2c_timeout", CTLFLAG_RW | CTLTYPE_UINT | CTLFLAG_MPSAFE, sc, 0, ti_i2c_sysctl_timeout, "IU", "I2C bus timeout (in ticks)"); /* Activate the interrupt. */ err = bus_setup_intr(dev, sc->sc_irq_res, INTR_TYPE_MISC | INTR_MPSAFE, NULL, ti_i2c_intr, sc, &sc->sc_irq_h); if (err) goto out; /* Attach the iicbus. */ if ((sc->sc_iicbus = device_add_child(dev, "iicbus", -1)) == NULL) { device_printf(dev, "could not allocate iicbus instance\n"); err = ENXIO; goto out; } /* Probe and attach the iicbus */ bus_generic_attach(dev); out: if (err) { ti_i2c_deactivate(dev); TI_I2C_LOCK_DESTROY(sc); } return (err); } static int ti_i2c_detach(device_t dev) { struct ti_i2c_softc *sc; int rv; sc = device_get_softc(dev); ti_i2c_deactivate(dev); TI_I2C_LOCK_DESTROY(sc); if (sc->sc_iicbus && (rv = device_delete_child(dev, sc->sc_iicbus)) != 0) return (rv); return (0); } static phandle_t ti_i2c_get_node(device_t bus, device_t dev) { /* Share controller node with iibus device. */ return (ofw_bus_get_node(bus)); } static device_method_t ti_i2c_methods[] = { /* Device interface */ DEVMETHOD(device_probe, ti_i2c_probe), DEVMETHOD(device_attach, ti_i2c_attach), DEVMETHOD(device_detach, ti_i2c_detach), + /* Bus interface */ + DEVMETHOD(bus_setup_intr, bus_generic_setup_intr), + DEVMETHOD(bus_teardown_intr, bus_generic_teardown_intr), + DEVMETHOD(bus_alloc_resource, bus_generic_alloc_resource), + DEVMETHOD(bus_release_resource, bus_generic_release_resource), + DEVMETHOD(bus_activate_resource, bus_generic_activate_resource), + DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource), + DEVMETHOD(bus_adjust_resource, bus_generic_adjust_resource), + DEVMETHOD(bus_set_resource, bus_generic_rl_set_resource), + DEVMETHOD(bus_get_resource, bus_generic_rl_get_resource), + /* OFW methods */ DEVMETHOD(ofw_bus_get_node, ti_i2c_get_node), /* iicbus interface */ DEVMETHOD(iicbus_callback, iicbus_null_callback), DEVMETHOD(iicbus_reset, ti_i2c_iicbus_reset), DEVMETHOD(iicbus_transfer, ti_i2c_transfer), DEVMETHOD_END }; static driver_t ti_i2c_driver = { "iichb", ti_i2c_methods, sizeof(struct ti_i2c_softc), }; static devclass_t ti_i2c_devclass; DRIVER_MODULE(ti_iic, simplebus, ti_i2c_driver, ti_i2c_devclass, 0, 0); DRIVER_MODULE(iicbus, ti_iic, iicbus_driver, iicbus_devclass, 0, 0); MODULE_DEPEND(ti_iic, ti_prcm, 1, 1, 1); MODULE_DEPEND(ti_iic, iicbus, 1, 1, 1); Index: projects/release-arm-redux/sys/dev/acpica/acpi_cpu.c =================================================================== --- projects/release-arm-redux/sys/dev/acpica/acpi_cpu.c (revision 282691) +++ projects/release-arm-redux/sys/dev/acpica/acpi_cpu.c (revision 282692) @@ -1,1373 +1,1511 @@ /*- * Copyright (c) 2003-2005 Nate Lawson (SDG) * Copyright (c) 2001 Michael Smith * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_acpi.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(__amd64__) || defined(__i386__) #include +#include +#include #endif #include #include #include #include /* * Support for ACPI Processor devices, including C[1-3] sleep states. */ /* Hooks for the ACPI CA debugging infrastructure */ #define _COMPONENT ACPI_PROCESSOR ACPI_MODULE_NAME("PROCESSOR") struct acpi_cx { struct resource *p_lvlx; /* Register to read to enter state. */ uint32_t type; /* C1-3 (C4 and up treated as C3). */ uint32_t trans_lat; /* Transition latency (usec). */ uint32_t power; /* Power consumed (mW). */ int res_type; /* Resource type for p_lvlx. */ int res_rid; /* Resource ID for p_lvlx. */ + bool do_mwait; + uint32_t mwait_hint; + bool mwait_hw_coord; + bool mwait_bm_avoidance; }; #define MAX_CX_STATES 8 struct acpi_cpu_softc { device_t cpu_dev; ACPI_HANDLE cpu_handle; struct pcpu *cpu_pcpu; uint32_t cpu_acpi_id; /* ACPI processor id */ uint32_t cpu_p_blk; /* ACPI P_BLK location */ uint32_t cpu_p_blk_len; /* P_BLK length (must be 6). */ struct acpi_cx cpu_cx_states[MAX_CX_STATES]; int cpu_cx_count; /* Number of valid Cx states. */ int cpu_prev_sleep;/* Last idle sleep duration. */ int cpu_features; /* Child driver supported features. */ /* Runtime state. */ int cpu_non_c2; /* Index of lowest non-C2 state. */ int cpu_non_c3; /* Index of lowest non-C3 state. */ u_int cpu_cx_stats[MAX_CX_STATES];/* Cx usage history. */ /* Values for sysctl. */ struct sysctl_ctx_list cpu_sysctl_ctx; struct sysctl_oid *cpu_sysctl_tree; int cpu_cx_lowest; int cpu_cx_lowest_lim; int cpu_disable_idle; /* Disable entry to idle function */ char cpu_cx_supported[64]; }; struct acpi_cpu_device { struct resource_list ad_rl; }; #define CPU_GET_REG(reg, width) \ (bus_space_read_ ## width(rman_get_bustag((reg)), \ rman_get_bushandle((reg)), 0)) #define CPU_SET_REG(reg, width, val) \ (bus_space_write_ ## width(rman_get_bustag((reg)), \ rman_get_bushandle((reg)), 0, (val))) #define PM_USEC(x) ((x) >> 2) /* ~4 clocks per usec (3.57955 Mhz) */ #define ACPI_NOTIFY_CX_STATES 0x81 /* _CST changed. */ #define CPU_QUIRK_NO_C3 (1<<0) /* C3-type states are not usable. */ #define CPU_QUIRK_NO_BM_CTRL (1<<2) /* No bus mastering control. */ #define PCI_VENDOR_INTEL 0x8086 #define PCI_DEVICE_82371AB_3 0x7113 /* PIIX4 chipset for quirks. */ #define PCI_REVISION_A_STEP 0 #define PCI_REVISION_B_STEP 1 #define PCI_REVISION_4E 2 #define PCI_REVISION_4M 3 #define PIIX4_DEVACTB_REG 0x58 #define PIIX4_BRLD_EN_IRQ0 (1<<0) #define PIIX4_BRLD_EN_IRQ (1<<1) #define PIIX4_BRLD_EN_IRQ8 (1<<5) #define PIIX4_STOP_BREAK_MASK (PIIX4_BRLD_EN_IRQ0 | PIIX4_BRLD_EN_IRQ | PIIX4_BRLD_EN_IRQ8) #define PIIX4_PCNTRL_BST_EN (1<<10) +#define CST_FFH_VENDOR_INTEL 1 +#define CST_FFH_INTEL_CL_C1IO 1 +#define CST_FFH_INTEL_CL_MWAIT 2 +#define CST_FFH_MWAIT_HW_COORD 0x0001 +#define CST_FFH_MWAIT_BM_AVOID 0x0002 + /* Allow users to ignore processor orders in MADT. */ static int cpu_unordered; SYSCTL_INT(_debug_acpi, OID_AUTO, cpu_unordered, CTLFLAG_RDTUN, &cpu_unordered, 0, "Do not use the MADT to match ACPI Processor objects to CPUs."); /* Knob to disable acpi_cpu devices */ bool acpi_cpu_disabled = false; /* Platform hardware resource information. */ static uint32_t cpu_smi_cmd; /* Value to write to SMI_CMD. */ static uint8_t cpu_cst_cnt; /* Indicate we are _CST aware. */ static int cpu_quirks; /* Indicate any hardware bugs. */ /* Values for sysctl. */ static struct sysctl_ctx_list cpu_sysctl_ctx; static struct sysctl_oid *cpu_sysctl_tree; static int cpu_cx_generic; static int cpu_cx_lowest_lim; static device_t *cpu_devices; static int cpu_ndevices; static struct acpi_cpu_softc **cpu_softc; ACPI_SERIAL_DECL(cpu, "ACPI CPU"); static int acpi_cpu_probe(device_t dev); static int acpi_cpu_attach(device_t dev); static int acpi_cpu_suspend(device_t dev); static int acpi_cpu_resume(device_t dev); static int acpi_pcpu_get_id(device_t dev, uint32_t *acpi_id, uint32_t *cpu_id); static struct resource_list *acpi_cpu_get_rlist(device_t dev, device_t child); static device_t acpi_cpu_add_child(device_t dev, u_int order, const char *name, int unit); static int acpi_cpu_read_ivar(device_t dev, device_t child, int index, uintptr_t *result); static int acpi_cpu_shutdown(device_t dev); static void acpi_cpu_cx_probe(struct acpi_cpu_softc *sc); static void acpi_cpu_generic_cx_probe(struct acpi_cpu_softc *sc); static int acpi_cpu_cx_cst(struct acpi_cpu_softc *sc); static void acpi_cpu_startup(void *arg); static void acpi_cpu_startup_cx(struct acpi_cpu_softc *sc); static void acpi_cpu_cx_list(struct acpi_cpu_softc *sc); static void acpi_cpu_idle(sbintime_t sbt); static void acpi_cpu_notify(ACPI_HANDLE h, UINT32 notify, void *context); static int acpi_cpu_quirks(void); static int acpi_cpu_usage_sysctl(SYSCTL_HANDLER_ARGS); static int acpi_cpu_usage_counters_sysctl(SYSCTL_HANDLER_ARGS); static int acpi_cpu_set_cx_lowest(struct acpi_cpu_softc *sc); static int acpi_cpu_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS); static int acpi_cpu_global_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS); +#if defined(__i386__) || defined(__amd64__) +static int acpi_cpu_method_sysctl(SYSCTL_HANDLER_ARGS); +#endif static device_method_t acpi_cpu_methods[] = { /* Device interface */ DEVMETHOD(device_probe, acpi_cpu_probe), DEVMETHOD(device_attach, acpi_cpu_attach), DEVMETHOD(device_detach, bus_generic_detach), DEVMETHOD(device_shutdown, acpi_cpu_shutdown), DEVMETHOD(device_suspend, acpi_cpu_suspend), DEVMETHOD(device_resume, acpi_cpu_resume), /* Bus interface */ DEVMETHOD(bus_add_child, acpi_cpu_add_child), DEVMETHOD(bus_read_ivar, acpi_cpu_read_ivar), DEVMETHOD(bus_get_resource_list, acpi_cpu_get_rlist), DEVMETHOD(bus_get_resource, bus_generic_rl_get_resource), DEVMETHOD(bus_set_resource, bus_generic_rl_set_resource), DEVMETHOD(bus_alloc_resource, bus_generic_rl_alloc_resource), DEVMETHOD(bus_release_resource, bus_generic_rl_release_resource), DEVMETHOD(bus_activate_resource, bus_generic_activate_resource), DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource), DEVMETHOD(bus_setup_intr, bus_generic_setup_intr), DEVMETHOD(bus_teardown_intr, bus_generic_teardown_intr), DEVMETHOD_END }; static driver_t acpi_cpu_driver = { "cpu", acpi_cpu_methods, sizeof(struct acpi_cpu_softc), }; static devclass_t acpi_cpu_devclass; DRIVER_MODULE(cpu, acpi, acpi_cpu_driver, acpi_cpu_devclass, 0, 0); MODULE_DEPEND(cpu, acpi, 1, 1, 1); static int acpi_cpu_probe(device_t dev) { int acpi_id, cpu_id; ACPI_BUFFER buf; ACPI_HANDLE handle; ACPI_OBJECT *obj; ACPI_STATUS status; if (acpi_disabled("cpu") || acpi_get_type(dev) != ACPI_TYPE_PROCESSOR || acpi_cpu_disabled) return (ENXIO); handle = acpi_get_handle(dev); if (cpu_softc == NULL) cpu_softc = malloc(sizeof(struct acpi_cpu_softc *) * (mp_maxid + 1), M_TEMP /* XXX */, M_WAITOK | M_ZERO); /* Get our Processor object. */ buf.Pointer = NULL; buf.Length = ACPI_ALLOCATE_BUFFER; status = AcpiEvaluateObject(handle, NULL, NULL, &buf); if (ACPI_FAILURE(status)) { device_printf(dev, "probe failed to get Processor obj - %s\n", AcpiFormatException(status)); return (ENXIO); } obj = (ACPI_OBJECT *)buf.Pointer; if (obj->Type != ACPI_TYPE_PROCESSOR) { device_printf(dev, "Processor object has bad type %d\n", obj->Type); AcpiOsFree(obj); return (ENXIO); } /* * Find the processor associated with our unit. We could use the * ProcId as a key, however, some boxes do not have the same values * in their Processor object as the ProcId values in the MADT. */ acpi_id = obj->Processor.ProcId; AcpiOsFree(obj); if (acpi_pcpu_get_id(dev, &acpi_id, &cpu_id) != 0) return (ENXIO); /* * Check if we already probed this processor. We scan the bus twice * so it's possible we've already seen this one. */ if (cpu_softc[cpu_id] != NULL) return (ENXIO); /* Mark this processor as in-use and save our derived id for attach. */ cpu_softc[cpu_id] = (void *)1; acpi_set_private(dev, (void*)(intptr_t)cpu_id); device_set_desc(dev, "ACPI CPU"); return (0); } static int acpi_cpu_attach(device_t dev) { ACPI_BUFFER buf; ACPI_OBJECT arg[4], *obj; ACPI_OBJECT_LIST arglist; struct pcpu *pcpu_data; struct acpi_cpu_softc *sc; struct acpi_softc *acpi_sc; ACPI_STATUS status; u_int features; int cpu_id, drv_count, i; driver_t **drivers; uint32_t cap_set[3]; /* UUID needed by _OSC evaluation */ static uint8_t cpu_oscuuid[16] = { 0x16, 0xA6, 0x77, 0x40, 0x0C, 0x29, 0xBE, 0x47, 0x9E, 0xBD, 0xD8, 0x70, 0x58, 0x71, 0x39, 0x53 }; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); sc = device_get_softc(dev); sc->cpu_dev = dev; sc->cpu_handle = acpi_get_handle(dev); cpu_id = (int)(intptr_t)acpi_get_private(dev); cpu_softc[cpu_id] = sc; pcpu_data = pcpu_find(cpu_id); pcpu_data->pc_device = dev; sc->cpu_pcpu = pcpu_data; cpu_smi_cmd = AcpiGbl_FADT.SmiCommand; cpu_cst_cnt = AcpiGbl_FADT.CstControl; buf.Pointer = NULL; buf.Length = ACPI_ALLOCATE_BUFFER; status = AcpiEvaluateObject(sc->cpu_handle, NULL, NULL, &buf); if (ACPI_FAILURE(status)) { device_printf(dev, "attach failed to get Processor obj - %s\n", AcpiFormatException(status)); return (ENXIO); } obj = (ACPI_OBJECT *)buf.Pointer; sc->cpu_p_blk = obj->Processor.PblkAddress; sc->cpu_p_blk_len = obj->Processor.PblkLength; sc->cpu_acpi_id = obj->Processor.ProcId; AcpiOsFree(obj); ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu%d: P_BLK at %#x/%d\n", device_get_unit(dev), sc->cpu_p_blk, sc->cpu_p_blk_len)); /* * If this is the first cpu we attach, create and initialize the generic * resources that will be used by all acpi cpu devices. */ if (device_get_unit(dev) == 0) { /* Assume we won't be using generic Cx mode by default */ cpu_cx_generic = FALSE; /* Install hw.acpi.cpu sysctl tree */ acpi_sc = acpi_device_get_parent_softc(dev); sysctl_ctx_init(&cpu_sysctl_ctx); cpu_sysctl_tree = SYSCTL_ADD_NODE(&cpu_sysctl_ctx, SYSCTL_CHILDREN(acpi_sc->acpi_sysctl_tree), OID_AUTO, "cpu", CTLFLAG_RD, 0, "node for CPU children"); /* Queue post cpu-probing task handler */ AcpiOsExecute(OSL_NOTIFY_HANDLER, acpi_cpu_startup, NULL); } /* * Before calling any CPU methods, collect child driver feature hints * and notify ACPI of them. We support unified SMP power control * so advertise this ourselves. Note this is not the same as independent * SMP control where each CPU can have different settings. */ - sc->cpu_features = ACPI_CAP_SMP_SAME | ACPI_CAP_SMP_SAME_C3; + sc->cpu_features = ACPI_CAP_SMP_SAME | ACPI_CAP_SMP_SAME_C3 | + ACPI_CAP_C1_IO_HALT; + +#if defined(__i386__) || defined(__amd64__) + /* + * Ask for MWAIT modes if not disabled and interrupts work + * reasonable with MWAIT. + */ + if (!acpi_disabled("mwait") && cpu_mwait_usable()) + sc->cpu_features |= ACPI_CAP_SMP_C1_NATIVE | ACPI_CAP_SMP_C3_NATIVE; +#endif + if (devclass_get_drivers(acpi_cpu_devclass, &drivers, &drv_count) == 0) { for (i = 0; i < drv_count; i++) { if (ACPI_GET_FEATURES(drivers[i], &features) == 0) sc->cpu_features |= features; } free(drivers, M_TEMP); } /* * CPU capabilities are specified in * Intel Processor Vendor-Specific ACPI Interface Specification. */ if (sc->cpu_features) { arglist.Pointer = arg; arglist.Count = 4; arg[0].Type = ACPI_TYPE_BUFFER; arg[0].Buffer.Length = sizeof(cpu_oscuuid); arg[0].Buffer.Pointer = cpu_oscuuid; /* UUID */ arg[1].Type = ACPI_TYPE_INTEGER; arg[1].Integer.Value = 1; /* revision */ arg[2].Type = ACPI_TYPE_INTEGER; arg[2].Integer.Value = 1; /* count */ arg[3].Type = ACPI_TYPE_BUFFER; arg[3].Buffer.Length = sizeof(cap_set); /* Capabilities buffer */ arg[3].Buffer.Pointer = (uint8_t *)cap_set; cap_set[0] = 0; /* status */ cap_set[1] = sc->cpu_features; status = AcpiEvaluateObject(sc->cpu_handle, "_OSC", &arglist, NULL); if (ACPI_SUCCESS(status)) { if (cap_set[0] != 0) device_printf(dev, "_OSC returned status %#x\n", cap_set[0]); } else { arglist.Pointer = arg; arglist.Count = 1; arg[0].Type = ACPI_TYPE_BUFFER; arg[0].Buffer.Length = sizeof(cap_set); arg[0].Buffer.Pointer = (uint8_t *)cap_set; cap_set[0] = 1; /* revision */ cap_set[1] = 1; /* number of capabilities integers */ cap_set[2] = sc->cpu_features; AcpiEvaluateObject(sc->cpu_handle, "_PDC", &arglist, NULL); } } /* Probe for Cx state support. */ acpi_cpu_cx_probe(sc); return (0); } static void acpi_cpu_postattach(void *unused __unused) { device_t *devices; int err; int i, n; err = devclass_get_devices(acpi_cpu_devclass, &devices, &n); if (err != 0) { printf("devclass_get_devices(acpi_cpu_devclass) failed\n"); return; } for (i = 0; i < n; i++) bus_generic_probe(devices[i]); for (i = 0; i < n; i++) bus_generic_attach(devices[i]); free(devices, M_TEMP); } SYSINIT(acpi_cpu, SI_SUB_CONFIGURE, SI_ORDER_MIDDLE, acpi_cpu_postattach, NULL); static void disable_idle(struct acpi_cpu_softc *sc) { cpuset_t cpuset; CPU_SETOF(sc->cpu_pcpu->pc_cpuid, &cpuset); sc->cpu_disable_idle = TRUE; /* * Ensure that the CPU is not in idle state or in acpi_cpu_idle(). * Note that this code depends on the fact that the rendezvous IPI * can not penetrate context where interrupts are disabled and acpi_cpu_idle * is called and executed in such a context with interrupts being re-enabled * right before return. */ smp_rendezvous_cpus(cpuset, smp_no_rendevous_barrier, NULL, smp_no_rendevous_barrier, NULL); } static void enable_idle(struct acpi_cpu_softc *sc) { sc->cpu_disable_idle = FALSE; } static int is_idle_disabled(struct acpi_cpu_softc *sc) { return (sc->cpu_disable_idle); } /* * Disable any entry to the idle function during suspend and re-enable it * during resume. */ static int acpi_cpu_suspend(device_t dev) { int error; error = bus_generic_suspend(dev); if (error) return (error); disable_idle(device_get_softc(dev)); return (0); } static int acpi_cpu_resume(device_t dev) { enable_idle(device_get_softc(dev)); return (bus_generic_resume(dev)); } /* * Find the processor associated with a given ACPI ID. By default, * use the MADT to map ACPI IDs to APIC IDs and use that to locate a * processor. Some systems have inconsistent ASL and MADT however. * For these systems the cpu_unordered tunable can be set in which * case we assume that Processor objects are listed in the same order * in both the MADT and ASL. */ static int acpi_pcpu_get_id(device_t dev, uint32_t *acpi_id, uint32_t *cpu_id) { struct pcpu *pc; uint32_t i, idx; KASSERT(acpi_id != NULL, ("Null acpi_id")); KASSERT(cpu_id != NULL, ("Null cpu_id")); idx = device_get_unit(dev); /* * If pc_acpi_id for CPU 0 is not initialized (e.g. a non-APIC * UP box) use the ACPI ID from the first processor we find. */ if (idx == 0 && mp_ncpus == 1) { pc = pcpu_find(0); if (pc->pc_acpi_id == 0xffffffff) pc->pc_acpi_id = *acpi_id; *cpu_id = 0; return (0); } CPU_FOREACH(i) { pc = pcpu_find(i); KASSERT(pc != NULL, ("no pcpu data for %d", i)); if (cpu_unordered) { if (idx-- == 0) { /* * If pc_acpi_id doesn't match the ACPI ID from the * ASL, prefer the MADT-derived value. */ if (pc->pc_acpi_id != *acpi_id) *acpi_id = pc->pc_acpi_id; *cpu_id = pc->pc_cpuid; return (0); } } else { if (pc->pc_acpi_id == *acpi_id) { if (bootverbose) device_printf(dev, "Processor %s (ACPI ID %u) -> APIC ID %d\n", acpi_name(acpi_get_handle(dev)), *acpi_id, pc->pc_cpuid); *cpu_id = pc->pc_cpuid; return (0); } } } if (bootverbose) printf("ACPI: Processor %s (ACPI ID %u) ignored\n", acpi_name(acpi_get_handle(dev)), *acpi_id); return (ESRCH); } static struct resource_list * acpi_cpu_get_rlist(device_t dev, device_t child) { struct acpi_cpu_device *ad; ad = device_get_ivars(child); if (ad == NULL) return (NULL); return (&ad->ad_rl); } static device_t acpi_cpu_add_child(device_t dev, u_int order, const char *name, int unit) { struct acpi_cpu_device *ad; device_t child; if ((ad = malloc(sizeof(*ad), M_TEMP, M_NOWAIT | M_ZERO)) == NULL) return (NULL); resource_list_init(&ad->ad_rl); child = device_add_child_ordered(dev, order, name, unit); if (child != NULL) device_set_ivars(child, ad); else free(ad, M_TEMP); return (child); } static int acpi_cpu_read_ivar(device_t dev, device_t child, int index, uintptr_t *result) { struct acpi_cpu_softc *sc; sc = device_get_softc(dev); switch (index) { case ACPI_IVAR_HANDLE: *result = (uintptr_t)sc->cpu_handle; break; case CPU_IVAR_PCPU: *result = (uintptr_t)sc->cpu_pcpu; break; #if defined(__amd64__) || defined(__i386__) case CPU_IVAR_NOMINAL_MHZ: if (tsc_is_invariant) { *result = (uintptr_t)(atomic_load_acq_64(&tsc_freq) / 1000000); break; } /* FALLTHROUGH */ #endif default: return (ENOENT); } return (0); } static int acpi_cpu_shutdown(device_t dev) { ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); /* Allow children to shutdown first. */ bus_generic_shutdown(dev); /* * Disable any entry to the idle function. */ disable_idle(device_get_softc(dev)); /* * CPU devices are not truely detached and remain referenced, * so their resources are not freed. */ return_VALUE (0); } static void acpi_cpu_cx_probe(struct acpi_cpu_softc *sc) { ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); /* Use initial sleep value of 1 sec. to start with lowest idle state. */ sc->cpu_prev_sleep = 1000000; sc->cpu_cx_lowest = 0; sc->cpu_cx_lowest_lim = 0; /* * Check for the ACPI 2.0 _CST sleep states object. If we can't find * any, we'll revert to generic FADT/P_BLK Cx control method which will * be handled by acpi_cpu_startup. We need to defer to after having * probed all the cpus in the system before probing for generic Cx * states as we may already have found cpus with valid _CST packages */ if (!cpu_cx_generic && acpi_cpu_cx_cst(sc) != 0) { /* * We were unable to find a _CST package for this cpu or there * was an error parsing it. Switch back to generic mode. */ cpu_cx_generic = TRUE; if (bootverbose) device_printf(sc->cpu_dev, "switching to generic Cx mode\n"); } /* * TODO: _CSD Package should be checked here. */ } static void acpi_cpu_generic_cx_probe(struct acpi_cpu_softc *sc) { ACPI_GENERIC_ADDRESS gas; struct acpi_cx *cx_ptr; sc->cpu_cx_count = 0; cx_ptr = sc->cpu_cx_states; /* Use initial sleep value of 1 sec. to start with lowest idle state. */ sc->cpu_prev_sleep = 1000000; /* C1 has been required since just after ACPI 1.0 */ cx_ptr->type = ACPI_STATE_C1; cx_ptr->trans_lat = 0; cx_ptr++; sc->cpu_non_c2 = sc->cpu_cx_count; sc->cpu_non_c3 = sc->cpu_cx_count; sc->cpu_cx_count++; cpu_deepest_sleep = 1; /* * The spec says P_BLK must be 6 bytes long. However, some systems * use it to indicate a fractional set of features present so we * take 5 as C2. Some may also have a value of 7 to indicate * another C3 but most use _CST for this (as required) and having * "only" C1-C3 is not a hardship. */ if (sc->cpu_p_blk_len < 5) return; /* Validate and allocate resources for C2 (P_LVL2). */ gas.SpaceId = ACPI_ADR_SPACE_SYSTEM_IO; gas.BitWidth = 8; if (AcpiGbl_FADT.C2Latency <= 100) { gas.Address = sc->cpu_p_blk + 4; cx_ptr->res_rid = 0; acpi_bus_alloc_gas(sc->cpu_dev, &cx_ptr->res_type, &cx_ptr->res_rid, &gas, &cx_ptr->p_lvlx, RF_SHAREABLE); if (cx_ptr->p_lvlx != NULL) { cx_ptr->type = ACPI_STATE_C2; cx_ptr->trans_lat = AcpiGbl_FADT.C2Latency; cx_ptr++; sc->cpu_non_c3 = sc->cpu_cx_count; sc->cpu_cx_count++; cpu_deepest_sleep = 2; } } if (sc->cpu_p_blk_len < 6) return; /* Validate and allocate resources for C3 (P_LVL3). */ if (AcpiGbl_FADT.C3Latency <= 1000 && !(cpu_quirks & CPU_QUIRK_NO_C3)) { gas.Address = sc->cpu_p_blk + 5; cx_ptr->res_rid = 1; acpi_bus_alloc_gas(sc->cpu_dev, &cx_ptr->res_type, &cx_ptr->res_rid, &gas, &cx_ptr->p_lvlx, RF_SHAREABLE); if (cx_ptr->p_lvlx != NULL) { cx_ptr->type = ACPI_STATE_C3; cx_ptr->trans_lat = AcpiGbl_FADT.C3Latency; cx_ptr++; sc->cpu_cx_count++; cpu_deepest_sleep = 3; } } } +static void +acpi_cpu_cx_cst_mwait(struct acpi_cx *cx_ptr, uint64_t address, int accsize) +{ + + cx_ptr->do_mwait = true; + cx_ptr->mwait_hint = address & 0xffffffff; + cx_ptr->mwait_hw_coord = (accsize & CST_FFH_MWAIT_HW_COORD) != 0; + cx_ptr->mwait_bm_avoidance = (accsize & CST_FFH_MWAIT_BM_AVOID) != 0; +} + +static void +acpi_cpu_cx_cst_free_plvlx(device_t cpu_dev, struct acpi_cx *cx_ptr) +{ + + if (cx_ptr->p_lvlx == NULL) + return; + bus_release_resource(cpu_dev, cx_ptr->res_type, cx_ptr->res_rid, + cx_ptr->p_lvlx); + cx_ptr->p_lvlx = NULL; +} + /* * Parse a _CST package and set up its Cx states. Since the _CST object * can change dynamically, our notify handler may call this function * to clean up and probe the new _CST package. */ static int acpi_cpu_cx_cst(struct acpi_cpu_softc *sc) { struct acpi_cx *cx_ptr; ACPI_STATUS status; ACPI_BUFFER buf; ACPI_OBJECT *top; ACPI_OBJECT *pkg; uint32_t count; - int i; + uint64_t address; + int i, vendor, class, accsize; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); buf.Pointer = NULL; buf.Length = ACPI_ALLOCATE_BUFFER; status = AcpiEvaluateObject(sc->cpu_handle, "_CST", NULL, &buf); if (ACPI_FAILURE(status)) return (ENXIO); /* _CST is a package with a count and at least one Cx package. */ top = (ACPI_OBJECT *)buf.Pointer; if (!ACPI_PKG_VALID(top, 2) || acpi_PkgInt32(top, 0, &count) != 0) { device_printf(sc->cpu_dev, "invalid _CST package\n"); AcpiOsFree(buf.Pointer); return (ENXIO); } if (count != top->Package.Count - 1) { device_printf(sc->cpu_dev, "invalid _CST state count (%d != %d)\n", count, top->Package.Count - 1); count = top->Package.Count - 1; } if (count > MAX_CX_STATES) { device_printf(sc->cpu_dev, "_CST has too many states (%d)\n", count); count = MAX_CX_STATES; } sc->cpu_non_c2 = 0; sc->cpu_non_c3 = 0; sc->cpu_cx_count = 0; cx_ptr = sc->cpu_cx_states; /* * C1 has been required since just after ACPI 1.0. * Reserve the first slot for it. */ cx_ptr->type = ACPI_STATE_C0; cx_ptr++; sc->cpu_cx_count++; cpu_deepest_sleep = 1; /* Set up all valid states. */ for (i = 0; i < count; i++) { pkg = &top->Package.Elements[i + 1]; if (!ACPI_PKG_VALID(pkg, 4) || acpi_PkgInt32(pkg, 1, &cx_ptr->type) != 0 || acpi_PkgInt32(pkg, 2, &cx_ptr->trans_lat) != 0 || acpi_PkgInt32(pkg, 3, &cx_ptr->power) != 0) { device_printf(sc->cpu_dev, "skipping invalid Cx state package\n"); continue; } /* Validate the state to see if we should use it. */ switch (cx_ptr->type) { case ACPI_STATE_C1: + acpi_cpu_cx_cst_free_plvlx(sc->cpu_dev, cx_ptr); +#if defined(__i386__) || defined(__amd64__) + if (acpi_PkgFFH_IntelCpu(pkg, 0, &vendor, &class, &address, + &accsize) == 0 && vendor == CST_FFH_VENDOR_INTEL) { + if (class == CST_FFH_INTEL_CL_C1IO) { + /* C1 I/O then Halt */ + cx_ptr->res_rid = sc->cpu_cx_count; + bus_set_resource(sc->cpu_dev, SYS_RES_IOPORT, + cx_ptr->res_rid, address, 1); + cx_ptr->p_lvlx = bus_alloc_resource_any(sc->cpu_dev, + SYS_RES_IOPORT, &cx_ptr->res_rid, RF_ACTIVE | + RF_SHAREABLE); + if (cx_ptr->p_lvlx == NULL) { + bus_delete_resource(sc->cpu_dev, SYS_RES_IOPORT, + cx_ptr->res_rid); + device_printf(sc->cpu_dev, + "C1 I/O failed to allocate port %d, " + "degrading to C1 Halt", (int)address); + } + } else if (class == CST_FFH_INTEL_CL_MWAIT) { + acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize); + } + } +#endif if (sc->cpu_cx_states[0].type == ACPI_STATE_C0) { /* This is the first C1 state. Use the reserved slot. */ sc->cpu_cx_states[0] = *cx_ptr; } else { sc->cpu_non_c2 = sc->cpu_cx_count; sc->cpu_non_c3 = sc->cpu_cx_count; cx_ptr++; sc->cpu_cx_count++; } continue; case ACPI_STATE_C2: sc->cpu_non_c3 = sc->cpu_cx_count; if (cpu_deepest_sleep < 2) cpu_deepest_sleep = 2; break; case ACPI_STATE_C3: default: if ((cpu_quirks & CPU_QUIRK_NO_C3) != 0) { ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu%d: C3[%d] not available.\n", device_get_unit(sc->cpu_dev), i)); continue; } else cpu_deepest_sleep = 3; break; } /* Free up any previous register. */ - if (cx_ptr->p_lvlx != NULL) { - bus_release_resource(sc->cpu_dev, cx_ptr->res_type, cx_ptr->res_rid, - cx_ptr->p_lvlx); - cx_ptr->p_lvlx = NULL; - } + acpi_cpu_cx_cst_free_plvlx(sc->cpu_dev, cx_ptr); /* Allocate the control register for C2 or C3. */ - cx_ptr->res_rid = sc->cpu_cx_count; - acpi_PkgGas(sc->cpu_dev, pkg, 0, &cx_ptr->res_type, &cx_ptr->res_rid, - &cx_ptr->p_lvlx, RF_SHAREABLE); - if (cx_ptr->p_lvlx) { +#if defined(__i386__) || defined(__amd64__) + if (acpi_PkgFFH_IntelCpu(pkg, 0, &vendor, &class, &address, + &accsize) == 0 && vendor == CST_FFH_VENDOR_INTEL && + class == CST_FFH_INTEL_CL_MWAIT) { + /* Native C State Instruction use (mwait) */ + acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize); ACPI_DEBUG_PRINT((ACPI_DB_INFO, - "acpi_cpu%d: Got C%d - %d latency\n", - device_get_unit(sc->cpu_dev), cx_ptr->type, - cx_ptr->trans_lat)); + "acpi_cpu%d: Got C%d/mwait - %d latency\n", + device_get_unit(sc->cpu_dev), cx_ptr->type, cx_ptr->trans_lat)); cx_ptr++; sc->cpu_cx_count++; + } else +#endif + { + cx_ptr->res_rid = sc->cpu_cx_count; + acpi_PkgGas(sc->cpu_dev, pkg, 0, &cx_ptr->res_type, + &cx_ptr->res_rid, &cx_ptr->p_lvlx, RF_SHAREABLE); + if (cx_ptr->p_lvlx) { + ACPI_DEBUG_PRINT((ACPI_DB_INFO, + "acpi_cpu%d: Got C%d - %d latency\n", + device_get_unit(sc->cpu_dev), cx_ptr->type, + cx_ptr->trans_lat)); + cx_ptr++; + sc->cpu_cx_count++; + } } } AcpiOsFree(buf.Pointer); /* If C1 state was not found, we need one now. */ cx_ptr = sc->cpu_cx_states; if (cx_ptr->type == ACPI_STATE_C0) { cx_ptr->type = ACPI_STATE_C1; cx_ptr->trans_lat = 0; } return (0); } /* * Call this *after* all CPUs have been attached. */ static void acpi_cpu_startup(void *arg) { struct acpi_cpu_softc *sc; int i; /* Get set of CPU devices */ devclass_get_devices(acpi_cpu_devclass, &cpu_devices, &cpu_ndevices); /* * Setup any quirks that might necessary now that we have probed * all the CPUs */ acpi_cpu_quirks(); if (cpu_cx_generic) { /* * We are using generic Cx mode, probe for available Cx states * for all processors. */ for (i = 0; i < cpu_ndevices; i++) { sc = device_get_softc(cpu_devices[i]); acpi_cpu_generic_cx_probe(sc); } } else { /* * We are using _CST mode, remove C3 state if necessary. * As we now know for sure that we will be using _CST mode * install our notify handler. */ for (i = 0; i < cpu_ndevices; i++) { sc = device_get_softc(cpu_devices[i]); if (cpu_quirks & CPU_QUIRK_NO_C3) { sc->cpu_cx_count = min(sc->cpu_cx_count, sc->cpu_non_c3 + 1); } AcpiInstallNotifyHandler(sc->cpu_handle, ACPI_DEVICE_NOTIFY, acpi_cpu_notify, sc); } } /* Perform Cx final initialization. */ for (i = 0; i < cpu_ndevices; i++) { sc = device_get_softc(cpu_devices[i]); acpi_cpu_startup_cx(sc); } /* Add a sysctl handler to handle global Cx lowest setting */ SYSCTL_ADD_PROC(&cpu_sysctl_ctx, SYSCTL_CHILDREN(cpu_sysctl_tree), OID_AUTO, "cx_lowest", CTLTYPE_STRING | CTLFLAG_RW, NULL, 0, acpi_cpu_global_cx_lowest_sysctl, "A", "Global lowest Cx sleep state to use"); /* Take over idling from cpu_idle_default(). */ cpu_cx_lowest_lim = 0; for (i = 0; i < cpu_ndevices; i++) { sc = device_get_softc(cpu_devices[i]); enable_idle(sc); } cpu_idle_hook = acpi_cpu_idle; } static void acpi_cpu_cx_list(struct acpi_cpu_softc *sc) { struct sbuf sb; int i; /* * Set up the list of Cx states */ sbuf_new(&sb, sc->cpu_cx_supported, sizeof(sc->cpu_cx_supported), SBUF_FIXEDLEN); for (i = 0; i < sc->cpu_cx_count; i++) sbuf_printf(&sb, "C%d/%d/%d ", i + 1, sc->cpu_cx_states[i].type, sc->cpu_cx_states[i].trans_lat); sbuf_trim(&sb); sbuf_finish(&sb); } static void acpi_cpu_startup_cx(struct acpi_cpu_softc *sc) { acpi_cpu_cx_list(sc); SYSCTL_ADD_STRING(&sc->cpu_sysctl_ctx, SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)), OID_AUTO, "cx_supported", CTLFLAG_RD, sc->cpu_cx_supported, 0, "Cx/microsecond values for supported Cx states"); SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx, SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)), OID_AUTO, "cx_lowest", CTLTYPE_STRING | CTLFLAG_RW, (void *)sc, 0, acpi_cpu_cx_lowest_sysctl, "A", "lowest Cx sleep state to use"); SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx, SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)), OID_AUTO, "cx_usage", CTLTYPE_STRING | CTLFLAG_RD, (void *)sc, 0, acpi_cpu_usage_sysctl, "A", "percent usage for each Cx state"); SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx, SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)), OID_AUTO, "cx_usage_counters", CTLTYPE_STRING | CTLFLAG_RD, (void *)sc, 0, acpi_cpu_usage_counters_sysctl, "A", "Cx sleep state counters"); +#if defined(__i386__) || defined(__amd64__) + SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx, + SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)), + OID_AUTO, "cx_method", CTLTYPE_STRING | CTLFLAG_RD, + (void *)sc, 0, acpi_cpu_method_sysctl, "A", + "Cx entrance methods"); +#endif /* Signal platform that we can handle _CST notification. */ if (!cpu_cx_generic && cpu_cst_cnt != 0) { ACPI_LOCK(acpi); AcpiOsWritePort(cpu_smi_cmd, cpu_cst_cnt, 8); ACPI_UNLOCK(acpi); } } /* * Idle the CPU in the lowest state possible. This function is called with * interrupts disabled. Note that once it re-enables interrupts, a task * switch can occur so do not access shared data (i.e. the softc) after * interrupts are re-enabled. */ static void acpi_cpu_idle(sbintime_t sbt) { struct acpi_cpu_softc *sc; struct acpi_cx *cx_next; uint64_t cputicks; uint32_t start_time, end_time; int bm_active, cx_next_idx, i, us; /* * Look up our CPU id to get our softc. If it's NULL, we'll use C1 * since there is no ACPI processor object for this CPU. This occurs * for logical CPUs in the HTT case. */ sc = cpu_softc[PCPU_GET(cpuid)]; if (sc == NULL) { acpi_cpu_c1(); return; } /* If disabled, take the safe path. */ if (is_idle_disabled(sc)) { acpi_cpu_c1(); return; } /* Find the lowest state that has small enough latency. */ us = sc->cpu_prev_sleep; if (sbt >= 0 && us > (sbt >> 12)) us = (sbt >> 12); cx_next_idx = 0; if (cpu_disable_c2_sleep) i = min(sc->cpu_cx_lowest, sc->cpu_non_c2); else if (cpu_disable_c3_sleep) i = min(sc->cpu_cx_lowest, sc->cpu_non_c3); else i = sc->cpu_cx_lowest; for (; i >= 0; i--) { if (sc->cpu_cx_states[i].trans_lat * 3 <= us) { cx_next_idx = i; break; } } /* * Check for bus master activity. If there was activity, clear * the bit and use the lowest non-C3 state. Note that the USB * driver polling for new devices keeps this bit set all the * time if USB is loaded. */ if ((cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0 && cx_next_idx > sc->cpu_non_c3) { AcpiReadBitRegister(ACPI_BITREG_BUS_MASTER_STATUS, &bm_active); if (bm_active != 0) { AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_STATUS, 1); cx_next_idx = sc->cpu_non_c3; } } /* Select the next state and update statistics. */ cx_next = &sc->cpu_cx_states[cx_next_idx]; sc->cpu_cx_stats[cx_next_idx]++; KASSERT(cx_next->type != ACPI_STATE_C0, ("acpi_cpu_idle: C0 sleep")); /* * Execute HLT (or equivalent) and wait for an interrupt. We can't * precisely calculate the time spent in C1 since the place we wake up * is an ISR. Assume we slept no more then half of quantum, unless * we are called inside critical section, delaying context switch. */ if (cx_next->type == ACPI_STATE_C1) { cputicks = cpu_ticks(); - acpi_cpu_c1(); + if (cx_next->p_lvlx != NULL) { + /* C1 I/O then Halt */ + CPU_GET_REG(cx_next->p_lvlx, 1); + } + if (cx_next->do_mwait) + acpi_cpu_idle_mwait(cx_next->mwait_hint); + else + acpi_cpu_c1(); end_time = ((cpu_ticks() - cputicks) << 20) / cpu_tickrate(); if (curthread->td_critnest == 0) end_time = min(end_time, 500000 / hz); sc->cpu_prev_sleep = (sc->cpu_prev_sleep * 3 + end_time) / 4; return; } /* * For C3, disable bus master arbitration and enable bus master wake * if BM control is available, otherwise flush the CPU cache. */ - if (cx_next->type == ACPI_STATE_C3) { + if (cx_next->type == ACPI_STATE_C3 || cx_next->mwait_bm_avoidance) { if ((cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0) { AcpiWriteBitRegister(ACPI_BITREG_ARB_DISABLE, 1); AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_RLD, 1); } else ACPI_FLUSH_CPU_CACHE(); } /* * Read from P_LVLx to enter C2(+), checking time spent asleep. * Use the ACPI timer for measuring sleep time. Since we need to * get the time very close to the CPU start/stop clock logic, this * is the only reliable time source. */ if (cx_next->type == ACPI_STATE_C3) { AcpiHwRead(&start_time, &AcpiGbl_FADT.XPmTimerBlock); cputicks = 0; } else { start_time = 0; cputicks = cpu_ticks(); } - CPU_GET_REG(cx_next->p_lvlx, 1); + if (cx_next->do_mwait) + acpi_cpu_idle_mwait(cx_next->mwait_hint); + else + CPU_GET_REG(cx_next->p_lvlx, 1); /* * Read the end time twice. Since it may take an arbitrary time * to enter the idle state, the first read may be executed before * the processor has stopped. Doing it again provides enough * margin that we are certain to have a correct value. */ AcpiHwRead(&end_time, &AcpiGbl_FADT.XPmTimerBlock); if (cx_next->type == ACPI_STATE_C3) { AcpiHwRead(&end_time, &AcpiGbl_FADT.XPmTimerBlock); end_time = acpi_TimerDelta(end_time, start_time); } else end_time = ((cpu_ticks() - cputicks) << 20) / cpu_tickrate(); /* Enable bus master arbitration and disable bus master wakeup. */ - if (cx_next->type == ACPI_STATE_C3 && - (cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0) { + if ((cx_next->type == ACPI_STATE_C3 || cx_next->mwait_bm_avoidance) && + (cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0) { AcpiWriteBitRegister(ACPI_BITREG_ARB_DISABLE, 0); AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_RLD, 0); } ACPI_ENABLE_IRQS(); sc->cpu_prev_sleep = (sc->cpu_prev_sleep * 3 + PM_USEC(end_time)) / 4; } /* * Re-evaluate the _CST object when we are notified that it changed. */ static void acpi_cpu_notify(ACPI_HANDLE h, UINT32 notify, void *context) { struct acpi_cpu_softc *sc = (struct acpi_cpu_softc *)context; if (notify != ACPI_NOTIFY_CX_STATES) return; /* * C-state data for target CPU is going to be in flux while we execute * acpi_cpu_cx_cst, so disable entering acpi_cpu_idle. * Also, it may happen that multiple ACPI taskqueues may concurrently * execute notifications for the same CPU. ACPI_SERIAL is used to * protect against that. */ ACPI_SERIAL_BEGIN(cpu); disable_idle(sc); /* Update the list of Cx states. */ acpi_cpu_cx_cst(sc); acpi_cpu_cx_list(sc); acpi_cpu_set_cx_lowest(sc); enable_idle(sc); ACPI_SERIAL_END(cpu); acpi_UserNotify("PROCESSOR", sc->cpu_handle, notify); } static int acpi_cpu_quirks(void) { device_t acpi_dev; uint32_t val; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); /* * Bus mastering arbitration control is needed to keep caches coherent * while sleeping in C3. If it's not present but a working flush cache * instruction is present, flush the caches before entering C3 instead. * Otherwise, just disable C3 completely. */ if (AcpiGbl_FADT.Pm2ControlBlock == 0 || AcpiGbl_FADT.Pm2ControlLength == 0) { if ((AcpiGbl_FADT.Flags & ACPI_FADT_WBINVD) && (AcpiGbl_FADT.Flags & ACPI_FADT_WBINVD_FLUSH) == 0) { cpu_quirks |= CPU_QUIRK_NO_BM_CTRL; ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu: no BM control, using flush cache method\n")); } else { cpu_quirks |= CPU_QUIRK_NO_C3; ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu: no BM control, C3 not available\n")); } } /* * If we are using generic Cx mode, C3 on multiple CPUs requires using * the expensive flush cache instruction. */ if (cpu_cx_generic && mp_ncpus > 1) { cpu_quirks |= CPU_QUIRK_NO_BM_CTRL; ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu: SMP, using flush cache mode for C3\n")); } /* Look for various quirks of the PIIX4 part. */ acpi_dev = pci_find_device(PCI_VENDOR_INTEL, PCI_DEVICE_82371AB_3); if (acpi_dev != NULL) { switch (pci_get_revid(acpi_dev)) { /* * Disable C3 support for all PIIX4 chipsets. Some of these parts * do not report the BMIDE status to the BM status register and * others have a livelock bug if Type-F DMA is enabled. Linux * works around the BMIDE bug by reading the BM status directly * but we take the simpler approach of disabling C3 for these * parts. * * See erratum #18 ("C3 Power State/BMIDE and Type-F DMA * Livelock") from the January 2002 PIIX4 specification update. * Applies to all PIIX4 models. * * Also, make sure that all interrupts cause a "Stop Break" * event to exit from C2 state. * Also, BRLD_EN_BM (ACPI_BITREG_BUS_MASTER_RLD in ACPI-speak) * should be set to zero, otherwise it causes C2 to short-sleep. * PIIX4 doesn't properly support C3 and bus master activity * need not break out of C2. */ case PCI_REVISION_A_STEP: case PCI_REVISION_B_STEP: case PCI_REVISION_4E: case PCI_REVISION_4M: cpu_quirks |= CPU_QUIRK_NO_C3; ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu: working around PIIX4 bug, disabling C3\n")); val = pci_read_config(acpi_dev, PIIX4_DEVACTB_REG, 4); if ((val & PIIX4_STOP_BREAK_MASK) != PIIX4_STOP_BREAK_MASK) { ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu: PIIX4: enabling IRQs to generate Stop Break\n")); val |= PIIX4_STOP_BREAK_MASK; pci_write_config(acpi_dev, PIIX4_DEVACTB_REG, val, 4); } AcpiReadBitRegister(ACPI_BITREG_BUS_MASTER_RLD, &val); if (val) { ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu: PIIX4: reset BRLD_EN_BM\n")); AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_RLD, 0); } break; default: break; } } return (0); } static int acpi_cpu_usage_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_cpu_softc *sc; struct sbuf sb; char buf[128]; int i; uintmax_t fract, sum, whole; sc = (struct acpi_cpu_softc *) arg1; sum = 0; for (i = 0; i < sc->cpu_cx_count; i++) sum += sc->cpu_cx_stats[i]; sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN); for (i = 0; i < sc->cpu_cx_count; i++) { if (sum > 0) { whole = (uintmax_t)sc->cpu_cx_stats[i] * 100; fract = (whole % sum) * 100; sbuf_printf(&sb, "%u.%02u%% ", (u_int)(whole / sum), (u_int)(fract / sum)); } else sbuf_printf(&sb, "0.00%% "); } sbuf_printf(&sb, "last %dus", sc->cpu_prev_sleep); sbuf_trim(&sb); sbuf_finish(&sb); sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req); sbuf_delete(&sb); return (0); } /* * XXX TODO: actually add support to count each entry/exit * from the Cx states. */ static int acpi_cpu_usage_counters_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_cpu_softc *sc; struct sbuf sb; char buf[128]; int i; sc = (struct acpi_cpu_softc *) arg1; /* Print out the raw counters */ sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN); for (i = 0; i < sc->cpu_cx_count; i++) { sbuf_printf(&sb, "%u ", sc->cpu_cx_stats[i]); } sbuf_trim(&sb); sbuf_finish(&sb); sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req); sbuf_delete(&sb); return (0); } + +#if defined(__i386__) || defined(__amd64__) +static int +acpi_cpu_method_sysctl(SYSCTL_HANDLER_ARGS) +{ + struct acpi_cpu_softc *sc; + struct acpi_cx *cx; + struct sbuf sb; + char buf[128]; + int i; + + sc = (struct acpi_cpu_softc *)arg1; + sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN); + for (i = 0; i < sc->cpu_cx_count; i++) { + cx = &sc->cpu_cx_states[i]; + sbuf_printf(&sb, "C%d/", i + 1); + if (cx->do_mwait) { + sbuf_cat(&sb, "mwait"); + if (cx->mwait_hw_coord) + sbuf_cat(&sb, "/hwc"); + if (cx->mwait_bm_avoidance) + sbuf_cat(&sb, "/bma"); + } else if (cx->type == ACPI_STATE_C1) { + sbuf_cat(&sb, "hlt"); + } else { + sbuf_cat(&sb, "io"); + } + if (cx->type == ACPI_STATE_C1 && cx->p_lvlx != NULL) + sbuf_cat(&sb, "/iohlt"); + sbuf_putc(&sb, ' '); + } + sbuf_trim(&sb); + sbuf_finish(&sb); + sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req); + sbuf_delete(&sb); + return (0); +} +#endif static int acpi_cpu_set_cx_lowest(struct acpi_cpu_softc *sc) { int i; ACPI_SERIAL_ASSERT(cpu); sc->cpu_cx_lowest = min(sc->cpu_cx_lowest_lim, sc->cpu_cx_count - 1); /* If not disabling, cache the new lowest non-C3 state. */ sc->cpu_non_c3 = 0; for (i = sc->cpu_cx_lowest; i >= 0; i--) { if (sc->cpu_cx_states[i].type < ACPI_STATE_C3) { sc->cpu_non_c3 = i; break; } } /* Reset the statistics counters. */ bzero(sc->cpu_cx_stats, sizeof(sc->cpu_cx_stats)); return (0); } static int acpi_cpu_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_cpu_softc *sc; char state[8]; int val, error; sc = (struct acpi_cpu_softc *) arg1; snprintf(state, sizeof(state), "C%d", sc->cpu_cx_lowest_lim + 1); error = sysctl_handle_string(oidp, state, sizeof(state), req); if (error != 0 || req->newptr == NULL) return (error); if (strlen(state) < 2 || toupper(state[0]) != 'C') return (EINVAL); if (strcasecmp(state, "Cmax") == 0) val = MAX_CX_STATES; else { val = (int) strtol(state + 1, NULL, 10); if (val < 1 || val > MAX_CX_STATES) return (EINVAL); } ACPI_SERIAL_BEGIN(cpu); sc->cpu_cx_lowest_lim = val - 1; acpi_cpu_set_cx_lowest(sc); ACPI_SERIAL_END(cpu); return (0); } static int acpi_cpu_global_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_cpu_softc *sc; char state[8]; int val, error, i; snprintf(state, sizeof(state), "C%d", cpu_cx_lowest_lim + 1); error = sysctl_handle_string(oidp, state, sizeof(state), req); if (error != 0 || req->newptr == NULL) return (error); if (strlen(state) < 2 || toupper(state[0]) != 'C') return (EINVAL); if (strcasecmp(state, "Cmax") == 0) val = MAX_CX_STATES; else { val = (int) strtol(state + 1, NULL, 10); if (val < 1 || val > MAX_CX_STATES) return (EINVAL); } /* Update the new lowest useable Cx state for all CPUs. */ ACPI_SERIAL_BEGIN(cpu); cpu_cx_lowest_lim = val - 1; for (i = 0; i < cpu_ndevices; i++) { sc = device_get_softc(cpu_devices[i]); sc->cpu_cx_lowest_lim = cpu_cx_lowest_lim; acpi_cpu_set_cx_lowest(sc); } ACPI_SERIAL_END(cpu); return (0); } Index: projects/release-arm-redux/sys/dev/acpica/acpi_package.c =================================================================== --- projects/release-arm-redux/sys/dev/acpica/acpi_package.c (revision 282691) +++ projects/release-arm-redux/sys/dev/acpica/acpi_package.c (revision 282692) @@ -1,152 +1,174 @@ /*- * Copyright (c) 2003 Nate Lawson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include /* * Package manipulation convenience functions */ int acpi_PkgInt(ACPI_OBJECT *res, int idx, UINT64 *dst) { ACPI_OBJECT *obj; obj = &res->Package.Elements[idx]; if (obj == NULL || obj->Type != ACPI_TYPE_INTEGER) return (EINVAL); *dst = obj->Integer.Value; return (0); } int acpi_PkgInt32(ACPI_OBJECT *res, int idx, uint32_t *dst) { UINT64 tmp; int error; error = acpi_PkgInt(res, idx, &tmp); if (error == 0) *dst = (uint32_t)tmp; return (error); } int acpi_PkgStr(ACPI_OBJECT *res, int idx, void *dst, size_t size) { ACPI_OBJECT *obj; void *ptr; size_t length; obj = &res->Package.Elements[idx]; if (obj == NULL) return (EINVAL); bzero(dst, sizeof(dst)); switch (obj->Type) { case ACPI_TYPE_STRING: ptr = obj->String.Pointer; length = obj->String.Length; break; case ACPI_TYPE_BUFFER: ptr = obj->Buffer.Pointer; length = obj->Buffer.Length; break; default: return (EINVAL); } /* Make sure string will fit, including terminating NUL */ if (++length > size) return (E2BIG); strlcpy(dst, ptr, length); return (0); } int acpi_PkgGas(device_t dev, ACPI_OBJECT *res, int idx, int *type, int *rid, struct resource **dst, u_int flags) { ACPI_GENERIC_ADDRESS gas; ACPI_OBJECT *obj; obj = &res->Package.Elements[idx]; if (obj == NULL || obj->Type != ACPI_TYPE_BUFFER || obj->Buffer.Length < sizeof(ACPI_GENERIC_ADDRESS) + 3) return (EINVAL); memcpy(&gas, obj->Buffer.Pointer + 3, sizeof(gas)); return (acpi_bus_alloc_gas(dev, type, rid, &gas, dst, flags)); } +int +acpi_PkgFFH_IntelCpu(ACPI_OBJECT *res, int idx, int *vendor, int *class, + uint64_t *address, int *accsize) +{ + ACPI_GENERIC_ADDRESS gas; + ACPI_OBJECT *obj; + + obj = &res->Package.Elements[idx]; + if (obj == NULL || obj->Type != ACPI_TYPE_BUFFER || + obj->Buffer.Length < sizeof(ACPI_GENERIC_ADDRESS) + 3) + return (EINVAL); + + memcpy(&gas, obj->Buffer.Pointer + 3, sizeof(gas)); + if (gas.SpaceId != ACPI_ADR_SPACE_FIXED_HARDWARE) + return (ERESTART); + *vendor = gas.BitWidth; + *class = gas.BitOffset; + *address = gas.Address; + *accsize = gas.AccessWidth; + return (0); +} + ACPI_HANDLE acpi_GetReference(ACPI_HANDLE scope, ACPI_OBJECT *obj) { ACPI_HANDLE h; if (obj == NULL) return (NULL); switch (obj->Type) { case ACPI_TYPE_LOCAL_REFERENCE: case ACPI_TYPE_ANY: h = obj->Reference.Handle; break; case ACPI_TYPE_STRING: /* * The String object usually contains a fully-qualified path, so * scope can be NULL. * * XXX This may not always be the case. */ if (ACPI_FAILURE(AcpiGetHandle(scope, obj->String.Pointer, &h))) h = NULL; break; default: h = NULL; break; } return (h); } Index: projects/release-arm-redux/sys/dev/acpica/acpivar.h =================================================================== --- projects/release-arm-redux/sys/dev/acpica/acpivar.h (revision 282691) +++ projects/release-arm-redux/sys/dev/acpica/acpivar.h (revision 282692) @@ -1,507 +1,509 @@ /*- * Copyright (c) 2000 Mitsuru IWASAKI * Copyright (c) 2000 Michael Smith * Copyright (c) 2000 BSDi * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _ACPIVAR_H_ #define _ACPIVAR_H_ #ifdef _KERNEL #include "acpi_if.h" #include "bus_if.h" #include #include #include #include #include #include #include #include #include struct apm_clone_data; struct acpi_softc { device_t acpi_dev; struct cdev *acpi_dev_t; int acpi_enabled; int acpi_sstate; int acpi_sleep_disabled; int acpi_resources_reserved; struct sysctl_ctx_list acpi_sysctl_ctx; struct sysctl_oid *acpi_sysctl_tree; int acpi_power_button_sx; int acpi_sleep_button_sx; int acpi_lid_switch_sx; int acpi_standby_sx; int acpi_suspend_sx; int acpi_sleep_delay; int acpi_s4bios; int acpi_do_disable; int acpi_verbose; int acpi_handle_reboot; vm_offset_t acpi_wakeaddr; vm_paddr_t acpi_wakephys; int acpi_next_sstate; /* Next suspend Sx state. */ struct apm_clone_data *acpi_clone; /* Pseudo-dev for devd(8). */ STAILQ_HEAD(,apm_clone_data) apm_cdevs; /* All apm/apmctl/acpi cdevs. */ struct callout susp_force_to; /* Force suspend if no acks. */ }; struct acpi_device { /* ACPI ivars */ ACPI_HANDLE ad_handle; void *ad_private; int ad_flags; /* Resources */ struct resource_list ad_rl; }; /* Track device (/dev/{apm,apmctl} and /dev/acpi) notification status. */ struct apm_clone_data { STAILQ_ENTRY(apm_clone_data) entries; struct cdev *cdev; int flags; #define ACPI_EVF_NONE 0 /* /dev/apm semantics */ #define ACPI_EVF_DEVD 1 /* /dev/acpi is handled via devd(8) */ #define ACPI_EVF_WRITE 2 /* Device instance is opened writable. */ int notify_status; #define APM_EV_NONE 0 /* Device not yet aware of pending sleep. */ #define APM_EV_NOTIFIED 1 /* Device saw next sleep state. */ #define APM_EV_ACKED 2 /* Device agreed sleep can occur. */ struct acpi_softc *acpi_sc; struct selinfo sel_read; }; #define ACPI_PRW_MAX_POWERRES 8 struct acpi_prw_data { ACPI_HANDLE gpe_handle; int gpe_bit; int lowest_wake; ACPI_OBJECT power_res[ACPI_PRW_MAX_POWERRES]; int power_res_count; }; /* Flags for each device defined in the AML namespace. */ #define ACPI_FLAG_WAKE_ENABLED 0x1 /* Macros for extracting parts of a PCI address from an _ADR value. */ #define ACPI_ADR_PCI_SLOT(adr) (((adr) & 0xffff0000) >> 16) #define ACPI_ADR_PCI_FUNC(adr) ((adr) & 0xffff) /* * Entry points to ACPI from above are global functions defined in this * file, sysctls, and I/O on the control device. Entry points from below * are interrupts (the SCI), notifies, task queue threads, and the thermal * zone polling thread. * * ACPI tables and global shared data are protected by a global lock * (acpi_mutex). * * Each ACPI device can have its own driver-specific mutex for protecting * shared access to local data. The ACPI_LOCK macros handle mutexes. * * Drivers that need to serialize access to functions (e.g., to route * interrupts, get/set control paths, etc.) should use the sx lock macros * (ACPI_SERIAL). * * ACPI-CA handles its own locking and should not be called with locks held. * * The most complicated path is: * GPE -> EC runs _Qxx -> _Qxx reads EC space -> GPE */ extern struct mtx acpi_mutex; #define ACPI_LOCK(sys) mtx_lock(&sys##_mutex) #define ACPI_UNLOCK(sys) mtx_unlock(&sys##_mutex) #define ACPI_LOCK_ASSERT(sys) mtx_assert(&sys##_mutex, MA_OWNED); #define ACPI_LOCK_DECL(sys, name) \ static struct mtx sys##_mutex; \ MTX_SYSINIT(sys##_mutex, &sys##_mutex, name, MTX_DEF) #define ACPI_SERIAL_BEGIN(sys) sx_xlock(&sys##_sxlock) #define ACPI_SERIAL_END(sys) sx_xunlock(&sys##_sxlock) #define ACPI_SERIAL_ASSERT(sys) sx_assert(&sys##_sxlock, SX_XLOCKED); #define ACPI_SERIAL_DECL(sys, name) \ static struct sx sys##_sxlock; \ SX_SYSINIT(sys##_sxlock, &sys##_sxlock, name) /* * ACPI CA does not define layers for non-ACPI CA drivers. * We define some here within the range provided. */ #define ACPI_AC_ADAPTER 0x00010000 #define ACPI_BATTERY 0x00020000 #define ACPI_BUS 0x00040000 #define ACPI_BUTTON 0x00080000 #define ACPI_EC 0x00100000 #define ACPI_FAN 0x00200000 #define ACPI_POWERRES 0x00400000 #define ACPI_PROCESSOR 0x00800000 #define ACPI_THERMAL 0x01000000 #define ACPI_TIMER 0x02000000 #define ACPI_OEM 0x04000000 /* * Constants for different interrupt models used with acpi_SetIntrModel(). */ #define ACPI_INTR_PIC 0 #define ACPI_INTR_APIC 1 #define ACPI_INTR_SAPIC 2 /* * Various features and capabilities for the acpi_get_features() method. * In particular, these are used for the ACPI 3.0 _PDC and _OSC methods. * See the Intel document titled "Intel Processor Vendor-Specific ACPI", * number 302223-007. */ #define ACPI_CAP_PERF_MSRS (1 << 0) /* Intel SpeedStep PERF_CTL MSRs */ #define ACPI_CAP_C1_IO_HALT (1 << 1) /* Intel C1 "IO then halt" sequence */ #define ACPI_CAP_THR_MSRS (1 << 2) /* Intel OnDemand throttling MSRs */ #define ACPI_CAP_SMP_SAME (1 << 3) /* MP C1, Px, and Tx (all the same) */ #define ACPI_CAP_SMP_SAME_C3 (1 << 4) /* MP C2 and C3 (all the same) */ #define ACPI_CAP_SMP_DIFF_PX (1 << 5) /* MP Px (different, using _PSD) */ #define ACPI_CAP_SMP_DIFF_CX (1 << 6) /* MP Cx (different, using _CSD) */ #define ACPI_CAP_SMP_DIFF_TX (1 << 7) /* MP Tx (different, using _TSD) */ #define ACPI_CAP_SMP_C1_NATIVE (1 << 8) /* MP C1 support other than halt */ #define ACPI_CAP_SMP_C3_NATIVE (1 << 9) /* MP C2 and C3 support */ #define ACPI_CAP_PX_HW_COORD (1 << 11) /* Intel P-state HW coordination */ #define ACPI_CAP_INTR_CPPC (1 << 12) /* Native Interrupt Handling for Collaborative Processor Performance Control notifications */ #define ACPI_CAP_HW_DUTY_C (1 << 13) /* Hardware Duty Cycling */ /* * Quirk flags. * * ACPI_Q_BROKEN: Disables all ACPI support. * ACPI_Q_TIMER: Disables support for the ACPI timer. * ACPI_Q_MADT_IRQ0: Specifies that ISA IRQ 0 is wired up to pin 0 of the * first APIC and that the MADT should force that by ignoring the PC-AT * compatible flag and ignoring overrides that redirect IRQ 0 to pin 2. */ extern int acpi_quirks; #define ACPI_Q_OK 0 #define ACPI_Q_BROKEN (1 << 0) #define ACPI_Q_TIMER (1 << 1) #define ACPI_Q_MADT_IRQ0 (1 << 2) /* * Note that the low ivar values are reserved to provide * interface compatibility with ISA drivers which can also * attach to ACPI. */ #define ACPI_IVAR_HANDLE 0x100 #define ACPI_IVAR_UNUSED 0x101 /* Unused/reserved. */ #define ACPI_IVAR_PRIVATE 0x102 #define ACPI_IVAR_FLAGS 0x103 /* * Accessor functions for our ivars. Default value for BUS_READ_IVAR is * (type) 0. The accessor functions don't check return values. */ #define __ACPI_BUS_ACCESSOR(varp, var, ivarp, ivar, type) \ \ static __inline type varp ## _get_ ## var(device_t dev) \ { \ uintptr_t v = 0; \ BUS_READ_IVAR(device_get_parent(dev), dev, \ ivarp ## _IVAR_ ## ivar, &v); \ return ((type) v); \ } \ \ static __inline void varp ## _set_ ## var(device_t dev, type t) \ { \ uintptr_t v = (uintptr_t) t; \ BUS_WRITE_IVAR(device_get_parent(dev), dev, \ ivarp ## _IVAR_ ## ivar, v); \ } __ACPI_BUS_ACCESSOR(acpi, handle, ACPI, HANDLE, ACPI_HANDLE) __ACPI_BUS_ACCESSOR(acpi, private, ACPI, PRIVATE, void *) __ACPI_BUS_ACCESSOR(acpi, flags, ACPI, FLAGS, int) void acpi_fake_objhandler(ACPI_HANDLE h, void *data); static __inline device_t acpi_get_device(ACPI_HANDLE handle) { void *dev = NULL; AcpiGetData(handle, acpi_fake_objhandler, &dev); return ((device_t)dev); } static __inline ACPI_OBJECT_TYPE acpi_get_type(device_t dev) { ACPI_HANDLE h; ACPI_OBJECT_TYPE t; if ((h = acpi_get_handle(dev)) == NULL) return (ACPI_TYPE_NOT_FOUND); if (ACPI_FAILURE(AcpiGetType(h, &t))) return (ACPI_TYPE_NOT_FOUND); return (t); } /* Find the difference between two PM tick counts. */ static __inline uint32_t acpi_TimerDelta(uint32_t end, uint32_t start) { if (end < start && (AcpiGbl_FADT.Flags & ACPI_FADT_32BIT_TIMER) == 0) end |= 0x01000000; return (end - start); } #ifdef ACPI_DEBUGGER void acpi_EnterDebugger(void); #endif #ifdef ACPI_DEBUG #include #define STEP(x) do {printf x, printf("\n"); cngetc();} while (0) #else #define STEP(x) #endif #define ACPI_VPRINT(dev, acpi_sc, x...) do { \ if (acpi_get_verbose(acpi_sc)) \ device_printf(dev, x); \ } while (0) /* Values for the device _STA (status) method. */ #define ACPI_STA_PRESENT (1 << 0) #define ACPI_STA_ENABLED (1 << 1) #define ACPI_STA_SHOW_IN_UI (1 << 2) #define ACPI_STA_FUNCTIONAL (1 << 3) #define ACPI_STA_BATT_PRESENT (1 << 4) #define ACPI_DEVINFO_PRESENT(x, flags) \ (((x) & (flags)) == (flags)) #define ACPI_DEVICE_PRESENT(x) \ ACPI_DEVINFO_PRESENT(x, ACPI_STA_PRESENT | ACPI_STA_FUNCTIONAL) #define ACPI_BATTERY_PRESENT(x) \ ACPI_DEVINFO_PRESENT(x, ACPI_STA_PRESENT | ACPI_STA_FUNCTIONAL | \ ACPI_STA_BATT_PRESENT) /* Callback function type for walking subtables within a table. */ typedef void acpi_subtable_handler(ACPI_SUBTABLE_HEADER *, void *); BOOLEAN acpi_DeviceIsPresent(device_t dev); BOOLEAN acpi_BatteryIsPresent(device_t dev); ACPI_STATUS acpi_GetHandleInScope(ACPI_HANDLE parent, char *path, ACPI_HANDLE *result); ACPI_BUFFER *acpi_AllocBuffer(int size); ACPI_STATUS acpi_ConvertBufferToInteger(ACPI_BUFFER *bufp, UINT32 *number); ACPI_STATUS acpi_GetInteger(ACPI_HANDLE handle, char *path, UINT32 *number); ACPI_STATUS acpi_SetInteger(ACPI_HANDLE handle, char *path, UINT32 number); ACPI_STATUS acpi_ForeachPackageObject(ACPI_OBJECT *obj, void (*func)(ACPI_OBJECT *comp, void *arg), void *arg); ACPI_STATUS acpi_FindIndexedResource(ACPI_BUFFER *buf, int index, ACPI_RESOURCE **resp); ACPI_STATUS acpi_AppendBufferResource(ACPI_BUFFER *buf, ACPI_RESOURCE *res); ACPI_STATUS acpi_OverrideInterruptLevel(UINT32 InterruptNumber); ACPI_STATUS acpi_SetIntrModel(int model); int acpi_ReqSleepState(struct acpi_softc *sc, int state); int acpi_AckSleepState(struct apm_clone_data *clone, int error); ACPI_STATUS acpi_SetSleepState(struct acpi_softc *sc, int state); int acpi_wake_set_enable(device_t dev, int enable); int acpi_parse_prw(ACPI_HANDLE h, struct acpi_prw_data *prw); ACPI_STATUS acpi_Startup(void); void acpi_UserNotify(const char *subsystem, ACPI_HANDLE h, uint8_t notify); int acpi_bus_alloc_gas(device_t dev, int *type, int *rid, ACPI_GENERIC_ADDRESS *gas, struct resource **res, u_int flags); void acpi_walk_subtables(void *first, void *end, acpi_subtable_handler *handler, void *arg); BOOLEAN acpi_MatchHid(ACPI_HANDLE h, const char *hid); struct acpi_parse_resource_set { void (*set_init)(device_t dev, void *arg, void **context); void (*set_done)(device_t dev, void *context); void (*set_ioport)(device_t dev, void *context, uint64_t base, uint64_t length); void (*set_iorange)(device_t dev, void *context, uint64_t low, uint64_t high, uint64_t length, uint64_t align); void (*set_memory)(device_t dev, void *context, uint64_t base, uint64_t length); void (*set_memoryrange)(device_t dev, void *context, uint64_t low, uint64_t high, uint64_t length, uint64_t align); void (*set_irq)(device_t dev, void *context, uint8_t *irq, int count, int trig, int pol); void (*set_ext_irq)(device_t dev, void *context, uint32_t *irq, int count, int trig, int pol); void (*set_drq)(device_t dev, void *context, uint8_t *drq, int count); void (*set_start_dependent)(device_t dev, void *context, int preference); void (*set_end_dependent)(device_t dev, void *context); }; extern struct acpi_parse_resource_set acpi_res_parse_set; int acpi_identify(void); void acpi_config_intr(device_t dev, ACPI_RESOURCE *res); ACPI_STATUS acpi_lookup_irq_resource(device_t dev, int rid, struct resource *res, ACPI_RESOURCE *acpi_res); ACPI_STATUS acpi_parse_resources(device_t dev, ACPI_HANDLE handle, struct acpi_parse_resource_set *set, void *arg); struct resource *acpi_alloc_sysres(device_t child, int type, int *rid, u_long start, u_long end, u_long count, u_int flags); /* ACPI event handling */ UINT32 acpi_event_power_button_sleep(void *context); UINT32 acpi_event_power_button_wake(void *context); UINT32 acpi_event_sleep_button_sleep(void *context); UINT32 acpi_event_sleep_button_wake(void *context); #define ACPI_EVENT_PRI_FIRST 0 #define ACPI_EVENT_PRI_DEFAULT 10000 #define ACPI_EVENT_PRI_LAST 20000 typedef void (*acpi_event_handler_t)(void *, int); EVENTHANDLER_DECLARE(acpi_sleep_event, acpi_event_handler_t); EVENTHANDLER_DECLARE(acpi_wakeup_event, acpi_event_handler_t); /* Device power control. */ ACPI_STATUS acpi_pwr_wake_enable(ACPI_HANDLE consumer, int enable); ACPI_STATUS acpi_pwr_switch_consumer(ACPI_HANDLE consumer, int state); int acpi_device_pwr_for_sleep(device_t bus, device_t dev, int *dstate); /* APM emulation */ void acpi_apm_init(struct acpi_softc *); /* Misc. */ static __inline struct acpi_softc * acpi_device_get_parent_softc(device_t child) { device_t parent; parent = device_get_parent(child); if (parent == NULL) return (NULL); return (device_get_softc(parent)); } static __inline int acpi_get_verbose(struct acpi_softc *sc) { if (sc) return (sc->acpi_verbose); return (0); } char *acpi_name(ACPI_HANDLE handle); int acpi_avoid(ACPI_HANDLE handle); int acpi_disabled(char *subsys); int acpi_machdep_init(device_t dev); void acpi_install_wakeup_handler(struct acpi_softc *sc); int acpi_sleep_machdep(struct acpi_softc *sc, int state); int acpi_wakeup_machdep(struct acpi_softc *sc, int state, int sleep_result, int intr_enabled); int acpi_table_quirks(int *quirks); int acpi_machdep_quirks(int *quirks); /* Battery Abstraction. */ struct acpi_battinfo; int acpi_battery_register(device_t dev); int acpi_battery_remove(device_t dev); int acpi_battery_get_units(void); int acpi_battery_get_info_expire(void); int acpi_battery_bst_valid(struct acpi_bst *bst); int acpi_battery_bif_valid(struct acpi_bif *bif); int acpi_battery_get_battinfo(device_t dev, struct acpi_battinfo *info); /* Embedded controller. */ void acpi_ec_ecdt_probe(device_t); /* AC adapter interface. */ int acpi_acad_get_acline(int *); /* Package manipulation convenience functions. */ #define ACPI_PKG_VALID(pkg, size) \ ((pkg) != NULL && (pkg)->Type == ACPI_TYPE_PACKAGE && \ (pkg)->Package.Count >= (size)) int acpi_PkgInt(ACPI_OBJECT *res, int idx, UINT64 *dst); int acpi_PkgInt32(ACPI_OBJECT *res, int idx, uint32_t *dst); int acpi_PkgStr(ACPI_OBJECT *res, int idx, void *dst, size_t size); int acpi_PkgGas(device_t dev, ACPI_OBJECT *res, int idx, int *type, int *rid, struct resource **dst, u_int flags); +int acpi_PkgFFH_IntelCpu(ACPI_OBJECT *res, int idx, int *vendor, + int *class, uint64_t *address, int *accsize); ACPI_HANDLE acpi_GetReference(ACPI_HANDLE scope, ACPI_OBJECT *obj); /* * Base level for BUS_ADD_CHILD. Special devices are added at orders less * than this, and normal devices at or above this level. This keeps the * probe order sorted so that things like sysresource are available before * their children need them. */ #define ACPI_DEV_BASE_ORDER 100 /* Default maximum number of tasks to enqueue. */ #ifndef ACPI_MAX_TASKS #define ACPI_MAX_TASKS MAX(32, MAXCPU * 4) #endif /* Default number of task queue threads to start. */ #ifndef ACPI_MAX_THREADS #define ACPI_MAX_THREADS 3 #endif /* Use the device logging level for ktr(4). */ #define KTR_ACPI KTR_DEV SYSCTL_DECL(_debug_acpi); /* * Map a PXM to a VM domain. * * Returns the VM domain ID if found, or -1 if not found / invalid. */ #if MAXMEMDOM > 1 extern int acpi_map_pxm_to_vm_domainid(int pxm); #endif extern int acpi_get_domain(device_t dev, device_t child, int *domain); extern int acpi_parse_pxm(device_t dev, int *domain); #endif /* _KERNEL */ #endif /* !_ACPIVAR_H_ */ Index: projects/release-arm-redux/sys/dev/hwpmc/hwpmc_armv7.c =================================================================== --- projects/release-arm-redux/sys/dev/hwpmc/hwpmc_armv7.c (revision 282691) +++ projects/release-arm-redux/sys/dev/hwpmc/hwpmc_armv7.c (revision 282692) @@ -1,564 +1,564 @@ /*- * Copyright (c) 2015 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract (FA8750-10-C-0237) * ("CTSRD"), as part of the DARPA CRASH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #define CPU_ID_CORTEX_VER_MASK 0xff #define CPU_ID_CORTEX_VER_SHIFT 4 static int armv7_npmcs; struct armv7_event_code_map { enum pmc_event pe_ev; uint8_t pe_code; }; const struct armv7_event_code_map armv7_event_codes[] = { { PMC_EV_ARMV7_PMNC_SW_INCR, 0x00 }, { PMC_EV_ARMV7_L1_ICACHE_REFILL, 0x01 }, { PMC_EV_ARMV7_ITLB_REFILL, 0x02 }, { PMC_EV_ARMV7_L1_DCACHE_REFILL, 0x03 }, { PMC_EV_ARMV7_L1_DCACHE_ACCESS, 0x04 }, { PMC_EV_ARMV7_DTLB_REFILL, 0x05 }, { PMC_EV_ARMV7_MEM_READ, 0x06 }, { PMC_EV_ARMV7_MEM_WRITE, 0x07 }, { PMC_EV_ARMV7_INSTR_EXECUTED, 0x08 }, { PMC_EV_ARMV7_EXC_TAKEN, 0x09 }, { PMC_EV_ARMV7_EXC_EXECUTED, 0x0A }, { PMC_EV_ARMV7_CID_WRITE, 0x0B }, { PMC_EV_ARMV7_PC_WRITE, 0x0C }, { PMC_EV_ARMV7_PC_IMM_BRANCH, 0x0D }, { PMC_EV_ARMV7_PC_PROC_RETURN, 0x0E }, { PMC_EV_ARMV7_MEM_UNALIGNED_ACCESS, 0x0F }, { PMC_EV_ARMV7_PC_BRANCH_MIS_PRED, 0x10 }, { PMC_EV_ARMV7_CLOCK_CYCLES, 0x11 }, { PMC_EV_ARMV7_PC_BRANCH_PRED, 0x12 }, { PMC_EV_ARMV7_MEM_ACCESS, 0x13 }, { PMC_EV_ARMV7_L1_ICACHE_ACCESS, 0x14 }, { PMC_EV_ARMV7_L1_DCACHE_WB, 0x15 }, { PMC_EV_ARMV7_L2_CACHE_ACCESS, 0x16 }, { PMC_EV_ARMV7_L2_CACHE_REFILL, 0x17 }, { PMC_EV_ARMV7_L2_CACHE_WB, 0x18 }, { PMC_EV_ARMV7_BUS_ACCESS, 0x19 }, { PMC_EV_ARMV7_MEM_ERROR, 0x1A }, { PMC_EV_ARMV7_INSTR_SPEC, 0x1B }, { PMC_EV_ARMV7_TTBR_WRITE, 0x1C }, { PMC_EV_ARMV7_BUS_CYCLES, 0x1D }, { PMC_EV_ARMV7_CPU_CYCLES, 0xFF }, }; const int armv7_event_codes_size = sizeof(armv7_event_codes) / sizeof(armv7_event_codes[0]); /* * Per-processor information. */ struct armv7_cpu { struct pmc_hw *pc_armv7pmcs; int cortex_ver; }; static struct armv7_cpu **armv7_pcpu; /* * Interrupt Enable Set Register */ static __inline void armv7_interrupt_enable(uint32_t pmc) { uint32_t reg; reg = (1 << pmc); cp15_pminten_set(reg); } /* * Interrupt Clear Set Register */ static __inline void armv7_interrupt_disable(uint32_t pmc) { uint32_t reg; reg = (1 << pmc); cp15_pminten_clr(reg); } /* * Counter Set Enable Register */ static __inline void armv7_counter_enable(unsigned int pmc) { uint32_t reg; reg = (1 << pmc); cp15_pmcnten_set(reg); } /* * Counter Clear Enable Register */ static __inline void armv7_counter_disable(unsigned int pmc) { uint32_t reg; reg = (1 << pmc); cp15_pmcnten_clr(reg); } /* * Performance Count Register N */ static uint32_t armv7_pmcn_read(unsigned int pmc) { KASSERT(pmc < armv7_npmcs, ("%s: illegal PMC number %d", __func__, pmc)); cp15_pmselr_set(pmc); return (cp15_pmxevcntr_get()); } static uint32_t armv7_pmcn_write(unsigned int pmc, uint32_t reg) { KASSERT(pmc < armv7_npmcs, ("%s: illegal PMC number %d", __func__, pmc)); cp15_pmselr_set(pmc); cp15_pmxevcntr_set(reg); return (reg); } static int armv7_allocate_pmc(int cpu, int ri, struct pmc *pm, const struct pmc_op_pmcallocate *a) { uint32_t caps, config; struct armv7_cpu *pac; enum pmc_event pe; int i; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < armv7_npmcs, ("[armv7,%d] illegal row index %d", __LINE__, ri)); pac = armv7_pcpu[cpu]; caps = a->pm_caps; if (a->pm_class != PMC_CLASS_ARMV7) return (EINVAL); pe = a->pm_ev; for (i = 0; i < armv7_event_codes_size; i++) { if (armv7_event_codes[i].pe_ev == pe) { config = armv7_event_codes[i].pe_code; break; } } if (i == armv7_event_codes_size) return EINVAL; pm->pm_md.pm_armv7.pm_armv7_evsel = config; - PMCDBG(MDP,ALL,2,"armv7-allocate ri=%d -> config=0x%x", ri, config); + PMCDBG2(MDP,ALL,2,"armv7-allocate ri=%d -> config=0x%x", ri, config); return 0; } static int armv7_read_pmc(int cpu, int ri, pmc_value_t *v) { pmc_value_t tmp; struct pmc *pm; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < armv7_npmcs, ("[armv7,%d] illegal row index %d", __LINE__, ri)); pm = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc; if (pm->pm_md.pm_armv7.pm_armv7_evsel == 0xFF) tmp = cp15_pmccntr_get(); else tmp = armv7_pmcn_read(ri); - PMCDBG(MDP,REA,2,"armv7-read id=%d -> %jd", ri, tmp); + PMCDBG2(MDP,REA,2,"armv7-read id=%d -> %jd", ri, tmp); if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) *v = ARMV7_PERFCTR_VALUE_TO_RELOAD_COUNT(tmp); else *v = tmp; return 0; } static int armv7_write_pmc(int cpu, int ri, pmc_value_t v) { struct pmc *pm; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < armv7_npmcs, ("[armv7,%d] illegal row-index %d", __LINE__, ri)); pm = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc; if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) v = ARMV7_RELOAD_COUNT_TO_PERFCTR_VALUE(v); - PMCDBG(MDP,WRI,1,"armv7-write cpu=%d ri=%d v=%jx", cpu, ri, v); + PMCDBG3(MDP,WRI,1,"armv7-write cpu=%d ri=%d v=%jx", cpu, ri, v); if (pm->pm_md.pm_armv7.pm_armv7_evsel == 0xFF) cp15_pmccntr_set(v); else armv7_pmcn_write(ri, v); return 0; } static int armv7_config_pmc(int cpu, int ri, struct pmc *pm) { struct pmc_hw *phw; - PMCDBG(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm); + PMCDBG3(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm); KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < armv7_npmcs, ("[armv7,%d] illegal row-index %d", __LINE__, ri)); phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri]; KASSERT(pm == NULL || phw->phw_pmc == NULL, ("[armv7,%d] pm=%p phw->pm=%p hwpmc not unconfigured", __LINE__, pm, phw->phw_pmc)); phw->phw_pmc = pm; return 0; } static int armv7_start_pmc(int cpu, int ri) { struct pmc_hw *phw; uint32_t config; struct pmc *pm; phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri]; pm = phw->phw_pmc; config = pm->pm_md.pm_armv7.pm_armv7_evsel; /* * Configure the event selection. */ cp15_pmselr_set(ri); cp15_pmxevtyper_set(config); /* * Enable the PMC. */ armv7_interrupt_enable(ri); armv7_counter_enable(ri); return 0; } static int armv7_stop_pmc(int cpu, int ri) { struct pmc_hw *phw; struct pmc *pm; phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri]; pm = phw->phw_pmc; /* * Disable the PMCs. */ armv7_counter_disable(ri); armv7_interrupt_disable(ri); return 0; } static int armv7_release_pmc(int cpu, int ri, struct pmc *pmc) { struct pmc_hw *phw; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < armv7_npmcs, ("[armv7,%d] illegal row-index %d", __LINE__, ri)); phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri]; KASSERT(phw->phw_pmc == NULL, ("[armv7,%d] PHW pmc %p non-NULL", __LINE__, phw->phw_pmc)); return 0; } static int armv7_intr(int cpu, struct trapframe *tf) { struct armv7_cpu *pc; int retval, ri; struct pmc *pm; int error; int reg; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] CPU %d out of range", __LINE__, cpu)); retval = 0; pc = armv7_pcpu[cpu]; for (ri = 0; ri < armv7_npmcs; ri++) { pm = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc; if (pm == NULL) continue; if (!PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) continue; /* Check if counter has overflowed */ if (pm->pm_md.pm_armv7.pm_armv7_evsel == 0xFF) reg = (1 << 31); else reg = (1 << ri); if ((cp15_pmovsr_get() & reg) == 0) { continue; } /* Clear Overflow Flag */ cp15_pmovsr_set(reg); retval = 1; /* Found an interrupting PMC. */ if (pm->pm_state != PMC_STATE_RUNNING) continue; error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, TRAPF_USERMODE(tf)); if (error) armv7_stop_pmc(cpu, ri); /* Reload sampling count */ armv7_write_pmc(cpu, ri, pm->pm_sc.pm_reloadcount); } return (retval); } static int armv7_describe(int cpu, int ri, struct pmc_info *pi, struct pmc **ppmc) { char armv7_name[PMC_NAME_MAX]; struct pmc_hw *phw; int error; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d], illegal CPU %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < armv7_npmcs, ("[armv7,%d] row-index %d out of range", __LINE__, ri)); phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri]; snprintf(armv7_name, sizeof(armv7_name), "ARMV7-%d", ri); if ((error = copystr(armv7_name, pi->pm_name, PMC_NAME_MAX, NULL)) != 0) return error; pi->pm_class = PMC_CLASS_ARMV7; if (phw->phw_state & PMC_PHW_FLAG_IS_ENABLED) { pi->pm_enabled = TRUE; *ppmc = phw->phw_pmc; } else { pi->pm_enabled = FALSE; *ppmc = NULL; } return (0); } static int armv7_get_config(int cpu, int ri, struct pmc **ppm) { *ppm = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc; return 0; } /* * XXX don't know what we should do here. */ static int armv7_switch_in(struct pmc_cpu *pc, struct pmc_process *pp) { return 0; } static int armv7_switch_out(struct pmc_cpu *pc, struct pmc_process *pp) { return 0; } static int armv7_pcpu_init(struct pmc_mdep *md, int cpu) { struct armv7_cpu *pac; struct pmc_hw *phw; struct pmc_cpu *pc; uint32_t pmnc; int first_ri; int cpuid; int i; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[armv7,%d] wrong cpu number %d", __LINE__, cpu)); - PMCDBG(MDP,INI,1,"armv7-init cpu=%d", cpu); + PMCDBG1(MDP,INI,1,"armv7-init cpu=%d", cpu); armv7_pcpu[cpu] = pac = malloc(sizeof(struct armv7_cpu), M_PMC, M_WAITOK|M_ZERO); cpuid = cpu_ident(); pac->cortex_ver = (cpuid >> CPU_ID_CORTEX_VER_SHIFT) & \ CPU_ID_CORTEX_VER_MASK; pac->pc_armv7pmcs = malloc(sizeof(struct pmc_hw) * armv7_npmcs, M_PMC, M_WAITOK|M_ZERO); pc = pmc_pcpu[cpu]; first_ri = md->pmd_classdep[PMC_MDEP_CLASS_INDEX_ARMV7].pcd_ri; KASSERT(pc != NULL, ("[armv7,%d] NULL per-cpu pointer", __LINE__)); for (i = 0, phw = pac->pc_armv7pmcs; i < armv7_npmcs; i++, phw++) { phw->phw_state = PMC_PHW_FLAG_IS_ENABLED | PMC_PHW_CPU_TO_STATE(cpu) | PMC_PHW_INDEX_TO_STATE(i); phw->phw_pmc = NULL; pc->pc_hwpmcs[i + first_ri] = phw; } /* Enable unit */ pmnc = cp15_pmcr_get(); pmnc |= ARMV7_PMNC_ENABLE; cp15_pmcr_set(pmnc); return 0; } static int armv7_pcpu_fini(struct pmc_mdep *md, int cpu) { uint32_t pmnc; pmnc = cp15_pmcr_get(); pmnc &= ~ARMV7_PMNC_ENABLE; cp15_pmcr_set(pmnc); return 0; } struct pmc_mdep * pmc_armv7_initialize() { struct pmc_mdep *pmc_mdep; struct pmc_classdep *pcd; int reg; reg = cp15_pmcr_get(); armv7_npmcs = (reg >> ARMV7_PMNC_N_SHIFT) & \ ARMV7_PMNC_N_MASK; - PMCDBG(MDP,INI,1,"armv7-init npmcs=%d", armv7_npmcs); + PMCDBG1(MDP,INI,1,"armv7-init npmcs=%d", armv7_npmcs); /* * Allocate space for pointers to PMC HW descriptors and for * the MDEP structure used by MI code. */ armv7_pcpu = malloc(sizeof(struct armv7_cpu *) * pmc_cpu_max(), M_PMC, M_WAITOK | M_ZERO); /* Just one class */ pmc_mdep = pmc_mdep_alloc(1); pmc_mdep->pmd_cputype = PMC_CPU_ARMV7; pcd = &pmc_mdep->pmd_classdep[PMC_MDEP_CLASS_INDEX_ARMV7]; pcd->pcd_caps = ARMV7_PMC_CAPS; pcd->pcd_class = PMC_CLASS_ARMV7; pcd->pcd_num = armv7_npmcs; pcd->pcd_ri = pmc_mdep->pmd_npmc; pcd->pcd_width = 32; pcd->pcd_allocate_pmc = armv7_allocate_pmc; pcd->pcd_config_pmc = armv7_config_pmc; pcd->pcd_pcpu_fini = armv7_pcpu_fini; pcd->pcd_pcpu_init = armv7_pcpu_init; pcd->pcd_describe = armv7_describe; pcd->pcd_get_config = armv7_get_config; pcd->pcd_read_pmc = armv7_read_pmc; pcd->pcd_release_pmc = armv7_release_pmc; pcd->pcd_start_pmc = armv7_start_pmc; pcd->pcd_stop_pmc = armv7_stop_pmc; pcd->pcd_write_pmc = armv7_write_pmc; pmc_mdep->pmd_intr = armv7_intr; pmc_mdep->pmd_switch_in = armv7_switch_in; pmc_mdep->pmd_switch_out = armv7_switch_out; pmc_mdep->pmd_npmc += armv7_npmcs; return (pmc_mdep); } void pmc_armv7_finalize(struct pmc_mdep *md) { } Index: projects/release-arm-redux/sys/dev/hwpmc/hwpmc_e500.c =================================================================== --- projects/release-arm-redux/sys/dev/hwpmc/hwpmc_e500.c (revision 282691) +++ projects/release-arm-redux/sys/dev/hwpmc/hwpmc_e500.c (revision 282692) @@ -1,660 +1,660 @@ /*- * Copyright (c) 2015 Justin Hibbits * Copyright (c) 2005, Joseph Koshy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include "hwpmc_powerpc.h" #define POWERPC_PMC_CAPS (PMC_CAP_INTERRUPT | PMC_CAP_USER | \ PMC_CAP_SYSTEM | PMC_CAP_EDGE | \ PMC_CAP_THRESHOLD | PMC_CAP_READ | \ PMC_CAP_WRITE | PMC_CAP_INVERT | \ PMC_CAP_QUALIFIER) #define E500_PMC_HAS_OVERFLOWED(x) (e500_pmcn_read(x) & (0x1 << 31)) struct e500_event_code_map { enum pmc_event pe_ev; /* enum value */ uint8_t pe_counter_mask; /* Which counter this can be counted in. */ uint8_t pe_code; /* numeric code */ uint8_t pe_cpu; /* e500 core (v1,v2,mc), mask */ }; #define E500_MAX_PMCS 4 #define PMC_PPC_MASK0 0 #define PMC_PPC_MASK1 1 #define PMC_PPC_MASK2 2 #define PMC_PPC_MASK3 3 #define PMC_PPC_MASK_ALL 0x0f #define PMC_PPC_E500V1 1 #define PMC_PPC_E500V2 2 #define PMC_PPC_E500MC 4 #define PMC_PPC_E500_ANY 7 #define PMC_E500_EVENT(id, mask, number, core) \ [PMC_EV_E500_##id - PMC_EV_E500_FIRST] = \ { .pe_ev = PMC_EV_E500_##id, .pe_counter_mask = mask, \ .pe_code = number, .pe_cpu = core } #define PMC_E500MC_ONLY(id, number) \ PMC_E500_EVENT(id, PMC_PPC_MASK_ALL, number, PMC_PPC_E500MC) #define PMC_E500_COMMON(id, number) \ PMC_E500_EVENT(id, PMC_PPC_MASK_ALL, number, PMC_PPC_E500_ANY) static struct e500_event_code_map e500_event_codes[] = { PMC_E500_COMMON(CYCLES, 1), PMC_E500_COMMON(INSTR_COMPLETED, 2), PMC_E500_COMMON(UOPS_COMPLETED, 3), PMC_E500_COMMON(INSTR_FETCHED, 4), PMC_E500_COMMON(UOPS_DECODED, 5), PMC_E500_COMMON(PM_EVENT_TRANSITIONS, 6), PMC_E500_COMMON(PM_EVENT_CYCLES, 7), PMC_E500_COMMON(BRANCH_INSTRS_COMPLETED, 8), PMC_E500_COMMON(LOAD_UOPS_COMPLETED, 9), PMC_E500_COMMON(STORE_UOPS_COMPLETED, 10), PMC_E500_COMMON(CQ_REDIRECTS, 11), PMC_E500_COMMON(BRANCHES_FINISHED, 12), PMC_E500_COMMON(TAKEN_BRANCHES_FINISHED, 13), PMC_E500_COMMON(FINISHED_UNCOND_BRANCHES_MISS_BTB, 14), PMC_E500_COMMON(BRANCH_MISPRED, 15), PMC_E500_COMMON(BTB_BRANCH_MISPRED_FROM_DIRECTION, 16), PMC_E500_COMMON(BTB_HITS_PSEUDO_HITS, 17), PMC_E500_COMMON(CYCLES_DECODE_STALLED, 18), PMC_E500_COMMON(CYCLES_ISSUE_STALLED, 19), PMC_E500_COMMON(CYCLES_BRANCH_ISSUE_STALLED, 20), PMC_E500_COMMON(CYCLES_SU1_SCHED_STALLED, 21), PMC_E500_COMMON(CYCLES_SU2_SCHED_STALLED, 22), PMC_E500_COMMON(CYCLES_MU_SCHED_STALLED, 23), PMC_E500_COMMON(CYCLES_LRU_SCHED_STALLED, 24), PMC_E500_COMMON(CYCLES_BU_SCHED_STALLED, 25), PMC_E500_COMMON(TOTAL_TRANSLATED, 26), PMC_E500_COMMON(LOADS_TRANSLATED, 27), PMC_E500_COMMON(STORES_TRANSLATED, 28), PMC_E500_COMMON(TOUCHES_TRANSLATED, 29), PMC_E500_COMMON(CACHEOPS_TRANSLATED, 30), PMC_E500_COMMON(CACHE_INHIBITED_ACCESS_TRANSLATED, 31), PMC_E500_COMMON(GUARDED_LOADS_TRANSLATED, 32), PMC_E500_COMMON(WRITE_THROUGH_STORES_TRANSLATED, 33), PMC_E500_COMMON(MISALIGNED_LOAD_STORE_ACCESS_TRANSLATED, 34), PMC_E500_COMMON(TOTAL_ALLOCATED_TO_DLFB, 35), PMC_E500_COMMON(LOADS_TRANSLATED_ALLOCATED_TO_DLFB, 36), PMC_E500_COMMON(STORES_COMPLETED_ALLOCATED_TO_DLFB, 37), PMC_E500_COMMON(TOUCHES_TRANSLATED_ALLOCATED_TO_DLFB, 38), PMC_E500_COMMON(STORES_COMPLETED, 39), PMC_E500_COMMON(DATA_L1_CACHE_LOCKS, 40), PMC_E500_COMMON(DATA_L1_CACHE_RELOADS, 41), PMC_E500_COMMON(DATA_L1_CACHE_CASTOUTS, 42), PMC_E500_COMMON(LOAD_MISS_DLFB_FULL, 43), PMC_E500_COMMON(LOAD_MISS_LDQ_FULL, 44), PMC_E500_COMMON(LOAD_GUARDED_MISS, 45), PMC_E500_COMMON(STORE_TRANSLATE_WHEN_QUEUE_FULL, 46), PMC_E500_COMMON(ADDRESS_COLLISION, 47), PMC_E500_COMMON(DATA_MMU_MISS, 48), PMC_E500_COMMON(DATA_MMU_BUSY, 49), PMC_E500_COMMON(PART2_MISALIGNED_CACHE_ACCESS, 50), PMC_E500_COMMON(LOAD_MISS_DLFB_FULL_CYCLES, 51), PMC_E500_COMMON(LOAD_MISS_LDQ_FULL_CYCLES, 52), PMC_E500_COMMON(LOAD_GUARDED_MISS_CYCLES, 53), PMC_E500_COMMON(STORE_TRANSLATE_WHEN_QUEUE_FULL_CYCLES, 54), PMC_E500_COMMON(ADDRESS_COLLISION_CYCLES, 55), PMC_E500_COMMON(DATA_MMU_MISS_CYCLES, 56), PMC_E500_COMMON(DATA_MMU_BUSY_CYCLES, 57), PMC_E500_COMMON(PART2_MISALIGNED_CACHE_ACCESS_CYCLES, 58), PMC_E500_COMMON(INSTR_L1_CACHE_LOCKS, 59), PMC_E500_COMMON(INSTR_L1_CACHE_RELOADS, 60), PMC_E500_COMMON(INSTR_L1_CACHE_FETCHES, 61), PMC_E500_COMMON(INSTR_MMU_TLB4K_RELOADS, 62), PMC_E500_COMMON(INSTR_MMU_VSP_RELOADS, 63), PMC_E500_COMMON(DATA_MMU_TLB4K_RELOADS, 64), PMC_E500_COMMON(DATA_MMU_VSP_RELOADS, 65), PMC_E500_COMMON(L2MMU_MISSES, 66), PMC_E500_COMMON(BIU_MASTER_REQUESTS, 67), PMC_E500_COMMON(BIU_MASTER_INSTR_SIDE_REQUESTS, 68), PMC_E500_COMMON(BIU_MASTER_DATA_SIDE_REQUESTS, 69), PMC_E500_COMMON(BIU_MASTER_DATA_SIDE_CASTOUT_REQUESTS, 70), PMC_E500_COMMON(BIU_MASTER_RETRIES, 71), PMC_E500_COMMON(SNOOP_REQUESTS, 72), PMC_E500_COMMON(SNOOP_HITS, 73), PMC_E500_COMMON(SNOOP_PUSHES, 74), PMC_E500_COMMON(SNOOP_RETRIES, 75), PMC_E500_EVENT(DLFB_LOAD_MISS_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1, 76, PMC_PPC_E500_ANY), PMC_E500_EVENT(ILFB_FETCH_MISS_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1, 77, PMC_PPC_E500_ANY), PMC_E500_EVENT(EXT_INPU_INTR_LATENCY_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1, 78, PMC_PPC_E500_ANY), PMC_E500_EVENT(CRIT_INPUT_INTR_LATENCY_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1, 79, PMC_PPC_E500_ANY), PMC_E500_EVENT(EXT_INPUT_INTR_PENDING_LATENCY_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1, 80, PMC_PPC_E500_ANY), PMC_E500_EVENT(CRIT_INPUT_INTR_PENDING_LATENCY_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1, 81, PMC_PPC_E500_ANY), PMC_E500_COMMON(PMC0_OVERFLOW, 82), PMC_E500_COMMON(PMC1_OVERFLOW, 83), PMC_E500_COMMON(PMC2_OVERFLOW, 84), PMC_E500_COMMON(PMC3_OVERFLOW, 85), PMC_E500_COMMON(INTERRUPTS_TAKEN, 86), PMC_E500_COMMON(EXT_INPUT_INTR_TAKEN, 87), PMC_E500_COMMON(CRIT_INPUT_INTR_TAKEN, 88), PMC_E500_COMMON(SYSCALL_TRAP_INTR, 89), PMC_E500_EVENT(TLB_BIT_TRANSITIONS, PMC_PPC_MASK_ALL, 90, PMC_PPC_E500V2 | PMC_PPC_E500MC), PMC_E500MC_ONLY(L2_LINEFILL_BUFFER, 91), PMC_E500MC_ONLY(LV2_VS, 92), PMC_E500MC_ONLY(CASTOUTS_RELEASED, 93), PMC_E500MC_ONLY(INTV_ALLOCATIONS, 94), PMC_E500MC_ONLY(DLFB_RETRIES_TO_MBAR, 95), PMC_E500MC_ONLY(STORE_RETRIES, 96), PMC_E500MC_ONLY(STASH_L1_HITS, 97), PMC_E500MC_ONLY(STASH_L2_HITS, 98), PMC_E500MC_ONLY(STASH_BUSY_1, 99), PMC_E500MC_ONLY(STASH_BUSY_2, 100), PMC_E500MC_ONLY(STASH_BUSY_3, 101), PMC_E500MC_ONLY(STASH_HITS, 102), PMC_E500MC_ONLY(STASH_HIT_DLFB, 103), PMC_E500MC_ONLY(STASH_REQUESTS, 106), PMC_E500MC_ONLY(STASH_REQUESTS_L1, 107), PMC_E500MC_ONLY(STASH_REQUESTS_L2, 108), PMC_E500MC_ONLY(STALLS_NO_CAQ_OR_COB, 109), PMC_E500MC_ONLY(L2_CACHE_ACCESSES, 110), PMC_E500MC_ONLY(L2_HIT_CACHE_ACCESSES, 111), PMC_E500MC_ONLY(L2_CACHE_DATA_ACCESSES, 112), PMC_E500MC_ONLY(L2_CACHE_DATA_HITS, 113), PMC_E500MC_ONLY(L2_CACHE_INSTR_ACCESSES, 114), PMC_E500MC_ONLY(L2_CACHE_INSTR_HITS, 115), PMC_E500MC_ONLY(L2_CACHE_ALLOCATIONS, 116), PMC_E500MC_ONLY(L2_CACHE_DATA_ALLOCATIONS, 117), PMC_E500MC_ONLY(L2_CACHE_DIRTY_DATA_ALLOCATIONS, 118), PMC_E500MC_ONLY(L2_CACHE_INSTR_ALLOCATIONS, 119), PMC_E500MC_ONLY(L2_CACHE_UPDATES, 120), PMC_E500MC_ONLY(L2_CACHE_CLEAN_UPDATES, 121), PMC_E500MC_ONLY(L2_CACHE_DIRTY_UPDATES, 122), PMC_E500MC_ONLY(L2_CACHE_CLEAN_REDUNDANT_UPDATES, 123), PMC_E500MC_ONLY(L2_CACHE_DIRTY_REDUNDANT_UPDATES, 124), PMC_E500MC_ONLY(L2_CACHE_LOCKS, 125), PMC_E500MC_ONLY(L2_CACHE_CASTOUTS, 126), PMC_E500MC_ONLY(L2_CACHE_DATA_DIRTY_HITS, 127), PMC_E500MC_ONLY(INSTR_LFB_WENT_HIGH_PRIORITY, 128), PMC_E500MC_ONLY(SNOOP_THROTTLING_TURNED_ON, 129), PMC_E500MC_ONLY(L2_CLEAN_LINE_INVALIDATIONS, 130), PMC_E500MC_ONLY(L2_INCOHERENT_LINE_INVALIDATIONS, 131), PMC_E500MC_ONLY(L2_COHERENT_LINE_INVALIDATIONS, 132), PMC_E500MC_ONLY(COHERENT_LOOKUP_MISS_DUE_TO_VALID_BUT_INCOHERENT_MATCHES, 133), PMC_E500MC_ONLY(IAC1S_DETECTED, 140), PMC_E500MC_ONLY(IAC2S_DETECTED, 141), PMC_E500MC_ONLY(DAC1S_DTECTED, 144), PMC_E500MC_ONLY(DAC2S_DTECTED, 145), PMC_E500MC_ONLY(DVT0_DETECTED, 148), PMC_E500MC_ONLY(DVT1_DETECTED, 149), PMC_E500MC_ONLY(DVT2_DETECTED, 150), PMC_E500MC_ONLY(DVT3_DETECTED, 151), PMC_E500MC_ONLY(DVT4_DETECTED, 152), PMC_E500MC_ONLY(DVT5_DETECTED, 153), PMC_E500MC_ONLY(DVT6_DETECTED, 154), PMC_E500MC_ONLY(DVT7_DETECTED, 155), PMC_E500MC_ONLY(CYCLES_COMPLETION_STALLED_NEXUS_FIFO_FULL, 156), PMC_E500MC_ONLY(FPU_DOUBLE_PUMP, 160), PMC_E500MC_ONLY(FPU_FINISH, 161), PMC_E500MC_ONLY(FPU_DIVIDE_CYCLES, 162), PMC_E500MC_ONLY(FPU_DENORM_INPUT_CYCLES, 163), PMC_E500MC_ONLY(FPU_RESULT_STALL_CYCLES, 164), PMC_E500MC_ONLY(FPU_FPSCR_FULL_STALL, 165), PMC_E500MC_ONLY(FPU_PIPE_SYNC_STALLS, 166), PMC_E500MC_ONLY(FPU_INPUT_DATA_STALLS, 167), PMC_E500MC_ONLY(DECORATED_LOADS, 176), PMC_E500MC_ONLY(DECORATED_STORES, 177), PMC_E500MC_ONLY(LOAD_RETRIES, 178), PMC_E500MC_ONLY(STWCX_SUCCESSES, 179), PMC_E500MC_ONLY(STWCX_FAILURES, 180), }; const size_t e500_event_codes_size = sizeof(e500_event_codes) / sizeof(e500_event_codes[0]); static pmc_value_t e500_pmcn_read(unsigned int pmc) { switch (pmc) { case 0: return mfpmr(PMR_PMC0); break; case 1: return mfpmr(PMR_PMC1); break; case 2: return mfpmr(PMR_PMC2); break; case 3: return mfpmr(PMR_PMC3); break; default: panic("Invalid PMC number: %d\n", pmc); } } static void e500_pmcn_write(unsigned int pmc, uint32_t val) { switch (pmc) { case 0: mtpmr(PMR_PMC0, val); break; case 1: mtpmr(PMR_PMC1, val); break; case 2: mtpmr(PMR_PMC2, val); break; case 3: mtpmr(PMR_PMC3, val); break; default: panic("Invalid PMC number: %d\n", pmc); } } static int e500_read_pmc(int cpu, int ri, pmc_value_t *v) { struct pmc *pm; pmc_value_t tmp; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < E500_MAX_PMCS, ("[powerpc,%d] illegal row index %d", __LINE__, ri)); pm = powerpc_pcpu[cpu]->pc_ppcpmcs[ri].phw_pmc; KASSERT(pm, ("[core,%d] cpu %d ri %d pmc not configured", __LINE__, cpu, ri)); tmp = e500_pmcn_read(ri); - PMCDBG(MDP,REA,2,"ppc-read id=%d -> %jd", ri, tmp); + PMCDBG2(MDP,REA,2,"ppc-read id=%d -> %jd", ri, tmp); if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) *v = POWERPC_PERFCTR_VALUE_TO_RELOAD_COUNT(tmp); else *v = tmp; return 0; } static int e500_write_pmc(int cpu, int ri, pmc_value_t v) { struct pmc *pm; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < E500_MAX_PMCS, ("[powerpc,%d] illegal row-index %d", __LINE__, ri)); pm = powerpc_pcpu[cpu]->pc_ppcpmcs[ri].phw_pmc; if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) v = POWERPC_RELOAD_COUNT_TO_PERFCTR_VALUE(v); - PMCDBG(MDP,WRI,1,"powerpc-write cpu=%d ri=%d v=%jx", cpu, ri, v); + PMCDBG3(MDP,WRI,1,"powerpc-write cpu=%d ri=%d v=%jx", cpu, ri, v); e500_pmcn_write(ri, v); return 0; } static int e500_config_pmc(int cpu, int ri, struct pmc *pm) { struct pmc_hw *phw; - PMCDBG(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm); + PMCDBG3(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm); KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < E500_MAX_PMCS, ("[powerpc,%d] illegal row-index %d", __LINE__, ri)); phw = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri]; KASSERT(pm == NULL || phw->phw_pmc == NULL, ("[powerpc,%d] pm=%p phw->pm=%p hwpmc not unconfigured", __LINE__, pm, phw->phw_pmc)); phw->phw_pmc = pm; return 0; } static int e500_start_pmc(int cpu, int ri) { uint32_t config; struct pmc *pm; struct pmc_hw *phw; phw = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri]; pm = phw->phw_pmc; config = pm->pm_md.pm_powerpc.pm_powerpc_evsel; if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) config |= PMLCax_CE; /* Enable the PMC. */ switch (ri) { case 0: mtpmr(PMR_PMLCa0, config); break; case 1: mtpmr(PMR_PMLCa1, config); break; case 2: mtpmr(PMR_PMLCa2, config); break; case 3: mtpmr(PMR_PMLCa3, config); break; default: break; } return 0; } static int e500_stop_pmc(int cpu, int ri) { struct pmc *pm; struct pmc_hw *phw; register_t pmc_pmlc; phw = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri]; pm = phw->phw_pmc; /* * Disable the PMCs. */ switch (ri) { case 0: pmc_pmlc = mfpmr(PMR_PMLCa0); pmc_pmlc |= PMLCax_FC; mtpmr(PMR_PMLCa0, pmc_pmlc); break; case 1: pmc_pmlc = mfpmr(PMR_PMLCa1); pmc_pmlc |= PMLCax_FC; mtpmr(PMR_PMLCa1, pmc_pmlc); break; case 2: pmc_pmlc = mfpmr(PMR_PMLCa2); pmc_pmlc |= PMLCax_FC; mtpmr(PMR_PMLCa2, pmc_pmlc); break; case 3: pmc_pmlc = mfpmr(PMR_PMLCa3); pmc_pmlc |= PMLCax_FC; mtpmr(PMR_PMLCa3, pmc_pmlc); break; default: break; } return 0; } static int e500_pcpu_init(struct pmc_mdep *md, int cpu) { int first_ri, i; struct pmc_cpu *pc; struct powerpc_cpu *pac; struct pmc_hw *phw; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] wrong cpu number %d", __LINE__, cpu)); - PMCDBG(MDP,INI,1,"powerpc-init cpu=%d", cpu); + PMCDBG1(MDP,INI,1,"powerpc-init cpu=%d", cpu); /* Freeze all counters. */ mtpmr(PMR_PMGC0, PMGC_FAC | PMGC_PMIE | PMGC_FCECE); powerpc_pcpu[cpu] = pac = malloc(sizeof(struct powerpc_cpu), M_PMC, M_WAITOK|M_ZERO); pac->pc_ppcpmcs = malloc(sizeof(struct pmc_hw) * E500_MAX_PMCS, M_PMC, M_WAITOK|M_ZERO); pac->pc_class = PMC_CLASS_E500; pc = pmc_pcpu[cpu]; first_ri = md->pmd_classdep[PMC_MDEP_CLASS_INDEX_POWERPC].pcd_ri; KASSERT(pc != NULL, ("[powerpc,%d] NULL per-cpu pointer", __LINE__)); for (i = 0, phw = pac->pc_ppcpmcs; i < E500_MAX_PMCS; i++, phw++) { phw->phw_state = PMC_PHW_FLAG_IS_ENABLED | PMC_PHW_CPU_TO_STATE(cpu) | PMC_PHW_INDEX_TO_STATE(i); phw->phw_pmc = NULL; pc->pc_hwpmcs[i + first_ri] = phw; /* Initialize the PMC to stopped */ e500_stop_pmc(cpu, i); } /* Unfreeze global register. */ mtpmr(PMR_PMGC0, PMGC_PMIE | PMGC_FCECE); return 0; } static int e500_pcpu_fini(struct pmc_mdep *md, int cpu) { uint32_t pmgc0 = mfpmr(PMR_PMGC0); pmgc0 |= PMGC_FAC; mtpmr(PMR_PMGC0, pmgc0); mtmsr(mfmsr() & ~PSL_PMM); free(powerpc_pcpu[cpu]->pc_ppcpmcs, M_PMC); free(powerpc_pcpu[cpu], M_PMC); return 0; } static int e500_allocate_pmc(int cpu, int ri, struct pmc *pm, const struct pmc_op_pmcallocate *a) { enum pmc_event pe; uint32_t caps, config, counter; struct e500_event_code_map *ev; uint16_t vers; uint8_t pe_cpu_mask; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < E500_MAX_PMCS, ("[powerpc,%d] illegal row index %d", __LINE__, ri)); caps = a->pm_caps; pe = a->pm_ev; config = PMLCax_FCS | PMLCax_FCU | PMLCax_FCM1 | PMLCax_FCM1; if (pe < PMC_EV_E500_FIRST || pe > PMC_EV_E500_LAST) return (EINVAL); ev = &e500_event_codes[pe-PMC_EV_E500_FIRST]; if (ev->pe_code == 0) return (EINVAL); vers = mfpvr() >> 16; switch (vers) { case FSL_E500v1: pe_cpu_mask = ev->pe_code & PMC_PPC_E500V1; break; case FSL_E500v2: pe_cpu_mask = ev->pe_code & PMC_PPC_E500V2; break; case FSL_E500mc: pe_cpu_mask = ev->pe_code & PMC_PPC_E500MC; break; } if (pe_cpu_mask == 0) return (EINVAL); config |= PMLCax_EVENT(ev->pe_code); counter = ev->pe_counter_mask; if ((counter & (1 << ri)) == 0) return (EINVAL); if (caps & PMC_CAP_SYSTEM) config &= ~PMLCax_FCS; if (caps & PMC_CAP_USER) config &= ~PMLCax_FCU; if ((caps & (PMC_CAP_USER | PMC_CAP_SYSTEM)) == 0) config &= ~(PMLCax_FCS|PMLCax_FCU); pm->pm_md.pm_powerpc.pm_powerpc_evsel = config; - PMCDBG(MDP,ALL,2,"powerpc-allocate ri=%d -> config=0x%x", ri, config); + PMCDBG2(MDP,ALL,2,"powerpc-allocate ri=%d -> config=0x%x", ri, config); return 0; } static int e500_release_pmc(int cpu, int ri, struct pmc *pmc) { struct pmc_hw *phw; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < E500_MAX_PMCS, ("[powerpc,%d] illegal row-index %d", __LINE__, ri)); phw = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri]; KASSERT(phw->phw_pmc == NULL, ("[powerpc,%d] PHW pmc %p non-NULL", __LINE__, phw->phw_pmc)); return 0; } static int e500_intr(int cpu, struct trapframe *tf) { int i, error, retval; uint32_t config; struct pmc *pm; struct powerpc_cpu *pac; KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[powerpc,%d] out of range CPU %d", __LINE__, cpu)); - PMCDBG(MDP,INT,1, "cpu=%d tf=%p um=%d", cpu, (void *) tf, + PMCDBG3(MDP,INT,1, "cpu=%d tf=%p um=%d", cpu, (void *) tf, TRAPF_USERMODE(tf)); retval = 0; pac = powerpc_pcpu[cpu]; config = mfpmr(PMR_PMGC0) & ~PMGC_FAC; /* * look for all PMCs that have interrupted: * - look for a running, sampling PMC which has overflowed * and which has a valid 'struct pmc' association * * If found, we call a helper to process the interrupt. */ for (i = 0; i < E500_MAX_PMCS; i++) { if ((pm = pac->pc_ppcpmcs[i].phw_pmc) == NULL || !PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) { continue; } if (!E500_PMC_HAS_OVERFLOWED(i)) continue; retval = 1; /* Found an interrupting PMC. */ if (pm->pm_state != PMC_STATE_RUNNING) continue; /* Stop the counter if logging fails. */ error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, TRAPF_USERMODE(tf)); if (error != 0) e500_stop_pmc(cpu, i); /* reload count. */ e500_write_pmc(cpu, i, pm->pm_sc.pm_reloadcount); } atomic_add_int(retval ? &pmc_stats.pm_intr_processed : &pmc_stats.pm_intr_ignored, 1); /* Re-enable PERF exceptions. */ if (retval) mtpmr(PMR_PMGC0, config | PMGC_PMIE); return (retval); } int pmc_e500_initialize(struct pmc_mdep *pmc_mdep) { struct pmc_classdep *pcd; pmc_mdep->pmd_cputype = PMC_CPU_PPC_E500; pcd = &pmc_mdep->pmd_classdep[PMC_MDEP_CLASS_INDEX_POWERPC]; pcd->pcd_caps = POWERPC_PMC_CAPS; pcd->pcd_class = PMC_CLASS_E500; pcd->pcd_num = E500_MAX_PMCS; pcd->pcd_ri = pmc_mdep->pmd_npmc; pcd->pcd_width = 32; pcd->pcd_allocate_pmc = e500_allocate_pmc; pcd->pcd_config_pmc = e500_config_pmc; pcd->pcd_pcpu_fini = e500_pcpu_fini; pcd->pcd_pcpu_init = e500_pcpu_init; pcd->pcd_describe = powerpc_describe; pcd->pcd_get_config = powerpc_get_config; pcd->pcd_read_pmc = e500_read_pmc; pcd->pcd_release_pmc = e500_release_pmc; pcd->pcd_start_pmc = e500_start_pmc; pcd->pcd_stop_pmc = e500_stop_pmc; pcd->pcd_write_pmc = e500_write_pmc; pmc_mdep->pmd_npmc += E500_MAX_PMCS; pmc_mdep->pmd_intr = e500_intr; return (0); } Index: projects/release-arm-redux/sys/dev/hwpmc/hwpmc_mips74k.c =================================================================== --- projects/release-arm-redux/sys/dev/hwpmc/hwpmc_mips74k.c (revision 282691) +++ projects/release-arm-redux/sys/dev/hwpmc/hwpmc_mips74k.c (revision 282692) @@ -1,261 +1,261 @@ /*- * Copyright (c) 2010 George V. Neville-Neil * Copyright (c) 2015 Adrian Chadd * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #define MIPS74K_PMC_CAPS (PMC_CAP_INTERRUPT | PMC_CAP_USER | \ PMC_CAP_SYSTEM | PMC_CAP_EDGE | \ PMC_CAP_THRESHOLD | PMC_CAP_READ | \ PMC_CAP_WRITE | PMC_CAP_INVERT | \ PMC_CAP_QUALIFIER) /* 0x1 - Exception_enable */ #define MIPS74K_PMC_INTERRUPT_ENABLE 0x10 /* Enable interrupts */ #define MIPS74K_PMC_USER_ENABLE 0x08 /* Count in USER mode */ #define MIPS74K_PMC_SUPER_ENABLE 0x04 /* Count in SUPERVISOR mode */ #define MIPS74K_PMC_KERNEL_ENABLE 0x02 /* Count in KERNEL mode */ #define MIPS74K_PMC_ENABLE (MIPS74K_PMC_USER_ENABLE | \ MIPS74K_PMC_SUPER_ENABLE | \ MIPS74K_PMC_KERNEL_ENABLE) #define MIPS74K_PMC_SELECT 5 /* Which bit position the event starts at. */ const struct mips_event_code_map mips_event_codes[] = { { PMC_EV_MIPS74K_CYCLES, MIPS_CTR_ALL, 0 }, { PMC_EV_MIPS74K_INSTR_EXECUTED, MIPS_CTR_ALL, 1 }, { PMC_EV_MIPS74K_PREDICTED_JR_31, MIPS_CTR_0, 2 }, { PMC_EV_MIPS74K_JR_31_MISPREDICTIONS, MIPS_CTR_1, 2 }, { PMC_EV_MIPS74K_REDIRECT_STALLS, MIPS_CTR_0, 3 }, { PMC_EV_MIPS74K_JR_31_NO_PREDICTIONS, MIPS_CTR_1, 3 }, { PMC_EV_MIPS74K_ITLB_ACCESSES, MIPS_CTR_0, 4 }, { PMC_EV_MIPS74K_ITLB_MISSES, MIPS_CTR_1, 4 }, { PMC_EV_MIPS74K_JTLB_INSN_MISSES, MIPS_CTR_1, 5 }, { PMC_EV_MIPS74K_ICACHE_ACCESSES, MIPS_CTR_0, 6 }, { PMC_EV_MIPS74K_ICACHE_MISSES, MIPS_CTR_1, 6 }, { PMC_EV_MIPS74K_ICACHE_MISS_STALLS, MIPS_CTR_0, 7 }, { PMC_EV_MIPS74K_UNCACHED_IFETCH_STALLS, MIPS_CTR_0, 8 }, { PMC_EV_MIPS74K_PDTRACE_BACK_STALLS, MIPS_CTR_1, 8 }, { PMC_EV_MIPS74K_IFU_REPLAYS, MIPS_CTR_0, 9 }, { PMC_EV_MIPS74K_KILLED_FETCH_SLOTS, MIPS_CTR_1, 9 }, { PMC_EV_MIPS74K_IFU_IDU_MISS_PRED_UPSTREAM_CYCLES, MIPS_CTR_0, 11 }, { PMC_EV_MIPS74K_IFU_IDU_NO_FETCH_CYCLES, MIPS_CTR_1, 11 }, { PMC_EV_MIPS74K_IFU_IDU_CLOGED_DOWNSTREAM_CYCLES, MIPS_CTR_0, 12 }, { PMC_EV_MIPS74K_DDQ0_FULL_DR_STALLS, MIPS_CTR_0, 13 }, { PMC_EV_MIPS74K_DDQ1_FULL_DR_STALLS, MIPS_CTR_1, 13 }, { PMC_EV_MIPS74K_ALCB_FULL_DR_STALLS, MIPS_CTR_0, 14 }, { PMC_EV_MIPS74K_AGCB_FULL_DR_STALLS, MIPS_CTR_1, 14 }, { PMC_EV_MIPS74K_CLDQ_FULL_DR_STALLS, MIPS_CTR_0, 15 }, { PMC_EV_MIPS74K_IODQ_FULL_DR_STALLS, MIPS_CTR_1, 15 }, { PMC_EV_MIPS74K_ALU_EMPTY_CYCLES, MIPS_CTR_0, 16 }, { PMC_EV_MIPS74K_AGEN_EMPTY_CYCLES, MIPS_CTR_1, 16 }, { PMC_EV_MIPS74K_ALU_OPERANDS_NOT_READY_CYCLES, MIPS_CTR_0, 17 }, { PMC_EV_MIPS74K_AGEN_OPERANDS_NOT_READY_CYCLES, MIPS_CTR_1, 17 }, { PMC_EV_MIPS74K_ALU_NO_ISSUES_CYCLES, MIPS_CTR_0, 18 }, { PMC_EV_MIPS74K_AGEN_NO_ISSUES_CYCLES, MIPS_CTR_1, 18 }, { PMC_EV_MIPS74K_ALU_BUBBLE_CYCLES, MIPS_CTR_0, 19 }, { PMC_EV_MIPS74K_AGEN_BUBBLE_CYCLES, MIPS_CTR_1, 19 }, { PMC_EV_MIPS74K_SINGLE_ISSUE_CYCLES, MIPS_CTR_0, 20 }, { PMC_EV_MIPS74K_DUAL_ISSUE_CYCLES, MIPS_CTR_1, 20 }, { PMC_EV_MIPS74K_OOO_ALU_ISSUE_CYCLES, MIPS_CTR_0, 21 }, { PMC_EV_MIPS74K_OOO_AGEN_ISSUE_CYCLES, MIPS_CTR_1, 21 }, { PMC_EV_MIPS74K_JALR_JALR_HB_INSNS, MIPS_CTR_0, 22 }, { PMC_EV_MIPS74K_DCACHE_LINE_REFILL_REQUESTS, MIPS_CTR_1, 22 }, { PMC_EV_MIPS74K_DCACHE_LOAD_ACCESSES, MIPS_CTR_0, 23 }, { PMC_EV_MIPS74K_DCACHE_ACCESSES, MIPS_CTR_1, 23 }, { PMC_EV_MIPS74K_DCACHE_WRITEBACKS, MIPS_CTR_0, 24 }, { PMC_EV_MIPS74K_DCACHE_MISSES, MIPS_CTR_1, 24 }, { PMC_EV_MIPS74K_JTLB_DATA_ACCESSES, MIPS_CTR_0, 25 }, { PMC_EV_MIPS74K_JTLB_DATA_MISSES, MIPS_CTR_1, 25 }, { PMC_EV_MIPS74K_LOAD_STORE_REPLAYS, MIPS_CTR_0, 26 }, { PMC_EV_MIPS74K_VA_TRANSALTION_CORNER_CASES, MIPS_CTR_1, 26 }, { PMC_EV_MIPS74K_LOAD_STORE_BLOCKED_CYCLES, MIPS_CTR_0, 27 }, { PMC_EV_MIPS74K_LOAD_STORE_NO_FILL_REQUESTS, MIPS_CTR_1, 27 }, { PMC_EV_MIPS74K_L2_CACHE_WRITEBACKS, MIPS_CTR_0, 28 }, { PMC_EV_MIPS74K_L2_CACHE_ACCESSES, MIPS_CTR_1, 28 }, { PMC_EV_MIPS74K_L2_CACHE_MISSES, MIPS_CTR_0, 29 }, { PMC_EV_MIPS74K_L2_CACHE_MISS_CYCLES, MIPS_CTR_1, 29 }, { PMC_EV_MIPS74K_FSB_FULL_STALLS, MIPS_CTR_0, 30 }, { PMC_EV_MIPS74K_FSB_OVER_50_FULL, MIPS_CTR_1, 30 }, { PMC_EV_MIPS74K_LDQ_FULL_STALLS, MIPS_CTR_0, 31 }, { PMC_EV_MIPS74K_LDQ_OVER_50_FULL, MIPS_CTR_1, 31 }, { PMC_EV_MIPS74K_WBB_FULL_STALLS, MIPS_CTR_0, 32 }, { PMC_EV_MIPS74K_WBB_OVER_50_FULL, MIPS_CTR_1, 32 }, { PMC_EV_MIPS74K_LOAD_MISS_CONSUMER_REPLAYS, MIPS_CTR_0, 35 }, { PMC_EV_MIPS74K_CP1_CP2_LOAD_INSNS, MIPS_CTR_1, 35 }, { PMC_EV_MIPS74K_JR_NON_31_INSNS, MIPS_CTR_0, 36 }, { PMC_EV_MIPS74K_MISPREDICTED_JR_31_INSNS, MIPS_CTR_1, 36 }, { PMC_EV_MIPS74K_BRANCH_INSNS, MIPS_CTR_0, 37 }, { PMC_EV_MIPS74K_CP1_CP2_COND_BRANCH_INSNS, MIPS_CTR_1, 37 }, { PMC_EV_MIPS74K_BRANCH_LIKELY_INSNS, MIPS_CTR_0, 38 }, { PMC_EV_MIPS74K_MISPREDICTED_BRANCH_LIKELY_INSNS, MIPS_CTR_1, 38 }, { PMC_EV_MIPS74K_COND_BRANCH_INSNS, MIPS_CTR_0, 39 }, { PMC_EV_MIPS74K_MISPREDICTED_BRANCH_INSNS, MIPS_CTR_1, 39 }, { PMC_EV_MIPS74K_INTEGER_INSNS, MIPS_CTR_0, 40 }, { PMC_EV_MIPS74K_FPU_INSNS, MIPS_CTR_1, 40 }, { PMC_EV_MIPS74K_LOAD_INSNS, MIPS_CTR_0, 41 }, { PMC_EV_MIPS74K_STORE_INSNS, MIPS_CTR_1, 41 }, { PMC_EV_MIPS74K_J_JAL_INSNS, MIPS_CTR_0, 42 }, { PMC_EV_MIPS74K_MIPS16_INSNS, MIPS_CTR_1, 42 }, { PMC_EV_MIPS74K_NOP_INSNS, MIPS_CTR_0, 43 }, { PMC_EV_MIPS74K_NT_MUL_DIV_INSNS, MIPS_CTR_1, 43 }, { PMC_EV_MIPS74K_DSP_INSNS, MIPS_CTR_0, 44 }, { PMC_EV_MIPS74K_ALU_DSP_SATURATION_INSNS, MIPS_CTR_1, 44 }, { PMC_EV_MIPS74K_DSP_BRANCH_INSNS, MIPS_CTR_0, 45 }, { PMC_EV_MIPS74K_MDU_DSP_SATURATION_INSNS, MIPS_CTR_1, 45 }, { PMC_EV_MIPS74K_UNCACHED_LOAD_INSNS, MIPS_CTR_0, 46 }, { PMC_EV_MIPS74K_UNCACHED_STORE_INSNS, MIPS_CTR_1, 46 }, { PMC_EV_MIPS74K_EJTAG_INSN_TRIGGERS, MIPS_CTR_0, 49 }, { PMC_EV_MIPS74K_CP1_BRANCH_MISPREDICTIONS, MIPS_CTR_0, 50 }, { PMC_EV_MIPS74K_SC_INSNS, MIPS_CTR_0, 51 }, { PMC_EV_MIPS74K_FAILED_SC_INSNS, MIPS_CTR_1, 51 }, { PMC_EV_MIPS74K_PREFETCH_INSNS, MIPS_CTR_0, 52 }, { PMC_EV_MIPS74K_CACHE_HIT_PREFETCH_INSNS, MIPS_CTR_1, 52 }, { PMC_EV_MIPS74K_NO_INSN_CYCLES, MIPS_CTR_0, 53 }, { PMC_EV_MIPS74K_LOAD_MISS_INSNS, MIPS_CTR_1, 53 }, { PMC_EV_MIPS74K_ONE_INSN_CYCLES, MIPS_CTR_0, 54 }, { PMC_EV_MIPS74K_TWO_INSNS_CYCLES, MIPS_CTR_1, 54 }, { PMC_EV_MIPS74K_GFIFO_BLOCKED_CYCLES, MIPS_CTR_0, 55 }, { PMC_EV_MIPS74K_CP1_CP2_STORE_INSNS, MIPS_CTR_1, 55 }, { PMC_EV_MIPS74K_MISPREDICTION_STALLS, MIPS_CTR_0, 56 }, { PMC_EV_MIPS74K_MISPREDICTED_BRANCH_INSNS_CYCLES, MIPS_CTR_0, 57 }, { PMC_EV_MIPS74K_EXCEPTIONS_TAKEN, MIPS_CTR_0, 58 }, { PMC_EV_MIPS74K_GRADUATION_REPLAYS, MIPS_CTR_1, 58 }, { PMC_EV_MIPS74K_COREEXTEND_EVENTS, MIPS_CTR_0, 59 }, { PMC_EV_MIPS74K_ISPRAM_EVENTS, MIPS_CTR_0, 62 }, { PMC_EV_MIPS74K_DSPRAM_EVENTS, MIPS_CTR_1, 62 }, { PMC_EV_MIPS74K_L2_CACHE_SINGLE_BIT_ERRORS, MIPS_CTR_0, 63 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_0, MIPS_CTR_0, 64 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_1, MIPS_CTR_1, 64 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_2, MIPS_CTR_0, 65 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_3, MIPS_CTR_1, 65 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_4, MIPS_CTR_0, 66 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_5, MIPS_CTR_1, 66 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_6, MIPS_CTR_0, 67 }, { PMC_EV_MIPS74K_SYSTEM_EVENT_7, MIPS_CTR_1, 67 }, { PMC_EV_MIPS74K_OCP_ALL_REQUESTS, MIPS_CTR_0, 68 }, { PMC_EV_MIPS74K_OCP_ALL_CACHEABLE_REQUESTS, MIPS_CTR_1, 68 }, { PMC_EV_MIPS74K_OCP_READ_REQUESTS, MIPS_CTR_0, 69 }, { PMC_EV_MIPS74K_OCP_READ_CACHEABLE_REQUESTS, MIPS_CTR_1, 69 }, { PMC_EV_MIPS74K_OCP_WRITE_REQUESTS, MIPS_CTR_0, 70 }, { PMC_EV_MIPS74K_OCP_WRITE_CACHEABLE_REQUESTS, MIPS_CTR_1, 70 }, { PMC_EV_MIPS74K_FSB_LESS_25_FULL, MIPS_CTR_0, 74 }, { PMC_EV_MIPS74K_FSB_25_50_FULL, MIPS_CTR_1, 74 }, { PMC_EV_MIPS74K_LDQ_LESS_25_FULL, MIPS_CTR_0, 75 }, { PMC_EV_MIPS74K_LDQ_25_50_FULL, MIPS_CTR_1, 75 }, { PMC_EV_MIPS74K_WBB_LESS_25_FULL, MIPS_CTR_0, 76 }, { PMC_EV_MIPS74K_WBB_25_50_FULL, MIPS_CTR_1, 76 }, }; const int mips_event_codes_size = sizeof(mips_event_codes) / sizeof(mips_event_codes[0]); struct mips_pmc_spec mips_pmc_spec = { .ps_cpuclass = PMC_CLASS_MIPS74K, .ps_cputype = PMC_CPU_MIPS_74K, .ps_capabilities = MIPS74K_PMC_CAPS, .ps_counter_width = 32 }; /* * Performance Count Register N */ uint64_t mips_pmcn_read(unsigned int pmc) { uint32_t reg = 0; KASSERT(pmc < mips_npmcs, ("[mips74k,%d] illegal PMC number %d", __LINE__, pmc)); /* The counter value is the next value after the control register. */ switch (pmc) { case 0: reg = mips_rd_perfcnt1(); break; case 1: reg = mips_rd_perfcnt3(); break; default: return 0; } return (reg); } uint64_t mips_pmcn_write(unsigned int pmc, uint64_t reg) { KASSERT(pmc < mips_npmcs, ("[mips74k,%d] illegal PMC number %d", __LINE__, pmc)); switch (pmc) { case 0: mips_wr_perfcnt1(reg); break; case 1: mips_wr_perfcnt3(reg); break; default: return 0; } return (reg); } uint32_t mips_get_perfctl(int cpu, int ri, uint32_t event, uint32_t caps) { uint32_t config; config = event; config <<= MIPS74K_PMC_SELECT; if (caps & PMC_CAP_SYSTEM) config |= (MIPS74K_PMC_SUPER_ENABLE | MIPS74K_PMC_KERNEL_ENABLE); if (caps & PMC_CAP_USER) config |= MIPS74K_PMC_USER_ENABLE; if ((caps & (PMC_CAP_USER | PMC_CAP_SYSTEM)) == 0) config |= MIPS74K_PMC_ENABLE; if (caps & PMC_CAP_INTERRUPT) config |= MIPS74K_PMC_INTERRUPT_ENABLE; - PMCDBG(MDP,ALL,2,"mips74k-get_perfctl ri=%d -> config=0x%x", ri, config); + PMCDBG2(MDP,ALL,2,"mips74k-get_perfctl ri=%d -> config=0x%x", ri, config); return (config); } Index: projects/release-arm-redux/sys/dev/iicbus/iicbus.c =================================================================== --- projects/release-arm-redux/sys/dev/iicbus/iicbus.c (revision 282691) +++ projects/release-arm-redux/sys/dev/iicbus/iicbus.c (revision 282692) @@ -1,324 +1,403 @@ /*- * Copyright (c) 1998, 2001 Nicolas Souchu * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); /* * Autoconfiguration and support routines for the Philips serial I2C bus */ #include #include #include #include #include #include #include +#include #include #include #include #include #include "iicbus_if.h" /* See comments below for why auto-scanning is a bad idea. */ #define SCAN_IICBUS 0 static int iicbus_probe(device_t dev) { device_set_desc(dev, "Philips I2C bus"); /* Allow other subclasses to override this driver. */ return (BUS_PROBE_GENERIC); } #if SCAN_IICBUS static int iic_probe_device(device_t dev, u_char addr) { int count; char byte; if ((addr & 1) == 0) { /* is device writable? */ if (!iicbus_start(dev, (u_char)addr, 0)) { iicbus_stop(dev); return (1); } } else { /* is device readable? */ if (!iicbus_block_read(dev, (u_char)addr, &byte, 1, &count)) return (1); } return (0); } #endif /* * We add all the devices which we know about. * The generic attach routine will attach them if they are alive. */ static int iicbus_attach(device_t dev) { #if SCAN_IICBUS unsigned char addr; #endif struct iicbus_softc *sc = IICBUS_SOFTC(dev); int strict; sc->dev = dev; mtx_init(&sc->lock, "iicbus", NULL, MTX_DEF); iicbus_init_frequency(dev, 0); iicbus_reset(dev, IIC_FASTEST, 0, NULL); if (resource_int_value(device_get_name(dev), device_get_unit(dev), "strict", &strict) == 0) sc->strict = strict; else sc->strict = 1; /* device probing is meaningless since the bus is supposed to be * hot-plug. Moreover, some I2C chips do not appreciate random * accesses like stop after start to fast, reads for less than * x bytes... */ #if SCAN_IICBUS printf("Probing for devices on iicbus%d:", device_get_unit(dev)); /* probe any devices */ for (addr = 16; addr < 240; addr++) { if (iic_probe_device(dev, (u_char)addr)) { printf(" <%x>", addr); } } printf("\n"); #endif bus_generic_probe(dev); bus_enumerate_hinted_children(dev); bus_generic_attach(dev); return (0); } static int iicbus_detach(device_t dev) { struct iicbus_softc *sc = IICBUS_SOFTC(dev); iicbus_reset(dev, IIC_FASTEST, 0, NULL); bus_generic_detach(dev); mtx_destroy(&sc->lock); return (0); } static int iicbus_print_child(device_t dev, device_t child) { struct iicbus_ivar *devi = IICBUS_IVAR(child); int retval = 0; retval += bus_print_child_header(dev, child); if (devi->addr != 0) retval += printf(" at addr %#x", devi->addr); + resource_list_print_type(&devi->rl, "irq", SYS_RES_IRQ, "%ld"); retval += bus_print_child_footer(dev, child); return (retval); } static void iicbus_probe_nomatch(device_t bus, device_t child) { struct iicbus_ivar *devi = IICBUS_IVAR(child); - device_printf(bus, ""); - printf(" at addr %#x\n", devi->addr); - return; + device_printf(bus, " at addr %#x", devi->addr); + resource_list_print_type(&devi->rl, "irq", SYS_RES_IRQ, "%ld"); + printf("\n"); } static int iicbus_child_location_str(device_t bus, device_t child, char *buf, size_t buflen) { struct iicbus_ivar *devi = IICBUS_IVAR(child); snprintf(buf, buflen, "addr=%#x", devi->addr); return (0); } static int iicbus_child_pnpinfo_str(device_t bus, device_t child, char *buf, size_t buflen) { *buf = '\0'; return (0); } static int iicbus_read_ivar(device_t bus, device_t child, int which, uintptr_t *result) { struct iicbus_ivar *devi = IICBUS_IVAR(child); switch (which) { default: return (EINVAL); case IICBUS_IVAR_ADDR: *result = devi->addr; break; } return (0); } static device_t iicbus_add_child(device_t dev, u_int order, const char *name, int unit) { device_t child; struct iicbus_ivar *devi; child = device_add_child_ordered(dev, order, name, unit); if (child == NULL) return (child); devi = malloc(sizeof(struct iicbus_ivar), M_DEVBUF, M_NOWAIT | M_ZERO); if (devi == NULL) { device_delete_child(dev, child); return (0); } + resource_list_init(&devi->rl); device_set_ivars(child, devi); return (child); } static void iicbus_hinted_child(device_t bus, const char *dname, int dunit) { device_t child; + int irq; struct iicbus_ivar *devi; child = BUS_ADD_CHILD(bus, 0, dname, dunit); devi = IICBUS_IVAR(child); resource_int_value(dname, dunit, "addr", &devi->addr); + if (resource_int_value(dname, dunit, "irq", &irq) == 0) { + if (bus_set_resource(child, SYS_RES_IRQ, 0, irq, 1) != 0) + device_printf(bus, + "warning: bus_set_resource() failed\n"); + } } +static int +iicbus_set_resource(device_t dev, device_t child, int type, int rid, + u_long start, u_long count) +{ + struct iicbus_ivar *devi; + struct resource_list_entry *rle; + + devi = IICBUS_IVAR(child); + rle = resource_list_add(&devi->rl, type, rid, start, + start + count - 1, count); + if (rle == NULL) + return (ENXIO); + + return (0); +} + +static struct resource * +iicbus_alloc_resource(device_t bus, device_t child, int type, int *rid, + u_long start, u_long end, u_long count, u_int flags) +{ + struct resource_list *rl; + struct resource_list_entry *rle; + + /* Only IRQ resources are supported. */ + if (type != SYS_RES_IRQ) + return (NULL); + + /* + * Request for the default allocation with a given rid: use resource + * list stored in the local device info. + */ + if ((start == 0UL) && (end == ~0UL)) { + rl = BUS_GET_RESOURCE_LIST(bus, child); + if (rl == NULL) + return (NULL); + rle = resource_list_find(rl, type, *rid); + if (rle == NULL) { + if (bootverbose) + device_printf(bus, "no default resources for " + "rid = %d, type = %d\n", *rid, type); + return (NULL); + } + start = rle->start; + end = rle->end; + count = rle->count; + } + + return (bus_generic_alloc_resource(bus, child, type, rid, start, end, + count, flags)); +} + +static struct resource_list * +iicbus_get_resource_list(device_t bus __unused, device_t child) +{ + struct iicbus_ivar *devi; + + devi = IICBUS_IVAR(child); + return (&devi->rl); +} + int iicbus_generic_intr(device_t dev, int event, char *buf) { return (0); } int iicbus_null_callback(device_t dev, int index, caddr_t data) { return (0); } int iicbus_null_repeated_start(device_t dev, u_char addr) { return (IIC_ENOTSUPP); } void iicbus_init_frequency(device_t dev, u_int bus_freq) { struct iicbus_softc *sc = IICBUS_SOFTC(dev); /* * If a bus frequency value was passed in, use it. Otherwise initialize * it first to the standard i2c 100KHz frequency, then override that * from a hint if one exists. */ if (bus_freq > 0) sc->bus_freq = bus_freq; else { sc->bus_freq = 100000; resource_int_value(device_get_name(dev), device_get_unit(dev), "frequency", (int *)&sc->bus_freq); } /* * Set up the sysctl that allows the bus frequency to be changed. * It is flagged as a tunable so that the user can set the value in * loader(8), and that will override any other setting from any source. * The sysctl tunable/value is the one most directly controlled by the * user and thus the one that always takes precedence. */ SYSCTL_ADD_UINT(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, "frequency", CTLFLAG_RW | CTLFLAG_TUN, &sc->bus_freq, sc->bus_freq, "Bus frequency in Hz"); } static u_int iicbus_get_frequency(device_t dev, u_char speed) { struct iicbus_softc *sc = IICBUS_SOFTC(dev); /* * If the frequency has not been configured for the bus, or the request * is specifically for SLOW speed, use the standard 100KHz rate, else * use the configured bus speed. */ if (sc->bus_freq == 0 || speed == IIC_SLOW) return (100000); return (sc->bus_freq); } static device_method_t iicbus_methods[] = { /* device interface */ DEVMETHOD(device_probe, iicbus_probe), DEVMETHOD(device_attach, iicbus_attach), DEVMETHOD(device_detach, iicbus_detach), /* bus interface */ + DEVMETHOD(bus_setup_intr, bus_generic_setup_intr), + DEVMETHOD(bus_teardown_intr, bus_generic_teardown_intr), + DEVMETHOD(bus_release_resource, bus_generic_release_resource), + DEVMETHOD(bus_activate_resource, bus_generic_activate_resource), + DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource), + DEVMETHOD(bus_adjust_resource, bus_generic_adjust_resource), + DEVMETHOD(bus_get_resource, bus_generic_rl_get_resource), + DEVMETHOD(bus_alloc_resource, iicbus_alloc_resource), + DEVMETHOD(bus_get_resource_list, iicbus_get_resource_list), + DEVMETHOD(bus_set_resource, iicbus_set_resource), DEVMETHOD(bus_add_child, iicbus_add_child), DEVMETHOD(bus_print_child, iicbus_print_child), DEVMETHOD(bus_probe_nomatch, iicbus_probe_nomatch), DEVMETHOD(bus_read_ivar, iicbus_read_ivar), DEVMETHOD(bus_child_pnpinfo_str, iicbus_child_pnpinfo_str), DEVMETHOD(bus_child_location_str, iicbus_child_location_str), DEVMETHOD(bus_hinted_child, iicbus_hinted_child), /* iicbus interface */ DEVMETHOD(iicbus_transfer, iicbus_transfer), DEVMETHOD(iicbus_get_frequency, iicbus_get_frequency), DEVMETHOD_END }; driver_t iicbus_driver = { "iicbus", iicbus_methods, sizeof(struct iicbus_softc), }; devclass_t iicbus_devclass; MODULE_VERSION(iicbus, IICBUS_MODVER); DRIVER_MODULE(iicbus, iichb, iicbus_driver, iicbus_devclass, 0, 0); Index: projects/release-arm-redux/sys/dev/iicbus/iicbus.h =================================================================== --- projects/release-arm-redux/sys/dev/iicbus/iicbus.h (revision 282691) +++ projects/release-arm-redux/sys/dev/iicbus/iicbus.h (revision 282692) @@ -1,77 +1,78 @@ /*- * Copyright (c) 1998 Nicolas Souchu * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ * */ #ifndef __IICBUS_H #define __IICBUS_H #include #include #define IICBUS_IVAR(d) (struct iicbus_ivar *) device_get_ivars(d) #define IICBUS_SOFTC(d) (struct iicbus_softc *) device_get_softc(d) struct iicbus_softc { device_t dev; /* Myself */ device_t owner; /* iicbus owner device structure */ u_char started; /* address of the 'started' slave * 0 if no start condition succeeded */ u_char strict; /* deny operations that violate the * I2C protocol */ struct mtx lock; u_int bus_freq; /* Configured bus Hz. */ }; struct iicbus_ivar { uint32_t addr; + struct resource_list rl; bool nostop; }; enum { IICBUS_IVAR_ADDR, /* Address or base address */ IICBUS_IVAR_NOSTOP, /* nostop defaults */ }; #define IICBUS_ACCESSOR(A, B, T) \ __BUS_ACCESSOR(iicbus, A, IICBUS, B, T) IICBUS_ACCESSOR(addr, ADDR, uint32_t) IICBUS_ACCESSOR(nostop, NOSTOP, bool) #define IICBUS_LOCK(sc) mtx_lock(&(sc)->lock) #define IICBUS_UNLOCK(sc) mtx_unlock(&(sc)->lock) #define IICBUS_ASSERT_LOCKED(sc) mtx_assert(&(sc)->lock, MA_OWNED) int iicbus_generic_intr(device_t dev, int event, char *buf); void iicbus_init_frequency(device_t dev, u_int bus_freq); extern driver_t iicbus_driver; extern devclass_t iicbus_devclass; #endif Index: projects/release-arm-redux/sys/dev/ofw/ofw_iicbus.c =================================================================== --- projects/release-arm-redux/sys/dev/ofw/ofw_iicbus.c (revision 282691) +++ projects/release-arm-redux/sys/dev/ofw/ofw_iicbus.c (revision 282692) @@ -1,201 +1,216 @@ /*- * Copyright (c) 2009, Nathan Whitehorn * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include "iicbus_if.h" /* Methods */ static device_probe_t ofw_iicbus_probe; static device_attach_t ofw_iicbus_attach; static device_t ofw_iicbus_add_child(device_t dev, u_int order, const char *name, int unit); static const struct ofw_bus_devinfo *ofw_iicbus_get_devinfo(device_t bus, device_t dev); +static struct resource_list *ofw_iicbus_get_resource_list(device_t bus, + device_t child); static device_method_t ofw_iicbus_methods[] = { /* Device interface */ DEVMETHOD(device_probe, ofw_iicbus_probe), DEVMETHOD(device_attach, ofw_iicbus_attach), /* Bus interface */ + DEVMETHOD(bus_get_resource_list, ofw_iicbus_get_resource_list), DEVMETHOD(bus_child_pnpinfo_str, ofw_bus_gen_child_pnpinfo_str), DEVMETHOD(bus_add_child, ofw_iicbus_add_child), /* ofw_bus interface */ DEVMETHOD(ofw_bus_get_devinfo, ofw_iicbus_get_devinfo), DEVMETHOD(ofw_bus_get_compat, ofw_bus_gen_get_compat), DEVMETHOD(ofw_bus_get_model, ofw_bus_gen_get_model), DEVMETHOD(ofw_bus_get_name, ofw_bus_gen_get_name), DEVMETHOD(ofw_bus_get_node, ofw_bus_gen_get_node), DEVMETHOD(ofw_bus_get_type, ofw_bus_gen_get_type), DEVMETHOD_END }; struct ofw_iicbus_devinfo { - struct iicbus_ivar opd_dinfo; + struct iicbus_ivar opd_dinfo; /* Must be the first. */ struct ofw_bus_devinfo opd_obdinfo; }; static devclass_t ofwiicbus_devclass; DEFINE_CLASS_1(iicbus, ofw_iicbus_driver, ofw_iicbus_methods, sizeof(struct iicbus_softc), iicbus_driver); DRIVER_MODULE(ofw_iicbus, iicbb, ofw_iicbus_driver, ofwiicbus_devclass, 0, 0); DRIVER_MODULE(ofw_iicbus, iichb, ofw_iicbus_driver, ofwiicbus_devclass, 0, 0); MODULE_VERSION(ofw_iicbus, 1); MODULE_DEPEND(ofw_iicbus, iicbus, 1, 1, 1); static int ofw_iicbus_probe(device_t dev) { if (ofw_bus_get_node(dev) == -1) return (ENXIO); device_set_desc(dev, "OFW I2C bus"); return (0); } static int ofw_iicbus_attach(device_t dev) { struct iicbus_softc *sc = IICBUS_SOFTC(dev); struct ofw_iicbus_devinfo *dinfo; phandle_t child, node; pcell_t freq, paddr; device_t childdev; sc->dev = dev; mtx_init(&sc->lock, "iicbus", NULL, MTX_DEF); /* * If there is a clock-frequency property for the device node, use it as * the starting value for the bus frequency. Then call the common * routine that handles the tunable/sysctl which allows the FDT value to * be overridden by the user. */ node = ofw_bus_get_node(dev); freq = 0; OF_getencprop(node, "clock-frequency", &freq, sizeof(freq)); iicbus_init_frequency(dev, freq); iicbus_reset(dev, IIC_FASTEST, 0, NULL); bus_generic_probe(dev); bus_enumerate_hinted_children(dev); /* * Attach those children represented in the device tree. */ for (child = OF_child(node); child != 0; child = OF_peer(child)) { /* * Try to get the I2C address first from the i2c-address * property, then try the reg property. It moves around * on different systems. */ if (OF_getencprop(child, "i2c-address", &paddr, sizeof(paddr)) == -1) if (OF_getencprop(child, "reg", &paddr, sizeof(paddr)) == -1) continue; /* * Now set up the I2C and OFW bus layer devinfo and add it * to the bus. */ dinfo = malloc(sizeof(struct ofw_iicbus_devinfo), M_DEVBUF, M_NOWAIT | M_ZERO); if (dinfo == NULL) continue; dinfo->opd_dinfo.addr = paddr; if (ofw_bus_gen_setup_devinfo(&dinfo->opd_obdinfo, child) != 0) { free(dinfo, M_DEVBUF); continue; } + childdev = device_add_child(dev, NULL, -1); + resource_list_init(&dinfo->opd_dinfo.rl); + ofw_bus_intr_to_rl(childdev, child, &dinfo->opd_dinfo.rl); device_set_ivars(childdev, dinfo); } return (bus_generic_attach(dev)); } static device_t ofw_iicbus_add_child(device_t dev, u_int order, const char *name, int unit) { device_t child; struct ofw_iicbus_devinfo *devi; child = device_add_child_ordered(dev, order, name, unit); if (child == NULL) return (child); devi = malloc(sizeof(struct ofw_iicbus_devinfo), M_DEVBUF, M_NOWAIT | M_ZERO); if (devi == NULL) { device_delete_child(dev, child); return (0); } /* * NULL all the OFW-related parts of the ivars for non-OFW * children. */ devi->opd_obdinfo.obd_node = -1; devi->opd_obdinfo.obd_name = NULL; devi->opd_obdinfo.obd_compat = NULL; devi->opd_obdinfo.obd_type = NULL; devi->opd_obdinfo.obd_model = NULL; device_set_ivars(child, devi); return (child); } static const struct ofw_bus_devinfo * ofw_iicbus_get_devinfo(device_t bus, device_t dev) { struct ofw_iicbus_devinfo *dinfo; dinfo = device_get_ivars(dev); return (&dinfo->opd_obdinfo); +} + +static struct resource_list * +ofw_iicbus_get_resource_list(device_t bus __unused, device_t child) +{ + struct ofw_iicbus_devinfo *devi; + + devi = device_get_ivars(child); + return (&devi->opd_dinfo.rl); } Index: projects/release-arm-redux/sys/i386/acpica/acpi_machdep.c =================================================================== --- projects/release-arm-redux/sys/i386/acpica/acpi_machdep.c (revision 282691) +++ projects/release-arm-redux/sys/i386/acpica/acpi_machdep.c (revision 282692) @@ -1,394 +1,387 @@ /*- * Copyright (c) 2001 Mitsuru IWASAKI * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include uint32_t acpi_resume_beep; SYSCTL_UINT(_debug_acpi, OID_AUTO, resume_beep, CTLFLAG_RWTUN, &acpi_resume_beep, 0, "Beep the PC speaker when resuming"); uint32_t acpi_reset_video; TUNABLE_INT("hw.acpi.reset_video", &acpi_reset_video); static int intr_model = ACPI_INTR_PIC; int acpi_machdep_init(device_t dev) { struct acpi_softc *sc; sc = device_get_softc(dev); acpi_apm_init(sc); acpi_install_wakeup_handler(sc); if (intr_model == ACPI_INTR_PIC) BUS_CONFIG_INTR(dev, AcpiGbl_FADT.SciInterrupt, INTR_TRIGGER_LEVEL, INTR_POLARITY_LOW); else acpi_SetIntrModel(intr_model); SYSCTL_ADD_UINT(&sc->acpi_sysctl_ctx, SYSCTL_CHILDREN(sc->acpi_sysctl_tree), OID_AUTO, "reset_video", CTLFLAG_RW, &acpi_reset_video, 0, "Call the VESA reset BIOS vector on the resume path"); return (0); } void acpi_SetDefaultIntrModel(int model) { intr_model = model; } /* Check BIOS date. If 1998 or older, disable ACPI. */ int acpi_machdep_quirks(int *quirks) { char *va; int year; /* BIOS address 0xffff5 contains the date in the format mm/dd/yy. */ va = pmap_mapbios(0xffff0, 16); sscanf(va + 11, "%2d", &year); pmap_unmapbios((vm_offset_t)va, 16); /* * Date must be >= 1/1/1999 or we don't trust ACPI. Note that this * check must be changed by my 114th birthday. */ if (year > 90 && year < 99) *quirks = ACPI_Q_BROKEN; return (0); } -void -acpi_cpu_c1() -{ - - __asm __volatile("sti; hlt"); -} - /* * Support for mapping ACPI tables during early boot. This abuses the * crashdump map because the kernel cannot allocate KVA in * pmap_mapbios() when this is used. This makes the following * assumptions about how we use this KVA: pages 0 and 1 are used to * map in the header of each table found via the RSDT or XSDT and * pages 2 to n are used to map in the RSDT or XSDT. This has to use * 2 pages for the table headers in case a header spans a page * boundary. * * XXX: We don't ensure the table fits in the available address space * in the crashdump map. */ /* * Map some memory using the crashdump map. 'offset' is an offset in * pages into the crashdump map to use for the start of the mapping. */ static void * table_map(vm_paddr_t pa, int offset, vm_offset_t length) { vm_offset_t va, off; void *data; off = pa & PAGE_MASK; length = round_page(length + off); pa = pa & PG_FRAME; va = (vm_offset_t)pmap_kenter_temporary(pa, offset) + (offset * PAGE_SIZE); data = (void *)(va + off); length -= PAGE_SIZE; while (length > 0) { va += PAGE_SIZE; pa += PAGE_SIZE; length -= PAGE_SIZE; pmap_kenter(va, pa); invlpg(va); } return (data); } /* Unmap memory previously mapped with table_map(). */ static void table_unmap(void *data, vm_offset_t length) { vm_offset_t va, off; va = (vm_offset_t)data; off = va & PAGE_MASK; length = round_page(length + off); va &= ~PAGE_MASK; while (length > 0) { pmap_kremove(va); invlpg(va); va += PAGE_SIZE; length -= PAGE_SIZE; } } /* * Map a table at a given offset into the crashdump map. It first * maps the header to determine the table length and then maps the * entire table. */ static void * map_table(vm_paddr_t pa, int offset, const char *sig) { ACPI_TABLE_HEADER *header; vm_offset_t length; void *table; header = table_map(pa, offset, sizeof(ACPI_TABLE_HEADER)); if (strncmp(header->Signature, sig, ACPI_NAME_SIZE) != 0) { table_unmap(header, sizeof(ACPI_TABLE_HEADER)); return (NULL); } length = header->Length; table_unmap(header, sizeof(ACPI_TABLE_HEADER)); table = table_map(pa, offset, length); if (ACPI_FAILURE(AcpiTbChecksum(table, length))) { if (bootverbose) printf("ACPI: Failed checksum for table %s\n", sig); #if (ACPI_CHECKSUM_ABORT) table_unmap(table, length); return (NULL); #endif } return (table); } /* * See if a given ACPI table is the requested table. Returns the * length of the able if it matches or zero on failure. */ static int probe_table(vm_paddr_t address, const char *sig) { ACPI_TABLE_HEADER *table; table = table_map(address, 0, sizeof(ACPI_TABLE_HEADER)); if (table == NULL) { if (bootverbose) printf("ACPI: Failed to map table at 0x%jx\n", (uintmax_t)address); return (0); } if (bootverbose) printf("Table '%.4s' at 0x%jx\n", table->Signature, (uintmax_t)address); if (strncmp(table->Signature, sig, ACPI_NAME_SIZE) != 0) { table_unmap(table, sizeof(ACPI_TABLE_HEADER)); return (0); } table_unmap(table, sizeof(ACPI_TABLE_HEADER)); return (1); } /* * Try to map a table at a given physical address previously returned * by acpi_find_table(). */ void * acpi_map_table(vm_paddr_t pa, const char *sig) { return (map_table(pa, 0, sig)); } /* Unmap a table previously mapped via acpi_map_table(). */ void acpi_unmap_table(void *table) { ACPI_TABLE_HEADER *header; header = (ACPI_TABLE_HEADER *)table; table_unmap(table, header->Length); } /* * Return the physical address of the requested table or zero if one * is not found. */ vm_paddr_t acpi_find_table(const char *sig) { ACPI_PHYSICAL_ADDRESS rsdp_ptr; ACPI_TABLE_RSDP *rsdp; ACPI_TABLE_RSDT *rsdt; ACPI_TABLE_XSDT *xsdt; ACPI_TABLE_HEADER *table; vm_paddr_t addr; int i, count; if (resource_disabled("acpi", 0)) return (0); /* * Map in the RSDP. Since ACPI uses AcpiOsMapMemory() which in turn * calls pmap_mapbios() to find the RSDP, we assume that we can use * pmap_mapbios() to map the RSDP. */ if ((rsdp_ptr = AcpiOsGetRootPointer()) == 0) return (0); rsdp = pmap_mapbios(rsdp_ptr, sizeof(ACPI_TABLE_RSDP)); if (rsdp == NULL) { if (bootverbose) printf("ACPI: Failed to map RSDP\n"); return (0); } /* * For ACPI >= 2.0, use the XSDT if it is available. * Otherwise, use the RSDT. We map the XSDT or RSDT at page 2 * in the crashdump area. Pages 0 and 1 are used to map in the * headers of candidate ACPI tables. */ addr = 0; if (rsdp->Revision >= 2 && rsdp->XsdtPhysicalAddress != 0) { /* * AcpiOsGetRootPointer only verifies the checksum for * the version 1.0 portion of the RSDP. Version 2.0 has * an additional checksum that we verify first. */ if (AcpiTbChecksum((UINT8 *)rsdp, ACPI_RSDP_XCHECKSUM_LENGTH)) { if (bootverbose) printf("ACPI: RSDP failed extended checksum\n"); return (0); } xsdt = map_table(rsdp->XsdtPhysicalAddress, 2, ACPI_SIG_XSDT); if (xsdt == NULL) { if (bootverbose) printf("ACPI: Failed to map XSDT\n"); return (0); } count = (xsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) / sizeof(UINT64); for (i = 0; i < count; i++) if (probe_table(xsdt->TableOffsetEntry[i], sig)) { addr = xsdt->TableOffsetEntry[i]; break; } acpi_unmap_table(xsdt); } else { rsdt = map_table(rsdp->RsdtPhysicalAddress, 2, ACPI_SIG_RSDT); if (rsdt == NULL) { if (bootverbose) printf("ACPI: Failed to map RSDT\n"); return (0); } count = (rsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) / sizeof(UINT32); for (i = 0; i < count; i++) if (probe_table(rsdt->TableOffsetEntry[i], sig)) { addr = rsdt->TableOffsetEntry[i]; break; } acpi_unmap_table(rsdt); } pmap_unmapbios((vm_offset_t)rsdp, sizeof(ACPI_TABLE_RSDP)); if (addr == 0) { if (bootverbose) printf("ACPI: No %s table found\n", sig); return (0); } if (bootverbose) printf("%s: Found table at 0x%jx\n", sig, (uintmax_t)addr); /* * Verify that we can map the full table and that its checksum is * correct, etc. */ table = map_table(addr, 0, sig); if (table == NULL) return (0); acpi_unmap_table(table); return (addr); } /* * ACPI nexus(4) driver. */ static int nexus_acpi_probe(device_t dev) { int error; error = acpi_identify(); if (error) return (error); return (BUS_PROBE_DEFAULT); } static int nexus_acpi_attach(device_t dev) { nexus_init_resources(); bus_generic_probe(dev); if (BUS_ADD_CHILD(dev, 10, "acpi", 0) == NULL) panic("failed to add acpi0 device"); return (bus_generic_attach(dev)); } static device_method_t nexus_acpi_methods[] = { /* Device interface */ DEVMETHOD(device_probe, nexus_acpi_probe), DEVMETHOD(device_attach, nexus_acpi_attach), { 0, 0 } }; DEFINE_CLASS_1(nexus, nexus_acpi_driver, nexus_acpi_methods, 1, nexus_driver); static devclass_t nexus_devclass; DRIVER_MODULE(nexus_acpi, root, nexus_acpi_driver, nexus_devclass, 0, 0); Index: projects/release-arm-redux/sys/i386/include/md_var.h =================================================================== --- projects/release-arm-redux/sys/i386/include/md_var.h (revision 282691) +++ projects/release-arm-redux/sys/i386/include/md_var.h (revision 282692) @@ -1,133 +1,134 @@ /*- * Copyright (c) 1995 Bruce D. Evans. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the author nor the names of contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _MACHINE_MD_VAR_H_ #define _MACHINE_MD_VAR_H_ /* * Miscellaneous machine-dependent declarations. */ extern long Maxmem; extern u_int basemem; /* PA of original top of base memory */ extern int busdma_swi_pending; extern u_int cpu_exthigh; extern u_int cpu_feature; extern u_int cpu_feature2; extern u_int amd_feature; extern u_int amd_feature2; extern u_int amd_pminfo; extern u_int via_feature_rng; extern u_int via_feature_xcrypt; extern u_int cpu_clflush_line_size; extern u_int cpu_stdext_feature; extern u_int cpu_fxsr; extern u_int cpu_high; extern u_int cpu_id; extern u_int cpu_max_ext_state_size; extern u_int cpu_mxcsr_mask; extern u_int cpu_procinfo; extern u_int cpu_procinfo2; extern char cpu_vendor[]; extern u_int cpu_vendor_id; extern u_int cpu_mon_mwait_flags; extern u_int cpu_mon_min_size; extern u_int cpu_mon_max_size; extern u_int cpu_maxphyaddr; extern u_int cyrix_did; #if defined(I586_CPU) && !defined(NO_F00F_HACK) extern int has_f00f_bug; #endif extern u_int hv_high; extern char hv_vendor[]; extern char kstack[]; extern char sigcode[]; extern int szsigcode; #ifdef COMPAT_FREEBSD4 extern int szfreebsd4_sigcode; #endif #ifdef COMPAT_43 extern int szosigcode; #endif extern uint32_t *vm_page_dump; extern int vm_page_dump_size; extern int workaround_erratum383; extern int _udatasel; extern int _ucodesel; extern int use_xsave; extern uint64_t xsave_mask; typedef void alias_for_inthand_t(u_int cs, u_int ef, u_int esp, u_int ss); struct pcb; union savefpu; struct thread; struct reg; struct fpreg; struct dbreg; struct dumperinfo; void *alloc_fpusave(int flags); void bcopyb(const void *from, void *to, size_t len); void busdma_swi(void); +bool cpu_mwait_usable(void); void cpu_probe_amdc1e(void); void cpu_setregs(void); void cpu_switch_load_gs(void) __asm(__STRING(cpu_switch_load_gs)); void doreti_iret(void) __asm(__STRING(doreti_iret)); void doreti_iret_fault(void) __asm(__STRING(doreti_iret_fault)); void doreti_popl_ds(void) __asm(__STRING(doreti_popl_ds)); void doreti_popl_ds_fault(void) __asm(__STRING(doreti_popl_ds_fault)); void doreti_popl_es(void) __asm(__STRING(doreti_popl_es)); void doreti_popl_es_fault(void) __asm(__STRING(doreti_popl_es_fault)); void doreti_popl_fs(void) __asm(__STRING(doreti_popl_fs)); void doreti_popl_fs_fault(void) __asm(__STRING(doreti_popl_fs_fault)); void dump_add_page(vm_paddr_t); void dump_drop_page(vm_paddr_t); void finishidentcpu(void); void fillw(int /*u_short*/ pat, void *base, size_t cnt); void initializecpu(void); void initializecpucache(void); void i686_pagezero(void *addr); void sse2_pagezero(void *addr); void init_AMD_Elan_sc520(void); int is_physical_memory(vm_paddr_t addr); int isa_nmi(int cd); vm_paddr_t kvtop(void *addr); void panicifcpuunsupported(void); void ppro_reenable_apic(void); void printcpuinfo(void); void setidt(int idx, alias_for_inthand_t *func, int typ, int dpl, int selec); int user_dbreg_trap(void); int minidumpsys(struct dumperinfo *); union savefpu *get_pcb_user_save_td(struct thread *td); union savefpu *get_pcb_user_save_pcb(struct pcb *pcb); struct pcb *get_pcb_td(struct thread *td); #endif /* !_MACHINE_MD_VAR_H_ */ Index: projects/release-arm-redux/sys/kern/kern_malloc.c =================================================================== --- projects/release-arm-redux/sys/kern/kern_malloc.c (revision 282691) +++ projects/release-arm-redux/sys/kern/kern_malloc.c (revision 282692) @@ -1,1110 +1,1112 @@ /*- * Copyright (c) 1987, 1991, 1993 * The Regents of the University of California. * Copyright (c) 2005-2009 Robert N. M. Watson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_malloc.c 8.3 (Berkeley) 1/4/94 */ /* * Kernel malloc(9) implementation -- general purpose kernel memory allocator * based on memory types. Back end is implemented using the UMA(9) zone * allocator. A set of fixed-size buckets are used for smaller allocations, * and a special UMA allocation interface is used for larger allocations. * Callers declare memory types, and statistics are maintained independently * for each memory type. Statistics are maintained per-CPU for performance * reasons. See malloc(9) and comments in malloc.h for a detailed * description. */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_vm.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DEBUG_MEMGUARD #include #endif #ifdef DEBUG_REDZONE #include #endif #if defined(INVARIANTS) && defined(__i386__) #include #endif #include #ifdef KDTRACE_HOOKS #include dtrace_malloc_probe_func_t dtrace_malloc_probe; #endif /* * When realloc() is called, if the new size is sufficiently smaller than * the old size, realloc() will allocate a new, smaller block to avoid * wasting memory. 'Sufficiently smaller' is defined as: newsize <= * oldsize / 2^n, where REALLOC_FRACTION defines the value of 'n'. */ #ifndef REALLOC_FRACTION #define REALLOC_FRACTION 1 /* new block if <= half the size */ #endif /* * Centrally define some common malloc types. */ MALLOC_DEFINE(M_CACHE, "cache", "Various Dynamically allocated caches"); MALLOC_DEFINE(M_DEVBUF, "devbuf", "device driver memory"); MALLOC_DEFINE(M_TEMP, "temp", "misc temporary data buffers"); MALLOC_DEFINE(M_IP6OPT, "ip6opt", "IPv6 options"); MALLOC_DEFINE(M_IP6NDP, "ip6ndp", "IPv6 Neighbor Discovery"); static struct malloc_type *kmemstatistics; static int kmemcount; #define KMEM_ZSHIFT 4 #define KMEM_ZBASE 16 #define KMEM_ZMASK (KMEM_ZBASE - 1) #define KMEM_ZMAX 65536 #define KMEM_ZSIZE (KMEM_ZMAX >> KMEM_ZSHIFT) static uint8_t kmemsize[KMEM_ZSIZE + 1]; #ifndef MALLOC_DEBUG_MAXZONES #define MALLOC_DEBUG_MAXZONES 1 #endif static int numzones = MALLOC_DEBUG_MAXZONES; /* * Small malloc(9) memory allocations are allocated from a set of UMA buckets * of various sizes. * * XXX: The comment here used to read "These won't be powers of two for * long." It's possible that a significant amount of wasted memory could be * recovered by tuning the sizes of these buckets. */ struct { int kz_size; char *kz_name; uma_zone_t kz_zone[MALLOC_DEBUG_MAXZONES]; } kmemzones[] = { {16, "16", }, {32, "32", }, {64, "64", }, {128, "128", }, {256, "256", }, {512, "512", }, {1024, "1024", }, {2048, "2048", }, {4096, "4096", }, {8192, "8192", }, {16384, "16384", }, {32768, "32768", }, {65536, "65536", }, {0, NULL}, }; /* * Zone to allocate malloc type descriptions from. For ABI reasons, memory * types are described by a data structure passed by the declaring code, but * the malloc(9) implementation has its own data structure describing the * type and statistics. This permits the malloc(9)-internal data structures * to be modified without breaking binary-compiled kernel modules that * declare malloc types. */ static uma_zone_t mt_zone; u_long vm_kmem_size; SYSCTL_ULONG(_vm, OID_AUTO, kmem_size, CTLFLAG_RDTUN, &vm_kmem_size, 0, "Size of kernel memory"); static u_long kmem_zmax = KMEM_ZMAX; SYSCTL_ULONG(_vm, OID_AUTO, kmem_zmax, CTLFLAG_RDTUN, &kmem_zmax, 0, "Maximum allocation size that malloc(9) would use UMA as backend"); static u_long vm_kmem_size_min; SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_min, CTLFLAG_RDTUN, &vm_kmem_size_min, 0, "Minimum size of kernel memory"); static u_long vm_kmem_size_max; SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_max, CTLFLAG_RDTUN, &vm_kmem_size_max, 0, "Maximum size of kernel memory"); static u_int vm_kmem_size_scale; SYSCTL_UINT(_vm, OID_AUTO, kmem_size_scale, CTLFLAG_RDTUN, &vm_kmem_size_scale, 0, "Scale factor for kernel memory size"); static int sysctl_kmem_map_size(SYSCTL_HANDLER_ARGS); SYSCTL_PROC(_vm, OID_AUTO, kmem_map_size, CTLFLAG_RD | CTLTYPE_ULONG | CTLFLAG_MPSAFE, NULL, 0, sysctl_kmem_map_size, "LU", "Current kmem allocation size"); static int sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS); SYSCTL_PROC(_vm, OID_AUTO, kmem_map_free, CTLFLAG_RD | CTLTYPE_ULONG | CTLFLAG_MPSAFE, NULL, 0, sysctl_kmem_map_free, "LU", "Free space in kmem"); /* * The malloc_mtx protects the kmemstatistics linked list. */ struct mtx malloc_mtx; #ifdef MALLOC_PROFILE uint64_t krequests[KMEM_ZSIZE + 1]; static int sysctl_kern_mprof(SYSCTL_HANDLER_ARGS); #endif static int sysctl_kern_malloc_stats(SYSCTL_HANDLER_ARGS); /* * time_uptime of the last malloc(9) failure (induced or real). */ static time_t t_malloc_fail; #if defined(MALLOC_MAKE_FAILURES) || (MALLOC_DEBUG_MAXZONES > 1) static SYSCTL_NODE(_debug, OID_AUTO, malloc, CTLFLAG_RD, 0, "Kernel malloc debugging options"); #endif /* * malloc(9) fault injection -- cause malloc failures every (n) mallocs when * the caller specifies M_NOWAIT. If set to 0, no failures are caused. */ #ifdef MALLOC_MAKE_FAILURES static int malloc_failure_rate; static int malloc_nowait_count; static int malloc_failure_count; SYSCTL_INT(_debug_malloc, OID_AUTO, failure_rate, CTLFLAG_RWTUN, &malloc_failure_rate, 0, "Every (n) mallocs with M_NOWAIT will fail"); SYSCTL_INT(_debug_malloc, OID_AUTO, failure_count, CTLFLAG_RD, &malloc_failure_count, 0, "Number of imposed M_NOWAIT malloc failures"); #endif static int sysctl_kmem_map_size(SYSCTL_HANDLER_ARGS) { u_long size; size = vmem_size(kmem_arena, VMEM_ALLOC); return (sysctl_handle_long(oidp, &size, 0, req)); } static int sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS) { u_long size; size = vmem_size(kmem_arena, VMEM_FREE); return (sysctl_handle_long(oidp, &size, 0, req)); } /* * malloc(9) uma zone separation -- sub-page buffer overruns in one * malloc type will affect only a subset of other malloc types. */ #if MALLOC_DEBUG_MAXZONES > 1 static void tunable_set_numzones(void) { TUNABLE_INT_FETCH("debug.malloc.numzones", &numzones); /* Sanity check the number of malloc uma zones. */ if (numzones <= 0) numzones = 1; if (numzones > MALLOC_DEBUG_MAXZONES) numzones = MALLOC_DEBUG_MAXZONES; } SYSINIT(numzones, SI_SUB_TUNABLES, SI_ORDER_ANY, tunable_set_numzones, NULL); SYSCTL_INT(_debug_malloc, OID_AUTO, numzones, CTLFLAG_RDTUN | CTLFLAG_NOFETCH, &numzones, 0, "Number of malloc uma subzones"); /* * Any number that changes regularly is an okay choice for the * offset. Build numbers are pretty good of you have them. */ static u_int zone_offset = __FreeBSD_version; TUNABLE_INT("debug.malloc.zone_offset", &zone_offset); SYSCTL_UINT(_debug_malloc, OID_AUTO, zone_offset, CTLFLAG_RDTUN, &zone_offset, 0, "Separate malloc types by examining the " "Nth character in the malloc type short description."); static u_int mtp_get_subzone(const char *desc) { size_t len; u_int val; if (desc == NULL || (len = strlen(desc)) == 0) return (0); val = desc[zone_offset % len]; return (val % numzones); } #elif MALLOC_DEBUG_MAXZONES == 0 #error "MALLOC_DEBUG_MAXZONES must be positive." #else static inline u_int mtp_get_subzone(const char *desc) { return (0); } #endif /* MALLOC_DEBUG_MAXZONES > 1 */ int malloc_last_fail(void) { return (time_uptime - t_malloc_fail); } /* * An allocation has succeeded -- update malloc type statistics for the * amount of bucket size. Occurs within a critical section so that the * thread isn't preempted and doesn't migrate while updating per-PCU * statistics. */ static void malloc_type_zone_allocated(struct malloc_type *mtp, unsigned long size, int zindx) { struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp; critical_enter(); mtip = mtp->ks_handle; mtsp = &mtip->mti_stats[curcpu]; if (size > 0) { mtsp->mts_memalloced += size; mtsp->mts_numallocs++; } if (zindx != -1) mtsp->mts_size |= 1 << zindx; #ifdef KDTRACE_HOOKS if (dtrace_malloc_probe != NULL) { uint32_t probe_id = mtip->mti_probes[DTMALLOC_PROBE_MALLOC]; if (probe_id != 0) (dtrace_malloc_probe)(probe_id, (uintptr_t) mtp, (uintptr_t) mtip, (uintptr_t) mtsp, size, zindx); } #endif critical_exit(); } void malloc_type_allocated(struct malloc_type *mtp, unsigned long size) { if (size > 0) malloc_type_zone_allocated(mtp, size, -1); } /* * A free operation has occurred -- update malloc type statistics for the * amount of the bucket size. Occurs within a critical section so that the * thread isn't preempted and doesn't migrate while updating per-CPU * statistics. */ void malloc_type_freed(struct malloc_type *mtp, unsigned long size) { struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp; critical_enter(); mtip = mtp->ks_handle; mtsp = &mtip->mti_stats[curcpu]; mtsp->mts_memfreed += size; mtsp->mts_numfrees++; #ifdef KDTRACE_HOOKS if (dtrace_malloc_probe != NULL) { uint32_t probe_id = mtip->mti_probes[DTMALLOC_PROBE_FREE]; if (probe_id != 0) (dtrace_malloc_probe)(probe_id, (uintptr_t) mtp, (uintptr_t) mtip, (uintptr_t) mtsp, size, 0); } #endif critical_exit(); } /* * contigmalloc: * * Allocate a block of physically contiguous memory. * * If M_NOWAIT is set, this routine will not block and return NULL if * the allocation fails. */ void * contigmalloc(unsigned long size, struct malloc_type *type, int flags, vm_paddr_t low, vm_paddr_t high, unsigned long alignment, vm_paddr_t boundary) { void *ret; ret = (void *)kmem_alloc_contig(kernel_arena, size, flags, low, high, alignment, boundary, VM_MEMATTR_DEFAULT); if (ret != NULL) malloc_type_allocated(type, round_page(size)); return (ret); } /* * contigfree: * * Free a block of memory allocated by contigmalloc. * * This routine may not block. */ void contigfree(void *addr, unsigned long size, struct malloc_type *type) { kmem_free(kernel_arena, (vm_offset_t)addr, size); malloc_type_freed(type, round_page(size)); } /* * malloc: * * Allocate a block of memory. * * If M_NOWAIT is set, this routine will not block and return NULL if * the allocation fails. */ void * malloc(unsigned long size, struct malloc_type *mtp, int flags) { int indx; struct malloc_type_internal *mtip; caddr_t va; uma_zone_t zone; #if defined(DIAGNOSTIC) || defined(DEBUG_REDZONE) unsigned long osize = size; #endif #ifdef INVARIANTS KASSERT(mtp->ks_magic == M_MAGIC, ("malloc: bad malloc type magic")); /* * Check that exactly one of M_WAITOK or M_NOWAIT is specified. */ indx = flags & (M_WAITOK | M_NOWAIT); if (indx != M_NOWAIT && indx != M_WAITOK) { static struct timeval lasterr; static int curerr, once; if (once == 0 && ppsratecheck(&lasterr, &curerr, 1)) { printf("Bad malloc flags: %x\n", indx); kdb_backtrace(); flags |= M_WAITOK; once++; } } #endif #ifdef MALLOC_MAKE_FAILURES if ((flags & M_NOWAIT) && (malloc_failure_rate != 0)) { atomic_add_int(&malloc_nowait_count, 1); if ((malloc_nowait_count % malloc_failure_rate) == 0) { atomic_add_int(&malloc_failure_count, 1); t_malloc_fail = time_uptime; return (NULL); } } #endif if (flags & M_WAITOK) KASSERT(curthread->td_intr_nesting_level == 0, ("malloc(M_WAITOK) in interrupt context")); #ifdef DEBUG_MEMGUARD if (memguard_cmp_mtp(mtp, size)) { va = memguard_alloc(size, flags); if (va != NULL) return (va); /* This is unfortunate but should not be fatal. */ } #endif #ifdef DEBUG_REDZONE size = redzone_size_ntor(size); #endif if (size <= kmem_zmax) { mtip = mtp->ks_handle; if (size & KMEM_ZMASK) size = (size & ~KMEM_ZMASK) + KMEM_ZBASE; indx = kmemsize[size >> KMEM_ZSHIFT]; KASSERT(mtip->mti_zone < numzones, ("mti_zone %u out of range %d", mtip->mti_zone, numzones)); zone = kmemzones[indx].kz_zone[mtip->mti_zone]; #ifdef MALLOC_PROFILE krequests[size >> KMEM_ZSHIFT]++; #endif va = uma_zalloc(zone, flags); if (va != NULL) size = zone->uz_size; malloc_type_zone_allocated(mtp, va == NULL ? 0 : size, indx); } else { size = roundup(size, PAGE_SIZE); zone = NULL; va = uma_large_malloc(size, flags); malloc_type_allocated(mtp, va == NULL ? 0 : size); } if (flags & M_WAITOK) KASSERT(va != NULL, ("malloc(M_WAITOK) returned NULL")); else if (va == NULL) t_malloc_fail = time_uptime; #ifdef DIAGNOSTIC if (va != NULL && !(flags & M_ZERO)) { memset(va, 0x70, osize); } #endif #ifdef DEBUG_REDZONE if (va != NULL) va = redzone_setup(va, osize); #endif return ((void *) va); } /* * free: * * Free a block of memory allocated by malloc. * * This routine may not block. */ void free(void *addr, struct malloc_type *mtp) { uma_slab_t slab; u_long size; KASSERT(mtp->ks_magic == M_MAGIC, ("free: bad malloc type magic")); /* free(NULL, ...) does nothing */ if (addr == NULL) return; #ifdef DEBUG_MEMGUARD if (is_memguard_addr(addr)) { memguard_free(addr); return; } #endif #ifdef DEBUG_REDZONE redzone_check(addr); addr = redzone_addr_ntor(addr); #endif slab = vtoslab((vm_offset_t)addr & (~UMA_SLAB_MASK)); if (slab == NULL) panic("free: address %p(%p) has not been allocated.\n", addr, (void *)((u_long)addr & (~UMA_SLAB_MASK))); if (!(slab->us_flags & UMA_SLAB_MALLOC)) { #ifdef INVARIANTS struct malloc_type **mtpp = addr; #endif size = slab->us_keg->uk_size; #ifdef INVARIANTS /* * Cache a pointer to the malloc_type that most recently freed * this memory here. This way we know who is most likely to * have stepped on it later. * * This code assumes that size is a multiple of 8 bytes for * 64 bit machines */ mtpp = (struct malloc_type **) ((unsigned long)mtpp & ~UMA_ALIGN_PTR); mtpp += (size - sizeof(struct malloc_type *)) / sizeof(struct malloc_type *); *mtpp = mtp; #endif uma_zfree_arg(LIST_FIRST(&slab->us_keg->uk_zones), addr, slab); } else { size = slab->us_size; uma_large_free(slab); } malloc_type_freed(mtp, size); } /* * realloc: change the size of a memory block */ void * realloc(void *addr, unsigned long size, struct malloc_type *mtp, int flags) { uma_slab_t slab; unsigned long alloc; void *newaddr; KASSERT(mtp->ks_magic == M_MAGIC, ("realloc: bad malloc type magic")); /* realloc(NULL, ...) is equivalent to malloc(...) */ if (addr == NULL) return (malloc(size, mtp, flags)); /* * XXX: Should report free of old memory and alloc of new memory to * per-CPU stats. */ #ifdef DEBUG_MEMGUARD if (is_memguard_addr(addr)) return (memguard_realloc(addr, size, mtp, flags)); #endif #ifdef DEBUG_REDZONE slab = NULL; alloc = redzone_get_size(addr); #else slab = vtoslab((vm_offset_t)addr & ~(UMA_SLAB_MASK)); /* Sanity check */ KASSERT(slab != NULL, ("realloc: address %p out of range", (void *)addr)); /* Get the size of the original block */ if (!(slab->us_flags & UMA_SLAB_MALLOC)) alloc = slab->us_keg->uk_size; else alloc = slab->us_size; /* Reuse the original block if appropriate */ if (size <= alloc && (size > (alloc >> REALLOC_FRACTION) || alloc == MINALLOCSIZE)) return (addr); #endif /* !DEBUG_REDZONE */ /* Allocate a new, bigger (or smaller) block */ if ((newaddr = malloc(size, mtp, flags)) == NULL) return (NULL); /* Copy over original contents */ bcopy(addr, newaddr, min(size, alloc)); free(addr, mtp); return (newaddr); } /* * reallocf: same as realloc() but free memory on failure. */ void * reallocf(void *addr, unsigned long size, struct malloc_type *mtp, int flags) { void *mem; if ((mem = realloc(addr, size, mtp, flags)) == NULL) free(addr, mtp); return (mem); } /* - * Wake the page daemon when we exhaust KVA. It will call the lowmem handler - * and uma_reclaim() callbacks in a context that is safe. + * Wake the uma reclamation pagedaemon thread when we exhaust KVA. It + * will call the lowmem handler and uma_reclaim() callbacks in a + * context that is safe. */ static void kmem_reclaim(vmem_t *vm, int flags) { + uma_reclaim_wakeup(); pagedaemon_wakeup(); } #ifndef __sparc64__ CTASSERT(VM_KMEM_SIZE_SCALE >= 1); #endif /* * Initialize the kernel memory (kmem) arena. */ void kmeminit(void) { u_long mem_size; u_long tmp; #ifdef VM_KMEM_SIZE if (vm_kmem_size == 0) vm_kmem_size = VM_KMEM_SIZE; #endif #ifdef VM_KMEM_SIZE_MIN if (vm_kmem_size_min == 0) vm_kmem_size_min = VM_KMEM_SIZE_MIN; #endif #ifdef VM_KMEM_SIZE_MAX if (vm_kmem_size_max == 0) vm_kmem_size_max = VM_KMEM_SIZE_MAX; #endif /* * Calculate the amount of kernel virtual address (KVA) space that is * preallocated to the kmem arena. In order to support a wide range * of machines, it is a function of the physical memory size, * specifically, * * min(max(physical memory size / VM_KMEM_SIZE_SCALE, * VM_KMEM_SIZE_MIN), VM_KMEM_SIZE_MAX) * * Every architecture must define an integral value for * VM_KMEM_SIZE_SCALE. However, the definitions of VM_KMEM_SIZE_MIN * and VM_KMEM_SIZE_MAX, which represent respectively the floor and * ceiling on this preallocation, are optional. Typically, * VM_KMEM_SIZE_MAX is itself a function of the available KVA space on * a given architecture. */ mem_size = vm_cnt.v_page_count; if (mem_size <= 32768) /* delphij XXX 128MB */ kmem_zmax = PAGE_SIZE; if (vm_kmem_size_scale < 1) vm_kmem_size_scale = VM_KMEM_SIZE_SCALE; /* * Check if we should use defaults for the "vm_kmem_size" * variable: */ if (vm_kmem_size == 0) { vm_kmem_size = (mem_size / vm_kmem_size_scale) * PAGE_SIZE; if (vm_kmem_size_min > 0 && vm_kmem_size < vm_kmem_size_min) vm_kmem_size = vm_kmem_size_min; if (vm_kmem_size_max > 0 && vm_kmem_size >= vm_kmem_size_max) vm_kmem_size = vm_kmem_size_max; } /* * The amount of KVA space that is preallocated to the * kmem arena can be set statically at compile-time or manually * through the kernel environment. However, it is still limited to * twice the physical memory size, which has been sufficient to handle * the most severe cases of external fragmentation in the kmem arena. */ if (vm_kmem_size / 2 / PAGE_SIZE > mem_size) vm_kmem_size = 2 * mem_size * PAGE_SIZE; vm_kmem_size = round_page(vm_kmem_size); #ifdef DEBUG_MEMGUARD tmp = memguard_fudge(vm_kmem_size, kernel_map); #else tmp = vm_kmem_size; #endif vmem_init(kmem_arena, "kmem arena", kva_alloc(tmp), tmp, PAGE_SIZE, 0, 0); vmem_set_reclaim(kmem_arena, kmem_reclaim); #ifdef DEBUG_MEMGUARD /* * Initialize MemGuard if support compiled in. MemGuard is a * replacement allocator used for detecting tamper-after-free * scenarios as they occur. It is only used for debugging. */ memguard_init(kmem_arena); #endif } /* * Initialize the kernel memory allocator */ /* ARGSUSED*/ static void mallocinit(void *dummy) { int i; uint8_t indx; mtx_init(&malloc_mtx, "malloc", NULL, MTX_DEF); kmeminit(); uma_startup2(); if (kmem_zmax < PAGE_SIZE || kmem_zmax > KMEM_ZMAX) kmem_zmax = KMEM_ZMAX; mt_zone = uma_zcreate("mt_zone", sizeof(struct malloc_type_internal), #ifdef INVARIANTS mtrash_ctor, mtrash_dtor, mtrash_init, mtrash_fini, #else NULL, NULL, NULL, NULL, #endif UMA_ALIGN_PTR, UMA_ZONE_MALLOC); for (i = 0, indx = 0; kmemzones[indx].kz_size != 0; indx++) { int size = kmemzones[indx].kz_size; char *name = kmemzones[indx].kz_name; int subzone; for (subzone = 0; subzone < numzones; subzone++) { kmemzones[indx].kz_zone[subzone] = uma_zcreate(name, size, #ifdef INVARIANTS mtrash_ctor, mtrash_dtor, mtrash_init, mtrash_fini, #else NULL, NULL, NULL, NULL, #endif UMA_ALIGN_PTR, UMA_ZONE_MALLOC); } for (;i <= size; i+= KMEM_ZBASE) kmemsize[i >> KMEM_ZSHIFT] = indx; } } SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_SECOND, mallocinit, NULL); void malloc_init(void *data) { struct malloc_type_internal *mtip; struct malloc_type *mtp; KASSERT(vm_cnt.v_page_count != 0, ("malloc_register before vm_init")); mtp = data; if (mtp->ks_magic != M_MAGIC) panic("malloc_init: bad malloc type magic"); mtip = uma_zalloc(mt_zone, M_WAITOK | M_ZERO); mtp->ks_handle = mtip; mtip->mti_zone = mtp_get_subzone(mtp->ks_shortdesc); mtx_lock(&malloc_mtx); mtp->ks_next = kmemstatistics; kmemstatistics = mtp; kmemcount++; mtx_unlock(&malloc_mtx); } void malloc_uninit(void *data) { struct malloc_type_internal *mtip; struct malloc_type_stats *mtsp; struct malloc_type *mtp, *temp; uma_slab_t slab; long temp_allocs, temp_bytes; int i; mtp = data; KASSERT(mtp->ks_magic == M_MAGIC, ("malloc_uninit: bad malloc type magic")); KASSERT(mtp->ks_handle != NULL, ("malloc_deregister: cookie NULL")); mtx_lock(&malloc_mtx); mtip = mtp->ks_handle; mtp->ks_handle = NULL; if (mtp != kmemstatistics) { for (temp = kmemstatistics; temp != NULL; temp = temp->ks_next) { if (temp->ks_next == mtp) { temp->ks_next = mtp->ks_next; break; } } KASSERT(temp, ("malloc_uninit: type '%s' not found", mtp->ks_shortdesc)); } else kmemstatistics = mtp->ks_next; kmemcount--; mtx_unlock(&malloc_mtx); /* * Look for memory leaks. */ temp_allocs = temp_bytes = 0; for (i = 0; i < MAXCPU; i++) { mtsp = &mtip->mti_stats[i]; temp_allocs += mtsp->mts_numallocs; temp_allocs -= mtsp->mts_numfrees; temp_bytes += mtsp->mts_memalloced; temp_bytes -= mtsp->mts_memfreed; } if (temp_allocs > 0 || temp_bytes > 0) { printf("Warning: memory type %s leaked memory on destroy " "(%ld allocations, %ld bytes leaked).\n", mtp->ks_shortdesc, temp_allocs, temp_bytes); } slab = vtoslab((vm_offset_t) mtip & (~UMA_SLAB_MASK)); uma_zfree_arg(mt_zone, mtip, slab); } struct malloc_type * malloc_desc2type(const char *desc) { struct malloc_type *mtp; mtx_assert(&malloc_mtx, MA_OWNED); for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { if (strcmp(mtp->ks_shortdesc, desc) == 0) return (mtp); } return (NULL); } static int sysctl_kern_malloc_stats(SYSCTL_HANDLER_ARGS) { struct malloc_type_stream_header mtsh; struct malloc_type_internal *mtip; struct malloc_type_header mth; struct malloc_type *mtp; int error, i; struct sbuf sbuf; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sbuf_new_for_sysctl(&sbuf, NULL, 128, req); sbuf_clear_flags(&sbuf, SBUF_INCLUDENUL); mtx_lock(&malloc_mtx); /* * Insert stream header. */ bzero(&mtsh, sizeof(mtsh)); mtsh.mtsh_version = MALLOC_TYPE_STREAM_VERSION; mtsh.mtsh_maxcpus = MAXCPU; mtsh.mtsh_count = kmemcount; (void)sbuf_bcat(&sbuf, &mtsh, sizeof(mtsh)); /* * Insert alternating sequence of type headers and type statistics. */ for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { mtip = (struct malloc_type_internal *)mtp->ks_handle; /* * Insert type header. */ bzero(&mth, sizeof(mth)); strlcpy(mth.mth_name, mtp->ks_shortdesc, MALLOC_MAX_NAME); (void)sbuf_bcat(&sbuf, &mth, sizeof(mth)); /* * Insert type statistics for each CPU. */ for (i = 0; i < MAXCPU; i++) { (void)sbuf_bcat(&sbuf, &mtip->mti_stats[i], sizeof(mtip->mti_stats[i])); } } mtx_unlock(&malloc_mtx); error = sbuf_finish(&sbuf); sbuf_delete(&sbuf); return (error); } SYSCTL_PROC(_kern, OID_AUTO, malloc_stats, CTLFLAG_RD|CTLTYPE_STRUCT, 0, 0, sysctl_kern_malloc_stats, "s,malloc_type_ustats", "Return malloc types"); SYSCTL_INT(_kern, OID_AUTO, malloc_count, CTLFLAG_RD, &kmemcount, 0, "Count of kernel malloc types"); void malloc_type_list(malloc_type_list_func_t *func, void *arg) { struct malloc_type *mtp, **bufmtp; int count, i; size_t buflen; mtx_lock(&malloc_mtx); restart: mtx_assert(&malloc_mtx, MA_OWNED); count = kmemcount; mtx_unlock(&malloc_mtx); buflen = sizeof(struct malloc_type *) * count; bufmtp = malloc(buflen, M_TEMP, M_WAITOK); mtx_lock(&malloc_mtx); if (count < kmemcount) { free(bufmtp, M_TEMP); goto restart; } for (mtp = kmemstatistics, i = 0; mtp != NULL; mtp = mtp->ks_next, i++) bufmtp[i] = mtp; mtx_unlock(&malloc_mtx); for (i = 0; i < count; i++) (func)(bufmtp[i], arg); free(bufmtp, M_TEMP); } #ifdef DDB DB_SHOW_COMMAND(malloc, db_show_malloc) { struct malloc_type_internal *mtip; struct malloc_type *mtp; uint64_t allocs, frees; uint64_t alloced, freed; int i; db_printf("%18s %12s %12s %12s\n", "Type", "InUse", "MemUse", "Requests"); for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { mtip = (struct malloc_type_internal *)mtp->ks_handle; allocs = 0; frees = 0; alloced = 0; freed = 0; for (i = 0; i < MAXCPU; i++) { allocs += mtip->mti_stats[i].mts_numallocs; frees += mtip->mti_stats[i].mts_numfrees; alloced += mtip->mti_stats[i].mts_memalloced; freed += mtip->mti_stats[i].mts_memfreed; } db_printf("%18s %12ju %12juK %12ju\n", mtp->ks_shortdesc, allocs - frees, (alloced - freed + 1023) / 1024, allocs); if (db_pager_quit) break; } } #if MALLOC_DEBUG_MAXZONES > 1 DB_SHOW_COMMAND(multizone_matches, db_show_multizone_matches) { struct malloc_type_internal *mtip; struct malloc_type *mtp; u_int subzone; if (!have_addr) { db_printf("Usage: show multizone_matches \n"); return; } mtp = (void *)addr; if (mtp->ks_magic != M_MAGIC) { db_printf("Magic %lx does not match expected %x\n", mtp->ks_magic, M_MAGIC); return; } mtip = mtp->ks_handle; subzone = mtip->mti_zone; for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) { mtip = mtp->ks_handle; if (mtip->mti_zone != subzone) continue; db_printf("%s\n", mtp->ks_shortdesc); if (db_pager_quit) break; } } #endif /* MALLOC_DEBUG_MAXZONES > 1 */ #endif /* DDB */ #ifdef MALLOC_PROFILE static int sysctl_kern_mprof(SYSCTL_HANDLER_ARGS) { struct sbuf sbuf; uint64_t count; uint64_t waste; uint64_t mem; int error; int rsize; int size; int i; waste = 0; mem = 0; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sbuf_new_for_sysctl(&sbuf, NULL, 128, req); sbuf_printf(&sbuf, "\n Size Requests Real Size\n"); for (i = 0; i < KMEM_ZSIZE; i++) { size = i << KMEM_ZSHIFT; rsize = kmemzones[kmemsize[i]].kz_size; count = (long long unsigned)krequests[i]; sbuf_printf(&sbuf, "%6d%28llu%11d\n", size, (unsigned long long)count, rsize); if ((rsize * count) > (size * count)) waste += (rsize * count) - (size * count); mem += (rsize * count); } sbuf_printf(&sbuf, "\nTotal memory used:\t%30llu\nTotal Memory wasted:\t%30llu\n", (unsigned long long)mem, (unsigned long long)waste); error = sbuf_finish(&sbuf); sbuf_delete(&sbuf); return (error); } SYSCTL_OID(_kern, OID_AUTO, mprof, CTLTYPE_STRING|CTLFLAG_RD, NULL, 0, sysctl_kern_mprof, "A", "Malloc Profiling"); #endif /* MALLOC_PROFILE */ Index: projects/release-arm-redux/sys/kern/kern_thread.c =================================================================== --- projects/release-arm-redux/sys/kern/kern_thread.c (revision 282691) +++ projects/release-arm-redux/sys/kern/kern_thread.c (revision 282692) @@ -1,1114 +1,1137 @@ /*- * Copyright (C) 2001 Julian Elischer . * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice(s), this list of conditions and the following disclaimer as * the first lines of this file unmodified other than the possible * addition of one or more copyright notices. * 2. Redistributions in binary form must reproduce the above copyright * notice(s), this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER(S) ``AS IS'' AND ANY * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER(S) BE LIABLE FOR ANY * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH * DAMAGE. */ #include "opt_witness.h" #include "opt_hwpmc_hooks.h" #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef HWPMC_HOOKS #include #endif #include #include #include #include #include SDT_PROVIDER_DECLARE(proc); SDT_PROBE_DEFINE(proc, , , lwp__exit); /* * thread related storage. */ static uma_zone_t thread_zone; TAILQ_HEAD(, thread) zombie_threads = TAILQ_HEAD_INITIALIZER(zombie_threads); static struct mtx zombie_lock; MTX_SYSINIT(zombie_lock, &zombie_lock, "zombie lock", MTX_SPIN); static void thread_zombie(struct thread *); #define TID_BUFFER_SIZE 1024 struct mtx tid_lock; static struct unrhdr *tid_unrhdr; static lwpid_t tid_buffer[TID_BUFFER_SIZE]; static int tid_head, tid_tail; static MALLOC_DEFINE(M_TIDHASH, "tidhash", "thread hash"); struct tidhashhead *tidhashtbl; u_long tidhash; struct rwlock tidhash_lock; static lwpid_t tid_alloc(void) { lwpid_t tid; tid = alloc_unr(tid_unrhdr); if (tid != -1) return (tid); mtx_lock(&tid_lock); if (tid_head == tid_tail) { mtx_unlock(&tid_lock); return (-1); } tid = tid_buffer[tid_head]; tid_head = (tid_head + 1) % TID_BUFFER_SIZE; mtx_unlock(&tid_lock); return (tid); } static void tid_free(lwpid_t tid) { lwpid_t tmp_tid = -1; mtx_lock(&tid_lock); if ((tid_tail + 1) % TID_BUFFER_SIZE == tid_head) { tmp_tid = tid_buffer[tid_head]; tid_head = (tid_head + 1) % TID_BUFFER_SIZE; } tid_buffer[tid_tail] = tid; tid_tail = (tid_tail + 1) % TID_BUFFER_SIZE; mtx_unlock(&tid_lock); if (tmp_tid != -1) free_unr(tid_unrhdr, tmp_tid); } /* * Prepare a thread for use. */ static int thread_ctor(void *mem, int size, void *arg, int flags) { struct thread *td; td = (struct thread *)mem; td->td_state = TDS_INACTIVE; td->td_oncpu = NOCPU; td->td_tid = tid_alloc(); /* * Note that td_critnest begins life as 1 because the thread is not * running and is thereby implicitly waiting to be on the receiving * end of a context switch. */ td->td_critnest = 1; td->td_lend_user_pri = PRI_MAX; EVENTHANDLER_INVOKE(thread_ctor, td); #ifdef AUDIT audit_thread_alloc(td); #endif umtx_thread_alloc(td); return (0); } /* * Reclaim a thread after use. */ static void thread_dtor(void *mem, int size, void *arg) { struct thread *td; td = (struct thread *)mem; #ifdef INVARIANTS /* Verify that this thread is in a safe state to free. */ switch (td->td_state) { case TDS_INHIBITED: case TDS_RUNNING: case TDS_CAN_RUN: case TDS_RUNQ: /* * We must never unlink a thread that is in one of * these states, because it is currently active. */ panic("bad state for thread unlinking"); /* NOTREACHED */ case TDS_INACTIVE: break; default: panic("bad thread state"); /* NOTREACHED */ } #endif #ifdef AUDIT audit_thread_free(td); #endif /* Free all OSD associated to this thread. */ osd_thread_exit(td); EVENTHANDLER_INVOKE(thread_dtor, td); tid_free(td->td_tid); } /* * Initialize type-stable parts of a thread (when newly created). */ static int thread_init(void *mem, int size, int flags) { struct thread *td; td = (struct thread *)mem; td->td_sleepqueue = sleepq_alloc(); td->td_turnstile = turnstile_alloc(); td->td_rlqe = NULL; EVENTHANDLER_INVOKE(thread_init, td); td->td_sched = (struct td_sched *)&td[1]; umtx_thread_init(td); td->td_kstack = 0; td->td_sel = NULL; return (0); } /* * Tear down type-stable parts of a thread (just before being discarded). */ static void thread_fini(void *mem, int size) { struct thread *td; td = (struct thread *)mem; EVENTHANDLER_INVOKE(thread_fini, td); rlqentry_free(td->td_rlqe); turnstile_free(td->td_turnstile); sleepq_free(td->td_sleepqueue); umtx_thread_fini(td); seltdfini(td); } /* * For a newly created process, * link up all the structures and its initial threads etc. * called from: * {arch}/{arch}/machdep.c {arch}_init(), init386() etc. * proc_dtor() (should go away) * proc_init() */ void proc_linkup0(struct proc *p, struct thread *td) { TAILQ_INIT(&p->p_threads); /* all threads in proc */ proc_linkup(p, td); } void proc_linkup(struct proc *p, struct thread *td) { sigqueue_init(&p->p_sigqueue, p); p->p_ksi = ksiginfo_alloc(1); if (p->p_ksi != NULL) { /* XXX p_ksi may be null if ksiginfo zone is not ready */ p->p_ksi->ksi_flags = KSI_EXT | KSI_INS; } LIST_INIT(&p->p_mqnotifier); p->p_numthreads = 0; thread_link(td, p); } /* * Initialize global thread allocation resources. */ void threadinit(void) { mtx_init(&tid_lock, "TID lock", NULL, MTX_DEF); /* * pid_max cannot be greater than PID_MAX. * leave one number for thread0. */ tid_unrhdr = new_unrhdr(PID_MAX + 2, INT_MAX, &tid_lock); thread_zone = uma_zcreate("THREAD", sched_sizeof_thread(), thread_ctor, thread_dtor, thread_init, thread_fini, 16 - 1, 0); tidhashtbl = hashinit(maxproc / 2, M_TIDHASH, &tidhash); rw_init(&tidhash_lock, "tidhash"); } /* * Place an unused thread on the zombie list. * Use the slpq as that must be unused by now. */ void thread_zombie(struct thread *td) { mtx_lock_spin(&zombie_lock); TAILQ_INSERT_HEAD(&zombie_threads, td, td_slpq); mtx_unlock_spin(&zombie_lock); } /* * Release a thread that has exited after cpu_throw(). */ void thread_stash(struct thread *td) { atomic_subtract_rel_int(&td->td_proc->p_exitthreads, 1); thread_zombie(td); } /* * Reap zombie resources. */ void thread_reap(void) { struct thread *td_first, *td_next; /* * Don't even bother to lock if none at this instant, * we really don't care about the next instant.. */ if (!TAILQ_EMPTY(&zombie_threads)) { mtx_lock_spin(&zombie_lock); td_first = TAILQ_FIRST(&zombie_threads); if (td_first) TAILQ_INIT(&zombie_threads); mtx_unlock_spin(&zombie_lock); while (td_first) { td_next = TAILQ_NEXT(td_first, td_slpq); if (td_first->td_ucred) crfree(td_first->td_ucred); thread_free(td_first); td_first = td_next; } } } /* * Allocate a thread. */ struct thread * thread_alloc(int pages) { struct thread *td; thread_reap(); /* check if any zombies to get */ td = (struct thread *)uma_zalloc(thread_zone, M_WAITOK); KASSERT(td->td_kstack == 0, ("thread_alloc got thread with kstack")); if (!vm_thread_new(td, pages)) { uma_zfree(thread_zone, td); return (NULL); } cpu_thread_alloc(td); return (td); } int thread_alloc_stack(struct thread *td, int pages) { KASSERT(td->td_kstack == 0, ("thread_alloc_stack called on a thread with kstack")); if (!vm_thread_new(td, pages)) return (0); cpu_thread_alloc(td); return (1); } /* * Deallocate a thread. */ void thread_free(struct thread *td) { lock_profile_thread_exit(td); if (td->td_cpuset) cpuset_rel(td->td_cpuset); td->td_cpuset = NULL; cpu_thread_free(td); if (td->td_kstack != 0) vm_thread_dispose(td); uma_zfree(thread_zone, td); } /* * Discard the current thread and exit from its context. * Always called with scheduler locked. * * Because we can't free a thread while we're operating under its context, * push the current thread into our CPU's deadthread holder. This means * we needn't worry about someone else grabbing our context before we * do a cpu_throw(). */ void thread_exit(void) { uint64_t runtime, new_switchtime; struct thread *td; struct thread *td2; struct proc *p; int wakeup_swapper; td = curthread; p = td->td_proc; PROC_SLOCK_ASSERT(p, MA_OWNED); mtx_assert(&Giant, MA_NOTOWNED); PROC_LOCK_ASSERT(p, MA_OWNED); KASSERT(p != NULL, ("thread exiting without a process")); CTR3(KTR_PROC, "thread_exit: thread %p (pid %ld, %s)", td, (long)p->p_pid, td->td_name); KASSERT(TAILQ_EMPTY(&td->td_sigqueue.sq_list), ("signal pending")); #ifdef AUDIT AUDIT_SYSCALL_EXIT(0, td); #endif /* * drop FPU & debug register state storage, or any other * architecture specific resources that * would not be on a new untouched process. */ cpu_thread_exit(td); /* XXXSMP */ /* * The last thread is left attached to the process * So that the whole bundle gets recycled. Skip * all this stuff if we never had threads. * EXIT clears all sign of other threads when * it goes to single threading, so the last thread always * takes the short path. */ if (p->p_flag & P_HADTHREADS) { if (p->p_numthreads > 1) { atomic_add_int(&td->td_proc->p_exitthreads, 1); thread_unlink(td); td2 = FIRST_THREAD_IN_PROC(p); sched_exit_thread(td2, td); /* * The test below is NOT true if we are the * sole exiting thread. P_STOPPED_SINGLE is unset * in exit1() after it is the only survivor. */ if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) { if (p->p_numthreads == p->p_suspcount) { thread_lock(p->p_singlethread); wakeup_swapper = thread_unsuspend_one( p->p_singlethread, p); thread_unlock(p->p_singlethread); if (wakeup_swapper) kick_proc0(); } } PCPU_SET(deadthread, td); } else { /* * The last thread is exiting.. but not through exit() */ panic ("thread_exit: Last thread exiting on its own"); } } #ifdef HWPMC_HOOKS /* * If this thread is part of a process that is being tracked by hwpmc(4), * inform the module of the thread's impending exit. */ if (PMC_PROC_IS_USING_PMCS(td->td_proc)) PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_OUT); #endif PROC_UNLOCK(p); PROC_STATLOCK(p); thread_lock(td); PROC_SUNLOCK(p); /* Do the same timestamp bookkeeping that mi_switch() would do. */ new_switchtime = cpu_ticks(); runtime = new_switchtime - PCPU_GET(switchtime); td->td_runtime += runtime; td->td_incruntime += runtime; PCPU_SET(switchtime, new_switchtime); PCPU_SET(switchticks, ticks); PCPU_INC(cnt.v_swtch); /* Save our resource usage in our process. */ td->td_ru.ru_nvcsw++; ruxagg(p, td); rucollect(&p->p_ru, &td->td_ru); PROC_STATUNLOCK(p); td->td_state = TDS_INACTIVE; #ifdef WITNESS witness_thread_exit(td); #endif CTR1(KTR_PROC, "thread_exit: cpu_throw() thread %p", td); sched_throw(td); panic("I'm a teapot!"); /* NOTREACHED */ } /* * Do any thread specific cleanups that may be needed in wait() * called with Giant, proc and schedlock not held. */ void thread_wait(struct proc *p) { struct thread *td; mtx_assert(&Giant, MA_NOTOWNED); KASSERT(p->p_numthreads == 1, ("multiple threads in thread_wait()")); KASSERT(p->p_exitthreads == 0, ("p_exitthreads leaking")); td = FIRST_THREAD_IN_PROC(p); /* Lock the last thread so we spin until it exits cpu_throw(). */ thread_lock(td); thread_unlock(td); lock_profile_thread_exit(td); cpuset_rel(td->td_cpuset); td->td_cpuset = NULL; cpu_thread_clean(td); crfree(td->td_ucred); thread_reap(); /* check for zombie threads etc. */ } /* * Link a thread to a process. * set up anything that needs to be initialized for it to * be used by the process. */ void thread_link(struct thread *td, struct proc *p) { /* * XXX This can't be enabled because it's called for proc0 before * its lock has been created. * PROC_LOCK_ASSERT(p, MA_OWNED); */ td->td_state = TDS_INACTIVE; td->td_proc = p; td->td_flags = TDF_INMEM; LIST_INIT(&td->td_contested); LIST_INIT(&td->td_lprof[0]); LIST_INIT(&td->td_lprof[1]); sigqueue_init(&td->td_sigqueue, p); callout_init(&td->td_slpcallout, CALLOUT_MPSAFE); TAILQ_INSERT_TAIL(&p->p_threads, td, td_plist); p->p_numthreads++; } /* * Called from: * thread_exit() */ void thread_unlink(struct thread *td) { struct proc *p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); TAILQ_REMOVE(&p->p_threads, td, td_plist); p->p_numthreads--; /* could clear a few other things here */ /* Must NOT clear links to proc! */ } static int calc_remaining(struct proc *p, int mode) { int remaining; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); if (mode == SINGLE_EXIT) remaining = p->p_numthreads; else if (mode == SINGLE_BOUNDARY) remaining = p->p_numthreads - p->p_boundary_count; else if (mode == SINGLE_NO_EXIT || mode == SINGLE_ALLPROC) remaining = p->p_numthreads - p->p_suspcount; else panic("calc_remaining: wrong mode %d", mode); return (remaining); } static int remain_for_mode(int mode) { return (mode == SINGLE_ALLPROC ? 0 : 1); } static int weed_inhib(int mode, struct thread *td2, struct proc *p) { int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); THREAD_LOCK_ASSERT(td2, MA_OWNED); wakeup_swapper = 0; switch (mode) { case SINGLE_EXIT: if (TD_IS_SUSPENDED(td2)) wakeup_swapper |= thread_unsuspend_one(td2, p); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) wakeup_swapper |= sleepq_abort(td2, EINTR); break; case SINGLE_BOUNDARY: if (TD_IS_SUSPENDED(td2) && (td2->td_flags & TDF_BOUNDARY) == 0) wakeup_swapper |= thread_unsuspend_one(td2, p); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) wakeup_swapper |= sleepq_abort(td2, ERESTART); break; case SINGLE_NO_EXIT: if (TD_IS_SUSPENDED(td2) && (td2->td_flags & TDF_BOUNDARY) == 0) wakeup_swapper |= thread_unsuspend_one(td2, p); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) wakeup_swapper |= sleepq_abort(td2, ERESTART); break; case SINGLE_ALLPROC: /* * ALLPROC suspend tries to avoid spurious EINTR for * threads sleeping interruptable, by suspending the * thread directly, similarly to sig_suspend_threads(). * Since such sleep is not performed at the user * boundary, TDF_BOUNDARY flag is not set, and TDF_ALLPROCSUSP * is used to avoid immediate un-suspend. */ if (TD_IS_SUSPENDED(td2) && (td2->td_flags & (TDF_BOUNDARY | TDF_ALLPROCSUSP)) == 0) wakeup_swapper |= thread_unsuspend_one(td2, p); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) { if ((td2->td_flags & TDF_SBDRY) == 0) { thread_suspend_one(td2); td2->td_flags |= TDF_ALLPROCSUSP; } else { wakeup_swapper |= sleepq_abort(td2, ERESTART); } } break; } return (wakeup_swapper); } /* * Enforce single-threading. * * Returns 1 if the caller must abort (another thread is waiting to * exit the process or similar). Process is locked! * Returns 0 when you are successfully the only thread running. * A process has successfully single threaded in the suspend mode when * There are no threads in user mode. Threads in the kernel must be * allowed to continue until they get to the user boundary. They may even * copy out their return values and data before suspending. They may however be * accelerated in reaching the user boundary as we will wake up * any sleeping threads that are interruptable. (PCATCH). */ int thread_single(struct proc *p, int mode) { struct thread *td; struct thread *td2; int remaining, wakeup_swapper; td = curthread; KASSERT(mode == SINGLE_EXIT || mode == SINGLE_BOUNDARY || mode == SINGLE_ALLPROC || mode == SINGLE_NO_EXIT, ("invalid mode %d", mode)); /* * If allowing non-ALLPROC singlethreading for non-curproc * callers, calc_remaining() and remain_for_mode() should be * adjusted to also account for td->td_proc != p. For now * this is not implemented because it is not used. */ KASSERT((mode == SINGLE_ALLPROC && td->td_proc != p) || (mode != SINGLE_ALLPROC && td->td_proc == p), ("mode %d proc %p curproc %p", mode, p, td->td_proc)); mtx_assert(&Giant, MA_NOTOWNED); PROC_LOCK_ASSERT(p, MA_OWNED); if ((p->p_flag & P_HADTHREADS) == 0 && mode != SINGLE_ALLPROC) return (0); /* Is someone already single threading? */ if (p->p_singlethread != NULL && p->p_singlethread != td) return (1); if (mode == SINGLE_EXIT) { p->p_flag |= P_SINGLE_EXIT; p->p_flag &= ~P_SINGLE_BOUNDARY; } else { p->p_flag &= ~P_SINGLE_EXIT; if (mode == SINGLE_BOUNDARY) p->p_flag |= P_SINGLE_BOUNDARY; else p->p_flag &= ~P_SINGLE_BOUNDARY; } if (mode == SINGLE_ALLPROC) p->p_flag |= P_TOTAL_STOP; p->p_flag |= P_STOPPED_SINGLE; PROC_SLOCK(p); p->p_singlethread = td; remaining = calc_remaining(p, mode); while (remaining != remain_for_mode(mode)) { if (P_SHOULDSTOP(p) != P_STOPPED_SINGLE) goto stopme; wakeup_swapper = 0; FOREACH_THREAD_IN_PROC(p, td2) { if (td2 == td) continue; thread_lock(td2); td2->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK; if (TD_IS_INHIBITED(td2)) { wakeup_swapper |= weed_inhib(mode, td2, p); #ifdef SMP } else if (TD_IS_RUNNING(td2) && td != td2) { forward_signal(td2); #endif } thread_unlock(td2); } if (wakeup_swapper) kick_proc0(); remaining = calc_remaining(p, mode); /* * Maybe we suspended some threads.. was it enough? */ if (remaining == remain_for_mode(mode)) break; stopme: /* * Wake us up when everyone else has suspended. * In the mean time we suspend as well. */ thread_suspend_switch(td, p); remaining = calc_remaining(p, mode); } if (mode == SINGLE_EXIT) { /* * Convert the process to an unthreaded process. The * SINGLE_EXIT is called by exit1() or execve(), in * both cases other threads must be retired. */ KASSERT(p->p_numthreads == 1, ("Unthreading with >1 threads")); p->p_singlethread = NULL; p->p_flag &= ~(P_STOPPED_SINGLE | P_SINGLE_EXIT | P_HADTHREADS); /* * Wait for any remaining threads to exit cpu_throw(). */ while (p->p_exitthreads != 0) { PROC_SUNLOCK(p); PROC_UNLOCK(p); sched_relinquish(td); PROC_LOCK(p); PROC_SLOCK(p); } + } else if (mode == SINGLE_BOUNDARY) { + /* + * Wait until all suspended threads are removed from + * the processors. The thread_suspend_check() + * increments p_boundary_count while it is still + * running, which makes it possible for the execve() + * to destroy vmspace while our other threads are + * still using the address space. + * + * We lock the thread, which is only allowed to + * succeed after context switch code finished using + * the address space. + */ + FOREACH_THREAD_IN_PROC(p, td2) { + if (td2 == td) + continue; + thread_lock(td2); + KASSERT((td2->td_flags & TDF_BOUNDARY) != 0, + ("td %p not on boundary", td2)); + KASSERT(TD_IS_SUSPENDED(td2), + ("td %p is not suspended", td2)); + thread_unlock(td2); + } } PROC_SUNLOCK(p); return (0); } bool thread_suspend_check_needed(void) { struct proc *p; struct thread *td; td = curthread; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); return (P_SHOULDSTOP(p) || ((p->p_flag & P_TRACED) != 0 && (td->td_dbgflags & TDB_SUSPEND) != 0)); } /* * Called in from locations that can safely check to see * whether we have to suspend or at least throttle for a * single-thread event (e.g. fork). * * Such locations include userret(). * If the "return_instead" argument is non zero, the thread must be able to * accept 0 (caller may continue), or 1 (caller must abort) as a result. * * The 'return_instead' argument tells the function if it may do a * thread_exit() or suspend, or whether the caller must abort and back * out instead. * * If the thread that set the single_threading request has set the * P_SINGLE_EXIT bit in the process flags then this call will never return * if 'return_instead' is false, but will exit. * * P_SINGLE_EXIT | return_instead == 0| return_instead != 0 *---------------+--------------------+--------------------- * 0 | returns 0 | returns 0 or 1 * | when ST ends | immediately *---------------+--------------------+--------------------- * 1 | thread exits | returns 1 * | | immediately * 0 = thread_exit() or suspension ok, * other = return error instead of stopping the thread. * * While a full suspension is under effect, even a single threading * thread would be suspended if it made this call (but it shouldn't). * This call should only be made from places where * thread_exit() would be safe as that may be the outcome unless * return_instead is set. */ int thread_suspend_check(int return_instead) { struct thread *td; struct proc *p; int wakeup_swapper; td = curthread; p = td->td_proc; mtx_assert(&Giant, MA_NOTOWNED); PROC_LOCK_ASSERT(p, MA_OWNED); while (thread_suspend_check_needed()) { if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) { KASSERT(p->p_singlethread != NULL, ("singlethread not set")); /* * The only suspension in action is a * single-threading. Single threader need not stop. * XXX Should be safe to access unlocked * as it can only be set to be true by us. */ if (p->p_singlethread == td) return (0); /* Exempt from stopping. */ } if ((p->p_flag & P_SINGLE_EXIT) && return_instead) return (EINTR); /* Should we goto user boundary if we didn't come from there? */ if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE && (p->p_flag & P_SINGLE_BOUNDARY) && return_instead) return (ERESTART); /* * Ignore suspend requests for stop signals if they * are deferred. */ if ((P_SHOULDSTOP(p) == P_STOPPED_SIG || (p->p_flag & P_TOTAL_STOP) != 0) && (td->td_flags & TDF_SBDRY) != 0) { KASSERT(return_instead, ("TDF_SBDRY set for unsafe thread_suspend_check")); return (0); } /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. */ if ((p->p_flag & P_SINGLE_EXIT) && (p->p_singlethread != td)) { PROC_UNLOCK(p); tidhash_remove(td); PROC_LOCK(p); tdsigcleanup(td); umtx_thread_exit(td); PROC_SLOCK(p); thread_stopped(p); thread_exit(); } PROC_SLOCK(p); thread_stopped(p); if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) { if (p->p_numthreads == p->p_suspcount + 1) { thread_lock(p->p_singlethread); wakeup_swapper = thread_unsuspend_one(p->p_singlethread, p); thread_unlock(p->p_singlethread); if (wakeup_swapper) kick_proc0(); } } PROC_UNLOCK(p); thread_lock(td); /* * When a thread suspends, it just * gets taken off all queues. */ thread_suspend_one(td); if (return_instead == 0) { p->p_boundary_count++; td->td_flags |= TDF_BOUNDARY; } PROC_SUNLOCK(p); mi_switch(SW_INVOL | SWT_SUSPEND, NULL); if (return_instead == 0) td->td_flags &= ~TDF_BOUNDARY; thread_unlock(td); PROC_LOCK(p); if (return_instead == 0) { PROC_SLOCK(p); p->p_boundary_count--; PROC_SUNLOCK(p); } } return (0); } void thread_suspend_switch(struct thread *td, struct proc *p) { KASSERT(!TD_IS_SUSPENDED(td), ("already suspended")); PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); /* * We implement thread_suspend_one in stages here to avoid * dropping the proc lock while the thread lock is owned. */ if (p == td->td_proc) { thread_stopped(p); p->p_suspcount++; } PROC_UNLOCK(p); thread_lock(td); td->td_flags &= ~TDF_NEEDSUSPCHK; TD_SET_SUSPENDED(td); sched_sleep(td, 0); PROC_SUNLOCK(p); DROP_GIANT(); mi_switch(SW_VOL | SWT_SUSPEND, NULL); thread_unlock(td); PICKUP_GIANT(); PROC_LOCK(p); PROC_SLOCK(p); } void thread_suspend_one(struct thread *td) { struct proc *p; p = td->td_proc; PROC_SLOCK_ASSERT(p, MA_OWNED); THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(!TD_IS_SUSPENDED(td), ("already suspended")); p->p_suspcount++; td->td_flags &= ~TDF_NEEDSUSPCHK; TD_SET_SUSPENDED(td); sched_sleep(td, 0); } int thread_unsuspend_one(struct thread *td, struct proc *p) { THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(TD_IS_SUSPENDED(td), ("Thread not suspended")); TD_CLR_SUSPENDED(td); td->td_flags &= ~TDF_ALLPROCSUSP; if (td->td_proc == p) { PROC_SLOCK_ASSERT(p, MA_OWNED); p->p_suspcount--; } return (setrunnable(td)); } /* * Allow all threads blocked by single threading to continue running. */ void thread_unsuspend(struct proc *p) { struct thread *td; int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); wakeup_swapper = 0; if (!P_SHOULDSTOP(p)) { FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); if (TD_IS_SUSPENDED(td)) { wakeup_swapper |= thread_unsuspend_one(td, p); } thread_unlock(td); } } else if ((P_SHOULDSTOP(p) == P_STOPPED_SINGLE) && (p->p_numthreads == p->p_suspcount)) { /* * Stopping everything also did the job for the single * threading request. Now we've downgraded to single-threaded, * let it continue. */ if (p->p_singlethread->td_proc == p) { thread_lock(p->p_singlethread); wakeup_swapper = thread_unsuspend_one( p->p_singlethread, p); thread_unlock(p->p_singlethread); } } if (wakeup_swapper) kick_proc0(); } /* * End the single threading mode.. */ void thread_single_end(struct proc *p, int mode) { struct thread *td; int wakeup_swapper; KASSERT(mode == SINGLE_EXIT || mode == SINGLE_BOUNDARY || mode == SINGLE_ALLPROC || mode == SINGLE_NO_EXIT, ("invalid mode %d", mode)); PROC_LOCK_ASSERT(p, MA_OWNED); KASSERT((mode == SINGLE_ALLPROC && (p->p_flag & P_TOTAL_STOP) != 0) || (mode != SINGLE_ALLPROC && (p->p_flag & P_TOTAL_STOP) == 0), ("mode %d does not match P_TOTAL_STOP", mode)); p->p_flag &= ~(P_STOPPED_SINGLE | P_SINGLE_EXIT | P_SINGLE_BOUNDARY | P_TOTAL_STOP); PROC_SLOCK(p); p->p_singlethread = NULL; wakeup_swapper = 0; /* * If there are other threads they may now run, * unless of course there is a blanket 'stop order' * on the process. The single threader must be allowed * to continue however as this is a bad place to stop. */ if (p->p_numthreads != remain_for_mode(mode) && !P_SHOULDSTOP(p)) { FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); if (TD_IS_SUSPENDED(td)) { wakeup_swapper |= thread_unsuspend_one(td, p); } thread_unlock(td); } } PROC_SUNLOCK(p); if (wakeup_swapper) kick_proc0(); } struct thread * thread_find(struct proc *p, lwpid_t tid) { struct thread *td; PROC_LOCK_ASSERT(p, MA_OWNED); FOREACH_THREAD_IN_PROC(p, td) { if (td->td_tid == tid) break; } return (td); } /* Locate a thread by number; return with proc lock held. */ struct thread * tdfind(lwpid_t tid, pid_t pid) { #define RUN_THRESH 16 struct thread *td; int run = 0; rw_rlock(&tidhash_lock); LIST_FOREACH(td, TIDHASH(tid), td_hash) { if (td->td_tid == tid) { if (pid != -1 && td->td_proc->p_pid != pid) { td = NULL; break; } PROC_LOCK(td->td_proc); if (td->td_proc->p_state == PRS_NEW) { PROC_UNLOCK(td->td_proc); td = NULL; break; } if (run > RUN_THRESH) { if (rw_try_upgrade(&tidhash_lock)) { LIST_REMOVE(td, td_hash); LIST_INSERT_HEAD(TIDHASH(td->td_tid), td, td_hash); rw_wunlock(&tidhash_lock); return (td); } } break; } run++; } rw_runlock(&tidhash_lock); return (td); } void tidhash_add(struct thread *td) { rw_wlock(&tidhash_lock); LIST_INSERT_HEAD(TIDHASH(td->td_tid), td, td_hash); rw_wunlock(&tidhash_lock); } void tidhash_remove(struct thread *td) { rw_wlock(&tidhash_lock); LIST_REMOVE(td, td_hash); rw_wunlock(&tidhash_lock); } Index: projects/release-arm-redux/sys/vm/uma.h =================================================================== --- projects/release-arm-redux/sys/vm/uma.h (revision 282691) +++ projects/release-arm-redux/sys/vm/uma.h (revision 282692) @@ -1,693 +1,696 @@ /*- * Copyright (c) 2002, 2003, 2004, 2005 Jeffrey Roberson * Copyright (c) 2004, 2005 Bosko Milekic * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ * */ /* * uma.h - External definitions for the Universal Memory Allocator * */ #ifndef _VM_UMA_H_ #define _VM_UMA_H_ #include /* For NULL */ #include /* For M_* */ /* User visible parameters */ #define UMA_SMALLEST_UNIT (PAGE_SIZE / 256) /* Smallest item allocated */ /* Types and type defs */ struct uma_zone; /* Opaque type used as a handle to the zone */ typedef struct uma_zone * uma_zone_t; void zone_drain(uma_zone_t); /* * Item constructor * * Arguments: * item A pointer to the memory which has been allocated. * arg The arg field passed to uma_zalloc_arg * size The size of the allocated item * flags See zalloc flags * * Returns: * 0 on success * errno on failure * * Discussion: * The constructor is called just before the memory is returned * to the user. It may block if necessary. */ typedef int (*uma_ctor)(void *mem, int size, void *arg, int flags); /* * Item destructor * * Arguments: * item A pointer to the memory which has been allocated. * size The size of the item being destructed. * arg Argument passed through uma_zfree_arg * * Returns: * Nothing * * Discussion: * The destructor may perform operations that differ from those performed * by the initializer, but it must leave the object in the same state. * This IS type stable storage. This is called after EVERY zfree call. */ typedef void (*uma_dtor)(void *mem, int size, void *arg); /* * Item initializer * * Arguments: * item A pointer to the memory which has been allocated. * size The size of the item being initialized. * flags See zalloc flags * * Returns: * 0 on success * errno on failure * * Discussion: * The initializer is called when the memory is cached in the uma zone. * The initializer and the destructor should leave the object in the same * state. */ typedef int (*uma_init)(void *mem, int size, int flags); /* * Item discard function * * Arguments: * item A pointer to memory which has been 'freed' but has not left the * zone's cache. * size The size of the item being discarded. * * Returns: * Nothing * * Discussion: * This routine is called when memory leaves a zone and is returned to the * system for other uses. It is the counter-part to the init function. */ typedef void (*uma_fini)(void *mem, int size); /* * Import new memory into a cache zone. */ typedef int (*uma_import)(void *arg, void **store, int count, int flags); /* * Free memory from a cache zone. */ typedef void (*uma_release)(void *arg, void **store, int count); /* * What's the difference between initializing and constructing? * * The item is initialized when it is cached, and this is the state that the * object should be in when returned to the allocator. The purpose of this is * to remove some code which would otherwise be called on each allocation by * utilizing a known, stable state. This differs from the constructor which * will be called on EVERY allocation. * * For example, in the initializer you may want to initialize embedded locks, * NULL list pointers, set up initial states, magic numbers, etc. This way if * the object is held in the allocator and re-used it won't be necessary to * re-initialize it. * * The constructor may be used to lock a data structure, link it on to lists, * bump reference counts or total counts of outstanding structures, etc. * */ /* Function proto types */ /* * Create a new uma zone * * Arguments: * name The text name of the zone for debugging and stats. This memory * should not be freed until the zone has been deallocated. * size The size of the object that is being created. * ctor The constructor that is called when the object is allocated. * dtor The destructor that is called when the object is freed. * init An initializer that sets up the initial state of the memory. * fini A discard function that undoes initialization done by init. * ctor/dtor/init/fini may all be null, see notes above. * align A bitmask that corresponds to the requested alignment * eg 4 would be 0x3 * flags A set of parameters that control the behavior of the zone. * * Returns: * A pointer to a structure which is intended to be opaque to users of * the interface. The value may be null if the wait flag is not set. */ uma_zone_t uma_zcreate(const char *name, size_t size, uma_ctor ctor, uma_dtor dtor, uma_init uminit, uma_fini fini, int align, uint32_t flags); /* * Create a secondary uma zone * * Arguments: * name The text name of the zone for debugging and stats. This memory * should not be freed until the zone has been deallocated. * ctor The constructor that is called when the object is allocated. * dtor The destructor that is called when the object is freed. * zinit An initializer that sets up the initial state of the memory * as the object passes from the Keg's slab to the Zone's cache. * zfini A discard function that undoes initialization done by init * as the object passes from the Zone's cache to the Keg's slab. * * ctor/dtor/zinit/zfini may all be null, see notes above. * Note that the zinit and zfini specified here are NOT * exactly the same as the init/fini specified to uma_zcreate() * when creating a master zone. These zinit/zfini are called * on the TRANSITION from keg to zone (and vice-versa). Once * these are set, the primary zone may alter its init/fini * (which are called when the object passes from VM to keg) * using uma_zone_set_init/fini()) as well as its own * zinit/zfini (unset by default for master zone) with * uma_zone_set_zinit/zfini() (note subtle 'z' prefix). * * master A reference to this zone's Master Zone (Primary Zone), * which contains the backing Keg for the Secondary Zone * being added. * * Returns: * A pointer to a structure which is intended to be opaque to users of * the interface. The value may be null if the wait flag is not set. */ uma_zone_t uma_zsecond_create(char *name, uma_ctor ctor, uma_dtor dtor, uma_init zinit, uma_fini zfini, uma_zone_t master); /* * Add a second master to a secondary zone. This provides multiple data * backends for objects with the same size. Both masters must have * compatible allocation flags. Presently, UMA_ZONE_MALLOC type zones are * the only supported. * * Returns: * Error on failure, 0 on success. */ int uma_zsecond_add(uma_zone_t zone, uma_zone_t master); /* * Create cache-only zones. * * This allows uma's per-cpu cache facilities to handle arbitrary * pointers. Consumers must specify the import and release functions to * fill and destroy caches. UMA does not allocate any memory for these * zones. The 'arg' parameter is passed to import/release and is caller * specific. */ uma_zone_t uma_zcache_create(char *name, int size, uma_ctor ctor, uma_dtor dtor, uma_init zinit, uma_fini zfini, uma_import zimport, uma_release zrelease, void *arg, int flags); /* * Definitions for uma_zcreate flags * * These flags share space with UMA_ZFLAGs in uma_int.h. Be careful not to * overlap when adding new features. 0xf0000000 is in use by uma_int.h. */ #define UMA_ZONE_PAGEABLE 0x0001 /* Return items not fully backed by physical memory XXX Not yet */ #define UMA_ZONE_ZINIT 0x0002 /* Initialize with zeros */ #define UMA_ZONE_STATIC 0x0004 /* Statically sized zone */ #define UMA_ZONE_OFFPAGE 0x0008 /* Force the slab structure allocation off of the real memory */ #define UMA_ZONE_MALLOC 0x0010 /* For use by malloc(9) only! */ #define UMA_ZONE_NOFREE 0x0020 /* Do not free slabs of this type! */ #define UMA_ZONE_MTXCLASS 0x0040 /* Create a new lock class */ #define UMA_ZONE_VM 0x0080 /* * Used for internal vm datastructures * only. */ #define UMA_ZONE_HASH 0x0100 /* * Use a hash table instead of caching * information in the vm_page. */ #define UMA_ZONE_SECONDARY 0x0200 /* Zone is a Secondary Zone */ #define UMA_ZONE_REFCNT 0x0400 /* Allocate refcnts in slabs */ #define UMA_ZONE_MAXBUCKET 0x0800 /* Use largest buckets */ #define UMA_ZONE_CACHESPREAD 0x1000 /* * Spread memory start locations across * all possible cache lines. May * require many virtually contiguous * backend pages and can fail early. */ #define UMA_ZONE_VTOSLAB 0x2000 /* Zone uses vtoslab for lookup. */ #define UMA_ZONE_NODUMP 0x4000 /* * Zone's pages will not be included in * mini-dumps. */ #define UMA_ZONE_PCPU 0x8000 /* * Allocates mp_ncpus slabs sized to * sizeof(struct pcpu). */ /* * These flags are shared between the keg and zone. In zones wishing to add * new kegs these flags must be compatible. Some are determined based on * physical parameters of the request and may not be provided by the consumer. */ #define UMA_ZONE_INHERIT \ (UMA_ZONE_OFFPAGE | UMA_ZONE_MALLOC | UMA_ZONE_NOFREE | \ UMA_ZONE_HASH | UMA_ZONE_REFCNT | UMA_ZONE_VTOSLAB | UMA_ZONE_PCPU) /* Definitions for align */ #define UMA_ALIGN_PTR (sizeof(void *) - 1) /* Alignment fit for ptr */ #define UMA_ALIGN_LONG (sizeof(long) - 1) /* "" long */ #define UMA_ALIGN_INT (sizeof(int) - 1) /* "" int */ #define UMA_ALIGN_SHORT (sizeof(short) - 1) /* "" short */ #define UMA_ALIGN_CHAR (sizeof(char) - 1) /* "" char */ #define UMA_ALIGN_CACHE (0 - 1) /* Cache line size align */ /* * Destroys an empty uma zone. If the zone is not empty uma complains loudly. * * Arguments: * zone The zone we want to destroy. * */ void uma_zdestroy(uma_zone_t zone); /* * Allocates an item out of a zone * * Arguments: * zone The zone we are allocating from * arg This data is passed to the ctor function * flags See sys/malloc.h for available flags. * * Returns: * A non-null pointer to an initialized element from the zone is * guaranteed if the wait flag is M_WAITOK. Otherwise a null pointer * may be returned if the zone is empty or the ctor failed. */ void *uma_zalloc_arg(uma_zone_t zone, void *arg, int flags); /* * Allocates an item out of a zone without supplying an argument * * This is just a wrapper for uma_zalloc_arg for convenience. * */ static __inline void *uma_zalloc(uma_zone_t zone, int flags); static __inline void * uma_zalloc(uma_zone_t zone, int flags) { return uma_zalloc_arg(zone, NULL, flags); } /* * Frees an item back into the specified zone. * * Arguments: * zone The zone the item was originally allocated out of. * item The memory to be freed. * arg Argument passed to the destructor * * Returns: * Nothing. */ void uma_zfree_arg(uma_zone_t zone, void *item, void *arg); /* * Frees an item back to a zone without supplying an argument * * This is just a wrapper for uma_zfree_arg for convenience. * */ static __inline void uma_zfree(uma_zone_t zone, void *item); static __inline void uma_zfree(uma_zone_t zone, void *item) { uma_zfree_arg(zone, item, NULL); } /* * XXX The rest of the prototypes in this header are h0h0 magic for the VM. * If you think you need to use it for a normal zone you're probably incorrect. */ /* * Backend page supplier routines * * Arguments: * zone The zone that is requesting pages. * size The number of bytes being requested. * pflag Flags for these memory pages, see below. * wait Indicates our willingness to block. * * Returns: * A pointer to the allocated memory or NULL on failure. */ typedef void *(*uma_alloc)(uma_zone_t zone, vm_size_t size, uint8_t *pflag, int wait); /* * Backend page free routines * * Arguments: * item A pointer to the previously allocated pages. * size The original size of the allocation. * pflag The flags for the slab. See UMA_SLAB_* below. * * Returns: * None */ typedef void (*uma_free)(void *item, vm_size_t size, uint8_t pflag); /* * Sets up the uma allocator. (Called by vm_mem_init) * * Arguments: * bootmem A pointer to memory used to bootstrap the system. * * Returns: * Nothing * * Discussion: * This memory is used for zones which allocate things before the * backend page supplier can give us pages. It should be * UMA_SLAB_SIZE * boot_pages bytes. (see uma_int.h) * */ void uma_startup(void *bootmem, int boot_pages); /* * Finishes starting up the allocator. This should * be called when kva is ready for normal allocs. * * Arguments: * None * * Returns: * Nothing * * Discussion: * uma_startup2 is called by kmeminit() to enable us of uma for malloc. */ void uma_startup2(void); /* * Reclaims unused memory for all zones * * Arguments: * None * Returns: * None * * This should only be called by the page out daemon. */ void uma_reclaim(void); /* * Sets the alignment mask to be used for all zones requesting cache * alignment. Should be called by MD boot code prior to starting VM/UMA. * * Arguments: * align The alignment mask * * Returns: * Nothing */ void uma_set_align(int align); /* * Set a reserved number of items to hold for M_USE_RESERVE allocations. All * other requests must allocate new backing pages. */ void uma_zone_reserve(uma_zone_t zone, int nitems); /* * Reserves the maximum KVA space required by the zone and configures the zone * to use a VM_ALLOC_NOOBJ-based backend allocator. * * Arguments: * zone The zone to update. * nitems The upper limit on the number of items that can be allocated. * * Returns: * 0 if KVA space can not be allocated * 1 if successful * * Discussion: * When the machine supports a direct map and the zone's items are smaller * than a page, the zone will use the direct map instead of allocating KVA * space. */ int uma_zone_reserve_kva(uma_zone_t zone, int nitems); /* * Sets a high limit on the number of items allowed in a zone * * Arguments: * zone The zone to limit * nitems The requested upper limit on the number of items allowed * * Returns: * int The effective value of nitems after rounding up based on page size */ int uma_zone_set_max(uma_zone_t zone, int nitems); /* * Obtains the effective limit on the number of items in a zone * * Arguments: * zone The zone to obtain the effective limit from * * Return: * 0 No limit * int The effective limit of the zone */ int uma_zone_get_max(uma_zone_t zone); /* * Sets a warning to be printed when limit is reached * * Arguments: * zone The zone we will warn about * warning Warning content * * Returns: * Nothing */ void uma_zone_set_warning(uma_zone_t zone, const char *warning); /* * Obtains the approximate current number of items allocated from a zone * * Arguments: * zone The zone to obtain the current allocation count from * * Return: * int The approximate current number of items allocated from the zone */ int uma_zone_get_cur(uma_zone_t zone); /* * The following two routines (uma_zone_set_init/fini) * are used to set the backend init/fini pair which acts on an * object as it becomes allocated and is placed in a slab within * the specified zone's backing keg. These should probably not * be changed once allocations have already begun, but only be set * immediately upon zone creation. */ void uma_zone_set_init(uma_zone_t zone, uma_init uminit); void uma_zone_set_fini(uma_zone_t zone, uma_fini fini); /* * The following two routines (uma_zone_set_zinit/zfini) are * used to set the zinit/zfini pair which acts on an object as * it passes from the backing Keg's slab cache to the * specified Zone's bucket cache. These should probably not * be changed once allocations have already begun, but only be set * immediately upon zone creation. */ void uma_zone_set_zinit(uma_zone_t zone, uma_init zinit); void uma_zone_set_zfini(uma_zone_t zone, uma_fini zfini); /* * Replaces the standard backend allocator for this zone. * * Arguments: * zone The zone whose backend allocator is being changed. * allocf A pointer to the allocation function * * Returns: * Nothing * * Discussion: * This could be used to implement pageable allocation, or perhaps * even DMA allocators if used in conjunction with the OFFPAGE * zone flag. */ void uma_zone_set_allocf(uma_zone_t zone, uma_alloc allocf); /* * Used for freeing memory provided by the allocf above * * Arguments: * zone The zone that intends to use this free routine. * freef The page freeing routine. * * Returns: * Nothing */ void uma_zone_set_freef(uma_zone_t zone, uma_free freef); /* * These flags are setable in the allocf and visible in the freef. */ #define UMA_SLAB_BOOT 0x01 /* Slab alloced from boot pages */ #define UMA_SLAB_KMEM 0x02 /* Slab alloced from kmem_map */ #define UMA_SLAB_KERNEL 0x04 /* Slab alloced from kernel_map */ #define UMA_SLAB_PRIV 0x08 /* Slab alloced from priv allocator */ #define UMA_SLAB_OFFP 0x10 /* Slab is managed separately */ #define UMA_SLAB_MALLOC 0x20 /* Slab is a large malloc slab */ /* 0x40 and 0x80 are available */ /* * Used to pre-fill a zone with some number of items * * Arguments: * zone The zone to fill * itemcnt The number of items to reserve * * Returns: * Nothing * * NOTE: This is blocking and should only be done at startup */ void uma_prealloc(uma_zone_t zone, int itemcnt); /* * Used to lookup the reference counter allocated for an item * from a UMA_ZONE_REFCNT zone. For UMA_ZONE_REFCNT zones, * reference counters are allocated for items and stored in * the underlying slab header. * * Arguments: * zone The UMA_ZONE_REFCNT zone to which the item belongs. * item The address of the item for which we want a refcnt. * * Returns: * A pointer to a uint32_t reference counter. */ uint32_t *uma_find_refcnt(uma_zone_t zone, void *item); /* * Used to determine if a fixed-size zone is exhausted. * * Arguments: * zone The zone to check * * Returns: * Non-zero if zone is exhausted. */ int uma_zone_exhausted(uma_zone_t zone); int uma_zone_exhausted_nolock(uma_zone_t zone); /* * Common UMA_ZONE_PCPU zones. */ extern uma_zone_t pcpu_zone_64; extern uma_zone_t pcpu_zone_ptr; /* * Exported statistics structures to be used by user space monitoring tools. * Statistics stream consists of a uma_stream_header, followed by a series of * alternative uma_type_header and uma_type_stat structures. */ #define UMA_STREAM_VERSION 0x00000001 struct uma_stream_header { uint32_t ush_version; /* Stream format version. */ uint32_t ush_maxcpus; /* Value of MAXCPU for stream. */ uint32_t ush_count; /* Number of records. */ uint32_t _ush_pad; /* Pad/reserved field. */ }; #define UTH_MAX_NAME 32 #define UTH_ZONE_SECONDARY 0x00000001 struct uma_type_header { /* * Static per-zone data, some extracted from the supporting keg. */ char uth_name[UTH_MAX_NAME]; uint32_t uth_align; /* Keg: alignment. */ uint32_t uth_size; /* Keg: requested size of item. */ uint32_t uth_rsize; /* Keg: real size of item. */ uint32_t uth_maxpages; /* Keg: maximum number of pages. */ uint32_t uth_limit; /* Keg: max items to allocate. */ /* * Current dynamic zone/keg-derived statistics. */ uint32_t uth_pages; /* Keg: pages allocated. */ uint32_t uth_keg_free; /* Keg: items free. */ uint32_t uth_zone_free; /* Zone: items free. */ uint32_t uth_bucketsize; /* Zone: desired bucket size. */ uint32_t uth_zone_flags; /* Zone: flags. */ uint64_t uth_allocs; /* Zone: number of allocations. */ uint64_t uth_frees; /* Zone: number of frees. */ uint64_t uth_fails; /* Zone: number of alloc failures. */ uint64_t uth_sleeps; /* Zone: number of alloc sleeps. */ uint64_t _uth_reserved1[2]; /* Reserved. */ }; struct uma_percpu_stat { uint64_t ups_allocs; /* Cache: number of allocations. */ uint64_t ups_frees; /* Cache: number of frees. */ uint64_t ups_cache_free; /* Cache: free items in cache. */ uint64_t _ups_reserved[5]; /* Reserved. */ }; +void uma_reclaim_wakeup(void); +void uma_reclaim_worker(void *); + #endif /* _VM_UMA_H_ */ Index: projects/release-arm-redux/sys/vm/uma_core.c =================================================================== --- projects/release-arm-redux/sys/vm/uma_core.c (revision 282691) +++ projects/release-arm-redux/sys/vm/uma_core.c (revision 282692) @@ -1,3621 +1,3655 @@ /*- * Copyright (c) 2002-2005, 2009, 2013 Jeffrey Roberson * Copyright (c) 2004, 2005 Bosko Milekic * Copyright (c) 2004-2006 Robert N. M. Watson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* * uma_core.c Implementation of the Universal Memory allocator * * This allocator is intended to replace the multitude of similar object caches * in the standard FreeBSD kernel. The intent is to be flexible as well as * effecient. A primary design goal is to return unused memory to the rest of * the system. This will make the system as a whole more flexible due to the * ability to move memory to subsystems which most need it instead of leaving * pools of reserved memory unused. * * The basic ideas stem from similar slab/zone based allocators whose algorithms * are well known. * */ /* * TODO: * - Improve memory usage for large allocations * - Investigate cache size adjustments */ #include __FBSDID("$FreeBSD$"); /* I should really use ktr.. */ /* #define UMA_DEBUG 1 #define UMA_DEBUG_ALLOC 1 #define UMA_DEBUG_ALLOC_1 1 */ #include "opt_ddb.h" #include "opt_param.h" #include "opt_vm.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DEBUG_MEMGUARD #include #endif /* * This is the zone and keg from which all zones are spawned. The idea is that * even the zone & keg heads are allocated from the allocator, so we use the * bss section to bootstrap us. */ static struct uma_keg masterkeg; static struct uma_zone masterzone_k; static struct uma_zone masterzone_z; static uma_zone_t kegs = &masterzone_k; static uma_zone_t zones = &masterzone_z; /* This is the zone from which all of uma_slab_t's are allocated. */ static uma_zone_t slabzone; static uma_zone_t slabrefzone; /* With refcounters (for UMA_ZONE_REFCNT) */ /* * The initial hash tables come out of this zone so they can be allocated * prior to malloc coming up. */ static uma_zone_t hashzone; /* The boot-time adjusted value for cache line alignment. */ int uma_align_cache = 64 - 1; static MALLOC_DEFINE(M_UMAHASH, "UMAHash", "UMA Hash Buckets"); /* * Are we allowed to allocate buckets? */ static int bucketdisable = 1; /* Linked list of all kegs in the system */ static LIST_HEAD(,uma_keg) uma_kegs = LIST_HEAD_INITIALIZER(uma_kegs); /* Linked list of all cache-only zones in the system */ static LIST_HEAD(,uma_zone) uma_cachezones = LIST_HEAD_INITIALIZER(uma_cachezones); /* This RW lock protects the keg list */ static struct rwlock_padalign uma_rwlock; /* Linked list of boot time pages */ static LIST_HEAD(,uma_slab) uma_boot_pages = LIST_HEAD_INITIALIZER(uma_boot_pages); /* This mutex protects the boot time pages list */ static struct mtx_padalign uma_boot_pages_mtx; static struct sx uma_drain_lock; /* Is the VM done starting up? */ static int booted = 0; #define UMA_STARTUP 1 #define UMA_STARTUP2 2 /* * Only mbuf clusters use ref zones. Just provide enough references * to support the one user. New code should not use the ref facility. */ static const u_int uma_max_ipers_ref = PAGE_SIZE / MCLBYTES; /* * This is the handle used to schedule events that need to happen * outside of the allocation fast path. */ static struct callout uma_callout; #define UMA_TIMEOUT 20 /* Seconds for callout interval. */ /* * This structure is passed as the zone ctor arg so that I don't have to create * a special allocation function just for zones. */ struct uma_zctor_args { const char *name; size_t size; uma_ctor ctor; uma_dtor dtor; uma_init uminit; uma_fini fini; uma_import import; uma_release release; void *arg; uma_keg_t keg; int align; uint32_t flags; }; struct uma_kctor_args { uma_zone_t zone; size_t size; uma_init uminit; uma_fini fini; int align; uint32_t flags; }; struct uma_bucket_zone { uma_zone_t ubz_zone; char *ubz_name; int ubz_entries; /* Number of items it can hold. */ int ubz_maxsize; /* Maximum allocation size per-item. */ }; /* * Compute the actual number of bucket entries to pack them in power * of two sizes for more efficient space utilization. */ #define BUCKET_SIZE(n) \ (((sizeof(void *) * (n)) - sizeof(struct uma_bucket)) / sizeof(void *)) #define BUCKET_MAX BUCKET_SIZE(256) struct uma_bucket_zone bucket_zones[] = { { NULL, "4 Bucket", BUCKET_SIZE(4), 4096 }, { NULL, "6 Bucket", BUCKET_SIZE(6), 3072 }, { NULL, "8 Bucket", BUCKET_SIZE(8), 2048 }, { NULL, "12 Bucket", BUCKET_SIZE(12), 1536 }, { NULL, "16 Bucket", BUCKET_SIZE(16), 1024 }, { NULL, "32 Bucket", BUCKET_SIZE(32), 512 }, { NULL, "64 Bucket", BUCKET_SIZE(64), 256 }, { NULL, "128 Bucket", BUCKET_SIZE(128), 128 }, { NULL, "256 Bucket", BUCKET_SIZE(256), 64 }, { NULL, NULL, 0} }; /* * Flags and enumerations to be passed to internal functions. */ enum zfreeskip { SKIP_NONE = 0, SKIP_DTOR, SKIP_FINI }; /* Prototypes.. */ static void *noobj_alloc(uma_zone_t, vm_size_t, uint8_t *, int); static void *page_alloc(uma_zone_t, vm_size_t, uint8_t *, int); static void *startup_alloc(uma_zone_t, vm_size_t, uint8_t *, int); static void page_free(void *, vm_size_t, uint8_t); static uma_slab_t keg_alloc_slab(uma_keg_t, uma_zone_t, int); static void cache_drain(uma_zone_t); static void bucket_drain(uma_zone_t, uma_bucket_t); static void bucket_cache_drain(uma_zone_t zone); static int keg_ctor(void *, int, void *, int); static void keg_dtor(void *, int, void *); static int zone_ctor(void *, int, void *, int); static void zone_dtor(void *, int, void *); static int zero_init(void *, int, int); static void keg_small_init(uma_keg_t keg); static void keg_large_init(uma_keg_t keg); static void zone_foreach(void (*zfunc)(uma_zone_t)); static void zone_timeout(uma_zone_t zone); static int hash_alloc(struct uma_hash *); static int hash_expand(struct uma_hash *, struct uma_hash *); static void hash_free(struct uma_hash *hash); static void uma_timeout(void *); static void uma_startup3(void); static void *zone_alloc_item(uma_zone_t, void *, int); static void zone_free_item(uma_zone_t, void *, void *, enum zfreeskip); static void bucket_enable(void); static void bucket_init(void); static uma_bucket_t bucket_alloc(uma_zone_t zone, void *, int); static void bucket_free(uma_zone_t zone, uma_bucket_t, void *); static void bucket_zone_drain(void); static uma_bucket_t zone_alloc_bucket(uma_zone_t zone, void *, int flags); static uma_slab_t zone_fetch_slab(uma_zone_t zone, uma_keg_t last, int flags); static uma_slab_t zone_fetch_slab_multi(uma_zone_t zone, uma_keg_t last, int flags); static void *slab_alloc_item(uma_keg_t keg, uma_slab_t slab); static void slab_free_item(uma_keg_t keg, uma_slab_t slab, void *item); static uma_keg_t uma_kcreate(uma_zone_t zone, size_t size, uma_init uminit, uma_fini fini, int align, uint32_t flags); static int zone_import(uma_zone_t zone, void **bucket, int max, int flags); static void zone_release(uma_zone_t zone, void **bucket, int cnt); static void uma_zero_item(void *item, uma_zone_t zone); void uma_print_zone(uma_zone_t); void uma_print_stats(void); static int sysctl_vm_zone_count(SYSCTL_HANDLER_ARGS); static int sysctl_vm_zone_stats(SYSCTL_HANDLER_ARGS); SYSINIT(uma_startup3, SI_SUB_VM_CONF, SI_ORDER_SECOND, uma_startup3, NULL); SYSCTL_PROC(_vm, OID_AUTO, zone_count, CTLFLAG_RD|CTLTYPE_INT, 0, 0, sysctl_vm_zone_count, "I", "Number of UMA zones"); SYSCTL_PROC(_vm, OID_AUTO, zone_stats, CTLFLAG_RD|CTLTYPE_STRUCT, 0, 0, sysctl_vm_zone_stats, "s,struct uma_type_header", "Zone Stats"); static int zone_warnings = 1; SYSCTL_INT(_vm, OID_AUTO, zone_warnings, CTLFLAG_RWTUN, &zone_warnings, 0, "Warn when UMA zones becomes full"); /* * This routine checks to see whether or not it's safe to enable buckets. */ static void bucket_enable(void) { bucketdisable = vm_page_count_min(); } /* * Initialize bucket_zones, the array of zones of buckets of various sizes. * * For each zone, calculate the memory required for each bucket, consisting * of the header and an array of pointers. */ static void bucket_init(void) { struct uma_bucket_zone *ubz; int size; for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) { size = roundup(sizeof(struct uma_bucket), sizeof(void *)); size += sizeof(void *) * ubz->ubz_entries; ubz->ubz_zone = uma_zcreate(ubz->ubz_name, size, NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_MTXCLASS | UMA_ZFLAG_BUCKET); } } /* * Given a desired number of entries for a bucket, return the zone from which * to allocate the bucket. */ static struct uma_bucket_zone * bucket_zone_lookup(int entries) { struct uma_bucket_zone *ubz; for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) if (ubz->ubz_entries >= entries) return (ubz); ubz--; return (ubz); } static int bucket_select(int size) { struct uma_bucket_zone *ubz; ubz = &bucket_zones[0]; if (size > ubz->ubz_maxsize) return MAX((ubz->ubz_maxsize * ubz->ubz_entries) / size, 1); for (; ubz->ubz_entries != 0; ubz++) if (ubz->ubz_maxsize < size) break; ubz--; return (ubz->ubz_entries); } static uma_bucket_t bucket_alloc(uma_zone_t zone, void *udata, int flags) { struct uma_bucket_zone *ubz; uma_bucket_t bucket; /* * This is to stop us from allocating per cpu buckets while we're * running out of vm.boot_pages. Otherwise, we would exhaust the * boot pages. This also prevents us from allocating buckets in * low memory situations. */ if (bucketdisable) return (NULL); /* * To limit bucket recursion we store the original zone flags * in a cookie passed via zalloc_arg/zfree_arg. This allows the * NOVM flag to persist even through deep recursions. We also * store ZFLAG_BUCKET once we have recursed attempting to allocate * a bucket for a bucket zone so we do not allow infinite bucket * recursion. This cookie will even persist to frees of unused * buckets via the allocation path or bucket allocations in the * free path. */ if ((zone->uz_flags & UMA_ZFLAG_BUCKET) == 0) udata = (void *)(uintptr_t)zone->uz_flags; else { if ((uintptr_t)udata & UMA_ZFLAG_BUCKET) return (NULL); udata = (void *)((uintptr_t)udata | UMA_ZFLAG_BUCKET); } if ((uintptr_t)udata & UMA_ZFLAG_CACHEONLY) flags |= M_NOVM; ubz = bucket_zone_lookup(zone->uz_count); if (ubz->ubz_zone == zone && (ubz + 1)->ubz_entries != 0) ubz++; bucket = uma_zalloc_arg(ubz->ubz_zone, udata, flags); if (bucket) { #ifdef INVARIANTS bzero(bucket->ub_bucket, sizeof(void *) * ubz->ubz_entries); #endif bucket->ub_cnt = 0; bucket->ub_entries = ubz->ubz_entries; } return (bucket); } static void bucket_free(uma_zone_t zone, uma_bucket_t bucket, void *udata) { struct uma_bucket_zone *ubz; KASSERT(bucket->ub_cnt == 0, ("bucket_free: Freeing a non free bucket.")); if ((zone->uz_flags & UMA_ZFLAG_BUCKET) == 0) udata = (void *)(uintptr_t)zone->uz_flags; ubz = bucket_zone_lookup(bucket->ub_entries); uma_zfree_arg(ubz->ubz_zone, bucket, udata); } static void bucket_zone_drain(void) { struct uma_bucket_zone *ubz; for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) zone_drain(ubz->ubz_zone); } static void zone_log_warning(uma_zone_t zone) { static const struct timeval warninterval = { 300, 0 }; if (!zone_warnings || zone->uz_warning == NULL) return; if (ratecheck(&zone->uz_ratecheck, &warninterval)) printf("[zone: %s] %s\n", zone->uz_name, zone->uz_warning); } static void zone_foreach_keg(uma_zone_t zone, void (*kegfn)(uma_keg_t)) { uma_klink_t klink; LIST_FOREACH(klink, &zone->uz_kegs, kl_link) kegfn(klink->kl_keg); } /* * Routine called by timeout which is used to fire off some time interval * based calculations. (stats, hash size, etc.) * * Arguments: * arg Unused * * Returns: * Nothing */ static void uma_timeout(void *unused) { bucket_enable(); zone_foreach(zone_timeout); /* Reschedule this event */ callout_reset(&uma_callout, UMA_TIMEOUT * hz, uma_timeout, NULL); } /* * Routine to perform timeout driven calculations. This expands the * hashes and does per cpu statistics aggregation. * * Returns nothing. */ static void keg_timeout(uma_keg_t keg) { KEG_LOCK(keg); /* * Expand the keg hash table. * * This is done if the number of slabs is larger than the hash size. * What I'm trying to do here is completely reduce collisions. This * may be a little aggressive. Should I allow for two collisions max? */ if (keg->uk_flags & UMA_ZONE_HASH && keg->uk_pages / keg->uk_ppera >= keg->uk_hash.uh_hashsize) { struct uma_hash newhash; struct uma_hash oldhash; int ret; /* * This is so involved because allocating and freeing * while the keg lock is held will lead to deadlock. * I have to do everything in stages and check for * races. */ newhash = keg->uk_hash; KEG_UNLOCK(keg); ret = hash_alloc(&newhash); KEG_LOCK(keg); if (ret) { if (hash_expand(&keg->uk_hash, &newhash)) { oldhash = keg->uk_hash; keg->uk_hash = newhash; } else oldhash = newhash; KEG_UNLOCK(keg); hash_free(&oldhash); return; } } KEG_UNLOCK(keg); } static void zone_timeout(uma_zone_t zone) { zone_foreach_keg(zone, &keg_timeout); } /* * Allocate and zero fill the next sized hash table from the appropriate * backing store. * * Arguments: * hash A new hash structure with the old hash size in uh_hashsize * * Returns: * 1 on sucess and 0 on failure. */ static int hash_alloc(struct uma_hash *hash) { int oldsize; int alloc; oldsize = hash->uh_hashsize; /* We're just going to go to a power of two greater */ if (oldsize) { hash->uh_hashsize = oldsize * 2; alloc = sizeof(hash->uh_slab_hash[0]) * hash->uh_hashsize; hash->uh_slab_hash = (struct slabhead *)malloc(alloc, M_UMAHASH, M_NOWAIT); } else { alloc = sizeof(hash->uh_slab_hash[0]) * UMA_HASH_SIZE_INIT; hash->uh_slab_hash = zone_alloc_item(hashzone, NULL, M_WAITOK); hash->uh_hashsize = UMA_HASH_SIZE_INIT; } if (hash->uh_slab_hash) { bzero(hash->uh_slab_hash, alloc); hash->uh_hashmask = hash->uh_hashsize - 1; return (1); } return (0); } /* * Expands the hash table for HASH zones. This is done from zone_timeout * to reduce collisions. This must not be done in the regular allocation * path, otherwise, we can recurse on the vm while allocating pages. * * Arguments: * oldhash The hash you want to expand * newhash The hash structure for the new table * * Returns: * Nothing * * Discussion: */ static int hash_expand(struct uma_hash *oldhash, struct uma_hash *newhash) { uma_slab_t slab; int hval; int i; if (!newhash->uh_slab_hash) return (0); if (oldhash->uh_hashsize >= newhash->uh_hashsize) return (0); /* * I need to investigate hash algorithms for resizing without a * full rehash. */ for (i = 0; i < oldhash->uh_hashsize; i++) while (!SLIST_EMPTY(&oldhash->uh_slab_hash[i])) { slab = SLIST_FIRST(&oldhash->uh_slab_hash[i]); SLIST_REMOVE_HEAD(&oldhash->uh_slab_hash[i], us_hlink); hval = UMA_HASH(newhash, slab->us_data); SLIST_INSERT_HEAD(&newhash->uh_slab_hash[hval], slab, us_hlink); } return (1); } /* * Free the hash bucket to the appropriate backing store. * * Arguments: * slab_hash The hash bucket we're freeing * hashsize The number of entries in that hash bucket * * Returns: * Nothing */ static void hash_free(struct uma_hash *hash) { if (hash->uh_slab_hash == NULL) return; if (hash->uh_hashsize == UMA_HASH_SIZE_INIT) zone_free_item(hashzone, hash->uh_slab_hash, NULL, SKIP_NONE); else free(hash->uh_slab_hash, M_UMAHASH); } /* * Frees all outstanding items in a bucket * * Arguments: * zone The zone to free to, must be unlocked. * bucket The free/alloc bucket with items, cpu queue must be locked. * * Returns: * Nothing */ static void bucket_drain(uma_zone_t zone, uma_bucket_t bucket) { int i; if (bucket == NULL) return; if (zone->uz_fini) for (i = 0; i < bucket->ub_cnt; i++) zone->uz_fini(bucket->ub_bucket[i], zone->uz_size); zone->uz_release(zone->uz_arg, bucket->ub_bucket, bucket->ub_cnt); bucket->ub_cnt = 0; } /* * Drains the per cpu caches for a zone. * * NOTE: This may only be called while the zone is being turn down, and not * during normal operation. This is necessary in order that we do not have * to migrate CPUs to drain the per-CPU caches. * * Arguments: * zone The zone to drain, must be unlocked. * * Returns: * Nothing */ static void cache_drain(uma_zone_t zone) { uma_cache_t cache; int cpu; /* * XXX: It is safe to not lock the per-CPU caches, because we're * tearing down the zone anyway. I.e., there will be no further use * of the caches at this point. * * XXX: It would good to be able to assert that the zone is being * torn down to prevent improper use of cache_drain(). * * XXX: We lock the zone before passing into bucket_cache_drain() as * it is used elsewhere. Should the tear-down path be made special * there in some form? */ CPU_FOREACH(cpu) { cache = &zone->uz_cpu[cpu]; bucket_drain(zone, cache->uc_allocbucket); bucket_drain(zone, cache->uc_freebucket); if (cache->uc_allocbucket != NULL) bucket_free(zone, cache->uc_allocbucket, NULL); if (cache->uc_freebucket != NULL) bucket_free(zone, cache->uc_freebucket, NULL); cache->uc_allocbucket = cache->uc_freebucket = NULL; } ZONE_LOCK(zone); bucket_cache_drain(zone); ZONE_UNLOCK(zone); } static void cache_shrink(uma_zone_t zone) { if (zone->uz_flags & UMA_ZFLAG_INTERNAL) return; ZONE_LOCK(zone); zone->uz_count = (zone->uz_count_min + zone->uz_count) / 2; ZONE_UNLOCK(zone); } static void cache_drain_safe_cpu(uma_zone_t zone) { uma_cache_t cache; uma_bucket_t b1, b2; if (zone->uz_flags & UMA_ZFLAG_INTERNAL) return; b1 = b2 = NULL; ZONE_LOCK(zone); critical_enter(); cache = &zone->uz_cpu[curcpu]; if (cache->uc_allocbucket) { if (cache->uc_allocbucket->ub_cnt != 0) LIST_INSERT_HEAD(&zone->uz_buckets, cache->uc_allocbucket, ub_link); else b1 = cache->uc_allocbucket; cache->uc_allocbucket = NULL; } if (cache->uc_freebucket) { if (cache->uc_freebucket->ub_cnt != 0) LIST_INSERT_HEAD(&zone->uz_buckets, cache->uc_freebucket, ub_link); else b2 = cache->uc_freebucket; cache->uc_freebucket = NULL; } critical_exit(); ZONE_UNLOCK(zone); if (b1) bucket_free(zone, b1, NULL); if (b2) bucket_free(zone, b2, NULL); } /* * Safely drain per-CPU caches of a zone(s) to alloc bucket. * This is an expensive call because it needs to bind to all CPUs * one by one and enter a critical section on each of them in order * to safely access their cache buckets. * Zone lock must not be held on call this function. */ static void cache_drain_safe(uma_zone_t zone) { int cpu; /* * Polite bucket sizes shrinking was not enouth, shrink aggressively. */ if (zone) cache_shrink(zone); else zone_foreach(cache_shrink); CPU_FOREACH(cpu) { thread_lock(curthread); sched_bind(curthread, cpu); thread_unlock(curthread); if (zone) cache_drain_safe_cpu(zone); else zone_foreach(cache_drain_safe_cpu); } thread_lock(curthread); sched_unbind(curthread); thread_unlock(curthread); } /* * Drain the cached buckets from a zone. Expects a locked zone on entry. */ static void bucket_cache_drain(uma_zone_t zone) { uma_bucket_t bucket; /* * Drain the bucket queues and free the buckets, we just keep two per * cpu (alloc/free). */ while ((bucket = LIST_FIRST(&zone->uz_buckets)) != NULL) { LIST_REMOVE(bucket, ub_link); ZONE_UNLOCK(zone); bucket_drain(zone, bucket); bucket_free(zone, bucket, NULL); ZONE_LOCK(zone); } /* * Shrink further bucket sizes. Price of single zone lock collision * is probably lower then price of global cache drain. */ if (zone->uz_count > zone->uz_count_min) zone->uz_count--; } static void keg_free_slab(uma_keg_t keg, uma_slab_t slab, int start) { uint8_t *mem; int i; uint8_t flags; mem = slab->us_data; flags = slab->us_flags; i = start; if (keg->uk_fini != NULL) { for (i--; i > -1; i--) keg->uk_fini(slab->us_data + (keg->uk_rsize * i), keg->uk_size); } if (keg->uk_flags & UMA_ZONE_OFFPAGE) zone_free_item(keg->uk_slabzone, slab, NULL, SKIP_NONE); #ifdef UMA_DEBUG printf("%s: Returning %d bytes.\n", keg->uk_name, PAGE_SIZE * keg->uk_ppera); #endif keg->uk_freef(mem, PAGE_SIZE * keg->uk_ppera, flags); } /* * Frees pages from a keg back to the system. This is done on demand from * the pageout daemon. * * Returns nothing. */ static void keg_drain(uma_keg_t keg) { struct slabhead freeslabs = { 0 }; uma_slab_t slab; uma_slab_t n; /* * We don't want to take pages from statically allocated kegs at this * time */ if (keg->uk_flags & UMA_ZONE_NOFREE || keg->uk_freef == NULL) return; #ifdef UMA_DEBUG printf("%s free items: %u\n", keg->uk_name, keg->uk_free); #endif KEG_LOCK(keg); if (keg->uk_free == 0) goto finished; slab = LIST_FIRST(&keg->uk_free_slab); while (slab) { n = LIST_NEXT(slab, us_link); /* We have no where to free these to */ if (slab->us_flags & UMA_SLAB_BOOT) { slab = n; continue; } LIST_REMOVE(slab, us_link); keg->uk_pages -= keg->uk_ppera; keg->uk_free -= keg->uk_ipers; if (keg->uk_flags & UMA_ZONE_HASH) UMA_HASH_REMOVE(&keg->uk_hash, slab, slab->us_data); SLIST_INSERT_HEAD(&freeslabs, slab, us_hlink); slab = n; } finished: KEG_UNLOCK(keg); while ((slab = SLIST_FIRST(&freeslabs)) != NULL) { SLIST_REMOVE(&freeslabs, slab, uma_slab, us_hlink); keg_free_slab(keg, slab, keg->uk_ipers); } } static void zone_drain_wait(uma_zone_t zone, int waitok) { /* * Set draining to interlock with zone_dtor() so we can release our * locks as we go. Only dtor() should do a WAITOK call since it * is the only call that knows the structure will still be available * when it wakes up. */ ZONE_LOCK(zone); while (zone->uz_flags & UMA_ZFLAG_DRAINING) { if (waitok == M_NOWAIT) goto out; msleep(zone, zone->uz_lockptr, PVM, "zonedrain", 1); } zone->uz_flags |= UMA_ZFLAG_DRAINING; bucket_cache_drain(zone); ZONE_UNLOCK(zone); /* * The DRAINING flag protects us from being freed while * we're running. Normally the uma_rwlock would protect us but we * must be able to release and acquire the right lock for each keg. */ zone_foreach_keg(zone, &keg_drain); ZONE_LOCK(zone); zone->uz_flags &= ~UMA_ZFLAG_DRAINING; wakeup(zone); out: ZONE_UNLOCK(zone); } void zone_drain(uma_zone_t zone) { zone_drain_wait(zone, M_NOWAIT); } /* * Allocate a new slab for a keg. This does not insert the slab onto a list. * * Arguments: * wait Shall we wait? * * Returns: * The slab that was allocated or NULL if there is no memory and the * caller specified M_NOWAIT. */ static uma_slab_t keg_alloc_slab(uma_keg_t keg, uma_zone_t zone, int wait) { uma_slabrefcnt_t slabref; uma_alloc allocf; uma_slab_t slab; uint8_t *mem; uint8_t flags; int i; mtx_assert(&keg->uk_lock, MA_OWNED); slab = NULL; mem = NULL; #ifdef UMA_DEBUG printf("alloc_slab: Allocating a new slab for %s\n", keg->uk_name); #endif allocf = keg->uk_allocf; KEG_UNLOCK(keg); if (keg->uk_flags & UMA_ZONE_OFFPAGE) { slab = zone_alloc_item(keg->uk_slabzone, NULL, wait); if (slab == NULL) goto out; } /* * This reproduces the old vm_zone behavior of zero filling pages the * first time they are added to a zone. * * Malloced items are zeroed in uma_zalloc. */ if ((keg->uk_flags & UMA_ZONE_MALLOC) == 0) wait |= M_ZERO; else wait &= ~M_ZERO; if (keg->uk_flags & UMA_ZONE_NODUMP) wait |= M_NODUMP; /* zone is passed for legacy reasons. */ mem = allocf(zone, keg->uk_ppera * PAGE_SIZE, &flags, wait); if (mem == NULL) { if (keg->uk_flags & UMA_ZONE_OFFPAGE) zone_free_item(keg->uk_slabzone, slab, NULL, SKIP_NONE); slab = NULL; goto out; } /* Point the slab into the allocated memory */ if (!(keg->uk_flags & UMA_ZONE_OFFPAGE)) slab = (uma_slab_t )(mem + keg->uk_pgoff); if (keg->uk_flags & UMA_ZONE_VTOSLAB) for (i = 0; i < keg->uk_ppera; i++) vsetslab((vm_offset_t)mem + (i * PAGE_SIZE), slab); slab->us_keg = keg; slab->us_data = mem; slab->us_freecount = keg->uk_ipers; slab->us_flags = flags; BIT_FILL(SLAB_SETSIZE, &slab->us_free); #ifdef INVARIANTS BIT_ZERO(SLAB_SETSIZE, &slab->us_debugfree); #endif if (keg->uk_flags & UMA_ZONE_REFCNT) { slabref = (uma_slabrefcnt_t)slab; for (i = 0; i < keg->uk_ipers; i++) slabref->us_refcnt[i] = 0; } if (keg->uk_init != NULL) { for (i = 0; i < keg->uk_ipers; i++) if (keg->uk_init(slab->us_data + (keg->uk_rsize * i), keg->uk_size, wait) != 0) break; if (i != keg->uk_ipers) { keg_free_slab(keg, slab, i); slab = NULL; goto out; } } out: KEG_LOCK(keg); if (slab != NULL) { if (keg->uk_flags & UMA_ZONE_HASH) UMA_HASH_INSERT(&keg->uk_hash, slab, mem); keg->uk_pages += keg->uk_ppera; keg->uk_free += keg->uk_ipers; } return (slab); } /* * This function is intended to be used early on in place of page_alloc() so * that we may use the boot time page cache to satisfy allocations before * the VM is ready. */ static void * startup_alloc(uma_zone_t zone, vm_size_t bytes, uint8_t *pflag, int wait) { uma_keg_t keg; uma_slab_t tmps; int pages, check_pages; keg = zone_first_keg(zone); pages = howmany(bytes, PAGE_SIZE); check_pages = pages - 1; KASSERT(pages > 0, ("startup_alloc can't reserve 0 pages\n")); /* * Check our small startup cache to see if it has pages remaining. */ mtx_lock(&uma_boot_pages_mtx); /* First check if we have enough room. */ tmps = LIST_FIRST(&uma_boot_pages); while (tmps != NULL && check_pages-- > 0) tmps = LIST_NEXT(tmps, us_link); if (tmps != NULL) { /* * It's ok to lose tmps references. The last one will * have tmps->us_data pointing to the start address of * "pages" contiguous pages of memory. */ while (pages-- > 0) { tmps = LIST_FIRST(&uma_boot_pages); LIST_REMOVE(tmps, us_link); } mtx_unlock(&uma_boot_pages_mtx); *pflag = tmps->us_flags; return (tmps->us_data); } mtx_unlock(&uma_boot_pages_mtx); if (booted < UMA_STARTUP2) panic("UMA: Increase vm.boot_pages"); /* * Now that we've booted reset these users to their real allocator. */ #ifdef UMA_MD_SMALL_ALLOC keg->uk_allocf = (keg->uk_ppera > 1) ? page_alloc : uma_small_alloc; #else keg->uk_allocf = page_alloc; #endif return keg->uk_allocf(zone, bytes, pflag, wait); } /* * Allocates a number of pages from the system * * Arguments: * bytes The number of bytes requested * wait Shall we wait? * * Returns: * A pointer to the alloced memory or possibly * NULL if M_NOWAIT is set. */ static void * page_alloc(uma_zone_t zone, vm_size_t bytes, uint8_t *pflag, int wait) { void *p; /* Returned page */ *pflag = UMA_SLAB_KMEM; p = (void *) kmem_malloc(kmem_arena, bytes, wait); return (p); } /* * Allocates a number of pages from within an object * * Arguments: * bytes The number of bytes requested * wait Shall we wait? * * Returns: * A pointer to the alloced memory or possibly * NULL if M_NOWAIT is set. */ static void * noobj_alloc(uma_zone_t zone, vm_size_t bytes, uint8_t *flags, int wait) { TAILQ_HEAD(, vm_page) alloctail; u_long npages; vm_offset_t retkva, zkva; vm_page_t p, p_next; uma_keg_t keg; TAILQ_INIT(&alloctail); keg = zone_first_keg(zone); npages = howmany(bytes, PAGE_SIZE); while (npages > 0) { p = vm_page_alloc(NULL, 0, VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED | VM_ALLOC_NOOBJ); if (p != NULL) { /* * Since the page does not belong to an object, its * listq is unused. */ TAILQ_INSERT_TAIL(&alloctail, p, listq); npages--; continue; } if (wait & M_WAITOK) { VM_WAIT; continue; } /* * Page allocation failed, free intermediate pages and * exit. */ TAILQ_FOREACH_SAFE(p, &alloctail, listq, p_next) { vm_page_unwire(p, PQ_INACTIVE); vm_page_free(p); } return (NULL); } *flags = UMA_SLAB_PRIV; zkva = keg->uk_kva + atomic_fetchadd_long(&keg->uk_offset, round_page(bytes)); retkva = zkva; TAILQ_FOREACH(p, &alloctail, listq) { pmap_qenter(zkva, &p, 1); zkva += PAGE_SIZE; } return ((void *)retkva); } /* * Frees a number of pages to the system * * Arguments: * mem A pointer to the memory to be freed * size The size of the memory being freed * flags The original p->us_flags field * * Returns: * Nothing */ static void page_free(void *mem, vm_size_t size, uint8_t flags) { struct vmem *vmem; if (flags & UMA_SLAB_KMEM) vmem = kmem_arena; else if (flags & UMA_SLAB_KERNEL) vmem = kernel_arena; else panic("UMA: page_free used with invalid flags %d", flags); kmem_free(vmem, (vm_offset_t)mem, size); } /* * Zero fill initializer * * Arguments/Returns follow uma_init specifications */ static int zero_init(void *mem, int size, int flags) { bzero(mem, size); return (0); } /* * Finish creating a small uma keg. This calculates ipers, and the keg size. * * Arguments * keg The zone we should initialize * * Returns * Nothing */ static void keg_small_init(uma_keg_t keg) { u_int rsize; u_int memused; u_int wastedspace; u_int shsize; if (keg->uk_flags & UMA_ZONE_PCPU) { u_int ncpus = mp_ncpus ? mp_ncpus : MAXCPU; keg->uk_slabsize = sizeof(struct pcpu); keg->uk_ppera = howmany(ncpus * sizeof(struct pcpu), PAGE_SIZE); } else { keg->uk_slabsize = UMA_SLAB_SIZE; keg->uk_ppera = 1; } /* * Calculate the size of each allocation (rsize) according to * alignment. If the requested size is smaller than we have * allocation bits for we round it up. */ rsize = keg->uk_size; if (rsize < keg->uk_slabsize / SLAB_SETSIZE) rsize = keg->uk_slabsize / SLAB_SETSIZE; if (rsize & keg->uk_align) rsize = (rsize & ~keg->uk_align) + (keg->uk_align + 1); keg->uk_rsize = rsize; KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0 || keg->uk_rsize < sizeof(struct pcpu), ("%s: size %u too large", __func__, keg->uk_rsize)); if (keg->uk_flags & UMA_ZONE_REFCNT) rsize += sizeof(uint32_t); if (keg->uk_flags & UMA_ZONE_OFFPAGE) shsize = 0; else shsize = sizeof(struct uma_slab); keg->uk_ipers = (keg->uk_slabsize - shsize) / rsize; KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE, ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers)); memused = keg->uk_ipers * rsize + shsize; wastedspace = keg->uk_slabsize - memused; /* * We can't do OFFPAGE if we're internal or if we've been * asked to not go to the VM for buckets. If we do this we * may end up going to the VM for slabs which we do not * want to do if we're UMA_ZFLAG_CACHEONLY as a result * of UMA_ZONE_VM, which clearly forbids it. */ if ((keg->uk_flags & UMA_ZFLAG_INTERNAL) || (keg->uk_flags & UMA_ZFLAG_CACHEONLY)) return; /* * See if using an OFFPAGE slab will limit our waste. Only do * this if it permits more items per-slab. * * XXX We could try growing slabsize to limit max waste as well. * Historically this was not done because the VM could not * efficiently handle contiguous allocations. */ if ((wastedspace >= keg->uk_slabsize / UMA_MAX_WASTE) && (keg->uk_ipers < (keg->uk_slabsize / keg->uk_rsize))) { keg->uk_ipers = keg->uk_slabsize / keg->uk_rsize; KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE, ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers)); #ifdef UMA_DEBUG printf("UMA decided we need offpage slab headers for " "keg: %s, calculated wastedspace = %d, " "maximum wasted space allowed = %d, " "calculated ipers = %d, " "new wasted space = %d\n", keg->uk_name, wastedspace, keg->uk_slabsize / UMA_MAX_WASTE, keg->uk_ipers, keg->uk_slabsize - keg->uk_ipers * keg->uk_rsize); #endif keg->uk_flags |= UMA_ZONE_OFFPAGE; } if ((keg->uk_flags & UMA_ZONE_OFFPAGE) && (keg->uk_flags & UMA_ZONE_VTOSLAB) == 0) keg->uk_flags |= UMA_ZONE_HASH; } /* * Finish creating a large (> UMA_SLAB_SIZE) uma kegs. Just give in and do * OFFPAGE for now. When I can allow for more dynamic slab sizes this will be * more complicated. * * Arguments * keg The keg we should initialize * * Returns * Nothing */ static void keg_large_init(uma_keg_t keg) { u_int shsize; KASSERT(keg != NULL, ("Keg is null in keg_large_init")); KASSERT((keg->uk_flags & UMA_ZFLAG_CACHEONLY) == 0, ("keg_large_init: Cannot large-init a UMA_ZFLAG_CACHEONLY keg")); KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0, ("%s: Cannot large-init a UMA_ZONE_PCPU keg", __func__)); keg->uk_ppera = howmany(keg->uk_size, PAGE_SIZE); keg->uk_slabsize = keg->uk_ppera * PAGE_SIZE; keg->uk_ipers = 1; keg->uk_rsize = keg->uk_size; /* We can't do OFFPAGE if we're internal, bail out here. */ if (keg->uk_flags & UMA_ZFLAG_INTERNAL) return; /* Check whether we have enough space to not do OFFPAGE. */ if ((keg->uk_flags & UMA_ZONE_OFFPAGE) == 0) { shsize = sizeof(struct uma_slab); if (keg->uk_flags & UMA_ZONE_REFCNT) shsize += keg->uk_ipers * sizeof(uint32_t); if (shsize & UMA_ALIGN_PTR) shsize = (shsize & ~UMA_ALIGN_PTR) + (UMA_ALIGN_PTR + 1); if ((PAGE_SIZE * keg->uk_ppera) - keg->uk_rsize < shsize) keg->uk_flags |= UMA_ZONE_OFFPAGE; } if ((keg->uk_flags & UMA_ZONE_OFFPAGE) && (keg->uk_flags & UMA_ZONE_VTOSLAB) == 0) keg->uk_flags |= UMA_ZONE_HASH; } static void keg_cachespread_init(uma_keg_t keg) { int alignsize; int trailer; int pages; int rsize; KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0, ("%s: Cannot cachespread-init a UMA_ZONE_PCPU keg", __func__)); alignsize = keg->uk_align + 1; rsize = keg->uk_size; /* * We want one item to start on every align boundary in a page. To * do this we will span pages. We will also extend the item by the * size of align if it is an even multiple of align. Otherwise, it * would fall on the same boundary every time. */ if (rsize & keg->uk_align) rsize = (rsize & ~keg->uk_align) + alignsize; if ((rsize & alignsize) == 0) rsize += alignsize; trailer = rsize - keg->uk_size; pages = (rsize * (PAGE_SIZE / alignsize)) / PAGE_SIZE; pages = MIN(pages, (128 * 1024) / PAGE_SIZE); keg->uk_rsize = rsize; keg->uk_ppera = pages; keg->uk_slabsize = UMA_SLAB_SIZE; keg->uk_ipers = ((pages * PAGE_SIZE) + trailer) / rsize; keg->uk_flags |= UMA_ZONE_OFFPAGE | UMA_ZONE_VTOSLAB; KASSERT(keg->uk_ipers <= SLAB_SETSIZE, ("%s: keg->uk_ipers too high(%d) increase max_ipers", __func__, keg->uk_ipers)); } /* * Keg header ctor. This initializes all fields, locks, etc. And inserts * the keg onto the global keg list. * * Arguments/Returns follow uma_ctor specifications * udata Actually uma_kctor_args */ static int keg_ctor(void *mem, int size, void *udata, int flags) { struct uma_kctor_args *arg = udata; uma_keg_t keg = mem; uma_zone_t zone; bzero(keg, size); keg->uk_size = arg->size; keg->uk_init = arg->uminit; keg->uk_fini = arg->fini; keg->uk_align = arg->align; keg->uk_free = 0; keg->uk_reserve = 0; keg->uk_pages = 0; keg->uk_flags = arg->flags; keg->uk_allocf = page_alloc; keg->uk_freef = page_free; keg->uk_slabzone = NULL; /* * The master zone is passed to us at keg-creation time. */ zone = arg->zone; keg->uk_name = zone->uz_name; if (arg->flags & UMA_ZONE_VM) keg->uk_flags |= UMA_ZFLAG_CACHEONLY; if (arg->flags & UMA_ZONE_ZINIT) keg->uk_init = zero_init; if (arg->flags & UMA_ZONE_REFCNT || arg->flags & UMA_ZONE_MALLOC) keg->uk_flags |= UMA_ZONE_VTOSLAB; if (arg->flags & UMA_ZONE_PCPU) #ifdef SMP keg->uk_flags |= UMA_ZONE_OFFPAGE; #else keg->uk_flags &= ~UMA_ZONE_PCPU; #endif if (keg->uk_flags & UMA_ZONE_CACHESPREAD) { keg_cachespread_init(keg); } else if (keg->uk_flags & UMA_ZONE_REFCNT) { if (keg->uk_size > (UMA_SLAB_SIZE - sizeof(struct uma_slab_refcnt) - sizeof(uint32_t))) keg_large_init(keg); else keg_small_init(keg); } else { if (keg->uk_size > (UMA_SLAB_SIZE - sizeof(struct uma_slab))) keg_large_init(keg); else keg_small_init(keg); } if (keg->uk_flags & UMA_ZONE_OFFPAGE) { if (keg->uk_flags & UMA_ZONE_REFCNT) { if (keg->uk_ipers > uma_max_ipers_ref) panic("Too many ref items per zone: %d > %d\n", keg->uk_ipers, uma_max_ipers_ref); keg->uk_slabzone = slabrefzone; } else keg->uk_slabzone = slabzone; } /* * If we haven't booted yet we need allocations to go through the * startup cache until the vm is ready. */ if (keg->uk_ppera == 1) { #ifdef UMA_MD_SMALL_ALLOC keg->uk_allocf = uma_small_alloc; keg->uk_freef = uma_small_free; if (booted < UMA_STARTUP) keg->uk_allocf = startup_alloc; #else if (booted < UMA_STARTUP2) keg->uk_allocf = startup_alloc; #endif } else if (booted < UMA_STARTUP2 && (keg->uk_flags & UMA_ZFLAG_INTERNAL)) keg->uk_allocf = startup_alloc; /* * Initialize keg's lock */ KEG_LOCK_INIT(keg, (arg->flags & UMA_ZONE_MTXCLASS)); /* * If we're putting the slab header in the actual page we need to * figure out where in each page it goes. This calculates a right * justified offset into the memory on an ALIGN_PTR boundary. */ if (!(keg->uk_flags & UMA_ZONE_OFFPAGE)) { u_int totsize; /* Size of the slab struct and free list */ totsize = sizeof(struct uma_slab); /* Size of the reference counts. */ if (keg->uk_flags & UMA_ZONE_REFCNT) totsize += keg->uk_ipers * sizeof(uint32_t); if (totsize & UMA_ALIGN_PTR) totsize = (totsize & ~UMA_ALIGN_PTR) + (UMA_ALIGN_PTR + 1); keg->uk_pgoff = (PAGE_SIZE * keg->uk_ppera) - totsize; /* * The only way the following is possible is if with our * UMA_ALIGN_PTR adjustments we are now bigger than * UMA_SLAB_SIZE. I haven't checked whether this is * mathematically possible for all cases, so we make * sure here anyway. */ totsize = keg->uk_pgoff + sizeof(struct uma_slab); if (keg->uk_flags & UMA_ZONE_REFCNT) totsize += keg->uk_ipers * sizeof(uint32_t); if (totsize > PAGE_SIZE * keg->uk_ppera) { printf("zone %s ipers %d rsize %d size %d\n", zone->uz_name, keg->uk_ipers, keg->uk_rsize, keg->uk_size); panic("UMA slab won't fit."); } } if (keg->uk_flags & UMA_ZONE_HASH) hash_alloc(&keg->uk_hash); #ifdef UMA_DEBUG printf("UMA: %s(%p) size %d(%d) flags %#x ipers %d ppera %d out %d free %d\n", zone->uz_name, zone, keg->uk_size, keg->uk_rsize, keg->uk_flags, keg->uk_ipers, keg->uk_ppera, (keg->uk_ipers * keg->uk_pages) - keg->uk_free, keg->uk_free); #endif LIST_INSERT_HEAD(&keg->uk_zones, zone, uz_link); rw_wlock(&uma_rwlock); LIST_INSERT_HEAD(&uma_kegs, keg, uk_link); rw_wunlock(&uma_rwlock); return (0); } /* * Zone header ctor. This initializes all fields, locks, etc. * * Arguments/Returns follow uma_ctor specifications * udata Actually uma_zctor_args */ static int zone_ctor(void *mem, int size, void *udata, int flags) { struct uma_zctor_args *arg = udata; uma_zone_t zone = mem; uma_zone_t z; uma_keg_t keg; bzero(zone, size); zone->uz_name = arg->name; zone->uz_ctor = arg->ctor; zone->uz_dtor = arg->dtor; zone->uz_slab = zone_fetch_slab; zone->uz_init = NULL; zone->uz_fini = NULL; zone->uz_allocs = 0; zone->uz_frees = 0; zone->uz_fails = 0; zone->uz_sleeps = 0; zone->uz_count = 0; zone->uz_count_min = 0; zone->uz_flags = 0; zone->uz_warning = NULL; timevalclear(&zone->uz_ratecheck); keg = arg->keg; ZONE_LOCK_INIT(zone, (arg->flags & UMA_ZONE_MTXCLASS)); /* * This is a pure cache zone, no kegs. */ if (arg->import) { if (arg->flags & UMA_ZONE_VM) arg->flags |= UMA_ZFLAG_CACHEONLY; zone->uz_flags = arg->flags; zone->uz_size = arg->size; zone->uz_import = arg->import; zone->uz_release = arg->release; zone->uz_arg = arg->arg; zone->uz_lockptr = &zone->uz_lock; rw_wlock(&uma_rwlock); LIST_INSERT_HEAD(&uma_cachezones, zone, uz_link); rw_wunlock(&uma_rwlock); goto out; } /* * Use the regular zone/keg/slab allocator. */ zone->uz_import = (uma_import)zone_import; zone->uz_release = (uma_release)zone_release; zone->uz_arg = zone; if (arg->flags & UMA_ZONE_SECONDARY) { KASSERT(arg->keg != NULL, ("Secondary zone on zero'd keg")); zone->uz_init = arg->uminit; zone->uz_fini = arg->fini; zone->uz_lockptr = &keg->uk_lock; zone->uz_flags |= UMA_ZONE_SECONDARY; rw_wlock(&uma_rwlock); ZONE_LOCK(zone); LIST_FOREACH(z, &keg->uk_zones, uz_link) { if (LIST_NEXT(z, uz_link) == NULL) { LIST_INSERT_AFTER(z, zone, uz_link); break; } } ZONE_UNLOCK(zone); rw_wunlock(&uma_rwlock); } else if (keg == NULL) { if ((keg = uma_kcreate(zone, arg->size, arg->uminit, arg->fini, arg->align, arg->flags)) == NULL) return (ENOMEM); } else { struct uma_kctor_args karg; int error; /* We should only be here from uma_startup() */ karg.size = arg->size; karg.uminit = arg->uminit; karg.fini = arg->fini; karg.align = arg->align; karg.flags = arg->flags; karg.zone = zone; error = keg_ctor(arg->keg, sizeof(struct uma_keg), &karg, flags); if (error) return (error); } /* * Link in the first keg. */ zone->uz_klink.kl_keg = keg; LIST_INSERT_HEAD(&zone->uz_kegs, &zone->uz_klink, kl_link); zone->uz_lockptr = &keg->uk_lock; zone->uz_size = keg->uk_size; zone->uz_flags |= (keg->uk_flags & (UMA_ZONE_INHERIT | UMA_ZFLAG_INHERIT)); /* * Some internal zones don't have room allocated for the per cpu * caches. If we're internal, bail out here. */ if (keg->uk_flags & UMA_ZFLAG_INTERNAL) { KASSERT((zone->uz_flags & UMA_ZONE_SECONDARY) == 0, ("Secondary zone requested UMA_ZFLAG_INTERNAL")); return (0); } out: if ((arg->flags & UMA_ZONE_MAXBUCKET) == 0) zone->uz_count = bucket_select(zone->uz_size); else zone->uz_count = BUCKET_MAX; zone->uz_count_min = zone->uz_count; return (0); } /* * Keg header dtor. This frees all data, destroys locks, frees the hash * table and removes the keg from the global list. * * Arguments/Returns follow uma_dtor specifications * udata unused */ static void keg_dtor(void *arg, int size, void *udata) { uma_keg_t keg; keg = (uma_keg_t)arg; KEG_LOCK(keg); if (keg->uk_free != 0) { printf("Freed UMA keg (%s) was not empty (%d items). " " Lost %d pages of memory.\n", keg->uk_name ? keg->uk_name : "", keg->uk_free, keg->uk_pages); } KEG_UNLOCK(keg); hash_free(&keg->uk_hash); KEG_LOCK_FINI(keg); } /* * Zone header dtor. * * Arguments/Returns follow uma_dtor specifications * udata unused */ static void zone_dtor(void *arg, int size, void *udata) { uma_klink_t klink; uma_zone_t zone; uma_keg_t keg; zone = (uma_zone_t)arg; keg = zone_first_keg(zone); if (!(zone->uz_flags & UMA_ZFLAG_INTERNAL)) cache_drain(zone); rw_wlock(&uma_rwlock); LIST_REMOVE(zone, uz_link); rw_wunlock(&uma_rwlock); /* * XXX there are some races here where * the zone can be drained but zone lock * released and then refilled before we * remove it... we dont care for now */ zone_drain_wait(zone, M_WAITOK); /* * Unlink all of our kegs. */ while ((klink = LIST_FIRST(&zone->uz_kegs)) != NULL) { klink->kl_keg = NULL; LIST_REMOVE(klink, kl_link); if (klink == &zone->uz_klink) continue; free(klink, M_TEMP); } /* * We only destroy kegs from non secondary zones. */ if (keg != NULL && (zone->uz_flags & UMA_ZONE_SECONDARY) == 0) { rw_wlock(&uma_rwlock); LIST_REMOVE(keg, uk_link); rw_wunlock(&uma_rwlock); zone_free_item(kegs, keg, NULL, SKIP_NONE); } ZONE_LOCK_FINI(zone); } /* * Traverses every zone in the system and calls a callback * * Arguments: * zfunc A pointer to a function which accepts a zone * as an argument. * * Returns: * Nothing */ static void zone_foreach(void (*zfunc)(uma_zone_t)) { uma_keg_t keg; uma_zone_t zone; rw_rlock(&uma_rwlock); LIST_FOREACH(keg, &uma_kegs, uk_link) { LIST_FOREACH(zone, &keg->uk_zones, uz_link) zfunc(zone); } rw_runlock(&uma_rwlock); } /* Public functions */ /* See uma.h */ void uma_startup(void *bootmem, int boot_pages) { struct uma_zctor_args args; uma_slab_t slab; u_int slabsize; int i; #ifdef UMA_DEBUG printf("Creating uma keg headers zone and keg.\n"); #endif rw_init(&uma_rwlock, "UMA lock"); /* "manually" create the initial zone */ memset(&args, 0, sizeof(args)); args.name = "UMA Kegs"; args.size = sizeof(struct uma_keg); args.ctor = keg_ctor; args.dtor = keg_dtor; args.uminit = zero_init; args.fini = NULL; args.keg = &masterkeg; args.align = 32 - 1; args.flags = UMA_ZFLAG_INTERNAL; /* The initial zone has no Per cpu queues so it's smaller */ zone_ctor(kegs, sizeof(struct uma_zone), &args, M_WAITOK); #ifdef UMA_DEBUG printf("Filling boot free list.\n"); #endif for (i = 0; i < boot_pages; i++) { slab = (uma_slab_t)((uint8_t *)bootmem + (i * UMA_SLAB_SIZE)); slab->us_data = (uint8_t *)slab; slab->us_flags = UMA_SLAB_BOOT; LIST_INSERT_HEAD(&uma_boot_pages, slab, us_link); } mtx_init(&uma_boot_pages_mtx, "UMA boot pages", NULL, MTX_DEF); #ifdef UMA_DEBUG printf("Creating uma zone headers zone and keg.\n"); #endif args.name = "UMA Zones"; args.size = sizeof(struct uma_zone) + (sizeof(struct uma_cache) * (mp_maxid + 1)); args.ctor = zone_ctor; args.dtor = zone_dtor; args.uminit = zero_init; args.fini = NULL; args.keg = NULL; args.align = 32 - 1; args.flags = UMA_ZFLAG_INTERNAL; /* The initial zone has no Per cpu queues so it's smaller */ zone_ctor(zones, sizeof(struct uma_zone), &args, M_WAITOK); #ifdef UMA_DEBUG printf("Creating slab and hash zones.\n"); #endif /* Now make a zone for slab headers */ slabzone = uma_zcreate("UMA Slabs", sizeof(struct uma_slab), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL); /* * We also create a zone for the bigger slabs with reference * counts in them, to accomodate UMA_ZONE_REFCNT zones. */ slabsize = sizeof(struct uma_slab_refcnt); slabsize += uma_max_ipers_ref * sizeof(uint32_t); slabrefzone = uma_zcreate("UMA RCntSlabs", slabsize, NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL); hashzone = uma_zcreate("UMA Hash", sizeof(struct slabhead *) * UMA_HASH_SIZE_INIT, NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL); bucket_init(); booted = UMA_STARTUP; #ifdef UMA_DEBUG printf("UMA startup complete.\n"); #endif } /* see uma.h */ void uma_startup2(void) { booted = UMA_STARTUP2; bucket_enable(); sx_init(&uma_drain_lock, "umadrain"); #ifdef UMA_DEBUG printf("UMA startup2 complete.\n"); #endif } /* * Initialize our callout handle * */ static void uma_startup3(void) { #ifdef UMA_DEBUG printf("Starting callout.\n"); #endif callout_init(&uma_callout, CALLOUT_MPSAFE); callout_reset(&uma_callout, UMA_TIMEOUT * hz, uma_timeout, NULL); #ifdef UMA_DEBUG printf("UMA startup3 complete.\n"); #endif } static uma_keg_t uma_kcreate(uma_zone_t zone, size_t size, uma_init uminit, uma_fini fini, int align, uint32_t flags) { struct uma_kctor_args args; args.size = size; args.uminit = uminit; args.fini = fini; args.align = (align == UMA_ALIGN_CACHE) ? uma_align_cache : align; args.flags = flags; args.zone = zone; return (zone_alloc_item(kegs, &args, M_WAITOK)); } /* See uma.h */ void uma_set_align(int align) { if (align != UMA_ALIGN_CACHE) uma_align_cache = align; } /* See uma.h */ uma_zone_t uma_zcreate(const char *name, size_t size, uma_ctor ctor, uma_dtor dtor, uma_init uminit, uma_fini fini, int align, uint32_t flags) { struct uma_zctor_args args; uma_zone_t res; bool locked; /* This stuff is essential for the zone ctor */ memset(&args, 0, sizeof(args)); args.name = name; args.size = size; args.ctor = ctor; args.dtor = dtor; args.uminit = uminit; args.fini = fini; args.align = align; args.flags = flags; args.keg = NULL; if (booted < UMA_STARTUP2) { locked = false; } else { sx_slock(&uma_drain_lock); locked = true; } res = zone_alloc_item(zones, &args, M_WAITOK); if (locked) sx_sunlock(&uma_drain_lock); return (res); } /* See uma.h */ uma_zone_t uma_zsecond_create(char *name, uma_ctor ctor, uma_dtor dtor, uma_init zinit, uma_fini zfini, uma_zone_t master) { struct uma_zctor_args args; uma_keg_t keg; uma_zone_t res; bool locked; keg = zone_first_keg(master); memset(&args, 0, sizeof(args)); args.name = name; args.size = keg->uk_size; args.ctor = ctor; args.dtor = dtor; args.uminit = zinit; args.fini = zfini; args.align = keg->uk_align; args.flags = keg->uk_flags | UMA_ZONE_SECONDARY; args.keg = keg; if (booted < UMA_STARTUP2) { locked = false; } else { sx_slock(&uma_drain_lock); locked = true; } /* XXX Attaches only one keg of potentially many. */ res = zone_alloc_item(zones, &args, M_WAITOK); if (locked) sx_sunlock(&uma_drain_lock); return (res); } /* See uma.h */ uma_zone_t uma_zcache_create(char *name, int size, uma_ctor ctor, uma_dtor dtor, uma_init zinit, uma_fini zfini, uma_import zimport, uma_release zrelease, void *arg, int flags) { struct uma_zctor_args args; memset(&args, 0, sizeof(args)); args.name = name; args.size = size; args.ctor = ctor; args.dtor = dtor; args.uminit = zinit; args.fini = zfini; args.import = zimport; args.release = zrelease; args.arg = arg; args.align = 0; args.flags = flags; return (zone_alloc_item(zones, &args, M_WAITOK)); } static void zone_lock_pair(uma_zone_t a, uma_zone_t b) { if (a < b) { ZONE_LOCK(a); mtx_lock_flags(b->uz_lockptr, MTX_DUPOK); } else { ZONE_LOCK(b); mtx_lock_flags(a->uz_lockptr, MTX_DUPOK); } } static void zone_unlock_pair(uma_zone_t a, uma_zone_t b) { ZONE_UNLOCK(a); ZONE_UNLOCK(b); } int uma_zsecond_add(uma_zone_t zone, uma_zone_t master) { uma_klink_t klink; uma_klink_t kl; int error; error = 0; klink = malloc(sizeof(*klink), M_TEMP, M_WAITOK | M_ZERO); zone_lock_pair(zone, master); /* * zone must use vtoslab() to resolve objects and must already be * a secondary. */ if ((zone->uz_flags & (UMA_ZONE_VTOSLAB | UMA_ZONE_SECONDARY)) != (UMA_ZONE_VTOSLAB | UMA_ZONE_SECONDARY)) { error = EINVAL; goto out; } /* * The new master must also use vtoslab(). */ if ((zone->uz_flags & UMA_ZONE_VTOSLAB) != UMA_ZONE_VTOSLAB) { error = EINVAL; goto out; } /* * Both must either be refcnt, or not be refcnt. */ if ((zone->uz_flags & UMA_ZONE_REFCNT) != (master->uz_flags & UMA_ZONE_REFCNT)) { error = EINVAL; goto out; } /* * The underlying object must be the same size. rsize * may be different. */ if (master->uz_size != zone->uz_size) { error = E2BIG; goto out; } /* * Put it at the end of the list. */ klink->kl_keg = zone_first_keg(master); LIST_FOREACH(kl, &zone->uz_kegs, kl_link) { if (LIST_NEXT(kl, kl_link) == NULL) { LIST_INSERT_AFTER(kl, klink, kl_link); break; } } klink = NULL; zone->uz_flags |= UMA_ZFLAG_MULTI; zone->uz_slab = zone_fetch_slab_multi; out: zone_unlock_pair(zone, master); if (klink != NULL) free(klink, M_TEMP); return (error); } /* See uma.h */ void uma_zdestroy(uma_zone_t zone) { sx_slock(&uma_drain_lock); zone_free_item(zones, zone, NULL, SKIP_NONE); sx_sunlock(&uma_drain_lock); } /* See uma.h */ void * uma_zalloc_arg(uma_zone_t zone, void *udata, int flags) { void *item; uma_cache_t cache; uma_bucket_t bucket; int lockfail; int cpu; #if 0 /* XXX: FIX!! Do not enable this in CURRENT!! MarkM */ /* The entropy here is desirable, but the harvesting is expensive */ random_harvest(&(zone->uz_name), sizeof(void *), 1, RANDOM_UMA_ALLOC); #endif /* This is the fast path allocation */ #ifdef UMA_DEBUG_ALLOC_1 printf("Allocating one item from %s(%p)\n", zone->uz_name, zone); #endif CTR3(KTR_UMA, "uma_zalloc_arg thread %x zone %s flags %d", curthread, zone->uz_name, flags); if (flags & M_WAITOK) { WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "uma_zalloc_arg: zone \"%s\"", zone->uz_name); } #ifdef DEBUG_MEMGUARD if (memguard_cmp_zone(zone)) { item = memguard_alloc(zone->uz_size, flags); if (item != NULL) { /* * Avoid conflict with the use-after-free * protecting infrastructure from INVARIANTS. */ if (zone->uz_init != NULL && zone->uz_init != mtrash_init && zone->uz_init(item, zone->uz_size, flags) != 0) return (NULL); if (zone->uz_ctor != NULL && zone->uz_ctor != mtrash_ctor && zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) { zone->uz_fini(item, zone->uz_size); return (NULL); } #if 0 /* XXX: FIX!! Do not enable this in CURRENT!! MarkM */ /* The entropy here is desirable, but the harvesting is expensive */ random_harvest(&item, sizeof(void *), 1, RANDOM_UMA_ALLOC); #endif return (item); } /* This is unfortunate but should not be fatal. */ } #endif /* * If possible, allocate from the per-CPU cache. There are two * requirements for safe access to the per-CPU cache: (1) the thread * accessing the cache must not be preempted or yield during access, * and (2) the thread must not migrate CPUs without switching which * cache it accesses. We rely on a critical section to prevent * preemption and migration. We release the critical section in * order to acquire the zone mutex if we are unable to allocate from * the current cache; when we re-acquire the critical section, we * must detect and handle migration if it has occurred. */ critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; zalloc_start: bucket = cache->uc_allocbucket; if (bucket != NULL && bucket->ub_cnt > 0) { bucket->ub_cnt--; item = bucket->ub_bucket[bucket->ub_cnt]; #ifdef INVARIANTS bucket->ub_bucket[bucket->ub_cnt] = NULL; #endif KASSERT(item != NULL, ("uma_zalloc: Bucket pointer mangled.")); cache->uc_allocs++; critical_exit(); if (zone->uz_ctor != NULL && zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) { atomic_add_long(&zone->uz_fails, 1); zone_free_item(zone, item, udata, SKIP_DTOR); return (NULL); } #ifdef INVARIANTS uma_dbg_alloc(zone, NULL, item); #endif if (flags & M_ZERO) uma_zero_item(item, zone); #if 0 /* XXX: FIX!! Do not enable this in CURRENT!! MarkM */ /* The entropy here is desirable, but the harvesting is expensive */ random_harvest(&item, sizeof(void *), 1, RANDOM_UMA_ALLOC); #endif return (item); } /* * We have run out of items in our alloc bucket. * See if we can switch with our free bucket. */ bucket = cache->uc_freebucket; if (bucket != NULL && bucket->ub_cnt > 0) { #ifdef UMA_DEBUG_ALLOC printf("uma_zalloc: Swapping empty with alloc.\n"); #endif cache->uc_freebucket = cache->uc_allocbucket; cache->uc_allocbucket = bucket; goto zalloc_start; } /* * Discard any empty allocation bucket while we hold no locks. */ bucket = cache->uc_allocbucket; cache->uc_allocbucket = NULL; critical_exit(); if (bucket != NULL) bucket_free(zone, bucket, udata); /* Short-circuit for zones without buckets and low memory. */ if (zone->uz_count == 0 || bucketdisable) goto zalloc_item; /* * Attempt to retrieve the item from the per-CPU cache has failed, so * we must go back to the zone. This requires the zone lock, so we * must drop the critical section, then re-acquire it when we go back * to the cache. Since the critical section is released, we may be * preempted or migrate. As such, make sure not to maintain any * thread-local state specific to the cache from prior to releasing * the critical section. */ lockfail = 0; if (ZONE_TRYLOCK(zone) == 0) { /* Record contention to size the buckets. */ ZONE_LOCK(zone); lockfail = 1; } critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; /* * Since we have locked the zone we may as well send back our stats. */ atomic_add_long(&zone->uz_allocs, cache->uc_allocs); atomic_add_long(&zone->uz_frees, cache->uc_frees); cache->uc_allocs = 0; cache->uc_frees = 0; /* See if we lost the race to fill the cache. */ if (cache->uc_allocbucket != NULL) { ZONE_UNLOCK(zone); goto zalloc_start; } /* * Check the zone's cache of buckets. */ if ((bucket = LIST_FIRST(&zone->uz_buckets)) != NULL) { KASSERT(bucket->ub_cnt != 0, ("uma_zalloc_arg: Returning an empty bucket.")); LIST_REMOVE(bucket, ub_link); cache->uc_allocbucket = bucket; ZONE_UNLOCK(zone); goto zalloc_start; } /* We are no longer associated with this CPU. */ critical_exit(); /* * We bump the uz count when the cache size is insufficient to * handle the working set. */ if (lockfail && zone->uz_count < BUCKET_MAX) zone->uz_count++; ZONE_UNLOCK(zone); /* * Now lets just fill a bucket and put it on the free list. If that * works we'll restart the allocation from the begining and it * will use the just filled bucket. */ bucket = zone_alloc_bucket(zone, udata, flags); if (bucket != NULL) { ZONE_LOCK(zone); critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; /* * See if we lost the race or were migrated. Cache the * initialized bucket to make this less likely or claim * the memory directly. */ if (cache->uc_allocbucket == NULL) cache->uc_allocbucket = bucket; else LIST_INSERT_HEAD(&zone->uz_buckets, bucket, ub_link); ZONE_UNLOCK(zone); goto zalloc_start; } /* * We may not be able to get a bucket so return an actual item. */ #ifdef UMA_DEBUG printf("uma_zalloc_arg: Bucketzone returned NULL\n"); #endif zalloc_item: item = zone_alloc_item(zone, udata, flags); #if 0 /* XXX: FIX!! Do not enable this in CURRENT!! MarkM */ /* The entropy here is desirable, but the harvesting is expensive */ random_harvest(&item, sizeof(void *), 1, RANDOM_UMA_ALLOC); #endif return (item); } static uma_slab_t keg_fetch_slab(uma_keg_t keg, uma_zone_t zone, int flags) { uma_slab_t slab; int reserve; mtx_assert(&keg->uk_lock, MA_OWNED); slab = NULL; reserve = 0; if ((flags & M_USE_RESERVE) == 0) reserve = keg->uk_reserve; for (;;) { /* * Find a slab with some space. Prefer slabs that are partially * used over those that are totally full. This helps to reduce * fragmentation. */ if (keg->uk_free > reserve) { if (!LIST_EMPTY(&keg->uk_part_slab)) { slab = LIST_FIRST(&keg->uk_part_slab); } else { slab = LIST_FIRST(&keg->uk_free_slab); LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link); } MPASS(slab->us_keg == keg); return (slab); } /* * M_NOVM means don't ask at all! */ if (flags & M_NOVM) break; if (keg->uk_maxpages && keg->uk_pages >= keg->uk_maxpages) { keg->uk_flags |= UMA_ZFLAG_FULL; /* * If this is not a multi-zone, set the FULL bit. * Otherwise slab_multi() takes care of it. */ if ((zone->uz_flags & UMA_ZFLAG_MULTI) == 0) { zone->uz_flags |= UMA_ZFLAG_FULL; zone_log_warning(zone); } if (flags & M_NOWAIT) break; zone->uz_sleeps++; msleep(keg, &keg->uk_lock, PVM, "keglimit", 0); continue; } slab = keg_alloc_slab(keg, zone, flags); /* * If we got a slab here it's safe to mark it partially used * and return. We assume that the caller is going to remove * at least one item. */ if (slab) { MPASS(slab->us_keg == keg); LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link); return (slab); } /* * We might not have been able to get a slab but another cpu * could have while we were unlocked. Check again before we * fail. */ flags |= M_NOVM; } return (slab); } static uma_slab_t zone_fetch_slab(uma_zone_t zone, uma_keg_t keg, int flags) { uma_slab_t slab; if (keg == NULL) { keg = zone_first_keg(zone); KEG_LOCK(keg); } for (;;) { slab = keg_fetch_slab(keg, zone, flags); if (slab) return (slab); if (flags & (M_NOWAIT | M_NOVM)) break; } KEG_UNLOCK(keg); return (NULL); } /* * uma_zone_fetch_slab_multi: Fetches a slab from one available keg. Returns * with the keg locked. On NULL no lock is held. * * The last pointer is used to seed the search. It is not required. */ static uma_slab_t zone_fetch_slab_multi(uma_zone_t zone, uma_keg_t last, int rflags) { uma_klink_t klink; uma_slab_t slab; uma_keg_t keg; int flags; int empty; int full; /* * Don't wait on the first pass. This will skip limit tests * as well. We don't want to block if we can find a provider * without blocking. */ flags = (rflags & ~M_WAITOK) | M_NOWAIT; /* * Use the last slab allocated as a hint for where to start * the search. */ if (last != NULL) { slab = keg_fetch_slab(last, zone, flags); if (slab) return (slab); KEG_UNLOCK(last); } /* * Loop until we have a slab incase of transient failures * while M_WAITOK is specified. I'm not sure this is 100% * required but we've done it for so long now. */ for (;;) { empty = 0; full = 0; /* * Search the available kegs for slabs. Be careful to hold the * correct lock while calling into the keg layer. */ LIST_FOREACH(klink, &zone->uz_kegs, kl_link) { keg = klink->kl_keg; KEG_LOCK(keg); if ((keg->uk_flags & UMA_ZFLAG_FULL) == 0) { slab = keg_fetch_slab(keg, zone, flags); if (slab) return (slab); } if (keg->uk_flags & UMA_ZFLAG_FULL) full++; else empty++; KEG_UNLOCK(keg); } if (rflags & (M_NOWAIT | M_NOVM)) break; flags = rflags; /* * All kegs are full. XXX We can't atomically check all kegs * and sleep so just sleep for a short period and retry. */ if (full && !empty) { ZONE_LOCK(zone); zone->uz_flags |= UMA_ZFLAG_FULL; zone->uz_sleeps++; zone_log_warning(zone); msleep(zone, zone->uz_lockptr, PVM, "zonelimit", hz/100); zone->uz_flags &= ~UMA_ZFLAG_FULL; ZONE_UNLOCK(zone); continue; } } return (NULL); } static void * slab_alloc_item(uma_keg_t keg, uma_slab_t slab) { void *item; uint8_t freei; MPASS(keg == slab->us_keg); mtx_assert(&keg->uk_lock, MA_OWNED); freei = BIT_FFS(SLAB_SETSIZE, &slab->us_free) - 1; BIT_CLR(SLAB_SETSIZE, freei, &slab->us_free); item = slab->us_data + (keg->uk_rsize * freei); slab->us_freecount--; keg->uk_free--; /* Move this slab to the full list */ if (slab->us_freecount == 0) { LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&keg->uk_full_slab, slab, us_link); } return (item); } static int zone_import(uma_zone_t zone, void **bucket, int max, int flags) { uma_slab_t slab; uma_keg_t keg; int i; slab = NULL; keg = NULL; /* Try to keep the buckets totally full */ for (i = 0; i < max; ) { if ((slab = zone->uz_slab(zone, keg, flags)) == NULL) break; keg = slab->us_keg; while (slab->us_freecount && i < max) { bucket[i++] = slab_alloc_item(keg, slab); if (keg->uk_free <= keg->uk_reserve) break; } /* Don't grab more than one slab at a time. */ flags &= ~M_WAITOK; flags |= M_NOWAIT; } if (slab != NULL) KEG_UNLOCK(keg); return i; } static uma_bucket_t zone_alloc_bucket(uma_zone_t zone, void *udata, int flags) { uma_bucket_t bucket; int max; /* Don't wait for buckets, preserve caller's NOVM setting. */ bucket = bucket_alloc(zone, udata, M_NOWAIT | (flags & M_NOVM)); if (bucket == NULL) return (NULL); max = MIN(bucket->ub_entries, zone->uz_count); bucket->ub_cnt = zone->uz_import(zone->uz_arg, bucket->ub_bucket, max, flags); /* * Initialize the memory if necessary. */ if (bucket->ub_cnt != 0 && zone->uz_init != NULL) { int i; for (i = 0; i < bucket->ub_cnt; i++) if (zone->uz_init(bucket->ub_bucket[i], zone->uz_size, flags) != 0) break; /* * If we couldn't initialize the whole bucket, put the * rest back onto the freelist. */ if (i != bucket->ub_cnt) { zone->uz_release(zone->uz_arg, &bucket->ub_bucket[i], bucket->ub_cnt - i); #ifdef INVARIANTS bzero(&bucket->ub_bucket[i], sizeof(void *) * (bucket->ub_cnt - i)); #endif bucket->ub_cnt = i; } } if (bucket->ub_cnt == 0) { bucket_free(zone, bucket, udata); atomic_add_long(&zone->uz_fails, 1); return (NULL); } return (bucket); } /* * Allocates a single item from a zone. * * Arguments * zone The zone to alloc for. * udata The data to be passed to the constructor. * flags M_WAITOK, M_NOWAIT, M_ZERO. * * Returns * NULL if there is no memory and M_NOWAIT is set * An item if successful */ static void * zone_alloc_item(uma_zone_t zone, void *udata, int flags) { void *item; item = NULL; #ifdef UMA_DEBUG_ALLOC printf("INTERNAL: Allocating one item from %s(%p)\n", zone->uz_name, zone); #endif if (zone->uz_import(zone->uz_arg, &item, 1, flags) != 1) goto fail; atomic_add_long(&zone->uz_allocs, 1); /* * We have to call both the zone's init (not the keg's init) * and the zone's ctor. This is because the item is going from * a keg slab directly to the user, and the user is expecting it * to be both zone-init'd as well as zone-ctor'd. */ if (zone->uz_init != NULL) { if (zone->uz_init(item, zone->uz_size, flags) != 0) { zone_free_item(zone, item, udata, SKIP_FINI); goto fail; } } if (zone->uz_ctor != NULL) { if (zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) { zone_free_item(zone, item, udata, SKIP_DTOR); goto fail; } } #ifdef INVARIANTS uma_dbg_alloc(zone, NULL, item); #endif if (flags & M_ZERO) uma_zero_item(item, zone); return (item); fail: atomic_add_long(&zone->uz_fails, 1); return (NULL); } /* See uma.h */ void uma_zfree_arg(uma_zone_t zone, void *item, void *udata) { uma_cache_t cache; uma_bucket_t bucket; int lockfail; int cpu; #if 0 /* XXX: FIX!! Do not enable this in CURRENT!! MarkM */ /* The entropy here is desirable, but the harvesting is expensive */ struct entropy { const void *uz_name; const void *item; } entropy; entropy.uz_name = zone->uz_name; entropy.item = item; random_harvest(&entropy, sizeof(struct entropy), 2, RANDOM_UMA_ALLOC); #endif #ifdef UMA_DEBUG_ALLOC_1 printf("Freeing item %p to %s(%p)\n", item, zone->uz_name, zone); #endif CTR2(KTR_UMA, "uma_zfree_arg thread %x zone %s", curthread, zone->uz_name); /* uma_zfree(..., NULL) does nothing, to match free(9). */ if (item == NULL) return; #ifdef DEBUG_MEMGUARD if (is_memguard_addr(item)) { if (zone->uz_dtor != NULL && zone->uz_dtor != mtrash_dtor) zone->uz_dtor(item, zone->uz_size, udata); if (zone->uz_fini != NULL && zone->uz_fini != mtrash_fini) zone->uz_fini(item, zone->uz_size); memguard_free(item); return; } #endif #ifdef INVARIANTS if (zone->uz_flags & UMA_ZONE_MALLOC) uma_dbg_free(zone, udata, item); else uma_dbg_free(zone, NULL, item); #endif if (zone->uz_dtor != NULL) zone->uz_dtor(item, zone->uz_size, udata); /* * The race here is acceptable. If we miss it we'll just have to wait * a little longer for the limits to be reset. */ if (zone->uz_flags & UMA_ZFLAG_FULL) goto zfree_item; /* * If possible, free to the per-CPU cache. There are two * requirements for safe access to the per-CPU cache: (1) the thread * accessing the cache must not be preempted or yield during access, * and (2) the thread must not migrate CPUs without switching which * cache it accesses. We rely on a critical section to prevent * preemption and migration. We release the critical section in * order to acquire the zone mutex if we are unable to free to the * current cache; when we re-acquire the critical section, we must * detect and handle migration if it has occurred. */ zfree_restart: critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; zfree_start: /* * Try to free into the allocbucket first to give LIFO ordering * for cache-hot datastructures. Spill over into the freebucket * if necessary. Alloc will swap them if one runs dry. */ bucket = cache->uc_allocbucket; if (bucket == NULL || bucket->ub_cnt >= bucket->ub_entries) bucket = cache->uc_freebucket; if (bucket != NULL && bucket->ub_cnt < bucket->ub_entries) { KASSERT(bucket->ub_bucket[bucket->ub_cnt] == NULL, ("uma_zfree: Freeing to non free bucket index.")); bucket->ub_bucket[bucket->ub_cnt] = item; bucket->ub_cnt++; cache->uc_frees++; critical_exit(); return; } /* * We must go back the zone, which requires acquiring the zone lock, * which in turn means we must release and re-acquire the critical * section. Since the critical section is released, we may be * preempted or migrate. As such, make sure not to maintain any * thread-local state specific to the cache from prior to releasing * the critical section. */ critical_exit(); if (zone->uz_count == 0 || bucketdisable) goto zfree_item; lockfail = 0; if (ZONE_TRYLOCK(zone) == 0) { /* Record contention to size the buckets. */ ZONE_LOCK(zone); lockfail = 1; } critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; /* * Since we have locked the zone we may as well send back our stats. */ atomic_add_long(&zone->uz_allocs, cache->uc_allocs); atomic_add_long(&zone->uz_frees, cache->uc_frees); cache->uc_allocs = 0; cache->uc_frees = 0; bucket = cache->uc_freebucket; if (bucket != NULL && bucket->ub_cnt < bucket->ub_entries) { ZONE_UNLOCK(zone); goto zfree_start; } cache->uc_freebucket = NULL; /* Can we throw this on the zone full list? */ if (bucket != NULL) { #ifdef UMA_DEBUG_ALLOC printf("uma_zfree: Putting old bucket on the free list.\n"); #endif /* ub_cnt is pointing to the last free item */ KASSERT(bucket->ub_cnt != 0, ("uma_zfree: Attempting to insert an empty bucket onto the full list.\n")); LIST_INSERT_HEAD(&zone->uz_buckets, bucket, ub_link); } /* We are no longer associated with this CPU. */ critical_exit(); /* * We bump the uz count when the cache size is insufficient to * handle the working set. */ if (lockfail && zone->uz_count < BUCKET_MAX) zone->uz_count++; ZONE_UNLOCK(zone); #ifdef UMA_DEBUG_ALLOC printf("uma_zfree: Allocating new free bucket.\n"); #endif bucket = bucket_alloc(zone, udata, M_NOWAIT); if (bucket) { critical_enter(); cpu = curcpu; cache = &zone->uz_cpu[cpu]; if (cache->uc_freebucket == NULL) { cache->uc_freebucket = bucket; goto zfree_start; } /* * We lost the race, start over. We have to drop our * critical section to free the bucket. */ critical_exit(); bucket_free(zone, bucket, udata); goto zfree_restart; } /* * If nothing else caught this, we'll just do an internal free. */ zfree_item: zone_free_item(zone, item, udata, SKIP_DTOR); return; } static void slab_free_item(uma_keg_t keg, uma_slab_t slab, void *item) { uint8_t freei; mtx_assert(&keg->uk_lock, MA_OWNED); MPASS(keg == slab->us_keg); /* Do we need to remove from any lists? */ if (slab->us_freecount+1 == keg->uk_ipers) { LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link); } else if (slab->us_freecount == 0) { LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link); } /* Slab management. */ freei = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize; BIT_SET(SLAB_SETSIZE, freei, &slab->us_free); slab->us_freecount++; /* Keg statistics. */ keg->uk_free++; } static void zone_release(uma_zone_t zone, void **bucket, int cnt) { void *item; uma_slab_t slab; uma_keg_t keg; uint8_t *mem; int clearfull; int i; clearfull = 0; keg = zone_first_keg(zone); KEG_LOCK(keg); for (i = 0; i < cnt; i++) { item = bucket[i]; if (!(zone->uz_flags & UMA_ZONE_VTOSLAB)) { mem = (uint8_t *)((uintptr_t)item & (~UMA_SLAB_MASK)); if (zone->uz_flags & UMA_ZONE_HASH) { slab = hash_sfind(&keg->uk_hash, mem); } else { mem += keg->uk_pgoff; slab = (uma_slab_t)mem; } } else { slab = vtoslab((vm_offset_t)item); if (slab->us_keg != keg) { KEG_UNLOCK(keg); keg = slab->us_keg; KEG_LOCK(keg); } } slab_free_item(keg, slab, item); if (keg->uk_flags & UMA_ZFLAG_FULL) { if (keg->uk_pages < keg->uk_maxpages) { keg->uk_flags &= ~UMA_ZFLAG_FULL; clearfull = 1; } /* * We can handle one more allocation. Since we're * clearing ZFLAG_FULL, wake up all procs blocked * on pages. This should be uncommon, so keeping this * simple for now (rather than adding count of blocked * threads etc). */ wakeup(keg); } } KEG_UNLOCK(keg); if (clearfull) { ZONE_LOCK(zone); zone->uz_flags &= ~UMA_ZFLAG_FULL; wakeup(zone); ZONE_UNLOCK(zone); } } /* * Frees a single item to any zone. * * Arguments: * zone The zone to free to * item The item we're freeing * udata User supplied data for the dtor * skip Skip dtors and finis */ static void zone_free_item(uma_zone_t zone, void *item, void *udata, enum zfreeskip skip) { #ifdef INVARIANTS if (skip == SKIP_NONE) { if (zone->uz_flags & UMA_ZONE_MALLOC) uma_dbg_free(zone, udata, item); else uma_dbg_free(zone, NULL, item); } #endif if (skip < SKIP_DTOR && zone->uz_dtor) zone->uz_dtor(item, zone->uz_size, udata); if (skip < SKIP_FINI && zone->uz_fini) zone->uz_fini(item, zone->uz_size); atomic_add_long(&zone->uz_frees, 1); zone->uz_release(zone->uz_arg, &item, 1); } /* See uma.h */ int uma_zone_set_max(uma_zone_t zone, int nitems) { uma_keg_t keg; keg = zone_first_keg(zone); if (keg == NULL) return (0); KEG_LOCK(keg); keg->uk_maxpages = (nitems / keg->uk_ipers) * keg->uk_ppera; if (keg->uk_maxpages * keg->uk_ipers < nitems) keg->uk_maxpages += keg->uk_ppera; nitems = keg->uk_maxpages * keg->uk_ipers; KEG_UNLOCK(keg); return (nitems); } /* See uma.h */ int uma_zone_get_max(uma_zone_t zone) { int nitems; uma_keg_t keg; keg = zone_first_keg(zone); if (keg == NULL) return (0); KEG_LOCK(keg); nitems = keg->uk_maxpages * keg->uk_ipers; KEG_UNLOCK(keg); return (nitems); } /* See uma.h */ void uma_zone_set_warning(uma_zone_t zone, const char *warning) { ZONE_LOCK(zone); zone->uz_warning = warning; ZONE_UNLOCK(zone); } /* See uma.h */ int uma_zone_get_cur(uma_zone_t zone) { int64_t nitems; u_int i; ZONE_LOCK(zone); nitems = zone->uz_allocs - zone->uz_frees; CPU_FOREACH(i) { /* * See the comment in sysctl_vm_zone_stats() regarding the * safety of accessing the per-cpu caches. With the zone lock * held, it is safe, but can potentially result in stale data. */ nitems += zone->uz_cpu[i].uc_allocs - zone->uz_cpu[i].uc_frees; } ZONE_UNLOCK(zone); return (nitems < 0 ? 0 : nitems); } /* See uma.h */ void uma_zone_set_init(uma_zone_t zone, uma_init uminit) { uma_keg_t keg; keg = zone_first_keg(zone); KASSERT(keg != NULL, ("uma_zone_set_init: Invalid zone type")); KEG_LOCK(keg); KASSERT(keg->uk_pages == 0, ("uma_zone_set_init on non-empty keg")); keg->uk_init = uminit; KEG_UNLOCK(keg); } /* See uma.h */ void uma_zone_set_fini(uma_zone_t zone, uma_fini fini) { uma_keg_t keg; keg = zone_first_keg(zone); KASSERT(keg != NULL, ("uma_zone_set_fini: Invalid zone type")); KEG_LOCK(keg); KASSERT(keg->uk_pages == 0, ("uma_zone_set_fini on non-empty keg")); keg->uk_fini = fini; KEG_UNLOCK(keg); } /* See uma.h */ void uma_zone_set_zinit(uma_zone_t zone, uma_init zinit) { ZONE_LOCK(zone); KASSERT(zone_first_keg(zone)->uk_pages == 0, ("uma_zone_set_zinit on non-empty keg")); zone->uz_init = zinit; ZONE_UNLOCK(zone); } /* See uma.h */ void uma_zone_set_zfini(uma_zone_t zone, uma_fini zfini) { ZONE_LOCK(zone); KASSERT(zone_first_keg(zone)->uk_pages == 0, ("uma_zone_set_zfini on non-empty keg")); zone->uz_fini = zfini; ZONE_UNLOCK(zone); } /* See uma.h */ /* XXX uk_freef is not actually used with the zone locked */ void uma_zone_set_freef(uma_zone_t zone, uma_free freef) { uma_keg_t keg; keg = zone_first_keg(zone); KASSERT(keg != NULL, ("uma_zone_set_freef: Invalid zone type")); KEG_LOCK(keg); keg->uk_freef = freef; KEG_UNLOCK(keg); } /* See uma.h */ /* XXX uk_allocf is not actually used with the zone locked */ void uma_zone_set_allocf(uma_zone_t zone, uma_alloc allocf) { uma_keg_t keg; keg = zone_first_keg(zone); KEG_LOCK(keg); keg->uk_allocf = allocf; KEG_UNLOCK(keg); } /* See uma.h */ void uma_zone_reserve(uma_zone_t zone, int items) { uma_keg_t keg; keg = zone_first_keg(zone); if (keg == NULL) return; KEG_LOCK(keg); keg->uk_reserve = items; KEG_UNLOCK(keg); return; } /* See uma.h */ int uma_zone_reserve_kva(uma_zone_t zone, int count) { uma_keg_t keg; vm_offset_t kva; int pages; keg = zone_first_keg(zone); if (keg == NULL) return (0); pages = count / keg->uk_ipers; if (pages * keg->uk_ipers < count) pages++; #ifdef UMA_MD_SMALL_ALLOC if (keg->uk_ppera > 1) { #else if (1) { #endif kva = kva_alloc(pages * UMA_SLAB_SIZE); if (kva == 0) return (0); } else kva = 0; KEG_LOCK(keg); keg->uk_kva = kva; keg->uk_offset = 0; keg->uk_maxpages = pages; #ifdef UMA_MD_SMALL_ALLOC keg->uk_allocf = (keg->uk_ppera > 1) ? noobj_alloc : uma_small_alloc; #else keg->uk_allocf = noobj_alloc; #endif keg->uk_flags |= UMA_ZONE_NOFREE; KEG_UNLOCK(keg); return (1); } /* See uma.h */ void uma_prealloc(uma_zone_t zone, int items) { int slabs; uma_slab_t slab; uma_keg_t keg; keg = zone_first_keg(zone); if (keg == NULL) return; KEG_LOCK(keg); slabs = items / keg->uk_ipers; if (slabs * keg->uk_ipers < items) slabs++; while (slabs > 0) { slab = keg_alloc_slab(keg, zone, M_WAITOK); if (slab == NULL) break; MPASS(slab->us_keg == keg); LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link); slabs--; } KEG_UNLOCK(keg); } /* See uma.h */ uint32_t * uma_find_refcnt(uma_zone_t zone, void *item) { uma_slabrefcnt_t slabref; uma_slab_t slab; uma_keg_t keg; uint32_t *refcnt; int idx; slab = vtoslab((vm_offset_t)item & (~UMA_SLAB_MASK)); slabref = (uma_slabrefcnt_t)slab; keg = slab->us_keg; KASSERT(keg->uk_flags & UMA_ZONE_REFCNT, ("uma_find_refcnt(): zone possibly not UMA_ZONE_REFCNT")); idx = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize; refcnt = &slabref->us_refcnt[idx]; return refcnt; } /* See uma.h */ -void -uma_reclaim(void) +static void +uma_reclaim_locked(bool kmem_danger) { + #ifdef UMA_DEBUG printf("UMA: vm asked us to release pages!\n"); #endif - sx_xlock(&uma_drain_lock); + sx_assert(&uma_drain_lock, SA_XLOCKED); bucket_enable(); zone_foreach(zone_drain); - if (vm_page_count_min()) { + if (vm_page_count_min() || kmem_danger) { cache_drain_safe(NULL); zone_foreach(zone_drain); } /* * Some slabs may have been freed but this zone will be visited early * we visit again so that we can free pages that are empty once other * zones are drained. We have to do the same for buckets. */ zone_drain(slabzone); zone_drain(slabrefzone); bucket_zone_drain(); +} + +void +uma_reclaim(void) +{ + + sx_xlock(&uma_drain_lock); + uma_reclaim_locked(false); sx_xunlock(&uma_drain_lock); +} + +static int uma_reclaim_needed; + +void +uma_reclaim_wakeup(void) +{ + + uma_reclaim_needed = 1; + wakeup(&uma_reclaim_needed); +} + +void +uma_reclaim_worker(void *arg __unused) +{ + + sx_xlock(&uma_drain_lock); + for (;;) { + sx_sleep(&uma_reclaim_needed, &uma_drain_lock, PVM, + "umarcl", 0); + if (uma_reclaim_needed) { + uma_reclaim_needed = 0; + uma_reclaim_locked(true); + } + } } /* See uma.h */ int uma_zone_exhausted(uma_zone_t zone) { int full; ZONE_LOCK(zone); full = (zone->uz_flags & UMA_ZFLAG_FULL); ZONE_UNLOCK(zone); return (full); } int uma_zone_exhausted_nolock(uma_zone_t zone) { return (zone->uz_flags & UMA_ZFLAG_FULL); } void * uma_large_malloc(vm_size_t size, int wait) { void *mem; uma_slab_t slab; uint8_t flags; slab = zone_alloc_item(slabzone, NULL, wait); if (slab == NULL) return (NULL); mem = page_alloc(NULL, size, &flags, wait); if (mem) { vsetslab((vm_offset_t)mem, slab); slab->us_data = mem; slab->us_flags = flags | UMA_SLAB_MALLOC; slab->us_size = size; } else { zone_free_item(slabzone, slab, NULL, SKIP_NONE); } return (mem); } void uma_large_free(uma_slab_t slab) { page_free(slab->us_data, slab->us_size, slab->us_flags); zone_free_item(slabzone, slab, NULL, SKIP_NONE); } static void uma_zero_item(void *item, uma_zone_t zone) { if (zone->uz_flags & UMA_ZONE_PCPU) { for (int i = 0; i < mp_ncpus; i++) bzero(zpcpu_get_cpu(item, i), zone->uz_size); } else bzero(item, zone->uz_size); } void uma_print_stats(void) { zone_foreach(uma_print_zone); } static void slab_print(uma_slab_t slab) { printf("slab: keg %p, data %p, freecount %d\n", slab->us_keg, slab->us_data, slab->us_freecount); } static void cache_print(uma_cache_t cache) { printf("alloc: %p(%d), free: %p(%d)\n", cache->uc_allocbucket, cache->uc_allocbucket?cache->uc_allocbucket->ub_cnt:0, cache->uc_freebucket, cache->uc_freebucket?cache->uc_freebucket->ub_cnt:0); } static void uma_print_keg(uma_keg_t keg) { uma_slab_t slab; printf("keg: %s(%p) size %d(%d) flags %#x ipers %d ppera %d " "out %d free %d limit %d\n", keg->uk_name, keg, keg->uk_size, keg->uk_rsize, keg->uk_flags, keg->uk_ipers, keg->uk_ppera, (keg->uk_ipers * keg->uk_pages) - keg->uk_free, keg->uk_free, (keg->uk_maxpages / keg->uk_ppera) * keg->uk_ipers); printf("Part slabs:\n"); LIST_FOREACH(slab, &keg->uk_part_slab, us_link) slab_print(slab); printf("Free slabs:\n"); LIST_FOREACH(slab, &keg->uk_free_slab, us_link) slab_print(slab); printf("Full slabs:\n"); LIST_FOREACH(slab, &keg->uk_full_slab, us_link) slab_print(slab); } void uma_print_zone(uma_zone_t zone) { uma_cache_t cache; uma_klink_t kl; int i; printf("zone: %s(%p) size %d flags %#x\n", zone->uz_name, zone, zone->uz_size, zone->uz_flags); LIST_FOREACH(kl, &zone->uz_kegs, kl_link) uma_print_keg(kl->kl_keg); CPU_FOREACH(i) { cache = &zone->uz_cpu[i]; printf("CPU %d Cache:\n", i); cache_print(cache); } } #ifdef DDB /* * Generate statistics across both the zone and its per-cpu cache's. Return * desired statistics if the pointer is non-NULL for that statistic. * * Note: does not update the zone statistics, as it can't safely clear the * per-CPU cache statistic. * * XXXRW: Following the uc_allocbucket and uc_freebucket pointers here isn't * safe from off-CPU; we should modify the caches to track this information * directly so that we don't have to. */ static void uma_zone_sumstat(uma_zone_t z, int *cachefreep, uint64_t *allocsp, uint64_t *freesp, uint64_t *sleepsp) { uma_cache_t cache; uint64_t allocs, frees, sleeps; int cachefree, cpu; allocs = frees = sleeps = 0; cachefree = 0; CPU_FOREACH(cpu) { cache = &z->uz_cpu[cpu]; if (cache->uc_allocbucket != NULL) cachefree += cache->uc_allocbucket->ub_cnt; if (cache->uc_freebucket != NULL) cachefree += cache->uc_freebucket->ub_cnt; allocs += cache->uc_allocs; frees += cache->uc_frees; } allocs += z->uz_allocs; frees += z->uz_frees; sleeps += z->uz_sleeps; if (cachefreep != NULL) *cachefreep = cachefree; if (allocsp != NULL) *allocsp = allocs; if (freesp != NULL) *freesp = frees; if (sleepsp != NULL) *sleepsp = sleeps; } #endif /* DDB */ static int sysctl_vm_zone_count(SYSCTL_HANDLER_ARGS) { uma_keg_t kz; uma_zone_t z; int count; count = 0; rw_rlock(&uma_rwlock); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) count++; } rw_runlock(&uma_rwlock); return (sysctl_handle_int(oidp, &count, 0, req)); } static int sysctl_vm_zone_stats(SYSCTL_HANDLER_ARGS) { struct uma_stream_header ush; struct uma_type_header uth; struct uma_percpu_stat ups; uma_bucket_t bucket; struct sbuf sbuf; uma_cache_t cache; uma_klink_t kl; uma_keg_t kz; uma_zone_t z; uma_keg_t k; int count, error, i; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sbuf_new_for_sysctl(&sbuf, NULL, 128, req); sbuf_clear_flags(&sbuf, SBUF_INCLUDENUL); count = 0; rw_rlock(&uma_rwlock); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) count++; } /* * Insert stream header. */ bzero(&ush, sizeof(ush)); ush.ush_version = UMA_STREAM_VERSION; ush.ush_maxcpus = (mp_maxid + 1); ush.ush_count = count; (void)sbuf_bcat(&sbuf, &ush, sizeof(ush)); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) { bzero(&uth, sizeof(uth)); ZONE_LOCK(z); strlcpy(uth.uth_name, z->uz_name, UTH_MAX_NAME); uth.uth_align = kz->uk_align; uth.uth_size = kz->uk_size; uth.uth_rsize = kz->uk_rsize; LIST_FOREACH(kl, &z->uz_kegs, kl_link) { k = kl->kl_keg; uth.uth_maxpages += k->uk_maxpages; uth.uth_pages += k->uk_pages; uth.uth_keg_free += k->uk_free; uth.uth_limit = (k->uk_maxpages / k->uk_ppera) * k->uk_ipers; } /* * A zone is secondary is it is not the first entry * on the keg's zone list. */ if ((z->uz_flags & UMA_ZONE_SECONDARY) && (LIST_FIRST(&kz->uk_zones) != z)) uth.uth_zone_flags = UTH_ZONE_SECONDARY; LIST_FOREACH(bucket, &z->uz_buckets, ub_link) uth.uth_zone_free += bucket->ub_cnt; uth.uth_allocs = z->uz_allocs; uth.uth_frees = z->uz_frees; uth.uth_fails = z->uz_fails; uth.uth_sleeps = z->uz_sleeps; (void)sbuf_bcat(&sbuf, &uth, sizeof(uth)); /* * While it is not normally safe to access the cache * bucket pointers while not on the CPU that owns the * cache, we only allow the pointers to be exchanged * without the zone lock held, not invalidated, so * accept the possible race associated with bucket * exchange during monitoring. */ for (i = 0; i < (mp_maxid + 1); i++) { bzero(&ups, sizeof(ups)); if (kz->uk_flags & UMA_ZFLAG_INTERNAL) goto skip; if (CPU_ABSENT(i)) goto skip; cache = &z->uz_cpu[i]; if (cache->uc_allocbucket != NULL) ups.ups_cache_free += cache->uc_allocbucket->ub_cnt; if (cache->uc_freebucket != NULL) ups.ups_cache_free += cache->uc_freebucket->ub_cnt; ups.ups_allocs = cache->uc_allocs; ups.ups_frees = cache->uc_frees; skip: (void)sbuf_bcat(&sbuf, &ups, sizeof(ups)); } ZONE_UNLOCK(z); } } rw_runlock(&uma_rwlock); error = sbuf_finish(&sbuf); sbuf_delete(&sbuf); return (error); } int sysctl_handle_uma_zone_max(SYSCTL_HANDLER_ARGS) { uma_zone_t zone = *(uma_zone_t *)arg1; int error, max; max = uma_zone_get_max(zone); error = sysctl_handle_int(oidp, &max, 0, req); if (error || !req->newptr) return (error); uma_zone_set_max(zone, max); return (0); } int sysctl_handle_uma_zone_cur(SYSCTL_HANDLER_ARGS) { uma_zone_t zone = *(uma_zone_t *)arg1; int cur; cur = uma_zone_get_cur(zone); return (sysctl_handle_int(oidp, &cur, 0, req)); } #ifdef DDB DB_SHOW_COMMAND(uma, db_show_uma) { uint64_t allocs, frees, sleeps; uma_bucket_t bucket; uma_keg_t kz; uma_zone_t z; int cachefree; db_printf("%18s %8s %8s %8s %12s %8s %8s\n", "Zone", "Size", "Used", "Free", "Requests", "Sleeps", "Bucket"); LIST_FOREACH(kz, &uma_kegs, uk_link) { LIST_FOREACH(z, &kz->uk_zones, uz_link) { if (kz->uk_flags & UMA_ZFLAG_INTERNAL) { allocs = z->uz_allocs; frees = z->uz_frees; sleeps = z->uz_sleeps; cachefree = 0; } else uma_zone_sumstat(z, &cachefree, &allocs, &frees, &sleeps); if (!((z->uz_flags & UMA_ZONE_SECONDARY) && (LIST_FIRST(&kz->uk_zones) != z))) cachefree += kz->uk_free; LIST_FOREACH(bucket, &z->uz_buckets, ub_link) cachefree += bucket->ub_cnt; db_printf("%18s %8ju %8jd %8d %12ju %8ju %8u\n", z->uz_name, (uintmax_t)kz->uk_size, (intmax_t)(allocs - frees), cachefree, (uintmax_t)allocs, sleeps, z->uz_count); if (db_pager_quit) return; } } } DB_SHOW_COMMAND(umacache, db_show_umacache) { uint64_t allocs, frees; uma_bucket_t bucket; uma_zone_t z; int cachefree; db_printf("%18s %8s %8s %8s %12s %8s\n", "Zone", "Size", "Used", "Free", "Requests", "Bucket"); LIST_FOREACH(z, &uma_cachezones, uz_link) { uma_zone_sumstat(z, &cachefree, &allocs, &frees, NULL); LIST_FOREACH(bucket, &z->uz_buckets, ub_link) cachefree += bucket->ub_cnt; db_printf("%18s %8ju %8jd %8d %12ju %8u\n", z->uz_name, (uintmax_t)z->uz_size, (intmax_t)(allocs - frees), cachefree, (uintmax_t)allocs, z->uz_count); if (db_pager_quit) return; } } #endif Index: projects/release-arm-redux/sys/vm/vm_pageout.c =================================================================== --- projects/release-arm-redux/sys/vm/vm_pageout.c (revision 282691) +++ projects/release-arm-redux/sys/vm/vm_pageout.c (revision 282692) @@ -1,1914 +1,1919 @@ /*- * Copyright (c) 1991 Regents of the University of California. * All rights reserved. * Copyright (c) 1994 John S. Dyson * All rights reserved. * Copyright (c) 1994 David Greenman * All rights reserved. * Copyright (c) 2005 Yahoo! Technologies Norway AS * All rights reserved. * * This code is derived from software contributed to Berkeley by * The Mach Operating System project at Carnegie-Mellon University. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)vm_pageout.c 7.4 (Berkeley) 5/7/91 * * * Copyright (c) 1987, 1990 Carnegie-Mellon University. * All rights reserved. * * Authors: Avadis Tevanian, Jr., Michael Wayne Young * * Permission to use, copy, modify and distribute this software and * its documentation is hereby granted, provided that both the copyright * notice and this permission notice appear in all copies of the * software, derivative works or modified versions, and any portions * thereof, and that both notices appear in supporting documentation. * * CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" * CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND * FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. * * Carnegie Mellon requests users of this software to return to * * Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU * School of Computer Science * Carnegie Mellon University * Pittsburgh PA 15213-3890 * * any improvements or extensions that they make and grant Carnegie the * rights to redistribute these changes. */ /* * The proverbial page-out daemon. */ #include __FBSDID("$FreeBSD$"); #include "opt_vm.h" #include "opt_kdtrace.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * System initialization */ /* the kernel process "vm_pageout"*/ static void vm_pageout(void); static void vm_pageout_init(void); static int vm_pageout_clean(vm_page_t m); static int vm_pageout_cluster(vm_page_t m); static void vm_pageout_scan(struct vm_domain *vmd, int pass); static void vm_pageout_mightbe_oom(struct vm_domain *vmd, int pass); SYSINIT(pagedaemon_init, SI_SUB_KTHREAD_PAGE, SI_ORDER_FIRST, vm_pageout_init, NULL); struct proc *pageproc; static struct kproc_desc page_kp = { "pagedaemon", vm_pageout, &pageproc }; SYSINIT(pagedaemon, SI_SUB_KTHREAD_PAGE, SI_ORDER_SECOND, kproc_start, &page_kp); SDT_PROVIDER_DEFINE(vm); SDT_PROBE_DEFINE(vm, , , vm__lowmem_cache); SDT_PROBE_DEFINE(vm, , , vm__lowmem_scan); #if !defined(NO_SWAPPING) /* the kernel process "vm_daemon"*/ static void vm_daemon(void); static struct proc *vmproc; static struct kproc_desc vm_kp = { "vmdaemon", vm_daemon, &vmproc }; SYSINIT(vmdaemon, SI_SUB_KTHREAD_VM, SI_ORDER_FIRST, kproc_start, &vm_kp); #endif int vm_pages_needed; /* Event on which pageout daemon sleeps */ int vm_pageout_deficit; /* Estimated number of pages deficit */ int vm_pageout_pages_needed; /* flag saying that the pageout daemon needs pages */ int vm_pageout_wakeup_thresh; #if !defined(NO_SWAPPING) static int vm_pageout_req_swapout; /* XXX */ static int vm_daemon_needed; static struct mtx vm_daemon_mtx; /* Allow for use by vm_pageout before vm_daemon is initialized. */ MTX_SYSINIT(vm_daemon, &vm_daemon_mtx, "vm daemon", MTX_DEF); #endif static int vm_max_launder = 32; static int vm_pageout_update_period; static int defer_swap_pageouts; static int disable_swap_pageouts; static int lowmem_period = 10; static int lowmem_ticks; #if defined(NO_SWAPPING) static int vm_swap_enabled = 0; static int vm_swap_idle_enabled = 0; #else static int vm_swap_enabled = 1; static int vm_swap_idle_enabled = 0; #endif static int vm_panic_on_oom = 0; SYSCTL_INT(_vm, OID_AUTO, panic_on_oom, CTLFLAG_RWTUN, &vm_panic_on_oom, 0, "panic on out of memory instead of killing the largest process"); SYSCTL_INT(_vm, OID_AUTO, pageout_wakeup_thresh, CTLFLAG_RW, &vm_pageout_wakeup_thresh, 0, "free page threshold for waking up the pageout daemon"); SYSCTL_INT(_vm, OID_AUTO, max_launder, CTLFLAG_RW, &vm_max_launder, 0, "Limit dirty flushes in pageout"); SYSCTL_INT(_vm, OID_AUTO, pageout_update_period, CTLFLAG_RW, &vm_pageout_update_period, 0, "Maximum active LRU update period"); SYSCTL_INT(_vm, OID_AUTO, lowmem_period, CTLFLAG_RW, &lowmem_period, 0, "Low memory callback period"); #if defined(NO_SWAPPING) SYSCTL_INT(_vm, VM_SWAPPING_ENABLED, swap_enabled, CTLFLAG_RD, &vm_swap_enabled, 0, "Enable entire process swapout"); SYSCTL_INT(_vm, OID_AUTO, swap_idle_enabled, CTLFLAG_RD, &vm_swap_idle_enabled, 0, "Allow swapout on idle criteria"); #else SYSCTL_INT(_vm, VM_SWAPPING_ENABLED, swap_enabled, CTLFLAG_RW, &vm_swap_enabled, 0, "Enable entire process swapout"); SYSCTL_INT(_vm, OID_AUTO, swap_idle_enabled, CTLFLAG_RW, &vm_swap_idle_enabled, 0, "Allow swapout on idle criteria"); #endif SYSCTL_INT(_vm, OID_AUTO, defer_swapspace_pageouts, CTLFLAG_RW, &defer_swap_pageouts, 0, "Give preference to dirty pages in mem"); SYSCTL_INT(_vm, OID_AUTO, disable_swapspace_pageouts, CTLFLAG_RW, &disable_swap_pageouts, 0, "Disallow swapout of dirty pages"); static int pageout_lock_miss; SYSCTL_INT(_vm, OID_AUTO, pageout_lock_miss, CTLFLAG_RD, &pageout_lock_miss, 0, "vget() lock misses during pageout"); #define VM_PAGEOUT_PAGE_COUNT 16 int vm_pageout_page_count = VM_PAGEOUT_PAGE_COUNT; int vm_page_max_wired; /* XXX max # of wired pages system-wide */ SYSCTL_INT(_vm, OID_AUTO, max_wired, CTLFLAG_RW, &vm_page_max_wired, 0, "System-wide limit to wired page count"); static boolean_t vm_pageout_fallback_object_lock(vm_page_t, vm_page_t *); static boolean_t vm_pageout_launder(struct vm_pagequeue *pq, int, vm_paddr_t, vm_paddr_t); #if !defined(NO_SWAPPING) static void vm_pageout_map_deactivate_pages(vm_map_t, long); static void vm_pageout_object_deactivate_pages(pmap_t, vm_object_t, long); static void vm_req_vmdaemon(int req); #endif static boolean_t vm_pageout_page_lock(vm_page_t, vm_page_t *); /* * Initialize a dummy page for marking the caller's place in the specified * paging queue. In principle, this function only needs to set the flag * PG_MARKER. Nonetheless, it wirte busies and initializes the hold count * to one as safety precautions. */ static void vm_pageout_init_marker(vm_page_t marker, u_short queue) { bzero(marker, sizeof(*marker)); marker->flags = PG_MARKER; marker->busy_lock = VPB_SINGLE_EXCLUSIVER; marker->queue = queue; marker->hold_count = 1; } /* * vm_pageout_fallback_object_lock: * * Lock vm object currently associated with `m'. VM_OBJECT_TRYWLOCK is * known to have failed and page queue must be either PQ_ACTIVE or * PQ_INACTIVE. To avoid lock order violation, unlock the page queues * while locking the vm object. Use marker page to detect page queue * changes and maintain notion of next page on page queue. Return * TRUE if no changes were detected, FALSE otherwise. vm object is * locked on return. * * This function depends on both the lock portion of struct vm_object * and normal struct vm_page being type stable. */ static boolean_t vm_pageout_fallback_object_lock(vm_page_t m, vm_page_t *next) { struct vm_page marker; struct vm_pagequeue *pq; boolean_t unchanged; u_short queue; vm_object_t object; queue = m->queue; vm_pageout_init_marker(&marker, queue); pq = vm_page_pagequeue(m); object = m->object; TAILQ_INSERT_AFTER(&pq->pq_pl, m, &marker, plinks.q); vm_pagequeue_unlock(pq); vm_page_unlock(m); VM_OBJECT_WLOCK(object); vm_page_lock(m); vm_pagequeue_lock(pq); /* Page queue might have changed. */ *next = TAILQ_NEXT(&marker, plinks.q); unchanged = (m->queue == queue && m->object == object && &marker == TAILQ_NEXT(m, plinks.q)); TAILQ_REMOVE(&pq->pq_pl, &marker, plinks.q); return (unchanged); } /* * Lock the page while holding the page queue lock. Use marker page * to detect page queue changes and maintain notion of next page on * page queue. Return TRUE if no changes were detected, FALSE * otherwise. The page is locked on return. The page queue lock might * be dropped and reacquired. * * This function depends on normal struct vm_page being type stable. */ static boolean_t vm_pageout_page_lock(vm_page_t m, vm_page_t *next) { struct vm_page marker; struct vm_pagequeue *pq; boolean_t unchanged; u_short queue; vm_page_lock_assert(m, MA_NOTOWNED); if (vm_page_trylock(m)) return (TRUE); queue = m->queue; vm_pageout_init_marker(&marker, queue); pq = vm_page_pagequeue(m); TAILQ_INSERT_AFTER(&pq->pq_pl, m, &marker, plinks.q); vm_pagequeue_unlock(pq); vm_page_lock(m); vm_pagequeue_lock(pq); /* Page queue might have changed. */ *next = TAILQ_NEXT(&marker, plinks.q); unchanged = (m->queue == queue && &marker == TAILQ_NEXT(m, plinks.q)); TAILQ_REMOVE(&pq->pq_pl, &marker, plinks.q); return (unchanged); } /* * vm_pageout_clean: * * Clean the page and remove it from the laundry. * * We set the busy bit to cause potential page faults on this page to * block. Note the careful timing, however, the busy bit isn't set till * late and we cannot do anything that will mess with the page. */ static int vm_pageout_cluster(vm_page_t m) { vm_object_t object; vm_page_t mc[2*vm_pageout_page_count], pb, ps; int pageout_count; int ib, is, page_base; vm_pindex_t pindex = m->pindex; vm_page_lock_assert(m, MA_OWNED); object = m->object; VM_OBJECT_ASSERT_WLOCKED(object); /* * It doesn't cost us anything to pageout OBJT_DEFAULT or OBJT_SWAP * with the new swapper, but we could have serious problems paging * out other object types if there is insufficient memory. * * Unfortunately, checking free memory here is far too late, so the * check has been moved up a procedural level. */ /* * Can't clean the page if it's busy or held. */ vm_page_assert_unbusied(m); KASSERT(m->hold_count == 0, ("vm_pageout_clean: page %p is held", m)); vm_page_unlock(m); mc[vm_pageout_page_count] = pb = ps = m; pageout_count = 1; page_base = vm_pageout_page_count; ib = 1; is = 1; /* * Scan object for clusterable pages. * * We can cluster ONLY if: ->> the page is NOT * clean, wired, busy, held, or mapped into a * buffer, and one of the following: * 1) The page is inactive, or a seldom used * active page. * -or- * 2) we force the issue. * * During heavy mmap/modification loads the pageout * daemon can really fragment the underlying file * due to flushing pages out of order and not trying * align the clusters (which leave sporatic out-of-order * holes). To solve this problem we do the reverse scan * first and attempt to align our cluster, then do a * forward scan if room remains. */ more: while (ib && pageout_count < vm_pageout_page_count) { vm_page_t p; if (ib > pindex) { ib = 0; break; } if ((p = vm_page_prev(pb)) == NULL || vm_page_busied(p)) { ib = 0; break; } vm_page_lock(p); vm_page_test_dirty(p); if (p->dirty == 0 || p->queue != PQ_INACTIVE || p->hold_count != 0) { /* may be undergoing I/O */ vm_page_unlock(p); ib = 0; break; } vm_page_unlock(p); mc[--page_base] = pb = p; ++pageout_count; ++ib; /* * alignment boundry, stop here and switch directions. Do * not clear ib. */ if ((pindex - (ib - 1)) % vm_pageout_page_count == 0) break; } while (pageout_count < vm_pageout_page_count && pindex + is < object->size) { vm_page_t p; if ((p = vm_page_next(ps)) == NULL || vm_page_busied(p)) break; vm_page_lock(p); vm_page_test_dirty(p); if (p->dirty == 0 || p->queue != PQ_INACTIVE || p->hold_count != 0) { /* may be undergoing I/O */ vm_page_unlock(p); break; } vm_page_unlock(p); mc[page_base + pageout_count] = ps = p; ++pageout_count; ++is; } /* * If we exhausted our forward scan, continue with the reverse scan * when possible, even past a page boundry. This catches boundry * conditions. */ if (ib && pageout_count < vm_pageout_page_count) goto more; /* * we allow reads during pageouts... */ return (vm_pageout_flush(&mc[page_base], pageout_count, 0, 0, NULL, NULL)); } /* * vm_pageout_flush() - launder the given pages * * The given pages are laundered. Note that we setup for the start of * I/O ( i.e. busy the page ), mark it read-only, and bump the object * reference count all in here rather then in the parent. If we want * the parent to do more sophisticated things we may have to change * the ordering. * * Returned runlen is the count of pages between mreq and first * page after mreq with status VM_PAGER_AGAIN. * *eio is set to TRUE if pager returned VM_PAGER_ERROR or VM_PAGER_FAIL * for any page in runlen set. */ int vm_pageout_flush(vm_page_t *mc, int count, int flags, int mreq, int *prunlen, boolean_t *eio) { vm_object_t object = mc[0]->object; int pageout_status[count]; int numpagedout = 0; int i, runlen; VM_OBJECT_ASSERT_WLOCKED(object); /* * Initiate I/O. Bump the vm_page_t->busy counter and * mark the pages read-only. * * We do not have to fixup the clean/dirty bits here... we can * allow the pager to do it after the I/O completes. * * NOTE! mc[i]->dirty may be partial or fragmented due to an * edge case with file fragments. */ for (i = 0; i < count; i++) { KASSERT(mc[i]->valid == VM_PAGE_BITS_ALL, ("vm_pageout_flush: partially invalid page %p index %d/%d", mc[i], i, count)); vm_page_sbusy(mc[i]); pmap_remove_write(mc[i]); } vm_object_pip_add(object, count); vm_pager_put_pages(object, mc, count, flags, pageout_status); runlen = count - mreq; if (eio != NULL) *eio = FALSE; for (i = 0; i < count; i++) { vm_page_t mt = mc[i]; KASSERT(pageout_status[i] == VM_PAGER_PEND || !pmap_page_is_write_mapped(mt), ("vm_pageout_flush: page %p is not write protected", mt)); switch (pageout_status[i]) { case VM_PAGER_OK: case VM_PAGER_PEND: numpagedout++; break; case VM_PAGER_BAD: /* * Page outside of range of object. Right now we * essentially lose the changes by pretending it * worked. */ vm_page_undirty(mt); break; case VM_PAGER_ERROR: case VM_PAGER_FAIL: /* * If page couldn't be paged out, then reactivate the * page so it doesn't clog the inactive list. (We * will try paging out it again later). */ vm_page_lock(mt); vm_page_activate(mt); vm_page_unlock(mt); if (eio != NULL && i >= mreq && i - mreq < runlen) *eio = TRUE; break; case VM_PAGER_AGAIN: if (i >= mreq && i - mreq < runlen) runlen = i - mreq; break; } /* * If the operation is still going, leave the page busy to * block all other accesses. Also, leave the paging in * progress indicator set so that we don't attempt an object * collapse. */ if (pageout_status[i] != VM_PAGER_PEND) { vm_object_pip_wakeup(object); vm_page_sunbusy(mt); if (vm_page_count_severe()) { vm_page_lock(mt); vm_page_try_to_cache(mt); vm_page_unlock(mt); } } } if (prunlen != NULL) *prunlen = runlen; return (numpagedout); } static boolean_t vm_pageout_launder(struct vm_pagequeue *pq, int tries, vm_paddr_t low, vm_paddr_t high) { struct mount *mp; struct vnode *vp; vm_object_t object; vm_paddr_t pa; vm_page_t m, m_tmp, next; int lockmode; vm_pagequeue_lock(pq); TAILQ_FOREACH_SAFE(m, &pq->pq_pl, plinks.q, next) { if ((m->flags & PG_MARKER) != 0) continue; pa = VM_PAGE_TO_PHYS(m); if (pa < low || pa + PAGE_SIZE > high) continue; if (!vm_pageout_page_lock(m, &next) || m->hold_count != 0) { vm_page_unlock(m); continue; } object = m->object; if ((!VM_OBJECT_TRYWLOCK(object) && (!vm_pageout_fallback_object_lock(m, &next) || m->hold_count != 0)) || vm_page_busied(m)) { vm_page_unlock(m); VM_OBJECT_WUNLOCK(object); continue; } vm_page_test_dirty(m); if (m->dirty == 0 && object->ref_count != 0) pmap_remove_all(m); if (m->dirty != 0) { vm_page_unlock(m); if (tries == 0 || (object->flags & OBJ_DEAD) != 0) { VM_OBJECT_WUNLOCK(object); continue; } if (object->type == OBJT_VNODE) { vm_pagequeue_unlock(pq); vp = object->handle; vm_object_reference_locked(object); VM_OBJECT_WUNLOCK(object); (void)vn_start_write(vp, &mp, V_WAIT); lockmode = MNT_SHARED_WRITES(vp->v_mount) ? LK_SHARED : LK_EXCLUSIVE; vn_lock(vp, lockmode | LK_RETRY); VM_OBJECT_WLOCK(object); vm_object_page_clean(object, 0, 0, OBJPC_SYNC); VM_OBJECT_WUNLOCK(object); VOP_UNLOCK(vp, 0); vm_object_deallocate(object); vn_finished_write(mp); return (TRUE); } else if (object->type == OBJT_SWAP || object->type == OBJT_DEFAULT) { vm_pagequeue_unlock(pq); m_tmp = m; vm_pageout_flush(&m_tmp, 1, VM_PAGER_PUT_SYNC, 0, NULL, NULL); VM_OBJECT_WUNLOCK(object); return (TRUE); } } else { /* * Dequeue here to prevent lock recursion in * vm_page_cache(). */ vm_page_dequeue_locked(m); vm_page_cache(m); vm_page_unlock(m); } VM_OBJECT_WUNLOCK(object); } vm_pagequeue_unlock(pq); return (FALSE); } /* * Increase the number of cached pages. The specified value, "tries", * determines which categories of pages are cached: * * 0: All clean, inactive pages within the specified physical address range * are cached. Will not sleep. * 1: The vm_lowmem handlers are called. All inactive pages within * the specified physical address range are cached. May sleep. * 2: The vm_lowmem handlers are called. All inactive and active pages * within the specified physical address range are cached. May sleep. */ void vm_pageout_grow_cache(int tries, vm_paddr_t low, vm_paddr_t high) { int actl, actmax, inactl, inactmax, dom, initial_dom; static int start_dom = 0; if (tries > 0) { /* * Decrease registered cache sizes. The vm_lowmem handlers * may acquire locks and/or sleep, so they can only be invoked * when "tries" is greater than zero. */ SDT_PROBE0(vm, , , vm__lowmem_cache); EVENTHANDLER_INVOKE(vm_lowmem, 0); /* * We do this explicitly after the caches have been drained * above. */ uma_reclaim(); } /* * Make the next scan start on the next domain. */ initial_dom = atomic_fetchadd_int(&start_dom, 1) % vm_ndomains; inactl = 0; inactmax = vm_cnt.v_inactive_count; actl = 0; actmax = tries < 2 ? 0 : vm_cnt.v_active_count; dom = initial_dom; /* * Scan domains in round-robin order, first inactive queues, * then active. Since domain usually owns large physically * contiguous chunk of memory, it makes sense to completely * exhaust one domain before switching to next, while growing * the pool of contiguous physical pages. * * Do not even start launder a domain which cannot contain * the specified address range, as indicated by segments * constituting the domain. */ again: if (inactl < inactmax) { if (vm_phys_domain_intersects(vm_dom[dom].vmd_segs, low, high) && vm_pageout_launder(&vm_dom[dom].vmd_pagequeues[PQ_INACTIVE], tries, low, high)) { inactl++; goto again; } if (++dom == vm_ndomains) dom = 0; if (dom != initial_dom) goto again; } if (actl < actmax) { if (vm_phys_domain_intersects(vm_dom[dom].vmd_segs, low, high) && vm_pageout_launder(&vm_dom[dom].vmd_pagequeues[PQ_ACTIVE], tries, low, high)) { actl++; goto again; } if (++dom == vm_ndomains) dom = 0; if (dom != initial_dom) goto again; } } #if !defined(NO_SWAPPING) /* * vm_pageout_object_deactivate_pages * * Deactivate enough pages to satisfy the inactive target * requirements. * * The object and map must be locked. */ static void vm_pageout_object_deactivate_pages(pmap_t pmap, vm_object_t first_object, long desired) { vm_object_t backing_object, object; vm_page_t p; int act_delta, remove_mode; VM_OBJECT_ASSERT_LOCKED(first_object); if ((first_object->flags & OBJ_FICTITIOUS) != 0) return; for (object = first_object;; object = backing_object) { if (pmap_resident_count(pmap) <= desired) goto unlock_return; VM_OBJECT_ASSERT_LOCKED(object); if ((object->flags & OBJ_UNMANAGED) != 0 || object->paging_in_progress != 0) goto unlock_return; remove_mode = 0; if (object->shadow_count > 1) remove_mode = 1; /* * Scan the object's entire memory queue. */ TAILQ_FOREACH(p, &object->memq, listq) { if (pmap_resident_count(pmap) <= desired) goto unlock_return; if (vm_page_busied(p)) continue; PCPU_INC(cnt.v_pdpages); vm_page_lock(p); if (p->wire_count != 0 || p->hold_count != 0 || !pmap_page_exists_quick(pmap, p)) { vm_page_unlock(p); continue; } act_delta = pmap_ts_referenced(p); if ((p->aflags & PGA_REFERENCED) != 0) { if (act_delta == 0) act_delta = 1; vm_page_aflag_clear(p, PGA_REFERENCED); } if (p->queue != PQ_ACTIVE && act_delta != 0) { vm_page_activate(p); p->act_count += act_delta; } else if (p->queue == PQ_ACTIVE) { if (act_delta == 0) { p->act_count -= min(p->act_count, ACT_DECLINE); if (!remove_mode && p->act_count == 0) { pmap_remove_all(p); vm_page_deactivate(p); } else vm_page_requeue(p); } else { vm_page_activate(p); if (p->act_count < ACT_MAX - ACT_ADVANCE) p->act_count += ACT_ADVANCE; vm_page_requeue(p); } } else if (p->queue == PQ_INACTIVE) pmap_remove_all(p); vm_page_unlock(p); } if ((backing_object = object->backing_object) == NULL) goto unlock_return; VM_OBJECT_RLOCK(backing_object); if (object != first_object) VM_OBJECT_RUNLOCK(object); } unlock_return: if (object != first_object) VM_OBJECT_RUNLOCK(object); } /* * deactivate some number of pages in a map, try to do it fairly, but * that is really hard to do. */ static void vm_pageout_map_deactivate_pages(map, desired) vm_map_t map; long desired; { vm_map_entry_t tmpe; vm_object_t obj, bigobj; int nothingwired; if (!vm_map_trylock(map)) return; bigobj = NULL; nothingwired = TRUE; /* * first, search out the biggest object, and try to free pages from * that. */ tmpe = map->header.next; while (tmpe != &map->header) { if ((tmpe->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) { obj = tmpe->object.vm_object; if (obj != NULL && VM_OBJECT_TRYRLOCK(obj)) { if (obj->shadow_count <= 1 && (bigobj == NULL || bigobj->resident_page_count < obj->resident_page_count)) { if (bigobj != NULL) VM_OBJECT_RUNLOCK(bigobj); bigobj = obj; } else VM_OBJECT_RUNLOCK(obj); } } if (tmpe->wired_count > 0) nothingwired = FALSE; tmpe = tmpe->next; } if (bigobj != NULL) { vm_pageout_object_deactivate_pages(map->pmap, bigobj, desired); VM_OBJECT_RUNLOCK(bigobj); } /* * Next, hunt around for other pages to deactivate. We actually * do this search sort of wrong -- .text first is not the best idea. */ tmpe = map->header.next; while (tmpe != &map->header) { if (pmap_resident_count(vm_map_pmap(map)) <= desired) break; if ((tmpe->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) { obj = tmpe->object.vm_object; if (obj != NULL) { VM_OBJECT_RLOCK(obj); vm_pageout_object_deactivate_pages(map->pmap, obj, desired); VM_OBJECT_RUNLOCK(obj); } } tmpe = tmpe->next; } /* * Remove all mappings if a process is swapped out, this will free page * table pages. */ if (desired == 0 && nothingwired) { pmap_remove(vm_map_pmap(map), vm_map_min(map), vm_map_max(map)); } vm_map_unlock(map); } #endif /* !defined(NO_SWAPPING) */ /* * Attempt to acquire all of the necessary locks to launder a page and * then call through the clustering layer to PUTPAGES. Wait a short * time for a vnode lock. * * Requires the page and object lock on entry, releases both before return. * Returns 0 on success and an errno otherwise. */ static int vm_pageout_clean(vm_page_t m) { struct vnode *vp; struct mount *mp; vm_object_t object; vm_pindex_t pindex; int error, lockmode; vm_page_assert_locked(m); object = m->object; VM_OBJECT_ASSERT_WLOCKED(object); error = 0; vp = NULL; mp = NULL; /* * The object is already known NOT to be dead. It * is possible for the vget() to block the whole * pageout daemon, but the new low-memory handling * code should prevent it. * * We can't wait forever for the vnode lock, we might * deadlock due to a vn_read() getting stuck in * vm_wait while holding this vnode. We skip the * vnode if we can't get it in a reasonable amount * of time. */ if (object->type == OBJT_VNODE) { vm_page_unlock(m); vp = object->handle; if (vp->v_type == VREG && vn_start_write(vp, &mp, V_NOWAIT) != 0) { mp = NULL; error = EDEADLK; goto unlock_all; } KASSERT(mp != NULL, ("vp %p with NULL v_mount", vp)); vm_object_reference_locked(object); pindex = m->pindex; VM_OBJECT_WUNLOCK(object); lockmode = MNT_SHARED_WRITES(vp->v_mount) ? LK_SHARED : LK_EXCLUSIVE; if (vget(vp, lockmode | LK_TIMELOCK, curthread)) { vp = NULL; error = EDEADLK; goto unlock_mp; } VM_OBJECT_WLOCK(object); vm_page_lock(m); /* * While the object and page were unlocked, the page * may have been: * (1) moved to a different queue, * (2) reallocated to a different object, * (3) reallocated to a different offset, or * (4) cleaned. */ if (m->queue != PQ_INACTIVE || m->object != object || m->pindex != pindex || m->dirty == 0) { vm_page_unlock(m); error = ENXIO; goto unlock_all; } /* * The page may have been busied or held while the object * and page locks were released. */ if (vm_page_busied(m) || m->hold_count != 0) { vm_page_unlock(m); error = EBUSY; goto unlock_all; } } /* * If a page is dirty, then it is either being washed * (but not yet cleaned) or it is still in the * laundry. If it is still in the laundry, then we * start the cleaning operation. */ if (vm_pageout_cluster(m) == 0) error = EIO; unlock_all: VM_OBJECT_WUNLOCK(object); unlock_mp: vm_page_lock_assert(m, MA_NOTOWNED); if (mp != NULL) { if (vp != NULL) vput(vp); vm_object_deallocate(object); vn_finished_write(mp); } return (error); } /* * vm_pageout_scan does the dirty work for the pageout daemon. * * pass 0 - Update active LRU/deactivate pages * pass 1 - Move inactive to cache or free * pass 2 - Launder dirty pages */ static void vm_pageout_scan(struct vm_domain *vmd, int pass) { vm_page_t m, next; struct vm_pagequeue *pq; vm_object_t object; int act_delta, addl_page_shortage, deficit, maxscan, page_shortage; int vnodes_skipped = 0; int maxlaunder; boolean_t queues_locked; /* * If we need to reclaim memory ask kernel caches to return * some. We rate limit to avoid thrashing. */ if (vmd == &vm_dom[0] && pass > 0 && (ticks - lowmem_ticks) / hz >= lowmem_period) { /* * Decrease registered cache sizes. */ SDT_PROBE0(vm, , , vm__lowmem_scan); EVENTHANDLER_INVOKE(vm_lowmem, 0); /* * We do this explicitly after the caches have been * drained above. */ uma_reclaim(); lowmem_ticks = ticks; } /* * The addl_page_shortage is the number of temporarily * stuck pages in the inactive queue. In other words, the * number of pages from the inactive count that should be * discounted in setting the target for the active queue scan. */ addl_page_shortage = 0; /* * Calculate the number of pages we want to either free or move * to the cache. */ if (pass > 0) { deficit = atomic_readandclear_int(&vm_pageout_deficit); page_shortage = vm_paging_target() + deficit; } else page_shortage = deficit = 0; /* * maxlaunder limits the number of dirty pages we flush per scan. * For most systems a smaller value (16 or 32) is more robust under * extreme memory and disk pressure because any unnecessary writes * to disk can result in extreme performance degredation. However, * systems with excessive dirty pages (especially when MAP_NOSYNC is * used) will die horribly with limited laundering. If the pageout * daemon cannot clean enough pages in the first pass, we let it go * all out in succeeding passes. */ if ((maxlaunder = vm_max_launder) <= 1) maxlaunder = 1; if (pass > 1) maxlaunder = 10000; /* * Start scanning the inactive queue for pages we can move to the * cache or free. The scan will stop when the target is reached or * we have scanned the entire inactive queue. Note that m->act_count * is not used to form decisions for the inactive queue, only for the * active queue. */ pq = &vmd->vmd_pagequeues[PQ_INACTIVE]; maxscan = pq->pq_cnt; vm_pagequeue_lock(pq); queues_locked = TRUE; for (m = TAILQ_FIRST(&pq->pq_pl); m != NULL && maxscan-- > 0 && page_shortage > 0; m = next) { vm_pagequeue_assert_locked(pq); KASSERT(queues_locked, ("unlocked queues")); KASSERT(m->queue == PQ_INACTIVE, ("Inactive queue %p", m)); PCPU_INC(cnt.v_pdpages); next = TAILQ_NEXT(m, plinks.q); /* * skip marker pages */ if (m->flags & PG_MARKER) continue; KASSERT((m->flags & PG_FICTITIOUS) == 0, ("Fictitious page %p cannot be in inactive queue", m)); KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("Unmanaged page %p cannot be in inactive queue", m)); /* * The page or object lock acquisitions fail if the * page was removed from the queue or moved to a * different position within the queue. In either * case, addl_page_shortage should not be incremented. */ if (!vm_pageout_page_lock(m, &next)) { vm_page_unlock(m); continue; } object = m->object; if (!VM_OBJECT_TRYWLOCK(object) && !vm_pageout_fallback_object_lock(m, &next)) { vm_page_unlock(m); VM_OBJECT_WUNLOCK(object); continue; } /* * Don't mess with busy pages, keep them at at the * front of the queue, most likely they are being * paged out. Increment addl_page_shortage for busy * pages, because they may leave the inactive queue * shortly after page scan is finished. */ if (vm_page_busied(m)) { vm_page_unlock(m); VM_OBJECT_WUNLOCK(object); addl_page_shortage++; continue; } /* * We unlock the inactive page queue, invalidating the * 'next' pointer. Use our marker to remember our * place. */ TAILQ_INSERT_AFTER(&pq->pq_pl, m, &vmd->vmd_marker, plinks.q); vm_pagequeue_unlock(pq); queues_locked = FALSE; /* * We bump the activation count if the page has been * referenced while in the inactive queue. This makes * it less likely that the page will be added back to the * inactive queue prematurely again. Here we check the * page tables (or emulated bits, if any), given the upper * level VM system not knowing anything about existing * references. */ if ((m->aflags & PGA_REFERENCED) != 0) { vm_page_aflag_clear(m, PGA_REFERENCED); act_delta = 1; } else act_delta = 0; if (object->ref_count != 0) { act_delta += pmap_ts_referenced(m); } else { KASSERT(!pmap_page_is_mapped(m), ("vm_pageout_scan: page %p is mapped", m)); } /* * If the upper level VM system knows about any page * references, we reactivate the page or requeue it. */ if (act_delta != 0) { if (object->ref_count != 0) { vm_page_activate(m); m->act_count += act_delta + ACT_ADVANCE; } else { vm_pagequeue_lock(pq); queues_locked = TRUE; vm_page_requeue_locked(m); } VM_OBJECT_WUNLOCK(object); vm_page_unlock(m); goto relock_queues; } if (m->hold_count != 0) { vm_page_unlock(m); VM_OBJECT_WUNLOCK(object); /* * Held pages are essentially stuck in the * queue. So, they ought to be discounted * from the inactive count. See the * calculation of the page_shortage for the * loop over the active queue below. */ addl_page_shortage++; goto relock_queues; } /* * If the page appears to be clean at the machine-independent * layer, then remove all of its mappings from the pmap in * anticipation of placing it onto the cache queue. If, * however, any of the page's mappings allow write access, * then the page may still be modified until the last of those * mappings are removed. */ vm_page_test_dirty(m); if (m->dirty == 0 && object->ref_count != 0) pmap_remove_all(m); if (m->valid == 0) { /* * Invalid pages can be easily freed */ vm_page_free(m); PCPU_INC(cnt.v_dfree); --page_shortage; } else if (m->dirty == 0) { /* * Clean pages can be placed onto the cache queue. * This effectively frees them. */ vm_page_cache(m); --page_shortage; } else if ((m->flags & PG_WINATCFLS) == 0 && pass < 2) { /* * Dirty pages need to be paged out, but flushing * a page is extremely expensive versus freeing * a clean page. Rather then artificially limiting * the number of pages we can flush, we instead give * dirty pages extra priority on the inactive queue * by forcing them to be cycled through the queue * twice before being flushed, after which the * (now clean) page will cycle through once more * before being freed. This significantly extends * the thrash point for a heavily loaded machine. */ m->flags |= PG_WINATCFLS; vm_pagequeue_lock(pq); queues_locked = TRUE; vm_page_requeue_locked(m); } else if (maxlaunder > 0) { /* * We always want to try to flush some dirty pages if * we encounter them, to keep the system stable. * Normally this number is small, but under extreme * pressure where there are insufficient clean pages * on the inactive queue, we may have to go all out. */ int swap_pageouts_ok; int error; if ((object->type != OBJT_SWAP) && (object->type != OBJT_DEFAULT)) { swap_pageouts_ok = 1; } else { swap_pageouts_ok = !(defer_swap_pageouts || disable_swap_pageouts); swap_pageouts_ok |= (!disable_swap_pageouts && defer_swap_pageouts && vm_page_count_min()); } /* * We don't bother paging objects that are "dead". * Those objects are in a "rundown" state. */ if (!swap_pageouts_ok || (object->flags & OBJ_DEAD)) { vm_pagequeue_lock(pq); vm_page_unlock(m); VM_OBJECT_WUNLOCK(object); queues_locked = TRUE; vm_page_requeue_locked(m); goto relock_queues; } error = vm_pageout_clean(m); /* * Decrement page_shortage on success to account for * the (future) cleaned page. Otherwise we could wind * up laundering or cleaning too many pages. */ if (error == 0) { page_shortage--; maxlaunder--; } else if (error == EDEADLK) { pageout_lock_miss++; vnodes_skipped++; } else if (error == EBUSY) { addl_page_shortage++; } vm_page_lock_assert(m, MA_NOTOWNED); goto relock_queues; } vm_page_unlock(m); VM_OBJECT_WUNLOCK(object); relock_queues: if (!queues_locked) { vm_pagequeue_lock(pq); queues_locked = TRUE; } next = TAILQ_NEXT(&vmd->vmd_marker, plinks.q); TAILQ_REMOVE(&pq->pq_pl, &vmd->vmd_marker, plinks.q); } vm_pagequeue_unlock(pq); #if !defined(NO_SWAPPING) /* * Wakeup the swapout daemon if we didn't cache or free the targeted * number of pages. */ if (vm_swap_enabled && page_shortage > 0) vm_req_vmdaemon(VM_SWAP_NORMAL); #endif /* * Wakeup the sync daemon if we skipped a vnode in a writeable object * and we didn't cache or free enough pages. */ if (vnodes_skipped > 0 && page_shortage > vm_cnt.v_free_target - vm_cnt.v_free_min) (void)speedup_syncer(); /* * Compute the number of pages we want to try to move from the * active queue to the inactive queue. */ page_shortage = vm_cnt.v_inactive_target - vm_cnt.v_inactive_count + vm_paging_target() + deficit + addl_page_shortage; pq = &vmd->vmd_pagequeues[PQ_ACTIVE]; vm_pagequeue_lock(pq); maxscan = pq->pq_cnt; /* * If we're just idle polling attempt to visit every * active page within 'update_period' seconds. */ if (pass == 0 && vm_pageout_update_period != 0) { maxscan /= vm_pageout_update_period; page_shortage = maxscan; } /* * Scan the active queue for things we can deactivate. We nominally * track the per-page activity counter and use it to locate * deactivation candidates. */ m = TAILQ_FIRST(&pq->pq_pl); while (m != NULL && maxscan-- > 0 && page_shortage > 0) { KASSERT(m->queue == PQ_ACTIVE, ("vm_pageout_scan: page %p isn't active", m)); next = TAILQ_NEXT(m, plinks.q); if ((m->flags & PG_MARKER) != 0) { m = next; continue; } KASSERT((m->flags & PG_FICTITIOUS) == 0, ("Fictitious page %p cannot be in active queue", m)); KASSERT((m->oflags & VPO_UNMANAGED) == 0, ("Unmanaged page %p cannot be in active queue", m)); if (!vm_pageout_page_lock(m, &next)) { vm_page_unlock(m); m = next; continue; } /* * The count for pagedaemon pages is done after checking the * page for eligibility... */ PCPU_INC(cnt.v_pdpages); /* * Check to see "how much" the page has been used. */ if ((m->aflags & PGA_REFERENCED) != 0) { vm_page_aflag_clear(m, PGA_REFERENCED); act_delta = 1; } else act_delta = 0; /* * Unlocked object ref count check. Two races are possible. * 1) The ref was transitioning to zero and we saw non-zero, * the pmap bits will be checked unnecessarily. * 2) The ref was transitioning to one and we saw zero. * The page lock prevents a new reference to this page so * we need not check the reference bits. */ if (m->object->ref_count != 0) act_delta += pmap_ts_referenced(m); /* * Advance or decay the act_count based on recent usage. */ if (act_delta != 0) { m->act_count += ACT_ADVANCE + act_delta; if (m->act_count > ACT_MAX) m->act_count = ACT_MAX; } else m->act_count -= min(m->act_count, ACT_DECLINE); /* * Move this page to the tail of the active or inactive * queue depending on usage. */ if (m->act_count == 0) { /* Dequeue to avoid later lock recursion. */ vm_page_dequeue_locked(m); vm_page_deactivate(m); page_shortage--; } else vm_page_requeue_locked(m); vm_page_unlock(m); m = next; } vm_pagequeue_unlock(pq); #if !defined(NO_SWAPPING) /* * Idle process swapout -- run once per second. */ if (vm_swap_idle_enabled) { static long lsec; if (time_second != lsec) { vm_req_vmdaemon(VM_SWAP_IDLE); lsec = time_second; } } #endif /* * If we are critically low on one of RAM or swap and low on * the other, kill the largest process. However, we avoid * doing this on the first pass in order to give ourselves a * chance to flush out dirty vnode-backed pages and to allow * active pages to be moved to the inactive queue and reclaimed. */ vm_pageout_mightbe_oom(vmd, pass); } static int vm_pageout_oom_vote; /* * The pagedaemon threads randlomly select one to perform the * OOM. Trying to kill processes before all pagedaemons * failed to reach free target is premature. */ static void vm_pageout_mightbe_oom(struct vm_domain *vmd, int pass) { int old_vote; if (pass <= 1 || !((swap_pager_avail < 64 && vm_page_count_min()) || (swap_pager_full && vm_paging_target() > 0))) { if (vmd->vmd_oom) { vmd->vmd_oom = FALSE; atomic_subtract_int(&vm_pageout_oom_vote, 1); } return; } if (vmd->vmd_oom) return; vmd->vmd_oom = TRUE; old_vote = atomic_fetchadd_int(&vm_pageout_oom_vote, 1); if (old_vote != vm_ndomains - 1) return; /* * The current pagedaemon thread is the last in the quorum to * start OOM. Initiate the selection and signaling of the * victim. */ vm_pageout_oom(VM_OOM_MEM); /* * After one round of OOM terror, recall our vote. On the * next pass, current pagedaemon would vote again if the low * memory condition is still there, due to vmd_oom being * false. */ vmd->vmd_oom = FALSE; atomic_subtract_int(&vm_pageout_oom_vote, 1); } void vm_pageout_oom(int shortage) { struct proc *p, *bigproc; vm_offset_t size, bigsize; struct thread *td; struct vmspace *vm; /* * We keep the process bigproc locked once we find it to keep anyone * from messing with it; however, there is a possibility of * deadlock if process B is bigproc and one of it's child processes * attempts to propagate a signal to B while we are waiting for A's * lock while walking this list. To avoid this, we don't block on * the process lock but just skip a process if it is already locked. */ bigproc = NULL; bigsize = 0; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { int breakout; PROC_LOCK(p); /* * If this is a system, protected or killed process, skip it. */ if (p->p_state != PRS_NORMAL || (p->p_flag & (P_INEXEC | P_PROTECTED | P_SYSTEM | P_WEXIT)) != 0 || p->p_pid == 1 || P_KILLED(p) || (p->p_pid < 48 && swap_pager_avail != 0)) { PROC_UNLOCK(p); continue; } /* * If the process is in a non-running type state, * don't touch it. Check all the threads individually. */ breakout = 0; FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); if (!TD_ON_RUNQ(td) && !TD_IS_RUNNING(td) && !TD_IS_SLEEPING(td) && !TD_IS_SUSPENDED(td)) { thread_unlock(td); breakout = 1; break; } thread_unlock(td); } if (breakout) { PROC_UNLOCK(p); continue; } /* * get the process size */ vm = vmspace_acquire_ref(p); if (vm == NULL) { PROC_UNLOCK(p); continue; } _PHOLD(p); if (!vm_map_trylock_read(&vm->vm_map)) { _PRELE(p); PROC_UNLOCK(p); vmspace_free(vm); continue; } PROC_UNLOCK(p); size = vmspace_swap_count(vm); vm_map_unlock_read(&vm->vm_map); if (shortage == VM_OOM_MEM) size += vmspace_resident_count(vm); vmspace_free(vm); /* * if the this process is bigger than the biggest one * remember it. */ if (size > bigsize) { if (bigproc != NULL) PRELE(bigproc); bigproc = p; bigsize = size; } else { PRELE(p); } } sx_sunlock(&allproc_lock); if (bigproc != NULL) { if (vm_panic_on_oom != 0) panic("out of swap space"); PROC_LOCK(bigproc); killproc(bigproc, "out of swap space"); sched_nice(bigproc, PRIO_MIN); _PRELE(bigproc); PROC_UNLOCK(bigproc); wakeup(&vm_cnt.v_free_count); } } static void vm_pageout_worker(void *arg) { struct vm_domain *domain; int domidx; domidx = (uintptr_t)arg; domain = &vm_dom[domidx]; /* * XXXKIB It could be useful to bind pageout daemon threads to * the cores belonging to the domain, from which vm_page_array * is allocated. */ KASSERT(domain->vmd_segs != 0, ("domain without segments")); vm_pageout_init_marker(&domain->vmd_marker, PQ_INACTIVE); /* * The pageout daemon worker is never done, so loop forever. */ while (TRUE) { /* * If we have enough free memory, wakeup waiters. Do * not clear vm_pages_needed until we reach our target, * otherwise we may be woken up over and over again and * waste a lot of cpu. */ mtx_lock(&vm_page_queue_free_mtx); if (vm_pages_needed && !vm_page_count_min()) { if (!vm_paging_needed()) vm_pages_needed = 0; wakeup(&vm_cnt.v_free_count); } if (vm_pages_needed) { /* * Still not done, take a second pass without waiting * (unlimited dirty cleaning), otherwise sleep a bit * and try again. */ if (domain->vmd_pass > 1) msleep(&vm_pages_needed, &vm_page_queue_free_mtx, PVM, "psleep", hz / 2); } else { /* * Good enough, sleep until required to refresh * stats. */ domain->vmd_pass = 0; msleep(&vm_pages_needed, &vm_page_queue_free_mtx, PVM, "psleep", hz); } if (vm_pages_needed) { vm_cnt.v_pdwakeups++; domain->vmd_pass++; } mtx_unlock(&vm_page_queue_free_mtx); vm_pageout_scan(domain, domain->vmd_pass); } } /* * vm_pageout_init initialises basic pageout daemon settings. */ static void vm_pageout_init(void) { /* * Initialize some paging parameters. */ vm_cnt.v_interrupt_free_min = 2; if (vm_cnt.v_page_count < 2000) vm_pageout_page_count = 8; /* * v_free_reserved needs to include enough for the largest * swap pager structures plus enough for any pv_entry structs * when paging. */ if (vm_cnt.v_page_count > 1024) vm_cnt.v_free_min = 4 + (vm_cnt.v_page_count - 1024) / 200; else vm_cnt.v_free_min = 4; vm_cnt.v_pageout_free_min = (2*MAXBSIZE)/PAGE_SIZE + vm_cnt.v_interrupt_free_min; vm_cnt.v_free_reserved = vm_pageout_page_count + vm_cnt.v_pageout_free_min + (vm_cnt.v_page_count / 768); vm_cnt.v_free_severe = vm_cnt.v_free_min / 2; vm_cnt.v_free_target = 4 * vm_cnt.v_free_min + vm_cnt.v_free_reserved; vm_cnt.v_free_min += vm_cnt.v_free_reserved; vm_cnt.v_free_severe += vm_cnt.v_free_reserved; vm_cnt.v_inactive_target = (3 * vm_cnt.v_free_target) / 2; if (vm_cnt.v_inactive_target > vm_cnt.v_free_count / 3) vm_cnt.v_inactive_target = vm_cnt.v_free_count / 3; /* * Set the default wakeup threshold to be 10% above the minimum * page limit. This keeps the steady state out of shortfall. */ vm_pageout_wakeup_thresh = (vm_cnt.v_free_min / 10) * 11; /* * Set interval in seconds for active scan. We want to visit each * page at least once every ten minutes. This is to prevent worst * case paging behaviors with stale active LRU. */ if (vm_pageout_update_period == 0) vm_pageout_update_period = 600; /* XXX does not really belong here */ if (vm_page_max_wired == 0) vm_page_max_wired = vm_cnt.v_free_count / 3; } /* * vm_pageout is the high level pageout daemon. */ static void vm_pageout(void) { + int error; #if MAXMEMDOM > 1 - int error, i; + int i; #endif swap_pager_swap_init(); #if MAXMEMDOM > 1 for (i = 1; i < vm_ndomains; i++) { error = kthread_add(vm_pageout_worker, (void *)(uintptr_t)i, curproc, NULL, 0, 0, "dom%d", i); if (error != 0) { panic("starting pageout for domain %d, error %d\n", i, error); } } #endif + error = kthread_add(uma_reclaim_worker, NULL, curproc, NULL, + 0, 0, "uma"); + if (error != 0) + panic("starting uma_reclaim helper, error %d\n", error); vm_pageout_worker((void *)(uintptr_t)0); } /* * Unless the free page queue lock is held by the caller, this function * should be regarded as advisory. Specifically, the caller should * not msleep() on &vm_cnt.v_free_count following this function unless * the free page queue lock is held until the msleep() is performed. */ void pagedaemon_wakeup(void) { if (!vm_pages_needed && curthread->td_proc != pageproc) { vm_pages_needed = 1; wakeup(&vm_pages_needed); } } #if !defined(NO_SWAPPING) static void vm_req_vmdaemon(int req) { static int lastrun = 0; mtx_lock(&vm_daemon_mtx); vm_pageout_req_swapout |= req; if ((ticks > (lastrun + hz)) || (ticks < lastrun)) { wakeup(&vm_daemon_needed); lastrun = ticks; } mtx_unlock(&vm_daemon_mtx); } static void vm_daemon(void) { struct rlimit rsslim; struct proc *p; struct thread *td; struct vmspace *vm; int breakout, swapout_flags, tryagain, attempts; #ifdef RACCT uint64_t rsize, ravailable; #endif while (TRUE) { mtx_lock(&vm_daemon_mtx); msleep(&vm_daemon_needed, &vm_daemon_mtx, PPAUSE, "psleep", #ifdef RACCT racct_enable ? hz : 0 #else 0 #endif ); swapout_flags = vm_pageout_req_swapout; vm_pageout_req_swapout = 0; mtx_unlock(&vm_daemon_mtx); if (swapout_flags) swapout_procs(swapout_flags); /* * scan the processes for exceeding their rlimits or if * process is swapped out -- deactivate pages */ tryagain = 0; attempts = 0; again: attempts++; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { vm_pindex_t limit, size; /* * if this is a system process or if we have already * looked at this process, skip it. */ PROC_LOCK(p); if (p->p_state != PRS_NORMAL || p->p_flag & (P_INEXEC | P_SYSTEM | P_WEXIT)) { PROC_UNLOCK(p); continue; } /* * if the process is in a non-running type state, * don't touch it. */ breakout = 0; FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); if (!TD_ON_RUNQ(td) && !TD_IS_RUNNING(td) && !TD_IS_SLEEPING(td) && !TD_IS_SUSPENDED(td)) { thread_unlock(td); breakout = 1; break; } thread_unlock(td); } if (breakout) { PROC_UNLOCK(p); continue; } /* * get a limit */ lim_rlimit(p, RLIMIT_RSS, &rsslim); limit = OFF_TO_IDX( qmin(rsslim.rlim_cur, rsslim.rlim_max)); /* * let processes that are swapped out really be * swapped out set the limit to nothing (will force a * swap-out.) */ if ((p->p_flag & P_INMEM) == 0) limit = 0; /* XXX */ vm = vmspace_acquire_ref(p); PROC_UNLOCK(p); if (vm == NULL) continue; size = vmspace_resident_count(vm); if (size >= limit) { vm_pageout_map_deactivate_pages( &vm->vm_map, limit); } #ifdef RACCT if (racct_enable) { rsize = IDX_TO_OFF(size); PROC_LOCK(p); racct_set(p, RACCT_RSS, rsize); ravailable = racct_get_available(p, RACCT_RSS); PROC_UNLOCK(p); if (rsize > ravailable) { /* * Don't be overly aggressive; this * might be an innocent process, * and the limit could've been exceeded * by some memory hog. Don't try * to deactivate more than 1/4th * of process' resident set size. */ if (attempts <= 8) { if (ravailable < rsize - (rsize / 4)) { ravailable = rsize - (rsize / 4); } } vm_pageout_map_deactivate_pages( &vm->vm_map, OFF_TO_IDX(ravailable)); /* Update RSS usage after paging out. */ size = vmspace_resident_count(vm); rsize = IDX_TO_OFF(size); PROC_LOCK(p); racct_set(p, RACCT_RSS, rsize); PROC_UNLOCK(p); if (rsize > ravailable) tryagain = 1; } } #endif vmspace_free(vm); } sx_sunlock(&allproc_lock); if (tryagain != 0 && attempts <= 10) goto again; } } #endif /* !defined(NO_SWAPPING) */ Index: projects/release-arm-redux/sys/x86/include/acpica_machdep.h =================================================================== --- projects/release-arm-redux/sys/x86/include/acpica_machdep.h (revision 282691) +++ projects/release-arm-redux/sys/x86/include/acpica_machdep.h (revision 282692) @@ -1,87 +1,88 @@ /*- * Copyright (c) 2002 Mitsuru IWASAKI * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /****************************************************************************** * * Name: acpica_machdep.h - arch-specific defines, etc. * $Revision$ * *****************************************************************************/ #ifndef __ACPICA_MACHDEP_H__ #define __ACPICA_MACHDEP_H__ #ifdef _KERNEL /* * Calling conventions: * * ACPI_SYSTEM_XFACE - Interfaces to host OS (handlers, threads) * ACPI_EXTERNAL_XFACE - External ACPI interfaces * ACPI_INTERNAL_XFACE - Internal ACPI interfaces * ACPI_INTERNAL_VAR_XFACE - Internal variable-parameter list interfaces */ #define ACPI_SYSTEM_XFACE #define ACPI_EXTERNAL_XFACE #define ACPI_INTERNAL_XFACE #define ACPI_INTERNAL_VAR_XFACE /* Asm macros */ #define ACPI_ASM_MACROS #define BREAKPOINT3 #define ACPI_DISABLE_IRQS() disable_intr() #define ACPI_ENABLE_IRQS() enable_intr() #define ACPI_FLUSH_CPU_CACHE() wbinvd() /* Section 5.2.10.1: global lock acquire/release functions */ int acpi_acquire_global_lock(volatile uint32_t *); int acpi_release_global_lock(volatile uint32_t *); #define ACPI_ACQUIRE_GLOBAL_LOCK(GLptr, Acq) do { \ (Acq) = acpi_acquire_global_lock(&((GLptr)->GlobalLock)); \ } while (0) #define ACPI_RELEASE_GLOBAL_LOCK(GLptr, Acq) do { \ (Acq) = acpi_release_global_lock(&((GLptr)->GlobalLock)); \ } while (0) enum intr_trigger; enum intr_polarity; void acpi_SetDefaultIntrModel(int model); void acpi_cpu_c1(void); +void acpi_cpu_idle_mwait(uint32_t mwait_hint); void *acpi_map_table(vm_paddr_t pa, const char *sig); void acpi_unmap_table(void *table); vm_paddr_t acpi_find_table(const char *sig); void madt_parse_interrupt_values(void *entry, enum intr_trigger *trig, enum intr_polarity *pol); extern int madt_found_sci_override; #endif /* _KERNEL */ #endif /* __ACPICA_MACHDEP_H__ */ Property changes on: projects/release-arm-redux/sys/x86/include/acpica_machdep.h ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/sys/x86/include/acpica_machdep.h:r282673-282691 Index: projects/release-arm-redux/sys/x86/include/specialreg.h =================================================================== --- projects/release-arm-redux/sys/x86/include/specialreg.h (revision 282691) +++ projects/release-arm-redux/sys/x86/include/specialreg.h (revision 282692) @@ -1,844 +1,845 @@ /*- * Copyright (c) 1991 The Regents of the University of California. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)specialreg.h 7.1 (Berkeley) 5/9/91 * $FreeBSD$ */ #ifndef _MACHINE_SPECIALREG_H_ #define _MACHINE_SPECIALREG_H_ /* * Bits in 386 special registers: */ #define CR0_PE 0x00000001 /* Protected mode Enable */ #define CR0_MP 0x00000002 /* "Math" (fpu) Present */ #define CR0_EM 0x00000004 /* EMulate FPU instructions. (trap ESC only) */ #define CR0_TS 0x00000008 /* Task Switched (if MP, trap ESC and WAIT) */ #define CR0_PG 0x80000000 /* PaGing enable */ /* * Bits in 486 special registers: */ #define CR0_NE 0x00000020 /* Numeric Error enable (EX16 vs IRQ13) */ #define CR0_WP 0x00010000 /* Write Protect (honor page protect in all modes) */ #define CR0_AM 0x00040000 /* Alignment Mask (set to enable AC flag) */ #define CR0_NW 0x20000000 /* Not Write-through */ #define CR0_CD 0x40000000 /* Cache Disable */ #define CR3_PCID_SAVE 0x8000000000000000 +#define CR3_PCID_MASK 0xfff /* * Bits in PPro special registers */ #define CR4_VME 0x00000001 /* Virtual 8086 mode extensions */ #define CR4_PVI 0x00000002 /* Protected-mode virtual interrupts */ #define CR4_TSD 0x00000004 /* Time stamp disable */ #define CR4_DE 0x00000008 /* Debugging extensions */ #define CR4_PSE 0x00000010 /* Page size extensions */ #define CR4_PAE 0x00000020 /* Physical address extension */ #define CR4_MCE 0x00000040 /* Machine check enable */ #define CR4_PGE 0x00000080 /* Page global enable */ #define CR4_PCE 0x00000100 /* Performance monitoring counter enable */ #define CR4_FXSR 0x00000200 /* Fast FPU save/restore used by OS */ #define CR4_XMM 0x00000400 /* enable SIMD/MMX2 to use except 16 */ #define CR4_VMXE 0x00002000 /* enable VMX operation (Intel-specific) */ #define CR4_FSGSBASE 0x00010000 /* Enable FS/GS BASE accessing instructions */ #define CR4_PCIDE 0x00020000 /* Enable Context ID */ #define CR4_XSAVE 0x00040000 /* XSETBV/XGETBV */ #define CR4_SMEP 0x00100000 /* Supervisor-Mode Execution Prevention */ /* * Bits in AMD64 special registers. EFER is 64 bits wide. */ #define EFER_SCE 0x000000001 /* System Call Extensions (R/W) */ #define EFER_LME 0x000000100 /* Long mode enable (R/W) */ #define EFER_LMA 0x000000400 /* Long mode active (R) */ #define EFER_NXE 0x000000800 /* PTE No-Execute bit enable (R/W) */ #define EFER_SVM 0x000001000 /* SVM enable bit for AMD, reserved for Intel */ #define EFER_LMSLE 0x000002000 /* Long Mode Segment Limit Enable */ #define EFER_FFXSR 0x000004000 /* Fast FXSAVE/FSRSTOR */ #define EFER_TCE 0x000008000 /* Translation Cache Extension */ /* * Intel Extended Features registers */ #define XCR0 0 /* XFEATURE_ENABLED_MASK register */ #define XFEATURE_ENABLED_X87 0x00000001 #define XFEATURE_ENABLED_SSE 0x00000002 #define XFEATURE_ENABLED_YMM_HI128 0x00000004 #define XFEATURE_ENABLED_AVX XFEATURE_ENABLED_YMM_HI128 #define XFEATURE_ENABLED_BNDREGS 0x00000008 #define XFEATURE_ENABLED_BNDCSR 0x00000010 #define XFEATURE_ENABLED_OPMASK 0x00000020 #define XFEATURE_ENABLED_ZMM_HI256 0x00000040 #define XFEATURE_ENABLED_HI16_ZMM 0x00000080 #define XFEATURE_AVX \ (XFEATURE_ENABLED_X87 | XFEATURE_ENABLED_SSE | XFEATURE_ENABLED_AVX) #define XFEATURE_AVX512 \ (XFEATURE_ENABLED_OPMASK | XFEATURE_ENABLED_ZMM_HI256 | \ XFEATURE_ENABLED_HI16_ZMM) #define XFEATURE_MPX \ (XFEATURE_ENABLED_BNDREGS | XFEATURE_ENABLED_BNDCSR) /* * CPUID instruction features register */ #define CPUID_FPU 0x00000001 #define CPUID_VME 0x00000002 #define CPUID_DE 0x00000004 #define CPUID_PSE 0x00000008 #define CPUID_TSC 0x00000010 #define CPUID_MSR 0x00000020 #define CPUID_PAE 0x00000040 #define CPUID_MCE 0x00000080 #define CPUID_CX8 0x00000100 #define CPUID_APIC 0x00000200 #define CPUID_B10 0x00000400 #define CPUID_SEP 0x00000800 #define CPUID_MTRR 0x00001000 #define CPUID_PGE 0x00002000 #define CPUID_MCA 0x00004000 #define CPUID_CMOV 0x00008000 #define CPUID_PAT 0x00010000 #define CPUID_PSE36 0x00020000 #define CPUID_PSN 0x00040000 #define CPUID_CLFSH 0x00080000 #define CPUID_B20 0x00100000 #define CPUID_DS 0x00200000 #define CPUID_ACPI 0x00400000 #define CPUID_MMX 0x00800000 #define CPUID_FXSR 0x01000000 #define CPUID_SSE 0x02000000 #define CPUID_XMM 0x02000000 #define CPUID_SSE2 0x04000000 #define CPUID_SS 0x08000000 #define CPUID_HTT 0x10000000 #define CPUID_TM 0x20000000 #define CPUID_IA64 0x40000000 #define CPUID_PBE 0x80000000 #define CPUID2_SSE3 0x00000001 #define CPUID2_PCLMULQDQ 0x00000002 #define CPUID2_DTES64 0x00000004 #define CPUID2_MON 0x00000008 #define CPUID2_DS_CPL 0x00000010 #define CPUID2_VMX 0x00000020 #define CPUID2_SMX 0x00000040 #define CPUID2_EST 0x00000080 #define CPUID2_TM2 0x00000100 #define CPUID2_SSSE3 0x00000200 #define CPUID2_CNXTID 0x00000400 #define CPUID2_SDBG 0x00000800 #define CPUID2_FMA 0x00001000 #define CPUID2_CX16 0x00002000 #define CPUID2_XTPR 0x00004000 #define CPUID2_PDCM 0x00008000 #define CPUID2_PCID 0x00020000 #define CPUID2_DCA 0x00040000 #define CPUID2_SSE41 0x00080000 #define CPUID2_SSE42 0x00100000 #define CPUID2_X2APIC 0x00200000 #define CPUID2_MOVBE 0x00400000 #define CPUID2_POPCNT 0x00800000 #define CPUID2_TSCDLT 0x01000000 #define CPUID2_AESNI 0x02000000 #define CPUID2_XSAVE 0x04000000 #define CPUID2_OSXSAVE 0x08000000 #define CPUID2_AVX 0x10000000 #define CPUID2_F16C 0x20000000 #define CPUID2_RDRAND 0x40000000 #define CPUID2_HV 0x80000000 /* * Important bits in the Thermal and Power Management flags * CPUID.6 EAX and ECX. */ #define CPUTPM1_SENSOR 0x00000001 #define CPUTPM1_TURBO 0x00000002 #define CPUTPM1_ARAT 0x00000004 #define CPUTPM2_EFFREQ 0x00000001 /* * Important bits in the AMD extended cpuid flags */ #define AMDID_SYSCALL 0x00000800 #define AMDID_MP 0x00080000 #define AMDID_NX 0x00100000 #define AMDID_EXT_MMX 0x00400000 #define AMDID_FFXSR 0x02000000 #define AMDID_PAGE1GB 0x04000000 #define AMDID_RDTSCP 0x08000000 #define AMDID_LM 0x20000000 #define AMDID_EXT_3DNOW 0x40000000 #define AMDID_3DNOW 0x80000000 #define AMDID2_LAHF 0x00000001 #define AMDID2_CMP 0x00000002 #define AMDID2_SVM 0x00000004 #define AMDID2_EXT_APIC 0x00000008 #define AMDID2_CR8 0x00000010 #define AMDID2_ABM 0x00000020 #define AMDID2_SSE4A 0x00000040 #define AMDID2_MAS 0x00000080 #define AMDID2_PREFETCH 0x00000100 #define AMDID2_OSVW 0x00000200 #define AMDID2_IBS 0x00000400 #define AMDID2_XOP 0x00000800 #define AMDID2_SKINIT 0x00001000 #define AMDID2_WDT 0x00002000 #define AMDID2_LWP 0x00008000 #define AMDID2_FMA4 0x00010000 #define AMDID2_TCE 0x00020000 #define AMDID2_NODE_ID 0x00080000 #define AMDID2_TBM 0x00200000 #define AMDID2_TOPOLOGY 0x00400000 #define AMDID2_PCXC 0x00800000 #define AMDID2_PNXC 0x01000000 #define AMDID2_DBE 0x04000000 #define AMDID2_PTSC 0x08000000 #define AMDID2_PTSCEL2I 0x10000000 /* * CPUID instruction 1 eax info */ #define CPUID_STEPPING 0x0000000f #define CPUID_MODEL 0x000000f0 #define CPUID_FAMILY 0x00000f00 #define CPUID_EXT_MODEL 0x000f0000 #define CPUID_EXT_FAMILY 0x0ff00000 #ifdef __i386__ #define CPUID_TO_MODEL(id) \ ((((id) & CPUID_MODEL) >> 4) | \ ((((id) & CPUID_FAMILY) >= 0x600) ? \ (((id) & CPUID_EXT_MODEL) >> 12) : 0)) #define CPUID_TO_FAMILY(id) \ ((((id) & CPUID_FAMILY) >> 8) + \ ((((id) & CPUID_FAMILY) == 0xf00) ? \ (((id) & CPUID_EXT_FAMILY) >> 20) : 0)) #else #define CPUID_TO_MODEL(id) \ ((((id) & CPUID_MODEL) >> 4) | \ (((id) & CPUID_EXT_MODEL) >> 12)) #define CPUID_TO_FAMILY(id) \ ((((id) & CPUID_FAMILY) >> 8) + \ (((id) & CPUID_EXT_FAMILY) >> 20)) #endif /* * CPUID instruction 1 ebx info */ #define CPUID_BRAND_INDEX 0x000000ff #define CPUID_CLFUSH_SIZE 0x0000ff00 #define CPUID_HTT_CORES 0x00ff0000 #define CPUID_LOCAL_APIC_ID 0xff000000 /* * CPUID instruction 5 info */ #define CPUID5_MON_MIN_SIZE 0x0000ffff /* eax */ #define CPUID5_MON_MAX_SIZE 0x0000ffff /* ebx */ #define CPUID5_MON_MWAIT_EXT 0x00000001 /* ecx */ #define CPUID5_MWAIT_INTRBREAK 0x00000002 /* ecx */ /* * MWAIT cpu power states. Lower 4 bits are sub-states. */ #define MWAIT_C0 0xf0 #define MWAIT_C1 0x00 #define MWAIT_C2 0x10 #define MWAIT_C3 0x20 #define MWAIT_C4 0x30 /* * MWAIT extensions. */ /* Interrupt breaks MWAIT even when masked. */ #define MWAIT_INTRBREAK 0x00000001 /* * CPUID instruction 6 ecx info */ #define CPUID_PERF_STAT 0x00000001 #define CPUID_PERF_BIAS 0x00000008 /* * CPUID instruction 0xb ebx info. */ #define CPUID_TYPE_INVAL 0 #define CPUID_TYPE_SMT 1 #define CPUID_TYPE_CORE 2 /* * CPUID instruction 0xd Processor Extended State Enumeration Sub-leaf 1 */ #define CPUID_EXTSTATE_XSAVEOPT 0x00000001 #define CPUID_EXTSTATE_XSAVEC 0x00000002 #define CPUID_EXTSTATE_XINUSE 0x00000004 #define CPUID_EXTSTATE_XSAVES 0x00000008 /* * AMD extended function 8000_0007h edx info */ #define AMDPM_TS 0x00000001 #define AMDPM_FID 0x00000002 #define AMDPM_VID 0x00000004 #define AMDPM_TTP 0x00000008 #define AMDPM_TM 0x00000010 #define AMDPM_STC 0x00000020 #define AMDPM_100MHZ_STEPS 0x00000040 #define AMDPM_HW_PSTATE 0x00000080 #define AMDPM_TSC_INVARIANT 0x00000100 #define AMDPM_CPB 0x00000200 /* * AMD extended function 8000_0008h ecx info */ #define AMDID_CMP_CORES 0x000000ff #define AMDID_COREID_SIZE 0x0000f000 #define AMDID_COREID_SIZE_SHIFT 12 /* * CPUID instruction 7 Structured Extended Features, leaf 0 ebx info */ #define CPUID_STDEXT_FSGSBASE 0x00000001 #define CPUID_STDEXT_TSC_ADJUST 0x00000002 #define CPUID_STDEXT_BMI1 0x00000008 #define CPUID_STDEXT_HLE 0x00000010 #define CPUID_STDEXT_AVX2 0x00000020 #define CPUID_STDEXT_SMEP 0x00000080 #define CPUID_STDEXT_BMI2 0x00000100 #define CPUID_STDEXT_ERMS 0x00000200 #define CPUID_STDEXT_INVPCID 0x00000400 #define CPUID_STDEXT_RTM 0x00000800 #define CPUID_STDEXT_MPX 0x00004000 #define CPUID_STDEXT_AVX512F 0x00010000 #define CPUID_STDEXT_RDSEED 0x00040000 #define CPUID_STDEXT_ADX 0x00080000 #define CPUID_STDEXT_SMAP 0x00100000 #define CPUID_STDEXT_CLFLUSHOPT 0x00800000 #define CPUID_STDEXT_PROCTRACE 0x02000000 #define CPUID_STDEXT_AVX512PF 0x04000000 #define CPUID_STDEXT_AVX512ER 0x08000000 #define CPUID_STDEXT_AVX512CD 0x10000000 #define CPUID_STDEXT_SHA 0x20000000 /* * CPUID manufacturers identifiers */ #define AMD_VENDOR_ID "AuthenticAMD" #define CENTAUR_VENDOR_ID "CentaurHauls" #define CYRIX_VENDOR_ID "CyrixInstead" #define INTEL_VENDOR_ID "GenuineIntel" #define NEXGEN_VENDOR_ID "NexGenDriven" #define NSC_VENDOR_ID "Geode by NSC" #define RISE_VENDOR_ID "RiseRiseRise" #define SIS_VENDOR_ID "SiS SiS SiS " #define TRANSMETA_VENDOR_ID "GenuineTMx86" #define UMC_VENDOR_ID "UMC UMC UMC " /* * Model-specific registers for the i386 family */ #define MSR_P5_MC_ADDR 0x000 #define MSR_P5_MC_TYPE 0x001 #define MSR_TSC 0x010 #define MSR_P5_CESR 0x011 #define MSR_P5_CTR0 0x012 #define MSR_P5_CTR1 0x013 #define MSR_IA32_PLATFORM_ID 0x017 #define MSR_APICBASE 0x01b #define MSR_EBL_CR_POWERON 0x02a #define MSR_TEST_CTL 0x033 #define MSR_IA32_FEATURE_CONTROL 0x03a #define MSR_BIOS_UPDT_TRIG 0x079 #define MSR_BBL_CR_D0 0x088 #define MSR_BBL_CR_D1 0x089 #define MSR_BBL_CR_D2 0x08a #define MSR_BIOS_SIGN 0x08b #define MSR_PERFCTR0 0x0c1 #define MSR_PERFCTR1 0x0c2 #define MSR_PLATFORM_INFO 0x0ce #define MSR_MPERF 0x0e7 #define MSR_APERF 0x0e8 #define MSR_IA32_EXT_CONFIG 0x0ee /* Undocumented. Core Solo/Duo only */ #define MSR_MTRRcap 0x0fe #define MSR_BBL_CR_ADDR 0x116 #define MSR_BBL_CR_DECC 0x118 #define MSR_BBL_CR_CTL 0x119 #define MSR_BBL_CR_TRIG 0x11a #define MSR_BBL_CR_BUSY 0x11b #define MSR_BBL_CR_CTL3 0x11e #define MSR_SYSENTER_CS_MSR 0x174 #define MSR_SYSENTER_ESP_MSR 0x175 #define MSR_SYSENTER_EIP_MSR 0x176 #define MSR_MCG_CAP 0x179 #define MSR_MCG_STATUS 0x17a #define MSR_MCG_CTL 0x17b #define MSR_EVNTSEL0 0x186 #define MSR_EVNTSEL1 0x187 #define MSR_THERM_CONTROL 0x19a #define MSR_THERM_INTERRUPT 0x19b #define MSR_THERM_STATUS 0x19c #define MSR_IA32_MISC_ENABLE 0x1a0 #define MSR_IA32_TEMPERATURE_TARGET 0x1a2 #define MSR_TURBO_RATIO_LIMIT 0x1ad #define MSR_TURBO_RATIO_LIMIT1 0x1ae #define MSR_DEBUGCTLMSR 0x1d9 #define MSR_LASTBRANCHFROMIP 0x1db #define MSR_LASTBRANCHTOIP 0x1dc #define MSR_LASTINTFROMIP 0x1dd #define MSR_LASTINTTOIP 0x1de #define MSR_ROB_CR_BKUPTMPDR6 0x1e0 #define MSR_MTRRVarBase 0x200 #define MSR_MTRR64kBase 0x250 #define MSR_MTRR16kBase 0x258 #define MSR_MTRR4kBase 0x268 #define MSR_PAT 0x277 #define MSR_MC0_CTL2 0x280 #define MSR_MTRRdefType 0x2ff #define MSR_MC0_CTL 0x400 #define MSR_MC0_STATUS 0x401 #define MSR_MC0_ADDR 0x402 #define MSR_MC0_MISC 0x403 #define MSR_MC1_CTL 0x404 #define MSR_MC1_STATUS 0x405 #define MSR_MC1_ADDR 0x406 #define MSR_MC1_MISC 0x407 #define MSR_MC2_CTL 0x408 #define MSR_MC2_STATUS 0x409 #define MSR_MC2_ADDR 0x40a #define MSR_MC2_MISC 0x40b #define MSR_MC3_CTL 0x40c #define MSR_MC3_STATUS 0x40d #define MSR_MC3_ADDR 0x40e #define MSR_MC3_MISC 0x40f #define MSR_MC4_CTL 0x410 #define MSR_MC4_STATUS 0x411 #define MSR_MC4_ADDR 0x412 #define MSR_MC4_MISC 0x413 #define MSR_RAPL_POWER_UNIT 0x606 #define MSR_PKG_ENERGY_STATUS 0x611 #define MSR_DRAM_ENERGY_STATUS 0x619 #define MSR_PP0_ENERGY_STATUS 0x639 #define MSR_PP1_ENERGY_STATUS 0x641 /* * VMX MSRs */ #define MSR_VMX_BASIC 0x480 #define MSR_VMX_PINBASED_CTLS 0x481 #define MSR_VMX_PROCBASED_CTLS 0x482 #define MSR_VMX_EXIT_CTLS 0x483 #define MSR_VMX_ENTRY_CTLS 0x484 #define MSR_VMX_CR0_FIXED0 0x486 #define MSR_VMX_CR0_FIXED1 0x487 #define MSR_VMX_CR4_FIXED0 0x488 #define MSR_VMX_CR4_FIXED1 0x489 #define MSR_VMX_PROCBASED_CTLS2 0x48b #define MSR_VMX_EPT_VPID_CAP 0x48c #define MSR_VMX_TRUE_PINBASED_CTLS 0x48d #define MSR_VMX_TRUE_PROCBASED_CTLS 0x48e #define MSR_VMX_TRUE_EXIT_CTLS 0x48f #define MSR_VMX_TRUE_ENTRY_CTLS 0x490 /* * X2APIC MSRs */ #define MSR_APIC_000 0x800 #define MSR_APIC_ID 0x802 #define MSR_APIC_VERSION 0x803 #define MSR_APIC_TPR 0x808 #define MSR_APIC_EOI 0x80b #define MSR_APIC_LDR 0x80d #define MSR_APIC_SVR 0x80f #define MSR_APIC_ISR0 0x810 #define MSR_APIC_ISR1 0x811 #define MSR_APIC_ISR2 0x812 #define MSR_APIC_ISR3 0x813 #define MSR_APIC_ISR4 0x814 #define MSR_APIC_ISR5 0x815 #define MSR_APIC_ISR6 0x816 #define MSR_APIC_ISR7 0x817 #define MSR_APIC_TMR0 0x818 #define MSR_APIC_IRR0 0x820 #define MSR_APIC_ESR 0x828 #define MSR_APIC_LVT_CMCI 0x82F #define MSR_APIC_ICR 0x830 #define MSR_APIC_LVT_TIMER 0x832 #define MSR_APIC_LVT_THERMAL 0x833 #define MSR_APIC_LVT_PCINT 0x834 #define MSR_APIC_LVT_LINT0 0x835 #define MSR_APIC_LVT_LINT1 0x836 #define MSR_APIC_LVT_ERROR 0x837 #define MSR_APIC_ICR_TIMER 0x838 #define MSR_APIC_CCR_TIMER 0x839 #define MSR_APIC_DCR_TIMER 0x83e #define MSR_APIC_SELF_IPI 0x83f #define MSR_IA32_XSS 0xda0 /* * Constants related to MSR's. */ #define APICBASE_RESERVED 0x000002ff #define APICBASE_BSP 0x00000100 #define APICBASE_X2APIC 0x00000400 #define APICBASE_ENABLED 0x00000800 #define APICBASE_ADDRESS 0xfffff000 /* MSR_IA32_FEATURE_CONTROL related */ #define IA32_FEATURE_CONTROL_LOCK 0x01 /* lock bit */ #define IA32_FEATURE_CONTROL_SMX_EN 0x02 /* enable VMX inside SMX */ #define IA32_FEATURE_CONTROL_VMX_EN 0x04 /* enable VMX outside SMX */ /* * PAT modes. */ #define PAT_UNCACHEABLE 0x00 #define PAT_WRITE_COMBINING 0x01 #define PAT_WRITE_THROUGH 0x04 #define PAT_WRITE_PROTECTED 0x05 #define PAT_WRITE_BACK 0x06 #define PAT_UNCACHED 0x07 #define PAT_VALUE(i, m) ((long long)(m) << (8 * (i))) #define PAT_MASK(i) PAT_VALUE(i, 0xff) /* * Constants related to MTRRs */ #define MTRR_UNCACHEABLE 0x00 #define MTRR_WRITE_COMBINING 0x01 #define MTRR_WRITE_THROUGH 0x04 #define MTRR_WRITE_PROTECTED 0x05 #define MTRR_WRITE_BACK 0x06 #define MTRR_N64K 8 /* numbers of fixed-size entries */ #define MTRR_N16K 16 #define MTRR_N4K 64 #define MTRR_CAP_WC 0x0000000000000400 #define MTRR_CAP_FIXED 0x0000000000000100 #define MTRR_CAP_VCNT 0x00000000000000ff #define MTRR_DEF_ENABLE 0x0000000000000800 #define MTRR_DEF_FIXED_ENABLE 0x0000000000000400 #define MTRR_DEF_TYPE 0x00000000000000ff #define MTRR_PHYSBASE_PHYSBASE 0x000ffffffffff000 #define MTRR_PHYSBASE_TYPE 0x00000000000000ff #define MTRR_PHYSMASK_PHYSMASK 0x000ffffffffff000 #define MTRR_PHYSMASK_VALID 0x0000000000000800 /* * Cyrix configuration registers, accessible as IO ports. */ #define CCR0 0xc0 /* Configuration control register 0 */ #define CCR0_NC0 0x01 /* First 64K of each 1M memory region is non-cacheable */ #define CCR0_NC1 0x02 /* 640K-1M region is non-cacheable */ #define CCR0_A20M 0x04 /* Enables A20M# input pin */ #define CCR0_KEN 0x08 /* Enables KEN# input pin */ #define CCR0_FLUSH 0x10 /* Enables FLUSH# input pin */ #define CCR0_BARB 0x20 /* Flushes internal cache when entering hold state */ #define CCR0_CO 0x40 /* Cache org: 1=direct mapped, 0=2x set assoc */ #define CCR0_SUSPEND 0x80 /* Enables SUSP# and SUSPA# pins */ #define CCR1 0xc1 /* Configuration control register 1 */ #define CCR1_RPL 0x01 /* Enables RPLSET and RPLVAL# pins */ #define CCR1_SMI 0x02 /* Enables SMM pins */ #define CCR1_SMAC 0x04 /* System management memory access */ #define CCR1_MMAC 0x08 /* Main memory access */ #define CCR1_NO_LOCK 0x10 /* Negate LOCK# */ #define CCR1_SM3 0x80 /* SMM address space address region 3 */ #define CCR2 0xc2 #define CCR2_WB 0x02 /* Enables WB cache interface pins */ #define CCR2_SADS 0x02 /* Slow ADS */ #define CCR2_LOCK_NW 0x04 /* LOCK NW Bit */ #define CCR2_SUSP_HLT 0x08 /* Suspend on HALT */ #define CCR2_WT1 0x10 /* WT region 1 */ #define CCR2_WPR1 0x10 /* Write-protect region 1 */ #define CCR2_BARB 0x20 /* Flushes write-back cache when entering hold state. */ #define CCR2_BWRT 0x40 /* Enables burst write cycles */ #define CCR2_USE_SUSP 0x80 /* Enables suspend pins */ #define CCR3 0xc3 #define CCR3_SMILOCK 0x01 /* SMM register lock */ #define CCR3_NMI 0x02 /* Enables NMI during SMM */ #define CCR3_LINBRST 0x04 /* Linear address burst cycles */ #define CCR3_SMMMODE 0x08 /* SMM Mode */ #define CCR3_MAPEN0 0x10 /* Enables Map0 */ #define CCR3_MAPEN1 0x20 /* Enables Map1 */ #define CCR3_MAPEN2 0x40 /* Enables Map2 */ #define CCR3_MAPEN3 0x80 /* Enables Map3 */ #define CCR4 0xe8 #define CCR4_IOMASK 0x07 #define CCR4_MEM 0x08 /* Enables momory bypassing */ #define CCR4_DTE 0x10 /* Enables directory table entry cache */ #define CCR4_FASTFPE 0x20 /* Fast FPU exception */ #define CCR4_CPUID 0x80 /* Enables CPUID instruction */ #define CCR5 0xe9 #define CCR5_WT_ALLOC 0x01 /* Write-through allocate */ #define CCR5_SLOP 0x02 /* LOOP instruction slowed down */ #define CCR5_LBR1 0x10 /* Local bus region 1 */ #define CCR5_ARREN 0x20 /* Enables ARR region */ #define CCR6 0xea #define CCR7 0xeb /* Performance Control Register (5x86 only). */ #define PCR0 0x20 #define PCR0_RSTK 0x01 /* Enables return stack */ #define PCR0_BTB 0x02 /* Enables branch target buffer */ #define PCR0_LOOP 0x04 /* Enables loop */ #define PCR0_AIS 0x08 /* Enables all instrcutions stalled to serialize pipe. */ #define PCR0_MLR 0x10 /* Enables reordering of misaligned loads */ #define PCR0_BTBRT 0x40 /* Enables BTB test register. */ #define PCR0_LSSER 0x80 /* Disable reorder */ /* Device Identification Registers */ #define DIR0 0xfe #define DIR1 0xff /* * Machine Check register constants. */ #define MCG_CAP_COUNT 0x000000ff #define MCG_CAP_CTL_P 0x00000100 #define MCG_CAP_EXT_P 0x00000200 #define MCG_CAP_CMCI_P 0x00000400 #define MCG_CAP_TES_P 0x00000800 #define MCG_CAP_EXT_CNT 0x00ff0000 #define MCG_CAP_SER_P 0x01000000 #define MCG_STATUS_RIPV 0x00000001 #define MCG_STATUS_EIPV 0x00000002 #define MCG_STATUS_MCIP 0x00000004 #define MCG_CTL_ENABLE 0xffffffffffffffff #define MCG_CTL_DISABLE 0x0000000000000000 #define MSR_MC_CTL(x) (MSR_MC0_CTL + (x) * 4) #define MSR_MC_STATUS(x) (MSR_MC0_STATUS + (x) * 4) #define MSR_MC_ADDR(x) (MSR_MC0_ADDR + (x) * 4) #define MSR_MC_MISC(x) (MSR_MC0_MISC + (x) * 4) #define MSR_MC_CTL2(x) (MSR_MC0_CTL2 + (x)) /* If MCG_CAP_CMCI_P */ #define MC_STATUS_MCA_ERROR 0x000000000000ffff #define MC_STATUS_MODEL_ERROR 0x00000000ffff0000 #define MC_STATUS_OTHER_INFO 0x01ffffff00000000 #define MC_STATUS_COR_COUNT 0x001fffc000000000 /* If MCG_CAP_CMCI_P */ #define MC_STATUS_TES_STATUS 0x0060000000000000 /* If MCG_CAP_TES_P */ #define MC_STATUS_AR 0x0080000000000000 /* If MCG_CAP_TES_P */ #define MC_STATUS_S 0x0100000000000000 /* If MCG_CAP_TES_P */ #define MC_STATUS_PCC 0x0200000000000000 #define MC_STATUS_ADDRV 0x0400000000000000 #define MC_STATUS_MISCV 0x0800000000000000 #define MC_STATUS_EN 0x1000000000000000 #define MC_STATUS_UC 0x2000000000000000 #define MC_STATUS_OVER 0x4000000000000000 #define MC_STATUS_VAL 0x8000000000000000 #define MC_MISC_RA_LSB 0x000000000000003f /* If MCG_CAP_SER_P */ #define MC_MISC_ADDRESS_MODE 0x00000000000001c0 /* If MCG_CAP_SER_P */ #define MC_CTL2_THRESHOLD 0x0000000000007fff #define MC_CTL2_CMCI_EN 0x0000000040000000 /* * The following four 3-byte registers control the non-cacheable regions. * These registers must be written as three separate bytes. * * NCRx+0: A31-A24 of starting address * NCRx+1: A23-A16 of starting address * NCRx+2: A15-A12 of starting address | NCR_SIZE_xx. * * The non-cacheable region's starting address must be aligned to the * size indicated by the NCR_SIZE_xx field. */ #define NCR1 0xc4 #define NCR2 0xc7 #define NCR3 0xca #define NCR4 0xcd #define NCR_SIZE_0K 0 #define NCR_SIZE_4K 1 #define NCR_SIZE_8K 2 #define NCR_SIZE_16K 3 #define NCR_SIZE_32K 4 #define NCR_SIZE_64K 5 #define NCR_SIZE_128K 6 #define NCR_SIZE_256K 7 #define NCR_SIZE_512K 8 #define NCR_SIZE_1M 9 #define NCR_SIZE_2M 10 #define NCR_SIZE_4M 11 #define NCR_SIZE_8M 12 #define NCR_SIZE_16M 13 #define NCR_SIZE_32M 14 #define NCR_SIZE_4G 15 /* * The address region registers are used to specify the location and * size for the eight address regions. * * ARRx + 0: A31-A24 of start address * ARRx + 1: A23-A16 of start address * ARRx + 2: A15-A12 of start address | ARR_SIZE_xx */ #define ARR0 0xc4 #define ARR1 0xc7 #define ARR2 0xca #define ARR3 0xcd #define ARR4 0xd0 #define ARR5 0xd3 #define ARR6 0xd6 #define ARR7 0xd9 #define ARR_SIZE_0K 0 #define ARR_SIZE_4K 1 #define ARR_SIZE_8K 2 #define ARR_SIZE_16K 3 #define ARR_SIZE_32K 4 #define ARR_SIZE_64K 5 #define ARR_SIZE_128K 6 #define ARR_SIZE_256K 7 #define ARR_SIZE_512K 8 #define ARR_SIZE_1M 9 #define ARR_SIZE_2M 10 #define ARR_SIZE_4M 11 #define ARR_SIZE_8M 12 #define ARR_SIZE_16M 13 #define ARR_SIZE_32M 14 #define ARR_SIZE_4G 15 /* * The region control registers specify the attributes associated with * the ARRx addres regions. */ #define RCR0 0xdc #define RCR1 0xdd #define RCR2 0xde #define RCR3 0xdf #define RCR4 0xe0 #define RCR5 0xe1 #define RCR6 0xe2 #define RCR7 0xe3 #define RCR_RCD 0x01 /* Disables caching for ARRx (x = 0-6). */ #define RCR_RCE 0x01 /* Enables caching for ARR7. */ #define RCR_WWO 0x02 /* Weak write ordering. */ #define RCR_WL 0x04 /* Weak locking. */ #define RCR_WG 0x08 /* Write gathering. */ #define RCR_WT 0x10 /* Write-through. */ #define RCR_NLB 0x20 /* LBA# pin is not asserted. */ /* AMD Write Allocate Top-Of-Memory and Control Register */ #define AMD_WT_ALLOC_TME 0x40000 /* top-of-memory enable */ #define AMD_WT_ALLOC_PRE 0x20000 /* programmable range enable */ #define AMD_WT_ALLOC_FRE 0x10000 /* fixed (A0000-FFFFF) range enable */ /* AMD64 MSR's */ #define MSR_EFER 0xc0000080 /* extended features */ #define MSR_STAR 0xc0000081 /* legacy mode SYSCALL target/cs/ss */ #define MSR_LSTAR 0xc0000082 /* long mode SYSCALL target rip */ #define MSR_CSTAR 0xc0000083 /* compat mode SYSCALL target rip */ #define MSR_SF_MASK 0xc0000084 /* syscall flags mask */ #define MSR_FSBASE 0xc0000100 /* base address of the %fs "segment" */ #define MSR_GSBASE 0xc0000101 /* base address of the %gs "segment" */ #define MSR_KGSBASE 0xc0000102 /* base address of the kernel %gs */ #define MSR_PERFEVSEL0 0xc0010000 #define MSR_PERFEVSEL1 0xc0010001 #define MSR_PERFEVSEL2 0xc0010002 #define MSR_PERFEVSEL3 0xc0010003 #define MSR_K7_PERFCTR0 0xc0010004 #define MSR_K7_PERFCTR1 0xc0010005 #define MSR_K7_PERFCTR2 0xc0010006 #define MSR_K7_PERFCTR3 0xc0010007 #define MSR_SYSCFG 0xc0010010 #define MSR_HWCR 0xc0010015 #define MSR_IORRBASE0 0xc0010016 #define MSR_IORRMASK0 0xc0010017 #define MSR_IORRBASE1 0xc0010018 #define MSR_IORRMASK1 0xc0010019 #define MSR_TOP_MEM 0xc001001a /* boundary for ram below 4G */ #define MSR_TOP_MEM2 0xc001001d /* boundary for ram above 4G */ #define MSR_NB_CFG1 0xc001001f /* NB configuration 1 */ #define MSR_P_STATE_LIMIT 0xc0010061 /* P-state Current Limit Register */ #define MSR_P_STATE_CONTROL 0xc0010062 /* P-state Control Register */ #define MSR_P_STATE_STATUS 0xc0010063 /* P-state Status Register */ #define MSR_P_STATE_CONFIG(n) (0xc0010064 + (n)) /* P-state Config */ #define MSR_SMM_ADDR 0xc0010112 /* SMM TSEG base address */ #define MSR_SMM_MASK 0xc0010113 /* SMM TSEG address mask */ #define MSR_IC_CFG 0xc0011021 /* Instruction Cache Configuration */ #define MSR_K8_UCODE_UPDATE 0xc0010020 /* update microcode */ #define MSR_MC0_CTL_MASK 0xc0010044 #define MSR_VM_CR 0xc0010114 /* SVM: feature control */ #define MSR_VM_HSAVE_PA 0xc0010117 /* SVM: host save area address */ /* MSR_VM_CR related */ #define VM_CR_SVMDIS 0x10 /* SVM: disabled by BIOS */ /* VIA ACE crypto featureset: for via_feature_rng */ #define VIA_HAS_RNG 1 /* cpu has RNG */ /* VIA ACE crypto featureset: for via_feature_xcrypt */ #define VIA_HAS_AES 1 /* cpu has AES */ #define VIA_HAS_SHA 2 /* cpu has SHA1 & SHA256 */ #define VIA_HAS_MM 4 /* cpu has RSA instructions */ #define VIA_HAS_AESCTR 8 /* cpu has AES-CTR instructions */ /* Centaur Extended Feature flags */ #define VIA_CPUID_HAS_RNG 0x000004 #define VIA_CPUID_DO_RNG 0x000008 #define VIA_CPUID_HAS_ACE 0x000040 #define VIA_CPUID_DO_ACE 0x000080 #define VIA_CPUID_HAS_ACE2 0x000100 #define VIA_CPUID_DO_ACE2 0x000200 #define VIA_CPUID_HAS_PHE 0x000400 #define VIA_CPUID_DO_PHE 0x000800 #define VIA_CPUID_HAS_PMM 0x001000 #define VIA_CPUID_DO_PMM 0x002000 /* VIA ACE xcrypt-* instruction context control options */ #define VIA_CRYPT_CWLO_ROUND_M 0x0000000f #define VIA_CRYPT_CWLO_ALG_M 0x00000070 #define VIA_CRYPT_CWLO_ALG_AES 0x00000000 #define VIA_CRYPT_CWLO_KEYGEN_M 0x00000080 #define VIA_CRYPT_CWLO_KEYGEN_HW 0x00000000 #define VIA_CRYPT_CWLO_KEYGEN_SW 0x00000080 #define VIA_CRYPT_CWLO_NORMAL 0x00000000 #define VIA_CRYPT_CWLO_INTERMEDIATE 0x00000100 #define VIA_CRYPT_CWLO_ENCRYPT 0x00000000 #define VIA_CRYPT_CWLO_DECRYPT 0x00000200 #define VIA_CRYPT_CWLO_KEY128 0x0000000a /* 128bit, 10 rds */ #define VIA_CRYPT_CWLO_KEY192 0x0000040c /* 192bit, 12 rds */ #define VIA_CRYPT_CWLO_KEY256 0x0000080e /* 256bit, 15 rds */ #endif /* !_MACHINE_SPECIALREG_H_ */ Index: projects/release-arm-redux/sys/x86/x86/cpu_machdep.c =================================================================== --- projects/release-arm-redux/sys/x86/x86/cpu_machdep.c (revision 282691) +++ projects/release-arm-redux/sys/x86/x86/cpu_machdep.c (revision 282692) @@ -1,486 +1,517 @@ /*- * Copyright (c) 2003 Peter Wemm. * Copyright (c) 1992 Terrence R. Lambert. * Copyright (c) 1982, 1987, 1990 The Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * William Jolitz. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)machdep.c 7.4 (Berkeley) 6/3/91 */ #include __FBSDID("$FreeBSD$"); #include "opt_atpic.h" #include "opt_compat.h" #include "opt_cpu.h" #include "opt_ddb.h" #include "opt_inet.h" #include "opt_isa.h" #include "opt_kstack_pages.h" #include "opt_maxmem.h" #include "opt_mp_watchdog.h" #include "opt_perfmon.h" #include "opt_platform.h" #ifdef __i386__ #include "opt_npx.h" #include "opt_apic.h" #include "opt_xbox.h" #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef SMP #include #endif #include #include #include #include #include #include #include #ifdef PERFMON #include #endif #include #ifdef SMP #include #endif +#include #include #include #include #include #include #include #include #include /* * Machine dependent boot() routine * * I haven't seen anything to put here yet * Possibly some stuff might be grafted back here from boot() */ void cpu_boot(int howto) { } /* * Flush the D-cache for non-DMA I/O so that the I-cache can * be made coherent later. */ void cpu_flush_dcache(void *ptr, size_t len) { /* Not applicable */ } +void +acpi_cpu_c1(void) +{ + + __asm __volatile("sti; hlt"); +} + +void +acpi_cpu_idle_mwait(uint32_t mwait_hint) +{ + int *state; + + state = (int *)PCPU_PTR(monitorbuf); + /* + * XXXKIB. Software coordination mode should be supported, + * but all Intel CPUs provide hardware coordination. + */ + cpu_monitor(state, 0, 0); + cpu_mwait(MWAIT_INTRBREAK, mwait_hint); +} + /* Get current clock frequency for the given cpu id. */ int cpu_est_clockrate(int cpu_id, uint64_t *rate) { uint64_t tsc1, tsc2; uint64_t acnt, mcnt, perf; register_t reg; if (pcpu_find(cpu_id) == NULL || rate == NULL) return (EINVAL); #ifdef __i386__ if ((cpu_feature & CPUID_TSC) == 0) return (EOPNOTSUPP); #endif /* * If TSC is P-state invariant and APERF/MPERF MSRs do not exist, * DELAY(9) based logic fails. */ if (tsc_is_invariant && !tsc_perf_stat) return (EOPNOTSUPP); #ifdef SMP if (smp_cpus > 1) { /* Schedule ourselves on the indicated cpu. */ thread_lock(curthread); sched_bind(curthread, cpu_id); thread_unlock(curthread); } #endif /* Calibrate by measuring a short delay. */ reg = intr_disable(); if (tsc_is_invariant) { wrmsr(MSR_MPERF, 0); wrmsr(MSR_APERF, 0); tsc1 = rdtsc(); DELAY(1000); mcnt = rdmsr(MSR_MPERF); acnt = rdmsr(MSR_APERF); tsc2 = rdtsc(); intr_restore(reg); perf = 1000 * acnt / mcnt; *rate = (tsc2 - tsc1) * perf; } else { tsc1 = rdtsc(); DELAY(1000); tsc2 = rdtsc(); intr_restore(reg); *rate = (tsc2 - tsc1) * 1000; } #ifdef SMP if (smp_cpus > 1) { thread_lock(curthread); sched_unbind(curthread); thread_unlock(curthread); } #endif return (0); } /* * Shutdown the CPU as much as possible */ void cpu_halt(void) { for (;;) halt(); } +bool +cpu_mwait_usable(void) +{ + + return ((cpu_feature2 & CPUID2_MON) != 0 && ((cpu_mon_mwait_flags & + (CPUID5_MON_MWAIT_EXT | CPUID5_MWAIT_INTRBREAK)) == + (CPUID5_MON_MWAIT_EXT | CPUID5_MWAIT_INTRBREAK))); +} + void (*cpu_idle_hook)(sbintime_t) = NULL; /* ACPI idle hook. */ static int cpu_ident_amdc1e = 0; /* AMD C1E supported. */ static int idle_mwait = 1; /* Use MONITOR/MWAIT for short idle. */ SYSCTL_INT(_machdep, OID_AUTO, idle_mwait, CTLFLAG_RWTUN, &idle_mwait, 0, "Use MONITOR/MWAIT for short idle"); #define STATE_RUNNING 0x0 #define STATE_MWAIT 0x1 #define STATE_SLEEPING 0x2 #ifndef PC98 static void cpu_idle_acpi(sbintime_t sbt) { int *state; state = (int *)PCPU_PTR(monitorbuf); *state = STATE_SLEEPING; /* See comments in cpu_idle_hlt(). */ disable_intr(); if (sched_runnable()) enable_intr(); else if (cpu_idle_hook) cpu_idle_hook(sbt); else - __asm __volatile("sti; hlt"); + acpi_cpu_c1(); *state = STATE_RUNNING; } #endif /* !PC98 */ static void cpu_idle_hlt(sbintime_t sbt) { int *state; state = (int *)PCPU_PTR(monitorbuf); *state = STATE_SLEEPING; /* * Since we may be in a critical section from cpu_idle(), if * an interrupt fires during that critical section we may have * a pending preemption. If the CPU halts, then that thread * may not execute until a later interrupt awakens the CPU. * To handle this race, check for a runnable thread after * disabling interrupts and immediately return if one is * found. Also, we must absolutely guarentee that hlt is * the next instruction after sti. This ensures that any * interrupt that fires after the call to disable_intr() will * immediately awaken the CPU from hlt. Finally, please note * that on x86 this works fine because of interrupts enabled only * after the instruction following sti takes place, while IF is set * to 1 immediately, allowing hlt instruction to acknowledge the * interrupt. */ disable_intr(); if (sched_runnable()) enable_intr(); else - __asm __volatile("sti; hlt"); + acpi_cpu_c1(); *state = STATE_RUNNING; } static void cpu_idle_mwait(sbintime_t sbt) { int *state; state = (int *)PCPU_PTR(monitorbuf); *state = STATE_MWAIT; /* See comments in cpu_idle_hlt(). */ disable_intr(); if (sched_runnable()) { enable_intr(); *state = STATE_RUNNING; return; } cpu_monitor(state, 0, 0); if (*state == STATE_MWAIT) __asm __volatile("sti; mwait" : : "a" (MWAIT_C1), "c" (0)); else enable_intr(); *state = STATE_RUNNING; } static void cpu_idle_spin(sbintime_t sbt) { int *state; int i; state = (int *)PCPU_PTR(monitorbuf); *state = STATE_RUNNING; /* * The sched_runnable() call is racy but as long as there is * a loop missing it one time will have just a little impact if any * (and it is much better than missing the check at all). */ for (i = 0; i < 1000; i++) { if (sched_runnable()) return; cpu_spinwait(); } } /* * C1E renders the local APIC timer dead, so we disable it by * reading the Interrupt Pending Message register and clearing * both C1eOnCmpHalt (bit 28) and SmiOnCmpHalt (bit 27). * * Reference: * "BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors" * #32559 revision 3.00+ */ #define MSR_AMDK8_IPM 0xc0010055 #define AMDK8_SMIONCMPHALT (1ULL << 27) #define AMDK8_C1EONCMPHALT (1ULL << 28) #define AMDK8_CMPHALT (AMDK8_SMIONCMPHALT | AMDK8_C1EONCMPHALT) void cpu_probe_amdc1e(void) { /* * Detect the presence of C1E capability mostly on latest * dual-cores (or future) k8 family. */ if (cpu_vendor_id == CPU_VENDOR_AMD && (cpu_id & 0x00000f00) == 0x00000f00 && (cpu_id & 0x0fff0000) >= 0x00040000) { cpu_ident_amdc1e = 1; } } #if defined(__i386__) && defined(PC98) void (*cpu_idle_fn)(sbintime_t) = cpu_idle_hlt; #else void (*cpu_idle_fn)(sbintime_t) = cpu_idle_acpi; #endif void cpu_idle(int busy) { uint64_t msr; sbintime_t sbt = -1; CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", busy, curcpu); #ifdef MP_WATCHDOG ap_watchdog(PCPU_GET(cpuid)); #endif /* If we are busy - try to use fast methods. */ if (busy) { if ((cpu_feature2 & CPUID2_MON) && idle_mwait) { cpu_idle_mwait(busy); goto out; } } /* If we have time - switch timers into idle mode. */ if (!busy) { critical_enter(); sbt = cpu_idleclock(); } /* Apply AMD APIC timer C1E workaround. */ if (cpu_ident_amdc1e && cpu_disable_c3_sleep) { msr = rdmsr(MSR_AMDK8_IPM); if (msr & AMDK8_CMPHALT) wrmsr(MSR_AMDK8_IPM, msr & ~AMDK8_CMPHALT); } /* Call main idle method. */ cpu_idle_fn(sbt); /* Switch timers back into active mode. */ if (!busy) { cpu_activeclock(); critical_exit(); } out: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done", busy, curcpu); } int cpu_idle_wakeup(int cpu) { struct pcpu *pcpu; int *state; pcpu = pcpu_find(cpu); state = (int *)pcpu->pc_monitorbuf; /* * This doesn't need to be atomic since missing the race will * simply result in unnecessary IPIs. */ if (*state == STATE_SLEEPING) return (0); if (*state == STATE_MWAIT) *state = STATE_RUNNING; return (1); } /* * Ordered by speed/power consumption. */ struct { void *id_fn; char *id_name; } idle_tbl[] = { { cpu_idle_spin, "spin" }, { cpu_idle_mwait, "mwait" }, { cpu_idle_hlt, "hlt" }, #if !defined(__i386__) || !defined(PC98) { cpu_idle_acpi, "acpi" }, #endif { NULL, NULL } }; static int idle_sysctl_available(SYSCTL_HANDLER_ARGS) { char *avail, *p; int error; int i; avail = malloc(256, M_TEMP, M_WAITOK); p = avail; for (i = 0; idle_tbl[i].id_name != NULL; i++) { if (strstr(idle_tbl[i].id_name, "mwait") && (cpu_feature2 & CPUID2_MON) == 0) continue; #if !defined(__i386__) || !defined(PC98) if (strcmp(idle_tbl[i].id_name, "acpi") == 0 && cpu_idle_hook == NULL) continue; #endif p += sprintf(p, "%s%s", p != avail ? ", " : "", idle_tbl[i].id_name); } error = sysctl_handle_string(oidp, avail, 0, req); free(avail, M_TEMP); return (error); } SYSCTL_PROC(_machdep, OID_AUTO, idle_available, CTLTYPE_STRING | CTLFLAG_RD, 0, 0, idle_sysctl_available, "A", "list of available idle functions"); static int idle_sysctl(SYSCTL_HANDLER_ARGS) { char buf[16]; int error; char *p; int i; p = "unknown"; for (i = 0; idle_tbl[i].id_name != NULL; i++) { if (idle_tbl[i].id_fn == cpu_idle_fn) { p = idle_tbl[i].id_name; break; } } strncpy(buf, p, sizeof(buf)); error = sysctl_handle_string(oidp, buf, sizeof(buf), req); if (error != 0 || req->newptr == NULL) return (error); for (i = 0; idle_tbl[i].id_name != NULL; i++) { if (strstr(idle_tbl[i].id_name, "mwait") && (cpu_feature2 & CPUID2_MON) == 0) continue; #if !defined(__i386__) || !defined(PC98) if (strcmp(idle_tbl[i].id_name, "acpi") == 0 && cpu_idle_hook == NULL) continue; #endif if (strcmp(idle_tbl[i].id_name, buf)) continue; cpu_idle_fn = idle_tbl[i].id_fn; return (0); } return (EINVAL); } SYSCTL_PROC(_machdep, OID_AUTO, idle, CTLTYPE_STRING | CTLFLAG_RW, 0, 0, idle_sysctl, "A", "currently selected idle function"); Index: projects/release-arm-redux/sys/x86/xen/xen_apic.c =================================================================== --- projects/release-arm-redux/sys/x86/xen/xen_apic.c (revision 282691) +++ projects/release-arm-redux/sys/x86/xen/xen_apic.c (revision 282692) @@ -1,549 +1,547 @@ /* * Copyright (c) 2014 Roger Pau Monné * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /*--------------------------------- Macros -----------------------------------*/ #define XEN_APIC_UNSUPPORTED \ panic("%s: not available in Xen PV port.", __func__) /*--------------------------- Forward Declarations ---------------------------*/ #ifdef SMP static driver_filter_t xen_smp_rendezvous_action; static driver_filter_t xen_invltlb; static driver_filter_t xen_invlpg; static driver_filter_t xen_invlrng; static driver_filter_t xen_invlcache; static driver_filter_t xen_ipi_bitmap_handler; static driver_filter_t xen_cpustop_handler; static driver_filter_t xen_cpususpend_handler; static driver_filter_t xen_cpustophard_handler; #endif /*---------------------------- Extern Declarations ---------------------------*/ /* Variables used by mp_machdep to perform the MMU related IPIs */ #ifdef __amd64__ extern int pmap_pcid_enabled; #endif /*---------------------------------- Macros ----------------------------------*/ #define IPI_TO_IDX(ipi) ((ipi) - APIC_IPI_INTS) /*--------------------------------- Xen IPIs ---------------------------------*/ #ifdef SMP struct xen_ipi_handler { driver_filter_t *filter; const char *description; }; static struct xen_ipi_handler xen_ipis[] = { [IPI_TO_IDX(IPI_RENDEZVOUS)] = { xen_smp_rendezvous_action, "r" }, [IPI_TO_IDX(IPI_INVLTLB)] = { xen_invltlb, "itlb"}, [IPI_TO_IDX(IPI_INVLPG)] = { xen_invlpg, "ipg" }, [IPI_TO_IDX(IPI_INVLRNG)] = { xen_invlrng, "irg" }, [IPI_TO_IDX(IPI_INVLCACHE)] = { xen_invlcache, "ic" }, [IPI_TO_IDX(IPI_BITMAP_VECTOR)] = { xen_ipi_bitmap_handler, "b" }, [IPI_TO_IDX(IPI_STOP)] = { xen_cpustop_handler, "st" }, [IPI_TO_IDX(IPI_SUSPEND)] = { xen_cpususpend_handler, "sp" }, [IPI_TO_IDX(IPI_STOP_HARD)] = { xen_cpustophard_handler, "sth" }, }; #endif /*------------------------------- Per-CPU Data -------------------------------*/ #ifdef SMP DPCPU_DEFINE(xen_intr_handle_t, ipi_handle[nitems(xen_ipis)]); #endif /*------------------------------- Xen PV APIC --------------------------------*/ static void xen_pv_lapic_create(u_int apic_id, int boot_cpu) { #ifdef SMP cpu_add(apic_id, boot_cpu); #endif } static void xen_pv_lapic_init(vm_paddr_t addr) { } static void xen_pv_lapic_setup(int boot) { } static void xen_pv_lapic_dump(const char *str) { printf("cpu%d %s XEN PV LAPIC\n", PCPU_GET(cpuid), str); } static void xen_pv_lapic_disable(void) { } static void xen_pv_lapic_eoi(void) { XEN_APIC_UNSUPPORTED; } static int xen_pv_lapic_id(void) { return (PCPU_GET(apic_id)); } static int xen_pv_lapic_intr_pending(u_int vector) { XEN_APIC_UNSUPPORTED; return (0); } static u_int xen_pv_apic_cpuid(u_int apic_id) { #ifdef SMP return (apic_cpuids[apic_id]); #else return (0); #endif } static u_int xen_pv_apic_alloc_vector(u_int apic_id, u_int irq) { XEN_APIC_UNSUPPORTED; return (0); } static u_int xen_pv_apic_alloc_vectors(u_int apic_id, u_int *irqs, u_int count, u_int align) { XEN_APIC_UNSUPPORTED; return (0); } static void xen_pv_apic_disable_vector(u_int apic_id, u_int vector) { XEN_APIC_UNSUPPORTED; } static void xen_pv_apic_enable_vector(u_int apic_id, u_int vector) { XEN_APIC_UNSUPPORTED; } static void xen_pv_apic_free_vector(u_int apic_id, u_int vector, u_int irq) { XEN_APIC_UNSUPPORTED; } static void xen_pv_lapic_set_logical_id(u_int apic_id, u_int cluster, u_int cluster_id) { XEN_APIC_UNSUPPORTED; } static int xen_pv_lapic_enable_pmc(void) { XEN_APIC_UNSUPPORTED; return (0); } static void xen_pv_lapic_disable_pmc(void) { XEN_APIC_UNSUPPORTED; } static void xen_pv_lapic_reenable_pmc(void) { XEN_APIC_UNSUPPORTED; } static void xen_pv_lapic_enable_cmc(void) { } #ifdef SMP static void xen_pv_lapic_ipi_raw(register_t icrlo, u_int dest) { XEN_APIC_UNSUPPORTED; } static void xen_pv_lapic_ipi_vectored(u_int vector, int dest) { xen_intr_handle_t *ipi_handle; int ipi_idx, to_cpu, self; ipi_idx = IPI_TO_IDX(vector); if (ipi_idx >= nitems(xen_ipis)) panic("IPI out of range"); switch(dest) { case APIC_IPI_DEST_SELF: ipi_handle = DPCPU_GET(ipi_handle); xen_intr_signal(ipi_handle[ipi_idx]); break; case APIC_IPI_DEST_ALL: CPU_FOREACH(to_cpu) { ipi_handle = DPCPU_ID_GET(to_cpu, ipi_handle); xen_intr_signal(ipi_handle[ipi_idx]); } break; case APIC_IPI_DEST_OTHERS: self = PCPU_GET(cpuid); CPU_FOREACH(to_cpu) { if (to_cpu != self) { ipi_handle = DPCPU_ID_GET(to_cpu, ipi_handle); xen_intr_signal(ipi_handle[ipi_idx]); } } break; default: to_cpu = apic_cpuid(dest); ipi_handle = DPCPU_ID_GET(to_cpu, ipi_handle); xen_intr_signal(ipi_handle[ipi_idx]); break; } } static int xen_pv_lapic_ipi_wait(int delay) { XEN_APIC_UNSUPPORTED; return (0); } static int xen_pv_lapic_ipi_alloc(inthand_t *ipifunc) { XEN_APIC_UNSUPPORTED; return (-1); } static void xen_pv_lapic_ipi_free(int vector) { XEN_APIC_UNSUPPORTED; } #endif /* SMP */ static int xen_pv_lapic_set_lvt_mask(u_int apic_id, u_int lvt, u_char masked) { XEN_APIC_UNSUPPORTED; return (0); } static int xen_pv_lapic_set_lvt_mode(u_int apic_id, u_int lvt, uint32_t mode) { XEN_APIC_UNSUPPORTED; return (0); } static int xen_pv_lapic_set_lvt_polarity(u_int apic_id, u_int lvt, enum intr_polarity pol) { XEN_APIC_UNSUPPORTED; return (0); } static int xen_pv_lapic_set_lvt_triggermode(u_int apic_id, u_int lvt, enum intr_trigger trigger) { XEN_APIC_UNSUPPORTED; return (0); } /* Xen apic_ops implementation */ struct apic_ops xen_apic_ops = { .create = xen_pv_lapic_create, .init = xen_pv_lapic_init, .xapic_mode = xen_pv_lapic_disable, .setup = xen_pv_lapic_setup, .dump = xen_pv_lapic_dump, .disable = xen_pv_lapic_disable, .eoi = xen_pv_lapic_eoi, .id = xen_pv_lapic_id, .intr_pending = xen_pv_lapic_intr_pending, .set_logical_id = xen_pv_lapic_set_logical_id, .cpuid = xen_pv_apic_cpuid, .alloc_vector = xen_pv_apic_alloc_vector, .alloc_vectors = xen_pv_apic_alloc_vectors, .enable_vector = xen_pv_apic_enable_vector, .disable_vector = xen_pv_apic_disable_vector, .free_vector = xen_pv_apic_free_vector, .enable_pmc = xen_pv_lapic_enable_pmc, .disable_pmc = xen_pv_lapic_disable_pmc, .reenable_pmc = xen_pv_lapic_reenable_pmc, .enable_cmc = xen_pv_lapic_enable_cmc, #ifdef SMP .ipi_raw = xen_pv_lapic_ipi_raw, .ipi_vectored = xen_pv_lapic_ipi_vectored, .ipi_wait = xen_pv_lapic_ipi_wait, .ipi_alloc = xen_pv_lapic_ipi_alloc, .ipi_free = xen_pv_lapic_ipi_free, #endif .set_lvt_mask = xen_pv_lapic_set_lvt_mask, .set_lvt_mode = xen_pv_lapic_set_lvt_mode, .set_lvt_polarity = xen_pv_lapic_set_lvt_polarity, .set_lvt_triggermode = xen_pv_lapic_set_lvt_triggermode, }; #ifdef SMP /*---------------------------- XEN PV IPI Handlers ---------------------------*/ /* * These are C clones of the ASM functions found in apic_vector. */ static int xen_ipi_bitmap_handler(void *arg) { struct trapframe *frame; frame = arg; ipi_bitmap_handler(*frame); return (FILTER_HANDLED); } static int xen_smp_rendezvous_action(void *arg) { #ifdef COUNT_IPIS (*ipi_rendezvous_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ smp_rendezvous_action(); return (FILTER_HANDLED); } static int xen_invltlb(void *arg) { invltlb_handler(); return (FILTER_HANDLED); } #ifdef __amd64__ static int -xen_invltlb_pcid(void *arg) +xen_invltlb_invpcid(void *arg) { - invltlb_pcid_handler(); + invltlb_invpcid_handler(); return (FILTER_HANDLED); } -#endif static int -xen_invlpg(void *arg) +xen_invltlb_pcid(void *arg) { - invlpg_handler(); + invltlb_pcid_handler(); return (FILTER_HANDLED); } +#endif -#ifdef __amd64__ static int -xen_invlpg_pcid(void *arg) +xen_invlpg(void *arg) { - invlpg_pcid_handler(); + invlpg_handler(); return (FILTER_HANDLED); } -#endif static int xen_invlrng(void *arg) { invlrng_handler(); return (FILTER_HANDLED); } static int xen_invlcache(void *arg) { invlcache_handler(); return (FILTER_HANDLED); } static int xen_cpustop_handler(void *arg) { cpustop_handler(); return (FILTER_HANDLED); } static int xen_cpususpend_handler(void *arg) { cpususpend_handler(); return (FILTER_HANDLED); } static int xen_cpustophard_handler(void *arg) { ipi_nmi_handler(); return (FILTER_HANDLED); } /*----------------------------- XEN PV IPI setup -----------------------------*/ /* * Those functions are provided outside of the Xen PV APIC implementation * so PVHVM guests can also use PV IPIs without having an actual Xen PV APIC, * because on PVHVM there's an emulated LAPIC provided by Xen. */ static void xen_cpu_ipi_init(int cpu) { xen_intr_handle_t *ipi_handle; const struct xen_ipi_handler *ipi; device_t dev; int idx, rc; ipi_handle = DPCPU_ID_GET(cpu, ipi_handle); dev = pcpu_find(cpu)->pc_device; KASSERT((dev != NULL), ("NULL pcpu device_t")); for (ipi = xen_ipis, idx = 0; idx < nitems(xen_ipis); ipi++, idx++) { if (ipi->filter == NULL) { ipi_handle[idx] = NULL; continue; } rc = xen_intr_alloc_and_bind_ipi(dev, cpu, ipi->filter, INTR_TYPE_TTY, &ipi_handle[idx]); if (rc != 0) panic("Unable to allocate a XEN IPI port"); xen_intr_describe(ipi_handle[idx], "%s", ipi->description); } } static void xen_setup_cpus(void) { int i; if (!xen_vector_callback_enabled) return; #ifdef __amd64__ if (pmap_pcid_enabled) { - xen_ipis[IPI_TO_IDX(IPI_INVLTLB)].filter = xen_invltlb_pcid; - xen_ipis[IPI_TO_IDX(IPI_INVLPG)].filter = xen_invlpg_pcid; + xen_ipis[IPI_TO_IDX(IPI_INVLTLB)].filter = invpcid_works ? + xen_invltlb_invpcid : xen_invltlb_pcid; } #endif CPU_FOREACH(i) xen_cpu_ipi_init(i); /* Set the xen pv ipi ops to replace the native ones */ if (xen_hvm_domain()) apic_ops.ipi_vectored = xen_pv_lapic_ipi_vectored; } /* We need to setup IPIs before APs are started */ SYSINIT(xen_setup_cpus, SI_SUB_SMP-1, SI_ORDER_FIRST, xen_setup_cpus, NULL); #endif /* SMP */ Index: projects/release-arm-redux/sys =================================================================== --- projects/release-arm-redux/sys (revision 282691) +++ projects/release-arm-redux/sys (revision 282692) Property changes on: projects/release-arm-redux/sys ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/sys:r282673-282691 Index: projects/release-arm-redux/usr.sbin/pmcstat/pmcstat_log.c =================================================================== --- projects/release-arm-redux/usr.sbin/pmcstat/pmcstat_log.c (revision 282691) +++ projects/release-arm-redux/usr.sbin/pmcstat/pmcstat_log.c (revision 282692) @@ -1,2224 +1,2226 @@ /*- * Copyright (c) 2005-2007, Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * Transform a hwpmc(4) log into human readable form, and into * gprof(1) compatible profiles. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "pmcstat.h" #include "pmcstat_log.h" #include "pmcstat_top.h" #define PMCSTAT_ALLOCATE 1 /* * PUBLIC INTERFACES * * pmcstat_initialize_logging() initialize this module, called first * pmcstat_shutdown_logging() orderly shutdown, called last * pmcstat_open_log() open an eventlog for processing * pmcstat_process_log() print/convert an event log * pmcstat_display_log() top mode display for the log * pmcstat_close_log() finish processing an event log * * IMPLEMENTATION NOTES * * We correlate each 'callchain' or 'sample' entry seen in the event * log back to an executable object in the system. Executable objects * include: * - program executables, * - shared libraries loaded by the runtime loader, * - dlopen()'ed objects loaded by the program, * - the runtime loader itself, * - the kernel and kernel modules. * * Each process that we know about is treated as a set of regions that * map to executable objects. Processes are described by * 'pmcstat_process' structures. Executable objects are tracked by * 'pmcstat_image' structures. The kernel and kernel modules are * common to all processes (they reside at the same virtual addresses * for all processes). Individual processes can have their text * segments and shared libraries loaded at process-specific locations. * * A given executable object can be in use by multiple processes * (e.g., libc.so) and loaded at a different address in each. * pmcstat_pcmap structures track per-image mappings. * * The sample log could have samples from multiple PMCs; we * generate one 'gmon.out' profile per PMC. * * IMPLEMENTATION OF GMON OUTPUT * * Each executable object gets one 'gmon.out' profile, per PMC in * use. Creation of 'gmon.out' profiles is done lazily. The * 'gmon.out' profiles generated for a given sampling PMC are * aggregates of all the samples for that particular executable * object. * * IMPLEMENTATION OF SYSTEM-WIDE CALLGRAPH OUTPUT * * Each active pmcid has its own callgraph structure, described by a * 'struct pmcstat_callgraph'. Given a process id and a list of pc * values, we map each pc value to a tuple (image, symbol), where * 'image' denotes an executable object and 'symbol' is the closest * symbol that precedes the pc value. Each pc value in the list is * also given a 'rank' that reflects its depth in the call stack. */ struct pmcstat_pmcs pmcstat_pmcs = LIST_HEAD_INITIALIZER(pmcstat_pmcs); /* * All image descriptors are kept in a hash table. */ struct pmcstat_image_hash_list pmcstat_image_hash[PMCSTAT_NHASH]; /* * All process descriptors are kept in a hash table. */ struct pmcstat_process_hash_list pmcstat_process_hash[PMCSTAT_NHASH]; struct pmcstat_stats pmcstat_stats; /* statistics */ static int ps_samples_period; /* samples count between top refresh. */ struct pmcstat_process *pmcstat_kernproc; /* kernel 'process' */ #include "pmcpl_gprof.h" #include "pmcpl_callgraph.h" #include "pmcpl_annotate.h" #include "pmcpl_annotate_cg.h" #include "pmcpl_calltree.h" static struct pmc_plugins { const char *pl_name; /* name */ /* configure */ int (*pl_configure)(char *opt); /* init and shutdown */ int (*pl_init)(void); void (*pl_shutdown)(FILE *mf); /* sample processing */ void (*pl_process)(struct pmcstat_process *pp, struct pmcstat_pmcrecord *pmcr, uint32_t nsamples, uintfptr_t *cc, int usermode, uint32_t cpu); /* image */ void (*pl_initimage)(struct pmcstat_image *pi); void (*pl_shutdownimage)(struct pmcstat_image *pi); /* pmc */ void (*pl_newpmc)(pmcstat_interned_string ps, struct pmcstat_pmcrecord *pr); /* top display */ void (*pl_topdisplay)(void); /* top keypress */ int (*pl_topkeypress)(int c, WINDOW *w); } plugins[] = { { .pl_name = "none", }, { .pl_name = "callgraph", .pl_init = pmcpl_cg_init, .pl_shutdown = pmcpl_cg_shutdown, .pl_process = pmcpl_cg_process, .pl_topkeypress = pmcpl_cg_topkeypress, .pl_topdisplay = pmcpl_cg_topdisplay }, { .pl_name = "gprof", .pl_shutdown = pmcpl_gmon_shutdown, .pl_process = pmcpl_gmon_process, .pl_initimage = pmcpl_gmon_initimage, .pl_shutdownimage = pmcpl_gmon_shutdownimage, .pl_newpmc = pmcpl_gmon_newpmc }, { .pl_name = "annotate", .pl_process = pmcpl_annotate_process }, { .pl_name = "calltree", .pl_configure = pmcpl_ct_configure, .pl_init = pmcpl_ct_init, .pl_shutdown = pmcpl_ct_shutdown, .pl_process = pmcpl_ct_process, .pl_topkeypress = pmcpl_ct_topkeypress, .pl_topdisplay = pmcpl_ct_topdisplay }, { .pl_name = "annotate_cg", .pl_process = pmcpl_annotate_cg_process }, { .pl_name = NULL } }; static int pmcstat_mergepmc; int pmcstat_pmcinfilter = 0; /* PMC filter for top mode. */ float pmcstat_threshold = 0.5; /* Cost filter for top mode. */ /* * Prototypes */ static struct pmcstat_image *pmcstat_image_from_path(pmcstat_interned_string _path, int _iskernelmodule); static void pmcstat_image_get_aout_params(struct pmcstat_image *_image); static void pmcstat_image_get_elf_params(struct pmcstat_image *_image); static void pmcstat_image_link(struct pmcstat_process *_pp, struct pmcstat_image *_i, uintfptr_t _lpc); static void pmcstat_pmcid_add(pmc_id_t _pmcid, pmcstat_interned_string _name); static void pmcstat_process_aout_exec(struct pmcstat_process *_pp, struct pmcstat_image *_image, uintfptr_t _entryaddr); static void pmcstat_process_elf_exec(struct pmcstat_process *_pp, struct pmcstat_image *_image, uintfptr_t _entryaddr); static void pmcstat_process_exec(struct pmcstat_process *_pp, pmcstat_interned_string _path, uintfptr_t _entryaddr); static struct pmcstat_process *pmcstat_process_lookup(pid_t _pid, int _allocate); static int pmcstat_string_compute_hash(const char *_string); static void pmcstat_string_initialize(void); static int pmcstat_string_lookup_hash(pmcstat_interned_string _is); static void pmcstat_string_shutdown(void); static void pmcstat_stats_reset(int _reset_global); /* * A simple implementation of interned strings. Each interned string * is assigned a unique address, so that subsequent string compares * can be done by a simple pointer comparison instead of using * strcmp(). This speeds up hash table lookups and saves memory if * duplicate strings are the norm. */ struct pmcstat_string { LIST_ENTRY(pmcstat_string) ps_next; /* hash link */ int ps_len; int ps_hash; char *ps_string; }; static LIST_HEAD(,pmcstat_string) pmcstat_string_hash[PMCSTAT_NHASH]; /* * PMC count. */ int pmcstat_npmcs; /* * PMC Top mode pause state. */ static int pmcstat_pause; static void pmcstat_stats_reset(int reset_global) { struct pmcstat_pmcrecord *pr; /* Flush PMCs stats. */ LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) { pr->pr_samples = 0; pr->pr_dubious_frames = 0; } ps_samples_period = 0; /* Flush global stats. */ if (reset_global) bzero(&pmcstat_stats, sizeof(struct pmcstat_stats)); } /* * Compute a 'hash' value for a string. */ static int pmcstat_string_compute_hash(const char *s) { unsigned hash; for (hash = 2166136261; *s; s++) hash = (hash ^ *s) * 16777619; return (hash & PMCSTAT_HASH_MASK); } /* * Intern a copy of string 's', and return a pointer to the * interned structure. */ pmcstat_interned_string pmcstat_string_intern(const char *s) { struct pmcstat_string *ps; const struct pmcstat_string *cps; int hash, len; if ((cps = pmcstat_string_lookup(s)) != NULL) return (cps); hash = pmcstat_string_compute_hash(s); len = strlen(s); if ((ps = malloc(sizeof(*ps))) == NULL) err(EX_OSERR, "ERROR: Could not intern string"); ps->ps_len = len; ps->ps_hash = hash; ps->ps_string = strdup(s); LIST_INSERT_HEAD(&pmcstat_string_hash[hash], ps, ps_next); return ((pmcstat_interned_string) ps); } const char * pmcstat_string_unintern(pmcstat_interned_string str) { const char *s; s = ((const struct pmcstat_string *) str)->ps_string; return (s); } pmcstat_interned_string pmcstat_string_lookup(const char *s) { struct pmcstat_string *ps; int hash, len; hash = pmcstat_string_compute_hash(s); len = strlen(s); LIST_FOREACH(ps, &pmcstat_string_hash[hash], ps_next) if (ps->ps_len == len && ps->ps_hash == hash && strcmp(ps->ps_string, s) == 0) return (ps); return (NULL); } static int pmcstat_string_lookup_hash(pmcstat_interned_string s) { const struct pmcstat_string *ps; ps = (const struct pmcstat_string *) s; return (ps->ps_hash); } /* * Initialize the string interning facility. */ static void pmcstat_string_initialize(void) { int i; for (i = 0; i < PMCSTAT_NHASH; i++) LIST_INIT(&pmcstat_string_hash[i]); } /* * Destroy the string table, free'ing up space. */ static void pmcstat_string_shutdown(void) { int i; struct pmcstat_string *ps, *pstmp; for (i = 0; i < PMCSTAT_NHASH; i++) LIST_FOREACH_SAFE(ps, &pmcstat_string_hash[i], ps_next, pstmp) { LIST_REMOVE(ps, ps_next); free(ps->ps_string); free(ps); } } /* * Determine whether a given executable image is an A.OUT object, and * if so, fill in its parameters from the text file. * Sets image->pi_type. */ static void pmcstat_image_get_aout_params(struct pmcstat_image *image) { int fd; ssize_t nbytes; struct exec ex; const char *path; char buffer[PATH_MAX]; path = pmcstat_string_unintern(image->pi_execpath); assert(path != NULL); if (image->pi_iskernelmodule) errx(EX_SOFTWARE, "ERROR: a.out kernel modules are unsupported \"%s\"", path); (void) snprintf(buffer, sizeof(buffer), "%s%s", args.pa_fsroot, path); if ((fd = open(buffer, O_RDONLY, 0)) < 0 || (nbytes = read(fd, &ex, sizeof(ex))) < 0) { if (args.pa_verbosity >= 2) warn("WARNING: Cannot determine type of \"%s\"", path); image->pi_type = PMCSTAT_IMAGE_INDETERMINABLE; if (fd != -1) (void) close(fd); return; } (void) close(fd); if ((unsigned) nbytes != sizeof(ex) || N_BADMAG(ex)) return; image->pi_type = PMCSTAT_IMAGE_AOUT; /* TODO: the rest of a.out processing */ return; } /* * Helper function. */ static int pmcstat_symbol_compare(const void *a, const void *b) { const struct pmcstat_symbol *sym1, *sym2; sym1 = (const struct pmcstat_symbol *) a; sym2 = (const struct pmcstat_symbol *) b; if (sym1->ps_end <= sym2->ps_start) return (-1); if (sym1->ps_start >= sym2->ps_end) return (1); return (0); } /* * Map an address to a symbol in an image. */ struct pmcstat_symbol * pmcstat_symbol_search(struct pmcstat_image *image, uintfptr_t addr) { struct pmcstat_symbol sym; if (image->pi_symbols == NULL) return (NULL); sym.ps_name = NULL; sym.ps_start = addr; sym.ps_end = addr + 1; return (bsearch((void *) &sym, image->pi_symbols, image->pi_symcount, sizeof(struct pmcstat_symbol), pmcstat_symbol_compare)); } /* * Add the list of symbols in the given section to the list associated * with the object. */ static void pmcstat_image_add_symbols(struct pmcstat_image *image, Elf *e, Elf_Scn *scn, GElf_Shdr *sh) { int firsttime; size_t n, newsyms, nshsyms, nfuncsyms; struct pmcstat_symbol *symptr; char *fnname; GElf_Sym sym; Elf_Data *data; if ((data = elf_getdata(scn, NULL)) == NULL) return; /* * Determine the number of functions named in this * section. */ nshsyms = sh->sh_size / sh->sh_entsize; for (n = nfuncsyms = 0; n < nshsyms; n++) { if (gelf_getsym(data, (int) n, &sym) != &sym) return; if (GELF_ST_TYPE(sym.st_info) == STT_FUNC) nfuncsyms++; } if (nfuncsyms == 0) return; /* * Allocate space for the new entries. */ firsttime = image->pi_symbols == NULL; symptr = realloc(image->pi_symbols, sizeof(*symptr) * (image->pi_symcount + nfuncsyms)); if (symptr == image->pi_symbols) /* realloc() failed. */ return; image->pi_symbols = symptr; /* * Append new symbols to the end of the current table. */ symptr += image->pi_symcount; for (n = newsyms = 0; n < nshsyms; n++) { if (gelf_getsym(data, (int) n, &sym) != &sym) return; if (GELF_ST_TYPE(sym.st_info) != STT_FUNC) continue; if (sym.st_shndx == STN_UNDEF) continue; if (!firsttime && pmcstat_symbol_search(image, sym.st_value)) continue; /* We've seen this symbol already. */ if ((fnname = elf_strptr(e, sh->sh_link, sym.st_name)) == NULL) continue; #ifdef __arm__ /* Remove spurious ARM function name. */ if (fnname[0] == '$' && (fnname[1] == 'a' || fnname[1] == 't' || fnname[1] == 'd') && fnname[2] == '\0') continue; #endif symptr->ps_name = pmcstat_string_intern(fnname); symptr->ps_start = sym.st_value - image->pi_vaddr; symptr->ps_end = symptr->ps_start + sym.st_size; symptr++; newsyms++; } image->pi_symcount += newsyms; if (image->pi_symcount == 0) return; assert(newsyms <= nfuncsyms); /* * Return space to the system if there were duplicates. */ if (newsyms < nfuncsyms) image->pi_symbols = realloc(image->pi_symbols, sizeof(*symptr) * image->pi_symcount); /* * Keep the list of symbols sorted. */ qsort(image->pi_symbols, image->pi_symcount, sizeof(*symptr), pmcstat_symbol_compare); /* * Deal with function symbols that have a size of 'zero' by * making them extend to the next higher address. These * symbols are usually defined in assembly code. */ for (symptr = image->pi_symbols; symptr < image->pi_symbols + (image->pi_symcount - 1); symptr++) if (symptr->ps_start == symptr->ps_end) symptr->ps_end = (symptr+1)->ps_start; } /* * Examine an ELF file to determine the size of its text segment. * Sets image->pi_type if anything conclusive can be determined about * this image. */ static void pmcstat_image_get_elf_params(struct pmcstat_image *image) { int fd; size_t i, nph, nsh; const char *path, *elfbase; char *p, *endp; uintfptr_t minva, maxva; Elf *e; Elf_Scn *scn; GElf_Ehdr eh; GElf_Phdr ph; GElf_Shdr sh; enum pmcstat_image_type image_type; char buffer[PATH_MAX]; assert(image->pi_type == PMCSTAT_IMAGE_UNKNOWN); image->pi_start = minva = ~(uintfptr_t) 0; image->pi_end = maxva = (uintfptr_t) 0; image->pi_type = image_type = PMCSTAT_IMAGE_INDETERMINABLE; image->pi_isdynamic = 0; image->pi_dynlinkerpath = NULL; image->pi_vaddr = 0; path = pmcstat_string_unintern(image->pi_execpath); assert(path != NULL); /* * Look for kernel modules under FSROOT/KERNELPATH/NAME, * and user mode executable objects under FSROOT/PATHNAME. */ if (image->pi_iskernelmodule) (void) snprintf(buffer, sizeof(buffer), "%s%s/%s", args.pa_fsroot, args.pa_kernel, path); else (void) snprintf(buffer, sizeof(buffer), "%s%s", args.pa_fsroot, path); e = NULL; if ((fd = open(buffer, O_RDONLY, 0)) < 0 || (e = elf_begin(fd, ELF_C_READ, NULL)) == NULL || (elf_kind(e) != ELF_K_ELF)) { if (args.pa_verbosity >= 2) warnx("WARNING: Cannot determine the type of \"%s\".", buffer); goto done; } if (gelf_getehdr(e, &eh) != &eh) { warnx( "WARNING: Cannot retrieve the ELF Header for \"%s\": %s.", buffer, elf_errmsg(-1)); goto done; } if (eh.e_type != ET_EXEC && eh.e_type != ET_DYN && !(image->pi_iskernelmodule && eh.e_type == ET_REL)) { warnx("WARNING: \"%s\" is of an unsupported ELF type.", buffer); goto done; } image_type = eh.e_ident[EI_CLASS] == ELFCLASS32 ? PMCSTAT_IMAGE_ELF32 : PMCSTAT_IMAGE_ELF64; /* * Determine the virtual address where an executable would be * loaded. Additionally, for dynamically linked executables, * save the pathname to the runtime linker. */ if (eh.e_type == ET_EXEC) { if (elf_getphnum(e, &nph) == 0) { warnx( "WARNING: Could not determine the number of program headers in \"%s\": %s.", buffer, elf_errmsg(-1)); goto done; } for (i = 0; i < eh.e_phnum; i++) { if (gelf_getphdr(e, i, &ph) != &ph) { warnx( "WARNING: Retrieval of PHDR entry #%ju in \"%s\" failed: %s.", (uintmax_t) i, buffer, elf_errmsg(-1)); goto done; } switch (ph.p_type) { case PT_DYNAMIC: image->pi_isdynamic = 1; break; case PT_INTERP: if ((elfbase = elf_rawfile(e, NULL)) == NULL) { warnx( "WARNING: Cannot retrieve the interpreter for \"%s\": %s.", buffer, elf_errmsg(-1)); goto done; } image->pi_dynlinkerpath = pmcstat_string_intern(elfbase + ph.p_offset); break; case PT_LOAD: if ((ph.p_offset & (-ph.p_align)) == 0) image->pi_vaddr = ph.p_vaddr & (-ph.p_align); break; } } } /* * Get the min and max VA associated with this ELF object. */ if (elf_getshnum(e, &nsh) == 0) { warnx( "WARNING: Could not determine the number of sections for \"%s\": %s.", buffer, elf_errmsg(-1)); goto done; } for (i = 0; i < nsh; i++) { if ((scn = elf_getscn(e, i)) == NULL || gelf_getshdr(scn, &sh) != &sh) { warnx( "WARNING: Could not retrieve section header #%ju in \"%s\": %s.", (uintmax_t) i, buffer, elf_errmsg(-1)); goto done; } if (sh.sh_flags & SHF_EXECINSTR) { minva = min(minva, sh.sh_addr); maxva = max(maxva, sh.sh_addr + sh.sh_size); } if (sh.sh_type == SHT_SYMTAB || sh.sh_type == SHT_DYNSYM) pmcstat_image_add_symbols(image, e, scn, &sh); } image->pi_start = minva; image->pi_end = maxva; image->pi_type = image_type; image->pi_fullpath = pmcstat_string_intern(buffer); /* Build display name */ endp = buffer; for (p = buffer; *p; p++) if (*p == '/') endp = p+1; image->pi_name = pmcstat_string_intern(endp); done: (void) elf_end(e); if (fd >= 0) (void) close(fd); return; } /* * Given an image descriptor, determine whether it is an ELF, or AOUT. * If no handler claims the image, set its type to 'INDETERMINABLE'. */ void pmcstat_image_determine_type(struct pmcstat_image *image) { assert(image->pi_type == PMCSTAT_IMAGE_UNKNOWN); /* Try each kind of handler in turn */ if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_get_elf_params(image); if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_get_aout_params(image); /* * Otherwise, remember that we tried to determine * the object's type and had failed. */ if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) image->pi_type = PMCSTAT_IMAGE_INDETERMINABLE; } /* * Locate an image descriptor given an interned path, adding a fresh * descriptor to the cache if necessary. This function also finds a * suitable name for this image's sample file. * * We defer filling in the file format specific parts of the image * structure till the time we actually see a sample that would fall * into this image. */ static struct pmcstat_image * pmcstat_image_from_path(pmcstat_interned_string internedpath, int iskernelmodule) { int hash; struct pmcstat_image *pi; hash = pmcstat_string_lookup_hash(internedpath); /* First, look for an existing entry. */ LIST_FOREACH(pi, &pmcstat_image_hash[hash], pi_next) if (pi->pi_execpath == internedpath && pi->pi_iskernelmodule == iskernelmodule) return (pi); /* * Allocate a new entry and place it at the head of the hash * and LRU lists. */ pi = malloc(sizeof(*pi)); if (pi == NULL) return (NULL); pi->pi_type = PMCSTAT_IMAGE_UNKNOWN; pi->pi_execpath = internedpath; pi->pi_start = ~0; pi->pi_end = 0; pi->pi_entry = 0; pi->pi_vaddr = 0; pi->pi_isdynamic = 0; pi->pi_iskernelmodule = iskernelmodule; pi->pi_dynlinkerpath = NULL; pi->pi_symbols = NULL; pi->pi_symcount = 0; pi->pi_addr2line = NULL; if (plugins[args.pa_pplugin].pl_initimage != NULL) plugins[args.pa_pplugin].pl_initimage(pi); if (plugins[args.pa_plugin].pl_initimage != NULL) plugins[args.pa_plugin].pl_initimage(pi); LIST_INSERT_HEAD(&pmcstat_image_hash[hash], pi, pi_next); return (pi); } /* * Record the fact that PC values from 'start' to 'end' come from * image 'image'. */ static void pmcstat_image_link(struct pmcstat_process *pp, struct pmcstat_image *image, uintfptr_t start) { struct pmcstat_pcmap *pcm, *pcmnew; uintfptr_t offset; assert(image->pi_type != PMCSTAT_IMAGE_UNKNOWN && image->pi_type != PMCSTAT_IMAGE_INDETERMINABLE); if ((pcmnew = malloc(sizeof(*pcmnew))) == NULL) err(EX_OSERR, "ERROR: Cannot create a map entry"); /* * Adjust the map entry to only cover the text portion * of the object. */ offset = start - image->pi_vaddr; pcmnew->ppm_lowpc = image->pi_start + offset; pcmnew->ppm_highpc = image->pi_end + offset; pcmnew->ppm_image = image; assert(pcmnew->ppm_lowpc < pcmnew->ppm_highpc); /* Overlapped mmap()'s are assumed to never occur. */ TAILQ_FOREACH(pcm, &pp->pp_map, ppm_next) if (pcm->ppm_lowpc >= pcmnew->ppm_highpc) break; if (pcm == NULL) TAILQ_INSERT_TAIL(&pp->pp_map, pcmnew, ppm_next); else TAILQ_INSERT_BEFORE(pcm, pcmnew, ppm_next); } /* * Unmap images in the range [start..end) associated with process * 'pp'. */ static void pmcstat_image_unmap(struct pmcstat_process *pp, uintfptr_t start, uintfptr_t end) { struct pmcstat_pcmap *pcm, *pcmtmp, *pcmnew; assert(pp != NULL); assert(start < end); /* * Cases: * - we could have the range completely in the middle of an * existing pcmap; in this case we have to split the pcmap * structure into two (i.e., generate a 'hole'). * - we could have the range covering multiple pcmaps; these * will have to be removed. * - we could have either 'start' or 'end' falling in the * middle of a pcmap; in this case shorten the entry. */ TAILQ_FOREACH_SAFE(pcm, &pp->pp_map, ppm_next, pcmtmp) { assert(pcm->ppm_lowpc < pcm->ppm_highpc); if (pcm->ppm_highpc <= start) continue; if (pcm->ppm_lowpc >= end) return; if (pcm->ppm_lowpc >= start && pcm->ppm_highpc <= end) { /* * The current pcmap is completely inside the * unmapped range: remove it entirely. */ TAILQ_REMOVE(&pp->pp_map, pcm, ppm_next); free(pcm); } else if (pcm->ppm_lowpc < start && pcm->ppm_highpc > end) { /* * Split this pcmap into two; curtail the * current map to end at [start-1], and start * the new one at [end]. */ if ((pcmnew = malloc(sizeof(*pcmnew))) == NULL) err(EX_OSERR, "ERROR: Cannot split a map entry"); pcmnew->ppm_image = pcm->ppm_image; pcmnew->ppm_lowpc = end; pcmnew->ppm_highpc = pcm->ppm_highpc; pcm->ppm_highpc = start; TAILQ_INSERT_AFTER(&pp->pp_map, pcm, pcmnew, ppm_next); return; } else if (pcm->ppm_lowpc < start && pcm->ppm_highpc <= end) pcm->ppm_highpc = start; else if (pcm->ppm_lowpc >= start && pcm->ppm_highpc > end) pcm->ppm_lowpc = end; else assert(0); } } /* * Resolve file name and line number for the given address. */ int pmcstat_image_addr2line(struct pmcstat_image *image, uintfptr_t addr, char *sourcefile, size_t sourcefile_len, unsigned *sourceline, char *funcname, size_t funcname_len) { static int addr2line_warn = 0; unsigned l; char *sep, cmdline[PATH_MAX], imagepath[PATH_MAX]; int fd; if (image->pi_addr2line == NULL) { snprintf(imagepath, sizeof(imagepath), "%s%s.symbols", args.pa_fsroot, pmcstat_string_unintern(image->pi_fullpath)); fd = open(imagepath, O_RDONLY); if (fd < 0) { snprintf(imagepath, sizeof(imagepath), "%s%s", args.pa_fsroot, pmcstat_string_unintern(image->pi_fullpath)); } else close(fd); /* * New addr2line support recursive inline function with -i * but the format does not add a marker when no more entries * are available. */ snprintf(cmdline, sizeof(cmdline), "addr2line -Cfe \"%s\"", imagepath); image->pi_addr2line = popen(cmdline, "r+"); if (image->pi_addr2line == NULL) { if (!addr2line_warn) { addr2line_warn = 1; warnx( "WARNING: addr2line is needed for source code information." ); } return (0); } } if (feof(image->pi_addr2line) || ferror(image->pi_addr2line)) { warnx("WARNING: addr2line pipe error"); pclose(image->pi_addr2line); image->pi_addr2line = NULL; return (0); } fprintf(image->pi_addr2line, "%p\n", (void *)addr); if (fgets(funcname, funcname_len, image->pi_addr2line) == NULL) { warnx("WARNING: addr2line function name read error"); return (0); } sep = strchr(funcname, '\n'); if (sep != NULL) *sep = '\0'; if (fgets(sourcefile, sourcefile_len, image->pi_addr2line) == NULL) { warnx("WARNING: addr2line source file read error"); return (0); } sep = strchr(sourcefile, ':'); if (sep == NULL) { warnx("WARNING: addr2line source line separator missing"); return (0); } *sep = '\0'; l = atoi(sep+1); if (l == 0) return (0); *sourceline = l; return (1); } /* * Add a {pmcid,name} mapping. */ static void pmcstat_pmcid_add(pmc_id_t pmcid, pmcstat_interned_string ps) { struct pmcstat_pmcrecord *pr, *prm; /* Replace an existing name for the PMC. */ prm = NULL; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcid == pmcid) { pr->pr_pmcname = ps; return; } else if (pr->pr_pmcname == ps) prm = pr; /* * Otherwise, allocate a new descriptor and call the * plugins hook. */ if ((pr = malloc(sizeof(*pr))) == NULL) err(EX_OSERR, "ERROR: Cannot allocate pmc record"); pr->pr_pmcid = pmcid; pr->pr_pmcname = ps; pr->pr_pmcin = pmcstat_npmcs++; pr->pr_samples = 0; pr->pr_dubious_frames = 0; pr->pr_merge = prm == NULL ? pr : prm; LIST_INSERT_HEAD(&pmcstat_pmcs, pr, pr_next); if (plugins[args.pa_pplugin].pl_newpmc != NULL) plugins[args.pa_pplugin].pl_newpmc(ps, pr); if (plugins[args.pa_plugin].pl_newpmc != NULL) plugins[args.pa_plugin].pl_newpmc(ps, pr); } /* * Given a pmcid in use, find its human-readable name. */ const char * pmcstat_pmcid_to_name(pmc_id_t pmcid) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcid == pmcid) return (pmcstat_string_unintern(pr->pr_pmcname)); return NULL; } /* * Convert PMC index to name. */ const char * pmcstat_pmcindex_to_name(int pmcin) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcin == pmcin) return pmcstat_string_unintern(pr->pr_pmcname); return NULL; } /* * Return PMC record with given index. */ struct pmcstat_pmcrecord * pmcstat_pmcindex_to_pmcr(int pmcin) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcin == pmcin) return pr; return NULL; } /* * Get PMC record by id, apply merge policy. */ static struct pmcstat_pmcrecord * pmcstat_lookup_pmcid(pmc_id_t pmcid) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) { if (pr->pr_pmcid == pmcid) { if (pmcstat_mergepmc) return pr->pr_merge; return pr; } } return NULL; } /* * Associate an AOUT image with a process. */ static void pmcstat_process_aout_exec(struct pmcstat_process *pp, struct pmcstat_image *image, uintfptr_t entryaddr) { (void) pp; (void) image; (void) entryaddr; /* TODO Implement a.out handling */ } /* * Associate an ELF image with a process. */ static void pmcstat_process_elf_exec(struct pmcstat_process *pp, struct pmcstat_image *image, uintfptr_t entryaddr) { uintmax_t libstart; struct pmcstat_image *rtldimage; assert(image->pi_type == PMCSTAT_IMAGE_ELF32 || image->pi_type == PMCSTAT_IMAGE_ELF64); /* Create a map entry for the base executable. */ pmcstat_image_link(pp, image, image->pi_vaddr); /* * For dynamically linked executables we need to determine * where the dynamic linker was mapped to for this process, * Subsequent executable objects that are mapped in by the * dynamic linker will be tracked by log events of type * PMCLOG_TYPE_MAP_IN. */ if (image->pi_isdynamic) { /* * The runtime loader gets loaded just after the maximum * possible heap address. Like so: * * [ TEXT DATA BSS HEAP -->*RTLD SHLIBS <--STACK] * ^ ^ * 0 VM_MAXUSER_ADDRESS * * The exact address where the loader gets mapped in * will vary according to the size of the executable * and the limits on the size of the process'es data * segment at the time of exec(). The entry address * recorded at process exec time corresponds to the * 'start' address inside the dynamic linker. From * this we can figure out the address where the * runtime loader's file object had been mapped to. */ rtldimage = pmcstat_image_from_path(image->pi_dynlinkerpath, 0); if (rtldimage == NULL) { warnx("WARNING: Cannot find image for \"%s\".", pmcstat_string_unintern(image->pi_dynlinkerpath)); pmcstat_stats.ps_exec_errors++; return; } if (rtldimage->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_get_elf_params(rtldimage); if (rtldimage->pi_type != PMCSTAT_IMAGE_ELF32 && rtldimage->pi_type != PMCSTAT_IMAGE_ELF64) { warnx("WARNING: rtld not an ELF object \"%s\".", pmcstat_string_unintern(image->pi_dynlinkerpath)); return; } libstart = entryaddr - rtldimage->pi_entry; pmcstat_image_link(pp, rtldimage, libstart); } } /* * Find the process descriptor corresponding to a PID. If 'allocate' * is zero, we return a NULL if a pid descriptor could not be found or * a process descriptor process. If 'allocate' is non-zero, then we * will attempt to allocate a fresh process descriptor. Zombie * process descriptors are only removed if a fresh allocation for the * same PID is requested. */ static struct pmcstat_process * pmcstat_process_lookup(pid_t pid, int allocate) { uint32_t hash; struct pmcstat_pcmap *ppm, *ppmtmp; struct pmcstat_process *pp, *pptmp; hash = (uint32_t) pid & PMCSTAT_HASH_MASK; /* simplicity wins */ LIST_FOREACH_SAFE(pp, &pmcstat_process_hash[hash], pp_next, pptmp) if (pp->pp_pid == pid) { /* Found a descriptor, check and process zombies */ if (allocate && pp->pp_isactive == 0) { /* remove maps */ TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) { TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next); free(ppm); } /* remove process entry */ LIST_REMOVE(pp, pp_next); free(pp); break; } return (pp); } if (!allocate) return (NULL); if ((pp = malloc(sizeof(*pp))) == NULL) err(EX_OSERR, "ERROR: Cannot allocate pid descriptor"); pp->pp_pid = pid; pp->pp_isactive = 1; TAILQ_INIT(&pp->pp_map); LIST_INSERT_HEAD(&pmcstat_process_hash[hash], pp, pp_next); return (pp); } /* * Associate an image and a process. */ static void pmcstat_process_exec(struct pmcstat_process *pp, pmcstat_interned_string path, uintfptr_t entryaddr) { struct pmcstat_image *image; if ((image = pmcstat_image_from_path(path, 0)) == NULL) { pmcstat_stats.ps_exec_errors++; return; } if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_determine_type(image); assert(image->pi_type != PMCSTAT_IMAGE_UNKNOWN); switch (image->pi_type) { case PMCSTAT_IMAGE_ELF32: case PMCSTAT_IMAGE_ELF64: pmcstat_stats.ps_exec_elf++; pmcstat_process_elf_exec(pp, image, entryaddr); break; case PMCSTAT_IMAGE_AOUT: pmcstat_stats.ps_exec_aout++; pmcstat_process_aout_exec(pp, image, entryaddr); break; case PMCSTAT_IMAGE_INDETERMINABLE: pmcstat_stats.ps_exec_indeterminable++; break; default: err(EX_SOFTWARE, "ERROR: Unsupported executable type for \"%s\"", pmcstat_string_unintern(path)); } } /* * Find the map entry associated with process 'p' at PC value 'pc'. */ struct pmcstat_pcmap * pmcstat_process_find_map(struct pmcstat_process *p, uintfptr_t pc) { struct pmcstat_pcmap *ppm; TAILQ_FOREACH(ppm, &p->pp_map, ppm_next) { if (pc >= ppm->ppm_lowpc && pc < ppm->ppm_highpc) return (ppm); if (pc < ppm->ppm_lowpc) return (NULL); } return (NULL); } /* * Convert a hwpmc(4) log to profile information. A system-wide * callgraph is generated if FLAG_DO_CALLGRAPHS is set. gmon.out * files usable by gprof(1) are created if FLAG_DO_GPROF is set. */ static int pmcstat_analyze_log(void) { uint32_t cpu, cpuflags; uintfptr_t pc; pid_t pid; struct pmcstat_image *image; struct pmcstat_process *pp, *ppnew; struct pmcstat_pcmap *ppm, *ppmtmp; struct pmclog_ev ev; struct pmcstat_pmcrecord *pmcr; pmcstat_interned_string image_path; assert(args.pa_flags & FLAG_DO_ANALYSIS); if (elf_version(EV_CURRENT) == EV_NONE) err(EX_UNAVAILABLE, "Elf library intialization failed"); while (pmclog_read(args.pa_logparser, &ev) == 0) { assert(ev.pl_state == PMCLOG_OK); switch (ev.pl_type) { case PMCLOG_TYPE_INITIALIZE: if ((ev.pl_u.pl_i.pl_version & 0xFF000000) != PMC_VERSION_MAJOR << 24 && args.pa_verbosity > 0) warnx( "WARNING: Log version 0x%x does not match compiled version 0x%x.", ev.pl_u.pl_i.pl_version, PMC_VERSION_MAJOR); break; case PMCLOG_TYPE_MAP_IN: /* * Introduce an address range mapping for a * userland process or the kernel (pid == -1). * * We always allocate a process descriptor so * that subsequent samples seen for this * address range are mapped to the current * object being mapped in. */ pid = ev.pl_u.pl_mi.pl_pid; if (pid == -1) pp = pmcstat_kernproc; else pp = pmcstat_process_lookup(pid, PMCSTAT_ALLOCATE); assert(pp != NULL); image_path = pmcstat_string_intern(ev.pl_u.pl_mi. pl_pathname); image = pmcstat_image_from_path(image_path, pid == -1); if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_determine_type(image); if (image->pi_type != PMCSTAT_IMAGE_INDETERMINABLE) pmcstat_image_link(pp, image, ev.pl_u.pl_mi.pl_start); break; case PMCLOG_TYPE_MAP_OUT: /* * Remove an address map. */ pid = ev.pl_u.pl_mo.pl_pid; if (pid == -1) pp = pmcstat_kernproc; else pp = pmcstat_process_lookup(pid, 0); if (pp == NULL) /* unknown process */ break; pmcstat_image_unmap(pp, ev.pl_u.pl_mo.pl_start, ev.pl_u.pl_mo.pl_end); break; case PMCLOG_TYPE_PCSAMPLE: /* * Note: the `PCSAMPLE' log entry is not * generated by hpwmc(4) after version 2. */ /* * We bring in the gmon file for the image * currently associated with the PMC & pid * pair and increment the appropriate entry * bin inside this. */ pmcstat_stats.ps_samples_total++; ps_samples_period++; pc = ev.pl_u.pl_s.pl_pc; pp = pmcstat_process_lookup(ev.pl_u.pl_s.pl_pid, PMCSTAT_ALLOCATE); /* Get PMC record. */ pmcr = pmcstat_lookup_pmcid(ev.pl_u.pl_s.pl_pmcid); assert(pmcr != NULL); pmcr->pr_samples++; /* * Call the plugins processing * TODO: move pmcstat_process_find_map inside plugins */ if (plugins[args.pa_pplugin].pl_process != NULL) plugins[args.pa_pplugin].pl_process( pp, pmcr, 1, &pc, pmcstat_process_find_map(pp, pc) != NULL, 0); plugins[args.pa_plugin].pl_process( pp, pmcr, 1, &pc, pmcstat_process_find_map(pp, pc) != NULL, 0); break; case PMCLOG_TYPE_CALLCHAIN: pmcstat_stats.ps_samples_total++; ps_samples_period++; cpuflags = ev.pl_u.pl_cc.pl_cpuflags; cpu = PMC_CALLCHAIN_CPUFLAGS_TO_CPU(cpuflags); /* Filter on the CPU id. */ if (!CPU_ISSET(cpu, &(args.pa_cpumask))) { pmcstat_stats.ps_samples_skipped++; break; } pp = pmcstat_process_lookup(ev.pl_u.pl_cc.pl_pid, PMCSTAT_ALLOCATE); /* Get PMC record. */ pmcr = pmcstat_lookup_pmcid(ev.pl_u.pl_cc.pl_pmcid); assert(pmcr != NULL); pmcr->pr_samples++; /* * Call the plugins processing */ if (plugins[args.pa_pplugin].pl_process != NULL) plugins[args.pa_pplugin].pl_process( pp, pmcr, ev.pl_u.pl_cc.pl_npc, ev.pl_u.pl_cc.pl_pc, PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(cpuflags), cpu); plugins[args.pa_plugin].pl_process( pp, pmcr, ev.pl_u.pl_cc.pl_npc, ev.pl_u.pl_cc.pl_pc, PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(cpuflags), cpu); break; case PMCLOG_TYPE_PMCALLOCATE: /* * Record the association pmc id between this * PMC and its name. */ pmcstat_pmcid_add(ev.pl_u.pl_a.pl_pmcid, pmcstat_string_intern(ev.pl_u.pl_a.pl_evname)); break; case PMCLOG_TYPE_PMCALLOCATEDYN: /* * Record the association pmc id between this * PMC and its name. */ pmcstat_pmcid_add(ev.pl_u.pl_ad.pl_pmcid, pmcstat_string_intern(ev.pl_u.pl_ad.pl_evname)); break; case PMCLOG_TYPE_PROCEXEC: /* * Change the executable image associated with * a process. */ pp = pmcstat_process_lookup(ev.pl_u.pl_x.pl_pid, PMCSTAT_ALLOCATE); /* delete the current process map */ TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) { TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next); free(ppm); } - /* associate this process image */ + /* + * Associate this process image. + */ image_path = pmcstat_string_intern( ev.pl_u.pl_x.pl_pathname); assert(image_path != NULL); pmcstat_process_exec(pp, image_path, ev.pl_u.pl_x.pl_entryaddr); break; case PMCLOG_TYPE_PROCEXIT: /* * Due to the way the log is generated, the * last few samples corresponding to a process * may appear in the log after the process * exit event is recorded. Thus we keep the * process' descriptor and associated data * structures around, but mark the process as * having exited. */ pp = pmcstat_process_lookup(ev.pl_u.pl_e.pl_pid, 0); if (pp == NULL) break; pp->pp_isactive = 0; /* mark as a zombie */ break; case PMCLOG_TYPE_SYSEXIT: pp = pmcstat_process_lookup(ev.pl_u.pl_se.pl_pid, 0); if (pp == NULL) break; pp->pp_isactive = 0; /* make a zombie */ break; case PMCLOG_TYPE_PROCFORK: /* * Allocate a process descriptor for the new * (child) process. */ ppnew = pmcstat_process_lookup(ev.pl_u.pl_f.pl_newpid, PMCSTAT_ALLOCATE); /* * If we had been tracking the parent, clone * its address maps. */ pp = pmcstat_process_lookup(ev.pl_u.pl_f.pl_oldpid, 0); if (pp == NULL) break; TAILQ_FOREACH(ppm, &pp->pp_map, ppm_next) pmcstat_image_link(ppnew, ppm->ppm_image, ppm->ppm_lowpc); break; default: /* other types of entries are not relevant */ break; } } if (ev.pl_state == PMCLOG_EOF) return (PMCSTAT_FINISHED); else if (ev.pl_state == PMCLOG_REQUIRE_DATA) return (PMCSTAT_RUNNING); err(EX_DATAERR, "ERROR: event parsing failed (record %jd, offset 0x%jx)", (uintmax_t) ev.pl_count + 1, ev.pl_offset); } /* * Print log entries as text. */ static int pmcstat_print_log(void) { struct pmclog_ev ev; uint32_t npc; while (pmclog_read(args.pa_logparser, &ev) == 0) { assert(ev.pl_state == PMCLOG_OK); switch (ev.pl_type) { case PMCLOG_TYPE_CALLCHAIN: PMCSTAT_PRINT_ENTRY("callchain", "%d 0x%x %d %d %c", ev.pl_u.pl_cc.pl_pid, ev.pl_u.pl_cc.pl_pmcid, PMC_CALLCHAIN_CPUFLAGS_TO_CPU(ev.pl_u.pl_cc. \ pl_cpuflags), ev.pl_u.pl_cc.pl_npc, PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(ev.pl_u.pl_cc.\ pl_cpuflags) ? 'u' : 's'); for (npc = 0; npc < ev.pl_u.pl_cc.pl_npc; npc++) PMCSTAT_PRINT_ENTRY("...", "%p", (void *) ev.pl_u.pl_cc.pl_pc[npc]); break; case PMCLOG_TYPE_CLOSELOG: PMCSTAT_PRINT_ENTRY("closelog",); break; case PMCLOG_TYPE_DROPNOTIFY: PMCSTAT_PRINT_ENTRY("drop",); break; case PMCLOG_TYPE_INITIALIZE: PMCSTAT_PRINT_ENTRY("initlog","0x%x \"%s\"", ev.pl_u.pl_i.pl_version, pmc_name_of_cputype(ev.pl_u.pl_i.pl_arch)); if ((ev.pl_u.pl_i.pl_version & 0xFF000000) != PMC_VERSION_MAJOR << 24 && args.pa_verbosity > 0) warnx( "WARNING: Log version 0x%x != expected version 0x%x.", ev.pl_u.pl_i.pl_version, PMC_VERSION); break; case PMCLOG_TYPE_MAP_IN: PMCSTAT_PRINT_ENTRY("map-in","%d %p \"%s\"", ev.pl_u.pl_mi.pl_pid, (void *) ev.pl_u.pl_mi.pl_start, ev.pl_u.pl_mi.pl_pathname); break; case PMCLOG_TYPE_MAP_OUT: PMCSTAT_PRINT_ENTRY("map-out","%d %p %p", ev.pl_u.pl_mo.pl_pid, (void *) ev.pl_u.pl_mo.pl_start, (void *) ev.pl_u.pl_mo.pl_end); break; case PMCLOG_TYPE_PCSAMPLE: PMCSTAT_PRINT_ENTRY("sample","0x%x %d %p %c", ev.pl_u.pl_s.pl_pmcid, ev.pl_u.pl_s.pl_pid, (void *) ev.pl_u.pl_s.pl_pc, ev.pl_u.pl_s.pl_usermode ? 'u' : 's'); break; case PMCLOG_TYPE_PMCALLOCATE: PMCSTAT_PRINT_ENTRY("allocate","0x%x \"%s\" 0x%x", ev.pl_u.pl_a.pl_pmcid, ev.pl_u.pl_a.pl_evname, ev.pl_u.pl_a.pl_flags); break; case PMCLOG_TYPE_PMCALLOCATEDYN: PMCSTAT_PRINT_ENTRY("allocatedyn","0x%x \"%s\" 0x%x", ev.pl_u.pl_ad.pl_pmcid, ev.pl_u.pl_ad.pl_evname, ev.pl_u.pl_ad.pl_flags); break; case PMCLOG_TYPE_PMCATTACH: PMCSTAT_PRINT_ENTRY("attach","0x%x %d \"%s\"", ev.pl_u.pl_t.pl_pmcid, ev.pl_u.pl_t.pl_pid, ev.pl_u.pl_t.pl_pathname); break; case PMCLOG_TYPE_PMCDETACH: PMCSTAT_PRINT_ENTRY("detach","0x%x %d", ev.pl_u.pl_d.pl_pmcid, ev.pl_u.pl_d.pl_pid); break; case PMCLOG_TYPE_PROCCSW: PMCSTAT_PRINT_ENTRY("cswval","0x%x %d %jd", ev.pl_u.pl_c.pl_pmcid, ev.pl_u.pl_c.pl_pid, ev.pl_u.pl_c.pl_value); break; case PMCLOG_TYPE_PROCEXEC: PMCSTAT_PRINT_ENTRY("exec","0x%x %d %p \"%s\"", ev.pl_u.pl_x.pl_pmcid, ev.pl_u.pl_x.pl_pid, (void *) ev.pl_u.pl_x.pl_entryaddr, ev.pl_u.pl_x.pl_pathname); break; case PMCLOG_TYPE_PROCEXIT: PMCSTAT_PRINT_ENTRY("exitval","0x%x %d %jd", ev.pl_u.pl_e.pl_pmcid, ev.pl_u.pl_e.pl_pid, ev.pl_u.pl_e.pl_value); break; case PMCLOG_TYPE_PROCFORK: PMCSTAT_PRINT_ENTRY("fork","%d %d", ev.pl_u.pl_f.pl_oldpid, ev.pl_u.pl_f.pl_newpid); break; case PMCLOG_TYPE_USERDATA: PMCSTAT_PRINT_ENTRY("userdata","0x%x", ev.pl_u.pl_u.pl_userdata); break; case PMCLOG_TYPE_SYSEXIT: PMCSTAT_PRINT_ENTRY("exit","%d", ev.pl_u.pl_se.pl_pid); break; default: fprintf(args.pa_printfile, "unknown event (type %d).\n", ev.pl_type); } } if (ev.pl_state == PMCLOG_EOF) return (PMCSTAT_FINISHED); else if (ev.pl_state == PMCLOG_REQUIRE_DATA) return (PMCSTAT_RUNNING); errx(EX_DATAERR, "ERROR: event parsing failed (record %jd, offset 0x%jx).", (uintmax_t) ev.pl_count + 1, ev.pl_offset); /*NOTREACHED*/ } /* * Public Interfaces. */ /* * Close a logfile, after first flushing all in-module queued data. */ int pmcstat_close_log(void) { /* If a local logfile is configured ask the kernel to stop * and flush data. Kernel will close the file when data is flushed * so keep the status to EXITING. */ if (args.pa_logfd != -1) { if (pmc_close_logfile() < 0) err(EX_OSERR, "ERROR: logging failed"); } return (args.pa_flags & FLAG_HAS_PIPE ? PMCSTAT_EXITING : PMCSTAT_FINISHED); } /* * Open a log file, for reading or writing. * * The function returns the fd of a successfully opened log or -1 in * case of failure. */ int pmcstat_open_log(const char *path, int mode) { int error, fd, cfd; size_t hlen; const char *p, *errstr; struct addrinfo hints, *res, *res0; char hostname[MAXHOSTNAMELEN]; errstr = NULL; fd = -1; /* * If 'path' is "-" then open one of stdin or stdout depending * on the value of 'mode'. * * If 'path' contains a ':' and does not start with a '/' or '.', * and is being opened for writing, treat it as a "host:port" * specification and open a network socket. * * Otherwise, treat 'path' as a file name and open that. */ if (path[0] == '-' && path[1] == '\0') fd = (mode == PMCSTAT_OPEN_FOR_READ) ? 0 : 1; else if (path[0] != '/' && path[0] != '.' && strchr(path, ':') != NULL) { p = strrchr(path, ':'); hlen = p - path; if (p == path || hlen >= sizeof(hostname)) { errstr = strerror(EINVAL); goto done; } assert(hlen < sizeof(hostname)); (void) strncpy(hostname, path, hlen); hostname[hlen] = '\0'; (void) memset(&hints, 0, sizeof(hints)); hints.ai_family = AF_UNSPEC; hints.ai_socktype = SOCK_STREAM; if ((error = getaddrinfo(hostname, p+1, &hints, &res0)) != 0) { errstr = gai_strerror(error); goto done; } fd = -1; for (res = res0; res; res = res->ai_next) { if ((fd = socket(res->ai_family, res->ai_socktype, res->ai_protocol)) < 0) { errstr = strerror(errno); continue; } if (mode == PMCSTAT_OPEN_FOR_READ) { if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) { errstr = strerror(errno); (void) close(fd); fd = -1; continue; } listen(fd, 1); cfd = accept(fd, NULL, NULL); (void) close(fd); if (cfd < 0) { errstr = strerror(errno); fd = -1; break; } fd = cfd; } else { if (connect(fd, res->ai_addr, res->ai_addrlen) < 0) { errstr = strerror(errno); (void) close(fd); fd = -1; continue; } } errstr = NULL; break; } freeaddrinfo(res0); } else if ((fd = open(path, mode == PMCSTAT_OPEN_FOR_READ ? O_RDONLY : (O_WRONLY|O_CREAT|O_TRUNC), S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH)) < 0) errstr = strerror(errno); done: if (errstr) errx(EX_OSERR, "ERROR: Cannot open \"%s\" for %s: %s.", path, (mode == PMCSTAT_OPEN_FOR_READ ? "reading" : "writing"), errstr); return (fd); } /* * Process a log file in offline analysis mode. */ int pmcstat_process_log(void) { /* * If analysis has not been asked for, just print the log to * the current output file. */ if (args.pa_flags & FLAG_DO_PRINT) return (pmcstat_print_log()); else return (pmcstat_analyze_log()); } /* * Refresh top display. */ static void pmcstat_refresh_top(void) { int v_attrs; float v; char pmcname[40]; struct pmcstat_pmcrecord *pmcpr; /* If in pause mode do not refresh display. */ if (pmcstat_pause) return; /* Wait until PMC pop in the log. */ pmcpr = pmcstat_pmcindex_to_pmcr(pmcstat_pmcinfilter); if (pmcpr == NULL) return; /* Format PMC name. */ if (pmcstat_mergepmc) snprintf(pmcname, sizeof(pmcname), "[%s]", pmcstat_string_unintern(pmcpr->pr_pmcname)); else snprintf(pmcname, sizeof(pmcname), "%s.%d", pmcstat_string_unintern(pmcpr->pr_pmcname), pmcstat_pmcinfilter); /* Format samples count. */ if (ps_samples_period > 0) v = (pmcpr->pr_samples * 100.0) / ps_samples_period; else v = 0.; v_attrs = PMCSTAT_ATTRPERCENT(v); PMCSTAT_PRINTBEGIN(); PMCSTAT_PRINTW("PMC: %s Samples: %u ", pmcname, pmcpr->pr_samples); PMCSTAT_ATTRON(v_attrs); PMCSTAT_PRINTW("(%.1f%%) ", v); PMCSTAT_ATTROFF(v_attrs); PMCSTAT_PRINTW(", %u unresolved\n\n", pmcpr->pr_dubious_frames); if (plugins[args.pa_plugin].pl_topdisplay != NULL) plugins[args.pa_plugin].pl_topdisplay(); PMCSTAT_PRINTEND(); } /* * Find the next pmc index to display. */ static void pmcstat_changefilter(void) { int pmcin; struct pmcstat_pmcrecord *pmcr; /* * Find the next merge target. */ if (pmcstat_mergepmc) { pmcin = pmcstat_pmcinfilter; do { pmcr = pmcstat_pmcindex_to_pmcr(pmcstat_pmcinfilter); if (pmcr == NULL || pmcr == pmcr->pr_merge) break; pmcstat_pmcinfilter++; if (pmcstat_pmcinfilter >= pmcstat_npmcs) pmcstat_pmcinfilter = 0; } while (pmcstat_pmcinfilter != pmcin); } } /* * Top mode keypress. */ int pmcstat_keypress_log(void) { int c, ret = 0; WINDOW *w; w = newwin(1, 0, 1, 0); c = wgetch(w); wprintw(w, "Key: %c => ", c); switch (c) { case 'c': wprintw(w, "enter mode 'd' or 'a' => "); c = wgetch(w); if (c == 'd') { args.pa_topmode = PMCSTAT_TOP_DELTA; wprintw(w, "switching to delta mode"); } else { args.pa_topmode = PMCSTAT_TOP_ACCUM; wprintw(w, "switching to accumulation mode"); } break; case 'm': pmcstat_mergepmc = !pmcstat_mergepmc; /* * Changing merge state require data reset. */ if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(NULL); pmcstat_stats_reset(0); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); /* Update filter to be on a merge target. */ pmcstat_changefilter(); wprintw(w, "merge PMC %s", pmcstat_mergepmc ? "on" : "off"); break; case 'n': /* Close current plugin. */ if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(NULL); /* Find next top display available. */ do { args.pa_plugin++; if (plugins[args.pa_plugin].pl_name == NULL) args.pa_plugin = 0; } while (plugins[args.pa_plugin].pl_topdisplay == NULL); /* Open new plugin. */ pmcstat_stats_reset(0); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); wprintw(w, "switching to plugin %s", plugins[args.pa_plugin].pl_name); break; case 'p': pmcstat_pmcinfilter++; if (pmcstat_pmcinfilter >= pmcstat_npmcs) pmcstat_pmcinfilter = 0; pmcstat_changefilter(); wprintw(w, "switching to PMC %s.%d", pmcstat_pmcindex_to_name(pmcstat_pmcinfilter), pmcstat_pmcinfilter); break; case ' ': pmcstat_pause = !pmcstat_pause; if (pmcstat_pause) wprintw(w, "pause => press space again to continue"); break; case 'q': wprintw(w, "exiting..."); ret = 1; break; default: if (plugins[args.pa_plugin].pl_topkeypress != NULL) if (plugins[args.pa_plugin].pl_topkeypress(c, w)) ret = 1; } wrefresh(w); delwin(w); return ret; } /* * Top mode display. */ void pmcstat_display_log(void) { pmcstat_refresh_top(); /* Reset everythings if delta mode. */ if (args.pa_topmode == PMCSTAT_TOP_DELTA) { if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(NULL); pmcstat_stats_reset(0); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); } } /* * Configure a plugins. */ void pmcstat_pluginconfigure_log(char *opt) { if (strncmp(opt, "threshold=", 10) == 0) { pmcstat_threshold = atof(opt+10); } else { if (plugins[args.pa_plugin].pl_configure != NULL) { if (!plugins[args.pa_plugin].pl_configure(opt)) err(EX_USAGE, "ERROR: unknown option <%s>.", opt); } } } /* * Initialize module. */ void pmcstat_initialize_logging(void) { int i; /* use a convenient format for 'ldd' output */ if (setenv("LD_TRACE_LOADED_OBJECTS_FMT1","%o \"%p\" %x\n",1) != 0) err(EX_OSERR, "ERROR: Cannot setenv"); /* Initialize hash tables */ pmcstat_string_initialize(); for (i = 0; i < PMCSTAT_NHASH; i++) { LIST_INIT(&pmcstat_image_hash[i]); LIST_INIT(&pmcstat_process_hash[i]); } /* * Create a fake 'process' entry for the kernel with pid -1. * hwpmc(4) will subsequently inform us about where the kernel * and any loaded kernel modules are mapped. */ if ((pmcstat_kernproc = pmcstat_process_lookup((pid_t) -1, PMCSTAT_ALLOCATE)) == NULL) err(EX_OSERR, "ERROR: Cannot initialize logging"); /* PMC count. */ pmcstat_npmcs = 0; /* Merge PMC with same name. */ pmcstat_mergepmc = args.pa_mergepmc; /* * Initialize plugins */ if (plugins[args.pa_pplugin].pl_init != NULL) plugins[args.pa_pplugin].pl_init(); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); } /* * Shutdown module. */ void pmcstat_shutdown_logging(void) { int i; FILE *mf; struct pmcstat_image *pi, *pitmp; struct pmcstat_process *pp, *pptmp; struct pmcstat_pcmap *ppm, *ppmtmp; /* determine where to send the map file */ mf = NULL; if (args.pa_mapfilename != NULL) mf = (strcmp(args.pa_mapfilename, "-") == 0) ? args.pa_printfile : fopen(args.pa_mapfilename, "w"); if (mf == NULL && args.pa_flags & FLAG_DO_GPROF && args.pa_verbosity >= 2) mf = args.pa_printfile; if (mf) (void) fprintf(mf, "MAP:\n"); /* * Shutdown the plugins */ if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(mf); if (plugins[args.pa_pplugin].pl_shutdown != NULL) plugins[args.pa_pplugin].pl_shutdown(mf); for (i = 0; i < PMCSTAT_NHASH; i++) { LIST_FOREACH_SAFE(pi, &pmcstat_image_hash[i], pi_next, pitmp) { if (plugins[args.pa_plugin].pl_shutdownimage != NULL) plugins[args.pa_plugin].pl_shutdownimage(pi); if (plugins[args.pa_pplugin].pl_shutdownimage != NULL) plugins[args.pa_pplugin].pl_shutdownimage(pi); free(pi->pi_symbols); if (pi->pi_addr2line != NULL) pclose(pi->pi_addr2line); LIST_REMOVE(pi, pi_next); free(pi); } LIST_FOREACH_SAFE(pp, &pmcstat_process_hash[i], pp_next, pptmp) { TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) { TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next); free(ppm); } LIST_REMOVE(pp, pp_next); free(pp); } } pmcstat_string_shutdown(); /* * Print errors unless -q was specified. Print all statistics * if verbosity > 1. */ #define PRINT(N,V) do { \ if (pmcstat_stats.ps_##V || args.pa_verbosity >= 2) \ (void) fprintf(args.pa_printfile, " %-40s %d\n",\ N, pmcstat_stats.ps_##V); \ } while (0) if (args.pa_verbosity >= 1 && (args.pa_flags & FLAG_DO_ANALYSIS)) { (void) fprintf(args.pa_printfile, "CONVERSION STATISTICS:\n"); PRINT("#exec/a.out", exec_aout); PRINT("#exec/elf", exec_elf); PRINT("#exec/unknown", exec_indeterminable); PRINT("#exec handling errors", exec_errors); PRINT("#samples/total", samples_total); PRINT("#samples/unclaimed", samples_unknown_offset); PRINT("#samples/unknown-object", samples_indeterminable); PRINT("#samples/unknown-function", samples_unknown_function); PRINT("#callchain/dubious-frames", callchain_dubious_frames); } if (mf) (void) fclose(mf); } Index: projects/release-arm-redux/usr.sbin/pw/Makefile =================================================================== --- projects/release-arm-redux/usr.sbin/pw/Makefile (revision 282691) +++ projects/release-arm-redux/usr.sbin/pw/Makefile (revision 282692) @@ -1,19 +1,19 @@ # $FreeBSD$ PROG= pw MAN= pw.conf.5 pw.8 SRCS= pw.c pw_conf.c pw_user.c pw_group.c pw_log.c pw_nis.c pw_vpw.c \ grupd.c pwupd.c fileupd.c psdate.c \ bitmap.c cpdir.c rm_r.c WARNS?= 2 -LIBADD= crypt util +LIBADD= crypt util sbuf .include .if ${MK_TESTS} != "no" SUBDIR+= tests .endif .include Index: projects/release-arm-redux/usr.sbin/pw/fileupd.c =================================================================== --- projects/release-arm-redux/usr.sbin/pw/fileupd.c (revision 282691) +++ projects/release-arm-redux/usr.sbin/pw/fileupd.c (revision 282692) @@ -1,68 +1,47 @@ /*- * Copyright (C) 1996 * David L. Nugent. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef lint static const char rcsid[] = "$FreeBSD$"; #endif /* not lint */ -#include -#include #include -#include -#include -#include -#include -#include -#include #include "pwupd.h" - -int -extendline(char **buf, int * buflen, int needed) -{ - if (needed > *buflen) { - char *tmp = realloc(*buf, needed); - if (tmp == NULL) - return -1; - *buf = tmp; - *buflen = needed; - } - return *buflen; -} int extendarray(char ***buf, int * buflen, int needed) { if (needed > *buflen) { char **tmp = realloc(*buf, needed * sizeof(char *)); if (tmp == NULL) return -1; *buf = tmp; *buflen = needed; } return *buflen; } Index: projects/release-arm-redux/usr.sbin/pw/grupd.c =================================================================== --- projects/release-arm-redux/usr.sbin/pw/grupd.c (revision 282691) +++ projects/release-arm-redux/usr.sbin/pw/grupd.c (revision 282692) @@ -1,128 +1,124 @@ /*- * Copyright (C) 1996 * David L. Nugent. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef lint static const char rcsid[] = "$FreeBSD$"; #endif /* not lint */ #include #include #include #include #include #include -#include -#include -#include -#include #include #include "pwupd.h" static char * grpath = _PATH_PWD; int setgrdir(const char * dir) { if (dir == NULL) return -1; else grpath = strdup(dir); if (grpath == NULL) return -1; return 0; } char * getgrpath(const char * file) { static char pathbuf[MAXPATHLEN]; snprintf(pathbuf, sizeof pathbuf, "%s/%s", grpath, file); return pathbuf; } static int gr_update(struct group * grp, char const * group) { int pfd, tfd; struct group *gr = NULL; struct group *old_gr = NULL; if (grp != NULL) gr = gr_dup(grp); if (group != NULL) old_gr = GETGRNAM(group); if (gr_init(grpath, NULL)) err(1, "gr_init()"); if ((pfd = gr_lock()) == -1) { gr_fini(); err(1, "gr_lock()"); } if ((tfd = gr_tmp(-1)) == -1) { gr_fini(); err(1, "gr_tmp()"); } if (gr_copy(pfd, tfd, gr, old_gr) == -1) { gr_fini(); err(1, "gr_copy()"); } if (gr_mkdb() == -1) { gr_fini(); err(1, "gr_mkdb()"); } free(gr); gr_fini(); return 0; } int addgrent(struct group * grp) { return gr_update(grp, NULL); } int chggrent(char const * login, struct group * grp) { return gr_update(grp, login); } int delgrent(struct group * grp) { char group[MAXLOGNAME]; strlcpy(group, grp->gr_name, MAXLOGNAME); return gr_update(NULL, group); } Index: projects/release-arm-redux/usr.sbin/pw/pw_conf.c =================================================================== --- projects/release-arm-redux/usr.sbin/pw/pw_conf.c (revision 282691) +++ projects/release-arm-redux/usr.sbin/pw/pw_conf.c (revision 282692) @@ -1,503 +1,501 @@ /*- * Copyright (C) 1996 * David L. Nugent. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef lint static const char rcsid[] = "$FreeBSD$"; #endif /* not lint */ +#include +#include #include #include #include #include "pw.h" #define debugging 0 enum { _UC_NONE, _UC_DEFAULTPWD, _UC_REUSEUID, _UC_REUSEGID, _UC_NISPASSWD, _UC_DOTDIR, _UC_NEWMAIL, _UC_LOGFILE, _UC_HOMEROOT, _UC_HOMEMODE, _UC_SHELLPATH, _UC_SHELLS, _UC_DEFAULTSHELL, _UC_DEFAULTGROUP, _UC_EXTRAGROUPS, _UC_DEFAULTCLASS, _UC_MINUID, _UC_MAXUID, _UC_MINGID, _UC_MAXGID, _UC_EXPIRE, _UC_PASSWORD, _UC_FIELDS }; static char bourne_shell[] = "sh"; static char *system_shells[_UC_MAXSHELLS] = { bourne_shell, "csh", "tcsh" }; static char const *booltrue[] = { "yes", "true", "1", "on", NULL }; static char const *boolfalse[] = { "no", "false", "0", "off", NULL }; static struct userconf config = { 0, /* Default password for new users? (nologin) */ 0, /* Reuse uids? */ 0, /* Reuse gids? */ NULL, /* NIS version of the passwd file */ "/usr/share/skel", /* Where to obtain skeleton files */ NULL, /* Mail to send to new accounts */ "/var/log/userlog", /* Where to log changes */ "/home", /* Where to create home directory */ _DEF_DIRMODE, /* Home directory perms, modified by umask */ "/bin", /* Where shells are located */ system_shells, /* List of shells (first is default) */ bourne_shell, /* Default shell */ NULL, /* Default group name */ NULL, /* Default (additional) groups */ NULL, /* Default login class */ 1000, 32000, /* Allowed range of uids */ 1000, 32000, /* Allowed range of gids */ 0, /* Days until account expires */ 0, /* Days until password expires */ 0 /* size of default_group array */ }; static char const *comments[_UC_FIELDS] = { "#\n# pw.conf - user/group configuration defaults\n#\n", "\n# Password for new users? no=nologin yes=loginid none=blank random=random\n", "\n# Reuse gaps in uid sequence? (yes or no)\n", "\n# Reuse gaps in gid sequence? (yes or no)\n", "\n# Path to the NIS passwd file (blank or 'no' for none)\n", "\n# Obtain default dotfiles from this directory\n", "\n# Mail this file to new user (/etc/newuser.msg or no)\n", "\n# Log add/change/remove information in this file\n", "\n# Root directory in which $HOME directory is created\n", "\n# Mode for the new $HOME directory, will be modified by umask\n", "\n# Colon separated list of directories containing valid shells\n", "\n# Comma separated list of available shells (without paths)\n", "\n# Default shell (without path)\n", "\n# Default group (leave blank for new group per user)\n", "\n# Extra groups for new users\n", "\n# Default login class for new users\n", "\n# Range of valid default user ids\n", NULL, "\n# Range of valid default group ids\n", NULL, "\n# Days after which account expires (0=disabled)\n", "\n# Days after which password expires (0=disabled)\n" }; static char const *kwds[] = { "", "defaultpasswd", "reuseuids", "reusegids", "nispasswd", "skeleton", "newmail", "logfile", "home", "homemode", "shellpath", "shells", "defaultshell", "defaultgroup", "extragroups", "defaultclass", "minuid", "maxuid", "mingid", "maxgid", "expire_days", "password_days", NULL }; static char * unquote(char const * str) { if (str && (*str == '"' || *str == '\'')) { char *p = strchr(str + 1, *str); if (p != NULL) *p = '\0'; return (char *) (*++str ? str : NULL); } return (char *) str; } int boolean_val(char const * str, int dflt) { if ((str = unquote(str)) != NULL) { int i; for (i = 0; booltrue[i]; i++) if (strcmp(str, booltrue[i]) == 0) return 1; for (i = 0; boolfalse[i]; i++) if (strcmp(str, boolfalse[i]) == 0) return 0; /* * Special cases for defaultpassword */ if (strcmp(str, "random") == 0) return -1; if (strcmp(str, "none") == 0) return -2; } return dflt; } char const * boolean_str(int val) { if (val == -1) return "random"; else if (val == -2) return "none"; else return val ? booltrue[0] : boolfalse[0]; } char * newstr(char const * p) { char *q = NULL; if ((p = unquote(p)) != NULL) { int l = strlen(p) + 1; if ((q = malloc(l)) != NULL) memcpy(q, p, l); } return q; } #define LNBUFSZ 1024 struct userconf * read_userconfig(char const * file) { FILE *fp; char *buf, *p; size_t linecap; ssize_t linelen; buf = NULL; linecap = 0; extendarray(&config.groups, &config.numgroups, 200); memset(config.groups, 0, config.numgroups * sizeof(char *)); if (file == NULL) file = _PATH_PW_CONF; if ((fp = fopen(file, "r")) != NULL) { while ((linelen = getline(&buf, &linecap, fp)) > 0) { if (*buf && (p = strtok(buf, " \t\r\n=")) != NULL && *p != '#') { static char const toks[] = " \t\r\n,="; char *q = strtok(NULL, toks); int i = 0; mode_t *modeset; while (i < _UC_FIELDS && strcmp(p, kwds[i]) != 0) ++i; #if debugging if (i == _UC_FIELDS) printf("Got unknown kwd `%s' val=`%s'\n", p, q ? q : ""); else printf("Got kwd[%s]=%s\n", p, q); #endif switch (i) { case _UC_DEFAULTPWD: config.default_password = boolean_val(q, 1); break; case _UC_REUSEUID: config.reuse_uids = boolean_val(q, 0); break; case _UC_REUSEGID: config.reuse_gids = boolean_val(q, 0); break; case _UC_NISPASSWD: config.nispasswd = (q == NULL || !boolean_val(q, 1)) ? NULL : newstr(q); break; case _UC_DOTDIR: config.dotdir = (q == NULL || !boolean_val(q, 1)) ? NULL : newstr(q); break; case _UC_NEWMAIL: config.newmail = (q == NULL || !boolean_val(q, 1)) ? NULL : newstr(q); break; case _UC_LOGFILE: config.logfile = (q == NULL || !boolean_val(q, 1)) ? NULL : newstr(q); break; case _UC_HOMEROOT: config.home = (q == NULL || !boolean_val(q, 1)) ? "/home" : newstr(q); break; case _UC_HOMEMODE: modeset = setmode(q); config.homemode = (q == NULL || !boolean_val(q, 1)) ? _DEF_DIRMODE : getmode(modeset, _DEF_DIRMODE); free(modeset); break; case _UC_SHELLPATH: config.shelldir = (q == NULL || !boolean_val(q, 1)) ? "/bin" : newstr(q); break; case _UC_SHELLS: for (i = 0; i < _UC_MAXSHELLS && q != NULL; i++, q = strtok(NULL, toks)) system_shells[i] = newstr(q); if (i > 0) while (i < _UC_MAXSHELLS) system_shells[i++] = NULL; break; case _UC_DEFAULTSHELL: config.shell_default = (q == NULL || !boolean_val(q, 1)) ? (char *) bourne_shell : newstr(q); break; case _UC_DEFAULTGROUP: q = unquote(q); config.default_group = (q == NULL || !boolean_val(q, 1) || GETGRNAM(q) == NULL) ? NULL : newstr(q); break; case _UC_EXTRAGROUPS: for (i = 0; q != NULL; q = strtok(NULL, toks)) { if (extendarray(&config.groups, &config.numgroups, i + 2) != -1) config.groups[i++] = newstr(q); } if (i > 0) while (i < config.numgroups) config.groups[i++] = NULL; break; case _UC_DEFAULTCLASS: config.default_class = (q == NULL || !boolean_val(q, 1)) ? NULL : newstr(q); break; case _UC_MINUID: if ((q = unquote(q)) != NULL && isdigit(*q)) config.min_uid = (uid_t) atol(q); break; case _UC_MAXUID: if ((q = unquote(q)) != NULL && isdigit(*q)) config.max_uid = (uid_t) atol(q); break; case _UC_MINGID: if ((q = unquote(q)) != NULL && isdigit(*q)) config.min_gid = (gid_t) atol(q); break; case _UC_MAXGID: if ((q = unquote(q)) != NULL && isdigit(*q)) config.max_gid = (gid_t) atol(q); break; case _UC_EXPIRE: if ((q = unquote(q)) != NULL && isdigit(*q)) config.expire_days = atoi(q); break; case _UC_PASSWORD: if ((q = unquote(q)) != NULL && isdigit(*q)) config.password_days = atoi(q); break; case _UC_FIELDS: case _UC_NONE: break; } } } if (linecap > 0) free(buf); fclose(fp); } return &config; } int write_userconfig(char const * file) { int fd; + struct sbuf *buf; if (file == NULL) file = _PATH_PW_CONF; if ((fd = open(file, O_CREAT | O_RDWR | O_TRUNC | O_EXLOCK, 0644)) != -1) { FILE *fp; if ((fp = fdopen(fd, "w")) == NULL) close(fd); else { - int i, j, k; - int len = LNBUFSZ; - char *buf = malloc(len); - + int i, j; + + buf = sbuf_new_auto(); for (i = _UC_NONE; i < _UC_FIELDS; i++) { int quote = 1; - char const *val = buf; - *buf = '\0'; + sbuf_clear(buf); switch (i) { case _UC_DEFAULTPWD: - val = boolean_str(config.default_password); + sbuf_cat(buf, boolean_str(config.default_password)); break; case _UC_REUSEUID: - val = boolean_str(config.reuse_uids); + sbuf_cat(buf, boolean_str(config.reuse_uids)); break; case _UC_REUSEGID: - val = boolean_str(config.reuse_gids); + sbuf_cat(buf, boolean_str(config.reuse_gids)); break; case _UC_NISPASSWD: - val = config.nispasswd ? config.nispasswd : ""; + sbuf_cat(buf, config.nispasswd ? + config.nispasswd : ""); quote = 0; break; case _UC_DOTDIR: - val = config.dotdir ? config.dotdir : boolean_str(0); + sbuf_cat(buf, config.dotdir ? + config.dotdir : boolean_str(0)); break; case _UC_NEWMAIL: - val = config.newmail ? config.newmail : boolean_str(0); + sbuf_cat(buf, config.newmail ? + config.newmail : boolean_str(0)); break; case _UC_LOGFILE: - val = config.logfile ? config.logfile : boolean_str(0); + sbuf_cat(buf, config.logfile ? + config.logfile : boolean_str(0)); break; case _UC_HOMEROOT: - val = config.home; + sbuf_cat(buf, config.home); break; case _UC_HOMEMODE: - sprintf(buf, "%04o", config.homemode); + sbuf_printf(buf, "%04o", config.homemode); quote = 0; break; case _UC_SHELLPATH: - val = config.shelldir; + sbuf_cat(buf, config.shelldir); break; case _UC_SHELLS: - for (j = k = 0; j < _UC_MAXSHELLS && system_shells[j] != NULL; j++) { - char lbuf[64]; - int l = snprintf(lbuf, sizeof lbuf, "%s\"%s\"", k ? "," : "", system_shells[j]); - if (l < 0) - l = 0; - if (l + k + 1 < len || extendline(&buf, &len, len + LNBUFSZ) != -1) { - strcpy(buf + k, lbuf); - k += l; - } + for (j = 0; j < _UC_MAXSHELLS && + system_shells[j] != NULL; j++) { + sbuf_printf(buf, "%s\"%s\"", j ? + "," : "", system_shells[j]); } quote = 0; break; case _UC_DEFAULTSHELL: - val = config.shell_default ? config.shell_default : bourne_shell; + sbuf_cat(buf, config.shell_default ? + config.shell_default : bourne_shell); break; case _UC_DEFAULTGROUP: - val = config.default_group ? config.default_group : ""; + sbuf_cat(buf, config.default_group ? + config.default_group : ""); break; case _UC_EXTRAGROUPS: extendarray(&config.groups, &config.numgroups, 200); - for (j = k = 0; j < config.numgroups && config.groups[j] != NULL; j++) { - char lbuf[64]; - int l = snprintf(lbuf, sizeof lbuf, "%s\"%s\"", k ? "," : "", config.groups[j]); - if (l < 0) - l = 0; - if (l + k + 1 < len || extendline(&buf, &len, len + 1024) != -1) { - strcpy(buf + k, lbuf); - k += l; - } - } + for (j = 0; j < config.numgroups && + config.groups[j] != NULL; j++) + sbuf_printf(buf, "%s\"%s\"", j ? + "," : "", config.groups[j]); quote = 0; break; case _UC_DEFAULTCLASS: - val = config.default_class ? config.default_class : ""; + sbuf_cat(buf, config.default_class ? + config.default_class : ""); break; case _UC_MINUID: - sprintf(buf, "%lu", (unsigned long) config.min_uid); + sbuf_printf(buf, "%lu", (unsigned long) config.min_uid); quote = 0; break; case _UC_MAXUID: - sprintf(buf, "%lu", (unsigned long) config.max_uid); + sbuf_printf(buf, "%lu", (unsigned long) config.max_uid); quote = 0; break; case _UC_MINGID: - sprintf(buf, "%lu", (unsigned long) config.min_gid); + sbuf_printf(buf, "%lu", (unsigned long) config.min_gid); quote = 0; break; case _UC_MAXGID: - sprintf(buf, "%lu", (unsigned long) config.max_gid); + sbuf_printf(buf, "%lu", (unsigned long) config.max_gid); quote = 0; break; case _UC_EXPIRE: - sprintf(buf, "%d", config.expire_days); + sbuf_printf(buf, "%d", config.expire_days); quote = 0; break; case _UC_PASSWORD: - sprintf(buf, "%d", config.password_days); + sbuf_printf(buf, "%d", config.password_days); quote = 0; break; case _UC_NONE: break; } + sbuf_finish(buf); if (comments[i]) fputs(comments[i], fp); if (*kwds[i]) { if (quote) - fprintf(fp, "%s = \"%s\"\n", kwds[i], val); + fprintf(fp, "%s = \"%s\"\n", kwds[i], sbuf_data(buf)); else - fprintf(fp, "%s = %s\n", kwds[i], val); + fprintf(fp, "%s = %s\n", kwds[i], sbuf_data(buf)); #if debugging - printf("WROTE: %s = %s\n", kwds[i], val); + printf("WROTE: %s = %s\n", kwds[i], sbuf_data(buf)); #endif } } - free(buf); + sbuf_delete(buf); return fclose(fp) != EOF; } } return 0; } Index: projects/release-arm-redux/usr.sbin/pw/pw_nis.c =================================================================== --- projects/release-arm-redux/usr.sbin/pw/pw_nis.c (revision 282691) +++ projects/release-arm-redux/usr.sbin/pw/pw_nis.c (revision 282692) @@ -1,96 +1,93 @@ /*- * Copyright (C) 1996 * David L. Nugent. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef lint static const char rcsid[] = "$FreeBSD$"; #endif /* not lint */ -#include -#include -#include #include #include #include #include #include "pw.h" static int pw_nisupdate(const char * path, struct passwd * pwd, char const * user) { int pfd, tfd; struct passwd *pw = NULL; struct passwd *old_pw = NULL; if (pwd != NULL) pw = pw_dup(pwd); if (user != NULL) old_pw = GETPWNAM(user); if (pw_init(NULL, path)) err(1,"pw_init()"); if ((pfd = pw_lock()) == -1) { pw_fini(); err(1, "pw_lock()"); } if ((tfd = pw_tmp(-1)) == -1) { pw_fini(); err(1, "pw_tmp()"); } if (pw_copy(pfd, tfd, pw, old_pw) == -1) { pw_fini(); err(1, "pw_copy()"); } if (chmod(pw_tempname(), 0644) == -1) err(1, "chmod()"); if (rename(pw_tempname(), path) == -1) err(1, "rename()"); free(pw); pw_fini(); return (0); } int addnispwent(const char *path, struct passwd * pwd) { return pw_nisupdate(path, pwd, NULL); } int chgnispwent(const char *path, char const * login, struct passwd * pwd) { return pw_nisupdate(path, pwd, login); } int delnispwent(const char *path, const char *login) { return pw_nisupdate(path, NULL, login); } Index: projects/release-arm-redux/usr.sbin/pw/pw_user.c =================================================================== --- projects/release-arm-redux/usr.sbin/pw/pw_user.c (revision 282691) +++ projects/release-arm-redux/usr.sbin/pw/pw_user.c (revision 282692) @@ -1,1345 +1,1341 @@ /*- * Copyright (C) 1996 * David L. Nugent. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #ifndef lint static const char rcsid[] = "$FreeBSD$"; #endif /* not lint */ #include #include #include #include #include #include #include #include #include #include -#include #include #include #include #include #include "pw.h" #include "bitmap.h" #define LOGNAMESIZE (MAXLOGNAME-1) static char locked_str[] = "*LOCKED*"; static int print_user(struct passwd * pwd, int pretty, int v7); static uid_t pw_uidpolicy(struct userconf * cnf, struct cargs * args); static uid_t pw_gidpolicy(struct userconf * cnf, struct cargs * args, char *nam, gid_t prefer); static time_t pw_pwdpolicy(struct userconf * cnf, struct cargs * args); static time_t pw_exppolicy(struct userconf * cnf, struct cargs * args); static char *pw_homepolicy(struct userconf * cnf, struct cargs * args, char const * user); static char *pw_shellpolicy(struct userconf * cnf, struct cargs * args, char *newshell); static char *pw_password(struct userconf * cnf, struct cargs * args, char const * user); static char *shell_path(char const * path, char *shells[], char *sh); static void rmat(uid_t uid); static void rmopie(char const * name); /*- * -C config configuration file * -q quiet operation * -n name login name * -u uid user id * -c comment user name/comment * -d directory home directory * -e date account expiry date * -p date password expiry date * -g grp primary group * -G grp1,grp2 additional groups * -m [ -k dir ] create and set up home * -s shell name of login shell * -o duplicate uid ok * -L class user class * -l name new login name * -h fd password filehandle * -H fd encrypted password filehandle * -F force print or add * Setting defaults: * -D set user defaults * -b dir default home root dir * -e period default expiry period * -p period default password change period * -g group default group * -G grp1,grp2.. default additional groups * -L class default login class * -k dir default home skeleton * -s shell default shell * -w method default password method */ int pw_user(struct userconf * cnf, int mode, struct cargs * args) { int rc, edited = 0; char *p = NULL; char *passtmp; struct carg *a_name; struct carg *a_uid; struct carg *arg; struct passwd *pwd = NULL; struct group *grp; struct stat st; char line[_PASSWORD_LEN+1]; FILE *fp; char *dmode_c; void *set = NULL; static struct passwd fakeuser = { NULL, "*", -1, -1, 0, "", "User &", "/nonexistent", "/bin/sh", 0 #if defined(__FreeBSD__) ,0 #endif }; /* * With M_NEXT, we only need to return the * next uid to stdout */ if (mode == M_NEXT) { uid_t next = pw_uidpolicy(cnf, args); if (getarg(args, 'q')) return next; printf("%ld:", (long)next); pw_group(cnf, mode, args); return EXIT_SUCCESS; } /* * We can do all of the common legwork here */ if ((arg = getarg(args, 'b')) != NULL) { cnf->home = arg->val; } if ((arg = getarg(args, 'M')) != NULL) { dmode_c = arg->val; if ((set = setmode(dmode_c)) == NULL) errx(EX_DATAERR, "invalid directory creation mode '%s'", dmode_c); cnf->homemode = getmode(set, _DEF_DIRMODE); free(set); } /* * If we'll need to use it or we're updating it, * then create the base home directory if necessary */ if (arg != NULL || getarg(args, 'm') != NULL) { int l = strlen(cnf->home); if (l > 1 && cnf->home[l-1] == '/') /* Shave off any trailing path delimiter */ cnf->home[--l] = '\0'; if (l < 2 || *cnf->home != '/') /* Check for absolute path name */ errx(EX_DATAERR, "invalid base directory for home '%s'", cnf->home); if (stat(cnf->home, &st) == -1) { char dbuf[MAXPATHLEN]; /* * This is a kludge especially for Joerg :) * If the home directory would be created in the root partition, then * we really create it under /usr which is likely to have more space. * But we create a symlink from cnf->home -> "/usr" -> cnf->home */ if (strchr(cnf->home+1, '/') == NULL) { - strcpy(dbuf, "/usr"); - strncat(dbuf, cnf->home, MAXPATHLEN-5); + snprintf(dbuf, MAXPATHLEN, "/usr%s", cnf->home); if (mkdir(dbuf, _DEF_DIRMODE) != -1 || errno == EEXIST) { chown(dbuf, 0, 0); /* * Skip first "/" and create symlink: * /home -> usr/home */ symlink(dbuf+1, cnf->home); } /* If this falls, fall back to old method */ } strlcpy(dbuf, cnf->home, sizeof(dbuf)); p = dbuf; if (stat(dbuf, &st) == -1) { while ((p = strchr(p + 1, '/')) != NULL) { *p = '\0'; if (stat(dbuf, &st) == -1) { if (mkdir(dbuf, _DEF_DIRMODE) == -1) goto direrr; chown(dbuf, 0, 0); } else if (!S_ISDIR(st.st_mode)) errx(EX_OSFILE, "'%s' (root home parent) is not a directory", dbuf); *p = '/'; } } if (stat(dbuf, &st) == -1) { if (mkdir(dbuf, _DEF_DIRMODE) == -1) { direrr: err(EX_OSFILE, "mkdir '%s'", dbuf); } chown(dbuf, 0, 0); } } else if (!S_ISDIR(st.st_mode)) errx(EX_OSFILE, "root home `%s' is not a directory", cnf->home); } if ((arg = getarg(args, 'e')) != NULL) cnf->expire_days = atoi(arg->val); if ((arg = getarg(args, 'y')) != NULL) cnf->nispasswd = arg->val; if ((arg = getarg(args, 'p')) != NULL && arg->val) cnf->password_days = atoi(arg->val); if ((arg = getarg(args, 'g')) != NULL) { if (!*(p = arg->val)) /* Handle empty group list specially */ cnf->default_group = ""; else { if ((grp = GETGRNAM(p)) == NULL) { if (!isdigit((unsigned char)*p) || (grp = GETGRGID((gid_t) atoi(p))) == NULL) errx(EX_NOUSER, "group `%s' does not exist", p); } cnf->default_group = newstr(grp->gr_name); } } if ((arg = getarg(args, 'L')) != NULL) cnf->default_class = pw_checkname((u_char *)arg->val, 0); if ((arg = getarg(args, 'G')) != NULL && arg->val) { int i = 0; for (p = strtok(arg->val, ", \t"); p != NULL; p = strtok(NULL, ", \t")) { if ((grp = GETGRNAM(p)) == NULL) { if (!isdigit((unsigned char)*p) || (grp = GETGRGID((gid_t) atoi(p))) == NULL) errx(EX_NOUSER, "group `%s' does not exist", p); } if (extendarray(&cnf->groups, &cnf->numgroups, i + 2) != -1) cnf->groups[i++] = newstr(grp->gr_name); } while (i < cnf->numgroups) cnf->groups[i++] = NULL; } if ((arg = getarg(args, 'k')) != NULL) { if (stat(cnf->dotdir = arg->val, &st) == -1 || !S_ISDIR(st.st_mode)) errx(EX_OSFILE, "skeleton `%s' is not a directory or does not exist", cnf->dotdir); } if ((arg = getarg(args, 's')) != NULL) cnf->shell_default = arg->val; if ((arg = getarg(args, 'w')) != NULL) cnf->default_password = boolean_val(arg->val, cnf->default_password); if (mode == M_ADD && getarg(args, 'D')) { if (getarg(args, 'n') != NULL) errx(EX_DATAERR, "can't combine `-D' with `-n name'"); if ((arg = getarg(args, 'u')) != NULL && (p = strtok(arg->val, ", \t")) != NULL) { if ((cnf->min_uid = (uid_t) atoi(p)) == 0) cnf->min_uid = 1000; if ((p = strtok(NULL, " ,\t")) == NULL || (cnf->max_uid = (uid_t) atoi(p)) < cnf->min_uid) cnf->max_uid = 32000; } if ((arg = getarg(args, 'i')) != NULL && (p = strtok(arg->val, ", \t")) != NULL) { if ((cnf->min_gid = (gid_t) atoi(p)) == 0) cnf->min_gid = 1000; if ((p = strtok(NULL, " ,\t")) == NULL || (cnf->max_gid = (gid_t) atoi(p)) < cnf->min_gid) cnf->max_gid = 32000; } arg = getarg(args, 'C'); if (write_userconfig(arg ? arg->val : NULL)) return EXIT_SUCCESS; warn("config update"); return EX_IOERR; } if (mode == M_PRINT && getarg(args, 'a')) { int pretty = getarg(args, 'P') != NULL; int v7 = getarg(args, '7') != NULL; SETPWENT(); while ((pwd = GETPWENT()) != NULL) print_user(pwd, pretty, v7); ENDPWENT(); return EXIT_SUCCESS; } if ((a_name = getarg(args, 'n')) != NULL) pwd = GETPWNAM(pw_checkname((u_char *)a_name->val, 0)); a_uid = getarg(args, 'u'); if (a_uid == NULL) { if (a_name == NULL) errx(EX_DATAERR, "user name or id required"); /* * Determine whether 'n' switch is name or uid - we don't * really don't really care which we have, but we need to * know. */ if (mode != M_ADD && pwd == NULL && strspn(a_name->val, "0123456789") == strlen(a_name->val) && *a_name->val) { (a_uid = a_name)->ch = 'u'; a_name = NULL; } } else { if (strspn(a_uid->val, "0123456789") != strlen(a_uid->val)) errx(EX_USAGE, "-u expects a number"); } /* * Update, delete & print require that the user exists */ if (mode == M_UPDATE || mode == M_DELETE || mode == M_PRINT || mode == M_LOCK || mode == M_UNLOCK) { if (a_name == NULL && pwd == NULL) /* Try harder */ pwd = GETPWUID(atoi(a_uid->val)); if (pwd == NULL) { if (mode == M_PRINT && getarg(args, 'F')) { fakeuser.pw_name = a_name ? a_name->val : "nouser"; fakeuser.pw_uid = a_uid ? (uid_t) atol(a_uid->val) : -1; return print_user(&fakeuser, getarg(args, 'P') != NULL, getarg(args, '7') != NULL); } if (a_name == NULL) errx(EX_NOUSER, "no such uid `%s'", a_uid->val); errx(EX_NOUSER, "no such user `%s'", a_name->val); } if (a_name == NULL) /* May be needed later */ a_name = addarg(args, 'n', newstr(pwd->pw_name)); /* * The M_LOCK and M_UNLOCK functions simply add or remove * a "*LOCKED*" prefix from in front of the password to * prevent it decoding correctly, and therefore prevents * access. Of course, this only prevents access via * password authentication (not ssh, kerberos or any * other method that does not use the UNIX password) but * that is a known limitation. */ if (mode == M_LOCK) { if (strncmp(pwd->pw_passwd, locked_str, sizeof(locked_str)-1) == 0) errx(EX_DATAERR, "user '%s' is already locked", pwd->pw_name); - passtmp = malloc(strlen(pwd->pw_passwd) + sizeof(locked_str)); + asprintf(&passtmp, "%s%s", locked_str, pwd->pw_passwd); if (passtmp == NULL) /* disaster */ errx(EX_UNAVAILABLE, "out of memory"); - strcpy(passtmp, locked_str); - strcat(passtmp, pwd->pw_passwd); pwd->pw_passwd = passtmp; edited = 1; } else if (mode == M_UNLOCK) { if (strncmp(pwd->pw_passwd, locked_str, sizeof(locked_str)-1) != 0) errx(EX_DATAERR, "user '%s' is not locked", pwd->pw_name); pwd->pw_passwd += sizeof(locked_str)-1; edited = 1; } else if (mode == M_DELETE) { /* * Handle deletions now */ char file[MAXPATHLEN]; char home[MAXPATHLEN]; uid_t uid = pwd->pw_uid; struct group *gr; char grname[LOGNAMESIZE]; if (strcmp(pwd->pw_name, "root") == 0) errx(EX_DATAERR, "cannot remove user 'root'"); if (!PWALTDIR()) { /* * Remove opie record from /etc/opiekeys */ rmopie(pwd->pw_name); /* * Remove crontabs */ snprintf(file, sizeof(file), "/var/cron/tabs/%s", pwd->pw_name); if (access(file, F_OK) == 0) { sprintf(file, "crontab -u %s -r", pwd->pw_name); system(file); } } /* * Save these for later, since contents of pwd may be * invalidated by deletion */ sprintf(file, "%s/%s", _PATH_MAILDIR, pwd->pw_name); strlcpy(home, pwd->pw_dir, sizeof(home)); gr = GETGRGID(pwd->pw_gid); if (gr != NULL) strlcpy(grname, gr->gr_name, LOGNAMESIZE); else grname[0] = '\0'; rc = delpwent(pwd); if (rc == -1) err(EX_IOERR, "user '%s' does not exist", pwd->pw_name); else if (rc != 0) { warn("passwd update"); return EX_IOERR; } if (cnf->nispasswd && *cnf->nispasswd=='/') { rc = delnispwent(cnf->nispasswd, a_name->val); if (rc == -1) warnx("WARNING: user '%s' does not exist in NIS passwd", pwd->pw_name); else if (rc != 0) warn("WARNING: NIS passwd update"); /* non-fatal */ } grp = GETGRNAM(a_name->val); if (grp != NULL && (grp->gr_mem == NULL || *grp->gr_mem == NULL) && strcmp(a_name->val, grname) == 0) delgrent(GETGRNAM(a_name->val)); SETGRENT(); while ((grp = GETGRENT()) != NULL) { int i, j; char group[MAXLOGNAME]; if (grp->gr_mem != NULL) { for (i = 0; grp->gr_mem[i] != NULL; i++) { if (!strcmp(grp->gr_mem[i], a_name->val)) { for (j = i; grp->gr_mem[j] != NULL; j++) grp->gr_mem[j] = grp->gr_mem[j+1]; strlcpy(group, grp->gr_name, MAXLOGNAME); chggrent(group, grp); } } } } ENDGRENT(); pw_log(cnf, mode, W_USER, "%s(%ld) account removed", a_name->val, (long) uid); if (!PWALTDIR()) { /* * Remove mail file */ remove(file); /* * Remove at jobs */ if (getpwuid(uid) == NULL) rmat(uid); /* * Remove home directory and contents */ if (getarg(args, 'r') != NULL && *home == '/' && getpwuid(uid) == NULL) { if (stat(home, &st) != -1) { rm_r(home, uid); pw_log(cnf, mode, W_USER, "%s(%ld) home '%s' %sremoved", a_name->val, (long) uid, home, stat(home, &st) == -1 ? "" : "not completely "); } } } return EXIT_SUCCESS; } else if (mode == M_PRINT) return print_user(pwd, getarg(args, 'P') != NULL, getarg(args, '7') != NULL); /* * The rest is edit code */ if ((arg = getarg(args, 'l')) != NULL) { if (strcmp(pwd->pw_name, "root") == 0) errx(EX_DATAERR, "can't rename `root' account"); pwd->pw_name = pw_checkname((u_char *)arg->val, 0); edited = 1; } if ((arg = getarg(args, 'u')) != NULL && isdigit((unsigned char)*arg->val)) { pwd->pw_uid = (uid_t) atol(arg->val); edited = 1; if (pwd->pw_uid != 0 && strcmp(pwd->pw_name, "root") == 0) errx(EX_DATAERR, "can't change uid of `root' account"); if (pwd->pw_uid == 0 && strcmp(pwd->pw_name, "root") != 0) warnx("WARNING: account `%s' will have a uid of 0 (superuser access!)", pwd->pw_name); } if ((arg = getarg(args, 'g')) != NULL && pwd->pw_uid != 0) { /* Already checked this */ gid_t newgid = (gid_t) GETGRNAM(cnf->default_group)->gr_gid; if (newgid != pwd->pw_gid) { edited = 1; pwd->pw_gid = newgid; } } if ((arg = getarg(args, 'p')) != NULL) { if (*arg->val == '\0' || strcmp(arg->val, "0") == 0) { if (pwd->pw_change != 0) { pwd->pw_change = 0; edited = 1; } } else { time_t now = time(NULL); time_t expire = parse_date(now, arg->val); if (pwd->pw_change != expire) { pwd->pw_change = expire; edited = 1; } } } if ((arg = getarg(args, 'e')) != NULL) { if (*arg->val == '\0' || strcmp(arg->val, "0") == 0) { if (pwd->pw_expire != 0) { pwd->pw_expire = 0; edited = 1; } } else { time_t now = time(NULL); time_t expire = parse_date(now, arg->val); if (pwd->pw_expire != expire) { pwd->pw_expire = expire; edited = 1; } } } if ((arg = getarg(args, 's')) != NULL) { char *shell = shell_path(cnf->shelldir, cnf->shells, arg->val); if (shell == NULL) shell = ""; if (strcmp(shell, pwd->pw_shell) != 0) { pwd->pw_shell = shell; edited = 1; } } if (getarg(args, 'L')) { if (cnf->default_class == NULL) cnf->default_class = ""; if (strcmp(pwd->pw_class, cnf->default_class) != 0) { pwd->pw_class = cnf->default_class; edited = 1; } } if ((arg = getarg(args, 'd')) != NULL) { if (strcmp(pwd->pw_dir, arg->val)) edited = 1; if (stat(pwd->pw_dir = arg->val, &st) == -1) { if (getarg(args, 'm') == NULL && strcmp(pwd->pw_dir, "/nonexistent") != 0) warnx("WARNING: home `%s' does not exist", pwd->pw_dir); } else if (!S_ISDIR(st.st_mode)) warnx("WARNING: home `%s' is not a directory", pwd->pw_dir); } if ((arg = getarg(args, 'w')) != NULL && getarg(args, 'h') == NULL && getarg(args, 'H') == NULL) { login_cap_t *lc; lc = login_getpwclass(pwd); if (lc == NULL || login_setcryptfmt(lc, "sha512", NULL) == NULL) warn("setting crypt(3) format"); login_close(lc); pwd->pw_passwd = pw_password(cnf, args, pwd->pw_name); edited = 1; } } else { login_cap_t *lc; /* * Add code */ if (a_name == NULL) /* Required */ errx(EX_DATAERR, "login name required"); else if ((pwd = GETPWNAM(a_name->val)) != NULL) /* Exists */ errx(EX_DATAERR, "login name `%s' already exists", a_name->val); /* * Now, set up defaults for a new user */ pwd = &fakeuser; pwd->pw_name = a_name->val; pwd->pw_class = cnf->default_class ? cnf->default_class : ""; pwd->pw_uid = pw_uidpolicy(cnf, args); pwd->pw_gid = pw_gidpolicy(cnf, args, pwd->pw_name, (gid_t) pwd->pw_uid); pwd->pw_change = pw_pwdpolicy(cnf, args); pwd->pw_expire = pw_exppolicy(cnf, args); pwd->pw_dir = pw_homepolicy(cnf, args, pwd->pw_name); pwd->pw_shell = pw_shellpolicy(cnf, args, NULL); lc = login_getpwclass(pwd); if (lc == NULL || login_setcryptfmt(lc, "sha512", NULL) == NULL) warn("setting crypt(3) format"); login_close(lc); pwd->pw_passwd = pw_password(cnf, args, pwd->pw_name); edited = 1; if (pwd->pw_uid == 0 && strcmp(pwd->pw_name, "root") != 0) warnx("WARNING: new account `%s' has a uid of 0 (superuser access!)", pwd->pw_name); } /* * Shared add/edit code */ if ((arg = getarg(args, 'c')) != NULL) { char *gecos = pw_checkname((u_char *)arg->val, 1); if (strcmp(pwd->pw_gecos, gecos) != 0) { pwd->pw_gecos = gecos; edited = 1; } } if ((arg = getarg(args, 'h')) != NULL || (arg = getarg(args, 'H')) != NULL) { if (strcmp(arg->val, "-") == 0) { if (!pwd->pw_passwd || *pwd->pw_passwd != '*') { pwd->pw_passwd = "*"; /* No access */ edited = 1; } } else { int fd = atoi(arg->val); int precrypt = (arg->ch == 'H'); int b; int istty = isatty(fd); struct termios t; login_cap_t *lc; if (istty) { if (tcgetattr(fd, &t) == -1) istty = 0; else { struct termios n = t; /* Disable echo */ n.c_lflag &= ~(ECHO); tcsetattr(fd, TCSANOW, &n); printf("%s%spassword for user %s:", (mode == M_UPDATE) ? "new " : "", precrypt ? "encrypted " : "", pwd->pw_name); fflush(stdout); } } b = read(fd, line, sizeof(line) - 1); if (istty) { /* Restore state */ tcsetattr(fd, TCSANOW, &t); fputc('\n', stdout); fflush(stdout); } if (b < 0) { warn("-%c file descriptor", precrypt ? 'H' : 'h'); return EX_IOERR; } line[b] = '\0'; if ((p = strpbrk(line, "\r\n")) != NULL) *p = '\0'; if (!*line) errx(EX_DATAERR, "empty password read on file descriptor %d", fd); if (precrypt) { if (strchr(line, ':') != NULL) return EX_DATAERR; pwd->pw_passwd = line; } else { lc = login_getpwclass(pwd); if (lc == NULL || login_setcryptfmt(lc, "sha512", NULL) == NULL) warn("setting crypt(3) format"); login_close(lc); pwd->pw_passwd = pw_pwcrypt(line); } edited = 1; } } /* * Special case: -N only displays & exits */ if (getarg(args, 'N') != NULL) return print_user(pwd, getarg(args, 'P') != NULL, getarg(args, '7') != NULL); if (mode == M_ADD) { edited = 1; /* Always */ rc = addpwent(pwd); if (rc == -1) { warnx("user '%s' already exists", pwd->pw_name); return EX_IOERR; } else if (rc != 0) { warn("passwd file update"); return EX_IOERR; } if (cnf->nispasswd && *cnf->nispasswd=='/') { rc = addnispwent(cnf->nispasswd, pwd); if (rc == -1) warnx("User '%s' already exists in NIS passwd", pwd->pw_name); else warn("NIS passwd update"); /* NOTE: we treat NIS-only update errors as non-fatal */ } } else if (mode == M_UPDATE || mode == M_LOCK || mode == M_UNLOCK) { if (edited) { /* Only updated this if required */ rc = chgpwent(a_name->val, pwd); if (rc == -1) { warnx("user '%s' does not exist (NIS?)", pwd->pw_name); return EX_IOERR; } else if (rc != 0) { warn("passwd file update"); return EX_IOERR; } if ( cnf->nispasswd && *cnf->nispasswd=='/') { rc = chgnispwent(cnf->nispasswd, a_name->val, pwd); if (rc == -1) warn("User '%s' not found in NIS passwd", pwd->pw_name); else warn("NIS passwd update"); /* NOTE: NIS-only update errors are not fatal */ } } } /* * Ok, user is created or changed - now edit group file */ if (mode == M_ADD || getarg(args, 'G') != NULL) { int i, j; /* First remove the user from all group */ SETGRENT(); while ((grp = GETGRENT()) != NULL) { char group[MAXLOGNAME]; if (grp->gr_mem == NULL) continue; for (i = 0; grp->gr_mem[i] != NULL; i++) { if (strcmp(grp->gr_mem[i] , pwd->pw_name) != 0) continue; for (j = i; grp->gr_mem[j] != NULL ; j++) grp->gr_mem[j] = grp->gr_mem[j+1]; strlcpy(group, grp->gr_name, MAXLOGNAME); chggrent(group, grp); } } ENDGRENT(); /* now add to group where needed */ for (i = 0; cnf->groups[i] != NULL; i++) { grp = GETGRNAM(cnf->groups[i]); grp = gr_add(grp, pwd->pw_name); /* * grp can only be NULL in 2 cases: * - the new member is already a member * - a problem with memory occurs * in both cases we want to skip now. */ if (grp == NULL) continue; chggrent(cnf->groups[i], grp); free(grp); } } /* go get a current version of pwd */ pwd = GETPWNAM(a_name->val); if (pwd == NULL) { /* This will fail when we rename, so special case that */ if (mode == M_UPDATE && (arg = getarg(args, 'l')) != NULL) { a_name->val = arg->val; /* update new name */ pwd = GETPWNAM(a_name->val); /* refetch renamed rec */ } } if (pwd == NULL) /* can't go on without this */ errx(EX_NOUSER, "user '%s' disappeared during update", a_name->val); grp = GETGRGID(pwd->pw_gid); pw_log(cnf, mode, W_USER, "%s(%ld):%s(%ld):%s:%s:%s", pwd->pw_name, (long) pwd->pw_uid, grp ? grp->gr_name : "unknown", (long) (grp ? grp->gr_gid : -1), pwd->pw_gecos, pwd->pw_dir, pwd->pw_shell); /* * If adding, let's touch and chown the user's mail file. This is not * strictly necessary under BSD with a 0755 maildir but it also * doesn't hurt anything to create the empty mailfile */ if (mode == M_ADD) { if (!PWALTDIR()) { sprintf(line, "%s/%s", _PATH_MAILDIR, pwd->pw_name); close(open(line, O_RDWR | O_CREAT, 0600)); /* Preserve contents & * mtime */ chown(line, pwd->pw_uid, pwd->pw_gid); } } /* * Let's create and populate the user's home directory. Note * that this also `works' for editing users if -m is used, but * existing files will *not* be overwritten. */ if (!PWALTDIR() && getarg(args, 'm') != NULL && pwd->pw_dir && *pwd->pw_dir == '/' && pwd->pw_dir[1]) { copymkdir(pwd->pw_dir, cnf->dotdir, cnf->homemode, pwd->pw_uid, pwd->pw_gid); pw_log(cnf, mode, W_USER, "%s(%ld) home %s made", pwd->pw_name, (long) pwd->pw_uid, pwd->pw_dir); } /* * Finally, send mail to the new user as well, if we are asked to */ if (mode == M_ADD && !PWALTDIR() && cnf->newmail && *cnf->newmail && (fp = fopen(cnf->newmail, "r")) != NULL) { FILE *pfp = popen(_PATH_SENDMAIL " -t", "w"); if (pfp == NULL) warn("sendmail"); else { fprintf(pfp, "From: root\n" "To: %s\n" "Subject: Welcome!\n\n", pwd->pw_name); while (fgets(line, sizeof(line), fp) != NULL) { /* Do substitutions? */ fputs(line, pfp); } pclose(pfp); pw_log(cnf, mode, W_USER, "%s(%ld) new user mail sent", pwd->pw_name, (long) pwd->pw_uid); } fclose(fp); } return EXIT_SUCCESS; } static uid_t pw_uidpolicy(struct userconf * cnf, struct cargs * args) { struct passwd *pwd; uid_t uid = (uid_t) - 1; struct carg *a_uid = getarg(args, 'u'); /* * Check the given uid, if any */ if (a_uid != NULL) { uid = (uid_t) atol(a_uid->val); if ((pwd = GETPWUID(uid)) != NULL && getarg(args, 'o') == NULL) errx(EX_DATAERR, "uid `%ld' has already been allocated", (long) pwd->pw_uid); } else { struct bitmap bm; /* * We need to allocate the next available uid under one of * two policies a) Grab the first unused uid b) Grab the * highest possible unused uid */ if (cnf->min_uid >= cnf->max_uid) { /* Sanity * claus^H^H^H^Hheck */ cnf->min_uid = 1000; cnf->max_uid = 32000; } bm = bm_alloc(cnf->max_uid - cnf->min_uid + 1); /* * Now, let's fill the bitmap from the password file */ SETPWENT(); while ((pwd = GETPWENT()) != NULL) if (pwd->pw_uid >= (uid_t) cnf->min_uid && pwd->pw_uid <= (uid_t) cnf->max_uid) bm_setbit(&bm, pwd->pw_uid - cnf->min_uid); ENDPWENT(); /* * Then apply the policy, with fallback to reuse if necessary */ if (cnf->reuse_uids || (uid = (uid_t) (bm_lastset(&bm) + cnf->min_uid + 1)) > cnf->max_uid) uid = (uid_t) (bm_firstunset(&bm) + cnf->min_uid); /* * Another sanity check */ if (uid < cnf->min_uid || uid > cnf->max_uid) errx(EX_SOFTWARE, "unable to allocate a new uid - range fully used"); bm_dealloc(&bm); } return uid; } static uid_t pw_gidpolicy(struct userconf * cnf, struct cargs * args, char *nam, gid_t prefer) { struct group *grp; gid_t gid = (uid_t) - 1; struct carg *a_gid = getarg(args, 'g'); /* * If no arg given, see if default can help out */ if (a_gid == NULL && cnf->default_group && *cnf->default_group) a_gid = addarg(args, 'g', cnf->default_group); /* * Check the given gid, if any */ SETGRENT(); if (a_gid != NULL) { if ((grp = GETGRNAM(a_gid->val)) == NULL) { gid = (gid_t) atol(a_gid->val); if ((gid == 0 && !isdigit((unsigned char)*a_gid->val)) || (grp = GETGRGID(gid)) == NULL) errx(EX_NOUSER, "group `%s' is not defined", a_gid->val); } gid = grp->gr_gid; } else if ((grp = GETGRNAM(nam)) != NULL && (grp->gr_mem == NULL || grp->gr_mem[0] == NULL)) { gid = grp->gr_gid; /* Already created? Use it anyway... */ } else { struct cargs grpargs; char tmp[32]; LIST_INIT(&grpargs); addarg(&grpargs, 'n', nam); /* * We need to auto-create a group with the user's name. We * can send all the appropriate output to our sister routine * bit first see if we can create a group with gid==uid so we * can keep the user and group ids in sync. We purposely do * NOT check the gid range if we can force the sync. If the * user's name dups an existing group, then the group add * function will happily handle that case for us and exit. */ if (GETGRGID(prefer) == NULL) { sprintf(tmp, "%lu", (unsigned long) prefer); addarg(&grpargs, 'g', tmp); } if (getarg(args, 'N')) { addarg(&grpargs, 'N', NULL); addarg(&grpargs, 'q', NULL); gid = pw_group(cnf, M_NEXT, &grpargs); } else { pw_group(cnf, M_ADD, &grpargs); if ((grp = GETGRNAM(nam)) != NULL) gid = grp->gr_gid; } a_gid = LIST_FIRST(&grpargs); while (a_gid != NULL) { struct carg *t = LIST_NEXT(a_gid, list); LIST_REMOVE(a_gid, list); a_gid = t; } } ENDGRENT(); return gid; } static time_t pw_pwdpolicy(struct userconf * cnf, struct cargs * args) { time_t result = 0; time_t now = time(NULL); struct carg *arg = getarg(args, 'p'); if (arg != NULL) { if ((result = parse_date(now, arg->val)) == now) errx(EX_DATAERR, "invalid date/time `%s'", arg->val); } else if (cnf->password_days > 0) result = now + ((long) cnf->password_days * 86400L); return result; } static time_t pw_exppolicy(struct userconf * cnf, struct cargs * args) { time_t result = 0; time_t now = time(NULL); struct carg *arg = getarg(args, 'e'); if (arg != NULL) { if ((result = parse_date(now, arg->val)) == now) errx(EX_DATAERR, "invalid date/time `%s'", arg->val); } else if (cnf->expire_days > 0) result = now + ((long) cnf->expire_days * 86400L); return result; } static char * pw_homepolicy(struct userconf * cnf, struct cargs * args, char const * user) { struct carg *arg = getarg(args, 'd'); if (arg) return arg->val; else { static char home[128]; if (cnf->home == NULL || *cnf->home == '\0') errx(EX_CONFIG, "no base home directory set"); sprintf(home, "%s/%s", cnf->home, user); return home; } } static char * shell_path(char const * path, char *shells[], char *sh) { if (sh != NULL && (*sh == '/' || *sh == '\0')) return sh; /* specified full path or forced none */ else { char *p; char paths[_UC_MAXLINE]; /* * We need to search paths */ strlcpy(paths, path, sizeof(paths)); for (p = strtok(paths, ": \t\r\n"); p != NULL; p = strtok(NULL, ": \t\r\n")) { int i; static char shellpath[256]; if (sh != NULL) { sprintf(shellpath, "%s/%s", p, sh); if (access(shellpath, X_OK) == 0) return shellpath; } else for (i = 0; i < _UC_MAXSHELLS && shells[i] != NULL; i++) { sprintf(shellpath, "%s/%s", p, shells[i]); if (access(shellpath, X_OK) == 0) return shellpath; } } if (sh == NULL) errx(EX_OSFILE, "can't find shell `%s' in shell paths", sh); errx(EX_CONFIG, "no default shell available or defined"); return NULL; } } static char * pw_shellpolicy(struct userconf * cnf, struct cargs * args, char *newshell) { char *sh = newshell; struct carg *arg = getarg(args, 's'); if (newshell == NULL && arg != NULL) sh = arg->val; return shell_path(cnf->shelldir, cnf->shells, sh ? sh : cnf->shell_default); } #define SALTSIZE 32 static char const chars[] = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ./"; char * pw_pwcrypt(char *password) { int i; char salt[SALTSIZE + 1]; char *cryptpw; static char buf[256]; /* * Calculate a salt value */ for (i = 0; i < SALTSIZE; i++) salt[i] = chars[arc4random_uniform(sizeof(chars) - 1)]; salt[SALTSIZE] = '\0'; cryptpw = crypt(password, salt); if (cryptpw == NULL) errx(EX_CONFIG, "crypt(3) failure"); return strcpy(buf, cryptpw); } static char * pw_password(struct userconf * cnf, struct cargs * args, char const * user) { int i, l; char pwbuf[32]; switch (cnf->default_password) { case -1: /* Random password */ l = (arc4random() % 8 + 8); /* 8 - 16 chars */ for (i = 0; i < l; i++) pwbuf[i] = chars[arc4random_uniform(sizeof(chars)-1)]; pwbuf[i] = '\0'; /* * We give this information back to the user */ if (getarg(args, 'h') == NULL && getarg(args, 'H') == NULL && getarg(args, 'N') == NULL) { if (isatty(STDOUT_FILENO)) printf("Password for '%s' is: ", user); printf("%s\n", pwbuf); fflush(stdout); } break; case -2: /* No password at all! */ return ""; case 0: /* No login - default */ default: return "*"; case 1: /* user's name */ strlcpy(pwbuf, user, sizeof(pwbuf)); break; } return pw_pwcrypt(pwbuf); } static int print_user(struct passwd * pwd, int pretty, int v7) { if (!pretty) { char *buf; if (!v7) pwd->pw_passwd = (pwd->pw_passwd == NULL) ? "" : "*"; buf = v7 ? pw_make_v7(pwd) : pw_make(pwd); printf("%s\n", buf); free(buf); } else { int j; char *p; struct group *grp = GETGRGID(pwd->pw_gid); char uname[60] = "User &", office[60] = "[None]", wphone[60] = "[None]", hphone[60] = "[None]"; char acexpire[32] = "[None]", pwexpire[32] = "[None]"; struct tm * tptr; if ((p = strtok(pwd->pw_gecos, ",")) != NULL) { strlcpy(uname, p, sizeof(uname)); if ((p = strtok(NULL, ",")) != NULL) { strlcpy(office, p, sizeof(office)); if ((p = strtok(NULL, ",")) != NULL) { strlcpy(wphone, p, sizeof(wphone)); if ((p = strtok(NULL, "")) != NULL) { strlcpy(hphone, p, sizeof(hphone)); } } } } /* * Handle '&' in gecos field */ if ((p = strchr(uname, '&')) != NULL) { int l = strlen(pwd->pw_name); int m = strlen(p); memmove(p + l, p + 1, m); memmove(p, pwd->pw_name, l); *p = (char) toupper((unsigned char)*p); } if (pwd->pw_expire > (time_t)0 && (tptr = localtime(&pwd->pw_expire)) != NULL) strftime(acexpire, sizeof acexpire, "%c", tptr); if (pwd->pw_change > (time_t)0 && (tptr = localtime(&pwd->pw_change)) != NULL) strftime(pwexpire, sizeof pwexpire, "%c", tptr); printf("Login Name: %-15s #%-12ld Group: %-15s #%ld\n" " Full Name: %s\n" " Home: %-26.26s Class: %s\n" " Shell: %-26.26s Office: %s\n" "Work Phone: %-26.26s Home Phone: %s\n" "Acc Expire: %-26.26s Pwd Expire: %s\n", pwd->pw_name, (long) pwd->pw_uid, grp ? grp->gr_name : "(invalid)", (long) pwd->pw_gid, uname, pwd->pw_dir, pwd->pw_class, pwd->pw_shell, office, wphone, hphone, acexpire, pwexpire); SETGRENT(); j = 0; while ((grp=GETGRENT()) != NULL) { int i = 0; if (grp->gr_mem != NULL) { while (grp->gr_mem[i] != NULL) { if (strcmp(grp->gr_mem[i], pwd->pw_name)==0) { printf(j++ == 0 ? " Groups: %s" : ",%s", grp->gr_name); break; } ++i; } } } ENDGRENT(); printf("%s", j ? "\n" : ""); } return EXIT_SUCCESS; } char * pw_checkname(u_char *name, int gecos) { char showch[8]; u_char const *badchars, *ch, *showtype; int reject; ch = name; reject = 0; if (gecos) { /* See if the name is valid as a gecos (comment) field. */ badchars = ":!@"; showtype = "gecos field"; } else { /* See if the name is valid as a userid or group. */ badchars = " ,\t:+&#%$^()!@~*?<>=|\\/\""; showtype = "userid/group name"; /* Userids and groups can not have a leading '-'. */ if (*ch == '-') reject = 1; } if (!reject) { while (*ch) { if (strchr(badchars, *ch) != NULL || *ch < ' ' || *ch == 127) { reject = 1; break; } /* 8-bit characters are only allowed in GECOS fields */ if (!gecos && (*ch & 0x80)) { reject = 1; break; } ch++; } } /* * A `$' is allowed as the final character for userids and groups, * mainly for the benefit of samba. */ if (reject && !gecos) { if (*ch == '$' && *(ch + 1) == '\0') { reject = 0; ch++; } } if (reject) { snprintf(showch, sizeof(showch), (*ch >= ' ' && *ch < 127) ? "`%c'" : "0x%02x", *ch); errx(EX_DATAERR, "invalid character %s at position %td in %s", showch, (ch - name), showtype); } if (!gecos && (ch - name) > LOGNAMESIZE) errx(EX_DATAERR, "name too long `%s' (max is %d)", name, LOGNAMESIZE); return (char *)name; } static void rmat(uid_t uid) { DIR *d = opendir("/var/at/jobs"); if (d != NULL) { struct dirent *e; while ((e = readdir(d)) != NULL) { struct stat st; if (strncmp(e->d_name, ".lock", 5) != 0 && stat(e->d_name, &st) == 0 && !S_ISDIR(st.st_mode) && st.st_uid == uid) { char tmp[MAXPATHLEN]; sprintf(tmp, "/usr/bin/atrm %s", e->d_name); system(tmp); } } closedir(d); } } static void rmopie(char const * name) { static const char etcopie[] = "/etc/opiekeys"; FILE *fp = fopen(etcopie, "r+"); if (fp != NULL) { char tmp[1024]; off_t atofs = 0; int length = strlen(name); while (fgets(tmp, sizeof tmp, fp) != NULL) { if (strncmp(name, tmp, length) == 0 && tmp[length]==' ') { if (fseek(fp, atofs, SEEK_SET) == 0) { fwrite("#", 1, 1, fp); /* Comment username out */ } break; } atofs = ftell(fp); } /* * If we got an error of any sort, don't update! */ fclose(fp); } } Index: projects/release-arm-redux =================================================================== --- projects/release-arm-redux (revision 282691) +++ projects/release-arm-redux (revision 282692) Property changes on: projects/release-arm-redux ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head:r282673-282691