Index: projects/release-arm-redux/share/man/man4/acpi.4
===================================================================
--- projects/release-arm-redux/share/man/man4/acpi.4	(revision 282691)
+++ projects/release-arm-redux/share/man/man4/acpi.4	(revision 282692)
@@ -1,630 +1,641 @@
 .\"
 .\" Copyright (c) 2001 Michael Smith
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
-.Dd June 23, 2014
+.Dd May 9, 2015
 .Dt ACPI 4
 .Os
 .Sh NAME
 .Nm acpi
 .Nd Advanced Configuration and Power Management support
 .Sh SYNOPSIS
 .Cd "device acpi"
 .Pp
 .Cd "options ACPI_DEBUG"
 .Cd "options DDB"
 .Sh DESCRIPTION
 The
 .Nm
 driver provides support for the Intel/Microsoft/Compaq/Toshiba ACPI
 standard.
 This support includes platform hardware discovery (superseding the
 PnP and PCI BIOS), as well as power management (superseding APM) and
 other features.
 ACPI core support is provided by the ACPI CA reference implementation
 from Intel.
 .Pp
 Note that the
 .Nm
 driver is automatically loaded by the
 .Xr loader 8 ,
 and should only be
 compiled into the kernel on platforms where ACPI is mandatory.
 .Sh SYSCTL VARIABLES
 The
 .Nm
 driver is intended to provide power management without user intervention.
 If the default settings are not optimal, the following sysctls can be
 used to modify or monitor
 .Nm
 behavior.
 Note that some variables will be available only if the given hardware supports
 them (such as
 .Va hw.acpi.acline ) .
 .Bl -tag -width indent
 .It Va debug.acpi.enable_debug_objects
 Enable dumping Debug objects without
 .Cd "options ACPI_DEBUG" .
 Default is 0, ignore Debug objects.
-.It Va hw.acpi.acline
-AC line state (1 means online, 0 means on battery power).
-.It Va hw.acpi.cpu.cx_usage
+.It Va dev.cpu.N.cx_usage
 Debugging information listing the percent of total usage for each sleep state.
 The values are reset when
-.Va hw.acpi.cpu.cx_lowest
+.Va dev.cpu.N.cx_lowest
 is modified.
-.It Va hw.acpi.cpu.cx_lowest
+.It Va dev.cpu.N.cx_lowest
 Lowest Cx state to use for idling the CPU.
 A scheduling algorithm will select states between
 .Li C1
 and this setting
 as system load dictates.
 To enable ACPI CPU idling control,
 .Va machdep.idle
 should be set to
 .Li acpi
 if it is listed in
 .Va machdep.idle_available .
 .It Va hw.acpi.cpu.cx_supported
 List of supported CPU idle states and their transition latency
 in microseconds.
 Each state has a type (e.g.,
 .Li C2 ) .
 .Li C1
 is equivalent to the ia32
 .Li HLT
 instruction,
 .Li C2
 provides a deeper
 sleep with the same semantics, and
 .Li C3
 provides the deepest sleep
 but additionally requires bus mastering to be disabled.
 States greater than
 .Li C3
 provide even more power savings with the same
 semantics as the
 .Li C3
 state.
 Deeper sleeps provide more power savings but increased transition
 latency when an interrupt occurs.
+.It Va dev.cpu.N.cx_method
+List of supported CPU idle states and their transition methods, as
+directed by the firmware.
+.It Va hw.acpi.acline
+AC line state (1 means online, 0 means on battery power).
 .It Va hw.acpi.disable_on_reboot
 Disable ACPI during the reboot process.
 Most systems reboot fine with ACPI still enabled, but some require
 exiting to legacy mode first.
 Default is 0, leave ACPI enabled.
 .It Va hw.acpi.handle_reboot
 Use the ACPI Reset Register capability to reboot the system.
 Some newer systems require use of this register, while some only work
 with legacy rebooting support.
 .It Va hw.acpi.lid_switch_state
 Suspend state
 .Pq Li S1 Ns \[en] Ns Li S5
 to enter when the lid switch (i.e., a notebook screen) is closed.
 Default is
 .Dq Li NONE
 (do nothing).
 .It Va hw.acpi.power_button_state
 Suspend state
 .Pq Li S1 Ns \[en] Ns Li S5
 to enter when the power button is pressed.
 Default is
 .Li S5
 (power-off nicely).
 .It Va hw.acpi.reset_video
 Reset the video adapter from real mode during the resume path.
 Some systems need this help, others have display problems if it is enabled.
 Default is 0 (disabled).
 .It Va hw.acpi.s4bios
 Indicate whether the system supports
 .Li S4BIOS .
 This means that the BIOS can handle all the functions of suspending the
 system to disk.
 Otherwise, the OS is responsible for suspending to disk
 .Pq Li S4OS .
 Most current systems do not support
 .Li S4BIOS .
 .It Va hw.acpi.sleep_button_state
 Suspend state
 .Pq Li S1 Ns \[en] Ns Li S5
 to enter when the sleep button is pressed.
 This is usually a special function button on the keyboard.
 Default is
 .Li S3
 (suspend-to-RAM).
 .It Va hw.acpi.sleep_delay
 Wait this number of seconds between preparing the system to suspend and
 actually entering the suspend state.
 Default is 1 second.
 .It Va hw.acpi.supported_sleep_state
 Suspend states
 .Pq Li S1 Ns \[en] Ns Li S5
 supported by the BIOS.
 .Bl -tag -width indent
 .It Li S1
 Quick suspend to RAM.
 The CPU enters a lower power state, but most peripherals are left running.
 .It Li S2
 Lower power state than
 .Li S1 ,
 but with the same basic characteristics.
 Not supported by many systems.
 .It Li S3
 Suspend to RAM.
 Most devices are powered off, and the system stops running except for
 memory refresh.
 .It Li S4
 Suspend to disk.
 All devices are powered off, and the system stops running.
 When resuming, the system starts as if from a cold power on.
 Not yet supported by
 .Fx
 unless
 .Li S4BIOS
 is available.
 .It Li S5
 System shuts down cleanly and powers off.
 .El
 .It Va hw.acpi.verbose
 Enable verbose printing from the various ACPI subsystems.
 .El
 .Sh LOADER TUNABLES
 Tunables can be set at the
 .Xr loader 8
 prompt before booting the kernel or stored in
 .Pa /boot/loader.conf .
 Many of these tunables also have a matching
 .Xr sysctl 8
 entry for access after boot.
 .Bl -tag -width indent
 .It Va acpi_dsdt_load
 Enables loading of a custom ACPI DSDT.
 .It Va acpi_dsdt_name
 Name of the DSDT table to load, if loading is enabled.
 .It Va debug.acpi.cpu_unordered
 Do not use the MADT to match ACPI Processor objects to CPUs.
 This is needed on a few systems with a buggy BIOS that does not use
 consistent processor IDs.
 Default is 0 (disabled).
 .It Va debug.acpi.disabled
 Selectively disables portions of ACPI for debugging purposes.
 .It Va debug.acpi.interpreter_slack
 Enable less strict ACPI implementations.
 Default is 1, ignore common BIOS mistakes.
 .It Va debug.acpi.max_threads
 Specify the number of task threads that are started on boot.
 Limiting this to 1 may help work around various BIOSes that cannot
 handle parallel requests.
 The default value is 3.
 .It Va debug.acpi.quirks
 Override any automatic quirks completely.
 .It Va debug.acpi.resume_beep
 Beep the PC speaker on resume.
 This can help diagnose suspend/resume problems.
 Default is 0 (disabled).
 .It Va hint.acpi.0.disabled
 Set this to 1 to disable all of ACPI.
 If ACPI has been disabled on your system due to a blacklist entry for your
 BIOS, you can set this to 0 to re-enable ACPI for testing.
 .It Va hw.acpi.ec.poll_timeout
 Delay in milliseconds to wait for the EC to respond.
 Try increasing this number if you get the error
 .Qq Li AE_NO_HARDWARE_RESPONSE .
 .It Va hw.acpi.host_mem_start
 Override the assumed memory starting address for PCI host bridges.
 .It Va hw.acpi.install_interface , hw.acpi.remove_interface
 Install or remove OS interface(s) to control return value of
 .Ql _OSI
 query method.
 When an OS interface is specified in
 .Va hw.acpi.install_interface ,
 .Li _OSI
 query for the interface returns it is
 .Em supported .
 Conversely, when an OS interface is specified in
 .Va hw.acpi.remove_interface ,
 .Li _OSI
 query returns it is
 .Em not supported .
 Multiple interfaces can be specified in a comma-separated list and
 any leading white spaces will be ignored.
 For example,
 .Qq Li FreeBSD, Linux
 is a valid list of two interfaces
 .Qq Li FreeBSD
 and
 .Qq Li Linux .
 .It Va hw.acpi.reset_video
 Enables calling the VESA reset BIOS vector on the resume path.
 This can fix some graphics cards that have problems such as LCD white-out
 after resume.
 Default is 0 (disabled).
 .It Va hw.acpi.serialize_methods
 Allow override of whether methods execute in parallel or not.
 Enable this for serial behavior, which fixes
 .Qq Li AE_ALREADY_EXISTS
 errors for
 AML that really cannot handle parallel method execution.
 It is off by default since this breaks recursive methods and some IBMs use
 such code.
 .It Va hw.acpi.verbose
 Turn on verbose debugging information about what ACPI is doing.
 .It Va hw.pci.link.%s.%d.irq
 Override the interrupt to use for this link and index.
 This capability should be used carefully, and only if a device is not
 working with
 .Nm
 enabled.
 .Qq %s
 is the name of the link (e.g., LNKA).
 .Qq %d
 is the resource index when the link supports multiple IRQs.
 Most PCI links only have one IRQ resource, so the below form should be used.
 .It Va hw.pci.link.%s.irq
 Override the interrupt to use.
 This capability should be used carefully, and only if a device is not
 working with
 .Nm
 enabled.
 .Qq %s
 is the name of the link (e.g., LNKA).
 .El
 .Sh DISABLING ACPI
 Since ACPI support on different platforms varies greatly, there are many
 debugging and tuning options available.
 .Pp
 For machines known not to work with
 .Nm
 enabled, there is a BIOS blacklist.
 Currently, the blacklist only controls whether
 .Nm
 should be disabled or not.
 In the future, it will have more granularity to control features (the
 infrastructure for that is already there).
 .Pp
 To enable
 .Nm
 (for debugging purposes, etc.) on machines that are on the blacklist, set the
 kernel environment variable
 .Va hint.acpi.0.disabled
 to 0.
 Before trying this, consider updating your BIOS to a more recent version that
 may be compatible with ACPI.
 .Pp
 To disable the
 .Nm
 driver completely, set the kernel environment variable
 .Va hint.acpi.0.disabled
 to 1.
 .Pp
 Some i386 machines totally fail to operate with some or all of ACPI disabled.
 Other i386 machines fail with ACPI enabled.
 Disabling all or part of ACPI on non-i386 platforms (i.e., platforms where
 ACPI support is mandatory) may result in a non-functional system.
 .Pp
 The
 .Nm
 driver comprises a set of drivers, which may be selectively disabled
 in case of problems.
 To disable a sub-driver, list it in the kernel
 environment variable
 .Va debug.acpi.disabled .
 Multiple entries can be listed, separated by a space.
 .Pp
 ACPI sub-devices and features that can be disabled:
 .Bl -tag -width ".Li sysresource"
 .It Li all
 Disable all ACPI features and devices.
 .It Li acad
 .Pq Vt device
 Supports AC adapter.
 .It Li bus
 .Pq Vt feature
 Probes and attaches subdevices.
 Disabling will avoid scanning the ACPI namespace entirely.
 .It Li children
 .Pq Vt feature
 Attaches standard ACPI sub-drivers and devices enumerated in the
 ACPI namespace.
 Disabling this has a similar effect to disabling
 .Dq Li bus ,
 except that the
 ACPI namespace will still be scanned.
 .It Li button
 .Pq Vt device
 Supports ACPI button devices (typically power and sleep buttons).
 .It Li cmbat
 .Pq Vt device
 Control-method batteries device.
 .It Li cpu
 .Pq Vt device
 Supports CPU power-saving and speed-setting functions.
 .It Li ec
 .Pq Vt device
 Supports the ACPI Embedded Controller interface, used to communicate
 with embedded platform controllers.
 .It Li isa
 .Pq Vt device
 Supports an ISA bus bridge defined in the ACPI namespace,
 typically as a child of a PCI bus.
 .It Li lid
 .Pq Vt device
 Supports an ACPI laptop lid switch, which typically puts a
 system to sleep.
+.It Li mwait
+.Pq Vt feature
+Do not ask firmware for available x86-vendor specific methods to enter
+.Li Cx
+sleep states.
+Only query and use the generic I/O-based entrance method.
+The knob is provided to work around inconsistencies in the tables
+filled by firmware.
 .It Li quirks
 .Pq Vt feature
 Do not honor quirks.
 Quirks automatically disable ACPI functionality based on the XSDT table's
 OEM vendor name and revision date.
 .It Li pci
 .Pq Vt device
 Supports Host to PCI bridges.
 .It Li pci_link
 .Pq Vt feature
 Performs PCI interrupt routing.
 .It Li sysresource
 .Pq Vt device
 Pseudo-devices containing resources which ACPI claims.
 .It Li thermal
 .Pq Vt device
 Supports system cooling and heat management.
 .It Li timer
 .Pq Vt device
 Implements a timecounter using the ACPI fixed-frequency timer.
 .It Li video
 .Pq Vt device
 Supports
 .Xr acpi_video 4
 which may conflict with
 .Xr agp 4
 device.
 .El
 .Pp
 It is also possible to avoid portions of the ACPI namespace which
 may be causing problems, by listing the full path of the root of
 the region to be avoided in the kernel environment variable
 .Va debug.acpi.avoid .
 The object and all of its children will be ignored during the
 bus/children scan of the namespace.
 The ACPI CA code will still know about the avoided region.
 .Sh DEBUGGING OUTPUT
 To enable debugging output,
 .Nm
 must be compiled with
 .Cd "options ACPI_DEBUG" .
 Debugging output is separated between layers and levels, where a layer is
 a component of the ACPI subsystem, and a level is a particular kind
 of debugging output.
 .Pp
 Both layers and levels are specified as a whitespace-separated list of
 tokens, with layers listed in
 .Va debug.acpi.layer
 and levels in
 .Va debug.acpi.level .
 .Pp
 The first set of layers is for ACPI-CA components, and the second is for
 .Fx
 drivers.
 The ACPI-CA layer descriptions include the prefix for the files they
 refer to.
 The supported layers are:
 .Pp
 .Bl -tag -compact -width ".Li ACPI_CA_DISASSEMBLER"
 .It Li ACPI_UTILITIES
 Utility ("ut") functions
 .It Li ACPI_HARDWARE
 Hardware access ("hw")
 .It Li ACPI_EVENTS
 Event and GPE ("ev")
 .It Li ACPI_TABLES
 Table access ("tb")
 .It Li ACPI_NAMESPACE
 Namespace evaluation ("ns")
 .It Li ACPI_PARSER
 AML parser ("ps")
 .It Li ACPI_DISPATCHER
 Internal representation of interpreter state ("ds")
 .It Li ACPI_EXECUTER
 Execute AML methods ("ex")
 .It Li ACPI_RESOURCES
 Resource parsing ("rs")
 .It Li ACPI_CA_DEBUGGER
 Debugger implementation ("db", "dm")
 .It Li ACPI_OS_SERVICES
 Usermode support routines ("os")
 .It Li ACPI_CA_DISASSEMBLER
 Disassembler implementation (unused)
 .It Li ACPI_ALL_COMPONENTS
 All the above ACPI-CA components
 .It Li ACPI_AC_ADAPTER
 AC adapter driver
 .It Li ACPI_BATTERY
 Control-method battery driver
 .It Li ACPI_BUS
 ACPI, ISA, and PCI bus drivers
 .It Li ACPI_BUTTON
 Power and sleep button driver
 .It Li ACPI_EC
 Embedded controller driver
 .It Li ACPI_FAN
 Fan driver
 .It Li ACPI_OEM
 Platform-specific driver for hotkeys, LED, etc.
 .It Li ACPI_POWER
 Power resource driver
 .It Li ACPI_PROCESSOR
 CPU driver
 .It Li ACPI_THERMAL
 Thermal zone driver
 .It Li ACPI_TIMER
 Timer driver
 .It Li ACPI_ALL_DRIVERS
 All the above
 .Fx
 ACPI drivers
 .El
 .Pp
 The supported levels are:
 .Pp
 .Bl -tag -compact -width ".Li ACPI_LV_AML_DISASSEMBLE"
 .It Li ACPI_LV_INIT
 Initialization progress
 .It Li ACPI_LV_DEBUG_OBJECT
 Stores to objects
 .It Li ACPI_LV_INFO
 General information and progress
 .It Li ACPI_LV_REPAIR
 Repair a common problem with predefined methods
 .It Li ACPI_LV_ALL_EXCEPTIONS
 All the previous levels
 .It Li ACPI_LV_PARSE
 .It Li ACPI_LV_DISPATCH
 .It Li ACPI_LV_EXEC
 .It Li ACPI_LV_NAMES
 .It Li ACPI_LV_OPREGION
 .It Li ACPI_LV_BFIELD
 .It Li ACPI_LV_TABLES
 .It Li ACPI_LV_VALUES
 .It Li ACPI_LV_OBJECTS
 .It Li ACPI_LV_RESOURCES
 .It Li ACPI_LV_USER_REQUESTS
 .It Li ACPI_LV_PACKAGE
 .It Li ACPI_LV_VERBOSITY1
 All the previous levels
 .It Li ACPI_LV_ALLOCATIONS
 .It Li ACPI_LV_FUNCTIONS
 .It Li ACPI_LV_OPTIMIZATIONS
 .It Li ACPI_LV_VERBOSITY2
 All the previous levels
 .It Li ACPI_LV_ALL
 Synonym for
 .Qq Li ACPI_LV_VERBOSITY2
 .It Li ACPI_LV_MUTEX
 .It Li ACPI_LV_THREADS
 .It Li ACPI_LV_IO
 .It Li ACPI_LV_INTERRUPTS
 .It Li ACPI_LV_VERBOSITY3
 All the previous levels
 .It Li ACPI_LV_AML_DISASSEMBLE
 .It Li ACPI_LV_VERBOSE_INFO
 .It Li ACPI_LV_FULL_TABLES
 .It Li ACPI_LV_EVENTS
 .It Li ACPI_LV_VERBOSE
 All levels after
 .Qq Li ACPI_LV_VERBOSITY3
 .It Li ACPI_LV_INIT_NAMES
 .It Li ACPI_LV_LOAD
 .El
 .Pp
 Selection of the appropriate layer and level values is important
 to avoid massive amounts of debugging output.
 For example, the following configuration is a good way to gather initial
 information.
 It enables debug output for both ACPI-CA and the
 .Nm
 driver, printing basic information about errors, warnings, and progress.
 .Bd -literal -offset indent
 debug.acpi.layer="ACPI_ALL_COMPONENTS ACPI_ALL_DRIVERS"
 debug.acpi.level="ACPI_LV_ALL_EXCEPTIONS"
 .Ed
 .Pp
 Debugging output by the ACPI CA subsystem is prefixed with the
 module name in lowercase, followed by a source line number.
 Output from the
 .Fx Ns -local
 code follows the same format, but
 the module name is uppercased.
 .Sh OVERRIDING YOUR BIOS BYTECODE
 ACPI interprets bytecode named AML
 (ACPI Machine Language)
 provided by the BIOS vendor as a memory image at boot time.
 Sometimes, the AML code contains a bug that does not appear when parsed
 by the Microsoft implementation.
 .Fx
 provides a way to override it with your own AML code to work around
 or debug such problems.
 Note that all AML in your DSDT and any SSDT tables is overridden.
 .Pp
 In order to load your AML code, you must edit
 .Pa /boot/loader.conf
 and include the following lines.
 .Bd -literal -offset indent
 acpi_dsdt_load="YES"
 acpi_dsdt_name="/boot/acpi_dsdt.aml" # You may change this name.
 .Ed
 .Pp
 In order to prepare your AML code, you will need the
 .Xr acpidump 8
 and
 .Xr iasl 8
 utilities and some ACPI knowledge.
 .Sh COMPATIBILITY
 ACPI is only found and supported on i386/ia32 and amd64.
 .Sh SEE ALSO
 .Xr kenv 1 ,
 .Xr acpi_thermal 4 ,
 .Xr device.hints 5 ,
 .Xr loader.conf 5 ,
 .Xr acpiconf 8 ,
 .Xr acpidump 8 ,
 .Xr config 8 ,
 .Xr iasl 8
 .Rs
 .%A "Compaq Computer Corporation"
 .%A "Intel Corporation"
 .%A "Microsoft Corporation"
 .%A "Phoenix Technologies Ltd."
 .%A "Toshiba Corporation"
 .%D August 25, 2003
 .%T "Advanced Configuration and Power Interface Specification"
 .%U http://acpi.info/spec.htm
 .Re
 .Sh AUTHORS
 .An -nosplit
 The ACPI CA subsystem is developed and maintained by
 Intel Architecture Labs.
 .Pp
 The following people made notable contributions to the ACPI subsystem
 in
 .Fx :
 .An Michael Smith ,
 .An Takanori Watanabe Aq Mt takawata@jp.FreeBSD.org ,
 .An Mitsuru IWASAKI Aq Mt iwasaki@jp.FreeBSD.org ,
 .An Munehiro Matsuda ,
 .An Nate Lawson ,
 the ACPI-jp mailing list at
 .Aq Mt acpi-jp@jp.FreeBSD.org ,
 and many other contributors.
 .Pp
 This manual page was written by
 .An Michael Smith Aq Mt msmith@FreeBSD.org .
 .Sh BUGS
 Many BIOS versions have serious bugs that may cause system instability,
 break suspend/resume, or prevent devices from operating properly due to
 IRQ routing problems.
 Upgrade your BIOS to the latest version available from the vendor before
 deciding it is a problem with
 .Nm .
Index: projects/release-arm-redux/share/man/man4
===================================================================
--- projects/release-arm-redux/share/man/man4	(revision 282691)
+++ projects/release-arm-redux/share/man/man4	(revision 282692)

Property changes on: projects/release-arm-redux/share/man/man4
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head/share/man/man4:r282673-282691
Index: projects/release-arm-redux/share
===================================================================
--- projects/release-arm-redux/share	(revision 282691)
+++ projects/release-arm-redux/share	(revision 282692)

Property changes on: projects/release-arm-redux/share
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head/share:r282673-282691
Index: projects/release-arm-redux/sys/amd64/acpica/acpi_machdep.c
===================================================================
--- projects/release-arm-redux/sys/amd64/acpica/acpi_machdep.c	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/acpica/acpi_machdep.c	(revision 282692)
@@ -1,384 +1,377 @@
 /*-
  * Copyright (c) 2001 Mitsuru IWASAKI
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <contrib/dev/acpica/include/acpi.h>
 #include <contrib/dev/acpica/include/accommon.h>
 #include <contrib/dev/acpica/include/actables.h>
 
 #include <dev/acpica/acpivar.h>
 
 #include <machine/nexusvar.h>
 
 int acpi_resume_beep;
 SYSCTL_INT(_debug_acpi, OID_AUTO, resume_beep, CTLFLAG_RWTUN,
     &acpi_resume_beep, 0, "Beep the PC speaker when resuming");
 
 int acpi_reset_video;
 TUNABLE_INT("hw.acpi.reset_video", &acpi_reset_video);
 
 static int intr_model = ACPI_INTR_PIC;
 
 int
 acpi_machdep_init(device_t dev)
 {
 	struct acpi_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	acpi_apm_init(sc);
 
 	if (intr_model != ACPI_INTR_PIC)
 		acpi_SetIntrModel(intr_model);
 
 	SYSCTL_ADD_INT(&sc->acpi_sysctl_ctx,
 	    SYSCTL_CHILDREN(sc->acpi_sysctl_tree), OID_AUTO,
 	    "reset_video", CTLFLAG_RW, &acpi_reset_video, 0,
 	    "Call the VESA reset BIOS vector on the resume path");
 
 	return (0);
 }
 
 void
 acpi_SetDefaultIntrModel(int model)
 {
 
 	intr_model = model;
 }
 
 int
 acpi_machdep_quirks(int *quirks)
 {
 
 	return (0);
 }
 
-void
-acpi_cpu_c1()
-{
-
-	__asm __volatile("sti; hlt");
-}
-
 /*
  * Support for mapping ACPI tables during early boot.  Currently this
  * uses the crashdump map to map each table.  However, the crashdump
  * map is created in pmap_bootstrap() right after the direct map, so
  * we should be able to just use pmap_mapbios() here instead.
  *
  * This makes the following assumptions about how we use this KVA:
  * pages 0 and 1 are used to map in the header of each table found via
  * the RSDT or XSDT and pages 2 to n are used to map in the RSDT or
  * XSDT.  This has to use 2 pages for the table headers in case a
  * header spans a page boundary.
  *
  * XXX: We don't ensure the table fits in the available address space
  * in the crashdump map.
  */
 
 /*
  * Map some memory using the crashdump map.  'offset' is an offset in
  * pages into the crashdump map to use for the start of the mapping.
  */
 static void *
 table_map(vm_paddr_t pa, int offset, vm_offset_t length)
 {
 	vm_offset_t va, off;
 	void *data;
 
 	off = pa & PAGE_MASK;
 	length = round_page(length + off);
 	pa = pa & PG_FRAME;
 	va = (vm_offset_t)pmap_kenter_temporary(pa, offset) +
 	    (offset * PAGE_SIZE);
 	data = (void *)(va + off);
 	length -= PAGE_SIZE;
 	while (length > 0) {
 		va += PAGE_SIZE;
 		pa += PAGE_SIZE;
 		length -= PAGE_SIZE;
 		pmap_kenter(va, pa);
 		invlpg(va);
 	}
 	return (data);
 }
 
 /* Unmap memory previously mapped with table_map(). */
 static void
 table_unmap(void *data, vm_offset_t length)
 {
 	vm_offset_t va, off;
 
 	va = (vm_offset_t)data;
 	off = va & PAGE_MASK;
 	length = round_page(length + off);
 	va &= ~PAGE_MASK;
 	while (length > 0) {
 		pmap_kremove(va);
 		invlpg(va);
 		va += PAGE_SIZE;
 		length -= PAGE_SIZE;
 	}
 }
 
 /*
  * Map a table at a given offset into the crashdump map.  It first
  * maps the header to determine the table length and then maps the
  * entire table.
  */
 static void *
 map_table(vm_paddr_t pa, int offset, const char *sig)
 {
 	ACPI_TABLE_HEADER *header;
 	vm_offset_t length;
 	void *table;
 
 	header = table_map(pa, offset, sizeof(ACPI_TABLE_HEADER));
 	if (strncmp(header->Signature, sig, ACPI_NAME_SIZE) != 0) {
 		table_unmap(header, sizeof(ACPI_TABLE_HEADER));
 		return (NULL);
 	}
 	length = header->Length;
 	table_unmap(header, sizeof(ACPI_TABLE_HEADER));
 	table = table_map(pa, offset, length);
 	if (ACPI_FAILURE(AcpiTbChecksum(table, length))) {
 		if (bootverbose)
 			printf("ACPI: Failed checksum for table %s\n", sig);
 #if (ACPI_CHECKSUM_ABORT)
 		table_unmap(table, length);
 		return (NULL);
 #endif
 	}
 	return (table);
 }
 
 /*
  * See if a given ACPI table is the requested table.  Returns the
  * length of the able if it matches or zero on failure.
  */
 static int
 probe_table(vm_paddr_t address, const char *sig)
 {
 	ACPI_TABLE_HEADER *table;
 
 	table = table_map(address, 0, sizeof(ACPI_TABLE_HEADER));
 	if (table == NULL) {
 		if (bootverbose)
 			printf("ACPI: Failed to map table at 0x%jx\n",
 			    (uintmax_t)address);
 		return (0);
 	}
 	if (bootverbose)
 		printf("Table '%.4s' at 0x%jx\n", table->Signature,
 		    (uintmax_t)address);
 
 	if (strncmp(table->Signature, sig, ACPI_NAME_SIZE) != 0) {
 		table_unmap(table, sizeof(ACPI_TABLE_HEADER));
 		return (0);
 	}
 	table_unmap(table, sizeof(ACPI_TABLE_HEADER));
 	return (1);
 }
 
 /*
  * Try to map a table at a given physical address previously returned
  * by acpi_find_table().
  */
 void *
 acpi_map_table(vm_paddr_t pa, const char *sig)
 {
 
 	return (map_table(pa, 0, sig));
 }
 
 /* Unmap a table previously mapped via acpi_map_table(). */
 void
 acpi_unmap_table(void *table)
 {
 	ACPI_TABLE_HEADER *header;
 
 	header = (ACPI_TABLE_HEADER *)table;
 	table_unmap(table, header->Length);
 }
 
 /*
  * Return the physical address of the requested table or zero if one
  * is not found.
  */
 vm_paddr_t
 acpi_find_table(const char *sig)
 {
 	ACPI_PHYSICAL_ADDRESS rsdp_ptr;
 	ACPI_TABLE_RSDP *rsdp;
 	ACPI_TABLE_RSDT *rsdt;
 	ACPI_TABLE_XSDT *xsdt;
 	ACPI_TABLE_HEADER *table;
 	vm_paddr_t addr;
 	int i, count;
 
 	if (resource_disabled("acpi", 0))
 		return (0);
 
 	/*
 	 * Map in the RSDP.  Since ACPI uses AcpiOsMapMemory() which in turn
 	 * calls pmap_mapbios() to find the RSDP, we assume that we can use
 	 * pmap_mapbios() to map the RSDP.
 	 */
 	if ((rsdp_ptr = AcpiOsGetRootPointer()) == 0)
 		return (0);
 	rsdp = pmap_mapbios(rsdp_ptr, sizeof(ACPI_TABLE_RSDP));
 	if (rsdp == NULL) {
 		if (bootverbose)
 			printf("ACPI: Failed to map RSDP\n");
 		return (0);
 	}
 
 	/*
 	 * For ACPI >= 2.0, use the XSDT if it is available.
 	 * Otherwise, use the RSDT.  We map the XSDT or RSDT at page 2
 	 * in the crashdump area.  Pages 0 and 1 are used to map in the
 	 * headers of candidate ACPI tables.
 	 */
 	addr = 0;
 	if (rsdp->Revision >= 2 && rsdp->XsdtPhysicalAddress != 0) {
 		/*
 		 * AcpiOsGetRootPointer only verifies the checksum for
 		 * the version 1.0 portion of the RSDP.  Version 2.0 has
 		 * an additional checksum that we verify first.
 		 */
 		if (AcpiTbChecksum((UINT8 *)rsdp, ACPI_RSDP_XCHECKSUM_LENGTH)) {
 			if (bootverbose)
 				printf("ACPI: RSDP failed extended checksum\n");
 			return (0);
 		}
 		xsdt = map_table(rsdp->XsdtPhysicalAddress, 2, ACPI_SIG_XSDT);
 		if (xsdt == NULL) {
 			if (bootverbose)
 				printf("ACPI: Failed to map XSDT\n");
 			return (0);
 		}
 		count = (xsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) /
 		    sizeof(UINT64);
 		for (i = 0; i < count; i++)
 			if (probe_table(xsdt->TableOffsetEntry[i], sig)) {
 				addr = xsdt->TableOffsetEntry[i];
 				break;
 			}
 		acpi_unmap_table(xsdt);
 	} else {
 		rsdt = map_table(rsdp->RsdtPhysicalAddress, 2, ACPI_SIG_RSDT);
 		if (rsdt == NULL) {
 			if (bootverbose)
 				printf("ACPI: Failed to map RSDT\n");
 			return (0);
 		}
 		count = (rsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) /
 		    sizeof(UINT32);
 		for (i = 0; i < count; i++)
 			if (probe_table(rsdt->TableOffsetEntry[i], sig)) {
 				addr = rsdt->TableOffsetEntry[i];
 				break;
 			}
 		acpi_unmap_table(rsdt);
 	}
 	pmap_unmapbios((vm_offset_t)rsdp, sizeof(ACPI_TABLE_RSDP));
 	if (addr == 0) {
 		if (bootverbose)
 			printf("ACPI: No %s table found\n", sig);
 		return (0);
 	}
 	if (bootverbose)
 		printf("%s: Found table at 0x%jx\n", sig, (uintmax_t)addr);
 
 	/*
 	 * Verify that we can map the full table and that its checksum is
 	 * correct, etc.
 	 */
 	table = map_table(addr, 0, sig);
 	if (table == NULL)
 		return (0);
 	acpi_unmap_table(table);
 
 	return (addr);
 }
 
 /*
  * ACPI nexus(4) driver.
  */
 static int
 nexus_acpi_probe(device_t dev)
 {
 	int error;
 
 	error = acpi_identify();
 	if (error)
 		return (error);
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 nexus_acpi_attach(device_t dev)
 {
 	device_t acpi_dev;
 	int error;
 
 	nexus_init_resources();
 	bus_generic_probe(dev);
 	acpi_dev = BUS_ADD_CHILD(dev, 10, "acpi", 0);
 	if (acpi_dev == NULL)
 		panic("failed to add acpi0 device");
 
 	error = bus_generic_attach(dev);
 	if (error == 0)
 		acpi_install_wakeup_handler(device_get_softc(acpi_dev));
 
 	return (error);
 }
 
 static device_method_t nexus_acpi_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		nexus_acpi_probe),
 	DEVMETHOD(device_attach,	nexus_acpi_attach),
 
 	{ 0, 0 }
 };
 
 DEFINE_CLASS_1(nexus, nexus_acpi_driver, nexus_acpi_methods, 1, nexus_driver);
 static devclass_t nexus_devclass;
 
 DRIVER_MODULE(nexus_acpi, root, nexus_acpi_driver, nexus_devclass, 0, 0);
Index: projects/release-arm-redux/sys/amd64/amd64/apic_vector.S
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/apic_vector.S	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/apic_vector.S	(revision 282692)
@@ -1,351 +1,347 @@
 /*-
  * Copyright (c) 1989, 1990 William F. Jolitz.
  * Copyright (c) 1990 The Regents of the University of California.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: vector.s, 386BSD 0.1 unknown origin
  * $FreeBSD$
  */
 
 /*
  * Interrupt entry points for external interrupts triggered by I/O APICs
  * as well as IPI handlers.
  */
 
 #include "opt_smp.h"
 
 #include <machine/asmacros.h>
 #include <machine/specialreg.h>
 #include <x86/apicreg.h>
 
 #include "assym.s"
 
 #ifdef SMP
 #define LK	lock ;
 #else
 #define LK
 #endif
 
 	.text
 	SUPERALIGN_TEXT
 	/* End Of Interrupt to APIC */
 as_lapic_eoi:
 	cmpl	$0,x2apic_mode
 	jne	1f
 	movq	lapic_map,%rax
 	movl	$0,LA_EOI(%rax)
 	ret
 1:
 	movl	$MSR_APIC_EOI,%ecx
 	xorl	%eax,%eax
 	xorl	%edx,%edx
 	wrmsr
 	ret
 
 /*
  * I/O Interrupt Entry Point.  Rather than having one entry point for
  * each interrupt source, we use one entry point for each 32-bit word
  * in the ISR.  The handler determines the highest bit set in the ISR,
  * translates that into a vector, and passes the vector to the
  * lapic_handle_intr() function.
  */
 #define	ISR_VEC(index, vec_name)					\
 	.text ;								\
 	SUPERALIGN_TEXT ;						\
 IDTVEC(vec_name) ;							\
 	PUSH_FRAME ;							\
 	FAKE_MCOUNT(TF_RIP(%rsp)) ;					\
 	cmpl	$0,x2apic_mode ;					\
 	je	1f ;							\
 	movl	$(MSR_APIC_ISR0 + index),%ecx ;				\
 	rdmsr ;								\
 	jmp	2f ;							\
 1: ;									\
 	movq	lapic_map, %rdx ;	/* pointer to local APIC */	\
 	movl	LA_ISR + 16 * (index)(%rdx), %eax ;	/* load ISR */	\
 2: ;									\
 	bsrl	%eax, %eax ;	/* index of highest set bit in ISR */	\
 	jz	3f ;							\
 	addl	$(32 * index),%eax ;					\
 	movq	%rsp, %rsi	;                                       \
 	movl	%eax, %edi ;	/* pass the IRQ */			\
 	call	lapic_handle_intr ;					\
 3: ;									\
 	MEXITCOUNT ;							\
 	jmp	doreti
 
 /*
  * Handle "spurious INTerrupts".
  * Notes:
  *  This is different than the "spurious INTerrupt" generated by an
  *   8259 PIC for missing INTs.  See the APIC documentation for details.
  *  This routine should NOT do an 'EOI' cycle.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(spuriousint)
 
 	/* No EOI cycle used here */
 
 	jmp	doreti_iret
 
 	ISR_VEC(1, apic_isr1)
 	ISR_VEC(2, apic_isr2)
 	ISR_VEC(3, apic_isr3)
 	ISR_VEC(4, apic_isr4)
 	ISR_VEC(5, apic_isr5)
 	ISR_VEC(6, apic_isr6)
 	ISR_VEC(7, apic_isr7)
 
 /*
  * Local APIC periodic timer handler.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(timerint)
 	PUSH_FRAME
 	FAKE_MCOUNT(TF_RIP(%rsp))
 	movq	%rsp, %rdi
 	call	lapic_handle_timer
 	MEXITCOUNT
 	jmp	doreti
 
 /*
  * Local APIC CMCI handler.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(cmcint)
 	PUSH_FRAME
 	FAKE_MCOUNT(TF_RIP(%rsp))
 	call	lapic_handle_cmc
 	MEXITCOUNT
 	jmp	doreti
 
 /*
  * Local APIC error interrupt handler.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(errorint)
 	PUSH_FRAME
 	FAKE_MCOUNT(TF_RIP(%rsp))
 	call	lapic_handle_error
 	MEXITCOUNT
 	jmp	doreti
 
 #ifdef XENHVM
 /*
  * Xen event channel upcall interrupt handler.
  * Only used when the hypervisor supports direct vector callbacks.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(xen_intr_upcall)
 	PUSH_FRAME
 	FAKE_MCOUNT(TF_RIP(%rsp))
 	movq	%rsp, %rdi
 	call	xen_intr_handle_upcall
 	MEXITCOUNT
 	jmp	doreti
 #endif
 
 #ifdef HYPERV
 /*
  * This is the Hyper-V vmbus channel direct callback interrupt.
  * Only used when it is running on Hyper-V.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(hv_vmbus_callback)
 	PUSH_FRAME
 	FAKE_MCOUNT(TF_RIP(%rsp))
 	movq	%rsp, %rdi
 	call	hv_vector_handler
 	MEXITCOUNT
 	jmp	doreti
 #endif
 
 #ifdef SMP
 /*
  * Global address space TLB shootdown.
  */
 	.text
 
-#define	NAKE_INTR_CS	24
-
 	SUPERALIGN_TEXT
 invltlb_ret:
 	call	as_lapic_eoi
 	POP_FRAME
 	jmp	doreti_iret
 
 	SUPERALIGN_TEXT
+IDTVEC(invltlb)
+	PUSH_FRAME
+
+	call	invltlb_handler
+	jmp	invltlb_ret
+
 IDTVEC(invltlb_pcid)
 	PUSH_FRAME
 
 	call	invltlb_pcid_handler
 	jmp	invltlb_ret
 
-
-	SUPERALIGN_TEXT
-IDTVEC(invltlb)
+IDTVEC(invltlb_invpcid)
 	PUSH_FRAME
 
-	call	invltlb_handler
+	call	invltlb_invpcid_handler
 	jmp	invltlb_ret
 
 /*
  * Single page TLB shootdown
  */
 	.text
-	SUPERALIGN_TEXT
-IDTVEC(invlpg_pcid)
-	PUSH_FRAME
-
-	call	invlpg_pcid_handler
-	jmp	invltlb_ret
 
 	SUPERALIGN_TEXT
 IDTVEC(invlpg)
 	PUSH_FRAME
 
 	call	invlpg_handler
 	jmp	invltlb_ret
 
 /*
  * Page range TLB shootdown.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(invlrng)
 	PUSH_FRAME
 
 	call	invlrng_handler
 	jmp	invltlb_ret
 
 /*
  * Invalidate cache.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(invlcache)
 	PUSH_FRAME
 
 	call	invlcache_handler
 	jmp	invltlb_ret
 
 /*
  * Handler for IPIs sent via the per-cpu IPI bitmap.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(ipi_intr_bitmap_handler)		
 	PUSH_FRAME
 
 	call	as_lapic_eoi
 	
 	FAKE_MCOUNT(TF_RIP(%rsp))
 
 	call	ipi_bitmap_handler
 	MEXITCOUNT
 	jmp	doreti
 
 /*
  * Executed by a CPU when it receives an IPI_STOP from another CPU.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(cpustop)
 	PUSH_FRAME
 
 	call	as_lapic_eoi
 
 	call	cpustop_handler
 	jmp	doreti
 
 /*
  * Executed by a CPU when it receives an IPI_SUSPEND from another CPU.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(cpususpend)
 	PUSH_FRAME
 
 	call	cpususpend_handler
 	call	as_lapic_eoi
 	jmp	doreti
 
 /*
  * Executed by a CPU when it receives a RENDEZVOUS IPI from another CPU.
  *
  * - Calls the generic rendezvous action function.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(rendezvous)
 	PUSH_FRAME
 #ifdef COUNT_IPIS
 	movl	PCPU(CPUID), %eax
 	movq	ipi_rendezvous_counts(,%rax,8), %rax
 	incq	(%rax)
 #endif
 	call	smp_rendezvous_action
 	call	as_lapic_eoi
 	jmp	doreti
 
 /*
  * IPI handler whose purpose is to interrupt the CPU with minimum overhead.
  * This is used by bhyve to force a host cpu executing in guest context to
  * trap into the hypervisor.
  *
  * This handler is different from other IPI handlers in the following aspects:
  *
  * 1. It doesn't push a trapframe on the stack.
  *
  * This implies that a DDB backtrace involving 'justreturn' will skip the
  * function that was interrupted by this handler.
  *
  * 2. It doesn't 'swapgs' when userspace is interrupted.
  *
  * The 'justreturn' handler does not access any pcpu data so it is not an
  * issue. Moreover the 'justreturn' handler can only be interrupted by an NMI
  * whose handler already doesn't trust GS.base when kernel code is interrupted.
  */
 	.text
 	SUPERALIGN_TEXT
 IDTVEC(justreturn)
 	pushq	%rax
 	pushq	%rcx
 	pushq	%rdx
 	call	as_lapic_eoi
 	popq	%rdx
 	popq	%rcx
 	popq	%rax
 	jmp	doreti_iret
 
 #endif /* SMP */
Index: projects/release-arm-redux/sys/amd64/amd64/cpu_switch.S
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/cpu_switch.S	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/cpu_switch.S	(revision 282692)
@@ -1,521 +1,476 @@
 /*-
  * Copyright (c) 2003 Peter Wemm.
  * Copyright (c) 1990 The Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #include <machine/asmacros.h>
 #include <machine/specialreg.h>
 
 #include "assym.s"
 #include "opt_sched.h"
 
 /*****************************************************************************/
 /* Scheduling                                                                */
 /*****************************************************************************/
 
 	.text
 
 #ifdef SMP
 #define LK	lock ;
 #else
 #define LK
 #endif
 
 #if defined(SCHED_ULE) && defined(SMP)
 #define	SETLK	xchgq
 #else
 #define	SETLK	movq
 #endif
 
 /*
  * cpu_throw()
  *
  * This is the second half of cpu_switch(). It is used when the current
  * thread is either a dummy or slated to die, and we no longer care
  * about its state.  This is only a slight optimization and is probably
  * not worth it anymore.  Note that we need to clear the pm_active bits so
  * we do need the old proc if it still exists.
  * %rdi = oldtd
  * %rsi = newtd
  */
 ENTRY(cpu_throw)
-	movl	PCPU(CPUID),%eax
-	testq	%rdi,%rdi
-	jz	1f
-	/* release bit from old pm_active */
-	movq	PCPU(CURPMAP),%rdx
-	LK btrl	%eax,PM_ACTIVE(%rdx)		/* clear old */
-1:
-	movq	TD_PCB(%rsi),%r8		/* newtd->td_pcb */
-	movq	PCB_CR3(%r8),%rcx		/* new address space */
-	jmp	swact
+	movq	%rsi,%r12
+	movq	%rsi,%rdi
+	call	pmap_activate_sw
+	jmp	sw1
 END(cpu_throw)
 
 /*
  * cpu_switch(old, new, mtx)
  *
  * Save the current thread state, then select the next thread to run
  * and load its state.
  * %rdi = oldtd
  * %rsi = newtd
  * %rdx = mtx
  */
 ENTRY(cpu_switch)
 	/* Switch to new thread.  First, save context. */
 	movq	TD_PCB(%rdi),%r8
 	orl	$PCB_FULL_IRET,PCB_FLAGS(%r8)
 
 	movq	(%rsp),%rax			/* Hardware registers */
 	movq	%r15,PCB_R15(%r8)
 	movq	%r14,PCB_R14(%r8)
 	movq	%r13,PCB_R13(%r8)
 	movq	%r12,PCB_R12(%r8)
 	movq	%rbp,PCB_RBP(%r8)
 	movq	%rsp,PCB_RSP(%r8)
 	movq	%rbx,PCB_RBX(%r8)
 	movq	%rax,PCB_RIP(%r8)
 
 	testl	$PCB_DBREGS,PCB_FLAGS(%r8)
 	jnz	store_dr			/* static predict not taken */
 done_store_dr:
 
 	/* have we used fp, and need a save? */
 	cmpq	%rdi,PCPU(FPCURTHREAD)
 	jne	3f
 	movq	PCB_SAVEFPU(%r8),%r8
 	clts
 	cmpl	$0,use_xsave
 	jne	1f
 	fxsave	(%r8)
 	jmp	2f
 1:	movq	%rdx,%rcx
 	movl	xsave_mask,%eax
 	movl	xsave_mask+4,%edx
 	.globl	ctx_switch_xsave
 ctx_switch_xsave:
 	/* This is patched to xsaveopt if supported, see fpuinit_bsp1() */
 	xsave	(%r8)
 	movq	%rcx,%rdx
 2:	smsw	%ax
 	orb	$CR0_TS,%al
 	lmsw	%ax
 	xorl	%eax,%eax
 	movq	%rax,PCPU(FPCURTHREAD)
 3:
-
 	/* Save is done.  Now fire up new thread. Leave old vmspace. */
-	movq	TD_PCB(%rsi),%r8
-
-	/* switch address space */
-	movq	PCB_CR3(%r8),%rcx
-	movq	%cr3,%rax
-	cmpq	%rcx,%rax			/* Same address space? */
-	jne	swinact
-	SETLK	%rdx, TD_LOCK(%rdi)		/* Release the old thread */
-	jmp	sw1
-swinact:
-	movl	PCPU(CPUID),%eax
-	/* Release bit from old pmap->pm_active */
-	movq	PCPU(CURPMAP),%r12
-	LK btrl	%eax,PM_ACTIVE(%r12)		/* clear old */
-	SETLK	%rdx,TD_LOCK(%rdi)		/* Release the old thread */
-swact:
-	/* Set bit in new pmap->pm_active */
-	movq	TD_PROC(%rsi),%rdx		/* newproc */
-	movq	P_VMSPACE(%rdx), %rdx
-	addq	$VM_PMAP,%rdx
-	cmpl	$-1,PM_PCID(%rdx)
-	je	1f
-	LK btsl	%eax,PM_SAVE(%rdx)
-	jnc	1f
-	btsq	$63,%rcx			/* CR3_PCID_SAVE */
-	incq	PCPU(PM_SAVE_CNT)
-1:
-	movq	%rcx,%cr3			/* new address space */
-	LK btsl	%eax,PM_ACTIVE(%rdx)		/* set new */
-	movq	%rdx,PCPU(CURPMAP)
-
-	/*
-	 * We might lose the race and other CPU might have changed
-	 * the pmap after we set our bit in pmap->pm_save.  Recheck.
-	 * Reload %cr3 with CR3_PCID_SAVE bit cleared if pmap was
-	 * modified, causing TLB flush for this pcid.
-	 */
-	btrq	$63,%rcx
-	jnc	1f
-	LK btsl	%eax,PM_SAVE(%rdx)
-	jc	1f
-	decq	PCPU(PM_SAVE_CNT)
-	movq	%rcx,%cr3
-1:
-
+	movq	%rsi,%r12
+	movq	%rdi,%r13
+	movq	%rdx,%r15
+	movq	%rsi,%rdi
+	callq	pmap_activate_sw
+	SETLK	%r15,TD_LOCK(%r13)		/* Release the old thread */
 sw1:
+	movq	TD_PCB(%r12),%r8
 #if defined(SCHED_ULE) && defined(SMP)
 	/* Wait for the new thread to become unblocked */
 	movq	$blocked_lock, %rdx
 1:
-	movq	TD_LOCK(%rsi),%rcx
+	movq	TD_LOCK(%r12),%rcx
 	cmpq	%rcx, %rdx
 	pause
 	je	1b
 #endif
 	/*
 	 * At this point, we've switched address spaces and are ready
 	 * to load up the rest of the next context.
 	 */
 
 	/* Skip loading user fsbase/gsbase for kthreads */
-	testl	$TDP_KTHREAD,TD_PFLAGS(%rsi)
+	testl	$TDP_KTHREAD,TD_PFLAGS(%r12)
 	jnz	do_kthread
 
 	/*
 	 * Load ldt register
 	 */
-	movq	TD_PROC(%rsi),%rcx
+	movq	TD_PROC(%r12),%rcx
 	cmpq	$0, P_MD+MD_LDT(%rcx)
 	jne	do_ldt
 	xorl	%eax,%eax
 ld_ldt:	lldt	%ax
 
 	/* Restore fs base in GDT */
 	movl	PCB_FSBASE(%r8),%eax
 	movq	PCPU(FS32P),%rdx
 	movw	%ax,2(%rdx)
 	shrl	$16,%eax
 	movb	%al,4(%rdx)
 	shrl	$8,%eax
 	movb	%al,7(%rdx)
 
 	/* Restore gs base in GDT */
 	movl	PCB_GSBASE(%r8),%eax
 	movq	PCPU(GS32P),%rdx
 	movw	%ax,2(%rdx)
 	shrl	$16,%eax
 	movb	%al,4(%rdx)
 	shrl	$8,%eax
 	movb	%al,7(%rdx)
 
 do_kthread:
 	/* Do we need to reload tss ? */
 	movq	PCPU(TSSP),%rax
 	movq	PCB_TSSP(%r8),%rdx
 	testq	%rdx,%rdx
 	cmovzq	PCPU(COMMONTSSP),%rdx
 	cmpq	%rax,%rdx
 	jne	do_tss
 done_tss:
 	movq	%r8,PCPU(RSP0)
 	movq	%r8,PCPU(CURPCB)
 	/* Update the TSS_RSP0 pointer for the next interrupt */
 	movq	%r8,COMMON_TSS_RSP0(%rdx)
-	movq	%rsi,PCPU(CURTHREAD)		/* into next thread */
+	movq	%r12,PCPU(CURTHREAD)		/* into next thread */
 
 	/* Test if debug registers should be restored. */
 	testl	$PCB_DBREGS,PCB_FLAGS(%r8)
 	jnz	load_dr				/* static predict not taken */
 done_load_dr:
 
 	/* Restore context. */
 	movq	PCB_R15(%r8),%r15
 	movq	PCB_R14(%r8),%r14
 	movq	PCB_R13(%r8),%r13
 	movq	PCB_R12(%r8),%r12
 	movq	PCB_RBP(%r8),%rbp
 	movq	PCB_RSP(%r8),%rsp
 	movq	PCB_RBX(%r8),%rbx
 	movq	PCB_RIP(%r8),%rax
 	movq	%rax,(%rsp)
 	ret
 
 	/*
 	 * We order these strangely for several reasons.
 	 * 1: I wanted to use static branch prediction hints
 	 * 2: Most athlon64/opteron cpus don't have them.  They define
 	 *    a forward branch as 'predict not taken'.  Intel cores have
 	 *    the 'rep' prefix to invert this.
 	 * So, to make it work on both forms of cpu we do the detour.
 	 * We use jumps rather than call in order to avoid the stack.
 	 */
 
 store_dr:
 	movq	%dr7,%rax			/* yes, do the save */
 	movq	%dr0,%r15
 	movq	%dr1,%r14
 	movq	%dr2,%r13
 	movq	%dr3,%r12
 	movq	%dr6,%r11
 	movq	%r15,PCB_DR0(%r8)
 	movq	%r14,PCB_DR1(%r8)
 	movq	%r13,PCB_DR2(%r8)
 	movq	%r12,PCB_DR3(%r8)
 	movq	%r11,PCB_DR6(%r8)
 	movq	%rax,PCB_DR7(%r8)
 	andq	$0x0000fc00, %rax		/* disable all watchpoints */
 	movq	%rax,%dr7
 	jmp	done_store_dr
 
 load_dr:
 	movq	%dr7,%rax
 	movq	PCB_DR0(%r8),%r15
 	movq	PCB_DR1(%r8),%r14
 	movq	PCB_DR2(%r8),%r13
 	movq	PCB_DR3(%r8),%r12
 	movq	PCB_DR6(%r8),%r11
 	movq	PCB_DR7(%r8),%rcx
 	movq	%r15,%dr0
 	movq	%r14,%dr1
 	/* Preserve reserved bits in %dr7 */
 	andq	$0x0000fc00,%rax
 	andq	$~0x0000fc00,%rcx
 	movq	%r13,%dr2
 	movq	%r12,%dr3
 	orq	%rcx,%rax
 	movq	%r11,%dr6
 	movq	%rax,%dr7
 	jmp	done_load_dr
 
 do_tss:	movq	%rdx,PCPU(TSSP)
 	movq	%rdx,%rcx
 	movq	PCPU(TSS),%rax
 	movw	%cx,2(%rax)
 	shrq	$16,%rcx
 	movb	%cl,4(%rax)
 	shrq	$8,%rcx
 	movb	%cl,7(%rax)
 	shrq	$8,%rcx
 	movl	%ecx,8(%rax)
 	movb	$0x89,5(%rax)	/* unset busy */
 	movl	$TSSSEL,%eax
 	ltr	%ax
 	jmp	done_tss
 
 do_ldt:	movq	PCPU(LDT),%rax
 	movq	P_MD+MD_LDT_SD(%rcx),%rdx
 	movq	%rdx,(%rax)
 	movq	P_MD+MD_LDT_SD+8(%rcx),%rdx
 	movq	%rdx,8(%rax)
 	movl	$LDTSEL,%eax
 	jmp	ld_ldt
 END(cpu_switch)
 
 /*
  * savectx(pcb)
  * Update pcb, saving current processor state.
  */
 ENTRY(savectx)
 	/* Save caller's return address. */
 	movq	(%rsp),%rax
 	movq	%rax,PCB_RIP(%rdi)
 
 	movq	%rbx,PCB_RBX(%rdi)
 	movq	%rsp,PCB_RSP(%rdi)
 	movq	%rbp,PCB_RBP(%rdi)
 	movq	%r12,PCB_R12(%rdi)
 	movq	%r13,PCB_R13(%rdi)
 	movq	%r14,PCB_R14(%rdi)
 	movq	%r15,PCB_R15(%rdi)
 
 	movq	%cr0,%rax
 	movq	%rax,PCB_CR0(%rdi)
 	movq	%cr2,%rax
 	movq	%rax,PCB_CR2(%rdi)
 	movq	%cr3,%rax
 	movq	%rax,PCB_CR3(%rdi)
 	movq	%cr4,%rax
 	movq	%rax,PCB_CR4(%rdi)
 
 	movq	%dr0,%rax
 	movq	%rax,PCB_DR0(%rdi)
 	movq	%dr1,%rax
 	movq	%rax,PCB_DR1(%rdi)
 	movq	%dr2,%rax
 	movq	%rax,PCB_DR2(%rdi)
 	movq	%dr3,%rax
 	movq	%rax,PCB_DR3(%rdi)
 	movq	%dr6,%rax
 	movq	%rax,PCB_DR6(%rdi)
 	movq	%dr7,%rax
 	movq	%rax,PCB_DR7(%rdi)
 
 	movl	$MSR_FSBASE,%ecx
 	rdmsr
 	movl	%eax,PCB_FSBASE(%rdi)
 	movl	%edx,PCB_FSBASE+4(%rdi)
 	movl	$MSR_GSBASE,%ecx
 	rdmsr
 	movl	%eax,PCB_GSBASE(%rdi)
 	movl	%edx,PCB_GSBASE+4(%rdi)
 	movl	$MSR_KGSBASE,%ecx
 	rdmsr
 	movl	%eax,PCB_KGSBASE(%rdi)
 	movl	%edx,PCB_KGSBASE+4(%rdi)
 	movl	$MSR_EFER,%ecx
 	rdmsr
 	movl	%eax,PCB_EFER(%rdi)
 	movl	%edx,PCB_EFER+4(%rdi)
 	movl	$MSR_STAR,%ecx
 	rdmsr
 	movl	%eax,PCB_STAR(%rdi)
 	movl	%edx,PCB_STAR+4(%rdi)
 	movl	$MSR_LSTAR,%ecx
 	rdmsr
 	movl	%eax,PCB_LSTAR(%rdi)
 	movl	%edx,PCB_LSTAR+4(%rdi)
 	movl	$MSR_CSTAR,%ecx
 	rdmsr
 	movl	%eax,PCB_CSTAR(%rdi)
 	movl	%edx,PCB_CSTAR+4(%rdi)
 	movl	$MSR_SF_MASK,%ecx
 	rdmsr
 	movl	%eax,PCB_SFMASK(%rdi)
 	movl	%edx,PCB_SFMASK+4(%rdi)
 
 	sgdt	PCB_GDT(%rdi)
 	sidt	PCB_IDT(%rdi)
 	sldt	PCB_LDT(%rdi)
 	str	PCB_TR(%rdi)
 
 	movl	$1,%eax
 	ret
 END(savectx)
 
 /*
  * resumectx(pcb)
  * Resuming processor state from pcb.
  */     
 ENTRY(resumectx)
 	/* Switch to KPML4phys. */
 	movq	KPML4phys,%rax
 	movq	%rax,%cr3
 
 	/* Force kernel segment registers. */
 	movl	$KDSEL,%eax
 	movw	%ax,%ds
 	movw	%ax,%es
 	movw	%ax,%ss
 	movl	$KUF32SEL,%eax
 	movw	%ax,%fs
 	movl	$KUG32SEL,%eax
 	movw	%ax,%gs
 
 	movl	$MSR_FSBASE,%ecx
 	movl	PCB_FSBASE(%rdi),%eax
 	movl	4 + PCB_FSBASE(%rdi),%edx
 	wrmsr
 	movl	$MSR_GSBASE,%ecx
 	movl	PCB_GSBASE(%rdi),%eax
 	movl	4 + PCB_GSBASE(%rdi),%edx
 	wrmsr
 	movl	$MSR_KGSBASE,%ecx
 	movl	PCB_KGSBASE(%rdi),%eax
 	movl	4 + PCB_KGSBASE(%rdi),%edx
 	wrmsr
 
 	/* Restore EFER. */
 	movl	$MSR_EFER,%ecx
 	movl	PCB_EFER(%rdi),%eax
 	wrmsr
 
 	/* Restore fast syscall stuff. */
 	movl	$MSR_STAR,%ecx
 	movl	PCB_STAR(%rdi),%eax
 	movl	4 + PCB_STAR(%rdi),%edx
 	wrmsr
 	movl	$MSR_LSTAR,%ecx
 	movl	PCB_LSTAR(%rdi),%eax
 	movl	4 + PCB_LSTAR(%rdi),%edx
 	wrmsr
 	movl	$MSR_CSTAR,%ecx
 	movl	PCB_CSTAR(%rdi),%eax
 	movl	4 + PCB_CSTAR(%rdi),%edx
 	wrmsr
 	movl	$MSR_SF_MASK,%ecx
 	movl	PCB_SFMASK(%rdi),%eax
 	wrmsr
 
 	/* Restore CR0, CR2, CR4 and CR3. */
 	movq	PCB_CR0(%rdi),%rax
 	movq	%rax,%cr0
 	movq	PCB_CR2(%rdi),%rax
 	movq	%rax,%cr2
 	movq	PCB_CR4(%rdi),%rax
 	movq	%rax,%cr4
 	movq	PCB_CR3(%rdi),%rax
 	movq	%rax,%cr3
 
 	/* Restore descriptor tables. */
 	lidt	PCB_IDT(%rdi)
 	lldt	PCB_LDT(%rdi)
 
 #define	SDT_SYSTSS	9
 #define	SDT_SYSBSY	11
 
 	/* Clear "task busy" bit and reload TR. */
 	movq	PCPU(TSS),%rax
 	andb	$(~SDT_SYSBSY | SDT_SYSTSS),5(%rax)
 	movw	PCB_TR(%rdi),%ax
 	ltr	%ax
 
 #undef	SDT_SYSTSS
 #undef	SDT_SYSBSY
 
 	/* Restore debug registers. */
 	movq	PCB_DR0(%rdi),%rax
 	movq	%rax,%dr0
 	movq	PCB_DR1(%rdi),%rax
 	movq	%rax,%dr1
 	movq	PCB_DR2(%rdi),%rax
 	movq	%rax,%dr2
 	movq	PCB_DR3(%rdi),%rax
 	movq	%rax,%dr3
 	movq	PCB_DR6(%rdi),%rax
 	movq	%rax,%dr6
 	movq	PCB_DR7(%rdi),%rax
 	movq	%rax,%dr7
 
 	/* Restore other callee saved registers. */
 	movq	PCB_R15(%rdi),%r15
 	movq	PCB_R14(%rdi),%r14
 	movq	PCB_R13(%rdi),%r13
 	movq	PCB_R12(%rdi),%r12
 	movq	PCB_RBP(%rdi),%rbp
 	movq	PCB_RSP(%rdi),%rsp
 	movq	PCB_RBX(%rdi),%rbx
 
 	/* Restore return address. */
 	movq	PCB_RIP(%rdi),%rax
 	movq	%rax,(%rsp)
 
 	xorl	%eax,%eax
 	ret
 END(resumectx)
Index: projects/release-arm-redux/sys/amd64/amd64/genassym.c
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/genassym.c	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/genassym.c	(revision 282692)
@@ -1,241 +1,239 @@
 /*-
  * Copyright (c) 1982, 1990 The Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)genassym.c	5.11 (Berkeley) 5/10/91
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_compat.h"
 #include "opt_hwpmc_hooks.h"
 #include "opt_kstack_pages.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/assym.h>
 #include <sys/bio.h>
 #include <sys/buf.h>
 #include <sys/proc.h>
 #ifdef	HWPMC_HOOKS
 #include <sys/pmckern.h>
 #endif
 #include <sys/errno.h>
 #include <sys/mount.h>
 #include <sys/mutex.h>
 #include <sys/socket.h>
 #include <sys/resourcevar.h>
 #include <sys/ucontext.h>
 #include <machine/tss.h>
 #include <sys/vmmeter.h>
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_map.h>
 #include <sys/proc.h>
 #include <x86/apicreg.h>
 #include <machine/cpu.h>
 #include <machine/pcb.h>
 #include <machine/sigframe.h>
 #include <machine/proc.h>
 #include <machine/segments.h>
 
 ASSYM(P_VMSPACE, offsetof(struct proc, p_vmspace));
 ASSYM(VM_PMAP, offsetof(struct vmspace, vm_pmap));
 ASSYM(PM_ACTIVE, offsetof(struct pmap, pm_active));
-ASSYM(PM_SAVE, offsetof(struct pmap, pm_save));
-ASSYM(PM_PCID, offsetof(struct pmap, pm_pcid));
 
 ASSYM(P_MD, offsetof(struct proc, p_md));
 ASSYM(MD_LDT, offsetof(struct mdproc, md_ldt));
 ASSYM(MD_LDT_SD, offsetof(struct mdproc, md_ldt_sd));
 
 ASSYM(TD_LOCK, offsetof(struct thread, td_lock));
 ASSYM(TD_FLAGS, offsetof(struct thread, td_flags));
 ASSYM(TD_PCB, offsetof(struct thread, td_pcb));
 ASSYM(TD_PFLAGS, offsetof(struct thread, td_pflags));
 ASSYM(TD_PROC, offsetof(struct thread, td_proc));
 ASSYM(TD_TID, offsetof(struct thread, td_tid));
 ASSYM(TD_FRAME, offsetof(struct thread, td_frame));
 
 ASSYM(TDF_ASTPENDING, TDF_ASTPENDING);
 ASSYM(TDF_NEEDRESCHED, TDF_NEEDRESCHED);
 
 ASSYM(TDP_CALLCHAIN, TDP_CALLCHAIN);
 ASSYM(TDP_KTHREAD, TDP_KTHREAD);
 
 ASSYM(V_TRAP, offsetof(struct vmmeter, v_trap));
 ASSYM(V_SYSCALL, offsetof(struct vmmeter, v_syscall));
 ASSYM(V_INTR, offsetof(struct vmmeter, v_intr));
 ASSYM(KSTACK_PAGES, KSTACK_PAGES);
 ASSYM(PAGE_SIZE, PAGE_SIZE);
 ASSYM(NPTEPG, NPTEPG);
 ASSYM(NPDEPG, NPDEPG);
 ASSYM(addr_PTmap, addr_PTmap);
 ASSYM(addr_PDmap, addr_PDmap);
 ASSYM(addr_PDPmap, addr_PDPmap);
 ASSYM(addr_PML4map, addr_PML4map);
 ASSYM(addr_PML4pml4e, addr_PML4pml4e);
 ASSYM(PDESIZE, sizeof(pd_entry_t));
 ASSYM(PTESIZE, sizeof(pt_entry_t));
 ASSYM(PAGE_SHIFT, PAGE_SHIFT);
 ASSYM(PAGE_MASK, PAGE_MASK);
 ASSYM(PDRSHIFT, PDRSHIFT);
 ASSYM(PDPSHIFT, PDPSHIFT);
 ASSYM(PML4SHIFT, PML4SHIFT);
 ASSYM(val_KPDPI, KPDPI);
 ASSYM(val_KPML4I, KPML4I);
 ASSYM(val_PML4PML4I, PML4PML4I);
 ASSYM(USRSTACK, USRSTACK);
 ASSYM(VM_MAXUSER_ADDRESS, VM_MAXUSER_ADDRESS);
 ASSYM(KERNBASE, KERNBASE);
 ASSYM(DMAP_MIN_ADDRESS, DMAP_MIN_ADDRESS);
 ASSYM(DMAP_MAX_ADDRESS, DMAP_MAX_ADDRESS);
 ASSYM(MCLBYTES, MCLBYTES);
 
 ASSYM(PCB_R15, offsetof(struct pcb, pcb_r15));
 ASSYM(PCB_R14, offsetof(struct pcb, pcb_r14));
 ASSYM(PCB_R13, offsetof(struct pcb, pcb_r13));
 ASSYM(PCB_R12, offsetof(struct pcb, pcb_r12));
 ASSYM(PCB_RBP, offsetof(struct pcb, pcb_rbp));
 ASSYM(PCB_RSP, offsetof(struct pcb, pcb_rsp));
 ASSYM(PCB_RBX, offsetof(struct pcb, pcb_rbx));
 ASSYM(PCB_RIP, offsetof(struct pcb, pcb_rip));
 ASSYM(PCB_FSBASE, offsetof(struct pcb, pcb_fsbase));
 ASSYM(PCB_GSBASE, offsetof(struct pcb, pcb_gsbase));
 ASSYM(PCB_KGSBASE, offsetof(struct pcb, pcb_kgsbase));
 ASSYM(PCB_CR0, offsetof(struct pcb, pcb_cr0));
 ASSYM(PCB_CR2, offsetof(struct pcb, pcb_cr2));
 ASSYM(PCB_CR3, offsetof(struct pcb, pcb_cr3));
 ASSYM(PCB_CR4, offsetof(struct pcb, pcb_cr4));
 ASSYM(PCB_DR0, offsetof(struct pcb, pcb_dr0));
 ASSYM(PCB_DR1, offsetof(struct pcb, pcb_dr1));
 ASSYM(PCB_DR2, offsetof(struct pcb, pcb_dr2));
 ASSYM(PCB_DR3, offsetof(struct pcb, pcb_dr3));
 ASSYM(PCB_DR6, offsetof(struct pcb, pcb_dr6));
 ASSYM(PCB_DR7, offsetof(struct pcb, pcb_dr7));
 ASSYM(PCB_GDT, offsetof(struct pcb, pcb_gdt));
 ASSYM(PCB_IDT, offsetof(struct pcb, pcb_idt));
 ASSYM(PCB_LDT, offsetof(struct pcb, pcb_ldt));
 ASSYM(PCB_TR, offsetof(struct pcb, pcb_tr));
 ASSYM(PCB_FLAGS, offsetof(struct pcb, pcb_flags));
 ASSYM(PCB_ONFAULT, offsetof(struct pcb, pcb_onfault));
 ASSYM(PCB_GS32SD, offsetof(struct pcb, pcb_gs32sd));
 ASSYM(PCB_TSSP, offsetof(struct pcb, pcb_tssp));
 ASSYM(PCB_SAVEFPU, offsetof(struct pcb, pcb_save));
 ASSYM(PCB_EFER, offsetof(struct pcb, pcb_efer));
 ASSYM(PCB_STAR, offsetof(struct pcb, pcb_star));
 ASSYM(PCB_LSTAR, offsetof(struct pcb, pcb_lstar));
 ASSYM(PCB_CSTAR, offsetof(struct pcb, pcb_cstar));
 ASSYM(PCB_SFMASK, offsetof(struct pcb, pcb_sfmask));
 ASSYM(PCB_SIZE, sizeof(struct pcb));
 ASSYM(PCB_FULL_IRET, PCB_FULL_IRET);
 ASSYM(PCB_DBREGS, PCB_DBREGS);
 ASSYM(PCB_32BIT, PCB_32BIT);
 
 ASSYM(COMMON_TSS_RSP0, offsetof(struct amd64tss, tss_rsp0));
 
 ASSYM(TF_R15, offsetof(struct trapframe, tf_r15));
 ASSYM(TF_R14, offsetof(struct trapframe, tf_r14));
 ASSYM(TF_R13, offsetof(struct trapframe, tf_r13));
 ASSYM(TF_R12, offsetof(struct trapframe, tf_r12));
 ASSYM(TF_R11, offsetof(struct trapframe, tf_r11));
 ASSYM(TF_R10, offsetof(struct trapframe, tf_r10));
 ASSYM(TF_R9, offsetof(struct trapframe, tf_r9));
 ASSYM(TF_R8, offsetof(struct trapframe, tf_r8));
 ASSYM(TF_RDI, offsetof(struct trapframe, tf_rdi));
 ASSYM(TF_RSI, offsetof(struct trapframe, tf_rsi));
 ASSYM(TF_RBP, offsetof(struct trapframe, tf_rbp));
 ASSYM(TF_RBX, offsetof(struct trapframe, tf_rbx));
 ASSYM(TF_RDX, offsetof(struct trapframe, tf_rdx));
 ASSYM(TF_RCX, offsetof(struct trapframe, tf_rcx));
 ASSYM(TF_RAX, offsetof(struct trapframe, tf_rax));
 ASSYM(TF_TRAPNO, offsetof(struct trapframe, tf_trapno));
 ASSYM(TF_ADDR, offsetof(struct trapframe, tf_addr));
 ASSYM(TF_ERR, offsetof(struct trapframe, tf_err));
 ASSYM(TF_RIP, offsetof(struct trapframe, tf_rip));
 ASSYM(TF_CS, offsetof(struct trapframe, tf_cs));
 ASSYM(TF_RFLAGS, offsetof(struct trapframe, tf_rflags));
 ASSYM(TF_RSP, offsetof(struct trapframe, tf_rsp));
 ASSYM(TF_SS, offsetof(struct trapframe, tf_ss));
 ASSYM(TF_DS, offsetof(struct trapframe, tf_ds));
 ASSYM(TF_ES, offsetof(struct trapframe, tf_es));
 ASSYM(TF_FS, offsetof(struct trapframe, tf_fs));
 ASSYM(TF_GS, offsetof(struct trapframe, tf_gs));
 ASSYM(TF_FLAGS, offsetof(struct trapframe, tf_flags));
 ASSYM(TF_SIZE, sizeof(struct trapframe));
 ASSYM(TF_HASSEGS, TF_HASSEGS);
 
 ASSYM(SIGF_HANDLER, offsetof(struct sigframe, sf_ahu.sf_handler));
 ASSYM(SIGF_UC, offsetof(struct sigframe, sf_uc));
 ASSYM(UC_EFLAGS, offsetof(ucontext_t, uc_mcontext.mc_rflags));
 ASSYM(ENOENT, ENOENT);
 ASSYM(EFAULT, EFAULT);
 ASSYM(ENAMETOOLONG, ENAMETOOLONG);
 ASSYM(MAXCOMLEN, MAXCOMLEN);
 ASSYM(MAXPATHLEN, MAXPATHLEN);
 ASSYM(PC_SIZEOF, sizeof(struct pcpu));
 ASSYM(PC_PRVSPACE, offsetof(struct pcpu, pc_prvspace));
 ASSYM(PC_CURTHREAD, offsetof(struct pcpu, pc_curthread));
 ASSYM(PC_FPCURTHREAD, offsetof(struct pcpu, pc_fpcurthread));
 ASSYM(PC_IDLETHREAD, offsetof(struct pcpu, pc_idlethread));
 ASSYM(PC_CURPCB, offsetof(struct pcpu, pc_curpcb));
 ASSYM(PC_CPUID, offsetof(struct pcpu, pc_cpuid));
 ASSYM(PC_SCRATCH_RSP, offsetof(struct pcpu, pc_scratch_rsp));
 ASSYM(PC_CURPMAP, offsetof(struct pcpu, pc_curpmap));
 ASSYM(PC_TSSP, offsetof(struct pcpu, pc_tssp));
 ASSYM(PC_RSP0, offsetof(struct pcpu, pc_rsp0));
 ASSYM(PC_FS32P, offsetof(struct pcpu, pc_fs32p));
 ASSYM(PC_GS32P, offsetof(struct pcpu, pc_gs32p));
 ASSYM(PC_LDT, offsetof(struct pcpu, pc_ldt));
 ASSYM(PC_COMMONTSSP, offsetof(struct pcpu, pc_commontssp));
 ASSYM(PC_TSS, offsetof(struct pcpu, pc_tss));
 ASSYM(PC_PM_SAVE_CNT, offsetof(struct pcpu, pc_pm_save_cnt));
  
 ASSYM(LA_EOI, LAPIC_EOI * LAPIC_MEM_MUL);
 ASSYM(LA_ISR, LAPIC_ISR0 * LAPIC_MEM_MUL);
 
 ASSYM(KCSEL, GSEL(GCODE_SEL, SEL_KPL));
 ASSYM(KDSEL, GSEL(GDATA_SEL, SEL_KPL));
 ASSYM(KUCSEL, GSEL(GUCODE_SEL, SEL_UPL));
 ASSYM(KUDSEL, GSEL(GUDATA_SEL, SEL_UPL));
 ASSYM(KUC32SEL, GSEL(GUCODE32_SEL, SEL_UPL));
 ASSYM(KUF32SEL, GSEL(GUFS32_SEL, SEL_UPL));
 ASSYM(KUG32SEL, GSEL(GUGS32_SEL, SEL_UPL));
 ASSYM(TSSSEL, GSEL(GPROC0_SEL, SEL_KPL));
 ASSYM(LDTSEL, GSEL(GUSERLDT_SEL, SEL_KPL));
 ASSYM(SEL_RPL_MASK, SEL_RPL_MASK);
 
 ASSYM(__FreeBSD_version, __FreeBSD_version);
 
 #ifdef	HWPMC_HOOKS
 ASSYM(PMC_FN_USER_CALLCHAIN, PMC_FN_USER_CALLCHAIN);
 #endif
Index: projects/release-arm-redux/sys/amd64/amd64/machdep.c
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/machdep.c	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/machdep.c	(revision 282692)
@@ -1,2459 +1,2458 @@
 /*-
  * Copyright (c) 2003 Peter Wemm.
  * Copyright (c) 1992 Terrence R. Lambert.
  * Copyright (c) 1982, 1987, 1990 The Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)machdep.c	7.4 (Berkeley) 6/3/91
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_atpic.h"
 #include "opt_compat.h"
 #include "opt_cpu.h"
 #include "opt_ddb.h"
 #include "opt_inet.h"
 #include "opt_isa.h"
 #include "opt_kstack_pages.h"
 #include "opt_maxmem.h"
 #include "opt_mp_watchdog.h"
 #include "opt_perfmon.h"
 #include "opt_platform.h"
 #include "opt_sched.h"
 
 #include <sys/param.h>
 #include <sys/proc.h>
 #include <sys/systm.h>
 #include <sys/bio.h>
 #include <sys/buf.h>
 #include <sys/bus.h>
 #include <sys/callout.h>
 #include <sys/cons.h>
 #include <sys/cpu.h>
 #include <sys/efi.h>
 #include <sys/eventhandler.h>
 #include <sys/exec.h>
 #include <sys/imgact.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/linker.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/memrange.h>
 #include <sys/msgbuf.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/ptrace.h>
 #include <sys/reboot.h>
 #include <sys/rwlock.h>
 #include <sys/sched.h>
 #include <sys/signalvar.h>
 #ifdef SMP
 #include <sys/smp.h>
 #endif
 #include <sys/syscallsubr.h>
 #include <sys/sysctl.h>
 #include <sys/sysent.h>
 #include <sys/sysproto.h>
 #include <sys/ucontext.h>
 #include <sys/vmmeter.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_pager.h>
 #include <vm/vm_param.h>
 
 #ifdef DDB
 #ifndef KDB
 #error KDB must be enabled in order for DDB to work!
 #endif
 #include <ddb/ddb.h>
 #include <ddb/db_sym.h>
 #endif
 
 #include <net/netisr.h>
 
 #include <machine/clock.h>
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/intr_machdep.h>
 #include <x86/mca.h>
 #include <machine/md_var.h>
 #include <machine/metadata.h>
 #include <machine/mp_watchdog.h>
 #include <machine/pc/bios.h>
 #include <machine/pcb.h>
 #include <machine/proc.h>
 #include <machine/reg.h>
 #include <machine/sigframe.h>
 #include <machine/specialreg.h>
 #ifdef PERFMON
 #include <machine/perfmon.h>
 #endif
 #include <machine/tss.h>
 #ifdef SMP
 #include <machine/smp.h>
 #endif
 #ifdef FDT
 #include <x86/fdt.h>
 #endif
 
 #ifdef DEV_ATPIC
 #include <x86/isa/icu.h>
 #else
 #include <x86/apicvar.h>
 #endif
 
 #include <isa/isareg.h>
 #include <isa/rtc.h>
 #include <x86/init.h>
 
 /* Sanity check for __curthread() */
 CTASSERT(offsetof(struct pcpu, pc_curthread) == 0);
 
 extern u_int64_t hammer_time(u_int64_t, u_int64_t);
 
 #define	CS_SECURE(cs)		(ISPL(cs) == SEL_UPL)
 #define	EFL_SECURE(ef, oef)	((((ef) ^ (oef)) & ~PSL_USERCHANGE) == 0)
 
 static void cpu_startup(void *);
 static void get_fpcontext(struct thread *td, mcontext_t *mcp,
     char *xfpusave, size_t xfpusave_len);
 static int  set_fpcontext(struct thread *td, mcontext_t *mcp,
     char *xfpustate, size_t xfpustate_len);
 SYSINIT(cpu, SI_SUB_CPU, SI_ORDER_FIRST, cpu_startup, NULL);
 
 /* Preload data parse function */
 static caddr_t native_parse_preload_data(u_int64_t);
 
 /* Native function to fetch and parse the e820 map */
 static void native_parse_memmap(caddr_t, vm_paddr_t *, int *);
 
 /* Default init_ops implementation. */
 struct init_ops init_ops = {
 	.parse_preload_data =	native_parse_preload_data,
 	.early_clock_source_init =	i8254_init,
 	.early_delay =			i8254_delay,
 	.parse_memmap =			native_parse_memmap,
 #ifdef SMP
 	.mp_bootaddress =		mp_bootaddress,
 	.start_all_aps =		native_start_all_aps,
 #endif
 	.msi_init =			msi_init,
 };
 
 /*
  * The file "conf/ldscript.amd64" defines the symbol "kernphys".  Its value is
  * the physical address at which the kernel is loaded.
  */
 extern char kernphys[];
 
 struct msgbuf *msgbufp;
 
 /* Intel ICH registers */
 #define ICH_PMBASE	0x400
 #define ICH_SMI_EN	ICH_PMBASE + 0x30
 
 int	_udatasel, _ucodesel, _ucode32sel, _ufssel, _ugssel;
 
 int cold = 1;
 
 long Maxmem = 0;
 long realmem = 0;
 
 /*
  * The number of PHYSMAP entries must be one less than the number of
  * PHYSSEG entries because the PHYSMAP entry that spans the largest
  * physical address that is accessible by ISA DMA is split into two
  * PHYSSEG entries.
  */
 #define	PHYSMAP_SIZE	(2 * (VM_PHYSSEG_MAX - 1))
 
 vm_paddr_t phys_avail[PHYSMAP_SIZE + 2];
 vm_paddr_t dump_avail[PHYSMAP_SIZE + 2];
 
 /* must be 2 less so 0 0 can signal end of chunks */
 #define PHYS_AVAIL_ARRAY_END ((sizeof(phys_avail) / sizeof(phys_avail[0])) - 2)
 #define DUMP_AVAIL_ARRAY_END ((sizeof(dump_avail) / sizeof(dump_avail[0])) - 2)
 
 struct kva_md_info kmi;
 
 static struct trapframe proc0_tf;
 struct region_descriptor r_gdt, r_idt;
 
 struct pcpu __pcpu[MAXCPU];
 
 struct mtx icu_lock;
 
 struct mem_range_softc mem_range_softc;
 
 struct mtx dt_lock;	/* lock for GDT and LDT */
 
 void (*vmm_resume_p)(void);
 
 static void
 cpu_startup(dummy)
 	void *dummy;
 {
 	uintmax_t memsize;
 	char *sysenv;
 
 	/*
 	 * On MacBooks, we need to disallow the legacy USB circuit to
 	 * generate an SMI# because this can cause several problems,
 	 * namely: incorrect CPU frequency detection and failure to
 	 * start the APs.
 	 * We do this by disabling a bit in the SMI_EN (SMI Control and
 	 * Enable register) of the Intel ICH LPC Interface Bridge. 
 	 */
 	sysenv = kern_getenv("smbios.system.product");
 	if (sysenv != NULL) {
 		if (strncmp(sysenv, "MacBook1,1", 10) == 0 ||
 		    strncmp(sysenv, "MacBook3,1", 10) == 0 ||
 		    strncmp(sysenv, "MacBook4,1", 10) == 0 ||
 		    strncmp(sysenv, "MacBookPro1,1", 13) == 0 ||
 		    strncmp(sysenv, "MacBookPro1,2", 13) == 0 ||
 		    strncmp(sysenv, "MacBookPro3,1", 13) == 0 ||
 		    strncmp(sysenv, "MacBookPro4,1", 13) == 0 ||
 		    strncmp(sysenv, "Macmini1,1", 10) == 0) {
 			if (bootverbose)
 				printf("Disabling LEGACY_USB_EN bit on "
 				    "Intel ICH.\n");
 			outl(ICH_SMI_EN, inl(ICH_SMI_EN) & ~0x8);
 		}
 		freeenv(sysenv);
 	}
 
 	/*
 	 * Good {morning,afternoon,evening,night}.
 	 */
 	startrtclock();
 	printcpuinfo();
 	panicifcpuunsupported();
 #ifdef PERFMON
 	perfmon_init();
 #endif
 
 	/*
 	 * Display physical memory if SMBIOS reports reasonable amount.
 	 */
 	memsize = 0;
 	sysenv = kern_getenv("smbios.memory.enabled");
 	if (sysenv != NULL) {
 		memsize = (uintmax_t)strtoul(sysenv, (char **)NULL, 10) << 10;
 		freeenv(sysenv);
 	}
 	if (memsize < ptoa((uintmax_t)vm_cnt.v_free_count))
 		memsize = ptoa((uintmax_t)Maxmem);
 	printf("real memory  = %ju (%ju MB)\n", memsize, memsize >> 20);
 	realmem = atop(memsize);
 
 	/*
 	 * Display any holes after the first chunk of extended memory.
 	 */
 	if (bootverbose) {
 		int indx;
 
 		printf("Physical memory chunk(s):\n");
 		for (indx = 0; phys_avail[indx + 1] != 0; indx += 2) {
 			vm_paddr_t size;
 
 			size = phys_avail[indx + 1] - phys_avail[indx];
 			printf(
 			    "0x%016jx - 0x%016jx, %ju bytes (%ju pages)\n",
 			    (uintmax_t)phys_avail[indx],
 			    (uintmax_t)phys_avail[indx + 1] - 1,
 			    (uintmax_t)size, (uintmax_t)size / PAGE_SIZE);
 		}
 	}
 
 	vm_ksubmap_init(&kmi);
 
 	printf("avail memory = %ju (%ju MB)\n",
 	    ptoa((uintmax_t)vm_cnt.v_free_count),
 	    ptoa((uintmax_t)vm_cnt.v_free_count) / 1048576);
 
 	/*
 	 * Set up buffers, so they can be used to read disk labels.
 	 */
 	bufinit();
 	vm_pager_bufferinit();
 
 	cpu_setregs();
 }
 
 /*
  * Send an interrupt to process.
  *
  * Stack is set up to allow sigcode stored
  * at top to call routine, followed by call
  * to sigreturn routine below.  After sigreturn
  * resets the signal mask, the stack, and the
  * frame pointer, it returns to the user
  * specified pc, psl.
  */
 void
 sendsig(sig_t catcher, ksiginfo_t *ksi, sigset_t *mask)
 {
 	struct sigframe sf, *sfp;
 	struct pcb *pcb;
 	struct proc *p;
 	struct thread *td;
 	struct sigacts *psp;
 	char *sp;
 	struct trapframe *regs;
 	char *xfpusave;
 	size_t xfpusave_len;
 	int sig;
 	int oonstack;
 
 	td = curthread;
 	pcb = td->td_pcb;
 	p = td->td_proc;
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	sig = ksi->ksi_signo;
 	psp = p->p_sigacts;
 	mtx_assert(&psp->ps_mtx, MA_OWNED);
 	regs = td->td_frame;
 	oonstack = sigonstack(regs->tf_rsp);
 
 	if (cpu_max_ext_state_size > sizeof(struct savefpu) && use_xsave) {
 		xfpusave_len = cpu_max_ext_state_size - sizeof(struct savefpu);
 		xfpusave = __builtin_alloca(xfpusave_len);
 	} else {
 		xfpusave_len = 0;
 		xfpusave = NULL;
 	}
 
 	/* Save user context. */
 	bzero(&sf, sizeof(sf));
 	sf.sf_uc.uc_sigmask = *mask;
 	sf.sf_uc.uc_stack = td->td_sigstk;
 	sf.sf_uc.uc_stack.ss_flags = (td->td_pflags & TDP_ALTSTACK)
 	    ? ((oonstack) ? SS_ONSTACK : 0) : SS_DISABLE;
 	sf.sf_uc.uc_mcontext.mc_onstack = (oonstack) ? 1 : 0;
 	bcopy(regs, &sf.sf_uc.uc_mcontext.mc_rdi, sizeof(*regs));
 	sf.sf_uc.uc_mcontext.mc_len = sizeof(sf.sf_uc.uc_mcontext); /* magic */
 	get_fpcontext(td, &sf.sf_uc.uc_mcontext, xfpusave, xfpusave_len);
 	fpstate_drop(td);
 	sf.sf_uc.uc_mcontext.mc_fsbase = pcb->pcb_fsbase;
 	sf.sf_uc.uc_mcontext.mc_gsbase = pcb->pcb_gsbase;
 	bzero(sf.sf_uc.uc_mcontext.mc_spare,
 	    sizeof(sf.sf_uc.uc_mcontext.mc_spare));
 	bzero(sf.sf_uc.__spare__, sizeof(sf.sf_uc.__spare__));
 
 	/* Allocate space for the signal handler context. */
 	if ((td->td_pflags & TDP_ALTSTACK) != 0 && !oonstack &&
 	    SIGISMEMBER(psp->ps_sigonstack, sig)) {
 		sp = td->td_sigstk.ss_sp + td->td_sigstk.ss_size;
 #if defined(COMPAT_43)
 		td->td_sigstk.ss_flags |= SS_ONSTACK;
 #endif
 	} else
 		sp = (char *)regs->tf_rsp - 128;
 	if (xfpusave != NULL) {
 		sp -= xfpusave_len;
 		sp = (char *)((unsigned long)sp & ~0x3Ful);
 		sf.sf_uc.uc_mcontext.mc_xfpustate = (register_t)sp;
 	}
 	sp -= sizeof(struct sigframe);
 	/* Align to 16 bytes. */
 	sfp = (struct sigframe *)((unsigned long)sp & ~0xFul);
 
 	/* Translate the signal if appropriate. */
 	if (p->p_sysent->sv_sigtbl && sig <= p->p_sysent->sv_sigsize)
 		sig = p->p_sysent->sv_sigtbl[_SIG_IDX(sig)];
 
 	/* Build the argument list for the signal handler. */
 	regs->tf_rdi = sig;			/* arg 1 in %rdi */
 	regs->tf_rdx = (register_t)&sfp->sf_uc;	/* arg 3 in %rdx */
 	bzero(&sf.sf_si, sizeof(sf.sf_si));
 	if (SIGISMEMBER(psp->ps_siginfo, sig)) {
 		/* Signal handler installed with SA_SIGINFO. */
 		regs->tf_rsi = (register_t)&sfp->sf_si;	/* arg 2 in %rsi */
 		sf.sf_ahu.sf_action = (__siginfohandler_t *)catcher;
 
 		/* Fill in POSIX parts */
 		sf.sf_si = ksi->ksi_info;
 		sf.sf_si.si_signo = sig; /* maybe a translated signal */
 		regs->tf_rcx = (register_t)ksi->ksi_addr; /* arg 4 in %rcx */
 	} else {
 		/* Old FreeBSD-style arguments. */
 		regs->tf_rsi = ksi->ksi_code;	/* arg 2 in %rsi */
 		regs->tf_rcx = (register_t)ksi->ksi_addr; /* arg 4 in %rcx */
 		sf.sf_ahu.sf_handler = catcher;
 	}
 	mtx_unlock(&psp->ps_mtx);
 	PROC_UNLOCK(p);
 
 	/*
 	 * Copy the sigframe out to the user's stack.
 	 */
 	if (copyout(&sf, sfp, sizeof(*sfp)) != 0 ||
 	    (xfpusave != NULL && copyout(xfpusave,
 	    (void *)sf.sf_uc.uc_mcontext.mc_xfpustate, xfpusave_len)
 	    != 0)) {
 #ifdef DEBUG
 		printf("process %ld has trashed its stack\n", (long)p->p_pid);
 #endif
 		PROC_LOCK(p);
 		sigexit(td, SIGILL);
 	}
 
 	regs->tf_rsp = (long)sfp;
 	regs->tf_rip = p->p_sysent->sv_sigcode_base;
 	regs->tf_rflags &= ~(PSL_T | PSL_D);
 	regs->tf_cs = _ucodesel;
 	regs->tf_ds = _udatasel;
 	regs->tf_ss = _udatasel;
 	regs->tf_es = _udatasel;
 	regs->tf_fs = _ufssel;
 	regs->tf_gs = _ugssel;
 	regs->tf_flags = TF_HASSEGS;
 	set_pcb_flags(pcb, PCB_FULL_IRET);
 	PROC_LOCK(p);
 	mtx_lock(&psp->ps_mtx);
 }
 
 /*
  * System call to cleanup state after a signal
  * has been taken.  Reset signal mask and
  * stack state from context left by sendsig (above).
  * Return to previous pc and psl as specified by
  * context left by sendsig. Check carefully to
  * make sure that the user has not modified the
  * state to gain improper privileges.
  *
  * MPSAFE
  */
 int
 sys_sigreturn(td, uap)
 	struct thread *td;
 	struct sigreturn_args /* {
 		const struct __ucontext *sigcntxp;
 	} */ *uap;
 {
 	ucontext_t uc;
 	struct pcb *pcb;
 	struct proc *p;
 	struct trapframe *regs;
 	ucontext_t *ucp;
 	char *xfpustate;
 	size_t xfpustate_len;
 	long rflags;
 	int cs, error, ret;
 	ksiginfo_t ksi;
 
 	pcb = td->td_pcb;
 	p = td->td_proc;
 
 	error = copyin(uap->sigcntxp, &uc, sizeof(uc));
 	if (error != 0) {
 		uprintf("pid %d (%s): sigreturn copyin failed\n",
 		    p->p_pid, td->td_name);
 		return (error);
 	}
 	ucp = &uc;
 	if ((ucp->uc_mcontext.mc_flags & ~_MC_FLAG_MASK) != 0) {
 		uprintf("pid %d (%s): sigreturn mc_flags %x\n", p->p_pid,
 		    td->td_name, ucp->uc_mcontext.mc_flags);
 		return (EINVAL);
 	}
 	regs = td->td_frame;
 	rflags = ucp->uc_mcontext.mc_rflags;
 	/*
 	 * Don't allow users to change privileged or reserved flags.
 	 */
 	if (!EFL_SECURE(rflags, regs->tf_rflags)) {
 		uprintf("pid %d (%s): sigreturn rflags = 0x%lx\n", p->p_pid,
 		    td->td_name, rflags);
 		return (EINVAL);
 	}
 
 	/*
 	 * Don't allow users to load a valid privileged %cs.  Let the
 	 * hardware check for invalid selectors, excess privilege in
 	 * other selectors, invalid %eip's and invalid %esp's.
 	 */
 	cs = ucp->uc_mcontext.mc_cs;
 	if (!CS_SECURE(cs)) {
 		uprintf("pid %d (%s): sigreturn cs = 0x%x\n", p->p_pid,
 		    td->td_name, cs);
 		ksiginfo_init_trap(&ksi);
 		ksi.ksi_signo = SIGBUS;
 		ksi.ksi_code = BUS_OBJERR;
 		ksi.ksi_trapno = T_PROTFLT;
 		ksi.ksi_addr = (void *)regs->tf_rip;
 		trapsignal(td, &ksi);
 		return (EINVAL);
 	}
 
 	if ((uc.uc_mcontext.mc_flags & _MC_HASFPXSTATE) != 0) {
 		xfpustate_len = uc.uc_mcontext.mc_xfpustate_len;
 		if (xfpustate_len > cpu_max_ext_state_size -
 		    sizeof(struct savefpu)) {
 			uprintf("pid %d (%s): sigreturn xfpusave_len = 0x%zx\n",
 			    p->p_pid, td->td_name, xfpustate_len);
 			return (EINVAL);
 		}
 		xfpustate = __builtin_alloca(xfpustate_len);
 		error = copyin((const void *)uc.uc_mcontext.mc_xfpustate,
 		    xfpustate, xfpustate_len);
 		if (error != 0) {
 			uprintf(
 	"pid %d (%s): sigreturn copying xfpustate failed\n",
 			    p->p_pid, td->td_name);
 			return (error);
 		}
 	} else {
 		xfpustate = NULL;
 		xfpustate_len = 0;
 	}
 	ret = set_fpcontext(td, &ucp->uc_mcontext, xfpustate, xfpustate_len);
 	if (ret != 0) {
 		uprintf("pid %d (%s): sigreturn set_fpcontext err %d\n",
 		    p->p_pid, td->td_name, ret);
 		return (ret);
 	}
 	bcopy(&ucp->uc_mcontext.mc_rdi, regs, sizeof(*regs));
 	pcb->pcb_fsbase = ucp->uc_mcontext.mc_fsbase;
 	pcb->pcb_gsbase = ucp->uc_mcontext.mc_gsbase;
 
 #if defined(COMPAT_43)
 	if (ucp->uc_mcontext.mc_onstack & 1)
 		td->td_sigstk.ss_flags |= SS_ONSTACK;
 	else
 		td->td_sigstk.ss_flags &= ~SS_ONSTACK;
 #endif
 
 	kern_sigprocmask(td, SIG_SETMASK, &ucp->uc_sigmask, NULL, 0);
 	set_pcb_flags(pcb, PCB_FULL_IRET);
 	return (EJUSTRETURN);
 }
 
 #ifdef COMPAT_FREEBSD4
 int
 freebsd4_sigreturn(struct thread *td, struct freebsd4_sigreturn_args *uap)
 {
  
 	return sys_sigreturn(td, (struct sigreturn_args *)uap);
 }
 #endif
 
 /*
  * Reset registers to default values on exec.
  */
 void
 exec_setregs(struct thread *td, struct image_params *imgp, u_long stack)
 {
 	struct trapframe *regs = td->td_frame;
 	struct pcb *pcb = td->td_pcb;
 
 	mtx_lock(&dt_lock);
 	if (td->td_proc->p_md.md_ldt != NULL)
 		user_ldt_free(td);
 	else
 		mtx_unlock(&dt_lock);
 	
 	pcb->pcb_fsbase = 0;
 	pcb->pcb_gsbase = 0;
 	clear_pcb_flags(pcb, PCB_32BIT);
 	pcb->pcb_initial_fpucw = __INITIAL_FPUCW__;
 	set_pcb_flags(pcb, PCB_FULL_IRET);
 
 	bzero((char *)regs, sizeof(struct trapframe));
 	regs->tf_rip = imgp->entry_addr;
 	regs->tf_rsp = ((stack - 8) & ~0xFul) + 8;
 	regs->tf_rdi = stack;		/* argv */
 	regs->tf_rflags = PSL_USER | (regs->tf_rflags & PSL_T);
 	regs->tf_ss = _udatasel;
 	regs->tf_cs = _ucodesel;
 	regs->tf_ds = _udatasel;
 	regs->tf_es = _udatasel;
 	regs->tf_fs = _ufssel;
 	regs->tf_gs = _ugssel;
 	regs->tf_flags = TF_HASSEGS;
 	td->td_retval[1] = 0;
 
 	/*
 	 * Reset the hardware debug registers if they were in use.
 	 * They won't have any meaning for the newly exec'd process.
 	 */
 	if (pcb->pcb_flags & PCB_DBREGS) {
 		pcb->pcb_dr0 = 0;
 		pcb->pcb_dr1 = 0;
 		pcb->pcb_dr2 = 0;
 		pcb->pcb_dr3 = 0;
 		pcb->pcb_dr6 = 0;
 		pcb->pcb_dr7 = 0;
 		if (pcb == curpcb) {
 			/*
 			 * Clear the debug registers on the running
 			 * CPU, otherwise they will end up affecting
 			 * the next process we switch to.
 			 */
 			reset_dbregs();
 		}
 		clear_pcb_flags(pcb, PCB_DBREGS);
 	}
 
 	/*
 	 * Drop the FP state if we hold it, so that the process gets a
 	 * clean FP state if it uses the FPU again.
 	 */
 	fpstate_drop(td);
 }
 
 void
 cpu_setregs(void)
 {
 	register_t cr0;
 
 	cr0 = rcr0();
 	/*
 	 * CR0_MP, CR0_NE and CR0_TS are also set by npx_probe() for the
 	 * BSP.  See the comments there about why we set them.
 	 */
 	cr0 |= CR0_MP | CR0_NE | CR0_TS | CR0_WP | CR0_AM;
 	load_cr0(cr0);
 }
 
 /*
  * Initialize amd64 and configure to run kernel
  */
 
 /*
  * Initialize segments & interrupt table
  */
 
 struct user_segment_descriptor gdt[NGDT * MAXCPU];/* global descriptor tables */
 static struct gate_descriptor idt0[NIDT];
 struct gate_descriptor *idt = &idt0[0];	/* interrupt descriptor table */
 
 static char dblfault_stack[PAGE_SIZE] __aligned(16);
 
 static char nmi0_stack[PAGE_SIZE] __aligned(16);
 CTASSERT(sizeof(struct nmi_pcpu) == 16);
 
 struct amd64tss common_tss[MAXCPU];
 
 /*
  * Software prototypes -- in more palatable form.
  *
  * Keep GUFS32, GUGS32, GUCODE32 and GUDATA at the same
  * slots as corresponding segments for i386 kernel.
  */
 struct soft_segment_descriptor gdt_segs[] = {
 /* GNULL_SEL	0 Null Descriptor */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0x0,
 	.ssd_type = 0,
 	.ssd_dpl = 0,
 	.ssd_p = 0,
 	.ssd_long = 0,
 	.ssd_def32 = 0,
 	.ssd_gran = 0		},
 /* GNULL2_SEL	1 Null Descriptor */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0x0,
 	.ssd_type = 0,
 	.ssd_dpl = 0,
 	.ssd_p = 0,
 	.ssd_long = 0,
 	.ssd_def32 = 0,
 	.ssd_gran = 0		},
 /* GUFS32_SEL	2 32 bit %gs Descriptor for user */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMRWA,
 	.ssd_dpl = SEL_UPL,
 	.ssd_p = 1,
 	.ssd_long = 0,
 	.ssd_def32 = 1,
 	.ssd_gran = 1		},
 /* GUGS32_SEL	3 32 bit %fs Descriptor for user */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMRWA,
 	.ssd_dpl = SEL_UPL,
 	.ssd_p = 1,
 	.ssd_long = 0,
 	.ssd_def32 = 1,
 	.ssd_gran = 1		},
 /* GCODE_SEL	4 Code Descriptor for kernel */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMERA,
 	.ssd_dpl = SEL_KPL,
 	.ssd_p = 1,
 	.ssd_long = 1,
 	.ssd_def32 = 0,
 	.ssd_gran = 1		},
 /* GDATA_SEL	5 Data Descriptor for kernel */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMRWA,
 	.ssd_dpl = SEL_KPL,
 	.ssd_p = 1,
 	.ssd_long = 1,
 	.ssd_def32 = 0,
 	.ssd_gran = 1		},
 /* GUCODE32_SEL	6 32 bit Code Descriptor for user */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMERA,
 	.ssd_dpl = SEL_UPL,
 	.ssd_p = 1,
 	.ssd_long = 0,
 	.ssd_def32 = 1,
 	.ssd_gran = 1		},
 /* GUDATA_SEL	7 32/64 bit Data Descriptor for user */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMRWA,
 	.ssd_dpl = SEL_UPL,
 	.ssd_p = 1,
 	.ssd_long = 0,
 	.ssd_def32 = 1,
 	.ssd_gran = 1		},
 /* GUCODE_SEL	8 64 bit Code Descriptor for user */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0xfffff,
 	.ssd_type = SDT_MEMERA,
 	.ssd_dpl = SEL_UPL,
 	.ssd_p = 1,
 	.ssd_long = 1,
 	.ssd_def32 = 0,
 	.ssd_gran = 1		},
 /* GPROC0_SEL	9 Proc 0 Tss Descriptor */
 {	.ssd_base = 0x0,
 	.ssd_limit = sizeof(struct amd64tss) + IOPERM_BITMAP_SIZE - 1,
 	.ssd_type = SDT_SYSTSS,
 	.ssd_dpl = SEL_KPL,
 	.ssd_p = 1,
 	.ssd_long = 0,
 	.ssd_def32 = 0,
 	.ssd_gran = 0		},
 /* Actually, the TSS is a system descriptor which is double size */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0x0,
 	.ssd_type = 0,
 	.ssd_dpl = 0,
 	.ssd_p = 0,
 	.ssd_long = 0,
 	.ssd_def32 = 0,
 	.ssd_gran = 0		},
 /* GUSERLDT_SEL	11 LDT Descriptor */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0x0,
 	.ssd_type = 0,
 	.ssd_dpl = 0,
 	.ssd_p = 0,
 	.ssd_long = 0,
 	.ssd_def32 = 0,
 	.ssd_gran = 0		},
 /* GUSERLDT_SEL	12 LDT Descriptor, double size */
 {	.ssd_base = 0x0,
 	.ssd_limit = 0x0,
 	.ssd_type = 0,
 	.ssd_dpl = 0,
 	.ssd_p = 0,
 	.ssd_long = 0,
 	.ssd_def32 = 0,
 	.ssd_gran = 0		},
 };
 
 void
 setidt(idx, func, typ, dpl, ist)
 	int idx;
 	inthand_t *func;
 	int typ;
 	int dpl;
 	int ist;
 {
 	struct gate_descriptor *ip;
 
 	ip = idt + idx;
 	ip->gd_looffset = (uintptr_t)func;
 	ip->gd_selector = GSEL(GCODE_SEL, SEL_KPL);
 	ip->gd_ist = ist;
 	ip->gd_xx = 0;
 	ip->gd_type = typ;
 	ip->gd_dpl = dpl;
 	ip->gd_p = 1;
 	ip->gd_hioffset = ((uintptr_t)func)>>16 ;
 }
 
 extern inthand_t
 	IDTVEC(div), IDTVEC(dbg), IDTVEC(nmi), IDTVEC(bpt), IDTVEC(ofl),
 	IDTVEC(bnd), IDTVEC(ill), IDTVEC(dna), IDTVEC(fpusegm),
 	IDTVEC(tss), IDTVEC(missing), IDTVEC(stk), IDTVEC(prot),
 	IDTVEC(page), IDTVEC(mchk), IDTVEC(rsvd), IDTVEC(fpu), IDTVEC(align),
 	IDTVEC(xmm), IDTVEC(dblfault),
 #ifdef KDTRACE_HOOKS
 	IDTVEC(dtrace_ret),
 #endif
 #ifdef XENHVM
 	IDTVEC(xen_intr_upcall),
 #endif
 	IDTVEC(fast_syscall), IDTVEC(fast_syscall32);
 
 #ifdef DDB
 /*
  * Display the index and function name of any IDT entries that don't use
  * the default 'rsvd' entry point.
  */
 DB_SHOW_COMMAND(idt, db_show_idt)
 {
 	struct gate_descriptor *ip;
 	int idx;
 	uintptr_t func;
 
 	ip = idt;
 	for (idx = 0; idx < NIDT && !db_pager_quit; idx++) {
 		func = ((long)ip->gd_hioffset << 16 | ip->gd_looffset);
 		if (func != (uintptr_t)&IDTVEC(rsvd)) {
 			db_printf("%3d\t", idx);
 			db_printsym(func, DB_STGY_PROC);
 			db_printf("\n");
 		}
 		ip++;
 	}
 }
 
 /* Show privileged registers. */
 DB_SHOW_COMMAND(sysregs, db_show_sysregs)
 {
 	struct {
 		uint16_t limit;
 		uint64_t base;
 	} __packed idtr, gdtr;
 	uint16_t ldt, tr;
 
 	__asm __volatile("sidt %0" : "=m" (idtr));
 	db_printf("idtr\t0x%016lx/%04x\n",
 	    (u_long)idtr.base, (u_int)idtr.limit);
 	__asm __volatile("sgdt %0" : "=m" (gdtr));
 	db_printf("gdtr\t0x%016lx/%04x\n",
 	    (u_long)gdtr.base, (u_int)gdtr.limit);
 	__asm __volatile("sldt %0" : "=r" (ldt));
 	db_printf("ldtr\t0x%04x\n", ldt);
 	__asm __volatile("str %0" : "=r" (tr));
 	db_printf("tr\t0x%04x\n", tr);
 	db_printf("cr0\t0x%016lx\n", rcr0());
 	db_printf("cr2\t0x%016lx\n", rcr2());
 	db_printf("cr3\t0x%016lx\n", rcr3());
 	db_printf("cr4\t0x%016lx\n", rcr4());
 	db_printf("EFER\t%016lx\n", rdmsr(MSR_EFER));
 	db_printf("FEATURES_CTL\t%016lx\n", rdmsr(MSR_IA32_FEATURE_CONTROL));
 	db_printf("DEBUG_CTL\t%016lx\n", rdmsr(MSR_DEBUGCTLMSR));
 	db_printf("PAT\t%016lx\n", rdmsr(MSR_PAT));
 	db_printf("GSBASE\t%016lx\n", rdmsr(MSR_GSBASE));
 }
 #endif
 
 void
 sdtossd(sd, ssd)
 	struct user_segment_descriptor *sd;
 	struct soft_segment_descriptor *ssd;
 {
 
 	ssd->ssd_base  = (sd->sd_hibase << 24) | sd->sd_lobase;
 	ssd->ssd_limit = (sd->sd_hilimit << 16) | sd->sd_lolimit;
 	ssd->ssd_type  = sd->sd_type;
 	ssd->ssd_dpl   = sd->sd_dpl;
 	ssd->ssd_p     = sd->sd_p;
 	ssd->ssd_long  = sd->sd_long;
 	ssd->ssd_def32 = sd->sd_def32;
 	ssd->ssd_gran  = sd->sd_gran;
 }
 
 void
 ssdtosd(ssd, sd)
 	struct soft_segment_descriptor *ssd;
 	struct user_segment_descriptor *sd;
 {
 
 	sd->sd_lobase = (ssd->ssd_base) & 0xffffff;
 	sd->sd_hibase = (ssd->ssd_base >> 24) & 0xff;
 	sd->sd_lolimit = (ssd->ssd_limit) & 0xffff;
 	sd->sd_hilimit = (ssd->ssd_limit >> 16) & 0xf;
 	sd->sd_type  = ssd->ssd_type;
 	sd->sd_dpl   = ssd->ssd_dpl;
 	sd->sd_p     = ssd->ssd_p;
 	sd->sd_long  = ssd->ssd_long;
 	sd->sd_def32 = ssd->ssd_def32;
 	sd->sd_gran  = ssd->ssd_gran;
 }
 
 void
 ssdtosyssd(ssd, sd)
 	struct soft_segment_descriptor *ssd;
 	struct system_segment_descriptor *sd;
 {
 
 	sd->sd_lobase = (ssd->ssd_base) & 0xffffff;
 	sd->sd_hibase = (ssd->ssd_base >> 24) & 0xfffffffffful;
 	sd->sd_lolimit = (ssd->ssd_limit) & 0xffff;
 	sd->sd_hilimit = (ssd->ssd_limit >> 16) & 0xf;
 	sd->sd_type  = ssd->ssd_type;
 	sd->sd_dpl   = ssd->ssd_dpl;
 	sd->sd_p     = ssd->ssd_p;
 	sd->sd_gran  = ssd->ssd_gran;
 }
 
 #if !defined(DEV_ATPIC) && defined(DEV_ISA)
 #include <isa/isavar.h>
 #include <isa/isareg.h>
 /*
  * Return a bitmap of the current interrupt requests.  This is 8259-specific
  * and is only suitable for use at probe time.
  * This is only here to pacify sio.  It is NOT FATAL if this doesn't work.
  * It shouldn't be here.  There should probably be an APIC centric
  * implementation in the apic driver code, if at all.
  */
 intrmask_t
 isa_irq_pending(void)
 {
 	u_char irr1;
 	u_char irr2;
 
 	irr1 = inb(IO_ICU1);
 	irr2 = inb(IO_ICU2);
 	return ((irr2 << 8) | irr1);
 }
 #endif
 
 u_int basemem;
 
 static int
 add_physmap_entry(uint64_t base, uint64_t length, vm_paddr_t *physmap,
     int *physmap_idxp)
 {
 	int i, insert_idx, physmap_idx;
 
 	physmap_idx = *physmap_idxp;
 
 	if (length == 0)
 		return (1);
 
 	/*
 	 * Find insertion point while checking for overlap.  Start off by
 	 * assuming the new entry will be added to the end.
 	 *
 	 * NB: physmap_idx points to the next free slot.
 	 */
 	insert_idx = physmap_idx;
 	for (i = 0; i <= physmap_idx; i += 2) {
 		if (base < physmap[i + 1]) {
 			if (base + length <= physmap[i]) {
 				insert_idx = i;
 				break;
 			}
 			if (boothowto & RB_VERBOSE)
 				printf(
 		    "Overlapping memory regions, ignoring second region\n");
 			return (1);
 		}
 	}
 
 	/* See if we can prepend to the next entry. */
 	if (insert_idx <= physmap_idx && base + length == physmap[insert_idx]) {
 		physmap[insert_idx] = base;
 		return (1);
 	}
 
 	/* See if we can append to the previous entry. */
 	if (insert_idx > 0 && base == physmap[insert_idx - 1]) {
 		physmap[insert_idx - 1] += length;
 		return (1);
 	}
 
 	physmap_idx += 2;
 	*physmap_idxp = physmap_idx;
 	if (physmap_idx == PHYSMAP_SIZE) {
 		printf(
 		"Too many segments in the physical address map, giving up\n");
 		return (0);
 	}
 
 	/*
 	 * Move the last 'N' entries down to make room for the new
 	 * entry if needed.
 	 */
 	for (i = (physmap_idx - 2); i > insert_idx; i -= 2) {
 		physmap[i] = physmap[i - 2];
 		physmap[i + 1] = physmap[i - 1];
 	}
 
 	/* Insert the new entry. */
 	physmap[insert_idx] = base;
 	physmap[insert_idx + 1] = base + length;
 	return (1);
 }
 
 void
 bios_add_smap_entries(struct bios_smap *smapbase, u_int32_t smapsize,
                       vm_paddr_t *physmap, int *physmap_idx)
 {
 	struct bios_smap *smap, *smapend;
 
 	smapend = (struct bios_smap *)((uintptr_t)smapbase + smapsize);
 
 	for (smap = smapbase; smap < smapend; smap++) {
 		if (boothowto & RB_VERBOSE)
 			printf("SMAP type=%02x base=%016lx len=%016lx\n",
 			    smap->type, smap->base, smap->length);
 
 		if (smap->type != SMAP_TYPE_MEMORY)
 			continue;
 
 		if (!add_physmap_entry(smap->base, smap->length, physmap,
 		    physmap_idx))
 			break;
 	}
 }
 
 #define efi_next_descriptor(ptr, size) \
 	((struct efi_md *)(((uint8_t *) ptr) + size))
 
 static void
 add_efi_map_entries(struct efi_map_header *efihdr, vm_paddr_t *physmap,
     int *physmap_idx)
 {
 	struct efi_md *map, *p;
 	const char *type;
 	size_t efisz;
 	int ndesc, i;
 
 	static const char *types[] = {
 		"Reserved",
 		"LoaderCode",
 		"LoaderData",
 		"BootServicesCode",
 		"BootServicesData",
 		"RuntimeServicesCode",
 		"RuntimeServicesData",
 		"ConventionalMemory",
 		"UnusableMemory",
 		"ACPIReclaimMemory",
 		"ACPIMemoryNVS",
 		"MemoryMappedIO",
 		"MemoryMappedIOPortSpace",
 		"PalCode"
 	};
 
 	/*
 	 * Memory map data provided by UEFI via the GetMemoryMap
 	 * Boot Services API.
 	 */
 	efisz = (sizeof(struct efi_map_header) + 0xf) & ~0xf;
 	map = (struct efi_md *)((uint8_t *)efihdr + efisz); 
 
 	if (efihdr->descriptor_size == 0)
 		return;
 	ndesc = efihdr->memory_size / efihdr->descriptor_size;
 
 	if (boothowto & RB_VERBOSE)
 		printf("%23s %12s %12s %8s %4s\n",
 		    "Type", "Physical", "Virtual", "#Pages", "Attr");
 
 	for (i = 0, p = map; i < ndesc; i++,
 	    p = efi_next_descriptor(p, efihdr->descriptor_size)) {
 		if (boothowto & RB_VERBOSE) {
 			if (p->md_type <= EFI_MD_TYPE_PALCODE)
 				type = types[p->md_type];
 			else
 				type = "<INVALID>";
 			printf("%23s %012lx %12p %08lx ", type, p->md_phys,
 			    p->md_virt, p->md_pages);
 			if (p->md_attr & EFI_MD_ATTR_UC)
 				printf("UC ");
 			if (p->md_attr & EFI_MD_ATTR_WC)
 				printf("WC ");
 			if (p->md_attr & EFI_MD_ATTR_WT)
 				printf("WT ");
 			if (p->md_attr & EFI_MD_ATTR_WB)
 				printf("WB ");
 			if (p->md_attr & EFI_MD_ATTR_UCE)
 				printf("UCE ");
 			if (p->md_attr & EFI_MD_ATTR_WP)
 				printf("WP ");
 			if (p->md_attr & EFI_MD_ATTR_RP)
 				printf("RP ");
 			if (p->md_attr & EFI_MD_ATTR_XP)
 				printf("XP ");
 			if (p->md_attr & EFI_MD_ATTR_RT)
 				printf("RUNTIME");
 			printf("\n");
 		}
 
 		switch (p->md_type) {
 		case EFI_MD_TYPE_CODE:
 		case EFI_MD_TYPE_DATA:
 		case EFI_MD_TYPE_BS_CODE:
 		case EFI_MD_TYPE_BS_DATA:
 		case EFI_MD_TYPE_FREE:
 			/*
 			 * We're allowed to use any entry with these types.
 			 */
 			break;
 		default:
 			continue;
 		}
 
 		if (!add_physmap_entry(p->md_phys, (p->md_pages * PAGE_SIZE),
 		    physmap, physmap_idx))
 			break;
 	}
 }
 
 static char bootmethod[16] = "";
 SYSCTL_STRING(_machdep, OID_AUTO, bootmethod, CTLFLAG_RD, bootmethod, 0,
     "System firmware boot method");
 
 static void
 native_parse_memmap(caddr_t kmdp, vm_paddr_t *physmap, int *physmap_idx)
 {
 	struct bios_smap *smap;
 	struct efi_map_header *efihdr;
 	u_int32_t size;
 
 	/*
 	 * Memory map from INT 15:E820.
 	 *
 	 * subr_module.c says:
 	 * "Consumer may safely assume that size value precedes data."
 	 * ie: an int32_t immediately precedes smap.
 	 */
 
 	efihdr = (struct efi_map_header *)preload_search_info(kmdp,
 	    MODINFO_METADATA | MODINFOMD_EFI_MAP);
 	smap = (struct bios_smap *)preload_search_info(kmdp,
 	    MODINFO_METADATA | MODINFOMD_SMAP);
 	if (efihdr == NULL && smap == NULL)
 		panic("No BIOS smap or EFI map info from loader!");
 
 	if (efihdr != NULL) {
 		add_efi_map_entries(efihdr, physmap, physmap_idx);
 		strlcpy(bootmethod, "UEFI", sizeof(bootmethod));
 	} else {
 		size = *((u_int32_t *)smap - 1);
 		bios_add_smap_entries(smap, size, physmap, physmap_idx);
 		strlcpy(bootmethod, "BIOS", sizeof(bootmethod));
 	}
 }
 
 #define	PAGES_PER_GB	(1024 * 1024 * 1024 / PAGE_SIZE)
 
 /*
  * Populate the (physmap) array with base/bound pairs describing the
  * available physical memory in the system, then test this memory and
  * build the phys_avail array describing the actually-available memory.
  *
  * Total memory size may be set by the kernel environment variable
  * hw.physmem or the compile-time define MAXMEM.
  *
  * XXX first should be vm_paddr_t.
  */
 static void
 getmemsize(caddr_t kmdp, u_int64_t first)
 {
 	int i, physmap_idx, pa_indx, da_indx;
 	vm_paddr_t pa, physmap[PHYSMAP_SIZE];
 	u_long physmem_start, physmem_tunable, memtest;
 	pt_entry_t *pte;
 	quad_t dcons_addr, dcons_size;
 	int page_counter;
 
 	bzero(physmap, sizeof(physmap));
 	physmap_idx = 0;
 
 	init_ops.parse_memmap(kmdp, physmap, &physmap_idx);
 	physmap_idx -= 2;
 
 	/*
 	 * Find the 'base memory' segment for SMP
 	 */
 	basemem = 0;
 	for (i = 0; i <= physmap_idx; i += 2) {
 		if (physmap[i] <= 0xA0000) {
 			basemem = physmap[i + 1] / 1024;
 			break;
 		}
 	}
 	if (basemem == 0 || basemem > 640) {
 		if (bootverbose)
 			printf(
 		"Memory map doesn't contain a basemem segment, faking it");
 		basemem = 640;
 	}
 
 	/*
 	 * Make hole for "AP -> long mode" bootstrap code.  The
 	 * mp_bootaddress vector is only available when the kernel
 	 * is configured to support APs and APs for the system start
 	 * in 32bit mode (e.g. SMP bare metal).
 	 */
 	if (init_ops.mp_bootaddress) {
 		if (physmap[1] >= 0x100000000)
 			panic(
 	"Basemem segment is not suitable for AP bootstrap code!");
 		physmap[1] = init_ops.mp_bootaddress(physmap[1] / 1024);
 	}
 
 	/*
 	 * Maxmem isn't the "maximum memory", it's one larger than the
 	 * highest page of the physical address space.  It should be
 	 * called something like "Maxphyspage".  We may adjust this
 	 * based on ``hw.physmem'' and the results of the memory test.
 	 */
 	Maxmem = atop(physmap[physmap_idx + 1]);
 
 #ifdef MAXMEM
 	Maxmem = MAXMEM / 4;
 #endif
 
 	if (TUNABLE_ULONG_FETCH("hw.physmem", &physmem_tunable))
 		Maxmem = atop(physmem_tunable);
 
 	/*
 	 * The boot memory test is disabled by default, as it takes a
 	 * significant amount of time on large-memory systems, and is
 	 * unfriendly to virtual machines as it unnecessarily touches all
 	 * pages.
 	 *
 	 * A general name is used as the code may be extended to support
 	 * additional tests beyond the current "page present" test.
 	 */
 	memtest = 0;
 	TUNABLE_ULONG_FETCH("hw.memtest.tests", &memtest);
 
 	/*
 	 * Don't allow MAXMEM or hw.physmem to extend the amount of memory
 	 * in the system.
 	 */
 	if (Maxmem > atop(physmap[physmap_idx + 1]))
 		Maxmem = atop(physmap[physmap_idx + 1]);
 
 	if (atop(physmap[physmap_idx + 1]) != Maxmem &&
 	    (boothowto & RB_VERBOSE))
 		printf("Physical memory use set to %ldK\n", Maxmem * 4);
 
 	/* call pmap initialization to make new kernel address space */
 	pmap_bootstrap(&first);
 
 	/*
 	 * Size up each available chunk of physical memory.
 	 *
 	 * XXX Some BIOSes corrupt low 64KB between suspend and resume.
 	 * By default, mask off the first 16 pages unless we appear to be
 	 * running in a VM.
 	 */
 	physmem_start = (vm_guest > VM_GUEST_NO ? 1 : 16) << PAGE_SHIFT;
 	TUNABLE_ULONG_FETCH("hw.physmem.start", &physmem_start);
 	if (physmap[0] < physmem_start) {
 		if (physmem_start < PAGE_SIZE)
 			physmap[0] = PAGE_SIZE;
 		else if (physmem_start >= physmap[1])
 			physmap[0] = round_page(physmap[1] - PAGE_SIZE);
 		else
 			physmap[0] = round_page(physmem_start);
 	}
 	pa_indx = 0;
 	da_indx = 1;
 	phys_avail[pa_indx++] = physmap[0];
 	phys_avail[pa_indx] = physmap[0];
 	dump_avail[da_indx] = physmap[0];
 	pte = CMAP1;
 
 	/*
 	 * Get dcons buffer address
 	 */
 	if (getenv_quad("dcons.addr", &dcons_addr) == 0 ||
 	    getenv_quad("dcons.size", &dcons_size) == 0)
 		dcons_addr = 0;
 
 	/*
 	 * physmap is in bytes, so when converting to page boundaries,
 	 * round up the start address and round down the end address.
 	 */
 	page_counter = 0;
 	if (memtest != 0)
 		printf("Testing system memory");
 	for (i = 0; i <= physmap_idx; i += 2) {
 		vm_paddr_t end;
 
 		end = ptoa((vm_paddr_t)Maxmem);
 		if (physmap[i + 1] < end)
 			end = trunc_page(physmap[i + 1]);
 		for (pa = round_page(physmap[i]); pa < end; pa += PAGE_SIZE) {
 			int tmp, page_bad, full;
 			int *ptr = (int *)CADDR1;
 
 			full = FALSE;
 			/*
 			 * block out kernel memory as not available.
 			 */
 			if (pa >= (vm_paddr_t)kernphys && pa < first)
 				goto do_dump_avail;
 
 			/*
 			 * block out dcons buffer
 			 */
 			if (dcons_addr > 0
 			    && pa >= trunc_page(dcons_addr)
 			    && pa < dcons_addr + dcons_size)
 				goto do_dump_avail;
 
 			page_bad = FALSE;
 			if (memtest == 0)
 				goto skip_memtest;
 
 			/*
 			 * Print a "." every GB to show we're making
 			 * progress.
 			 */
 			page_counter++;
 			if ((page_counter % PAGES_PER_GB) == 0)
 				printf(".");
 
 			/*
 			 * map page into kernel: valid, read/write,non-cacheable
 			 */
 			*pte = pa | PG_V | PG_RW | PG_NC_PWT | PG_NC_PCD;
 			invltlb();
 
 			tmp = *(int *)ptr;
 			/*
 			 * Test for alternating 1's and 0's
 			 */
 			*(volatile int *)ptr = 0xaaaaaaaa;
 			if (*(volatile int *)ptr != 0xaaaaaaaa)
 				page_bad = TRUE;
 			/*
 			 * Test for alternating 0's and 1's
 			 */
 			*(volatile int *)ptr = 0x55555555;
 			if (*(volatile int *)ptr != 0x55555555)
 				page_bad = TRUE;
 			/*
 			 * Test for all 1's
 			 */
 			*(volatile int *)ptr = 0xffffffff;
 			if (*(volatile int *)ptr != 0xffffffff)
 				page_bad = TRUE;
 			/*
 			 * Test for all 0's
 			 */
 			*(volatile int *)ptr = 0x0;
 			if (*(volatile int *)ptr != 0x0)
 				page_bad = TRUE;
 			/*
 			 * Restore original value.
 			 */
 			*(int *)ptr = tmp;
 
 skip_memtest:
 			/*
 			 * Adjust array of valid/good pages.
 			 */
 			if (page_bad == TRUE)
 				continue;
 			/*
 			 * If this good page is a continuation of the
 			 * previous set of good pages, then just increase
 			 * the end pointer. Otherwise start a new chunk.
 			 * Note that "end" points one higher than end,
 			 * making the range >= start and < end.
 			 * If we're also doing a speculative memory
 			 * test and we at or past the end, bump up Maxmem
 			 * so that we keep going. The first bad page
 			 * will terminate the loop.
 			 */
 			if (phys_avail[pa_indx] == pa) {
 				phys_avail[pa_indx] += PAGE_SIZE;
 			} else {
 				pa_indx++;
 				if (pa_indx == PHYS_AVAIL_ARRAY_END) {
 					printf(
 		"Too many holes in the physical address space, giving up\n");
 					pa_indx--;
 					full = TRUE;
 					goto do_dump_avail;
 				}
 				phys_avail[pa_indx++] = pa;	/* start */
 				phys_avail[pa_indx] = pa + PAGE_SIZE; /* end */
 			}
 			physmem++;
 do_dump_avail:
 			if (dump_avail[da_indx] == pa) {
 				dump_avail[da_indx] += PAGE_SIZE;
 			} else {
 				da_indx++;
 				if (da_indx == DUMP_AVAIL_ARRAY_END) {
 					da_indx--;
 					goto do_next;
 				}
 				dump_avail[da_indx++] = pa; /* start */
 				dump_avail[da_indx] = pa + PAGE_SIZE; /* end */
 			}
 do_next:
 			if (full)
 				break;
 		}
 	}
 	*pte = 0;
 	invltlb();
 	if (memtest != 0)
 		printf("\n");
 
 	/*
 	 * XXX
 	 * The last chunk must contain at least one page plus the message
 	 * buffer to avoid complicating other code (message buffer address
 	 * calculation, etc.).
 	 */
 	while (phys_avail[pa_indx - 1] + PAGE_SIZE +
 	    round_page(msgbufsize) >= phys_avail[pa_indx]) {
 		physmem -= atop(phys_avail[pa_indx] - phys_avail[pa_indx - 1]);
 		phys_avail[pa_indx--] = 0;
 		phys_avail[pa_indx--] = 0;
 	}
 
 	Maxmem = atop(phys_avail[pa_indx]);
 
 	/* Trim off space for the message buffer. */
 	phys_avail[pa_indx] -= round_page(msgbufsize);
 
 	/* Map the message buffer. */
 	msgbufp = (struct msgbuf *)PHYS_TO_DMAP(phys_avail[pa_indx]);
 }
 
 static caddr_t
 native_parse_preload_data(u_int64_t modulep)
 {
 	caddr_t kmdp;
 #ifdef DDB
 	vm_offset_t ksym_start;
 	vm_offset_t ksym_end;
 #endif
 
 	preload_metadata = (caddr_t)(uintptr_t)(modulep + KERNBASE);
 	preload_bootstrap_relocate(KERNBASE);
 	kmdp = preload_search_by_type("elf kernel");
 	if (kmdp == NULL)
 		kmdp = preload_search_by_type("elf64 kernel");
 	boothowto = MD_FETCH(kmdp, MODINFOMD_HOWTO, int);
 	kern_envp = MD_FETCH(kmdp, MODINFOMD_ENVP, char *) + KERNBASE;
 #ifdef DDB
 	ksym_start = MD_FETCH(kmdp, MODINFOMD_SSYM, uintptr_t);
 	ksym_end = MD_FETCH(kmdp, MODINFOMD_ESYM, uintptr_t);
 	db_fetch_ksymtab(ksym_start, ksym_end);
 #endif
 
 	return (kmdp);
 }
 
 u_int64_t
 hammer_time(u_int64_t modulep, u_int64_t physfree)
 {
 	caddr_t kmdp;
 	int gsel_tss, x;
 	struct pcpu *pc;
 	struct nmi_pcpu *np;
 	struct xstate_hdr *xhdr;
 	u_int64_t msr;
 	char *env;
 	size_t kstack0_sz;
 
 	thread0.td_kstack = physfree + KERNBASE;
 	thread0.td_kstack_pages = KSTACK_PAGES;
 	kstack0_sz = thread0.td_kstack_pages * PAGE_SIZE;
 	bzero((void *)thread0.td_kstack, kstack0_sz);
 	physfree += kstack0_sz;
 
 	/*
  	 * This may be done better later if it gets more high level
  	 * components in it. If so just link td->td_proc here.
 	 */
 	proc_linkup0(&proc0, &thread0);
 
 	kmdp = init_ops.parse_preload_data(modulep);
 
 	/* Init basic tunables, hz etc */
 	init_param1();
 
 	/*
 	 * make gdt memory segments
 	 */
 	for (x = 0; x < NGDT; x++) {
 		if (x != GPROC0_SEL && x != (GPROC0_SEL + 1) &&
 		    x != GUSERLDT_SEL && x != (GUSERLDT_SEL) + 1)
 			ssdtosd(&gdt_segs[x], &gdt[x]);
 	}
 	gdt_segs[GPROC0_SEL].ssd_base = (uintptr_t)&common_tss[0];
 	ssdtosyssd(&gdt_segs[GPROC0_SEL],
 	    (struct system_segment_descriptor *)&gdt[GPROC0_SEL]);
 
 	r_gdt.rd_limit = NGDT * sizeof(gdt[0]) - 1;
 	r_gdt.rd_base =  (long) gdt;
 	lgdt(&r_gdt);
 	pc = &__pcpu[0];
 
 	wrmsr(MSR_FSBASE, 0);		/* User value */
 	wrmsr(MSR_GSBASE, (u_int64_t)pc);
 	wrmsr(MSR_KGSBASE, 0);		/* User value while in the kernel */
 
 	pcpu_init(pc, 0, sizeof(struct pcpu));
 	dpcpu_init((void *)(physfree + KERNBASE), 0);
 	physfree += DPCPU_SIZE;
 	PCPU_SET(prvspace, pc);
 	PCPU_SET(curthread, &thread0);
 	PCPU_SET(tssp, &common_tss[0]);
 	PCPU_SET(commontssp, &common_tss[0]);
 	PCPU_SET(tss, (struct system_segment_descriptor *)&gdt[GPROC0_SEL]);
 	PCPU_SET(ldt, (struct system_segment_descriptor *)&gdt[GUSERLDT_SEL]);
 	PCPU_SET(fs32p, &gdt[GUFS32_SEL]);
 	PCPU_SET(gs32p, &gdt[GUGS32_SEL]);
 
 	/*
 	 * Initialize mutexes.
 	 *
 	 * icu_lock: in order to allow an interrupt to occur in a critical
 	 * 	     section, to set pcpu->ipending (etc...) properly, we
 	 *	     must be able to get the icu lock, so it can't be
 	 *	     under witness.
 	 */
 	mutex_init();
 	mtx_init(&icu_lock, "icu", NULL, MTX_SPIN | MTX_NOWITNESS);
 	mtx_init(&dt_lock, "descriptor tables", NULL, MTX_DEF);
 
 	/* exceptions */
 	for (x = 0; x < NIDT; x++)
 		setidt(x, &IDTVEC(rsvd), SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_DE, &IDTVEC(div),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_DB, &IDTVEC(dbg),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_NMI, &IDTVEC(nmi),  SDT_SYSIGT, SEL_KPL, 2);
  	setidt(IDT_BP, &IDTVEC(bpt),  SDT_SYSIGT, SEL_UPL, 0);
 	setidt(IDT_OF, &IDTVEC(ofl),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_BR, &IDTVEC(bnd),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_UD, &IDTVEC(ill),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_NM, &IDTVEC(dna),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_DF, &IDTVEC(dblfault), SDT_SYSIGT, SEL_KPL, 1);
 	setidt(IDT_FPUGP, &IDTVEC(fpusegm),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_TS, &IDTVEC(tss),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_NP, &IDTVEC(missing),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_SS, &IDTVEC(stk),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_GP, &IDTVEC(prot),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_PF, &IDTVEC(page),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_MF, &IDTVEC(fpu),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_AC, &IDTVEC(align), SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_MC, &IDTVEC(mchk),  SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_XF, &IDTVEC(xmm), SDT_SYSIGT, SEL_KPL, 0);
 #ifdef KDTRACE_HOOKS
 	setidt(IDT_DTRACE_RET, &IDTVEC(dtrace_ret), SDT_SYSIGT, SEL_UPL, 0);
 #endif
 #ifdef XENHVM
 	setidt(IDT_EVTCHN, &IDTVEC(xen_intr_upcall), SDT_SYSIGT, SEL_UPL, 0);
 #endif
 
 	r_idt.rd_limit = sizeof(idt0) - 1;
 	r_idt.rd_base = (long) idt;
 	lidt(&r_idt);
 
 	/*
 	 * Initialize the clock before the console so that console
 	 * initialization can use DELAY().
 	 */
 	clock_init();
 
 	/*
 	 * Use vt(4) by default for UEFI boot (during the sc(4)/vt(4)
 	 * transition).
 	 */
 	if (kmdp != NULL && preload_search_info(kmdp,
 	    MODINFO_METADATA | MODINFOMD_EFI_MAP) != NULL)
 		vty_set_preferred(VTY_VT);
 
 	/*
 	 * Initialize the console before we print anything out.
 	 */
 	cninit();
 
 #ifdef DEV_ISA
 #ifdef DEV_ATPIC
 	elcr_probe();
 	atpic_startup();
 #else
 	/* Reset and mask the atpics and leave them shut down. */
 	atpic_reset();
 
 	/*
 	 * Point the ICU spurious interrupt vectors at the APIC spurious
 	 * interrupt handler.
 	 */
 	setidt(IDT_IO_INTS + 7, IDTVEC(spuriousint), SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IDT_IO_INTS + 15, IDTVEC(spuriousint), SDT_SYSIGT, SEL_KPL, 0);
 #endif
 #else
 #error "have you forgotten the isa device?";
 #endif
 
 	kdb_init();
 
 #ifdef KDB
 	if (boothowto & RB_KDB)
 		kdb_enter(KDB_WHY_BOOTFLAGS,
 		    "Boot flags requested debugger");
 #endif
 
 	identify_cpu();		/* Final stage of CPU initialization */
 	initializecpu();	/* Initialize CPU registers */
 	initializecpucache();
 
 	/* doublefault stack space, runs on ist1 */
 	common_tss[0].tss_ist1 = (long)&dblfault_stack[sizeof(dblfault_stack)];
 
 	/*
 	 * NMI stack, runs on ist2.  The pcpu pointer is stored just
 	 * above the start of the ist2 stack.
 	 */
 	np = ((struct nmi_pcpu *) &nmi0_stack[sizeof(nmi0_stack)]) - 1;
 	np->np_pcpu = (register_t) pc;
 	common_tss[0].tss_ist2 = (long) np;
 
 	/* Set the IO permission bitmap (empty due to tss seg limit) */
 	common_tss[0].tss_iobase = sizeof(struct amd64tss) + IOPERM_BITMAP_SIZE;
 
 	gsel_tss = GSEL(GPROC0_SEL, SEL_KPL);
 	ltr(gsel_tss);
 
 	/* Set up the fast syscall stuff */
 	msr = rdmsr(MSR_EFER) | EFER_SCE;
 	wrmsr(MSR_EFER, msr);
 	wrmsr(MSR_LSTAR, (u_int64_t)IDTVEC(fast_syscall));
 	wrmsr(MSR_CSTAR, (u_int64_t)IDTVEC(fast_syscall32));
 	msr = ((u_int64_t)GSEL(GCODE_SEL, SEL_KPL) << 32) |
 	      ((u_int64_t)GSEL(GUCODE32_SEL, SEL_UPL) << 48);
 	wrmsr(MSR_STAR, msr);
 	wrmsr(MSR_SF_MASK, PSL_NT|PSL_T|PSL_I|PSL_C|PSL_D);
 
 	getmemsize(kmdp, physfree);
 	init_param2(physmem);
 
 	/* now running on new page tables, configured,and u/iom is accessible */
 
 	msgbufinit(msgbufp, msgbufsize);
 	fpuinit();
 
 	/*
 	 * Set up thread0 pcb after fpuinit calculated pcb + fpu save
 	 * area size.  Zero out the extended state header in fpu save
 	 * area.
 	 */
 	thread0.td_pcb = get_pcb_td(&thread0);
 	bzero(get_pcb_user_save_td(&thread0), cpu_max_ext_state_size);
 	if (use_xsave) {
 		xhdr = (struct xstate_hdr *)(get_pcb_user_save_td(&thread0) +
 		    1);
 		xhdr->xstate_bv = xsave_mask;
 	}
 	/* make an initial tss so cpu can get interrupt stack on syscall! */
 	common_tss[0].tss_rsp0 = (vm_offset_t)thread0.td_pcb;
 	/* Ensure the stack is aligned to 16 bytes */
 	common_tss[0].tss_rsp0 &= ~0xFul;
 	PCPU_SET(rsp0, common_tss[0].tss_rsp0);
 	PCPU_SET(curpcb, thread0.td_pcb);
 
 	/* transfer to user mode */
 
 	_ucodesel = GSEL(GUCODE_SEL, SEL_UPL);
 	_udatasel = GSEL(GUDATA_SEL, SEL_UPL);
 	_ucode32sel = GSEL(GUCODE32_SEL, SEL_UPL);
 	_ufssel = GSEL(GUFS32_SEL, SEL_UPL);
 	_ugssel = GSEL(GUGS32_SEL, SEL_UPL);
 
 	load_ds(_udatasel);
 	load_es(_udatasel);
 	load_fs(_ufssel);
 
 	/* setup proc 0's pcb */
 	thread0.td_pcb->pcb_flags = 0;
-	thread0.td_pcb->pcb_cr3 = KPML4phys; /* PCID 0 is reserved for kernel */
 	thread0.td_frame = &proc0_tf;
 
         env = kern_getenv("kernelname");
 	if (env != NULL)
 		strlcpy(kernelname, env, sizeof(kernelname));
 
 	cpu_probe_amdc1e();
 
 #ifdef FDT
 	x86_init_fdt();
 #endif
 
 	/* Location of kernel stack for locore */
 	return ((u_int64_t)thread0.td_pcb);
 }
 
 void
 cpu_pcpu_init(struct pcpu *pcpu, int cpuid, size_t size)
 {
 
 	pcpu->pc_acpi_id = 0xffffffff;
 }
 
 static int
 smap_sysctl_handler(SYSCTL_HANDLER_ARGS)
 {
 	struct bios_smap *smapbase;
 	struct bios_smap_xattr smap;
 	caddr_t kmdp;
 	uint32_t *smapattr;
 	int count, error, i;
 
 	/* Retrieve the system memory map from the loader. */
 	kmdp = preload_search_by_type("elf kernel");
 	if (kmdp == NULL)
 		kmdp = preload_search_by_type("elf64 kernel");
 	smapbase = (struct bios_smap *)preload_search_info(kmdp,
 	    MODINFO_METADATA | MODINFOMD_SMAP);
 	if (smapbase == NULL)
 		return (0);
 	smapattr = (uint32_t *)preload_search_info(kmdp,
 	    MODINFO_METADATA | MODINFOMD_SMAP_XATTR);
 	count = *((uint32_t *)smapbase - 1) / sizeof(*smapbase);
 	error = 0;
 	for (i = 0; i < count; i++) {
 		smap.base = smapbase[i].base;
 		smap.length = smapbase[i].length;
 		smap.type = smapbase[i].type;
 		if (smapattr != NULL)
 			smap.xattr = smapattr[i];
 		else
 			smap.xattr = 0;
 		error = SYSCTL_OUT(req, &smap, sizeof(smap));
 	}
 	return (error);
 }
 SYSCTL_PROC(_machdep, OID_AUTO, smap, CTLTYPE_OPAQUE|CTLFLAG_RD, NULL, 0,
     smap_sysctl_handler, "S,bios_smap_xattr", "Raw BIOS SMAP data");
 
 static int
 efi_map_sysctl_handler(SYSCTL_HANDLER_ARGS)
 {
 	struct efi_map_header *efihdr;
 	caddr_t kmdp;
 	uint32_t efisize;
 
 	kmdp = preload_search_by_type("elf kernel");
 	if (kmdp == NULL)
 		kmdp = preload_search_by_type("elf64 kernel");
 	efihdr = (struct efi_map_header *)preload_search_info(kmdp,
 	    MODINFO_METADATA | MODINFOMD_EFI_MAP);
 	if (efihdr == NULL)
 		return (0);
 	efisize = *((uint32_t *)efihdr - 1);
 	return (SYSCTL_OUT(req, efihdr, efisize));
 }
 SYSCTL_PROC(_machdep, OID_AUTO, efi_map, CTLTYPE_OPAQUE|CTLFLAG_RD, NULL, 0,
     efi_map_sysctl_handler, "S,efi_map_header", "Raw EFI Memory Map");
 
 void
 spinlock_enter(void)
 {
 	struct thread *td;
 	register_t flags;
 
 	td = curthread;
 	if (td->td_md.md_spinlock_count == 0) {
 		flags = intr_disable();
 		td->td_md.md_spinlock_count = 1;
 		td->td_md.md_saved_flags = flags;
 	} else
 		td->td_md.md_spinlock_count++;
 	critical_enter();
 }
 
 void
 spinlock_exit(void)
 {
 	struct thread *td;
 	register_t flags;
 
 	td = curthread;
 	critical_exit();
 	flags = td->td_md.md_saved_flags;
 	td->td_md.md_spinlock_count--;
 	if (td->td_md.md_spinlock_count == 0)
 		intr_restore(flags);
 }
 
 /*
  * Construct a PCB from a trapframe. This is called from kdb_trap() where
  * we want to start a backtrace from the function that caused us to enter
  * the debugger. We have the context in the trapframe, but base the trace
  * on the PCB. The PCB doesn't have to be perfect, as long as it contains
  * enough for a backtrace.
  */
 void
 makectx(struct trapframe *tf, struct pcb *pcb)
 {
 
 	pcb->pcb_r12 = tf->tf_r12;
 	pcb->pcb_r13 = tf->tf_r13;
 	pcb->pcb_r14 = tf->tf_r14;
 	pcb->pcb_r15 = tf->tf_r15;
 	pcb->pcb_rbp = tf->tf_rbp;
 	pcb->pcb_rbx = tf->tf_rbx;
 	pcb->pcb_rip = tf->tf_rip;
 	pcb->pcb_rsp = tf->tf_rsp;
 }
 
 int
 ptrace_set_pc(struct thread *td, unsigned long addr)
 {
 
 	td->td_frame->tf_rip = addr;
 	set_pcb_flags(td->td_pcb, PCB_FULL_IRET);
 	return (0);
 }
 
 int
 ptrace_single_step(struct thread *td)
 {
 	td->td_frame->tf_rflags |= PSL_T;
 	return (0);
 }
 
 int
 ptrace_clear_single_step(struct thread *td)
 {
 	td->td_frame->tf_rflags &= ~PSL_T;
 	return (0);
 }
 
 int
 fill_regs(struct thread *td, struct reg *regs)
 {
 	struct trapframe *tp;
 
 	tp = td->td_frame;
 	return (fill_frame_regs(tp, regs));
 }
 
 int
 fill_frame_regs(struct trapframe *tp, struct reg *regs)
 {
 	regs->r_r15 = tp->tf_r15;
 	regs->r_r14 = tp->tf_r14;
 	regs->r_r13 = tp->tf_r13;
 	regs->r_r12 = tp->tf_r12;
 	regs->r_r11 = tp->tf_r11;
 	regs->r_r10 = tp->tf_r10;
 	regs->r_r9  = tp->tf_r9;
 	regs->r_r8  = tp->tf_r8;
 	regs->r_rdi = tp->tf_rdi;
 	regs->r_rsi = tp->tf_rsi;
 	regs->r_rbp = tp->tf_rbp;
 	regs->r_rbx = tp->tf_rbx;
 	regs->r_rdx = tp->tf_rdx;
 	regs->r_rcx = tp->tf_rcx;
 	regs->r_rax = tp->tf_rax;
 	regs->r_rip = tp->tf_rip;
 	regs->r_cs = tp->tf_cs;
 	regs->r_rflags = tp->tf_rflags;
 	regs->r_rsp = tp->tf_rsp;
 	regs->r_ss = tp->tf_ss;
 	if (tp->tf_flags & TF_HASSEGS) {
 		regs->r_ds = tp->tf_ds;
 		regs->r_es = tp->tf_es;
 		regs->r_fs = tp->tf_fs;
 		regs->r_gs = tp->tf_gs;
 	} else {
 		regs->r_ds = 0;
 		regs->r_es = 0;
 		regs->r_fs = 0;
 		regs->r_gs = 0;
 	}
 	return (0);
 }
 
 int
 set_regs(struct thread *td, struct reg *regs)
 {
 	struct trapframe *tp;
 	register_t rflags;
 
 	tp = td->td_frame;
 	rflags = regs->r_rflags & 0xffffffff;
 	if (!EFL_SECURE(rflags, tp->tf_rflags) || !CS_SECURE(regs->r_cs))
 		return (EINVAL);
 	tp->tf_r15 = regs->r_r15;
 	tp->tf_r14 = regs->r_r14;
 	tp->tf_r13 = regs->r_r13;
 	tp->tf_r12 = regs->r_r12;
 	tp->tf_r11 = regs->r_r11;
 	tp->tf_r10 = regs->r_r10;
 	tp->tf_r9  = regs->r_r9;
 	tp->tf_r8  = regs->r_r8;
 	tp->tf_rdi = regs->r_rdi;
 	tp->tf_rsi = regs->r_rsi;
 	tp->tf_rbp = regs->r_rbp;
 	tp->tf_rbx = regs->r_rbx;
 	tp->tf_rdx = regs->r_rdx;
 	tp->tf_rcx = regs->r_rcx;
 	tp->tf_rax = regs->r_rax;
 	tp->tf_rip = regs->r_rip;
 	tp->tf_cs = regs->r_cs;
 	tp->tf_rflags = rflags;
 	tp->tf_rsp = regs->r_rsp;
 	tp->tf_ss = regs->r_ss;
 	if (0) {	/* XXXKIB */
 		tp->tf_ds = regs->r_ds;
 		tp->tf_es = regs->r_es;
 		tp->tf_fs = regs->r_fs;
 		tp->tf_gs = regs->r_gs;
 		tp->tf_flags = TF_HASSEGS;
 	}
 	set_pcb_flags(td->td_pcb, PCB_FULL_IRET);
 	return (0);
 }
 
 /* XXX check all this stuff! */
 /* externalize from sv_xmm */
 static void
 fill_fpregs_xmm(struct savefpu *sv_xmm, struct fpreg *fpregs)
 {
 	struct envxmm *penv_fpreg = (struct envxmm *)&fpregs->fpr_env;
 	struct envxmm *penv_xmm = &sv_xmm->sv_env;
 	int i;
 
 	/* pcb -> fpregs */
 	bzero(fpregs, sizeof(*fpregs));
 
 	/* FPU control/status */
 	penv_fpreg->en_cw = penv_xmm->en_cw;
 	penv_fpreg->en_sw = penv_xmm->en_sw;
 	penv_fpreg->en_tw = penv_xmm->en_tw;
 	penv_fpreg->en_opcode = penv_xmm->en_opcode;
 	penv_fpreg->en_rip = penv_xmm->en_rip;
 	penv_fpreg->en_rdp = penv_xmm->en_rdp;
 	penv_fpreg->en_mxcsr = penv_xmm->en_mxcsr;
 	penv_fpreg->en_mxcsr_mask = penv_xmm->en_mxcsr_mask;
 
 	/* FPU registers */
 	for (i = 0; i < 8; ++i)
 		bcopy(sv_xmm->sv_fp[i].fp_acc.fp_bytes, fpregs->fpr_acc[i], 10);
 
 	/* SSE registers */
 	for (i = 0; i < 16; ++i)
 		bcopy(sv_xmm->sv_xmm[i].xmm_bytes, fpregs->fpr_xacc[i], 16);
 }
 
 /* internalize from fpregs into sv_xmm */
 static void
 set_fpregs_xmm(struct fpreg *fpregs, struct savefpu *sv_xmm)
 {
 	struct envxmm *penv_xmm = &sv_xmm->sv_env;
 	struct envxmm *penv_fpreg = (struct envxmm *)&fpregs->fpr_env;
 	int i;
 
 	/* fpregs -> pcb */
 	/* FPU control/status */
 	penv_xmm->en_cw = penv_fpreg->en_cw;
 	penv_xmm->en_sw = penv_fpreg->en_sw;
 	penv_xmm->en_tw = penv_fpreg->en_tw;
 	penv_xmm->en_opcode = penv_fpreg->en_opcode;
 	penv_xmm->en_rip = penv_fpreg->en_rip;
 	penv_xmm->en_rdp = penv_fpreg->en_rdp;
 	penv_xmm->en_mxcsr = penv_fpreg->en_mxcsr;
 	penv_xmm->en_mxcsr_mask = penv_fpreg->en_mxcsr_mask & cpu_mxcsr_mask;
 
 	/* FPU registers */
 	for (i = 0; i < 8; ++i)
 		bcopy(fpregs->fpr_acc[i], sv_xmm->sv_fp[i].fp_acc.fp_bytes, 10);
 
 	/* SSE registers */
 	for (i = 0; i < 16; ++i)
 		bcopy(fpregs->fpr_xacc[i], sv_xmm->sv_xmm[i].xmm_bytes, 16);
 }
 
 /* externalize from td->pcb */
 int
 fill_fpregs(struct thread *td, struct fpreg *fpregs)
 {
 
 	KASSERT(td == curthread || TD_IS_SUSPENDED(td) ||
 	    P_SHOULDSTOP(td->td_proc),
 	    ("not suspended thread %p", td));
 	fpugetregs(td);
 	fill_fpregs_xmm(get_pcb_user_save_td(td), fpregs);
 	return (0);
 }
 
 /* internalize to td->pcb */
 int
 set_fpregs(struct thread *td, struct fpreg *fpregs)
 {
 
 	set_fpregs_xmm(fpregs, get_pcb_user_save_td(td));
 	fpuuserinited(td);
 	return (0);
 }
 
 /*
  * Get machine context.
  */
 int
 get_mcontext(struct thread *td, mcontext_t *mcp, int flags)
 {
 	struct pcb *pcb;
 	struct trapframe *tp;
 
 	pcb = td->td_pcb;
 	tp = td->td_frame;
 	PROC_LOCK(curthread->td_proc);
 	mcp->mc_onstack = sigonstack(tp->tf_rsp);
 	PROC_UNLOCK(curthread->td_proc);
 	mcp->mc_r15 = tp->tf_r15;
 	mcp->mc_r14 = tp->tf_r14;
 	mcp->mc_r13 = tp->tf_r13;
 	mcp->mc_r12 = tp->tf_r12;
 	mcp->mc_r11 = tp->tf_r11;
 	mcp->mc_r10 = tp->tf_r10;
 	mcp->mc_r9  = tp->tf_r9;
 	mcp->mc_r8  = tp->tf_r8;
 	mcp->mc_rdi = tp->tf_rdi;
 	mcp->mc_rsi = tp->tf_rsi;
 	mcp->mc_rbp = tp->tf_rbp;
 	mcp->mc_rbx = tp->tf_rbx;
 	mcp->mc_rcx = tp->tf_rcx;
 	mcp->mc_rflags = tp->tf_rflags;
 	if (flags & GET_MC_CLEAR_RET) {
 		mcp->mc_rax = 0;
 		mcp->mc_rdx = 0;
 		mcp->mc_rflags &= ~PSL_C;
 	} else {
 		mcp->mc_rax = tp->tf_rax;
 		mcp->mc_rdx = tp->tf_rdx;
 	}
 	mcp->mc_rip = tp->tf_rip;
 	mcp->mc_cs = tp->tf_cs;
 	mcp->mc_rsp = tp->tf_rsp;
 	mcp->mc_ss = tp->tf_ss;
 	mcp->mc_ds = tp->tf_ds;
 	mcp->mc_es = tp->tf_es;
 	mcp->mc_fs = tp->tf_fs;
 	mcp->mc_gs = tp->tf_gs;
 	mcp->mc_flags = tp->tf_flags;
 	mcp->mc_len = sizeof(*mcp);
 	get_fpcontext(td, mcp, NULL, 0);
 	mcp->mc_fsbase = pcb->pcb_fsbase;
 	mcp->mc_gsbase = pcb->pcb_gsbase;
 	mcp->mc_xfpustate = 0;
 	mcp->mc_xfpustate_len = 0;
 	bzero(mcp->mc_spare, sizeof(mcp->mc_spare));
 	return (0);
 }
 
 /*
  * Set machine context.
  *
  * However, we don't set any but the user modifiable flags, and we won't
  * touch the cs selector.
  */
 int
 set_mcontext(struct thread *td, mcontext_t *mcp)
 {
 	struct pcb *pcb;
 	struct trapframe *tp;
 	char *xfpustate;
 	long rflags;
 	int ret;
 
 	pcb = td->td_pcb;
 	tp = td->td_frame;
 	if (mcp->mc_len != sizeof(*mcp) ||
 	    (mcp->mc_flags & ~_MC_FLAG_MASK) != 0)
 		return (EINVAL);
 	rflags = (mcp->mc_rflags & PSL_USERCHANGE) |
 	    (tp->tf_rflags & ~PSL_USERCHANGE);
 	if (mcp->mc_flags & _MC_HASFPXSTATE) {
 		if (mcp->mc_xfpustate_len > cpu_max_ext_state_size -
 		    sizeof(struct savefpu))
 			return (EINVAL);
 		xfpustate = __builtin_alloca(mcp->mc_xfpustate_len);
 		ret = copyin((void *)mcp->mc_xfpustate, xfpustate,
 		    mcp->mc_xfpustate_len);
 		if (ret != 0)
 			return (ret);
 	} else
 		xfpustate = NULL;
 	ret = set_fpcontext(td, mcp, xfpustate, mcp->mc_xfpustate_len);
 	if (ret != 0)
 		return (ret);
 	tp->tf_r15 = mcp->mc_r15;
 	tp->tf_r14 = mcp->mc_r14;
 	tp->tf_r13 = mcp->mc_r13;
 	tp->tf_r12 = mcp->mc_r12;
 	tp->tf_r11 = mcp->mc_r11;
 	tp->tf_r10 = mcp->mc_r10;
 	tp->tf_r9  = mcp->mc_r9;
 	tp->tf_r8  = mcp->mc_r8;
 	tp->tf_rdi = mcp->mc_rdi;
 	tp->tf_rsi = mcp->mc_rsi;
 	tp->tf_rbp = mcp->mc_rbp;
 	tp->tf_rbx = mcp->mc_rbx;
 	tp->tf_rdx = mcp->mc_rdx;
 	tp->tf_rcx = mcp->mc_rcx;
 	tp->tf_rax = mcp->mc_rax;
 	tp->tf_rip = mcp->mc_rip;
 	tp->tf_rflags = rflags;
 	tp->tf_rsp = mcp->mc_rsp;
 	tp->tf_ss = mcp->mc_ss;
 	tp->tf_flags = mcp->mc_flags;
 	if (tp->tf_flags & TF_HASSEGS) {
 		tp->tf_ds = mcp->mc_ds;
 		tp->tf_es = mcp->mc_es;
 		tp->tf_fs = mcp->mc_fs;
 		tp->tf_gs = mcp->mc_gs;
 	}
 	if (mcp->mc_flags & _MC_HASBASES) {
 		pcb->pcb_fsbase = mcp->mc_fsbase;
 		pcb->pcb_gsbase = mcp->mc_gsbase;
 	}
 	set_pcb_flags(pcb, PCB_FULL_IRET);
 	return (0);
 }
 
 static void
 get_fpcontext(struct thread *td, mcontext_t *mcp, char *xfpusave,
     size_t xfpusave_len)
 {
 	size_t max_len, len;
 
 	mcp->mc_ownedfp = fpugetregs(td);
 	bcopy(get_pcb_user_save_td(td), &mcp->mc_fpstate[0],
 	    sizeof(mcp->mc_fpstate));
 	mcp->mc_fpformat = fpuformat();
 	if (!use_xsave || xfpusave_len == 0)
 		return;
 	max_len = cpu_max_ext_state_size - sizeof(struct savefpu);
 	len = xfpusave_len;
 	if (len > max_len) {
 		len = max_len;
 		bzero(xfpusave + max_len, len - max_len);
 	}
 	mcp->mc_flags |= _MC_HASFPXSTATE;
 	mcp->mc_xfpustate_len = len;
 	bcopy(get_pcb_user_save_td(td) + 1, xfpusave, len);
 }
 
 static int
 set_fpcontext(struct thread *td, mcontext_t *mcp, char *xfpustate,
     size_t xfpustate_len)
 {
 	struct savefpu *fpstate;
 	int error;
 
 	if (mcp->mc_fpformat == _MC_FPFMT_NODEV)
 		return (0);
 	else if (mcp->mc_fpformat != _MC_FPFMT_XMM)
 		return (EINVAL);
 	else if (mcp->mc_ownedfp == _MC_FPOWNED_NONE) {
 		/* We don't care what state is left in the FPU or PCB. */
 		fpstate_drop(td);
 		error = 0;
 	} else if (mcp->mc_ownedfp == _MC_FPOWNED_FPU ||
 	    mcp->mc_ownedfp == _MC_FPOWNED_PCB) {
 		fpstate = (struct savefpu *)&mcp->mc_fpstate;
 		fpstate->sv_env.en_mxcsr &= cpu_mxcsr_mask;
 		error = fpusetregs(td, fpstate, xfpustate, xfpustate_len);
 	} else
 		return (EINVAL);
 	return (error);
 }
 
 void
 fpstate_drop(struct thread *td)
 {
 
 	KASSERT(PCB_USER_FPU(td->td_pcb), ("fpstate_drop: kernel-owned fpu"));
 	critical_enter();
 	if (PCPU_GET(fpcurthread) == td)
 		fpudrop();
 	/*
 	 * XXX force a full drop of the fpu.  The above only drops it if we
 	 * owned it.
 	 *
 	 * XXX I don't much like fpugetuserregs()'s semantics of doing a full
 	 * drop.  Dropping only to the pcb matches fnsave's behaviour.
 	 * We only need to drop to !PCB_INITDONE in sendsig().  But
 	 * sendsig() is the only caller of fpugetuserregs()... perhaps we just
 	 * have too many layers.
 	 */
 	clear_pcb_flags(curthread->td_pcb,
 	    PCB_FPUINITDONE | PCB_USERFPUINITDONE);
 	critical_exit();
 }
 
 int
 fill_dbregs(struct thread *td, struct dbreg *dbregs)
 {
 	struct pcb *pcb;
 
 	if (td == NULL) {
 		dbregs->dr[0] = rdr0();
 		dbregs->dr[1] = rdr1();
 		dbregs->dr[2] = rdr2();
 		dbregs->dr[3] = rdr3();
 		dbregs->dr[6] = rdr6();
 		dbregs->dr[7] = rdr7();
 	} else {
 		pcb = td->td_pcb;
 		dbregs->dr[0] = pcb->pcb_dr0;
 		dbregs->dr[1] = pcb->pcb_dr1;
 		dbregs->dr[2] = pcb->pcb_dr2;
 		dbregs->dr[3] = pcb->pcb_dr3;
 		dbregs->dr[6] = pcb->pcb_dr6;
 		dbregs->dr[7] = pcb->pcb_dr7;
 	}
 	dbregs->dr[4] = 0;
 	dbregs->dr[5] = 0;
 	dbregs->dr[8] = 0;
 	dbregs->dr[9] = 0;
 	dbregs->dr[10] = 0;
 	dbregs->dr[11] = 0;
 	dbregs->dr[12] = 0;
 	dbregs->dr[13] = 0;
 	dbregs->dr[14] = 0;
 	dbregs->dr[15] = 0;
 	return (0);
 }
 
 int
 set_dbregs(struct thread *td, struct dbreg *dbregs)
 {
 	struct pcb *pcb;
 	int i;
 
 	if (td == NULL) {
 		load_dr0(dbregs->dr[0]);
 		load_dr1(dbregs->dr[1]);
 		load_dr2(dbregs->dr[2]);
 		load_dr3(dbregs->dr[3]);
 		load_dr6(dbregs->dr[6]);
 		load_dr7(dbregs->dr[7]);
 	} else {
 		/*
 		 * Don't let an illegal value for dr7 get set.  Specifically,
 		 * check for undefined settings.  Setting these bit patterns
 		 * result in undefined behaviour and can lead to an unexpected
 		 * TRCTRAP or a general protection fault right here.
 		 * Upper bits of dr6 and dr7 must not be set
 		 */
 		for (i = 0; i < 4; i++) {
 			if (DBREG_DR7_ACCESS(dbregs->dr[7], i) == 0x02)
 				return (EINVAL);
 			if (td->td_frame->tf_cs == _ucode32sel &&
 			    DBREG_DR7_LEN(dbregs->dr[7], i) == DBREG_DR7_LEN_8)
 				return (EINVAL);
 		}
 		if ((dbregs->dr[6] & 0xffffffff00000000ul) != 0 ||
 		    (dbregs->dr[7] & 0xffffffff00000000ul) != 0)
 			return (EINVAL);
 
 		pcb = td->td_pcb;
 
 		/*
 		 * Don't let a process set a breakpoint that is not within the
 		 * process's address space.  If a process could do this, it
 		 * could halt the system by setting a breakpoint in the kernel
 		 * (if ddb was enabled).  Thus, we need to check to make sure
 		 * that no breakpoints are being enabled for addresses outside
 		 * process's address space.
 		 *
 		 * XXX - what about when the watched area of the user's
 		 * address space is written into from within the kernel
 		 * ... wouldn't that still cause a breakpoint to be generated
 		 * from within kernel mode?
 		 */
 
 		if (DBREG_DR7_ENABLED(dbregs->dr[7], 0)) {
 			/* dr0 is enabled */
 			if (dbregs->dr[0] >= VM_MAXUSER_ADDRESS)
 				return (EINVAL);
 		}
 		if (DBREG_DR7_ENABLED(dbregs->dr[7], 1)) {
 			/* dr1 is enabled */
 			if (dbregs->dr[1] >= VM_MAXUSER_ADDRESS)
 				return (EINVAL);
 		}
 		if (DBREG_DR7_ENABLED(dbregs->dr[7], 2)) {
 			/* dr2 is enabled */
 			if (dbregs->dr[2] >= VM_MAXUSER_ADDRESS)
 				return (EINVAL);
 		}
 		if (DBREG_DR7_ENABLED(dbregs->dr[7], 3)) {
 			/* dr3 is enabled */
 			if (dbregs->dr[3] >= VM_MAXUSER_ADDRESS)
 				return (EINVAL);
 		}
 
 		pcb->pcb_dr0 = dbregs->dr[0];
 		pcb->pcb_dr1 = dbregs->dr[1];
 		pcb->pcb_dr2 = dbregs->dr[2];
 		pcb->pcb_dr3 = dbregs->dr[3];
 		pcb->pcb_dr6 = dbregs->dr[6];
 		pcb->pcb_dr7 = dbregs->dr[7];
 
 		set_pcb_flags(pcb, PCB_DBREGS);
 	}
 
 	return (0);
 }
 
 void
 reset_dbregs(void)
 {
 
 	load_dr7(0);	/* Turn off the control bits first */
 	load_dr0(0);
 	load_dr1(0);
 	load_dr2(0);
 	load_dr3(0);
 	load_dr6(0);
 }
 
 /*
  * Return > 0 if a hardware breakpoint has been hit, and the
  * breakpoint was in user space.  Return 0, otherwise.
  */
 int
 user_dbreg_trap(void)
 {
         u_int64_t dr7, dr6; /* debug registers dr6 and dr7 */
         u_int64_t bp;       /* breakpoint bits extracted from dr6 */
         int nbp;            /* number of breakpoints that triggered */
         caddr_t addr[4];    /* breakpoint addresses */
         int i;
         
         dr7 = rdr7();
         if ((dr7 & 0x000000ff) == 0) {
                 /*
                  * all GE and LE bits in the dr7 register are zero,
                  * thus the trap couldn't have been caused by the
                  * hardware debug registers
                  */
                 return 0;
         }
 
         nbp = 0;
         dr6 = rdr6();
         bp = dr6 & 0x0000000f;
 
         if (!bp) {
                 /*
                  * None of the breakpoint bits are set meaning this
                  * trap was not caused by any of the debug registers
                  */
                 return 0;
         }
 
         /*
          * at least one of the breakpoints were hit, check to see
          * which ones and if any of them are user space addresses
          */
 
         if (bp & 0x01) {
                 addr[nbp++] = (caddr_t)rdr0();
         }
         if (bp & 0x02) {
                 addr[nbp++] = (caddr_t)rdr1();
         }
         if (bp & 0x04) {
                 addr[nbp++] = (caddr_t)rdr2();
         }
         if (bp & 0x08) {
                 addr[nbp++] = (caddr_t)rdr3();
         }
 
         for (i = 0; i < nbp; i++) {
                 if (addr[i] < (caddr_t)VM_MAXUSER_ADDRESS) {
                         /*
                          * addr[i] is in user space
                          */
                         return nbp;
                 }
         }
 
         /*
          * None of the breakpoints are in user space.
          */
         return 0;
 }
 
 #ifdef KDB
 
 /*
  * Provide inb() and outb() as functions.  They are normally only available as
  * inline functions, thus cannot be called from the debugger.
  */
 
 /* silence compiler warnings */
 u_char inb_(u_short);
 void outb_(u_short, u_char);
 
 u_char
 inb_(u_short port)
 {
 	return inb(port);
 }
 
 void
 outb_(u_short port, u_char data)
 {
 	outb(port, data);
 }
 
 #endif /* KDB */
Index: projects/release-arm-redux/sys/amd64/amd64/mp_machdep.c
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/mp_machdep.c	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/mp_machdep.c	(revision 282692)
@@ -1,730 +1,608 @@
 /*-
  * Copyright (c) 1996, by Steve Passe
  * Copyright (c) 2003, by Peter Wemm
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. The name of the developer may NOT be used to endorse or promote products
  *    derived from this software without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_cpu.h"
 #include "opt_ddb.h"
 #include "opt_kstack_pages.h"
 #include "opt_sched.h"
 #include "opt_smp.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/cpuset.h>
 #ifdef GPROF 
 #include <sys/gmon.h>
 #endif
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/memrange.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 
 #include <x86/apicreg.h>
 #include <machine/clock.h>
 #include <machine/cputypes.h>
 #include <machine/cpufunc.h>
 #include <x86/mca.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/psl.h>
 #include <machine/smp.h>
 #include <machine/specialreg.h>
 #include <machine/tss.h>
 #include <machine/cpu.h>
 #include <x86/init.h>
 
 #define WARMBOOT_TARGET		0
 #define WARMBOOT_OFF		(KERNBASE + 0x0467)
 #define WARMBOOT_SEG		(KERNBASE + 0x0469)
 
 #define CMOS_REG		(0x70)
 #define CMOS_DATA		(0x71)
 #define BIOS_RESET		(0x0f)
 #define BIOS_WARM		(0x0a)
 
 extern	struct pcpu __pcpu[];
 
 /* Temporary variables for init_secondary()  */
 char *doublefault_stack;
 char *nmi_stack;
 
 /* Variables needed for SMP tlb shootdown. */
-vm_offset_t smp_tlb_addr2;
-struct invpcid_descr smp_tlb_invpcid;
+static vm_offset_t smp_tlb_addr1, smp_tlb_addr2;
+static pmap_t smp_tlb_pmap;
 volatile int smp_tlb_wait;
-uint64_t pcid_cr3;
-pmap_t smp_tlb_pmap;
-extern int invpcid_works;
 
 extern inthand_t IDTVEC(fast_syscall), IDTVEC(fast_syscall32);
 
 /*
  * Local data and functions.
  */
 
 static int	start_ap(int apic_id);
 
 static u_int	bootMP_size;
 static u_int	boot_address;
 
 /*
  * Calculate usable address in base memory for AP trampoline code.
  */
 u_int
 mp_bootaddress(u_int basemem)
 {
 
 	bootMP_size = mptramp_end - mptramp_start;
 	boot_address = trunc_page(basemem * 1024); /* round down to 4k boundary */
 	if (((basemem * 1024) - boot_address) < bootMP_size)
 		boot_address -= PAGE_SIZE;	/* not enough, lower by 4k */
 	/* 3 levels of page table pages */
 	mptramp_pagetables = boot_address - (PAGE_SIZE * 3);
 
 	return mptramp_pagetables;
 }
 
 /*
  * Initialize the IPI handlers and start up the AP's.
  */
 void
 cpu_mp_start(void)
 {
 	int i;
 
 	/* Initialize the logical ID to APIC ID table. */
 	for (i = 0; i < MAXCPU; i++) {
 		cpu_apic_ids[i] = -1;
 		cpu_ipi_pending[i] = 0;
 	}
 
 	/* Install an inter-CPU IPI for TLB invalidation */
 	if (pmap_pcid_enabled) {
-		setidt(IPI_INVLTLB, IDTVEC(invltlb_pcid), SDT_SYSIGT,
-		    SEL_KPL, 0);
-		setidt(IPI_INVLPG, IDTVEC(invlpg_pcid), SDT_SYSIGT,
-		    SEL_KPL, 0);
+		if (invpcid_works) {
+			setidt(IPI_INVLTLB, IDTVEC(invltlb_invpcid),
+			    SDT_SYSIGT, SEL_KPL, 0);
+		} else {
+			setidt(IPI_INVLTLB, IDTVEC(invltlb_pcid), SDT_SYSIGT,
+			    SEL_KPL, 0);
+		}
 	} else {
 		setidt(IPI_INVLTLB, IDTVEC(invltlb), SDT_SYSIGT, SEL_KPL, 0);
-		setidt(IPI_INVLPG, IDTVEC(invlpg), SDT_SYSIGT, SEL_KPL, 0);
 	}
+	setidt(IPI_INVLPG, IDTVEC(invlpg), SDT_SYSIGT, SEL_KPL, 0);
 	setidt(IPI_INVLRNG, IDTVEC(invlrng), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for cache invalidation. */
 	setidt(IPI_INVLCACHE, IDTVEC(invlcache), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for all-CPU rendezvous */
 	setidt(IPI_RENDEZVOUS, IDTVEC(rendezvous), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install generic inter-CPU IPI handler */
 	setidt(IPI_BITMAP_VECTOR, IDTVEC(ipi_intr_bitmap_handler),
 	       SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for CPU stop/restart */
 	setidt(IPI_STOP, IDTVEC(cpustop), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Install an inter-CPU IPI for CPU suspend/resume */
 	setidt(IPI_SUSPEND, IDTVEC(cpususpend), SDT_SYSIGT, SEL_KPL, 0);
 
 	/* Set boot_cpu_id if needed. */
 	if (boot_cpu_id == -1) {
 		boot_cpu_id = PCPU_GET(apic_id);
 		cpu_info[boot_cpu_id].cpu_bsp = 1;
 	} else
 		KASSERT(boot_cpu_id == PCPU_GET(apic_id),
 		    ("BSP's APIC ID doesn't match boot_cpu_id"));
 
 	/* Probe logical/physical core configuration. */
 	topo_probe();
 
 	assign_cpu_ids();
 
 	/* Start each Application Processor */
 	init_ops.start_all_aps();
 
 	set_interrupt_apic_ids();
 }
 
 
 /*
  * AP CPU's call this to initialize themselves.
  */
 void
 init_secondary(void)
 {
 	struct pcpu *pc;
 	struct nmi_pcpu *np;
 	u_int64_t msr, cr0;
 	int cpu, gsel_tss, x;
 	struct region_descriptor ap_gdt;
 
 	/* Set by the startup code for us to use */
 	cpu = bootAP;
 
 	/* Init tss */
 	common_tss[cpu] = common_tss[0];
 	common_tss[cpu].tss_rsp0 = 0;   /* not used until after switch */
 	common_tss[cpu].tss_iobase = sizeof(struct amd64tss) +
 	    IOPERM_BITMAP_SIZE;
 	common_tss[cpu].tss_ist1 = (long)&doublefault_stack[PAGE_SIZE];
 
 	/* The NMI stack runs on IST2. */
 	np = ((struct nmi_pcpu *) &nmi_stack[PAGE_SIZE]) - 1;
 	common_tss[cpu].tss_ist2 = (long) np;
 
 	/* Prepare private GDT */
 	gdt_segs[GPROC0_SEL].ssd_base = (long) &common_tss[cpu];
 	for (x = 0; x < NGDT; x++) {
 		if (x != GPROC0_SEL && x != (GPROC0_SEL + 1) &&
 		    x != GUSERLDT_SEL && x != (GUSERLDT_SEL + 1))
 			ssdtosd(&gdt_segs[x], &gdt[NGDT * cpu + x]);
 	}
 	ssdtosyssd(&gdt_segs[GPROC0_SEL],
 	    (struct system_segment_descriptor *)&gdt[NGDT * cpu + GPROC0_SEL]);
 	ap_gdt.rd_limit = NGDT * sizeof(gdt[0]) - 1;
 	ap_gdt.rd_base =  (long) &gdt[NGDT * cpu];
 	lgdt(&ap_gdt);			/* does magic intra-segment return */
 
 	/* Get per-cpu data */
 	pc = &__pcpu[cpu];
 
 	/* prime data page for it to use */
 	pcpu_init(pc, cpu, sizeof(struct pcpu));
 	dpcpu_init(dpcpu, cpu);
 	pc->pc_apic_id = cpu_apic_ids[cpu];
 	pc->pc_prvspace = pc;
 	pc->pc_curthread = 0;
 	pc->pc_tssp = &common_tss[cpu];
 	pc->pc_commontssp = &common_tss[cpu];
 	pc->pc_rsp0 = 0;
 	pc->pc_tss = (struct system_segment_descriptor *)&gdt[NGDT * cpu +
 	    GPROC0_SEL];
 	pc->pc_fs32p = &gdt[NGDT * cpu + GUFS32_SEL];
 	pc->pc_gs32p = &gdt[NGDT * cpu + GUGS32_SEL];
 	pc->pc_ldt = (struct system_segment_descriptor *)&gdt[NGDT * cpu +
 	    GUSERLDT_SEL];
+	pc->pc_curpmap = kernel_pmap;
+	pc->pc_pcid_gen = 1;
+	pc->pc_pcid_next = PMAP_PCID_KERN + 1;
 
 	/* Save the per-cpu pointer for use by the NMI handler. */
 	np->np_pcpu = (register_t) pc;
 
 	wrmsr(MSR_FSBASE, 0);		/* User value */
 	wrmsr(MSR_GSBASE, (u_int64_t)pc);
 	wrmsr(MSR_KGSBASE, (u_int64_t)pc);	/* XXX User value while we're in the kernel */
 
 	lidt(&r_idt);
 
 	gsel_tss = GSEL(GPROC0_SEL, SEL_KPL);
 	ltr(gsel_tss);
 
 	/*
 	 * Set to a known state:
 	 * Set by mpboot.s: CR0_PG, CR0_PE
 	 * Set by cpu_setregs: CR0_NE, CR0_MP, CR0_TS, CR0_WP, CR0_AM
 	 */
 	cr0 = rcr0();
 	cr0 &= ~(CR0_CD | CR0_NW | CR0_EM);
 	load_cr0(cr0);
 
 	/* Set up the fast syscall stuff */
 	msr = rdmsr(MSR_EFER) | EFER_SCE;
 	wrmsr(MSR_EFER, msr);
 	wrmsr(MSR_LSTAR, (u_int64_t)IDTVEC(fast_syscall));
 	wrmsr(MSR_CSTAR, (u_int64_t)IDTVEC(fast_syscall32));
 	msr = ((u_int64_t)GSEL(GCODE_SEL, SEL_KPL) << 32) |
 	      ((u_int64_t)GSEL(GUCODE32_SEL, SEL_UPL) << 48);
 	wrmsr(MSR_STAR, msr);
 	wrmsr(MSR_SF_MASK, PSL_NT|PSL_T|PSL_I|PSL_C|PSL_D);
 
 	/* signal our startup to the BSP. */
 	mp_naps++;
 
 	/* Spin until the BSP releases the AP's. */
 	while (!aps_ready)
 		ia32_pause();
 
 	init_secondary_tail();
 }
 
 /*******************************************************************
  * local functions and data
  */
 
 /*
  * start each AP in our list
  */
 int
 native_start_all_aps(void)
 {
 	vm_offset_t va = boot_address + KERNBASE;
 	u_int64_t *pt4, *pt3, *pt2;
 	u_int32_t mpbioswarmvec;
 	int apic_id, cpu, i;
 	u_char mpbiosreason;
 
 	mtx_init(&ap_boot_mtx, "ap boot", NULL, MTX_SPIN);
 
 	/* install the AP 1st level boot code */
 	pmap_kenter(va, boot_address);
 	pmap_invalidate_page(kernel_pmap, va);
 	bcopy(mptramp_start, (void *)va, bootMP_size);
 
 	/* Locate the page tables, they'll be below the trampoline */
 	pt4 = (u_int64_t *)(uintptr_t)(mptramp_pagetables + KERNBASE);
 	pt3 = pt4 + (PAGE_SIZE) / sizeof(u_int64_t);
 	pt2 = pt3 + (PAGE_SIZE) / sizeof(u_int64_t);
 
 	/* Create the initial 1GB replicated page tables */
 	for (i = 0; i < 512; i++) {
 		/* Each slot of the level 4 pages points to the same level 3 page */
 		pt4[i] = (u_int64_t)(uintptr_t)(mptramp_pagetables + PAGE_SIZE);
 		pt4[i] |= PG_V | PG_RW | PG_U;
 
 		/* Each slot of the level 3 pages points to the same level 2 page */
 		pt3[i] = (u_int64_t)(uintptr_t)(mptramp_pagetables + (2 * PAGE_SIZE));
 		pt3[i] |= PG_V | PG_RW | PG_U;
 
 		/* The level 2 page slots are mapped with 2MB pages for 1GB. */
 		pt2[i] = i * (2 * 1024 * 1024);
 		pt2[i] |= PG_V | PG_RW | PG_PS | PG_U;
 	}
 
 	/* save the current value of the warm-start vector */
 	mpbioswarmvec = *((u_int32_t *) WARMBOOT_OFF);
 	outb(CMOS_REG, BIOS_RESET);
 	mpbiosreason = inb(CMOS_DATA);
 
 	/* setup a vector to our boot code */
 	*((volatile u_short *) WARMBOOT_OFF) = WARMBOOT_TARGET;
 	*((volatile u_short *) WARMBOOT_SEG) = (boot_address >> 4);
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, BIOS_WARM);	/* 'warm-start' */
 
 	/* start each AP */
 	for (cpu = 1; cpu < mp_ncpus; cpu++) {
 		apic_id = cpu_apic_ids[cpu];
 
 		/* allocate and set up an idle stack data page */
 		bootstacks[cpu] = (void *)kmem_malloc(kernel_arena,
 		    KSTACK_PAGES * PAGE_SIZE, M_WAITOK | M_ZERO);
 		doublefault_stack = (char *)kmem_malloc(kernel_arena,
 		    PAGE_SIZE, M_WAITOK | M_ZERO);
 		nmi_stack = (char *)kmem_malloc(kernel_arena, PAGE_SIZE,
 		    M_WAITOK | M_ZERO);
 		dpcpu = (void *)kmem_malloc(kernel_arena, DPCPU_SIZE,
 		    M_WAITOK | M_ZERO);
 
 		bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 8;
 		bootAP = cpu;
 
 		/* attempt to start the Application Processor */
 		if (!start_ap(apic_id)) {
 			/* restore the warmstart vector */
 			*(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec;
 			panic("AP #%d (PHY# %d) failed!", cpu, apic_id);
 		}
 
 		CPU_SET(cpu, &all_cpus);	/* record AP in CPU map */
 	}
 
 	/* restore the warmstart vector */
 	*(u_int32_t *) WARMBOOT_OFF = mpbioswarmvec;
 
 	outb(CMOS_REG, BIOS_RESET);
 	outb(CMOS_DATA, mpbiosreason);
 
 	/* number of APs actually started */
 	return mp_naps;
 }
 
 
 /*
  * This function starts the AP (application processor) identified
  * by the APIC ID 'physicalCpu'.  It does quite a "song and dance"
  * to accomplish this.  This is necessary because of the nuances
  * of the different hardware we might encounter.  It isn't pretty,
  * but it seems to work.
  */
 static int
 start_ap(int apic_id)
 {
 	int vector, ms;
 	int cpus;
 
 	/* calculate the vector */
 	vector = (boot_address >> 12) & 0xff;
 
 	/* used as a watchpoint to signal AP startup */
 	cpus = mp_naps;
 
 	ipi_startup(apic_id, vector);
 
 	/* Wait up to 5 seconds for it to start. */
 	for (ms = 0; ms < 5000; ms++) {
 		if (mp_naps > cpus)
 			return 1;	/* return SUCCESS */
 		DELAY(1000);
 	}
 	return 0;		/* return FAILURE */
 }
 
 /*
- * Flush the TLB on all other CPU's
+ * Flush the TLB on other CPU's
  */
-static void
-smp_tlb_shootdown(u_int vector, pmap_t pmap, vm_offset_t addr1,
-    vm_offset_t addr2)
-{
-	u_int ncpu;
 
-	ncpu = mp_ncpus - 1;	/* does not shootdown self */
-	if (ncpu < 1)
-		return;		/* no other cpus */
-	if (!(read_rflags() & PSL_I))
-		panic("%s: interrupts disabled", __func__);
-	mtx_lock_spin(&smp_ipi_mtx);
-	smp_tlb_invpcid.addr = addr1;
-	if (pmap == NULL) {
-		smp_tlb_invpcid.pcid = 0;
-	} else {
-		smp_tlb_invpcid.pcid = pmap->pm_pcid;
-		pcid_cr3 = pmap->pm_cr3;
-	}
-	smp_tlb_addr2 = addr2;
-	smp_tlb_pmap = pmap;
-	atomic_store_rel_int(&smp_tlb_wait, 0);
-	ipi_all_but_self(vector);
-	while (smp_tlb_wait < ncpu)
-		ia32_pause();
-	mtx_unlock_spin(&smp_ipi_mtx);
-}
-
 static void
 smp_targeted_tlb_shootdown(cpuset_t mask, u_int vector, pmap_t pmap,
     vm_offset_t addr1, vm_offset_t addr2)
 {
 	int cpu, ncpu, othercpus;
 
-	othercpus = mp_ncpus - 1;
+	othercpus = mp_ncpus - 1;	/* does not shootdown self */
+
+	/*
+	 * Check for other cpus.  Return if none.
+	 */
 	if (CPU_ISFULLSET(&mask)) {
 		if (othercpus < 1)
 			return;
 	} else {
 		CPU_CLR(PCPU_GET(cpuid), &mask);
 		if (CPU_EMPTY(&mask))
 			return;
 	}
+
 	if (!(read_rflags() & PSL_I))
 		panic("%s: interrupts disabled", __func__);
 	mtx_lock_spin(&smp_ipi_mtx);
-	smp_tlb_invpcid.addr = addr1;
-	if (pmap == NULL) {
-		smp_tlb_invpcid.pcid = 0;
-	} else {
-		smp_tlb_invpcid.pcid = pmap->pm_pcid;
-		pcid_cr3 = pmap->pm_cr3;
-	}
+	smp_tlb_addr1 = addr1;
 	smp_tlb_addr2 = addr2;
 	smp_tlb_pmap = pmap;
 	atomic_store_rel_int(&smp_tlb_wait, 0);
 	if (CPU_ISFULLSET(&mask)) {
 		ncpu = othercpus;
 		ipi_all_but_self(vector);
 	} else {
 		ncpu = 0;
 		while ((cpu = CPU_FFS(&mask)) != 0) {
 			cpu--;
 			CPU_CLR(cpu, &mask);
 			CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__,
 			    cpu, vector);
 			ipi_send_cpu(cpu, vector);
 			ncpu++;
 		}
 	}
 	while (smp_tlb_wait < ncpu)
 		ia32_pause();
 	mtx_unlock_spin(&smp_ipi_mtx);
 }
 
 void
-smp_invlpg(pmap_t pmap, vm_offset_t addr)
-{
-
-	if (smp_started) {
-		smp_tlb_shootdown(IPI_INVLPG, pmap, addr, 0);
-#ifdef COUNT_XINVLTLB_HITS
-		ipi_page++;
-#endif
-	}
-}
-
-void
-smp_invlpg_range(pmap_t pmap, vm_offset_t addr1, vm_offset_t addr2)
-{
-
-	if (smp_started) {
-		smp_tlb_shootdown(IPI_INVLRNG, pmap, addr1, addr2);
-#ifdef COUNT_XINVLTLB_HITS
-		ipi_range++;
-		ipi_range_size += (addr2 - addr1) / PAGE_SIZE;
-#endif
-	}
-}
-
-void
 smp_masked_invltlb(cpuset_t mask, pmap_t pmap)
 {
 
 	if (smp_started) {
 		smp_targeted_tlb_shootdown(mask, IPI_INVLTLB, pmap, 0, 0);
 #ifdef COUNT_XINVLTLB_HITS
-		ipi_masked_global++;
+		ipi_global++;
 #endif
 	}
 }
 
 void
-smp_masked_invlpg(cpuset_t mask, pmap_t pmap, vm_offset_t addr)
+smp_masked_invlpg(cpuset_t mask, vm_offset_t addr)
 {
 
 	if (smp_started) {
-		smp_targeted_tlb_shootdown(mask, IPI_INVLPG, pmap, addr, 0);
+		smp_targeted_tlb_shootdown(mask, IPI_INVLPG, NULL, addr, 0);
 #ifdef COUNT_XINVLTLB_HITS
-		ipi_masked_page++;
+		ipi_page++;
 #endif
 	}
 }
 
 void
-smp_masked_invlpg_range(cpuset_t mask, pmap_t pmap, vm_offset_t addr1,
-    vm_offset_t addr2)
+smp_masked_invlpg_range(cpuset_t mask, vm_offset_t addr1, vm_offset_t addr2)
 {
 
 	if (smp_started) {
-		smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, pmap, addr1,
-		    addr2);
+		smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, NULL,
+		    addr1, addr2);
 #ifdef COUNT_XINVLTLB_HITS
-		ipi_masked_range++;
-		ipi_masked_range_size += (addr2 - addr1) / PAGE_SIZE;
+		ipi_range++;
+		ipi_range_size += (addr2 - addr1) / PAGE_SIZE;
 #endif
 	}
 }
 
 void
 smp_cache_flush(void)
 {
 
-	if (smp_started)
-		smp_tlb_shootdown(IPI_INVLCACHE, NULL, 0, 0);
-}
-
-void
-smp_invltlb(pmap_t pmap)
-{ 
-
 	if (smp_started) {
-		smp_tlb_shootdown(IPI_INVLTLB, pmap, 0, 0);
-#ifdef COUNT_XINVLTLB_HITS
-		ipi_global++;
-#endif
+		smp_targeted_tlb_shootdown(all_cpus, IPI_INVLCACHE, NULL,
+		    0, 0);
 	}
 }
 
 /*
  * Handlers for TLB related IPIs
  */
 void
 invltlb_handler(void)
 {
 #ifdef COUNT_XINVLTLB_HITS
 	xhits_gbl[PCPU_GET(cpuid)]++;
 #endif /* COUNT_XINVLTLB_HITS */
 #ifdef COUNT_IPIS
 	(*ipi_invltlb_counts[PCPU_GET(cpuid)])++;
 #endif /* COUNT_IPIS */
 
 	invltlb();
 	atomic_add_int(&smp_tlb_wait, 1);
 }
 
 void
-invltlb_pcid_handler(void)
+invltlb_invpcid_handler(void)
 {
-	uint64_t cr3;
-	u_int cpuid;
+	struct invpcid_descr d;
+
 #ifdef COUNT_XINVLTLB_HITS
 	xhits_gbl[PCPU_GET(cpuid)]++;
 #endif /* COUNT_XINVLTLB_HITS */
 #ifdef COUNT_IPIS
 	(*ipi_invltlb_counts[PCPU_GET(cpuid)])++;
 #endif /* COUNT_IPIS */
 
-	if (smp_tlb_invpcid.pcid != (uint64_t)-1 &&
-	    smp_tlb_invpcid.pcid != 0) {
-		if (invpcid_works) {
-			invpcid(&smp_tlb_invpcid, INVPCID_CTX);
-		} else {
-			/* Otherwise reload %cr3 twice. */
-			cr3 = rcr3();
-			if (cr3 != pcid_cr3) {
-				load_cr3(pcid_cr3);
-				cr3 |= CR3_PCID_SAVE;
-			}
-			load_cr3(cr3);
-		}
-	} else {
-		invltlb_globpcid();
-	}
-	if (smp_tlb_pmap != NULL) {
-		cpuid = PCPU_GET(cpuid);
-		if (!CPU_ISSET(cpuid, &smp_tlb_pmap->pm_active))
-			CPU_CLR_ATOMIC(cpuid, &smp_tlb_pmap->pm_save);
-	}
-
+	d.pcid = smp_tlb_pmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid;
+	d.pad = 0;
+	d.addr = 0;
+	invpcid(&d, smp_tlb_pmap == kernel_pmap ? INVPCID_CTXGLOB :
+	    INVPCID_CTX);
 	atomic_add_int(&smp_tlb_wait, 1);
 }
 
 void
-invlpg_handler(void)
+invltlb_pcid_handler(void)
 {
 #ifdef COUNT_XINVLTLB_HITS
-	xhits_pg[PCPU_GET(cpuid)]++;
+	xhits_gbl[PCPU_GET(cpuid)]++;
 #endif /* COUNT_XINVLTLB_HITS */
 #ifdef COUNT_IPIS
-	(*ipi_invlpg_counts[PCPU_GET(cpuid)])++;
+	(*ipi_invltlb_counts[PCPU_GET(cpuid)])++;
 #endif /* COUNT_IPIS */
 
-	invlpg(smp_tlb_invpcid.addr);
+	if (smp_tlb_pmap == kernel_pmap) {
+		invltlb_globpcid();
+	} else {
+		/*
+		 * The current pmap might not be equal to
+		 * smp_tlb_pmap.  The clearing of the pm_gen in
+		 * pmap_invalidate_all() takes care of TLB
+		 * invalidation when switching to the pmap on this
+		 * CPU.
+		 */
+		if (PCPU_GET(curpmap) == smp_tlb_pmap) {
+			load_cr3(smp_tlb_pmap->pm_cr3 |
+			    smp_tlb_pmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid);
+		}
+	}
 	atomic_add_int(&smp_tlb_wait, 1);
 }
 
 void
-invlpg_pcid_handler(void)
+invlpg_handler(void)
 {
-	uint64_t cr3;
 #ifdef COUNT_XINVLTLB_HITS
 	xhits_pg[PCPU_GET(cpuid)]++;
 #endif /* COUNT_XINVLTLB_HITS */
 #ifdef COUNT_IPIS
 	(*ipi_invlpg_counts[PCPU_GET(cpuid)])++;
 #endif /* COUNT_IPIS */
 
-	if (smp_tlb_invpcid.pcid == (uint64_t)-1) {
-		invltlb_globpcid();
-	} else if (smp_tlb_invpcid.pcid == 0) {
-		invlpg(smp_tlb_invpcid.addr);
-	} else if (invpcid_works) {
-		invpcid(&smp_tlb_invpcid, INVPCID_ADDR);
-	} else {
-		/*
-		 * PCID supported, but INVPCID is not.
-		 * Temporarily switch to the target address
-		 * space and do INVLPG.
-		 */
-		cr3 = rcr3();
-		if (cr3 != pcid_cr3)
-			load_cr3(pcid_cr3 | CR3_PCID_SAVE);
-		invlpg(smp_tlb_invpcid.addr);
-		load_cr3(cr3 | CR3_PCID_SAVE);
-	}
-
+	invlpg(smp_tlb_addr1);
 	atomic_add_int(&smp_tlb_wait, 1);
 }
 
-static inline void
-invlpg_range(vm_offset_t start, vm_offset_t end)
-{
-
-	do {
-		invlpg(start);
-		start += PAGE_SIZE;
-	} while (start < end);
-}
-
 void
 invlrng_handler(void)
 {
-	struct invpcid_descr d;
 	vm_offset_t addr;
-	uint64_t cr3;
-	u_int cpuid;
+
 #ifdef COUNT_XINVLTLB_HITS
 	xhits_rng[PCPU_GET(cpuid)]++;
 #endif /* COUNT_XINVLTLB_HITS */
 #ifdef COUNT_IPIS
 	(*ipi_invlrng_counts[PCPU_GET(cpuid)])++;
 #endif /* COUNT_IPIS */
 
-	addr = smp_tlb_invpcid.addr;
-	if (pmap_pcid_enabled) {
-		if (smp_tlb_invpcid.pcid == 0) {
-			/*
-			 * kernel pmap - use invlpg to invalidate
-			 * global mapping.
-			 */
-			invlpg_range(addr, smp_tlb_addr2);
-		} else if (smp_tlb_invpcid.pcid == (uint64_t)-1) {
-			invltlb_globpcid();
-			if (smp_tlb_pmap != NULL) {
-				cpuid = PCPU_GET(cpuid);
-				if (!CPU_ISSET(cpuid, &smp_tlb_pmap->pm_active))
-					CPU_CLR_ATOMIC(cpuid,
-					    &smp_tlb_pmap->pm_save);
-			}
-		} else if (invpcid_works) {
-			d = smp_tlb_invpcid;
-			do {
-				invpcid(&d, INVPCID_ADDR);
-				d.addr += PAGE_SIZE;
-			} while (d.addr <= smp_tlb_addr2);
-		} else {
-			cr3 = rcr3();
-			if (cr3 != pcid_cr3)
-				load_cr3(pcid_cr3 | CR3_PCID_SAVE);
-			invlpg_range(addr, smp_tlb_addr2);
-			load_cr3(cr3 | CR3_PCID_SAVE);
-		}
-	} else {
-		invlpg_range(addr, smp_tlb_addr2);
-	}
+	addr = smp_tlb_addr1;
+	do {
+		invlpg(addr);
+		addr += PAGE_SIZE;
+	} while (addr < smp_tlb_addr2);
 
 	atomic_add_int(&smp_tlb_wait, 1);
 }
Index: projects/release-arm-redux/sys/amd64/amd64/pmap.c
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/pmap.c	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/pmap.c	(revision 282692)
@@ -1,7027 +1,6976 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  * Copyright (c) 2003 Peter Wemm
  * All rights reserved.
  * Copyright (c) 2005-2010 Alan L. Cox <alc@cs.rice.edu>
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from:	@(#)pmap.c	7.7 (Berkeley)	5/12/91
  */
 /*-
  * Copyright (c) 2003 Networks Associates Technology, Inc.
  * All rights reserved.
  *
  * This software was developed for the FreeBSD Project by Jake Burkholder,
  * Safeport Network Services, and Network Associates Laboratories, the
  * Security Research Division of Network Associates, Inc. under
  * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA
  * CHATS research program.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #define	AMD64_NPT_AWARE
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  *	Manages physical address maps.
  *
  *	Since the information managed by this module is
  *	also stored by the logical address mapping module,
  *	this module may throw away valid virtual-to-physical
  *	mappings at almost any time.  However, invalidations
  *	of virtual-to-physical mappings must be done as
  *	requested.
  *
  *	In order to cope with hardware architectures which
  *	make virtual-to-physical map invalidates expensive,
  *	this module may delay invalidate or reduced protection
  *	operations until such time as they are actually
  *	necessary.  This module is given full information as
  *	to which processors are currently using which maps,
  *	and to when physical maps must be made correct.
  */
 
 #include "opt_pmap.h"
 #include "opt_vm.h"
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mman.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/rwlock.h>
 #include <sys/sx.h>
 #include <sys/vmem.h>
 #include <sys/vmmeter.h>
 #include <sys/sched.h>
 #include <sys/sysctl.h>
 #include <sys/_unrhdr.h>
 #include <sys/smp.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/vm_phys.h>
 #include <vm/vm_radix.h>
 #include <vm/vm_reserv.h>
 #include <vm/uma.h>
 
 #include <machine/intr_machdep.h>
 #include <x86/apicvar.h>
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/specialreg.h>
 #ifdef SMP
 #include <machine/smp.h>
 #endif
 
 static __inline boolean_t
 pmap_type_guest(pmap_t pmap)
 {
 
 	return ((pmap->pm_type == PT_EPT) || (pmap->pm_type == PT_RVI));
 }
 
 static __inline boolean_t
 pmap_emulate_ad_bits(pmap_t pmap)
 {
 
 	return ((pmap->pm_flags & PMAP_EMULATE_AD_BITS) != 0);
 }
 
 static __inline pt_entry_t
 pmap_valid_bit(pmap_t pmap)
 {
 	pt_entry_t mask;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		mask = X86_PG_V;
 		break;
 	case PT_EPT:
 		if (pmap_emulate_ad_bits(pmap))
 			mask = EPT_PG_EMUL_V;
 		else
 			mask = EPT_PG_READ;
 		break;
 	default:
 		panic("pmap_valid_bit: invalid pm_type %d", pmap->pm_type);
 	}
 
 	return (mask);
 }
 
 static __inline pt_entry_t
 pmap_rw_bit(pmap_t pmap)
 {
 	pt_entry_t mask;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		mask = X86_PG_RW;
 		break;
 	case PT_EPT:
 		if (pmap_emulate_ad_bits(pmap))
 			mask = EPT_PG_EMUL_RW;
 		else
 			mask = EPT_PG_WRITE;
 		break;
 	default:
 		panic("pmap_rw_bit: invalid pm_type %d", pmap->pm_type);
 	}
 
 	return (mask);
 }
 
 static __inline pt_entry_t
 pmap_global_bit(pmap_t pmap)
 {
 	pt_entry_t mask;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 		mask = X86_PG_G;
 		break;
 	case PT_RVI:
 	case PT_EPT:
 		mask = 0;
 		break;
 	default:
 		panic("pmap_global_bit: invalid pm_type %d", pmap->pm_type);
 	}
 
 	return (mask);
 }
 
 static __inline pt_entry_t
 pmap_accessed_bit(pmap_t pmap)
 {
 	pt_entry_t mask;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		mask = X86_PG_A;
 		break;
 	case PT_EPT:
 		if (pmap_emulate_ad_bits(pmap))
 			mask = EPT_PG_READ;
 		else
 			mask = EPT_PG_A;
 		break;
 	default:
 		panic("pmap_accessed_bit: invalid pm_type %d", pmap->pm_type);
 	}
 
 	return (mask);
 }
 
 static __inline pt_entry_t
 pmap_modified_bit(pmap_t pmap)
 {
 	pt_entry_t mask;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		mask = X86_PG_M;
 		break;
 	case PT_EPT:
 		if (pmap_emulate_ad_bits(pmap))
 			mask = EPT_PG_WRITE;
 		else
 			mask = EPT_PG_M;
 		break;
 	default:
 		panic("pmap_modified_bit: invalid pm_type %d", pmap->pm_type);
 	}
 
 	return (mask);
 }
 
+extern	struct pcpu __pcpu[];
+
 #if !defined(DIAGNOSTIC)
 #ifdef __GNUC_GNU_INLINE__
 #define PMAP_INLINE	__attribute__((__gnu_inline__)) inline
 #else
 #define PMAP_INLINE	extern inline
 #endif
 #else
 #define PMAP_INLINE
 #endif
 
 #ifdef PV_STATS
 #define PV_STAT(x)	do { x ; } while (0)
 #else
 #define PV_STAT(x)	do { } while (0)
 #endif
 
 #define	pa_index(pa)	((pa) >> PDRSHIFT)
 #define	pa_to_pvh(pa)	(&pv_table[pa_index(pa)])
 
 #define	NPV_LIST_LOCKS	MAXCPU
 
 #define	PHYS_TO_PV_LIST_LOCK(pa)	\
 			(&pv_list_locks[pa_index(pa) % NPV_LIST_LOCKS])
 
 #define	CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa)	do {	\
 	struct rwlock **_lockp = (lockp);		\
 	struct rwlock *_new_lock;			\
 							\
 	_new_lock = PHYS_TO_PV_LIST_LOCK(pa);		\
 	if (_new_lock != *_lockp) {			\
 		if (*_lockp != NULL)			\
 			rw_wunlock(*_lockp);		\
 		*_lockp = _new_lock;			\
 		rw_wlock(*_lockp);			\
 	}						\
 } while (0)
 
 #define	CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m)	\
 			CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, VM_PAGE_TO_PHYS(m))
 
 #define	RELEASE_PV_LIST_LOCK(lockp)		do {	\
 	struct rwlock **_lockp = (lockp);		\
 							\
 	if (*_lockp != NULL) {				\
 		rw_wunlock(*_lockp);			\
 		*_lockp = NULL;				\
 	}						\
 } while (0)
 
 #define	VM_PAGE_TO_PV_LIST_LOCK(m)	\
 			PHYS_TO_PV_LIST_LOCK(VM_PAGE_TO_PHYS(m))
 
 struct pmap kernel_pmap_store;
 
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 
 int nkpt;
 SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, &nkpt, 0,
     "Number of kernel page table pages allocated on bootup");
 
 static int ndmpdp;
 vm_paddr_t dmaplimit;
 vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 pt_entry_t pg_nx;
 
 static SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters");
 
 static int pat_works = 1;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pat_works, CTLFLAG_RD, &pat_works, 1,
     "Is page attribute table fully functional?");
 
 static int pg_ps_enabled = 1;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pg_ps_enabled, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
     &pg_ps_enabled, 0, "Are large page mappings enabled?");
 
 #define	PAT_INDEX_SIZE	8
 static int pat_index[PAT_INDEX_SIZE];	/* cache mode to PAT index conversion */
 
 static u_int64_t	KPTphys;	/* phys addr of kernel level 1 */
 static u_int64_t	KPDphys;	/* phys addr of kernel level 2 */
 u_int64_t		KPDPphys;	/* phys addr of kernel level 3 */
 u_int64_t		KPML4phys;	/* phys addr of kernel level 4 */
 
 static u_int64_t	DMPDphys;	/* phys addr of direct mapped level 2 */
 static u_int64_t	DMPDPphys;	/* phys addr of direct mapped level 3 */
 static int		ndmpdpphys;	/* number of DMPDPphys pages */
 
 static struct rwlock_padalign pvh_global_lock;
 
 /*
  * Data for the pv entry allocation mechanism
  */
 static TAILQ_HEAD(pch, pv_chunk) pv_chunks = TAILQ_HEAD_INITIALIZER(pv_chunks);
 static struct mtx pv_chunks_mutex;
 static struct rwlock pv_list_locks[NPV_LIST_LOCKS];
 static struct md_page *pv_table;
 
 /*
  * All those kernel PT submaps that BSD is so fond of
  */
 pt_entry_t *CMAP1 = 0;
 caddr_t CADDR1 = 0;
 
 static int pmap_flags = PMAP_PDE_SUPERPAGE;	/* flags for x86 pmaps */
 
-static struct unrhdr pcid_unr;
-static struct mtx pcid_mtx;
 int pmap_pcid_enabled = 0;
 SYSCTL_INT(_vm_pmap, OID_AUTO, pcid_enabled, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
     &pmap_pcid_enabled, 0, "Is TLB Context ID enabled ?");
 int invpcid_works = 0;
 SYSCTL_INT(_vm_pmap, OID_AUTO, invpcid_works, CTLFLAG_RD, &invpcid_works, 0,
     "Is the invpcid instruction available ?");
 
 static int
 pmap_pcid_save_cnt_proc(SYSCTL_HANDLER_ARGS)
 {
 	int i;
 	uint64_t res;
 
 	res = 0;
 	CPU_FOREACH(i) {
 		res += cpuid_to_pcpu[i]->pc_pm_save_cnt;
 	}
 	return (sysctl_handle_64(oidp, &res, 0, req));
 }
 SYSCTL_PROC(_vm_pmap, OID_AUTO, pcid_save_cnt, CTLTYPE_U64 | CTLFLAG_RW |
     CTLFLAG_MPSAFE, NULL, 0, pmap_pcid_save_cnt_proc, "QU",
     "Count of saved TLB context on switch");
 
 /*
  * Crashdump maps.
  */
 static caddr_t crashdumpmap;
 
 static void	free_pv_chunk(struct pv_chunk *pc);
 static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
 static int	popcnt_pc_map_elem_pq(uint64_t elem);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static void	reserve_pv_entries(pmap_t pmap, int needed,
 		    struct rwlock **lockp);
 static void	pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa,
 		    struct rwlock **lockp);
 static boolean_t pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa,
 		    struct rwlock **lockp);
 static void	pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa,
 		    struct rwlock **lockp);
 static void	pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va);
 static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
 		    vm_offset_t va);
 
 static int pmap_change_attr_locked(vm_offset_t va, vm_size_t size, int mode);
 static boolean_t pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va);
 static boolean_t pmap_demote_pde_locked(pmap_t pmap, pd_entry_t *pde,
     vm_offset_t va, struct rwlock **lockp);
 static boolean_t pmap_demote_pdpe(pmap_t pmap, pdp_entry_t *pdpe,
     vm_offset_t va);
 static boolean_t pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, struct rwlock **lockp);
 static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte, struct rwlock **lockp);
 static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte);
 static int pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte);
 static void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode);
 static vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va);
 static void pmap_pde_attr(pd_entry_t *pde, int cache_bits, int mask);
 static void pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va,
     struct rwlock **lockp);
 static boolean_t pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva,
     vm_prot_t prot);
 static void pmap_pte_attr(pt_entry_t *pte, int cache_bits, int mask);
 static int pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
     struct spglist *free, struct rwlock **lockp);
 static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t sva,
     pd_entry_t ptepde, struct spglist *free, struct rwlock **lockp);
 static void pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte);
 static void pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde,
     struct spglist *free);
 static boolean_t pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va,
     vm_page_t m, struct rwlock **lockp);
 static void pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde,
     pd_entry_t newpde);
 static void pmap_update_pde_invalidate(pmap_t, vm_offset_t va, pd_entry_t pde);
 
 static vm_page_t _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex,
 		struct rwlock **lockp);
 static vm_page_t pmap_allocpde(pmap_t pmap, vm_offset_t va,
 		struct rwlock **lockp);
 static vm_page_t pmap_allocpte(pmap_t pmap, vm_offset_t va,
 		struct rwlock **lockp);
 
 static void _pmap_unwire_ptp(pmap_t pmap, vm_offset_t va, vm_page_t m,
     struct spglist *free);
 static int pmap_unuse_pt(pmap_t, vm_offset_t, pd_entry_t, struct spglist *);
 static vm_offset_t pmap_kmem_choose(vm_offset_t addr);
 
 /*
  * Move the kernel virtual free pointer to the next
  * 2MB.  This is used to help improve performance
  * by using a large (2MB) page for much of the kernel
  * (.text, .data, .bss)
  */
 static vm_offset_t
 pmap_kmem_choose(vm_offset_t addr)
 {
 	vm_offset_t newaddr = addr;
 
 	newaddr = (addr + (NBPDR - 1)) & ~(NBPDR - 1);
 	return (newaddr);
 }
 
 /********************/
 /* Inline functions */
 /********************/
 
 /* Return a non-clipped PD index for a given VA */
 static __inline vm_pindex_t
 pmap_pde_pindex(vm_offset_t va)
 {
 	return (va >> PDRSHIFT);
 }
 
 
 /* Return various clipped indexes for a given VA */
 static __inline vm_pindex_t
 pmap_pte_index(vm_offset_t va)
 {
 
 	return ((va >> PAGE_SHIFT) & ((1ul << NPTEPGSHIFT) - 1));
 }
 
 static __inline vm_pindex_t
 pmap_pde_index(vm_offset_t va)
 {
 
 	return ((va >> PDRSHIFT) & ((1ul << NPDEPGSHIFT) - 1));
 }
 
 static __inline vm_pindex_t
 pmap_pdpe_index(vm_offset_t va)
 {
 
 	return ((va >> PDPSHIFT) & ((1ul << NPDPEPGSHIFT) - 1));
 }
 
 static __inline vm_pindex_t
 pmap_pml4e_index(vm_offset_t va)
 {
 
 	return ((va >> PML4SHIFT) & ((1ul << NPML4EPGSHIFT) - 1));
 }
 
 /* Return a pointer to the PML4 slot that corresponds to a VA */
 static __inline pml4_entry_t *
 pmap_pml4e(pmap_t pmap, vm_offset_t va)
 {
 
 	return (&pmap->pm_pml4[pmap_pml4e_index(va)]);
 }
 
 /* Return a pointer to the PDP slot that corresponds to a VA */
 static __inline pdp_entry_t *
 pmap_pml4e_to_pdpe(pml4_entry_t *pml4e, vm_offset_t va)
 {
 	pdp_entry_t *pdpe;
 
 	pdpe = (pdp_entry_t *)PHYS_TO_DMAP(*pml4e & PG_FRAME);
 	return (&pdpe[pmap_pdpe_index(va)]);
 }
 
 /* Return a pointer to the PDP slot that corresponds to a VA */
 static __inline pdp_entry_t *
 pmap_pdpe(pmap_t pmap, vm_offset_t va)
 {
 	pml4_entry_t *pml4e;
 	pt_entry_t PG_V;
 
 	PG_V = pmap_valid_bit(pmap);
 	pml4e = pmap_pml4e(pmap, va);
 	if ((*pml4e & PG_V) == 0)
 		return (NULL);
 	return (pmap_pml4e_to_pdpe(pml4e, va));
 }
 
 /* Return a pointer to the PD slot that corresponds to a VA */
 static __inline pd_entry_t *
 pmap_pdpe_to_pde(pdp_entry_t *pdpe, vm_offset_t va)
 {
 	pd_entry_t *pde;
 
 	pde = (pd_entry_t *)PHYS_TO_DMAP(*pdpe & PG_FRAME);
 	return (&pde[pmap_pde_index(va)]);
 }
 
 /* Return a pointer to the PD slot that corresponds to a VA */
 static __inline pd_entry_t *
 pmap_pde(pmap_t pmap, vm_offset_t va)
 {
 	pdp_entry_t *pdpe;
 	pt_entry_t PG_V;
 
 	PG_V = pmap_valid_bit(pmap);
 	pdpe = pmap_pdpe(pmap, va);
 	if (pdpe == NULL || (*pdpe & PG_V) == 0)
 		return (NULL);
 	return (pmap_pdpe_to_pde(pdpe, va));
 }
 
 /* Return a pointer to the PT slot that corresponds to a VA */
 static __inline pt_entry_t *
 pmap_pde_to_pte(pd_entry_t *pde, vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = (pt_entry_t *)PHYS_TO_DMAP(*pde & PG_FRAME);
 	return (&pte[pmap_pte_index(va)]);
 }
 
 /* Return a pointer to the PT slot that corresponds to a VA */
 static __inline pt_entry_t *
 pmap_pte(pmap_t pmap, vm_offset_t va)
 {
 	pd_entry_t *pde;
 	pt_entry_t PG_V;
 
 	PG_V = pmap_valid_bit(pmap);
 	pde = pmap_pde(pmap, va);
 	if (pde == NULL || (*pde & PG_V) == 0)
 		return (NULL);
 	if ((*pde & PG_PS) != 0)	/* compat with i386 pmap_pte() */
 		return ((pt_entry_t *)pde);
 	return (pmap_pde_to_pte(pde, va));
 }
 
 static __inline void
 pmap_resident_count_inc(pmap_t pmap, int count)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	pmap->pm_stats.resident_count += count;
 }
 
 static __inline void
 pmap_resident_count_dec(pmap_t pmap, int count)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT(pmap->pm_stats.resident_count >= count,
 	    ("pmap %p resident count underflow %ld %d", pmap,
 	    pmap->pm_stats.resident_count, count));
 	pmap->pm_stats.resident_count -= count;
 }
 
 PMAP_INLINE pt_entry_t *
 vtopte(vm_offset_t va)
 {
 	u_int64_t mask = ((1ul << (NPTEPGSHIFT + NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1);
 
 	KASSERT(va >= VM_MAXUSER_ADDRESS, ("vtopte on a uva/gpa 0x%0lx", va));
 
 	return (PTmap + ((va >> PAGE_SHIFT) & mask));
 }
 
 static __inline pd_entry_t *
 vtopde(vm_offset_t va)
 {
 	u_int64_t mask = ((1ul << (NPDEPGSHIFT + NPDPEPGSHIFT + NPML4EPGSHIFT)) - 1);
 
 	KASSERT(va >= VM_MAXUSER_ADDRESS, ("vtopde on a uva/gpa 0x%0lx", va));
 
 	return (PDmap + ((va >> PDRSHIFT) & mask));
 }
 
 static u_int64_t
 allocpages(vm_paddr_t *firstaddr, int n)
 {
 	u_int64_t ret;
 
 	ret = *firstaddr;
 	bzero((void *)ret, n * PAGE_SIZE);
 	*firstaddr += n * PAGE_SIZE;
 	return (ret);
 }
 
 CTASSERT(powerof2(NDMPML4E));
 
 /* number of kernel PDP slots */
 #define	NKPDPE(ptpgs)		howmany((ptpgs), NPDEPG)
 
 static void
 nkpt_init(vm_paddr_t addr)
 {
 	int pt_pages;
 	
 #ifdef NKPT
 	pt_pages = NKPT;
 #else
 	pt_pages = howmany(addr, 1 << PDRSHIFT);
 	pt_pages += NKPDPE(pt_pages);
 
 	/*
 	 * Add some slop beyond the bare minimum required for bootstrapping
 	 * the kernel.
 	 *
 	 * This is quite important when allocating KVA for kernel modules.
 	 * The modules are required to be linked in the negative 2GB of
 	 * the address space.  If we run out of KVA in this region then
 	 * pmap_growkernel() will need to allocate page table pages to map
 	 * the entire 512GB of KVA space which is an unnecessary tax on
 	 * physical memory.
 	 */
 	pt_pages += 8;		/* 16MB additional slop for kernel modules */
 #endif
 	nkpt = pt_pages;
 }
 
 static void
 create_pagetables(vm_paddr_t *firstaddr)
 {
 	int i, j, ndm1g, nkpdpe;
 	pt_entry_t *pt_p;
 	pd_entry_t *pd_p;
 	pdp_entry_t *pdp_p;
 	pml4_entry_t *p4_p;
 
 	/* Allocate page table pages for the direct map */
 	ndmpdp = (ptoa(Maxmem) + NBPDP - 1) >> PDPSHIFT;
 	if (ndmpdp < 4)		/* Minimum 4GB of dirmap */
 		ndmpdp = 4;
 	ndmpdpphys = howmany(ndmpdp, NPDPEPG);
 	if (ndmpdpphys > NDMPML4E) {
 		/*
 		 * Each NDMPML4E allows 512 GB, so limit to that,
 		 * and then readjust ndmpdp and ndmpdpphys.
 		 */
 		printf("NDMPML4E limits system to %d GB\n", NDMPML4E * 512);
 		Maxmem = atop(NDMPML4E * NBPML4);
 		ndmpdpphys = NDMPML4E;
 		ndmpdp = NDMPML4E * NPDEPG;
 	}
 	DMPDPphys = allocpages(firstaddr, ndmpdpphys);
 	ndm1g = 0;
 	if ((amd_feature & AMDID_PAGE1GB) != 0)
 		ndm1g = ptoa(Maxmem) >> PDPSHIFT;
 	if (ndm1g < ndmpdp)
 		DMPDphys = allocpages(firstaddr, ndmpdp - ndm1g);
 	dmaplimit = (vm_paddr_t)ndmpdp << PDPSHIFT;
 
 	/* Allocate pages */
 	KPML4phys = allocpages(firstaddr, 1);
 	KPDPphys = allocpages(firstaddr, NKPML4E);
 
 	/*
 	 * Allocate the initial number of kernel page table pages required to
 	 * bootstrap.  We defer this until after all memory-size dependent
 	 * allocations are done (e.g. direct map), so that we don't have to
 	 * build in too much slop in our estimate.
 	 *
 	 * Note that when NKPML4E > 1, we have an empty page underneath
 	 * all but the KPML4I'th one, so we need NKPML4E-1 extra (zeroed)
 	 * pages.  (pmap_enter requires a PD page to exist for each KPML4E.)
 	 */
 	nkpt_init(*firstaddr);
 	nkpdpe = NKPDPE(nkpt);
 
 	KPTphys = allocpages(firstaddr, nkpt);
 	KPDphys = allocpages(firstaddr, nkpdpe);
 
 	/* Fill in the underlying page table pages */
 	/* Nominally read-only (but really R/W) from zero to physfree */
 	/* XXX not fully used, underneath 2M pages */
 	pt_p = (pt_entry_t *)KPTphys;
 	for (i = 0; ptoa(i) < *firstaddr; i++)
 		pt_p[i] = ptoa(i) | X86_PG_RW | X86_PG_V | X86_PG_G;
 
 	/* Now map the page tables at their location within PTmap */
 	pd_p = (pd_entry_t *)KPDphys;
 	for (i = 0; i < nkpt; i++)
 		pd_p[i] = (KPTphys + ptoa(i)) | X86_PG_RW | X86_PG_V;
 
 	/* Map from zero to end of allocations under 2M pages */
 	/* This replaces some of the KPTphys entries above */
 	for (i = 0; (i << PDRSHIFT) < *firstaddr; i++)
 		pd_p[i] = (i << PDRSHIFT) | X86_PG_RW | X86_PG_V | PG_PS |
 		    X86_PG_G;
 
 	/* And connect up the PD to the PDP (leaving room for L4 pages) */
 	pdp_p = (pdp_entry_t *)(KPDPphys + ptoa(KPML4I - KPML4BASE));
 	for (i = 0; i < nkpdpe; i++)
 		pdp_p[i + KPDPI] = (KPDphys + ptoa(i)) | X86_PG_RW | X86_PG_V |
 		    PG_U;
 
 	/*
 	 * Now, set up the direct map region using 2MB and/or 1GB pages.  If
 	 * the end of physical memory is not aligned to a 1GB page boundary,
 	 * then the residual physical memory is mapped with 2MB pages.  Later,
 	 * if pmap_mapdev{_attr}() uses the direct map for non-write-back
 	 * memory, pmap_change_attr() will demote any 2MB or 1GB page mappings
 	 * that are partially used. 
 	 */
 	pd_p = (pd_entry_t *)DMPDphys;
 	for (i = NPDEPG * ndm1g, j = 0; i < NPDEPG * ndmpdp; i++, j++) {
 		pd_p[j] = (vm_paddr_t)i << PDRSHIFT;
 		/* Preset PG_M and PG_A because demotion expects it. */
 		pd_p[j] |= X86_PG_RW | X86_PG_V | PG_PS | X86_PG_G |
 		    X86_PG_M | X86_PG_A;
 	}
 	pdp_p = (pdp_entry_t *)DMPDPphys;
 	for (i = 0; i < ndm1g; i++) {
 		pdp_p[i] = (vm_paddr_t)i << PDPSHIFT;
 		/* Preset PG_M and PG_A because demotion expects it. */
 		pdp_p[i] |= X86_PG_RW | X86_PG_V | PG_PS | X86_PG_G |
 		    X86_PG_M | X86_PG_A;
 	}
 	for (j = 0; i < ndmpdp; i++, j++) {
 		pdp_p[i] = DMPDphys + ptoa(j);
 		pdp_p[i] |= X86_PG_RW | X86_PG_V | PG_U;
 	}
 
 	/* And recursively map PML4 to itself in order to get PTmap */
 	p4_p = (pml4_entry_t *)KPML4phys;
 	p4_p[PML4PML4I] = KPML4phys;
 	p4_p[PML4PML4I] |= X86_PG_RW | X86_PG_V | PG_U;
 
 	/* Connect the Direct Map slot(s) up to the PML4. */
 	for (i = 0; i < ndmpdpphys; i++) {
 		p4_p[DMPML4I + i] = DMPDPphys + ptoa(i);
 		p4_p[DMPML4I + i] |= X86_PG_RW | X86_PG_V | PG_U;
 	}
 
 	/* Connect the KVA slots up to the PML4 */
 	for (i = 0; i < NKPML4E; i++) {
 		p4_p[KPML4BASE + i] = KPDPphys + ptoa(i);
 		p4_p[KPML4BASE + i] |= X86_PG_RW | X86_PG_V | PG_U;
 	}
 }
 
 /*
  *	Bootstrap the system enough to run with virtual memory.
  *
  *	On amd64 this is called after mapping has already been enabled
  *	and just syncs the pmap module with what has already been done.
  *	[We can't call it easily with mapping off since the kernel is not
  *	mapped with PA == VA, hence we would have to relocate every address
  *	from the linked base (virtual) address "KERNBASE" to the actual
  *	(physical) address starting relative to 0]
  */
 void
 pmap_bootstrap(vm_paddr_t *firstaddr)
 {
 	vm_offset_t va;
 	pt_entry_t *pte;
+	int i;
 
 	/*
 	 * Create an initial set of page tables to run the kernel in.
 	 */
 	create_pagetables(firstaddr);
 
 	/*
 	 * Add a physical memory segment (vm_phys_seg) corresponding to the
 	 * preallocated kernel page table pages so that vm_page structures
 	 * representing these pages will be created.  The vm_page structures
 	 * are required for promotion of the corresponding kernel virtual
 	 * addresses to superpage mappings.
 	 */
 	vm_phys_add_seg(KPTphys, KPTphys + ptoa(nkpt));
 
 	virtual_avail = (vm_offset_t) KERNBASE + *firstaddr;
 	virtual_avail = pmap_kmem_choose(virtual_avail);
 
 	virtual_end = VM_MAX_KERNEL_ADDRESS;
 
 
 	/* XXX do %cr0 as well */
 	load_cr4(rcr4() | CR4_PGE);
 	load_cr3(KPML4phys);
 	if (cpu_stdext_feature & CPUID_STDEXT_SMEP)
 		load_cr4(rcr4() | CR4_SMEP);
 
 	/*
 	 * Initialize the kernel pmap (which is statically allocated).
 	 */
 	PMAP_LOCK_INIT(kernel_pmap);
 	kernel_pmap->pm_pml4 = (pdp_entry_t *)PHYS_TO_DMAP(KPML4phys);
 	kernel_pmap->pm_cr3 = KPML4phys;
 	CPU_FILL(&kernel_pmap->pm_active);	/* don't allow deactivation */
-	CPU_FILL(&kernel_pmap->pm_save);	/* always superset of pm_active */
 	TAILQ_INIT(&kernel_pmap->pm_pvchunk);
 	kernel_pmap->pm_flags = pmap_flags;
 
  	/*
 	 * Initialize the global pv list lock.
 	 */
 	rw_init(&pvh_global_lock, "pmap pv global");
 
 	/*
 	 * Reserve some special page table entries/VA space for temporary
 	 * mapping of pages.
 	 */
 #define	SYSMAP(c, p, v, n)	\
 	v = (c)va; va += ((n)*PAGE_SIZE); p = pte; pte += (n);
 
 	va = virtual_avail;
 	pte = vtopte(va);
 
 	/*
 	 * Crashdump maps.  The first page is reused as CMAP1 for the
 	 * memory test.
 	 */
 	SYSMAP(caddr_t, CMAP1, crashdumpmap, MAXDUMPPGS)
 	CADDR1 = crashdumpmap;
 
 	virtual_avail = va;
 
 	/* Initialize the PAT MSR. */
 	pmap_init_pat();
 
 	/* Initialize TLB Context Id. */
 	TUNABLE_INT_FETCH("vm.pmap.pcid_enabled", &pmap_pcid_enabled);
 	if ((cpu_feature2 & CPUID2_PCID) != 0 && pmap_pcid_enabled) {
-		load_cr4(rcr4() | CR4_PCIDE);
-		mtx_init(&pcid_mtx, "pcid", NULL, MTX_DEF);
-		init_unrhdr(&pcid_unr, 1, (1 << 12) - 1, &pcid_mtx);
 		/* Check for INVPCID support */
 		invpcid_works = (cpu_stdext_feature & CPUID_STDEXT_INVPCID)
 		    != 0;
-		kernel_pmap->pm_pcid = 0;
-#ifndef SMP
+		for (i = 0; i < MAXCPU; i++) {
+			kernel_pmap->pm_pcids[i].pm_pcid = PMAP_PCID_KERN;
+			kernel_pmap->pm_pcids[i].pm_gen = 1;
+		}
+		__pcpu[0].pc_pcid_next = PMAP_PCID_KERN + 1;
+		__pcpu[0].pc_pcid_gen = 1;
+		/*
+		 * pcpu area for APs is zeroed during AP startup.
+		 * pc_pcid_next and pc_pcid_gen are initialized by AP
+		 * during pcpu setup.
+		 */
+#ifdef SMP
+		load_cr4(rcr4() | CR4_PCIDE);
+#else
 		pmap_pcid_enabled = 0;
 #endif
-	} else
+	} else {
 		pmap_pcid_enabled = 0;
+	}
 }
 
 /*
  * Setup the PAT MSR.
  */
 void
 pmap_init_pat(void)
 {
 	int pat_table[PAT_INDEX_SIZE];
 	uint64_t pat_msr;
 	u_long cr0, cr4;
 	int i;
 
 	/* Bail if this CPU doesn't implement PAT. */
 	if ((cpu_feature & CPUID_PAT) == 0)
 		panic("no PAT??");
 
 	/* Set default PAT index table. */
 	for (i = 0; i < PAT_INDEX_SIZE; i++)
 		pat_table[i] = -1;
 	pat_table[PAT_WRITE_BACK] = 0;
 	pat_table[PAT_WRITE_THROUGH] = 1;
 	pat_table[PAT_UNCACHEABLE] = 3;
 	pat_table[PAT_WRITE_COMBINING] = 3;
 	pat_table[PAT_WRITE_PROTECTED] = 3;
 	pat_table[PAT_UNCACHED] = 3;
 
 	/* Initialize default PAT entries. */
 	pat_msr = PAT_VALUE(0, PAT_WRITE_BACK) |
 	    PAT_VALUE(1, PAT_WRITE_THROUGH) |
 	    PAT_VALUE(2, PAT_UNCACHED) |
 	    PAT_VALUE(3, PAT_UNCACHEABLE) |
 	    PAT_VALUE(4, PAT_WRITE_BACK) |
 	    PAT_VALUE(5, PAT_WRITE_THROUGH) |
 	    PAT_VALUE(6, PAT_UNCACHED) |
 	    PAT_VALUE(7, PAT_UNCACHEABLE);
 
 	if (pat_works) {
 		/*
 		 * Leave the indices 0-3 at the default of WB, WT, UC-, and UC.
 		 * Program 5 and 6 as WP and WC.
 		 * Leave 4 and 7 as WB and UC.
 		 */
 		pat_msr &= ~(PAT_MASK(5) | PAT_MASK(6));
 		pat_msr |= PAT_VALUE(5, PAT_WRITE_PROTECTED) |
 		    PAT_VALUE(6, PAT_WRITE_COMBINING);
 		pat_table[PAT_UNCACHED] = 2;
 		pat_table[PAT_WRITE_PROTECTED] = 5;
 		pat_table[PAT_WRITE_COMBINING] = 6;
 	} else {
 		/*
 		 * Just replace PAT Index 2 with WC instead of UC-.
 		 */
 		pat_msr &= ~PAT_MASK(2);
 		pat_msr |= PAT_VALUE(2, PAT_WRITE_COMBINING);
 		pat_table[PAT_WRITE_COMBINING] = 2;
 	}
 
 	/* Disable PGE. */
 	cr4 = rcr4();
 	load_cr4(cr4 & ~CR4_PGE);
 
 	/* Disable caches (CD = 1, NW = 0). */
 	cr0 = rcr0();
 	load_cr0((cr0 & ~CR0_NW) | CR0_CD);
 
 	/* Flushes caches and TLBs. */
 	wbinvd();
 	invltlb();
 
 	/* Update PAT and index table. */
 	wrmsr(MSR_PAT, pat_msr);
 	for (i = 0; i < PAT_INDEX_SIZE; i++)
 		pat_index[i] = pat_table[i];
 
 	/* Flush caches and TLBs again. */
 	wbinvd();
 	invltlb();
 
 	/* Restore caches and PGE. */
 	load_cr0(cr0);
 	load_cr4(cr4);
 }
 
 /*
  *	Initialize a vm_page's machine-dependent fields.
  */
 void
 pmap_page_init(vm_page_t m)
 {
 
 	TAILQ_INIT(&m->md.pv_list);
 	m->md.pat_mode = PAT_WRITE_BACK;
 }
 
 /*
  *	Initialize the pmap module.
  *	Called by vm_init, to initialize any structures that the pmap
  *	system needs to map virtual memory.
  */
 void
 pmap_init(void)
 {
 	vm_page_t mpte;
 	vm_size_t s;
 	int i, pv_npg;
 
 	/*
 	 * Initialize the vm page array entries for the kernel pmap's
 	 * page table pages.
 	 */ 
 	for (i = 0; i < nkpt; i++) {
 		mpte = PHYS_TO_VM_PAGE(KPTphys + (i << PAGE_SHIFT));
 		KASSERT(mpte >= vm_page_array &&
 		    mpte < &vm_page_array[vm_page_array_size],
 		    ("pmap_init: page table page is out of range"));
 		mpte->pindex = pmap_pde_pindex(KERNBASE) + i;
 		mpte->phys_addr = KPTphys + (i << PAGE_SHIFT);
 	}
 
 	/*
 	 * If the kernel is running on a virtual machine, then it must assume
 	 * that MCA is enabled by the hypervisor.  Moreover, the kernel must
 	 * be prepared for the hypervisor changing the vendor and family that
 	 * are reported by CPUID.  Consequently, the workaround for AMD Family
 	 * 10h Erratum 383 is enabled if the processor's feature set does not
 	 * include at least one feature that is only supported by older Intel
 	 * or newer AMD processors.
 	 */
 	if (vm_guest == VM_GUEST_VM && (cpu_feature & CPUID_SS) == 0 &&
 	    (cpu_feature2 & (CPUID2_SSSE3 | CPUID2_SSE41 | CPUID2_AESNI |
 	    CPUID2_AVX | CPUID2_XSAVE)) == 0 && (amd_feature2 & (AMDID2_XOP |
 	    AMDID2_FMA4)) == 0)
 		workaround_erratum383 = 1;
 
 	/*
 	 * Are large page mappings enabled?
 	 */
 	TUNABLE_INT_FETCH("vm.pmap.pg_ps_enabled", &pg_ps_enabled);
 	if (pg_ps_enabled) {
 		KASSERT(MAXPAGESIZES > 1 && pagesizes[1] == 0,
 		    ("pmap_init: can't assign to pagesizes[1]"));
 		pagesizes[1] = NBPDR;
 	}
 
 	/*
 	 * Initialize the pv chunk list mutex.
 	 */
 	mtx_init(&pv_chunks_mutex, "pmap pv chunk list", NULL, MTX_DEF);
 
 	/*
 	 * Initialize the pool of pv list locks.
 	 */
 	for (i = 0; i < NPV_LIST_LOCKS; i++)
 		rw_init(&pv_list_locks[i], "pmap pv list");
 
 	/*
 	 * Calculate the size of the pv head table for superpages.
 	 */
 	pv_npg = howmany(vm_phys_segs[vm_phys_nsegs - 1].end, NBPDR);
 
 	/*
 	 * Allocate memory for the pv head table for superpages.
 	 */
 	s = (vm_size_t)(pv_npg * sizeof(struct md_page));
 	s = round_page(s);
 	pv_table = (struct md_page *)kmem_malloc(kernel_arena, s,
 	    M_WAITOK | M_ZERO);
 	for (i = 0; i < pv_npg; i++)
 		TAILQ_INIT(&pv_table[i].pv_list);
 }
 
 static SYSCTL_NODE(_vm_pmap, OID_AUTO, pde, CTLFLAG_RD, 0,
     "2MB page mapping counters");
 
 static u_long pmap_pde_demotions;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, demotions, CTLFLAG_RD,
     &pmap_pde_demotions, 0, "2MB page demotions");
 
 static u_long pmap_pde_mappings;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, mappings, CTLFLAG_RD,
     &pmap_pde_mappings, 0, "2MB page mappings");
 
 static u_long pmap_pde_p_failures;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, p_failures, CTLFLAG_RD,
     &pmap_pde_p_failures, 0, "2MB page promotion failures");
 
 static u_long pmap_pde_promotions;
 SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, promotions, CTLFLAG_RD,
     &pmap_pde_promotions, 0, "2MB page promotions");
 
 static SYSCTL_NODE(_vm_pmap, OID_AUTO, pdpe, CTLFLAG_RD, 0,
     "1GB page mapping counters");
 
 static u_long pmap_pdpe_demotions;
 SYSCTL_ULONG(_vm_pmap_pdpe, OID_AUTO, demotions, CTLFLAG_RD,
     &pmap_pdpe_demotions, 0, "1GB page demotions");
 
 /***************************************************
  * Low level helper routines.....
  ***************************************************/
 
 static pt_entry_t
 pmap_swap_pat(pmap_t pmap, pt_entry_t entry)
 {
 	int x86_pat_bits = X86_PG_PTE_PAT | X86_PG_PDE_PAT;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		/* Verify that both PAT bits are not set at the same time */
 		KASSERT((entry & x86_pat_bits) != x86_pat_bits,
 		    ("Invalid PAT bits in entry %#lx", entry));
 
 		/* Swap the PAT bits if one of them is set */
 		if ((entry & x86_pat_bits) != 0)
 			entry ^= x86_pat_bits;
 		break;
 	case PT_EPT:
 		/*
 		 * Nothing to do - the memory attributes are represented
 		 * the same way for regular pages and superpages.
 		 */
 		break;
 	default:
 		panic("pmap_switch_pat_bits: bad pm_type %d", pmap->pm_type);
 	}
 
 	return (entry);
 }
 
 /*
  * Determine the appropriate bits to set in a PTE or PDE for a specified
  * caching mode.
  */
 static int
 pmap_cache_bits(pmap_t pmap, int mode, boolean_t is_pde)
 {
 	int cache_bits, pat_flag, pat_idx;
 
 	if (mode < 0 || mode >= PAT_INDEX_SIZE || pat_index[mode] < 0)
 		panic("Unknown caching mode %d\n", mode);
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		/* The PAT bit is different for PTE's and PDE's. */
 		pat_flag = is_pde ? X86_PG_PDE_PAT : X86_PG_PTE_PAT;
 
 		/* Map the caching mode to a PAT index. */
 		pat_idx = pat_index[mode];
 
 		/* Map the 3-bit index value into the PAT, PCD, and PWT bits. */
 		cache_bits = 0;
 		if (pat_idx & 0x4)
 			cache_bits |= pat_flag;
 		if (pat_idx & 0x2)
 			cache_bits |= PG_NC_PCD;
 		if (pat_idx & 0x1)
 			cache_bits |= PG_NC_PWT;
 		break;
 
 	case PT_EPT:
 		cache_bits = EPT_PG_IGNORE_PAT | EPT_PG_MEMORY_TYPE(mode);
 		break;
 
 	default:
 		panic("unsupported pmap type %d", pmap->pm_type);
 	}
 
 	return (cache_bits);
 }
 
 static int
 pmap_cache_mask(pmap_t pmap, boolean_t is_pde)
 {
 	int mask;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 	case PT_RVI:
 		mask = is_pde ? X86_PG_PDE_CACHE : X86_PG_PTE_CACHE;
 		break;
 	case PT_EPT:
 		mask = EPT_PG_IGNORE_PAT | EPT_PG_MEMORY_TYPE(0x7);
 		break;
 	default:
 		panic("pmap_cache_mask: invalid pm_type %d", pmap->pm_type);
 	}
 
 	return (mask);
 }
 
 static __inline boolean_t
 pmap_ps_enabled(pmap_t pmap)
 {
 
 	return (pg_ps_enabled && (pmap->pm_flags & PMAP_PDE_SUPERPAGE) != 0);
 }
 
 static void
 pmap_update_pde_store(pmap_t pmap, pd_entry_t *pde, pd_entry_t newpde)
 {
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 		break;
 	case PT_RVI:
 	case PT_EPT:
 		/*
 		 * XXX
 		 * This is a little bogus since the generation number is
 		 * supposed to be bumped up when a region of the address
 		 * space is invalidated in the page tables.
 		 *
 		 * In this case the old PDE entry is valid but yet we want
 		 * to make sure that any mappings using the old entry are
 		 * invalidated in the TLB.
 		 *
 		 * The reason this works as expected is because we rendezvous
 		 * "all" host cpus and force any vcpu context to exit as a
 		 * side-effect.
 		 */
 		atomic_add_acq_long(&pmap->pm_eptgen, 1);
 		break;
 	default:
 		panic("pmap_update_pde_store: bad pm_type %d", pmap->pm_type);
 	}
 	pde_store(pde, newpde);
 }
 
 /*
  * After changing the page size for the specified virtual address in the page
  * table, flush the corresponding entries from the processor's TLB.  Only the
  * calling processor's TLB is affected.
  *
  * The calling thread must be pinned to a processor.
  */
 static void
 pmap_update_pde_invalidate(pmap_t pmap, vm_offset_t va, pd_entry_t newpde)
 {
 	pt_entry_t PG_G;
 
 	if (pmap_type_guest(pmap))
 		return;
 
 	KASSERT(pmap->pm_type == PT_X86,
 	    ("pmap_update_pde_invalidate: invalid type %d", pmap->pm_type));
 
 	PG_G = pmap_global_bit(pmap);
 
 	if ((newpde & PG_PS) == 0)
 		/* Demotion: flush a specific 2MB page mapping. */
 		invlpg(va);
 	else if ((newpde & PG_G) == 0)
 		/*
 		 * Promotion: flush every 4KB page mapping from the TLB
 		 * because there are too many to flush individually.
 		 */
 		invltlb();
 	else {
 		/*
 		 * Promotion: flush every 4KB page mapping from the TLB,
 		 * including any global (PG_G) mappings.
 		 */
 		invltlb_globpcid();
 	}
 }
 #ifdef SMP
 
-static void
-pmap_invalidate_page_pcid(pmap_t pmap, vm_offset_t va)
-{
-	struct invpcid_descr d;
-	uint64_t cr3;
-
-	if (invpcid_works) {
-		d.pcid = pmap->pm_pcid;
-		d.pad = 0;
-		d.addr = va;
-		invpcid(&d, INVPCID_ADDR);
-		return;
-	}
-
-	cr3 = rcr3();
-	critical_enter();
-	load_cr3(pmap->pm_cr3 | CR3_PCID_SAVE);
-	invlpg(va);
-	load_cr3(cr3 | CR3_PCID_SAVE);
-	critical_exit();
-}
-
 /*
  * For SMP, these functions have to use the IPI mechanism for coherence.
  *
  * N.B.: Before calling any of the following TLB invalidation functions,
  * the calling processor must ensure that all stores updating a non-
  * kernel page table are globally performed.  Otherwise, another
  * processor could cache an old, pre-update entry without being
  * invalidated.  This can happen one of two ways: (1) The pmap becomes
  * active on another processor after its pm_active field is checked by
  * one of the following functions but before a store updating the page
  * table is globally performed. (2) The pmap becomes active on another
  * processor before its pm_active field is checked but due to
  * speculative loads one of the following functions stills reads the
  * pmap as inactive on the other processor.
  * 
  * The kernel page table is exempt because its pm_active field is
  * immutable.  The kernel page table is always active on every
  * processor.
  */
 
 /*
  * Interrupt the cpus that are executing in the guest context.
  * This will force the vcpu to exit and the cached EPT mappings
  * will be invalidated by the host before the next vmresume.
  */
 static __inline void
 pmap_invalidate_ept(pmap_t pmap)
 {
 	int ipinum;
 
 	sched_pin();
 	KASSERT(!CPU_ISSET(curcpu, &pmap->pm_active),
 	    ("pmap_invalidate_ept: absurd pm_active"));
 
 	/*
 	 * The TLB mappings associated with a vcpu context are not
 	 * flushed each time a different vcpu is chosen to execute.
 	 *
 	 * This is in contrast with a process's vtop mappings that
 	 * are flushed from the TLB on each context switch.
 	 *
 	 * Therefore we need to do more than just a TLB shootdown on
 	 * the active cpus in 'pmap->pm_active'. To do this we keep
 	 * track of the number of invalidations performed on this pmap.
 	 *
 	 * Each vcpu keeps a cache of this counter and compares it
 	 * just before a vmresume. If the counter is out-of-date an
 	 * invept will be done to flush stale mappings from the TLB.
 	 */
 	atomic_add_acq_long(&pmap->pm_eptgen, 1);
 
 	/*
 	 * Force the vcpu to exit and trap back into the hypervisor.
 	 */
 	ipinum = pmap->pm_flags & PMAP_NESTED_IPIMASK;
 	ipi_selected(pmap->pm_active, ipinum);
 	sched_unpin();
 }
 
 void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
-	cpuset_t other_cpus;
-	u_int cpuid;
+	cpuset_t *mask;
+	u_int cpuid, i;
 
 	if (pmap_type_guest(pmap)) {
 		pmap_invalidate_ept(pmap);
 		return;
 	}
 
 	KASSERT(pmap->pm_type == PT_X86,
 	    ("pmap_invalidate_page: invalid type %d", pmap->pm_type));
 
 	sched_pin();
-	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
-		if (!pmap_pcid_enabled) {
-			invlpg(va);
-		} else {
-			if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) {
-				if (pmap == PCPU_GET(curpmap))
-					invlpg(va);
-				else
-					pmap_invalidate_page_pcid(pmap, va);
-			} else {
-				invltlb_globpcid();
-			}
-		}
-		smp_invlpg(pmap, va);
+	if (pmap == kernel_pmap) {
+		invlpg(va);
+		mask = &all_cpus;
 	} else {
 		cpuid = PCPU_GET(cpuid);
-		other_cpus = all_cpus;
-		CPU_CLR(cpuid, &other_cpus);
-		if (CPU_ISSET(cpuid, &pmap->pm_active))
+		if (pmap == PCPU_GET(curpmap))
 			invlpg(va);
-		else if (pmap_pcid_enabled) {
-			if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0)
-				pmap_invalidate_page_pcid(pmap, va);
-			else
-				invltlb_globpcid();
+		else if (pmap_pcid_enabled)
+			pmap->pm_pcids[cpuid].pm_gen = 0;
+		if (pmap_pcid_enabled) {
+			CPU_FOREACH(i) {
+				if (cpuid != i)
+					pmap->pm_pcids[i].pm_gen = 0;
+			}
 		}
-		if (pmap_pcid_enabled)
-			CPU_AND(&other_cpus, &pmap->pm_save);
-		else
-			CPU_AND(&other_cpus, &pmap->pm_active);
-		if (!CPU_EMPTY(&other_cpus))
-			smp_masked_invlpg(other_cpus, pmap, va);
+		mask = &pmap->pm_active;
 	}
+	smp_masked_invlpg(*mask, va);
 	sched_unpin();
 }
 
-static void
-pmap_invalidate_range_pcid(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
-{
-	struct invpcid_descr d;
-	uint64_t cr3;
-	vm_offset_t addr;
-
-	if (invpcid_works) {
-		d.pcid = pmap->pm_pcid;
-		d.pad = 0;
-		for (addr = sva; addr < eva; addr += PAGE_SIZE) {
-			d.addr = addr;
-			invpcid(&d, INVPCID_ADDR);
-		}
-		return;
-	}
-
-	cr3 = rcr3();
-	critical_enter();
-	load_cr3(pmap->pm_cr3 | CR3_PCID_SAVE);
-	for (addr = sva; addr < eva; addr += PAGE_SIZE)
-		invlpg(addr);
-	load_cr3(cr3 | CR3_PCID_SAVE);
-	critical_exit();
-}
-
 void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
-	cpuset_t other_cpus;
+	cpuset_t *mask;
 	vm_offset_t addr;
-	u_int cpuid;
+	u_int cpuid, i;
 
 	if (pmap_type_guest(pmap)) {
 		pmap_invalidate_ept(pmap);
 		return;
 	}
 
 	KASSERT(pmap->pm_type == PT_X86,
 	    ("pmap_invalidate_range: invalid type %d", pmap->pm_type));
 
 	sched_pin();
-	if (pmap == kernel_pmap || !CPU_CMP(&pmap->pm_active, &all_cpus)) {
-		if (!pmap_pcid_enabled) {
-			for (addr = sva; addr < eva; addr += PAGE_SIZE)
-				invlpg(addr);
-		} else {
-			if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) {
-				if (pmap == PCPU_GET(curpmap)) {
-					for (addr = sva; addr < eva;
-					    addr += PAGE_SIZE)
-						invlpg(addr);
-				} else {
-					pmap_invalidate_range_pcid(pmap,
-					    sva, eva);
-				}
-			} else {
-				invltlb_globpcid();
-			}
-		}
-		smp_invlpg_range(pmap, sva, eva);
+	cpuid = PCPU_GET(cpuid);
+	if (pmap == kernel_pmap) {
+		for (addr = sva; addr < eva; addr += PAGE_SIZE)
+			invlpg(addr);
+		mask = &all_cpus;
 	} else {
-		cpuid = PCPU_GET(cpuid);
-		other_cpus = all_cpus;
-		CPU_CLR(cpuid, &other_cpus);
-		if (CPU_ISSET(cpuid, &pmap->pm_active)) {
+		if (pmap == PCPU_GET(curpmap)) {
 			for (addr = sva; addr < eva; addr += PAGE_SIZE)
 				invlpg(addr);
 		} else if (pmap_pcid_enabled) {
-			if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0)
-				pmap_invalidate_range_pcid(pmap, sva, eva);
-			else
-				invltlb_globpcid();
+			pmap->pm_pcids[cpuid].pm_gen = 0;
 		}
-		if (pmap_pcid_enabled)
-			CPU_AND(&other_cpus, &pmap->pm_save);
-		else
-			CPU_AND(&other_cpus, &pmap->pm_active);
-		if (!CPU_EMPTY(&other_cpus))
-			smp_masked_invlpg_range(other_cpus, pmap, sva, eva);
+		if (pmap_pcid_enabled) {
+			CPU_FOREACH(i) {
+				if (cpuid != i)
+					pmap->pm_pcids[i].pm_gen = 0;
+			}
+		}
+		mask = &pmap->pm_active;
 	}
+	smp_masked_invlpg_range(*mask, sva, eva);
 	sched_unpin();
 }
 
 void
 pmap_invalidate_all(pmap_t pmap)
 {
-	cpuset_t other_cpus;
+	cpuset_t *mask;
 	struct invpcid_descr d;
-	uint64_t cr3;
-	u_int cpuid;
+	u_int cpuid, i;
 
 	if (pmap_type_guest(pmap)) {
 		pmap_invalidate_ept(pmap);
 		return;
 	}
 
 	KASSERT(pmap->pm_type == PT_X86,
 	    ("pmap_invalidate_all: invalid type %d", pmap->pm_type));
 
 	sched_pin();
-	cpuid = PCPU_GET(cpuid);
-	if (pmap == kernel_pmap ||
-	    (pmap_pcid_enabled && !CPU_CMP(&pmap->pm_save, &all_cpus)) ||
-	    !CPU_CMP(&pmap->pm_active, &all_cpus)) {
-		if (invpcid_works) {
+	if (pmap == kernel_pmap) {
+		if (pmap_pcid_enabled && invpcid_works) {
 			bzero(&d, sizeof(d));
 			invpcid(&d, INVPCID_CTXGLOB);
 		} else {
 			invltlb_globpcid();
 		}
-		if (!CPU_ISSET(cpuid, &pmap->pm_active))
-			CPU_CLR_ATOMIC(cpuid, &pmap->pm_save);
-		smp_invltlb(pmap);
+		mask = &all_cpus;
 	} else {
-		other_cpus = all_cpus;
-		CPU_CLR(cpuid, &other_cpus);
-
-		/*
-		 * This logic is duplicated in the Xinvltlb shootdown
-		 * IPI handler.
-		 */
-		if (pmap_pcid_enabled) {
-			if (pmap->pm_pcid != -1 && pmap->pm_pcid != 0) {
+		cpuid = PCPU_GET(cpuid);
+		if (pmap == PCPU_GET(curpmap)) {
+			if (pmap_pcid_enabled) {
 				if (invpcid_works) {
-					d.pcid = pmap->pm_pcid;
+					d.pcid = pmap->pm_pcids[cpuid].pm_pcid;
 					d.pad = 0;
 					d.addr = 0;
 					invpcid(&d, INVPCID_CTX);
 				} else {
-					cr3 = rcr3();
-					critical_enter();
-
-					/*
-					 * Bit 63 is clear, pcid TLB
-					 * entries are invalidated.
-					 */
-					load_cr3(pmap->pm_cr3);
-					load_cr3(cr3 | CR3_PCID_SAVE);
-					critical_exit();
+					load_cr3(pmap->pm_cr3 | pmap->pm_pcids
+					    [PCPU_GET(cpuid)].pm_pcid);
 				}
 			} else {
-				invltlb_globpcid();
+				invltlb();
 			}
-		} else if (CPU_ISSET(cpuid, &pmap->pm_active))
-			invltlb();
-		if (!CPU_ISSET(cpuid, &pmap->pm_active))
-			CPU_CLR_ATOMIC(cpuid, &pmap->pm_save);
-		if (pmap_pcid_enabled)
-			CPU_AND(&other_cpus, &pmap->pm_save);
-		else
-			CPU_AND(&other_cpus, &pmap->pm_active);
-		if (!CPU_EMPTY(&other_cpus))
-			smp_masked_invltlb(other_cpus, pmap);
+		} else if (pmap_pcid_enabled) {
+			pmap->pm_pcids[cpuid].pm_gen = 0;
+		}
+		if (pmap_pcid_enabled) {
+			CPU_FOREACH(i) {
+				if (cpuid != i)
+					pmap->pm_pcids[i].pm_gen = 0;
+			}
+		}
+		mask = &pmap->pm_active;
 	}
+	smp_masked_invltlb(*mask, pmap);
 	sched_unpin();
 }
 
 void
 pmap_invalidate_cache(void)
 {
 
 	sched_pin();
 	wbinvd();
 	smp_cache_flush();
 	sched_unpin();
 }
 
 struct pde_action {
 	cpuset_t invalidate;	/* processors that invalidate their TLB */
 	pmap_t pmap;
 	vm_offset_t va;
 	pd_entry_t *pde;
 	pd_entry_t newpde;
 	u_int store;		/* processor that updates the PDE */
 };
 
 static void
 pmap_update_pde_action(void *arg)
 {
 	struct pde_action *act = arg;
 
 	if (act->store == PCPU_GET(cpuid))
 		pmap_update_pde_store(act->pmap, act->pde, act->newpde);
 }
 
 static void
 pmap_update_pde_teardown(void *arg)
 {
 	struct pde_action *act = arg;
 
 	if (CPU_ISSET(PCPU_GET(cpuid), &act->invalidate))
 		pmap_update_pde_invalidate(act->pmap, act->va, act->newpde);
 }
 
 /*
  * Change the page size for the specified virtual address in a way that
  * prevents any possibility of the TLB ever having two entries that map the
  * same virtual address using different page sizes.  This is the recommended
  * workaround for Erratum 383 on AMD Family 10h processors.  It prevents a
  * machine check exception for a TLB state that is improperly diagnosed as a
  * hardware error.
  */
 static void
 pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde)
 {
 	struct pde_action act;
 	cpuset_t active, other_cpus;
 	u_int cpuid;
 
 	sched_pin();
 	cpuid = PCPU_GET(cpuid);
 	other_cpus = all_cpus;
 	CPU_CLR(cpuid, &other_cpus);
 	if (pmap == kernel_pmap || pmap_type_guest(pmap)) 
 		active = all_cpus;
 	else {
 		active = pmap->pm_active;
-		CPU_AND_ATOMIC(&pmap->pm_save, &active);
 	}
 	if (CPU_OVERLAP(&active, &other_cpus)) { 
 		act.store = cpuid;
 		act.invalidate = active;
 		act.va = va;
 		act.pmap = pmap;
 		act.pde = pde;
 		act.newpde = newpde;
 		CPU_SET(cpuid, &active);
 		smp_rendezvous_cpus(active,
 		    smp_no_rendevous_barrier, pmap_update_pde_action,
 		    pmap_update_pde_teardown, &act);
 	} else {
 		pmap_update_pde_store(pmap, pde, newpde);
 		if (CPU_ISSET(cpuid, &active))
 			pmap_update_pde_invalidate(pmap, va, newpde);
 	}
 	sched_unpin();
 }
 #else /* !SMP */
 /*
  * Normal, non-SMP, invalidation functions.
  * We inline these within pmap.c for speed.
  */
 PMAP_INLINE void
 pmap_invalidate_page(pmap_t pmap, vm_offset_t va)
 {
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 		if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 			invlpg(va);
 		break;
 	case PT_RVI:
 	case PT_EPT:
 		pmap->pm_eptgen++;
 		break;
 	default:
 		panic("pmap_invalidate_page: unknown type: %d", pmap->pm_type);
 	}
 }
 
 PMAP_INLINE void
 pmap_invalidate_range(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t addr;
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 		if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 			for (addr = sva; addr < eva; addr += PAGE_SIZE)
 				invlpg(addr);
 		break;
 	case PT_RVI:
 	case PT_EPT:
 		pmap->pm_eptgen++;
 		break;
 	default:
 		panic("pmap_invalidate_range: unknown type: %d", pmap->pm_type);
 	}
 }
 
 PMAP_INLINE void
 pmap_invalidate_all(pmap_t pmap)
 {
 
 	switch (pmap->pm_type) {
 	case PT_X86:
 		if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 			invltlb();
 		break;
 	case PT_RVI:
 	case PT_EPT:
 		pmap->pm_eptgen++;
 		break;
 	default:
 		panic("pmap_invalidate_all: unknown type %d", pmap->pm_type);
 	}
 }
 
 PMAP_INLINE void
 pmap_invalidate_cache(void)
 {
 
 	wbinvd();
 }
 
 static void
 pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, pd_entry_t newpde)
 {
 
 	pmap_update_pde_store(pmap, pde, newpde);
 	if (pmap == kernel_pmap || !CPU_EMPTY(&pmap->pm_active))
 		pmap_update_pde_invalidate(pmap, va, newpde);
 	else
 		CPU_ZERO(&pmap->pm_save);
 }
 #endif /* !SMP */
 
 #define PMAP_CLFLUSH_THRESHOLD   (2 * 1024 * 1024)
 
 void
 pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva, boolean_t force)
 {
 
 	if (force) {
 		sva &= ~(vm_offset_t)cpu_clflush_line_size;
 	} else {
 		KASSERT((sva & PAGE_MASK) == 0,
 		    ("pmap_invalidate_cache_range: sva not page-aligned"));
 		KASSERT((eva & PAGE_MASK) == 0,
 		    ("pmap_invalidate_cache_range: eva not page-aligned"));
 	}
 
 	if ((cpu_feature & CPUID_SS) != 0 && !force)
 		; /* If "Self Snoop" is supported and allowed, do nothing. */
 	else if ((cpu_feature & CPUID_CLFSH) != 0 &&
 	    eva - sva < PMAP_CLFLUSH_THRESHOLD) {
 
 		/*
 		 * XXX: Some CPUs fault, hang, or trash the local APIC
 		 * registers if we use CLFLUSH on the local APIC
 		 * range.  The local APIC is always uncached, so we
 		 * don't need to flush for that range anyway.
 		 */
 		if (pmap_kextract(sva) == lapic_paddr)
 			return;
 
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
 		 * instruction to insure that previous stores are
 		 * included in the write-back.  The processor
 		 * propagates flush to other processors in the cache
 		 * coherence domain.
 		 */
 		mfence();
 		for (; sva < eva; sva += cpu_clflush_line_size)
 			clflush(sva);
 		mfence();
 	} else {
 
 		/*
 		 * No targeted cache flush methods are supported by CPU,
 		 * or the supplied range is bigger than 2MB.
 		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}
 }
 
 /*
  * Remove the specified set of pages from the data and instruction caches.
  *
  * In contrast to pmap_invalidate_cache_range(), this function does not
  * rely on the CPU's self-snoop feature, because it is intended for use
  * when moving pages into a different cache domain.
  */
 void
 pmap_invalidate_cache_pages(vm_page_t *pages, int count)
 {
 	vm_offset_t daddr, eva;
 	int i;
 
 	if (count >= PMAP_CLFLUSH_THRESHOLD / PAGE_SIZE ||
 	    (cpu_feature & CPUID_CLFSH) == 0)
 		pmap_invalidate_cache();
 	else {
 		mfence();
 		for (i = 0; i < count; i++) {
 			daddr = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pages[i]));
 			eva = daddr + PAGE_SIZE;
 			for (; daddr < eva; daddr += cpu_clflush_line_size)
 				clflush(daddr);
 		}
 		mfence();
 	}
 }
 
 /*
  *	Routine:	pmap_extract
  *	Function:
  *		Extract the physical page address associated
  *		with the given map/virtual_address pair.
  */
 vm_paddr_t 
 pmap_extract(pmap_t pmap, vm_offset_t va)
 {
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_V;
 	vm_paddr_t pa;
 
 	pa = 0;
 	PG_V = pmap_valid_bit(pmap);
 	PMAP_LOCK(pmap);
 	pdpe = pmap_pdpe(pmap, va);
 	if (pdpe != NULL && (*pdpe & PG_V) != 0) {
 		if ((*pdpe & PG_PS) != 0)
 			pa = (*pdpe & PG_PS_FRAME) | (va & PDPMASK);
 		else {
 			pde = pmap_pdpe_to_pde(pdpe, va);
 			if ((*pde & PG_V) != 0) {
 				if ((*pde & PG_PS) != 0) {
 					pa = (*pde & PG_PS_FRAME) |
 					    (va & PDRMASK);
 				} else {
 					pte = pmap_pde_to_pte(pde, va);
 					pa = (*pte & PG_FRAME) |
 					    (va & PAGE_MASK);
 				}
 			}
 		}
 	}
 	PMAP_UNLOCK(pmap);
 	return (pa);
 }
 
 /*
  *	Routine:	pmap_extract_and_hold
  *	Function:
  *		Atomically extract and hold the physical page
  *		with the given pmap and virtual address pair
  *		if that mapping permits the given protection.
  */
 vm_page_t
 pmap_extract_and_hold(pmap_t pmap, vm_offset_t va, vm_prot_t prot)
 {
 	pd_entry_t pde, *pdep;
 	pt_entry_t pte, PG_RW, PG_V;
 	vm_paddr_t pa;
 	vm_page_t m;
 
 	pa = 0;
 	m = NULL;
 	PG_RW = pmap_rw_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PMAP_LOCK(pmap);
 retry:
 	pdep = pmap_pde(pmap, va);
 	if (pdep != NULL && (pde = *pdep)) {
 		if (pde & PG_PS) {
 			if ((pde & PG_RW) || (prot & VM_PROT_WRITE) == 0) {
 				if (vm_page_pa_tryrelock(pmap, (pde &
 				    PG_PS_FRAME) | (va & PDRMASK), &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE((pde & PG_PS_FRAME) |
 				    (va & PDRMASK));
 				vm_page_hold(m);
 			}
 		} else {
 			pte = *pmap_pde_to_pte(pdep, va);
 			if ((pte & PG_V) &&
 			    ((pte & PG_RW) || (prot & VM_PROT_WRITE) == 0)) {
 				if (vm_page_pa_tryrelock(pmap, pte & PG_FRAME,
 				    &pa))
 					goto retry;
 				m = PHYS_TO_VM_PAGE(pte & PG_FRAME);
 				vm_page_hold(m);
 			}
 		}
 	}
 	PA_UNLOCK_COND(pa);
 	PMAP_UNLOCK(pmap);
 	return (m);
 }
 
 vm_paddr_t
 pmap_kextract(vm_offset_t va)
 {
 	pd_entry_t pde;
 	vm_paddr_t pa;
 
 	if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS) {
 		pa = DMAP_TO_PHYS(va);
 	} else {
 		pde = *vtopde(va);
 		if (pde & PG_PS) {
 			pa = (pde & PG_PS_FRAME) | (va & PDRMASK);
 		} else {
 			/*
 			 * Beware of a concurrent promotion that changes the
 			 * PDE at this point!  For example, vtopte() must not
 			 * be used to access the PTE because it would use the
 			 * new PDE.  It is, however, safe to use the old PDE
 			 * because the page table page is preserved by the
 			 * promotion.
 			 */
 			pa = *pmap_pde_to_pte(&pde, va);
 			pa = (pa & PG_FRAME) | (va & PAGE_MASK);
 		}
 	}
 	return (pa);
 }
 
 /***************************************************
  * Low level mapping routines.....
  ***************************************************/
 
 /*
  * Add a wired page to the kva.
  * Note: not SMP coherent.
  */
 PMAP_INLINE void 
 pmap_kenter(vm_offset_t va, vm_paddr_t pa)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_store(pte, pa | X86_PG_RW | X86_PG_V | X86_PG_G);
 }
 
 static __inline void
 pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode)
 {
 	pt_entry_t *pte;
 	int cache_bits;
 
 	pte = vtopte(va);
 	cache_bits = pmap_cache_bits(kernel_pmap, mode, 0);
 	pte_store(pte, pa | X86_PG_RW | X86_PG_V | X86_PG_G | cache_bits);
 }
 
 /*
  * Remove a page from the kernel pagetables.
  * Note: not SMP coherent.
  */
 PMAP_INLINE void
 pmap_kremove(vm_offset_t va)
 {
 	pt_entry_t *pte;
 
 	pte = vtopte(va);
 	pte_clear(pte);
 }
 
 /*
  *	Used to map a range of physical addresses into kernel
  *	virtual address space.
  *
  *	The value passed in '*virt' is a suggested virtual address for
  *	the mapping. Architectures which can support a direct-mapped
  *	physical to virtual region can return the appropriate address
  *	within that region, leaving '*virt' unchanged. Other
  *	architectures should map the pages starting at '*virt' and
  *	update '*virt' with the first usable address after the mapped
  *	region.
  */
 vm_offset_t
 pmap_map(vm_offset_t *virt, vm_paddr_t start, vm_paddr_t end, int prot)
 {
 	return PHYS_TO_DMAP(start);
 }
 
 
 /*
  * Add a list of wired pages to the kva
  * this routine is only used for temporary
  * kernel mappings that do not need to have
  * page modification or references recorded.
  * Note that old mappings are simply written
  * over.  The page *must* be wired.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qenter(vm_offset_t sva, vm_page_t *ma, int count)
 {
 	pt_entry_t *endpte, oldpte, pa, *pte;
 	vm_page_t m;
 	int cache_bits;
 
 	oldpte = 0;
 	pte = vtopte(sva);
 	endpte = pte + count;
 	while (pte < endpte) {
 		m = *ma++;
 		cache_bits = pmap_cache_bits(kernel_pmap, m->md.pat_mode, 0);
 		pa = VM_PAGE_TO_PHYS(m) | cache_bits;
 		if ((*pte & (PG_FRAME | X86_PG_PTE_CACHE)) != pa) {
 			oldpte |= *pte;
 			pte_store(pte, pa | X86_PG_G | X86_PG_RW | X86_PG_V);
 		}
 		pte++;
 	}
 	if (__predict_false((oldpte & X86_PG_V) != 0))
 		pmap_invalidate_range(kernel_pmap, sva, sva + count *
 		    PAGE_SIZE);
 }
 
 /*
  * This routine tears out page mappings from the
  * kernel -- it is meant only for temporary mappings.
  * Note: SMP coherent.  Uses a ranged shootdown IPI.
  */
 void
 pmap_qremove(vm_offset_t sva, int count)
 {
 	vm_offset_t va;
 
 	va = sva;
 	while (count-- > 0) {
 		KASSERT(va >= VM_MIN_KERNEL_ADDRESS, ("usermode va %lx", va));
 		pmap_kremove(va);
 		va += PAGE_SIZE;
 	}
 	pmap_invalidate_range(kernel_pmap, sva, va);
 }
 
 /***************************************************
  * Page table page management routines.....
  ***************************************************/
 static __inline void
 pmap_free_zero_pages(struct spglist *free)
 {
 	vm_page_t m;
 
 	while ((m = SLIST_FIRST(free)) != NULL) {
 		SLIST_REMOVE_HEAD(free, plinks.s.ss);
 		/* Preserve the page's PG_ZERO setting. */
 		vm_page_free_toq(m);
 	}
 }
 
 /*
  * Schedule the specified unused page table page to be freed.  Specifically,
  * add the page to the specified list of pages that will be released to the
  * physical memory manager after the TLB has been updated.
  */
 static __inline void
 pmap_add_delayed_free_list(vm_page_t m, struct spglist *free,
     boolean_t set_PG_ZERO)
 {
 
 	if (set_PG_ZERO)
 		m->flags |= PG_ZERO;
 	else
 		m->flags &= ~PG_ZERO;
 	SLIST_INSERT_HEAD(free, m, plinks.s.ss);
 }
 	
 /*
  * Inserts the specified page table page into the specified pmap's collection
  * of idle page table pages.  Each of a pmap's page table pages is responsible
  * for mapping a distinct range of virtual addresses.  The pmap's collection is
  * ordered by this virtual address range.
  */
 static __inline int
 pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	return (vm_radix_insert(&pmap->pm_root, mpte));
 }
 
 /*
  * Looks for a page table page mapping the specified virtual address in the
  * specified pmap's collection of idle page table pages.  Returns NULL if there
  * is no page table page corresponding to the specified virtual address.
  */
 static __inline vm_page_t
 pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	return (vm_radix_lookup(&pmap->pm_root, pmap_pde_pindex(va)));
 }
 
 /*
  * Removes the specified page table page from the specified pmap's collection
  * of idle page table pages.  The specified page table page must be a member of
  * the pmap's collection.
  */
 static __inline void
 pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	vm_radix_remove(&pmap->pm_root, mpte->pindex);
 }
 
 /*
  * Decrements a page table page's wire count, which is used to record the
  * number of valid page table entries within the page.  If the wire count
  * drops to zero, then the page table page is unmapped.  Returns TRUE if the
  * page table page was unmapped and FALSE otherwise.
  */
 static inline boolean_t
 pmap_unwire_ptp(pmap_t pmap, vm_offset_t va, vm_page_t m, struct spglist *free)
 {
 
 	--m->wire_count;
 	if (m->wire_count == 0) {
 		_pmap_unwire_ptp(pmap, va, m, free);
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 static void
 _pmap_unwire_ptp(pmap_t pmap, vm_offset_t va, vm_page_t m, struct spglist *free)
 {
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	/*
 	 * unmap the page table page
 	 */
 	if (m->pindex >= (NUPDE + NUPDPE)) {
 		/* PDP page */
 		pml4_entry_t *pml4;
 		pml4 = pmap_pml4e(pmap, va);
 		*pml4 = 0;
 	} else if (m->pindex >= NUPDE) {
 		/* PD page */
 		pdp_entry_t *pdp;
 		pdp = pmap_pdpe(pmap, va);
 		*pdp = 0;
 	} else {
 		/* PTE page */
 		pd_entry_t *pd;
 		pd = pmap_pde(pmap, va);
 		*pd = 0;
 	}
 	pmap_resident_count_dec(pmap, 1);
 	if (m->pindex < NUPDE) {
 		/* We just released a PT, unhold the matching PD */
 		vm_page_t pdpg;
 
 		pdpg = PHYS_TO_VM_PAGE(*pmap_pdpe(pmap, va) & PG_FRAME);
 		pmap_unwire_ptp(pmap, va, pdpg, free);
 	}
 	if (m->pindex >= NUPDE && m->pindex < (NUPDE + NUPDPE)) {
 		/* We just released a PD, unhold the matching PDP */
 		vm_page_t pdppg;
 
 		pdppg = PHYS_TO_VM_PAGE(*pmap_pml4e(pmap, va) & PG_FRAME);
 		pmap_unwire_ptp(pmap, va, pdppg, free);
 	}
 
 	/*
 	 * This is a release store so that the ordinary store unmapping
 	 * the page table page is globally performed before TLB shoot-
 	 * down is begun.
 	 */
 	atomic_subtract_rel_int(&vm_cnt.v_wire_count, 1);
 
 	/* 
 	 * Put page on a list so that it is released after
 	 * *ALL* TLB shootdown is done
 	 */
 	pmap_add_delayed_free_list(m, free, TRUE);
 }
 
 /*
  * After removing a page table entry, this routine is used to
  * conditionally free the page, and manage the hold/wire counts.
  */
 static int
 pmap_unuse_pt(pmap_t pmap, vm_offset_t va, pd_entry_t ptepde,
     struct spglist *free)
 {
 	vm_page_t mpte;
 
 	if (va >= VM_MAXUSER_ADDRESS)
 		return (0);
 	KASSERT(ptepde != 0, ("pmap_unuse_pt: ptepde != 0"));
 	mpte = PHYS_TO_VM_PAGE(ptepde & PG_FRAME);
 	return (pmap_unwire_ptp(pmap, va, mpte, free));
 }
 
 void
 pmap_pinit0(pmap_t pmap)
 {
 
 	PMAP_LOCK_INIT(pmap);
 	pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(KPML4phys);
 	pmap->pm_cr3 = KPML4phys;
 	pmap->pm_root.rt_root = 0;
 	CPU_ZERO(&pmap->pm_active);
-	CPU_ZERO(&pmap->pm_save);
 	PCPU_SET(curpmap, pmap);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
-	pmap->pm_pcid = pmap_pcid_enabled ? 0 : -1;
 	pmap->pm_flags = pmap_flags;
 }
 
 /*
  * Initialize a preallocated and zeroed pmap structure,
  * such as one in a vmspace structure.
  */
 int
 pmap_pinit_type(pmap_t pmap, enum pmap_type pm_type, int flags)
 {
 	vm_page_t pml4pg;
 	vm_paddr_t pml4phys;
 	int i;
 
 	/*
 	 * allocate the page directory page
 	 */
 	while ((pml4pg = vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL |
 	    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL)
 		VM_WAIT;
 
 	pml4phys = VM_PAGE_TO_PHYS(pml4pg);
 	pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(pml4phys);
-	pmap->pm_pcid = -1;
+	CPU_FOREACH(i) {
+		pmap->pm_pcids[i].pm_pcid = PMAP_PCID_NONE;
+		pmap->pm_pcids[i].pm_gen = 0;
+	}
 	pmap->pm_cr3 = ~0;	/* initialize to an invalid value */
 
 	if ((pml4pg->flags & PG_ZERO) == 0)
 		pagezero(pmap->pm_pml4);
 
 	/*
 	 * Do not install the host kernel mappings in the nested page
 	 * tables. These mappings are meaningless in the guest physical
 	 * address space.
 	 */
 	if ((pmap->pm_type = pm_type) == PT_X86) {
 		pmap->pm_cr3 = pml4phys;
 
 		/* Wire in kernel global address entries. */
 		for (i = 0; i < NKPML4E; i++) {
 			pmap->pm_pml4[KPML4BASE + i] = (KPDPphys + ptoa(i)) |
 			    X86_PG_RW | X86_PG_V | PG_U;
 		}
 		for (i = 0; i < ndmpdpphys; i++) {
 			pmap->pm_pml4[DMPML4I + i] = (DMPDPphys + ptoa(i)) |
 			    X86_PG_RW | X86_PG_V | PG_U;
 		}
 
 		/* install self-referential address mapping entry(s) */
 		pmap->pm_pml4[PML4PML4I] = VM_PAGE_TO_PHYS(pml4pg) |
 		    X86_PG_V | X86_PG_RW | X86_PG_A | X86_PG_M;
-
-		if (pmap_pcid_enabled) {
-			pmap->pm_pcid = alloc_unr(&pcid_unr);
-			if (pmap->pm_pcid != -1)
-				pmap->pm_cr3 |= pmap->pm_pcid;
-		}
 	}
 
 	pmap->pm_root.rt_root = 0;
 	CPU_ZERO(&pmap->pm_active);
 	TAILQ_INIT(&pmap->pm_pvchunk);
 	bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
 	pmap->pm_flags = flags;
 	pmap->pm_eptgen = 0;
-	CPU_ZERO(&pmap->pm_save);
 
 	return (1);
 }
 
 int
 pmap_pinit(pmap_t pmap)
 {
 
 	return (pmap_pinit_type(pmap, PT_X86, pmap_flags));
 }
 
 /*
  * This routine is called if the desired page table page does not exist.
  *
  * If page table page allocation fails, this routine may sleep before
  * returning NULL.  It sleeps only if a lock pointer was given.
  *
  * Note: If a page allocation fails at page table level two or three,
  * one or two pages may be held during the wait, only to be released
  * afterwards.  This conservative approach is easily argued to avoid
  * race conditions.
  */
 static vm_page_t
 _pmap_allocpte(pmap_t pmap, vm_pindex_t ptepindex, struct rwlock **lockp)
 {
 	vm_page_t m, pdppg, pdpg;
 	pt_entry_t PG_A, PG_M, PG_RW, PG_V;
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	/*
 	 * Allocate a page table page.
 	 */
 	if ((m = vm_page_alloc(NULL, ptepindex, VM_ALLOC_NOOBJ |
 	    VM_ALLOC_WIRED | VM_ALLOC_ZERO)) == NULL) {
 		if (lockp != NULL) {
 			RELEASE_PV_LIST_LOCK(lockp);
 			PMAP_UNLOCK(pmap);
 			rw_runlock(&pvh_global_lock);
 			VM_WAIT;
 			rw_rlock(&pvh_global_lock);
 			PMAP_LOCK(pmap);
 		}
 
 		/*
 		 * Indicate the need to retry.  While waiting, the page table
 		 * page may have been allocated.
 		 */
 		return (NULL);
 	}
 	if ((m->flags & PG_ZERO) == 0)
 		pmap_zero_page(m);
 
 	/*
 	 * Map the pagetable page into the process address space, if
 	 * it isn't already there.
 	 */
 
 	if (ptepindex >= (NUPDE + NUPDPE)) {
 		pml4_entry_t *pml4;
 		vm_pindex_t pml4index;
 
 		/* Wire up a new PDPE page */
 		pml4index = ptepindex - (NUPDE + NUPDPE);
 		pml4 = &pmap->pm_pml4[pml4index];
 		*pml4 = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M;
 
 	} else if (ptepindex >= NUPDE) {
 		vm_pindex_t pml4index;
 		vm_pindex_t pdpindex;
 		pml4_entry_t *pml4;
 		pdp_entry_t *pdp;
 
 		/* Wire up a new PDE page */
 		pdpindex = ptepindex - NUPDE;
 		pml4index = pdpindex >> NPML4EPGSHIFT;
 
 		pml4 = &pmap->pm_pml4[pml4index];
 		if ((*pml4 & PG_V) == 0) {
 			/* Have to allocate a new pdp, recurse */
 			if (_pmap_allocpte(pmap, NUPDE + NUPDPE + pml4index,
 			    lockp) == NULL) {
 				--m->wire_count;
 				atomic_subtract_int(&vm_cnt.v_wire_count, 1);
 				vm_page_free_zero(m);
 				return (NULL);
 			}
 		} else {
 			/* Add reference to pdp page */
 			pdppg = PHYS_TO_VM_PAGE(*pml4 & PG_FRAME);
 			pdppg->wire_count++;
 		}
 		pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME);
 
 		/* Now find the pdp page */
 		pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)];
 		*pdp = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M;
 
 	} else {
 		vm_pindex_t pml4index;
 		vm_pindex_t pdpindex;
 		pml4_entry_t *pml4;
 		pdp_entry_t *pdp;
 		pd_entry_t *pd;
 
 		/* Wire up a new PTE page */
 		pdpindex = ptepindex >> NPDPEPGSHIFT;
 		pml4index = pdpindex >> NPML4EPGSHIFT;
 
 		/* First, find the pdp and check that its valid. */
 		pml4 = &pmap->pm_pml4[pml4index];
 		if ((*pml4 & PG_V) == 0) {
 			/* Have to allocate a new pd, recurse */
 			if (_pmap_allocpte(pmap, NUPDE + pdpindex,
 			    lockp) == NULL) {
 				--m->wire_count;
 				atomic_subtract_int(&vm_cnt.v_wire_count, 1);
 				vm_page_free_zero(m);
 				return (NULL);
 			}
 			pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME);
 			pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)];
 		} else {
 			pdp = (pdp_entry_t *)PHYS_TO_DMAP(*pml4 & PG_FRAME);
 			pdp = &pdp[pdpindex & ((1ul << NPDPEPGSHIFT) - 1)];
 			if ((*pdp & PG_V) == 0) {
 				/* Have to allocate a new pd, recurse */
 				if (_pmap_allocpte(pmap, NUPDE + pdpindex,
 				    lockp) == NULL) {
 					--m->wire_count;
 					atomic_subtract_int(&vm_cnt.v_wire_count,
 					    1);
 					vm_page_free_zero(m);
 					return (NULL);
 				}
 			} else {
 				/* Add reference to the pd page */
 				pdpg = PHYS_TO_VM_PAGE(*pdp & PG_FRAME);
 				pdpg->wire_count++;
 			}
 		}
 		pd = (pd_entry_t *)PHYS_TO_DMAP(*pdp & PG_FRAME);
 
 		/* Now we know where the page directory page is */
 		pd = &pd[ptepindex & ((1ul << NPDEPGSHIFT) - 1)];
 		*pd = VM_PAGE_TO_PHYS(m) | PG_U | PG_RW | PG_V | PG_A | PG_M;
 	}
 
 	pmap_resident_count_inc(pmap, 1);
 
 	return (m);
 }
 
 static vm_page_t
 pmap_allocpde(pmap_t pmap, vm_offset_t va, struct rwlock **lockp)
 {
 	vm_pindex_t pdpindex, ptepindex;
 	pdp_entry_t *pdpe, PG_V;
 	vm_page_t pdpg;
 
 	PG_V = pmap_valid_bit(pmap);
 
 retry:
 	pdpe = pmap_pdpe(pmap, va);
 	if (pdpe != NULL && (*pdpe & PG_V) != 0) {
 		/* Add a reference to the pd page. */
 		pdpg = PHYS_TO_VM_PAGE(*pdpe & PG_FRAME);
 		pdpg->wire_count++;
 	} else {
 		/* Allocate a pd page. */
 		ptepindex = pmap_pde_pindex(va);
 		pdpindex = ptepindex >> NPDPEPGSHIFT;
 		pdpg = _pmap_allocpte(pmap, NUPDE + pdpindex, lockp);
 		if (pdpg == NULL && lockp != NULL)
 			goto retry;
 	}
 	return (pdpg);
 }
 
 static vm_page_t
 pmap_allocpte(pmap_t pmap, vm_offset_t va, struct rwlock **lockp)
 {
 	vm_pindex_t ptepindex;
 	pd_entry_t *pd, PG_V;
 	vm_page_t m;
 
 	PG_V = pmap_valid_bit(pmap);
 
 	/*
 	 * Calculate pagetable page index
 	 */
 	ptepindex = pmap_pde_pindex(va);
 retry:
 	/*
 	 * Get the page directory entry
 	 */
 	pd = pmap_pde(pmap, va);
 
 	/*
 	 * This supports switching from a 2MB page to a
 	 * normal 4K page.
 	 */
 	if (pd != NULL && (*pd & (PG_PS | PG_V)) == (PG_PS | PG_V)) {
 		if (!pmap_demote_pde_locked(pmap, pd, va, lockp)) {
 			/*
 			 * Invalidation of the 2MB page mapping may have caused
 			 * the deallocation of the underlying PD page.
 			 */
 			pd = NULL;
 		}
 	}
 
 	/*
 	 * If the page table page is mapped, we just increment the
 	 * hold count, and activate it.
 	 */
 	if (pd != NULL && (*pd & PG_V) != 0) {
 		m = PHYS_TO_VM_PAGE(*pd & PG_FRAME);
 		m->wire_count++;
 	} else {
 		/*
 		 * Here if the pte page isn't mapped, or if it has been
 		 * deallocated.
 		 */
 		m = _pmap_allocpte(pmap, ptepindex, lockp);
 		if (m == NULL && lockp != NULL)
 			goto retry;
 	}
 	return (m);
 }
 
 
 /***************************************************
  * Pmap allocation/deallocation routines.
  ***************************************************/
 
 /*
  * Release any resources held by the given physical map.
  * Called when a pmap initialized by pmap_pinit is being released.
  * Should only be called if the map contains no valid mappings.
  */
 void
 pmap_release(pmap_t pmap)
 {
 	vm_page_t m;
 	int i;
 
 	KASSERT(pmap->pm_stats.resident_count == 0,
 	    ("pmap_release: pmap resident count %ld != 0",
 	    pmap->pm_stats.resident_count));
 	KASSERT(vm_radix_is_empty(&pmap->pm_root),
 	    ("pmap_release: pmap has reserved page table page(s)"));
 	KASSERT(CPU_EMPTY(&pmap->pm_active),
 	    ("releasing active pmap %p", pmap));
 
-	if (pmap_pcid_enabled) {
-		/*
-		 * Invalidate any left TLB entries, to allow the reuse
-		 * of the pcid.
-		 */
-		pmap_invalidate_all(pmap);
-	}
-
 	m = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pmap->pm_pml4));
 
 	for (i = 0; i < NKPML4E; i++)	/* KVA */
 		pmap->pm_pml4[KPML4BASE + i] = 0;
 	for (i = 0; i < ndmpdpphys; i++)/* Direct Map */
 		pmap->pm_pml4[DMPML4I + i] = 0;
 	pmap->pm_pml4[PML4PML4I] = 0;	/* Recursive Mapping */
 
 	m->wire_count--;
 	atomic_subtract_int(&vm_cnt.v_wire_count, 1);
 	vm_page_free_zero(m);
-	if (pmap->pm_pcid != -1)
-		free_unr(&pcid_unr, pmap->pm_pcid);
 }
 
 static int
 kvm_size(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long ksize = VM_MAX_KERNEL_ADDRESS - VM_MIN_KERNEL_ADDRESS;
 
 	return sysctl_handle_long(oidp, &ksize, 0, req);
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_size, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_size, "LU", "Size of KVM");
 
 static int
 kvm_free(SYSCTL_HANDLER_ARGS)
 {
 	unsigned long kfree = VM_MAX_KERNEL_ADDRESS - kernel_vm_end;
 
 	return sysctl_handle_long(oidp, &kfree, 0, req);
 }
 SYSCTL_PROC(_vm, OID_AUTO, kvm_free, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_free, "LU", "Amount of KVM free");
 
 /*
  * grow the number of kernel page table entries, if needed
  */
 void
 pmap_growkernel(vm_offset_t addr)
 {
 	vm_paddr_t paddr;
 	vm_page_t nkpg;
 	pd_entry_t *pde, newpdir;
 	pdp_entry_t *pdpe;
 
 	mtx_assert(&kernel_map->system_mtx, MA_OWNED);
 
 	/*
 	 * Return if "addr" is within the range of kernel page table pages
 	 * that were preallocated during pmap bootstrap.  Moreover, leave
 	 * "kernel_vm_end" and the kernel page table as they were.
 	 *
 	 * The correctness of this action is based on the following
 	 * argument: vm_map_insert() allocates contiguous ranges of the
 	 * kernel virtual address space.  It calls this function if a range
 	 * ends after "kernel_vm_end".  If the kernel is mapped between
 	 * "kernel_vm_end" and "addr", then the range cannot begin at
 	 * "kernel_vm_end".  In fact, its beginning address cannot be less
 	 * than the kernel.  Thus, there is no immediate need to allocate
 	 * any new kernel page table pages between "kernel_vm_end" and
 	 * "KERNBASE".
 	 */
 	if (KERNBASE < addr && addr <= KERNBASE + nkpt * NBPDR)
 		return;
 
 	addr = roundup2(addr, NBPDR);
 	if (addr - 1 >= kernel_map->max_offset)
 		addr = kernel_map->max_offset;
 	while (kernel_vm_end < addr) {
 		pdpe = pmap_pdpe(kernel_pmap, kernel_vm_end);
 		if ((*pdpe & X86_PG_V) == 0) {
 			/* We need a new PDP entry */
 			nkpg = vm_page_alloc(NULL, kernel_vm_end >> PDPSHIFT,
 			    VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ |
 			    VM_ALLOC_WIRED | VM_ALLOC_ZERO);
 			if (nkpg == NULL)
 				panic("pmap_growkernel: no memory to grow kernel");
 			if ((nkpg->flags & PG_ZERO) == 0)
 				pmap_zero_page(nkpg);
 			paddr = VM_PAGE_TO_PHYS(nkpg);
 			*pdpe = (pdp_entry_t)(paddr | X86_PG_V | X86_PG_RW |
 			    X86_PG_A | X86_PG_M);
 			continue; /* try again */
 		}
 		pde = pmap_pdpe_to_pde(pdpe, kernel_vm_end);
 		if ((*pde & X86_PG_V) != 0) {
 			kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 			if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 				kernel_vm_end = kernel_map->max_offset;
 				break;                       
 			}
 			continue;
 		}
 
 		nkpg = vm_page_alloc(NULL, pmap_pde_pindex(kernel_vm_end),
 		    VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED |
 		    VM_ALLOC_ZERO);
 		if (nkpg == NULL)
 			panic("pmap_growkernel: no memory to grow kernel");
 		if ((nkpg->flags & PG_ZERO) == 0)
 			pmap_zero_page(nkpg);
 		paddr = VM_PAGE_TO_PHYS(nkpg);
 		newpdir = paddr | X86_PG_V | X86_PG_RW | X86_PG_A | X86_PG_M;
 		pde_store(pde, newpdir);
 
 		kernel_vm_end = (kernel_vm_end + NBPDR) & ~PDRMASK;
 		if (kernel_vm_end - 1 >= kernel_map->max_offset) {
 			kernel_vm_end = kernel_map->max_offset;
 			break;                       
 		}
 	}
 }
 
 
 /***************************************************
  * page management routines.
  ***************************************************/
 
 CTASSERT(sizeof(struct pv_chunk) == PAGE_SIZE);
 CTASSERT(_NPCM == 3);
 CTASSERT(_NPCPV == 168);
 
 static __inline struct pv_chunk *
 pv_to_chunk(pv_entry_t pv)
 {
 
 	return ((struct pv_chunk *)((uintptr_t)pv & ~(uintptr_t)PAGE_MASK));
 }
 
 #define PV_PMAP(pv) (pv_to_chunk(pv)->pc_pmap)
 
 #define	PC_FREE0	0xfffffffffffffffful
 #define	PC_FREE1	0xfffffffffffffffful
 #define	PC_FREE2	0x000000fffffffffful
 
 static const uint64_t pc_freemask[_NPCM] = { PC_FREE0, PC_FREE1, PC_FREE2 };
 
 #ifdef PV_STATS
 static int pc_chunk_count, pc_chunk_allocs, pc_chunk_frees, pc_chunk_tryfail;
 
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_count, CTLFLAG_RD, &pc_chunk_count, 0,
 	"Current number of pv entry chunks");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_allocs, CTLFLAG_RD, &pc_chunk_allocs, 0,
 	"Current number of pv entry chunks allocated");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_frees, CTLFLAG_RD, &pc_chunk_frees, 0,
 	"Current number of pv entry chunks frees");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pc_chunk_tryfail, CTLFLAG_RD, &pc_chunk_tryfail, 0,
 	"Number of times tried to get a chunk page but failed.");
 
 static long pv_entry_frees, pv_entry_allocs, pv_entry_count;
 static int pv_entry_spare;
 
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_frees, CTLFLAG_RD, &pv_entry_frees, 0,
 	"Current number of pv entry frees");
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_allocs, CTLFLAG_RD, &pv_entry_allocs, 0,
 	"Current number of pv entry allocs");
 SYSCTL_LONG(_vm_pmap, OID_AUTO, pv_entry_count, CTLFLAG_RD, &pv_entry_count, 0,
 	"Current number of pv entries");
 SYSCTL_INT(_vm_pmap, OID_AUTO, pv_entry_spare, CTLFLAG_RD, &pv_entry_spare, 0,
 	"Current number of spare pv entries");
 #endif
 
 /*
  * We are in a serious low memory condition.  Resort to
  * drastic measures to free some pages so we can allocate
  * another pv entry chunk.
  *
  * Returns NULL if PV entries were reclaimed from the specified pmap.
  *
  * We do not, however, unmap 2mpages because subsequent accesses will
  * allocate per-page pv entries until repromotion occurs, thereby
  * exacerbating the shortage of free pv entries.
  */
 static vm_page_t
 reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp)
 {
 	struct pch new_tail;
 	struct pv_chunk *pc;
 	struct md_page *pvh;
 	pd_entry_t *pde;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte;
 	pt_entry_t PG_G, PG_A, PG_M, PG_RW;
 	pv_entry_t pv;
 	vm_offset_t va;
 	vm_page_t m, m_pc;
 	struct spglist free;
 	uint64_t inuse;
 	int bit, field, freed;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(locked_pmap, MA_OWNED);
 	KASSERT(lockp != NULL, ("reclaim_pv_chunk: lockp is NULL"));
 	pmap = NULL;
 	m_pc = NULL;
 	PG_G = PG_A = PG_M = PG_RW = 0;
 	SLIST_INIT(&free);
 	TAILQ_INIT(&new_tail);
 	mtx_lock(&pv_chunks_mutex);
 	while ((pc = TAILQ_FIRST(&pv_chunks)) != NULL && SLIST_EMPTY(&free)) {
 		TAILQ_REMOVE(&pv_chunks, pc, pc_lru);
 		mtx_unlock(&pv_chunks_mutex);
 		if (pmap != pc->pc_pmap) {
 			if (pmap != NULL) {
 				pmap_invalidate_all(pmap);
 				if (pmap != locked_pmap)
 					PMAP_UNLOCK(pmap);
 			}
 			pmap = pc->pc_pmap;
 			/* Avoid deadlock and lock recursion. */
 			if (pmap > locked_pmap) {
 				RELEASE_PV_LIST_LOCK(lockp);
 				PMAP_LOCK(pmap);
 			} else if (pmap != locked_pmap &&
 			    !PMAP_TRYLOCK(pmap)) {
 				pmap = NULL;
 				TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru);
 				mtx_lock(&pv_chunks_mutex);
 				continue;
 			}
 			PG_G = pmap_global_bit(pmap);
 			PG_A = pmap_accessed_bit(pmap);
 			PG_M = pmap_modified_bit(pmap);
 			PG_RW = pmap_rw_bit(pmap);
 		}
 
 		/*
 		 * Destroy every non-wired, 4 KB page mapping in the chunk.
 		 */
 		freed = 0;
 		for (field = 0; field < _NPCM; field++) {
 			for (inuse = ~pc->pc_map[field] & pc_freemask[field];
 			    inuse != 0; inuse &= ~(1UL << bit)) {
 				bit = bsfq(inuse);
 				pv = &pc->pc_pventry[field * 64 + bit];
 				va = pv->pv_va;
 				pde = pmap_pde(pmap, va);
 				if ((*pde & PG_PS) != 0)
 					continue;
 				pte = pmap_pde_to_pte(pde, va);
 				if ((*pte & PG_W) != 0)
 					continue;
 				tpte = pte_load_clear(pte);
 				if ((tpte & PG_G) != 0)
 					pmap_invalidate_page(pmap, va);
 				m = PHYS_TO_VM_PAGE(tpte & PG_FRAME);
 				if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 					vm_page_dirty(m);
 				if ((tpte & PG_A) != 0)
 					vm_page_aflag_set(m, PGA_REFERENCED);
 				CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m);
 				TAILQ_REMOVE(&m->md.pv_list, pv, pv_next);
 				m->md.pv_gen++;
 				if (TAILQ_EMPTY(&m->md.pv_list) &&
 				    (m->flags & PG_FICTITIOUS) == 0) {
 					pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 					if (TAILQ_EMPTY(&pvh->pv_list)) {
 						vm_page_aflag_clear(m,
 						    PGA_WRITEABLE);
 					}
 				}
 				pc->pc_map[field] |= 1UL << bit;
 				pmap_unuse_pt(pmap, va, *pde, &free);
 				freed++;
 			}
 		}
 		if (freed == 0) {
 			TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru);
 			mtx_lock(&pv_chunks_mutex);
 			continue;
 		}
 		/* Every freed mapping is for a 4 KB page. */
 		pmap_resident_count_dec(pmap, freed);
 		PV_STAT(atomic_add_long(&pv_entry_frees, freed));
 		PV_STAT(atomic_add_int(&pv_entry_spare, freed));
 		PV_STAT(atomic_subtract_long(&pv_entry_count, freed));
 		TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 		if (pc->pc_map[0] == PC_FREE0 && pc->pc_map[1] == PC_FREE1 &&
 		    pc->pc_map[2] == PC_FREE2) {
 			PV_STAT(atomic_subtract_int(&pv_entry_spare, _NPCPV));
 			PV_STAT(atomic_subtract_int(&pc_chunk_count, 1));
 			PV_STAT(atomic_add_int(&pc_chunk_frees, 1));
 			/* Entire chunk is free; return it. */
 			m_pc = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pc));
 			dump_drop_page(m_pc->phys_addr);
 			mtx_lock(&pv_chunks_mutex);
 			break;
 		}
 		TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 		TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru);
 		mtx_lock(&pv_chunks_mutex);
 		/* One freed pv entry in locked_pmap is sufficient. */
 		if (pmap == locked_pmap)
 			break;
 	}
 	TAILQ_CONCAT(&pv_chunks, &new_tail, pc_lru);
 	mtx_unlock(&pv_chunks_mutex);
 	if (pmap != NULL) {
 		pmap_invalidate_all(pmap);
 		if (pmap != locked_pmap)
 			PMAP_UNLOCK(pmap);
 	}
 	if (m_pc == NULL && !SLIST_EMPTY(&free)) {
 		m_pc = SLIST_FIRST(&free);
 		SLIST_REMOVE_HEAD(&free, plinks.s.ss);
 		/* Recycle a freed page table page. */
 		m_pc->wire_count = 1;
 		atomic_add_int(&vm_cnt.v_wire_count, 1);
 	}
 	pmap_free_zero_pages(&free);
 	return (m_pc);
 }
 
 /*
  * free the pv_entry back to the free list
  */
 static void
 free_pv_entry(pmap_t pmap, pv_entry_t pv)
 {
 	struct pv_chunk *pc;
 	int idx, field, bit;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	PV_STAT(atomic_add_long(&pv_entry_frees, 1));
 	PV_STAT(atomic_add_int(&pv_entry_spare, 1));
 	PV_STAT(atomic_subtract_long(&pv_entry_count, 1));
 	pc = pv_to_chunk(pv);
 	idx = pv - &pc->pc_pventry[0];
 	field = idx / 64;
 	bit = idx % 64;
 	pc->pc_map[field] |= 1ul << bit;
 	if (pc->pc_map[0] != PC_FREE0 || pc->pc_map[1] != PC_FREE1 ||
 	    pc->pc_map[2] != PC_FREE2) {
 		/* 98% of the time, pc is already at the head of the list. */
 		if (__predict_false(pc != TAILQ_FIRST(&pmap->pm_pvchunk))) {
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 		}
 		return;
 	}
 	TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 	free_pv_chunk(pc);
 }
 
 static void
 free_pv_chunk(struct pv_chunk *pc)
 {
 	vm_page_t m;
 
 	mtx_lock(&pv_chunks_mutex);
  	TAILQ_REMOVE(&pv_chunks, pc, pc_lru);
 	mtx_unlock(&pv_chunks_mutex);
 	PV_STAT(atomic_subtract_int(&pv_entry_spare, _NPCPV));
 	PV_STAT(atomic_subtract_int(&pc_chunk_count, 1));
 	PV_STAT(atomic_add_int(&pc_chunk_frees, 1));
 	/* entire chunk is free, return it */
 	m = PHYS_TO_VM_PAGE(DMAP_TO_PHYS((vm_offset_t)pc));
 	dump_drop_page(m->phys_addr);
 	vm_page_unwire(m, PQ_INACTIVE);
 	vm_page_free(m);
 }
 
 /*
  * Returns a new PV entry, allocating a new PV chunk from the system when
  * needed.  If this PV chunk allocation fails and a PV list lock pointer was
  * given, a PV chunk is reclaimed from an arbitrary pmap.  Otherwise, NULL is
  * returned.
  *
  * The given PV list lock may be released.
  */
 static pv_entry_t
 get_pv_entry(pmap_t pmap, struct rwlock **lockp)
 {
 	int bit, field;
 	pv_entry_t pv;
 	struct pv_chunk *pc;
 	vm_page_t m;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	PV_STAT(atomic_add_long(&pv_entry_allocs, 1));
 retry:
 	pc = TAILQ_FIRST(&pmap->pm_pvchunk);
 	if (pc != NULL) {
 		for (field = 0; field < _NPCM; field++) {
 			if (pc->pc_map[field]) {
 				bit = bsfq(pc->pc_map[field]);
 				break;
 			}
 		}
 		if (field < _NPCM) {
 			pv = &pc->pc_pventry[field * 64 + bit];
 			pc->pc_map[field] &= ~(1ul << bit);
 			/* If this was the last item, move it to tail */
 			if (pc->pc_map[0] == 0 && pc->pc_map[1] == 0 &&
 			    pc->pc_map[2] == 0) {
 				TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 				TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc,
 				    pc_list);
 			}
 			PV_STAT(atomic_add_long(&pv_entry_count, 1));
 			PV_STAT(atomic_subtract_int(&pv_entry_spare, 1));
 			return (pv);
 		}
 	}
 	/* No free items, allocate another chunk */
 	m = vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ |
 	    VM_ALLOC_WIRED);
 	if (m == NULL) {
 		if (lockp == NULL) {
 			PV_STAT(pc_chunk_tryfail++);
 			return (NULL);
 		}
 		m = reclaim_pv_chunk(pmap, lockp);
 		if (m == NULL)
 			goto retry;
 	}
 	PV_STAT(atomic_add_int(&pc_chunk_count, 1));
 	PV_STAT(atomic_add_int(&pc_chunk_allocs, 1));
 	dump_add_page(m->phys_addr);
 	pc = (void *)PHYS_TO_DMAP(m->phys_addr);
 	pc->pc_pmap = pmap;
 	pc->pc_map[0] = PC_FREE0 & ~1ul;	/* preallocated bit 0 */
 	pc->pc_map[1] = PC_FREE1;
 	pc->pc_map[2] = PC_FREE2;
 	mtx_lock(&pv_chunks_mutex);
 	TAILQ_INSERT_TAIL(&pv_chunks, pc, pc_lru);
 	mtx_unlock(&pv_chunks_mutex);
 	pv = &pc->pc_pventry[0];
 	TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 	PV_STAT(atomic_add_long(&pv_entry_count, 1));
 	PV_STAT(atomic_add_int(&pv_entry_spare, _NPCPV - 1));
 	return (pv);
 }
 
 /*
  * Returns the number of one bits within the given PV chunk map element.
  *
  * The erratas for Intel processors state that "POPCNT Instruction May
  * Take Longer to Execute Than Expected".  It is believed that the
  * issue is the spurious dependency on the destination register.
  * Provide a hint to the register rename logic that the destination
  * value is overwritten, by clearing it, as suggested in the
  * optimization manual.  It should be cheap for unaffected processors
  * as well.
  *
  * Reference numbers for erratas are
  * 4th Gen Core: HSD146
  * 5th Gen Core: BDM85
  */
 static int
 popcnt_pc_map_elem_pq(uint64_t elem)
 {
 	u_long result;
 
 	__asm __volatile("xorl %k0,%k0;popcntq %1,%0"
 	    : "=&r" (result) : "rm" (elem));
 	return (result);
 }
 
 /*
  * Ensure that the number of spare PV entries in the specified pmap meets or
  * exceeds the given count, "needed".
  *
  * The given PV list lock may be released.
  */
 static void
 reserve_pv_entries(pmap_t pmap, int needed, struct rwlock **lockp)
 {
 	struct pch new_tail;
 	struct pv_chunk *pc;
 	int avail, free;
 	vm_page_t m;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT(lockp != NULL, ("reserve_pv_entries: lockp is NULL"));
 
 	/*
 	 * Newly allocated PV chunks must be stored in a private list until
 	 * the required number of PV chunks have been allocated.  Otherwise,
 	 * reclaim_pv_chunk() could recycle one of these chunks.  In
 	 * contrast, these chunks must be added to the pmap upon allocation.
 	 */
 	TAILQ_INIT(&new_tail);
 retry:
 	avail = 0;
 	TAILQ_FOREACH(pc, &pmap->pm_pvchunk, pc_list) {
 		if ((cpu_feature2 & CPUID2_POPCNT) == 0) {
 			free = bitcount64(pc->pc_map[0]);
 			free += bitcount64(pc->pc_map[1]);
 			free += bitcount64(pc->pc_map[2]);
 		} else {
 			free = popcnt_pc_map_elem_pq(pc->pc_map[0]);
 			free += popcnt_pc_map_elem_pq(pc->pc_map[1]);
 			free += popcnt_pc_map_elem_pq(pc->pc_map[2]);
 		}
 		if (free == 0)
 			break;
 		avail += free;
 		if (avail >= needed)
 			break;
 	}
 	for (; avail < needed; avail += _NPCPV) {
 		m = vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ |
 		    VM_ALLOC_WIRED);
 		if (m == NULL) {
 			m = reclaim_pv_chunk(pmap, lockp);
 			if (m == NULL)
 				goto retry;
 		}
 		PV_STAT(atomic_add_int(&pc_chunk_count, 1));
 		PV_STAT(atomic_add_int(&pc_chunk_allocs, 1));
 		dump_add_page(m->phys_addr);
 		pc = (void *)PHYS_TO_DMAP(m->phys_addr);
 		pc->pc_pmap = pmap;
 		pc->pc_map[0] = PC_FREE0;
 		pc->pc_map[1] = PC_FREE1;
 		pc->pc_map[2] = PC_FREE2;
 		TAILQ_INSERT_HEAD(&pmap->pm_pvchunk, pc, pc_list);
 		TAILQ_INSERT_TAIL(&new_tail, pc, pc_lru);
 		PV_STAT(atomic_add_int(&pv_entry_spare, _NPCPV));
 	}
 	if (!TAILQ_EMPTY(&new_tail)) {
 		mtx_lock(&pv_chunks_mutex);
 		TAILQ_CONCAT(&pv_chunks, &new_tail, pc_lru);
 		mtx_unlock(&pv_chunks_mutex);
 	}
 }
 
 /*
  * First find and then remove the pv entry for the specified pmap and virtual
  * address from the specified pv list.  Returns the pv entry if found and NULL
  * otherwise.  This operation can be performed on pv lists for either 4KB or
  * 2MB page mappings.
  */
 static __inline pv_entry_t
 pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) {
 		if (pmap == PV_PMAP(pv) && va == pv->pv_va) {
 			TAILQ_REMOVE(&pvh->pv_list, pv, pv_next);
 			pvh->pv_gen++;
 			break;
 		}
 	}
 	return (pv);
 }
 
 /*
  * After demotion from a 2MB page mapping to 512 4KB page mappings,
  * destroy the pv entry for the 2MB page mapping and reinstantiate the pv
  * entries for each of the 4KB page mappings.
  */
 static void
 pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa,
     struct rwlock **lockp)
 {
 	struct md_page *pvh;
 	struct pv_chunk *pc;
 	pv_entry_t pv;
 	vm_offset_t va_last;
 	vm_page_t m;
 	int bit, field;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((pa & PDRMASK) == 0,
 	    ("pmap_pv_demote_pde: pa is not 2mpage aligned"));
 	CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa);
 
 	/*
 	 * Transfer the 2mpage's pv entry for this mapping to the first
 	 * page's pv list.  Once this transfer begins, the pv list lock
 	 * must not be released until the last pv entry is reinstantiated.
 	 */
 	pvh = pa_to_pvh(pa);
 	va = trunc_2mpage(va);
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pv_demote_pde: pv not found"));
 	m = PHYS_TO_VM_PAGE(pa);
 	TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next);
 	m->md.pv_gen++;
 	/* Instantiate the remaining NPTEPG - 1 pv entries. */
 	PV_STAT(atomic_add_long(&pv_entry_allocs, NPTEPG - 1));
 	va_last = va + NBPDR - PAGE_SIZE;
 	for (;;) {
 		pc = TAILQ_FIRST(&pmap->pm_pvchunk);
 		KASSERT(pc->pc_map[0] != 0 || pc->pc_map[1] != 0 ||
 		    pc->pc_map[2] != 0, ("pmap_pv_demote_pde: missing spare"));
 		for (field = 0; field < _NPCM; field++) {
 			while (pc->pc_map[field]) {
 				bit = bsfq(pc->pc_map[field]);
 				pc->pc_map[field] &= ~(1ul << bit);
 				pv = &pc->pc_pventry[field * 64 + bit];
 				va += PAGE_SIZE;
 				pv->pv_va = va;
 				m++;
 				KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 			    ("pmap_pv_demote_pde: page %p is not managed", m));
 				TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next);
 				m->md.pv_gen++;
 				if (va == va_last)
 					goto out;
 			}
 		}
 		TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 		TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list);
 	}
 out:
 	if (pc->pc_map[0] == 0 && pc->pc_map[1] == 0 && pc->pc_map[2] == 0) {
 		TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 		TAILQ_INSERT_TAIL(&pmap->pm_pvchunk, pc, pc_list);
 	}
 	PV_STAT(atomic_add_long(&pv_entry_count, NPTEPG - 1));
 	PV_STAT(atomic_subtract_int(&pv_entry_spare, NPTEPG - 1));
 }
 
 /*
  * After promotion from 512 4KB page mappings to a single 2MB page mapping,
  * replace the many pv entries for the 4KB page mappings by a single pv entry
  * for the 2MB page mapping.
  */
 static void
 pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa,
     struct rwlock **lockp)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	vm_offset_t va_last;
 	vm_page_t m;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	KASSERT((pa & PDRMASK) == 0,
 	    ("pmap_pv_promote_pde: pa is not 2mpage aligned"));
 	CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa);
 
 	/*
 	 * Transfer the first page's pv entry for this mapping to the 2mpage's
 	 * pv list.  Aside from avoiding the cost of a call to get_pv_entry(),
 	 * a transfer avoids the possibility that get_pv_entry() calls
 	 * reclaim_pv_chunk() and that reclaim_pv_chunk() removes one of the
 	 * mappings that is being promoted.
 	 */
 	m = PHYS_TO_VM_PAGE(pa);
 	va = trunc_2mpage(va);
 	pv = pmap_pvh_remove(&m->md, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pv_promote_pde: pv not found"));
 	pvh = pa_to_pvh(pa);
 	TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_next);
 	pvh->pv_gen++;
 	/* Free the remaining NPTEPG - 1 pv entries. */
 	va_last = va + NBPDR - PAGE_SIZE;
 	do {
 		m++;
 		va += PAGE_SIZE;
 		pmap_pvh_free(&m->md, pmap, va);
 	} while (va < va_last);
 }
 
 /*
  * First find and then destroy the pv entry for the specified pmap and virtual
  * address.  This operation can be performed on pv lists for either 4KB or 2MB
  * page mappings.
  */
 static void
 pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va)
 {
 	pv_entry_t pv;
 
 	pv = pmap_pvh_remove(pvh, pmap, va);
 	KASSERT(pv != NULL, ("pmap_pvh_free: pv not found"));
 	free_pv_entry(pmap, pv);
 }
 
 /*
  * Conditionally create the PV entry for a 4KB page mapping if the required
  * memory can be allocated without resorting to reclamation.
  */
 static boolean_t
 pmap_try_insert_pv_entry(pmap_t pmap, vm_offset_t va, vm_page_t m,
     struct rwlock **lockp)
 {
 	pv_entry_t pv;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	/* Pass NULL instead of the lock pointer to disable reclamation. */
 	if ((pv = get_pv_entry(pmap, NULL)) != NULL) {
 		pv->pv_va = va;
 		CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m);
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next);
 		m->md.pv_gen++;
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * Conditionally create the PV entry for a 2MB page mapping if the required
  * memory can be allocated without resorting to reclamation.
  */
 static boolean_t
 pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa,
     struct rwlock **lockp)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	/* Pass NULL instead of the lock pointer to disable reclamation. */
 	if ((pv = get_pv_entry(pmap, NULL)) != NULL) {
 		pv->pv_va = va;
 		CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, pa);
 		pvh = pa_to_pvh(pa);
 		TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_next);
 		pvh->pv_gen++;
 		return (TRUE);
 	} else
 		return (FALSE);
 }
 
 /*
  * Fills a page table page with mappings to consecutive physical pages.
  */
 static void
 pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte)
 {
 	pt_entry_t *pte;
 
 	for (pte = firstpte; pte < firstpte + NPTEPG; pte++) {
 		*pte = newpte;
 		newpte += PAGE_SIZE;
 	}
 }
 
 /*
  * Tries to demote a 2MB page mapping.  If demotion fails, the 2MB page
  * mapping is invalidated.
  */
 static boolean_t
 pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va)
 {
 	struct rwlock *lock;
 	boolean_t rv;
 
 	lock = NULL;
 	rv = pmap_demote_pde_locked(pmap, pde, va, &lock);
 	if (lock != NULL)
 		rw_wunlock(lock);
 	return (rv);
 }
 
 static boolean_t
 pmap_demote_pde_locked(pmap_t pmap, pd_entry_t *pde, vm_offset_t va,
     struct rwlock **lockp)
 {
 	pd_entry_t newpde, oldpde;
 	pt_entry_t *firstpte, newpte;
 	pt_entry_t PG_A, PG_G, PG_M, PG_RW, PG_V;
 	vm_paddr_t mptepa;
 	vm_page_t mpte;
 	struct spglist free;
 	int PG_PTE_CACHE;
 
 	PG_G = pmap_global_bit(pmap);
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_PTE_CACHE = pmap_cache_mask(pmap, 0);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpde = *pde;
 	KASSERT((oldpde & (PG_PS | PG_V)) == (PG_PS | PG_V),
 	    ("pmap_demote_pde: oldpde is missing PG_PS and/or PG_V"));
 	if ((oldpde & PG_A) != 0 && (mpte = pmap_lookup_pt_page(pmap, va)) !=
 	    NULL)
 		pmap_remove_pt_page(pmap, mpte);
 	else {
 		KASSERT((oldpde & PG_W) == 0,
 		    ("pmap_demote_pde: page table page for a wired mapping"
 		    " is missing"));
 
 		/*
 		 * Invalidate the 2MB page mapping and return "failure" if the
 		 * mapping was never accessed or the allocation of the new
 		 * page table page fails.  If the 2MB page mapping belongs to
 		 * the direct map region of the kernel's address space, then
 		 * the page allocation request specifies the highest possible
 		 * priority (VM_ALLOC_INTERRUPT).  Otherwise, the priority is
 		 * normal.  Page table pages are preallocated for every other
 		 * part of the kernel address space, so the direct map region
 		 * is the only part of the kernel address space that must be
 		 * handled here.
 		 */
 		if ((oldpde & PG_A) == 0 || (mpte = vm_page_alloc(NULL,
 		    pmap_pde_pindex(va), (va >= DMAP_MIN_ADDRESS && va <
 		    DMAP_MAX_ADDRESS ? VM_ALLOC_INTERRUPT : VM_ALLOC_NORMAL) |
 		    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 			SLIST_INIT(&free);
 			pmap_remove_pde(pmap, pde, trunc_2mpage(va), &free,
 			    lockp);
 			pmap_invalidate_page(pmap, trunc_2mpage(va));
 			pmap_free_zero_pages(&free);
 			CTR2(KTR_PMAP, "pmap_demote_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return (FALSE);
 		}
 		if (va < VM_MAXUSER_ADDRESS)
 			pmap_resident_count_inc(pmap, 1);
 	}
 	mptepa = VM_PAGE_TO_PHYS(mpte);
 	firstpte = (pt_entry_t *)PHYS_TO_DMAP(mptepa);
 	newpde = mptepa | PG_M | PG_A | (oldpde & PG_U) | PG_RW | PG_V;
 	KASSERT((oldpde & PG_A) != 0,
 	    ("pmap_demote_pde: oldpde is missing PG_A"));
 	KASSERT((oldpde & (PG_M | PG_RW)) != PG_RW,
 	    ("pmap_demote_pde: oldpde is missing PG_M"));
 	newpte = oldpde & ~PG_PS;
 	newpte = pmap_swap_pat(pmap, newpte);
 
 	/*
 	 * If the page table page is new, initialize it.
 	 */
 	if (mpte->wire_count == 1) {
 		mpte->wire_count = NPTEPG;
 		pmap_fill_ptp(firstpte, newpte);
 	}
 	KASSERT((*firstpte & PG_FRAME) == (newpte & PG_FRAME),
 	    ("pmap_demote_pde: firstpte and newpte map different physical"
 	    " addresses"));
 
 	/*
 	 * If the mapping has changed attributes, update the page table
 	 * entries.
 	 */
 	if ((*firstpte & PG_PTE_PROMOTE) != (newpte & PG_PTE_PROMOTE))
 		pmap_fill_ptp(firstpte, newpte);
 
 	/*
 	 * The spare PV entries must be reserved prior to demoting the
 	 * mapping, that is, prior to changing the PDE.  Otherwise, the state
 	 * of the PDE and the PV lists will be inconsistent, which can result
 	 * in reclaim_pv_chunk() attempting to remove a PV entry from the
 	 * wrong PV list and pmap_pv_demote_pde() failing to find the expected
 	 * PV entry for the 2MB page mapping that is being demoted.
 	 */
 	if ((oldpde & PG_MANAGED) != 0)
 		reserve_pv_entries(pmap, NPTEPG - 1, lockp);
 
 	/*
 	 * Demote the mapping.  This pmap is locked.  The old PDE has
 	 * PG_A set.  If the old PDE has PG_RW set, it also has PG_M
 	 * set.  Thus, there is no danger of a race with another
 	 * processor changing the setting of PG_A and/or PG_M between
 	 * the read above and the store below. 
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, newpde);
 	else
 		pde_store(pde, newpde);
 
 	/*
 	 * Invalidate a stale recursive mapping of the page table page.
 	 */
 	if (va >= VM_MAXUSER_ADDRESS)
 		pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va));
 
 	/*
 	 * Demote the PV entry.
 	 */
 	if ((oldpde & PG_MANAGED) != 0)
 		pmap_pv_demote_pde(pmap, va, oldpde & PG_PS_FRAME, lockp);
 
 	atomic_add_long(&pmap_pde_demotions, 1);
 	CTR2(KTR_PMAP, "pmap_demote_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * pmap_remove_kernel_pde: Remove a kernel superpage mapping.
  */
 static void
 pmap_remove_kernel_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va)
 {
 	pd_entry_t newpde;
 	vm_paddr_t mptepa;
 	vm_page_t mpte;
 
 	KASSERT(pmap == kernel_pmap, ("pmap %p is not kernel_pmap", pmap));
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	mpte = pmap_lookup_pt_page(pmap, va);
 	if (mpte == NULL)
 		panic("pmap_remove_kernel_pde: Missing pt page.");
 
 	pmap_remove_pt_page(pmap, mpte);
 	mptepa = VM_PAGE_TO_PHYS(mpte);
 	newpde = mptepa | X86_PG_M | X86_PG_A | X86_PG_RW | X86_PG_V;
 
 	/*
 	 * Initialize the page table page.
 	 */
 	pagezero((void *)PHYS_TO_DMAP(mptepa));
 
 	/*
 	 * Demote the mapping.
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, newpde);
 	else
 		pde_store(pde, newpde);
 
 	/*
 	 * Invalidate a stale recursive mapping of the page table page.
 	 */
 	pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va));
 }
 
 /*
  * pmap_remove_pde: do the things to unmap a superpage in a process
  */
 static int
 pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
     struct spglist *free, struct rwlock **lockp)
 {
 	struct md_page *pvh;
 	pd_entry_t oldpde;
 	vm_offset_t eva, va;
 	vm_page_t m, mpte;
 	pt_entry_t PG_G, PG_A, PG_M, PG_RW;
 
 	PG_G = pmap_global_bit(pmap);
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((sva & PDRMASK) == 0,
 	    ("pmap_remove_pde: sva is not 2mpage aligned"));
 	oldpde = pte_load_clear(pdq);
 	if (oldpde & PG_W)
 		pmap->pm_stats.wired_count -= NBPDR / PAGE_SIZE;
 
 	/*
 	 * Machines that don't support invlpg, also don't support
 	 * PG_G.
 	 */
 	if (oldpde & PG_G)
 		pmap_invalidate_page(kernel_pmap, sva);
 	pmap_resident_count_dec(pmap, NBPDR / PAGE_SIZE);
 	if (oldpde & PG_MANAGED) {
 		CHANGE_PV_LIST_LOCK_TO_PHYS(lockp, oldpde & PG_PS_FRAME);
 		pvh = pa_to_pvh(oldpde & PG_PS_FRAME);
 		pmap_pvh_free(pvh, pmap, sva);
 		eva = sva + NBPDR;
 		for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME);
 		    va < eva; va += PAGE_SIZE, m++) {
 			if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 			if (oldpde & PG_A)
 				vm_page_aflag_set(m, PGA_REFERENCED);
 			if (TAILQ_EMPTY(&m->md.pv_list) &&
 			    TAILQ_EMPTY(&pvh->pv_list))
 				vm_page_aflag_clear(m, PGA_WRITEABLE);
 		}
 	}
 	if (pmap == kernel_pmap) {
 		pmap_remove_kernel_pde(pmap, pdq, sva);
 	} else {
 		mpte = pmap_lookup_pt_page(pmap, sva);
 		if (mpte != NULL) {
 			pmap_remove_pt_page(pmap, mpte);
 			pmap_resident_count_dec(pmap, 1);
 			KASSERT(mpte->wire_count == NPTEPG,
 			    ("pmap_remove_pde: pte page wire count error"));
 			mpte->wire_count = 0;
 			pmap_add_delayed_free_list(mpte, free, FALSE);
 			atomic_subtract_int(&vm_cnt.v_wire_count, 1);
 		}
 	}
 	return (pmap_unuse_pt(pmap, sva, *pmap_pdpe(pmap, sva), free));
 }
 
 /*
  * pmap_remove_pte: do the things to unmap a page in a process
  */
 static int
 pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, 
     pd_entry_t ptepde, struct spglist *free, struct rwlock **lockp)
 {
 	struct md_page *pvh;
 	pt_entry_t oldpte, PG_A, PG_M, PG_RW;
 	vm_page_t m;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpte = pte_load_clear(ptq);
 	if (oldpte & PG_W)
 		pmap->pm_stats.wired_count -= 1;
 	pmap_resident_count_dec(pmap, 1);
 	if (oldpte & PG_MANAGED) {
 		m = PHYS_TO_VM_PAGE(oldpte & PG_FRAME);
 		if ((oldpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		if (oldpte & PG_A)
 			vm_page_aflag_set(m, PGA_REFERENCED);
 		CHANGE_PV_LIST_LOCK_TO_VM_PAGE(lockp, m);
 		pmap_pvh_free(&m->md, pmap, va);
 		if (TAILQ_EMPTY(&m->md.pv_list) &&
 		    (m->flags & PG_FICTITIOUS) == 0) {
 			pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 			if (TAILQ_EMPTY(&pvh->pv_list))
 				vm_page_aflag_clear(m, PGA_WRITEABLE);
 		}
 	}
 	return (pmap_unuse_pt(pmap, va, ptepde, free));
 }
 
 /*
  * Remove a single page from a process address space
  */
 static void
 pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde,
     struct spglist *free)
 {
 	struct rwlock *lock;
 	pt_entry_t *pte, PG_V;
 
 	PG_V = pmap_valid_bit(pmap);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	if ((*pde & PG_V) == 0)
 		return;
 	pte = pmap_pde_to_pte(pde, va);
 	if ((*pte & PG_V) == 0)
 		return;
 	lock = NULL;
 	pmap_remove_pte(pmap, pte, va, *pde, free, &lock);
 	if (lock != NULL)
 		rw_wunlock(lock);
 	pmap_invalidate_page(pmap, va);
 }
 
 /*
  *	Remove the given range of addresses from the specified map.
  *
  *	It is assumed that the start and end are properly
  *	rounded to the page size.
  */
 void
 pmap_remove(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	struct rwlock *lock;
 	vm_offset_t va, va_next;
 	pml4_entry_t *pml4e;
 	pdp_entry_t *pdpe;
 	pd_entry_t ptpaddr, *pde;
 	pt_entry_t *pte, PG_G, PG_V;
 	struct spglist free;
 	int anyvalid;
 
 	PG_G = pmap_global_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 
 	/*
 	 * Perform an unsynchronized read.  This is, however, safe.
 	 */
 	if (pmap->pm_stats.resident_count == 0)
 		return;
 
 	anyvalid = 0;
 	SLIST_INIT(&free);
 
 	rw_rlock(&pvh_global_lock);
 	PMAP_LOCK(pmap);
 
 	/*
 	 * special handling of removing one page.  a very
 	 * common operation and easy to short circuit some
 	 * code.
 	 */
 	if (sva + PAGE_SIZE == eva) {
 		pde = pmap_pde(pmap, sva);
 		if (pde && (*pde & PG_PS) == 0) {
 			pmap_remove_page(pmap, sva, pde, &free);
 			goto out;
 		}
 	}
 
 	lock = NULL;
 	for (; sva < eva; sva = va_next) {
 
 		if (pmap->pm_stats.resident_count == 0)
 			break;
 
 		pml4e = pmap_pml4e(pmap, sva);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (sva + NBPML4) & ~PML4MASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		pdpe = pmap_pml4e_to_pdpe(pml4e, sva);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (sva + NBPDP) & ~PDPMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		/*
 		 * Calculate index for next page table.
 		 */
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		ptpaddr = *pde;
 
 		/*
 		 * Weed out invalid mappings.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			/*
 			 * Are we removing the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == va_next && eva >= va_next) {
 				/*
 				 * The TLB entry for a PG_G mapping is
 				 * invalidated by pmap_remove_pde().
 				 */
 				if ((ptpaddr & PG_G) == 0)
 					anyvalid = 1;
 				pmap_remove_pde(pmap, pde, sva, &free, &lock);
 				continue;
 			} else if (!pmap_demote_pde_locked(pmap, pde, sva,
 			    &lock)) {
 				/* The large page mapping was destroyed. */
 				continue;
 			} else
 				ptpaddr = *pde;
 		}
 
 		/*
 		 * Limit our scan to either the end of the va represented
 		 * by the current page table page, or to the end of the
 		 * range being removed.
 		 */
 		if (va_next > eva)
 			va_next = eva;
 
 		va = va_next;
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		    sva += PAGE_SIZE) {
 			if (*pte == 0) {
 				if (va != va_next) {
 					pmap_invalidate_range(pmap, va, sva);
 					va = va_next;
 				}
 				continue;
 			}
 			if ((*pte & PG_G) == 0)
 				anyvalid = 1;
 			else if (va == va_next)
 				va = sva;
 			if (pmap_remove_pte(pmap, pte, sva, ptpaddr, &free,
 			    &lock)) {
 				sva += PAGE_SIZE;
 				break;
 			}
 		}
 		if (va != va_next)
 			pmap_invalidate_range(pmap, va, sva);
 	}
 	if (lock != NULL)
 		rw_wunlock(lock);
 out:
 	if (anyvalid)
 		pmap_invalidate_all(pmap);
 	rw_runlock(&pvh_global_lock);	
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(&free);
 }
 
 /*
  *	Routine:	pmap_remove_all
  *	Function:
  *		Removes this physical page from
  *		all physical maps in which it resides.
  *		Reflects back modify bits to the pager.
  *
  *	Notes:
  *		Original versions of this routine were very
  *		inefficient because they iteratively called
  *		pmap_remove (slow...)
  */
 
 void
 pmap_remove_all(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv;
 	pmap_t pmap;
 	pt_entry_t *pte, tpte, PG_A, PG_M, PG_RW;
 	pd_entry_t *pde;
 	vm_offset_t va;
 	struct spglist free;
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_remove_all: page %p is not managed", m));
 	SLIST_INIT(&free);
 	rw_wlock(&pvh_global_lock);
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		goto small_mappings;
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	while ((pv = TAILQ_FIRST(&pvh->pv_list)) != NULL) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		(void)pmap_demote_pde(pmap, pde, va);
 		PMAP_UNLOCK(pmap);
 	}
 small_mappings:
 	while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) {
 		pmap = PV_PMAP(pv);
 		PMAP_LOCK(pmap);
 		PG_A = pmap_accessed_bit(pmap);
 		PG_M = pmap_modified_bit(pmap);
 		PG_RW = pmap_rw_bit(pmap);
 		pmap_resident_count_dec(pmap, 1);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_remove_all: found"
 		    " a 2mpage in page %p's pv list", m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 		tpte = pte_load_clear(pte);
 		if (tpte & PG_W)
 			pmap->pm_stats.wired_count--;
 		if (tpte & PG_A)
 			vm_page_aflag_set(m, PGA_REFERENCED);
 
 		/*
 		 * Update the vm_page_t clean and reference bits.
 		 */
 		if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			vm_page_dirty(m);
 		pmap_unuse_pt(pmap, pv->pv_va, *pde, &free);
 		pmap_invalidate_page(pmap, pv->pv_va);
 		TAILQ_REMOVE(&m->md.pv_list, pv, pv_next);
 		m->md.pv_gen++;
 		free_pv_entry(pmap, pv);
 		PMAP_UNLOCK(pmap);
 	}
 	vm_page_aflag_clear(m, PGA_WRITEABLE);
 	rw_wunlock(&pvh_global_lock);
 	pmap_free_zero_pages(&free);
 }
 
 /*
  * pmap_protect_pde: do the things to protect a 2mpage in a process
  */
 static boolean_t
 pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, vm_prot_t prot)
 {
 	pd_entry_t newpde, oldpde;
 	vm_offset_t eva, va;
 	vm_page_t m;
 	boolean_t anychanged;
 	pt_entry_t PG_G, PG_M, PG_RW;
 
 	PG_G = pmap_global_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	KASSERT((sva & PDRMASK) == 0,
 	    ("pmap_protect_pde: sva is not 2mpage aligned"));
 	anychanged = FALSE;
 retry:
 	oldpde = newpde = *pde;
 	if (oldpde & PG_MANAGED) {
 		eva = sva + NBPDR;
 		for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME);
 		    va < eva; va += PAGE_SIZE, m++)
 			if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW))
 				vm_page_dirty(m);
 	}
 	if ((prot & VM_PROT_WRITE) == 0)
 		newpde &= ~(PG_RW | PG_M);
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpde |= pg_nx;
 	if (newpde != oldpde) {
 		if (!atomic_cmpset_long(pde, oldpde, newpde))
 			goto retry;
 		if (oldpde & PG_G)
 			pmap_invalidate_page(pmap, sva);
 		else
 			anychanged = TRUE;
 	}
 	return (anychanged);
 }
 
 /*
  *	Set the physical protection on the
  *	specified range of this map as requested.
  */
 void
 pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
 {
 	vm_offset_t va_next;
 	pml4_entry_t *pml4e;
 	pdp_entry_t *pdpe;
 	pd_entry_t ptpaddr, *pde;
 	pt_entry_t *pte, PG_G, PG_M, PG_RW, PG_V;
 	boolean_t anychanged, pv_lists_locked;
 
 	KASSERT((prot & ~VM_PROT_ALL) == 0, ("invalid prot %x", prot));
 	if (prot == VM_PROT_NONE) {
 		pmap_remove(pmap, sva, eva);
 		return;
 	}
 
 	if ((prot & (VM_PROT_WRITE|VM_PROT_EXECUTE)) ==
 	    (VM_PROT_WRITE|VM_PROT_EXECUTE))
 		return;
 
 	PG_G = pmap_global_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 	pv_lists_locked = FALSE;
 resume:
 	anychanged = FALSE;
 
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = va_next) {
 
 		pml4e = pmap_pml4e(pmap, sva);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (sva + NBPML4) & ~PML4MASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		pdpe = pmap_pml4e_to_pdpe(pml4e, sva);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (sva + NBPDP) & ~PDPMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		ptpaddr = *pde;
 
 		/*
 		 * Weed out invalid mappings.
 		 */
 		if (ptpaddr == 0)
 			continue;
 
 		/*
 		 * Check for large page.
 		 */
 		if ((ptpaddr & PG_PS) != 0) {
 			/*
 			 * Are we protecting the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == va_next && eva >= va_next) {
 				/*
 				 * The TLB entry for a PG_G mapping is
 				 * invalidated by pmap_protect_pde().
 				 */
 				if (pmap_protect_pde(pmap, pde, sva, prot))
 					anychanged = TRUE;
 				continue;
 			} else {
 				if (!pv_lists_locked) {
 					pv_lists_locked = TRUE;
 					if (!rw_try_rlock(&pvh_global_lock)) {
 						if (anychanged)
 							pmap_invalidate_all(
 							    pmap);
 						PMAP_UNLOCK(pmap);
 						rw_rlock(&pvh_global_lock);
 						goto resume;
 					}
 				}
 				if (!pmap_demote_pde(pmap, pde, sva)) {
 					/*
 					 * The large page mapping was
 					 * destroyed.
 					 */
 					continue;
 				}
 			}
 		}
 
 		if (va_next > eva)
 			va_next = eva;
 
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		    sva += PAGE_SIZE) {
 			pt_entry_t obits, pbits;
 			vm_page_t m;
 
 retry:
 			obits = pbits = *pte;
 			if ((pbits & PG_V) == 0)
 				continue;
 
 			if ((prot & VM_PROT_WRITE) == 0) {
 				if ((pbits & (PG_MANAGED | PG_M | PG_RW)) ==
 				    (PG_MANAGED | PG_M | PG_RW)) {
 					m = PHYS_TO_VM_PAGE(pbits & PG_FRAME);
 					vm_page_dirty(m);
 				}
 				pbits &= ~(PG_RW | PG_M);
 			}
 			if ((prot & VM_PROT_EXECUTE) == 0)
 				pbits |= pg_nx;
 
 			if (pbits != obits) {
 				if (!atomic_cmpset_long(pte, obits, pbits))
 					goto retry;
 				if (obits & PG_G)
 					pmap_invalidate_page(pmap, sva);
 				else
 					anychanged = TRUE;
 			}
 		}
 	}
 	if (anychanged)
 		pmap_invalidate_all(pmap);
 	if (pv_lists_locked)
 		rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * Tries to promote the 512, contiguous 4KB page mappings that are within a
  * single page table page (PTP) to a single 2MB page mapping.  For promotion
  * to occur, two conditions must be met: (1) the 4KB page mappings must map
  * aligned, contiguous physical memory and (2) the 4KB page mappings must have
  * identical characteristics. 
  */
 static void
 pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va,
     struct rwlock **lockp)
 {
 	pd_entry_t newpde;
 	pt_entry_t *firstpte, oldpte, pa, *pte;
 	pt_entry_t PG_G, PG_A, PG_M, PG_RW, PG_V;
 	vm_offset_t oldpteva;
 	vm_page_t mpte;
 	int PG_PTE_CACHE;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_G = pmap_global_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 	PG_PTE_CACHE = pmap_cache_mask(pmap, 0);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * Examine the first PTE in the specified PTP.  Abort if this PTE is
 	 * either invalid, unused, or does not map the first 4KB physical page
 	 * within a 2MB page. 
 	 */
 	firstpte = (pt_entry_t *)PHYS_TO_DMAP(*pde & PG_FRAME);
 setpde:
 	newpde = *firstpte;
 	if ((newpde & ((PG_FRAME & PDRMASK) | PG_A | PG_V)) != (PG_A | PG_V)) {
 		atomic_add_long(&pmap_pde_p_failures, 1);
 		CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return;
 	}
 	if ((newpde & (PG_M | PG_RW)) == PG_RW) {
 		/*
 		 * When PG_M is already clear, PG_RW can be cleared without
 		 * a TLB invalidation.
 		 */
 		if (!atomic_cmpset_long(firstpte, newpde, newpde & ~PG_RW))
 			goto setpde;
 		newpde &= ~PG_RW;
 	}
 
 	/*
 	 * Examine each of the other PTEs in the specified PTP.  Abort if this
 	 * PTE maps an unexpected 4KB physical page or does not have identical
 	 * characteristics to the first PTE.
 	 */
 	pa = (newpde & (PG_PS_FRAME | PG_A | PG_V)) + NBPDR - PAGE_SIZE;
 	for (pte = firstpte + NPTEPG - 1; pte > firstpte; pte--) {
 setpte:
 		oldpte = *pte;
 		if ((oldpte & (PG_FRAME | PG_A | PG_V)) != pa) {
 			atomic_add_long(&pmap_pde_p_failures, 1);
 			CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return;
 		}
 		if ((oldpte & (PG_M | PG_RW)) == PG_RW) {
 			/*
 			 * When PG_M is already clear, PG_RW can be cleared
 			 * without a TLB invalidation.
 			 */
 			if (!atomic_cmpset_long(pte, oldpte, oldpte & ~PG_RW))
 				goto setpte;
 			oldpte &= ~PG_RW;
 			oldpteva = (oldpte & PG_FRAME & PDRMASK) |
 			    (va & ~PDRMASK);
 			CTR2(KTR_PMAP, "pmap_promote_pde: protect for va %#lx"
 			    " in pmap %p", oldpteva, pmap);
 		}
 		if ((oldpte & PG_PTE_PROMOTE) != (newpde & PG_PTE_PROMOTE)) {
 			atomic_add_long(&pmap_pde_p_failures, 1);
 			CTR2(KTR_PMAP, "pmap_promote_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return;
 		}
 		pa -= PAGE_SIZE;
 	}
 
 	/*
 	 * Save the page table page in its current state until the PDE
 	 * mapping the superpage is demoted by pmap_demote_pde() or
 	 * destroyed by pmap_remove_pde(). 
 	 */
 	mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME);
 	KASSERT(mpte >= vm_page_array &&
 	    mpte < &vm_page_array[vm_page_array_size],
 	    ("pmap_promote_pde: page table page is out of range"));
 	KASSERT(mpte->pindex == pmap_pde_pindex(va),
 	    ("pmap_promote_pde: page table page's pindex is wrong"));
 	if (pmap_insert_pt_page(pmap, mpte)) {
 		atomic_add_long(&pmap_pde_p_failures, 1);
 		CTR2(KTR_PMAP,
 		    "pmap_promote_pde: failure for va %#lx in pmap %p", va,
 		    pmap);
 		return;
 	}
 
 	/*
 	 * Promote the pv entries.
 	 */
 	if ((newpde & PG_MANAGED) != 0)
 		pmap_pv_promote_pde(pmap, va, newpde & PG_PS_FRAME, lockp);
 
 	/*
 	 * Propagate the PAT index to its proper position.
 	 */
 	newpde = pmap_swap_pat(pmap, newpde);
 
 	/*
 	 * Map the superpage.
 	 */
 	if (workaround_erratum383)
 		pmap_update_pde(pmap, va, pde, PG_PS | newpde);
 	else
 		pde_store(pde, PG_PS | newpde);
 
 	atomic_add_long(&pmap_pde_promotions, 1);
 	CTR2(KTR_PMAP, "pmap_promote_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 }
 
 /*
  *	Insert the given physical page (p) at
  *	the specified virtual address (v) in the
  *	target physical map with the protection requested.
  *
  *	If specified, the page will be wired down, meaning
  *	that the related pte can not be reclaimed.
  *
  *	NB:  This is the only routine which MAY NOT lazy-evaluate
  *	or lose information.  That is, this routine must actually
  *	insert this page into the given map NOW.
  */
 int
 pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
     u_int flags, int8_t psind __unused)
 {
 	struct rwlock *lock;
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_G, PG_A, PG_M, PG_RW, PG_V;
 	pt_entry_t newpte, origpte;
 	pv_entry_t pv;
 	vm_paddr_t opa, pa;
 	vm_page_t mpte, om;
 	boolean_t nosleep;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_G = pmap_global_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	va = trunc_page(va);
 	KASSERT(va <= VM_MAX_KERNEL_ADDRESS, ("pmap_enter: toobig"));
 	KASSERT(va < UPT_MIN_ADDRESS || va >= UPT_MAX_ADDRESS,
 	    ("pmap_enter: invalid to pmap_enter page table pages (va: 0x%lx)",
 	    va));
 	KASSERT((m->oflags & VPO_UNMANAGED) != 0 || va < kmi.clean_sva ||
 	    va >= kmi.clean_eva,
 	    ("pmap_enter: managed mapping within the clean submap"));
 	if ((m->oflags & VPO_UNMANAGED) == 0 && !vm_page_xbusied(m))
 		VM_OBJECT_ASSERT_LOCKED(m->object);
 	pa = VM_PAGE_TO_PHYS(m);
 	newpte = (pt_entry_t)(pa | PG_A | PG_V);
 	if ((flags & VM_PROT_WRITE) != 0)
 		newpte |= PG_M;
 	if ((prot & VM_PROT_WRITE) != 0)
 		newpte |= PG_RW;
 	KASSERT((newpte & (PG_M | PG_RW)) != PG_M,
 	    ("pmap_enter: flags includes VM_PROT_WRITE but prot doesn't"));
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpte |= pg_nx;
 	if ((flags & PMAP_ENTER_WIRED) != 0)
 		newpte |= PG_W;
 	if (va < VM_MAXUSER_ADDRESS)
 		newpte |= PG_U;
 	if (pmap == kernel_pmap)
 		newpte |= PG_G;
 	newpte |= pmap_cache_bits(pmap, m->md.pat_mode, 0);
 
 	/*
 	 * Set modified bit gratuitously for writeable mappings if
 	 * the page is unmanaged. We do not want to take a fault
 	 * to do the dirty bit accounting for these mappings.
 	 */
 	if ((m->oflags & VPO_UNMANAGED) != 0) {
 		if ((newpte & PG_RW) != 0)
 			newpte |= PG_M;
 	}
 
 	mpte = NULL;
 
 	lock = NULL;
 	rw_rlock(&pvh_global_lock);
 	PMAP_LOCK(pmap);
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 retry:
 	pde = pmap_pde(pmap, va);
 	if (pde != NULL && (*pde & PG_V) != 0 && ((*pde & PG_PS) == 0 ||
 	    pmap_demote_pde_locked(pmap, pde, va, &lock))) {
 		pte = pmap_pde_to_pte(pde, va);
 		if (va < VM_MAXUSER_ADDRESS && mpte == NULL) {
 			mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME);
 			mpte->wire_count++;
 		}
 	} else if (va < VM_MAXUSER_ADDRESS) {
 		/*
 		 * Here if the pte page isn't mapped, or if it has been
 		 * deallocated.
 		 */
 		nosleep = (flags & PMAP_ENTER_NOSLEEP) != 0;
 		mpte = _pmap_allocpte(pmap, pmap_pde_pindex(va),
 		    nosleep ? NULL : &lock);
 		if (mpte == NULL && nosleep) {
 			if (lock != NULL)
 				rw_wunlock(lock);
 			rw_runlock(&pvh_global_lock);
 			PMAP_UNLOCK(pmap);
 			return (KERN_RESOURCE_SHORTAGE);
 		}
 		goto retry;
 	} else
 		panic("pmap_enter: invalid page directory va=%#lx", va);
 
 	origpte = *pte;
 
 	/*
 	 * Is the specified virtual address already mapped?
 	 */
 	if ((origpte & PG_V) != 0) {
 		/*
 		 * Wiring change, just update stats. We don't worry about
 		 * wiring PT pages as they remain resident as long as there
 		 * are valid mappings in them. Hence, if a user page is wired,
 		 * the PT page will be also.
 		 */
 		if ((newpte & PG_W) != 0 && (origpte & PG_W) == 0)
 			pmap->pm_stats.wired_count++;
 		else if ((newpte & PG_W) == 0 && (origpte & PG_W) != 0)
 			pmap->pm_stats.wired_count--;
 
 		/*
 		 * Remove the extra PT page reference.
 		 */
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			KASSERT(mpte->wire_count > 0,
 			    ("pmap_enter: missing reference to page table page,"
 			     " va: 0x%lx", va));
 		}
 
 		/*
 		 * Has the physical page changed?
 		 */
 		opa = origpte & PG_FRAME;
 		if (opa == pa) {
 			/*
 			 * No, might be a protection or wiring change.
 			 */
 			if ((origpte & PG_MANAGED) != 0) {
 				newpte |= PG_MANAGED;
 				if ((newpte & PG_RW) != 0)
 					vm_page_aflag_set(m, PGA_WRITEABLE);
 			}
 			if (((origpte ^ newpte) & ~(PG_M | PG_A)) == 0)
 				goto unchanged;
 			goto validate;
 		}
 	} else {
 		/*
 		 * Increment the counters.
 		 */
 		if ((newpte & PG_W) != 0)
 			pmap->pm_stats.wired_count++;
 		pmap_resident_count_inc(pmap, 1);
 	}
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->oflags & VPO_UNMANAGED) == 0) {
 		newpte |= PG_MANAGED;
 		pv = get_pv_entry(pmap, &lock);
 		pv->pv_va = va;
 		CHANGE_PV_LIST_LOCK_TO_PHYS(&lock, pa);
 		TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next);
 		m->md.pv_gen++;
 		if ((newpte & PG_RW) != 0)
 			vm_page_aflag_set(m, PGA_WRITEABLE);
 	}
 
 	/*
 	 * Update the PTE.
 	 */
 	if ((origpte & PG_V) != 0) {
 validate:
 		origpte = pte_load_store(pte, newpte);
 		opa = origpte & PG_FRAME;
 		if (opa != pa) {
 			if ((origpte & PG_MANAGED) != 0) {
 				om = PHYS_TO_VM_PAGE(opa);
 				if ((origpte & (PG_M | PG_RW)) == (PG_M |
 				    PG_RW))
 					vm_page_dirty(om);
 				if ((origpte & PG_A) != 0)
 					vm_page_aflag_set(om, PGA_REFERENCED);
 				CHANGE_PV_LIST_LOCK_TO_PHYS(&lock, opa);
 				pmap_pvh_free(&om->md, pmap, va);
 				if ((om->aflags & PGA_WRITEABLE) != 0 &&
 				    TAILQ_EMPTY(&om->md.pv_list) &&
 				    ((om->flags & PG_FICTITIOUS) != 0 ||
 				    TAILQ_EMPTY(&pa_to_pvh(opa)->pv_list)))
 					vm_page_aflag_clear(om, PGA_WRITEABLE);
 			}
 		} else if ((newpte & PG_M) == 0 && (origpte & (PG_M |
 		    PG_RW)) == (PG_M | PG_RW)) {
 			if ((origpte & PG_MANAGED) != 0)
 				vm_page_dirty(m);
 
 			/*
 			 * Although the PTE may still have PG_RW set, TLB
 			 * invalidation may nonetheless be required because
 			 * the PTE no longer has PG_M set.
 			 */
 		} else if ((origpte & PG_NX) != 0 || (newpte & PG_NX) == 0) {
 			/*
 			 * This PTE change does not require TLB invalidation.
 			 */
 			goto unchanged;
 		}
 		if ((origpte & PG_A) != 0)
 			pmap_invalidate_page(pmap, va);
 	} else
 		pte_store(pte, newpte);
 
 unchanged:
 
 	/*
 	 * If both the page table page and the reservation are fully
 	 * populated, then attempt promotion.
 	 */
 	if ((mpte == NULL || mpte->wire_count == NPTEPG) &&
 	    pmap_ps_enabled(pmap) &&
 	    (m->flags & PG_FICTITIOUS) == 0 &&
 	    vm_reserv_level_iffullpop(m) == 0)
 		pmap_promote_pde(pmap, pde, va, &lock);
 
 	if (lock != NULL)
 		rw_wunlock(lock);
 	rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 	return (KERN_SUCCESS);
 }
 
 /*
  * Tries to create a 2MB page mapping.  Returns TRUE if successful and FALSE
  * otherwise.  Fails if (1) a page table page cannot be allocated without
  * blocking, (2) a mapping already exists at the specified virtual address, or
  * (3) a pv entry cannot be allocated without reclaiming another pv entry. 
  */
 static boolean_t
 pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
     struct rwlock **lockp)
 {
 	pd_entry_t *pde, newpde;
 	pt_entry_t PG_V;
 	vm_page_t mpde;
 	struct spglist free;
 
 	PG_V = pmap_valid_bit(pmap);
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	if ((mpde = pmap_allocpde(pmap, va, NULL)) == NULL) {
 		CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	pde = (pd_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mpde));
 	pde = &pde[pmap_pde_index(va)];
 	if ((*pde & PG_V) != 0) {
 		KASSERT(mpde->wire_count > 1,
 		    ("pmap_enter_pde: mpde's wire count is too low"));
 		mpde->wire_count--;
 		CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	newpde = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(pmap, m->md.pat_mode, 1) |
 	    PG_PS | PG_V;
 	if ((m->oflags & VPO_UNMANAGED) == 0) {
 		newpde |= PG_MANAGED;
 
 		/*
 		 * Abort this mapping if its PV entry could not be created.
 		 */
 		if (!pmap_pv_insert_pde(pmap, va, VM_PAGE_TO_PHYS(m),
 		    lockp)) {
 			SLIST_INIT(&free);
 			if (pmap_unwire_ptp(pmap, va, mpde, &free)) {
 				pmap_invalidate_page(pmap, va);
 				pmap_free_zero_pages(&free);
 			}
 			CTR2(KTR_PMAP, "pmap_enter_pde: failure for va %#lx"
 			    " in pmap %p", va, pmap);
 			return (FALSE);
 		}
 	}
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		newpde |= pg_nx;
 	if (va < VM_MAXUSER_ADDRESS)
 		newpde |= PG_U;
 
 	/*
 	 * Increment counters.
 	 */
 	pmap_resident_count_inc(pmap, NBPDR / PAGE_SIZE);
 
 	/*
 	 * Map the superpage.
 	 */
 	pde_store(pde, newpde);
 
 	atomic_add_long(&pmap_pde_mappings, 1);
 	CTR2(KTR_PMAP, "pmap_enter_pde: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * Maps a sequence of resident pages belonging to the same object.
  * The sequence begins with the given page m_start.  This page is
  * mapped at the given virtual address start.  Each subsequent page is
  * mapped at a virtual address that is offset from start by the same
  * amount as the page is offset from m_start within the object.  The
  * last page in the sequence is the page with the largest offset from
  * m_start that can be mapped at a virtual address less than the given
  * virtual address end.  Not every virtual page between start and end
  * is mapped; only those for which a resident page exists with the
  * corresponding offset from m_start are mapped.
  */
 void
 pmap_enter_object(pmap_t pmap, vm_offset_t start, vm_offset_t end,
     vm_page_t m_start, vm_prot_t prot)
 {
 	struct rwlock *lock;
 	vm_offset_t va;
 	vm_page_t m, mpte;
 	vm_pindex_t diff, psize;
 
 	VM_OBJECT_ASSERT_LOCKED(m_start->object);
 
 	psize = atop(end - start);
 	mpte = NULL;
 	m = m_start;
 	lock = NULL;
 	rw_rlock(&pvh_global_lock);
 	PMAP_LOCK(pmap);
 	while (m != NULL && (diff = m->pindex - m_start->pindex) < psize) {
 		va = start + ptoa(diff);
 		if ((va & PDRMASK) == 0 && va + NBPDR <= end &&
 		    m->psind == 1 && pmap_ps_enabled(pmap) &&
 		    pmap_enter_pde(pmap, va, m, prot, &lock))
 			m = &m[NBPDR / PAGE_SIZE - 1];
 		else
 			mpte = pmap_enter_quick_locked(pmap, va, m, prot,
 			    mpte, &lock);
 		m = TAILQ_NEXT(m, listq);
 	}
 	if (lock != NULL)
 		rw_wunlock(lock);
 	rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  * this code makes some *MAJOR* assumptions:
  * 1. Current pmap & pmap exists.
  * 2. Not wired.
  * 3. Read access.
  * 4. No page table pages.
  * but is *MUCH* faster than pmap_enter...
  */
 
 void
 pmap_enter_quick(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot)
 {
 	struct rwlock *lock;
 
 	lock = NULL;
 	rw_rlock(&pvh_global_lock);
 	PMAP_LOCK(pmap);
 	(void)pmap_enter_quick_locked(pmap, va, m, prot, NULL, &lock);
 	if (lock != NULL)
 		rw_wunlock(lock);
 	rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 }
 
 static vm_page_t
 pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m,
     vm_prot_t prot, vm_page_t mpte, struct rwlock **lockp)
 {
 	struct spglist free;
 	pt_entry_t *pte, PG_V;
 	vm_paddr_t pa;
 
 	KASSERT(va < kmi.clean_sva || va >= kmi.clean_eva ||
 	    (m->oflags & VPO_UNMANAGED) != 0,
 	    ("pmap_enter_quick_locked: managed mapping within the clean submap"));
 	PG_V = pmap_valid_bit(pmap);
 	rw_assert(&pvh_global_lock, RA_LOCKED);
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 
 	/*
 	 * In the case that a page table page is not
 	 * resident, we are creating it here.
 	 */
 	if (va < VM_MAXUSER_ADDRESS) {
 		vm_pindex_t ptepindex;
 		pd_entry_t *ptepa;
 
 		/*
 		 * Calculate pagetable page index
 		 */
 		ptepindex = pmap_pde_pindex(va);
 		if (mpte && (mpte->pindex == ptepindex)) {
 			mpte->wire_count++;
 		} else {
 			/*
 			 * Get the page directory entry
 			 */
 			ptepa = pmap_pde(pmap, va);
 
 			/*
 			 * If the page table page is mapped, we just increment
 			 * the hold count, and activate it.  Otherwise, we
 			 * attempt to allocate a page table page.  If this
 			 * attempt fails, we don't retry.  Instead, we give up.
 			 */
 			if (ptepa && (*ptepa & PG_V) != 0) {
 				if (*ptepa & PG_PS)
 					return (NULL);
 				mpte = PHYS_TO_VM_PAGE(*ptepa & PG_FRAME);
 				mpte->wire_count++;
 			} else {
 				/*
 				 * Pass NULL instead of the PV list lock
 				 * pointer, because we don't intend to sleep.
 				 */
 				mpte = _pmap_allocpte(pmap, ptepindex, NULL);
 				if (mpte == NULL)
 					return (mpte);
 			}
 		}
 		pte = (pt_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mpte));
 		pte = &pte[pmap_pte_index(va)];
 	} else {
 		mpte = NULL;
 		pte = vtopte(va);
 	}
 	if (*pte) {
 		if (mpte != NULL) {
 			mpte->wire_count--;
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Enter on the PV list if part of our managed memory.
 	 */
 	if ((m->oflags & VPO_UNMANAGED) == 0 &&
 	    !pmap_try_insert_pv_entry(pmap, va, m, lockp)) {
 		if (mpte != NULL) {
 			SLIST_INIT(&free);
 			if (pmap_unwire_ptp(pmap, va, mpte, &free)) {
 				pmap_invalidate_page(pmap, va);
 				pmap_free_zero_pages(&free);
 			}
 			mpte = NULL;
 		}
 		return (mpte);
 	}
 
 	/*
 	 * Increment counters
 	 */
 	pmap_resident_count_inc(pmap, 1);
 
 	pa = VM_PAGE_TO_PHYS(m) | pmap_cache_bits(pmap, m->md.pat_mode, 0);
 	if ((prot & VM_PROT_EXECUTE) == 0)
 		pa |= pg_nx;
 
 	/*
 	 * Now validate mapping with RO protection
 	 */
 	if ((m->oflags & VPO_UNMANAGED) != 0)
 		pte_store(pte, pa | PG_V | PG_U);
 	else
 		pte_store(pte, pa | PG_V | PG_U | PG_MANAGED);
 	return (mpte);
 }
 
 /*
  * Make a temporary mapping for a physical address.  This is only intended
  * to be used for panic dumps.
  */
 void *
 pmap_kenter_temporary(vm_paddr_t pa, int i)
 {
 	vm_offset_t va;
 
 	va = (vm_offset_t)crashdumpmap + (i * PAGE_SIZE);
 	pmap_kenter(va, pa);
 	invlpg(va);
 	return ((void *)crashdumpmap);
 }
 
 /*
  * This code maps large physical mmap regions into the
  * processor address space.  Note that some shortcuts
  * are taken, but the code works.
  */
 void
 pmap_object_init_pt(pmap_t pmap, vm_offset_t addr, vm_object_t object,
     vm_pindex_t pindex, vm_size_t size)
 {
 	pd_entry_t *pde;
 	pt_entry_t PG_A, PG_M, PG_RW, PG_V;
 	vm_paddr_t pa, ptepa;
 	vm_page_t p, pdpg;
 	int pat_mode;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	KASSERT(object->type == OBJT_DEVICE || object->type == OBJT_SG,
 	    ("pmap_object_init_pt: non-device object"));
 	if ((addr & (NBPDR - 1)) == 0 && (size & (NBPDR - 1)) == 0) {
 		if (!pmap_ps_enabled(pmap))
 			return;
 		if (!vm_object_populate(object, pindex, pindex + atop(size)))
 			return;
 		p = vm_page_lookup(object, pindex);
 		KASSERT(p->valid == VM_PAGE_BITS_ALL,
 		    ("pmap_object_init_pt: invalid page %p", p));
 		pat_mode = p->md.pat_mode;
 
 		/*
 		 * Abort the mapping if the first page is not physically
 		 * aligned to a 2MB page boundary.
 		 */
 		ptepa = VM_PAGE_TO_PHYS(p);
 		if (ptepa & (NBPDR - 1))
 			return;
 
 		/*
 		 * Skip the first page.  Abort the mapping if the rest of
 		 * the pages are not physically contiguous or have differing
 		 * memory attributes.
 		 */
 		p = TAILQ_NEXT(p, listq);
 		for (pa = ptepa + PAGE_SIZE; pa < ptepa + size;
 		    pa += PAGE_SIZE) {
 			KASSERT(p->valid == VM_PAGE_BITS_ALL,
 			    ("pmap_object_init_pt: invalid page %p", p));
 			if (pa != VM_PAGE_TO_PHYS(p) ||
 			    pat_mode != p->md.pat_mode)
 				return;
 			p = TAILQ_NEXT(p, listq);
 		}
 
 		/*
 		 * Map using 2MB pages.  Since "ptepa" is 2M aligned and
 		 * "size" is a multiple of 2M, adding the PAT setting to "pa"
 		 * will not affect the termination of this loop.
 		 */ 
 		PMAP_LOCK(pmap);
 		for (pa = ptepa | pmap_cache_bits(pmap, pat_mode, 1);
 		    pa < ptepa + size; pa += NBPDR) {
 			pdpg = pmap_allocpde(pmap, addr, NULL);
 			if (pdpg == NULL) {
 				/*
 				 * The creation of mappings below is only an
 				 * optimization.  If a page directory page
 				 * cannot be allocated without blocking,
 				 * continue on to the next mapping rather than
 				 * blocking.
 				 */
 				addr += NBPDR;
 				continue;
 			}
 			pde = (pd_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pdpg));
 			pde = &pde[pmap_pde_index(addr)];
 			if ((*pde & PG_V) == 0) {
 				pde_store(pde, pa | PG_PS | PG_M | PG_A |
 				    PG_U | PG_RW | PG_V);
 				pmap_resident_count_inc(pmap, NBPDR / PAGE_SIZE);
 				atomic_add_long(&pmap_pde_mappings, 1);
 			} else {
 				/* Continue on if the PDE is already valid. */
 				pdpg->wire_count--;
 				KASSERT(pdpg->wire_count > 0,
 				    ("pmap_object_init_pt: missing reference "
 				    "to page directory page, va: 0x%lx", addr));
 			}
 			addr += NBPDR;
 		}
 		PMAP_UNLOCK(pmap);
 	}
 }
 
 /*
  *	Clear the wired attribute from the mappings for the specified range of
  *	addresses in the given pmap.  Every valid mapping within that range
  *	must have the wired attribute set.  In contrast, invalid mappings
  *	cannot have the wired attribute set, so they are ignored.
  *
  *	The wired attribute of the page table entry is not a hardware feature,
  *	so there is no need to invalidate any TLB entries.
  */
 void
 pmap_unwire(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
 {
 	vm_offset_t va_next;
 	pml4_entry_t *pml4e;
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_V;
 	boolean_t pv_lists_locked;
 
 	PG_V = pmap_valid_bit(pmap);
 	pv_lists_locked = FALSE;
 resume:
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = va_next) {
 		pml4e = pmap_pml4e(pmap, sva);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (sva + NBPML4) & ~PML4MASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 		pdpe = pmap_pml4e_to_pdpe(pml4e, sva);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (sva + NBPDP) & ~PDPMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		if ((*pde & PG_V) == 0)
 			continue;
 		if ((*pde & PG_PS) != 0) {
 			if ((*pde & PG_W) == 0)
 				panic("pmap_unwire: pde %#jx is missing PG_W",
 				    (uintmax_t)*pde);
 
 			/*
 			 * Are we unwiring the entire large page?  If not,
 			 * demote the mapping and fall through.
 			 */
 			if (sva + NBPDR == va_next && eva >= va_next) {
 				atomic_clear_long(pde, PG_W);
 				pmap->pm_stats.wired_count -= NBPDR /
 				    PAGE_SIZE;
 				continue;
 			} else {
 				if (!pv_lists_locked) {
 					pv_lists_locked = TRUE;
 					if (!rw_try_rlock(&pvh_global_lock)) {
 						PMAP_UNLOCK(pmap);
 						rw_rlock(&pvh_global_lock);
 						/* Repeat sva. */
 						goto resume;
 					}
 				}
 				if (!pmap_demote_pde(pmap, pde, sva))
 					panic("pmap_unwire: demotion failed");
 			}
 		}
 		if (va_next > eva)
 			va_next = eva;
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		    sva += PAGE_SIZE) {
 			if ((*pte & PG_V) == 0)
 				continue;
 			if ((*pte & PG_W) == 0)
 				panic("pmap_unwire: pte %#jx is missing PG_W",
 				    (uintmax_t)*pte);
 
 			/*
 			 * PG_W must be cleared atomically.  Although the pmap
 			 * lock synchronizes access to PG_W, another processor
 			 * could be setting PG_M and/or PG_A concurrently.
 			 */
 			atomic_clear_long(pte, PG_W);
 			pmap->pm_stats.wired_count--;
 		}
 	}
 	if (pv_lists_locked)
 		rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Copy the range specified by src_addr/len
  *	from the source map to the range dst_addr/len
  *	in the destination map.
  *
  *	This routine is only advisory and need not do anything.
  */
 
 void
 pmap_copy(pmap_t dst_pmap, pmap_t src_pmap, vm_offset_t dst_addr, vm_size_t len,
     vm_offset_t src_addr)
 {
 	struct rwlock *lock;
 	struct spglist free;
 	vm_offset_t addr;
 	vm_offset_t end_addr = src_addr + len;
 	vm_offset_t va_next;
 	pt_entry_t PG_A, PG_M, PG_V;
 
 	if (dst_addr != src_addr)
 		return;
 
 	if (dst_pmap->pm_type != src_pmap->pm_type)
 		return;
 
 	/*
 	 * EPT page table entries that require emulation of A/D bits are
 	 * sensitive to clearing the PG_A bit (aka EPT_PG_READ). Although
 	 * we clear PG_M (aka EPT_PG_WRITE) concomitantly, the PG_U bit
 	 * (aka EPT_PG_EXECUTE) could still be set. Since some EPT
 	 * implementations flag an EPT misconfiguration for exec-only
 	 * mappings we skip this function entirely for emulated pmaps.
 	 */
 	if (pmap_emulate_ad_bits(dst_pmap))
 		return;
 
 	lock = NULL;
 	rw_rlock(&pvh_global_lock);
 	if (dst_pmap < src_pmap) {
 		PMAP_LOCK(dst_pmap);
 		PMAP_LOCK(src_pmap);
 	} else {
 		PMAP_LOCK(src_pmap);
 		PMAP_LOCK(dst_pmap);
 	}
 
 	PG_A = pmap_accessed_bit(dst_pmap);
 	PG_M = pmap_modified_bit(dst_pmap);
 	PG_V = pmap_valid_bit(dst_pmap);
 
 	for (addr = src_addr; addr < end_addr; addr = va_next) {
 		pt_entry_t *src_pte, *dst_pte;
 		vm_page_t dstmpde, dstmpte, srcmpte;
 		pml4_entry_t *pml4e;
 		pdp_entry_t *pdpe;
 		pd_entry_t srcptepaddr, *pde;
 
 		KASSERT(addr < UPT_MIN_ADDRESS,
 		    ("pmap_copy: invalid to pmap_copy page tables"));
 
 		pml4e = pmap_pml4e(src_pmap, addr);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (addr + NBPML4) & ~PML4MASK;
 			if (va_next < addr)
 				va_next = end_addr;
 			continue;
 		}
 
 		pdpe = pmap_pml4e_to_pdpe(pml4e, addr);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (addr + NBPDP) & ~PDPMASK;
 			if (va_next < addr)
 				va_next = end_addr;
 			continue;
 		}
 
 		va_next = (addr + NBPDR) & ~PDRMASK;
 		if (va_next < addr)
 			va_next = end_addr;
 
 		pde = pmap_pdpe_to_pde(pdpe, addr);
 		srcptepaddr = *pde;
 		if (srcptepaddr == 0)
 			continue;
 			
 		if (srcptepaddr & PG_PS) {
 			if ((addr & PDRMASK) != 0 || addr + NBPDR > end_addr)
 				continue;
 			dstmpde = pmap_allocpde(dst_pmap, addr, NULL);
 			if (dstmpde == NULL)
 				break;
 			pde = (pd_entry_t *)
 			    PHYS_TO_DMAP(VM_PAGE_TO_PHYS(dstmpde));
 			pde = &pde[pmap_pde_index(addr)];
 			if (*pde == 0 && ((srcptepaddr & PG_MANAGED) == 0 ||
 			    pmap_pv_insert_pde(dst_pmap, addr, srcptepaddr &
 			    PG_PS_FRAME, &lock))) {
 				*pde = srcptepaddr & ~PG_W;
 				pmap_resident_count_inc(dst_pmap, NBPDR / PAGE_SIZE);
 			} else
 				dstmpde->wire_count--;
 			continue;
 		}
 
 		srcptepaddr &= PG_FRAME;
 		srcmpte = PHYS_TO_VM_PAGE(srcptepaddr);
 		KASSERT(srcmpte->wire_count > 0,
 		    ("pmap_copy: source page table page is unused"));
 
 		if (va_next > end_addr)
 			va_next = end_addr;
 
 		src_pte = (pt_entry_t *)PHYS_TO_DMAP(srcptepaddr);
 		src_pte = &src_pte[pmap_pte_index(addr)];
 		dstmpte = NULL;
 		while (addr < va_next) {
 			pt_entry_t ptetemp;
 			ptetemp = *src_pte;
 			/*
 			 * we only virtual copy managed pages
 			 */
 			if ((ptetemp & PG_MANAGED) != 0) {
 				if (dstmpte != NULL &&
 				    dstmpte->pindex == pmap_pde_pindex(addr))
 					dstmpte->wire_count++;
 				else if ((dstmpte = pmap_allocpte(dst_pmap,
 				    addr, NULL)) == NULL)
 					goto out;
 				dst_pte = (pt_entry_t *)
 				    PHYS_TO_DMAP(VM_PAGE_TO_PHYS(dstmpte));
 				dst_pte = &dst_pte[pmap_pte_index(addr)];
 				if (*dst_pte == 0 &&
 				    pmap_try_insert_pv_entry(dst_pmap, addr,
 				    PHYS_TO_VM_PAGE(ptetemp & PG_FRAME),
 				    &lock)) {
 					/*
 					 * Clear the wired, modified, and
 					 * accessed (referenced) bits
 					 * during the copy.
 					 */
 					*dst_pte = ptetemp & ~(PG_W | PG_M |
 					    PG_A);
 					pmap_resident_count_inc(dst_pmap, 1);
 				} else {
 					SLIST_INIT(&free);
 					if (pmap_unwire_ptp(dst_pmap, addr,
 					    dstmpte, &free)) {
 						pmap_invalidate_page(dst_pmap,
 						    addr);
 						pmap_free_zero_pages(&free);
 					}
 					goto out;
 				}
 				if (dstmpte->wire_count >= srcmpte->wire_count)
 					break;
 			}
 			addr += PAGE_SIZE;
 			src_pte++;
 		}
 	}
 out:
 	if (lock != NULL)
 		rw_wunlock(lock);
 	rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(src_pmap);
 	PMAP_UNLOCK(dst_pmap);
 }
 
 /*
  *	pmap_zero_page zeros the specified hardware page by mapping
  *	the page into KVM and using bzero to clear its contents.
  */
 void
 pmap_zero_page(vm_page_t m)
 {
 	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
 
 	pagezero((void *)va);
 }
 
 /*
  *	pmap_zero_page_area zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.
  *
  *	off and size may not cover an area beyond a single hardware page.
  */
 void
 pmap_zero_page_area(vm_page_t m, int off, int size)
 {
 	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
 
 	if (off == 0 && size == PAGE_SIZE)
 		pagezero((void *)va);
 	else
 		bzero((char *)va + off, size);
 }
 
 /*
  *	pmap_zero_page_idle zeros the specified hardware page by mapping 
  *	the page into KVM and using bzero to clear its contents.  This
  *	is intended to be called from the vm_pagezero process only and
  *	outside of Giant.
  */
 void
 pmap_zero_page_idle(vm_page_t m)
 {
 	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
 
 	pagezero((void *)va);
 }
 
 /*
  *	pmap_copy_page copies the specified (machine independent)
  *	page by mapping the page into virtual memory and using
  *	bcopy to copy the page, one machine dependent page at a
  *	time.
  */
 void
 pmap_copy_page(vm_page_t msrc, vm_page_t mdst)
 {
 	vm_offset_t src = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(msrc));
 	vm_offset_t dst = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mdst));
 
 	pagecopy((void *)src, (void *)dst);
 }
 
 int unmapped_buf_allowed = 1;
 
 void
 pmap_copy_pages(vm_page_t ma[], vm_offset_t a_offset, vm_page_t mb[],
     vm_offset_t b_offset, int xfersize)
 {
 	void *a_cp, *b_cp;
 	vm_page_t pages[2];
 	vm_offset_t vaddr[2], a_pg_offset, b_pg_offset;
 	int cnt;
 	boolean_t mapped;
 
 	while (xfersize > 0) {
 		a_pg_offset = a_offset & PAGE_MASK;
 		pages[0] = ma[a_offset >> PAGE_SHIFT];
 		b_pg_offset = b_offset & PAGE_MASK;
 		pages[1] = mb[b_offset >> PAGE_SHIFT];
 		cnt = min(xfersize, PAGE_SIZE - a_pg_offset);
 		cnt = min(cnt, PAGE_SIZE - b_pg_offset);
 		mapped = pmap_map_io_transient(pages, vaddr, 2, FALSE);
 		a_cp = (char *)vaddr[0] + a_pg_offset;
 		b_cp = (char *)vaddr[1] + b_pg_offset;
 		bcopy(a_cp, b_cp, cnt);
 		if (__predict_false(mapped))
 			pmap_unmap_io_transient(pages, vaddr, 2, FALSE);
 		a_offset += cnt;
 		b_offset += cnt;
 		xfersize -= cnt;
 	}
 }
 
 /*
  * Returns true if the pmap's pv is one of the first
  * 16 pvs linked to from this page.  This count may
  * be changed upwards or downwards in the future; it
  * is only necessary that true be returned for a small
  * subset of pmaps for proper page aging.
  */
 boolean_t
 pmap_page_exists_quick(pmap_t pmap, vm_page_t m)
 {
 	struct md_page *pvh;
 	struct rwlock *lock;
 	pv_entry_t pv;
 	int loops = 0;
 	boolean_t rv;
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_page_exists_quick: page %p is not managed", m));
 	rv = FALSE;
 	rw_rlock(&pvh_global_lock);
 	lock = VM_PAGE_TO_PV_LIST_LOCK(m);
 	rw_rlock(lock);
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) {
 		if (PV_PMAP(pv) == pmap) {
 			rv = TRUE;
 			break;
 		}
 		loops++;
 		if (loops >= 16)
 			break;
 	}
 	if (!rv && loops < 16 && (m->flags & PG_FICTITIOUS) == 0) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) {
 			if (PV_PMAP(pv) == pmap) {
 				rv = TRUE;
 				break;
 			}
 			loops++;
 			if (loops >= 16)
 				break;
 		}
 	}
 	rw_runlock(lock);
 	rw_runlock(&pvh_global_lock);
 	return (rv);
 }
 
 /*
  *	pmap_page_wired_mappings:
  *
  *	Return the number of managed mappings to the given physical page
  *	that are wired.
  */
 int
 pmap_page_wired_mappings(vm_page_t m)
 {
 	struct rwlock *lock;
 	struct md_page *pvh;
 	pmap_t pmap;
 	pt_entry_t *pte;
 	pv_entry_t pv;
 	int count, md_gen, pvh_gen;
 
 	if ((m->oflags & VPO_UNMANAGED) != 0)
 		return (0);
 	rw_rlock(&pvh_global_lock);
 	lock = VM_PAGE_TO_PV_LIST_LOCK(m);
 	rw_rlock(lock);
 restart:
 	count = 0;
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) {
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			md_gen = m->md.pv_gen;
 			rw_runlock(lock);
 			PMAP_LOCK(pmap);
 			rw_rlock(lock);
 			if (md_gen != m->md.pv_gen) {
 				PMAP_UNLOCK(pmap);
 				goto restart;
 			}
 		}
 		pte = pmap_pte(pmap, pv->pv_va);
 		if ((*pte & PG_W) != 0)
 			count++;
 		PMAP_UNLOCK(pmap);
 	}
 	if ((m->flags & PG_FICTITIOUS) == 0) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) {
 			pmap = PV_PMAP(pv);
 			if (!PMAP_TRYLOCK(pmap)) {
 				md_gen = m->md.pv_gen;
 				pvh_gen = pvh->pv_gen;
 				rw_runlock(lock);
 				PMAP_LOCK(pmap);
 				rw_rlock(lock);
 				if (md_gen != m->md.pv_gen ||
 				    pvh_gen != pvh->pv_gen) {
 					PMAP_UNLOCK(pmap);
 					goto restart;
 				}
 			}
 			pte = pmap_pde(pmap, pv->pv_va);
 			if ((*pte & PG_W) != 0)
 				count++;
 			PMAP_UNLOCK(pmap);
 		}
 	}
 	rw_runlock(lock);
 	rw_runlock(&pvh_global_lock);
 	return (count);
 }
 
 /*
  * Returns TRUE if the given page is mapped individually or as part of
  * a 2mpage.  Otherwise, returns FALSE.
  */
 boolean_t
 pmap_page_is_mapped(vm_page_t m)
 {
 	struct rwlock *lock;
 	boolean_t rv;
 
 	if ((m->oflags & VPO_UNMANAGED) != 0)
 		return (FALSE);
 	rw_rlock(&pvh_global_lock);
 	lock = VM_PAGE_TO_PV_LIST_LOCK(m);
 	rw_rlock(lock);
 	rv = !TAILQ_EMPTY(&m->md.pv_list) ||
 	    ((m->flags & PG_FICTITIOUS) == 0 &&
 	    !TAILQ_EMPTY(&pa_to_pvh(VM_PAGE_TO_PHYS(m))->pv_list));
 	rw_runlock(lock);
 	rw_runlock(&pvh_global_lock);
 	return (rv);
 }
 
 /*
  * Destroy all managed, non-wired mappings in the given user-space
  * pmap.  This pmap cannot be active on any processor besides the
  * caller.
  *                                                                                
  * This function cannot be applied to the kernel pmap.  Moreover, it
  * is not intended for general use.  It is only to be used during
  * process termination.  Consequently, it can be implemented in ways
  * that make it faster than pmap_remove().  First, it can more quickly
  * destroy mappings by iterating over the pmap's collection of PV
  * entries, rather than searching the page table.  Second, it doesn't
  * have to test and clear the page table entries atomically, because
  * no processor is currently accessing the user address space.  In
  * particular, a page table entry's dirty bit won't change state once
  * this function starts.
  */
 void
 pmap_remove_pages(pmap_t pmap)
 {
 	pd_entry_t ptepde;
 	pt_entry_t *pte, tpte;
 	pt_entry_t PG_M, PG_RW, PG_V;
 	struct spglist free;
 	vm_page_t m, mpte, mt;
 	pv_entry_t pv;
 	struct md_page *pvh;
 	struct pv_chunk *pc, *npc;
 	struct rwlock *lock;
 	int64_t bit;
 	uint64_t inuse, bitmask;
 	int allfree, field, freed, idx;
 	boolean_t superpage;
 	vm_paddr_t pa;
 
 	/*
 	 * Assert that the given pmap is only active on the current
 	 * CPU.  Unfortunately, we cannot block another CPU from
 	 * activating the pmap while this function is executing.
 	 */
 	KASSERT(pmap == PCPU_GET(curpmap), ("non-current pmap %p", pmap));
 #ifdef INVARIANTS
 	{
 		cpuset_t other_cpus;
 
 		other_cpus = all_cpus;
 		critical_enter();
 		CPU_CLR(PCPU_GET(cpuid), &other_cpus);
 		CPU_AND(&other_cpus, &pmap->pm_active);
 		critical_exit();
 		KASSERT(CPU_EMPTY(&other_cpus), ("pmap active %p", pmap));
 	}
 #endif
 
 	lock = NULL;
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	SLIST_INIT(&free);
 	rw_rlock(&pvh_global_lock);
 	PMAP_LOCK(pmap);
 	TAILQ_FOREACH_SAFE(pc, &pmap->pm_pvchunk, pc_list, npc) {
 		allfree = 1;
 		freed = 0;
 		for (field = 0; field < _NPCM; field++) {
 			inuse = ~pc->pc_map[field] & pc_freemask[field];
 			while (inuse != 0) {
 				bit = bsfq(inuse);
 				bitmask = 1UL << bit;
 				idx = field * 64 + bit;
 				pv = &pc->pc_pventry[idx];
 				inuse &= ~bitmask;
 
 				pte = pmap_pdpe(pmap, pv->pv_va);
 				ptepde = *pte;
 				pte = pmap_pdpe_to_pde(pte, pv->pv_va);
 				tpte = *pte;
 				if ((tpte & (PG_PS | PG_V)) == PG_V) {
 					superpage = FALSE;
 					ptepde = tpte;
 					pte = (pt_entry_t *)PHYS_TO_DMAP(tpte &
 					    PG_FRAME);
 					pte = &pte[pmap_pte_index(pv->pv_va)];
 					tpte = *pte;
 				} else {
 					/*
 					 * Keep track whether 'tpte' is a
 					 * superpage explicitly instead of
 					 * relying on PG_PS being set.
 					 *
 					 * This is because PG_PS is numerically
 					 * identical to PG_PTE_PAT and thus a
 					 * regular page could be mistaken for
 					 * a superpage.
 					 */
 					superpage = TRUE;
 				}
 
 				if ((tpte & PG_V) == 0) {
 					panic("bad pte va %lx pte %lx",
 					    pv->pv_va, tpte);
 				}
 
 /*
  * We cannot remove wired pages from a process' mapping at this time
  */
 				if (tpte & PG_W) {
 					allfree = 0;
 					continue;
 				}
 
 				if (superpage)
 					pa = tpte & PG_PS_FRAME;
 				else
 					pa = tpte & PG_FRAME;
 
 				m = PHYS_TO_VM_PAGE(pa);
 				KASSERT(m->phys_addr == pa,
 				    ("vm_page_t %p phys_addr mismatch %016jx %016jx",
 				    m, (uintmax_t)m->phys_addr,
 				    (uintmax_t)tpte));
 
 				KASSERT((m->flags & PG_FICTITIOUS) != 0 ||
 				    m < &vm_page_array[vm_page_array_size],
 				    ("pmap_remove_pages: bad tpte %#jx",
 				    (uintmax_t)tpte));
 
 				pte_clear(pte);
 
 				/*
 				 * Update the vm_page_t clean/reference bits.
 				 */
 				if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 					if (superpage) {
 						for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
 							vm_page_dirty(mt);
 					} else
 						vm_page_dirty(m);
 				}
 
 				CHANGE_PV_LIST_LOCK_TO_VM_PAGE(&lock, m);
 
 				/* Mark free */
 				pc->pc_map[field] |= bitmask;
 				if (superpage) {
 					pmap_resident_count_dec(pmap, NBPDR / PAGE_SIZE);
 					pvh = pa_to_pvh(tpte & PG_PS_FRAME);
 					TAILQ_REMOVE(&pvh->pv_list, pv, pv_next);
 					pvh->pv_gen++;
 					if (TAILQ_EMPTY(&pvh->pv_list)) {
 						for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
 							if ((mt->aflags & PGA_WRITEABLE) != 0 &&
 							    TAILQ_EMPTY(&mt->md.pv_list))
 								vm_page_aflag_clear(mt, PGA_WRITEABLE);
 					}
 					mpte = pmap_lookup_pt_page(pmap, pv->pv_va);
 					if (mpte != NULL) {
 						pmap_remove_pt_page(pmap, mpte);
 						pmap_resident_count_dec(pmap, 1);
 						KASSERT(mpte->wire_count == NPTEPG,
 						    ("pmap_remove_pages: pte page wire count error"));
 						mpte->wire_count = 0;
 						pmap_add_delayed_free_list(mpte, &free, FALSE);
 						atomic_subtract_int(&vm_cnt.v_wire_count, 1);
 					}
 				} else {
 					pmap_resident_count_dec(pmap, 1);
 					TAILQ_REMOVE(&m->md.pv_list, pv, pv_next);
 					m->md.pv_gen++;
 					if ((m->aflags & PGA_WRITEABLE) != 0 &&
 					    TAILQ_EMPTY(&m->md.pv_list) &&
 					    (m->flags & PG_FICTITIOUS) == 0) {
 						pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 						if (TAILQ_EMPTY(&pvh->pv_list))
 							vm_page_aflag_clear(m, PGA_WRITEABLE);
 					}
 				}
 				pmap_unuse_pt(pmap, pv->pv_va, ptepde, &free);
 				freed++;
 			}
 		}
 		PV_STAT(atomic_add_long(&pv_entry_frees, freed));
 		PV_STAT(atomic_add_int(&pv_entry_spare, freed));
 		PV_STAT(atomic_subtract_long(&pv_entry_count, freed));
 		if (allfree) {
 			TAILQ_REMOVE(&pmap->pm_pvchunk, pc, pc_list);
 			free_pv_chunk(pc);
 		}
 	}
 	if (lock != NULL)
 		rw_wunlock(lock);
 	pmap_invalidate_all(pmap);
 	rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 	pmap_free_zero_pages(&free);
 }
 
 static boolean_t
 pmap_page_test_mappings(vm_page_t m, boolean_t accessed, boolean_t modified)
 {
 	struct rwlock *lock;
 	pv_entry_t pv;
 	struct md_page *pvh;
 	pt_entry_t *pte, mask;
 	pt_entry_t PG_A, PG_M, PG_RW, PG_V;
 	pmap_t pmap;
 	int md_gen, pvh_gen;
 	boolean_t rv;
 
 	rv = FALSE;
 	rw_rlock(&pvh_global_lock);
 	lock = VM_PAGE_TO_PV_LIST_LOCK(m);
 	rw_rlock(lock);
 restart:
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) {
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			md_gen = m->md.pv_gen;
 			rw_runlock(lock);
 			PMAP_LOCK(pmap);
 			rw_rlock(lock);
 			if (md_gen != m->md.pv_gen) {
 				PMAP_UNLOCK(pmap);
 				goto restart;
 			}
 		}
 		pte = pmap_pte(pmap, pv->pv_va);
 		mask = 0;
 		if (modified) {
 			PG_M = pmap_modified_bit(pmap);
 			PG_RW = pmap_rw_bit(pmap);
 			mask |= PG_RW | PG_M;
 		}
 		if (accessed) {
 			PG_A = pmap_accessed_bit(pmap);
 			PG_V = pmap_valid_bit(pmap);
 			mask |= PG_V | PG_A;
 		}
 		rv = (*pte & mask) == mask;
 		PMAP_UNLOCK(pmap);
 		if (rv)
 			goto out;
 	}
 	if ((m->flags & PG_FICTITIOUS) == 0) {
 		pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 		TAILQ_FOREACH(pv, &pvh->pv_list, pv_next) {
 			pmap = PV_PMAP(pv);
 			if (!PMAP_TRYLOCK(pmap)) {
 				md_gen = m->md.pv_gen;
 				pvh_gen = pvh->pv_gen;
 				rw_runlock(lock);
 				PMAP_LOCK(pmap);
 				rw_rlock(lock);
 				if (md_gen != m->md.pv_gen ||
 				    pvh_gen != pvh->pv_gen) {
 					PMAP_UNLOCK(pmap);
 					goto restart;
 				}
 			}
 			pte = pmap_pde(pmap, pv->pv_va);
 			mask = 0;
 			if (modified) {
 				PG_M = pmap_modified_bit(pmap);
 				PG_RW = pmap_rw_bit(pmap);
 				mask |= PG_RW | PG_M;
 			}
 			if (accessed) {
 				PG_A = pmap_accessed_bit(pmap);
 				PG_V = pmap_valid_bit(pmap);
 				mask |= PG_V | PG_A;
 			}
 			rv = (*pte & mask) == mask;
 			PMAP_UNLOCK(pmap);
 			if (rv)
 				goto out;
 		}
 	}
 out:
 	rw_runlock(lock);
 	rw_runlock(&pvh_global_lock);
 	return (rv);
 }
 
 /*
  *	pmap_is_modified:
  *
  *	Return whether or not the specified physical page was modified
  *	in any physical maps.
  */
 boolean_t
 pmap_is_modified(vm_page_t m)
 {
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_is_modified: page %p is not managed", m));
 
 	/*
 	 * If the page is not exclusive busied, then PGA_WRITEABLE cannot be
 	 * concurrently set while the object is locked.  Thus, if PGA_WRITEABLE
 	 * is clear, no PTEs can have PG_M set.
 	 */
 	VM_OBJECT_ASSERT_WLOCKED(m->object);
 	if (!vm_page_xbusied(m) && (m->aflags & PGA_WRITEABLE) == 0)
 		return (FALSE);
 	return (pmap_page_test_mappings(m, FALSE, TRUE));
 }
 
 /*
  *	pmap_is_prefaultable:
  *
  *	Return whether or not the specified virtual address is eligible
  *	for prefault.
  */
 boolean_t
 pmap_is_prefaultable(pmap_t pmap, vm_offset_t addr)
 {
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_V;
 	boolean_t rv;
 
 	PG_V = pmap_valid_bit(pmap);
 	rv = FALSE;
 	PMAP_LOCK(pmap);
 	pde = pmap_pde(pmap, addr);
 	if (pde != NULL && (*pde & (PG_PS | PG_V)) == PG_V) {
 		pte = pmap_pde_to_pte(pde, addr);
 		rv = (*pte & PG_V) == 0;
 	}
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 /*
  *	pmap_is_referenced:
  *
  *	Return whether or not the specified physical page was referenced
  *	in any physical maps.
  */
 boolean_t
 pmap_is_referenced(vm_page_t m)
 {
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_is_referenced: page %p is not managed", m));
 	return (pmap_page_test_mappings(m, TRUE, FALSE));
 }
 
 /*
  * Clear the write and modified bits in each of the given page's mappings.
  */
 void
 pmap_remove_write(vm_page_t m)
 {
 	struct md_page *pvh;
 	pmap_t pmap;
 	struct rwlock *lock;
 	pv_entry_t next_pv, pv;
 	pd_entry_t *pde;
 	pt_entry_t oldpte, *pte, PG_M, PG_RW;
 	vm_offset_t va;
 	int pvh_gen, md_gen;
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_remove_write: page %p is not managed", m));
 
 	/*
 	 * If the page is not exclusive busied, then PGA_WRITEABLE cannot be
 	 * set by another thread while the object is locked.  Thus,
 	 * if PGA_WRITEABLE is clear, no page table entries need updating.
 	 */
 	VM_OBJECT_ASSERT_WLOCKED(m->object);
 	if (!vm_page_xbusied(m) && (m->aflags & PGA_WRITEABLE) == 0)
 		return;
 	rw_rlock(&pvh_global_lock);
 	lock = VM_PAGE_TO_PV_LIST_LOCK(m);
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 retry_pv_loop:
 	rw_wlock(lock);
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		goto small_mappings;
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_next, next_pv) {
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			pvh_gen = pvh->pv_gen;
 			rw_wunlock(lock);
 			PMAP_LOCK(pmap);
 			rw_wlock(lock);
 			if (pvh_gen != pvh->pv_gen) {
 				PMAP_UNLOCK(pmap);
 				rw_wunlock(lock);
 				goto retry_pv_loop;
 			}
 		}
 		PG_RW = pmap_rw_bit(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		if ((*pde & PG_RW) != 0)
 			(void)pmap_demote_pde_locked(pmap, pde, va, &lock);
 		KASSERT(lock == VM_PAGE_TO_PV_LIST_LOCK(m),
 		    ("inconsistent pv lock %p %p for page %p",
 		    lock, VM_PAGE_TO_PV_LIST_LOCK(m), m));
 		PMAP_UNLOCK(pmap);
 	}
 small_mappings:
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) {
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			pvh_gen = pvh->pv_gen;
 			md_gen = m->md.pv_gen;
 			rw_wunlock(lock);
 			PMAP_LOCK(pmap);
 			rw_wlock(lock);
 			if (pvh_gen != pvh->pv_gen ||
 			    md_gen != m->md.pv_gen) {
 				PMAP_UNLOCK(pmap);
 				rw_wunlock(lock);
 				goto retry_pv_loop;
 			}
 		}
 		PG_M = pmap_modified_bit(pmap);
 		PG_RW = pmap_rw_bit(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0,
 		    ("pmap_remove_write: found a 2mpage in page %p's pv list",
 		    m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 retry:
 		oldpte = *pte;
 		if (oldpte & PG_RW) {
 			if (!atomic_cmpset_long(pte, oldpte, oldpte &
 			    ~(PG_RW | PG_M)))
 				goto retry;
 			if ((oldpte & PG_M) != 0)
 				vm_page_dirty(m);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	rw_wunlock(lock);
 	vm_page_aflag_clear(m, PGA_WRITEABLE);
 	rw_runlock(&pvh_global_lock);
 }
 
 static __inline boolean_t
 safe_to_clear_referenced(pmap_t pmap, pt_entry_t pte)
 {
 
 	if (!pmap_emulate_ad_bits(pmap))
 		return (TRUE);
 
 	KASSERT(pmap->pm_type == PT_EPT, ("invalid pm_type %d", pmap->pm_type));
 
 	/*
 	 * XWR = 010 or 110 will cause an unconditional EPT misconfiguration
 	 * so we don't let the referenced (aka EPT_PG_READ) bit to be cleared
 	 * if the EPT_PG_WRITE bit is set.
 	 */
 	if ((pte & EPT_PG_WRITE) != 0)
 		return (FALSE);
 
 	/*
 	 * XWR = 100 is allowed only if the PMAP_SUPPORTS_EXEC_ONLY is set.
 	 */
 	if ((pte & EPT_PG_EXECUTE) == 0 ||
 	    ((pmap->pm_flags & PMAP_SUPPORTS_EXEC_ONLY) != 0))
 		return (TRUE);
 	else
 		return (FALSE);
 }
 
 #define	PMAP_TS_REFERENCED_MAX	5
 
 /*
  *	pmap_ts_referenced:
  *
  *	Return a count of reference bits for a page, clearing those bits.
  *	It is not necessary for every reference bit to be cleared, but it
  *	is necessary that 0 only be returned when there are truly no
  *	reference bits set.
  *
  *	XXX: The exact number of bits to check and clear is a matter that
  *	should be tested and standardized at some point in the future for
  *	optimal aging of shared pages.
  */
 int
 pmap_ts_referenced(vm_page_t m)
 {
 	struct md_page *pvh;
 	pv_entry_t pv, pvf;
 	pmap_t pmap;
 	struct rwlock *lock;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t *pte, PG_A;
 	vm_offset_t va;
 	vm_paddr_t pa;
 	int cleared, md_gen, not_cleared, pvh_gen;
 	struct spglist free;
 	boolean_t demoted;
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_ts_referenced: page %p is not managed", m));
 	SLIST_INIT(&free);
 	cleared = 0;
 	pa = VM_PAGE_TO_PHYS(m);
 	lock = PHYS_TO_PV_LIST_LOCK(pa);
 	pvh = pa_to_pvh(pa);
 	rw_rlock(&pvh_global_lock);
 	rw_wlock(lock);
 retry:
 	not_cleared = 0;
 	if ((m->flags & PG_FICTITIOUS) != 0 ||
 	    (pvf = TAILQ_FIRST(&pvh->pv_list)) == NULL)
 		goto small_mappings;
 	pv = pvf;
 	do {
 		if (pvf == NULL)
 			pvf = pv;
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			pvh_gen = pvh->pv_gen;
 			rw_wunlock(lock);
 			PMAP_LOCK(pmap);
 			rw_wlock(lock);
 			if (pvh_gen != pvh->pv_gen) {
 				PMAP_UNLOCK(pmap);
 				goto retry;
 			}
 		}
 		PG_A = pmap_accessed_bit(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, pv->pv_va);
 		oldpde = *pde;
 		if ((*pde & PG_A) != 0) {
 			/*
 			 * Since this reference bit is shared by 512 4KB
 			 * pages, it should not be cleared every time it is
 			 * tested.  Apply a simple "hash" function on the
 			 * physical page number, the virtual superpage number,
 			 * and the pmap address to select one 4KB page out of
 			 * the 512 on which testing the reference bit will
 			 * result in clearing that reference bit.  This
 			 * function is designed to avoid the selection of the
 			 * same 4KB page for every 2MB page mapping.
 			 *
 			 * On demotion, a mapping that hasn't been referenced
 			 * is simply destroyed.  To avoid the possibility of a
 			 * subsequent page fault on a demoted wired mapping,
 			 * always leave its reference bit set.  Moreover,
 			 * since the superpage is wired, the current state of
 			 * its reference bit won't affect page replacement.
 			 */
 			if ((((pa >> PAGE_SHIFT) ^ (pv->pv_va >> PDRSHIFT) ^
 			    (uintptr_t)pmap) & (NPTEPG - 1)) == 0 &&
 			    (*pde & PG_W) == 0) {
 				if (safe_to_clear_referenced(pmap, oldpde)) {
 					atomic_clear_long(pde, PG_A);
 					pmap_invalidate_page(pmap, pv->pv_va);
 					demoted = FALSE;
 				} else if (pmap_demote_pde_locked(pmap, pde,
 				    pv->pv_va, &lock)) {
 					/*
 					 * Remove the mapping to a single page
 					 * so that a subsequent access may
 					 * repromote.  Since the underlying
 					 * page table page is fully populated,
 					 * this removal never frees a page
 					 * table page.
 					 */
 					demoted = TRUE;
 					va += VM_PAGE_TO_PHYS(m) - (oldpde &
 					    PG_PS_FRAME);
 					pte = pmap_pde_to_pte(pde, va);
 					pmap_remove_pte(pmap, pte, va, *pde,
 					    NULL, &lock);
 					pmap_invalidate_page(pmap, va);
 				} else
 					demoted = TRUE;
 
 				if (demoted) {
 					/*
 					 * The superpage mapping was removed
 					 * entirely and therefore 'pv' is no
 					 * longer valid.
 					 */
 					if (pvf == pv)
 						pvf = NULL;
 					pv = NULL;
 				}
 				cleared++;
 				KASSERT(lock == VM_PAGE_TO_PV_LIST_LOCK(m),
 				    ("inconsistent pv lock %p %p for page %p",
 				    lock, VM_PAGE_TO_PV_LIST_LOCK(m), m));
 			} else
 				not_cleared++;
 		}
 		PMAP_UNLOCK(pmap);
 		/* Rotate the PV list if it has more than one entry. */
 		if (pv != NULL && TAILQ_NEXT(pv, pv_next) != NULL) {
 			TAILQ_REMOVE(&pvh->pv_list, pv, pv_next);
 			TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_next);
 			pvh->pv_gen++;
 		}
 		if (cleared + not_cleared >= PMAP_TS_REFERENCED_MAX)
 			goto out;
 	} while ((pv = TAILQ_FIRST(&pvh->pv_list)) != pvf);
 small_mappings:
 	if ((pvf = TAILQ_FIRST(&m->md.pv_list)) == NULL)
 		goto out;
 	pv = pvf;
 	do {
 		if (pvf == NULL)
 			pvf = pv;
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			pvh_gen = pvh->pv_gen;
 			md_gen = m->md.pv_gen;
 			rw_wunlock(lock);
 			PMAP_LOCK(pmap);
 			rw_wlock(lock);
 			if (pvh_gen != pvh->pv_gen || md_gen != m->md.pv_gen) {
 				PMAP_UNLOCK(pmap);
 				goto retry;
 			}
 		}
 		PG_A = pmap_accessed_bit(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0,
 		    ("pmap_ts_referenced: found a 2mpage in page %p's pv list",
 		    m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 		if ((*pte & PG_A) != 0) {
 			if (safe_to_clear_referenced(pmap, *pte)) {
 				atomic_clear_long(pte, PG_A);
 				pmap_invalidate_page(pmap, pv->pv_va);
 				cleared++;
 			} else if ((*pte & PG_W) == 0) {
 				/*
 				 * Wired pages cannot be paged out so
 				 * doing accessed bit emulation for
 				 * them is wasted effort. We do the
 				 * hard work for unwired pages only.
 				 */
 				pmap_remove_pte(pmap, pte, pv->pv_va,
 				    *pde, &free, &lock);
 				pmap_invalidate_page(pmap, pv->pv_va);
 				cleared++;
 				if (pvf == pv)
 					pvf = NULL;
 				pv = NULL;
 				KASSERT(lock == VM_PAGE_TO_PV_LIST_LOCK(m),
 				    ("inconsistent pv lock %p %p for page %p",
 				    lock, VM_PAGE_TO_PV_LIST_LOCK(m), m));
 			} else
 				not_cleared++;
 		}
 		PMAP_UNLOCK(pmap);
 		/* Rotate the PV list if it has more than one entry. */
 		if (pv != NULL && TAILQ_NEXT(pv, pv_next) != NULL) {
 			TAILQ_REMOVE(&m->md.pv_list, pv, pv_next);
 			TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_next);
 			m->md.pv_gen++;
 		}
 	} while ((pv = TAILQ_FIRST(&m->md.pv_list)) != pvf && cleared +
 	    not_cleared < PMAP_TS_REFERENCED_MAX);
 out:
 	rw_wunlock(lock);
 	rw_runlock(&pvh_global_lock);
 	pmap_free_zero_pages(&free);
 	return (cleared + not_cleared);
 }
 
 /*
  *	Apply the given advice to the specified range of addresses within the
  *	given pmap.  Depending on the advice, clear the referenced and/or
  *	modified flags in each mapping and set the mapped page's dirty field.
  */
 void
 pmap_advise(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, int advice)
 {
 	struct rwlock *lock;
 	pml4_entry_t *pml4e;
 	pdp_entry_t *pdpe;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t *pte, PG_A, PG_G, PG_M, PG_RW, PG_V;
 	vm_offset_t va_next;
 	vm_page_t m;
 	boolean_t anychanged, pv_lists_locked;
 
 	if (advice != MADV_DONTNEED && advice != MADV_FREE)
 		return;
 
 	/*
 	 * A/D bit emulation requires an alternate code path when clearing
 	 * the modified and accessed bits below. Since this function is
 	 * advisory in nature we skip it entirely for pmaps that require
 	 * A/D bit emulation.
 	 */
 	if (pmap_emulate_ad_bits(pmap))
 		return;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_G = pmap_global_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	pv_lists_locked = FALSE;
 resume:
 	anychanged = FALSE;
 	PMAP_LOCK(pmap);
 	for (; sva < eva; sva = va_next) {
 		pml4e = pmap_pml4e(pmap, sva);
 		if ((*pml4e & PG_V) == 0) {
 			va_next = (sva + NBPML4) & ~PML4MASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 		pdpe = pmap_pml4e_to_pdpe(pml4e, sva);
 		if ((*pdpe & PG_V) == 0) {
 			va_next = (sva + NBPDP) & ~PDPMASK;
 			if (va_next < sva)
 				va_next = eva;
 			continue;
 		}
 		va_next = (sva + NBPDR) & ~PDRMASK;
 		if (va_next < sva)
 			va_next = eva;
 		pde = pmap_pdpe_to_pde(pdpe, sva);
 		oldpde = *pde;
 		if ((oldpde & PG_V) == 0)
 			continue;
 		else if ((oldpde & PG_PS) != 0) {
 			if ((oldpde & PG_MANAGED) == 0)
 				continue;
 			if (!pv_lists_locked) {
 				pv_lists_locked = TRUE;
 				if (!rw_try_rlock(&pvh_global_lock)) {
 					if (anychanged)
 						pmap_invalidate_all(pmap);
 					PMAP_UNLOCK(pmap);
 					rw_rlock(&pvh_global_lock);
 					goto resume;
 				}
 			}
 			lock = NULL;
 			if (!pmap_demote_pde_locked(pmap, pde, sva, &lock)) {
 				if (lock != NULL)
 					rw_wunlock(lock);
 
 				/*
 				 * The large page mapping was destroyed.
 				 */
 				continue;
 			}
 
 			/*
 			 * Unless the page mappings are wired, remove the
 			 * mapping to a single page so that a subsequent
 			 * access may repromote.  Since the underlying page
 			 * table page is fully populated, this removal never
 			 * frees a page table page.
 			 */
 			if ((oldpde & PG_W) == 0) {
 				pte = pmap_pde_to_pte(pde, sva);
 				KASSERT((*pte & PG_V) != 0,
 				    ("pmap_advise: invalid PTE"));
 				pmap_remove_pte(pmap, pte, sva, *pde, NULL,
 				    &lock);
 				anychanged = TRUE;
 			}
 			if (lock != NULL)
 				rw_wunlock(lock);
 		}
 		if (va_next > eva)
 			va_next = eva;
 		for (pte = pmap_pde_to_pte(pde, sva); sva != va_next; pte++,
 		    sva += PAGE_SIZE) {
 			if ((*pte & (PG_MANAGED | PG_V)) != (PG_MANAGED |
 			    PG_V))
 				continue;
 			else if ((*pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 				if (advice == MADV_DONTNEED) {
 					/*
 					 * Future calls to pmap_is_modified()
 					 * can be avoided by making the page
 					 * dirty now.
 					 */
 					m = PHYS_TO_VM_PAGE(*pte & PG_FRAME);
 					vm_page_dirty(m);
 				}
 				atomic_clear_long(pte, PG_M | PG_A);
 			} else if ((*pte & PG_A) != 0)
 				atomic_clear_long(pte, PG_A);
 			else
 				continue;
 			if ((*pte & PG_G) != 0)
 				pmap_invalidate_page(pmap, sva);
 			else
 				anychanged = TRUE;
 		}
 	}
 	if (anychanged)
 		pmap_invalidate_all(pmap);
 	if (pv_lists_locked)
 		rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 }
 
 /*
  *	Clear the modify bits on the specified physical page.
  */
 void
 pmap_clear_modify(vm_page_t m)
 {
 	struct md_page *pvh;
 	pmap_t pmap;
 	pv_entry_t next_pv, pv;
 	pd_entry_t oldpde, *pde;
 	pt_entry_t oldpte, *pte, PG_M, PG_RW, PG_V;
 	struct rwlock *lock;
 	vm_offset_t va;
 	int md_gen, pvh_gen;
 
 	KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 	    ("pmap_clear_modify: page %p is not managed", m));
 	VM_OBJECT_ASSERT_WLOCKED(m->object);
 	KASSERT(!vm_page_xbusied(m),
 	    ("pmap_clear_modify: page %p is exclusive busied", m));
 
 	/*
 	 * If the page is not PGA_WRITEABLE, then no PTEs can have PG_M set.
 	 * If the object containing the page is locked and the page is not
 	 * exclusive busied, then PGA_WRITEABLE cannot be concurrently set.
 	 */
 	if ((m->aflags & PGA_WRITEABLE) == 0)
 		return;
 	pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m));
 	rw_rlock(&pvh_global_lock);
 	lock = VM_PAGE_TO_PV_LIST_LOCK(m);
 	rw_wlock(lock);
 restart:
 	if ((m->flags & PG_FICTITIOUS) != 0)
 		goto small_mappings;
 	TAILQ_FOREACH_SAFE(pv, &pvh->pv_list, pv_next, next_pv) {
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			pvh_gen = pvh->pv_gen;
 			rw_wunlock(lock);
 			PMAP_LOCK(pmap);
 			rw_wlock(lock);
 			if (pvh_gen != pvh->pv_gen) {
 				PMAP_UNLOCK(pmap);
 				goto restart;
 			}
 		}
 		PG_M = pmap_modified_bit(pmap);
 		PG_V = pmap_valid_bit(pmap);
 		PG_RW = pmap_rw_bit(pmap);
 		va = pv->pv_va;
 		pde = pmap_pde(pmap, va);
 		oldpde = *pde;
 		if ((oldpde & PG_RW) != 0) {
 			if (pmap_demote_pde_locked(pmap, pde, va, &lock)) {
 				if ((oldpde & PG_W) == 0) {
 					/*
 					 * Write protect the mapping to a
 					 * single page so that a subsequent
 					 * write access may repromote.
 					 */
 					va += VM_PAGE_TO_PHYS(m) - (oldpde &
 					    PG_PS_FRAME);
 					pte = pmap_pde_to_pte(pde, va);
 					oldpte = *pte;
 					if ((oldpte & PG_V) != 0) {
 						while (!atomic_cmpset_long(pte,
 						    oldpte,
 						    oldpte & ~(PG_M | PG_RW)))
 							oldpte = *pte;
 						vm_page_dirty(m);
 						pmap_invalidate_page(pmap, va);
 					}
 				}
 			}
 		}
 		PMAP_UNLOCK(pmap);
 	}
 small_mappings:
 	TAILQ_FOREACH(pv, &m->md.pv_list, pv_next) {
 		pmap = PV_PMAP(pv);
 		if (!PMAP_TRYLOCK(pmap)) {
 			md_gen = m->md.pv_gen;
 			pvh_gen = pvh->pv_gen;
 			rw_wunlock(lock);
 			PMAP_LOCK(pmap);
 			rw_wlock(lock);
 			if (pvh_gen != pvh->pv_gen || md_gen != m->md.pv_gen) {
 				PMAP_UNLOCK(pmap);
 				goto restart;
 			}
 		}
 		PG_M = pmap_modified_bit(pmap);
 		PG_RW = pmap_rw_bit(pmap);
 		pde = pmap_pde(pmap, pv->pv_va);
 		KASSERT((*pde & PG_PS) == 0, ("pmap_clear_modify: found"
 		    " a 2mpage in page %p's pv list", m));
 		pte = pmap_pde_to_pte(pde, pv->pv_va);
 		if ((*pte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
 			atomic_clear_long(pte, PG_M);
 			pmap_invalidate_page(pmap, pv->pv_va);
 		}
 		PMAP_UNLOCK(pmap);
 	}
 	rw_wunlock(lock);
 	rw_runlock(&pvh_global_lock);
 }
 
 /*
  * Miscellaneous support routines follow
  */
 
 /* Adjust the cache mode for a 4KB page mapped via a PTE. */
 static __inline void
 pmap_pte_attr(pt_entry_t *pte, int cache_bits, int mask)
 {
 	u_int opte, npte;
 
 	/*
 	 * The cache mode bits are all in the low 32-bits of the
 	 * PTE, so we can just spin on updating the low 32-bits.
 	 */
 	do {
 		opte = *(u_int *)pte;
 		npte = opte & ~mask;
 		npte |= cache_bits;
 	} while (npte != opte && !atomic_cmpset_int((u_int *)pte, opte, npte));
 }
 
 /* Adjust the cache mode for a 2MB page mapped via a PDE. */
 static __inline void
 pmap_pde_attr(pd_entry_t *pde, int cache_bits, int mask)
 {
 	u_int opde, npde;
 
 	/*
 	 * The cache mode bits are all in the low 32-bits of the
 	 * PDE, so we can just spin on updating the low 32-bits.
 	 */
 	do {
 		opde = *(u_int *)pde;
 		npde = opde & ~mask;
 		npde |= cache_bits;
 	} while (npde != opde && !atomic_cmpset_int((u_int *)pde, opde, npde));
 }
 
 /*
  * Map a set of physical memory pages into the kernel virtual
  * address space. Return a pointer to where it is mapped. This
  * routine is intended to be used for mapping device memory,
  * NOT real memory.
  */
 void *
 pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode)
 {
 	vm_offset_t va, offset;
 	vm_size_t tmpsize;
 
 	/*
 	 * If the specified range of physical addresses fits within the direct
 	 * map window, use the direct map. 
 	 */
 	if (pa < dmaplimit && pa + size < dmaplimit) {
 		va = PHYS_TO_DMAP(pa);
 		if (!pmap_change_attr(va, size, mode))
 			return ((void *)va);
 	}
 	offset = pa & PAGE_MASK;
 	size = round_page(offset + size);
 	va = kva_alloc(size);
 	if (!va)
 		panic("pmap_mapdev: Couldn't alloc kernel virtual memory");
 	pa = trunc_page(pa);
 	for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE)
 		pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode);
 	pmap_invalidate_range(kernel_pmap, va, va + tmpsize);
 	pmap_invalidate_cache_range(va, va + tmpsize, FALSE);
 	return ((void *)(va + offset));
 }
 
 void *
 pmap_mapdev(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_UNCACHEABLE));
 }
 
 void *
 pmap_mapbios(vm_paddr_t pa, vm_size_t size)
 {
 
 	return (pmap_mapdev_attr(pa, size, PAT_WRITE_BACK));
 }
 
 void
 pmap_unmapdev(vm_offset_t va, vm_size_t size)
 {
 	vm_offset_t base, offset;
 
 	/* If we gave a direct map region in pmap_mapdev, do nothing */
 	if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS)
 		return;
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = round_page(offset + size);
 	kva_free(base, size);
 }
 
 /*
  * Tries to demote a 1GB page mapping.
  */
 static boolean_t
 pmap_demote_pdpe(pmap_t pmap, pdp_entry_t *pdpe, vm_offset_t va)
 {
 	pdp_entry_t newpdpe, oldpdpe;
 	pd_entry_t *firstpde, newpde, *pde;
 	pt_entry_t PG_A, PG_M, PG_RW, PG_V;
 	vm_paddr_t mpdepa;
 	vm_page_t mpde;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	PMAP_LOCK_ASSERT(pmap, MA_OWNED);
 	oldpdpe = *pdpe;
 	KASSERT((oldpdpe & (PG_PS | PG_V)) == (PG_PS | PG_V),
 	    ("pmap_demote_pdpe: oldpdpe is missing PG_PS and/or PG_V"));
 	if ((mpde = vm_page_alloc(NULL, va >> PDPSHIFT, VM_ALLOC_INTERRUPT |
 	    VM_ALLOC_NOOBJ | VM_ALLOC_WIRED)) == NULL) {
 		CTR2(KTR_PMAP, "pmap_demote_pdpe: failure for va %#lx"
 		    " in pmap %p", va, pmap);
 		return (FALSE);
 	}
 	mpdepa = VM_PAGE_TO_PHYS(mpde);
 	firstpde = (pd_entry_t *)PHYS_TO_DMAP(mpdepa);
 	newpdpe = mpdepa | PG_M | PG_A | (oldpdpe & PG_U) | PG_RW | PG_V;
 	KASSERT((oldpdpe & PG_A) != 0,
 	    ("pmap_demote_pdpe: oldpdpe is missing PG_A"));
 	KASSERT((oldpdpe & (PG_M | PG_RW)) != PG_RW,
 	    ("pmap_demote_pdpe: oldpdpe is missing PG_M"));
 	newpde = oldpdpe;
 
 	/*
 	 * Initialize the page directory page.
 	 */
 	for (pde = firstpde; pde < firstpde + NPDEPG; pde++) {
 		*pde = newpde;
 		newpde += NBPDR;
 	}
 
 	/*
 	 * Demote the mapping.
 	 */
 	*pdpe = newpdpe;
 
 	/*
 	 * Invalidate a stale recursive mapping of the page directory page.
 	 */
 	pmap_invalidate_page(pmap, (vm_offset_t)vtopde(va));
 
 	pmap_pdpe_demotions++;
 	CTR2(KTR_PMAP, "pmap_demote_pdpe: success for va %#lx"
 	    " in pmap %p", va, pmap);
 	return (TRUE);
 }
 
 /*
  * Sets the memory attribute for the specified page.
  */
 void
 pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma)
 {
 
 	m->md.pat_mode = ma;
 
 	/*
 	 * If "m" is a normal page, update its direct mapping.  This update
 	 * can be relied upon to perform any cache operations that are
 	 * required for data coherence.
 	 */
 	if ((m->flags & PG_FICTITIOUS) == 0 &&
 	    pmap_change_attr(PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)), PAGE_SIZE,
 	    m->md.pat_mode))
 		panic("memory attribute change on the direct map failed");
 }
 
 /*
  * Changes the specified virtual address range's memory type to that given by
  * the parameter "mode".  The specified virtual address range must be
  * completely contained within either the direct map or the kernel map.  If
  * the virtual address range is contained within the kernel map, then the
  * memory type for each of the corresponding ranges of the direct map is also
  * changed.  (The corresponding ranges of the direct map are those ranges that
  * map the same physical pages as the specified virtual address range.)  These
  * changes to the direct map are necessary because Intel describes the
  * behavior of their processors as "undefined" if two or more mappings to the
  * same physical page have different memory types.
  *
  * Returns zero if the change completed successfully, and either EINVAL or
  * ENOMEM if the change failed.  Specifically, EINVAL is returned if some part
  * of the virtual address range was not mapped, and ENOMEM is returned if
  * there was insufficient memory available to complete the change.  In the
  * latter case, the memory type may have been changed on some part of the
  * virtual address range or the direct map.
  */
 int
 pmap_change_attr(vm_offset_t va, vm_size_t size, int mode)
 {
 	int error;
 
 	PMAP_LOCK(kernel_pmap);
 	error = pmap_change_attr_locked(va, size, mode);
 	PMAP_UNLOCK(kernel_pmap);
 	return (error);
 }
 
 static int
 pmap_change_attr_locked(vm_offset_t va, vm_size_t size, int mode)
 {
 	vm_offset_t base, offset, tmpva;
 	vm_paddr_t pa_start, pa_end;
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	pt_entry_t *pte;
 	int cache_bits_pte, cache_bits_pde, error;
 	boolean_t changed;
 
 	PMAP_LOCK_ASSERT(kernel_pmap, MA_OWNED);
 	base = trunc_page(va);
 	offset = va & PAGE_MASK;
 	size = round_page(offset + size);
 
 	/*
 	 * Only supported on kernel virtual addresses, including the direct
 	 * map but excluding the recursive map.
 	 */
 	if (base < DMAP_MIN_ADDRESS)
 		return (EINVAL);
 
 	cache_bits_pde = pmap_cache_bits(kernel_pmap, mode, 1);
 	cache_bits_pte = pmap_cache_bits(kernel_pmap, mode, 0);
 	changed = FALSE;
 
 	/*
 	 * Pages that aren't mapped aren't supported.  Also break down 2MB pages
 	 * into 4KB pages if required.
 	 */
 	for (tmpva = base; tmpva < base + size; ) {
 		pdpe = pmap_pdpe(kernel_pmap, tmpva);
 		if (*pdpe == 0)
 			return (EINVAL);
 		if (*pdpe & PG_PS) {
 			/*
 			 * If the current 1GB page already has the required
 			 * memory type, then we need not demote this page. Just
 			 * increment tmpva to the next 1GB page frame.
 			 */
 			if ((*pdpe & X86_PG_PDE_CACHE) == cache_bits_pde) {
 				tmpva = trunc_1gpage(tmpva) + NBPDP;
 				continue;
 			}
 
 			/*
 			 * If the current offset aligns with a 1GB page frame
 			 * and there is at least 1GB left within the range, then
 			 * we need not break down this page into 2MB pages.
 			 */
 			if ((tmpva & PDPMASK) == 0 &&
 			    tmpva + PDPMASK < base + size) {
 				tmpva += NBPDP;
 				continue;
 			}
 			if (!pmap_demote_pdpe(kernel_pmap, pdpe, tmpva))
 				return (ENOMEM);
 		}
 		pde = pmap_pdpe_to_pde(pdpe, tmpva);
 		if (*pde == 0)
 			return (EINVAL);
 		if (*pde & PG_PS) {
 			/*
 			 * If the current 2MB page already has the required
 			 * memory type, then we need not demote this page. Just
 			 * increment tmpva to the next 2MB page frame.
 			 */
 			if ((*pde & X86_PG_PDE_CACHE) == cache_bits_pde) {
 				tmpva = trunc_2mpage(tmpva) + NBPDR;
 				continue;
 			}
 
 			/*
 			 * If the current offset aligns with a 2MB page frame
 			 * and there is at least 2MB left within the range, then
 			 * we need not break down this page into 4KB pages.
 			 */
 			if ((tmpva & PDRMASK) == 0 &&
 			    tmpva + PDRMASK < base + size) {
 				tmpva += NBPDR;
 				continue;
 			}
 			if (!pmap_demote_pde(kernel_pmap, pde, tmpva))
 				return (ENOMEM);
 		}
 		pte = pmap_pde_to_pte(pde, tmpva);
 		if (*pte == 0)
 			return (EINVAL);
 		tmpva += PAGE_SIZE;
 	}
 	error = 0;
 
 	/*
 	 * Ok, all the pages exist, so run through them updating their
 	 * cache mode if required.
 	 */
 	pa_start = pa_end = 0;
 	for (tmpva = base; tmpva < base + size; ) {
 		pdpe = pmap_pdpe(kernel_pmap, tmpva);
 		if (*pdpe & PG_PS) {
 			if ((*pdpe & X86_PG_PDE_CACHE) != cache_bits_pde) {
 				pmap_pde_attr(pdpe, cache_bits_pde,
 				    X86_PG_PDE_CACHE);
 				changed = TRUE;
 			}
 			if (tmpva >= VM_MIN_KERNEL_ADDRESS) {
 				if (pa_start == pa_end) {
 					/* Start physical address run. */
 					pa_start = *pdpe & PG_PS_FRAME;
 					pa_end = pa_start + NBPDP;
 				} else if (pa_end == (*pdpe & PG_PS_FRAME))
 					pa_end += NBPDP;
 				else {
 					/* Run ended, update direct map. */
 					error = pmap_change_attr_locked(
 					    PHYS_TO_DMAP(pa_start),
 					    pa_end - pa_start, mode);
 					if (error != 0)
 						break;
 					/* Start physical address run. */
 					pa_start = *pdpe & PG_PS_FRAME;
 					pa_end = pa_start + NBPDP;
 				}
 			}
 			tmpva = trunc_1gpage(tmpva) + NBPDP;
 			continue;
 		}
 		pde = pmap_pdpe_to_pde(pdpe, tmpva);
 		if (*pde & PG_PS) {
 			if ((*pde & X86_PG_PDE_CACHE) != cache_bits_pde) {
 				pmap_pde_attr(pde, cache_bits_pde,
 				    X86_PG_PDE_CACHE);
 				changed = TRUE;
 			}
 			if (tmpva >= VM_MIN_KERNEL_ADDRESS) {
 				if (pa_start == pa_end) {
 					/* Start physical address run. */
 					pa_start = *pde & PG_PS_FRAME;
 					pa_end = pa_start + NBPDR;
 				} else if (pa_end == (*pde & PG_PS_FRAME))
 					pa_end += NBPDR;
 				else {
 					/* Run ended, update direct map. */
 					error = pmap_change_attr_locked(
 					    PHYS_TO_DMAP(pa_start),
 					    pa_end - pa_start, mode);
 					if (error != 0)
 						break;
 					/* Start physical address run. */
 					pa_start = *pde & PG_PS_FRAME;
 					pa_end = pa_start + NBPDR;
 				}
 			}
 			tmpva = trunc_2mpage(tmpva) + NBPDR;
 		} else {
 			pte = pmap_pde_to_pte(pde, tmpva);
 			if ((*pte & X86_PG_PTE_CACHE) != cache_bits_pte) {
 				pmap_pte_attr(pte, cache_bits_pte,
 				    X86_PG_PTE_CACHE);
 				changed = TRUE;
 			}
 			if (tmpva >= VM_MIN_KERNEL_ADDRESS) {
 				if (pa_start == pa_end) {
 					/* Start physical address run. */
 					pa_start = *pte & PG_FRAME;
 					pa_end = pa_start + PAGE_SIZE;
 				} else if (pa_end == (*pte & PG_FRAME))
 					pa_end += PAGE_SIZE;
 				else {
 					/* Run ended, update direct map. */
 					error = pmap_change_attr_locked(
 					    PHYS_TO_DMAP(pa_start),
 					    pa_end - pa_start, mode);
 					if (error != 0)
 						break;
 					/* Start physical address run. */
 					pa_start = *pte & PG_FRAME;
 					pa_end = pa_start + PAGE_SIZE;
 				}
 			}
 			tmpva += PAGE_SIZE;
 		}
 	}
 	if (error == 0 && pa_start != pa_end)
 		error = pmap_change_attr_locked(PHYS_TO_DMAP(pa_start),
 		    pa_end - pa_start, mode);
 
 	/*
 	 * Flush CPU caches if required to make sure any data isn't cached that
 	 * shouldn't be, etc.
 	 */
 	if (changed) {
 		pmap_invalidate_range(kernel_pmap, base, tmpva);
 		pmap_invalidate_cache_range(base, tmpva, FALSE);
 	}
 	return (error);
 }
 
 /*
  * Demotes any mapping within the direct map region that covers more than the
  * specified range of physical addresses.  This range's size must be a power
  * of two and its starting address must be a multiple of its size.  Since the
  * demotion does not change any attributes of the mapping, a TLB invalidation
  * is not mandatory.  The caller may, however, request a TLB invalidation.
  */
 void
 pmap_demote_DMAP(vm_paddr_t base, vm_size_t len, boolean_t invalidate)
 {
 	pdp_entry_t *pdpe;
 	pd_entry_t *pde;
 	vm_offset_t va;
 	boolean_t changed;
 
 	if (len == 0)
 		return;
 	KASSERT(powerof2(len), ("pmap_demote_DMAP: len is not a power of 2"));
 	KASSERT((base & (len - 1)) == 0,
 	    ("pmap_demote_DMAP: base is not a multiple of len"));
 	if (len < NBPDP && base < dmaplimit) {
 		va = PHYS_TO_DMAP(base);
 		changed = FALSE;
 		PMAP_LOCK(kernel_pmap);
 		pdpe = pmap_pdpe(kernel_pmap, va);
 		if ((*pdpe & X86_PG_V) == 0)
 			panic("pmap_demote_DMAP: invalid PDPE");
 		if ((*pdpe & PG_PS) != 0) {
 			if (!pmap_demote_pdpe(kernel_pmap, pdpe, va))
 				panic("pmap_demote_DMAP: PDPE failed");
 			changed = TRUE;
 		}
 		if (len < NBPDR) {
 			pde = pmap_pdpe_to_pde(pdpe, va);
 			if ((*pde & X86_PG_V) == 0)
 				panic("pmap_demote_DMAP: invalid PDE");
 			if ((*pde & PG_PS) != 0) {
 				if (!pmap_demote_pde(kernel_pmap, pde, va))
 					panic("pmap_demote_DMAP: PDE failed");
 				changed = TRUE;
 			}
 		}
 		if (changed && invalidate)
 			pmap_invalidate_page(kernel_pmap, va);
 		PMAP_UNLOCK(kernel_pmap);
 	}
 }
 
 /*
  * perform the pmap work for mincore
  */
 int
 pmap_mincore(pmap_t pmap, vm_offset_t addr, vm_paddr_t *locked_pa)
 {
 	pd_entry_t *pdep;
 	pt_entry_t pte, PG_A, PG_M, PG_RW, PG_V;
 	vm_paddr_t pa;
 	int val;
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	PMAP_LOCK(pmap);
 retry:
 	pdep = pmap_pde(pmap, addr);
 	if (pdep != NULL && (*pdep & PG_V)) {
 		if (*pdep & PG_PS) {
 			pte = *pdep;
 			/* Compute the physical address of the 4KB page. */
 			pa = ((*pdep & PG_PS_FRAME) | (addr & PDRMASK)) &
 			    PG_FRAME;
 			val = MINCORE_SUPER;
 		} else {
 			pte = *pmap_pde_to_pte(pdep, addr);
 			pa = pte & PG_FRAME;
 			val = 0;
 		}
 	} else {
 		pte = 0;
 		pa = 0;
 		val = 0;
 	}
 	if ((pte & PG_V) != 0) {
 		val |= MINCORE_INCORE;
 		if ((pte & (PG_M | PG_RW)) == (PG_M | PG_RW))
 			val |= MINCORE_MODIFIED | MINCORE_MODIFIED_OTHER;
 		if ((pte & PG_A) != 0)
 			val |= MINCORE_REFERENCED | MINCORE_REFERENCED_OTHER;
 	}
 	if ((val & (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER)) !=
 	    (MINCORE_MODIFIED_OTHER | MINCORE_REFERENCED_OTHER) &&
 	    (pte & (PG_MANAGED | PG_V)) == (PG_MANAGED | PG_V)) {
 		/* Ensure that "PHYS_TO_VM_PAGE(pa)->object" doesn't change. */
 		if (vm_page_pa_tryrelock(pmap, pa, locked_pa))
 			goto retry;
 	} else
 		PA_UNLOCK_COND(*locked_pa);
 	PMAP_UNLOCK(pmap);
 	return (val);
 }
 
+static uint64_t
+pmap_pcid_alloc(pmap_t pmap, u_int cpuid)
+{
+	uint32_t gen, new_gen, pcid_next;
+
+	CRITICAL_ASSERT(curthread);
+	gen = PCPU_GET(pcid_gen);
+	if (pmap->pm_pcids[cpuid].pm_pcid == PMAP_PCID_KERN ||
+	    pmap->pm_pcids[cpuid].pm_gen == gen)
+		return (CR3_PCID_SAVE);
+	pcid_next = PCPU_GET(pcid_next);
+	KASSERT(pcid_next <= PMAP_PCID_OVERMAX, ("cpu %d pcid_next %#x",
+	    cpuid, pcid_next));
+	if (pcid_next == PMAP_PCID_OVERMAX) {
+		new_gen = gen + 1;
+		if (new_gen == 0)
+			new_gen = 1;
+		PCPU_SET(pcid_gen, new_gen);
+		pcid_next = PMAP_PCID_KERN + 1;
+	} else {
+		new_gen = gen;
+	}
+	pmap->pm_pcids[cpuid].pm_pcid = pcid_next;
+	pmap->pm_pcids[cpuid].pm_gen = new_gen;
+	PCPU_SET(pcid_next, pcid_next + 1);
+	return (0);
+}
+
 void
-pmap_activate(struct thread *td)
+pmap_activate_sw(struct thread *td)
 {
-	pmap_t	pmap, oldpmap;
-	u_int	cpuid;
+	pmap_t oldpmap, pmap;
+	uint64_t cached, cr3;
+	u_int cpuid;
 
-	critical_enter();
-	pmap = vmspace_pmap(td->td_proc->p_vmspace);
 	oldpmap = PCPU_GET(curpmap);
+	pmap = vmspace_pmap(td->td_proc->p_vmspace);
+	if (oldpmap == pmap)
+		return;
 	cpuid = PCPU_GET(cpuid);
 #ifdef SMP
-	CPU_CLR_ATOMIC(cpuid, &oldpmap->pm_active);
 	CPU_SET_ATOMIC(cpuid, &pmap->pm_active);
-	CPU_SET_ATOMIC(cpuid, &pmap->pm_save);
 #else
-	CPU_CLR(cpuid, &oldpmap->pm_active);
 	CPU_SET(cpuid, &pmap->pm_active);
-	CPU_SET(cpuid, &pmap->pm_save);
 #endif
-	td->td_pcb->pcb_cr3 = pmap->pm_cr3;
-	load_cr3(pmap->pm_cr3);
+	cr3 = rcr3();
+	if (pmap_pcid_enabled) {
+		cached = pmap_pcid_alloc(pmap, cpuid);
+		KASSERT(pmap->pm_pcids[cpuid].pm_pcid >= 0 &&
+		    pmap->pm_pcids[cpuid].pm_pcid < PMAP_PCID_OVERMAX,
+		    ("pmap %p cpu %d pcid %#x", pmap, cpuid,
+		    pmap->pm_pcids[cpuid].pm_pcid));
+		KASSERT(pmap != PMAP_PCID_KERN || pmap == kernel_pmap,
+		    ("non-kernel pmap %p cpu %d pcid %#x", pmap, cpuid,
+		    pmap->pm_pcids[cpuid].pm_pcid));
+		if (!cached || (cr3 & ~CR3_PCID_MASK) != pmap->pm_cr3) {
+			load_cr3(pmap->pm_cr3 | pmap->pm_pcids[cpuid].pm_pcid |
+			    cached);
+			if (cached)
+				PCPU_INC(pm_save_cnt);
+		}
+	} else if (cr3 != pmap->pm_cr3) {
+		load_cr3(pmap->pm_cr3);
+	}
 	PCPU_SET(curpmap, pmap);
+#ifdef SMP
+	CPU_CLR_ATOMIC(cpuid, &oldpmap->pm_active);
+#else
+	CPU_CLR(cpuid, &oldpmap->pm_active);
+#endif
+}
+
+void
+pmap_activate(struct thread *td)
+{
+
+	critical_enter();
+	pmap_activate_sw(td);
 	critical_exit();
 }
 
 void
 pmap_sync_icache(pmap_t pm, vm_offset_t va, vm_size_t sz)
 {
 }
 
 /*
  *	Increase the starting virtual address of the given mapping if a
  *	different alignment might result in more superpage mappings.
  */
 void
 pmap_align_superpage(vm_object_t object, vm_ooffset_t offset,
     vm_offset_t *addr, vm_size_t size)
 {
 	vm_offset_t superpage_offset;
 
 	if (size < NBPDR)
 		return;
 	if (object != NULL && (object->flags & OBJ_COLORED) != 0)
 		offset += ptoa(object->pg_color);
 	superpage_offset = offset & PDRMASK;
 	if (size - ((NBPDR - superpage_offset) & PDRMASK) < NBPDR ||
 	    (*addr & PDRMASK) == superpage_offset)
 		return;
 	if ((*addr & PDRMASK) < superpage_offset)
 		*addr = (*addr & ~PDRMASK) + superpage_offset;
 	else
 		*addr = ((*addr + PDRMASK) & ~PDRMASK) + superpage_offset;
 }
 
 #ifdef INVARIANTS
 static unsigned long num_dirty_emulations;
 SYSCTL_ULONG(_vm_pmap, OID_AUTO, num_dirty_emulations, CTLFLAG_RW,
 	     &num_dirty_emulations, 0, NULL);
 
 static unsigned long num_accessed_emulations;
 SYSCTL_ULONG(_vm_pmap, OID_AUTO, num_accessed_emulations, CTLFLAG_RW,
 	     &num_accessed_emulations, 0, NULL);
 
 static unsigned long num_superpage_accessed_emulations;
 SYSCTL_ULONG(_vm_pmap, OID_AUTO, num_superpage_accessed_emulations, CTLFLAG_RW,
 	     &num_superpage_accessed_emulations, 0, NULL);
 
 static unsigned long ad_emulation_superpage_promotions;
 SYSCTL_ULONG(_vm_pmap, OID_AUTO, ad_emulation_superpage_promotions, CTLFLAG_RW,
 	     &ad_emulation_superpage_promotions, 0, NULL);
 #endif	/* INVARIANTS */
 
 int
 pmap_emulate_accessed_dirty(pmap_t pmap, vm_offset_t va, int ftype)
 {
 	int rv;
 	struct rwlock *lock;
 	vm_page_t m, mpte;
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_A, PG_M, PG_RW, PG_V;
 	boolean_t pv_lists_locked;
 
 	KASSERT(ftype == VM_PROT_READ || ftype == VM_PROT_WRITE,
 	    ("pmap_emulate_accessed_dirty: invalid fault type %d", ftype));
 
 	if (!pmap_emulate_ad_bits(pmap))
 		return (-1);
 
 	PG_A = pmap_accessed_bit(pmap);
 	PG_M = pmap_modified_bit(pmap);
 	PG_V = pmap_valid_bit(pmap);
 	PG_RW = pmap_rw_bit(pmap);
 
 	rv = -1;
 	lock = NULL;
 	pv_lists_locked = FALSE;
 retry:
 	PMAP_LOCK(pmap);
 
 	pde = pmap_pde(pmap, va);
 	if (pde == NULL || (*pde & PG_V) == 0)
 		goto done;
 
 	if ((*pde & PG_PS) != 0) {
 		if (ftype == VM_PROT_READ) {
 #ifdef INVARIANTS
 			atomic_add_long(&num_superpage_accessed_emulations, 1);
 #endif
 			*pde |= PG_A;
 			rv = 0;
 		}
 		goto done;
 	}
 
 	pte = pmap_pde_to_pte(pde, va);
 	if ((*pte & PG_V) == 0)
 		goto done;
 
 	if (ftype == VM_PROT_WRITE) {
 		if ((*pte & PG_RW) == 0)
 			goto done;
 		/*
 		 * Set the modified and accessed bits simultaneously.
 		 *
 		 * Intel EPT PTEs that do software emulation of A/D bits map
 		 * PG_A and PG_M to EPT_PG_READ and EPT_PG_WRITE respectively.
 		 * An EPT misconfiguration is triggered if the PTE is writable
 		 * but not readable (WR=10). This is avoided by setting PG_A
 		 * and PG_M simultaneously.
 		 */
 		*pte |= PG_M | PG_A;
 	} else {
 		*pte |= PG_A;
 	}
 
 	/* try to promote the mapping */
 	if (va < VM_MAXUSER_ADDRESS)
 		mpte = PHYS_TO_VM_PAGE(*pde & PG_FRAME);
 	else
 		mpte = NULL;
 
 	m = PHYS_TO_VM_PAGE(*pte & PG_FRAME);
 
 	if ((mpte == NULL || mpte->wire_count == NPTEPG) &&
 	    pmap_ps_enabled(pmap) &&
 	    (m->flags & PG_FICTITIOUS) == 0 &&
 	    vm_reserv_level_iffullpop(m) == 0) {
 		if (!pv_lists_locked) {
 			pv_lists_locked = TRUE;
 			if (!rw_try_rlock(&pvh_global_lock)) {
 				PMAP_UNLOCK(pmap);
 				rw_rlock(&pvh_global_lock);
 				goto retry;
 			}
 		}
 		pmap_promote_pde(pmap, pde, va, &lock);
 #ifdef INVARIANTS
 		atomic_add_long(&ad_emulation_superpage_promotions, 1);
 #endif
 	}
 #ifdef INVARIANTS
 	if (ftype == VM_PROT_WRITE)
 		atomic_add_long(&num_dirty_emulations, 1);
 	else
 		atomic_add_long(&num_accessed_emulations, 1);
 #endif
 	rv = 0;		/* success */
 done:
 	if (lock != NULL)
 		rw_wunlock(lock);
 	if (pv_lists_locked)
 		rw_runlock(&pvh_global_lock);
 	PMAP_UNLOCK(pmap);
 	return (rv);
 }
 
 void
 pmap_get_mapping(pmap_t pmap, vm_offset_t va, uint64_t *ptr, int *num)
 {
 	pml4_entry_t *pml4;
 	pdp_entry_t *pdp;
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_V;
 	int idx;
 
 	idx = 0;
 	PG_V = pmap_valid_bit(pmap);
 	PMAP_LOCK(pmap);
 
 	pml4 = pmap_pml4e(pmap, va);
 	ptr[idx++] = *pml4;
 	if ((*pml4 & PG_V) == 0)
 		goto done;
 
 	pdp = pmap_pml4e_to_pdpe(pml4, va);
 	ptr[idx++] = *pdp;
 	if ((*pdp & PG_V) == 0 || (*pdp & PG_PS) != 0)
 		goto done;
 
 	pde = pmap_pdpe_to_pde(pdp, va);
 	ptr[idx++] = *pde;
 	if ((*pde & PG_V) == 0 || (*pde & PG_PS) != 0)
 		goto done;
 
 	pte = pmap_pde_to_pte(pde, va);
 	ptr[idx++] = *pte;
 
 done:
 	PMAP_UNLOCK(pmap);
 	*num = idx;
 }
 
 /**
  * Get the kernel virtual address of a set of physical pages. If there are
  * physical addresses not covered by the DMAP perform a transient mapping
  * that will be removed when calling pmap_unmap_io_transient.
  *
  * \param page        The pages the caller wishes to obtain the virtual
  *                    address on the kernel memory map.
  * \param vaddr       On return contains the kernel virtual memory address
  *                    of the pages passed in the page parameter.
  * \param count       Number of pages passed in.
  * \param can_fault   TRUE if the thread using the mapped pages can take
  *                    page faults, FALSE otherwise.
  *
  * \returns TRUE if the caller must call pmap_unmap_io_transient when
  *          finished or FALSE otherwise.
  *
  */
 boolean_t
 pmap_map_io_transient(vm_page_t page[], vm_offset_t vaddr[], int count,
     boolean_t can_fault)
 {
 	vm_paddr_t paddr;
 	boolean_t needs_mapping;
 	pt_entry_t *pte;
 	int cache_bits, error, i;
 
 	/*
 	 * Allocate any KVA space that we need, this is done in a separate
 	 * loop to prevent calling vmem_alloc while pinned.
 	 */
 	needs_mapping = FALSE;
 	for (i = 0; i < count; i++) {
 		paddr = VM_PAGE_TO_PHYS(page[i]);
 		if (__predict_false(paddr >= dmaplimit)) {
 			error = vmem_alloc(kernel_arena, PAGE_SIZE,
 			    M_BESTFIT | M_WAITOK, &vaddr[i]);
 			KASSERT(error == 0, ("vmem_alloc failed: %d", error));
 			needs_mapping = TRUE;
 		} else {
 			vaddr[i] = PHYS_TO_DMAP(paddr);
 		}
 	}
 
 	/* Exit early if everything is covered by the DMAP */
 	if (!needs_mapping)
 		return (FALSE);
 
 	/*
 	 * NB:  The sequence of updating a page table followed by accesses
 	 * to the corresponding pages used in the !DMAP case is subject to
 	 * the situation described in the "AMD64 Architecture Programmer's
 	 * Manual Volume 2: System Programming" rev. 3.23, "7.3.1 Special
 	 * Coherency Considerations".  Therefore, issuing the INVLPG right
 	 * after modifying the PTE bits is crucial.
 	 */
 	if (!can_fault)
 		sched_pin();
 	for (i = 0; i < count; i++) {
 		paddr = VM_PAGE_TO_PHYS(page[i]);
 		if (paddr >= dmaplimit) {
 			if (can_fault) {
 				/*
 				 * Slow path, since we can get page faults
 				 * while mappings are active don't pin the
 				 * thread to the CPU and instead add a global
 				 * mapping visible to all CPUs.
 				 */
 				pmap_qenter(vaddr[i], &page[i], 1);
 			} else {
 				pte = vtopte(vaddr[i]);
 				cache_bits = pmap_cache_bits(kernel_pmap,
 				    page[i]->md.pat_mode, 0);
 				pte_store(pte, paddr | X86_PG_RW | X86_PG_V |
 				    cache_bits);
 				invlpg(vaddr[i]);
 			}
 		}
 	}
 
 	return (needs_mapping);
 }
 
 void
 pmap_unmap_io_transient(vm_page_t page[], vm_offset_t vaddr[], int count,
     boolean_t can_fault)
 {
 	vm_paddr_t paddr;
 	int i;
 
 	if (!can_fault)
 		sched_unpin();
 	for (i = 0; i < count; i++) {
 		paddr = VM_PAGE_TO_PHYS(page[i]);
 		if (paddr >= dmaplimit) {
 			if (can_fault)
 				pmap_qremove(vaddr[i], 1);
 			vmem_free(kernel_arena, vaddr[i], PAGE_SIZE);
 		}
 	}
 }
 
 #include "opt_ddb.h"
 #ifdef DDB
 #include <ddb/ddb.h>
 
 DB_SHOW_COMMAND(pte, pmap_print_pte)
 {
 	pmap_t pmap;
 	pml4_entry_t *pml4;
 	pdp_entry_t *pdp;
 	pd_entry_t *pde;
 	pt_entry_t *pte, PG_V;
 	vm_offset_t va;
 
 	if (have_addr) {
 		va = (vm_offset_t)addr;
 		pmap = PCPU_GET(curpmap); /* XXX */
 	} else {
 		db_printf("show pte addr\n");
 		return;
 	}
 	PG_V = pmap_valid_bit(pmap);
 	pml4 = pmap_pml4e(pmap, va);
 	db_printf("VA %#016lx pml4e %#016lx", va, *pml4);
 	if ((*pml4 & PG_V) == 0) {
 		db_printf("\n");
 		return;
 	}
 	pdp = pmap_pml4e_to_pdpe(pml4, va);
 	db_printf(" pdpe %#016lx", *pdp);
 	if ((*pdp & PG_V) == 0 || (*pdp & PG_PS) != 0) {
 		db_printf("\n");
 		return;
 	}
 	pde = pmap_pdpe_to_pde(pdp, va);
 	db_printf(" pde %#016lx", *pde);
 	if ((*pde & PG_V) == 0 || (*pde & PG_PS) != 0) {
 		db_printf("\n");
 		return;
 	}
 	pte = pmap_pde_to_pte(pde, va);
 	db_printf(" pte %#016lx\n", *pte);
 }
 
 DB_SHOW_COMMAND(phys2dmap, pmap_phys2dmap)
 {
 	vm_paddr_t a;
 
 	if (have_addr) {
 		a = (vm_paddr_t)addr;
 		db_printf("0x%jx\n", (uintmax_t)PHYS_TO_DMAP(a));
 	} else {
 		db_printf("show phys2dmap addr\n");
 	}
 }
 #endif
Index: projects/release-arm-redux/sys/amd64/amd64/vm_machdep.c
===================================================================
--- projects/release-arm-redux/sys/amd64/amd64/vm_machdep.c	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/amd64/vm_machdep.c	(revision 282692)
@@ -1,734 +1,732 @@
 /*-
  * Copyright (c) 1982, 1986 The Regents of the University of California.
  * Copyright (c) 1989, 1990 William Jolitz
  * Copyright (c) 1994 John Dyson
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department, and William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)vm_machdep.c	7.3 (Berkeley) 5/13/91
  *	Utah $Hdr: vm_machdep.c 1.16.1.1 89/06/23$
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_isa.h"
 #include "opt_cpu.h"
 #include "opt_compat.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bio.h>
 #include <sys/buf.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mbuf.h>
 #include <sys/mutex.h>
 #include <sys/pioctl.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/sysctl.h>
 #include <sys/sysent.h>
 #include <sys/unistd.h>
 #include <sys/vnode.h>
 #include <sys/vmmeter.h>
 
 #include <machine/cpu.h>
 #include <machine/md_var.h>
 #include <machine/pcb.h>
 #include <machine/smp.h>
 #include <machine/specialreg.h>
 #include <machine/tss.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_param.h>
 
 #include <isa/isareg.h>
 
 static void	cpu_reset_real(void);
 #ifdef SMP
 static void	cpu_reset_proxy(void);
 static u_int	cpu_reset_proxyid;
 static volatile u_int	cpu_reset_proxy_active;
 #endif
 
 _Static_assert(OFFSETOF_CURTHREAD == offsetof(struct pcpu, pc_curthread),
     "OFFSETOF_CURTHREAD does not correspond with offset of pc_curthread.");
 _Static_assert(OFFSETOF_CURPCB == offsetof(struct pcpu, pc_curpcb),
     "OFFSETOF_CURPCB does not correspond with offset of pc_curpcb.");
 
 struct savefpu *
 get_pcb_user_save_td(struct thread *td)
 {
 	vm_offset_t p;
 
 	p = td->td_kstack + td->td_kstack_pages * PAGE_SIZE -
 	    cpu_max_ext_state_size;
 	KASSERT((p % 64) == 0, ("Unaligned pcb_user_save area"));
 	return ((struct savefpu *)p);
 }
 
 struct savefpu *
 get_pcb_user_save_pcb(struct pcb *pcb)
 {
 	vm_offset_t p;
 
 	p = (vm_offset_t)(pcb + 1);
 	return ((struct savefpu *)p);
 }
 
 struct pcb *
 get_pcb_td(struct thread *td)
 {
 	vm_offset_t p;
 
 	p = td->td_kstack + td->td_kstack_pages * PAGE_SIZE -
 	    cpu_max_ext_state_size - sizeof(struct pcb);
 	return ((struct pcb *)p);
 }
 
 void *
 alloc_fpusave(int flags)
 {
 	void *res;
 	struct savefpu_ymm *sf;
 
 	res = malloc(cpu_max_ext_state_size, M_DEVBUF, flags);
 	if (use_xsave) {
 		sf = (struct savefpu_ymm *)res;
 		bzero(&sf->sv_xstate.sx_hd, sizeof(sf->sv_xstate.sx_hd));
 		sf->sv_xstate.sx_hd.xstate_bv = xsave_mask;
 	}
 	return (res);
 }
 
 /*
  * Finish a fork operation, with process p2 nearly set up.
  * Copy and update the pcb, set up the stack so that the child
  * ready to run and return to user mode.
  */
 void
 cpu_fork(td1, p2, td2, flags)
 	register struct thread *td1;
 	register struct proc *p2;
 	struct thread *td2;
 	int flags;
 {
 	register struct proc *p1;
 	struct pcb *pcb2;
 	struct mdproc *mdp1, *mdp2;
 	struct proc_ldt *pldt;
 	pmap_t pmap2;
 
 	p1 = td1->td_proc;
 	if ((flags & RFPROC) == 0) {
 		if ((flags & RFMEM) == 0) {
 			/* unshare user LDT */
 			mdp1 = &p1->p_md;
 			mtx_lock(&dt_lock);
 			if ((pldt = mdp1->md_ldt) != NULL &&
 			    pldt->ldt_refcnt > 1 &&
 			    user_ldt_alloc(p1, 1) == NULL)
 				panic("could not copy LDT");
 			mtx_unlock(&dt_lock);
 		}
 		return;
 	}
 
 	/* Ensure that td1's pcb is up to date. */
 	fpuexit(td1);
 
 	/* Point the pcb to the top of the stack */
 	pcb2 = get_pcb_td(td2);
 	td2->td_pcb = pcb2;
 
 	/* Copy td1's pcb */
 	bcopy(td1->td_pcb, pcb2, sizeof(*pcb2));
 
 	/* Properly initialize pcb_save */
 	pcb2->pcb_save = get_pcb_user_save_pcb(pcb2);
 	bcopy(get_pcb_user_save_td(td1), get_pcb_user_save_pcb(pcb2),
 	    cpu_max_ext_state_size);
 
 	/* Point mdproc and then copy over td1's contents */
 	mdp2 = &p2->p_md;
 	bcopy(&p1->p_md, mdp2, sizeof(*mdp2));
 
 	/*
 	 * Create a new fresh stack for the new process.
 	 * Copy the trap frame for the return to user mode as if from a
 	 * syscall.  This copies most of the user mode register values.
 	 */
 	td2->td_frame = (struct trapframe *)td2->td_pcb - 1;
 	bcopy(td1->td_frame, td2->td_frame, sizeof(struct trapframe));
 
 	td2->td_frame->tf_rax = 0;		/* Child returns zero */
 	td2->td_frame->tf_rflags &= ~PSL_C;	/* success */
 	td2->td_frame->tf_rdx = 1;
 
 	/*
 	 * If the parent process has the trap bit set (i.e. a debugger had
 	 * single stepped the process to the system call), we need to clear
 	 * the trap flag from the new frame unless the debugger had set PF_FORK
 	 * on the parent.  Otherwise, the child will receive a (likely
 	 * unexpected) SIGTRAP when it executes the first instruction after
 	 * returning  to userland.
 	 */
 	if ((p1->p_pfsflags & PF_FORK) == 0)
 		td2->td_frame->tf_rflags &= ~PSL_T;
 
 	/*
 	 * Set registers for trampoline to user mode.  Leave space for the
 	 * return address on stack.  These are the kernel mode register values.
 	 */
 	pmap2 = vmspace_pmap(p2->p_vmspace);
-	pcb2->pcb_cr3 = pmap2->pm_cr3;
 	pcb2->pcb_r12 = (register_t)fork_return;	/* fork_trampoline argument */
 	pcb2->pcb_rbp = 0;
 	pcb2->pcb_rsp = (register_t)td2->td_frame - sizeof(void *);
 	pcb2->pcb_rbx = (register_t)td2;		/* fork_trampoline argument */
 	pcb2->pcb_rip = (register_t)fork_trampoline;
 	/*-
 	 * pcb2->pcb_dr*:	cloned above.
 	 * pcb2->pcb_savefpu:	cloned above.
 	 * pcb2->pcb_flags:	cloned above.
 	 * pcb2->pcb_onfault:	cloned above (always NULL here?).
 	 * pcb2->pcb_[fg]sbase:	cloned above
 	 */
 
 	/* Setup to release spin count in fork_exit(). */
 	td2->td_md.md_spinlock_count = 1;
 	td2->td_md.md_saved_flags = PSL_KERNEL | PSL_I;
 
 	/* As an i386, do not copy io permission bitmap. */
 	pcb2->pcb_tssp = NULL;
 
 	/* New segment registers. */
 	set_pcb_flags(pcb2, PCB_FULL_IRET);
 
 	/* Copy the LDT, if necessary. */
 	mdp1 = &td1->td_proc->p_md;
 	mdp2 = &p2->p_md;
 	mtx_lock(&dt_lock);
 	if (mdp1->md_ldt != NULL) {
 		if (flags & RFMEM) {
 			mdp1->md_ldt->ldt_refcnt++;
 			mdp2->md_ldt = mdp1->md_ldt;
 			bcopy(&mdp1->md_ldt_sd, &mdp2->md_ldt_sd, sizeof(struct
 			    system_segment_descriptor));
 		} else {
 			mdp2->md_ldt = NULL;
 			mdp2->md_ldt = user_ldt_alloc(p2, 0);
 			if (mdp2->md_ldt == NULL)
 				panic("could not copy LDT");
 			amd64_set_ldt_data(td2, 0, max_ldt_segment,
 			    (struct user_segment_descriptor *)
 			    mdp1->md_ldt->ldt_base);
 		}
 	} else
 		mdp2->md_ldt = NULL;
 	mtx_unlock(&dt_lock);
 
 	/*
 	 * Now, cpu_switch() can schedule the new process.
 	 * pcb_rsp is loaded pointing to the cpu_switch() stack frame
 	 * containing the return address when exiting cpu_switch.
 	 * This will normally be to fork_trampoline(), which will have
 	 * %ebx loaded with the new proc's pointer.  fork_trampoline()
 	 * will set up a stack to call fork_return(p, frame); to complete
 	 * the return to user-mode.
 	 */
 }
 
 /*
  * Intercept the return address from a freshly forked process that has NOT
  * been scheduled yet.
  *
  * This is needed to make kernel threads stay in kernel mode.
  */
 void
 cpu_set_fork_handler(td, func, arg)
 	struct thread *td;
 	void (*func)(void *);
 	void *arg;
 {
 	/*
 	 * Note that the trap frame follows the args, so the function
 	 * is really called like this:  func(arg, frame);
 	 */
 	td->td_pcb->pcb_r12 = (long) func;	/* function */
 	td->td_pcb->pcb_rbx = (long) arg;	/* first arg */
 }
 
 void
 cpu_exit(struct thread *td)
 {
 
 	/*
 	 * If this process has a custom LDT, release it.
 	 */
 	mtx_lock(&dt_lock);
 	if (td->td_proc->p_md.md_ldt != 0)
 		user_ldt_free(td);
 	else
 		mtx_unlock(&dt_lock);
 }
 
 void
 cpu_thread_exit(struct thread *td)
 {
 	struct pcb *pcb;
 
 	critical_enter();
 	if (td == PCPU_GET(fpcurthread))
 		fpudrop();
 	critical_exit();
 
 	pcb = td->td_pcb;
 
 	/* Disable any hardware breakpoints. */
 	if (pcb->pcb_flags & PCB_DBREGS) {
 		reset_dbregs();
 		clear_pcb_flags(pcb, PCB_DBREGS);
 	}
 }
 
 void
 cpu_thread_clean(struct thread *td)
 {
 	struct pcb *pcb;
 
 	pcb = td->td_pcb;
 
 	/*
 	 * Clean TSS/iomap
 	 */
 	if (pcb->pcb_tssp != NULL) {
 		kmem_free(kernel_arena, (vm_offset_t)pcb->pcb_tssp,
 		    ctob(IOPAGES + 1));
 		pcb->pcb_tssp = NULL;
 	}
 }
 
 void
 cpu_thread_swapin(struct thread *td)
 {
 }
 
 void
 cpu_thread_swapout(struct thread *td)
 {
 }
 
 void
 cpu_thread_alloc(struct thread *td)
 {
 	struct pcb *pcb;
 	struct xstate_hdr *xhdr;
 
 	td->td_pcb = pcb = get_pcb_td(td);
 	td->td_frame = (struct trapframe *)pcb - 1;
 	pcb->pcb_save = get_pcb_user_save_pcb(pcb);
 	if (use_xsave) {
 		xhdr = (struct xstate_hdr *)(pcb->pcb_save + 1);
 		bzero(xhdr, sizeof(*xhdr));
 		xhdr->xstate_bv = xsave_mask;
 	}
 }
 
 void
 cpu_thread_free(struct thread *td)
 {
 
 	cpu_thread_clean(td);
 }
 
 void
 cpu_set_syscall_retval(struct thread *td, int error)
 {
 
 	switch (error) {
 	case 0:
 		td->td_frame->tf_rax = td->td_retval[0];
 		td->td_frame->tf_rdx = td->td_retval[1];
 		td->td_frame->tf_rflags &= ~PSL_C;
 		break;
 
 	case ERESTART:
 		/*
 		 * Reconstruct pc, we know that 'syscall' is 2 bytes,
 		 * lcall $X,y is 7 bytes, int 0x80 is 2 bytes.
 		 * We saved this in tf_err.
 		 * %r10 (which was holding the value of %rcx) is restored
 		 * for the next iteration.
 		 * %r10 restore is only required for freebsd/amd64 processes,
 		 * but shall be innocent for any ia32 ABI.
 		 *
 		 * Require full context restore to get the arguments
 		 * in the registers reloaded at return to usermode.
 		 */
 		td->td_frame->tf_rip -= td->td_frame->tf_err;
 		td->td_frame->tf_r10 = td->td_frame->tf_rcx;
 		set_pcb_flags(td->td_pcb, PCB_FULL_IRET);
 		break;
 
 	case EJUSTRETURN:
 		break;
 
 	default:
 		if (td->td_proc->p_sysent->sv_errsize) {
 			if (error >= td->td_proc->p_sysent->sv_errsize)
 				error = -1;	/* XXX */
 			else
 				error = td->td_proc->p_sysent->sv_errtbl[error];
 		}
 		td->td_frame->tf_rax = error;
 		td->td_frame->tf_rflags |= PSL_C;
 		break;
 	}
 }
 
 /*
  * Initialize machine state (pcb and trap frame) for a new thread about to
  * upcall. Put enough state in the new thread's PCB to get it to go back 
  * userret(), where we can intercept it again to set the return (upcall)
  * Address and stack, along with those from upcals that are from other sources
  * such as those generated in thread_userret() itself.
  */
 void
 cpu_set_upcall(struct thread *td, struct thread *td0)
 {
 	struct pcb *pcb2;
 
 	/* Point the pcb to the top of the stack. */
 	pcb2 = td->td_pcb;
 
 	/*
 	 * Copy the upcall pcb.  This loads kernel regs.
 	 * Those not loaded individually below get their default
 	 * values here.
 	 */
 	bcopy(td0->td_pcb, pcb2, sizeof(*pcb2));
 	clear_pcb_flags(pcb2, PCB_FPUINITDONE | PCB_USERFPUINITDONE |
 	    PCB_KERNFPU);
 	pcb2->pcb_save = get_pcb_user_save_pcb(pcb2);
 	bcopy(get_pcb_user_save_td(td0), pcb2->pcb_save,
 	    cpu_max_ext_state_size);
 	set_pcb_flags(pcb2, PCB_FULL_IRET);
 
 	/*
 	 * Create a new fresh stack for the new thread.
 	 */
 	bcopy(td0->td_frame, td->td_frame, sizeof(struct trapframe));
 
 	/* If the current thread has the trap bit set (i.e. a debugger had
 	 * single stepped the process to the system call), we need to clear
 	 * the trap flag from the new frame. Otherwise, the new thread will
 	 * receive a (likely unexpected) SIGTRAP when it executes the first
 	 * instruction after returning to userland.
 	 */
 	td->td_frame->tf_rflags &= ~PSL_T;
 
 	/*
 	 * Set registers for trampoline to user mode.  Leave space for the
 	 * return address on stack.  These are the kernel mode register values.
 	 */
 	pcb2->pcb_r12 = (register_t)fork_return;	    /* trampoline arg */
 	pcb2->pcb_rbp = 0;
 	pcb2->pcb_rsp = (register_t)td->td_frame - sizeof(void *);	/* trampoline arg */
 	pcb2->pcb_rbx = (register_t)td;			    /* trampoline arg */
 	pcb2->pcb_rip = (register_t)fork_trampoline;
 	/*
 	 * If we didn't copy the pcb, we'd need to do the following registers:
-	 * pcb2->pcb_cr3:	cloned above.
 	 * pcb2->pcb_dr*:	cloned above.
 	 * pcb2->pcb_savefpu:	cloned above.
 	 * pcb2->pcb_onfault:	cloned above (always NULL here?).
 	 * pcb2->pcb_[fg]sbase: cloned above
 	 */
 
 	/* Setup to release spin count in fork_exit(). */
 	td->td_md.md_spinlock_count = 1;
 	td->td_md.md_saved_flags = PSL_KERNEL | PSL_I;
 }
 
 /*
  * Set that machine state for performing an upcall that has to
  * be done in thread_userret() so that those upcalls generated
  * in thread_userret() itself can be done as well.
  */
 void
 cpu_set_upcall_kse(struct thread *td, void (*entry)(void *), void *arg,
 	stack_t *stack)
 {
 
 	/* 
 	 * Do any extra cleaning that needs to be done.
 	 * The thread may have optional components
 	 * that are not present in a fresh thread.
 	 * This may be a recycled thread so make it look
 	 * as though it's newly allocated.
 	 */
 	cpu_thread_clean(td);
 
 #ifdef COMPAT_FREEBSD32
 	if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) {
 		/*
 	 	 * Set the trap frame to point at the beginning of the uts
 		 * function.
 		 */
 		td->td_frame->tf_rbp = 0;
 		td->td_frame->tf_rsp =
 		   (((uintptr_t)stack->ss_sp + stack->ss_size - 4) & ~0x0f) - 4;
 		td->td_frame->tf_rip = (uintptr_t)entry;
 
 		/*
 		 * Pass the address of the mailbox for this kse to the uts
 		 * function as a parameter on the stack.
 		 */
 		suword32((void *)(td->td_frame->tf_rsp + sizeof(int32_t)),
 		    (uint32_t)(uintptr_t)arg);
 
 		return;
 	}
 #endif
 
 	/*
 	 * Set the trap frame to point at the beginning of the uts
 	 * function.
 	 */
 	td->td_frame->tf_rbp = 0;
 	td->td_frame->tf_rsp =
 	    ((register_t)stack->ss_sp + stack->ss_size) & ~0x0f;
 	td->td_frame->tf_rsp -= 8;
 	td->td_frame->tf_rip = (register_t)entry;
 	td->td_frame->tf_ds = _udatasel;
 	td->td_frame->tf_es = _udatasel;
 	td->td_frame->tf_fs = _ufssel;
 	td->td_frame->tf_gs = _ugssel;
 	td->td_frame->tf_flags = TF_HASSEGS;
 
 	/*
 	 * Pass the address of the mailbox for this kse to the uts
 	 * function as a parameter on the stack.
 	 */
 	td->td_frame->tf_rdi = (register_t)arg;
 }
 
 int
 cpu_set_user_tls(struct thread *td, void *tls_base)
 {
 	struct pcb *pcb;
 
 	if ((u_int64_t)tls_base >= VM_MAXUSER_ADDRESS)
 		return (EINVAL);
 
 	pcb = td->td_pcb;
 	set_pcb_flags(pcb, PCB_FULL_IRET);
 #ifdef COMPAT_FREEBSD32
 	if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) {
 		pcb->pcb_gsbase = (register_t)tls_base;
 		return (0);
 	}
 #endif
 	pcb->pcb_fsbase = (register_t)tls_base;
 	return (0);
 }
 
 #ifdef SMP
 static void
 cpu_reset_proxy()
 {
 	cpuset_t tcrp;
 
 	cpu_reset_proxy_active = 1;
 	while (cpu_reset_proxy_active == 1)
 		ia32_pause(); /* Wait for other cpu to see that we've started */
 
 	CPU_SETOF(cpu_reset_proxyid, &tcrp);
 	stop_cpus(tcrp);
 	printf("cpu_reset_proxy: Stopped CPU %d\n", cpu_reset_proxyid);
 	DELAY(1000000);
 	cpu_reset_real();
 }
 #endif
 
 void
 cpu_reset()
 {
 #ifdef SMP
 	cpuset_t map;
 	u_int cnt;
 
 	if (smp_started) {
 		map = all_cpus;
 		CPU_CLR(PCPU_GET(cpuid), &map);
 		CPU_NAND(&map, &stopped_cpus);
 		if (!CPU_EMPTY(&map)) {
 			printf("cpu_reset: Stopping other CPUs\n");
 			stop_cpus(map);
 		}
 
 		if (PCPU_GET(cpuid) != 0) {
 			cpu_reset_proxyid = PCPU_GET(cpuid);
 			cpustop_restartfunc = cpu_reset_proxy;
 			cpu_reset_proxy_active = 0;
 			printf("cpu_reset: Restarting BSP\n");
 
 			/* Restart CPU #0. */
 			CPU_SETOF(0, &started_cpus);
 			wmb();
 
 			cnt = 0;
 			while (cpu_reset_proxy_active == 0 && cnt < 10000000) {
 				ia32_pause();
 				cnt++;	/* Wait for BSP to announce restart */
 			}
 			if (cpu_reset_proxy_active == 0)
 				printf("cpu_reset: Failed to restart BSP\n");
 			enable_intr();
 			cpu_reset_proxy_active = 2;
 
 			while (1)
 				ia32_pause();
 			/* NOTREACHED */
 		}
 
 		DELAY(1000000);
 	}
 #endif
 	cpu_reset_real();
 	/* NOTREACHED */
 }
 
 static void
 cpu_reset_real()
 {
 	struct region_descriptor null_idt;
 	int b;
 
 	disable_intr();
 
 	/*
 	 * Attempt to do a CPU reset via the keyboard controller,
 	 * do not turn off GateA20, as any machine that fails
 	 * to do the reset here would then end up in no man's land.
 	 */
 	outb(IO_KBD + 4, 0xFE);
 	DELAY(500000);	/* wait 0.5 sec to see if that did it */
 
 	/*
 	 * Attempt to force a reset via the Reset Control register at
 	 * I/O port 0xcf9.  Bit 2 forces a system reset when it
 	 * transitions from 0 to 1.  Bit 1 selects the type of reset
 	 * to attempt: 0 selects a "soft" reset, and 1 selects a
 	 * "hard" reset.  We try a "hard" reset.  The first write sets
 	 * bit 1 to select a "hard" reset and clears bit 2.  The
 	 * second write forces a 0 -> 1 transition in bit 2 to trigger
 	 * a reset.
 	 */
 	outb(0xcf9, 0x2);
 	outb(0xcf9, 0x6);
 	DELAY(500000);  /* wait 0.5 sec to see if that did it */
 
 	/*
 	 * Attempt to force a reset via the Fast A20 and Init register
 	 * at I/O port 0x92.  Bit 1 serves as an alternate A20 gate.
 	 * Bit 0 asserts INIT# when set to 1.  We are careful to only
 	 * preserve bit 1 while setting bit 0.  We also must clear bit
 	 * 0 before setting it if it isn't already clear.
 	 */
 	b = inb(0x92);
 	if (b != 0xff) {
 		if ((b & 0x1) != 0)
 			outb(0x92, b & 0xfe);
 		outb(0x92, b | 0x1);
 		DELAY(500000);  /* wait 0.5 sec to see if that did it */
 	}
 
 	printf("No known reset method worked, attempting CPU shutdown\n");
 	DELAY(1000000);	/* wait 1 sec for printf to complete */
 
 	/* Wipe the IDT. */
 	null_idt.rd_limit = 0;
 	null_idt.rd_base = 0;
 	lidt(&null_idt);
 
 	/* "good night, sweet prince .... <THUNK!>" */
 	breakpoint();
 
 	/* NOTREACHED */
 	while(1);
 }
 
 /*
  * Software interrupt handler for queued VM system processing.
  */   
 void  
 swi_vm(void *dummy) 
 {     
 	if (busdma_swi_pending != 0)
 		busdma_swi();
 }
 
 /*
  * Tell whether this address is in some physical memory region.
  * Currently used by the kernel coredump code in order to avoid
  * dumping the ``ISA memory hole'' which could cause indefinite hangs,
  * or other unpredictable behaviour.
  */
 
 int
 is_physical_memory(vm_paddr_t addr)
 {
 
 #ifdef DEV_ISA
 	/* The ISA ``memory hole''. */
 	if (addr >= 0xa0000 && addr < 0x100000)
 		return 0;
 #endif
 
 	/*
 	 * stuff other tests for known memory-mapped devices (PCI?)
 	 * here
 	 */
 
 	return 1;
 }
Index: projects/release-arm-redux/sys/amd64/include/cpufunc.h
===================================================================
--- projects/release-arm-redux/sys/amd64/include/cpufunc.h	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/include/cpufunc.h	(revision 282692)
@@ -1,865 +1,864 @@
 /*-
  * Copyright (c) 2003 Peter Wemm.
  * Copyright (c) 1993 The Regents of the University of California.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /*
  * Functions to provide access to special i386 instructions.
  * This in included in sys/systm.h, and that file should be
  * used in preference to this.
  */
 
 #ifndef _MACHINE_CPUFUNC_H_
 #define	_MACHINE_CPUFUNC_H_
 
 #ifndef _SYS_CDEFS_H_
 #error this file needs sys/cdefs.h as a prerequisite
 #endif
 
 struct region_descriptor;
 
 #define readb(va)	(*(volatile uint8_t *) (va))
 #define readw(va)	(*(volatile uint16_t *) (va))
 #define readl(va)	(*(volatile uint32_t *) (va))
 #define readq(va)	(*(volatile uint64_t *) (va))
 
 #define writeb(va, d)	(*(volatile uint8_t *) (va) = (d))
 #define writew(va, d)	(*(volatile uint16_t *) (va) = (d))
 #define writel(va, d)	(*(volatile uint32_t *) (va) = (d))
 #define writeq(va, d)	(*(volatile uint64_t *) (va) = (d))
 
 #if defined(__GNUCLIKE_ASM) && defined(__CC_SUPPORTS___INLINE)
 
 static __inline void
 breakpoint(void)
 {
 	__asm __volatile("int $3");
 }
 
 static __inline u_int
 bsfl(u_int mask)
 {
 	u_int	result;
 
 	__asm __volatile("bsfl %1,%0" : "=r" (result) : "rm" (mask));
 	return (result);
 }
 
 static __inline u_long
 bsfq(u_long mask)
 {
 	u_long	result;
 
 	__asm __volatile("bsfq %1,%0" : "=r" (result) : "rm" (mask));
 	return (result);
 }
 
 static __inline u_int
 bsrl(u_int mask)
 {
 	u_int	result;
 
 	__asm __volatile("bsrl %1,%0" : "=r" (result) : "rm" (mask));
 	return (result);
 }
 
 static __inline u_long
 bsrq(u_long mask)
 {
 	u_long	result;
 
 	__asm __volatile("bsrq %1,%0" : "=r" (result) : "rm" (mask));
 	return (result);
 }
 
 static __inline void
 clflush(u_long addr)
 {
 
 	__asm __volatile("clflush %0" : : "m" (*(char *)addr));
 }
 
 static __inline void
 clts(void)
 {
 
 	__asm __volatile("clts");
 }
 
 static __inline void
 disable_intr(void)
 {
 	__asm __volatile("cli" : : : "memory");
 }
 
 static __inline void
 do_cpuid(u_int ax, u_int *p)
 {
 	__asm __volatile("cpuid"
 			 : "=a" (p[0]), "=b" (p[1]), "=c" (p[2]), "=d" (p[3])
 			 :  "0" (ax));
 }
 
 static __inline void
 cpuid_count(u_int ax, u_int cx, u_int *p)
 {
 	__asm __volatile("cpuid"
 			 : "=a" (p[0]), "=b" (p[1]), "=c" (p[2]), "=d" (p[3])
 			 :  "0" (ax), "c" (cx));
 }
 
 static __inline void
 enable_intr(void)
 {
 	__asm __volatile("sti");
 }
 
 #ifdef _KERNEL
 
 #define	HAVE_INLINE_FFS
 #define        ffs(x)  __builtin_ffs(x)
 
 #define	HAVE_INLINE_FFSL
 
 static __inline int
 ffsl(long mask)
 {
 	return (mask == 0 ? mask : (int)bsfq((u_long)mask) + 1);
 }
 
 #define	HAVE_INLINE_FFSLL
 
 static __inline int
 ffsll(long long mask)
 {
 	return (ffsl((long)mask));
 }
 
 #define	HAVE_INLINE_FLS
 
 static __inline int
 fls(int mask)
 {
 	return (mask == 0 ? mask : (int)bsrl((u_int)mask) + 1);
 }
 
 #define	HAVE_INLINE_FLSL
 
 static __inline int
 flsl(long mask)
 {
 	return (mask == 0 ? mask : (int)bsrq((u_long)mask) + 1);
 }
 
 #define	HAVE_INLINE_FLSLL
 
 static __inline int
 flsll(long long mask)
 {
 	return (flsl((long)mask));
 }
 
 #endif /* _KERNEL */
 
 static __inline void
 halt(void)
 {
 	__asm __volatile("hlt");
 }
 
 static __inline u_char
 inb(u_int port)
 {
 	u_char	data;
 
 	__asm __volatile("inb %w1, %0" : "=a" (data) : "Nd" (port));
 	return (data);
 }
 
 static __inline u_int
 inl(u_int port)
 {
 	u_int	data;
 
 	__asm __volatile("inl %w1, %0" : "=a" (data) : "Nd" (port));
 	return (data);
 }
 
 static __inline void
 insb(u_int port, void *addr, size_t count)
 {
 	__asm __volatile("cld; rep; insb"
 			 : "+D" (addr), "+c" (count)
 			 : "d" (port)
 			 : "memory");
 }
 
 static __inline void
 insw(u_int port, void *addr, size_t count)
 {
 	__asm __volatile("cld; rep; insw"
 			 : "+D" (addr), "+c" (count)
 			 : "d" (port)
 			 : "memory");
 }
 
 static __inline void
 insl(u_int port, void *addr, size_t count)
 {
 	__asm __volatile("cld; rep; insl"
 			 : "+D" (addr), "+c" (count)
 			 : "d" (port)
 			 : "memory");
 }
 
 static __inline void
 invd(void)
 {
 	__asm __volatile("invd");
 }
 
 static __inline u_short
 inw(u_int port)
 {
 	u_short	data;
 
 	__asm __volatile("inw %w1, %0" : "=a" (data) : "Nd" (port));
 	return (data);
 }
 
 static __inline void
 outb(u_int port, u_char data)
 {
 	__asm __volatile("outb %0, %w1" : : "a" (data), "Nd" (port));
 }
 
 static __inline void
 outl(u_int port, u_int data)
 {
 	__asm __volatile("outl %0, %w1" : : "a" (data), "Nd" (port));
 }
 
 static __inline void
 outsb(u_int port, const void *addr, size_t count)
 {
 	__asm __volatile("cld; rep; outsb"
 			 : "+S" (addr), "+c" (count)
 			 : "d" (port));
 }
 
 static __inline void
 outsw(u_int port, const void *addr, size_t count)
 {
 	__asm __volatile("cld; rep; outsw"
 			 : "+S" (addr), "+c" (count)
 			 : "d" (port));
 }
 
 static __inline void
 outsl(u_int port, const void *addr, size_t count)
 {
 	__asm __volatile("cld; rep; outsl"
 			 : "+S" (addr), "+c" (count)
 			 : "d" (port));
 }
 
 static __inline void
 outw(u_int port, u_short data)
 {
 	__asm __volatile("outw %0, %w1" : : "a" (data), "Nd" (port));
 }
 
 static __inline u_long
 popcntq(u_long mask)
 {
 	u_long result;
 
 	__asm __volatile("popcntq %1,%0" : "=r" (result) : "rm" (mask));
 	return (result);
 }
 
 static __inline void
 lfence(void)
 {
 
 	__asm __volatile("lfence" : : : "memory");
 }
 
 static __inline void
 mfence(void)
 {
 
 	__asm __volatile("mfence" : : : "memory");
 }
 
 static __inline void
 ia32_pause(void)
 {
 	__asm __volatile("pause");
 }
 
 static __inline u_long
 read_rflags(void)
 {
 	u_long	rf;
 
 	__asm __volatile("pushfq; popq %0" : "=r" (rf));
 	return (rf);
 }
 
 static __inline uint64_t
 rdmsr(u_int msr)
 {
 	uint32_t low, high;
 
 	__asm __volatile("rdmsr" : "=a" (low), "=d" (high) : "c" (msr));
 	return (low | ((uint64_t)high << 32));
 }
 
 static __inline uint32_t
 rdmsr32(u_int msr)
 {
 	uint32_t low;
 
 	__asm __volatile("rdmsr" : "=a" (low) : "c" (msr) : "rdx");
 	return (low);
 }
 
 static __inline uint64_t
 rdpmc(u_int pmc)
 {
 	uint32_t low, high;
 
 	__asm __volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (pmc));
 	return (low | ((uint64_t)high << 32));
 }
 
 static __inline uint64_t
 rdtsc(void)
 {
 	uint32_t low, high;
 
 	__asm __volatile("rdtsc" : "=a" (low), "=d" (high));
 	return (low | ((uint64_t)high << 32));
 }
 
 static __inline uint32_t
 rdtsc32(void)
 {
 	uint32_t rv;
 
 	__asm __volatile("rdtsc" : "=a" (rv) : : "edx");
 	return (rv);
 }
 
 static __inline void
 wbinvd(void)
 {
 	__asm __volatile("wbinvd");
 }
 
 static __inline void
 write_rflags(u_long rf)
 {
 	__asm __volatile("pushq %0;  popfq" : : "r" (rf));
 }
 
 static __inline void
 wrmsr(u_int msr, uint64_t newval)
 {
 	uint32_t low, high;
 
 	low = newval;
 	high = newval >> 32;
 	__asm __volatile("wrmsr" : : "a" (low), "d" (high), "c" (msr));
 }
 
 static __inline void
 load_cr0(u_long data)
 {
 
 	__asm __volatile("movq %0,%%cr0" : : "r" (data));
 }
 
 static __inline u_long
 rcr0(void)
 {
 	u_long	data;
 
 	__asm __volatile("movq %%cr0,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline u_long
 rcr2(void)
 {
 	u_long	data;
 
 	__asm __volatile("movq %%cr2,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_cr3(u_long data)
 {
 
 	__asm __volatile("movq %0,%%cr3" : : "r" (data) : "memory");
 }
 
 static __inline u_long
 rcr3(void)
 {
 	u_long	data;
 
 	__asm __volatile("movq %%cr3,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_cr4(u_long data)
 {
 	__asm __volatile("movq %0,%%cr4" : : "r" (data));
 }
 
 static __inline u_long
 rcr4(void)
 {
 	u_long	data;
 
 	__asm __volatile("movq %%cr4,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline u_long
 rxcr(u_int reg)
 {
 	u_int low, high;
 
 	__asm __volatile("xgetbv" : "=a" (low), "=d" (high) : "c" (reg));
 	return (low | ((uint64_t)high << 32));
 }
 
 static __inline void
 load_xcr(u_int reg, u_long val)
 {
 	u_int low, high;
 
 	low = val;
 	high = val >> 32;
 	__asm __volatile("xsetbv" : : "c" (reg), "a" (low), "d" (high));
 }
 
 /*
  * Global TLB flush (except for thise for pages marked PG_G)
  */
 static __inline void
 invltlb(void)
 {
 
 	load_cr3(rcr3());
 }
 
 #ifndef CR4_PGE
 #define	CR4_PGE	0x00000080	/* Page global enable */
 #endif
 
 /*
  * Perform the guaranteed invalidation of all TLB entries.  This
  * includes the global entries, and entries in all PCIDs, not only the
  * current context.  The function works both on non-PCID CPUs and CPUs
  * with the PCID turned off or on.  See IA-32 SDM Vol. 3a 4.10.4.1
  * Operations that Invalidate TLBs and Paging-Structure Caches.
  */
 static __inline void
 invltlb_globpcid(void)
 {
 	uint64_t cr4;
 
 	cr4 = rcr4();
 	load_cr4(cr4 & ~CR4_PGE);
 	/*
 	 * Although preemption at this point could be detrimental to
 	 * performance, it would not lead to an error.  PG_G is simply
 	 * ignored if CR4.PGE is clear.  Moreover, in case this block
 	 * is re-entered, the load_cr4() either above or below will
 	 * modify CR4.PGE flushing the TLB.
 	 */
 	load_cr4(cr4 | CR4_PGE);
 }
 
 /*
  * TLB flush for an individual page (even if it has PG_G).
  * Only works on 486+ CPUs (i386 does not have PG_G).
  */
 static __inline void
 invlpg(u_long addr)
 {
 
 	__asm __volatile("invlpg %0" : : "m" (*(char *)addr) : "memory");
 }
 
 #define	INVPCID_ADDR	0
 #define	INVPCID_CTX	1
 #define	INVPCID_CTXGLOB	2
 #define	INVPCID_ALLCTX	3
 
 struct invpcid_descr {
 	uint64_t	pcid:12 __packed;
 	uint64_t	pad:52 __packed;
 	uint64_t	addr;
 } __packed;
 
 static __inline void
 invpcid(struct invpcid_descr *d, int type)
 {
 
-	/* invpcid (%rdx),%rax */
-	__asm __volatile(".byte 0x66,0x0f,0x38,0x82,0x02"
-	    : : "d" (d), "a" ((u_long)type) : "memory");
+	__asm __volatile("invpcid (%0),%1"
+	    : : "r" (d), "r" ((u_long)type) : "memory");
 }
 
 static __inline u_short
 rfs(void)
 {
 	u_short sel;
 	__asm __volatile("movw %%fs,%0" : "=rm" (sel));
 	return (sel);
 }
 
 static __inline u_short
 rgs(void)
 {
 	u_short sel;
 	__asm __volatile("movw %%gs,%0" : "=rm" (sel));
 	return (sel);
 }
 
 static __inline u_short
 rss(void)
 {
 	u_short sel;
 	__asm __volatile("movw %%ss,%0" : "=rm" (sel));
 	return (sel);
 }
 
 static __inline void
 load_ds(u_short sel)
 {
 	__asm __volatile("movw %0,%%ds" : : "rm" (sel));
 }
 
 static __inline void
 load_es(u_short sel)
 {
 	__asm __volatile("movw %0,%%es" : : "rm" (sel));
 }
 
 static __inline void
 cpu_monitor(const void *addr, u_long extensions, u_int hints)
 {
 
 	__asm __volatile("monitor"
 	    : : "a" (addr), "c" (extensions), "d" (hints));
 }
 
 static __inline void
 cpu_mwait(u_long extensions, u_int hints)
 {
 
 	__asm __volatile("mwait" : : "a" (hints), "c" (extensions));
 }
 
 #ifdef _KERNEL
 /* This is defined in <machine/specialreg.h> but is too painful to get to */
 #ifndef	MSR_FSBASE
 #define	MSR_FSBASE	0xc0000100
 #endif
 static __inline void
 load_fs(u_short sel)
 {
 	/* Preserve the fsbase value across the selector load */
 	__asm __volatile("rdmsr; movw %0,%%fs; wrmsr"
 	    : : "rm" (sel), "c" (MSR_FSBASE) : "eax", "edx");
 }
 
 #ifndef	MSR_GSBASE
 #define	MSR_GSBASE	0xc0000101
 #endif
 static __inline void
 load_gs(u_short sel)
 {
 	/*
 	 * Preserve the gsbase value across the selector load.
 	 * Note that we have to disable interrupts because the gsbase
 	 * being trashed happens to be the kernel gsbase at the time.
 	 */
 	__asm __volatile("pushfq; cli; rdmsr; movw %0,%%gs; wrmsr; popfq"
 	    : : "rm" (sel), "c" (MSR_GSBASE) : "eax", "edx");
 }
 #else
 /* Usable by userland */
 static __inline void
 load_fs(u_short sel)
 {
 	__asm __volatile("movw %0,%%fs" : : "rm" (sel));
 }
 
 static __inline void
 load_gs(u_short sel)
 {
 	__asm __volatile("movw %0,%%gs" : : "rm" (sel));
 }
 #endif
 
 static __inline void
 lidt(struct region_descriptor *addr)
 {
 	__asm __volatile("lidt (%0)" : : "r" (addr));
 }
 
 static __inline void
 lldt(u_short sel)
 {
 	__asm __volatile("lldt %0" : : "r" (sel));
 }
 
 static __inline void
 ltr(u_short sel)
 {
 	__asm __volatile("ltr %0" : : "r" (sel));
 }
 
 static __inline uint64_t
 rdr0(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr0,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr0(uint64_t dr0)
 {
 	__asm __volatile("movq %0,%%dr0" : : "r" (dr0));
 }
 
 static __inline uint64_t
 rdr1(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr1,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr1(uint64_t dr1)
 {
 	__asm __volatile("movq %0,%%dr1" : : "r" (dr1));
 }
 
 static __inline uint64_t
 rdr2(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr2,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr2(uint64_t dr2)
 {
 	__asm __volatile("movq %0,%%dr2" : : "r" (dr2));
 }
 
 static __inline uint64_t
 rdr3(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr3,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr3(uint64_t dr3)
 {
 	__asm __volatile("movq %0,%%dr3" : : "r" (dr3));
 }
 
 static __inline uint64_t
 rdr4(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr4,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr4(uint64_t dr4)
 {
 	__asm __volatile("movq %0,%%dr4" : : "r" (dr4));
 }
 
 static __inline uint64_t
 rdr5(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr5,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr5(uint64_t dr5)
 {
 	__asm __volatile("movq %0,%%dr5" : : "r" (dr5));
 }
 
 static __inline uint64_t
 rdr6(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr6,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr6(uint64_t dr6)
 {
 	__asm __volatile("movq %0,%%dr6" : : "r" (dr6));
 }
 
 static __inline uint64_t
 rdr7(void)
 {
 	uint64_t data;
 	__asm __volatile("movq %%dr7,%0" : "=r" (data));
 	return (data);
 }
 
 static __inline void
 load_dr7(uint64_t dr7)
 {
 	__asm __volatile("movq %0,%%dr7" : : "r" (dr7));
 }
 
 static __inline register_t
 intr_disable(void)
 {
 	register_t rflags;
 
 	rflags = read_rflags();
 	disable_intr();
 	return (rflags);
 }
 
 static __inline void
 intr_restore(register_t rflags)
 {
 	write_rflags(rflags);
 }
 
 #else /* !(__GNUCLIKE_ASM && __CC_SUPPORTS___INLINE) */
 
 int	breakpoint(void);
 u_int	bsfl(u_int mask);
 u_int	bsrl(u_int mask);
 void	clflush(u_long addr);
 void	clts(void);
 void	cpuid_count(u_int ax, u_int cx, u_int *p);
 void	disable_intr(void);
 void	do_cpuid(u_int ax, u_int *p);
 void	enable_intr(void);
 void	halt(void);
 void	ia32_pause(void);
 u_char	inb(u_int port);
 u_int	inl(u_int port);
 void	insb(u_int port, void *addr, size_t count);
 void	insl(u_int port, void *addr, size_t count);
 void	insw(u_int port, void *addr, size_t count);
 register_t	intr_disable(void);
 void	intr_restore(register_t rf);
 void	invd(void);
 void	invlpg(u_int addr);
 void	invltlb(void);
 u_short	inw(u_int port);
 void	lidt(struct region_descriptor *addr);
 void	lldt(u_short sel);
 void	load_cr0(u_long cr0);
 void	load_cr3(u_long cr3);
 void	load_cr4(u_long cr4);
 void	load_dr0(uint64_t dr0);
 void	load_dr1(uint64_t dr1);
 void	load_dr2(uint64_t dr2);
 void	load_dr3(uint64_t dr3);
 void	load_dr4(uint64_t dr4);
 void	load_dr5(uint64_t dr5);
 void	load_dr6(uint64_t dr6);
 void	load_dr7(uint64_t dr7);
 void	load_fs(u_short sel);
 void	load_gs(u_short sel);
 void	ltr(u_short sel);
 void	outb(u_int port, u_char data);
 void	outl(u_int port, u_int data);
 void	outsb(u_int port, const void *addr, size_t count);
 void	outsl(u_int port, const void *addr, size_t count);
 void	outsw(u_int port, const void *addr, size_t count);
 void	outw(u_int port, u_short data);
 u_long	rcr0(void);
 u_long	rcr2(void);
 u_long	rcr3(void);
 u_long	rcr4(void);
 uint64_t rdmsr(u_int msr);
 uint32_t rdmsr32(u_int msr);
 uint64_t rdpmc(u_int pmc);
 uint64_t rdr0(void);
 uint64_t rdr1(void);
 uint64_t rdr2(void);
 uint64_t rdr3(void);
 uint64_t rdr4(void);
 uint64_t rdr5(void);
 uint64_t rdr6(void);
 uint64_t rdr7(void);
 uint64_t rdtsc(void);
 u_long	read_rflags(void);
 u_int	rfs(void);
 u_int	rgs(void);
 void	wbinvd(void);
 void	write_rflags(u_int rf);
 void	wrmsr(u_int msr, uint64_t newval);
 
 #endif	/* __GNUCLIKE_ASM && __CC_SUPPORTS___INLINE */
 
 void	reset_dbregs(void);
 
 #ifdef _KERNEL
 int	rdmsr_safe(u_int msr, uint64_t *val);
 int	wrmsr_safe(u_int msr, uint64_t newval);
 #endif
 
 #endif /* !_MACHINE_CPUFUNC_H_ */
Index: projects/release-arm-redux/sys/amd64/include/md_var.h
===================================================================
--- projects/release-arm-redux/sys/amd64/include/md_var.h	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/include/md_var.h	(revision 282692)
@@ -1,131 +1,132 @@
 /*-
  * Copyright (c) 1995 Bruce D. Evans.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_MD_VAR_H_
 #define	_MACHINE_MD_VAR_H_
 
 /*
  * Miscellaneous machine-dependent declarations.
  */
 
 extern	long	Maxmem;
 extern	u_int	basemem;
 extern	int	busdma_swi_pending;
 extern	u_int	cpu_exthigh;
 extern	u_int	cpu_feature;
 extern	u_int	cpu_feature2;
 extern	u_int	amd_feature;
 extern	u_int	amd_feature2;
 extern	u_int	amd_pminfo;
 extern	u_int	via_feature_rng;
 extern	u_int	via_feature_xcrypt;
 extern	u_int	cpu_clflush_line_size;
 extern	u_int	cpu_stdext_feature;
 extern	u_int	cpu_fxsr;
 extern	u_int	cpu_high;
 extern	u_int	cpu_id;
 extern	u_int	cpu_max_ext_state_size;
 extern	u_int	cpu_mxcsr_mask;
 extern	u_int	cpu_procinfo;
 extern	u_int	cpu_procinfo2;
 extern	char	cpu_vendor[];
 extern	u_int	cpu_vendor_id;
 extern	u_int	cpu_mon_mwait_flags;
 extern	u_int	cpu_mon_min_size;
 extern	u_int	cpu_mon_max_size;
 extern	u_int	cpu_maxphyaddr;
 extern	char	ctx_switch_xsave[];
 extern	u_int	hv_high;
 extern	char	hv_vendor[];
 extern	char	kstack[];
 extern	char	sigcode[];
 extern	int	szsigcode;
 extern	uint64_t *vm_page_dump;
 extern	int	vm_page_dump_size;
 extern	int	workaround_erratum383;
 extern	int	_udatasel;
 extern	int	_ucodesel;
 extern	int	_ucode32sel;
 extern	int	_ufssel;
 extern	int	_ugssel;
 extern	int	use_xsave;
 extern	uint64_t xsave_mask;
 
 typedef void alias_for_inthand_t(u_int cs, u_int ef, u_int esp, u_int ss);
 struct	pcb;
 struct	savefpu;
 struct	thread;
 struct	reg;
 struct	fpreg;
 struct  dbreg;
 struct	dumperinfo;
 
 void	*alloc_fpusave(int flags);
 void	amd64_syscall(struct thread *td, int traced);
 void	busdma_swi(void);
+bool	cpu_mwait_usable(void);
 void	cpu_probe_amdc1e(void);
 void	cpu_setregs(void);
 void	doreti_iret(void) __asm(__STRING(doreti_iret));
 void	doreti_iret_fault(void) __asm(__STRING(doreti_iret_fault));
 void	ld_ds(void) __asm(__STRING(ld_ds));
 void	ld_es(void) __asm(__STRING(ld_es));
 void	ld_fs(void) __asm(__STRING(ld_fs));
 void	ld_gs(void) __asm(__STRING(ld_gs));
 void	ld_fsbase(void) __asm(__STRING(ld_fsbase));
 void	ld_gsbase(void) __asm(__STRING(ld_gsbase));
 void	ds_load_fault(void) __asm(__STRING(ds_load_fault));
 void	es_load_fault(void) __asm(__STRING(es_load_fault));
 void	fs_load_fault(void) __asm(__STRING(fs_load_fault));
 void	gs_load_fault(void) __asm(__STRING(gs_load_fault));
 void	fsbase_load_fault(void) __asm(__STRING(fsbase_load_fault));
 void	gsbase_load_fault(void) __asm(__STRING(gsbase_load_fault));
 void	dump_add_page(vm_paddr_t);
 void	dump_drop_page(vm_paddr_t);
 void	identify_cpu(void);
 void	initializecpu(void);
 void	initializecpucache(void);
 void	fillw(int /*u_short*/ pat, void *base, size_t cnt);
 void	fpstate_drop(struct thread *td);
 int	is_physical_memory(vm_paddr_t addr);
 int	isa_nmi(int cd);
 void	panicifcpuunsupported(void);
 void	pagecopy(void *from, void *to);
 void	pagezero(void *addr);
 void	printcpuinfo(void);
 void	setidt(int idx, alias_for_inthand_t *func, int typ, int dpl, int ist);
 int	user_dbreg_trap(void);
 int	minidumpsys(struct dumperinfo *);
 struct savefpu *get_pcb_user_save_td(struct thread *td);
 struct savefpu *get_pcb_user_save_pcb(struct pcb *pcb);
 struct pcb *get_pcb_td(struct thread *td);
 void	amd64_db_resume_dbreg(void);
 
 #endif /* !_MACHINE_MD_VAR_H_ */
Index: projects/release-arm-redux/sys/amd64/include/pcpu.h
===================================================================
--- projects/release-arm-redux/sys/amd64/include/pcpu.h	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/include/pcpu.h	(revision 282692)
@@ -1,249 +1,251 @@
 /*-
  * Copyright (c) Peter Wemm <peter@netplex.com.au>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_PCPU_H_
 #define	_MACHINE_PCPU_H_
 
 #ifndef _SYS_CDEFS_H_
 #error "sys/cdefs.h is a prerequisite for this file"
 #endif
 
 /*
  * The SMP parts are setup in pmap.c and locore.s for the BSP, and
  * mp_machdep.c sets up the data for the AP's to "see" when they awake.
  * The reason for doing it via a struct is so that an array of pointers
  * to each CPU's data can be set up for things like "check curproc on all
  * other processors"
  */
 #define	PCPU_MD_FIELDS							\
 	char	pc_monitorbuf[128] __aligned(128); /* cache line */	\
 	struct	pcpu *pc_prvspace;	/* Self-reference */		\
 	struct	pmap *pc_curpmap;					\
 	struct	amd64tss *pc_tssp;	/* TSS segment active on CPU */	\
 	struct	amd64tss *pc_commontssp;/* Common TSS for the CPU */	\
 	register_t pc_rsp0;						\
 	register_t pc_scratch_rsp;	/* User %rsp in syscall */	\
 	u_int	pc_apic_id;						\
 	u_int   pc_acpi_id;		/* ACPI CPU id */		\
 	/* Pointer to the CPU %fs descriptor */				\
 	struct user_segment_descriptor	*pc_fs32p;			\
 	/* Pointer to the CPU %gs descriptor */				\
 	struct user_segment_descriptor	*pc_gs32p;			\
 	/* Pointer to the CPU LDT descriptor */				\
 	struct system_segment_descriptor *pc_ldt;			\
 	/* Pointer to the CPU TSS descriptor */				\
 	struct system_segment_descriptor *pc_tss;			\
 	uint64_t	pc_pm_save_cnt;					\
 	u_int	pc_cmci_mask;		/* MCx banks for CMCI */	\
 	uint64_t pc_dbreg[16];		/* ddb debugging regs */	\
 	int pc_dbreg_cmd;		/* ddb debugging reg cmd */	\
 	u_int	pc_vcpu_id;		/* Xen vCPU ID */		\
-	char	__pad[157]		/* be divisor of PAGE_SIZE	\
+	uint32_t pc_pcid_next;						\
+	uint32_t pc_pcid_gen;						\
+	char	__pad[149]		/* be divisor of PAGE_SIZE	\
 					   after cache alignment */
 
 #define	PC_DBREG_CMD_NONE	0
 #define	PC_DBREG_CMD_LOAD	1
 
 #ifdef _KERNEL
 
 #ifdef lint
 
 extern struct pcpu *pcpup;
 
 #define	PCPU_GET(member)	(pcpup->pc_ ## member)
 #define	PCPU_ADD(member, val)	(pcpup->pc_ ## member += (val))
 #define	PCPU_INC(member)	PCPU_ADD(member, 1)
 #define	PCPU_PTR(member)	(&pcpup->pc_ ## member)
 #define	PCPU_SET(member, val)	(pcpup->pc_ ## member = (val))
 
 #elif defined(__GNUCLIKE_ASM) && defined(__GNUCLIKE___TYPEOF)
 
 /*
  * Evaluates to the byte offset of the per-cpu variable name.
  */
 #define	__pcpu_offset(name)						\
 	__offsetof(struct pcpu, name)
 
 /*
  * Evaluates to the type of the per-cpu variable name.
  */
 #define	__pcpu_type(name)						\
 	__typeof(((struct pcpu *)0)->name)
 
 /*
  * Evaluates to the address of the per-cpu variable name.
  */
 #define	__PCPU_PTR(name) __extension__ ({				\
 	__pcpu_type(name) *__p;						\
 									\
 	__asm __volatile("movq %%gs:%1,%0; addq %2,%0"			\
 	    : "=r" (__p)						\
 	    : "m" (*(struct pcpu *)(__pcpu_offset(pc_prvspace))),	\
 	      "i" (__pcpu_offset(name)));				\
 									\
 	__p;								\
 })
 
 /*
  * Evaluates to the value of the per-cpu variable name.
  */
 #define	__PCPU_GET(name) __extension__ ({				\
 	__pcpu_type(name) __res;					\
 	struct __s {							\
 		u_char	__b[MIN(sizeof(__pcpu_type(name)), 8)];		\
 	} __s;								\
 									\
 	if (sizeof(__res) == 1 || sizeof(__res) == 2 ||			\
 	    sizeof(__res) == 4 || sizeof(__res) == 8) {			\
 		__asm __volatile("mov %%gs:%1,%0"			\
 		    : "=r" (__s)					\
 		    : "m" (*(struct __s *)(__pcpu_offset(name))));	\
 		*(struct __s *)(void *)&__res = __s;			\
 	} else {							\
 		__res = *__PCPU_PTR(name);				\
 	}								\
 	__res;								\
 })
 
 /*
  * Adds the value to the per-cpu counter name.  The implementation
  * must be atomic with respect to interrupts.
  */
 #define	__PCPU_ADD(name, val) do {					\
 	__pcpu_type(name) __val;					\
 	struct __s {							\
 		u_char	__b[MIN(sizeof(__pcpu_type(name)), 8)];		\
 	} __s;								\
 									\
 	__val = (val);							\
 	if (sizeof(__val) == 1 || sizeof(__val) == 2 ||			\
 	    sizeof(__val) == 4 || sizeof(__val) == 8) {			\
 		__s = *(struct __s *)(void *)&__val;			\
 		__asm __volatile("add %1,%%gs:%0"			\
 		    : "=m" (*(struct __s *)(__pcpu_offset(name)))	\
 		    : "r" (__s));					\
 	} else								\
 		*__PCPU_PTR(name) += __val;				\
 } while (0)
 
 /*
  * Increments the value of the per-cpu counter name.  The implementation
  * must be atomic with respect to interrupts.
  */
 #define	__PCPU_INC(name) do {						\
 	CTASSERT(sizeof(__pcpu_type(name)) == 1 ||			\
 	    sizeof(__pcpu_type(name)) == 2 ||				\
 	    sizeof(__pcpu_type(name)) == 4 ||				\
 	    sizeof(__pcpu_type(name)) == 8);				\
 	if (sizeof(__pcpu_type(name)) == 1) {				\
 		__asm __volatile("incb %%gs:%0"				\
 		    : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\
 		    : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\
 	} else if (sizeof(__pcpu_type(name)) == 2) {			\
 		__asm __volatile("incw %%gs:%0"				\
 		    : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\
 		    : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\
 	} else if (sizeof(__pcpu_type(name)) == 4) {			\
 		__asm __volatile("incl %%gs:%0"				\
 		    : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\
 		    : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\
 	} else if (sizeof(__pcpu_type(name)) == 8) {			\
 		__asm __volatile("incq %%gs:%0"				\
 		    : "=m" (*(__pcpu_type(name) *)(__pcpu_offset(name)))\
 		    : "m" (*(__pcpu_type(name) *)(__pcpu_offset(name))));\
 	}								\
 } while (0)
 
 /*
  * Sets the value of the per-cpu variable name to value val.
  */
 #define	__PCPU_SET(name, val) {						\
 	__pcpu_type(name) __val;					\
 	struct __s {							\
 		u_char	__b[MIN(sizeof(__pcpu_type(name)), 8)];		\
 	} __s;								\
 									\
 	__val = (val);							\
 	if (sizeof(__val) == 1 || sizeof(__val) == 2 ||			\
 	    sizeof(__val) == 4 || sizeof(__val) == 8) {			\
 		__s = *(struct __s *)(void *)&__val;			\
 		__asm __volatile("mov %1,%%gs:%0"			\
 		    : "=m" (*(struct __s *)(__pcpu_offset(name)))	\
 		    : "r" (__s));					\
 	} else {							\
 		*__PCPU_PTR(name) = __val;				\
 	}								\
 }
 
 #define	PCPU_GET(member)	__PCPU_GET(pc_ ## member)
 #define	PCPU_ADD(member, val)	__PCPU_ADD(pc_ ## member, val)
 #define	PCPU_INC(member)	__PCPU_INC(pc_ ## member)
 #define	PCPU_PTR(member)	__PCPU_PTR(pc_ ## member)
 #define	PCPU_SET(member, val)	__PCPU_SET(pc_ ## member, val)
 
 #define	OFFSETOF_CURTHREAD	0
 #ifdef __clang__
 #pragma clang diagnostic push
 #pragma clang diagnostic ignored "-Wnull-dereference"
 #endif
 static __inline __pure2 struct thread *
 __curthread(void)
 {
 	struct thread *td;
 
 	__asm("movq %%gs:%1,%0" : "=r" (td)
 	    : "m" (*(char *)OFFSETOF_CURTHREAD));
 	return (td);
 }
 #ifdef __clang__
 #pragma clang diagnostic pop
 #endif
 #define	curthread		(__curthread())
 
 #define	OFFSETOF_CURPCB		32
 static __inline __pure2 struct pcb *
 __curpcb(void)
 {
 	struct pcb *pcb;
 
 	__asm("movq %%gs:%1,%0" : "=r" (pcb) : "m" (*(char *)OFFSETOF_CURPCB));
 	return (pcb);
 }
 #define	curpcb		(__curpcb())
 
 #define	IS_BSP()	(PCPU_GET(cpuid) == 0)
 
 #else /* !lint || defined(__GNUCLIKE_ASM) && defined(__GNUCLIKE___TYPEOF) */
 
 #error "this file needs to be ported to your compiler"
 
 #endif /* lint, etc. */
 
 #endif /* _KERNEL */
 
 #endif /* !_MACHINE_PCPU_H_ */
Index: projects/release-arm-redux/sys/amd64/include/pmap.h
===================================================================
--- projects/release-arm-redux/sys/amd64/include/pmap.h	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/include/pmap.h	(revision 282692)
@@ -1,406 +1,417 @@
 /*-
  * Copyright (c) 2003 Peter Wemm.
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * the Systems Programming Group of the University of Utah Computer
  * Science Department and William Jolitz of UUNET Technologies Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * Derived from hp300 version by Mike Hibler, this version by William
  * Jolitz uses a recursive map [a pde points to the page directory] to
  * map the page tables using the pagetables themselves. This is done to
  * reduce the impact on kernel virtual memory for lots of sparse address
  * space, and to reduce the cost of memory to each process.
  *
  *	from: hp300: @(#)pmap.h	7.2 (Berkeley) 12/16/90
  *	from: @(#)pmap.h	7.4 (Berkeley) 5/12/91
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_PMAP_H_
 #define	_MACHINE_PMAP_H_
 
 /*
  * Page-directory and page-table entries follow this format, with a few
  * of the fields not present here and there, depending on a lot of things.
  */
 				/* ---- Intel Nomenclature ---- */
 #define	X86_PG_V	0x001	/* P	Valid			*/
 #define	X86_PG_RW	0x002	/* R/W	Read/Write		*/
 #define	X86_PG_U	0x004	/* U/S  User/Supervisor		*/
 #define	X86_PG_NC_PWT	0x008	/* PWT	Write through		*/
 #define	X86_PG_NC_PCD	0x010	/* PCD	Cache disable		*/
 #define	X86_PG_A	0x020	/* A	Accessed		*/
 #define	X86_PG_M	0x040	/* D	Dirty			*/
 #define	X86_PG_PS	0x080	/* PS	Page size (0=4k,1=2M)	*/
 #define	X86_PG_PTE_PAT	0x080	/* PAT	PAT index		*/
 #define	X86_PG_G	0x100	/* G	Global			*/
 #define	X86_PG_AVAIL1	0x200	/*    /	Available for system	*/
 #define	X86_PG_AVAIL2	0x400	/*   <	programmers use		*/
 #define	X86_PG_AVAIL3	0x800	/*    \				*/
 #define	X86_PG_PDE_PAT	0x1000	/* PAT	PAT index		*/
 #define	X86_PG_NX	(1ul<<63) /* No-execute */
 #define	X86_PG_AVAIL(x)	(1ul << (x))
 
 /* Page level cache control fields used to determine the PAT type */
 #define	X86_PG_PDE_CACHE (X86_PG_PDE_PAT | X86_PG_NC_PWT | X86_PG_NC_PCD)
 #define	X86_PG_PTE_CACHE (X86_PG_PTE_PAT | X86_PG_NC_PWT | X86_PG_NC_PCD)
 
 /*
  * Intel extended page table (EPT) bit definitions.
  */
 #define	EPT_PG_READ		0x001	/* R	Read		*/
 #define	EPT_PG_WRITE		0x002	/* W	Write		*/
 #define	EPT_PG_EXECUTE		0x004	/* X	Execute		*/
 #define	EPT_PG_IGNORE_PAT	0x040	/* IPAT	Ignore PAT	*/
 #define	EPT_PG_PS		0x080	/* PS	Page size	*/
 #define	EPT_PG_A		0x100	/* A	Accessed	*/
 #define	EPT_PG_M		0x200	/* D	Dirty		*/
 #define	EPT_PG_MEMORY_TYPE(x)	((x) << 3) /* MT Memory Type	*/
 
 /*
  * Define the PG_xx macros in terms of the bits on x86 PTEs.
  */
 #define	PG_V		X86_PG_V
 #define	PG_RW		X86_PG_RW
 #define	PG_U		X86_PG_U
 #define	PG_NC_PWT	X86_PG_NC_PWT
 #define	PG_NC_PCD	X86_PG_NC_PCD
 #define	PG_A		X86_PG_A
 #define	PG_M		X86_PG_M
 #define	PG_PS		X86_PG_PS
 #define	PG_PTE_PAT	X86_PG_PTE_PAT
 #define	PG_G		X86_PG_G
 #define	PG_AVAIL1	X86_PG_AVAIL1
 #define	PG_AVAIL2	X86_PG_AVAIL2
 #define	PG_AVAIL3	X86_PG_AVAIL3
 #define	PG_PDE_PAT	X86_PG_PDE_PAT
 #define	PG_NX		X86_PG_NX
 #define	PG_PDE_CACHE	X86_PG_PDE_CACHE
 #define	PG_PTE_CACHE	X86_PG_PTE_CACHE
 
 /* Our various interpretations of the above */
 #define	PG_W		X86_PG_AVAIL3	/* "Wired" pseudoflag */
 #define	PG_MANAGED	X86_PG_AVAIL2
 #define	EPT_PG_EMUL_V	X86_PG_AVAIL(52)
 #define	EPT_PG_EMUL_RW	X86_PG_AVAIL(53)
 #define	PG_FRAME	(0x000ffffffffff000ul)
 #define	PG_PS_FRAME	(0x000fffffffe00000ul)
 
 /*
  * Promotion to a 2MB (PDE) page mapping requires that the corresponding 4KB
  * (PTE) page mappings have identical settings for the following fields:
  */
 #define	PG_PTE_PROMOTE	(PG_NX | PG_MANAGED | PG_W | PG_G | PG_PTE_CACHE | \
 	    PG_M | PG_A | PG_U | PG_RW | PG_V)
 
 /*
  * Page Protection Exception bits
  */
 
 #define PGEX_P		0x01	/* Protection violation vs. not present */
 #define PGEX_W		0x02	/* during a Write cycle */
 #define PGEX_U		0x04	/* access from User mode (UPL) */
 #define PGEX_RSV	0x08	/* reserved PTE field is non-zero */
 #define PGEX_I		0x10	/* during an instruction fetch */
 
 /* 
  * undef the PG_xx macros that define bits in the regular x86 PTEs that
  * have a different position in nested PTEs. This is done when compiling
  * code that needs to be aware of the differences between regular x86 and
  * nested PTEs.
  *
  * The appropriate bitmask will be calculated at runtime based on the pmap
  * type.
  */
 #ifdef AMD64_NPT_AWARE
 #undef PG_AVAIL1		/* X86_PG_AVAIL1 aliases with EPT_PG_M */
 #undef PG_G
 #undef PG_A
 #undef PG_M
 #undef PG_PDE_PAT
 #undef PG_PDE_CACHE
 #undef PG_PTE_PAT
 #undef PG_PTE_CACHE
 #undef PG_RW
 #undef PG_V
 #endif
 
 /*
  * Pte related macros.  This is complicated by having to deal with
  * the sign extension of the 48th bit.
  */
 #define KVADDR(l4, l3, l2, l1) ( \
 	((unsigned long)-1 << 47) | \
 	((unsigned long)(l4) << PML4SHIFT) | \
 	((unsigned long)(l3) << PDPSHIFT) | \
 	((unsigned long)(l2) << PDRSHIFT) | \
 	((unsigned long)(l1) << PAGE_SHIFT))
 
 #define UVADDR(l4, l3, l2, l1) ( \
 	((unsigned long)(l4) << PML4SHIFT) | \
 	((unsigned long)(l3) << PDPSHIFT) | \
 	((unsigned long)(l2) << PDRSHIFT) | \
 	((unsigned long)(l1) << PAGE_SHIFT))
 
 /*
  * Number of kernel PML4 slots.  Can be anywhere from 1 to 64 or so,
  * but setting it larger than NDMPML4E makes no sense.
  *
  * Each slot provides .5 TB of kernel virtual space.
  */
 #define NKPML4E		4
 
 #define	NUPML4E		(NPML4EPG/2)	/* number of userland PML4 pages */
 #define	NUPDPE		(NUPML4E*NPDPEPG)/* number of userland PDP pages */
 #define	NUPDE		(NUPDPE*NPDEPG)	/* number of userland PD entries */
 
 /*
  * NDMPML4E is the maximum number of PML4 entries that will be
  * used to implement the direct map.  It must be a power of two,
  * and should generally exceed NKPML4E.  The maximum possible
  * value is 64; using 128 will make the direct map intrude into
  * the recursive page table map.
  */
 #define	NDMPML4E	8
 
 /*
  * These values control the layout of virtual memory.  The starting address
  * of the direct map, which is controlled by DMPML4I, must be a multiple of
  * its size.  (See the PHYS_TO_DMAP() and DMAP_TO_PHYS() macros.)
  *
  * Note: KPML4I is the index of the (single) level 4 page that maps
  * the KVA that holds KERNBASE, while KPML4BASE is the index of the
  * first level 4 page that maps VM_MIN_KERNEL_ADDRESS.  If NKPML4E
  * is 1, these are the same, otherwise KPML4BASE < KPML4I and extra
  * level 4 PDEs are needed to map from VM_MIN_KERNEL_ADDRESS up to
  * KERNBASE.
  *
  * (KPML4I combines with KPDPI to choose where KERNBASE starts.
  * Or, in other words, KPML4I provides bits 39..47 of KERNBASE,
  * and KPDPI provides bits 30..38.)
  */
 #define	PML4PML4I	(NPML4EPG/2)	/* Index of recursive pml4 mapping */
 
 #define	KPML4BASE	(NPML4EPG-NKPML4E) /* KVM at highest addresses */
 #define	DMPML4I		rounddown(KPML4BASE-NDMPML4E, NDMPML4E) /* Below KVM */
 
 #define	KPML4I		(NPML4EPG-1)
 #define	KPDPI		(NPDPEPG-2)	/* kernbase at -2GB */
 
 /*
  * XXX doesn't really belong here I guess...
  */
 #define ISA_HOLE_START    0xa0000
 #define ISA_HOLE_LENGTH (0x100000-ISA_HOLE_START)
 
+#define	PMAP_PCID_NONE		0xffffffff
+#define	PMAP_PCID_KERN		0
+#define	PMAP_PCID_OVERMAX	0x1000
+
 #ifndef LOCORE
 
 #include <sys/queue.h>
 #include <sys/_cpuset.h>
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 
 #include <vm/_vm_radix.h>
 
 typedef u_int64_t pd_entry_t;
 typedef u_int64_t pt_entry_t;
 typedef u_int64_t pdp_entry_t;
 typedef u_int64_t pml4_entry_t;
 
 /*
  * Address of current address space page table maps and directories.
  */
 #ifdef _KERNEL
 #define	addr_PTmap	(KVADDR(PML4PML4I, 0, 0, 0))
 #define	addr_PDmap	(KVADDR(PML4PML4I, PML4PML4I, 0, 0))
 #define	addr_PDPmap	(KVADDR(PML4PML4I, PML4PML4I, PML4PML4I, 0))
 #define	addr_PML4map	(KVADDR(PML4PML4I, PML4PML4I, PML4PML4I, PML4PML4I))
 #define	addr_PML4pml4e	(addr_PML4map + (PML4PML4I * sizeof(pml4_entry_t)))
 #define	PTmap		((pt_entry_t *)(addr_PTmap))
 #define	PDmap		((pd_entry_t *)(addr_PDmap))
 #define	PDPmap		((pd_entry_t *)(addr_PDPmap))
 #define	PML4map		((pd_entry_t *)(addr_PML4map))
 #define	PML4pml4e	((pd_entry_t *)(addr_PML4pml4e))
 
 extern int nkpt;		/* Initial number of kernel page tables */
 extern u_int64_t KPDPphys;	/* physical address of kernel level 3 */
 extern u_int64_t KPML4phys;	/* physical address of kernel level 4 */
 
 /*
  * virtual address to page table entry and
  * to physical address.
  * Note: these work recursively, thus vtopte of a pte will give
  * the corresponding pde that in turn maps it.
  */
 pt_entry_t *vtopte(vm_offset_t);
 #define	vtophys(va)	pmap_kextract(((vm_offset_t) (va)))
 
 #define	pte_load_store(ptep, pte)	atomic_swap_long(ptep, pte)
 #define	pte_load_clear(ptep)		atomic_swap_long(ptep, 0)
 #define	pte_store(ptep, pte) do { \
 	*(u_long *)(ptep) = (u_long)(pte); \
 } while (0)
 #define	pte_clear(ptep)			pte_store(ptep, 0)
 
 #define	pde_store(pdep, pde)		pte_store(pdep, pde)
 
 extern pt_entry_t pg_nx;
 
 #endif /* _KERNEL */
 
 /*
  * Pmap stuff
  */
 struct	pv_entry;
 struct	pv_chunk;
 
 struct md_page {
 	TAILQ_HEAD(,pv_entry)	pv_list;
 	int			pv_gen;
 	int			pat_mode;
 };
 
 enum pmap_type {
 	PT_X86,			/* regular x86 page tables */
 	PT_EPT,			/* Intel's nested page tables */
 	PT_RVI,			/* AMD's nested page tables */
 };
 
+struct pmap_pcids {
+	uint32_t	pm_pcid;
+	uint32_t	pm_gen;
+};
+
 /*
  * The kernel virtual address (KVA) of the level 4 page table page is always
  * within the direct map (DMAP) region.
  */
 struct pmap {
 	struct mtx		pm_mtx;
 	pml4_entry_t		*pm_pml4;	/* KVA of level 4 page table */
 	uint64_t		pm_cr3;
 	TAILQ_HEAD(,pv_chunk)	pm_pvchunk;	/* list of mappings in pmap */
 	cpuset_t		pm_active;	/* active on cpus */
-	cpuset_t		pm_save;	/* Context valid on cpus mask */
-	int			pm_pcid;	/* context id */
 	enum pmap_type		pm_type;	/* regular or nested tables */
 	struct pmap_statistics	pm_stats;	/* pmap statistics */
 	struct vm_radix		pm_root;	/* spare page table pages */
 	long			pm_eptgen;	/* EPT pmap generation id */
 	int			pm_flags;
+	struct pmap_pcids	pm_pcids[MAXCPU];
 };
 
 /* flags */
 #define	PMAP_NESTED_IPIMASK	0xff
 #define	PMAP_PDE_SUPERPAGE	(1 << 8)	/* supports 2MB superpages */
 #define	PMAP_EMULATE_AD_BITS	(1 << 9)	/* needs A/D bits emulation */
 #define	PMAP_SUPPORTS_EXEC_ONLY	(1 << 10)	/* execute only mappings ok */
 
 typedef struct pmap	*pmap_t;
 
 #ifdef _KERNEL
 extern struct pmap	kernel_pmap_store;
 #define kernel_pmap	(&kernel_pmap_store)
 
 #define	PMAP_LOCK(pmap)		mtx_lock(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_ASSERT(pmap, type) \
 				mtx_assert(&(pmap)->pm_mtx, (type))
 #define	PMAP_LOCK_DESTROY(pmap)	mtx_destroy(&(pmap)->pm_mtx)
 #define	PMAP_LOCK_INIT(pmap)	mtx_init(&(pmap)->pm_mtx, "pmap", \
 				    NULL, MTX_DEF | MTX_DUPOK)
 #define	PMAP_LOCKED(pmap)	mtx_owned(&(pmap)->pm_mtx)
 #define	PMAP_MTX(pmap)		(&(pmap)->pm_mtx)
 #define	PMAP_TRYLOCK(pmap)	mtx_trylock(&(pmap)->pm_mtx)
 #define	PMAP_UNLOCK(pmap)	mtx_unlock(&(pmap)->pm_mtx)
 
 int	pmap_pinit_type(pmap_t pmap, enum pmap_type pm_type, int flags);
 int	pmap_emulate_accessed_dirty(pmap_t pmap, vm_offset_t va, int ftype);
 #endif
 
 /*
  * For each vm_page_t, there is a list of all currently valid virtual
  * mappings of that page.  An entry is a pv_entry_t, the list is pv_list.
  */
 typedef struct pv_entry {
 	vm_offset_t	pv_va;		/* virtual address for mapping */
 	TAILQ_ENTRY(pv_entry)	pv_next;
 } *pv_entry_t;
 
 /*
  * pv_entries are allocated in chunks per-process.  This avoids the
  * need to track per-pmap assignments.
  */
 #define	_NPCM	3
 #define	_NPCPV	168
 struct pv_chunk {
 	pmap_t			pc_pmap;
 	TAILQ_ENTRY(pv_chunk)	pc_list;
 	uint64_t		pc_map[_NPCM];	/* bitmap; 1 = free */
 	TAILQ_ENTRY(pv_chunk)	pc_lru;
 	struct pv_entry		pc_pventry[_NPCPV];
 };
 
 #ifdef	_KERNEL
 
 extern caddr_t	CADDR1;
 extern pt_entry_t *CMAP1;
 extern vm_paddr_t phys_avail[];
 extern vm_paddr_t dump_avail[];
 extern vm_offset_t virtual_avail;
 extern vm_offset_t virtual_end;
 extern vm_paddr_t dmaplimit;
 
 #define	pmap_page_get_memattr(m)	((vm_memattr_t)(m)->md.pat_mode)
 #define	pmap_page_is_write_mapped(m)	(((m)->aflags & PGA_WRITEABLE) != 0)
 #define	pmap_unmapbios(va, sz)	pmap_unmapdev((va), (sz))
 
+struct thread;
+
+void	pmap_activate_sw(struct thread *);
 void	pmap_bootstrap(vm_paddr_t *);
 int	pmap_change_attr(vm_offset_t, vm_size_t, int);
 void	pmap_demote_DMAP(vm_paddr_t base, vm_size_t len, boolean_t invalidate);
 void	pmap_init_pat(void);
 void	pmap_kenter(vm_offset_t va, vm_paddr_t pa);
 void	*pmap_kenter_temporary(vm_paddr_t pa, int i);
 vm_paddr_t pmap_kextract(vm_offset_t);
 void	pmap_kremove(vm_offset_t);
 void	*pmap_mapbios(vm_paddr_t, vm_size_t);
 void	*pmap_mapdev(vm_paddr_t, vm_size_t);
 void	*pmap_mapdev_attr(vm_paddr_t, vm_size_t, int);
 boolean_t pmap_page_is_mapped(vm_page_t m);
 void	pmap_page_set_memattr(vm_page_t m, vm_memattr_t ma);
 void	pmap_unmapdev(vm_offset_t, vm_size_t);
 void	pmap_invalidate_page(pmap_t, vm_offset_t);
 void	pmap_invalidate_range(pmap_t, vm_offset_t, vm_offset_t);
 void	pmap_invalidate_all(pmap_t);
 void	pmap_invalidate_cache(void);
 void	pmap_invalidate_cache_pages(vm_page_t *pages, int count);
 void	pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva,
 	    boolean_t force);
 void	pmap_get_mapping(pmap_t pmap, vm_offset_t va, uint64_t *ptr, int *num);
 boolean_t pmap_map_io_transient(vm_page_t *, vm_offset_t *, int, boolean_t);
 void	pmap_unmap_io_transient(vm_page_t *, vm_offset_t *, int, boolean_t);
 #endif /* _KERNEL */
 
 #endif /* !LOCORE */
 
 #endif /* !_MACHINE_PMAP_H_ */
Index: projects/release-arm-redux/sys/amd64/include/smp.h
===================================================================
--- projects/release-arm-redux/sys/amd64/include/smp.h	(revision 282691)
+++ projects/release-arm-redux/sys/amd64/include/smp.h	(revision 282692)
@@ -1,134 +1,127 @@
 /*-
  * ----------------------------------------------------------------------------
  * "THE BEER-WARE LICENSE" (Revision 42):
  * <phk@FreeBSD.org> wrote this file.  As long as you retain this notice you
  * can do whatever you want with this stuff. If we meet some day, and you think
  * this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
  * ----------------------------------------------------------------------------
  *
  * $FreeBSD$
  *
  */
 
 #ifndef _MACHINE_SMP_H_
 #define _MACHINE_SMP_H_
 
 #ifdef _KERNEL
 
 #ifdef SMP
 
 #ifndef LOCORE
 
 #include <sys/bus.h>
 #include <machine/frame.h>
 #include <machine/intr_machdep.h>
 #include <x86/apicvar.h>
 #include <machine/pcb.h>
 
 /* global symbols in mpboot.S */
 extern char			mptramp_start[];
 extern char			mptramp_end[];
 extern u_int32_t		mptramp_pagetables;
 
 /* global data in mp_machdep.c */
 extern int			mp_naps;
 extern int			boot_cpu_id;
 extern struct pcb		stoppcbs[];
 extern int			cpu_apic_ids[];
 extern int bootAP;
 extern void *dpcpu;
 extern char *bootSTK;
 extern int bootAP;
 extern void *bootstacks[];
 extern volatile u_int cpu_ipi_pending[];
 extern volatile int aps_ready;
 extern struct mtx ap_boot_mtx;
 extern int cpu_logical;
 extern int cpu_cores;
 extern int pmap_pcid_enabled;
+extern int invpcid_works;
 extern u_int xhits_gbl[];
 extern u_int xhits_pg[];
 extern u_int xhits_rng[];
 extern u_int ipi_global;
 extern u_int ipi_page;
 extern u_int ipi_range;
 extern u_int ipi_range_size;
-extern u_int ipi_masked_global;
-extern u_int ipi_masked_page;
-extern u_int ipi_masked_range;
-extern u_int ipi_masked_range_size;
 
 extern volatile int smp_tlb_wait;
 
 struct cpu_info {
 	int	cpu_present:1;
 	int	cpu_bsp:1;
 	int	cpu_disabled:1;
 	int	cpu_hyperthread:1;
 };
 extern struct cpu_info cpu_info[];
 
 #ifdef COUNT_IPIS
 extern u_long *ipi_invltlb_counts[MAXCPU];
 extern u_long *ipi_invlrng_counts[MAXCPU];
 extern u_long *ipi_invlpg_counts[MAXCPU];
 extern u_long *ipi_invlcache_counts[MAXCPU];
 extern u_long *ipi_rendezvous_counts[MAXCPU];
 #endif
 
 /* IPI handlers */
 inthand_t
-	IDTVEC(invltlb_pcid),	/* TLB shootdowns - global, pcid enabled */
 	IDTVEC(invltlb),	/* TLB shootdowns - global */
-	IDTVEC(invlpg_pcid),	/* TLB shootdowns - 1 page, pcid enabled */
+	IDTVEC(invltlb_pcid),	/* TLB shootdowns - global, pcid */
+	IDTVEC(invltlb_invpcid),/* TLB shootdowns - global, invpcid */
 	IDTVEC(invlpg),		/* TLB shootdowns - 1 page */
 	IDTVEC(invlrng),	/* TLB shootdowns - page range */
 	IDTVEC(invlcache),	/* Write back and invalidate cache */
 	IDTVEC(ipi_intr_bitmap_handler), /* Bitmap based IPIs */ 
 	IDTVEC(cpustop),	/* CPU stops & waits to be restarted */
 	IDTVEC(cpususpend),	/* CPU suspends & waits to be resumed */
 	IDTVEC(justreturn),	/* interrupt CPU with minimum overhead */
 	IDTVEC(rendezvous);	/* handle CPU rendezvous */
 
 struct pmap;
 
 /* functions in mp_machdep.c */
 void	assign_cpu_ids(void);
 void	cpu_add(u_int apic_id, char boot_cpu);
 void	cpustop_handler(void);
 void	cpususpend_handler(void);
 void	init_secondary_tail(void);
 void	invltlb_handler(void);
 void	invltlb_pcid_handler(void);
+void	invltlb_invpcid_handler(void);
 void	invlpg_handler(void);
-void	invlpg_pcid_handler(void);
 void	invlrng_handler(void);
 void	invlcache_handler(void);
 void	init_secondary(void);
 void	ipi_startup(int apic_id, int vector);
 void	ipi_all_but_self(u_int ipi);
 void 	ipi_bitmap_handler(struct trapframe frame);
 void	ipi_cpu(int cpu, u_int ipi);
 int	ipi_nmi_handler(void);
 void	ipi_selected(cpuset_t cpus, u_int ipi);
 u_int	mp_bootaddress(u_int);
 void	set_interrupt_apic_ids(void);
 void	smp_cache_flush(void);
-void	smp_invlpg(struct pmap *pmap, vm_offset_t addr);
-void	smp_masked_invlpg(cpuset_t mask, struct pmap *pmap, vm_offset_t addr);
-void	smp_invlpg_range(struct pmap *pmap, vm_offset_t startva,
+void	smp_masked_invlpg(cpuset_t mask, vm_offset_t addr);
+void	smp_masked_invlpg_range(cpuset_t mask, vm_offset_t startva,
 	    vm_offset_t endva);
-void	smp_masked_invlpg_range(cpuset_t mask, struct pmap *pmap,
-	    vm_offset_t startva, vm_offset_t endva);
-void	smp_invltlb(struct pmap *pmap);
 void	smp_masked_invltlb(cpuset_t mask, struct pmap *pmap);
 int	native_start_all_aps(void);
 void	mem_range_AP_init(void);
 void	topo_probe(void);
 void	ipi_send_cpu(int cpu, u_int ipi);
 
 #endif /* !LOCORE */
 #endif /* SMP */
 
 #endif /* _KERNEL */
 #endif /* _MACHINE_SMP_H_ */
Index: projects/release-arm-redux/sys/arm/ti/ti_i2c.c
===================================================================
--- projects/release-arm-redux/sys/arm/ti/ti_i2c.c	(revision 282691)
+++ projects/release-arm-redux/sys/arm/ti/ti_i2c.c	(revision 282692)
@@ -1,979 +1,990 @@
 /*-
  * Copyright (c) 2011 Ben Gray <ben.r.gray@gmail.com>.
  * Copyright (c) 2014 Luiz Otavio O Souza <loos@freebsd.org>.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /**
  * Driver for the I2C module on the TI SoC.
  *
  * This driver is heavily based on the TWI driver for the AT91 (at91_twi.c).
  *
  * CAUTION: The I2Ci registers are limited to 16 bit and 8 bit data accesses,
  * 32 bit data access is not allowed and can corrupt register content.
  *
  * This driver currently doesn't use DMA for the transfer, although I hope to
  * incorporate that sometime in the future.  The idea being that for transaction
  * larger than a certain size the DMA engine is used, for anything less the
  * normal interrupt/fifo driven option is used.
  *
  *
  * WARNING: This driver uses mtx_sleep and interrupts to perform transactions,
  * which means you can't do a transaction during startup before the interrupts
  * have been enabled.  Hint - the freebsd function config_intrhook_establish().
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/conf.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mbuf.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 #include <sys/rman.h>
 #include <sys/sysctl.h>
 #include <machine/bus.h>
 
 #include <dev/ofw/openfirm.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 
 #include <arm/ti/ti_cpuid.h>
 #include <arm/ti/ti_prcm.h>
 #include <arm/ti/ti_i2c.h>
 
 #include <dev/iicbus/iiconf.h>
 #include <dev/iicbus/iicbus.h>
 
 #include "iicbus_if.h"
 
 /**
  *	I2C device driver context, a pointer to this is stored in the device
  *	driver structure.
  */
 struct ti_i2c_softc
 {
 	device_t		sc_dev;
 	uint32_t		device_id;
 	struct resource*	sc_irq_res;
 	struct resource*	sc_mem_res;
 	device_t		sc_iicbus;
 
 	void*			sc_irq_h;
 
 	struct mtx		sc_mtx;
 
 	struct iic_msg*		sc_buffer;
 	int			sc_bus_inuse;
 	int			sc_buffer_pos;
 	int			sc_error;
 	int			sc_fifo_trsh;
 	int			sc_timeout;
 
 	uint16_t		sc_con_reg;
 	uint16_t		sc_rev;
 };
 
 struct ti_i2c_clock_config
 {
 	u_int   frequency;	/* Bus frequency in Hz */
 	uint8_t psc;		/* Fast/Standard mode prescale divider */
 	uint8_t scll;		/* Fast/Standard mode SCL low time */
 	uint8_t sclh;		/* Fast/Standard mode SCL high time */
 	uint8_t hsscll;		/* High Speed mode SCL low time */
 	uint8_t hssclh;		/* High Speed mode SCL high time */
 };
 
 #if defined(SOC_OMAP4)
 /*
  * OMAP4 i2c bus clock is 96MHz / ((psc + 1) * (scll + 7 + sclh + 5)).
  * The prescaler values for 100KHz and 400KHz modes come from the table in the
  * OMAP4 TRM.  The table doesn't list 1MHz; these values should give that speed.
  */
 static struct ti_i2c_clock_config ti_omap4_i2c_clock_configs[] = {
 	{  100000, 23,  13,  15,  0,  0},
 	{  400000,  9,   5,   7,  0,  0},
 	{ 1000000,  3,   5,   7,  0,  0},
 /*	{ 3200000,  1, 113, 115,  7, 10}, - HS mode */
 	{       0 /* Table terminator */ }
 };
 #endif
 
 #if defined(SOC_TI_AM335X)
 /*
  * AM335x i2c bus clock is 48MHZ / ((psc + 1) * (scll + 7 + sclh + 5))
  * In all cases we prescale the clock to 24MHz as recommended in the manual.
  */
 static struct ti_i2c_clock_config ti_am335x_i2c_clock_configs[] = {
 	{  100000, 1, 111, 117, 0, 0},
 	{  400000, 1,  23,  25, 0, 0},
 	{ 1000000, 1,   5,   7, 0, 0},
 	{       0 /* Table terminator */ }
 };
 #endif
 
 /**
  *	Locking macros used throughout the driver
  */
 #define	TI_I2C_LOCK(_sc)		mtx_lock(&(_sc)->sc_mtx)
 #define	TI_I2C_UNLOCK(_sc)		mtx_unlock(&(_sc)->sc_mtx)
 #define	TI_I2C_LOCK_INIT(_sc)						\
 	mtx_init(&_sc->sc_mtx, device_get_nameunit(_sc->sc_dev),	\
 	    "ti_i2c", MTX_DEF)
 #define	TI_I2C_LOCK_DESTROY(_sc)	mtx_destroy(&_sc->sc_mtx)
 #define	TI_I2C_ASSERT_LOCKED(_sc)	mtx_assert(&_sc->sc_mtx, MA_OWNED)
 #define	TI_I2C_ASSERT_UNLOCKED(_sc)	mtx_assert(&_sc->sc_mtx, MA_NOTOWNED)
 
 #ifdef DEBUG
 #define	ti_i2c_dbg(_sc, fmt, args...)					\
 	device_printf((_sc)->sc_dev, fmt, ##args)
 #else
 #define	ti_i2c_dbg(_sc, fmt, args...)
 #endif
 
 /**
  *	ti_i2c_read_2 - reads a 16-bit value from one of the I2C registers
  *	@sc: I2C device context
  *	@off: the byte offset within the register bank to read from.
  *
  *
  *	LOCKING:
  *	No locking required
  *
  *	RETURNS:
  *	16-bit value read from the register.
  */
 static inline uint16_t
 ti_i2c_read_2(struct ti_i2c_softc *sc, bus_size_t off)
 {
 
 	return (bus_read_2(sc->sc_mem_res, off));
 }
 
 /**
  *	ti_i2c_write_2 - writes a 16-bit value to one of the I2C registers
  *	@sc: I2C device context
  *	@off: the byte offset within the register bank to read from.
  *	@val: the value to write into the register
  *
  *	LOCKING:
  *	No locking required
  *
  *	RETURNS:
  *	16-bit value read from the register.
  */
 static inline void
 ti_i2c_write_2(struct ti_i2c_softc *sc, bus_size_t off, uint16_t val)
 {
 
 	bus_write_2(sc->sc_mem_res, off, val);
 }
 
 static int
 ti_i2c_transfer_intr(struct ti_i2c_softc* sc, uint16_t status)
 {
 	int amount, done, i;
 
 	done = 0;
 	amount = 0;
 	/* Check for the error conditions. */
 	if (status & I2C_STAT_NACK) {
 		/* No ACK from slave. */
 		ti_i2c_dbg(sc, "NACK\n");
 		ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_NACK);
 		sc->sc_error = ENXIO;
 	} else if (status & I2C_STAT_AL) {
 		/* Arbitration lost. */
 		ti_i2c_dbg(sc, "Arbitration lost\n");
 		ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_AL);
 		sc->sc_error = ENXIO;
 	}
 
 	/* Check if we have finished. */
 	if (status & I2C_STAT_ARDY) {
 		/* Register access ready - transaction complete basically. */
 		ti_i2c_dbg(sc, "ARDY transaction complete\n");
 		if (sc->sc_error != 0 && sc->sc_buffer->flags & IIC_M_NOSTOP) {
 			ti_i2c_write_2(sc, I2C_REG_CON,
 			    sc->sc_con_reg | I2C_CON_STP);
 		}
 		ti_i2c_write_2(sc, I2C_REG_STATUS,
 		    I2C_STAT_ARDY | I2C_STAT_RDR | I2C_STAT_RRDY |
 		    I2C_STAT_XDR | I2C_STAT_XRDY);
 		return (1);
 	}
 
 	if (sc->sc_buffer->flags & IIC_M_RD) {
 		/* Read some data. */
 		if (status & I2C_STAT_RDR) {
 			/*
 			 * Receive draining interrupt - last data received.
 			 * The set FIFO threshold wont be reached to trigger
 			 * RRDY.
 			 */
 			ti_i2c_dbg(sc, "Receive draining interrupt\n");
 
 			/*
 			 * Drain the FIFO.  Read the pending data in the FIFO.
 			 */
 			amount = sc->sc_buffer->len - sc->sc_buffer_pos;
 		} else if (status & I2C_STAT_RRDY) {
 			/*
 			 * Receive data ready interrupt - FIFO has reached the
 			 * set threshold.
 			 */
 			ti_i2c_dbg(sc, "Receive data ready interrupt\n");
 
 			amount = min(sc->sc_fifo_trsh,
 			    sc->sc_buffer->len - sc->sc_buffer_pos);
 		}
 
 		/* Read the bytes from the fifo. */
 		for (i = 0; i < amount; i++)
 			sc->sc_buffer->buf[sc->sc_buffer_pos++] = 
 			    (uint8_t)(ti_i2c_read_2(sc, I2C_REG_DATA) & 0xff);
 
 		if (status & I2C_STAT_RDR)
 			ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_RDR);
 		if (status & I2C_STAT_RRDY)
 			ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_RRDY);
 
 	} else {
 		/* Write some data. */
 		if (status & I2C_STAT_XDR) {
 			/*
 			 * Transmit draining interrupt - FIFO level is below
 			 * the set threshold and the amount of data still to
 			 * be transferred wont reach the set FIFO threshold.
 			 */
 			ti_i2c_dbg(sc, "Transmit draining interrupt\n");
 
 			/*
 			 * Drain the TX data.  Write the pending data in the
 			 * FIFO.
 			 */
 			amount = sc->sc_buffer->len - sc->sc_buffer_pos;
 		} else if (status & I2C_STAT_XRDY) {
 			/*
 			 * Transmit data ready interrupt - the FIFO level
 			 * is below the set threshold.
 			 */
 			ti_i2c_dbg(sc, "Transmit data ready interrupt\n");
 
 			amount = min(sc->sc_fifo_trsh,
 			    sc->sc_buffer->len - sc->sc_buffer_pos);
 		}
 
 		/* Write the bytes from the fifo. */
 		for (i = 0; i < amount; i++)
 			ti_i2c_write_2(sc, I2C_REG_DATA,
 			    sc->sc_buffer->buf[sc->sc_buffer_pos++]);
 
 		if (status & I2C_STAT_XDR)
 			ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_XDR);
 		if (status & I2C_STAT_XRDY)
 			ti_i2c_write_2(sc, I2C_REG_STATUS, I2C_STAT_XRDY);
 	}
 
 	return (done);
 }
 
 /**
  *	ti_i2c_intr - interrupt handler for the I2C module
  *	@dev: i2c device handle
  *
  *
  *
  *	LOCKING:
  *	Called from timer context
  *
  *	RETURNS:
  *	EH_HANDLED or EH_NOT_HANDLED
  */
 static void
 ti_i2c_intr(void *arg)
 {
 	int done;
 	struct ti_i2c_softc *sc;
 	uint16_t events, status;
 
  	sc = (struct ti_i2c_softc *)arg;
 
 	TI_I2C_LOCK(sc);
 
 	status = ti_i2c_read_2(sc, I2C_REG_STATUS);
 	if (status == 0) {
 		TI_I2C_UNLOCK(sc);
 		return;
 	}
 
 	/* Save enabled interrupts. */
 	events = ti_i2c_read_2(sc, I2C_REG_IRQENABLE_SET);
 
 	/* We only care about enabled interrupts. */
 	status &= events;
 
 	done = 0;
 
 	if (sc->sc_buffer != NULL)
 		done = ti_i2c_transfer_intr(sc, status);
 	else {
 		ti_i2c_dbg(sc, "Transfer interrupt without buffer\n");
 		sc->sc_error = EINVAL;
 		done = 1;
 	}
 
 	if (done)
 		/* Wakeup the process that started the transaction. */
 		wakeup(sc);
 
 	TI_I2C_UNLOCK(sc);
 }
 
 /**
  *	ti_i2c_transfer - called to perform the transfer
  *	@dev: i2c device handle
  *	@msgs: the messages to send/receive
  *	@nmsgs: the number of messages in the msgs array
  *
  *
  *	LOCKING:
  *	Internally locked
  *
  *	RETURNS:
  *	0 on function succeeded
  *	EINVAL if invalid message is passed as an arg
  */
 static int
 ti_i2c_transfer(device_t dev, struct iic_msg *msgs, uint32_t nmsgs)
 {
 	int err, i, repstart, timeout;
 	struct ti_i2c_softc *sc;
 	uint16_t reg;
 
  	sc = device_get_softc(dev);
 	TI_I2C_LOCK(sc);
 
 	/* If the controller is busy wait until it is available. */
 	while (sc->sc_bus_inuse == 1)
 		mtx_sleep(sc, &sc->sc_mtx, 0, "i2cbuswait", 0);
 
 	/* Now we have control over the I2C controller. */
 	sc->sc_bus_inuse = 1;
 
 	err = 0;
 	repstart = 0;
 	for (i = 0; i < nmsgs; i++) {
 
 		sc->sc_buffer = &msgs[i];
 		sc->sc_buffer_pos = 0;
 		sc->sc_error = 0;
 
 		/* Zero byte transfers aren't allowed. */
 		if (sc->sc_buffer == NULL || sc->sc_buffer->buf == NULL ||
 		    sc->sc_buffer->len == 0) {
 			err = EINVAL;
 			break;
 		}
 
 		/* Check if the i2c bus is free. */
 		if (repstart == 0) {
 			/*
 			 * On repeated start we send the START condition while
 			 * the bus _is_ busy.
 			 */
 			timeout = 0;
 			while (ti_i2c_read_2(sc, I2C_REG_STATUS_RAW) & I2C_STAT_BB) {
 				if (timeout++ > 100) {
 					err = EBUSY;
 					goto out;
 				}
 				DELAY(1000);
 			}
 			timeout = 0;
 		} else
 			repstart = 0;
 
 		if (sc->sc_buffer->flags & IIC_M_NOSTOP)
 			repstart = 1;
 
 		/* Set the slave address. */
 		ti_i2c_write_2(sc, I2C_REG_SA, msgs[i].slave >> 1);
 
 		/* Write the data length. */
 		ti_i2c_write_2(sc, I2C_REG_CNT, sc->sc_buffer->len);
 
 		/* Clear the RX and the TX FIFO. */
 		reg = ti_i2c_read_2(sc, I2C_REG_BUF);
 		reg |= I2C_BUF_RXFIFO_CLR | I2C_BUF_TXFIFO_CLR;
 		ti_i2c_write_2(sc, I2C_REG_BUF, reg);
 
 		reg = sc->sc_con_reg | I2C_CON_STT;
 		if (repstart == 0)
 			reg |= I2C_CON_STP;
 		if ((sc->sc_buffer->flags & IIC_M_RD) == 0)
 			reg |= I2C_CON_TRX;
 		ti_i2c_write_2(sc, I2C_REG_CON, reg);
 
 		/* Wait for an event. */
 		err = mtx_sleep(sc, &sc->sc_mtx, 0, "i2ciowait", sc->sc_timeout);
 		if (err == 0)
 			err = sc->sc_error;
 
 		if (err)
 			break;
 	}
 
 out:
 	if (timeout == 0) {
 		while (ti_i2c_read_2(sc, I2C_REG_STATUS_RAW) & I2C_STAT_BB) {
 			if (timeout++ > 100)
 				break;
 			DELAY(1000);
 		}
 	}
 	/* Put the controller in master mode again. */
 	if ((ti_i2c_read_2(sc, I2C_REG_CON) & I2C_CON_MST) == 0)
 		ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg);
 
 	sc->sc_buffer = NULL;
 	sc->sc_bus_inuse = 0;
 
 	/* Wake up the processes that are waiting for the bus. */
 	wakeup(sc);
 
 	TI_I2C_UNLOCK(sc);
 
 	return (err);
 }
 
 static int
 ti_i2c_reset(struct ti_i2c_softc *sc, u_char speed)
 {
 	int timeout;
 	struct ti_i2c_clock_config *clkcfg;
 	u_int busfreq;
 	uint16_t fifo_trsh, reg, scll, sclh;
 
 	switch (ti_chip()) {
 #ifdef SOC_OMAP4
 	case CHIP_OMAP_4:
 		clkcfg = ti_omap4_i2c_clock_configs;
 		break;
 #endif
 #ifdef SOC_TI_AM335X
 	case CHIP_AM335X:
 		clkcfg = ti_am335x_i2c_clock_configs;
 		break;
 #endif
 	default:
 		panic("Unknown Ti SoC, unable to reset the i2c");
 	}
 
 	/*
 	 * If we haven't attached the bus yet, just init at the default slow
 	 * speed.  This lets us get the hardware initialized enough to attach
 	 * the bus which is where the real speed configuration is handled. After
 	 * the bus is attached, get the configured speed from it.  Search the
 	 * configuration table for the best speed we can do that doesn't exceed
 	 * the requested speed.
 	 */
 	if (sc->sc_iicbus == NULL)
 		busfreq = 100000;
 	else
 		busfreq = IICBUS_GET_FREQUENCY(sc->sc_iicbus, speed);
 	for (;;) {
 		if (clkcfg[1].frequency == 0 || clkcfg[1].frequency > busfreq)
 			break;
 		clkcfg++;
 	}
 
 	/*
 	 * 23.1.4.3 - HS I2C Software Reset
 	 *    From OMAP4 TRM at page 4068.
 	 *
 	 * 1. Ensure that the module is disabled.
 	 */
 	sc->sc_con_reg = 0;
 	ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg);
 
 	/* 2. Issue a softreset to the controller. */
 	bus_write_2(sc->sc_mem_res, I2C_REG_SYSC, I2C_REG_SYSC_SRST);
 
 	/*
 	 * 3. Enable the module.
 	 *    The I2Ci.I2C_SYSS[0] RDONE bit is asserted only after the module
 	 *    is enabled by setting the I2Ci.I2C_CON[15] I2C_EN bit to 1.
 	 */
 	ti_i2c_write_2(sc, I2C_REG_CON, I2C_CON_I2C_EN);
 
  	/* 4. Wait for the software reset to complete. */
 	timeout = 0;
 	while ((ti_i2c_read_2(sc, I2C_REG_SYSS) & I2C_SYSS_RDONE) == 0) {
 		if (timeout++ > 100)
 			return (EBUSY);
 		DELAY(100);
 	}
 
 	/*
 	 * Disable the I2C controller once again, now that the reset has
 	 * finished.
 	 */
 	ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg);
 
 	/*
 	 * The following sequence is taken from the OMAP4 TRM at page 4077.
 	 *
 	 * 1. Enable the functional and interface clocks (see Section
 	 *    23.1.5.1.1.1.1).  Done at ti_i2c_activate().
 	 *
 	 * 2. Program the prescaler to obtain an approximately 12MHz internal
 	 *    sampling clock (I2Ci_INTERNAL_CLK) by programming the
 	 *    corresponding value in the I2Ci.I2C_PSC[3:0] PSC field.
 	 *    This value depends on the frequency of the functional clock
 	 *    (I2Ci_FCLK).  Because this frequency is 96MHz, the
 	 *    I2Ci.I2C_PSC[7:0] PSC field value is 0x7.
 	 */
 	ti_i2c_write_2(sc, I2C_REG_PSC, clkcfg->psc);
 
 	/*
 	 * 3. Program the I2Ci.I2C_SCLL[7:0] SCLL and I2Ci.I2C_SCLH[7:0] SCLH
 	 *    bit fields to obtain a bit rate of 100 Kbps, 400 Kbps or 1Mbps.
 	 *    These values depend on the internal sampling clock frequency
 	 *    (see Table 23-8).
 	 */
 	scll = clkcfg->scll & I2C_SCLL_MASK;
 	sclh = clkcfg->sclh & I2C_SCLH_MASK;
 
 	/*
 	 * 4. (Optional) Program the I2Ci.I2C_SCLL[15:8] HSSCLL and
 	 *    I2Ci.I2C_SCLH[15:8] HSSCLH fields to obtain a bit rate of
 	 *    400K bps or 3.4M bps (for the second phase of HS mode).  These
 	 *    values depend on the internal sampling clock frequency (see
 	 *    Table 23-8).
 	 *
 	 * 5. (Optional) If a bit rate of 3.4M bps is used and the bus line
 	 *    capacitance exceeds 45 pF, (see Section 18.4.8, PAD Functional
 	 *    Multiplexing and Configuration).
 	 */
 	switch (ti_chip()) {
 #ifdef SOC_OMAP4
 	case CHIP_OMAP_4:
 		if ((clkcfg->hsscll + clkcfg->hssclh) > 0) {
 			scll |= clkcfg->hsscll << I2C_HSSCLL_SHIFT;
 			sclh |= clkcfg->hssclh << I2C_HSSCLH_SHIFT;
 			sc->sc_con_reg |= I2C_CON_OPMODE_HS;
 		}
 		break;
 #endif
 	}
 
 	/* Write the selected bit rate. */
 	ti_i2c_write_2(sc, I2C_REG_SCLL, scll);
 	ti_i2c_write_2(sc, I2C_REG_SCLH, sclh);
 
 	/*
 	 * 6. Configure the Own Address of the I2C controller by storing it in
 	 *    the I2Ci.I2C_OA0 register.  Up to four Own Addresses can be
 	 *    programmed in the I2Ci.I2C_OAi registers (where i = 0, 1, 2, 3)
 	 *    for each I2C controller.
 	 *
 	 * Note: For a 10-bit address, set the corresponding expand Own Address
 	 * bit in the I2Ci.I2C_CON register.
 	 *
 	 * Driver currently always in single master mode so ignore this step.
 	 */
 
 	/*
 	 * 7. Set the TX threshold (in transmitter mode) and the RX threshold
 	 *    (in receiver mode) by setting the I2Ci.I2C_BUF[5:0]XTRSH field to
 	 *    (TX threshold - 1) and the I2Ci.I2C_BUF[13:8]RTRSH field to (RX
 	 *    threshold - 1), where the TX and RX thresholds are greater than
 	 *    or equal to 1.
 	 *
 	 * The threshold is set to 5 for now.
 	 */
 	fifo_trsh = (sc->sc_fifo_trsh - 1) & I2C_BUF_TRSH_MASK;
 	reg = fifo_trsh | (fifo_trsh << I2C_BUF_RXTRSH_SHIFT);
 	ti_i2c_write_2(sc, I2C_REG_BUF, reg);
 
 	/*
 	 * 8. Take the I2C controller out of reset by setting the
 	 *    I2Ci.I2C_CON[15] I2C_EN bit to 1.
 	 *
 	 * 23.1.5.1.1.1.2 - Initialize the I2C Controller
 	 *
 	 * To initialize the I2C controller, perform the following steps:
 	 *
 	 * 1. Configure the I2Ci.I2C_CON register:
 	 *     . For master or slave mode, set the I2Ci.I2C_CON[10] MST bit
 	 *       (0: slave, 1: master).
 	 *     . For transmitter or receiver mode, set the I2Ci.I2C_CON[9] TRX
 	 *       bit (0: receiver, 1: transmitter).
 	 */
 
 	/* Enable the I2C controller in master mode. */
 	sc->sc_con_reg |= I2C_CON_I2C_EN | I2C_CON_MST;
 	ti_i2c_write_2(sc, I2C_REG_CON, sc->sc_con_reg);
 
 	/*
 	 * 2. If using an interrupt to transmit/receive data, set the
 	 *    corresponding bit in the I2Ci.I2C_IE register (the I2Ci.I2C_IE[4]
 	 *    XRDY_IE bit for the transmit interrupt, the I2Ci.I2C_IE[3] RRDY
 	 *    bit for the receive interrupt).
 	 */
 
 	/* Set the interrupts we want to be notified. */
 	reg = I2C_IE_XDR |	/* Transmit draining interrupt. */
 	    I2C_IE_XRDY |	/* Transmit Data Ready interrupt. */
 	    I2C_IE_RDR |	/* Receive draining interrupt. */
 	    I2C_IE_RRDY |	/* Receive Data Ready interrupt. */
 	    I2C_IE_ARDY |	/* Register Access Ready interrupt. */
 	    I2C_IE_NACK |	/* No Acknowledgment interrupt. */
 	    I2C_IE_AL;		/* Arbitration lost interrupt. */
 
 	/* Enable the interrupts. */
 	ti_i2c_write_2(sc, I2C_REG_IRQENABLE_SET, reg);
 
 	/*
 	 * 3. If using DMA to receive/transmit data, set to 1 the corresponding
 	 *    bit in the I2Ci.I2C_BUF register (the I2Ci.I2C_BUF[15] RDMA_EN
 	 *    bit for the receive DMA channel, the I2Ci.I2C_BUF[7] XDMA_EN bit
 	 *    for the transmit DMA channel).
 	 *
 	 * Not using DMA for now, so ignore this.
 	 */
 
 	return (0);
 }
 
 static int
 ti_i2c_iicbus_reset(device_t dev, u_char speed, u_char addr, u_char *oldaddr)
 {
 	struct ti_i2c_softc *sc;
 	int err;
 
 	sc = device_get_softc(dev);
 	TI_I2C_LOCK(sc);
 	err = ti_i2c_reset(sc, speed);
 	TI_I2C_UNLOCK(sc);
 	if (err)
 		return (err);
 
 	return (IIC_ENOADDR);
 }
 
 static int
 ti_i2c_activate(device_t dev)
 {
 	clk_ident_t clk;
 	int err;
 	struct ti_i2c_softc *sc;
 
 	sc = (struct ti_i2c_softc*)device_get_softc(dev);
 
 	/*
 	 * 1. Enable the functional and interface clocks (see Section
 	 * 23.1.5.1.1.1.1).
 	 */
 	clk = I2C0_CLK + sc->device_id;
 	err = ti_prcm_clk_enable(clk);
 	if (err)
 		return (err);
 
 	return (ti_i2c_reset(sc, IIC_UNKNOWN));
 }
 
 /**
  *	ti_i2c_deactivate - deactivates the controller and releases resources
  *	@dev: i2c device handle
  *
  *
  *
  *	LOCKING:
  *	Assumed called in an atomic context.
  *
  *	RETURNS:
  *	nothing
  */
 static void
 ti_i2c_deactivate(device_t dev)
 {
 	struct ti_i2c_softc *sc = device_get_softc(dev);
 	clk_ident_t clk;
 
 	/* Disable the controller - cancel all transactions. */
 	ti_i2c_write_2(sc, I2C_REG_IRQENABLE_CLR, 0xffff);
 	ti_i2c_write_2(sc, I2C_REG_STATUS, 0xffff);
 	ti_i2c_write_2(sc, I2C_REG_CON, 0);
 
 	/* Release the interrupt handler. */
 	if (sc->sc_irq_h != NULL) {
 		bus_teardown_intr(dev, sc->sc_irq_res, sc->sc_irq_h);
 		sc->sc_irq_h = NULL;
 	}
 
 	bus_generic_detach(sc->sc_dev);
 
 	/* Unmap the I2C controller registers. */
 	if (sc->sc_mem_res != NULL) {
 		bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res);
 		sc->sc_mem_res = NULL;
 	}
 
 	/* Release the IRQ resource. */
 	if (sc->sc_irq_res != NULL) {
 		bus_release_resource(dev, SYS_RES_IRQ, 0, sc->sc_irq_res);
 		sc->sc_irq_res = NULL;
 	}
 
 	/* Finally disable the functional and interface clocks. */
 	clk = I2C0_CLK + sc->device_id;
 	ti_prcm_clk_disable(clk);
 }
 
 static int
 ti_i2c_sysctl_clk(SYSCTL_HANDLER_ARGS)
 {
 	int clk, psc, sclh, scll;
 	struct ti_i2c_softc *sc;
 
 	sc = arg1;
 
 	TI_I2C_LOCK(sc);
 	/* Get the system prescaler value. */
 	psc = (int)ti_i2c_read_2(sc, I2C_REG_PSC) + 1;
 
 	/* Get the bitrate. */
 	scll = (int)ti_i2c_read_2(sc, I2C_REG_SCLL) & I2C_SCLL_MASK;
 	sclh = (int)ti_i2c_read_2(sc, I2C_REG_SCLH) & I2C_SCLH_MASK;
 
 	clk = I2C_CLK / psc / (scll + 7 + sclh + 5);
 	TI_I2C_UNLOCK(sc);
 
 	return (sysctl_handle_int(oidp, &clk, 0, req));
 }
 
 static int
 ti_i2c_sysctl_timeout(SYSCTL_HANDLER_ARGS)
 {
 	struct ti_i2c_softc *sc;
 	unsigned int val;
 	int err;
 
 	sc = arg1;
 
 	/* 
 	 * MTX_DEF lock can't be held while doing uimove in
 	 * sysctl_handle_int
 	 */
 	TI_I2C_LOCK(sc);
 	val = sc->sc_timeout;
 	TI_I2C_UNLOCK(sc);
 
 	err = sysctl_handle_int(oidp, &val, 0, req);
 	/* Write request? */
 	if ((err == 0) && (req->newptr != NULL)) {
 		TI_I2C_LOCK(sc);
 		sc->sc_timeout = val;
 		TI_I2C_UNLOCK(sc);
 	}
 
 	return (err);
 }
 
 static int
 ti_i2c_probe(device_t dev)
 {
 
 	if (!ofw_bus_status_okay(dev))
 		return (ENXIO);
 	if (!ofw_bus_is_compatible(dev, "ti,i2c"))
 		return (ENXIO);
 	device_set_desc(dev, "TI I2C Controller");
 
 	return (0);
 }
 
 static int
 ti_i2c_attach(device_t dev)
 {
 	int err, rid;
 	phandle_t node;
 	struct ti_i2c_softc *sc;
 	struct sysctl_ctx_list *ctx;
 	struct sysctl_oid_list *tree;
 	uint16_t fifosz;
 
  	sc = device_get_softc(dev);
 	sc->sc_dev = dev;
 
 	/* Get the i2c device id from FDT. */
 	node = ofw_bus_get_node(dev);
 	if ((OF_getencprop(node, "i2c-device-id", &sc->device_id,
 	    sizeof(sc->device_id))) <= 0) {
 		device_printf(dev, "missing i2c-device-id attribute in FDT\n");
 		return (ENXIO);
 	}
 
 	/* Get the memory resource for the register mapping. */
 	rid = 0;
 	sc->sc_mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid,
 	    RF_ACTIVE);
 	if (sc->sc_mem_res == NULL) {
 		device_printf(dev, "Cannot map registers.\n");
 		return (ENXIO);
 	}
 
 	/* Allocate our IRQ resource. */
 	rid = 0;
 	sc->sc_irq_res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid,
 	    RF_ACTIVE | RF_SHAREABLE);
 	if (sc->sc_irq_res == NULL) {
 		bus_release_resource(dev, SYS_RES_MEMORY, 0, sc->sc_mem_res);
 		device_printf(dev, "Cannot allocate interrupt.\n");
 		return (ENXIO);
 	}
 
 	TI_I2C_LOCK_INIT(sc);
 
 	/* First of all, we _must_ activate the H/W. */
 	err = ti_i2c_activate(dev);
 	if (err) {
 		device_printf(dev, "ti_i2c_activate failed\n");
 		goto out;
 	}
 
 	/* Read the version number of the I2C module */
 	sc->sc_rev = ti_i2c_read_2(sc, I2C_REG_REVNB_HI) & 0xff;
 
 	/* Get the fifo size. */
 	fifosz = ti_i2c_read_2(sc, I2C_REG_BUFSTAT);
 	fifosz >>= I2C_BUFSTAT_FIFODEPTH_SHIFT;
 	fifosz &= I2C_BUFSTAT_FIFODEPTH_MASK;
 
 	device_printf(dev, "I2C revision %d.%d FIFO size: %d bytes\n",
 	    sc->sc_rev >> 4, sc->sc_rev & 0xf, 8 << fifosz);
 
 	/* Set the FIFO threshold to 5 for now. */
 	sc->sc_fifo_trsh = 5;
 
 	/* Set I2C bus timeout */
 	sc->sc_timeout = 5*hz;
 
 	ctx = device_get_sysctl_ctx(dev);
 	tree = SYSCTL_CHILDREN(device_get_sysctl_tree(dev));
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "i2c_clock",
 	    CTLFLAG_RD | CTLTYPE_UINT | CTLFLAG_MPSAFE, sc, 0,
 	    ti_i2c_sysctl_clk, "IU", "I2C bus clock");
 
 	SYSCTL_ADD_PROC(ctx, tree, OID_AUTO, "i2c_timeout",
 	    CTLFLAG_RW | CTLTYPE_UINT | CTLFLAG_MPSAFE, sc, 0,
 	    ti_i2c_sysctl_timeout, "IU", "I2C bus timeout (in ticks)");
 
 	/* Activate the interrupt. */
 	err = bus_setup_intr(dev, sc->sc_irq_res, INTR_TYPE_MISC | INTR_MPSAFE,
 	    NULL, ti_i2c_intr, sc, &sc->sc_irq_h);
 	if (err)
 		goto out;
 
 	/* Attach the iicbus. */
 	if ((sc->sc_iicbus = device_add_child(dev, "iicbus", -1)) == NULL) {
 		device_printf(dev, "could not allocate iicbus instance\n");
 		err = ENXIO;
 		goto out;
 	}
 
 	/* Probe and attach the iicbus */
 	bus_generic_attach(dev);
 
 out:
 	if (err) {
 		ti_i2c_deactivate(dev);
 		TI_I2C_LOCK_DESTROY(sc);
 	}
 
 	return (err);
 }
 
 static int
 ti_i2c_detach(device_t dev)
 {
 	struct ti_i2c_softc *sc;
 	int rv;
 
  	sc = device_get_softc(dev);
 	ti_i2c_deactivate(dev);
 	TI_I2C_LOCK_DESTROY(sc);
 	if (sc->sc_iicbus &&
 	    (rv = device_delete_child(dev, sc->sc_iicbus)) != 0)
 		return (rv);
 
 	return (0);
 }
 
 static phandle_t
 ti_i2c_get_node(device_t bus, device_t dev)
 {
 
 	/* Share controller node with iibus device. */
 	return (ofw_bus_get_node(bus));
 }
 
 static device_method_t ti_i2c_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		ti_i2c_probe),
 	DEVMETHOD(device_attach,	ti_i2c_attach),
 	DEVMETHOD(device_detach,	ti_i2c_detach),
 
+	/* Bus interface */
+	DEVMETHOD(bus_setup_intr,	bus_generic_setup_intr),
+	DEVMETHOD(bus_teardown_intr,	bus_generic_teardown_intr),
+	DEVMETHOD(bus_alloc_resource,	bus_generic_alloc_resource),
+	DEVMETHOD(bus_release_resource,	bus_generic_release_resource),
+	DEVMETHOD(bus_activate_resource, bus_generic_activate_resource),
+	DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource),
+	DEVMETHOD(bus_adjust_resource,	bus_generic_adjust_resource),
+	DEVMETHOD(bus_set_resource,	bus_generic_rl_set_resource),
+	DEVMETHOD(bus_get_resource,	bus_generic_rl_get_resource),
+
 	/* OFW methods */
 	DEVMETHOD(ofw_bus_get_node,	ti_i2c_get_node),
 
 	/* iicbus interface */
 	DEVMETHOD(iicbus_callback,	iicbus_null_callback),
 	DEVMETHOD(iicbus_reset,		ti_i2c_iicbus_reset),
 	DEVMETHOD(iicbus_transfer,	ti_i2c_transfer),
 
 	DEVMETHOD_END
 };
 
 static driver_t ti_i2c_driver = {
 	"iichb",
 	ti_i2c_methods,
 	sizeof(struct ti_i2c_softc),
 };
 
 static devclass_t ti_i2c_devclass;
 
 DRIVER_MODULE(ti_iic, simplebus, ti_i2c_driver, ti_i2c_devclass, 0, 0);
 DRIVER_MODULE(iicbus, ti_iic, iicbus_driver, iicbus_devclass, 0, 0);
 
 MODULE_DEPEND(ti_iic, ti_prcm, 1, 1, 1);
 MODULE_DEPEND(ti_iic, iicbus, 1, 1, 1);
Index: projects/release-arm-redux/sys/dev/acpica/acpi_cpu.c
===================================================================
--- projects/release-arm-redux/sys/dev/acpica/acpi_cpu.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/acpica/acpi_cpu.c	(revision 282692)
@@ -1,1373 +1,1511 @@
 /*-
  * Copyright (c) 2003-2005 Nate Lawson (SDG)
  * Copyright (c) 2001 Michael Smith
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_acpi.h"
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/cpu.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/pcpu.h>
 #include <sys/power.h>
 #include <sys/proc.h>
 #include <sys/sched.h>
 #include <sys/sbuf.h>
 #include <sys/smp.h>
 
 #include <dev/pci/pcivar.h>
 #include <machine/atomic.h>
 #include <machine/bus.h>
 #if defined(__amd64__) || defined(__i386__)
 #include <machine/clock.h>
+#include <machine/specialreg.h>
+#include <machine/md_var.h>
 #endif
 #include <sys/rman.h>
 
 #include <contrib/dev/acpica/include/acpi.h>
 #include <contrib/dev/acpica/include/accommon.h>
 
 #include <dev/acpica/acpivar.h>
 
 /*
  * Support for ACPI Processor devices, including C[1-3] sleep states.
  */
 
 /* Hooks for the ACPI CA debugging infrastructure */
 #define _COMPONENT	ACPI_PROCESSOR
 ACPI_MODULE_NAME("PROCESSOR")
 
 struct acpi_cx {
     struct resource	*p_lvlx;	/* Register to read to enter state. */
     uint32_t		 type;		/* C1-3 (C4 and up treated as C3). */
     uint32_t		 trans_lat;	/* Transition latency (usec). */
     uint32_t		 power;		/* Power consumed (mW). */
     int			 res_type;	/* Resource type for p_lvlx. */
     int			 res_rid;	/* Resource ID for p_lvlx. */
+    bool		 do_mwait;
+    uint32_t		 mwait_hint;
+    bool		 mwait_hw_coord;
+    bool		 mwait_bm_avoidance;
 };
 #define MAX_CX_STATES	 8
 
 struct acpi_cpu_softc {
     device_t		 cpu_dev;
     ACPI_HANDLE		 cpu_handle;
     struct pcpu		*cpu_pcpu;
     uint32_t		 cpu_acpi_id;	/* ACPI processor id */
     uint32_t		 cpu_p_blk;	/* ACPI P_BLK location */
     uint32_t		 cpu_p_blk_len;	/* P_BLK length (must be 6). */
     struct acpi_cx	 cpu_cx_states[MAX_CX_STATES];
     int			 cpu_cx_count;	/* Number of valid Cx states. */
     int			 cpu_prev_sleep;/* Last idle sleep duration. */
     int			 cpu_features;	/* Child driver supported features. */
     /* Runtime state. */
     int			 cpu_non_c2;	/* Index of lowest non-C2 state. */
     int			 cpu_non_c3;	/* Index of lowest non-C3 state. */
     u_int		 cpu_cx_stats[MAX_CX_STATES];/* Cx usage history. */
     /* Values for sysctl. */
     struct sysctl_ctx_list cpu_sysctl_ctx;
     struct sysctl_oid	*cpu_sysctl_tree;
     int			 cpu_cx_lowest;
     int			 cpu_cx_lowest_lim;
     int			 cpu_disable_idle; /* Disable entry to idle function */
     char 		 cpu_cx_supported[64];
 };
 
 struct acpi_cpu_device {
     struct resource_list	ad_rl;
 };
 
 #define CPU_GET_REG(reg, width) 					\
     (bus_space_read_ ## width(rman_get_bustag((reg)), 			\
 		      rman_get_bushandle((reg)), 0))
 #define CPU_SET_REG(reg, width, val)					\
     (bus_space_write_ ## width(rman_get_bustag((reg)), 			\
 		       rman_get_bushandle((reg)), 0, (val)))
 
 #define PM_USEC(x)	 ((x) >> 2)	/* ~4 clocks per usec (3.57955 Mhz) */
 
 #define ACPI_NOTIFY_CX_STATES	0x81	/* _CST changed. */
 
 #define CPU_QUIRK_NO_C3		(1<<0)	/* C3-type states are not usable. */
 #define CPU_QUIRK_NO_BM_CTRL	(1<<2)	/* No bus mastering control. */
 
 #define PCI_VENDOR_INTEL	0x8086
 #define PCI_DEVICE_82371AB_3	0x7113	/* PIIX4 chipset for quirks. */
 #define PCI_REVISION_A_STEP	0
 #define PCI_REVISION_B_STEP	1
 #define PCI_REVISION_4E		2
 #define PCI_REVISION_4M		3
 #define PIIX4_DEVACTB_REG	0x58
 #define PIIX4_BRLD_EN_IRQ0	(1<<0)
 #define PIIX4_BRLD_EN_IRQ	(1<<1)
 #define PIIX4_BRLD_EN_IRQ8	(1<<5)
 #define PIIX4_STOP_BREAK_MASK	(PIIX4_BRLD_EN_IRQ0 | PIIX4_BRLD_EN_IRQ | PIIX4_BRLD_EN_IRQ8)
 #define PIIX4_PCNTRL_BST_EN	(1<<10)
 
+#define	CST_FFH_VENDOR_INTEL	1
+#define	CST_FFH_INTEL_CL_C1IO	1
+#define	CST_FFH_INTEL_CL_MWAIT	2
+#define	CST_FFH_MWAIT_HW_COORD	0x0001
+#define	CST_FFH_MWAIT_BM_AVOID	0x0002
+
 /* Allow users to ignore processor orders in MADT. */
 static int cpu_unordered;
 SYSCTL_INT(_debug_acpi, OID_AUTO, cpu_unordered, CTLFLAG_RDTUN,
     &cpu_unordered, 0,
     "Do not use the MADT to match ACPI Processor objects to CPUs.");
 
 /* Knob to disable acpi_cpu devices */
 bool acpi_cpu_disabled = false;
 
 /* Platform hardware resource information. */
 static uint32_t		 cpu_smi_cmd;	/* Value to write to SMI_CMD. */
 static uint8_t		 cpu_cst_cnt;	/* Indicate we are _CST aware. */
 static int		 cpu_quirks;	/* Indicate any hardware bugs. */
 
 /* Values for sysctl. */
 static struct sysctl_ctx_list cpu_sysctl_ctx;
 static struct sysctl_oid *cpu_sysctl_tree;
 static int		 cpu_cx_generic;
 static int		 cpu_cx_lowest_lim;
 
 static device_t		*cpu_devices;
 static int		 cpu_ndevices;
 static struct acpi_cpu_softc **cpu_softc;
 ACPI_SERIAL_DECL(cpu, "ACPI CPU");
 
 static int	acpi_cpu_probe(device_t dev);
 static int	acpi_cpu_attach(device_t dev);
 static int	acpi_cpu_suspend(device_t dev);
 static int	acpi_cpu_resume(device_t dev);
 static int	acpi_pcpu_get_id(device_t dev, uint32_t *acpi_id,
 		    uint32_t *cpu_id);
 static struct resource_list *acpi_cpu_get_rlist(device_t dev, device_t child);
 static device_t	acpi_cpu_add_child(device_t dev, u_int order, const char *name,
 		    int unit);
 static int	acpi_cpu_read_ivar(device_t dev, device_t child, int index,
 		    uintptr_t *result);
 static int	acpi_cpu_shutdown(device_t dev);
 static void	acpi_cpu_cx_probe(struct acpi_cpu_softc *sc);
 static void	acpi_cpu_generic_cx_probe(struct acpi_cpu_softc *sc);
 static int	acpi_cpu_cx_cst(struct acpi_cpu_softc *sc);
 static void	acpi_cpu_startup(void *arg);
 static void	acpi_cpu_startup_cx(struct acpi_cpu_softc *sc);
 static void	acpi_cpu_cx_list(struct acpi_cpu_softc *sc);
 static void	acpi_cpu_idle(sbintime_t sbt);
 static void	acpi_cpu_notify(ACPI_HANDLE h, UINT32 notify, void *context);
 static int	acpi_cpu_quirks(void);
 static int	acpi_cpu_usage_sysctl(SYSCTL_HANDLER_ARGS);
 static int	acpi_cpu_usage_counters_sysctl(SYSCTL_HANDLER_ARGS);
 static int	acpi_cpu_set_cx_lowest(struct acpi_cpu_softc *sc);
 static int	acpi_cpu_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS);
 static int	acpi_cpu_global_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS);
+#if defined(__i386__) || defined(__amd64__)
+static int	acpi_cpu_method_sysctl(SYSCTL_HANDLER_ARGS);
+#endif
 
 static device_method_t acpi_cpu_methods[] = {
     /* Device interface */
     DEVMETHOD(device_probe,	acpi_cpu_probe),
     DEVMETHOD(device_attach,	acpi_cpu_attach),
     DEVMETHOD(device_detach,	bus_generic_detach),
     DEVMETHOD(device_shutdown,	acpi_cpu_shutdown),
     DEVMETHOD(device_suspend,	acpi_cpu_suspend),
     DEVMETHOD(device_resume,	acpi_cpu_resume),
 
     /* Bus interface */
     DEVMETHOD(bus_add_child,	acpi_cpu_add_child),
     DEVMETHOD(bus_read_ivar,	acpi_cpu_read_ivar),
     DEVMETHOD(bus_get_resource_list, acpi_cpu_get_rlist),
     DEVMETHOD(bus_get_resource,	bus_generic_rl_get_resource),
     DEVMETHOD(bus_set_resource,	bus_generic_rl_set_resource),
     DEVMETHOD(bus_alloc_resource, bus_generic_rl_alloc_resource),
     DEVMETHOD(bus_release_resource, bus_generic_rl_release_resource),
     DEVMETHOD(bus_activate_resource, bus_generic_activate_resource),
     DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource),
     DEVMETHOD(bus_setup_intr,	bus_generic_setup_intr),
     DEVMETHOD(bus_teardown_intr, bus_generic_teardown_intr),
 
     DEVMETHOD_END
 };
 
 static driver_t acpi_cpu_driver = {
     "cpu",
     acpi_cpu_methods,
     sizeof(struct acpi_cpu_softc),
 };
 
 static devclass_t acpi_cpu_devclass;
 DRIVER_MODULE(cpu, acpi, acpi_cpu_driver, acpi_cpu_devclass, 0, 0);
 MODULE_DEPEND(cpu, acpi, 1, 1, 1);
 
 static int
 acpi_cpu_probe(device_t dev)
 {
     int			   acpi_id, cpu_id;
     ACPI_BUFFER		   buf;
     ACPI_HANDLE		   handle;
     ACPI_OBJECT		   *obj;
     ACPI_STATUS		   status;
 
     if (acpi_disabled("cpu") || acpi_get_type(dev) != ACPI_TYPE_PROCESSOR ||
 	    acpi_cpu_disabled)
 	return (ENXIO);
 
     handle = acpi_get_handle(dev);
     if (cpu_softc == NULL)
 	cpu_softc = malloc(sizeof(struct acpi_cpu_softc *) *
 	    (mp_maxid + 1), M_TEMP /* XXX */, M_WAITOK | M_ZERO);
 
     /* Get our Processor object. */
     buf.Pointer = NULL;
     buf.Length = ACPI_ALLOCATE_BUFFER;
     status = AcpiEvaluateObject(handle, NULL, NULL, &buf);
     if (ACPI_FAILURE(status)) {
 	device_printf(dev, "probe failed to get Processor obj - %s\n",
 		      AcpiFormatException(status));
 	return (ENXIO);
     }
     obj = (ACPI_OBJECT *)buf.Pointer;
     if (obj->Type != ACPI_TYPE_PROCESSOR) {
 	device_printf(dev, "Processor object has bad type %d\n", obj->Type);
 	AcpiOsFree(obj);
 	return (ENXIO);
     }
 
     /*
      * Find the processor associated with our unit.  We could use the
      * ProcId as a key, however, some boxes do not have the same values
      * in their Processor object as the ProcId values in the MADT.
      */
     acpi_id = obj->Processor.ProcId;
     AcpiOsFree(obj);
     if (acpi_pcpu_get_id(dev, &acpi_id, &cpu_id) != 0)
 	return (ENXIO);
 
     /*
      * Check if we already probed this processor.  We scan the bus twice
      * so it's possible we've already seen this one.
      */
     if (cpu_softc[cpu_id] != NULL)
 	return (ENXIO);
 
     /* Mark this processor as in-use and save our derived id for attach. */
     cpu_softc[cpu_id] = (void *)1;
     acpi_set_private(dev, (void*)(intptr_t)cpu_id);
     device_set_desc(dev, "ACPI CPU");
 
     return (0);
 }
 
 static int
 acpi_cpu_attach(device_t dev)
 {
     ACPI_BUFFER		   buf;
     ACPI_OBJECT		   arg[4], *obj;
     ACPI_OBJECT_LIST	   arglist;
     struct pcpu		   *pcpu_data;
     struct acpi_cpu_softc *sc;
     struct acpi_softc	  *acpi_sc;
     ACPI_STATUS		   status;
     u_int		   features;
     int			   cpu_id, drv_count, i;
     driver_t 		  **drivers;
     uint32_t		   cap_set[3];
 
     /* UUID needed by _OSC evaluation */
     static uint8_t cpu_oscuuid[16] = { 0x16, 0xA6, 0x77, 0x40, 0x0C, 0x29,
 				       0xBE, 0x47, 0x9E, 0xBD, 0xD8, 0x70,
 				       0x58, 0x71, 0x39, 0x53 };
 
     ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__);
 
     sc = device_get_softc(dev);
     sc->cpu_dev = dev;
     sc->cpu_handle = acpi_get_handle(dev);
     cpu_id = (int)(intptr_t)acpi_get_private(dev);
     cpu_softc[cpu_id] = sc;
     pcpu_data = pcpu_find(cpu_id);
     pcpu_data->pc_device = dev;
     sc->cpu_pcpu = pcpu_data;
     cpu_smi_cmd = AcpiGbl_FADT.SmiCommand;
     cpu_cst_cnt = AcpiGbl_FADT.CstControl;
 
     buf.Pointer = NULL;
     buf.Length = ACPI_ALLOCATE_BUFFER;
     status = AcpiEvaluateObject(sc->cpu_handle, NULL, NULL, &buf);
     if (ACPI_FAILURE(status)) {
 	device_printf(dev, "attach failed to get Processor obj - %s\n",
 		      AcpiFormatException(status));
 	return (ENXIO);
     }
     obj = (ACPI_OBJECT *)buf.Pointer;
     sc->cpu_p_blk = obj->Processor.PblkAddress;
     sc->cpu_p_blk_len = obj->Processor.PblkLength;
     sc->cpu_acpi_id = obj->Processor.ProcId;
     AcpiOsFree(obj);
     ACPI_DEBUG_PRINT((ACPI_DB_INFO, "acpi_cpu%d: P_BLK at %#x/%d\n",
 		     device_get_unit(dev), sc->cpu_p_blk, sc->cpu_p_blk_len));
 
     /*
      * If this is the first cpu we attach, create and initialize the generic
      * resources that will be used by all acpi cpu devices.
      */
     if (device_get_unit(dev) == 0) {
 	/* Assume we won't be using generic Cx mode by default */
 	cpu_cx_generic = FALSE;
 
 	/* Install hw.acpi.cpu sysctl tree */
 	acpi_sc = acpi_device_get_parent_softc(dev);
 	sysctl_ctx_init(&cpu_sysctl_ctx);
 	cpu_sysctl_tree = SYSCTL_ADD_NODE(&cpu_sysctl_ctx,
 	    SYSCTL_CHILDREN(acpi_sc->acpi_sysctl_tree), OID_AUTO, "cpu",
 	    CTLFLAG_RD, 0, "node for CPU children");
 
 	/* Queue post cpu-probing task handler */
 	AcpiOsExecute(OSL_NOTIFY_HANDLER, acpi_cpu_startup, NULL);
     }
 
     /*
      * Before calling any CPU methods, collect child driver feature hints
      * and notify ACPI of them.  We support unified SMP power control
      * so advertise this ourselves.  Note this is not the same as independent
      * SMP control where each CPU can have different settings.
      */
-    sc->cpu_features = ACPI_CAP_SMP_SAME | ACPI_CAP_SMP_SAME_C3;
+    sc->cpu_features = ACPI_CAP_SMP_SAME | ACPI_CAP_SMP_SAME_C3 |
+      ACPI_CAP_C1_IO_HALT;
+
+#if defined(__i386__) || defined(__amd64__)
+    /*
+     * Ask for MWAIT modes if not disabled and interrupts work
+     * reasonable with MWAIT.
+     */
+    if (!acpi_disabled("mwait") && cpu_mwait_usable())
+	sc->cpu_features |= ACPI_CAP_SMP_C1_NATIVE | ACPI_CAP_SMP_C3_NATIVE;
+#endif
+
     if (devclass_get_drivers(acpi_cpu_devclass, &drivers, &drv_count) == 0) {
 	for (i = 0; i < drv_count; i++) {
 	    if (ACPI_GET_FEATURES(drivers[i], &features) == 0)
 		sc->cpu_features |= features;
 	}
 	free(drivers, M_TEMP);
     }
 
     /*
      * CPU capabilities are specified in
      * Intel Processor Vendor-Specific ACPI Interface Specification.
      */
     if (sc->cpu_features) {
 	arglist.Pointer = arg;
 	arglist.Count = 4;
 	arg[0].Type = ACPI_TYPE_BUFFER;
 	arg[0].Buffer.Length = sizeof(cpu_oscuuid);
 	arg[0].Buffer.Pointer = cpu_oscuuid;	/* UUID */
 	arg[1].Type = ACPI_TYPE_INTEGER;
 	arg[1].Integer.Value = 1;		/* revision */
 	arg[2].Type = ACPI_TYPE_INTEGER;
 	arg[2].Integer.Value = 1;		/* count */
 	arg[3].Type = ACPI_TYPE_BUFFER;
 	arg[3].Buffer.Length = sizeof(cap_set);	/* Capabilities buffer */
 	arg[3].Buffer.Pointer = (uint8_t *)cap_set;
 	cap_set[0] = 0;				/* status */
 	cap_set[1] = sc->cpu_features;
 	status = AcpiEvaluateObject(sc->cpu_handle, "_OSC", &arglist, NULL);
 	if (ACPI_SUCCESS(status)) {
 	    if (cap_set[0] != 0)
 		device_printf(dev, "_OSC returned status %#x\n", cap_set[0]);
 	}
 	else {
 	    arglist.Pointer = arg;
 	    arglist.Count = 1;
 	    arg[0].Type = ACPI_TYPE_BUFFER;
 	    arg[0].Buffer.Length = sizeof(cap_set);
 	    arg[0].Buffer.Pointer = (uint8_t *)cap_set;
 	    cap_set[0] = 1; /* revision */
 	    cap_set[1] = 1; /* number of capabilities integers */
 	    cap_set[2] = sc->cpu_features;
 	    AcpiEvaluateObject(sc->cpu_handle, "_PDC", &arglist, NULL);
 	}
     }
 
     /* Probe for Cx state support. */
     acpi_cpu_cx_probe(sc);
 
     return (0);
 }
 
 static void
 acpi_cpu_postattach(void *unused __unused)
 {
     device_t *devices;
     int err;
     int i, n;
 
     err = devclass_get_devices(acpi_cpu_devclass, &devices, &n);
     if (err != 0) {
 	printf("devclass_get_devices(acpi_cpu_devclass) failed\n");
 	return;
     }
     for (i = 0; i < n; i++)
 	bus_generic_probe(devices[i]);
     for (i = 0; i < n; i++)
 	bus_generic_attach(devices[i]);
     free(devices, M_TEMP);
 }
 
 SYSINIT(acpi_cpu, SI_SUB_CONFIGURE, SI_ORDER_MIDDLE,
     acpi_cpu_postattach, NULL);
 
 static void
 disable_idle(struct acpi_cpu_softc *sc)
 {
     cpuset_t cpuset;
 
     CPU_SETOF(sc->cpu_pcpu->pc_cpuid, &cpuset);
     sc->cpu_disable_idle = TRUE;
 
     /*
      * Ensure that the CPU is not in idle state or in acpi_cpu_idle().
      * Note that this code depends on the fact that the rendezvous IPI
      * can not penetrate context where interrupts are disabled and acpi_cpu_idle
      * is called and executed in such a context with interrupts being re-enabled
      * right before return.
      */
     smp_rendezvous_cpus(cpuset, smp_no_rendevous_barrier, NULL,
 	smp_no_rendevous_barrier, NULL);
 }
 
 static void
 enable_idle(struct acpi_cpu_softc *sc)
 {
 
     sc->cpu_disable_idle = FALSE;
 }
 
 static int
 is_idle_disabled(struct acpi_cpu_softc *sc)
 {
 
     return (sc->cpu_disable_idle);
 }
 
 /*
  * Disable any entry to the idle function during suspend and re-enable it
  * during resume.
  */
 static int
 acpi_cpu_suspend(device_t dev)
 {
     int error;
 
     error = bus_generic_suspend(dev);
     if (error)
 	return (error);
     disable_idle(device_get_softc(dev));
     return (0);
 }
 
 static int
 acpi_cpu_resume(device_t dev)
 {
 
     enable_idle(device_get_softc(dev));
     return (bus_generic_resume(dev));
 }
 
 /*
  * Find the processor associated with a given ACPI ID.  By default,
  * use the MADT to map ACPI IDs to APIC IDs and use that to locate a
  * processor.  Some systems have inconsistent ASL and MADT however.
  * For these systems the cpu_unordered tunable can be set in which
  * case we assume that Processor objects are listed in the same order
  * in both the MADT and ASL.
  */
 static int
 acpi_pcpu_get_id(device_t dev, uint32_t *acpi_id, uint32_t *cpu_id)
 {
     struct pcpu	*pc;
     uint32_t	 i, idx;
 
     KASSERT(acpi_id != NULL, ("Null acpi_id"));
     KASSERT(cpu_id != NULL, ("Null cpu_id"));
     idx = device_get_unit(dev);
 
     /*
      * If pc_acpi_id for CPU 0 is not initialized (e.g. a non-APIC
      * UP box) use the ACPI ID from the first processor we find.
      */
     if (idx == 0 && mp_ncpus == 1) {
 	pc = pcpu_find(0);
 	if (pc->pc_acpi_id == 0xffffffff)
 	    pc->pc_acpi_id = *acpi_id;
 	*cpu_id = 0;
 	return (0);
     }
 
     CPU_FOREACH(i) {
 	pc = pcpu_find(i);
 	KASSERT(pc != NULL, ("no pcpu data for %d", i));
 	if (cpu_unordered) {
 	    if (idx-- == 0) {
 		/*
 		 * If pc_acpi_id doesn't match the ACPI ID from the
 		 * ASL, prefer the MADT-derived value.
 		 */
 		if (pc->pc_acpi_id != *acpi_id)
 		    *acpi_id = pc->pc_acpi_id;
 		*cpu_id = pc->pc_cpuid;
 		return (0);
 	    }
 	} else {
 	    if (pc->pc_acpi_id == *acpi_id) {
 		if (bootverbose)
 		    device_printf(dev,
 			"Processor %s (ACPI ID %u) -> APIC ID %d\n",
 			acpi_name(acpi_get_handle(dev)), *acpi_id,
 			pc->pc_cpuid);
 		*cpu_id = pc->pc_cpuid;
 		return (0);
 	    }
 	}
     }
 
     if (bootverbose)
 	printf("ACPI: Processor %s (ACPI ID %u) ignored\n",
 	    acpi_name(acpi_get_handle(dev)), *acpi_id);
 
     return (ESRCH);
 }
 
 static struct resource_list *
 acpi_cpu_get_rlist(device_t dev, device_t child)
 {
     struct acpi_cpu_device *ad;
 
     ad = device_get_ivars(child);
     if (ad == NULL)
 	return (NULL);
     return (&ad->ad_rl);
 }
 
 static device_t
 acpi_cpu_add_child(device_t dev, u_int order, const char *name, int unit)
 {
     struct acpi_cpu_device *ad;
     device_t child;
 
     if ((ad = malloc(sizeof(*ad), M_TEMP, M_NOWAIT | M_ZERO)) == NULL)
 	return (NULL);
 
     resource_list_init(&ad->ad_rl);
     
     child = device_add_child_ordered(dev, order, name, unit);
     if (child != NULL)
 	device_set_ivars(child, ad);
     else
 	free(ad, M_TEMP);
     return (child);
 }
 
 static int
 acpi_cpu_read_ivar(device_t dev, device_t child, int index, uintptr_t *result)
 {
     struct acpi_cpu_softc *sc;
 
     sc = device_get_softc(dev);
     switch (index) {
     case ACPI_IVAR_HANDLE:
 	*result = (uintptr_t)sc->cpu_handle;
 	break;
     case CPU_IVAR_PCPU:
 	*result = (uintptr_t)sc->cpu_pcpu;
 	break;
 #if defined(__amd64__) || defined(__i386__)
     case CPU_IVAR_NOMINAL_MHZ:
 	if (tsc_is_invariant) {
 	    *result = (uintptr_t)(atomic_load_acq_64(&tsc_freq) / 1000000);
 	    break;
 	}
 	/* FALLTHROUGH */
 #endif
     default:
 	return (ENOENT);
     }
     return (0);
 }
 
 static int
 acpi_cpu_shutdown(device_t dev)
 {
     ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__);
 
     /* Allow children to shutdown first. */
     bus_generic_shutdown(dev);
 
     /*
      * Disable any entry to the idle function.
      */
     disable_idle(device_get_softc(dev));
 
     /*
      * CPU devices are not truely detached and remain referenced,
      * so their resources are not freed.
      */
 
     return_VALUE (0);
 }
 
 static void
 acpi_cpu_cx_probe(struct acpi_cpu_softc *sc)
 {
     ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__);
 
     /* Use initial sleep value of 1 sec. to start with lowest idle state. */
     sc->cpu_prev_sleep = 1000000;
     sc->cpu_cx_lowest = 0;
     sc->cpu_cx_lowest_lim = 0;
 
     /*
      * Check for the ACPI 2.0 _CST sleep states object. If we can't find
      * any, we'll revert to generic FADT/P_BLK Cx control method which will
      * be handled by acpi_cpu_startup. We need to defer to after having
      * probed all the cpus in the system before probing for generic Cx
      * states as we may already have found cpus with valid _CST packages
      */
     if (!cpu_cx_generic && acpi_cpu_cx_cst(sc) != 0) {
 	/*
 	 * We were unable to find a _CST package for this cpu or there
 	 * was an error parsing it. Switch back to generic mode.
 	 */
 	cpu_cx_generic = TRUE;
 	if (bootverbose)
 	    device_printf(sc->cpu_dev, "switching to generic Cx mode\n");
     }
 
     /*
      * TODO: _CSD Package should be checked here.
      */
 }
 
 static void
 acpi_cpu_generic_cx_probe(struct acpi_cpu_softc *sc)
 {
     ACPI_GENERIC_ADDRESS	 gas;
     struct acpi_cx		*cx_ptr;
 
     sc->cpu_cx_count = 0;
     cx_ptr = sc->cpu_cx_states;
 
     /* Use initial sleep value of 1 sec. to start with lowest idle state. */
     sc->cpu_prev_sleep = 1000000;
 
     /* C1 has been required since just after ACPI 1.0 */
     cx_ptr->type = ACPI_STATE_C1;
     cx_ptr->trans_lat = 0;
     cx_ptr++;
     sc->cpu_non_c2 = sc->cpu_cx_count;
     sc->cpu_non_c3 = sc->cpu_cx_count;
     sc->cpu_cx_count++;
     cpu_deepest_sleep = 1;
 
     /* 
      * The spec says P_BLK must be 6 bytes long.  However, some systems
      * use it to indicate a fractional set of features present so we
      * take 5 as C2.  Some may also have a value of 7 to indicate
      * another C3 but most use _CST for this (as required) and having
      * "only" C1-C3 is not a hardship.
      */
     if (sc->cpu_p_blk_len < 5)
 	return; 
 
     /* Validate and allocate resources for C2 (P_LVL2). */
     gas.SpaceId = ACPI_ADR_SPACE_SYSTEM_IO;
     gas.BitWidth = 8;
     if (AcpiGbl_FADT.C2Latency <= 100) {
 	gas.Address = sc->cpu_p_blk + 4;
 	cx_ptr->res_rid = 0;
 	acpi_bus_alloc_gas(sc->cpu_dev, &cx_ptr->res_type, &cx_ptr->res_rid,
 	    &gas, &cx_ptr->p_lvlx, RF_SHAREABLE);
 	if (cx_ptr->p_lvlx != NULL) {
 	    cx_ptr->type = ACPI_STATE_C2;
 	    cx_ptr->trans_lat = AcpiGbl_FADT.C2Latency;
 	    cx_ptr++;
 	    sc->cpu_non_c3 = sc->cpu_cx_count;
 	    sc->cpu_cx_count++;
 	    cpu_deepest_sleep = 2;
 	}
     }
     if (sc->cpu_p_blk_len < 6)
 	return;
 
     /* Validate and allocate resources for C3 (P_LVL3). */
     if (AcpiGbl_FADT.C3Latency <= 1000 && !(cpu_quirks & CPU_QUIRK_NO_C3)) {
 	gas.Address = sc->cpu_p_blk + 5;
 	cx_ptr->res_rid = 1;
 	acpi_bus_alloc_gas(sc->cpu_dev, &cx_ptr->res_type, &cx_ptr->res_rid,
 	    &gas, &cx_ptr->p_lvlx, RF_SHAREABLE);
 	if (cx_ptr->p_lvlx != NULL) {
 	    cx_ptr->type = ACPI_STATE_C3;
 	    cx_ptr->trans_lat = AcpiGbl_FADT.C3Latency;
 	    cx_ptr++;
 	    sc->cpu_cx_count++;
 	    cpu_deepest_sleep = 3;
 	}
     }
 }
 
+static void
+acpi_cpu_cx_cst_mwait(struct acpi_cx *cx_ptr, uint64_t address, int accsize)
+{
+
+	cx_ptr->do_mwait = true;
+	cx_ptr->mwait_hint = address & 0xffffffff;
+	cx_ptr->mwait_hw_coord = (accsize & CST_FFH_MWAIT_HW_COORD) != 0;
+	cx_ptr->mwait_bm_avoidance = (accsize & CST_FFH_MWAIT_BM_AVOID) != 0;
+}
+
+static void
+acpi_cpu_cx_cst_free_plvlx(device_t cpu_dev, struct acpi_cx *cx_ptr)
+{
+
+	if (cx_ptr->p_lvlx == NULL)
+		return;
+	bus_release_resource(cpu_dev, cx_ptr->res_type, cx_ptr->res_rid,
+	    cx_ptr->p_lvlx);
+	cx_ptr->p_lvlx = NULL;
+}
+
 /*
  * Parse a _CST package and set up its Cx states.  Since the _CST object
  * can change dynamically, our notify handler may call this function
  * to clean up and probe the new _CST package.
  */
 static int
 acpi_cpu_cx_cst(struct acpi_cpu_softc *sc)
 {
     struct	 acpi_cx *cx_ptr;
     ACPI_STATUS	 status;
     ACPI_BUFFER	 buf;
     ACPI_OBJECT	*top;
     ACPI_OBJECT	*pkg;
     uint32_t	 count;
-    int		 i;
+    uint64_t	 address;
+    int		 i, vendor, class, accsize;
 
     ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__);
 
     buf.Pointer = NULL;
     buf.Length = ACPI_ALLOCATE_BUFFER;
     status = AcpiEvaluateObject(sc->cpu_handle, "_CST", NULL, &buf);
     if (ACPI_FAILURE(status))
 	return (ENXIO);
 
     /* _CST is a package with a count and at least one Cx package. */
     top = (ACPI_OBJECT *)buf.Pointer;
     if (!ACPI_PKG_VALID(top, 2) || acpi_PkgInt32(top, 0, &count) != 0) {
 	device_printf(sc->cpu_dev, "invalid _CST package\n");
 	AcpiOsFree(buf.Pointer);
 	return (ENXIO);
     }
     if (count != top->Package.Count - 1) {
 	device_printf(sc->cpu_dev, "invalid _CST state count (%d != %d)\n",
 	       count, top->Package.Count - 1);
 	count = top->Package.Count - 1;
     }
     if (count > MAX_CX_STATES) {
 	device_printf(sc->cpu_dev, "_CST has too many states (%d)\n", count);
 	count = MAX_CX_STATES;
     }
 
     sc->cpu_non_c2 = 0;
     sc->cpu_non_c3 = 0;
     sc->cpu_cx_count = 0;
     cx_ptr = sc->cpu_cx_states;
 
     /*
      * C1 has been required since just after ACPI 1.0.
      * Reserve the first slot for it.
      */
     cx_ptr->type = ACPI_STATE_C0;
     cx_ptr++;
     sc->cpu_cx_count++;
     cpu_deepest_sleep = 1;
 
     /* Set up all valid states. */
     for (i = 0; i < count; i++) {
 	pkg = &top->Package.Elements[i + 1];
 	if (!ACPI_PKG_VALID(pkg, 4) ||
 	    acpi_PkgInt32(pkg, 1, &cx_ptr->type) != 0 ||
 	    acpi_PkgInt32(pkg, 2, &cx_ptr->trans_lat) != 0 ||
 	    acpi_PkgInt32(pkg, 3, &cx_ptr->power) != 0) {
 
 	    device_printf(sc->cpu_dev, "skipping invalid Cx state package\n");
 	    continue;
 	}
 
 	/* Validate the state to see if we should use it. */
 	switch (cx_ptr->type) {
 	case ACPI_STATE_C1:
+	    acpi_cpu_cx_cst_free_plvlx(sc->cpu_dev, cx_ptr);
+#if defined(__i386__) || defined(__amd64__)
+	    if (acpi_PkgFFH_IntelCpu(pkg, 0, &vendor, &class, &address,
+	      &accsize) == 0 && vendor == CST_FFH_VENDOR_INTEL) {
+		if (class == CST_FFH_INTEL_CL_C1IO) {
+		    /* C1 I/O then Halt */
+		    cx_ptr->res_rid = sc->cpu_cx_count;
+		    bus_set_resource(sc->cpu_dev, SYS_RES_IOPORT,
+		      cx_ptr->res_rid, address, 1);
+		    cx_ptr->p_lvlx = bus_alloc_resource_any(sc->cpu_dev,
+		      SYS_RES_IOPORT, &cx_ptr->res_rid, RF_ACTIVE |
+		      RF_SHAREABLE);
+		    if (cx_ptr->p_lvlx == NULL) {
+			bus_delete_resource(sc->cpu_dev, SYS_RES_IOPORT,
+			  cx_ptr->res_rid);
+			device_printf(sc->cpu_dev,
+			  "C1 I/O failed to allocate port %d, "
+			  "degrading to C1 Halt", (int)address);
+		    }
+		} else if (class == CST_FFH_INTEL_CL_MWAIT) {
+		    acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize);
+		}
+	    }
+#endif
 	    if (sc->cpu_cx_states[0].type == ACPI_STATE_C0) {
 		/* This is the first C1 state.  Use the reserved slot. */
 		sc->cpu_cx_states[0] = *cx_ptr;
 	    } else {
 		sc->cpu_non_c2 = sc->cpu_cx_count;
 		sc->cpu_non_c3 = sc->cpu_cx_count;
 		cx_ptr++;
 		sc->cpu_cx_count++;
 	    }
 	    continue;
 	case ACPI_STATE_C2:
 	    sc->cpu_non_c3 = sc->cpu_cx_count;
 	    if (cpu_deepest_sleep < 2)
 		    cpu_deepest_sleep = 2;
 	    break;
 	case ACPI_STATE_C3:
 	default:
 	    if ((cpu_quirks & CPU_QUIRK_NO_C3) != 0) {
 		ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 				 "acpi_cpu%d: C3[%d] not available.\n",
 				 device_get_unit(sc->cpu_dev), i));
 		continue;
 	    } else
 		cpu_deepest_sleep = 3;
 	    break;
 	}
 
 	/* Free up any previous register. */
-	if (cx_ptr->p_lvlx != NULL) {
-	    bus_release_resource(sc->cpu_dev, cx_ptr->res_type, cx_ptr->res_rid,
-	        cx_ptr->p_lvlx);
-	    cx_ptr->p_lvlx = NULL;
-	}
+	acpi_cpu_cx_cst_free_plvlx(sc->cpu_dev, cx_ptr);
 
 	/* Allocate the control register for C2 or C3. */
-	cx_ptr->res_rid = sc->cpu_cx_count;
-	acpi_PkgGas(sc->cpu_dev, pkg, 0, &cx_ptr->res_type, &cx_ptr->res_rid,
-	    &cx_ptr->p_lvlx, RF_SHAREABLE);
-	if (cx_ptr->p_lvlx) {
+#if defined(__i386__) || defined(__amd64__)
+	if (acpi_PkgFFH_IntelCpu(pkg, 0, &vendor, &class, &address,
+	  &accsize) == 0 && vendor == CST_FFH_VENDOR_INTEL &&
+	  class == CST_FFH_INTEL_CL_MWAIT) {
+	    /* Native C State Instruction use (mwait) */
+	    acpi_cpu_cx_cst_mwait(cx_ptr, address, accsize);
 	    ACPI_DEBUG_PRINT((ACPI_DB_INFO,
-			     "acpi_cpu%d: Got C%d - %d latency\n",
-			     device_get_unit(sc->cpu_dev), cx_ptr->type,
-			     cx_ptr->trans_lat));
+	      "acpi_cpu%d: Got C%d/mwait - %d latency\n",
+	      device_get_unit(sc->cpu_dev), cx_ptr->type, cx_ptr->trans_lat));
 	    cx_ptr++;
 	    sc->cpu_cx_count++;
+	} else
+#endif
+	{
+	    cx_ptr->res_rid = sc->cpu_cx_count;
+	    acpi_PkgGas(sc->cpu_dev, pkg, 0, &cx_ptr->res_type,
+		&cx_ptr->res_rid, &cx_ptr->p_lvlx, RF_SHAREABLE);
+	    if (cx_ptr->p_lvlx) {
+		ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+		     "acpi_cpu%d: Got C%d - %d latency\n",
+		     device_get_unit(sc->cpu_dev), cx_ptr->type,
+		     cx_ptr->trans_lat));
+		cx_ptr++;
+		sc->cpu_cx_count++;
+	    }
 	}
     }
     AcpiOsFree(buf.Pointer);
 
     /* If C1 state was not found, we need one now. */
     cx_ptr = sc->cpu_cx_states;
     if (cx_ptr->type == ACPI_STATE_C0) {
 	cx_ptr->type = ACPI_STATE_C1;
 	cx_ptr->trans_lat = 0;
     }
 
     return (0);
 }
 
 /*
  * Call this *after* all CPUs have been attached.
  */
 static void
 acpi_cpu_startup(void *arg)
 {
     struct acpi_cpu_softc *sc;
     int i;
 
     /* Get set of CPU devices */
     devclass_get_devices(acpi_cpu_devclass, &cpu_devices, &cpu_ndevices);
 
     /*
      * Setup any quirks that might necessary now that we have probed
      * all the CPUs
      */
     acpi_cpu_quirks();
 
     if (cpu_cx_generic) {
 	/*
 	 * We are using generic Cx mode, probe for available Cx states
 	 * for all processors.
 	 */
 	for (i = 0; i < cpu_ndevices; i++) {
 	    sc = device_get_softc(cpu_devices[i]);
 	    acpi_cpu_generic_cx_probe(sc);
 	}
     } else {
 	/*
 	 * We are using _CST mode, remove C3 state if necessary.
 	 * As we now know for sure that we will be using _CST mode
 	 * install our notify handler.
 	 */
 	for (i = 0; i < cpu_ndevices; i++) {
 	    sc = device_get_softc(cpu_devices[i]);
 	    if (cpu_quirks & CPU_QUIRK_NO_C3) {
 		sc->cpu_cx_count = min(sc->cpu_cx_count, sc->cpu_non_c3 + 1);
 	    }
 	    AcpiInstallNotifyHandler(sc->cpu_handle, ACPI_DEVICE_NOTIFY,
 		acpi_cpu_notify, sc);
 	}
     }
 
     /* Perform Cx final initialization. */
     for (i = 0; i < cpu_ndevices; i++) {
 	sc = device_get_softc(cpu_devices[i]);
 	acpi_cpu_startup_cx(sc);
     }
 
     /* Add a sysctl handler to handle global Cx lowest setting */
     SYSCTL_ADD_PROC(&cpu_sysctl_ctx, SYSCTL_CHILDREN(cpu_sysctl_tree),
 	OID_AUTO, "cx_lowest", CTLTYPE_STRING | CTLFLAG_RW,
 	NULL, 0, acpi_cpu_global_cx_lowest_sysctl, "A",
 	"Global lowest Cx sleep state to use");
 
     /* Take over idling from cpu_idle_default(). */
     cpu_cx_lowest_lim = 0;
     for (i = 0; i < cpu_ndevices; i++) {
 	sc = device_get_softc(cpu_devices[i]);
 	enable_idle(sc);
     }
     cpu_idle_hook = acpi_cpu_idle;
 }
 
 static void
 acpi_cpu_cx_list(struct acpi_cpu_softc *sc)
 {
     struct sbuf sb;
     int i;
 
     /*
      * Set up the list of Cx states
      */
     sbuf_new(&sb, sc->cpu_cx_supported, sizeof(sc->cpu_cx_supported),
 	SBUF_FIXEDLEN);
     for (i = 0; i < sc->cpu_cx_count; i++)
 	sbuf_printf(&sb, "C%d/%d/%d ", i + 1, sc->cpu_cx_states[i].type,
 	    sc->cpu_cx_states[i].trans_lat);
     sbuf_trim(&sb);
     sbuf_finish(&sb);
 }	
 
 static void
 acpi_cpu_startup_cx(struct acpi_cpu_softc *sc)
 {
     acpi_cpu_cx_list(sc);
     
     SYSCTL_ADD_STRING(&sc->cpu_sysctl_ctx,
 		      SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)),
 		      OID_AUTO, "cx_supported", CTLFLAG_RD,
 		      sc->cpu_cx_supported, 0,
 		      "Cx/microsecond values for supported Cx states");
     SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx,
 		    SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)),
 		    OID_AUTO, "cx_lowest", CTLTYPE_STRING | CTLFLAG_RW,
 		    (void *)sc, 0, acpi_cpu_cx_lowest_sysctl, "A",
 		    "lowest Cx sleep state to use");
     SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx,
 		    SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)),
 		    OID_AUTO, "cx_usage", CTLTYPE_STRING | CTLFLAG_RD,
 		    (void *)sc, 0, acpi_cpu_usage_sysctl, "A",
 		    "percent usage for each Cx state");
     SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx,
 		    SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)),
 		    OID_AUTO, "cx_usage_counters", CTLTYPE_STRING | CTLFLAG_RD,
 		    (void *)sc, 0, acpi_cpu_usage_counters_sysctl, "A",
 		    "Cx sleep state counters");
+#if defined(__i386__) || defined(__amd64__)
+    SYSCTL_ADD_PROC(&sc->cpu_sysctl_ctx,
+		    SYSCTL_CHILDREN(device_get_sysctl_tree(sc->cpu_dev)),
+		    OID_AUTO, "cx_method", CTLTYPE_STRING | CTLFLAG_RD,
+		    (void *)sc, 0, acpi_cpu_method_sysctl, "A",
+		    "Cx entrance methods");
+#endif
 
     /* Signal platform that we can handle _CST notification. */
     if (!cpu_cx_generic && cpu_cst_cnt != 0) {
 	ACPI_LOCK(acpi);
 	AcpiOsWritePort(cpu_smi_cmd, cpu_cst_cnt, 8);
 	ACPI_UNLOCK(acpi);
     }
 }
 
 /*
  * Idle the CPU in the lowest state possible.  This function is called with
  * interrupts disabled.  Note that once it re-enables interrupts, a task
  * switch can occur so do not access shared data (i.e. the softc) after
  * interrupts are re-enabled.
  */
 static void
 acpi_cpu_idle(sbintime_t sbt)
 {
     struct	acpi_cpu_softc *sc;
     struct	acpi_cx *cx_next;
     uint64_t	cputicks;
     uint32_t	start_time, end_time;
     int		bm_active, cx_next_idx, i, us;
 
     /*
      * Look up our CPU id to get our softc.  If it's NULL, we'll use C1
      * since there is no ACPI processor object for this CPU.  This occurs
      * for logical CPUs in the HTT case.
      */
     sc = cpu_softc[PCPU_GET(cpuid)];
     if (sc == NULL) {
 	acpi_cpu_c1();
 	return;
     }
 
     /* If disabled, take the safe path. */
     if (is_idle_disabled(sc)) {
 	acpi_cpu_c1();
 	return;
     }
 
     /* Find the lowest state that has small enough latency. */
     us = sc->cpu_prev_sleep;
     if (sbt >= 0 && us > (sbt >> 12))
 	us = (sbt >> 12);
     cx_next_idx = 0;
     if (cpu_disable_c2_sleep)
 	i = min(sc->cpu_cx_lowest, sc->cpu_non_c2);
     else if (cpu_disable_c3_sleep)
 	i = min(sc->cpu_cx_lowest, sc->cpu_non_c3);
     else
 	i = sc->cpu_cx_lowest;
     for (; i >= 0; i--) {
 	if (sc->cpu_cx_states[i].trans_lat * 3 <= us) {
 	    cx_next_idx = i;
 	    break;
 	}
     }
 
     /*
      * Check for bus master activity.  If there was activity, clear
      * the bit and use the lowest non-C3 state.  Note that the USB
      * driver polling for new devices keeps this bit set all the
      * time if USB is loaded.
      */
     if ((cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0 &&
 	cx_next_idx > sc->cpu_non_c3) {
 	AcpiReadBitRegister(ACPI_BITREG_BUS_MASTER_STATUS, &bm_active);
 	if (bm_active != 0) {
 	    AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_STATUS, 1);
 	    cx_next_idx = sc->cpu_non_c3;
 	}
     }
 
     /* Select the next state and update statistics. */
     cx_next = &sc->cpu_cx_states[cx_next_idx];
     sc->cpu_cx_stats[cx_next_idx]++;
     KASSERT(cx_next->type != ACPI_STATE_C0, ("acpi_cpu_idle: C0 sleep"));
 
     /*
      * Execute HLT (or equivalent) and wait for an interrupt.  We can't
      * precisely calculate the time spent in C1 since the place we wake up
      * is an ISR.  Assume we slept no more then half of quantum, unless
      * we are called inside critical section, delaying context switch.
      */
     if (cx_next->type == ACPI_STATE_C1) {
 	cputicks = cpu_ticks();
-	acpi_cpu_c1();
+	if (cx_next->p_lvlx != NULL) {
+	    /* C1 I/O then Halt */
+	    CPU_GET_REG(cx_next->p_lvlx, 1);
+	}
+	if (cx_next->do_mwait)
+	    acpi_cpu_idle_mwait(cx_next->mwait_hint);
+	else
+	    acpi_cpu_c1();
 	end_time = ((cpu_ticks() - cputicks) << 20) / cpu_tickrate();
 	if (curthread->td_critnest == 0)
 		end_time = min(end_time, 500000 / hz);
 	sc->cpu_prev_sleep = (sc->cpu_prev_sleep * 3 + end_time) / 4;
 	return;
     }
 
     /*
      * For C3, disable bus master arbitration and enable bus master wake
      * if BM control is available, otherwise flush the CPU cache.
      */
-    if (cx_next->type == ACPI_STATE_C3) {
+    if (cx_next->type == ACPI_STATE_C3 || cx_next->mwait_bm_avoidance) {
 	if ((cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0) {
 	    AcpiWriteBitRegister(ACPI_BITREG_ARB_DISABLE, 1);
 	    AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_RLD, 1);
 	} else
 	    ACPI_FLUSH_CPU_CACHE();
     }
 
     /*
      * Read from P_LVLx to enter C2(+), checking time spent asleep.
      * Use the ACPI timer for measuring sleep time.  Since we need to
      * get the time very close to the CPU start/stop clock logic, this
      * is the only reliable time source.
      */
     if (cx_next->type == ACPI_STATE_C3) {
 	AcpiHwRead(&start_time, &AcpiGbl_FADT.XPmTimerBlock);
 	cputicks = 0;
     } else {
 	start_time = 0;
 	cputicks = cpu_ticks();
     }
-    CPU_GET_REG(cx_next->p_lvlx, 1);
+    if (cx_next->do_mwait)
+	acpi_cpu_idle_mwait(cx_next->mwait_hint);
+    else
+	CPU_GET_REG(cx_next->p_lvlx, 1);
 
     /*
      * Read the end time twice.  Since it may take an arbitrary time
      * to enter the idle state, the first read may be executed before
      * the processor has stopped.  Doing it again provides enough
      * margin that we are certain to have a correct value.
      */
     AcpiHwRead(&end_time, &AcpiGbl_FADT.XPmTimerBlock);
     if (cx_next->type == ACPI_STATE_C3) {
 	AcpiHwRead(&end_time, &AcpiGbl_FADT.XPmTimerBlock);
 	end_time = acpi_TimerDelta(end_time, start_time);
     } else
 	end_time = ((cpu_ticks() - cputicks) << 20) / cpu_tickrate();
 
     /* Enable bus master arbitration and disable bus master wakeup. */
-    if (cx_next->type == ACPI_STATE_C3 &&
-	(cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0) {
+    if ((cx_next->type == ACPI_STATE_C3 || cx_next->mwait_bm_avoidance) &&
+      (cpu_quirks & CPU_QUIRK_NO_BM_CTRL) == 0) {
 	AcpiWriteBitRegister(ACPI_BITREG_ARB_DISABLE, 0);
 	AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_RLD, 0);
     }
     ACPI_ENABLE_IRQS();
 
     sc->cpu_prev_sleep = (sc->cpu_prev_sleep * 3 + PM_USEC(end_time)) / 4;
 }
 
 /*
  * Re-evaluate the _CST object when we are notified that it changed.
  */
 static void
 acpi_cpu_notify(ACPI_HANDLE h, UINT32 notify, void *context)
 {
     struct acpi_cpu_softc *sc = (struct acpi_cpu_softc *)context;
 
     if (notify != ACPI_NOTIFY_CX_STATES)
 	return;
 
     /*
      * C-state data for target CPU is going to be in flux while we execute
      * acpi_cpu_cx_cst, so disable entering acpi_cpu_idle.
      * Also, it may happen that multiple ACPI taskqueues may concurrently
      * execute notifications for the same CPU.  ACPI_SERIAL is used to
      * protect against that.
      */
     ACPI_SERIAL_BEGIN(cpu);
     disable_idle(sc);
 
     /* Update the list of Cx states. */
     acpi_cpu_cx_cst(sc);
     acpi_cpu_cx_list(sc);
     acpi_cpu_set_cx_lowest(sc);
 
     enable_idle(sc);
     ACPI_SERIAL_END(cpu);
 
     acpi_UserNotify("PROCESSOR", sc->cpu_handle, notify);
 }
 
 static int
 acpi_cpu_quirks(void)
 {
     device_t acpi_dev;
     uint32_t val;
 
     ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__);
 
     /*
      * Bus mastering arbitration control is needed to keep caches coherent
      * while sleeping in C3.  If it's not present but a working flush cache
      * instruction is present, flush the caches before entering C3 instead.
      * Otherwise, just disable C3 completely.
      */
     if (AcpiGbl_FADT.Pm2ControlBlock == 0 ||
 	AcpiGbl_FADT.Pm2ControlLength == 0) {
 	if ((AcpiGbl_FADT.Flags & ACPI_FADT_WBINVD) &&
 	    (AcpiGbl_FADT.Flags & ACPI_FADT_WBINVD_FLUSH) == 0) {
 	    cpu_quirks |= CPU_QUIRK_NO_BM_CTRL;
 	    ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 		"acpi_cpu: no BM control, using flush cache method\n"));
 	} else {
 	    cpu_quirks |= CPU_QUIRK_NO_C3;
 	    ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 		"acpi_cpu: no BM control, C3 not available\n"));
 	}
     }
 
     /*
      * If we are using generic Cx mode, C3 on multiple CPUs requires using
      * the expensive flush cache instruction.
      */
     if (cpu_cx_generic && mp_ncpus > 1) {
 	cpu_quirks |= CPU_QUIRK_NO_BM_CTRL;
 	ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 	    "acpi_cpu: SMP, using flush cache mode for C3\n"));
     }
 
     /* Look for various quirks of the PIIX4 part. */
     acpi_dev = pci_find_device(PCI_VENDOR_INTEL, PCI_DEVICE_82371AB_3);
     if (acpi_dev != NULL) {
 	switch (pci_get_revid(acpi_dev)) {
 	/*
 	 * Disable C3 support for all PIIX4 chipsets.  Some of these parts
 	 * do not report the BMIDE status to the BM status register and
 	 * others have a livelock bug if Type-F DMA is enabled.  Linux
 	 * works around the BMIDE bug by reading the BM status directly
 	 * but we take the simpler approach of disabling C3 for these
 	 * parts.
 	 *
 	 * See erratum #18 ("C3 Power State/BMIDE and Type-F DMA
 	 * Livelock") from the January 2002 PIIX4 specification update.
 	 * Applies to all PIIX4 models.
 	 *
 	 * Also, make sure that all interrupts cause a "Stop Break"
 	 * event to exit from C2 state.
 	 * Also, BRLD_EN_BM (ACPI_BITREG_BUS_MASTER_RLD in ACPI-speak)
 	 * should be set to zero, otherwise it causes C2 to short-sleep.
 	 * PIIX4 doesn't properly support C3 and bus master activity
 	 * need not break out of C2.
 	 */
 	case PCI_REVISION_A_STEP:
 	case PCI_REVISION_B_STEP:
 	case PCI_REVISION_4E:
 	case PCI_REVISION_4M:
 	    cpu_quirks |= CPU_QUIRK_NO_C3;
 	    ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 		"acpi_cpu: working around PIIX4 bug, disabling C3\n"));
 
 	    val = pci_read_config(acpi_dev, PIIX4_DEVACTB_REG, 4);
 	    if ((val & PIIX4_STOP_BREAK_MASK) != PIIX4_STOP_BREAK_MASK) {
 		ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 		    "acpi_cpu: PIIX4: enabling IRQs to generate Stop Break\n"));
 	    	val |= PIIX4_STOP_BREAK_MASK;
 		pci_write_config(acpi_dev, PIIX4_DEVACTB_REG, val, 4);
 	    }
 	    AcpiReadBitRegister(ACPI_BITREG_BUS_MASTER_RLD, &val);
 	    if (val) {
 		ACPI_DEBUG_PRINT((ACPI_DB_INFO,
 		    "acpi_cpu: PIIX4: reset BRLD_EN_BM\n"));
 		AcpiWriteBitRegister(ACPI_BITREG_BUS_MASTER_RLD, 0);
 	    }
 	    break;
 	default:
 	    break;
 	}
     }
 
     return (0);
 }
 
 static int
 acpi_cpu_usage_sysctl(SYSCTL_HANDLER_ARGS)
 {
     struct acpi_cpu_softc *sc;
     struct sbuf	 sb;
     char	 buf[128];
     int		 i;
     uintmax_t	 fract, sum, whole;
 
     sc = (struct acpi_cpu_softc *) arg1;
     sum = 0;
     for (i = 0; i < sc->cpu_cx_count; i++)
 	sum += sc->cpu_cx_stats[i];
     sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN);
     for (i = 0; i < sc->cpu_cx_count; i++) {
 	if (sum > 0) {
 	    whole = (uintmax_t)sc->cpu_cx_stats[i] * 100;
 	    fract = (whole % sum) * 100;
 	    sbuf_printf(&sb, "%u.%02u%% ", (u_int)(whole / sum),
 		(u_int)(fract / sum));
 	} else
 	    sbuf_printf(&sb, "0.00%% ");
     }
     sbuf_printf(&sb, "last %dus", sc->cpu_prev_sleep);
     sbuf_trim(&sb);
     sbuf_finish(&sb);
     sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req);
     sbuf_delete(&sb);
 
     return (0);
 }
 
 /*
  * XXX TODO: actually add support to count each entry/exit
  * from the Cx states.
  */
 static int
 acpi_cpu_usage_counters_sysctl(SYSCTL_HANDLER_ARGS)
 {
     struct acpi_cpu_softc *sc;
     struct sbuf	 sb;
     char	 buf[128];
     int		 i;
 
     sc = (struct acpi_cpu_softc *) arg1;
 
     /* Print out the raw counters */
     sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN);
 
     for (i = 0; i < sc->cpu_cx_count; i++) {
         sbuf_printf(&sb, "%u ", sc->cpu_cx_stats[i]);
     }
 
     sbuf_trim(&sb);
     sbuf_finish(&sb);
     sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req);
     sbuf_delete(&sb);
 
     return (0);
 }
+
+#if defined(__i386__) || defined(__amd64__)
+static int
+acpi_cpu_method_sysctl(SYSCTL_HANDLER_ARGS)
+{
+	struct acpi_cpu_softc *sc;
+	struct acpi_cx *cx;
+	struct sbuf sb;
+	char buf[128];
+	int i;
+
+	sc = (struct acpi_cpu_softc *)arg1;
+	sbuf_new(&sb, buf, sizeof(buf), SBUF_FIXEDLEN);
+	for (i = 0; i < sc->cpu_cx_count; i++) {
+		cx = &sc->cpu_cx_states[i];
+		sbuf_printf(&sb, "C%d/", i + 1);
+		if (cx->do_mwait) {
+			sbuf_cat(&sb, "mwait");
+			if (cx->mwait_hw_coord)
+				sbuf_cat(&sb, "/hwc");
+			if (cx->mwait_bm_avoidance)
+				sbuf_cat(&sb, "/bma");
+		} else if (cx->type == ACPI_STATE_C1) {
+			sbuf_cat(&sb, "hlt");
+		} else {
+			sbuf_cat(&sb, "io");
+		}
+		if (cx->type == ACPI_STATE_C1 && cx->p_lvlx != NULL)
+			sbuf_cat(&sb, "/iohlt");
+		sbuf_putc(&sb, ' ');
+	}
+	sbuf_trim(&sb);
+	sbuf_finish(&sb);
+	sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req);
+	sbuf_delete(&sb);
+	return (0);
+}
+#endif
 
 static int
 acpi_cpu_set_cx_lowest(struct acpi_cpu_softc *sc)
 {
     int i;
 
     ACPI_SERIAL_ASSERT(cpu);
     sc->cpu_cx_lowest = min(sc->cpu_cx_lowest_lim, sc->cpu_cx_count - 1);
 
     /* If not disabling, cache the new lowest non-C3 state. */
     sc->cpu_non_c3 = 0;
     for (i = sc->cpu_cx_lowest; i >= 0; i--) {
 	if (sc->cpu_cx_states[i].type < ACPI_STATE_C3) {
 	    sc->cpu_non_c3 = i;
 	    break;
 	}
     }
 
     /* Reset the statistics counters. */
     bzero(sc->cpu_cx_stats, sizeof(sc->cpu_cx_stats));
     return (0);
 }
 
 static int
 acpi_cpu_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS)
 {
     struct	 acpi_cpu_softc *sc;
     char	 state[8];
     int		 val, error;
 
     sc = (struct acpi_cpu_softc *) arg1;
     snprintf(state, sizeof(state), "C%d", sc->cpu_cx_lowest_lim + 1);
     error = sysctl_handle_string(oidp, state, sizeof(state), req);
     if (error != 0 || req->newptr == NULL)
 	return (error);
     if (strlen(state) < 2 || toupper(state[0]) != 'C')
 	return (EINVAL);
     if (strcasecmp(state, "Cmax") == 0)
 	val = MAX_CX_STATES;
     else {
 	val = (int) strtol(state + 1, NULL, 10);
 	if (val < 1 || val > MAX_CX_STATES)
 	    return (EINVAL);
     }
 
     ACPI_SERIAL_BEGIN(cpu);
     sc->cpu_cx_lowest_lim = val - 1;
     acpi_cpu_set_cx_lowest(sc);
     ACPI_SERIAL_END(cpu);
 
     return (0);
 }
 
 static int
 acpi_cpu_global_cx_lowest_sysctl(SYSCTL_HANDLER_ARGS)
 {
     struct	acpi_cpu_softc *sc;
     char	state[8];
     int		val, error, i;
 
     snprintf(state, sizeof(state), "C%d", cpu_cx_lowest_lim + 1);
     error = sysctl_handle_string(oidp, state, sizeof(state), req);
     if (error != 0 || req->newptr == NULL)
 	return (error);
     if (strlen(state) < 2 || toupper(state[0]) != 'C')
 	return (EINVAL);
     if (strcasecmp(state, "Cmax") == 0)
 	val = MAX_CX_STATES;
     else {
 	val = (int) strtol(state + 1, NULL, 10);
 	if (val < 1 || val > MAX_CX_STATES)
 	    return (EINVAL);
     }
 
     /* Update the new lowest useable Cx state for all CPUs. */
     ACPI_SERIAL_BEGIN(cpu);
     cpu_cx_lowest_lim = val - 1;
     for (i = 0; i < cpu_ndevices; i++) {
 	sc = device_get_softc(cpu_devices[i]);
 	sc->cpu_cx_lowest_lim = cpu_cx_lowest_lim;
 	acpi_cpu_set_cx_lowest(sc);
     }
     ACPI_SERIAL_END(cpu);
 
     return (0);
 }
Index: projects/release-arm-redux/sys/dev/acpica/acpi_package.c
===================================================================
--- projects/release-arm-redux/sys/dev/acpica/acpi_package.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/acpica/acpi_package.c	(revision 282692)
@@ -1,152 +1,174 @@
 /*-
  * Copyright (c) 2003 Nate Lawson
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/kernel.h>
 #include <sys/bus.h>
 #include <sys/sbuf.h>
 
 #include <machine/bus.h>
 #include <machine/resource.h>
 #include <sys/rman.h>
 
 #include <contrib/dev/acpica/include/acpi.h>
 
 #include <dev/acpica/acpivar.h>
 
 /*
  * Package manipulation convenience functions
  */
 
 int
 acpi_PkgInt(ACPI_OBJECT *res, int idx, UINT64 *dst)
 {
     ACPI_OBJECT		*obj;
 
     obj = &res->Package.Elements[idx];
     if (obj == NULL || obj->Type != ACPI_TYPE_INTEGER)
 	return (EINVAL);
     *dst = obj->Integer.Value;
 
     return (0);
 }
 
 int
 acpi_PkgInt32(ACPI_OBJECT *res, int idx, uint32_t *dst)
 {
     UINT64		tmp;
     int			error;
 
     error = acpi_PkgInt(res, idx, &tmp);
     if (error == 0)
 	*dst = (uint32_t)tmp;
 
     return (error);
 }
 
 int
 acpi_PkgStr(ACPI_OBJECT *res, int idx, void *dst, size_t size)
 {
     ACPI_OBJECT		*obj;
     void		*ptr;
     size_t		 length;
 
     obj = &res->Package.Elements[idx];
     if (obj == NULL)
 	return (EINVAL);
     bzero(dst, sizeof(dst));
 
     switch (obj->Type) {
     case ACPI_TYPE_STRING:
 	ptr = obj->String.Pointer;
 	length = obj->String.Length;
 	break;
     case ACPI_TYPE_BUFFER:
 	ptr = obj->Buffer.Pointer;
 	length = obj->Buffer.Length;
 	break;
     default:
 	return (EINVAL);
     }
 
     /* Make sure string will fit, including terminating NUL */
     if (++length > size)
 	return (E2BIG);
 
     strlcpy(dst, ptr, length);
     return (0);
 }
 
 int
 acpi_PkgGas(device_t dev, ACPI_OBJECT *res, int idx, int *type, int *rid,
     struct resource **dst, u_int flags)
 {
     ACPI_GENERIC_ADDRESS gas;
     ACPI_OBJECT *obj;
 
     obj = &res->Package.Elements[idx];
     if (obj == NULL || obj->Type != ACPI_TYPE_BUFFER ||
 	obj->Buffer.Length < sizeof(ACPI_GENERIC_ADDRESS) + 3)
 	return (EINVAL);
 
     memcpy(&gas, obj->Buffer.Pointer + 3, sizeof(gas));
 
     return (acpi_bus_alloc_gas(dev, type, rid, &gas, dst, flags));
 }
 
+int
+acpi_PkgFFH_IntelCpu(ACPI_OBJECT *res, int idx, int *vendor, int *class,
+    uint64_t *address, int *accsize)
+{
+    ACPI_GENERIC_ADDRESS gas;
+    ACPI_OBJECT *obj;
+
+    obj = &res->Package.Elements[idx];
+    if (obj == NULL || obj->Type != ACPI_TYPE_BUFFER ||
+	obj->Buffer.Length < sizeof(ACPI_GENERIC_ADDRESS) + 3)
+	return (EINVAL);
+
+    memcpy(&gas, obj->Buffer.Pointer + 3, sizeof(gas));
+    if (gas.SpaceId != ACPI_ADR_SPACE_FIXED_HARDWARE)
+	return (ERESTART);
+    *vendor = gas.BitWidth;
+    *class = gas.BitOffset;
+    *address = gas.Address;
+    *accsize = gas.AccessWidth;
+    return (0);
+}
+
 ACPI_HANDLE
 acpi_GetReference(ACPI_HANDLE scope, ACPI_OBJECT *obj)
 {
     ACPI_HANDLE h;
 
     if (obj == NULL)
 	return (NULL);
 
     switch (obj->Type) {
     case ACPI_TYPE_LOCAL_REFERENCE:
     case ACPI_TYPE_ANY:
 	h = obj->Reference.Handle;
 	break;
     case ACPI_TYPE_STRING:
 	/*
 	 * The String object usually contains a fully-qualified path, so
 	 * scope can be NULL.
 	 *
 	 * XXX This may not always be the case.
 	 */
 	if (ACPI_FAILURE(AcpiGetHandle(scope, obj->String.Pointer, &h)))
 	    h = NULL;
 	break;
     default:
 	h = NULL;
 	break;
     }
 
     return (h);
 }
Index: projects/release-arm-redux/sys/dev/acpica/acpivar.h
===================================================================
--- projects/release-arm-redux/sys/dev/acpica/acpivar.h	(revision 282691)
+++ projects/release-arm-redux/sys/dev/acpica/acpivar.h	(revision 282692)
@@ -1,507 +1,509 @@
 /*-
  * Copyright (c) 2000 Mitsuru IWASAKI <iwasaki@jp.freebsd.org>
  * Copyright (c) 2000 Michael Smith <msmith@freebsd.org>
  * Copyright (c) 2000 BSDi
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _ACPIVAR_H_
 #define _ACPIVAR_H_
 
 #ifdef _KERNEL
 
 #include "acpi_if.h"
 #include "bus_if.h"
 #include <sys/eventhandler.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/selinfo.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 
 #include <machine/bus.h>
 #include <machine/resource.h>
 
 struct apm_clone_data;
 struct acpi_softc {
     device_t		acpi_dev;
     struct cdev		*acpi_dev_t;
 
     int			acpi_enabled;
     int			acpi_sstate;
     int			acpi_sleep_disabled;
     int			acpi_resources_reserved;
 
     struct sysctl_ctx_list acpi_sysctl_ctx;
     struct sysctl_oid	*acpi_sysctl_tree;
     int			acpi_power_button_sx;
     int			acpi_sleep_button_sx;
     int			acpi_lid_switch_sx;
 
     int			acpi_standby_sx;
     int			acpi_suspend_sx;
 
     int			acpi_sleep_delay;
     int			acpi_s4bios;
     int			acpi_do_disable;
     int			acpi_verbose;
     int			acpi_handle_reboot;
 
     vm_offset_t		acpi_wakeaddr;
     vm_paddr_t		acpi_wakephys;
 
     int			acpi_next_sstate;	/* Next suspend Sx state. */
     struct apm_clone_data *acpi_clone;		/* Pseudo-dev for devd(8). */
     STAILQ_HEAD(,apm_clone_data) apm_cdevs;	/* All apm/apmctl/acpi cdevs. */
     struct callout	susp_force_to;		/* Force suspend if no acks. */
 };
 
 struct acpi_device {
     /* ACPI ivars */
     ACPI_HANDLE			ad_handle;
     void			*ad_private;
     int				ad_flags;
 
     /* Resources */
     struct resource_list	ad_rl;
 };
 
 /* Track device (/dev/{apm,apmctl} and /dev/acpi) notification status. */
 struct apm_clone_data {
     STAILQ_ENTRY(apm_clone_data) entries;
     struct cdev 	*cdev;
     int			flags;
 #define	ACPI_EVF_NONE	0	/* /dev/apm semantics */
 #define	ACPI_EVF_DEVD	1	/* /dev/acpi is handled via devd(8) */
 #define	ACPI_EVF_WRITE	2	/* Device instance is opened writable. */
     int			notify_status;
 #define	APM_EV_NONE	0	/* Device not yet aware of pending sleep. */
 #define	APM_EV_NOTIFIED	1	/* Device saw next sleep state. */
 #define	APM_EV_ACKED	2	/* Device agreed sleep can occur. */
     struct acpi_softc	*acpi_sc;
     struct selinfo	sel_read;
 };
 
 #define ACPI_PRW_MAX_POWERRES	8
 
 struct acpi_prw_data {
     ACPI_HANDLE		gpe_handle;
     int			gpe_bit;
     int			lowest_wake;
     ACPI_OBJECT		power_res[ACPI_PRW_MAX_POWERRES];
     int			power_res_count;
 };
 
 /* Flags for each device defined in the AML namespace. */
 #define ACPI_FLAG_WAKE_ENABLED	0x1
 
 /* Macros for extracting parts of a PCI address from an _ADR value. */
 #define	ACPI_ADR_PCI_SLOT(adr)	(((adr) & 0xffff0000) >> 16)
 #define	ACPI_ADR_PCI_FUNC(adr)	((adr) & 0xffff)
 
 /*
  * Entry points to ACPI from above are global functions defined in this
  * file, sysctls, and I/O on the control device.  Entry points from below
  * are interrupts (the SCI), notifies, task queue threads, and the thermal
  * zone polling thread.
  *
  * ACPI tables and global shared data are protected by a global lock
  * (acpi_mutex).  
  *
  * Each ACPI device can have its own driver-specific mutex for protecting
  * shared access to local data.  The ACPI_LOCK macros handle mutexes.
  *
  * Drivers that need to serialize access to functions (e.g., to route
  * interrupts, get/set control paths, etc.) should use the sx lock macros
  * (ACPI_SERIAL).
  *
  * ACPI-CA handles its own locking and should not be called with locks held.
  *
  * The most complicated path is:
  *     GPE -> EC runs _Qxx -> _Qxx reads EC space -> GPE
  */
 extern struct mtx			acpi_mutex;
 #define ACPI_LOCK(sys)			mtx_lock(&sys##_mutex)
 #define ACPI_UNLOCK(sys)		mtx_unlock(&sys##_mutex)
 #define ACPI_LOCK_ASSERT(sys)		mtx_assert(&sys##_mutex, MA_OWNED);
 #define ACPI_LOCK_DECL(sys, name)				\
 	static struct mtx sys##_mutex;				\
 	MTX_SYSINIT(sys##_mutex, &sys##_mutex, name, MTX_DEF)
 #define ACPI_SERIAL_BEGIN(sys)		sx_xlock(&sys##_sxlock)
 #define ACPI_SERIAL_END(sys)		sx_xunlock(&sys##_sxlock)
 #define ACPI_SERIAL_ASSERT(sys)		sx_assert(&sys##_sxlock, SX_XLOCKED);
 #define ACPI_SERIAL_DECL(sys, name)				\
 	static struct sx sys##_sxlock;				\
 	SX_SYSINIT(sys##_sxlock, &sys##_sxlock, name)
 
 /*
  * ACPI CA does not define layers for non-ACPI CA drivers.
  * We define some here within the range provided.
  */
 #define	ACPI_AC_ADAPTER		0x00010000
 #define	ACPI_BATTERY		0x00020000
 #define	ACPI_BUS		0x00040000
 #define	ACPI_BUTTON		0x00080000
 #define	ACPI_EC			0x00100000
 #define	ACPI_FAN		0x00200000
 #define	ACPI_POWERRES		0x00400000
 #define	ACPI_PROCESSOR		0x00800000
 #define	ACPI_THERMAL		0x01000000
 #define	ACPI_TIMER		0x02000000
 #define	ACPI_OEM		0x04000000
 
 /*
  * Constants for different interrupt models used with acpi_SetIntrModel().
  */
 #define	ACPI_INTR_PIC		0
 #define	ACPI_INTR_APIC		1
 #define	ACPI_INTR_SAPIC		2
 
 /*
  * Various features and capabilities for the acpi_get_features() method.
  * In particular, these are used for the ACPI 3.0 _PDC and _OSC methods.
  * See the Intel document titled "Intel Processor Vendor-Specific ACPI",
  * number 302223-007.
  */
 #define	ACPI_CAP_PERF_MSRS	(1 << 0)  /* Intel SpeedStep PERF_CTL MSRs */
 #define	ACPI_CAP_C1_IO_HALT	(1 << 1)  /* Intel C1 "IO then halt" sequence */
 #define	ACPI_CAP_THR_MSRS	(1 << 2)  /* Intel OnDemand throttling MSRs */
 #define	ACPI_CAP_SMP_SAME	(1 << 3)  /* MP C1, Px, and Tx (all the same) */
 #define	ACPI_CAP_SMP_SAME_C3	(1 << 4)  /* MP C2 and C3 (all the same) */
 #define	ACPI_CAP_SMP_DIFF_PX	(1 << 5)  /* MP Px (different, using _PSD) */
 #define	ACPI_CAP_SMP_DIFF_CX	(1 << 6)  /* MP Cx (different, using _CSD) */
 #define	ACPI_CAP_SMP_DIFF_TX	(1 << 7)  /* MP Tx (different, using _TSD) */
 #define	ACPI_CAP_SMP_C1_NATIVE	(1 << 8)  /* MP C1 support other than halt */
 #define	ACPI_CAP_SMP_C3_NATIVE	(1 << 9)  /* MP C2 and C3 support */
 #define	ACPI_CAP_PX_HW_COORD	(1 << 11) /* Intel P-state HW coordination */
 #define	ACPI_CAP_INTR_CPPC	(1 << 12) /* Native Interrupt Handling for
 	     Collaborative Processor Performance Control notifications */
 #define	ACPI_CAP_HW_DUTY_C	(1 << 13) /* Hardware Duty Cycling */
 
 /*
  * Quirk flags.
  *
  * ACPI_Q_BROKEN: Disables all ACPI support.
  * ACPI_Q_TIMER: Disables support for the ACPI timer.
  * ACPI_Q_MADT_IRQ0: Specifies that ISA IRQ 0 is wired up to pin 0 of the
  *	first APIC and that the MADT should force that by ignoring the PC-AT
  *	compatible flag and ignoring overrides that redirect IRQ 0 to pin 2.
  */
 extern int	acpi_quirks;
 #define ACPI_Q_OK		0
 #define ACPI_Q_BROKEN		(1 << 0)
 #define ACPI_Q_TIMER		(1 << 1)
 #define ACPI_Q_MADT_IRQ0	(1 << 2)
 
 /*
  * Note that the low ivar values are reserved to provide
  * interface compatibility with ISA drivers which can also
  * attach to ACPI.
  */
 #define ACPI_IVAR_HANDLE	0x100
 #define ACPI_IVAR_UNUSED	0x101	/* Unused/reserved. */
 #define ACPI_IVAR_PRIVATE	0x102
 #define ACPI_IVAR_FLAGS		0x103
 
 /*
  * Accessor functions for our ivars.  Default value for BUS_READ_IVAR is
  * (type) 0.  The <sys/bus.h> accessor functions don't check return values.
  */
 #define __ACPI_BUS_ACCESSOR(varp, var, ivarp, ivar, type)	\
 								\
 static __inline type varp ## _get_ ## var(device_t dev)		\
 {								\
     uintptr_t v = 0;						\
     BUS_READ_IVAR(device_get_parent(dev), dev,			\
 	ivarp ## _IVAR_ ## ivar, &v);				\
     return ((type) v);						\
 }								\
 								\
 static __inline void varp ## _set_ ## var(device_t dev, type t)	\
 {								\
     uintptr_t v = (uintptr_t) t;				\
     BUS_WRITE_IVAR(device_get_parent(dev), dev,			\
 	ivarp ## _IVAR_ ## ivar, v);				\
 }
 
 __ACPI_BUS_ACCESSOR(acpi, handle, ACPI, HANDLE, ACPI_HANDLE)
 __ACPI_BUS_ACCESSOR(acpi, private, ACPI, PRIVATE, void *)
 __ACPI_BUS_ACCESSOR(acpi, flags, ACPI, FLAGS, int)
 
 void acpi_fake_objhandler(ACPI_HANDLE h, void *data);
 static __inline device_t
 acpi_get_device(ACPI_HANDLE handle)
 {
     void *dev = NULL;
     AcpiGetData(handle, acpi_fake_objhandler, &dev);
     return ((device_t)dev);
 }
 
 static __inline ACPI_OBJECT_TYPE
 acpi_get_type(device_t dev)
 {
     ACPI_HANDLE		h;
     ACPI_OBJECT_TYPE	t;
 
     if ((h = acpi_get_handle(dev)) == NULL)
 	return (ACPI_TYPE_NOT_FOUND);
     if (ACPI_FAILURE(AcpiGetType(h, &t)))
 	return (ACPI_TYPE_NOT_FOUND);
     return (t);
 }
 
 /* Find the difference between two PM tick counts. */
 static __inline uint32_t
 acpi_TimerDelta(uint32_t end, uint32_t start)
 {
 
 	if (end < start && (AcpiGbl_FADT.Flags & ACPI_FADT_32BIT_TIMER) == 0)
 		end |= 0x01000000;
 	return (end - start);
 }
 
 #ifdef ACPI_DEBUGGER
 void		acpi_EnterDebugger(void);
 #endif
 
 #ifdef ACPI_DEBUG
 #include <sys/cons.h>
 #define STEP(x)		do {printf x, printf("\n"); cngetc();} while (0)
 #else
 #define STEP(x)
 #endif
 
 #define ACPI_VPRINT(dev, acpi_sc, x...) do {			\
     if (acpi_get_verbose(acpi_sc))				\
 	device_printf(dev, x);					\
 } while (0)
 
 /* Values for the device _STA (status) method. */
 #define ACPI_STA_PRESENT	(1 << 0)
 #define ACPI_STA_ENABLED	(1 << 1)
 #define ACPI_STA_SHOW_IN_UI	(1 << 2)
 #define ACPI_STA_FUNCTIONAL	(1 << 3)
 #define ACPI_STA_BATT_PRESENT	(1 << 4)
 
 #define ACPI_DEVINFO_PRESENT(x, flags)					\
 	(((x) & (flags)) == (flags))
 #define ACPI_DEVICE_PRESENT(x)						\
 	ACPI_DEVINFO_PRESENT(x, ACPI_STA_PRESENT | ACPI_STA_FUNCTIONAL)
 #define ACPI_BATTERY_PRESENT(x)						\
 	ACPI_DEVINFO_PRESENT(x, ACPI_STA_PRESENT | ACPI_STA_FUNCTIONAL | \
 	    ACPI_STA_BATT_PRESENT)
 
 /* Callback function type for walking subtables within a table. */
 typedef void acpi_subtable_handler(ACPI_SUBTABLE_HEADER *, void *);
 
 BOOLEAN		acpi_DeviceIsPresent(device_t dev);
 BOOLEAN		acpi_BatteryIsPresent(device_t dev);
 ACPI_STATUS	acpi_GetHandleInScope(ACPI_HANDLE parent, char *path,
 		    ACPI_HANDLE *result);
 ACPI_BUFFER	*acpi_AllocBuffer(int size);
 ACPI_STATUS	acpi_ConvertBufferToInteger(ACPI_BUFFER *bufp,
 		    UINT32 *number);
 ACPI_STATUS	acpi_GetInteger(ACPI_HANDLE handle, char *path,
 		    UINT32 *number);
 ACPI_STATUS	acpi_SetInteger(ACPI_HANDLE handle, char *path,
 		    UINT32 number);
 ACPI_STATUS	acpi_ForeachPackageObject(ACPI_OBJECT *obj, 
 		    void (*func)(ACPI_OBJECT *comp, void *arg), void *arg);
 ACPI_STATUS	acpi_FindIndexedResource(ACPI_BUFFER *buf, int index,
 		    ACPI_RESOURCE **resp);
 ACPI_STATUS	acpi_AppendBufferResource(ACPI_BUFFER *buf,
 		    ACPI_RESOURCE *res);
 ACPI_STATUS	acpi_OverrideInterruptLevel(UINT32 InterruptNumber);
 ACPI_STATUS	acpi_SetIntrModel(int model);
 int		acpi_ReqSleepState(struct acpi_softc *sc, int state);
 int		acpi_AckSleepState(struct apm_clone_data *clone, int error);
 ACPI_STATUS	acpi_SetSleepState(struct acpi_softc *sc, int state);
 int		acpi_wake_set_enable(device_t dev, int enable);
 int		acpi_parse_prw(ACPI_HANDLE h, struct acpi_prw_data *prw);
 ACPI_STATUS	acpi_Startup(void);
 void		acpi_UserNotify(const char *subsystem, ACPI_HANDLE h,
 		    uint8_t notify);
 int		acpi_bus_alloc_gas(device_t dev, int *type, int *rid,
 		    ACPI_GENERIC_ADDRESS *gas, struct resource **res,
 		    u_int flags);
 void		acpi_walk_subtables(void *first, void *end,
 		    acpi_subtable_handler *handler, void *arg);
 BOOLEAN		acpi_MatchHid(ACPI_HANDLE h, const char *hid);
 
 struct acpi_parse_resource_set {
     void	(*set_init)(device_t dev, void *arg, void **context);
     void	(*set_done)(device_t dev, void *context);
     void	(*set_ioport)(device_t dev, void *context, uint64_t base,
 		    uint64_t length);
     void	(*set_iorange)(device_t dev, void *context, uint64_t low,
 		    uint64_t high, uint64_t length, uint64_t align);
     void	(*set_memory)(device_t dev, void *context, uint64_t base,
 		    uint64_t length);
     void	(*set_memoryrange)(device_t dev, void *context, uint64_t low,
 		    uint64_t high, uint64_t length, uint64_t align);
     void	(*set_irq)(device_t dev, void *context, uint8_t *irq,
 		    int count, int trig, int pol);
     void	(*set_ext_irq)(device_t dev, void *context, uint32_t *irq,
 		    int count, int trig, int pol);
     void	(*set_drq)(device_t dev, void *context, uint8_t *drq,
 		    int count);
     void	(*set_start_dependent)(device_t dev, void *context,
 		    int preference);
     void	(*set_end_dependent)(device_t dev, void *context);
 };
 
 extern struct	acpi_parse_resource_set acpi_res_parse_set;
 
 int		acpi_identify(void);
 void		acpi_config_intr(device_t dev, ACPI_RESOURCE *res);
 ACPI_STATUS	acpi_lookup_irq_resource(device_t dev, int rid,
 		    struct resource *res, ACPI_RESOURCE *acpi_res);
 ACPI_STATUS	acpi_parse_resources(device_t dev, ACPI_HANDLE handle,
 		    struct acpi_parse_resource_set *set, void *arg);
 struct resource *acpi_alloc_sysres(device_t child, int type, int *rid,
 		    u_long start, u_long end, u_long count, u_int flags);
 
 /* ACPI event handling */
 UINT32		acpi_event_power_button_sleep(void *context);
 UINT32		acpi_event_power_button_wake(void *context);
 UINT32		acpi_event_sleep_button_sleep(void *context);
 UINT32		acpi_event_sleep_button_wake(void *context);
 
 #define ACPI_EVENT_PRI_FIRST      0
 #define ACPI_EVENT_PRI_DEFAULT    10000
 #define ACPI_EVENT_PRI_LAST       20000
 
 typedef void (*acpi_event_handler_t)(void *, int);
 
 EVENTHANDLER_DECLARE(acpi_sleep_event, acpi_event_handler_t);
 EVENTHANDLER_DECLARE(acpi_wakeup_event, acpi_event_handler_t);
 
 /* Device power control. */
 ACPI_STATUS	acpi_pwr_wake_enable(ACPI_HANDLE consumer, int enable);
 ACPI_STATUS	acpi_pwr_switch_consumer(ACPI_HANDLE consumer, int state);
 int		acpi_device_pwr_for_sleep(device_t bus, device_t dev,
 		    int *dstate);
 
 /* APM emulation */
 void		acpi_apm_init(struct acpi_softc *);
 
 /* Misc. */
 static __inline struct acpi_softc *
 acpi_device_get_parent_softc(device_t child)
 {
     device_t	parent;
 
     parent = device_get_parent(child);
     if (parent == NULL)
 	return (NULL);
     return (device_get_softc(parent));
 }
 
 static __inline int
 acpi_get_verbose(struct acpi_softc *sc)
 {
     if (sc)
 	return (sc->acpi_verbose);
     return (0);
 }
 
 char		*acpi_name(ACPI_HANDLE handle);
 int		acpi_avoid(ACPI_HANDLE handle);
 int		acpi_disabled(char *subsys);
 int		acpi_machdep_init(device_t dev);
 void		acpi_install_wakeup_handler(struct acpi_softc *sc);
 int		acpi_sleep_machdep(struct acpi_softc *sc, int state);
 int		acpi_wakeup_machdep(struct acpi_softc *sc, int state,
 		    int sleep_result, int intr_enabled);
 int		acpi_table_quirks(int *quirks);
 int		acpi_machdep_quirks(int *quirks);
 
 /* Battery Abstraction. */
 struct acpi_battinfo;
 
 int		acpi_battery_register(device_t dev);
 int		acpi_battery_remove(device_t dev);
 int		acpi_battery_get_units(void);
 int		acpi_battery_get_info_expire(void);
 int		acpi_battery_bst_valid(struct acpi_bst *bst);
 int		acpi_battery_bif_valid(struct acpi_bif *bif);
 int		acpi_battery_get_battinfo(device_t dev,
 		    struct acpi_battinfo *info);
 
 /* Embedded controller. */
 void		acpi_ec_ecdt_probe(device_t);
 
 /* AC adapter interface. */
 int		acpi_acad_get_acline(int *);
 
 /* Package manipulation convenience functions. */
 #define ACPI_PKG_VALID(pkg, size)				\
     ((pkg) != NULL && (pkg)->Type == ACPI_TYPE_PACKAGE &&	\
      (pkg)->Package.Count >= (size))
 int		acpi_PkgInt(ACPI_OBJECT *res, int idx, UINT64 *dst);
 int		acpi_PkgInt32(ACPI_OBJECT *res, int idx, uint32_t *dst);
 int		acpi_PkgStr(ACPI_OBJECT *res, int idx, void *dst, size_t size);
 int		acpi_PkgGas(device_t dev, ACPI_OBJECT *res, int idx, int *type,
 		    int *rid, struct resource **dst, u_int flags);
+int		acpi_PkgFFH_IntelCpu(ACPI_OBJECT *res, int idx, int *vendor,
+		    int *class, uint64_t *address, int *accsize);
 ACPI_HANDLE	acpi_GetReference(ACPI_HANDLE scope, ACPI_OBJECT *obj);
 
 /*
  * Base level for BUS_ADD_CHILD.  Special devices are added at orders less
  * than this, and normal devices at or above this level.  This keeps the
  * probe order sorted so that things like sysresource are available before
  * their children need them.
  */
 #define	ACPI_DEV_BASE_ORDER	100
 
 /* Default maximum number of tasks to enqueue. */
 #ifndef ACPI_MAX_TASKS
 #define	ACPI_MAX_TASKS		MAX(32, MAXCPU * 4)
 #endif
 
 /* Default number of task queue threads to start. */
 #ifndef ACPI_MAX_THREADS
 #define ACPI_MAX_THREADS	3
 #endif
 
 /* Use the device logging level for ktr(4). */
 #define	KTR_ACPI		KTR_DEV
 
 SYSCTL_DECL(_debug_acpi);
 
 /*
  * Map a PXM to a VM domain.
  *
  * Returns the VM domain ID if found, or -1 if not found / invalid.
  */
 #if MAXMEMDOM > 1
 extern	int acpi_map_pxm_to_vm_domainid(int pxm);
 #endif
 extern	int acpi_get_domain(device_t dev, device_t child, int *domain);
 extern	int acpi_parse_pxm(device_t dev, int *domain);
 
 #endif /* _KERNEL */
 #endif /* !_ACPIVAR_H_ */
Index: projects/release-arm-redux/sys/dev/hwpmc/hwpmc_armv7.c
===================================================================
--- projects/release-arm-redux/sys/dev/hwpmc/hwpmc_armv7.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/hwpmc/hwpmc_armv7.c	(revision 282692)
@@ -1,564 +1,564 @@
 /*-
  * Copyright (c) 2015 Ruslan Bukin <br@bsdpad.com>
  * All rights reserved.
  *
  * This software was developed by SRI International and the University of
  * Cambridge Computer Laboratory under DARPA/AFRL contract (FA8750-10-C-0237)
  * ("CTSRD"), as part of the DARPA CRASH research programme.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/pmc.h>
 #include <sys/pmckern.h>
 
 #include <machine/pmc_mdep.h>
 #include <machine/cpu.h>
 
 #define	CPU_ID_CORTEX_VER_MASK	0xff
 #define	CPU_ID_CORTEX_VER_SHIFT	4
 
 static int armv7_npmcs;
 
 struct armv7_event_code_map {
 	enum pmc_event	pe_ev;
 	uint8_t		pe_code;
 };
 
 const struct armv7_event_code_map armv7_event_codes[] = {
 	{ PMC_EV_ARMV7_PMNC_SW_INCR,		0x00 },
 	{ PMC_EV_ARMV7_L1_ICACHE_REFILL,	0x01 },
 	{ PMC_EV_ARMV7_ITLB_REFILL,		0x02 },
 	{ PMC_EV_ARMV7_L1_DCACHE_REFILL,	0x03 },
 	{ PMC_EV_ARMV7_L1_DCACHE_ACCESS,	0x04 },
 	{ PMC_EV_ARMV7_DTLB_REFILL,		0x05 },
 	{ PMC_EV_ARMV7_MEM_READ,		0x06 },
 	{ PMC_EV_ARMV7_MEM_WRITE,		0x07 },
 	{ PMC_EV_ARMV7_INSTR_EXECUTED,		0x08 },
 	{ PMC_EV_ARMV7_EXC_TAKEN,		0x09 },
 	{ PMC_EV_ARMV7_EXC_EXECUTED,		0x0A },
 	{ PMC_EV_ARMV7_CID_WRITE,		0x0B },
 	{ PMC_EV_ARMV7_PC_WRITE,		0x0C },
 	{ PMC_EV_ARMV7_PC_IMM_BRANCH,		0x0D },
 	{ PMC_EV_ARMV7_PC_PROC_RETURN,		0x0E },
 	{ PMC_EV_ARMV7_MEM_UNALIGNED_ACCESS,	0x0F },
 	{ PMC_EV_ARMV7_PC_BRANCH_MIS_PRED,	0x10 },
 	{ PMC_EV_ARMV7_CLOCK_CYCLES,		0x11 },
 	{ PMC_EV_ARMV7_PC_BRANCH_PRED,		0x12 },
 	{ PMC_EV_ARMV7_MEM_ACCESS,		0x13 },
 	{ PMC_EV_ARMV7_L1_ICACHE_ACCESS,	0x14 },
 	{ PMC_EV_ARMV7_L1_DCACHE_WB,		0x15 },
 	{ PMC_EV_ARMV7_L2_CACHE_ACCESS,		0x16 },
 	{ PMC_EV_ARMV7_L2_CACHE_REFILL,		0x17 },
 	{ PMC_EV_ARMV7_L2_CACHE_WB,		0x18 },
 	{ PMC_EV_ARMV7_BUS_ACCESS,		0x19 },
 	{ PMC_EV_ARMV7_MEM_ERROR,		0x1A },
 	{ PMC_EV_ARMV7_INSTR_SPEC,		0x1B },
 	{ PMC_EV_ARMV7_TTBR_WRITE,		0x1C },
 	{ PMC_EV_ARMV7_BUS_CYCLES,		0x1D },
 	{ PMC_EV_ARMV7_CPU_CYCLES,		0xFF },
 };
 
 const int armv7_event_codes_size =
 	sizeof(armv7_event_codes) / sizeof(armv7_event_codes[0]);
 
 /*
  * Per-processor information.
  */
 struct armv7_cpu {
 	struct pmc_hw   *pc_armv7pmcs;
 	int cortex_ver;
 };
 
 static struct armv7_cpu **armv7_pcpu;
 
 /*
  * Interrupt Enable Set Register
  */
 static __inline void
 armv7_interrupt_enable(uint32_t pmc)
 {
 	uint32_t reg;
 
 	reg = (1 << pmc);
 	cp15_pminten_set(reg);
 }
 
 /*
  * Interrupt Clear Set Register
  */
 static __inline void
 armv7_interrupt_disable(uint32_t pmc)
 {
 	uint32_t reg;
 
 	reg = (1 << pmc);
 	cp15_pminten_clr(reg);
 }
 
 /*
  * Counter Set Enable Register
  */
 static __inline void
 armv7_counter_enable(unsigned int pmc)
 {
 	uint32_t reg;
 
 	reg = (1 << pmc);
 	cp15_pmcnten_set(reg);
 }
 
 /*
  * Counter Clear Enable Register
  */
 static __inline void
 armv7_counter_disable(unsigned int pmc)
 {
 	uint32_t reg;
 
 	reg = (1 << pmc);
 	cp15_pmcnten_clr(reg);
 }
 
 /*
  * Performance Count Register N
  */
 static uint32_t
 armv7_pmcn_read(unsigned int pmc)
 {
 
 	KASSERT(pmc < armv7_npmcs, ("%s: illegal PMC number %d", __func__, pmc));
 
 	cp15_pmselr_set(pmc);
 	return (cp15_pmxevcntr_get());
 }
 
 static uint32_t
 armv7_pmcn_write(unsigned int pmc, uint32_t reg)
 {
 
 	KASSERT(pmc < armv7_npmcs, ("%s: illegal PMC number %d", __func__, pmc));
 
 	cp15_pmselr_set(pmc);
 	cp15_pmxevcntr_set(reg);
 
 	return (reg);
 }
 
 static int
 armv7_allocate_pmc(int cpu, int ri, struct pmc *pm,
   const struct pmc_op_pmcallocate *a)
 {
 	uint32_t caps, config;
 	struct armv7_cpu *pac;
 	enum pmc_event pe;
 	int i;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < armv7_npmcs,
 	    ("[armv7,%d] illegal row index %d", __LINE__, ri));
 
 	pac = armv7_pcpu[cpu];
 
 	caps = a->pm_caps;
 	if (a->pm_class != PMC_CLASS_ARMV7)
 		return (EINVAL);
 	pe = a->pm_ev;
 
 	for (i = 0; i < armv7_event_codes_size; i++) {
 		if (armv7_event_codes[i].pe_ev == pe) {
 			config = armv7_event_codes[i].pe_code;
 			break;
 		}
 	}
 	if (i == armv7_event_codes_size)
 		return EINVAL;
 
 	pm->pm_md.pm_armv7.pm_armv7_evsel = config;
 
-	PMCDBG(MDP,ALL,2,"armv7-allocate ri=%d -> config=0x%x", ri, config);
+	PMCDBG2(MDP,ALL,2,"armv7-allocate ri=%d -> config=0x%x", ri, config);
 
 	return 0;
 }
 
 
 static int
 armv7_read_pmc(int cpu, int ri, pmc_value_t *v)
 {
 	pmc_value_t tmp;
 	struct pmc *pm;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < armv7_npmcs,
 	    ("[armv7,%d] illegal row index %d", __LINE__, ri));
 
 	pm  = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc;
 
 	if (pm->pm_md.pm_armv7.pm_armv7_evsel == 0xFF)
 		tmp = cp15_pmccntr_get();
 	else
 		tmp = armv7_pmcn_read(ri);
 
-	PMCDBG(MDP,REA,2,"armv7-read id=%d -> %jd", ri, tmp);
+	PMCDBG2(MDP,REA,2,"armv7-read id=%d -> %jd", ri, tmp);
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 		*v = ARMV7_PERFCTR_VALUE_TO_RELOAD_COUNT(tmp);
 	else
 		*v = tmp;
 
 	return 0;
 }
 
 static int
 armv7_write_pmc(int cpu, int ri, pmc_value_t v)
 {
 	struct pmc *pm;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < armv7_npmcs,
 	    ("[armv7,%d] illegal row-index %d", __LINE__, ri));
 
 	pm  = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc;
 
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 		v = ARMV7_RELOAD_COUNT_TO_PERFCTR_VALUE(v);
 	
-	PMCDBG(MDP,WRI,1,"armv7-write cpu=%d ri=%d v=%jx", cpu, ri, v);
+	PMCDBG3(MDP,WRI,1,"armv7-write cpu=%d ri=%d v=%jx", cpu, ri, v);
 
 	if (pm->pm_md.pm_armv7.pm_armv7_evsel == 0xFF)
 		cp15_pmccntr_set(v);
 	else
 		armv7_pmcn_write(ri, v);
 
 	return 0;
 }
 
 static int
 armv7_config_pmc(int cpu, int ri, struct pmc *pm)
 {
 	struct pmc_hw *phw;
 
-	PMCDBG(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm);
+	PMCDBG3(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm);
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < armv7_npmcs,
 	    ("[armv7,%d] illegal row-index %d", __LINE__, ri));
 
 	phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri];
 
 	KASSERT(pm == NULL || phw->phw_pmc == NULL,
 	    ("[armv7,%d] pm=%p phw->pm=%p hwpmc not unconfigured",
 	    __LINE__, pm, phw->phw_pmc));
 
 	phw->phw_pmc = pm;
 
 	return 0;
 }
 
 static int
 armv7_start_pmc(int cpu, int ri)
 {
 	struct pmc_hw *phw;
 	uint32_t config;
 	struct pmc *pm;
 
 	phw    = &armv7_pcpu[cpu]->pc_armv7pmcs[ri];
 	pm     = phw->phw_pmc;
 	config = pm->pm_md.pm_armv7.pm_armv7_evsel;
 
 	/*
 	 * Configure the event selection.
 	 */
 	cp15_pmselr_set(ri);
 	cp15_pmxevtyper_set(config);
 
 	/*
 	 * Enable the PMC.
 	 */
 	armv7_interrupt_enable(ri);
 	armv7_counter_enable(ri);
 
 	return 0;
 }
 
 static int
 armv7_stop_pmc(int cpu, int ri)
 {
 	struct pmc_hw *phw;
 	struct pmc *pm;
 
 	phw    = &armv7_pcpu[cpu]->pc_armv7pmcs[ri];
 	pm     = phw->phw_pmc;
 
 	/*
 	 * Disable the PMCs.
 	 */
 	armv7_counter_disable(ri);
 	armv7_interrupt_disable(ri);
 
 	return 0;
 }
 
 static int
 armv7_release_pmc(int cpu, int ri, struct pmc *pmc)
 {
 	struct pmc_hw *phw;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < armv7_npmcs,
 	    ("[armv7,%d] illegal row-index %d", __LINE__, ri));
 
 	phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri];
 	KASSERT(phw->phw_pmc == NULL,
 	    ("[armv7,%d] PHW pmc %p non-NULL", __LINE__, phw->phw_pmc));
 
 	return 0;
 }
 
 static int
 armv7_intr(int cpu, struct trapframe *tf)
 {
 	struct armv7_cpu *pc;
 	int retval, ri;
 	struct pmc *pm;
 	int error;
 	int reg;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] CPU %d out of range", __LINE__, cpu));
 
 	retval = 0;
 	pc = armv7_pcpu[cpu];
 
 	for (ri = 0; ri < armv7_npmcs; ri++) {
 		pm = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc;
 		if (pm == NULL)
 			continue;
 		if (!PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 			continue;
 
 		/* Check if counter has overflowed */
 		if (pm->pm_md.pm_armv7.pm_armv7_evsel == 0xFF)
 			reg = (1 << 31);
 		else
 			reg = (1 << ri);
 
 		if ((cp15_pmovsr_get() & reg) == 0) {
 			continue;
 		}
 
 		/* Clear Overflow Flag */
 		cp15_pmovsr_set(reg);
 
 		retval = 1; /* Found an interrupting PMC. */
 		if (pm->pm_state != PMC_STATE_RUNNING)
 			continue;
 
 		error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
 		    TRAPF_USERMODE(tf));
 		if (error)
 			armv7_stop_pmc(cpu, ri);
 
 		/* Reload sampling count */
 		armv7_write_pmc(cpu, ri, pm->pm_sc.pm_reloadcount);
 	}
 
 	return (retval);
 }
 
 static int
 armv7_describe(int cpu, int ri, struct pmc_info *pi, struct pmc **ppmc)
 {
 	char armv7_name[PMC_NAME_MAX];
 	struct pmc_hw *phw;
 	int error;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d], illegal CPU %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < armv7_npmcs,
 	    ("[armv7,%d] row-index %d out of range", __LINE__, ri));
 
 	phw = &armv7_pcpu[cpu]->pc_armv7pmcs[ri];
 	snprintf(armv7_name, sizeof(armv7_name), "ARMV7-%d", ri);
 	if ((error = copystr(armv7_name, pi->pm_name, PMC_NAME_MAX,
 	    NULL)) != 0)
 		return error;
 	pi->pm_class = PMC_CLASS_ARMV7;
 	if (phw->phw_state & PMC_PHW_FLAG_IS_ENABLED) {
 		pi->pm_enabled = TRUE;
 		*ppmc = phw->phw_pmc;
 	} else {
 		pi->pm_enabled = FALSE;
 		*ppmc = NULL;
 	}
 
 	return (0);
 }
 
 static int
 armv7_get_config(int cpu, int ri, struct pmc **ppm)
 {
 
 	*ppm = armv7_pcpu[cpu]->pc_armv7pmcs[ri].phw_pmc;
 
 	return 0;
 }
 
 /*
  * XXX don't know what we should do here.
  */
 static int
 armv7_switch_in(struct pmc_cpu *pc, struct pmc_process *pp)
 {
 
 	return 0;
 }
 
 static int
 armv7_switch_out(struct pmc_cpu *pc, struct pmc_process *pp)
 {
 
 	return 0;
 }
 
 static int
 armv7_pcpu_init(struct pmc_mdep *md, int cpu)
 {
 	struct armv7_cpu *pac;
 	struct pmc_hw  *phw;
 	struct pmc_cpu *pc;
 	uint32_t pmnc;
 	int first_ri;
 	int cpuid;
 	int i;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[armv7,%d] wrong cpu number %d", __LINE__, cpu));
-	PMCDBG(MDP,INI,1,"armv7-init cpu=%d", cpu);
+	PMCDBG1(MDP,INI,1,"armv7-init cpu=%d", cpu);
 
 	armv7_pcpu[cpu] = pac = malloc(sizeof(struct armv7_cpu), M_PMC,
 	    M_WAITOK|M_ZERO);
 
 	cpuid = cpu_ident();
 	pac->cortex_ver = (cpuid >> CPU_ID_CORTEX_VER_SHIFT) & \
 				CPU_ID_CORTEX_VER_MASK;
 
 	pac->pc_armv7pmcs = malloc(sizeof(struct pmc_hw) * armv7_npmcs,
 	    M_PMC, M_WAITOK|M_ZERO);
 	pc = pmc_pcpu[cpu];
 	first_ri = md->pmd_classdep[PMC_MDEP_CLASS_INDEX_ARMV7].pcd_ri;
 	KASSERT(pc != NULL, ("[armv7,%d] NULL per-cpu pointer", __LINE__));
 
 	for (i = 0, phw = pac->pc_armv7pmcs; i < armv7_npmcs; i++, phw++) {
 		phw->phw_state    = PMC_PHW_FLAG_IS_ENABLED |
 		    PMC_PHW_CPU_TO_STATE(cpu) | PMC_PHW_INDEX_TO_STATE(i);
 		phw->phw_pmc      = NULL;
 		pc->pc_hwpmcs[i + first_ri] = phw;
 	}
 
 	/* Enable unit */
 	pmnc = cp15_pmcr_get();
 	pmnc |= ARMV7_PMNC_ENABLE;
 	cp15_pmcr_set(pmnc);
 
 	return 0;
 }
 
 static int
 armv7_pcpu_fini(struct pmc_mdep *md, int cpu)
 {
 	uint32_t pmnc;
 
 	pmnc = cp15_pmcr_get();
 	pmnc &= ~ARMV7_PMNC_ENABLE;
 	cp15_pmcr_set(pmnc);
 
 	return 0;
 }
 
 struct pmc_mdep *
 pmc_armv7_initialize()
 {
 	struct pmc_mdep *pmc_mdep;
 	struct pmc_classdep *pcd;
 	int reg;
 
 	reg = cp15_pmcr_get();
 
 	armv7_npmcs = (reg >> ARMV7_PMNC_N_SHIFT) & \
 				ARMV7_PMNC_N_MASK;
 
-	PMCDBG(MDP,INI,1,"armv7-init npmcs=%d", armv7_npmcs);
+	PMCDBG1(MDP,INI,1,"armv7-init npmcs=%d", armv7_npmcs);
 	
 	/*
 	 * Allocate space for pointers to PMC HW descriptors and for
 	 * the MDEP structure used by MI code.
 	 */
 	armv7_pcpu = malloc(sizeof(struct armv7_cpu *) * pmc_cpu_max(),
 		M_PMC, M_WAITOK | M_ZERO);
 
 	/* Just one class */
 	pmc_mdep = pmc_mdep_alloc(1);
 	pmc_mdep->pmd_cputype = PMC_CPU_ARMV7;
 
 	pcd = &pmc_mdep->pmd_classdep[PMC_MDEP_CLASS_INDEX_ARMV7];
 	pcd->pcd_caps  = ARMV7_PMC_CAPS;
 	pcd->pcd_class = PMC_CLASS_ARMV7;
 	pcd->pcd_num   = armv7_npmcs;
 	pcd->pcd_ri    = pmc_mdep->pmd_npmc;
 	pcd->pcd_width = 32;
 
 	pcd->pcd_allocate_pmc   = armv7_allocate_pmc;
 	pcd->pcd_config_pmc     = armv7_config_pmc;
 	pcd->pcd_pcpu_fini      = armv7_pcpu_fini;
 	pcd->pcd_pcpu_init      = armv7_pcpu_init;
 	pcd->pcd_describe       = armv7_describe;
 	pcd->pcd_get_config	= armv7_get_config;
 	pcd->pcd_read_pmc       = armv7_read_pmc;
 	pcd->pcd_release_pmc    = armv7_release_pmc;
 	pcd->pcd_start_pmc      = armv7_start_pmc;
 	pcd->pcd_stop_pmc       = armv7_stop_pmc;
 	pcd->pcd_write_pmc      = armv7_write_pmc;
 
 	pmc_mdep->pmd_intr       = armv7_intr;
 	pmc_mdep->pmd_switch_in  = armv7_switch_in;
 	pmc_mdep->pmd_switch_out = armv7_switch_out;
 	
 	pmc_mdep->pmd_npmc   += armv7_npmcs;
 
 	return (pmc_mdep);
 }
 
 void
 pmc_armv7_finalize(struct pmc_mdep *md)
 {
 
 }
Index: projects/release-arm-redux/sys/dev/hwpmc/hwpmc_e500.c
===================================================================
--- projects/release-arm-redux/sys/dev/hwpmc/hwpmc_e500.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/hwpmc/hwpmc_e500.c	(revision 282692)
@@ -1,660 +1,660 @@
 /*-
  * Copyright (c) 2015 Justin Hibbits
  * Copyright (c) 2005, Joseph Koshy
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/pmc.h>
 #include <sys/pmckern.h>
 #include <sys/systm.h>
 
 #include <machine/pmc_mdep.h>
 #include <machine/cpu.h>
 
 #include <ddb/ddb.h>
 
 #include "hwpmc_powerpc.h"
 
 #define	POWERPC_PMC_CAPS	(PMC_CAP_INTERRUPT | PMC_CAP_USER |     \
 				 PMC_CAP_SYSTEM | PMC_CAP_EDGE |	\
 				 PMC_CAP_THRESHOLD | PMC_CAP_READ |	\
 				 PMC_CAP_WRITE | PMC_CAP_INVERT |	\
 				 PMC_CAP_QUALIFIER)
 
 #define E500_PMC_HAS_OVERFLOWED(x) (e500_pmcn_read(x) & (0x1 << 31))
 
 struct e500_event_code_map {
 	enum pmc_event	pe_ev;       /* enum value */
 	uint8_t         pe_counter_mask;  /* Which counter this can be counted in. */
 	uint8_t		pe_code;     /* numeric code */
 	uint8_t		pe_cpu;	     /* e500 core (v1,v2,mc), mask */
 };
 
 #define E500_MAX_PMCS	4
 #define PMC_PPC_MASK0	0
 #define PMC_PPC_MASK1	1
 #define PMC_PPC_MASK2	2
 #define PMC_PPC_MASK3	3
 #define PMC_PPC_MASK_ALL	0x0f
 #define PMC_PPC_E500V1		1
 #define PMC_PPC_E500V2		2
 #define PMC_PPC_E500MC		4
 #define PMC_PPC_E500_ANY	7
 #define PMC_E500_EVENT(id, mask, number, core) \
 	[PMC_EV_E500_##id - PMC_EV_E500_FIRST] = \
 	    { .pe_ev = PMC_EV_E500_##id, .pe_counter_mask = mask, \
 	      .pe_code = number, .pe_cpu = core }
 #define PMC_E500MC_ONLY(id, number) \
 	PMC_E500_EVENT(id, PMC_PPC_MASK_ALL, number, PMC_PPC_E500MC)
 #define PMC_E500_COMMON(id, number) \
 	PMC_E500_EVENT(id, PMC_PPC_MASK_ALL, number, PMC_PPC_E500_ANY)
 
 static struct e500_event_code_map e500_event_codes[] = {
 	PMC_E500_COMMON(CYCLES, 1),
 	PMC_E500_COMMON(INSTR_COMPLETED, 2),
 	PMC_E500_COMMON(UOPS_COMPLETED, 3),
 	PMC_E500_COMMON(INSTR_FETCHED, 4),
 	PMC_E500_COMMON(UOPS_DECODED, 5),
 	PMC_E500_COMMON(PM_EVENT_TRANSITIONS, 6),
 	PMC_E500_COMMON(PM_EVENT_CYCLES, 7),
 	PMC_E500_COMMON(BRANCH_INSTRS_COMPLETED, 8),
 	PMC_E500_COMMON(LOAD_UOPS_COMPLETED, 9),
 	PMC_E500_COMMON(STORE_UOPS_COMPLETED, 10),
 	PMC_E500_COMMON(CQ_REDIRECTS, 11),
 	PMC_E500_COMMON(BRANCHES_FINISHED, 12),
 	PMC_E500_COMMON(TAKEN_BRANCHES_FINISHED, 13),
 	PMC_E500_COMMON(FINISHED_UNCOND_BRANCHES_MISS_BTB, 14),
 	PMC_E500_COMMON(BRANCH_MISPRED, 15),
 	PMC_E500_COMMON(BTB_BRANCH_MISPRED_FROM_DIRECTION, 16),
 	PMC_E500_COMMON(BTB_HITS_PSEUDO_HITS, 17),
 	PMC_E500_COMMON(CYCLES_DECODE_STALLED, 18),
 	PMC_E500_COMMON(CYCLES_ISSUE_STALLED, 19),
 	PMC_E500_COMMON(CYCLES_BRANCH_ISSUE_STALLED, 20),
 	PMC_E500_COMMON(CYCLES_SU1_SCHED_STALLED, 21),
 	PMC_E500_COMMON(CYCLES_SU2_SCHED_STALLED, 22),
 	PMC_E500_COMMON(CYCLES_MU_SCHED_STALLED, 23),
 	PMC_E500_COMMON(CYCLES_LRU_SCHED_STALLED, 24),
 	PMC_E500_COMMON(CYCLES_BU_SCHED_STALLED, 25),
 	PMC_E500_COMMON(TOTAL_TRANSLATED, 26),
 	PMC_E500_COMMON(LOADS_TRANSLATED, 27),
 	PMC_E500_COMMON(STORES_TRANSLATED, 28),
 	PMC_E500_COMMON(TOUCHES_TRANSLATED, 29),
 	PMC_E500_COMMON(CACHEOPS_TRANSLATED, 30),
 	PMC_E500_COMMON(CACHE_INHIBITED_ACCESS_TRANSLATED, 31),
 	PMC_E500_COMMON(GUARDED_LOADS_TRANSLATED, 32),
 	PMC_E500_COMMON(WRITE_THROUGH_STORES_TRANSLATED, 33),
 	PMC_E500_COMMON(MISALIGNED_LOAD_STORE_ACCESS_TRANSLATED, 34),
 	PMC_E500_COMMON(TOTAL_ALLOCATED_TO_DLFB, 35),
 	PMC_E500_COMMON(LOADS_TRANSLATED_ALLOCATED_TO_DLFB, 36),
 	PMC_E500_COMMON(STORES_COMPLETED_ALLOCATED_TO_DLFB, 37),
 	PMC_E500_COMMON(TOUCHES_TRANSLATED_ALLOCATED_TO_DLFB, 38),
 	PMC_E500_COMMON(STORES_COMPLETED, 39),
 	PMC_E500_COMMON(DATA_L1_CACHE_LOCKS, 40),
 	PMC_E500_COMMON(DATA_L1_CACHE_RELOADS, 41),
 	PMC_E500_COMMON(DATA_L1_CACHE_CASTOUTS, 42),
 	PMC_E500_COMMON(LOAD_MISS_DLFB_FULL, 43),
 	PMC_E500_COMMON(LOAD_MISS_LDQ_FULL, 44),
 	PMC_E500_COMMON(LOAD_GUARDED_MISS, 45),
 	PMC_E500_COMMON(STORE_TRANSLATE_WHEN_QUEUE_FULL, 46),
 	PMC_E500_COMMON(ADDRESS_COLLISION, 47),
 	PMC_E500_COMMON(DATA_MMU_MISS, 48),
 	PMC_E500_COMMON(DATA_MMU_BUSY, 49),
 	PMC_E500_COMMON(PART2_MISALIGNED_CACHE_ACCESS, 50),
 	PMC_E500_COMMON(LOAD_MISS_DLFB_FULL_CYCLES, 51),
 	PMC_E500_COMMON(LOAD_MISS_LDQ_FULL_CYCLES, 52),
 	PMC_E500_COMMON(LOAD_GUARDED_MISS_CYCLES, 53),
 	PMC_E500_COMMON(STORE_TRANSLATE_WHEN_QUEUE_FULL_CYCLES, 54),
 	PMC_E500_COMMON(ADDRESS_COLLISION_CYCLES, 55),
 	PMC_E500_COMMON(DATA_MMU_MISS_CYCLES, 56),
 	PMC_E500_COMMON(DATA_MMU_BUSY_CYCLES, 57),
 	PMC_E500_COMMON(PART2_MISALIGNED_CACHE_ACCESS_CYCLES, 58),
 	PMC_E500_COMMON(INSTR_L1_CACHE_LOCKS, 59),
 	PMC_E500_COMMON(INSTR_L1_CACHE_RELOADS, 60),
 	PMC_E500_COMMON(INSTR_L1_CACHE_FETCHES, 61),
 	PMC_E500_COMMON(INSTR_MMU_TLB4K_RELOADS, 62),
 	PMC_E500_COMMON(INSTR_MMU_VSP_RELOADS, 63),
 	PMC_E500_COMMON(DATA_MMU_TLB4K_RELOADS, 64),
 	PMC_E500_COMMON(DATA_MMU_VSP_RELOADS, 65),
 	PMC_E500_COMMON(L2MMU_MISSES, 66),
 	PMC_E500_COMMON(BIU_MASTER_REQUESTS, 67),
 	PMC_E500_COMMON(BIU_MASTER_INSTR_SIDE_REQUESTS, 68),
 	PMC_E500_COMMON(BIU_MASTER_DATA_SIDE_REQUESTS, 69),
 	PMC_E500_COMMON(BIU_MASTER_DATA_SIDE_CASTOUT_REQUESTS, 70),
 	PMC_E500_COMMON(BIU_MASTER_RETRIES, 71),
 	PMC_E500_COMMON(SNOOP_REQUESTS, 72),
 	PMC_E500_COMMON(SNOOP_HITS, 73),
 	PMC_E500_COMMON(SNOOP_PUSHES, 74),
 	PMC_E500_COMMON(SNOOP_RETRIES, 75),
 	PMC_E500_EVENT(DLFB_LOAD_MISS_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1,
 	    76, PMC_PPC_E500_ANY),
 	PMC_E500_EVENT(ILFB_FETCH_MISS_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1,
 	    77, PMC_PPC_E500_ANY),
 	PMC_E500_EVENT(EXT_INPU_INTR_LATENCY_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1,
 	    78, PMC_PPC_E500_ANY),
 	PMC_E500_EVENT(CRIT_INPUT_INTR_LATENCY_CYCLES, PMC_PPC_MASK0|PMC_PPC_MASK1,
 	    79, PMC_PPC_E500_ANY),
 	PMC_E500_EVENT(EXT_INPUT_INTR_PENDING_LATENCY_CYCLES,
 	    PMC_PPC_MASK0|PMC_PPC_MASK1, 80, PMC_PPC_E500_ANY),
 	PMC_E500_EVENT(CRIT_INPUT_INTR_PENDING_LATENCY_CYCLES,
 	    PMC_PPC_MASK0|PMC_PPC_MASK1, 81, PMC_PPC_E500_ANY),
 	PMC_E500_COMMON(PMC0_OVERFLOW, 82),
 	PMC_E500_COMMON(PMC1_OVERFLOW, 83),
 	PMC_E500_COMMON(PMC2_OVERFLOW, 84),
 	PMC_E500_COMMON(PMC3_OVERFLOW, 85),
 	PMC_E500_COMMON(INTERRUPTS_TAKEN, 86),
 	PMC_E500_COMMON(EXT_INPUT_INTR_TAKEN, 87),
 	PMC_E500_COMMON(CRIT_INPUT_INTR_TAKEN, 88),
 	PMC_E500_COMMON(SYSCALL_TRAP_INTR, 89),
 	PMC_E500_EVENT(TLB_BIT_TRANSITIONS, PMC_PPC_MASK_ALL, 90,
 	    PMC_PPC_E500V2 | PMC_PPC_E500MC),
 	PMC_E500MC_ONLY(L2_LINEFILL_BUFFER, 91),
 	PMC_E500MC_ONLY(LV2_VS, 92),
 	PMC_E500MC_ONLY(CASTOUTS_RELEASED, 93),
 	PMC_E500MC_ONLY(INTV_ALLOCATIONS, 94),
 	PMC_E500MC_ONLY(DLFB_RETRIES_TO_MBAR, 95),
 	PMC_E500MC_ONLY(STORE_RETRIES, 96),
 	PMC_E500MC_ONLY(STASH_L1_HITS, 97),
 	PMC_E500MC_ONLY(STASH_L2_HITS, 98),
 	PMC_E500MC_ONLY(STASH_BUSY_1, 99),
 	PMC_E500MC_ONLY(STASH_BUSY_2, 100),
 	PMC_E500MC_ONLY(STASH_BUSY_3, 101),
 	PMC_E500MC_ONLY(STASH_HITS, 102),
 	PMC_E500MC_ONLY(STASH_HIT_DLFB, 103),
 	PMC_E500MC_ONLY(STASH_REQUESTS, 106),
 	PMC_E500MC_ONLY(STASH_REQUESTS_L1, 107),
 	PMC_E500MC_ONLY(STASH_REQUESTS_L2, 108),
 	PMC_E500MC_ONLY(STALLS_NO_CAQ_OR_COB, 109),
 	PMC_E500MC_ONLY(L2_CACHE_ACCESSES, 110),
 	PMC_E500MC_ONLY(L2_HIT_CACHE_ACCESSES, 111),
 	PMC_E500MC_ONLY(L2_CACHE_DATA_ACCESSES, 112),
 	PMC_E500MC_ONLY(L2_CACHE_DATA_HITS, 113),
 	PMC_E500MC_ONLY(L2_CACHE_INSTR_ACCESSES, 114),
 	PMC_E500MC_ONLY(L2_CACHE_INSTR_HITS, 115),
 	PMC_E500MC_ONLY(L2_CACHE_ALLOCATIONS, 116),
 	PMC_E500MC_ONLY(L2_CACHE_DATA_ALLOCATIONS, 117),
 	PMC_E500MC_ONLY(L2_CACHE_DIRTY_DATA_ALLOCATIONS, 118),
 	PMC_E500MC_ONLY(L2_CACHE_INSTR_ALLOCATIONS, 119),
 	PMC_E500MC_ONLY(L2_CACHE_UPDATES, 120),
 	PMC_E500MC_ONLY(L2_CACHE_CLEAN_UPDATES, 121),
 	PMC_E500MC_ONLY(L2_CACHE_DIRTY_UPDATES, 122),
 	PMC_E500MC_ONLY(L2_CACHE_CLEAN_REDUNDANT_UPDATES, 123),
 	PMC_E500MC_ONLY(L2_CACHE_DIRTY_REDUNDANT_UPDATES, 124),
 	PMC_E500MC_ONLY(L2_CACHE_LOCKS, 125),
 	PMC_E500MC_ONLY(L2_CACHE_CASTOUTS, 126),
 	PMC_E500MC_ONLY(L2_CACHE_DATA_DIRTY_HITS, 127),
 	PMC_E500MC_ONLY(INSTR_LFB_WENT_HIGH_PRIORITY, 128),
 	PMC_E500MC_ONLY(SNOOP_THROTTLING_TURNED_ON, 129),
 	PMC_E500MC_ONLY(L2_CLEAN_LINE_INVALIDATIONS, 130),
 	PMC_E500MC_ONLY(L2_INCOHERENT_LINE_INVALIDATIONS, 131),
 	PMC_E500MC_ONLY(L2_COHERENT_LINE_INVALIDATIONS, 132),
 	PMC_E500MC_ONLY(COHERENT_LOOKUP_MISS_DUE_TO_VALID_BUT_INCOHERENT_MATCHES, 133),
 	PMC_E500MC_ONLY(IAC1S_DETECTED, 140),
 	PMC_E500MC_ONLY(IAC2S_DETECTED, 141),
 	PMC_E500MC_ONLY(DAC1S_DTECTED, 144),
 	PMC_E500MC_ONLY(DAC2S_DTECTED, 145),
 	PMC_E500MC_ONLY(DVT0_DETECTED, 148),
 	PMC_E500MC_ONLY(DVT1_DETECTED, 149),
 	PMC_E500MC_ONLY(DVT2_DETECTED, 150),
 	PMC_E500MC_ONLY(DVT3_DETECTED, 151),
 	PMC_E500MC_ONLY(DVT4_DETECTED, 152),
 	PMC_E500MC_ONLY(DVT5_DETECTED, 153),
 	PMC_E500MC_ONLY(DVT6_DETECTED, 154),
 	PMC_E500MC_ONLY(DVT7_DETECTED, 155),
 	PMC_E500MC_ONLY(CYCLES_COMPLETION_STALLED_NEXUS_FIFO_FULL, 156),
 	PMC_E500MC_ONLY(FPU_DOUBLE_PUMP, 160),
 	PMC_E500MC_ONLY(FPU_FINISH, 161),
 	PMC_E500MC_ONLY(FPU_DIVIDE_CYCLES, 162),
 	PMC_E500MC_ONLY(FPU_DENORM_INPUT_CYCLES, 163),
 	PMC_E500MC_ONLY(FPU_RESULT_STALL_CYCLES, 164),
 	PMC_E500MC_ONLY(FPU_FPSCR_FULL_STALL, 165),
 	PMC_E500MC_ONLY(FPU_PIPE_SYNC_STALLS, 166),
 	PMC_E500MC_ONLY(FPU_INPUT_DATA_STALLS, 167),
 	PMC_E500MC_ONLY(DECORATED_LOADS, 176),
 	PMC_E500MC_ONLY(DECORATED_STORES, 177),
 	PMC_E500MC_ONLY(LOAD_RETRIES, 178),
 	PMC_E500MC_ONLY(STWCX_SUCCESSES, 179),
 	PMC_E500MC_ONLY(STWCX_FAILURES, 180),
 };
 
 const size_t e500_event_codes_size = 
 	sizeof(e500_event_codes) / sizeof(e500_event_codes[0]);
 
 static pmc_value_t
 e500_pmcn_read(unsigned int pmc)
 {
 	switch (pmc) {
 		case 0:
 			return mfpmr(PMR_PMC0);
 			break;
 		case 1:
 			return mfpmr(PMR_PMC1);
 			break;
 		case 2:
 			return mfpmr(PMR_PMC2);
 			break;
 		case 3:
 			return mfpmr(PMR_PMC3);
 			break;
 		default:
 			panic("Invalid PMC number: %d\n", pmc);
 	}
 }
 
 static void
 e500_pmcn_write(unsigned int pmc, uint32_t val)
 {
 	switch (pmc) {
 		case 0:
 			mtpmr(PMR_PMC0, val);
 			break;
 		case 1:
 			mtpmr(PMR_PMC1, val);
 			break;
 		case 2:
 			mtpmr(PMR_PMC2, val);
 			break;
 		case 3:
 			mtpmr(PMR_PMC3, val);
 			break;
 		default:
 			panic("Invalid PMC number: %d\n", pmc);
 	}
 }
 
 static int
 e500_read_pmc(int cpu, int ri, pmc_value_t *v)
 {
 	struct pmc *pm;
 	pmc_value_t tmp;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < E500_MAX_PMCS,
 	    ("[powerpc,%d] illegal row index %d", __LINE__, ri));
 
 	pm  = powerpc_pcpu[cpu]->pc_ppcpmcs[ri].phw_pmc;
 	KASSERT(pm,
 	    ("[core,%d] cpu %d ri %d pmc not configured", __LINE__, cpu,
 		ri));
 
 	tmp = e500_pmcn_read(ri);
-	PMCDBG(MDP,REA,2,"ppc-read id=%d -> %jd", ri, tmp);
+	PMCDBG2(MDP,REA,2,"ppc-read id=%d -> %jd", ri, tmp);
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 		*v = POWERPC_PERFCTR_VALUE_TO_RELOAD_COUNT(tmp);
 	else
 		*v = tmp;
 
 	return 0;
 }
 
 static int
 e500_write_pmc(int cpu, int ri, pmc_value_t v)
 {
 	struct pmc *pm;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < E500_MAX_PMCS,
 	    ("[powerpc,%d] illegal row-index %d", __LINE__, ri));
 
 	pm  = powerpc_pcpu[cpu]->pc_ppcpmcs[ri].phw_pmc;
 
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 		v = POWERPC_RELOAD_COUNT_TO_PERFCTR_VALUE(v);
 	
-	PMCDBG(MDP,WRI,1,"powerpc-write cpu=%d ri=%d v=%jx", cpu, ri, v);
+	PMCDBG3(MDP,WRI,1,"powerpc-write cpu=%d ri=%d v=%jx", cpu, ri, v);
 
 	e500_pmcn_write(ri, v);
 
 	return 0;
 }
 
 static int
 e500_config_pmc(int cpu, int ri, struct pmc *pm)
 {
 	struct pmc_hw *phw;
 
-	PMCDBG(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm);
+	PMCDBG3(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm);
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < E500_MAX_PMCS,
 	    ("[powerpc,%d] illegal row-index %d", __LINE__, ri));
 
 	phw = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri];
 
 	KASSERT(pm == NULL || phw->phw_pmc == NULL,
 	    ("[powerpc,%d] pm=%p phw->pm=%p hwpmc not unconfigured",
 	    __LINE__, pm, phw->phw_pmc));
 
 	phw->phw_pmc = pm;
 
 	return 0;
 }
 
 static int
 e500_start_pmc(int cpu, int ri)
 {
 	uint32_t config;
         struct pmc *pm;
         struct pmc_hw *phw;
 
 	phw    = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri];
 	pm     = phw->phw_pmc;
 	config = pm->pm_md.pm_powerpc.pm_powerpc_evsel;
 
 	if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)))
 		config |= PMLCax_CE;
 
 	/* Enable the PMC. */
 	switch (ri) {
 	case 0:
 		mtpmr(PMR_PMLCa0, config);
 		break;
 	case 1:
 		mtpmr(PMR_PMLCa1, config);
 		break;
 	case 2:
 		mtpmr(PMR_PMLCa2, config);
 		break;
 	case 3:
 		mtpmr(PMR_PMLCa3, config);
 		break;
 	default:
 		break;
 	}
 	
 	return 0;
 }
 
 static int
 e500_stop_pmc(int cpu, int ri)
 {
         struct pmc *pm;
         struct pmc_hw *phw;
         register_t pmc_pmlc;
 
 	phw    = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri];
 	pm     = phw->phw_pmc;
 
 	/*
 	 * Disable the PMCs.
 	 */
 	switch (ri) {
 	case 0:
 		pmc_pmlc = mfpmr(PMR_PMLCa0);
 		pmc_pmlc |= PMLCax_FC;
 		mtpmr(PMR_PMLCa0, pmc_pmlc);
 		break;
 	case 1:
 		pmc_pmlc = mfpmr(PMR_PMLCa1);
 		pmc_pmlc |= PMLCax_FC;
 		mtpmr(PMR_PMLCa1, pmc_pmlc);
 		break;
 	case 2:
 		pmc_pmlc = mfpmr(PMR_PMLCa2);
 		pmc_pmlc |= PMLCax_FC;
 		mtpmr(PMR_PMLCa2, pmc_pmlc);
 		break;
 	case 3:
 		pmc_pmlc = mfpmr(PMR_PMLCa3);
 		pmc_pmlc |= PMLCax_FC;
 		mtpmr(PMR_PMLCa3, pmc_pmlc);
 		break;
 	default:
 		break;
 	}
 	return 0;
 }
 
 static int
 e500_pcpu_init(struct pmc_mdep *md, int cpu)
 {
 	int first_ri, i;
 	struct pmc_cpu *pc;
 	struct powerpc_cpu *pac;
 	struct pmc_hw  *phw;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] wrong cpu number %d", __LINE__, cpu));
-	PMCDBG(MDP,INI,1,"powerpc-init cpu=%d", cpu);
+	PMCDBG1(MDP,INI,1,"powerpc-init cpu=%d", cpu);
 
 	/* Freeze all counters. */
 	mtpmr(PMR_PMGC0, PMGC_FAC | PMGC_PMIE | PMGC_FCECE);
 
 	powerpc_pcpu[cpu] = pac = malloc(sizeof(struct powerpc_cpu), M_PMC,
 	    M_WAITOK|M_ZERO);
 	pac->pc_ppcpmcs = malloc(sizeof(struct pmc_hw) * E500_MAX_PMCS,
 	    M_PMC, M_WAITOK|M_ZERO);
 	pac->pc_class = PMC_CLASS_E500;
 	pc = pmc_pcpu[cpu];
 	first_ri = md->pmd_classdep[PMC_MDEP_CLASS_INDEX_POWERPC].pcd_ri;
 	KASSERT(pc != NULL, ("[powerpc,%d] NULL per-cpu pointer", __LINE__));
 
 	for (i = 0, phw = pac->pc_ppcpmcs; i < E500_MAX_PMCS; i++, phw++) {
 		phw->phw_state    = PMC_PHW_FLAG_IS_ENABLED |
 		    PMC_PHW_CPU_TO_STATE(cpu) | PMC_PHW_INDEX_TO_STATE(i);
 		phw->phw_pmc      = NULL;
 		pc->pc_hwpmcs[i + first_ri] = phw;
 
 		/* Initialize the PMC to stopped */
 		e500_stop_pmc(cpu, i);
 	}
 	/* Unfreeze global register. */
 	mtpmr(PMR_PMGC0, PMGC_PMIE | PMGC_FCECE);
 
 	return 0;
 }
 
 static int
 e500_pcpu_fini(struct pmc_mdep *md, int cpu)
 {
 	uint32_t pmgc0 = mfpmr(PMR_PMGC0);
 
 	pmgc0 |= PMGC_FAC;
 	mtpmr(PMR_PMGC0, pmgc0);
 	mtmsr(mfmsr() & ~PSL_PMM);
 
 	free(powerpc_pcpu[cpu]->pc_ppcpmcs, M_PMC);
 	free(powerpc_pcpu[cpu], M_PMC);
 
 	return 0;
 }
 
 static int
 e500_allocate_pmc(int cpu, int ri, struct pmc *pm,
   const struct pmc_op_pmcallocate *a)
 {
 	enum pmc_event pe;
 	uint32_t caps, config, counter;
 	struct e500_event_code_map *ev;
 	uint16_t vers;
 	uint8_t pe_cpu_mask;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < E500_MAX_PMCS,
 	    ("[powerpc,%d] illegal row index %d", __LINE__, ri));
 
 	caps = a->pm_caps;
 
 	pe = a->pm_ev;
 	config = PMLCax_FCS | PMLCax_FCU |
 	    PMLCax_FCM1 | PMLCax_FCM1;
 	if (pe < PMC_EV_E500_FIRST || pe > PMC_EV_E500_LAST)
 		return (EINVAL);
 
 	ev = &e500_event_codes[pe-PMC_EV_E500_FIRST];
 	if (ev->pe_code == 0)
 		return (EINVAL);
 
 	vers = mfpvr() >> 16;
 	switch (vers) {
 	case FSL_E500v1:
 		pe_cpu_mask = ev->pe_code & PMC_PPC_E500V1;
 		break;
 	case FSL_E500v2:
 		pe_cpu_mask = ev->pe_code & PMC_PPC_E500V2;
 		break;
 	case FSL_E500mc:
 		pe_cpu_mask = ev->pe_code & PMC_PPC_E500MC;
 		break;
 	}
 	if (pe_cpu_mask == 0)
 		return (EINVAL);
 
 	config |= PMLCax_EVENT(ev->pe_code);
 	counter =  ev->pe_counter_mask;
 	if ((counter & (1 << ri)) == 0)
 		return (EINVAL);
 
 	if (caps & PMC_CAP_SYSTEM)
 		config &= ~PMLCax_FCS;
 	if (caps & PMC_CAP_USER)
 		config &= ~PMLCax_FCU;
 	if ((caps & (PMC_CAP_USER | PMC_CAP_SYSTEM)) == 0)
 		config &= ~(PMLCax_FCS|PMLCax_FCU);
 
 	pm->pm_md.pm_powerpc.pm_powerpc_evsel = config;
 
-	PMCDBG(MDP,ALL,2,"powerpc-allocate ri=%d -> config=0x%x", ri, config);
+	PMCDBG2(MDP,ALL,2,"powerpc-allocate ri=%d -> config=0x%x", ri, config);
 
 	return 0;
 }
 
 static int
 e500_release_pmc(int cpu, int ri, struct pmc *pmc)
 {
 	struct pmc_hw *phw;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] illegal CPU value %d", __LINE__, cpu));
 	KASSERT(ri >= 0 && ri < E500_MAX_PMCS,
 	    ("[powerpc,%d] illegal row-index %d", __LINE__, ri));
 
 	phw = &powerpc_pcpu[cpu]->pc_ppcpmcs[ri];
 	KASSERT(phw->phw_pmc == NULL,
 	    ("[powerpc,%d] PHW pmc %p non-NULL", __LINE__, phw->phw_pmc));
 
 	return 0;
 }
 
 static int
 e500_intr(int cpu, struct trapframe *tf)
 {
 	int i, error, retval;
 	uint32_t config;
 	struct pmc *pm;
 	struct powerpc_cpu *pac;
 
 	KASSERT(cpu >= 0 && cpu < pmc_cpu_max(),
 	    ("[powerpc,%d] out of range CPU %d", __LINE__, cpu));
 
-	PMCDBG(MDP,INT,1, "cpu=%d tf=%p um=%d", cpu, (void *) tf,
+	PMCDBG3(MDP,INT,1, "cpu=%d tf=%p um=%d", cpu, (void *) tf,
 	    TRAPF_USERMODE(tf));
 
 	retval = 0;
 
 	pac = powerpc_pcpu[cpu];
 
 	config  = mfpmr(PMR_PMGC0) & ~PMGC_FAC;
 
 	/*
 	 * look for all PMCs that have interrupted:
 	 * - look for a running, sampling PMC which has overflowed
 	 *   and which has a valid 'struct pmc' association
 	 *
 	 * If found, we call a helper to process the interrupt.
 	 */
 
 	for (i = 0; i < E500_MAX_PMCS; i++) {
 		if ((pm = pac->pc_ppcpmcs[i].phw_pmc) == NULL ||
 		    !PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) {
 			continue;
 		}
 
 		if (!E500_PMC_HAS_OVERFLOWED(i))
 			continue;
 
 		retval = 1;	/* Found an interrupting PMC. */
 
 		if (pm->pm_state != PMC_STATE_RUNNING)
 			continue;
 
 		/* Stop the counter if logging fails. */
 		error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
 		    TRAPF_USERMODE(tf));
 		if (error != 0)
 			e500_stop_pmc(cpu, i);
 
 		/* reload count. */
 		e500_write_pmc(cpu, i, pm->pm_sc.pm_reloadcount);
 	}
 
 	atomic_add_int(retval ? &pmc_stats.pm_intr_processed :
 	    &pmc_stats.pm_intr_ignored, 1);
 
 	/* Re-enable PERF exceptions. */
 	if (retval)
 		mtpmr(PMR_PMGC0, config | PMGC_PMIE);
 
 	return (retval);
 }
 
 int
 pmc_e500_initialize(struct pmc_mdep *pmc_mdep)
 {
 	struct pmc_classdep *pcd;
 
 	pmc_mdep->pmd_cputype = PMC_CPU_PPC_E500;
 
 	pcd = &pmc_mdep->pmd_classdep[PMC_MDEP_CLASS_INDEX_POWERPC];
 	pcd->pcd_caps  = POWERPC_PMC_CAPS;
 	pcd->pcd_class = PMC_CLASS_E500;
 	pcd->pcd_num   = E500_MAX_PMCS;
 	pcd->pcd_ri    = pmc_mdep->pmd_npmc;
 	pcd->pcd_width = 32;
 
 	pcd->pcd_allocate_pmc   = e500_allocate_pmc;
 	pcd->pcd_config_pmc     = e500_config_pmc;
 	pcd->pcd_pcpu_fini      = e500_pcpu_fini;
 	pcd->pcd_pcpu_init      = e500_pcpu_init;
 	pcd->pcd_describe       = powerpc_describe;
 	pcd->pcd_get_config     = powerpc_get_config;
 	pcd->pcd_read_pmc       = e500_read_pmc;
 	pcd->pcd_release_pmc    = e500_release_pmc;
 	pcd->pcd_start_pmc      = e500_start_pmc;
 	pcd->pcd_stop_pmc       = e500_stop_pmc;
  	pcd->pcd_write_pmc      = e500_write_pmc;
 
 	pmc_mdep->pmd_npmc   += E500_MAX_PMCS;
 	pmc_mdep->pmd_intr   =  e500_intr;
 
 	return (0);
 }
Index: projects/release-arm-redux/sys/dev/hwpmc/hwpmc_mips74k.c
===================================================================
--- projects/release-arm-redux/sys/dev/hwpmc/hwpmc_mips74k.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/hwpmc/hwpmc_mips74k.c	(revision 282692)
@@ -1,261 +1,261 @@
 /*-
  * Copyright (c) 2010 George V. Neville-Neil <gnn@freebsd.org>
  * Copyright (c) 2015 Adrian Chadd <adrian@freebsd.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/pmc.h>
 #include <sys/pmckern.h>
 
 #include <machine/cpu.h>
 #include <machine/cpufunc.h>
 #include <machine/pmc_mdep.h>
 
 #define	MIPS74K_PMC_CAPS	(PMC_CAP_INTERRUPT | PMC_CAP_USER |     \
 				 PMC_CAP_SYSTEM | PMC_CAP_EDGE |	\
 				 PMC_CAP_THRESHOLD | PMC_CAP_READ |	\
 				 PMC_CAP_WRITE | PMC_CAP_INVERT |	\
 				 PMC_CAP_QUALIFIER)
 
 /* 0x1 - Exception_enable */
 #define MIPS74K_PMC_INTERRUPT_ENABLE      0x10 /* Enable interrupts */
 #define MIPS74K_PMC_USER_ENABLE           0x08 /* Count in USER mode */
 #define MIPS74K_PMC_SUPER_ENABLE          0x04 /* Count in SUPERVISOR mode */
 #define MIPS74K_PMC_KERNEL_ENABLE         0x02 /* Count in KERNEL mode */
 #define MIPS74K_PMC_ENABLE (MIPS74K_PMC_USER_ENABLE |	   \
 			    MIPS74K_PMC_SUPER_ENABLE |	   \
 			    MIPS74K_PMC_KERNEL_ENABLE)
 
 #define MIPS74K_PMC_SELECT 5 /* Which bit position the event starts at. */
 
 const struct mips_event_code_map mips_event_codes[] = {
 	{ PMC_EV_MIPS74K_CYCLES, MIPS_CTR_ALL, 0 },
 	{ PMC_EV_MIPS74K_INSTR_EXECUTED, MIPS_CTR_ALL, 1 },
 	{ PMC_EV_MIPS74K_PREDICTED_JR_31, MIPS_CTR_0, 2 },
 	{ PMC_EV_MIPS74K_JR_31_MISPREDICTIONS, MIPS_CTR_1, 2 },
 	{ PMC_EV_MIPS74K_REDIRECT_STALLS, MIPS_CTR_0, 3 },
 	{ PMC_EV_MIPS74K_JR_31_NO_PREDICTIONS, MIPS_CTR_1, 3 },
 	{ PMC_EV_MIPS74K_ITLB_ACCESSES, MIPS_CTR_0, 4 },
 	{ PMC_EV_MIPS74K_ITLB_MISSES, MIPS_CTR_1, 4 },
 	{ PMC_EV_MIPS74K_JTLB_INSN_MISSES, MIPS_CTR_1, 5 },
 	{ PMC_EV_MIPS74K_ICACHE_ACCESSES, MIPS_CTR_0, 6 },
 	{ PMC_EV_MIPS74K_ICACHE_MISSES, MIPS_CTR_1, 6 },
 	{ PMC_EV_MIPS74K_ICACHE_MISS_STALLS, MIPS_CTR_0, 7 },
 	{ PMC_EV_MIPS74K_UNCACHED_IFETCH_STALLS, MIPS_CTR_0, 8 },
 	{ PMC_EV_MIPS74K_PDTRACE_BACK_STALLS, MIPS_CTR_1, 8 },
 	{ PMC_EV_MIPS74K_IFU_REPLAYS, MIPS_CTR_0, 9 },
 	{ PMC_EV_MIPS74K_KILLED_FETCH_SLOTS, MIPS_CTR_1, 9 },
 	{ PMC_EV_MIPS74K_IFU_IDU_MISS_PRED_UPSTREAM_CYCLES, MIPS_CTR_0, 11 },
 	{ PMC_EV_MIPS74K_IFU_IDU_NO_FETCH_CYCLES, MIPS_CTR_1, 11 },
 	{ PMC_EV_MIPS74K_IFU_IDU_CLOGED_DOWNSTREAM_CYCLES, MIPS_CTR_0, 12 },
 	{ PMC_EV_MIPS74K_DDQ0_FULL_DR_STALLS, MIPS_CTR_0, 13 },
 	{ PMC_EV_MIPS74K_DDQ1_FULL_DR_STALLS, MIPS_CTR_1, 13 },
 	{ PMC_EV_MIPS74K_ALCB_FULL_DR_STALLS, MIPS_CTR_0, 14 },
 	{ PMC_EV_MIPS74K_AGCB_FULL_DR_STALLS, MIPS_CTR_1, 14 },
 	{ PMC_EV_MIPS74K_CLDQ_FULL_DR_STALLS, MIPS_CTR_0, 15 },
 	{ PMC_EV_MIPS74K_IODQ_FULL_DR_STALLS, MIPS_CTR_1, 15 },
 	{ PMC_EV_MIPS74K_ALU_EMPTY_CYCLES, MIPS_CTR_0, 16 },
 	{ PMC_EV_MIPS74K_AGEN_EMPTY_CYCLES, MIPS_CTR_1, 16 },
 	{ PMC_EV_MIPS74K_ALU_OPERANDS_NOT_READY_CYCLES, MIPS_CTR_0, 17 },
 	{ PMC_EV_MIPS74K_AGEN_OPERANDS_NOT_READY_CYCLES, MIPS_CTR_1, 17 },
 	{ PMC_EV_MIPS74K_ALU_NO_ISSUES_CYCLES, MIPS_CTR_0, 18 },
 	{ PMC_EV_MIPS74K_AGEN_NO_ISSUES_CYCLES, MIPS_CTR_1, 18 },
 	{ PMC_EV_MIPS74K_ALU_BUBBLE_CYCLES, MIPS_CTR_0, 19 },
 	{ PMC_EV_MIPS74K_AGEN_BUBBLE_CYCLES, MIPS_CTR_1, 19 },
 	{ PMC_EV_MIPS74K_SINGLE_ISSUE_CYCLES, MIPS_CTR_0, 20 },
 	{ PMC_EV_MIPS74K_DUAL_ISSUE_CYCLES, MIPS_CTR_1, 20 },
 	{ PMC_EV_MIPS74K_OOO_ALU_ISSUE_CYCLES, MIPS_CTR_0, 21 },
 	{ PMC_EV_MIPS74K_OOO_AGEN_ISSUE_CYCLES, MIPS_CTR_1, 21 },
 	{ PMC_EV_MIPS74K_JALR_JALR_HB_INSNS, MIPS_CTR_0, 22 },
 	{ PMC_EV_MIPS74K_DCACHE_LINE_REFILL_REQUESTS, MIPS_CTR_1, 22 },
 	{ PMC_EV_MIPS74K_DCACHE_LOAD_ACCESSES, MIPS_CTR_0, 23 },
 	{ PMC_EV_MIPS74K_DCACHE_ACCESSES, MIPS_CTR_1, 23 },
 	{ PMC_EV_MIPS74K_DCACHE_WRITEBACKS, MIPS_CTR_0, 24 },
 	{ PMC_EV_MIPS74K_DCACHE_MISSES, MIPS_CTR_1, 24 },
 	{ PMC_EV_MIPS74K_JTLB_DATA_ACCESSES, MIPS_CTR_0, 25 },
 	{ PMC_EV_MIPS74K_JTLB_DATA_MISSES, MIPS_CTR_1, 25 },
 	{ PMC_EV_MIPS74K_LOAD_STORE_REPLAYS, MIPS_CTR_0, 26 },
 	{ PMC_EV_MIPS74K_VA_TRANSALTION_CORNER_CASES, MIPS_CTR_1, 26 },
 	{ PMC_EV_MIPS74K_LOAD_STORE_BLOCKED_CYCLES, MIPS_CTR_0, 27 },
 	{ PMC_EV_MIPS74K_LOAD_STORE_NO_FILL_REQUESTS, MIPS_CTR_1, 27 },
 	{ PMC_EV_MIPS74K_L2_CACHE_WRITEBACKS, MIPS_CTR_0, 28 },
 	{ PMC_EV_MIPS74K_L2_CACHE_ACCESSES, MIPS_CTR_1, 28 },
 	{ PMC_EV_MIPS74K_L2_CACHE_MISSES, MIPS_CTR_0, 29 },
 	{ PMC_EV_MIPS74K_L2_CACHE_MISS_CYCLES, MIPS_CTR_1, 29 },
 	{ PMC_EV_MIPS74K_FSB_FULL_STALLS, MIPS_CTR_0, 30 },
 	{ PMC_EV_MIPS74K_FSB_OVER_50_FULL, MIPS_CTR_1, 30 },
 	{ PMC_EV_MIPS74K_LDQ_FULL_STALLS, MIPS_CTR_0, 31 },
 	{ PMC_EV_MIPS74K_LDQ_OVER_50_FULL, MIPS_CTR_1, 31 },
 	{ PMC_EV_MIPS74K_WBB_FULL_STALLS, MIPS_CTR_0, 32 },
 	{ PMC_EV_MIPS74K_WBB_OVER_50_FULL, MIPS_CTR_1, 32 },
 	{ PMC_EV_MIPS74K_LOAD_MISS_CONSUMER_REPLAYS, MIPS_CTR_0, 35 },
 	{ PMC_EV_MIPS74K_CP1_CP2_LOAD_INSNS, MIPS_CTR_1, 35 },
 	{ PMC_EV_MIPS74K_JR_NON_31_INSNS, MIPS_CTR_0, 36 },
 	{ PMC_EV_MIPS74K_MISPREDICTED_JR_31_INSNS, MIPS_CTR_1, 36 },
 	{ PMC_EV_MIPS74K_BRANCH_INSNS, MIPS_CTR_0, 37 },
 	{ PMC_EV_MIPS74K_CP1_CP2_COND_BRANCH_INSNS, MIPS_CTR_1, 37 },
 	{ PMC_EV_MIPS74K_BRANCH_LIKELY_INSNS, MIPS_CTR_0, 38 },
 	{ PMC_EV_MIPS74K_MISPREDICTED_BRANCH_LIKELY_INSNS, MIPS_CTR_1, 38 },
 	{ PMC_EV_MIPS74K_COND_BRANCH_INSNS, MIPS_CTR_0, 39 },
 	{ PMC_EV_MIPS74K_MISPREDICTED_BRANCH_INSNS, MIPS_CTR_1, 39 },
 	{ PMC_EV_MIPS74K_INTEGER_INSNS, MIPS_CTR_0, 40 },
 	{ PMC_EV_MIPS74K_FPU_INSNS, MIPS_CTR_1, 40 },
 	{ PMC_EV_MIPS74K_LOAD_INSNS, MIPS_CTR_0, 41 },
 	{ PMC_EV_MIPS74K_STORE_INSNS, MIPS_CTR_1, 41 },
 	{ PMC_EV_MIPS74K_J_JAL_INSNS, MIPS_CTR_0, 42 },
 	{ PMC_EV_MIPS74K_MIPS16_INSNS, MIPS_CTR_1, 42 },
 	{ PMC_EV_MIPS74K_NOP_INSNS, MIPS_CTR_0, 43 },
 	{ PMC_EV_MIPS74K_NT_MUL_DIV_INSNS, MIPS_CTR_1, 43 },
 	{ PMC_EV_MIPS74K_DSP_INSNS, MIPS_CTR_0, 44 },
 	{ PMC_EV_MIPS74K_ALU_DSP_SATURATION_INSNS, MIPS_CTR_1, 44 },
 	{ PMC_EV_MIPS74K_DSP_BRANCH_INSNS, MIPS_CTR_0, 45 },
 	{ PMC_EV_MIPS74K_MDU_DSP_SATURATION_INSNS, MIPS_CTR_1, 45 },
 	{ PMC_EV_MIPS74K_UNCACHED_LOAD_INSNS, MIPS_CTR_0, 46 },
 	{ PMC_EV_MIPS74K_UNCACHED_STORE_INSNS, MIPS_CTR_1, 46 },
 	{ PMC_EV_MIPS74K_EJTAG_INSN_TRIGGERS, MIPS_CTR_0, 49 },
 	{ PMC_EV_MIPS74K_CP1_BRANCH_MISPREDICTIONS, MIPS_CTR_0, 50 },
 	{ PMC_EV_MIPS74K_SC_INSNS, MIPS_CTR_0, 51 },
 	{ PMC_EV_MIPS74K_FAILED_SC_INSNS, MIPS_CTR_1, 51 },
 	{ PMC_EV_MIPS74K_PREFETCH_INSNS, MIPS_CTR_0, 52 },
 	{ PMC_EV_MIPS74K_CACHE_HIT_PREFETCH_INSNS, MIPS_CTR_1, 52 },
 	{ PMC_EV_MIPS74K_NO_INSN_CYCLES, MIPS_CTR_0, 53 },
 	{ PMC_EV_MIPS74K_LOAD_MISS_INSNS, MIPS_CTR_1, 53 },
 	{ PMC_EV_MIPS74K_ONE_INSN_CYCLES, MIPS_CTR_0, 54 },
 	{ PMC_EV_MIPS74K_TWO_INSNS_CYCLES, MIPS_CTR_1, 54 },
 	{ PMC_EV_MIPS74K_GFIFO_BLOCKED_CYCLES, MIPS_CTR_0, 55 },
 	{ PMC_EV_MIPS74K_CP1_CP2_STORE_INSNS, MIPS_CTR_1, 55 },
 	{ PMC_EV_MIPS74K_MISPREDICTION_STALLS, MIPS_CTR_0, 56 },
 	{ PMC_EV_MIPS74K_MISPREDICTED_BRANCH_INSNS_CYCLES, MIPS_CTR_0, 57 },
 	{ PMC_EV_MIPS74K_EXCEPTIONS_TAKEN, MIPS_CTR_0, 58 },
 	{ PMC_EV_MIPS74K_GRADUATION_REPLAYS, MIPS_CTR_1, 58 },
 	{ PMC_EV_MIPS74K_COREEXTEND_EVENTS, MIPS_CTR_0, 59 },
 	{ PMC_EV_MIPS74K_ISPRAM_EVENTS, MIPS_CTR_0, 62 },
 	{ PMC_EV_MIPS74K_DSPRAM_EVENTS, MIPS_CTR_1, 62 },
 	{ PMC_EV_MIPS74K_L2_CACHE_SINGLE_BIT_ERRORS, MIPS_CTR_0, 63 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_0, MIPS_CTR_0, 64 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_1, MIPS_CTR_1, 64 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_2, MIPS_CTR_0, 65 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_3, MIPS_CTR_1, 65 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_4, MIPS_CTR_0, 66 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_5, MIPS_CTR_1, 66 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_6, MIPS_CTR_0, 67 },
 	{ PMC_EV_MIPS74K_SYSTEM_EVENT_7, MIPS_CTR_1, 67 },
 	{ PMC_EV_MIPS74K_OCP_ALL_REQUESTS, MIPS_CTR_0, 68 },
 	{ PMC_EV_MIPS74K_OCP_ALL_CACHEABLE_REQUESTS, MIPS_CTR_1, 68 },
 	{ PMC_EV_MIPS74K_OCP_READ_REQUESTS, MIPS_CTR_0, 69 },
 	{ PMC_EV_MIPS74K_OCP_READ_CACHEABLE_REQUESTS, MIPS_CTR_1, 69 },
 	{ PMC_EV_MIPS74K_OCP_WRITE_REQUESTS, MIPS_CTR_0, 70 },
 	{ PMC_EV_MIPS74K_OCP_WRITE_CACHEABLE_REQUESTS, MIPS_CTR_1, 70 },
 	{ PMC_EV_MIPS74K_FSB_LESS_25_FULL, MIPS_CTR_0, 74 },
 	{ PMC_EV_MIPS74K_FSB_25_50_FULL, MIPS_CTR_1, 74 },
 	{ PMC_EV_MIPS74K_LDQ_LESS_25_FULL, MIPS_CTR_0, 75 },
 	{ PMC_EV_MIPS74K_LDQ_25_50_FULL, MIPS_CTR_1, 75 },
 	{ PMC_EV_MIPS74K_WBB_LESS_25_FULL, MIPS_CTR_0, 76 },
 	{ PMC_EV_MIPS74K_WBB_25_50_FULL, MIPS_CTR_1, 76 },
 };
 
 const int mips_event_codes_size =
 	sizeof(mips_event_codes) / sizeof(mips_event_codes[0]);
 
 struct mips_pmc_spec mips_pmc_spec = {
 	.ps_cpuclass = PMC_CLASS_MIPS74K,
 	.ps_cputype = PMC_CPU_MIPS_74K,
 	.ps_capabilities = MIPS74K_PMC_CAPS,
 	.ps_counter_width = 32
 };
 
 /*
  * Performance Count Register N
  */
 uint64_t
 mips_pmcn_read(unsigned int pmc)
 {
 	uint32_t reg = 0;
 
 	KASSERT(pmc < mips_npmcs, ("[mips74k,%d] illegal PMC number %d",
 				   __LINE__, pmc));
 
 	/* The counter value is the next value after the control register. */
 	switch (pmc) {
 	case 0:
 		reg = mips_rd_perfcnt1();
 		break;
 	case 1:
 		reg = mips_rd_perfcnt3();
 		break;
 	default:
 		return 0;
 	}
 	return (reg);
 }
 
 uint64_t
 mips_pmcn_write(unsigned int pmc, uint64_t reg)
 {
 
 	KASSERT(pmc < mips_npmcs, ("[mips74k,%d] illegal PMC number %d",
 				   __LINE__, pmc));
 
 	switch (pmc) {
 	case 0:
 		mips_wr_perfcnt1(reg);
 		break;
 	case 1:
 		mips_wr_perfcnt3(reg);
 		break;
 	default:
 		return 0;
 	}
 	return (reg);
 }
 
 uint32_t
 mips_get_perfctl(int cpu, int ri, uint32_t event, uint32_t caps)
 {
 	uint32_t config;
 
 	config = event;
 
 	config <<= MIPS74K_PMC_SELECT;
 
 	if (caps & PMC_CAP_SYSTEM)
 		config |= (MIPS74K_PMC_SUPER_ENABLE |
 			   MIPS74K_PMC_KERNEL_ENABLE);
 	if (caps & PMC_CAP_USER)
 		config |= MIPS74K_PMC_USER_ENABLE;
 	if ((caps & (PMC_CAP_USER | PMC_CAP_SYSTEM)) == 0)
 		config |= MIPS74K_PMC_ENABLE;
 	if (caps & PMC_CAP_INTERRUPT)
 		config |= MIPS74K_PMC_INTERRUPT_ENABLE;
 
-	PMCDBG(MDP,ALL,2,"mips74k-get_perfctl ri=%d -> config=0x%x", ri, config);
+	PMCDBG2(MDP,ALL,2,"mips74k-get_perfctl ri=%d -> config=0x%x", ri, config);
 
 	return (config);
 }
Index: projects/release-arm-redux/sys/dev/iicbus/iicbus.c
===================================================================
--- projects/release-arm-redux/sys/dev/iicbus/iicbus.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/iicbus/iicbus.c	(revision 282692)
@@ -1,324 +1,403 @@
 /*-
  * Copyright (c) 1998, 2001 Nicolas Souchu
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /*
  * Autoconfiguration and support routines for the Philips serial I2C bus
  */
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
+#include <sys/rman.h>
 #include <sys/sysctl.h>
 #include <sys/bus.h> 
 
 #include <dev/iicbus/iiconf.h>
 #include <dev/iicbus/iicbus.h>
 
 #include "iicbus_if.h"
 
 /* See comments below for why auto-scanning is a bad idea. */
 #define SCAN_IICBUS 0
 
 static int
 iicbus_probe(device_t dev)
 {
 
 	device_set_desc(dev, "Philips I2C bus");
 
 	/* Allow other subclasses to override this driver. */
 	return (BUS_PROBE_GENERIC);
 }
 
 #if SCAN_IICBUS
 static int 
 iic_probe_device(device_t dev, u_char addr)
 {
 	int count;
 	char byte;
 
 	if ((addr & 1) == 0) {
 		/* is device writable? */
 		if (!iicbus_start(dev, (u_char)addr, 0)) {
 			iicbus_stop(dev);
 			return (1);
 		}
 	} else {
 		/* is device readable? */
 		if (!iicbus_block_read(dev, (u_char)addr, &byte, 1, &count))
 			return (1);
 	}
 
 	return (0);
 }
 #endif
 
 /*
  * We add all the devices which we know about.
  * The generic attach routine will attach them if they are alive.
  */
 static int
 iicbus_attach(device_t dev)
 {
 #if SCAN_IICBUS
 	unsigned char addr;
 #endif
 	struct iicbus_softc *sc = IICBUS_SOFTC(dev);
 	int strict;
 
 	sc->dev = dev;
 	mtx_init(&sc->lock, "iicbus", NULL, MTX_DEF);
 	iicbus_init_frequency(dev, 0);
 	iicbus_reset(dev, IIC_FASTEST, 0, NULL);
 	if (resource_int_value(device_get_name(dev),
 		device_get_unit(dev), "strict", &strict) == 0)
 		sc->strict = strict;
 	else
 		sc->strict = 1;
 
 	/* device probing is meaningless since the bus is supposed to be
 	 * hot-plug. Moreover, some I2C chips do not appreciate random
 	 * accesses like stop after start to fast, reads for less than
 	 * x bytes...
 	 */
 #if SCAN_IICBUS
 	printf("Probing for devices on iicbus%d:", device_get_unit(dev));
 
 	/* probe any devices */
 	for (addr = 16; addr < 240; addr++) {
 		if (iic_probe_device(dev, (u_char)addr)) {
 			printf(" <%x>", addr);
 		}
 	}
 	printf("\n");
 #endif
 	bus_generic_probe(dev);
 	bus_enumerate_hinted_children(dev);
 	bus_generic_attach(dev);
         return (0);
 }
   
 static int
 iicbus_detach(device_t dev)
 {
 	struct iicbus_softc *sc = IICBUS_SOFTC(dev);
 
 	iicbus_reset(dev, IIC_FASTEST, 0, NULL);
 	bus_generic_detach(dev);
 	mtx_destroy(&sc->lock);
 	return (0);
 }
   
 static int
 iicbus_print_child(device_t dev, device_t child)
 {
 	struct iicbus_ivar *devi = IICBUS_IVAR(child);
 	int retval = 0;
 
 	retval += bus_print_child_header(dev, child);
 	if (devi->addr != 0)
 		retval += printf(" at addr %#x", devi->addr);
+	resource_list_print_type(&devi->rl, "irq", SYS_RES_IRQ, "%ld");
 	retval += bus_print_child_footer(dev, child);
 
 	return (retval);
 }
 
 static void
 iicbus_probe_nomatch(device_t bus, device_t child)
 {
 	struct iicbus_ivar *devi = IICBUS_IVAR(child);
 
-	device_printf(bus, "<unknown card>");
-	printf(" at addr %#x\n", devi->addr);
-	return;
+	device_printf(bus, "<unknown card> at addr %#x", devi->addr);
+	resource_list_print_type(&devi->rl, "irq", SYS_RES_IRQ, "%ld");
+	printf("\n");
 }
 
 static int
 iicbus_child_location_str(device_t bus, device_t child, char *buf,
     size_t buflen)
 {
 	struct iicbus_ivar *devi = IICBUS_IVAR(child);
 
 	snprintf(buf, buflen, "addr=%#x", devi->addr);
 	return (0);
 }
 
 static int
 iicbus_child_pnpinfo_str(device_t bus, device_t child, char *buf,
     size_t buflen)
 {
 	*buf = '\0';
 	return (0);
 }
 
 static int
 iicbus_read_ivar(device_t bus, device_t child, int which, uintptr_t *result)
 {
 	struct iicbus_ivar *devi = IICBUS_IVAR(child);
 
 	switch (which) {
 	default:
 		return (EINVAL);
 	case IICBUS_IVAR_ADDR:
 		*result = devi->addr;
 		break;
 	}
 	return (0);
 }
 
 static device_t
 iicbus_add_child(device_t dev, u_int order, const char *name, int unit)
 {
 	device_t child;
 	struct iicbus_ivar *devi;
 
 	child = device_add_child_ordered(dev, order, name, unit);
 	if (child == NULL)
 		return (child);
 	devi = malloc(sizeof(struct iicbus_ivar), M_DEVBUF, M_NOWAIT | M_ZERO);
 	if (devi == NULL) {
 		device_delete_child(dev, child);
 		return (0);
 	}
+	resource_list_init(&devi->rl);
 	device_set_ivars(child, devi);
 	return (child);
 }
 
 static void
 iicbus_hinted_child(device_t bus, const char *dname, int dunit)
 {
 	device_t child;
+	int irq;
 	struct iicbus_ivar *devi;
 
 	child = BUS_ADD_CHILD(bus, 0, dname, dunit);
 	devi = IICBUS_IVAR(child);
 	resource_int_value(dname, dunit, "addr", &devi->addr);
+	if (resource_int_value(dname, dunit, "irq", &irq) == 0) {
+		if (bus_set_resource(child, SYS_RES_IRQ, 0, irq, 1) != 0)
+			device_printf(bus,
+			    "warning: bus_set_resource() failed\n");
+	}
 }
 
+static int
+iicbus_set_resource(device_t dev, device_t child, int type, int rid,
+    u_long start, u_long count)
+{
+	struct iicbus_ivar *devi;
+	struct resource_list_entry *rle;
+
+	devi = IICBUS_IVAR(child);
+	rle = resource_list_add(&devi->rl, type, rid, start,
+	    start + count - 1, count);
+	if (rle == NULL)
+		return (ENXIO);
+
+	return (0);
+}
+
+static struct resource *
+iicbus_alloc_resource(device_t bus, device_t child, int type, int *rid,
+    u_long start, u_long end, u_long count, u_int flags)
+{
+	struct resource_list *rl;
+	struct resource_list_entry *rle;
+
+	/* Only IRQ resources are supported. */
+	if (type != SYS_RES_IRQ)
+		return (NULL);
+
+	/*
+	 * Request for the default allocation with a given rid: use resource
+	 * list stored in the local device info.
+	 */
+	if ((start == 0UL) && (end == ~0UL)) {
+		rl = BUS_GET_RESOURCE_LIST(bus, child);
+		if (rl == NULL)
+			return (NULL);
+		rle = resource_list_find(rl, type, *rid);
+		if (rle == NULL) {
+			if (bootverbose)
+				device_printf(bus, "no default resources for "
+				    "rid = %d, type = %d\n", *rid, type);
+			return (NULL);
+		}
+		start = rle->start;
+		end = rle->end;
+		count = rle->count;
+	}
+
+	return (bus_generic_alloc_resource(bus, child, type, rid, start, end,
+	    count, flags));
+}
+
+static struct resource_list *
+iicbus_get_resource_list(device_t bus __unused, device_t child)
+{
+	struct iicbus_ivar *devi;
+
+	devi = IICBUS_IVAR(child);
+	return (&devi->rl);
+}
+
 int
 iicbus_generic_intr(device_t dev, int event, char *buf)
 {
 
 	return (0);
 }
 
 int
 iicbus_null_callback(device_t dev, int index, caddr_t data)
 {
 
 	return (0);
 }
 
 int
 iicbus_null_repeated_start(device_t dev, u_char addr)
 {
 
 	return (IIC_ENOTSUPP);
 }
 
 void
 iicbus_init_frequency(device_t dev, u_int bus_freq)
 {
 	struct iicbus_softc *sc = IICBUS_SOFTC(dev);
 
 	/*
 	 * If a bus frequency value was passed in, use it.  Otherwise initialize
 	 * it first to the standard i2c 100KHz frequency, then override that
 	 * from a hint if one exists.
 	 */
 	if (bus_freq > 0)
 		sc->bus_freq = bus_freq;
 	else {
 		sc->bus_freq = 100000;
 		resource_int_value(device_get_name(dev), device_get_unit(dev),
 		    "frequency", (int *)&sc->bus_freq);
 	}
 	/*
 	 * Set up the sysctl that allows the bus frequency to be changed.
 	 * It is flagged as a tunable so that the user can set the value in
 	 * loader(8), and that will override any other setting from any source.
 	 * The sysctl tunable/value is the one most directly controlled by the
 	 * user and thus the one that always takes precedence.
 	 */
 	SYSCTL_ADD_UINT(device_get_sysctl_ctx(dev),
 	    SYSCTL_CHILDREN(device_get_sysctl_tree(dev)),
 	    OID_AUTO, "frequency", CTLFLAG_RW | CTLFLAG_TUN, &sc->bus_freq,
 	    sc->bus_freq, "Bus frequency in Hz");
 }
 
 static u_int
 iicbus_get_frequency(device_t dev, u_char speed)
 {
 	struct iicbus_softc *sc = IICBUS_SOFTC(dev);
 
 	/*
 	 * If the frequency has not been configured for the bus, or the request
 	 * is specifically for SLOW speed, use the standard 100KHz rate, else
 	 * use the configured bus speed.
 	 */
 	if (sc->bus_freq == 0 || speed == IIC_SLOW)
 		return (100000);
 	return (sc->bus_freq);
 }
 
 static device_method_t iicbus_methods[] = {
 	/* device interface */
 	DEVMETHOD(device_probe,		iicbus_probe),
 	DEVMETHOD(device_attach,	iicbus_attach),
 	DEVMETHOD(device_detach,	iicbus_detach),
 
 	/* bus interface */
+	DEVMETHOD(bus_setup_intr,	bus_generic_setup_intr),
+	DEVMETHOD(bus_teardown_intr,	bus_generic_teardown_intr),
+	DEVMETHOD(bus_release_resource, bus_generic_release_resource),
+	DEVMETHOD(bus_activate_resource, bus_generic_activate_resource),
+	DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource),
+	DEVMETHOD(bus_adjust_resource,	bus_generic_adjust_resource),
+	DEVMETHOD(bus_get_resource,	bus_generic_rl_get_resource),
+	DEVMETHOD(bus_alloc_resource,	iicbus_alloc_resource),
+	DEVMETHOD(bus_get_resource_list, iicbus_get_resource_list),
+	DEVMETHOD(bus_set_resource,	iicbus_set_resource),
 	DEVMETHOD(bus_add_child,	iicbus_add_child),
 	DEVMETHOD(bus_print_child,	iicbus_print_child),
 	DEVMETHOD(bus_probe_nomatch,	iicbus_probe_nomatch),
 	DEVMETHOD(bus_read_ivar,	iicbus_read_ivar),
 	DEVMETHOD(bus_child_pnpinfo_str, iicbus_child_pnpinfo_str),
 	DEVMETHOD(bus_child_location_str, iicbus_child_location_str),
 	DEVMETHOD(bus_hinted_child,	iicbus_hinted_child),
 
 	/* iicbus interface */
 	DEVMETHOD(iicbus_transfer,	iicbus_transfer),
 	DEVMETHOD(iicbus_get_frequency,	iicbus_get_frequency),
 
 	DEVMETHOD_END
 };
 
 driver_t iicbus_driver = {
         "iicbus",
         iicbus_methods,
         sizeof(struct iicbus_softc),
 };
 
 devclass_t iicbus_devclass;
 
 MODULE_VERSION(iicbus, IICBUS_MODVER);
 DRIVER_MODULE(iicbus, iichb, iicbus_driver, iicbus_devclass, 0, 0);
Index: projects/release-arm-redux/sys/dev/iicbus/iicbus.h
===================================================================
--- projects/release-arm-redux/sys/dev/iicbus/iicbus.h	(revision 282691)
+++ projects/release-arm-redux/sys/dev/iicbus/iicbus.h	(revision 282692)
@@ -1,77 +1,78 @@
 /*-
  * Copyright (c) 1998 Nicolas Souchu
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  *
  */
 #ifndef __IICBUS_H
 #define __IICBUS_H
 
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
 
 #define IICBUS_IVAR(d) (struct iicbus_ivar *) device_get_ivars(d)
 #define IICBUS_SOFTC(d) (struct iicbus_softc *) device_get_softc(d)
 
 struct iicbus_softc
 {
 	device_t dev;		/* Myself */
 	device_t owner;		/* iicbus owner device structure */
 	u_char started;		/* address of the 'started' slave
 				 * 0 if no start condition succeeded */
 	u_char strict;		/* deny operations that violate the
 				 * I2C protocol */
 	struct mtx lock;
 	u_int bus_freq;		/* Configured bus Hz. */
 };
 
 struct iicbus_ivar
 {
 	uint32_t	addr;
+	struct resource_list	rl;
 	bool		nostop;
 };
 
 enum {
 	IICBUS_IVAR_ADDR,		/* Address or base address */
 	IICBUS_IVAR_NOSTOP,		/* nostop defaults */
 };
 
 #define IICBUS_ACCESSOR(A, B, T)					\
 	__BUS_ACCESSOR(iicbus, A, IICBUS, B, T)
 	
 IICBUS_ACCESSOR(addr,		ADDR,		uint32_t)
 IICBUS_ACCESSOR(nostop,		NOSTOP,		bool)
 
 #define	IICBUS_LOCK(sc)			mtx_lock(&(sc)->lock)
 #define	IICBUS_UNLOCK(sc)      		mtx_unlock(&(sc)->lock)
 #define	IICBUS_ASSERT_LOCKED(sc)       	mtx_assert(&(sc)->lock, MA_OWNED)
 
 int  iicbus_generic_intr(device_t dev, int event, char *buf);
 void iicbus_init_frequency(device_t dev, u_int bus_freq);
 
 extern driver_t iicbus_driver;
 extern devclass_t iicbus_devclass;
 
 #endif
Index: projects/release-arm-redux/sys/dev/ofw/ofw_iicbus.c
===================================================================
--- projects/release-arm-redux/sys/dev/ofw/ofw_iicbus.c	(revision 282691)
+++ projects/release-arm-redux/sys/dev/ofw/ofw_iicbus.c	(revision 282692)
@@ -1,201 +1,216 @@
 /*-
  * Copyright (c) 2009, Nathan Whitehorn <nwhitehorn@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/libkern.h>
 #include <sys/lock.h>
 #include <sys/module.h>
 #include <sys/mutex.h>
 
 #include <dev/iicbus/iicbus.h>
 #include <dev/iicbus/iiconf.h>
 #include <dev/ofw/ofw_bus.h>
 #include <dev/ofw/ofw_bus_subr.h>
 #include <dev/ofw/openfirm.h>
 
 #include "iicbus_if.h"
 
 /* Methods */
 static device_probe_t ofw_iicbus_probe;
 static device_attach_t ofw_iicbus_attach;
 static device_t ofw_iicbus_add_child(device_t dev, u_int order,
     const char *name, int unit);
 static const struct ofw_bus_devinfo *ofw_iicbus_get_devinfo(device_t bus,
     device_t dev);
+static struct resource_list *ofw_iicbus_get_resource_list(device_t bus,
+    device_t child);
 
 static device_method_t ofw_iicbus_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		ofw_iicbus_probe),
 	DEVMETHOD(device_attach,	ofw_iicbus_attach),
 
 	/* Bus interface */
+	DEVMETHOD(bus_get_resource_list, ofw_iicbus_get_resource_list),
 	DEVMETHOD(bus_child_pnpinfo_str, ofw_bus_gen_child_pnpinfo_str),
 	DEVMETHOD(bus_add_child,	ofw_iicbus_add_child),
 
 	/* ofw_bus interface */
 	DEVMETHOD(ofw_bus_get_devinfo,	ofw_iicbus_get_devinfo),
 	DEVMETHOD(ofw_bus_get_compat,	ofw_bus_gen_get_compat),
 	DEVMETHOD(ofw_bus_get_model,	ofw_bus_gen_get_model),
 	DEVMETHOD(ofw_bus_get_name,	ofw_bus_gen_get_name),
 	DEVMETHOD(ofw_bus_get_node,	ofw_bus_gen_get_node),
 	DEVMETHOD(ofw_bus_get_type,	ofw_bus_gen_get_type),
 
 	DEVMETHOD_END
 };
 
 struct ofw_iicbus_devinfo {
-	struct iicbus_ivar	opd_dinfo;
+	struct iicbus_ivar	opd_dinfo;	/* Must be the first. */
 	struct ofw_bus_devinfo	opd_obdinfo;
 };
 
 static devclass_t ofwiicbus_devclass;
 
 DEFINE_CLASS_1(iicbus, ofw_iicbus_driver, ofw_iicbus_methods,
     sizeof(struct iicbus_softc), iicbus_driver);
 DRIVER_MODULE(ofw_iicbus, iicbb, ofw_iicbus_driver, ofwiicbus_devclass, 0, 0);
 DRIVER_MODULE(ofw_iicbus, iichb, ofw_iicbus_driver, ofwiicbus_devclass, 0, 0);
 MODULE_VERSION(ofw_iicbus, 1);
 MODULE_DEPEND(ofw_iicbus, iicbus, 1, 1, 1);
 
 static int
 ofw_iicbus_probe(device_t dev)
 {
 
 	if (ofw_bus_get_node(dev) == -1)
 		return (ENXIO);
 	device_set_desc(dev, "OFW I2C bus");
 
 	return (0);
 }
 
 static int
 ofw_iicbus_attach(device_t dev)
 {
 	struct iicbus_softc *sc = IICBUS_SOFTC(dev);
 	struct ofw_iicbus_devinfo *dinfo;
 	phandle_t child, node;
 	pcell_t freq, paddr;
 	device_t childdev;
 
 	sc->dev = dev;
 	mtx_init(&sc->lock, "iicbus", NULL, MTX_DEF);
 
 	/*
 	 * If there is a clock-frequency property for the device node, use it as
 	 * the starting value for the bus frequency.  Then call the common
 	 * routine that handles the tunable/sysctl which allows the FDT value to
 	 * be overridden by the user.
 	 */
 	node = ofw_bus_get_node(dev);
 	freq = 0;
 	OF_getencprop(node, "clock-frequency", &freq, sizeof(freq));
 	iicbus_init_frequency(dev, freq);
 	
 	iicbus_reset(dev, IIC_FASTEST, 0, NULL);
 
 	bus_generic_probe(dev);
 	bus_enumerate_hinted_children(dev);
 
 	/*
 	 * Attach those children represented in the device tree.
 	 */
 	for (child = OF_child(node); child != 0; child = OF_peer(child)) {
 		/*
 		 * Try to get the I2C address first from the i2c-address
 		 * property, then try the reg property.  It moves around
 		 * on different systems.
 		 */
 		if (OF_getencprop(child, "i2c-address", &paddr,
 		    sizeof(paddr)) == -1)
 			if (OF_getencprop(child, "reg", &paddr,
 			    sizeof(paddr)) == -1)
 				continue;
 
 		/*
 		 * Now set up the I2C and OFW bus layer devinfo and add it
 		 * to the bus.
 		 */
 		dinfo = malloc(sizeof(struct ofw_iicbus_devinfo), M_DEVBUF,
 		    M_NOWAIT | M_ZERO);
 		if (dinfo == NULL)
 			continue;
 		dinfo->opd_dinfo.addr = paddr;
 		if (ofw_bus_gen_setup_devinfo(&dinfo->opd_obdinfo, child) !=
 		    0) {
 			free(dinfo, M_DEVBUF);
 			continue;
 		}
+
 		childdev = device_add_child(dev, NULL, -1);
+		resource_list_init(&dinfo->opd_dinfo.rl);
+		ofw_bus_intr_to_rl(childdev, child, &dinfo->opd_dinfo.rl);
 		device_set_ivars(childdev, dinfo);
 	}
 
 	return (bus_generic_attach(dev));
 }
 
 static device_t
 ofw_iicbus_add_child(device_t dev, u_int order, const char *name, int unit)
 {
 	device_t child;
 	struct ofw_iicbus_devinfo *devi;
 
 	child = device_add_child_ordered(dev, order, name, unit);
 	if (child == NULL)
 		return (child);
 	devi = malloc(sizeof(struct ofw_iicbus_devinfo), M_DEVBUF,
 	    M_NOWAIT | M_ZERO);
 	if (devi == NULL) {
 		device_delete_child(dev, child);
 		return (0);
 	}
 
 	/*
 	 * NULL all the OFW-related parts of the ivars for non-OFW
 	 * children.
 	 */
 	devi->opd_obdinfo.obd_node = -1;
 	devi->opd_obdinfo.obd_name = NULL;
 	devi->opd_obdinfo.obd_compat = NULL;
 	devi->opd_obdinfo.obd_type = NULL;
 	devi->opd_obdinfo.obd_model = NULL;
 
 	device_set_ivars(child, devi);
 
 	return (child);
 }
 
 static const struct ofw_bus_devinfo *
 ofw_iicbus_get_devinfo(device_t bus, device_t dev)
 {
 	struct ofw_iicbus_devinfo *dinfo;
 
 	dinfo = device_get_ivars(dev);
 	return (&dinfo->opd_obdinfo);
+}
+
+static struct resource_list *
+ofw_iicbus_get_resource_list(device_t bus __unused, device_t child)
+{
+	struct ofw_iicbus_devinfo *devi;
+
+	devi = device_get_ivars(child);
+	return (&devi->opd_dinfo.rl);
 }
Index: projects/release-arm-redux/sys/i386/acpica/acpi_machdep.c
===================================================================
--- projects/release-arm-redux/sys/i386/acpica/acpi_machdep.c	(revision 282691)
+++ projects/release-arm-redux/sys/i386/acpica/acpi_machdep.c	(revision 282692)
@@ -1,394 +1,387 @@
 /*-
  * Copyright (c) 2001 Mitsuru IWASAKI
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/module.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <contrib/dev/acpica/include/acpi.h>
 #include <contrib/dev/acpica/include/accommon.h>
 #include <contrib/dev/acpica/include/actables.h>
 
 #include <dev/acpica/acpivar.h>
 
 #include <machine/nexusvar.h>
 
 uint32_t acpi_resume_beep;
 SYSCTL_UINT(_debug_acpi, OID_AUTO, resume_beep, CTLFLAG_RWTUN, &acpi_resume_beep,
     0, "Beep the PC speaker when resuming");
 
 uint32_t acpi_reset_video;
 TUNABLE_INT("hw.acpi.reset_video", &acpi_reset_video);
 
 static int intr_model = ACPI_INTR_PIC;
 
 int
 acpi_machdep_init(device_t dev)
 {
 	struct acpi_softc *sc;
 
 	sc = device_get_softc(dev);
 
 	acpi_apm_init(sc);
 	acpi_install_wakeup_handler(sc);
 
 	if (intr_model == ACPI_INTR_PIC)
 		BUS_CONFIG_INTR(dev, AcpiGbl_FADT.SciInterrupt,
 		    INTR_TRIGGER_LEVEL, INTR_POLARITY_LOW);
 	else
 		acpi_SetIntrModel(intr_model);
 
 	SYSCTL_ADD_UINT(&sc->acpi_sysctl_ctx,
 	    SYSCTL_CHILDREN(sc->acpi_sysctl_tree), OID_AUTO,
 	    "reset_video", CTLFLAG_RW, &acpi_reset_video, 0,
 	    "Call the VESA reset BIOS vector on the resume path");
 
 	return (0);
 }
 
 void
 acpi_SetDefaultIntrModel(int model)
 {
 
 	intr_model = model;
 }
 
 /* Check BIOS date.  If 1998 or older, disable ACPI. */
 int
 acpi_machdep_quirks(int *quirks)
 {
 	char *va;
 	int year;
 
 	/* BIOS address 0xffff5 contains the date in the format mm/dd/yy. */
 	va = pmap_mapbios(0xffff0, 16);
 	sscanf(va + 11, "%2d", &year);
 	pmap_unmapbios((vm_offset_t)va, 16);
 
 	/* 
 	 * Date must be >= 1/1/1999 or we don't trust ACPI.  Note that this
 	 * check must be changed by my 114th birthday.
 	 */
 	if (year > 90 && year < 99)
 		*quirks = ACPI_Q_BROKEN;
 
 	return (0);
 }
 
-void
-acpi_cpu_c1()
-{
-
-	__asm __volatile("sti; hlt");
-}
-
 /*
  * Support for mapping ACPI tables during early boot.  This abuses the
  * crashdump map because the kernel cannot allocate KVA in
  * pmap_mapbios() when this is used.  This makes the following
  * assumptions about how we use this KVA: pages 0 and 1 are used to
  * map in the header of each table found via the RSDT or XSDT and
  * pages 2 to n are used to map in the RSDT or XSDT.  This has to use
  * 2 pages for the table headers in case a header spans a page
  * boundary.
  *
  * XXX: We don't ensure the table fits in the available address space
  * in the crashdump map.
  */
 
 /*
  * Map some memory using the crashdump map.  'offset' is an offset in
  * pages into the crashdump map to use for the start of the mapping.
  */
 static void *
 table_map(vm_paddr_t pa, int offset, vm_offset_t length)
 {
 	vm_offset_t va, off;
 	void *data;
 
 	off = pa & PAGE_MASK;
 	length = round_page(length + off);
 	pa = pa & PG_FRAME;
 	va = (vm_offset_t)pmap_kenter_temporary(pa, offset) +
 	    (offset * PAGE_SIZE);
 	data = (void *)(va + off);
 	length -= PAGE_SIZE;
 	while (length > 0) {
 		va += PAGE_SIZE;
 		pa += PAGE_SIZE;
 		length -= PAGE_SIZE;
 		pmap_kenter(va, pa);
 		invlpg(va);
 	}
 	return (data);
 }
 
 /* Unmap memory previously mapped with table_map(). */
 static void
 table_unmap(void *data, vm_offset_t length)
 {
 	vm_offset_t va, off;
 
 	va = (vm_offset_t)data;
 	off = va & PAGE_MASK;
 	length = round_page(length + off);
 	va &= ~PAGE_MASK;
 	while (length > 0) {
 		pmap_kremove(va);
 		invlpg(va);
 		va += PAGE_SIZE;
 		length -= PAGE_SIZE;
 	}
 }
 
 /*
  * Map a table at a given offset into the crashdump map.  It first
  * maps the header to determine the table length and then maps the
  * entire table.
  */
 static void *
 map_table(vm_paddr_t pa, int offset, const char *sig)
 {
 	ACPI_TABLE_HEADER *header;
 	vm_offset_t length;
 	void *table;
 
 	header = table_map(pa, offset, sizeof(ACPI_TABLE_HEADER));
 	if (strncmp(header->Signature, sig, ACPI_NAME_SIZE) != 0) {
 		table_unmap(header, sizeof(ACPI_TABLE_HEADER));
 		return (NULL);
 	}
 	length = header->Length;
 	table_unmap(header, sizeof(ACPI_TABLE_HEADER));
 	table = table_map(pa, offset, length);
 	if (ACPI_FAILURE(AcpiTbChecksum(table, length))) {
 		if (bootverbose)
 			printf("ACPI: Failed checksum for table %s\n", sig);
 #if (ACPI_CHECKSUM_ABORT)
 		table_unmap(table, length);
 		return (NULL);
 #endif
 	}
 	return (table);
 }
 
 /*
  * See if a given ACPI table is the requested table.  Returns the
  * length of the able if it matches or zero on failure.
  */
 static int
 probe_table(vm_paddr_t address, const char *sig)
 {
 	ACPI_TABLE_HEADER *table;
 
 	table = table_map(address, 0, sizeof(ACPI_TABLE_HEADER));
 	if (table == NULL) {
 		if (bootverbose)
 			printf("ACPI: Failed to map table at 0x%jx\n",
 			    (uintmax_t)address);
 		return (0);
 	}
 	if (bootverbose)
 		printf("Table '%.4s' at 0x%jx\n", table->Signature,
 		    (uintmax_t)address);
 
 	if (strncmp(table->Signature, sig, ACPI_NAME_SIZE) != 0) {
 		table_unmap(table, sizeof(ACPI_TABLE_HEADER));
 		return (0);
 	}
 	table_unmap(table, sizeof(ACPI_TABLE_HEADER));
 	return (1);
 }
 
 /*
  * Try to map a table at a given physical address previously returned
  * by acpi_find_table().
  */
 void *
 acpi_map_table(vm_paddr_t pa, const char *sig)
 {
 
 	return (map_table(pa, 0, sig));
 }
 
 /* Unmap a table previously mapped via acpi_map_table(). */
 void
 acpi_unmap_table(void *table)
 {
 	ACPI_TABLE_HEADER *header;
 
 	header = (ACPI_TABLE_HEADER *)table;
 	table_unmap(table, header->Length);
 }
 
 /*
  * Return the physical address of the requested table or zero if one
  * is not found.
  */
 vm_paddr_t
 acpi_find_table(const char *sig)
 {
 	ACPI_PHYSICAL_ADDRESS rsdp_ptr;
 	ACPI_TABLE_RSDP *rsdp;
 	ACPI_TABLE_RSDT *rsdt;
 	ACPI_TABLE_XSDT *xsdt;
 	ACPI_TABLE_HEADER *table;
 	vm_paddr_t addr;
 	int i, count;
 
 	if (resource_disabled("acpi", 0))
 		return (0);
 
 	/*
 	 * Map in the RSDP.  Since ACPI uses AcpiOsMapMemory() which in turn
 	 * calls pmap_mapbios() to find the RSDP, we assume that we can use
 	 * pmap_mapbios() to map the RSDP.
 	 */
 	if ((rsdp_ptr = AcpiOsGetRootPointer()) == 0)
 		return (0);
 	rsdp = pmap_mapbios(rsdp_ptr, sizeof(ACPI_TABLE_RSDP));
 	if (rsdp == NULL) {
 		if (bootverbose)
 			printf("ACPI: Failed to map RSDP\n");
 		return (0);
 	}
 
 	/*
 	 * For ACPI >= 2.0, use the XSDT if it is available.
 	 * Otherwise, use the RSDT.  We map the XSDT or RSDT at page 2
 	 * in the crashdump area.  Pages 0 and 1 are used to map in the
 	 * headers of candidate ACPI tables.
 	 */
 	addr = 0;
 	if (rsdp->Revision >= 2 && rsdp->XsdtPhysicalAddress != 0) {
 		/*
 		 * AcpiOsGetRootPointer only verifies the checksum for
 		 * the version 1.0 portion of the RSDP.  Version 2.0 has
 		 * an additional checksum that we verify first.
 		 */
 		if (AcpiTbChecksum((UINT8 *)rsdp, ACPI_RSDP_XCHECKSUM_LENGTH)) {
 			if (bootverbose)
 				printf("ACPI: RSDP failed extended checksum\n");
 			return (0);
 		}
 		xsdt = map_table(rsdp->XsdtPhysicalAddress, 2, ACPI_SIG_XSDT);
 		if (xsdt == NULL) {
 			if (bootverbose)
 				printf("ACPI: Failed to map XSDT\n");
 			return (0);
 		}
 		count = (xsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) /
 		    sizeof(UINT64);
 		for (i = 0; i < count; i++)
 			if (probe_table(xsdt->TableOffsetEntry[i], sig)) {
 				addr = xsdt->TableOffsetEntry[i];
 				break;
 			}
 		acpi_unmap_table(xsdt);
 	} else {
 		rsdt = map_table(rsdp->RsdtPhysicalAddress, 2, ACPI_SIG_RSDT);
 		if (rsdt == NULL) {
 			if (bootverbose)
 				printf("ACPI: Failed to map RSDT\n");
 			return (0);
 		}
 		count = (rsdt->Header.Length - sizeof(ACPI_TABLE_HEADER)) /
 		    sizeof(UINT32);
 		for (i = 0; i < count; i++)
 			if (probe_table(rsdt->TableOffsetEntry[i], sig)) {
 				addr = rsdt->TableOffsetEntry[i];
 				break;
 			}
 		acpi_unmap_table(rsdt);
 	}
 	pmap_unmapbios((vm_offset_t)rsdp, sizeof(ACPI_TABLE_RSDP));
 	if (addr == 0) {
 		if (bootverbose)
 			printf("ACPI: No %s table found\n", sig);
 		return (0);
 	}
 	if (bootverbose)
 		printf("%s: Found table at 0x%jx\n", sig, (uintmax_t)addr);
 
 	/*
 	 * Verify that we can map the full table and that its checksum is
 	 * correct, etc.
 	 */
 	table = map_table(addr, 0, sig);
 	if (table == NULL)
 		return (0);
 	acpi_unmap_table(table);
 
 	return (addr);
 }
 
 /*
  * ACPI nexus(4) driver.
  */
 static int
 nexus_acpi_probe(device_t dev)
 {
 	int error;
 
 	error = acpi_identify();
 	if (error)
 		return (error);
 
 	return (BUS_PROBE_DEFAULT);
 }
 
 static int
 nexus_acpi_attach(device_t dev)
 {
 
 	nexus_init_resources();
 	bus_generic_probe(dev);
 	if (BUS_ADD_CHILD(dev, 10, "acpi", 0) == NULL)
 		panic("failed to add acpi0 device");
 
 	return (bus_generic_attach(dev));
 }
 
 static device_method_t nexus_acpi_methods[] = {
 	/* Device interface */
 	DEVMETHOD(device_probe,		nexus_acpi_probe),
 	DEVMETHOD(device_attach,	nexus_acpi_attach),
 
 	{ 0, 0 }
 };
 
 DEFINE_CLASS_1(nexus, nexus_acpi_driver, nexus_acpi_methods, 1, nexus_driver);
 static devclass_t nexus_devclass;
 
 DRIVER_MODULE(nexus_acpi, root, nexus_acpi_driver, nexus_devclass, 0, 0);
Index: projects/release-arm-redux/sys/i386/include/md_var.h
===================================================================
--- projects/release-arm-redux/sys/i386/include/md_var.h	(revision 282691)
+++ projects/release-arm-redux/sys/i386/include/md_var.h	(revision 282692)
@@ -1,133 +1,134 @@
 /*-
  * Copyright (c) 1995 Bruce D. Evans.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. Neither the name of the author nor the names of contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_MD_VAR_H_
 #define	_MACHINE_MD_VAR_H_
 
 /*
  * Miscellaneous machine-dependent declarations.
  */
 
 extern	long	Maxmem;
 extern	u_int	basemem;	/* PA of original top of base memory */
 extern	int	busdma_swi_pending;
 extern	u_int	cpu_exthigh;
 extern	u_int	cpu_feature;
 extern	u_int	cpu_feature2;
 extern	u_int	amd_feature;
 extern	u_int	amd_feature2;
 extern	u_int	amd_pminfo;
 extern	u_int	via_feature_rng;
 extern	u_int	via_feature_xcrypt;
 extern	u_int	cpu_clflush_line_size;
 extern	u_int	cpu_stdext_feature;
 extern	u_int	cpu_fxsr;
 extern	u_int	cpu_high;
 extern	u_int	cpu_id;
 extern	u_int	cpu_max_ext_state_size;
 extern	u_int	cpu_mxcsr_mask;
 extern	u_int	cpu_procinfo;
 extern	u_int	cpu_procinfo2;
 extern	char	cpu_vendor[];
 extern	u_int	cpu_vendor_id;
 extern	u_int	cpu_mon_mwait_flags;
 extern	u_int	cpu_mon_min_size;
 extern	u_int	cpu_mon_max_size;
 extern	u_int	cpu_maxphyaddr;
 extern	u_int	cyrix_did;
 #if defined(I586_CPU) && !defined(NO_F00F_HACK)
 extern	int	has_f00f_bug;
 #endif
 extern	u_int	hv_high;
 extern	char	hv_vendor[];
 extern	char	kstack[];
 extern	char	sigcode[];
 extern	int	szsigcode;
 #ifdef COMPAT_FREEBSD4
 extern	int	szfreebsd4_sigcode;
 #endif
 #ifdef COMPAT_43
 extern	int	szosigcode;
 #endif
 extern	uint32_t *vm_page_dump;
 extern	int	vm_page_dump_size;
 extern	int	workaround_erratum383;
 extern	int	_udatasel;
 extern	int	_ucodesel;
 extern	int	use_xsave;
 extern	uint64_t xsave_mask;
 
 typedef void alias_for_inthand_t(u_int cs, u_int ef, u_int esp, u_int ss);
 struct	pcb;
 union	savefpu;
 struct	thread;
 struct	reg;
 struct	fpreg;
 struct  dbreg;
 struct	dumperinfo;
 
 void	*alloc_fpusave(int flags);
 void	bcopyb(const void *from, void *to, size_t len);
 void	busdma_swi(void);
+bool	cpu_mwait_usable(void);
 void	cpu_probe_amdc1e(void);
 void	cpu_setregs(void);
 void	cpu_switch_load_gs(void) __asm(__STRING(cpu_switch_load_gs));
 void	doreti_iret(void) __asm(__STRING(doreti_iret));
 void	doreti_iret_fault(void) __asm(__STRING(doreti_iret_fault));
 void	doreti_popl_ds(void) __asm(__STRING(doreti_popl_ds));
 void	doreti_popl_ds_fault(void) __asm(__STRING(doreti_popl_ds_fault));
 void	doreti_popl_es(void) __asm(__STRING(doreti_popl_es));
 void	doreti_popl_es_fault(void) __asm(__STRING(doreti_popl_es_fault));
 void	doreti_popl_fs(void) __asm(__STRING(doreti_popl_fs));
 void	doreti_popl_fs_fault(void) __asm(__STRING(doreti_popl_fs_fault));
 void	dump_add_page(vm_paddr_t);
 void	dump_drop_page(vm_paddr_t);
 void	finishidentcpu(void);
 void	fillw(int /*u_short*/ pat, void *base, size_t cnt);
 void	initializecpu(void);
 void	initializecpucache(void);
 void	i686_pagezero(void *addr);
 void	sse2_pagezero(void *addr);
 void	init_AMD_Elan_sc520(void);
 int	is_physical_memory(vm_paddr_t addr);
 int	isa_nmi(int cd);
 vm_paddr_t kvtop(void *addr);
 void	panicifcpuunsupported(void);
 void	ppro_reenable_apic(void);
 void	printcpuinfo(void);
 void	setidt(int idx, alias_for_inthand_t *func, int typ, int dpl, int selec);
 int     user_dbreg_trap(void);
 int	minidumpsys(struct dumperinfo *);
 union savefpu *get_pcb_user_save_td(struct thread *td);
 union savefpu *get_pcb_user_save_pcb(struct pcb *pcb);
 struct pcb *get_pcb_td(struct thread *td);
 
 #endif /* !_MACHINE_MD_VAR_H_ */
Index: projects/release-arm-redux/sys/kern/kern_malloc.c
===================================================================
--- projects/release-arm-redux/sys/kern/kern_malloc.c	(revision 282691)
+++ projects/release-arm-redux/sys/kern/kern_malloc.c	(revision 282692)
@@ -1,1110 +1,1112 @@
 /*-
  * Copyright (c) 1987, 1991, 1993
  *	The Regents of the University of California.
  * Copyright (c) 2005-2009 Robert N. M. Watson
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	@(#)kern_malloc.c	8.3 (Berkeley) 1/4/94
  */
 
 /*
  * Kernel malloc(9) implementation -- general purpose kernel memory allocator
  * based on memory types.  Back end is implemented using the UMA(9) zone
  * allocator.  A set of fixed-size buckets are used for smaller allocations,
  * and a special UMA allocation interface is used for larger allocations.
  * Callers declare memory types, and statistics are maintained independently
  * for each memory type.  Statistics are maintained per-CPU for performance
  * reasons.  See malloc(9) and comments in malloc.h for a detailed
  * description.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_ddb.h"
 #include "opt_vm.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/vmmeter.h>
 #include <sys/proc.h>
 #include <sys/sbuf.h>
 #include <sys/sysctl.h>
 #include <sys/time.h>
 #include <sys/vmem.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_param.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_map.h>
 #include <vm/vm_page.h>
 #include <vm/uma.h>
 #include <vm/uma_int.h>
 #include <vm/uma_dbg.h>
 
 #ifdef DEBUG_MEMGUARD
 #include <vm/memguard.h>
 #endif
 #ifdef DEBUG_REDZONE
 #include <vm/redzone.h>
 #endif
 
 #if defined(INVARIANTS) && defined(__i386__)
 #include <machine/cpu.h>
 #endif
 
 #include <ddb/ddb.h>
 
 #ifdef KDTRACE_HOOKS
 #include <sys/dtrace_bsd.h>
 
 dtrace_malloc_probe_func_t	dtrace_malloc_probe;
 #endif
 
 /*
  * When realloc() is called, if the new size is sufficiently smaller than
  * the old size, realloc() will allocate a new, smaller block to avoid
  * wasting memory. 'Sufficiently smaller' is defined as: newsize <=
  * oldsize / 2^n, where REALLOC_FRACTION defines the value of 'n'.
  */
 #ifndef REALLOC_FRACTION
 #define	REALLOC_FRACTION	1	/* new block if <= half the size */
 #endif
 
 /*
  * Centrally define some common malloc types.
  */
 MALLOC_DEFINE(M_CACHE, "cache", "Various Dynamically allocated caches");
 MALLOC_DEFINE(M_DEVBUF, "devbuf", "device driver memory");
 MALLOC_DEFINE(M_TEMP, "temp", "misc temporary data buffers");
 
 MALLOC_DEFINE(M_IP6OPT, "ip6opt", "IPv6 options");
 MALLOC_DEFINE(M_IP6NDP, "ip6ndp", "IPv6 Neighbor Discovery");
 
 static struct malloc_type *kmemstatistics;
 static int kmemcount;
 
 #define KMEM_ZSHIFT	4
 #define KMEM_ZBASE	16
 #define KMEM_ZMASK	(KMEM_ZBASE - 1)
 
 #define KMEM_ZMAX	65536
 #define KMEM_ZSIZE	(KMEM_ZMAX >> KMEM_ZSHIFT)
 static uint8_t kmemsize[KMEM_ZSIZE + 1];
 
 #ifndef MALLOC_DEBUG_MAXZONES
 #define	MALLOC_DEBUG_MAXZONES	1
 #endif
 static int numzones = MALLOC_DEBUG_MAXZONES;
 
 /*
  * Small malloc(9) memory allocations are allocated from a set of UMA buckets
  * of various sizes.
  *
  * XXX: The comment here used to read "These won't be powers of two for
  * long."  It's possible that a significant amount of wasted memory could be
  * recovered by tuning the sizes of these buckets.
  */
 struct {
 	int kz_size;
 	char *kz_name;
 	uma_zone_t kz_zone[MALLOC_DEBUG_MAXZONES];
 } kmemzones[] = {
 	{16, "16", },
 	{32, "32", },
 	{64, "64", },
 	{128, "128", },
 	{256, "256", },
 	{512, "512", },
 	{1024, "1024", },
 	{2048, "2048", },
 	{4096, "4096", },
 	{8192, "8192", },
 	{16384, "16384", },
 	{32768, "32768", },
 	{65536, "65536", },
 	{0, NULL},
 };
 
 /*
  * Zone to allocate malloc type descriptions from.  For ABI reasons, memory
  * types are described by a data structure passed by the declaring code, but
  * the malloc(9) implementation has its own data structure describing the
  * type and statistics.  This permits the malloc(9)-internal data structures
  * to be modified without breaking binary-compiled kernel modules that
  * declare malloc types.
  */
 static uma_zone_t mt_zone;
 
 u_long vm_kmem_size;
 SYSCTL_ULONG(_vm, OID_AUTO, kmem_size, CTLFLAG_RDTUN, &vm_kmem_size, 0,
     "Size of kernel memory");
 
 static u_long kmem_zmax = KMEM_ZMAX;
 SYSCTL_ULONG(_vm, OID_AUTO, kmem_zmax, CTLFLAG_RDTUN, &kmem_zmax, 0,
     "Maximum allocation size that malloc(9) would use UMA as backend");
 
 static u_long vm_kmem_size_min;
 SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_min, CTLFLAG_RDTUN, &vm_kmem_size_min, 0,
     "Minimum size of kernel memory");
 
 static u_long vm_kmem_size_max;
 SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_max, CTLFLAG_RDTUN, &vm_kmem_size_max, 0,
     "Maximum size of kernel memory");
 
 static u_int vm_kmem_size_scale;
 SYSCTL_UINT(_vm, OID_AUTO, kmem_size_scale, CTLFLAG_RDTUN, &vm_kmem_size_scale, 0,
     "Scale factor for kernel memory size");
 
 static int sysctl_kmem_map_size(SYSCTL_HANDLER_ARGS);
 SYSCTL_PROC(_vm, OID_AUTO, kmem_map_size,
     CTLFLAG_RD | CTLTYPE_ULONG | CTLFLAG_MPSAFE, NULL, 0,
     sysctl_kmem_map_size, "LU", "Current kmem allocation size");
 
 static int sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS);
 SYSCTL_PROC(_vm, OID_AUTO, kmem_map_free,
     CTLFLAG_RD | CTLTYPE_ULONG | CTLFLAG_MPSAFE, NULL, 0,
     sysctl_kmem_map_free, "LU", "Free space in kmem");
 
 /*
  * The malloc_mtx protects the kmemstatistics linked list.
  */
 struct mtx malloc_mtx;
 
 #ifdef MALLOC_PROFILE
 uint64_t krequests[KMEM_ZSIZE + 1];
 
 static int sysctl_kern_mprof(SYSCTL_HANDLER_ARGS);
 #endif
 
 static int sysctl_kern_malloc_stats(SYSCTL_HANDLER_ARGS);
 
 /*
  * time_uptime of the last malloc(9) failure (induced or real).
  */
 static time_t t_malloc_fail;
 
 #if defined(MALLOC_MAKE_FAILURES) || (MALLOC_DEBUG_MAXZONES > 1)
 static SYSCTL_NODE(_debug, OID_AUTO, malloc, CTLFLAG_RD, 0,
     "Kernel malloc debugging options");
 #endif
 
 /*
  * malloc(9) fault injection -- cause malloc failures every (n) mallocs when
  * the caller specifies M_NOWAIT.  If set to 0, no failures are caused.
  */
 #ifdef MALLOC_MAKE_FAILURES
 static int malloc_failure_rate;
 static int malloc_nowait_count;
 static int malloc_failure_count;
 SYSCTL_INT(_debug_malloc, OID_AUTO, failure_rate, CTLFLAG_RWTUN,
     &malloc_failure_rate, 0, "Every (n) mallocs with M_NOWAIT will fail");
 SYSCTL_INT(_debug_malloc, OID_AUTO, failure_count, CTLFLAG_RD,
     &malloc_failure_count, 0, "Number of imposed M_NOWAIT malloc failures");
 #endif
 
 static int
 sysctl_kmem_map_size(SYSCTL_HANDLER_ARGS)
 {
 	u_long size;
 
 	size = vmem_size(kmem_arena, VMEM_ALLOC);
 	return (sysctl_handle_long(oidp, &size, 0, req));
 }
 
 static int
 sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS)
 {
 	u_long size;
 
 	size = vmem_size(kmem_arena, VMEM_FREE);
 	return (sysctl_handle_long(oidp, &size, 0, req));
 }
 
 /*
  * malloc(9) uma zone separation -- sub-page buffer overruns in one
  * malloc type will affect only a subset of other malloc types.
  */
 #if MALLOC_DEBUG_MAXZONES > 1
 static void
 tunable_set_numzones(void)
 {
 
 	TUNABLE_INT_FETCH("debug.malloc.numzones",
 	    &numzones);
 
 	/* Sanity check the number of malloc uma zones. */
 	if (numzones <= 0)
 		numzones = 1;
 	if (numzones > MALLOC_DEBUG_MAXZONES)
 		numzones = MALLOC_DEBUG_MAXZONES;
 }
 SYSINIT(numzones, SI_SUB_TUNABLES, SI_ORDER_ANY, tunable_set_numzones, NULL);
 SYSCTL_INT(_debug_malloc, OID_AUTO, numzones, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
     &numzones, 0, "Number of malloc uma subzones");
 
 /*
  * Any number that changes regularly is an okay choice for the
  * offset.  Build numbers are pretty good of you have them.
  */
 static u_int zone_offset = __FreeBSD_version;
 TUNABLE_INT("debug.malloc.zone_offset", &zone_offset);
 SYSCTL_UINT(_debug_malloc, OID_AUTO, zone_offset, CTLFLAG_RDTUN,
     &zone_offset, 0, "Separate malloc types by examining the "
     "Nth character in the malloc type short description.");
 
 static u_int
 mtp_get_subzone(const char *desc)
 {
 	size_t len;
 	u_int val;
 
 	if (desc == NULL || (len = strlen(desc)) == 0)
 		return (0);
 	val = desc[zone_offset % len];
 	return (val % numzones);
 }
 #elif MALLOC_DEBUG_MAXZONES == 0
 #error "MALLOC_DEBUG_MAXZONES must be positive."
 #else
 static inline u_int
 mtp_get_subzone(const char *desc)
 {
 
 	return (0);
 }
 #endif /* MALLOC_DEBUG_MAXZONES > 1 */
 
 int
 malloc_last_fail(void)
 {
 
 	return (time_uptime - t_malloc_fail);
 }
 
 /*
  * An allocation has succeeded -- update malloc type statistics for the
  * amount of bucket size.  Occurs within a critical section so that the
  * thread isn't preempted and doesn't migrate while updating per-PCU
  * statistics.
  */
 static void
 malloc_type_zone_allocated(struct malloc_type *mtp, unsigned long size,
     int zindx)
 {
 	struct malloc_type_internal *mtip;
 	struct malloc_type_stats *mtsp;
 
 	critical_enter();
 	mtip = mtp->ks_handle;
 	mtsp = &mtip->mti_stats[curcpu];
 	if (size > 0) {
 		mtsp->mts_memalloced += size;
 		mtsp->mts_numallocs++;
 	}
 	if (zindx != -1)
 		mtsp->mts_size |= 1 << zindx;
 
 #ifdef KDTRACE_HOOKS
 	if (dtrace_malloc_probe != NULL) {
 		uint32_t probe_id = mtip->mti_probes[DTMALLOC_PROBE_MALLOC];
 		if (probe_id != 0)
 			(dtrace_malloc_probe)(probe_id,
 			    (uintptr_t) mtp, (uintptr_t) mtip,
 			    (uintptr_t) mtsp, size, zindx);
 	}
 #endif
 
 	critical_exit();
 }
 
 void
 malloc_type_allocated(struct malloc_type *mtp, unsigned long size)
 {
 
 	if (size > 0)
 		malloc_type_zone_allocated(mtp, size, -1);
 }
 
 /*
  * A free operation has occurred -- update malloc type statistics for the
  * amount of the bucket size.  Occurs within a critical section so that the
  * thread isn't preempted and doesn't migrate while updating per-CPU
  * statistics.
  */
 void
 malloc_type_freed(struct malloc_type *mtp, unsigned long size)
 {
 	struct malloc_type_internal *mtip;
 	struct malloc_type_stats *mtsp;
 
 	critical_enter();
 	mtip = mtp->ks_handle;
 	mtsp = &mtip->mti_stats[curcpu];
 	mtsp->mts_memfreed += size;
 	mtsp->mts_numfrees++;
 
 #ifdef KDTRACE_HOOKS
 	if (dtrace_malloc_probe != NULL) {
 		uint32_t probe_id = mtip->mti_probes[DTMALLOC_PROBE_FREE];
 		if (probe_id != 0)
 			(dtrace_malloc_probe)(probe_id,
 			    (uintptr_t) mtp, (uintptr_t) mtip,
 			    (uintptr_t) mtsp, size, 0);
 	}
 #endif
 
 	critical_exit();
 }
 
 /*
  *	contigmalloc:
  *
  *	Allocate a block of physically contiguous memory.
  *
  *	If M_NOWAIT is set, this routine will not block and return NULL if
  *	the allocation fails.
  */
 void *
 contigmalloc(unsigned long size, struct malloc_type *type, int flags,
     vm_paddr_t low, vm_paddr_t high, unsigned long alignment,
     vm_paddr_t boundary)
 {
 	void *ret;
 
 	ret = (void *)kmem_alloc_contig(kernel_arena, size, flags, low, high,
 	    alignment, boundary, VM_MEMATTR_DEFAULT);
 	if (ret != NULL)
 		malloc_type_allocated(type, round_page(size));
 	return (ret);
 }
 
 /*
  *	contigfree:
  *
  *	Free a block of memory allocated by contigmalloc.
  *
  *	This routine may not block.
  */
 void
 contigfree(void *addr, unsigned long size, struct malloc_type *type)
 {
 
 	kmem_free(kernel_arena, (vm_offset_t)addr, size);
 	malloc_type_freed(type, round_page(size));
 }
 
 /*
  *	malloc:
  *
  *	Allocate a block of memory.
  *
  *	If M_NOWAIT is set, this routine will not block and return NULL if
  *	the allocation fails.
  */
 void *
 malloc(unsigned long size, struct malloc_type *mtp, int flags)
 {
 	int indx;
 	struct malloc_type_internal *mtip;
 	caddr_t va;
 	uma_zone_t zone;
 #if defined(DIAGNOSTIC) || defined(DEBUG_REDZONE)
 	unsigned long osize = size;
 #endif
 
 #ifdef INVARIANTS
 	KASSERT(mtp->ks_magic == M_MAGIC, ("malloc: bad malloc type magic"));
 	/*
 	 * Check that exactly one of M_WAITOK or M_NOWAIT is specified.
 	 */
 	indx = flags & (M_WAITOK | M_NOWAIT);
 	if (indx != M_NOWAIT && indx != M_WAITOK) {
 		static	struct timeval lasterr;
 		static	int curerr, once;
 		if (once == 0 && ppsratecheck(&lasterr, &curerr, 1)) {
 			printf("Bad malloc flags: %x\n", indx);
 			kdb_backtrace();
 			flags |= M_WAITOK;
 			once++;
 		}
 	}
 #endif
 #ifdef MALLOC_MAKE_FAILURES
 	if ((flags & M_NOWAIT) && (malloc_failure_rate != 0)) {
 		atomic_add_int(&malloc_nowait_count, 1);
 		if ((malloc_nowait_count % malloc_failure_rate) == 0) {
 			atomic_add_int(&malloc_failure_count, 1);
 			t_malloc_fail = time_uptime;
 			return (NULL);
 		}
 	}
 #endif
 	if (flags & M_WAITOK)
 		KASSERT(curthread->td_intr_nesting_level == 0,
 		   ("malloc(M_WAITOK) in interrupt context"));
 
 #ifdef DEBUG_MEMGUARD
 	if (memguard_cmp_mtp(mtp, size)) {
 		va = memguard_alloc(size, flags);
 		if (va != NULL)
 			return (va);
 		/* This is unfortunate but should not be fatal. */
 	}
 #endif
 
 #ifdef DEBUG_REDZONE
 	size = redzone_size_ntor(size);
 #endif
 
 	if (size <= kmem_zmax) {
 		mtip = mtp->ks_handle;
 		if (size & KMEM_ZMASK)
 			size = (size & ~KMEM_ZMASK) + KMEM_ZBASE;
 		indx = kmemsize[size >> KMEM_ZSHIFT];
 		KASSERT(mtip->mti_zone < numzones,
 		    ("mti_zone %u out of range %d",
 		    mtip->mti_zone, numzones));
 		zone = kmemzones[indx].kz_zone[mtip->mti_zone];
 #ifdef MALLOC_PROFILE
 		krequests[size >> KMEM_ZSHIFT]++;
 #endif
 		va = uma_zalloc(zone, flags);
 		if (va != NULL)
 			size = zone->uz_size;
 		malloc_type_zone_allocated(mtp, va == NULL ? 0 : size, indx);
 	} else {
 		size = roundup(size, PAGE_SIZE);
 		zone = NULL;
 		va = uma_large_malloc(size, flags);
 		malloc_type_allocated(mtp, va == NULL ? 0 : size);
 	}
 	if (flags & M_WAITOK)
 		KASSERT(va != NULL, ("malloc(M_WAITOK) returned NULL"));
 	else if (va == NULL)
 		t_malloc_fail = time_uptime;
 #ifdef DIAGNOSTIC
 	if (va != NULL && !(flags & M_ZERO)) {
 		memset(va, 0x70, osize);
 	}
 #endif
 #ifdef DEBUG_REDZONE
 	if (va != NULL)
 		va = redzone_setup(va, osize);
 #endif
 	return ((void *) va);
 }
 
 /*
  *	free:
  *
  *	Free a block of memory allocated by malloc.
  *
  *	This routine may not block.
  */
 void
 free(void *addr, struct malloc_type *mtp)
 {
 	uma_slab_t slab;
 	u_long size;
 
 	KASSERT(mtp->ks_magic == M_MAGIC, ("free: bad malloc type magic"));
 
 	/* free(NULL, ...) does nothing */
 	if (addr == NULL)
 		return;
 
 #ifdef DEBUG_MEMGUARD
 	if (is_memguard_addr(addr)) {
 		memguard_free(addr);
 		return;
 	}
 #endif
 
 #ifdef DEBUG_REDZONE
 	redzone_check(addr);
 	addr = redzone_addr_ntor(addr);
 #endif
 
 	slab = vtoslab((vm_offset_t)addr & (~UMA_SLAB_MASK));
 
 	if (slab == NULL)
 		panic("free: address %p(%p) has not been allocated.\n",
 		    addr, (void *)((u_long)addr & (~UMA_SLAB_MASK)));
 
 	if (!(slab->us_flags & UMA_SLAB_MALLOC)) {
 #ifdef INVARIANTS
 		struct malloc_type **mtpp = addr;
 #endif
 		size = slab->us_keg->uk_size;
 #ifdef INVARIANTS
 		/*
 		 * Cache a pointer to the malloc_type that most recently freed
 		 * this memory here.  This way we know who is most likely to
 		 * have stepped on it later.
 		 *
 		 * This code assumes that size is a multiple of 8 bytes for
 		 * 64 bit machines
 		 */
 		mtpp = (struct malloc_type **)
 		    ((unsigned long)mtpp & ~UMA_ALIGN_PTR);
 		mtpp += (size - sizeof(struct malloc_type *)) /
 		    sizeof(struct malloc_type *);
 		*mtpp = mtp;
 #endif
 		uma_zfree_arg(LIST_FIRST(&slab->us_keg->uk_zones), addr, slab);
 	} else {
 		size = slab->us_size;
 		uma_large_free(slab);
 	}
 	malloc_type_freed(mtp, size);
 }
 
 /*
  *	realloc: change the size of a memory block
  */
 void *
 realloc(void *addr, unsigned long size, struct malloc_type *mtp, int flags)
 {
 	uma_slab_t slab;
 	unsigned long alloc;
 	void *newaddr;
 
 	KASSERT(mtp->ks_magic == M_MAGIC,
 	    ("realloc: bad malloc type magic"));
 
 	/* realloc(NULL, ...) is equivalent to malloc(...) */
 	if (addr == NULL)
 		return (malloc(size, mtp, flags));
 
 	/*
 	 * XXX: Should report free of old memory and alloc of new memory to
 	 * per-CPU stats.
 	 */
 
 #ifdef DEBUG_MEMGUARD
 	if (is_memguard_addr(addr))
 		return (memguard_realloc(addr, size, mtp, flags));
 #endif
 
 #ifdef DEBUG_REDZONE
 	slab = NULL;
 	alloc = redzone_get_size(addr);
 #else
 	slab = vtoslab((vm_offset_t)addr & ~(UMA_SLAB_MASK));
 
 	/* Sanity check */
 	KASSERT(slab != NULL,
 	    ("realloc: address %p out of range", (void *)addr));
 
 	/* Get the size of the original block */
 	if (!(slab->us_flags & UMA_SLAB_MALLOC))
 		alloc = slab->us_keg->uk_size;
 	else
 		alloc = slab->us_size;
 
 	/* Reuse the original block if appropriate */
 	if (size <= alloc
 	    && (size > (alloc >> REALLOC_FRACTION) || alloc == MINALLOCSIZE))
 		return (addr);
 #endif /* !DEBUG_REDZONE */
 
 	/* Allocate a new, bigger (or smaller) block */
 	if ((newaddr = malloc(size, mtp, flags)) == NULL)
 		return (NULL);
 
 	/* Copy over original contents */
 	bcopy(addr, newaddr, min(size, alloc));
 	free(addr, mtp);
 	return (newaddr);
 }
 
 /*
  *	reallocf: same as realloc() but free memory on failure.
  */
 void *
 reallocf(void *addr, unsigned long size, struct malloc_type *mtp, int flags)
 {
 	void *mem;
 
 	if ((mem = realloc(addr, size, mtp, flags)) == NULL)
 		free(addr, mtp);
 	return (mem);
 }
 
 /*
- * Wake the page daemon when we exhaust KVA.  It will call the lowmem handler
- * and uma_reclaim() callbacks in a context that is safe.
+ * Wake the uma reclamation pagedaemon thread when we exhaust KVA.  It
+ * will call the lowmem handler and uma_reclaim() callbacks in a
+ * context that is safe.
  */
 static void
 kmem_reclaim(vmem_t *vm, int flags)
 {
 
+	uma_reclaim_wakeup();
 	pagedaemon_wakeup();
 }
 
 #ifndef __sparc64__
 CTASSERT(VM_KMEM_SIZE_SCALE >= 1);
 #endif
 
 /*
  * Initialize the kernel memory (kmem) arena.
  */
 void
 kmeminit(void)
 {
 	u_long mem_size;
 	u_long tmp;
 
 #ifdef VM_KMEM_SIZE
 	if (vm_kmem_size == 0)
 		vm_kmem_size = VM_KMEM_SIZE;
 #endif
 #ifdef VM_KMEM_SIZE_MIN
 	if (vm_kmem_size_min == 0)
 		vm_kmem_size_min = VM_KMEM_SIZE_MIN;
 #endif
 #ifdef VM_KMEM_SIZE_MAX
 	if (vm_kmem_size_max == 0)
 		vm_kmem_size_max = VM_KMEM_SIZE_MAX;
 #endif
 	/*
 	 * Calculate the amount of kernel virtual address (KVA) space that is
 	 * preallocated to the kmem arena.  In order to support a wide range
 	 * of machines, it is a function of the physical memory size,
 	 * specifically,
 	 *
 	 *	min(max(physical memory size / VM_KMEM_SIZE_SCALE,
 	 *	    VM_KMEM_SIZE_MIN), VM_KMEM_SIZE_MAX)
 	 *
 	 * Every architecture must define an integral value for
 	 * VM_KMEM_SIZE_SCALE.  However, the definitions of VM_KMEM_SIZE_MIN
 	 * and VM_KMEM_SIZE_MAX, which represent respectively the floor and
 	 * ceiling on this preallocation, are optional.  Typically,
 	 * VM_KMEM_SIZE_MAX is itself a function of the available KVA space on
 	 * a given architecture.
 	 */
 	mem_size = vm_cnt.v_page_count;
 	if (mem_size <= 32768) /* delphij XXX 128MB */
 		kmem_zmax = PAGE_SIZE;
 
 	if (vm_kmem_size_scale < 1)
 		vm_kmem_size_scale = VM_KMEM_SIZE_SCALE;
 
 	/*
 	 * Check if we should use defaults for the "vm_kmem_size"
 	 * variable:
 	 */
 	if (vm_kmem_size == 0) {
 		vm_kmem_size = (mem_size / vm_kmem_size_scale) * PAGE_SIZE;
 
 		if (vm_kmem_size_min > 0 && vm_kmem_size < vm_kmem_size_min)
 			vm_kmem_size = vm_kmem_size_min;
 		if (vm_kmem_size_max > 0 && vm_kmem_size >= vm_kmem_size_max)
 			vm_kmem_size = vm_kmem_size_max;
 	}
 
 	/*
 	 * The amount of KVA space that is preallocated to the
 	 * kmem arena can be set statically at compile-time or manually
 	 * through the kernel environment.  However, it is still limited to
 	 * twice the physical memory size, which has been sufficient to handle
 	 * the most severe cases of external fragmentation in the kmem arena. 
 	 */
 	if (vm_kmem_size / 2 / PAGE_SIZE > mem_size)
 		vm_kmem_size = 2 * mem_size * PAGE_SIZE;
 
 	vm_kmem_size = round_page(vm_kmem_size);
 #ifdef DEBUG_MEMGUARD
 	tmp = memguard_fudge(vm_kmem_size, kernel_map);
 #else
 	tmp = vm_kmem_size;
 #endif
 	vmem_init(kmem_arena, "kmem arena", kva_alloc(tmp), tmp, PAGE_SIZE,
 	    0, 0);
 	vmem_set_reclaim(kmem_arena, kmem_reclaim);
 
 #ifdef DEBUG_MEMGUARD
 	/*
 	 * Initialize MemGuard if support compiled in.  MemGuard is a
 	 * replacement allocator used for detecting tamper-after-free
 	 * scenarios as they occur.  It is only used for debugging.
 	 */
 	memguard_init(kmem_arena);
 #endif
 }
 
 /*
  * Initialize the kernel memory allocator
  */
 /* ARGSUSED*/
 static void
 mallocinit(void *dummy)
 {
 	int i;
 	uint8_t indx;
 
 	mtx_init(&malloc_mtx, "malloc", NULL, MTX_DEF);
 
 	kmeminit();
 
 	uma_startup2();
 
 	if (kmem_zmax < PAGE_SIZE || kmem_zmax > KMEM_ZMAX)
 		kmem_zmax = KMEM_ZMAX;
 
 	mt_zone = uma_zcreate("mt_zone", sizeof(struct malloc_type_internal),
 #ifdef INVARIANTS
 	    mtrash_ctor, mtrash_dtor, mtrash_init, mtrash_fini,
 #else
 	    NULL, NULL, NULL, NULL,
 #endif
 	    UMA_ALIGN_PTR, UMA_ZONE_MALLOC);
 	for (i = 0, indx = 0; kmemzones[indx].kz_size != 0; indx++) {
 		int size = kmemzones[indx].kz_size;
 		char *name = kmemzones[indx].kz_name;
 		int subzone;
 
 		for (subzone = 0; subzone < numzones; subzone++) {
 			kmemzones[indx].kz_zone[subzone] =
 			    uma_zcreate(name, size,
 #ifdef INVARIANTS
 			    mtrash_ctor, mtrash_dtor, mtrash_init, mtrash_fini,
 #else
 			    NULL, NULL, NULL, NULL,
 #endif
 			    UMA_ALIGN_PTR, UMA_ZONE_MALLOC);
 		}		    
 		for (;i <= size; i+= KMEM_ZBASE)
 			kmemsize[i >> KMEM_ZSHIFT] = indx;
 
 	}
 }
 SYSINIT(kmem, SI_SUB_KMEM, SI_ORDER_SECOND, mallocinit, NULL);
 
 void
 malloc_init(void *data)
 {
 	struct malloc_type_internal *mtip;
 	struct malloc_type *mtp;
 
 	KASSERT(vm_cnt.v_page_count != 0, ("malloc_register before vm_init"));
 
 	mtp = data;
 	if (mtp->ks_magic != M_MAGIC)
 		panic("malloc_init: bad malloc type magic");
 
 	mtip = uma_zalloc(mt_zone, M_WAITOK | M_ZERO);
 	mtp->ks_handle = mtip;
 	mtip->mti_zone = mtp_get_subzone(mtp->ks_shortdesc);
 
 	mtx_lock(&malloc_mtx);
 	mtp->ks_next = kmemstatistics;
 	kmemstatistics = mtp;
 	kmemcount++;
 	mtx_unlock(&malloc_mtx);
 }
 
 void
 malloc_uninit(void *data)
 {
 	struct malloc_type_internal *mtip;
 	struct malloc_type_stats *mtsp;
 	struct malloc_type *mtp, *temp;
 	uma_slab_t slab;
 	long temp_allocs, temp_bytes;
 	int i;
 
 	mtp = data;
 	KASSERT(mtp->ks_magic == M_MAGIC,
 	    ("malloc_uninit: bad malloc type magic"));
 	KASSERT(mtp->ks_handle != NULL, ("malloc_deregister: cookie NULL"));
 
 	mtx_lock(&malloc_mtx);
 	mtip = mtp->ks_handle;
 	mtp->ks_handle = NULL;
 	if (mtp != kmemstatistics) {
 		for (temp = kmemstatistics; temp != NULL;
 		    temp = temp->ks_next) {
 			if (temp->ks_next == mtp) {
 				temp->ks_next = mtp->ks_next;
 				break;
 			}
 		}
 		KASSERT(temp,
 		    ("malloc_uninit: type '%s' not found", mtp->ks_shortdesc));
 	} else
 		kmemstatistics = mtp->ks_next;
 	kmemcount--;
 	mtx_unlock(&malloc_mtx);
 
 	/*
 	 * Look for memory leaks.
 	 */
 	temp_allocs = temp_bytes = 0;
 	for (i = 0; i < MAXCPU; i++) {
 		mtsp = &mtip->mti_stats[i];
 		temp_allocs += mtsp->mts_numallocs;
 		temp_allocs -= mtsp->mts_numfrees;
 		temp_bytes += mtsp->mts_memalloced;
 		temp_bytes -= mtsp->mts_memfreed;
 	}
 	if (temp_allocs > 0 || temp_bytes > 0) {
 		printf("Warning: memory type %s leaked memory on destroy "
 		    "(%ld allocations, %ld bytes leaked).\n", mtp->ks_shortdesc,
 		    temp_allocs, temp_bytes);
 	}
 
 	slab = vtoslab((vm_offset_t) mtip & (~UMA_SLAB_MASK));
 	uma_zfree_arg(mt_zone, mtip, slab);
 }
 
 struct malloc_type *
 malloc_desc2type(const char *desc)
 {
 	struct malloc_type *mtp;
 
 	mtx_assert(&malloc_mtx, MA_OWNED);
 	for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) {
 		if (strcmp(mtp->ks_shortdesc, desc) == 0)
 			return (mtp);
 	}
 	return (NULL);
 }
 
 static int
 sysctl_kern_malloc_stats(SYSCTL_HANDLER_ARGS)
 {
 	struct malloc_type_stream_header mtsh;
 	struct malloc_type_internal *mtip;
 	struct malloc_type_header mth;
 	struct malloc_type *mtp;
 	int error, i;
 	struct sbuf sbuf;
 
 	error = sysctl_wire_old_buffer(req, 0);
 	if (error != 0)
 		return (error);
 	sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
 	sbuf_clear_flags(&sbuf, SBUF_INCLUDENUL);
 	mtx_lock(&malloc_mtx);
 
 	/*
 	 * Insert stream header.
 	 */
 	bzero(&mtsh, sizeof(mtsh));
 	mtsh.mtsh_version = MALLOC_TYPE_STREAM_VERSION;
 	mtsh.mtsh_maxcpus = MAXCPU;
 	mtsh.mtsh_count = kmemcount;
 	(void)sbuf_bcat(&sbuf, &mtsh, sizeof(mtsh));
 
 	/*
 	 * Insert alternating sequence of type headers and type statistics.
 	 */
 	for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) {
 		mtip = (struct malloc_type_internal *)mtp->ks_handle;
 
 		/*
 		 * Insert type header.
 		 */
 		bzero(&mth, sizeof(mth));
 		strlcpy(mth.mth_name, mtp->ks_shortdesc, MALLOC_MAX_NAME);
 		(void)sbuf_bcat(&sbuf, &mth, sizeof(mth));
 
 		/*
 		 * Insert type statistics for each CPU.
 		 */
 		for (i = 0; i < MAXCPU; i++) {
 			(void)sbuf_bcat(&sbuf, &mtip->mti_stats[i],
 			    sizeof(mtip->mti_stats[i]));
 		}
 	}
 	mtx_unlock(&malloc_mtx);
 	error = sbuf_finish(&sbuf);
 	sbuf_delete(&sbuf);
 	return (error);
 }
 
 SYSCTL_PROC(_kern, OID_AUTO, malloc_stats, CTLFLAG_RD|CTLTYPE_STRUCT,
     0, 0, sysctl_kern_malloc_stats, "s,malloc_type_ustats",
     "Return malloc types");
 
 SYSCTL_INT(_kern, OID_AUTO, malloc_count, CTLFLAG_RD, &kmemcount, 0,
     "Count of kernel malloc types");
 
 void
 malloc_type_list(malloc_type_list_func_t *func, void *arg)
 {
 	struct malloc_type *mtp, **bufmtp;
 	int count, i;
 	size_t buflen;
 
 	mtx_lock(&malloc_mtx);
 restart:
 	mtx_assert(&malloc_mtx, MA_OWNED);
 	count = kmemcount;
 	mtx_unlock(&malloc_mtx);
 
 	buflen = sizeof(struct malloc_type *) * count;
 	bufmtp = malloc(buflen, M_TEMP, M_WAITOK);
 
 	mtx_lock(&malloc_mtx);
 
 	if (count < kmemcount) {
 		free(bufmtp, M_TEMP);
 		goto restart;
 	}
 
 	for (mtp = kmemstatistics, i = 0; mtp != NULL; mtp = mtp->ks_next, i++)
 		bufmtp[i] = mtp;
 
 	mtx_unlock(&malloc_mtx);
 
 	for (i = 0; i < count; i++)
 		(func)(bufmtp[i], arg);
 
 	free(bufmtp, M_TEMP);
 }
 
 #ifdef DDB
 DB_SHOW_COMMAND(malloc, db_show_malloc)
 {
 	struct malloc_type_internal *mtip;
 	struct malloc_type *mtp;
 	uint64_t allocs, frees;
 	uint64_t alloced, freed;
 	int i;
 
 	db_printf("%18s %12s  %12s %12s\n", "Type", "InUse", "MemUse",
 	    "Requests");
 	for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) {
 		mtip = (struct malloc_type_internal *)mtp->ks_handle;
 		allocs = 0;
 		frees = 0;
 		alloced = 0;
 		freed = 0;
 		for (i = 0; i < MAXCPU; i++) {
 			allocs += mtip->mti_stats[i].mts_numallocs;
 			frees += mtip->mti_stats[i].mts_numfrees;
 			alloced += mtip->mti_stats[i].mts_memalloced;
 			freed += mtip->mti_stats[i].mts_memfreed;
 		}
 		db_printf("%18s %12ju %12juK %12ju\n",
 		    mtp->ks_shortdesc, allocs - frees,
 		    (alloced - freed + 1023) / 1024, allocs);
 		if (db_pager_quit)
 			break;
 	}
 }
 
 #if MALLOC_DEBUG_MAXZONES > 1
 DB_SHOW_COMMAND(multizone_matches, db_show_multizone_matches)
 {
 	struct malloc_type_internal *mtip;
 	struct malloc_type *mtp;
 	u_int subzone;
 
 	if (!have_addr) {
 		db_printf("Usage: show multizone_matches <malloc type/addr>\n");
 		return;
 	}
 	mtp = (void *)addr;
 	if (mtp->ks_magic != M_MAGIC) {
 		db_printf("Magic %lx does not match expected %x\n",
 		    mtp->ks_magic, M_MAGIC);
 		return;
 	}
 
 	mtip = mtp->ks_handle;
 	subzone = mtip->mti_zone;
 
 	for (mtp = kmemstatistics; mtp != NULL; mtp = mtp->ks_next) {
 		mtip = mtp->ks_handle;
 		if (mtip->mti_zone != subzone)
 			continue;
 		db_printf("%s\n", mtp->ks_shortdesc);
 		if (db_pager_quit)
 			break;
 	}
 }
 #endif /* MALLOC_DEBUG_MAXZONES > 1 */
 #endif /* DDB */
 
 #ifdef MALLOC_PROFILE
 
 static int
 sysctl_kern_mprof(SYSCTL_HANDLER_ARGS)
 {
 	struct sbuf sbuf;
 	uint64_t count;
 	uint64_t waste;
 	uint64_t mem;
 	int error;
 	int rsize;
 	int size;
 	int i;
 
 	waste = 0;
 	mem = 0;
 
 	error = sysctl_wire_old_buffer(req, 0);
 	if (error != 0)
 		return (error);
 	sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
 	sbuf_printf(&sbuf, 
 	    "\n  Size                    Requests  Real Size\n");
 	for (i = 0; i < KMEM_ZSIZE; i++) {
 		size = i << KMEM_ZSHIFT;
 		rsize = kmemzones[kmemsize[i]].kz_size;
 		count = (long long unsigned)krequests[i];
 
 		sbuf_printf(&sbuf, "%6d%28llu%11d\n", size,
 		    (unsigned long long)count, rsize);
 
 		if ((rsize * count) > (size * count))
 			waste += (rsize * count) - (size * count);
 		mem += (rsize * count);
 	}
 	sbuf_printf(&sbuf,
 	    "\nTotal memory used:\t%30llu\nTotal Memory wasted:\t%30llu\n",
 	    (unsigned long long)mem, (unsigned long long)waste);
 	error = sbuf_finish(&sbuf);
 	sbuf_delete(&sbuf);
 	return (error);
 }
 
 SYSCTL_OID(_kern, OID_AUTO, mprof, CTLTYPE_STRING|CTLFLAG_RD,
     NULL, 0, sysctl_kern_mprof, "A", "Malloc Profiling");
 #endif /* MALLOC_PROFILE */
Index: projects/release-arm-redux/sys/kern/kern_thread.c
===================================================================
--- projects/release-arm-redux/sys/kern/kern_thread.c	(revision 282691)
+++ projects/release-arm-redux/sys/kern/kern_thread.c	(revision 282692)
@@ -1,1114 +1,1137 @@
 /*-
  * Copyright (C) 2001 Julian Elischer <julian@freebsd.org>.
  *  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice(s), this list of conditions and the following disclaimer as
  *    the first lines of this file unmodified other than the possible
  *    addition of one or more copyright notices.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice(s), this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER(S) ``AS IS'' AND ANY
  * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  * DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER(S) BE LIABLE FOR ANY
  * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
  * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
  * DAMAGE.
  */
 
 #include "opt_witness.h"
 #include "opt_hwpmc_hooks.h"
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/rangelock.h>
 #include <sys/resourcevar.h>
 #include <sys/sdt.h>
 #include <sys/smp.h>
 #include <sys/sched.h>
 #include <sys/sleepqueue.h>
 #include <sys/selinfo.h>
 #include <sys/turnstile.h>
 #include <sys/ktr.h>
 #include <sys/rwlock.h>
 #include <sys/umtx.h>
 #include <sys/cpuset.h>
 #ifdef	HWPMC_HOOKS
 #include <sys/pmckern.h>
 #endif
 
 #include <security/audit/audit.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/uma.h>
 #include <sys/eventhandler.h>
 
 SDT_PROVIDER_DECLARE(proc);
 SDT_PROBE_DEFINE(proc, , , lwp__exit);
 
 /*
  * thread related storage.
  */
 static uma_zone_t thread_zone;
 
 TAILQ_HEAD(, thread) zombie_threads = TAILQ_HEAD_INITIALIZER(zombie_threads);
 static struct mtx zombie_lock;
 MTX_SYSINIT(zombie_lock, &zombie_lock, "zombie lock", MTX_SPIN);
 
 static void thread_zombie(struct thread *);
 
 #define TID_BUFFER_SIZE	1024
 
 struct mtx tid_lock;
 static struct unrhdr *tid_unrhdr;
 static lwpid_t tid_buffer[TID_BUFFER_SIZE];
 static int tid_head, tid_tail;
 static MALLOC_DEFINE(M_TIDHASH, "tidhash", "thread hash");
 
 struct	tidhashhead *tidhashtbl;
 u_long	tidhash;
 struct	rwlock tidhash_lock;
 
 static lwpid_t
 tid_alloc(void)
 {
 	lwpid_t	tid;
 
 	tid = alloc_unr(tid_unrhdr);
 	if (tid != -1)
 		return (tid);
 	mtx_lock(&tid_lock);
 	if (tid_head == tid_tail) {
 		mtx_unlock(&tid_lock);
 		return (-1);
 	}
 	tid = tid_buffer[tid_head];
 	tid_head = (tid_head + 1) % TID_BUFFER_SIZE;
 	mtx_unlock(&tid_lock);
 	return (tid);
 }
 
 static void
 tid_free(lwpid_t tid)
 {
 	lwpid_t tmp_tid = -1;
 
 	mtx_lock(&tid_lock);
 	if ((tid_tail + 1) % TID_BUFFER_SIZE == tid_head) {
 		tmp_tid = tid_buffer[tid_head];
 		tid_head = (tid_head + 1) % TID_BUFFER_SIZE;
 	}
 	tid_buffer[tid_tail] = tid;
 	tid_tail = (tid_tail + 1) % TID_BUFFER_SIZE;
 	mtx_unlock(&tid_lock);
 	if (tmp_tid != -1)
 		free_unr(tid_unrhdr, tmp_tid);
 }
 
 /*
  * Prepare a thread for use.
  */
 static int
 thread_ctor(void *mem, int size, void *arg, int flags)
 {
 	struct thread	*td;
 
 	td = (struct thread *)mem;
 	td->td_state = TDS_INACTIVE;
 	td->td_oncpu = NOCPU;
 
 	td->td_tid = tid_alloc();
 
 	/*
 	 * Note that td_critnest begins life as 1 because the thread is not
 	 * running and is thereby implicitly waiting to be on the receiving
 	 * end of a context switch.
 	 */
 	td->td_critnest = 1;
 	td->td_lend_user_pri = PRI_MAX;
 	EVENTHANDLER_INVOKE(thread_ctor, td);
 #ifdef AUDIT
 	audit_thread_alloc(td);
 #endif
 	umtx_thread_alloc(td);
 	return (0);
 }
 
 /*
  * Reclaim a thread after use.
  */
 static void
 thread_dtor(void *mem, int size, void *arg)
 {
 	struct thread *td;
 
 	td = (struct thread *)mem;
 
 #ifdef INVARIANTS
 	/* Verify that this thread is in a safe state to free. */
 	switch (td->td_state) {
 	case TDS_INHIBITED:
 	case TDS_RUNNING:
 	case TDS_CAN_RUN:
 	case TDS_RUNQ:
 		/*
 		 * We must never unlink a thread that is in one of
 		 * these states, because it is currently active.
 		 */
 		panic("bad state for thread unlinking");
 		/* NOTREACHED */
 	case TDS_INACTIVE:
 		break;
 	default:
 		panic("bad thread state");
 		/* NOTREACHED */
 	}
 #endif
 #ifdef AUDIT
 	audit_thread_free(td);
 #endif
 	/* Free all OSD associated to this thread. */
 	osd_thread_exit(td);
 
 	EVENTHANDLER_INVOKE(thread_dtor, td);
 	tid_free(td->td_tid);
 }
 
 /*
  * Initialize type-stable parts of a thread (when newly created).
  */
 static int
 thread_init(void *mem, int size, int flags)
 {
 	struct thread *td;
 
 	td = (struct thread *)mem;
 
 	td->td_sleepqueue = sleepq_alloc();
 	td->td_turnstile = turnstile_alloc();
 	td->td_rlqe = NULL;
 	EVENTHANDLER_INVOKE(thread_init, td);
 	td->td_sched = (struct td_sched *)&td[1];
 	umtx_thread_init(td);
 	td->td_kstack = 0;
 	td->td_sel = NULL;
 	return (0);
 }
 
 /*
  * Tear down type-stable parts of a thread (just before being discarded).
  */
 static void
 thread_fini(void *mem, int size)
 {
 	struct thread *td;
 
 	td = (struct thread *)mem;
 	EVENTHANDLER_INVOKE(thread_fini, td);
 	rlqentry_free(td->td_rlqe);
 	turnstile_free(td->td_turnstile);
 	sleepq_free(td->td_sleepqueue);
 	umtx_thread_fini(td);
 	seltdfini(td);
 }
 
 /*
  * For a newly created process,
  * link up all the structures and its initial threads etc.
  * called from:
  * {arch}/{arch}/machdep.c   {arch}_init(), init386() etc.
  * proc_dtor() (should go away)
  * proc_init()
  */
 void
 proc_linkup0(struct proc *p, struct thread *td)
 {
 	TAILQ_INIT(&p->p_threads);	     /* all threads in proc */
 	proc_linkup(p, td);
 }
 
 void
 proc_linkup(struct proc *p, struct thread *td)
 {
 
 	sigqueue_init(&p->p_sigqueue, p);
 	p->p_ksi = ksiginfo_alloc(1);
 	if (p->p_ksi != NULL) {
 		/* XXX p_ksi may be null if ksiginfo zone is not ready */
 		p->p_ksi->ksi_flags = KSI_EXT | KSI_INS;
 	}
 	LIST_INIT(&p->p_mqnotifier);
 	p->p_numthreads = 0;
 	thread_link(td, p);
 }
 
 /*
  * Initialize global thread allocation resources.
  */
 void
 threadinit(void)
 {
 
 	mtx_init(&tid_lock, "TID lock", NULL, MTX_DEF);
 
 	/*
 	 * pid_max cannot be greater than PID_MAX.
 	 * leave one number for thread0.
 	 */
 	tid_unrhdr = new_unrhdr(PID_MAX + 2, INT_MAX, &tid_lock);
 
 	thread_zone = uma_zcreate("THREAD", sched_sizeof_thread(),
 	    thread_ctor, thread_dtor, thread_init, thread_fini,
 	    16 - 1, 0);
 	tidhashtbl = hashinit(maxproc / 2, M_TIDHASH, &tidhash);
 	rw_init(&tidhash_lock, "tidhash");
 }
 
 /*
  * Place an unused thread on the zombie list.
  * Use the slpq as that must be unused by now.
  */
 void
 thread_zombie(struct thread *td)
 {
 	mtx_lock_spin(&zombie_lock);
 	TAILQ_INSERT_HEAD(&zombie_threads, td, td_slpq);
 	mtx_unlock_spin(&zombie_lock);
 }
 
 /*
  * Release a thread that has exited after cpu_throw().
  */
 void
 thread_stash(struct thread *td)
 {
 	atomic_subtract_rel_int(&td->td_proc->p_exitthreads, 1);
 	thread_zombie(td);
 }
 
 /*
  * Reap zombie resources.
  */
 void
 thread_reap(void)
 {
 	struct thread *td_first, *td_next;
 
 	/*
 	 * Don't even bother to lock if none at this instant,
 	 * we really don't care about the next instant..
 	 */
 	if (!TAILQ_EMPTY(&zombie_threads)) {
 		mtx_lock_spin(&zombie_lock);
 		td_first = TAILQ_FIRST(&zombie_threads);
 		if (td_first)
 			TAILQ_INIT(&zombie_threads);
 		mtx_unlock_spin(&zombie_lock);
 		while (td_first) {
 			td_next = TAILQ_NEXT(td_first, td_slpq);
 			if (td_first->td_ucred)
 				crfree(td_first->td_ucred);
 			thread_free(td_first);
 			td_first = td_next;
 		}
 	}
 }
 
 /*
  * Allocate a thread.
  */
 struct thread *
 thread_alloc(int pages)
 {
 	struct thread *td;
 
 	thread_reap(); /* check if any zombies to get */
 
 	td = (struct thread *)uma_zalloc(thread_zone, M_WAITOK);
 	KASSERT(td->td_kstack == 0, ("thread_alloc got thread with kstack"));
 	if (!vm_thread_new(td, pages)) {
 		uma_zfree(thread_zone, td);
 		return (NULL);
 	}
 	cpu_thread_alloc(td);
 	return (td);
 }
 
 int
 thread_alloc_stack(struct thread *td, int pages)
 {
 
 	KASSERT(td->td_kstack == 0,
 	    ("thread_alloc_stack called on a thread with kstack"));
 	if (!vm_thread_new(td, pages))
 		return (0);
 	cpu_thread_alloc(td);
 	return (1);
 }
 
 /*
  * Deallocate a thread.
  */
 void
 thread_free(struct thread *td)
 {
 
 	lock_profile_thread_exit(td);
 	if (td->td_cpuset)
 		cpuset_rel(td->td_cpuset);
 	td->td_cpuset = NULL;
 	cpu_thread_free(td);
 	if (td->td_kstack != 0)
 		vm_thread_dispose(td);
 	uma_zfree(thread_zone, td);
 }
 
 /*
  * Discard the current thread and exit from its context.
  * Always called with scheduler locked.
  *
  * Because we can't free a thread while we're operating under its context,
  * push the current thread into our CPU's deadthread holder. This means
  * we needn't worry about someone else grabbing our context before we
  * do a cpu_throw().
  */
 void
 thread_exit(void)
 {
 	uint64_t runtime, new_switchtime;
 	struct thread *td;
 	struct thread *td2;
 	struct proc *p;
 	int wakeup_swapper;
 
 	td = curthread;
 	p = td->td_proc;
 
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
 	mtx_assert(&Giant, MA_NOTOWNED);
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	KASSERT(p != NULL, ("thread exiting without a process"));
 	CTR3(KTR_PROC, "thread_exit: thread %p (pid %ld, %s)", td,
 	    (long)p->p_pid, td->td_name);
 	KASSERT(TAILQ_EMPTY(&td->td_sigqueue.sq_list), ("signal pending"));
 
 #ifdef AUDIT
 	AUDIT_SYSCALL_EXIT(0, td);
 #endif
 	/*
 	 * drop FPU & debug register state storage, or any other
 	 * architecture specific resources that
 	 * would not be on a new untouched process.
 	 */
 	cpu_thread_exit(td);	/* XXXSMP */
 
 	/*
 	 * The last thread is left attached to the process
 	 * So that the whole bundle gets recycled. Skip
 	 * all this stuff if we never had threads.
 	 * EXIT clears all sign of other threads when
 	 * it goes to single threading, so the last thread always
 	 * takes the short path.
 	 */
 	if (p->p_flag & P_HADTHREADS) {
 		if (p->p_numthreads > 1) {
 			atomic_add_int(&td->td_proc->p_exitthreads, 1);
 			thread_unlink(td);
 			td2 = FIRST_THREAD_IN_PROC(p);
 			sched_exit_thread(td2, td);
 
 			/*
 			 * The test below is NOT true if we are the
 			 * sole exiting thread. P_STOPPED_SINGLE is unset
 			 * in exit1() after it is the only survivor.
 			 */
 			if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) {
 				if (p->p_numthreads == p->p_suspcount) {
 					thread_lock(p->p_singlethread);
 					wakeup_swapper = thread_unsuspend_one(
 						p->p_singlethread, p);
 					thread_unlock(p->p_singlethread);
 					if (wakeup_swapper)
 						kick_proc0();
 				}
 			}
 
 			PCPU_SET(deadthread, td);
 		} else {
 			/*
 			 * The last thread is exiting.. but not through exit()
 			 */
 			panic ("thread_exit: Last thread exiting on its own");
 		}
 	} 
 #ifdef	HWPMC_HOOKS
 	/*
 	 * If this thread is part of a process that is being tracked by hwpmc(4),
 	 * inform the module of the thread's impending exit.
 	 */
 	if (PMC_PROC_IS_USING_PMCS(td->td_proc))
 		PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_OUT);
 #endif
 	PROC_UNLOCK(p);
 	PROC_STATLOCK(p);
 	thread_lock(td);
 	PROC_SUNLOCK(p);
 
 	/* Do the same timestamp bookkeeping that mi_switch() would do. */
 	new_switchtime = cpu_ticks();
 	runtime = new_switchtime - PCPU_GET(switchtime);
 	td->td_runtime += runtime;
 	td->td_incruntime += runtime;
 	PCPU_SET(switchtime, new_switchtime);
 	PCPU_SET(switchticks, ticks);
 	PCPU_INC(cnt.v_swtch);
 
 	/* Save our resource usage in our process. */
 	td->td_ru.ru_nvcsw++;
 	ruxagg(p, td);
 	rucollect(&p->p_ru, &td->td_ru);
 	PROC_STATUNLOCK(p);
 
 	td->td_state = TDS_INACTIVE;
 #ifdef WITNESS
 	witness_thread_exit(td);
 #endif
 	CTR1(KTR_PROC, "thread_exit: cpu_throw() thread %p", td);
 	sched_throw(td);
 	panic("I'm a teapot!");
 	/* NOTREACHED */
 }
 
 /*
  * Do any thread specific cleanups that may be needed in wait()
  * called with Giant, proc and schedlock not held.
  */
 void
 thread_wait(struct proc *p)
 {
 	struct thread *td;
 
 	mtx_assert(&Giant, MA_NOTOWNED);
 	KASSERT(p->p_numthreads == 1, ("multiple threads in thread_wait()"));
 	KASSERT(p->p_exitthreads == 0, ("p_exitthreads leaking"));
 	td = FIRST_THREAD_IN_PROC(p);
 	/* Lock the last thread so we spin until it exits cpu_throw(). */
 	thread_lock(td);
 	thread_unlock(td);
 	lock_profile_thread_exit(td);
 	cpuset_rel(td->td_cpuset);
 	td->td_cpuset = NULL;
 	cpu_thread_clean(td);
 	crfree(td->td_ucred);
 	thread_reap();	/* check for zombie threads etc. */
 }
 
 /*
  * Link a thread to a process.
  * set up anything that needs to be initialized for it to
  * be used by the process.
  */
 void
 thread_link(struct thread *td, struct proc *p)
 {
 
 	/*
 	 * XXX This can't be enabled because it's called for proc0 before
 	 * its lock has been created.
 	 * PROC_LOCK_ASSERT(p, MA_OWNED);
 	 */
 	td->td_state    = TDS_INACTIVE;
 	td->td_proc     = p;
 	td->td_flags    = TDF_INMEM;
 
 	LIST_INIT(&td->td_contested);
 	LIST_INIT(&td->td_lprof[0]);
 	LIST_INIT(&td->td_lprof[1]);
 	sigqueue_init(&td->td_sigqueue, p);
 	callout_init(&td->td_slpcallout, CALLOUT_MPSAFE);
 	TAILQ_INSERT_TAIL(&p->p_threads, td, td_plist);
 	p->p_numthreads++;
 }
 
 /*
  * Called from:
  *  thread_exit()
  */
 void
 thread_unlink(struct thread *td)
 {
 	struct proc *p = td->td_proc;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	TAILQ_REMOVE(&p->p_threads, td, td_plist);
 	p->p_numthreads--;
 	/* could clear a few other things here */
 	/* Must  NOT clear links to proc! */
 }
 
 static int
 calc_remaining(struct proc *p, int mode)
 {
 	int remaining;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
 	if (mode == SINGLE_EXIT)
 		remaining = p->p_numthreads;
 	else if (mode == SINGLE_BOUNDARY)
 		remaining = p->p_numthreads - p->p_boundary_count;
 	else if (mode == SINGLE_NO_EXIT || mode == SINGLE_ALLPROC)
 		remaining = p->p_numthreads - p->p_suspcount;
 	else
 		panic("calc_remaining: wrong mode %d", mode);
 	return (remaining);
 }
 
 static int
 remain_for_mode(int mode)
 {
 
 	return (mode == SINGLE_ALLPROC ? 0 : 1);
 }
 
 static int
 weed_inhib(int mode, struct thread *td2, struct proc *p)
 {
 	int wakeup_swapper;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
 	THREAD_LOCK_ASSERT(td2, MA_OWNED);
 
 	wakeup_swapper = 0;
 	switch (mode) {
 	case SINGLE_EXIT:
 		if (TD_IS_SUSPENDED(td2))
 			wakeup_swapper |= thread_unsuspend_one(td2, p);
 		if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0)
 			wakeup_swapper |= sleepq_abort(td2, EINTR);
 		break;
 	case SINGLE_BOUNDARY:
 		if (TD_IS_SUSPENDED(td2) && (td2->td_flags & TDF_BOUNDARY) == 0)
 			wakeup_swapper |= thread_unsuspend_one(td2, p);
 		if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0)
 			wakeup_swapper |= sleepq_abort(td2, ERESTART);
 		break;
 	case SINGLE_NO_EXIT:
 		if (TD_IS_SUSPENDED(td2) && (td2->td_flags & TDF_BOUNDARY) == 0)
 			wakeup_swapper |= thread_unsuspend_one(td2, p);
 		if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0)
 			wakeup_swapper |= sleepq_abort(td2, ERESTART);
 		break;
 	case SINGLE_ALLPROC:
 		/*
 		 * ALLPROC suspend tries to avoid spurious EINTR for
 		 * threads sleeping interruptable, by suspending the
 		 * thread directly, similarly to sig_suspend_threads().
 		 * Since such sleep is not performed at the user
 		 * boundary, TDF_BOUNDARY flag is not set, and TDF_ALLPROCSUSP
 		 * is used to avoid immediate un-suspend.
 		 */
 		if (TD_IS_SUSPENDED(td2) && (td2->td_flags & (TDF_BOUNDARY |
 		    TDF_ALLPROCSUSP)) == 0)
 			wakeup_swapper |= thread_unsuspend_one(td2, p);
 		if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) {
 			if ((td2->td_flags & TDF_SBDRY) == 0) {
 				thread_suspend_one(td2);
 				td2->td_flags |= TDF_ALLPROCSUSP;
 			} else {
 				wakeup_swapper |= sleepq_abort(td2, ERESTART);
 			}
 		}
 		break;
 	}
 	return (wakeup_swapper);
 }
 
 /*
  * Enforce single-threading.
  *
  * Returns 1 if the caller must abort (another thread is waiting to
  * exit the process or similar). Process is locked!
  * Returns 0 when you are successfully the only thread running.
  * A process has successfully single threaded in the suspend mode when
  * There are no threads in user mode. Threads in the kernel must be
  * allowed to continue until they get to the user boundary. They may even
  * copy out their return values and data before suspending. They may however be
  * accelerated in reaching the user boundary as we will wake up
  * any sleeping threads that are interruptable. (PCATCH).
  */
 int
 thread_single(struct proc *p, int mode)
 {
 	struct thread *td;
 	struct thread *td2;
 	int remaining, wakeup_swapper;
 
 	td = curthread;
 	KASSERT(mode == SINGLE_EXIT || mode == SINGLE_BOUNDARY ||
 	    mode == SINGLE_ALLPROC || mode == SINGLE_NO_EXIT,
 	    ("invalid mode %d", mode));
 	/*
 	 * If allowing non-ALLPROC singlethreading for non-curproc
 	 * callers, calc_remaining() and remain_for_mode() should be
 	 * adjusted to also account for td->td_proc != p.  For now
 	 * this is not implemented because it is not used.
 	 */
 	KASSERT((mode == SINGLE_ALLPROC && td->td_proc != p) ||
 	    (mode != SINGLE_ALLPROC && td->td_proc == p),
 	    ("mode %d proc %p curproc %p", mode, p, td->td_proc));
 	mtx_assert(&Giant, MA_NOTOWNED);
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 
 	if ((p->p_flag & P_HADTHREADS) == 0 && mode != SINGLE_ALLPROC)
 		return (0);
 
 	/* Is someone already single threading? */
 	if (p->p_singlethread != NULL && p->p_singlethread != td)
 		return (1);
 
 	if (mode == SINGLE_EXIT) {
 		p->p_flag |= P_SINGLE_EXIT;
 		p->p_flag &= ~P_SINGLE_BOUNDARY;
 	} else {
 		p->p_flag &= ~P_SINGLE_EXIT;
 		if (mode == SINGLE_BOUNDARY)
 			p->p_flag |= P_SINGLE_BOUNDARY;
 		else
 			p->p_flag &= ~P_SINGLE_BOUNDARY;
 	}
 	if (mode == SINGLE_ALLPROC)
 		p->p_flag |= P_TOTAL_STOP;
 	p->p_flag |= P_STOPPED_SINGLE;
 	PROC_SLOCK(p);
 	p->p_singlethread = td;
 	remaining = calc_remaining(p, mode);
 	while (remaining != remain_for_mode(mode)) {
 		if (P_SHOULDSTOP(p) != P_STOPPED_SINGLE)
 			goto stopme;
 		wakeup_swapper = 0;
 		FOREACH_THREAD_IN_PROC(p, td2) {
 			if (td2 == td)
 				continue;
 			thread_lock(td2);
 			td2->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK;
 			if (TD_IS_INHIBITED(td2)) {
 				wakeup_swapper |= weed_inhib(mode, td2, p);
 #ifdef SMP
 			} else if (TD_IS_RUNNING(td2) && td != td2) {
 				forward_signal(td2);
 #endif
 			}
 			thread_unlock(td2);
 		}
 		if (wakeup_swapper)
 			kick_proc0();
 		remaining = calc_remaining(p, mode);
 
 		/*
 		 * Maybe we suspended some threads.. was it enough?
 		 */
 		if (remaining == remain_for_mode(mode))
 			break;
 
 stopme:
 		/*
 		 * Wake us up when everyone else has suspended.
 		 * In the mean time we suspend as well.
 		 */
 		thread_suspend_switch(td, p);
 		remaining = calc_remaining(p, mode);
 	}
 	if (mode == SINGLE_EXIT) {
 		/*
 		 * Convert the process to an unthreaded process.  The
 		 * SINGLE_EXIT is called by exit1() or execve(), in
 		 * both cases other threads must be retired.
 		 */
 		KASSERT(p->p_numthreads == 1, ("Unthreading with >1 threads"));
 		p->p_singlethread = NULL;
 		p->p_flag &= ~(P_STOPPED_SINGLE | P_SINGLE_EXIT | P_HADTHREADS);
 
 		/*
 		 * Wait for any remaining threads to exit cpu_throw().
 		 */
 		while (p->p_exitthreads != 0) {
 			PROC_SUNLOCK(p);
 			PROC_UNLOCK(p);
 			sched_relinquish(td);
 			PROC_LOCK(p);
 			PROC_SLOCK(p);
 		}
+	} else if (mode == SINGLE_BOUNDARY) {
+		/*
+		 * Wait until all suspended threads are removed from
+		 * the processors.  The thread_suspend_check()
+		 * increments p_boundary_count while it is still
+		 * running, which makes it possible for the execve()
+		 * to destroy vmspace while our other threads are
+		 * still using the address space.
+		 *
+		 * We lock the thread, which is only allowed to
+		 * succeed after context switch code finished using
+		 * the address space.
+		 */
+		FOREACH_THREAD_IN_PROC(p, td2) {
+			if (td2 == td)
+				continue;
+			thread_lock(td2);
+			KASSERT((td2->td_flags & TDF_BOUNDARY) != 0,
+			    ("td %p not on boundary", td2));
+			KASSERT(TD_IS_SUSPENDED(td2),
+			    ("td %p is not suspended", td2));
+			thread_unlock(td2);
+		}
 	}
 	PROC_SUNLOCK(p);
 	return (0);
 }
 
 bool
 thread_suspend_check_needed(void)
 {
 	struct proc *p;
 	struct thread *td;
 
 	td = curthread;
 	p = td->td_proc;
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	return (P_SHOULDSTOP(p) || ((p->p_flag & P_TRACED) != 0 &&
 	    (td->td_dbgflags & TDB_SUSPEND) != 0));
 }
 
 /*
  * Called in from locations that can safely check to see
  * whether we have to suspend or at least throttle for a
  * single-thread event (e.g. fork).
  *
  * Such locations include userret().
  * If the "return_instead" argument is non zero, the thread must be able to
  * accept 0 (caller may continue), or 1 (caller must abort) as a result.
  *
  * The 'return_instead' argument tells the function if it may do a
  * thread_exit() or suspend, or whether the caller must abort and back
  * out instead.
  *
  * If the thread that set the single_threading request has set the
  * P_SINGLE_EXIT bit in the process flags then this call will never return
  * if 'return_instead' is false, but will exit.
  *
  * P_SINGLE_EXIT | return_instead == 0| return_instead != 0
  *---------------+--------------------+---------------------
  *       0       | returns 0          |   returns 0 or 1
  *               | when ST ends       |   immediately
  *---------------+--------------------+---------------------
  *       1       | thread exits       |   returns 1
  *               |                    |  immediately
  * 0 = thread_exit() or suspension ok,
  * other = return error instead of stopping the thread.
  *
  * While a full suspension is under effect, even a single threading
  * thread would be suspended if it made this call (but it shouldn't).
  * This call should only be made from places where
  * thread_exit() would be safe as that may be the outcome unless
  * return_instead is set.
  */
 int
 thread_suspend_check(int return_instead)
 {
 	struct thread *td;
 	struct proc *p;
 	int wakeup_swapper;
 
 	td = curthread;
 	p = td->td_proc;
 	mtx_assert(&Giant, MA_NOTOWNED);
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	while (thread_suspend_check_needed()) {
 		if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) {
 			KASSERT(p->p_singlethread != NULL,
 			    ("singlethread not set"));
 			/*
 			 * The only suspension in action is a
 			 * single-threading. Single threader need not stop.
 			 * XXX Should be safe to access unlocked
 			 * as it can only be set to be true by us.
 			 */
 			if (p->p_singlethread == td)
 				return (0);	/* Exempt from stopping. */
 		}
 		if ((p->p_flag & P_SINGLE_EXIT) && return_instead)
 			return (EINTR);
 
 		/* Should we goto user boundary if we didn't come from there? */
 		if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE &&
 		    (p->p_flag & P_SINGLE_BOUNDARY) && return_instead)
 			return (ERESTART);
 
 		/*
 		 * Ignore suspend requests for stop signals if they
 		 * are deferred.
 		 */
 		if ((P_SHOULDSTOP(p) == P_STOPPED_SIG ||
 		    (p->p_flag & P_TOTAL_STOP) != 0) &&
 		    (td->td_flags & TDF_SBDRY) != 0) {
 			KASSERT(return_instead,
 			    ("TDF_SBDRY set for unsafe thread_suspend_check"));
 			return (0);
 		}
 
 		/*
 		 * If the process is waiting for us to exit,
 		 * this thread should just suicide.
 		 * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE.
 		 */
 		if ((p->p_flag & P_SINGLE_EXIT) && (p->p_singlethread != td)) {
 			PROC_UNLOCK(p);
 			tidhash_remove(td);
 			PROC_LOCK(p);
 			tdsigcleanup(td);
 			umtx_thread_exit(td);
 			PROC_SLOCK(p);
 			thread_stopped(p);
 			thread_exit();
 		}
 
 		PROC_SLOCK(p);
 		thread_stopped(p);
 		if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) {
 			if (p->p_numthreads == p->p_suspcount + 1) {
 				thread_lock(p->p_singlethread);
 				wakeup_swapper =
 				    thread_unsuspend_one(p->p_singlethread, p);
 				thread_unlock(p->p_singlethread);
 				if (wakeup_swapper)
 					kick_proc0();
 			}
 		}
 		PROC_UNLOCK(p);
 		thread_lock(td);
 		/*
 		 * When a thread suspends, it just
 		 * gets taken off all queues.
 		 */
 		thread_suspend_one(td);
 		if (return_instead == 0) {
 			p->p_boundary_count++;
 			td->td_flags |= TDF_BOUNDARY;
 		}
 		PROC_SUNLOCK(p);
 		mi_switch(SW_INVOL | SWT_SUSPEND, NULL);
 		if (return_instead == 0)
 			td->td_flags &= ~TDF_BOUNDARY;
 		thread_unlock(td);
 		PROC_LOCK(p);
 		if (return_instead == 0) {
 			PROC_SLOCK(p);
 			p->p_boundary_count--;
 			PROC_SUNLOCK(p);
 		}
 	}
 	return (0);
 }
 
 void
 thread_suspend_switch(struct thread *td, struct proc *p)
 {
 
 	KASSERT(!TD_IS_SUSPENDED(td), ("already suspended"));
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
 	/*
 	 * We implement thread_suspend_one in stages here to avoid
 	 * dropping the proc lock while the thread lock is owned.
 	 */
 	if (p == td->td_proc) {
 		thread_stopped(p);
 		p->p_suspcount++;
 	}
 	PROC_UNLOCK(p);
 	thread_lock(td);
 	td->td_flags &= ~TDF_NEEDSUSPCHK;
 	TD_SET_SUSPENDED(td);
 	sched_sleep(td, 0);
 	PROC_SUNLOCK(p);
 	DROP_GIANT();
 	mi_switch(SW_VOL | SWT_SUSPEND, NULL);
 	thread_unlock(td);
 	PICKUP_GIANT();
 	PROC_LOCK(p);
 	PROC_SLOCK(p);
 }
 
 void
 thread_suspend_one(struct thread *td)
 {
 	struct proc *p;
 
 	p = td->td_proc;
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT(!TD_IS_SUSPENDED(td), ("already suspended"));
 	p->p_suspcount++;
 	td->td_flags &= ~TDF_NEEDSUSPCHK;
 	TD_SET_SUSPENDED(td);
 	sched_sleep(td, 0);
 }
 
 int
 thread_unsuspend_one(struct thread *td, struct proc *p)
 {
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
 	KASSERT(TD_IS_SUSPENDED(td), ("Thread not suspended"));
 	TD_CLR_SUSPENDED(td);
 	td->td_flags &= ~TDF_ALLPROCSUSP;
 	if (td->td_proc == p) {
 		PROC_SLOCK_ASSERT(p, MA_OWNED);
 		p->p_suspcount--;
 	}
 	return (setrunnable(td));
 }
 
 /*
  * Allow all threads blocked by single threading to continue running.
  */
 void
 thread_unsuspend(struct proc *p)
 {
 	struct thread *td;
 	int wakeup_swapper;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
 	wakeup_swapper = 0;
 	if (!P_SHOULDSTOP(p)) {
                 FOREACH_THREAD_IN_PROC(p, td) {
 			thread_lock(td);
 			if (TD_IS_SUSPENDED(td)) {
 				wakeup_swapper |= thread_unsuspend_one(td, p);
 			}
 			thread_unlock(td);
 		}
 	} else if ((P_SHOULDSTOP(p) == P_STOPPED_SINGLE) &&
 	    (p->p_numthreads == p->p_suspcount)) {
 		/*
 		 * Stopping everything also did the job for the single
 		 * threading request. Now we've downgraded to single-threaded,
 		 * let it continue.
 		 */
 		if (p->p_singlethread->td_proc == p) {
 			thread_lock(p->p_singlethread);
 			wakeup_swapper = thread_unsuspend_one(
 			    p->p_singlethread, p);
 			thread_unlock(p->p_singlethread);
 		}
 	}
 	if (wakeup_swapper)
 		kick_proc0();
 }
 
 /*
  * End the single threading mode..
  */
 void
 thread_single_end(struct proc *p, int mode)
 {
 	struct thread *td;
 	int wakeup_swapper;
 
 	KASSERT(mode == SINGLE_EXIT || mode == SINGLE_BOUNDARY ||
 	    mode == SINGLE_ALLPROC || mode == SINGLE_NO_EXIT,
 	    ("invalid mode %d", mode));
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	KASSERT((mode == SINGLE_ALLPROC && (p->p_flag & P_TOTAL_STOP) != 0) ||
 	    (mode != SINGLE_ALLPROC && (p->p_flag & P_TOTAL_STOP) == 0),
 	    ("mode %d does not match P_TOTAL_STOP", mode));
 	p->p_flag &= ~(P_STOPPED_SINGLE | P_SINGLE_EXIT | P_SINGLE_BOUNDARY |
 	    P_TOTAL_STOP);
 	PROC_SLOCK(p);
 	p->p_singlethread = NULL;
 	wakeup_swapper = 0;
 	/*
 	 * If there are other threads they may now run,
 	 * unless of course there is a blanket 'stop order'
 	 * on the process. The single threader must be allowed
 	 * to continue however as this is a bad place to stop.
 	 */
 	if (p->p_numthreads != remain_for_mode(mode) && !P_SHOULDSTOP(p)) {
                 FOREACH_THREAD_IN_PROC(p, td) {
 			thread_lock(td);
 			if (TD_IS_SUSPENDED(td)) {
 				wakeup_swapper |= thread_unsuspend_one(td, p);
 			}
 			thread_unlock(td);
 		}
 	}
 	PROC_SUNLOCK(p);
 	if (wakeup_swapper)
 		kick_proc0();
 }
 
 struct thread *
 thread_find(struct proc *p, lwpid_t tid)
 {
 	struct thread *td;
 
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	FOREACH_THREAD_IN_PROC(p, td) {
 		if (td->td_tid == tid)
 			break;
 	}
 	return (td);
 }
 
 /* Locate a thread by number; return with proc lock held. */
 struct thread *
 tdfind(lwpid_t tid, pid_t pid)
 {
 #define RUN_THRESH	16
 	struct thread *td;
 	int run = 0;
 
 	rw_rlock(&tidhash_lock);
 	LIST_FOREACH(td, TIDHASH(tid), td_hash) {
 		if (td->td_tid == tid) {
 			if (pid != -1 && td->td_proc->p_pid != pid) {
 				td = NULL;
 				break;
 			}
 			PROC_LOCK(td->td_proc);
 			if (td->td_proc->p_state == PRS_NEW) {
 				PROC_UNLOCK(td->td_proc);
 				td = NULL;
 				break;
 			}
 			if (run > RUN_THRESH) {
 				if (rw_try_upgrade(&tidhash_lock)) {
 					LIST_REMOVE(td, td_hash);
 					LIST_INSERT_HEAD(TIDHASH(td->td_tid),
 						td, td_hash);
 					rw_wunlock(&tidhash_lock);
 					return (td);
 				}
 			}
 			break;
 		}
 		run++;
 	}
 	rw_runlock(&tidhash_lock);
 	return (td);
 }
 
 void
 tidhash_add(struct thread *td)
 {
 	rw_wlock(&tidhash_lock);
 	LIST_INSERT_HEAD(TIDHASH(td->td_tid), td, td_hash);
 	rw_wunlock(&tidhash_lock);
 }
 
 void
 tidhash_remove(struct thread *td)
 {
 	rw_wlock(&tidhash_lock);
 	LIST_REMOVE(td, td_hash);
 	rw_wunlock(&tidhash_lock);
 }
Index: projects/release-arm-redux/sys/vm/uma.h
===================================================================
--- projects/release-arm-redux/sys/vm/uma.h	(revision 282691)
+++ projects/release-arm-redux/sys/vm/uma.h	(revision 282692)
@@ -1,693 +1,696 @@
 /*-
  * Copyright (c) 2002, 2003, 2004, 2005 Jeffrey Roberson <jeff@FreeBSD.org>
  * Copyright (c) 2004, 2005 Bosko Milekic <bmilekic@FreeBSD.org>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  * $FreeBSD$
  *
  */
 
 /*
  * uma.h - External definitions for the Universal Memory Allocator
  *
 */
 
 #ifndef _VM_UMA_H_
 #define _VM_UMA_H_
 
 #include <sys/param.h>		/* For NULL */
 #include <sys/malloc.h>		/* For M_* */
 
 /* User visible parameters */
 #define UMA_SMALLEST_UNIT       (PAGE_SIZE / 256) /* Smallest item allocated */
 
 /* Types and type defs */
 
 struct uma_zone;
 /* Opaque type used as a handle to the zone */
 typedef struct uma_zone * uma_zone_t;
 
 void zone_drain(uma_zone_t);
 
 /*
  * Item constructor
  *
  * Arguments:
  *	item  A pointer to the memory which has been allocated.
  *	arg   The arg field passed to uma_zalloc_arg
  *	size  The size of the allocated item
  *	flags See zalloc flags
  *
  * Returns:
  *	0      on success
  *      errno  on failure
  *
  * Discussion:
  *	The constructor is called just before the memory is returned
  *	to the user. It may block if necessary.
  */
 typedef int (*uma_ctor)(void *mem, int size, void *arg, int flags);
 
 /*
  * Item destructor
  *
  * Arguments:
  *	item  A pointer to the memory which has been allocated.
  *	size  The size of the item being destructed.
  *	arg   Argument passed through uma_zfree_arg
  *
  * Returns:
  *	Nothing
  *
  * Discussion:
  *	The destructor may perform operations that differ from those performed
  *	by the initializer, but it must leave the object in the same state.
  *	This IS type stable storage.  This is called after EVERY zfree call.
  */
 typedef void (*uma_dtor)(void *mem, int size, void *arg);
 
 /*
  * Item initializer
  *
  * Arguments:
  *	item  A pointer to the memory which has been allocated.
  *	size  The size of the item being initialized.
  *	flags See zalloc flags
  *
  * Returns:
  *	0      on success
  *      errno  on failure
  *
  * Discussion:
  *	The initializer is called when the memory is cached in the uma zone.
  *	The initializer and the destructor should leave the object in the same
  *	state.
  */
 typedef int (*uma_init)(void *mem, int size, int flags);
 
 /*
  * Item discard function
  *
  * Arguments:
  *	item  A pointer to memory which has been 'freed' but has not left the
  *	      zone's cache.
  *	size  The size of the item being discarded.
  *
  * Returns:
  *	Nothing
  *
  * Discussion:
  *	This routine is called when memory leaves a zone and is returned to the
  *	system for other uses.  It is the counter-part to the init function.
  */
 typedef void (*uma_fini)(void *mem, int size);
 
 /*
  * Import new memory into a cache zone.
  */
 typedef int (*uma_import)(void *arg, void **store, int count, int flags);
 
 /*
  * Free memory from a cache zone.
  */
 typedef void (*uma_release)(void *arg, void **store, int count);
 
 /*
  * What's the difference between initializing and constructing?
  *
  * The item is initialized when it is cached, and this is the state that the
  * object should be in when returned to the allocator. The purpose of this is
  * to remove some code which would otherwise be called on each allocation by
  * utilizing a known, stable state.  This differs from the constructor which
  * will be called on EVERY allocation.
  *
  * For example, in the initializer you may want to initialize embedded locks,
  * NULL list pointers, set up initial states, magic numbers, etc.  This way if
  * the object is held in the allocator and re-used it won't be necessary to
  * re-initialize it.
  *
  * The constructor may be used to lock a data structure, link it on to lists,
  * bump reference counts or total counts of outstanding structures, etc.
  *
  */
 
 
 /* Function proto types */
 
 /*
  * Create a new uma zone
  *
  * Arguments:
  *	name  The text name of the zone for debugging and stats. This memory
  *		should not be freed until the zone has been deallocated.
  *	size  The size of the object that is being created.
  *	ctor  The constructor that is called when the object is allocated.
  *	dtor  The destructor that is called when the object is freed.
  *	init  An initializer that sets up the initial state of the memory.
  *	fini  A discard function that undoes initialization done by init.
  *		ctor/dtor/init/fini may all be null, see notes above.
  *	align A bitmask that corresponds to the requested alignment
  *		eg 4 would be 0x3
  *	flags A set of parameters that control the behavior of the zone.
  *
  * Returns:
  *	A pointer to a structure which is intended to be opaque to users of
  *	the interface.  The value may be null if the wait flag is not set.
  */
 uma_zone_t uma_zcreate(const char *name, size_t size, uma_ctor ctor,
 		    uma_dtor dtor, uma_init uminit, uma_fini fini,
 		    int align, uint32_t flags);
 
 /*
  * Create a secondary uma zone
  *
  * Arguments:
  *	name  The text name of the zone for debugging and stats. This memory
  *		should not be freed until the zone has been deallocated.
  *	ctor  The constructor that is called when the object is allocated.
  *	dtor  The destructor that is called when the object is freed.
  *	zinit  An initializer that sets up the initial state of the memory
  *		as the object passes from the Keg's slab to the Zone's cache.
  *	zfini  A discard function that undoes initialization done by init
  *		as the object passes from the Zone's cache to the Keg's slab.
  *
  *		ctor/dtor/zinit/zfini may all be null, see notes above.
  *		Note that the zinit and zfini specified here are NOT
  *		exactly the same as the init/fini specified to uma_zcreate()
  *		when creating a master zone.  These zinit/zfini are called
  *		on the TRANSITION from keg to zone (and vice-versa). Once
  *		these are set, the primary zone may alter its init/fini
  *		(which are called when the object passes from VM to keg)
  *		using uma_zone_set_init/fini()) as well as its own
  *		zinit/zfini (unset by default for master zone) with
  *		uma_zone_set_zinit/zfini() (note subtle 'z' prefix).
  *
  *	master  A reference to this zone's Master Zone (Primary Zone),
  *		which contains the backing Keg for the Secondary Zone
  *		being added.
  *
  * Returns:
  *	A pointer to a structure which is intended to be opaque to users of
  *	the interface.  The value may be null if the wait flag is not set.
  */
 uma_zone_t uma_zsecond_create(char *name, uma_ctor ctor, uma_dtor dtor,
 		    uma_init zinit, uma_fini zfini, uma_zone_t master);
 
 /*
  * Add a second master to a secondary zone.  This provides multiple data
  * backends for objects with the same size.  Both masters must have
  * compatible allocation flags.  Presently, UMA_ZONE_MALLOC type zones are
  * the only supported.
  *
  * Returns:
  *	Error on failure, 0 on success.
  */
 int uma_zsecond_add(uma_zone_t zone, uma_zone_t master);
 
 /*
  * Create cache-only zones.
  *
  * This allows uma's per-cpu cache facilities to handle arbitrary
  * pointers.  Consumers must specify the import and release functions to
  * fill and destroy caches.  UMA does not allocate any memory for these
  * zones.  The 'arg' parameter is passed to import/release and is caller
  * specific.
  */
 uma_zone_t uma_zcache_create(char *name, int size, uma_ctor ctor, uma_dtor dtor,
 		    uma_init zinit, uma_fini zfini, uma_import zimport,
 		    uma_release zrelease, void *arg, int flags);
 
 /*
  * Definitions for uma_zcreate flags
  *
  * These flags share space with UMA_ZFLAGs in uma_int.h.  Be careful not to
  * overlap when adding new features.  0xf0000000 is in use by uma_int.h.
  */
 #define UMA_ZONE_PAGEABLE	0x0001	/* Return items not fully backed by
 					   physical memory XXX Not yet */
 #define UMA_ZONE_ZINIT		0x0002	/* Initialize with zeros */
 #define UMA_ZONE_STATIC		0x0004	/* Statically sized zone */
 #define UMA_ZONE_OFFPAGE	0x0008	/* Force the slab structure allocation
 					   off of the real memory */
 #define UMA_ZONE_MALLOC		0x0010	/* For use by malloc(9) only! */
 #define UMA_ZONE_NOFREE		0x0020	/* Do not free slabs of this type! */
 #define UMA_ZONE_MTXCLASS	0x0040	/* Create a new lock class */
 #define	UMA_ZONE_VM		0x0080	/*
 					 * Used for internal vm datastructures
 					 * only.
 					 */
 #define	UMA_ZONE_HASH		0x0100	/*
 					 * Use a hash table instead of caching
 					 * information in the vm_page.
 					 */
 #define	UMA_ZONE_SECONDARY	0x0200	/* Zone is a Secondary Zone */
 #define	UMA_ZONE_REFCNT		0x0400	/* Allocate refcnts in slabs */
 #define	UMA_ZONE_MAXBUCKET	0x0800	/* Use largest buckets */
 #define	UMA_ZONE_CACHESPREAD	0x1000	/*
 					 * Spread memory start locations across
 					 * all possible cache lines.  May
 					 * require many virtually contiguous
 					 * backend pages and can fail early.
 					 */
 #define	UMA_ZONE_VTOSLAB	0x2000	/* Zone uses vtoslab for lookup. */
 #define	UMA_ZONE_NODUMP		0x4000	/*
 					 * Zone's pages will not be included in
 					 * mini-dumps.
 					 */
 #define	UMA_ZONE_PCPU		0x8000	/*
 					 * Allocates mp_ncpus slabs sized to
 					 * sizeof(struct pcpu).
 					 */
 
 /*
  * These flags are shared between the keg and zone.  In zones wishing to add
  * new kegs these flags must be compatible.  Some are determined based on
  * physical parameters of the request and may not be provided by the consumer.
  */
 #define	UMA_ZONE_INHERIT						\
     (UMA_ZONE_OFFPAGE | UMA_ZONE_MALLOC | UMA_ZONE_NOFREE |		\
     UMA_ZONE_HASH | UMA_ZONE_REFCNT | UMA_ZONE_VTOSLAB | UMA_ZONE_PCPU)
 
 /* Definitions for align */
 #define UMA_ALIGN_PTR	(sizeof(void *) - 1)	/* Alignment fit for ptr */
 #define UMA_ALIGN_LONG	(sizeof(long) - 1)	/* "" long */
 #define UMA_ALIGN_INT	(sizeof(int) - 1)	/* "" int */
 #define UMA_ALIGN_SHORT	(sizeof(short) - 1)	/* "" short */
 #define UMA_ALIGN_CHAR	(sizeof(char) - 1)	/* "" char */
 #define UMA_ALIGN_CACHE	(0 - 1)			/* Cache line size align */
 
 /*
  * Destroys an empty uma zone.  If the zone is not empty uma complains loudly.
  *
  * Arguments:
  *	zone  The zone we want to destroy.
  *
  */
 void uma_zdestroy(uma_zone_t zone);
 
 /*
  * Allocates an item out of a zone
  *
  * Arguments:
  *	zone  The zone we are allocating from
  *	arg   This data is passed to the ctor function
  *	flags See sys/malloc.h for available flags.
  *
  * Returns:
  *	A non-null pointer to an initialized element from the zone is
  *	guaranteed if the wait flag is M_WAITOK.  Otherwise a null pointer
  *	may be returned if the zone is empty or the ctor failed.
  */
 
 void *uma_zalloc_arg(uma_zone_t zone, void *arg, int flags);
 
 /*
  * Allocates an item out of a zone without supplying an argument
  *
  * This is just a wrapper for uma_zalloc_arg for convenience.
  *
  */
 static __inline void *uma_zalloc(uma_zone_t zone, int flags);
 
 static __inline void *
 uma_zalloc(uma_zone_t zone, int flags)
 {
 	return uma_zalloc_arg(zone, NULL, flags);
 }
 
 /*
  * Frees an item back into the specified zone.
  *
  * Arguments:
  *	zone  The zone the item was originally allocated out of.
  *	item  The memory to be freed.
  *	arg   Argument passed to the destructor
  *
  * Returns:
  *	Nothing.
  */
 
 void uma_zfree_arg(uma_zone_t zone, void *item, void *arg);
 
 /*
  * Frees an item back to a zone without supplying an argument
  *
  * This is just a wrapper for uma_zfree_arg for convenience.
  *
  */
 static __inline void uma_zfree(uma_zone_t zone, void *item);
 
 static __inline void
 uma_zfree(uma_zone_t zone, void *item)
 {
 	uma_zfree_arg(zone, item, NULL);
 }
 
 /*
  * XXX The rest of the prototypes in this header are h0h0 magic for the VM.
  * If you think you need to use it for a normal zone you're probably incorrect.
  */
 
 /*
  * Backend page supplier routines
  *
  * Arguments:
  *	zone  The zone that is requesting pages.
  *	size  The number of bytes being requested.
  *	pflag Flags for these memory pages, see below.
  *	wait  Indicates our willingness to block.
  *
  * Returns:
  *	A pointer to the allocated memory or NULL on failure.
  */
 
 typedef void *(*uma_alloc)(uma_zone_t zone, vm_size_t size, uint8_t *pflag,
     int wait);
 
 /*
  * Backend page free routines
  *
  * Arguments:
  *	item  A pointer to the previously allocated pages.
  *	size  The original size of the allocation.
  *	pflag The flags for the slab.  See UMA_SLAB_* below.
  *
  * Returns:
  *	None
  */
 typedef void (*uma_free)(void *item, vm_size_t size, uint8_t pflag);
 
 
 
 /*
  * Sets up the uma allocator. (Called by vm_mem_init)
  *
  * Arguments:
  *	bootmem  A pointer to memory used to bootstrap the system.
  *
  * Returns:
  *	Nothing
  *
  * Discussion:
  *	This memory is used for zones which allocate things before the
  *	backend page supplier can give us pages.  It should be
  *	UMA_SLAB_SIZE * boot_pages bytes. (see uma_int.h)
  *
  */
 
 void uma_startup(void *bootmem, int boot_pages);
 
 /*
  * Finishes starting up the allocator.  This should
  * be called when kva is ready for normal allocs.
  *
  * Arguments:
  *	None
  *
  * Returns:
  *	Nothing
  *
  * Discussion:
  *	uma_startup2 is called by kmeminit() to enable us of uma for malloc.
  */
 
 void uma_startup2(void);
 
 /*
  * Reclaims unused memory for all zones
  *
  * Arguments:
  *	None
  * Returns:
  *	None
  *
  * This should only be called by the page out daemon.
  */
 
 void uma_reclaim(void);
 
 /*
  * Sets the alignment mask to be used for all zones requesting cache
  * alignment.  Should be called by MD boot code prior to starting VM/UMA.
  *
  * Arguments:
  *	align The alignment mask
  *
  * Returns:
  *	Nothing
  */
 void uma_set_align(int align);
 
 /*
  * Set a reserved number of items to hold for M_USE_RESERVE allocations.  All
  * other requests must allocate new backing pages.
  */
 void uma_zone_reserve(uma_zone_t zone, int nitems);
 
 /*
  * Reserves the maximum KVA space required by the zone and configures the zone
  * to use a VM_ALLOC_NOOBJ-based backend allocator.
  *
  * Arguments:
  *	zone  The zone to update.
  *	nitems  The upper limit on the number of items that can be allocated.
  *
  * Returns:
  *	0  if KVA space can not be allocated
  *	1  if successful
  *
  * Discussion:
  *	When the machine supports a direct map and the zone's items are smaller
  *	than a page, the zone will use the direct map instead of allocating KVA
  *	space.
  */
 int uma_zone_reserve_kva(uma_zone_t zone, int nitems);
 
 /*
  * Sets a high limit on the number of items allowed in a zone
  *
  * Arguments:
  *	zone  The zone to limit
  *	nitems  The requested upper limit on the number of items allowed
  *
  * Returns:
  *	int  The effective value of nitems after rounding up based on page size
  */
 int uma_zone_set_max(uma_zone_t zone, int nitems);
 
 /*
  * Obtains the effective limit on the number of items in a zone
  *
  * Arguments:
  *	zone  The zone to obtain the effective limit from
  *
  * Return:
  *	0  No limit
  *	int  The effective limit of the zone
  */
 int uma_zone_get_max(uma_zone_t zone);
 
 /*
  * Sets a warning to be printed when limit is reached
  *
  * Arguments:
  *	zone  The zone we will warn about
  *	warning  Warning content
  *
  * Returns:
  *	Nothing
  */
 void uma_zone_set_warning(uma_zone_t zone, const char *warning);
 
 /*
  * Obtains the approximate current number of items allocated from a zone
  *
  * Arguments:
  *	zone  The zone to obtain the current allocation count from
  *
  * Return:
  *	int  The approximate current number of items allocated from the zone
  */
 int uma_zone_get_cur(uma_zone_t zone);
 
 /*
  * The following two routines (uma_zone_set_init/fini)
  * are used to set the backend init/fini pair which acts on an
  * object as it becomes allocated and is placed in a slab within
  * the specified zone's backing keg.  These should probably not
  * be changed once allocations have already begun, but only be set
  * immediately upon zone creation.
  */
 void uma_zone_set_init(uma_zone_t zone, uma_init uminit);
 void uma_zone_set_fini(uma_zone_t zone, uma_fini fini);
 
 /*
  * The following two routines (uma_zone_set_zinit/zfini) are
  * used to set the zinit/zfini pair which acts on an object as
  * it passes from the backing Keg's slab cache to the
  * specified Zone's bucket cache.  These should probably not
  * be changed once allocations have already begun, but only be set
  * immediately upon zone creation.
  */
 void uma_zone_set_zinit(uma_zone_t zone, uma_init zinit);
 void uma_zone_set_zfini(uma_zone_t zone, uma_fini zfini);
 
 /*
  * Replaces the standard backend allocator for this zone.
  *
  * Arguments:
  *	zone   The zone whose backend allocator is being changed.
  *	allocf A pointer to the allocation function
  *
  * Returns:
  *	Nothing
  *
  * Discussion:
  *	This could be used to implement pageable allocation, or perhaps
  *	even DMA allocators if used in conjunction with the OFFPAGE
  *	zone flag.
  */
 
 void uma_zone_set_allocf(uma_zone_t zone, uma_alloc allocf);
 
 /*
  * Used for freeing memory provided by the allocf above
  *
  * Arguments:
  *	zone  The zone that intends to use this free routine.
  *	freef The page freeing routine.
  *
  * Returns:
  *	Nothing
  */
 
 void uma_zone_set_freef(uma_zone_t zone, uma_free freef);
 
 /*
  * These flags are setable in the allocf and visible in the freef.
  */
 #define UMA_SLAB_BOOT	0x01		/* Slab alloced from boot pages */
 #define UMA_SLAB_KMEM	0x02		/* Slab alloced from kmem_map */
 #define UMA_SLAB_KERNEL	0x04		/* Slab alloced from kernel_map */
 #define UMA_SLAB_PRIV	0x08		/* Slab alloced from priv allocator */
 #define UMA_SLAB_OFFP	0x10		/* Slab is managed separately  */
 #define UMA_SLAB_MALLOC	0x20		/* Slab is a large malloc slab */
 /* 0x40 and 0x80 are available */
 
 /*
  * Used to pre-fill a zone with some number of items
  *
  * Arguments:
  *	zone    The zone to fill
  *	itemcnt The number of items to reserve
  *
  * Returns:
  *	Nothing
  *
  * NOTE: This is blocking and should only be done at startup
  */
 void uma_prealloc(uma_zone_t zone, int itemcnt);
 
 /*
  * Used to lookup the reference counter allocated for an item
  * from a UMA_ZONE_REFCNT zone.  For UMA_ZONE_REFCNT zones,
  * reference counters are allocated for items and stored in
  * the underlying slab header.
  *
  * Arguments:
  *	zone  The UMA_ZONE_REFCNT zone to which the item belongs.
  *	item  The address of the item for which we want a refcnt.
  *
  * Returns:
  *	A pointer to a uint32_t reference counter.
  */
 uint32_t *uma_find_refcnt(uma_zone_t zone, void *item);
 
 /*
  * Used to determine if a fixed-size zone is exhausted.
  *
  * Arguments:
  *	zone    The zone to check
  *
  * Returns:
  *	Non-zero if zone is exhausted.
  */
 int uma_zone_exhausted(uma_zone_t zone);
 int uma_zone_exhausted_nolock(uma_zone_t zone);
 
 /*
  * Common UMA_ZONE_PCPU zones.
  */
 extern uma_zone_t pcpu_zone_64;
 extern uma_zone_t pcpu_zone_ptr;
 
 /*
  * Exported statistics structures to be used by user space monitoring tools.
  * Statistics stream consists of a uma_stream_header, followed by a series of
  * alternative uma_type_header and uma_type_stat structures.
  */
 #define	UMA_STREAM_VERSION	0x00000001
 struct uma_stream_header {
 	uint32_t	ush_version;	/* Stream format version. */
 	uint32_t	ush_maxcpus;	/* Value of MAXCPU for stream. */
 	uint32_t	ush_count;	/* Number of records. */
 	uint32_t	_ush_pad;	/* Pad/reserved field. */
 };
 
 #define	UTH_MAX_NAME	32
 #define	UTH_ZONE_SECONDARY	0x00000001
 struct uma_type_header {
 	/*
 	 * Static per-zone data, some extracted from the supporting keg.
 	 */
 	char		uth_name[UTH_MAX_NAME];
 	uint32_t	uth_align;	/* Keg: alignment. */
 	uint32_t	uth_size;	/* Keg: requested size of item. */
 	uint32_t	uth_rsize;	/* Keg: real size of item. */
 	uint32_t	uth_maxpages;	/* Keg: maximum number of pages. */
 	uint32_t	uth_limit;	/* Keg: max items to allocate. */
 
 	/*
 	 * Current dynamic zone/keg-derived statistics.
 	 */
 	uint32_t	uth_pages;	/* Keg: pages allocated. */
 	uint32_t	uth_keg_free;	/* Keg: items free. */
 	uint32_t	uth_zone_free;	/* Zone: items free. */
 	uint32_t	uth_bucketsize;	/* Zone: desired bucket size. */
 	uint32_t	uth_zone_flags;	/* Zone: flags. */
 	uint64_t	uth_allocs;	/* Zone: number of allocations. */
 	uint64_t	uth_frees;	/* Zone: number of frees. */
 	uint64_t	uth_fails;	/* Zone: number of alloc failures. */
 	uint64_t	uth_sleeps;	/* Zone: number of alloc sleeps. */
 	uint64_t	_uth_reserved1[2];	/* Reserved. */
 };
 
 struct uma_percpu_stat {
 	uint64_t	ups_allocs;	/* Cache: number of allocations. */
 	uint64_t	ups_frees;	/* Cache: number of frees. */
 	uint64_t	ups_cache_free;	/* Cache: free items in cache. */
 	uint64_t	_ups_reserved[5];	/* Reserved. */
 };
 
+void uma_reclaim_wakeup(void);
+void uma_reclaim_worker(void *);
+
 #endif	/* _VM_UMA_H_ */
Index: projects/release-arm-redux/sys/vm/uma_core.c
===================================================================
--- projects/release-arm-redux/sys/vm/uma_core.c	(revision 282691)
+++ projects/release-arm-redux/sys/vm/uma_core.c	(revision 282692)
@@ -1,3621 +1,3655 @@
 /*-
  * Copyright (c) 2002-2005, 2009, 2013 Jeffrey Roberson <jeff@FreeBSD.org>
  * Copyright (c) 2004, 2005 Bosko Milekic <bmilekic@FreeBSD.org>
  * Copyright (c) 2004-2006 Robert N. M. Watson
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice unmodified, this list of conditions, and the following
  *    disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
  * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
  * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
  * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
  * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
 /*
  * uma_core.c  Implementation of the Universal Memory allocator
  *
  * This allocator is intended to replace the multitude of similar object caches
  * in the standard FreeBSD kernel.  The intent is to be flexible as well as
  * effecient.  A primary design goal is to return unused memory to the rest of
  * the system.  This will make the system as a whole more flexible due to the
  * ability to move memory to subsystems which most need it instead of leaving
  * pools of reserved memory unused.
  *
  * The basic ideas stem from similar slab/zone based allocators whose algorithms
  * are well known.
  *
  */
 
 /*
  * TODO:
  *	- Improve memory usage for large allocations
  *	- Investigate cache size adjustments
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 /* I should really use ktr.. */
 /*
 #define UMA_DEBUG 1
 #define UMA_DEBUG_ALLOC 1
 #define UMA_DEBUG_ALLOC_1 1
 */
 
 #include "opt_ddb.h"
 #include "opt_param.h"
 #include "opt_vm.h"
 
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bitset.h>
 #include <sys/kernel.h>
 #include <sys/types.h>
 #include <sys/queue.h>
 #include <sys/malloc.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/sysctl.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/random.h>
 #include <sys/rwlock.h>
 #include <sys/sbuf.h>
 #include <sys/sched.h>
 #include <sys/smp.h>
 #include <sys/vmmeter.h>
 
 #include <vm/vm.h>
 #include <vm/vm_object.h>
 #include <vm/vm_page.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_param.h>
 #include <vm/vm_map.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 #include <vm/uma.h>
 #include <vm/uma_int.h>
 #include <vm/uma_dbg.h>
 
 #include <ddb/ddb.h>
 
 #ifdef DEBUG_MEMGUARD
 #include <vm/memguard.h>
 #endif
 
 /*
  * This is the zone and keg from which all zones are spawned.  The idea is that
  * even the zone & keg heads are allocated from the allocator, so we use the
  * bss section to bootstrap us.
  */
 static struct uma_keg masterkeg;
 static struct uma_zone masterzone_k;
 static struct uma_zone masterzone_z;
 static uma_zone_t kegs = &masterzone_k;
 static uma_zone_t zones = &masterzone_z;
 
 /* This is the zone from which all of uma_slab_t's are allocated. */
 static uma_zone_t slabzone;
 static uma_zone_t slabrefzone;	/* With refcounters (for UMA_ZONE_REFCNT) */
 
 /*
  * The initial hash tables come out of this zone so they can be allocated
  * prior to malloc coming up.
  */
 static uma_zone_t hashzone;
 
 /* The boot-time adjusted value for cache line alignment. */
 int uma_align_cache = 64 - 1;
 
 static MALLOC_DEFINE(M_UMAHASH, "UMAHash", "UMA Hash Buckets");
 
 /*
  * Are we allowed to allocate buckets?
  */
 static int bucketdisable = 1;
 
 /* Linked list of all kegs in the system */
 static LIST_HEAD(,uma_keg) uma_kegs = LIST_HEAD_INITIALIZER(uma_kegs);
 
 /* Linked list of all cache-only zones in the system */
 static LIST_HEAD(,uma_zone) uma_cachezones =
     LIST_HEAD_INITIALIZER(uma_cachezones);
 
 /* This RW lock protects the keg list */
 static struct rwlock_padalign uma_rwlock;
 
 /* Linked list of boot time pages */
 static LIST_HEAD(,uma_slab) uma_boot_pages =
     LIST_HEAD_INITIALIZER(uma_boot_pages);
 
 /* This mutex protects the boot time pages list */
 static struct mtx_padalign uma_boot_pages_mtx;
 
 static struct sx uma_drain_lock;
 
 /* Is the VM done starting up? */
 static int booted = 0;
 #define	UMA_STARTUP	1
 #define	UMA_STARTUP2	2
 
 /*
  * Only mbuf clusters use ref zones.  Just provide enough references
  * to support the one user.  New code should not use the ref facility.
  */
 static const u_int uma_max_ipers_ref = PAGE_SIZE / MCLBYTES;
 
 /*
  * This is the handle used to schedule events that need to happen
  * outside of the allocation fast path.
  */
 static struct callout uma_callout;
 #define	UMA_TIMEOUT	20		/* Seconds for callout interval. */
 
 /*
  * This structure is passed as the zone ctor arg so that I don't have to create
  * a special allocation function just for zones.
  */
 struct uma_zctor_args {
 	const char *name;
 	size_t size;
 	uma_ctor ctor;
 	uma_dtor dtor;
 	uma_init uminit;
 	uma_fini fini;
 	uma_import import;
 	uma_release release;
 	void *arg;
 	uma_keg_t keg;
 	int align;
 	uint32_t flags;
 };
 
 struct uma_kctor_args {
 	uma_zone_t zone;
 	size_t size;
 	uma_init uminit;
 	uma_fini fini;
 	int align;
 	uint32_t flags;
 };
 
 struct uma_bucket_zone {
 	uma_zone_t	ubz_zone;
 	char		*ubz_name;
 	int		ubz_entries;	/* Number of items it can hold. */
 	int		ubz_maxsize;	/* Maximum allocation size per-item. */
 };
 
 /*
  * Compute the actual number of bucket entries to pack them in power
  * of two sizes for more efficient space utilization.
  */
 #define	BUCKET_SIZE(n)						\
     (((sizeof(void *) * (n)) - sizeof(struct uma_bucket)) / sizeof(void *))
 
 #define	BUCKET_MAX	BUCKET_SIZE(256)
 
 struct uma_bucket_zone bucket_zones[] = {
 	{ NULL, "4 Bucket", BUCKET_SIZE(4), 4096 },
 	{ NULL, "6 Bucket", BUCKET_SIZE(6), 3072 },
 	{ NULL, "8 Bucket", BUCKET_SIZE(8), 2048 },
 	{ NULL, "12 Bucket", BUCKET_SIZE(12), 1536 },
 	{ NULL, "16 Bucket", BUCKET_SIZE(16), 1024 },
 	{ NULL, "32 Bucket", BUCKET_SIZE(32), 512 },
 	{ NULL, "64 Bucket", BUCKET_SIZE(64), 256 },
 	{ NULL, "128 Bucket", BUCKET_SIZE(128), 128 },
 	{ NULL, "256 Bucket", BUCKET_SIZE(256), 64 },
 	{ NULL, NULL, 0}
 };
 
 /*
  * Flags and enumerations to be passed to internal functions.
  */
 enum zfreeskip { SKIP_NONE = 0, SKIP_DTOR, SKIP_FINI };
 
 /* Prototypes.. */
 
 static void *noobj_alloc(uma_zone_t, vm_size_t, uint8_t *, int);
 static void *page_alloc(uma_zone_t, vm_size_t, uint8_t *, int);
 static void *startup_alloc(uma_zone_t, vm_size_t, uint8_t *, int);
 static void page_free(void *, vm_size_t, uint8_t);
 static uma_slab_t keg_alloc_slab(uma_keg_t, uma_zone_t, int);
 static void cache_drain(uma_zone_t);
 static void bucket_drain(uma_zone_t, uma_bucket_t);
 static void bucket_cache_drain(uma_zone_t zone);
 static int keg_ctor(void *, int, void *, int);
 static void keg_dtor(void *, int, void *);
 static int zone_ctor(void *, int, void *, int);
 static void zone_dtor(void *, int, void *);
 static int zero_init(void *, int, int);
 static void keg_small_init(uma_keg_t keg);
 static void keg_large_init(uma_keg_t keg);
 static void zone_foreach(void (*zfunc)(uma_zone_t));
 static void zone_timeout(uma_zone_t zone);
 static int hash_alloc(struct uma_hash *);
 static int hash_expand(struct uma_hash *, struct uma_hash *);
 static void hash_free(struct uma_hash *hash);
 static void uma_timeout(void *);
 static void uma_startup3(void);
 static void *zone_alloc_item(uma_zone_t, void *, int);
 static void zone_free_item(uma_zone_t, void *, void *, enum zfreeskip);
 static void bucket_enable(void);
 static void bucket_init(void);
 static uma_bucket_t bucket_alloc(uma_zone_t zone, void *, int);
 static void bucket_free(uma_zone_t zone, uma_bucket_t, void *);
 static void bucket_zone_drain(void);
 static uma_bucket_t zone_alloc_bucket(uma_zone_t zone, void *, int flags);
 static uma_slab_t zone_fetch_slab(uma_zone_t zone, uma_keg_t last, int flags);
 static uma_slab_t zone_fetch_slab_multi(uma_zone_t zone, uma_keg_t last, int flags);
 static void *slab_alloc_item(uma_keg_t keg, uma_slab_t slab);
 static void slab_free_item(uma_keg_t keg, uma_slab_t slab, void *item);
 static uma_keg_t uma_kcreate(uma_zone_t zone, size_t size, uma_init uminit,
     uma_fini fini, int align, uint32_t flags);
 static int zone_import(uma_zone_t zone, void **bucket, int max, int flags);
 static void zone_release(uma_zone_t zone, void **bucket, int cnt);
 static void uma_zero_item(void *item, uma_zone_t zone);
 
 void uma_print_zone(uma_zone_t);
 void uma_print_stats(void);
 static int sysctl_vm_zone_count(SYSCTL_HANDLER_ARGS);
 static int sysctl_vm_zone_stats(SYSCTL_HANDLER_ARGS);
 
 SYSINIT(uma_startup3, SI_SUB_VM_CONF, SI_ORDER_SECOND, uma_startup3, NULL);
 
 SYSCTL_PROC(_vm, OID_AUTO, zone_count, CTLFLAG_RD|CTLTYPE_INT,
     0, 0, sysctl_vm_zone_count, "I", "Number of UMA zones");
 
 SYSCTL_PROC(_vm, OID_AUTO, zone_stats, CTLFLAG_RD|CTLTYPE_STRUCT,
     0, 0, sysctl_vm_zone_stats, "s,struct uma_type_header", "Zone Stats");
 
 static int zone_warnings = 1;
 SYSCTL_INT(_vm, OID_AUTO, zone_warnings, CTLFLAG_RWTUN, &zone_warnings, 0,
     "Warn when UMA zones becomes full");
 
 /*
  * This routine checks to see whether or not it's safe to enable buckets.
  */
 static void
 bucket_enable(void)
 {
 	bucketdisable = vm_page_count_min();
 }
 
 /*
  * Initialize bucket_zones, the array of zones of buckets of various sizes.
  *
  * For each zone, calculate the memory required for each bucket, consisting
  * of the header and an array of pointers.
  */
 static void
 bucket_init(void)
 {
 	struct uma_bucket_zone *ubz;
 	int size;
 
 	for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++) {
 		size = roundup(sizeof(struct uma_bucket), sizeof(void *));
 		size += sizeof(void *) * ubz->ubz_entries;
 		ubz->ubz_zone = uma_zcreate(ubz->ubz_name, size,
 		    NULL, NULL, NULL, NULL, UMA_ALIGN_PTR,
 		    UMA_ZONE_MTXCLASS | UMA_ZFLAG_BUCKET);
 	}
 }
 
 /*
  * Given a desired number of entries for a bucket, return the zone from which
  * to allocate the bucket.
  */
 static struct uma_bucket_zone *
 bucket_zone_lookup(int entries)
 {
 	struct uma_bucket_zone *ubz;
 
 	for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++)
 		if (ubz->ubz_entries >= entries)
 			return (ubz);
 	ubz--;
 	return (ubz);
 }
 
 static int
 bucket_select(int size)
 {
 	struct uma_bucket_zone *ubz;
 
 	ubz = &bucket_zones[0];
 	if (size > ubz->ubz_maxsize)
 		return MAX((ubz->ubz_maxsize * ubz->ubz_entries) / size, 1);
 
 	for (; ubz->ubz_entries != 0; ubz++)
 		if (ubz->ubz_maxsize < size)
 			break;
 	ubz--;
 	return (ubz->ubz_entries);
 }
 
 static uma_bucket_t
 bucket_alloc(uma_zone_t zone, void *udata, int flags)
 {
 	struct uma_bucket_zone *ubz;
 	uma_bucket_t bucket;
 
 	/*
 	 * This is to stop us from allocating per cpu buckets while we're
 	 * running out of vm.boot_pages.  Otherwise, we would exhaust the
 	 * boot pages.  This also prevents us from allocating buckets in
 	 * low memory situations.
 	 */
 	if (bucketdisable)
 		return (NULL);
 	/*
 	 * To limit bucket recursion we store the original zone flags
 	 * in a cookie passed via zalloc_arg/zfree_arg.  This allows the
 	 * NOVM flag to persist even through deep recursions.  We also
 	 * store ZFLAG_BUCKET once we have recursed attempting to allocate
 	 * a bucket for a bucket zone so we do not allow infinite bucket
 	 * recursion.  This cookie will even persist to frees of unused
 	 * buckets via the allocation path or bucket allocations in the
 	 * free path.
 	 */
 	if ((zone->uz_flags & UMA_ZFLAG_BUCKET) == 0)
 		udata = (void *)(uintptr_t)zone->uz_flags;
 	else {
 		if ((uintptr_t)udata & UMA_ZFLAG_BUCKET)
 			return (NULL);
 		udata = (void *)((uintptr_t)udata | UMA_ZFLAG_BUCKET);
 	}
 	if ((uintptr_t)udata & UMA_ZFLAG_CACHEONLY)
 		flags |= M_NOVM;
 	ubz = bucket_zone_lookup(zone->uz_count);
 	if (ubz->ubz_zone == zone && (ubz + 1)->ubz_entries != 0)
 		ubz++;
 	bucket = uma_zalloc_arg(ubz->ubz_zone, udata, flags);
 	if (bucket) {
 #ifdef INVARIANTS
 		bzero(bucket->ub_bucket, sizeof(void *) * ubz->ubz_entries);
 #endif
 		bucket->ub_cnt = 0;
 		bucket->ub_entries = ubz->ubz_entries;
 	}
 
 	return (bucket);
 }
 
 static void
 bucket_free(uma_zone_t zone, uma_bucket_t bucket, void *udata)
 {
 	struct uma_bucket_zone *ubz;
 
 	KASSERT(bucket->ub_cnt == 0,
 	    ("bucket_free: Freeing a non free bucket."));
 	if ((zone->uz_flags & UMA_ZFLAG_BUCKET) == 0)
 		udata = (void *)(uintptr_t)zone->uz_flags;
 	ubz = bucket_zone_lookup(bucket->ub_entries);
 	uma_zfree_arg(ubz->ubz_zone, bucket, udata);
 }
 
 static void
 bucket_zone_drain(void)
 {
 	struct uma_bucket_zone *ubz;
 
 	for (ubz = &bucket_zones[0]; ubz->ubz_entries != 0; ubz++)
 		zone_drain(ubz->ubz_zone);
 }
 
 static void
 zone_log_warning(uma_zone_t zone)
 {
 	static const struct timeval warninterval = { 300, 0 };
 
 	if (!zone_warnings || zone->uz_warning == NULL)
 		return;
 
 	if (ratecheck(&zone->uz_ratecheck, &warninterval))
 		printf("[zone: %s] %s\n", zone->uz_name, zone->uz_warning);
 }
 
 static void
 zone_foreach_keg(uma_zone_t zone, void (*kegfn)(uma_keg_t))
 {
 	uma_klink_t klink;
 
 	LIST_FOREACH(klink, &zone->uz_kegs, kl_link)
 		kegfn(klink->kl_keg);
 }
 
 /*
  * Routine called by timeout which is used to fire off some time interval
  * based calculations.  (stats, hash size, etc.)
  *
  * Arguments:
  *	arg   Unused
  *
  * Returns:
  *	Nothing
  */
 static void
 uma_timeout(void *unused)
 {
 	bucket_enable();
 	zone_foreach(zone_timeout);
 
 	/* Reschedule this event */
 	callout_reset(&uma_callout, UMA_TIMEOUT * hz, uma_timeout, NULL);
 }
 
 /*
  * Routine to perform timeout driven calculations.  This expands the
  * hashes and does per cpu statistics aggregation.
  *
  *  Returns nothing.
  */
 static void
 keg_timeout(uma_keg_t keg)
 {
 
 	KEG_LOCK(keg);
 	/*
 	 * Expand the keg hash table.
 	 *
 	 * This is done if the number of slabs is larger than the hash size.
 	 * What I'm trying to do here is completely reduce collisions.  This
 	 * may be a little aggressive.  Should I allow for two collisions max?
 	 */
 	if (keg->uk_flags & UMA_ZONE_HASH &&
 	    keg->uk_pages / keg->uk_ppera >= keg->uk_hash.uh_hashsize) {
 		struct uma_hash newhash;
 		struct uma_hash oldhash;
 		int ret;
 
 		/*
 		 * This is so involved because allocating and freeing
 		 * while the keg lock is held will lead to deadlock.
 		 * I have to do everything in stages and check for
 		 * races.
 		 */
 		newhash = keg->uk_hash;
 		KEG_UNLOCK(keg);
 		ret = hash_alloc(&newhash);
 		KEG_LOCK(keg);
 		if (ret) {
 			if (hash_expand(&keg->uk_hash, &newhash)) {
 				oldhash = keg->uk_hash;
 				keg->uk_hash = newhash;
 			} else
 				oldhash = newhash;
 
 			KEG_UNLOCK(keg);
 			hash_free(&oldhash);
 			return;
 		}
 	}
 	KEG_UNLOCK(keg);
 }
 
 static void
 zone_timeout(uma_zone_t zone)
 {
 
 	zone_foreach_keg(zone, &keg_timeout);
 }
 
 /*
  * Allocate and zero fill the next sized hash table from the appropriate
  * backing store.
  *
  * Arguments:
  *	hash  A new hash structure with the old hash size in uh_hashsize
  *
  * Returns:
  *	1 on sucess and 0 on failure.
  */
 static int
 hash_alloc(struct uma_hash *hash)
 {
 	int oldsize;
 	int alloc;
 
 	oldsize = hash->uh_hashsize;
 
 	/* We're just going to go to a power of two greater */
 	if (oldsize)  {
 		hash->uh_hashsize = oldsize * 2;
 		alloc = sizeof(hash->uh_slab_hash[0]) * hash->uh_hashsize;
 		hash->uh_slab_hash = (struct slabhead *)malloc(alloc,
 		    M_UMAHASH, M_NOWAIT);
 	} else {
 		alloc = sizeof(hash->uh_slab_hash[0]) * UMA_HASH_SIZE_INIT;
 		hash->uh_slab_hash = zone_alloc_item(hashzone, NULL,
 		    M_WAITOK);
 		hash->uh_hashsize = UMA_HASH_SIZE_INIT;
 	}
 	if (hash->uh_slab_hash) {
 		bzero(hash->uh_slab_hash, alloc);
 		hash->uh_hashmask = hash->uh_hashsize - 1;
 		return (1);
 	}
 
 	return (0);
 }
 
 /*
  * Expands the hash table for HASH zones.  This is done from zone_timeout
  * to reduce collisions.  This must not be done in the regular allocation
  * path, otherwise, we can recurse on the vm while allocating pages.
  *
  * Arguments:
  *	oldhash  The hash you want to expand
  *	newhash  The hash structure for the new table
  *
  * Returns:
  *	Nothing
  *
  * Discussion:
  */
 static int
 hash_expand(struct uma_hash *oldhash, struct uma_hash *newhash)
 {
 	uma_slab_t slab;
 	int hval;
 	int i;
 
 	if (!newhash->uh_slab_hash)
 		return (0);
 
 	if (oldhash->uh_hashsize >= newhash->uh_hashsize)
 		return (0);
 
 	/*
 	 * I need to investigate hash algorithms for resizing without a
 	 * full rehash.
 	 */
 
 	for (i = 0; i < oldhash->uh_hashsize; i++)
 		while (!SLIST_EMPTY(&oldhash->uh_slab_hash[i])) {
 			slab = SLIST_FIRST(&oldhash->uh_slab_hash[i]);
 			SLIST_REMOVE_HEAD(&oldhash->uh_slab_hash[i], us_hlink);
 			hval = UMA_HASH(newhash, slab->us_data);
 			SLIST_INSERT_HEAD(&newhash->uh_slab_hash[hval],
 			    slab, us_hlink);
 		}
 
 	return (1);
 }
 
 /*
  * Free the hash bucket to the appropriate backing store.
  *
  * Arguments:
  *	slab_hash  The hash bucket we're freeing
  *	hashsize   The number of entries in that hash bucket
  *
  * Returns:
  *	Nothing
  */
 static void
 hash_free(struct uma_hash *hash)
 {
 	if (hash->uh_slab_hash == NULL)
 		return;
 	if (hash->uh_hashsize == UMA_HASH_SIZE_INIT)
 		zone_free_item(hashzone, hash->uh_slab_hash, NULL, SKIP_NONE);
 	else
 		free(hash->uh_slab_hash, M_UMAHASH);
 }
 
 /*
  * Frees all outstanding items in a bucket
  *
  * Arguments:
  *	zone   The zone to free to, must be unlocked.
  *	bucket The free/alloc bucket with items, cpu queue must be locked.
  *
  * Returns:
  *	Nothing
  */
 
 static void
 bucket_drain(uma_zone_t zone, uma_bucket_t bucket)
 {
 	int i;
 
 	if (bucket == NULL)
 		return;
 
 	if (zone->uz_fini)
 		for (i = 0; i < bucket->ub_cnt; i++) 
 			zone->uz_fini(bucket->ub_bucket[i], zone->uz_size);
 	zone->uz_release(zone->uz_arg, bucket->ub_bucket, bucket->ub_cnt);
 	bucket->ub_cnt = 0;
 }
 
 /*
  * Drains the per cpu caches for a zone.
  *
  * NOTE: This may only be called while the zone is being turn down, and not
  * during normal operation.  This is necessary in order that we do not have
  * to migrate CPUs to drain the per-CPU caches.
  *
  * Arguments:
  *	zone     The zone to drain, must be unlocked.
  *
  * Returns:
  *	Nothing
  */
 static void
 cache_drain(uma_zone_t zone)
 {
 	uma_cache_t cache;
 	int cpu;
 
 	/*
 	 * XXX: It is safe to not lock the per-CPU caches, because we're
 	 * tearing down the zone anyway.  I.e., there will be no further use
 	 * of the caches at this point.
 	 *
 	 * XXX: It would good to be able to assert that the zone is being
 	 * torn down to prevent improper use of cache_drain().
 	 *
 	 * XXX: We lock the zone before passing into bucket_cache_drain() as
 	 * it is used elsewhere.  Should the tear-down path be made special
 	 * there in some form?
 	 */
 	CPU_FOREACH(cpu) {
 		cache = &zone->uz_cpu[cpu];
 		bucket_drain(zone, cache->uc_allocbucket);
 		bucket_drain(zone, cache->uc_freebucket);
 		if (cache->uc_allocbucket != NULL)
 			bucket_free(zone, cache->uc_allocbucket, NULL);
 		if (cache->uc_freebucket != NULL)
 			bucket_free(zone, cache->uc_freebucket, NULL);
 		cache->uc_allocbucket = cache->uc_freebucket = NULL;
 	}
 	ZONE_LOCK(zone);
 	bucket_cache_drain(zone);
 	ZONE_UNLOCK(zone);
 }
 
 static void
 cache_shrink(uma_zone_t zone)
 {
 
 	if (zone->uz_flags & UMA_ZFLAG_INTERNAL)
 		return;
 
 	ZONE_LOCK(zone);
 	zone->uz_count = (zone->uz_count_min + zone->uz_count) / 2;
 	ZONE_UNLOCK(zone);
 }
 
 static void
 cache_drain_safe_cpu(uma_zone_t zone)
 {
 	uma_cache_t cache;
 	uma_bucket_t b1, b2;
 
 	if (zone->uz_flags & UMA_ZFLAG_INTERNAL)
 		return;
 
 	b1 = b2 = NULL;
 	ZONE_LOCK(zone);
 	critical_enter();
 	cache = &zone->uz_cpu[curcpu];
 	if (cache->uc_allocbucket) {
 		if (cache->uc_allocbucket->ub_cnt != 0)
 			LIST_INSERT_HEAD(&zone->uz_buckets,
 			    cache->uc_allocbucket, ub_link);
 		else
 			b1 = cache->uc_allocbucket;
 		cache->uc_allocbucket = NULL;
 	}
 	if (cache->uc_freebucket) {
 		if (cache->uc_freebucket->ub_cnt != 0)
 			LIST_INSERT_HEAD(&zone->uz_buckets,
 			    cache->uc_freebucket, ub_link);
 		else
 			b2 = cache->uc_freebucket;
 		cache->uc_freebucket = NULL;
 	}
 	critical_exit();
 	ZONE_UNLOCK(zone);
 	if (b1)
 		bucket_free(zone, b1, NULL);
 	if (b2)
 		bucket_free(zone, b2, NULL);
 }
 
 /*
  * Safely drain per-CPU caches of a zone(s) to alloc bucket.
  * This is an expensive call because it needs to bind to all CPUs
  * one by one and enter a critical section on each of them in order
  * to safely access their cache buckets.
  * Zone lock must not be held on call this function.
  */
 static void
 cache_drain_safe(uma_zone_t zone)
 {
 	int cpu;
 
 	/*
 	 * Polite bucket sizes shrinking was not enouth, shrink aggressively.
 	 */
 	if (zone)
 		cache_shrink(zone);
 	else
 		zone_foreach(cache_shrink);
 
 	CPU_FOREACH(cpu) {
 		thread_lock(curthread);
 		sched_bind(curthread, cpu);
 		thread_unlock(curthread);
 
 		if (zone)
 			cache_drain_safe_cpu(zone);
 		else
 			zone_foreach(cache_drain_safe_cpu);
 	}
 	thread_lock(curthread);
 	sched_unbind(curthread);
 	thread_unlock(curthread);
 }
 
 /*
  * Drain the cached buckets from a zone.  Expects a locked zone on entry.
  */
 static void
 bucket_cache_drain(uma_zone_t zone)
 {
 	uma_bucket_t bucket;
 
 	/*
 	 * Drain the bucket queues and free the buckets, we just keep two per
 	 * cpu (alloc/free).
 	 */
 	while ((bucket = LIST_FIRST(&zone->uz_buckets)) != NULL) {
 		LIST_REMOVE(bucket, ub_link);
 		ZONE_UNLOCK(zone);
 		bucket_drain(zone, bucket);
 		bucket_free(zone, bucket, NULL);
 		ZONE_LOCK(zone);
 	}
 
 	/*
 	 * Shrink further bucket sizes.  Price of single zone lock collision
 	 * is probably lower then price of global cache drain.
 	 */
 	if (zone->uz_count > zone->uz_count_min)
 		zone->uz_count--;
 }
 
 static void
 keg_free_slab(uma_keg_t keg, uma_slab_t slab, int start)
 {
 	uint8_t *mem;
 	int i;
 	uint8_t flags;
 
 	mem = slab->us_data;
 	flags = slab->us_flags;
 	i = start;
 	if (keg->uk_fini != NULL) {
 		for (i--; i > -1; i--)
 			keg->uk_fini(slab->us_data + (keg->uk_rsize * i),
 			    keg->uk_size);
 	}
 	if (keg->uk_flags & UMA_ZONE_OFFPAGE)
 		zone_free_item(keg->uk_slabzone, slab, NULL, SKIP_NONE);
 #ifdef UMA_DEBUG
 	printf("%s: Returning %d bytes.\n", keg->uk_name,
 	    PAGE_SIZE * keg->uk_ppera);
 #endif
 	keg->uk_freef(mem, PAGE_SIZE * keg->uk_ppera, flags);
 }
 
 /*
  * Frees pages from a keg back to the system.  This is done on demand from
  * the pageout daemon.
  *
  * Returns nothing.
  */
 static void
 keg_drain(uma_keg_t keg)
 {
 	struct slabhead freeslabs = { 0 };
 	uma_slab_t slab;
 	uma_slab_t n;
 
 	/*
 	 * We don't want to take pages from statically allocated kegs at this
 	 * time
 	 */
 	if (keg->uk_flags & UMA_ZONE_NOFREE || keg->uk_freef == NULL)
 		return;
 
 #ifdef UMA_DEBUG
 	printf("%s free items: %u\n", keg->uk_name, keg->uk_free);
 #endif
 	KEG_LOCK(keg);
 	if (keg->uk_free == 0)
 		goto finished;
 
 	slab = LIST_FIRST(&keg->uk_free_slab);
 	while (slab) {
 		n = LIST_NEXT(slab, us_link);
 
 		/* We have no where to free these to */
 		if (slab->us_flags & UMA_SLAB_BOOT) {
 			slab = n;
 			continue;
 		}
 
 		LIST_REMOVE(slab, us_link);
 		keg->uk_pages -= keg->uk_ppera;
 		keg->uk_free -= keg->uk_ipers;
 
 		if (keg->uk_flags & UMA_ZONE_HASH)
 			UMA_HASH_REMOVE(&keg->uk_hash, slab, slab->us_data);
 
 		SLIST_INSERT_HEAD(&freeslabs, slab, us_hlink);
 
 		slab = n;
 	}
 finished:
 	KEG_UNLOCK(keg);
 
 	while ((slab = SLIST_FIRST(&freeslabs)) != NULL) {
 		SLIST_REMOVE(&freeslabs, slab, uma_slab, us_hlink);
 		keg_free_slab(keg, slab, keg->uk_ipers);
 	}
 }
 
 static void
 zone_drain_wait(uma_zone_t zone, int waitok)
 {
 
 	/*
 	 * Set draining to interlock with zone_dtor() so we can release our
 	 * locks as we go.  Only dtor() should do a WAITOK call since it
 	 * is the only call that knows the structure will still be available
 	 * when it wakes up.
 	 */
 	ZONE_LOCK(zone);
 	while (zone->uz_flags & UMA_ZFLAG_DRAINING) {
 		if (waitok == M_NOWAIT)
 			goto out;
 		msleep(zone, zone->uz_lockptr, PVM, "zonedrain", 1);
 	}
 	zone->uz_flags |= UMA_ZFLAG_DRAINING;
 	bucket_cache_drain(zone);
 	ZONE_UNLOCK(zone);
 	/*
 	 * The DRAINING flag protects us from being freed while
 	 * we're running.  Normally the uma_rwlock would protect us but we
 	 * must be able to release and acquire the right lock for each keg.
 	 */
 	zone_foreach_keg(zone, &keg_drain);
 	ZONE_LOCK(zone);
 	zone->uz_flags &= ~UMA_ZFLAG_DRAINING;
 	wakeup(zone);
 out:
 	ZONE_UNLOCK(zone);
 }
 
 void
 zone_drain(uma_zone_t zone)
 {
 
 	zone_drain_wait(zone, M_NOWAIT);
 }
 
 /*
  * Allocate a new slab for a keg.  This does not insert the slab onto a list.
  *
  * Arguments:
  *	wait  Shall we wait?
  *
  * Returns:
  *	The slab that was allocated or NULL if there is no memory and the
  *	caller specified M_NOWAIT.
  */
 static uma_slab_t
 keg_alloc_slab(uma_keg_t keg, uma_zone_t zone, int wait)
 {
 	uma_slabrefcnt_t slabref;
 	uma_alloc allocf;
 	uma_slab_t slab;
 	uint8_t *mem;
 	uint8_t flags;
 	int i;
 
 	mtx_assert(&keg->uk_lock, MA_OWNED);
 	slab = NULL;
 	mem = NULL;
 
 #ifdef UMA_DEBUG
 	printf("alloc_slab:  Allocating a new slab for %s\n", keg->uk_name);
 #endif
 	allocf = keg->uk_allocf;
 	KEG_UNLOCK(keg);
 
 	if (keg->uk_flags & UMA_ZONE_OFFPAGE) {
 		slab = zone_alloc_item(keg->uk_slabzone, NULL, wait);
 		if (slab == NULL)
 			goto out;
 	}
 
 	/*
 	 * This reproduces the old vm_zone behavior of zero filling pages the
 	 * first time they are added to a zone.
 	 *
 	 * Malloced items are zeroed in uma_zalloc.
 	 */
 
 	if ((keg->uk_flags & UMA_ZONE_MALLOC) == 0)
 		wait |= M_ZERO;
 	else
 		wait &= ~M_ZERO;
 
 	if (keg->uk_flags & UMA_ZONE_NODUMP)
 		wait |= M_NODUMP;
 
 	/* zone is passed for legacy reasons. */
 	mem = allocf(zone, keg->uk_ppera * PAGE_SIZE, &flags, wait);
 	if (mem == NULL) {
 		if (keg->uk_flags & UMA_ZONE_OFFPAGE)
 			zone_free_item(keg->uk_slabzone, slab, NULL, SKIP_NONE);
 		slab = NULL;
 		goto out;
 	}
 
 	/* Point the slab into the allocated memory */
 	if (!(keg->uk_flags & UMA_ZONE_OFFPAGE))
 		slab = (uma_slab_t )(mem + keg->uk_pgoff);
 
 	if (keg->uk_flags & UMA_ZONE_VTOSLAB)
 		for (i = 0; i < keg->uk_ppera; i++)
 			vsetslab((vm_offset_t)mem + (i * PAGE_SIZE), slab);
 
 	slab->us_keg = keg;
 	slab->us_data = mem;
 	slab->us_freecount = keg->uk_ipers;
 	slab->us_flags = flags;
 	BIT_FILL(SLAB_SETSIZE, &slab->us_free);
 #ifdef INVARIANTS
 	BIT_ZERO(SLAB_SETSIZE, &slab->us_debugfree);
 #endif
 	if (keg->uk_flags & UMA_ZONE_REFCNT) {
 		slabref = (uma_slabrefcnt_t)slab;
 		for (i = 0; i < keg->uk_ipers; i++)
 			slabref->us_refcnt[i] = 0;
 	}
 
 	if (keg->uk_init != NULL) {
 		for (i = 0; i < keg->uk_ipers; i++)
 			if (keg->uk_init(slab->us_data + (keg->uk_rsize * i),
 			    keg->uk_size, wait) != 0)
 				break;
 		if (i != keg->uk_ipers) {
 			keg_free_slab(keg, slab, i);
 			slab = NULL;
 			goto out;
 		}
 	}
 out:
 	KEG_LOCK(keg);
 
 	if (slab != NULL) {
 		if (keg->uk_flags & UMA_ZONE_HASH)
 			UMA_HASH_INSERT(&keg->uk_hash, slab, mem);
 
 		keg->uk_pages += keg->uk_ppera;
 		keg->uk_free += keg->uk_ipers;
 	}
 
 	return (slab);
 }
 
 /*
  * This function is intended to be used early on in place of page_alloc() so
  * that we may use the boot time page cache to satisfy allocations before
  * the VM is ready.
  */
 static void *
 startup_alloc(uma_zone_t zone, vm_size_t bytes, uint8_t *pflag, int wait)
 {
 	uma_keg_t keg;
 	uma_slab_t tmps;
 	int pages, check_pages;
 
 	keg = zone_first_keg(zone);
 	pages = howmany(bytes, PAGE_SIZE);
 	check_pages = pages - 1;
 	KASSERT(pages > 0, ("startup_alloc can't reserve 0 pages\n"));
 
 	/*
 	 * Check our small startup cache to see if it has pages remaining.
 	 */
 	mtx_lock(&uma_boot_pages_mtx);
 
 	/* First check if we have enough room. */
 	tmps = LIST_FIRST(&uma_boot_pages);
 	while (tmps != NULL && check_pages-- > 0)
 		tmps = LIST_NEXT(tmps, us_link);
 	if (tmps != NULL) {
 		/*
 		 * It's ok to lose tmps references.  The last one will
 		 * have tmps->us_data pointing to the start address of
 		 * "pages" contiguous pages of memory.
 		 */
 		while (pages-- > 0) {
 			tmps = LIST_FIRST(&uma_boot_pages);
 			LIST_REMOVE(tmps, us_link);
 		}
 		mtx_unlock(&uma_boot_pages_mtx);
 		*pflag = tmps->us_flags;
 		return (tmps->us_data);
 	}
 	mtx_unlock(&uma_boot_pages_mtx);
 	if (booted < UMA_STARTUP2)
 		panic("UMA: Increase vm.boot_pages");
 	/*
 	 * Now that we've booted reset these users to their real allocator.
 	 */
 #ifdef UMA_MD_SMALL_ALLOC
 	keg->uk_allocf = (keg->uk_ppera > 1) ? page_alloc : uma_small_alloc;
 #else
 	keg->uk_allocf = page_alloc;
 #endif
 	return keg->uk_allocf(zone, bytes, pflag, wait);
 }
 
 /*
  * Allocates a number of pages from the system
  *
  * Arguments:
  *	bytes  The number of bytes requested
  *	wait  Shall we wait?
  *
  * Returns:
  *	A pointer to the alloced memory or possibly
  *	NULL if M_NOWAIT is set.
  */
 static void *
 page_alloc(uma_zone_t zone, vm_size_t bytes, uint8_t *pflag, int wait)
 {
 	void *p;	/* Returned page */
 
 	*pflag = UMA_SLAB_KMEM;
 	p = (void *) kmem_malloc(kmem_arena, bytes, wait);
 
 	return (p);
 }
 
 /*
  * Allocates a number of pages from within an object
  *
  * Arguments:
  *	bytes  The number of bytes requested
  *	wait   Shall we wait?
  *
  * Returns:
  *	A pointer to the alloced memory or possibly
  *	NULL if M_NOWAIT is set.
  */
 static void *
 noobj_alloc(uma_zone_t zone, vm_size_t bytes, uint8_t *flags, int wait)
 {
 	TAILQ_HEAD(, vm_page) alloctail;
 	u_long npages;
 	vm_offset_t retkva, zkva;
 	vm_page_t p, p_next;
 	uma_keg_t keg;
 
 	TAILQ_INIT(&alloctail);
 	keg = zone_first_keg(zone);
 
 	npages = howmany(bytes, PAGE_SIZE);
 	while (npages > 0) {
 		p = vm_page_alloc(NULL, 0, VM_ALLOC_INTERRUPT |
 		    VM_ALLOC_WIRED | VM_ALLOC_NOOBJ);
 		if (p != NULL) {
 			/*
 			 * Since the page does not belong to an object, its
 			 * listq is unused.
 			 */
 			TAILQ_INSERT_TAIL(&alloctail, p, listq);
 			npages--;
 			continue;
 		}
 		if (wait & M_WAITOK) {
 			VM_WAIT;
 			continue;
 		}
 
 		/*
 		 * Page allocation failed, free intermediate pages and
 		 * exit.
 		 */
 		TAILQ_FOREACH_SAFE(p, &alloctail, listq, p_next) {
 			vm_page_unwire(p, PQ_INACTIVE);
 			vm_page_free(p); 
 		}
 		return (NULL);
 	}
 	*flags = UMA_SLAB_PRIV;
 	zkva = keg->uk_kva +
 	    atomic_fetchadd_long(&keg->uk_offset, round_page(bytes));
 	retkva = zkva;
 	TAILQ_FOREACH(p, &alloctail, listq) {
 		pmap_qenter(zkva, &p, 1);
 		zkva += PAGE_SIZE;
 	}
 
 	return ((void *)retkva);
 }
 
 /*
  * Frees a number of pages to the system
  *
  * Arguments:
  *	mem   A pointer to the memory to be freed
  *	size  The size of the memory being freed
  *	flags The original p->us_flags field
  *
  * Returns:
  *	Nothing
  */
 static void
 page_free(void *mem, vm_size_t size, uint8_t flags)
 {
 	struct vmem *vmem;
 
 	if (flags & UMA_SLAB_KMEM)
 		vmem = kmem_arena;
 	else if (flags & UMA_SLAB_KERNEL)
 		vmem = kernel_arena;
 	else
 		panic("UMA: page_free used with invalid flags %d", flags);
 
 	kmem_free(vmem, (vm_offset_t)mem, size);
 }
 
 /*
  * Zero fill initializer
  *
  * Arguments/Returns follow uma_init specifications
  */
 static int
 zero_init(void *mem, int size, int flags)
 {
 	bzero(mem, size);
 	return (0);
 }
 
 /*
  * Finish creating a small uma keg.  This calculates ipers, and the keg size.
  *
  * Arguments
  *	keg  The zone we should initialize
  *
  * Returns
  *	Nothing
  */
 static void
 keg_small_init(uma_keg_t keg)
 {
 	u_int rsize;
 	u_int memused;
 	u_int wastedspace;
 	u_int shsize;
 
 	if (keg->uk_flags & UMA_ZONE_PCPU) {
 		u_int ncpus = mp_ncpus ? mp_ncpus : MAXCPU;
 
 		keg->uk_slabsize = sizeof(struct pcpu);
 		keg->uk_ppera = howmany(ncpus * sizeof(struct pcpu),
 		    PAGE_SIZE);
 	} else {
 		keg->uk_slabsize = UMA_SLAB_SIZE;
 		keg->uk_ppera = 1;
 	}
 
 	/*
 	 * Calculate the size of each allocation (rsize) according to
 	 * alignment.  If the requested size is smaller than we have
 	 * allocation bits for we round it up.
 	 */
 	rsize = keg->uk_size;
 	if (rsize < keg->uk_slabsize / SLAB_SETSIZE)
 		rsize = keg->uk_slabsize / SLAB_SETSIZE;
 	if (rsize & keg->uk_align)
 		rsize = (rsize & ~keg->uk_align) + (keg->uk_align + 1);
 	keg->uk_rsize = rsize;
 
 	KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0 ||
 	    keg->uk_rsize < sizeof(struct pcpu),
 	    ("%s: size %u too large", __func__, keg->uk_rsize));
 
 	if (keg->uk_flags & UMA_ZONE_REFCNT)
 		rsize += sizeof(uint32_t);
 
 	if (keg->uk_flags & UMA_ZONE_OFFPAGE)
 		shsize = 0;
 	else 
 		shsize = sizeof(struct uma_slab);
 
 	keg->uk_ipers = (keg->uk_slabsize - shsize) / rsize;
 	KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE,
 	    ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers));
 
 	memused = keg->uk_ipers * rsize + shsize;
 	wastedspace = keg->uk_slabsize - memused;
 
 	/*
 	 * We can't do OFFPAGE if we're internal or if we've been
 	 * asked to not go to the VM for buckets.  If we do this we
 	 * may end up going to the VM  for slabs which we do not
 	 * want to do if we're UMA_ZFLAG_CACHEONLY as a result
 	 * of UMA_ZONE_VM, which clearly forbids it.
 	 */
 	if ((keg->uk_flags & UMA_ZFLAG_INTERNAL) ||
 	    (keg->uk_flags & UMA_ZFLAG_CACHEONLY))
 		return;
 
 	/*
 	 * See if using an OFFPAGE slab will limit our waste.  Only do
 	 * this if it permits more items per-slab.
 	 *
 	 * XXX We could try growing slabsize to limit max waste as well.
 	 * Historically this was not done because the VM could not
 	 * efficiently handle contiguous allocations.
 	 */
 	if ((wastedspace >= keg->uk_slabsize / UMA_MAX_WASTE) &&
 	    (keg->uk_ipers < (keg->uk_slabsize / keg->uk_rsize))) {
 		keg->uk_ipers = keg->uk_slabsize / keg->uk_rsize;
 		KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE,
 		    ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers));
 #ifdef UMA_DEBUG
 		printf("UMA decided we need offpage slab headers for "
 		    "keg: %s, calculated wastedspace = %d, "
 		    "maximum wasted space allowed = %d, "
 		    "calculated ipers = %d, "
 		    "new wasted space = %d\n", keg->uk_name, wastedspace,
 		    keg->uk_slabsize / UMA_MAX_WASTE, keg->uk_ipers,
 		    keg->uk_slabsize - keg->uk_ipers * keg->uk_rsize);
 #endif
 		keg->uk_flags |= UMA_ZONE_OFFPAGE;
 	}
 
 	if ((keg->uk_flags & UMA_ZONE_OFFPAGE) &&
 	    (keg->uk_flags & UMA_ZONE_VTOSLAB) == 0)
 		keg->uk_flags |= UMA_ZONE_HASH;
 }
 
 /*
  * Finish creating a large (> UMA_SLAB_SIZE) uma kegs.  Just give in and do
  * OFFPAGE for now.  When I can allow for more dynamic slab sizes this will be
  * more complicated.
  *
  * Arguments
  *	keg  The keg we should initialize
  *
  * Returns
  *	Nothing
  */
 static void
 keg_large_init(uma_keg_t keg)
 {
 	u_int shsize;
 
 	KASSERT(keg != NULL, ("Keg is null in keg_large_init"));
 	KASSERT((keg->uk_flags & UMA_ZFLAG_CACHEONLY) == 0,
 	    ("keg_large_init: Cannot large-init a UMA_ZFLAG_CACHEONLY keg"));
 	KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0,
 	    ("%s: Cannot large-init a UMA_ZONE_PCPU keg", __func__));
 
 	keg->uk_ppera = howmany(keg->uk_size, PAGE_SIZE);
 	keg->uk_slabsize = keg->uk_ppera * PAGE_SIZE;
 	keg->uk_ipers = 1;
 	keg->uk_rsize = keg->uk_size;
 
 	/* We can't do OFFPAGE if we're internal, bail out here. */
 	if (keg->uk_flags & UMA_ZFLAG_INTERNAL)
 		return;
 
 	/* Check whether we have enough space to not do OFFPAGE. */
 	if ((keg->uk_flags & UMA_ZONE_OFFPAGE) == 0) {
 		shsize = sizeof(struct uma_slab);
 		if (keg->uk_flags & UMA_ZONE_REFCNT)
 			shsize += keg->uk_ipers * sizeof(uint32_t);
 		if (shsize & UMA_ALIGN_PTR)
 			shsize = (shsize & ~UMA_ALIGN_PTR) +
 			    (UMA_ALIGN_PTR + 1);
 
 		if ((PAGE_SIZE * keg->uk_ppera) - keg->uk_rsize < shsize)
 			keg->uk_flags |= UMA_ZONE_OFFPAGE;
 	}
 
 	if ((keg->uk_flags & UMA_ZONE_OFFPAGE) &&
 	    (keg->uk_flags & UMA_ZONE_VTOSLAB) == 0)
 		keg->uk_flags |= UMA_ZONE_HASH;
 }
 
 static void
 keg_cachespread_init(uma_keg_t keg)
 {
 	int alignsize;
 	int trailer;
 	int pages;
 	int rsize;
 
 	KASSERT((keg->uk_flags & UMA_ZONE_PCPU) == 0,
 	    ("%s: Cannot cachespread-init a UMA_ZONE_PCPU keg", __func__));
 
 	alignsize = keg->uk_align + 1;
 	rsize = keg->uk_size;
 	/*
 	 * We want one item to start on every align boundary in a page.  To
 	 * do this we will span pages.  We will also extend the item by the
 	 * size of align if it is an even multiple of align.  Otherwise, it
 	 * would fall on the same boundary every time.
 	 */
 	if (rsize & keg->uk_align)
 		rsize = (rsize & ~keg->uk_align) + alignsize;
 	if ((rsize & alignsize) == 0)
 		rsize += alignsize;
 	trailer = rsize - keg->uk_size;
 	pages = (rsize * (PAGE_SIZE / alignsize)) / PAGE_SIZE;
 	pages = MIN(pages, (128 * 1024) / PAGE_SIZE);
 	keg->uk_rsize = rsize;
 	keg->uk_ppera = pages;
 	keg->uk_slabsize = UMA_SLAB_SIZE;
 	keg->uk_ipers = ((pages * PAGE_SIZE) + trailer) / rsize;
 	keg->uk_flags |= UMA_ZONE_OFFPAGE | UMA_ZONE_VTOSLAB;
 	KASSERT(keg->uk_ipers <= SLAB_SETSIZE,
 	    ("%s: keg->uk_ipers too high(%d) increase max_ipers", __func__,
 	    keg->uk_ipers));
 }
 
 /*
  * Keg header ctor.  This initializes all fields, locks, etc.  And inserts
  * the keg onto the global keg list.
  *
  * Arguments/Returns follow uma_ctor specifications
  *	udata  Actually uma_kctor_args
  */
 static int
 keg_ctor(void *mem, int size, void *udata, int flags)
 {
 	struct uma_kctor_args *arg = udata;
 	uma_keg_t keg = mem;
 	uma_zone_t zone;
 
 	bzero(keg, size);
 	keg->uk_size = arg->size;
 	keg->uk_init = arg->uminit;
 	keg->uk_fini = arg->fini;
 	keg->uk_align = arg->align;
 	keg->uk_free = 0;
 	keg->uk_reserve = 0;
 	keg->uk_pages = 0;
 	keg->uk_flags = arg->flags;
 	keg->uk_allocf = page_alloc;
 	keg->uk_freef = page_free;
 	keg->uk_slabzone = NULL;
 
 	/*
 	 * The master zone is passed to us at keg-creation time.
 	 */
 	zone = arg->zone;
 	keg->uk_name = zone->uz_name;
 
 	if (arg->flags & UMA_ZONE_VM)
 		keg->uk_flags |= UMA_ZFLAG_CACHEONLY;
 
 	if (arg->flags & UMA_ZONE_ZINIT)
 		keg->uk_init = zero_init;
 
 	if (arg->flags & UMA_ZONE_REFCNT || arg->flags & UMA_ZONE_MALLOC)
 		keg->uk_flags |= UMA_ZONE_VTOSLAB;
 
 	if (arg->flags & UMA_ZONE_PCPU)
 #ifdef SMP
 		keg->uk_flags |= UMA_ZONE_OFFPAGE;
 #else
 		keg->uk_flags &= ~UMA_ZONE_PCPU;
 #endif
 
 	if (keg->uk_flags & UMA_ZONE_CACHESPREAD) {
 		keg_cachespread_init(keg);
 	} else if (keg->uk_flags & UMA_ZONE_REFCNT) {
 		if (keg->uk_size >
 		    (UMA_SLAB_SIZE - sizeof(struct uma_slab_refcnt) -
 		    sizeof(uint32_t)))
 			keg_large_init(keg);
 		else
 			keg_small_init(keg);
 	} else {
 		if (keg->uk_size > (UMA_SLAB_SIZE - sizeof(struct uma_slab)))
 			keg_large_init(keg);
 		else
 			keg_small_init(keg);
 	}
 
 	if (keg->uk_flags & UMA_ZONE_OFFPAGE) {
 		if (keg->uk_flags & UMA_ZONE_REFCNT) {
 			if (keg->uk_ipers > uma_max_ipers_ref)
 				panic("Too many ref items per zone: %d > %d\n",
 				    keg->uk_ipers, uma_max_ipers_ref);
 			keg->uk_slabzone = slabrefzone;
 		} else
 			keg->uk_slabzone = slabzone;
 	}
 
 	/*
 	 * If we haven't booted yet we need allocations to go through the
 	 * startup cache until the vm is ready.
 	 */
 	if (keg->uk_ppera == 1) {
 #ifdef UMA_MD_SMALL_ALLOC
 		keg->uk_allocf = uma_small_alloc;
 		keg->uk_freef = uma_small_free;
 
 		if (booted < UMA_STARTUP)
 			keg->uk_allocf = startup_alloc;
 #else
 		if (booted < UMA_STARTUP2)
 			keg->uk_allocf = startup_alloc;
 #endif
 	} else if (booted < UMA_STARTUP2 &&
 	    (keg->uk_flags & UMA_ZFLAG_INTERNAL))
 		keg->uk_allocf = startup_alloc;
 
 	/*
 	 * Initialize keg's lock
 	 */
 	KEG_LOCK_INIT(keg, (arg->flags & UMA_ZONE_MTXCLASS));
 
 	/*
 	 * If we're putting the slab header in the actual page we need to
 	 * figure out where in each page it goes.  This calculates a right
 	 * justified offset into the memory on an ALIGN_PTR boundary.
 	 */
 	if (!(keg->uk_flags & UMA_ZONE_OFFPAGE)) {
 		u_int totsize;
 
 		/* Size of the slab struct and free list */
 		totsize = sizeof(struct uma_slab);
 
 		/* Size of the reference counts. */
 		if (keg->uk_flags & UMA_ZONE_REFCNT)
 			totsize += keg->uk_ipers * sizeof(uint32_t);
 
 		if (totsize & UMA_ALIGN_PTR)
 			totsize = (totsize & ~UMA_ALIGN_PTR) +
 			    (UMA_ALIGN_PTR + 1);
 		keg->uk_pgoff = (PAGE_SIZE * keg->uk_ppera) - totsize;
 
 		/*
 		 * The only way the following is possible is if with our
 		 * UMA_ALIGN_PTR adjustments we are now bigger than
 		 * UMA_SLAB_SIZE.  I haven't checked whether this is
 		 * mathematically possible for all cases, so we make
 		 * sure here anyway.
 		 */
 		totsize = keg->uk_pgoff + sizeof(struct uma_slab);
 		if (keg->uk_flags & UMA_ZONE_REFCNT)
 			totsize += keg->uk_ipers * sizeof(uint32_t);
 		if (totsize > PAGE_SIZE * keg->uk_ppera) {
 			printf("zone %s ipers %d rsize %d size %d\n",
 			    zone->uz_name, keg->uk_ipers, keg->uk_rsize,
 			    keg->uk_size);
 			panic("UMA slab won't fit.");
 		}
 	}
 
 	if (keg->uk_flags & UMA_ZONE_HASH)
 		hash_alloc(&keg->uk_hash);
 
 #ifdef UMA_DEBUG
 	printf("UMA: %s(%p) size %d(%d) flags %#x ipers %d ppera %d out %d free %d\n",
 	    zone->uz_name, zone, keg->uk_size, keg->uk_rsize, keg->uk_flags,
 	    keg->uk_ipers, keg->uk_ppera,
 	    (keg->uk_ipers * keg->uk_pages) - keg->uk_free, keg->uk_free);
 #endif
 
 	LIST_INSERT_HEAD(&keg->uk_zones, zone, uz_link);
 
 	rw_wlock(&uma_rwlock);
 	LIST_INSERT_HEAD(&uma_kegs, keg, uk_link);
 	rw_wunlock(&uma_rwlock);
 	return (0);
 }
 
 /*
  * Zone header ctor.  This initializes all fields, locks, etc.
  *
  * Arguments/Returns follow uma_ctor specifications
  *	udata  Actually uma_zctor_args
  */
 static int
 zone_ctor(void *mem, int size, void *udata, int flags)
 {
 	struct uma_zctor_args *arg = udata;
 	uma_zone_t zone = mem;
 	uma_zone_t z;
 	uma_keg_t keg;
 
 	bzero(zone, size);
 	zone->uz_name = arg->name;
 	zone->uz_ctor = arg->ctor;
 	zone->uz_dtor = arg->dtor;
 	zone->uz_slab = zone_fetch_slab;
 	zone->uz_init = NULL;
 	zone->uz_fini = NULL;
 	zone->uz_allocs = 0;
 	zone->uz_frees = 0;
 	zone->uz_fails = 0;
 	zone->uz_sleeps = 0;
 	zone->uz_count = 0;
 	zone->uz_count_min = 0;
 	zone->uz_flags = 0;
 	zone->uz_warning = NULL;
 	timevalclear(&zone->uz_ratecheck);
 	keg = arg->keg;
 
 	ZONE_LOCK_INIT(zone, (arg->flags & UMA_ZONE_MTXCLASS));
 
 	/*
 	 * This is a pure cache zone, no kegs.
 	 */
 	if (arg->import) {
 		if (arg->flags & UMA_ZONE_VM)
 			arg->flags |= UMA_ZFLAG_CACHEONLY;
 		zone->uz_flags = arg->flags;
 		zone->uz_size = arg->size;
 		zone->uz_import = arg->import;
 		zone->uz_release = arg->release;
 		zone->uz_arg = arg->arg;
 		zone->uz_lockptr = &zone->uz_lock;
 		rw_wlock(&uma_rwlock);
 		LIST_INSERT_HEAD(&uma_cachezones, zone, uz_link);
 		rw_wunlock(&uma_rwlock);
 		goto out;
 	}
 
 	/*
 	 * Use the regular zone/keg/slab allocator.
 	 */
 	zone->uz_import = (uma_import)zone_import;
 	zone->uz_release = (uma_release)zone_release;
 	zone->uz_arg = zone; 
 
 	if (arg->flags & UMA_ZONE_SECONDARY) {
 		KASSERT(arg->keg != NULL, ("Secondary zone on zero'd keg"));
 		zone->uz_init = arg->uminit;
 		zone->uz_fini = arg->fini;
 		zone->uz_lockptr = &keg->uk_lock;
 		zone->uz_flags |= UMA_ZONE_SECONDARY;
 		rw_wlock(&uma_rwlock);
 		ZONE_LOCK(zone);
 		LIST_FOREACH(z, &keg->uk_zones, uz_link) {
 			if (LIST_NEXT(z, uz_link) == NULL) {
 				LIST_INSERT_AFTER(z, zone, uz_link);
 				break;
 			}
 		}
 		ZONE_UNLOCK(zone);
 		rw_wunlock(&uma_rwlock);
 	} else if (keg == NULL) {
 		if ((keg = uma_kcreate(zone, arg->size, arg->uminit, arg->fini,
 		    arg->align, arg->flags)) == NULL)
 			return (ENOMEM);
 	} else {
 		struct uma_kctor_args karg;
 		int error;
 
 		/* We should only be here from uma_startup() */
 		karg.size = arg->size;
 		karg.uminit = arg->uminit;
 		karg.fini = arg->fini;
 		karg.align = arg->align;
 		karg.flags = arg->flags;
 		karg.zone = zone;
 		error = keg_ctor(arg->keg, sizeof(struct uma_keg), &karg,
 		    flags);
 		if (error)
 			return (error);
 	}
 
 	/*
 	 * Link in the first keg.
 	 */
 	zone->uz_klink.kl_keg = keg;
 	LIST_INSERT_HEAD(&zone->uz_kegs, &zone->uz_klink, kl_link);
 	zone->uz_lockptr = &keg->uk_lock;
 	zone->uz_size = keg->uk_size;
 	zone->uz_flags |= (keg->uk_flags &
 	    (UMA_ZONE_INHERIT | UMA_ZFLAG_INHERIT));
 
 	/*
 	 * Some internal zones don't have room allocated for the per cpu
 	 * caches.  If we're internal, bail out here.
 	 */
 	if (keg->uk_flags & UMA_ZFLAG_INTERNAL) {
 		KASSERT((zone->uz_flags & UMA_ZONE_SECONDARY) == 0,
 		    ("Secondary zone requested UMA_ZFLAG_INTERNAL"));
 		return (0);
 	}
 
 out:
 	if ((arg->flags & UMA_ZONE_MAXBUCKET) == 0)
 		zone->uz_count = bucket_select(zone->uz_size);
 	else
 		zone->uz_count = BUCKET_MAX;
 	zone->uz_count_min = zone->uz_count;
 
 	return (0);
 }
 
 /*
  * Keg header dtor.  This frees all data, destroys locks, frees the hash
  * table and removes the keg from the global list.
  *
  * Arguments/Returns follow uma_dtor specifications
  *	udata  unused
  */
 static void
 keg_dtor(void *arg, int size, void *udata)
 {
 	uma_keg_t keg;
 
 	keg = (uma_keg_t)arg;
 	KEG_LOCK(keg);
 	if (keg->uk_free != 0) {
 		printf("Freed UMA keg (%s) was not empty (%d items). "
 		    " Lost %d pages of memory.\n",
 		    keg->uk_name ? keg->uk_name : "",
 		    keg->uk_free, keg->uk_pages);
 	}
 	KEG_UNLOCK(keg);
 
 	hash_free(&keg->uk_hash);
 
 	KEG_LOCK_FINI(keg);
 }
 
 /*
  * Zone header dtor.
  *
  * Arguments/Returns follow uma_dtor specifications
  *	udata  unused
  */
 static void
 zone_dtor(void *arg, int size, void *udata)
 {
 	uma_klink_t klink;
 	uma_zone_t zone;
 	uma_keg_t keg;
 
 	zone = (uma_zone_t)arg;
 	keg = zone_first_keg(zone);
 
 	if (!(zone->uz_flags & UMA_ZFLAG_INTERNAL))
 		cache_drain(zone);
 
 	rw_wlock(&uma_rwlock);
 	LIST_REMOVE(zone, uz_link);
 	rw_wunlock(&uma_rwlock);
 	/*
 	 * XXX there are some races here where
 	 * the zone can be drained but zone lock
 	 * released and then refilled before we
 	 * remove it... we dont care for now
 	 */
 	zone_drain_wait(zone, M_WAITOK);
 	/*
 	 * Unlink all of our kegs.
 	 */
 	while ((klink = LIST_FIRST(&zone->uz_kegs)) != NULL) {
 		klink->kl_keg = NULL;
 		LIST_REMOVE(klink, kl_link);
 		if (klink == &zone->uz_klink)
 			continue;
 		free(klink, M_TEMP);
 	}
 	/*
 	 * We only destroy kegs from non secondary zones.
 	 */
 	if (keg != NULL && (zone->uz_flags & UMA_ZONE_SECONDARY) == 0)  {
 		rw_wlock(&uma_rwlock);
 		LIST_REMOVE(keg, uk_link);
 		rw_wunlock(&uma_rwlock);
 		zone_free_item(kegs, keg, NULL, SKIP_NONE);
 	}
 	ZONE_LOCK_FINI(zone);
 }
 
 /*
  * Traverses every zone in the system and calls a callback
  *
  * Arguments:
  *	zfunc  A pointer to a function which accepts a zone
  *		as an argument.
  *
  * Returns:
  *	Nothing
  */
 static void
 zone_foreach(void (*zfunc)(uma_zone_t))
 {
 	uma_keg_t keg;
 	uma_zone_t zone;
 
 	rw_rlock(&uma_rwlock);
 	LIST_FOREACH(keg, &uma_kegs, uk_link) {
 		LIST_FOREACH(zone, &keg->uk_zones, uz_link)
 			zfunc(zone);
 	}
 	rw_runlock(&uma_rwlock);
 }
 
 /* Public functions */
 /* See uma.h */
 void
 uma_startup(void *bootmem, int boot_pages)
 {
 	struct uma_zctor_args args;
 	uma_slab_t slab;
 	u_int slabsize;
 	int i;
 
 #ifdef UMA_DEBUG
 	printf("Creating uma keg headers zone and keg.\n");
 #endif
 	rw_init(&uma_rwlock, "UMA lock");
 
 	/* "manually" create the initial zone */
 	memset(&args, 0, sizeof(args));
 	args.name = "UMA Kegs";
 	args.size = sizeof(struct uma_keg);
 	args.ctor = keg_ctor;
 	args.dtor = keg_dtor;
 	args.uminit = zero_init;
 	args.fini = NULL;
 	args.keg = &masterkeg;
 	args.align = 32 - 1;
 	args.flags = UMA_ZFLAG_INTERNAL;
 	/* The initial zone has no Per cpu queues so it's smaller */
 	zone_ctor(kegs, sizeof(struct uma_zone), &args, M_WAITOK);
 
 #ifdef UMA_DEBUG
 	printf("Filling boot free list.\n");
 #endif
 	for (i = 0; i < boot_pages; i++) {
 		slab = (uma_slab_t)((uint8_t *)bootmem + (i * UMA_SLAB_SIZE));
 		slab->us_data = (uint8_t *)slab;
 		slab->us_flags = UMA_SLAB_BOOT;
 		LIST_INSERT_HEAD(&uma_boot_pages, slab, us_link);
 	}
 	mtx_init(&uma_boot_pages_mtx, "UMA boot pages", NULL, MTX_DEF);
 
 #ifdef UMA_DEBUG
 	printf("Creating uma zone headers zone and keg.\n");
 #endif
 	args.name = "UMA Zones";
 	args.size = sizeof(struct uma_zone) +
 	    (sizeof(struct uma_cache) * (mp_maxid + 1));
 	args.ctor = zone_ctor;
 	args.dtor = zone_dtor;
 	args.uminit = zero_init;
 	args.fini = NULL;
 	args.keg = NULL;
 	args.align = 32 - 1;
 	args.flags = UMA_ZFLAG_INTERNAL;
 	/* The initial zone has no Per cpu queues so it's smaller */
 	zone_ctor(zones, sizeof(struct uma_zone), &args, M_WAITOK);
 
 #ifdef UMA_DEBUG
 	printf("Creating slab and hash zones.\n");
 #endif
 
 	/* Now make a zone for slab headers */
 	slabzone = uma_zcreate("UMA Slabs",
 				sizeof(struct uma_slab),
 				NULL, NULL, NULL, NULL,
 				UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL);
 
 	/*
 	 * We also create a zone for the bigger slabs with reference
 	 * counts in them, to accomodate UMA_ZONE_REFCNT zones.
 	 */
 	slabsize = sizeof(struct uma_slab_refcnt);
 	slabsize += uma_max_ipers_ref * sizeof(uint32_t);
 	slabrefzone = uma_zcreate("UMA RCntSlabs",
 				  slabsize,
 				  NULL, NULL, NULL, NULL,
 				  UMA_ALIGN_PTR,
 				  UMA_ZFLAG_INTERNAL);
 
 	hashzone = uma_zcreate("UMA Hash",
 	    sizeof(struct slabhead *) * UMA_HASH_SIZE_INIT,
 	    NULL, NULL, NULL, NULL,
 	    UMA_ALIGN_PTR, UMA_ZFLAG_INTERNAL);
 
 	bucket_init();
 
 	booted = UMA_STARTUP;
 
 #ifdef UMA_DEBUG
 	printf("UMA startup complete.\n");
 #endif
 }
 
 /* see uma.h */
 void
 uma_startup2(void)
 {
 	booted = UMA_STARTUP2;
 	bucket_enable();
 	sx_init(&uma_drain_lock, "umadrain");
 #ifdef UMA_DEBUG
 	printf("UMA startup2 complete.\n");
 #endif
 }
 
 /*
  * Initialize our callout handle
  *
  */
 
 static void
 uma_startup3(void)
 {
 #ifdef UMA_DEBUG
 	printf("Starting callout.\n");
 #endif
 	callout_init(&uma_callout, CALLOUT_MPSAFE);
 	callout_reset(&uma_callout, UMA_TIMEOUT * hz, uma_timeout, NULL);
 #ifdef UMA_DEBUG
 	printf("UMA startup3 complete.\n");
 #endif
 }
 
 static uma_keg_t
 uma_kcreate(uma_zone_t zone, size_t size, uma_init uminit, uma_fini fini,
 		int align, uint32_t flags)
 {
 	struct uma_kctor_args args;
 
 	args.size = size;
 	args.uminit = uminit;
 	args.fini = fini;
 	args.align = (align == UMA_ALIGN_CACHE) ? uma_align_cache : align;
 	args.flags = flags;
 	args.zone = zone;
 	return (zone_alloc_item(kegs, &args, M_WAITOK));
 }
 
 /* See uma.h */
 void
 uma_set_align(int align)
 {
 
 	if (align != UMA_ALIGN_CACHE)
 		uma_align_cache = align;
 }
 
 /* See uma.h */
 uma_zone_t
 uma_zcreate(const char *name, size_t size, uma_ctor ctor, uma_dtor dtor,
 		uma_init uminit, uma_fini fini, int align, uint32_t flags)
 
 {
 	struct uma_zctor_args args;
 	uma_zone_t res;
 	bool locked;
 
 	/* This stuff is essential for the zone ctor */
 	memset(&args, 0, sizeof(args));
 	args.name = name;
 	args.size = size;
 	args.ctor = ctor;
 	args.dtor = dtor;
 	args.uminit = uminit;
 	args.fini = fini;
 	args.align = align;
 	args.flags = flags;
 	args.keg = NULL;
 
 	if (booted < UMA_STARTUP2) {
 		locked = false;
 	} else {
 		sx_slock(&uma_drain_lock);
 		locked = true;
 	}
 	res = zone_alloc_item(zones, &args, M_WAITOK);
 	if (locked)
 		sx_sunlock(&uma_drain_lock);
 	return (res);
 }
 
 /* See uma.h */
 uma_zone_t
 uma_zsecond_create(char *name, uma_ctor ctor, uma_dtor dtor,
 		    uma_init zinit, uma_fini zfini, uma_zone_t master)
 {
 	struct uma_zctor_args args;
 	uma_keg_t keg;
 	uma_zone_t res;
 	bool locked;
 
 	keg = zone_first_keg(master);
 	memset(&args, 0, sizeof(args));
 	args.name = name;
 	args.size = keg->uk_size;
 	args.ctor = ctor;
 	args.dtor = dtor;
 	args.uminit = zinit;
 	args.fini = zfini;
 	args.align = keg->uk_align;
 	args.flags = keg->uk_flags | UMA_ZONE_SECONDARY;
 	args.keg = keg;
 
 	if (booted < UMA_STARTUP2) {
 		locked = false;
 	} else {
 		sx_slock(&uma_drain_lock);
 		locked = true;
 	}
 	/* XXX Attaches only one keg of potentially many. */
 	res = zone_alloc_item(zones, &args, M_WAITOK);
 	if (locked)
 		sx_sunlock(&uma_drain_lock);
 	return (res);
 }
 
 /* See uma.h */
 uma_zone_t
 uma_zcache_create(char *name, int size, uma_ctor ctor, uma_dtor dtor,
 		    uma_init zinit, uma_fini zfini, uma_import zimport,
 		    uma_release zrelease, void *arg, int flags)
 {
 	struct uma_zctor_args args;
 
 	memset(&args, 0, sizeof(args));
 	args.name = name;
 	args.size = size;
 	args.ctor = ctor;
 	args.dtor = dtor;
 	args.uminit = zinit;
 	args.fini = zfini;
 	args.import = zimport;
 	args.release = zrelease;
 	args.arg = arg;
 	args.align = 0;
 	args.flags = flags;
 
 	return (zone_alloc_item(zones, &args, M_WAITOK));
 }
 
 static void
 zone_lock_pair(uma_zone_t a, uma_zone_t b)
 {
 	if (a < b) {
 		ZONE_LOCK(a);
 		mtx_lock_flags(b->uz_lockptr, MTX_DUPOK);
 	} else {
 		ZONE_LOCK(b);
 		mtx_lock_flags(a->uz_lockptr, MTX_DUPOK);
 	}
 }
 
 static void
 zone_unlock_pair(uma_zone_t a, uma_zone_t b)
 {
 
 	ZONE_UNLOCK(a);
 	ZONE_UNLOCK(b);
 }
 
 int
 uma_zsecond_add(uma_zone_t zone, uma_zone_t master)
 {
 	uma_klink_t klink;
 	uma_klink_t kl;
 	int error;
 
 	error = 0;
 	klink = malloc(sizeof(*klink), M_TEMP, M_WAITOK | M_ZERO);
 
 	zone_lock_pair(zone, master);
 	/*
 	 * zone must use vtoslab() to resolve objects and must already be
 	 * a secondary.
 	 */
 	if ((zone->uz_flags & (UMA_ZONE_VTOSLAB | UMA_ZONE_SECONDARY))
 	    != (UMA_ZONE_VTOSLAB | UMA_ZONE_SECONDARY)) {
 		error = EINVAL;
 		goto out;
 	}
 	/*
 	 * The new master must also use vtoslab().
 	 */
 	if ((zone->uz_flags & UMA_ZONE_VTOSLAB) != UMA_ZONE_VTOSLAB) {
 		error = EINVAL;
 		goto out;
 	}
 	/*
 	 * Both must either be refcnt, or not be refcnt.
 	 */
 	if ((zone->uz_flags & UMA_ZONE_REFCNT) !=
 	    (master->uz_flags & UMA_ZONE_REFCNT)) {
 		error = EINVAL;
 		goto out;
 	}
 	/*
 	 * The underlying object must be the same size.  rsize
 	 * may be different.
 	 */
 	if (master->uz_size != zone->uz_size) {
 		error = E2BIG;
 		goto out;
 	}
 	/*
 	 * Put it at the end of the list.
 	 */
 	klink->kl_keg = zone_first_keg(master);
 	LIST_FOREACH(kl, &zone->uz_kegs, kl_link) {
 		if (LIST_NEXT(kl, kl_link) == NULL) {
 			LIST_INSERT_AFTER(kl, klink, kl_link);
 			break;
 		}
 	}
 	klink = NULL;
 	zone->uz_flags |= UMA_ZFLAG_MULTI;
 	zone->uz_slab = zone_fetch_slab_multi;
 
 out:
 	zone_unlock_pair(zone, master);
 	if (klink != NULL)
 		free(klink, M_TEMP);
 
 	return (error);
 }
 
 
 /* See uma.h */
 void
 uma_zdestroy(uma_zone_t zone)
 {
 
 	sx_slock(&uma_drain_lock);
 	zone_free_item(zones, zone, NULL, SKIP_NONE);
 	sx_sunlock(&uma_drain_lock);
 }
 
 /* See uma.h */
 void *
 uma_zalloc_arg(uma_zone_t zone, void *udata, int flags)
 {
 	void *item;
 	uma_cache_t cache;
 	uma_bucket_t bucket;
 	int lockfail;
 	int cpu;
 
 #if 0
 	/* XXX: FIX!! Do not enable this in CURRENT!! MarkM */
 	/* The entropy here is desirable, but the harvesting is expensive */
 	random_harvest(&(zone->uz_name), sizeof(void *), 1, RANDOM_UMA_ALLOC);
 #endif
 
 	/* This is the fast path allocation */
 #ifdef UMA_DEBUG_ALLOC_1
 	printf("Allocating one item from %s(%p)\n", zone->uz_name, zone);
 #endif
 	CTR3(KTR_UMA, "uma_zalloc_arg thread %x zone %s flags %d", curthread,
 	    zone->uz_name, flags);
 
 	if (flags & M_WAITOK) {
 		WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
 		    "uma_zalloc_arg: zone \"%s\"", zone->uz_name);
 	}
 #ifdef DEBUG_MEMGUARD
 	if (memguard_cmp_zone(zone)) {
 		item = memguard_alloc(zone->uz_size, flags);
 		if (item != NULL) {
 			/*
 			 * Avoid conflict with the use-after-free
 			 * protecting infrastructure from INVARIANTS.
 			 */
 			if (zone->uz_init != NULL &&
 			    zone->uz_init != mtrash_init &&
 			    zone->uz_init(item, zone->uz_size, flags) != 0)
 				return (NULL);
 			if (zone->uz_ctor != NULL &&
 			    zone->uz_ctor != mtrash_ctor &&
 			    zone->uz_ctor(item, zone->uz_size, udata,
 			    flags) != 0) {
 			    	zone->uz_fini(item, zone->uz_size);
 				return (NULL);
 			}
 #if 0
 			/* XXX: FIX!! Do not enable this in CURRENT!! MarkM */
 			/* The entropy here is desirable, but the harvesting is expensive */
 			random_harvest(&item, sizeof(void *), 1, RANDOM_UMA_ALLOC);
 #endif
 			return (item);
 		}
 		/* This is unfortunate but should not be fatal. */
 	}
 #endif
 	/*
 	 * If possible, allocate from the per-CPU cache.  There are two
 	 * requirements for safe access to the per-CPU cache: (1) the thread
 	 * accessing the cache must not be preempted or yield during access,
 	 * and (2) the thread must not migrate CPUs without switching which
 	 * cache it accesses.  We rely on a critical section to prevent
 	 * preemption and migration.  We release the critical section in
 	 * order to acquire the zone mutex if we are unable to allocate from
 	 * the current cache; when we re-acquire the critical section, we
 	 * must detect and handle migration if it has occurred.
 	 */
 	critical_enter();
 	cpu = curcpu;
 	cache = &zone->uz_cpu[cpu];
 
 zalloc_start:
 	bucket = cache->uc_allocbucket;
 	if (bucket != NULL && bucket->ub_cnt > 0) {
 		bucket->ub_cnt--;
 		item = bucket->ub_bucket[bucket->ub_cnt];
 #ifdef INVARIANTS
 		bucket->ub_bucket[bucket->ub_cnt] = NULL;
 #endif
 		KASSERT(item != NULL, ("uma_zalloc: Bucket pointer mangled."));
 		cache->uc_allocs++;
 		critical_exit();
 		if (zone->uz_ctor != NULL &&
 		    zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) {
 			atomic_add_long(&zone->uz_fails, 1);
 			zone_free_item(zone, item, udata, SKIP_DTOR);
 			return (NULL);
 		}
 #ifdef INVARIANTS
 		uma_dbg_alloc(zone, NULL, item);
 #endif
 		if (flags & M_ZERO)
 			uma_zero_item(item, zone);
 #if 0
 		/* XXX: FIX!! Do not enable this in CURRENT!! MarkM */
 		/* The entropy here is desirable, but the harvesting is expensive */
 		random_harvest(&item, sizeof(void *), 1, RANDOM_UMA_ALLOC);
 #endif
 		return (item);
 	}
 
 	/*
 	 * We have run out of items in our alloc bucket.
 	 * See if we can switch with our free bucket.
 	 */
 	bucket = cache->uc_freebucket;
 	if (bucket != NULL && bucket->ub_cnt > 0) {
 #ifdef UMA_DEBUG_ALLOC
 		printf("uma_zalloc: Swapping empty with alloc.\n");
 #endif
 		cache->uc_freebucket = cache->uc_allocbucket;
 		cache->uc_allocbucket = bucket;
 		goto zalloc_start;
 	}
 
 	/*
 	 * Discard any empty allocation bucket while we hold no locks.
 	 */
 	bucket = cache->uc_allocbucket;
 	cache->uc_allocbucket = NULL;
 	critical_exit();
 	if (bucket != NULL)
 		bucket_free(zone, bucket, udata);
 
 	/* Short-circuit for zones without buckets and low memory. */
 	if (zone->uz_count == 0 || bucketdisable)
 		goto zalloc_item;
 
 	/*
 	 * Attempt to retrieve the item from the per-CPU cache has failed, so
 	 * we must go back to the zone.  This requires the zone lock, so we
 	 * must drop the critical section, then re-acquire it when we go back
 	 * to the cache.  Since the critical section is released, we may be
 	 * preempted or migrate.  As such, make sure not to maintain any
 	 * thread-local state specific to the cache from prior to releasing
 	 * the critical section.
 	 */
 	lockfail = 0;
 	if (ZONE_TRYLOCK(zone) == 0) {
 		/* Record contention to size the buckets. */
 		ZONE_LOCK(zone);
 		lockfail = 1;
 	}
 	critical_enter();
 	cpu = curcpu;
 	cache = &zone->uz_cpu[cpu];
 
 	/*
 	 * Since we have locked the zone we may as well send back our stats.
 	 */
 	atomic_add_long(&zone->uz_allocs, cache->uc_allocs);
 	atomic_add_long(&zone->uz_frees, cache->uc_frees);
 	cache->uc_allocs = 0;
 	cache->uc_frees = 0;
 
 	/* See if we lost the race to fill the cache. */
 	if (cache->uc_allocbucket != NULL) {
 		ZONE_UNLOCK(zone);
 		goto zalloc_start;
 	}
 
 	/*
 	 * Check the zone's cache of buckets.
 	 */
 	if ((bucket = LIST_FIRST(&zone->uz_buckets)) != NULL) {
 		KASSERT(bucket->ub_cnt != 0,
 		    ("uma_zalloc_arg: Returning an empty bucket."));
 
 		LIST_REMOVE(bucket, ub_link);
 		cache->uc_allocbucket = bucket;
 		ZONE_UNLOCK(zone);
 		goto zalloc_start;
 	}
 	/* We are no longer associated with this CPU. */
 	critical_exit();
 
 	/*
 	 * We bump the uz count when the cache size is insufficient to
 	 * handle the working set.
 	 */
 	if (lockfail && zone->uz_count < BUCKET_MAX)
 		zone->uz_count++;
 	ZONE_UNLOCK(zone);
 
 	/*
 	 * Now lets just fill a bucket and put it on the free list.  If that
 	 * works we'll restart the allocation from the begining and it
 	 * will use the just filled bucket.
 	 */
 	bucket = zone_alloc_bucket(zone, udata, flags);
 	if (bucket != NULL) {
 		ZONE_LOCK(zone);
 		critical_enter();
 		cpu = curcpu;
 		cache = &zone->uz_cpu[cpu];
 		/*
 		 * See if we lost the race or were migrated.  Cache the
 		 * initialized bucket to make this less likely or claim
 		 * the memory directly.
 		 */
 		if (cache->uc_allocbucket == NULL)
 			cache->uc_allocbucket = bucket;
 		else
 			LIST_INSERT_HEAD(&zone->uz_buckets, bucket, ub_link);
 		ZONE_UNLOCK(zone);
 		goto zalloc_start;
 	}
 
 	/*
 	 * We may not be able to get a bucket so return an actual item.
 	 */
 #ifdef UMA_DEBUG
 	printf("uma_zalloc_arg: Bucketzone returned NULL\n");
 #endif
 
 zalloc_item:
 	item = zone_alloc_item(zone, udata, flags);
 
 #if 0
 	/* XXX: FIX!! Do not enable this in CURRENT!! MarkM */
 	/* The entropy here is desirable, but the harvesting is expensive */
 	random_harvest(&item, sizeof(void *), 1, RANDOM_UMA_ALLOC);
 #endif
 	return (item);
 }
 
 static uma_slab_t
 keg_fetch_slab(uma_keg_t keg, uma_zone_t zone, int flags)
 {
 	uma_slab_t slab;
 	int reserve;
 
 	mtx_assert(&keg->uk_lock, MA_OWNED);
 	slab = NULL;
 	reserve = 0;
 	if ((flags & M_USE_RESERVE) == 0)
 		reserve = keg->uk_reserve;
 
 	for (;;) {
 		/*
 		 * Find a slab with some space.  Prefer slabs that are partially
 		 * used over those that are totally full.  This helps to reduce
 		 * fragmentation.
 		 */
 		if (keg->uk_free > reserve) {
 			if (!LIST_EMPTY(&keg->uk_part_slab)) {
 				slab = LIST_FIRST(&keg->uk_part_slab);
 			} else {
 				slab = LIST_FIRST(&keg->uk_free_slab);
 				LIST_REMOVE(slab, us_link);
 				LIST_INSERT_HEAD(&keg->uk_part_slab, slab,
 				    us_link);
 			}
 			MPASS(slab->us_keg == keg);
 			return (slab);
 		}
 
 		/*
 		 * M_NOVM means don't ask at all!
 		 */
 		if (flags & M_NOVM)
 			break;
 
 		if (keg->uk_maxpages && keg->uk_pages >= keg->uk_maxpages) {
 			keg->uk_flags |= UMA_ZFLAG_FULL;
 			/*
 			 * If this is not a multi-zone, set the FULL bit.
 			 * Otherwise slab_multi() takes care of it.
 			 */
 			if ((zone->uz_flags & UMA_ZFLAG_MULTI) == 0) {
 				zone->uz_flags |= UMA_ZFLAG_FULL;
 				zone_log_warning(zone);
 			}
 			if (flags & M_NOWAIT)
 				break;
 			zone->uz_sleeps++;
 			msleep(keg, &keg->uk_lock, PVM, "keglimit", 0);
 			continue;
 		}
 		slab = keg_alloc_slab(keg, zone, flags);
 		/*
 		 * If we got a slab here it's safe to mark it partially used
 		 * and return.  We assume that the caller is going to remove
 		 * at least one item.
 		 */
 		if (slab) {
 			MPASS(slab->us_keg == keg);
 			LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link);
 			return (slab);
 		}
 		/*
 		 * We might not have been able to get a slab but another cpu
 		 * could have while we were unlocked.  Check again before we
 		 * fail.
 		 */
 		flags |= M_NOVM;
 	}
 	return (slab);
 }
 
 static uma_slab_t
 zone_fetch_slab(uma_zone_t zone, uma_keg_t keg, int flags)
 {
 	uma_slab_t slab;
 
 	if (keg == NULL) {
 		keg = zone_first_keg(zone);
 		KEG_LOCK(keg);
 	}
 
 	for (;;) {
 		slab = keg_fetch_slab(keg, zone, flags);
 		if (slab)
 			return (slab);
 		if (flags & (M_NOWAIT | M_NOVM))
 			break;
 	}
 	KEG_UNLOCK(keg);
 	return (NULL);
 }
 
 /*
  * uma_zone_fetch_slab_multi:  Fetches a slab from one available keg.  Returns
  * with the keg locked.  On NULL no lock is held.
  *
  * The last pointer is used to seed the search.  It is not required.
  */
 static uma_slab_t
 zone_fetch_slab_multi(uma_zone_t zone, uma_keg_t last, int rflags)
 {
 	uma_klink_t klink;
 	uma_slab_t slab;
 	uma_keg_t keg;
 	int flags;
 	int empty;
 	int full;
 
 	/*
 	 * Don't wait on the first pass.  This will skip limit tests
 	 * as well.  We don't want to block if we can find a provider
 	 * without blocking.
 	 */
 	flags = (rflags & ~M_WAITOK) | M_NOWAIT;
 	/*
 	 * Use the last slab allocated as a hint for where to start
 	 * the search.
 	 */
 	if (last != NULL) {
 		slab = keg_fetch_slab(last, zone, flags);
 		if (slab)
 			return (slab);
 		KEG_UNLOCK(last);
 	}
 	/*
 	 * Loop until we have a slab incase of transient failures
 	 * while M_WAITOK is specified.  I'm not sure this is 100%
 	 * required but we've done it for so long now.
 	 */
 	for (;;) {
 		empty = 0;
 		full = 0;
 		/*
 		 * Search the available kegs for slabs.  Be careful to hold the
 		 * correct lock while calling into the keg layer.
 		 */
 		LIST_FOREACH(klink, &zone->uz_kegs, kl_link) {
 			keg = klink->kl_keg;
 			KEG_LOCK(keg);
 			if ((keg->uk_flags & UMA_ZFLAG_FULL) == 0) {
 				slab = keg_fetch_slab(keg, zone, flags);
 				if (slab)
 					return (slab);
 			}
 			if (keg->uk_flags & UMA_ZFLAG_FULL)
 				full++;
 			else
 				empty++;
 			KEG_UNLOCK(keg);
 		}
 		if (rflags & (M_NOWAIT | M_NOVM))
 			break;
 		flags = rflags;
 		/*
 		 * All kegs are full.  XXX We can't atomically check all kegs
 		 * and sleep so just sleep for a short period and retry.
 		 */
 		if (full && !empty) {
 			ZONE_LOCK(zone);
 			zone->uz_flags |= UMA_ZFLAG_FULL;
 			zone->uz_sleeps++;
 			zone_log_warning(zone);
 			msleep(zone, zone->uz_lockptr, PVM,
 			    "zonelimit", hz/100);
 			zone->uz_flags &= ~UMA_ZFLAG_FULL;
 			ZONE_UNLOCK(zone);
 			continue;
 		}
 	}
 	return (NULL);
 }
 
 static void *
 slab_alloc_item(uma_keg_t keg, uma_slab_t slab)
 {
 	void *item;
 	uint8_t freei;
 
 	MPASS(keg == slab->us_keg);
 	mtx_assert(&keg->uk_lock, MA_OWNED);
 
 	freei = BIT_FFS(SLAB_SETSIZE, &slab->us_free) - 1;
 	BIT_CLR(SLAB_SETSIZE, freei, &slab->us_free);
 	item = slab->us_data + (keg->uk_rsize * freei);
 	slab->us_freecount--;
 	keg->uk_free--;
 
 	/* Move this slab to the full list */
 	if (slab->us_freecount == 0) {
 		LIST_REMOVE(slab, us_link);
 		LIST_INSERT_HEAD(&keg->uk_full_slab, slab, us_link);
 	}
 
 	return (item);
 }
 
 static int
 zone_import(uma_zone_t zone, void **bucket, int max, int flags)
 {
 	uma_slab_t slab;
 	uma_keg_t keg;
 	int i;
 
 	slab = NULL;
 	keg = NULL;
 	/* Try to keep the buckets totally full */
 	for (i = 0; i < max; ) {
 		if ((slab = zone->uz_slab(zone, keg, flags)) == NULL)
 			break;
 		keg = slab->us_keg;
 		while (slab->us_freecount && i < max) { 
 			bucket[i++] = slab_alloc_item(keg, slab);
 			if (keg->uk_free <= keg->uk_reserve)
 				break;
 		}
 		/* Don't grab more than one slab at a time. */
 		flags &= ~M_WAITOK;
 		flags |= M_NOWAIT;
 	}
 	if (slab != NULL)
 		KEG_UNLOCK(keg);
 
 	return i;
 }
 
 static uma_bucket_t
 zone_alloc_bucket(uma_zone_t zone, void *udata, int flags)
 {
 	uma_bucket_t bucket;
 	int max;
 
 	/* Don't wait for buckets, preserve caller's NOVM setting. */
 	bucket = bucket_alloc(zone, udata, M_NOWAIT | (flags & M_NOVM));
 	if (bucket == NULL)
 		return (NULL);
 
 	max = MIN(bucket->ub_entries, zone->uz_count);
 	bucket->ub_cnt = zone->uz_import(zone->uz_arg, bucket->ub_bucket,
 	    max, flags);
 
 	/*
 	 * Initialize the memory if necessary.
 	 */
 	if (bucket->ub_cnt != 0 && zone->uz_init != NULL) {
 		int i;
 
 		for (i = 0; i < bucket->ub_cnt; i++)
 			if (zone->uz_init(bucket->ub_bucket[i], zone->uz_size,
 			    flags) != 0)
 				break;
 		/*
 		 * If we couldn't initialize the whole bucket, put the
 		 * rest back onto the freelist.
 		 */
 		if (i != bucket->ub_cnt) {
 			zone->uz_release(zone->uz_arg, &bucket->ub_bucket[i],
 			    bucket->ub_cnt - i);
 #ifdef INVARIANTS
 			bzero(&bucket->ub_bucket[i],
 			    sizeof(void *) * (bucket->ub_cnt - i));
 #endif
 			bucket->ub_cnt = i;
 		}
 	}
 
 	if (bucket->ub_cnt == 0) {
 		bucket_free(zone, bucket, udata);
 		atomic_add_long(&zone->uz_fails, 1);
 		return (NULL);
 	}
 
 	return (bucket);
 }
 
 /*
  * Allocates a single item from a zone.
  *
  * Arguments
  *	zone   The zone to alloc for.
  *	udata  The data to be passed to the constructor.
  *	flags  M_WAITOK, M_NOWAIT, M_ZERO.
  *
  * Returns
  *	NULL if there is no memory and M_NOWAIT is set
  *	An item if successful
  */
 
 static void *
 zone_alloc_item(uma_zone_t zone, void *udata, int flags)
 {
 	void *item;
 
 	item = NULL;
 
 #ifdef UMA_DEBUG_ALLOC
 	printf("INTERNAL: Allocating one item from %s(%p)\n", zone->uz_name, zone);
 #endif
 	if (zone->uz_import(zone->uz_arg, &item, 1, flags) != 1)
 		goto fail;
 	atomic_add_long(&zone->uz_allocs, 1);
 
 	/*
 	 * We have to call both the zone's init (not the keg's init)
 	 * and the zone's ctor.  This is because the item is going from
 	 * a keg slab directly to the user, and the user is expecting it
 	 * to be both zone-init'd as well as zone-ctor'd.
 	 */
 	if (zone->uz_init != NULL) {
 		if (zone->uz_init(item, zone->uz_size, flags) != 0) {
 			zone_free_item(zone, item, udata, SKIP_FINI);
 			goto fail;
 		}
 	}
 	if (zone->uz_ctor != NULL) {
 		if (zone->uz_ctor(item, zone->uz_size, udata, flags) != 0) {
 			zone_free_item(zone, item, udata, SKIP_DTOR);
 			goto fail;
 		}
 	}
 #ifdef INVARIANTS
 	uma_dbg_alloc(zone, NULL, item);
 #endif
 	if (flags & M_ZERO)
 		uma_zero_item(item, zone);
 
 	return (item);
 
 fail:
 	atomic_add_long(&zone->uz_fails, 1);
 	return (NULL);
 }
 
 /* See uma.h */
 void
 uma_zfree_arg(uma_zone_t zone, void *item, void *udata)
 {
 	uma_cache_t cache;
 	uma_bucket_t bucket;
 	int lockfail;
 	int cpu;
 
 #if 0
 	/* XXX: FIX!! Do not enable this in CURRENT!! MarkM */
 	/* The entropy here is desirable, but the harvesting is expensive */
 	struct entropy {
 		const void *uz_name;
 		const void *item;
 	} entropy;
 
 	entropy.uz_name = zone->uz_name;
 	entropy.item = item;
 	random_harvest(&entropy, sizeof(struct entropy), 2, RANDOM_UMA_ALLOC);
 #endif
 
 #ifdef UMA_DEBUG_ALLOC_1
 	printf("Freeing item %p to %s(%p)\n", item, zone->uz_name, zone);
 #endif
 	CTR2(KTR_UMA, "uma_zfree_arg thread %x zone %s", curthread,
 	    zone->uz_name);
 
         /* uma_zfree(..., NULL) does nothing, to match free(9). */
         if (item == NULL)
                 return;
 #ifdef DEBUG_MEMGUARD
 	if (is_memguard_addr(item)) {
 		if (zone->uz_dtor != NULL && zone->uz_dtor != mtrash_dtor)
 			zone->uz_dtor(item, zone->uz_size, udata);
 		if (zone->uz_fini != NULL && zone->uz_fini != mtrash_fini)
 			zone->uz_fini(item, zone->uz_size);
 		memguard_free(item);
 		return;
 	}
 #endif
 #ifdef INVARIANTS
 	if (zone->uz_flags & UMA_ZONE_MALLOC)
 		uma_dbg_free(zone, udata, item);
 	else
 		uma_dbg_free(zone, NULL, item);
 #endif
 	if (zone->uz_dtor != NULL)
 		zone->uz_dtor(item, zone->uz_size, udata);
 
 	/*
 	 * The race here is acceptable.  If we miss it we'll just have to wait
 	 * a little longer for the limits to be reset.
 	 */
 	if (zone->uz_flags & UMA_ZFLAG_FULL)
 		goto zfree_item;
 
 	/*
 	 * If possible, free to the per-CPU cache.  There are two
 	 * requirements for safe access to the per-CPU cache: (1) the thread
 	 * accessing the cache must not be preempted or yield during access,
 	 * and (2) the thread must not migrate CPUs without switching which
 	 * cache it accesses.  We rely on a critical section to prevent
 	 * preemption and migration.  We release the critical section in
 	 * order to acquire the zone mutex if we are unable to free to the
 	 * current cache; when we re-acquire the critical section, we must
 	 * detect and handle migration if it has occurred.
 	 */
 zfree_restart:
 	critical_enter();
 	cpu = curcpu;
 	cache = &zone->uz_cpu[cpu];
 
 zfree_start:
 	/*
 	 * Try to free into the allocbucket first to give LIFO ordering
 	 * for cache-hot datastructures.  Spill over into the freebucket
 	 * if necessary.  Alloc will swap them if one runs dry.
 	 */
 	bucket = cache->uc_allocbucket;
 	if (bucket == NULL || bucket->ub_cnt >= bucket->ub_entries)
 		bucket = cache->uc_freebucket;
 	if (bucket != NULL && bucket->ub_cnt < bucket->ub_entries) {
 		KASSERT(bucket->ub_bucket[bucket->ub_cnt] == NULL,
 		    ("uma_zfree: Freeing to non free bucket index."));
 		bucket->ub_bucket[bucket->ub_cnt] = item;
 		bucket->ub_cnt++;
 		cache->uc_frees++;
 		critical_exit();
 		return;
 	}
 
 	/*
 	 * We must go back the zone, which requires acquiring the zone lock,
 	 * which in turn means we must release and re-acquire the critical
 	 * section.  Since the critical section is released, we may be
 	 * preempted or migrate.  As such, make sure not to maintain any
 	 * thread-local state specific to the cache from prior to releasing
 	 * the critical section.
 	 */
 	critical_exit();
 	if (zone->uz_count == 0 || bucketdisable)
 		goto zfree_item;
 
 	lockfail = 0;
 	if (ZONE_TRYLOCK(zone) == 0) {
 		/* Record contention to size the buckets. */
 		ZONE_LOCK(zone);
 		lockfail = 1;
 	}
 	critical_enter();
 	cpu = curcpu;
 	cache = &zone->uz_cpu[cpu];
 
 	/*
 	 * Since we have locked the zone we may as well send back our stats.
 	 */
 	atomic_add_long(&zone->uz_allocs, cache->uc_allocs);
 	atomic_add_long(&zone->uz_frees, cache->uc_frees);
 	cache->uc_allocs = 0;
 	cache->uc_frees = 0;
 
 	bucket = cache->uc_freebucket;
 	if (bucket != NULL && bucket->ub_cnt < bucket->ub_entries) {
 		ZONE_UNLOCK(zone);
 		goto zfree_start;
 	}
 	cache->uc_freebucket = NULL;
 
 	/* Can we throw this on the zone full list? */
 	if (bucket != NULL) {
 #ifdef UMA_DEBUG_ALLOC
 		printf("uma_zfree: Putting old bucket on the free list.\n");
 #endif
 		/* ub_cnt is pointing to the last free item */
 		KASSERT(bucket->ub_cnt != 0,
 		    ("uma_zfree: Attempting to insert an empty bucket onto the full list.\n"));
 		LIST_INSERT_HEAD(&zone->uz_buckets, bucket, ub_link);
 	}
 
 	/* We are no longer associated with this CPU. */
 	critical_exit();
 
 	/*
 	 * We bump the uz count when the cache size is insufficient to
 	 * handle the working set.
 	 */
 	if (lockfail && zone->uz_count < BUCKET_MAX)
 		zone->uz_count++;
 	ZONE_UNLOCK(zone);
 
 #ifdef UMA_DEBUG_ALLOC
 	printf("uma_zfree: Allocating new free bucket.\n");
 #endif
 	bucket = bucket_alloc(zone, udata, M_NOWAIT);
 	if (bucket) {
 		critical_enter();
 		cpu = curcpu;
 		cache = &zone->uz_cpu[cpu];
 		if (cache->uc_freebucket == NULL) {
 			cache->uc_freebucket = bucket;
 			goto zfree_start;
 		}
 		/*
 		 * We lost the race, start over.  We have to drop our
 		 * critical section to free the bucket.
 		 */
 		critical_exit();
 		bucket_free(zone, bucket, udata);
 		goto zfree_restart;
 	}
 
 	/*
 	 * If nothing else caught this, we'll just do an internal free.
 	 */
 zfree_item:
 	zone_free_item(zone, item, udata, SKIP_DTOR);
 
 	return;
 }
 
 static void
 slab_free_item(uma_keg_t keg, uma_slab_t slab, void *item)
 {
 	uint8_t freei;
 
 	mtx_assert(&keg->uk_lock, MA_OWNED);
 	MPASS(keg == slab->us_keg);
 
 	/* Do we need to remove from any lists? */
 	if (slab->us_freecount+1 == keg->uk_ipers) {
 		LIST_REMOVE(slab, us_link);
 		LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link);
 	} else if (slab->us_freecount == 0) {
 		LIST_REMOVE(slab, us_link);
 		LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link);
 	}
 
 	/* Slab management. */
 	freei = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize;
 	BIT_SET(SLAB_SETSIZE, freei, &slab->us_free);
 	slab->us_freecount++;
 
 	/* Keg statistics. */
 	keg->uk_free++;
 }
 
 static void
 zone_release(uma_zone_t zone, void **bucket, int cnt)
 {
 	void *item;
 	uma_slab_t slab;
 	uma_keg_t keg;
 	uint8_t *mem;
 	int clearfull;
 	int i;
 
 	clearfull = 0;
 	keg = zone_first_keg(zone);
 	KEG_LOCK(keg);
 	for (i = 0; i < cnt; i++) {
 		item = bucket[i];
 		if (!(zone->uz_flags & UMA_ZONE_VTOSLAB)) {
 			mem = (uint8_t *)((uintptr_t)item & (~UMA_SLAB_MASK));
 			if (zone->uz_flags & UMA_ZONE_HASH) {
 				slab = hash_sfind(&keg->uk_hash, mem);
 			} else {
 				mem += keg->uk_pgoff;
 				slab = (uma_slab_t)mem;
 			}
 		} else {
 			slab = vtoslab((vm_offset_t)item);
 			if (slab->us_keg != keg) {
 				KEG_UNLOCK(keg);
 				keg = slab->us_keg;
 				KEG_LOCK(keg);
 			}
 		}
 		slab_free_item(keg, slab, item);
 		if (keg->uk_flags & UMA_ZFLAG_FULL) {
 			if (keg->uk_pages < keg->uk_maxpages) {
 				keg->uk_flags &= ~UMA_ZFLAG_FULL;
 				clearfull = 1;
 			}
 
 			/* 
 			 * We can handle one more allocation. Since we're
 			 * clearing ZFLAG_FULL, wake up all procs blocked
 			 * on pages. This should be uncommon, so keeping this
 			 * simple for now (rather than adding count of blocked 
 			 * threads etc).
 			 */
 			wakeup(keg);
 		}
 	}
 	KEG_UNLOCK(keg);
 	if (clearfull) {
 		ZONE_LOCK(zone);
 		zone->uz_flags &= ~UMA_ZFLAG_FULL;
 		wakeup(zone);
 		ZONE_UNLOCK(zone);
 	}
 
 }
 
 /*
  * Frees a single item to any zone.
  *
  * Arguments:
  *	zone   The zone to free to
  *	item   The item we're freeing
  *	udata  User supplied data for the dtor
  *	skip   Skip dtors and finis
  */
 static void
 zone_free_item(uma_zone_t zone, void *item, void *udata, enum zfreeskip skip)
 {
 
 #ifdef INVARIANTS
 	if (skip == SKIP_NONE) {
 		if (zone->uz_flags & UMA_ZONE_MALLOC)
 			uma_dbg_free(zone, udata, item);
 		else
 			uma_dbg_free(zone, NULL, item);
 	}
 #endif
 	if (skip < SKIP_DTOR && zone->uz_dtor)
 		zone->uz_dtor(item, zone->uz_size, udata);
 
 	if (skip < SKIP_FINI && zone->uz_fini)
 		zone->uz_fini(item, zone->uz_size);
 
 	atomic_add_long(&zone->uz_frees, 1);
 	zone->uz_release(zone->uz_arg, &item, 1);
 }
 
 /* See uma.h */
 int
 uma_zone_set_max(uma_zone_t zone, int nitems)
 {
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	if (keg == NULL)
 		return (0);
 	KEG_LOCK(keg);
 	keg->uk_maxpages = (nitems / keg->uk_ipers) * keg->uk_ppera;
 	if (keg->uk_maxpages * keg->uk_ipers < nitems)
 		keg->uk_maxpages += keg->uk_ppera;
 	nitems = keg->uk_maxpages * keg->uk_ipers;
 	KEG_UNLOCK(keg);
 
 	return (nitems);
 }
 
 /* See uma.h */
 int
 uma_zone_get_max(uma_zone_t zone)
 {
 	int nitems;
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	if (keg == NULL)
 		return (0);
 	KEG_LOCK(keg);
 	nitems = keg->uk_maxpages * keg->uk_ipers;
 	KEG_UNLOCK(keg);
 
 	return (nitems);
 }
 
 /* See uma.h */
 void
 uma_zone_set_warning(uma_zone_t zone, const char *warning)
 {
 
 	ZONE_LOCK(zone);
 	zone->uz_warning = warning;
 	ZONE_UNLOCK(zone);
 }
 
 /* See uma.h */
 int
 uma_zone_get_cur(uma_zone_t zone)
 {
 	int64_t nitems;
 	u_int i;
 
 	ZONE_LOCK(zone);
 	nitems = zone->uz_allocs - zone->uz_frees;
 	CPU_FOREACH(i) {
 		/*
 		 * See the comment in sysctl_vm_zone_stats() regarding the
 		 * safety of accessing the per-cpu caches. With the zone lock
 		 * held, it is safe, but can potentially result in stale data.
 		 */
 		nitems += zone->uz_cpu[i].uc_allocs -
 		    zone->uz_cpu[i].uc_frees;
 	}
 	ZONE_UNLOCK(zone);
 
 	return (nitems < 0 ? 0 : nitems);
 }
 
 /* See uma.h */
 void
 uma_zone_set_init(uma_zone_t zone, uma_init uminit)
 {
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	KASSERT(keg != NULL, ("uma_zone_set_init: Invalid zone type"));
 	KEG_LOCK(keg);
 	KASSERT(keg->uk_pages == 0,
 	    ("uma_zone_set_init on non-empty keg"));
 	keg->uk_init = uminit;
 	KEG_UNLOCK(keg);
 }
 
 /* See uma.h */
 void
 uma_zone_set_fini(uma_zone_t zone, uma_fini fini)
 {
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	KASSERT(keg != NULL, ("uma_zone_set_fini: Invalid zone type"));
 	KEG_LOCK(keg);
 	KASSERT(keg->uk_pages == 0,
 	    ("uma_zone_set_fini on non-empty keg"));
 	keg->uk_fini = fini;
 	KEG_UNLOCK(keg);
 }
 
 /* See uma.h */
 void
 uma_zone_set_zinit(uma_zone_t zone, uma_init zinit)
 {
 
 	ZONE_LOCK(zone);
 	KASSERT(zone_first_keg(zone)->uk_pages == 0,
 	    ("uma_zone_set_zinit on non-empty keg"));
 	zone->uz_init = zinit;
 	ZONE_UNLOCK(zone);
 }
 
 /* See uma.h */
 void
 uma_zone_set_zfini(uma_zone_t zone, uma_fini zfini)
 {
 
 	ZONE_LOCK(zone);
 	KASSERT(zone_first_keg(zone)->uk_pages == 0,
 	    ("uma_zone_set_zfini on non-empty keg"));
 	zone->uz_fini = zfini;
 	ZONE_UNLOCK(zone);
 }
 
 /* See uma.h */
 /* XXX uk_freef is not actually used with the zone locked */
 void
 uma_zone_set_freef(uma_zone_t zone, uma_free freef)
 {
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	KASSERT(keg != NULL, ("uma_zone_set_freef: Invalid zone type"));
 	KEG_LOCK(keg);
 	keg->uk_freef = freef;
 	KEG_UNLOCK(keg);
 }
 
 /* See uma.h */
 /* XXX uk_allocf is not actually used with the zone locked */
 void
 uma_zone_set_allocf(uma_zone_t zone, uma_alloc allocf)
 {
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	KEG_LOCK(keg);
 	keg->uk_allocf = allocf;
 	KEG_UNLOCK(keg);
 }
 
 /* See uma.h */
 void
 uma_zone_reserve(uma_zone_t zone, int items)
 {
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	if (keg == NULL)
 		return;
 	KEG_LOCK(keg);
 	keg->uk_reserve = items;
 	KEG_UNLOCK(keg);
 
 	return;
 }
 
 /* See uma.h */
 int
 uma_zone_reserve_kva(uma_zone_t zone, int count)
 {
 	uma_keg_t keg;
 	vm_offset_t kva;
 	int pages;
 
 	keg = zone_first_keg(zone);
 	if (keg == NULL)
 		return (0);
 	pages = count / keg->uk_ipers;
 
 	if (pages * keg->uk_ipers < count)
 		pages++;
 
 #ifdef UMA_MD_SMALL_ALLOC
 	if (keg->uk_ppera > 1) {
 #else
 	if (1) {
 #endif
 		kva = kva_alloc(pages * UMA_SLAB_SIZE);
 		if (kva == 0)
 			return (0);
 	} else
 		kva = 0;
 	KEG_LOCK(keg);
 	keg->uk_kva = kva;
 	keg->uk_offset = 0;
 	keg->uk_maxpages = pages;
 #ifdef UMA_MD_SMALL_ALLOC
 	keg->uk_allocf = (keg->uk_ppera > 1) ? noobj_alloc : uma_small_alloc;
 #else
 	keg->uk_allocf = noobj_alloc;
 #endif
 	keg->uk_flags |= UMA_ZONE_NOFREE;
 	KEG_UNLOCK(keg);
 
 	return (1);
 }
 
 /* See uma.h */
 void
 uma_prealloc(uma_zone_t zone, int items)
 {
 	int slabs;
 	uma_slab_t slab;
 	uma_keg_t keg;
 
 	keg = zone_first_keg(zone);
 	if (keg == NULL)
 		return;
 	KEG_LOCK(keg);
 	slabs = items / keg->uk_ipers;
 	if (slabs * keg->uk_ipers < items)
 		slabs++;
 	while (slabs > 0) {
 		slab = keg_alloc_slab(keg, zone, M_WAITOK);
 		if (slab == NULL)
 			break;
 		MPASS(slab->us_keg == keg);
 		LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link);
 		slabs--;
 	}
 	KEG_UNLOCK(keg);
 }
 
 /* See uma.h */
 uint32_t *
 uma_find_refcnt(uma_zone_t zone, void *item)
 {
 	uma_slabrefcnt_t slabref;
 	uma_slab_t slab;
 	uma_keg_t keg;
 	uint32_t *refcnt;
 	int idx;
 
 	slab = vtoslab((vm_offset_t)item & (~UMA_SLAB_MASK));
 	slabref = (uma_slabrefcnt_t)slab;
 	keg = slab->us_keg;
 	KASSERT(keg->uk_flags & UMA_ZONE_REFCNT,
 	    ("uma_find_refcnt(): zone possibly not UMA_ZONE_REFCNT"));
 	idx = ((uintptr_t)item - (uintptr_t)slab->us_data) / keg->uk_rsize;
 	refcnt = &slabref->us_refcnt[idx];
 	return refcnt;
 }
 
 /* See uma.h */
-void
-uma_reclaim(void)
+static void
+uma_reclaim_locked(bool kmem_danger)
 {
+
 #ifdef UMA_DEBUG
 	printf("UMA: vm asked us to release pages!\n");
 #endif
-	sx_xlock(&uma_drain_lock);
+	sx_assert(&uma_drain_lock, SA_XLOCKED);
 	bucket_enable();
 	zone_foreach(zone_drain);
-	if (vm_page_count_min()) {
+	if (vm_page_count_min() || kmem_danger) {
 		cache_drain_safe(NULL);
 		zone_foreach(zone_drain);
 	}
 	/*
 	 * Some slabs may have been freed but this zone will be visited early
 	 * we visit again so that we can free pages that are empty once other
 	 * zones are drained.  We have to do the same for buckets.
 	 */
 	zone_drain(slabzone);
 	zone_drain(slabrefzone);
 	bucket_zone_drain();
+}
+
+void
+uma_reclaim(void)
+{
+
+	sx_xlock(&uma_drain_lock);
+	uma_reclaim_locked(false);
 	sx_xunlock(&uma_drain_lock);
+}
+
+static int uma_reclaim_needed;
+
+void
+uma_reclaim_wakeup(void)
+{
+
+	uma_reclaim_needed = 1;
+	wakeup(&uma_reclaim_needed);
+}
+
+void
+uma_reclaim_worker(void *arg __unused)
+{
+
+	sx_xlock(&uma_drain_lock);
+	for (;;) {
+		sx_sleep(&uma_reclaim_needed, &uma_drain_lock, PVM,
+		    "umarcl", 0);
+		if (uma_reclaim_needed) {
+			uma_reclaim_needed = 0;
+			uma_reclaim_locked(true);
+		}
+	}
 }
 
 /* See uma.h */
 int
 uma_zone_exhausted(uma_zone_t zone)
 {
 	int full;
 
 	ZONE_LOCK(zone);
 	full = (zone->uz_flags & UMA_ZFLAG_FULL);
 	ZONE_UNLOCK(zone);
 	return (full);	
 }
 
 int
 uma_zone_exhausted_nolock(uma_zone_t zone)
 {
 	return (zone->uz_flags & UMA_ZFLAG_FULL);
 }
 
 void *
 uma_large_malloc(vm_size_t size, int wait)
 {
 	void *mem;
 	uma_slab_t slab;
 	uint8_t flags;
 
 	slab = zone_alloc_item(slabzone, NULL, wait);
 	if (slab == NULL)
 		return (NULL);
 	mem = page_alloc(NULL, size, &flags, wait);
 	if (mem) {
 		vsetslab((vm_offset_t)mem, slab);
 		slab->us_data = mem;
 		slab->us_flags = flags | UMA_SLAB_MALLOC;
 		slab->us_size = size;
 	} else {
 		zone_free_item(slabzone, slab, NULL, SKIP_NONE);
 	}
 
 	return (mem);
 }
 
 void
 uma_large_free(uma_slab_t slab)
 {
 
 	page_free(slab->us_data, slab->us_size, slab->us_flags);
 	zone_free_item(slabzone, slab, NULL, SKIP_NONE);
 }
 
 static void
 uma_zero_item(void *item, uma_zone_t zone)
 {
 
 	if (zone->uz_flags & UMA_ZONE_PCPU) {
 		for (int i = 0; i < mp_ncpus; i++)
 			bzero(zpcpu_get_cpu(item, i), zone->uz_size);
 	} else
 		bzero(item, zone->uz_size);
 }
 
 void
 uma_print_stats(void)
 {
 	zone_foreach(uma_print_zone);
 }
 
 static void
 slab_print(uma_slab_t slab)
 {
 	printf("slab: keg %p, data %p, freecount %d\n",
 		slab->us_keg, slab->us_data, slab->us_freecount);
 }
 
 static void
 cache_print(uma_cache_t cache)
 {
 	printf("alloc: %p(%d), free: %p(%d)\n",
 		cache->uc_allocbucket,
 		cache->uc_allocbucket?cache->uc_allocbucket->ub_cnt:0,
 		cache->uc_freebucket,
 		cache->uc_freebucket?cache->uc_freebucket->ub_cnt:0);
 }
 
 static void
 uma_print_keg(uma_keg_t keg)
 {
 	uma_slab_t slab;
 
 	printf("keg: %s(%p) size %d(%d) flags %#x ipers %d ppera %d "
 	    "out %d free %d limit %d\n",
 	    keg->uk_name, keg, keg->uk_size, keg->uk_rsize, keg->uk_flags,
 	    keg->uk_ipers, keg->uk_ppera,
 	    (keg->uk_ipers * keg->uk_pages) - keg->uk_free, keg->uk_free,
 	    (keg->uk_maxpages / keg->uk_ppera) * keg->uk_ipers);
 	printf("Part slabs:\n");
 	LIST_FOREACH(slab, &keg->uk_part_slab, us_link)
 		slab_print(slab);
 	printf("Free slabs:\n");
 	LIST_FOREACH(slab, &keg->uk_free_slab, us_link)
 		slab_print(slab);
 	printf("Full slabs:\n");
 	LIST_FOREACH(slab, &keg->uk_full_slab, us_link)
 		slab_print(slab);
 }
 
 void
 uma_print_zone(uma_zone_t zone)
 {
 	uma_cache_t cache;
 	uma_klink_t kl;
 	int i;
 
 	printf("zone: %s(%p) size %d flags %#x\n",
 	    zone->uz_name, zone, zone->uz_size, zone->uz_flags);
 	LIST_FOREACH(kl, &zone->uz_kegs, kl_link)
 		uma_print_keg(kl->kl_keg);
 	CPU_FOREACH(i) {
 		cache = &zone->uz_cpu[i];
 		printf("CPU %d Cache:\n", i);
 		cache_print(cache);
 	}
 }
 
 #ifdef DDB
 /*
  * Generate statistics across both the zone and its per-cpu cache's.  Return
  * desired statistics if the pointer is non-NULL for that statistic.
  *
  * Note: does not update the zone statistics, as it can't safely clear the
  * per-CPU cache statistic.
  *
  * XXXRW: Following the uc_allocbucket and uc_freebucket pointers here isn't
  * safe from off-CPU; we should modify the caches to track this information
  * directly so that we don't have to.
  */
 static void
 uma_zone_sumstat(uma_zone_t z, int *cachefreep, uint64_t *allocsp,
     uint64_t *freesp, uint64_t *sleepsp)
 {
 	uma_cache_t cache;
 	uint64_t allocs, frees, sleeps;
 	int cachefree, cpu;
 
 	allocs = frees = sleeps = 0;
 	cachefree = 0;
 	CPU_FOREACH(cpu) {
 		cache = &z->uz_cpu[cpu];
 		if (cache->uc_allocbucket != NULL)
 			cachefree += cache->uc_allocbucket->ub_cnt;
 		if (cache->uc_freebucket != NULL)
 			cachefree += cache->uc_freebucket->ub_cnt;
 		allocs += cache->uc_allocs;
 		frees += cache->uc_frees;
 	}
 	allocs += z->uz_allocs;
 	frees += z->uz_frees;
 	sleeps += z->uz_sleeps;
 	if (cachefreep != NULL)
 		*cachefreep = cachefree;
 	if (allocsp != NULL)
 		*allocsp = allocs;
 	if (freesp != NULL)
 		*freesp = frees;
 	if (sleepsp != NULL)
 		*sleepsp = sleeps;
 }
 #endif /* DDB */
 
 static int
 sysctl_vm_zone_count(SYSCTL_HANDLER_ARGS)
 {
 	uma_keg_t kz;
 	uma_zone_t z;
 	int count;
 
 	count = 0;
 	rw_rlock(&uma_rwlock);
 	LIST_FOREACH(kz, &uma_kegs, uk_link) {
 		LIST_FOREACH(z, &kz->uk_zones, uz_link)
 			count++;
 	}
 	rw_runlock(&uma_rwlock);
 	return (sysctl_handle_int(oidp, &count, 0, req));
 }
 
 static int
 sysctl_vm_zone_stats(SYSCTL_HANDLER_ARGS)
 {
 	struct uma_stream_header ush;
 	struct uma_type_header uth;
 	struct uma_percpu_stat ups;
 	uma_bucket_t bucket;
 	struct sbuf sbuf;
 	uma_cache_t cache;
 	uma_klink_t kl;
 	uma_keg_t kz;
 	uma_zone_t z;
 	uma_keg_t k;
 	int count, error, i;
 
 	error = sysctl_wire_old_buffer(req, 0);
 	if (error != 0)
 		return (error);
 	sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
 	sbuf_clear_flags(&sbuf, SBUF_INCLUDENUL);
 
 	count = 0;
 	rw_rlock(&uma_rwlock);
 	LIST_FOREACH(kz, &uma_kegs, uk_link) {
 		LIST_FOREACH(z, &kz->uk_zones, uz_link)
 			count++;
 	}
 
 	/*
 	 * Insert stream header.
 	 */
 	bzero(&ush, sizeof(ush));
 	ush.ush_version = UMA_STREAM_VERSION;
 	ush.ush_maxcpus = (mp_maxid + 1);
 	ush.ush_count = count;
 	(void)sbuf_bcat(&sbuf, &ush, sizeof(ush));
 
 	LIST_FOREACH(kz, &uma_kegs, uk_link) {
 		LIST_FOREACH(z, &kz->uk_zones, uz_link) {
 			bzero(&uth, sizeof(uth));
 			ZONE_LOCK(z);
 			strlcpy(uth.uth_name, z->uz_name, UTH_MAX_NAME);
 			uth.uth_align = kz->uk_align;
 			uth.uth_size = kz->uk_size;
 			uth.uth_rsize = kz->uk_rsize;
 			LIST_FOREACH(kl, &z->uz_kegs, kl_link) {
 				k = kl->kl_keg;
 				uth.uth_maxpages += k->uk_maxpages;
 				uth.uth_pages += k->uk_pages;
 				uth.uth_keg_free += k->uk_free;
 				uth.uth_limit = (k->uk_maxpages / k->uk_ppera)
 				    * k->uk_ipers;
 			}
 
 			/*
 			 * A zone is secondary is it is not the first entry
 			 * on the keg's zone list.
 			 */
 			if ((z->uz_flags & UMA_ZONE_SECONDARY) &&
 			    (LIST_FIRST(&kz->uk_zones) != z))
 				uth.uth_zone_flags = UTH_ZONE_SECONDARY;
 
 			LIST_FOREACH(bucket, &z->uz_buckets, ub_link)
 				uth.uth_zone_free += bucket->ub_cnt;
 			uth.uth_allocs = z->uz_allocs;
 			uth.uth_frees = z->uz_frees;
 			uth.uth_fails = z->uz_fails;
 			uth.uth_sleeps = z->uz_sleeps;
 			(void)sbuf_bcat(&sbuf, &uth, sizeof(uth));
 			/*
 			 * While it is not normally safe to access the cache
 			 * bucket pointers while not on the CPU that owns the
 			 * cache, we only allow the pointers to be exchanged
 			 * without the zone lock held, not invalidated, so
 			 * accept the possible race associated with bucket
 			 * exchange during monitoring.
 			 */
 			for (i = 0; i < (mp_maxid + 1); i++) {
 				bzero(&ups, sizeof(ups));
 				if (kz->uk_flags & UMA_ZFLAG_INTERNAL)
 					goto skip;
 				if (CPU_ABSENT(i))
 					goto skip;
 				cache = &z->uz_cpu[i];
 				if (cache->uc_allocbucket != NULL)
 					ups.ups_cache_free +=
 					    cache->uc_allocbucket->ub_cnt;
 				if (cache->uc_freebucket != NULL)
 					ups.ups_cache_free +=
 					    cache->uc_freebucket->ub_cnt;
 				ups.ups_allocs = cache->uc_allocs;
 				ups.ups_frees = cache->uc_frees;
 skip:
 				(void)sbuf_bcat(&sbuf, &ups, sizeof(ups));
 			}
 			ZONE_UNLOCK(z);
 		}
 	}
 	rw_runlock(&uma_rwlock);
 	error = sbuf_finish(&sbuf);
 	sbuf_delete(&sbuf);
 	return (error);
 }
 
 int
 sysctl_handle_uma_zone_max(SYSCTL_HANDLER_ARGS)
 {
 	uma_zone_t zone = *(uma_zone_t *)arg1;
 	int error, max;
 
 	max = uma_zone_get_max(zone);
 	error = sysctl_handle_int(oidp, &max, 0, req);
 	if (error || !req->newptr)
 		return (error);
 
 	uma_zone_set_max(zone, max);
 
 	return (0);
 }
 
 int
 sysctl_handle_uma_zone_cur(SYSCTL_HANDLER_ARGS)
 {
 	uma_zone_t zone = *(uma_zone_t *)arg1;
 	int cur;
 
 	cur = uma_zone_get_cur(zone);
 	return (sysctl_handle_int(oidp, &cur, 0, req));
 }
 
 #ifdef DDB
 DB_SHOW_COMMAND(uma, db_show_uma)
 {
 	uint64_t allocs, frees, sleeps;
 	uma_bucket_t bucket;
 	uma_keg_t kz;
 	uma_zone_t z;
 	int cachefree;
 
 	db_printf("%18s %8s %8s %8s %12s %8s %8s\n", "Zone", "Size", "Used",
 	    "Free", "Requests", "Sleeps", "Bucket");
 	LIST_FOREACH(kz, &uma_kegs, uk_link) {
 		LIST_FOREACH(z, &kz->uk_zones, uz_link) {
 			if (kz->uk_flags & UMA_ZFLAG_INTERNAL) {
 				allocs = z->uz_allocs;
 				frees = z->uz_frees;
 				sleeps = z->uz_sleeps;
 				cachefree = 0;
 			} else
 				uma_zone_sumstat(z, &cachefree, &allocs,
 				    &frees, &sleeps);
 			if (!((z->uz_flags & UMA_ZONE_SECONDARY) &&
 			    (LIST_FIRST(&kz->uk_zones) != z)))
 				cachefree += kz->uk_free;
 			LIST_FOREACH(bucket, &z->uz_buckets, ub_link)
 				cachefree += bucket->ub_cnt;
 			db_printf("%18s %8ju %8jd %8d %12ju %8ju %8u\n",
 			    z->uz_name, (uintmax_t)kz->uk_size,
 			    (intmax_t)(allocs - frees), cachefree,
 			    (uintmax_t)allocs, sleeps, z->uz_count);
 			if (db_pager_quit)
 				return;
 		}
 	}
 }
 
 DB_SHOW_COMMAND(umacache, db_show_umacache)
 {
 	uint64_t allocs, frees;
 	uma_bucket_t bucket;
 	uma_zone_t z;
 	int cachefree;
 
 	db_printf("%18s %8s %8s %8s %12s %8s\n", "Zone", "Size", "Used", "Free",
 	    "Requests", "Bucket");
 	LIST_FOREACH(z, &uma_cachezones, uz_link) {
 		uma_zone_sumstat(z, &cachefree, &allocs, &frees, NULL);
 		LIST_FOREACH(bucket, &z->uz_buckets, ub_link)
 			cachefree += bucket->ub_cnt;
 		db_printf("%18s %8ju %8jd %8d %12ju %8u\n",
 		    z->uz_name, (uintmax_t)z->uz_size,
 		    (intmax_t)(allocs - frees), cachefree,
 		    (uintmax_t)allocs, z->uz_count);
 		if (db_pager_quit)
 			return;
 	}
 }
 #endif
Index: projects/release-arm-redux/sys/vm/vm_pageout.c
===================================================================
--- projects/release-arm-redux/sys/vm/vm_pageout.c	(revision 282691)
+++ projects/release-arm-redux/sys/vm/vm_pageout.c	(revision 282692)
@@ -1,1914 +1,1919 @@
 /*-
  * Copyright (c) 1991 Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 1994 John S. Dyson
  * All rights reserved.
  * Copyright (c) 1994 David Greenman
  * All rights reserved.
  * Copyright (c) 2005 Yahoo! Technologies Norway AS
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * The Mach Operating System project at Carnegie-Mellon University.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)vm_pageout.c	7.4 (Berkeley) 5/7/91
  *
  *
  * Copyright (c) 1987, 1990 Carnegie-Mellon University.
  * All rights reserved.
  *
  * Authors: Avadis Tevanian, Jr., Michael Wayne Young
  *
  * Permission to use, copy, modify and distribute this software and
  * its documentation is hereby granted, provided that both the copyright
  * notice and this permission notice appear in all copies of the
  * software, derivative works or modified versions, and any portions
  * thereof, and that both notices appear in supporting documentation.
  *
  * CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
  * CONDITION.  CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
  * FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
  *
  * Carnegie Mellon requests users of this software to return to
  *
  *  Software Distribution Coordinator  or  Software.Distribution@CS.CMU.EDU
  *  School of Computer Science
  *  Carnegie Mellon University
  *  Pittsburgh PA 15213-3890
  *
  * any improvements or extensions that they make and grant Carnegie the
  * rights to redistribute these changes.
  */
 
 /*
  *	The proverbial page-out daemon.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_vm.h"
 #include "opt_kdtrace.h"
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/eventhandler.h>
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/proc.h>
 #include <sys/kthread.h>
 #include <sys/ktr.h>
 #include <sys/mount.h>
 #include <sys/racct.h>
 #include <sys/resourcevar.h>
 #include <sys/sched.h>
 #include <sys/sdt.h>
 #include <sys/signalvar.h>
 #include <sys/smp.h>
 #include <sys/vnode.h>
 #include <sys/vmmeter.h>
 #include <sys/rwlock.h>
 #include <sys/sx.h>
 #include <sys/sysctl.h>
 
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/vm_object.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_pageout.h>
 #include <vm/vm_pager.h>
 #include <vm/vm_phys.h>
 #include <vm/swap_pager.h>
 #include <vm/vm_extern.h>
 #include <vm/uma.h>
 
 /*
  * System initialization
  */
 
 /* the kernel process "vm_pageout"*/
 static void vm_pageout(void);
 static void vm_pageout_init(void);
 static int vm_pageout_clean(vm_page_t m);
 static int vm_pageout_cluster(vm_page_t m);
 static void vm_pageout_scan(struct vm_domain *vmd, int pass);
 static void vm_pageout_mightbe_oom(struct vm_domain *vmd, int pass);
 
 SYSINIT(pagedaemon_init, SI_SUB_KTHREAD_PAGE, SI_ORDER_FIRST, vm_pageout_init,
     NULL);
 
 struct proc *pageproc;
 
 static struct kproc_desc page_kp = {
 	"pagedaemon",
 	vm_pageout,
 	&pageproc
 };
 SYSINIT(pagedaemon, SI_SUB_KTHREAD_PAGE, SI_ORDER_SECOND, kproc_start,
     &page_kp);
 
 SDT_PROVIDER_DEFINE(vm);
 SDT_PROBE_DEFINE(vm, , , vm__lowmem_cache);
 SDT_PROBE_DEFINE(vm, , , vm__lowmem_scan);
 
 #if !defined(NO_SWAPPING)
 /* the kernel process "vm_daemon"*/
 static void vm_daemon(void);
 static struct	proc *vmproc;
 
 static struct kproc_desc vm_kp = {
 	"vmdaemon",
 	vm_daemon,
 	&vmproc
 };
 SYSINIT(vmdaemon, SI_SUB_KTHREAD_VM, SI_ORDER_FIRST, kproc_start, &vm_kp);
 #endif
 
 
 int vm_pages_needed;		/* Event on which pageout daemon sleeps */
 int vm_pageout_deficit;		/* Estimated number of pages deficit */
 int vm_pageout_pages_needed;	/* flag saying that the pageout daemon needs pages */
 int vm_pageout_wakeup_thresh;
 
 #if !defined(NO_SWAPPING)
 static int vm_pageout_req_swapout;	/* XXX */
 static int vm_daemon_needed;
 static struct mtx vm_daemon_mtx;
 /* Allow for use by vm_pageout before vm_daemon is initialized. */
 MTX_SYSINIT(vm_daemon, &vm_daemon_mtx, "vm daemon", MTX_DEF);
 #endif
 static int vm_max_launder = 32;
 static int vm_pageout_update_period;
 static int defer_swap_pageouts;
 static int disable_swap_pageouts;
 static int lowmem_period = 10;
 static int lowmem_ticks;
 
 #if defined(NO_SWAPPING)
 static int vm_swap_enabled = 0;
 static int vm_swap_idle_enabled = 0;
 #else
 static int vm_swap_enabled = 1;
 static int vm_swap_idle_enabled = 0;
 #endif
 
 static int vm_panic_on_oom = 0;
 
 SYSCTL_INT(_vm, OID_AUTO, panic_on_oom,
 	CTLFLAG_RWTUN, &vm_panic_on_oom, 0,
 	"panic on out of memory instead of killing the largest process");
 
 SYSCTL_INT(_vm, OID_AUTO, pageout_wakeup_thresh,
 	CTLFLAG_RW, &vm_pageout_wakeup_thresh, 0,
 	"free page threshold for waking up the pageout daemon");
 
 SYSCTL_INT(_vm, OID_AUTO, max_launder,
 	CTLFLAG_RW, &vm_max_launder, 0, "Limit dirty flushes in pageout");
 
 SYSCTL_INT(_vm, OID_AUTO, pageout_update_period,
 	CTLFLAG_RW, &vm_pageout_update_period, 0,
 	"Maximum active LRU update period");
   
 SYSCTL_INT(_vm, OID_AUTO, lowmem_period, CTLFLAG_RW, &lowmem_period, 0,
 	"Low memory callback period");
 
 #if defined(NO_SWAPPING)
 SYSCTL_INT(_vm, VM_SWAPPING_ENABLED, swap_enabled,
 	CTLFLAG_RD, &vm_swap_enabled, 0, "Enable entire process swapout");
 SYSCTL_INT(_vm, OID_AUTO, swap_idle_enabled,
 	CTLFLAG_RD, &vm_swap_idle_enabled, 0, "Allow swapout on idle criteria");
 #else
 SYSCTL_INT(_vm, VM_SWAPPING_ENABLED, swap_enabled,
 	CTLFLAG_RW, &vm_swap_enabled, 0, "Enable entire process swapout");
 SYSCTL_INT(_vm, OID_AUTO, swap_idle_enabled,
 	CTLFLAG_RW, &vm_swap_idle_enabled, 0, "Allow swapout on idle criteria");
 #endif
 
 SYSCTL_INT(_vm, OID_AUTO, defer_swapspace_pageouts,
 	CTLFLAG_RW, &defer_swap_pageouts, 0, "Give preference to dirty pages in mem");
 
 SYSCTL_INT(_vm, OID_AUTO, disable_swapspace_pageouts,
 	CTLFLAG_RW, &disable_swap_pageouts, 0, "Disallow swapout of dirty pages");
 
 static int pageout_lock_miss;
 SYSCTL_INT(_vm, OID_AUTO, pageout_lock_miss,
 	CTLFLAG_RD, &pageout_lock_miss, 0, "vget() lock misses during pageout");
 
 #define VM_PAGEOUT_PAGE_COUNT 16
 int vm_pageout_page_count = VM_PAGEOUT_PAGE_COUNT;
 
 int vm_page_max_wired;		/* XXX max # of wired pages system-wide */
 SYSCTL_INT(_vm, OID_AUTO, max_wired,
 	CTLFLAG_RW, &vm_page_max_wired, 0, "System-wide limit to wired page count");
 
 static boolean_t vm_pageout_fallback_object_lock(vm_page_t, vm_page_t *);
 static boolean_t vm_pageout_launder(struct vm_pagequeue *pq, int, vm_paddr_t,
     vm_paddr_t);
 #if !defined(NO_SWAPPING)
 static void vm_pageout_map_deactivate_pages(vm_map_t, long);
 static void vm_pageout_object_deactivate_pages(pmap_t, vm_object_t, long);
 static void vm_req_vmdaemon(int req);
 #endif
 static boolean_t vm_pageout_page_lock(vm_page_t, vm_page_t *);
 
 /*
  * Initialize a dummy page for marking the caller's place in the specified
  * paging queue.  In principle, this function only needs to set the flag
  * PG_MARKER.  Nonetheless, it wirte busies and initializes the hold count
  * to one as safety precautions.
  */ 
 static void
 vm_pageout_init_marker(vm_page_t marker, u_short queue)
 {
 
 	bzero(marker, sizeof(*marker));
 	marker->flags = PG_MARKER;
 	marker->busy_lock = VPB_SINGLE_EXCLUSIVER;
 	marker->queue = queue;
 	marker->hold_count = 1;
 }
 
 /*
  * vm_pageout_fallback_object_lock:
  * 
  * Lock vm object currently associated with `m'. VM_OBJECT_TRYWLOCK is
  * known to have failed and page queue must be either PQ_ACTIVE or
  * PQ_INACTIVE.  To avoid lock order violation, unlock the page queues
  * while locking the vm object.  Use marker page to detect page queue
  * changes and maintain notion of next page on page queue.  Return
  * TRUE if no changes were detected, FALSE otherwise.  vm object is
  * locked on return.
  * 
  * This function depends on both the lock portion of struct vm_object
  * and normal struct vm_page being type stable.
  */
 static boolean_t
 vm_pageout_fallback_object_lock(vm_page_t m, vm_page_t *next)
 {
 	struct vm_page marker;
 	struct vm_pagequeue *pq;
 	boolean_t unchanged;
 	u_short queue;
 	vm_object_t object;
 
 	queue = m->queue;
 	vm_pageout_init_marker(&marker, queue);
 	pq = vm_page_pagequeue(m);
 	object = m->object;
 	
 	TAILQ_INSERT_AFTER(&pq->pq_pl, m, &marker, plinks.q);
 	vm_pagequeue_unlock(pq);
 	vm_page_unlock(m);
 	VM_OBJECT_WLOCK(object);
 	vm_page_lock(m);
 	vm_pagequeue_lock(pq);
 
 	/* Page queue might have changed. */
 	*next = TAILQ_NEXT(&marker, plinks.q);
 	unchanged = (m->queue == queue &&
 		     m->object == object &&
 		     &marker == TAILQ_NEXT(m, plinks.q));
 	TAILQ_REMOVE(&pq->pq_pl, &marker, plinks.q);
 	return (unchanged);
 }
 
 /*
  * Lock the page while holding the page queue lock.  Use marker page
  * to detect page queue changes and maintain notion of next page on
  * page queue.  Return TRUE if no changes were detected, FALSE
  * otherwise.  The page is locked on return. The page queue lock might
  * be dropped and reacquired.
  *
  * This function depends on normal struct vm_page being type stable.
  */
 static boolean_t
 vm_pageout_page_lock(vm_page_t m, vm_page_t *next)
 {
 	struct vm_page marker;
 	struct vm_pagequeue *pq;
 	boolean_t unchanged;
 	u_short queue;
 
 	vm_page_lock_assert(m, MA_NOTOWNED);
 	if (vm_page_trylock(m))
 		return (TRUE);
 
 	queue = m->queue;
 	vm_pageout_init_marker(&marker, queue);
 	pq = vm_page_pagequeue(m);
 
 	TAILQ_INSERT_AFTER(&pq->pq_pl, m, &marker, plinks.q);
 	vm_pagequeue_unlock(pq);
 	vm_page_lock(m);
 	vm_pagequeue_lock(pq);
 
 	/* Page queue might have changed. */
 	*next = TAILQ_NEXT(&marker, plinks.q);
 	unchanged = (m->queue == queue && &marker == TAILQ_NEXT(m, plinks.q));
 	TAILQ_REMOVE(&pq->pq_pl, &marker, plinks.q);
 	return (unchanged);
 }
 
 /*
  * vm_pageout_clean:
  *
  * Clean the page and remove it from the laundry.
  * 
  * We set the busy bit to cause potential page faults on this page to
  * block.  Note the careful timing, however, the busy bit isn't set till
  * late and we cannot do anything that will mess with the page.
  */
 static int
 vm_pageout_cluster(vm_page_t m)
 {
 	vm_object_t object;
 	vm_page_t mc[2*vm_pageout_page_count], pb, ps;
 	int pageout_count;
 	int ib, is, page_base;
 	vm_pindex_t pindex = m->pindex;
 
 	vm_page_lock_assert(m, MA_OWNED);
 	object = m->object;
 	VM_OBJECT_ASSERT_WLOCKED(object);
 
 	/*
 	 * It doesn't cost us anything to pageout OBJT_DEFAULT or OBJT_SWAP
 	 * with the new swapper, but we could have serious problems paging
 	 * out other object types if there is insufficient memory.  
 	 *
 	 * Unfortunately, checking free memory here is far too late, so the
 	 * check has been moved up a procedural level.
 	 */
 
 	/*
 	 * Can't clean the page if it's busy or held.
 	 */
 	vm_page_assert_unbusied(m);
 	KASSERT(m->hold_count == 0, ("vm_pageout_clean: page %p is held", m));
 	vm_page_unlock(m);
 
 	mc[vm_pageout_page_count] = pb = ps = m;
 	pageout_count = 1;
 	page_base = vm_pageout_page_count;
 	ib = 1;
 	is = 1;
 
 	/*
 	 * Scan object for clusterable pages.
 	 *
 	 * We can cluster ONLY if: ->> the page is NOT
 	 * clean, wired, busy, held, or mapped into a
 	 * buffer, and one of the following:
 	 * 1) The page is inactive, or a seldom used
 	 *    active page.
 	 * -or-
 	 * 2) we force the issue.
 	 *
 	 * During heavy mmap/modification loads the pageout
 	 * daemon can really fragment the underlying file
 	 * due to flushing pages out of order and not trying
 	 * align the clusters (which leave sporatic out-of-order
 	 * holes).  To solve this problem we do the reverse scan
 	 * first and attempt to align our cluster, then do a 
 	 * forward scan if room remains.
 	 */
 more:
 	while (ib && pageout_count < vm_pageout_page_count) {
 		vm_page_t p;
 
 		if (ib > pindex) {
 			ib = 0;
 			break;
 		}
 
 		if ((p = vm_page_prev(pb)) == NULL || vm_page_busied(p)) {
 			ib = 0;
 			break;
 		}
 		vm_page_lock(p);
 		vm_page_test_dirty(p);
 		if (p->dirty == 0 ||
 		    p->queue != PQ_INACTIVE ||
 		    p->hold_count != 0) {	/* may be undergoing I/O */
 			vm_page_unlock(p);
 			ib = 0;
 			break;
 		}
 		vm_page_unlock(p);
 		mc[--page_base] = pb = p;
 		++pageout_count;
 		++ib;
 		/*
 		 * alignment boundry, stop here and switch directions.  Do
 		 * not clear ib.
 		 */
 		if ((pindex - (ib - 1)) % vm_pageout_page_count == 0)
 			break;
 	}
 
 	while (pageout_count < vm_pageout_page_count && 
 	    pindex + is < object->size) {
 		vm_page_t p;
 
 		if ((p = vm_page_next(ps)) == NULL || vm_page_busied(p))
 			break;
 		vm_page_lock(p);
 		vm_page_test_dirty(p);
 		if (p->dirty == 0 ||
 		    p->queue != PQ_INACTIVE ||
 		    p->hold_count != 0) {	/* may be undergoing I/O */
 			vm_page_unlock(p);
 			break;
 		}
 		vm_page_unlock(p);
 		mc[page_base + pageout_count] = ps = p;
 		++pageout_count;
 		++is;
 	}
 
 	/*
 	 * If we exhausted our forward scan, continue with the reverse scan
 	 * when possible, even past a page boundry.  This catches boundry
 	 * conditions.
 	 */
 	if (ib && pageout_count < vm_pageout_page_count)
 		goto more;
 
 	/*
 	 * we allow reads during pageouts...
 	 */
 	return (vm_pageout_flush(&mc[page_base], pageout_count, 0, 0, NULL,
 	    NULL));
 }
 
 /*
  * vm_pageout_flush() - launder the given pages
  *
  *	The given pages are laundered.  Note that we setup for the start of
  *	I/O ( i.e. busy the page ), mark it read-only, and bump the object
  *	reference count all in here rather then in the parent.  If we want
  *	the parent to do more sophisticated things we may have to change
  *	the ordering.
  *
  *	Returned runlen is the count of pages between mreq and first
  *	page after mreq with status VM_PAGER_AGAIN.
  *	*eio is set to TRUE if pager returned VM_PAGER_ERROR or VM_PAGER_FAIL
  *	for any page in runlen set.
  */
 int
 vm_pageout_flush(vm_page_t *mc, int count, int flags, int mreq, int *prunlen,
     boolean_t *eio)
 {
 	vm_object_t object = mc[0]->object;
 	int pageout_status[count];
 	int numpagedout = 0;
 	int i, runlen;
 
 	VM_OBJECT_ASSERT_WLOCKED(object);
 
 	/*
 	 * Initiate I/O.  Bump the vm_page_t->busy counter and
 	 * mark the pages read-only.
 	 *
 	 * We do not have to fixup the clean/dirty bits here... we can
 	 * allow the pager to do it after the I/O completes.
 	 *
 	 * NOTE! mc[i]->dirty may be partial or fragmented due to an
 	 * edge case with file fragments.
 	 */
 	for (i = 0; i < count; i++) {
 		KASSERT(mc[i]->valid == VM_PAGE_BITS_ALL,
 		    ("vm_pageout_flush: partially invalid page %p index %d/%d",
 			mc[i], i, count));
 		vm_page_sbusy(mc[i]);
 		pmap_remove_write(mc[i]);
 	}
 	vm_object_pip_add(object, count);
 
 	vm_pager_put_pages(object, mc, count, flags, pageout_status);
 
 	runlen = count - mreq;
 	if (eio != NULL)
 		*eio = FALSE;
 	for (i = 0; i < count; i++) {
 		vm_page_t mt = mc[i];
 
 		KASSERT(pageout_status[i] == VM_PAGER_PEND ||
 		    !pmap_page_is_write_mapped(mt),
 		    ("vm_pageout_flush: page %p is not write protected", mt));
 		switch (pageout_status[i]) {
 		case VM_PAGER_OK:
 		case VM_PAGER_PEND:
 			numpagedout++;
 			break;
 		case VM_PAGER_BAD:
 			/*
 			 * Page outside of range of object. Right now we
 			 * essentially lose the changes by pretending it
 			 * worked.
 			 */
 			vm_page_undirty(mt);
 			break;
 		case VM_PAGER_ERROR:
 		case VM_PAGER_FAIL:
 			/*
 			 * If page couldn't be paged out, then reactivate the
 			 * page so it doesn't clog the inactive list.  (We
 			 * will try paging out it again later).
 			 */
 			vm_page_lock(mt);
 			vm_page_activate(mt);
 			vm_page_unlock(mt);
 			if (eio != NULL && i >= mreq && i - mreq < runlen)
 				*eio = TRUE;
 			break;
 		case VM_PAGER_AGAIN:
 			if (i >= mreq && i - mreq < runlen)
 				runlen = i - mreq;
 			break;
 		}
 
 		/*
 		 * If the operation is still going, leave the page busy to
 		 * block all other accesses. Also, leave the paging in
 		 * progress indicator set so that we don't attempt an object
 		 * collapse.
 		 */
 		if (pageout_status[i] != VM_PAGER_PEND) {
 			vm_object_pip_wakeup(object);
 			vm_page_sunbusy(mt);
 			if (vm_page_count_severe()) {
 				vm_page_lock(mt);
 				vm_page_try_to_cache(mt);
 				vm_page_unlock(mt);
 			}
 		}
 	}
 	if (prunlen != NULL)
 		*prunlen = runlen;
 	return (numpagedout);
 }
 
 static boolean_t
 vm_pageout_launder(struct vm_pagequeue *pq, int tries, vm_paddr_t low,
     vm_paddr_t high)
 {
 	struct mount *mp;
 	struct vnode *vp;
 	vm_object_t object;
 	vm_paddr_t pa;
 	vm_page_t m, m_tmp, next;
 	int lockmode;
 
 	vm_pagequeue_lock(pq);
 	TAILQ_FOREACH_SAFE(m, &pq->pq_pl, plinks.q, next) {
 		if ((m->flags & PG_MARKER) != 0)
 			continue;
 		pa = VM_PAGE_TO_PHYS(m);
 		if (pa < low || pa + PAGE_SIZE > high)
 			continue;
 		if (!vm_pageout_page_lock(m, &next) || m->hold_count != 0) {
 			vm_page_unlock(m);
 			continue;
 		}
 		object = m->object;
 		if ((!VM_OBJECT_TRYWLOCK(object) &&
 		    (!vm_pageout_fallback_object_lock(m, &next) ||
 		    m->hold_count != 0)) || vm_page_busied(m)) {
 			vm_page_unlock(m);
 			VM_OBJECT_WUNLOCK(object);
 			continue;
 		}
 		vm_page_test_dirty(m);
 		if (m->dirty == 0 && object->ref_count != 0)
 			pmap_remove_all(m);
 		if (m->dirty != 0) {
 			vm_page_unlock(m);
 			if (tries == 0 || (object->flags & OBJ_DEAD) != 0) {
 				VM_OBJECT_WUNLOCK(object);
 				continue;
 			}
 			if (object->type == OBJT_VNODE) {
 				vm_pagequeue_unlock(pq);
 				vp = object->handle;
 				vm_object_reference_locked(object);
 				VM_OBJECT_WUNLOCK(object);
 				(void)vn_start_write(vp, &mp, V_WAIT);
 				lockmode = MNT_SHARED_WRITES(vp->v_mount) ?
 				    LK_SHARED : LK_EXCLUSIVE;
 				vn_lock(vp, lockmode | LK_RETRY);
 				VM_OBJECT_WLOCK(object);
 				vm_object_page_clean(object, 0, 0, OBJPC_SYNC);
 				VM_OBJECT_WUNLOCK(object);
 				VOP_UNLOCK(vp, 0);
 				vm_object_deallocate(object);
 				vn_finished_write(mp);
 				return (TRUE);
 			} else if (object->type == OBJT_SWAP ||
 			    object->type == OBJT_DEFAULT) {
 				vm_pagequeue_unlock(pq);
 				m_tmp = m;
 				vm_pageout_flush(&m_tmp, 1, VM_PAGER_PUT_SYNC,
 				    0, NULL, NULL);
 				VM_OBJECT_WUNLOCK(object);
 				return (TRUE);
 			}
 		} else {
 			/*
 			 * Dequeue here to prevent lock recursion in
 			 * vm_page_cache().
 			 */
 			vm_page_dequeue_locked(m);
 			vm_page_cache(m);
 			vm_page_unlock(m);
 		}
 		VM_OBJECT_WUNLOCK(object);
 	}
 	vm_pagequeue_unlock(pq);
 	return (FALSE);
 }
 
 /*
  * Increase the number of cached pages.  The specified value, "tries",
  * determines which categories of pages are cached:
  *
  *  0: All clean, inactive pages within the specified physical address range
  *     are cached.  Will not sleep.
  *  1: The vm_lowmem handlers are called.  All inactive pages within
  *     the specified physical address range are cached.  May sleep.
  *  2: The vm_lowmem handlers are called.  All inactive and active pages
  *     within the specified physical address range are cached.  May sleep.
  */
 void
 vm_pageout_grow_cache(int tries, vm_paddr_t low, vm_paddr_t high)
 {
 	int actl, actmax, inactl, inactmax, dom, initial_dom;
 	static int start_dom = 0;
 
 	if (tries > 0) {
 		/*
 		 * Decrease registered cache sizes.  The vm_lowmem handlers
 		 * may acquire locks and/or sleep, so they can only be invoked
 		 * when "tries" is greater than zero.
 		 */
 		SDT_PROBE0(vm, , , vm__lowmem_cache);
 		EVENTHANDLER_INVOKE(vm_lowmem, 0);
 
 		/*
 		 * We do this explicitly after the caches have been drained
 		 * above.
 		 */
 		uma_reclaim();
 	}
 
 	/*
 	 * Make the next scan start on the next domain.
 	 */
 	initial_dom = atomic_fetchadd_int(&start_dom, 1) % vm_ndomains;
 
 	inactl = 0;
 	inactmax = vm_cnt.v_inactive_count;
 	actl = 0;
 	actmax = tries < 2 ? 0 : vm_cnt.v_active_count;
 	dom = initial_dom;
 
 	/*
 	 * Scan domains in round-robin order, first inactive queues,
 	 * then active.  Since domain usually owns large physically
 	 * contiguous chunk of memory, it makes sense to completely
 	 * exhaust one domain before switching to next, while growing
 	 * the pool of contiguous physical pages.
 	 *
 	 * Do not even start launder a domain which cannot contain
 	 * the specified address range, as indicated by segments
 	 * constituting the domain.
 	 */
 again:
 	if (inactl < inactmax) {
 		if (vm_phys_domain_intersects(vm_dom[dom].vmd_segs,
 		    low, high) &&
 		    vm_pageout_launder(&vm_dom[dom].vmd_pagequeues[PQ_INACTIVE],
 		    tries, low, high)) {
 			inactl++;
 			goto again;
 		}
 		if (++dom == vm_ndomains)
 			dom = 0;
 		if (dom != initial_dom)
 			goto again;
 	}
 	if (actl < actmax) {
 		if (vm_phys_domain_intersects(vm_dom[dom].vmd_segs,
 		    low, high) &&
 		    vm_pageout_launder(&vm_dom[dom].vmd_pagequeues[PQ_ACTIVE],
 		      tries, low, high)) {
 			actl++;
 			goto again;
 		}
 		if (++dom == vm_ndomains)
 			dom = 0;
 		if (dom != initial_dom)
 			goto again;
 	}
 }
 
 #if !defined(NO_SWAPPING)
 /*
  *	vm_pageout_object_deactivate_pages
  *
  *	Deactivate enough pages to satisfy the inactive target
  *	requirements.
  *
  *	The object and map must be locked.
  */
 static void
 vm_pageout_object_deactivate_pages(pmap_t pmap, vm_object_t first_object,
     long desired)
 {
 	vm_object_t backing_object, object;
 	vm_page_t p;
 	int act_delta, remove_mode;
 
 	VM_OBJECT_ASSERT_LOCKED(first_object);
 	if ((first_object->flags & OBJ_FICTITIOUS) != 0)
 		return;
 	for (object = first_object;; object = backing_object) {
 		if (pmap_resident_count(pmap) <= desired)
 			goto unlock_return;
 		VM_OBJECT_ASSERT_LOCKED(object);
 		if ((object->flags & OBJ_UNMANAGED) != 0 ||
 		    object->paging_in_progress != 0)
 			goto unlock_return;
 
 		remove_mode = 0;
 		if (object->shadow_count > 1)
 			remove_mode = 1;
 		/*
 		 * Scan the object's entire memory queue.
 		 */
 		TAILQ_FOREACH(p, &object->memq, listq) {
 			if (pmap_resident_count(pmap) <= desired)
 				goto unlock_return;
 			if (vm_page_busied(p))
 				continue;
 			PCPU_INC(cnt.v_pdpages);
 			vm_page_lock(p);
 			if (p->wire_count != 0 || p->hold_count != 0 ||
 			    !pmap_page_exists_quick(pmap, p)) {
 				vm_page_unlock(p);
 				continue;
 			}
 			act_delta = pmap_ts_referenced(p);
 			if ((p->aflags & PGA_REFERENCED) != 0) {
 				if (act_delta == 0)
 					act_delta = 1;
 				vm_page_aflag_clear(p, PGA_REFERENCED);
 			}
 			if (p->queue != PQ_ACTIVE && act_delta != 0) {
 				vm_page_activate(p);
 				p->act_count += act_delta;
 			} else if (p->queue == PQ_ACTIVE) {
 				if (act_delta == 0) {
 					p->act_count -= min(p->act_count,
 					    ACT_DECLINE);
 					if (!remove_mode && p->act_count == 0) {
 						pmap_remove_all(p);
 						vm_page_deactivate(p);
 					} else
 						vm_page_requeue(p);
 				} else {
 					vm_page_activate(p);
 					if (p->act_count < ACT_MAX -
 					    ACT_ADVANCE)
 						p->act_count += ACT_ADVANCE;
 					vm_page_requeue(p);
 				}
 			} else if (p->queue == PQ_INACTIVE)
 				pmap_remove_all(p);
 			vm_page_unlock(p);
 		}
 		if ((backing_object = object->backing_object) == NULL)
 			goto unlock_return;
 		VM_OBJECT_RLOCK(backing_object);
 		if (object != first_object)
 			VM_OBJECT_RUNLOCK(object);
 	}
 unlock_return:
 	if (object != first_object)
 		VM_OBJECT_RUNLOCK(object);
 }
 
 /*
  * deactivate some number of pages in a map, try to do it fairly, but
  * that is really hard to do.
  */
 static void
 vm_pageout_map_deactivate_pages(map, desired)
 	vm_map_t map;
 	long desired;
 {
 	vm_map_entry_t tmpe;
 	vm_object_t obj, bigobj;
 	int nothingwired;
 
 	if (!vm_map_trylock(map))
 		return;
 
 	bigobj = NULL;
 	nothingwired = TRUE;
 
 	/*
 	 * first, search out the biggest object, and try to free pages from
 	 * that.
 	 */
 	tmpe = map->header.next;
 	while (tmpe != &map->header) {
 		if ((tmpe->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) {
 			obj = tmpe->object.vm_object;
 			if (obj != NULL && VM_OBJECT_TRYRLOCK(obj)) {
 				if (obj->shadow_count <= 1 &&
 				    (bigobj == NULL ||
 				     bigobj->resident_page_count < obj->resident_page_count)) {
 					if (bigobj != NULL)
 						VM_OBJECT_RUNLOCK(bigobj);
 					bigobj = obj;
 				} else
 					VM_OBJECT_RUNLOCK(obj);
 			}
 		}
 		if (tmpe->wired_count > 0)
 			nothingwired = FALSE;
 		tmpe = tmpe->next;
 	}
 
 	if (bigobj != NULL) {
 		vm_pageout_object_deactivate_pages(map->pmap, bigobj, desired);
 		VM_OBJECT_RUNLOCK(bigobj);
 	}
 	/*
 	 * Next, hunt around for other pages to deactivate.  We actually
 	 * do this search sort of wrong -- .text first is not the best idea.
 	 */
 	tmpe = map->header.next;
 	while (tmpe != &map->header) {
 		if (pmap_resident_count(vm_map_pmap(map)) <= desired)
 			break;
 		if ((tmpe->eflags & MAP_ENTRY_IS_SUB_MAP) == 0) {
 			obj = tmpe->object.vm_object;
 			if (obj != NULL) {
 				VM_OBJECT_RLOCK(obj);
 				vm_pageout_object_deactivate_pages(map->pmap, obj, desired);
 				VM_OBJECT_RUNLOCK(obj);
 			}
 		}
 		tmpe = tmpe->next;
 	}
 
 	/*
 	 * Remove all mappings if a process is swapped out, this will free page
 	 * table pages.
 	 */
 	if (desired == 0 && nothingwired) {
 		pmap_remove(vm_map_pmap(map), vm_map_min(map),
 		    vm_map_max(map));
 	}
 
 	vm_map_unlock(map);
 }
 #endif		/* !defined(NO_SWAPPING) */
 
 /*
  * Attempt to acquire all of the necessary locks to launder a page and
  * then call through the clustering layer to PUTPAGES.  Wait a short
  * time for a vnode lock.
  *
  * Requires the page and object lock on entry, releases both before return.
  * Returns 0 on success and an errno otherwise.
  */
 static int
 vm_pageout_clean(vm_page_t m)
 {
 	struct vnode *vp;
 	struct mount *mp;
 	vm_object_t object;
 	vm_pindex_t pindex;
 	int error, lockmode;
 
 	vm_page_assert_locked(m);
 	object = m->object;
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	error = 0;
 	vp = NULL;
 	mp = NULL;
 
 	/*
 	 * The object is already known NOT to be dead.   It
 	 * is possible for the vget() to block the whole
 	 * pageout daemon, but the new low-memory handling
 	 * code should prevent it.
 	 *
 	 * We can't wait forever for the vnode lock, we might
 	 * deadlock due to a vn_read() getting stuck in
 	 * vm_wait while holding this vnode.  We skip the 
 	 * vnode if we can't get it in a reasonable amount
 	 * of time.
 	 */
 	if (object->type == OBJT_VNODE) {
 		vm_page_unlock(m);
 		vp = object->handle;
 		if (vp->v_type == VREG &&
 		    vn_start_write(vp, &mp, V_NOWAIT) != 0) {
 			mp = NULL;
 			error = EDEADLK;
 			goto unlock_all;
 		}
 		KASSERT(mp != NULL,
 		    ("vp %p with NULL v_mount", vp));
 		vm_object_reference_locked(object);
 		pindex = m->pindex;
 		VM_OBJECT_WUNLOCK(object);
 		lockmode = MNT_SHARED_WRITES(vp->v_mount) ?
 		    LK_SHARED : LK_EXCLUSIVE;
 		if (vget(vp, lockmode | LK_TIMELOCK, curthread)) {
 			vp = NULL;
 			error = EDEADLK;
 			goto unlock_mp;
 		}
 		VM_OBJECT_WLOCK(object);
 		vm_page_lock(m);
 		/*
 		 * While the object and page were unlocked, the page
 		 * may have been:
 		 * (1) moved to a different queue,
 		 * (2) reallocated to a different object,
 		 * (3) reallocated to a different offset, or
 		 * (4) cleaned.
 		 */
 		if (m->queue != PQ_INACTIVE || m->object != object ||
 		    m->pindex != pindex || m->dirty == 0) {
 			vm_page_unlock(m);
 			error = ENXIO;
 			goto unlock_all;
 		}
 
 		/*
 		 * The page may have been busied or held while the object
 		 * and page locks were released.
 		 */
 		if (vm_page_busied(m) || m->hold_count != 0) {
 			vm_page_unlock(m);
 			error = EBUSY;
 			goto unlock_all;
 		}
 	}
 
 	/*
 	 * If a page is dirty, then it is either being washed
 	 * (but not yet cleaned) or it is still in the
 	 * laundry.  If it is still in the laundry, then we
 	 * start the cleaning operation. 
 	 */
 	if (vm_pageout_cluster(m) == 0)
 		error = EIO;
 
 unlock_all:
 	VM_OBJECT_WUNLOCK(object);
 
 unlock_mp:
 	vm_page_lock_assert(m, MA_NOTOWNED);
 	if (mp != NULL) {
 		if (vp != NULL)
 			vput(vp);
 		vm_object_deallocate(object);
 		vn_finished_write(mp);
 	}
 
 	return (error);
 }
 
 /*
  *	vm_pageout_scan does the dirty work for the pageout daemon.
  *
  *	pass 0 - Update active LRU/deactivate pages
  *	pass 1 - Move inactive to cache or free
  *	pass 2 - Launder dirty pages
  */
 static void
 vm_pageout_scan(struct vm_domain *vmd, int pass)
 {
 	vm_page_t m, next;
 	struct vm_pagequeue *pq;
 	vm_object_t object;
 	int act_delta, addl_page_shortage, deficit, maxscan, page_shortage;
 	int vnodes_skipped = 0;
 	int maxlaunder;
 	boolean_t queues_locked;
 
 	/*
 	 * If we need to reclaim memory ask kernel caches to return
 	 * some.  We rate limit to avoid thrashing.
 	 */
 	if (vmd == &vm_dom[0] && pass > 0 &&
 	    (ticks - lowmem_ticks) / hz >= lowmem_period) {
 		/*
 		 * Decrease registered cache sizes.
 		 */
 		SDT_PROBE0(vm, , , vm__lowmem_scan);
 		EVENTHANDLER_INVOKE(vm_lowmem, 0);
 		/*
 		 * We do this explicitly after the caches have been
 		 * drained above.
 		 */
 		uma_reclaim();
 		lowmem_ticks = ticks;
 	}
 
 	/*
 	 * The addl_page_shortage is the number of temporarily
 	 * stuck pages in the inactive queue.  In other words, the
 	 * number of pages from the inactive count that should be
 	 * discounted in setting the target for the active queue scan.
 	 */
 	addl_page_shortage = 0;
 
 	/*
 	 * Calculate the number of pages we want to either free or move
 	 * to the cache.
 	 */
 	if (pass > 0) {
 		deficit = atomic_readandclear_int(&vm_pageout_deficit);
 		page_shortage = vm_paging_target() + deficit;
 	} else
 		page_shortage = deficit = 0;
 
 	/*
 	 * maxlaunder limits the number of dirty pages we flush per scan.
 	 * For most systems a smaller value (16 or 32) is more robust under
 	 * extreme memory and disk pressure because any unnecessary writes
 	 * to disk can result in extreme performance degredation.  However,
 	 * systems with excessive dirty pages (especially when MAP_NOSYNC is
 	 * used) will die horribly with limited laundering.  If the pageout
 	 * daemon cannot clean enough pages in the first pass, we let it go
 	 * all out in succeeding passes.
 	 */
 	if ((maxlaunder = vm_max_launder) <= 1)
 		maxlaunder = 1;
 	if (pass > 1)
 		maxlaunder = 10000;
 
 	/*
 	 * Start scanning the inactive queue for pages we can move to the
 	 * cache or free.  The scan will stop when the target is reached or
 	 * we have scanned the entire inactive queue.  Note that m->act_count
 	 * is not used to form decisions for the inactive queue, only for the
 	 * active queue.
 	 */
 	pq = &vmd->vmd_pagequeues[PQ_INACTIVE];
 	maxscan = pq->pq_cnt;
 	vm_pagequeue_lock(pq);
 	queues_locked = TRUE;
 	for (m = TAILQ_FIRST(&pq->pq_pl);
 	     m != NULL && maxscan-- > 0 && page_shortage > 0;
 	     m = next) {
 		vm_pagequeue_assert_locked(pq);
 		KASSERT(queues_locked, ("unlocked queues"));
 		KASSERT(m->queue == PQ_INACTIVE, ("Inactive queue %p", m));
 
 		PCPU_INC(cnt.v_pdpages);
 		next = TAILQ_NEXT(m, plinks.q);
 
 		/*
 		 * skip marker pages
 		 */
 		if (m->flags & PG_MARKER)
 			continue;
 
 		KASSERT((m->flags & PG_FICTITIOUS) == 0,
 		    ("Fictitious page %p cannot be in inactive queue", m));
 		KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 		    ("Unmanaged page %p cannot be in inactive queue", m));
 
 		/*
 		 * The page or object lock acquisitions fail if the
 		 * page was removed from the queue or moved to a
 		 * different position within the queue.  In either
 		 * case, addl_page_shortage should not be incremented.
 		 */
 		if (!vm_pageout_page_lock(m, &next)) {
 			vm_page_unlock(m);
 			continue;
 		}
 		object = m->object;
 		if (!VM_OBJECT_TRYWLOCK(object) &&
 		    !vm_pageout_fallback_object_lock(m, &next)) {
 			vm_page_unlock(m);
 			VM_OBJECT_WUNLOCK(object);
 			continue;
 		}
 
 		/*
 		 * Don't mess with busy pages, keep them at at the
 		 * front of the queue, most likely they are being
 		 * paged out.  Increment addl_page_shortage for busy
 		 * pages, because they may leave the inactive queue
 		 * shortly after page scan is finished.
 		 */
 		if (vm_page_busied(m)) {
 			vm_page_unlock(m);
 			VM_OBJECT_WUNLOCK(object);
 			addl_page_shortage++;
 			continue;
 		}
 
 		/*
 		 * We unlock the inactive page queue, invalidating the
 		 * 'next' pointer.  Use our marker to remember our
 		 * place.
 		 */
 		TAILQ_INSERT_AFTER(&pq->pq_pl, m, &vmd->vmd_marker, plinks.q);
 		vm_pagequeue_unlock(pq);
 		queues_locked = FALSE;
 
 		/*
 		 * We bump the activation count if the page has been
 		 * referenced while in the inactive queue.  This makes
 		 * it less likely that the page will be added back to the
 		 * inactive queue prematurely again.  Here we check the 
 		 * page tables (or emulated bits, if any), given the upper 
 		 * level VM system not knowing anything about existing 
 		 * references.
 		 */
 		if ((m->aflags & PGA_REFERENCED) != 0) {
 			vm_page_aflag_clear(m, PGA_REFERENCED);
 			act_delta = 1;
 		} else
 			act_delta = 0;
 		if (object->ref_count != 0) {
 			act_delta += pmap_ts_referenced(m);
 		} else {
 			KASSERT(!pmap_page_is_mapped(m),
 			    ("vm_pageout_scan: page %p is mapped", m));
 		}
 
 		/*
 		 * If the upper level VM system knows about any page 
 		 * references, we reactivate the page or requeue it.
 		 */
 		if (act_delta != 0) {
 			if (object->ref_count != 0) {
 				vm_page_activate(m);
 				m->act_count += act_delta + ACT_ADVANCE;
 			} else {
 				vm_pagequeue_lock(pq);
 				queues_locked = TRUE;
 				vm_page_requeue_locked(m);
 			}
 			VM_OBJECT_WUNLOCK(object);
 			vm_page_unlock(m);
 			goto relock_queues;
 		}
 
 		if (m->hold_count != 0) {
 			vm_page_unlock(m);
 			VM_OBJECT_WUNLOCK(object);
 
 			/*
 			 * Held pages are essentially stuck in the
 			 * queue.  So, they ought to be discounted
 			 * from the inactive count.  See the
 			 * calculation of the page_shortage for the
 			 * loop over the active queue below.
 			 */
 			addl_page_shortage++;
 			goto relock_queues;
 		}
 
 		/*
 		 * If the page appears to be clean at the machine-independent
 		 * layer, then remove all of its mappings from the pmap in
 		 * anticipation of placing it onto the cache queue.  If,
 		 * however, any of the page's mappings allow write access,
 		 * then the page may still be modified until the last of those
 		 * mappings are removed.
 		 */
 		vm_page_test_dirty(m);
 		if (m->dirty == 0 && object->ref_count != 0)
 			pmap_remove_all(m);
 
 		if (m->valid == 0) {
 			/*
 			 * Invalid pages can be easily freed
 			 */
 			vm_page_free(m);
 			PCPU_INC(cnt.v_dfree);
 			--page_shortage;
 		} else if (m->dirty == 0) {
 			/*
 			 * Clean pages can be placed onto the cache queue.
 			 * This effectively frees them.
 			 */
 			vm_page_cache(m);
 			--page_shortage;
 		} else if ((m->flags & PG_WINATCFLS) == 0 && pass < 2) {
 			/*
 			 * Dirty pages need to be paged out, but flushing
 			 * a page is extremely expensive versus freeing
 			 * a clean page.  Rather then artificially limiting
 			 * the number of pages we can flush, we instead give
 			 * dirty pages extra priority on the inactive queue
 			 * by forcing them to be cycled through the queue
 			 * twice before being flushed, after which the
 			 * (now clean) page will cycle through once more
 			 * before being freed.  This significantly extends
 			 * the thrash point for a heavily loaded machine.
 			 */
 			m->flags |= PG_WINATCFLS;
 			vm_pagequeue_lock(pq);
 			queues_locked = TRUE;
 			vm_page_requeue_locked(m);
 		} else if (maxlaunder > 0) {
 			/*
 			 * We always want to try to flush some dirty pages if
 			 * we encounter them, to keep the system stable.
 			 * Normally this number is small, but under extreme
 			 * pressure where there are insufficient clean pages
 			 * on the inactive queue, we may have to go all out.
 			 */
 			int swap_pageouts_ok;
 			int error;
 
 			if ((object->type != OBJT_SWAP) && (object->type != OBJT_DEFAULT)) {
 				swap_pageouts_ok = 1;
 			} else {
 				swap_pageouts_ok = !(defer_swap_pageouts || disable_swap_pageouts);
 				swap_pageouts_ok |= (!disable_swap_pageouts && defer_swap_pageouts &&
 				vm_page_count_min());
 										
 			}
 
 			/*
 			 * We don't bother paging objects that are "dead".  
 			 * Those objects are in a "rundown" state.
 			 */
 			if (!swap_pageouts_ok || (object->flags & OBJ_DEAD)) {
 				vm_pagequeue_lock(pq);
 				vm_page_unlock(m);
 				VM_OBJECT_WUNLOCK(object);
 				queues_locked = TRUE;
 				vm_page_requeue_locked(m);
 				goto relock_queues;
 			}
 			error = vm_pageout_clean(m);
 			/*
 			 * Decrement page_shortage on success to account for
 			 * the (future) cleaned page.  Otherwise we could wind
 			 * up laundering or cleaning too many pages.
 			 */
 			if (error == 0) {
 				page_shortage--;
 				maxlaunder--;
 			} else if (error == EDEADLK) {
 				pageout_lock_miss++;
 				vnodes_skipped++;
 			} else if (error == EBUSY) {
 				addl_page_shortage++;
 			}
 			vm_page_lock_assert(m, MA_NOTOWNED);
 			goto relock_queues;
 		}
 		vm_page_unlock(m);
 		VM_OBJECT_WUNLOCK(object);
 relock_queues:
 		if (!queues_locked) {
 			vm_pagequeue_lock(pq);
 			queues_locked = TRUE;
 		}
 		next = TAILQ_NEXT(&vmd->vmd_marker, plinks.q);
 		TAILQ_REMOVE(&pq->pq_pl, &vmd->vmd_marker, plinks.q);
 	}
 	vm_pagequeue_unlock(pq);
 
 #if !defined(NO_SWAPPING)
 	/*
 	 * Wakeup the swapout daemon if we didn't cache or free the targeted
 	 * number of pages. 
 	 */
 	if (vm_swap_enabled && page_shortage > 0)
 		vm_req_vmdaemon(VM_SWAP_NORMAL);
 #endif
 
 	/*
 	 * Wakeup the sync daemon if we skipped a vnode in a writeable object
 	 * and we didn't cache or free enough pages.
 	 */
 	if (vnodes_skipped > 0 && page_shortage > vm_cnt.v_free_target -
 	    vm_cnt.v_free_min)
 		(void)speedup_syncer();
 
 	/*
 	 * Compute the number of pages we want to try to move from the
 	 * active queue to the inactive queue.
 	 */
 	page_shortage = vm_cnt.v_inactive_target - vm_cnt.v_inactive_count +
 	    vm_paging_target() + deficit + addl_page_shortage;
 
 	pq = &vmd->vmd_pagequeues[PQ_ACTIVE];
 	vm_pagequeue_lock(pq);
 	maxscan = pq->pq_cnt;
 
 	/*
 	 * If we're just idle polling attempt to visit every
 	 * active page within 'update_period' seconds.
 	 */
 	if (pass == 0 && vm_pageout_update_period != 0) {
 		maxscan /= vm_pageout_update_period;
 		page_shortage = maxscan;
 	}
 
 	/*
 	 * Scan the active queue for things we can deactivate. We nominally
 	 * track the per-page activity counter and use it to locate
 	 * deactivation candidates.
 	 */
 	m = TAILQ_FIRST(&pq->pq_pl);
 	while (m != NULL && maxscan-- > 0 && page_shortage > 0) {
 
 		KASSERT(m->queue == PQ_ACTIVE,
 		    ("vm_pageout_scan: page %p isn't active", m));
 
 		next = TAILQ_NEXT(m, plinks.q);
 		if ((m->flags & PG_MARKER) != 0) {
 			m = next;
 			continue;
 		}
 		KASSERT((m->flags & PG_FICTITIOUS) == 0,
 		    ("Fictitious page %p cannot be in active queue", m));
 		KASSERT((m->oflags & VPO_UNMANAGED) == 0,
 		    ("Unmanaged page %p cannot be in active queue", m));
 		if (!vm_pageout_page_lock(m, &next)) {
 			vm_page_unlock(m);
 			m = next;
 			continue;
 		}
 
 		/*
 		 * The count for pagedaemon pages is done after checking the
 		 * page for eligibility...
 		 */
 		PCPU_INC(cnt.v_pdpages);
 
 		/*
 		 * Check to see "how much" the page has been used.
 		 */
 		if ((m->aflags & PGA_REFERENCED) != 0) {
 			vm_page_aflag_clear(m, PGA_REFERENCED);
 			act_delta = 1;
 		} else
 			act_delta = 0;
 
 		/*
 		 * Unlocked object ref count check.  Two races are possible.
 		 * 1) The ref was transitioning to zero and we saw non-zero,
 		 *    the pmap bits will be checked unnecessarily.
 		 * 2) The ref was transitioning to one and we saw zero. 
 		 *    The page lock prevents a new reference to this page so
 		 *    we need not check the reference bits.
 		 */
 		if (m->object->ref_count != 0)
 			act_delta += pmap_ts_referenced(m);
 
 		/*
 		 * Advance or decay the act_count based on recent usage.
 		 */
 		if (act_delta != 0) {
 			m->act_count += ACT_ADVANCE + act_delta;
 			if (m->act_count > ACT_MAX)
 				m->act_count = ACT_MAX;
 		} else
 			m->act_count -= min(m->act_count, ACT_DECLINE);
 
 		/*
 		 * Move this page to the tail of the active or inactive
 		 * queue depending on usage.
 		 */
 		if (m->act_count == 0) {
 			/* Dequeue to avoid later lock recursion. */
 			vm_page_dequeue_locked(m);
 			vm_page_deactivate(m);
 			page_shortage--;
 		} else
 			vm_page_requeue_locked(m);
 		vm_page_unlock(m);
 		m = next;
 	}
 	vm_pagequeue_unlock(pq);
 #if !defined(NO_SWAPPING)
 	/*
 	 * Idle process swapout -- run once per second.
 	 */
 	if (vm_swap_idle_enabled) {
 		static long lsec;
 		if (time_second != lsec) {
 			vm_req_vmdaemon(VM_SWAP_IDLE);
 			lsec = time_second;
 		}
 	}
 #endif
 
 	/*
 	 * If we are critically low on one of RAM or swap and low on
 	 * the other, kill the largest process.  However, we avoid
 	 * doing this on the first pass in order to give ourselves a
 	 * chance to flush out dirty vnode-backed pages and to allow
 	 * active pages to be moved to the inactive queue and reclaimed.
 	 */
 	vm_pageout_mightbe_oom(vmd, pass);
 }
 
 static int vm_pageout_oom_vote;
 
 /*
  * The pagedaemon threads randlomly select one to perform the
  * OOM.  Trying to kill processes before all pagedaemons
  * failed to reach free target is premature.
  */
 static void
 vm_pageout_mightbe_oom(struct vm_domain *vmd, int pass)
 {
 	int old_vote;
 
 	if (pass <= 1 || !((swap_pager_avail < 64 && vm_page_count_min()) ||
 	    (swap_pager_full && vm_paging_target() > 0))) {
 		if (vmd->vmd_oom) {
 			vmd->vmd_oom = FALSE;
 			atomic_subtract_int(&vm_pageout_oom_vote, 1);
 		}
 		return;
 	}
 
 	if (vmd->vmd_oom)
 		return;
 
 	vmd->vmd_oom = TRUE;
 	old_vote = atomic_fetchadd_int(&vm_pageout_oom_vote, 1);
 	if (old_vote != vm_ndomains - 1)
 		return;
 
 	/*
 	 * The current pagedaemon thread is the last in the quorum to
 	 * start OOM.  Initiate the selection and signaling of the
 	 * victim.
 	 */
 	vm_pageout_oom(VM_OOM_MEM);
 
 	/*
 	 * After one round of OOM terror, recall our vote.  On the
 	 * next pass, current pagedaemon would vote again if the low
 	 * memory condition is still there, due to vmd_oom being
 	 * false.
 	 */
 	vmd->vmd_oom = FALSE;
 	atomic_subtract_int(&vm_pageout_oom_vote, 1);
 }
 
 void
 vm_pageout_oom(int shortage)
 {
 	struct proc *p, *bigproc;
 	vm_offset_t size, bigsize;
 	struct thread *td;
 	struct vmspace *vm;
 
 	/*
 	 * We keep the process bigproc locked once we find it to keep anyone
 	 * from messing with it; however, there is a possibility of
 	 * deadlock if process B is bigproc and one of it's child processes
 	 * attempts to propagate a signal to B while we are waiting for A's
 	 * lock while walking this list.  To avoid this, we don't block on
 	 * the process lock but just skip a process if it is already locked.
 	 */
 	bigproc = NULL;
 	bigsize = 0;
 	sx_slock(&allproc_lock);
 	FOREACH_PROC_IN_SYSTEM(p) {
 		int breakout;
 
 		PROC_LOCK(p);
 
 		/*
 		 * If this is a system, protected or killed process, skip it.
 		 */
 		if (p->p_state != PRS_NORMAL || (p->p_flag & (P_INEXEC |
 		    P_PROTECTED | P_SYSTEM | P_WEXIT)) != 0 ||
 		    p->p_pid == 1 || P_KILLED(p) ||
 		    (p->p_pid < 48 && swap_pager_avail != 0)) {
 			PROC_UNLOCK(p);
 			continue;
 		}
 		/*
 		 * If the process is in a non-running type state,
 		 * don't touch it.  Check all the threads individually.
 		 */
 		breakout = 0;
 		FOREACH_THREAD_IN_PROC(p, td) {
 			thread_lock(td);
 			if (!TD_ON_RUNQ(td) &&
 			    !TD_IS_RUNNING(td) &&
 			    !TD_IS_SLEEPING(td) &&
 			    !TD_IS_SUSPENDED(td)) {
 				thread_unlock(td);
 				breakout = 1;
 				break;
 			}
 			thread_unlock(td);
 		}
 		if (breakout) {
 			PROC_UNLOCK(p);
 			continue;
 		}
 		/*
 		 * get the process size
 		 */
 		vm = vmspace_acquire_ref(p);
 		if (vm == NULL) {
 			PROC_UNLOCK(p);
 			continue;
 		}
 		_PHOLD(p);
 		if (!vm_map_trylock_read(&vm->vm_map)) {
 			_PRELE(p);
 			PROC_UNLOCK(p);
 			vmspace_free(vm);
 			continue;
 		}
 		PROC_UNLOCK(p);
 		size = vmspace_swap_count(vm);
 		vm_map_unlock_read(&vm->vm_map);
 		if (shortage == VM_OOM_MEM)
 			size += vmspace_resident_count(vm);
 		vmspace_free(vm);
 		/*
 		 * if the this process is bigger than the biggest one
 		 * remember it.
 		 */
 		if (size > bigsize) {
 			if (bigproc != NULL)
 				PRELE(bigproc);
 			bigproc = p;
 			bigsize = size;
 		} else {
 			PRELE(p);
 		}
 	}
 	sx_sunlock(&allproc_lock);
 	if (bigproc != NULL) {
 		if (vm_panic_on_oom != 0)
 			panic("out of swap space");
 		PROC_LOCK(bigproc);
 		killproc(bigproc, "out of swap space");
 		sched_nice(bigproc, PRIO_MIN);
 		_PRELE(bigproc);
 		PROC_UNLOCK(bigproc);
 		wakeup(&vm_cnt.v_free_count);
 	}
 }
 
 static void
 vm_pageout_worker(void *arg)
 {
 	struct vm_domain *domain;
 	int domidx;
 
 	domidx = (uintptr_t)arg;
 	domain = &vm_dom[domidx];
 
 	/*
 	 * XXXKIB It could be useful to bind pageout daemon threads to
 	 * the cores belonging to the domain, from which vm_page_array
 	 * is allocated.
 	 */
 
 	KASSERT(domain->vmd_segs != 0, ("domain without segments"));
 	vm_pageout_init_marker(&domain->vmd_marker, PQ_INACTIVE);
 
 	/*
 	 * The pageout daemon worker is never done, so loop forever.
 	 */
 	while (TRUE) {
 		/*
 		 * If we have enough free memory, wakeup waiters.  Do
 		 * not clear vm_pages_needed until we reach our target,
 		 * otherwise we may be woken up over and over again and
 		 * waste a lot of cpu.
 		 */
 		mtx_lock(&vm_page_queue_free_mtx);
 		if (vm_pages_needed && !vm_page_count_min()) {
 			if (!vm_paging_needed())
 				vm_pages_needed = 0;
 			wakeup(&vm_cnt.v_free_count);
 		}
 		if (vm_pages_needed) {
 			/*
 			 * Still not done, take a second pass without waiting
 			 * (unlimited dirty cleaning), otherwise sleep a bit
 			 * and try again.
 			 */
 			if (domain->vmd_pass > 1)
 				msleep(&vm_pages_needed,
 				    &vm_page_queue_free_mtx, PVM, "psleep",
 				    hz / 2);
 		} else {
 			/*
 			 * Good enough, sleep until required to refresh
 			 * stats.
 			 */
 			domain->vmd_pass = 0;
 			msleep(&vm_pages_needed, &vm_page_queue_free_mtx,
 			    PVM, "psleep", hz);
 
 		}
 		if (vm_pages_needed) {
 			vm_cnt.v_pdwakeups++;
 			domain->vmd_pass++;
 		}
 		mtx_unlock(&vm_page_queue_free_mtx);
 		vm_pageout_scan(domain, domain->vmd_pass);
 	}
 }
 
 /*
  *	vm_pageout_init initialises basic pageout daemon settings.
  */
 static void
 vm_pageout_init(void)
 {
 	/*
 	 * Initialize some paging parameters.
 	 */
 	vm_cnt.v_interrupt_free_min = 2;
 	if (vm_cnt.v_page_count < 2000)
 		vm_pageout_page_count = 8;
 
 	/*
 	 * v_free_reserved needs to include enough for the largest
 	 * swap pager structures plus enough for any pv_entry structs
 	 * when paging. 
 	 */
 	if (vm_cnt.v_page_count > 1024)
 		vm_cnt.v_free_min = 4 + (vm_cnt.v_page_count - 1024) / 200;
 	else
 		vm_cnt.v_free_min = 4;
 	vm_cnt.v_pageout_free_min = (2*MAXBSIZE)/PAGE_SIZE +
 	    vm_cnt.v_interrupt_free_min;
 	vm_cnt.v_free_reserved = vm_pageout_page_count +
 	    vm_cnt.v_pageout_free_min + (vm_cnt.v_page_count / 768);
 	vm_cnt.v_free_severe = vm_cnt.v_free_min / 2;
 	vm_cnt.v_free_target = 4 * vm_cnt.v_free_min + vm_cnt.v_free_reserved;
 	vm_cnt.v_free_min += vm_cnt.v_free_reserved;
 	vm_cnt.v_free_severe += vm_cnt.v_free_reserved;
 	vm_cnt.v_inactive_target = (3 * vm_cnt.v_free_target) / 2;
 	if (vm_cnt.v_inactive_target > vm_cnt.v_free_count / 3)
 		vm_cnt.v_inactive_target = vm_cnt.v_free_count / 3;
 
 	/*
 	 * Set the default wakeup threshold to be 10% above the minimum
 	 * page limit.  This keeps the steady state out of shortfall.
 	 */
 	vm_pageout_wakeup_thresh = (vm_cnt.v_free_min / 10) * 11;
 
 	/*
 	 * Set interval in seconds for active scan.  We want to visit each
 	 * page at least once every ten minutes.  This is to prevent worst
 	 * case paging behaviors with stale active LRU.
 	 */
 	if (vm_pageout_update_period == 0)
 		vm_pageout_update_period = 600;
 
 	/* XXX does not really belong here */
 	if (vm_page_max_wired == 0)
 		vm_page_max_wired = vm_cnt.v_free_count / 3;
 }
 
 /*
  *     vm_pageout is the high level pageout daemon.
  */
 static void
 vm_pageout(void)
 {
+	int error;
 #if MAXMEMDOM > 1
-	int error, i;
+	int i;
 #endif
 
 	swap_pager_swap_init();
 #if MAXMEMDOM > 1
 	for (i = 1; i < vm_ndomains; i++) {
 		error = kthread_add(vm_pageout_worker, (void *)(uintptr_t)i,
 		    curproc, NULL, 0, 0, "dom%d", i);
 		if (error != 0) {
 			panic("starting pageout for domain %d, error %d\n",
 			    i, error);
 		}
 	}
 #endif
+	error = kthread_add(uma_reclaim_worker, NULL, curproc, NULL,
+	    0, 0, "uma");
+	if (error != 0)
+		panic("starting uma_reclaim helper, error %d\n", error);
 	vm_pageout_worker((void *)(uintptr_t)0);
 }
 
 /*
  * Unless the free page queue lock is held by the caller, this function
  * should be regarded as advisory.  Specifically, the caller should
  * not msleep() on &vm_cnt.v_free_count following this function unless
  * the free page queue lock is held until the msleep() is performed.
  */
 void
 pagedaemon_wakeup(void)
 {
 
 	if (!vm_pages_needed && curthread->td_proc != pageproc) {
 		vm_pages_needed = 1;
 		wakeup(&vm_pages_needed);
 	}
 }
 
 #if !defined(NO_SWAPPING)
 static void
 vm_req_vmdaemon(int req)
 {
 	static int lastrun = 0;
 
 	mtx_lock(&vm_daemon_mtx);
 	vm_pageout_req_swapout |= req;
 	if ((ticks > (lastrun + hz)) || (ticks < lastrun)) {
 		wakeup(&vm_daemon_needed);
 		lastrun = ticks;
 	}
 	mtx_unlock(&vm_daemon_mtx);
 }
 
 static void
 vm_daemon(void)
 {
 	struct rlimit rsslim;
 	struct proc *p;
 	struct thread *td;
 	struct vmspace *vm;
 	int breakout, swapout_flags, tryagain, attempts;
 #ifdef RACCT
 	uint64_t rsize, ravailable;
 #endif
 
 	while (TRUE) {
 		mtx_lock(&vm_daemon_mtx);
 		msleep(&vm_daemon_needed, &vm_daemon_mtx, PPAUSE, "psleep",
 #ifdef RACCT
 		    racct_enable ? hz : 0
 #else
 		    0
 #endif
 		);
 		swapout_flags = vm_pageout_req_swapout;
 		vm_pageout_req_swapout = 0;
 		mtx_unlock(&vm_daemon_mtx);
 		if (swapout_flags)
 			swapout_procs(swapout_flags);
 
 		/*
 		 * scan the processes for exceeding their rlimits or if
 		 * process is swapped out -- deactivate pages
 		 */
 		tryagain = 0;
 		attempts = 0;
 again:
 		attempts++;
 		sx_slock(&allproc_lock);
 		FOREACH_PROC_IN_SYSTEM(p) {
 			vm_pindex_t limit, size;
 
 			/*
 			 * if this is a system process or if we have already
 			 * looked at this process, skip it.
 			 */
 			PROC_LOCK(p);
 			if (p->p_state != PRS_NORMAL ||
 			    p->p_flag & (P_INEXEC | P_SYSTEM | P_WEXIT)) {
 				PROC_UNLOCK(p);
 				continue;
 			}
 			/*
 			 * if the process is in a non-running type state,
 			 * don't touch it.
 			 */
 			breakout = 0;
 			FOREACH_THREAD_IN_PROC(p, td) {
 				thread_lock(td);
 				if (!TD_ON_RUNQ(td) &&
 				    !TD_IS_RUNNING(td) &&
 				    !TD_IS_SLEEPING(td) &&
 				    !TD_IS_SUSPENDED(td)) {
 					thread_unlock(td);
 					breakout = 1;
 					break;
 				}
 				thread_unlock(td);
 			}
 			if (breakout) {
 				PROC_UNLOCK(p);
 				continue;
 			}
 			/*
 			 * get a limit
 			 */
 			lim_rlimit(p, RLIMIT_RSS, &rsslim);
 			limit = OFF_TO_IDX(
 			    qmin(rsslim.rlim_cur, rsslim.rlim_max));
 
 			/*
 			 * let processes that are swapped out really be
 			 * swapped out set the limit to nothing (will force a
 			 * swap-out.)
 			 */
 			if ((p->p_flag & P_INMEM) == 0)
 				limit = 0;	/* XXX */
 			vm = vmspace_acquire_ref(p);
 			PROC_UNLOCK(p);
 			if (vm == NULL)
 				continue;
 
 			size = vmspace_resident_count(vm);
 			if (size >= limit) {
 				vm_pageout_map_deactivate_pages(
 				    &vm->vm_map, limit);
 			}
 #ifdef RACCT
 			if (racct_enable) {
 				rsize = IDX_TO_OFF(size);
 				PROC_LOCK(p);
 				racct_set(p, RACCT_RSS, rsize);
 				ravailable = racct_get_available(p, RACCT_RSS);
 				PROC_UNLOCK(p);
 				if (rsize > ravailable) {
 					/*
 					 * Don't be overly aggressive; this
 					 * might be an innocent process,
 					 * and the limit could've been exceeded
 					 * by some memory hog.  Don't try
 					 * to deactivate more than 1/4th
 					 * of process' resident set size.
 					 */
 					if (attempts <= 8) {
 						if (ravailable < rsize -
 						    (rsize / 4)) {
 							ravailable = rsize -
 							    (rsize / 4);
 						}
 					}
 					vm_pageout_map_deactivate_pages(
 					    &vm->vm_map,
 					    OFF_TO_IDX(ravailable));
 					/* Update RSS usage after paging out. */
 					size = vmspace_resident_count(vm);
 					rsize = IDX_TO_OFF(size);
 					PROC_LOCK(p);
 					racct_set(p, RACCT_RSS, rsize);
 					PROC_UNLOCK(p);
 					if (rsize > ravailable)
 						tryagain = 1;
 				}
 			}
 #endif
 			vmspace_free(vm);
 		}
 		sx_sunlock(&allproc_lock);
 		if (tryagain != 0 && attempts <= 10)
 			goto again;
 	}
 }
 #endif			/* !defined(NO_SWAPPING) */
Index: projects/release-arm-redux/sys/x86/include/acpica_machdep.h
===================================================================
--- projects/release-arm-redux/sys/x86/include/acpica_machdep.h	(revision 282691)
+++ projects/release-arm-redux/sys/x86/include/acpica_machdep.h	(revision 282692)
@@ -1,87 +1,88 @@
 /*-
  * Copyright (c) 2002 Mitsuru IWASAKI
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  * $FreeBSD$
  */
 
 /******************************************************************************
  *
  * Name: acpica_machdep.h - arch-specific defines, etc.
  *       $Revision$
  *
  *****************************************************************************/
 
 #ifndef __ACPICA_MACHDEP_H__
 #define	__ACPICA_MACHDEP_H__
 
 #ifdef _KERNEL
 /*
  * Calling conventions:
  *
  * ACPI_SYSTEM_XFACE        - Interfaces to host OS (handlers, threads)
  * ACPI_EXTERNAL_XFACE      - External ACPI interfaces 
  * ACPI_INTERNAL_XFACE      - Internal ACPI interfaces
  * ACPI_INTERNAL_VAR_XFACE  - Internal variable-parameter list interfaces
  */
 #define	ACPI_SYSTEM_XFACE
 #define	ACPI_EXTERNAL_XFACE
 #define	ACPI_INTERNAL_XFACE
 #define	ACPI_INTERNAL_VAR_XFACE
 
 /* Asm macros */
 
 #define	ACPI_ASM_MACROS
 #define	BREAKPOINT3
 #define	ACPI_DISABLE_IRQS() disable_intr()
 #define	ACPI_ENABLE_IRQS()  enable_intr()
 
 #define	ACPI_FLUSH_CPU_CACHE()	wbinvd()
 
 /* Section 5.2.10.1: global lock acquire/release functions */
 int	acpi_acquire_global_lock(volatile uint32_t *);
 int	acpi_release_global_lock(volatile uint32_t *);
 #define	ACPI_ACQUIRE_GLOBAL_LOCK(GLptr, Acq)	do {			\
 	(Acq) = acpi_acquire_global_lock(&((GLptr)->GlobalLock));	\
 } while (0)
 #define	ACPI_RELEASE_GLOBAL_LOCK(GLptr, Acq)	do {			\
 	(Acq) = acpi_release_global_lock(&((GLptr)->GlobalLock));	\
 } while (0)
  
 enum intr_trigger;
 enum intr_polarity;
 
 void	acpi_SetDefaultIntrModel(int model);
 void	acpi_cpu_c1(void);
+void	acpi_cpu_idle_mwait(uint32_t mwait_hint);
 void	*acpi_map_table(vm_paddr_t pa, const char *sig);
 void	acpi_unmap_table(void *table);
 vm_paddr_t acpi_find_table(const char *sig);
 void	madt_parse_interrupt_values(void *entry,
 	    enum intr_trigger *trig, enum intr_polarity *pol);
 
 extern int madt_found_sci_override;
 
 #endif /* _KERNEL */
 
 #endif /* __ACPICA_MACHDEP_H__ */

Property changes on: projects/release-arm-redux/sys/x86/include/acpica_machdep.h
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head/sys/x86/include/acpica_machdep.h:r282673-282691
Index: projects/release-arm-redux/sys/x86/include/specialreg.h
===================================================================
--- projects/release-arm-redux/sys/x86/include/specialreg.h	(revision 282691)
+++ projects/release-arm-redux/sys/x86/include/specialreg.h	(revision 282692)
@@ -1,844 +1,845 @@
 /*-
  * Copyright (c) 1991 The Regents of the University of California.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)specialreg.h	7.1 (Berkeley) 5/9/91
  * $FreeBSD$
  */
 
 #ifndef _MACHINE_SPECIALREG_H_
 #define	_MACHINE_SPECIALREG_H_
 
 /*
  * Bits in 386 special registers:
  */
 #define	CR0_PE	0x00000001	/* Protected mode Enable */
 #define	CR0_MP	0x00000002	/* "Math" (fpu) Present */
 #define	CR0_EM	0x00000004	/* EMulate FPU instructions. (trap ESC only) */
 #define	CR0_TS	0x00000008	/* Task Switched (if MP, trap ESC and WAIT) */
 #define	CR0_PG	0x80000000	/* PaGing enable */
 
 /*
  * Bits in 486 special registers:
  */
 #define	CR0_NE	0x00000020	/* Numeric Error enable (EX16 vs IRQ13) */
 #define	CR0_WP	0x00010000	/* Write Protect (honor page protect in
 							   all modes) */
 #define	CR0_AM	0x00040000	/* Alignment Mask (set to enable AC flag) */
 #define	CR0_NW  0x20000000	/* Not Write-through */
 #define	CR0_CD  0x40000000	/* Cache Disable */
 
 #define	CR3_PCID_SAVE 0x8000000000000000
+#define	CR3_PCID_MASK 0xfff
 
 /*
  * Bits in PPro special registers
  */
 #define	CR4_VME	0x00000001	/* Virtual 8086 mode extensions */
 #define	CR4_PVI	0x00000002	/* Protected-mode virtual interrupts */
 #define	CR4_TSD	0x00000004	/* Time stamp disable */
 #define	CR4_DE	0x00000008	/* Debugging extensions */
 #define	CR4_PSE	0x00000010	/* Page size extensions */
 #define	CR4_PAE	0x00000020	/* Physical address extension */
 #define	CR4_MCE	0x00000040	/* Machine check enable */
 #define	CR4_PGE	0x00000080	/* Page global enable */
 #define	CR4_PCE	0x00000100	/* Performance monitoring counter enable */
 #define	CR4_FXSR 0x00000200	/* Fast FPU save/restore used by OS */
 #define	CR4_XMM	0x00000400	/* enable SIMD/MMX2 to use except 16 */
 #define	CR4_VMXE 0x00002000	/* enable VMX operation (Intel-specific) */
 #define	CR4_FSGSBASE 0x00010000	/* Enable FS/GS BASE accessing instructions */
 #define	CR4_PCIDE 0x00020000	/* Enable Context ID */
 #define	CR4_XSAVE 0x00040000	/* XSETBV/XGETBV */
 #define	CR4_SMEP 0x00100000	/* Supervisor-Mode Execution Prevention */
 
 /*
  * Bits in AMD64 special registers.  EFER is 64 bits wide.
  */
 #define	EFER_SCE 0x000000001	/* System Call Extensions (R/W) */
 #define	EFER_LME 0x000000100	/* Long mode enable (R/W) */
 #define	EFER_LMA 0x000000400	/* Long mode active (R) */
 #define	EFER_NXE 0x000000800	/* PTE No-Execute bit enable (R/W) */
 #define	EFER_SVM 0x000001000	/* SVM enable bit for AMD, reserved for Intel */
 #define	EFER_LMSLE 0x000002000	/* Long Mode Segment Limit Enable */
 #define	EFER_FFXSR 0x000004000	/* Fast FXSAVE/FSRSTOR */
 #define	EFER_TCE   0x000008000	/* Translation Cache Extension */
 
 /*
  * Intel Extended Features registers
  */
 #define	XCR0	0		/* XFEATURE_ENABLED_MASK register */
 
 #define	XFEATURE_ENABLED_X87		0x00000001
 #define	XFEATURE_ENABLED_SSE		0x00000002
 #define	XFEATURE_ENABLED_YMM_HI128	0x00000004
 #define	XFEATURE_ENABLED_AVX		XFEATURE_ENABLED_YMM_HI128
 #define	XFEATURE_ENABLED_BNDREGS	0x00000008
 #define	XFEATURE_ENABLED_BNDCSR		0x00000010
 #define	XFEATURE_ENABLED_OPMASK		0x00000020
 #define	XFEATURE_ENABLED_ZMM_HI256	0x00000040
 #define	XFEATURE_ENABLED_HI16_ZMM	0x00000080
 
 #define	XFEATURE_AVX					\
     (XFEATURE_ENABLED_X87 | XFEATURE_ENABLED_SSE | XFEATURE_ENABLED_AVX)
 #define	XFEATURE_AVX512						\
     (XFEATURE_ENABLED_OPMASK | XFEATURE_ENABLED_ZMM_HI256 |	\
     XFEATURE_ENABLED_HI16_ZMM)
 #define	XFEATURE_MPX					\
     (XFEATURE_ENABLED_BNDREGS | XFEATURE_ENABLED_BNDCSR)
 
 /*
  * CPUID instruction features register
  */
 #define	CPUID_FPU	0x00000001
 #define	CPUID_VME	0x00000002
 #define	CPUID_DE	0x00000004
 #define	CPUID_PSE	0x00000008
 #define	CPUID_TSC	0x00000010
 #define	CPUID_MSR	0x00000020
 #define	CPUID_PAE	0x00000040
 #define	CPUID_MCE	0x00000080
 #define	CPUID_CX8	0x00000100
 #define	CPUID_APIC	0x00000200
 #define	CPUID_B10	0x00000400
 #define	CPUID_SEP	0x00000800
 #define	CPUID_MTRR	0x00001000
 #define	CPUID_PGE	0x00002000
 #define	CPUID_MCA	0x00004000
 #define	CPUID_CMOV	0x00008000
 #define	CPUID_PAT	0x00010000
 #define	CPUID_PSE36	0x00020000
 #define	CPUID_PSN	0x00040000
 #define	CPUID_CLFSH	0x00080000
 #define	CPUID_B20	0x00100000
 #define	CPUID_DS	0x00200000
 #define	CPUID_ACPI	0x00400000
 #define	CPUID_MMX	0x00800000
 #define	CPUID_FXSR	0x01000000
 #define	CPUID_SSE	0x02000000
 #define	CPUID_XMM	0x02000000
 #define	CPUID_SSE2	0x04000000
 #define	CPUID_SS	0x08000000
 #define	CPUID_HTT	0x10000000
 #define	CPUID_TM	0x20000000
 #define	CPUID_IA64	0x40000000
 #define	CPUID_PBE	0x80000000
 
 #define	CPUID2_SSE3	0x00000001
 #define	CPUID2_PCLMULQDQ 0x00000002
 #define	CPUID2_DTES64	0x00000004
 #define	CPUID2_MON	0x00000008
 #define	CPUID2_DS_CPL	0x00000010
 #define	CPUID2_VMX	0x00000020
 #define	CPUID2_SMX	0x00000040
 #define	CPUID2_EST	0x00000080
 #define	CPUID2_TM2	0x00000100
 #define	CPUID2_SSSE3	0x00000200
 #define	CPUID2_CNXTID	0x00000400
 #define	CPUID2_SDBG	0x00000800
 #define	CPUID2_FMA	0x00001000
 #define	CPUID2_CX16	0x00002000
 #define	CPUID2_XTPR	0x00004000
 #define	CPUID2_PDCM	0x00008000
 #define	CPUID2_PCID	0x00020000
 #define	CPUID2_DCA	0x00040000
 #define	CPUID2_SSE41	0x00080000
 #define	CPUID2_SSE42	0x00100000
 #define	CPUID2_X2APIC	0x00200000
 #define	CPUID2_MOVBE	0x00400000
 #define	CPUID2_POPCNT	0x00800000
 #define	CPUID2_TSCDLT	0x01000000
 #define	CPUID2_AESNI	0x02000000
 #define	CPUID2_XSAVE	0x04000000
 #define	CPUID2_OSXSAVE	0x08000000
 #define	CPUID2_AVX	0x10000000
 #define	CPUID2_F16C	0x20000000
 #define	CPUID2_RDRAND	0x40000000
 #define	CPUID2_HV	0x80000000
 
 /*
  * Important bits in the Thermal and Power Management flags
  * CPUID.6 EAX and ECX.
  */
 #define	CPUTPM1_SENSOR	0x00000001
 #define	CPUTPM1_TURBO	0x00000002
 #define	CPUTPM1_ARAT	0x00000004
 #define	CPUTPM2_EFFREQ	0x00000001
 
 /*
  * Important bits in the AMD extended cpuid flags
  */
 #define	AMDID_SYSCALL	0x00000800
 #define	AMDID_MP	0x00080000
 #define	AMDID_NX	0x00100000
 #define	AMDID_EXT_MMX	0x00400000
 #define	AMDID_FFXSR	0x02000000
 #define	AMDID_PAGE1GB	0x04000000
 #define	AMDID_RDTSCP	0x08000000
 #define	AMDID_LM	0x20000000
 #define	AMDID_EXT_3DNOW	0x40000000
 #define	AMDID_3DNOW	0x80000000
 
 #define	AMDID2_LAHF	0x00000001
 #define	AMDID2_CMP	0x00000002
 #define	AMDID2_SVM	0x00000004
 #define	AMDID2_EXT_APIC	0x00000008
 #define	AMDID2_CR8	0x00000010
 #define	AMDID2_ABM	0x00000020
 #define	AMDID2_SSE4A	0x00000040
 #define	AMDID2_MAS	0x00000080
 #define	AMDID2_PREFETCH	0x00000100
 #define	AMDID2_OSVW	0x00000200
 #define	AMDID2_IBS	0x00000400
 #define	AMDID2_XOP	0x00000800
 #define	AMDID2_SKINIT	0x00001000
 #define	AMDID2_WDT	0x00002000
 #define	AMDID2_LWP	0x00008000
 #define	AMDID2_FMA4	0x00010000
 #define	AMDID2_TCE	0x00020000
 #define	AMDID2_NODE_ID	0x00080000
 #define	AMDID2_TBM	0x00200000
 #define	AMDID2_TOPOLOGY	0x00400000
 #define	AMDID2_PCXC	0x00800000
 #define	AMDID2_PNXC	0x01000000
 #define	AMDID2_DBE	0x04000000
 #define	AMDID2_PTSC	0x08000000
 #define	AMDID2_PTSCEL2I	0x10000000
 
 /*
  * CPUID instruction 1 eax info
  */
 #define	CPUID_STEPPING		0x0000000f
 #define	CPUID_MODEL		0x000000f0
 #define	CPUID_FAMILY		0x00000f00
 #define	CPUID_EXT_MODEL		0x000f0000
 #define	CPUID_EXT_FAMILY	0x0ff00000
 #ifdef __i386__
 #define	CPUID_TO_MODEL(id) \
     ((((id) & CPUID_MODEL) >> 4) | \
     ((((id) & CPUID_FAMILY) >= 0x600) ? \
     (((id) & CPUID_EXT_MODEL) >> 12) : 0))
 #define	CPUID_TO_FAMILY(id) \
     ((((id) & CPUID_FAMILY) >> 8) + \
     ((((id) & CPUID_FAMILY) == 0xf00) ? \
     (((id) & CPUID_EXT_FAMILY) >> 20) : 0))
 #else
 #define	CPUID_TO_MODEL(id) \
     ((((id) & CPUID_MODEL) >> 4) | \
     (((id) & CPUID_EXT_MODEL) >> 12))
 #define	CPUID_TO_FAMILY(id) \
     ((((id) & CPUID_FAMILY) >> 8) + \
     (((id) & CPUID_EXT_FAMILY) >> 20))
 #endif
 
 /*
  * CPUID instruction 1 ebx info
  */
 #define	CPUID_BRAND_INDEX	0x000000ff
 #define	CPUID_CLFUSH_SIZE	0x0000ff00
 #define	CPUID_HTT_CORES		0x00ff0000
 #define	CPUID_LOCAL_APIC_ID	0xff000000
 
 /*
  * CPUID instruction 5 info
  */
 #define	CPUID5_MON_MIN_SIZE	0x0000ffff	/* eax */
 #define	CPUID5_MON_MAX_SIZE	0x0000ffff	/* ebx */
 #define	CPUID5_MON_MWAIT_EXT	0x00000001	/* ecx */
 #define	CPUID5_MWAIT_INTRBREAK	0x00000002	/* ecx */
 
 /*
  * MWAIT cpu power states.  Lower 4 bits are sub-states.
  */
 #define	MWAIT_C0	0xf0
 #define	MWAIT_C1	0x00
 #define	MWAIT_C2	0x10
 #define	MWAIT_C3	0x20
 #define	MWAIT_C4	0x30
 
 /*
  * MWAIT extensions.
  */
 /* Interrupt breaks MWAIT even when masked. */
 #define	MWAIT_INTRBREAK		0x00000001
 
 /*
  * CPUID instruction 6 ecx info
  */
 #define	CPUID_PERF_STAT		0x00000001
 #define	CPUID_PERF_BIAS		0x00000008
 
 /* 
  * CPUID instruction 0xb ebx info.
  */
 #define	CPUID_TYPE_INVAL	0
 #define	CPUID_TYPE_SMT		1
 #define	CPUID_TYPE_CORE		2
 
 /*
  * CPUID instruction 0xd Processor Extended State Enumeration Sub-leaf 1
  */
 #define	CPUID_EXTSTATE_XSAVEOPT	0x00000001
 #define	CPUID_EXTSTATE_XSAVEC	0x00000002
 #define	CPUID_EXTSTATE_XINUSE	0x00000004
 #define	CPUID_EXTSTATE_XSAVES	0x00000008
 
 /*
  * AMD extended function 8000_0007h edx info
  */
 #define	AMDPM_TS		0x00000001
 #define	AMDPM_FID		0x00000002
 #define	AMDPM_VID		0x00000004
 #define	AMDPM_TTP		0x00000008
 #define	AMDPM_TM		0x00000010
 #define	AMDPM_STC		0x00000020
 #define	AMDPM_100MHZ_STEPS	0x00000040
 #define	AMDPM_HW_PSTATE		0x00000080
 #define	AMDPM_TSC_INVARIANT	0x00000100
 #define	AMDPM_CPB		0x00000200
 
 /*
  * AMD extended function 8000_0008h ecx info
  */
 #define	AMDID_CMP_CORES		0x000000ff
 #define	AMDID_COREID_SIZE	0x0000f000
 #define	AMDID_COREID_SIZE_SHIFT	12
 
 /*
  * CPUID instruction 7 Structured Extended Features, leaf 0 ebx info
  */
 #define	CPUID_STDEXT_FSGSBASE	0x00000001
 #define	CPUID_STDEXT_TSC_ADJUST	0x00000002
 #define	CPUID_STDEXT_BMI1	0x00000008
 #define	CPUID_STDEXT_HLE	0x00000010
 #define	CPUID_STDEXT_AVX2	0x00000020
 #define	CPUID_STDEXT_SMEP	0x00000080
 #define	CPUID_STDEXT_BMI2	0x00000100
 #define	CPUID_STDEXT_ERMS	0x00000200
 #define	CPUID_STDEXT_INVPCID	0x00000400
 #define	CPUID_STDEXT_RTM	0x00000800
 #define	CPUID_STDEXT_MPX	0x00004000
 #define	CPUID_STDEXT_AVX512F	0x00010000
 #define	CPUID_STDEXT_RDSEED	0x00040000
 #define	CPUID_STDEXT_ADX	0x00080000
 #define	CPUID_STDEXT_SMAP	0x00100000
 #define	CPUID_STDEXT_CLFLUSHOPT	0x00800000
 #define	CPUID_STDEXT_PROCTRACE	0x02000000
 #define	CPUID_STDEXT_AVX512PF	0x04000000
 #define	CPUID_STDEXT_AVX512ER	0x08000000
 #define	CPUID_STDEXT_AVX512CD	0x10000000
 #define	CPUID_STDEXT_SHA	0x20000000
 
 /*
  * CPUID manufacturers identifiers
  */
 #define	AMD_VENDOR_ID		"AuthenticAMD"
 #define	CENTAUR_VENDOR_ID	"CentaurHauls"
 #define	CYRIX_VENDOR_ID		"CyrixInstead"
 #define	INTEL_VENDOR_ID		"GenuineIntel"
 #define	NEXGEN_VENDOR_ID	"NexGenDriven"
 #define	NSC_VENDOR_ID		"Geode by NSC"
 #define	RISE_VENDOR_ID		"RiseRiseRise"
 #define	SIS_VENDOR_ID		"SiS SiS SiS "
 #define	TRANSMETA_VENDOR_ID	"GenuineTMx86"
 #define	UMC_VENDOR_ID		"UMC UMC UMC "
 
 /*
  * Model-specific registers for the i386 family
  */
 #define	MSR_P5_MC_ADDR		0x000
 #define	MSR_P5_MC_TYPE		0x001
 #define	MSR_TSC			0x010
 #define	MSR_P5_CESR		0x011
 #define	MSR_P5_CTR0		0x012
 #define	MSR_P5_CTR1		0x013
 #define	MSR_IA32_PLATFORM_ID	0x017
 #define	MSR_APICBASE		0x01b
 #define	MSR_EBL_CR_POWERON	0x02a
 #define	MSR_TEST_CTL		0x033
 #define	MSR_IA32_FEATURE_CONTROL 0x03a
 #define	MSR_BIOS_UPDT_TRIG	0x079
 #define	MSR_BBL_CR_D0		0x088
 #define	MSR_BBL_CR_D1		0x089
 #define	MSR_BBL_CR_D2		0x08a
 #define	MSR_BIOS_SIGN		0x08b
 #define	MSR_PERFCTR0		0x0c1
 #define	MSR_PERFCTR1		0x0c2
 #define	MSR_PLATFORM_INFO	0x0ce
 #define	MSR_MPERF		0x0e7
 #define	MSR_APERF		0x0e8
 #define	MSR_IA32_EXT_CONFIG	0x0ee	/* Undocumented. Core Solo/Duo only */
 #define	MSR_MTRRcap		0x0fe
 #define	MSR_BBL_CR_ADDR		0x116
 #define	MSR_BBL_CR_DECC		0x118
 #define	MSR_BBL_CR_CTL		0x119
 #define	MSR_BBL_CR_TRIG		0x11a
 #define	MSR_BBL_CR_BUSY		0x11b
 #define	MSR_BBL_CR_CTL3		0x11e
 #define	MSR_SYSENTER_CS_MSR	0x174
 #define	MSR_SYSENTER_ESP_MSR	0x175
 #define	MSR_SYSENTER_EIP_MSR	0x176
 #define	MSR_MCG_CAP		0x179
 #define	MSR_MCG_STATUS		0x17a
 #define	MSR_MCG_CTL		0x17b
 #define	MSR_EVNTSEL0		0x186
 #define	MSR_EVNTSEL1		0x187
 #define	MSR_THERM_CONTROL	0x19a
 #define	MSR_THERM_INTERRUPT	0x19b
 #define	MSR_THERM_STATUS	0x19c
 #define	MSR_IA32_MISC_ENABLE	0x1a0
 #define	MSR_IA32_TEMPERATURE_TARGET	0x1a2
 #define	MSR_TURBO_RATIO_LIMIT	0x1ad
 #define	MSR_TURBO_RATIO_LIMIT1	0x1ae
 #define	MSR_DEBUGCTLMSR		0x1d9
 #define	MSR_LASTBRANCHFROMIP	0x1db
 #define	MSR_LASTBRANCHTOIP	0x1dc
 #define	MSR_LASTINTFROMIP	0x1dd
 #define	MSR_LASTINTTOIP		0x1de
 #define	MSR_ROB_CR_BKUPTMPDR6	0x1e0
 #define	MSR_MTRRVarBase		0x200
 #define	MSR_MTRR64kBase		0x250
 #define	MSR_MTRR16kBase		0x258
 #define	MSR_MTRR4kBase		0x268
 #define	MSR_PAT			0x277
 #define	MSR_MC0_CTL2		0x280
 #define	MSR_MTRRdefType		0x2ff
 #define	MSR_MC0_CTL		0x400
 #define	MSR_MC0_STATUS		0x401
 #define	MSR_MC0_ADDR		0x402
 #define	MSR_MC0_MISC		0x403
 #define	MSR_MC1_CTL		0x404
 #define	MSR_MC1_STATUS		0x405
 #define	MSR_MC1_ADDR		0x406
 #define	MSR_MC1_MISC		0x407
 #define	MSR_MC2_CTL		0x408
 #define	MSR_MC2_STATUS		0x409
 #define	MSR_MC2_ADDR		0x40a
 #define	MSR_MC2_MISC		0x40b
 #define	MSR_MC3_CTL		0x40c
 #define	MSR_MC3_STATUS		0x40d
 #define	MSR_MC3_ADDR		0x40e
 #define	MSR_MC3_MISC		0x40f
 #define	MSR_MC4_CTL		0x410
 #define	MSR_MC4_STATUS		0x411
 #define	MSR_MC4_ADDR		0x412
 #define	MSR_MC4_MISC		0x413
 #define	MSR_RAPL_POWER_UNIT	0x606
 #define	MSR_PKG_ENERGY_STATUS	0x611
 #define	MSR_DRAM_ENERGY_STATUS	0x619
 #define	MSR_PP0_ENERGY_STATUS	0x639
 #define	MSR_PP1_ENERGY_STATUS	0x641
 
 /*
  * VMX MSRs
  */
 #define	MSR_VMX_BASIC		0x480
 #define	MSR_VMX_PINBASED_CTLS	0x481
 #define	MSR_VMX_PROCBASED_CTLS	0x482
 #define	MSR_VMX_EXIT_CTLS	0x483
 #define	MSR_VMX_ENTRY_CTLS	0x484
 #define	MSR_VMX_CR0_FIXED0	0x486
 #define	MSR_VMX_CR0_FIXED1	0x487
 #define	MSR_VMX_CR4_FIXED0	0x488
 #define	MSR_VMX_CR4_FIXED1	0x489
 #define	MSR_VMX_PROCBASED_CTLS2	0x48b
 #define	MSR_VMX_EPT_VPID_CAP	0x48c
 #define	MSR_VMX_TRUE_PINBASED_CTLS	0x48d
 #define	MSR_VMX_TRUE_PROCBASED_CTLS	0x48e
 #define	MSR_VMX_TRUE_EXIT_CTLS	0x48f
 #define	MSR_VMX_TRUE_ENTRY_CTLS	0x490
 
 /*
  * X2APIC MSRs
  */
 #define	MSR_APIC_000		0x800
 #define	MSR_APIC_ID		0x802
 #define	MSR_APIC_VERSION	0x803
 #define	MSR_APIC_TPR		0x808
 #define	MSR_APIC_EOI		0x80b
 #define	MSR_APIC_LDR		0x80d
 #define	MSR_APIC_SVR		0x80f
 #define	MSR_APIC_ISR0		0x810
 #define	MSR_APIC_ISR1		0x811
 #define	MSR_APIC_ISR2		0x812
 #define	MSR_APIC_ISR3		0x813
 #define	MSR_APIC_ISR4		0x814
 #define	MSR_APIC_ISR5		0x815
 #define	MSR_APIC_ISR6		0x816
 #define	MSR_APIC_ISR7		0x817
 #define	MSR_APIC_TMR0		0x818
 #define	MSR_APIC_IRR0		0x820
 #define	MSR_APIC_ESR		0x828
 #define	MSR_APIC_LVT_CMCI	0x82F
 #define	MSR_APIC_ICR		0x830
 #define	MSR_APIC_LVT_TIMER	0x832
 #define	MSR_APIC_LVT_THERMAL	0x833
 #define	MSR_APIC_LVT_PCINT	0x834
 #define	MSR_APIC_LVT_LINT0	0x835
 #define	MSR_APIC_LVT_LINT1	0x836
 #define	MSR_APIC_LVT_ERROR	0x837
 #define	MSR_APIC_ICR_TIMER	0x838
 #define	MSR_APIC_CCR_TIMER	0x839
 #define	MSR_APIC_DCR_TIMER	0x83e
 #define	MSR_APIC_SELF_IPI	0x83f
 
 #define	MSR_IA32_XSS		0xda0
 
 /*
  * Constants related to MSR's.
  */
 #define	APICBASE_RESERVED	0x000002ff
 #define	APICBASE_BSP		0x00000100
 #define	APICBASE_X2APIC		0x00000400
 #define	APICBASE_ENABLED	0x00000800
 #define	APICBASE_ADDRESS	0xfffff000
 
 /* MSR_IA32_FEATURE_CONTROL related */
 #define	IA32_FEATURE_CONTROL_LOCK	0x01	/* lock bit */
 #define	IA32_FEATURE_CONTROL_SMX_EN	0x02	/* enable VMX inside SMX */
 #define	IA32_FEATURE_CONTROL_VMX_EN	0x04	/* enable VMX outside SMX */
 
 /*
  * PAT modes.
  */
 #define	PAT_UNCACHEABLE		0x00
 #define	PAT_WRITE_COMBINING	0x01
 #define	PAT_WRITE_THROUGH	0x04
 #define	PAT_WRITE_PROTECTED	0x05
 #define	PAT_WRITE_BACK		0x06
 #define	PAT_UNCACHED		0x07
 #define	PAT_VALUE(i, m)		((long long)(m) << (8 * (i)))
 #define	PAT_MASK(i)		PAT_VALUE(i, 0xff)
 
 /*
  * Constants related to MTRRs
  */
 #define	MTRR_UNCACHEABLE	0x00
 #define	MTRR_WRITE_COMBINING	0x01
 #define	MTRR_WRITE_THROUGH	0x04
 #define	MTRR_WRITE_PROTECTED	0x05
 #define	MTRR_WRITE_BACK		0x06
 #define	MTRR_N64K		8	/* numbers of fixed-size entries */
 #define	MTRR_N16K		16
 #define	MTRR_N4K		64
 #define	MTRR_CAP_WC		0x0000000000000400
 #define	MTRR_CAP_FIXED		0x0000000000000100
 #define	MTRR_CAP_VCNT		0x00000000000000ff
 #define	MTRR_DEF_ENABLE		0x0000000000000800
 #define	MTRR_DEF_FIXED_ENABLE	0x0000000000000400
 #define	MTRR_DEF_TYPE		0x00000000000000ff
 #define	MTRR_PHYSBASE_PHYSBASE	0x000ffffffffff000
 #define	MTRR_PHYSBASE_TYPE	0x00000000000000ff
 #define	MTRR_PHYSMASK_PHYSMASK	0x000ffffffffff000
 #define	MTRR_PHYSMASK_VALID	0x0000000000000800
 
 /*
  * Cyrix configuration registers, accessible as IO ports.
  */
 #define	CCR0			0xc0	/* Configuration control register 0 */
 #define	CCR0_NC0		0x01	/* First 64K of each 1M memory region is
 								   non-cacheable */
 #define	CCR0_NC1		0x02	/* 640K-1M region is non-cacheable */
 #define	CCR0_A20M		0x04	/* Enables A20M# input pin */
 #define	CCR0_KEN		0x08	/* Enables KEN# input pin */
 #define	CCR0_FLUSH		0x10	/* Enables FLUSH# input pin */
 #define	CCR0_BARB		0x20	/* Flushes internal cache when entering hold
 								   state */
 #define	CCR0_CO			0x40	/* Cache org: 1=direct mapped, 0=2x set
 								   assoc */
 #define	CCR0_SUSPEND	0x80	/* Enables SUSP# and SUSPA# pins */
 
 #define	CCR1			0xc1	/* Configuration control register 1 */
 #define	CCR1_RPL		0x01	/* Enables RPLSET and RPLVAL# pins */
 #define	CCR1_SMI		0x02	/* Enables SMM pins */
 #define	CCR1_SMAC		0x04	/* System management memory access */
 #define	CCR1_MMAC		0x08	/* Main memory access */
 #define	CCR1_NO_LOCK	0x10	/* Negate LOCK# */
 #define	CCR1_SM3		0x80	/* SMM address space address region 3 */
 
 #define	CCR2			0xc2
 #define	CCR2_WB			0x02	/* Enables WB cache interface pins */
 #define	CCR2_SADS		0x02	/* Slow ADS */
 #define	CCR2_LOCK_NW	0x04	/* LOCK NW Bit */
 #define	CCR2_SUSP_HLT	0x08	/* Suspend on HALT */
 #define	CCR2_WT1		0x10	/* WT region 1 */
 #define	CCR2_WPR1		0x10	/* Write-protect region 1 */
 #define	CCR2_BARB		0x20	/* Flushes write-back cache when entering
 								   hold state. */
 #define	CCR2_BWRT		0x40	/* Enables burst write cycles */
 #define	CCR2_USE_SUSP	0x80	/* Enables suspend pins */
 
 #define	CCR3			0xc3
 #define	CCR3_SMILOCK	0x01	/* SMM register lock */
 #define	CCR3_NMI		0x02	/* Enables NMI during SMM */
 #define	CCR3_LINBRST	0x04	/* Linear address burst cycles */
 #define	CCR3_SMMMODE	0x08	/* SMM Mode */
 #define	CCR3_MAPEN0		0x10	/* Enables Map0 */
 #define	CCR3_MAPEN1		0x20	/* Enables Map1 */
 #define	CCR3_MAPEN2		0x40	/* Enables Map2 */
 #define	CCR3_MAPEN3		0x80	/* Enables Map3 */
 
 #define	CCR4			0xe8
 #define	CCR4_IOMASK		0x07
 #define	CCR4_MEM		0x08	/* Enables momory bypassing */
 #define	CCR4_DTE		0x10	/* Enables directory table entry cache */
 #define	CCR4_FASTFPE	0x20	/* Fast FPU exception */
 #define	CCR4_CPUID		0x80	/* Enables CPUID instruction */
 
 #define	CCR5			0xe9
 #define	CCR5_WT_ALLOC	0x01	/* Write-through allocate */
 #define	CCR5_SLOP		0x02	/* LOOP instruction slowed down */
 #define	CCR5_LBR1		0x10	/* Local bus region 1 */
 #define	CCR5_ARREN		0x20	/* Enables ARR region */
 
 #define	CCR6			0xea
 
 #define	CCR7			0xeb
 
 /* Performance Control Register (5x86 only). */
 #define	PCR0			0x20
 #define	PCR0_RSTK		0x01	/* Enables return stack */
 #define	PCR0_BTB		0x02	/* Enables branch target buffer */
 #define	PCR0_LOOP		0x04	/* Enables loop */
 #define	PCR0_AIS		0x08	/* Enables all instrcutions stalled to
 								   serialize pipe. */
 #define	PCR0_MLR		0x10	/* Enables reordering of misaligned loads */
 #define	PCR0_BTBRT		0x40	/* Enables BTB test register. */
 #define	PCR0_LSSER		0x80	/* Disable reorder */
 
 /* Device Identification Registers */
 #define	DIR0			0xfe
 #define	DIR1			0xff
 
 /*
  * Machine Check register constants.
  */
 #define	MCG_CAP_COUNT		0x000000ff
 #define	MCG_CAP_CTL_P		0x00000100
 #define	MCG_CAP_EXT_P		0x00000200
 #define	MCG_CAP_CMCI_P		0x00000400
 #define	MCG_CAP_TES_P		0x00000800
 #define	MCG_CAP_EXT_CNT		0x00ff0000
 #define	MCG_CAP_SER_P		0x01000000
 #define	MCG_STATUS_RIPV		0x00000001
 #define	MCG_STATUS_EIPV		0x00000002
 #define	MCG_STATUS_MCIP		0x00000004
 #define	MCG_CTL_ENABLE		0xffffffffffffffff
 #define	MCG_CTL_DISABLE		0x0000000000000000
 #define	MSR_MC_CTL(x)		(MSR_MC0_CTL + (x) * 4)
 #define	MSR_MC_STATUS(x)	(MSR_MC0_STATUS + (x) * 4)
 #define	MSR_MC_ADDR(x)		(MSR_MC0_ADDR + (x) * 4)
 #define	MSR_MC_MISC(x)		(MSR_MC0_MISC + (x) * 4)
 #define	MSR_MC_CTL2(x)		(MSR_MC0_CTL2 + (x))	/* If MCG_CAP_CMCI_P */
 #define	MC_STATUS_MCA_ERROR	0x000000000000ffff
 #define	MC_STATUS_MODEL_ERROR	0x00000000ffff0000
 #define	MC_STATUS_OTHER_INFO	0x01ffffff00000000
 #define	MC_STATUS_COR_COUNT	0x001fffc000000000	/* If MCG_CAP_CMCI_P */
 #define	MC_STATUS_TES_STATUS	0x0060000000000000	/* If MCG_CAP_TES_P */
 #define	MC_STATUS_AR		0x0080000000000000	/* If MCG_CAP_TES_P */
 #define	MC_STATUS_S		0x0100000000000000	/* If MCG_CAP_TES_P */
 #define	MC_STATUS_PCC		0x0200000000000000
 #define	MC_STATUS_ADDRV		0x0400000000000000
 #define	MC_STATUS_MISCV		0x0800000000000000
 #define	MC_STATUS_EN		0x1000000000000000
 #define	MC_STATUS_UC		0x2000000000000000
 #define	MC_STATUS_OVER		0x4000000000000000
 #define	MC_STATUS_VAL		0x8000000000000000
 #define	MC_MISC_RA_LSB		0x000000000000003f	/* If MCG_CAP_SER_P */
 #define	MC_MISC_ADDRESS_MODE	0x00000000000001c0	/* If MCG_CAP_SER_P */
 #define	MC_CTL2_THRESHOLD	0x0000000000007fff
 #define	MC_CTL2_CMCI_EN		0x0000000040000000
 
 /*
  * The following four 3-byte registers control the non-cacheable regions.
  * These registers must be written as three separate bytes.
  *
  * NCRx+0: A31-A24 of starting address
  * NCRx+1: A23-A16 of starting address
  * NCRx+2: A15-A12 of starting address | NCR_SIZE_xx.
  *
  * The non-cacheable region's starting address must be aligned to the
  * size indicated by the NCR_SIZE_xx field.
  */
 #define	NCR1	0xc4
 #define	NCR2	0xc7
 #define	NCR3	0xca
 #define	NCR4	0xcd
 
 #define	NCR_SIZE_0K	0
 #define	NCR_SIZE_4K	1
 #define	NCR_SIZE_8K	2
 #define	NCR_SIZE_16K	3
 #define	NCR_SIZE_32K	4
 #define	NCR_SIZE_64K	5
 #define	NCR_SIZE_128K	6
 #define	NCR_SIZE_256K	7
 #define	NCR_SIZE_512K	8
 #define	NCR_SIZE_1M	9
 #define	NCR_SIZE_2M	10
 #define	NCR_SIZE_4M	11
 #define	NCR_SIZE_8M	12
 #define	NCR_SIZE_16M	13
 #define	NCR_SIZE_32M	14
 #define	NCR_SIZE_4G	15
 
 /*
  * The address region registers are used to specify the location and
  * size for the eight address regions.
  *
  * ARRx + 0: A31-A24 of start address
  * ARRx + 1: A23-A16 of start address
  * ARRx + 2: A15-A12 of start address | ARR_SIZE_xx
  */
 #define	ARR0	0xc4
 #define	ARR1	0xc7
 #define	ARR2	0xca
 #define	ARR3	0xcd
 #define	ARR4	0xd0
 #define	ARR5	0xd3
 #define	ARR6	0xd6
 #define	ARR7	0xd9
 
 #define	ARR_SIZE_0K		0
 #define	ARR_SIZE_4K		1
 #define	ARR_SIZE_8K		2
 #define	ARR_SIZE_16K	3
 #define	ARR_SIZE_32K	4
 #define	ARR_SIZE_64K	5
 #define	ARR_SIZE_128K	6
 #define	ARR_SIZE_256K	7
 #define	ARR_SIZE_512K	8
 #define	ARR_SIZE_1M		9
 #define	ARR_SIZE_2M		10
 #define	ARR_SIZE_4M		11
 #define	ARR_SIZE_8M		12
 #define	ARR_SIZE_16M	13
 #define	ARR_SIZE_32M	14
 #define	ARR_SIZE_4G		15
 
 /*
  * The region control registers specify the attributes associated with
  * the ARRx addres regions.
  */
 #define	RCR0	0xdc
 #define	RCR1	0xdd
 #define	RCR2	0xde
 #define	RCR3	0xdf
 #define	RCR4	0xe0
 #define	RCR5	0xe1
 #define	RCR6	0xe2
 #define	RCR7	0xe3
 
 #define	RCR_RCD	0x01	/* Disables caching for ARRx (x = 0-6). */
 #define	RCR_RCE	0x01	/* Enables caching for ARR7. */
 #define	RCR_WWO	0x02	/* Weak write ordering. */
 #define	RCR_WL	0x04	/* Weak locking. */
 #define	RCR_WG	0x08	/* Write gathering. */
 #define	RCR_WT	0x10	/* Write-through. */
 #define	RCR_NLB	0x20	/* LBA# pin is not asserted. */
 
 /* AMD Write Allocate Top-Of-Memory and Control Register */
 #define	AMD_WT_ALLOC_TME	0x40000	/* top-of-memory enable */
 #define	AMD_WT_ALLOC_PRE	0x20000	/* programmable range enable */
 #define	AMD_WT_ALLOC_FRE	0x10000	/* fixed (A0000-FFFFF) range enable */
 
 /* AMD64 MSR's */
 #define	MSR_EFER	0xc0000080	/* extended features */
 #define	MSR_STAR	0xc0000081	/* legacy mode SYSCALL target/cs/ss */
 #define	MSR_LSTAR	0xc0000082	/* long mode SYSCALL target rip */
 #define	MSR_CSTAR	0xc0000083	/* compat mode SYSCALL target rip */
 #define	MSR_SF_MASK	0xc0000084	/* syscall flags mask */
 #define	MSR_FSBASE	0xc0000100	/* base address of the %fs "segment" */
 #define	MSR_GSBASE	0xc0000101	/* base address of the %gs "segment" */
 #define	MSR_KGSBASE	0xc0000102	/* base address of the kernel %gs */
 #define	MSR_PERFEVSEL0	0xc0010000
 #define	MSR_PERFEVSEL1	0xc0010001
 #define	MSR_PERFEVSEL2	0xc0010002
 #define	MSR_PERFEVSEL3	0xc0010003
 #define	MSR_K7_PERFCTR0	0xc0010004
 #define	MSR_K7_PERFCTR1	0xc0010005
 #define	MSR_K7_PERFCTR2	0xc0010006
 #define	MSR_K7_PERFCTR3	0xc0010007
 #define	MSR_SYSCFG	0xc0010010
 #define	MSR_HWCR	0xc0010015
 #define	MSR_IORRBASE0	0xc0010016
 #define	MSR_IORRMASK0	0xc0010017
 #define	MSR_IORRBASE1	0xc0010018
 #define	MSR_IORRMASK1	0xc0010019
 #define	MSR_TOP_MEM	0xc001001a	/* boundary for ram below 4G */
 #define	MSR_TOP_MEM2	0xc001001d	/* boundary for ram above 4G */
 #define	MSR_NB_CFG1	0xc001001f	/* NB configuration 1 */
 #define	MSR_P_STATE_LIMIT 0xc0010061	/* P-state Current Limit Register */
 #define	MSR_P_STATE_CONTROL 0xc0010062	/* P-state Control Register */
 #define	MSR_P_STATE_STATUS 0xc0010063	/* P-state Status Register */
 #define	MSR_P_STATE_CONFIG(n) (0xc0010064 + (n)) /* P-state Config */
 #define	MSR_SMM_ADDR	0xc0010112	/* SMM TSEG base address */
 #define	MSR_SMM_MASK	0xc0010113	/* SMM TSEG address mask */
 #define	MSR_IC_CFG	0xc0011021	/* Instruction Cache Configuration */
 #define	MSR_K8_UCODE_UPDATE	0xc0010020	/* update microcode */
 #define	MSR_MC0_CTL_MASK	0xc0010044
 #define	MSR_VM_CR		0xc0010114 /* SVM: feature control */
 #define	MSR_VM_HSAVE_PA		0xc0010117 /* SVM: host save area address */
 
 /* MSR_VM_CR related */
 #define	VM_CR_SVMDIS		0x10	/* SVM: disabled by BIOS */
 
 /* VIA ACE crypto featureset: for via_feature_rng */
 #define	VIA_HAS_RNG		1	/* cpu has RNG */
 
 /* VIA ACE crypto featureset: for via_feature_xcrypt */
 #define	VIA_HAS_AES		1	/* cpu has AES */
 #define	VIA_HAS_SHA		2	/* cpu has SHA1 & SHA256 */
 #define	VIA_HAS_MM		4	/* cpu has RSA instructions */
 #define	VIA_HAS_AESCTR		8	/* cpu has AES-CTR instructions */
 
 /* Centaur Extended Feature flags */
 #define	VIA_CPUID_HAS_RNG	0x000004
 #define	VIA_CPUID_DO_RNG	0x000008
 #define	VIA_CPUID_HAS_ACE	0x000040
 #define	VIA_CPUID_DO_ACE	0x000080
 #define	VIA_CPUID_HAS_ACE2	0x000100
 #define	VIA_CPUID_DO_ACE2	0x000200
 #define	VIA_CPUID_HAS_PHE	0x000400
 #define	VIA_CPUID_DO_PHE	0x000800
 #define	VIA_CPUID_HAS_PMM	0x001000
 #define	VIA_CPUID_DO_PMM	0x002000
 
 /* VIA ACE xcrypt-* instruction context control options */
 #define	VIA_CRYPT_CWLO_ROUND_M		0x0000000f
 #define	VIA_CRYPT_CWLO_ALG_M		0x00000070
 #define	VIA_CRYPT_CWLO_ALG_AES		0x00000000
 #define	VIA_CRYPT_CWLO_KEYGEN_M		0x00000080
 #define	VIA_CRYPT_CWLO_KEYGEN_HW	0x00000000
 #define	VIA_CRYPT_CWLO_KEYGEN_SW	0x00000080
 #define	VIA_CRYPT_CWLO_NORMAL		0x00000000
 #define	VIA_CRYPT_CWLO_INTERMEDIATE	0x00000100
 #define	VIA_CRYPT_CWLO_ENCRYPT		0x00000000
 #define	VIA_CRYPT_CWLO_DECRYPT		0x00000200
 #define	VIA_CRYPT_CWLO_KEY128		0x0000000a	/* 128bit, 10 rds */
 #define	VIA_CRYPT_CWLO_KEY192		0x0000040c	/* 192bit, 12 rds */
 #define	VIA_CRYPT_CWLO_KEY256		0x0000080e	/* 256bit, 15 rds */
 
 #endif /* !_MACHINE_SPECIALREG_H_ */
Index: projects/release-arm-redux/sys/x86/x86/cpu_machdep.c
===================================================================
--- projects/release-arm-redux/sys/x86/x86/cpu_machdep.c	(revision 282691)
+++ projects/release-arm-redux/sys/x86/x86/cpu_machdep.c	(revision 282692)
@@ -1,486 +1,517 @@
 /*-
  * Copyright (c) 2003 Peter Wemm.
  * Copyright (c) 1992 Terrence R. Lambert.
  * Copyright (c) 1982, 1987, 1990 The Regents of the University of California.
  * All rights reserved.
  *
  * This code is derived from software contributed to Berkeley by
  * William Jolitz.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  * 3. All advertising materials mentioning features or use of this software
  *    must display the following acknowledgement:
  *	This product includes software developed by the University of
  *	California, Berkeley and its contributors.
  * 4. Neither the name of the University nor the names of its contributors
  *    may be used to endorse or promote products derived from this software
  *    without specific prior written permission.
  *
  * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
  *	from: @(#)machdep.c	7.4 (Berkeley) 6/3/91
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include "opt_atpic.h"
 #include "opt_compat.h"
 #include "opt_cpu.h"
 #include "opt_ddb.h"
 #include "opt_inet.h"
 #include "opt_isa.h"
 #include "opt_kstack_pages.h"
 #include "opt_maxmem.h"
 #include "opt_mp_watchdog.h"
 #include "opt_perfmon.h"
 #include "opt_platform.h"
 #ifdef __i386__
 #include "opt_npx.h"
 #include "opt_apic.h"
 #include "opt_xbox.h"
 #endif
 
 #include <sys/param.h>
 #include <sys/proc.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
 #include <sys/cpu.h>
 #include <sys/kdb.h>
 #include <sys/kernel.h>
 #include <sys/ktr.h>
 #include <sys/lock.h>
 #include <sys/malloc.h>
 #include <sys/mutex.h>
 #include <sys/pcpu.h>
 #include <sys/rwlock.h>
 #include <sys/sched.h>
 #ifdef SMP
 #include <sys/smp.h>
 #endif
 #include <sys/sysctl.h>
 
 #include <machine/clock.h>
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/specialreg.h>
 #include <machine/md_var.h>
 #include <machine/mp_watchdog.h>
 #ifdef PERFMON
 #include <machine/perfmon.h>
 #endif
 #include <machine/tss.h>
 #ifdef SMP
 #include <machine/smp.h>
 #endif
+#include <x86/acpica_machdep.h>
 
 #include <vm/vm.h>
 #include <vm/vm_extern.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_page.h>
 #include <vm/vm_map.h>
 #include <vm/vm_object.h>
 #include <vm/vm_pager.h>
 #include <vm/vm_param.h>
 
 /*
  * Machine dependent boot() routine
  *
  * I haven't seen anything to put here yet
  * Possibly some stuff might be grafted back here from boot()
  */
 void
 cpu_boot(int howto)
 {
 }
 
 /*
  * Flush the D-cache for non-DMA I/O so that the I-cache can
  * be made coherent later.
  */
 void
 cpu_flush_dcache(void *ptr, size_t len)
 {
 	/* Not applicable */
 }
 
+void
+acpi_cpu_c1(void)
+{
+
+	__asm __volatile("sti; hlt");
+}
+
+void
+acpi_cpu_idle_mwait(uint32_t mwait_hint)
+{
+	int *state;
+
+	state = (int *)PCPU_PTR(monitorbuf);
+	/*
+	 * XXXKIB.  Software coordination mode should be supported,
+	 * but all Intel CPUs provide hardware coordination.
+	 */
+	cpu_monitor(state, 0, 0);
+	cpu_mwait(MWAIT_INTRBREAK, mwait_hint);
+}
+
 /* Get current clock frequency for the given cpu id. */
 int
 cpu_est_clockrate(int cpu_id, uint64_t *rate)
 {
 	uint64_t tsc1, tsc2;
 	uint64_t acnt, mcnt, perf;
 	register_t reg;
 
 	if (pcpu_find(cpu_id) == NULL || rate == NULL)
 		return (EINVAL);
 #ifdef __i386__
 	if ((cpu_feature & CPUID_TSC) == 0)
 		return (EOPNOTSUPP);
 #endif
 
 	/*
 	 * If TSC is P-state invariant and APERF/MPERF MSRs do not exist,
 	 * DELAY(9) based logic fails.
 	 */
 	if (tsc_is_invariant && !tsc_perf_stat)
 		return (EOPNOTSUPP);
 
 #ifdef SMP
 	if (smp_cpus > 1) {
 		/* Schedule ourselves on the indicated cpu. */
 		thread_lock(curthread);
 		sched_bind(curthread, cpu_id);
 		thread_unlock(curthread);
 	}
 #endif
 
 	/* Calibrate by measuring a short delay. */
 	reg = intr_disable();
 	if (tsc_is_invariant) {
 		wrmsr(MSR_MPERF, 0);
 		wrmsr(MSR_APERF, 0);
 		tsc1 = rdtsc();
 		DELAY(1000);
 		mcnt = rdmsr(MSR_MPERF);
 		acnt = rdmsr(MSR_APERF);
 		tsc2 = rdtsc();
 		intr_restore(reg);
 		perf = 1000 * acnt / mcnt;
 		*rate = (tsc2 - tsc1) * perf;
 	} else {
 		tsc1 = rdtsc();
 		DELAY(1000);
 		tsc2 = rdtsc();
 		intr_restore(reg);
 		*rate = (tsc2 - tsc1) * 1000;
 	}
 
 #ifdef SMP
 	if (smp_cpus > 1) {
 		thread_lock(curthread);
 		sched_unbind(curthread);
 		thread_unlock(curthread);
 	}
 #endif
 
 	return (0);
 }
 
 /*
  * Shutdown the CPU as much as possible
  */
 void
 cpu_halt(void)
 {
 	for (;;)
 		halt();
 }
 
+bool
+cpu_mwait_usable(void)
+{
+
+	return ((cpu_feature2 & CPUID2_MON) != 0 && ((cpu_mon_mwait_flags &
+	    (CPUID5_MON_MWAIT_EXT | CPUID5_MWAIT_INTRBREAK)) ==
+	    (CPUID5_MON_MWAIT_EXT | CPUID5_MWAIT_INTRBREAK)));
+}
+
 void (*cpu_idle_hook)(sbintime_t) = NULL;	/* ACPI idle hook. */
 static int	cpu_ident_amdc1e = 0;	/* AMD C1E supported. */
 static int	idle_mwait = 1;		/* Use MONITOR/MWAIT for short idle. */
 SYSCTL_INT(_machdep, OID_AUTO, idle_mwait, CTLFLAG_RWTUN, &idle_mwait,
     0, "Use MONITOR/MWAIT for short idle");
 
 #define	STATE_RUNNING	0x0
 #define	STATE_MWAIT	0x1
 #define	STATE_SLEEPING	0x2
 
 #ifndef PC98
 static void
 cpu_idle_acpi(sbintime_t sbt)
 {
 	int *state;
 
 	state = (int *)PCPU_PTR(monitorbuf);
 	*state = STATE_SLEEPING;
 
 	/* See comments in cpu_idle_hlt(). */
 	disable_intr();
 	if (sched_runnable())
 		enable_intr();
 	else if (cpu_idle_hook)
 		cpu_idle_hook(sbt);
 	else
-		__asm __volatile("sti; hlt");
+		acpi_cpu_c1();
 	*state = STATE_RUNNING;
 }
 #endif /* !PC98 */
 
 static void
 cpu_idle_hlt(sbintime_t sbt)
 {
 	int *state;
 
 	state = (int *)PCPU_PTR(monitorbuf);
 	*state = STATE_SLEEPING;
 
 	/*
 	 * Since we may be in a critical section from cpu_idle(), if
 	 * an interrupt fires during that critical section we may have
 	 * a pending preemption.  If the CPU halts, then that thread
 	 * may not execute until a later interrupt awakens the CPU.
 	 * To handle this race, check for a runnable thread after
 	 * disabling interrupts and immediately return if one is
 	 * found.  Also, we must absolutely guarentee that hlt is
 	 * the next instruction after sti.  This ensures that any
 	 * interrupt that fires after the call to disable_intr() will
 	 * immediately awaken the CPU from hlt.  Finally, please note
 	 * that on x86 this works fine because of interrupts enabled only
 	 * after the instruction following sti takes place, while IF is set
 	 * to 1 immediately, allowing hlt instruction to acknowledge the
 	 * interrupt.
 	 */
 	disable_intr();
 	if (sched_runnable())
 		enable_intr();
 	else
-		__asm __volatile("sti; hlt");
+		acpi_cpu_c1();
 	*state = STATE_RUNNING;
 }
 
 static void
 cpu_idle_mwait(sbintime_t sbt)
 {
 	int *state;
 
 	state = (int *)PCPU_PTR(monitorbuf);
 	*state = STATE_MWAIT;
 
 	/* See comments in cpu_idle_hlt(). */
 	disable_intr();
 	if (sched_runnable()) {
 		enable_intr();
 		*state = STATE_RUNNING;
 		return;
 	}
 	cpu_monitor(state, 0, 0);
 	if (*state == STATE_MWAIT)
 		__asm __volatile("sti; mwait" : : "a" (MWAIT_C1), "c" (0));
 	else
 		enable_intr();
 	*state = STATE_RUNNING;
 }
 
 static void
 cpu_idle_spin(sbintime_t sbt)
 {
 	int *state;
 	int i;
 
 	state = (int *)PCPU_PTR(monitorbuf);
 	*state = STATE_RUNNING;
 
 	/*
 	 * The sched_runnable() call is racy but as long as there is
 	 * a loop missing it one time will have just a little impact if any 
 	 * (and it is much better than missing the check at all).
 	 */
 	for (i = 0; i < 1000; i++) {
 		if (sched_runnable())
 			return;
 		cpu_spinwait();
 	}
 }
 
 /*
  * C1E renders the local APIC timer dead, so we disable it by
  * reading the Interrupt Pending Message register and clearing
  * both C1eOnCmpHalt (bit 28) and SmiOnCmpHalt (bit 27).
  * 
  * Reference:
  *   "BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors"
  *   #32559 revision 3.00+
  */
 #define	MSR_AMDK8_IPM		0xc0010055
 #define	AMDK8_SMIONCMPHALT	(1ULL << 27)
 #define	AMDK8_C1EONCMPHALT	(1ULL << 28)
 #define	AMDK8_CMPHALT		(AMDK8_SMIONCMPHALT | AMDK8_C1EONCMPHALT)
 
 void
 cpu_probe_amdc1e(void)
 {
 
 	/*
 	 * Detect the presence of C1E capability mostly on latest
 	 * dual-cores (or future) k8 family.
 	 */
 	if (cpu_vendor_id == CPU_VENDOR_AMD &&
 	    (cpu_id & 0x00000f00) == 0x00000f00 &&
 	    (cpu_id & 0x0fff0000) >=  0x00040000) {
 		cpu_ident_amdc1e = 1;
 	}
 }
 
 #if defined(__i386__) && defined(PC98)
 void (*cpu_idle_fn)(sbintime_t) = cpu_idle_hlt;
 #else
 void (*cpu_idle_fn)(sbintime_t) = cpu_idle_acpi;
 #endif
 
 void
 cpu_idle(int busy)
 {
 	uint64_t msr;
 	sbintime_t sbt = -1;
 
 	CTR2(KTR_SPARE2, "cpu_idle(%d) at %d",
 	    busy, curcpu);
 #ifdef MP_WATCHDOG
 	ap_watchdog(PCPU_GET(cpuid));
 #endif
 
 	/* If we are busy - try to use fast methods. */
 	if (busy) {
 		if ((cpu_feature2 & CPUID2_MON) && idle_mwait) {
 			cpu_idle_mwait(busy);
 			goto out;
 		}
 	}
 
 	/* If we have time - switch timers into idle mode. */
 	if (!busy) {
 		critical_enter();
 		sbt = cpu_idleclock();
 	}
 
 	/* Apply AMD APIC timer C1E workaround. */
 	if (cpu_ident_amdc1e && cpu_disable_c3_sleep) {
 		msr = rdmsr(MSR_AMDK8_IPM);
 		if (msr & AMDK8_CMPHALT)
 			wrmsr(MSR_AMDK8_IPM, msr & ~AMDK8_CMPHALT);
 	}
 
 	/* Call main idle method. */
 	cpu_idle_fn(sbt);
 
 	/* Switch timers back into active mode. */
 	if (!busy) {
 		cpu_activeclock();
 		critical_exit();
 	}
 out:
 	CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done",
 	    busy, curcpu);
 }
 
 int
 cpu_idle_wakeup(int cpu)
 {
 	struct pcpu *pcpu;
 	int *state;
 
 	pcpu = pcpu_find(cpu);
 	state = (int *)pcpu->pc_monitorbuf;
 	/*
 	 * This doesn't need to be atomic since missing the race will
 	 * simply result in unnecessary IPIs.
 	 */
 	if (*state == STATE_SLEEPING)
 		return (0);
 	if (*state == STATE_MWAIT)
 		*state = STATE_RUNNING;
 	return (1);
 }
 
 /*
  * Ordered by speed/power consumption.
  */
 struct {
 	void	*id_fn;
 	char	*id_name;
 } idle_tbl[] = {
 	{ cpu_idle_spin, "spin" },
 	{ cpu_idle_mwait, "mwait" },
 	{ cpu_idle_hlt, "hlt" },
 #if !defined(__i386__) || !defined(PC98)
 	{ cpu_idle_acpi, "acpi" },
 #endif
 	{ NULL, NULL }
 };
 
 static int
 idle_sysctl_available(SYSCTL_HANDLER_ARGS)
 {
 	char *avail, *p;
 	int error;
 	int i;
 
 	avail = malloc(256, M_TEMP, M_WAITOK);
 	p = avail;
 	for (i = 0; idle_tbl[i].id_name != NULL; i++) {
 		if (strstr(idle_tbl[i].id_name, "mwait") &&
 		    (cpu_feature2 & CPUID2_MON) == 0)
 			continue;
 #if !defined(__i386__) || !defined(PC98)
 		if (strcmp(idle_tbl[i].id_name, "acpi") == 0 &&
 		    cpu_idle_hook == NULL)
 			continue;
 #endif
 		p += sprintf(p, "%s%s", p != avail ? ", " : "",
 		    idle_tbl[i].id_name);
 	}
 	error = sysctl_handle_string(oidp, avail, 0, req);
 	free(avail, M_TEMP);
 	return (error);
 }
 
 SYSCTL_PROC(_machdep, OID_AUTO, idle_available, CTLTYPE_STRING | CTLFLAG_RD,
     0, 0, idle_sysctl_available, "A", "list of available idle functions");
 
 static int
 idle_sysctl(SYSCTL_HANDLER_ARGS)
 {
 	char buf[16];
 	int error;
 	char *p;
 	int i;
 
 	p = "unknown";
 	for (i = 0; idle_tbl[i].id_name != NULL; i++) {
 		if (idle_tbl[i].id_fn == cpu_idle_fn) {
 			p = idle_tbl[i].id_name;
 			break;
 		}
 	}
 	strncpy(buf, p, sizeof(buf));
 	error = sysctl_handle_string(oidp, buf, sizeof(buf), req);
 	if (error != 0 || req->newptr == NULL)
 		return (error);
 	for (i = 0; idle_tbl[i].id_name != NULL; i++) {
 		if (strstr(idle_tbl[i].id_name, "mwait") &&
 		    (cpu_feature2 & CPUID2_MON) == 0)
 			continue;
 #if !defined(__i386__) || !defined(PC98)
 		if (strcmp(idle_tbl[i].id_name, "acpi") == 0 &&
 		    cpu_idle_hook == NULL)
 			continue;
 #endif
 		if (strcmp(idle_tbl[i].id_name, buf))
 			continue;
 		cpu_idle_fn = idle_tbl[i].id_fn;
 		return (0);
 	}
 	return (EINVAL);
 }
 
 SYSCTL_PROC(_machdep, OID_AUTO, idle, CTLTYPE_STRING | CTLFLAG_RW, 0, 0,
     idle_sysctl, "A", "currently selected idle function");
Index: projects/release-arm-redux/sys/x86/xen/xen_apic.c
===================================================================
--- projects/release-arm-redux/sys/x86/xen/xen_apic.c	(revision 282691)
+++ projects/release-arm-redux/sys/x86/xen/xen_apic.c	(revision 282692)
@@ -1,549 +1,547 @@
 /*
  * Copyright (c) 2014 Roger Pau Monné <roger.pau@citrix.com>
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/bus.h>
 #include <sys/kernel.h>
 #include <sys/malloc.h>
 #include <sys/proc.h>
 #include <sys/smp.h>
 #include <sys/systm.h>
 
 #include <vm/vm.h>
 #include <vm/pmap.h>
 
 #include <machine/cpufunc.h>
 #include <machine/cpu.h>
 #include <machine/intr_machdep.h>
 #include <machine/smp.h>
 
 #include <x86/apicreg.h>
 #include <x86/apicvar.h>
 
 #include <xen/xen-os.h>
 #include <xen/features.h>
 #include <xen/gnttab.h>
 #include <xen/hypervisor.h>
 #include <xen/hvm.h>
 #include <xen/xen_intr.h>
 
 #include <xen/interface/vcpu.h>
 
 /*--------------------------------- Macros -----------------------------------*/
 
 #define XEN_APIC_UNSUPPORTED \
 	panic("%s: not available in Xen PV port.", __func__)
 
 
 /*--------------------------- Forward Declarations ---------------------------*/
 #ifdef SMP
 static driver_filter_t xen_smp_rendezvous_action;
 static driver_filter_t xen_invltlb;
 static driver_filter_t xen_invlpg;
 static driver_filter_t xen_invlrng;
 static driver_filter_t xen_invlcache;
 static driver_filter_t xen_ipi_bitmap_handler;
 static driver_filter_t xen_cpustop_handler;
 static driver_filter_t xen_cpususpend_handler;
 static driver_filter_t xen_cpustophard_handler;
 #endif
 
 /*---------------------------- Extern Declarations ---------------------------*/
 /* Variables used by mp_machdep to perform the MMU related IPIs */
 #ifdef __amd64__
 extern int pmap_pcid_enabled;
 #endif
 
 /*---------------------------------- Macros ----------------------------------*/
 #define	IPI_TO_IDX(ipi) ((ipi) - APIC_IPI_INTS)
 
 /*--------------------------------- Xen IPIs ---------------------------------*/
 #ifdef SMP
 struct xen_ipi_handler
 {
 	driver_filter_t	*filter;
 	const char	*description;
 };
 
 static struct xen_ipi_handler xen_ipis[] = 
 {
 	[IPI_TO_IDX(IPI_RENDEZVOUS)]	= { xen_smp_rendezvous_action,	"r"   },
 	[IPI_TO_IDX(IPI_INVLTLB)]	= { xen_invltlb,		"itlb"},
 	[IPI_TO_IDX(IPI_INVLPG)]	= { xen_invlpg,			"ipg" },
 	[IPI_TO_IDX(IPI_INVLRNG)]	= { xen_invlrng,		"irg" },
 	[IPI_TO_IDX(IPI_INVLCACHE)]	= { xen_invlcache,		"ic"  },
 	[IPI_TO_IDX(IPI_BITMAP_VECTOR)] = { xen_ipi_bitmap_handler,	"b"   },
 	[IPI_TO_IDX(IPI_STOP)]		= { xen_cpustop_handler,	"st"  },
 	[IPI_TO_IDX(IPI_SUSPEND)]	= { xen_cpususpend_handler,	"sp"  },
 	[IPI_TO_IDX(IPI_STOP_HARD)]	= { xen_cpustophard_handler,	"sth" },
 };
 #endif
 
 /*------------------------------- Per-CPU Data -------------------------------*/
 #ifdef SMP
 DPCPU_DEFINE(xen_intr_handle_t, ipi_handle[nitems(xen_ipis)]);
 #endif
 
 /*------------------------------- Xen PV APIC --------------------------------*/
 
 static void
 xen_pv_lapic_create(u_int apic_id, int boot_cpu)
 {
 #ifdef SMP
 	cpu_add(apic_id, boot_cpu);
 #endif
 }
 
 static void
 xen_pv_lapic_init(vm_paddr_t addr)
 {
 
 }
 
 static void
 xen_pv_lapic_setup(int boot)
 {
 
 }
 
 static void
 xen_pv_lapic_dump(const char *str)
 {
 
 	printf("cpu%d %s XEN PV LAPIC\n", PCPU_GET(cpuid), str);
 }
 
 static void
 xen_pv_lapic_disable(void)
 {
 
 }
 
 static void
 xen_pv_lapic_eoi(void)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static int
 xen_pv_lapic_id(void)
 {
 
 	return (PCPU_GET(apic_id));
 }
 
 static int
 xen_pv_lapic_intr_pending(u_int vector)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static u_int
 xen_pv_apic_cpuid(u_int apic_id)
 {
 #ifdef SMP
 	return (apic_cpuids[apic_id]);
 #else
 	return (0);
 #endif
 }
 
 static u_int
 xen_pv_apic_alloc_vector(u_int apic_id, u_int irq)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static u_int
 xen_pv_apic_alloc_vectors(u_int apic_id, u_int *irqs, u_int count, u_int align)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static void
 xen_pv_apic_disable_vector(u_int apic_id, u_int vector)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static void
 xen_pv_apic_enable_vector(u_int apic_id, u_int vector)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static void
 xen_pv_apic_free_vector(u_int apic_id, u_int vector, u_int irq)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static void
 xen_pv_lapic_set_logical_id(u_int apic_id, u_int cluster, u_int cluster_id)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static int
 xen_pv_lapic_enable_pmc(void)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static void
 xen_pv_lapic_disable_pmc(void)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static void
 xen_pv_lapic_reenable_pmc(void)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static void
 xen_pv_lapic_enable_cmc(void)
 {
 
 }
 
 #ifdef SMP
 static void
 xen_pv_lapic_ipi_raw(register_t icrlo, u_int dest)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 
 static void
 xen_pv_lapic_ipi_vectored(u_int vector, int dest)
 {
 	xen_intr_handle_t *ipi_handle;
 	int ipi_idx, to_cpu, self;
 
 	ipi_idx = IPI_TO_IDX(vector);
 	if (ipi_idx >= nitems(xen_ipis))
 		panic("IPI out of range");
 
 	switch(dest) {
 	case APIC_IPI_DEST_SELF:
 		ipi_handle = DPCPU_GET(ipi_handle);
 		xen_intr_signal(ipi_handle[ipi_idx]);
 		break;
 	case APIC_IPI_DEST_ALL:
 		CPU_FOREACH(to_cpu) {
 			ipi_handle = DPCPU_ID_GET(to_cpu, ipi_handle);
 			xen_intr_signal(ipi_handle[ipi_idx]);
 		}
 		break;
 	case APIC_IPI_DEST_OTHERS:
 		self = PCPU_GET(cpuid);
 		CPU_FOREACH(to_cpu) {
 			if (to_cpu != self) {
 				ipi_handle = DPCPU_ID_GET(to_cpu, ipi_handle);
 				xen_intr_signal(ipi_handle[ipi_idx]);
 			}
 		}
 		break;
 	default:
 		to_cpu = apic_cpuid(dest);
 		ipi_handle = DPCPU_ID_GET(to_cpu, ipi_handle);
 		xen_intr_signal(ipi_handle[ipi_idx]);
 		break;
 	}
 }
 
 static int
 xen_pv_lapic_ipi_wait(int delay)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static int
 xen_pv_lapic_ipi_alloc(inthand_t *ipifunc)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (-1);
 }
 
 static void
 xen_pv_lapic_ipi_free(int vector)
 {
 
 	XEN_APIC_UNSUPPORTED;
 }
 #endif	/* SMP */
 
 static int
 xen_pv_lapic_set_lvt_mask(u_int apic_id, u_int lvt, u_char masked)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static int
 xen_pv_lapic_set_lvt_mode(u_int apic_id, u_int lvt, uint32_t mode)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static int
 xen_pv_lapic_set_lvt_polarity(u_int apic_id, u_int lvt, enum intr_polarity pol)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 static int
 xen_pv_lapic_set_lvt_triggermode(u_int apic_id, u_int lvt,
     enum intr_trigger trigger)
 {
 
 	XEN_APIC_UNSUPPORTED;
 	return (0);
 }
 
 /* Xen apic_ops implementation */
 struct apic_ops xen_apic_ops = {
 	.create			= xen_pv_lapic_create,
 	.init			= xen_pv_lapic_init,
 	.xapic_mode		= xen_pv_lapic_disable,
 	.setup			= xen_pv_lapic_setup,
 	.dump			= xen_pv_lapic_dump,
 	.disable		= xen_pv_lapic_disable,
 	.eoi			= xen_pv_lapic_eoi,
 	.id			= xen_pv_lapic_id,
 	.intr_pending		= xen_pv_lapic_intr_pending,
 	.set_logical_id		= xen_pv_lapic_set_logical_id,
 	.cpuid			= xen_pv_apic_cpuid,
 	.alloc_vector		= xen_pv_apic_alloc_vector,
 	.alloc_vectors		= xen_pv_apic_alloc_vectors,
 	.enable_vector		= xen_pv_apic_enable_vector,
 	.disable_vector		= xen_pv_apic_disable_vector,
 	.free_vector		= xen_pv_apic_free_vector,
 	.enable_pmc		= xen_pv_lapic_enable_pmc,
 	.disable_pmc		= xen_pv_lapic_disable_pmc,
 	.reenable_pmc		= xen_pv_lapic_reenable_pmc,
 	.enable_cmc		= xen_pv_lapic_enable_cmc,
 #ifdef SMP
 	.ipi_raw		= xen_pv_lapic_ipi_raw,
 	.ipi_vectored		= xen_pv_lapic_ipi_vectored,
 	.ipi_wait		= xen_pv_lapic_ipi_wait,
 	.ipi_alloc		= xen_pv_lapic_ipi_alloc,
 	.ipi_free		= xen_pv_lapic_ipi_free,
 #endif
 	.set_lvt_mask		= xen_pv_lapic_set_lvt_mask,
 	.set_lvt_mode		= xen_pv_lapic_set_lvt_mode,
 	.set_lvt_polarity	= xen_pv_lapic_set_lvt_polarity,
 	.set_lvt_triggermode	= xen_pv_lapic_set_lvt_triggermode,
 };
 
 #ifdef SMP
 /*---------------------------- XEN PV IPI Handlers ---------------------------*/
 /*
  * These are C clones of the ASM functions found in apic_vector.
  */
 static int
 xen_ipi_bitmap_handler(void *arg)
 {
 	struct trapframe *frame;
 
 	frame = arg;
 	ipi_bitmap_handler(*frame);
 	return (FILTER_HANDLED);
 }
 
 static int
 xen_smp_rendezvous_action(void *arg)
 {
 #ifdef COUNT_IPIS
 	(*ipi_rendezvous_counts[PCPU_GET(cpuid)])++;
 #endif /* COUNT_IPIS */
 
 	smp_rendezvous_action();
 	return (FILTER_HANDLED);
 }
 
 static int
 xen_invltlb(void *arg)
 {
 
 	invltlb_handler();
 	return (FILTER_HANDLED);
 }
 
 #ifdef __amd64__
 static int
-xen_invltlb_pcid(void *arg)
+xen_invltlb_invpcid(void *arg)
 {
 
-	invltlb_pcid_handler();
+	invltlb_invpcid_handler();
 	return (FILTER_HANDLED);
 }
-#endif
 
 static int
-xen_invlpg(void *arg)
+xen_invltlb_pcid(void *arg)
 {
 
-	invlpg_handler();
+	invltlb_pcid_handler();
 	return (FILTER_HANDLED);
 }
+#endif
 
-#ifdef __amd64__
 static int
-xen_invlpg_pcid(void *arg)
+xen_invlpg(void *arg)
 {
 
-	invlpg_pcid_handler();
+	invlpg_handler();
 	return (FILTER_HANDLED);
 }
-#endif
 
 static int
 xen_invlrng(void *arg)
 {
 
 	invlrng_handler();
 	return (FILTER_HANDLED);
 }
 
 static int
 xen_invlcache(void *arg)
 {
 
 	invlcache_handler();
 	return (FILTER_HANDLED);
 }
 
 static int
 xen_cpustop_handler(void *arg)
 {
 
 	cpustop_handler();
 	return (FILTER_HANDLED);
 }
 
 static int
 xen_cpususpend_handler(void *arg)
 {
 
 	cpususpend_handler();
 	return (FILTER_HANDLED);
 }
 
 static int
 xen_cpustophard_handler(void *arg)
 {
 
 	ipi_nmi_handler();
 	return (FILTER_HANDLED);
 }
 
 /*----------------------------- XEN PV IPI setup -----------------------------*/
 /*
  * Those functions are provided outside of the Xen PV APIC implementation
  * so PVHVM guests can also use PV IPIs without having an actual Xen PV APIC,
  * because on PVHVM there's an emulated LAPIC provided by Xen.
  */
 static void
 xen_cpu_ipi_init(int cpu)
 {
 	xen_intr_handle_t *ipi_handle;
 	const struct xen_ipi_handler *ipi;
 	device_t dev;
 	int idx, rc;
 
 	ipi_handle = DPCPU_ID_GET(cpu, ipi_handle);
 	dev = pcpu_find(cpu)->pc_device;
 	KASSERT((dev != NULL), ("NULL pcpu device_t"));
 
 	for (ipi = xen_ipis, idx = 0; idx < nitems(xen_ipis); ipi++, idx++) {
 
 		if (ipi->filter == NULL) {
 			ipi_handle[idx] = NULL;
 			continue;
 		}
 
 		rc = xen_intr_alloc_and_bind_ipi(dev, cpu, ipi->filter,
 		    INTR_TYPE_TTY, &ipi_handle[idx]);
 		if (rc != 0)
 			panic("Unable to allocate a XEN IPI port");
 		xen_intr_describe(ipi_handle[idx], "%s", ipi->description);
 	}
 }
 
 static void
 xen_setup_cpus(void)
 {
 	int i;
 
 	if (!xen_vector_callback_enabled)
 		return;
 
 #ifdef __amd64__
 	if (pmap_pcid_enabled) {
-		xen_ipis[IPI_TO_IDX(IPI_INVLTLB)].filter = xen_invltlb_pcid;
-		xen_ipis[IPI_TO_IDX(IPI_INVLPG)].filter = xen_invlpg_pcid;
+		xen_ipis[IPI_TO_IDX(IPI_INVLTLB)].filter = invpcid_works ?
+		    xen_invltlb_invpcid : xen_invltlb_pcid;
 	}
 #endif
 	CPU_FOREACH(i)
 		xen_cpu_ipi_init(i);
 
 	/* Set the xen pv ipi ops to replace the native ones */
 	if (xen_hvm_domain())
 		apic_ops.ipi_vectored = xen_pv_lapic_ipi_vectored;
 }
 
 /* We need to setup IPIs before APs are started */
 SYSINIT(xen_setup_cpus, SI_SUB_SMP-1, SI_ORDER_FIRST, xen_setup_cpus, NULL);
 #endif /* SMP */
Index: projects/release-arm-redux/sys
===================================================================
--- projects/release-arm-redux/sys	(revision 282691)
+++ projects/release-arm-redux/sys	(revision 282692)

Property changes on: projects/release-arm-redux/sys
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head/sys:r282673-282691
Index: projects/release-arm-redux/usr.sbin/pmcstat/pmcstat_log.c
===================================================================
--- projects/release-arm-redux/usr.sbin/pmcstat/pmcstat_log.c	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pmcstat/pmcstat_log.c	(revision 282692)
@@ -1,2224 +1,2226 @@
 /*-
  * Copyright (c) 2005-2007, Joseph Koshy
  * Copyright (c) 2007 The FreeBSD Foundation
  * All rights reserved.
  *
  * Portions of this software were developed by A. Joseph Koshy under
  * sponsorship from the FreeBSD Foundation and Google, Inc.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 /*
  * Transform a hwpmc(4) log into human readable form, and into
  * gprof(1) compatible profiles.
  */
 
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD$");
 
 #include <sys/param.h>
 #include <sys/endian.h>
 #include <sys/cpuset.h>
 #include <sys/gmon.h>
 #include <sys/imgact_aout.h>
 #include <sys/imgact_elf.h>
 #include <sys/mman.h>
 #include <sys/pmc.h>
 #include <sys/queue.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
 #include <sys/wait.h>
 
 #include <netinet/in.h>
 
 #include <assert.h>
 #include <curses.h>
 #include <err.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <gelf.h>
 #include <libgen.h>
 #include <limits.h>
 #include <netdb.h>
 #include <pmc.h>
 #include <pmclog.h>
 #include <sysexits.h>
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 
 #include "pmcstat.h"
 #include "pmcstat_log.h"
 #include "pmcstat_top.h"
 
 #define	PMCSTAT_ALLOCATE		1
 
 /*
  * PUBLIC INTERFACES
  *
  * pmcstat_initialize_logging()	initialize this module, called first
  * pmcstat_shutdown_logging()		orderly shutdown, called last
  * pmcstat_open_log()			open an eventlog for processing
  * pmcstat_process_log()		print/convert an event log
  * pmcstat_display_log()		top mode display for the log
  * pmcstat_close_log()			finish processing an event log
  *
  * IMPLEMENTATION NOTES
  *
  * We correlate each 'callchain' or 'sample' entry seen in the event
  * log back to an executable object in the system. Executable objects
  * include:
  * 	- program executables,
  *	- shared libraries loaded by the runtime loader,
  *	- dlopen()'ed objects loaded by the program,
  *	- the runtime loader itself,
  *	- the kernel and kernel modules.
  *
  * Each process that we know about is treated as a set of regions that
  * map to executable objects.  Processes are described by
  * 'pmcstat_process' structures.  Executable objects are tracked by
  * 'pmcstat_image' structures.  The kernel and kernel modules are
  * common to all processes (they reside at the same virtual addresses
  * for all processes).  Individual processes can have their text
  * segments and shared libraries loaded at process-specific locations.
  *
  * A given executable object can be in use by multiple processes
  * (e.g., libc.so) and loaded at a different address in each.
  * pmcstat_pcmap structures track per-image mappings.
  *
  * The sample log could have samples from multiple PMCs; we
  * generate one 'gmon.out' profile per PMC.
  *
  * IMPLEMENTATION OF GMON OUTPUT
  *
  * Each executable object gets one 'gmon.out' profile, per PMC in
  * use.  Creation of 'gmon.out' profiles is done lazily.  The
  * 'gmon.out' profiles generated for a given sampling PMC are
  * aggregates of all the samples for that particular executable
  * object.
  *
  * IMPLEMENTATION OF SYSTEM-WIDE CALLGRAPH OUTPUT
  *
  * Each active pmcid has its own callgraph structure, described by a
  * 'struct pmcstat_callgraph'.  Given a process id and a list of pc
  * values, we map each pc value to a tuple (image, symbol), where
  * 'image' denotes an executable object and 'symbol' is the closest
  * symbol that precedes the pc value.  Each pc value in the list is
  * also given a 'rank' that reflects its depth in the call stack.
  */
 
 struct pmcstat_pmcs pmcstat_pmcs = LIST_HEAD_INITIALIZER(pmcstat_pmcs);
 
 /*
  * All image descriptors are kept in a hash table.
  */
 struct pmcstat_image_hash_list pmcstat_image_hash[PMCSTAT_NHASH];
 
 /*
  * All process descriptors are kept in a hash table.
  */
 struct pmcstat_process_hash_list pmcstat_process_hash[PMCSTAT_NHASH];
 
 struct pmcstat_stats pmcstat_stats; /* statistics */
 static int ps_samples_period; /* samples count between top refresh. */
 
 struct pmcstat_process *pmcstat_kernproc; /* kernel 'process' */
 
 #include "pmcpl_gprof.h"
 #include "pmcpl_callgraph.h"
 #include "pmcpl_annotate.h"
 #include "pmcpl_annotate_cg.h"
 #include "pmcpl_calltree.h"
 
 static struct pmc_plugins  {
 	const char 	*pl_name;	/* name */
 
 	/* configure */
 	int (*pl_configure)(char *opt);
 
 	/* init and shutdown */
 	int (*pl_init)(void);
 	void (*pl_shutdown)(FILE *mf);
 
 	/* sample processing */
 	void (*pl_process)(struct pmcstat_process *pp,
 	    struct pmcstat_pmcrecord *pmcr, uint32_t nsamples,
 	    uintfptr_t *cc, int usermode, uint32_t cpu);
 
 	/* image */
 	void (*pl_initimage)(struct pmcstat_image *pi);
 	void (*pl_shutdownimage)(struct pmcstat_image *pi);
 
 	/* pmc */
 	void (*pl_newpmc)(pmcstat_interned_string ps,
 		struct pmcstat_pmcrecord *pr);
 	
 	/* top display */
 	void (*pl_topdisplay)(void);
 
 	/* top keypress */
 	int (*pl_topkeypress)(int c, WINDOW *w);
 
 } plugins[] = {
 	{
 		.pl_name		= "none",
 	},
 	{
 		.pl_name		= "callgraph",
 		.pl_init		= pmcpl_cg_init,
 		.pl_shutdown		= pmcpl_cg_shutdown,
 		.pl_process		= pmcpl_cg_process,
 		.pl_topkeypress		= pmcpl_cg_topkeypress,
 		.pl_topdisplay		= pmcpl_cg_topdisplay
 	},
 	{
 		.pl_name		= "gprof",
 		.pl_shutdown		= pmcpl_gmon_shutdown,
 		.pl_process		= pmcpl_gmon_process,
 		.pl_initimage		= pmcpl_gmon_initimage,
 		.pl_shutdownimage	= pmcpl_gmon_shutdownimage,
 		.pl_newpmc		= pmcpl_gmon_newpmc
 	},
 	{
 		.pl_name		= "annotate",
 		.pl_process		= pmcpl_annotate_process
 	},
 	{
 		.pl_name		= "calltree",
 		.pl_configure		= pmcpl_ct_configure,
 		.pl_init		= pmcpl_ct_init,
 		.pl_shutdown		= pmcpl_ct_shutdown,
 		.pl_process		= pmcpl_ct_process,
 		.pl_topkeypress		= pmcpl_ct_topkeypress,
 		.pl_topdisplay		= pmcpl_ct_topdisplay
 	},
 	{
 		.pl_name		= "annotate_cg",
 		.pl_process		= pmcpl_annotate_cg_process
 	},
 
 	{
 		.pl_name		= NULL
 	}
 };
 
 static int pmcstat_mergepmc;
 
 int pmcstat_pmcinfilter = 0; /* PMC filter for top mode. */
 float pmcstat_threshold = 0.5; /* Cost filter for top mode. */
 
 /*
  * Prototypes
  */
 
 static struct pmcstat_image *pmcstat_image_from_path(pmcstat_interned_string
     _path, int _iskernelmodule);
 static void pmcstat_image_get_aout_params(struct pmcstat_image *_image);
 static void pmcstat_image_get_elf_params(struct pmcstat_image *_image);
 static void	pmcstat_image_link(struct pmcstat_process *_pp,
     struct pmcstat_image *_i, uintfptr_t _lpc);
 
 static void	pmcstat_pmcid_add(pmc_id_t _pmcid,
     pmcstat_interned_string _name);
 
 static void	pmcstat_process_aout_exec(struct pmcstat_process *_pp,
     struct pmcstat_image *_image, uintfptr_t _entryaddr);
 static void	pmcstat_process_elf_exec(struct pmcstat_process *_pp,
     struct pmcstat_image *_image, uintfptr_t _entryaddr);
 static void	pmcstat_process_exec(struct pmcstat_process *_pp,
     pmcstat_interned_string _path, uintfptr_t _entryaddr);
 static struct pmcstat_process *pmcstat_process_lookup(pid_t _pid,
     int _allocate);
 static int	pmcstat_string_compute_hash(const char *_string);
 static void pmcstat_string_initialize(void);
 static int	pmcstat_string_lookup_hash(pmcstat_interned_string _is);
 static void pmcstat_string_shutdown(void);
 static void pmcstat_stats_reset(int _reset_global);
 
 /*
  * A simple implementation of interned strings.  Each interned string
  * is assigned a unique address, so that subsequent string compares
  * can be done by a simple pointer comparison instead of using
  * strcmp().  This speeds up hash table lookups and saves memory if
  * duplicate strings are the norm.
  */
 struct pmcstat_string {
 	LIST_ENTRY(pmcstat_string)	ps_next;	/* hash link */
 	int		ps_len;
 	int		ps_hash;
 	char		*ps_string;
 };
 
 static LIST_HEAD(,pmcstat_string)	pmcstat_string_hash[PMCSTAT_NHASH];
 
 /*
  * PMC count.
  */
 int pmcstat_npmcs;
 
 /*
  * PMC Top mode pause state.
  */
 static int pmcstat_pause;
 
 static void
 pmcstat_stats_reset(int reset_global)
 {
 	struct pmcstat_pmcrecord *pr;
 
 	/* Flush PMCs stats. */
 	LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) {
 		pr->pr_samples = 0;
 		pr->pr_dubious_frames = 0;
 	}
 	ps_samples_period = 0;
 
 	/* Flush global stats. */
 	if (reset_global)
 		bzero(&pmcstat_stats, sizeof(struct pmcstat_stats));
 }
 
 /*
  * Compute a 'hash' value for a string.
  */
 
 static int
 pmcstat_string_compute_hash(const char *s)
 {
 	unsigned hash;
 
 	for (hash = 2166136261; *s; s++)
 		hash = (hash ^ *s) * 16777619;
 
 	return (hash & PMCSTAT_HASH_MASK);
 }
 
 /*
  * Intern a copy of string 's', and return a pointer to the
  * interned structure.
  */
 
 pmcstat_interned_string
 pmcstat_string_intern(const char *s)
 {
 	struct pmcstat_string *ps;
 	const struct pmcstat_string *cps;
 	int hash, len;
 
 	if ((cps = pmcstat_string_lookup(s)) != NULL)
 		return (cps);
 
 	hash = pmcstat_string_compute_hash(s);
 	len  = strlen(s);
 
 	if ((ps = malloc(sizeof(*ps))) == NULL)
 		err(EX_OSERR, "ERROR: Could not intern string");
 	ps->ps_len = len;
 	ps->ps_hash = hash;
 	ps->ps_string = strdup(s);
 	LIST_INSERT_HEAD(&pmcstat_string_hash[hash], ps, ps_next);
 	return ((pmcstat_interned_string) ps);
 }
 
 const char *
 pmcstat_string_unintern(pmcstat_interned_string str)
 {
 	const char *s;
 
 	s = ((const struct pmcstat_string *) str)->ps_string;
 	return (s);
 }
 
 pmcstat_interned_string
 pmcstat_string_lookup(const char *s)
 {
 	struct pmcstat_string *ps;
 	int hash, len;
 
 	hash = pmcstat_string_compute_hash(s);
 	len = strlen(s);
 
 	LIST_FOREACH(ps, &pmcstat_string_hash[hash], ps_next)
 	    if (ps->ps_len == len && ps->ps_hash == hash &&
 		strcmp(ps->ps_string, s) == 0)
 		    return (ps);
 	return (NULL);
 }
 
 static int
 pmcstat_string_lookup_hash(pmcstat_interned_string s)
 {
 	const struct pmcstat_string *ps;
 
 	ps = (const struct pmcstat_string *) s;
 	return (ps->ps_hash);
 }
 
 /*
  * Initialize the string interning facility.
  */
 
 static void
 pmcstat_string_initialize(void)
 {
 	int i;
 
 	for (i = 0; i < PMCSTAT_NHASH; i++)
 		LIST_INIT(&pmcstat_string_hash[i]);
 }
 
 /*
  * Destroy the string table, free'ing up space.
  */
 
 static void
 pmcstat_string_shutdown(void)
 {
 	int i;
 	struct pmcstat_string *ps, *pstmp;
 
 	for (i = 0; i < PMCSTAT_NHASH; i++)
 		LIST_FOREACH_SAFE(ps, &pmcstat_string_hash[i], ps_next,
 		    pstmp) {
 			LIST_REMOVE(ps, ps_next);
 			free(ps->ps_string);
 			free(ps);
 		}
 }
 
 /*
  * Determine whether a given executable image is an A.OUT object, and
  * if so, fill in its parameters from the text file.
  * Sets image->pi_type.
  */
 
 static void
 pmcstat_image_get_aout_params(struct pmcstat_image *image)
 {
 	int fd;
 	ssize_t nbytes;
 	struct exec ex;
 	const char *path;
 	char buffer[PATH_MAX];
 
 	path = pmcstat_string_unintern(image->pi_execpath);
 	assert(path != NULL);
 
 	if (image->pi_iskernelmodule)
 		errx(EX_SOFTWARE,
 		    "ERROR: a.out kernel modules are unsupported \"%s\"", path);
 
 	(void) snprintf(buffer, sizeof(buffer), "%s%s",
 	    args.pa_fsroot, path);
 
 	if ((fd = open(buffer, O_RDONLY, 0)) < 0 ||
 	    (nbytes = read(fd, &ex, sizeof(ex))) < 0) {
 		if (args.pa_verbosity >= 2)
 			warn("WARNING: Cannot determine type of \"%s\"",
 			    path);
 		image->pi_type = PMCSTAT_IMAGE_INDETERMINABLE;
 		if (fd != -1)
 			(void) close(fd);
 		return;
 	}
 
 	(void) close(fd);
 
 	if ((unsigned) nbytes != sizeof(ex) ||
 	    N_BADMAG(ex))
 		return;
 
 	image->pi_type = PMCSTAT_IMAGE_AOUT;
 
 	/* TODO: the rest of a.out processing */
 
 	return;
 }
 
 /*
  * Helper function.
  */
 
 static int
 pmcstat_symbol_compare(const void *a, const void *b)
 {
 	const struct pmcstat_symbol *sym1, *sym2;
 
 	sym1 = (const struct pmcstat_symbol *) a;
 	sym2 = (const struct pmcstat_symbol *) b;
 
 	if (sym1->ps_end <= sym2->ps_start)
 		return (-1);
 	if (sym1->ps_start >= sym2->ps_end)
 		return (1);
 	return (0);
 }
 
 /*
  * Map an address to a symbol in an image.
  */
 
 struct pmcstat_symbol *
 pmcstat_symbol_search(struct pmcstat_image *image, uintfptr_t addr)
 {
 	struct pmcstat_symbol sym;
 
 	if (image->pi_symbols == NULL)
 		return (NULL);
 
 	sym.ps_name  = NULL;
 	sym.ps_start = addr;
 	sym.ps_end   = addr + 1;
 
 	return (bsearch((void *) &sym, image->pi_symbols,
 		    image->pi_symcount, sizeof(struct pmcstat_symbol),
 		    pmcstat_symbol_compare));
 }
 
 /*
  * Add the list of symbols in the given section to the list associated
  * with the object.
  */
 static void
 pmcstat_image_add_symbols(struct pmcstat_image *image, Elf *e,
     Elf_Scn *scn, GElf_Shdr *sh)
 {
 	int firsttime;
 	size_t n, newsyms, nshsyms, nfuncsyms;
 	struct pmcstat_symbol *symptr;
 	char *fnname;
 	GElf_Sym sym;
 	Elf_Data *data;
 
 	if ((data = elf_getdata(scn, NULL)) == NULL)
 		return;
 
 	/*
 	 * Determine the number of functions named in this
 	 * section.
 	 */
 
 	nshsyms = sh->sh_size / sh->sh_entsize;
 	for (n = nfuncsyms = 0; n < nshsyms; n++) {
 		if (gelf_getsym(data, (int) n, &sym) != &sym)
 			return;
 		if (GELF_ST_TYPE(sym.st_info) == STT_FUNC)
 			nfuncsyms++;
 	}
 
 	if (nfuncsyms == 0)
 		return;
 
 	/*
 	 * Allocate space for the new entries.
 	 */
 	firsttime = image->pi_symbols == NULL;
 	symptr = realloc(image->pi_symbols,
 	    sizeof(*symptr) * (image->pi_symcount + nfuncsyms));
 	if (symptr == image->pi_symbols) /* realloc() failed. */
 		return;
 	image->pi_symbols = symptr;
 
 	/*
 	 * Append new symbols to the end of the current table.
 	 */
 	symptr += image->pi_symcount;
 
 	for (n = newsyms = 0; n < nshsyms; n++) {
 		if (gelf_getsym(data, (int) n, &sym) != &sym)
 			return;
 		if (GELF_ST_TYPE(sym.st_info) != STT_FUNC)
 			continue;
 		if (sym.st_shndx == STN_UNDEF)
 			continue;
 
 		if (!firsttime && pmcstat_symbol_search(image, sym.st_value))
 			continue; /* We've seen this symbol already. */
 
 		if ((fnname = elf_strptr(e, sh->sh_link, sym.st_name))
 		    == NULL)
 			continue;
 #ifdef __arm__
 		/* Remove spurious ARM function name. */
 		if (fnname[0] == '$' &&
 		    (fnname[1] == 'a' || fnname[1] == 't' ||
 		    fnname[1] == 'd') &&
 		    fnname[2] == '\0')
 			continue;
 #endif
 
 		symptr->ps_name  = pmcstat_string_intern(fnname);
 		symptr->ps_start = sym.st_value - image->pi_vaddr;
 		symptr->ps_end   = symptr->ps_start + sym.st_size;
 		symptr++;
 
 		newsyms++;
 	}
 
 	image->pi_symcount += newsyms;
 	if (image->pi_symcount == 0)
 		return;
 
 	assert(newsyms <= nfuncsyms);
 
 	/*
 	 * Return space to the system if there were duplicates.
 	 */
 	if (newsyms < nfuncsyms)
 		image->pi_symbols = realloc(image->pi_symbols,
 		    sizeof(*symptr) * image->pi_symcount);
 
 	/*
 	 * Keep the list of symbols sorted.
 	 */
 	qsort(image->pi_symbols, image->pi_symcount, sizeof(*symptr),
 	    pmcstat_symbol_compare);
 
 	/*
 	 * Deal with function symbols that have a size of 'zero' by
 	 * making them extend to the next higher address.  These
 	 * symbols are usually defined in assembly code.
 	 */
 	for (symptr = image->pi_symbols;
 	     symptr < image->pi_symbols + (image->pi_symcount - 1);
 	     symptr++)
 		if (symptr->ps_start == symptr->ps_end)
 			symptr->ps_end = (symptr+1)->ps_start;
 }
 
 /*
  * Examine an ELF file to determine the size of its text segment.
  * Sets image->pi_type if anything conclusive can be determined about
  * this image.
  */
 
 static void
 pmcstat_image_get_elf_params(struct pmcstat_image *image)
 {
 	int fd;
 	size_t i, nph, nsh;
 	const char *path, *elfbase;
 	char *p, *endp;
 	uintfptr_t minva, maxva;
 	Elf *e;
 	Elf_Scn *scn;
 	GElf_Ehdr eh;
 	GElf_Phdr ph;
 	GElf_Shdr sh;
 	enum pmcstat_image_type image_type;
 	char buffer[PATH_MAX];
 
 	assert(image->pi_type == PMCSTAT_IMAGE_UNKNOWN);
 
 	image->pi_start = minva = ~(uintfptr_t) 0;
 	image->pi_end = maxva = (uintfptr_t) 0;
 	image->pi_type = image_type = PMCSTAT_IMAGE_INDETERMINABLE;
 	image->pi_isdynamic = 0;
 	image->pi_dynlinkerpath = NULL;
 	image->pi_vaddr = 0;
 
 	path = pmcstat_string_unintern(image->pi_execpath);
 	assert(path != NULL);
 
 	/*
 	 * Look for kernel modules under FSROOT/KERNELPATH/NAME,
 	 * and user mode executable objects under FSROOT/PATHNAME.
 	 */
 	if (image->pi_iskernelmodule)
 		(void) snprintf(buffer, sizeof(buffer), "%s%s/%s",
 		    args.pa_fsroot, args.pa_kernel, path);
 	else
 		(void) snprintf(buffer, sizeof(buffer), "%s%s",
 		    args.pa_fsroot, path);
 
 	e = NULL;
 	if ((fd = open(buffer, O_RDONLY, 0)) < 0 ||
 	    (e = elf_begin(fd, ELF_C_READ, NULL)) == NULL ||
 	    (elf_kind(e) != ELF_K_ELF)) {
 		if (args.pa_verbosity >= 2)
 			warnx("WARNING: Cannot determine the type of \"%s\".",
 			    buffer);
 		goto done;
 	}
 
 	if (gelf_getehdr(e, &eh) != &eh) {
 		warnx(
 		    "WARNING: Cannot retrieve the ELF Header for \"%s\": %s.",
 		    buffer, elf_errmsg(-1));
 		goto done;
 	}
 
 	if (eh.e_type != ET_EXEC && eh.e_type != ET_DYN &&
 	    !(image->pi_iskernelmodule && eh.e_type == ET_REL)) {
 		warnx("WARNING: \"%s\" is of an unsupported ELF type.",
 		    buffer);
 		goto done;
 	}
 
 	image_type = eh.e_ident[EI_CLASS] == ELFCLASS32 ?
 	    PMCSTAT_IMAGE_ELF32 : PMCSTAT_IMAGE_ELF64;
 
 	/*
 	 * Determine the virtual address where an executable would be
 	 * loaded.  Additionally, for dynamically linked executables,
 	 * save the pathname to the runtime linker.
 	 */
 	if (eh.e_type == ET_EXEC) {
 		if (elf_getphnum(e, &nph) == 0) {
 			warnx(
 "WARNING: Could not determine the number of program headers in \"%s\": %s.",
 			    buffer,
 			    elf_errmsg(-1));
 			goto done;
 		}
 		for (i = 0; i < eh.e_phnum; i++) {
 			if (gelf_getphdr(e, i, &ph) != &ph) {
 				warnx(
 "WARNING: Retrieval of PHDR entry #%ju in \"%s\" failed: %s.",
 				    (uintmax_t) i, buffer, elf_errmsg(-1));
 				goto done;
 			}
 			switch (ph.p_type) {
 			case PT_DYNAMIC:
 				image->pi_isdynamic = 1;
 				break;
 			case PT_INTERP:
 				if ((elfbase = elf_rawfile(e, NULL)) == NULL) {
 					warnx(
 "WARNING: Cannot retrieve the interpreter for \"%s\": %s.",
 					    buffer, elf_errmsg(-1));
 					goto done;
 				}
 				image->pi_dynlinkerpath =
 				    pmcstat_string_intern(elfbase +
 				        ph.p_offset);
 				break;
 			case PT_LOAD:
 				if ((ph.p_offset & (-ph.p_align)) == 0)
 					image->pi_vaddr = ph.p_vaddr & (-ph.p_align);
 				break;
 			}
 		}
 	}
 
 	/*
 	 * Get the min and max VA associated with this ELF object.
 	 */
 	if (elf_getshnum(e, &nsh) == 0) {
 		warnx(
 "WARNING: Could not determine the number of sections for \"%s\": %s.",
 		    buffer, elf_errmsg(-1));
 		goto done;
 	}
 
 	for (i = 0; i < nsh; i++) {
 		if ((scn = elf_getscn(e, i)) == NULL ||
 		    gelf_getshdr(scn, &sh) != &sh) {
 			warnx(
 "WARNING: Could not retrieve section header #%ju in \"%s\": %s.",
 			    (uintmax_t) i, buffer, elf_errmsg(-1));
 			goto done;
 		}
 		if (sh.sh_flags & SHF_EXECINSTR) {
 			minva = min(minva, sh.sh_addr);
 			maxva = max(maxva, sh.sh_addr + sh.sh_size);
 		}
 		if (sh.sh_type == SHT_SYMTAB || sh.sh_type == SHT_DYNSYM)
 			pmcstat_image_add_symbols(image, e, scn, &sh);
 	}
 
 	image->pi_start = minva;
 	image->pi_end   = maxva;
 	image->pi_type  = image_type;
 	image->pi_fullpath = pmcstat_string_intern(buffer);
 
 	/* Build display name
 	 */
 	endp = buffer;
 	for (p = buffer; *p; p++)
 		if (*p == '/')
 			endp = p+1;
 	image->pi_name = pmcstat_string_intern(endp);
 
  done:
 	(void) elf_end(e);
 	if (fd >= 0)
 		(void) close(fd);
 	return;
 }
 
 /*
  * Given an image descriptor, determine whether it is an ELF, or AOUT.
  * If no handler claims the image, set its type to 'INDETERMINABLE'.
  */
 
 void
 pmcstat_image_determine_type(struct pmcstat_image *image)
 {
 	assert(image->pi_type == PMCSTAT_IMAGE_UNKNOWN);
 
 	/* Try each kind of handler in turn */
 	if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN)
 		pmcstat_image_get_elf_params(image);
 	if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN)
 		pmcstat_image_get_aout_params(image);
 
 	/*
 	 * Otherwise, remember that we tried to determine
 	 * the object's type and had failed.
 	 */
 	if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN)
 		image->pi_type = PMCSTAT_IMAGE_INDETERMINABLE;
 }
 
 /*
  * Locate an image descriptor given an interned path, adding a fresh
  * descriptor to the cache if necessary.  This function also finds a
  * suitable name for this image's sample file.
  *
  * We defer filling in the file format specific parts of the image
  * structure till the time we actually see a sample that would fall
  * into this image.
  */
 
 static struct pmcstat_image *
 pmcstat_image_from_path(pmcstat_interned_string internedpath,
     int iskernelmodule)
 {
 	int hash;
 	struct pmcstat_image *pi;
 
 	hash = pmcstat_string_lookup_hash(internedpath);
 
 	/* First, look for an existing entry. */
 	LIST_FOREACH(pi, &pmcstat_image_hash[hash], pi_next)
 	    if (pi->pi_execpath == internedpath &&
 		  pi->pi_iskernelmodule == iskernelmodule)
 		    return (pi);
 
 	/*
 	 * Allocate a new entry and place it at the head of the hash
 	 * and LRU lists.
 	 */
 	pi = malloc(sizeof(*pi));
 	if (pi == NULL)
 		return (NULL);
 
 	pi->pi_type = PMCSTAT_IMAGE_UNKNOWN;
 	pi->pi_execpath = internedpath;
 	pi->pi_start = ~0;
 	pi->pi_end = 0;
 	pi->pi_entry = 0;
 	pi->pi_vaddr = 0;
 	pi->pi_isdynamic = 0;
 	pi->pi_iskernelmodule = iskernelmodule;
 	pi->pi_dynlinkerpath = NULL;
 	pi->pi_symbols = NULL;
 	pi->pi_symcount = 0;
 	pi->pi_addr2line = NULL;
 
 	if (plugins[args.pa_pplugin].pl_initimage != NULL)
 		plugins[args.pa_pplugin].pl_initimage(pi);
 	if (plugins[args.pa_plugin].pl_initimage != NULL)
 		plugins[args.pa_plugin].pl_initimage(pi);
 
 	LIST_INSERT_HEAD(&pmcstat_image_hash[hash], pi, pi_next);
 
 	return (pi);
 }
 
 /*
  * Record the fact that PC values from 'start' to 'end' come from
  * image 'image'.
  */
 
 static void
 pmcstat_image_link(struct pmcstat_process *pp, struct pmcstat_image *image,
     uintfptr_t start)
 {
 	struct pmcstat_pcmap *pcm, *pcmnew;
 	uintfptr_t offset;
 
 	assert(image->pi_type != PMCSTAT_IMAGE_UNKNOWN &&
 	    image->pi_type != PMCSTAT_IMAGE_INDETERMINABLE);
 
 	if ((pcmnew = malloc(sizeof(*pcmnew))) == NULL)
 		err(EX_OSERR, "ERROR: Cannot create a map entry");
 
 	/*
 	 * Adjust the map entry to only cover the text portion
 	 * of the object.
 	 */
 
 	offset = start - image->pi_vaddr;
 	pcmnew->ppm_lowpc  = image->pi_start + offset;
 	pcmnew->ppm_highpc = image->pi_end + offset;
 	pcmnew->ppm_image  = image;
 
 	assert(pcmnew->ppm_lowpc < pcmnew->ppm_highpc);
 
 	/* Overlapped mmap()'s are assumed to never occur. */
 	TAILQ_FOREACH(pcm, &pp->pp_map, ppm_next)
 	    if (pcm->ppm_lowpc >= pcmnew->ppm_highpc)
 		    break;
 
 	if (pcm == NULL)
 		TAILQ_INSERT_TAIL(&pp->pp_map, pcmnew, ppm_next);
 	else
 		TAILQ_INSERT_BEFORE(pcm, pcmnew, ppm_next);
 }
 
 /*
  * Unmap images in the range [start..end) associated with process
  * 'pp'.
  */
 
 static void
 pmcstat_image_unmap(struct pmcstat_process *pp, uintfptr_t start,
     uintfptr_t end)
 {
 	struct pmcstat_pcmap *pcm, *pcmtmp, *pcmnew;
 
 	assert(pp != NULL);
 	assert(start < end);
 
 	/*
 	 * Cases:
 	 * - we could have the range completely in the middle of an
 	 *   existing pcmap; in this case we have to split the pcmap
 	 *   structure into two (i.e., generate a 'hole').
 	 * - we could have the range covering multiple pcmaps; these
 	 *   will have to be removed.
 	 * - we could have either 'start' or 'end' falling in the
 	 *   middle of a pcmap; in this case shorten the entry.
 	 */
 	TAILQ_FOREACH_SAFE(pcm, &pp->pp_map, ppm_next, pcmtmp) {
 		assert(pcm->ppm_lowpc < pcm->ppm_highpc);
 		if (pcm->ppm_highpc <= start)
 			continue;
 		if (pcm->ppm_lowpc >= end)
 			return;
 		if (pcm->ppm_lowpc >= start && pcm->ppm_highpc <= end) {
 			/*
 			 * The current pcmap is completely inside the
 			 * unmapped range: remove it entirely.
 			 */
 			TAILQ_REMOVE(&pp->pp_map, pcm, ppm_next);
 			free(pcm);
 		} else if (pcm->ppm_lowpc < start && pcm->ppm_highpc > end) {
 			/*
 			 * Split this pcmap into two; curtail the
 			 * current map to end at [start-1], and start
 			 * the new one at [end].
 			 */
 			if ((pcmnew = malloc(sizeof(*pcmnew))) == NULL)
 				err(EX_OSERR,
 				    "ERROR: Cannot split a map entry");
 
 			pcmnew->ppm_image = pcm->ppm_image;
 
 			pcmnew->ppm_lowpc = end;
 			pcmnew->ppm_highpc = pcm->ppm_highpc;
 
 			pcm->ppm_highpc = start;
 
 			TAILQ_INSERT_AFTER(&pp->pp_map, pcm, pcmnew, ppm_next);
 
 			return;
 		} else if (pcm->ppm_lowpc < start && pcm->ppm_highpc <= end)
 			pcm->ppm_highpc = start;
 		else if (pcm->ppm_lowpc >= start && pcm->ppm_highpc > end)
 			pcm->ppm_lowpc = end;
 		else
 			assert(0);
 	}
 }
 
 /*
  * Resolve file name and line number for the given address.
  */
 int
 pmcstat_image_addr2line(struct pmcstat_image *image, uintfptr_t addr,
     char *sourcefile, size_t sourcefile_len, unsigned *sourceline,
     char *funcname, size_t funcname_len)
 {
 	static int addr2line_warn = 0;
 	unsigned l;
 
 	char *sep, cmdline[PATH_MAX], imagepath[PATH_MAX];
 	int fd;
 
 	if (image->pi_addr2line == NULL) {
 		snprintf(imagepath, sizeof(imagepath), "%s%s.symbols",
 		    args.pa_fsroot,
 		    pmcstat_string_unintern(image->pi_fullpath));
 		fd = open(imagepath, O_RDONLY);
 		if (fd < 0) {
 			snprintf(imagepath, sizeof(imagepath), "%s%s",
 			    args.pa_fsroot,
 			    pmcstat_string_unintern(image->pi_fullpath));
 		} else
 			close(fd);
 		/*
 		 * New addr2line support recursive inline function with -i
 		 * but the format does not add a marker when no more entries
 		 * are available.
 		 */
 		snprintf(cmdline, sizeof(cmdline), "addr2line -Cfe \"%s\"",
 		    imagepath);
 		image->pi_addr2line = popen(cmdline, "r+");
 		if (image->pi_addr2line == NULL) {
 			if (!addr2line_warn) {
 				addr2line_warn = 1;
 				warnx(
 "WARNING: addr2line is needed for source code information."
 				    );
 			}
 			return (0);
 		}
 	}
 
 	if (feof(image->pi_addr2line) || ferror(image->pi_addr2line)) {
 		warnx("WARNING: addr2line pipe error");
 		pclose(image->pi_addr2line);
 		image->pi_addr2line = NULL;
 		return (0);
 	}
 
 	fprintf(image->pi_addr2line, "%p\n", (void *)addr);
 
 	if (fgets(funcname, funcname_len, image->pi_addr2line) == NULL) {
 		warnx("WARNING: addr2line function name read error");
 		return (0);
 	}
 	sep = strchr(funcname, '\n');
 	if (sep != NULL)
 		*sep = '\0';
 
 	if (fgets(sourcefile, sourcefile_len, image->pi_addr2line) == NULL) {
 		warnx("WARNING: addr2line source file read error");
 		return (0);
 	}
 	sep = strchr(sourcefile, ':');
 	if (sep == NULL) {
 		warnx("WARNING: addr2line source line separator missing");
 		return (0);
 	}
 	*sep = '\0';
 	l = atoi(sep+1);
 	if (l == 0)
 		return (0);
 	*sourceline = l;
 	return (1);
 }
 
 /*
  * Add a {pmcid,name} mapping.
  */
 
 static void
 pmcstat_pmcid_add(pmc_id_t pmcid, pmcstat_interned_string ps)
 {
 	struct pmcstat_pmcrecord *pr, *prm;
 
 	/* Replace an existing name for the PMC. */
 	prm = NULL;
 	LIST_FOREACH(pr, &pmcstat_pmcs, pr_next)
 		if (pr->pr_pmcid == pmcid) {
 			pr->pr_pmcname = ps;
 			return;
 		} else if (pr->pr_pmcname == ps)
 			prm = pr;
 
 	/*
 	 * Otherwise, allocate a new descriptor and call the
 	 * plugins hook.
 	 */
 	if ((pr = malloc(sizeof(*pr))) == NULL)
 		err(EX_OSERR, "ERROR: Cannot allocate pmc record");
 
 	pr->pr_pmcid = pmcid;
 	pr->pr_pmcname = ps;
 	pr->pr_pmcin = pmcstat_npmcs++;
 	pr->pr_samples = 0;
 	pr->pr_dubious_frames = 0;
 	pr->pr_merge = prm == NULL ? pr : prm;
 
 	LIST_INSERT_HEAD(&pmcstat_pmcs, pr, pr_next);
 
 	if (plugins[args.pa_pplugin].pl_newpmc != NULL)
 		plugins[args.pa_pplugin].pl_newpmc(ps, pr);
 	if (plugins[args.pa_plugin].pl_newpmc != NULL)
 		plugins[args.pa_plugin].pl_newpmc(ps, pr);
 }
 
 /*
  * Given a pmcid in use, find its human-readable name.
  */
 
 const char *
 pmcstat_pmcid_to_name(pmc_id_t pmcid)
 {
 	struct pmcstat_pmcrecord *pr;
 
 	LIST_FOREACH(pr, &pmcstat_pmcs, pr_next)
 	    if (pr->pr_pmcid == pmcid)
 		    return (pmcstat_string_unintern(pr->pr_pmcname));
 
 	return NULL;
 }
 
 /*
  * Convert PMC index to name.
  */
 
 const char *
 pmcstat_pmcindex_to_name(int pmcin)
 {
 	struct pmcstat_pmcrecord *pr;
 
 	LIST_FOREACH(pr, &pmcstat_pmcs, pr_next)
 		if (pr->pr_pmcin == pmcin)
 			return pmcstat_string_unintern(pr->pr_pmcname);
 
 	return NULL;
 }
 
 /*
  * Return PMC record with given index.
  */
 
 struct pmcstat_pmcrecord *
 pmcstat_pmcindex_to_pmcr(int pmcin)
 {
 	struct pmcstat_pmcrecord *pr;
 
 	LIST_FOREACH(pr, &pmcstat_pmcs, pr_next)
 		if (pr->pr_pmcin == pmcin)
 			return pr;
 
 	return NULL;
 }
 
 /*
  * Get PMC record by id, apply merge policy.
  */
 
 static struct pmcstat_pmcrecord *
 pmcstat_lookup_pmcid(pmc_id_t pmcid)
 {
 	struct pmcstat_pmcrecord *pr;
 
 	LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) {
 		if (pr->pr_pmcid == pmcid) {
 			if (pmcstat_mergepmc)
 				return pr->pr_merge;
 			return pr;
 		}
 	}
 
 	return NULL;
 }
 
 /*
  * Associate an AOUT image with a process.
  */
 
 static void
 pmcstat_process_aout_exec(struct pmcstat_process *pp,
     struct pmcstat_image *image, uintfptr_t entryaddr)
 {
 	(void) pp;
 	(void) image;
 	(void) entryaddr;
 	/* TODO Implement a.out handling */
 }
 
 /*
  * Associate an ELF image with a process.
  */
 
 static void
 pmcstat_process_elf_exec(struct pmcstat_process *pp,
     struct pmcstat_image *image, uintfptr_t entryaddr)
 {
 	uintmax_t libstart;
 	struct pmcstat_image *rtldimage;
 
 	assert(image->pi_type == PMCSTAT_IMAGE_ELF32 ||
 	    image->pi_type == PMCSTAT_IMAGE_ELF64);
 
 	/* Create a map entry for the base executable. */
 	pmcstat_image_link(pp, image, image->pi_vaddr);
 
 	/*
 	 * For dynamically linked executables we need to determine
 	 * where the dynamic linker was mapped to for this process,
 	 * Subsequent executable objects that are mapped in by the
 	 * dynamic linker will be tracked by log events of type
 	 * PMCLOG_TYPE_MAP_IN.
 	 */
 
 	if (image->pi_isdynamic) {
 
 		/*
 		 * The runtime loader gets loaded just after the maximum
 		 * possible heap address.  Like so:
 		 *
 		 * [  TEXT DATA BSS HEAP -->*RTLD  SHLIBS   <--STACK]
 		 * ^					            ^
 		 * 0				   VM_MAXUSER_ADDRESS
 
 		 *
 		 * The exact address where the loader gets mapped in
 		 * will vary according to the size of the executable
 		 * and the limits on the size of the process'es data
 		 * segment at the time of exec().  The entry address
 		 * recorded at process exec time corresponds to the
 		 * 'start' address inside the dynamic linker.  From
 		 * this we can figure out the address where the
 		 * runtime loader's file object had been mapped to.
 		 */
 		rtldimage = pmcstat_image_from_path(image->pi_dynlinkerpath, 0);
 		if (rtldimage == NULL) {
 			warnx("WARNING: Cannot find image for \"%s\".",
 			    pmcstat_string_unintern(image->pi_dynlinkerpath));
 			pmcstat_stats.ps_exec_errors++;
 			return;
 		}
 
 		if (rtldimage->pi_type == PMCSTAT_IMAGE_UNKNOWN)
 			pmcstat_image_get_elf_params(rtldimage);
 
 		if (rtldimage->pi_type != PMCSTAT_IMAGE_ELF32 &&
 		    rtldimage->pi_type != PMCSTAT_IMAGE_ELF64) {
 			warnx("WARNING: rtld not an ELF object \"%s\".",
 			    pmcstat_string_unintern(image->pi_dynlinkerpath));
 			return;
 		}
 
 		libstart = entryaddr - rtldimage->pi_entry;
 		pmcstat_image_link(pp, rtldimage, libstart);
 	}
 }
 
 /*
  * Find the process descriptor corresponding to a PID.  If 'allocate'
  * is zero, we return a NULL if a pid descriptor could not be found or
  * a process descriptor process.  If 'allocate' is non-zero, then we
  * will attempt to allocate a fresh process descriptor.  Zombie
  * process descriptors are only removed if a fresh allocation for the
  * same PID is requested.
  */
 
 static struct pmcstat_process *
 pmcstat_process_lookup(pid_t pid, int allocate)
 {
 	uint32_t hash;
 	struct pmcstat_pcmap *ppm, *ppmtmp;
 	struct pmcstat_process *pp, *pptmp;
 
 	hash = (uint32_t) pid & PMCSTAT_HASH_MASK;	/* simplicity wins */
 
 	LIST_FOREACH_SAFE(pp, &pmcstat_process_hash[hash], pp_next, pptmp)
 		if (pp->pp_pid == pid) {
 			/* Found a descriptor, check and process zombies */
 			if (allocate && pp->pp_isactive == 0) {
 				/* remove maps */
 				TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next,
 				    ppmtmp) {
 					TAILQ_REMOVE(&pp->pp_map, ppm,
 					    ppm_next);
 					free(ppm);
 				}
 				/* remove process entry */
 				LIST_REMOVE(pp, pp_next);
 				free(pp);
 				break;
 			}
 			return (pp);
 		}
 
 	if (!allocate)
 		return (NULL);
 
 	if ((pp = malloc(sizeof(*pp))) == NULL)
 		err(EX_OSERR, "ERROR: Cannot allocate pid descriptor");
 
 	pp->pp_pid = pid;
 	pp->pp_isactive = 1;
 
 	TAILQ_INIT(&pp->pp_map);
 
 	LIST_INSERT_HEAD(&pmcstat_process_hash[hash], pp, pp_next);
 	return (pp);
 }
 
 /*
  * Associate an image and a process.
  */
 
 static void
 pmcstat_process_exec(struct pmcstat_process *pp,
     pmcstat_interned_string path, uintfptr_t entryaddr)
 {
 	struct pmcstat_image *image;
 
 	if ((image = pmcstat_image_from_path(path, 0)) == NULL) {
 		pmcstat_stats.ps_exec_errors++;
 		return;
 	}
 
 	if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN)
 		pmcstat_image_determine_type(image);
 
 	assert(image->pi_type != PMCSTAT_IMAGE_UNKNOWN);
 
 	switch (image->pi_type) {
 	case PMCSTAT_IMAGE_ELF32:
 	case PMCSTAT_IMAGE_ELF64:
 		pmcstat_stats.ps_exec_elf++;
 		pmcstat_process_elf_exec(pp, image, entryaddr);
 		break;
 
 	case PMCSTAT_IMAGE_AOUT:
 		pmcstat_stats.ps_exec_aout++;
 		pmcstat_process_aout_exec(pp, image, entryaddr);
 		break;
 
 	case PMCSTAT_IMAGE_INDETERMINABLE:
 		pmcstat_stats.ps_exec_indeterminable++;
 		break;
 
 	default:
 		err(EX_SOFTWARE,
 		    "ERROR: Unsupported executable type for \"%s\"",
 		    pmcstat_string_unintern(path));
 	}
 }
 
 
 /*
  * Find the map entry associated with process 'p' at PC value 'pc'.
  */
 
 struct pmcstat_pcmap *
 pmcstat_process_find_map(struct pmcstat_process *p, uintfptr_t pc)
 {
 	struct pmcstat_pcmap *ppm;
 
 	TAILQ_FOREACH(ppm, &p->pp_map, ppm_next) {
 		if (pc >= ppm->ppm_lowpc && pc < ppm->ppm_highpc)
 			return (ppm);
 		if (pc < ppm->ppm_lowpc)
 			return (NULL);
 	}
 
 	return (NULL);
 }
 
 /*
  * Convert a hwpmc(4) log to profile information.  A system-wide
  * callgraph is generated if FLAG_DO_CALLGRAPHS is set.  gmon.out
  * files usable by gprof(1) are created if FLAG_DO_GPROF is set.
  */
 static int
 pmcstat_analyze_log(void)
 {
 	uint32_t cpu, cpuflags;
 	uintfptr_t pc;
 	pid_t pid;
 	struct pmcstat_image *image;
 	struct pmcstat_process *pp, *ppnew;
 	struct pmcstat_pcmap *ppm, *ppmtmp;
 	struct pmclog_ev ev;
 	struct pmcstat_pmcrecord *pmcr;
 	pmcstat_interned_string image_path;
 
 	assert(args.pa_flags & FLAG_DO_ANALYSIS);
 
 	if (elf_version(EV_CURRENT) == EV_NONE)
 		err(EX_UNAVAILABLE, "Elf library intialization failed");
 
 	while (pmclog_read(args.pa_logparser, &ev) == 0) {
 		assert(ev.pl_state == PMCLOG_OK);
 
 		switch (ev.pl_type) {
 		case PMCLOG_TYPE_INITIALIZE:
 			if ((ev.pl_u.pl_i.pl_version & 0xFF000000) !=
 			    PMC_VERSION_MAJOR << 24 && args.pa_verbosity > 0)
 				warnx(
 "WARNING: Log version 0x%x does not match compiled version 0x%x.",
 				    ev.pl_u.pl_i.pl_version, PMC_VERSION_MAJOR);
 			break;
 
 		case PMCLOG_TYPE_MAP_IN:
 			/*
 			 * Introduce an address range mapping for a
 			 * userland process or the kernel (pid == -1).
 			 *
 			 * We always allocate a process descriptor so
 			 * that subsequent samples seen for this
 			 * address range are mapped to the current
 			 * object being mapped in.
 			 */
 			pid = ev.pl_u.pl_mi.pl_pid;
 			if (pid == -1)
 				pp = pmcstat_kernproc;
 			else
 				pp = pmcstat_process_lookup(pid,
 				    PMCSTAT_ALLOCATE);
 
 			assert(pp != NULL);
 
 			image_path = pmcstat_string_intern(ev.pl_u.pl_mi.
 			    pl_pathname);
 			image = pmcstat_image_from_path(image_path, pid == -1);
 			if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN)
 				pmcstat_image_determine_type(image);
 			if (image->pi_type != PMCSTAT_IMAGE_INDETERMINABLE)
 				pmcstat_image_link(pp, image,
 				    ev.pl_u.pl_mi.pl_start);
 			break;
 
 		case PMCLOG_TYPE_MAP_OUT:
 			/*
 			 * Remove an address map.
 			 */
 			pid = ev.pl_u.pl_mo.pl_pid;
 			if (pid == -1)
 				pp = pmcstat_kernproc;
 			else
 				pp = pmcstat_process_lookup(pid, 0);
 
 			if (pp == NULL)	/* unknown process */
 				break;
 
 			pmcstat_image_unmap(pp, ev.pl_u.pl_mo.pl_start,
 			    ev.pl_u.pl_mo.pl_end);
 			break;
 
 		case PMCLOG_TYPE_PCSAMPLE:
 			/*
 			 * Note: the `PCSAMPLE' log entry is not
 			 * generated by hpwmc(4) after version 2.
 			 */
 
 			/*
 			 * We bring in the gmon file for the image
 			 * currently associated with the PMC & pid
 			 * pair and increment the appropriate entry
 			 * bin inside this.
 			 */
 			pmcstat_stats.ps_samples_total++;
 			ps_samples_period++;
 
 			pc = ev.pl_u.pl_s.pl_pc;
 			pp = pmcstat_process_lookup(ev.pl_u.pl_s.pl_pid,
 			    PMCSTAT_ALLOCATE);
 
 			/* Get PMC record. */
 			pmcr = pmcstat_lookup_pmcid(ev.pl_u.pl_s.pl_pmcid);
 			assert(pmcr != NULL);
 			pmcr->pr_samples++;
 
 			/*
 			 * Call the plugins processing
 			 * TODO: move pmcstat_process_find_map inside plugins
 			 */
 
 			if (plugins[args.pa_pplugin].pl_process != NULL)
 				plugins[args.pa_pplugin].pl_process(
 				    pp, pmcr, 1, &pc,
 				    pmcstat_process_find_map(pp, pc) != NULL, 0);
 			plugins[args.pa_plugin].pl_process(
 			    pp, pmcr, 1, &pc,
 			    pmcstat_process_find_map(pp, pc) != NULL, 0);
 			break;
 
 		case PMCLOG_TYPE_CALLCHAIN:
 			pmcstat_stats.ps_samples_total++;
 			ps_samples_period++;
 
 			cpuflags = ev.pl_u.pl_cc.pl_cpuflags;
 			cpu = PMC_CALLCHAIN_CPUFLAGS_TO_CPU(cpuflags);
 
 			/* Filter on the CPU id. */
 			if (!CPU_ISSET(cpu, &(args.pa_cpumask))) {
 				pmcstat_stats.ps_samples_skipped++;
 				break;
 			}
 
 			pp = pmcstat_process_lookup(ev.pl_u.pl_cc.pl_pid,
 			    PMCSTAT_ALLOCATE);
 
 			/* Get PMC record. */
 			pmcr = pmcstat_lookup_pmcid(ev.pl_u.pl_cc.pl_pmcid);
 			assert(pmcr != NULL);
 			pmcr->pr_samples++;
 
 			/*
 			 * Call the plugins processing
 			 */
 
 			if (plugins[args.pa_pplugin].pl_process != NULL)
 				plugins[args.pa_pplugin].pl_process(
 				    pp, pmcr,
 				    ev.pl_u.pl_cc.pl_npc,
 				    ev.pl_u.pl_cc.pl_pc,
 				    PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(cpuflags),
 				    cpu);
 			plugins[args.pa_plugin].pl_process(
 			    pp, pmcr,
 			    ev.pl_u.pl_cc.pl_npc,
 			    ev.pl_u.pl_cc.pl_pc,
 			    PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(cpuflags),
 			    cpu);
 			break;
 
 		case PMCLOG_TYPE_PMCALLOCATE:
 			/*
 			 * Record the association pmc id between this
 			 * PMC and its name.
 			 */
 			pmcstat_pmcid_add(ev.pl_u.pl_a.pl_pmcid,
 			    pmcstat_string_intern(ev.pl_u.pl_a.pl_evname));
 			break;
 
 		case PMCLOG_TYPE_PMCALLOCATEDYN:
 			/*
 			 * Record the association pmc id between this
 			 * PMC and its name.
 			 */
 			pmcstat_pmcid_add(ev.pl_u.pl_ad.pl_pmcid,
 			    pmcstat_string_intern(ev.pl_u.pl_ad.pl_evname));
 			break;
 
 		case PMCLOG_TYPE_PROCEXEC:
 
 			/*
 			 * Change the executable image associated with
 			 * a process.
 			 */
 			pp = pmcstat_process_lookup(ev.pl_u.pl_x.pl_pid,
 			    PMCSTAT_ALLOCATE);
 
 			/* delete the current process map */
 			TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) {
 				TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next);
 				free(ppm);
 			}
 
-			/* associate this process  image */
+			/*
+			 * Associate this process image.
+			 */
 			image_path = pmcstat_string_intern(
 				ev.pl_u.pl_x.pl_pathname);
 			assert(image_path != NULL);
 			pmcstat_process_exec(pp, image_path,
 			    ev.pl_u.pl_x.pl_entryaddr);
 			break;
 
 		case PMCLOG_TYPE_PROCEXIT:
 
 			/*
 			 * Due to the way the log is generated, the
 			 * last few samples corresponding to a process
 			 * may appear in the log after the process
 			 * exit event is recorded.  Thus we keep the
 			 * process' descriptor and associated data
 			 * structures around, but mark the process as
 			 * having exited.
 			 */
 			pp = pmcstat_process_lookup(ev.pl_u.pl_e.pl_pid, 0);
 			if (pp == NULL)
 				break;
 			pp->pp_isactive = 0;	/* mark as a zombie */
 			break;
 
 		case PMCLOG_TYPE_SYSEXIT:
 			pp = pmcstat_process_lookup(ev.pl_u.pl_se.pl_pid, 0);
 			if (pp == NULL)
 				break;
 			pp->pp_isactive = 0;	/* make a zombie */
 			break;
 
 		case PMCLOG_TYPE_PROCFORK:
 
 			/*
 			 * Allocate a process descriptor for the new
 			 * (child) process.
 			 */
 			ppnew =
 			    pmcstat_process_lookup(ev.pl_u.pl_f.pl_newpid,
 				PMCSTAT_ALLOCATE);
 
 			/*
 			 * If we had been tracking the parent, clone
 			 * its address maps.
 			 */
 			pp = pmcstat_process_lookup(ev.pl_u.pl_f.pl_oldpid, 0);
 			if (pp == NULL)
 				break;
 			TAILQ_FOREACH(ppm, &pp->pp_map, ppm_next)
 			    pmcstat_image_link(ppnew, ppm->ppm_image,
 				ppm->ppm_lowpc);
 			break;
 
 		default:	/* other types of entries are not relevant */
 			break;
 		}
 	}
 
 	if (ev.pl_state == PMCLOG_EOF)
 		return (PMCSTAT_FINISHED);
 	else if (ev.pl_state == PMCLOG_REQUIRE_DATA)
 		return (PMCSTAT_RUNNING);
 
 	err(EX_DATAERR,
 	    "ERROR: event parsing failed (record %jd, offset 0x%jx)",
 	    (uintmax_t) ev.pl_count + 1, ev.pl_offset);
 }
 
 /*
  * Print log entries as text.
  */
 
 static int
 pmcstat_print_log(void)
 {
 	struct pmclog_ev ev;
 	uint32_t npc;
 
 	while (pmclog_read(args.pa_logparser, &ev) == 0) {
 		assert(ev.pl_state == PMCLOG_OK);
 		switch (ev.pl_type) {
 		case PMCLOG_TYPE_CALLCHAIN:
 			PMCSTAT_PRINT_ENTRY("callchain",
 			    "%d 0x%x %d %d %c", ev.pl_u.pl_cc.pl_pid,
 			    ev.pl_u.pl_cc.pl_pmcid,
 			    PMC_CALLCHAIN_CPUFLAGS_TO_CPU(ev.pl_u.pl_cc. \
 				pl_cpuflags), ev.pl_u.pl_cc.pl_npc,
 			    PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(ev.pl_u.pl_cc.\
 			        pl_cpuflags) ? 'u' : 's');
 			for (npc = 0; npc < ev.pl_u.pl_cc.pl_npc; npc++)
 				PMCSTAT_PRINT_ENTRY("...", "%p",
 				    (void *) ev.pl_u.pl_cc.pl_pc[npc]);
 			break;
 		case PMCLOG_TYPE_CLOSELOG:
 			PMCSTAT_PRINT_ENTRY("closelog",);
 			break;
 		case PMCLOG_TYPE_DROPNOTIFY:
 			PMCSTAT_PRINT_ENTRY("drop",);
 			break;
 		case PMCLOG_TYPE_INITIALIZE:
 			PMCSTAT_PRINT_ENTRY("initlog","0x%x \"%s\"",
 			    ev.pl_u.pl_i.pl_version,
 			    pmc_name_of_cputype(ev.pl_u.pl_i.pl_arch));
 			if ((ev.pl_u.pl_i.pl_version & 0xFF000000) !=
 			    PMC_VERSION_MAJOR << 24 && args.pa_verbosity > 0)
 				warnx(
 "WARNING: Log version 0x%x != expected version 0x%x.",
 				    ev.pl_u.pl_i.pl_version, PMC_VERSION);
 			break;
 		case PMCLOG_TYPE_MAP_IN:
 			PMCSTAT_PRINT_ENTRY("map-in","%d %p \"%s\"",
 			    ev.pl_u.pl_mi.pl_pid,
 			    (void *) ev.pl_u.pl_mi.pl_start,
 			    ev.pl_u.pl_mi.pl_pathname);
 			break;
 		case PMCLOG_TYPE_MAP_OUT:
 			PMCSTAT_PRINT_ENTRY("map-out","%d %p %p",
 			    ev.pl_u.pl_mo.pl_pid,
 			    (void *) ev.pl_u.pl_mo.pl_start,
 			    (void *) ev.pl_u.pl_mo.pl_end);
 			break;
 		case PMCLOG_TYPE_PCSAMPLE:
 			PMCSTAT_PRINT_ENTRY("sample","0x%x %d %p %c",
 			    ev.pl_u.pl_s.pl_pmcid,
 			    ev.pl_u.pl_s.pl_pid,
 			    (void *) ev.pl_u.pl_s.pl_pc,
 			    ev.pl_u.pl_s.pl_usermode ? 'u' : 's');
 			break;
 		case PMCLOG_TYPE_PMCALLOCATE:
 			PMCSTAT_PRINT_ENTRY("allocate","0x%x \"%s\" 0x%x",
 			    ev.pl_u.pl_a.pl_pmcid,
 			    ev.pl_u.pl_a.pl_evname,
 			    ev.pl_u.pl_a.pl_flags);
 			break;
 		case PMCLOG_TYPE_PMCALLOCATEDYN:
 			PMCSTAT_PRINT_ENTRY("allocatedyn","0x%x \"%s\" 0x%x",
 			    ev.pl_u.pl_ad.pl_pmcid,
 			    ev.pl_u.pl_ad.pl_evname,
 			    ev.pl_u.pl_ad.pl_flags);
 			break;
 		case PMCLOG_TYPE_PMCATTACH:
 			PMCSTAT_PRINT_ENTRY("attach","0x%x %d \"%s\"",
 			    ev.pl_u.pl_t.pl_pmcid,
 			    ev.pl_u.pl_t.pl_pid,
 			    ev.pl_u.pl_t.pl_pathname);
 			break;
 		case PMCLOG_TYPE_PMCDETACH:
 			PMCSTAT_PRINT_ENTRY("detach","0x%x %d",
 			    ev.pl_u.pl_d.pl_pmcid,
 			    ev.pl_u.pl_d.pl_pid);
 			break;
 		case PMCLOG_TYPE_PROCCSW:
 			PMCSTAT_PRINT_ENTRY("cswval","0x%x %d %jd",
 			    ev.pl_u.pl_c.pl_pmcid,
 			    ev.pl_u.pl_c.pl_pid,
 			    ev.pl_u.pl_c.pl_value);
 			break;
 		case PMCLOG_TYPE_PROCEXEC:
 			PMCSTAT_PRINT_ENTRY("exec","0x%x %d %p \"%s\"",
 			    ev.pl_u.pl_x.pl_pmcid,
 			    ev.pl_u.pl_x.pl_pid,
 			    (void *) ev.pl_u.pl_x.pl_entryaddr,
 			    ev.pl_u.pl_x.pl_pathname);
 			break;
 		case PMCLOG_TYPE_PROCEXIT:
 			PMCSTAT_PRINT_ENTRY("exitval","0x%x %d %jd",
 			    ev.pl_u.pl_e.pl_pmcid,
 			    ev.pl_u.pl_e.pl_pid,
 			    ev.pl_u.pl_e.pl_value);
 			break;
 		case PMCLOG_TYPE_PROCFORK:
 			PMCSTAT_PRINT_ENTRY("fork","%d %d",
 			    ev.pl_u.pl_f.pl_oldpid,
 			    ev.pl_u.pl_f.pl_newpid);
 			break;
 		case PMCLOG_TYPE_USERDATA:
 			PMCSTAT_PRINT_ENTRY("userdata","0x%x",
 			    ev.pl_u.pl_u.pl_userdata);
 			break;
 		case PMCLOG_TYPE_SYSEXIT:
 			PMCSTAT_PRINT_ENTRY("exit","%d",
 			    ev.pl_u.pl_se.pl_pid);
 			break;
 		default:
 			fprintf(args.pa_printfile, "unknown event (type %d).\n",
 			    ev.pl_type);
 		}
 	}
 
 	if (ev.pl_state == PMCLOG_EOF)
 		return (PMCSTAT_FINISHED);
 	else if (ev.pl_state ==  PMCLOG_REQUIRE_DATA)
 		return (PMCSTAT_RUNNING);
 
 	errx(EX_DATAERR,
 	    "ERROR: event parsing failed (record %jd, offset 0x%jx).",
 	    (uintmax_t) ev.pl_count + 1, ev.pl_offset);
 	/*NOTREACHED*/
 }
 
 /*
  * Public Interfaces.
  */
 
 /*
  * Close a logfile, after first flushing all in-module queued data.
  */
 
 int
 pmcstat_close_log(void)
 {
 	/* If a local logfile is configured ask the kernel to stop
 	 * and flush data. Kernel will close the file when data is flushed
 	 * so keep the status to EXITING.
 	 */
 	if (args.pa_logfd != -1) {
 		if (pmc_close_logfile() < 0)
 			err(EX_OSERR, "ERROR: logging failed");
 	}
 
 	return (args.pa_flags & FLAG_HAS_PIPE ? PMCSTAT_EXITING :
 	    PMCSTAT_FINISHED);
 }
 
 
 
 /*
  * Open a log file, for reading or writing.
  *
  * The function returns the fd of a successfully opened log or -1 in
  * case of failure.
  */
 
 int
 pmcstat_open_log(const char *path, int mode)
 {
 	int error, fd, cfd;
 	size_t hlen;
 	const char *p, *errstr;
 	struct addrinfo hints, *res, *res0;
 	char hostname[MAXHOSTNAMELEN];
 
 	errstr = NULL;
 	fd = -1;
 
 	/*
 	 * If 'path' is "-" then open one of stdin or stdout depending
 	 * on the value of 'mode'.
 	 *
 	 * If 'path' contains a ':' and does not start with a '/' or '.',
 	 * and is being opened for writing, treat it as a "host:port"
 	 * specification and open a network socket.
 	 *
 	 * Otherwise, treat 'path' as a file name and open that.
 	 */
 	if (path[0] == '-' && path[1] == '\0')
 		fd = (mode == PMCSTAT_OPEN_FOR_READ) ? 0 : 1;
 	else if (path[0] != '/' &&
 	    path[0] != '.' && strchr(path, ':') != NULL) {
 
 		p = strrchr(path, ':');
 		hlen = p - path;
 		if (p == path || hlen >= sizeof(hostname)) {
 			errstr = strerror(EINVAL);
 			goto done;
 		}
 
 		assert(hlen < sizeof(hostname));
 		(void) strncpy(hostname, path, hlen);
 		hostname[hlen] = '\0';
 
 		(void) memset(&hints, 0, sizeof(hints));
 		hints.ai_family = AF_UNSPEC;
 		hints.ai_socktype = SOCK_STREAM;
 		if ((error = getaddrinfo(hostname, p+1, &hints, &res0)) != 0) {
 			errstr = gai_strerror(error);
 			goto done;
 		}
 
 		fd = -1;
 		for (res = res0; res; res = res->ai_next) {
 			if ((fd = socket(res->ai_family, res->ai_socktype,
 			    res->ai_protocol)) < 0) {
 				errstr = strerror(errno);
 				continue;
 			}
 			if (mode == PMCSTAT_OPEN_FOR_READ) {
 				if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) {
 					errstr = strerror(errno);
 					(void) close(fd);
 					fd = -1;
 					continue;
 				}
 				listen(fd, 1);
 				cfd = accept(fd, NULL, NULL);
 				(void) close(fd);
 				if (cfd < 0) {
 					errstr = strerror(errno);
 					fd = -1;
 					break;
 				}
 				fd = cfd;
 			} else {
 				if (connect(fd, res->ai_addr, res->ai_addrlen) < 0) {
 					errstr = strerror(errno);
 					(void) close(fd);
 					fd = -1;
 					continue;
 				}
 			}
 			errstr = NULL;
 			break;
 		}
 		freeaddrinfo(res0);
 
 	} else if ((fd = open(path, mode == PMCSTAT_OPEN_FOR_READ ?
 		    O_RDONLY : (O_WRONLY|O_CREAT|O_TRUNC),
 		    S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH)) < 0)
 			errstr = strerror(errno);
 
   done:
 	if (errstr)
 		errx(EX_OSERR, "ERROR: Cannot open \"%s\" for %s: %s.", path,
 		    (mode == PMCSTAT_OPEN_FOR_READ ? "reading" : "writing"),
 		    errstr);
 
 	return (fd);
 }
 
 /*
  * Process a log file in offline analysis mode.
  */
 
 int
 pmcstat_process_log(void)
 {
 
 	/*
 	 * If analysis has not been asked for, just print the log to
 	 * the current output file.
 	 */
 	if (args.pa_flags & FLAG_DO_PRINT)
 		return (pmcstat_print_log());
 	else
 		return (pmcstat_analyze_log());
 }
 
 /*
  * Refresh top display.
  */
 
 static void
 pmcstat_refresh_top(void)
 {
 	int v_attrs;
 	float v;
 	char pmcname[40];
 	struct pmcstat_pmcrecord *pmcpr;
 
 	/* If in pause mode do not refresh display. */
 	if (pmcstat_pause)
 		return;
 
 	/* Wait until PMC pop in the log. */
 	pmcpr = pmcstat_pmcindex_to_pmcr(pmcstat_pmcinfilter);
 	if (pmcpr == NULL)
 		return;
 
 	/* Format PMC name. */
 	if (pmcstat_mergepmc)
 		snprintf(pmcname, sizeof(pmcname), "[%s]",
 		    pmcstat_string_unintern(pmcpr->pr_pmcname));
 	else
 		snprintf(pmcname, sizeof(pmcname), "%s.%d",
 		    pmcstat_string_unintern(pmcpr->pr_pmcname),
 		    pmcstat_pmcinfilter);
 
 	/* Format samples count. */
 	if (ps_samples_period > 0)
 		v = (pmcpr->pr_samples * 100.0) / ps_samples_period;
 	else
 		v = 0.;
 	v_attrs = PMCSTAT_ATTRPERCENT(v);
 
 	PMCSTAT_PRINTBEGIN();
 	PMCSTAT_PRINTW("PMC: %s Samples: %u ",
 	    pmcname,
 	    pmcpr->pr_samples);
 	PMCSTAT_ATTRON(v_attrs);
 	PMCSTAT_PRINTW("(%.1f%%) ", v);
 	PMCSTAT_ATTROFF(v_attrs);
 	PMCSTAT_PRINTW(", %u unresolved\n\n",
 	    pmcpr->pr_dubious_frames);
 	if (plugins[args.pa_plugin].pl_topdisplay != NULL)
 		plugins[args.pa_plugin].pl_topdisplay();
 	PMCSTAT_PRINTEND();
 }
 
 /*
  * Find the next pmc index to display.
  */
 
 static void
 pmcstat_changefilter(void)
 {
 	int pmcin;
 	struct pmcstat_pmcrecord *pmcr;
 
 	/*
 	 * Find the next merge target.
 	 */
 	if (pmcstat_mergepmc) {
 		pmcin = pmcstat_pmcinfilter;
 
 		do {
 			pmcr = pmcstat_pmcindex_to_pmcr(pmcstat_pmcinfilter);
 			if (pmcr == NULL || pmcr == pmcr->pr_merge)
 				break;
 
 			pmcstat_pmcinfilter++;
 			if (pmcstat_pmcinfilter >= pmcstat_npmcs)
 				pmcstat_pmcinfilter = 0;
 
 		} while (pmcstat_pmcinfilter != pmcin);
 	}
 }
 
 /*
  * Top mode keypress.
  */
 
 int
 pmcstat_keypress_log(void)
 {
 	int c, ret = 0;
 	WINDOW *w;
 
 	w = newwin(1, 0, 1, 0);
 	c = wgetch(w);
 	wprintw(w, "Key: %c => ", c);
 	switch (c) {
 	case 'c':
 		wprintw(w, "enter mode 'd' or 'a' => ");
 		c = wgetch(w);
 		if (c == 'd') {
 			args.pa_topmode = PMCSTAT_TOP_DELTA;
 			wprintw(w, "switching to delta mode");
 		} else {
 			args.pa_topmode = PMCSTAT_TOP_ACCUM;
 			wprintw(w, "switching to accumulation mode");
 		}
 		break;
 	case 'm':
 		pmcstat_mergepmc = !pmcstat_mergepmc;
 		/*
 		 * Changing merge state require data reset.
 		 */
 		if (plugins[args.pa_plugin].pl_shutdown != NULL)
 			plugins[args.pa_plugin].pl_shutdown(NULL);
 		pmcstat_stats_reset(0);
 		if (plugins[args.pa_plugin].pl_init != NULL)
 			plugins[args.pa_plugin].pl_init();
 
 		/* Update filter to be on a merge target. */
 		pmcstat_changefilter();
 		wprintw(w, "merge PMC %s", pmcstat_mergepmc ? "on" : "off");
 		break;
 	case 'n':
 		/* Close current plugin. */
 		if (plugins[args.pa_plugin].pl_shutdown != NULL)
 			plugins[args.pa_plugin].pl_shutdown(NULL);
 
 		/* Find next top display available. */
 		do {
 			args.pa_plugin++;
 			if (plugins[args.pa_plugin].pl_name == NULL)
 				args.pa_plugin = 0;
 		} while (plugins[args.pa_plugin].pl_topdisplay == NULL);
 
 		/* Open new plugin. */
 		pmcstat_stats_reset(0);
 		if (plugins[args.pa_plugin].pl_init != NULL)
 			plugins[args.pa_plugin].pl_init();
 		wprintw(w, "switching to plugin %s",
 		    plugins[args.pa_plugin].pl_name);
 		break;
 	case 'p':
 		pmcstat_pmcinfilter++;
 		if (pmcstat_pmcinfilter >= pmcstat_npmcs)
 			pmcstat_pmcinfilter = 0;
 		pmcstat_changefilter();
 		wprintw(w, "switching to PMC %s.%d",
 		    pmcstat_pmcindex_to_name(pmcstat_pmcinfilter),
 		    pmcstat_pmcinfilter);
 		break;
 	case ' ':
 		pmcstat_pause = !pmcstat_pause;
 		if (pmcstat_pause)
 			wprintw(w, "pause => press space again to continue");
 		break;
 	case 'q':
 		wprintw(w, "exiting...");
 		ret = 1;
 		break;
 	default:
 		if (plugins[args.pa_plugin].pl_topkeypress != NULL)
 			if (plugins[args.pa_plugin].pl_topkeypress(c, w))
 				ret = 1;
 	}
 
 	wrefresh(w);
 	delwin(w);
 	return ret;
 }
 
 
 /*
  * Top mode display.
  */
 
 void
 pmcstat_display_log(void)
 {
 
 	pmcstat_refresh_top();
 
 	/* Reset everythings if delta mode. */
 	if (args.pa_topmode == PMCSTAT_TOP_DELTA) {
 		if (plugins[args.pa_plugin].pl_shutdown != NULL)
 			plugins[args.pa_plugin].pl_shutdown(NULL);
 		pmcstat_stats_reset(0);
 		if (plugins[args.pa_plugin].pl_init != NULL)
 			plugins[args.pa_plugin].pl_init();
 	}
 
 }
 
 /*
  * Configure a plugins.
  */
 
 void
 pmcstat_pluginconfigure_log(char *opt)
 {
 
 	if (strncmp(opt, "threshold=", 10) == 0) {
 		pmcstat_threshold = atof(opt+10);
 	} else {
 		if (plugins[args.pa_plugin].pl_configure != NULL) {
 			if (!plugins[args.pa_plugin].pl_configure(opt))
 				err(EX_USAGE,
 				    "ERROR: unknown option <%s>.", opt);
 		}
 	}
 }
 
 /*
  * Initialize module.
  */
 
 void
 pmcstat_initialize_logging(void)
 {
 	int i;
 
 	/* use a convenient format for 'ldd' output */
 	if (setenv("LD_TRACE_LOADED_OBJECTS_FMT1","%o \"%p\" %x\n",1) != 0)
 		err(EX_OSERR, "ERROR: Cannot setenv");
 
 	/* Initialize hash tables */
 	pmcstat_string_initialize();
 	for (i = 0; i < PMCSTAT_NHASH; i++) {
 		LIST_INIT(&pmcstat_image_hash[i]);
 		LIST_INIT(&pmcstat_process_hash[i]);
 	}
 
 	/*
 	 * Create a fake 'process' entry for the kernel with pid -1.
 	 * hwpmc(4) will subsequently inform us about where the kernel
 	 * and any loaded kernel modules are mapped.
 	 */
 	if ((pmcstat_kernproc = pmcstat_process_lookup((pid_t) -1,
 		 PMCSTAT_ALLOCATE)) == NULL)
 		err(EX_OSERR, "ERROR: Cannot initialize logging");
 
 	/* PMC count. */
 	pmcstat_npmcs = 0;
 
 	/* Merge PMC with same name. */
 	pmcstat_mergepmc = args.pa_mergepmc;
 
 	/*
 	 * Initialize plugins
 	 */
 
 	if (plugins[args.pa_pplugin].pl_init != NULL)
 		plugins[args.pa_pplugin].pl_init();
 	if (plugins[args.pa_plugin].pl_init != NULL)
 		plugins[args.pa_plugin].pl_init();
 }
 
 /*
  * Shutdown module.
  */
 
 void
 pmcstat_shutdown_logging(void)
 {
 	int i;
 	FILE *mf;
 	struct pmcstat_image *pi, *pitmp;
 	struct pmcstat_process *pp, *pptmp;
 	struct pmcstat_pcmap *ppm, *ppmtmp;
 
 	/* determine where to send the map file */
 	mf = NULL;
 	if (args.pa_mapfilename != NULL)
 		mf = (strcmp(args.pa_mapfilename, "-") == 0) ?
 		    args.pa_printfile : fopen(args.pa_mapfilename, "w");
 
 	if (mf == NULL && args.pa_flags & FLAG_DO_GPROF &&
 	    args.pa_verbosity >= 2)
 		mf = args.pa_printfile;
 
 	if (mf)
 		(void) fprintf(mf, "MAP:\n");
 
 	/*
 	 * Shutdown the plugins
 	 */
 
 	if (plugins[args.pa_plugin].pl_shutdown != NULL)
 		plugins[args.pa_plugin].pl_shutdown(mf);
 	if (plugins[args.pa_pplugin].pl_shutdown != NULL)
 		plugins[args.pa_pplugin].pl_shutdown(mf);
 
 	for (i = 0; i < PMCSTAT_NHASH; i++) {
 		LIST_FOREACH_SAFE(pi, &pmcstat_image_hash[i], pi_next,
 		    pitmp) {
 			if (plugins[args.pa_plugin].pl_shutdownimage != NULL)
 				plugins[args.pa_plugin].pl_shutdownimage(pi);
 			if (plugins[args.pa_pplugin].pl_shutdownimage != NULL)
 				plugins[args.pa_pplugin].pl_shutdownimage(pi);
 
 			free(pi->pi_symbols);
 			if (pi->pi_addr2line != NULL)
 				pclose(pi->pi_addr2line);
 			LIST_REMOVE(pi, pi_next);
 			free(pi);
 		}
 
 		LIST_FOREACH_SAFE(pp, &pmcstat_process_hash[i], pp_next,
 		    pptmp) {
 			TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) {
 				TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next);
 				free(ppm);
 			}
 			LIST_REMOVE(pp, pp_next);
 			free(pp);
 		}
 	}
 
 	pmcstat_string_shutdown();
 
 	/*
 	 * Print errors unless -q was specified.  Print all statistics
 	 * if verbosity > 1.
 	 */
 #define	PRINT(N,V) do {							\
 		if (pmcstat_stats.ps_##V || args.pa_verbosity >= 2)	\
 			(void) fprintf(args.pa_printfile, " %-40s %d\n",\
 			    N, pmcstat_stats.ps_##V);			\
 	} while (0)
 
 	if (args.pa_verbosity >= 1 && (args.pa_flags & FLAG_DO_ANALYSIS)) {
 		(void) fprintf(args.pa_printfile, "CONVERSION STATISTICS:\n");
 		PRINT("#exec/a.out", exec_aout);
 		PRINT("#exec/elf", exec_elf);
 		PRINT("#exec/unknown", exec_indeterminable);
 		PRINT("#exec handling errors", exec_errors);
 		PRINT("#samples/total", samples_total);
 		PRINT("#samples/unclaimed", samples_unknown_offset);
 		PRINT("#samples/unknown-object", samples_indeterminable);
 		PRINT("#samples/unknown-function", samples_unknown_function);
 		PRINT("#callchain/dubious-frames", callchain_dubious_frames);
 	}
 
 	if (mf)
 		(void) fclose(mf);
 }
Index: projects/release-arm-redux/usr.sbin/pw/Makefile
===================================================================
--- projects/release-arm-redux/usr.sbin/pw/Makefile	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pw/Makefile	(revision 282692)
@@ -1,19 +1,19 @@
 # $FreeBSD$
 
 PROG=	pw
 MAN=	pw.conf.5 pw.8
 SRCS=	pw.c pw_conf.c pw_user.c pw_group.c pw_log.c pw_nis.c pw_vpw.c \
 	grupd.c pwupd.c fileupd.c psdate.c \
 	bitmap.c cpdir.c rm_r.c
 
 WARNS?=	2
 
-LIBADD=	crypt util
+LIBADD=	crypt util sbuf
 
 .include <src.opts.mk>
 
 .if ${MK_TESTS} != "no"
 SUBDIR+=	tests
 .endif
 
 .include <bsd.prog.mk>
Index: projects/release-arm-redux/usr.sbin/pw/fileupd.c
===================================================================
--- projects/release-arm-redux/usr.sbin/pw/fileupd.c	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pw/fileupd.c	(revision 282692)
@@ -1,68 +1,47 @@
 /*-
  * Copyright (C) 1996
  *	David L. Nugent.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #ifndef lint
 static const char rcsid[] =
   "$FreeBSD$";
 #endif /* not lint */
 
-#include <stdio.h>
-#include <fcntl.h>
 #include <stdlib.h>
-#include <string.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <sys/param.h>
-#include <errno.h>
-#include <unistd.h>
 
 #include "pwupd.h"
-
-int
-extendline(char **buf, int * buflen, int needed)
-{
-	if (needed > *buflen) {
-		char	*tmp = realloc(*buf, needed);
-		if (tmp == NULL)
-			return -1;
-		*buf = tmp;
-		*buflen = needed;
-	}
-	return *buflen;
-}
 
 int
 extendarray(char ***buf, int * buflen, int needed)
 {
 	if (needed > *buflen) {
 		char	**tmp = realloc(*buf, needed * sizeof(char *));
 		if (tmp == NULL)
 			return -1;
 		*buf = tmp;
 		*buflen = needed;
 	}
 	return *buflen;
 }
Index: projects/release-arm-redux/usr.sbin/pw/grupd.c
===================================================================
--- projects/release-arm-redux/usr.sbin/pw/grupd.c	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pw/grupd.c	(revision 282692)
@@ -1,128 +1,124 @@
 /*-
  * Copyright (C) 1996
  *	David L. Nugent.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #ifndef lint
 static const char rcsid[] =
   "$FreeBSD$";
 #endif /* not lint */
 
 #include <grp.h>
 #include <libutil.h>
 #include <err.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
-#include <unistd.h>
-#include <stdarg.h>
-#include <sys/types.h>
-#include <sys/stat.h>
 #include <sys/param.h>
 
 #include "pwupd.h"
 
 static char * grpath = _PATH_PWD;
 
 int
 setgrdir(const char * dir)
 {
 	if (dir == NULL)
 		return -1;
 	else
 		grpath = strdup(dir);
 	if (grpath == NULL)
 		return -1;
 
 	return 0;
 }
 
 char *
 getgrpath(const char * file)
 {
 	static char pathbuf[MAXPATHLEN];
 
 	snprintf(pathbuf, sizeof pathbuf, "%s/%s", grpath, file);
 	return pathbuf;
 }
 
 static int
 gr_update(struct group * grp, char const * group)
 {
 	int pfd, tfd;
 	struct group *gr = NULL;
 	struct group *old_gr = NULL;
 
 	if (grp != NULL)
 		gr = gr_dup(grp);
 
 	if (group != NULL)
 		old_gr = GETGRNAM(group);
 
 	if (gr_init(grpath, NULL))
 		err(1, "gr_init()");
 
 	if ((pfd = gr_lock()) == -1) {
 		gr_fini();
 		err(1, "gr_lock()");
 	}
 	if ((tfd = gr_tmp(-1)) == -1) {
 		gr_fini();
 		err(1, "gr_tmp()");
 	}
 	if (gr_copy(pfd, tfd, gr, old_gr) == -1) {
 		gr_fini();
 		err(1, "gr_copy()");
 	}
 	if (gr_mkdb() == -1) {
 		gr_fini();
 		err(1, "gr_mkdb()");
 	}
 	free(gr);
 	gr_fini();
 	return 0;
 }
 
 
 int
 addgrent(struct group * grp)
 {
 	return gr_update(grp, NULL);
 }
 
 int
 chggrent(char const * login, struct group * grp)
 {
 	return gr_update(grp, login);
 }
 
 int
 delgrent(struct group * grp)
 {
 	char group[MAXLOGNAME];
 
 	strlcpy(group, grp->gr_name, MAXLOGNAME);
 
 	return gr_update(NULL, group);
 }
Index: projects/release-arm-redux/usr.sbin/pw/pw_conf.c
===================================================================
--- projects/release-arm-redux/usr.sbin/pw/pw_conf.c	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pw/pw_conf.c	(revision 282692)
@@ -1,503 +1,501 @@
 /*-
  * Copyright (C) 1996
  *	David L. Nugent.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #ifndef lint
 static const char rcsid[] =
   "$FreeBSD$";
 #endif /* not lint */
 
+#include <sys/types.h>
+#include <sys/sbuf.h>
 #include <string.h>
 #include <ctype.h>
 #include <fcntl.h>
 
 #include "pw.h"
 
 #define debugging 0
 
 enum {
 	_UC_NONE,
 	_UC_DEFAULTPWD,
 	_UC_REUSEUID,
 	_UC_REUSEGID,
 	_UC_NISPASSWD,
 	_UC_DOTDIR,
 	_UC_NEWMAIL,
 	_UC_LOGFILE,
 	_UC_HOMEROOT,
 	_UC_HOMEMODE,
 	_UC_SHELLPATH,
 	_UC_SHELLS,
 	_UC_DEFAULTSHELL,
 	_UC_DEFAULTGROUP,
 	_UC_EXTRAGROUPS,
 	_UC_DEFAULTCLASS,
 	_UC_MINUID,
 	_UC_MAXUID,
 	_UC_MINGID,
 	_UC_MAXGID,
 	_UC_EXPIRE,
 	_UC_PASSWORD,
 	_UC_FIELDS
 };
 
 static char     bourne_shell[] = "sh";
 
 static char    *system_shells[_UC_MAXSHELLS] =
 {
 	bourne_shell,
 	"csh",
 	"tcsh"
 };
 
 static char const *booltrue[] =
 {
 	"yes", "true", "1", "on", NULL
 };
 static char const *boolfalse[] =
 {
 	"no", "false", "0", "off", NULL
 };
 
 static struct userconf config =
 {
 	0,			/* Default password for new users? (nologin) */
 	0,			/* Reuse uids? */
 	0,			/* Reuse gids? */
 	NULL,			/* NIS version of the passwd file */
 	"/usr/share/skel",	/* Where to obtain skeleton files */
 	NULL,			/* Mail to send to new accounts */
 	"/var/log/userlog",	/* Where to log changes */
 	"/home",		/* Where to create home directory */
 	_DEF_DIRMODE,		/* Home directory perms, modified by umask */
 	"/bin",			/* Where shells are located */
 	system_shells,		/* List of shells (first is default) */
 	bourne_shell,		/* Default shell */
 	NULL,			/* Default group name */
 	NULL,			/* Default (additional) groups */
 	NULL,			/* Default login class */
 	1000, 32000,		/* Allowed range of uids */
 	1000, 32000,		/* Allowed range of gids */
 	0,			/* Days until account expires */
 	0,			/* Days until password expires */
 	0			/* size of default_group array */
 };
 
 static char const *comments[_UC_FIELDS] =
 {
 	"#\n# pw.conf - user/group configuration defaults\n#\n",
 	"\n# Password for new users? no=nologin yes=loginid none=blank random=random\n",
 	"\n# Reuse gaps in uid sequence? (yes or no)\n",
 	"\n# Reuse gaps in gid sequence? (yes or no)\n",
 	"\n# Path to the NIS passwd file (blank or 'no' for none)\n",
 	"\n# Obtain default dotfiles from this directory\n",
 	"\n# Mail this file to new user (/etc/newuser.msg or no)\n",
 	"\n# Log add/change/remove information in this file\n",
 	"\n# Root directory in which $HOME directory is created\n",
 	"\n# Mode for the new $HOME directory, will be modified by umask\n",
 	"\n# Colon separated list of directories containing valid shells\n",
 	"\n# Comma separated list of available shells (without paths)\n",
 	"\n# Default shell (without path)\n",
 	"\n# Default group (leave blank for new group per user)\n",
 	"\n# Extra groups for new users\n",
 	"\n# Default login class for new users\n",
 	"\n# Range of valid default user ids\n",
 	NULL,
 	"\n# Range of valid default group ids\n",
 	NULL,
 	"\n# Days after which account expires (0=disabled)\n",
 	"\n# Days after which password expires (0=disabled)\n"
 };
 
 static char const *kwds[] =
 {
 	"",
 	"defaultpasswd",
 	"reuseuids",
 	"reusegids",
 	"nispasswd",
 	"skeleton",
 	"newmail",
 	"logfile",
 	"home",
 	"homemode",
 	"shellpath",
 	"shells",
 	"defaultshell",
 	"defaultgroup",
 	"extragroups",
 	"defaultclass",
 	"minuid",
 	"maxuid",
 	"mingid",
 	"maxgid",
 	"expire_days",
 	"password_days",
 	NULL
 };
 
 static char    *
 unquote(char const * str)
 {
 	if (str && (*str == '"' || *str == '\'')) {
 		char           *p = strchr(str + 1, *str);
 
 		if (p != NULL)
 			*p = '\0';
 		return (char *) (*++str ? str : NULL);
 	}
 	return (char *) str;
 }
 
 int
 boolean_val(char const * str, int dflt)
 {
 	if ((str = unquote(str)) != NULL) {
 		int             i;
 
 		for (i = 0; booltrue[i]; i++)
 			if (strcmp(str, booltrue[i]) == 0)
 				return 1;
 		for (i = 0; boolfalse[i]; i++)
 			if (strcmp(str, boolfalse[i]) == 0)
 				return 0;
 
 		/*
 		 * Special cases for defaultpassword
 		 */
 		if (strcmp(str, "random") == 0)
 			return -1;
 		if (strcmp(str, "none") == 0)
 			return -2;
 	}
 	return dflt;
 }
 
 char const     *
 boolean_str(int val)
 {
 	if (val == -1)
 		return "random";
 	else if (val == -2)
 		return "none";
 	else
 		return val ? booltrue[0] : boolfalse[0];
 }
 
 char           *
 newstr(char const * p)
 {
 	char           *q = NULL;
 
 	if ((p = unquote(p)) != NULL) {
 		int             l = strlen(p) + 1;
 
 		if ((q = malloc(l)) != NULL)
 			memcpy(q, p, l);
 	}
 	return q;
 }
 
 #define LNBUFSZ 1024
 
 
 struct userconf *
 read_userconfig(char const * file)
 {
 	FILE	*fp;
 	char	*buf, *p;
 	size_t	linecap;
 	ssize_t	linelen;
 
 	buf = NULL;
 	linecap = 0;
 
 	extendarray(&config.groups, &config.numgroups, 200);
 	memset(config.groups, 0, config.numgroups * sizeof(char *));
 	if (file == NULL)
 		file = _PATH_PW_CONF;
 
 	if ((fp = fopen(file, "r")) != NULL) {
 		while ((linelen = getline(&buf, &linecap, fp)) > 0) {
 			if (*buf && (p = strtok(buf, " \t\r\n=")) != NULL && *p != '#') {
 				static char const toks[] = " \t\r\n,=";
 				char           *q = strtok(NULL, toks);
 				int             i = 0;
 				mode_t          *modeset;
 
 				while (i < _UC_FIELDS && strcmp(p, kwds[i]) != 0)
 					++i;
 #if debugging
 				if (i == _UC_FIELDS)
 					printf("Got unknown kwd `%s' val=`%s'\n", p, q ? q : "");
 				else
 					printf("Got kwd[%s]=%s\n", p, q);
 #endif
 				switch (i) {
 				case _UC_DEFAULTPWD:
 					config.default_password = boolean_val(q, 1);
 					break;
 				case _UC_REUSEUID:
 					config.reuse_uids = boolean_val(q, 0);
 					break;
 				case _UC_REUSEGID:
 					config.reuse_gids = boolean_val(q, 0);
 					break;
 				case _UC_NISPASSWD:
 					config.nispasswd = (q == NULL || !boolean_val(q, 1))
 						? NULL : newstr(q);
 					break;
 				case _UC_DOTDIR:
 					config.dotdir = (q == NULL || !boolean_val(q, 1))
 						? NULL : newstr(q);
 					break;
 				case _UC_NEWMAIL:
 					config.newmail = (q == NULL || !boolean_val(q, 1))
 						? NULL : newstr(q);
 					break;
 				case _UC_LOGFILE:
 					config.logfile = (q == NULL || !boolean_val(q, 1))
 						? NULL : newstr(q);
 					break;
 				case _UC_HOMEROOT:
 					config.home = (q == NULL || !boolean_val(q, 1))
 						? "/home" : newstr(q);
 					break;
 				case _UC_HOMEMODE:
 					modeset = setmode(q);
 					config.homemode = (q == NULL || !boolean_val(q, 1))
 						? _DEF_DIRMODE : getmode(modeset, _DEF_DIRMODE);
 					free(modeset);
 					break;
 				case _UC_SHELLPATH:
 					config.shelldir = (q == NULL || !boolean_val(q, 1))
 						? "/bin" : newstr(q);
 					break;
 				case _UC_SHELLS:
 					for (i = 0; i < _UC_MAXSHELLS && q != NULL; i++, q = strtok(NULL, toks))
 						system_shells[i] = newstr(q);
 					if (i > 0)
 						while (i < _UC_MAXSHELLS)
 							system_shells[i++] = NULL;
 					break;
 				case _UC_DEFAULTSHELL:
 					config.shell_default = (q == NULL || !boolean_val(q, 1))
 						? (char *) bourne_shell : newstr(q);
 					break;
 				case _UC_DEFAULTGROUP:
 					q = unquote(q);
 					config.default_group = (q == NULL || !boolean_val(q, 1) || GETGRNAM(q) == NULL)
 						? NULL : newstr(q);
 					break;
 				case _UC_EXTRAGROUPS:
 					for (i = 0; q != NULL; q = strtok(NULL, toks)) {
 						if (extendarray(&config.groups, &config.numgroups, i + 2) != -1)
 							config.groups[i++] = newstr(q);
 					}
 					if (i > 0)
 						while (i < config.numgroups)
 							config.groups[i++] = NULL;
 					break;
 				case _UC_DEFAULTCLASS:
 					config.default_class = (q == NULL || !boolean_val(q, 1))
 						? NULL : newstr(q);
 					break;
 				case _UC_MINUID:
 					if ((q = unquote(q)) != NULL && isdigit(*q))
 						config.min_uid = (uid_t) atol(q);
 					break;
 				case _UC_MAXUID:
 					if ((q = unquote(q)) != NULL && isdigit(*q))
 						config.max_uid = (uid_t) atol(q);
 					break;
 				case _UC_MINGID:
 					if ((q = unquote(q)) != NULL && isdigit(*q))
 						config.min_gid = (gid_t) atol(q);
 					break;
 				case _UC_MAXGID:
 					if ((q = unquote(q)) != NULL && isdigit(*q))
 						config.max_gid = (gid_t) atol(q);
 					break;
 				case _UC_EXPIRE:
 					if ((q = unquote(q)) != NULL && isdigit(*q))
 						config.expire_days = atoi(q);
 					break;
 				case _UC_PASSWORD:
 					if ((q = unquote(q)) != NULL && isdigit(*q))
 						config.password_days = atoi(q);
 					break;
 				case _UC_FIELDS:
 				case _UC_NONE:
 					break;
 				}
 			}
 		}
 		if (linecap > 0)
 			free(buf);
 		fclose(fp);
 	}
 	return &config;
 }
 
 
 int
 write_userconfig(char const * file)
 {
 	int             fd;
+	struct sbuf	*buf;
 
 	if (file == NULL)
 		file = _PATH_PW_CONF;
 
 	if ((fd = open(file, O_CREAT | O_RDWR | O_TRUNC | O_EXLOCK, 0644)) != -1) {
 		FILE           *fp;
 
 		if ((fp = fdopen(fd, "w")) == NULL)
 			close(fd);
 		else {
-			int             i, j, k;
-			int		len = LNBUFSZ;
-			char           *buf = malloc(len);
-
+			int             i, j;
+			
+			buf = sbuf_new_auto();
 			for (i = _UC_NONE; i < _UC_FIELDS; i++) {
 				int             quote = 1;
-				char const     *val = buf;
 
-				*buf = '\0';
+				sbuf_clear(buf);
 				switch (i) {
 				case _UC_DEFAULTPWD:
-					val = boolean_str(config.default_password);
+					sbuf_cat(buf, boolean_str(config.default_password));
 					break;
 				case _UC_REUSEUID:
-					val = boolean_str(config.reuse_uids);
+					sbuf_cat(buf, boolean_str(config.reuse_uids));
 					break;
 				case _UC_REUSEGID:
-					val = boolean_str(config.reuse_gids);
+					sbuf_cat(buf, boolean_str(config.reuse_gids));
 					break;
 				case _UC_NISPASSWD:
-					val = config.nispasswd ? config.nispasswd : "";
+					sbuf_cat(buf, config.nispasswd ?
+					    config.nispasswd : "");
 					quote = 0;
 					break;
 				case _UC_DOTDIR:
-					val = config.dotdir ? config.dotdir : boolean_str(0);
+					sbuf_cat(buf, config.dotdir ?
+					    config.dotdir : boolean_str(0));
 					break;
 				case _UC_NEWMAIL:
-					val = config.newmail ? config.newmail : boolean_str(0);
+					sbuf_cat(buf, config.newmail ?
+					    config.newmail : boolean_str(0));
 					break;
 				case _UC_LOGFILE:
-					val = config.logfile ? config.logfile : boolean_str(0);
+					sbuf_cat(buf, config.logfile ?
+					    config.logfile : boolean_str(0));
 					break;
 				case _UC_HOMEROOT:
-					val = config.home;
+					sbuf_cat(buf, config.home);
 					break;
 				case _UC_HOMEMODE:
-					sprintf(buf, "%04o", config.homemode);
+					sbuf_printf(buf, "%04o", config.homemode);
 					quote = 0;
 					break;
 				case _UC_SHELLPATH:
-					val = config.shelldir;
+					sbuf_cat(buf, config.shelldir);
 					break;
 				case _UC_SHELLS:
-					for (j = k = 0; j < _UC_MAXSHELLS && system_shells[j] != NULL; j++) {
-						char	lbuf[64];
-						int	l = snprintf(lbuf, sizeof lbuf, "%s\"%s\"", k ? "," : "", system_shells[j]);
-						if (l < 0)
-							l = 0;
-						if (l + k + 1 < len || extendline(&buf, &len, len + LNBUFSZ) != -1) {
-							strcpy(buf + k, lbuf);
-							k += l;
-						}
+					for (j = 0; j < _UC_MAXSHELLS &&
+					    system_shells[j] != NULL; j++) {
+						sbuf_printf(buf, "%s\"%s\"", j ?
+						     "," : "", system_shells[j]);
 					}
 					quote = 0;
 					break;
 				case _UC_DEFAULTSHELL:
-					val = config.shell_default ? config.shell_default : bourne_shell;
+					sbuf_cat(buf, config.shell_default ?
+					    config.shell_default : bourne_shell);
 					break;
 				case _UC_DEFAULTGROUP:
-					val = config.default_group ? config.default_group : "";
+					sbuf_cat(buf, config.default_group ?
+					    config.default_group : "");
 					break;
 				case _UC_EXTRAGROUPS:
 					extendarray(&config.groups, &config.numgroups, 200);
-					for (j = k = 0; j < config.numgroups && config.groups[j] != NULL; j++) {
-						char	lbuf[64];
-						int	l = snprintf(lbuf, sizeof lbuf, "%s\"%s\"", k ? "," : "", config.groups[j]);
-						if (l < 0)
-							l = 0;
-						if (l + k + 1 < len || extendline(&buf, &len, len + 1024) != -1) {
-							strcpy(buf + k, lbuf);
-							k +=  l;
-						}
-					}
+					for (j = 0; j < config.numgroups &&
+					    config.groups[j] != NULL; j++)
+						sbuf_printf(buf, "%s\"%s\"", j ?
+						     "," : "", config.groups[j]);
 					quote = 0;
 					break;
 				case _UC_DEFAULTCLASS:
-					val = config.default_class ? config.default_class : "";
+					sbuf_cat(buf, config.default_class ?
+					    config.default_class : "");
 					break;
 				case _UC_MINUID:
-					sprintf(buf, "%lu", (unsigned long) config.min_uid);
+					sbuf_printf(buf, "%lu", (unsigned long) config.min_uid);
 					quote = 0;
 					break;
 				case _UC_MAXUID:
-					sprintf(buf, "%lu", (unsigned long) config.max_uid);
+					sbuf_printf(buf, "%lu", (unsigned long) config.max_uid);
 					quote = 0;
 					break;
 				case _UC_MINGID:
-					sprintf(buf, "%lu", (unsigned long) config.min_gid);
+					sbuf_printf(buf, "%lu", (unsigned long) config.min_gid);
 					quote = 0;
 					break;
 				case _UC_MAXGID:
-					sprintf(buf, "%lu", (unsigned long) config.max_gid);
+					sbuf_printf(buf, "%lu", (unsigned long) config.max_gid);
 					quote = 0;
 					break;
 				case _UC_EXPIRE:
-					sprintf(buf, "%d", config.expire_days);
+					sbuf_printf(buf, "%d", config.expire_days);
 					quote = 0;
 					break;
 				case _UC_PASSWORD:
-					sprintf(buf, "%d", config.password_days);
+					sbuf_printf(buf, "%d", config.password_days);
 					quote = 0;
 					break;
 				case _UC_NONE:
 					break;
 				}
+				sbuf_finish(buf);
 
 				if (comments[i])
 					fputs(comments[i], fp);
 
 				if (*kwds[i]) {
 					if (quote)
-						fprintf(fp, "%s = \"%s\"\n", kwds[i], val);
+						fprintf(fp, "%s = \"%s\"\n", kwds[i], sbuf_data(buf));
 					else
-						fprintf(fp, "%s = %s\n", kwds[i], val);
+						fprintf(fp, "%s = %s\n", kwds[i], sbuf_data(buf));
 #if debugging
-					printf("WROTE: %s = %s\n", kwds[i], val);
+					printf("WROTE: %s = %s\n", kwds[i], sbuf_data(buf));
 #endif
 				}
 			}
-			free(buf);
+			sbuf_delete(buf);
 			return fclose(fp) != EOF;
 		}
 	}
 	return 0;
 }
Index: projects/release-arm-redux/usr.sbin/pw/pw_nis.c
===================================================================
--- projects/release-arm-redux/usr.sbin/pw/pw_nis.c	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pw/pw_nis.c	(revision 282692)
@@ -1,96 +1,93 @@
 /*-
  * Copyright (C) 1996
  *	David L. Nugent.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
 
 #ifndef lint
 static const char rcsid[] =
   "$FreeBSD$";
 #endif /* not lint */
 
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
 #include <sys/types.h>
 #include <err.h>
 #include <pwd.h>
 #include <libutil.h>
 
 #include "pw.h"
 
 static int
 pw_nisupdate(const char * path, struct passwd * pwd, char const * user)
 {
 	int pfd, tfd;
 	struct passwd *pw = NULL;
 	struct passwd *old_pw = NULL;
 
 	if (pwd != NULL)
 		pw = pw_dup(pwd);
 
 	if (user != NULL)
 		old_pw = GETPWNAM(user);
 
 	if (pw_init(NULL, path))
 		err(1,"pw_init()");
 	if ((pfd = pw_lock()) == -1) {
 		pw_fini();
 		err(1, "pw_lock()");
 	}
 	if ((tfd = pw_tmp(-1)) == -1) {
 		pw_fini();
 		err(1, "pw_tmp()");
 	}
 	if (pw_copy(pfd, tfd, pw, old_pw) == -1) {
 		pw_fini();
 		err(1, "pw_copy()");
 	}
 	if (chmod(pw_tempname(), 0644) == -1)
 		err(1, "chmod()");
 	if (rename(pw_tempname(), path) == -1)
 		err(1, "rename()");
 
 	free(pw);
 	pw_fini();
 
 	return (0);
 }
 
 int
 addnispwent(const char *path, struct passwd * pwd)
 {
 	return pw_nisupdate(path, pwd, NULL);
 }
 
 int
 chgnispwent(const char *path, char const * login, struct passwd * pwd)
 {
 	return pw_nisupdate(path, pwd, login);
 }
 
 int
 delnispwent(const char *path, const char *login)
 {
 	return pw_nisupdate(path, NULL, login);
 }
Index: projects/release-arm-redux/usr.sbin/pw/pw_user.c
===================================================================
--- projects/release-arm-redux/usr.sbin/pw/pw_user.c	(revision 282691)
+++ projects/release-arm-redux/usr.sbin/pw/pw_user.c	(revision 282692)
@@ -1,1345 +1,1341 @@
 /*-
  * Copyright (C) 1996
  *	David L. Nugent.  All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * are met:
  * 1. Redistributions of source code must retain the above copyright
  *    notice, this list of conditions and the following disclaimer.
  * 2. Redistributions in binary form must reproduce the above copyright
  *    notice, this list of conditions and the following disclaimer in the
  *    documentation and/or other materials provided with the distribution.
  *
  * THIS SOFTWARE IS PROVIDED BY DAVID L. NUGENT AND CONTRIBUTORS ``AS IS'' AND
  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  * ARE DISCLAIMED.  IN NO EVENT SHALL DAVID L. NUGENT OR CONTRIBUTORS BE LIABLE
  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  * 
  */
 
 #ifndef lint
 static const char rcsid[] =
   "$FreeBSD$";
 #endif /* not lint */
 
 #include <ctype.h>
 #include <err.h>
 #include <fcntl.h>
 #include <sys/param.h>
 #include <dirent.h>
 #include <paths.h>
 #include <termios.h>
 #include <sys/types.h>
 #include <sys/time.h>
 #include <sys/resource.h>
-#include <unistd.h>
 #include <login_cap.h>
 #include <pwd.h>
 #include <grp.h>
 #include <libutil.h>
 #include "pw.h"
 #include "bitmap.h"
 
 #define LOGNAMESIZE (MAXLOGNAME-1)
 
 static		char locked_str[] = "*LOCKED*";
 
 static int      print_user(struct passwd * pwd, int pretty, int v7);
 static uid_t    pw_uidpolicy(struct userconf * cnf, struct cargs * args);
 static uid_t    pw_gidpolicy(struct userconf * cnf, struct cargs * args, char *nam, gid_t prefer);
 static time_t   pw_pwdpolicy(struct userconf * cnf, struct cargs * args);
 static time_t   pw_exppolicy(struct userconf * cnf, struct cargs * args);
 static char    *pw_homepolicy(struct userconf * cnf, struct cargs * args, char const * user);
 static char    *pw_shellpolicy(struct userconf * cnf, struct cargs * args, char *newshell);
 static char    *pw_password(struct userconf * cnf, struct cargs * args, char const * user);
 static char    *shell_path(char const * path, char *shells[], char *sh);
 static void     rmat(uid_t uid);
 static void     rmopie(char const * name);
 
 /*-
  * -C config      configuration file
  * -q             quiet operation
  * -n name        login name
  * -u uid         user id
  * -c comment     user name/comment
  * -d directory   home directory
  * -e date        account expiry date
  * -p date        password expiry date
  * -g grp         primary group
  * -G grp1,grp2   additional groups
  * -m [ -k dir ]  create and set up home
  * -s shell       name of login shell
  * -o             duplicate uid ok
  * -L class       user class
  * -l name        new login name
  * -h fd          password filehandle
  * -H fd          encrypted password filehandle
  * -F             force print or add
  *   Setting defaults:
  * -D             set user defaults
  * -b dir         default home root dir
  * -e period      default expiry period
  * -p period      default password change period
  * -g group       default group
  * -G             grp1,grp2.. default additional groups
  * -L class       default login class
  * -k dir         default home skeleton
  * -s shell       default shell
  * -w method      default password method
  */
 
 int
 pw_user(struct userconf * cnf, int mode, struct cargs * args)
 {
 	int	        rc, edited = 0;
 	char           *p = NULL;
 	char					 *passtmp;
 	struct carg    *a_name;
 	struct carg    *a_uid;
 	struct carg    *arg;
 	struct passwd  *pwd = NULL;
 	struct group   *grp;
 	struct stat     st;
 	char            line[_PASSWORD_LEN+1];
 	FILE	       *fp;
 	char *dmode_c;
 	void *set = NULL;
 
 	static struct passwd fakeuser =
 	{
 		NULL,
 		"*",
 		-1,
 		-1,
 		0,
 		"",
 		"User &",
 		"/nonexistent",
 		"/bin/sh",
 		0
 #if defined(__FreeBSD__)
 		,0
 #endif
 	};
 
 
 	/*
 	 * With M_NEXT, we only need to return the
 	 * next uid to stdout
 	 */
 	if (mode == M_NEXT)
 	{
 		uid_t next = pw_uidpolicy(cnf, args);
 		if (getarg(args, 'q'))
 			return next;
 		printf("%ld:", (long)next);
 		pw_group(cnf, mode, args);
 		return EXIT_SUCCESS;
 	}
 
 	/*
 	 * We can do all of the common legwork here
 	 */
 
 	if ((arg = getarg(args, 'b')) != NULL) {
 		cnf->home = arg->val;
 	}
 
 	if ((arg = getarg(args, 'M')) != NULL) {
 		dmode_c = arg->val;
 		if ((set = setmode(dmode_c)) == NULL)
 			errx(EX_DATAERR, "invalid directory creation mode '%s'",
 			    dmode_c);
 		cnf->homemode = getmode(set, _DEF_DIRMODE);
 		free(set);
 	}
 
 	/*
 	 * If we'll need to use it or we're updating it,
 	 * then create the base home directory if necessary
 	 */
 	if (arg != NULL || getarg(args, 'm') != NULL) {
 		int	l = strlen(cnf->home);
 
 		if (l > 1 && cnf->home[l-1] == '/')	/* Shave off any trailing path delimiter */
 			cnf->home[--l] = '\0';
 
 		if (l < 2 || *cnf->home != '/')		/* Check for absolute path name */
 			errx(EX_DATAERR, "invalid base directory for home '%s'", cnf->home);
 
 		if (stat(cnf->home, &st) == -1) {
 			char	dbuf[MAXPATHLEN];
 
 			/*
 			 * This is a kludge especially for Joerg :)
 			 * If the home directory would be created in the root partition, then
 			 * we really create it under /usr which is likely to have more space.
 			 * But we create a symlink from cnf->home -> "/usr" -> cnf->home
 			 */
 			if (strchr(cnf->home+1, '/') == NULL) {
-				strcpy(dbuf, "/usr");
-				strncat(dbuf, cnf->home, MAXPATHLEN-5);
+				snprintf(dbuf, MAXPATHLEN, "/usr%s", cnf->home);
 				if (mkdir(dbuf, _DEF_DIRMODE) != -1 || errno == EEXIST) {
 					chown(dbuf, 0, 0);
 					/*
 					 * Skip first "/" and create symlink:
 					 * /home -> usr/home
 					 */
 					symlink(dbuf+1, cnf->home);
 				}
 				/* If this falls, fall back to old method */
 			}
 			strlcpy(dbuf, cnf->home, sizeof(dbuf));
 			p = dbuf;
 			if (stat(dbuf, &st) == -1) {
 				while ((p = strchr(p + 1, '/')) != NULL) {
 					*p = '\0';
 					if (stat(dbuf, &st) == -1) {
 						if (mkdir(dbuf, _DEF_DIRMODE) == -1)
 							goto direrr;
 						chown(dbuf, 0, 0);
 					} else if (!S_ISDIR(st.st_mode))
 						errx(EX_OSFILE, "'%s' (root home parent) is not a directory", dbuf);
 					*p = '/';
 				}
 			}
 			if (stat(dbuf, &st) == -1) {
 				if (mkdir(dbuf, _DEF_DIRMODE) == -1) {
 				direrr:	err(EX_OSFILE, "mkdir '%s'", dbuf);
 				}
 				chown(dbuf, 0, 0);
 			}
 		} else if (!S_ISDIR(st.st_mode))
 			errx(EX_OSFILE, "root home `%s' is not a directory", cnf->home);
 	}
 
 	if ((arg = getarg(args, 'e')) != NULL)
 		cnf->expire_days = atoi(arg->val);
 
 	if ((arg = getarg(args, 'y')) != NULL)
 		cnf->nispasswd = arg->val;
 
 	if ((arg = getarg(args, 'p')) != NULL && arg->val)
 		cnf->password_days = atoi(arg->val);
 
 	if ((arg = getarg(args, 'g')) != NULL) {
 		if (!*(p = arg->val))	/* Handle empty group list specially */
 			cnf->default_group = "";
 		else {
 			if ((grp = GETGRNAM(p)) == NULL) {
 				if (!isdigit((unsigned char)*p) || (grp = GETGRGID((gid_t) atoi(p))) == NULL)
 					errx(EX_NOUSER, "group `%s' does not exist", p);
 			}
 			cnf->default_group = newstr(grp->gr_name);
 		}
 	}
 	if ((arg = getarg(args, 'L')) != NULL)
 		cnf->default_class = pw_checkname((u_char *)arg->val, 0);
 
 	if ((arg = getarg(args, 'G')) != NULL && arg->val) {
 		int i = 0;
 
 		for (p = strtok(arg->val, ", \t"); p != NULL; p = strtok(NULL, ", \t")) {
 			if ((grp = GETGRNAM(p)) == NULL) {
 				if (!isdigit((unsigned char)*p) || (grp = GETGRGID((gid_t) atoi(p))) == NULL)
 					errx(EX_NOUSER, "group `%s' does not exist", p);
 			}
 			if (extendarray(&cnf->groups, &cnf->numgroups, i + 2) != -1)
 				cnf->groups[i++] = newstr(grp->gr_name);
 		}
 		while (i < cnf->numgroups)
 			cnf->groups[i++] = NULL;
 	}
 
 	if ((arg = getarg(args, 'k')) != NULL) {
 		if (stat(cnf->dotdir = arg->val, &st) == -1 || !S_ISDIR(st.st_mode))
 			errx(EX_OSFILE, "skeleton `%s' is not a directory or does not exist", cnf->dotdir);
 	}
 
 	if ((arg = getarg(args, 's')) != NULL)
 		cnf->shell_default = arg->val;
 
 	if ((arg = getarg(args, 'w')) != NULL)
 		cnf->default_password = boolean_val(arg->val, cnf->default_password);
 	if (mode == M_ADD && getarg(args, 'D')) {
 		if (getarg(args, 'n') != NULL)
 			errx(EX_DATAERR, "can't combine `-D' with `-n name'");
 		if ((arg = getarg(args, 'u')) != NULL && (p = strtok(arg->val, ", \t")) != NULL) {
 			if ((cnf->min_uid = (uid_t) atoi(p)) == 0)
 				cnf->min_uid = 1000;
 			if ((p = strtok(NULL, " ,\t")) == NULL || (cnf->max_uid = (uid_t) atoi(p)) < cnf->min_uid)
 				cnf->max_uid = 32000;
 		}
 		if ((arg = getarg(args, 'i')) != NULL && (p = strtok(arg->val, ", \t")) != NULL) {
 			if ((cnf->min_gid = (gid_t) atoi(p)) == 0)
 				cnf->min_gid = 1000;
 			if ((p = strtok(NULL, " ,\t")) == NULL || (cnf->max_gid = (gid_t) atoi(p)) < cnf->min_gid)
 				cnf->max_gid = 32000;
 		}
 
 		arg = getarg(args, 'C');
 		if (write_userconfig(arg ? arg->val : NULL))
 			return EXIT_SUCCESS;
 		warn("config update");
 		return EX_IOERR;
 	}
 
 	if (mode == M_PRINT && getarg(args, 'a')) {
 		int             pretty = getarg(args, 'P') != NULL;
 		int		v7 = getarg(args, '7') != NULL;
 		SETPWENT();
 		while ((pwd = GETPWENT()) != NULL)
 			print_user(pwd, pretty, v7);
 		ENDPWENT();
 		return EXIT_SUCCESS;
 	}
 
 	if ((a_name = getarg(args, 'n')) != NULL)
 		pwd = GETPWNAM(pw_checkname((u_char *)a_name->val, 0));
 	a_uid = getarg(args, 'u');
 
 	if (a_uid == NULL) {
 		if (a_name == NULL)
 			errx(EX_DATAERR, "user name or id required");
 
 		/*
 		 * Determine whether 'n' switch is name or uid - we don't
 		 * really don't really care which we have, but we need to
 		 * know.
 		 */
 		if (mode != M_ADD && pwd == NULL
 		    && strspn(a_name->val, "0123456789") == strlen(a_name->val)
 		    && *a_name->val) {
 			(a_uid = a_name)->ch = 'u';
 			a_name = NULL;
 		}
 	} else {
 		if (strspn(a_uid->val, "0123456789") != strlen(a_uid->val))
 			errx(EX_USAGE, "-u expects a number");
 	}
 
 	/*
 	 * Update, delete & print require that the user exists
 	 */
 	if (mode == M_UPDATE || mode == M_DELETE ||
 	    mode == M_PRINT  || mode == M_LOCK   || mode == M_UNLOCK) {
 
 		if (a_name == NULL && pwd == NULL)	/* Try harder */
 			pwd = GETPWUID(atoi(a_uid->val));
 
 		if (pwd == NULL) {
 			if (mode == M_PRINT && getarg(args, 'F')) {
 				fakeuser.pw_name = a_name ? a_name->val : "nouser";
 				fakeuser.pw_uid = a_uid ? (uid_t) atol(a_uid->val) : -1;
 				return print_user(&fakeuser,
 						  getarg(args, 'P') != NULL,
 						  getarg(args, '7') != NULL);
 			}
 			if (a_name == NULL)
 				errx(EX_NOUSER, "no such uid `%s'", a_uid->val);
 			errx(EX_NOUSER, "no such user `%s'", a_name->val);
 		}
 
 		if (a_name == NULL)	/* May be needed later */
 			a_name = addarg(args, 'n', newstr(pwd->pw_name));
 
 		/*
 		 * The M_LOCK and M_UNLOCK functions simply add or remove
 		 * a "*LOCKED*" prefix from in front of the password to
 		 * prevent it decoding correctly, and therefore prevents
 		 * access. Of course, this only prevents access via
 		 * password authentication (not ssh, kerberos or any
 		 * other method that does not use the UNIX password) but
 		 * that is a known limitation.
 		 */
 
 		if (mode == M_LOCK) {
 			if (strncmp(pwd->pw_passwd, locked_str, sizeof(locked_str)-1) == 0)
 				errx(EX_DATAERR, "user '%s' is already locked", pwd->pw_name);
-			passtmp = malloc(strlen(pwd->pw_passwd) + sizeof(locked_str));
+			asprintf(&passtmp, "%s%s", locked_str, pwd->pw_passwd);
 			if (passtmp == NULL)	/* disaster */
 				errx(EX_UNAVAILABLE, "out of memory");
-			strcpy(passtmp, locked_str);
-			strcat(passtmp, pwd->pw_passwd);
 			pwd->pw_passwd = passtmp;
 			edited = 1;
 		} else if (mode == M_UNLOCK) {
 			if (strncmp(pwd->pw_passwd, locked_str, sizeof(locked_str)-1) != 0)
 				errx(EX_DATAERR, "user '%s' is not locked", pwd->pw_name);
 			pwd->pw_passwd += sizeof(locked_str)-1;
 			edited = 1;
 		} else if (mode == M_DELETE) {
 			/*
 			 * Handle deletions now
 			 */
 			char            file[MAXPATHLEN];
 			char            home[MAXPATHLEN];
 			uid_t           uid = pwd->pw_uid;
 			struct group    *gr;
 			char            grname[LOGNAMESIZE];
 
 			if (strcmp(pwd->pw_name, "root") == 0)
 				errx(EX_DATAERR, "cannot remove user 'root'");
 
 			if (!PWALTDIR()) {
 				/*
 				 * Remove opie record from /etc/opiekeys
 		        	 */
 
 				rmopie(pwd->pw_name);
 
 				/*
 				 * Remove crontabs
 				 */
 				snprintf(file, sizeof(file), "/var/cron/tabs/%s", pwd->pw_name);
 				if (access(file, F_OK) == 0) {
 					sprintf(file, "crontab -u %s -r", pwd->pw_name);
 					system(file);
 				}
 			}
 			/*
 			 * Save these for later, since contents of pwd may be
 			 * invalidated by deletion
 			 */
 			sprintf(file, "%s/%s", _PATH_MAILDIR, pwd->pw_name);
 			strlcpy(home, pwd->pw_dir, sizeof(home));
 			gr = GETGRGID(pwd->pw_gid);
 			if (gr != NULL)
 				strlcpy(grname, gr->gr_name, LOGNAMESIZE);
 			else
 				grname[0] = '\0';
 
 			rc = delpwent(pwd);
 			if (rc == -1)
 				err(EX_IOERR, "user '%s' does not exist", pwd->pw_name);
 			else if (rc != 0) {
 				warn("passwd update");
 				return EX_IOERR;
 			}
 
 			if (cnf->nispasswd && *cnf->nispasswd=='/') {
 				rc = delnispwent(cnf->nispasswd, a_name->val);
 				if (rc == -1)
 					warnx("WARNING: user '%s' does not exist in NIS passwd", pwd->pw_name);
 				else if (rc != 0)
 					warn("WARNING: NIS passwd update");
 				/* non-fatal */
 			}
 
 			grp = GETGRNAM(a_name->val);
 			if (grp != NULL &&
 			    (grp->gr_mem == NULL || *grp->gr_mem == NULL) &&
 			    strcmp(a_name->val, grname) == 0)
 				delgrent(GETGRNAM(a_name->val));
 			SETGRENT();
 			while ((grp = GETGRENT()) != NULL) {
 				int i, j;
 				char group[MAXLOGNAME];
 				if (grp->gr_mem != NULL) {
 					for (i = 0; grp->gr_mem[i] != NULL; i++) {
 						if (!strcmp(grp->gr_mem[i], a_name->val)) {
 							for (j = i; grp->gr_mem[j] != NULL; j++)
 								grp->gr_mem[j] = grp->gr_mem[j+1];
 							strlcpy(group, grp->gr_name, MAXLOGNAME);
 							chggrent(group, grp);
 						}
 					}
 				}
 			}
 			ENDGRENT();
 
 			pw_log(cnf, mode, W_USER, "%s(%ld) account removed", a_name->val, (long) uid);
 
 			if (!PWALTDIR()) {
 				/*
 				 * Remove mail file
 				 */
 				remove(file);
 
 				/*
 				 * Remove at jobs
 				 */
 				if (getpwuid(uid) == NULL)
 					rmat(uid);
 
 				/*
 				 * Remove home directory and contents
 				 */
 				if (getarg(args, 'r') != NULL && *home == '/' && getpwuid(uid) == NULL) {
 					if (stat(home, &st) != -1) {
 						rm_r(home, uid);
 						pw_log(cnf, mode, W_USER, "%s(%ld) home '%s' %sremoved",
 						       a_name->val, (long) uid, home,
 						       stat(home, &st) == -1 ? "" : "not completely ");
 					}
 				}
 			}
 			return EXIT_SUCCESS;
 		} else if (mode == M_PRINT)
 			return print_user(pwd,
 					  getarg(args, 'P') != NULL,
 					  getarg(args, '7') != NULL);
 
 		/*
 		 * The rest is edit code
 		 */
 		if ((arg = getarg(args, 'l')) != NULL) {
 			if (strcmp(pwd->pw_name, "root") == 0)
 				errx(EX_DATAERR, "can't rename `root' account");
 			pwd->pw_name = pw_checkname((u_char *)arg->val, 0);
 			edited = 1;
 		}
 
 		if ((arg = getarg(args, 'u')) != NULL && isdigit((unsigned char)*arg->val)) {
 			pwd->pw_uid = (uid_t) atol(arg->val);
 			edited = 1;
 			if (pwd->pw_uid != 0 && strcmp(pwd->pw_name, "root") == 0)
 				errx(EX_DATAERR, "can't change uid of `root' account");
 			if (pwd->pw_uid == 0 && strcmp(pwd->pw_name, "root") != 0)
 				warnx("WARNING: account `%s' will have a uid of 0 (superuser access!)", pwd->pw_name);
 		}
 
 		if ((arg = getarg(args, 'g')) != NULL && pwd->pw_uid != 0) {	/* Already checked this */
 			gid_t newgid = (gid_t) GETGRNAM(cnf->default_group)->gr_gid;
 			if (newgid != pwd->pw_gid) {
 				edited = 1;
 				pwd->pw_gid = newgid;
 			}
 		}
 
 		if ((arg = getarg(args, 'p')) != NULL) {
 			if (*arg->val == '\0' || strcmp(arg->val, "0") == 0) {
 				if (pwd->pw_change != 0) {
 					pwd->pw_change = 0;
 					edited = 1;
 				}
 			}
 			else {
 				time_t          now = time(NULL);
 				time_t          expire = parse_date(now, arg->val);
 
 				if (pwd->pw_change != expire) {
 					pwd->pw_change = expire;
 					edited = 1;
 				}
 			}
 		}
 
 		if ((arg = getarg(args, 'e')) != NULL) {
 			if (*arg->val == '\0' || strcmp(arg->val, "0") == 0) {
 				if (pwd->pw_expire != 0) {
 					pwd->pw_expire = 0;
 					edited = 1;
 				}
 			}
 			else {
 				time_t          now = time(NULL);
 				time_t          expire = parse_date(now, arg->val);
 
 				if (pwd->pw_expire != expire) {
 					pwd->pw_expire = expire;
 					edited = 1;
 				}
 			}
 		}
 
 		if ((arg = getarg(args, 's')) != NULL) {
 			char *shell = shell_path(cnf->shelldir, cnf->shells, arg->val);
 			if (shell == NULL)
 				shell = "";
 			if (strcmp(shell, pwd->pw_shell) != 0) {
 				pwd->pw_shell = shell;
 				edited = 1;
 			}
 		}
 
 		if (getarg(args, 'L')) {
 			if (cnf->default_class == NULL)
 				cnf->default_class = "";
 			if (strcmp(pwd->pw_class, cnf->default_class) != 0) {
 				pwd->pw_class = cnf->default_class;
 				edited = 1;
 			}
 		}
 
 		if ((arg  = getarg(args, 'd')) != NULL) {
 			if (strcmp(pwd->pw_dir, arg->val))
 				edited = 1;
 			if (stat(pwd->pw_dir = arg->val, &st) == -1) {
 				if (getarg(args, 'm') == NULL && strcmp(pwd->pw_dir, "/nonexistent") != 0)
 				  warnx("WARNING: home `%s' does not exist", pwd->pw_dir);
 			} else if (!S_ISDIR(st.st_mode))
 				warnx("WARNING: home `%s' is not a directory", pwd->pw_dir);
 		}
 
 		if ((arg = getarg(args, 'w')) != NULL &&
 		    getarg(args, 'h') == NULL && getarg(args, 'H') == NULL) {
 			login_cap_t *lc;
 
 			lc = login_getpwclass(pwd);
 			if (lc == NULL ||
 			    login_setcryptfmt(lc, "sha512", NULL) == NULL)
 				warn("setting crypt(3) format");
 			login_close(lc);
 			pwd->pw_passwd = pw_password(cnf, args, pwd->pw_name);
 			edited = 1;
 		}
 
 	} else {
 		login_cap_t *lc;
 
 		/*
 		 * Add code
 		 */
 
 		if (a_name == NULL)	/* Required */
 			errx(EX_DATAERR, "login name required");
 		else if ((pwd = GETPWNAM(a_name->val)) != NULL)	/* Exists */
 			errx(EX_DATAERR, "login name `%s' already exists", a_name->val);
 
 		/*
 		 * Now, set up defaults for a new user
 		 */
 		pwd = &fakeuser;
 		pwd->pw_name = a_name->val;
 		pwd->pw_class = cnf->default_class ? cnf->default_class : "";
 		pwd->pw_uid = pw_uidpolicy(cnf, args);
 		pwd->pw_gid = pw_gidpolicy(cnf, args, pwd->pw_name, (gid_t) pwd->pw_uid);
 		pwd->pw_change = pw_pwdpolicy(cnf, args);
 		pwd->pw_expire = pw_exppolicy(cnf, args);
 		pwd->pw_dir = pw_homepolicy(cnf, args, pwd->pw_name);
 		pwd->pw_shell = pw_shellpolicy(cnf, args, NULL);
 		lc = login_getpwclass(pwd);
 		if (lc == NULL || login_setcryptfmt(lc, "sha512", NULL) == NULL)
 			warn("setting crypt(3) format");
 		login_close(lc);
 		pwd->pw_passwd = pw_password(cnf, args, pwd->pw_name);
 		edited = 1;
 
 		if (pwd->pw_uid == 0 && strcmp(pwd->pw_name, "root") != 0)
 			warnx("WARNING: new account `%s' has a uid of 0 (superuser access!)", pwd->pw_name);
 	}
 
 	/*
 	 * Shared add/edit code
 	 */
 	if ((arg = getarg(args, 'c')) != NULL) {
 		char	*gecos = pw_checkname((u_char *)arg->val, 1);
 		if (strcmp(pwd->pw_gecos, gecos) != 0) {
 			pwd->pw_gecos = gecos;
 			edited = 1;
 		}
 	}
 
 	if ((arg = getarg(args, 'h')) != NULL ||
 	    (arg = getarg(args, 'H')) != NULL) {
 		if (strcmp(arg->val, "-") == 0) {
 			if (!pwd->pw_passwd || *pwd->pw_passwd != '*') {
 				pwd->pw_passwd = "*";	/* No access */
 				edited = 1;
 			}
 		} else {
 			int             fd = atoi(arg->val);
 			int		precrypt = (arg->ch == 'H');
 			int             b;
 			int             istty = isatty(fd);
 			struct termios  t;
 			login_cap_t	*lc;
 
 			if (istty) {
 				if (tcgetattr(fd, &t) == -1)
 					istty = 0;
 				else {
 					struct termios  n = t;
 
 					/* Disable echo */
 					n.c_lflag &= ~(ECHO);
 					tcsetattr(fd, TCSANOW, &n);
 					printf("%s%spassword for user %s:",
 					     (mode == M_UPDATE) ? "new " : "",
 					     precrypt ? "encrypted " : "",
 					     pwd->pw_name);
 					fflush(stdout);
 				}
 			}
 			b = read(fd, line, sizeof(line) - 1);
 			if (istty) {	/* Restore state */
 				tcsetattr(fd, TCSANOW, &t);
 				fputc('\n', stdout);
 				fflush(stdout);
 			}
 			if (b < 0) {
 				warn("-%c file descriptor", precrypt ? 'H' :
 				    'h');
 				return EX_IOERR;
 			}
 			line[b] = '\0';
 			if ((p = strpbrk(line, "\r\n")) != NULL)
 				*p = '\0';
 			if (!*line)
 				errx(EX_DATAERR, "empty password read on file descriptor %d", fd);
 			if (precrypt) {
 				if (strchr(line, ':') != NULL)
 					return EX_DATAERR;
 				pwd->pw_passwd = line;
 			} else {
 				lc = login_getpwclass(pwd);
 				if (lc == NULL ||
 				    login_setcryptfmt(lc, "sha512", NULL) == NULL)
 					warn("setting crypt(3) format");
 				login_close(lc);
 				pwd->pw_passwd = pw_pwcrypt(line);
 			}
 			edited = 1;
 		}
 	}
 
 	/*
 	 * Special case: -N only displays & exits
 	 */
 	if (getarg(args, 'N') != NULL)
 		return print_user(pwd,
 				  getarg(args, 'P') != NULL,
 				  getarg(args, '7') != NULL);
 
 	if (mode == M_ADD) {
 		edited = 1;	/* Always */
 		rc = addpwent(pwd);
 		if (rc == -1) {
 			warnx("user '%s' already exists", pwd->pw_name);
 			return EX_IOERR;
 		} else if (rc != 0) {
 			warn("passwd file update");
 			return EX_IOERR;
 		}
 		if (cnf->nispasswd && *cnf->nispasswd=='/') {
 			rc = addnispwent(cnf->nispasswd, pwd);
 			if (rc == -1)
 				warnx("User '%s' already exists in NIS passwd", pwd->pw_name);
 			else
 				warn("NIS passwd update");
 			/* NOTE: we treat NIS-only update errors as non-fatal */
 		}
 	} else if (mode == M_UPDATE || mode == M_LOCK || mode == M_UNLOCK) {
 		if (edited) {	/* Only updated this if required */
 			rc = chgpwent(a_name->val, pwd);
 			if (rc == -1) {
 				warnx("user '%s' does not exist (NIS?)", pwd->pw_name);
 				return EX_IOERR;
 			} else if (rc != 0) {
 				warn("passwd file update");
 				return EX_IOERR;
 			}
 			if ( cnf->nispasswd && *cnf->nispasswd=='/') {
 				rc = chgnispwent(cnf->nispasswd, a_name->val, pwd);
 				if (rc == -1)
 					warn("User '%s' not found in NIS passwd", pwd->pw_name);
 				else
 					warn("NIS passwd update");
 				/* NOTE: NIS-only update errors are not fatal */
 			}
 		}
 	}
 
 	/*
 	 * Ok, user is created or changed - now edit group file
 	 */
 
 	if (mode == M_ADD || getarg(args, 'G') != NULL) {
 		int i, j;
 		/* First remove the user from all group */
 		SETGRENT();
 		while ((grp = GETGRENT()) != NULL) {
 			char group[MAXLOGNAME];
 			if (grp->gr_mem == NULL)
 				continue;
 			for (i = 0; grp->gr_mem[i] != NULL; i++) {
 				if (strcmp(grp->gr_mem[i] , pwd->pw_name) != 0)
 					continue;
 				for (j = i; grp->gr_mem[j] != NULL ; j++)
 					grp->gr_mem[j] = grp->gr_mem[j+1];
 				strlcpy(group, grp->gr_name, MAXLOGNAME);
 				chggrent(group, grp);
 			}
 		}
 		ENDGRENT();
 
 		/* now add to group where needed */
 		for (i = 0; cnf->groups[i] != NULL; i++) {
 			grp = GETGRNAM(cnf->groups[i]);
 			grp = gr_add(grp, pwd->pw_name);
 			/*
 			 * grp can only be NULL in 2 cases:
 			 * - the new member is already a member
 			 * - a problem with memory occurs
 			 * in both cases we want to skip now.
 			 */
 			if (grp == NULL)
 				continue;
 			chggrent(cnf->groups[i], grp);
 			free(grp);
 		}
 	}
 
 
 	/* go get a current version of pwd */
 	pwd = GETPWNAM(a_name->val);
 	if (pwd == NULL) {
 		/* This will fail when we rename, so special case that */
 		if (mode == M_UPDATE && (arg = getarg(args, 'l')) != NULL) {
 			a_name->val = arg->val;		/* update new name */
 			pwd = GETPWNAM(a_name->val);	/* refetch renamed rec */
 		}
 	}
 	if (pwd == NULL)	/* can't go on without this */
 		errx(EX_NOUSER, "user '%s' disappeared during update", a_name->val);
 
 	grp = GETGRGID(pwd->pw_gid);
 	pw_log(cnf, mode, W_USER, "%s(%ld):%s(%ld):%s:%s:%s",
 	       pwd->pw_name, (long) pwd->pw_uid,
 	    grp ? grp->gr_name : "unknown", (long) (grp ? grp->gr_gid : -1),
 	       pwd->pw_gecos, pwd->pw_dir, pwd->pw_shell);
 
 	/*
 	 * If adding, let's touch and chown the user's mail file. This is not
 	 * strictly necessary under BSD with a 0755 maildir but it also
 	 * doesn't hurt anything to create the empty mailfile
 	 */
 	if (mode == M_ADD) {
 		if (!PWALTDIR()) {
 			sprintf(line, "%s/%s", _PATH_MAILDIR, pwd->pw_name);
 			close(open(line, O_RDWR | O_CREAT, 0600));	/* Preserve contents &
 									 * mtime */
 			chown(line, pwd->pw_uid, pwd->pw_gid);
 		}
 	}
 
 	/*
 	 * Let's create and populate the user's home directory. Note
 	 * that this also `works' for editing users if -m is used, but
 	 * existing files will *not* be overwritten.
 	 */
 	if (!PWALTDIR() && getarg(args, 'm') != NULL && pwd->pw_dir && *pwd->pw_dir == '/' && pwd->pw_dir[1]) {
 		copymkdir(pwd->pw_dir, cnf->dotdir, cnf->homemode, pwd->pw_uid, pwd->pw_gid);
 		pw_log(cnf, mode, W_USER, "%s(%ld) home %s made",
 		       pwd->pw_name, (long) pwd->pw_uid, pwd->pw_dir);
 	}
 
 
 	/*
 	 * Finally, send mail to the new user as well, if we are asked to
 	 */
 	if (mode == M_ADD && !PWALTDIR() && cnf->newmail && *cnf->newmail && (fp = fopen(cnf->newmail, "r")) != NULL) {
 		FILE           *pfp = popen(_PATH_SENDMAIL " -t", "w");
 		
 		if (pfp == NULL)
 			warn("sendmail");
 		else {
 			fprintf(pfp, "From: root\n" "To: %s\n" "Subject: Welcome!\n\n", pwd->pw_name);
 			while (fgets(line, sizeof(line), fp) != NULL) {
 				/* Do substitutions? */
 				fputs(line, pfp);
 			}
 			pclose(pfp);
 			pw_log(cnf, mode, W_USER, "%s(%ld) new user mail sent",
 			    pwd->pw_name, (long) pwd->pw_uid);
 		}
 		fclose(fp);
 	}
 
 	return EXIT_SUCCESS;
 }
 
 
 static          uid_t
 pw_uidpolicy(struct userconf * cnf, struct cargs * args)
 {
 	struct passwd  *pwd;
 	uid_t           uid = (uid_t) - 1;
 	struct carg    *a_uid = getarg(args, 'u');
 
 	/*
 	 * Check the given uid, if any
 	 */
 	if (a_uid != NULL) {
 		uid = (uid_t) atol(a_uid->val);
 
 		if ((pwd = GETPWUID(uid)) != NULL && getarg(args, 'o') == NULL)
 			errx(EX_DATAERR, "uid `%ld' has already been allocated", (long) pwd->pw_uid);
 	} else {
 		struct bitmap   bm;
 
 		/*
 		 * We need to allocate the next available uid under one of
 		 * two policies a) Grab the first unused uid b) Grab the
 		 * highest possible unused uid
 		 */
 		if (cnf->min_uid >= cnf->max_uid) {	/* Sanity
 							 * claus^H^H^H^Hheck */
 			cnf->min_uid = 1000;
 			cnf->max_uid = 32000;
 		}
 		bm = bm_alloc(cnf->max_uid - cnf->min_uid + 1);
 
 		/*
 		 * Now, let's fill the bitmap from the password file
 		 */
 		SETPWENT();
 		while ((pwd = GETPWENT()) != NULL)
 			if (pwd->pw_uid >= (uid_t) cnf->min_uid && pwd->pw_uid <= (uid_t) cnf->max_uid)
 				bm_setbit(&bm, pwd->pw_uid - cnf->min_uid);
 		ENDPWENT();
 
 		/*
 		 * Then apply the policy, with fallback to reuse if necessary
 		 */
 		if (cnf->reuse_uids || (uid = (uid_t) (bm_lastset(&bm) + cnf->min_uid + 1)) > cnf->max_uid)
 			uid = (uid_t) (bm_firstunset(&bm) + cnf->min_uid);
 
 		/*
 		 * Another sanity check
 		 */
 		if (uid < cnf->min_uid || uid > cnf->max_uid)
 			errx(EX_SOFTWARE, "unable to allocate a new uid - range fully used");
 		bm_dealloc(&bm);
 	}
 	return uid;
 }
 
 
 static          uid_t
 pw_gidpolicy(struct userconf * cnf, struct cargs * args, char *nam, gid_t prefer)
 {
 	struct group   *grp;
 	gid_t           gid = (uid_t) - 1;
 	struct carg    *a_gid = getarg(args, 'g');
 
 	/*
 	 * If no arg given, see if default can help out
 	 */
 	if (a_gid == NULL && cnf->default_group && *cnf->default_group)
 		a_gid = addarg(args, 'g', cnf->default_group);
 
 	/*
 	 * Check the given gid, if any
 	 */
 	SETGRENT();
 	if (a_gid != NULL) {
 		if ((grp = GETGRNAM(a_gid->val)) == NULL) {
 			gid = (gid_t) atol(a_gid->val);
 			if ((gid == 0 && !isdigit((unsigned char)*a_gid->val)) || (grp = GETGRGID(gid)) == NULL)
 				errx(EX_NOUSER, "group `%s' is not defined", a_gid->val);
 		}
 		gid = grp->gr_gid;
 	} else if ((grp = GETGRNAM(nam)) != NULL &&
 	    (grp->gr_mem == NULL || grp->gr_mem[0] == NULL)) {
 		gid = grp->gr_gid;  /* Already created? Use it anyway... */
 	} else {
 		struct cargs    grpargs;
 		char            tmp[32];
 
 		LIST_INIT(&grpargs);
 		addarg(&grpargs, 'n', nam);
 
 		/*
 		 * We need to auto-create a group with the user's name. We
 		 * can send all the appropriate output to our sister routine
 		 * bit first see if we can create a group with gid==uid so we
 		 * can keep the user and group ids in sync. We purposely do
 		 * NOT check the gid range if we can force the sync. If the
 		 * user's name dups an existing group, then the group add
 		 * function will happily handle that case for us and exit.
 		 */
 		if (GETGRGID(prefer) == NULL) {
 			sprintf(tmp, "%lu", (unsigned long) prefer);
 			addarg(&grpargs, 'g', tmp);
 		}
 		if (getarg(args, 'N'))
 		{
 			addarg(&grpargs, 'N', NULL);
 			addarg(&grpargs, 'q', NULL);
 			gid = pw_group(cnf, M_NEXT, &grpargs);
 		}
 		else
 		{
 			pw_group(cnf, M_ADD, &grpargs);
 			if ((grp = GETGRNAM(nam)) != NULL)
 				gid = grp->gr_gid;
 		}
 		a_gid = LIST_FIRST(&grpargs);
 		while (a_gid != NULL) {
 			struct carg    *t = LIST_NEXT(a_gid, list);
 			LIST_REMOVE(a_gid, list);
 			a_gid = t;
 		}
 	}
 	ENDGRENT();
 	return gid;
 }
 
 
 static          time_t
 pw_pwdpolicy(struct userconf * cnf, struct cargs * args)
 {
 	time_t          result = 0;
 	time_t          now = time(NULL);
 	struct carg    *arg = getarg(args, 'p');
 
 	if (arg != NULL) {
 		if ((result = parse_date(now, arg->val)) == now)
 			errx(EX_DATAERR, "invalid date/time `%s'", arg->val);
 	} else if (cnf->password_days > 0)
 		result = now + ((long) cnf->password_days * 86400L);
 	return result;
 }
 
 
 static          time_t
 pw_exppolicy(struct userconf * cnf, struct cargs * args)
 {
 	time_t          result = 0;
 	time_t          now = time(NULL);
 	struct carg    *arg = getarg(args, 'e');
 
 	if (arg != NULL) {
 		if ((result = parse_date(now, arg->val)) == now)
 			errx(EX_DATAERR, "invalid date/time `%s'", arg->val);
 	} else if (cnf->expire_days > 0)
 		result = now + ((long) cnf->expire_days * 86400L);
 	return result;
 }
 
 
 static char    *
 pw_homepolicy(struct userconf * cnf, struct cargs * args, char const * user)
 {
 	struct carg    *arg = getarg(args, 'd');
 
 	if (arg)
 		return arg->val;
 	else {
 		static char     home[128];
 
 		if (cnf->home == NULL || *cnf->home == '\0')
 			errx(EX_CONFIG, "no base home directory set");
 		sprintf(home, "%s/%s", cnf->home, user);
 		return home;
 	}
 }
 
 static char    *
 shell_path(char const * path, char *shells[], char *sh)
 {
 	if (sh != NULL && (*sh == '/' || *sh == '\0'))
 		return sh;	/* specified full path or forced none */
 	else {
 		char           *p;
 		char            paths[_UC_MAXLINE];
 
 		/*
 		 * We need to search paths
 		 */
 		strlcpy(paths, path, sizeof(paths));
 		for (p = strtok(paths, ": \t\r\n"); p != NULL; p = strtok(NULL, ": \t\r\n")) {
 			int             i;
 			static char     shellpath[256];
 
 			if (sh != NULL) {
 				sprintf(shellpath, "%s/%s", p, sh);
 				if (access(shellpath, X_OK) == 0)
 					return shellpath;
 			} else
 				for (i = 0; i < _UC_MAXSHELLS && shells[i] != NULL; i++) {
 					sprintf(shellpath, "%s/%s", p, shells[i]);
 					if (access(shellpath, X_OK) == 0)
 						return shellpath;
 				}
 		}
 		if (sh == NULL)
 			errx(EX_OSFILE, "can't find shell `%s' in shell paths", sh);
 		errx(EX_CONFIG, "no default shell available or defined");
 		return NULL;
 	}
 }
 
 
 static char    *
 pw_shellpolicy(struct userconf * cnf, struct cargs * args, char *newshell)
 {
 	char           *sh = newshell;
 	struct carg    *arg = getarg(args, 's');
 
 	if (newshell == NULL && arg != NULL)
 		sh = arg->val;
 	return shell_path(cnf->shelldir, cnf->shells, sh ? sh : cnf->shell_default);
 }
 
 #define	SALTSIZE	32
 
 static char const chars[] = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ./";
 
 char           *
 pw_pwcrypt(char *password)
 {
 	int             i;
 	char            salt[SALTSIZE + 1];
 	char		*cryptpw;
 
 	static char     buf[256];
 
 	/*
 	 * Calculate a salt value
 	 */
 	for (i = 0; i < SALTSIZE; i++)
 		salt[i] = chars[arc4random_uniform(sizeof(chars) - 1)];
 	salt[SALTSIZE] = '\0';
 
 	cryptpw = crypt(password, salt);
 	if (cryptpw == NULL)
 		errx(EX_CONFIG, "crypt(3) failure");
 	return strcpy(buf, cryptpw);
 }
 
 
 static char    *
 pw_password(struct userconf * cnf, struct cargs * args, char const * user)
 {
 	int             i, l;
 	char            pwbuf[32];
 
 	switch (cnf->default_password) {
 	case -1:		/* Random password */
 		l = (arc4random() % 8 + 8);	/* 8 - 16 chars */
 		for (i = 0; i < l; i++)
 			pwbuf[i] = chars[arc4random_uniform(sizeof(chars)-1)];
 		pwbuf[i] = '\0';
 
 		/*
 		 * We give this information back to the user
 		 */
 		if (getarg(args, 'h') == NULL && getarg(args, 'H') == NULL &&
 		    getarg(args, 'N') == NULL) {
 			if (isatty(STDOUT_FILENO))
 				printf("Password for '%s' is: ", user);
 			printf("%s\n", pwbuf);
 			fflush(stdout);
 		}
 		break;
 
 	case -2:		/* No password at all! */
 		return "";
 
 	case 0:		/* No login - default */
 	default:
 		return "*";
 
 	case 1:		/* user's name */
 		strlcpy(pwbuf, user, sizeof(pwbuf));
 		break;
 	}
 	return pw_pwcrypt(pwbuf);
 }
 
 
 static int
 print_user(struct passwd * pwd, int pretty, int v7)
 {
 	if (!pretty) {
 		char            *buf;
 
 		if (!v7)
 			pwd->pw_passwd = (pwd->pw_passwd == NULL) ? "" : "*";
 
 		buf = v7 ? pw_make_v7(pwd) : pw_make(pwd);
 		printf("%s\n", buf);
 		free(buf);
 	} else {
 		int		j;
 		char           *p;
 		struct group   *grp = GETGRGID(pwd->pw_gid);
 		char            uname[60] = "User &", office[60] = "[None]",
 		                wphone[60] = "[None]", hphone[60] = "[None]";
 		char		acexpire[32] = "[None]", pwexpire[32] = "[None]";
 		struct tm *    tptr;
 
 		if ((p = strtok(pwd->pw_gecos, ",")) != NULL) {
 			strlcpy(uname, p, sizeof(uname));
 			if ((p = strtok(NULL, ",")) != NULL) {
 				strlcpy(office, p, sizeof(office));
 				if ((p = strtok(NULL, ",")) != NULL) {
 					strlcpy(wphone, p, sizeof(wphone));
 					if ((p = strtok(NULL, "")) != NULL) {
 						strlcpy(hphone, p,
 						    sizeof(hphone));
 					}
 				}
 			}
 		}
 		/*
 		 * Handle '&' in gecos field
 		 */
 		if ((p = strchr(uname, '&')) != NULL) {
 			int             l = strlen(pwd->pw_name);
 			int             m = strlen(p);
 
 			memmove(p + l, p + 1, m);
 			memmove(p, pwd->pw_name, l);
 			*p = (char) toupper((unsigned char)*p);
 		}
 		if (pwd->pw_expire > (time_t)0 && (tptr = localtime(&pwd->pw_expire)) != NULL)
 			strftime(acexpire, sizeof acexpire, "%c", tptr);
 		if (pwd->pw_change > (time_t)0 && (tptr = localtime(&pwd->pw_change)) != NULL)
 			strftime(pwexpire, sizeof pwexpire, "%c", tptr);
 		printf("Login Name: %-15s   #%-12ld Group: %-15s   #%ld\n"
 		       " Full Name: %s\n"
 		       "      Home: %-26.26s      Class: %s\n"
 		       "     Shell: %-26.26s     Office: %s\n"
 		       "Work Phone: %-26.26s Home Phone: %s\n"
 		       "Acc Expire: %-26.26s Pwd Expire: %s\n",
 		       pwd->pw_name, (long) pwd->pw_uid,
 		       grp ? grp->gr_name : "(invalid)", (long) pwd->pw_gid,
 		       uname, pwd->pw_dir, pwd->pw_class,
 		       pwd->pw_shell, office, wphone, hphone,
 		       acexpire, pwexpire);
 	        SETGRENT();
 		j = 0;
 		while ((grp=GETGRENT()) != NULL)
 		{
 			int     i = 0;
 			if (grp->gr_mem != NULL) {
 				while (grp->gr_mem[i] != NULL)
 				{
 					if (strcmp(grp->gr_mem[i], pwd->pw_name)==0)
 					{
 						printf(j++ == 0 ? "    Groups: %s" : ",%s", grp->gr_name);
 						break;
 					}
 					++i;
 				}
 			}
 		}
 		ENDGRENT();
 		printf("%s", j ? "\n" : "");
 	}
 	return EXIT_SUCCESS;
 }
 
 char    *
 pw_checkname(u_char *name, int gecos)
 {
 	char showch[8];
 	u_char const *badchars, *ch, *showtype;
 	int reject;
 
 	ch = name;
 	reject = 0;
 	if (gecos) {
 		/* See if the name is valid as a gecos (comment) field. */
 		badchars = ":!@";
 		showtype = "gecos field";
 	} else {
 		/* See if the name is valid as a userid or group. */
 		badchars = " ,\t:+&#%$^()!@~*?<>=|\\/\"";
 		showtype = "userid/group name";
 		/* Userids and groups can not have a leading '-'. */
 		if (*ch == '-')
 			reject = 1;
 	}
 	if (!reject) {
 		while (*ch) {
 			if (strchr(badchars, *ch) != NULL || *ch < ' ' ||
 			    *ch == 127) {
 				reject = 1;
 				break;
 			}
 			/* 8-bit characters are only allowed in GECOS fields */
 			if (!gecos && (*ch & 0x80)) {
 				reject = 1;
 				break;
 			}
 			ch++;
 		}
 	}
 	/*
 	 * A `$' is allowed as the final character for userids and groups,
 	 * mainly for the benefit of samba.
 	 */
 	if (reject && !gecos) {
 		if (*ch == '$' && *(ch + 1) == '\0') {
 			reject = 0;
 			ch++;
 		}
 	}
 	if (reject) {
 		snprintf(showch, sizeof(showch), (*ch >= ' ' && *ch < 127)
 		    ? "`%c'" : "0x%02x", *ch);
 		errx(EX_DATAERR, "invalid character %s at position %td in %s",
 		    showch, (ch - name), showtype);
 	}
 	if (!gecos && (ch - name) > LOGNAMESIZE)
 		errx(EX_DATAERR, "name too long `%s' (max is %d)", name,
 		    LOGNAMESIZE);
 	return (char *)name;
 }
 
 
 static void
 rmat(uid_t uid)
 {
 	DIR            *d = opendir("/var/at/jobs");
 
 	if (d != NULL) {
 		struct dirent  *e;
 
 		while ((e = readdir(d)) != NULL) {
 			struct stat     st;
 
 			if (strncmp(e->d_name, ".lock", 5) != 0 &&
 			    stat(e->d_name, &st) == 0 &&
 			    !S_ISDIR(st.st_mode) &&
 			    st.st_uid == uid) {
 				char            tmp[MAXPATHLEN];
 
 				sprintf(tmp, "/usr/bin/atrm %s", e->d_name);
 				system(tmp);
 			}
 		}
 		closedir(d);
 	}
 }
 
 static void
 rmopie(char const * name)
 {
 	static const char etcopie[] = "/etc/opiekeys";
 	FILE   *fp = fopen(etcopie, "r+");
 
 	if (fp != NULL) {
 		char	tmp[1024];
 		off_t	atofs = 0;
 		int	length = strlen(name);
 
 		while (fgets(tmp, sizeof tmp, fp) != NULL) {
 			if (strncmp(name, tmp, length) == 0 && tmp[length]==' ') {
 				if (fseek(fp, atofs, SEEK_SET) == 0) {
 					fwrite("#", 1, 1, fp);	/* Comment username out */
 				}
 				break;
 			}
 			atofs = ftell(fp);
 		}
 		/*
 		 * If we got an error of any sort, don't update!
 		 */
 		fclose(fp);
 	}
 }
 
Index: projects/release-arm-redux
===================================================================
--- projects/release-arm-redux	(revision 282691)
+++ projects/release-arm-redux	(revision 282692)

Property changes on: projects/release-arm-redux
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
   Merged /head:r282673-282691